PipelineC (again), dct example, looking for help/interest

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Hi folks looking for feedback on PipelineC. Ideas of what to implement next
.  

I will point you to a recent reddit post which ultimately points to GitHub.
  

https://www.reddit.com/r/FPGA/comments/d0x2p5/serial_8x8_dct_in_pipelinec_l
ower_resource_usage/  

Here is the code to get you interested:  

// This is the unrolled version of the original dct copy-and-pasted algorit
hm  
// https://www.geeksforgeeks.org/discrete-cosine-transform-algorithm-progra
m/  
// PipelineC iterations of dctTransformUnrolled are used  
// to unroll the calculation serially in O(n^4) time  

// Input 'matrix' and start=1 to begin calculation  
// Input 'matrix' must stay constant until return .done  

// 'sum' accumulates over iterations/clocks and should be pipelined  
// So 'sum' must be a volatile global variable  
// Keep track of when sum is valid and be read+written  
volatile uint1_t dct_volatiles_valid;  
// sum will temporarily store the sum of cosine signals  
volatile float dct_sum;  
// dct_result will store the discrete cosine transform  
// Signal that this is the iteration containing the 'done' result  
typedef struct dct_done_t  
{  
        float matrix[DCT_M][DCT_N];  
        uint1_t done;  
} dct_done_t;  
volatile dct_done_t dct_result;  
dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t s
tart)  
{  
        // Assume not done yet  
        dct_result.done = 0;  
          
        // Start validates volatiles  
        if(start)  
        {  
                dct_volatiles_valid = 1;  
        }  
          
        // Global func to handle getting to BRAM  
        //     1) Lookup constants from BRAM (using iterators)  
        //     2) Increment iterators  
        // Returns next iterators and constants and will increment when req
uested  
        dct_lookup_increment_t lookup_increment;  
        uint1_t do_increment;  
        // Only increment when volatiles valid  
        do_increment = dct_volatiles_valid;  
        lookup_increment = dct_lookup_increment(do_increment);  
          
        // Unpack struct for ease of reading calculation code below  
        float const_val;  
        const_val = lookup_increment.lookup.const_val;  
        float cos_val;  
        cos_val = lookup_increment.lookup.cos_val;  
        dct_iter_t i;  
        i = lookup_increment.incrementer.curr_iters.i;  
        dct_iter_t j;  
        j = lookup_increment.incrementer.curr_iters.j;  
        dct_iter_t k;  
        k = lookup_increment.incrementer.curr_iters.k;  
        dct_iter_t l;  
        l = lookup_increment.incrementer.curr_iters.l;  
        uint1_t reset_k;  
        reset_k = lookup_increment.incrementer.increment.reset_k;  
        uint1_t reset_l;  
        reset_l = lookup_increment.incrementer.increment.reset_l;  
        uint1_t done;  
        done = lookup_increment.incrementer.increment.done;  
          
          
        // Do math for this volatile iteration only when  
        // can safely read+write volatiles  
        if(dct_volatiles_valid)  
        {  
                // ~~~ The primary calculation ~~~:  
                // 1) Float * cosine constant from lookup table    
                float dct1;  
                dct1 = (float)matrix[k][l] * cos_val;  
                // 2) Increment sum  
                dct_sum = dct_sum + dct1;  
                // 3) constant * Float and assign into the output matrix  
                dct_result.matrix[i][j] = const_val * dct_sum;  
                  
                // Sum accumulates during the k and l loops  
                // So reset when they are rolling over  
                if(reset_k & reset_l)  
                {  
                        dct_sum = 0.0;  
                }  
                  
                // Done yet?  
                dct_result.done = done;  
                  
                // Reset volatiles once done  
                if(done)  
                {  
                        dct_volatiles_valid = 0;  
                }  
        }  
          
        return dct_result;  
}  


What does this synthesize to?  

Essentially a state machine where each state uses the same N clocks worth o
f logic to do work. (the body of dctTransformUnrolled).  

Consider the 'execution' of the function in time order. The logic consists  
of:  

~17% of time for getting lookup constants & incrementing the iterators (dct
_lookup_increment), reading the [k][l] value out of input 'matrix'  

~21% of time for the 1) Float * cosine constant from lookup table, a floati
ng point multiplier  

~34% of time for the 2) Increment sum addition, a floating point adder  

~21% of time for the 3) constant * Float, a floating point multiplier  

~5% of time for the 3) assignment into the output matrix at [i][j]  

That pipeline takes some fixed number of clock cycles N. That means every N
 clock cycles 'dct_volatiles_valid' will =1 (after being set at the start
). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total  
latency in clock cycles is N * 4096.

Re: PipelineC (again), dct example, looking for help/interest
Hi folks looking for feedback on PipelineC. Ideas of what to implement next
.  

I will point you to a recent reddit post which ultimately points to GitHub.
  

https://www.reddit.com/r/FPGA/comments/d0x2p5/serial_8x8_dct_in_pipelinec_l
ower_resource_usage/  

Here is the code to get you interested:  

// This is the unrolled version of the original dct copy-and-pasted algorit
hm  
// https://www.geeksforgeeks.org/discrete-cosine-transform-algorithm-progra
m/  
// PipelineC iterations of dctTransformUnrolled are used  
// to unroll the calculation serially in O(n^4) time  

// Input 'matrix' and start=1 to begin calculation  
// Input 'matrix' must stay constant until return .done  

// 'sum' accumulates over iterations/clocks and should be pipelined  
// So 'sum' must be a volatile global variable  
// Keep track of when sum is valid and be read+written  
volatile uint1_t dct_volatiles_valid;  
// sum will temporarily store the sum of cosine signals  
volatile float dct_sum;  
// dct_result will store the discrete cosine transform  
// Signal that this is the iteration containing the 'done' result  
typedef struct dct_done_t  
{  
        float matrix[DCT_M][DCT_N];  
        uint1_t done;  
} dct_done_t;  
volatile dct_done_t dct_result;  
dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t s
tart)  
{  
        // Assume not done yet  
        dct_result.done = 0;  
          
        // Start validates volatiles  
        if(start)  
        {  
                dct_volatiles_valid = 1;  
        }  
          
        // Global func to handle getting to BRAM  
        //     1) Lookup constants from BRAM (using iterators)  
        //     2) Increment iterators  
        // Returns next iterators and constants and will increment when req
uested  
        dct_lookup_increment_t lookup_increment;  
        uint1_t do_increment;  
        // Only increment when volatiles valid  
        do_increment = dct_volatiles_valid;  
        lookup_increment = dct_lookup_increment(do_increment);  
          
        // Unpack struct for ease of reading calculation code below  
        float const_val;  
        const_val = lookup_increment.lookup.const_val;  
        float cos_val;  
        cos_val = lookup_increment.lookup.cos_val;  
        dct_iter_t i;  
        i = lookup_increment.incrementer.curr_iters.i;  
        dct_iter_t j;  
        j = lookup_increment.incrementer.curr_iters.j;  
        dct_iter_t k;  
        k = lookup_increment.incrementer.curr_iters.k;  
        dct_iter_t l;  
        l = lookup_increment.incrementer.curr_iters.l;  
        uint1_t reset_k;  
        reset_k = lookup_increment.incrementer.increment.reset_k;  
        uint1_t reset_l;  
        reset_l = lookup_increment.incrementer.increment.reset_l;  
        uint1_t done;  
        done = lookup_increment.incrementer.increment.done;  
          
          
        // Do math for this volatile iteration only when  
        // can safely read+write volatiles  
        if(dct_volatiles_valid)  
        {  
                // ~~~ The primary calculation ~~~:  
                // 1) Float * cosine constant from lookup table    
                float dct1;  
                dct1 = (float)matrix[k][l] * cos_val;  
                // 2) Increment sum  
                dct_sum = dct_sum + dct1;  
                // 3) constant * Float and assign into the output matrix  
                dct_result.matrix[i][j] = const_val * dct_sum;  
                  
                // Sum accumulates during the k and l loops  
                // So reset when they are rolling over  
                if(reset_k & reset_l)  
                {  
                        dct_sum = 0.0;  
                }  
                  
                // Done yet?  
                dct_result.done = done;  
                  
                // Reset volatiles once done  
                if(done)  
                {  
                        dct_volatiles_valid = 0;  
                }  
        }  
          
        return dct_result;  
}  
What does this synthesize to?  

Essentially a state machine where each state uses the same N clocks worth o
f logic to do work. (the body of dctTransformUnrolled).  

Consider the 'execution' of the function in time order. The logic consists  
of:  

~17% of time for getting lookup constants & incrementing the iterators (dct
_lookup_increment), reading the [k][l] value out of input 'matrix'  

~21% of time for the 1) Float * cosine constant from lookup table, a floati
ng point multiplier  

~34% of time for the 2) Increment sum addition, a floating point adder  

~21% of time for the 3) constant * Float, a floating point multiplier  

~5% of time for the 3) assignment into the output matrix at [i][j]  

That pipeline takes some fixed number of clock cycles N. That means every N
 clock cycles 'dct_volatiles_valid' will =1 (after being set at the start
). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total  
latency in clock cycles is N * 4096.

Site Timeline