Hi folks looking for feedback on PipelineC. Ideas of what to implement next .
I will point you to a recent reddit post which ultimately points to GitHub.
Here is the code to get you interested:
// This is the unrolled version of the original dct copy-and-pasted algorit hm //
// Input 'matrix' and start=1 to begin calculation // Input 'matrix' must stay constant until return .done
// 'sum' accumulates over iterations/clocks and should be pipelined // So 'sum' must be a volatile global variable // Keep track of when sum is valid and be read+written volatile uint1_t dct_volatiles_valid; // sum will temporarily store the sum of cosine signals volatile float dct_sum; // dct_result will store the discrete cosine transform // Signal that this is the iteration containing the 'done' result typedef struct dct_done_t { float matrix[DCT_M][DCT_N]; uint1_t done; } dct_done_t; volatile dct_done_t dct_result; dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t s tart) { // Assume not done yet dct_result.done = 0; // Start validates volatiles if(start) { dct_volatiles_valid = 1; } // Global func to handle getting to BRAM // 1) Lookup constants from BRAM (using iterators) // 2) Increment iterators // Returns next iterators and constants and will increment when req uested dct_lookup_increment_t lookup_increment; uint1_t do_increment; // Only increment when volatiles valid do_increment = dct_volatiles_valid; lookup_increment = dct_lookup_increment(do_increment); // Unpack struct for ease of reading calculation code below float const_val; const_val = lookup_increment.lookup.const_val; float cos_val; cos_val = lookup_increment.lookup.cos_val; dct_iter_t i; i = lookup_increment.incrementer.curr_iters.i; dct_iter_t j; j = lookup_increment.incrementer.curr_iters.j; dct_iter_t k; k = lookup_increment.incrementer.curr_iters.k; dct_iter_t l; l = lookup_increment.incrementer.curr_iters.l; uint1_t reset_k; reset_k = lookup_increment.incrementer.increment.reset_k; uint1_t reset_l; reset_l = lookup_increment.incrementer.increment.reset_l; uint1_t done; done = lookup_increment.incrementer.increment.done; // Do math for this volatile iteration only when // can safely read+write volatiles if(dct_volatiles_valid) { // ~~~ The primary calculation ~~~: // 1) Float * cosine constant from lookup table float dct1; dct1 = (float)matrix[k][l] * cos_val; // 2) Increment sum dct_sum = dct_sum + dct1; // 3) constant * Float and assign into the output matrix dct_result.matrix[i][j] = const_val * dct_sum; // Sum accumulates during the k and l loops // So reset when they are rolling over if(reset_k & reset_l) { dct_sum = 0.0; } // Done yet? dct_result.done = done; // Reset volatiles once done if(done) { dct_volatiles_valid = 0; } } return dct_result; }
What does this synthesize to?
Essentially a state machine where each state uses the same N clocks worth o f logic to do work. (the body of dctTransformUnrolled).
Consider the 'execution' of the function in time order. The logic consists of:
~17% of time for getting lookup constants & incrementing the iterators (dct _lookup_increment), reading the [k][l] value out of input 'matrix'
~21% of time for the 1) Float * cosine constant from lookup table, a floati ng point multiplier
~34% of time for the 2) Increment sum addition, a floating point adder
~21% of time for the 3) constant * Float, a floating point multiplier
~5% of time for the 3) assignment into the output matrix at [i][j]
That pipeline takes some fixed number of clock cycles N. That means every N clock cycles 'dct_volatiles_valid' will =1 (after being set at the start ). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total latency in clock cycles is N * 4096.