PipelineC (again), dct example, looking for help/interest

Sep 07, 2019 1 Replies

Rate this thread:

Julian Kemmerer 6 years ago

Hi folks looking for feedback on PipelineC. Ideas of what to implement next .

I will point you to a recent reddit post which ultimately points to GitHub.

formatting link

ower_resource_usage/

Here is the code to get you interested:

// This is the unrolled version of the original dct copy-and-pasted algorit hm //

formatting link

m/ // PipelineC iterations of dctTransformUnrolled are used // to unroll the calculation serially in O(n^4) time

// Input 'matrix' and start=1 to begin calculation // Input 'matrix' must stay constant until return .done

// 'sum' accumulates over iterations/clocks and should be pipelined // So 'sum' must be a volatile global variable // Keep track of when sum is valid and be read+written volatile uint1_t dct_volatiles_valid; // sum will temporarily store the sum of cosine signals volatile float dct_sum; // dct_result will store the discrete cosine transform // Signal that this is the iteration containing the 'done' result typedef struct dct_done_t { float matrix[DCT_M][DCT_N]; uint1_t done; } dct_done_t; volatile dct_done_t dct_result; dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t s tart) { // Assume not done yet dct_result.done = 0; // Start validates volatiles if(start) { dct_volatiles_valid = 1; } // Global func to handle getting to BRAM // 1) Lookup constants from BRAM (using iterators) // 2) Increment iterators // Returns next iterators and constants and will increment when req uested dct_lookup_increment_t lookup_increment; uint1_t do_increment; // Only increment when volatiles valid do_increment = dct_volatiles_valid; lookup_increment = dct_lookup_increment(do_increment); // Unpack struct for ease of reading calculation code below float const_val; const_val = lookup_increment.lookup.const_val; float cos_val; cos_val = lookup_increment.lookup.cos_val; dct_iter_t i; i = lookup_increment.incrementer.curr_iters.i; dct_iter_t j; j = lookup_increment.incrementer.curr_iters.j; dct_iter_t k; k = lookup_increment.incrementer.curr_iters.k; dct_iter_t l; l = lookup_increment.incrementer.curr_iters.l; uint1_t reset_k; reset_k = lookup_increment.incrementer.increment.reset_k; uint1_t reset_l; reset_l = lookup_increment.incrementer.increment.reset_l; uint1_t done; done = lookup_increment.incrementer.increment.done; // Do math for this volatile iteration only when // can safely read+write volatiles if(dct_volatiles_valid) { // ~~~ The primary calculation ~~~: // 1) Float * cosine constant from lookup table float dct1; dct1 = (float)matrix[k][l] * cos_val; // 2) Increment sum dct_sum = dct_sum + dct1; // 3) constant * Float and assign into the output matrix dct_result.matrix[i][j] = const_val * dct_sum; // Sum accumulates during the k and l loops // So reset when they are rolling over if(reset_k & reset_l) { dct_sum = 0.0; } // Done yet? dct_result.done = done; // Reset volatiles once done if(done) { dct_volatiles_valid = 0; } } return dct_result; }

What does this synthesize to?

Essentially a state machine where each state uses the same N clocks worth o f logic to do work. (the body of dctTransformUnrolled).

Consider the 'execution' of the function in time order. The logic consists of:

~17% of time for getting lookup constants & incrementing the iterators (dct _lookup_increment), reading the [k][l] value out of input 'matrix'

~21% of time for the 1) Float * cosine constant from lookup table, a floati ng point multiplier

~34% of time for the 2) Increment sum addition, a floating point adder

~21% of time for the 3) constant * Float, a floating point multiplier

~5% of time for the 3) assignment into the output matrix at [i][j]

That pipeline takes some fixed number of clock cycles N. That means every N clock cycles 'dct_volatiles_valid' will =1 (after being set at the start ). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total latency in clock cycles is N * 4096.

Vote