PipelineC (again), dct example, looking for help/interest

Hi folks looking for feedback on PipelineC. Ideas of what to implement next
.
I will point you to a recent reddit post which ultimately points to GitHub.

formatting link

ower_resource_usage/
Here is the code to get you interested:
// This is the unrolled version of the original dct copy-and-pasted algorit
hm
//
formatting link

m/
// PipelineC iterations of dctTransformUnrolled are used
// to unroll the calculation serially in O(n^4) time
// Input 'matrix' and start=1 to begin calculation
// Input 'matrix' must stay constant until return .done
// 'sum' accumulates over iterations/clocks and should be pipelined
// So 'sum' must be a volatile global variable
// Keep track of when sum is valid and be read+written
volatile uint1_t dct_volatiles_valid;
// sum will temporarily store the sum of cosine signals
volatile float dct_sum;
// dct_result will store the discrete cosine transform
// Signal that this is the iteration containing the 'done' result
typedef struct dct_done_t
{
float matrix[DCT_M][DCT_N];
uint1_t done;
} dct_done_t;
volatile dct_done_t dct_result;
dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t s
tart)
{
// Assume not done yet
dct_result.done = 0;

// Start validates volatiles
if(start)
{
dct_volatiles_valid = 1;
}

// Global func to handle getting to BRAM
// 1) Lookup constants from BRAM (using iterators)
// 2) Increment iterators
// Returns next iterators and constants and will increment when req
uested
dct_lookup_increment_t lookup_increment;
uint1_t do_increment;
// Only increment when volatiles valid
do_increment = dct_volatiles_valid;
lookup_increment = dct_lookup_increment(do_increment);

// Unpack struct for ease of reading calculation code below
float const_val;
const_val = lookup_increment.lookup.const_val;
float cos_val;
cos_val = lookup_increment.lookup.cos_val;
dct_iter_t i;
i = lookup_increment.incrementer.curr_iters.i;
dct_iter_t j;
j = lookup_increment.incrementer.curr_iters.j;
dct_iter_t k;
k = lookup_increment.incrementer.curr_iters.k;
dct_iter_t l;
l = lookup_increment.incrementer.curr_iters.l;
uint1_t reset_k;
reset_k = lookup_increment.incrementer.increment.reset_k;
uint1_t reset_l;
reset_l = lookup_increment.incrementer.increment.reset_l;
uint1_t done;
done = lookup_increment.incrementer.increment.done;


// Do math for this volatile iteration only when
// can safely read+write volatiles
if(dct_volatiles_valid)
{
// ~~~ The primary calculation ~~~:
// 1) Float * cosine constant from lookup table
float dct1;
dct1 = (float)matrix[k][l] *
cos_val;
// 2) Increment sum
dct_sum = dct_sum + dct1;
// 3) constant * Float and assign into the output matrix
dct_result.matrix[i][j] = const_val *
dct_sum;

// Sum accumulates during the k and l loops
// So reset when they are rolling over
if(reset_k & reset_l)
{
dct_sum = 0.0;
}

// Done yet?
dct_result.done = done;

// Reset volatiles once done
if(done)
{
dct_volatiles_valid = 0;
}
}

return dct_result;
}
What does this synthesize to?
Essentially a state machine where each state uses the same N clocks worth o
f logic to do work. (the body of dctTransformUnrolled).
Consider the 'execution' of the function in time order. The logic consists
of:
~17% of time for getting lookup constants & incrementing the iterators (dct
_lookup_increment), reading the [k][l] value out of input 'matrix'
~21% of time for the 1) Float * cosine constant from lookup table, a floati
ng point multiplier
~34% of time for the 2) Increment sum addition, a floating point adder
~21% of time for the 3) constant *
Float, a floating point multiplier
~5% of time for the 3) assignment into the output matrix at [i][j]
That pipeline takes some fixed number of clock cycles N. That means every N
clock cycles 'dct_volatiles_valid' will =1 (after being set at the start
). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total
latency in clock cycles is N * 4096.
Reply to
Julian Kemmerer
Loading thread data ...
Hi folks looking for feedback on PipelineC. Ideas of what to implement next .
I will point you to a recent reddit post which ultimately points to GitHub.
formatting link
ower_resource_usage/
Here is the code to get you interested:
// This is the unrolled version of the original dct copy-and-pasted algorit hm //
formatting link
m/ // PipelineC iterations of dctTransformUnrolled are used // to unroll the calculation serially in O(n^4) time
// Input 'matrix' and start=1 to begin calculation // Input 'matrix' must stay constant until return .done
// 'sum' accumulates over iterations/clocks and should be pipelined // So 'sum' must be a volatile global variable // Keep track of when sum is valid and be read+written volatile uint1_t dct_volatiles_valid; // sum will temporarily store the sum of cosine signals volatile float dct_sum; // dct_result will store the discrete cosine transform // Signal that this is the iteration containing the 'done' result typedef struct dct_done_t { float matrix[DCT_M][DCT_N]; uint1_t done; } dct_done_t; volatile dct_done_t dct_result; dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t s tart) { // Assume not done yet dct_result.done = 0; // Start validates volatiles if(start) { dct_volatiles_valid = 1; } // Global func to handle getting to BRAM // 1) Lookup constants from BRAM (using iterators) // 2) Increment iterators // Returns next iterators and constants and will increment when req uested dct_lookup_increment_t lookup_increment; uint1_t do_increment; // Only increment when volatiles valid do_increment = dct_volatiles_valid; lookup_increment = dct_lookup_increment(do_increment); // Unpack struct for ease of reading calculation code below float const_val; const_val = lookup_increment.lookup.const_val; float cos_val; cos_val = lookup_increment.lookup.cos_val; dct_iter_t i; i = lookup_increment.incrementer.curr_iters.i; dct_iter_t j; j = lookup_increment.incrementer.curr_iters.j; dct_iter_t k; k = lookup_increment.incrementer.curr_iters.k; dct_iter_t l; l = lookup_increment.incrementer.curr_iters.l; uint1_t reset_k; reset_k = lookup_increment.incrementer.increment.reset_k; uint1_t reset_l; reset_l = lookup_increment.incrementer.increment.reset_l; uint1_t done; done = lookup_increment.incrementer.increment.done; // Do math for this volatile iteration only when // can safely read+write volatiles if(dct_volatiles_valid) { // ~~~ The primary calculation ~~~: // 1) Float * cosine constant from lookup table float dct1; dct1 = (float)matrix[k][l] * cos_val; // 2) Increment sum dct_sum = dct_sum + dct1; // 3) constant * Float and assign into the output matrix dct_result.matrix[i][j] = const_val * dct_sum; // Sum accumulates during the k and l loops // So reset when they are rolling over if(reset_k & reset_l) { dct_sum = 0.0; } // Done yet? dct_result.done = done; // Reset volatiles once done if(done) { dct_volatiles_valid = 0; } } return dct_result; } What does this synthesize to?
Essentially a state machine where each state uses the same N clocks worth o f logic to do work. (the body of dctTransformUnrolled).
Consider the 'execution' of the function in time order. The logic consists of:
~17% of time for getting lookup constants & incrementing the iterators (dct _lookup_increment), reading the [k][l] value out of input 'matrix'
~21% of time for the 1) Float * cosine constant from lookup table, a floati ng point multiplier
~34% of time for the 2) Increment sum addition, a floating point adder
~21% of time for the 3) constant *
Float, a floating point multiplier
~5% of time for the 3) assignment into the output matrix at [i][j]
That pipeline takes some fixed number of clock cycles N. That means every N clock cycles 'dct_volatiles_valid' will =1 (after being set at the start ). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total latency in clock cycles is N * 4096.
Reply to
Julian Kemmerer

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.