.

I will point you to a recent reddit post which ultimately points to GitHub.

ower_resource_usage/

Here is the code to get you interested:

// This is the unrolled version of the original dct copy-and-pasted algorit

hm

//

m/

// PipelineC iterations of dctTransformUnrolled are used

// to unroll the calculation serially in O(n^4) time

// Input 'matrix' and start=1 to begin calculation

// Input 'matrix' must stay constant until return .done

// 'sum' accumulates over iterations/clocks and should be pipelined

// So 'sum' must be a volatile global variable

// Keep track of when sum is valid and be read+written

volatile uint1_t dct_volatiles_valid;

// sum will temporarily store the sum of cosine signals

volatile float dct_sum;

// dct_result will store the discrete cosine transform

// Signal that this is the iteration containing the 'done' result

typedef struct dct_done_t

{

float matrix[DCT_M][DCT_N];

uint1_t done;

} dct_done_t;

volatile dct_done_t dct_result;

dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t s

tart)

{

// Assume not done yet

dct_result.done = 0;

// Start validates volatiles

if(start)

{

dct_volatiles_valid = 1;

}

// Global func to handle getting to BRAM

// 1) Lookup constants from BRAM (using iterators)

// 2) Increment iterators

// Returns next iterators and constants and will increment when req

uested

dct_lookup_increment_t lookup_increment;

uint1_t do_increment;

// Only increment when volatiles valid

do_increment = dct_volatiles_valid;

lookup_increment = dct_lookup_increment(do_increment);

// Unpack struct for ease of reading calculation code below

float const_val;

const_val = lookup_increment.lookup.const_val;

float cos_val;

cos_val = lookup_increment.lookup.cos_val;

dct_iter_t i;

i = lookup_increment.incrementer.curr_iters.i;

dct_iter_t j;

j = lookup_increment.incrementer.curr_iters.j;

dct_iter_t k;

k = lookup_increment.incrementer.curr_iters.k;

dct_iter_t l;

l = lookup_increment.incrementer.curr_iters.l;

uint1_t reset_k;

reset_k = lookup_increment.incrementer.increment.reset_k;

uint1_t reset_l;

reset_l = lookup_increment.incrementer.increment.reset_l;

uint1_t done;

done = lookup_increment.incrementer.increment.done;

// Do math for this volatile iteration only when

// can safely read+write volatiles

if(dct_volatiles_valid)

{

// ~~~ The primary calculation ~~~:

// 1) Float

*** cosine constant from lookup table**

float dct1;

dct1 = (float)matrix[k][l] *cos_val;

float dct1;

dct1 = (float)matrix[k][l] *

// 2) Increment sum

dct_sum = dct_sum + dct1;

// 3) constant

*** Float and assign into the output matrix**

dct_result.matrix[i][j] = const_val *dct_sum;

dct_result.matrix[i][j] = const_val *

// Sum accumulates during the k and l loops

// So reset when they are rolling over

if(reset_k & reset_l)

{

dct_sum = 0.0;

}

// Done yet?

dct_result.done = done;

// Reset volatiles once done

if(done)

{

dct_volatiles_valid = 0;

}

}

return dct_result;

}

What does this synthesize to?

Essentially a state machine where each state uses the same N clocks worth o

f logic to do work. (the body of dctTransformUnrolled).

Consider the 'execution' of the function in time order. The logic consists

of:

~17% of time for getting lookup constants & incrementing the iterators (dct

_lookup_increment), reading the [k][l] value out of input 'matrix'

~21% of time for the 1) Float

*** cosine constant from lookup table, a floati**

ng point multiplier

~34% of time for the 2) Increment sum addition, a floating point adder

~21% of time for the 3) constant *Float, a floating point multiplier

ng point multiplier

~34% of time for the 2) Increment sum addition, a floating point adder

~21% of time for the 3) constant *

~5% of time for the 3) assignment into the output matrix at [i][j]

That pipeline takes some fixed number of clock cycles N. That means every N

clock cycles 'dct_volatiles_valid' will =1 (after being set at the start

). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total

latency in clock cycles is N * 4096.