how to design this datapath unit for DSP using VHDL/Verilog?

Dear all,

I want to design an arithmatic datapath unit for digital signal processing using VHDL and/or Verilog.

The input are 5 elements(either sequential or parallel) each having 8 bits. It needs to multiply each of these 5 inputs with a predefined constant matrix(10x10, floating point scaled and round to integer). The output will be a 10x10 matrix summing the above five matrices up, each element having 12 bits). So for each element of the matrix, I can have a MAC unit. The internal computation will be 16 bits.

Hence for each 5 inputs x1, x2, x3, x4, x5, the output matrix

Y=x1*C1+x2*C2+x3*C3+x4*C4+x5*C5 where Y, C1, C2, C3, C4, C5 are matrices;

If I put an MAC for each element, I will have a purely parallel architecture, but I need 100 16bits MAC units, which will be too resource consuming.

I am considering to make a parallel-serial architecture, at each time, it outputs one row, which will be 10x12 bits... so the output will be row-by-row.

I also need to consider to streamlize the datapath operation. Since there will be a stream of 5 elements input in a non-stop fashion, the output will also be non-stop streaming. So after one row is outputted, that row can be used for computation/storage of the results for the next 5 input elements.

I am ok so far in thinking... but further thinking makes me confused and perplexed... how to do sequential timing control(how to what to do at which cycle)? do I need to pipelining? how to design the architecture? I mean, I know pipelining theoratically from one semester course, but now I am going to implement one, I am totally lost...

Finally, how to program this? Is there any examples for this?

Please help me!

Thanks a lot,

-Walala

Reply to
walala
Loading thread data ...

You don't mention a very important parameter--speed. Depending upon the bandwidth you require you can probably timeslice the MACs, saving hardware.

-Kevin

bits.

12

will

which

Reply to
Kevin Neilson

Hi Kevin,

Thanks for your answer!

The requirement of output throughput is 33-50MHz, i.e., it should output 33 million to 50 million 12-bits element per second,

and each 5 inputs correspond to 10x10=100 such 12-bits element outputs...

The technology I am going to use is 0.25u.

I think the inputs are naturally serial, but I can let them be parallel, since there are only 5 of them, but again, I am not sure how to do the parallel-serial partition of the internal MACs... and how to pace the outputs...

Seems inputs are faster than the outputs, maybe I should let the input wait after fed into the unit?

Can you give some further advice on how to do this architecture? how to do the timing? I think it is really difficult...and point me to some resources?

Thanks very much,

-Walala

hardware.

processing

will

having

matrices;

resource

it

there

be

elements.

I

going

Reply to
walala

Assume you nailed down the MAC for calculating one element, to calculate one row at a time, you just add a 10:1 mux before the MAC and select inputs for the next element after the current element is done and the accumulator is cleared. A counter or a simple state machine can be used to control the mux select signal. This will take at least 10 times longer to get all 100 elements.

Jim Wu snipped-for-privacy@yahoo.com

formatting link

Reply to
Jim Wu

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.