Do you have a question? Post it now! No Registration Necessary
- Subject
- Posted on
- Max finding
- 08-27-2003
- Josh Model
August 27, 2003, 12:14 pm
Hi all,
Just wanted to delve into the groups accumulated knowledge. I've got an app
where I must find the index of the maximum of 40 or so (unsorted, of course)
20-bit numbers. --the specifics aren't that important, just helps me think.
It's pretty time critical, with throughput more important than latency,
unless latency becomes absurd, and not area critical.
The "brute force" solution of cycling through the 40 numbers and doing a
compare-store on each is pretty slow for FPGA implementation, and cascading
comparators is pretty ugly.
I did some googling and literature searching and came up with plenty of
CS-style results, but very few hardware results.
This was one...
http://www.fpgacpu.org/usenet/max.html
which was seemed the FPGA equivalent of a very clever IEE paper entitled
"Efficient Parallel Pipelineable VLSI architechture for finding the maximum
binary number" by F. Daneshgaran and K. Yao. which solved the problem using
tristates.
The bitwise parallel-max approach is pretty attactive and reduces the clock
cycles to the bitwidth, e.g. 20 (though I guess you could look at m-bits at
a time, but it becomes more complex) but I think it will get somewhat hung
up on the "check for no candidates" logic.
I was wondering if there were any other "clever" approaches around to this
sort of thing from an FPGA synthesis viewpoint? I'll be the first to admit
to it being a while since I've looked at an algorithms book, so feel free to
club me over the head with any idiocies.
--Josh Model
Asst. Technical Staff
MIT Lincoln Laboratory
Just wanted to delve into the groups accumulated knowledge. I've got an app
where I must find the index of the maximum of 40 or so (unsorted, of course)
20-bit numbers. --the specifics aren't that important, just helps me think.
It's pretty time critical, with throughput more important than latency,
unless latency becomes absurd, and not area critical.
The "brute force" solution of cycling through the 40 numbers and doing a
compare-store on each is pretty slow for FPGA implementation, and cascading
comparators is pretty ugly.
I did some googling and literature searching and came up with plenty of
CS-style results, but very few hardware results.
This was one...
http://www.fpgacpu.org/usenet/max.html
which was seemed the FPGA equivalent of a very clever IEE paper entitled
"Efficient Parallel Pipelineable VLSI architechture for finding the maximum
binary number" by F. Daneshgaran and K. Yao. which solved the problem using
tristates.
The bitwise parallel-max approach is pretty attactive and reduces the clock
cycles to the bitwidth, e.g. 20 (though I guess you could look at m-bits at
a time, but it becomes more complex) but I think it will get somewhat hung
up on the "check for no candidates" logic.
I was wondering if there were any other "clever" approaches around to this
sort of thing from an FPGA synthesis viewpoint? I'll be the first to admit
to it being a while since I've looked at an algorithms book, so feel free to
club me over the head with any idiocies.
--Josh Model
Asst. Technical Staff
MIT Lincoln Laboratory
Re: Max finding
think.
Do you get all 40 numbers on every clock? Or do you get them one at
a time? Or are they all sitting in a memory somewhere?
If the numbers arrive one by one, then keeping track of the largest
(and its index) is pretty simple.
If you need the max of the 40 numbers that you saw most recently,
it's a bit harder because you must keep the numbers in sorted order,
but you also need to know about their arrival order so that you can
throw away the correct value when it reaches the "stale" end of the
list. I've done similar things (on a smaller scale) for median
filters, and it gets kinda messy.
cascading
I don't think it's that ugly, really. A tree of 40 comparator/mux modules
will do it, in a fairly systematic way. (It would be horrific to code in
Verilog, but easy in VHDL.)
Somewhere in the archives here we have a heapsorter design which does
the right kind of thing... I'll seek it out, if you are interested.
--
Jonathan Bromley, Consultant
DOULOS - Developing Design Know-how
Jonathan Bromley, Consultant
DOULOS - Developing Design Know-how
We've slightly trimmed the long signature. Click to see the full one.
Re: Max finding
Responses *'d
*Right. we get all 40 numbers (outputs of Multiply accumulates) once every
N clock cycles, with N probably being determined by the update rate at which
we can find the maximum. In the ideal situation, N = 1, and we update the
maximum index every clock cycle.
* I think you're right, might not be so bad. If pipelined it would reduce
latency to Ceil(log2(40)) = 6, which is pretty good, and keep throughput at
maximum. The way I see it, each module would be made up (using 20-bit
numbers & 6-bit addresses) a 2-input, 20-bit comparator which serves as the
select for this module, a 20-bit 2:1 mux to pass the greater value, and a
6-bit 2:1 mux to pass the corect index. Cascade these guys in levels and
you'd need 20 + 10 + 5 + 2 + 2 + 1 modules. Still seems sort of inelegant
to me, but it gets the job done, which is the point.
* I'd be interested if it's easy for you to find -- for my app, I think most
of the overhead of building the heap would be wasted, since I only want the
max.
*Thanks,
*--Josh
*Right. we get all 40 numbers (outputs of Multiply accumulates) once every
N clock cycles, with N probably being determined by the update rate at which
we can find the maximum. In the ideal situation, N = 1, and we update the
maximum index every clock cycle.
* I think you're right, might not be so bad. If pipelined it would reduce
latency to Ceil(log2(40)) = 6, which is pretty good, and keep throughput at
maximum. The way I see it, each module would be made up (using 20-bit
numbers & 6-bit addresses) a 2-input, 20-bit comparator which serves as the
select for this module, a 20-bit 2:1 mux to pass the greater value, and a
6-bit 2:1 mux to pass the corect index. Cascade these guys in levels and
you'd need 20 + 10 + 5 + 2 + 2 + 1 modules. Still seems sort of inelegant
to me, but it gets the job done, which is the point.
* I'd be interested if it's easy for you to find -- for my app, I think most
of the overhead of building the heap would be wasted, since I only want the
max.
*Thanks,
*--Josh
Re: Max finding
...
every
which
What's the operating clock rate? If you are running in the low tens of MHz
just up your internal FPGA frequency. That's the "FPGA way".
Also, can you use any information prior to the MAC to pre-calculate which
result will be the largest? Are the coefficients fixed? You get the idea.
What's feeding data into the FPGA? Can any intelligenge be derived from
this source?
It seems to me that the best you can do is a parallel binary search type
algorithm. It will take several clocks no matter what.
Another option might be (WARNING: IT'S UGLY AND GENERALLY BAD DESIGN) to
hand place all the components and look into a fully combinatorial solution.
Depending on your clock rate you might be able to pull this off.
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Martin Euredjian
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Martin Euredjian
We've slightly trimmed the long signature. Click to see the full one.
Re: Max finding
>
>
>>The "brute force" solution of cycling through the 40 numbers and doing a
>>compare-store on each is pretty slow for FPGA implementation, and
cascading
>>
>>
>>comparators is pretty ugly.
>>
>>
>
>I don't think it's that ugly, really. A tree of 40 comparator/mux modules
>will do it, in a fairly systematic way. (It would be horrific to code in
>Verilog, but easy in VHDL.)
>
>
Aw, heck... it's just a bunch of for loops.
40 values, 20 compares, 20 results.
20 values, 10 compares, 10 results.
10 values, 5 compares, 5 results.
5 values, 2 compares, 3 results.
3 values, 1 compare, 2 results.
2 values, 1 compare, final result.
41 20 bit registers in the pipeline and a result in about 5 clocks.
one stage would look like:
for( i=0; i<10; i=i+1)
begin
result_stage3[i] <= result_stage2[2*i] > result_stage2[2*i+1] ?
result_stage2[2*i] : result_stage2[2*i+1];
index_stage3[i] <= result_stage2[2*i] > result_stage2[2*i+1] ?
index_stage2[2*i] : index_stage2[2*i+1];
end
Site Timeline
- » fixed point divider help
- — Next thread in » Field-Programmable Gate Arrays
- » Q:epax1 dma?
- — Previous thread in » Field-Programmable Gate Arrays
- » engineered data path versus inferred data path
- — Newest thread in » Field-Programmable Gate Arrays
- » waterproof connectors for well pump
- — The site's Newest Thread. Posted in » Electronics Design
- » Newest LTspice update has lots of AD op amps.
- — The site's Last Updated Thread. Posted in » Electronics Design