Max finding

Hi all,

Just wanted to delve into the groups accumulated knowledge. I've got an app where I must find the index of the maximum of 40 or so (unsorted, of course)

20-bit numbers. --the specifics aren't that important, just helps me think. It's pretty time critical, with throughput more important than latency, unless latency becomes absurd, and not area critical.

The "brute force" solution of cycling through the 40 numbers and doing a compare-store on each is pretty slow for FPGA implementation, and cascading comparators is pretty ugly.

I did some googling and literature searching and came up with plenty of CS-style results, but very few hardware results.

This was one...

formatting link
which was seemed the FPGA equivalent of a very clever IEE paper entitled "Efficient Parallel Pipelineable VLSI architechture for finding the maximum binary number" by F. Daneshgaran and K. Yao. which solved the problem using tristates.

The bitwise parallel-max approach is pretty attactive and reduces the clock cycles to the bitwidth, e.g. 20 (though I guess you could look at m-bits at a time, but it becomes more complex) but I think it will get somewhat hung up on the "check for no candidates" logic.

I was wondering if there were any other "clever" approaches around to this sort of thing from an FPGA synthesis viewpoint? I'll be the first to admit to it being a while since I've looked at an algorithms book, so feel free to club me over the head with any idiocies.

--Josh Model Asst. Technical Staff MIT Lincoln Laboratory

Reply to
Josh Model
Loading thread data ...


Do you get all 40 numbers on every clock? Or do you get them one at a time? Or are they all sitting in a memory somewhere?

If the numbers arrive one by one, then keeping track of the largest (and its index) is pretty simple.

If you need the max of the 40 numbers that you saw most recently, it's a bit harder because you must keep the numbers in sorted order, but you also need to know about their arrival order so that you can throw away the correct value when it reaches the "stale" end of the list. I've done similar things (on a smaller scale) for median filters, and it gets kinda messy.


I don't think it's that ugly, really. A tree of 40 comparator/mux modules will do it, in a fairly systematic way. (It would be horrific to code in Verilog, but easy in VHDL.)

Somewhere in the archives here we have a heapsorter design which does the right kind of thing... I'll seek it out, if you are interested.

Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
 Click to see the full signature
Reply to
Jonathan Bromley

Responses *'d

*Right. we get all 40 numbers (outputs of Multiply accumulates) once every N clock cycles, with N probably being determined by the update rate at which we can find the maximum. In the ideal situation, N = 1, and we update the maximum index every clock cycle.
  • I think you're right, might not be so bad. If pipelined it would reduce latency to Ceil(log2(40)) = 6, which is pretty good, and keep throughput at maximum. The way I see it, each module would be made up (using 20-bit numbers & 6-bit addresses) a 2-input, 20-bit comparator which serves as the select for this module, a 20-bit 2:1 mux to pass the greater value, and a
6-bit 2:1 mux to pass the corect index. Cascade these guys in levels and you'd need 20 + 10 + 5 + 2 + 2 + 1 modules. Still seems sort of inelegant to me, but it gets the job done, which is the point.
  • I'd be interested if it's easy for you to find -- for my app, I think most of the overhead of building the heap would be wasted, since I only want the max.
*Thanks, *--Josh
Reply to
Josh Model

Aw, heck... it's just a bunch of for loops.

40 values, 20 compares, 20 results. 20 values, 10 compares, 10 results. 10 values, 5 compares, 5 results. 5 values, 2 compares, 3 results. 3 values, 1 compare, 2 results. 2 values, 1 compare, final result.

41 20 bit registers in the pipeline and a result in about 5 clocks.

one stage would look like: for( i=0; i

Reply to
John Handwork




What's the operating clock rate? If you are running in the low tens of MHz just up your internal FPGA frequency. That's the "FPGA way".

Also, can you use any information prior to the MAC to pre-calculate which result will be the largest? Are the coefficients fixed? You get the idea.

What's feeding data into the FPGA? Can any intelligenge be derived from this source?

It seems to me that the best you can do is a parallel binary search type algorithm. It will take several clocks no matter what.

Another option might be (WARNING: IT'S UGLY AND GENERALLY BAD DESIGN) to hand place all the components and look into a fully combinatorial solution. Depending on your clock rate you might be able to pull this off.

Martin Euredjian
 Click to see the full signature
Reply to
Martin Euredjian

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.