You are thinking in terms of feeding the vectors into an FPGA in parallel. That is hardly necessary (unless you generate entire new vectors every
10ns).
Questions you need to answer:
-How quickly are new vectors generated?
-Where will the vectors come from (or be stored)?
-How quickly do you need an answer?
-What is the maximum value of M and N?
-Is it guaranteed that there will be only one matching value for all possible vector combinations?
-How many different 20 bit values can a vector element have? Is it a straight unsigned binary value or some sort of a code that would limit the number of distinct values to less than 2**20?
Just as an example, if M = N = 512, with the FPGA running at 100MHz, you could do a brute force comparison of two vectors in 2.6 ms. Your M = N = 60 example would take 36 microseconds. Keep in mind, these are worst-case figures.
For the above use two SelectRAM memories preloaded with your vectors. You could load your vectors one word at a time at whatever rate your source could handle. If, say, your source happens to be SDRAM connected to the FPGA you could load the vectors at 133MHz (or 2x with DDR). Depending on the width of the data path that would mean one or more 20 bit word per clock. No need to consume I/O to pass the vectors in parallel.
So, a rough estimate of a complete load->compare->result cycle would be: M = N = 512 ... load = 7.7us; compare = 2.6ms M = N = 60 ... load = 0.9us; compare = 36us
Is that fast enough? You could always run the FPGA faster.
If SDRAM was involved I would probably run the logic at the SDRAM clock (or
2x) rather than an unrelated frequency (like 100MHz).
There are a million possible variations of this concept. It all depends on the constraints you have to work within. A binary search, for example, would provide a very significant improvement in performance. You could replicate the circuit X times in order to obtain a further X-time improvement in reaching a match.