This is not a question... :)
Equality compares are easy. It uses a two input XOR for each bit with all the results being OR'd together. This will take 32 LUTs for the XOR and the first OR gate and 11 more LUTs to combine the rest for a total of 43 LUTs in four levels. If the design uses the "special" features that most chips have (ORing of LUTs within a CLB), you can use the LUTs in pairs or even groups of four and reduce the number of levels for speed.
Is that 16 bit address (64k words) of 8 bit words? Because 64k x 8 =
512K.You can get this much RAM in the VirtexII if you use the XC2V500 part. Or in the new Spartan3 you could use a XC3S1500. I am not sure which will be cheaper, but I bet it is the Spartan3.
The speed of the block RAM will be much faster than anything external to the FPGA. The block ram will be synchronous and lends itself well to pipelined operations.
A lot of how you design will be implemented will depend on your data flow which you have said nothing about. Think about how the storage will be orgainized and accessed. Obviously one large block of memory with one interface will not let you do 64 compares at one time. If you rate of performing these compares is not fast, you can use one compare logic block and run the different data through it sequentially. Then one memory could easily do the job.