Using LRU for 2 way only means 1 bit toggle per set line, but 4 way requires 4.3.2 states more trouble than its worth. Given all the states involved, it is indeed possible but the simpler schemes will give almost as good a result. Intel used a 3b (1,1,1) toggle hierarchy with almost as good a result called pseudo LRU for 1 of their 4way designs.
If you are starting out in cache design in FPGA you should probably do a direct mapped (1 way) with random replacement. From there on onwards, it gets more complex for each extra feature.
Fully associative cache is really out of the question for both FPGA & ASIC as the assoc row decode adds dearly to access times. But the various 2,4,8 way mapped schemes work almost as well.
I have a Cache book by Jim Handy which goes from very light reading to quite deep once it gets into the more interesting issues but it never addresses the special issues regarding FPGA design, being written for TTL to ASIC readers.
There is a way to get fully associative without requiring complex fast HW but it involves a more SW like approach ie hashing addresses into a somewhat empty table. The access time will be a faster cycle * some factor between 1-inf with an avg that can be close to 1.5 cycles for a
50% full ram. You trade 1 problem for varying acceess times and must keep memory table atleast 30% empty. This has been used for Inverse Page Table MMUs by IBM too. I am using it too since varying cycles don't bother me much. In more complex typical cache designs for multiple cpus trying to maintain cache coherancy, they already must get into varying timings depending on snoop, bus interference as well.regards
johnjakson_usa_com