Implementing a fast cache in Altera Cyclone

Hello,

I want to implement a cache which has over 30 words á 2 bytes. The description of that module should be in VDHL.

The problem I am facing right now is the following:

It should be feasible to write an address into it and to read an address out of it. The most important function should be that a new incoming address should be checked if it is already in the cache. This check should occur within some few clocks. The problem is that the cache has over 30 entries so that using a FIFO or a RAM-structure would take over 30 clock cycles to check all the words.

Has anyone an idea how this problem could be solved basically in order to achieve a faster check time?

Thank you very much for your help.

Kind regards Andrés Vázquez G&D System Development - FPGA design

Reply to
Vazquez
Loading thread data ...

formatting link

Gr, Hans

Reply to
Hans Brand

The usual arrangement in caches is to compute some function of the address, and use that as the index into your cache. That way, it's very easy to locate whether an address is held in the cache.

Typically the "function" is just the bottom few bits of the address, because this gives a reasonable distribution of cached addresses into cached locations for many practical problems.

I don't know how big your addresses are, but let's guess they are

16 bits wide. Let's also suppose that you use the bottom 5 bits of address to index into your 32-slot cache. Now let's suppose you have accessed some locations... 0x1234 -- uses cache slot 0x14 0x1235 -- uses cache slot 0x15 0x9A23 -- uses cache slot 0x03

You can see that this scheme will work well for short sequential burst accesses. However, let's now suppose that you make a fourth access...

0x7255 -- uses cache slot 0x15

This new access will of course evict the 0x1235 entry from slot 0x15.

To make all this work, each cache slot needs to store...

  • those bits of the address that were not used to index it: in my example, the top 11 bits
  • the cached data
  • a "valid" bit, to say that the stored data is correct
  • (for write-back caches) a "dirty" bit, saying that the cache slot has been written with a new data value, but the cached memory location hasn't yet been written with this new value

To improve performance, caches can have N-way associativity - in other words, you have more than one cache slot associated with each cache index. This means you have to compare addresses, but typically only 2 or 4 rather than the full 32. You can generally do this with parallel, single-cycle logic.

If indexing on the bottom few bits of address doesn't suit your requirement, you can use a more sophisticated address mapping - perhaps some kind of hash function. It doesn't matter, as long as the function is repeatable and distributes addresses onto cache slots in a suitable way. You would then need to store the whole address in the cache.

-- Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * Perl * Tcl/Tk * Verification * Project Services

Doulos Ltd. Church Hatch, 22 Market Place, Ringwood, Hampshire, BH24 1AW, UK Tel: +44 (0)1425 471223 mail: snipped-for-privacy@doulos.com Fax: +44 (0)1425 471573 Web:

formatting link

The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.

Reply to
Jonathan Bromley

This feature is only available in APEX, Excalibur, and Mercury.

- Paul Leventis Altera Corp.

Reply to
Paul Leventis

Andres,

I've written two responses, one assuming you want a CAM (that is, something that you give an entry of 16-bits to and it tells you whether that entry is in one of the 32 slots of the CAM), the other assuming you can get away with a cache.

*** If you really do need a CAM ***

CAMs are expensive to implement. You need to look at ALL the data in your cache at once in order to find a match (in a single cycle) and hence using a memory is not a great fit for this task. You're stuck implementing your CAM in logic if you need single-cycle access. With 32 entries of 16 bits each, this means 32x16 = 480 LEs just for the registers. But it's not as bad as it seems -- the LUTs in these LEs will still be used (automatically) by Quartus for related logic if such logic is available, or unrelated logic if your design is getting full.

I'd also worry about the explosion in the amount of logic required for comparison if you are doing single-cycle matching. One technique you can use to half or further reduce time & comparison hardware and still use a RAM: use a wider RAM. You can get two entries/cycle just by using a 16x32 RAM (32-bits wide) which maps into a single M4K block in Cyclone. Or go up to an 8x64 or 4x128 RAM. This way you are retrieving multiple entries/cycle.

Note: Just use the lpm_ram megafunction. Quartus will automatically make use of the right combination of M4K blocks to implement the RAM you need "behind your back".

If you expect a lot of misses and would like misses to come back quickly, you could store your data differently. For example, use a 4x128 RAM and instead of storing 8 entries per address, store the first 4-bits of all 32 entries in the 1st address, the next 4-bits in the 2nd adress, etc. In each clock cycle you are comparing 4-bits of every entry to 4-bits of the target word. But this way you get an early exit on a miss.

I'm no CAM expert and I'm sure there'll be a bunch of other ideas that come up...

*** If you are looking for a cache ***

Jonathan gives a nice description of direct-mapped and N-way set associative caching. I'll just add a few points and ideas.

As a rule of thumb, increasing cache size yields better results (in a microprocessor) than increasing associativity. In selecting the cache design for your system, you need to take into account the memory access patterns, the speed you need to run your system clock at, tolerable cache latency, etc. Switching from a direct-mapped (i.e. 1-way associative) to 2- or 4-way set associative cache will mean that you require more logic for performing reads & writes, and depending on the speed your clock is running at, this may limit the speed of your system. Often you must run software emulations of a system to determine what the best cache is -- I'd suggest trying 1-way, 2-way and 4-way set associative caches of various sizes and comparing the increase in efficiency (i.e. # of clocks required per instruction on average) of your system across simulated, real-world data that it will handle. Then compare this to the relative clock speed that those different systems will run at (which will require implementing both a direct-mapped and N-way associative cache in HDL).

Also, note that making a cache deeper is nearly free in Cyclone if you aren't using your memory blocks for other features. You get a 128x36 RAM for the same price as a 32x36 RAM -- so you might as well make a 128-entry direct mapped cache. Or you can stitch multiple RAMs together (automatically through lpm_ram) to create a deeper cache.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis

An auxiliary victim cache could work well in distributed RAM.

Reply to
Tim

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.