embedded RAM vs. registers

- A
- alb
  
  Contact options for registered users
posted
10 years ago

Fri, Jan 17, 2014 3:29 PM

Hi everyone,

I'm trying to optimize the footprint of my firmware on the target device and I realize there are a lot of parameters which might be stored in the embedded RAM instead of dedicated registers.

Certainly the RAM access logic will 'eat some space' but lot's of flops will be released. Is there any recommendation on how to optimally use embedded resources? [1]

The main reason for this optimization is to free some space to include a function which has been added later in the design phase (ouch!).

Thanks a lot,

Al

[1] I know that put like this this question is certainly open to a hot discussion! :-)

--
A: Because it fouls the order in which people normally read text. 
Q: Why is top-posting such a bad thing? 
A: Top-posting. 
Q: What is the most annoying thing on usenet and in e-mail?

- G
- GaborSzakacs
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Fri, Jan 17, 2014 9:53 PM

It depends on the device you're targetting. To some extent the tools can make use of embedded RAM without changing your RTL. For example Xilinx tools allow you to place logic into unused BRAMs, and will automatically infer SRL's where the design allows it.

I've often used BRAM as a "shadow memory" to keep a copy of internal configuration registers for readback. That can eliminate a large mux, at least for all register bits that only change when written. Read-only bits and self-resetting bits would still need a mux, but the overall logic could be reduced vs. a complete mux for all bits.

--
Gabor 

P.S. - I find your signature more annoying than top posting.  In my 
opinion the most annoying thing about usenet (besides the text-only 
format) is people who think they have been appointed to police the 
ettiquette of other posters.

- A
- alb
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sat, Jan 18, 2014 10:23 PM

Hi Gabor,

On 1/17/2014 10:53 PM, GaborSzakacs wrote: []

[]

Uhm, apparently the Microsemi devices I'm using (IGLOO), together with the toolset (Libero IDE) are not that smart to profit of the local memory, unless I'm inadvertently asking *not* to use it. To be honest I have not searched deeply for ram usage on these devices, but the handbook does not provide any clue on 'use of RAM without changing RTL'.

I guess I do not completely follow you here, which mux are you talking about?

Al

p.s.: you are entitled to have your own opinion about Usenet and its users' opinion, no more than I am.

- G
- Gabor
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Jan 19, 2014 3:30 AM

In a system with a processor (external or embedded) you typically have some form of bus to read and write registers within the FPGA. Normally you need the outputs of these registers all the time, so you can't just implement the whole thing as RAM. Now if the CPU wants to be able to read back the values it wrote, you need a big readback multiplexer (unless your IGLOO has internal tristate buffers) to select the register you want to read back. What I do is to have a RAM that keeps a copy of what was written by the CPU. Then the readback mux defaults to the output of this (simple single-port) RAM unless the register is read-only or has some side-effects that could change the register's value when it's not being written by the CPU. If you have a design with a whole lot of registers, you can really reduce the size of the readback mux.

Of course you could save even more logic by not having readback for values that only change when written by the CPU. These become "write-only" registers, and the software guy then needs to keep his own "shadow" copy of the values he wrote if he needs to read it back later.

Someone said, "Opinions are like a**holes. Everyone has one, and they all stink." In any case I see you removed your signature from the latest post. ;-)

--
Gabor

- A
- alb
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Jan 20, 2014 1:14 PM

Hi Gabor,

On 1/19/2014 4:30 AM, Gabor wrote: []

I follow you if you talk about 'state registers', which of course are needed to keep the current state of the logic, but there are lots of 'configuration registers' which do not need constant access at their values.

A simple example would be the configuration of an UART, you do not need to know *constantly* that you need a parity bit or two stop bits. These type of 'memory' can go in a RAM. Would you agree?

Got your point about the multiplexer.

I tend to avoid local copies of information since they may not mirror efficiently, leading to multiple sources of 'truth' which eventually may bite you. How do you guarantee on a cycle base that the two locations are perfectly matching? What happens if they differ? If you do not need cycle base accuracy then which location you rely upon?

I now understand your, indeed valid, point.

see my opinion on multiple copies above.

[]

See, we are not too far apart with our own personal opinion on 'opinions'.

That is done automatically by my mailer when I'm not the OP, so do not get too excited about that ;-)

- G
- GaborSzakacs
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jan 21, 2014 4:28 PM

Not at all. The UART needs to know how many stop bits and what sort of parity to use whenever it transmits data. That can be completely asynchronous to the CPU data bus. If the UART needed to get this info from RAM, it would need another address port to that RAM. That's a very inefficient use of hardware to avoid storing 2 or 3 bits in a separate register. If you meant that the UART would read the RAM and then keep a local copy, how is this different (in terms of resource usage) than just having the register implemented in flip-flops?

This is indeed an issue whenever you use this technique to save resources. I look at it as a trade-off. In the case of readback for read/write bits that only change when written by the CPU, the only time you would be out of synch is at start-up. In my case I would either make a rule that the software must write every register at least once before it could be read back, or I would program the "RAM" with the initial register values at config time. This works on Xilnx parts, where the configuration bitstream has bits for all BRAM locations. Not all FPGA's can do this, though. Anyway, I thought this thread was about saving device resources...

--
Gabor

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jan 21, 2014 9:56 PM

(snip)

If you think of it that way, (and sometimes I do) then the microprocessor is the biggest waste of transistors ever invented. A huge number of transistors, now in the billions, to get data into, and out of, an arithmetic-logic-unit containing thousands of transistors.

Most of the time, a large fraction of the logic isn't doing anything at all!

Consider the old favorite of introductory digital logic laboratory courses, the digital clock. Almost nothing happens most of the time (ignore display multiplex for now), but once a second the display is updated. In the 1970s, you would build one out of TTL chips. Though the FF's had the ability to switch at MHz rates, here they ran at 1Hz or less. (Well, divide down from 60Hz.) Again, the transistors are being wasted, but now in the time domain instead of the spatial domain.

A small MCU, with small, built-in RAM and ROM (maybe external ROM) has plenty of power to run a digital clock. Many more transistors than the TTL version, and they are used more often than the TTL version, but the economy of scale of building small MCUs more than makes up for it.

As to the previous question, how to build a UART.

If you look inside a terminal server (not that anyone uses them anymore) you find a microprocessor in place of 8 UARTs. A single mircoprocessor is fast enough to collect the bits from eight incoming serial ports, and drive the bits into eight outgoing ports, along with keeping up the TCP connections to the ethernet port.

I am sure the people who designed and built some of the early computers would think it strange that we now have a loop waiting for the user to type on the keyboard.

In the early days, single task batch processing made more efficient use of the available resources. Not so much later, multitasking allowed one to keep a single CPU busy, though with less efficient use of RAM. (Decreasing cost of RAM vs. CPU.)

With an FPGA, one has the ability to keep a large number of transistors (gates) busy a large fraction of the time, if one has a problem big enough.

-- glen

- J
- jonesandy
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Jan 22, 2014 12:27 AM

Al,

Most "automatic" conversion of logic from LUTs to RAMs involves using the R AMs like ROMs, preloaded with constant data during configuration. Flash bas ed FPGAs from MicroSemi do not have the ability to preload their BRAMs duri ng "configuration." There is no "configuration" phase at/during startup dur ing which they could automatically be preloaded.

Furthermore, the IGLOO/ProASIC3 series only provide synchronous BRAMs with a clock cycle delay between address in and data out. They can be inferred f rom RTL, so long as your RTL includes that clock cycle delay.

If you have several identical slow speed interfaces (e.g. UARTs, SPI, I2C, etc.) that could happily run with an effective clock rate of a fraction of your system clock rate, look at C-slow optimization to reduce utilization. There are a few coding tricks that ease translating a single-channel module into a multi-channel, C-slowed module capable of replacing multiple copies of the original.

Retiming can be combined with C-slowing (the two are very synergystic) to e nable the original clock rate to be increased, recovering some of the origi nal per-channel performance.

Repipelining can be combined with C-slowing (also synergystic) to hide orig inal design latency, thus recovering some of the per-channel performance wi thout increasing the system clock rate.

Andy

- A
- alb
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sat, Jan 25, 2014 1:54 PM

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sat, Jan 25, 2014 4:07 PM

(snip)

I only learned about C-slow a year or two ago, and wasn't sure why it was so different from the pipelining that computer designers did in the 1960's and 1970's.

And yes, I don't know what the C is for.

-- glen

- J
- jonesandy
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Jan 27, 2014 4:37 PM

From what I understand, C-slowing originated as a state-space transform. I don't know what C stands for either.

The earliest reference I have seen for its application to digital circuits is:

C. Leiserson, F. Rose, and J. Saxe, "Optimizing synchronous circuitry by re timing," Proceedings of the 3rd Caltech Conference On VLSI, pp. 87-116, Mar ch 1983

I do not have access to the paper, but it is cited in many later papers as the initial work in the application of C-slowing to digital circuits. From the title, I would assume that they employed C-slowing to remove all single

-clock-cycle feeback paths, which otherwise cannot be retimed.

Andy