CPU vs. FPGA vs. RAM

- V
- Valentin Tihomirov
  
  Contact options for registered users
posted
20 years ago

Sun, Oct 19, 2003 4:10 PM

The massive parallelism is considered as the main advantage of of FPGAs. Meantime, the bottleneck of modern systems is a memory performance. How do I benefit in e.g. image processing using wide low speed FPGA over hi-speed running CPU when image is located in SRAM? Today more and more FPGAs are equipped with embedded RAM. How can FPGA benefit from the concurrent processing having to serialize memory access?

- N
- Nicholas C. Weaver
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sun, Oct 19, 2003 4:21 PM

What do you mean by "memory performance?" Latency for sequential access? Latency for parallel accesses? Throughput for a single stream? THroughput for multiple streams sharing memory? THroughput for multiple streams from independant memories?

--
Nicholas C. Weaver                                 nweaver@cs.berkeley.edu

- J
- Jake Janovetz
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Oct 20, 2003 3:07 AM

I don't think it matters. He just saying that FPGAs provide such an abundance of functional units that memory performance (however you call it -- the ability for a given system to provide data to the function units) is limiting. He's got a 10,000 pound gorilla and he's trying to feed it bananas with a teaspoon.

Jake

- J
- Jake Janovetz
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Oct 20, 2003 3:12 AM

Good question, Valentin. I personally think you'll see a lot more use being made of the on-chip embedded RAM as 'cache' memory than previously. Of course, FPGAs are applied to so many problems that this use may never be applicable and it has already been applied in many situations. Bottom line is that the on-chip memory is going to be faster than off-chip memory. It's usefulness in keeping the FPGA's functional units supplied with data may come in the form of transforming it into a more useful cache structure.

That may also mean that a larger part of an FPGA design will be relegated to the function of cache controller. Perhaps using block RAMs as L2-type cache and distributed local memory as an L1-type cache and trying to keep the external memory pipe as active as possible in the most efficient way depending on the functionality implemented.

Jake

- H
- H. Peter Anvin
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Oct 20, 2003 6:50 PM

Followup to: By author: snipped-for-privacy@yahoo.com (Jake Janovetz) In newsgroup: comp.arch.fpga

This is of course why FPGAs has extremely high speed onboard memory in small chunks that can be independently wired. The quantities are limited, of course, just like the caches on a CPU, but a lot of FPGA designs wouldn't be possible/practical without it.

-hpa

--
 at work,  in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

- N
- Nicholas C. Weaver
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Oct 21, 2003 2:17 PM

Actually, mixed Dram/Logic (what gets your the ~10x greater memory density for the HSRA) has effectively failed. It required too much process fiddling, which nobody wants to do these days.

Also, then you have the joy of the DRAM page access time anyway, so you save on the pin crossings, but the rest of the latency is still there, and still measured in several clock cycles.

As for the streaming access example, its still an order of magnitude greater than a CPUs, where you only have 128 pins at 533 Mbps/pin. But if it is random access from a single block (e.g. pointer chasing), why not jsut get an Athlon 64 and write it in software?

--
Nicholas C. Weaver                                 nweaver@cs.berkeley.edu

- H
- H. Peter Anvin
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Oct 21, 2003 8:28 PM

Followup to: By author: snipped-for-privacy@ribbit.CS.Berkeley.EDU (Nicholas C. Weaver) In newsgroup: comp.arch.fpga

It is, but you get *HUGE* amounts of data for each access. Existing DRAMs mux away an amazing amount of the data read for each array access.

I suspect that DRAM integration is going to happen sooner or later, however, it's not going to happen just yet.

-hpa

--
 at work,  in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64