DRC has announced its newest FPGA that drops into AMD's Socket 940

J

Jan Panteltje 20 years ago

formatting link

So... I do see a possibility here.

Vote

P

Paul Rubin 20 years ago

What about the number of AES/sec?

Vote

C

c d saunter 20 years ago

: Jan Panteltje wrote: : >

formatting link

: 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is : useful for tasks which either totally reside on the FPGA side of the : board or have really high latency (e.g. PK work).

Sitting on the HT bus like that offers residence about as close as you can get to a mainstream CPU. Given the new HT3 stuff - faster and links possible over 1 meter - i.e. directly joining blades - I really like this aproach. Especially given the memory architecture that goes along with HT/Opterons. It's bringing mainstream CPUs and FPGAs back into the point to point multiple interconnect world of the TigerSHARCs and the old TI C40s.

It feels a bit like a resurgence to the old British Transputer except with gate arrays mixing with CPUs on an equal footing in terms of connectivity.

cds

Vote

C

c d saunter 20 years ago

: HT links are not solely designed for speed. Latency is the key. 16 : lanes of PCIe can compete just fine with a 16x16 1Ghz HT link in terms : of bandwidth.

: Oddly enough the best tasks for this are things which don't return back : to back [e.g. raytrace a scene].

I wouldn't call that odd - a modern CPU hiding behind caches with long pipelines is always going to struggle with low latency back/forewards/back/forewards shared tasks with an FPGA/Clearspeed/xxx

- certainly interesting things happen with FPGA silicon and CPU silicon coupled in a SOC or on an FPGA but the clock rates are far below a dedicated CPU.

On the serial / parallel issue I have a leaning towards parallel for simplicity when it comes to the FPGA code and latency, although serial has benefits for physical complexity and routing. Also it feels like they leap frog each other every few months in terms of bandwidth! The world is squeezing itself down a thin pipe these days though...

: What this does open the door for though is for mixed architecture : systems. E.g. synthesize a MIPS core in the FPGA and map the DDR : controller on to it.

: Then you have x86 and MIPS in the same system.

: That'd be cool.

An awfull lot of cool things are on their way...

Vote

P

Paul Rubin 20 years ago

I'd think if you're going to use such an expensive and exotic approach at all, you'd pipeline it to get one AES operation per cycle, maybe even more than one if you're doing something like EAX mode, or CTR mode ona large block in parallel.

Vote

D

DJ Delorie 20 years ago

Just one? Why not a couple dozen small purpose-designed RISC cores, running in parallel?

Vote

J

JJ 20 years ago

Um yes it does look familiar doesn't it. If you go to the origins of HT when it was called something else at AlphaWorks IIRC, the key people had originally come from Inmos and had worked on the PLLs for the Transputer and maybe those links too. The fellow is now a Fellow at AMD after they bought them out. In a previous life, same people were at Meiko and did their own routers used to stitch up T800s then later several other cpus ultimately leading to the Alpha platform after Meiko went belly up.

When I first heard Xilinx was taking a HT license, which seems a long time ago now, I wondered when this would happen.

When I first saw the early marketing for the Hammer with 1,2,3,4 of these HT links and the memory channel too, I could only say out loud, looks & smells like a Transputer to me with 20yrs development but it isn't really, it doesn't have the process scheduler or any real support for programming concurrently per occam, just links. But when I see the product today with a huge price premium on the no of HT links, I am dissapointed, one Opteron with 1 link is cheap enough, add more links, the cost goes way up as it looks more and more like a server platform. The no of Links on the Transputer was always an issue back then, 4 is a minimum.

The socket module though looks a bit like SFF TRAM module but the multi socket Opteron boards are not really TRAM carriers that can be populated with general purpose computing modules on a grid. Perhaps that will come back again but probably with more modest links.

I have been suggesting a Transputer resurgence for some time by building an FPGA Processor Element hooked up with a specialized MMU that shares the available memory bandwidth of RLDRAM amongst many PEs using latency hiding Multithreading to make the PEs not appear to have any memory wall. By distributing n.PE+MMUs into the fabric, one can then add algorithm specific extentions or coprocessors to each and copy the node systolic fashion over the array. Each PE only uses only 1 BRam, so quite a few PEs would fit. The Transputer is really now defined by all the good stuff that goes into the MMU rather than the PEs. There is a paper on it at wotug.org for anyone interested.

When you build algorithms in FPGA around arrays of customizeable PEs I think some of the reasons for having an Opteron in the system may become moot, put the cpus into the FPGA as many copies as you can get since all the real bandwidth is in/out of all the Blockrams, not the more limited I/O pins.

I will have to look more into HT3 though.

John Jakson transputer guy

Vote

D

DJ Delorie 20 years ago

I always thought it would be neat to design a CPU cell in a QFP fpga, such that all the pins on each side were designed to interface to an adjacent cell - making the PCB routing trivial. The cells along the boundary would be programmed to use the free edges to talk to external peripherals.

I suppose with a BGA you could use the outer rows to talk to adjacent cells, and the inner rows to interface to a RAM chip on the other side of the board.

Vote

J

JJ 20 years ago

Given the FPGA resources needed for 1 PE, 1 BRam & about 500 Luts/FFs and then putting around 10 with a shared MMU which requires unknown resources at this time, one might get a combined resource figure that is still insignificant compared to the size of the largest FPGAs that would likely be placed in these 940 sockets.

Each MMU uses more resources than a few PEs but also would chew up a good portion of I/Os pins say 120 or so for 1 RLDRAM interface and more for external links. It becomes obvious one is really I/O limited or content limited so an array of much smaller FPGAs makes more sense on a TRAM carrier type board. Then every FPGA might get 4 MMU memory systems giving effectively 40 or so PEs running at 300Mhz or 100Mips each. The total 40x100mips still doesn't look so good compared to 1 Opteron, but the system is very different. You end up with 160 or so threads since each PE is a 4 way MTA, you have to have every thread busy and that requires occam or HDL like parallel programming v possibly only 1 thread on an Opteron. The big payback is that all these threads get to see almost no memory wall with full random access over their local memory banks with some additional latency for nearby MMUs and more so for off FPGA nodes.

You either have a thread wall or you have a memory wall. The thread wall is not really a problem for occam, csp, Transputer, parallel people but is a huge barrier to most Opteron customers. The memory wall though is a real problem requiring possibly 1000 clock cycle memory accesses for all accesses that miss the cache system and caches can never be big enough for the sorts of datasets some have in mind, nor can the TLBs have enough asssociativity. I believe these memory walls are most likely halving typical throughput of sequential cpus for even a modest miss rate. Thats why I am prone to suggesting getting rid of the Opteron and put the cpus right inside the algorithm with local copros per PE or better still per MMU. One such copro could be a FP unit which uses the same reasoning as the MMU. If a FP unit can deliver

1 flop per clock shared over 40 threads, each thread gets FP slices with very little latency in the order of a load, store op.

I haven't really worried too much about packaging BGA v edge connected, I suspect that the medium size parts are big enough to hold enough PEs and use up the I/O for RLDRAM and some for HT like links. I would probably put each FPGA & related RLDRAM on its own module so it would look a little like these DRC modules or really a SFF modern TRAM. That separates the module design from the motherboard design and then you can get some volume on these modules.

Don't even ask why I wouldn't use regular SDRAM, about 20x less random throughput, would effectively limit me to only 1-2 or so PEs per MMU, and that would leave the FPGA almost empty.

John Jakson transputer guy

Vote

J

JJ 20 years ago

For any processor with no substantial caches, one might assume every

5th opcode is a load or store, for a nice register heavy design, maybe every 10th opcode. For a classic SDRAM interface the performance will be very poor. The usual thing to do is to gang up lots of very expensive Brams into I, D caches which gives up alot of the parallel bandwidth they each have when used separately. Even still each core now uses lots of Bram, some cpu logic, an SDRAM controller and a good chunk of the I/O is gone. That sort of system can be replicated maybe 4 times depending on I/O count and none of these has any performance to write home about. But one could put additional algorithmic content next to each node.

Memory limits and hence I/O pads is the crux of the problem. My Transputer design uses 1 Bram/PE hence on paper maybe 554 PEs might fit in the biggest FPGA but that doesn't work. The Lut/Bram useage takes it down to half that and then assume the MMUs consume the rest of the fabric in a regular tileing. Still the memory traffic of 250 odd PEs can't be funneled through maybe only 4 memory interfaces even RLDRAM, so the PE count either has to come way down and or more of the Brams have to be used as local caches which gives up alot of their bandwidth again.

One way around the I/O limit I have been thinking of is to bring the RLDRAM inside the FPGA. SInce we can't do that, instead replicate the RLDRAM logical architecture of n concurrent slower banks using up all remaining BRam aggregating them into cache that can be shared with multiple PEs at the L1 level. Only when those miss does the L2 RLDRAM come in to play, so trading down PEs for Bram caches allows more Transputer nodes to share the few RLDRAM interface.

.( (n*PE + MMU + Bram cache)*k + MMU + RLDRAM interface) *4 or so.

Q I am curious about how many separate memory channels people have actually put onto the largest FPGAs, I suspect on the highend for independant RLDRAM controllers it is around 4 due to specialized use of the clock resources needed to make the DDR interfaces work. I also wonder if these serial interface DRAMs have come out yet that would allow many more memory channels to per FPGA.

John Jakson transputer guy

Vote

R

Rob Warnock 20 years ago

+--------------- | What this does open the door for though is for mixed architecture | systems. E.g. synthesize a MIPS core in the FPGA and map the DDR | controller on to it. | | Then you have x86 and MIPS in the same system. +---------------

But *not* necessarily running ccNUMA with each other!! See my recent post on "comp.lang.lisp" [yeah, they were talking about the prospects for using the same DRC FPGA for an update on the Lisp Machine]:

formatting link

especially the bits about the difference between "non-coherent HT", used for ordinary I/O (PIOs & DMA), and the "coherent HT" used for the inter-Opteron ccNUMA cache-coherency. I *strongly* suspect the DRC FPGA[1] only does non-coherent HT, which, while just fine for a DMA-style crypto co-processor, wouldn't let your FPGA-based MIPS CPU participate in the Opteron cache-coherency protocol.

-Rob

[1] Well, the *chip* could probably do either; I'm actually referring to whatever libraries of HT protocol support that come with it.

----- Rob Warnock

627 26th Avenue San Mateo, CA 94403 (650)572-2607

Vote

P

Piotr Wyderski 20 years ago

How do you prevent the pirates from stealing your private symmetric AES key from the FPGA? _This_ is the hard part, not the decryption process. I can easily implement an over

1Gbit/s 128-bit AES en/decryptor even on a Cyclone, but it is meaningless, as the key is not (and cannot be) protected.

Best regards Piotr Wyderski

Vote

E

Eric Smith 20 years ago

Use a Virtex II or Virtex 4 and it can be.

There are degrees of protection. The protection available in the Virtex II or Virtex 4 isn't absolute, of course, but it would take tremendous resources to extract a key embedded in one. You wouldn't be able to read the key back out electrically due to the FPGA's own encryption system, which is based on triple DES or AES with a key in internal SRAM.

To extract the application symmetric AES key, you'd have to be able to decap the FPGA without cutting power to it or shorting out any internal nodes, then microprobe it. And you'd have to know *where* to probe it; unless you had the original design files, just finding where the application key was stored would be an immense task.

(Note that I'm not talking about finding out where the FPGA bitsream decryption key is stored; that would be relatively easy since you could use ANY decapped Virtex II/4 part to search for that. The application's decryption key would be somewhere inside the FPGA configuration.)

Vote

DRC has announced its newest FPGA that drops into AMD's Socket 940

Join the Discussion

Didn't find your answer?