Large RAM sizes in embedded systems

I'd be curious to hear from anyone who has worked on embedded systems with relatively large amounts of directly-addressable RAM.

I'm discussing with a colleague/client an application that would work on a nominally 740GB data set. The nature of the data and the required processing is such that it's a "must-have" performance improvement to hold the entire data set in RAM rather than swapping in from some secondary storage mechanism. It's perfectly acceptable for the machine to take an entire week to cold-boot. It's not acceptable, once booted, for it to wait several seconds to page in data from a hard disk :)

Hence I would say my RAM requirement would be speced at 1TB of error-correcting RAM. The hardware interfaces I would require are gigabit Ethernet, SATA for the boot media, and a means for connecting to an ASIC that does all the real processing work. The interface for that latter is not yet defined, but would quite likely be PCI Express. All this suggests a PC-style architecture as the way to go.

I'm really not finding much (read: anything) in the way of monolithic computer modules that can address 1TB. I've found mention of server clusters that have that much in aggregate, but it's spread across several computers. OS support for such large RAM sizes also appears to be problematic, but I could work around this.

Is anyone else dealing with similar problems? This is strictly a theoretical investigation for me now - more of a feasibility review than anything else - but it's quite an intriguing project. Maybe the right approach is to build a massively parallel engine with identical modules handling manageable (8GB?) slices of the data set. However this would be very expensive in terms of power and additional support circuitry.

Reply to
larwe
Loading thread data ...

Is it really several seconds? Sounds like some sort of search mechanism, and a hash table can be O(1) if the overflow is properly handled. The disadvantage is that the search is only for exact equality, not nearest. If this is an accurate observation you can experiment with my hashlib package, available under GPL or negotiated license, and let the underlying virtual memory systems work. The size of the data set is less important than the total count of entries. At any rate, see:

--
Merry Christmas, Happy Hanukah, Happy New Year
        Joyeux Noel, Bonne Annee.
 Click to see the full signature
Reply to
CBFalconer

How about this?

formatting link

Reply to
Arlet

I can find likely candidates for the desired section inexactly using a hash of fuzzy-logic analysis results(1), but individual data "frames" are quite large. The ASIC can swallow data very quickly indeed (and can be parallelled). The bottleneck would be pulling the data frame off secondary storage.

The ideal scenario would really be to have a second ASIC do the search then hand over the DRAM buses to the "processing" ASIC.

(1) - Imagine I was indexing pictures, which I'm not. My fuzzy parameters could be things like this: Upper left-hand quadrant is predominantly {black, red, blue, green, magenta, cyan, yellow, white}. This quadrant has {0-9%, 10-19%, 20-29%, 30-39%, ..., 90-100%} pixels that are brighter than the overall image average. And so on. My data set has a lot of characteristics that could be quantized this way to yield a pretty good hashable index.

Reply to
larwe

Perhaps the right approach is to _buy_ as many PC-architecture boards as you need to hold that much memory, each with appropriate RTOS support (Linux with RTAI?), and rope them together on a bus?

I don't know if the best way to realize this would be to use your GB ethernet, or to try to get them talking nicely on one Compact-PCI bus, or VME, or PC-104, or something stranger.

I do think that if you have the space (and power supplies) this may be the least-engineering way to develop the hardware. It may even be an effective way to prototype a system that you can come back to later and replace all the SBC's with something that's been stripped down to the bare essentials necessary to boot up and host your gargantuan blocks of memory.

--
Tim Wescott
Wescott Design Services
 Click to see the full signature
Reply to
Tim Wescott

Let us assume for a moment that you have access to the best chips, layouts and bank accounts.

You can build it with 24 modules of 32G each.

Each module consists of 128 512M x 4 DDR2 SDRAM. Namely, 15 row/col addresses and 3 bank pins. You can buffer/share the address pins, but need individual bank pins (or 128x3=384 pins). So, each modules require a 500 pins FPGA.

24 of 16 layers 8"x11" PCB 24 of 500 pins FPGA 3072 of 512M x 4

Assume that you can get the FPGA for $50 and SDRAM for $5 each, it can be build for approxmately $20,000, give or take a few K.

Reply to
linnix

Have you proven you can't get the performance you need from a hard drive based system? I would think with the proper indexing (I hope you don't expect to sort though a terabyte looking for you data, even with high speed DDR2 memory that is going to take a while!) that you could retrieve even a megabyte or two 50 or 100mS? Probably less. The key here of course is to know where the data you need is located. You may need some sort of home grown file system tweaked to your requirements. Keep the "file allocation table" in local RAM. Just how fast does it need to be?

Dave

Reply to
David Lundquist

A tentative budget to build the machine is US$200k.

I'd need to build them from scratch, unfortunately. The largest size I see as COTS is 4GB.

Part of my original question is to see if people are aware of COTS stuff that I didn't find in my searching.

1TB of RAM is something like $70,000 of 4GB modules :)
Reply to
larwe

At the moment I've lashed together a very very simple and dumb prototype that suggests the idea can be made to work for much smaller data sets. As with many other things, the hard part comes when you try to scale it :)

The lack of "here's the website, here's the catalog number" type answers in this thread is leading me to believe I need to revisit the underlying design assumptions.

Reply to
larwe

HP's integrity superdome servers can have 1TB of RAM. Sun's blade 8000 P supports 640GB (something in the back of my head tells me that Suns could handle 1TB if you swapped CPU modules for RAM, not sure which model it was). Pricing is a different matter.

As for your throughput: 400MB/s is not that many disks in RAID-0 (assuming that you need sequential access).

Regards, Alvin.

Reply to
Alvin Andries

If the largest COTS memory modules are 4 GiB, you would still need 256 of these, taking some considerable board area or total volume, so the wire connection lengths would be considerable. Unfortunately the speed of light is unfortunately not very fast and in PCB tracks the propagation velocity is 200000 km/s or even less depending on the dielectric.

In a truly random access system, the performance would suffer very badly due to the propagation delay (i.e. this would basically be a half-duplex environment), but in some block transfer system, such as cache line loading, block loading or "DMA" systems, the propagation delay/line turnaround delay would only be suffered once for each block transferred.

It might be preferable to store any index structures (etc.) into a smaller RAM with shorter propagation delays and use block transfers for getting the actual data from memory physically located at a larger distance from the processing power.

Paul

Reply to
Paul Keinanen

We can do it for $100K.

You probably means SIMM/DIMM modules, which doesn't make much sense for the amount of chips you need.

Yes, you would have to build the ram modules. The 60 balls chips have much smaller footprints than SIMM/DIMM sockets. You would save lots of PCB spaces as well as height. Eight layers should work for individual chips, but additional layers for cross routings. Sixteen layers should be sufficient for 500 to 1000 balls FPGA. You don't need high gate counts for SDRAM controller, but you would need the high ball counts.

Furthermore, I would use stackable wide bus (200 to 300 pins) between the 24 modules.

About $20,000 in chips, in 3K qty. Micron makes 512Mx4, perhaps 1Gx4 soon. I would use deep and narrow chips, for easier routing/buffering.

Reply to
linnix

Its all theory , we dont have any use for Crays , nor higher speed ARM cpus to do high speed video nor .......

There are clever ways to do stuff , so we dont need to move stuff around at hi speed .

Example DVD movies to an LCD . the speed will slow down because we wont send 7 mega bits per sec to the LCD , we will have a look up ROM in every LCD , that will create parts of the image .. IBM will probably disagree , because while they were supposed to be studying this practical application of decompression , they were wasting tax dollars on fractals ....

Its trivial to compress VIDEO , but only if you DONT study MPEG !! It violates simple common sense !!

So , in the future , computers will slow streams , NOT speed them .

The database at target needs not be programmed with all the look up , it can be adaptive , saving lots of programming . We have gigabytes of HDD , my new Opsys will delay the start of ur DVD movies to do short calculations and store some fragments of ref frames , later it will save them to its adaptive dictionary . The next DVD movie it plays will make use of that info . Thats all i can do in text , you would need to see it work .... it would take 100's of pages to do here ....

They way i started , was to figure a slow CPU , then argue where to get faster specialty peripheral chips like Video accelerators , then figure how to replace them with more general purpose hardware .... Aha !! Thats where i got the idea to move the work from the CPU board to LCD controller in the LCD . Now you only have to send hi level , not each and every bit of RGB !!

Its simple ....

If you study the Cray-1 stuff , they were paid by the hour , so they wanted to do something on Cray-1 that took many hours/days . So they searched for some un needed task and studied how to make it twice as long , then how to program it in twice the time ....

But when one is not paid by the hour .....

I work for free .... So everything you see will be clever ... like hackers ...they do a clever job ....

Reply to
werty

You're going to have to be more specific.

Is there some way to frame up the data access patterns? Like, any natural frame size, any typical pattern between frames?

A hard drive array will have no problem to BEGIN serving up a purely randomly chosen frame in the 10ms ballpark, which is a lot faster than your "several seconds".

If your frames are truly huge then each frame would be split up among multiple disks so that you get the sum of all the bandwidths.

740GB data sets are not all that unusual. Treating it as "I gotta have it all in memory at once" is unusual.

Others have already remarked on tools for associative searches. There are specialized RAID arrays with built-in knowledge of a few very very specific associative searches. Indeed it is beginning to sound like this is what you want to build :-). In the 1970's these specialized drive arrays were sometimes called "rotating associative array memories", Googling for terms like that is difficult because now all those terms have been corrupted for other meanings. I think a more current term might be "content addressable memory".

If the data set is fixed, and the thing you are searching for is two-dimensional and of known size, then there are some really funky photographic techniques that use optics and holograms instead of computers.

Tim.

Reply to
shoppa

Is this an embedded problem? Sounds like you are trying to apply a embedded solution to a where a network cluster would be more appropriate or possibly a hybrid solution. Hard to know without understanding the nature of your dataset and some hard thorough put numbers.

larwe wrote:

Reply to
machinamentum

... snip ...

... snip ...

Yes. Embedded does not mean small, a better definition word might be 'dedicated'.

Please do not top-post. I fixed this one. Your answer should follow (or be intermixed with) the material you quote, after snipping anything not germane to your reply. See the following links:

Reply to
CBFalconer

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.