The Embedded Heap

Tim Wescott · 2013-06-21T17:45:09+00:00

I know this is a dangerous question to ask, because often people tend to run on emotion with this issue. I ask, because _I_ tend to run on emotion ("hey! That guy suggested using "new"! Where's the torches and pitchforks?!!"), but I'm contemplating an application where a traditional heap -- if I understand its working correctly -- may actually work out, even in a 24-7 application. I know I've worked around applications where heap management was an issue -- my first job out of college was at a company whose product ran under DOS and which had to reboot the machine every night at 1:00AM in no small part because the heap became hopelessly fragmented and the only way to fix it was to bail out of the application and resume. (It wasn't a leak. We had an instrumented heap, we could see the free space leveling off while the largest available block just got smaller and smaller until finally -- poof! things didn't work anymore.) So I'm looking at an application where allowing use of a heap would let me use some canned code (for linear algebra -- it's mathematically intensive stuff I'm doing). The heap usage is going to be a set of stuff with disparate sizes, but each element in the set is always going to be the same size from iteration to iteration. Some of the matrices are going to be around forever, but most are going to come and go with each iteration of the loop. My feeling is that if the heap manager is intelligent enough it should be able to fit all the short-lived, same-size stuff back into spaces from which it came, even if the long-lived stuff is kinda in the middle. I can see execution time being an issue, but if the structure of the heap settles out after a while then that time should be bounded to some maximum. So -- does this make sense? Does anyone have any suggested reading for me to do, to understand this? My alternative is to hand-allocate storage for a bunch of intermediate steps, and probably have my code doing backflips to...

J

Jon Kirwan 13 years ago

The dead-trivial routines I linked you towards (and I wrote) do exactly this. And they are VERY SHORT. (However, you may want to add a check for a NULL pointer in the free() function for production use -- I provided those quickly to provide the minimal example of just how easy this stuff is.) The malloc() function "sees" contiguous free blocks as a single block, during allocation. It's really quite simple.

I have a short routine (that I used myself for testing but didn't post) that displays the current state of the heap any time you want to see it. You could use that, for example, to observe and collect empirical data from real world runs. (I'd probably modify it to "write snapshots to memory" for later examination in an embedded system where printf() is problematic.) I think your approach is good -- look at the data. And luckily it is VERY EASY to do, too.

I think I've begun to understand where you are headed. If you are able to show that what is already there (the malloc/free I showed you does what you talked about, already) is good enough, then you don't have to do much at all. Just instrument it, test and verify it, document the results, and move on.

If you really want to move allocated memory so that the largest possible free space is available when you want it, then I think it is easy enough to use two otherwise generic tools and to combine them into a specialized tool perfect for your situation. Very small amount of code required, too.

I'd cobble two generic tools (malloc/free and memory marking) into a customized tool for the purpose. A lot of times, that's what a craftsman has to do. Not everything in every possible variety is ready-made all the time. You know that as well as any. Your case is a constrained one and probably any "good solution" isn't simultaneously also a "broad market solution," broad enough to gain a supporting base anyway. That is, if the already decent malloc/free shows itself unable to meet your needs, of course.

Interesting problem. But I don't think a good solution is at all difficult to come by. Especially since you control the startup code, as well.

Jon

Vote

N

Nobody 13 years ago

The worst-case scenario is that whenever you ask for an N-byte block, your existing allocations are all exactly N-1 bytes apart. IOW, the maximum heap size approaches the total actually allocated plus the number of allocations multiplied by the size of the largest block.

If the number of allocations and the largest allocation are both bounded, then so is the maximum heap size. But it might be a very large bound relative to the amount of memory actually used.

Vote

W

Waldek Hebisch 13 years ago

It is well known that existing malloc implementations may behave quite badly for some sequences of allocations. Avarage behaviour is much better. If you need absolute warranty, then buddy style systems are close to optimal, but on average they are wasteful (depending on ratio of maximal allocation to minimal allocation you will need amount of memory that is 8-60 times bigger then maxial amount of allocated memory).

In other post you wrote that you know what allocations will be made. This is comfortable situation quite unlike what normally happens. In such case probably the best sollution is to roll your own malloc which has a number of pools for different sizes. From a pool you allocate blocks of fixed (maximal) size. When doing allocation you search for pool whith block size bigger or equal to requested amount. A sigle simulation run allows you to determine how many blocks to put in each pool. Concerning block sizes of pools the simplest scheme it to use powers of 2. But possibly you may save some memory by choosing different sizes.

With fixed allocation pattern it is tempting to just test how well it works with given malloc. But if you need absolute warranty this is not enough: with high probablity malloc will work OK. Testing may discover problems, but does not increase much confidence that malloc will work (since even without testing confidence is relatively high). The set of pools appreach has advantage that it will "converge" in a single cycle, so you know that it will always work. Of course, if you use malloc which uses set of pools (this is popular approach) the smae will hold. But there a malloc which work differently, so you need to check.

Waldek Hebisch hebisch@math.uni.wroc.pl

Vote

G

George Neuner 13 years ago

Since Tim apparently is working in C, the Boehm-Demers-Weiser collector would be more appropriate.

formatting link

Ada

You don't *have* to use static allocation, but you do have to prove that dynamic allocation won't fail. It can be proven that a specific heap size will - or will not - support a given pattern of allocation and freeing.

AFAIK, there aren't hard real time GC implementations for C. BDW can be used in soft RT systems if timing permits.

George

Vote

G

George Neuner 13 years ago

As long as there is not a leak, you won't see unbounded usage. However, given non-moving blocks [i.e. lacking heap compaction], for any particular heap size there exists some patterns of allocation and freeing which will lead to failure due to heap fragmentation.

A lot depends on how the application uses this algebra library. In the best case, using a dedicated heap, it may be possible to blindly allocate without freeing during use and then simply clear the entire heap when you're finished [ mark-release aka "stack" allocation ].

If you're in the position of constantly calling the library to refine answers, you're may be screwed unless you can compact the heap.

There's nothing per se wrong with using dynamic allocation ... as long as you can prove that allocation won't be a point of failure. Depending on the application, this may be anywhere from fairly easy to impossibly hard.

George

Vote

T

Tom Gardner 13 years ago

Last time I looked, that GC was _approximate_: it wouldn't necessarily collect all garbage. Did it even do compaction?

A lot of _mallocs_ aren't suitable for hard real time work; determining the least bad location to minimise fragmentation can be problematic.

Vote

T

Tim Wescott 13 years ago

Hey. I like that approach. Basically the algorithm is:

Start with small set of saved data
add a bit of information
Do a sh**load of math
save that small set of data
repeat

As long as I can coerce the library to save the data in static memory I could insert your "flush the heap" right before the heap. Or, I could just verify that the heap is really empty at that point, every time.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

P

Paul Rubin 13 years ago

Tim didn't explicitly say he was using C, or at least that it was required. Maybe he can consider alternatives.

BDW isn't suitable for critical systems because it isn't guaranteed to always distinguish data from pointers. An adversarial data pattern can make it run out of memory.

Vote

G

George Neuner 13 years ago

BDW is conservative and non-moving because C is an uncooperative language that doesn't identify most pointers. It is, however, both incremental [with VMM support] which minimizes GC pauses, and thread aware so that a well designed multithread application need not be stopped for collections [at least on a multiprocessor].

Conservative collectors *can* potentially mistake random pointer aligned values for actual pointers and erroneously keep a heap block that really should be freed. However, the probability of this occurring is very low: the collector knows the actual set of allocated blocks and so can ignore any phantom "pointers" that don't point into an actual allocated block.

But yes, it can happen that some bogus value is mistaken for a pointer into a large data structure and that structure won't be freed. However, it is also likely that phantom pointers are transient - on the stack or in a heap block that itself will be freed, and so the mistake will self-correct at some future collection.

"Precise" collectors don't make these mistakes, but they require cooperation from the compiler. C - as described by the standard - doesn't cooperate and so cannot use a precise collector.

The majority of dynamic memory managers aren't suitable. The term "malloc" is kind of C specific so I prefer to avoid it when discussing dynamic allocation in general.

Similarly very few GC systems are suitable for HRT - there are only a few that I'm aware of that can guarantee pause times and application response [given adequate resources].

George

Vote

T

Tom Gardner 13 years ago

That's all as I remembered it.

However, I'm sure that while the average probability of having a phantom pointer is low, I'm equally sure that in some pernicious circumstances it is nearly 100%.

How one ensures pernicious circumstances are avoided in an embedded system is left as an exercise for the student :)

Again, I agree. IMNSHO some GCs are sufficient for soft real-time systems.

Vote

B

Bill Leary 13 years ago

Can you cite one or more please?

The last time (years ago) I looked into GC for a real time system, none of those I could find would give any sort of guarantee on performance. I'm still of the opinion that malloc/free and garbage collection are a bad idea on most real time applications.

But I'm also very interested in what a predictable garbage collector might look like.

Thanks.

- Bill

Vote

T

Tom Gardner 13 years ago

those I could find would give any sort of guarantee on performance. I'm still of the opinion that malloc/free and garbage

Depends on the characteristics of your system and what guarantees you require. If exceeding a deadline is unacceptable, then malloc and GC are "undesirable".

If, OTOH, you want good mean and 95th percentile performance and can tolerate infrequent extended pauses, then GC is eminently practical.

Good examples can be found by looking at where Java is used in high frequency trading where milliseconds are an age. They avoid creating garbage as much as possible, and tune the devil out of the GC. (That mob are so concerned about latency they spend $600m to put their own fibre optic cable across the Atlantic, and put some of their algorithms into FPGAs).

Other examples are in high-availability telecoms applications such as PAYG mobile phone charging.

look like.

I'd like to see what a predictable L1/L2 memory cache was like :( Even with a 486 and its tiny cache and simple benchmarks, people measured peak execution times that were 10* the mean. I rather doubt the situation has improved.

Vote

P

Paul Rubin 13 years ago

There is a substantial literature on the subject:

formatting link

The short version is you can get guaranteed bounds on gc pauses, in exchange for somewhat more total overhead than traditional schemes. Basically you run the GC operations sort of concurrently with the application program, so every time the program allocates any memory, the GC also cleans up the heap a little bit more. That allows bounding the maximum latency of any single allocation. This is in contrast with typical semispace GC where allocation is a simple pointer increment, and then eventually the program has to pause while the GC scans the live heap.

Perry Cheng's PhD thesis is pretty readable and informative, iirc:

formatting link

There are a few other interesting-looking links on that page that I haven't examined.

Vote

B

Bill Leary 13 years ago

Sure. I agree with all of that. But...

Ah, sorry about that. I got the idea we were talking about GC used with C, not as a built into a language.

Yes. I worked on an application like that written in Eiffel, also built in GC.

Not to my knowledge.

- Bill

Vote

B

Bill Leary 13 years ago

Thanks for the memory jog. Yes, I've seen that approach, and used it. I'd forgotten about it when I asked the question. Probably because the application where we used it was pretty loose on it's real-time aspects.

Right. That pause was what caused us to tune the GC to do collection more often that just "out of memory, have to collect" situations.

"A parallel, real-time garbage collector" This sounds awfully familiar. I'm wondering if it wasn't one of the documents we used for ideas trying to get around the many seconds long pauses we were experiencing with that Eiffel system.

Indeed. Thank you.

- Bill

Vote

G

George Neuner 13 years ago

Not disagreeing, but explaining why it's unlikely.

You can always craft a pathological example involving phantom pointers, but in a real application, it's extremely unlikely - particularly with a sophisticated GC that tracks allocated blocks and where you can deliberately allocate untraced objects. BDW is quite sophisticated in these regards.

It's not terribly hard to get a random value that points into the heap, but getting one that points to a known allocated block is somewhat harder. The heap in most systems sits in the middle of the memory map, above the code and below the stack. Integer values tend to be too small and [partial] FP values tend to be too big. Where you are most likely to have problems is with text and with integer signal and image data - but such data can be deliberately placed in objects that the GC won't trace for pointers.

Also recall that a GC'd heap requires some extra room wrt a manual heap because freeing of blocks is not [necessarily] synchronous with the application ceasing to use them. Typically the headroom is on the order of 25%, so the odds that a random value would point to a real allocated block are somewhat reduced to begin with.

The most common problem with GC is accidental retention of a real pointer that references a large data structure. This can be avoided by taking care in programming. Second most common problem is forgetting that GC only cleans up memory and does not handle other resources.

Careful programming. "Embedded" is a continuum that contains ranges where use of GC is perfectly acceptable. As it exists now, It simply isn't acceptable for all uses. Sub-microsecond HRT and safety critical systems are hot topics for current GC research.

I am certain that you know this, but I'm going to say it anyway:

HRT doesn't mean "really, really fast" - the distinguishing character of an HRT system is deterministic response, not the time scale at which it operates. I used to work on hard systems that had 600ms cycles.

There are several GC implementations, both moving and non-moving, which have deterministic pause and guaranteed scheduling. These are suitable for HRT given an appropriate application.

George

Vote

P

Paul Rubin 13 years ago

I think a lot of the time you have to assume adversarial input, i.e. if it's possible to craft a pathological example, then someone WILL. Safety critical systems are also supposed to use formal verification, which means the safety properties have to be proved for ALL inputs, including pathological ones. So, for example, in critical HRT systems it's not even ok to use hash tables, much less something like imprecise GC.

Vote

G

George Neuner 13 years ago

GC.

I *absolutely* agree that imprecise, conservative GC has no place in safety critical systems. However, as I said previously, "'embedded' is a continuum". I do believe that GC has a place in the embedded world, and eventually, I think precise, HRT GC will be used in safety critical systems ... but those won't be programmed in C.

George

Vote

R

Robert Wessel 13 years ago

And often constructing pathological inputs is a lot easier that people assume.

The most interesting part of Musser's original Introsort paper is his surprisingly straight-forward system of building pathological Quicksort inputs (even with the usual refinements, like median-of-three).

Vote

G

George Neuner 13 years ago

I'd

What Paul is describing is an old serial co-routine style GC model. It is still viable and widely used (often with VMM assist), but it dates from the 1970's and is not state of the art in RTGC.

Bacon's and Cheng's Metronome is a state of the art system, though by no means the only one worthy of consideration.

You should also look at: - Baker's parallel Treadmills, - Blelloch and Cheng's replicating collector, - Henricksson's Slack collector, - McCloskey's Staccato collector,

Sorry I don't have URLs handy - I have a lot of actual paper in my library - but they shouldn't be too difficult to find.

The most comprehensive technical reference is the bible (part II):

- Jones, Hosking & Moss, "The Garbage Collection Handbook", CRC Press

2012

formatting link

82795

The bible (part I) also is worth having for foundation material. The second book is self-contained but foundation is covered more tersely than in the first book.

- Jones & Lins, "Garbage Collection: Algorithms for Automatic Dynamic Memory Management", Wiley 1996

formatting link

dp/0471941484

When reading GC literature you need to be aware of GC-speak. Older papers in particular tended to abuse terminology that is understood differently in other computing circles:

You need to understand the tri-color abstraction because nearly all papers refer to it but many don't bother to define it.

In older papers "real time" almost invariably means SRT - defined approximately as "imperceptible to a user operating a GUI". Determinism as in HRT was not really a consideration until the late

90's. You still need to be careful to distinguish whether any particular paper is talking about SRT or HRT.

Similarly, "concurrent" usually did not mean "parallel", but rather referred to serial co-routine style interleaved incremental collection. "Concurrent" includes collectors based on VMM hardware which use protection faults to switch between the mutator and collector.

However "parallel" generally means simultaneous. Parallel collectors run in one or more separate threads, synchronizing with mutator thread(s) as necessary but otherwise not interfering with mutator operations.

The main thing to understand is that real-world GC systems typically are hybrids which may use different algorithms to manage different heaps or different types of objects, and which may switch algorithms based on runtime heuristics or tuning parameters. No real GC system is simple, and all have more than one trick.

Evaluating a particular implementation requires both a thorough understanding of the basics and good intuition about how the various techniques can be extended and combined. Good familiarity with the intended hardware (and OS services, if any) is a must [preaching to the choir in this group 8-)].

George

Vote

The Embedded Heap

Join the Discussion

Didn't find your answer?