The Embedded Heap

T

Tim Wescott 13 years ago

I know this is a dangerous question to ask, because often people tend to run on emotion with this issue.

I ask, because _I_ tend to run on emotion ("hey! That guy suggested using "new"! Where's the torches and pitchforks?!!"), but I'm contemplating an application where a traditional heap -- if I understand its working correctly -- may actually work out, even in a 24-7 application.

I know I've worked around applications where heap management was an issue

-- my first job out of college was at a company whose product ran under DOS and which had to reboot the machine every night at 1:00AM in no small part because the heap became hopelessly fragmented and the only way to fix it was to bail out of the application and resume. (It wasn't a leak. We had an instrumented heap, we could see the free space leveling off while the largest available block just got smaller and smaller until finally -- poof! things didn't work anymore.)

So I'm looking at an application where allowing use of a heap would let me use some canned code (for linear algebra -- it's mathematically intensive stuff I'm doing). The heap usage is going to be a set of stuff with disparate sizes, but each element in the set is always going to be the same size from iteration to iteration. Some of the matrices are going to be around forever, but most are going to come and go with each iteration of the loop.

My feeling is that if the heap manager is intelligent enough it should be able to fit all the short-lived, same-size stuff back into spaces from which it came, even if the long-lived stuff is kinda in the middle. I can see execution time being an issue, but if the structure of the heap settles out after a while then that time should be bounded to some maximum.

So -- does this make sense? Does anyone have any suggested reading for me to do, to understand this?

My alternative is to hand-allocate storage for a bunch of intermediate steps, and probably have my code doing backflips to make it all work. I can do this, but I'd at least like to think that I can just copy it out of the theory books and have it work.

My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com

Vote

J

Jon Kirwan 13 years ago

Boilerplate heap management is usually quite ignorant of specific embedded application requirements. It just does something mostly easy, most of the time. But as I'm sure you will be told, you can write (buy, or otherwise select from free stuff) heap management that does more of what you want.

In your above case, compaction might have been better. But if automated (background), then you would get "time hits" which might kill your application for other reasons. If manual, you'd have to work out when you could afford the compaction timing... itself sometimes a nasty enough problem.

The "forever" ones should NOT come from heap. That's a given. So you don't have to worry about those.

The fact that the element size is always the same only means that the granularity of your allocations is set in stone. But it doesn't change the fact that you will be allocating varying chunk sizes, if I read you correctly.

You don't provide anough information, in my opinion, to be able to address your points well. It's already a given that you don't need to use the heap for "long-lived" stuff if by that you mean the "going to be around forever" stuff. You allocate those at compile time and are done with it. They never cause you any trouble. It's the short-lived (meaning less than the program operating lifetime) stuff that matters.

One technique to use, especially if you already know that you will be re-allocating things of the same size, is to just set them up statically (meaning compile-time allocated) and to do allocation manually. Array a[] can be used for purpose-1, then reused for purpose-2 at a later time, for example. It's not hard to do that, in many cases. Keep in mind that programming used to be done "back in the day" when heaps didn't exist except in university laboratories. Many is the time I've had to write code using very limited memory resources and a language that was allocated entirely at compile time. It's quite doable. Ugly at times. But in the case you seem to be talking about, not that bad. People did get complex things done in real-world applications using languages that didn't support heap allocation at all, you know? Heap is merely a convenience people have grown used to, not a required feature to get real work done. (And I have to tell you that we are no longer in anything like the kind of memory-limited situations I remember dealing with. It's high cotton today.)

Another tact is to use memory marking techniques. I use this with operating systems which have to undergo a "soft reboot" that is guaranteed to reset subsystems. The O/S, of course, can reset the heap space it manages, in such a situation. But each subsystem may have it's own static (compile-time) variables. These variables would be initialized at hard boot time, of course, because the C language (for example) guarantees this. But a soft reboot does NOT make those guarantees. So I force the subsystems to rely upon a memory marking module that is managed by the O/S. If the memory mark is lost (the O/S performed a soft boot), then each subsystem will reinit itself, as well.

Anyway, I kind of understand where you are coming from. But without lots more details it is hard to provide a "best answer" for you.

Sounds like you want to keep things the way you are used to using them and getting advantages from the library which are tailored to your application. The way you do that is to think, then design, then write and test. Just like in hardware where you think, then design, then build and test.

Good results come from good thinking and design. Less so, from grabbing up boilerplate answers others developed without any knowledge of your application.

Jon

Vote

T

Tim Wescott 13 years ago

What I _want_ to do is to get a good solid job done with the least expenditure of billable hours, to provide my customer with the best ratio of value to dollars.

Normally, a good way to do that is to not reinvent the wheel at every step of the way -- but in this case, the available "wheels" were not meant for the environment.

Writing a matrix library from scratch would be fun, and I'd get paid many bux for it, but I don't feel that it serves my customer well unless it takes less time than making the existing candidates play nicely in the available framework.

Hence the hope that someone can point to an article that does a good job of delineating all of the issues with heap management -- then I can make the decision myself, without having to cough up a bunch of possibly competition-sensitive information.

My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com

Vote

J

Jon Kirwan 13 years ago

Always a good idea.

I never suggested you do that. But I probably didn't read your wording closely enough. I suppose you are using a standard library for matrix work that depends upon heap. If so, I apologize for my failure to understand.

Understood. What have you already examined in doing searches so far? And why have you rejected what you saw so far?

Jon

Vote

T

Tim Wescott 13 years ago

Yes. Actually I've looked at two so far, and both of them seem to have some very clever memory-management features which, while cool in a desktop environment, will make it hard to predict exactly what they're doing.

The fundamental problem boils down to the fact that if you're dealing with matrices, then

A = B * C;

almost guarantees that A is going to change size; at this point nearly any sensible numerical package is going to go allocate something from somewhere.

I've already chosen a processor that allows me to be somewhat profligate with memory resources, specifically to hold down on coding time. I just need to know enough detail to know if I can hold them down enough, without having to stand on my head vis-a-vis the matrix math.

When it's a complex and esoteric subject like this I often start my search by publicly whining on USENET. Given that no one's jumping in with an article that they like, I guess a search (and the subsequent plowing through dreck) is next on the list.

Sigh.

My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com

Vote

J

Jon Kirwan 13 years ago

Okay. I guess this still assumes a lot about your matrix package. Perhaps they have documentation on "optimization" or on "fragmentation" where they discuss ways you can help them minimize problems?

And as I mentioned in my first response to you, there are malloc/free packages which support compaction. Of those, a subset will allow you to turn off automatic compaction. And a further subset of those would support an explicit call for compaction, so that it happens when you can afford the time.

Normally, for problems similar to yours, a developer would use memory pools and assign allocators for them. But this means that you have control over the libraries that would then use these. You don't and, as I gather, they just use the one and only "malloc/free" library you link into them.

I can think of some messy ways to add a layer that they would call, but which would use different pools... but then you said you really don't want to get into that mess. And I get that.

I think I'd be looking for manual compaction and not worry too much about the rest. You can then "salt" your own code, here and there according to what you learn as time proceeds, in order to keep the fragmentation down to a tolerable level.

Still sounds like what I just mentioned. Look for something that supports manual (not automatic) compaction. There are lots of methods for allocation, freeing, and compaction and various tradeoffs, placing size and compute time in various places. I don't think it matters too much in your case, which way you go there. Main thing is that it fits your existing libraries and that it isn't otherwise ridiculous.

Okay. I see. Well, I honestly figured you'd already done some diligence before posting. (That usually saves the time of others and makes for a better post with good targeting to help others answer your purpose more accurately. But I know I don't need to tell you that.)

I'm at odds for what to suggest. The simplest malloc/free code looks about like this:

formatting link

I would probably use that and add a manual garbage collection function to it. One more subroutine to add, is all. And that isn't hard or complicated. That assumes, of course, that you can accept the idea of manual garbage collection/compaction and salt your code, appropriately.

Jon

Vote

J

Jon Kirwan 13 years ago

I forgot to mention:

In any embedded system using a C compiler, there will be a "crt0" (called that because the code is often found in crt0.asm or something similar) component that initializes the library system and compile time statics before calling main(). It's possible that if you supply your own malloc() and free(), which will usually replace the library versions when linking (but not always if the compiler toolset tries to be too smart), there will be some problems/conflicts with library initializing code that presumes the library versions of malloc/free are ready to go BEFORE main() can start up.

If you use your own version that links itself into all references to malloc/free (as it should), and if you haven't gotten to main() yet so that you can initialize your own malloc/free system, then their libraries may go belly-up on you and the code won't even start right and may blow up in your face.

Just a note. I sometimes need to modify the crt0.asm code.

Jon

Vote

T

Tim Wescott 13 years ago

Well, fortunately(?) it's on a system where I have to be responsible for the startup code. So I can make damned sure that I set up any custom malloc/free code at that point.

Tim Wescott www.wescottdesign.com

Vote

L

langwadt 13 years ago

no way of finding a spot in the code where you it is safe to silently do a reset? :P

-Lasse

Vote

P

Paul Rubin 13 years ago

Tim, you're making it kind of hard, because your question and your requirements are so vague. In general you shouldn't have to deal with the intricacies of this kind of stuff. You mention malloc so I guess you're using C. For your specific issue of whether free will coalesce adjacent blocks after you've freed them separately, the answer is yes any reasonable implementation should do that.

There are a lot of different malloc implementations expressing choices driven in part by how much RAM has to be managed, but the basic idea of all of them is they are supposed to abstract the problem away so you DON'T have to worry about the details. So if you're going to use malloc, just use it. Don't worry too much about how it works unless you're sure you're going to have performance problems.

The main hazard of malloc/free is bugs coming from buffer overruns, double-frees, memory leaks from forgetting to free stuff during error recovery, and so on. This is something humans shouldn't have to deal with. The solution if you have enough memory is automatic management, aka garbage collection. (I'll include reference counting schemes like C++'s shared_ptr in this terminology, though others wouldn't include it).

I guess it may be relevant to your linear algebra stuff that a reference like x[i] might be faster if x is at a fixed location (so it's just an indexed lookup with an immediate offset) than if x is a variable (on the x86 there's an LEA instruction that combines 2 registers and an offset, but on whatever cpu you're using it might not). But, if you're that starved for cycles that affects all kinds of things.

I'd say if you have enough cpu and ram and you don't mind occasional pauses want to minimize development and debugging time, use GC if you can, preferably in a higher level language (consider Python using the Numpy or Scipy libraries for the linear algebra, for example). If you have hard realtime requirements and/or your thing is safety critical, you have to use static allocation (e.g. MISRA C guidelines, Ada Ravenscar profile, etc). Manual malloc/free is something to avoid if you can help it, just because of the bookkeeping headache.

You should look at the docs of Anton Ertl's Forth GC implementation for a good explanation of the issues of GC vs manual allocation. It's good reading even if you're not using Forth:

formatting link

Vote

N

Nobody 13 years ago

Does the library record allocations internally? Or is all allocated memory eventually handed back to the caller to maintain or free as it sees fit?

If necessary, can you modify this library in any way? Or are you limited to whatever you can put above and/or below it?

Even if it has "internal" allocations and you can't modify, it may be possible to use a context-aware allocator which considers the call chain when making allocation decisions (i.e. stuff that you get to free or relocate goes in one pool, internal allocations in another).

In the worst case, you may be able to forcibly reinitialise it, i.e. restore all of its data segments to their initial state then call any initialisation code, allowing you to reset the heap.

Vote

J

Jon Kirwan 13 years ago

So the real problem you face is how to combine the free space so that there is only one free block that collects up all of the smaller free blocks. But to do so, you may have to move already allocated blocks so that they are contiguous, leaving all the free space in a remaining contiguous section. If you move allocated blocks around, how does the library code you are using change the pointers it has kept around (if any?)

(Hopefully, the matrix library doesn't keep pointers in static memory.)

SIDE NOTE

The code I linked to you will automatically collect adjacent free blocks during the malloc() call, so there is no need at all for a separate garbage collection step to fold adjacent free blocks together. For many applications, the fact that malloc() automatically "sees" adjacent free blocks as a single, larger free block during its search means that in many cases there is nothing more needed. And you can easily see just how short and simple the code is.

Fragmentation still happens, but it isn't nearly as harmful as it might otherwise be. It's also the only way to handle things without substituting something else for pointers. If you were to actually MOVE an allocated block around, then you would also need to worry about code that isn't under control by malloc/free holding pointers to memory blocks that are no longer valid. So you can't just "garbage collect" and move allocated blocks around without modifying things in a more substantial way.

Since most implementations of malloc/free are at least as good as the one I showed you, doing better gets squarely back to the problem outlined already -- maintaining valid pointers when the blocks have been moved. (It's easy enough to move them; what is hard is making sure that all owners of pointers to them are also updated.)

END SIDE NOTE

So long as the matrix library doesn't retain memory pointers in static lifetime variables (something you may, or may not, know), this won't be a problem for the matrix library. It won't be calling for any garbage collection so the blocks can't move while it is working. The problem will then rest only on the code you write. And that means you can make things work out by writing appropriate code.

If you are coding in C and your library requires access to malloc/free/realloc/calloc, then write yourself simple code to do the work so that you can manage the garbage collection step and update your pointers when that takes place. I'd likely use a modified "memory marking" structure for pure speed, small size, and ease of use. It's one extra data structure (probably small) and a couple of extra very small functions. It's robust, too. I don't know an implementation on the web, though. But it would take only a few hours to write it up and a few more to test and validate it. I can refer you to documentation on memory marking and how I might modify the concept slightly to deal well with this issue, if you decide to pursue it.

Otherwise, if you still insist on going beyond the very simple automatic means that the malloc() code I already showed you manages in collecting adjacent free blocks, then you will need to figure out how you are going to deal with moving old, allocated blocks around in memory. You can find garbage collecting allocators on the web that can be used in C, but your matrix library obviously won't link to them and use them properly. It expects the traditional model and it does not expect to be using smart pointers (for example.)

The luck thing is that if during a call into your matrix library you can ensure that the free block starts out as large as possible, then you are as good as you can get. And pointers that you send into the library and pointers you get back from the library can be managed by your own code using something akin to memory marking. ("Atoms" would be another way to handle it in your code, too, but the code for "atoms" involves hashing and is much more work writing and testing.)

Jon

Vote

V

Vladimir Vassilevsky 13 years ago

It depends.

Overload new() operator to trace all requests to memory. See for yourself what happens in your particular task in your particular system. Proceed accordingly.

Vladimir Vassilevsky DSP and Mixed Signal Designs

formatting link

Vote

R

Roberto Waltman 13 years ago

I am already hyperventilating.

Three suggestions, that can be used independently or combined.

(1) As mentioned in other replies, implement your own memory allocation subsystem.

A common approach in embedded systems is to use a number of "pools", each pool being a statically allocated chunk of memory containing N blocks of the same size.

Typically a small number of pools is used, (l have seen from 3 to 10,) each with blocks of different size.

The fact that all blocks in a single pool are of the same size makes managing it trivial. (One linked list of free blocks, using the first few bytes of each block to hold a next pointer.)

A malloc request tries to find a free block in the first pool holding blocks at least as large as the requested size. If all are used, it tries the pool with the next larger block size, etc.

Freeing a block just puts back that block in the free block linked list. Adjacent free blocks are never combined.

Adding counters to each pool to keep track of max/avg usage, and how many times a request had to be satisfied by a larger pool down the chain, makes possible to fine tune the number of pools and the block sizes to a particular project runtime behavior.

Of course, allocating memory from one of a few fixed size block pools implies some waste, memory constrains may make this approach inappropriate.

(2) If you can identify a (or a few) critical malloc usage (called most often, or deep in loops, etc.) use a custom malloc()/free() for that case only.

A dedicated pool as described before, or a wrapper around the standard functions that does not always release a freed block, assuming a new malloc request for the same size will came soon. These dedicated malloc()/free() could keep, for example, an internal list of 3 blocks. free() will put a block in that list, or call the "real" free() if all are used. malloc() will try to return one of these 3 saved blocks, or call the "real" malloc() if none is available.

(3) Storage "arenas" - This is more useful for usage patterns where memory is allocated with varying sizes as a process goes on, and it is all deallocated at some later point.

Allocate one big block large enough for a subsystem needs, allocate memory from inside this block as need, (very quick, one pointer to the beginning of the free area inside the arena, no search), deallocate the whole arena when done. (No need to deallocate individual blocks inside the arena, if it is known they are not used anymore.)

Roberto Waltman [ Please reply to the group, return address is invalid ]

Vote

P

Paul Rubin 13 years ago

They are all reasonable suggestions in various situations, but this whole discussion seems like premature optimization. Really, the most sensible approach is implement whatever is easiest and most obvious, then benchmark it to see whether it satisfies the requirements. If it does, you're done. If not, that's when to look for speedups.

Vote

T

Tim Wescott 13 years ago

Well, I'm hoping that my problem is to find an answer to the following question:

I assume a good "C-style" malloc routine, that does nothing special beyond merging any bits of free space that happen to be contiguous, without moving blocks of memory. I constantly stir this heap by making allocation and deallocation requests of it. These requests come from a finite set of possible sizes (for example ten 30-byte blocks, two 100 byte blocks, etc.), and moreover the requests happen in a fixed pattern that repeats once with each iteration of the algorithm.

Given these assumptions, is the amount of heap fragmentation bounded, can I predict that bound mathematically, and (better), is there a reliable means by which I can measure that bound by watching the heap behavior over some limited number of iterations of my algorithm?

Intuitively the answers that I think are correct are "yes", "yes", and "maybe". And I should think that this would be a problem that's cropped up in the embedded space before, so I would think that there would be at least one white paper on it -- perhaps even one by each person whose hawking a customized heap manager. What I really want is such a paper that proves to me that the answers are "yes", "yes" and "yes", and gives me guidance to the number of iterations that I need to observe.

I guess I'll either have to slog through some papers, or do some math myself.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

T

Tim Wescott 13 years ago

Mostly I want to make sure that there's not some known danger of unbounded heap usage in the case of requests from a bounded set of storage. I really think the answer is "no", but I'd rather not be surprised by this after the product is fielded. Particularly because if my customer ever double-checks with an experienced embedded person there's a good chance that they'll be appalled at the notion of using a heap.

So, the thread is either about due diligence, or about me covering my ass, or maybe both.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

P

Paul Rubin 13 years ago

It is all implementation specific. But presumably nothing stops you from running your client millions of times (just skip the actual math part that uses the cpu time and doesn't affect the allocation pattern) and looking at the shape of the heap afterwards and seeing if you have apparent convergence. If you need absolute guarantees (at pain of possible nuclear meltdown or whatever), you shouldn't use malloc and you should probably write in Ada rather than C.

Vote

P

Paul Rubin 13 years ago

It does sound like your allocation pattern is simple enough that you can use completely static storage. But a malloc implementation would have to be rather badly broken if it leaks memory the way you describe, as long as you actually free everything you allocate. It's much more common for storage leaks to be caused by bugs in the application itself.

As someone else said, you can also use "pool" or "arena" style allocation, where you allocate big blocks of memory, carve them up into smaller pieces used by the algorithm, and then at the end of the algorithm you can free the big blocks, releasing all the small pieces at once without having to worry about accounting for them individually. That should also prevent most fragmentation problems.

Fancier systems use copying-style garbage collection, where the GC simply copies all the data from the live heap into a new heap, then frees the old heap. The simplest version of this has 2x ram overhead (the heap lives in half the ram, then it gets copied to the other half and the first half is freed). These systems are quite reliable with a little bit of care. Phone switches programmed in Erlang keep running for decades, including across software upgrades (you can load new code into the switch while it is running, without interrupting existing conversations etc).

Vote

P

Paul Rubin 13 years ago

I should have mentioned, yes there are tools for this, Valgrind is the best-known one at least in my little corner of the world. I don't know if it works with typical embedded toolchains, but maybe you can port your application on a workstation for testing, which is generally useful anyway.

Vote

The Embedded Heap

Join the Discussion

Didn't find your answer?