"write()" semantics

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sat, Jun 18, 2011 5:55 PM

That wouldn't be necessary. It costs nothing for me to make unlimited copies of read-only data -- just another pointer to the data (*you* can't change the data, either!)

Then you're violating "the *only* thing in the page(s) was THE BUFFER". If you want that capability, then, by necessity, a copy operation is required. So, you can't use the "fast path" interface but must resort to a slower, more traditional interface.

- A
- Arlet Ottens
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sat, Jun 18, 2011 6:12 PM

You said earlier:

So, I don't understand. If the write buffer is mapped in read-only memory (e.g. inbetween executable code), does the page "disappear", or not ?

Generally, I wouldn't know what variables or buffers would be in a certain page, unless they were cleverly allocated.

But since you're playing with the page tables anyway, it seems like you could implement a Copy-On-Write page fault handler, which would remove all restrictions on use, while still giving you the same performance in cases where pages aren't written to.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sat, Jun 18, 2011 6:48 PM

One "instance" disappears. Since they are all identical instances, you don't see any change.

The problem is different for *writeable* pages because you can't have two (writeable) instances of the same page.

You would allocate your "buffer" from a pool that was created for that purpose!

Yes, that was the approach I was exploring when I posed the question. But, that adds overhead to *every* write() -- even if the user doesn't need it. (alternatively, you add yet another "mode"...)

If, for example, people tend to stuff things into a buffer, write that buffer to some "device" and then forget all about that data as they move on to process "new" data, then the costs of CoW are *wasted*.

Just like a *real* copy() would have been wasted.

Note that the cost of CoW when the page actually *is* touched exceeds the cost of a blind, unconditional, "a priori" copy. So, if the user is typically "filling" a buffer, writing it, then filling it anew (with different data) and writing *that*, etc., then the CoW implementation is slower than the blind copy approach -- because the buffer/page gets touched *always* (an extra trap, some housekeeping, *then* the actual copy)

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, Jun 20, 2011 5:42 AM

Which is the way *most* programs are organized. I'm simply pointing out some exceptions.

My point, though, is that programmers rely (by necessity) on the call semantics even when those semantics are unknown to the programmer.

Protecting the buffer has costs. Leaving it unprotected and telling the programmer to keep hands off does not cost the OS anything. That's why nearly all OSes operate in that way.

Not uncommon at all. Witness the march to "managed" environments - JVM, .NET, etc. - and back to interpreted languages. All of these have enormous costs and yet programmers are flocking to them.

Of course. At minimum there's VMM and cache circuitry even if the CPU thinks it's working with flat addresses.

Obviously you're limited by how many concurrent segments the VMM can track. But in a reasonable system the number of working segments would likely be quite large ... 10s of thousands if not more.

Even Intel's half baked system allowed 16,000 segments system wide. The problem was the CPU was aware of them and could only use 4-6 at a time (depending on model). There also was a segment cache that retained some model dependent number of segments descriptor - a cached descriptor did not have to be revalidated and so could be reloaded quickly. Validation took many (~100) cycles on Intel.

Segment translation - like page translation - is best done by VMM. The CPU should just blissfully use flat addresses.

Yeah. But that introduces a lot of potential problems that are not easily solved ... such as what to do when two programs need to share memory but want to use different sized pages. It's perfectly possible to have differently sized page frames overlaid on the same physical addresses, but changing the maps in the VMM unit might be an issue. To be efficient you need to deal with one size at a time. With differently sized pages, the page tables would have different structures, so the VMM unit would have to be flushed and reconfigured every time a new page size is encountered.

I don't immediately see a design that could work with multiple sizes simultaneously. AFAIK, no VMM unit does so now ... even the ones that permit multiple pages sizes.

Normally COW is used in disjoint memory maps - e.g., 2 processes sharing a page - it doesn't matter to either that its private "copy" is at the same logical address as the original because after the copy the original is no longer visible. The problem is that the kernel and the user view of the process are not disjoint: when the page is copied, the kernel now potentially can see 2 different copies of it having the same logical address - one in kernel space and one in user space.

This would break congruency of the kernel and user view of the process and I'm not aware of any OS that would tolerate it. The difficulty lies in the fact that programs don't expect addresses to change due to COW - and so COW maintains logical addresses (while remapping the copy to a new physical address).

It's probably possible to temporarily allow separate kernel and user space copies with the understanding that the kernel has to merge its view of user after the I/O operation, but that could get complicated with many I/O operations simultaneously in progress.

Likewise, it's probably possible to have a COW implementation remap a known I/O buffer page to a new logical address in kernel space ... thereby preserving congruency with user space ... but that could require some fancy bookkeeping like in GC languages: every pointer/reference the kernel has to the page would have to be updated after the remapping.

George

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Jun 23, 2011 7:04 PM

Understood.

Though "they" tend to offer some notion of simplicity/convenience in return for those "constraints". In my case, I'm offering "efficiency" (I think that might be harder for The Average Joe to relate to -- as it's is obvious only indirectly)

So, the segment descriptors would reside in memory (like page tables). What mechanism would the processor use to know which segment(s) applied to a particular address?

See, this would limit the utility of this scheme (I'm still thinking about how you work around this limitation *in* the CPU)

Why can't the MMU track page sizes? (efficency is a relative term; VMM isn't as efficient as a non-managed space -- in terms of hardware complexity)

So, you're in the Intel boat -- pick a page size and live with it across the board...

But, once copied (well, some epsilon thereafter) they are no longer the same "object". E.g., a physical page exists at a particular logical address in one process and at another in another process. As long as neither process "touches" the page, it's just an efficiency hack for sharing the data without duplicating it (and consuming more physical memory in the process). Like having two instances of a program running at the same time (shared CODE).

Once one of the (logical) pages is modified, then they are no longer the same object -- a new physical page must be instantiated for one of the "almost-copies". The efficiency of storing a single copy is discarded by the act of modifying it.

E.g., in the write() scenario, it's like allocating a new buffer so you can manipulate the contents of that buffer while keeping the contents of the original buffer intact for the pending I/O operation. You no longer have the same "buffer" (contents) as you had previously.

I'm not seeing why the logical address is changing (?)

No, in the write() case, the kernel can *discard* its copy of the buffer! If the user has touched it (necessitating instantiation of a new physical page to hold an alterable copy of the buffer), then the user no longer cares about the original buffer's *contents* (though the buffer can still reside at the same logical address in that "process")

The page becomes "owned" by the kernel until the kernel is through with it (i.e., the write() completes). Thereafter, it can be released to the free pool.

- A
- Albert van der Horst
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jun 24, 2011 11:35 AM

(We are talking protected mode on Intel, aren't we?) With all due respect. One can argue with Intel's segmentation system that it is overly complicated and overly flexible. I can't imagine anything I wanted to do, and couldn't, with Intel's segments on an assembler level. E.g. I could have a buffer allocated in a segment, require the supplier to remount that segment readonly, before accepting it to use it for writing. The 4 levels of privilege allow a lot of enforcement to lower levels etc. If I am the writer at a higher privilege than the user program, it could check/spy on the user program that it didn't sneak in a privileged instruction doing something to the segment I don't want..., whatever.

Compilers and especially linkers do a bad job at exposing the functionality. That may be because that functionality is totally foreign to the programming language used.

You don't want paging, you want more segments, intel got thousands of them. Each with an individual size and properties. This is not an abstraction present in a high level language, but neither is paging.

--

--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jun 24, 2011 9:41 PM

The VMM unit would cache recently used descriptors and perform lookup by range comparison against the address. On a cache miss, you'd go to memory based tables. There are a number of reasonable ways to arrange the tables, but I'd likely use a hash table or possibly a range ordered trie.

This is similar to how paging is done now. Most systems use some form of b-tree to arrange their memory based page tables. When a page miss occurs, the processor has to search the memory tables for the correct page entry. This happens more often than you think ... the normal mode of operation is to flush the VMM unit at a context switch and fault in page descriptors as needed. Most OSes try to improve on that by saving/restoring the process's working set of page descriptors, but a large program can quickly transition to a new working set and render the page restore operation useless.

Well I'm talking about a system where the CPU uses flat addresses and where segmentation is handled entirely by the VMM unit (like paging is now). You could have as many active segments as the VMM unit allows. Remember that segmentation is performed on logical addresses - there is no "translation" but merely a range and protection mode check. The only issue is context awareness - a segment belongs to a process (or the kernel), but the same is true of a VMM page.

Has nothing to do with Intel ... AFAIK *NOBODY* makes a VMM unit that handles multiple page sizes simultaneously. A number of processor families - including IA-64 - can switch page sizes on the fly, but they don't handle multiple sizes simultaneously

Thinking harder about the problem, I can see a way to allow several power of 2 page sizes (to a limit) simultaneously, but at the cost of a LOT more comparator circuitry - think about a cache with an address comparator on each word rather than each burst line, a LOT more power usage and a likely increase of several cycles in access latency ... although I think a clever enough design could preserve overall throughput.

Yes, a new physical page is allocated ... but that new page remains at the SAME LOGICAL address in the context that touched it ... and that is the problem: the kernel normally has the current user process mapped into its address space so that the it can access any user space address without a protection fault.

When user touches the COW page, the user gets a unique copy in its memory map, but the kernel would see both the original and the copy and each would have the same logical address in the *kernel* memory map.

The only workable solution is to remap the kernel's copy immediately to a new *logical* address *before* the kernel tries to use it. This would work only if the buffer is merely raw data to the kernel ... if there are any relative addresses present on the page those would be broken by logical remapping.

But this is not the way COW works ... a new kernel call specific mechanism would have to be devised.

George

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, Jun 27, 2011 1:40 AM

But, since each segment can be a different size, how does the processor map a particular *arbitrary* address into the appropriate segment -- and, from that, its descriptor?

E.g., with a paged MMU, you *know* which page each address is in simply from the address itself (given a fixed page size). So, lookups can happen "deterministically"...

Aside from the mapping of arbitrary addresses into appropriate segments, I understand this.

But, you could (within reason) trade page size for efficiency. That would allow the "user" to make this tradeoff and know its costs.

Oh, I don't operate that way. I put pages wherever it is convenient and maintain a map of "what's where". This allows me to sidestep the "MSb of address (effectively) determines user/kernel". The kernel acts like yet another "task" in that sense -- "foo" might exist at address X in the kernel, Y in one task and Z in another.

Correct. The same is true if a (shared) page in task X also exists in task Y at a different address. If you want to store pointers/references in shared objects, then you have to ensure that those referenced objects exist in the same places in each memory space.

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Jun 30, 2011 3:07 AM

First of all, in most systems page lookups are NOT deterministic. Typically there is a separate table for the kernel, for each process and for global allocations. Most systems permit page tables to be swapped, so the required table (or relevant portion) may not be RAM resident when needed. The tables themselves usually are arranged in a b-tree that may be up to 7 levels deep with 64-bit addresses on systems I am aware of (depending on page and process size). Some operating systems allow the tables - or sometimes at least the indexes

- to be locked into RAM, but only systems with relatively small memories (and few processes) or massively huge memories can afford to do this.

Ok, let's go through this step by step. I may be repeating some stuff but bear with me.

First, my segments are not indexing bases for the CPU (where Intel fell down), but rather is just a protection zone defined by start and end addresses (or alternatively a start address and an extent). They don't really need a "descriptor" other than a name by which the kernel or process can reference them.

I'm assuming the VMM unit can manage some reasonable number of segments - 1K or more (similar to current paging units) - and that it can perform range comparisons on all of them simultaneously (similar to an associative cache). The contents of the VMM unit at any time would be context/process dependent (again like page tables). And, of course, there would always be one (master) segment that covered the whole of the context.

Replacement policies are debatable, but IMO segments with the tightest (i.e. most specific) address ranges should be retained whenever possible and when a segment is ejected due to insufficient slots, then a flag should be set on the next larger inclusive segment to indicate that a search is necessary. Naturally, the contents of the VMM unit represent the "working set" and can be bulk saved/restored at a context switch.

So it's obvious what happens when the most specific segment containing the target address is in the VMM unit ... it's identified by range comparison and the access is (dis)allowed depending on the segment's protections.

If a memory search is triggered, then just like a page fault there is a trap into an OS handler which searches for and loads the most specific segment. Obviously, the handler can do whatever it wants and the segment data can be stored in any way that's convenient to search.

I haven't decided what is the best representation for memory tables wrt searching ... there are several that would work reasonably well. B-tree obviously is a candidate (ala page tables), but I'd be tempted to use multisets and map-reduce if I had SIMD operations available. I can think of a few graph solutions as well.

I'm not sure what you mean by "*in* the CPU" because I'm talking about outside the CPU ... the VMM unit technically is between the CPU and memory (regardless of being on the die). Ideallly the CPU need know nothing about virtual memory manipulation.

If you're pondering how to abuse Intel CPU's then don't bother ... there's no way to make this work on Intel. Intel's segments are used as both index bases and protection zones - they are a CPU visible part of the address. The segment VMM unit checks protection but plays no part in (re)loading the CPU's segment registers - that all is under program/OS control (for "flat addresses" the OS loads all the segment registers with the process base address). [The above is a bit over-simplified ... it's correct for IA-32 or in

32-bit mode on IA-64. In "flat" address mode, registers FS and GS still are used for threading by various OSes (which remains true in 64-bit mode). However, in 64-bit mode, FS and GS are range checked only and the other segment registers are not recognized. In 64-bit mode only page mode protection is available.]

There is no microprocessor I am aware of that operates the way I've described ... this is an amalgamation of functionality from mainframe systems I've encountered. I've long wanted to take a FPGA/ASIC core CPU with restartable instructions but no internal VMM and put a good segmentation unit around it. But my VHDL skills aren't up to the task and I have doubts that any of the ?->VHDL compilers could fit the design into a reasonably sized part.

George

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Jul 7, 2011 7:16 PM

Hi George,

[grrrr... I need to figure out how to keep active threads "visible" [As an aside: do you recall the mainframe that had the asynchronous processing elements (multipliers, etc.)? I want to think Burroughs but that may be incorrect. It's not important just perhaps fresher in your mind than mine -- I'm too lazy to drag out my course notes...]

Yes, but my point was that pages begin on fixed physical boundaries. The sorts of segments we're discussing would lift that requirement. I.e., *any* address can exist in a segment at any relative offset in that segment (whereas XX..XXX..00 will always be at the *start* of a particular page)

I consider the parameters that define the segment to be it's "descriptor" (what you mention I would call a "handle")

So, you're allowing segments to exist within segments? I.e., a particular address may reside in one or more addresses? Can segments overlap in ways that are not proper subsets of each other (i.e., AAAAAAxxxxBBBBBBBBB -- where the xxxx locations are in A & B though the A are only in A and B are only in B)?

This seems like it would significantly complicate identifying the appropriate segment for a particular address (i.e., you can't even claim the "smallest enclosing segment" wins since two equally sized segments could contain a particular address)

Or, am I missing something?

But how do you constrain the "system" so that pathological cases don't break it? E.g., all segments are a single byte and cover the entire address space. How do you even *detect* when this sort of thing is happening?

Sorry, I meant "in the CPU" as pertaining to "in an active device" (vs. just structs residing in memory) so that they can be acted upon "in hardware" instead of forcing the CPU to "run code" to resolve every memory reference.

I think the better bang-for-buck is just variable sized pages (arbitrarily constrained to begin on "convenient boundaries", etc.). This, in effect, is what I'm doing with my "write buffers"; ignoring actual size and just fitting into whole page(s) since the hardware can handle those efficiently.

With gobs of memory and/or memory *space*, the cost is low.

*And*, if your algorithms tend to want to process large(r) buffers (instead of tens or hundreds of bytes), then there really is no penalty[1] [1] Actually, in my case, the penalty is overall latency as a complete buffer must be processed before it can be "passed along". Obviously, the bigger the buffer, the longer it takes to process -- meaning the "later" it is sent down the pipe (along with all that follow). [this *is* actually a serious constraint as it effectively requires more memory be available to meet the real-time constraints]

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Aug 2, 2011 9:50 PM

Depends on what era you're talking about ... I believe that in the

80's a number of mainframe designs had asynch pipelines: IBM 3090, Vax 9000, Crays, etc. I don't recall Burroughs having multiple pipelines although the later models had multiple processors.

Yes, but that just changes how you search. I'm not disputing that a segment search is or will be more expensive than a page search ... but I think you've been counting cycles so long that you are assuming anything beyond a simple table lookup is too complex. [I have some sympathy for that view but IMO it's rapidly becoming obsolete even in the very small. These days it's often cheaper to stick in a 32-bit mpu than use something smaller that needs more glue logic.]

A page /could/ be found directly by its prefix, but in modern 64-bit systems such a direct mapped table would be enormous, so the tables are split and arranged into a search tree which maps some number of address bits at each level, with only the final step being a table index. Even in a 32-bit system where the tables are a more manageable

4..8MB/process, in many cases keeping them in RAM for all the runnable processes still is not feasible.

For segments the VMM hardware search is a parallel map-reduce operation - essentially the same as a cache lookup but using range comparison instead of address equality. The smallest enclosing segment is selected, but (as you noted) it may not be the most specific.

There would at least one segment covering the whole of the process space and locked into the cache while the process is current. If/when a covering segment is displaced from the cache, flags should be set on other covering segments indicating that a memory search is needed if these segments are selected.

The simplest memory representation of a segment is a tuple of at least: { start addr, end addr, permissions } It can optionally include an owning process id and other information that might allow sharing among processes. I won't go into that here because it is a black hole of possibilities.

WRT how to organize the segments, the most logical form IMO is some kind of multimap (like in the C++ STL or Java library) with an ordering defined such the most specific range is found first. Because you'd typically keep a separate map for each process, they won't get obscenely huge.

It is most advantageous to keep the tuples in a linear table and construct separate search maps if needed. On a CPU with SIMD the linear table could be efficiently searched directly and also is perfect for DMAing into the VMM unit (which could do the search very simply in hardware).

[Some of my views are colored by a lot of playing with powerful DSPs that have memory->memory DMA which can be route data through FP pipelines or other attached units in parallel with ALU processing. Most CPU based systems don't have general memory->memory DMA (though they could if desired). Many systems require that the source or the target be within a defined I/O range (bus mastering may be both but few systems allow both addresses to be anywhere in memory).]

You've forgotten that the point of this was to protect the user process from violating itself during asynchronous I/O (e.g., mucking with a buffer while I/O is in progress). Therefore the only segments that ought to exist within a process - apart from the one defining the whole process space - should be those that are defining system call parameters and I/O buffers. Depending on how these objects are grouped, I think they should form a set that is either completely disjoint or completely hierarchical.

There are no semantic issues that prevent sharing mappings among different calls, or different processes, or even concurrent calls by different threads in the same process.

However, sharing among multiple processes can be problematic because that really must be done with physical addresses rather than virtual ones. It can be handled, but the mappings become more complex than just a pair of address limits.

You can't ever fully prevent pathological behavior with any sufficiently powerful feature. Even now you can take MMAP on Unix/Linux, or VirtualProtect on Windows, set no-acesss protections for nearly every page of your process (except the page containing the user page fault handler ... you'll crash instantly if that can't be read). You can potentially take a fault on every instruction. Nothing but common sense prevents this. [And actually there is a legitimate use for this. One well known method of incremental GC uses VMM calls to enforce a read barrier. It marks the user heap no-access and then scans, remaps and unprotects individual pages as the program faults on them.]

The simple way to prevent problems is to limit the number of segments a process can create. You wrote that you would need many thousands of segments, but I really question that. You may in fact be correct ... you are experienced enough not to be dazzled by shiny objects ... but I'm an OS/system software guy and it's been my experience that almost any design claiming to need thousands of somehow is wrong-headed.

Another way to prevent problems with a powerful feature is to invent a DSL and control it using the compiler or runtime library. Don't let programmers work in C ... create a new language that controls how they can use the feature. As I mentioned in some previous message, I used a system where I/O buffers had to be declared as such and the compiler/runtime/OS conspired to magically protect them during use.

A related option is to bury the feature within the system API so its use is completely hidden from programmer. IOW, let the user call read() or write() normally but manipulate things using a hidden system API behind the scenes to protect the buffers.

That said, I'm of the opinion that programmers should not be protected from themselves. If pulsing some bit turns on the saw motor then you put a BIG honking warning in the manual about it and stop worrying. If the programmer cuts someone's hand off, then maybe he'll learn to be more careful. [I used to play with pneumatic conveyor loading robots. The robots had to be synchronized to the conveyor and until that is completed the whole place is a hard hat area. Some of those loaders can throw like a major league pitcher.]

George

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Aug 3, 2011 6:39 PM

Had to be at least 10 years earlier. I.e., I was at school mid 70's and learned about it, then. I suspect I will have to drag out my course notes (this has been one of those "little things" nagging at my memory for several years :< )

But it has to occur in some "reasonably" bounded time frame since any particular (arbitrary) reference could touch a new segment...

Yes.

Depends on the sort of system you are dealing with. E.g., a "desktop" system that can spawn new processes "indefinitely" (subject to the whims of its user(s)) is different than something embedded that is designed with foreknowledge of "the most it will ever do"

Would you allow addresses to be members of segments that were not wholly contained in the "other" segment? (i.e., where the intersection of two segments is not empty and the *union* of those same two segments is larger than the largest of the two?)

But, again, what limits the number of such tuples (and, thus, the cost of the search?)

Actually, no. Recall I was asking if it was a common coding practice to reference (even R/O) a "buffer" (to avoid using the term "segment", here) after an asynchronous write() has conceptually passed that buffer's contents to "something else".

Or, in my case, I want write() to be able to *unmap* the buffer from the "producer's" memory space -- so, if it was commonplace to reference the buffer's contents (again, even R/O) *after* the write(), that producer would be screwed -- he'd have to arrange to keep a copy of the buffer *or* ensure all references happen before it is write()-d.

Understood (and the example I am describing falls under "I/O buffers" if you consider any passing of data between processes as "I/O")

I'm specifically looking at this as a key feature in my multimedia server (see threads, elsewhere). Briefly, I (myself) have ~25-30 network drops throughout the house. At the end of each, conceivably, I can locate an audio/video/AV client (let's just deal with simple audio clients, for the moment). Currently, those are one or two "channel" data sinks -- you pass audio streams to them and they "play" them (speakers, etc.).

For practical purposes, I think ~14 audio streams is a good "typical" number to scale against. More is possible (especially in other deployments/applications) as is less. (this is based on two video programs being played -- each with 5.1 sound -- plus two separate "stereo" audio programs).

*In* the server, I want to code data for each of these streams and "write()" it, eventually, to the protocol handler for distribution to the active clients. Since this streams indefinitely, you can't just process "all" of the audio in one shot -- process a "buffer" at a time (whatever *that* is).

Buffers can't be too big because that impacts the hardware in the clients. And, also leaves them more vulnerable to dropped packets, etc.

So, there is a queue of "processed buffers" waiting for each client. As well as buffers "being processed" for those same clients. (remember, everything here is "x 14", at least).

Additionally, the server is responsible for some signal processing and mixing (the clients "see" a data stream designed specifically for each of them). So, I was going to use the same mechanism for IPC -- treating the "next process" in the signal processing chain as the "output device" of the previous process, etc.

There is value (and cost!) in this sort of approach as it can exploit protection domains to prevent a buggy "signal processor" from corrupting data that other consumers will need. I expect it to pay off, handsomely, during development where it is just too hard to sort out why something "doesn't sound right"

Yes, I plan on doing the unmap() *in* the write() call. All is well as long as the caller doesn't later try to reference that buffer's contents -- since it no longer belongs to him!

[I initially thought I could just let the user reference the buffer and use that to fault a *new* buffer in its place. But, that might not be the most efficient way of doing things (and, would confuse a programmer who expected the previous buffer's contents to still be accessible, there!)]

I have mixed feelings, here: I don't want you to *cripple* me in an attempt to prevent me from doing something that you think

*might* hurt me -- especially as I have a tendency to do things that you might not have foreseen as "typical". (this, IMO, is where things like "prohibiting pointers" are overly paternalistic)

OTOH, I want the system to provide reliable services that expedite my development of reliable *systems*.

[consider how much easier it is to produce a reliable system when competing tasks/processes/developers *know* they have some protection from each other's (mis)actions. IMO, that's the sort of place you burn CPU cycles... *not* images of little folders flying through the air!]

Yeah, I worked on tablet press automation at one point. Compressing "powders" into "tablets" (several tons of pressure operating at 100+ Hz). Were it not for the "guards" you could lose a finger 10 times over in the time it took your larynx to *start* thinking about screaming! :-/

[apparently, the same sorts of machines are used to manufacture the explosive charges used to deploy airbags. I've often wondered what it would be like to watch *that* manufacturing process "go awry"... :> ]

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, Aug 7, 2011 6:10 AM

That's true, but extending the "reasonable" upper bound is the purpose of multi-tasking. In paging systems it's extremely rare that a page fault has to be handled immediately - and it's never the case that paging I/O must be done immediately ... the most that ever will happen immediately is remapping (to a new address) or changing protections on an already resident page.

Absolutely. An embedded system frequently may run only a single application and without a paging device. It may use MMU only for kernel/user space separation (and sometimes not even that). In these cases the page tables certainly will be resident. But I was speaking in general rather than for any particular case.

You can't have overlapped segments in the same scope, but there's no technical reason to disallow segments overlapping from different scopes.

E.g., a segment could be owned by a particular thread and be enforced only when that thread is active. You'd simply add an owner field to the segment tuple and make matching it to the currently running thread a condition of selection. (This is already done by CPU paging units to prevent thrashing and having to spill/fill pages unnecessarily during a context switch.)

Now you're going to ask about multiple CPUs/cores and what happens when 2 threads with overlapping segments are running simultaneously? The answer is: the memory accesses still would be serialized and the MMU would invoke the requested protection for each access ... but if the protections permit concurrent writes to the overlapped zone, then it can be corrupted unless external synchronization is used.

The only limits would be ones imposed by the system: how many concurrent segment slots the MMU supports, how much space is available for process/thread information, how fast can the memory "tables" be processed, etc.

My personal preference would be use a separate linear table for each scope (process or thread) and to DMA the table into the MMU. Let it do search/selection using hardware while the CPU continues with other processing.

With 32-bit addresses a segment tuple could be just 12 bytes including thread id and protection bits. If you look at the current Linux process table (which is actually a thread table) adding another 3KB table - describing 256 segments - to each entry wouldn't really change much ... the entries can be over 40KB each now.

For that matter, if I were designing (single core) hardware for this, I would create an MMU to handle maybe 512 or 1K segments, let each protected context in the OS have that many and just spill/fill the MMU on each context switch. Obviously, a multi-core CPU would need more intelligent MMU handling.

I'm not sure if that was a typo.

As I said above, this is not possible from within the same scope, but can be done using segments in different scopes.

Not necessarily ... you could keep read access just by modifying the segment protection rather than unmapping it. Or, equivalently, you could map a new read only segment over the same addresses.

George

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, Aug 7, 2011 1:49 PM

OK, so you're treating it as the same (cost) sort of operation as paging, in general. So, you really, *really* want the MMU to be able to handle *lots* of segments concurrently.

[E.g., with paging, locality of reference can give "reasonable" performance with very few active page table entries in the MMU since multiple consecutive hits will fall in the same page. OTOH, accesses to "data objects" can be sporadic and scattered possibly touching several objects in rapid succession...]

Understood.

So, you could define the API to accept a "Segment*" as an argument in which "start" and "end" addresses for the new segment (to be created) are found. This ensures it will be a subset.

But, that just ensures the OS won't (intentionally) put "bad magic" into a segment descriptor. How can you *enforce* that in the actual hardware? I.e., so that a faulty API implementation doesn't effectively create this sort of overlap... and how the hardware might misbehave if this contract is broken? (I like designing hardware so that "CAN'T HAPPEN" really *can't* happen!)

Understood.

OK, so you define a pure *model* for the mechanism and then let the actual hardware implementation trade time/space/efficiency off as appropriate. E.g., like packaging a 32b processor (ALU) with an 8 bit external bus...

And the "segment descriptor table" can, itself, be created as a formal object (and given its own "segment descriptor")...

Yes, but that brings up other potential races/hazards. E.g., what if the new owner modifies the contents, etc. (yeah, you could work around that with CoW).

The idea of write *exporting* the data (and the memory supporting it) seems easier to relate to. Before you read(), the data doesn't (physically) exist. And, after you write(), its gone forever!

Then, with async I/O, you don't worry about interactions between you (producer) and the consumer... you forfeit access to the data once you hand it off to someone. Async vs. sync is just a way for you to know when the data has been "consumed" by the consumer (think about it) -- so really only makes sense for physical device access (not IPC -- which should tend towards async interfaces -- unless you want to limit the number of buffers "in play" at any given time)

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Aug 9, 2011 7:58 PM

Well ... it's easy enough to check if you can limit the number of segments in any particular execution context (thread, process, etc.) to what the MMU can hold internally. The API would talk to the MMU directly and a new segment could be checked against those already defined rejecting anything that is outside the special segment defining the entire process space or that overlaps any other defined segment (and, naturally, a new definition that would overflow the MMU).

This would require either a context tag (thread/process id) for each segment or spilling/filling the MMU at every context change - which would likely impact performance too much (though perhaps not with DMA).

If you need more simultaneous segments than your MMU can hold, then you have a problem. Again, using DMA you could check a new segment by blowing the in-memory table(s) through the hardware, but you'd need to validate an asynchronous API and also deal with caching replacement policies in the MMU.

Personally, I'd try to avoid the issue by making the MMU as large as is practical. There are some very reasonably priced FPGAs that could make a good stand-alone MMU for a CPU lacking its own ... even a few that I think might handle a CPU core plus MMU in one package (I'd have to study it - I haven't worked with FPGAs for several years, but they've only gotten bigger and cheaper and there are a number of respectable cores available that are 5K-20K LEs). But again, I have trouble imagining a system - let alone a single application - really needing hundreds of thousands of simultaneously available segments. I know some web service developers like to create a new (maybe lightweight) thread for each request, but it doesn't make sense to me to create gazillions of threads unless there is enough I/O bandwidth to keep a significant fraction of them runnable ... ISTM that having a million threads all pending is pointless. I believe it's better in general to use a thread pool appropriately sized for the bandwidth of the most restrictive I/O channel and just deal with queuing. YMMV.

(Un)Mapping and changing protection (if allowed) naturally would have to be atomic operations, but you're correct that there are other race hazards. However, they can be mitigated (though not eliminated) by providing locking mechanisms. This is yet another reason for advocating a DSL rather than a library because proper usage can be made required language syntax rather than optional function calls that could be forgotten.

For example, presuming a predefined I/O buffer type that uses protected segments behind the scenes, you could define differing buffer types for RO, WO, RW which could be locked for exclusive or shared access as appropriate when in scope: var buf1 : WriteBuffer; buf2 : ReadBuffer; buf3 : IoBuffer; begin /* outside "with" blocks buffers are NA */

with ( buf1, buf2 ) do : /* buf1 is WO, buf2 is RO */ end;

with ( buf3 ) do : /* buf3 is RW */ end; end.

Or, alternatively, you could require an explicit locking syntax which directly manipulates segment protections:

var buf1, buf2, buf3 : IoBuffer; begin /* outside "with" blocks buffers are NA */

with ( out buf1, in buf2 ) do : /* buf1 is WO, buf2 is RO */ end;

with ( buf3 ) do : /* unqualified buf3 is RW */ end; with ( inout buf2 ) do : /* RW using keyword */ end; end.

There are (dis)advantages to either approach - and if you intend to permit shared read locks then you must establish whether writes or reads take precedence - but the point is that, by creating a DSL, you get to define and enforce the semantics you consider to be reasonable and proper.

Makes perfect sense to me ... but then I'm a OS/systems software guy who designs languages and writes compilers as a hobby (and also occasionally for pay). I pay close attention to semantics and, like you, I carefully consider the ramifications of providing and/or using a particular feature. However, I am fully aware that the vast majority of application programmers neither know nor understand the semantics of features they routinely rely on.

In the abstract I generally agree with the notion of a programmable system providing mechanism and leaving usage to the programmer ... (always when I'm to be the programmer 8-) ... but in practice I've found that a certain amount of system defined policy is always necessary. IME the majority of application programmers aren't sophisticated enough to develop good policy and aren't conscientious enough to stick with it on their own. They are, however, willing to accept language limitations that don't greatly affect implementing their intended applications. This is why the newer "managed code" languages: Java, C#, VisualBasic, Eiffel, etc., have become so popular (and even the original managed language - Lisp - is experiencing a resurgence).

George

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Aug 9, 2011 8:28 PM

Hi George,

[try>>>> Would you allow addresses to be members of segments that were

[I don't see the need for DMA -- as a separate "device". I.e., let the MMU/"segment manager" move all the data... it's already got the physical address bus wired to it. Pass it an *object* (segment_t) that defines the segments for the new context and let it pull them in, at will]

I can't anticipate the costs of this sort of action (discarding the MMU's contents) in "real world" applications. I.e., you can get a feel for what *paging* will cost an application... since most references will tend to be in already faulted pages, etc.

Likewise, the cost/benefit of saving/restoring FPU state at context switch -- you can see what sort of floating point operations a thread invokes. E.g., if the next thread to run has *no* FP usage, then it's silly to dump the FPU... this *current* thread may end up being reactivated immediately after that FP-starved thread so the save/restore is just waste.

OTOH, if you "fully embrace" segments and use them liberally for your objects, then you can have *lots* of them with lots of interactions that are just too hard to keep track of in your head while planning the application (or any thread therein).

I am a big fan of coming up with mechanisms that can be "generously embraced". For example, a "checkbox" or a "radio button" should be implementable as its own *window* (ideally). In practical terms, if the "window" mechanism becomes too expensive, then the programmer has to start thinking about where to draw the line -- i.e., what

*should* have its own window and what should be *part* of a window's contents.

E.g., the write() behavior that I mentioned at the start of this thread is inspired by a desire (on my part) to be able to move chunks of (physical) memory quickly among threads (by eliminating bcopy()) *and* retain the benefits of protection domains, etc. If I can make that mechanism efficient, then it leads to greater use (in my particular application) -- instead of having to do some things one way and other things *another* way...

Understood.

I.e., you are assuming {Write,Read,IO}Buffer are now reserved keywords in your language? And, they *impose* the necessary behaviors?

Understood.

I write AS IF the programmer is competent (fully expecting him

*not* to be -- or, to be *careless*) and just make sure I carefully document what i am doing and "why". I.e., I'll let him/her compare floats for (exact!) equality -- since there may be a real need to do so! *But*, when I do "fuzzy compares", I will make sure I leave notes explaining why they aren't *exact* and why I chose the particular fuzz factor that I did, etc.

I.e., I'll let you run with scissors. I'll advise you of the dangers of doing so. But, if you hurt yourself in the process, I won't provide any sympathy...

Agreed. OTOH, if the mechanism implicitly enforces a particular policy (as an unavoidable consequence of its design -- like my write()), then all you (I) can do is ensure the user is aware of this by providing good documentation.

"You can lead a programmer to documentation, but you can't make him

*think*" -- ha! quite a good play on words (horse/water/*drink*), if I may say so! :>

Yup. Treat the programmer as if he needed his hand held... "Oooh! Don't do that! You might hurt yourself!!"

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Aug 11, 2011 8:17 AM

Yeah ... I like to make my posts stand alone as much as possible so people don't have to search back through the thread to figure out what's going on (I hate doing that!). Unfortunately it does make the posts longer as the discussion becomes more complicated.

I'm using the term "DMA" generically to refer to accesses that are not being generated programmatically from the CPU.

You need the moral equivalent of DMA in order for the MMU to perform hardware aided searches through memory tables *while* simultaneously performing its memory protection functions for CPU generated accesses. Obviously the MMU could perform this itself using free memory cycles[*], but the effect would be the same as if DMA were being used.

[*] that is, if there are any free cycles ... modern CPUs are so fast relative to DRAM that a system can become completely memory bound with just a few compute bound threads. Very few systems have enough memory bandwidth even to keep the CPU running 100%, never mind trying to do I/O at the same time. [Back in the days of 100MHz SDRAM I worked on a system that needed to burst 32MB within ~6ms (~5.1GB/s) and be able to repeat that 100 times per second. IIRC we had to use 4 4-way banked memories to meet the data rate requirements.]

Modern CPUs have hundreds of rename registers to mitigate the costs of context change saves/restores - the saves may be deferred or may not happen at all depending on what else is going on.

I can see the theoretical utility in making each program object an individual hardware protected entity, but I have never encountered a system that did so and I have no feel for what problems it might cause or how severely performance might be impacted.

I have a fair amount of experience with implementing compiler based programmatic object protection, so I can speak to that.

That I understand ... I have quite a bit of experience with stream processing/filtering applications ... I just can't speak to the hardware protection issue.

Unfortunately there is no easy way to model it either. Windows or Linux can approximate what you want to do among *processes* using memory mapped files, but their VMM APIs weren't designed for rapidly changing mappings and there isn't any way to achieve protection among threads in a single process.

I don't know if you have ever used Windows's IStorages/IStreams (they are available on Linux by installing Mono). I have used them in applications for the type of hand-off assembly line processing you are talking about - they act like files and facilitate dealing with arbitrarily sized data. They don't have hardware protection within a single process, but there is a mode which provides programmatic protection in which objects are bound to the thread that creates them and, by default, are inaccessible to other threads in the process. In this mode you must use a marshalling API to pass pointers to your private objects to another thread or process. [There is also a mode where objects are shareable among threads by default and, within the process, you can just pass the pointers around directly. I've mostly just used shared mode and hand-off messages - I never had a reason other than curiosity to try the protected mode.]

Looking at (and maybe playing with) these a bit might give you a feel for what functions your API might need to provide and how it might feel to use it in practice.

Not necessarily "reserved" - that has other implications (see below) - but I am presuming that IoBuffer is a data type known to the compiler (like a Pascal compiler knows what "file of ..." means) and implemented by the runtime library using segmentation. Also I'm presuming that {Write|Read}Buffer exist as specializations of IoBuffer.

In this example I am assuming that "in" and "out" are keywords which result in modification of the segment protections.

"Reserved" generally implies that a given word isn't available for any other purpose ... such as naming a variable or function. Reserving words is a convenience for the compiler developer, but so long as the usage can be determined, there really is no need to do it. For example, in PL/1 it was legal to write things like "if if then then else else" where the second occurrence of each could denote either a variable or a (zero argument) function to be called. Lisp also has a long history of name abuse: e.g., "(list list)" is a perfectly legal call of the built-in function "list" with an argument also named "list".

This really isn't that hard to do ... most languages require the compiler to maintain multiple separate name spaces for functions, global variables, local variables, structure/object fields, etc. Multiple name spaces really only affect parsing ... once the initial IR is constructed the names of things become irrelevant (other than for providing error messages).

George

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Aug 11, 2011 10:13 AM

Understood.

Yes, but keeping an extra set (or five) of "registers" is different than keeping an extra set of segment descriptor tables... :>

My point was that the cost of the operation is just too hard to internalize so that you can get a feel for what makes sense to do in a particular application, etc. (this was the point of the FPU analogy as it is relatively easy to gauge FP usage of a set of tasks)

Agreed. It's too "foreign" to be able to relate to, with any degree of confidence.

I've been modeling it "by discipline" in the implementation of my audio clients. I.e., "pretending" that I can't access a "buffer" after I have write()-d it (no MMU in the design so I can't see for sure what the consequences there would be). I am trying to get a feel for what hidden costs (in terms of altering *how* I "do things") there might be with this sort of approach (before biting the bullet to implement it "for real" in the server).

I don't do Windows (or Linux) :>

I suspect that carries a fair bit of overhead.

This is the approach I am taking with the audio clients' "emulation" of the write() behavior -- pass pointers to buffers AS IF they had actually been moved between address spaces; the only real "trick" being the fact that "thread A" may have allocated the block/buffer while "thread F" eventually free()'s it.

I.e., *if* I screw up and reference a buffer that I no longer "own", the hardware won't tell me... instead, things will crash :>

Here, I am trying for a truly minimalist implementation. The benefits of protection domains is a big win (so I want to try to maintain that) but much of the flexibility you would *want* in this sort of an interface I am willing to discard (e.g., force buffers to be multiples of page sizes, page aligned, etc.).

OK. Let's call it "special" -- whether the programmer has defined it as such (elsewhere) or the language does... I.e., {W,R,I}Buffer don't have the same semantics as {Write,Read,IO}Buffer.