non load/store architecture?

and Q were different and Q > P)

My favorite saying was: NOP took 160 microseconds -- other instructions took longer. ;-)

--
Thad
Reply to
Thad Smith
Loading thread data ...

Please feel free try our remote-control i8096 ICE at

formatting link

You can view the equipment in a video window, and you'll have session windows for the i8096 serial port and the ICE console. You can paste C code into an entry window, compile, and get Motorola S-records to paste into the console window for execution on the mcu. (I may add automatic upload over the aux serial port soon).

Remote control is limited to one user at a time; reset buttons will revoke control from a user or an idle connection for your convenience (please be gentle). In the future I hope to add a user queue to avoid access contention.

I will add links to a collection of sources and welcome your submissions. When time permits, I will add a more interesting target, perhaps a robot arm for telerobotics experiments. For now, there is a single LED on PORT 2, BIT 5 to flash and the serial port for I/O.

I am still searching for proper documentation for the HMI-200; for now I have included a link to a Windows help file for a source-level debugger which includes many references to setup and operation of the emulator. There are also links to PDF datasheets, user's manuals and appnotes.

Regards,

Michael Grigoni Cybertheque Museum

Reply to
msg

Of all the computers I have used only the IBM 1620 was the only pure memory only computer with no programming concept of registers. There have been some near misses, processors that put their registers in memory and some like the ones you have referred to like the Z8 and RS08 more recently that moved many of the accumulator operations to memory to memory operations significantly improving the execution bandwidth by doing so.

The IBM 1620 was additive and remarkably nice to program. Conceptually it had a lot going for it. Numbers were stored as variable length fields. Adding two 40 digit numbers took the same code as adding two 5 digit numbers. Two or three years ago we looked at the IBM 1620's instruction set as a model for an embedded processor memory only instruction set. The extra address field in most instructions killed the advantages of memory to memory operations. I suspect that quite a few people have gone through the same exercise.

w..

Reply to
Walter Banks

x86 is not a bad architecture because of being a non-load/store architecture, or because it is CISC. There is nothing inherently bad about being CISC, and nothing inherently good about RISC. They give different scope for different sorts of implementation, but it is possible to make a bad RISC ISA or a good CISC ISA (the 68k/ColdFire is an example of a very nice CISC ISA). The x86 was widely held to be a poor and limited design when the 8086 first came out, and modern x86 chips have a terrible architecture (but with some very nice implementations) as a result of incremental steps keeping backwards compatibility with such a bad starting point.

Reply to
David Brown

Does (will?) the RS08 allow a frame shift on the short and tiny opcodes ?

before my time :) - but yes, if you do fully memory-memory opcodes, that has a size cost. Resister-register means you thrash Load/Store, so the best seems to be variable width opcodes, and support for some frame offset so the shortest opcodes (some call them register, some call them short/tiny) can work into windows of memory. Yes, that needs to be stacked on INT, but that's not a big cost.

80c51 has 2 bits of frame offset, in their register handling. Z8/C166 have more.

It does shift some of the hard work onto compiler writers :)

- but PCs these days have plenty of resource for this work.

On the history front, I recall one company, decades ago now, that had a very large (for the time) 6805 code base, and IIRC, a full build in assembler took 45 minutes!!. Microcontroller speeds have advanced much less than tool speeds.

-jg

Reply to
Jim Granville

I have the bias of a performance jock. I define "good" as "optimizeable." Yes, RISC is inherently better than CISC for optimization; that is the point of RISC. I also think it is easier to write optimizing compilers for RISC than for VLIW, judging by the industry's experience with the Itanium. But at least, VLIW attempts to regiment instruction scheduling. That's what you have to do to get performance. The CISC "whatever length, whatever latency" stuff does not cut it.

Define "nice." Maybe you think it's nice because it's low power or easy to program or something.

Yep, Worse Is Better.

Cheers, Brandon Van Every

Reply to
Brandon J. Van Every

The programming environment (including compilers and hardware) should help the programmer express the _application_ problem in a convenient way and not bother too much about the underlaying hardware, especially if the application is to be portable between various platforms.

I assume that you are referring to the alias problem, i.e. a function of having multiple pointer parameters, which all could point to the same physical memory area in the caller context ?

In many cases different functionality is divided into subroutines simply for manageability and readability reasons and the code is actually used only from one place, so the code can be inlined without any loss of memory usage or performance. In many cases inlining also reveals what the various parameters are pointing at and can indicate if there is a alias problem.

With global optimisation with version control systems, most alias problems could be spotted globally and only generate alias-aware code when needed.

Load everything, process everything in temporary store (registers), store all results is basically a caching problem, so the alias problem could also be solved with a suitable cache hardware. Then it is more a question of semantics if you call something a register or cache line:-)

They don't want to be and I don't see why they should be. RISC is just one fad among others fad in computing.

Paul

Reply to
Paul Keinanen
[Snipped]

I've got some C code written 15 years ago which still needs the occasional minor update. On the orignal 25MHz 386 used for development a full build took approximately 1h30. The last time I re-built everything on an AMD XP2600 machine, it took just over 3 minutes.

Regards Anton Erasmus

Reply to
Anton Erasmus

Brandon J. Van Every wrote:

"Good" for an ISA can mean many things. From a low-level programmer's viewpoint, a "good" ISA is easy to work with at the assembly level, and it's (relatively :-) easy to make an optimising C compiler. It should provide whatever OS support functions (MMU, traps, etc.) are appropriate for the size of the cpu and it's applications. From the hardware viewpoint, it should be possible to make implementations that are small, low-power, give high instructions-per-clock, low branch overhead, small code size, etc.

If pure performance is your only requirement, then that means good compiler support, high IPC, and low latencies. Since you brought up the Itanium (of which I only know a little), we can compare a few architectures ranging from pure CISC (x86), half-way (ColdFire - it's technically CISC, but has many RISC features), pure RISC (PPC), and VLIW (Itanium).

From the compiler writer's viewpoint, the sweet-spot is probably the ColdFire, possibly the PPC. Lots of registers and an orthogonal instruction set are important - both have these. The ColdFire wins out because lots of common sequences that require two or more RISC instructions can be handled as one (function preludes and cleanups are much simpler with the 68k, and extremely common "load data, use data" sequences are shorter and faster on the 68k, and don't require an extra register). The x86 is horrible for compilers, requiring all sorts of tricks to make good code (although modern versions are better). Instruction prefixes must be a nightmare. And VLIW requires incredible compiler fortune-telling abilities in order to get good instruction sequencing.

For the hardware implementation, a well-constructed CISC ISA is easy on a small system with consistent, fast memory (such as a microcontroller). Of course, the x86 is by no means well-constructed, with its prefix codes. And for faster processes, it involves all sorts of complex register renaming schemes to allow pipelined and superscaler execution. Instruction decoding logic is large and slow, and pipelines are often long, complex, and inconsistent across instruction types (leading to long delays on mispredicted branches). The lack of registers means much more memory IO, which causes stalls and requires complex scheduling. RISC is much easier in this way - the instruction coding is far simpler, and there is much greater similarity across instruction types. Because you have many more registers, there are fewer bottleneck registers, and thus much less need for register renaming and other tricks. The optional condition code updates of the PPC also help earlier branch prediction. The ColdFire lies somewhat in between - its decode logic is harder than a pure RISC cpu would need, but far simpler than an x86. You need some anonymous registers to deal with direct memory operands, but not many, as much of the code is RISC-style register-register. The VLIW cpu can be made with very high ipc, but only for ideal code - it can't reschedule instructions dynamically, and is thus only fast for processing large loops (assuming that no data outside the L1 cache is read or written during the loop).

Where does that leave us? A pure RISC architecture is best when aiming for maximal ipc, and can be run at higher clock speeds as each step is simpler. But the ColdFire code is more compact, leading to lower bandwidth requirements on the instruction bus, and it does more per instruction, giving better performance for the same ipc. VLIW is a failed concept in most cases (it can be useful in DSP's, and some scientific programs, but not for general use), and it is not even worth considering making a fast core that executes x86 instructions directly - modern cores translate the x86 code into an internal RISC code. (I believe the high-speed ColdFire cores do that too to some extend - most instructions are RISCy enough to implement directly, but some are broken into a few RISC codes in the decoder.)

Being low power is a good indicator of an efficient ISA - the x86 is not low power, and neither are most fast RISC cores as they need high clock speeds. Remember, what's important for a processor is the work done per clock, not just instructions per clock, and in comparison to a pure RISC architecture, the ColdFire sacrifices a little ipc for a lot more work done per instruction.

But the most obvious "nice" feature of the ColdFire ISA can be seen by looking at assembly listings. Writing nearly optimal code for it is easy, and (equally important) it is easy to understand generated optimal code. There is no need for x86-style abuse of addressing modes to write good code, and there is no need to have a 600 page manual on-hand to interpret the nuances of PPC instruction codes. This makes it nice for the programmer, nice for the compiler writer, and nice for the hardware implementer.

Additionally, this being c.a.e., ColdFire is eminently suitable for embedded systems. It's support for and handling of interrupts and exceptions, in particular, is excellent (I know there are other RISC cpus, like the ARM, that are also good), and compact code is a clear benefit.

Modern x86 implementations are like turds polished until they glow in the dark.

Reply to
David Brown

That's surprisngly poor. Just a x30 speed up i.e. line 750 MHz 386 with correspondingly faster memory and disk. The 386 had no internal cache and ran at a minimum of two clocks per instruction. Caches, particularly L1 Icache, combined with serveal instructions per clock should outperform that. Perhaps the problem is memory speed or I/O. Even 15 years ago a 25MHz 386 was a slow beast - are you sure it wasn't a 486 (typically x2 - x2.5 a 386).

Peter

Reply to
Peter Dickerson

Having lots of registers isn't necessarily better. In the end, it only matters how quickly you can access the data. As long as you aggressively cache the local data, instructions with memory operands can be just as efficient as those with register access. An efficient CPU can implement a set of 'virtual registers' that map to part of the local stack frame, allowing multi-way access to multiple memory locations. This also allows improvements to be made, like extending the size of this virtual register map, without changing the ISA.

A RISC CPU with many registers needs to find a balance between the number of registers and the overhead of encoding them in the instruction word. This means that for many small functions, where you only need a few local variables, some bits in the instruction word are wasted. On the other hand, when you do run out of registers on a RISC machine, you'll waste space on extra load/store instructions.

Given the bottleneck of the external memory bus, you can get the highest performance by having a densely coded, complicated instruction set, with short instructions for the most common operations on local operands, and longer instructions for less common ones/far away operands. Complicated instructions and non-orthogonal designs may not be very appealing for us humans, but as long as we have the technology to create a compiler that can use them, and an instruction decoder that can decode them, it does allow higher performance than something that is constrained by requirements to have it orthogonal and "clean".

Reply to
Arlet

Sure, "could." In the real world, aren't.

I'm not a CPU designer, but sounds like you're just saying you'll make the die more expensive. Or give up some other functionality, like a FPU. RISC is also about cost.

They should be if they care about optimization. Most people don't.

Cheers, Brandon Van Every

Reply to
Brandon J. Van Every

Working on OpenGL device driver optimization for the DEC Alpha, I never saw instruction cache misses. Only data cache. Performance code is in small loops, not huge hulking one-shots. I say "more compact code improves performance" is theory, and not observable in practice. More compact data, on the other hand, matters a great deal.

The units of work I've always cared about are FPU adds, multiplies, and divides. There isn't more arithmetic work to do per instruction. You could do more load/store work, but assuming you hit your primary data cache, that's not your bottleneck anyways. The arithmetic is. As I said above, instruction cache bloat doesn't matter in tightly looping code. Or, I'd wager, in loosely looping code either. Instruction caches are pretty big compared to the looping code. If all your code is one-shot then you've got completely different system caching issues, nothing to do with the CPU.

Now I suppose if you design CPUs with almost no cache, you might care about instructions being small. But then, you're not designing a performance CPU anyways. So who's gonna care about the performance? "Good" won't mean optimization, it'll mean low power or cheap to manufacture or something.

Cheers, Brandon Van Every

Reply to
Brandon J. Van Every

Having a particularly fast cache for the top stack area would be a good idea for fast cpus with few registers. Unfortunately, it's not as easy as it sounds (which is why it is not done even on high-end x86 chips). The big problem is that the area that needs the fast-as-possible cache varies too much, so that swapping in and out of L1 cache would take too long. Write-back buffers and read-buffers make up a sort of L0 cache, which helps. Remember that for high clock speed devices, register access is single-cycle, while L1 cache access may be up to a dozen clock cycles (on a cache hit).

Some cpus, like the SPARC, have a system that can be considered either a huge register set, or a very fast stack. At any given time, only a certain number (typically 32) of the registers are directly accessible. During a function call, the visible register window gets shuffled along, giving a new set of visible registers. This makes function calls far cheaper than a more standard register set.

As for making virtual registers, that is to some extent what modern x86 cpus do when translating from x86 code to internal RISC code, as part of the smart implementation of a stupid ISA.

A disadvantage of having many registers, which would also apply to a fast-cached stack, is the overhead on context switches. So having too many registers is not a good idea. The appropriate number depends on the type of application and the type of processor architecture. If you have a pure load/store RISC architecture, than you need lots of registers as all memory access has to go through them - 32 registers is a typical number. If you also have direct memory addressing modes, then you don't need as many, as single-use memory data does not have to use a register. Thus 16 registers is a good number for the ColdFire, and was also chosen for amd64. CPUs with 64 or even 128 registers can be useful, but the context switch overhead is very big (for the Itanium, that's 1 KB of register data), and it takes a very advanced compiler and particular application types to actually make use of the whole register set.

True.

I agree. Have you looked at the ColdFire architecture? It is very much what you are suggesting here, except that there are only a couple of points at which orthogonality is sacrificed. The architecture has 8 data registers and 8 address registers. There is no difference between any of the data registers, and the only special address register is that A7 is the stack pointer used for calls and returns. Common instructions, especially register to register modes, are a single 16-bit opcode (and single-cycle execution), while extension words are used to hold immediate data, addresses, etc.

For a smaller (16-bit) example with a similar mix of RISC and CISC ideas, see the msp430.

Reply to
David Brown

In current commercial reality, compilers are not infinitely smart, nor do they have infinite development resources allocated to make them so. Especially when a new CPU comes out every 2 years. So, programmers inadvertently specify things that compilers and CPU architectures cannot handle optimally. This is what gives hand ASM coders a job. Somebody who doesn't care writes the 1st cut of code. If it turns out someone should care, an optimization jock comes in and fixes things up. If he can. If the original careless programmer, knowing nothing of underlying HW operations or limitations, didn't paint the software into a corner.

Most software doesn't have to perform, so most programmers aren't aware of the performance implications of the code they write. At the extreme, you get the kind of "execute once in a blue moon" bloat that's typical of Microsoft products. With such code, programmers don't just blow off the underlying HW, they blow off any algorithmic design principle that could be remotely called efficient.

Cheers, Brandon Van Every

Reply to
Brandon J. Van Every

The compiler is an old DOS compiler that could use expanded/extended memory. The use of extra memory speeded up compiles by a factor of approximately 2. On the original machine all object files were written to a RAM disk, which improved the linking speed by a factor of

5 to 10. On the modern machine everything was kept on harddisk, and the use of extended/expanded memory was disabled.

I have not tried to compile this code on a modern machine actually running DOS. It was compiled under Windows 2000 in a command prompt. How much overhead this has I do not know.

Also having thought about it again after your comments, the 3 minute time was for a Duron 600 machine. The XP2600 time was in the region of 45 seconds if I recall correctly.

Regards Anton Erasmus

Reply to
Anton Erasmus

The advantages of compact code will depend a lot on the rest of the device, such as the types, sizes, and speeds of the cache(s) and databuses. Also relevant to c.a.e., though not a speed issue, is that the size of the code has a direct bearing on the cost for typical embedded systems (i.e., flash code store).

For the type of code you are talking about, that's true. As always, there are no correct answers as it all depends on the application. In particular, if you are doing heavy FP work the the FP units are likely to be the bottleneck, and individual instructions shuffling data around or doing simple arithmetic (the most common type of instruction in most code) don't matter much. But in a lot of code it does matter. Take the simple C code "x = 123456;", where "x" is a 32-bit global variable. On the ColdFire, this compiles to:

move.l #123456,%d0 move.l %d0,x

Two instructions, each 6 bytes long, each executing in 1 clock (plus a write access to memory).

On the PPC, this compiles to:

lis %r0,0x1 ori %r0,%r0,57920 lis %r9,x@ha stw %r0,x@l(%r9)

(See what I mean about ColdFire code being nice and clear?)

That's four instructions, each 4 bytes, each executing in 1 clock (plus a write access to memory).

The ColdFire generates more compact code, running at twice the speed for the same clock. That's what I mean by greater work done per clock.

We are clearly coming from this from different experiences, if OpenGL drivers on an Alpha are typical for your programming, while I work mostly with smaller processors (the ColdFire I am using at the moment has no cache - all its flash and sram are internal, with single cycle access). But performance is very important to small systems - high performance means you can use slower clock speeds, leading to lower power, lower EMI, and cheaper components. It might not be the most important factor, but it is still there.

Even on cached processors, small code means better use of the cache. Critical loops will (should!) fit within even a small instruction cache, but programs consist of more than their critical loops. A complete instruction cache miss might mean a stall of a hundred or more clock cycles (which might be worth twice that in instruction counts on a superscaler processor) - there is a reason why more expensive processors have larger instruction caches. More compact code gives the same benefits of a larger cache.

Reply to
David Brown

For example, take a graphical application, like a web browser, waiting for the next network packet that comes in for an image it has requested. As the packet is received, we basically get a one-shot code execution of the interrupt handler, network stack, firewall, application, GUI libraries, and graphics driver. All added up, that's quite a big chunk of code, with poor locality, and very little looping.

Reply to
Arlet

True.

formatting link

So increase the number of caches from 1 to ? and what logical divisions are possible?

But a single register shuffle up down one may be more effective.

virtual registers only really useful for out of order execution.

workspace pointers was the solution to this, but as speeds have risen then caching of the workspace has been the problem/solution/not yet implemented?

orthagonal = easy decode circuit and less pipelining needed.

i did like the 68k, i understand that coldfire has some of the more wasteful instructions removed, is this true?

or for a minimal core

formatting link

taking note of the fact that +/-1 cache locality costs less than +/-n, and many other factors. it turns out a two perimeter chip 44-pin facet has quite a bit of space left for putting in cache ;) and a full 16 bits for opcode expansion to a 32 bit architecture.

Reply to
jacko

... snip ...

25 years ago my PascalP ran on a machine independent pcode interpreter, which was a stack machine (with no registers). This resulted in very compact code and extreme portability, but had a speed penalty. I also developed a native code generator, driven from the same compiler, for the 8080. This was highly register limited, but the generator postponed actual pushes until needed. It did this by keeping track of which registers (out of three) help the top three stack items. It also kept track of constants in registers, so that they could be either used directly, or the registers incremented (or decremented) to form the appropriate constant. These simple optimizations resulted in quite compact and efficient code. IIRC a minimal program compiled to about 2KB of machine code after library linking, and 100 odd bytes of pcode. The linker loaded nothing unneeded.

The system complied with the ISO standard with a very few exceptions (such as gotos out of procedures). The compliance details are available at:

--
Chuck F (cbfalconer at maineline dot net)
   Available for consulting/temporary embedded and systems.
Reply to
CBFalconer

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.