ARM/Atmel buses, architecture - Naimi

I don't understand points 2,3,4. In the Harvard Arch, we separate code and data flow. Eg: LDI R16, 0F

0F is pulled from flash (using the data bus - operand) at the same time as the LDI (via the code bus - opcode). The register R16 is GPR so it's in the CPU.

What the heck is 2 for: 'a set of address buses for accessing the data' ? And the others: a set of address buses to access the opcodes

Why do we have: In RISC processors, there are four sets of buses

formatting link

RISC processors have separate buses for data and code. In all the x86 processors, like all other CISC computers, there is one set of buses for the address (e.g., A0?A24 in the 80286) and another set of buses for data (e.g., D0?Dl5 in the 80286) carrying opcodes and operands in and out of the CPU.

To access any sec-tion of memory, regardless of whether it contains code or data operands, the same address bus and data bus are used.

In RISC processors, there are four sets ofbuses: (l) a set of data buses for carrying data (operands) in and out of the CPU, (2) a set of address buses for accessing the data, (3) a set of buses to carry the opcodes, and(4) a set of address buses to access the opcodes. The use ofseparate buses for codeand data operands is commonly referred to as Harvard architecture. We examinedthe Harvard architecture of the AVR in the previous section.

Reply to
Veek. M
Loading thread data ...

You are confusing internal busses with external busses. Yes, most CPUs have one external memory interface for both data and code, but internally they can any number of busses desired since there are many registers for data and/or operands in the code.

--

Rick C
Reply to
rickman

The book is explaining things very badly - it is mixing up logical and physical buses, internal and external buses, and address spaces. It is also making terrible generalisations in its distinctions between RISC and CISC, and appears to have a view of "Harvard vs. von Neumann" from the 1970's.

It is much more common to talk of a "data bus" as including the address lines and the data lines needed for reading and writing data, and an "instruction bus" as including the address lines and data lines needed for reading instructions and opcodes. On some systems, such as the AVR, these are physically separate and access separate memory blocks. On most modern cpus, they are physically separate from the cpu core, and may be connected to independent L0 caches or buffers, but are later combined in a multiplexer, crosspoint or bus controller, perhaps along with a unified cache, before connecting to a single memory space. And on smaller or older cpus, the pathways may never be independent - there is just one bus coming out of the core.

Reply to
David Brown

To expand (I hope) on what David is saying:

A "pure" (or maybe "old-time") Harvard architecture is one where instructions and data are _entirely_ separate -- see, for instance, the ADSP-21xx series of processors, where the program memory space is not only separate from the data memory space, but it's a different width (24 bits as opposed to the data memory width of 16 bits). It is, in fact, awkward to read instruction memory, and IIRC you can't access the upper 8 bits.

Ditto for Von Neumann architecture, e.g. the MC68HC11, which had one unified memory space out of which everything came (and no problems reading anything at all).

Looking at the data I have for the ARM Cortex, at least the M4 core seems to expose multiple busses, one of which is for instructions and another of which is for data. In a processor designed to take advantage of this, you could have separate memory spaces that could be accessed simultaneously by code and data fetches, speeding things up. I've seen this sort of thing called a "Harvard architecture", but I'm not sure that I'm fully convinced that it's really the best nomenclature to use, because while it may be a Harvard architecture physically, logically it's Von Neumann.

--
Tim Wescott 
Wescott Design Services 
 Click to see the full signature
Reply to
Tim Wescott

I haven't seen any references that would refer to this as "Harvard". The distinction of a Harvard architecture is at the ISA level. It really doesn't say much about the physical structure of the CPU. In the case of the M4, I expect the two buses are connected to I and D caches rather than directly to memory, but I haven't see the diagram. I don't recall for sure that the M4 has I and D caches, but I believe they are at least optional.

--

Rick C
Reply to
rickman

I looked at the ARM architecture spec, which didn't say much about it, and the ST document, which talks about three different busses.

Yes, I'm sure that they just go to the pertinent caches. I've seen that sort of thing called "Harvard" before: I don't know if either ARM or ST is committing that sin.

--
Tim Wescott 
Wescott Design Services 
 Click to see the full signature
Reply to
Tim Wescott

Often it is referred to as a "modified Harvard architecture", which makes a bit more sense - though I think any use of the terms "Harvard" or "von Neumann" on modern cpus is confusing.

The problem is that different people use it for different levels - some for the ISA level (which is really the important difference, at least for users of the processor), and some for the physical buses.

They are indeed optional. Typically, slower and cheaper M4 devices have no cache, while bigger and faster ones have I and D caches (and often single-precision hardware floating point support). For M4 up to about

80 MHz and only internal ram, caches are not needed - a flash buffer with double-width flash banks keeps the instruction bus reasonably close to saturation, while internal ram is zero wait state for data. It is only with faster operations and external memory that you really make use of caches.

And yes, some people (and even some manufacturers) will tell you that such an arrangement is a Harvard architecture, despite the single unified address space which is the key characteristic of von Neumann. I'd hate to think what these folks would say to a core with more than two buses (some ARMs have extra buses to tightly-coupled memories, for example).

Reply to
David Brown

The most commonly seen Harvard CPUs are the "modified" variant which allows code and data to be together in a common memory, but *cached* separately. [DSPs obviously have other ideas].

Yes. I used the ADSP-21k which was similar. It had 48 bit instructions and 16/32/40 bit data [16/32 bit integer, 32/40 bit FP]. However, on the 21k, code and data could be together in the same memory space - just having different alignment requirements. One of the DMA channels could do 16/32 48 bit packing/unpacking to facilitate loading code [e.g., so you could use 16 or 32 bit ROMs].

George

Reply to
George Neuner

I guess my question would be, what is the point of drawing a distinction between Harvard, modified Harvard and von Neumann? Sure there are a few advantages to separating code and data cache, but I consider that to be an issue of cache design. I have never even given any thought to which of the three basic architectures a given CPU used.

--

Rick C
Reply to
rickman

In the 90's, "Harvard architecture" was something of a buzzword for "way fast" -- this was back when the usual embedded processor had just one memory bus and no cache or a vestigial one. Generally a Harvard architecture processor was significantly faster by virtue of having two entirely separate memory systems.

To some extent having a pair of independent memory systems for instructions and data (which some of the ST ARM chips have) still does help speed things along, which is important if you're running the processor to the limit.

--
Tim Wescott 
Control systems, embedded software and circuit design 
 Click to see the full signature
Reply to
Tim Wescott

I guess none of us give it much thought what an architecture is called of course. Then dwelling onto what 1940-s term to use on a post

2000 CPU seems somewhat out of place anyway. We just get to the details and use the parts as they are.

George separates the DSP-s for a good reason. The one I know well is the 5420 of TI (have made stuff with it, wrote an assembler for it). Well it does have separate program and data buses - and memories - internally. Some of the data RAM was "dual access" per cycle; in fact it could do 3 data accesses and one program access per cycle (how else is it supposed to do a MAC per cycle anyway). No caches to speak of, it is a small device - 2 cores were consuming just about 300mW IIRC, this at 100 MHz clock - not bad at all for an end of the 90-s processor.

And I don't remember encountering the word "Harvard" while I was at it, not that it would not fit. Or may be I have but did not notice it.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

The essense of the von Neumann architecture is the idea that the program is stored in memory, as opposed to being hard-wired. In this respect, Harvard is merely a variant of von Neumann.

However, classic von Neumann acknowledges that code itself is also data for other code, and thus it allows for the idea that the program can modify itself.

Self-modifying code is a serious issue for deep pipelines: the old version of a modified instruction may already have been fetched. Detecting and dealing with this is very expensive: partial results must be discarded, (at least parts of) the pipeline must be flushed, and the modified instruction stream must be restarted.

The Harvard design became popular because it actively discourages trying to write self-modifying code.

But carrying separation of code and data all the way to disjoint memory is expensive with large memories, so CPUs tend to have unified memory with disjoint caching [the "modified" Harvard design].

George

Reply to
George Neuner

Not exactly this but in this line of thought something got me some 15 years ago when I was dealing with the 5420 in a way I still remember, wasted me a day perhaps. I had written some self modifying code - not for use in the end device, just a one time utility for me. Ran it on the 5420 (was computationally intensive and this was my fastest option at the time, then I had made a new toolchain for it so I wanted to use it etc.). Something pretty simple did not work; the self modifying code seemed not to get modified or something. But when I looked with the monitor (I had done that for the 5420, too) it was OK. Yet it ran as if it was not. Turned out I had not read the entire errata sheet. There was something about writing to program memory which got never initiated unless a write to some data memory took place after it (or something of the sort)... So when the monitor trapped it obviously did write to data memory - stack etc. - and I saw the correct program memory...

Still remember it, could even locate the source (just found the comment "hopefully some write has begun....", did not try to get into the details again of course).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

And yet, "self-modifying code" is exactly what the Java Hot-spot JIT compiler is. The first and second time through, the byte-code is interpreted, and after the second it's on a queue of things to compile to native code. When a compile thread completes that task, that code path diverges to the newly-generated native code. The first task of that code is often to decide whether the branches further down this path are what this code has been optimised for, and to return to the interpreter if the assumptions are not met.

My point is, self-modifying code, under the right conditions, is not only possible, but is the right solution for some problems.

Reply to
Clifford Heath

It may be that you didn't flush the write(s) ???

AFAIK, starting from the Pentium Pro, all the Intel chips have snooped data writes and automatically invalidated corresponding code cache addresses. They also monitor the code cache and flush the pipeline if any in-flight instruction is invalidated.

Modern chips snoop the unified L2 cache so cross-core writes can be seen earlier [before they go all the way to memory]. I would have thought that the E5420 would be in that class, but some of the old chips only monitored actual memory writes.

Still, the write wouldn't be seen until it hit at least the L2 cache. The E5420 had write-back L1, so the writes would have needed to be flushed explicitly [or you would have needed to wait until the modified lines were replaced.]

Intel is rather inexplicably nice to self-modifying code: generally the worst that happens is that it will have poor performance. Many other manufacturers are actively hostile: a lot of chips make even writing _correct_ self-modifying code a challenge.

On most non-Intel chips, in addition to flushing code modifying writes all the way to memory, you must deliberately invalidate the modified addresses in the code cache. On many chips you must also deliberately invalidate branch predictions.

And then, with some chips, you still must time everything correctly because the chip will continue to execute already fetched instructions regardless of whether they have been invalidated in cache.

George

Reply to
George Neuner

JIT generation is not really applicable. It is "self-modifying" in a broad sense, but not in a way that is detrimental to code correctness. [performance is a different issue]

Although it is called "Just In Time", the reality is that JIT code isn't being rewritten *AS* it is being executed: that is, a block of code is generated, and only when it is complete is the CPU permitted to enter it. There is no concern that old (incorrect) instructions will be fetched and executed before they are overwritten with new (correct) instructions.

Most Harvard chips have no real issues dealing with JIT code because new code typically is at a different address than the old code it replaces. All the JIT systems I am familiar with indirect calls through a jump table rather than patching call sites directly, so that they are free to replace code blocks at will. Which is not to say that old code addresses, branch prediction targets, etc. should not also be invalidated [and also external things like changing page protections]: these are needed so that code generation buffers can be reused. But typically reuse happens at timescales that ensure any cache traces would have been aged out and gone anyway.

George

Reply to
George Neuner

While self modifying code was necessary in old computers without index registers, I am surprised it is still used today.

In those old computers the low end of the instruction word was the data address. These bits in the instruction word was modified and the instruction executed to access a different element in an array.

Sequentially accessing an array was simple by incrementing the whole instruction word, which incremented the address part by one. Of course, you needed to have an array limit check to compare the actual instruction word with the last instruction word. Failing to do this and sooner or later, the instruction word would become a completely different instruction :-)

Reply to
upsidedown

Hah, there is a nice misunderstanding :). It was not an Intel part and I did not know 5420 did apply to one of theirs.

It was a TI TMS-whatever-5420 dsp (of their C54xx series).

I certainly have had my share of forgetting to invalidate the I-cache on power(PPC) while I was porting DPS to it (well over 10 years ago), but these were just routine errors, easy to catch. Nowhere near as nasty as the one I remember for the (my...) 5420 which took reading the errata sheet to comprehend (the 5420 DSP has no caches etc., no sync , dcbf, icbi etc. opcodes at all, they had messed something up internally - not critical as long as one was aware of it).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

Ah. I never worked with TI - all my DSP work was with ADI chips.

George

Reply to
George Neuner

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.