Re: Intel details future Larrabee graphics chip

- D
- Dennis
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Aug 24, 2008 5:54 PM

OK, bswapq is x86_64, but then that is probably where the extension makes sense. Since the PowerPC is a load/store architecture (mostly) I counted the byte reverse load/stores.

The 8 byte reverse load/store is apparently an extension for the Cell Broadband Engine PPE variant of the PowerPC. The following is from the CellBE_PXCell_Handbook_v1.11_12May08_pub.pdf at

formatting link

page 740. It may be easier to just search on "ibm cell alphaworks" and look around the site.

A.2.1 New PowerPC Instructions The PPE implements the following new instructions, relative to version 2.02 of the PowerPC Architecture: ? ldbrx?Load Doubleword Byte Reverse Indexed X-form ? sdbrx?Store Doubleword Byte Reverse Indexed X-form Details follow starting on the next page.

- M
- malc
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Aug 24, 2008 6:55 PM

I see, so i haven't missed anything while reading PowerISA 2.04. Being PPE only precludes me from using it, yeah it'll work on one PPC64 machine i have access to, but will bomb on the other, shame.

formatting link

Thank You.

--
mailto:av1474@comtv.ru

- R
- Rob Warnock
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 10:06 AM

+--------------- | Aside: does anyone know why the "Harvard" approach was promoted from | being a trivial but important variation of Von Neumann to being of | equal rank, starting about 20 years ago? +---------------

My guess would be the rise of separate I & D on-chip caches, which naturally leads to a Harvard approach inside the CPU/ALU pipeline, while the external memory interface remains (mostly) von Neumann [except for details of cache-flush/-invalidate ops needed to make the external memory be *truly* von Neumann].

Coincidentally, it was just over 20 years ago that AMD brought out the Am29000, which was a odd little Harvard hybrid (with *no* caches) with separate instruction & data busses and a *shared* address bus [and burst mode on both the I- & D-busses so once you'd started a burst on one you could use the address bus to start a burst for the other]. It may have had a small contribution to the terminology change in the

1987-1992 timeframe or so.

Though if the 29k had any real effect on long-term Marketing language, it was probably just that it caused other vendors with on-chip primary caches to look up and say, "Oh, *we* have a Harvard Architecture *too*, inside our CPU pipeline." [Which they did, all along, though nobody had talked about it that way much before.]

+--------------- | so, despite the nonsense in Wikipedia... +---------------

Well, they did get this bit right, IMHO:

Modern high performance CPU chip designs incorporate aspects of both Harvard and von Neumann architecture. On-chip cache memory is divided into an instruction cache and a data cache. Harvard architecture is used as the CPU accesses the cache. In the case of a cache miss, however, the data is retrieved from the main memory, which is not divided into separate instruction and data sections. Thus, while a von Neumann architecture is presented to the programmer, the hardware implementation gains the efficiencies of the Harvard architecture.

+--------------- | and almost all programming languages have used separate code and data | "address spaces" since the invention of COBOL and FORTRAN, and were/are | always talked about as using the Von Neumann model (as they do). +---------------

This seems to be conflating a number of issues which were more nearly orthogonal than your statement would lead one to believe:

How programming languages [other than assembler] talked about "code" versus "data". Mostly, they didn't! They only talked about "data". How the code got into memory in the first place was magic buried in the operating system and/or the linker. [One exception was Lisp, which usually included a full compiler/linker within the run-time system, and thus needed to know how to get "data" to become "code". But even there the user programmer didn't worry about it. You said (COMPILE 'FOO) and FOO was now a compiled code object instead of an interpreted s-expression data object. Not really "separate" at the language level, then.]

Sometimes "code" and "data" were split by (virtual) addressing only to separate access capabilities, as with the read-only "high segment" and read-write "low segment" of the DEC PDP-10. But note that even with this split, it was still a pure von Neumann machine: code could be fetched from either segment and a data LOAD from the high seg worked fine. There was still only one address bus & one instruction/data bus.

[Well, except in later models where the high seg could be read- and execute-protected if the program counter wasn't *in* the high seg, and it could only get there by jumping to specific locations containing "PORTAL" instructions.]

Some small-address-space machines needed to split the address space to get beyond the word-length limits of the machine. A classic example is the DEC PDP-11, which was initially a 16-bit von Neumann machine with 16-bit addressing, but which was later given 22 bits (or more) of physical addressing on the bus. User-mode (well, mapped) addressing was split into "code" and "data" accesses, and addresses could thus be re-used to double the effective virtual address space. [Since there was no way to store into "code" space, it became effectively read-only.] It was still von Neumann, though, even with that split.

Even weirder was the DEC PDP-8, which was a also pure von Neumann machine with 12-bit instructions/data, but in larger models with up to 15 bits of bus address. The CIF (Current Instruction Field) register held the upper 3 address bits of the current program counter. The CDF register held the upper 3 address bits used by *indirect* data references, but the upper 3 address bits for any direct data references came from the CIF.

True Harvard splits at the hardware level the physical paths by which "code" enters the CPU and the physical paths by which "data" enters the CPU. The Am29000 was such a machine, although it *shared* the virtual address space [and the address bus itself] between code & data, and thus was Harvard in hardware and von Neumann in software. [Well, except that there really was *no* way for the software to read instruction memory as "data" if the system designer hadn't provided some "sneak path" between the I- & D-busses, so in its pure form the only way to store "data" in "code" space was as a sequence of load-immediate instructions.]

[Note that the 29000 was almost *perfectly* tuned to be connected to video DRAM (VRAM or VDRAM) as its main memory, with the I-bus connected to the "serial" port of the VRAM and the A- & D-busses connected to the normal A & D ports of the VRAM. Besides giving the separate I- & D-busses needed for the Harvard architecture, it also provided the "sneak path" that you needed to be able to write "data" that could be later treated as "code".] [Note#2: The Am20930 added on-chip I-cache (though *not* D-cache!), and eliminated the separate I-bus pins. So at the bus level it was back to von Neumann. But the microarchitecture remained the same as the Am29000, so at that level it was still "Harvard".]

Sometimes I & D were split because instructions and the data were of different widths!! E.g., the 8X300 with 16-bit instructions, 13-bit instruction addresses, 8-bit data, and just 9-bit data addresses (actually, two data spaces with 8-bit addressing on each, but sharing a single 8-bit data bus); the Microchip PIC 8-bit series, with 8-bit data and 12-, 13-, or 14-bit instructions (depending on model) and a mishmash of instruction addressing (depending on model/ROM size); etc.

Anyway, "code" & "data" were split or not for many different reasons, not just one, and sometimes it was just a partitioning of (virtual) addresses, sometimes a physical separation between busses; sometimes it was to increase bandwidth, sometimes to increase address space, and sometimes to separate access capabilities. If you look across all the architectures of the past 2-3 decades, somewhere you'll find a mix & match of almost all possible combinations of these reasons.

-Rob

----- Rob Warnock

627 26th Avenue San Mateo, CA 94403 (650)572-2607

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 10:32 AM

Well, separate I and D caches was a well-established technology by the 1960s (and probably a lot earlier). I suppose that the current crop of kiddies were taught by the sort of "computer scientists" who deified themselves in the 1980s and denigrated earlier work to do so.

|> +--------------- |> | so, despite the nonsense in Wikipedia... |> +--------------- |> |> Well, they did get this bit right, IMHO: |> |> Modern high performance CPU chip designs incorporate aspects of |> both Harvard and von Neumann architecture. ...

True.

|> +--------------- |> | and almost all programming languages have used separate code and data |> | "address spaces" since the invention of COBOL and FORTRAN, and were/are |> | always talked about as using the Von Neumann model (as they do). |> +--------------- |> |> This seems to be conflating a number of issues which were more |> nearly orthogonal than your statement would lead one to believe:

That's fair.

|> 0. How programming languages [other than assembler] talked about "code" |> versus "data". Mostly, they didn't! They only talked about "data". |> How the code got into memory in the first place was magic buried in |> the operating system and/or the linker. ...

Not really. Firstly, the code of a function was almost always an opaque read-only object, but pointers to it could often be manipulated just like any other pointers to opaque read-only data objects - even excluding LISP, that was true in BCPL, Algol 68 and others. Secondly, that "magic" aspect was largely true of data in many early languages - and is almost always true of genuinely high-level ones.

|> Anyway, "code" & "data" were split or not for many different reasons, |> not just one, ...

Indeed. But it STILL doesn't answer my question, which is why the revisionists have turned established terminology on its side, and propagate complete nonsense about Von Neumann (restricted sense) and Harvard being very different architectural models.

Regards, Nick Maclaren.

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 10:48 AM

If you reread my post that caused you to start this particular sub- thread, you would find out that I actually refer to Von Neumann and Harvard as two very similar architectural models. So who are those evil revisionists?!

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 11:00 AM

Agreed, sort of:

If you can do the same as most hw, i.e. punting at Inf/NaN/Denorm, then the real cost is in the multi-way branch on the exponent field, with the problem being the fact that Zero is quite common, so we cannot simply lump it together with the other extreme exponent cases but have to specialcase it:

if (exp + 1 > 1) { // Regular number ... } else if ((bits & ~SIGN) == 0) { // Zero ... } else if (exp == 0) { // Denorm ... } else { // Inf/NaN }

I didn't know that, but that would only be a single well-predicted branch up front in each routine, right? ... OK, using binary encoding makes a sw implementation quite easy! In fact, it seems like a useful working format for a sw implementation.

If I were going to implement 754R (using the mod_1000 encoding) in software I would have to handle all FADD/FSUB in some form of BCD, but FMUL would probably be faster by first converting to pure binary, or at least a much larger base, like 1E9.

FDIV in decimal isn't obvious, but I'd try a two-stage binary reciprocal approach, i.e. a 32-bit reciprocal used in two iterations with back-multiply and subtract.

Terje

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 11:01 AM

A lot earlier than 60s? When would it be? According to Wikipedia article, the first machine on which you would even theoretically want such caches wasn't built until 1948 i.e. just

12 years before 60s.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 11:03 AM

In article , snipped-for-privacy@yahoo.com writes: |>

|> > Indeed. But it STILL doesn't answer my question, which is why the |> > revisionists have turned established terminology on its side, and |> > propagate complete nonsense about Von Neumann (restricted sense) and |> > Harvard being very different architectural models. |> |> If you reread my post that caused you to start this particular sub- |> thread, you would find out that I actually refer to Von Neumann and |> Harvard as two very similar architectural models.

And I never said that you didn't.

|> So who are those evil revisionists?!

Dunno. But you can see evidence of their work on Wikipedia and in an increasing number of technical papers.

To some extent, even you are a revisionist, because the traditional terminology "Von Neumann architecture" includes the Harvard variant as a subclass. As it is.

This sort of revisionism causes much more serious problems than it might appear to, because it makes it much harder for anyone trying to propose a significant architectural change to do so. Even if they win their argument, other people are likely to claim that they are making a radical change by moving from Von Neumann to Harvard architectures!

Seriously.

Regards, Nick Maclaren.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 11:09 AM

In article , snipped-for-privacy@yahoo.com writes: |> >

|> > Well, separate I and D caches was a well-established technology by |> > the 1960s (and probably a lot earlier). |> |> A lot earlier than 60s? When would it be? |> According to Wikipedia article, the first machine on which you would |> even theoretically want such caches wasn't built until 1948 i.e. just |> 12 years before 60s.

I didn't say "by 1960" but "by the 1960s". The first machines I know of that had much in the way of transparent caching were late 1960s, and included several with separate I and D caches.

Regards, Nick Maclaren.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 11:28 AM

In article , Terje Mathisen writes: |>

|> > If you can decode an IEEE 754 value in 3-5 instructions, and get all |> > of the special cases right, then it has hardware assistance. Note |> > that merely breaking the number up into fields is the easy part of |> > the decoding. Stopping at that point isn't interesting. |> |> Agreed, sort of: |> |> If you can do the same as most hw, i.e. punting at Inf/NaN/Denorm,

No, you can't - that's not according to specification!

|> then |> the real cost is in the multi-way branch on the exponent field, with the |> problem being the fact that Zero is quite common, so we cannot simply |> lump it together with the other extreme exponent cases but have to |> specialcase it: ...

Don't bet on the others being rare - it's very application-dependent. In particular, denorms are NOT rare in many programs, and only some architectures have a "position of first bit" opcode.

|> > And I said "a hundred times as expensive", not "100 instructions", |> > though it could well be 100 executed instructions. The reason that |> > I said it was expensive is that it will often/usually have a lot of |> > mispredicted branches. You are aware that there are TWO formats of |> > decimal, aren't you? |> |> I didn't know that, but that would only be a single well-predicted |> branch up front in each routine, right?

In general, yes. But only one of them would be less than horrible to decode efficiently and correctly in software.

|> OK, using binary encoding makes a sw implementation quite easy! In fact, |> it seems like a useful working format for a sw implementation.

Even that is not nice. You have a hard-to-predict branch based on which of the two binary variants is used, plus the other tests. Also, unless IEEE 754R was changed radically after I stopped following it, even the binary representation supports cohorts - and decoding a number means getting that right, too.

A decent decoder would support the densely packed decimal format, too, which is NOT pretty in software!

Regards, Nick Maclaren.

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 12:01 PM

So you are saying that since Von Neumann architecture could be seen as generalization of the Harvard then we should treat Harvard as a subset of Von Neumann? That sounds logically correct but ignores the practical engineering restrictions imposed by Von Neumann generalization. More importantly, historically, Von Neumann's big invention was treating of code and data as the same. So I don't see why machines that most certainly do not do anything like that neither at logical nor at physical layer should be referred as Von Neumann machines.

Conclusion: the people that use the term "Von Neumann architecture" as a common replacement for "architecture based on interpreting of serial or near-serial instruction streams fetched from random-access memory" are true revisionists.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 12:36 PM

In article , snipped-for-privacy@yahoo.com writes: |> >

|> Conclusion: the people that use the term "Von Neumann architecture" as |> a common replacement for "architecture based on interpreting of serial |> or near-serial instruction streams fetched from random-access memory" |> are true revisionists.

Ah. Well, I side with Backus - who is both massively more eminent than I am and of a previous generation.

formatting link

Are you claiming that he was being a revisionist in that?

Regards, Nick Maclaren.

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 12:53 PM

Sorry, ACM portal refuses to show me what you mean.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 1:05 PM

In article , snipped-for-privacy@yahoo.com writes: |> >

|> >

formatting link

|> |> Sorry, ACM portal refuses to show me what you mean.

It refuses to show me, now. Try the following for the full article:

formatting link

If that fails, use Google on "Von Neumann reference", select the second match, and look at "all 59 versions".

In particular, see the description of a Von Neumann computer and what Von Neumann languages are.

Regards, Nick Maclaren.

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 1:08 PM

Are you aware of the binary-to-decimal conversion algorithm I discovered

10+ years ago? (AMD used to show it in their optimization manual, without any attribution. :-()

Using a 32-bit cpu it will convert any input to decimal i about the same time as a single integer DIV opcode, i.e. 30-50 cycles.

Larger inputs should be split using reciprocal multiplication by

2^32/1e9 (instead of division) to work modulo 1e9.

The same approach can handle 64-bit chunks on a 64-bit cpu, making even

100+ mantissa bits doable in maybe 100 cycles.

Terje

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 1:08 PM

Figured out that you most likely had in mind this particular citation: "Conventional programming languages are growing ever more enormous, but not stronger. Inherent defects at the most basic level cause them to be both fat and weak: their primitive word-at-a-time style of programming inherited from their common ancestor--the von Neumann computer... etc"

Yes, Backus is most certainly a revisionist. The property he is talking about predated Von Neumann contribution. If anything, he should have praised Von Neumann for showing us one possibly way out of maze although probably not the best one from performance perspective.

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 1:31 PM

On Aug 24, 7:59 pm, "Wilco Dijkstra" wrote: [....]

Yes. There was a short circuit between the headphones.

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 1:34 PM

real cost is in the multi-way branch on

so we cannot simply lump it together

I use a similar layout, but there is no need to decode the inputs at all:

// 32-bit IEEE float in x, y if (((x + 0x800000) & 0x7f000000) != 0 && ((y + 0x800000) & 0x7f000000)) != 0) { // normal case, now decode x, do operation and return } // now deal with special cases

In most cases I need just one easily predictable branch to catch the special cases without decoding the operands, and use conditional execution for the special cases. However for the binary operators you have to be careful when dealing with zero first:

// x * 0 or 0 * y -> return 0 if ((x And I said "a hundred times as expensive", not "100 instructions",

front in each routine, right?

Indeed, and only that if you need to write one routine to handle both formats.

Wilco

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 1:36 PM

messagenews:g8ov69$4sc$ snipped-for-privacy@gemini.csx.cam.ac.uk...

The lookup, requires a trip to memory. Some processors, I think the Blackfin is one, have an add with reverse carry. You can use this to speed up the FFT sequence.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 1:55 PM

In article , MooseFET writes: |> On Aug 24, 8:25 pm, "Wilco Dijkstra" |> wrote: |> >

|> > > Some of the cryptographic algorithms are similar. Inverting bits |> > > (as used in FFTs) is, too, but I don't know any algorithms where |> > > that is a major bottleneck. |> >

|> > Indeed. Various architectures do implement bitreverse, but it is hardly |> > needed as CPUs already have the ultimate bitshuffle instruction: |> > the lookup table. |> |> The lookup, requires a trip to memory. Some processors, I think the |> Blackfin is one, have an add with reverse carry. You can use this to |> speed up the FFT sequence.

Yes. And that is the reason it is generally insane to implement it by a lookup table for large FFTs - the effects on the cache more than compensate for its increased speed. As it's a fairly minor component of the algorithm, anyway, what the hell?

Regards, Nick Maclaren.