Re: Intel details future Larrabee graphics chip

- A
- Anne & Lynn Wheeler
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 2:10 PM

at the time (in following email), i was still on kick about (the same) shared pages appearing at different virtual addresses in different virtual address spaces (or even the same shared pages appearing at different virtual address in the same virtual address space) ... misc. related posts

formatting link

from long ago and far away (with regard to 3090):

Date: 11/17/83 13:40:41 To: wheeler

The machine has a split cache, the instruction cache is managed with real addresses. No problems.

The operand cache is managed with two directories: one holds LOGICAL addresses (i.e. mixture of real and virtual), and the other holds real addresses. It appears to the outside world to be managed with real addresses. I can think of no reason why shared pages will be peculiar in this environment.

... snip ...

related old email about the 3090 cache operation

formatting link

in this post, also mentioning 801 (separate I&D cache) from 1975:

formatting link

Flash 10208

this (earlier) email mentions 5880 (amdahl mainframe clone) having separate I & D caches

formatting link

in this post

formatting link

blast from the past ... macrocode

misc. posts mentioning 801 (romp, rios, power/pc, etc).

formatting link

One of the differences between 801 split cache and the 3090 (5880) split cache ... was that 3090 (& 5880) managed cache consistency (between I & D caches) in hardware ...while 801 required software to flush D-cache & invalidate I-cache (like program loaders which may have modified instruction streams ... in the data cache ... in order to make sure that modifications in the D-cache were correctly reflected in the I-cache instruction stream).

other old email mentioning 801

formatting link

semi-related recent post in this thread (discussing virtual memory & paging from the 60s):

formatting link

Future architecture

for related topic drift ... "small" shared segments in ROMP chip (801 used later in PC/RT)

formatting link

in this post:

formatting link

Multiple mappings

and (this time, Iliad chip ... another 801)

formatting link

in this post:

formatting link

To RISC or not to RISC

similar post along this line

formatting link

The Perfect Computer - 36 bits?

formatting link

Taxes

--
40+yrs virtualization experience (since Jan68), online at home since Mar70

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 5:50 PM

Sorry, this is a language problem:

By punt I didn't mean "give up" but "accept big performance hit".

In my performance-sensitive code (i.e. most of it) "to punt" would mean giving up on providing a fast (i.e. useful) path, and fall back on something slow that does work.

BitScan() isn't really needed at this particular point, a small lookup table is enough to handle the leading 754R exponent bits.

For normalization it is very useful though!

There is a canonical way to represent most/all results, but you have to support all possible encodings of decimal fp numbers.

This sucks, but only sort of: It seems to make it easier to support significance arithmetic.

DPD is fun, there are actually two different ways to solve the DPD problem (encoding 1000 decimal values using 10 bits, without requiring any expensive hw operation on either packing or unpacking).

I think it was one of the IBM guys (Hack?) who challenged me to figure out the encoding, and I came up with the alternate approach. The one they did choose has the advantage of needing smaller tables for a sw pack/unpack routine.

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 6:32 PM

In article , Terje Mathisen writes: |> |> > |> If you can do the same as most hw, i.e. punting at Inf/NaN/Denorm, |> > |> > No, you can't - that's not according to specification! |> |> Sorry, this is a language problem: |> |> By punt I didn't mean "give up" but "accept big performance hit".

Ah. That's fine - IF you know that such things will be rare. And my experience is that they aren't half as rare in real programs as is usually assumed.

|> > Don't bet on the others being rare - it's very application-dependent. |> > In particular, denorms are NOT rare in many programs, and only some |> > architectures have a "position of first bit" opcode. |> |> BitScan() isn't really needed at this particular point, a small lookup |> table is enough to handle the leading 754R exponent bits.

That doesn't help with denormals, though! You can't always use them in denormalised form, and can rarely use the same code as for normal numbers unless you normalise them.

Regards, Nick Maclaren.

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 7:31 PM

real cost is in the multi-way branch on

common, so we cannot simply lump it together

This is actually two branches, equally hard/easy to predict.

Aha! I think Nick will claim that this is cheating, even though I tend to like this approach: Yes, it does lose some nice properties close to the underflow limit, but it runs so much faster on nearly all hw. :-)

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 7:41 PM

Which is why I wrote "at this particular point". I _did_ mention the general problem of being forced to accept all possible (i.e. non-canonical) encodings, where bitscan would be an absolute requirement.

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 7:54 PM

years ago? (AMD used to show it in their

as a single integer DIV opcode, i.e.

(instead of division) to work modulo 1e9.

That is very quick indeed. I normally use divmod by 10 - this is faster for small numbers (even when using hardware divide on a non-x86 system), but slower for >5 digits. I've not tried your algorithm, but I guess on a simple ARM one could get around 50 cycles as well, and 30 on a superscalar one.

A 50-bit 754R mantissa would need to be split into two parts (using

64x64->128, which would be slow on a 32-bit system), followed by multiplying by 1000 to get the 5 DPD values plus the leading digit. I guess one could use 32-bit fixed point by doing the multiply by 1000 as x - ((x*3) >> 7). Getting all that down to less than 50 cycles seems a challenge, except perhaps on x64.

Binary to decimal conversion of binary floating point numbers is a bit more complex still (printing decimal floating point numbers is trivial because of the decimal exponent).

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 8:34 PM

real cost is in the multi-way branch on

common, so we cannot simply lump it together

Not on architectures with conditional execution or other means to avoid branches. For example one can optimize:

if (a != 0 && b != 0)

as

if (min(a,b) != 0)

or

if ((clz(a) | clz(b)) < 32)

However given branch prediction is pretty good nowadays, it probably doesn't save much.

this approach: Yes, it does lose some

nearly all hw. :-)

Yes, if the choice is between fast and slow, most people tend to choose fast - unsurprisingly. In the embedded space FP is used for problems that never cause underflow or overflow (beyond badly implemented math functions - sin(x) can cause underflow if you don't special case small x).

Wilco

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 8:39 PM

In article , Terje Mathisen writes: |> Wilco Dijkstra wrote: |> |> >... |> > In most cases I need just one easily predictable branch to catch the |> > special cases without decoding the operands, and use conditional |> > execution for the special cases. However for the binary operators |> > you have to be careful when dealing with zero first: |> > |> > // x * 0 or 0 * y -> return 0 |> > if ((x return (x ^ y) & 0x80000000; |> > |> > This checks that neither value is a Inf/NaN, and so one of them |> > must be a zero or denormal (this code flushes denormals to zero). |> |> Aha! I think Nick will claim that this is cheating, even though I tend |> to like this approach: Yes, it does lose some nice properties close to |> the underflow limit, but it runs so much faster on nearly all hw. :-)

I am not going to support the IEEE 754 specification; I am merely saying that it is a pain to handle in software, efficiently and fully.

Regards, Nick Maclaren.

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Aug 25, 2008 10:23 PM

Absolutely. I wrote the kernels of a compliant IEEE implementation in assembler a long time ago. Dynamic rounding was very tricky, as was support for user trap handlers and keeping the 5 status bits up to date. Denormals took a lot of code due to the architecture not supporting a count-leading-zeroes instruction - however I managed to use them unnormalized in most cases and branch to a shared normalization function at the end of each operation. We ended up conditionally assembling the code for various subsets of IEEE as the overhead of including everything was just too much...

Wilco

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 6:03 AM

Would you kindly explain to me how to normalize a denormal without expanding the exponent range?

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 7:18 AM

In article , JosephKK writes: |> |> Would you kindly explain to me how to normalize a denormal without |> expanding the exponent range?

That's how you do it!

Regards, Nick Maclaren.

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 11:51 AM

You don't.

I.e. a sw library will almost certainly choose to work in an internal exponent format with a lot more bits, like a 32-bit int.

To follow the spec you have to denormalize (if needed) the result again after each operation, unless you can fake it exactly.

One possible idea would be to mask away (with proper rounding) the bottom bits that would have been shifted away during the conversion to external exponent range.

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 12:08 PM

0)

No, you (in general) cannot, since C specifies that the && operator shall only evaluate the second part if the first is true.

For some specific sequences, of which this is one, it might be possible for the compiler to figure out that it is legal to evaluate both halves, but I think you should probably rewrite the code to make that clearer!

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- T
- Torben =?iso-8859-1?Q?=C6gidiu
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 12:57 PM

0)

Also, the optimisation assumes a and b are >= 0. If they are declared as unsiged, this is no problem, but for integers, it requires extra analysis. Besides, in many cases, min(a,b) is compiled into something like a For some specific sequences, of which this is one, it might be

It is not that difficult to make a (conservative) analysis that checks if expressions can be evaluated quickly without error and use this to speculatively evaluate such expressions ahead of time, such as above. So this is not where the real problem lies.

Torben

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 1:38 PM

If I was coding such a library I would most likely convert to an internal format with a longer mantissa and a base 256 exponent. While the numbers are being held in the internals, a few extra bytes needed for such a format would be a low price to pay for the greater speed.

A base 256 number with a longer mantissa speeds up adding and subtracting at the cost of some speed in the multiply and divide. Making the mantissa a multiple of the natural word length of the processor, gets you most of that back.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 1:49 PM

In article , MooseFET writes: |> On Aug 26, 7:51 pm, Terje Mathisen |> wrote: |> >

|> > I.e. a sw library will almost certainly choose to work in an internal |> > exponent format with a lot more bits, like a 32-bit int. |> |> If I was coding such a library I would most likely convert to an |> internal format with a longer mantissa and a base 256 exponent. While |> the numbers are being held in the internals, a few extra bytes needed |> for such a format would be a low price to pay for the greater speed.

That was the choice that IBM made on the System/360.

It introduces a bit more complexity, reduces the accuracy slightly, and breaks one invariant that is assumed by many programs, too, but none of that is enough to get excited about. I wouldn't do that, but that means just that my preference is different from yours :-)

Regards, Nick Maclaren.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 6:25 PM

In article , glen herrmannsfeldt writes: |> |> > I didn't say "by 1960" but "by the 1960s". The first machines I know |> > of that had much in the way of transparent caching were late 1960s, |> > and included several with separate I and D caches. |> |> I thought the 360/85 was supposed to be the first with cache, |> which I believe was 1968. It had to be transparent, because |> there is no description of cache in S/360.

That could well be so.

Regards, Nick Maclaren.

- J
- Joel Koltner
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 6:48 PM

I wonder if LabView programmers might make good VHDL/Verilog programmers?

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 7:19 PM

I am not sure what you mean by "like gcc." It is usual to write FPGA code in verilog (or VHDL). The problem is, that the thought process for writing such code are more like logic design (which it is) than software design (for gcc).

There are people using C as a hardware descriptor language, but to me that isn't the right way to go. It encourages the idea that you can think of hardware the same way as C programming, and even that you might port algorithms from existing C code. I think, though, that dataflow is closer to the way to think about logic design than traditional programming languages.

Both Xilinx and Altera now have freely available tools.

I would say not much more than you need to know about most computers to write portable and fast C code, but you probably don't think about those.

-- glen

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 7:21 PM

Nick Maclaren wrote: (snip)

I thought the 360/85 was supposed to be the first with cache, which I believe was 1968. It had to be transparent, because there is no description of cache in S/360.

-- glen