Re: Intel details future Larrabee graphics chip

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 10:15 PM

!= 0)

The variables were meant to be (cast to) unsigned indeed.

Actually you get:

cmp a,#0 cmpne b,#0 beq else

Most ARM compilers can do basic stuff like this. The compilers I worked on can generate far more complex sequences of conditional execution.

The idea was to show that one can compile the && using a single branch even on ISAs that do not have conditional execution or a conditional move. If you don't have a min/clz instruction then using it as a primitive is not a good idea either.

The analysis of whether it is correct is trivial indeed, so compilers routinely optimize the &&, ||, and ?: operators. What is hard however is deciding whether it is beneficial without profile feedback.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 10:56 PM

format with a lot more bits, like a

each operation, unless you can fake it

that would have been shifted away

You can often use unnormalized denormals but with the exponent forced to 1. This gives the correct result for add/sub and multiply but it's usually better to normalize division and square root.

When you normalize you just need an extra exponent bit to allow for negative exponents. I often bias the exponent by 1 so that the largest denormal has exponent -1 (rather than 0). This allows the overflow and underflow tests to be done using a single compare. It also makes recombining the exponent and mantissa easier.

You can use a wider internal format for the calculations as long as no rounding is done to the internal format. You can only round once when you create the final result with the reduced exponent/mantissa (ie. rounding must be done on the final denormal value).

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Aug 26, 2008 11:05 PM

The mantissa would normally already be 32-bit or 64-bit internally. Addition is already as simple as:

res = manta + (mantb >> (expa - expb));

I can't see how rebasing the exponent could possibly simplify this.

Wilco

- R
- Rob Warnock
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 2:59 AM

+--------------- | snipped-for-privacy@rpw3.org (Rob Warnock) writes: | |> My guess would be the rise of separate I & D on-chip caches, which | |> naturally leads to a Harvard approach inside the CPU/ALU pipeline, | |> while the external memory interface remains (mostly) von Neumann... | | Well, separate I and D caches was a well-established technology by | the 1960s (and probably a lot earlier). +---------------

Well, you couldn't tell it by me!! ;-} I started coding in 1965, and *none* of the machines I learned on[1] had *any* caches yet, not even the venerable DEC PDP-10 (KA10) we got in 1970 (FCS Sep. 1967) -- and in those days the -10 was used for quite significant timesharing loads! Not until the KL10 (FCS June 1975) did the PDP-10 series get any cache at all.[2]

And the first microprocessor I ran into that had separate on-chip I & D caches would have been the MIPS R3000, circa 1988.

You were clearly in a different world than I was at the time. [Not surprising, as I was nowhere near any of the centers of advanced computing research until 1980 or so.]

-Rob

[1] IBM 1410, LGP-30, IBM 1620, DEC PDP-10, PDP-8, PDP-11. [2] I don't count the "fast registers" (accumulators) of the KA10 & KI10 or the TLB of the KI10 as "cache" in the sense of this discussion.

----- Rob Warnock

627 26th Avenue San Mateo, CA 94403 (650)572-2607

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 3:48 AM

I do a lot of both and it is obvious that I should have expanded on the gcc comment.

I can get the source code for gcc. People can look at it to make sure it is right. It tends to be bug free as a result.

I don't have to relearn all of the terminology for everything and what to click where and what to never ever click to get gcc to spit out a working result from my input code.

gcc is free in that I don't have to pay for it or give up other things of value to get the right to use it. Free or low cost tools are better from the point for view of getting a large number of people to use it.

gcc just does the task at hand. They didn't try to build an editor and web browser and e-mail client into it. It just takes in source code and makes object code. This makes my life much easier

I tend to think in terms of the flow of the information or in logic gates depending on which model works better for the logic problem at hand.

Both the Altera and Xilinx tools are bloated with a lot of stuff that doesn't add any usefulness like a clumsy editor.

Both are only free in windows land. Even there they require that you apply for a "free license" to be able to use them.

Both are buggy. I spent about a month figuring out something that was a bug in Altera's "quartus". This wasn't just the not handling "Z" values correctly one either. It was VHDL that when compiled produced a result which missed a term. I narrowed it down and actually found it in the output to the fitter BTW.

If I write careful "C" for gcc, it is portable without thinking about the details of the CPU chip that runs it. If you keep it simple gcc produces very good results.

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 3:53 AM

It is that nasty ">>" operator. And the "if" logic you forgot to include that makes it slow. If the processor you are working with doesn't do the shifts quickly, the base 256 exponents speeds things up a lot.

Consider writing a floating point package for a Z80 and see how it makes a huge difference in an extreme case.

- K
- Kim Enkovaara
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 5:16 AM

Even with FPGAs the code can be quite portable. Usually quite small portability layer that contains PLLs and special IO pins is only needed. Most of the memories can be nowadays written portably etc. That can be compared to code needed to support different operating systems in C programs.

Of course some coders like to use special blocks, big IP cores from the vendors etc. At that point the portability is not so great, but that is also comparable to usage of OS specific libraries in software projects.

I think the bigger obstacle in FPGA use is the parallel vs. sequential mindset of programming.

--Kim

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 5:52 AM

That is what i understood the requirements to be as well. But i asked Nick, as it was Nick that i thought asserted a denormal could be renormalized.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 5:54 AM

Would the both of you try reading IEEE 754 please?

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 5:56 AM

exponent format with a lot more bits, like a

each operation, unless you can fake it

bits that would have been shifted away

And what actual hardware are taking about?

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 5:57 AM

Not all hardware has this capability. For that mater not all software does either. Then what do you do?

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 6:02 AM

I would be really surprised. Functional/procedural languages are conceptually very different from logic HDLs. Though at least a few will be.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 8:11 AM

In article , JosephKK writes: |> |> >|> Would you kindly explain to me how to normalize a denormal without |> >|> expanding the exponent range? |> >

|> >That's how you do it! |> |> Not all hardware has this capability. For that mater not all software |> does either. Then what do you do?

Terje has already answered your question.

If your software doesn't have a large enough integer, you emulate a larger one - a standard technology.

Regards, Nick Maclaren.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 8:16 AM

In article , snipped-for-privacy@rpw3.org (Rob Warnock) writes: |>

|> +--------------- |> | |> My guess would be the rise of separate I & D on-chip caches, which |> | |> naturally leads to a Harvard approach inside the CPU/ALU pipeline, |> | |> while the external memory interface remains (mostly) von Neumann... |> | |> | Well, separate I and D caches was a well-established technology by |> | the 1960s (and probably a lot earlier). |> +--------------- |> |> Well, you couldn't tell it by me!! ;-} I started coding in 1965, |> and *none* of the machines I learned on[1] had *any* caches yet, ...

The IBM 370/165 did! I can't now remember the details, but that was a (very) late 1960s machine.

|> You were clearly in a different world than I was at the time. |> [Not surprising, as I was nowhere near any of the centers of |> advanced computing research until 1980 or so.]

Cambridge had stopped doing much hardware work by the time I arrived.

Regards, Nick Maclaren.

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 9:05 AM

Base256 could help on CPUs with slow shifts, but only if you use a non-IEEE format. If you convert between base256 and IEEE exponent for every operation then you end up with more shifting overall. The above shift would need at most a 4-bit shift on an 8-bitter, so it's not too bad at all.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 9:17 AM

exponent format with a lot more bits, like a

after each operation, unless you can fake it

bits that would have been shifted away

We were talking about emulating IEEE in software, so it works on any hardware. My C implementation requires a 32-bit type, so as long as your compiler has one, it will work.

Wilco

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 1:10 PM

I think you missed the point about keeping the numbers as base 256 while they are being worked on. This means that you only need to convert to and from the IEEE format on the way in and out. If you are doing an FFT, the conversion time would be small compared to the savings in the FFT.

The above

all.

A 4 bit shift takes quite a bit of time on an 8-bitter. You have only one carry to tranfer the bits between bytes so it is usually faster to do 4 one bit shifts. Here it is for an 8051:

CLR C ; Shift in a zero MOV A,LSB ; Load the lowest RLC A ; Shift up one MOV LSB,A MOV A,LSB+1 ; Next byte RLC A MOV LSB+1,A MOV A,LSB+2 ; Next byte RLC A MOV LSB+2,A MOV A,LSB+3 ; Next byte RLC A MOV LSB+3,A

As you can see it comes out to 13 instructions per one bit shift. This makes it well worth avoiding if you can.

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 1:15 PM

Yes, I know that he didn't show the right operation for the alignment before the add, but I don't think it matters to my disagreement with him since the right one requires an extra bit of "if" logic and depending on the exponent the prepending of a one.

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 1:22 PM

IIRC, the 360 used a base 16 exponent and could shift nibbles quickly so it was fairly fast for its day. You could get to an integer quickly using the "add without normalizing" instruction.

There was also a very fast but inaccurate "unnatural log" routine that was pared with an "unnatural base to any power" that let you do powers very quickly but inaccurately.

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Aug 27, 2008 1:27 PM

I have never seen a case of FPGA code going between Xilinx's tools and Altera's without some very serious rewriting.

I don't see that as an obstacle at all. The sequential mindset is a learned thing. If you do FPGAs, parallel computers, random logic or design analog circuits, you think in parallel terms.