Intel-Altera, again

- G
- Gerhard Hoffmann
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Jun 4, 2015 4:10 PM

Am 02.06.2015 um 17:49 schrieb John Larkin:

What color is the sky on your home planet?

clumsy, microcoded, register-poor

try to inform yourself what register renaming is. No one has more.

You should have noted that Intel is not in the X86 busines; Intel is in the best-revenue busines. They have dropped everything that could lead them astray and that was good and made them #1. Where are the old DRAM heros that that Intel left this market to? Where is NEC, Fujitsu, Hitachi, Mostek, Ti Dram? Is Micron still prospering? Intel still has an architectural level ARM license; they may develop their own designs, not just buy IP blackboxes like most others.

I tried to get info on configuration memory scrubbing and triple module redundancy on Virtex, that was impossible until I hid behind my customer's customer, they were big enough. But, still no copy of XAPP-186.

That is needed only by Apple, just to be incompatible. Nothing for the masses that could not be done by USB. And USB stems also from Intel. Bring volume that is worth the support. When I was at Infineon fiber optics, Intel and even Maxim used to call and ask what we might need. And Xilinx can do remarkable prices when you design a cpu board for a cell phone base station, with millions to be build to cover this planet.

I have been a Xilinx user since XC2064. It is enough now. With Altera on leading-edge Intel processes, Xilinx loses their advantage of being the pilot TMSC customer for new processes.

Gerhard

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Jun 4, 2015 4:32 PM

If you have a reasonable number of registers, and they don't have bizarre restrictions on their uses, you don't need to do insane pipeline-tangled renaming and scoreboarding and stuff.

x86 is a 40-year old architecture. Times have changed, but Intel hasn't.

i32 and Itanic (like the HP3000 and others) were attempts by semi-academics to construct byzantine microcoded instruction sets. They failed because they were too slow. Intel, the best IC fabber in the world, somehow doesn't get it. I think the management and culture is so emotionally attached to x86 that they will watch as ARM drains their life away.

Sad day that IBM went with Intel, and not 68K. 68K maps nicely into a RISC implementation, ie Coldfire. But genuine on-purpose RISC is the real winner.

That probably won't continue, at least as regards CPUs. Their best future may be in fab, which they are good at. That are already doing contract fab, unthinkable ten years ago.

Is Intel the next Kodak?

I thought they let their ARM license lapse. The Altera acquisition gets it back.

Xilinx assigns metallics to customers. Some get gold support, some silver, some bronze. We are lead or coal or dirt. We're allowed to direct questions to the distributor, who can't answer them.

We like Altera; I just hope Intel doesn't ruin them.

--

John Larkin         Highland Technology, Inc 
picosecond timing   laser drivers and controllers 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Jun 4, 2015 8:06 PM

Cell phones are the new DRAM. They were hot and very profitable for a while, but now they are much more commodity and will be more like that in the future.

I guess I shouldn't be too negative on phone. It is more like they are following the PC curve than the DRAM curve. But they will top out before too much longer with little growth other than what can be had from innovation (think iPhone).

Lol. That is very true. Isolation from all but the top customers is their goal and it works well because you can't deal directly with billions of customers.

I am sure Altera will remain largely intact. It is just too much money to pour into a hole that you are just going to fill in again. But it all depends on where the dollars end up coming from. I feel the FPGA market is topping out with the classic comms customers. But I haven't done much research into it as this is not an area that affects me.

I think FPGAs are no longer innovating and we will be seeing stagnation in new products other than the likelihood they will finally start offering more and more integrated hardware such as Ethernet and other buses to allow them to integrate into other mainstream products. In essence the FPGA fabric will become another peripheral in the SOC and/or system board. But that will take a few years.

--

Rick

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 1:20 AM

Yeah, they subcontracted the sneering.

How am I supposed to know if I can use 20,000 chips a year (that was the threshold) if I can't see the data sheet, until....

When Intel was founded, a VAX MIP worth of CPU cost over a million dollars in today's money. Now it costs a few cents. Factor of 1e8 roughly.

People don't need more compute power for Facebook and Twitter and texting. They want cheap and low power.

If all Intel can sell is heat-sunk gigaflops for big bucks, they are doomed.

--

John Larkin         Highland Technology, Inc 
picosecond timing   laser drivers and controllers 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 7:53 AM

Actually, you /do/ need the register renaming and scoreboarding stuff, even if you have a solid number of registers.

When you have lots of logical registers with one-one mappings to physical registers, it is up to the compiler to schedule instructions and organise register usage. For bigger functions and those doing a large amount of calculation, that's okay - but smaller functions tend to re-use the same registers. Making good use of anything beyond about 32 registers becomes intractable for compilers - and for the majority of code, 16 registers is good enough. If you compile a module with two exported functions "foo" and "bar", the compiler does not know if these will be used as "foo(x); bar(y);", or "foo(x); foo(y);", or whatever combination is actually run - thus it cannot pick optimal register allocations to allow the code to run with minimal scheduling delays.

(This is one of the reasons why Itanium is a disaster for most work - the idea was that the compiler would handle scheduling, multi-issuing and register allocation at compile-time, and it failed miserably for most purposes.)

When you have lots more physical registers than logical ones, the cpu can make the re-ordering and scheduling decisions at run-time. That means it can optimise for the code being run in the order it is run, which gives better results than a compiler trying to guess in advance which functions are used in combination. But such re-ordering is expensive in die space and power consumption. And if you have too few logical registers, a significant part of the instruction stream is taken up by shuffling data between these logical registers and memory (especially the stack), while significant die space is used to get fast read and write buffers to avoid the impact of this while still making sure caches and memory are consistent for all bus masters.

The best performance, therefore, is achieved by having plenty of registers (at least 16, but no more than 32) with orthogonal instruction sets (little or no restrictions on the use of those registers), /and/ register re-naming and scoreboarding. This lets the compiler do a decent job, minimises the complications for the hardware, but still lets the cpu do the final scheduling re-arrangements at run time.

Intel has changed over time, but the x86 architecture is stuck with requiring compatibility, which keeps it limited. The trouble is that the original x86 design was a rather poor ISA, using old techniques and outdated ideas when it was first introduced. Thus it is an 8-bit architecture that has been extended and enhanced - it's like building a pyramid upside down. (Compare this to its competitor, the 68k - this was designed from the start as a 32-bit ISA even though the first versions worked as 16-bit internally to reduce costs. The 68k was designed to be forward-compatible, rather than backward-compatible.) Modern x86 cores are fantastic pieces of engineering, built on this poor base. They show that you can, in fact, polish a turd.

That's not why the "Itanic" failed. It had several flaws, but I think the biggest was that it was designed with the idea that the scheduling, multi-issue instruction packing, and register allocation would all be handled at compile time. Intel (and HP) simply assumed that compiler technology would improve to take advantage of the Itanium. Unfortunately, reality got in the way - it turns out that after an incredible development effort, compilers could efficiently use only a fraction of the Itanium power, and even then for only a small percentage of source code. A few tasks could be handled well, but most code simply could not use the cpu core well.

But since the chip was designed to be running multiple instructions in parallel all the time, with an emphasis on consistent throughput rather than power efficiency, these unused units were all burning power all the time - they are famed for their power requirements.

Another key flaw was the 686 compatibility mode for running "legacy" code. This was an afterthought, to help users transition to native code programs, so little effort was put into its speed. But in reality customers had a substantial amount of 686 compiled programs that they could not get in Itanium versions (a classic chicken-and-egg problem), which the Itanium at a quarter of the speed of then-current x86 cpus.

Yes indeed. And it is also sad that the Coldfire is pretty much dead now - Freescale could not keep up the momentum with PPC, ARM, /and/ Coldfire cores. Something had to give, and it was Coldfire.

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 11:32 AM

Interesting point of view.

I have written quite a few assembler programs for PDP-11 (8 general purpose registers, 6 actually usable) and VAX (16 registers, 12 usable or only 9 if you used character string or packed decimal instructions.

In most cases 6 registers was more than enough, I couldn't have used

16 registers more effectively, It is surprisingly that the compiler can handle only that limited range of registers.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 12:24 PM

Compilers can /handle/ as many registers as they can get (well, compilers like gcc and llvm anyway). But for most cost, there simply isn't any use for large numbers of registers. For most code, 12-16 general-purpose registers is usually sufficient - extra ones are only marginally helpful. Sometimes you have code with a lot of calculations where it can be useful to do a bit of loop unrolling and pass data through different registers to make sure that all the instruction latencies and pipelining overlap optimally (especially when you have a super-scaler cpu, use some multi-cycle instructions, or need to read data that might not be in the cache). But I have not seen indications or claims that more than 32 registers are helpful in all but the most specialised of functions.

Note that even if your functions look small and simple, with such limited data that they couldn't possibly use more than a few registers, function inlining and other inter-procedural optimisations can often make use of more registers.

When you are writing assembly manually, it becomes increasingly difficult to track large numbers of registers and to make sure they are all being used optimally. The tendency is to simply assign local variables or temporaries to registers. This works quite well on simple cpus that handle one instruction at a time. But once you are dealing with more complex cpus that can track more than one instruction in operation, it becomes less than ideal. Tracking complex register allocation manually across loops, branches, etc., quickly becomes impractical for a human - especially when changes in one part of the code can have knock-on effects later on. This is one of the key reasons why a good compiler will generally produce faster code than a good assembly programmer on a modern cpu for most code. (Hand-written assembly may still be best for very specialised parts.)

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 12:29 PM

Moreover, in many applications one does not even need to know any HLD in order to use them efficiently. For example, Altera has recently introduced the necessary driver pack to translate from the OpenCL input into their FPGA bitstreams, which means that, among other things, our GPU-based databases would run directly on the FPGA accelerator cards.

Best regards, Piotr

- E
- edward.ming.lee
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 3:31 PM

We are thinking in the same manner. Think of a geo map lookup. We can com pile the whole city in an FPGA. It hardly need to be changed, perhaps once a year. Most of the time, we need to compute distance from one point to a ll entries. It can certain be done all in parallel.

We need socket J (771 pads) Xeons with FPGA.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 3:55 PM

For a machine with a human-friendly instruction set (PDP11 and 68K, for example) people can presumably write an important subroutine better than a compiler can. You're right, 4 to 6 registers is usually what you need to write nice fast code.

--

John Larkin         Highland Technology, Inc 
picosecond timing   laser drivers and controllers 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 4:24 PM

The second ZYNQ project that we did had the hottest FPGA we've ever seen. A real finger-burner, and the box would hang up at about 70C ambient. We had to add the heat sink, the fan, and the silly adapter plate.

formatting link

The max FPGA clock is only 128 MHz, and much is 64, so we were surprised at the power dissipation. There are a lot of busy sinc3 and FIR filters. Maybe the two 600 MHz ARMS use a lot of power. It would be interesting to investigate it, but the fan was a pragmatic fix, out of the project critical path.

--

John Larkin         Highland Technology, Inc 
picosecond timing   laser drivers and controllers 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- E
- edward.ming.lee
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 4:56 PM

What's the power dissipation of this chip with the tiny heat sink? It's tiny compared to the massive copper piped, aluminum heat sink of the 80W Xeon.

When you get to equivalent processing power, ARM use just as much power. Yes X86 uses more, but ARM is catching up as well. For the Xeon chip, much of it also come from the 4M to 6M cache.

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 5:29 PM

John, I am subscribing to thin group as a hobbyist, and the quoted posting was on what I do profesionally. These two activities are completely separate, i.e. I neither design nor build the hardware to run the high performance apps. We order it from external vendors, but currently it is mostly based on the GPU chips. So it is the vendor's problem to solve the heat dissipation problems. My team expects a working device with an agreed inteface and does the software part. We do not use (directly) the high-speed bit-level communication interfaces the FPGA vendors are so proud of, all we need is the number crunching capacity.

I just wanted to say that we are more than eager to make use of such devices:

formatting link

when/if they approach our perimeter and outperform the GPUs. :-)

Best regards, Piotr

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 6:00 PM

In stack processors with no registers a similar effect is seen. Typical stack usage is under 8 and even with complex programs a 16 level stack is usually sufficient. Of course we aren't talking C code with stack frames. This is a stack machine programmed in a stack oriented language.

--

Rick

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 6:08 PM

The heat sink alone reduces chip temp a few degC. The pin fins are silly in free air. We're gaining at least 25C with the sink and the fan. I wanted to run the oven temp up until the box failed, but certain parties didn't want to go past 100C.

I don't know the chip power dissipation; it's not immediately easy to measure.

Actually, maybe it is; I'll give it a try.

--

John Larkin         Highland Technology, Inc 
picosecond timing   precision measurement  

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 8:55 PM

If you allow recursion (or unlimited number of nested while() loops, which is equivalent), you need stack frames of some kind.

Best regards, Piotr

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 9:25 PM

I used to think that Intel can afford the best engineers on the planet, or at least good enough to understand how their own products really work under the hood. The above example shows that it is not always the case. Subscribe to comp.arch, there are at least two designers of the x86 chips from Intel and AMD, I believe they'll gladly explain you the details.

Today the ISA doesn't matter, as it is either emulated in hardware (which is the case of the top-performance processor families), or burried deeply under JavaVM of some kind, or both. There is a small group of people interested in (and being paid for) getting the over-the-top performance figures, but this is a niche, and even we no longer do that the "conventional" way. Usually it is done by heavy-weight SIMD vectorization, low-overhead parallelism and synchronization techniques and moving towards GPGPU. Hardcore cases are approached by reconfigurable computing (with automatic domain-specific language into HDL autoprogramming), but it is still in its infancy and probably heavily overhyped. Single-thread performance is no longer important and it is the constant factor behind the (alleged) design flaws you enumerated.

Which is usually called good engineering...

"Ye shall know them by their fruits"

Where are their competitors today? Do they even exist?

Best regards, Piotr

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 9:29 PM

Recursion is equivalent to iteration, no? I think this was discussed in the Forth group once and the result is that there are *very* few algorithms that actually need recursion. If you want unlimited anything, that is impossible unless you are using a Turing machine perhaps (unlimited memory).

--

Rick

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 9:43 PM

I read somewhere that Intel sells around 1% of the CPUs made. The big architectures are ARM and MIPS.

--

John Larkin         Highland Technology, Inc 
picosecond timing   precision measurement  

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Jun 5, 2015 10:03 PM

To *unbounded* nesting level of iterations. A program with k levels of nested loops can be rewritten into a program with k levels or recursive calls (and vice versa).

Nothing needs recursion and everything needs recursion, it's a matter of interpretation. But recursion is extremely convenient and there is rarely even the desire to replace it with iteration if the problem becomes sufficiently complex. In the simple cases

-- yes, of course it should be done. Even here it depends on the ROI factor.

Sure, but even a 2MiB default Windows stack is in practice a way better approximation of the concept of infinity than the mentioned 16 entries. Sometimes you *must* have more, so there must be a way to get more. If you get it by stack frames or by heap allocation -- not really important. If your working area is limited and small, so must be the programs.

Best regards, Piotr