Is this Intel i7 machine good for LTSpice?

- J
- josephkk
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 4:05 AM

The first program that i used that had a noticeable improvement with the FPU was SPICE. There it made a huge difference. Similar applications had the same kind of results.

?-)

- J
- josephkk
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 4:10 AM

You need to study up on Amdahl's law. It relates the frequency of any event in the instruction and data sequences to the amount of speed impact it has.

?-)

- B
- Bill Sloman
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 4:17 AM

he check bits, and of using them to calculate a corrected output can in pri nciple be handled by look-up tables - which get a bit big - and in practice are handled by logic networks which are almost as fast.

Discrete logic is almost as old-fashioned as hydraulic logic.

In practice, either solution is going to be realised in programmable logic, and the look-up table is the version that uses most gates to get the lowes t propagation delay, and "logic" is the approach that trades off fewer gate s against longer propagation paths that make more choices.

For special cases, well-realised logic can be as fast as look up table - fa ster if you can exploit the reduced number of gates to sue bigger, faster g ates to realise the relevant logic paths.

It's a weasel phrase, introduced to set up exactly that response.

lls have to be tiny, so the electric charge involved is equally tiny.

e in extra components, extra board area, extra pins, extra bus tracks and e xtra bus drivers. Extra propagation delay didn't really come into it.

I was looking at setting up a gigaword or so of random access memory to hol d the lithographic data for a single layer of an integrated circuit.

It was to be accessed by variable aperture electron beam microfabricator, w riting flashes of electrons onto the electron beam sensitive resist on a mo ving silicon wafer. The machine never got built, but we spent almost four m illion UK pounds on the project around 1985 and 1986.Cambridge Instruments had agreed to commercialise what Thompson-CFS had sold us as a protoptype m achine, which turned out to be a proof-of-principle machine, which turned o ut to have to be redesigned in every detail, When we were sure that it was going to cost us almost as much to finish the project (which we could have afforded), and that it was going to tie up every engineer and programmer in the place for the next eighteen months (which we could afford) spent upwar ds of three million pounds buying our way out of our promises to finish the project.

The data buffer design study did get published

formatting link

J.P.Melot was a collaborator from the University of Bristol, with CERN expe rience, and totally brilliant, and Mike Penberth was my boss, who knew abou t stuff like hydraulic logic. He wasn't a particularly creative engineer, b ut he was utterly brilliant at working out why things weren't working, or n ot working right.

That could be longer ago, but I doubt it. There were micro-programmed machi nes around in the mid-1980's but the people who used them tended to be very specialised number crunchers.

--
Bill Sloman, Sydney

- D
- DecadentLinuxUserNumeroUno
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 5:11 AM

On Wed, 12 Nov 2014 20:10:07 -0800, josephkk Gave us:

The memory is tagged with the speed it runs at.

Two sets of two sticks.

One pair with and one without ECC

The MOBO requires an ECC setting for the RAM, so the check bits are generated FOR the stick as part of the chipset's hand off methods for the memory bus.

The timing tag declarations on both sticks are identical.

Both then run at identical speeds, and all this imaginary overhead you all are quacking about is already taken into account and managed OUTSIDE the speed with which the RAM sticks are being operated at and touted as able to be run at.

The math is REAL simple. The MOBO pings each at the same rate. They run at the same rate.

ZERO difference in a running machine because BOTH are accessed at identical speeds and will benchmark that way too. Down at the nitty gritty level, there is more taking place, but it does so WITHIN the timing constraints of the declared access rate for the array.

Slower because more is being done? Maybe... down there at the nitty gritty level. Nothing we see though.

OC them till they start failing, and you might see the ECC fail more often tying to keep up. That would be the test right there. Spool up the clock and watch the errors and error corrections start wading in. Then talk about specific causation.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 5:43 AM

You don't seem to understand, in logic more does not equal faster delays. I can assure you that more logic is slower than less.

I'm not sure why you are bringing programmable logic into this. That is a red herring.

It was in the late 80's. Star Technologies was a spin off from Floating Point Systems. FPS decided the market wanted 64 bit floating point and Star Tech was about speed at 32 bits. They provided a machine (two rack cabinets) that did 100 MFLOPS... the second fastest floating point in the world next to the Cray. This was before DSPs were terribly useful. But it didn't last long. They pumped out a design that did 50 MFLOPS in a single 9U rack which was incorporated into GE CAT scanners. They nursed that design for continuing support for a long time. Ultimately they folded without ever producing another viable design. The day of the array processor was over.

--

Rick

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 7:19 AM

One should also remember that magnetic core as well as dynamic RAM perform a destructive readout, so you have to perform a writeback after each read cycle. For core, you only have to do that for the actual read location (at the X and Y wire crossing), for dynamic RAM, you have to write back the whole column (hundreds or thousands of bits). For this reason, real access time (not using CAS multiplexing) is much shorter as the full cycle time.

Putting the ECC logic into the writeback loop doesn't slow down the _cycle_ time, as long as the ECC writeback is phase shifted from the main data write back.

Of course, this does require that the ECC logic is on the same memory chip, using ECC memory bits and logic on separate chips doesn't work.

For high radiation environment (if it makes sense to use DRAMs at all), I would put the ECC into the writeback loop so that the memory is flushed (ECC corrected) at every refresh as well as every read access to a column. This will quickly detect single bit errors, which are correctable, from entering into a multibit non-correctable error.

- D
- DecadentLinuxUserNumeroUno
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 7:26 AM

On Thu, 13 Nov 2014 09:19:27 +0200, snipped-for-privacy@downunder.com Gave us:

It sounds like a running ECC on each column string might be better than byte, word, or actual string correction would. And achieve what you said about getting single bit errors before they become monsters in the datagrams.

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 8:02 AM

Anything that used a compiler that could generate inline FP code would benefit enormously but if you had a noddy compiler that just had a bunch of library routines that were either calls to the emulator or calls to FP code in a subroutine then benefits were much less. This old page shows the variation in different sqrt coding tricks from way back:

formatting link

The inline code approximately 5x faster than the ordinary sqrt call.

How much benefit you got from the FPU depended critically on the quality of your compiler. You often got a bit of extra precision thrown in too since the FP stack holds intermediate results to 80bits.

The original Intel FPU had a few quirks in the trig functions which were found when Cyrix did a full analysis for their own numeric FPU (which was faster, more accurate and cheaper than the Intel part).

--
Regards, 
Martin Brown

- B
- Bill Sloman
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 10:48 AM

e

s

it

the check bits, and of using them to calculate a corrected output can in p rinciple be handled by look-up tables - which get a bit big - and in practi ce are handled by logic networks which are almost as fast.

h

s

gic, and the look-up table is the version that uses most gates to get the l owest propagation delay, and "logic" is the approach that trades off fewer gates against longer propagation paths that make more choices.

If I build my extra logic with ECLinPS and you build yours with 74LS, this won't be true.

If you can buy purpose-built ECC chips for you are particular choice of wor d length it's certainly going to be a red herring. If you have real world r equirements that don't correspond to an application that buys more than 100 ,000 chips per year, you are going to realise most of your system in a prog rammable logic device.

e.

achines around in the mid-1980's but the people who used them tended to be very specialised number crunchers.

When I applied for my job at EMI central research in 1975, one of the job i nterviews was with the guys who were building the number-crunching logic fo r the EMI body-scanner. I knew enough to ask them whether they were going t o use AMD's TTL bit-slice components. or Motorola's ECL bit-slices.

At the time they hadn't made up their minds, but by the time I'd got the jo b (and got the security clearance that let me actually start work) they'd g one for the AMD parts. They weren't as fast, but they integrated bigger chu nks of functionality. By the 1980's integrated circuits could integrate a l ot more transistors, and bit-slices weren't all that interesting.

--
Bill Sloman, Sydney

- B
- Bill Sloman
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Nov 13, 2014 10:52 AM

ECC correction makes more sense for longer words. 64-bit words were a sweet spot, because they could be error detected and corrected with an eight-bit check word.

Packet-switched networks detected and corrected on whole packets, with even longer check words.

--
Bill Sloman, Sydney

- J
- josephkk
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Nov 14, 2014 3:05 AM

impact

The similarities between core and DRAM are real. Early DRAM could not provide the next sequential address read as it had no registers to store it. Newer DRAM does (since EDO at least). That said, newer DRAM speeds up sequential reads over early DRAM by having the registers for it and not needing another complete cycle, just another data clock for the next sequential read data. See the DDR series specifications. The restore part of the cycle continues unabated making a non-sequential read after two or more sequential added reads occur much sooner.

???

ECC is just stored on more bits of memory word width. The ECC calculations are all done on the CPU chip (both directions).

ECC can be designed to correct as many bits of the word as you want. Want

4 bit correction and detection of almost all many bit errors, it can be done. It is a problem in optimization, how much do you pay in ECC and secondary ECC to detect everything?

?-)

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Nov 14, 2014 5:00 AM

Mine will still be faster because I'm using it in a Ferrari. What are you talking about... no, don't tell me. This conversation has gone south.

Thanks for sharing.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Nov 14, 2014 5:06 AM

Not sure what you are talking about. The writeback either doesn't use the ECC, just copying what is in memory... not very useful... or it has to verify the ECC and apply the correction before writing it back... or the third option is to flag an error in parallel, but then the entire write operation has to be repeated to accommodate the error correction. That extension messes up the timign and is hard to incorporate into most applications of ECC.

That's likely not a significant speed issue since the refresh is only a small portion of the memory bandwidth. But you still leave the gap of possible corruption between the last refresh and next read.

--

Rick

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Nov 14, 2014 7:25 AM

Apart from the first DRAMs that used all address lines at once, all the rest have multiplexed addresses with RAS/CAS selection.

This does not slow the access. For instance the first RAS/CAS addressed DRAM was 4096x1 bit with 64 rows and 64 columns. The high 6 bits with the RAS signal were decoded and selected one of the 64 rows. After a while, all the bits from that row were transferred to 64 column sense amplifier and latches.

After the low address bits were decoded with the CAS signal, it just selected one of the 64 column sense amplifier/latch bit and presented to the data out pin. Since the DRAM cell access time was much longer than the output column select multiplexor, multiplexing did not slow things much even for a single access.

Now that 64 column bits are already in the internal 64 internal registers, performing several CAS cycles with different low address bits allowed fast random access _within_ a 64 bit row, just multiplexing out the selected bit, instead of doing a dynamic RAM cell access each time.

Later models had internal column address counters, allowing sequentially select column bit access without doing a RAM cell access after the initial row activation.

Video-RAMs were very similar. All the TV-line bits were taken from

1024 column bits and then parallel loaded into a shift register clocked by the bit clock. This required a slow line select every 64 us, so no problem with propagation delays.

Since all the column bits are available simultaneously within the chip, my point was that it would make sense to put the ECC processing within the memory chip itself.

- J
- josephkk
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sat, Nov 15, 2014 4:28 AM

store

I am not buying it. It requires memory chips to bring out I/O to indicate "valid/no error", "error corrected", and "error, not corrected". Now if accesses are more than 1 chip wide you have to combine the status bits somehow, whether or not they are shipped to the CPU/DMA/Video. Also you may want a different ECC protection profile than what the memory chip maker provides.

?-)

- D
- DecadentLinuxUserNumeroUno
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sat, Nov 15, 2014 4:47 AM

On Fri, 14 Nov 2014 20:28:48 -0800, josephkk Gave us:

snip

The motherboard is involved. I am sure as much as can be placed on the RAM 'device' (stick) itself is. The chipset manages its part. The finished product is the checksum code required by whatever monitors and manages all of it (the chipset) ultimately gets generated and used for comparison. Any errors are handled by it, and the RAM and that little management code hard wired into it all. Then, it is back to square one. Next refresh and compare sequencing. Should be without missing a beat over similarly timed non-ECC RAM.

Seeing a bunch of errors would likely slow things, but that would also indicate a bigger problem.