ECC memory module modification?

- G
- gnuarm.deletethisbit
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Sep 24, 2018 12:05 AM

y.

ces

on

e

ecc-mem

et

rs

ers/ia-

ed

is.

I'm not sure I understand your questions. The ECC circuit will detect and correct 1 bit errors in each 64 bit word using 8 bits of ECC code. The mem ory interface has 72 data I/Os for these bits and transfers two words on ev ery clock cycle, one on the rising edge and one on the falling edge. Inter nally it is typical for the logic driving this to run at twice the external clock rate. Or I have seen designs that run at the same clock rate but us e two lanes of processing.

Is that more clear?

Rick C.

- B
- bill.sloman
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Sep 24, 2018 2:18 AM

He is asking about making a mod to the memory module to introduce errors to verify the unit is correcting the error. The thing I'm unclear on is what type of memory module this is. I guess they can use a module in a non-ECC computer that supports the extra bits. The computer just ignores them.

g

r two different processors, one with and one without ECC.

it was done in discrete logic.

Around 1985. The actual error-detector and error-corrector was a single chi p - either an ASIC or something programmable.

--
Bill Sloman, Sydney

- D
- Don KB7RPU
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Sep 24, 2018 2:30 PM

Do DDR and ECC both happen internally within the memory module? In other words, by the time the memory word is presented on the mobo bus its 64 bits are free of any single bit error and "good to go?" The need to hack two memory chips on the module bugs me. Why two?

Thank you, 73,

--
Don, KB7RPU 
There was a young lady named Bright Whose speed was far faster than light; 
She set out one day In a relative way And returned on the previous night.

- J
- Jasen Betts
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Sep 24, 2018 8:39 PM

No. DDR is an interface specification, it doesn't happen it just is.

no again, the bus is 72 bits wide the correction is done inside the CPU package.

Where are yo observing that?

--
     ?

- D
- Don KB7RPU
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Sep 24, 2018 10:38 PM

Allow me to review the DDR specification for my own benefit. DDR an acronym for Double Data Rate. It's a protocol that enables the memory to get read on the upbeat and downbeat of each clock cycle. "Who's clock cycle?" is a question. You lead me to believe that DDR uses the processor's ?instruction? clock for its beat.

DDR specifies only 64 bits, DQ0 to DQ63:

formatting link

How does the processor read the eight additional ECC bits? Does the memory module multiplex the ECC bits on the mobo bus?

You can see the hack if you freeze frame about 25-30 seconds into the OP's video:

formatting link

The video clearly shows a +5VDC Molex connector, a push button switch, a tiny PCB ?with chips on it?, and about a half a dozen white wires connected to two chips on the memory module. A single white wire seems connected in the vicinity of DQ48 through DQ52, but the video's too fuzzy to know for sure.

Thank you, 73,

--
Don, KB7RPU 
There was a young lady named Bright Whose speed was far faster than light; 
She set out one day In a relative way And returned on the previous night.

- D
- Dave Platt
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Sep 25, 2018 12:35 AM

It looks to me as if the extra 8 bits are carried on CB0 through CB7.

Take a look at page 6 of:

formatting link

The 8 check bits are wired up to memory ICs, in the same way that the various DQ bits are.

- D
- Don KB7RPU
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Sep 25, 2018 2:46 AM

U16, U36, U5, and U27 are right where they belong, in the middle. Leave it to Micron to offer decent documentation. OK, it's finally time to get serious with this hack. Here's a picture of my sacrificial mobo in its Sterilite container:

formatting link

The gray IDC cable connects to the SPD EEPROM mobo pins. It enables me to easily connect a logic analyzer to investigate. Unfortunately, the mobo encased in Sterilite is a Desktop model that doesn't accept ECC memory. Fortunately, a client's ancient S875 mobo failed a while ago. Although a re-cap revived it, an S875 is only a 32-bit mobo. It's useless, except for a hack. :0) So the S875 mobo will replace the Desktop mobo. The S875 is already populated with ECC memory. Some bone pile scrounging will be necessary to find appropriate non-ECC memory. God only knows why the Intel video connects white wires to two chips on the memory module. It just doesn't matter to me anymore. A DQX trace needs to be cut on both memory modules and then what? Use John's idea to open circuit the trace with a push button? That's probably too much of a kludge, no? Is a nuanced data switch better? Then there's Rick's approach to just mash the data line down to ground. This approach seems like a good starting point to me. YMMV.

Thank you, 73,

--
Don, KB7RPU 
There was a young lady named Bright Whose speed was far faster than light; 
She set out one day In a relative way And returned on the previous night.

- G
- geos
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Sep 25, 2018 5:30 AM

What value for resistor you're going to use? I found something that looks as another approach to this hack on PassMark site, the authors of the memory testing software that enables ECC error injection:

formatting link

thank you, geos

- D
- Don KB7RPU
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Sep 25, 2018 2:33 PM

That's a good question about the value of the resistor. For lack of a viable alternative, a value of 4.7 k? suggests itself as a likely candidate. This is a case where the WC-412A twiddle box touted by Bob Pease comes in handy. Interesting how Team Group's "mash hack" only causes a bit error on CPU 1, DIMM slot 1, of the S2600CO4 mobo. The software that displays the count up clock for this hack will use BSD in my case. It may be important for the software to exercise /all/ available memory.

Thank you, 73,

--
Don, KB7RPU 
There was a young lady named Bright Whose speed was far faster than light; 
She set out one day In a relative way And returned on the previous night.

- T
- Tom Del Rosso
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 8:17 PM

I think that's 7 bits. Effectively 6 to locate the error and 1 to indicate an error.

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 9:25 PM

it is 8 bits for 64 bit, so it can be no error, one error that can be corrected or two errors that can't be corrected

- T
- Tom Del Rosso
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 10:01 PM

Yeah well, 7 bits would do that. Didn't the PDP-11 have 16-bit words and 5 ECC bits? Then it would be 6 for 32 bits and 7 for 64.

- T
- Tom Del Rosso
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 10:05 PM

Yeah well, 7 bits would do that. Didn't the PDP-11 have 16-bit words and 5 ECC bits? Then it would be 6 for 32 bits and 7 for 64.

If you had 8-bit memory then you'd have 4 ECC bits.

1 would be parity for all 8 bits. 1 would be parity for the odd bits. 1 would be parity for bits 2,3,6,7. 1 would be parity for bits 4,5,6,7.

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 10:53 PM

the ECC bits could also have errors

- T
- Tom Del Rosso
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 11:07 PM

That would count as a 1-bit error so it's correctable.

The ECC bits could have their own parity bit if that's what you're saying. But that almost doubles the delay in generating the code, and then the ECC bits would have more protection than the data.

- D
- Dave Platt
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 11:23 PM

As I understand it, most motherboards use a modified Hamming code. Properly done, these codes give you the same level of robustness against single-bit errors, no matter which single bit ends up being flipped (original-data or ECC), and you don't have to have a separate parity-of-the-parity-bits bit.

The specific codes are apparently chosen so that the depths of the parity-XOR trees are the same (or similar) for all of the parity bits, so the processing delays are consistent.

As far as I know, these ECCs are all "systematic" codes - that is, the original 64 data bits are passed through, intact, into the extended

72-bit codes in memory. ECCs don't have to be systematic - some interesting ones give you an ECC-encoded output that looks nothing like the original input. The original data values are recalculated during the "read and check" operation. These non-systematic codes seem to be used mostly in communications, and not in storage applications.

- T
- Tom Del Rosso
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 11:35 PM

I don't know if my other post shows up, but is the following not how it's done?

8-bit memory would use 4 ECC bits.

1 would be parity for all 8 bits.

1 would be parity for the odd bits. 1 would be parity for bits 2,3,6,7. 1 would be parity for bits 4,5,6,7.

- D
- Dave Platt
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 11:49 PM

There are a bunch of ways to do it, with different tradeoffs.

formatting link

has a good overview... click through to the "Hamming (72, 64)" slide #17 and thereafter.

Slide 20 shows the Hsiao enhanced (72,64) ECC. Each parity bit is the result of XORing 26 data bits. The writeup on this claims some advantages, e.g. "simpler to implement at silicon with reduce[d] gate count."

Lattice, Marvel, and Intel all have (slightly different) ways of implementing this approach... I don't know whether the differences have technical advantages, or are a matter of patent-generation and

-avoidance.

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Sep 26, 2018 11:50 PM

torsdag den 27. september 2018 kl. 01.24.34 UTC+2 skrev Dave Platt:

"Due to the limited redundancy that Hamming codes add to the data, they can only detect and correct errors when the error rate is low. This is the case in computer memory (ECC memory), where bit errors are extremely rare and Hamming codes are widely used. In this context, an extended Hamming code having one extra parity bit is often used. Extended Hamming codes achieve a Hamming distance of 4, which allows the decoder to distinguish between when at most one 1-bit error occurs and when any 2-bit errors occur. In this sense, extended Hamming codes are single-error correcting and double- error detecting, abbreviated as SECDED."

- D
- Don KB7RPU
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Sep 27, 2018 12:29 AM

Xil SECDED for N bits of data requires K parity bits to be stored with the data where: N