Machine check while reading DMA Status Registers

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
I am repeatedly seeing a machine check when reading one of the DMA
Status Registers (DMASRx).  Our board has an MPC8248 processor running
Linux 2.6.26. We are using PCI DMA as described in section 9.13 of the
MPC8272 PowerQUICC II Family Reference Manual. The machine check
occurs after our system is running anywhere from 8 to 50 hours, during
which time the status registers are read billions of times
successfully before the machine check occurs. I am using all 4 DMA
engines and chaining mode.

Has anyone else seen this?  Any suggestions how to debug it?.

Re: Machine check while reading DMA Status Registers
Quoted text here. Click to load it

Not this and not on the same platform, but similar experience
on the MPC5200B.
When pushed harder - ftp downloading at 100 MbpS - some silicon
timing problem comes up and messes up the ATA (IDE, while writing
to disk) transfer. Happens once or twice per gigabyte - after I
discovered another bunch of timing issues they had on the chip
and made it work that good, that is. Another good effect of
doing it all in-house (SDMA microcode included) was that the
mentioned 1-2 times/GB error turned out to be recoverable and
I managed to just retry the faulty disk write - and at that
low rate this goes unnoticeable.
 When I asked their support what to do about the problem, I was
answered that linux did not use DMA for ATA, could I not also
use PIO please..... Turned out I had been right never to even
consider linux or some other mickey-mouse stuff for real work.
 Under the score - todays SOC parts are way too complex for the
silicon manufacturers to test 100%. They just don't have the
software to do it - chip gets released and tested in the field
by customers.
 Obviously there will be hardware issues - even Freescale will
have them (and I doubt there is a vendor who delivers SOC silicon
with less bugs/gate than they do, in fact I doubt anyone matches
them above a certain complexity).

 If you can understand what causes the failure - it may well be
something other than a silicon error, I am just fresh out of
wrestling some of these so I go there - then you can fix it
by software. Especially if you get the machine check exception
only for that reason, you can somehow abort and retry the corrupted
transfer or something.
 Boy, I had not written such a long post in long time :-).


Dimiter Popoff               Transgalactic Instruments
------------------------------------------------------ /

Site Timeline