Centos Dell PowerEdge 830 PCIe Training Error & SATA port 0 not found

- O
- Oliver Wilson
  
  Contact options for registered users
posted
5 years ago

Fri, Mar 22, 2019 4:37 AM

Anyone have experience with these 2 Centos Dell PowerEdge 830 errors? (1) PCIe Training Error: Embedded Bus#00/Dev#1C/Func#4 (2) SATA port 0 not found

I'm in a small training class where the teacher's old computer died. I told her I'd look at it where those are the two errors on the screen. (1)

formatting link

(2)

formatting link

Opening the case, I see only this card in a long slot on the motherboard.

formatting link

I don't know what the card does but it has an SATA cable to each of 4 HDDs.

formatting link

It seems disk 0 of the four disks is "unknown device" for some reason.

formatting link

Only 3 of the 4 "Arrays" are found (What is an array? Is that a disk?)

formatting link

Do you have debugging advice that I can give to this teacher for her Dell? f

- D
- dansabrservices
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 10:57 AM

57074-full.jpg

23977-full.jpg

e-full.jpg

s.

4-full.jpg

f-full.jpg

e-full.jpg

?

My first inclination is to think that the card is a raid controller and tha t the first disk has failed. Part of the "conversation" is to identify eac h device to the controller. These are smart devices these days. I would c heck disk 0. The disk itself is probably OK, but its controller may have f ailed.

- A
- Andy Burns
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 1:58 PM

The RAID card seems to have four SATA devices attached, apparently all working, which are formed into three arrays

since there are two 80GB disks ST380013, those are both likely members of the RAID 1 74GB array#2

The single 500GB ST3500320, is likely a single disk volume, 465GB array#1

and the single 1TB WD1003FZEX is likely the single disk volume 931GB array#0

so I'd say the physical and logical disks are fine, and that at some point the RAID controller is talking to them, the issue seems to be the server is sometimes having issues negotiating PCIe links to the RAID card

if you're lucky try removing and re-seating the PCIe card in case it's a loose contact, but other people seem to have either failed capacitors or a mismatch of PCIe generations between the card and the motherboard

- O
- Oliver Wilson
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 4:19 PM

Re: Fri, 22 Mar 2019 13:58:50 +0000,

I am confused.

The PCIe slots are the black unused slots I think where there is only one card, which is that RAID card you indentified.

I did move the RAID card from the leftmost long white slot to the rightmost long white slot and that "helped".

Both errors remained but at least the machine booted to CentOS after I did that switch (and also reseated all cables, blew all dust out, rebooted, so it could be any number of things that allowed the machine to boot to CentOS).

My main confusion about the SATA 0 unknown is whether it's the 1TB disk that's bad, or the RAID card that is bad.

I seem to see you saying that you think the 1TB drive is actually good? Did I understand that correctly?

If the 1TB drive is likely good, then are you saying the RAID card is likely bad?

- A
- Andy Burns
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 7:06 PM

I think the message about SATA port0 is referring to a SATA port on the motherboard, not a SATA port on the RAID card.

One of your photos shows the RAID card saying that all four drives and all three arrays are good.

Hopefully it is good too, after all one of your photos does show the RAID card having detected the drives and saying the arrays are optimal.

I would move the card back to the slot it was in, different PCIe slots can have different numbers of "lanes" and the PCIe training error you show is referring to the motherboard and PCIe card being unable to agree the correct number of lanes to use.

- D
- Dave Platt
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 8:03 PM

That's the way it looked to me, as well. The SATA ports and drives on the motherboard are normally handled by the motherboard chipset and the BIOS.

The motherboard BIOS doesn't deal directly with the ports on the add-on card. These are the responsibility of the card's own on-board BIOS - the resulting drives/volumes are registered as drives, but not as "ports" per se.

Simply unplugging, and then re-seating a controller card can often be effective at resolving problems like this. Not always, but it sometimes works.

Make sure that the card is properly seated in the slot, both when you first plug it in, and after you screw the card bracket to the case!

I've seen plenty of situations in which a bent bracket, or a case having slots of a funny size, or a bit of obstruction at the bottom of the card slot where the bracket "finger" fits in, is enough to cause the act of "screwing down" the card to actually flex the card upwards a bit out of the PCI or PCIe slot. Even if it works OK at first, the card sometimes works its way upwards a bit further and the slot connection becomes intermittent.

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 8:29 PM

Do you have, or can you get, a Linux system that you can use to check the disks? It might have a spare SATA connector you can connect the disk being tested to or, easier, you could use USB-connected disk dock that you can slot the disks being tested into.

If so, try two tests, both to be run with the disk powered up but not mounted.

- 1 (quick) run gparted to look at the disk partitioning. Are any errors reported? Does the partitioning scheme look sensible and is it the same on mirror disks?

- 2 (slower) run "fsck -p" against each partition each disk. If any errors are reported, try using fsck to repair the failing partition(s).

- 3 install smartd if it isn't already installed and use it to see how many hours each disk has run and what prefailure and/or failure indications each of them shows

I've had good quality (Fujitsu and Western Digital) disks fail at around 40-50k hours and cheap consumer crap fail at 3000 hours.

If those tests show the disks are OK, THEN you should suspect the RAID controller.

--
Martin    | martin at 
Gregorie  | gregorie dot org

- A
- Andy Burns
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 8:42 PM

If (some of) the disks have RAID metadata on them, be very careful attaching them to non-RAID SATA ports ...

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 10:03 PM

I don't 'do' RAID (never needed to outside RAID 1 on Tandem NonStop and Stratus fault tolerant systems, but apart from suggesting gparted or fsck repairs (WHICH THE OP CAN EASILY IGNORE), everything else I suggested is, or should be, read-only. How could read-only checks mess up RAID metadata?

Colour me genuinely puzzled: an explanation would be appreciated.

What I described (looking at what gparted, fsck and smartd have to say) is no more and no less that what I do routinely to hopefully spot failing disks before they break.

--
Martin    | martin at 
Gregorie  | gregorie dot org

- A
- Andy Burns
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 10:22 PM

read-only wouldn't, but an inexperienced user could accidentally write something, and if the disks are from a RAID system, the partitions probably don't start where partition tools are going to be looking for them, so you'll get a false sense that there aren't valid partitions on the disks.

- A
- Andy Burns
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 10:31 PM

s/partitions/file-systems

Generally to inspect RAID disks that aren't attached to their RAID controller, you need special software, e.g.

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Mar 22, 2019 10:33 PM

OK, noted. Thanks.

--
Martin    | martin at 
Gregorie  | gregorie dot org