USB mass storage error recovery

I'm getting stuck trying to figure out how to recover from a USB mass storage "error". This is for "bulk only" protocol. The root problem is that the RTOS we use is returning errors when they don't really exist and terminating transfers in the middle. I think I can fix that problem, but in the meantime I was trying to add some error checking and recovery to their mass storage driver. I'd also like to be able to recover from real errors if they ever happen.

What happens is that the CBW command bytes are sent successfully, then the data phase is interrupted mid-stream. When the host ignores the error and tries to read the CSW status it hangs forever.

My first approach was to detect the error and return from the transfer routine without reading the CSW. But the very next I/O operation will fail. I then tried doing doing a "bulk-only mass storage reset" operation, but that I/O also hangs. I then tried first clearing stalled endpoints out of desparation, and then doing the reset, but that didn't help.

So now I'm baffled. What I think is happening is that the mass storage devices are still waiting to read bytes from the data phase, and can not leave that state even if they see bytes on the control endpoint.

Is there anything I can do to clear this stuck state?

-- Darin Johnson

Reply to
darin
Loading thread data ...

We (myself and another engineer) have encountered this on our project too. The other engineer is the one who is handling the software and would be the one with the suggestions for you. I have forwarded your post on to him and asked him if he has any suggestsions and asked him to either post it here, or reply to my inquiry and I will post it here.

Reply to
Noway2

Here is the procedure for clearing faults on a bulk storage transfer that we are using in our project. Hope this helps.

The following are the steps I take during a bulk transport:

  1. Send the CSW. If the pipe stalls, clear the stall and go to the transport stage. If the clear stall fails, or the original result was some other error, perform a Bulk Reset and exit the transport routine.

  1. Send/receive data. If the pipe stalls, clear it and go to the read CSW stage. If the clear stall fails, or the original result was some other error, perform a Bulk Reset and exit the transport routine. Do not try to read the CSW.

  2. Read the CSW. If the pipe stalls, clear it and try to reread the CSW. If the clear stall fails, some other error occurs, or the CSW is invalid, perform a Bulk Reset.

On all of the steps above, if the Bulk Reset fails, the HC or device is not working properly. If other devices ARE working properly (or if it can be verified that the HC is functioning properly), assume the device is corrupt and ignore it. Otherwise try a hardware reset.

Reply to
Noway2

OK, I get the "other error" here. It's not a "real" error since the RTOS vendor supplied USB software is broken, but it's probably a good simulation of a real error. The problem is that the Bulk Reset hangs also (at the status stage I think). The vendor supplied mass storage software doesn't implement any timeouts to detect a hang...

There were basically two bugs in the software - the host controller driver cancelling transfers too soon, and the mass storage driver not handling errors. I tried to solve the latter first, but I made more headway after fixing the host controller instead. Though if there ever is a real error...

-- Darin Johnson

Reply to
Darin Johnson

Hi,

How about some more info? What OS are you using? Are you having problems with a host or a device?

I am having similar problems with a device "function" running on nucleus. It sends quite a bit of data both directions. It always hangs when I do a "format" from Winxp home. It would seem to hang near the end of the format. I do not know if this is a Nucleus driver stack problem or if I have a hardware driver issue (which is what I am debugging).

I am not seeing any stalls on the bus. What error is the mass storage driver not handling? Again is this a host or device issue?

Regards, Steve

There is no "x" in my email address.

Reply to
Steve Calfee

It's Nucleus, with a "host" driver, using EHCI. The error is not a real error, but it would be a transaction error (bad PID, CRC, etc). The OS assumes that if this bit is set that there's an error, although the HW retries the transaction up to 3 times in this case. From what I can see, *any* error during the data phase, other than a STALL, would cause problems. For instance, if the endpoint halted due to due many transaction or buffer errors.

-- Darin Johnson

Reply to
Darin Johnson

Hmmm. Nucleus sounds a bit pants, then.

I *hate* not being able to trust 3rd-party code.

Steve

formatting link

Reply to
Steve at fivetrees

It's not all bad, it just comes with lots of parts. Some of the parts are very reliable and stable and do what you want well, while others are relatively new. An advantage is that you get all the source code, a disadvantage is that you sometimes need the source code...

-- Darin Johnson

Reply to
Darin Johnson

Understood. But - if it were me, I'd put all sorts of compiler warnings over the untested new bits, or over provisional code. I mean, no timeouts... that's pretty bad. I'd hate to have to find that out the hard way.

Steve

formatting link

Reply to
Steve at fivetrees

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.