CAN bus reply problems

S

Ska 21 years ago

Hi folks!

We are developing a system using the CAN bus to implement the network connecting different nodes. We have a PC that needs to ask for some data (the node status) to the nodes that have to answer to the request immediately. In order to ask each node for its status we send a "remote frame" message to the CAN bus with a specific ID. The relevant node has to answer with the relevant data by using a "data frame" message. Each node is in a while loop reading a buffer and sending back data when necessary. Usually everything goes well but sometimes it happens that one of the nodes does not answer to the PC request, even if the request is sent to the bus (it is seen by another node and it can be seen by using an oscilloscope connected to the CAN bus lines). It seems the node do not see the message, it misses the interrupt for updating the buffer... We usually send a sequence of "remote frame" messages waiting every time for the answer: send ,waiting for answer, send, waiting, ... Even if we insert a sleep between a send and another, sometimes the messages are missed by a node... We modified the baud rate (from 500Kbit to 20Kbit) but the problem is not solved. We are using a T89C51CC03 micro-controller by ATMEL.

Have you ever experienced this problem? Any suggestion?

Thank you in advance for any help!

Cheers, Ska

Vote

T

Tim Wescott 21 years ago

1: This is either a problem with your microprocessor or with your code. 2: I have no experience with Atmel & CAN. 2a: The TMS320F2812 has been rock solid for me. 3: No protocol should trust external nodes 100% to receive something -- you should always have a timeout & retry mechanism.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

H

Heinz-Jürgen Oertel 21 years ago

I can not answer your specific question, in other words I don't know which part of your software or hardware is responsible for it. Could be the driver, could be a miss configuration of the CAN controllers, could be the cabling. But you should consider switching your node monitoring from the master/slave principle you are using now to something other. Your current implementation looks exactly like to _old_ CANopen Node Guarding mechanism. CANopen switched to Heart Beat years ago, where each node is an autonomously Heart Beat Producer and can be monitored by every node that wishes to do so. The benefit is more flexibility and reduced band width for the node monitoring. Anyway, it can happen that one of the Heart Beat Consumers is missing one Heart Beat of one of the Producers. In this case increase the rate or accept that one or more HB are missing.

Regards Heinz

with best regards / mit freundlichen Grüßen Heinz-Jürgen Oertel +=================================================================== | Heinz-Jürgen Oertel port GmbH http://www.port.de | mailto:oe@port.de | phone +49 345 77755-0 fax +49 345 77755-20 | Regensburger Str. 7b, D-06132 Halle/Saale, Germany | CAN Wiki http://www.CAN-Wiki.info | Newsletter: http://www.port.de/engl/company/content/abo_form.html+===================================================================

Vote

S

Ska 21 years ago

Hello Tim, hello Heinz, hello everybody

Thank you for your mails.

What you both are telling is that "No protocol should trust external nodes 100% to receive something -- you should always have a timeout & retry mechanism"! This is exactly what we are doing now, but it is something I don't like so much... :( We set a maximum number of retry messages (say 10) and it sometimes happens that the trials go over this threshold! In this case we reset and start again the CAN bus but, as I said, it is something we don't like so much...

...mmm...

Regards, Ska

Vote

H

Hans-Bernhard Broeker 21 years ago

[Massive quote without actual referral snipped. Please don't do that.]

What you're observing appears to be a rate of failure to receive CAN messages that is quite a lot beyond expectations of the protocol, unless you were operating in a pathologically noisy environment --- but you didn't mention anything like that.

What this hints at is a genuine bug in the receiving end, but I'm afraid you didn't reveal enough of its details for anybody out here to be able to remote-diagnose it more precisely. So I'll just bombard you with some questions:

Did you test this with only two nodes on the bus, and check if the receiving one ACKs the transmission?

What *is* the rate of failure, anyway, i.e. one in how many messages gets lost? What is the rate of transmissions with CRC or other failures, on the same network?

Do you have any way of debugging into the receiving CAN controller's register banks after a failed receival, to distinguish if the message actually failed to arrive in the message box, or just failed to raise the IRQ it's configured to? (There's a bug like that in another 8051 derivative with integrated CAN...)

Do you have a storage scope that would let you record the exact signalling up to the point of failure, so you could go look for any differences between successful and failing transmissions, on physical level?

Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de) Even if all the snow were burnt, ashes would remain.

Vote

S

Stephen 21 years ago

In article , Hans-Bernhard Broeker writes

Which 8051 is that then?

Vote

H

Hans-Bernhard Broeker 21 years ago

DS80C390 Rev. B3 and B4

-- Hans-Bernhard Broeker ( snipped-for-privacy@physik.rwth-aachen.de) Even if all the snow were burnt, ashes would remain.

Vote

S

Stephen 21 years ago

Ahhh... (sigh of relief). Just about to start coding on a 400. Sure would be a killer if the CAN interrupts didn't work as advertised!

Vote

H

Heinz-Jürgen Oertel 21 years ago

This really should not happen in CAN networks. If one of the nodes sees a wrong message, whatever the reason was, CRC, bit failure, framing error ..., it generates an error frame and cause the transmitter to retransmit. The probability of a _lost_ message, eg. a message not seen by a receiver, is very, very low. Looks like a bug in your driver (or CPU, but I'm not aware of such a problem in the Atmel chips)

Heinz

with best regards / mit freundlichen Grüßen Heinz-Jürgen Oertel +=================================================================== | Heinz-Jürgen Oertel port GmbH http://www.port.de | mailto:oe@port.de | phone +49 345 77755-0 fax +49 345 77755-20 | Regensburger Str. 7b, D-06132 Halle/Saale, Germany | CAN Wiki http://www.CAN-Wiki.info | Newsletter: http://www.port.de/engl/company/content/abo_form.html+===================================================================

Vote

R

Rich Walker 21 years ago

There are a couple of cases where you *will* see lost messages:

Too many messages for the receiver. Trivial case, but if your protocol doesn't allow for it, it *will* bite you in the ass one day.
Errors on the bus. Eventually, someone is going to go TX-Passive. Now everyone has lost whatever message they were going to get from there.

The normal reliability of CAN allows people to handwave both of these problems into "shouldn't happen". However, if you're building a higher-level protocol on top of CAN, you have to take these faults into account, because they will happen sometime.

cheers, Rich.

[happily swamping 8-bit micros on 1MBit CAN with flaky connectors and high error rates since ... oooh, gosh, *that* long ago?]

rich walker | Shadow Robot Company | rw@shadow.org.uk technical director 251 Liverpool Road | need a Hand? London N1 1LX | +UK 20 7700 2487 www.shadow.org.uk/products/newhand.shtml

Vote

H

Heinz-Jürgen Oertel 21 years ago

As I said already, yes you are right in _very seldom_ cases it can happen that a message is lost as a failure of the CAN protocol (described in every good CAN book.) What you are describing is different. The first case - definitely a problem of processing power, bad driver design or bad network design. The second case, what you are calling TX-Passive. This term could not be found in the ISO11898 standard. Assuming you mean:

It takes part in the bus communication but when an error has been detected, a passive error flag is sent (opposed to active error flag)

A transmitting node in Error-Passive still sends messages, and normally is received by other nodes. If network quality, or whatever, e.g. a transceiver defect, is getting worse, the transmitting node is switched off. And only in this case, nothing is sent. But in this case, as well as when you cut the cable of the transmitter, no other node can receive anything.

What the OP described, that one of the nodes does loose messages from time to time, looks strange. Therfore, for me it really looks like a bug in the software.

Regards Heinz

with best regards / mit freundlichen Grüßen Heinz-Jürgen Oertel +=================================================================== | Heinz-Jürgen Oertel port GmbH http://www.port.de | mailto:oe@port.de | phone +49 345 77755-0 fax +49 345 77755-20 | Regensburger Str. 7b, D-06132 Halle/Saale, Germany | CAN Wiki http://www.CAN-Wiki.info | Newsletter: http://www.port.de/engl/company/content/abo_form.html+===================================================================

Vote

S

Ska 21 years ago

Hello everybody

Thank you for your answers.

I'm happy to understand that CAN bus should be more reliable than what we are experiencing now. I will try and answer to your questions, let's see if I can do it properly.

For Hans-Bernhard

Yes, we did it. In our net we have some nodes, as I said. We have a serial line where we send the output of the printf in the code used for test purposes. We inserted a printf code in one of the nodes code (not the addressee node, another node; call it the "print node")to print the message IDs seen in the CAN bus. Sending the message to the CAN bus, sometimes it happens that the "print node" sees it (it produces a message to the serial line with the correct ID) but the addressee node does not send anything back. Sometimes it happens that both of them do not see anything. We are sure that the problem is that the addressee node does not send anything back (why? because it does not receive the message or because it decide not to trasmit an answer to it) because we set a led that is turned temporarily on when something is transmitted by a node and we do not see it flashing.

More or less the failure rate is about one message in 10 messages (but the rate is higher if you take into account consecutive failures...) We use the API set to communicate with the bus and we know that there is a CRC in the message sent, but we did not check it...

As I wrote, the only thing we know is that we send the message but the node do not receive it...

Ehm... no, I think no... ... ...

For Rich > There are a couple of cases where you *will* see lost messages: > > 1. Too many messages for the receiver. > Trivial case, but if your protocol doesn't allow for it, it *will* > bite you in the ass one day. This is not the case, I think, because the PC sends a single message and waits for an answer before sending another message...

Vote

H

Hans-Bernhard Broeker 21 years ago

Actually, you didn't. You ran a test, but not the one I described above. The ACK I'm talking about is that of the CAN bus protocol itself, where a receiving node sends back a single bit, inside the time frame of the message being transmitted on the bus, to inform the sender that at least one node successfully received it.

The test target here is to find how far into the receiving node the CAN message still makes it.

Change that LED's usage to "flash if something received", please. That's the more important test for the moment.

That's *way* too much. It suggests a serious software bug, mismatch in hardware clock rates, or misconfigured bit timing on the CAN bus.

Please do so.

Apparently, you don't really know even that. You only know for a fact that it doesn't send the answer. You haven't established why, yet.

Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de) Even if all the snow were burnt, ashes would remain.

Vote

R

R Adsett 21 years ago

Perhaps we should start with some even simpler checks. Re-reading through the thread I don't hink it's been established that the bus is properly terminated. I have seen a CAN bus work something like 10% to

90+% when not properly terminated. I have seen symptoms quite close to this when the bus had a broken termination resistor.

Robert

Vote

S

Ska 21 years ago

Robert, I did not understand the test you proposed... Can you explain it again?Is it an hardware check?

Cheers, Ska

Vote

R

R Adsett 21 years ago

Yes, it's a hardware check. Make sure that you have the proper terminating resistors on the bus. While CAN is quite tolerant of variation on the bus in my experience, missing termination resistors will cause the error rate to rise (often quite dramatically).

Simply find the end of the bus cables and look for the resistor. You can verify it's resistance with a multimeter.

Robert

Vote

S

simpleton 21 years ago

Yes, improper terminate seems to cause errant behaviours. This could also lead the situation where your're seeing can "messages". Consider two can nodes only. If it's incorrectl terminated then a possible ACk would never be received by th transmitting node. Somone on the node has to send an acknowledg reply or the transmitting node will keep transmitting. This cause lot of bus activity but no messages are being recognized. Also wha transceivers are you using. Not all transceivers seem to wor together

Vote

S

Ska 21 years ago

Robert, you are right. We are actually working with a temporary system configuration. We will work on the test set-up to terminate the bus and I will give a feedback about the results, hopefully by the end of the week.

Thanks, Ska

Vote

D

Dan Danknick 21 years ago

Actually, this is not necessary for CAN. The beginning of the frame contains a node ID that possible recipients filter through their match/accept registers. Active receivers calculate CRC as the frame bytes clock in and then compare it to the CRC at the frame end. If they match, the accepting receiver drives the bus active (low) for one bit in a designated tailing window. This lets the master, or sender of the frame, know that someone received it.

Use your scope to look at the bus for this ACK bit. If you see it, but the receiver doesn't process the frame, you've missed the interrupt. If you don't see the ACK bit, then the receiver didn't match the node ID or the CRC, or it's in Bus Off mode for error containment.

Also be sure you have both ends properly terminated; I've seen wild behavior on DeviceNET packets at 125, 250 and 500 kb/s.

Dan

Vote

P

Paul Keinanen 21 years ago

node ID

"Node ID" is only meaningful for some higher level protocols, such as CanOpen, but it does not make any sense in simple CanBus systems, which fully relies on message identifiers.

Unless the receiver is in the "bus off" or "error passive" mode, _all_ receivers should monitor the bus and signal ACK or error frame accordingly.

accepting receiver drives the bus

The ACK bit is sent by _any_ active (also "nonaddressed") device. Also if _any_ receiver detects a CRC or other error, it will send the error flag, which mutilates the message and no device will accept it.

This is only usable with only two devices (sender and receiver) on the bus. With more than two devices, someone else will acknowledge it. Instead of an oscilloscope, you should also be able to tell from the transmitter status registers, if someone ASKed the transmitted frame.

Or you have configured the mask registers incorrectly.

The identifier match should not affect the appearance of the ACK.

It should be possible to determine from the _transmitter_ status registers, if the frame was ACKed or an error flag generated by the receiving device.

Paul

Vote

CAN bus reply problems

Join the Discussion

Didn't find your answer?