CAN bus reply problems

- S
- Ska
  
  Contact options for registered users
posted
19 years ago

Mon, Mar 21, 2005 1:44 PM

Hi folks!

We are developing a system using the CAN bus to implement the network connecting different nodes. We have a PC that needs to ask for some data (the node status) to the nodes that have to answer to the request immediately. In order to ask each node for its status we send a "remote frame" message to the CAN bus with a specific ID. The relevant node has to answer with the relevant data by using a "data frame" message. Each node is in a while loop reading a buffer and sending back data when necessary. Usually everything goes well but sometimes it happens that one of the nodes does not answer to the PC request, even if the request is sent to the bus (it is seen by another node and it can be seen by using an oscilloscope connected to the CAN bus lines). It seems the node do not see the message, it misses the interrupt for updating the buffer... We usually send a sequence of "remote frame" messages waiting every time for the answer: send ,waiting for answer, send, waiting, ... Even if we insert a sleep between a send and another, sometimes the messages are missed by a node... We modified the baud rate (from 500Kbit to 20Kbit) but the problem is not solved. We are using a T89C51CC03 micro-controller by ATMEL.

Have you ever experienced this problem? Any suggestion?

Thank you in advance for any help!

Cheers, Ska

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Mar 21, 2005 4:02 PM

1: This is either a problem with your microprocessor or with your code. 2: I have no experience with Atmel & CAN. 2a: The TMS320F2812 has been rock solid for me. 3: No protocol should trust external nodes 100% to receive something -- you should always have a timeout & retry mechanism.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

- H
- Heinz-Jürgen Oertel
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Mar 21, 2005 5:57 PM

I can not answer your specific question, in other words I don't know which part of your software or hardware is responsible for it. Could be the driver, could be a miss configuration of the CAN controllers, could be the cabling. But you should consider switching your node monitoring from the master/slave principle you are using now to something other. Your current implementation looks exactly like to _old_ CANopen Node Guarding mechanism. CANopen switched to Heart Beat years ago, where each node is an autonomously Heart Beat Producer and can be monitored by every node that wishes to do so. The benefit is more flexibility and reduced band width for the node monitoring. Anyway, it can happen that one of the Heart Beat Consumers is missing one Heart Beat of one of the Producers. In this case increase the rate or accept that one or more HB are missing.

Regards Heinz

--

with best regards / mit freundlichen Grüßen

   Heinz-Jürgen Oertel
+===================================================================
| Heinz-Jürgen Oertel  port GmbH  http://www.port.de
| mailto:oe@port.de
| phone +49 345 77755-0     fax   +49 345 77755-20
| Regensburger Str. 7b,     D-06132 Halle/Saale,  Germany 
| CAN Wiki    http://www.CAN-Wiki.info
| Newsletter: http://www.port.de/engl/company/content/abo_form.html+===================================================================

- S
- Ska
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 22, 2005 10:34 AM

Hello Tim, hello Heinz, hello everybody

Thank you for your mails.

What you both are telling is that "No protocol should trust external nodes 100% to receive something -- you should always have a timeout & retry mechanism"! This is exactly what we are doing now, but it is something I don't like so much... :( We set a maximum number of retry messages (say 10) and it sometimes happens that the trials go over this threshold! In this case we reset and start again the CAN bus but, as I said, it is something we don't like so much...

...mmm...

Regards, Ska

- H
- Hans-Bernhard Broeker
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 22, 2005 2:12 PM

[Massive quote without actual referral snipped. Please don't do that.]

What you're observing appears to be a rate of failure to receive CAN messages that is quite a lot beyond expectations of the protocol, unless you were operating in a pathologically noisy environment --- but you didn't mention anything like that.

What this hints at is a genuine bug in the receiving end, but I'm afraid you didn't reveal enough of its details for anybody out here to be able to remote-diagnose it more precisely. So I'll just bombard you with some questions:

Did you test this with only two nodes on the bus, and check if the receiving one ACKs the transmission?

What *is* the rate of failure, anyway, i.e. one in how many messages gets lost? What is the rate of transmissions with CRC or other failures, on the same network?

Do you have any way of debugging into the receiving CAN controller's register banks after a failed receival, to distinguish if the message actually failed to arrive in the message box, or just failed to raise the IRQ it's configured to? (There's a bug like that in another 8051 derivative with integrated CAN...)

Do you have a storage scope that would let you record the exact signalling up to the point of failure, so you could go look for any differences between successful and failing transmissions, on physical level?

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

- S
- Stephen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 22, 2005 7:22 PM

In article , Hans-Bernhard Broeker writes

Which 8051 is that then?

- H
- Hans-Bernhard Broeker
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 22, 2005 7:54 PM

DS80C390 Rev. B3 and B4

-- Hans-Bernhard Broeker ( snipped-for-privacy@physik.rwth-aachen.de) Even if all the snow were burnt, ashes would remain.

- S
- Stephen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 22, 2005 8:06 PM

Ahhh... (sigh of relief). Just about to start coding on a 400. Sure would be a killer if the CAN interrupts didn't work as advertised!

- H
- Heinz-Jürgen Oertel
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 22, 2005 10:44 PM

This really should not happen in CAN networks. If one of the nodes sees a wrong message, whatever the reason was, CRC, bit failure, framing error ..., it generates an error frame and cause the transmitter to retransmit. The probability of a _lost_ message, eg. a message not seen by a receiver, is very, very low. Looks like a bug in your driver (or CPU, but I'm not aware of such a problem in the Atmel chips)

Heinz

--

with best regards / mit freundlichen Grüßen

   Heinz-Jürgen Oertel
+===================================================================
| Heinz-Jürgen Oertel  port GmbH  http://www.port.de
| mailto:oe@port.de
| phone +49 345 77755-0     fax   +49 345 77755-20
| Regensburger Str. 7b,     D-06132 Halle/Saale,  Germany 
| CAN Wiki    http://www.CAN-Wiki.info
| Newsletter: http://www.port.de/engl/company/content/abo_form.html+===================================================================

- R
- Rich Walker
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 2:45 PM

There are a couple of cases where you *will* see lost messages:

Too many messages for the receiver. Trivial case, but if your protocol doesn't allow for it, it *will* bite you in the ass one day.
Errors on the bus. Eventually, someone is going to go TX-Passive. Now everyone has lost whatever message they were going to get from there.

The normal reliability of CAN allows people to handwave both of these problems into "shouldn't happen". However, if you're building a higher-level protocol on top of CAN, you have to take these faults into account, because they will happen sometime.

cheers, Rich.

[happily swamping 8-bit micros on 1MBit CAN with flaky connectors and high error rates since ... oooh, gosh, *that* long ago?]

--
rich walker         |  Shadow Robot Company | rw@shadow.org.uk
technical director     251 Liverpool Road   |
need a Hand?           London  N1 1LX       | +UK 20 7700 2487
www.shadow.org.uk/products/newhand.shtml

- H
- Heinz-Jürgen Oertel
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 9:01 PM

As I said already, yes you are right in _very seldom_ cases it can happen that a message is lost as a failure of the CAN protocol (described in every good CAN book.) What you are describing is different. The first case - definitely a problem of processing power, bad driver design or bad network design. The second case, what you are calling TX-Passive. This term could not be found in the ISO11898 standard. Assuming you mean:

It takes part in the bus communication but when an error has been detected, a passive error flag is sent (opposed to active error flag)

A transmitting node in Error-Passive still sends messages, and normally is received by other nodes. If network quality, or whatever, e.g. a transceiver defect, is getting worse, the transmitting node is switched off. And only in this case, nothing is sent. But in this case, as well as when you cut the cable of the transmitter, no other node can receive anything.

What the OP described, that one of the nodes does loose messages from time to time, looks strange. Therfore, for me it really looks like a bug in the software.

Regards Heinz

--

with best regards / mit freundlichen Grüßen

   Heinz-Jürgen Oertel
+===================================================================
| Heinz-Jürgen Oertel  port GmbH  http://www.port.de
| mailto:oe@port.de
| phone +49 345 77755-0     fax   +49 345 77755-20
| Regensburger Str. 7b,     D-06132 Halle/Saale,  Germany 
| CAN Wiki    http://www.CAN-Wiki.info
| Newsletter: http://www.port.de/engl/company/content/abo_form.html+===================================================================

- S
- Ska
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 25, 2005 9:26 AM

Hello everybody

Thank you for your answers.

I'm happy to understand that CAN bus should be more reliable than what we are experiencing now. I will try and answer to your questions, let's see if I can do it properly.

For Hans-Bernhard

Yes, we did it. In our net we have some nodes, as I said. We have a serial line where we send the output of the printf in the code used for test purposes. We inserted a printf code in one of the nodes code (not the addressee node, another node; call it the "print node")to print the message IDs seen in the CAN bus. Sending the message to the CAN bus, sometimes it happens that the "print node" sees it (it produces a message to the serial line with the correct ID) but the addressee node does not send anything back. Sometimes it happens that both of them do not see anything. We are sure that the problem is that the addressee node does not send anything back (why? because it does not receive the message or because it decide not to trasmit an answer to it) because we set a led that is turned temporarily on when something is transmitted by a node and we do not see it flashing.

More or less the failure rate is about one message in 10 messages (but the rate is higher if you take into account consecutive failures...) We use the API set to communicate with the bus and we know that there is a CRC in the message sent, but we did not check it...

As I wrote, the only thing we know is that we send the message but the node do not receive it...

Ehm... no, I think no... ... ...

--
For Rich
> There are a couple of cases where you *will* see lost messages:
> 
> 1. Too many messages for the receiver.
>    Trivial case, but if your protocol doesn't allow for it, it *will*
>    bite you in the ass one day.

This is not the case, I think, because the PC sends a single message
and waits for an answer before sending another message...

- H
- Hans-Bernhard Broeker
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 25, 2005 12:29 PM

Actually, you didn't. You ran a test, but not the one I described above. The ACK I'm talking about is that of the CAN bus protocol itself, where a receiving node sends back a single bit, inside the time frame of the message being transmitted on the bus, to inform the sender that at least one node successfully received it.

The test target here is to find how far into the receiving node the CAN message still makes it.

Change that LED's usage to "flash if something received", please. That's the more important test for the moment.

That's *way* too much. It suggests a serious software bug, mismatch in hardware clock rates, or misconfigured bit timing on the CAN bus.

Please do so.

Apparently, you don't really know even that. You only know for a fact that it doesn't send the answer. You haven't established why, yet.

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

- R
- R Adsett
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 25, 2005 3:01 PM

Perhaps we should start with some even simpler checks. Re-reading through the thread I don't hink it's been established that the bus is properly terminated. I have seen a CAN bus work something like 10% to

90+% when not properly terminated. I have seen symptoms quite close to this when the bus had a broken termination resistor.

Robert

- S
- Ska
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 31, 2005 1:04 PM

Robert, I did not understand the test you proposed... Can you explain it again?Is it an hardware check?

Cheers, Ska

- R
- R Adsett
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 31, 2005 2:15 PM

Yes, it's a hardware check. Make sure that you have the proper terminating resistors on the bus. While CAN is quite tolerant of variation on the bus in my experience, missing termination resistors will cause the error rate to rise (often quite dramatically).

Simply find the end of the bus cables and look for the resistor. You can verify it's resistance with a multimeter.

Robert

- S
- simpleton
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 4, 2005 8:20 AM

Yes, improper terminate seems to cause errant behaviours. This could also lead the situation where your're seeing can "messages". Consider two can nodes only. If it's incorrectl terminated then a possible ACk would never be received by th transmitting node. Somone on the node has to send an acknowledg reply or the transmitting node will keep transmitting. This cause lot of bus activity but no messages are being recognized. Also wha transceivers are you using. Not all transceivers seem to wor together

- S
- Ska
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Apr 5, 2005 10:20 AM

Robert, you are right. We are actually working with a temporary system configuration. We will work on the test set-up to terminate the bus and I will give a feedback about the results, hopefully by the end of the week.

Thanks, Ska

- D
- Dan Danknick
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Apr 8, 2005 5:52 AM

Actually, this is not necessary for CAN. The beginning of the frame contains a node ID that possible recipients filter through their match/accept registers. Active receivers calculate CRC as the frame bytes clock in and then compare it to the CRC at the frame end. If they match, the accepting receiver drives the bus active (low) for one bit in a designated tailing window. This lets the master, or sender of the frame, know that someone received it.

Use your scope to look at the bus for this ACK bit. If you see it, but the receiver doesn't process the frame, you've missed the interrupt. If you don't see the ACK bit, then the receiver didn't match the node ID or the CRC, or it's in Bus Off mode for error containment.

Also be sure you have both ends properly terminated; I've seen wild behavior on DeviceNET packets at 125, 250 and 500 kb/s.

Dan

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Apr 8, 2005 7:07 AM

node ID

"Node ID" is only meaningful for some higher level protocols, such as CanOpen, but it does not make any sense in simple CanBus systems, which fully relies on message identifiers.

Unless the receiver is in the "bus off" or "error passive" mode, _all_ receivers should monitor the bus and signal ACK or error frame accordingly.

accepting receiver drives the bus

The ACK bit is sent by _any_ active (also "nonaddressed") device. Also if _any_ receiver detects a CRC or other error, it will send the error flag, which mutilates the message and no device will accept it.

This is only usable with only two devices (sender and receiver) on the bus. With more than two devices, someone else will acknowledge it. Instead of an oscilloscope, you should also be able to tell from the transmitter status registers, if someone ASKed the transmitted frame.

Or you have configured the mask registers incorrectly.

The identifier match should not affect the appearance of the ACK.

It should be possible to determine from the _transmitter_ status registers, if the frame was ACKed or an error flag generated by the receiving device.

Paul