Can the CAN bus be overloaded with too many transactions ?

Greetings,

We are using 5 Jaguars (black) with a 2CAN for motor control and for acquiring detailed data (motor current, bus voltage, encoders, temp, etc.) using 3 different threads with an aggregate rate of just over 1000 transactions/sec. While this generally works rather well, we are seeing an occasional CAN timeout message making it to the console screen. We have had some debate as to whether it is possible to overwhelm the CAN bus with too much traffic and that that we need to significantly reduce our polling frequency to avoid this occasional CANJaguar timeout.

**I would like to solicit any opinion folks have regarding the use of the CAN bus with the 2CAN and whether it really is possible to have too much CAN traffic which can result in CAN timeouts. **

It does seem like the Jaguars themselves and their ability to respond to each request with the Ethernet/CAN bus should be the limiting factor and it should not be possible to cause bus actual difficulty with lots of traffic. A series of transactions to each individual Jaguar can happen in rapid succession (sequential in code) and would represent a rather high (but brief) traffic burst.

I just spent some time looking at the WPILib CANJaguar implementation and was rather surprised to see the semaphore implementation is at the individual Jaguar level and not shared at the higher bus level. I tried to go down a level further to see how this might be handled at the lower levels but could not find the source for FRC_NetworkCommunication_JaguarCANDriver_receiveMessage.

I am curious as to whether any of the lower level implementations (perhaps in the 2CAN itself) attempts to ensure only one JANJaguar transaction is out on the CAN bus at a time ?

I am wondering whether our multiple thread implementation attempting to use the CANJaguar implementation with a per CANJaguar individual semaphores might be exposing system vulnerability when our threads just happen to occasionally to attempt to simultaneously transact and simultaneously request info from different Jags ?

I’m going to try to protect the these calls with a single shared semaphore (vs. one per Jaguar)and see if that resolves this occasional glitch.

Any thoughts or opinions on this would be greatly appreciated.

Thanks,
John

Using a 2CAN instead of the serial bus should help a whole lot.

Really, you want an async interface where nothing gets serialized except on the actual transmission of messages out onto the bus. The CAN protocol handles collisions from multiple devices, so receive works out to just listening and being able to keep up. If some device does not respond for some reason, each exchange should have an independent timeout.

Of course, the S/W stack is not ideal and you have to respect any constraints the Jaguar has in terms of handling overlapping commands.

At any rate, there should be plenty of bandwidth for this to happen, assuming a solid S/W implementation and a low error rate. The first thing I’d look into is the error rate, using the 2CAN web utility and also checking the cabling, particularly the terminators.

Then I’d check to see if your thread scheme is fighting with the serialization that happens to commands. If you issue a bunch of commands and they wind up being serialized, some commands may sit in a queue for a long time before they ever get sent. This could even be creating an unbounded queue that eventually winds up being emptied by timeouts.

Maybe a good way to go would be a thread per Jaguar, to match the serialization. Avoid issuing the same command while waiting for a prior instance of the same command to complete. If you’re using synchronous commands (where the function to issue a command does not return until the command exchange has completed or the command has timed out), this will automatically throttle things and you probably don’t have to worry about issuing multiple overlapping instances of the same command. Avoid having these threads do anything that can hold up the main thread that is doing the periodic interaction with the driver station.

Watch out for something like a Jaguar resetting – if you do get errors, you may need some sort of exception handling to reconfigure the affected Jaguar as they lose configuration (except for CAN ID) when they reboot. Sometimes, not everything reboots at once – Jags may not reboot across a cRIO reboot and also it is possible for them to reboot by themselves if there is something like a brownout or high current fault.

Anyway, the bottom line is the 2CAN works very well in my experience and you should be able to do what you are trying to do. However, it is possible you run into a bug in the Jaguar or the CAN S/W and lose a whole lot of time on this. Given the date, you may not want to mess with this unless you can do so while the robot is in the bag (withholding, a second control system or robot, …).

Greetings,

Thanks for the reply and discussion. We have a relatively small error rate of approximately 1 timeout every 3 or 4 seconds or about 1/4000 transactions.

While I suspect these occasional timeouts are harmless given our use (the next output is never more than 20 ms away), I would like to understand what is causing these with some frequency as I suspect (graciously) there may be a possible multithreading vulnerability in the system (our code, WPIlib, 2CAN, Jaguar, or elsewhere). We would prefer zero CAN timeout errors on the system if possible and would like to understand any limitations and expectations in full detail.

There have also been some debate within the team as to whether too many JAG transactions/sec can actually cause these timeouts. I have heard this bus overload theory from other teams as well and would like to separate fact from fiction on this front.

Of course, the S/W stack is not ideal and you have to respect any constraints the Jaguar has in terms of handling overlapping commands.

I wonder how often the Jags ever get overlapping commands, if ever, under normal use. Any single thread execution is essentially serialized nicely with each send/receive (ack) transaction.

Then I’d check to see if your thread scheme is fighting with the serialization that happens to commands. If you issue a bunch of commands and they wind up being serialized

The basic WPIlib seems to implement a simple blocking call waiting for a response of ack to each command. Introduce another thread and we still have each thread running synchronously through each of its its programmed transactions. The two threads will hit the CanJaguar m_transactionSemaphore only if they happen to hit the same Jaguar instance with no serialization across Jaguars at all. Given the relatively quick responsiveness of each Jag transaction (~200us for entire call) there is rarely any simultaneous Jag transactions but this is still expected to occur occasionally. Given this infrequent very infrequent command overlap, I suspect that ensuring complete (inter or intra jag) serialization with a jaguar wide semaphore will have no noticeable performance degradation (an occasional rare brief thread suspension). I’ll try this experiment and see whether it reduces or eliminates our occasional error.

Really, you want an async interface where nothing gets serialized except on the actual transmission of messages out onto the bus

I’m curious as to where this low level serialization is ensured ? Is it in the 2CAN ?

Thanks again,

john

Sorry for the slow response – travelling with little time and not very good connectivity…

This is mostly speculation/educated guesswork, but the 2CAN basically just gets packets from the Ethernet side and takes the message payload and puts it onto the CAN bus. So, messages (in a single packet) get serialized on the Ethernet side, and they won’t get interleaved as they are placed on the CAN bus, even if somehow the 2CAN were to have multiple messages ready to place onto the CAN bus. If the Ethernet side is faster (as it likely is) it would be possible for multiple messages to arrive on the Ethernet side while a prior message were still being sent on the CAN side. In this case, these would be buffered and sent in order.

From the CAN side to the Ethernet side, it is very similar. So, the 2CAN is not involved in serialization in the sense of not sending another command while waiting for a response to a prior command at all. Given this, it doesn’t matter if the commands are to the same Jaguar or different ones. The main thing you get with the 2CAN is just a fatter pipe that lets you get more commands through and also reduces the time for even single commands to make it from one end to the other (and back). In other words, the 2CAN winds up looking like a faster wire than the serial scheme but not different in terms of functionality. Of course, there is the whole web interface but this isn’t part of the communications path when it is being used in the robot.

For bus saturation, you need to work out how much data is flowing across the bus in a unit of time (typically one second). This has to account for traffic in both directions. You have to compare this to the bandwidth of the bus. If the utilization is more than a certain percentage of the available bandwidth, you are likely going to have issues. For example, commands may spend a long time waiting before they make it out onto the wire. Once you get into this situation, things can continue to back up more and more. Synchronous execution may limit this, but you could still see some of it when using multiple threads. With serial connectivity, the serial bus is going to have the lower/limiting bandwidth. When using the 2CAN, the CAN bus itself is going to be the limiting factor.