CAN Timeout Errors and Loss of DS Comms

Hello All, Having some issues getting our 2017 robot running with the new 2018 software upgrade.

A little background, an hour ago our 2017 Robot was still running the 2017 software, and all systems worked including all CAN (We only have PDP and PCM), we use PCM for Solenoids and PDP to poll current and voltage.

After flashing RIO, upgrading DS, and loading new CTRE Lifeboat, we can see the PDP and PCM in Silverlight in Internet Explorer. They are both up to date, and have passed self test and flashing light test.

Our Upgraded Robot Solenoids work as expected, however, when ever we try to use .getlVoltage or .getCurrent ofCTRE CAN Timeout Error, and after a few errors we loose all comms with DS, and the only way to restore comms is a full robot power cycle (deploying code does not restore DS comms although we are still connected to Roborio (can ping, and access web server))

Disabling these calls, restores system stability and we do not loose comms or receive errors.

We have re-deployed code after the lifeboat update, and confirmed the CTRE Jars are apart of our new build path.

However, I can not get around these timeout issues. We are trying to get current on channels 0-15. With the calls to the PDP libraries disabled, the robot runs fine, and Silverlight self-test shows the proper current draw on the appropriate channels when moving around.

At no point do we have any issues with the PCM. Inspected the wired connections and everything seems fine. We have this issue on 2 RoboRios, which had the full CAN bus up and running prior to the software update, and no other wiring or software changes were made other than the upgrade.

Did I miss a step in the upgrade? Any help would be appreciated.

Thanks,
Kevin

The PDP and PCM classes aren’t actually part of CTRE Phoenix - they’re part of WPILib. So while you wouldn’t be able to see them in the web-interface without using LifeBoat, the software API should still be working.

If you’re getting CAN receive errors for the PDP, make sure the device ID matches what you’re constructing in code - while the code might not have changed, it’s possible it got changed on the device.

I can’t say for certain, but my guess is that your comms issue is probably related to this:

Thanks we did play with changing the constructor values and making sure that they correspond to the proper CAN ID on the web page.

We currently have PCM = 1 and PDP = 2, although, all permutations we tried still yielded the same issues.

Last year we had both set to 0 because it was allowable to have a 0 id for all dissimilar CAN types, I think CTRE did away with that (maybe?), but its not a device numbering issue.

Nope. We (CTRE) haven’t changed anything in how those IDs work - you can still have identical IDs across different device types.

The PDP sends out its CAN frames every 25ms by default. The CAN Receive timeout is only generated if it’s been more than 50ms since the last frame was received by the RIO.

I wasn’t able to reproduce your issue with my test setup here. If you’ve checked that your device IDs are correct and the PDP is on the bus and are still seeing the issue, send us an e-mail (support@ctr-electronics.com) and we can walk through some troubleshooting.

New Developments:

Other commands like pdp.getTotalCurrent() works, but if I attempt to get current or voltage there is a timeout issue and comms crash.

:frowning:

Please contact support (support@ctr-electronics.com) with a screen shot of the PDP self-test.

Sent email but posting picture here in case anyone else has ideas: seems to be software/protocol related at the moment.





Update number 2:

Does not seem to be any issue if I create a new 2018 project and access the PDP in New code. In a default commandbase project if in disabledPeriodic I do:

PowerDistributionPanel pdp = new PowerDistributionPanel(2);

and make calles to getCurrent(0), there are no percevied timeout errors. But still have the issue with my 2017 code running, both projects are referencing the same Jar files provided by CTRE and WPILib

It would appear that the stack trace above (from 2017 code ported over to 2018) has issue in code that is not written by us, but by WPI LiveWindow? I think 2018 has an automated telemetry feature because our code is not in the loop in the above stack trace.

So we found a work around or possibly a solution. For some reason the PDP is not happy when our code runs and the new 2018 Telemetry feature is enabled.

We have our own thread that polls the PDP at a rate of 20ms. This worked flawlessly in 2016 and 2017. 2018 adding the new Telemetry feature which is enabled by default i believe is polling the PDP, while we are polling the PDP and is it possible that the double calls are causing the timeouts?

In any event Ensuring PDP was set to ID 0 for WPI lib, and disabling the new 2018 Telemetry feature made all CAN timeouts go away and our system became stable again, and our personal polling thread provided all PDP can data properly.

In Robot Init the line was added:
LiveWindow.disableAllTelemetry();

Would be nice to have this feature disabled by default or have it more appearant in the 2018 code updates so more teams can find these low level details.

Hope this helps someone else.

You can turn off telemetry for just the PDP. LiveWindow.disableTelemetry(pdp);

From the new for 2018 page http://wpilib.screenstepslive.com/s/currentCS/m/getting_started/l/801080-new-for-2018

LiveWindow now provides continuous telemetry (e.g. of motor and sensor values) for most WPILib classes via NetworkTables. Telemetry is sent each loop iteration of the IterativeRobot and TimedRobot templates (SampleRobot does not provide this functionality). Dashboards such as Shuffleboard provide ways to record this telemetry for later playback and analysis.

Most WPILib classes add themselves to the LiveWindow when they are constructed. While we recommend telemetry be left enabled, telemetry for specific instances can be disabled using LiveWindow.disableTelemetry(), or all telemetry can be disabled using LiveWindow.disableAllTelemetry().

To implement this change, the Sendable interface now uses a property definition interface rather than multiple functions. The LiveWindowSendable and NamedSendable classes have been deprecated.

Subsystem now provides an addChild() function to use instead of LiveWindow.addSensor() and LiveWindow.addActuator(), which have been deprecated.

If you construct the PowerDistributionPanel with ID=0, and your PDP is not at ID=0, you’ll see CAN timeouts from the LiveWindow code because it is trying to get information for ID=0. If your PDP is at some other ID, you should construct PowerDistributionPanel at that ID instead (which you said works fine in a new project). Is it possible your 2017 code is constructing PowerDistributionPanel with ID=0 somewhere? It looks like you might be?

It would help understand this issue if you could answer the following questions:

  • What ID is your PDP actually set to?
  • How are you constructing the PowerDistributionPanel class? (code snippet)
  • How are you polling the PDP in your code? (code snippet)

Thank you,

This is the file we found which lead us down the path of turning off Telemetry. may be helpful to other teams if the LiveView.Disable lines were formatted as code snippits to stand out better for teams.

Is there documentation that describes exactly what Telemetry sends? I think without knowing what data is sent for sure, we will just use our own Telemetry link to control data stream and polling rates, and disable all WPI telemetry.

Thanks, our initial code always had PDP set to 0 with a constructor call using the default constructor which defaults to 0.

pdp = new PowerDistributionPanel(); 

That code and the code which gets us the pertinent data from the PDP runs in its own Periodic Timer thread that we kick off in Robot Init and runs at a rate of 50Hz to provide us some automated voltage and current monitoring.

The CAN timeouts only happened with GetVoltage and GetCurrent calls, not with any other PDP methods, which means we could communicate to the PDP. After looking at the stack trace, it can be seen that it isn’t our code that is timing out, but the code from LiveWindow. Since we don’t use Livewindow, immediately jumped out to us that it could be the new 2018 telemetry link.

Turning that off resolved all timeouts. If we disable our thread, but leave telemetry on, timeout errors stop as well. So it would appear that our calls at 50hz and livewindow calls (which I think runs at the rate of the TimedRobot class default at 50hz as well) is bombarding the Can bus.

At no point did we have any other CAN timeout issues (we only use PCM on CAN and no issues controlling solenoids).

We will opt to use our own Telemetry link which has served us well for years until we have more time to play with the WPI Telemetry to see if it provides additional information for our specific needs.

The code snippit above with PDP(2) was just a snap shot of us playing around with the IDs and constructors, to see if that had anything to do with it.

The code that you linked from our Repo is the code that I am talking about that we enable to perform background checks on the PDP. It is the only line of code that we use to instantiate a PDP, we do not have any other PDP instantiation elsewhere in our Code. The only two things that were calling into the PDP was our thread (which you linked) and now new for 2018, the Telemetry link.

Regards,
Kevin

Each WPILib class registers what data gets sent for telemetry. The telemetry is collected and sent to NT each periodic/timed loop iteration, but at present a NT flush is not called, so how frequently it’s sent over NT is based on the NT update rate (100 ms by default). For the PDP, the telemetry setup code is here: https://github.com/wpilibsuite/allwpilib/blob/master/wpilibj/src/main/java/edu/wpi/first/wpilibj/PowerDistributionPanel.java#L112; all 16 channel currents, the voltage, and the total current are included in the telemetry.

There is actually no change on the CAN bus from reading the telemetry data more frequently. The PDP autonomously sends CAN telemetry messages at a periodic rate, the RoboRIO code just pulls off the most recent received message, it doesn’t cause additional CAN transmissions. The timeout message is generated based on cached information on the RoboRIO side, namely how recently the last message was received. What we discovered is that the HAL implementation isn’t locking this cache appropriately, so what’s almost certainly happening is that when you poll the PDP from a separate thread, a data race is causing the spurious timeout errors. We’ll fix this in the next update release.

Does this mean you were able to re-create our issue? If so thats great news!