The diagnostics have evolved and I doubt the documentation of all the info is up to date. I’ll address your questions here for now.
The DLink is a relatively closed device. I’m sure it could log things just through configuration, but thus far we haven’t found it useful, and we are running stock FW, so at this point, the DLink logs nothing.
The cRIO cooperates in the protocol and helps the DS to distinguish certain issues which would otherwise look identical. In particular, the cRIO and DS will both start pinging devices when comms is down. When comms comes back up, this info is used to identify the break in the comms chain and tattle on the device that rebooted.
The majority of the diagnostics and all of the logging are done by the DS.
The system watchdog is a low level FPGA timed service that turns off the I/O when the RT system doesn’t feed it. This is to provide safety and isn’t specific to comms.
Comms is required in order for the RT task to feed the FPGA. Loss of comms, cable, router, DS, etc. will interrupt the feedings. Because the FPGA enables the I/O instantly when it is fed, comms that is just a bit late can cause stutters, so it contains a counter. The system watchdog count indicates transitions, but not duration. The FPGA watchdog timeout is 100ms, which is quite short.
The DS or other components don’t implement the watchdog, but if they are late or negligent in feeding, they will cause a watchdog. The DS reports the 44004 error when the UDP read from the robot times out after 500ms. Thus in a situation where packet loss is not absolute, it is more likely to have 100ms windows with comms loss than 500ms windows of loss.
Additionally, the error messages reporting on the comms are being sent over potentially faulty comms. So lost packets may drop error messages. This is why there are summary messages and sometimes confusing inconsistencies where you may expect to see messages come in pairs, but find that sometimes they are solo.
To try a summary, the FPGA and its deadline are concerned with safety. The DS uses the timeouts to help diagnose the issue and report problems. The DS does its best to filter and present a summary of the comms in the log file viewer, but this is complicated because of missing messages.
You didn’t ask about it, but the charts tab shows two primary measurements of comms health in addition to errors/warnings. The latency or trip time measurement is of successful packets. In FRC, this number generally indicates retries at the wifi layer. The wifi layer will retry the UDP traffic at least four times in a clean environment and more in a noisy environment before moving on and letting the upper layers deal with it. UDP will deals with it by doing nothing. TCP would deal with NACKs, timeouts, and retransmission.
The blue bars indicate the packets that never arrived. They didn’t make the full trip, but some portion of them may have made it from DS to robot – keeping it enabled and allowing it to drive but the status from robot to DS was lost.
If you had clean room and a robot connected to DS over wifi, you’d expect no retransmissions or failures. Add RF noise and you’d start to see latency rise as retransmission is used to overcome noise. Add more noise and latency will peak and loss will start to rise.
To add to your list of configurations to attempt.
You may want to consider cutting back or turning off the video stream in a noisy demo venue. The TCP traffic will degrade more poorly and will interfere with the UDP traffic.
Greg McKaskle