The short answer is that if we discover network latency/dropout to be a significant problem, we will move our image processing application to an onboard laptop. Failing that, our next fallback is to reduce resolution and/or framerate. To be frank, we auto target just fine at 5 fps (because the gyro loop is closed at 200Hz); we do 30 because we can
However, I do not expect this to be a major concern. In past seasons, teams have streamed live camera data directly to their dashboards with few problems. The only difference is now we are cutting out the cRIO altogether. While we haven't run simulations against an "FRC network simulator" (but if you know of a tool that could be used for this purpose I would be interested in trying it), in theory there is PLENTY of bandwidth to go around. With reasonable compression settings these images are only on the order of 10-20 kilobytes a piece.
We don't timestamp the images, but we do transmit our heading synchronously with new camera images being available. That way, the results returned by the vision application do not go "out of date" if they are received late. Out of order packets would be a bigger problem (it's UDP under the hood). But absolute worst case - like you said - this would be a transient problem and would straighten itself out within a second or two.
EDIT: Forgot to add, we also do low pass filtering of both outputs from the vision system to help smoothness (and to reject momentary disturbances like when we occlude the vision target with a flying basketball

). This should help with occasional frame drops as well.