Has anyone experimented with solving for multi tag pose estimates across multiple cameras? I was just thinking that if seeing two AprilTags with one camera would lead to a large accuracy boost, would seeing two tags with different cameras also be able to have a helpful effect? Surely there is some way to synthesize those separate detections.
The issue with pose estimation across cameras is the difficulty in synchronizing them, particularly with standard USB cameras. I’ve observed that some cameras with internal clocks may be more effective.
That definitely makes sense, but if you somehow were able to synchronize two cameras, I’m sure that could significantly boost or effect your output.
Also, theoretically if your odometry was accurate enough over a very short period of time, you could just perform some latency compensation between the cameras. (AKA if the cameras are 14ms off from eachother, you could probably just offset one by the distance you would travel in 14ms)
We never did the multitag pose estimation via PhotonVision, it’s on our list to do this season though…
We did have 2 cameras on our robot though, and did combine the results to make one pose estimation, even if the camera saw multiple tags. We just used the WpiLib odometry to do this. We iterated for each camera, then for each tag seen by that camera, do a addVisionMeasurement() call for every tag seen in the environment. WpiLib does all the math for you in the background. We were able with this setup to be +/- 2mm of accuracy when we saw the speaker, even at ~3m, it was absolutely insane.
This is really useful (for the 2024 season) if you have camera pointed at the speaker, then 1-2 cameras pointed about 120* apart to catch tags at the amp or stage. Then you can locate yourself in more areas of the field. The thing to watch out for is with more cameras, the position measurements needs to be better and better, as you start bouncing around the mounting position errors. (our multiple cam setup both looked at the speaker this season, but will be different next)
WpiLib does the latency computation for you already, that is why you pass in the timestamp with the reading.
It even keeps a wheel speed sample buffer too, so it can go back and say x milliseconds ago, vision said I was at this point, this was the wheel speeds, and will dynamically correct your position based on the old wheel speed and camera readings. It’s honestly really slick.
I understand how to use multiple cameras to update your pose in the built in Kalman filter in WPILib, but there is still a difference between this and what I am asking.
For PhotonVision for example, using Multitag solves for your position using the points from all the corners at once with all the corners from all the tags. This gives you a more accurate position than just averaging the results from the separate AprilTags individually.
OpenCV SolvePNP only seems to support one camera perspective. OpenCV’s triangulatePoints() method and then trying to match up that array of points those of AprilTags on the field using estimateAffine3D(). This seems like it could work but it would only work in the case that the camera’s see all the Tags Together.
Something that I think would probably do a pretty close job to what would be necessary would be finding the estimated positions of all the tags relative to the camera using solvepnp, combining all of those point sets, and then finding the solution that minimizes error using estimateAffine3D.
After some searching, this OpenCV extension would do what i’m looking for solving wise.
Something of note, I’ve never really worked with OpenCV before and would have no idea how to actually end up using this with something like PhotonVision. My guess is that this could work but I am not sure it would run very fast on a RoboRio, so it would have to be run on a coprocessor. This would entail one of probably two options.
Option A
Continuously sending speed/pose data to a coprocessor over NT so it can continually do the compensation calculations. The coprocessor would collect camera data from up to two cameras plugged into itself but it could also collect from other cameras over NT.
Option B
The Roborio takes all the corner data down from NT, offsets the camera poses by latency, and then sends a “request” of sorts over NT to a coprocessor which calculates a solution to the request, at which point the Roborio takes back the calculated pose.
Option B would have an extra back and forth over NT before the poses were accurately synthesized and sent back to the RoboRIO, but it would probably be easier to program. (All the confusing stuff for me would be in Java on the RoboRIO, and the coprocessor would basically just be being used as a method call of sorts.)
That mapping3d thing you found seems to be aimed at solving for a single point using multiple views, which seems like a special case of the thing you’re really interested in, which I think is to find the most likely pose given multi-camera multi-tag sights. That’s a good goal, but it’s not simple.
You might be able to make great progress by going the other way: simplify the problem even more, down to 2d stereo. The field is 2d, so you can project all the tag sightings down into the field plane, which makes the math much easier.
No, i mean to make it work all you have to do is add an Enum with the camera properties and it willl work. Ofcos not as well as if it were tuned or filtered but it would work
I was looking into doing this at one point as well, but never figured it out.
The thought is that the MultiTag solutions are better than single tag because you have twice the information which allows the system to get rid of ambiguity. But a camera doesn’t always see more than one tag on the field, so what if you had two cameras? In this case, even if the two cameras only saw same single tag, you would still have twice the information (MultiCamera instead of MultiTag).
I never figured it out (actually started down a rabbit hole of gtsam), but in theory this would work.