Challenge problem: 'ladder of abstraction' iterative design in FRC

This is a really interesting read, and well worth your time if you’re interested in interactive, iterative design processes.
http://worrydream.com/LadderOfAbstraction/

When reading this, it occurred to me that this kind of design process could be really useful for designing automated software solutions for the kinds of problems that we face in FRC every build season.

Here’s a challenge problem for the FIRST community then – how can we make these kind of iterative design techniques easy to do for Robotics problems like that which we deal with every year?

I think the first step is software that contains interactive robot simulators like PyFRC, Gazebo, CCRE, or frc-java-simulator. But we can do better, perhaps some interactive toolkit or more iterative-friendly simulation capabilities.

That article brings up a number of interesting points! I know that we would have had significantly better autonomous modes if this kind of system were available and used this past season.

On the technical side of implementing these kinds of features, I’m not quite sure how it would be done well. If you try to abstract out time, you’d really have to run the entire program through (in 1:1 time) and record the results, as resuming from arbitrary states in complex robot code… probably would be extremely complicated to do. If you tried to speed this up, then the program could very easily “break the abstraction” (intentionally or not) and reference System.currentTimeMillis() (or the equivalent call in C++) which would cause different behavior unless you managed to figure out every single hole like that in the system.

I suppose that adding recording functionality to existing emulation platforms might be the easiest way to provide a basic level of this kind of functionality. I might try to add this onto the CCRE’s emulator this summer, and if I do I’ll report how useful it seems to be.

EDIT: Something that would make this kind of iteration easier would be to have Compile on Save work in NetBeans and another application to update the code within the running application, as this could let us modify code in real time in an emulator.

I agree, that’s why I called it a challenge problem. :slight_smile:

I think the tricky thing is really being able to maintain state about the robots behavior in a way that the simulator/emulator could do proper reasoning/adjustment of.

Another way I could imagine tackling the problem (particularly for autonomous mode state machines) is controlling the robot via abstractions that have state built into them, so that the simulator/stepper has ultimate control over the state machine. Then the variables are exposed via some mechanism, and that’s how you do variance over time.

I implemented a very early version of something like that for our code this year, where the actual logic for switching states is done by an underlying object, and the user code didn’t have (much) to do with changing that.

One more thought that I’m definitely going to implement by next year for PyFRC is just showing a basic 2D field that shows where the robot is. It’s an easy enough thing to do, and though not totally accurate, would get one close enough for good testing.

It’s actually not that hard to put together robot code that is ammenable to this sort of analysis. In our code this year any function that used time took a variable that contained the current time, so time wasn’t a problem. And we had a single structure that held almost all of our program’s state, so running from arbitrary states was easy too.

Here are the only things that weren’t totally clean:

  1. We used the filesystem: Our shooter’s speed presets depended on a config file, so if you wanted it to work the same every time you needed to start with the same config file. This was easy to deal with in practice; we would just delete the file and let it get recreated.
  2. Some data about the current state of the jags wasn’t kept with the other program state. This state was never available to the rest of the program so it wasn’t an issue when trying to test the other pieces.

This year we made our pieces truely independent of the underlying system, and it meant that we could get great test coverage. It’s not that hard.

Yes, it’s not all that hard. Most of the CCRE’s systems do go through a replaceable abstraction layer. However, all of these cases involve functionality where it will be obvious if you try to go around it. (For example, it won’t compile on the actual robot.) Unfortunately, System.currentTimeMillis() isn’t like this. I can easily make a replaceable timing system, but then built-in timing systems cannot be used (like System.currentTimeMillis(), java.util.Timer, the WPILib Timer, Thread.sleep, Object.wait, and probably a bunch more.)

The key part here is that since I’m writing a large framework for more people to use than just me or others on my team, I want to avoid forcing this kind of change in user programs.

My current best idea is implementing a class transformer in the emulator that substitutes timing-related calls with abstracted calls. This makes a more complex implementation, but would be less complex for the user. The false implementations could have an auto-time-skip system, so that whenever no threads are ready to run, it fast-forwards until the next sleep or wait times out and then continues on. This would let it run at maximum simulation speed by eliminating the delays from the system, without actually making the program run in a different fashion, and therefore wouldn’t limit the usefulness of the framework.

That is exactly how PyFRC allows you to pause/step time, except that in python it’s a bit easier to replace it out from underneath the program. :slight_smile:

I think not allowing calls to the built-in timing systems is reasonable and a custom “timing system” is not required to replace it. A single way to get the current time is sufficient. Other abstractions can be built on top of that.

I know you’ve said that you’ve had your last release of CCRE to break compatibility but how many external users do you have?

Ah, nice! I’ll probably end up using it then, if it works for you.

I disagree that it’s reasonable, at least for my purposes. I’m trying to have as few “gotchas” as possible, and if I avoid requiring usage of special timing mechanism, I avoid another location where users could be confused.

Well, at the very minimum two ways to interact with time are required: getting and sleeping (you cannot reasonably implement either in terms of the other), at which point a custom timing system might be the best bet. I’m probably going to use Java’s java.lang.instrumentation.Instrumentation interface and define a custom Agent to allow me to rewrite calls to the core timing methods. (Probably using the ASM4 library, which I’ve worked with before.) Alternatively, I can define a custom ClassLoader that provides similar transformations and reload the emulator and all of its dependencies with that, although I’m not sure that it would work properly if I need to modify built-in classes.

It’s a more complicated implementation, but a less complicated interface.

Nine robots that I know of have used the CCRE, including seven on Team 1540 (four bunnybots, our 2014 prototype, 2014 competition, and 2013 competition although that was a post-season addition) and two from our school but not strictly within team 1540 (ScumBot & ScumBot control system test platform.) This means that we don’t know of any external users. This makes sense, though, since it wouldn’t be likely that anyone outside of our school would want to use a first-year tool, as it was for the last season.

The biggest reason that we know of that would make others not want to use the CCRE would be worrying about its stability. If it’s unstable and it breaks during a competition, they’d need someone who understands it to fix it on the spot. By keeping it maximally stable, we do our best to minimize this concern. (Also, let’s avoid clogging this thread up with anything unrelated to the topic at hand, so if you want to discuss this part about the CCRE more, let’s move to the CCRE’s thread or private messages.)

More on topic, I started implementing a basic prototype version of this:
https://i.imgur.com/h0ofKI3.png
Currently it’s a system that graphs the actuator outputs of autonomous mode (or any other period of time). It doesn’t yet fast-forward the execution, but it does run the exact same unrestricted code that we’ve previously been able to write. It’s for the following autonomous mode:

protected void autonomousMain() throws AutonomousModeOverException, InterruptedException {
	moveForTime(-1, 5000);

	waitForTime(300);
	shifter.set(true);
	waitForTime(300);

	moveForTime(1, 1000);

	waitForTime(300);
	shifter.set(false);
}

private void moveForTime(float speed, int time) throws AutonomousModeOverException, InterruptedException {
	leftOut.set(speed);
	rightOut.set(speed);
	waitForTime(time);
	leftOut.set(0);
	rightOut.set(0);
}

Prepare to have your mind blown :stuck_out_tongue:


void sleep(float length){
	float start=get_time();
	while(get_time()<start+length);
}

I am well aware of this “solution”.
I said “reasonably implement”. Your suggestion, as I truly hope you know, pegs CPU usage at 100% and consumes all available processing power. This is not reasonable, except in very specific use cases.

Also, if you try this on the cRIO under Java, you’ll notice that it hangs every single thread other than the main thread for the duration, because Squawk does cooperative multitasking and this code has no “Thread.yield()”. (It’s still a bad idea in C++, of course, but not quite as horrendous.)

The cRIO’s Java threading is cooperative? Ick! I’m tempted to go read the Java specification to see if that’s even allowed.

EDIT: Went and read the Java spec. This does appear to be allowed. In particular, the sections where it mentions the “sleep” containts an implicit yield and the section where it says that having one thread “diverge” (aka infinite loop with no observable effects) means that all threads are allowed to do so. I am now very happy not to be writing Java.

By the way efficiency is not relavent to the point that I was trying to make. And if you’re doing cooperative stuff, then your implemenatation can just change to:


void sleep(float len){
    float start=get_time();
    while(get_time()<start+len) yield();
}

This still pegs your CPU, which as I mentioned earlier, is not relevant.

I only figured this out by diving into the Squawk source code that I found online - so it’s possible that the cRIO uses a majorly different version of Squawk, but I sincerely doubt it, especially as it matches all my observations and it seems difficult to change the concurrency model of Squawk. Luckily, this doesn’t actually matter all that much in practice. The main location where it’s annoying is when you’re doing vision processing, because, at least with old versions of the FRC library, our vision processing could block other threads from running for ~100ms and cause major lag. Luckily, we changed our code so that we didn’t have this issue, but it still wasn’t very nice.

Actually, CPU usage here is completely relevant. Would you like to use a computer that’s using almost all of its CPU on busywaiting? It would probably be slow, and there’s certainly many better ways to do it. Busywaiting is a known antipattern. The only reason it could make sense is if you are sleeping for a [i]very short amount of time.

I agree that this is a good solution, in some rare cases. I’ve even used it myself in bare-metal programming, when there’s literally nothing else that the computer could possibly be spending its time. But it’s not a good solution in this case. While it is a solution that “works”, it is not a reasonable solution for this context.

I feel your pain with that one. There’s definately been stuff that gets provided that just doesn’t seem like it works the way it should. Like on the old PICs if you just took their standard template and inserted a single call to atan2() or cos() the robot would totally die. (The built-in math functions took longer than the watchdog’s timeout, which was updated in the same loop)

short amount of time.](c - What are trade offs for "busy wait" vs "sleep"? - Stack Overflow)

I agree that this is a good solution, in some rare cases. I’ve even used it myself in bare-metal programming, when there’s literally nothing else that the computer could possibly be spending its time. But it’s not a good solution in this case. While it is a solution that “works”, it is not a reasonable solution for this context.

Why do you care about the CPU load %? If you have a latency problem then you have a latency problem regardless of CPU load. If you’re not getting the throughput you want it could be because you’re waiting on the internal storage. If your robot is drawing too much power, I doubt it’s because of the CPU. (Does the cRIO’s CPU ever go into a low power mode anyway?) And I’ve never heard of the cRIO having a thermal problem.

Maybe this is just the electrical engineer in me talking but when I hear about an embedded system where the CPU usage will never spike above 2% my reaction is not that the software is super-efficient but that the desginers should have put a cheaper part on their board.

One reason I can think of is because if you artificially inflate the load to 100%, then when you have an actual load problem you won’t notice it.

Ah, nice. I’ve only been here in FRC for three years, so I never experienced that, but it sounds very annoying.

Actually - in rare cases with very large numbers of shared Cluck objects with a previous less optimized version of Cluck (though that reminds me that I need to finish optimizing reconnections), we had the CPU pegged at 100% for a few moments when the cRIO reconnected with the Poultry Inspector. This didn’t cause any issues, but if there had been multiple other threads at 100% running at the same time, it could have noticeably affected the responsiveness of the robot.

We also had somewhere around eight threads running on our robot this season, at least a few of which were waiting for something, and that would have been a mess if those were all busywaiting. Especially if any of them had a higher priority than any of the others - even with yielding, I don’t think Squawk will ever yield to a lower-priority thread when a higher-priority thread is ready to run! AKA busywaiting is a really good way to starve important threads of resources.

We also had a finals match in the 2013 season where the drive team had to disable a chunk of functionality on the Phidget Console due to it being broken and using 100% CPU on the Driver Station, which caused the real control functions to lag horribly.

Yes, but in this case it’s not 2%. It’s 100%. This is actually relevant, as full usage can cause noticeable slowdowns. Even if the robot can handle running at 100% and perform all of its functions, it’s still a recipe for disaster if you add too much more, as you’ll never know how much is actually being used instead of being consumed by busywaiting.

I’ve given plenty of reasons why unneeded 100% CPU usage can be or is a bad idea, and if you provide getting the current time, it won’t be much more work to also provide waiting for a specific time. There’s a reason that no well-built software components use busywaiting for extended periods of time. (Kernels use very brief spinlocks, but that’s a fraction of a fraction of the time scale that we’re working at here.) Busywaiting is a known anti-pattern and should be avoided.

Anyway, we should get back to the topic. This isn’t the right place to discuss busywaiting further.

EDIT: Oops, virtuald ninja’d me with this response:

Yes, exactly. I touched on this above, even. Also, keep in mind that we’re also talking about code that runs on the computer running the emulation, and usually it’s not nice to have your development computer using lots of unneeded CPU.

Certainly, if you’ve got the CPU pegged you have to look at some other metric to measure your performance. We just put everything together in one loop and used something like this:


class Perf_mon{
	Time start,last,worst_case_;
	unsigned iterations;

	public:
	Perf_mon():start(-1),last(-1),worst_case(0),iterations(0){}

	void update(Time t){
		if(last==-1){
			start=t;
		}else{
			auto elapsed=t-last;
			worst_case_=max(worst_case_,elapsed);
		}
		last=t;
		iterations++;
	}

	Time average()const{ return iterations/(.0001+start-last); }
	Time worst_case()const{ return worst_case_; }
};

And this was kind of overkill since we could just look at the worst case, notice it was much faster than the driver station data updated, and be done.