Use a Markov chain model to explore subsystem reliability.

In this paper, a Markov chain is used to model subsystem reliability and explore the effects of increased robustness and repairability. After analysis, it is clear that improving robot uptime is best accomplished by a combination of increased robustness and repairability.

Of course, I was particualrly drawn to the conclusions of reliability percentages corresponding to time spent repairing. Understanding this relationship is obviuosly key to establishing design criteria.

Have you started to look at designs and evaluate what constitutes an “83% chance of being repaired within 2.5 minutes”? What sort of design process changes would you suggest to hit these metrics?

Perhaps I’m misreading this, but it seems like your chain is still 5% likely to break every 300 seconds even while you’re waiting in the pit or the queue.

I’d presume there’s some assumption of “failure rate applies to robot in motion” and the failure rate of a robot sitting in storage is comparatively negligible?

Not sure though, I didn’t dig too hard into all the math implications.

In the process of composing a response, I noticed that I made a mistake in the general solution for the steady state. It should have been <a/(a+b), b/(a+b)> and not <b/(a+b), a/(a+b)>. This has been corrected, along with some minor formatting errors. There were about 89 downloads when I swapped the file this morning; sorry for not catching it sooner.

Glad that you enjoyed it; it was fun putting together that paper. I’d also suggest that understanding the relationship between robot uptime/reliabilty as a function of repairability and robustness/durability is key. It’s enlightening to run an iteration or two of gradient ascent (or descent, depending on which side you pick to optimize). In general, an extra pound of repairability is not equivalent to an extra pound of durability/robustness.

Have you started to look at designs and evaluate what constitutes an “83% chance of being repaired within 2.5 minutes”? What sort of design process changes would you suggest to hit these metrics?

That’s a very good question. The short answer is that I haven’t looked at designs, and I’m unlikely to do so. That’s primarily a function of my background (math/computer science, not manufacturing or mechanical/electrical engineering), and my recognition that many, many variables are involved in those numbers. I don’t believe there’s a closed form solution that is going to tell me how fast something is to repair or how often it’s going to break before I actually build it. And even if I had enough data to do a confidence interval on the robustness or repairability of different designs this year, it’s unlikely that the game next year will result in designs for similar mechanisms and subsystems. I’d be more interested in studying things that don’t change year to year, like the impact of prototyping, manufacturing techniques, tolerances, and preventative maintenance. Maybe I’m biased though, since that’s basically a list of the things that 6844 got wrong this year.

As far as a design process goes, I’d start by making sure you’re being strategic and thoughtful. Ask yourself and your students what the failure modes are. Idenfity the “mission impact” of the failure - “if this breaks, what does that do in a match? If this breaks, what does that do to our chances of being picked?” Estimate how frequently you’ll encounter the failure mode. Estimate how quick to repair the failure mode will be. Multiply both estimates by a safety/fudge factor. Come up with a preventative maintainence plan, and practice repairs. Consider if your repairs will have any impact on software. Build and bag spare assemblies when possible.

I also suggest that subsystem designers pick a “favorite” failure mode. For example, wheels slipping on a drivetrain might be considered a failure, as it results in a loss of propulsion - but it also means that you’re not tripping your main breaker. Pick a failure mode, optimize the other failure modes into statistical improbabilities, and then optimize that failure mode for repairability.

You’re almost right - I can see how that could be confusing, based on how I phrased things in the paper. If I update the paper again, I might change this.

On average, there’s a 5% chance that it will break in a 2.5 minute window. That average accounts for time spent in matches, time on the practice field, time tuning auto, and the rest. Not going to lie - it’s a pretty sad drivetrain.

Analytically, it changes significantly. See spoiler for details.

the mathematical details
[spoiler]I assume that you’re saying that you’d always fix after debugging, so we’d end up with a matrix like this (do me a favor and imagine square brackets around this):

w 0 1-f
1-w d 0
0 1-d f

Then, the steady state probability for being in the working state (mouthful, I know - but I don’t know how to phrase that any better) is:

If you focus on the transitions between states, then you get a form that is, in my opinion, a bit nicer:

a = 1 - w
b = 1 - d
c = 1 - f
x = 1 / (a / a + a / b + a / c)

If you want, I can go through derivation, but it would need to be typeset in LaTeX. It would be pretty unreadable here.

[/spoiler]

Interestingly, this hardly impacts the example I used.

If you put in that debugging takes, on average, 5 minutes and that fixing it takes, on average, 25 minutes (notice that when summed together, this is the same average of 30 minutes), we have a .625 for the three-state/debug model and .624 for the two-state model. This may be a mathematical coincidence rather than anything really meaningful, however.