Vectorization Vs Loop Implementation

In most modern control theory, I’ve seen heavy usage of linear algebra. When implementing these methods on a robot, would it be more efficient to utilize and linear algebra library and vectorize implementation or use a more traditional form of programming? An example would be like the Ramsete controller. Is it better to implement the controller utilizing matrices or just utilize the elements in the controller with standard variables?

1 Like

Disclaimer: Have not done linear algebra using a roboRIO recently, but I think I can provide the start of an answer.

Technically speaking, depends on how you want to define this.

I’d guess that memory and execution time gains would be minimal, if present at all… though I’d be happy to hear if anyone has experience to the opposite.

At a minimum, I’d want the ability to tell the library which matrices will be constant values, and which will have fixed size, to prevent any additional overhead in the library from being general-purpose where specificity is desired.

The big advantage of using a good library would be to allow you to hide the nitty-gritty of matrix operations behind function calls in an elegant way, so as to make the source code better express the intent of functionality.

If I do find myself doing lots of matrix operations in the near future, I’d probably either use a library, or make my own little tidy class to accomplish this clarity goal.

FWIW, industry tools like Mathwork’s Matlab/Simulink suite provide a simulation environment and language where you can have lots of flexibility to design exactly what you want in your matrix math setup. Then, once you’re happy, it generates fixed output source code that you incorporate into a bigger codebase. Best of both worlds: Flexibility in design, then runtime execution/memory efficiency.

2 Likes
  1. Use someone else’s linalg library. Let them write the SIMD instructions for you. The upcoming modern controls addition to WPILib takes this approach.
  2. The shortest (and likely most efficient) implemation of RAMSETE does not touch any matrix math. This is not a good example for this kind of question.
3 Likes

My understanding is it depends a lot on the language you’re programming in. For example, Matlab and Python are relatively slow because of their dynamic type system and interpreted nature, so by utilizing matrix operations you are calling out to C/C++ implementations that run quickly. On the other hand, if you’re programming in lower level languages or something like Julia, loop implementations can be very quick.

1 Like

Modern compilers are good at vectorizing code in general, but it looks like Eigen (a C++ matrix library) is a bit better than a hand-written naive matrix multiplication.

Our team programs in Java. Do you know of any good java LA libraries to use?

I probably should have defined efficiency better. I was mostly speaking of runtime and memory usage, but clean function calls and LA operations is definitely a huge benefit if I choose a linear algebra library. Do you know of any decent libraries for Java?
For designing controllers in matlab, do you just offload gains from the designed controllers in the program to the robot? I’m not the biggest fan of that method but I may just be misinterpreting “output source code”

WPILib uses EJML with a custom wrapper that enforces matrix sizes at compile time. I really enjoy using this wrapper (and I used it extensively while contributing to the WPILib modern controls addition). The wrapper and EJML come with WPILib anyway, so they’re also quite convenient. Here’s a great example of what code written with the wrapper looks like.

2 Likes

The modern controls library being added for 2021 supports synthesizing feedforward and feedback controllers from state-space models. It also has factory functions in LinearSystemId for creating state-space models from either frc-characterization constants or physical constants like mass, the types and amounts of DC motors, etc.

Version 1 of the state-space PR generated models and controller gains in Python and generated C++/Java files containing the matrices, but version 2 (this PR) keeps everything in C++/Java, which is much more convenient.

1 Like

For the Eigen case it looks like it is generating the answer while compiling? I don’t see any actual multiplication instructions in the generated assembly, just some fiddling with constants and loading some of the pre-calculated results from .L8. That’s only going to work if all the matrix values are compile time constants, which I’m not sure is true for real robot code.

Change the matrix values to be read from the command line (=strtod(argv[i], nullptr) and the code changes. I can’t tell by looking which is more efficient. Or if any of it matters - without measuring it to be a hot-spot in the code it would just be a guessing game if there’s any difference between the two in a practical sense.

Also, per the OP’s mention of vectorizing, the Cortex-A9 on the roboRIO is armv7a, which means code needs to use floats - the hardware doesn’t support vectorizing double-precision floating point ops.

EJML was independently the top choice I found by searching, so there’s two votes for that library.

One thing to consider… it’s not the easiest endeavor, but you can load native C++ libraries (.so) from java and call them. There’s some overhead, but it could be worth it if Eigen (or similar) fits the bill. It looks like someone may have already done it, but again not having actually used it, I’m not sure if I should actually recommend it or not.

This one I can speak to a bit better.

So, any “codegen” environment worth its salt will support a mechanism for making certain constant values tunable at runtime. Simulink does expose this.

The most common usage I’d see is for traditional PID controllers, or debounces/thresholds, it’s possible to tweak their values at runtime. In my day job, “Tunable” just means “stored in some writeable memory address”, and then other tools are used to manipulate it. Similar things could be done for FRC, but it might require some extra support (ie, make tunable mean “in network tables”). All this is to say that, even though there’s no specific FRC support package for Simulink (yet), the concept you’re referring to definitely does exist.

Where it gets a bit sticky is if you’re relying on the guts of the development environment to do some set of matrix transformations for you. In this case, you wouldn’t be directly adjusting your tune values on the robot, you’d probably have to have some form of an offboard tool to do the matrix computations, and deploy the results to a more abstract set of tune parameters on the robot.

This is more of a specific example of a workflow and an assumption of where some tools draw their boundaries. Really, I think the requirement for any workflow is more high-level:

  1. All exposed “knobs” must be human-understandable
  2. Every computation that can be done once and saved should only be done once (and saved).
    a) Offboard tool, onboard logic, probably doesn’t matter. The key is it’s done once.
  3. The remaining minimum computation that is done periodically should be fast.
    a) This is where using a library that exposes cpu-instruction-level vector optimizations is good.

I think I’m running down an “unnecessary depth-first search” that’s probably diverging from what you were initially looking for.

1 Like

Yeah, I messed that up. I should have written it as a function that takes the matrices as arguments and returns a result.

Actually, it can make a huge difference. See this snippet instead (that actually has multiply instructions in it this time :slightly_smiling_face:): Compiler Explorer

The hand-written multiply is much shorter, but that’s because it’s using branch instructions (loops, essentially). The Eigen version fully vectorizes the computation and unrolls the loops.

It actually does. That’s what vmul.f64 and friends do in the updated Compiler Explorer link above. Granted, it won’t be as efficient as if it had a 64-bit FPU. Here’s some documentation on NEON instructions (NEON is a collection of SIMD instructions for ARM): Documentation – Arm Developer

We looked at jeigen before EJML actually. The numbers they published for overhead was like 10x, which was a bit concerning. For state-space-v2, we used EJML for the majority of the linalg, but wrapped some Eigen routines for the matrix exponential and a DARE solver because we didn’t want to bother rewriting that in EJML. We also made a JNI for an IsStabilizable() function because the implementation for that uses complex matrices, and EJML doesn’t really support that. We could have doubled the matrix sizes and used real matrices instead, but we didn’t feel like it.

JNI isn’t a well-designed FFI for developers. I don’t recommend developing for it yourself if you can avoid it.

For this use case, you can easily just do all the computation at startup on the roboRIO and simplify your workflow and tooling. Some of the most computationally expensive control operations you can do like system discretization or solving the DARE take under 1ms on the roboRIO (I’m ignoring trajectory optimization). We use a patched version of the DARE solver from https://drake.mit.edu. Less than 1ms is fast enough to redo the observer and controller synthesis at every timestep. 3512 does that for our drivetrain controller because the model is nonlinear. Our tuning parameters are the Q and R matrices from LQR instead of the controller gains.

We used the “generate constants” workflow in 2019, but the JIT controller synthesis has been much more convenient with regard to both build system integration and developer ergonomics. Granted, this assumes you can find implementations of the control theory you need or reimplement it yourself. MATLAB supports a lot more than just lqe() and lqr().

We do our initial tuning in simulation. GradleRIO targets x86 as well, so we’ve been using unit tests to simulate the models, write CSVs of the responses, and visualize them with matplotlib (Python). We don’t need to worry about tuning values much at runtime because our models are fairly accurate and roboRIO deploys don’t take long.

We start with models based on motor datasheets and sum of forces/torques (allwpilib/LinearSystemId.h at state-space-v2 · mcm001/allwpilib · GitHub), then use linear and/or angular Kv/Ka from frc-characterization to synthesize more accurate models once the robot is done (see the “Identify” functions in the previous link).

1 Like

Just to be clear on terminology… the hardware supports double precision. However, NEON’s “advanced SIMD” does not have support for double-precision (see Table 2-1 in ARM’s Cortex-A9 NEON/VFP manual). What this means is there are SIMD instructions for single precision, but only scalar instructions for double precision. Additionally, the compiler requires additional options to auto-vectorize.

Your testing right now is using scalar VFPv3 instructions regardless of whether it’s doubles or floats. If you compile the left hand code with both floats and -ffast-math -mfpu=neon-vfpv3 -ftree-vectorize you’ll see it will auto-vectorize and actually use the NEON SIMD instructions. Interestingly, the compiler does not auto-vectorize the Eigen code even if you use float and the same compiler options. However, you would need to benchmark to see which is actually faster to execute.

If you compile with -O3, it doesn’t use NEON instructions and it unrolls the loop with scalar instructions instead. So this may simply be a test case which doesn’t really show the power of SIMD.

2 Likes

Yep. As is, this looks like a “what’s your L1-D and FP forwarding microarchitecture” benchmark.

Not clear if this was the question but the -ffast-math (or -Ofast) is needed because ARM FP vectorization isn’t strictly ieee754 compliant.

Also, I’d recommend -fvect-cost-model for any attempts to tree-vectorize on ARM.

But in any case, still weird that the Eigen code is handled differently. And that the compiler unrolls it using -O2 - I thought that was enabled at -O3 and higher?

Maybe the Eigen code is overriding compiler options, both explicitly enabling unrolling and changing options needed for vectorize to work?

Sorry for the possible continued derail, but the OP did ask about vectorization.

Sorry, should have been more clear in my response.

For the naive loop implementation (with float):
-O2: no unroll, does not use NEON
-O3: unrolls, does not use NEON
-O2 -ffast-math -ftree-vectorize -mfpu=neon: no unroll, uses NEON
-O3 -ffast-math -ftree-vectorize -mfpu=neon: unrolls, uses NEON (correction from what I said above)

For the Eigen code (with float):
Same results for all of the above (unrolled loop that doesn’t use NEON)

It appears that what is happening is that Eigen embeds all kinds of optimizations, including SIMD optimizations, but the size of matrices that are being used in this testcase (it’s a 3x2 * 2x4) are too small for Eigen to enable SIMD. Doing a larger matrix (e.g. 6x4 * 4x8) with Eigen with -mfpu=neon results in NEON-using generated code.

To quote from the Eigen manual:

SSE, AltiVec, NEON and S390x work with packets of 128 bits, or 16 bytes. This means 4 ints, or 4 floats, or 2 doubles, except for S390x which doesn’t support 32-bit floats in hardware. Even though it is usually preferable to work on packets that are 128-bit aligned, starting from Eigen 3.3, Eigen will also vectorize unaligned vectors and matrices at a last resort. So only tiny vectors of matrices that are smaller than a packet won’t be vectorized.

1 Like

Sweet, excellent set of info!

I assume you’d still want to hook teleop init, or make sure you’ve got an easy-button to restart robot code if you’re iterating over many values. Even though the compile-deploy time is fairly small, IMO I’d still prefer workflows that don’t require it.

The fact it can be done in single-digit ms is good to know, I had no concept of the scale. My original line of thought had been presuming there was some “offboard” infrastructure already in place, and far more limits to the embedded controller than what we have in FRC. Especially since the RIO can do it, better to put it there to not add an extra tool to the chain.

We use a patched version of the DARE solver from https://drake.mit.edu . Less than 1ms is fast enough to redo the observer and controller synthesis at every timestep

Will the DARE solver implemented in WPILib be available for anyone to use? I was having trouble trying to understand how to implement it myself due to my lack of understanding for what DARE actually is(most resources I’ve seen with LQR kind of just gloss over DARE and just demonstrate its usage, not the derivation), as its seemingly beyond the scope of my current capabilities. I’d prefer to design my controller strictly in our robot project without needing to offload gains from programs such as MatLab.

Yes, the DARE implementation will be available along with an LQR implementation (and observers.) You should never have to touch (or even think about) MatLab.

You can take a look at what we’ve been working on here: GitHub - mcm001/allwpilib at state-space-v2

1 Like

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.