CPU Load in FRC RC

It has been years - no, decades - since my last programming class. (We used punchcards, if that’s a clue…).

My question has to do with the relative load on the CPU for the various programmatical operations that can be performed. I was always under the impression that an IF statement took a lot more CPU time than a series of 3 or 4 simple additions - but a thread near here made me re-think that belief.

So, can anyone explain to me the load on the CPU of the various operations, particularly Math (integer and quasi-floating [bit shifts?], if there’s any difference), Logical comparisons (<=), Conditionals, and I/O statements (and what else did I miss?).

Also, my assumption is that if I do a certain calculation, it takes as long to do it in a ‘subroutine’ (or whetever we call them today - a called function?) as it would in the main programming loop. (Ignore proper programming practice, I’m only worried about CPU load)

Not looking for exact numbers, a relative comparison would be fine. Even better would be a reference where I could just read on it myself…

Thanks,
Don

Sure, certain operations take a certain amount of tics to complete, however none of the operations you have mentioned would require more than it could handle, not even close.

I have no references, but each operation takes a certain amount of assembly, which requires a certain amount of CPU tics to complete.

Is there any specific reason as to why you are asking for this information? I personally do not see a reason to worry about how many tics a certain operation takes to succeed.

One thing you mentioned was if it would take longer in a subroutine (functions these days), or in the main loop, both would take the same time. Calling a function however does take some more tics than having all your code in the main loop, however it is so VERY little, it is negligible, and for the sake of being able to read your own code it is better you keep it in there.

Alright, at the very least, I can state that IF statements won’t cost much programming time. The PIC isn’t a pipelined processor, so you don’t lose time from a branch, so it’s not particularly costly. Secondly, doing a calculation in a function/subroutine incurs a fairly significant processing cost versus not doing so. This is mostly because when you call a function, the program has to build a stack saving the current processor state and possibly passing arguments so the instructions in the function can operate in a clean environment. It’s not a huge hit, something on the order of 10-20 instructions depending on things, but it does take time.

Edit: X-Istence, computation time can be very very important if you’re coding an interrupt service routine or other bits of code where you want things to happen very very quickly. Calling 20 different functions in an interrupt just to keep your code clean isn’t a good idea at all. If you want to use functions to keep code clean, I think you can make them inline functions. An inline function basically tells the compiler to take the function and paste it into where your function is called. I’m not positive the C18 compiler supports them out of the box, but I’ll look into it.

You can look at the listing file to see exactly how many assembly instructions different C operations take.

The PIC processor is designed so that every operation takes one clock cycle. Branch instructions are an exception; they take two. But the architecture of the PIC CPU is unusual in that read/write data and program code are not in the same memory space, and the ALU has some odd restrictions on where the results end up. This means that what looks like a simple operation in C might end up being a simple operation on the PIC – or it might end up being a page of assembly language to implement. Floating point arithmetic is particularly costly, as the PIC ALU doesn’t support floating point in hardware.

If you’re interested in the relative cost in program space (which translates almost directly to execution time) for various operations, I suggest you try compiling a program which uses those operations and then inspect the listing file to see how the compiler translates them into PIC assembly language.

Don,

As Alan mentioned, branching instructions by themselves only take two instruction clock cycles (an instruction clock is 4 clock cycles, so the 40mhz PIC processor in reality only has a 10mhz instruction clock. I will be strictly talking about instruction clock cycles in this note). It’s the instructions that have to be done within the IF statement that take most of the time, not the actual branching itself. In general, addition & subtraction operations all occur in-line, that is, the actual assembly code is generated separately for each add/sub operation in the routine the operation is found and is quick. For more complex operations (multiplication, division, trig, all floating point operations), the operands are passed to a math subroutine (function) that will perform the actual operation and the result returned to the calling routine. The source code for the math routines can be found in the mcc18 directory if you selected the appropriate option during the original install of the software. Some of these routines actually have min/max/mean clock cycles to execute. Based on a couple of observations, these times do not include the time to copy your arguments to the math variables (minimum 2 clock cycles per byte), or the result back into another variable. It also doesn’t include the necessary CALL/RETURN (4 clock cycles). Here are some samples - the math operation only, not the call/return or the passing of the arguments:

unsigned char * unsigned char = 6 clock cycles
signed short * signed short = 35 clock cycles
signed short / signed short = 85 clock cycles average (min 28, max 149)
signed long / signed short = 376 clock cycles average (min 84, max 421)
any floating point multiply/divide = 1835 clock cycles average
any floating point addition/subtraction = 80 clock cycles average

It should be noted that the floating point information came from the C18 compiler version 2.2 and may not be the same in version 2.4 (I believe they changed their floating point storage format between these two versions). Trig routines generally will have multiple floating point operations. I assume around 6 multiply/divide for lack of a better number (from a quick scan of the source – I may be way off base). Based on that assumption, a SINE call could consume upwards of 11,000 clock cycles (0.11% cpu). Doing one of these operations in the main loop of your code (approximately 38 times per second) would result in over 400,000 clock cycles per second (>4% cpu) - very costly.

I’m not sure if this answered your question or not, but hopefully it helped.

Mike

Thought I’d add some pictures from the PIC16 architecture guide. It defines the two stage pipeline used within the processor with each instruction having 4 clock cyles they name Q1…Q4. You can see in the diagram that the fetched instruction after a branch/call has to be flushed since it is from the wrong pc.

The clock input (from OSC1) is internally divided by four to generate four non-overlapping quadrature clocks, namely Q1, Q2, Q3, and Q4. Internally, the program counter (PC) is incremented every Q1, and the instruction is fetched from the program memory and latched into the instruction register in Q4. The instruction is decoded and executed during the following Q1 through Q4.

An “Instruction Cycle” consists of four Q cycles (Q1, Q2, Q3, and Q4). Fetch takes one instruction cycle while decode and execute takes another instruction cycle. However, due to Pipelining, each instruction effectively executes in one cycle. If an instruction causes the program counter to change (e.g. GOTO ) then an extra cycle is required to complete the instruction.

The instruction fetch begins with the program counter incrementing in Q1. In the execution cycle, the fetched instruction is latched into the “Instruction Register (IR)” in cycle Q1.

This instruction is then decoded and executed during the Q2, Q3, and Q4 cycles. Data memory is read during Q2 (operand read) and written during Q4 (destination write). The diagram shows the operation of the two stage pipeline for the instruction sequence shown.

At time TCY0, the first instruction is fetched from program memory. During TCY1, the first instruction executes while the second instruction is fetched. During TCY2, the second instruction executes while the third instruction is fetched. During TCY3, the fourth instruction is fetched while the third instruction (CALL SUB_1) is executed. When the third instruction completes execution, the CPU forces the address of instruction four onto the Stack and then changes the Program Counter (PC) to the address of SUB_1. This means that the instruction that was fetched during TCY3 needs to be “flushed” from the pipeline. During TCY4, instruction four is flushed (executed as a NOP) and
the instruction at address SUB_1 is fetched. Finally during TCY5, instruction five is executed and the instruction at address SUB_1+1 is fetched.

So, PC is incremented and latched in Q1, the next instruction fetch is initiated during Q1. The returned instruction from program memory is then latched in Q4 for execution during the next instruction cycle. If the current instruction changes the PC by writing new data in Q3 as with a branch or call then that new PC won’t show up to be used until the following instruction cycle’s Q1 period but the next instruction fetch is already under way… so flush the next instruction by executing it as a no-operation.

Pipeline.JPG
Clocks.JPG


Pipeline.JPG
Clocks.JPG

OK, thanks for the details. Different from what I was thinking, and so very helpful.

The reason for the question is to help my programmers generate more efficient code. As everyone knows, there’s more than one way to skin a cat in software, and so some careful choices can bring us faster execution. This isn’t a really significant concern - not yet, at least - because we’re nowheres near the liit of code space or CPU cycles in the loop. But, with some of the things being planned, we may come close to one, the other, or both - and I want to be prepared when we do.

Part of this whole exercise is to let kids understand what ‘efficient code’ means. Again, coming out of the punch card era, where a 1 MHz processor and 100 MB of online disk were very large mainframe characteristics, today’s bloated software, while nifty, is misleading. I showed a kid a copy of TinyEd, a work processor for DOS that’s something like 6k of .EXE. He didn’t believe it, how could a word processor be only 6k big? Compare that to MS Word.

Thanks again, I have enough information and references to move forward on my own.

Don