Quote:
|
Compilers, not programmers, should do optimizations
|
Correct. Too bad we don't have a decent optimizing compiler available.
But is it better to have code that reads:
Code:
for (samples_recieved=0; samples_recieved<10; samples_recieved++)
{
:
Or code?
Code:
for (samples_left_to_receive=10; --samples_left_to_recieve !=0; )
{
Both do equivalent things, the best optimizing compiler won't turn one into the other and yet the 2nd is much more efficient in that the change in control variable also will set status/results for testing and avoid a second comparison operation. With the PIC18F, the above is 3x+ times more efficient due to a programmer choosing a different but equivalent method.
Does this really matter? In most cases, no. But if the code happens to be executing at interrupt level or is executed thousands of time per second then generally yes it does matter. A few changes like this usually can reduce the interrupt code footprint to 1/2 to 1/3 of what it was. Typically I scrutinize only about 10% of the code I write for efficiency, yet the run-time impact can be huge.
Knowing the underlying architecture of a processor you're working on also can impact which code constructs you choose to utilize. Yeah, it would be great if the compiler figured out this stuff for you... but often it doesn't or like above can't. On the PIC, for example, computing ram addresses isn't exactly efficient. In some cases if makes sense to unroll "for(...)" loops. For example:
Code:
for(ndx=0; ndx<10; ndx++)
{
sample[ndx] = 0;
}
is terribly inefficient. It takes 14-16 instructions to calculate the ram address plus the overhead of ~13 instructions to manage the loop variable. Choosing to unroll the loop in the above case into the following is a huge win as the code execution is on average 10x times faster with little change in code size.
Code:
sample[0] = 0;
sample[1] = 0;
:
sample[9] = 0;
Anyway my point is a programmer should care about how code constructs map to the underlying architecture. If they didn't, then the way to go would be to just change everything to extended floating point precision and do all calculations that way -- the "smarts" in the compiler would figure out when we needed integers vs floats and 8 bit values vs 32 bit values and hide all that nonsense from the programmer.
PS
Another common trick/practice on the PIC when dealing with loading data from h/w such as:
Code:
379: timer_count = TMR1H;
380: timer_count <<= 8;
381: timer_count += TMR1L;
382: timer_count -= offset;
06730 50CF MOVF 0xfcf, W, ACCESS
06732 6F17 MOVWF 0x17, BANKED
06734 6B18 CLRF 0x18, BANKED
06736 C517 MOVFF 0x517, 0x518
06738 F518 NOP
0673A 6B17 CLRF 0x17, BANKED
0673C 50CE MOVF 0xfce, W, ACCESS
0673E 6E2B MOVWF 0x2b, ACCESS
06740 6A2C CLRF 0x2c, ACCESS
06742 502B MOVF 0x2b, W, ACCESS
06744 2717 ADDWF 0x17, F, BANKED
06746 502C MOVF 0x2c, W, ACCESS
06748 2318 ADDWFC 0x18, F, BANKED
0674A 5119 MOVF 0x19, W, BANKED
0674C 5F17 SUBWF 0x17, F, BANKED
0674E 511A MOVF 0x1a, W, BANKED
06750 5B18 SUBWFB 0x18, F, BANKED
A simple change can result in much more efficient code by using an anonymous union structure:
Code:
typedef union u_U16
{
unsigned int data;
struct {
unsigned char b0;
unsigned char b1;
};
} u_U16;
u_U16 timer_drift;
384: timer_drift.b1 = TMR1H;
385: timer_drift.b0 = TMR1L;
386: timer_drift.data -= offset;
06752 CFCF MOVFF 0xfcf, 0x516
06754 F516 NOP
06756 CFCE MOVFF 0xfce, 0x515
06758 F515 NOP
0675A 5119 MOVF 0x19, W, BANKED
0675C 5F15 SUBWF 0x15, F, BANKED
0675E 511A MOVF 0x1a, W, BANKED
06760 5B16 SUBWFB 0x16, F, BANKED
Another quick 3x increase in code size and execution efficiency.