Hardware by design: ChibiTerm - Render loop timing test

Sunday, July 29, 2018

ChibiTerm - Render loop timing test

Projects / ChibiTerm Original post date: 03/06/2016

I wrote some naive code for the font rendering to get an idea of the CPU cycles needed. PB1 was set before entering the code and cleared afterwards. (I hope I wrote the code correctly as I won't be able to do debugging until much later.)

Looking at the assembly code, I get the impression that the Keil compiler did some clever optimization. The loop is supposed to go around 80 times, there is 1 memory write to the buffer and each time the points get incremented by 1. Instead the compiler ran the loop 40 times with 2 memory stores and the pointers get incremented by 2!

Not entirely sure about the assembly code, but the compiler knows more about ARM code than I do. The optimization was set to -o3 (highest) and optimized for speed.

Assembly language output for above C code

This piece of code seems to be fast enough for the job! A scan line is 32us, so the rendering (17us) only needs about 1/2 of the CPU cycles available. The optimization helps a lot.

Rendering time measured from logic analyzer

The following is one scan line. ( Note: Zoom scales not the same.)

With the extra leeway, I can probably put the render code in the same IRQ handler that triggers the DMA. This would simplify the synchronization and buffer management between the rendering and DMA.

BTW interrupts on the Cortex M chips are pretty cool.

Oops forgot a shift in my font table lookup code.

*bufptr++ = tblptr[(*textptr++)<<4 16="" blockquote="" bytes="" char="" font="" padded="" per="" to="">This mistake turns out to be an optimization - organizing the font table by scan line rows. The scan line offset needed is only needed to be computed once in the setup code as I am rendering a full scan line at a time. Going to rerun the test with the shift instruction and see how bad it is.

Here is the new result:

It takes an extra 2.4us or so for the missing 80 shift instructions.

Assembly output of the C shift instruction

2.4us/80/50MHz = 1.5 cycles each!? How can it be?

The first shift instruction probably has a 1 wait state as it is right after a branch (prefetch misses). The second shift happens to be in the prefetch buffer, so it has 0 wait states. So on average, there is 0.5 added cycles due to wait states.

Here is an interesting experiment:

If speed can be improved by doing more transfer per loop, what is the limit or optimal point short of a loop unroll?

Loop unrolling results

The compiler changes the code generation style at n=8 because it ran out of registers to hold the values. That last is probably the reason why it gave up optimization after putting 4 transfer per loop, so that's why n=2 and n=4 has the same results.

Without the shifts, n=2 yields 14.25us. This time only saving 1.6us. This probably means that the prefetch kept the 4 shifts instructions at 0 wait states. The font optimization might be more trouble than it's worth.

Lesson learnt:

Some mistakes lead to discovery.

Cycles counting can be tricky for modern CPU.

ARM C compiler is pretty smart.

Hardware by design

Search This Blog

Sunday, July 29, 2018

ChibiTerm - Render loop timing test

No comments:

Post a Comment