Fast bitbanding gpio/sram access

Post your cool example code here.
arpruss
Posts: 153
Joined: Sat Sep 30, 2017 3:34 am

Re: Fast bitbanding gpio/sram access

Post by arpruss » Sun Nov 26, 2017 3:08 pm

Here are cycle timings per operation and code sizes for my test code using -O3 and DWT->CYCCNT for timing. The test code uses 400 unrolled operations. For read operations, this is a sequence of 400 reads from PB12 and PA7, alternating. For write operations, this is a sequence of 400 writes of 1>PB12, 1>PA7, 0>PB12, 0>PA7. All of the tests have the same setup code, so the byte sizes should mainly vary due to the actual read/write operations.

Summary: For reading, bitbanded is fastest and smallest. For constant value writing, gpio_write_bit() is by far fastest, while bitbanded is slightly smaller.
  • digitalRead: 52.5 cycles (40568 bytes)
  • gpio_read_bit: 9.5 cycles (39744 bytes)
  • reading from register with premade mask: 9 cycles (39360 bytes)
  • bitbanded read: 7 cycles (38128 bytes)
  • digitalWrite: 54 cycles (39760 bytes)
  • gpio_write_bit: 2 cycles (37344 bytes)
  • writing to ODR register with premade mask: 12.5 cycles (41072 bytes)
  • writing to BSRR register with premade mask: 4 cycles (38144 bytes)
  • bitbanded write: 7 cycles (37328 bytes)
However, gpio_write_bit() becomes significantly less space and time efficient when the value being written is not known at compile-time (e.g., I had it write a volatile uint8 which was flipped each write). It still beats bitbanded writing by one clock cycle, at the expense of a lot of space lost.

dave j
Posts: 9
Joined: Thu Nov 02, 2017 8:49 pm

Re: Fast bitbanding gpio/sram access

Post by dave j » Sun Nov 26, 2017 3:20 pm

The advantages from bit-banding really come from pre-calculating the address - that way you just need to do a read or write when you use it. Calculating it each time as you are doing loses the main advantage of the technique.

arpruss
Posts: 153
Joined: Sat Sep 30, 2017 3:34 am

Re: Fast bitbanding gpio/sram access

Post by arpruss » Sun Nov 26, 2017 3:49 pm

I made a new version that replaces the bitbanded DIGITAL_WRITE() with an optimized version of gpio_write_bit(). The optimized version is one clock cycle faster than gpio_write_bit() when the value being written (which must be either 0 or 1; other values yield unpredictable results) is unknown at compile time, and has the same speed as gpio_write_bit() when the value is known at compile time. https://gist.github.com/arpruss/5be978f ... 7abf954c68

The new DIGITAL_WRITE() trades space for speed. If you want to trade speed for space, use DIGITAL_WRITE_BITBAND() instead, which will always be faster and smaller than digitalWrite().

Note that my DIGITAL_READ() has an advantage over gpio_read_bit(), because DIGITAL_READ() always returns 0 or 1, while gpio_read_bit() returns 0 or a 32-bit mask. Thus, one can do things like:

Code: Select all

uint8 value = DIGITAL_READ(PA8);
DIGITAL_WRITE(PA8,value);
which has the expected result. But if one does:

Code: Select all

uint8 value = gpio_read_bit(GPIOA,8);
gpio_write_bit(GPIOA,9,value); // unexpected result!
you will always be writing zero to PA9, because gpio_read_bit(GPIOA,8) will return either 0 or 0x100, and either one casts to 0 when stored into value. Moreover, you can do bitbanged serial-data reading more easily with DIGITAL_READ() because it always returns 0 or 1, so you can just shift its output by the right number of bits and OR it into a buffer.

The main disadvantage of my macros is that you can't store the port in a variable (though you can store it in a macro). It must be an explicit "PB13" port label (the macros do compile-time string manipulation to extract the port data from the argument).

One thing I like about my DIGITAL_*() macros is that I don't need to have two #defines per port in my sketch header. If I was using gpio_*_bit(), I would need to do something like:

Code: Select all

#define ROTATION_DETECTOR_GPIO GPIOB
#define ROTATION_DETECTOR_BIT 3
but with DIGITAL_*(), just as with the stock digital*(), I just need:

Code: Select all

#define ROTATION_DETECTOR_PIN PB3

arpruss
Posts: 153
Joined: Sat Sep 30, 2017 3:34 am

Re: Fast bitbanding gpio/sram access

Post by arpruss » Sun Nov 26, 2017 3:53 pm

dave j wrote:
Sun Nov 26, 2017 3:20 pm
The advantages from bit-banding really come from pre-calculating the address - that way you just need to do a read or write when you use it. Calculating it each time as you are doing loses the main advantage of the technique.
The macros only work when the port is explicitly specified using the PXxx format, and then all the calculations are done by the compiler at compile time. I checked the assembly output both with -O3 and with no optimization. Here is a snippet without optimization (-g):

Code: Select all

aa=DIGITAL_READ(PB12);
 8002466:	6808      	ldr	r0, [r1, #0]
 8002468:	6018      	str	r0, [r3, #0]
aa=DIGITAL_READ(PA7);
 800246a:	6810      	ldr	r0, [r2, #0]
 800246c:	6018      	str	r0, [r3, #0]
The fact that in my benchmark the bitbanded versions consistently produce the smallest code also shows that the calculations are done at compile time.

stevestrong
Posts: 2043
Joined: Mon Oct 19, 2015 12:06 am
Location: Munich, Germany
Contact:

Re: Fast bitbanding gpio/sram access

Post by stevestrong » Sun Nov 26, 2017 4:13 pm

You have to add to your benchmark the instructions which load the constant values into registers r1, r2, r3.

User avatar
Pito
Posts: 1729
Joined: Sat Mar 26, 2016 3:26 pm
Location: Rapa Nui

Re: Fast bitbanding gpio/sram access

Post by Pito » Sun Nov 26, 2017 4:16 pm

..trades space for speed. If you want to trade speed for space,..
How could you trade speed for size and vice versa while manipulating the pins?
The size means clocks here..
Pukao Hats Cleaning Services Ltd.

User avatar
mrburnette
Posts: 2190
Joined: Mon Apr 27, 2015 12:50 pm
Location: Greater Atlanta
Contact:

Re: Fast bitbanding gpio/sram access

Post by mrburnette » Sun Nov 26, 2017 5:03 pm

I absolutely "love" these threads; truly interesting from a chip architecture perspective.

But I usually (done this many times before) make a post that such processes are anti-Arduino, conceptually. For someone coming over to STM32duino from AVR, it is paradigm quicksand because not only are we STM32_centric, we may even be writing code to a specific uC within the STM32 product family.

This is a good read and a good refresher for why Arduino cores are inherently pin_address inefficient. This all plays into the concerns of library writers on whether they write generic or use #ifdef to broaden the library's appeal (useful scope.)

Essentially, the choice made will both delight and dismay prospective users; which is just another way of saying 'you cannot please everyone.'


Ray

Ollie
Posts: 205
Joined: Thu Feb 25, 2016 7:27 pm

Re: Fast bitbanding gpio/sram access

Post by Ollie » Sun Nov 26, 2017 5:40 pm

My reasons to like bitbanding I/O were
  • Monotonic operations did eliminate conflicts with interrupts
  • Self-documentation when using variable names instead of port letters and pin numbers
  • Accessing pins that were not known at compilation time
The reason why I have abandoned bitbanding is
  • It is not supported in F7 and H7

User avatar
Rick Kimball
Posts: 1077
Joined: Tue Apr 28, 2015 1:26 am
Location: Eastern NC, US
Contact:

Re: Fast bitbanding gpio/sram access

Post by Rick Kimball » Sun Nov 26, 2017 5:53 pm

arpruss wrote:
Sun Nov 26, 2017 3:08 pm
The test code uses 400 unrolled operations. For read operations, this is a sequence of 400 reads from PB12 and PA7, alternating.
I think that is a ridiculous test. Who is going to be doing that? The typical use case of digital write is some conditional code then a pin change, then some more conditional logic and another digitalWrite. This is going to push your bitband registers out of r1,r2,r3 and it will have to reload them.

Can you explain why this test matters?
-rick

dannyf
Posts: 228
Joined: Wed May 11, 2016 4:29 pm

Re: Fast bitbanding gpio/sram access

Post by dannyf » Sun Nov 26, 2017 6:22 pm

The main disadvantage of my macros is that you can't store the port in a variable (though you can store it in a macro).
not that big of a deal, to most C programmer. The arduino crowd, however, seems to be more challenged by that.
One thing I like about my DIGITAL_*() macros is that I don't need to have two #defines per port in my sketch header. If I was using gpio_*_bit(), I would need to do something like
you are paying a (small) price for that, however.

Post Reply