Speed Up your IO !!

Post here first, or if you can't find a relevant section!
dannyf
Posts: 446
Joined: Sat Jul 04, 2020 7:46 pm

Re: Speed Up your IO !!

Post by dannyf »

those numbers probably explain why bit-banding is rarely mentioned in the datasheets of modern CMx chips.
ag123
Posts: 1653
Joined: Thu Dec 19, 2019 5:30 am
Answers: 24

Re: Speed Up your IO !!

Post by ag123 »

bit banding works well, but that it is not universally present in all Arm Cortex M architectures.
e.g. this speed
https://www.youtube.com/watch?v=7Ey1U36YtY0
https://github.com/ag88/Adafruit_ILI9341_SPI_stm32duino
is mostly due to SPI with DMA, but is partially like 10% contributed by bit banding as well.
A trouble is the code would work on stm32 F1 and F4, but not F0, G0, F7, H7 as Cortex M0 and Cortex M7 don't bundle bit banding.
Hence, for a more 'portable' implementation, it is better to simply use registers.

bit banding reduces codes to a simple looking

Code: Select all

	/* note bit banding is used here */
	inline void dc_command() { *dc_addr = 0; } // 0
	inline void dc_data()    { *dc_addr = 1; } // 1
	inline void cs_clear()   { *cs_addr = 0; }
	inline void cs_set()     { *cs_addr = 1; }
but is less portable than using registers
bit banding requires getting the address of the pin like so

Code: Select all

#define BB_PERI_REF      0x40000000
#define BB_PERI_BASE     0x42000000
volatile bb_perip(volatile void *address, uint8_t bit) {
	return (volatile uint32_t *)(BB_PERI_BASE + ((uint32_t) address - BB_PERI_REF) * 32 + bit * 4);
}

#ifdef ARDUINO_ARCH_STM32
  __IO uint32_t *base = portOutputRegister(digitalPinToPort(_cs));
  cs_addr = bb_perip(base, STM_PIN(digitalPinToPinName(_cs)));
  base = portOutputRegister(digitalPinToPort(_dc));
  dc_addr = bb_perip(base, STM_PIN(digitalPinToPinName(_dc)));
#elif defined(ARDUINO_ARCH_STM32F1)
  volatile uint32 *base = &(PIN_MAP[_cs].gpio_device->regs->ODR);
  cs_addr = bb_perip(base, PIN_MAP[_cs].gpio_bit);
  base = &(PIN_MAP[_dc].gpio_device->regs->ODR);
  dc_addr = bb_perip(base, PIN_MAP[_dc].gpio_bit);
#elif defined(ARDUINO_ARCH_STM32F4)
  volatile uint32 *base = portOutputRegister(digitalPinToPort(_cs));
  cs_addr = bb_perip(base, digitalPinToBit(_cs));
  base    = portOutputRegister(digitalPinToPort(_dc));
  dc_addr = bb_perip(base, digitalPinToBit(_dc));
#endif
in a sense the above gets you the pin address in the bit band zone given the PAxx symbol for the pin.
but that for this to work on other than F1 and F4, it means more "if defs" that F0, G0, F7, H7 would need to switch back to registers
dannyf
Posts: 446
Joined: Sat Jul 04, 2020 7:46 pm

Re: Speed Up your IO !!

Post by dannyf »

the macros I provided earlier are for generic applications -> it can even take variable ports or bits.

for fixed ports / bits, you can speed up the speed significantly by precalculating the alliasing address and operate on that address.

This will require minor changes to the set of macros I provided earlier. But the speed gain is significant: 3.7K ticks / 1K run. With a 72Mhz F103, you are talking about flipping a pin at 20+Mhz. Short of using a hardware module, that's hard to beat.

Code: Select all

		//for (tmp=0; tmp<1000/5; tmp++) {*odrptr = 1; *odrptr = 0;*odrptr = 1; *odrptr = 0;*odrptr = 1; *odrptr = 0;*odrptr = 1; *odrptr = 0;*odrptr = 1; *odrptr = 0;}	//3.7K/1K
GonzoG
Posts: 403
Joined: Wed Jan 15, 2020 11:30 am
Answers: 26
Location: Prudnik, Poland

Re: Speed Up your IO !!

Post by GonzoG »

GonzoG wrote: Mon Oct 23, 2023 7:27 pm I got 19MHz wit digitalReadFast and 33MHz with digitalWriteFast, but with typing 1000 lines of code.
While loop needs few cycles.
ManX84 wrote: Mon Oct 23, 2023 9:43 pm @GonzoG PLease give us your code ! (I do not pretend to get the fastest .. just starting to play)
Sorry.. copied wrong values. Those were Mops/s, so frequency is half of those. Measured with "toy" oscilloscope and digitalWriteFast gave me 18MHz signal.
1.jpg
1.jpg (54.12 KiB) Viewed 1360 times
code is really simple: 500x

Code: Select all

digitalWriteFast(PA1,1);
digitalWriteFast(PA1,0);
dannyf
Posts: 446
Joined: Sat Jul 04, 2020 7:46 pm

Re: Speed Up your IO !!

Post by dannyf »

18MHz signal.
Fast IO is implemented via BRR/BSRR.

Hard to imagine that they could get that fast.
GonzoG
Posts: 403
Joined: Wed Jan 15, 2020 11:30 am
Answers: 26
Location: Prudnik, Poland

Re: Speed Up your IO !!

Post by GonzoG »

dannyf wrote: Wed Oct 25, 2023 9:42 pm
18MHz signal.
Fast IO is implemented via BRR/BSRR.

Hard to imagine that they could get that fast.
digitalWriteFast and digitalReadFast are using LL functions. With O3 (fastest) optimization they can go this fast.
On F401 and F411 they need 1 MCU cycle.
This is what I get with F411CE:
6.jpg
6.jpg (58.78 KiB) Viewed 1217 times
dannyf
Posts: 446
Joined: Sat Jul 04, 2020 7:46 pm

Re: Speed Up your IO !!

Post by dannyf »

This is what I get with F411CE:
without knowing what the cpu is running at, the raw GPIO speed doesn't mean a whole lot.

the digitalWriteFAST() implementation is essentially an if-then-else branching on top of a BSRR / BRR macro. For a known value, a smart compiler will reduce it to just BSRR or BRR. That becomes the use of the STR instruction - a 2-cycle instruction. I confirm this on Keil MDK, and I would imagine that GCC is similar.

for me, a 1K run using raw BSRR/BRR executes in 4.5K ticks.

here is what I used:

Code: Select all

//four alternative ways of flipping a GPIO pin
//uncomment one of them at a time to test the speed
#define PC13FLP()		do {FIO_SET(GPIOC, 1<<13); FIO_CLR(GPIOC, 1<<13); } while (0);				//using BRR/BSRR
//#define PC13FLP()		FIO_FLP(GPIOC, 1<<13)														//flipping ODR
//#define PC13FLP()		do {*odrptr = 1; *odrptr = 0;} while (0)									//bit-banding. odrptr points to the bit banding alias address for PC13
//#define PC13FLP()		do {PC13FAST(1); PC13FAST(0);} while (0)									//mimicking digitalWriteFAST()

//helper macros to make the coding easier
#define PC13FAST(val)	do {if (val) FIO_SET(GPIOC, 1<<13); else FIO_CLR(GPIOC, 1<<13);} while (0)	//simulate FASTIO
#define PC13FLPx5()		do {PC13FLP(); PC13FLP(); PC13FLP(); PC13FLP(); PC13FLP();                } while (0)
#define PC13FLPx10()	do {PC13FLPx5(); PC13FLPx5();                                             } while (0)
#define PC13FLPx50()	do {PC13FLPx10(); PC13FLPx10(); PC13FLPx10(); PC13FLPx10(); PC13FLPx10(); } while (0)
#define PC13FLPx100()	do {PC13FLPx50(); PC13FLPx50();                                           } while (0)
#define PC13FLPx500()	do {PC13FLPx100();PC13FLPx100();PC13FLPx100();PC13FLPx100();PC13FLPx100();} while (0)
#define PC13FLPx1K()	do {PC13FLPx500();PC13FLPx500();                                          } while (0)
the first four macros represent four different ways to flip a pin. The last few macros make the test simpler: by swapping in / out of different flavors of PC13FLP(), you get to test their speed without rewriting your code.

Not seeing anyway of getting this faster than 4K cycles / 1K run.
Post Reply

Return to “General discussion”