ART accelerator

ag123 · Post by **ag123** » Thu Jan 09, 2020 3:11 pm

while playing with my PILL_F401 board, i got a little curious about the 'ART accelerator'. the ART accelerator is also available for F411, F405, F407 various F4* (note different from the 'ChromART' accelerator, ChromART is one level 'up' and does DMA2D - i.e. graphics acceleration, if you want to play with DMA2D 'ChromART' a board is STM32F429 discovery https://www.st.com/en/evaluation-tools/ ... overy.html)
this post is purely about the mcu cache for flash.

it turns out the ART accelerator is basically the cache settings for flash
i found this stackoverflow thread:
https://stackoverflow.com/questions/113 ... 6/12020932
in which one those replies referred to this doc for F2 code, check for "ART configuration"
https://www.st.com/content/ccc/resource ... 033348.pdf

i used this codes to toggle the caches to visualise the differences, currently these defines are only valid in official core.

Code: Select all

void ARTtoggle() {
	if((FLASH->ACR & FLASH_ACR_ICEN)!=FLASH_ACR_ICEN) { // art disabled
		/* enable the ART accelerator */
		/* enable prefetch buffer */
		FLASH->ACR |= FLASH_ACR_PRFTEN;
		/* Enable flash instruction cache */
		FLASH->ACR |= FLASH_ACR_ICEN;
		/* Enable flash data cache */
		FLASH->ACR |= FLASH_ACR_DCEN;
		asm("wfi"); //wait for a systick interrupt i.e. delay(1)
		Serial.println("ART enabled");
	} else {
		/* disable the ART accelerator */
		/* disable flash instruction cache */
		FLASH->ACR &= ~FLASH_ACR_ICEN;
		/* disable flash data cache */
		FLASH->ACR &= ~FLASH_ACR_DCEN;
		/* enable prefetch buffer */
		FLASH->ACR |= FLASH_ACR_PRFTEN;
		asm("wfi"); //wait for a systick interrupt, i.e. delay(1)
		Serial.println("ART disabled");
	}
}

a sample run looks like this: (this is -O3 compiled, so maybe the optimizer cheats a little for the mflops)

ART disabled
Beginning Whetstone benchmark at 84 MHz ...
Loops:10000, Iterations:1, Duration:5596.96 millisec
C Converted Single Precision Whetstones:178.67 Mflops
ART enabled
Beginning Whetstone benchmark at 84 MHz ...
Loops:10000, Iterations:1, Duration:4654.92 millisec
C Converted Single Precision Whetstones:214.83 Mflops

the 'ART accelerator' is enabled default i.e. the cache flags are enabled at least for the F401 that i played with in the official core
but for any reason if you think it isn't (enabled) you could try the above codes, it works even in setup() from arduino codes.
it is the same codes as PILL_F401 whetstone blinky for official core with the above test codes added
viewtopic.php?f=14&p=509#p509

serial terminal commands
p - print temperature
s - stop printing
w - run whetstone benchmark
a - toggle 'ART accelerator'

this article gives a good background on that 'ART accelerator'
https://eda360insider.wordpress.com/201 ... -in-1000s/

Bingo600 · Post by **Bingo600** » Thu Jan 09, 2020 5:36 pm

Informative

Thanx
Bingo

ag123 · Post by **ag123** » Thu Jan 09, 2020 5:43 pm

hope i'd find some time to patch this back to libmaple
another thing i realized is, with all these elaborate caches, the time between one instruction running and the next can vary like 1 clock cycle (could be less, e.g. there are 2 fpu, so if both run an instruction it is 2 instruction per cycle) (in cache) to like 7-15 clock cycles (e.g. wait state for flash)
this puts a big twist to bit banging io, it isn't quite possible to just bit bang a sequence and expecting the same timing. and some elaborate h/w e.g. timers and dma probably need to be involved

Pito · Post by **Pito** » Fri Jan 10, 2020 12:56 am

@ag123: I do not understand how can you get similar results @84MHz as I get @168MHz (-O3, FPU on)
Me:

Code: Select all

Loops: 10000 Iterations: 1 Duration: 4171 millisec.   0 clocks
C Converted Single Precision Whetstones: 239.75 Mflops

You:

Code: Select all

Loops:10000, Iterations:1, Duration:4654.92 millisec
C Converted Single Precision Whetstones:214.83 Mflops

PS: you cannot get more than 119.32 Mflops, imho..

ag123 · Post by **ag123** » Fri Jan 10, 2020 3:52 am

try with the arm compiler tool chain in the official core, the optimization is apparently more aggressive than the toolchain which i used which is old which i used with libmaple. that gives lower mflops
it seemed the tool chain used in the official core is likely same as the new arm 9 gcc compiler
https://developer.arm.com/tools-and-sof ... /downloads
but i'd agree with -O3, it is probably not doing fp calcs, it cheats

-Os is likely real which gives about 60 mflops, but it may not be using the 2nd concurrent fpu, i think there are 2 in the mcu

the ratios between the numbers still make sense, cache vs no cache, 20% speedup probably is about as much as the cache can deliver.
there are other bottlenecks and possible cache miss

ag123 · Post by **ag123** » Fri Jan 10, 2020 9:06 am

out of curiosity, i used the arm-none-gcc compiler in the official core
rebuild sketch with steve's libmaple core and got:
Beginning Whetstone benchmark at 84 MHz ... -Os
Loops:10000, Iterations:1, Duration:15050.42 millisec
C Converted Single Precision Whetstones:66.44 Mflops
Beginning Whetstone benchmark at 84 MHz ... -O3
Loops:10000, Iterations:1, Duration:8305.48 millisec
C Converted Single Precision Whetstones:120.40 Mflops

what seemed missing is to get ART accelerator working in steve's libmaple core
the other thing that is different is that the official core seemed to use a math lib from ARM CMSIS dsp library, i'm not sure if that makes a difference
but if it runs from flash and i get 120.40 Mflops and maybe that's 7 waits, what would 0 wait be
but for sure -O3 cheated still, maybe after we fix ART accelerator i'd get the same 214 (cheated) Mflops
and by that extension, u'd get closer to 500 (cheated) Mflops at 168 Mhz

it seemed the F4xx ART accelerator caches consists a few things
- a prefetch buffer (i kept this on in the ART toggle, hence this probably shows up in the higher Mflops)
- additional instruction and data cache (this accounts for the difference in the ART toggle codes)

edit:
tried that didn't work still get 120.40 Mflops

Pito · Post by **Pito** » Fri Jan 10, 2020 10:58 am

I've been using the same compiler as the official STM core (with the 239Mflops @168Mhz).
There is only 1 FPU inside.
Again, you cannot get more than 119 Mflops at 84MHz.

You have to enable the printing out the results as I have done !!
Otherwise the compiler may optimize out the entire blocks of the benchmark..

Code: Select all

Loops: 10000 Iterations: 1 Duration: 4171 millisec.   0 clocks
C Converted Single Precision Whetstones: 239.75 Mflops
0       0       0       1.00    -1.00   -1.00   -1.00   0
120000  140000  120000  -0.00   0.00    -0.00   0.00    120000
140000  120000  120000  -0.00   0.00    0.00    0.00    140000
3450000 1       1       1.00    -1.00   -1.00   -1.00   3450000
2100000 1       2       6.00    6.00    0.00    0.00    2100000
320000  1       2       0.00    0.00    0.00    0.00    320000
8990000 1       2       1.00    1.00    1.00    1.00    8990000
6160000 1       2       3.00    2.00    3.00    0.00    6160000
0       2       3       1.00    -1.00   -1.00   -1.00   0
930000  2       3       1.00    1.00    1.00    1.00    930000

ag123 · Post by **ag123** » Fri Jan 10, 2020 11:09 am

i got this instead with the same code, but rebuilt in libmaple core

ART disabled
Beginning Whetstone benchmark at 84 MHz ... -O3
Loops:10000, Iterations:1, Duration:11118.45 millisec
C Converted Single Precision Whetstones:89.94 Mflops
ART enabled
Beginning Whetstone benchmark at 84 MHz ... -O3
Loops:10000, Iterations:1, Duration:8287.50 millisec
C Converted Single Precision Whetstones:120.66 Mflops

the attached file is the code source i used
serial commands
p - print temperature
s - stop printing
w - run whetstone benchmark
a - toggle 'ART accelerator'

the file in there flashF4.h
is derived from flash.h in the core, except that the ACR register definitions are changed to use those for F401 should work on F407 and F4xx i guess

but given that flash.h won't work, it'd seem ART accelerator is enabled by the onChip firmware (i.e. ST's boot rom) default.
the documented reset values seem to be 0 - disabled

Pito · Post by **Pito** » Fri Jan 10, 2020 11:14 am

Pito wrote: Fri Jan 10, 2020 10:58 am I've been using the same compiler as the official STM core (with the 239Mflops @168Mhz).
There is only 1 FPU inside.
Again, you cannot get more than 119 Mflops at 84MHz.

You have to enable the printing out the results as I have done !!
Otherwise the compiler may optimize out the entire blocks of the benchmark..
Code: Select all
Loops: 10000 Iterations: 1 Duration: 4171 millisec.   0 clocks
C Converted Single Precision Whetstones: 239.75 Mflops
0       0       0       1.00    -1.00   -1.00   -1.00   0
120000  140000  120000  -0.00   0.00    -0.00   0.00    120000
140000  120000  120000  -0.00   0.00    0.00    0.00    140000
3450000 1       1       1.00    -1.00   -1.00   -1.00   3450000
2100000 1       2       6.00    6.00    0.00    0.00    2100000
320000  1       2       0.00    0.00    0.00    0.00    320000
8990000 1       2       1.00    1.00    1.00    1.00    8990000
6160000 1       2       3.00    2.00    3.00    0.00    6160000
0       2       3       1.00    -1.00   -1.00   -1.00   0
930000  2       3       1.00    1.00    1.00    1.00    930000

You will get the same number with STM compiler, enable printing out the result!
You have to see the same printout as in mine in above post, otherwise you get a crap.

ag123 · Post by **ag123** » Fri Jan 10, 2020 11:35 am

thanks, got this for the official core as well
when printing results enabled -O3 compiled

ART enabled
Beginning Whetstone benchmark at 84 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:11249.84 millisec
C Converted Single Precision Whetstones:88.89 Mflops
ART disabled
Beginning Whetstone benchmark at 84 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:13464.02 millisec
C Converted Single Precision Whetstones:74.27 Mflops

Arduino for STM32

ART accelerator

ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator