ART accelerator

Post here first, or if you can't find a relevant section!
ag123
Posts: 1657
Joined: Thu Dec 19, 2019 5:30 am
Answers: 25

Re: ART accelerator

Post by ag123 »

well, it seemed the 'ART Accelerator' cache accounts for between 10-20% speedups vs cache disabled
the other thing that isn't really apparent is whether the whetstone is in double precision
i think double precision calcs isn't just 2x single precision
it is probably many times more significant than single precision
so these single precision mflops is still 'believable', in the sense that it is much easier than do double precision mflops
nvidia & amd gpus spew Tflops of single precision vector calculaltions, but when it does double precision they often can't do better than say 50-150 Gflops and that is about max the common desktop gpus deliver in terms of double precision gflops
User avatar
Pito
Posts: 94
Joined: Tue Dec 24, 2019 1:53 pm

Re: ART accelerator

Post by Pito »

Double is aprox 2x slower than single (both hw and sw).
I think we need to consolidate the Whetstone results with reality.

You CANNOT get 237 MegaFLOPS with 168MegaHz Cortex M4 FPU.

Based on the above web page with Whetstone results the ~240 MegaFLOPS is a result of Pentium 4 at 2GHz clock.
Pukao Hats Cleaning Services Ltd.
ag123
Posts: 1657
Joined: Thu Dec 19, 2019 5:30 am
Answers: 25

Re: ART accelerator

Post by ag123 »

well for single precision 'whetstone' i think it is still valid, accordingly the fpu in f4xx has 2 fpu units, i.e. it can do 2 fp ops in parallel independent of each other
as apparently the 'enablefpu()' codes literally enable CP10 and CP11 co processors

Code: Select all

// Enable the FPU (Cortex-M4 - STM32F4xx and higher)
// http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/BEHBJHIG.html
void enablefpu() {
	  __asm volatile
	  (
	    "  ldr.w r0, =0xE000ED88    \n"  /* The FPU enable bits are in the CPACR. */
	    "  ldr r1, [r0]             \n"  /* read CAPCR */
	    "  orr r1, r1, #( 0xf << 20 )\n" /* Set bits 20-23 to enable CP10 and CP11 coprocessors */
	    "  str r1, [r0]              \n" /* Write back the modified value to the CPACR */
	    "  dsb                       \n" /* wait for store to complete */
	    "  isb"                          /* reset pipeline now the FPU is enabled */
	  );
}
Last edited by ag123 on Fri Jan 10, 2020 3:37 pm, edited 1 time in total.
User avatar
Pito
Posts: 94
Joined: Tue Dec 24, 2019 1:53 pm

Re: ART accelerator

Post by Pito »

Whetstone results.PNG
Whetstone results.PNG (64.85 KiB) Viewed 4146 times
Pukao Hats Cleaning Services Ltd.
ag123
Posts: 1657
Joined: Thu Dec 19, 2019 5:30 am
Answers: 25

Re: ART accelerator

Post by ag123 »

oops, i reads single precision, but maybe our little cortex m4 F4xx runs as fast as a P4 :lol:
i think these whetstone benchmarks are after all real, but i'm not sure how different is it from the BenchNT

updated 144Mhz results a couple posts back, not stable at 144 Mhz even at 7 waits
but managed to pull 1 set of result between hangs
ag123
Posts: 1657
Joined: Thu Dec 19, 2019 5:30 am
Answers: 25

Re: ART accelerator

Post by ag123 »

there is an update here
viewtopic.php?p=725#p725
our little cortex m4 has VFP
http://infocenter.arm.com/help/topic/co ... dejjh.html
1.5.9. Vector Floating-Point (VFP)

The VFP coprocessor supports floating point arithmetic operations and is a functional block within the ARM1176JZF-S processor. The VFP coprocessor is mapped as coprocessor numbers 10 and 11. Software can determine whether the VFP is present by the use of the Coprocessor Access Control Register. See c1, Coprocessor Access Control Register for more details.
-O3 + a special VFP lib probably vectorised some fp instructions and hence the speed.
hence, that NTBench whetstone probably isn't a 1-1 comparison here. if features like SSE is used on P4 is used that would likely be much faster.
maybe back then gcc don't have a -O3 that can auto vectorise c/c++ codes
:lol:

but then it is the ARM11 VFP and it is quite impressive, fp ops runs at 1 fp per cycle, which is why we see these mflops.
since then intel has caught up and these days intel chips basically also execute fp in 1 cycle. i think with vector fp, it is much more than 1 fp per cycle
and intel is much more extreme, intel does 64bits floating point probably in 1 cycle, the number of transistors involved is probably extreme.
one way to do away with loops is rather than say do 10 clocks to add 10 numbers, it is possible to make adders that add all that 10 numbers in 1 cycle, everything replaced by hardware

and more interesting stuff here, and it seemed, it is there in the F4
https://community.arm.com/cfs-file/__ke ... D00_M7.pdf

in addition stm32F1 and F4 don't seem to be on the same nm node
https://en.wikipedia.org/wiki/STM32
F4 runs faster higher mhz and cooler (dennard scaling)
Post Reply

Return to “General discussion”