ART accelerator

ag123 · Post by **ag123** » Fri Jan 10, 2020 2:43 pm

well, it seemed the 'ART Accelerator' cache accounts for between 10-20% speedups vs cache disabled
the other thing that isn't really apparent is whether the whetstone is in double precision
i think double precision calcs isn't just 2x single precision
it is probably many times more significant than single precision
so these single precision mflops is still 'believable', in the sense that it is much easier than do double precision mflops
nvidia & amd gpus spew Tflops of single precision vector calculaltions, but when it does double precision they often can't do better than say 50-150 Gflops and that is about max the common desktop gpus deliver in terms of double precision gflops

Pito · Post by **Pito** » Fri Jan 10, 2020 3:05 pm

Double is aprox 2x slower than single (both hw and sw).
I think we need to consolidate the Whetstone results with reality.

You CANNOT get 237 MegaFLOPS with 168MegaHz Cortex M4 FPU.

Based on the above web page with Whetstone results the ~240 MegaFLOPS is a result of Pentium 4 at 2GHz clock.

ag123 · Post by **ag123** » Fri Jan 10, 2020 3:34 pm

well for single precision 'whetstone' i think it is still valid, accordingly the fpu in f4xx has 2 fpu units, i.e. it can do 2 fp ops in parallel independent of each other
as apparently the 'enablefpu()' codes literally enable CP10 and CP11 co processors

Code: Select all

// Enable the FPU (Cortex-M4 - STM32F4xx and higher)
// http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/BEHBJHIG.html
void enablefpu() {
	  __asm volatile
	  (
	    "  ldr.w r0, =0xE000ED88    \n"  /* The FPU enable bits are in the CPACR. */
	    "  ldr r1, [r0]             \n"  /* read CAPCR */
	    "  orr r1, r1, #( 0xf << 20 )\n" /* Set bits 20-23 to enable CP10 and CP11 coprocessors */
	    "  str r1, [r0]              \n" /* Write back the modified value to the CPACR */
	    "  dsb                       \n" /* wait for store to complete */
	    "  isb"                          /* reset pipeline now the FPU is enabled */
	  );
}

Pito · Post by **Pito** » Fri Jan 10, 2020 3:37 pm

: Whetstone results.PNG (64.85 KiB) Viewed 6405 times

ag123 · Post by **ag123** » Fri Jan 10, 2020 3:39 pm

oops, i reads single precision, but maybe our little cortex m4 F4xx runs as fast as a P4

i think these whetstone benchmarks are after all real, but i'm not sure how different is it from the BenchNT

updated 144Mhz results a couple posts back, not stable at 144 Mhz even at 7 waits
but managed to pull 1 set of result between hangs

ag123 · Post by **ag123** » Sun Jan 19, 2020 7:17 pm

there is an update here
viewtopic.php?p=725#p725
our little cortex m4 has VFP
http://infocenter.arm.com/help/topic/co ... dejjh.html

1.5.9. Vector Floating-Point (VFP)

The VFP coprocessor supports floating point arithmetic operations and is a functional block within the ARM1176JZF-S processor. The VFP coprocessor is mapped as coprocessor numbers 10 and 11. Software can determine whether the VFP is present by the use of the Coprocessor Access Control Register. See c1, Coprocessor Access Control Register for more details.

-O3 + a special VFP lib probably vectorised some fp instructions and hence the speed.
hence, that NTBench whetstone probably isn't a 1-1 comparison here. if features like SSE is used on P4 is used that would likely be much faster.
maybe back then gcc don't have a -O3 that can auto vectorise c/c++ codes

but then it is the ARM11 VFP and it is quite impressive, fp ops runs at 1 fp per cycle, which is why we see these mflops.
since then intel has caught up and these days intel chips basically also execute fp in 1 cycle. i think with vector fp, it is much more than 1 fp per cycle
and intel is much more extreme, intel does 64bits floating point probably in 1 cycle, the number of transistors involved is probably extreme.
one way to do away with loops is rather than say do 10 clocks to add 10 numbers, it is possible to make adders that add all that 10 numbers in 1 cycle, everything replaced by hardware

and more interesting stuff here, and it seemed, it is there in the F4
https://community.arm.com/cfs-file/__ke ... D00_M7.pdf

in addition stm32F1 and F4 don't seem to be on the same nm node
https://en.wikipedia.org/wiki/STM32
F4 runs faster higher mhz and cooler (dennard scaling)

Arduino for STM32

ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator

Re: ART accelerator