Execute from external SRAM

Generic boards that are not Maple or Maple mini clones, and don't contain the additional USB reset hardware
Post Reply
User avatar
Pito
Posts: 1529
Joined: Sat Mar 26, 2016 3:26 pm
Location: Rapa Nui

Execute from external SRAM

Post by Pito » Thu Dec 22, 2016 10:58 am

Most probably not applicable for MM or BP boards, but with larger ZET6 boards it could be an interesting exercise :)
Looking for a simple way to tell compiler/linker to place the code and run it from an address in external SRAM.
For example I've got 512kB free space from 0x68000000 (the 103ZET6), and 64kB internally.
The execution from the external sram will be slower, however the size creates interesting opportunities.
The internal sram could still be used for vectors/stack and fast buffers.
The option would be to place specific functions into the external sram.
If our core and C experts know how and may provide a hint I am ready to elaborate.
Pukao Hats Cleaning Services Ltd.

User avatar
Pito
Posts: 1529
Joined: Sat Mar 26, 2016 3:26 pm
Location: Rapa Nui

Re: Execute from external SRAM

Post by Pito » Thu Dec 22, 2016 11:14 am

For example I found at https://ez.analog.com/thread/10607
Running code from SRAM in GNU GCC..
First of all, you need the –mlong-calls switch specified to the compiler. This is needed as calling functions in flash from sram or vice-versa require a branch instruction that can branch “further away”. Then you need the following in an accessible header file or at the top of a C file that will use it

#define RAMFUNC __attribute__ ((long_call, section (".ramfunctions")))

This is a macro for a function attribute which you can apply to a function that you want in SRAM. It really just places the function in a section called “.ramfunctions”, where it would be placed in a section called “.text” by default. (We need to update the linker script in a later step to tell the linker what to do with .ramfunctions). You can apply this to specific functions like shown below. (You have to apply it to the prototype as well).

RAMFUNC void MyRAMFunc(unsigned uiNumsamples) ;

RAMFUNC void MyRAMFunc(unsigned uiNumsamples) {
// Function Body
}

That works for C functions, you need to put something like this at the top of assembler functions (again putting the code into an appropriately named sections).

.section ".ramfunctions"
.align 8
.global PutOneChar
.thumb
.thumb_func
.type PutOneChar, %function

Then in the linker script file (*.ld) you need something like the line in red below inserted to tell the linker to put the functions marked as .ramfunctions into SRAM.

.data : AT (_etext)
{
_data = .;
*(vtable vtable.*)
*(.data .data.*)
*(.gnu.linkonce.d*)
. = ALIGN(4);
*(.ramfunctions) /* !!!! Placing functions in .ramfunctions section in RAM */
_edata = . ;
} > SRAM

This will store the function in flash to begin with, then copy it automatically into SRAM before it gets executed (almost identical to how initialised variables are handled). Just place RAMFUNC on any function that you want in RAM. Bear in mind that the attribute should be applied to any function that this function calls as well.
Last edited by Pito on Thu Dec 22, 2016 3:03 pm, edited 2 times in total.
Pukao Hats Cleaning Services Ltd.

stevestrong
Posts: 1609
Joined: Mon Oct 19, 2015 12:06 am
Location: Munich, Germany

Re: Execute from external SRAM

Post by stevestrong » Thu Dec 22, 2016 12:38 pm

Interesting. I was using similar technic on an 8051-based platform a decade before, but only because running from SRAM was much faster than running from flash, which seems not to be the case here.
But where the code should come from? From flash? If it already fits in flash, then you could just run it from there, it would be faster, too.
On the other side, more meaningful, it would be nice to load code selectively form SD card and let it run from SRAM. But then you should somehow first bring the code parts onto SD card...

User avatar
Pito
Posts: 1529
Joined: Sat Mar 26, 2016 3:26 pm
Location: Rapa Nui

Re: Execute from external SRAM

Post by Pito » Thu Dec 22, 2016 3:01 pm

Yea, the final solution would be to load the chunks from an SDcard for example. But that is a long way to go..
So as the first step we have to get text+data+bss into the external sram.
Pukao Hats Cleaning Services Ltd.

User avatar
Rick Kimball
Posts: 1014
Joined: Tue Apr 28, 2015 1:26 am
Location: Eastern NC, US
Contact:

Re: Execute from external SRAM

Post by Rick Kimball » Thu Dec 22, 2016 5:07 pm

I've used this before with internal ram on an msp432 where the flash has 4 wait states when running at 48MHz. The chip has an abundance of internal ram (64k). In this use case it is perfect for time critical code and will run at zero wait states. if you don't use the -mlong-calls command line or pragmas, the compiler will generate "veneers", basically local anonymous functions that do a ldr pc,[pc], with the far address as a word constant.

Code: Select all

200000b8 <___ZN7print_tI16serial_default_tILm9600ELm72000000E8GPIO_PINILm1ELm9EES1_ILm0ELm65535EEEE5_putsEPKh.isra.4_veneer>:
200000b8:       f85f f000       ldr.w   pc, [pc]        ; 200000bc <___ZN7print_tI16serial_default_tILm9600ELm72000000E8GPIO_PINILm1ELm9EES1_ILm0ELm65535EEEE5_putsEPKh.isra.4_veneer+0x4>
200000bc:       08000277        .word   0x08000277
This stuff works best with code that doesn't call other functions in flash. Otherwise it is bouncing back and forth doing long style address branches. Calling flash routines normally allows the use of 2 byte short b.n label calls. However, that has a limited range that won't allow you to jump very far.

.ramfunctions might be good for interrupt handlers that run without wait states (assuming you are using zero ws ram). The cpu takes care of the long calls by default without doing anything. However, as with all things there are trade-offs. Running from ram might not be as fast as you think depending on how you have the flash code cache stuff configured. In addition, when you run the code from ram I think you are competing with the data bus for normal data variable access. Normally instructions instruction bus requests are handled in parallel with data bus requests. With the msp432 there was a separate ICODE bus that could access ram, rom and flash in parallel, I'm not sure if the stm32 works that way.

Code: Select all

  /*
   * vector table systick handler must have c name bindings
   */
  __attribute__ ((section(".ramfunc")))
  void SysTick_Handler(void) {
    ++tickcount;
20000004:       f240 0314       movw    r3, #20
20000008:       f2c2 0300       movt    r3, #8192       ; 0x2000
2000000c:       681a            ldr     r2, [r3, #0]
2000000e:       3201            adds    r2, #1
20000010:       601a            str     r2, [r3, #0]
20000012:       4770            bx      lr
-rick
Last edited by Rick Kimball on Fri Dec 23, 2016 6:40 pm, edited 1 time in total.
-rick

User avatar
Pito
Posts: 1529
Joined: Sat Mar 26, 2016 3:26 pm
Location: Rapa Nui

Re: Execute from external SRAM

Post by Pito » Thu Dec 22, 2016 6:20 pm

The execution from an external sram could be 5-6x slower than from the flash (or less as the flash is 2ws and no art with 103, btw).
The rd/wr with a 10ns sram could be 6/4cycles for the 103ZET6, or something like that.
http://www.st.com/content/ccc/resource/ ... 200423.pdf
But the speed is not the critical factor here. The chance to run off the external sram is the main motivation.
Pukao Hats Cleaning Services Ltd.

User avatar
Pito
Posts: 1529
Joined: Sat Mar 26, 2016 3:26 pm
Location: Rapa Nui

Re: Execute from external SRAM

Post by Pito » Wed Dec 28, 2016 9:09 pm

As a quick speed comparison experiment I generated 30000 random uint16 numbers and sorted by Bubblesort.
Only the array has been placed to the external SRAM (10ns 256kx16) so far. Array accessed via pointers, not in Heap.
Internal SRAM:

Code: Select all

Generating 30000 16bit uints:
BubbleSorting 30000 16bit uints:
Elapsed: 147182682  usecs
Sorted last 100 in ascending order:
29900 65270
29901 65271
29902 65274
29903 65281
29904 65282
29905 65287
29906 65288
..
29994 65516
29995 65518
29996 65519
29997 65523
29998 65528
29999 65533
External SRAM:

Code: Select all

Generating 30000 16bit uints:
BubbleSorting 30000 16bit uints:
Elapsed: 275602656  usecs
Sorted last 100 in ascending order:
29900 65270
29901 65271
29902 65274
29903 65281
29904 65282
29905 65287
29906 65288
..
29994 65516
29995 65518
29996 65519
29997 65523
29998 65528
29999 65533
The internal SRAM is 275secs/147secs = 1.87x faster than the external FSMC one when accessed from a sketch in flash.
Not bad.. :)

I've tried with __attribute__((at 0x..)) but compiler ignores it because of -Wattributes (??).
The .ramfunction - I do not know how to organize that in .ld file yet (as we need 2 different sram segments to be defined in .ld).
Also the FSMC needs an initialization before it could be used.. So it seems the mcu has to be started with standard
internal sram, and then the "modules" to be executed have to be loaded to the external sram (and compiled outside the sketch).
Last edited by Pito on Sat Dec 31, 2016 7:05 pm, edited 1 time in total.
Pukao Hats Cleaning Services Ltd.

User avatar
Pito
Posts: 1529
Joined: Sat Mar 26, 2016 3:26 pm
Location: Rapa Nui

Re: Execute from external SRAM

Post by Pito » Thu Dec 29, 2016 12:34 am

Fyi - to compare 8/16/32 bit access via FSMC (array accessed via pointers, not in Heap):

Code: Select all

BubbleSorting 30000 8bit uints:
Elapsed: 241123281  usecs

BubbleSorting 30000 16bit uints:
Elapsed: 275602656  usecs

BubbleSorting 30000 32bit uints:
Elapsed: 323401444  usecs
Last edited by Pito on Sat Dec 31, 2016 7:01 pm, edited 1 time in total.
Pukao Hats Cleaning Services Ltd.

User avatar
Pito
Posts: 1529
Joined: Sat Mar 26, 2016 3:26 pm
Location: Rapa Nui

Re: Execute from external SRAM

Post by Pito » Sat Dec 31, 2016 6:37 pm

In the meantime I've got a 512kB large Heap working inside the external SRAM (via FSMC).
Here are some results while running standard Bubblesort on a set of 3000 generated random uints for various uints sizes.
Bubblesort_1.JPG
Bubblesort_1.JPG (37.78 KiB) Viewed 4481 times
The BS loops for reference (EXRAM8, EXRAM16, EXRAM32, EXRAM64, swap always of EXRAM type, n=3000):

Code: Select all

  for (c = 0 ; c < ( n - 1 ); c++)
  {
    for (d = 0 ; d < n - c - 1; d++)
    {
      if (EXRAM32[d] > EXRAM32[d+1]) /* For increasing order  */
      {
        swap32       = EXRAM32[d];
        EXRAM32[d]   = EXRAM32[d+1];
        EXRAM32[d+1] = swap32;
      }
    }
  }
Pukao Hats Cleaning Services Ltd.

victor_pv
Posts: 1652
Joined: Mon Apr 27, 2015 12:12 pm

Re: Execute from external SRAM

Post by victor_pv » Mon Jan 02, 2017 3:30 am

Pito, the old leadlabs bootloader had an option to upload a sketch to ram and run from there.
The way it worked was by having a separate linker script that uses whatever address you want to run from (I dont remember where FSMC seats in the memory map, but whatever address that is).
When using that linker script, it did not compile anything to use flash addresses, all went to RAM addresses, so you didn't have to make anything different in your code, all was done by the linker for every piece of code, constants, etc. Obviosly someone had to copy all that to RAM before running, but that was done by the original bootloader with a certain upload option.

Then when uploading the sketch to the maple board, the bootloader would be the one that would copy that code to RAM, starting in the address in which the linker was set to, and finally call the entry point.

The bootloader was already in the board, the linker doesn't even know about it, and the binary generated does not include any routine to copy the program from flash to RAM. It links exactly the same as if it was going to run from Flash, but it is using the address for RAM. You will understand if better if you download that linker script and check it out.

I did some tests myself on running from normal RAM, but with a custom bootloader that would first write the sketch to flash, then copy it to RAM, and run it. If the board was rebooted, it would copy the sketch to RAM again and run it, so I didn't have to upload every single time, as it was the case with leaflabs run from RAM option. I did that when doing some speed tests running from RAM and wanted to avoid having to upload upon a reset.

I would do the same for your case, except that given the size, I agree is better to use an sdcard to store the code. So you could:
1.- Write a sketch to run from flash that initializes the FSMC, and then access an sdcard and reads whatever file you want, copies it to RAM, and then call the entry point. This should be easy to do.
2.- Create a new board variant, that uses a modified linker script that links the executable code to the bottom of FSMC address. The bin file generated can be copied to an SDCard, and loaded and executed by the program described in point 1.
You can also modify that same linker script to place the stack at the top of the external RAM address, since at the point that code is called to run, the FSMC is already initialized, but may be faster to keep the stack in internal memory for speed.
3.- Make sure the code generated in point 2 does not someway disable FSMC, changes the function of a pin used for FSMC, or anything like that.

This was the maple linker script for RAM, called ram.ld:

Code: Select all

/*
 * libmaple linker script for RAM builds.
 *
 * A Flash build puts .text, .rodata, and .data/.bss/heap (of course)
 * in SRAM, but offsets the sections by enough space to store the
 * Maple bootloader, which uses low memory.
 */

/*
 * This pulls in the appropriate MEMORY declaration from the right
 * subdirectory of stm32/mem/ (the environment must call ld with the
 * right include directory flags to make this happen). Boards can also
 * use this file to use any of libmaple's memory-related hooks (like
 * where the heap should live).
 */
INCLUDE mem-ram.inc

/* Provide memory region aliases for common.inc */
REGION_ALIAS("REGION_TEXT", ram);
REGION_ALIAS("REGION_DATA", ram);
REGION_ALIAS("REGION_BSS", ram);
REGION_ALIAS("REGION_RODATA", ram);

/* Let common.inc handle the real work. */
INCLUDE common.inc
This is mem-ram.inc, included by the one above, and defines what are the bottom address for RAM and the size:

Code: Select all

MEMORY
{
  ram (rwx) : ORIGIN = 0x20000C00, LENGTH = 17K
  rom (rx)  : ORIGIN = 0x08005000, LENGTH = 0K
}
Finally there is a common.inc file, which included all the real linker stuff, included in ram.ld. When that linker gets to that, it places all the code in the region REGION_TEXT, which as shown above is defined as an alias to ram, which is address 0x20000C00 (the first 3KB were reserved for bootloader usage, size the bootloaded needed to run to copy the code to ram and call it, it would not be good if it started copying data over it's own stack, heap, etc).
The common.inc file probably have not changed until now, but ram.ld and mem-ram.inc may not be in the core any more, but if you download an older version, they should be there.

REGION_ALIAS("REGION_TEXT", ram); => This is the executable code, you could create a new alias, say "fsmc"
REGION_ALIAS("REGION_DATA", ram); => variables, you could move it to fsmc, or keep it in normal ram
REGION_ALIAS("REGION_BSS", ram); => this I dont remember right now... :oops:
REGION_ALIAS("REGION_RODATA", ram); => read only data (constants), again may be better in normal RAM unless it is too big.

The bin file generated by that linker script will already include a piece of code that copies any initial value for RODATA, and DATA regions to those addresses, since normally those are in RAM, someone needs to initialize them from flash, and that is a small assembler code included in the linker script. So when main is called all variables and constants have their initial values. So let's say you keep REGION_DATA and RODATA in normal RAM, then your programs doesn't need to worry about reading several files for several regions of ram of anything like that. Just read the bin file from sdcard and copy it to RAM to the correct starting address, and call the entry point, and the linker has already placed routines that will copy any initial data needed to DATA and RODATA regions, set the stack address, and call main.
So the bin file is copied all at once to sequential addresses, your program in flash does not need to care to copy some parts to internal RAM for constants, others to FSMC for code... It just need to dump the content of the bin file straight to FSMC ram starting with the address matching the linker script for REGION_TEXT, and call it, and there is code there already to spread the pieces that would normally be in RAM and need to be initialized.

I hope this all makes sense, I edited it 3 times and still think I am confusing...

Post Reply