WS2812B (Neopixel) library has been added to the F1 core

Information on the latest releases
User avatar
RogerClark
Posts: 7481
Joined: Mon Apr 27, 2015 10:36 am
Location: Melbourne, Australia
Contact:

WS2812B (Neopixel) library has been added to the F1 core

Post by RogerClark » Mon Jun 12, 2017 7:26 am

I've written a WS2812B (Neopixel) library which uses SPI DMA, see viewtopic.php?f=13&t=2179

And as its closely coupled to the LibMaple SPI DMA functions including the new SPI.dmaSendAsync, I have decided to add it to the core libraries, and in the longer term will probably remove the separate repo I initially created just for the library

User avatar
RogerClark
Posts: 7481
Joined: Mon Apr 27, 2015 10:36 am
Location: Melbourne, Australia
Contact:

Re: WS2812B (Neopixel) library has been added to the F1 core

Post by RogerClark » Mon Jun 12, 2017 11:07 am

Just to get some metrics I did some approximate timings by running effects for a million cycles..

So for a 10m 300LED strip.

The RainbowCycle effect would take around 600uS to setup using the bit banged system, and this time could not be recovered.
The SPI takes around 990uS to set-up the same thing, because the time to set a pixel in the buffer is 1.8uS instead of 0.99uS.

Transmission time is always 1.25uS per pixel, so a 300 LED strip would take 375uS to send.

So I think its a dead heat between these two methods, even for effects when the whole strip has to be updated.


Perhaps I there is a way to optimise the setPixelColor function

This has to index into the lookup table for each colour channel RGB, and then copy 3 bytes from the LUT (in flash) to the data in RAM.

Code: Select all

void WS2812B::setPixelColor(uint16_t n, uint8_t r, uint8_t g, uint8_t b)
 {
   uint8_t *bptr = pixels + (n<<3) + n +1;
   uint8_t *tPtr = (uint8_t *)encoderLookup + g*2 + g;// need to index 3 x g into the lookup
   
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;

   tPtr = (uint8_t *)encoderLookup + r*2 + r;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;   
   
   tPtr = (uint8_t *)encoderLookup + b*2 + b;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;
 }

As its 24 instead of 32 bits, I have to do this in 3 single byte transfers.
I'm already using x*2+x to do the multiply by 3 to try to speed things up.
And I'm just incrementing and copying data from pointer to pointer.

But perhaps this doesn't result in a very concise assembler code, possibly because these are single bye reads

Actually, I could probably transfer them as 32bits but overlap them, but the locations will not be 32bit or even 16 bit aligned, so perhaps that would not help.

If anyone can think of a way to optimise this , please let me know.

User avatar
RogerClark
Posts: 7481
Joined: Mon Apr 27, 2015 10:36 am
Location: Melbourne, Australia
Contact:

Re: WS2812B (Neopixel) library has been added to the F1 core

Post by RogerClark » Mon Jun 12, 2017 11:50 am

Just to partially answer my own question...

I have yet to test this, but currently the LUT is in Flash as its declared as const.
But reads from Flash incur wait state access time. ( I can't remember the value, but I think its 2 wait states)
So putting the 768 byte LUT in RAM will save 18 processor cycles I think.

But I think thats only 125nS.

I did also try doing some 32 bit transfers, and this did appear to speed things up as well, by around 300nS
But I did not have time today to confirm whether using 32 bit transfers actually resulted in valid data being copied.

I know that @racemaniac uses 4 encoded bits per pixel bit, hence his transfers are always 32 bits, but this slows down the transmission time by 1/3 which I think gives a longer overall time to build and send a new patten to the LEDs

Ultimately, its probably easier to just run the code on a F4 at 168MHz if is running to slow on a 72MHz F103 !

stevestrong
Posts: 1824
Joined: Mon Oct 19, 2015 12:06 am
Location: Munich, Germany

Re: WS2812B (Neopixel) library has been added to the F1 core

Post by stevestrong » Mon Jun 12, 2017 11:52 am

I would try the following:
- LUT of 32 bit values storing data in format [ r1 | r2 | r3 | x ], [ g1 | g2 | g3 | x ], [ b1 | b2 | b3 | x ]

- prepare an array uint32_t temp[3], wherein the goal is to have
temp[0] = [ r1 | r2 | r3 | g1 ];
temp[1] = [ g2 | g3 | b1 | b2 ];
temp[2] = [ b3 | x | x | x ];
This should involve only shift an AND / OR instructions of the (32 bit) LUT values.
Some kind of bit packing like this: https://community.arm.com/processors/f/ ... 3x-32-bits

- memcopy byte-wise from temp[] as source to destination, only the first 9 bytes.
Last edited by stevestrong on Mon Jun 12, 2017 12:08 pm, edited 1 time in total.

racemaniac
Posts: 622
Joined: Sat Nov 07, 2015 9:09 am

Re: WS2812B (Neopixel) library has been added to the F1 core

Post by racemaniac » Mon Jun 12, 2017 11:57 am

really wondering what will come of this :).
i indeed went for the 4 bits per signal(and indeed the LUT stored in ram) to optimize this as much as possible. As i don't use it blocking, optimizing the blocking part, and having a bit slower non blocking part doesn't bother me. If however we would find an efficient way of doing the 3 bits per signal, i'd also use that :).

User avatar
RogerClark
Posts: 7481
Joined: Mon Apr 27, 2015 10:36 am
Location: Melbourne, Australia
Contact:

Re: WS2812B (Neopixel) library has been added to the F1 core

Post by RogerClark » Mon Jun 12, 2017 12:17 pm

steve

BTW.

i originally used a function to generate the 24 bits, by shifting <<3 and OR'ing in either 0B100 or 0B110, but I presumed it was slower than using a LUT

But I should do some timing tests just in case its faster to do 8 shifts and 8 ORs

stevestrong
Posts: 1824
Joined: Mon Oct 19, 2015 12:06 am
Location: Munich, Germany

Re: WS2812B (Neopixel) library has been added to the F1 core

Post by stevestrong » Mon Jun 12, 2017 12:27 pm

oh, and don't forget to set flash wait states = 1 ;)

User avatar
RogerClark
Posts: 7481
Joined: Mon Apr 27, 2015 10:36 am
Location: Melbourne, Australia
Contact:

Re: WS2812B (Neopixel) library has been added to the F1 core

Post by RogerClark » Mon Jun 12, 2017 10:15 pm

stevestrong wrote:oh, and don't forget to set flash wait states = 1 ;)
I think the "law of diminishing returns" is now starting to apply ;-)

I may try writing a better version of the bit banged approach, to try to keep the USB running, but I have no immediate need to use the LED strio at the moment, and have some other things I really need to spend my time upon ;-)

User avatar
RogerClark
Posts: 7481
Joined: Mon Apr 27, 2015 10:36 am
Location: Melbourne, Australia
Contact:

Re: WS2812B (Neopixel) library has been added to the F1 core

Post by RogerClark » Tue Jun 13, 2017 12:51 am

I moved the LUT to RAM and it speeded up the setPixelColor function from 1.64uS to 1.22uS

The bit-banged version takes 0.95uS so the LUT version is now about 28% slower, which is not too bad.

PS. I initially screwed up adding the WS2812B folder to the repo, because I copied it from the other external repo I made for it, and it had a .git folder in it, which I'd not noticed, and this was really confusing GIT as it though the folder was some sort of "submodule"

I tried simply deleting the .git folder but that didnt help, so I had to remove commit and then add the folder again.

User avatar
RogerClark
Posts: 7481
Joined: Mon Apr 27, 2015 10:36 am
Location: Melbourne, Australia
Contact:

Re: WS2812B (Neopixel) library has been added to the F1 core

Post by RogerClark » Tue Jun 13, 2017 10:38 pm

For anyone interested in the internal workings of the WS2812B with reference to pulse timing, I found this very interesting post

https://cpldcpu.com/2014/01/14/light_ws ... he-ws2812/

I was thinking of doing these sorts of timing tests myself, but as its been done already, there is no need to do it all again.

The main thing I was wondering was the minimum Reset time, which in practice seems to be somewhere between 6 and 8uS; far below the spec value of 50uS

From reading some other experimental blogs about the WS2812B, I was also wondering if all that was required to send a pixel 1 or 0, was just the length of time that the input is at in logic High state, and that perhaps a short pulse of 100nS followed by logic low for 100nS would be OK.

However, it seems from the analysis in that blog, that the minimum overall period for each bit needs to be approximately 1100nS or longer.

So it would not be able to achieve a significant speed increase, by shortening the overall period of each pixel bit.

One other thing that strikes me, but which I dont know how to take advantage of.

Using the SPI bit trilets, of 100 or 110 to send a zero or a one, only the middle bit changes

The first bit is always 1 and the last bit is always zero.

I read another blog post, about a method to DMA multiple strings of LEDs, by DMAing to GPIO using the BSSR reg.
They used 3 DMA channels

The first channel runs at a period of about 1.2uS and drives the GPIO pin high.
The second channel seems to be triggered by the first channel, and also controlled by data in a buffer, which specifies whether it sets or resets the GPIO, and runs at a speed of about 450ns
The third channel runs at a period of about 850nS and is triggered off the first channel and always drives the GPIO low.

I have no idea how to configure the DMA to do this, perhaps its done in code ? or with timers to trigger the DMA.

It seems very complex, and consumes a lot of RAM for the GPIO control, as it would need 32 bits for the BSSR reg.

But it is an interesting concept, and perhaps can be equally well done in code.

If the processor was faster it could possibly be using an ISR but I think the call overhead etc is too high for this to work on the F103 @ 72 Mhz

Post Reply