Using code from @devan, and some additions I've written, I've improved the time taken by the bootloader between the upload finishing and the sketch running.
The time has been decreased, by using a BKP register, which survives a MCU reset, to store whether the MCU has just handled an DFU upload.
In which case when the bootloader is reset by DFU-Util on the PC, the code no longer needs to go through the normal startup process of flashing the LED and waiting for a timeout; but instead it can jump straight to the code that checks for a valid sketch and jump to that address.
Initially I tried to just jump to the sketch address when the bootloader got a DFU Reset, but this seemed to prevent the USB in the sketch from working.
So using a modified version of @devan's code, meant that I didnt need to worry about cleaning up all the registers and hardware etc that the bootloader uses, but instead the MCU can be reset as normal.
Note. It may be possible to clean the MCU and jump stright to the user code, but I doubt its going to noticeably faster as the code now doesn't do a lot apart from initialise, before going straight to the code that jumps to the sketch.
In addition to the speed up, the code also checks for another magic number in BKP_10, which locks it in perpetual bootloader mode.
But I'll need to update the core to take advantage of this, and to improve cold boot speed, I'd need to reduce the timeouts in the bootloader to take advantage of the change to the core.
i.e. we could set much lower DFU Wait timout's for cold boot, as for sketch uploads, the value in BKP_10 would lock the bootloader in perpetual bootloader mode (waiting for DFU upload) hence it wouldnt matter what the cold boot timeout was.
However we can't make cold boot really short, except on boards like the Maple mini, which have a button to put the bootloader into "Perpetual" mode, otherwise if the sketch has crashed, it would be very hard to reset the board at the right time for DFU to work (as there would be a narrow timeout window)
One other thought which may partially get around this problem, is that most generic boards have jumpers on both Boot0 and Boot1, but Boot1 has no effect if Boot0 is LOW
So we could use Boot1 as a switch to indicate to the bootloader that it needs to enter perpetual bootloader mode (continuously wait for DFU). This would be an easy change to the bootloader code, as we'd need need to set PB2 as Button on generic boards.