I have been spent about 2 months trying to find a faster way of multiplying 2 32-bit numbers giving a 64-bit result. It is truly driving me mad because it FEELS like their is a faster solution. I should add that this example is for a 100% assembly-language product so I do not need to preserve R4-R15. I am also looking for the fastest way to perform a 32-bit x 32-bit multiply that returns the top 32-bits of the result. Again, I can only shave off 1 cycle using the method outlined below. Lastly, I am looking for a faster CLZ (count leading zeros) routine. Jens Bauer has posted a very promising method but once again, I have a suspicion that the extra instructions of Tv6 may offer another route. Those EXTEND and REVERSE instructions LOOK like they may offer a faster method.
lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
mov r4,r0 //Factor0 lo * Factor1 lo muls r4,r2 //
muls r0,r1 //Factor0 lo * Factor1 hi
muls r2,r3 //Factor1 lo * Factor0 hi
muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * factor1 hi) + (Factor1 lo * Factor0 hi)
mov r2,#0 // adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2,r2,#16 //
lsrs r3,r0,#16 //Extract partial result [bits 16-31]
lsls r0,r0,#16 //Extract partial result [bits 0-15]
adds r0,r4 //Result [bits 0-31] + C adcs r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]As you can see, dealing with C & shifting bits 16:47 is the really bad part. I asked Joseph Yiu (the gentleman behind the 'Definitive Guide to the ARM Cortex' series of books and he recommended that I place it within the community. He really is a nice bloke and he was interested in the idea of using a set of 'magic numbers' to setup 32-bit values in 32-bits with no literal pool. Here are a couple of examples:MOV Rd, #<immediate>LSLS Rd,RdandMOV Rd,#<immediate>RORS Rd,RdSo if someone is working on GNU for the Cortex M0/M0+/M1 then maybe it's worth the time to work out the 'magic numbers''?My last point regards the shift/rotate instructions. Nobody can explain why the forms of these instructions that shift by another register use Rs [7:0]. NOBODY can explain why this is so. I suppose that the reverse instructions allow 4 different shift values to be stored in a single resister but I have yet to find an example.Many thanks for your time,SeanPS on the plus side, since interrupts have a separates set of registers, it's quite simple to use SP to address memory. It has unique addressing modes and the fantastic LDM instructions. This is so valuable for SWITCH-type statements. Setting up 8 registers in 9 cycles is powerful and if your system doesn't have DMA then it's a very efficient way of moving data.
When I mentioned saving a cycle on the CLZ, Bastian Schick, I meant that if the lookup table's base address is below 0x00000100, it can be set up as an 8-bit immediate. If all of the IRQs are used, you can only use from 0x000000c0 but I am likely to only need 2 or 3 so I can use from 0x00000050 and with the De Bruijn sequences only needing 32 bytes, I could potentially get several small (but frequently used) LUTs in there. I mean, the actual table can extend from 0x000000ff onwards.
I do appreciate that it's only a tiny difference but if by some magic someone finds a faster MULSHIFT32 but it uses a large table, that would be ideal to save a cycle.