This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex M0/M0+/M1 32-bit x 32-bit --->64-bit signed multiply

I have been spent about 2 months trying to find a faster way of multiplying 2 32-bit numbers giving a 64-bit result. It is truly driving me mad because it FEELS like their is a faster solution. I should add that this example is for a 100% assembly-language product so I do not need to preserve R4-R15. I am also looking for the fastest way to perform a 32-bit x 32-bit multiply that returns the top 32-bits of the result. Again, I can only shave off 1 cycle using the method outlined below. Lastly, I am looking for a faster CLZ (count leading zeros) routine. Jens Bauer has posted a very promising method but once again, I have a suspicion that the extra instructions of Tv6 may offer another route. Those EXTEND and REVERSE instructions LOOK like they may offer a faster method.

lsrs r3,r0,#16 //Factor0 hi [16:31]
uxth r0,r0 //Factor0 lo [0:15]

uxth r2,r1 //Factor1 lo [0:15]
lsrs r1,r1,#16 //Factor1 hi [16:31]

mov r4,r0 //Factor0 lo * Factor1 lo
muls r4,r2 //

muls r0,r1 //Factor0 lo * Factor1 hi

muls r2,r3 //Factor1 lo * Factor0 hi

muls r1,r3 //Factor1 hi * Factor0 hi

adds r0,r2 //(Factor0 lo * factor1 hi) + (Factor1 lo * Factor0 hi)

mov r2,#0 //
adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000)
lsls r2,r2,#16 //

lsrs r3,r0,#16 //Extract partial result [bits 16-31]

lsls r0,r0,#16 //Extract partial result [bits 0-15]

adds r0,r4 //Result [bits 0-31] + C
adcs r2,r3 //Partial [bits 16-47]
adds r1,r2 //Results [bit 32-63]

As you can see, dealing with C & shifting bits 16:47 is the really bad part. I asked Joseph Yiu (the gentleman behind the 'Definitive Guide to the ARM Cortex' series of books and he recommended that I place it within the community. He really is a nice bloke and he was interested in the idea of using a set of 'magic numbers' to setup 32-bit values in 32-bits with no literal pool. Here are a couple of examples:

MOV Rd, #<immediate>
LSLS Rd,Rd

and

MOV Rd,#<immediate>
RORS Rd,Rd

So if someone is working on GNU for the Cortex M0/M0+/M1 then maybe it's worth the time to work out the 'magic numbers''?

My last point regards the shift/rotate instructions. Nobody can explain why the forms of these instructions that shift by another register use Rs [7:0]. NOBODY can explain why this is so. I suppose that the reverse instructions allow 4 different shift values to be stored in a single resister but I have yet to find an example.

Many thanks for your time,
Sean

PS on the plus side, since interrupts have a separates set of registers, it's quite simple to use SP to address memory. It has unique addressing modes and the fantastic LDM instructions. This is so valuable for SWITCH-type statements. Setting up 8 registers in 9 cycles is powerful and if your system doesn't have DMA then it's a very efficient way of moving data.

Top replies

Sean Dunlevy over 6 years ago in reply to 42Bastian Schick +1

Your 32-bit x 32-bit -->64 bit is an interesting example of how optimized Thumb is significantly different from other scaler-CPUs. Realizing that the MULS is just 1 cycle is important (unless you have...

Parents

0 Sean Dunlevy over 3 years ago in reply to Sean Dunlevy

When I mentioned saving a cycle on the CLZ, Bastian Schick, I meant that if the lookup table's base address is below 0x00000100, it can be set up as an 8-bit immediate. If all of the IRQs are used, you can only use from 0x000000c0 but I am likely to only need 2 or 3 so I can use from 0x00000050 and with the De Bruijn sequences only needing 32 bytes, I could potentially get several small (but frequently used) LUTs in there. I mean, the actual table can extend from 0x000000ff onwards.

I do appreciate that it's only a tiny difference but if by some magic someone finds a faster MULSHIFT32 but it uses a large table, that would be ideal to save a cycle.
Cancel
Up 0 Down

Cancel

Reply

0 Sean Dunlevy over 3 years ago in reply to Sean Dunlevy

When I mentioned saving a cycle on the CLZ, Bastian Schick, I meant that if the lookup table's base address is below 0x00000100, it can be set up as an 8-bit immediate. If all of the IRQs are used, you can only use from 0x000000c0 but I am likely to only need 2 or 3 so I can use from 0x00000050 and with the De Bruijn sequences only needing 32 bytes, I could potentially get several small (but frequently used) LUTs in there. I mean, the actual table can extend from 0x000000ff onwards.

I do appreciate that it's only a tiny difference but if by some magic someone finds a faster MULSHIFT32 but it uses a large table, that would be ideal to save a cycle.
Cancel
Up 0 Down

Cancel

Children

No data