I managed to produce a 32-bit x 32-bit -->64 bit code fragment that took 18 cycles to complete (it is in-line).
oldmulshift32: lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
movs r4,r0 //copy Factor0 [0:15]
muls r4,r2 //Factor0 lo * Factor1 lo muls r0,r1 //Factor0 lo * Factor1 hi muls r2,r3 //Factor1 lo * Factor0 hi muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * Factor1 hi) + (Factor1 lo * Factor0 hi)
movs r2,#0 // adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2,r2,#16 //
lsrs r3,r0,#16 //Extract partial result [bits 16-31] lsls r0,r0,#16 //Extract partial result [bits 0-15]
adds r0,r4 //Result [bits 0-31] + C adcs r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]
Happily Jens Bauer was able to provide me with an alternative that takes only 17 cycles to execute.
mulshift32: uxth r2,r0 /* [1] Extract Factor 0 lo [0-15] lsrs r0,r0,#16 /* [1] Extract Factor 0 hi [16-31] lsrs r3,r1,#16 /* [1] Extract Factor 1 hi [31-16] uxth r1,r1 /* [1] Extract Factor 1 lo [0-15]
movs r4,r1 /* [1] Copy Factor 1 lo [0-15]
muls r1,r2 /* [1] Factor 1 lo x Factor 0 lo muls r4,r0 /* [1] Factor 1 lo x Factor 0 hi muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi
lsls r2,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) << 16 lsrs r4,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) >> 16 adds r1,r2 /* [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x factor 0 hi) << 16)) adcs r0,r4 /* [1] (Factor 0 hi x factor 1 hi) + ((factor 1 lo x factor 0 hi) >> 16))
lsls r2,r3,#16 /* [1] (Factor 1 hi x Factor 0 lo) << 16 lsrs r3,r3,#16 // [1] (Factor 1 hi x Factor 0 lo) >> 16 adds r1,r2 // [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x Factor 0 hi) >> 16) adcs r0,r3 // [1] (Factor 0 hi x Factor 1 hi) + ((Factor 1 hi + Factor 1 lo) >> 16)My question therefore encompasses any alternative code fragments that take less than 17 cycles. You will note that my code uses 3 cycles to move the value in the C flag to bit 16 i.e. a carry. This bit seems the most likely to be speeded up which is why I included it.Now, recently researchers have found a new method of multiplying very large algorithms but he Karasbura method only requires 3 MULS instructions but it requires so many logic operations that in practice make it slower. The problem is that the quick-reference guide to Thumb does not explain the action of C very well. Now I DO have a solution to this. If bits 6 or 7 are set in the shift/rotate instruction, C is sett appropriately. Since I got my Cortex M0+ based SOC I have wondered why the values in bits 6 & 7 were included in the shift/rotate.... and this MAY be why. Rotating by 64 or 128 will leave the original value in the register BUT the N & C flags will be set where appropriate.I have concluded that Sophie Wilson is a genius... but isn't good in communicating the underlying logic. I think she assumes that we are all as bright as her and since it's 'obvious'. she does not mention it.Many thanks.XPS I'm aiming to write the code in Thumb4 or earlier so that it is compatible with the maximum number f devices. It certainly IS a large project but both the Chinese & Indian governments have shown great interest and with ARM helping me along, I will give it my best shot.
Sean,
how close are you to the 90%? I guess it is closely related to the bit-rate/compression, right? I guess for an audio book 32KBit/s should be enough.
BTW: I was thinking forwards/backwards and in circles, but see no way of either squeezing another cycle out of the mul32x32 routine or having a multiply which delivers only the high-half for the result.But from the former code you did post, it seems, you are using the full 64bit of the result and do the rounding at the end.
The question is, do you use full 32bit or does the input data have less valid bits? Means: Do you need a general multiply or will a special one work?