Hi - I have spent several months optimising my MP2/MP2.5/MP3 decoder on my Raspberry Pi pico. Profiling highlighted the fact that of the 19 files containing code, it was 5 small files that take most of the time. i mean, it's polyphase filter an so forth so no surprise. I'm running the MP3 decoder from an interrupt so i stack LR and store SP to memory since the SP is a very powerful instruction (all of those addressing modes and so on) but I keep running into the same issue.The MULSHIFT32 macro is used thousands of times throughout the code and as the name suggests, it multiplies two 32-bit values together and returns the most top 32 bits of the result. Just to give an example, the polyphase loop takes 417 lines IF MULH was a valid instruction taking 409 to 421 cycles depending on branches BUT if I use the macro which some really excellent coder on here managed to achieve it in 17 cycles (I forget his name but he is awesome) it takes 627 cycles.That means that as it is, it can play a high quality mono stream with the clock speed set at 64MHz but my calculations suggest that if I had MULH, it would manage with the clock at 48MHz whixh would save power.I know exactly nothing about microcode in processors but I am aware of the Cortex M1 but it seems that it can only perform the slow (33 cycles in this case) and many moons ago I did read about the ARM7EJ-S which sounded fascinating.Now, the aim is to produce a USB memory stick that also uses the cortex M0+ processors found in 95% of these sticks to not only support MP2/2.5/3 (and possibly AAC which looks more complex to encode but no more complex to decode) as well as ACELP (the patent just an out for MP3 & ACELP as well as 1,2,3,4,5-bit ADPCM (I suppose 1 bit is technically delta compression?) and LPC10 (LPC10e is still under patent) because a good friend showed me some encoding tricks for LPC and considering that it was using 300,600 & 1200 bits per second, the quality was great - certainly good enough not to be annoying and I would hope sufficiently good for it's use in audiobooks aimed at education.I do apologise if this is just a stupid question but I am wondering how much extra silicon adding a MULH would take (Yes, I know I'm dumb).On the plus side, the reference fixed-point MP3 decoder I am using IS in C but interestingly, the code has been written in a manner which presumes sixteen 32-bit registers, one of which is the SP. When I use these tricks to get back r13 & r14 I have been able to avoid needing to use a stack-fram or a buffer in RAM (obviously since I use SP, it's a dang good thing I DO NOT need a stack frame.Many, many people have helped me and while my memory is terrible, you know who you are and although illness has slowed me down (seizures), I am getting it together. I am honsestly wondering just how much adding a MULH to the instruction set would cost in silicon because it really does knock of ⅓ of the processing time.Of course, Naveed has been a constant source of inspiration and I've just ordered some more audio stuff and an OLED screen so I intend to get some great sound from a humble USB stick.
Ah, forgot you want to build a all-in-one chip. Would be cool if there would be ASICs with CM3 hard-macros where you would not have to pay the full license (as it is paid by the Fab) but only per chip.Or: ARM would provide the RTL for the SMULL so it could be "added" to an CM0+ :-)Anyway, despite the time you already spent, would RISC-V be an option? (Honestly I did not bother to look into the ISA yet, as I do not see any projects with it.)