What increase in throughput can I expect on my device from changing a sequence of
```
fmla v1.4s, v1.4s, v1.4
to
mla v1.16b, v1.16b, v1.16b
?
My device consist of X3, A715 and A510 processors.
In profiling peakflops I got a ~2x increase in throughput. I would have expected an 4x increase.
Is there any matrix multiplication related instruction on arm in which I can expect a 4x increase in throughput by using int8 data types (possibly widening accumulator)?
Great to hear that you have findings for the throughput increase.
I don't have any S9 tablet datasheet at hand. There is a generic method to check the Arm CPU processor type.
In Android or Linux-like OS, you can run this command " cat /proc/cpuinfo". Here is one example for you.
Please check the CPU part number. After you know the CPU type of each CPU id, you can try to connect it to the Socket ID.
<quote>
# cat /proc/cpuinfoprocessor : 0BogoMIPS : 26.00Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 btiCPU implementer : 0x41CPU architecture: 8CPU variant : 0x0CPU part : 0xd46CPU revision : 2
processor : 1BogoMIPS : 26.00Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 btiCPU implementer : 0x41CPU architecture: 8CPU variant : 0x0CPU part : 0xd46CPU revision : 2
</quote>