Sigh, I had a nice big post with a bunch of details written out but much of it got deleted when I posted. Oh well, here's the short version.
Is there any documentation describing how NEON performs the actual memory accesses across the AXI bus for its ld4 instruction? I'm trying to read from a hardware FIFO with the assumption that it would be reading the data in order (and then de-interleaving it), but the actual results I'm seeing on the hardware imply that it is either performing the accesses out of order, performing more accesses than I would expect, or otherwise not behaving like I'd imagine it should based on what it should be doing. I expect an ld4 {vN.4s-vM.4s} [x] instruction to read 16 bytes from addresses x, x+16, x+32, and x+48, in that order, but that does not seem to be the case.
ld4 {vN.4s-vM.4s} [x]
The memory in question is uncached, device memory. Interleaved writes to this memory appear to happen in order, like I'd expect. From some of the AXI transaction information I'm logging, I can see that it's performing single-beat 16 byte reads, but I am not tracking the actual addresses so I'm not sure of the order. Any details on the exact behavior would be greatly appreciated!
After some additional logging of the AXI transactions being performed in response to the ld4 instruction, it seems to be performing multiple reads of each 16-byte memory region and only using half the data from each read. It reads 16 bytes from address [x] twice, the first time it uses the lower 8 bytes and the second time the upper 8 bytes. This behavior repeats four times.
Basically, to read 64 bytes it actually ends up reading 128 bytes and discarding half of them. I find it hard to believe that the A53 core was designed to behave in this way (especially given that this is device memory!), so I'm guessing that this has something to do with the interconnect. Maybe the A53 core is requesting 64 bit reads, but the interconnect, not realizing it's accessing device memory, just assumes that it's safe to widen them. Not sure. Either way, I can work around this by keeping track of which part of the 128 bit AXI bus was last accessed and reading the FIFO only once for each transaction, but since there's no way (that I've seen) to actually detect this behavior based on the AXI transaction information, I now have a very fragile system that only works for this very particular setup.