Hello!
I am currently working on creating a shared virtual address space in Linux arm64 on a Xilinx Zynq Ultrascale+ board. In the future it should be possible to share pointers/addresses between the Cortex A53s and the FPGA utilizing the built in ARM SMMU 500 (IOMMU) and the Cache Coherent Interface (CCI) without any user action necessary.
To do so I used the driver/iommu/arm-smmu.c iommu driver of Linux kernel v4.14.0 and modified it to remove most of virtualization abstractions, along with a separate custom kernel module and ioctls. Thereby each process gets its own individual SMMU context bank which holds its own aarch64 page table. Right now it is already possible to successfully read and write data from the FPGA with virtual addresses using a separate page table by manually mapping the allocated pages to the same virtual addresses again.
It would be more convenient if the MMU and SMMU could share the same page table, thus avoiding the unnecessary setup of a second, redundant page table. To do so, I made the following changes:
- Configure SMMU to use 39 bit VA size and 40 bit PA size (4 KB page size)
- Take the PGD pointer out of curr task_struct and pass it to the correct SMMU context bank PGD entry
All other SMMU hardware configurations are the same as in arm-smmu.c.
However, this leads to non-deterministic behavior. A test case where the FPGA reads multiple values and writes them back to a different location using the shared virtual addresses works only sometimes. I implemented the test program to intentionally pause after setting up all necessary data structures and initializing them and just before the FPGA is instructed to transfer the values. The longer the pause, the more values are correctly transferred. After about 10 seconds of pause the test always concludes successfully. In my opinion this sounds like a cache issue where the updated page table entries (PTE) created by Linux are still in CPU cache and the SMMU accesses the wrong ones which leads to no transfers (but also no translation error/failure in the SMMU). So either I have to flush/clean the cache at the correct location in linux source code or change some SMMU flags (context bank, stream-2-context register, ...), MMU flags or memory/shareability flags of the PTEs.
I already discovered that the MAIR was setup differently for the SMMU. I changed the SMMU code to match the MAIR of the MMU and also use the correct memory attribute index when necessary. But that didn't help. Furthermore, I also checked chapter 1.5.2 "Differences between ARM architecture and SMMU translation schemes" in the ARM SMMU v2 architecture specification.
What confuses me the most is that manually built page tables work correctly, but the Linux generated ones cause the described non-deterministic behavior.
Any information or hints about how to correctly setup the SMMU to use/share page tables generated by Linux or either change the way Linux configures its page tables in order to work properly with the SMMU would be greatly appreciated.
Thank you so much!
Best
Seems great work.
Out of interest, "Thereby each process gets its own individual SMMU context bank which holds its own aarch64 page table", how did you manage the SMMU context bank with application switching?
1. Change streamID for SMMU when user process switching?
2. Change streamID and Context bank mapping when user process switching?