Hi Experts,
Trying to port People Counting application on ZCU104 platform, where we want to Off load ML Part to FPGA and other Pre/Post processing modules wanted to use ARM CPU Cores. When we run the application we see that Pre/Post processing modules were taking lot of time. So we wanted to implement using Neon Intrinsics .
Here we see issue, when we compiled float and neon code with -O3 flag we see same latency numbers .
Can you please suggests any tips or how to analyse it further on this?
Thanks and Regards,
Raju
But what are the results that are causing you concern? And what CPU are you targeting/testing on?
Have you got any code snippets to give us more of an idea about the example?