This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Use armpl(22.0) to calculate fft, but fftwh(fp16) is slow than fftwf(fp32) in kunpeng920 arm server, I expect fftwh is faster 2x than fftwf

code:

static void fftwf_armpl_fp32(fftwf_complex* signal, int row, int col) {
fftwf_plan plan_f = fftwf_plan_dft_2d(col, row, signal, signal, FFTW_FORWARD, FFTW_ESTIMATE);
fftwf_execute(plan_f);
fftwf_destroy_plan(plan_f);
}

static void fftwf_armpl_fp16(fftwh_complex* signal, int row, int col) {
fftwh_plan plan_h = fftwh_plan_dft_2d(col, row, signal, signal, FFTW_FORWARD, FFTW_ESTIMATE);
fftwh_execute(plan_h);
fftwh_destroy_plan(plan_h);
}

size	FP32(ms)	FP16(ms)
256*256	4.45	3.09
512*512	16.4	12.7
1024*1024	35.7	36.0
2048*2048	180.1	169.1
4096*4096	761.5	861.4

Top replies

yan.wei over 1 year ago in reply to Chris Goodyer +1 suggested

Thinks. I only get fftwh_execute and fftwf_execute cost time, The result is fp16 slow than fp32. kunpeng920(armv8.2) is supported fp16 instruction, but why fp16 has not acceleration effect.

Parents

0 Chris Goodyer over 1 year ago in reply to yan.wei

Hi.

Thanks for confirming. We can observe a similar lack of extra performance on other 128-bit Neon platforms. Using these functions on, for example, an A64FX would show the 2x performance difference we would expect.

We've added looking at this to our future work list. Thanks for raising the issue.

Chris
Cancel
Up 0 Down

Cancel

Reply

0 Chris Goodyer over 1 year ago in reply to yan.wei

Hi.

Thanks for confirming. We can observe a similar lack of extra performance on other 128-bit Neon platforms. Using these functions on, for example, an A64FX would show the 2x performance difference we would expect.

We've added looking at this to our future work list. Thanks for raising the issue.

Chris
Cancel
Up 0 Down

Cancel

Children

No data