What is the difference between SVM and CL::buffer

hterrolle 1 month ago

hi,

i asked this question on khronos forum but i got no answer. So i decided to ask the question on this forum.

I used to do the following procces with openCL on Android.

Working with:

- Mali-G715-Immortalis MC11 r1p2

- OpenCL 3.0 v1.r38p1-01eac0.c1a71ccca2acf211eb87c5db5322f569

- SVM_COARSE_GRAIN_BUFFER supported

i create the platform,queu,devive. Create all my cl::buffer and compile all the kernel at the start of my application.
i get picture from my camera and send the byte data using JNI jbyteArray =>((uint8_t*)inPtr) to my c++ function.
i get the (uint8_t*)inPtr pointer than i use cl::buffer to feed the buffer with the camera picture data, using :
bufferNV21 = cl::Buffer(gContext, CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR , isize*sizeof(cl_uchar), inPtr , NULL); this take less than 1ms.
i process my kernel NV21toRGB than i do some staff with my output buffer.
i use enqueueMapBuffer to point the Buffer,buf, to my local program memory and that wil be used by pthread CPU processing. take less than 2ms
than i copy back the CPU result to the GPU buffer doing:
bufferligne = cl::Buffer(gContext, CL_MEM_USE_HOST_PTR, (1024*1024)*sizeof(cl_uchar4), buf, NULL); // remplace enqueueWriteBuffer.
this take less than 3ms
do some kernel on bufferligne cl::buffer
then send back the GPU buffer(bufferMMM) to Java out bitmap using
gQueue.enqueueReadBuffer(bufferMMM, CL_TRUE, 0, osize*sizeof(cl_uchar4), out, 0, &arraySecondEvent); // pour openCL
this last part take between 3 and 5ms, depends. Sometime less.

So it is relevant to use SVM with my cnfiguration and what should i change if i want to use SVM. Change at step 3,5,7 or 8.

And what does SVM that cl::buffer does not. I would like to anderstand Why to use.

i could improved the speed by using on the kernel.cl file

#pragma OPENCL EXTENSION cl_khr_priority_hints : enable // accelere openCL queue driver
#pragma OPENCL EXTENSION CL_QUEUE_PRIORITY_HIGH_KHR : enable

and on the .cpp file

// Optional extension support
#define CL_HPP_USE_IL_KHR
#define CL_HPP_USE_CL_SUB_GROUPS_KHR
#define CL_HPP_OPENCL_API_WRAPPER

Top replies

Parents

0 hterrolle 14 days ago in reply to John Kesapides

hi John,

Thanks for the tips. The input fom java cost nothing until the map/umap. The only cost i can see, after year of testing now, is on the output to java memory. So no need to use SVM, CL_MEM_USE_HOST_PTR look fast enough. Does not need to use umap after map. just a flush at the end of the traitement before next frame.

For the output to java i use enqueueReadBuffer and that can cost me between 2 to 5 ms. I tried to use map/umap but if it is working fine from java to openCL it is not the same from opencl to java.

the other problem that i found is removing the Event.wait() it sped up all the kernel but i need to set atleast one wait() before the transfert from GPU to CPU (map/umap) if not data on CPU are not up to date, but this cost aroud 10 ms, some time more. and it is not the map/umap how cost.

So the last improvment i can see is from opencl to java. I need to copy data i cannot use pionter. I may be wrong but you should have a look at it. Fro me it is strange that it can work from java to opencl but not i the other way.

PS: i just test it with CL_FALSE flag rather than CL_TRUE on enqueueReadBuffer that look to be the solution with 0ms of work. ;))

Thanks for your answer i found a new solution for improvment ;)) ;))

So it look like enqueueReadBuffer with CL_FALSE work like map/umap. So why map/umap does not work from OpenCL to Java ?

On shared memory normally they should not be any problem and any use of SVM. It is just how to implement the old function with the new materiel. Some do it well some does not. ;))

Médiatek ans hauwei seams to do it well. If qualcom offer me a spartagon equal to the meéditek 9200+ i could check it. ;)) ;)) ;))

Thanks again for posting an answer taht force me to think about performance than about data traitment.

Best Regards

hervé.
Cancel
Vote up +1 Vote down

Reply

Accept answer

Cancel

Reply

0 hterrolle 14 days ago in reply to John Kesapides

hi John,

Thanks for the tips. The input fom java cost nothing until the map/umap. The only cost i can see, after year of testing now, is on the output to java memory. So no need to use SVM, CL_MEM_USE_HOST_PTR look fast enough. Does not need to use umap after map. just a flush at the end of the traitement before next frame.

For the output to java i use enqueueReadBuffer and that can cost me between 2 to 5 ms. I tried to use map/umap but if it is working fine from java to openCL it is not the same from opencl to java.

the other problem that i found is removing the Event.wait() it sped up all the kernel but i need to set atleast one wait() before the transfert from GPU to CPU (map/umap) if not data on CPU are not up to date, but this cost aroud 10 ms, some time more. and it is not the map/umap how cost.

So the last improvment i can see is from opencl to java. I need to copy data i cannot use pionter. I may be wrong but you should have a look at it. Fro me it is strange that it can work from java to opencl but not i the other way.

PS: i just test it with CL_FALSE flag rather than CL_TRUE on enqueueReadBuffer that look to be the solution with 0ms of work. ;))

Thanks for your answer i found a new solution for improvment ;)) ;))

So it look like enqueueReadBuffer with CL_FALSE work like map/umap. So why map/umap does not work from OpenCL to Java ?

On shared memory normally they should not be any problem and any use of SVM. It is just how to implement the old function with the new materiel. Some do it well some does not. ;))

Médiatek ans hauwei seams to do it well. If qualcom offer me a spartagon equal to the meéditek 9200+ i could check it. ;)) ;)) ;))

Thanks again for posting an answer taht force me to think about performance than about data traitment.

Best Regards

hervé.
Cancel
Vote up +1 Vote down

Reply

Accept answer

Cancel

Children

No data