EDIT: Updated March 2015 to include more information on the GPU memory system to help developers optimizing compute shaders.
In the first two blogs of this series I introduced the frame-level pipelining [The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining] and tile based rendering architecture [The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering] used by the Mali GPUs, aiming to develop a mental model which developers can use to explain the behavior of the graphics stack when optimizing the performance of their applications.
In this blog I will finish the construction of this abstract machine, forming the final component: a stereotypical Mali "Midgard" GPU programmable core. This blog assumes you have read the first two parts in the series, so I would recommend starting with those if you have not read them already.
The "Midgard" family of Mali GPUs (the Mali-T600, Mali-T700, and Mali-T800 series) use a unified shader core architecture, meaning that only a single type of shader core exists in the design. This single core can execute all types of programmable shader code, including vertex shaders, fragment shaders, and compute kernels.
The exact number of shader cores present in a particular silicon chip varies; our silicon partners can choose how many shader cores they implement based on their performance needs and silicon area constraints. The Mali-T760 GPU can scale from a single core for low-end devices all the way up to 16 cores for the highest performance designs, but between 4 and 8 cores are the most common implementations.
The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue. Workloads from both queues can be processed by the GPU at the same time, so vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology). The workload for a single render target is broken into smaller pieces and distributed across all of the shader cores in the GPU, or in the case of tiling workloads (see the second blog in this series for an overview of tiling) a fixed function tiling unit.
The shader cores in the system share a level 2 cache to improve performance, and to reduce memory bandwidth caused by repeated data fetches. Like the number of cores, the size of the L2 is configurable by our silicon partners, but is typically in the range of 32-64KB per shader core in the GPU depending on how much silicon area is available. The number and bus width of the memory ports this cache has to external memory is configurable, again allowing our partners to tune the implementation to meet their performance, power, and area needs. In general we aim to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle.
The Mali shader core is structured as a number of fixed-function hardware blocks wrapped around a programmable "tripipe" execution core. The fixed function units perform the setup for a shader operation - such as rasterizing triangles or performing depth testing - or handling the post-shader activities - such as blending, or writing back a whole tile's worth of data at the end of rendering. The tripipe itself is the programmable part responsible for the execution of shader programs.
There are three classes of execution pipeline in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary depending on which GPU you are using; most silicon shipping today will have two arithmetic pipelines, but the Mali-T880 has three.
Unlike a traditional CPU architecture, where you will typically only have a single thread of execution at a time on a single core, the tripipe is a massively multi-threaded processing engine. There may well be hundreds of hardware threads running at the same time in the tripipe, with one thread created for each vertex or fragment which is shaded. This large number of threads exists to hide memory latency; it doesn't matter if some threads are stalled waiting for memory, as long as at least one thread is available to execute then we maintain efficient execution.
The arithmetic pipeline (A-pipe) is a SIMD (single instruction multiple data) vector processing engine, with arithmetic units which operate on 128-bit quad-word registers. The registers can be flexibly accessed as either 2 x FP64, 4 x FP32, 8 x FP16, 2 x int64, 4 x int32, 8 x int16, or 16 x int8. It is therefore possible for a single arithmetic vector task to operate on 8 "mediump" values in a single operation, and for OpenCL kernels operating on 8-bit luminance data to process 16 pixels per SIMD unit per clock cycle.
While I can't disclose the internal architecture of the arithmetic pipeline, our public performance data for each GPU can be used to give some idea of the number of maths units available. For example, the Mali-T760 with 16 cores is rated at 326 FP32 GFLOPS at 600MHz. This gives a total of 34 FP32 FLOPS per clock cycle for this shader core; it has two pipelines, so that's 17 FP32 FLOPS per pipeline per clock cycle. The available performance in terms of operations will increase for FP16/int16/int8 and decrease for FP64/int64 data types.
The texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete.
The load/store pipeline (LS-pipe) is responsible for all shader memory accesses which are not related to texturing.
For graphics workloads this means reading per-vertex attribute inputs and writing computed per-vertex outputs during vertex shading, and reading the per-vertex outputs values that were written by the vertex shader during fragment shading so they can be interpolated as a varying value.
In general every instruction is a single memory access operation, although like the arithmetic pipeline they are vector operations and so could load an entire "highp" vec4 varying in a single cycle.
In the OpenGL ES specification "fragment operations" - which include depth and stencil testing - happen at the end of the pipeline, after fragment shading has completed. This makes the specification very simple, but implies that you have to spend lots of time shading something, only to throw it away at the end of the frame if it turns out to be killed by ZS testing. Coloring fragments just to discard them would cost a huge amount of performance and wasted energy, so where possible we will do ZS testing early (i.e. before fragment shading), only falling back to late ZS testing (i.e. after fragment shading) where it is unavoidable (e.g. a dependency on fragment which may call "discard" and as such has indeterminate depth state until it exits the tripipe).
In addition to the traditional early-z schemes, we also have some overdraw removal capability which can stop fragments which have already been rasterized from turning into real rendering work if they do not contribute to the output scene in a useful way. My colleague seanellis has a great blog looking at this technology - Killing Pixels - A New Optimization for Shading on Arm Mali GPUs - so I won't dive into any more detail here.
This section is an after-the-fact addition to this blog, so if you have read this blog before and don't remember this section, don't worry you're not going crazy. We have been getting a lot of questions from developers writing OpenCL kernels and OpenGL ES compute shaders asking for more information about the GPU cache structure, as it can be really beneficial to lay out data structures and buffers to optimize cache locality. The salient facts are:
Based on this simple model it is possible to outline some of the fundamental properties underpinning the GPU performance.
If we scale this to a Mali-T760 MP8 running at 600MHz we can calculate the theoretical peak performance as:
The observant reader will have noted that I've talked a lot about vertices and fragments - the staple of graphics work - but have mentioned very little about how OpenCL and RenderScript compute threads come into being inside the core. Both of these types of work behave almost identically to vertex threads - you can view running a vertex shader over an array of vertices as a 1-dimensional compute problem. So the vertex thread creator also spawns compute threads, although more accurately I would say the compute thread creator also spawns vertices .
A document explaining the Midgard family performance counters, which map onto the block architecture described in this blog, can be found on my blog on the Midgard family.
This blog concludes the first chapter of this series, developing the abstract machine which defines the basic behaviors which an application developer should expect to see for a Mali GPU in the Midgard family. Over the rest of this series I'll start to put this new knowledge to work, investigating some common application development pitfalls, and useful optimization techniques, which can be identified and debugged using the Mali integration into the Arm DS-5 Streamline profiling tools.
My next blog on Mali performance is available below.
[CTAToken URL = "https://community.arm.com/graphics/b/blog/posts/mali-performance-1-checking-the-pipeline" target="_blank" text="Read Mali Performance: Checking the Pipeline" class ="green"]
Comments and questions welcomed as always,
TTFN,
Pete
19.2GB/s subject to the ability of the rest of the memory system outside of the GPU to give us data this quickly. Like most features of an Arm-based chip, the down-stream memory system is highly configurable in order to allow different vendors to tune power, performance, and silicon area according to their needs. For most SoC parts the rest of the system will throttle the available bandwidth before the GPU runs out of an ability to request data. It is unlikely you would want to sustain this kind of bandwidth for prolonged periods, but short burst performance is important.
Hi alprakas,
chrisvarns is correct on this I think. I would recommend you watch the following presentation from my colleague johangronqvist who looks at multiple implementations of SGEMM and how it was tuned to maximise cache efficiency, both from a point of view of workgroup size and modifications to the kernel itself.
GPU Compute, OpenCL and RenderScript Tutorials - Mali Developer Center Mali Developer Center
Hope that's of some use,
Tim
Yes, good point. Make sure you are issuing workgroups which align along the cache lines in memory, not sideways across them. That is a very quick way to kill performance.
Making sure workgroups are exact cacheline multiples in terms of data access (input and output) is also a good idea (i.e. fully use all of the data you touch). Your second workgroup size of 20x4 is not a power of two, so it is likely that you are loading a cache line and only using part of it.
Cache lines are 64 bytes if that helps.
I've certainly seen cases where the SHAPE of the workgroup has a massive impact on the cache friendliness, and therefore memory bandwidth and processing speed of a workgroup, and that's possibly what you're hitting. I believe timhar01 posted some good investigations into this sort of thing, he should be able to point you in the right direction. It's not the SIZE that's causing this variation, but the memory access pattern as influenced by differences in the workgroup FOOTPRINT.
Hth,
Chris
Can you please confirm that it is 128 bits or something else on 5422 SoC?
From the GPU point of view it's 1x128-bit interfaces if you have 1-4 cores, 2x128-bit interfaces if you have more than that.
There is a almost 3 times jump in power consumption between the workgroup size variations. I am thoroughly confused now, as to the cause of this.
I suspect you are hitting register pressure problems - this is side effect of workgroups, rather than being a direct effect of them. This blog from anton explains some of the constraints around register allocation, explaining how different numbers of registers affect the number of concurrent threads which can run in the core.
Arm Mali Compute Architecture Fundamentals
Due to the memory coherency and barrier behavior of compute workgroups we have to guarantee that all threads in a work group fit into the shader core at the same time, otherwise executing "barrier()" operations is impossible, so there is a tradeoff between number of work registers which can be used by a thread and the total number of threads in the workgroup. Summarizing Anton's blog:
I suspect your kernel has a lot of concurrent data so really needs 9-16 work registers. In your first case with a size of (16,4) your workgroup size is 64, so the compiler can allocate up to the full 16 work registers per thread. In your second case (20, 4) you force the workgroup size to 80 (i.e. bigger than 64) so the compiler is only allowed to allocate a maximum of 8 registers per thread. If you need more variables than that concurrently alive then you are going to force the compiler to use the stack, and due to the number of concurrent threads in the system (128 x core count) that stack access can add up to a lot of bandwidth very quickly.
It sounds like you need to reduce the amount of concurrent live variable data in your kernel. This can either be achieved by reducing the number of variables by restructuring your code, or using a lower precision datatype (e.g. fp16 rather than fp32).
EDIT Chris has reminded me that workgroup size is not part of the compile time state for OpenCL (it usually is for OpenGL ES 3.1 compute kernels) so it is probably not this - sorry . See some of the ideas below ...
HTH, Pete
Peter Harris wrote: The idea here is that if the workgroup size is small, the working set for the workgroup is small and can easily fit into the Cache hierarchy. The workgroup size doesn't really constrain the size of the working set in that way. It is entirely possible to have threads from multiple different workgroups live in the shader core at the same time. i.e. if you have a 16 element workgroup then we could have 16 different workgroups running and filling the available 256 thread slots. The only real constraint is that workgroups should be a multiple of 4 to maximize thread count.
Peter Harris wrote:
The idea here is that if the workgroup size is small, the working set for the workgroup is small and can easily fit into the Cache hierarchy.
The workgroup size doesn't really constrain the size of the working set in that way. It is entirely possible to have threads from multiple different workgroups live in the shader core at the same time. i.e. if you have a 16 element workgroup then we could have 16 different workgroups running and filling the available 256 thread slots.
The only real constraint is that workgroups should be a multiple of 4 to maximize thread count.
Unfortunately, this is not really matching with the experimental observations. So, when I run an application with a smaller workgroup size, the bandwidth requirement as obtained from observing the Mali counters in DS-5 (Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel), is much smaller, than larger workgroup size.
For example, with a (wg_size_x,wg_size_y) = (16,4), the bandwidth as calculated based on the formula mentioned in the link (Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel),, is 93M*Bus width size.... Whereas, if I increase the workgroup size to the next level say (20,4), the Bandwidth requirement suddenly jumps to 376M*Bus width size... This is also being reflected in the memory bus power consumption. There is a almost 3 times jump in power consumption between the workgroup size variations. I am thoroughly confused now, as to the cause of this. And then there is this paper (http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6618834&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%…), which talks about the effect of workgroup size for CPUs. Of course, this might be totally useless in case of GPUs as you described.
Anything else, that I should check in your opinion?
May I also check with you regarding the memory bus width size on this SoC. Can you please confirm that it is 128 bits or something else on 5422 SoC?
Thanks again!