Silicon chips based on the Arm Cortex-M55 and Cortex-M85 processors are reaching the market. To help software developers to make the most out of the Cortex-M55/M85 based devices, we prepared this page to highlight key points and useful information related to Cortex-M55 and Cortex-M85 software development. Please note that this page is work in progress and will be updated when we have new materials and information.
For an overview of the Armv8.1-M architecture and the Cortex-M55 processor, the following papers can be useful:
The official product page and product document can be found here:
Key resources for Helium programming:
https://www.arm.com/technologies/helium
https://developer.arm.com/Architectures/Helium
Cortex-M85 announcement blog: Cortex-M85: Highest Performing Cortex-M Processor ever - Internet of Things (IoT) blog - Arm Community blogs - Arm Community
A presentation video of "Arm DevSummit 2022 - Harnessing the capabilities from the Arm Cortex-M85 processor" is available here (requires registration).
There are also many links to other Cortex-M-related resources listed in this page: https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/cortex-m-resources.
There has been a range of enhancements in the Cortex-M55 r1 release when comparing to r0:
Due to the pipeline optimizations, there can be instruction cycle differences between silicon chips based on r0 and r1 of the Cortex-M55.
There has been a range of enhancements in the Cortex-M85 r1 release when comparing to r0:
Arm Compiler 6 is available here:
For best performance, please use Arm Compiler 6.15 or after. Arm Compiler 6.16 is now included in Keil MDK 5.34. If you are using an older version of the Keil MDK, please upgrade to the latest version. Note: Version 6.14 does have support for Armv8.1-M but is not as optimized as newer versions.
When using Cortex-M55 with Arm Compiler 6, the following command-line options can be used to select specific Cortex-M55 configuration:
When using Cortex-M85 with Arm Compiler 6, the following command-line options can be used to select specific Cortex-M85 configuration:
By default, when selecting Cortex-M55/M85 in Arm Compiler 6, the compiler assumed that the target supports Helium and FPU. To disable generation of Helium instructions, you need to add “+nomve”.
You can also specify architecture instead of specifying the processor. For example:
Please note that:
Other information:
Or
(where N is the coprocessor number)
#if (__ARM_FEATURE_MVE & 2) /* MVE Float */ … #endif
One of the key Arm Compiler 6 features that is useful when using Armv8.1-M processor is the auto-vectorization support. This enables a range of processing workloads to take advantage of the Helium and low-overhead-branch extension features without completely rewriting then for low level optimization.
-Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize
Sometimes, software developers might need to change the source code slightly to help the compiler to vectorize certain loop operations. Examples of codes that cannot be vectorized, or difficult to vectorized includes:
Some existing program codes might contain manually unrolled loops because software developers unroll some of the loops to get better performance. When porting such codes to Cortex-M55 processor, it might end up making it more difficult for the Arm Compiler 6 to identify auto-vectorization opportunities. Therefore, it might be necessary to modify the code to remove the manual loop unrolling.
Software developers should also remove pointer aliasing in loops using the restrict directive when applicable.
By default, the Loop and branch info cache is disabled after processor reset. To get the best out of the Low Overhead Branch (LOB) extension in Armv8.1-M, set the LOB bit in the Configuration and Control Register (CCR) to 1 to enable this cache. (Note: This hardware cache is not related to the I-cache and D-cache.)
For example, if you are using CMSIS-CORE in your project:
// Enable Loop and branch info cache SCB->CCR |= SCB_CCR_LOB_Msk; __DSB(); __ISB();
If you are not using CMSIS-CORE in your project:
#define CCR_ADDR (0xE000ED14UL) #define CCR *(volatile unsigned int *) CCR_ADDR #define __ISB() __builtin_arm_isb(0xF) #define __DSB() __builtin_arm_dsb(0xF) CCR |= 0x00080000UL; __DSB(); __ISB();
Similar to the floating-point unit (FPU), the Helium hardware need to be enabled before it can be used. This operation is similar to enabling the FPU - To use Helium features, coprocessor 10 and 11 must be enabled. For example, if you are using CMSIS-CORE in your project:
// Enable CP10 and CP11 SCB->CPACR |= ((3U << 10U*2U) | /* CP10 Full Access */ (3U << 11U*2U) );/* CP11 Full Access */ __DSB(); __ISB();
#define CPACR_ADDR (0xE000ED88UL) #define CPACR *(volatile unsigned int *) CPACR_ADDR #define __ISB() __builtin_arm_isb(0xF) #define __DSB() __builtin_arm_dsb(0xF) CPACR |=((3U << 10U*2U) | (3U << 11U*2U) ); __DSB(); __ISB();
If TrustZone is used, Secure privileged software should also setup NSACR and CPPWR registers to define whether the Non-secure world is allowed to access Helium and FPU features.
For Cortex-M55/M85 based devices that have instruction and data caches implemented, you might need to enable these caches based on the application requirements. Generally, if running code or accessing data in memories connected in the main AXI bus, it is best to enable the caches. For example:
By default, the caches are disabled at startup. In CMSIS-CORE based software projects you can use:
These functions include manual cache invalidation. In Armv8-M architecture, caches can also be invalidated automatically when being enabled. On the Cortex-M55/M85 processor, you can enable the caches using the following code:
(If you are using CMSIS-CORE in your project):
// Enable Instruction and Data caches SCB->CCR |= (SCB_CCR_IC_Msk|SCB_CCR_DC_Msk); __DSB(); __ISB();
(If you are not using CMSIS-CORE in your project):
#define CCR_ADDR (0xE000ED14UL) #define CCR *(volatile unsigned int *) CCR_ADDR #define __ISB() __builtin_arm_isb(0xF) #define __DSB() __builtin_arm_dsb(0xF) CCR |= 0x00030000UL; ____DSB(); ____ISB();
By default, branch prediction is disabled in Cortex-M85 and this feature is enabled usign BP (Branch Prediction) bit in the Configuration and Control Register (CCR).
// Enable Branch Prediction SCB->CCR |= SCB_CCR_BP_Msk; __DSB(); __ISB();
#define CCR_ADDR (0xE000ED14UL) #define CCR *(volatile unsigned int *) CCR_ADDR #define __ISB() __builtin_arm_isb(0xF) #define __DSB() __builtin_arm_dsb(0xF) CCR |= 0x00040000UL; ____DSB(); ____ISB();
For Cortex-M55 r0px and r1p0: Depending on the system design, the processor might attempt to put the Extension Processing Unit (EPU) into a retention state to save power if the EPU has been enabled but not being used. After the EPU entered retention state, if the software executes an FPU or Helium instruction, the processor will wake up the EPU automatically. While this is beneficial to energy efficiency, and is completely transparent to software, the automatic power switching sequences could cause delays to the program’s operation and could therefore reduce performance.
To avoid this performance penalty, change the ELPSTATE bits in the Core Power Domain Low Power State Register (CPDLPSTATE) to 0b00 (ON) or 0b01 (clock gated). Software should switch ELPSTATE bits back to 0b11 if the application does not require EPU, for example, when the device is going to enter a sleep mode. (After a reset the value of CPDLPSTATE is 0x00000333, meaning that the processor would attempt to switch the EPU into retention state because ELPSTATE is set to OFF (0b11)).
(Setting ELPSTATE to 0b01 when using CMSIS-CORE in your project):
/* Note: This code fragment is included in the example SystemInit code for the Cortex-M55 processor */ PWRMODCTL->CPDLPSTATE = (PWRMODCTL->CPDLPSTATE & 0xFFFFFFCFUL) | (0x1 << PWRMODCTL_CPDLPSTATE_ELPSTATE_Pos);
(Setting ELPSTATE to 0b01 without using CMSIS-CORE in your project):
#define CPDLPSTATE_ADDR (0xE001E300UL) #define CPDLPSTATE *(volatile unsigned int *) CPDLPSTATE_ADDR CPDLPSTATE = (CPDLPSTATE & 0xFFFFFFCFUL) | (0x01UL << 4);
Note: The CMSIS-CORE v5.7 header file for Cortex-M55 is missing the register definition for the CPDLPSTATE register. This is added in v5.8.
Cortex-M55 r1 supports limited static branch prediction by reusing the Low-Overhead Branch (LOB) hardware. In a few cases, this can help performance. In r1p0 release this feature is disabled by default, and can be enabled by clearing DISLOBR bit (bit 5) or the Auxiliary Control Register (ACTLR). In r1p1 release this bit is cleared by default.
The CMSIS-DSP library has been optimized for the Cortex-M55/M85 processor.
In releases, the CMSIS-DSP codes are released as source code only. This is different from the past where binary builds (libraries) are also available. This change is because the Cortex-M processors are highly configurable and building the libraries for all possible configuration variants is becoming impractical.
To compile the CMSIS-DSP libraries with Arm Compiler 6, please select “-Ofast” optimization level for best performance.
Usually, application codes using the CMSIS-DSP can be directly reused on Cortex-M55 projects and able to take advantage of the Helium technology immediately. However, in a few cases code modifications are required:
arm_biquad_cascade_df1_init_f32
arm_biquad_cascade_df1_mve_init_f32
Note: It takes a new argument: pCoeffsMod. Its size is 32*numStages float32_t elements.
For best performance, the buffers for filter processing should be at least 64-bit aligned (128-bit aligned is even better).
When migrating old projects, please review the DSP functions used in the project to see if any of those are in the deprecated function list (https://arm-software.github.io/CMSIS-DSP/main/deprecated.html). If yes, you should consider updating the codes.
Program code running from AXI connected memories do not have the same issue.
Note: sometime C compilers insert a NOP in the being of the loop after DLS/WLS{TP}. This NOP is not a part of the loop, but a padding instruction to keep the instruction in the loop aligned. To determine the correct loop address, please using the negative offset in the LE instruction, and do not rely on the position of the WLS/DLS{TP} instruction.
Please note that the cache management in the Armv8-M architecture has some differences when compared to the Armv7-M architecture.
Enabling cache using automatic invalidation
Disabling caches
Code example to disable I and D caches:
// Disable instruction and data cache // On Cortex-M55 this disables line-fills preventing any // New data from being written to the cache SCB->CCR &= ~SCB_CCR_IC_Msk; SCB->CCR &= ~SCB_CCR_DC_Msk; // Clean and invalidate caches to writeback any dirty // data to memory SCB_CleanInvalidateDCache(); SCB_InvalidateICache(); __DSB(); __ISB(); // Clear MSCR.xACTIVE to disable cache lookups SCB->MSCR &= ~SCB_MSCR_ICACTIVE_Msk; SCB->MSCR &= ~SCB_MSCR_DCACTIVE_Msk; __DSB(); __ISB(); // Sequence complete – all instructions fetches and // data read/write from main memory
Please note:
Event 0x0003 L1D_CACHE Event 0x0036 LL_CACHE_RD Event 0x0037 LL_CACHE_MISS_RD Event 0x0039 L1D_CACHE_MISS_RD Event 0x0040 L1D_CACHE_RD
These are all essentially the same event indicating a load store operation has accessed the cache. Technically this is correct as the cache logic is used as part of the access (instead of the TCM). However, it is confusing as the D-cache is disabled. Therefore, it is decided that in Cortex-M55 r1 these events are masked when the caches are deactivated in MSCR.
Armv8.1-M architecture introduced Half Precision Floating-Point arithmetic support. Half precision floating-point data are 16-bit, and its format is covered by IEEE 754-2008 standard. To support half precision arithmetic operations, the _Float16 data type is defined in C11 extension ISO/IEC TS 18661-3:2015.
In Armv8.1-M processors like the Arm Cortex-M55 and Cortex-M85 processors, the half precision float-point support is included when the FPU (Floating-Point Unit) is implemented. The Helium technology (M-Profile Vector Extension) introduced in Armv8.1-M also support half precision vector operations.
The format of half precision floating-point is shown in the diagram below:
Please note that the IEEE 754 half precision floating-point format is different from bfloat16 (__bf16), which is a different 16-bit floating-point format typically used for machine learning applications. For reference, the bfloat16 format is shown below. Bfloat16 is not supported by the Armv8.1-M architecture.
Although half-precision floating-point does not offer the same level of accuracy and data range as single precision floating-point, there are two main advantages for using this format in embedded applications:
Half precision floating-point is supported by modern compilers including Arm Compiler 6, LLVM and GCC. Since _Float16 is a part of C11 extension, ideally C/C++ project should specify C11 standard when using _Float16. However, current versions of Arm Compiler 6, LLVM and GCC accept _Float16 regardless of C standard being used, so omitting the C standard option is not a major issue.
An example of using _Float16 in C code is shown below. In addition to the use of _Float16 data type, the C standard also allow you to define half precision constant using the f16 suffix (e.g. 3.14f16). You can also use _Float16 for type casting.
#include "stdio.h" #include "ARMCM55.h" static volatile _Float16 A1, B1; int main(void) { _Float16 C1; A1 = 0.5f16; B1 = (_Float16) 0.5; C1 = A1*B1; printf("%f\n", (double) C1); while(1); }
Please note that _Float16 data type is different from __fp16 data type, which is supported by Arm C Language Extension (ACLE).
For more information about the differences, please visit the following web pages: