When does arm-none-eabi-gcc optimize away empty loops?

I am used to gcc optimizing away the sort of "for (i=0; i<DELAYCOUNT; i++) ;" loops that people sometimes try to use for delays.

But arm gcc seems to be very inconsistent in this area.

the following code, compiled with arm-gcc version 5.4, 6, 8, 9, or 10 and -Os, -O2, or -O3 will optimize away the loop in delay(), but NOT the for loop in main() ??

void delay() {
  for (int i=0; i < 9000000; i++) {}

int main() {
  while(1) {
    for(int i=0; i<9000000; i++){}  //Run a few cycles doing nothing

arm gcc 7 optimizes away both loops. g++ optimizes away both loops.


from gcc 10:

/Downloads/gcc-arm-10/bin/arm-none-eabi-gcc -mcpu=cortex-m0 -mthumb -g -Os -Wall -Wextra loop.c -c; arm-objdump -S loop.o

loop.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <delay>:
void delay() {
  for (int i=0; i < 9000000; i++) {}
   0:   4770            bx      lr

Disassembly of section .text.startup:

00000000 <main>:

int main() {
   0:   4b02            ldr     r3, [pc, #8]    ; (c <main+0xc>)
  while(1) {
    for(int i=0; i<9000000; i++){}  //Run a few cycles doing nothing
   2:   3b01            subs    r3, #1
   4:   2b00            cmp     r3, #0
   6:   d1fc            bne.n   2 <main+0x2>
   8:   e7fa            b.n     0 <main>
   a:   46c0            nop                     ; (mov r8, r8)
   c:   00895440        .word   0x00895440


(I'm not happy about the extra "cmp" instruction, either.  The subs will have set the flags.  with cpu=cortex-m4 it does better.)

