Suppose software does a write to device memory that allows early write acknowledgement, then executes a DSB instruction.
*device_memory = 1; /* Suppose the memory type is Device-nGnRE */ asm volatile("dsb sy" : : : "memory");
Because the memory type is Device-nGnRE, the memory system can return a write acknowledgement before the write has reached its endpoint. Does this mean the DSB instruction can complete before the write has reached its endpoint? Could someone point me to a section of the "Armv8 Architecture Reference Manual" that explains this?
jatron said:it seems like the DSB instruction can complete before the write has reached its endpoint
True.
jatron said:Does the processor wait for a write acknowledgement at any time, or does it simply push all writes to a store buffer for completion at some later time?
The processor places the store in the store-buffer at the time of committing the instruction. This step signals the completion of the store instruction - the processor can retire the instruction and reclaim some of the resources it reserved for its execution. The fact, that the store isn't even flowing on the bus yet, doesn't prevent the processor from continuing with executing other instructions (subject to dependencies, barriers, synchronization events, and resource-availability).
The processor unit, that interfaces with the bus for transmitting the store, waits for the write-ack as/if required by the protocol. Again, the wait doesn't prevent the processor from continuing with executing other instructions (subject to ...).
So the processor does both, but not in the way (exclusive-or) hinted in your question.
Whether the buffer collects multiple such writes before initiating a transaction over the bus, or performs them immediately as soon as one arrives, is an implementation detail I am not aware of. You may want to refer to Cortex-R4 TRM for some details, as an example.
jatron said:Could someone point me to a section of the "Armv8 Architecture Reference Manual" that clarifies this
Such implementation details may be found in the TRMs of actual processors, and/or protocol specs, but not in the arch. manual.
Assuming that the stores in your code snippet are not reordered by the compiler (and so are presented to the processor in this same programmed order), and assuming that they target the same peripheral, the arch. manual guarantees the programmer that a compliant implementation will ensure that the peripheral sees them in that same order. Therefore, the components involved in the IO must not reorder the stores (nR), and must not coalesce stores in case there are overlaps (nG).
Edit: The store-buffer, referred to above, is the one typically found in the LSU/L1-controller.
Edit: 'This step signals the completion of the store' is incorrect. Should be 'This step signals the completion of the store instruction'.