ARM Architecture Versions - Reference - Professional Embedded ARM Development (2014)

Professional Embedded ARM Development (2014)

Part II. Reference

Appendix B. ARM Architecture Versions

ARM architecture versions are often a source of confusion. ARM architecture versions (designs) are written as ARMv, whereas ARM cores (the CPU) are written as ARM. Also, ARM cores do not always have the same first number as their architecture. The ARM940T is based on the ARMv4 architecture, whereas the ARM926EJ-S is based on the ARMv5 architecture. The following table lists the different ARM Architectures and their associated families.

















Cortex-M0, Cortex-M0+, Cortex-M1


Cortex-A5, Cortex-A7, Cortex-A8, Cortex-A9, Cortex-A12, Cortex-A15


Cortex-R4, Cortex-R5, Cortex-R7






Cortex-A53, Cortex-A57


The first ARM processor was created April 26, 1985. It was targeted as a coprocessor for the BBC Micro, one of Acorn’s best-selling computers. Only a few hundred were ever made. It was originally designed to help Acorn work on the ARM2 processor but was sold to third-party developers to accustom them to the new architecture. You can imagine the thrill of using an 8-bit CPU running with a 32-bit coprocessor.

The ARM1 was a revolution for its time; it was a fully functional 32-bit processor with 26-bit addressing. It had 16 general-purpose 32-bit registers, all instructions were 32-bit, and the instruction set was orthogonal, meaning that the instructions were not tied to any particular register. This was contrary to the 6502, which had LDA and LDX instructions and loaded only one particular register. Acorn’s philosophy, from the start, was simplicity.


Following on the success of the ARM1, Acorn started work on the ARM2. Less than a year after the release of ARM1, ARM2 became the first commercially available RISC processor. The ARM2 was also based on the second version of the ARM architecture, called ARMv2.

The major weak point for ARM1 was the lack of hardware multiplication support. Multiplication was done in software using shifts and additions, but the general effect was considered to be “horribly slow.” ARMv2 fixed this, adding two instructions: MUL and MLA.

Another weakness of the ARMv1 structure was the lack of floating-point hardware. Acorn decided to address this problem by adding hardware support for coprocessors and intended to develop and deliver a floating-point coprocessor at a later date.

Another change was in the Fast Interrupt controller. Two new registers were added. Instead of banking registers R10 to R15, R8 and R9 were added to the list, increasing performance by reducing memory access to the stack.

ARM3, still based on the ARMv2 architecture, was released in 1989. The clock speed was increased to 25MHz, giving a performance of approximately 13 MIPS (compared to the 4 MIPS of ARM2). It contained approximately 300,000 transistors.

The ARM3 was the first ARM processor to use cache. ARM elected to use a fairly simple caching model: a 64-way set-associative cache with random write through a replacement method and 128-bit cache lines. Alterations were made to the coprocessor interface to support this, and the cache system was designated coprocessor zero.


This was the period where ARM spun itself away from Acorn. Thus, for some reason, there was no ARM4 or ARM5.

ARM6 introduced 32-bit addressing support, while still retaining compatibility to the previous 26-bit mode. Two new processor modes were added for handling memory fetch errors and undefined instructions, and two new registers were added: the Current Processor State Register (CPSR) and the Stored Processor State Register (SPSR). This now enabled ARM cores to use virtual memory without the need for previous, tedious tasks.

ARM6 was clocked at 20, 30, and 33 MHz and produced approximately 17, 26, and 28 MIPS average, respectively. It was also power-efficient, enabling a low (for the time) 3.3 v core voltage. This was the beginning for mobile embedded systems because the first product to use an ARM6 was the Apple Newton MessagePad.

ARM7 continued on the success of ARM6. The company, ARM Limited, was a huge success and added more features as the wider market requested them.

ARM7 doubled the size of the cache to 8 k and also doubled the size of the translation look-aside buffer. These changes increased performance over ARM6 by 40 percent.

ARM7 also introduced an extended instruction set which, quite logically for an ARM extension, has been named Thumb. Thumb is a second 16-bit instruction set, allowing (theoretically) programs to be one-half of their memory size. However, by using only eight registers, and a lack of conditional execution support, it ran slower but was a response to other embedded 8-bit and 16-bit processors.

ARM7 also introduced hardware debugging. Previously, engineers had to rely on the software ARMulator for debugging, but with the ARM7, on-chip debugging was possible. The target system can be run as normal, but with external hardware and software, the developer can set breakpoints, step through code, and examine registers and memory.

ARM7 also included advanced multiplication; it included both 32-bit and 64-bit multiplication and multiplication/accumulation, enabling ARM processors to be used in applications where DSPs were more traditionally used. The advanced multiplication core was so successful that it was used in ARM8, ARM9, and StrongARM processor cores.


ARM7-TDMI (Thumb + Debug + Multiplier + ICE) is an improvement of the original ARM7 core, based on the ARMv4T architecture. It is capable of 130 MIPS and was one of the most widely used cores for embedded systems. They were used on Apple iPods, Nintendo’s Game Boy Advance, and most of the major mobile telephones.

ARM9-TDMI is a successor to the popular ARM7-TDMI, still using the ARMv4T architecture. They have reduced heat production and clock frequency improvements, both at the cost of adding more transistors. A lot of work was done on the ARM9, the pipeline was greatly improved, and most instructions were executed in only one clock cycle.


The ARMv5 architecture, used in ARM9 and ARM10, introduced Jazelle DBX, or Direct Bytecode Execution, enabling execution of Java bytecode in hardware. Aimed mainly at the mobile phone market, Jazelle enabled Java ME applications and games to run faster by converting recognized bytecodes into native ARM instructions. ARM claims that approximately 95 percent of bytecode in typical programs ends up being directly processed in hardware.

The ARMv5 introduced saturating arithmetic instructions, enabling more intense calculations without the risk of overflow. With four dedicated instructions, calculations can be made that, when exceeding the maximum size of a 32-bit integer, set the overflow bit but return the maximum allowed value (−231 or 231 −1).

The ARM926EJ-S, one of the most popular ARM cores, is based on the ARMv5 architecture.


In 2002, ARM started licensing the ARMv6 core, namely, the ARM11 family. The ARMv6 architecture implemented Single Instruction, Multiple Data instructions (SIMD), heavily used in the mobile telephone market for MPEG-4. The addition of SIMD instructions effectively doubled MPEG-4 processing speed.

ARMv6 also solved some of the problems faced with data alignment; from then on, unaligned data access and mixed-endian data access was supported.

The core pipeline was increased from a five-stage pipeline to an eight-stage pipeline, increasing clock speeds with expected speeds at 1 GHz.


The ARMv6 architecture addressed the most demanding applications, but ARM was faced with a problem. With more and more clients requiring higher clock speeds and more data crunching power, it was clear that ARM was giving its customers everything they needed. However, some clients were no longer interested in such a complex architecture and were looking for something more light-weight, while still retaining all the advantages of ARM’s technological research. The ARMv6-M architecture was created.

The ARMv6-M architecture introduced the Cortex-M core, designed for Microcontroller applications, and uses the Thumb subset. This microcontroller was much smaller than previous processors, and addressed clients that needed ultralow-powered devices with a small footprint. NXP’s UM10415, based on the Cortex M0, ran at 48 MHz and had 32 Kb of flash memory and 8 Kb of RAM. With 25 GPIO lines, a UART port, SPI and I2C controllers, it was designed for ultramobility. Its footprint was 7 mm by 7 mm.

The ARMv6-M architecture was designed for simplicity. Simplicity for the developer, who could develop an entire embedded system from C without even touching assembly, but also simplicity from an electronics point of view, vastly reducing power and heat. It no longer included the ARM instruction set; it relied solely on Thumb-1 and Thumb-2 instructions. The pipeline was reduced to a two-stage pipeline, slightly decreasing performance, but vastly decreasing power usage. Although performance is lower on an ARMv6-M compared to an ARMv5, target devices did not need the processing power delivered by ARM11 processors. Cortex M processors are designed for microcontroller applications, with advanced I/O capability, but reduced processing power.


With the success of the ARMv6, and with the introduction of the new Cortex line, ARM introduced the ARMv7. This subset of the architecture is used for Cortex-A and Cortex-R processors. Cortex-M uses its own architecture subset.

ARMv7 includes optional virtualization technology, and for the processors that include this (Cortex-A7 and Cortex-A15), hardware division is also supported.


The ARMv7-M architecture is derived from the ARMv7-AR but excludes all the functions that the Cortex-M cannot use; it does not contain the ARM assembly language, but only Thumb and Thumb-2. The ARMv6-M architecture was a huge success, and the v7-M architecture extended the architecture to enable the full subset of Thumb and Thumb-2 to be used. It also added something that was missing: divide instructions. Processors based on the ARMv7-M architecture (the Cortex-M3 and Cortex-M4) support hardware division, saturated math, and an accelerated hardware multiplier.


The ARMv8 architecture is a switch to 64-bit computing. Featuring the new A64 instruction set, these 64-bit processors remain binary-compatible with 32-bit versions and can run 32-bit applications inside a 64-bit operating system. They also retain full compatibility with Thumb and Thumb-2, easing new development.

Capable of addressing 64-bits of memory, they also have advanced features for cache management and SIMD instructions, making them the ideal processors for demanding mobile applications, like video editing and extreme multimedia. ARMv8 is also an excellent processor for server applications, and numerous manufacturers are looking closely at this design to solve some of the age old problems of server farms—heat, power consumption, and space requirements.

Today, there are two processor designs in ARMv8; the Cortex-A53 and the Cortex-A57. Both of these designs can be used together using ARM’s big.LITTLE technology, and support up to 16 cores.