Professional Embedded ARM Development (2014)
Part II. Reference
Appendix A. Terminology
When studying embedded systems, and reading through technical documentation, there are a lot of terms that, at first, make no sense. A processor might be capable of obtaining a certain amount of MIPS, or maybe it supports JTAG debugging, or even support SIMD, but what exactly does that mean? In this appendix, you will see some of the most common terms and an explanation of each.
During a branch operation, when branching to a new portion of code, there can be a performance hit while the processor gets new instructions from another portion of memory. To reduce this performance hit, branch prediction hardware is included in some ARM implementations, fetching instructions from memory according to what the branch predictor thinks will be the result. Newer branch prediction hardware can obtain 95 percent accuracy, greatly increasing branch performance. Some branch prediction hardware can obtain a 100 percent accuracy by speculatively executing both branch results, and discarding one of the two when the result is known.
When memory became cheaper and systems started using more of it, it became clear that too much time was spent fetching data from the memory or putting data back into the memory. Analysis showed that most of the time the system would read in data from the spatial locality; in essence, if an access is made to a particular location in memory, there is a high probability that other accesses will be made to either that or neighboring locations within the lifetime of a program. To speed up the process, a cache was put in place.
A CPU cache is a faster form of memory used to reduce the average time to access memory. The cache is much smaller, but much faster than the system memory, and can store copies of the system memory. When the CPU requests data, if it is present in the cache, the CPU fetches it directly from cache, resulting in much faster access times. Modified cache, known as Dirty Data, is written back to memory when the cache space is required for new data, or during explicit maintenance operations.
All application-class ARM processors since ARM9 have had a Harvard cache architecture, where instruction cache and data cache are separated. Most Cortex-A processors support a two-level cache architecture in which the most common configuration is a pair of separate L1 caches, backed by a unified L2 cache.
Memory was requested from the processor and found in cache. This is known as a cache hit, and access to system memory is not required.
Instead of reading single words from memory, data is transferred between memory and cache in blocks of fixed sizes, called cache lines. The requested memory is written or read, including some memory located surrounding the requested memory block, because it is highly likely that data located near this location will be used in the near future.
Memory was requested from the processor and was not found in cache. This is known as a cache miss — the processor must now access system memory before the operation can be completed.
In the early days of ARM, cores supported a coprocessor architecture. Although ARM processors do not have any instructions similar to Intel’s x86 CPUID commands, ARM chips enable up to 16 coprocessors to be connected to a core. The ARM architecture provided an elegant way to extend the instruction set using “coprocessors” — a command that is not recognized by the ARM core is sent to each coprocessor until one of them accepts the instruction. Coprocessors are a way for designers to add extended functionality without the need to heavily modify an ARM core. By keeping the power and the simplicity of an ARM core, a designer can add a coprocessor to handle things that the ARM core was not designed to do. For example, Xscale processors have a DSP coprocessor and advanced interrupt handling functions directly on a coprocessor, and virtually any integrator could integrate an HD video decoder.
The last ARM core to support the coprocessor architecture was the ARM1176. Today, the ARM architecture still defines the coprocessor instruction set, and uses “coprocessor instructions,” but the coprocessor architecture no longer exists. Configuration, debug, trace, VFP, and NEON instructions still exist, but they do not have any external hardware associated with what they do — all the functions are now internal.
CP10 defines coprocessor instructions for Vector Floating Point (VFP).
CP11 defines coprocessor instructions for NEON. NEON is an extension of the original SIMD instruction set, and is often referred to as the Advanced SIMD Extensions.
CP14 defines coprocessor instructions for the debug unit. These features assist the development of application software, operating systems, systems, and hardware. It enables stopping program execution, examining and altering processor and coprocessor states, altering memory and peripheral state, and restarting the processor core.
CP15 is a special coprocessor, designed for memory and cache management, designated as the system control coprocessor. CP15 is used to configure the MMU, TCM, cache, cache debug access, and system performance data. Each ARM processor can have up to 16 coprocessors, named CP0 to CP15, and CP15 is reserved by ARM.
Each CPU has a frequency, in Hertz. The Hertz is a unit of frequency, defined as the number of cycles per second of a periodic phenomenon. A processor’s clock is, basically, a repeating signal from a logical 0 to a logical 1, and then back again to a logical 0, millions of times a second. A “cycle” is activated every time the signal goes from a logical 0 to a logical 1. In an 800 MHz processor, or a processor running at 800,000,000 Hz, there are 800 million cycles a second. Different operations require a different number of cycles to complete. Although some instructions may take only one cycle to complete, access to memory subsystems can sometimes take thousands — if not tens of thousands — of cycles. Also, some processors can do several things in one cycle; although one part of the CPU is busy executing an instruction, another is already busy fetching the next instruction.
An exception is a condition that is triggered when normal program flow is interrupted, either by an internal or external event. It can be caused by attempting to read protected memory (or a memory section that does not exist) or when dividing by zero. When an exception occurs, normal execution is halted, and the Program Counter is placed onto the relevant exception handler for execution.
An interrupt is an internal or external signal to the application, informing the processor that something requires its attention. There are several types of interrupts: normal interrupts, fast interrupts, and software interrupts. They are called interrupts because of their impact on the processor, which has to “interrupt” its execution sequence to service the interrupt. This is done by an interrupt handler.
Jazelle was originally designed to enable Java bytecode execution directly by the processor, implemented as a third execution state alongside ARM and Thumb-mode. Support for Jazelle is indicated by the letter J in the processor name (for example, the ARM926EJ-S), and is a requirement for ARMv6; however, newer devices include only a trivial implementation. With the advances in processor speeds and performance, Jazelle has been deprecated and should not be used for new applications.
Short for Joint Test Action Group, it is the common name for the IEEE 1149.1 test access port and boundary scan architecture. Although originally devised by electronic engineers as a way of testing printed circuit boards using boundary scans, today it is used to debug embedded systems. Several JTAG probes exist, enabling programmers to take control of processors and to help debugging and performance checking.
Short for Million Instructions per Second, MIPS was an early attempt at benchmarking. A one Megahertz processor, executing one instruction per clock, is a 1 MIPS processor. However, not every instruction is executed in a single clock, far from it. The Intel 4004, the first single-chip CPU, was benchmarked at 0.07 MIPS.
MIPS was a popular benchmarking method until processors arrived at a speed of more than 1 GHz, where MIPS no longer had any real meaning. Because single instructions carry out varying amounts of work, the idea of comparing computers by numbers of instructions became obsolete. Today, the Dhrystone benchmark, while dated, is often quoted in benchmark results. To accurately estimate a processor’s speed, a variety of benchmarks are used.
NEON is an extension of the original SIMD instruction set, and is often referred to as the Advanced SIMD Extensions. It extends the SIMD concept by adding instructions that work on 64-bit registers (D for double word) and 128-bit registers (Q for quad word).
Processors with out-of-order Execution can reorganize the order of instructions inside the pipeline to optimize efficiency, avoiding pipeline stalls. When an instruction creates a stall situation, the pipeline may attempt to execute another instruction that does not depend on an imminent result, even if it is later on in the pipeline.
To accelerate instruction throughput, pipelines were introduced. Instead of the processor working on a single instruction, fetching, and writing data as required, these steps are part of a pipeline. A pipeline is composed of several stages, and each stage is responsible for one specific action: fetch, decode, execute, data write, and so on. An instruction makes its way through the pipeline, passing through each stage. On each clock cycle, each stage is run; while one instruction is being decoded, another is being executed, and so on.
A register is a small amount of storage available directly inside the CPU. When doing calculations, operations are done directly into registers only; an ARM CPU will not write results directly into system memory. To complete an operation, memory must be read into a register. Then after the calculation is completed, the register will be written back to system memory.
Registers are the fastest of all memories, capable of being read and written in a processor single cycle.
Short for Single Instruction Multiple Data, these instructions can operate on several items packed into registers, which is useful for multimedia applications where the same mathematical operations must be performed on large data segments. This has since been augmented by NEON.
System on a Chip designates a computer chip containing a processor core and all the required external devices embedded directly onto the same microchip. They tend to have built-in memory, input and output peripherals, and sometimes graphics engines, but often require external devices to be effective (especially flash and memory).
ARM cores are available in two formats: hard die, where the physical layout of the ARM core is defined, and peripherals are added to the existing form, or as synthesizable, where the ARM core is delivered as a Verilog program. In this form, device manufacturers can perform custom modifications, tweaking the design to obtain higher clock speeds, optimizations for size, or low power consumption.
TrustZone is a security extension for select ARM processors, providing two virtual processors backed by hardware-based access control. The application core can switch between the two states (referred to as worlds), to prevent data being leaked from one world to another. TrustZone is typically used to run a rich operating system in a less trusted world, and more specialized security code (for example, DRM management) in the more trusted world.
A vector table is a place in memory containing responses to exceptions. On ARMv7-AR, it is eight words long, containing simple jump instructions. On ARMv7-M, it is much larger and doesn’t contain instructions, only memory locations. Put simply, it contains pointers to where the real code lies. For example, when an ARMv7-AR CPU receives an interrupt, an exception is made, setting the PC to a specific location, somewhere in the vector table. This instruction makes the processor jump to the area of code that is responsible for handling that particular exception. It is your job to correctly populate the vector table and to make sure that the vectors are correct.