Professional Embedded ARM Development (2014)
Part II. Reference
Appendix D. NEON Intrinsics and Instructions
This appendix contains information on, and a list of instructions used with, the NEON engine. Data types, lane types, and intrinsics are listed.
DATA TYPES
Table D-1 lists the different data types supported on the NEON engine, and the corresponding C data types.
TABLE D-1: NEON Data Types
DATA TYPE |
D-REGISTER (64 BITS) |
Q-REGISTER (128 BITS) |
Signed integers |
int8x8_t |
int8x16_t |
int16x4_t |
int16x8_t |
|
int32x2_t |
int32x4_t |
|
int64x1_t |
int64x2_t |
|
Unsigned integers |
uint8x8_t |
uint8x16_t |
uint16x4_t |
uint16x8_t |
|
uint32x2_t |
uint32x4_t |
|
uint64x1_t |
uint64x2_t |
|
Floating-point |
float16x4_t |
float16x8_t |
float32x2_t |
float32x4_t |
|
Polynomial |
poly8x8_t |
poly8x16_t |
poly16x4_t |
poly16x8_t |
LANE TYPES
Table D-2 lists the different lane types per class, and the amount of possible types for each class.
TABLE D-2: Data Lane Types
CLASS |
COUNT |
TYPES |
int |
6 |
int8, int16, int32, uint8, uint16, uint32 |
int/64 |
8 |
int8, int16, int32, int64, uint8, uint16, uint32, uint64 |
sint |
3 |
int8, int16, int32 |
sint16/32 |
2 |
int16, int32 |
int32 |
2 |
int32, uint32 |
8-bit |
3 |
int8, uint8, poly8 |
int/poly8 |
7 |
int8, int16, int32, uint8, uint16, uint32, poly8 |
int/64/poly |
10 |
int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16 |
arith |
7 |
int8, int16, int32, uint8, uint16, uint32, float32 |
arith/64 |
9 |
int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32 |
arith/poly8 |
8 |
int8, int16, int32, uint8, uint16, uint32, poly8, float32 |
floating |
1 |
float32 |
any |
11 |
int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16, float32 |
ASSEMBLY INSTRUCTIONS
Table D-3 contains a list of NEON instructions, as well as a brief description of each instruction.
TABLE D-3: NEON Instructions
INSTRUCTION |
DESCRIPTION |
VABA |
Absolute difference and Accumulate |
VABD |
Absolute difference |
VABS |
Absolute Value |
VACGE |
Absolute Compare Greater Than or Equal |
VACGT |
Absolute Compare Greater Than |
VACLE |
Absolute Compare Less Than or Equal |
VACLT |
Absolute Compare Less Than |
VADD |
Add |
VADDHN |
Add, Select High Half |
VAND |
Logical AND |
VBIC |
Bitwise Bit Clear |
VBIF |
Bitwise Insert if False |
VBIT |
Bitwise Insert if True |
VBSL |
Bitwise Select |
VCEQ |
Compare Equal |
VCGE |
Compare Greater Than or Equal |
VCGT |
Compare Greater Than |
VCLE |
Compare Less Than or Equal |
VCLS |
Count Leading Sign bits |
VCLT |
Compare Less Than |
VCLZ |
Count Leading Zeroes |
VCNT |
Count set bits |
VCVT |
Convert between different number formats |
VDUP |
Duplicate scalar to all lanes of vector |
VEOR |
Bitwise Exclusive OR |
VEXT |
Extract |
VFMA |
Fused Multiply and Accumulate |
VFMS |
Fused Multiply and Subtract |
VHADD |
Halving Add |
VHSUB |
Halving Subtract |
VLD |
Vector Load |
VMAX |
Maximum |
VMIN |
Minimum |
VMLA |
Multiply and Accumulate |
VMLS |
Multiply and Subtract |
VMOV |
Move |
VMOVL |
Move Long |
VMOVN |
Move Narrow |
VMUL |
Multiply |
VMVN |
Move Negative |
VNEG |
Negate |
VORN |
Bitwise OR NOT |
VORR |
Bitwise OR |
VPADAL |
Pairwise Add and Accumulate |
VPADD |
Pairwise Add |
VPMAX |
Pairwise Maximum |
VPMIN |
Pairwise Minimum |
VQABS |
Absolute Value, Saturate |
VQADD |
Add, Saturate |
VQDMLAL |
Saturating Double Multiply Accumulate |
VQDMLSL |
Saturating Double Multiply and Subtract |
VQDMUL |
Saturating Double Multiply |
VQDMULH |
Saturating Double Multiply returning High half |
VQMOVN |
Saturating Move |
VQNEG |
Negate, Saturate |
VQRDMULH |
Saturating Double Multiply returning High half |
VQRSHL |
Shift left, Round, Saturate |
VQRSHR |
Shift Right, Round, Saturate |
VQSHL |
Shift Left, Saturate |
VQSHR |
Shift Right, Saturate |
VQSUB |
Subtract, Saturate |
VRADDH |
Add, Select High Half, Round |
VRECPE |
Reciprocal Estimate |
VRECPS |
Reciprocal Step |
VREV |
Reverse Elements |
VRHADD |
Halving Add, Round |
VRSHR |
Shift Right and Round |
VRSQRTE |
Reciprocal Square Root Estimate |
VRSQRTS |
Reciprocal Square Root Step |
VRSRA |
Shift Right, Round and Accumulate |
VRSUBH |
Subtract, select High half, Round |
VSHL |
Shift Left |
VSHR |
Shift Right |
VSLI |
Shift Left and Insert |
VSRA |
Shift Right, Accumulate |
VSRI |
Shift Right and Insert |
VST |
Vector Store |
VSUB |
Subtract |
VSUBH |
Subtract, Select High half |
VSWP |
Swap Vectors |
VTBL |
Vector Table Lookup |
VTBX |
Vector Table Extension |
VTRN |
Vector Transpose |
VTST |
Test Bits |
VUZP |
Vector Unzip |
VZIP |
Vector Zip |
INTRINSIC NAMING CONVENTIONS
Intrinsics provide an elegant way to write NEON instructions using C. NEON intrinsics are created using the following structure:
v[q][r]name[u][n][q][_lane][_n][_result]_type
where:
q indicates a saturating operation.
r indicates a rounding operation.
name is the descriptive name of the operation.
u indicates signed-to-unsigned saturation.
n indicates a narrowing operation.
q indicates an operation on 128-bit vectors.
_n indicates a scalar operand supplied as an argument.
_lane indicates a scalar operand taken from the lane of a vector.
result is the result type in short form.
For example, vmul_s16 multiplies two vectors of signed 16-bit values and is equivalent to VMUL.I16. Some examples in C include:
uint32x4_t vec128 = vld1q_u32(i); // Load 4 32-bit values
uint8x8_t vadd_u8 (uint8x8_t,
uint8x8_t); //Add two lanes
int8x16_t vaddq_s8 (int8x16_t, int8x16_t); //Saturating add two lanes