# Discovering Modern C++. An Intensive Course for Scientists, Engineers, and Programmers(2016)

### Chapter 5. Meta-Programming

*Meta-Programs* are programs on programs. Reading a program as a text file and performing certain transformations on it is certainly feasible in most programming languages. In C++, we can even write programs that compute during compilation or transform themselves. Todd Veldhuizen showed that the template type system of C++ is Turing-complete [*49*]. This means that everything computable can be calculated in C++ at compile time.

We will discuss this intriguing asset thoroughly in this chapter. In particular, we will look at three major applications of it:

• Compile-time calculations (§*5.1*);

• Information about and transformations of types (§*5.2*); and

• Code generation (§*5.3*–§*5.4*).

These techniques allow us to make the examples from the preceding chapters more reliable, more efficient, and more broadly applicable.

**5.1 Let the Compiler Compute**

Meta-programming in its full extent was probably discovered thanks to a bug. Erwin Unruh wrote in the early 90s a program that printed prime numbers as *Error Messages* and thus demonstrated that C++ compilers are able to compute. This program is certainly the most famous C++ code that does not compile. The interested reader will find it in *Appendix A.9.1*. Please take this as testimony of exotic behavior and not as inspiration for your future programs.

Compile-time computations can be realized in two ways: backward compatible with template *Meta-Functions* or more easily with constexpr. The last feature was introduced in C++11 and extended in C++14.

**5.1.1 Compile-Time Functions**

c++11/fibonacci.cpp

Even in modern C++, it is not trivial to implement a prime-number test and we therefore start with something simpler: Fibonacci numbers. They can be computed recursively:

**constexpr** long **fibonacci**(long n)

{

return n <= 2 ? 1 : **fibonacci**(n - 1) + **fibonacci**(n - 2);

}

A constexpr in C++11 is in most cases a single return statement. We are further allowed to include certain statements without computation: empty ones, (certain) static_assert, type definitions, and using declarations and directives. Compared to a regular function, aconstexpr is rather restrictive:

• It cannot read or write anything outside the function. That is, no side effects!

• It cannot contain variables.^{1}

*1*. Only in C++11. This limitation was lifted in C++14. We come to this in *Section 5.1.2*.

• It cannot contain control structures like if or for.^{1}

• It can only contain a single computational statement.^{1}

• It can only call functions that are constexpr as well.

In comparison to template meta-functions (§*5.2.5*), on the other hand, a constexpr is perceivably more flexible:

• We can pass in floating-point types.

• We can even process user types (if they can be handled at compile time).

• We can use type detection.

• We can define member functions.

• We can use conditionals (which are simpler than specializations).

• We can call the functions with run-time arguments.

An easy function for floating-point numbers is square:

constexpr double square(**double** x)

{

return x * x;

}

Floating-point types are not allowed as template arguments, and before C++11 it was not possible at all to perform compile-time calculations with floating points. We can generalize the preceding function with a template parameter to all suitable numeric types:

template <typename **T**>

constexpr **T** square(**T** x)

{

return x * x;

}

The generalized function even accepts user types under certain conditions. Whether a type is suitable for constexpr functions depends on subtle details. Simply put, the type definition does not impede the creation of compile-time objects by containing, for instance, volatile members or arrays sized at run time.

The language standard tried to define conditions under which a type is usable in a constexpr function. It turned out that this is not definable as a type property because it depends on how the type is used—more precisely which constructor is called in the considered constexprfunction. Accordingly, the respective type trait is_literal_type was doomed to be useless and declared deprecated in C++14.* ^{2}* A new definition is expected in C++17.

*2*. Nonetheless, attention was paid that the useless result is well defined.

A really nice feature of constexpr functions is their usability at compile and run time, for instance:

long **n**= atoi(argv[1]);

cout "fibonacci(" n ") = " fibonacci(**n**) '\n';

Here we passed the first argument from the command line (which is definitely not known during compilation). Thus, whenever one or more arguments are only known at run time, the function cannot be evaluated at compile time. Only when all function arguments are available at compile time can the function be computed during compilation.

The hybrid applicability of constexpr functions implies in turn that their parameters can only be passed to constexpr functions. Passing a parameter to a regular function would impede the usage at compile time. Conversely, we cannot pass a function parameter to a compile-time evaluation like static_assert as it prevents run-time calls. As a consequence, we cannot express assertions within constexpr functions in C++11. Recent compilers provide assert as constexpr in C++14 mode so it can be used for parameters.

The language standard regulates which functions from the standard library must be realized as constexpr. Some library implementations realize additional functions as constexpr. For instance, the following function:

constexpr long floor_sqrt(long n)

{

return **floor**(**sqrt**(n));

}

is supported by g++ in versions 4.7–4.9. In contrast, floor and sqrt are not constexpr functions in earlier and later versions (or other compilers) so the code above does not compile there.

**5.1.2 Extended Compile-Time Functions**

The restrictions on compile-time functions are relaxed as far as is reasonable in C++14. Now, we can use:

• Void functions, e.g.:

constexpr **void** square(int &x) { x *= x; }

• Local variables, as long as they

– Are initialized;

– Have neither static nor thread storage duration; and

– Have a literal type.

• Control structures, except

– goto (which we do not want to use anyway);

– Assembler code, i.e., an asm block; and

– try-blocks.

The following example is allowed in C++14 but not in C++11 (for multiple reasons):

template <typename T>

constexpr T power(const T& x, int n)

{

**T r**(1);

**while** (--n > 0)

r *= x;

return r;

}

c++14/popcount.cpp

With these extensions, compile-time functions are almost as expressive as regular functions. As a more technical example we will also realize popcount (short for population count) which counts the number of 1-bits in binary data:

constexpr size_t popcount(size_t x)

{

int count= 0;

for (; x != 0; ++count)

x&= x - 1;

return count;

}

Analyzing this algorithm will also contribute to a better understanding of binary arithmetic. The key idea here is that x&= x - 1 sets the least significant bit to zero and leaves all other bits unaltered.

c++11/popcount.cpp

The function can be expressed in C++11 constexpr as well and is even shorter in the recursive formulation:

constexpr size_t popcount(size_t x)

{

return x == 0 ? 0 : popcount(x & x-1) + 1;

}

This stateless, recursive computation might be less comprehensible for some readers, and maybe clearer to others. It is said that finding iterative or recursive programs easier to understand depends on the order in which they are introduced to a programmer. Fortunately, C++ let us implement both.

**5.1.3 Primeness**

We have already mentioned that prime numbers were the topic of the first serious meta-program, although not compilable. Now we want to demonstrate that we can compute them in (compilable) modern C++. More precisely, we will implement a function that tells us at compile time whether a given number is prime. You may ask: “Why on earth do I need this information during compilation?” Fair question. Actually, the author used this compile-time function once in his research for categorizing cyclic groups with semantic concepts (§*3.5*). When the size of the group is a prime number, it is a field, otherwise a ring. An experimental compiler (ConceptGCC [*21*]) enabled such algebraic concepts in C++, and their model declarations contained the compile-time check for primeness (unfortunately before constexpr was available).

c++14/is_prime.cpp

Our algorithmic approach is that 1 is not prime, even numbers are not prime except 2, and for all other numbers we check that they are not divisible by any odd number larger than 1 and smaller than itself:

constexpr bool is_prime(int i)

{

if (i == 1)

return false;

if (i % 2 == 0)

return i == 2;

for (int j= 3; j < i; j+= 2)

if (i % j == 0)

return false;

return true;

}

Actually, we only need to test the divisibility by odd numbers smaller than the square root of the parameter i:

constexpr bool is_prime(int i)

{

if (i == 1)

return false;

if (i % 2 == 0)

return i == 2;

int max_check= static_cast<int>(sqrt(i)) + 1;

for (int j= 3; j < max_check; j+= 2)

if (i % j == 0)

return false;

return true;

}

Unfortunately, this version only works with standard libraries where sqrt is a constexpr (g++ 4.7–4.9 as before). Otherwise we have to provide our own constexpr implementation. For instance, we can use the fixed-point algorithm from *Section 4.3.1*:

constexpr int square_root(int x)

{

double r= x, dx= x;

while (const_abs((r * r) - dx) > 0.1) {

r= (r + dx/r) / 2;

}

return static_cast<int>(r);

}

As you can see, we performed the iterative approach as double and only converted it to int when the result is returned. This leads to a (sufficiently) efficient and portable implementation:

constexpr bool is_prime(int i)

{

if (i == 1)

return false;

if (i % 2 == 0)

return i == 2;

int max_check= square_root(i) + 1;

for (int j= 3; j < max_check; j+= 2)

if (i % j == 0)

return false;

return true;

}

c++11/is_prime.cpp

At the end, we like taking the challenge to implement this (well, the first version of it) with the restricted constexpr in C++11:

constexpr bool is_prime_aux(int i, int div)

{

return div >= i ? true :

(i % div == 0 ? false : is_prime_aux(i, div + 2));

}

constexpr bool is_prime(int i)

{

return i == 1 ? false :

(i % 2 == 0 ? i == 2 : is_prime_aux(i, 3));

}

Here we need two functions: one for the special cases and one for checking the divisibility by odd numbers from 3.

In theory, we can implement every calculation with a C++11 constexpr. It provides all features of μ-recursive functions: constant and successor function, projection, and recursion. In computability theory, it is further proved that the μ-recursive functions are equivalent to a Turing machine so that every computable function can be implemented with μ-recursive functions and in turn with C++11 constexpr. So much for theory. In practice, the effort (and pain) it takes to realize really complicated calculations with that restricted expressiveness is a different kettle of fish.

**Backward compatibility:** Before constexpr was introduced, compile-time calculations were realized with template *Meta-Functions*. They are more limited in their applicability (no float or user types) and harder to implement. If you cannot use C++11 features for some reason or are just interested in historic programming, you are welcome to read the meta-function section in Appendix 5.2.5.

**5.1.4 How Constant Are Our Constants?**

Declaring a (non-member) variable to be const:

const int i= something;

can establish two levels of constancy:

**1.** The object cannot be changed during program execution (always the case).

**2.** The value is already known at compile time (sometimes the case).

Whether the value of i is available during compilation depends on the expression that is assigned to it. If something is a literal:

const long i= 7, j= 8;

then we can use it during compilation, e.g., as a template parameter:

template <long N>

struct static_long

{

**static const** long value= N;

};

static_long<i> si;

Simple expressions of compile-time constants are usually available during compilation:

const long k= i + j;

static_long<k> sk;

When we assign a variable to a const object, it is definitely not available during compilation:

long ll;

cin ll;

const long cl= ll;

static_long<cl> scl; // Error

The constant cl cannot be changed in the program. On the other hand, it cannot be used during compilation as it depends on a run-time value.

There are scenarios where we cannot say from the program sources what kind of constant we have, for instance:

const long ri= floor(sqrt(i));

static_long<ri> sri; // compiles with g++ 4.7-4.9

Here, ri is known during compilation when sqrt and floor are both constexpr in the implementation of the standard library (e.g., in g++ 4.7–4.9); otherwise it is an error when used as a template argument.

To ensure that a constant has a compile-time value, we have to declare it constexpr:

constexpr long ri= floor(sqrt(i)); // compiles with g++ 4.7-4.9

This guarantees that ri is known during compilation; otherwise this line does not compile.

Note that constexpr is stricter for variables than for functions. A constexpr variable accepts only compile-time values while a constexpr function accepts both compile-time and run-time arguments.

**5.2 Providing and Using Type Information**

In *Chapter 3*, we saw the expressive power of function and class templates. However, all those functions and classes contained exactly the same code for all possible argument types. To increase the expressiveness of templates further, we will introduce smaller or larger code variations depending on the argument types. Thus, we first need information on types to dispatch on. Such type information can be technical—like is_const or is_reference—or semantic/domain-specific—like is_matrix or is_pressure. For most technical type information, we will find library support in the headers <type_traits> and <limits> as illustrated in *Section 4.3*. Domain-specific type properties are waiting for us to be implemented.

**5.2.1 Type Traits**

c++11/magnitude_example.cpp

In the function templates we have written so far, the types of temporaries and the return values were equal to one of the function arguments. Unfortunately, this does not always work. Imagine we implement a function that returns out of two values the one with the minimal magnitude:

template <typename T>

T inline min_magnitude(const T& x, const T& y)

{

using std::abs;

**T** ax= **abs**(x), ay= **abs**(y);

return ax < ay ? x : y;

}

We can call this function for int, unsigned, or double values:

double d1= 3., d2= 4.;

cout "min |d1, d2| = " **min_magnitude(d1, d2)** '\n';

If we call this function with two complex values:

std::complex<double> c1(3.), c2(4.);

cout "min |c1, c2| = " **min_magnitude(c1, c2)** '\n';

we will see an error message like this:

no match for operator< in ax < ay

The problem is that abs returns double values here which provide the comparison operator, but we store the results as complex values in the temporaries.

We now have different possibilities to solve this problem. For instance, we could avoid the temporaries altogether by comparing the two magnitudes without storing them. In C++11 or higher, we could leave it to the compiler to deduce the type of the temporaries:

template <typename T>

T inline min_magnitude(const T& x, const T& y)

{

using std::abs;

**auto ax= abs(x), ay= abs(y);**

**return ax < ay ? x : y;****}**

In this section, we choose a more explicit approach for demonstration’s sake: the magnitude type for possible argument types is provided by the user. Explicit type information is less important in newer standards but not entirely superfluous. Furthermore, knowing the basic mechanisms helps us understand tricky implementations.

Type properties are supplied in C++ by *Type Traits*. These are essentially meta-functions with type arguments. For our example, we will write a type trait to provide the magnitude’s type (for C++03 just replace each using with a traditional typedef). This is implemented by template specialization:

template <typename **T**>

struct Magnitude {};

template <>

struct Magnitude<**int**>

{

using type= **int**;

};

template <>

struct Magnitude<**float**>

{

using type= **float**;

};

template <>

struct Magnitude<**double**>

{

using type= **double**;

};

template <>

struct Magnitude<**std::complex<float>** >

{

using type= **float**;

};

template <>

struct Magnitude<**std::complex<double>** >

{

using type= **double**;

};

Admittedly, this code is quite clumsy. We can abbreviate the first definitions by postulating “if we don’t know better, we assume that T’s Magnitude type is T itself.”

template <typename **T**>

struct Magnitude

{

using type= **T**;

};

This is true for all intrinsic types and we handle them all correctly with one definition. A slight disadvantage of this definition is that it incorrectly applies to all types without a specialized trait. Classes for which we know that the definition above is not correct are all instantiations of the class template complex. So we define specializations like this:

template <>

struct Magnitude<**std::complex<double>** >

{

using type= **double**;

};

Instead of defining them individually for complex<float>, complex<double>, and so on, we can use partial specialization for all complex types:

template <typename T>

struct Magnitude<**std::complex<T>** >

{

using type= **T**;

};

Now that the type traits are defined, we can use them in our function:

template <typename T>

T inline min_magnitude(const **T**& x, const **T**& y)

{

using std::abs;

**typename Magnitude<T>::type** ax= abs(x), ay= abs(y);

return ax < ay ? x : y;

}

We can also extend this definition to (mathematical) vectors and matrices, to determine, for instance, the **return** type of a norm. The specialization reads:

template <typename T>

struct Magnitude<**vector<T>** >

{

using type= **T**; // not really perfect

};

However, when the value type of the vector is complex its norm will not be complex. Thus, we do not need the value type itself but its respective Magnitude:

template <typename T>

struct Magnitude<**vector<T>** >

{

using type= **typename Magnitude<T>::type**;

};

Implementing type traits requires some perceivable programming effort, but it pays off by enabling much more powerful programming afterward.

**5.2.2 Conditional Exception Handling**

c++11/vector_noexcept.cpp

In *Section 1.6.2.4*, we introduced the qualifier noexcept which indicates that a function is not allowed to throw an exception (i.e., exception-handling code is not generated and eventual exceptions would kill the program or lead to undefined behavior). For function templates, it may depend on the type argument whether it is exception-free.

For example, a clone function does not throw an exception when the argument type has an exception-free copy constructor. The standard library provides a type trait for this property:

std::is_nothrow_copy_constructible

This allows us to express the exception freedom of our clone function:

#include <type_traits>

template <typename T>

inline T clone(const T& x)

noexcept(**std::is_nothrow_copy_constructible<T>::value**)

{ return T{x}; }

You might feel that this implementation is somewhat disproportionate: the function head is many times the size of the body. Honestly, we share this feeling and think that such verbose declarations are only necessary for heavily used functions under the highest coding standards.

Another use case for conditional noexcept is the generic addition of two vectors that will be exception-free when the bracket operator of the vector class does not throw:

template <typename T>

class my_vector

{

const T& operator[](int i) const **noexcept**;

};

template <typename Vector>

inline Vector operator+(const Vector& x, const Vector& y)

**noexcept**(**noexcept(x[0])**)

{ ... }

The double noexcept certainly needs some familiarization. Here the outer one is a conditional declaration and the inner one the corresponding condition on an expression. This condition holds when a noexcept declaration is found for the respective expression—here for the bracket operator of the type of x. For instance, if we added two vectors of type my_vector, the addition would be declared noexcept.

**5.2.3 A const-Clean View Example**

c++11/trans_const.cpp

In this section, we will use type traits to solve technical problems of *Views*. These are small objects that provide a different perspective on another object. A prototypical use case is the transposition of a matrix. One way to provide a transposed matrix is of course to create a new matrix object with corresponding interchanged values. This is a quite expensive operation: it requires memory allocation and deallocation and copying all data of the matrix with switched values. A view is more efficient as we will see.

**5.2.3.1 Writing a Simple View Class**

In contrast to constructing an object with new data, a view only refers to an existing object and adapts its interface. This works very well for matrix transposition: we only have to switch the roles of rows and columns in the interface:

**Listing 5–1: Simple view implementation**

template <typename Matrix>

class transposed_view

{

public:

using value_type= typename Matrix::value_type;

using size_type= typename Matrix::size_type;

explicit transposed_view(Matrix& A) : **ref**(A) {}

value_type& operator()(size_type r, size_type c)

{ return **ref**(c, r); }

const value_type& operator()(size_type r, size_type c) const

{ return **ref**(c, r); }

private:

Matrix& **ref**;

};

Here we assume that the Matrix class supplies an operator() accepting two arguments for the row and column index and returns a reference to the corresponding entry *a** _{ij}*. We further suppose that type traits are defined for value_type and size_type. This is all we need to know about the referred matrix in this mini-example (ideally we would specify a concept for a simple matrix). Real template libraries like MTL4 of course provide larger interfaces. However, this small example is good enough to demonstrate the usage of meta-programming in certain views.

An object of class transposed_view can be treated like a regular matrix; e.g., it can be passed to all function templates expecting a matrix. The transposition is achieved on the fly by calling the referred object’s operator() with switched indices. For every matrix object, we can define a transposed view that behaves like a matrix:

mtl::dense2D<float> A= {{2, 3, 4},

{5, 6, 7},

{8, 9, 10}};**transposed_view<mtl::dense2D<float> > At(A)**;

When we access At(i, j), we will get A(j, i). We also define a non-const access so that we can even change entries:

**At(2, 0)**= 4.5;

This operation sets A(0, 2) to 4.5.

The definition of a transposed view object does not lead to particularly concise programs. For convenience, we add a function that returns the transposed view:

template <typename Matrix>

inline **transposed_view<Matrix> trans**(Matrix& A)

{

return transposed_view<Matrix>(A);

}

Now we can use the trans elegantly in our scientific software, for instance, in a matrix-vector product:

v= trans(A) * q;

In this case, a temporary view is created and used in the product. Since most compilers will inline the view’s operator(), computations with trans(A) will be as fast as with A.

**5.2.3.2 Dealing with const-ness**

So far, our view works nicely. Problems arise when we build the transposed view of a constant matrix:

**const** mtl::dense2D<float> B(A);

We still can create the transposed view of B but we cannot access its elements:

cout "trans(B)(2, 0) = " trans(B)(2, 0) '\n'; // Error

The compiler will tell us that it cannot initialize a float& from a const float. When we look at the location of the error, we realize that this happens in the non-constant overload of the operator. This raises the question of why the constant overload is not used which returns a constant reference and fits our needs perfectly.

First of all, we should check whether the ref member is really constant. We never used the const declarator in the class definition or the function trans. Help is provided from the *Run-Time Type Identification (RTTI)*. We add the header <typeinfo> and print the type information:

#include <typeinfo>

...

cout "trans(A) is " typeid(tst::trans(A)).name() '\n';

cout "trans(B) is = " typeid(tst::trans(B)).name() '\n';

This will produce the following output with g++:

typeid of trans(A) = N3tst15transposed_viewIN3mtl6matrix7dense2DIfNS2_10

parametersINS1_3tag9row_majorENS1_5index7c_indexENS1_9non_fixed10

dimensionsELb0EEEEEEE

typeid of trans(B) = N3tst15transposed_viewIKN3mtl6matrix7dense2DIfNS2_10

parametersINS1_3tag9row_majorENS1_5index7c_indexENS1_9non_fixed10

dimensionsELb0EEEEEEE

The output is not particularly readable. The types that are printed by RTTI are name-mangled. When we look very carefully, we can see the extra K in the second line that tells us that the view is instantiated with a constant matrix type. Nonetheless, we recommend you not waste your time with mangled names. An easy (and portable!) trick to achieve readable type names is provoking an error message like this:

int ta= trans(A);

int tb= trans(B);

A better way is using a *Name Demangler*. For instance, the GNU compiler comes with a tool called c++filt. By default it only demangles function names and we have to add the flag -t as in the following command pipe: const_view_test|c++filt -t. Then we will see:

typeid of trans(A) = transposed_view<mtl::matrix::dense2D<float,

mtl::matrix::parameters<mtl::tag::row_major, mtl::index::c_index,

mtl::non_fixed::dimensions, false, unsigned long> > >

typeid of trans(B) = transposed_view<mtl::matrix::dense2D<float,

mtl::matrix::parameters<mtl::tag::row_major, mtl::index::c_index,

mtl::non_fixed::dimensions, false, unsigned long> > const>

Now we can clearly see that trans(B) returns a transposed_view with template parameter const dense2D<...> (not dense2D<...>). Accordingly, the member ref has the type const dense 2D<...>&.

When we go one step back it now makes sense. We passed an object of type const dense2D<...> to the function trans which takes a template argument of type parameter Matrix&. Thus, Matrix is substituted by const dense2D<...> and the **return** type is accordinglytransposed_view <const dense2D<...> >. After this short excursion into type introspection, we are certain that the member ref is a constant reference. The following happens:

• When we call trans(B), the function’s template parameter is instantiated with const dense2D<float>.

• Thus, the **return** type is transposed_view<const dense2D<float> >.

• The constructor parameter has type const dense2D<float>&.

• Likewise, the member ref is a const dense2D<float>&.

There remains the question why the non-const version of the operator is called despite our referring to a constant matrix. The answer is that the constancy of ref does not matter for the choice but whether or not the view object itself is constant. To ascertain that the view is also constant, we could write:

const transposed_view<const mtl::dense2D<float> > Bt(B);

cout "Bt(2, 0) = " Bt(2, 0) '\n';

This works but is pretty clumsy. A brutal possibility to get the view compiled for constant matrices is to cast away the constancy. The undesired result would be that mutable views on constant matrices enable the modification of the allegedly constant matrix. This violates our principles so heavily that we do not even show what the code would look like.

**Rule**

Consider casting away const only as the very last resort.

In the following, we will empower you with very strong methodologies for handling constancy correctly. Every const_cast is an indicator of a severe design error. As Herb Sutter and Andrei Alexandrescu phrased it, “If you go const you never go back.” The only situation where we need const_cast is when we deal with const-incorrect third-party software, i.e., where read-only arguments are passed as mutable pointers or references. That is not our fault and we have no choice. Unfortunately, there are still many packages around that are entirely ignorant of theconst qualifier. Some of them are too large for a quick rewrite. The best we can do about it is add an appropriate API on top of it and avoid working with the original API. This saves us from spoiling our applications with const_casts and restricts the unspeakable const_cast to the interface. A good example of such a layer is *Boost::Bindings* [*30*] that provides a const-correct high-quality interface to BLAS, LAPACK, and other libraries with similarly old-fashioned interfaces (to phrase it diplomatically). Conversely, as long as we only use our own functions and classes we are able to avoid every const_cast with more or less extra work.

To handle constant matrices correctly, we could implement a second view class especially for constant matrices and overload the trans function to return this view for constant arguments:

template <typename Matrix>

class **const_transposed_view**

{

public:

using value_type= typename Matrix::value_type;

using size_type= typename Matrix::size_type;

explicit const_transposed_view(const Matrix& A) : ref(A) {}

const value_type& operator()(size_type r, size_type c) const

{ return ref(c, r); }

private:

const Matrix& ref;

};

template <typename Matrix>

inline **const_transposed_view<Matrix> trans**(const Matrix& A)

{

return **const_transposed_view<Matrix**>(A);

}

With this extra class, we solved our problem. But we added a fair amount of code for it. And what is even worse than the code length is the redundancy: our new class const_transposed_view is almost identical to transposed_view except for not containing a non-const operator(). Let’s look for a more productive and less redundant solution. To this end, we introduce in the following two new meta-functions.

**5.2.3.3 Check for Constancy**

Our problem with the view in *Listing 5–1* is that it cannot handle constant types as template arguments correctly in all methods. To modify the behavior for constant arguments, we first need to find out whether an argument is constant. The meta-function is_const that provides this information is very simple to implement by partial template specialization:

template <typename **T**>

struct is_const

{

static const bool value= **false**;

};

template <typename T>

struct is_const<**const T**>

{

static const bool value= **true**;

};

Constant types match both definitions, but the second one is more specific and therefore picked by the compiler. Non-constant types match only the first one. Note that we only look at the outermost type: the constancy of template parameters is not considered. For instance, view<const matrix> is not regarded as constant because the view itself is not const.

**5.2.3.4 Compile-Time Branching**

The other tool we need for our view is a type selection depending on a logical condition. This technology was introduced by Krzysztof Czarnecki and Ulrich W. Eisenecker [*8*]. The *Compile-Time If* is named conditional in the standard library. It can be realized by a rather simple implementation:

**Listing 5–2:** **conditional** **a.k.a. compile-time** **if**

template <bool **Condition**, typename ThenType, typename ElseType>

struct conditional

{

using type= **ThenType**;

};

template <typename ThenType, typename ElseType>

struct conditional<**false**, ThenType, ElseType>

{

using type= **ElseType**;

};

When this template is instantiated with a logical expression and two types, only the primary template (on top) matches when the first argument evaluates to true and ThenType is used in the type definition. If the first argument evaluates to false, then the specialization (below) is more specific so that ElseType is used. Like many ingenious inventions it is very simple once it is found. This meta-function is part of C++11 in header <type_traits>.^{3}

*3*. With C++03, you can use boost::mpl::if_c from Boost’s Meta-Programming Library (MPL). If you program with both the standard type traits and Boost, pay attention to the sometimes deviating naming conventions.

This meta-function allows us to define funny things like using double for temporaries when our maximal iteration number is larger than 100, otherwise float:

using **tmp_type**=

typename conditional<(max_iter > 100), **double**, **float**>::type;

cout "typeid = " typeid(tmp_type).name() '\n';

Needless to say, max_iter must be known at compile time. Admittedly, the example does not look extremely useful and the meta-if is not so important in small isolated code snippets. In contrast, for the development of large generic software packages, it becomes extremely important.

Please note that the comparison is surrounded by parentheses; otherwise the greater-than symbol > would be interpreted as the end of the template arguments. Likewise, expressions containing right shift must be surrounded by parentheses in C++11 or higher for the same reason.

To relieve us from typing typename and ::type when referring to the resulting type, C++14 introduces a template alias:

template <bool b, class T, class F>

using conditional_t= **typename** conditional<b, T, F>**::type**;

If the standard library of your compiler does not provide this alias, you can quickly add it—maybe with a different name or in another namespace to avoid name conflicts later on.

**5.2.3.5 The Final View**

Now we have all we need to revise the view from *Listing 5–1*. The problem was that we returned an entry of a constant matrix as a mutable reference. To avoid this, we can try to make the mutable access operator disappear in the view when the referred matrix is constant. This is possible but more complicated. We will come back to this approach in *Section 5.2.6*.

An easier solution is to keep both the mutable and the constant access operator but choose the **return** type of the former depending on the type of the template argument:

**Listing 5–3:** **const-safe view implementation**

1 template <typename Matrix>

2 class transposed_view

3 {

4 public:

5 using value_type= Collection<Matrix>::value_type;

6 using size_type= Collection<Matrix>::size_type;

7 private:

8 using **vref_type**= conditional_t<is_const<Matrix>::**type**,

9 **const value_type&**,

10 **value_type&**>;

11 public:

12 transposed_view(Matrix& A) : ref(A) {}

13

14 **vref_type** operator()(size_type r, size_type c)

15 { return ref(c, r); }

16

17 const value_type& operator()(size_type r, size_type c) const

18 { return ref(c, r); }

19

20 private:

21 Matrix& ref;

22 };

This implementation differentiates the return type of mutable views between constant and mutable referred matrices. This establishes the desired behavior as the following case study will show.

When the referred matrix is mutable, the **return** type of operator() depends on the constancy of the view object:

• If the view object is mutable, then operator() in line 14 returns a mutable reference (line 10); and

• If the view object is constant, then operator() in line 17 returns a constant reference.

This is the same behavior as in *Listing 5–1*.

If the matrix reference is constant, then a constant reference is always returned:

• If the view object is mutable, then operator() in line 14 returns a constant reference (line 9); and

• If the view object is constant, then operator() in line 17 returns a constant reference.

Altogether, we implemented a view class that supplies write access only when the view object and the referred matrix are both mutable.

**5.2.4 Standard Type Traits**

**Note:** The type traits in the standard originate from the Boost library collection. If you are moving from the Boost type traits to those of the standard, be aware that some names are different. There are also slight differences in the *Section 4.3*.

**5.2.5 Domain-Specific Type Properties**

Now that we have seen how the type traits in the standard library are realized, we can use this knowledge for ourselves and realize domain-specific type properties. Not surprisingly, we pick linear algebra as the domain and implement the property is_matrix. To be on the safe side, a type is only considered as a matrix when we explicitly declare it as such. Thus, types are by default declare as not being a matrix:

template <typename **T**>

struct is_matrix

{

static const bool value= **false**;

};

The standard library supplies a false_type containing just this static constant.* ^{4}* We can save ourselves some typing by deriving from this meta-function and inheriting (See

*Chapter 6*) value from false_type:

*4*. In C++03 you can resort to boost::mpl::false_.

template <typename T>

struct is_matrix

**: std::false_type**

{};

Now we specialize the meta-predicate for all known matrix classes:

template <typename Value, typename Para>

struct is_matrix<**mtl::dense2D<Value, Para>** >

: **std::true_type**

{};

// more matrix classes ...

The predicate can also depend on the template arguments. For instance, we might have a transposed_view class applicable to matrices and vectors (certainly tricky to implement but this is a different question). Certainly, a transposed vector is not a matrix whereas a transposed matrix is:

template <typename Matrix>

struct is_matrix<**transposed_view<Matrix>** >

: **is_matrix<Matrix>**

{};

// more views ...

More likely, we will realize separate views for transposing matrices and vectors, e.g.:

template <typename Matrix>

struct is_matrix<**matrix::transposed_view<Matrix>** >

: **std::true_type**

{};

To ascertain that our view is used correctly, we verify with static_assert that the template argument is a (known) matrix:

template <typename Matrix>

class transposed_view

{

static_assert(**is_matrix<Matrix>::value**,

"Argument of this view must be a matrix!");

// ...

};

If the view is instantiated with a type that is not a matrix (or not declared as such), the compilation aborts and the user-defined error message is printed. Since static_assert does not create run-time overhead, it should be used wherever errors can be detected at the type level or regarding compile-time constants. Up to C++14, the error message must be a literal. Later C++ versions will probably allow us to compose messages with type information.

When we try to compile our test with the static assertion, we will see that trans(A) compiles but not trans(B). The reason is that const dense2D<> is considered different from dense2D<> in template specialization so it is still not considered as a matrix. The good news is that we do not need to double our specializations for mutable and constant types, but we can write a partial specialization for all constant arguments:

template <typename T>

struct is_matrix<**const T**>

: is_matrix<T> {};

Thus, whenever a type T is a matrix, then const T is as well.

**5.2.6 enable_if**

A very powerful mechanism for meta-programming is enable_if, discovered by Jaakko Järvi and Jeremiah Wilcock. It is based on a convention for compiling function templates called *SFINAE: Substitution Failure Is Not An Error*. It means that function templates whose header cannot be substituted with the argument types are just ignored and do not cause an error.

Such substitution errors can happen when the **return** type of a function is a meta-function on a template argument, e.g.:

template <typename T>**typename Magnitude<T>::type**

inline min_abs(const T& x, const T& y)

{

using std::abs;

auto ax= abs(x), ay= abs(y);

return ax < ay ? ax : ay;

}

Here our **return** type is the Magnitude of T. When Magnitude<T> contains no member type for the type of x and y, the substitution fails and the function template is ignored. The benefit of this approach is that a function call can be compiled when for multiple overloads exactly one of them can be successfully substituted. Or when multiple function templates can be substituted and one of them is more specific than all others. This mechanism is exploited in enable_if.

Here we will use enable_if to select function templates based on domain-specific properties. As a showcase, we realize the *L*_{1} norm. It is defined for vector spaces and linear operators (matrices). Although these definitions are related, the practical real-world implementation for finite-dimensional vectors and matrices is different enough to justify multiple implementations. Of course, we could implement *L*_{1} norm for every matrix and vector type so that the call one_norm(x) would select the appropriate implementation for this type.

c++03/enable_if_test.cpp

To be more productive, we like to have one single implementation for all matrix types (including views) and one single implementation for all vector types. We use the meta-function is_matrix and implement is_vector accordingly. Further, we need the meta-function Magnitude to handle the magnitude of complex matrices and vectors. For convenience, we further provide a template alias Magnitude_t to access contained type information.

Next, we implement the meta-function enable_if that allows us to define function overloads that are only viable when a given condition holds:

template <bool Cond, typename T= void>

struct enable_if {

typedef T type;

};

template <typename T>

struct enable_if<false, T> {};

It defines a type only when the condition holds. Our realization is compatible with <type_traits> from C++11. The program snippet here serves as illustration while we use the standard meta-function in production software.

As in C++14, we want to add a template alias for making the notation more concise:

template <bool Cond, typename T= void>

using enable_if_t= typename enable_if<Cond, T>::type;

As before, it saves us from writing the typename-::type pair. Now we have all we need to implement the *L*_{1} norm in the generic fashion we aimed for:

1 template <typename T>

2 enable_if_t<**is_matrix<T>::value**, Magnitude_t<T> >

3 inline one_norm(const **T**& A)

4 {

5 using std::abs;

6 Magnitude_t<T> max{0};

7 for (unsigned c= 0; c < num_cols(A); c++) {

8 Magnitude_t<T> sum{0};

9 for (unsigned r= 0; r < num_cols(A); r++)

10 sum+= abs(A[r][c]);

11 max= max < sum ? sum : max;

12 }

13 return max ;

14 }

15

16 template <typename T>

17 enable_if_t<**is_vector<T>::value**, Magnitude_t<T> >

18 inline one_norm(const **T**& v)

19 {

20 using std::abs;

21 Magnitude_t<T> sum {0};

22 for (unsigned r= 0; r < size(v); r++)

23 sum+= abs(v[r]);

24 return sum;

25 }

The selection is now driven by enable_if in lines 2 and 17. Let us look at line 2 in detail for a matrix argument:

**1.** is_matrix<T> is evaluated to (i.e., inherited from) true_type.

**2.** enable_if_t< > becomes Magnitude_t<T>.

**3.** This is the **return** type of the function overload.

Here is what happens in this line when the argument is not a matrix type.

**1.** is_matrix<T> is evaluated to false_type.

**2.** enable_if_t< > cannot be substituted because enable_if< >::type does not exist in this case.

**3.** The function overload has no **return** type and is erroneous.

**4.** It is therefore ignored.

In short, the overload is only enabled if the argument is a matrix—as the names of the meta-functions say. Likewise the second overload is only available for vectors. A short test demonstrates this:

matrix A= {{2, 3, 4},

{5, 6, 7},

{8, 9, 10}};

dense_vector<float> v= {3, 4, 5}; // from MTL4

cout "one_norm(A) is " one_norm(A) "\n";

cout "one_norm(v) is " one_norm(v) "\n";

For types that are neither matrix nor vector, there will be no overload of one_norm available, and there should not be. Types that are considered both matrix and vector would cause an ambiguity and indicate a design flaw.

**Limitations:** The mechanism of enable_if is quite powerful but can complicate debugging. Especially with old compilers, error messages caused by enable_if are usually rather verbose and at the same time not very meaningful. When a function match is missing for a given argument type, it is hard to determine the reason because no helpful information is provided to the programmer; he/she is only told that no match is found, period. Newer compilers (clang++ ≥ 3.3 or gcc ≥ 4.9) inform the programmer that an appropriate overload was found which is disabled byenable_if.

Furthermore, the enabling mechanism cannot select the most specific condition. For instance, we cannot specialize implementation for, say, is_sparse_matrix. This can be achieved by avoiding ambiguities in the conditions:

template <typename T>

enable_if<is_matrix<T>::value && **!is_sparse_matrix<T>::value**,

Magnitude_t<T> >

inline one_norm(const T& A);

template <typename T>

enable_if<**is_sparse_matrix<T>::value**, Magnitude_t<T> >

inline one_norm(const T& A);

Evidently, this becomes quite error-prone when too many hierarchical conditions are considered.

The SFINAE paradigm only applies to template arguments of the function itself. Member functions cannot apply enable_if on the class’s template argument. For instance, the mutable access operator in line 9 of *Listing 5–1* cannot be hidden with enable_if for views on constant matrices because the operator itself is not a template function.

In the previous examples, we used SFINAE to invalidate the **return** type. This does not work for functions without a **return** type like constructors. In this case, we can introduce a dummy argument with a default value that can be invalidated as well when our condition does not hold. Problematic are functions without optional arguments and customizable **return** types like conversion operators.

Several of these issues can be addressed with *Anonymous Type Parameters*. Since this feature is not frequently used in daily programming, we put it into *Section A.9.4*.

**5.2.7 Variadic Templates Revised**

In *Section 3.10*, we implemented a variadic sum that accepted an arbitrary number of arguments of mixed types. The problem in this implementation was that we did not know an appropriate return type and used that of the first argument. And failed miserably.

In the meantime, we have seen more features and want to re-attack the problem. Our first approach is to use decltype to determine the result type:

template <typename T>

inline T sum(T t) { return t; }

template <typename T, typename ...P>

auto sum(T t, P ...p) -> **decltype( t + sum(p...) )** // Error

{

return t + sum(p...);

}

Unfortunately, this implementation does not compile for more than two arguments. To determine the **return** type for *n* arguments, the **return** type for the last *n –* 1 arguments is needed which is only available after the function is completely defined but not yet for the trailing **return** type. The function above is recursively called but not instantiated recursively.

**5.2.7.1 Variadic Class Template**

Thus, we have to determine the result type first. This can be done recursively by a variadic type trait:

// Forward declaration

template <typename ...P> struct sum_type;

template <typename T>

struct sum_type<**T**>

{

using type= **T**;

};

template <typename T, typename ...P>

struct sum_type<**T, P ...**>

{

using type= **decltype(T() + typename sum_type<P...>::type())**;

};

template <typename ...P>

using sum_type_t= typename sum_type<P...>::type;

Variadic class templates are also declared recursively. For the sake of visibility, we first need the general form as a declaration before we can write the definition. The definition always consists of two parts:

• The composite part—how we define the class with *n* parameters in terms of *n –* 1 parameters; and

• The base case for usually zero or one argument.

The example above uses an expression that did not appear in the variadic function template before: P... unpacks the type pack.

Please note the different compilation behaviors of recursive functions and classes: the latter are instantiated recursively and the former are not. That is the reason why we can use decltype recursively in the variadic class but not in the variadic function.

**5.2.7.2 Decoupling Return Type Deduction and Variadic Computation**

Using the previous type trait, we can realize the variadic function with a reasonable **return** type:

template <typename T>

inline T sum(T t) { return t; }

template <typename T, typename ...P>

inline **sum_type_t<T, P...>** sum(T t, P ...p)

{

return t + sum(p...);

}

This function yields the correct results for the previous examples:

auto s= **sum(-7, 3.7f, 9u, -2.6)**;

cout "s is " s " and its type is "

typeid(s).name() '\n';

auto s2= **sum(-7, 3.7f, 9u, -42.6)**;

cout "s2 is " s2 " and its type is "

typeid(s2).name() '\n';

which are:

s is 3.1 and its type is d

s2 is -36.9 and its type is d

**5.2.7.3 Common Type**

The standard library provides a type trait similar to sum_type named std::common_type in <type_traits> (plus the alias common_type_t in C++14). The motivation of this type trait is that intrinsic C++ types have implicit conversion rules so that the result type of an expression is independent of the operation; it only depends on the argument types. Thus, x + y + z, x - y - z, x * y * z, and x * y + z all have the same type when the variables are intrinsics. For intrinsic types, the following meta-predicate always evaluates to thetrue_type:

is_same<decltype(x + y + z),

common_type<decltype(x), decltype(y), decltype(z)>

Likewise for other expressions.

User-defined types are not guaranteed to return the same types in all operations. Therefore, it can make sense to provide operation-dependent type traits.

The standard library contains a minimum function. This function is limited to two arguments of the same type. Using common_type and variadic templates, we can easily write a generalization:

template <typename T>

inline **T** minimum(const T& t) { return **t**; }

template <typename T, typename ...P>**typename std::common_type<T, P...>::type**

minimum(const T& t, const P& ... p)

{

typedef typename std::common_type<T, P... >::type res_type;

return **std::min(res_type(t), res_type(minimum(p...)))**;

}

To avoid confusion, we called the function minimum. It takes an arbitrary number of arguments with arbitrary types as long as std::common_type and comparison are defined for it. For instance, the expression

minimum(-7, 3.7f, 9u, -2.6)

returns a double with the value -7. In C++14, the variadic overload of minimum simplifies to

template <typename T, typename ...P>

inline **auto** minimum(const T& t, const P& ... p)

{

using res_type= std::common_type_t<T, P...>;

return std::min(res_type(t), res_type(minimum(p...)));

}

by using the template alias and return type deduction.

**5.2.7.4 Associativity of Variadic Functions**

Our variadic implementations of sum added the first argument to the sum of the remaining. That is, the right-most + is computed first. On the other hand, the + operator in C++ is defined left-associative; the left-most + is first calculated in summations. Unfortunately, the corresponding left-associative implementation:

template <typename T>

inline T sum(T t) { return t; }

template <typename **...P**, typename T>

typename std::common_type<**P...**, T>::type

sum(P **...p**, T t)

{

return sum(p...) + t;

}

does not compile. The language does not support splitting off the last argument.

Integer numbers are associative (i.e., the order of calculation does not matter). Floating-point numbers are not due to rounding errors. Thus, we have to pay attention that changes in evaluation order due to the use of variadic templates do not cause numeric instabilities.

**5.3 Expression Templates**

Scientific software usually has strong performance requirements—especially when C++ is involved. Many large-scale simulations of physical, chemical, or biological processes run for weeks or months and everybody is glad when at least a part of this very long execution time can be saved. The same can be said about engineering, e.g., for static and dynamic analysis of large constructions. Saving execution time often comes at the price of program sources’ readability and maintainability. In *Section 5.3.1*, we will show a simple implementation of an operator and discuss why this is not efficient, and in the remainder of *Section 5.3* we will demonstrate how to improve performance without sacrificing the natural notation.

**5.3.1 Simple Operator Implementation**

Assume we have an application with vector addition. We want, for instance, to write the following vector expression:

w = x + y + z;

Say we have a vector class like in *Section 3.3*:

template <typename T>

class vector

{

public:

explicit vector(int size) : my_size(size), data(new T[my_size]) {}

const T& operator[](int i) const { check_index(i); return data[i]; }

T& operator[](int i) { check_index(i); return data[i]; }

// ...

};

We can of course provide an operator for adding such vectors:

**Listing 5–4: Naïve addition operator**

1 template <typename T>

2 inline vector<T> **operator+**(const vector<T>& x, const vector<T>& y)

3 {

4 x.check_size(size(y));

5 vector<T> sum(size(x));

6 for (int i= 0; i < size(x); ++i)

7 sum[i] = x[i] + y[i];

8 return sum;

9 }

A short test program checks that everything works properly:

vector<float> x= {1.0, 1.0, 2.0, -3.0},

y= {1.7, 1.7, 4.0, -6.0},

z= {4.1, 4.1, 2.6, 11.0},

w(4);

cout "x = " x std::endl;

cout "y = " y std::endl;

cout "z = " z std::endl;

w= x + y + z;

cout "w= x + y + z = " w endl;

If this works as expected, what is wrong with it? From the software engineering perspective: nothing. From the performance perspective: a lot.

The following list shows which operation is performed in which line of operator+ when the statement is executed:

**1.** Create a temporary variable sum for the addition of x and y (line 5).

**2.** Perform a loop reading x and y, adding it element-wise, and writing the result to sum (lines 6+7).

**3.** Copy sum to a temporary variable, say, t_xy, in the return statement (line 8).

**4.** Delete sum with the destructor when it goes out of scope (line 9).

**5.** Create a temporary variable sum for the addition of t_xy and z (line 5).

**6.** Perform a loop reading t_xy and z, adding it element-wise, and writing the result to sum (lines 6+7).

**7.** Copy sum to a temporary variable, say, t_xyz, in the return statement (line 8).

**8.** Delete sum (line 9).

**9.** Delete t_xy (after second addition).

**10.** Perform a loop reading t_xyz and writing to w (in assignment).

**11.** Delete t_xyz (after assignment).

This is admittedly the worst-case scenario; however, it really happened with old compilers. Modern compilers perform more optimizations by static code analysis and optimize return values (§*2.3.5.3*), thus avoiding the copies into temporaries t_xy and t_xyz.

The optimized version performs still:

**1.** Create a temporary variable sum for the addition of x and y (for distinction we call it sum_xy) (line 5).

**2.** Perform a loop reading x and y, adding it element-wise, and writing the result to sum (lines 6+7).

**3.** Create a temporary variable sum (for distinction, sum_xyz) for the addition of sum_xy and z (line 5).

**4.** Perform a loop reading sum_xy and z, adding it, and writing the result to sum_xyz (lines 6+7).

**5.** Delete sum_xy (after second addition).

**6.** Perform a loop reading sum_xyz and writing it element-wise to w (in assignment).

**7.** Delete sum_xyz (after assignment).

How many operations did we perform? Say our vectors have dimension *n*, then we have in total:

• 2*n* additions

• 3*n* assignments

• 5*n* reads

• 3*n* writes

• 2 memory allocations

• 2 memory deallocations

By comparison, if we could write a single loop or an inline function:

template <typename T>

void inline **add3**(const vector<T>& x, const vector<T>& y,

const vector<T>& z, vector<T>& sum)

{

x.check_size(size(y));

x.check_size(size(z));

x.check_size(size(sum));

**for** (int i= 0; i < size(x); ++i)

**sum[i] = x[i] + y[i] + z[i]**;

}

This function performs

• 2*n* additions

• *n* assignments

• 3*n* reads

• *n* writes

The call of this function:

add3(x, y, z, w);

is of course less elegant than the operator notation. It is also a little bit more prone to errors: we need to look at the documentation to see whether the first or the last argument contains the result. With operators, the semantics are evident.

In high-performance software, programmers tend to implement a hard-coded version of every important operation instead of freely composing them from smaller expressions. The reason is obvious; our operator implementation performed additionally

• 2*n* assignments

• 2*n* reads

• 2*n* writes

• 2 memory allocations

• 2 memory deallocations

The good news is we have not performed additional arithmetic. The bad news is that the operations above are more expensive. On modern computers, it takes much more time to read large amounts of data from or to write to the memory than executing fixed or floating-point operations.

Unfortunately, vectors in scientific applications tend to be rather long, often larger than the caches of the platform, and the vectors must really be transferred to and from main memory. In *Figure 5–1*, we depict the memory hierarchy symbolically. The chip on top represents the processor, the blue chips beneath it the L1 cache, the disks L2, the floppies the main memory, and the cassettes the virtual memory. The hierarchy consists of small fast memory close to the processor and large slow memory. When a data item from slow memory is read (marked in the second (blue) cassette), a copy is held in every faster memory (second floppy, first disk, first blue L1 cache chip).

**Figure 5–1: Memory hierarchy**

In case of shorter vectors, the data might reside in L1 or L2 cache and the data transfer is less critical. But in this case, the allocation and deallocation become serious slow-down factors.

**5.3.2 An Expression Template Class**

The purpose of *Expression Templates* (ET) is to keep the original operator notation without introducing the overhead induced by temporaries. The technique was independently discovered by Todd Veldhuizen and Daveed Vandevoorde.

c++11/expression_template_example.cpp

The solution to the dilemma of elegance versus performance is the introduction of a class for intermediate objects keeping references to the vectors and allowing us to perform all computations later in one sweep. The addition does not return a vector any longer but an object referring to the arguments:

template <typename T>

class **vector_sum**

{

public:

vector_sum(const vector<T>& v1, const vector<T>& v2)

: v1(v1), v2(v2) {}

private:

const vector<T> &v1, &v2;

};

template <typename T>**vector_sum<T>** operator+(const vector<T>& x, const vector<T>& y)

{

return {x, y};

}

Now we can write x + y but not w= x + y yet. It is not only that the assignment is not defined; we have not yet provided vector_sum with enough functionality to perform something useful in the assignment. Thus, we first extend vector_sum so that it looks like a vector itself:

template <typename T>

class vector_sum

{

public:

// ...

friend int size(const vector_sum& x) { return size(x.v1); }

T operator[](int i) const { return v1[i] + v2[i]; }

private:

const vector<T> &v1, & v2;

};

The most interesting function in this class is the bracket operator: when the *i*-th entry is accessed, we compute the sum of the operands’ *i*-th entries on the fly.

The drawback of computing element-wise sums in the bracket operator is the repeated calculation when the entries are accessed multiple times. This happens, for instance, in matrix vector multiplication when we compute A * (x+y). Thus, it can be beneficial for some operations to evaluate the vector sum upfront instead of doing so element-wise in the access operator.

To evaluate w= x + y, we also need an assignment operator for vector_sum:

template <typename T> class vector_sum; // forward declaration

template <typename T>

class vector

{ // ...

vector& operator=(**const vector_sum<T>**& that)

{

check_size(size(that));

for (int i= 0; i < my_size; ++i)

data[i]= that[i];

return *this;

}

};

The assignment iterates over the data from the current object and over that of the parameter that. Since the latter is a vector_sum, the expression that[i] computes an element-wise sum, here x[i] + y[i]. Thus, in contrast to what the naïve implementation from *Listing 5–4*would perform, the evaluation of w= x + y has now

• Only one loop;

• No temporary vector;

• No additional memory allocation and deallocation; and

• No additional data reads and writes.

In fact, the same operations are performed as in the loop:

for (int i= 0; i < size(w); ++i)

w[i] = x[i] + y[i];

The cost to create a vector_sum object is negligible. The object will be kept on the stack and does not require memory allocation. Even this little effort for creating the object is normally optimized away by most compilers with decent static code analysis.

What happens when we add three vectors? The naïve implementation from *Listing 5–4* returns a vector, and this vector can be added to another vector. Our approach returns a vector_sum, and we have no addition for vector_sum and vector. Thus, we would need another expression template class and a corresponding operation:

template <typename T>

class **vector_sum3**

{

public:

vector_sum3(const vector<T>& **v1**, const vector<T>& **v2**,

const vector<T>& **v3**)

: v1(v1), v2(v2), v3(v3)

{ ... }

T **operator[]**(int i) const { return v1[i] + v2[i] + v3[i]; }

private:

const vector<T> &v1, &v2, &v3;

};

template <typename T>**vector_sum3<T>** inline operator+(const vector_sum<T>& x,

const vector<T>& y)

{

return {x.v1, x.v2, y};

}

Furthermore, vector_sum must declare our new plus operator as **friend** to access its private members, and vector needs an assignment for vector_sum3. This becomes increasingly annoying. Also, what happens if we perform the second addition first: w= x + (y + z)? Then we need another plus operator. What if some of the vectors are multiplied with a scalar like w= x + dot(x, y) * y + 4.3 * z, and this scalar product is also implemented by an ET? Our implementation effort runs into combinatorial explosion and we need a more flexible solution that we introduce in the next section.

**5.3.3 Generic Expression Templates**

c++11/expression_template_example2.cpp

So far, we have started from a specific class (vector) and generalized the implementation gradually. Although this can help us understand the mechanism, we now go directly to the general version working for arbitrary vector types and views thereof:

template <typename V1, typename V2>

inline vector_sum<V1, V2> operator+(const V1& x, const V2& y)

{

return {x, y};

}

We now need an expression class with arbitrary arguments:

template <typename V1, typename V2>

class vector_sum

{

public:

vector_sum(const V1& v1, const V2& v2) : v1(v1), v2(v2) {}

???? operator[](int i) const { return v1[i] + v2[i]; }

private:

const V1& v1;

const V2& v2;

};

This is rather straightforward. The only issue is what type to return in operator[]. For this we must define value_type in each class (an external type trait would be more flexible but we want to keep it as simple as possible here). In vector_sum, we could take the value_type of the first argument, which can itself be taken from another class. This is an acceptable solution as long as the scalar types are identical in the entire application. However, we have seen in *Section 3.10* that we can get completely absurd results with mixed arguments when we do not pay attention to the result types. To be prepared for mixed arithmetic, we take the common_type_t of the arguments’ value types:

template <typename V1, typename V2>

class vector_sum

{

// ...

using value_type= **std::common_type_t**<typename V1::value_type,

typename V2::value_type>;

value_type operator[](int i) const { return v1[i] + v2[i]; }

};

If our class vector_sum does not need the explicit declaration of a value_type, we can use decltype(auto) in C++14 as a return type and leave the type deduction entirely to the compiler. In contrast, trailing return types do not work when the template is instantiated withvector_sum itself as this creates a dependency on itself.

To assign different kinds of expressions to a vector class, we should also generalize the assignment operator:

template <typename T>

class vector

{

public:

template <typename **Src**>

vector& operator=(const **Src**& **tha**t)

{

check_size(size(that));

for (int i= 0; i < my_size; ++i)

data[i]= **that[i]**;

return *this;

}

};

This assignment operator accepts every type except vector<T> for which we need a dedicated copy-assignment operator. To avoid code redundancy, we could implement a method that performs the actual copy and is called from both the general and the copy assignment.

**Advantages of expression templates:** Although the availability of operator overloading in C++ resulted in notationally nicer code, the scientific community refused to give up programming in Fortran or to implement the loops directly in C/C++. The reason was that the traditional operator implementations were too expensive. Due to the overhead of creating temporary variables and of copying vector and matrix objects, C++ could not compete with the performance of programs written in Fortran. This problem has been resolved with the introduction of generics and expression templates. Now we can write extremely efficient scientific programs in a notationally convenient manner.

**5.4 Meta-Tuning: Write Your Own Compiler Optimization**

Compiler technology is progressing, and more and more optimization techniques are provided. Ideally, we all write software in the easiest and most readable way, and the compiler generates the optimal executable out of it. We would only need a newer and better compiler for our programs to become faster and faster. Unfortunately, this only sometimes really works.

In addition to general-purpose optimizations like copy elision (§*2.3.5.3*), compilers provide numeric optimization techniques like *Loop Unrolling*: the loop is transformed such that multiple iterations are performed in one. This decreases the overhead of loop control and increases the potential for concurrent execution. Many compilers apply this technique only on the inner loop whereas unrolling multiple loops often enables even better performance. Several iterative calculations benefit from the introduction of additional temporaries which in turn can require semantic information that is not available for user types or operations.

Some compilers are particularly tuned for specific operations—especially those used in benchmarks and even more specifically for the LINPACK benchmark used to rate the 500 fastest computers in the world (*www.top500.org*). For instance, they can use pattern matching to recognize the typical three-nested loop in canonical dense-matrix multiplication and replace this code with a highly tuned assembler implementation that can be one or more orders of magnitude faster. These programs use seven or nine loops with platform-dependent block size to squeeze out the last bit of every cache level, transpose sub-matrices, run on multiple threads, perform a filigree register choreography, and so on.^{5}

*5*. One could sometimes get the impression that the High-Performance Computing (HPC) community believes that multiplying dense matrices at near-peak performance solves all performance issues of the world or at least demonstrates that everything can be computed at near-peak performance if only one tries hard enough. Fortunately, more and more people in the supercomputer centers realize that their machines are not only running dense-matrix operations and that real-world applications are in most cases limited by memory bandwidth and latency.

Having a simple loop implementation replaced with code running at almost peak performance is definitely a great achievement. Unfortunately, this makes many programmers believe that most calculations can be similarly accelerated. Often small changes suffice to fall out of the pattern and the performance is much less spectacular than expected. No matter how generally the patterns are defined, their applicability will always be limited. An algorithmic change (e.g., multiplying triangular instead of rectangular matrices) that does not impede the blocking and unrolling optimizations will probably fall out of the compiler’s special-case optimization.

To make a long story short: compilers can do a lot but not everything. Regardless of how many special cases are tuned by a compiler, there will always be a need for domain-specific optimizations. Alternatively to the techniques in compilers, there are tools like ROSE [*35*] that allow you to transform source code (including C++) with user-defined transformations on the abstract syntax tree (AST, the compiler-internal representation of programs).

A major roadblock for compiler optimization is that certain transformations require semantic knowledge. This is only available for types and operations known to the compiler implementers. The interested reader might also look at a deeper discussion of the topic in [*19*]. Research is going on to provide user-defined transformations with concept-based optimization [*47*]. Unfortunately, it will take time for this to become mainstream; even the new *Concepts Lite* extension planned for C++17 can only be a step toward optimization driven by user semantics since it will only deal with syntax at first.

In the remainder of this chapter, we will show user-defined code transformations with meta-programming in the domain of linear algebra. The goal is that the user writes operations as clearly as possible and the function and class templates reach for the maximal achievable performance. Given the Turing completeness of the template system, we can provide any desirable user interface while realizing an implementation underneath that behaves equivalently to the most efficient code, as we will demonstrate in this section. Writing well-tuned templates is a significant programming, testing, and benchmarking effort. For this to pay off, the templates should be contained in well-maintained libraries that are available to a broad user community (at least within the research team or company).

**5.4.1 Classical Fixed-Size Unrolling**

c++11/fsize_unroll_test.cpp

The easiest form of compile-time optimization can be realized for fixed-size data types, in particular for math vectors like those in *Section 3.7*. Similar to the default assignment, we can write a generic vector assignment:

template <typename T, int **Size**>

class fsize_vector

{

public:

const static int **my_size**= Size ;

template <typename Vector>

self& operator=(const self& that)

{

for (int i= 0; i < **my_size**; ++i)

**data[i]= that[i]**;

}

};

A state-of-the-art compiler will recognize that all iterations are independent from each other; e.g., data[2]= that[2]; is independent of data[1]= that[1];. The compiler will also determine the size of the loop during compilation. As a consequence, the generated binary for anfsize_vector of size 3 will be equivalent to

template <typename T, int **Size**>

class fsize_vector

{

template <typename Vector>

self& operator=(const self& that)

{

**data[0]= that[0]**;

**data[1]= that[1]**;

**data[2]= that[2]**;

}

};

The right-hand-side vector that might be an expression template (§*5.3*) for, say, alpha * x + y and its evaluation will be also inlined:

template <typename T, int **Size**>

class fsize_vector

{

template <typename Vector>

self& operator=(const self& that)

{

**data[0]= alpha * x[0] + y[0]**;

**data[1]= alpha * x[1] + y[1]**;

**data[2]= alpha * x[2] + y[2]**;

}

};

To make the unrolling more explicit and for the sake of introducing meta-tuning step by step, we develop a functor that performs the assignment:

template <typename Target, typename Source, int **N>****struct fsize_assign****{**

**void operator()(Target& tar, const Source& src)**

**{**

**fsize_assign<Target, Source, [*N-1**>()(tar, src);

std::cout "assign entry " N '\n';

**tar[N]= src[N]**;

}

};

template <typename Target, typename Source>

struct fsize_assign<Target, Source, **0**>

{

void operator()(Target& tar, const Source& src)

{

std::cout "assign entry " 0 '\n';

**tar[0]= src[0]**;

}

};

The printouts will show us the execution. To save ourselves from explicitly instantiating the argument types, we parameterize the operator() instead of the class:

template <int N>

struct fsize_assign

{

template <typename **Target**, typename **Source**>

void operator()(Target& tar, const Source& src)

{

fsize_assign<N-1>()(tar, src);

std::cout "assign entry " N '\n';

tar[N]= src[N];

}

};

template <>

struct fsize_assign<0>

{

template <typename **Target**, typename **Source**>

void operator()(Target& tar, const Source& src)

{

std::cout "assign entry " 0 '\n';

tar[0]= src[0];

}

};

Then the vector types can be deduced by the compiler when the operator is called. Instead of a loop implementation, we call the recursive assignment functor in the operator:

template <typename T, int Size>

class fsize_vector

{

static_assert(my_size > 0, "Vector must be larger than 0.");

self& operator=(const self& that)

{

**fsize_assign<my_size-1>**{}(*this, that);

return *this;

}

template <typename Vector>

self& operator=(const Vector& that)

{

**fsize_assign<my_size-1>**{}(*this, that);

return *this;

}

};

The execution of the following code fragment:

fsize_vector<float, 4> v, w;

v[0]= v[1]= 1.0; v[2]= 2.0; v[3]= -3.0;

w= v;

shows the expected 'margin-top:10.0pt;margin-right:0cm;margin-bottom: 10.0pt;margin-left:25.0pt;line-height:normal'>assign entry 0

assign entry 1

assign entry 2

assign entry 3

In this implementation, we replaced the loop with a recursion—counting on the compiler to inline the operations and the loop control. Otherwise the recursive function calls would be even slower than an ordinary loop.

This technique is only beneficial for small loops that run in L1 cache. Larger loops are dominated by loading the data from memory and the loop overhead is irrelevant. To the contrary, unrolling all operations with very large vectors usually decreases the performance because a lot of instructions need to be loaded, so the transfer of the data must wait. As mentioned before, compilers can unroll such operations by themselves and hopefully contain heuristics to decide when it is better not to. We have observed that the automatic unrolling of single loops is sometimes faster than explicit implementations like that above.

One would think that the implementation should be simpler with constexpr at least with the C++14 extensions. Unfortunately, it is not because we would mix compile-time arguments—the size—with run-time arguments—the vector references. Thus, the constexpr would degrade to an ordinary function.

**5.4.2 Nested Unrolling**

In our experience, most compilers unroll non-nested loops. Even good compilers that can handle certain nested loops will not be able to optimize every program kernel, in particular those with many template arguments that are instantiated with user-defined types. We will demonstrate here how to unroll nested loops at compile time with matrix vector multiplication as an example.

c++11/fsize_unroll_test.cpp

For this purpose, we introduce a simplistic fixed-size matrix type:

template <typename T, int **Rows**, int **Cols**>

class fsize_matrix

{

static_assert(Rows > 0, "Rows must be larger than 0.");

static_assert(Cols > 0, "Cols must be larger than 0.");

using self= fsize_matrix;

public:

using value_type= T;

const static int my_rows= Rows, my_cols= Cols;

fsize_matrix(const self& that) { ... }

// Cannot check column index here!

const T* operator[](int r) const { return data[r]; }

T* operator[](int r) { return data[r]; }

mat_vec_et<self, fsize_vector<T, **Cols**> >

operator*(const fsize_vector<T, **Cols**>& v) const

{

return {*this, v};

}

private:

T **data[Rows][Cols]**;

};

The bracket operator returns a pointer for the sake of simplicity whereas a good implementation should return a proxy that allows for checking the column index (see §*A.4.3.3*). The multiplication with a vector is realized by means of an expression template to avoid copying the result vector. Then the vector assignment is overloaded for our expression template type:

template <typename T, int Size>

class fsize_vector

{

template <typename Matrix, typename Vector>

self& operator=(const mat_vec_et<Matrix, Vector>& that)

{

using et= mat_vec_et<Matrix, Vector>;

using mv= fsize_mat_vec_mult<et::my_rows-1, et::my_cols-1>;

mv{}(that.A, that.v, *this);

return *this;

}

};

The functor fsize_mat_vec_mult computes the matrix vector product with respect to the three arguments. The general implementation of the functor reads:

template <int Rows, int Cols>

struct fsize_mat_vec_mult

{

template <typename Matrix, typename VecIn, typename VecOut>

void operator()(const Matrix& A, const VecIn& v_in, VecOut& v_out)

{

fsize_mat_vec_mult<Rows, Cols-1>()(A, v_in, v_out);

v_out[Rows]+= A[Rows][Cols] * v_in[Cols];

}

};

Again, the functor only takes the size arguments as explicit parameters whereas the types of the containers are deduced. The operator assumes that all smaller column indices are already handled and that we can increment v_out[Rows] by A[Rows][Cols] * v_in[Cols]. In particular, we assume that the first operation on v_out[Rows] initializes the value. Thus, we need a (partial) specialization for Cols = 0:

template <int Rows>

struct fsize_mat_vec_mult<Rows, 0>

{

template <typename Matrix, typename VecIn, typename VecOut>

void operator()(const Matrix& A, const VecIn& v_in, VecOut& v_out)

{

fsize_mat_vec_mult<Rows-1, Matrix::my_cols-1>()(A, v_in, v_out);

v_out[Rows]= A[Rows][0] * v_in[0];

}

};

The careful reader will have noticed the substitution of += with =. We also have to call the computation for the preceding row with all columns and inductively for all smaller rows. For the sake of simplicity, the number of matrix columns is taken from an internal definition in the matrix type.* ^{6}* We further need a (full) specialization to terminate the recursion:

*6*. Passing this as an extra template argument or taking type traits would have been more general because we would not depend on the my_cols definition in the class.

template <>

struct fsize_mat_vec_mult<0, 0>

{

template <typename Matrix, typename VecIn, typename VecOut>

void operator()(const Matrix& A, const VecIn& v_in, VecOut& v_out)

{

v_out[0]= A[0][0] * v_in[0];

}

};

With the inlining, our program will execute the operation w= A * v for vectors of size 4 as if we computed

w[0]= A[0][0] * v[0];

w[0]+= A[0][1] * v[1];

w[0]+= A[0][2] * v[2];

w[0]+= A[0][3] * v[3];

w[1]= A[1][0] * v[0];

w[1]+= A[1][1] * v[1];

w[1]+= A[1][2] * v[2];

Our tests have shown that such an implementation is really faster than the compiler optimization on loops.

**5.4.2.1 Increasing Concurrency**

A disadvantage of the preceding implementation is that all operations on an entry of the target vector are performed in one sweep. Therefore, the second operation must wait for the first one, the third for the second, and so on. The fifth operation can be done in parallel with the fourth, the ninth with the eighth, et cetera. However, this is a quite limited concurrency. We like having more parallelism in our program so that parallel pipelines in super-scalar processors or even SSEs are enabled. Again, we can twiddle our thumbs and hope that the compiler will reorder the statements into our favorite order or take it in our own hands. More concurrency is provided when we traverse the result vector and the matrix rows in the “inner” loop:

w[0]= A[0][0] * v[0];

w[1]= A[1][0] * v[0];

w[2]= A[2][0] * v[0];

w[3]= A[3][0] * v[0];

w[0]+= A[0][1] * v[1];

w[1]+= A[1][1] * v[1];

We only need to restructure our functor. The general template now reads:

template <int Rows, int Cols>

struct fsize_mat_vec_mult_cm

{

template <typename Matrix, typename VecIn, typename VecOut>

void operator()(const Matrix& A, const VecIn& v_in, VecOut& v_out)

{

fsize_mat_vec_mult_cm<Rows-1, Cols>()(A, v_in, v_out);

v_out[Rows]+= A[Rows][Cols] * v_in[Cols];

}

};

Now, we need a partial specialization for row 0 that goes to the next column:

template <int Cols>

struct fsize_mat_vec_mult_cm<0, Cols>

{

template <typename Matrix, typename VecIn, typename VecOut>

void operator()(const Matrix& A, const VecIn& v_in, VecOut& v_out)

{

fsize_mat_vec_mult_cm<Matrix::my_rows-1,

Cols-1>()(A, v_in, v_out);

v_out[0]+= A[0][Cols] * v_in[Cols];

}

};

The partial specialization for column 0 must also initialize the entry of the output vector:

template <int Rows>

struct fsize_mat_vec_mult_cm<Rows, 0>

{

template <typename Matrix, typename VecIn, typename VecOut>

void operator()(const Matrix& A, const VecIn& v_in, VecOut& v_out)

{

fsize_mat_vec_mult_cm<Rows-1, 0>()(A, v_in, v_out);

v_out[Rows]= A[Rows][0] * v_in[0];

}

};

Finally, we still need a specialization for row and column 0 to terminate the recursion. This can be reused from the previous functor:

template <>

struct fsize_mat_vec_mult_cm<0, 0>

: fsize_mat_vec_mult<0, 0> {};

Note that we perform the same operation on different data from which *SIMD* architectures can benefit. SIMD stands for *Single Instruction, Multiple Data*. Modern processors contain *SSE* units that perform arithmetic operations simultaneously on multiple floating-point numbers. To use these SSE commands, the processed data must be aligned and contiguous in memory and the compiler must be aware of it. In our examples, we did not address the alignment issue, but the unrolled code makes it clear that identical operations are executed on contiguous memory.

**5.4.2.2 Using Registers**

Another feature of modern processors that we should keep in mind is cache coherency. Processors are nowadays designed to share memory while retaining consistency in their caches. As a result, every time we write a data structure in memory like our vector w, a cache invalidation signal is sent to other cores and processors. Regrettably, this slows down computation perceivably.

Fortunately, the cache invalidation bottleneck can be avoided in many cases, simply by introducing temporaries in functions that reside in registers when the types allow for it. We can rely on the compiler to make good decisions where temporaries are located. C++03 had the keywordregister. However, it was a mere hint and the compiler was not obliged to store variables in registers. Especially when a program is compiled for a target platform that was not considered in the development process, it would do more harm than good if the register use was mandatory. Thus, the keyword was deprecated in C++11 given that compilers have pretty good heuristics to locate variables platform-dependently without help from the programmer.

The introduction of temporaries requires two classes: one for the outer and one for the inner loop. Let us start with the outer loop:

1 template <int Rows, int Cols>

2 struct fsize_mat_vec_mult_reg

3 {

4 template <typename Matrix, typename VecIn, typename VecOut>

5 void operator()(const Matrix& A, const VecIn& v_in, VecOut& v_out )

6 {

7 fsize_mat_vec_mult_reg<Rows-1, Cols>()(A, v_in, v_out);

8

9 typename VecOut::value_type tmp;

10 fsize_mat_vec_mult_aux<Rows, Cols>()(A, v_in, tmp);

11 v_out[Rows]= tmp;

12 }

13 };

We assume that fsize_mat_vec_mult_aux is defined or declared before this class. The first statement in line 7 calls the computations on the preceding rows. A temporary is defined in line 9 assuming that it will be located in a register by a decent compiler. Then we start the computation for this matrix row. The temporary is passed as a reference to an inline function so that the summation will be performed in a register. In line 10, we write the result back to v_out. This still causes the invalidation signal on the bus but only once for each entry. The functor must be specialized for row 0 to avoid infinite loops:

template <int Cols>

struct fsize_mat_vec_mult_reg<0, Cols>

{

template <typename Matrix, typename VecIn, typename VecOut>

void operator()(const Matrix& A, const VecIn& v_in, VecOut& v_out)

{

typename VecOut::value_type tmp;

fsize_mat_vec_mult_aux<0, Cols>()(A, v_in, tmp);

v_out[0]= tmp;

}

};

Within each row, we iterate over the columns and increment the temporary (within a register hopefully):

template <int Rows, int Cols>

struct fsize_mat_vec_mult_aux

{

template <typename Matrix, typename VecIn, typename ScalOut>

void operator()(const Matrix& A, const VecIn& v_in, ScalOut& tmp)

{

fsize_mat_vec_mult_aux<Rows, Cols-1>()(A, v_in, tmp);

tmp+= A[Rows][Cols] * v_in[Cols];

}

};

To terminate the computation for the matrix column, we write a specialization:

template <int Rows>

struct fsize_mat_vec_mult_aux<Rows, 0>

{

template <typename Matrix, typename VecIn, typename ScalOut>

void operator()(const Matrix& A, const VecIn& v_in, ScalOut& tmp)

{

tmp= A[Rows][0] * v_in[0];

}

};

In this section, we showed different ways to optimize a two-dimensional loop (with fixed sizes). There are certainly more possibilities: for instance, we could try an implementation with good concurrency and register usage at the same time. Another optimization would be the agglomeration of the write-backs to minimize the cache invalidation signals further.

**5.4.3 Dynamic Unrolling–Warm-up**

c++11/vector_unroll_example.cpp

As important as the fixed-size optimization is, acceleration for dynamically sized containers is needed even more. We start here with a simple example and some observations. We will reuse the vector class from *Listing 3–1*. To show the implementation more clearly, we write the code without operators and expression templates. Our test case will compute

*u* = 3*v* + *w*

for three short vectors of size 1000. The wall clock time will be measured with <chrono>. The vectors v and w will be initialized, and to make absolutely sure that the data is in cache, we will run some additional operations before the timing. For conciseness, we moved the benchmarking code to *Appendix A.9.5*.

We compare the straightforward loop with one that performs four operations in each iteration:

for (unsigned j= 0; j < rep; j++)

for (unsigned i= 0; i < s; i+= 4) {

u[i]= 3.0f * v[i] + w[i];

u[i+1]= 3.0f * v[i+1] + w[i+1];

u[i+2]= 3.0f * v[i+2] + w[i+2];

u[i+3]= 3.0f * v[i+3] + w[i+3];

}

This code will obviously only work when the vector size is divisible by 4. To avoid errors we can add an assertion on the vector size but this is not really satisfying. Instead, we generalize this implementation to arbitrary vector sizes:

**Listing 5–5: Unrolled computation of** *u* = 3*v* + *w*

for (unsigned j= 0; j < rep; j++) {

unsigned sb= s / 4 * 4;

for (unsigned i= 0; i < sb; i+= 4) {

u[i]= 3.0f * v[i] + w[i];

u[i+1]= 3.0f * v[i+1] + w[i+1];

u[i+2]= 3.0f * v[i+2] + w[i+2];

u[i+3]= 3.0f * v[i+3] + w[i+3];

}

for (unsigned i= sb; i < s; ++i)

u[i]= 3.0f * v[i] + w[i];

}

Sadly, we see the largest benefit with the oldest compilers. Using gcc 4.4 with the flags -O3-ffast-math -DNDEBUG running on an Intel i7-3820 3.6 GHz resulted in

Compute time native loop is 0.801699 μs.

Compute time unrolled loop is 0.600912 μs.

Measured timings in this chapter are averages over at least 1000 runs, so the accumulated execution time was more than 10s and the clock provides sufficient resolution.

Alternatively or in addition to our hand-coded unrolling, we can use the compiler flag -funroll-loops. This resulted in the following execution time on the test machine:

Compute time native loop is 0.610174 μs.

Compute time unrolled loop is 0.586364 μs.

Thus, the compiler flag supplied us with a similar performance gain.

The compiler is able to apply more optimizations when the vector size is already known at compile time:

const unsigned s= 1000;

Then it can be easier to transform the loop or to determine that a transformation is beneficial:

Compute time native loop is 0.474725 μs.

Compute time unrolled loop is 0.471488 μs.

With g++ 4.8, we observed run times around 0.42 μs and with clang 3.4 even 0.16 μs. Investigating the generated assembler revealed that the main difference was how the data is moved from the main memory to the floating-point registers and back.

It also demonstrated that 1D loops are very well optimized by modern compilers, often better than by our hand tuning. Nonetheless, we will show the meta-tuning technique first in one dimension as preparation for higher dimensions where it still supplies significant accelerations.

Assuming that loop unrolling is beneficial for a given calculation on a platform in question, we ask ourselves next: “What is the optimal block size for the unrolling?”

• Does it depend on the expression?

• Does it depend on the types of the arguments?

• Does it depend on the computer architecture?

The answer is yes. All of them. The main reason (but not the only one) is that different processors have different numbers of registers. How many registers are needed in one iteration depends on the expression and on the types (a complex value needs more registers than a float).

In the following section, we will address both issues: how to encapsulate the transformation so that it does not show up in the application and how we can change the block size without rewriting the loop.

**5.4.4 Unrolling Vector Expressions**

For easier understanding, we discuss the abstraction in meta-tuning step by step. We start with the previous loop example *u* = 3*v* + *w* and implement it as a tunable function. The function’s name is my_axpy, and it has a template argument for the block size so we can write, for instance:

for (unsigned j= 0; j < rep; j++)

my_axpy<2>(u, v, w );

This function contains an unrolled main loop with customizable block size and a clean-up loop at the end:

template <unsigned BSize, typename U, typename V, typename W >

void my_axpy(U& u, const V& v, const W & w)

{

assert(u.size() == v.size() && v.size() == w.size());

unsigned s= u.size(), sb= s / BSize * BSize;

for (unsigned i= 0; i < sb; i+= BSize)

my_axpy_ftor<0, BSize>()(u, v, w, i);

for (unsigned i= sb; i < s; ++i)

u[i]= 3.0f * v[i] + w[i];

}

As mentioned before, deduced template types, like the vector types in our case, must be defined at the end of the parameter list and the explicitly given arguments, in our case the block size, must be passed to the first template parameters. The implementation of the block statement in the first loop can be implemented similarly to the functor in *Section 5.4.1*. We deviate a bit from this implementation by using two template parameters where the first one is increased until it is equal to the second. We observed that this approach yielded faster binaries on gcc than using only one argument and counting it down to zero. In addition, the two-argument version is more consistent with the multi-dimensional implementation in *Section 5.4.7*. As for fixed-size unrolling, we need a recursive template definition. Within each operator, a single operation is performed and the following called:

template <unsigned Offset, unsigned Max>

struct my_axpy_ftor

{

template <typename U, typename V, typename W >

void operator()(U& u, const V& v, const W & w, unsigned i)

{

u[i+Offset]= 3.0f * v[i+Offset] + w[i+Offset];

my_axpy_ftor<Offset+1, Max>()(u, v, w, i);

}

};

The only difference from fixed-size unrolling is that the indices are relative to the index i. The operator() is first called with Offset equal to 0, then with 1, 2, and so on. Since each call is inlined, the functor call behaves like one monolithic block of operations without loop control and function call. Thus, the call of my_axpy_ftor<0, 4>()(u, v, w, i) performs the same operations as one iteration of the first loop in *Listing 5–5*.

Of course, this compilation would end in an infinite loop without the specialization for Max:

template <unsigned Max>

struct my_axpy_ftor<Max, Max>

{

template <typename U, typename V, typename W>

void operator()(U& u, const V& v, const W& w, unsigned i) {}

};

Performing the considered vector operation with different unrolling parameters yields

Compute time unrolled<2> loop is 0.667546 μs.

Compute time unrolled<4> loop is 0.601179 μs.

Compute time unrolled<6> loop is 0.565536 μs.

Compute time unrolled<8> loop is 0.570061 μs.

Now we can call this operation for any block size we like. On the other hand, it is an unacceptable programming effort to provide such functors for each vector expression. Therefore, we will now combine this technique with expression templates.

**5.4.5 Tuning an Expression Template**

c++03/vector_unroll_example2.cpp

In *Section 5.3.3*, we implemented expression templates for vector sums (without unrolling). In the same manner we could implement a scalar-vector product, but we leave this as *Exercise 5.5.4* for the motivated reader and consider expressions with addition only, for example:

*u* = *v* + *v* + *w*

Our baseline performance is

Compute time is 1.72 μs.

To incorporate meta-tuning into expression templates, we only need to modify the actual assignment because this is where all loop-based vector operations are performed. The other operations (addition, subtraction, scaling, . . . ) solely return small objects containing references. We can split the loop in operator= as before into the unrolled part at the beginning and the completion at the end:

template <typename T>

class vector

{

template <typename Src>

vector& operator=(const Src& that)

{

check_size(size(that));

unsigned s= my_size, sb= s / 4 * 4;

for (unsigned i= 0; i < sb; i+= 4)

assign<0, 4>()(*this, that, i);

for (unsigned i= sb; i < s; ++i)

data[i]= that[i];

return *this;

}

};

The assign functor is realized analogously to my_axpy_ftor:

template <unsigned Offset, unsigned Max>

struct assign

{

template <typename U, typename V>

void operator()(U& u, const V& v, unsigned i)

{

u[i+Offset]= v[i+Offset];

assign<Offset+1, Max>()(u, v, i);

}

};

template <unsigned Max>

struct assign<Max, Max>

{

template <typename U, typename V>

void operator()(U& u, const V& v, unsigned i) {}

};

Computing the expression above yields

Compute time is 1.37 μs.

With this rather simple modification we now accelerated all vector expression templates. In comparison to the previous implementation, however, we lost the flexibility to customize the loop unrolling. The functor assign has two arguments, thus allowing for customization. The problem is the assignment operator. In principle we can define an explicit template argument here:

template <unsigned BSize, typename Src>

vector& operator=(const Src& that)

{

check_size(size(that));

unsigned s= my_size, sb= s / BSize * BSize;

for (unsigned i= 0; i < sb; i+= BSize)

assign<0, BSize>()(*this, that, i);

for (unsigned i= sb; i < s; ++i)

data[i]= that[i];

return *this;

}

The drawback is that we cannot use the symbol = as a natural infix operator any longer. Instead we must write:

u. operator=<4>(v + v + w);

This has indeed a certain geeky charm, and one could also argue that people did, and still do, much more painful things for performance. Nonetheless, it does not meet our ideals of intuitiveness and readability.

Alternative notations are

unroll<4>(u= v + v + w);

or

unroll<4>(u)= v + v + w;

Both versions are implementable but we find the latter more readable. The former expresses more correctly what we are doing, while the latter is easier to implement and the structure of the computed expression retains better visibility. Therefore we show the realization of the second form.

The function unroll is simple to implement: it just returns an object of type unroll_vector (see below) with a reference to the vector and type information for the unroll size:

template <unsigned BSize, typename Vector>

unroll_vector<BSize, Vector> inline unroll(Vector& v)

{

return unroll_vector<BSize, Vector>(v);

}

The class unroll_vector is not complicated either. It only needs to take a reference of the target vector and to provide an assignment operator:

template <unsigned BSize, typename V>

class unroll_vector

{

public:

unroll_vector(V& ref) : ref(ref) {}

template <typename Src>

V& operator=(const Src& that)

{

assert(size(ref) == size(that));

unsigned s= size(ref), sb= s / BSize * BSize;

for (unsigned i= 0; i < sb; i+= BSize)

assign<0, BSize>()(ref, that, i);

for (unsigned i= sb; i < s; ++i)

ref[i]= that[i];

return ref;

}

private:

V& ref;

};

Evaluating the considered vector expressions for some block sizes yields

Compute time unroll<1>(u)= v + v + w is 1.72 μs.

Compute time unroll<2>(u)= v + v + w is 1.52 μs.

Compute time unroll<4>(u)= v + v + w is 1.36 μs.

Compute time unroll<6>(u)= v + v + w is 1.37 μs.

Compute time unroll<8>(u)= v + v + w is 1.4 μs.

These few benchmarks are consistent with the previous results; i.e., unroll<1> is equal to the canonical implementation and unroll<4> is as fast as the hard-wired unrolling.

**5.4.6 Tuning Reduction Operations**

The techniques in this section are applicable in similar fashion to a variety of vector and matrix norms. They can also be used for dot products and tensor reductions.

**5.4.6.1 Reducing on a Single Variable**

c++03/reduction_unroll_example.cpp

In the preceding vector operations, the *i*-th entry of each vector was handled independently of any other entry. For reduction operations, they are related by one or more temporary variables. And these temporary variables can become a serious bottleneck.

First, we test whether a reduction operation, say, the discrete *L*_{1} norm (also known as the Manhattan norm), can be accelerated by the techniques from *Section 5.4.4*. We implement the one_norm function in terms of a functor for the iteration block:

template <unsigned BSize, typename Vector>

typename Vector::value_type

inline one_norm(const Vector& v)

{

using std::abs;

typename Vector::value_type sum(0);

unsigned s= size(v), sb= s / BSize * BSize;

for (unsigned i= 0; i < sb; i+= BSize)

one_norm_ftor<0, BSize>()(sum, v, i);

for (unsigned i= sb; i < s; ++i)

sum += abs(v[i]);

return sum;

}

The functor is also implemented in the same manner as before:

template <unsigned Offset, unsigned Max>

struct one_norm_ftor

{

template <typename S, typename V>

void operator()(S& sum, const V& v, unsigned i)

{

using std::abs;

sum+= abs(v[i+Offset]);

one_norm_ftor<Offset+1, Max>()(sum, v, i);

}

};

template <unsigned Max>

struct one_norm_ftor<Max, Max>

{

template <typename S, typename V>

void operator()(S& sum, const V& v, unsigned i) {}

};

For reductions, we can see a tuning benefit with more recent compilers like gcc 4.8:

Compute time one_norm<1>(v) is 0.788445 μs.

Compute time one_norm<2>(v) is 0.43087 μs.

Compute time one_norm<4>(v) is 0.436625 μs.

Compute time one_norm<6>(v) is 0.43035 μs.

Compute time one_norm<8>(v) is 0.461095 μs.

This corresponds to a speedup of 1.8. Let us try some alternative implementations.

**5.4.6.2 Reducing on an Array**

c++03/reduction_unroll_array_example.cpp

When we look at the previous computation, we see that a different entry of v is used in each iteration. But every computation accesses the same temporary variable sum, and this limits the concurrency. To provide more concurrency, we can use multiple temporaries* ^{7}* in an array, for instance. The modified function then reads:

*7*. Strictly speaking, this is not true for every possible scalar type we can think of. The addition of the sum type must be a commutative monoid because we change the evaluation order. This holds of course for all intrinsic numeric types and certainly for almost all user-defined arithmetic types. However, we are free to define an addition that is not commutative or not monoidal. In this case, our transformation would be wrong. To deal with such exceptions, we need semantic concepts which will hopefully become part of C++ at some point. (Especially since the author has already spent a lot of time with them.)

template <unsigned BSize, typename Vector>

typename Vector::value_type

inline one_norm(const Vector& v)

{

using std::abs;

typename Vector::value_type sum[BSize];

for (unsigned i= 0; i < BSize; ++i)

sum[i]= 0;

unsigned s= size(v), sb= s / BSize * BSize;

for (unsigned i= 0; i < sb; i+= BSize)

one_norm_ftor<0, BSize>()(sum, v, i);

for (unsigned i= 1; i < BSize; ++i)

sum[0]+= sum[i];

for (unsigned i= sb; i < s; ++i)

sum[0]+= abs(v[i]);

return sum[0];

}

Now, each instance of one_norm_ftor operates on another entry of the sum array:

template <unsigned Offset, unsigned Max>

struct one_norm_ftor

{

template <typename S, typename V>

void operator()(S* sum, const V& v, unsigned i)

{

using std::abs;

sum[Offset]+= abs(v[i+Offset]);

one_norm_ftor<Offset+1, Max>()(sum, v, i);

}

};

template <unsigned Max>

struct one_norm_ftor<Max, Max>

{

template <typename S, typename V>

void operator()(S* sum, const V& v, unsigned i) {}

};

Running this implementation on the test machine yielded

Compute time one_norm<1>(v) is 0.797224 μs.

Compute time one_norm<2>(v) is 0.45923 μs.

Compute time one_norm<4>(v) is 0.538913 μs.

Compute time one_norm<6>(v) is 0.467529 μs.

Compute time one_norm<8>(v) is 0.506729 μs.

This is even a bit slower than the version with one variable. Maybe an array is more expensive to pass as an argument even in an inline function. Let us try something else.

**5.4.6.3 Reducing on a Nested Class Object**

c++03/reduction_unroll_nesting_example.cpp

To avoid arrays, we can define a class for n temporary variables where n is a template parameter. Then the class design is more consistent with the recursive scheme of the functors:

template <unsigned BSize, typename Value>

struct multi_tmp

{

typedef multi_tmp<BSize-1, Value> sub_type;

multi_tmp(const Value& v) : value(v), sub(v) {}

Value value;

sub_type sub;

};

template <typename Value>

struct multi_tmp<0, Value>

{

multi_tmp(const Value& v) {}

};

An object of this type can be recursively initialized so that we do not need a loop as for the array. A functor can operate on the value member and pass a reference to the sub member to its successor. This leads us to the implementation of our functor:

template <unsigned Offset, unsigned Max>

struct one_norm_ftor

{

template <typename S, typename V>

void operator()(S& sum, const V& v, unsigned i)

{

using std::abs;

sum.value+= abs(v[i+Offset]);

one_norm_ftor<Offset+1, Max>()(sum.sub, v, i);

}

};

template <unsigned Max>

struct one_norm_ftor<Max, Max>

{

template <typename S, typename V>

void operator()(S& sum, const V& v, unsigned i) {}

};

The unrolled function that uses this functor reads:

template <unsigned BSize, typename Vector>

typename Vector::value_type

inline one_norm(const Vector& v)

{

using std::abs;

typedef typename Vector::value_type value_type;

multi_tmp<BSize, value_type> multi_sum(0);

unsigned s= size(v), sb= s / BSize * BSize;

for (unsigned i= 0; i < sb; i+= BSize)

one_norm_ftor<0, BSize>()(multi_sum, v, i);

value_type sum= multi_sum.sum();

for (unsigned i= sb; i < s; ++i)

sum += abs (v[i]);

return sum;

}

There is still one piece missing: we must reduce the partial sums in multi_sum at the end. Unfortunately, we cannot write a loop over the members of multi_sum. So, we need a recursive function that dives into multi_sum which is easiest to implement as a member function with the corresponding specialization:

template <unsigned BSize, typename Value>

struct multi_tmp

{

Value sum() const { return value + sub.sum(); }

};

template <typename Value>

struct multi_tmp<0, Value>

{

Value sum() const { return 0; }

};

Note that we started the summation with the empty multi_tmp, not the innermost value member. Otherwise we would need an extra specialization for multi_tmp<1, Value>. Likewise, we could implement a general reduction as in accumulate, but this would require the presence of an initial element:

template <unsigned BSize, typename Value>

struct multi_tmp

{

template <typename Op>

Value reduce(Op op, const Value& init) const

{ return op(value, sub.reduce(op, init)); }

};

template <typename Value>

struct multi_tmp<0, Value>

{

template <typename Op>

Value reduce(Op, const Value& init) const { return init; }

};

The compute times of this version are

Compute time one_norm<1>(v) is 0.786668 μs.

Compute time one_norm<2>(v) is 0.442476 μs.

Compute time one_norm<4>(v) is 0.441455 μs.

Compute time one_norm<6>(v) is 0.410978 μs.

Compute time one_norm<8>(v) is 0.426368 μs.

Thus, in our test environment, the performance of the different implementations is similar.

**5.4.6.4 Dealing with Abstraction Penalty**

c++03/reduction_unroll_registers_example.cpp

In the previous sections, we introduced temporaries for enabling more independent operations. These temporaries are only beneficial, however, when they are assigned to registers. Otherwise they can even slow down the entire execution due to extra memory traffic and cache invalidation signals. With certain old compilers, arrays and nested classes were located in main memory and the run time of the unrolled code was even longer than that of the sequential one.

This is a typical example of *Abstraction Penalty*: semantically equivalent programs running slower due to a more abstract formulation. To quantify the abstraction penalty, Alex Stepanov wrote in the early 90s a benchmark for measuring the impact of wrapper classes on the performance of the accumulate function [*40*]. The idea was that compilers able to run all versions of the test with the same speed should be able to perform STL algorithms without overhead.

At that time, one could observe significant overhead for more abstract codes whereas modern compilers can easily deal with the abstractions in that benchmark. That does not imply that they can handle every level of abstraction, and we always have to check whether our performance-critical kernels could be faster with a less abstract implementation. For instance, in MTL4, the matrix vector product is generically implemented in an iterator-like fashion for all matrix types and views. This operation is specialized for important matrix classes and tuned for those data structures, partially with raw pointers. Generic, high-performance software needs a good balance of generic reusability and targeted tuning to avoid perceivable overhead on one hand and combinatorial code explosion on the other.

In our special case of register usage, we can try to help the compiler optimization by transferring the complexity away from the data structure. Our best chances that temporaries are stored in registers is to declare them as function-local variables:

inline one_norm(const Vector& v)

{

typename Vector::value_type s0(0), s1(0), s2(0), ...

}

The question is now how many we should declare. The number cannot depend on the template parameter but must be fixed for all block sizes. Furthermore, the number of temporaries limits our ability to unroll the loop.

How many temporaries are actually used in the iteration block depends on the template parameter BSize. Unfortunately, we cannot change the number of arguments in a function call depending on a template parameter, i.e., passing fewer arguments for smaller values of BSize. Thus, we have to pass all variables to the iteration block functor:

for (unsigned i= 0; i < sb; i+= BSize)

one_norm_ftor<0, BSize>()(s0, s1, s2, s3, s4, s5, s6, s7, v, i);

The first calculation in each block accumulates on s0, the second on s1, and so on. Unfortunately, we cannot select the temporary argument-dependently (unless we specialize for every value).

Alternatively, each computation is performed on its first function argument and subsequent functors are called without the first argument:

one_norm_ftor<1, BSize>()(s1, s2, s3, s4, s5, s6, s7, v, i);

one_norm_ftor<2, BSize>()(s2, s3, s4, s5, s6, s7, v, i);

one_norm_ftor<3, BSize>()(s3, s4, s5, s6, s7, v, i);

This is not realizable with templates either.

The solution is to rotate the references:

one_norm_ftor<1, BSize>()(s1, s2, s3, s4, s5, s6, s7, s0, v, i);

one_norm_ftor<2, BSize>()(s2, s3, s4, s5, s6, s7, s0, s1, v, i);

one_norm_ftor<3, BSize>()(s3, s4, s5, s6, s7, s0, s1, s2, v, i);

This rotation is achieved by the following functor implementation:

template <unsigned Offset, unsigned Max>

struct one_norm_ftor

{

template <typename S, typename V>

void operator()(S& s0, S& s1, S& s2, S& s3, S& s4, S& s5, S& s6,

S& s7, const V& v, unsigned i)

{

using std::abs;

s0+= abs(v[i+ Offset]);

one_norm_ftor<Offset+1, Max>()(s1, s2, s3, s4, s5, s6, s7,

s0, v, i);

}

};

template <unsigned Max>

struct one_norm_ftor<Max, Max>

{

template <typename S, typename V>

void operator()(S& s0, S& s1, S& s2, S& s3, S& s4, S& s5, S& s6,

S& s7, const V& v, unsigned i) {}

};

The corresponding one_norm function based on this functor is straightforward:

template <unsigned BSize, typename Vector>

typename Vector::value_type

inline one_norm(const Vector& v)

{

using std::abs;

typename Vector::value_type s0(0), s1(0), s2(0), s3(0), s4(0),

s5(0), s6(0), s7(0);

unsigned s= size(v), sb= s / BSize * BSize;

for (unsigned i= 0; i < sb; i+= BSize)

one_norm_ftor<0, BSize>()(s0, s1, s2, s3, s4, s5, s6, s7, v, i);

s0+= s1 + s2 + s3 + s4 + s5 + s6 + s7;

for (unsigned i= sb; i < s; ++i)

s0+= abs(v[i]);

return s0;

}

A slight disadvantage is the overhead for very small vectors: all registers must be accumulated after the block iterations even when there are none. A great advantage, on the other hand, is that the rotation allows for block sizes larger than the number of temporaries. They are reused without corrupting the result. Nonetheless, the actual concurrency will not be larger.

The execution of this implementation took on the test machine:

Compute time one_norm<1>(v) is 0.793497 μs.

Compute time one_norm<2>(v) is 0.500242 μs.

Compute time one_norm<4>(v) is 0.443954 μs.

Compute time one_norm<6>(v) is 0.441819 μs.

Compute time one_norm<8>(v) is 0.430749 μs.

This performance is comparable to the nested-class implementation for compilers that handle the latter properly (i.e., data members in registers); otherwise the rotation code is clearly faster.

**5.4.7 Tuning Nested Loops**

c++11/matrix_unroll_example.cpp

The most used (and abused) example in performance discussions is dense-matrix multiplication. We do not claim to compete with hand-tuned assembler codes, but we show the power of meta-programming to generate code variations from a single implementation. As a starting point, we use a template implementation of the matrix class from *Section A.4.3*. In the following, we use this simple test case:

const unsigned s= 128; // s= 4 in testing, 128 in timing

matrix<float> A(s, s), B(s, s), C(s, s);

for (unsigned i= 0; i < s; ++i)

for (unsigned j= 0; j < s; j++) {

A(i, j)= 100.0 * i + j;

B(i, j)= 200.0 * i + j;

}

mult(A, B, C);

A matrix multiplication is easily implemented with three nested loops. One of the six possible ways of nesting is a dot-product-like calculation for each entry from *C*:

*c** _{ik}* =

*A*

*·*

_{i}*B*

^{k}where *A** _{i}* is the

*i*-th row of

*A*and

*B*

*the*

^{k}*k*-th column of

*B*. We use a temporary in the innermost loop to decrease the cache-invalidation overhead of writing to

*C*’s elements in each operation:

template <typename Matrix>

inline void mult(const Matrix& A, const Matrix& B, Matrix& C)

{

assert(A.num_rows() == B.num_rows()); // ...

typedef typename Matrix::value_type value_type;

unsigned s= A.num_rows();

for (unsigned i= 0; i < s; ++i)

for (unsigned k= 0; k < s; k++) {

value_type tmp(0);

for (unsigned j= 0; j < s; j++)

tmp+= A(i, j) * B(j, k);

C(i, k)= tmp;

}

}

For this implementation, we provide a backward-compatible benchmark in *Appendix A.9.6*. The run time and performance of our canonical implementation (with 128 *×* 128 matrices) are:

Compute time mult(A, B, C) is 1980 μs. These are 2109 MFlops.

This implementation is our reference regarding performance and results. For the development of the unrolled implementation, we go back to 4 *×* 4 matrices. In contrast to *Section 5.4.6*, we do not unroll a single reduction but perform multiple reductions in parallel. Regarding the three loops, this means that we unroll the two outer loops and perform block operations in the inner loop; i.e., multiple i and j values are handled in each iteration. This block is realized by a size-parameterizable functor.

As in the canonical implementation, the reduction is not performed directly on elements of C but in temporaries. For this purpose, we use the class multi_tmp from §*5.4.6.3*. For the sake of simplicity, we limit ourselves to matrix sizes that are multiples of the unroll parameters (a full implementation for arbitrary matrix sizes is realized in MTL4). The unrolled matrix multiplication is shown in the following function:

template <unsigned Size0, unsigned Size1, typename Matrix>

inline void mult(const Matrix& A, const Matrix& B, Matrix& C)

{

using value_type= typename Matrix::value_type;

unsigned s= A.num_rows();

mult_block<0, Size0-1, 0, Size1-1> block;

for (unsigned i= 0; i < s; i+= Size0)

for (unsigned k= 0; k < s; k+= Size1) {

multi_tmp<Size0 * Size1, value_type> tmp(value_type(0));

for (unsigned j= 0; j < s; j++)

block(tmp, A, B, i, j, k);

block.update(tmp, C, i, k);

}

}

We still have to implement the functor mult_block. The techniques are essentially the same as in the vector operations, but we have to deal with more indices and their respective limits:

template <unsigned Index0, unsigned Max0, unsigned Index1,

unsigned Max1>

struct mult_block

{

typedef mult_block<Index0, Max0, Index1+1, Max1> next;

template <typename Tmp, typename Matrix>

void operator()(Tmp & tmp, const Matrix& A, const Matrix& B,

unsigned i, unsigned j, unsigned k)

{

tmp.value+= A(i + Index0, j) * B(j, k + Index1);

next()(tmp.sub, A, B, i, j, k);

}

template <typename Tmp, typename Matrix>

void update(const Tmp & tmp, Matrix& C, unsigned i, unsigned k)

{

C(i + Index0, k + Index1)= tmp.value;

next().update(tmp.sub, C, i, k);

}

};

template <unsigned Index0, unsigned Max0, unsigned Max1>

struct mult_block<Index0, Max0, Max1, Max1>

{

typedef mult_block<Index0+1, Max0, 0, Max1> next;

template <typename Tmp, typename Matrix>

void operator()(Tmp & tmp, const Matrix& A, const Matrix& B,

unsigned i, unsigned j, unsigned k)

{

tmp.value+= A(i + Index0, j) * B(j, k + Max1);

next()(tmp.sub, A, B, i, j, k);

}

template <typename Tmp, typename Matrix>

void update(const Tmp & tmp, Matrix& C, unsigned i, unsigned k)

{

C(i + Index0, k + Max1)= tmp.value;

next().update(tmp.sub, C, i, k);

}

};

template <unsigned Max0, unsigned Max1>

struct mult_block<Max0, Max0, Max1, Max1>

{

template <typename Tmp, typename Matrix>

void operator()(Tmp & tmp, const Matrix& A, const Matrix& B,

unsigned i, unsigned j, unsigned k)

{

tmp.value+= A(i + Max0, j) * B(j, k + Max1);

}

template <typename Tmp, typename Matrix>

void update(const Tmp & tmp, Matrix& C, unsigned i, unsigned k)

{

C(i + Max0, k + Max1)= tmp.value;

}

};

With appropriate logging, we can show that the same operations are performed for each entry of C as in the canonical implementation. We can also see that calculations regarding the entries are interleaved. In the following logging, we multiply 4 *×* 4 matrices and unroll 2 *×* 2 blocks. From the four temporaries, we observe two:

tmp.4+= A[1][0] * B[0][0]

tmp.3+= A[1][0] * B[0][1]

tmp.4+= A[1][1] * B[1][0]

tmp.3+= A[1][1] * B[1][1]

tmp.4+= A[1][2] * B[2][0]

tmp.3+= A[1][2] * B[2][1]

tmp.4+= A[1][3] * B[3][0]

tmp.3+= A[1][3] * B[3][1]

C[1][0]= tmp.4

C[1][1]= tmp.3

tmp.4+= A[3][0] * B[0][0]

tmp.3+= A[3][0] * B[0][1]

tmp.4+= A[3][1] * B[1][0]

tmp.3+= A[3][1] * B[1][1]

tmp.4+= A[3][2] * B[2][0]

tmp.3+= A[3][2] * B[2][1]

tmp.4+= A[3][3] * B[3][0]

tmp.3+= A[3][3] * B[3][1]

C[3][0]= tmp.4

C[3][1]= tmp.3

In temporary number 4, we accumulate *A*_{1} · *B*^{0} and store the result to *c*_{1}_{,}_{0}. This is interleaved with the accumulation of *A*_{1} · *B*^{1} in temporary 3, allowed for using multiple pipelines of super-scalar processors. We can also see that

The implementation above can be simplified. The first functor specialization only differs from the general functor in the way the indices are incremented. We can factor this out with an additional loop2 class:

template <unsigned Index0, unsigned Max0, unsigned Index1,

unsigned Max1>

struct loop2

{

static const unsigned next_index0= Index0,

next_index1= Index1 + 1;

};

template <unsigned Index0, unsigned Max0, unsigned Max1>

struct loop2<Index0, Max0, Max1, Max1>

{

static const unsigned next_index0= Index0 + 1, next_index1= 0;

};

Such a general class has a high potential for reuse. With this class, we can fuse the functor template and the first specialization:

template <unsigned Index0, unsigned Max0, unsigned Index1,

unsigned Max1>

struct mult_block

{

typedef loop2<Index0, Max0, Index1, Max1> l;

typedef mult_block<l::next_index0, Max0,

l::next_index1, Max1> next;

template <typename Tmp, typename Matrix>

void operator()(Tmp& tmp, const Matrix& A, const Matrix& B,

unsigned i, unsigned j, unsigned k)

{

tmp.value+= A(i + Index0, j) * B(j, k + Index1);

next()(tmp.sub, A, B, i, j, k);

}

template <typename Tmp, typename Matrix>

void update(const Tmp& tmp, Matrix& C, unsigned i, unsigned k)

{

C(i + Index0, k + Index1)= tmp.value;

next().update(tmp.sub, C, i, k);

}

};

The other specialization remains unaltered.

Last but not least, we want to see the impact of our not-so-simple matrix product implementation. The benchmark yielded on our test machine:

Time mult<1, 1> is 1968 μs. These are 2122 MFlops.

Time mult<1, 2> is 1356 μs. These are 3079 MFlops.

Time mult<1, 4> is 1038 μs. These are 4022 MFlops.

Time mult<1, 8> is 871 μs. These are 4794 MFlops.

Time mult<1, 16> is 2039 μs. These are 2048 MFlops.

Time mult<2, 1> is 1394 μs. These are 2996 MFlops.

Time mult<4, 1> is 1142 μs. These are 3658 MFlops.

Time mult<8, 1> is 1127 μs. These are 3705 MFlops.

Time mult<16, 1> is 2307 μs. These are 1810 MFlops.

Time mult<2, 2> is 1428 μs. These are 2923 MFlops.

Time mult<2, 4> is 1012 μs. These are 4126 MFlops.

Time mult<2, 8> is 2081 μs. These are 2007 MFlops.

Time mult<4, 4> is 1988 μs. These are 2100 MFlops.

We can see that mult<1, 1> has the same performance as the original implementation, which in fact is performing the operations in exactly the same order (so far the compiler optimization does not change the order internally). We also see that most unrolled versions are faster, up to a factor of 2.3.

With double matrices, the performance is slightly lower in general:

Time mult is 1996 μs. These are 2092 MFlops.

Time mult<1, 1> is 1989 μs. These are 2099 MFlops.

Time mult<1, 2> is 1463 μs. These are 2855 MFlops.

Time mult<1, 4> is 1251 μs. These are 3337 MFlops.

Time mult<1, 8> is 1068 μs. These are 3908 MFlops.

Time mult<1, 16> is 2078 μs. These are 2009 MFlops.

Time mult<2, 1> is 1450 μs. These are 2880 MFlops.

Time mult<4, 1> is 1188 μs. These are 3514 MFlops.

Time mult<8, 1> is 1143 μs. These are 3652 MFlops.

Time mult<16, 1> is 2332 μs. These are 1791 MFlops.

Time mult<2, 2> is 1218 μs. These are 3430 MFlops.

Time mult<2, 4> is 1040 μs. These are 4014 MFlops.

Time mult<2, 8> is 2101 μs. These are 1987 MFlops.

Time mult<4, 4> is 2001 μs. These are 2086 MFlops.

This shows that other parameterizations yield more acceleration and that the performance could be doubled.

Which configuration is the best and why is—as mentioned before—not the topic of this book; we only show programming techniques. The reader is invited to try this program on his/her own computer. The techniques in this section are intended for best L1 cache usage. When matrices are larger, we should use more levels of blocking. A general-purpose methodology for locality on L2, L3, main memory, local disk, . . . is recursion. This avoids reimplementation for each cache size and performs even reasonably well in virtual memory; see, for instance, [*20*].

**5.4.8 Tuning Résumé**

Software tuning including benchmarking [*25*] is an art of its own with advanced compiler optimization. The tiniest modification in the source can change the run-time behavior of an examined computation. In our example, it should not have mattered whether the size was known at compile time or not. But it did. Especially when the code is compiled without -DNDEBUG, the compiler might omit the index check in some situations and perform it in others. It is also important to print out computed values because the compiler might omit an entire computation when it is obvious that the result is not needed.

In addition, we must verify that repeated execution—for better clock resolution and amortized measuring overhead—is really repeated. When the result is independent on the number of repetitions, clever compilers might perform the code only once (we observed this in the unrolled reduction with clang 3.4 and block size 8). Such optimizations happen in particular when the results are intrinsic types while computations on user-defined types are usually not subject to such omissions (but we cannot count on it). Especially, the CUDA compiler performs an intensive static code analysis and rigorously drops all calculations without impact—leaving the benchmarking programmer utterly bewildered (and often prematurely excited about allegedly extremely fast calculations that were actually never performed or not as often as intended).

The goal of this section was not to implement the ultimate matrix or scalar product. In the presence of the new GPU and many-core processors with hundreds and thousands of cores and millions of threads, our exploration of super-scalar pipelining seems like a drop in the ocean. But it is not. The tuned implementations can be combined with multi-threading techniques like that of *Section 4.6* or with OpenMP. The template-parameterized blocking is also an excellent preparation for SSE acceleration.

More important than the exact performance values is for us to illustrate the expressive and generative power of C++. We can generate any execution that we like from any syntactic representation that pleases us. The best-known code generation project in high-performance computing is ATLAS [*51*]. For a given dense linear-algebra function, it generates different implementations from C and assembler snippets and compares their performance on a target platform. After a training phase, an efficient implementation of the BLAS library [*5*] is available for the targeted platform.

In C++, we can use any compiler to generate every possible implementation without the need of external code generators. Programs can be tuned by simply changing template arguments. Tuning parameters can be easily set in configuration files platform-dependently, leading to significantly different executables on various platforms without the need of massive reimplementation.

Performance tuning is, however, shooting at moving targets: what yields a great benefit today might be irrelevant or even detrimental tomorrow. Therefore, it is important that performance improvements are not hard-wired deep inside an application but that we are able to configure our tuning parameters conveniently.

The examples in this section demonstrated that meta-tuning is quite elaborate, and to our disappointment, the benefit of the transformations is not as pronounced as it used to be when we first investigated them in 2006. We have also seen in several examples that compilers are very powerful to apply general-purpose optimizations and often yield better results with less effort. Effort-wise, it is also advisable not to compete with highly tuned libraries in popular domains like dense linear algebra. Such libraries as MKL or Goto-BLAS are extremely efficient, and our chances of outperforming them after enormous work are tiny. All this said, we shall focus our efforts on the most important targets: domain-specific optimizations of fundamental kernels with strong impact on our applications’ overall run time.

**5.5 Exercises**

**5.5.1 Type Traits**

Write type traits for removing and adding references. Add a domain-specific type trait for the meta-predicate is_vector and assume that the only known vectors are so far my_vector<Value> and vector_sum<E1, E2>.

**5.5.2 Fibonacci Sequence**

Write a meta-template that generates the Fibonacci sequence at compile-time. The Fibonacci sequence is defined by the following recursion:

*x*_{0} = 0

*x*_{1} = 1

*x** _{n}* =

*x*

_{n}_{–1}+

*x*

_{n}_{–2}for

*n*≥ 2.

**5.5.3 Meta-Program for Greatest Common Divisor**

Write a meta-program for the GCD (greatest common divisor) of two integers. The algorithm is as follows: Write a generic function for an integer type I that computes the GCD.

1 function gcd(a, b):

2 if b = 0 return a

3 else return gcd(b, a mod b)

template <typename I>

I gcd (I a, I b) { ... }

Then write an integral meta-function that executes the same algorithm but at compile time. Your meta-function should be of the following form:

template <int A, int B>

struct gcd_meta {

static int const value = ... ;

} ;

i.e., gcd_meta<a,b>::value is the GCD of a and b. Verify whether the results correspond to your C++ function gcd().

**5.5.4 Vector Expression Template**

Implement a vector class (you can use std::vector<double> internally) that contains at least the following members:

class my_vector {

public:

typedef double value_type ;

my_vector( int n );

// Copy Constructor from type itself

my_vector( my_vector& );

// Constructor from generic vector

template <typename Vector>

my_vector( Vector& );

// Assignment operator

my_vector& operator=( my_vector const& v );

// Assignment for generic Vector

template <typename Vector>

my_vector& operator=( Vector const& v );

value_type& operator() ( int i );

int size() const;

value_type operator() ( int i ) const;

};

Make an expression for a scalar multiplied with a vector:

template <typename Scalar, typename Vector>

class scalar_times_vector_expression

{};

template <typename Scalar, typename Vector>

scalar_times_vector_expressions<Scalar, Vector>

operator*( Scalar const& s, Vector const& v )

{

return scalar_times_vector_expressions<Scalar, Vector>( s, v );

}

Put all classes and functions in the namespace math. You can also create an expression template for the addition of two vectors.

Write a small program, e.g.:

int main() {

math::my_vector v( 5 );

... Fill in some values of v ...

math::my_vector w( 5 );

w = 5.0 * v;

w = 5.0 * (7.0 * v );

w = v + 7.0*v; // (If you have added the operator+)

}

Use the debugger to see what happens.

**5.5.5 Meta-List**

Create a list of types. Implement the meta-functions insert, append, erase, and size.