Optimizing Program Performance on CPU Instruction Level

By Mao Luo

CPU Instruction

In general, processing an instruction involves a number of operations. It has been organized in a particular sequence of stages, attempting to make all instructions follow a uniform sequence.
Fetch: Fetch instruction from memory.

Decode: Read registers while decoding the instruction.

Execute: Execute the operation or calculate an address.

Memory: write data to memory or read data from memory

Write Back: write the results into a register

CPU Pipelining

Pipelining is an implementation technique in which multiple instructions are overlapped in execution. In a pipelined system, the task to be performed is divided into a series of discrete stages. A key feature of pipelining is that it increases the throughput of the system. Two main dependencies to restrict the performance of  pipelining.

Data Dependency: the result computed by one instruction are used as the data for a following instruction

Control Dependency: one instruction determines the location of the following instruction, such as when executing a jump, call or return.

When such dependencies have the potenial to cause an erroneous computation by the pipeline, they are called hazards.

The CPU has taken some measurements to solve these hazards.

It also needs to optimize code to avoid these dependencies.

 

Avoid Data Dependency

Program Example:

void combine1(vec_ptr v, data_t* dest) {
    long int i;
   long int length = vec_length(v);
   data_t* data = get_vec_start(v);
   data_t acc = IDENT;
   for (i = 0; i < length; ++i) {
       acc = acc OP data[i];
   }
   *dest = acc;
}

As an optimization example, consider the code as combine1, which combines all of the elements in a vector into a signle value according to some operation.

#define IDENT 0
#define OP +

By using different definitions of compile-time constants IDENT and OP, the code can be recompiled to perform different operations on the data.

typedef int data_t;
typedef struct {
   long int len;
   data_t* data;
} vec_rec, *vec_ptr;

Basic data type can be changed to float, double etc.

vec_ptr new_vec(long int len) {
   vec_ptr result = (vec_ptr)malloc(sizeof(vec_rec));
   if (!result) {
       return NULL;
   }
   result->len = len;
   if (len > 0) {
       data_t* data = (data_t*)calloc(len, sizeof(data_t));
       if (!data) {
           free((void *)result);
           return NULL;
       }
       result->data = data;
       std::fill(data, data + len, 1);
   } else {
       result->data = NULL;
   }
   return result;
} 
long int vec_length(vec_ptr v) {
    return v->len;
} 
data_t* get_vec_start(vec_ptr v) {
    return v->data;
}

Loop Unrolling and Enhancing Parallelism

Loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration. Loop unrolling can improve performance in two ways. First, it reduces the number of operations such as loop indexing and conditional branching. Second, it exposes ways in which we can further transform the code to reduce the number of operations in the critical paths of the overall computation.

void combine2(vec_ptr v, data_t* dest) {
   long int i;
   long int length = vec_length(v);
   long int limit = length - 1;
   data_t* data = get_vec_start(v);
   data_t acc = IDENT;
 
   for (i = 0; i < limit; i+=2) {
       acc = (acc OP data[i]) OP data[i+1];
   }
   for (; i < length; i++) {
       acc = acc OP data[i];
   }
   *dest = acc;
}

This loop unrolling just reduces the effect of loop overhead. Even though the loop has been unrolled by a factor of 2, there are still n operations along the critical path. Since the code accumulates the value as a single variable acc, it can’t compute a new value for acc until the precceding computation has completed. Even though the functional unit can start a new

operation every clock cycle, it will only start one every L cycles, where L is the latency of the combining operation.

For a combining operation that is associative and commutative, such as integer addition or multiplication, it can improve performance by splitting the set of combining operations into two or more parts and combining the results at the end.

void combine3(vec_ptr v, data_t* dest) {
   long int i;
   long int length = vec_length(v);
   long int limit = length - 1;
   data_t* data = get_vec_start(v);
   data_t acc0 = IDENT;
   data_t acc1 = IDENT;
 
   for (i = 0; i < limit; i += 2) {
       acc0 = acc0 OP data[i];
       acc1 = acc1 OP data[i+1];
   }
   for (; i < length; i++) {
       acc0 = acc0 OP data[i];
   }
   *dest = acc0 OP acc1;
}

As with combine5, the inner loop contains two operations, but these instructions translate into operations that read and write separate registers, with no data dependency between them. Now it has two critical paths, one corresponding to computing the product of even-numbered elements and one for the odd-numbered elements. Many compilers do loop unrolling automatically, but relatively few then introduce this form of parallelism.

In another way to break the sequential dependencies, it’s the reassociation transformation.

void combine4(vec_ptr v, data_t* dest) {
   long int i;
   long int length = vec_length(v);
   long int limit = length - 1;
   data_t* data = get_vec_start(v);
   data_t acc = IDENT;
   for (i = 0; i < limit; i+=2) {
       acc = acc OP (data[i] OP data[i+1]);
   }
   for (; i < length; i++) {
       acc = acc OP data[i];
   }
   *dest = acc;
}

This approach also increases the number of operations that can be performed in parallel. The reassocation transformation can reduce the number of operations along the critical path in a computation, resulting in better performance by better utilizing the pipelining capabilities of the functional units. Most compilers will not attempt any reassociations of floating-point operations, since these operations are not guaranteed to be associative.

Using vector instructions through built-in functions

On some targets, the instructions set constains SIMD vector instructions that operator on multiple values contained in one large register at the same time. SIMD is the acronym for “Single-Instruction, Multiple-Data.” The idea behind the SIMD execution model is that each 16-byte XMM register can hold multiple values. SSE instructions can perform vector operations on registers, such as adding or multiplying four or two sets of values in parallel. GCC supports extensions to the C Language that let programmers express a program in terms of vector operations that can be compiled into the SIMD instructions of SSE. This coding style is prefereable to writing code directly in assembly language, since GCC can also generate code for the SIMD instructions found on other processors.

First, using a combination of GCC instructions and loop unrolling. Add option “-msse2” on bcdev1

typedef int v4si __attribute__ ((vector_size(4 * sizeof(int))));
void combine5(vec_ptr v, data_t* dest) {
   long int i;
   long int length = vec_length(v);
   data_t* data = get_vec_start(v);
   v4si acc = { IDENT, IDENT, IDENT, IDENT};
   for (i = 0; i < length - 3; i += 4) {
       v4si* temp = reinterpret_cast<v4si*>(data + i);
       acc = __builtin_ia32_paddd128(acc, *temp);
   }
   data_t* sum = reinterpret_cast<data_t*>(&acc);
   data_t acc_last = sum[0] OP sum[1] OP sum[2] OP sum[3];
   for (; i < length; ++i) {
       acc_last = acc_last OP data[i];
   }
   *dest = acc_last;
}

Second, using a combination of GCC instructions loop unrolling and multiple accumulators.

void combine6(vec_ptr v, data_t* dest) {
   long int i;
   long int length = vec_length(v);
   data_t* data = get_vec_start(v);
   v4si acc1 = { IDENT, IDENT, IDENT, IDENT};
   v4si acc2 = { IDENT, IDENT, IDENT, IDENT};
   for (i = 0; i < length - 7; i += 8) {
       v4si* temp1 = reinterpret_cast<v4si*>(data + i);
       acc1 = __builtin_ia32_paddd128(acc1, *temp1);
       v4si* temp2 = reinterpret_cast<v4si*>(data + i + 4);
       acc2 = __builtin_ia32_paddd128(acc2, *temp2);
   }
   v4si acc = __builtin_ia32_paddd128(acc1, acc2);
   data_t* sum = reinterpret_cast<data_t*>(&acc);
    data_t acc_last = sum[0] OP sum[1] OP sum[2] OP sum[3];
   for (; i < length; ++i) {
       acc_last = acc_last OP data[i];
   }
   *dest = acc_last;
}

Using gprof to test for the length 100000000 vector:

For gcc has options to unroll loop “-funroll-loops” and vectorize “-ftree-vectorize”, just test combine1 with these two options. The generated assembly code is the same.

 

Avoid Control Dependency

In a processor that employs speculative execution or called branch prediction to solve this issue, the processor begins executing the instructions at the predicted branch target. It does this in a way that avoids modifying  register or memory locations. If the predicion is correct, the processor commit the results. If the prediction is incorrect, the processor must discard all of the results and restart the instruction fetch process at the correct location. Such a misprediction can incur a serious penalty, 20-40 clock cycles of wasted effort, causing a serious degradation of  program performance.

Write Code Suitable for Implementation with Conditional Moves

Recent versions of x86 processors have conditional move instructions. Conditional move instructions can be implemented as part of the pipelined processing of ordinary instructions. There is no need to guess whether or not the condition will hold, and hence no penalty for guessing incorrectly.

GCC is able to generate conditional moves for code when using conditional operations to compute values and then update the program state with these values, as opposed to use conditionals selectively update program state. In bcdev1 need to add “-O1”.

// Rearrange two vectors so that for each i, b[i] >= a[i]
void minmax1(int* a, int* b, int n) {
   int i;
   for (i = 0; i < n; i++) {
       if (a[i] > b[i]) {
           int t = a[i];
           a[i] = b[i];
           b[i] = t;       
       }
   }
}
void minmax2(int* a, int* b, int n) {
   int i;
   for (i = 0; i < n; i++) {
       int min = a[i] < b[i] ? a[i] : b[i];
       int max = a[i] < b[i] ? b[i] : a[i];
       a[i] = min;
       b[i] = max;
   }
}

To verfiy generated assembly code that minmax2 indeed used conditional moves.The relative assembler as follows.

minmax1                                                                     minmax2

Write code to test these two function.Using gprof to test for the length 100000000 vector :

if always a[i] > b[i] or a[i] < b[i], the branch predication will be more efficient.

发表评论

电子邮件地址不会被公开。 必填项已用*标注