Introduction to data oriented design – matrix multiplication

What is data oriented design

This is design approach to optimise efficient CPU cache usage by designing data layout. In this approach we separate and sort data field according to accessing order, this should decrease cache miss and improve code that bottle neck is memory bus.

Do not confuse this approach with data-driven programming

Example

Object oriented design:

typedef struct {
    char name[64];
    double income;
} Employee;
Employee employee[200];

Data oriented design:

char employeeName[200][64]; // or employeeName[200 * 64]
double employeeIncome[200];

First example show standard OOP approach, when we embed all employee information in one Employee structure, then we create employee array. In most cases this approach is preferred since it is easier to maintenance, read and understand.

Second example show different approach where we keep employee name and employee income in separate arrays. Names could also be kept in single dimension array this would be even better than multi dimension array, since each dimension might require to jump in different memory block and cause cache miss. When we use second example we need to make sure that when we use index “i” to access employeeName, then we should be able to use the same index on employeeIncome to access this employeeIncome.

employeeName[i] and employeeIncome[i] have to represent data of the same user. When we modify tables we need to make sure we update index accordingly. This makes it harder to use than standard OOP approach.

There is however one big benefit of this type of alignment, lets imagine we have to update each income by 10%:

for(int i=0; i<200; i++) {
    employee[i].income *= 1.1f;
}
for(int i=0; i<200; i++) {
    employeeIncome[i] *= 1.1f;
}

In first example for each iteration we need to jump sizeof(struct Employee) bytes(about 64+8=72) to access next income, let’s say our cache is 128b. When we read first employee we read 128b to cache starting from employee[0], so we get whole employee[0] and most of employee[1], unfortunately without employee[1].income, so when we try to access employee[1].income we do get cache miss and we have to read new data to cache.

With second example we read 128b, since each income is double(8 bytes), we have 128/8=16 incomes in memory. We can read 16 incomes without cache miss.

Data oriented design works great for algorithm that bottle neck is memory access, they speed up accessing data and allow to increase CPU work load.

Matrix multiplication

This simple algorithm show how powerful DOP can be with algorithm that require lot of memory reads and writes but do not require much CPU work. From wikipedia:

In mathematics, a matrix (plural matrices) is a rectangular array or table of numberssymbols, or expressions, arranged in rows and columns, which is used to represent a mathematical object or a property of such an object.

\begin{bmatrix}
3 & 3 & 3 & 3 \\
3 & 1 & 1 & 3 \\
3 & 0 & 0 & 3 \\
3 & 2 & 2 & 3
\end{bmatrix}

Multiplication

Result of multiplication of two matrix A and B is third matrix C. We can multiply matrix only if first matrix column count match second matrix row count.

Matrix A with dimension MxN and Matrix B with dimension KxL will result in matrix C with dimensions MxL.

To calculate C<sub>i,j</sub> we need to calculate dot product of i-th row of matrix A with j-th column of matrix B.

Dot product is a simple sum of multiplied values of the same index from row A and column B:

\[
C_{i,j}=\sum_{r=1}^{n}A_{i,r} * B{r,j}
\]

Double array implementation

Natural representation of a matrix is array of array of number type (like int):

void basicMul(int **a, int **b, int **c, int n) {
    for(int row=0; row<n; row++) {        
        for(int col=0; col<n; col++) {
             int val = 0;
             for(int i = 0; i<n; i++) {
                 val += a[row][i] * b[i][col];
             }
             c[row][col] = val;
        }
    } 
}

However in this case whenever we try to access different row of an array we will have to jump to different array that can lay far away in memory. When each array is far in memory this will cause lot of cache miss and might dramatically slowdown performance.

Aligned show result when memory is continuous, fragmented result shows when each row was far away in memory. To achieve fragmentation I have allocated N + 1000 column for each array. This additional allocation was ignored by algorithm, but was causing memory shift between each row array.

C alignedC fragmentedKotlin alignedKotlin fragmented
13.98s58.88s31s58.71s

When we compare result between aligned and fragmented matrix we seen huge difference. For “C” fragmented array is 4.2x slower and for Kotlin it is almost 2x slower. Performance of this code highly depend on memory fragmentation. We can easily get rid of this issue when we change multi dimension array into one dimension array.

Single dimension array implementation

Previous example show how allocation can cause jumping through memory and decrease performance of our code. The easies solution for that is to try allocating all required data in one allocation, this will prevent fragmentation.

When we try to allocate data for matrix of size MxN we actually can allocate single dimension array of size M*N, we will store column in order : C<sub>0</sub>, C<sub>1</sub>,… to get to i-th row(counting from 0) we need to skip “i” rows, each row have “n” column this means starting offset for “i”-th row is i * n, to get to “j”-th column in this row we need to skip additional “j” columns so final formula :

\[
matrix[i][j] => matrix[i * n + j]
\]

void basicMul(int *a, int *b, int *c, int n) {
        for(int row=0; row<n; row++) {
                for(int col=0; col<n; col++) {
                        int val = 0;
                        for(int i = 0; i<n; i++) {
                                val += a[row * n + i] * b[i * n + col];
                        }
                        c[row * n + col] = val;
                }
        }
}

Now our matrix is always represented as one singular memory block, we do not have issue with memory fragmentation. For C our result is stable and slightly faster than perfect double array implementation. Kotlin implementation improved 2x then best result from double array implementation, and now it is only 7.8% slower than C version.

CKotlin
13.70s15s

Transposed array implementation

Lets have a look on main multiplication loop:

val += a[row * n + i] * b[i * n + col];

Our loop iterator is “i”, this means that for array “a” we move to neighbour bytes on each iteration. This is perfect for cache since CPU try to prefetch continuous block of bytes, so we should get loot of cache hits.

Unfortunately for array “b” we jump “n” byte on each iteration this will most likely cause lot of cache misses, we could try to improve this by transposing matrix. Transposed matrix swaps values from i,j with values j,i. After transposing we will be able to change inner iteration for “b” matrix to linear access inside loop:

void transpose(int *a, int n) {
	for(int row=0; row<n - 1; row++) {
		for(int col=row + 1; col<n; col++) {
			int tmp = a[row * n + col];
			a[row * n + col] = a[col * n + row];
			a[col * n + row] = tmp;
		}
	}
}


void basicMulT(int *a, int *bT, int *c, int n) {
	for(int row=0; row<n; row++) {
		for(int col=0; col<n; col++) {
			int val = 0;
			for(int i = 0; i<n; i++) {
				val += a[row * n + i] * bT[col * n + i];
			}
			c[row * n + col] = val;
		}
	} 
}

Result include transposing in place for B matrix and multiplication. We improved previous performance 10x for C and almost 3x for Kotlin. This also shows the power of C memory management.

C Kotlin
1.33s5.39

Cache blocking implementation

We can do better with cache blocking technic. This algorithm split matrix into blocks of size “BS”x”BS” and instead of going full “N” element for internal loop it will go only “BS” iterations and then jump to next row. You might think that this jump should cause lot of cache miss, however if block BSxBS of matrix “a” and matrix “b” can fit in cache then you will actually get much less overall cache misses for calculating whole matrix block BSxBS.

void transpose(int *a, int n) {
	for(int row=0; row<n - 1; row++) {
		for(int col=row + 1; col<n; col++) {
			int tmp = a[row * n + col];
			a[row * n + col] = a[col * n + row];
			a[col * n + row] = tmp;
		}
	}
}


void blockingMulT(int *a, int *bT, int *c, int n, int bs) {
	for(int i=0; i<n*n; i++) c[i]=0;
	for(int row=0; row<n; row+=bs) {
		for(int col=0; col<n; col+=bs) {
			for(int kk = 0; kk<n; kk +=bs) {
				for (int i=row; i<row+bs; i++) {
					for(int j=col; j<col+bs; j++) {
						int val = c[i * n + j];
						for(int k = kk; k<kk + bs; k++) {
							val += a[i * n + k] * bT[j * n + k];
						}
						c[i * n + j] = val;	
					} 
				}
			}
		}
	} 

}

Final result show that this only work for C since Kotlin run on JVM that pollutes our cache:

CKotlin
0.84s7.4s

Summary

Final result show how much correct memory alignment can improve algorithm performance.

  • C started with 58.88s and finish with 0.84s this is 70x faster(!)
  • Kotlin started with 58.71s and best result was 5.39s this is almost 11x faster.

Leave a Reply

Your email address will not be published. Required fields are marked *