Summary

LU Decomposition is an example of how to use Nu+ features to optimize an algorithm, using multithreaded and vectorial code.

Reusable code decomposition and multithreaded and vectorial optimization methods are shown; they can be applied to every other parallelizable algorithm, from image processing to machine learning ones.

Algorithm Description

LU Decomposition is used in solving systems of linear equations

Algorithm decomposes A in:

upper triangolar matrix
lower triangolar matrix

Partial Pivoting

Partial pivoting is an optimization of standard LU Decomposition that aims to reduce numerical instability

Therefore a matrix is needed to keep track of pivoting operations
Partial because pivoting is applied on rows only

These matrices have to verify the relation:

Pseudocode

nu+ Optimization

In this section, it is shown how the pseudo-code algorithm can be optimized using nu+ features.

Pivoting phase

The pivoting phase has been made multi-threaded. Every k-th iteration (k ranging from 0 to m-1) of the pivoting operations works on the elements belonging to the k-th column and to rows from k to m-1.

Less rows than threads

In the case rows are less than threads, the algorithm disables the exceeding threads and enables only as many threads as rows.

if (rowsTreated < THREADS) {
		if(id<rowsTreated-1)
		{
			v[id] = u[k+id][k];
			__builtin_nuplus_flush((int)&v[id]);
		}

A following barrier make disabled threads wait for working threads.

Rows are multiple of threads

Every thread is enabled, working on a number of rows defined by work_per_thread variable

	work_per_thread = rowsTreated / THREADS;
		if (rowsTreated % THREADS == 0) {
			for(int i=0; i<work_per_thread; i++)
			{
				v[k+id*work_per_thread+i] = u[k+id*work_per_thread+i][k];
				__builtin_nuplus_flush((int)&v[k+id*work_per_thread+i]);
				
			}

Rows are not multiple of threads

Every thread is enabled, making at least "work_per_thread" operation. Last thread (identified by the greatest thread_id) makes "extra_work" more operations.

			extra_work = work_per_thread % THREADS;
			for(int i=0; i<work_per_thread + (id==THREADS-1) * extra_work; i++)
			{		
				v[k+id*work_per_thread+i] = u[k+id*work_per_thread+i][k];
				__builtin_nuplus_flush((int)&v[k+id*work_per_thread+i]);
			}

If Threads with id less than the maximum will complete their operations before the maximum id one they wait for its completion in next barrier.

Row exchanging phase

For row exchaning phase, both multithreading and vectorialization has been used.

Multithreading in row exchanging

Multithreading row-exchanging has been made possible by creating an ad-hoc data structure: a vector containing pointers to matrices.

This way threads can access in the same cycle to own matrix using their own id and making the programmer sure that no access conflict can happen.

        vec16f32 vector_k_th = vectorialize(vom[id][k]);
	vec16f32 vector_pivot_th = vectorialize(vom[id][piv]);

vom is "vector of matrices" and every thread access to one ad one only matrix identified by own id.

Vectorialization

Row exchanges make use of vector built-ins and mask register operation.

Rows are temporarily stored in vector variables.

	vec16f32 vector_k_th = vectorialize(vom[id][k]);
	vec16f32 vector_pivot_th = vectorialize(vom[id][piv]);
	vec16f32 bufferVector = vector_k_th;

First, a mask-register writing allows only the desired elements to be exchanged.

	vec16i32 mask = createVectorMask(k);
	__builtin_nuplus_write_mask_regv16i32(mask);


vec16f32 createVectorMask (int k)
{   
	vec16f32 mask = __builtin_nuplus_makevectori32(0);
	
	
	for (int i= (id==0)*k; i<(id%2==0)*SIZE + (id==1)*(k-1); i++)
	{
		mask[i] = -1;
	}
	
	return mask;
}

Then the k-th vector row is exchanged with the pivot-th one.

        vector_k_th = vector_pivot_th;
	vector_pivot_th = bufferVector;

Vector variables are then de-vectorialized and their elements stored in the original locations.

    	devectorialize(vector_k_th,k);
	devectorialize(vector_pivot_th,piv);

Rows are saved in memory.

Mask-register is reset, enabling all lanes.

void restoreVectorMask()
{
	
	 __builtin_nuplus_write_mask_regv16i32(__builtin_nuplus_makevectori32(-1));
}

Built-ins usage

Even if LU decomposition kernel is not too complex, it was required to use several nu+ built-ins for different purposes.

Flush

This built-in is needed when it is necessary to store a variable in memory. If flush wasn’t used, computations would have effects on core caches only.

Barrier

Barriers allow the programmer to synchronize threads’ executions. As instance, when performing operations on 𝐿 and 𝑈 at the end of each iteration, other threads shall no start pivoting or rows-exchanging phase, because they would work on non-updated matrices. With barrier built-in it is possible forcing threads to wait for other ones’ job completion.

Mask register writing

Using this intrinsic it is possible activate or deactivate hardware lanes, i.e. performing row exchanging only on some elements, not for all the row

LU Decomposition

Contents

Summary

Algorithm Description

Partial Pivoting

Pseudocode

nu+ Optimization

Pivoting phase

Less rows than threads

Rows are multiple of threads

Rows are not multiple of threads

Row exchanging phase

Multithreading in row exchanging

Vectorialization

Built-ins usage

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools