Difference between revisions of "Programming Model"

From NaplesPU Documentation
Jump to: navigation, search
(Nu+ Programming Model)
Line 3: Line 3:
 
In nu+ each thread runs the same code provided by the user, through builtins developers can parallelize it or differentiate flows. Each thread has a private stack.
 
In nu+ each thread runs the same code provided by the user, through builtins developers can parallelize it or differentiate flows. Each thread has a private stack.
  
 +
===SIMD Support===
 
Arithmetic operators (+, -, *, /, %), relational operators (==, !=, <, <=, >, >=), bitwise operators (&, |, ^, ~, <<, >>), logical operators (&&, ||, !) and assignment operators (=, +=, -=, *=, /=, %=, <<=, >>=, &=, ^=, |=) can be used with scalar and vector types and produce a scalar or vector signed integer result respectively. In some cases, mixed scalar/vector operations are possible. In such a cases, the scalar is considered as a vector with all elements equal to the scalar value.
 
Arithmetic operators (+, -, *, /, %), relational operators (==, !=, <, <=, >, >=), bitwise operators (&, |, ^, ~, <<, >>), logical operators (&&, ||, !) and assignment operators (=, +=, -=, *=, /=, %=, <<=, >>=, &=, ^=, |=) can be used with scalar and vector types and produce a scalar or vector signed integer result respectively. In some cases, mixed scalar/vector operations are possible. In such a cases, the scalar is considered as a vector with all elements equal to the scalar value.
  

Revision as of 16:09, 3 June 2019

Nu+ Programming Model

In nu+ each thread runs the same code provided by the user, through builtins developers can parallelize it or differentiate flows. Each thread has a private stack.

SIMD Support

Arithmetic operators (+, -, *, /, %), relational operators (==, !=, <, <=, >, >=), bitwise operators (&, |, ^, ~, <<, >>), logical operators (&&, ||, !) and assignment operators (=, +=, -=, *=, /=, %=, <<=, >>=, &=, ^=, |=) can be used with scalar and vector types and produce a scalar or vector signed integer result respectively. In some cases, mixed scalar/vector operations are possible. In such a cases, the scalar is considered as a vector with all elements equal to the scalar value.

For instance, to add two vectors:

#include <stdint.h>
int main (){
  vec16i32 a;
  vec16i32 b;
  …
  vec16i32 c = a+b;
}

or a vector with a scalar:

#include <stdint.h>
int main (){
 vec16i32 a;
 int b;
 …
 vec16i32 c = a+b;
}

In order to access vector elements, it is possible to use the operator []. For instance

#include <stdint.h>
int main (){
 vec16i32 a;
 // assign some values:
 for (int i=0; i<16; i++) a[i]=i;
 int sum = 0;
 // calculate sum
 for (int i=0; i<16; i++) sum += a[i];
}

Vectors can be initialized using curly bracket syntax. For instance, a constant vector:

#include <stdint.h>
int main (){
  const vec16i32 a = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
}

or a non-constant vector

#include <stdint.h>
int main (){
 int x, y, z;
 ...
 vec16i32 a = { x, y, z, x, y, z, x, y, z, x, y, z, x, y, z, x};
}

Conversions between vectors with the same number of elements can be performed using the LLVM intrisic __builtin_convertvector. Vector types v16i32, v16u32, v16f32 can be converted between each other. Similarly, vector types v8i64, v8u64, v8f64can be converted between each other. For instance:

#include <stdint.h>
int main (){
 vec16f32 a;
 ...
 vec16i32 d = __builtin_convertvector(a,vec16i32);
}

It is also possible to convert floating point vectors with a different number of elements. In such a case, it is required to use the two nu+ intrinsics __builtin_nuplus_v8f64tov16f32 or __builtin_nuplus_v16f32tov8f64. The first one converts 8 double precision FP elements into 8 single precision FP elements that are placed in the first 8 elements of a v16f32 vector. The second one converts the first 8 single precision FP elements of a v16f32 vector into 8 double precision FP elements. For instance:

#include <stdint.h>
int main (){
 vec16f32 a;
 ...
 vec8f64 b = __builtin_nuplus_v16f32tov8f64(a);
}

Vector comparisons can be done in two different ways. It is possible to use the traditional relational operators that result in another vector type of the same size of the two vectors. For instance:

#include <stdint.h>
int main (){
 vec16i32 a;
 vec16i32 b;
 …
 vec16i32 c = a < b;
}

After executing the above code, the vector c elements will be equal to 0xFFFFFFFF or 0x00000000 according to the result of the comparison. In addition, we also provide vector comparison intrinsics, as follow:

#include <stdint.h>
int main (){
 vec16i32 a;
 vec16i32 b;
 …
 int c = __builtin_nuplus_mask_cmpi32_slt (a, b)
}

After executing the above code, the integer c will contain a bitmap that can be directly used to write the mask register if required. Using vector comparison intrinsics is the natural way of performing comparisons in nu+.

Note that, in nu+, all instructions are masked and at the beginning, all lanes are enabled. If you want to handle SIMD control flow, you need to explicitly take care of masking operations so that they are only applied to some elements. For instance, take a look at the above code:

#include <stdint.h>
int main (){
 vec16i32 a;
 vec16i32 b;
 …
 int c = __builtin_nuplus_mask_cmpi32_slt (a, b) //generate mask for a<b
 int rm_old = __builtin_nuplus_read_mask_reg();  //save mask register
 __builtin_nuplus_write_mask_reg(c);             //write mask register for a=b
 __builtin_nuplus_write_mask_reg(c);             //write mask register for a>=b
 do_somethingelse();
 __builtin_nuplus_write_mask_reg(rm_old);        //restore the old mask
}

Nu+ Misc Intrinsics

Intr.png

Matrix Multiplication Multithreaded Example

The following code shows a multithread version of the matrix multiplication kernel running on a nu+ core. The code spreads the output computation among all available threads. Since each output element computation is independent of others, the thread parallelism is achieved dividing the outer cycle of the function. Output matrix row calculations are equally spanned across all cores and further spanned across threads. For each thread, the function first calculates the portion of the output matrix to compute at the core level: N / CORE_NUMB, with N the dimension of the matrix and CORE_NUMB the number of nu+ core in the system. Then, each thread starts computing the outer loop with start_loop = (core_id * N / CORE_NUMB) + thread_id the portion of output matrix to compute is multiplied to the running core ID (core_id) and added to the running thread ID (thread_id), and each iteration increments by the number of threads in the core THREAD_NUMB.

void matrix_mult(const int a[N][N], const int b[N][N], int mult[N][N], int core_id, int thread_id) {
 int start_loop = (core_id * N / CORE_NUMB) + thread_id;
 int end_loop = N / CORE_NUMB * (core_id + 1);
 
 for (int i = start_loop; i < end_loop; i += THREAD_NUMB){
   for (int j = 0; j < N; j++)
     for (int k = 0; k < N; k++)
       mult[i][j] += a[i][k] * b[k][j];
 }
}

Parameters core_id and thread_id are passed by the main function, and fetched from nu+ control registerr through builtins:

#define CORE_ID    __builtin_nuplus_read_control_reg(0)
#define THREAD_ID  __builtin_nuplus_read_control_reg(2)

In this way, each thread in each core computes a portion of the output matrix concurrently to the others.

The final result is ready when all thread over the system ended the assigned task, a system synchronization is required and in the main function this is achieved using barrier builtin provided by the programming model:

__builtin_nuplus_barrier(42, CORE_NUMB * THREAD_NUMB - 1);

The builtin above synchronizes CORE_NUMB * THREAD_NUMB number of threads (total number of thread running in the system) on the ID 42. When all threads hit the barrier the output matrix is ready, although most of it could be in private L1 caches. The next step is to flush output lines into the main memory:

if (THREAD_ID == 0 && CORE_ID == 0) {
     for (int i = 0; i < N*N; i += 64 / sizeof(int)) {
       __builtin_nuplus_flush((int) &C[i / N][i % N]);
     }
     __builtin_nuplus_write_control_reg(N*N, 12); // For cosimulation purpose
   }

Generally, flush operations are performed by a thread, in this case the first thread in the first core calls the flush builtin which sends output results still in L1 caches to the main memory.

The full code for the main function is attached below:

#define CORE_ID    __builtin_nuplus_read_control_reg(0)
#define THREAD_ID  __builtin_nuplus_read_control_reg(2)

int main(){
   init_matrix(A);
   init_matrix(B);

   matrix_mult(A, B, C, CORE_ID, THREAD_ID);
   __builtin_nuplus_barrier(CORE_ID + 1, THREAD_NUMB - 1);

   if (THREAD_ID == 0 && CORE_ID == 0) {
     for (int i = 0; i < N*N; i += 64 / sizeof(int)) {
       __builtin_nuplus_flush((int) &C[i / N][i % N]);
     }
     __builtin_nuplus_write_control_reg(N*N, 12); // For cosimulation purpose
   }

  return (int)&C;
}


Matrix Multiplication Vectorial Example

On the other hand, the vectorial version of the matrix multiplication function computes the output matrix in a SIMD multithreading fashion. Both input and output matrices are organized in target specific vector types, becoming vectors of vectors. Such an organization results in a distribution of the N column partial results on the 16 hardware lanes; each thread calculates N partial results every cycle. The size of the matrix must be multiple of 16.

void kernel_function(vec16i32 *A, vec16i32 *B, vec16i32 *C, int N) {
   uint32_t coreId =  __builtin_nuplus_read_control_reg(0);
   uint32_t threadId = __builtin_nuplus_read_control_reg(2);
   uint32_t nT = 2; // number of threads
   uint32_t nL = 16; // number of lanes
   uint32_t nC = N/nL; 
   uint32_t ndivnT = N/nT;
   uint32_t tIdndivnT = threadId*ndivnT;
   uint32_t tIdndivnTnC = tIdndivnT*nC;

   for (uint32_t i = coreId; i < ndivnT*nC; i+=CORE_NUMB){
       uint32_t col = (tIdndivnT+i)%nC;
       C[tIdndivnTnC+i] = 0;
       for (uint32_t j = 0; j < nC; j++){
           for (uint32_t k = 0; k < nL; k++){
               C[tIdndivnTnC+i] += A[tIdndivnTnC+i-col+j][k] * B[(nC*k)+(j*N)+col]; 
            }
       }
   }
}

Please note, the C[tIdndivnTnC+i] += A[tIdndivnTnC+i-col+j][k] * B[(nC*k)+(j*N)+col] performs 16 operations on 16 different data at time. The organization of the code and the thread parallelization are equivalent to its scalar version.

OpenCL support for Nu+

OpenCL defines a platform as a set of computing devices on which the host is connected to. Each device is further divided into several compute units (CUs), each of them defined as a collection of processing elements (PEs). Recall that the target platform is architecturally designed around a single core, structured in terms of a set of at most eight hardware threads. Each hardware threads competes with each other to have access to sixteen hardware lanes, to execute both scalar and vector operations on 32-bit wide integer or floating-point operands.

Plat model.png

The computing device abstraction is physically mapped on the nu+ many-core architecture. The nu+ many-core can be configured in terms of the number of cores. Each nu+ core maps on the OpenCL Compute Unit. Internally, the nu+ core is composed of hardware threads, each of them represents the abstraction of the OpenCL processing element.

Execution Model Matching

From the execution model point of view, OpenCL relies on an N-dimensional index space, where each point represents a kernel instance execution. Since the physical kernel instance execution is done by the hardware threads, the OpenCL work-item is mapped onto a nu+ single hardware-thread. Consequently, a work-group is defined as a set of hardware threads, and all work-items in a work-group executes on a single computing unit.

Execution model.png

Memory Model Matching

OpenCL partitions the memory in four different spaces:

  • the global and constant spaces: elements in those spaces are accessible by all work-items in all work-group.
  • the local space: visible only to work-items within a work-group.
  • the private space: visible to only a single work-item.

The target-platform is provided of a DDR memory, that is the device memory in OpenCL nomenclature. Consequently, variables are physically mapped on this memory. The compiler itself, by looking at the address-space qualifier, verifies the OpenCL constraints are satisfied.

Each nu+ core is also equipped with a Scratchpad Memory, an on-chip non-coherent memory portion exclusive to each core. This memory is compliant with the OpenCL local memory features

Memory mod.png

Finally, each hardware thread within the nu+ core has a private stack. This memory section is private to each hardware thread, that is the OpenCL work-item, and cannot be addressed by others. As a result, each stack acts as the OpenCL private memory.

Programming Model Matching

OpenCL supports two kinds of programming models, data- and task-parallel. A data-parallel model requires that each point of the OpenCL index space executes a kernel instance. Since each point represents a work-item and these are mapped on hardware threads, the data-parallel requirements are correctly satisfied. Please note that the implemented model is a relaxed version, without requiring a strictly one-to-one mapping on data.

A task-parallel programming model requires kernel instances are independently executed in any point of the index space. In this case, each work-item is not constrained to execute the same kernel instance of others. The compiler frontend defines a set of builtins that may be used for this purpose. Moreover, each nu + core is built of 16 hardware lanes, useful to realise a lock-step execution. Consequently, OpenCL support is realised to allow the usage of vector types. As a result, the vector execution is supported for the following data types:

  • charn, ucharn, are respectively mapped on vec16i8 and vec16u8, where n=16. Other values of n are not supported.
  • shortn, ushortn, are respectively mapped on vec16i16 and vec16i32, where n=16. Other values of n are not supported.
  • intn, uintn, are respectively mapped on vec16i32 and vec16u32, where n=16. Other values of n are not supported.
  • floatn, mapped on vec16f32, where n=16. Other values of n are not supported.

OpenCL Runtime Design

OpenCL APIs are a set of functions meant to coordinate and manage devices, those provide support for running applications and monitoring their execution. These APIs also provides a way to retrieve device-related information.

The following Figure depicts the UML Class diagram for the OpenCL Runtime as defined in the OpenCL specification. Grey-filled boxes represent not available features due to the absence of hardware support.

Uml.png

The custom OpenCL runtime relies on two main abstractions:

  • Low-level abstractions, not entirely hardware dependent, provide device-host communication support.
  • High-level abstractions, according to OpenCL APIs, administrate the life cycle of a kernel running onto the device.

OpenCL Example

The following code shows a vectorial matrix multiplication in OpenCL running on the nu+ device.

#include <opencl_stdlib.h>
#define WORK_DIM 4

__kernel void kernel_function(__global int16 *A, __global int16 *B, __global int16 *C, int rows, int cols)
{
   __private uint32_t threadId = get_local_id(0); 

   uint32_t nT = WORK_DIM; // number of threads
   uint32_t nL = 16;       // number of lanes

   uint32_t N = rows;
   uint32_t nC = N / nL;
   uint32_t ndivnT = N / nT;
   uint32_t tIdndivnT = threadId * ndivnT;
   uint32_t tIdndivnTnC = tIdndivnT * nC;
   for (uint32_t i = 0; i < ndivnT * nC; i++)
   {
       uint32_t col = (tIdndivnT + i) % nC;
       C[tIdndivnTnC + i] = 0;
       for (uint32_t j = 0; j < nC; j++)
       {
           for (uint32_t k = 0; k < nL; k++)
           {
               C[tIdndivnTnC + i] += A[tIdndivnTnC + i - col + j][k] * B[(nC * k) + (j * N) + col];
           }
       }
   }
}