Programming using is a bit more complex than in other languages. Although there are some excellent packages, such as , the documentation is poor, lacks examples and it’s difficult to use. CUDA Go mumax CUDA is for , so the best alternative is to use and invoke an external function with your . This is what I will do in this example, where I multiply two matrices using . C Command cgo Cuda Kernel CUDA If you want to know more about programming, read the . CUDA my article Kernel I created a that has the Kernel function and a helper function to be called externally. Note that I used because this is how invokes functions: Simple Kernel extern C cgo __ { row = blockIdx.y*blockDim.y+threadIdx.y; col = blockIdx.x*blockDim.x+threadIdx.x; (col < size && row < size) { result = ; ( ix= ;ix<size;ix++) { result += A[row*size+ix]*B[ix*size+col]; } C[row*size+col] = result; } } { { total = size*size; * gpu_A; * gpu_B; * gpu_C; msize = total * ( ); cudaMalloc(( **)&gpu_A, msize); cudaMemcpy(gpu_A,A,msize,cudaMemcpyHostToDevice); cudaMalloc(( **)&gpu_B, msize); cudaMemcpy(gpu_B,B,msize,cudaMemcpyHostToDevice); cudaMalloc(( **)&gpu_C,msize); ; ; vecmul<<<grid,blocks>>>(gpu_A,gpu_B,gpu_C,size); cudaMemcpy(C,gpu_C,msize,cudaMemcpyDeviceToHost); cudaFree(gpu_A); cudaFree(gpu_B); cudaFree(gpu_C); } } # include <stdio.h> # include <cuda.h> global__ void vecmul ( *A, * B, *C, size) float float float int // Row and Column indexes: int int // Are they bellow the maximum? if float 0 for int 0 extern "C" void maxmul ( *A, * B, *C, size) float float float int int // Allocate device memory: float float float int sizeof float void void void // Blocks & grids: dim3 blocks (size,size) dim3 grid ( , ) 1 1 // Call the kernel: // Get the result Matrix: //Free device matrices The function is the kernel and the function is the helper. Its function is to allocate memory in the , copy the parameters, invoke the kernel, and copy the result. Values are passed by reference. vecmul() maxmul() GPU Go code invokes the function and displays the result: Program maxmul.go helper main { C.maxmul(&a[ ], &b[ ], &c[ ], C. (size)) } { a := []C.float{ , , , , , , , , } b := []C.float{ , , , , , , , , } c []C.float = ([]C.float, ) Maxmul(a,b,c, ) fmt.Println(c) } package /* void maxmul(float *A, float* B, float *C, int size); #cgo LDFLAGS: -L. -L./ -lmaxmul */ import "C" import "fmt" func Maxmul (a []C.float, b []C.float, c []C.float, size ) int 0 0 0 int func main () //in := []C.float{1.23, 4.56} //C.test(&in[0]) // C 1.230000 4.560000 -1 2 4 0 5 3 6 2 1 3 0 2 3 4 5 4 7 2 var make 9 3 Before importing the package, which allows to invoke external functions in pure code (extern C), I pass the configuration of , indicating the prototype of the function , the path to and its name. C C cgo C lib I had to create a function in the code to invoke the external function to make things easier. It simply passes the reference to the arrays (the address of the first position) and the array size (in this case 3x3 = 9). In we work with matrices. wrapper Go CUDA flat I used the type to create containing my arrays (transformed into vectors). Then I called the function. Note that I passed the size of each row (or column). C.float slices Compiling To compile the code use the command: C nvcc --ptxas-options=-v --compiler-options -o libmaxmul.so --shared maxmul.cu '-fPIC' You need to have CUDA and the Nvidia driver installed! Then just run the code with the command: Go go run maxmul.go ... [19 36 16 27 41 31 28 15 24] And this is the result of matrix multiplication! Full Source Code is here: https://github.com/cleuton/golang-network/tree/master/english/cuda/nostress