paint-brush
JavaScript: Enable Parallelism for Private File Uploadingby@artyms
550 reads
550 reads

JavaScript: Enable Parallelism for Private File Uploading

by ArtsemJune 20th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Private file storage means that only you and the application you are relying on own your data. In this article,, I will explain how to speed up file uploading for private applications. I will use the [tweetNaCL] package, which is as close as possible to the original C implementation and is quite easy to use.
featured image - JavaScript: Enable Parallelism for Private File Uploading
Artsem HackerNoon profile picture

Intro

Secure or private file storage? What is the difference? If your application provides secure file handling, it means that only you and the application you are relying on own your data. e.g., Dropbox and Google Drive. But, if you need true privacy and don’t want the company you rely on to share or sell your files with a 3rd party, you need to think about moving to private solutions like Tresorit or Sharekey. In this article, I will try to explain how to speed up file uploading for private applications. Here is the image illustrating the process of how app private file uploading should work.

Image - 1. Schema of file uploading

Intro about the file uploading process

File uploading is a trivial and pretty common task for web development. There are plenty of articles about how to upload files using JavaScript from the web. And here, I will explain a simplified mechanism of how to do it.


If we are talking about the general approach of file uploading, it consists of several steps:


  1. Reading files from devices
  2. Preparing files for uploading
  3. Uploading a file.


If we are talking about the 1st and 3rd points, the approach is quite clear. Let’s add some notes about the 2nd point: Here, to exclude problems, if somewhere in the process of uploading a file, the error is thrown, the process of uploading should be restarted from scratch.


Also, if you are going to transmit huge files through the network, you can face performance issues and even get a 413 error (Large payload). The solution to walk around such issues is quite simple: we need to just split the file into chunks. After this, we can add encryption to encrypt each chunk, and we are good to go.

Encryption

Encryption is the heart of the private application. In short, encryption is a way of scrambling data so that only authorized parties can understand the information. It helps to hide your data from anyone who has no access to it. JS, like many other programming languages, brings you the possibility to add encryption to your application. But a bit differently.


Commonly the process of encryption is based on manipulating binary data.


Traditionally the JavaScript language didn’t really support interaction with binary data. Sure, there were strings, but they really abstracted the underlying data storage mechanism. There were also arrays, but those can contain values of any type and aren’t appropriate for representing binary buffers.


Eventually, the ArrayBuffer object was created and now is the core part of the language. An instance of ArrayBuffer represents a buffer of binary data that is of fixed length and cannot be resized. Instead, a “view” into the buffer must first be created. Due to the ambiguity of what binary data can mean, we need to use a view to read and write to the underlying buffer. There are several of these views available in JavaScript e.g. Uint8Array, Int8Array…


Why did we mention it here? The first thing is that to encrypt our file, we need to first represent it as TypedArray and then manipulate its bytes in order to make it impossible to understand by a third party. And the second thing is SharedArrayBuffer(but this will be sooner).


Okay, we finished with the representation of the file. The one thing left is the process of encryption. I don’t think this is the right article to dive deep into the nuances of the encryption process. Let’s just use some open-source solution for encryption. In this, I will use the tweetNaCL.js package, which is as close as possible to the original C implementation and which is quite easy to use.


So let’s start implementation and see what we can optimize.

First steps and the first problems.

Let’s start with the file reading process, which is quite straightforward.


Let’s say we have input with type file which has id=”file_handler”.

let chunksQueue = [];

const fileInput = document.getElementById('file_handler');

fileInput.addEventListener('change', function () {
  const file = fileInput.files[0];

  readFile(file).then((res) => {
    const chunksAmount = getChunksAmount(res.length);
    chunksQueue = getChunksQueue(chunksAmount);
    sendChunk(res).catch((e) => console.log(e));
  });
});


The method for reading files is Uint8Array, which is the preferred format for encryption.

function readFile(file) {
  return new Promise((resolve) => {
    const fileReader = new FileReader();

    fileReader.onload = function () {
      const data = fileReader.result;
      const fileUint8Array = new Uint8Array(data);

      resolve(fileUint8Array);
    };

    fileReader.readAsArrayBuffer(file);
  });
}


Also, let’s define CHUNK_SIZE - the size of one file chunk. This size will be calculated like Math.ceil(file.size / CHUNK_SIZE). Based on chunksAmount, we will organize chunksQueue, which will contain file chunks that should be transferred to the server.

function getChunksAmount(fileSize) { return Math.ceil(fileSize / CHUNK_SIZE) } 

const getChunksQueue = (chunksQuantity) => {
  return new Array(chunksQuantity).fill().map((_, index) => index);
};


sendChunk function, which is responsible for encrypting, uploading, and retrying uploading in case of error:

async function sendChunk(file, fileName) {
  if (!chunksQueue.length)
    return;

  const chunkId = chunksQueue.shift();
  const begin = chunkId * CHUNK_SIZE;
  const chunk = file.slice(begin, begin + CHUNK_SIZE);

  const encryptedChunk = encrypt(chunk, secretKey);

  upload({ fileChunk: encryptedChunk, fileName, chunkId })
    .then(() => {
      sendChunk(file, fileName);
    })
    .catch(() => {
      chunksQueue.unshift(chunkId);
    });
}


encrypt is the function for encrypting chunks:

function encrypt(chunk, secretKey) {
  // creating nonce which should also be transferred to server, but not in this article
  const nonce = nacl.randomBytes(24);

  // chunk encrypted with nonce and secret key
  const encryptedChunk = nacl.secretbox(chunk, nonce, secretKey);

  // base64 encoded typed array to be sent to the server
  return nacl.util.encodeBase64(encryptedChunk);
}


Don’t forget to create the user’s secretKey, which will be used for encrypting and decrypting files:

const secretKey = nacl.randomBytes(32);


And the last client function - the function for uploading file chunks to the server:

async function upload({ fileChunk, fileName, chunkId }) {
  return fetch(`${BACKEND_URL}/upload`, {
    method: 'POST',
    body: JSON.stringify({ fileChunk, fileName, chunkId }),
    headers: {
      'Content-Type': 'application/json',
    },
  })
    .then(response => response.json())
    .then(res => {
      if (res.status === 200)
        return 'ok'

      console.error('Server Error');
  });
}


The last step will be handling file chunks on the server. Firstly let’s create a simple server.js file that will be responsible for running the express server. There would be two endpoints: one for serving the client static index.html file and another one for the API route for uploading file chunks. All chunks will be stored on the filesystem related folder(uploaded).

// server.js
const express = require('express');
const path = require('path');
const fs = require('fs');
const bodyParser = require('body-parser');

const app = express();

const hostname = '127.0.0.1';
const port = 5000;

// increase memory for request to exclude 413 error
app.use(express.json({ limit: '500mb' }));
// for parsing body of request
app.use(bodyParser.json());
// for accessing js files from index.html
app.use(express.static('public'));

app.get('/', (req, res) => {
  res.sendFile(path.join(__dirname, './index.html'));
});

app.post('/upload', (req, res) => {
  const { fileChunk, fileName, chunkId } = req.body;
  const directoryPath = path.join(__dirname, `./uploaded/${fileName}`);

  // create folder for fileChunks in the file system
  fs.mkdir(directoryPath, (e) => {
    // if folder already exists, do not throw error
    if (e && e.code !== 'EEXIST') {
      console.error(e);

      res.status(500).send({ status: 500, error: e.message });
    }

    // writing of file chunk into created file folder
    fs.writeFile(path.join(directoryPath, `./${chunkId}`), fileChunk, (e, bytes) => {
      if (e)
        res.status(500).send({ status: 500, message: e.message });
      else
        res.status(200).send({ status: 200, message: 'Success' });
      });
  });
});

app.listen(port, hostname, () => {
  console.log(`Server running at http://${hostname}:${port}/`);
});


Looking into the proposed approach, we can notice that encryption of the chunk is happening for each chunk synchronously, which can affect UI in case of performance. As said in the previous encryption point, encryption in JS is a manipulation of TypedArrays. In programming, we can only manipulate arrays data using a loop. Looping has a negative effect on the event loop (creepy-sounding) which finally may reduce I/O responsiveness.


So in some cases, when the file is too huge, encryption of its chunks may stop the web page from responding for some time.


To avoid such behavior, we need to transmit encryption tasks to another place. We can try to make it asynchronous, but the best way will be to give this job to another thread.

Parallelism

Browser JavaScript has no single, bespoke implementation like most other programming languages do. There are several engines that are used in different web browsers, like v8 in Chrome and SpiderMonkey in Firefox. But most of them have Web Workers, which are mainly used for archiving parallelism. There is more than one type of Web Workers (e.g., SharedWorker, ServiceWorker). The simplest one is the Dedicated Worker, which will be used in the current article.


The main difference between Dedicated Worker and others is that dedicated worker is only accessible by the script that has called it.


Firstly, let’s add headers to our express static response on the server, so the client can access worker.js file:

app.use(express.static(
  'public', {
    setHeaders: function(res) {
      // heade necessary for accessing worker file from the client
      res.set('Cross-Origin-Embedder-Policy', 'require-corp');
    }
  }
));


On the client, we need to create a worker instance and change the logic of sendChunk to use workers for encryption instead of the main process.

// instanciating worker
const worker = new Worker("worker.js");

async function sendChunk(file, fileName) {
  if (!chunksQueue.length)
    return;

  const chunkId = chunksQueue.shift();
  const begin = chunkId * CHUNK_SIZE;
  const chunk = file.slice(begin, begin + CHUNK_SIZE);

  worker.onmessage = (msg) => {
    const { encryptedChunk: fileChunk } = msg.data;

    upload({
      fileChunk, 
      fileName, 
      chunkId,
     }).then(() => {
        sendChunk(file, fileName);
      }).catch(() => {
        chunksQueue.unshift(chunkId);
      });
   }

   worker.postMessage({ chunk });
}


And finally, dedicated worker code:

// worker.js
console.log('Running worker...');

if( 'function' === typeof importScripts) {
  importScripts('nacl.min.js');
  importScripts('nacl-util.min.js');
  importScripts('encrypt.js');

  const secretKey = nacl.randomBytes(32);

  self.onmessage = (msg) => {
    if (msg.data.chunk) {
      const { chunk } = msg.data;
      const encryptedChunk = encrypt(chunk, secretKey)

      postMessage({ encryptedChunk });
    }
  }
}


We had a very long journey, so let’s try to analyze what we had at that moment and try to find some clips.

Image 2 - Schema of file uploading with encryption in separate thread


Here is this example where we moved heavy synchronous calculations inside the worker so it would not block our main process. But, if we look through the current code, we can find out that we are using message-passing API to pass chunks of the file to the worker. This can slow down the performance of the uploading process because each time, file chunks are cloned to the worker.


What if we could use something like shared memory, to which our worker will have access same as the main process?

Shared Memory

SharedArrayBuffer class allows you to share memory between two threads without depending on the message passing. It is also the same object as ArrayBuffer mentioned in the Encryption point, which has only one difference - it is shared between all workers. Data gets sent to and from the worker by using the postMessage() method, and certain types are so-called transferable objects that are transferred from one context to another with a zero-copy operation, resulting in high performance.


To better understand how we can use this, let’s edit our previous example.


Image 3 - Schema of file uploading with Shared memory.

As you can see, even with the usage of shared memory, we still are transferring data between the worker and the main process(illustrated on 3-rd image, red arrow). This can be solved by using another shared memory for an encrypted object or just moving a request to API inside the worker.


In this article, we will use 2nd approach.


So our workflow will be transformed into:

Image 4 - Schema of file uploading with Shared memory v2.

But when you try to use SharedArrayBuffer, you will find that you need to copy the file Uint8Array into SharedArrayBuffer, as SharedArrayBuffer is just a point to the memory, and it just accepts byteLength when it is initialized. You need to create its view and store data in it using a copy of the existing one. So copying of file Uint8Array would slow down the process of uploading.


Suddenly the final solution appeared. Why not just pass the file object from the input? After refreshing a bit about what is File instance in JS, I found that the File is the interface that provides information about files and allows JavaScript in a web page to access their content. So this is just a pointer to the file. So it has no file content but only refers to it. So I decided to just pass it as a message from the main process and inject file content into the worker.

Final solution

Image 5 -Final schema

According to image 5, we need to move all the logic from the main process to the worker. Only the input handler will stay:

// main.js
document.addEventListener("DOMContentLoaded", () => {
  const worker = new Worker('/worker.js');

  const fileInput = document.getElementById('file_handler');

  fileInput.addEventListener('change', function () {
    const file = fileInput.files[0];

    worker.postMessage({ file });
  });
});


And worker.js:

console.log('Running worker...');

if( 'function' === typeof importScripts) {
  importScripts('nacl.min.js');
  importScripts('nacl-util.min.js');
  importScripts('encrypt.js');
  importScripts('upload.js');
  importScripts('sendChunk.js');
  importScripts('readFile.js');

  let chunksQueue = [];
  const CHUNK_SIZE = 1000000;

  function getChunksAmount(fileSize) { return Math.ceil(fileSize / CHUNK_SIZE) }

  const getChunksQueue = (chunksQuantity) => {
    return new Array(chunksQuantity).fill().map((_, index) => index);
  };

  self.onmessage = (msg) => {
    if (msg.data.file) {
      const { file } = msg.data;

      readFile(file).then((res) => {
        const chunksAmount = getChunksAmount(res.length);
        chunksQueue = getChunksQueue(chunksAmount);
          sendChunk(res, file.name, chunksQueue).catch((e) => console.log(e));
        });
      }
   }
}

Full source code you can find on GitHub: https://github.com/arty-ms/ParallelUpload

Conclusion

We can see that usage of parallelism can bring lots of effort for JS developers and can make your UI more responsible and effective, as all computation logic may be executed on the thread.

When I was starting this article, according to my plan, I thought that the usage of SharedArrayBuffer would be the final solution. But in the process of writing code, I realized that it was not true.


But It's worse to mention that SharedArrayBuffer is one of the core features of multithreading JS and everyone who uses workers API should be familiar with it. However, it is not useful for every case.


The next article will cover the usage of wasm code in JS, one of the most underrated client tools.

Sources used