How to Speed Up File Downloads With Python by@exactor

How to Speed Up File Downloads With Python

image
Maksim Kuznetsov HackerNoon profile picture

Maksim Kuznetsov

Just a Senior Python Developer.

Very often, some admins set limits on the speed of downloading files, this reduces the load on the network, but at the same time it is very annoying for users, especially when you need to download a large file (from 1 GB), and the speed fluctuates around 1 megabit per second (125 kilobytes per second). Based on these data, we conclude that the download speed will be at least 8192 seconds (2 hours 16 minutes 32 seconds). Although our bandwidth allows us to transfer up to 16 Mbps (2 MB per second) and it will take 512 seconds (8 minutes 32 seconds).

These values ​​​​are not taken by chance, for such a download, initially I used exclusively 4G Internet.

Case:

The utility developed by me and laid out below - works only if:

  • You know in advance that your bandwidth is higher than the download speed
  • Even large site pages load quickly (the first sign of artificially low speed)
  • You are not using a slow proxy or VPN
  • Good ping to the site

What are these restrictions for?

  • optimization of the backend and the return of static files
  • DDoS Protection

How is this slowdown implemented?

Nginx

location /static/ {
   ...
   limit rate 50k; -> 50 kilobytes per second for a single connection 
   ...
}

location /videos/ {
   ...
   limit rate 500k; -> 500 kilobytes per second for a single connection
   limit_rate_after 10m; -> after 10 megabytes download speed, will 500 kilobytes per second for a single connection
   ...
}

Feature with zip file

An interesting feature was discovered when downloading a file in the zip extension, each part allows you to partially display the files in the archive, although most archivers will say that the file is broken and not valid, some of the content and file names will be displayed.

Code parsing:

To create this program, we need Python, asyncio, aiohttp, aiofiles. All code will be asynchronous to increase performance and minimize overhead in terms of memory and speed. It is also possible to run on threads and processes, but when loading a large file, errors may occur when a thread or process cannot be created.

async def get_content_length(url):
    async with aiohttp.ClientSession() as session:
        async with session.head(url) as request:
            return request.content_length

This function returns the length of the file. And the request itself uses HEAD instead of GET, which means that we get only the headers, without the body (content at the given URL).

def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
    while size - start > part_size:
        yield start, start + part_size
        start += part_size
    yield start, size

This generator returns ranges for download. An important point is to choose part_size that is a multiple of 1024 to keep proportions per megabytes, although it seems that any number will do. It doesn't work correctly with part_size = 1, so I defaulted to 10 MB per part.

async def download(url, headers, save_path):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as request:
            file = await aiofiles.open(save_path, 'wb')
            await file.write(await request.content.read())

One of the main functions is a file download. It works asynchronously. Here we need asynchronous files to speed up disk writes by not blocking input and output operations.

async def process(url):
    filename = os.path.basename(urlparse(url).path)
    tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
    size = await get_content_length(url)
    tasks = []
    file_parts = []
    for number, sizes in enumerate(parts_generator(size)):
        part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
        file_parts.append(part_file_name)
        tasks.append(download(URL, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
    await asyncio.gather(*tasks)
    with open(filename, 'wb') as wfd:
        for f in file_parts:
            with open(f, 'rb') as fd:
                shutil.copyfileobj(fd, wfd)

The most basic function gets the filename from the URL, converts it into a numbered .part file, creates a temporary directory under the original file, all parts are downloaded into it. await asyncio.gather(*tasks) allows you to execute all collected coroutines concurrently, which significantly speeds downloading. After that, the already synchronous shutil.copyfileobj method concatenates all files into one file.

async def main():
    if len(sys.argv) <= 1:
        print('Add URLS')
        exit(1)
    urls = sys.argv[1:]
    await asyncio.gather(*[process(url) for url in urls])

The main function receives a list of URLs from the command line and, using the already familiar asyncio.gather, it starts downloading many files at the same time.

Benchmark:

On one of the resources I found, a benchmark was carried out on downloading a Gentoo Linux image from a site of one university(slow server).

  • async: 164.682 seconds
  • sync: 453.545 seconds

Download DietPi distribution (fast server):

  • async: 17.106 seconds best time, 20.056 seconds worst time
  • sync: 15.897 seconds best time, 25.832 worst time

As you can see, the result reaches almost 3x acceleration. On some files, the result reached 20-30 times.

Possible improvements:

  • More secure download. If there is an error, restart the download.
  • Memory optimization. One of the problems is a 2x increase in storage space consumption. (when all parts are downloaded, copied to a new file, but the directory has not yet been deleted). Easily fixed by deleting the file immediately after copying the contents of the part.
  • Some servers keep track of the number of connections and can ruin such a load, this requires pausing or greatly increasing the size of the part.
  • Adding a progress bar.

In conclusion, I can say that asynchronous loading is the way out, but unfortunately not a silver bullet in the matter of downloading files.

import asyncio
import os.path
import shutil

import aiofiles
import aiohttp
from tempfile import TemporaryDirectory
import sys
from urllib.parse import urlparse


async def get_content_length(url):
    async with aiohttp.ClientSession() as session:
        async with session.head(url) as request:
            return request.content_length


def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
    while size - start > part_size:
        yield start, start + part_size
        start += part_size
    yield start, size


async def download(url, headers, save_path):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as request:
            file = await aiofiles.open(save_path, 'wb')
            await file.write(await request.content.read())


async def process(url):
    filename = os.path.basename(urlparse(url).path)
    tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
    size = await get_content_length(url)
    tasks = []
    file_parts = []
    for number, sizes in enumerate(parts_generator(size)):
        part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
        file_parts.append(part_file_name)
        tasks.append(download(url, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
    await asyncio.gather(*tasks)
    with open(filename, 'wb') as wfd:
        for f in file_parts:
            with open(f, 'rb') as fd:
                shutil.copyfileobj(fd, wfd)


async def main():
    if len(sys.argv) <= 1:
        print('Add URLS')
        exit(1)
    urls = sys.argv[1:]
    await asyncio.gather(*[process(url) for url in urls])


if __name__ == '__main__':
    import time

    start_code = time.monotonic()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    print(f'{time.monotonic() - start_code} seconds!')

Comments

Signup or Login to Join the Discussion

Tags

Related Stories