Very often, some admins set limits on the speed of downloading files, this reduces the load on the network, but at the same time it is very annoying for users, especially when you need to download a large file (from 1 GB), and the speed fluctuates around 1 megabit per second (125 kilobytes per second). Based on these data, we conclude that the download speed will be at least 8192 seconds (2 hours 16 minutes 32 seconds). Although our bandwidth allows us to transfer up to 16 Mbps (2 MB per second) and it will take 512 seconds (8 minutes 32 seconds).
These values are not taken by chance, for such a download, initially I used exclusively 4G Internet.
Case:
The utility developed by me and laid out below - works only if:
What are these restrictions for?
How is this slowdown implemented?
Nginx
location /static/ {
...
limit rate 50k; -> 50 kilobytes per second for a single connection
...
}
location /videos/ {
...
limit rate 500k; -> 500 kilobytes per second for a single connection
limit_rate_after 10m; -> after 10 megabytes download speed, will 500 kilobytes per second for a single connection
...
}
Feature with zip file
An interesting feature was discovered when downloading a file in the zip extension, each part allows you to partially display the files in the archive, although most archivers will say that the file is broken and not valid, some of the content and file names will be displayed.
Code parsing:
To create this program, we need Python, asyncio, aiohttp, aiofiles. All code will be asynchronous to increase performance and minimize overhead in terms of memory and speed. It is also possible to run on threads and processes, but when loading a large file, errors may occur when a thread or process cannot be created.
async def get_content_length(url):
async with aiohttp.ClientSession() as session:
async with session.head(url) as request:
return request.content_length
This function returns the length of the file. And the request itself uses HEAD instead of GET, which means that we get only the headers, without the body (content at the given URL).
def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
while size - start > part_size:
yield start, start + part_size
start += part_size
yield start, size
This generator returns ranges for download. An important point is to choose part_size that is a multiple of 1024 to keep proportions per megabytes, although it seems that any number will do. It doesn't work correctly with part_size = 1, so I defaulted to 10 MB per part.
async def download(url, headers, save_path):
async with aiohttp.ClientSession(headers=headers) as session:
async with session.get(url) as request:
file = await aiofiles.open(save_path, 'wb')
await file.write(await request.content.read())
One of the main functions is a file download. It works asynchronously. Here we need asynchronous files to speed up disk writes by not blocking input and output operations.
async def process(url):
filename = os.path.basename(urlparse(url).path)
tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
size = await get_content_length(url)
tasks = []
file_parts = []
for number, sizes in enumerate(parts_generator(size)):
part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
file_parts.append(part_file_name)
tasks.append(download(URL, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
await asyncio.gather(*tasks)
with open(filename, 'wb') as wfd:
for f in file_parts:
with open(f, 'rb') as fd:
shutil.copyfileobj(fd, wfd)
The most basic function gets the filename from the URL, converts it into a numbered .part file, creates a temporary directory under the original file, all parts are downloaded into it. await asyncio.gather(*tasks) allows you to execute all collected coroutines concurrently, which significantly speeds downloading. After that, the already synchronous shutil.copyfileobj method concatenates all files into one file.
async def main():
if len(sys.argv) <= 1:
print('Add URLS')
exit(1)
urls = sys.argv[1:]
await asyncio.gather(*[process(url) for url in urls])
The main function receives a list of URLs from the command line and, using the already familiar asyncio.gather, it starts downloading many files at the same time.
Benchmark:
On one of the resources I found, a benchmark was carried out on downloading a Gentoo Linux image from a site of one university(slow server).
Download DietPi distribution (fast server):
As you can see, the result reaches almost 3x acceleration. On some files, the result reached 20-30 times.
Possible improvements:
In conclusion, I can say that asynchronous loading is the way out, but unfortunately not a silver bullet in the matter of downloading files.
import asyncio
import os.path
import shutil
import aiofiles
import aiohttp
from tempfile import TemporaryDirectory
import sys
from urllib.parse import urlparse
async def get_content_length(url):
async with aiohttp.ClientSession() as session:
async with session.head(url) as request:
return request.content_length
def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
while size - start > part_size:
yield start, start + part_size
start += part_size
yield start, size
async def download(url, headers, save_path):
async with aiohttp.ClientSession(headers=headers) as session:
async with session.get(url) as request:
file = await aiofiles.open(save_path, 'wb')
await file.write(await request.content.read())
async def process(url):
filename = os.path.basename(urlparse(url).path)
tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
size = await get_content_length(url)
tasks = []
file_parts = []
for number, sizes in enumerate(parts_generator(size)):
part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
file_parts.append(part_file_name)
tasks.append(download(url, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
await asyncio.gather(*tasks)
with open(filename, 'wb') as wfd:
for f in file_parts:
with open(f, 'rb') as fd:
shutil.copyfileobj(fd, wfd)
async def main():
if len(sys.argv) <= 1:
print('Add URLS')
exit(1)
urls = sys.argv[1:]
await asyncio.gather(*[process(url) for url in urls])
if __name__ == '__main__':
import time
start_code = time.monotonic()
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
print(f'{time.monotonic() - start_code} seconds!')