AWS Simple Storage Service (S3) is by far the most popular service on AWS. The simplicity and scalability of S3 made it a go-to platform not only for objects, but also to host them as static websites, serve ML models, provide backup functionality, and so much more. storing It became the simplest solution for event-driven processing of images, video, and audio files, and even matured to a de-facto replacement of Hadoop for big data processing. In this article, we'll look at various ways to leverage the power of S3 in Python. *!* Some use cases may really surprise you each code snippet below includes a link to a GitHub Gist shown as: ( ). Note: Gist 1. Reading objects without downloading them Imagine that you want to read a CSV file into a Pandas dataframe without downloading it. Here is how you can directly read the object's body directly as a Pandas dataframe ( ): Gist Similarly, if you want to and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below ( ). upload Gist 2. Downloading files to a temporary directory As an alternative to reading files directly, you could download all files that you need to process into a . This can be useful when you have to extract a large number of small files from a specific S3 directory ( ), concatenate all this data together, and then load it to a data warehouse or database in one go. Many analytical databases can process larger batches of data more efficiently than performing lots of tiny loads. Therefore, downloading and processing files, and then opening a single database connection for the Load part of ETL, can make the process more robust and efficient. temporary directory ex. near real-time streaming data By using a temporary directory, you can be sure that no state is left behind if your script crashes in between ( ). Gist 3. Specifying content type when uploading files Often when we upload files to S3, we don't think about the metadata behind that object. Howeve . Let's look at an example. r, doing it explicitly has some advantages Starting from line 9, we first upload a CSV file without explicitly specifying the content type. When we then check how this object's metadata has been stored, we find out that it was labeled as . Typically, most files will be labeled correctly based on the file's extension, but issues like this may happen unless we specify the content type explicitly. binary/octet-stream Starting from line 21, we do the same but we explicitly pass as content type. operation confirms that metadata is now correct ( ). text/csv HeadObject Gist 4. Retrieving only objects with a specific content-type You may ask: what benefit do we get by explicitly specifying the content type in ExtraArgs? In the example below, we try to filter for all CSV files ( ). Gist This will return a list of ObjectSummary objects that match this content-type: Out[2]: [s3.ObjectSummary(bucket_name='annageller', key='sales/customers.csv')] If we hadn't specified the content type explicitly, this file wouldn't have been found. 5. Hosting a static HTML report S3 is not only good at storing objects but also . First, we create an S3 bucket that can have publicly available objects. hosting them as static websites Turning off the "Block all public access" feature --- image by author Then, we generate an HTML page from any Pandas dataframe you want to share with others, and we upload this HTML file to S3. This way, we managed to build a simple tabular report that we can share with others ( ). Gist If you did not configure your S3 bucket to allow public access, you will receive : S3UploadFailedError boto3.exceptions.S3UploadFailedError: Failed to upload sales_report.html to annageller/sales_report.html: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied To solve this problem, you can either for specific files on this bucket, or you can use as shown in the section below. enable public access presigned URLs 6. Generating presigned URLs for temporary access When you generate a report, it may contain sensitive data. You may not want to allow access to everybody in the world to look at your business reports. To solve this issue, you can leverage an S3 feature called that allow granting to a specific S3 object by embedding a temporary credential token directly into the URL. Here is the same example from above, but now using a ( ) and a presigned URL ( ). presigned URLs permissions private S3 bucket with "Block all public access" set to "On" Gist The URL, created by the script above, should look similar to this: Presigned URL --- image by author 7. Uploading large files with multipart upload Uploading large files to S3 at once has a significant disadvantage: if the process fails close to the finish line, you need to start entirely from scratch. Additionally, the process is not parallelizable. AWS approached this problem by offering . This process breaks down large files into contiguous portions (parts). Each part can be uploaded in parallel using multiple threads, which can significantly speed up the process. Additionally, if the upload of any part fails due to network issues ( ), it can be retransmitted without affecting other parts. multipart uploads packet loss To leverage multi-part uploads in Python, boto3 provides a class TransferConfig in the module boto3.s3.transfer. The caveat is that you actually don't need to use it by hand. Any time you use the S3 client's method upload_file(), it for large files. But if you want to optimize your uploads, you can change the default parameters of TransferConfig to: automatically leverages multipart uploads set a custom number of threads, disable multithreading, specify a custom threshold from which boto3 should switch to multipart uploads. Let's test the performance of several transfer configuration options. In the images below, you can see the time it took to upload a 128.3 MB file from the : New York City Taxi dataset using the default configuration, specifying a custom number of threads, disabling multipart uploads. We can see from the image above that when using a relatively slow WiFi network, the default configuration provided the fastest upload result. In contrast, when using a faster network, parallelization across more threads turned out to be slightly faster. Regardless of your network speed, . Multi-part upload did help speed up the operation, and adding more threads did not help. This might vary depending on the file size and stability of your network. But from the experiment above we can infer that it's best to just use s3.upload_file() without manually changing the transfer configuration. Boto3 takes care of that well enough under the hood. But if you really are looking into speeding the S3 file transfer, have a look at the section below. using the default configuration seems to be good enough for most use cases the ProgressPercentage, passed as Callback, is a class taken directly from the . It allows us to see a progress bar during the upload. Note: boto3 docs 8. Making use of S3 Transfer Acceleration In the last section, we looked at using multipart uploads to improve performance. AWS provides another feature that can help us upload large files called . It allows to speed up uploads (PUTs) and downloads (GETs) over long distances between applications or users sending data and the S3 bucket storing data. Instead of sending data directly to the target location, we end up sending it to an closer to us and AWS will then send it in an optimized way from the edge location to the end destination. S3 Transfer Acceleration edge location Because AWS is moving data solely within the AWS network, i.e. from the edge location to the target destination in a specific AWS region. Why is it an "optimized" way? **When can we gain significant benefits using S3 Transfer Acceleration? ** when we are sending --- typically more than 1 GB, large objects when we are sending data over , ex. from the region to . long distances eu-central-1 us-east-1 To enable this feature, go to "Properties" within your S3 bucket page and select "Enable": Alternatively, you can enable this feature from Python ( ): Gist To use this feature in boto3, we need to enable it on the S3 client object ( ): Gist Now we can test the performance. First, let's test the same file from the previous section. We can barely see any improvement. When comparing the performance between purely doing a multipart upload, and additionally turning on the S3 Transfer Acceleration, we can see that the , regardless of the object size we examined. Below is the same experiment using a larger file (1.6 GB in size). performance gains are tiny In our example, we were sending data from Berlin to the region located in Frankfurt (Germany). The nearest edge location seems to be . The distances are rather short. If we had to send the same 1.6 GB file to a US region, then Transfer Acceleration could provide a more noticeable advantage. eu-central-1 located in Hamburg **TL;DR for optimizing upload and download performance using Boto3: ** you can safely use the default configuration of s3_client.upload_file() in most use cases --- it has sensible defaults for and changing anything in the TransferConfig did not provide any significant advantage in our experiment, multipart uploads use only when sending large files over long distances, i.e. cross-region data transfers of particularly large files. S3 Transfer Acceleration enabling S3 Transfer Acceleration can incur additional data transfer costs. Note: Thank you for reading! Also published on: https://dashbird.io/blog/aws-s3-python-tricks/