Harnessing the power of , an automatic speech recognition model, we’ve dramatically reduced audio transcription costs and time. Here’s a deep dive into our benchmark against the substantial and how we achieved a cost reduction. OpenAI’s Whisper Large V2 English CommonVoice dataset 99.1% A Costly Comparison Traditionally, utilizing a managed service like AWS Transcribe would set you back about for transcribing the entirety of the English CommonVoice dataset. Using a custom model? That’s an even steeper . In contrast, our approach using Whisper on SaladCloud incurred just , achieving the same result. $10,500 $13,134 $117 Behind The Scenes: Our Architecture Our simple batch processing framework comprises: Audio files stored in AWS S3. Storage: Jobs queued via AWS SQS, with unique identifiers and accessible URLs for each audio clip. Queue System: Post transcription, results are stored in DynamoDB. Transcription & Storage: We integrated HTTP handlers using AWS Lambda for easy access by workers to the queue and table. Worker Coordination: We wanted to keep the framework components fully managed and serverless, to provide as close of an analogue as possible to using managed transcription services. The framework itself incurred a cost of during transcription, mainly due to S3 costs associated with uploading and downloading millions of files. This amount does not include any costs from the node pool. $28 Discover our open-source code for a deeper dive: Job Queue Service Recording Service – Whisper Inference Server Docker Image – Whisper Benchmark Worker Docker Image Deployment on SaladCloud With our inference container and services ready, we leveraged SaladCloud’s . We used the API to deploy 2 identical container groups with 100 replicas each, all using the modest RTX 3060 with only 12GB of vRAM. We filled the job queue with urls to the 2.2 million audio clips included in the dataset, and hit start on our container groups. Our tasks were completed in a mere 15 hours, incurring in costs from Salad, and in costs from our batch framework. Public API $89 $28 Performance Comparison of Whisper-Large-v2 Across Different Clouds The result? An average transcription rate of , translating to an impressive . Notably, SaladCloud’s cost-performance ratio dramatically outshined major competitors, even when deploying custom models. one hour of audio every 16.47 seconds $0.00059 per audio minute It’s worth noting AWS Transcript’s billing structure can greatly inflate costs for shorter audio clips (which comprise most of the CommonVoice corpus), a setback not encountered on per-second billing platforms, and their cost-performance would likely improve somewhat when transcribing longer content. We tried to set up an apples-to-apples comparison by running our same batch inference architecture on AWS ECS…but we couldn’t get any GPUs. The GPU shortage strikes again. Optimizing Further While our benchmark results are already quite compelling, there are areas we’ve identified for potential performance enhancements: Our current setup results in GPUs momentarily sitting idle during the time it takes to fetch the next audio clip for transcription. By implementing an ‘eager fetching’ mechanism, the worker can preemptively download the subsequent audio clip even as the current one is still being processed. This parallelism ensures that by the time one clip is done, the next is immediately ready for transcription, thereby eliminating any waiting period. Eager Fetching: Another approach could involve batch downloading, where multiple clips are fetched simultaneously, reducing the frequency of download requests and better utilizing the GPU’s capabilities. This approach would further reduce any downtime associated with data retrieval. Batch Processing: By integrating these process improvements, we anticipate that the overall transcription throughput could see an enhancement of 20-50% on this dataset. This would not only reduce processing time but also lead to even more significant cost savings, maximizing the efficiency of this approach. SaladCloud: The Most Affordable GPU Cloud for AI Audio Transcription For startups and developers eyeing cost-effective, powerful GPU solutions, SaladCloud is a game changer. Boasting the market’s most competitive GPU prices, it offers a solution to sky-high cloud bills and limited GPU availability. In an era where cost-efficiency and performance are paramount, leveraging the right tools and architecture can make all the difference. Our Whisper Large Inference Benchmark is a testament to the savings and efficiency achievable with innovative approaches. We invite developers and startups to explore our open-source resources and discover the potential for themselves. Also published . here

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

Reducing AI Transcription Costs and Time With Salad

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Stable Diffusion Inference Benchmark — 9 Million Images for $1,872 in 24 hrs

105 Stories To Learn About K8s

10 Ways to Future-Proof Your Business With Cloud

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Upcoming DevOps Conferences for 2018

101 Stories To Learn About Cloud Infrastructure

Stable Diffusion Inference Benchmark — 9 Million Images for $1,872 in 24 hrs

105 Stories To Learn About K8s

10 Ways to Future-Proof Your Business With Cloud

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Upcoming DevOps Conferences for 2018

101 Stories To Learn About Cloud Infrastructure

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps