We’ve all been there. You’re building a rails app that has a large dataset and you plan for the app to run analytics on the . If you’re following the Test Driven (TDD) model then you’re going to need to test your analytics methods. The problem is, your dataset is too large to upload in your test suite. You could always copy the first 500 or so lines for a test fixture, but that always ends up producing weird results. What you really need is a little more control over what that test dataset looks like. Let’s start with an example: dataset Design I’m building an app to analyze data from a bike sharing network. I have a network of Stations that all have a name and bike_dock_count and which belong to many Trips as a start_station and end_station. Trips have a duration, start_station_id, and end_station_id. Station attributes: name, dock_count    Trip attributes: duration, start_station_id, end_station_id I know that I want each station to have some trips associated with it, but how can we get that setup? First, let’s start with building some dummy stations in Ruby. I could create 5 stations doing this: Station.create(name: "station 1", dock_count: 5)   Station.create(name: "station 2", dock_count: 6)   Station.create(name: "station 3", dock_count: 7)   Station.create(name: "station 4", dock_count: 8)   Station.create(name: "station 5", dock_count: 9) Or, I could build them with a times loop and save myself some repetition: 5.times do |time|    Station.create(name: "station #{time}", dock_count: time + 5)   end Cool, now I have my stations covered. I will always have 5 stations and they will be consistently named (I’m assuming that the database is being cleaned so the IDs are always 1–5). Next, lets move onto building out trips for these stations. We should probably go over what analytics I want to pull from these. To make this simple I want to calculate: average count of trips started at each station average count of trips ended at each station average duration of the trips standard deviation of the duration of the trips. Ok, so how do we do it? I want to dictate what the average value is, then build out dataset that reflect that count. Let’s say that each station should average 5 trips. Since average trips per station is simply: total_trips / total_stations = average_trips_per_station And, therefore total_trips = average_trips_per_station * total_stations We know that no matter what, if we have 5 stations and want to average 5 trips per station we need to build 25 trips and the actual stations of those trips don’t matter. Building it out we can use another loop: 25.times do    Trip.create(duration: 1, start_trip_id: rand(1..5), end_trip_id: rand(1..5))   end Why did I use rand(1..5) there? Because I know that I only have stations with IDs of 1–5 and I don’t really care which ones get assigned to each trip. This will work just fine, but do we really want our duration to always be 1? It is a bit boring and I think we can do something a bit fancier here. What if we set up our trips so that our duration follows a (bell curve). A little googling and I found this that will build an array of data of specified length (desired_count), the average (avg) and standard deviation (stdev): normal distribution handy formula Array.new(desired_count) {avg + stdev * Math.sqrt(-2 * Math.log(rand)) * Math.cos(2 * Math::PI * rand)} How do we use this in our data generator? Simple, we need to build data for 25 trips and let’s say our desired average duration is 60 with a standard deviation of 5. The code will look like this: trip_durations = Array.new(25) {60 + 5 * Math.sqrt(-2 * Math.log(rand)) * Math.cos(2 * Math::PI * rand)} Plugging this into IRB and using the gem which gives me the #mean and #standard_deviation methods I get the result below: descriptive statistics That is pretty, close. Since the dataset is being built using rand function the data will be off for a small dataset like this. I imagine as the array being built gets larger the mean and standard deviation gets closer to what we want. Of course, this level of accuracy isn’t acceptable to have randomly run at the beginning of every test, so we would probably want to run the code in IRB Anyway, we can implement it like this: 5.times do |time|    Station.create(name: "Station #{time}", dock_count: time+5   end trip_durations = Array.new(25) {60 + 5 * Math.sqrt(-2 * Math.log(rand)) * Math.cos(2 * Math::PI * rand)} 25.times do    Trip.create(duration: trip_durations.pop, start_trip_id: rand(1..5), end_trip_id: rand(1..5))   end That’s it! Now I’ve got a set of code that will build 5 stations and 25 trips. Plus, I know exactly what I’m getting (for the most part)!

Build Dummy Data with Relationships

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Behaviors of an Amateur Rails Developer

Assigns in Ruby on Rails — How you can test instance variables in views and controllers

Configuring Parallel Tests with Semaphore CI 2.0, RSpec, Cypress and Jest

Getting Started With Unit Testing With Rspec on Ruby With Rails

Github Actions auto split of slow RSpec test file in parallel jobs for Ruby on Rails project

How parallel Github Actions jobs can run your RSpec tests faster in Ruby on Rails application

Behaviors of an Amateur Rails Developer

Assigns in Ruby on Rails — How you can test instance variables in views and controllers

Configuring Parallel Tests with Semaphore CI 2.0, RSpec, Cypress and Jest

Getting Started With Unit Testing With Rspec on Ruby With Rails

Github Actions auto split of slow RSpec test file in parallel jobs for Ruby on Rails project

How parallel Github Actions jobs can run your RSpec tests faster in Ruby on Rails application

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps