Mapbox and Tippecanoe for big census data
Check out the finished map here!
Housing policy is something I deal with a lot, and so I spend a lot of time trying to make sense of housing data. While thinking about the relationship of the rental housing market with home ownership (typically represented across time), I started to wonder what that relationship looks like geographically.
Certainly there are parts of cities known for having lots of condos, or apartments, or single family homes; but I was curious what this looks like on the whole, and if larger structures could be discerned.
This inquiry turned into its own formidable technical challenge, and resulted in a pretty interesting data-set; read on to find out more, and how I built it!
The US Census makes owner/renter information readily available for census block geometries, but viewing it as a simple choropleth colored polygon leaves something to be desired. Often, one of the prime spatial consequences of rentership is increased density, since rented homes are often smaller or located inside multi-family buildings. But a choropleth map showing the relative incidence of owners to renters looses this important information. If instead, we view the data at a per-person level we can capture both dimensions of information at once.
This lets us compare structures of both density and ownership in very dense locations like Manhattan:
And locations that have a visible spread from urban to suburban, like Washington DC:
This makes it easy to start forming questions about why development took the shape (literally) that it did; what historical forces, policies, and timelines created particular shapes and conglomerations of one type or the other? Or, by the same token, what gave rise to those areas with no discernible structure at all?
I will admit I have no answers yet, but it’s an interesting jumping off point.
How I built it
The data used originally comes from the US Census Bureau's SF1 2010 Census. This is the most recent census with data available at the block level, and that fine grain of detail is necessary to produce more interesting visualizations. Using the US Census API and TIGER/Line geometry database, you can grab both census variables of interest, and the geometries they are associated with. In our case we took total population, population who rented their home, and population who owned their home.
The population number has to be converted an equal number of point geometries that fall within the block geometry. There are several ways of doing this, and I happen to have access to far more compute time than development time (and its far cheaper) in my current role. This was also a low priority project, so I coded it up the cheap way (brute force), set it running on a Friday evening, and enjoyed the weekend in wine country. When we came back on Monday there was a nice data-set waiting. :)
For every block feature we fed the geometry into a function that first calculates the bounding box, then generates a random point within that bounding box. We test if the point is inside of the actual polygon or just in its bounding box using ray-casting.
If the point falls inside the polygon we save it, otherwise it gets discarded. We keep doing this until we have the same number of points inside the polygon as the number of people recorded for that block in the SF1 census, for each type we’re interested in (renters/owners), then move on to the next block.
For what it’s worth, if anyone is interested in doing this in a more reasonable amount of time, the following should be a faster solution: Calculate a constrained delaunay triangulation of the polygon, pick a triangle at random, and generate a random point using barycentric coordinates inside the triangle. This eliminates the rejected points from the brute force method and should provide much faster point generation, especially for heavily skewed geometries; at the expense of more complex code.
The output of the above script ended up being nearly 40GB of geojson files containing point features. The next question naturally was, “How on earth are we going to load this into anything?”
Fortunately, vector map tiling is a magical thing! Mapbox maintains a program for just such situations, called Tippecanoe. Tippecanoe takes in huge quantities of geojson geometries and converts them to the Mapbox Vector Tiles format, a highly efficient protobuf encoded SQLite database. This lets you serve your data as small digestible vector tiles, and will help to ensure the texture and density of the data is preserved across all zoom levels.
The resulting vector tile database, called a .mbtiles file, was around 2GB in size and uploaded to Mapbox’s tile servers easily.
All that was left was to apply some styling in Mapbox Studio. To help the points read well across zoom levels, their diameter is a function of zoom as well as their opacity. That way at low zooms, overlapping points create brighter regions.
You can check out the finished product here! Showing demographic information at a person-by-person level can really change your perception of the data, and forces you to remember just why it’s important: we’re talking about real people here!
If you create visualizations like this that you find interesting, I’d love to see it, shoot me a tweet!