How do you build a complex logistic system that collects, at national scale, the country-wide harvest, on a developing country, with little or no good data and uneven roads conditions?. In this case study, from the Mountain Hazelnuts Ventures (“MHV”, a for-purpose for-profit environmental company in Bhutan — See background info at the World Economic Forum blog), the goal is hazelnut harvest collection from thousands of farmers across (currently poorly mapped roads) throughout the country.
This approach is equally applicable in other distribution or collection cases, such as optimizing a vaccine delivery to health centers, deliver material to distributors, plan a training tour, collecting community information, operate a last mile microloan network, …
This is the rather lengthy documentation of the full approach. From trees to trips. Starts on the strategy, then explain the support system, and then the 3 steps process (1-Estimate Harvest; 2-Cluster farmer harvest into collection points, and 3-Thread the collection points into detailed truck trips).
Collectively took roughly ~1000 person-hours to build and train the team from scratch. It is actively updated and improved. It is also based on many open software pieces. Would have been impossible without it, and we would have saved so much time for every extra piece we had.
From trees to trips: MHV Hazelnut logistics
How do we estimate, and collect, the harvest from millions of trees across Bhutan into the processing plant?
Collecting the harvest is the main outcome, but the process and pipeline also builds a reliable, adaptable, locally run, Data Science team and services able to support other questions.
Overview of the logistics system
To know the harvest needs, we need to know the harvest volume, timing and road access. To know these, we need good road maps (we will use OSM), and a custom routing engine (OSRM, that runs on top of OSM) to guide our loaded trucks (“DCM”s, a type of truck). That means we will need JOSM to improve OSM, and also QGIS to visualize the harvest and support the decision making. To assist on the mapping we use the free-tier of the commercial service MAPBOX (that pulls OSM data), as well as Digital Globe (DG) satellite images, and processed data from our corporate management tool (RMT) — which has information such a farmer locations, GPS traces and tree phenology data — . Then, to prioritize the tracing, to make statistics of the logistics and to estimate the harvest, we will need to do some data science. We will be using PYTHON, running on JUPYTER notebooks for documentation and clarity. To register and managed the knowledge we create, and to collaborate among the member of the team, we will use GIT and we will back it up on GITHUB, where we also do file progress and Issues.
Total software cost: 0$.
A— People and Software
In order to ensure best fit for the purpose, connectivity challenges, and maximize other applications, we are focusing on in-house development and training while reducing costs and time invested as much as possible. This means going to favor “off-the-shelf” software tools, and open-source software whenever possible, code that runs on low computer specs, and minimize Internet dependence.
We ended up building a team of 3 local data scientists (1 person with no previous experience in coding, and 2 from the IT team with “php” coding experience). The team built up capacity to use Ubuntu, Git, Github, python (over jupyter notebooks), docker, node, and how to set up a local server that runs some of the services and serves as as secondary local repository.
B — Maps
Maps, specifically road traces on a map, will be a central part of logistics. For example to route the trucks to the farmers and back. Using Google Maps, or ESRI, are not viable options due to several reasons, for example:
- License restrictions (or expensive, or not possible) to make off-line support or programmatic custom routing.
- Their map are severely, quantitatively and qualitatively, incomplete, and unconnected. More importantly, it is not possible to fix it despite having the data and local surveying capacity to do so.
- Where available, the speed assumed for the routing and driving time to get to the farmers is unrealistic. For example, despite being a “highway” many road segments are undergoing renovation, unpaved, gravel, or suffer from many road blocks (due to weather or construction). In other words 90 km/h in this “Highway” class is unrealistic:
- It allows us to improve it, and we have a wealth of data to contribute to it: Either from public sources such as the supremely timed release of Digital Global premium imagery, on May 9, 2017; or our corporate data, for example the traces of our support team visiting every farmer monthly for years.
- It allow us to leverage the OSM ecosystem of services, from Mapbox to use their routing from the improved OSM data, to directly run our own custom routing, fully offline and customized routing service for the driving speed of loaded trucks.
- It allows us to programmatically query the data in many ways, for example to calculate farmer to road distance, or do scenario planning if a road gets blocked (or paved).
According to several estimates [e.g. Mapbox], Bhutan roads were less than 1/3 mapped, and also had plenty of “road islands” (traced roads that are not connected with other roads so, routing-wise, isolated.
We were also encouraged by the strong support for OSM by the government, together with the World Bank and the OSM community, which had very positive recent pilot projects to train and map the capital.
Improving the map
Once improving the OSM map became a core anchor to deliver a good logistics planning, visualize the harvest needs and route our trucks, we decided to invest time to train the team for road tracing, and find out how to leverage our corporate data for that purpose.
The company has hundreds of field support staff to monitor the orchards health, assist the farmer with the caring of the trees and transport material. This team has been working for several years, at roughly monthly basis, to each of the thousands of farmers; while carrying company phones through which they input the reports. The phone also report back their location automatically back every hour. In order words, we have millions of GPS registrations of road points.
To leverage our internal GPS registrations of roads, we use Mapbox. For our use volume, we can use their free tier, so there is no cost. We used their service to create a custom map with the GPS traces, but also to test the OSRM approach, before deploying our own OSRM instance (so we could use a custom driving profile).
Custom Routing engine
We use OSRM, which is a routing engine that runs on OSM data. One of the benefits is that we can create a “driving profile” setting the speeds of our trucks based on the road type. For example we can set it to a maximum speed of 30 km/h on a highway (instead of the traditional 120 km/h for standard driving route engines), 15 km/h on unpaved roads and slower in gravel, …
We run the OSRM service locally on each Data Scientist laptop to maximize speed, but also on a server in the office that updates automatically from OSM so anyone in the company can use it for planning or account for roadblocks. As our team updates and improves the map into OSM, we pull down the data into OSRM, thereby providing us the most value, while benefiting the greater community that uses OSM. Running the service is little more than 3 commands with the `docker` instructions with some UI customizations and `osmupdate` instructions to pull the latest changes automatically.
Roads reaching the farmer
From the internal management tool (RMT) we know the location of the farmers, but the important factor is to know the roads that lead all the way from the farmer to the processing plant.
The team is using JOSM to improve OSM using satellite data, but we can also help prioritize by calculating the distance from each farmer to the closest known road. Farmers might not live close to a truck road, but we can rank these gaps to leverage local knowledge, satellite data and our GPS traces to figure out if these distance to the road are real, or result of untraced segments.
With python we can calculate the distance from each farmer to the closest road, since OSRM keeps track of the locations of the requested point (the farmer) and the first point in the route. With these 2 points, we calculate the “geodetic distance” as an estimate of the surface distance:
Ranking the biggest distances gives a very good mapping priority, but we can also plot the histogram of “access for farmer to road”. When the map is accurate, it provides a good proxy for “Accessibility” an important socioeconomic development metric.
As a byproduct we can also plot the difference between geodetic distance (straight line) from the farmer to the factory, and the distance by road, giving a sense of how much deviations we need due to mountains, rivers, cliffs, … This difference points to the importance of accurate road information, as road distance is roughly (50% of cases) more than 3 times the geodetic distance, but in 10% of cases more than 5 times longer. In that sense, Bhutan distances are “triple of that they look” on the map.
With QGIS we can visualize the result, basically layering a OSM background, with dots for each farmer, and aggregated into province totals (using “Join by location”) and styled based on the harvest totals and collection time.
C — Logistic pipeline
Once the ground work and all support systems are in place, the Logistics Pipeline consists of 3 steps:
- Estimate Farmer Harvest
Get how many nuts each farmer will have, and when they will be ready. Our first model is extremely naive, purely based on tree height (taller trees give more nuts), and the timing based on altitude (higher elevation tree give nuts later). After the data-driven model creates the harvest estimates, there should be a manual correction, based on local knowledge of each farmer.
- Cluster farmer harvest into collection points
We won’t stop for each of the thousands of farmers. Instead we create collection points. These collection points are closer to farmer with most nuts, and no farmer needs to go to their collection point further than 3 km, by road. The data-driven collection points should be adjusted based on local knowledge, actual place to stage the expected amount of nuts, …
- Thread visits to all collection points with trucks
Figure out the Instructions to each driver to go collect all nuts, and how many, from whom, they should expect on each stop.
As described, each data-driven step should be followed by a manual adjustment to account for several factors, such as incomplete data, cultural or practical reasons, … For example the collection point might not hold 2 tons of nuts, by down the road 200m away there might be a good place.
1 — Estimate Farmer Harvest
The first step, is to know how many nuts, from where, and when they are ready.
We don’t yet have enough years of harvest to create a more complex estimator based on previous year and the many variables the company is collecting. For now, we assign a conservative, and naive, yield based purely on the average height of the trees in an orchard, multiplied by 80% of the trees reported as healthy and growing. The timing is a linear delay with altitude (as a first rough estimate informed by hazelnut experts).
The nuts, and the data is available at the orchard level, and each farmer might have several orchards. Since the first stage of the collection is for the farmer to dry them at their house, we need to aggregate them all into the farmer location. Logistics is at the farmer level, not the orchard level.
2 Collection points
We cannot stop to collect the harvest for each of the thousands of farmers, especially when many are clustered close to each other. It is much more efficient to create collection points so the trucks can minimize stops. Each collection point should ensure that no farmer has to walk long distances with the harvest, as well as time the collection window according to when it will be ready.
Since we now have the harvest estimate, and the roads to the farmers, the first pass at the creation of the collection points is programmatically simple: Order farmers with decreasing harvest, and bring to that point all other farmers within 3 km by road.
We will only need to stop once in the harvest season. Since the timing is mostly depended on altitude, there won’t be too much spread among all farmers in a collection point. The low altitude farmer can wait a week or two for the higher altitude farmer to be ready. We keep track of the timing per farmer, and the spread of timing, in case we manually split these into 2 collection points in different weeks.
The code to cluster the collection points based on proximity threshold, and giving priority based on harvest amount, comes down to a few lines of python code. The first part calculates the distance from each farmer to all others, the second part runs down the farmer in decreasing order of expected harvest and adds all others within 3 km by road.
Note that no road can be shorter than a geodetic distance, so it is very efficient to skip farmers that are further than 3 km on straight line (“fast_range” function).
We also tried clustering based on other properties, but they proved worse than the method above:
- Geodetic distance. Very fast, but doesn’t account for a river, or a mountain in the middle. Very common in Bhutan.
- Admin boundary. It might make sense, to account for landscape, but the boundaries create very artificial limits for otherwise closer farmers.
On each cluster, we keep track of who belongs to it, so there is an easy, and important step, in mapping these clusters and double check them with the field staff, adjusting the actual pick-up location, members of the cluster or other important information we might have missed.
C — Collections points to truck trips
Once we have the collection points, we need to dispatch trucks from the factory, trying to minimize the number of trucks, and time they drive.
Our approach is to split the harvest by week, as if every week is its own harvest, and it doesn’t really matter when the collection happens within the week.
For each truck, we assume 2 drivers, working 8h/day, and add 20% of time to each trip they make. We use the customized routing engine that has specific truck speeds based on road type, road surface, restrict access to certain types (don’t use foot tracks, for example). We also assume a loading and unloading speed of 1 ton/hour.
As we calculate each stage of the truck fleet stops, we print out an instruction sheet with the driving directions, collection stops with the harvest to collect from whom, keep a tally of the truck cumulative time and loads, …
For each week, we send the first truck to the furthest point (we have to go there anyway, and starting there, we can reuse the return trip). We add the travel time to the truck tally, as well the time to collect the harvest on that stop.
We also check if the collection stop has more harvest that what it fits on a truck (e.g. 6 tons). If so, we calculate how many full truck loads we need. We send an empty truck, account for the round trip and load/unload cycle, and send it back to the next needed full truck load. These trips for far away collection points are the main factor that the truck fleet increases with increasing harvest.
Once collected, we check the next closest collection point, and see if it fits on the truck. If so we send it over. If the truck is full, we send it to the factory, account for the unload time, and send it back again to the furthest point.
Choosing the closest point for hundreds of stops is very time consuming. Unlike the collection stops step, we need to check all pairs, so we cannot use the geodesic trick to make it efficient. Instead, we use the “table” service of OSRM. With it, we can send up to 100 locations, and OSRM will return a matrix of all the pairs of distances. It’s is a little tricky to batch these requests into a lookup table, but it makes this step roughly 1000 times faster:
Before going to the next collection stop, we also check if the truck has used up all their available time of the week (2 drivers, 8h/day * 7 days week = 112 h/week). If so, instead of sending it over, we send it to the factory, as this means that the week is over. If on the way back, they run out of time for that week; on the next week we add this time as an overflow (they are coming back).
This is an example of the Harvest Collection Sheet:
# Harvest Collection instructions
Generated on 2017–06–23 09:41:59.193354
* DCM max load: 6000
* Drivers per DCM: 2
* Max driving hours per driver per day: 8
* Max weekly hours per DCM (with above settings): 112
* Loading speed: 1000 kg per hour
71.0% of collection points (***) need full truck loads.
*** stops to make, for a total of *** tons of nuts
## Week: ***
** pickups this week to do. Total harvest *** Kg.
— — -Status before pick up — — —
Trucks kg. loads: 
Trucks hr. usage: 
Next pick up confirmed. Going there:
Directions to next collection point:
*** Kg, *** h away, ***h to load
9 pickups left to do. Harvest left *** Kg.
Closest stop is ***…
Below is a more complex stage, with a sizeable fleet, a next stop that needs several full truck loads, and a truck about to finish it’s available time on the week:
Trucks kg. loads: [*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*]
Trucks hr. usage: [*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*]
-> This place needs 2 full DCMs, and **** remains to pick up
-> Sending free truck 37 of fleet.
-> Sending free truck 38 of fleet.
Truck won’t hold next stop. Don’t go collect, back to factory, ** away.
At the factory, truck aims again for further stop: ********
Next pick up confirmed. Going there:
Directions to next collection point:
****** Kg, ****** h away, ****** h to load
****** pickups left to do. Harvest left ****** Kg.
We look forward your comments and feedback. Thank you!
MHV Data Science team