Authors:
(1) Yujie Hu, Department of Geography, University of Florida, Gainesville, FL 32611 and UF Informatics Institute, University of Florida, Gainesville, FL 32611;
(2) Changzhen Wang, Department of Geography & Anthropology, Louisiana State University, Baton Rouge, LA 70803;
(3) Ruiyang Li, Children’s Environmental Health Initiative, Rice University, Houston, TX 77005;
(4)Fahui Wang, Department of Geography & Anthropology, Louisiana State University, Baton Rouge, LA 70803.
Concluding comments, Acknowledgement and References
Estimating a massive drive time matrix between locations is a practical but challenging task. The challenges include availability of reliable road network (including traffic) data, programming expertise, and access to high-performance computing resources. This research proposes a method for estimating a nationwide drive time matrix between ZIP code areas in the U.S.—a geographic unit at which many national datasets such as health information are compiled and distributed. The method (1) does not rely on intensive efforts in data preparation or access to advanced computing resources, (2) uses algorithms of varying complexity and computational time to estimate drive times of different trip lengths, and (3) accounts for both interzonal and intrazonal drive times. The core design samples ZIP code pairs with various intensities according to trip lengths and derives the drive times via Google Maps API, and the Google times are then used to adjust and improve some primitive estimates of drive times with low computational costs. The result provides a valuable resource for researchers.
Estimating a drive time matrix between locations is a critical task in spatial analysis, commonly encountered by researchers in geography, urban planning, transportation engineering, business management, and operational research, etc. To list a few, analytical models such as spatial interaction modeling (Simini et al., 2012), travel demand estimation (McFadden, 1974), location-allocation problems (ReVelle and Swain, 1970; Hu et al., 2019), spatial accessibility measures (Luo and Wang, 2003; Dony et al., 2015; Balomenos et al., 2019; Zhu et al., 2020), and delineation of health care market areas (Wang et al. 2020), rely on attainment of reliable drive time estimation from a set of origin locations to a set of destination locations. Such a task for a small-size matrix has become routine in many GIS and transportation packages, such as ESRI® ArcGIS and Caliper® TransCAD. However, it can be a challenging task for a large matrix.
Most studies in this scope focus on small areas like individual cities or counties. Estimating the time matrix at larger geographic scales, such as a national scope, is of increasingly great importance to researchers and policy makers. Take the U.S. health care as an example, a national travel time matrix is the most critical component to study the average geographic access to health care (Onega et al., 2008; Boscoe et al., 2012), measure the variations in geographic access across regions (Onega et al., 2017), identify the areas where the health care geographic access is significantly lower than average, suggest the most appropriate sites for new care facilities, and facilitate the implementation of other strategies for reducing health care disparities such as remote health care, health care on wheels, and transit to care.
The challenges for calibrating a large drive time matrix include availability of reliable road network (including traffic) data, programming expertise, and computational power. Many studies assume a free flow condition on roads to eliminate the data requirement on traffic, and are often limited to estimation of drive times from areas (e.g., ZIP code area, census tract) to the nearest locations (Onega et al., 2008; 2017; Boscoe et al., 2012; Ikram et al., 2015) or between locations within a short range (Shi et al., 2012) with a significantly reduced number of OD (origin-destination) pairs. Most recently, Saxon and Snow (2019) estimated a drive time matrix from each census tract to each primary care location within 62 miles (100 km) in the U.S. by tapping into advanced computing resources such as distributed computing and sophisticated algorithms, which may not be accessible by most researchers. They did not consider traffic conditions or node impedance, and the result tended to underestimate drive times especially for short trips.
One way to account for traffic effect in drive time estimation relies on the utilization of traffic sensors or auxiliary data sources. These sensors include loop detectors (Kwon et al., 2000; Coifman, 2002) and automatic vehicle identification systems—such as toll collection system (El Faouzi et al., 2009), license plate recognition system (van Hinsbergen et al., 2009), and Bluetooth-based system (Bhaskar and Chung, 2013)—that are installed at certain locations along the road. They can accurately capture travel speeds and times. Other sensors are rather flexible, such as the floating or probe vehicles that consist of a sample of vehicles equipped with GPS units running with traffic (De Fabritiis et al., 2008). Based on collected information on a vehicle’s location, direction, and speed in a short time interval, drive times between any two locations in a network can be readily attained (Semanjski, 2015). A few recent studies attempted to estimate drive times using big data. Toole et al. (2015) used call detail records (CDRs) to obtain drive times for all road segments in five selected cities worldwide. Woodard et al. (2017) derived drive times on all roadways in the Seattle metropolitan region based on collected mobile phone GPS data. Although these methods can provide highly accurate estimates on actual drive times in traffic, their reliance on installation of physical equipment or big crowdsourced data restricts their usage to only small areas ranging from major corridors to metro regions.
An alternative approach is to use third-party web mapping services such as Google Maps and MapQuest. For example, Google Maps Distance Matrix API uses the Google data such as its road network and collected traffic information to estimate drive times between a set of origins and a set of destinations. Similarly, MapQuest’s Route Matrix API uses open-source mapping data from the OpenStreetMap project to achieve this goal. Another benefit of using these commercial web services is to relieve analysts of the burden of preparing street network data and accessing GIS/transportation software (Boscoe et al., 2012). However, the free usage of these services comes with request limits. For instance, Google Maps offers free usage up to 40,000 OD records per month (Hu and Downs, 2019). A similar restriction applies to MapQuest. As a result, researchers usually use this approach to derive drive times for a limited number of OD pairs (Wang and Xu, 2011).
Another issue related to drive time estimation between areas is the so-called aggregation error (Hu and Wang, 2016). The centroid-to-centroid approach assumes that all people in an area live at the centroid of an area (Hewko et al., 2002), and inevitably overlooks intrazonal travels (Kordi et al., 2012; Bhatta and Larsen, 2011). For example, the average commuting time within a traffic analysis zone (TAZ) is 11.3 minutes for auto drivers in Cleveland, Ohio (Wang, 2003). Given the average area of 82.25 square miles for the ZIP code areas in the U.S., intrazonal drive times at the ZIP code area level can be significant, especially in low-density suburban or rural areas. Its omission accounts for a high percentage in error for short-range trips. A common approach approximates intrazonal travel distance as the radius of an area-equivalent circle (Frost et al., 1998, Horner and Murray, 2002; Hu and Wang, 2015). Some recent studies use Monte Carlo simulation techniques to improve the estimation of trip lengths between area units (Hu and Wang, 2016; Hu et al. 2017), and offers a viable solution to intrazonal drive time estimation.
This study seeks to estimate a very large drive time matrix between ZIP code areas in the U.S. ZIP code area is a popular geographic unit used in many nationwide datasets. For instance, ZIP code area is often the finest geographic scale at which health information is compiled and distributed in the U.S. (Berke and Shi, 2009). Our method (1) does not rely on intensive efforts in data preparation or access to advanced computing resources, (2) uses algorithms of varying complexity and computational time to estimate drive times of different trip lengths, and (3) accounts for both interzonal and intrazonal drive times. Both the program and results will be available for free download, and provide a valuable resource for researchers.
This paper is available on arxiv under CC BY 4.0 DEED license.
[1T] his is a preprint of: Hu, Y., Wang, C., Li, R., & Wang, F. (2020). Estimating a large drive time matrix between ZIP codes in the United States: A differential sampling approach. Journal of Transport Geography, 86, 102770. https://doi.org/10.1016/j.jtrangeo.2020.102770