How Alibaba’s Xianyu tech team recode GPS data and map a billion items into local ‘business districts’
Among China’s rising e-commerce platforms, Alibaba’s Xianyu(闲鱼) has emerged as a popular new destination for buying and selling second-hand goods, with over a billion user-submitted items already listed nationwide. The sheer volume of that inventory, dispersed not among warehouses but among households in “business districts” throughout the country, has challenged the app’s designers to expedite the flow of goods using GPS location data for both items and buyers — a challenge that conventional processing methods have thus far proved no match for.
To build a scalable system for effectively matching items and buyers to business districts, the Xianyu team used a set of algorithms for recoding location data based on GeoHash precision matching, which helped to vastly reduce the number of computations needed to complete exchanges. This approach also furnished a novel method for inferring codes for regions adjacent to those identified algorithmically, again reducing burdensome computations.
Let’s take a closer look at the effort behind Xianyu’s achievement and how the app’s framework simplifies spatial relationships in a coding context.
Hashing Out the Problem
To effectively divide cities into smaller business districts, Xianyu first evaluates factors such as transport networks and the distribution of malls and residential areas, and then maps divisions accordingly, as was the case for Alibaba’s home city of Hangzhou.
Based on user-shared GPS information, items listed on the platform are first encoded as point data, which is distributed in an unsorted manner across the regional map. Xianyu takes stock of the items located within a business district using this GPS data and recommends them to possible buyers located in the same district. Useful recommendations thus rely on the accurate calculation of the items that fall within a specified business district.
In terms of computation in the Xianyu database, business districts act as polygon data parameters that differ in shape and size and which do not overlap with each other. Items, meanwhile, are GPS-tagged point data. The question is, then, how can Xianyu quickly and accurately determine which business district these massive data points belong in? A conventional approach would be to use spatial relation equations to calculate the point-surface relations, but this involves processing huge amounts of data — a task already beyond feasibility considering the volume of goods traded on Xianyu.
Collectively, Chinese cities are divided into a total of approximately 10,000 business districts of varying sizes, ranging from 10 to 80 parameter lines each. Using the above-mentioned approach to match items with business districts, 20 quadrillion basic operations would have to be completed to process the total volume of data. Attempts to take this approach using Alibaba’s internal offline computing clusters returned no results after running the problem for over two days.
Faced with the need to innovate, Xianyu used a GeoHash-based precision matching algorithm coupled with rough GeoHash matching and, to a limited extent, a spatial relation equation-based precision matching algorithm. In the space of a day, this approach was able to successfully return results for the same monumental task, accurately matching data for one billion items and 10,000 business districts.
Technical Principles of GeoHash Point Data
At the most basic level, GeoHash encodes 2D geographic coordinates into strings of letters and digits. Each of these strings represents a specific geographic rectangle and is shared by all coordinates falling inside this rectangle. Therefore, the longer the string is, the smaller and more precisely defined the corresponding rectangle will be.
When encoding a geographic location, the target latitude and longitude are calculated to decide whether they fall in the left or right section of the original intervals of [-90,90] for latitude and [-180,180] for longitude. 0 is recorded for points falling in the left section, and 1 for those falling in the right. The interval obtained in the previous step is then folded in half to perform a further search function, which in turn returns a new binary code. When this process results in a binary code of a length that reaches the required precision, the code is then interleaved according to the rule that longitude values are placed on even bits and latitude values on odd bits, resulting in a new binary string. In the last step, this binary string is converted to a character string according to the Base32 table, which yields the precise GeoHash string of the geographic coordinates.
An example of GeoHash string calculation is given below with the coordinates “30.280245, 120.027162”. First, the latitude is binary that has been coded according to the following steps:
1. Divide [-90,90] into two equal halves. “30.280245” falls in the right section (0,90), so the first bit is 1.
2. Divide [0,90] into two equal halves. “30.280245” falls in the left section (0,45), so the second bit is 0.
3. Keep repeating this process to make the resultant interval smaller and smaller, with its two end points increasingly close to “30.280245”.
The first rounds of iteration are detailed in the procedures shown in the figure below:
These iterations continue until the code’s length offers the required precision. The complete iteration table of a 15-bit binary code is as follows:
The resultant binary code for latitude is 10101 01100 01000.
Binary code for the longitude is reached by the same iteration procedures:
The resultant binary code for longitude is 11010 10101 01101.
Following the rule that longitude values are placed on even-position bits and latitude values on odd-position bits, the binary codes of latitude and longitude are then interleaved to produce the final binary code: 11100 11001 10011 10010 00111 00010.
Next, Base32 codes are determined, with every five binary codes corresponding to one duotricemary code. Thus, the five binary bits are converted to decimal bits, returning values of 28, 25, 19, 18, 7, 2. The resulting code, based on the Base32 code table below, is wtmk72.
These results can be subsequently verified at geohash.org. In this case, verification results are as follows:
The first bits shown in the result from geohash.org are consistent with those yielded by the previous calculation. Multiple rounds of iteration via binary divisive procedures can likewise return longer and more precise results, as shown on the verification website.
The GeoHash string length corresponds to precision as shown in the following:
Obtaining GeoHash Codes for Surface Data
The standard GeoHash algorithm introduced in the previous section can only calculate GeoHash codes for 2D point coordinates. To complete location matching tasks, though, Xianyu also needed a way to determine GeoHash codes for the related surface data (polygon objects in the geographic information system, or GIS). This required a further extension of the algorithm applied in the previous effort.
The first step in this method is to determine the minimum bounding rectangle (MBR) of the polygon in question, and then to calculate the GeoHash code of the MBR’s southwestern corner coordinates. The rectangular GeoHash block corresponding to this GeoHash code is determined in an inverse GeoHash encoding process. Starting from this GeoHash block, Xianyu’s mechanism then searches for adjacent GeoHash blocks of the same size to the north and east, continuing until a GeoHash block is fully defined from the MBR. The edges of some of the GeoHash blocks thus situated might not intersect the polygon in question, and if so these blocks must be removed through a series of calculations detailed in subsequent steps.
For the example above, the HD image of the result is given below, where blue GeoHash blocks partially intersect with the original polygon and orange GeoHash blocks are contained entirely by it.
The following flow chart demonstrates the algorithm above:
A Quick Algorithm for Finding Adjacent GeoHash Blocks
In the GeoHash encoding flow chart in the previous section, the two steps marked in turquoise and orange are tasked with finding the adjacent GeoHash string to the east and north respectively.
With a conventional approach, the coordinates of one point inside the adjacent block can be obtained using the information which describes the current decoded GeoHash block. This point is then GeoHash-encoded, and the result is used as the GeoHash code for the adjacent block. As shown in the figure below, obtaining the codes for the 8 adjacent blocks surrounding “wtmk72” involves decoding “wtmk72” into the coordinates of 4 vertices (N1, N2, N3, and N4), which is done using an inverse GeoHash encoding process. These coordinates are then used to calculate the coordinates of N5, a random point inside the right adjacent block. N5 is then GeoHash-encoded to get “wtmk78”, the code for the right adjacent block. The codes of the other 7 adjacent blocks around “wtmk72” can be obtained in the same way.
This method requires a decoding process followed by an encoding process. This takes up a large amount of time, especially if the specified GeoHash string is long (thus requiring many decoding-encoding cycles).
By observing the pattern of the GeoHash coding table and the characteristics of Z-order curves, Xianyu was able to verify a method for quickly determining adjacent GeoHash strings by looking up tables.
Here it makes sense to continue using the GeoHash string “wtmk72” as an example. The corresponding decimal numbers are “28, 25, 19, 18, 7, 2”, and the binary codes are 11100 11001 10011 10010 00111 00010. Where w corresponds to the 5 binary codes 11100 that represent “longitude, latitude, longitude, latitude, longitude” respectively, t corresponds to the 5 binary codes 11001 that represent “latitude, longitude, latitude, longitude, latitude” respectively. To generalize this result, odd-position bit characters in the GeoHash string (‘w’, ‘m’, and ‘7’, in this example) represent binary bits that correspond to “longitude, latitude, longitude, latitude, longitude”, while even-position bit characters (‘t’, ‘k’, and ‘2’, in this example) represent binary bits that correspond to “latitude, longitude, latitude, longitude, latitude”.
The binary codes for ‘w’ are 11100, translating to “right, up, right, down, left”. The binary codes for ‘t’ are 11001, translating to “up, right, down, left, up”.
This character-direction conversion reveals the character-location mapping for odd-position bits displayed in the following table:
The following is the character-location mapping table for even-position bits:
Interestingly, these two tables can be mutually converted by inverting their axes and transposing accordingly.
Using these two tables allows Xianyu’s system to quickly determine the 8 characters surrounding each character. To calculate the 8 GeoHash values surrounding a given GeoHash string:
· If the last character of the string does not exceed the boundary in this direction, the approach is to keep the first several characters unchanged and take the adjacent character in this direction.
· If the last character exceeds the table boundary in this direction, the approach is to find the adjacent character of the penultimate character in this direction, and then the circularly adjacent character of the last character in the same direction.
· If the adjacent character of the penultimate character in this direction also falls outside of the table boundary, it is instead necessary to find the adjacent character of the third-to-last character in this direction first, and so on.
Considering the example of “wtmk72”, finding the 8 strings surrounding this GeoHash string is equivalent to locating the adjacent character of the last character ‘2’. The even-position bits table applies to ‘2’, whose 8 adjacent characters are ‘1’, ‘3’, ‘9’, ‘0’, ‘8’, ‘p’, ‘r’, ‘x’, where, ‘p’, ‘r’, and ‘x’ fall under the lower boundary of the table and the adjacency is obtained by connecting the upper and lower part of the even-position bits table. For these 3 characters that fall under the boundary, the approach is to find the adjacent character immediately below the penultimate character ‘7’. ‘7’ is on an odd-position bit, so the odd-position bits table applies here. The adjacent character immediately below ‘7’ in the table is ‘5’. Therefore, the 8 adjacent GeoHash strings for “wtmk72” are “wtmk71”, “wtmk73”, “wtmk79”, “wtmk70”, “wtmk78”, “wtmk5p”, “wtmk5r”, and “wtmk5x”.
With this adjacent character-focused quick algorithm, the surface data GeoHash encoding algorithm detailed in the flow chart in the previous section becomes vastly more efficient.
Linking Massive Point Data with Surface Data
Xianyu’s new approach to establishing links between massive point data and surface data involved calculating the GeoHash codes for the GPS data of items (point data) and business district AOI data (polygon data), as detailed using the aforementioned algorithms. These GeoHash codes are the same in length, and every piece of point data has a unique GeoHash string. Likewise, every piece of surface data corresponds to one or several GeoHash codes, which are either “fully contained” or “partially contained”.
To proceed with these strings where fully contained, Xianyu joined the GeoHash strings for all items into the strings of business districts. The resultant <items, business district> data enabled it to be determined which business districts the items belonged to.
For other items which the appropriate business districts still needed to be determined for, the approach was instead to join their GeoHash strings into the partially contained strings of business districts. The resulting <items, business districts> data indicated possible “business district-items” inclusion patterns. To verify such inclusion relations, geometric relation operations were applied to the GPS data of items and business district AOI data.
As shown in the following figure, the GeoHash code for Item 1’s point data is “wtmk70j”, successfully joining it with the surface data’s fully-contained string “wtmk70j”. This affirms that Item 1 can be categorized as belonging with this surface data.
The GeoHash code for Item 2’s point data is “wtmk70r”, successfully joining it to the surface data’s partially contained string “wtmk70r”. Item 2 thus seemingly belongs to the surface data, although this requires confirmation with further geometric point-surface calculations. The GeoHash code of Item 3’s point data does not match any of the surface data’s GeoHash block code, which quickly confirms that Item 3 is not part of this surface data.
In real scenarios, massive amounts of item GPS data from across China and large amounts of business district data need to be reconciled effectively. Generally, most items can be assigned to business districts in one step due to being fully contained in the first instance of surface data. For other items, GeoHash matching can reduce the geometric calculations to match items and business districts from the “1 item x all business districts nationwide” Cartesian product scale to the “1 item x 1 (or several) business districts” Cartesian product scale. In this way, the vast majority of unnecessary geometric calculations can be avoided, saving great deals of time.
In practice, processing the data of one billion items and of 10,000 business districts with the quick algorithm detailed in this article, Xianyu used one billion GeoHash point encoding calculations, 10,000 GeoHash surface encoding calculations, and 5 million calculations that determine if points fall within surfaces, totaling roughly 180 billion basic operations — only a minuscule portion of the 20 quadrillion basic calculations required for a conventional approach. Using the algorithm detailed in this article, Alibaba’s offline computing platform accomplished this feat in less than a day.
There are a number of geometric algorithms that can determine the inclusion relation between a given point and polygon. The most commonly used is the ray casting algorithm. Simply put, this involves drawing a ray from the point and counting the number of times the ray intersects with the polygon’s lines. If the number is odd, the point is inside the polygon; if not, it must be outside of it.
In essence, using GeoHash to simplify the calculation of spatial relations between massive point data and surface data is a space indexing approach. A number of useful space indexing techniques have been applied in the GIS field, including R tree series (R tree, R + tress, R * tree), quadtree, K-D tree, and grid indexing, each with its own distinct characteristics. In addition to processing point-surface relation issues, the approach in this article can also quickly handle point-point, polygon-polygon, point-line, and line-line relations. Examples include quickly determining which roads massive bus stops belong to, and whether multiple roads or railways intersect, among others.
(Original article by Luo Junfeng罗俊沣)