logo
Header graphic 4 of 8

Read more

External Links

How to query database records by their relative distance to each other on Earth?

Or: using Google's geocoding service and the Great Circle Distance-formula to make distance-based queries against a MySQL database.

I recently ran into a problem that comes up again and again. A customer has a couple of events in their database and they want to find out which events take place near a client's location so they can tell them about it. A few years ago, this kind of thing would have required a huge investment in licenses for commercial databases, but today, with the availability of free internet geocoding webservices, it turns out that the solution is surprisingly simple. In the following paragraphs I present solutions for the problem using SQL from PHP and Python.

I quickly found out that calculating the distance between two points on Earth, defined by their latitude and longitude, is not so hard. Besides being a rather simple trigonometric problem, there are also multiple formulas that differ in their complexity and accuracy, but all of them can easily be found on Wikipedia. I settled for a simple approach, using the Great Circle Distance-formula.

Surprisingly, Python has a very good library that implements an even better model for calculating the distance between two points on this planet based on an ellipsoid instead of a big sphere. It also provides a convenient interface to Google's, Yahoo's and geocode.us geocoding services. The library is called geopy. So, if you're using Python, you can stop reading now and just study geopy's documentation :-). Come back here if you need a recipe to calculate the distance between two places using SQL.

There's also a convenient interface to Google's geocoder for Java called GeoGoogle.

However, it seems that no such library exists for PHP. I only found example code in an article on developer.com and a library for automatically creating JavaScript and the mark-up for Google Maps mash-ups called GoogleMapAPI.

So, for the sake of your convenience I'll post my code for accessing the Google Geocoder API from PHP. For my convenience, I opted to use JSON output, using the excellent PEAR Services_JSON-library to quickly parse Google's answer in PHP. If you're a bit sadistic and loathe yourself, you can also use the great abomination that is PHP's XML support... that's entirely up to you.

/* In the following code $address is the address you want to encode and API_KEY is a constant
that contains your registered Google Maps API key. This code is dependent on the
existence of the curl library for HTTP communication.*/

$url = 'http://maps.google.de/maps/geo?q=' . urlencode($address) . '&output=json&key=' . API_KEY;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$gmaps_output = @curl_exec($ch);
curl_close($ch);

$json = new Services_JSON();
$response = $json->decode($gmaps_output);
if (!$json->isError($response)) {
  if ($response->Status->code == 200) {
    if (count($response->Placemark) == 1) {
      $lng = $response->Placemark[0]->Point->coordinates[0];
      $lat = $response->Placemark[0]->Point->coordinates[1];
    }
  }
}

You should replace maps.google.de with a host that is in the top-level domain of the country of the address you're querying. That seems to give the best results.

Now, let's say that you geocoded an address and it's time to put it in the database. You can of course just use two FLOAT columns, but MySQL, PostgreSQL and most other databases have at least some support for geographic types, usually based on the recommendations of the Open Geospatial Consortium. Using them may come in useful one day. So for MySQL I created a table like this:

CREATE TABLE geocoded (
  id INT(11) PRIMARY KEY NOT NULL auto_increment,
  gm_coor POINT NOT NULL,
  address_metadata_id INT(11) NOT NULL
);

To enter data in this table you use an insert query like this:

INSERT INTO geocoded (gm_coor, address_metadata) VALUES(
   GeomFromText('POINT([LATITUDE] [LONGITUDE])'), [whatever])

Now, if you're using PostgreSQL, you can use the excellent PostGIS extension that, besides implementing convenient column types to store geographic data, also implements, for example, the ST_length_spheroid function that is much easier to use than the following code to compare distances. But if you're stuck with MySQL, the following SQL snippet will implement the formula of the Wikipedia article mentioned above:

SELECT id, X(gm_coor) AS latitude, Y(gm_coor) AS longitude,
  ATAN2(
    SQRT(
      POW(COS(RADIANS(__LAT__)) *
           SIN(RADIANS(Y(gm_coor) - __LNG__)), 2) + 
      POW(COS(RADIANS(X(gm_coor))) * SIN(RADIANS(__LAT__)) - 
          SIN(RADIANS(X(gm_coor))) * COS(RADIANS(__LAT__)) * 
          COS(RADIANS(Y(gm_coor) - __LNG__)), 2)), 
    (SIN(RADIANS(X(gm_coor))) * SIN(RADIANS(__LAT__)) + 
     COS(RADIANS(X(gm_coor))) * COS(RADIANS(__LAT__)) * 
     COS(RADIANS(Y(gm_coor) - __LNG__)))
  ) * 6372.795 AS distance 
FROM geocoded
HAVING distance < [RANGE IN KILOMETRES]

To use the snippet, you just need to replace the placeholders __LAT__ and __LNG__ with the latitude and longitude of any point on earth in degrees. Please note, that the result is only an approximation of the real distance, because, as I mentioned, the model does not use an accurate representation of the Earth's shape, but only a sphere.

Why do you use HAVING instead of WHERE?

This question is easily answered: MySQL does not permit the use of aliased columns (like distance) in the WHERE-clause, because their value might not have been calculated yet when the WHERE-clause is evaluated. The HAVING-clause works very differently. In this case, the database first calculates the full result set, i.e. it calculates the distance between the input coordinates and every row in the table "geocoded" and then throws all rows away that don't match the criteria specified in the HAVING-clause.

You need to be careful with this, as this can be a huge performance problem. If you don't have another criteria by which you can shrink the result set to a manageable size, you should first query against a bounding rectangle. In Germany, one degree longitude corresponds approximately to 78.6 km and one degree latitude is approximately 111km, so you can do something like this:

[...] WHERE X(gm_coor) > (__LAT__ - 1.0) AND X(gm_coor) < (__LAT__ + 1.0) AND
Y(gm_coor) > (__LNG__ - 1.0) AND Y(gm_coor) < (__LNG__ + 1.0) HAVING distance < [...]

This will first confine the search to every coordinate within a 111km x 78.6km rectangle, which will potentially save you lots of time. Alternatively, you could recalculate the distance in the WHERE-clause by copying the full formula there and not using the alias, but be careful, because that might be even more of a performance hit. MySQL's query optimizer does not use column indexes if, instead of the column, a function of the column is used in the WHERE-clause. So in some cases, WHERE FUNC(col1) < X might perform worse than HAVING FUNC(col1) < X. This has been discussed in detail on the excellent MySQL performance blog.

Have fun!

I hope this article gave you some pointers. If you have any questions or want leave a comment, you're free to do so by the way. Just go to the comments page for this article. Also, take a look at the external links they might prove quite helpful. Thank you!