C’mon, please Google?

Steve Lime recently asked a question on the OSGeo discussion list about how search engines are exploiting spatial information. This is a topic that has intrigued me for a while too, and I think that the answer is that they are not leveraging this data to anywhere near the level they could be.

Google appears committed to its goal of organizing the world’s information, but they have barely scratched the surface of spatial information. Online spatial information is just as hard to find today as web sites were before Google came along.

I’m not naïve enough to think that Google would provide a spatial discovery portal without a strong incentive. Fortunately for us, that’s what separates spatial information from other domain-specific search problems: spatial context will contribute greatly to the quality of Google’s search results and provide them with better targeting for their advertising.

Google is already partway there. They have developed some strong code to parse addresses for web page geocoding, and they are already indexing spatial data and services: e00, kml, dwg, wms. Why would they not go the extra few yards (metres?) in making their index spatially-enabled? Well, possibly because it’s a bit more work than that.

On the web, you have one-dimensional documents (basically just streams of text) linked together in a massive topological construct. Indexing these documents requires you to analyze things like the structure of the document, importance of the document within its own site, and the quality of its neighbours in the web topology.

Spatial indexing is somewhat different, in at least three ways. First, within documents you are dealing with two- or three-dimensional relationships between elements. Second, spatial data is often presented in a format that looks similar to a spreadsheet with cryptic headings and numeric values; metadata is often either implied or stored in an auxiliary location. Third, there may not be an explicit definition of the projection the data are stored in, making it difficult to determine their true location in space.

The first item is important because understanding the relationships between elements in spatial documents is a large part of understanding the data. Some documents may have their elements evenly distributed throughout an entire country, while others may have data clustered in large cities. In some subject areas, documents containing a concentration of elements for a specific area might be given a higher authority rating than documents with their elements scattered over the landscape.

The second item is a difficult problem, but within Google’s ability to solve. The context of the document might be picked up from text files in the same directory, information gleaned from pages that link to the document, or the general subject of the site the data is linked from. As a user, it would be time-consuming to track down this information, but not impossible. For a computer it would be more difficult, but Google is a world leader in extracting semantic information from unstructured data. Developing algorithms to attach meaning to uncoded spatial documents is something that they are uniquely qualified to do.

The third problem is also quite difficult. Again, part of the issue is that the data may not be stored in the same file, might be stated on the page linking to the files, or may just be implied. The implied case is difficult, but not always unsolvable. For projection information, there are general standards of practice in most parts of the world. For instance, in British Columbia most local data is stored in UTM, while provincial data is generally available in one of UTM, BC Albers, or less frequently in Lat/Lon WGS84. Taking documents found on domains that have been geocoded to BC and determining the best fit with common local projections will often give good results.

I have kept my discussion mainly to file-based data sets, but I should be clear that it is just as important to do deep data mining against geospatial web services, emerging standards like GeoRSS, and the existing geourl and Dublin Core metadata tags. Fortunately, the problem domain is considerably smaller with many of the web-based formats than for file-based spatial information.

I am sure that there are other problems that will stumbled into along the way, but the benefits are well worth it: end users can have a much fuller understanding of their world, spatial professionals can easily locate the information that they need, and Google will be able to develop better profiles of their users related to their location.

So, how about it Google? Don’t you want to know that your users are coming from a location within a 25 year flood zone having average assessment values of $700000 per parcel? I’m sure that there are lots of us that would love to help you. In return all you would have to do is return spatial information in a special search or as a layer in Google Earth with a link to the source data or service :)

Hmm. Maybe I’m too full of myself, thinking that someone from Google might actually read this. Ah well. Might as well go for broke: I wouldn’t complain if I could store spatial information in Google Base, perform GEOS-style spatial queries on it, and return it as GML or KML… Oh, and it would be nice to be able to publish this data to Google Earth using collection pointers created with something like Google Sitemaps.

Enough whining for one day. -j

Related posts