Tuesday, July 28, 2009

Thoughts on GeoWeb Standards

Andrew Turner recently asked for our thoughts on GeoWeb standards, and I thought I'd put it as a post here instead of cluttering too much of his comment stream.

I've been thinking about the different standards and their place in the world a lot recently. I'm not someone who takes strong stances on anything, and you're not, I hope, going to read this post and think that I'm a KML partisan, and that it's only because I work at Google that I think positive thoughts about it. I prefer instead to explore the problem space.

The problem isn't adoption, clearly. It's findability.

Adoption rate:

There's no question that KML has a phenomenal adoption rate. Michael Jones went over the numbers during his GeoWeb talk, but in case you missed it:

  • More than 500,000,000 KML/KMZ documents on the Internet
  • More than 250,000 Internet websites hosting KML/KMZ content
  • 2 billion placemarks accessible on the public Internet
Those are staggering numbers, especially compared to just last year when Google announced it had indexed tens of millions of KML files on a hundred thousand unique domains. Growth by a power of ten in a year is a lot.

GeoRSS has also expanded rapidly. I don't have numbers on it, but I'm sure it's also a very large number.

There are other formats, too, like GeoJSON, that are great, and I really look forward to seeing what happens with them.

Findability

Frankly, I think we can do a much better job. Fundamentally, one of the problems is that geographic data doesn't lend itself well to linkablity. Sure, you can link within the data, but few people do. A limited number of KML files link to other KML files. GeoRSS can contain a variety of links, but often not to other geographic data files, but rather to HTML or binary media.

KML has been described as the HTML of geographic data. Whether that's true or not is a matter of some discussion, though I happen to think it is (more on that in another post I guess, after lots of people tell me I'm full of it). But one of the principle characteristics of HMTL is linking, which is weakly implemented in KML. Linking happens in two places, the Atom link element, and in the description balloon. Atom links usually refer to HTML media, as in "this is the site credited with authorship." In the description balloon, you're operating in essentially an HTML environment, leading people to be less likely to author KML with links to other KML files, but rather with links to HTML. When authors do put links to KML, it's mostly within their site, not to other KML files elsewhere.

My point isn't to encourage people to link to KMLs created by others, but rather that for findability purposes, on the HTML web we rely primarily on a link structure. The early web was made up of pages that delivered content, and linked to other sites. Whole pages developed early as directories of other sites, and they linked to other directories. Google web search was built on using the number and authority of links of others to rank pages. The "GeoWeb" isn't really a web in the same way. It uses the technologies that built the web, that live on the web, but it itself doesn't constitute a web in a meaningful sense. The vast majority of links to geographic data that I've found are HTML links within full HTML pages, with the next set being programmatic.

Is that the nature of geographic data? Or have we just not found the true linkability of it? I tend to think it's the former. Geographic data is heirarchical, it is ontological, it is content rich, it is combinable. It is linkable through common ontologies. But geographic data doesn't lend itself to easy linking in the same way. It's the nature of structured data, it must relate to a structure. Ontologies are almost the antithesis of linkablity outside the domain.

So that suggests that we need to find another mechanism for findability. Deep searches are possible, but generally when you want geographic data, you either want points on a map, "This is where that thing is" which is fairly easy to do, and I think we've done it well. Or, you want a metadata search of some kind, "Give me all the polygons that fall within this bounding box, and where property X is between Y and Z." That no one does well on a global scale, only within limited sets of data. Searching on text is great for web pages, because they are composed primarily of text. But searching for data is a whole other problem not easily solved by our current mechanisms.

Some people have written about using Semantic Web technologies to provide the linking, and particularly Andrew notes LinkedGeoData in his comments on his blog post. I've always been of the opinion that the Semantic Web is too complex. One of the joys of HTML is the ease by which you can link pages. The authoring tools aren't really there yet either. I'd be happy to be proved wrong. I used to think that standards, like RDF, that have languished for so long will never take off. However, the explosion of Ajax in the last few years has made me less skeptical. I don't know if the Semantic Web is the technology of the future and it always will be, or if it will actually take off. I remain fairly skeptical however, and as yet there's no widely adopted viewer for it either.

Combinability

Perhaps the true value of XML based formats comes from their combinability. Whether it's Atom (or RSS) and GML to make GeoRSS, or Atom and KML to produce a Google Data API, or Atom and KML to produce, well, KML containing Atom. This greatly increases their usability, and I think I sense another post coming on since this one is getting long. But my point being the XML standards provide the only really good way of doing this while retaining proper namespaces. The downside is, of course, the verbocity of XML and the pain of XML schema.

Wrap Up

Don't get me wrong, I think that KML and GeoRSS are great, as are a lot of other formats I haven't mentioned, like GeoJSON and others. Andrew asked also about other interesting topics, like expressiveness and durability, which I haven't gotten to. Ultimately, though, if we can't solve the findability problem, other technologies will come in that do.

Monday, July 6, 2009

Wrap-Up on GRUPP

I took a vacation immediately after GRUPP, so I didn't write a follow-up post to my Day 1 post. But I did want to just follow-up with a couple of resources that stood out for me beyond what I already wrote:

  1. JavaRosa
    JavaRosa is an open-source platform for data collection on mobile devices. Basically, a supped-up XForms browser for Java ME, it's pretty awesome.

  2. Open Data Kit
    Built on JavaRosa, it's designed specifically for Android, and to take advantage of all the awesomeness that is an Android phone.

I've decided in the coming weeks to play with ODK and blog about it, and try to think about other solutions for mobile offline/online data collection. I haven't done mobile development before, so this will be interesting.