Random Post: Build it because we can?
RSS .92| RSS 2.0| ATOM 0.3
  • Home
  • About
  •  

    Exploring the Locator OS OpenData set

    January 21st, 2011

    Fiona Hemsley-Flint had a good look at the OS Locator dataset which is available from the Ordnance Survey Open Data portal. I thought a summary of her findings might be of use to others thinking about how to use this dataset.

    Overview

    OS Locator contains a list of all the road names in UK, “derived from a number of Ordnance Survey datasets [Meridian2, Road database, Locality dataset, Boundary-Line]. These include the roads database which contains information on road names and road numbers and is the latest generation of Ordnance Survey’s sophisticated and highly detailed geographic data”. OS recommend viewing it on top of mid-scale datasets such as 1:10k & 1:25k Raster and streetview (which is freely available via OS opendata).

    Geometries

    Each feature is geo-referenced by a centre point and a bounding box (although some of the bboxes are actually line features where the road segment of the feature is horizontal or vertical).
    OS Locator names shown on OS map
    Figure 1. Multiple occurrences of Ferry Road, differentiated by their locality.

    Attribution

    The roads have a name and/or a classification, where the classification represents a road number, (e.g. ‘A1’ or ‘B1243’). They also have an associated settlement (town), locality, county/region and local authority; the latter two are derived from Boundary-Line, it is unclear what is used to form the ‘Locality dataset’. Locality and settlement are likely to be the most useful of these attributes when displaying result sets. For roads which cross locality boundaries, a point is assigned for each separate locality, therefore one road may have more than one point associated with it, distinguished by its locality.

    Storage

    851505 rows of data were added to a development server.
    Multiple geometry columns have been added to take into account the different geometries available.
    A ‘tsvector’ column has also been added to implement Postgres text search functionality. An example query might be:
    select name, classification, locality, settlement from os.locator_nov_10 where search @@ to_tsquery(‘high & street & edinburgh’);

    Which returns the following result set:

    Name	Classification	Locality	settlement
    CORSTORPHINE HIGH STREET		Se Corstorphine	EDINBURGH
    HIGH STREET		Musselburgh Central	EDINBURGH
    HIGH STREET		Musselburgh North	EDINBURGH
    HIGH STREET		Holyrood	EDINBURGH
    HIGH STREET	A199	Musselburgh North	EDINBURGH
    HIGH STREET	A199	Musselburgh Central	EDINBURGH
    NORTH HIGH STREET		Musselburgh North	EDINBURGH
    NORTH HIGH STREET	A199	Musselburgh West	EDINBURGH
    PORTOBELLO HIGH STREET	B6415	Milton	EDINBURGH
    PORTOBELLO HIGH STREET	B6415	Portobello	EDINBURGH
    NORTH HIGH STREET	A199	Musselburgh North	EDINBURGH

    Overall, the dataset contains a comprehensive list of the roads names within the UK. Decisions will need to be made about how to treat multiple features that actually refer to the same real world road.

    The main limitation of this dataset is that it can only be used to show the user the general location of a road – it can’t be used as a precise address gazetteer since it only provides street names with no knowledge of building numbers.


    More on the use of Unlock Places by georeferencer.org

    November 19th, 2010

    Some months back, Klokan Petr Pridal, who maintains OldMapsOnline.org and works with libraries and cartographic institutes across Europe, wrote with some questions about the Unlock Places service. We met at FOSS4G where I presented our work on the Chalice project and the Unlock services.
    Petr writes about how Unlock is used in his applications, and what future requirements from the service may be:


    It was great to meet you at FOSS4G in Barcelona and discuss with you
    the progress related to Unlock and possible cooperation with
    OldMapsOnline.org and usage in Georeferencer.org services.

    As you have mentioned, the most important thing for us would be to
    have in Unlock API/database the bounding boxes (or bounding polygons) for places as direct part of the JSON response.
    We need that mostly for villages, towns and cities and for areas such
    as districts or countries – all over the world. We need something like
    “bounds” as provided by the Google geocoding API.

    The second most important feature is to have the chance to install the
    service in our servers
    – especially in case you can’t provide
    guarantees for it in a future.

    It would be also great to have chance to improve the service for non-English languages, but right now the gazetteers and text processing is not primary target of our research.

    In this moment the Unlock API is in use:

    As a standard gazetteer search service to zoom the base maps to a place people type in the search box in our Georeferencer.org service – a
    collaborative georeferencing online service for scanned historical
    maps. It is in use by National Library of Scotland and a couple of other libraries.

    Here’s an example map (you need to register first).

    The uniqueness of Unlock is in openness of the license (primarily GeoNames.org CC-BY and also OS OpenData) and also so far very good availability of the online service (EDINA hardware and network?). We are missing the bounding box to be able to zoom our base maps to the correct area (determine the appropriate zoom level). Unlock API replaced Google Geocoder, which we can’t use, because we are displaying also non-google maps (such as Ordnance Survey OpenData) and we are potentially deriving data from the gazetteer database (the control points on the old maps), which is against Google TOS.

    In the future we are keen to extend the gazetteer with alternative
    historical toponyms
    (which people can identify on georeferenced old
    maps too), or participate on such work.

    The other usage of Unlock API is:

    As a metadata text analyzer, in a service such as our
    http://geoparser.appspot.com/, where we automatically parse existing
    library textual metadata to identify place names and locate the
    described maps including automatic approximation of their spatial
    coverage (by identifying map scale and physical size in the text and
    doing a simple math on top of it). This service is in a prototype
    phase only, we are using Yahoo Placemaker and I was testing Unlock Text API
    with it too.

    Here the huge advantage of Unlock would be primarily the possibility
    to add custom gazetteers
    (with Geonames as the default one), language detection (for example via Google Language API or otherwise) and also possibility to add into the workflow other tools, such as lemmatizator for particular language – the simplest available via hun/a/ispellu
    database integration or via existing morphological rule-based software
    such as:

    The problem is that without returning the lemmatization of the text the geoparser is almost unusable in non-English languages – especially Slavic
    one.

    We are very glad for availability of your results and of the reliable
    online services you provide. We can concentrate on the problems we
    need to solve primarily (georeferencing, clipping, stitching and
    presentation of old maps for later analysis) and use your results of
    research as a component solving a problem we are touching and we have to practically solve somehow.”


    Very glad that Petr wrote at such length about comprehensive use of Unlock. pushing the edges of what we are doing with the service.

    We have some work in the pipeline adding bounding boxes for places worldwide by making Natural Earth Data searchable through Unlock Places. Natural Earth is a generalised dataset intended for use in cartography, but should also have quite a lot of re-use value for map search.


    Connecting archives with linked geodata – Part II

    October 22nd, 2010

    This is part two of a blog starting with a presentation about the Chalice project and our aim to create a 1000-year place-name gazetteer, available as linked data, text-mined from volumes of the English Place Name Survey.

    Something else i’ve been organising is a web service called Unlock; it offers a gazetteer search service that searches with, and returns, shapes rather than just points for place-names. It has its origins in a 2001 project called GeoCrossWalk, extracting shapes from MasterMap and other Ordnance Survey data sources and making them available under a research-only license in the UK, available to subscribers to EDINA’s Digimap service.

    Now that so much open geodata is out there, Unlock now contains an open data place search service, indexing and interconnecting the different sources of shapes that match up to names. It has geonames and the OS Open Data sources in it, adding search of Natural Earth data in short order, looking at ways to enhance what others (Nominatim, LinkedGeoData) are already doing with search and re-use of OpenStreetmap data.

    The gazetteer search service sits alongside a placename text mining service. However, the text mining service is tuned to contemporary text (American news sources), and a lot of that also has to do with data availability and sharing of models, sets of training data. The more interesting use cases are in archive mining, of semi-unusual, semi-structured sets of documents and records (parliamentary proceedings, or historical population reports, parish and council records). Anything that is recorded will yield data, *is* data, back to the earliest written records we have.


    Place-names can provide a kind of universal key to interpreting the written record. Social organisation may change completely, but the land remembers, and place-names remain the same. Through the prism of place-names one can glimpse pre-history; not just what remains of those people wealthy enough to create *stuff* that lasted, but of everybody who otherwise vanished without trace.

    The other reason I’m here at FOSS4G; to ask for help. We (the authors of the text mining tools at the Language Technology Group, colleagues at EDINA, smart funders at JISC) want to put together a proper open source distribution of the core components of our work, for others to customise, extend, and work with us on.

    We could use advice – the Software Sustainability Institute is one place we are turning for advice on managing an open source release and, hopefully, community. OSS Watch supported us in structuring an open source business case.

    Transition to a world that is open by default turns out to be more difficult than one would think. It’s hard to get many minds to look in the same direction at the same time. Maybe legacy problems, kludges either technical, or social, or even emotional, arise to mess things up when we try to act in the clear.

    We could use practical advice on managing an open source release of our work to make it as self-sustaining as possible. In the short term; how best to structure a repository for collaboration, for branching and merging; where we should most usefully focus efforts at documentation; how to automate the process of testing to free up effort where it can be more creative; how to find the benefits in moving the process of working, from a closed to an open world.

    The Chalice project has a sourceforge repository where we’ve been putting the code the EDINA team has been working on; this includes an evolution of Unlock’s web service API, and user interface / annotation code from Addressing History. We’re now working on the best way to synchronise work-in-progress with currently published, GPL-licensed components from LTG, more pieces of the pipeline making up the “Edinburgh geoparser” and other things…


    Search and retrieve bounding boxes and shapes

    August 20th, 2010

    So we have a cool project running called Chalice, text-mining and locating historic placenames to build a historic gazetteer stretching back beyond Domesday, for a few areas of England and Wales. Claire Grover from LTG had some questions about using a shape based rather than point based gazetteer during “geographic information retrieval”, I thought it worth posting the answers here, as Unlock Places is able to do a lot more in public since the addition of Ordnance Survey Open Data.

    http://unlock.edina.ac.uk/features/Edinburgh – now by default returns info from OS Open Data sources including Boundary-Line as well as Meridian2 which have bounding boxes and detailed shapes for things like counties, parishes, though note they are all contemporary.

    (The above is just an alias for
    http://unlock.edina.ac.uk/ws/nameSearch?name=Edinburgh )

    So that’s a way to get bounding boxes and shapes for places that are in geonames, by comparing with other sources. The default search results have bounding boxes attached, one must look up a link to see the detailed geometry.

    Here’s how then to filter the query for place-names to a specific bounding box:
    http://unlock.edina.ac.uk/ws/spatialNameSearch?format=json&name=Stanley&minx=-8&maxx=4&miny=53&maxy=64&operator=within

    We have ‘search for names inside the shape which has this ID’ on our todo list but don’t yet have a pressing use case – for many things bounding boxes are enough, one even wants that bit of extra inclusion (e.g. Shropshire’s bounding box will contain a lot more than Shropshire, but as Shropshire’s boundary has changed over time, some approximation about the shape is actually helpful for historic geocoding).

    Note that all place-names for UK will have county containment information – we added this for Digimap – one day they may start using it!

    You may also be interested to play around with http://mapit.mysociety.org/ – it has all the same OS Open Data sources and mostly the same set of queries but in places does a little more – it doesnt have geonames integrated, though.

    Lasma did some work on conflating different mentions of places based on point-polygon relationships (e.g. if a shape and a point have the same name, and the shape contains the point, the name is “the same thing”). However this was an experiment that is not really finished. For example –
    http://unlock.edina.ac.uk/ws/uniqueNameSearch?name=Edinburgh – i see this returns a shape in preference to a point – and wonder if it always will, if a shape is available. However this is not much use when you actively want a set of duplicate names, as you do while geoparsing. It would be good to revisit this, again, with concrete use cases. And of course it would be good to do this for much wider than the UK, with shapes extracted from OpenStreetmap. Investigating…


    Your questions answered, @klokancz from Oldmapsonline.org

    July 30th, 2010

    Klokan Petr Pridal, the creator of the wonderful Old Maps Online and MapTiler, has been using Unlock Places in some collaborative project work with the National Library of Scotland. He had some technical questions for us, and some questions about the intended usage future of the service, so I thought it worth-while republishing the answers here on the Unlock blog.

    First, I like a lot the API… It is well documented, with examples. Easy to use. [Thanks!]

    It is a bit confusing that you use “name” parameter instead of “q” (according the OpenSearch.org), but otherwise it is very nice. I was testing it with the Google Closure UI.AutoComplete, which is using JSONP and the callback function – it is similar to the jQuery module.

    Right, our use of the “name” parameter for query is a legacy thing – it comes from Unlock’s predecessor, GeoCrossWalk. There’s been a lot of development in OpenSearch Geo since then, it would be worth our while to support it. However, I see OpenSearch as mainly for collections of geo-referenced things (datasets or documents) – not for the georeferences themselves – though of course it could be used to do both.

    There’s also the quicklinks API which was a thought experiment. It looks a lot more like the new MapIt API, which we’re also thinking about implementing in front of Unlock.


    It is great that you have bbox for the results and external link for the detailed footprint. The API gives anybody access to your combined geonames database with other source of data like OSM or OS. Geometry or at least bounding box is something I horribly miss at GeoNames API – and you have solved this problem!

    Ordnance Survey have solved this problem for us with Open Data by releasing sources of shapes that can be used outwith academic publications! (We’ve always had this in the academic-use-only version of Unlock, formerly GeoCrossWalk). We’re now looking at adding OpenStreetmap data to derive the same kind of bounding box and optional detailed shape, for Europe rather than just mainland UK.

    In this moment we are especially interested in usage of your Gazetteer via the “nameAndFeatureSearch” for “populated places” database. I am considering to link EDINA Unlock API from our Georeferencer.org service, instead of GeoNames.org API, which was planned originally. We can’t use Google Maps GeoCoding API because of the TOS. I expected that if we use your service and save the coordinates from GeoNames in our database, it is legal, same as if we would use
    directly GeoNames.org.

    For geonames data, “This work is licensed under a Creative Commons Attribution 3.0 License“. We preserve the attribution in our search results. If you’re republishing the coordinates then you should ideally keep the source data and make the attribution too – that goes for all the different data source attributions we make.


    BTW Georeferencer.org is going to be used also on the National Library of Scotland maps later this year…

    I’m really looking forward to seeing this, and I’m hoping to see more use made of the NLS Maps API in projects here at EDINA.

    I have a couple of questions related to the API:

    • – Is utf-8 input supported? I was not able to find records for “Nürnberg” or “Paříž” while query like “Nurnberg” or “Pariz” gives correct results. Is utf-8 encoded query passed automatically (urlencoded) to your service or are there any special parameters necessary?

    I passed this question on to Joe and Lasma; they went into a huddle, and a couple of hours later, Lasma sent this:

    Indeed, it was only doing ascii search. Joe just deployed a fix.
    Now you can do utf8 search.

    So utf8 search should now be behaving correctly as you would have expected it to. Thanks for pointing this out and helping us to improve the search service.

    – Is a combined query with country or administrative area possible? Something like “London, USA” or “Leith, Edinburgh”?

    Currently, if you do this sort of query – a comma-separated list of names – you see all Londons, and all USAs – as in this query: http://unlock.edina.ac.uk/features/London,USA?format=json

    The various Londons that are, in fact, in the USA, will be marked with a country element ‘United States’.

    But, i think what you’re asking for isn’t this – you’d like the Unlock Places search to pick out the Londons-contained-by-USA and just return those. We could do this, but don’t expose this sort of query via the API. We could change the meaning of comma-separated lists of names to do this, but that might break other peoples’ worlds. So the best answer I can give you is, we’ll think about how best to implement it and look at the access logs to see if we can reasonably change the meaning of the current API function.

    – What are the Terms and Conditions of the online service? Is it completely free for anybody or are there set already some limits on the number of requests, usage from website which are behind password, commercial web services, derived data, etc?

    So there are two versions of the Unlock Places gazetteer search service. One is completely open, built on various open data sources, and can be used by anyone for any purpose. We don’t have throttling or quotas on the API.
    If persistent or demented-looking requests ever become a problem, we’ll think about throttling requests from particular hosts. I like the approach that OpenStreetmap’s Nominatim search service takes here – to say, “if you’re planning really heavy traffic, please talk to us first, we can schedule it at a quiet time or you can install your own instance of Nominatim”.

    In the past I’ve fired off a million requests without any pause, to search through the 1881 census microdata placenames for UKDA, and this happily didn’t affect the performance of the service.

    The second gazetteer search service is limited to UK academic institutions that subscribe to the Digimap Ordnance Survey Collection, and the ways in which the data can be re-used are limited to academic services.
    The Archaeology Data Service, for example, uses Unlock Places in some of its services in this way. They don’t require a login, but they do have terms of use of their service, and don’t expose the Unlocked data directly.

    – Do you plan to release (make available for download) your Gazetteer database? If not, would you be willing to submit (later on?) at least the database with GeoNames.org IDs and the bboxes back the Mark Wick of GeoNames.org, so the great work you did is preserved also in the official free GeoNames database. You have much more to offer then bbox, but at least that would be excellent for the community.
    I feel that release of the database is important for sustainability…

    Right, everything in the open data side of Unlock is built from publicly available sources which are open licensed. One thing we could try is putting together a data package – using Open Knowledge Foundation’s datapkg project, for example – that would automate the process of rebuilding a database that looks like Unlock’s, from these different sources.

    – Are you going to support the service in the future?

    Unlock (Places, and Text) is a service supported by JISC, which manages technology funding for research and innovation in the UK. It’s hosted at the EDINA National Datacentre at the University of Edinburgh, which is also mostly supported directly by JISC.

    So EDINA has a service level agreement with JISC to maintain Unlock with maximum 10 hours of downtime in a year – I think we’re close to that.

    Our current agreement with JISC to support and develop the Unlock service at EDINA runs until July 2011. Its ongoing existence after that depends whether we, and JISC, can convincingly make the case that Unlock is creating “impact and value” in academia and beyond (museums, libraries and archives nearest by).

    One of the best ways we can make the case is to get more feedback from people like you, Petr – what you like about the service, what you wish it did, what it’s offering to research that commercial or government services cannot reach. Some more thoughts about that are at the bottom of my last post discussing MySociety’s MapIt service.


    Thank you a lot for you online service!

    Thank you a lot for your long email, Petr, and I hope it helps encourage others to write.


    Work in progress with OS Open Data

    April 2nd, 2010

    The April 1st release of many Ordnance Survey datasets as open data is great news for us at Unlock. As hoped for, Boundary-Line (administrative boundaries), the 50K gazetteer of placenames and a modified version of Code-Point (postal locations) are now open data.

    Boundary Line of Edinburgh shown on Google earth. Contains Ordnance Survey data © Crown copyright and database right 2010

    We’ll be putting these datasets into the open access part of Unlock Places, our place search service, and opening up Unlock Geocodes based on Code-Point Open. However, this is going to take a week or two, because we’re also adding some new features to Unlock’s search and results.

    Currently, registered academic users are able to:

    • Grab shapes and bounding boxes in KML or GeoJSON – no need for GIS software, re-use in web applications
    • Search by bounding box and feature type as well as place name
    • See properties of shapes (area, perimeter, central point) useful for statistics visualisation

    And in soon we’ll be publishing these new features currently in testing:

    • Relationships between places – cities, counties and regions containing found places – in the default results
    • Re-project points and shapes into different coordinate reference systems

    These have been added so we can finally plug the Unlock Places search into EDINA’s Digimap service.

    Having Boundary-Line shapes in our open data gazetteer will mean we can return bounding boxes or polygons through Unlock Text, which extracts placenames from documents and metadata. This will help to open up new research directions for our work with the Language Technology Group at Informatics in Edinburgh.

    There are some organisations we’d love to collaborate with (almost next door, the Map Library at the National Library of Scotland and the Royal Commission on Ancient and Historical Monuments of Scotland) but have been unable to, because Unlock and its predecessor GeoCrossWalk were limited by license to academic use only. I look forward to seeing all the things the OS Open Data release has now made possible.

    I’m also excited to see what re-use we and others could make of the Linked Data published by Ordnance Survey Research, and what their approach will be to connecting shapes to their administrative model.

    MasterMap, the highest-detail OS dataset, wasn’t included in the open release. Academic subscribers to the Digimap Ordnance Survey Collection get access to places extracted from MasterMap, and improvements to other datasets created using MasterMap, with an Unlock Places API key.


    Notes on Linked Data and Geodata Quality

    March 15th, 2010

    This is a long post talking about geospatial data quality background before moving on to Linked Data about halfway. I should probably try to break this down into smaller posts – “if I had more time, I would write less”.

    Through EDINA‘s involvement with the ESDIN project between mapping and cadastral agencies (NMCAs) across Europe, I’ve picked up a bit about data quality theory (at least as it applies to geography). One of ESDIN’s goals is a common quality model for the network of cooperating NMCAs.

    I’ve also been admiring Muki Haklay’s work on assessing data quality of collaborative OpenStreetmap data using comparable national mapping agency data. His recent assessment of OSM and Google MapMaker’s Haiti streetmaps showed the benefit of analytical data quality work, helping users assess how what they have matches the world, assisting with conflation to join different spatial databases together.

    Today I was pointed at Martijn Van Exel’s presentation at WhereCamp EU on “map quality”, ending with a consideration of how to measure quality in OpenStreetmap. Are map and underlying data quite different when we think about quality?

    The ISO specs for data quality have their origins in industrial and military quality assurance – “acceptable lot quality” for samples from a production line. One measurement, “circular error probable“, comes from ballistics design – the circle of error was once a literal circle round successive shots from an automatic weapon, indicating how wide a distance between shots, thus inaccuracy in the weapon, was tolerable.

    The ISO 19138 quality models apply to highly detailed data created by national mapping agencies. There’s a need for reproducible quality assessment of other kinds of data, less detailed and less complete, from both commercial and open sources.

    The ISO model presents measures of “completeness” and “consistency”. For completeness, an object or an attribute of an object is either present, or not present.

    Consistency is a bit more complicated than that. In the ISO model there are error elements, and error measures. The elements are different kinds of error – logical, temporal, positional and thematic. The measures describe how the errors should be reported – as a total count, as a relative rate for a given lot, as a “circular error probable”.

    Geographic data quality in this formal sense can be measured, either by a full inspection of a data set or in samples from it, in several ways:

    • Comparison to another data set, ideally of known and high quality
    • Comparing the contents of the dataset, using rules to describe what is expected.
    • Comparing samples of the dataset to the world, e.g. by intensive surveying.

    The ISO specs feature a data production process view of quality measurement. NMCAs apply rules and take measurements before publishing data, submitting data to cross-border efforts with neighbouring EU countries, and later after correcting the data to make sure roads join up. Practitioners definitely think in terms of spatial information as networks or graphs, not in terms of maps.

    Collaborative Quality Mapping

    Muki Haklay’s group used different comparison techniques – in one instance comparing variable-quality data to known high-quality data, in another comparing the relative completeness of two variable-quality data sources.

    Not so much thought has gone into the data user’s needs from quality information, as opposed to the data maintainer’s clearer needs. Relatively few specialised users will benefit from knowing the rate of consistency errors vs topological errors – for most people this level of detail won’t provide the confidence needed to reuse the information. The fundamental question is “how good is good enough?” and there is a wide spectrum of answers depending on the goals of each re-user of data.

    I also see several use cases for use of quality information to flag up data which is interesting for research or search purposes, but not appropriate to use for navigation or surveying purposes, where errors can be costly.

    An example: the “alpha shapes” that were produced by Flickr based on the distribution of geo-tagged images attached to a placename in a gazetteer.

    Another example: polygon data produced by bleeding-edge auto-generalisation techniques that may have good results in some areas but bizarre errors in others.

    Somewhat obviously, data quality information would be very useful to a data quality improvement drive. GeoFabrik made the OpenStreetmap Inspector tool, highlighting areas where nodes are disconnected or names and feature types for shapes are missing.

    Quality testing

    What about quality testing? When I worked as a perl programmer I enjoyed the test coverage and documentation coverage packages. A visual interface to show how much progress you’ve made on clearly documenting your code, to show how many decisions that should be tested for integrity remain untested.

    Software packages come with a set of tests – ideally these tests will have helped with the development process, as well as providing the user with examples of correct and efficient use of the code, and aiding in automatic installation of packages.

    Donald Knuth promoted the idea of “literate programming“, where code fully explains what it is doing. For code, this concept can be extended to “literate testing” of how well software is doing what is expected of it.

    At the Digimap 10th Birthday event, Glen Hart from Ordnance Survey Research talked about increasing data usability for Linked Data efforts. I want to link to this the idea of “literate data“, and think about a data-driven approach to quality.

    A registry based on CKAN, like data.gov.uk, could benefit from a quality audit. How can one take a quality approach to Linked Data?

    To start with, each record has a set of attributes and to reach completeness they should all be filled in. This ranges from data license to maintainer contact information to resource download. Many records inCKAN.netare incomplete. Automated tests could be run on the presence or absence of properties for each package. The results can be display on the web, with option to view the relative quality of package collections belonging groups, or tags. The process would help identify areas that needed focus and followup. It would help to plan and follow progress on turning records into downloadable data packages. Quality testing could help reward groups that were being diligent in maintaining metadata.

    The values of properties will have constraints, these can be used to test for quality – links should be reachable, email contact addresses should make at least one response. Locations in the dataset should be near locations in the metadata. Time ranges matching, ditto. Values that should be numbers, actually are numbers.

    Some datasets listed in the data.gov.uk catalogues have URLs that don’t dereference, i.e. are links that don’t work. It’s difficult to find out what packages these datasets are attached to, where to get the actual data or contact the maintainers.

    To see this in real data, visit the bare SPARQL endpoint at http://services.data.gov.uk/analytics/sparql and paste this query into the search box (it’s looking for everything described as a Dataset, using the scovo vocabulary for statistical data):

    PREFIX scv: <http://purl.org/NET/scovo#>

    SELECT DISTINCT ?p
    WHERE {
    ?p a scv:Dataset .
    }

    The response shows a set of URIs which, when you try to look them up to get a full description, return a “Resource not found” error. The presence of a quality test suite would catch this kind of incompleteness early in the release schedule, help provide metrics of how fast identified issues with incompleteness and inconsistency were being fixed.

    The presence of more information about a resource, from a link, can be agreed on as a quality rule for Linked Data – it is one of the Four Principles after all, that one should be able to follow a link and get useful information.

    With OWL schemas there is already some modelling of data objects and attributes and their relations. There are rules languages from W3C and elsewhere that could be used to automate some quality measurement – RIF and SWRL. These languages require a high level of buy-in to the standards, a rules engine, expertise.

    Data package testing be viewed like software package testing. The rules are built up, piece by piece, growing as the code does, ideally. The methods used can be quite ad-hoc, use different frameworks and structures, as long as the results are repeatable and the coverage is thorough.

    Not everyone will have the time or patience to run quality tests on their local copy of the data before use, so we need some way to convey the results. This could be an overall score, a count of completeness errors – something like the results of a software test run:

    3 items had no tests...
    9 tests in 4 items.
    9 passed and 0 failed.
    Test passed.

    For quality improvement, one needs to see the detail of what is missing. Essentially this is a picture of a data model with missing pieces. It would look a bit like the content of a SPARQL query:

    a scv:Dataset .
    dc:title ?title .
    scv:datasetOf ?package .
    etc...

    After writing this I was pointed at WIQA, a Linked Data quality specification language by the group behind dbpedia and Linked GeoData, which basically implements this with a SPARQL-like syntax. I would like to know more about in-the-wild use of WIQA and integration back into annotation tools…


    Dev8D: JISC Developer Days

    March 5th, 2010

    The Unlock development team recently attended the Dev8D: JISC Developer Days conference at University College London. The format of the event is fairly loose, with multiple sessions in parallel and the programme created dynamically as the 4 days progressed. Delegates are encouraged to use their feet to seek out what interests them! The idea is simple: developers, mainly (but not exclusively) from academic organisations come together to share ideas, work together and strengthen professional and social connections.

    A series of back-to-back 15 minute ‘lightning talks’ ran throughout the conference, I delivered two – describing EDINA’s Unlock services and showing users how to get started with the Unlock Places APIs. Discussions after the talk focused on the question of open sourcing and the licensing of Unlock Places software generally – and what future open gazetteer data sources we plan to include.

    In parallel with the lightning talks, workshop sessions were held on a variety of topics such as linked data, iPhone application development, working with Arduino and the Google app engine.

    Competitions
    Throughout Dev8D, several competitions or ‘bounties’ were held around different themes. In our competition, delegates had the chance to win a £200 Amazon voucher by entering a prototype application making use of the Unlock Places API. The most innovative and useful application wins!

    I gave a quick announcement at the start of the week to discuss the competition, how to get started using the API and then demonstrated a mobile client for the Unlock Places gazetteer as an example of the sort of competition entry we were looking for. This application makes use of the new HTML5 web database functionality – enabling users to download and store Unlock’s feature data offline on a mobile device. Here’s some of the entries:

    Marcus Ramsden from Southampton University created a plugin for EPrints, the open access respository software. Using the Unlock Text geoparser, ‘GeoPrints’ extracts locations from documents uploaded to EPrints then provides a mechanism to browse EPrint documents using maps.

    Aidan Slingsby from City University, entered some beautiful work displaying point data (in this case a gazetteer of British placenames) shown as as tag-maps, density estimation surfaces and chi surfaces rather than the usual map-pins! The data was based on GeoNames data accessed through the Unlock Places API.

    And the winner was… Duncan Davidson from Informatics Ventures, University of Edinburgh. He used the Unlock Places APIs together with Yahoo Pipes to present data on new start-ups and projects around Scotland. Enabling the conversion of data containing local council names into footprints, Unlock Places allowed the data to be mapped using KML and Google Maps, enabling his users to navigate around the data using maps – and search the data using spatial constraints.

    Some other interesting items at Dev8D…

    • <sameAs>
      Hugh Glaser from the University of Southampton discussed how sameAs.org works to establish linkage between datasets by managing multiple URIs for Linked Data without an authority. Hugh demonstrated using sameAs.org to locate co-references between different data sets.
    • Mendeley
      Mendeley
      is a research network built around the same principle as last.fm. Jan Reichelt and Ben Dowling discussed how by tracking, sharing and organising journal/article history, Mendeley is designed to help users to discover and keep in touch with similarly minded researchers. I heard of Mendeley last year and was surprised by the large (and rapidly increasing) user base – the collective data from its users is already proving a very powerful resource.
    • Processing
      Need to do rapid visualisation of images, animations or interactions? Processing is Java based sketchbox/IDE which will help you to to visualise your data much quicker. Ross McFarlane from the University of Liverpool gave a quick tutorial of Processing.js, a JavaScript port using <Canvas>, illustrating the power and versatility of this library.
    • Genetic Programming
      This session centred around some basic aspects of Genetic Algorithms/Evolutionary Computing and Emergent properties of evolutionary systems. Delegates focused on creating virtual ants (with Python) to solve mazes and by visualising their creatures with Processing (above), Richard Jones enabled developers to work on something a bit different!
    • Web Security
      Ben Charlton from the University of Kent delivered an excellent walk-through of the most significant and very common threats to web applications. Working from the OWASP Top 10 project, he discussed each threat with real world examples. Great stuff – important for all developers to see.
    • Replicating 3D Printer: RepRap
      Adrian Bowyer demonstrated RepRap – short for Replicating Rapid-prototyper. It’s an open source (GPL) device, able to create robust 3D plastic components (including around half of its own components). Its novel capability of being able to self-copy, with material costs of only €350 makes it accessible to small communities in the developing world as well as individuals in the developed world. His inspiring talk was well received and this super illustration of open information’s far reaching implications captured everyone’s imagination.

    All in all, a great conference. A broad spread of topics, with the right mix of sit-and-listen to get-involved activities. Whilst Dev8D is a fairly chaotic event, it’s clear that it generates a wealth of great ideas, contacts and even new products and services for academia. See Dev8D’s Happy Stories page for a record of some of the outcomes. I’m now looking forward to seeing how some of the prototypes evolve and I’m definitely looking forward to Dev8D 2011.


    A very long list of census placenames

    February 9th, 2010

    Nicola Farnworth from the UK Data Archive sent us a motherlode of user-contributed UK placenames – a list extracted from the 1881 census returns. The list is 910096 lines long.

    A corner of a page of a census record

    Many placenames have the name of a containing county, though some don’t. The data is full of errors, mistakes in the original records, mis-heard names, maybe errors in transcription.

    This census placename data badly needs a quality audit; how can Unlock Places help provide links to location references and clean up messy location data?

    I made a start at this over the weekend, because I also wanted an excuse to play with the redis nosql data store.

    To start, I threw the list of unique placenames against the geonames.org names in the Unlock Places API. The gazetteer is used to ground the placename list against known places, rather than search for exact locations at this stage, we look for known-to-exist-as-place names. The search function I used, closestMatchSearch, does a fulltext search for very close matches. It took getting on for 36 hours to run the whole lot.

    unique placenames: 667513
    known by geonames: 34180
    unknown by geonames: 633333

    We might hope for more, but this is a place to start. On manual inspection I noticed small settlements that are definitely in OpenStreetmap’s data. The Ordnance Survey 50K gazetteer, were it open data, would likely yield more initial matches.

    Next, each of the unlocated placenames is compared to the grounded group of places, and if one name is very similar to another (as measured by Levenshtein distance with a handy python module) then a reference is stored that one place is the sameAs another.

    Based on the results of a test run, this string similarity test should yield at least 100,000 identities between placenames. Hard to say at this stage how many will be in some kind of error (Easton matching Aston), 1 in 20 or hopefully many fewer.

    place:sameas:WELBOURN : place:WELBURN
    place:sameas:WELBOURY : place:WELBURY
    place:sameas:ALSHORNE : place:ASHORNE
    place:sameas:PHURLIGH : place:PURLEIGH
    place:sameas:LANGATHN : place:LLANGATHEN
    place:sameas:WIGISTON : place:WIGSTON
    place:sameas:ALSHORPE : place:ASHOPE
    place:sameas:PELSCHAM : place:ELSHAM

    As I next stage, I plan to run the similarity test again, on the placenames derived from it in the first stage, with a higher threshold for similarity.

    This should start getting the placenames yet to be located down to a manageable few hundred thousand. I hope to run the remaining set against OpenStreetmap’s Nominatim geocoding search service. I should probably write to them and mention this.

    There’s more to be done in cleaning and splitting the data. Some placenames are really addresses (which may well turn up through Nominatim) others are sub-regions or suburbs attached to other placenames, north/south/east/west prefixes.

    What next?

    Ultimately there will be a large set of possible placenames, many tens of thousands, which aren’t reliably found in any gazetteer. How to address this?

    A human annotator can be assisted by programs. We have a high threshold of acceptance for similarity of names for automatic link creation; we can lower that threshold a lot if a human is attesting to the result.

    We can also look at sound similarity algorithms like soundex and metaphone. There are concerns that this would have an unacceptable rate of false positives, but if a human annotator is intervening anyway, why not show rough-guess suggestions?

    A link back to the original source records would be of much benefit. Presumably the records come in sequences or sets which all deal with the same geographic region, more or less. By looking at clusters of placenames in a set of related documents, we can help pinpoint the location on a map (perhaps even pick out a name from a vector map layer).

    Records with unknown placenames can be roughly located near the places of related records.

    How close is close enough for search? If the record is floating near the street, or the neighbourhood, that it belongs in, is that close enough?

    And where people need micro-detail location and other annotations, how can they best provide their improvements for re-use by others?


    Places you won't find in any dictionary

    January 12th, 2010

    Tobar an Dualchais is an amazing archive of Gaelic and Scots speech and song samples. Under the hood, each of their records is annotated with places – the names of the village, or island, or parish, where the speaker came from.

    We’ve been trying to Unlock their placename data, so the names can be given map coordinates, and the recordings searched by location. Also, I wanted to see how much difference it would make if the Ordnance Survey 50K gazetteer were open licensed, thus enabling us to use it for this (non-research) project.

    Out of 1628 placenames, we found 851 exact matches in the 50K gazetteer and 1031 in the geonames.org gazetteer. Just 90 placenames were in the 50K but not in geonames. There’s a group of 296 placenames that we couldn’t find in any of our gazetteer data sources. Note that this an unusual sample, focused on remote and infrequently surveyed places in the Highland and Islands, but I had hoped for more from the 50K coverage.

    There are quite a few fun reasons why there are so many placenames that you won’t find in any dictionary:

    • Places that are historic don’t appear in our contemporary OS sources. Many administrative areas in Scotland changed in 1974, and current OS data does not have the old names or boundaries. Geonames has some locations for historic places (e.g. approximate centroids for the old counties) though without time ranges.
    • Typographical errors in data entry. E.g. “Stornooway” and “Stornaway” – using the gazetteer web service at the content creation stage would help with this.
    • Listings for places that are too small to be in a mid-scale gazetteer. For example, TAD data includes placenames for buildings belonging to clubs and societies where Gaelic sound recordings were made. Likely enough, some small settlements have escaped the notice of surveyors for OS and contributors to geonames.
    • Some places exist socially but not administratively. For example, our MasterMap gazetteer has records for a “Clanyard Bay”, “Clanyard House”, “Clanyard Mill” but not Clanyard itself. The Gazetteer for Scotland describes Clanyard as “a locality, made up of settlements” – High, Low and Middle Clanyards.
    • Geonames has local variant spellings as alternative names, and these show up in our gazetteer search, returning the more “authoritative” name.
    • Limitations in automated search for descriptions of names. For example, some placenames look like Terregles (DFS) see also Kirkcudbrightshire. I’m hoping the new work on fulltext search will help to address this – but there will always need to be a human confirmation stage, and fixes to the original records.

    It’s been invaluable to have a big set of known-to-be-placenames contributed in free-text fields by people who aren’t geographers. I would like to do more of this.

    I saw a beautiful transcript of an Ordnance Survey Object Name Book on a visit to RCAHMS. Apparently many for the English and Welsh ones were destroyed in the war, but the Scottish ones survived. But that is a story for another time.