RSS .92| RSS 2.0| ATOM 0.3
  • Home
  • About
  •  

    Talk to us about JISC 06/11

    June 23rd, 2011

    Glad to hear that Unlock has been cited in the JISC 06/11 “eContent Capital” call for proposals.

    The Unlock team would be very happy to help anyone fit a beneficial use of Unlock into their project proposal. This could feature the Unlock Places place-name and feature search; and/or the Unlock Text geoparser service which extracts place-names from text and tries to find their locations.

    One could use Unlock Text to create Linked Data links to geonames.org or Ordnance Survey Open Data. Or use Unlock Places to find the locations of postcodes; or find places within a given county or constituency…

    Please drop an email jo.walsh@ed.ac.uk or look up metazool on Skype or Twitter to chat about how Unlock fits with your proposal for JISC 06/11 …


    Search and retrieve bounding boxes and shapes

    August 20th, 2010

    So we have a cool project running called Chalice, text-mining and locating historic placenames to build a historic gazetteer stretching back beyond Domesday, for a few areas of England and Wales. Claire Grover from LTG had some questions about using a shape based rather than point based gazetteer during “geographic information retrieval”, I thought it worth posting the answers here, as Unlock Places is able to do a lot more in public since the addition of Ordnance Survey Open Data.

    http://unlock.edina.ac.uk/features/Edinburgh – now by default returns info from OS Open Data sources including Boundary-Line as well as Meridian2 which have bounding boxes and detailed shapes for things like counties, parishes, though note they are all contemporary.

    (The above is just an alias for
    http://unlock.edina.ac.uk/ws/nameSearch?name=Edinburgh )

    So that’s a way to get bounding boxes and shapes for places that are in geonames, by comparing with other sources. The default search results have bounding boxes attached, one must look up a link to see the detailed geometry.

    Here’s how then to filter the query for place-names to a specific bounding box:
    http://unlock.edina.ac.uk/ws/spatialNameSearch?format=json&name=Stanley&minx=-8&maxx=4&miny=53&maxy=64&operator=within

    We have ‘search for names inside the shape which has this ID’ on our todo list but don’t yet have a pressing use case – for many things bounding boxes are enough, one even wants that bit of extra inclusion (e.g. Shropshire’s bounding box will contain a lot more than Shropshire, but as Shropshire’s boundary has changed over time, some approximation about the shape is actually helpful for historic geocoding).

    Note that all place-names for UK will have county containment information – we added this for Digimap – one day they may start using it!

    You may also be interested to play around with http://mapit.mysociety.org/ – it has all the same OS Open Data sources and mostly the same set of queries but in places does a little more – it doesnt have geonames integrated, though.

    Lasma did some work on conflating different mentions of places based on point-polygon relationships (e.g. if a shape and a point have the same name, and the shape contains the point, the name is “the same thing”). However this was an experiment that is not really finished. For example –
    http://unlock.edina.ac.uk/ws/uniqueNameSearch?name=Edinburgh – i see this returns a shape in preference to a point – and wonder if it always will, if a shape is available. However this is not much use when you actively want a set of duplicate names, as you do while geoparsing. It would be good to revisit this, again, with concrete use cases. And of course it would be good to do this for much wider than the UK, with shapes extracted from OpenStreetmap. Investigating…


    An appreciation of MySociety’s MapIt service

    July 27th, 2010

    Impressed by the new MySociety service for doing interesting things with Ordnance Survey OpenData – MapIt. The API is well thought out and quick and clean, the documentation fits onto one page, the backend is free software.

    I will confess to mild chagrin, because as well as having all these wonderful properties, MapIt does almost everything that Unlock Places does for Boundary-Line and Code-Point. Compare, contrast:

    A simple search for records about a place beginning with a name, returning the results in JSON:

    The detail of the shape describing that place, in GeoJSON (in both cases the ID to be looked up is taken from the JSON results of the previous request):

    MapIt does things that are still on our todo list – such as exposing ST_Touches geometry query over web-based API:

    Matthew Somerville, MapIt’s creator writes that “MaPit is really just an extension of the service we have always run internally for our own purposes” – MySociety services like Fix My Street, Write To Them and the renowned They Work For You.

    It’s great to see a service that looks so much like Unlock emerge from the internal needs of an organisation with a track record of geospatially aware, simple useful web tools.

    However I pause to think, what are we providing with Unlock Places search through OS Open Data that MapIt isn’t doing at least as well?

    Well, we have a few more data sources, so a more comprehensive gazetteer search; MapIt is directed towards building applications around government data and assumes the client will probably know the “right” names or codes. We could implement a neat “Give me the official names and shapes for this more vernacular name” wrapper, perhaps.

    We have geonames mirrored in Unlock too – only point data, but global coverage – and are working on adding OpenStreetmap (probably just for Europe) to the cross-search. But I wonder, quite hard, how much we would gain from improving and adding to the MapIt codebase instead of persevering with our own gazetteer API code.

    A future focus for Unlock Places (from the New Year on) is adding historic place-names to the gazetteer, so we can do historic place-name text mining with Unlock Text – incorporating the data coming out of the CHALICE project – as this is a common request for researchers, and not something that’s currently being done commercially.

    The Unlock Text service remains a bit more novel. This does text mining across documents (plain text, HTML or XML metadata), extracts likely placenames and uses the gazetteer search to pick the most likely locations. The text miner looks for other entities too – personal and organisational names, references to dates – but we only expose the placename part over our the web API.


    Linking Placename Authorities

    April 9th, 2010


    Putting together a proposal for JISC call 02/10 based on a suggestion from Paul Ell at CDDA in Belfast. Why post it here? I think there’s value in working on these things in a more public way, and I’d like to know who else would find the work useful.

    Summary

    Generating a gazetteer of historic UK placenames, linked to documents and authority files in Linked Data form. Both working with existing placename authority files, and generating new authority files by extracting geographic names from text documents. Using the Edinburgh Geoparser to “georesolve” placenames and link them to widely-used geographic entities on the Linked Data web.

    Background

    GeoDigRef was a JISC project to extract references to people and places from several very large digitised collections, to make them easier to search. The Edinburgh Geoparser was adapted to extract place references from large collections.

    One roadblock in this and other projects has been the lack of open historic placename gazetteer for the UK.

    Placenames in authority files, and placenames text-mined from documents, can be turned into geographic links that connect items in collections with each other and with the Linked Data web; a historic gazetteer for the UK can be built as a byproduct.

    Proposal

    Firstly, working with placename authority files from existing collections, starting with the existing digitised volumes from the English Place Name Survey as a basis.

    Where place names are found, they can be linked to the corresponding Linked Data entity in geonames.org, the motherlode of place name links on the Linked Data web, using the georesolver component of the Edinburgh Geoparser.

    Secondly, using the geoparser to extract placename references from documents and using those placenames to seed an authority file, which can then be resolved in the same way.

    An open source web-based tool will help users link places to one another, remove false positives found by the geoparser, and publish the results as RDF using an open data license.

    Historic names will be imported back into the Unlock place search service.

    Context

    This will leave behind a toolset for others to use, as well as creating new reference data.

    Building on work done at the Open Knowledge Foundation to convert MARC/MADS bibliographic resources to RDF and add geographic links.

    Making re-use of existing digitised resources from CDDA to help make them discoverable, provide a path in to researchers.

    Geonames.org has some historic coverage, but it is hit and miss (E.g. “London” has “Londinium” as an alternate name, but at the contemporary location). The new OS OpenData sources are all contemporary.

    Once a placename is found in a text, it may not be found in a gazetteer. The more places correctly located, the higher the likelihood that other places mentioned in a document will also be correctly located. More historic coverage means better georeferencing for more archival collections.


    Work in progress with OS Open Data

    April 2nd, 2010

    The April 1st release of many Ordnance Survey datasets as open data is great news for us at Unlock. As hoped for, Boundary-Line (administrative boundaries), the 50K gazetteer of placenames and a modified version of Code-Point (postal locations) are now open data.

    Boundary Line of Edinburgh shown on Google earth. Contains Ordnance Survey data © Crown copyright and database right 2010

    We’ll be putting these datasets into the open access part of Unlock Places, our place search service, and opening up Unlock Geocodes based on Code-Point Open. However, this is going to take a week or two, because we’re also adding some new features to Unlock’s search and results.

    Currently, registered academic users are able to:

    • Grab shapes and bounding boxes in KML or GeoJSON – no need for GIS software, re-use in web applications
    • Search by bounding box and feature type as well as place name
    • See properties of shapes (area, perimeter, central point) useful for statistics visualisation

    And in soon we’ll be publishing these new features currently in testing:

    • Relationships between places – cities, counties and regions containing found places – in the default results
    • Re-project points and shapes into different coordinate reference systems

    These have been added so we can finally plug the Unlock Places search into EDINA’s Digimap service.

    Having Boundary-Line shapes in our open data gazetteer will mean we can return bounding boxes or polygons through Unlock Text, which extracts placenames from documents and metadata. This will help to open up new research directions for our work with the Language Technology Group at Informatics in Edinburgh.

    There are some organisations we’d love to collaborate with (almost next door, the Map Library at the National Library of Scotland and the Royal Commission on Ancient and Historical Monuments of Scotland) but have been unable to, because Unlock and its predecessor GeoCrossWalk were limited by license to academic use only. I look forward to seeing all the things the OS Open Data release has now made possible.

    I’m also excited to see what re-use we and others could make of the Linked Data published by Ordnance Survey Research, and what their approach will be to connecting shapes to their administrative model.

    MasterMap, the highest-detail OS dataset, wasn’t included in the open release. Academic subscribers to the Digimap Ordnance Survey Collection get access to places extracted from MasterMap, and improvements to other datasets created using MasterMap, with an Unlock Places API key.


    A very long list of census placenames

    February 9th, 2010

    Nicola Farnworth from the UK Data Archive sent us a motherlode of user-contributed UK placenames – a list extracted from the 1881 census returns. The list is 910096 lines long.

    A corner of a page of a census record

    Many placenames have the name of a containing county, though some don’t. The data is full of errors, mistakes in the original records, mis-heard names, maybe errors in transcription.

    This census placename data badly needs a quality audit; how can Unlock Places help provide links to location references and clean up messy location data?

    I made a start at this over the weekend, because I also wanted an excuse to play with the redis nosql data store.

    To start, I threw the list of unique placenames against the geonames.org names in the Unlock Places API. The gazetteer is used to ground the placename list against known places, rather than search for exact locations at this stage, we look for known-to-exist-as-place names. The search function I used, closestMatchSearch, does a fulltext search for very close matches. It took getting on for 36 hours to run the whole lot.

    unique placenames: 667513
    known by geonames: 34180
    unknown by geonames: 633333

    We might hope for more, but this is a place to start. On manual inspection I noticed small settlements that are definitely in OpenStreetmap’s data. The Ordnance Survey 50K gazetteer, were it open data, would likely yield more initial matches.

    Next, each of the unlocated placenames is compared to the grounded group of places, and if one name is very similar to another (as measured by Levenshtein distance with a handy python module) then a reference is stored that one place is the sameAs another.

    Based on the results of a test run, this string similarity test should yield at least 100,000 identities between placenames. Hard to say at this stage how many will be in some kind of error (Easton matching Aston), 1 in 20 or hopefully many fewer.

    place:sameas:WELBOURN : place:WELBURN
    place:sameas:WELBOURY : place:WELBURY
    place:sameas:ALSHORNE : place:ASHORNE
    place:sameas:PHURLIGH : place:PURLEIGH
    place:sameas:LANGATHN : place:LLANGATHEN
    place:sameas:WIGISTON : place:WIGSTON
    place:sameas:ALSHORPE : place:ASHOPE
    place:sameas:PELSCHAM : place:ELSHAM

    As I next stage, I plan to run the similarity test again, on the placenames derived from it in the first stage, with a higher threshold for similarity.

    This should start getting the placenames yet to be located down to a manageable few hundred thousand. I hope to run the remaining set against OpenStreetmap’s Nominatim geocoding search service. I should probably write to them and mention this.

    There’s more to be done in cleaning and splitting the data. Some placenames are really addresses (which may well turn up through Nominatim) others are sub-regions or suburbs attached to other placenames, north/south/east/west prefixes.

    What next?

    Ultimately there will be a large set of possible placenames, many tens of thousands, which aren’t reliably found in any gazetteer. How to address this?

    A human annotator can be assisted by programs. We have a high threshold of acceptance for similarity of names for automatic link creation; we can lower that threshold a lot if a human is attesting to the result.

    We can also look at sound similarity algorithms like soundex and metaphone. There are concerns that this would have an unacceptable rate of false positives, but if a human annotator is intervening anyway, why not show rough-guess suggestions?

    A link back to the original source records would be of much benefit. Presumably the records come in sequences or sets which all deal with the same geographic region, more or less. By looking at clusters of placenames in a set of related documents, we can help pinpoint the location on a map (perhaps even pick out a name from a vector map layer).

    Records with unknown placenames can be roughly located near the places of related records.

    How close is close enough for search? If the record is floating near the street, or the neighbourhood, that it belongs in, is that close enough?

    And where people need micro-detail location and other annotations, how can they best provide their improvements for re-use by others?