RSS .92| RSS 2.0| ATOM 0.3
  • Home
  • About
  •  

    Using source identifiers to link data

    November 29th, 2010

    In the Chalice project we’ve used Unlock Places to make links across the Linked Data web, using the source identifier which appears in the results of each place search. As this might be useful to others, it’s worth walking through an example.

    This search for “Bosley” shows us results in the UK from geonames and from the Ordnance Survey 50K gazetteer: http://unlock.edina.ac.uk/ws/nameSearch?name=Bosley&country=uk

    Here’s an extract of one of the results, the listing for Bosley in the Ordnance Survey 1:50K gazetteer:

    <identifier>11083412</identifier>
    <sourceIdentifier>28360</sourceIdentifier>
    <name>Bosley</name>
    <country>United Kingdom</country>
    <custodian>Ordnance Survey</custodian>
    <gazetteer>OS Open 1:50 000 Scale Gazetteer</gazetteer>

    The sourceIdentifier shown here is the identifier published by each of the original data sources that Unlock Places is using to cross-search.

    Ordnance Survey Research re-uses these identifiers to create its Linked Data namespace. For any place in the 50K gazetteer, we can reconstruct the link that refers to that place by appending the source identifier to this URL, which is the namespace for the 50K gazetteer: http://data.ordnancesurvey.co.uk/id/50kGazetteer/

    So our reference to Bosley can be made by adding the source identifier to the namespace:

    http://data.ordnancesurvey.co.uk/id/50kGazetteer/28360

    The same goes for source identifiers for places found in the geonames.org place-name gazetteer.

    <sourceIdentifier>2655141</sourceIdentifier>
    <name>Bosley</name>
    <gazetteer>GeoNames</gazetteer>

    Geonames uses http://sws.geonames.org/ as a namespace for its Linked Data links for places. So we can reconstruct the link for Bosley using the source identifier like this:

    http://sws.geonames.org/2655141/

    Note that the link needs the forward slash on the end to work correctly. If one looks at either of these links with a web browser, one is redirected to a human-readable page describing that place. To see the machine-readable, RDF version of the link’s contents, look at it with a command-line program such as curl, asking to “Accept” the RDF version:

    curl -L http://data.ordnancesurvey.co.uk/id/50kGazetteer/28360 -H "Accept: application/rdf+xml"

    I hope this is useful to others. We could add the links directly into the default search results, but many users may not be that interested in seeing RDF links in place-name search results. Thoughts on how we could offer this as a more useful function would be much appreciated.


    Connecting archives with linked geodata – Part II

    October 22nd, 2010

    This is part two of a blog starting with a presentation about the Chalice project and our aim to create a 1000-year place-name gazetteer, available as linked data, text-mined from volumes of the English Place Name Survey.

    Something else i’ve been organising is a web service called Unlock; it offers a gazetteer search service that searches with, and returns, shapes rather than just points for place-names. It has its origins in a 2001 project called GeoCrossWalk, extracting shapes from MasterMap and other Ordnance Survey data sources and making them available under a research-only license in the UK, available to subscribers to EDINA’s Digimap service.

    Now that so much open geodata is out there, Unlock now contains an open data place search service, indexing and interconnecting the different sources of shapes that match up to names. It has geonames and the OS Open Data sources in it, adding search of Natural Earth data in short order, looking at ways to enhance what others (Nominatim, LinkedGeoData) are already doing with search and re-use of OpenStreetmap data.

    The gazetteer search service sits alongside a placename text mining service. However, the text mining service is tuned to contemporary text (American news sources), and a lot of that also has to do with data availability and sharing of models, sets of training data. The more interesting use cases are in archive mining, of semi-unusual, semi-structured sets of documents and records (parliamentary proceedings, or historical population reports, parish and council records). Anything that is recorded will yield data, *is* data, back to the earliest written records we have.


    Place-names can provide a kind of universal key to interpreting the written record. Social organisation may change completely, but the land remembers, and place-names remain the same. Through the prism of place-names one can glimpse pre-history; not just what remains of those people wealthy enough to create *stuff* that lasted, but of everybody who otherwise vanished without trace.

    The other reason I’m here at FOSS4G; to ask for help. We (the authors of the text mining tools at the Language Technology Group, colleagues at EDINA, smart funders at JISC) want to put together a proper open source distribution of the core components of our work, for others to customise, extend, and work with us on.

    We could use advice – the Software Sustainability Institute is one place we are turning for advice on managing an open source release and, hopefully, community. OSS Watch supported us in structuring an open source business case.

    Transition to a world that is open by default turns out to be more difficult than one would think. It’s hard to get many minds to look in the same direction at the same time. Maybe legacy problems, kludges either technical, or social, or even emotional, arise to mess things up when we try to act in the clear.

    We could use practical advice on managing an open source release of our work to make it as self-sustaining as possible. In the short term; how best to structure a repository for collaboration, for branching and merging; where we should most usefully focus efforts at documentation; how to automate the process of testing to free up effort where it can be more creative; how to find the benefits in moving the process of working, from a closed to an open world.

    The Chalice project has a sourceforge repository where we’ve been putting the code the EDINA team has been working on; this includes an evolution of Unlock’s web service API, and user interface / annotation code from Addressing History. We’re now working on the best way to synchronise work-in-progress with currently published, GPL-licensed components from LTG, more pieces of the pipeline making up the “Edinburgh geoparser” and other things…


    OpenStreetmap and Linked Geodata

    October 14th, 2010

    I’ve been travelling overmuch for the last six weeks, but met lots of lovely people. Most recently, during a trip this week to discuss the Open Knowledge Foundation‘s part in the LOD2 consortium project, had a long chat with Jens and Claus, the developers and academics behind Linked Geo Data, the Linked Data version of the OpenStreetmap data.

    linked geodata browser

    The most interesting bit for Unlock is the RESTful interface to search the data; by point, radius, and bounding box, by feature class and by contents of labels assembled from tags. So it looks like Opensearch Geo as much as Unlock’s place search api does.

    Claus made up a mapping between tags and clusters of tags in OpenStreetmap, to a simple linkedgeodata.org ontology. Here’s the mapping file – warning, it is quite large – OSM->linkedgeodata mapping rules. Pointed him at Jochen Topf’s new work on OSM tag analysis and clustering, Taginfo.

    As well as the REST interface, there is a basic GeoSPARQL endpoint using Virtuoso as a Linked Data store – we ran containment queries for polygons returning polygons with reasonable performance. There is a fracturing in the GeoSPARQL world both in proposed standards and in actual implementation.

    So we want to be able to return links to LinkedGeodata.org URLs in the results of our search. Right now Unlock’s place search returns original source identifiers (from geonames, etc) as well as our local identifiers, for place-names and shapes. In fact Unlock could help with the mapping across of Linkedgeodata.org URLs to geonames URLs, which are quite widely used, an entry point into the bigger Linked Data web.

    Another very interesting tool for making links between things on the Linked Data web is SILK, by Chris Bizer, Anja Jentsch and their research group at the Freie Universitat Berlin. The latest (or still testing?) release of SILK has some spatial inference capacity as well as structural inference. So we could try it out on, for example, the Chalice data just to see what kind of links can be made between URLs for linkedgeodata things and URLs for historic place-names.

    We’ve been setting up an instance of OpenStreetmap for Unlock and other purposes at EDINA recently. Our plan with this is to start working from Nominatim, which has a point-based gazetteer for place-names down to street address level, and attempt to extract and/or generalise shapes as well as points corresponding to the names. We’re doing this to provide more/richer data search, rather than republishing original datasets in some more/differently interpretable form. So there’s lots of common ground and I hope to find ways to work together in future to make sure we complement and don’t duplicate.


    What else we’ve been up to lately

    July 22nd, 2010

    The Unlock blog has been quiet for a couple of months; since we added Ordnance Survey Open Data to the gazetteer search the team members have mostly been working on other things.

    Joe Vernon, our lead developer, has been working on the backend software for EDINA’s Addressing History project. This is a collaboration with the National Library of Scotland to create digitised and geocoded versions of historic post office directories. The sneak preview of the API is looking promising – though i agree with the commenter who suggests it should all be Linked Data!

    Lasma Sietinsone, our database engineer has been working on new data backends for Geology Roam, the new service within Digimap. She’s now finally free to start work on our OpenStreetmap mirror and adding search of OpenStreetmap features to Unlock’s open data gazetteer.

    I’ve been putting together a new project which has just started – CHALICE, short for Connecting Historical Authorities with Links, Contexts and Entities. This is a collaboration with several partners – Language Technology Group, who do the text mining magic behind the Unlock Text service; the Centre for Data Digitisation and Analysis in Belfast; and the Centre for e-Research at KCL. The CHALICE project arose from discussions at the wrap-up workshop on “Embedding GeoCrossWalk” (as Unlock was once known). It will involve text mining to create a historic gazetteer for parts of the UK in Linked Data form.

    I also worked with Yin Chen on a survey of EDINA services with an eye to where use of Linked Data could be interesting and valuable; then took a long holiday.

    So we are overdue for another burst of effort on the Unlock services, and there should be lots more to write about here on the blog over the coming weeks and months.


    Notes from Linking Geodata seminar at CeRch

    July 20th, 2010

    Note, this blog entry was originally published in May 2010.

    While on a bit of a road trip, had the chance to give a short seminar at the Centre for e-Research at Kings College London. This was informal, weren’t expecting much of a showing, so there are no slides, here is a quick summary.

    Introduced by Dr Stuart Dunn, and i talked about project ideas we had just been discussing – the attempt to mine the English Place Name Survey for its structure, now called CHALICE – mining archaelogical site records and artefact descriptions and attaching them to entities in OpenStreetmap using LinkedGeodata.org – mining key reference terms from documents in archives, attempting to link documents to reference data.

    Linked Geodata seems like a good place to start, pick out a sample entry and walk through the triples, at this point a bit of jumping about and graph-drawing on the whiteboard.

    There’s a list of mappings between items in Linked GeoData and in dbpedia.org, and likely thus through to geonames.org and other rich sources of Linked Data. Cf. Linked Geodata Datasets. Via sameas.org geographic links can be traversed to arrive at related media objects, resources, events.
    geonames.org has its 8m+ points and seems to be widely used in the academic geographic information retrieval community, due to its global coverage and open license.

    The text mining process used in the Edinburgh geoparser and elsewhere is two-phase, the first is the extraction purely looking at the text, of entities which seem likely to be placenames; the second phase is looking those names up in a gazetteer, and using relations between them to guess which of the suggested locations is the most likely.

    Point data, cartographic in origin. Polygon geoparsing.
    Machine learning approaches to both phases.

    We looked at UK-postcodes.com and the great work @pezholio has done on the RDF representations of postcodes there, with links across to some of the statistical area namespaces from data.gov.uk – along with the work that Ordnance Survey Research
    have in hand
    , there’s lots of new Linked Open Geodata in the UK.

    Historic names and shapes, temporal linking, these are areas where more practical, and open research has yet to be done.


    Linking Placename Authorities

    April 9th, 2010


    Putting together a proposal for JISC call 02/10 based on a suggestion from Paul Ell at CDDA in Belfast. Why post it here? I think there’s value in working on these things in a more public way, and I’d like to know who else would find the work useful.

    Summary

    Generating a gazetteer of historic UK placenames, linked to documents and authority files in Linked Data form. Both working with existing placename authority files, and generating new authority files by extracting geographic names from text documents. Using the Edinburgh Geoparser to “georesolve” placenames and link them to widely-used geographic entities on the Linked Data web.

    Background

    GeoDigRef was a JISC project to extract references to people and places from several very large digitised collections, to make them easier to search. The Edinburgh Geoparser was adapted to extract place references from large collections.

    One roadblock in this and other projects has been the lack of open historic placename gazetteer for the UK.

    Placenames in authority files, and placenames text-mined from documents, can be turned into geographic links that connect items in collections with each other and with the Linked Data web; a historic gazetteer for the UK can be built as a byproduct.

    Proposal

    Firstly, working with placename authority files from existing collections, starting with the existing digitised volumes from the English Place Name Survey as a basis.

    Where place names are found, they can be linked to the corresponding Linked Data entity in geonames.org, the motherlode of place name links on the Linked Data web, using the georesolver component of the Edinburgh Geoparser.

    Secondly, using the geoparser to extract placename references from documents and using those placenames to seed an authority file, which can then be resolved in the same way.

    An open source web-based tool will help users link places to one another, remove false positives found by the geoparser, and publish the results as RDF using an open data license.

    Historic names will be imported back into the Unlock place search service.

    Context

    This will leave behind a toolset for others to use, as well as creating new reference data.

    Building on work done at the Open Knowledge Foundation to convert MARC/MADS bibliographic resources to RDF and add geographic links.

    Making re-use of existing digitised resources from CDDA to help make them discoverable, provide a path in to researchers.

    Geonames.org has some historic coverage, but it is hit and miss (E.g. “London” has “Londinium” as an alternate name, but at the contemporary location). The new OS OpenData sources are all contemporary.

    Once a placename is found in a text, it may not be found in a gazetteer. The more places correctly located, the higher the likelihood that other places mentioned in a document will also be correctly located. More historic coverage means better georeferencing for more archival collections.


    Notes on Linked Data and Geodata Quality

    March 15th, 2010

    This is a long post talking about geospatial data quality background before moving on to Linked Data about halfway. I should probably try to break this down into smaller posts – “if I had more time, I would write less”.

    Through EDINA‘s involvement with the ESDIN project between mapping and cadastral agencies (NMCAs) across Europe, I’ve picked up a bit about data quality theory (at least as it applies to geography). One of ESDIN’s goals is a common quality model for the network of cooperating NMCAs.

    I’ve also been admiring Muki Haklay’s work on assessing data quality of collaborative OpenStreetmap data using comparable national mapping agency data. His recent assessment of OSM and Google MapMaker’s Haiti streetmaps showed the benefit of analytical data quality work, helping users assess how what they have matches the world, assisting with conflation to join different spatial databases together.

    Today I was pointed at Martijn Van Exel’s presentation at WhereCamp EU on “map quality”, ending with a consideration of how to measure quality in OpenStreetmap. Are map and underlying data quite different when we think about quality?

    The ISO specs for data quality have their origins in industrial and military quality assurance – “acceptable lot quality” for samples from a production line. One measurement, “circular error probable“, comes from ballistics design – the circle of error was once a literal circle round successive shots from an automatic weapon, indicating how wide a distance between shots, thus inaccuracy in the weapon, was tolerable.

    The ISO 19138 quality models apply to highly detailed data created by national mapping agencies. There’s a need for reproducible quality assessment of other kinds of data, less detailed and less complete, from both commercial and open sources.

    The ISO model presents measures of “completeness” and “consistency”. For completeness, an object or an attribute of an object is either present, or not present.

    Consistency is a bit more complicated than that. In the ISO model there are error elements, and error measures. The elements are different kinds of error – logical, temporal, positional and thematic. The measures describe how the errors should be reported – as a total count, as a relative rate for a given lot, as a “circular error probable”.

    Geographic data quality in this formal sense can be measured, either by a full inspection of a data set or in samples from it, in several ways:

    • Comparison to another data set, ideally of known and high quality
    • Comparing the contents of the dataset, using rules to describe what is expected.
    • Comparing samples of the dataset to the world, e.g. by intensive surveying.

    The ISO specs feature a data production process view of quality measurement. NMCAs apply rules and take measurements before publishing data, submitting data to cross-border efforts with neighbouring EU countries, and later after correcting the data to make sure roads join up. Practitioners definitely think in terms of spatial information as networks or graphs, not in terms of maps.

    Collaborative Quality Mapping

    Muki Haklay’s group used different comparison techniques – in one instance comparing variable-quality data to known high-quality data, in another comparing the relative completeness of two variable-quality data sources.

    Not so much thought has gone into the data user’s needs from quality information, as opposed to the data maintainer’s clearer needs. Relatively few specialised users will benefit from knowing the rate of consistency errors vs topological errors – for most people this level of detail won’t provide the confidence needed to reuse the information. The fundamental question is “how good is good enough?” and there is a wide spectrum of answers depending on the goals of each re-user of data.

    I also see several use cases for use of quality information to flag up data which is interesting for research or search purposes, but not appropriate to use for navigation or surveying purposes, where errors can be costly.

    An example: the “alpha shapes” that were produced by Flickr based on the distribution of geo-tagged images attached to a placename in a gazetteer.

    Another example: polygon data produced by bleeding-edge auto-generalisation techniques that may have good results in some areas but bizarre errors in others.

    Somewhat obviously, data quality information would be very useful to a data quality improvement drive. GeoFabrik made the OpenStreetmap Inspector tool, highlighting areas where nodes are disconnected or names and feature types for shapes are missing.

    Quality testing

    What about quality testing? When I worked as a perl programmer I enjoyed the test coverage and documentation coverage packages. A visual interface to show how much progress you’ve made on clearly documenting your code, to show how many decisions that should be tested for integrity remain untested.

    Software packages come with a set of tests – ideally these tests will have helped with the development process, as well as providing the user with examples of correct and efficient use of the code, and aiding in automatic installation of packages.

    Donald Knuth promoted the idea of “literate programming“, where code fully explains what it is doing. For code, this concept can be extended to “literate testing” of how well software is doing what is expected of it.

    At the Digimap 10th Birthday event, Glen Hart from Ordnance Survey Research talked about increasing data usability for Linked Data efforts. I want to link to this the idea of “literate data“, and think about a data-driven approach to quality.

    A registry based on CKAN, like data.gov.uk, could benefit from a quality audit. How can one take a quality approach to Linked Data?

    To start with, each record has a set of attributes and to reach completeness they should all be filled in. This ranges from data license to maintainer contact information to resource download. Many records inCKAN.netare incomplete. Automated tests could be run on the presence or absence of properties for each package. The results can be display on the web, with option to view the relative quality of package collections belonging groups, or tags. The process would help identify areas that needed focus and followup. It would help to plan and follow progress on turning records into downloadable data packages. Quality testing could help reward groups that were being diligent in maintaining metadata.

    The values of properties will have constraints, these can be used to test for quality – links should be reachable, email contact addresses should make at least one response. Locations in the dataset should be near locations in the metadata. Time ranges matching, ditto. Values that should be numbers, actually are numbers.

    Some datasets listed in the data.gov.uk catalogues have URLs that don’t dereference, i.e. are links that don’t work. It’s difficult to find out what packages these datasets are attached to, where to get the actual data or contact the maintainers.

    To see this in real data, visit the bare SPARQL endpoint at http://services.data.gov.uk/analytics/sparql and paste this query into the search box (it’s looking for everything described as a Dataset, using the scovo vocabulary for statistical data):

    PREFIX scv: <http://purl.org/NET/scovo#>

    SELECT DISTINCT ?p
    WHERE {
    ?p a scv:Dataset .
    }

    The response shows a set of URIs which, when you try to look them up to get a full description, return a “Resource not found” error. The presence of a quality test suite would catch this kind of incompleteness early in the release schedule, help provide metrics of how fast identified issues with incompleteness and inconsistency were being fixed.

    The presence of more information about a resource, from a link, can be agreed on as a quality rule for Linked Data – it is one of the Four Principles after all, that one should be able to follow a link and get useful information.

    With OWL schemas there is already some modelling of data objects and attributes and their relations. There are rules languages from W3C and elsewhere that could be used to automate some quality measurement – RIF and SWRL. These languages require a high level of buy-in to the standards, a rules engine, expertise.

    Data package testing be viewed like software package testing. The rules are built up, piece by piece, growing as the code does, ideally. The methods used can be quite ad-hoc, use different frameworks and structures, as long as the results are repeatable and the coverage is thorough.

    Not everyone will have the time or patience to run quality tests on their local copy of the data before use, so we need some way to convey the results. This could be an overall score, a count of completeness errors – something like the results of a software test run:

    3 items had no tests...
    9 tests in 4 items.
    9 passed and 0 failed.
    Test passed.

    For quality improvement, one needs to see the detail of what is missing. Essentially this is a picture of a data model with missing pieces. It would look a bit like the content of a SPARQL query:

    a scv:Dataset .
    dc:title ?title .
    scv:datasetOf ?package .
    etc...

    After writing this I was pointed at WIQA, a Linked Data quality specification language by the group behind dbpedia and Linked GeoData, which basically implements this with a SPARQL-like syntax. I would like to know more about in-the-wild use of WIQA and integration back into annotation tools…


    Dev8D: JISC Developer Days

    March 5th, 2010

    The Unlock development team recently attended the Dev8D: JISC Developer Days conference at University College London. The format of the event is fairly loose, with multiple sessions in parallel and the programme created dynamically as the 4 days progressed. Delegates are encouraged to use their feet to seek out what interests them! The idea is simple: developers, mainly (but not exclusively) from academic organisations come together to share ideas, work together and strengthen professional and social connections.

    A series of back-to-back 15 minute ‘lightning talks’ ran throughout the conference, I delivered two – describing EDINA’s Unlock services and showing users how to get started with the Unlock Places APIs. Discussions after the talk focused on the question of open sourcing and the licensing of Unlock Places software generally – and what future open gazetteer data sources we plan to include.

    In parallel with the lightning talks, workshop sessions were held on a variety of topics such as linked data, iPhone application development, working with Arduino and the Google app engine.

    Competitions
    Throughout Dev8D, several competitions or ‘bounties’ were held around different themes. In our competition, delegates had the chance to win a £200 Amazon voucher by entering a prototype application making use of the Unlock Places API. The most innovative and useful application wins!

    I gave a quick announcement at the start of the week to discuss the competition, how to get started using the API and then demonstrated a mobile client for the Unlock Places gazetteer as an example of the sort of competition entry we were looking for. This application makes use of the new HTML5 web database functionality – enabling users to download and store Unlock’s feature data offline on a mobile device. Here’s some of the entries:

    Marcus Ramsden from Southampton University created a plugin for EPrints, the open access respository software. Using the Unlock Text geoparser, ‘GeoPrints’ extracts locations from documents uploaded to EPrints then provides a mechanism to browse EPrint documents using maps.

    Aidan Slingsby from City University, entered some beautiful work displaying point data (in this case a gazetteer of British placenames) shown as as tag-maps, density estimation surfaces and chi surfaces rather than the usual map-pins! The data was based on GeoNames data accessed through the Unlock Places API.

    And the winner was… Duncan Davidson from Informatics Ventures, University of Edinburgh. He used the Unlock Places APIs together with Yahoo Pipes to present data on new start-ups and projects around Scotland. Enabling the conversion of data containing local council names into footprints, Unlock Places allowed the data to be mapped using KML and Google Maps, enabling his users to navigate around the data using maps – and search the data using spatial constraints.

    Some other interesting items at Dev8D…

    • <sameAs>
      Hugh Glaser from the University of Southampton discussed how sameAs.org works to establish linkage between datasets by managing multiple URIs for Linked Data without an authority. Hugh demonstrated using sameAs.org to locate co-references between different data sets.
    • Mendeley
      Mendeley
      is a research network built around the same principle as last.fm. Jan Reichelt and Ben Dowling discussed how by tracking, sharing and organising journal/article history, Mendeley is designed to help users to discover and keep in touch with similarly minded researchers. I heard of Mendeley last year and was surprised by the large (and rapidly increasing) user base – the collective data from its users is already proving a very powerful resource.
    • Processing
      Need to do rapid visualisation of images, animations or interactions? Processing is Java based sketchbox/IDE which will help you to to visualise your data much quicker. Ross McFarlane from the University of Liverpool gave a quick tutorial of Processing.js, a JavaScript port using <Canvas>, illustrating the power and versatility of this library.
    • Genetic Programming
      This session centred around some basic aspects of Genetic Algorithms/Evolutionary Computing and Emergent properties of evolutionary systems. Delegates focused on creating virtual ants (with Python) to solve mazes and by visualising their creatures with Processing (above), Richard Jones enabled developers to work on something a bit different!
    • Web Security
      Ben Charlton from the University of Kent delivered an excellent walk-through of the most significant and very common threats to web applications. Working from the OWASP Top 10 project, he discussed each threat with real world examples. Great stuff – important for all developers to see.
    • Replicating 3D Printer: RepRap
      Adrian Bowyer demonstrated RepRap – short for Replicating Rapid-prototyper. It’s an open source (GPL) device, able to create robust 3D plastic components (including around half of its own components). Its novel capability of being able to self-copy, with material costs of only €350 makes it accessible to small communities in the developing world as well as individuals in the developed world. His inspiring talk was well received and this super illustration of open information’s far reaching implications captured everyone’s imagination.

    All in all, a great conference. A broad spread of topics, with the right mix of sit-and-listen to get-involved activities. Whilst Dev8D is a fairly chaotic event, it’s clear that it generates a wealth of great ideas, contacts and even new products and services for academia. See Dev8D’s Happy Stories page for a record of some of the outcomes. I’m now looking forward to seeing how some of the prototypes evolve and I’m definitely looking forward to Dev8D 2011.


    Thoughts on Unlocking Historical Directories

    January 26th, 2010

    Last week I talked with Evelyn Cornell, of the Historical Directories project at the University of Leicester. The directories are mostly local listings information, trade focused, that pre-date telephone directories. Early ones are commercial ventures, later ones often produced with the involvement of public records offices and postal services. The ones digitised at the library in Leicester cover England and Wales from 1750 to 1919.

    This is a rich resource for historic social analysis, with lots of detail about locations and what happened in them. On the surface, the directories have a lot of research value for genealogy and local history. Below the surface, waiting to be mined, is location data for social science, economics, enriching archives.

    Evelyn is investigating ways to link the directories with other resources, or to find them by location search, to help make them more re-useful for more people.

    How can the Unlock services help realise the potential in the Historical Directories? And will Linked Data help? There are two strands here – looking at the directories as data collections, and looking at the data implicit in the collections.

    Let’s get a bit technical, over the fold.

    Geo-references for the directories

    Right now, each directory is annotated with placenames – the names of one or more counties containing places in the directory. Headings or sub-sections in the document may also contain placenames. Sample record for a directory covering Bedfordshire

    As well as a name, the directories could have a link identifying a place. For example, the geonames Linked Data URL for Bedfordshire. The link can be followed to get approximate coordinates for use on a map display. This provides an easy way to connect with other resources that use the same link.

    The directory records would also benefit from simpler, re-usable links. Right now they have quite complex-looking URLs that look like lookup.asp?[lots of parameters]. To encourage re-use, it’s worth composing links that look cleaner, more like /directory/1951/kellys_trade/ This could also help with search engine indexing, making the directories more findable via Google. There are some Cabinet Office guidelines on URIs for the Public Sector that could be useful here.

    Linked Data for the directories

    Consider making each ‘fact file’ of metadata for a given directory available in a machine-readable form, using common Dublin Core elements where possible. This could be done embedded in the page, using a standard like RDFa or it could be done at a separate URL, with an XML document describing and linking to the record.

    Consider a service like RCAHMS’ Scotland’s Places, which brings together related items from the catalogues of several different public records bodies in Scotland, when you visit a location page. Behind the scenes, different archives are being “cross-searched” via a web API, with records available in XML.

    Mining the directories

    The publications on the Historical Directories site are in PDF format. There have been OCR scans done but these aren’t published on the site – they are used internally for full-text search. (Though note the transcripts along with the scans are available for download from the UK Data Archive). The fulltext search on the Historical Directories site works really well, with highlights for found words in the PDF results.

    But the gold in a text-mining effort like this is found in locations of the individual records themselves – the listings connected to street addresses and buildings. This kind of material is perfect for rapid demographic analysis. The Visualising Urban Geographies project between the National Library of Scotland and University of Edinburgh is moving in this direction – automatically geo-coding addresses to “good enough” accuracy. Stuart Nicol has made some great teaching tools using search engine geocoders embedded in a Google Spreadsheet.

    But this demands a big transition – from “raw” digitised text, to structured tabular data. As Rich Gibson would say about Planet Earth – “It’s not even regularly irregular”, and can’t currently be successfully automated.

    Meanwhile of the directories do have more narrative,descriptive text, interleaved with tabular data on population, trade, livestock. This material reminds me of the Statistical Accounts of Scotland.

    For this kind of data there may be useful yield from the Unlock Text geoparsing service – extracting placenames and providing gazetteer links for the directory. Places mentioned in Directories will necessarily be clustered together, so the geoparser’s techniques for ranking suggested locations and picking the most likely one, should work well.

    This is skimming the surface of what could be done with historic directories, and I would really like to hear about other related efforts.


    Linked Data, JISC and Access

    January 8th, 2010

    With 2010 hindsight, I can smile at statements like:

    “The Semantic Web can provide an underlying framework to allow the deployment of service architecture to support virtual organisations. This concept is now sometimes given the description the Semantic Grid.”

    But that’s how it looked in the 2005 JISC report on “semantic web technologies”, which Paul Miller reviews at the start of his draft report on Linked Data Horizons.

    I appreciate the new focus on fundamental raw data, the “core set of widely used identifiers” which connect topic areas and enable more of JISC’s existing investments to be linked up and re-used. JACS codes for undergraduate courses, or ISSNs for academic journals – simple things that can be made quickly and cheaply available in RDF, for open re-use.

    It was a while after I read Paul’s draft before I clocked what is missing – a consideration of how Access Management schemes will affect the use of Linked Data in academic publishing.

    Many JISC services require a user to prove their academic credentials; so do commercial publishers, public sector archives – the list is long, and growing.

    URLs may have user/session identifiers in them, and to access a URL may involve a web-browser-dependent Shibboleth login process that touches on multiple sites.

    Publishers support UK Federation, and sell subscriptions to institutions. On their public sites, one can see summaries, abstracts, thumbnails, but to get data, one has to be attached to an institution that pays a subscription and is part of the Federation.

    Sites can publish Linked Data in RDF about their data resources. But if publishers want their data to be linked and indexed, they have to make two URLs for each bit of content; one public, one protected. Some data services are obliged to stay entirely Shibboleth-protected for licensing reasons, because the data available there is derived from other work that is licensed for academic use only.

    EDINA’s ShareGeo service has this problem – its RSS feed of new data sets published by users is public, but to look at the items in it, one has to log in to Digimap through the UK Federation.

    Unfortunately this breaks with one of the four Linked Data Principles – “When someone looks up a URI, provide useful information, using the standards“.

    Outwith the access barrier, non-commercial terms of use for scholarly resources don’t complement a Linked Data approach well. For example, OCLC’s WorldCat bibliography search forbids “automated information-gathering devices“, which would catch a crawler/indexer looking for RDF. As Paul tactfully puts it:

    To permit effective and widespread reuse, data must be explicitly licensed in ways that encourage third party engagement.