Liberate the newspapers! An update on the Europeana Newspapers project

Newspapers are one of the most sought-after sources of cultural heritage material available, since they contain a wealth of information ranging from international news events to local and family announcements. In the Europeana Newspapers project, partners from all over Europe have been working together on adding an impressive amount of newspaper content (18 million digitised newspaper pages and over 30 million records related to newspaper titles) to Europeana and The European Library. Since this is historical newspaper content from before 1940, nearly all of the material (and all of the metadata associated with it) is available as public domain, apart from a small amount of 20th century material which still falls under copyright.

An important part of the project focuses on boosting the potential use of this historical newspaper material for future research by improving the searchability of the content. When newspaper material is scanned, the first result is a simple image of the newspaper page. To enable valuable features for research (such as searching for specific articles or finding names of people) an additional refinement process is needed, in which the page is divided into sections, converted to full-text through Optical Character Recognition (OCR) and certain content, such as names of people or locations, is classified through Named Entity Recognition (NER).

A newspaper separated into articles through techniques such as Optical Layout Recognition (OLR)

Europeana Newspapers will deliver such full-text material for 10 million newspaper pages from various libraries throughout Europe through a new content browser of The European Library. The prototype of this was released in early 2014: a beta version with improved functionality will be released later this year. Unfortunately it is not easy to transform these historic newspapers from a scanned image into 100% correct text: newspapers were often printed in unusual fonts, on bad quality paper or were damaged in various ways throughout the years, all of which makes it more difficult for the OCR software to recognise the text correctly. In the future, The European Library hopes to give users the ability to submit corrections to the OCR text. An overview of the newspaper content that is currently available is here; more material will be added throughout 2014.

With so many different libraries and such a large amount of content involved, there have been a number of discussions around how to best provide access to the material: whether to expose the full-text with OCR errors to the user, how to deal with the different image and metadata formats and how to solve the sheer challenge of storing these often huge files (a master file of a newspaper image ranges between 10-50 Mb – more about the solution to this problem in this blog). The project now expects to deliver the full-text content without any copyright restrictions, which is great news for OpenGLAM.

The ability to freely use this valuable data will open up a wealth of possibilities for future research in areas such as text mining, information retrieval and language technology. Users will be able to search through the entire collection by keyword and are able to compare newspaper coverage on a specific historic event from different European countries. The project has been running a great series of blogs interviewing researchers about their work with historic newspapers, giving a first idea of the value this material has for a wide variety of research topics, such as how job ads provide information on the careers of men and women in the 19th and 20th centuries or how jokes and slang moved between America and Britain in the 1800s.

The availability of such a large corpus of digital newspaper content from libraries throughout Europe is a great step forwards. It is however only a fraction of the total material available, the majority of which is not even digitised yet: a 2012 survey held by the project found that only 26% of libraries had digitised more than 10% of their collection. An interesting paper on this, with the title Representation and Absence in Digital Resources: The Case of Europeana Newspapers will be presented at the next Digital Humanities (DH2014) conference in July.

If you want to find out more about the results of the project you can join the final event ‘Newspapers in Europe and the Digital Agenda for Europe’ at The British Library on 29-30 September 2014. This event is specifically aimed at policy makers and cultural heritage professionals and has a mission that OpenGLAM definitely supports: Liberate the newspapers!