Jonathan Stray Jonathan Stray Credit: Knight Foundation
Last week the Associated Press was one of 16 winners of the Knight Foundation's News Challenge, which pledged $4.7 million in funding to winning projects. AP's winning proposal was to build Overview, an open source analytical tool to help journalists find stories in "data dumps".

The project secured a grant of $475,000 which will be used to create a system to help journalists understand large sets of unstructured or semi-structured text documents and find the stories within, distributed with comprehensive training materials.

Following the win AP's interactive technology editor Jonathan Stray, who will lead the AP team, explained the tool will clean up data, illustrate patterns and create data visualisations.

"Overview, as we dubbed the tool, will create maps that display relationships among topics, people, places and dates," he said in a release.

"The goal is an interactive system where the computers do the visualization, while a human guides the exploration."

We caught up with Stray this week to find out more about the project, the growing importance of data journalism to AP and how the agency will work with newsrooms across the globe to develop the technology.

What functionalities will the tool offer?

We would like Overview to solve three different problems out of the box:
  • creating visual maps of the topics of a large set of documents,
  • conversation thread and social network visualization of emails,
  • automatic categorization of documents into different types, such as sorting emails from meeting minutes, as very often the document sets that journalists receive contain many different types of documents scrambled together.
What makes Overview different from other data journalism tools which help clean data?

Overview is specifically designed for large volumes of unstructured text data. There are now fairly good tools for data in the form of spreadsheets or structured databases, but very often journalists have to try to find stories in a huge volume of natural-language documents. Numbers can be charted and locations can be mapped, but it's much harder to get a sense of the content and topics of thousand or millions of pages of text.

What impact do you see the project having on the journalism industry?

We hope that Overview will become the first-line tool for document dump reporting, and that journalists and others will be empowered to more closely examine the documents sets available on their beats and within their communities

How will the money be used initially?

The majority of the money will be used to hire developers to build the system. There is also budget for documentation and training, to ensure that the system is easy to learn. The code will be continuously available on GitHub, but we intend to have an alpha release about 12 months after main development begins.

How important is data becoming as part of AP's journalism, and how do you see the tool developing this?

AP pursues stories of all types and formats all across the world, from breaking news to enterprise investigative work. However, it's clear that data journalism is a rapidly expanding field, if only to keep up with the enormous volumes of documents and data obtained through Freedom of Information requests, government transparency initiatives, and leaks. All of this information means nothing if we don't have the tools to make sense of it.

Are there plans to make the tool widely available, or only to AP staff?

Overview is an open-source project and we will be working with newsrooms around the world to design and deploy this technology. The participation of others is essential to the success of the project, both because we need help and because this project is as much about creating a community of knowledgeable journalists and technologists as it is about developing software.

What are your long term plans for development and what are the main goals?

Our experience so far suggests that no two document dumps are quite alike. That means there can't be a one-size fits all tool. Although we will try to solve some of the most common problems such as email visualization, our long-term goal is for Overview to be a flexible platform that can be configured for a variety of tasks. To that end we are planning a comprehensive plug-in architecture so others can extend the system for their own needs.

Mostly we hope to make document dump reporting much quicker and easier. We hope this will lead reporters to attempt more ambitious projects, and find stories that might otherwise be missed. Also, a carefully designed visualization can itself be a type of narrative when it clarifies the relationships between things.

Image by the Knight Foundation on Flickr. Some rights reserved.

Free daily newsletter

If you like our news and feature articles, you can sign up to receive our free daily (Mon-Fri) email newsletter (mobile friendly).