A US journalist who founded a not-for-profit news site is to launch an open-source programme that will turn scanned documents – such as receipts – into structured data.
Charles C. Duncan Pardo, founding editor of Raleigh Public Record, plans to launch the tool DocHive in exactly one month.
Duncan, who launched the online-only site dedicated to covering the capital city of North Carolina four years ago, told Journalism.co.uk he started developing the tool to speed up the process of extracting data from scanned campaign finance returns.
Faced with manually entering the data from receipts into Excel "one row at a time", Duncan decided he needed to find a more efficient way of pulling data from scanned documents and converting it to a spreadsheet.
Raleigh Public Record, which has a budget of $80,000-a-year, has three part-time members of staff and "relies very heavily on freelancers".
"We don't have a lot of resources, but we do have a lot of friends and connections", Duncan said, including his brother Edward, a software developer, who was keen to help him tackle the problem.
Duncan was successful in an application for grant funding to "really get rolling".
In this announcement post on the Reporters' Lab site, Duncan explains how it works. The programme converts the PDF into an image file and then "uses a template to break a page up into smaller sections".
"For example, in the campaign finance documents, DocHive will make separate sections for donor name, occupation, donation amount and all the other fields. Then, the programme will take each of those sections and turn it into a separate image file.
"The software takes that small image and uses optical character recognition technology to read the words or numbers and insert them into a CSV file.
DocHive is due to launch at the NICAR conference next month (28 February), and Duncan hopes journalists and developers working for news sites elsewhere will be able to adapt it to suit specific purposes.
"We are creating a wiki for the documentation," he said. "The hope is as other people tackle different documents, they will share those templates with others who are facing similar problems with documents."
Duncan will add updates on DocHive to Reporters Lab and Raleigh Public Record.
Free daily newsletter
- Dmitry Shishkin, chief content officer of Culture Trip, on converting subscribers, evergreen content and automated commissioning
- AI-powered journalism: a time-saver or an accident waiting to happen?
- "Artificial intelligence is not the future - it is happening right now"
- Keynote speaker Dmitry Shishkin, chief content officer at Culture Trip, joins Newsrewired
- How can news organisations drive reader habits?