True and False
Credit: Thinkstock

Journalists now have more easily accessible data at their fingertips than ever before.

But when it comes to telling stories with that data, the sheer amount of information available – coupled with tight deadlines for processing it – can be overwhelming.

At the NICAR conference in Denver, Colorado, on 10 March, investigative journalist Giannina Segnini shared some techniques for capturing the value of data and evaluating its quality.

"Data lies, as people lie," said Segnini, who is head of the MS Data Concentration Program at Columbia University's Graduate School of Journalism. She previously led the Investigative Unit at Costa Rican news outlet La Nación.

"At some point, data was collected by humans, so that's why it's errant."

Here are her recommendations for verifying data, adapted from a chapter she wrote for The Verification Handbook.

Where did the data come from?

"The first thing you do when you interview someone is a background check, so why should it be different with data?" asked Segnini.

She recommends thinking about data as if each column was a person – she calls them "ladies".

"Interviewing these 'ladies' is the only way to really understand their context in order to evaluate their quality, said Segnini.

"You have to have respect and understand that they are showing a version of reality. Who collected the data? Who paid for the data collection? What are the legal aspects – private data or public data?"

Usually this information can be found by searching around on the website where the data is available.

Understand the different types of data

Data can be structured or unstructured. Unstructured data, such as email, is designed for humans. It is not organised in a pre-defined manner and may contain data such as dates, locations, and numbers.

Structured data, such as a spreadsheet, is organised and machine-readable.

Every tool used to clean and refine data will take these two types of data into consideration.

Segnini recommends Talend, an open-source platform for processing Big Data which also evaluates the quality of it.

Data journalists tend to prefer individual records over aggregated data because they can verify it, she added.

"I basically do not work with statistical data," said Segnini. "I do not trust numbers."

Use what you know

When Segnini was working on a data project about abortion, the first thing she did with the dataset was to examine the information from a region she, being Costa Rican, knows very well: Latin America.

"In this case, they extrapolated the data in Latin America for two countries. I knew that was not possible, so I rejected the data."

Segnini also recommends choosing 50 records from your dataset at random and contrasting them with reality, such as published reports, to see if they sound accurate.

If you're looking at a dataset relating to a specific organisation, you could also talk to a former employee who is familiar with the data to see if it rings true, she said.

Do your maths

Segnini advises checking "the maths behind the indicators". She uses OpenRefine, an open-source tool, to explore data.

"Look at the extremes, in terms of money, dates and any other variables. Do they add up?"

Know your acronyms

Datasets often use acronyms and definitions. Some are generic – "forms" quite often explain who collected the data and how, for example.

Others are specific to the company that collected the data, and can be confusing for outsiders, (sometimes perhaps intentionally so).

Occasionally there will be a glossary or similar on the site where you found the data.

However, Segnini also recommends getting to know the IT department of any companies you cover or are likely to cover.

Contact information for IT workers is not usually as easy to find online as details for chief executives, for example, but often they hold the key to unlocking those mysterious acronyms.

"As journalists we're accustomed to look up ministers, secretaries, chief executives, but these 'invisible' people in the IT departments are amazing," she said.

"Get to know them. Because then before even getting your hands on the data, you can get a dictionary to explain the acronyms."

Update 15/03: An earlier version of this article misstated La Nación was an Argentine news outlet.

Free daily newsletter

If you like our news and feature articles, you can sign up to receive our free daily (Mon-Fri) email newsletter (mobile friendly).