Data visualisation on properties owned by Costa Rica's government ministers
These are just two of many criminal cases brought following probes by the investigative unit at La Nación, Giannina Segnini, who heads the unit told Journalism.co.uk, with past and present criminal cases relating not only to Costa Rica, but also to the UK, the US, France, Finland and "many other countries".
Segnini has been leading the investigative unit since 1994. "At the beginning it was literally a unit, composed of only me," she said, later expanding to a team of three journalists focusing purely on in-depth investigations.
Investigating corruption is not the purpose of the unit, Segnini pointed out. "We are not a prosecutor's office," she said, but what her team does do is uncover hard evidence, such as bank transfers and documents.Two years ago we started an experiment that combined the investigative methods that we have been using for years with the power of data and and database analysisGiannina Segnini
"We have been doing these types of investigation on a regular basis. Two years ago we started an experiment that combined the investigative methods that we have been using for years with the power of data and and database analysis."
The team started scraping websites and collating databases of public data, adding two developers to work with the three journalists.
"The good thing about this is that if you combine those two worlds, the outcome is very powerful. Not only can you actually prove facts, but also ... you are not relying on a source," Segnini added. "When you work with data you can also have access to the whole universe of information; if you are dependent on a source that's difficult to achieve."
- 'Zero-waste' data cooking
"The concept is based on the belief that there has to be no waste in data," Segnini said. "Every new dataset is a new ingredient. And if you have all the information in different databases, if you relate and correlate everything, you may have a wider way to explain things.
"We are using every single bit of data that we are collecting, we are relating it with the other information we have in terms that there is no waste in our reporting."We are using every single bit of data that we are collecting, we are relating it with the other information we have in terms that there is no waste in our reportingGiannina Segnini
The investigative unit does this by consolidating datasets, storing the information on a server and then the team uses the data for their investigations. One way they cross-reference data is by using identity numbers, which are assigned to every citizen and company in Costa Rica and are publicly available. In addition to using unique ID numbers, the team also uses dates and chronological information.
- Scraping 24/7
"At the beginning we were scraping all this information in a very manual way, now most of the process is automated."
Segnini gave the example of publicly available data of marriages, births and deaths. The information from the site is scraped weekly and automatically fed into the news outlet's database.
After the data is automatically scraped, it is cleaned (using Google Refine) and consolidated onto a repository server called i2, IBM software also used by the FBI and CIA, according to Segnini.
"There is a component called iBase, and so everything goes to this repository, but it's SQL-based so once we have the information there it's basically ready to be analysed."
There are then two processes that occur, Segnini explained. "One is that the information automatically goes to this repository that can be accessed and searched, not just by the investigative unit, but by the rest of the newsroom."
La Nación developed its own software so all journalists can access the information.
The other process is that the data is analysed. The unit uses another component of I2 called 'analyst's notebook', and tools including Tableau and statistical software like R.
Segnini gave an example of how data can be used, re-used and the power of linking databases. She explained that the investigative unit collects data on "almost all of the flights coming in and going out from the Costa Rica" and if there is a plane crash, they have data on the plane. The news outlet can also check the ownership of planes, whether there is a loan against it, and if the plane is owned by a company they can see who the directors of that company are.
"This cross-referencing is really useful for daily coverage but we also use this information to analyse and find patterns and trends in terms of destinations of the flights."
- Big data, big budget?
"It's not really expensive and what we are doing using i2 we could have done by using only SQL. The difference is that you can visualise the data in real-time, it's not a structured relational database, but an object-oriented database, and that's really helpful when you are investigating and examining the information you have."
La Nación is currently generating more data than it can process or analyse, Segnini explained. "We are therefore changing the way we are working," she said.
"We are inviting journalists from the newsroom, from other departments, like sport, entertainment and the economy section. The idea is that one of them comes and works with us and the data we have for a specific project and then they go back to their department."
- Property tax investigation
When thinking of how to cover the debate, the team knew that most tax information was not publicly available.
"We wondered if there was a way to find out whether people were actually paying their taxes. Tax information is confidential but we realised that there is a specific tax, which is the property tax, that is very easy to follow." The team could then link that back to La Nación's database of registered properties.
They looked at the property tax records for those pushing for tax reform, creating a database with the properties under the names of the ministers, the properties under their wives' or husbands' names, and under their company names. They found the declared value of these properties, which is public information on a public registry, and linked that to another database listing the values of the properties.
"The beauty of this case is that normally you get information because there's a source who comes to you and gives you the information. In this case if we had a source we would have probably only got part of the facts or the information. But in this case we got the whole universe of information."
Hat tip: Hacks/Hackers London. The September meet-up included a presentation by Giannina Segnini