A graph showing the number of IEDs cleared from the Afghanistan War Logs
Only a couple of years ago, the idea that journalists would need to know how to use a spreadsheet would have been laughed out of the newsroom. Now those benighted days are way behind us and extracting stories out of data is part of every journalist's toolkit of skills.
Some people say the answer is to become a sort of super hacker, write code and immerse yourself in SQL. If you decide to take that approach, you can find a load of resources here. But a lot of the work we do is just in excel and that's what I'll deal with here.
Of course, you could just ignore the whole thing, hope it'll go away and you can get back to longing to write colour pieces. But if you're interested in generating great news stories from the "tsunami of data" being released by the government, this is where you need to start.
1) Sourcing the data
This is a much undervalued skill - with many journalists simply outsourcing it to research departments and work experience students. You can't imagine doing that with key interviews for a crucial piece, yet key facts are often treated appallingly. You can do worse than start with http://www.guardian.co.uk/data - we've already done the work of identifying and cleaning up hundreds of datasets from carbon emissions, via crime rates to Bafta winners. Alternatively, I have a Delicious feed where I link to datasets - you can find that at http://www.delicious.com/smfrogers
But broadly, the general approach is to look for the most authoritative place for your data. Here are a few very specific examples:
GDP - from the Office for National Statistics. There are loads of different definitions of GDP. You probably want inflation-adjusted, which has the label: ABMI. You can find that at this link. It should be pointed out that the ONS has incredible info on this site - but it is also the world's worst website. You can get GDP refined into an easier form on the Guardian datablog here.
- Carbon emissions from different countries - from the US Energy Information Agency. A fantastic resource and has every country in the world on it.
- UK immigration statistics from the Home Office. These stats also cover asylum and come out annually but also every few months. They offer excel downloads and data for where applications come from
And here are a couple of general Guardian resources that you might find useful:
- World government data search - this amalgamates government data from many different countries such as crime, geo data and health - over 15,000 datasets are available
- Global aid and development data search. This brings together aid and development data from the UN, OECD, World Bank, the IMF and others - and covers over 3,200 datasets
2) Cleaning up the data
Adobe PDF files are the enemy of open data. Often files are published as PDFs to make them as inaccesible as humanly possible. Tables on pages can sometimes be extracted using Adobe Professional (which has an option if you highlight the table and control-click it). Often it's simpler to use a basic programme like Text Wrangler to access the data.
If a table's published on a web page, one thing worth trying is saving the web page as html – and then opening it in Excel, which then opens its as a table.
And even when data is published in spreadsheet (or CSV) form, it often comes in from respected sources in a right-old state. The Office for National Statistics, for instance, is fond of inserting hidden columns and blank lines into excel that serve no purpose but to make the data harder to analyse. The answer is to copy the data out, open a new excel window and 'paste special' the data into it as 'values' – then you an work out where those troublesome lines are.
3) Keep the codes
Often official data comes with the official codes added in – for instance, each school, hospital, constituency and local authority has a unique identifier code. Countries have them too (the UK's code is GB, for instance) – you can get them directly from the ISO at this link. They're useful because you may want to start mashing datasets together and it's amazing how many different spellings and word arrangements can get in the way of that. There's Burma and Mayanmar, for instance, or Durham, City of compared to City of Durham. Codes allow compare like with like.
There's a useful function that will help you with this – VLOOKUP in excel allows you to bring datasets together – it looks for the value in one sheet and brings in the results from another.
4) Other useful functions and formulas
PIVOT TALE REPORTS – this Excel function allows users to count the number of times something comes up, aggregate data or work out averages. It's really useful and we used it a lot with the WikiLeaks data.
Percentage change – this is the one I use, although I'm sure there are others:
|A Data 2005||B Data 2006||Result|
(For which the result = 92).
Rate per 100,000 – very useful for making data more proportional – if you ever want to compare China to the UK, for example:
|A UK||B Population||Rate per 100,000|
(For which the result = 0.04).
5) Publishing and visualising the data
We publish data using Google Spreadsheets because it makes them very easy to share – although we've normally manipulated the data in a spreadsheet package first. If you're working online, there are lots of ways to visualise the data (not least the built-in Google chart functions (which you an then embed on you site as mini interactives).
Many Eyes, Wordle and Timetric are all also useful. But the best option of all is to find a designer you can work with – designing a graphic and analysing data are two different jobs.
Google Fusion tables, a new service, is fantastic for mapping data – we used it here to show the WikiLeaks Iraq death figures.
Know your billions – it's amazing how few people know how many zeroes are in a billion, yet happily quote billions all the time. A billion is a 1,000 million and it looks like this: 1,000,000,000
Know the really big numbers, at least roughly: it will help you get a sense of how likely it is that a story is true. Some good ones are: How many people are born, die, total UK budget, size of the deficit, what the deficit is.
The main thing is: treat numbers like any other type of information. They're not sacred (although the facts are) – they can bear scrutiny and investigation just like anything else.
Simon Rogers edits guardian.co.uk/data. You can follow him on Twitter at @datastore and @smfrogers.
Interested in taking data journalism training further? Register for Journalism.co.uk's own 'Introdcution to data journalism' course, taught by former BBC Washington correspondent and Guardian digital research editor and data blog contributor Kevin Anderson. See full details at this link.