In the first of our three-part series on creating data-driven stories with import.io, Bea Schofield and Alex Gimson demonstrate how to create an API to extract data from websites.
import.io is a free tool created by a London start-up for turning websites into data. It aims to make the vast amount of data available on the web more accessible.
Essentially, import.io allows users to turn any website into a table of data or an API in just a few minutes without needing to write any computer code.
Analysing data can provide rich pickings for sourcing stories, such as Oxfam’s recent report that the poorest half of the world have the same amount of wealth as the richest 85 people.
Working with import.io, Oxfam discovered these statistics by analysing data from the most recent Forbes rich list, published early this year.
The screencast above demonstrates how easy it is to create an API to the Forbes rich list using an import.io extractor, or alternatively you can just follow the steps below.
1. Go to import.io and create your free account, then download their data browser – basically a web browser with some extra functions – and open it from your desktop. Next, navigate to the URL with the data you want to use and hit the small io button in the top-right corner.
Click 'let’s get cracking' and and choose the extractor tool (the one on the right).
If you can still see the data in the window, as we can in this example, click 'yes'.
For this example, there are multiple entries in this list so you need to click on the 'multiple results' button (the one in the middle).
2. Now it is time to begin extracting the data.
You will need to start by training the rows. To do this, highlight all of the data for one full result (ie. one person’s information) and click the 'train rows' button. You will need to train a few examples (in this case two) before the tool will be able to recognize the pattern of data on the page.
Next you will need to train the exact bits of data you want on the page by adding columns. To do this, click on the 'add columns' button, type the name into the box and select the data type (text, number, currency, link, etc.).
Then highlight an example of the data you want in that column and click train. In some cases you may need to train multiple examples of data for a column.
For this use case, you will need to map the rank as a number, name as text, net worth as currency, age as a number and source as text.
3. Once you have trained all the data you need, you can publish your API by clicking 'I’ve got what I need', then click on 'I'm done training' and finally 'upload to import.io'.
import.io will then create an API based on the data you trained. Clicking on 'show me the data' will take you to the dataset page where you can view the data you trained and refresh it whenever you like.
Free daily newsletter
- Tip: Bookmark these tools for collaborating on investigative projects
- Tip: Take note of this advice for collaborating on data journalism projects
- Tip: Check out these data and interactive journalism resources from NICAR 2018
- New automated tool at Reuters helps its 'cybernetic newsrooms' find stories in data
- Tip: How to get started with data journalism in a small newsroom