How notebooks can boost your work with data
David Kane is a data scientist and researcher and has over ten years experience working with charity data. He is also the Product Lead at 360Giving, a charity that helps funders publish open data about their grants, and empowers people to use this data to improve charitable giving. You can use 360Giving’s search engine for grants data, GrantNav, to explore the data and understand how grants are given in the UK.
As part of the 2021 Data4Good Festival, I ran a session showing how you can use code notebooks for data cleaning and analysis. Notebooks are files that bundle together code with explanatory text and give you a rich environment to gather and explore datasets. You can view and run the notebook using Google Colab, and Festival ticket holders can rewatch the session for a walk through of how to use a notebook and some of the features.
What are notebooks?
Notebooks are files that bundle together code with explanatory text and give you a rich environment to gather and explore datasets.
At 360Giving we’ve started using Colab notebooks regularly to perform and share analysis. The notebooks all start by gathering data from the 360Giving Datastore, an online database where developers and researchers can access all data published in the 360Giving Data Standard. We can then clean and understand the data, view data tables and visualise the data.
So, why use coding notebooks? What advantages do they give over traditional data analysis which might use Excel or other software?
1. Show your working
A main benefit of the notebook experience is that they combine the code and the explanation in one place. Previously, I might have had a spreadsheet full of complex formulas in one place, and a separate document elsewhere with some notes on what I had actually done.
Notebooks are made up of cells. These cells can contain the actual code which is run in the notebook itself, and you can display outputs from this code, like a data table or a chart). But cells can also contain text that explains what is happening – what you are doing, why you chose a particular method, what others might need to know about this piece of analysis.
Notebooks use a simple but very flexible text format called Markdown. This enables you to add richer text to your notes, including lists, links, images, and headings.
Increasingly this means that you can publish a version of the notebook itself, without the need for separately writing up the work. I wrote a blog post showing how to use wikidata to look up the location of football stadiums – the whole thing is from a notebook, and running the code cells would allow you to repeat the analysis.
2. Prototype and test ideas
Notebooks are also slightly different to how code is run elsewhere. Each code cell can be run by itself, without needing to re-run all the cells before it (although it can access the results of those previous cells). This makes it ideal for testing and trying out ideas, without needing to get your code right the first time.
This is how we use notebooks at 360Giving. We have a “minimum viable notebook” which contains the details needed to connect to the 360Giving Datastore and some example queries to fetch data. This can quickly be copied and then expanded to test out a new piece of analysis.
The fact you can try out code and iterate it without worrying too much about breaking anything also makes notebooks an ideal environment for learning how to code.
3. Share and enjoy
Notebooks are designed to be shared. If you’re using Google Colab then you can share the notebook like any other Google Drive file, and multiple people can edit the same file. If you’re using Jupyter notebooks then the files are like any other file, and can be emailed as an attachment, uploaded to websites, etc.
Services like Github and nbviewer make it easy to view these files, even if you are not running the Jupyter notebook software. And services like Binder and Google Colab mean you can run these notebooks, as well as viewing them.
Sharing notebooks might be about collaborating with others – you might want to get a senior manager to look over some figures, or check your method with an expert. But sharing notebooks also helps lend credibility and rigour to your work. It means others can see the code and try to reproduce your results. They can also build on your work to do their own analysis.
4. Don’t start from scratch
By running a Python notebook you can get access to not just basic Python, but all the other modules that have been created for specialist tasks.
In the notebook I shared as part of the Data4Good Festival, I showed some of the tools I use in notebooks. I rely heavily on the Pandas library to load, transform and analyse data. And I love how easily tools like Folium integrate, making it possible to map your data with just a couple of lines of code. If you look closely, you can see that Folium’s own tutorial is written in a notebook.
There are hundreds of other Python modules that are useful for tasks like machine learning, geographical data, connecting with data sources, charting and visualising data and much more.
These are some of the ways I think coding notebooks can help you produce excellent data analysis. They are easy to get up and running, and with the right coding skills – or a willingness to learn – you can quickly take advantage of the opportunities. You can get started yourself on Google Colab, or if you want to find out more about 360Giving data you can request access to the 360Giving Datastore.
This blog post is a repost and was originally written for, and published by the Data4Good Festival. Find the original blog at: data4goodfest.org.uk/notebooks-boost.