Spatial Analytics

Geography, some economics and data analysis.

Reproducible workflows: Sharing your data analysis

Phil Donovan / 2018-09-17


Developing robust, reproducible analytical practice is essential in today’s modern ‘data science’ space where the analysis is getting more and more complex. We’re quite rightly moving away from simpler, more limited spreadsheet environments into more flexible, more complex languages such as R and Python. However, this does come at a cost. Spreadsheets are good in the sense that all of the analysis is right there for an analyst to see and interpret straight away. And although spreadsheets are rather notorious for debugging syntactically because they read like a rapper’s lyric sheet e.g. $AB:$5  (($AC:$67) * (($A:$16) + $C:$81))), in the immediate sense, the analyst is definitely at the data coal face staring directly at the cells and it is therefore easier to spot errors in the moment. With environments such as Python and R, you are slightly more abstracted away from the data in that immediate, visual sense, making it harder to pick up on errors.

Now, one of the most common practices to ensuring analytical ‘quality’ is to get someone other than yourself to check your work. However, time often means a lot of money, which is why we need to do this process as quickly and simply as possible. And checking people’s work is often extremely painful. Commonly, you have to spend a large amount of time getting your head around the folder / file structure and then fixing broken file paths which are hard-wired to the previous individuals PC before you can even begin focusing on what is actually going on in the code!

In order to create and share reproducible workflows we need to make our work as transferable and understandable as possible. So in this blog post, I detail a couple of easy-to-do patterns and conventions for setting up a reproducible R project with code which will easily work on another person’s computer without having to re-write any hard-coded file paths. It sounds dull, but what is far duller is grappling with someone else’s arcane file structure and re-writing file paths every time they send you a script.

Structuring your folders and files

Have a clear and well-structured file system and naming convention. Below is a file structure for an R analysis that I have adapted from my colleagues work, Alex Raichev:

The project directory is the root for all information used in the project from which all other folders and files is nested within. Here you would create an “RStudio Project” along with other information you may have such as Git files and the likes. The data directory is where all of the collected and processed data is kept. The collected directory refers to the ‘raw’, ‘unprocessed’ data that you have sourced. Conversely, processed data refers to all of the processing and intermediate steps involved in an analysis. Meanwhile, the R directory is for your pure R scripts (mostly functions) e.g. not RMarkdown files. The actual analysis of the project is best conducted with a notebook; hence the notebooks directory! Notebooks are a great way of commenting, chunking and therefore communicating an analysis which is why they are preferable to just a .R script with comments. The outputs directory is relatively self-explanatory, along with the other directory which contains other miscellaneous files such as Microsoft Word documents or Excel spreadsheets.

Having a well-structured folder and file structure means that if you share a project with someone, they can easily go back and check not only your analysis, but the input data and outputs easily. It also helps to isolate a project, making it more self-contained and easier to share.

Where are we? here()!

Use here() from the the here package. The purpose of the here() function is to find the root directory of the project. In a nutshell, the here() function answers the question of where are we? So that we do not hard-wire a directory structure which is specific to our computer, and our computer only. Often in projects, an analyst will write absolute paths which are operating system and user specific. This needs to stop as Jenny explains:

If the first line of your #rstats script is setwd(“C:\Users\jenny\path\that\only\I\have”), I will come into your lab and SET YOUR COMPUTER ON FIRE.

Upon loading the package, the here package will search for a root project directory by looking for ‘.RProject’ files, or git files, or a ‘.here’ if specified and then it will locate where that project directory is relative to the users operating system, creating an absolute path relative to the project directory.

library(here)
## here() starts at /Users/phildonovan/Documents/sites/spatialanalytics

and you can use here() to load a path

csv_path <- here("data", "collected", "nz_census_data.csv") 

In the above, in the inputs into the here() are appended onto the root file path.

"pc_specific_path_to_project_root/data/collected/nz_census_data.csv"

which could then be used to import into R using csv_data <- read_csv(csv_path).

The benefit is that here() automatically finds your paths, meaning that you do not have to set it yourself, which would entail hard-coding it.

Wrapping it all together with the right path() to here()

Now to bundle this all together into a mock workflow I want to introduce one more function called path() from the new fs package which provides a set of tools for working with folders, files and paths in a consistent manner. The path() function just nicely parses file paths together (with colour highlighting), which, when used in conjunction with the here() function makes handling file paths really easy.

path(here(), "data", "collected", "my_data.csv")
## /Users/phildonovan/Documents/sites/spatialanalytics/data/collected/my_data.csv

Assuming the file structure I spoke about earlier, here is an example workflow which demonstrates everything I’ve discussed.

library(tidyverse)
library(here)
library(fs)

# Set up file path constants whilst not hard-coding 
# the users specific os detailes e.g. no 'C:\\Documents\\...'!
R_SCRIPT_DIR <- here("R")
COLLECTED_DATA_DIR <- here("data", "collected")
PROCESSED_DATA_DIR <- here("data", "processed")
OUTPUT_DIR <- here("outputs")

# Source external r script
f_path <- path(R_SCRIPT_DIR, "extra_functions.R")
source(f_path)

# Read in data.
census <- path(COLLECTED_DATA_DIR, "nz_census_data.csv") %>% 
  read_csv

# Process ... and output some intermediate data.
path(PROCESSED_DATA_DIR, "processed_census_data.csv") %>% 
  write_csv(processed_census_data, .)

# Finish analysis ... and save the final outputs
path(OUTPUT_DIR, "final_outputs.csv") %>% 
  write_csv(output_data, .)

Notice how there are no hard-coded, operating specific file paths? Furthermore, notice how easy it is to spot where the data is coming from and going to? I believe that this example shows how quickly and simply you can create reproducible research which you can share with your colleagues and friends; without driving them batty.

Finally, thank you Jim Hester! Another title for this post could be have been ‘An ode to Jim Hester’ who wrote the two packages in this post (here and fs).