# Spatial Analytics

Geography, some economics and data analysis.

# Parellising GIS workflows in R

### Phil Donovan / 2017-09-09

Recently, I have been working on parcel data for a project. Parcel data, even in New Zealand, is a large data set to process and I’m often waiting for analyses to finish. Modern computers all have multiple cores; however, by default they only send one task to one core. Parallisation is the process of breaking tasks up and sending them to multiple cores for processing. Hence the ability to speed things up. So let’s have a play to see if we can speed things up.

Firstly, let’s set the analysis up by importing the libraries and some data. The libraries I use are the new simple features (sf) library for R which interfaces with GDAL/OGR open source GIS tools and is built to work within the ‘tidyverse’ world of Hadley Wickham.

# Libraries
library(sf)
library(tidyverse)
library(parallel)
# Set number of points
number_of_points <- 10000

# Generate two matrices
sf_df_1 <- matrix(rnorm(number_of_points),nc=2) %>%
data.frame %>%
st_as_sf(coords = c("X1", "X2"))

sf_df_2 <- matrix(rnorm(number_of_points),nc=2) %>%
data.frame %>%
st_as_sf(coords = c("X1", "X2"))

OK, now we’re ready to roll! Let’s calculate the point to point distance between the data sets normally.

system.time({
st_distance(sf_df_1, sf_df_2)
})
##    user  system elapsed
##  40.157   0.513  42.082

Let’s have a look at what we can do it in when parallel. For this, I use the mclapply function from the parallel package which works exactly the same as the lapply function1. Sadly however, mclapply is only available on Mac Os X or Linux2. It is possible to implement parallisation in R on Windows but it is more difficult.

The workflow for implementing parallelisation in R is as follows:

1. Find the number of cores on your machine and leave one spare for everything else happening on the machine.
2. Split the first data set into groups based on the number of cores.
3. Loop over, using mclapply using 3 cores at one time, performing the analysis.

Let’s now write and run the code.

# Find out the number of cores on my machine and keep one spare for everything else.
n_cores <- detectCores() - 1

# Generate vector and add it as the group column to data set.
group_vector <- rep(1:n_cores, length.out = nrow(sf_df_1))

# Split the data up by the groups
point_sample_list <- split(sf_df_1, group_vector)

# Now we are ready to calculate the distance in parallel.
system.time({
mclapply(point_sample_list,
function(x) {
st_distance(x, sf_df_2)
},
mc.cores = n_cores)
})
##    user  system elapsed
##  22.494   0.707  28.169

This is only a small test, on a small 4-core laptop; imagine the potential gains on servers with 32 cores.

I have plans to create a function which automatically parallises GIS analyses in R. However, for this post, I’ll leave it there.

1. This is essentially an R loop which will always return a ‘list’ of results.

2. Importantly, the mclapply function relies on forking or shared memory addresses which makes parallisation much more efficient as different cores can access the same data sets in memory; however, Linux and OSX are the only operating systems in which this is possible.