Techniques we’ll cover in this part:
In this section we’ll be working with our first proper linguistic dataset, mainly to put many of the techniques covered so far into practice - but there are a few new things to cover as well.
We’re working with a sample of the responses to the long-running Our Dialects survey, which collects usage data on a range of dialectal features across the UK by asking questions such as:
As usual, load in the sf
, tidyverse
, rmapshaper
and mapview
packages if they aren’t already (also pryr
if you want to check the object size of your shapefiles before plotting)
library(sf)
library(tidyverse)
library(rmapshaper)
library(pryr)
library(mapview)
Read in the dialect survey dataset and start exploring it:
<- read_csv("data/dialect_survey.csv") dialect
It has 6000 rows (randomly sampled from the full 14,000-row dataset) and 41 columns:
dim(dialect)
## [1] 6000 38
Exercise
Explore the dataset to see:
table()
on a particular column to list its unique values and the frequency of each)Answers
Checking the columns of demographic metadata:
#these are the non-linguistic columns
%>%
dialect select(point_id:region) %>%
head()
## # A tibble: 6 × 8
## point_id sex age age.group dob lat long region
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 female 22 21 to 40 1992 53.4 -3.06 North West
## 2 2 female 24 21 to 40 1989 50.5 -3.73 South West
## 3 3 male 22 21 to 40 1991 50.4 -3.74 South West
## 4 4 male 87 > 80 1927 57.1 -2.07 Scotland
## 5 5 male 17 < 20 1997 57.1 -2.13 Scotland
## 6 6 female 22 21 to 40 1991 50.4 -3.77 South West
These are the 9 lexical variables:
#notice how missing values are coded as NA
%>%
dialect select(bread:walkway) %>%
head()
## # A tibble: 6 × 10
## bread furniture footwear clothing meal group frozen prank gum walkway
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 roll couch <NA> trousers tea you … <NA> <NA> <NA> <NA>
## 2 roll <NA> pumps trousers dinn… you <NA> <NA> <NA> <NA>
## 3 roll <NA> plimsol… trousers dinn… you <NA> <NA> <NA> <NA>
## 4 bap, muffi… couch, s… <NA> trousers dinn… you <NA> <NA> <NA> <NA>
## 5 bun settee <NA> trousers supp… yous <NA> <NA> <NA> <NA>
## 6 roll <NA> other (… trousers dinn… you <NA> <NA> <NA> <NA>
Checking how they’re coded:
#lexical questions are coded with the word they use - for people who reported
# use of multiple forms, these are comma separated. People were also allowed
# to select 'other' and type out a form that wasn't listed as an option
table(dialect$meal)
##
## dinner dinner, supper dinner, supper, tea
## 2483 56 25
## dinner, tea other (breakfast) other (evening meal)
## 385 1 3
## other (heaven on a plate) other (main meal) other (unspecified)
## 1 1 3
## supper supper, tea tea
## 285 11 2746
These are the 13 phonological variables:
%>%
dialect select(foot.cut:thin.fin) %>%
head()
## # A tibble: 6 × 13
## foot.cut class.farce for.more sauce.source one.gone book.spook fur.bear
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 rhyme <NA> rhyme don't rhyme rhyme don't rhy… rhyme
## 2 don't rhyme <NA> rhyme <NA> don't rh… don't rhy… don't r…
## 3 don't rhyme <NA> rhyme <NA> don't rh… don't rhy… don't r…
## 4 don't rhyme <NA> don't rhyme don't rhyme don't rh… rhyme don't r…
## 5 rhyme <NA> don't rhyme don't rhyme don't rh… rhyme don't r…
## 6 don't rhyme <NA> rhyme <NA> don't rh… don't rhy… don't r…
## # ℹ 6 more variables: pour.poor <chr>, eight.ate <chr>, singer.finger <chr>,
## # mute.moot <chr>, spa.spar <chr>, thin.fin <chr>
Checking how they’re coded:
#phonological questions are coded as either rhyme vs don't rhyme...
table(dialect$class.farce)
##
## don't rhyme rhyme
## 1746 1132
#...or same vs different, depending on the question wording
table(dialect$pour.poor)
##
## different same
## 1214 4786
These are the 7 grammatical variables:
%>%
dialect select(give.it.me:neither.me) %>%
head()
## # A tibble: 6 × 7
## give.it.me I.done.it look.at.them things.what not.got.it you.was neither.me
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 c a <NA> a <NA> b <NA>
## 2 b c b b <NA> <NA> <NA>
## 3 a d d d <NA> <NA> <NA>
## 4 c c <NA> c <NA> c <NA>
## 5 e e <NA> e <NA> e <NA>
## 6 c d d d <NA> <NA> <NA>
Checking how they’re coded:
#grammatical questions are coded on a scale of a-e,
#where a = "I'd use this form myself"
#and e = "A native speaker of English would never say this"
table(dialect$give.it.me)
##
## a b c d e
## 1494 1565 1719 959 263
Just like with the Twitter dataset, before we perform any geospatial analysis we first need to transform this into a shapefile based on the lat/long values
<- dialect %>%
dialect st_as_sf(coords = c("long", "lat"))
Because we have the region
column, we can already calculate some summary statistics, e.g. the proportion of people from Yorkshire who rhyme the words foot and cut (i.e. who don’t have the FOOT-STRUT split)
%>%
dialect st_drop_geometry() %>%
filter(region == 'Yorkshire and The Humber') %>%
count(foot.cut) %>%
mutate(rate = n / sum(n))
## # A tibble: 2 × 3
## foot.cut n rate
## <chr> <int> <dbl>
## 1 don't rhyme 164 0.206
## 2 rhyme 633 0.794
So about 20% of people from Yorkshire have the vowel split - compare this to people from the South East, where 96% of people report it!
%>%
dialect st_drop_geometry() %>%
filter(region == 'South East') %>%
count(foot.cut) %>%
mutate(rate = n / sum(n))
## # A tibble: 2 × 3
## foot.cut n rate
## <chr> <int> <dbl>
## 1 don't rhyme 600 0.955
## 2 rhyme 28 0.0446
This is quite a blunt way of approaching the geographic analysis though - there are only 9 regions of England but we know that dialect patterns are much more local and finer-grained than this!
Let’s perform some point-in-polygon aggregation to add other geocoded columns we can use in the analysis. Let’s start with postcode area (e.g. the ‘YO’ in YO10 5DD). We used postcode areas in Part 2 of this workshop so you should already have the uk.areas
shapefile in your R environment, but if not make sure you load it in first:
<- read_sf("shapefiles/uk-areas.shp") %>%
uk.areas ms_simplify(keep = 0.01)
We can use the same methods from before to perform point-in-polygon aggregation:
dialect
dataset, and make sure it’s the same as the uk.areas
shapefilest_make_valid()
to ensure that the shapefile is topologically valid after we simplified the geometriesst_join()
to overlay the two shapefiles and perform the point-in-polygon aggregationst_crs(dialect) <- st_crs(uk.areas)
<- st_make_valid(uk.areas)
uk.areas
<- dialect %>%
dialect st_join(uk.areas, join = st_within)
This should have added a new column named pc_area
, corresponding to the postcode area for each respondent.
Has the geocoding actually worked? Let’s look at a random sample of the data, focusing on the geometry
column (our original lat/long values), the new pc_area
column we’ve just geocoded, and the region
column we already had in the dataset:
%>%
dialect select(geometry, pc_area, region) %>%
sample_n(10)
## Simple feature collection with 10 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -3.126306 ymin: 51.00715 xmax: -0.1171217 ymax: 53.85991
## Geodetic CRS: WGS 84
## # A tibble: 10 × 3
## geometry pc_area region
## * <POINT [°]> <chr> <chr>
## 1 (-2.395596 53.37376) WA North West
## 2 (-2.740788 53.40464) WA North West
## 3 (-2.289099 53.85991) BB North West
## 4 (-0.1171217 51.00715) RH South East
## 5 (-2.075188 52.35304) B West Midlands
## 6 (-0.3212205 53.76869) HU Yorkshire and The Humber
## 7 (-0.7636585 51.30798) GU South East
## 8 (-1.883778 52.43449) B West Midlands
## 9 (-1.121013 53.52484) DN Yorkshire and The Humber
## 10 (-3.126306 53.39714) CH North West
Well it’s done something, but I don’t have the encyclopedic knowledge of all latitude and longitude values to know if those postcode areas have been assigned correctly. Let’s plot only the responses geocoded as coming from ‘YO’ (Yorkshire), ‘M’ (Manchester) or ‘B’ (Birmingham)
ggplot(uk.areas) +
geom_sf() +
geom_sf(data = filter(dialect, pc_area %in% c('YO', 'M', 'B')),
aes(colour = pc_area)) +
theme_void()
Certainly looks right! We do have a few pesky NA
s in the pc_area
column though, i.e. points that apparently don’t overlap with any polygon in the uk.areas
shapefile.
Let’s take a closer look by plotting only these points in an interactive map so we can really zoom in and see where they are
%>%
dialect filter(is.na(pc_area)) %>%
mapview()
Looks like they’re all points that lie on the coast - this is a case of the geometries not lining up perfectly (possibly because we simplified the uk.areas
shapefile) and/or the lat/long values falling very slightly into the sea so they’re technically not overlapping with any of the polygons. Let’s just get rid of them:
<- dialect %>%
dialect filter(!is.na(pc_area))
Exercise
Postcode areas are ok, but let’s take it one step further for a more detailed level of geospatial analysis. Following the same methods as above, perform point-in-polygon aggregation for postcode districts (e.g. the ‘YO10’ in YO10 5DD) for each respondent - there are over 2800 of these across Great Britain so we’ll be able to make much more detailed maps.
You can download a copy of the uk.districts
shapefile here. A brief note on labels: this shapefile is unfortunately missing the ‘BT’ postcode districts across Northern Ireland so technically it’s a shapefile of Great Britain, not the United Kingdom, but I’m calling it uk.districts
for consistency with uk.areas
(which does contain the ‘BT’ postcode area)
Answers
#read in the postcode districts shapefile and simplify it straight away
<- read_sf("shapefiles/uk-districts.shp") %>%
uk.districts ms_simplify(keep= 0.05)
ggplot(uk.districts) +
geom_sf()
#check that the CRS of 'dialect' matches this new shapefile
#(take note of the double == sign!)
st_crs(dialect) == st_crs(uk.districts)
## [1] TRUE
#it does match, so we can go straight ahead with the st_join() process
#after making the shapefile 'valid' first
<- st_make_valid(uk.districts)
uk.districts
#perform the st_join() and filter out any responses that don't overlap with a postcode district
<- dialect %>%
dialect st_join(uk.districts, join = st_within) %>%
filter(!is.na(pc_dist))
#the dialect object now has a new column (labelled 'pc_dist') of postcode districts
#let's plot just the postcode districts/responses corresponding to Yorkshire
#(i.e. those beginning YO) - for those of you who like regular expressions,
#the filter() command below could be replaced with str_detect(name, 'YO[0-9]*')
%>%
uk.districts filter(str_sub(pc_dist, 1, 2) == 'YO') %>%
ggplot() +
geom_sf() +
geom_sf(data = filter(dialect, str_sub(pc_dist, 1, 2) == 'YO')) +
theme_void()
Great - in addition to our actual geometry column (of lat/long points), we now have columns for:
This gives us a lot to work with when it comes to investigating regional patterns of dialect use. It means we can plot choropleth maps, i.e. polygons colour-coded with % use of a given linguistic variant, rather than colour-coded point maps (which can become over-crowded with large datasets, as we’ve already seen).
So how do we calculate these percentages? Choropleth maps can only plot the rate of use of a single variant of a single variable at once, so we need to pick what to plot first. How about people who rhyme the words class and farce, which represents a TRAP-BATH split (apologies to the non-sociolinguists among you who have no idea what a TRAP-BATH split is!)
We start by calculating the total number of responses to that variable in each geometry (which means having to filter out the NAs so they aren’t counted). We’ll save this to a dataframe called totals
.
<- dialect %>%
totals st_drop_geometry() %>%
filter(!is.na(class.farce)) %>%
group_by(pc_area) %>%
summarise(total = n())
totals
## # A tibble: 118 × 2
## pc_area total
## <chr> <int>
## 1 AB 12
## 2 AL 8
## 3 B 70
## 4 BA 28
## 5 BB 86
## 6 BD 36
## 7 BH 10
## 8 BL 76
## 9 BN 18
## 10 BR 5
## # ℹ 108 more rows
This is a workshop on mapping in R (not on the tidyverse itself!) so I don’t want to spend too much time covering more general data wrangling techniques, but this is a relatively complex ‘pipe chain’ of functions so let’s break it down line by line to see what we just did:
dialect
geometries using st_drop_geometry()
to turn it into a regular dataframefilter()
out the missing responses in the column of interest (there are lots of missing responses in the dataset because the survey has gone through many iterations with different sets of questions each time)group_by()
the pc_area
column to split the analysis up by each postcode areasummarise()
to create a total
column, which is just the total number of rows in each postcode area (after filtering out the NAs of course)The next step is to count the number of times our chosen response — rhyme — occurs in each postcode area. The code is almost identical to what we just saw, - the only difference is in the filter()
command. We save this to a dataframe called counts
:
<- dialect %>%
counts st_drop_geometry() %>%
filter(class.farce == 'rhyme') %>%
group_by(pc_area) %>%
summarise(n = n())
counts
## # A tibble: 108 × 2
## pc_area n
## <chr> <int>
## 1 AB 3
## 2 AL 7
## 3 B 21
## 4 BA 19
## 5 BB 5
## 6 BD 2
## 7 BH 10
## 8 BL 2
## 9 BN 15
## 10 BR 5
## # ℹ 98 more rows
Notice that the counts
dataframe has only 108 rows, compared with 118 rows for totals
. This is because counts
only lists the postcode areas where the response of interest, in this case rhyme, is reported at least once. If this response is completely absent from a particular postcode area, it won’t be listed here.
This means that when we join together the totals
and counts
dataframes using left_join()
, those postcode areas missing from counts
will have NA
as their value. We want to replace these NAs with 0 to represent the fact that 0% of the survey participants from that area responded in this way.
The code below does this, and also calculates a prop
column as before, which will give a value between 0 and 1 representing the proportion of respondents in each postcode area who rhyme the words. We’ll save this to a new dataframe called props
.
<- left_join(totals, counts) %>%
props replace_na(list(n = 0)) %>%
mutate(prop = n / total)
props
## # A tibble: 118 × 4
## pc_area total n prop
## <chr> <int> <int> <dbl>
## 1 AB 12 3 0.25
## 2 AL 8 7 0.875
## 3 B 70 21 0.3
## 4 BA 28 19 0.679
## 5 BB 86 5 0.0581
## 6 BD 36 2 0.0556
## 7 BH 10 10 1
## 8 BL 76 2 0.0263
## 9 BN 18 15 0.833
## 10 BR 5 5 1
## # ℹ 108 more rows
Then we can join this dataframe back to our uk.areas
shapefile to plot a choropleth map colour-coded by this new prop
column:
%>%
uk.areas left_join(props) %>%
ggplot() +
geom_sf(aes(fill = prop)) +
theme_void()
Cool!
The process is the same for grammatical variables: pick your variable (i.e. the column name) and the response you want to plot when performing the filter for the counts
dataframe, e.g. filter(give.it.me == 'a')
. You might want to plot people who responded with ‘a’ or ‘b’ (which means they either use the phrase themselves, or have heard it used in their area), in which case you’d just change the filter()
command to filter(give.it.me == 'a' | give.it.me == 'b')
.
The process is also the same for lexical variables, but remember that a filter like filter(bread == 'muffin')
will only plot people who report use of muffin and no other variant. Recall that many people reported use of more than one variant. To plot everyone who uses a particular variant, whether exclusively or not, you would simply change the filter to filter(str_detect(bread, 'muffin'))
, using str_detect()
to keep rows that contain the string ‘muffin’ somewhere in the response.
Let’s try this 3-step process again to demonstrate:
<- dialect %>%
totals st_drop_geometry() %>%
filter(!is.na(bread)) %>%
group_by(pc_area) %>%
summarise(total = n())
<- dialect %>%
counts st_drop_geometry() %>%
filter(str_detect(bread, 'muffin')) %>% #notice str_detect() here
group_by(pc_area) %>%
summarise(n = n())
<- left_join(totals, counts) %>%
props replace_na(list(n = 0)) %>%
mutate(prop = n / total)
%>%
uk.areas left_join(props) %>%
ggplot() +
geom_sf(aes(fill = prop)) +
theme_void()
Looks like it’s fairly localised! Remember we can use mapview()
to generate an interactive map instead, which allows us to zoom in and see in which exact postcode areas muffin is used. Because we specify label == 'pc_area'
, hovering the mouse over each polygon reveals the postcode area label, but we can also set this to label == 'prop'
if you want the hover-over text to display the actual proportion/value itself.
%>%
uk.areas left_join(props) %>%
mapview(zcol = 'prop', label = 'pc_area')
Exercise
Now it’s time to try it yourself: choose different dialect features/variants and make some choropleth maps based on postcode areas and also postcode districts - what are the pros/cons of each type? What interesting patterns of regional variation do you find?
Answers
#looking at foot-cut by postcode district
<- dialect %>%
totals st_drop_geometry() %>%
filter(!is.na(foot.cut)) %>%
group_by(pc_dist) %>%
summarise(total = n())
<- dialect %>%
counts st_drop_geometry() %>%
filter(foot.cut == "rhyme") %>%
group_by(pc_dist) %>%
count()
<- left_join(totals, counts) %>%
props replace_na(list(n = 0)) %>%
mutate(prop = n / total)
%>%
uk.districts left_join(props) %>%
ggplot() +
geom_sf(aes(fill = prop)) +
theme_void()
You might have noticed that with a more fine-grained level of aggregation, i.e. over postcode districts rather than areas, we get (1) more missing values, i.e. postcode districts with no responses, and (2) noisier values, i.e. postcode districts where you get extreme and unreliable percentages because there are only 1 or 2 respondents.
One solution is to perform a kind of smoothing called local spatial autocorrelation. This isn’t just to make prettier maps with less noise: it’s also a robust method for investigating the spatial structure of your data by looking for hot spots or cold spots (you sometimes see this referred to as ‘hot spot analysis’ for that reason). In other words, it looks for polygons/areas that not only have a particularly high (or low) value for your chosen variable, but that are also surrounded by nearby polygons/areas that also have particularly high (or low) values.
This is a technique we used in our recent publication using the Our Dialects survey data, and I’ll walk you through the main steps here.
We need a new package in our arsenal for this: spdep
. Let’s install and load it in:
install.packages("spdep")
library(spdep)
Step 1: We first need the raw data, so generate your props
dataframe just as we did before. Let’s look at who calls a bread roll a cob:
<- dialect %>%
totals st_drop_geometry() %>%
filter(!is.na(bread)) %>%
group_by(pc_dist) %>%
summarise(total = n())
<- dialect %>%
counts st_drop_geometry() %>%
filter(str_detect(bread, 'cob')) %>%
group_by(pc_dist) %>%
count()
<- left_join(totals, counts) %>%
props replace_na(list(n = 0)) %>%
mutate(prop = n / total)
%>%
uk.districts left_join(props) %>%
ggplot() +
geom_sf(aes(fill = prop)) +
theme_void()
Step 2: use st_centroid()
to calculate the geographic centroid of each polygon in your shapefile (i.e. each postcode area or district). In this example, we’ll use postcode districts
<- uk.districts %>%
centroids st_centroid() %>%
st_coordinates() %>%
as.data.frame() %>%
rename(centroid_long = X, centroid_lat = Y) %>%
mutate(pc_dist = uk.districts$pc_dist)
head(centroids)
## centroid_long centroid_lat pc_dist
## 1 -2.124527 57.13125 AB10
## 2 -2.080886 57.13860 AB11
## 3 -2.162241 57.08594 AB12
## 4 -2.241504 57.11100 AB13
## 5 -2.293247 57.10742 AB14
## 6 -2.211241 57.14225 AB15
Here’s what those centroids look like (zoomed in on the North West to make it clearer)
ggplot(uk.districts) +
geom_sf() +
geom_point(data = centroids, aes(x = centroid_long, y = centroid_lat),
size = 0.5, alpha = 0.5) +
coord_sf(ylim = c(53, 54), xlim = c(-3.5, -1.5)) +
theme_void()
Step 3: calculate a spatial weights matrix, which we need to perform the spatial autocorrelation smoothing. This involves taking each polygon centroid, as we just calculated, and working out its k nearest neighbours and its distance from those neighbours. What’s k? Whatever you want it to be! The higher value we set k as, the stronger the smoothing.
We can see below, for the postcode districts around Greater Manchester, how increasing the size of k means we take into account more and more neighbouring polygons.
Let’s set k to 25 for now, and go ahead making our spatial weights matrix:
#add centroids to our props dataframe, and transform into a matrix
<- props %>%
props_centroid_matrix left_join(centroids) %>%
select(centroid_long, centroid_lat) %>%
as.matrix()
#find nearest neighbours - we set k to 25 here
<- knn2nb(knearneigh(props_centroid_matrix, k = 25, longlat = TRUE))
neighbours
#calculate distances between neighbours
<- nbdists(neighbours, props_centroid_matrix, longlat = TRUE) distances
Step 4: set the weights in the spatial weights matrix to be the reciprocal distance (1/x), so that closer neighbours = smaller distance = higher weight = more influence in the smoothing. In other words, we don’t just consider 25 nearest neighbours in the smoothing - we apply more weight to the nearer of those nearest neighbours.
#weighting of the neighbors by reciprocal distance
<- lapply(distances, function(x) 1/x)
recip_distances
#return this in a format for spdep
<- nb2listw(neighbours, recip_distances) swm
Step 5: Calculate the Getis-Ord Gi* statistic, the technical term for our hotspot analysis, using localG()
and providing it with our aggregated values (in the prop
column of our props
dataframe) and the spatial weights matrix we just made (swm
). Here, we add this back to the props
dataframe in a column called z
<- props %>%
props mutate(z = as.numeric(localG(prop, swm)))
Now let’s plot the values of z
and marvel at the smoothness! Note how we’ve used scale_fill_viridis()
, which is a nice colour palette for reasons explained here - you first need to install and load the viridis
package if you want to use it though.
%>%
uk.districts left_join(props) %>%
ggplot() +
geom_sf(aes(fill = z)) +
scale_fill_viridis() +
theme_void()
To make comparisons easier, let’s plot the raw map and smoothed map next to each other (install and load in the patchwork
package to display plots side-by-side like this)
<- uk.districts %>%
raw.map left_join(props) %>%
ggplot() +
geom_sf(aes(fill = prop)) +
scale_fill_viridis() +
theme_void()
<- uk.districts %>%
smooth.map left_join(props) %>%
ggplot() +
geom_sf(aes(fill = z)) +
scale_fill_viridis() +
theme_void()
+ smooth.map raw.map
Notice how the smoothing algorithm has smoothed over those random yellow-coloured polygons dotted here and there, and instead highlighted a hotspot focused around the obvious cluster in the East Midlands. This is good! Those isolated yellow postcode districts are almost certainly outliers, misleadingly calculated at 100% most likely based on a single person. At this level of geographic granularity, there’s no way you’d genuinely get such a localised pattern specific to one tiny postcode district and not the surrounding ones! (but if anyone knows of such a dialectal pattern, I’m all ears).
Let’s zoom into the Midlands to get a clearer comparison:
<- raw.map + coord_sf(ylim = c(52, 54), xlim = c(-3, 0))
raw.map <- smooth.map + coord_sf(ylim = c(52, 54), xlim = c(-3, 0))
smooth.map
+ smooth.map raw.map
The smoothed map isn’t perfect. There are still lots of grey/blank polygons, because localG()
can’t do anything with geometries that have missing values. We would have to manually interpolate these values, which is possible (and what we did in the Journal of Linguistic Geography paper) but that is sadly beyond the scope of this workshop!
Exercise
mapview()
so you can zoom in and explore the smoothingStill here? Well here’s some bonus content for you!
Conducting hot spot analysis in the way we just did can sometimes introduce a weird smoothing artefact that I like to call the ‘donut problem’. Compare the raw and smoothed maps below (k = 5 in the hotspot analysis)
When we get one isolated polygon with a high value, surrounded by others with lower values, the smoothing actually creates a donut-type pattern where the surrounding polygons get higher z-scores than the polygon that did actually have the high value to begin with! This doesn’t seem like a very good system, and it arises because of how the nearest neighbour calculation works: each polygon is assigned a z-score based on the values of its k nearest neighbours, as we’ve already established, but in the default version of this algorithm each polygon doesn’t take itself into account.
We can fix this if we change the nearest neighbour search as shown below. Notice we lower the value of k by 1, because we then use include.self()
so that each polygon includes itself alongside its 9 other nearest neighbours.
<- knn2nb(knearneigh(matrix, k = 4, longlat = TRUE))
neighbours <- include.self(neighbours)
neighbours <- nbdists(neighbours, matrix, longlat = TRUE)
distances <- lapply(distances, function(x) 1/x) recip_distances
However, we run into a problem when we compute a spatial weights matrix based on reciprocal distance, because each polygon now has itself as a nearest neighbour, which obviously has a distance from itself of 0, and we can’t take the reciprocal of 0!
Solution? Before we calculate the reciprocal distances, we actually run the following loop to go through each set of distances and replace the 0 with the next smallest distance in the set (divided by 2, but this is somewhat arbitrary - it should just be the smallest number in the set but greater than 0)
for (x in 1:length(distances)) {
= replace(distances[[x]],
distances[[x]] == 0,
distances[[x]] sort(distances[[x]])[2]/2)
}
If we add the above into our workflow, here’s what the new smoothing looks like:
Hooray, no donuts!