That’s it - we won’t be covering any new GIS techniques in this part This is instead designed as a sandbox for you to put all your new skills to the test, playing around with an entirely different kind of linguistic dataset.

I’ve downloaded the entire database from The World Atlas of Language Structures Online (WALS), which is freely available on their GitHub repo (although in a format that requires lots of pre-processing - luckily I’ve done the hard work for you!)

Load it in using read_csv() (remember you can download the datasets needed for this workshop here)

wals <- read_csv("data/wals_clean.csv")

Let’s get a quick feel for the structure and content of this dataset before you go off and do your own thing with it

head(wals)
## # A tibble: 6 × 11
##   lang_id lang    feature_id feature feature_val_id feature_val feature_val_desc
##   <chr>   <chr>   <chr>      <chr>            <dbl> <chr>       <chr>           
## 1 abi     Abipón  100A       Alignm…            472 Accusative  Accusative alig…
## 2 abk     Abkhaz  100A       Alignm…            473 Ergative    Ergative alignm…
## 3 abn     Arabana 100A       Alignm…            471 Neutral     Neutral alignme…
## 4 abu     Abun    100A       Alignm…            471 Neutral     Neutral alignme…
## 5 ace     Acehne… 100A       Alignm…            474 Active      Active alignment
## 6 acm     Achuma… 100A       Alignm…            472 Accusative  Accusative alig…
## # ℹ 4 more variables: macro_area <chr>, family <chr>, latitude <dbl>,
## #   longitude <dbl>

It has over 76000 rows, because it’s in ‘long format’, i.e. each language~feature pair is on its own row:

wals %>% nrow()
## [1] 76477

There are 193 different typological features:

wals %>% 
  select(feature) %>% 
  unique() %>% 
  nrow()
## [1] 193

Let’s look at 10 random features as an example:

unique(wals$feature) %>% sample(10)
##  [1] "NegSVO Order"                                                                                 
##  [2] "Order of Subject and Verb"                                                                    
##  [3] "Asymmetrical Case-Marking"                                                                    
##  [4] "Suppletion in Imperatives and Hortatives"                                                     
##  [5] "Red and Yellow"                                                                               
##  [6] "Nasal Vowels in West Africa"                                                                  
##  [7] "Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase"
##  [8] "Question Particles in Sign Languages"                                                         
##  [9] "The Position of Negative Morphemes in Object-Initial Languages"                               
## [10] "Polar Questions"

And there are over 2600 languages (although not all languages have a value for all features)

wals %>% 
  select(lang) %>% 
  unique() %>% 
  nrow()
## [1] 2662

What other info do we have?

head(wals)
## # A tibble: 6 × 11
##   lang_id lang    feature_id feature feature_val_id feature_val feature_val_desc
##   <chr>   <chr>   <chr>      <chr>            <dbl> <chr>       <chr>           
## 1 abi     Abipón  100A       Alignm…            472 Accusative  Accusative alig…
## 2 abk     Abkhaz  100A       Alignm…            473 Ergative    Ergative alignm…
## 3 abn     Arabana 100A       Alignm…            471 Neutral     Neutral alignme…
## 4 abu     Abun    100A       Alignm…            471 Neutral     Neutral alignme…
## 5 ace     Acehne… 100A       Alignm…            474 Active      Active alignment
## 6 acm     Achuma… 100A       Alignm…            472 Accusative  Accusative alig…
## # ℹ 4 more variables: macro_area <chr>, family <chr>, latitude <dbl>,
## #   longitude <dbl>

The actual value of a feature is in the feature_val column (with a further description of that value in feature_val_desc). We have the macro_area of the language, and its family, as well as some geographic information of course (in latitude and longitude).

Remember this is just a plain csv, so we first need to convert it into a spatial-type object so that we can perform geospatial operations on it.

#convert from plain dataframe into sf object
wals <- wals %>%
  st_as_sf(coords = c("longitude", "latitude"))

#set CRS
st_crs(wals) <- 4326

As an example, let’s plot the distribution of the ‘Uvular consonants’ feature:

wals %>%
  filter(feature == 'Uvular Consonants') %>%
  mapview(zcol = 'feature_val', label = 'lang')

We can also plot a static map using ggplot:

world %>%
  ggplot() +
  geom_sf() +
  geom_sf(data = filter(wals, feature == 'Uvular Consonants'), aes(colour = feature_val)) +
  theme_void() +
  theme(legend.position = 'bottom')

Exercise

Now it’s time to explore! Here are some ideas of things to look at:

  1. Plot various maps (static or interactive) of different features - do any of them show interesting regional patterning or clustering? To what extent does this just reflect genetic relationships and the regional distribution of language families rather than actual geographic diffusion (arising from e.g. contact)?
  2. How about looking at co-variation between different typological features? e.g. is there a correlation between the size of a language’s consonant inventory vs its vowel inventory?
  3. Try searching for other world-level databases containing information like average altitude or various demographic statistics - do you find any random correlations between these country-level statistics and linguistic features? (remember that claim about languages in high-altitude settings tend to have more ejective consonants?)

I’ve already prepared some datasets you can load in for Q3 - in the data folder you’ll find the following csv files:

  • elevation_stats.csv: column for average elevation in metres (avg_elevation) - downloaded from here
  • carrots_turnips.csv: columns for total production of carrots/turnips in tonnes (production_tons), production per person in kg (production_pp_kg), total acreage in hectares (acreage_hectare) and yield in kg per hectare (yield_kg_hc) - downloaded from here
  • land_stats.csv: columns for proportion of arable land (arable_land_prop), proportion of crop cover (crop_cover_prop) and proportion of forest cover (forest_cover_prop) - downloaded from here

So many options… the world (atlas of language structures) is your oyster!