We are grateful to Rachel Spicer, Post Doctoral Fellow at the London School of Economics and Political Science, for feedback on enhancing accessibility and reproducibility of this vignette.
In psychological studies, decades of analyses of data originating from primarily Western, Educated, Industrial, Rich and Democratic (WEIRD) societies have contributed to the lack of replicability and generalizability of findings (Henrich et al., 2010). This has motivated the development of metrics that quantify meaningful cultural differences between societies (Muthukrishna et al., 2020), which bear on the validity of porting results obtained from analyzing one population over to another different population.
2D non-metric multidimensional scaling (NMDS) plot generated from a distance matrix summarising how culturally distinct regions are. Each entry of the distance matrix reports the cultural fixation index (\(\text{CF}_{ST}\)) between two regions. Also shown are distances within the regions, which capture within-nation variation. Figure reproduced from Muthukrishna et al. (2020), with permission from Michael Muthukrishna.
In contrast to between-society differences that reflect important cultural heterogeneities, differences within putatively singular populations or societies could suggest the presence of distinct communities that reflect meaningful sociological or demographic characteristics. For example, it is known that genealogical communities exist within the United States, and the persistence of homogamous mating in such communities (the determinants of which include, some scholars have suggested, geographic isolation and homophilous sorting by transmitted cultural traits; see Kalmijn, 1998 and Feldman and Cavalli-Sforza, 1981) is manifested in genetic differences (Han et al., 2017). See the Figure below and pay attention to the French Canadians, the Amish, the New Mexicans and the Scandinavians.
Distribution of ancestral birth locations in North America associated with clusters of individuals. A higher odds ratio (OR) implies that most present-day descendants of individuals from that ancestral birth location belong to the same cluster. Figure reproduced from Han et al. (2017) under the Creative Commons Attribution (CC-BY) license.
📖 In this vignette, we will:
We obtain publicly available survey questionnaire response data from the World Values Survey (WVS). The WVS is a research program aimed at understanding how beliefs, values and norms change over time and across nations globally; it is the largest non-commercial, cross-national, and time-series survey of public attitudes and values around the world. Motivated by the question of within-society differences, we focus on interrogating the cultural homogeneity of Singaporeans based on a representative collection of responses to the WVS questionnaire. To illustrate our method we use data from Wave 6 of the WVS that was released in 2012. (A report by Mathew, Teo, Tay and Wang, 2021 analyzing Singaporeans from 2020’s Wave 7 of the WVS has been published by the Institute of Policy Studies; here is a press release.)
To download the WVS data, please read the instructions in our Github repo.
The code below loads and describes the WVS data before running data processing steps.
## Load packages
library(tidyverse)
## Load the data
## This file can be downloaded with registration from the WVS website
wvs <- readRDS("ex_data/WVS/F00007762-WV6_Data_R_v20180912.Rds")
# How many samples and how many features?
dim(wvs)
#> [1] 89565 440
# Show the wvs tibble before processing...
wvs
#> # A tibble: 89,565 × 440
#> V1 V2 V2A V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#> <labe> <labe> <labe> <lab> <lab> <lab> <lab> <lab> <lab> <lab> <lab> <lab> <lab>
#> 1 6 12 12 1 1 1 1 -2 1 1 2 1 1
#> 2 6 12 12 2 1 2 3 4 2 2 2 2 2
#> 3 6 12 12 3 1 3 2 4 2 1 2 2 2
#> 4 6 12 12 4 1 1 3 4 3 1 2 1 2
#> 5 6 12 12 5 1 1 1 2 1 1 1 3 2
#> 6 6 12 12 6 1 2 2 2 4 1 2 1 2
#> 7 6 12 12 7 1 1 1 1 1 1 2 2 1
#> 8 6 12 12 8 1 1 1 1 2 2 2 1 2
#> 9 6 12 12 9 1 1 1 2 2 2 2 2 2
#> 10 6 12 12 10 1 1 1 2 1 1 1 1 2
#> # … with 89,555 more rows, and 427 more variables: V13 <labelled>, V14 <labelled>,
#> # V15 <labelled>, V16 <labelled>, V17 <labelled>, V18 <labelled>,
#> # V19 <labelled>, V20 <labelled>, V21 <labelled>, V22 <labelled>,
#> # V23 <labelled>, V24 <labelled>, V25 <labelled>, V26 <labelled>,
#> # V27 <labelled>, V28 <labelled>, V29 <labelled>, V30 <labelled>,
#> # V31 <labelled>, V32 <labelled>, V33 <labelled>, V34 <labelled>,
#> # V35 <labelled>, V36 <labelled>, V37 <labelled>, V38 <labelled>, …
# What doe each feature correspond to?
feature_anno <- wvs %>% map_chr(~attributes(.)$label)
head(feature_anno)
#> V1 V2
#> "Wave" "Country Code"
#> V2A V3
#> "Country/regions [with split ups]" "Interview number"
#> V4 V5
#> "Important in life: Family" "Important in life: Friends"
# What are the labels (i.e., categories) attached to feature V2?
# [!] Numbers do not indicate the number of samples with that label in the full dataset
attr(wvs$V2, "labels")
#> Missing; Unknown Not asked in survey Not applicable No answer
#> -5 -4 -3 -2
#> Don't know Algeria Azerbaijan Argentina
#> -1 12 31 32
#> Australia Armenia Brazil Belarus
#> 36 51 76 112
#> Chile China Taiwan Colombia
#> 152 156 158 170
#> Cyprus Ecuador Estonia Georgia
#> 196 218 233 268
#> Palestine Germany Ghana Haiti
#> 275 276 288 332
#> Hong Kong India Iraq Japan
#> 344 356 368 392
#> Kazakhstan Jordan South Korea Kuwait
#> 398 400 410 414
#> Kyrgyzstan Lebanon Libya Malaysia
#> 417 422 434 458
#> Mexico Morocco Netherlands New Zealand
#> 484 504 528 554
#> Nigeria Pakistan Peru Philippines
#> 566 586 604 608
#> Poland Qatar Romania Russia
#> 616 634 642 643
#> Rwanda Singapore Slovenia South Africa
#> 646 702 705 710
#> Zimbabwe Spain Sweden Thailand
#> 716 724 752 764
#> Trinidad and Tobago Tunisia Turkey Ukraine
#> 780 788 792 804
#> Egypt United States Uruguay Uzbekistan
#> 818 840 858 860
#> Yemen
#> 887
We mirror the approach taken by Muthukrishna et al. (2020) — itself built on the work of Bell, Richerson and McElreath (2009) — to include only a subset of the questions as features and to recode responses to those questions. Concretely, we
We provide scripts accomplishing this in our Github repo.
To facilitate our presentation, we have saved data generated from performing Steps 1 and 2 above. Below is code to load and preview this saved file.
## Load the data
load("ex_data/WVS/all_MM_num_filtered_recode.RData")
## View data
wv_num_filtered_[1:10, 1:10]
#> Wave Country Code Country/regions [with split ups] Interview number
#> 1 2010-2013 Algeria Algeria 1
#> 2 2010-2013 Algeria Algeria 2
#> 3 2010-2013 Algeria Algeria 3
#> 4 2010-2013 Algeria Algeria 4
#> 5 2010-2013 Algeria Algeria 5
#> 6 2010-2013 Algeria Algeria 6
#> 7 2010-2013 Algeria Algeria 7
#> 8 2010-2013 Algeria Algeria 8
#> 9 2010-2013 Algeria Algeria 9
#> 10 2010-2013 Algeria Algeria 10
#> Important in life: Family Important in life: Friends
#> 1 1 1
#> 2 1 1
#> 3 1 2
#> 4 1 1
#> 5 1 1
#> 6 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> Important in life: Leisure time Important in life: Politics
#> 1 1 NA
#> 2 2 2
#> 3 1 2
#> 4 2 2
#> 5 1 1
#> 6 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> Important in life: Work Important in life: Religion
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 2 1
#> 5 1 1
#> 6 2 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
We restrict our samples to individuals from Singapore. We obtain a dataset \(\mathbf{X}\) with \(N=1715\) observations and \(51\) categorical features, the first four of which are individual metadata. Thus there are \(P=47\) questionnaire response features in total.
## Select individuals from Singapore
wv_filtered_sg <- wv_num_filtered_ %>%
filter(`Country Code` == "Singapore") %>%
select_if(function(x) !(all(is.na(x)))) %>%
remove_missing() %>%
mutate(across(where(is.numeric), as.factor))
#> Error in filter(., `Country Code` == "Singapore"): object 'Country Code' not found
## View data
wv_filtered_sg[1:10, 1:10]
#> Wave Country Code Country/regions [with split ups] Interview number
#> 1 2010-2013 Singapore Singapore 1001
#> 2 2010-2013 Singapore Singapore 1002
#> 3 2010-2013 Singapore Singapore 1003
#> 5 2010-2013 Singapore Singapore 1005
#> 7 2010-2013 Singapore Singapore 1007
#> 8 2010-2013 Singapore Singapore 1008
#> 9 2010-2013 Singapore Singapore 1009
#> 10 2010-2013 Singapore Singapore 1010
#> 11 2010-2013 Singapore Singapore 1011
#> 12 2010-2013 Singapore Singapore 1012
#> Important in life: Family Important in life: Friends
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 5 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> 11 1 1
#> 12 1 1
#> Important in life: Leisure time Important in life: Politics
#> 1 1 2
#> 2 1 2
#> 3 1 2
#> 5 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> 11 1 1
#> 12 1 2
#> Important in life: Work Important in life: Religion
#> 1 1 2
#> 2 1 1
#> 3 1 2
#> 5 1 1
#> 7 1 2
#> 8 1 1
#> 9 1 2
#> 10 1 1
#> 11 1 1
#> 12 1 1
## Get data dimensions
dim(wv_filtered_sg)
#> [1] 1715 51
A challenge in measuring distances between categorical survey response variables is to determine which distance correctly measures “how far apart” we believe different pairs of categories should be. For example, should “strongly agree” and “not applicable” be more far apart than “strongly disagree” and “not applicable”? (This always depends on the downstream task; e.g., are distances used for visualization, or for classification, or something else.)
To compute Hamming distances between categorical variables, we perform one-hot encoding of all categorical variables with more than two categories. One-hot encoding means we assume that distances between pairs of categories are agnostic to the categories themselves. For example, consider the variable “Justifiable: Homosexuality”. After recoding there are three categories, 0, 1 and 2. Since differences between categories 0, 1 and 2 are assumed not to depend on the choice of categories, we transform 0, 1 and 2 into \(3\)-bits 100, 010 and 001, so that when computing Hamming distances between the categories, differences between any pair of categories is \(1\).
## Binarize array and generate block delimiters
library(mltools) # for one_hot to work
wv_filtered_sg_bin <- one_hot(data.table::as.data.table(wv_filtered_sg[,5:51]))
## Check dimension of one-hot encoded matrix
dim(wv_filtered_sg_bin)
#> [1] 1715 120
## Get block delimiters
block_membership <- c()
for (p in 1:47) {
# Count number of levels in the factor
k <- length(levels(wv_filtered_sg[,p+4]))
# Add to block delimiter vector accordingly
block_membership <- c(block_membership,
rep(p,k))
}
block_delims <- c()
for (p in 1:47) {
# Find position of first occurrence of p in block_membership
block_delims <- c(block_delims,match(p,block_membership))
}
Before we run any tests of exchangeability, let us visualize, using a heatmap and a dendrogram, how responses to the different survey questions correlate with one another. To generate our dendrogram, we perform hierarchical clustering with Ward linkage and distance matrix constructed from sample-sample correlations. (See p. 21 of Charrad et al., 2014 for an explanation of the Ward linkage.) Note that we do not need the one-hot encoded version of \(\mathbf{X}\) to accomplish this step.
## Generate a dendrogram
library(dendextend) # for ladderize to work
features_hclust_Euclidean_ward <- hclust(as.dist(1-cor(t(apply(wv_filtered_sg[,-(1:4)], 1, as.numeric)))),
method = "ward.D")
features_dend <- as.dendrogram(features_hclust_Euclidean_ward) # generate dendrogram for plotting heat map with
# Generate heat map to visualize correlations
library(RColorBrewer) # for brewer.pal to work
color.scheme <- rev(brewer.pal(10,"RdBu")) # generate the color scheme to use
heatmap(cor(t(apply(wv_filtered_sg[,-(1:4)], 1, as.numeric))),
Rowv = ladderize(features_dend),
Colv = ladderize(features_dend),
cexRow = 0.3,
cexCol = 0.3,
margins = c(2,2))
Even though we have to zoom into the figure for details, we already see that there are groups of features that are more correlated with one another. Here are three examples:
The questions corresponding to the IDs are printed below.
## Print table of responses with IDs
kableExtra::kable(data.frame(ID = 1:length(rownames(t(wv_filtered_sg[,-(1:4)]))),
Question = rownames(t(wv_filtered_sg[,-(1:4)]))))
ID | Question |
---|---|
1 | Important in life: Family |
2 | Important in life: Friends |
3 | Important in life: Leisure time |
4 | Important in life: Politics |
5 | Important in life: Work |
6 | Important in life: Religion |
7 | Important child qualities: independence |
8 | Important child qualities: Hard work |
9 | Important child qualities: Feeling of responsibility |
10 | Important child qualities: Imagination |
11 | Important child qualities: Tolerance and respect for other people |
12 | Important child qualities: Thrift saving money and things |
13 | Important child qualities: Religious faith |
14 | Important child qualities: Unselfishness |
15 | Important child qualities: Obedience |
16 | Most people can be trusted |
17 | On the whole, men make better political leaders than women do |
18 | Aims of country: first choice |
19 | Aims of respondent: first choice |
20 | Most important: first choice |
21 | Future changes: Less importance placed on work in our lives |
22 | Future changes: Greater respect for authority |
23 | Protecting environment vs. Economic growth |
24 | Interest in politics |
25 | Income equality |
26 | Private vs state ownership of business |
27 | Government responsibility |
28 | Competition good or harmful |
29 | Political system: Having a strong leader who does not have to bother with parli |
30 | Political system: Having the army rule |
31 | Political system: Having a democratic political system |
32 | Thinking about meaning and purpose of life |
33 | How often do you attend religious services |
34 | Religious person |
35 | Believe in: God |
36 | Believe in: hell |
37 | How important is God in your life |
38 | Justifiable: Claiming government benefits to which you are not entitled |
39 | Justifiable: Avoiding a fare on public transport |
40 | Justifiable: Cheating on taxes if you have a chance |
41 | Justifiable: Someone accepting a bribe in the course of their duties |
42 | Justifiable: Homosexuality |
43 | Justifiable: Prostitution |
44 | Justifiable: Abortion |
45 | Justifiable: Divorce |
46 | Justifiable: Suicide |
47 | How proud of nationality |
We thus see that Group 1 corresponds to “government and the law”, Group 2 corresponds to religion, and Group 3 corresponds to politics. Interestingly, “Important in life: Work” is grouped with politics.
Our first goal is to check whether our entire sample is homogeneous. We run our test, using block delimiters that demarcate one-hot encoded responses.
## Load packages and set directories
library(flintyR)
library(doParallel)
# Register parallel computation
registerDoParallel()
## Get p-value
getPValue(as.matrix(wv_filtered_sg_bin),
block_boundaries = block_delims,
largeP = TRUE) # since P is almost 50 we use asymptotic test
#> [1] 0
We obtain a \(p\)-value of \(0\). This is not surprising, given that (as we’ve just seen) the assumption of statistical independence between responses is likely false. Next, we try grouping these features into \(B\) blocks (\(B\leqslant P\)) based on the same hierarchical clustering procedure we used to visualize correlations. We still find that no matter how many blocks \(B\) we select (amounting to cutting the clustering dendrogram to form \(B\) clusters), we still obtain \(p=0\). Below, we show an example of this computation for \(B=6\).
## Cluster features into k = 6 groups to obtain feature grouping
dist_matrix_Euclidean <- stats::dist(t(data.matrix(wv_filtered_sg[,5:51])), method = "euclidean") # L2 distance
hclust_Euclidean_ward <- hclust(dist_matrix_Euclidean, method = "ward.D")
clusters <- cutree(hclust_Euclidean_ward, k = 6) %>%
as.data.frame()
## Get block delimiters from group membership
block_membership <- c()
for (p in 1:47) {
# Count number of levels in the factor
k <- length(levels(wv_filtered_sg[,p+4]))
# Add to block delimiter vector accordingly
which_group <- clusters[p,]
block_membership <- c(block_membership,
rep(which_group,k))
}
## Compute p-value with block boundaries obtained from feature grouping
getPValue(as.matrix(wv_filtered_sg_bin),
block_labels = block_membership)
#> [1] 0
This suggests the correlation structure between the features is complex, and it is likely that there are clusters within the samples that have distinct cluster-specific correlation structures between their features. In a demographic context, we suspect that there are distinct communities of Singaporeans, wherein the responses to the WVS questions have correlation patterns unique to the community.
Motivated by evidence of heterogeneity, we try to find distinct communities of Singaporeans within our sample. Similar to how we grouped our features, we cluster the samples using hierarchical clustering with Ward linkage and distance matrix constructed from Euclidean distances computed between samples. We assume that such a clustering is efficacious at recovering demographically meaningful communities. We obtain \(K=10\) as the optimal number of clusters using the Dindex metric. (See p. 14 of Charrad et al., 2014 for an explanation of how the Dindex works.)
## Perform tuning to decide the optimal number of clusters
## K to group samples into
library(NbClust) # for NbClust to work
res <- NbClust(data.matrix(wv_filtered_sg[,5:51]),
distance = "euclidean",
min.nc = 2,
max.nc = 12,
method = "ward.D2",
index = "dindex")
#> *** : The D index is a graphical method of determining the number of clusters.
#> In the plot of D index, we seek a significant knee (the significant peak in Dindex
#> second differences plot) that corresponds to a significant increase of the value of
#> the measure.
#>
## Minimize clustering gain on Dindex
-diff(res$All.index) # minimized at q = 11, so N_opt = (q - 1) = 10
#> 3 4 5 6 7 8 9 10 11 12
#> 0.0738 0.0516 0.0447 0.0419 0.0278 0.0200 0.0215 0.0213 0.0153 0.0161
## Perform hierarchical clustering
samples_dist_matrix_Euclidean <- stats::dist(data.matrix(wv_filtered_sg[,5:51]), method = "euclidean") # L2 distance
samples_hclust_Euclidean_ward <- hclust(samples_dist_matrix_Euclidean, method = "ward.D")
## Obtain cluster membership for K = 10 clusters
samples_clusters <- cutree(samples_hclust_Euclidean_ward, k = 10)
After obtaining \(10\) clusters, we group the features independently for each sample cluster using the same hierarchical clustering approach. We then perform the block version of our test at multiple number of blocks \(B\) partitioning our features, where each configuration corresponds to cutting the clustering dendrogram to form \(B\) feature clusters. Below, we demonstrate an example of this procedure for one cluster containing \(241\) individuals.
## Select subsample corresponding to Cluster 8
sg_c8 <- wv_filtered_sg[which(samples_clusters == 8),]
sg_c8 <- sg_c8[,-(1:4)] # remove metadata
## Check that each feature has at least two categorical levels
sg_c8 %>%
map(levels) %>% # get levels per feature
map(length) %>% # get number of levels
unlist() %>%
as.vector() # collapse into vector
#> [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 3 3 3 2 3 3 3 3 2 2 2 4 3 2 2 2 3 3 3
#> [40] 3 3 3 3 3 3 3 2
## Perform feature clustering
c8features_hclust_Euclidean_ward <- hclust(as.dist(1-cor(t(apply(data.matrix(sg_c8), 1, as.numeric)))),
method = "ward.D")
c8features_clusters <- cutree(c8features_hclust_Euclidean_ward, k = 10)
## Binarize array
sg_c8_bin <- one_hot(data.table::as.data.table(sg_c8))
## Get block delimiters from group membership
block_membership <- c()
for (p in 1:47) {
# Count number of levels in the factor
k <- length(levels(sg_c8[,p]))
# Add to block delimiter vector accordingly
which_group <- c8features_clusters[p]
block_membership <- c(block_membership,
rep(which_group,k))
}
## Compute p-value with block boundaries obtained from feature grouping
getPValue(as.matrix(sg_c8_bin),
block_labels = block_membership)
#> [1] 0.1484
After performing the steps described, we find that all but one cluster (Cluster 8, shown in the code chunk above) returned significant \(p\)-values that imply heterogeneity. However, for the only homogeneous cluster, we find that up to \(B=10\) blocks, the samples appear homogeneous. This is no longer true when we treat all \(P=47\) features as statistically independent. Thus, for this cluster, the variation in responses can be attributed to partitionable feature dependencies.
## Compute p-value when assuming features are statistically independent
# [!] This should return p-value of 0
getPValue(as.matrix(sg_c8_bin),
block_boundaries = block_delims,
largeP = TRUE) # since P is almost 50 we use asymptotic test
#> [1] 8.881366e-13
Now that we have found an exchangeable cluster of Singaporeans, we characterize the feature dependencies for this community of individuals.
## Generate heatmap with dendrogram capturing feature-feature correlations
c8features_dend <- as.dendrogram(c8features_hclust_Euclidean_ward)
heatmap(cor(t(apply(data.matrix(sg_c8), 1, as.numeric))),
Rowv = ladderize(c8features_dend),
Colv = ladderize(c8features_dend),
cexRow = 0.3,
cexCol = 0.3,
margins = c(2,2))
Here, we see that there are groups of features that are more correlated with one another. Here are three examples:
By looking back at the questions corresponding to the IDs, we see that Group 1 has to do with corporate culture. Moreover, unlike in the entire sample where Groups 2 and 3 were previously seen to be grouped as one, here they are grouped separately. Upon closer inspection, we notice that IDs 38 to 40 pertain more to obeying civil laws, while IDs 41 to 46 pertain more to moral and/or religious values. Thus, for this cluster we find a different relationship between two groups of traits that were previously agglomerated into one when viewed sample-wide.