We are grateful to Tim Stuart, Postdoctoral Researcher at the New York Genome Center, for feedback on accuracy, enhancing accessibility, and improving reproducibility of this vignette.
In cellular biology, the study of single cells allows biologists to discover mechanisms not detectable by studying only bulk populations of cells. In recent years, advances in sequencing technologies have enabled the quantification of many single cell features (e.g., transcriptome, proteome, metabolome, 3D contact maps), and such datasets have in turn provided new statistical challenges and spurred the development of new probabilistic models (Lähnemann et al., 2020).
Key to discovering mechanisms from single cell data are analyses of heterogeneity between single cell populations. Heterogeneous patterns across multiple biological features reveal biological differences between groups of cells that differ by physical location, temporal factors, or some other biological function, and such differences can subsequently help generate hypotheses relating features to function. Moreover, depending on the question the resolution of the population might differ, for instance at the tissue level, at the cell type level, or even at physical location level among the same cell types.
Many levels of resolution at which to investigate single cells. Figure reproduced from Lähnemann et al. (2020) and Wolf et al. (2019) under the Creative Commons Attribution (CC-BY) license.
Central to single cell analyses is the clustering of a large population of sampled cells from an experiment. Such clustering often partitions the large population into single cell communities that correspond to some known biological function (e.g., cell type or sampling batch). Once these cell communities are obtained, further analyses can be conducted on or between identified communities to generate biological hypotheses.
📖 In this vignette, we will:
We obtain publicly available human peripheral blood mononuclear cell (PBMC) scATAC-seq data from 10X Genomics. Data processing requires Seurat and its extension, Signac (v1.3.0); both can be installed with Bioconductor.
The code below loads and describes the PBMC data before running data processing steps.
## Install older version of Signac
# devtools::install_version("Signac", version = "0.2.5")
## Load packages
library(Signac)
library(Seurat)
library(EnsDb.Hsapiens.v75)
library(ggplot2)
library(tidyverse)
library(patchwork)
## Load the original data
counts <- Read10X_h5(filename = "ex_data/scATAC-seq/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
file = "ex_data/scATAC-seq/atac_v1_pbmc_10k_singlecell.csv",
header = TRUE,
row.names = 1
)
chrom_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
genome = 'hg19',
fragments = 'ex_data/scATAC-seq/atac_v1_pbmc_10k_fragments.tsv.gz',
min.cells = 10,
min.features = 200
)
pbmc <- CreateSeuratObject(
counts = chrom_assay,
assay = "peaks",
meta.data = metadata
)
## Show the PBMC object before data processing
pbmc
#> An object of class Seurat
#> 87561 features across 8728 samples within 1 assay
#> Active assay: peaks (87561 features, 0 variable features)
We perform all steps of data processing up to Non-linear dimension reduction and clustering in this tutorial provided by the lab of Rahul Satija.
To summarize the essential steps, we
We obtain \(7060\) cells after Step 1, and Step 4 yielded \(15\) clusters altogether, whose sizes are reported below. Steps 2 and 3 together account for technical noise (e.g., differences in sequencing depth). This step is important for us because we wish to interrogate homogeneity of clusters and to explain any remaining sources of heterogeneity in the data not owing to technical noise, if there are any.
## Load summary of clustering, show cluster sizes
load("ex_data/scATAC-seq/scATAC-seq_annotation_cluster.RData")
kableExtra::kable(table(combined_metadata$umap_cluster_labels),
col.names = c("Cluster", "Count")) %>%
kableExtra::kable_styling(position = "center")
Cluster | Count |
---|---|
0 | 2530 |
1 | 722 |
2 | 662 |
3 | 587 |
4 | 578 |
5 | 398 |
6 | 357 |
7 | 282 |
8 | 244 |
9 | 234 |
10 | 202 |
11 | 128 |
12 | 60 |
13 | 50 |
14 | 26 |
These clusters can be visualized in 2D space using UMAP. The code to produce this UMAP plot takes some time to run, so we avoid running it here. Interested readers can find the code in the Satija lab vignettes linked above.
These clusters correspond to myeloid and lymphoid cell populations. One can identify cell types by identifying cluster-defining biological markers using RNA-seq measurements of the same cells; see the Satija lab vignettes here and there for details on accomplishing this “feature anchoring” step.
Before proceeding with our analysis, we note that the \(87561\) features per individual cell correspond to physical regions along the human genome, and each feature records the number of reads, more precisely Tn5cut sites, mapped by that individual cell to the physical region. These read-based records measure chromatin activity and thus provide information about gene regulation.
Data processing has provided us with a dataset \(\mathbf{X}\in\mathbb{R}^{N\times P}\) consisting of \(N=7060\) single cells with \(P=87561\) TF-IDF normalized peak reads/counts features recorded per cell. Additionally, \(\mathbf{X}\) has been partitioned into \(15\) clusters that correspond to certain distinguishable blood cell populations.
Are the clusters homogeneous? How can one verify so? A naïve approach is to assume the normalized peak reads are statistically independent from one another and check for heterogeneity. We do this below for Cluster 11, running the independent features version of our exchangeability test and restricting to peak reads lying in Chromosome 1 to save time. For convenience, we have also saved the clusters generated above, so that we can load them directly and run our test.
## Show Cluster 11 (first six cells and first six features)
load("cluster_11_pbmc_data.RData")
# What is the dimension?
dim(extracted_matrix)
#> [1] 128 87561
# Preview first six cells and six normalized features
extracted_matrix[1:6,1:6]
#> 6 x 6 sparse Matrix of class "dgCMatrix"
#> chr1-565107-565550 chr1-569174-569639 chr1-713460-714823
#> AAACTGCTCCTATCCG-1 . . 1.746733
#> AAATGCCAGTGAAACT-1 . . .
#> AACCAACAGTGTCGGA-1 . . .
#> AACCTGATCGCTAGTA-1 . . .
#> AACGGGATCTTGTCGC-1 . . .
#> AAGATAGGTGCTTACA-1 . . .
#> chr1-752422-753038 chr1-762106-763359 chr1-779589-780271
#> AAACTGCTCCTATCCG-1 . . .
#> AAATGCCAGTGAAACT-1 . . .
#> AACCAACAGTGTCGGA-1 . . .
#> AACCTGATCGCTAGTA-1 . . .
#> AACGGGATCTTGTCGC-1 . . .
#> AAGATAGGTGCTTACA-1 . . .
# Get peak locations
peak_locs <- colnames(extracted_matrix)
# Extract chromosome number
chr_block_labels <- sapply(peak_locs, function(x) {substring(strsplit(x, split = "-")[[1]][1],4)})
# Extract position of all chr1 chromosomes
last_id <- match("2",chr_block_labels)
last_id <- last_id - 1
# Restrict to peak reads in Chr1
X <- extracted_matrix[,1:last_id]
# Check dimension
dim(X)
#> [1] 128 8555
## Run test of exchangeability
library(flintyR)
getPValue(as.matrix(X), largeP = TRUE) # use large P asymptotics since P is large
#> Warning in max(block_labels): no non-missing arguments to max; returning -Inf
#> Warning in seq_len(length.out - 2L) * by: Recycling array of length 1 in vector-array arithmetic is deprecated.
#> Use c() or as.vector() instead.
#> Warning in val - grid: Recycling array of length 1 in array-vector arithmetic is deprecated.
#> Use c() or as.vector() instead.
#> [1] 4.260687e-73
The \(p\)-value above provides evidence that Cluster 11 is non-exchangeable. However, it could be a Type I error arising from wrongly assuming that the features are statistically independent. (In other words, it could also be interpreted as evidence for dependencies between peaks. It could also be interpreted as evidence for both cluster heterogeneity and peak dependency.)
To account for statistical dependencies between peaks, we recall some possible biological factors inducing correlations between peaks.
In our paper and in this vignette, we attempt to investigate the second point by leveraging Hi-C data to group peaks in the scATAC-seq data. Owing to the large number of peak regions (\(P=87561\)) in our scATAC-seq data, for our analysis we will restrict our features to only those peak regions that lie in Chromosome 1. This restriction reduces the number of features to \(P=8555\), as we saw when we ran our test on Cluster 11 earlier.
Before proceeding to integrate Hi-C data with our scATAC-seq data, we provide some background on Hi-C. As described in the second bullet point above, there is evidence suggesting that the 3D structure of the genome plays an important role in determining differences in chromatin accessibility. Chromatin 3D organization is measured by probing physical interaction between non-adjacent genomic loci and such probing in turn is accomplished by chromatin conformation capture (3C) technologies (Dekker et al., 2002). With Hi-C (Lieberman-Aiden et al., 2009) being one of the more advanced capturing technologies, we hence use Hi-C data to supervise the grouping of our peaks. The heat map below depicts what a Hi-C interaction map looks like.
Scatterplot of a Hi-C interaction map of Chromosome 14 of a genome sequenced from lymphoblastoid cells (GM06990). Localized darker patches reveal local regions of increased contact. Plot produced using the Bioconductor package LiebermanAidenHiC2009 with help from this vignette.
Concretely, we use Hi-C data to partition our features into disjoint sets of dependent features (so-called “dependent blocks”), where we group features into the same sets based on whether they belong in the same topologically associating domain (TAD) as identified by Hi-C TAD callers.
There are more than 20 TAD calling approaches; see Pal et al. (2019) for a summary of them. We consider two approaches, DomainCaller — a method proposed by Dixon et al. (2012) that relies on the directionality index — and TopDom — a newer approach that scores patterns of contact frequencies and tests these scores for statistical significance under a non-parametric null distribution Shin et al. (2016). Note both DomainCaller and TopDom are found reliable in a comprehensive meta-analysis of 22 TAD callers by Zufferey et al. (2018). (TADs were first described by Dixon and colleagues, so it should come as no surprise that their DomainCaller method identifies TADs well.)
We provide a guide to generating TAD boundaries in a separate vignette.
## Load packages and set directories
library(Matrix)
library(doParallel)
# Register parallel computation
registerDoParallel()
# Relative directory
rel_dir <- "ex_data/scATAC-seq"
Our first goal is to check whether each cluster of cells is homogeneous, even after accounting for potential dependencies between peaks arising from chromosome folding. To accomplish this, we need to group peaks based on whether they lie in the same TAD. We demonstrate how we perform this step, using TAD membership within Chromosome 1 identified by DomainCaller.
## Grouping peaks
# Load libraries
require(HiTC) # install using Bioconductor if not installed
require(HiCDataHumanIMR90) # install using Bioconductor if not installed
data(Dixon2012_IMR90)
# View TADs identified by DomainCaller
tads_imr90
#> GRanges object with 2338 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> TAD-1 chr1 770138-1290137 *
#> TAD-2 chr1 1290138-1850140 *
#> TAD-3 chr1 1850141-2330140 *
#> TAD-4 chr1 2330141-3610140 *
#> TAD-5 chr1 3770141-6077413 *
#> ... ... ... ...
#> TAD-2334 chrX 146992309-148552096 *
#> TAD-2335 chrX 148592096-149929342 *
#> TAD-2336 chrX 149929343-151969344 *
#> TAD-2337 chrX 152089345-152746806 *
#> TAD-2338 chrX 152786807-154946806 *
#> -------
#> seqinfo: 23 sequences from an unspecified genome; no seqlengths
# Helper function for classifying each ATAC peak into one of many TADs
getDomainCallerLabels <- function(tads_imr90, peak_locs_2) {
chr1_tad_locs <- tads_imr90@ranges[which(tads_imr90@seqnames == "chr1"),]
start_pos <- sapply(peak_locs_2, function(x) {strsplit(x, split = "-")[[1]][2]})
end_pos <- sapply(peak_locs_2, function(x) {strsplit(x, split = "-")[[1]][3]})
tmp_df <- data.frame(START = start_pos,
END = end_pos)
rownames(tmp_df) <- NULL
tmp_df$START <- as.numeric(tmp_df$START); tmp_df$END <- as.numeric(tmp_df$END)
getThisLabel <- function(start_, end_) {
# Find largest start in chr1_tad_locs that start_ beats
max_start_ind <- max(which(chr1_tad_locs@start < start_))
if (is.infinite(max_start_ind)) {
# Feature has no available block label, set to -99
message("Peak Region is not contained in any TAD...")
return(-99)
} else if ((chr1_tad_locs@start[max_start_ind] + chr1_tad_locs@width[max_start_ind] - 1) <= end_) {
# Feature is sandwiched between two TADs
message("Peak Region is sandwiched between two TADs...")
return(-99)
} else {
assertthat::assert_that((chr1_tad_locs@start[max_start_ind] + chr1_tad_locs@width[max_start_ind] - 1) > end_,
msg = paste0("end position ", end_," exceeds right end of TAD", chr1_tad_locs@start[max_start_ind + 1]))
return(max_start_ind)
}
}
# Allocate labels
tmp_df$LABEL <- mapply(getThisLabel,tmp_df$START, tmp_df$END)
# Return
return(tmp_df)
}
# Get domainCaller labels
peak_locs_2 <- peak_locs[which(sapply(peak_locs,function(x) {substring(strsplit(x, split = "-")[[1]][1],4)}) == "1")]
each_peaks_tad_membership <- getDomainCallerLabels(tads_imr90, peak_locs_2)
# View each peak's TAD membership
head(each_peaks_tad_membership)
#> START END LABEL
#> 1 565107 565550 -99
#> 2 569174 569639 -99
#> 3 713460 714823 -99
#> 4 752422 753038 -99
#> 5 762106 763359 -99
#> 6 779589 780271 1
# How many TADs are there?
length(unique(each_peaks_tad_membership$LABEL))
#> [1] 233
# Label the peaks of Cluster 11 data
colnames(X) <- each_peaks_tad_membership$LABEL
Now that we have our peaks grouped by TAD membership, we can visualize correlations between groups of peaks. Visualization provides us with intuition as to whether peaks within the same group are more correlated with one another than are peaks between different groups. Below, we plot the average correlations between peaks from each pair of groups. We do not compute self-correlations when averaging correlations between peaks from the same group, and drop groups consisting of only a single peak.
Heatmap of average correlations between groups of peaks. When computing average correlations between peaks within the same group (the diagonal elements of the heatmap), self-correlations are removed. The heatmap is generated by the code below.
## Code to generate average correlations between groups of peaks (NOT RUN HERE)
# Obtain Chr1 peaks for entire sample of 7060 cells
#tfidf_norm_matrix <- t(pbmc@assays$peaks@data)
#tfidf_norm_matrix_chr1 <- tfidf_norm_matrix[,1:8555]
#colnames(tfidf_norm_matrix_chr1) <- each_peaks_tad_membership$LABEL
# Delete the non-TAD classified features
#tfidf_norm_matrix_chr1_ <- tfidf_norm_matrix_chr1[,which(test_df_output$LABEL != -99)]
# Rename block labels to be compatible
#colnames(tfidf_norm_matrix_chr1_) <- as.numeric(as.factor(colnames(tfidf_norm_matrix_chr1_)))
# Order the peaks by groups
#tfidf_norm_matrix_chr1_sub_mat <- tfidf_norm_matrix_chr1_ %>% as.matrix()
#tfidf_norm_matrix_chr1_sub_mat <- tfidf_norm_matrix_chr1_sub_mat[,order(colnames(tfidf_norm_matrix_chr1_sub_mat))]
#all_tads <- unique(colnames(tfidf_norm_matrix_chr1_sub_mat))
# Construct average correlation matrix to plot heatmap
#ave_cor_matrix <- matrix(nrow = length(all_tads),
# ncol = length(all_tads))
#for (i in all_tads) {
# for (j in all_tads) {
# # Extract submatrix belonging to TAD i
# TAD_i <- tfidf_norm_matrix_chr1_sub_mat[,which(colnames(tfidf_norm_matrix_chr1_sub_mat) == i),drop=FALSE]
# if (i == j) {
# # Compute the non-matching features only
# if (ncol(TAD_i) == 1) {
# x_ij <- 0
# } else {
# cor_ii <- cor(TAD_i)
# cor_ii <- cor_ii - diag(1, nrow(cor_ii), ncol(cor_ii))
# x_ij <- mean(cor_ii) * nrow(cor_ii)^2 / (nrow(cor_ii)^2 - nrow(cor_ii))
# }
# } else {
# # Extract submatrix belonging to TAD j
# TAD_j <- tfidf_norm_matrix_chr1_sub_mat[,which(colnames(tfidf_norm_matrix_chr1_sub_mat) == j),drop=FALSE]
# # Compute average correlation between all features in TAD i and all features in TAD j
# x_ij <- mean(cor(TAD_i, TAD_j))
# }
# # Fill in entry (i,j) of ave_cor_matrix
# ave_cor_matrix[as.numeric(i),as.numeric(j)] <- x_ij
# }
#}
# Save average correlation matrix
#write.csv(ave_cor_matrix, file = "ave_cor_matrix.csv")
# Load average correlation matrix
ave_cor_matrix <- read.csv(paste0(rel_dir, "/ave_cor_matrix.csv"), row.names = 1)
# Plot heat map
#col_scheme <- colorRampPalette(c("blue", "white", "red"))(30)
#gplots::heatmap.2(as.matrix(ave_cor_matrix), col=col_scheme, dendrogram='none', Rowv=FALSE, Colv=FALSE, trace='none')
# Compute mean diagonal and off-diagonal correlation values
ave_cor_matrix <- as.matrix(ave_cor_matrix)
mean(diag(ave_cor_matrix))
#> [1] 0.009481952
mean(ave_cor_matrix - diag(ave_cor_matrix)) * nrow(ave_cor_matrix)^2 / (nrow(ave_cor_matrix)^2 - nrow(ave_cor_matrix))
#> [1] -0.00712446
The heat map and calculation above suggest that peak-peak correlations within a group are slightly higher than peak-peak correlations between distinct groups. However, average correlations between pairs of groups are all very weak, never exceeding \(0.01\) in magnitude.
Take note that TAD grouping is contiguous in nature: peaks physically close to one another are typically grouped together as a TAD, with occasional boundaries breaking the contiguity. This is similar to the grouping of single nucleotide polymorphisms in genetics into contiguous LD blocks, where within-block correlations are empirically higher than between-block correlations.
We proceed with running our exchangeability test. Since the number of blocks \(B=233 - 1 = 232\) (one block corresponds to unclassified peaks) is large, we use the asymptotic version of our test.
## Grouping peaks
# Delete the non-TAD classified features
X_ <- X[,which(each_peaks_tad_membership$LABEL != -99)]
# Remove TADs containing only a single peak
X__ <- X_[,-which(table(colnames(X_)) == 1)]
# Rename block labels to be compatible
colnames(X__) <- as.numeric(as.factor(colnames(X__)))
# Compute p-value
getPValue(X = as.matrix(X__), block_labels = as.numeric(colnames(X__)), largeP = TRUE)
#> [1] 9.398319e-61
We obtain a \(p\)-value that is essentially \(0\). In fact, we found that our test reports zero \(p\)-values for each cluster. This is true even when we use TAD membership identified by TopDom, which we do not show here.
Perhaps the zero \(p\)-values aren’t surprising, given that all clusters, except for Cluster 8, contain multiple cell types. We proceed to analyze subclusters consisting of unique cell types. We run our exchangeability test using the same grouping of peaks by TAD membership. Here, we find that all but three subclusters are non-exchangeable. The three exchangeable subclusters are printed below.
## Show the three exchangeable subclusters
library(data.table)
domain_caller_results <- data.table::fread(paste0(rel_dir, "/domainCaller_p_value_rm1s_df.txt"))
domain_caller_results %>% subset(P_VALUE < 1 & P_VALUE > 0.05)
#> CLUSTER CELL_TYPE SAMPLE_SIZE P_VALUE
#> 1: 0 CD4 Memory 6 0.2891715
#> 2: 0 CD4 Naive 5 0.1785740
#> 3: 6 CD4 Memory 7 0.4547475
To summarize our results thus far, our test provides evidence of heterogeneity, not just in the clusters but also among subclusters consisting of a particular cell type. This is so even after accounting for correlations induced on peak read counts by the 3D structure (TADs) of the genome. Alternatively, if we assert that the clusters and subclusters are exchangeable, then grouping peaks by common TAD membership does not completely capture the complex dependencies between peaks.
We pursue these two directions of analyses below.
Given that all \(15\) clusters and most of their subclusters appear heterogeneous even after accounting for chromatin structure-induced dependencies, we shall attempt to explain what drives the heterogeneity. We consider a few statistical explanations. Each explanation can potentially be translated into some biological interpretation, and we defer the translation and interpretation to biologists who are more familiar with the subject-matter.
Here, we investigate the first point by dropping peaks that based on the fraction of non-zero normalized reads, and defer investigation of the second point to future work.
For example, we consider Cluster 12 (\(60\) cells) and focus on the subcluster consisting of \(59\) NK cells (one cell is a CD8 Naive). Note this subcluster is non-exchangeable when we run flinty at \(\alpha=0.05\). We find that a large proportion of peaks is sparse.
## Load and visualize peak statistics
load(file = "cluster_12_pbmc_data.RData")
load(file = "090721_scATAC-seq_annotation_cluster.RData")
# Get peak locations
peak_locs <- colnames(extracted_matrix)
# Extract chromosome number
chr_block_labels <- sapply(peak_locs, function(x) {substring(strsplit(x, split = "-")[[1]][1],4)})
# Extract position of all chr1 chromosomes
last_id <- match("2",chr_block_labels)
last_id <- last_id - 1
X <- extracted_matrix[,1:last_id]
colnames(X) <- each_peaks_tad_membership$LABEL
# Delete the non-TAD classified features
X_ <- X[,which(each_peaks_tad_membership$LABEL != -99)]
X__ <- X_[,-which(table(colnames(X_)) == 1)]
# Filter out the single CD8 naive
cluster_meta <- combined_metadata[which(combined_metadata$umap_cluster_labels==12),]
X_subclust <- X__[rownames(cluster_meta[which(cluster_meta$pred_cell_type == "NK cell"),]),]
# View histogram of feature sparsity (% zero normalized peak reads)
col_sparse_ratio <- apply(as.matrix(X_subclust), 2, function(x) {mean(x == 0)})
hist(col_sparse_ratio, main = "Histogram of Peak Sparsity", xlab = "Peak Sparsity")
mean((col_sparse_ratio < 0.8))
#> [1] 0.1429788
# View plot of peak variance against peak sparsity
col_variances <- apply(X_subclust,2,var)
plot(x = col_sparse_ratio, y = col_variances,
main = "Peak Variance vs Peak Sparsity",
xlab = "Peak Sparsity",
ylab = "Peak Variance")
To see if feature sparsity contributes to heterogeneity, we remove peaks from each group that are too sparse and try running our exchangeability test once more. After several iterations of filtering peaks, we find that additionally removing peaks that have sparsity \(\leqslant 0.7\) leads to subcluster exchangeability.
## Run test to explain contribution to heterogeneity
# Remove sparse peaks
X_non_sparse <- as.matrix(X_subclust)[,which(col_sparse_ratio < 0.8 & col_sparse_ratio > 0.7)]
colnames(X_non_sparse) <- as.numeric(as.factor(colnames(X_non_sparse)))
# Check number of independent blocks left after removal
length(unique(colnames(X_non_sparse)))
#> [1] 148
# large enough for asymptotic test to be used
# Run flinty
getPValue(X_non_sparse, block_labels = as.numeric(colnames(X_non_sparse)), largeP = TRUE)
#> [1] 0.1579093
As mentioned earlier, we may also assert that the clusters are exchangeable. If so, we must explain how the features are dependent for each cluster, such that under the dependency structure the cluster is exchangeable or homogeneous.
Let us consider Cluster 8, which is a community of \(244\) pre-B cells. Like our analysis of Cluster 12 above, we start with visualizing some peak statistics, including the distributions of peak sparsity and peak variances.
## Load and visualize peak statistics
load(file = "cluster_8_pbmc_data.RData")
load(file = "090721_scATAC-seq_annotation_cluster.RData")
# Get peak locations
peak_locs <- colnames(extracted_matrix)
# Extract chromosome number
chr_block_labels <- sapply(peak_locs, function(x) {substring(strsplit(x, split = "-")[[1]][1],4)})
# Extract position of all chr1 chromosomes
last_id <- match("2",chr_block_labels)
last_id <- last_id - 1
X <- extracted_matrix[,1:last_id]
# View histogram of feature sparsity (% zero normalized peak reads)
col_sparse_ratio <- apply(as.matrix(X), 2, function(x) {mean(x == 0)})
hist(col_sparse_ratio, main = "Histogram of Peak Sparsity", xlab = "Peak Sparsity")
mean((col_sparse_ratio < 0.8))
#> [1] 0.1246055
# View plot of peak variance against peak sparsity
col_variances <- apply(X,2,var)
plot(x = col_sparse_ratio, y = col_variances,
main = "Peak Variance vs Peak Sparsity",
xlab = "Peak Sparsity",
ylab = "Peak Variance")
We now try to group the peaks based on correlations. We perform hierarchical clustering with Ward linkage and distance matrix constructed from peak-peak correlations. (See p. 21 of Charrad et al., 2014 for an explanation of the Ward linkage.) To choose the optimal number of clusters, which correspond to groups of peaks, we use the Dindex metric. (See p. 14 of Charrad et al., 2014 for an explanation of how the Dindex works.) We find that \(K=46\) clusters is optimal. Thus, we run our test again, with the peaks now grouped into \(46\) blocks based on the clustering approach just described.
# Create distance matrix based on pairwise correlations
peak_cor <- cor(as.matrix(X))
na_sums <- apply(peak_cor,2,function(x){sum(is.na(x))})
na_peaks <- which(na_sums == (dim(peak_cor) - 1))
peak_cor_na_rm <- peak_cor[-na_peaks,-na_peaks]
peak_dist <- as.dist(1 - peak_cor_na_rm)
# Subset X
X_ <- as.matrix(X)[,!(colnames(X) %in% names(na_peaks))]
# Remove peak correlation matrix
rm(peak_cor)
rm(peak_cor_na_rm)
# Perform hierarchical clustering
peak_tree <- hclust(peak_dist, method="ward.D")
rm(peak_dist)
# Clear garbage
gc()
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 12172693 650.1 32126820 1715.8 NA 32126820 1715.8
#> Vcells 309536399 2361.6 693332412 5289.8 16384 693332222 5289.8
# Create groups of features
library(dendextend)
peak_dend <- as.dendrogram(peak_tree)
#library(NbClust) # for NbClust to work
# This tuning step takes ~0.5h
#res <- NbClust(data.matrix(X_),
# distance = "euclidean",
# min.nc = 10,
# max.nc = 50,
# method = "ward.D2",
# index = "dindex")
#-diff(res$All.index) # minimized at q = 47, so N_opt = (q - 1) = 46
#which(-diff(res$All.index) == min(-diff(res$All.index)))
# View sizes of each group
clusters <- cutree(peak_dend, k = 46)
table(clusters)
#> clusters
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> 63 5075 49 63 22 34 27 49 58 21 47 21 60 56 31 32
#> 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#> 35 134 48 41 33 21 34 57 48 35 72 40 83 47 41 47
#> 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#> 32 30 35 79 47 44 29 31 48 25 17 30 24 36
# Group peaks into clusters
colnames(X_) <- clusters
# Run test of exchangeability
getPValue(X_, block_labels = as.numeric(colnames(X_)), largeP = TRUE)
#> [1] 0.3087334
We see that under this grouping of peaks and at level \(\alpha=0.05\), the sample is exchangeable.
To interpret this finding and the associated grouping, we inspect how peaks within the same group are distributed across Chr 1.
## Inspect with-group peaks' physical location distribution
phys_loc_df <- data.frame(CHR = rep(1, length(clusters)),
START = as.numeric(sapply(names(clusters), function(x) {strsplit(x, "-")[[1]][2]})),
END = as.numeric(sapply(names(clusters), function(x) {strsplit(x, "-")[[1]][3]})),
GROUP = as.numeric(clusters))
# Save files for chromoMap to read
#write.table(phys_loc_df[,c(1,2,3)], "chr_file.txt", col.names = FALSE, row.names = FALSE)
#write.table(phys_loc_df[,c(1,2,3)], "chr_file.txt", col.names = FALSE, row.names = FALSE)
#phys_loc_df %>%
# mutate(ANNO = paste0("Group", GROUP)) %>%
# select(ANNO, CHR, START, END) %>%
# write.table(file = "anno_file.txt", col.names = FALSE, row.names = FALSE)
phys_loc_df <- read.table("anno_file.txt", col.names = c("CLUSTER","CHR","START","END"))
phys_loc_df <- phys_loc_df %>%
mutate(GROUP =sapply(CLUSTER, function(x) {substring(x, 6) %>% as.numeric()})) %>%
select(CHR,START,END,GROUP)
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'select' for signature '"data.frame"'
dat <- apply(phys_loc_df,
1,
function(x) data.table(CHR = x["CHR"], BP = x["START"]:x["END"], GROUP = x["GROUP"])
) %>%
rbindlist()
colors <- colorRampPalette(c("#d7191c", "#fdae61", "#abd9e9", "#2c7bb6"))(46)
ggplot(dat) +
geom_tile(aes(x = CHR, y = BP, fill = factor(GROUP))) +
scale_fill_manual(values=colors) +
theme_bw() +
scale_x_discrete(labels = 1) +
xlab("Chromosome 1") +
ylab("Physical Location (bp)") +
ylim(c(565550, 249219853)) +
labs(fill = "Group")
Given the clusters found above do not group neatly by physical location, we interpret this as evidence that “long-range” peak interactions manifesting in correlations in the data, rather than grouping of peaks into contiguous blocks, must be accounted for in the modeling of scATAC-seq peaks, if the cluster is exchangeable or unstratified. There is some biological basis for this: there are almost certainly trans-acting factors that co-regulate multiple peaks across different TADs. Such co-regulation would likely induce dependencies between the peaks that are not captured by the TAD structure alone.
A broader implication of this analysis is that single cell ATAC peaks cannot simply be grouped into contiguous blocks (with contiguity based on physical location and TAD membership), as we saw earlier. Adequate statistical modeling of scATAC-seq data should go beyond simplistic blocking. This is unlike the case in statistical and population genetic models of evolution, where single nucleotide polymorphisms are grouped into blocks based on high within-block linkage disequilibrium (LD).