A Non-parametric Test for Exchangeability and Homogeneity

Computes the p-value of a multivariate dataset \(\mathbf{X}\), which informs the user if the sample is exchangeable at a given significance level, while simultaneously accounting for feature dependencies. See Aw, Spence and Song (2021) for details.

getPValue(
  X,
  block_boundaries = NULL,
  block_labels = NULL,
  largeP = FALSE,
  largeN = FALSE,
  nruns = 5000,
  type = "unbiased",
  p = 2
)

Arguments

X: The binary or real matrix on which to perform test of exchangeability.
block_boundaries: Vector denoting the positions where a new block of non-independent features starts. Default is NULL.
block_labels: Length \(P\) vector recording the block label of each feature. Default is NULL.
largeP: Boolean indicating whether to use large \(P\) asymptotics. Default is FALSE.
largeN: Boolean indicating whether to use large \(N\) asymptotics. Default is FALSE.
nruns: Resampling number for exact test. Default is 5000.
type: Either an unbiased estimate of (`'unbiased'`, default), or valid, but biased estimate of, (`'valid'`) p-value (see Hemerik and Goeman, 2018), or both (`'both'`). Default is `'unbiased'`.
p: The power \(p\) of \(l_p^p\), i.e., \(||x||_p^p = (x_1^p+...x_n^p)\). Default is 2.

Value

The p-value to be used to test the null hypothesis of exchangeability.

Details

Automatically detects if dataset is binary, and runs the Hamming distance version of test if so. Otherwise, computes the squared Euclidean distance between samples and evaluates whether the variance of Euclidean distances, \(V\), is atypically large under the null hypothesis of exchangeability. Note the user may tweak the choice of power \(p\) if they prefer an \(l_p^p\) distance other than Euclidean.

Under the hood, the variance statistic, \(V\), is computed efficiently. Moreover, the user can specify their choice of block permutations, large \(P\) asymptotics, or large \(P\) and large \(N\) asymptotics. The latter two return reasonably accurate p-values for moderately large dimensionalities.

User recommendations: When the number of independent blocks \(B\) or number of independent features \(P\) is at least 50, it is safe to use large \(P\) asymptotics. If \(P\) or \(B\) is small, however, stick with permutations.

Dependencies: All functions in auxiliary.R

Examples

# Example 1 (get p-value of small matrix with independent features using exact test)
suppressWarnings(require(doParallel))
#> Loading required package: doParallel
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
# registerDoParallel(cores = 2)

X1 <- matrix(nrow = 5, ncol = 10, rbinom(50, 1, 0.5)) # binary matrix, small
getPValue(X1) # perform exact test with 5000 permutations
#> Warning: executing %dopar% sequentially: no parallel backend registered
#> [1] 0.169

# should be larger than 0.05

# Example 2 (get p-value of high-dim matrix with independent features using asymptotic test)
X2 <- matrix(nrow = 10, ncol = 1000, rnorm(1e4)) # real matrix, large enough
getPValue(X2, p = 2, largeP = TRUE) # very fast
#> [1] 0.418648

# should be larger than 0.05
# getPValue(X2, p = 2) # slower, do not run (Output: 0.5764)

# Example 3 (get p-value of high-dim matrix with partitionable features using exact test)

X3 <- matrix(nrow = 10, ncol = 1000, rbinom(1e4, 1, 0.5))
getPValue(X3, block_labels = rep(c(1,2,3,4,5), 200))
#> Warning: There exist columns with all ones or all zeros for binary X.
#> [1] 0.2812

# Warning message: # there are features that have zero variation (i.e., all 0s or 1s)
# In getPValue(X3, block_labels = rep(c(1, 2, 3, 4, 5), 200)) :
# There exist columns with all ones or all zeros for binary X.

# Example 4 (get p-value of high-dim matrix with partitionable features using asymptotic test)

## This elaborate example generates binarized versions of time series data.

# Helper function to binarize a marker
# by converting z-scores to {0,1} based on
# standard normal quantiles
binarizeMarker <- function(x, freq, ploidy) {
 if (ploidy == 1) {
   return((x > qnorm(1-freq)) + 0)
 } else if (ploidy == 2) {
   if (x <= qnorm((1-freq)^2)) {
     return(0)
   } else if (x <= qnorm(1-freq^2)) {
     return(1)
   } else return(2)
 } else {
   cat("Specify valid ploidy number, 1 or 2")
 }
}

getAutoRegArray <- function(B, N, maf_l = 0.38, maf_u = 0.5, rho = 0.5, ploid = 1) {
# get minor allele frequencies by sampling from uniform
mafs <- runif(B, min = maf_l, max = maf_u)
# get AR array
ar_array <- t(replicate(N, arima.sim(n = B, list(ar=rho))))
# theoretical column variance
column_var <- 1/(1-rho^2)
# rescale so that variance per marker is 1
ar_array <- ar_array / sqrt(column_var)
# rescale each column of AR array
for (b in 1:B) {
  ar_array[,b] <- sapply(ar_array[,b],
                         binarizeMarker,
                         freq = mafs[b],
                         ploidy = ploid)
}
return(ar_array)
}

## Function to generate the data array with desired number of samples
getExHaplotypes <- function(N) {
  array <- do.call("cbind",
                   lapply(1:50, function(x) {getAutoRegArray(N, B = 20)}))
  return(array)
}

##  Generate data and run test
X4 <- getExHaplotypes(10)
getPValue(X4, block_boundaries = seq(from = 1, to = 1000, by = 25), largeP = TRUE)
#> Warning: There exist columns with all ones or all zeros for binary X.
#> [1] 0.1598537

# stopImplicitCluster()