Python Example¶

Basics¶

To make sure Python works, try to generate a binary matrix (which we refer to as \(\mathbf{X}\)).

>>> import numpy as np
>>> import random

# Define matrix dimensions
>>> rows = 6
>>> cols = 8

# Generate random binary matrix
>>> X = np.array([[random.choice([0, 1]) for _ in range(cols)] for _ in range(rows)])
>>> print(X)
[[1 1 0 1 1 1 0 1]
 [0 0 1 1 0 0 0 0]
 [0 0 0 0 1 1 1 0]
 [1 1 0 1 1 1 1 0]
 [0 0 0 0 1 0 1 0]
 [0 0 0 1 1 0 1 1]]
>>> print(X.shape)
(6, 8)

If the above works, we are ready to proceed with performing combinatorial calculations. First, import zagar. (If you have not installed it, go to the Installation page on this website and follow instructions there.)

>>> import zagar

Binary matrices form a class of rich and practical discrete structures.

Rich, because they lead to interesting mathematical theory (see e.g., Barvinok 2010, Cooper et al. 2019, Liebenau and Wormald 2022 and many others not listed);
Practical, because they represent many data objects encountered in reality (e.g., contingency tables in statistics, genetic polymorphism data, codes for information transmission).

With zagar, we can investigate the following combinatorial question:

Given marginal constraints, like the row and column sums, of a binary matrix, how many such matrices are there that satisfy them?

The answer to this question is challenging in general. But for reasonably sized matrix dimensions, the function count_matrices calculates the answer. Let us use the binary matrix \(\mathbf{X}\) generated earlier to demonstrate its usage.

# Compute row and column margins
>>> row_margins = X.sum(axis=1).tolist()
>>> col_margins = X.sum(axis=0).tolist()

# Check
>>> print(row_margins)
[6, 2, 3, 6, 2, 4]
>>> print(col_margins)
[2, 2, 1, 4, 5, 3, 4, 2]

# Count number of binary matrices with observed marginal constraints
>>> zagar.count_matrices(p=row_margins, q=col_margins, sort_p_desc=True)
39940

There are \(39,940\) such binary matrices — a huge count!

Admixed Arrays¶

Closely related to a binary matrix is an admixed array. An admixed array is a pair \([\mathbf{A},\mathbf{X}]\), where \(\mathbf{A}=(A_{np})\) and \(\mathbf{X}=(X_{np})\) lie in \(\{0,1\}^{N\times 2P}\). Here, \(N\) and \(P\) are positive integers providing the dimensionality of the underlying binary matrices; they correspond to the number of individuals and the number of genetic loci in an admixed population dataset.

Let us generate a random admixed array.

# Define array parameters
>>> N = 6
>>> P = 5

# Generate random admixed array
>>> X = np.array([[random.choice([0, 1]) for _ in range(2*P)] for _ in range(N)])
>>> A = np.array([[random.choice([0, 1]) for _ in range(2*P)] for _ in range(N)])

>>> print(A)
[[1 1 1 1 0 0 1 1 0 0]
 [1 0 1 0 1 0 1 1 0 0]
 [1 1 0 1 0 0 1 1 0 1]
 [1 0 0 1 0 1 1 0 1 1]
 [1 0 1 1 1 1 1 1 0 1]
 [1 1 1 1 0 0 0 0 0 1]]
>>> print(X)
[[1 1 1 0 1 1 0 1 1 1]
 [1 1 0 0 0 1 0 1 1 0]
 [1 0 0 1 1 0 0 1 0 1]
 [1 1 0 1 1 1 0 1 1 0]
 [1 1 0 1 0 1 1 1 1 0]
 [0 0 0 1 0 0 1 0 0 1]]

The following two summary statistics (modulo normalization factors that are irrelevant here) are studied in genetic studies of admixed populations.

Individual global ancestry: For each individual \(n\in[N]\), their global ancestry dosage is \(A_{n\cdot}=\sum_{p=1}^{2P}A_{np}\). Equivalently, by gathering the values across all individuals we obtain a vector of individual global ancestry dosages called the ancestry dosage vector, \(\boldsymbol{A}=(A_{1\cdot},\ldots,A_{N\cdot})\).
Ancestry-specific allele frequencies: For each locus \(p\in[P]\), the pair \((\Phi_{p,0},\Phi_{p,1})\) records the allele dosage specific to each ancestry. Mathematically, \(\Phi_{p,0}=\sum_{n=1}^N \left[(1-A_{np})X_{np}+(1-A_{n(P+p)})X_{n(P+p)}\right]\) and \(\Phi_{p,1}=\sum_{n=1}^N \left(A_{np}X_{np}+A_{n(P+p)}X_{n(P+p)}\right)\). Again, by gathering the values across all loci we obtain two ancestry-specific allele dosage vectors: \(\boldsymbol{\Phi}_0=(\Phi_{1,0},\ldots,\Phi_{P,0})\) and \(\boldsymbol{\Phi}_1=(\Phi_{1,1},\ldots,\Phi_{P,1})\).

We compute these summaries for the admixed array generated above.

# Individual global ancestry (row sums of A)
>>> row_sum_A = A.sum(axis=1).tolist()
>>> print(row_sum_A)
[6, 5, 6, 6, 8, 5]

# Ancestry-specific allele frequencies
>>> phi0 = []
>>> phi1 = []
>>> for p in range(P):
...     s0 = 0
...     s1 = 0
...     for n in range(N):
...         a1 = A[n, p]
...         a2 = A[n, P + p]
...         x1 = X[n, p]
...         x2 = X[n, P + p]
...
...         s0 += (1 - a1) * x1 + (1 - a2) * x2
...         s1 += a1 * x1 + a2 * x2
...
...     phi0.append(s0)
...     phi1.append(s1)

>>> print(phi0)
[2, 4, 1, 3, 4]
>>> print(phi1)
[7, 2, 5, 5, 2]

# Optional: sanity checks
# 1) row sum constraints must sum to no. 1's in A
# 2) column sum constraints must sum to no. 1's in X
>>> sum(row_sum_A) == A.sum()
True
>>> (sum(phi0) + sum(phi1)) == X.sum()
True

The code above was written to help reinforce understanding of how these summaries are calculated from data. These summaries can also be obtained with a single call to the function get_constraints:

# Compute phi0, phi1 and row_sum_A
>>> constraint_list = zagar.get_constraints(A, X)
>>> constraint_list[0] # row_sum_A
[6, 5, 6, 6, 8, 5]
>>> constraint_list[1] # phi0
[2, 4, 1, 3, 4]
>>> constraint_list[2] # phi1
[7, 2, 5, 5, 2]

Now assume that these summaries are viewed as constraints. We can ask an analogous combinatorial question of our admixed array.

Given the marginal constraints (on individual global ancestry and ancestry-specific allele frequencies) of an admixed array, how many admixed arrays are there that satisfy them?

We concretely define three sets of constrained admixed arrays.

Individual global ancestry constraint: \(\mathscr{A}_{1}=\{[\mathbf{A},\mathbf{X}]:~(\forall n \in [N])(A_{n\cdot}=a_{n\cdot})\}\). Equivalently, \(\mathscr{A}_{1}\) denotes the set of admixed arrays for which \(\boldsymbol{A}=\boldsymbol{a}\), for some fixed non-negative integer \(N\)-vector \(\boldsymbol{a}=(a_{1\cdot},\ldots,a_{N\cdot})\).
Ancestry-specific allele frequency constraint: \(\mathscr{A}_{2}=\{[\mathbf{A},\mathbf{X}]:~(\forall p\in[P])(\Phi_{p,0}=\phi_{p,0} \wedge \Phi_{p,1}=\phi_{p,1})\}\). Equivalently, \(\mathscr{A}_{2}\) denotes the set of admixed arrays for which \(\boldsymbol{\Phi}_0=\boldsymbol{\phi}_0\) and \(\boldsymbol{\Phi}_1=\boldsymbol{\phi}_1\), for some fixed non-negative integer \(P\)-vectors \(\boldsymbol{\phi}_0=(\phi_{1,0},\ldots,\phi_{P,0})\) and \(\boldsymbol{\phi}_1=(\phi_{1,1},\ldots,\phi_{P,1})\).
Both constraints: \(\mathscr{A}_{12}\) denotes the set of admixed arrays for which \(\boldsymbol{A}=\boldsymbol{a}\), \(\boldsymbol{\Phi}_0=\boldsymbol{\phi}_0\) and \(\boldsymbol{\Phi}_1=\boldsymbol{\phi}_1\).

With zagar, we can compute the sizes of these constrained sets of arrays using getA1, getA2 and getA12. If you have OpenMP installed, you can speed up the runtime of getA12 by using getA12_parallel instead.

# Size of A1
>>> zagar.getA1(N=N, P=P, a_vec=row_sum_A)
30512038196863407653105172480000

# Size of A2
>>> zagar.getA2(N=N, P=P, phi0_vec=phi0, phi1_vec=phi1)
7845452210162977021624320000

# Size of A12
>>> zagar.getA12(N=N, a_vec=row_sum_A, phi0_vec=phi0, phi1_vec=phi1)
574155698908329217448832
# Or use: zagar.getA12_parallel(N=N, a_vec=row_sum_A, phi0_vec=phi0, phi1_vec=phi1) if you have OpenMP

Saddle-Point Approximation¶

Let us consider a regular admixed array, defined by \(\boldsymbol{a}=(P,...,P)\) and \(\boldsymbol{\phi}_0=\boldsymbol{\phi}_1=(N,...,N)\). In this constrained case, a saddle-point approximation to a Cauchy integral can be used to estimate the size of \(\mathscr{A}_{12}\):

\(\log_2(|\mathscr{A}_{12}|)\approx 2NP - \frac{1}{2}[N\log_2(\pi P) + P\log_2(\pi N)-\log_2(\pi NP)] - \log_2(e)\cdot(N+P-1)^2/(8NP)\)

We demonstrate this functionality in zagar.

>>> import math

# Array parameters
>>> N = 9
>>> P = 3

# Define regular constraints
>>> a_reg = [P] * N       # length N, all entries = P
>>> phi0_reg = [N] * P    # length P, all entries = N
>>> phi1_reg = [N] * P    # length P, all entries = N

# Compute size of A12 and report the approximation and upper bound
>>> res = zagar.getA12(N=N, a_vec=a_reg, phi0_vec=phi0_reg, phi1_vec=phi1_reg, return_approx=True)
Computing estimated log-counts based on saddle-point approximation.
Note: log2_approx is good only when P and N are large with bounded ratio
>>> print(res)
{'count': 25697841152, 'is_regular': True, 'log2_upper_bound': 35.40699628357531, 'log2_approx': 34.59881989492918}

# Saddle-point approximation
>>> print(res['log2_approx'])
34.59881989492918

# Upper bound
>>> print(res['log2_upper_bound'])
35.40699628357531

# Compute log2(A12)
>>> math.log2(25697841152)
34.5809281141245  # well-approximated by saddle-point approximation

For \((N,P)=(9,3)\), there are \(|\mathscr{A}_{12}|=25,697,841,152\) (more than 25 billion!) regular admixed arrays. This is equivalent to about \(\log_2(|\mathscr{A}_{12}|)=34.6\) bits, which is well-approximated by the saddle-point formula.