Sage Example ============ Basics ------ To make sure Sage works, try to generate a binary matrix (which we refer to as :math:`\mathbf{X}`). .. code-block:: sage sage: from sage.all import * ....: import random # Define matrix dimensions sage: ....: rows = 6 ....: cols = 8 # Generate random binary matrix sage: X = matrix(GF(2), rows, cols, lambda i, j: random.choice([0, 1])) sage: print(X) [1 1 0 1 1 1 0 1] [0 0 1 1 0 0 0 0] [0 0 0 0 1 1 1 0] [1 1 0 1 1 1 1 0] [0 0 0 0 1 0 1 0] [0 0 0 1 1 0 1 1] sage: print(X.parent()) Full MatrixSpace of 6 by 8 dense matrices over Finite Field of size 2 If the above works, we are ready to proceed with performing combinatorial calculations. First, import **zagar**. (If you have not installed it, go to the `Installation page <./installation.html>`__ on this website and follow instructions there.) .. code-block:: sage sage: import zagar Binary matrices form a class of rich and practical discrete structures. - *Rich*, because they lead to interesting mathematical theory (see e.g., `Barvinok 2010 `__, `Cooper et al. 2019 `__, `Liebenau and Wormald 2022 `__ and many others not listed); - *Practical*, because they represent many data objects encountered in reality (e.g., contingency tables in statistics, genetic polymorphism data, codes for information transmission). With **zagar**, we can investigate the following combinatorial question: Given marginal constraints, like the row and column sums, of a binary matrix, how many such matrices are there that satisfy them? The answer to this question is challenging in general. But for reasonably sized matrix dimensions, the function ``count_matrices`` calculates the answer. Let us use the binary matrix :math:`\mathbf{X}` generated earlier to demonstrate its usage. .. code-block:: sage # Compute row and column margins sage: row_margins = [sum(int(x) for x in X.row(i)) for i in range(X.nrows())] sage: col_margins = [sum(int(x) for x in X.column(j)) for j in range(X.ncols())] # Check sage: print(row_margins) [6, 2, 3, 6, 2, 4] sage: print(col_margins) [2, 2, 1, 4, 5, 3, 4, 2] # Count number of binary matrices with observed marginal constraints sage: zagar.count_matrices(p=row_margins,q=col_margins,sort_p_desc=True) 39940 There are :math:`39,940` such binary matrices --- a huge count! Admixed Arrays -------------- Closely related to a binary matrix is an admixed array. An *admixed array* is a pair :math:`[\mathbf{A},\mathbf{X}]`, where :math:`\mathbf{A}=(A_{np})` and :math:`\mathbf{X}=(X_{np})` lie in :math:`\{0,1\}^{N\times 2P}`. Here, :math:`N` and :math:`P` are positive integers providing the dimensionality of the underlying binary matrices; they correspond to the number of individuals and the number of genetic loci in an admixed population dataset. Let us generate a random admixed array. .. code-block:: sage # Define array parameters sage: N=6 sage: P=5 # Generate random admixed array sage: X = matrix(GF(2), N, 2*P, lambda i, j: random.choice([0, 1])) sage: A = matrix(GF(2), N, 2*P, lambda i, j: random.choice([0, 1])) sage: print(A) [1 1 1 1 0 0 1 1 0 0] [1 0 1 0 1 0 1 1 0 0] [1 1 0 1 0 0 1 1 0 1] [1 0 0 1 0 1 1 0 1 1] [1 0 1 1 1 1 1 1 0 1] [1 1 1 1 0 0 0 0 0 1] sage: print(X) [1 1 1 0 1 1 0 1 1 1] [1 1 0 0 0 1 0 1 1 0] [1 0 0 1 1 0 0 1 0 1] [1 1 0 1 1 1 0 1 1 0] [1 1 0 1 0 1 1 1 1 0] [0 0 0 1 0 0 1 0 0 1] The following two summary statistics (modulo normalization factors that are irrelevant here) are studied in genetic studies of admixed populations. - *Individual global ancestry*: For each individual :math:`n\in[N]`, their global ancestry dosage is :math:`A_{n\cdot}=\sum_{p=1}^{2P}A_{np}`. Equivalently, by gathering the values across all individuals we obtain a vector of individual global ancestry dosages called the *ancestry dosage vector*, :math:`\boldsymbol{A}=(A_{1\cdot},\ldots,A_{N\cdot})`. - *Ancestry-specific allele frequencies*: For each locus :math:`p\in[P]`, the pair :math:`(\Phi_{p,0},\Phi_{p,1})` records the allele dosage specific to each ancestry. Mathematically, :math:`\Phi_{p,0}=\sum_{n=1}^N \left[(1-A_{np})X_{np}+(1-A_{n(P+p)})X_{n(P+p)}\right]` and :math:`\Phi_{p,1}=\sum_{n=1}^N \left(A_{np}X_{np}+A_{n(P+p)}X_{n(P+p)}\right)`. Again, by gathering the values across all loci we obtain two *ancestry-specific allele dosage vectors*: :math:`\boldsymbol{\Phi}_0=(\Phi_{1,0},\ldots,\Phi_{P,0})` and :math:`\boldsymbol{\Phi}_1=(\Phi_{1,1},\ldots,\Phi_{P,1})`. We compute these summaries for the admixed array generated above. .. code-block:: sage # Individual global ancestry # Convert type from GF(2) (modulo 2 arithmetic) to ZZ (integer arithmetic) sage: row_sum_A = [sum(ZZ(A[n,j]) for j in range(2*P)) for n in range(N)] sage: print(row_sum_A) [6, 5, 6, 6, 8, 5] # Ancestry-specific allele frequencies sage: phi0 = [] ....: phi1 = [] ....: for p in range(P): ....: s0 = ZZ(0) ....: s1 = ZZ(0) ....: for n in range(N): ....: a1 = ZZ(A[n, p]) ....: a2 = ZZ(A[n, P+p]) ....: x1 = ZZ(X[n, p]) ....: x2 = ZZ(X[n, P+p]) ....: ....: # Now these are ordinary integer products/sums (no mod 2) ....: s0 += (1 - a1)*x1 + (1 - a2)*x2 ....: s1 += a1*x1 + a2*x2 ....: ....: phi0.append(s0) ....: phi1.append(s1) sage: print(phi0) [2, 4, 1, 3, 4] sage: print(phi1) [7, 2, 5, 5, 2] # Optional: sanity checks # 1) row sum constraints must sum to no. 1's in A # 2) column sum constraints must sum to no. 1's in X sage: sum(row_sum_A) == (sum(ZZ(a) for a in A.list())) True sage: (sum(phi0)+sum(phi1)) == (sum(ZZ(x) for x in X.list())) True The code above was written to help reinforce understanding of how these summaries are calculated from data. These summaries can also be obtained with a single call to the function ``get_constraints``: .. code-block:: sage # Compute phi0, phi1 and row_sum_A sage: constraint_list = zagar.get_constraints(A.change_ring(ZZ), X.change_ring(ZZ)) sage: constraint_list[0] # row_sum_A [6, 5, 6, 6, 8, 5] sage: constraint_list[1] # phi0 [2, 4, 1, 3, 4] sage: constraint_list[2] # phi1 [7, 2, 5, 5, 2] Now assume that these summaries are viewed as constraints. We can ask an analogous combinatorial question of our admixed array. Given the marginal constraints (on individual global ancestry and ancestry-specific allele frequencies) of an admixed array, how many admixed arrays are there that satisfy them? We concretely define three sets of constrained admixed arrays. - *Individual global ancestry constraint*: :math:`\mathscr{A}_{1}=\{[\mathbf{A},\mathbf{X}]:~(\forall n \in [N])(A_{n\cdot}=a_{n\cdot})\}`. Equivalently, :math:`\mathscr{A}_{1}` denotes the set of admixed arrays for which :math:`\boldsymbol{A}=\boldsymbol{a}`, for some fixed non-negative integer :math:`N`-vector :math:`\boldsymbol{a}=(a_{1\cdot},\ldots,a_{N\cdot})`. - *Ancestry-specific allele frequency constraint*: :math:`\mathscr{A}_{2}=\{[\mathbf{A},\mathbf{X}]:~(\forall p\in[P])(\Phi_{p,0}=\phi_{p,0} \wedge \Phi_{p,1}=\phi_{p,1})\}`. Equivalently, :math:`\mathscr{A}_{2}` denotes the set of admixed arrays for which :math:`\boldsymbol{\Phi}_0=\boldsymbol{\phi}_0` and :math:`\boldsymbol{\Phi}_1=\boldsymbol{\phi}_1`, for some fixed non-negative integer :math:`P`-vectors :math:`\boldsymbol{\phi}_0=(\phi_{1,0},\ldots,\phi_{P,0})` and :math:`\boldsymbol{\phi}_1=(\phi_{1,1},\ldots,\phi_{P,1})`. - *Both constraints*: :math:`\mathscr{A}_{12}` denotes the set of admixed arrays for which :math:`\boldsymbol{A}=\boldsymbol{a}`, :math:`\boldsymbol{\Phi}_0=\boldsymbol{\phi}_0` and :math:`\boldsymbol{\Phi}_1=\boldsymbol{\phi}_1`. With **zagar**, we can compute the sizes of these constrained sets of arrays using ``getA1``, ``getA2`` and ``getA12``. If you have OpenMP installed, you can speed up the runtime of ``getA12`` by using ``getA12_parallel`` instead. .. code-block:: sage # Size of A1 sage: zagar.getA1(N=N, P=P, a_vec=row_sum_A) 30512038196863407653105172480000 # Size of A2 sage: zagar.getA2(N=N, P=P, phi0_vec=phi0, phi1_vec=phi1) 7845452210162977021624320000 # Size of A12 sage: zagar.getA12(N=N, a_vec=row_sum_A, phi0_vec=phi0, phi1_vec=phi1) # sage: zagar.getA12_parallel(N=N, a_vec=row_sum_A, phi0_vec=phi0, phi1_vec=phi1) # if user has OpenMP 574155698908329217448832 Saddle-Point Approximation ^^^^^^^^^^^^^^^^^^^^^^^^^^ Let us consider a regular admixed array, defined by :math:`\boldsymbol{a}=(P,...,P)` and :math:`\boldsymbol{\phi}_0=\boldsymbol{\phi}_1=(N,...,N)`. In this constrained case, a `saddle-point approximation `__ to a `Cauchy integral `__ can be used to estimate the size of :math:`\mathscr{A}_{12}`: :math:`\log_2(|\mathscr{A}_{12}|)\approx 2NP - \frac{1}{2}[N\log_2(\pi P) + P\log_2(\pi N)-\log_2(\pi NP)] - \log_2(e)\cdot(N+P-1)^2/(8NP)` We demonstrate this functionality in **zagar**. .. code-block:: sage # Array parameters sage: N = 9 sage: P = 3 # Define regular constraints sage: a_reg = vector(ZZ, [P]*N) # length N, all entries = P sage: phi0_reg = vector(ZZ, [N]*P) # length P, all entries = N sage: phi1_reg = vector(ZZ, [N]*P) # length P, all entries = N # Compute size of A12 and report the approximation and upper bound sage: res = zagar.getA12(N=N, a_vec=a_reg, phi0_vec=phi0_reg, phi1_vec=phi1_reg, return_approx=True) Computing estimated log-counts based on saddle-point approximation. Note: log2_approx is good only when P and N are large with bounded ratio sage: print(res) {'count': 25697841152, 'is_regular': True, 'log2_upper_bound': 35.40699628357531, 'log2_approx': 34.59881989492918} # Saddle-point approximation sage: print(res['log2_approx']) 34.59881989492918 # Upper bound sage: print(res['log2_upper_bound']) 35.40699628357531 # Compute log2(A12) sage: log(25697841152, 2).n() 34.5809281141245 # well-approximated by saddle-point approximation For :math:`(N,P)=(9,3)`, there are :math:`|\mathscr{A}_{12}|=25,697,841,152` (more than 25 billion!) regular admixed arrays. This is equivalent to about :math:`\log_2(|\mathscr{A}_{12}|)=34.6` bits, which is well-approximated by the saddle-point formula.