DISCO¶
- 
class hyppo.ksample.DISCO(compute_distance='euclidean', bias=False, **kwargs)¶
- Distance Components (DISCO) test statistic and p-value. - DISCO is a powerful multivariate k-sample test. It leverages distance matrix capabilities (similar to tests like distance correlation or Dcorr). In fact, DISCO statistic is equivalent to our 2-sample formulation nonparametric MANOVA via independence testing, i.e. - hyppo.ksample.KSample, and to- hyppo.independence.Dcorr,- hyppo.ksample.Energy,- hyppo.independence.Hsic, and- hyppo.ksample.MMD1 2.- Parameters
- compute_distance ( - str,- callable, or- None, default:- "euclidean") -- A function that computes the distance among the samples within each data matrix. Valid strings for- compute_distanceare, as defined in- sklearn.metrics.pairwise_distances,- From scikit-learn: [ - "euclidean",- "cityblock",- "cosine",- "l1",- "l2",- "manhattan"] See the documentation for- scipy.spatial.distancefor details on these metrics.
- From scipy.spatial.distance: [ - "braycurtis",- "canberra",- "chebyshev",- "correlation",- "dice",- "hamming",- "jaccard",- "kulsinski",- "mahalanobis",- "minkowski",- "rogerstanimoto",- "russellrao",- "seuclidean",- "sokalmichener",- "sokalsneath",- "sqeuclidean",- "yule"] See the documentation for- scipy.spatial.distancefor details on these metrics.
 - Set to - Noneor- "precomputed"if- xand- yare already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form- metric(x, **kwargs)where- xis the data matrix for which pairwise distances are calculated and- **kwargsare extra arguements to send to your custom function.
- bias ( - bool, default:- False) -- Whether or not to use the biased or unbiased test statistics.
- **kwargs -- Arbitrary keyword arguments for - compute_distance.
 
 - Notes - Traditionally, the formulation for the DISCO statistic is as follows 3: - Define \(\{ u^i_1 \stackrel{iid}{\sim} F_{U_1},\ i = 1, ..., n_1 \}\) up to \(\{ u^j_k \stackrel{iid}{\sim} F_{V_1},\ j = 1, ..., n_k \}\) as k groups of samples deriving from different distributions with the same dimensionality. If \(d(\cdot, \cdot)\) is a distance metric (i.e. Euclidean), \(N = \sum_{i = 1}^k n_k\), and \(\mathrm{Energy}\) is the Energy test statistic from - hyppo.ksample.Energythen,\[\mathrm{DISCO}_N(\mathbf{u}_1, \ldots, \mathbf{u}_k) = \sum_{1 \leq k < l \leq K} \frac{n_k n_l}{2N} \mathrm{Energy}_{n_k + n_l} (\mathbf{u}_k, \mathbf{u}_l)\]- The implementation in the - hyppo.ksample.KSampleclass (using- hyppo.independence.Dcorr) is in fact equivalent to this implementation (for p-values) and statistics are equivalent up to a scaling factor 1.- The p-value returned is calculated using a permutation test uses - hyppo.tools.perm_test. The fast version of the test uses- hyppo.tools.chi2_approx.- References - 1(1,2)
- Sambit Panda, Cencheng Shen, Ronan Perry, Jelle Zorn, Antoine Lutz, Carey E. Priebe, and Joshua T. Vogelstein. Nonpar MANOVA via Independence Testing. arXiv:1910.08883 [cs, stat], April 2021. arXiv:1910.08883. 
- 2
- Cencheng Shen and Joshua T. Vogelstein. The exact equivalence of distance and kernel methods in hypothesis testing. AStA Advances in Statistical Analysis, September 2020. doi:10.1007/s10182-020-00378-1. 
- 3
- Gábor J. Székely and Maria L. Rizzo. Testing for equal distributions in high dimensions. InterStat, pages 2004. 
 
Methods Summary
| 
 | Calulates the DISCO test statistic. | 
| 
 | Calculates the DISCO test statistic and p-value. | 
- 
DISCO.statistic(*args)¶
- Calulates the DISCO test statistic. 
- 
DISCO.test(*args, reps=1000, workers=1, auto=True, random_state=None)¶
- Calculates the DISCO test statistic and p-value. - Parameters
- *args ( - ndarray) -- Variable length input data matrices. All inputs must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions.
- reps ( - int, default:- 1000) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.
- workers ( - int, default:- 1) -- The number of cores to parallelize the p-value computation over. Supply- -1to use all cores available to the Process.
- auto ( - bool, default:- True) -- Automatically uses fast approximation when n and size of array is greater than 20. If- True, and sample size is greater than 20, then- hyppo.tools.chi2_approxwill be run. Parameters- repsand- workersare irrelevant in this case. Otherwise,- hyppo.tools.perm_testwill be run.
 
- Returns
 - Examples - >>> import numpy as np >>> from hyppo.ksample import DISCO >>> x = np.arange(7) >>> y = x >>> stat, pvalue = DISCO().test(x, y) >>> '%.3f, %.1f' % (stat, pvalue) '-1.566, 1.0' 
