Statistics Theory
See recent articles
Showing new listings for Friday, 12 June 2026
- [1] arXiv:2606.12943 [pdf, html, other]
-
Title: Phase transition of Schott's statistic for high-dimensional heavy-tailed dataComments: 42 pagesSubjects: Statistics Theory (math.ST)
Consider Schott's statistic (Schott, 2005) defined as the squared Frobenius norm of the sample correlation matrix for data from $\alpha$-regularly varying populations. We investigate its asymptotic distribution in a general framework characterized by data dimension p, sample size n, and regularly varying coefficients $\alpha$. In particular, we identify a phase transition phenomenon in the asymptotic behavior. For light-tailed populations ($\alpha > 3$), we revisit the $\alpha$-free asymptotic distribution but relax the constraint on the ratio of $p/n$. For heavy-tailed populations ($\alpha < 3$), we derive a new asymptotic normal distribution whose variance explicitly depends on $\alpha$. We also propose a consistent estimator for the asymptotic variance such that the standardized Schott's test statistic remains applicable for unknown location parameters and all $\alpha > 0$.
- [2] arXiv:2606.13084 [pdf, html, other]
-
Title: Characterizing metric-space-valued processes: separating classes and weak invariance principles for measure-theoretic inferenceSubjects: Statistics Theory (math.ST); Probability (math.PR)
This article investigates stochastic processes taking values in metric spaces that lack a topological vector space structure, a regime characterized by intricate interplay between topological, geometric, and temporal dependence structures. It is formally established that spaces admitting an isometric Hilbertian embedding constitute a strict subclass within the much broader class of metric spaces possessing the ball property. While traditional kernel methods are susceptible to geometric distortion when the underlying space cannot be isometrically embedded into a Hilbert space, we bypass such limitations by exploiting a fundamental structural property inherent to this broader class; namely, that Borel probability measures are uniquely determined by their values on balls. These separating classes provide the foundation for the subsequently introduced measure-theoretic inference methodology. We derive uniform convergence of a family of time-dependent random measures, alongside weak invariance principles for the corresponding nonstationary random fields. This framework explicitly exposes how dependence and geometric complexity influence sample path regularity. Furthermore, because the rapid decay of small-ball probabilities can prohibit the existence of limiting distributions for supremum-based discrepancy measures, we develop $L^p$-based alternatives. By directly leveraging the introduced convergence results, this approach circumvents the need for higher-order $U$-process formulations. Finally, for spaces that do admit an isometric Hilbertian embedding, and where $U$-processes naturally arise, we establish limit theory for both degenerate and nondegenerate multi-parameter $U$-processes, and demonstrate that local discrepancy tests maintain asymptotic stability under dynamic parameter regimes.
- [3] arXiv:2606.13230 [pdf, html, other]
-
Title: Consistency of variational approximations under bounded Kullback--Leibler divergenceSubjects: Statistics Theory (math.ST)
Variational methods are widely used to approximate posterior distributions in Bayesian inference when exact computation is infeasible. We study when such approximations inherit posterior consistency. Our first result shows that, on a general metric space, a uniform bound on the Kullback--Leibler divergence from the approximating measures to a tight sequence of target measures forces the approximating sequence to be tight. It follows that if the target posteriors converge weakly to a Dirac mass at the true parameter, then any variational sequence with bounded Kullback--Leibler divergence to the targets is also consistent. We also give simple logarithmic-moment conditions that verify this boundedness condition, and illustrate them for smooth generalised posterior distributions.
- [4] arXiv:2606.13280 [pdf, other]
-
Title: Generalization Bounds for Transformer-Based Next-Token Prediction in a Language ModelSubjects: Statistics Theory (math.ST)
A refined statistical understanding of LLM pre-training requires the analysis of the transformer architecture for data distributions that encapsulate key characteristics of text data. To address this, we propose a text data distribution based on an extension of the log-bilinear language model from the natural language processing literature. For this data generating process, we derive generalization bounds for deep transformer architectures, highlighting the dependence on the network architecture, the vocabulary size, the number of documents and the document length.
- [5] arXiv:2606.13554 [pdf, html, other]
-
Title: Asymptotic regimes for maximum likelihood estimation in the Ewens--Pitman model: When the strength parameter mattersSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
We study the large sample asymptotic behaviour of the Maximum Likelihood Estimator of the discount and strength parameters $(\alpha,\theta)$ in the Ewens--Pitman model for random partitions, under mild assumptions on the data-generating mechanism. We show that four distinct regimes arise, depending on the limiting behaviour of the frequency spectrum. In particular, in contrast with previous work, we find that $\theta$ may play a crucial role asymptotically. We further show that the existing literature implicitly focuses on only two of these regimes, and we relate this restriction to the constraints imposed by infinite exchangeability. Under the latter, indeed, the number of distinct blocks and the frequency spectrum are necessarily tied by a rigid structural relation. We prove that this lack of flexibility can be overcome through what we call the scaled Ewens--Pitman model, in which $\theta$ is allowed to grow with the sample size $n$. Finally, we provide empirical evidence from real-world data showing that such extensions are needed to capture frequency spectra that fall outside the classical Ewens--Pitman framework.
New submissions (showing 5 of 5 entries)
- [6] arXiv:2606.12720 (cross-list from math.PR) [pdf, html, other]
-
Title: On McDiarmid's Inequality under Dependence via Approximate Tensorization of EntropyComments: 27 pagesSubjects: Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
We argue that dependent versions of McDiarmid's inequality are a useful but underutilized tool in mathematical statistics, learning theory and theoretical computer science. To make this point, we first highlight that approximate tensorization of entropy (ATE) implies McDiarmid's via the Entropy Method. Second, we derive McDiarmid's inequality for non-isotropic Gaussian random vectors $X \sim \mathcal N(\mu, \Sigma)$ through ATE with a constant of the order of the condition number of $\Sigma$. We both independently obtain this ATE through a simple application of stochastic localization and also discuss how a more general ATE for the Gibbs sampler due to Ascolani et al., 2026 generalizes McDiarmid's-like concentration to strongly log-concave and log-smooth probability measures. We then apply the resulting concentration inequalities to resolve a question on the concentration of $\operatorname{sign}(X)$ posed by Simone Bombari, investigate Erdős-Rényi graphs under dependence and prove a Dvoretzky-Kiefer-Wolfowitz-type inequality for observations from a joint measure fulfilling ATE and continuous marginal CDFs. For the class of strongly log-concave and log-smooth measures, this result improves upon a prior Dvoretzky-Kiefer-Wolfowitz-type inequality for non-i.i.d. observations due to Bobkov and Götze, 2010, by establishing the expected $1/\sqrt{n}$-rate of convergence under weak dependence instead of $n^{-1/3}$.
- [7] arXiv:2606.12879 (cross-list from cs.DS) [pdf, html, other]
-
Title: Diffusion-Network Alignment: An Efficient Algorithm and Explicit Probability BoundsSubjects: Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
This paper studies a variation of the classic network alignment problem, named diffusion-network alignment. The goal is to align the vertices of a rooted diffusion tree to the vertices of a network, where the diffusion tree could be from a communication trace or contact tracing, and the network could be an online or offline social network. Different from the classic network alignment where both networks are fully observed, this model captures the information asymmetry of two networks. To solve this problem, this paper presents an efficient algorithm based on tree correlation tests to extract alignment information from local neighborhoods. We analyze the performance of the algorithm in the sparse graph regime and show that with high probability, all matched pairs are correct. Furthermore, for each vertex on the diffusion tree, this paper establishes an explicit lower bound on the probability that the vertex is correctly matched. These lower bounds are depth-dependent and increase as vertices get closer to the root.
- [8] arXiv:2606.12892 (cross-list from stat.ML) [pdf, html, other]
-
Title: Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz RegressionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
This study investigates semiparametric efficient estimation of causal and structural parameters in a semi-supervised setting. In our setting, unlabeled auxiliary regressors are available in addition to labeled observations consisting of outcomes and regressors. Our goal is to construct estimators of causal and structural parameters whose asymptotic variances are smaller than those of estimators constructed using only labeled data. We refer to this framework as prediction-powered causal inference (PPCI). We first derive the efficient influence function and the efficiency bound, which imply that the use of auxiliary regressors can attain a smaller asymptotic variance than the efficiency bound attainable from labeled observations alone. Then, by combining the efficient influence function with the debiased machine learning (DML) framework, we propose methods that we call DML-PPCI. If we construct an estimating-equation estimator, we refer to the method as EE-DML-PPCI; if we construct a targeted-learning estimator, we refer to the method as TMLE-DML-PPCI. The asymptotic variances of both estimators match our derived efficiency bound. In the construction of the estimators, estimation of the efficient influence function plays an important role. In our study, the efficient influence function is also a Neyman orthogonal score, which depends on the Riesz representer and the regression function. For Riesz representer estimation, we develop semi-supervised generalized Riesz regression with convergence rate guarantees.
- [9] arXiv:2606.13234 (cross-list from stat.CO) [pdf, html, other]
-
Title: Switching Hamiltonian Monte Carlo for sampling from mixture distributionsSubjects: Computation (stat.CO); Numerical Analysis (math.NA); Statistics Theory (math.ST)
We introduce a switching Hamiltonian Monte Carlo method for sampling from finite mixture Boltzmann-Gibbs distributions. We propose symmetric numerical integrators to approximate switching Hamiltonian dynamics interlaced with Poisson jumps, where the regime-switching chain is simulated using the uniformization technique or the stochastic simulation algorithm. We prove geometric ergodicity of the resulting Markov chain. We develop an approach based on the discrete Poisson equation associated with numerical schemes to estimate the error in computing ergodic averages. Using this approach we prove that the proposed numerical integrators have second-order bias. This approach is simple and can be generalized to other settings, for example, kinetic Langevin equations. Finally, we verify the convergence result via numerical experiment.
- [10] arXiv:2606.13614 (cross-list from stat.ML) [pdf, html, other]
-
Title: Majority-of-Three is OptimalComments: 9 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We give a short proof that the majority vote of three independent consistent classifiers is an optimal learner in the realizable PAC setting. This proves optimality for the simplest voting scheme, while simplifying both the algorithmic structure and the probabilistic analysis of previous voting learners, including the algorithm of S. Hanneke and the analysis of bagging by K. Green Larsen.
Cross submissions (showing 5 of 5 entries)
- [11] arXiv:2501.19126 (replaced) [pdf, html, other]
-
Title: Asymptotic optimality theory of confidence intervals of the meanSubjects: Statistics Theory (math.ST)
We address the classical problem of constructing confidence intervals (CIs) for the mean of a distribution, given \(N\) i.i.d. samples, such that the CI contains the true mean with probability at least \(1 - \delta\), where \(\delta \in (0,1)\). We characterize three distinct learning regimes based on the minimum achievable limiting width of any CI as the sample size \(N_{\delta} \to \infty\) and \(\delta \to 0\). In the first regime, where \(N_{\delta}\) grows slower than \(\log(1/\delta)\), the limiting width of any CI equals the width of the distribution's support, precluding meaningful inference. In the second regime, where \(N_{\delta}\) scales as \(\log(1/\delta)\), we precisely characterize the minimum limiting width, which depends on the scaling constant. In the third regime, where \(N_{\delta}\) grows faster than \(\log(1/\delta)\), complete learning is achievable, and the limiting width of the CI collapses to zero, converging to the true mean. We demonstrate that CIs derived from concentration inequalities based on Kullback--Leibler (KL) divergences achieve asymptotically optimal performance, attaining the minimum limiting width in both sufficient and complete learning regimes for distributions in two families: single-parameter exponential and bounded support. Additionally, these results extend to one-sided CIs, with the width notion adjusted appropriately. Finally, we generalize our findings to settings with random per-sample costs, motivated by practical applications such as stochastic simulators and cloud service selection. Instead of a fixed sample size, we consider a cost budget \(C_{\delta}\), identifying analogous learning regimes and characterizing the optimal CI construction policy.
- [12] arXiv:2504.16279 (replaced) [pdf, html, other]
-
Title: Sharp Detection Threshold for Correlation among Multiple Unlabeled Gaussian NetworksSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Applications (stat.AP)
This paper studies the hypothesis testing problem of deciding whether $m \geq 2$ complete weighted graphs with Gaussian edge weights are mutually correlated after unknown relabelings of their vertices. Under the null model all edge weights are independent standard Gaussians, whereas under the planted model the graphs share a latent vertex alignment and each pair of corresponding edge weights has correlation $\rho$. For fixed $m$, we identify the sharp information-theoretic threshold for detection. Above the threshold, a generalized likelihood-ratio test achieves strong detection, whereas even weak detection is impossible below the threshold. The result extends the two-graph detection threshold of Wu, Xu, and Yu to any fixed number of graphs, exhibits a side-information regime in which two graphs alone are insufficient but multiple graphs enable detection, and, together with the recovery threshold of Vassaux and Massoulié, shows that this Gaussian multi-graph model has no detection--recovery gap.
- [13] arXiv:2511.21441 (replaced) [pdf, other]
-
Title: Hierarchical Besov-Laplace priors for spatially inhomogeneous binary classificationComments: 28 pages, supplement included, 4 figures, 4 tables. To Appear in Advances in Data Analysis and ClassificationSubjects: Statistics Theory (math.ST)
We study nonparametric Bayesian binary classification, in the case where the unknown probability response function is possibly spatially inhomogeneous, for example, being generally flat across the domain but presenting localized sharp variations. We consider a hierarchical procedure based on the Besov-Laplace priors from the inverse problems and imaging literature, with a carefully tuned hyper-prior on the regularity parameter. We show that the resulting posterior distribution concentrates towards the ground truth at optimal rate, automatically adapting to the unknown regularity. To implement posterior inference in practice, we devise an efficient Markov chain Monte Carlo (MCMC) algorithm based on recent ad-hoc dimension-robust methods for Besov-Laplace priors. We then test the considered approach in extensive numerical simulations, where we obtain a solid corroboration of the theoretical results.
- [14] arXiv:2512.24701 (replaced) [pdf, html, other]
-
Title: Epistemic Confidence Statement via Extended LikelihoodSubjects: Statistics Theory (math.ST)
Fisher's fiducial probability has recently attracted renewed attention under the notion of epistemic confidence. Epistemic confidence statements can be formulated through extended likelihoods, thereby clarifying several long-standing controversies regarding its fiducial probability properties. It establishes a direct connection between Fisher's epistemic notion of confidence for observed data and Neyman's frequentist aleatory coverage probability for future data, thereby enabling extension of epistemic confidence statements for multidimensional parameters. We demonstrate how higher-order asymptotic theory can be applied to refine the first-order asymptotic epistemic confidence statements of the observed region, as a direct consequence of extended likelihood property.
- [15] arXiv:2606.11110 (replaced) [pdf, html, other]
-
Title: Fixed-Threshold One-Bit Toeplitz Covariance Estimation under Sparse-Ruler SamplingComments: v2: substantially revised; 21 pages main text + appendix, 59 pages totalSubjects: Statistics Theory (math.ST); Information Theory (cs.IT)
We study Toeplitz covariance estimation when fixed-threshold one-bit quantization is combined with deterministic sparse-ruler sampling, so that each observed bit is reused across many lag products. At a nonzero threshold the signs have nonzero mean, and this reuse gives raw sign products a coherent one-vertex variance component governed by weighted row sums; centering removes it and leaves a degenerate sparse-pair statistic. We prove a Gaussian variance contraction theorem for hollow quadratic forms of bounded coordinate transforms, including hard threshold signs: the variance is bounded by the squared correlation operator norm times the squared Frobenius norm of the edge weights, with constants independent of dimension, support size and maximum degree. For the oracle centered sparse-ruler estimator, the leading operator-norm term is \(\gamma_0L_1\kappa_{\rm obs}\sqrt{\varphi(\Omega)\log d/n}\), where \(\varphi(\Omega)=\sum_{s=1}^{d-1}q_s^{-1}\) is the coverage coefficient of the ruler; pooled marginal calibration from the \(n|\Omega|\) observed bits adds a plug-in term. A spectral-packing lower bound in a known-scale identity-neighborhood submodel shows that this dependence is intrinsic under balanced coverage geometry; in the non-saturated regime where the coverage term dominates, the oracle estimator is minimax rate optimal over this submodel.
- [16] arXiv:2111.08157 (replaced) [pdf, html, other]
-
Title: Fine Stratification of Survey ExperimentsSubjects: Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
This paper studies a two-stage model of experimentation, where the researcher first samples representative experimental participants from an eligible pool, then assigns each sampled unit to treatment or control, using matched $k$-tuples randomization at both stages. To implement such designs, we develop a fast new algorithm for matching units into $k$-tuples for any $k \ge 2$ and any dimension of covariates. By surveying 200 recent experimental working papers, we estimate that our algorithm newly enables multivariate fine stratification with provable match quality guarantees for about 44\% of experiments in economics. We show that finely stratified sampling and assignment both nonparametrically reduce the variance of treatment effect estimation, with the gains from stratified sampling increasing in the size of the eligible pool and how well covariates predict treatment effect heterogeneity. We develop new inference methods that fully exploit the efficiency gains from both design stages, allowing researchers to report smaller standard errors if they designed a representative experiment. An application to nine published experiments quantifies the efficiency gains.
- [17] arXiv:2407.18572 (replaced) [pdf, other]
-
Title: Bernoulli amputationSubjects: Applications (stat.AP); Statistics Theory (math.ST); Other Statistics (stat.OT)
A novel, stochastic approach to amputation, the process of introducing missing values to a complete dataset, is presented. It allows one to construct a wide variety of missingness patterns by only having to specify distributions of missingness indicators as opposed to specifying each missingness pattern manually. Missingness indicators are modeled in a principled way via copulas and Bernoulli margins, thus allowing one to incorporate dependence in missingness patterns. Besides more classical missingness mechanisms such as missing completely at random, missing at random, and missing not at random, the approach is able to model structured missingness such as block missingness and, via mixtures, monotone missingness, which are patterns of missing data frequently found in real-life datasets. Properties such as joint missingness probabilities or missingness correlation are derived mathematically. The flexibility of the approach in capturing different missingness patterns while only requiring to specify distributional assumptions on missingness indicators is demonstrated with mathematical examples and empirical illustrations in terms of a well-known example dataset of sufficiently small sample size that allows to identify each missing data point visually. Finally, an example application to multivariate financial time series is provided.