Notion of convergence of random variables
Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.
The law of large numbers says that, for each single event
, its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class
from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events
The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.
Definitions
For a class of predicates
defined on a set
and a set of samples
, where
, the empirical frequency of
on
is

The theoretical probability of
is defined as 
The Uniform Convergence Theorem states, roughly, that if
is "simple" and we draw samples independently (with replacement) from
according to any distribution
, then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability.
Here "simple" means that the Vapnik–Chervonenkis dimension of the class
is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.
The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis using the concept of growth function.
The statement of the uniform convergence theorem is as follows:
If
is a set of
-valued functions defined on a set
and
is a probability distribution on
then for
and
a positive integer, we have:

- where, for any
, 

- and
.
indicates that the probability is taken over
consisting of
i.i.d. draws from the distribution
.
is defined as: For any
-valued functions
over
and
,
And for any natural number
, the shattering number
is defined as:

From the point of Learning Theory one can consider
to be the Concept/Hypothesis class defined over the instance set
. Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.
Sauer–Shelah lemma
The Sauer–Shelah lemma relates the shattering number
to the VC Dimension.
Lemma:
, where
is the VC Dimension of the concept class
.
Corollary:
.
and are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.
- Symmetrization: We transform the problem of analyzing
into the problem of analyzing
, where
and
are i.i.d samples of size
drawn according to the distribution
. One can view
as the original randomly drawn sample of length
, while
may be thought as the testing sample which is used to estimate
. - Permutation: Since
and
are picked identically and independently, so swapping elements between them will not change the probability distribution on
and
. So, we will try to bound the probability of
for some
by considering the effect of a specific collection of permutations of the joint sample
. Specifically, we consider permutations
which swap
and
in some subset of
. The symbol
means the concatenation of
and
. - Reduction to a finite class: We can now restrict the function class
to a fixed joint sample and hence, if
has finite VC Dimension, it reduces to the problem to one involving a finite function class.
We present the technical details of the proof.
Symmetrization
Lemma: Let
and

Then for
,
.
Proof: By the triangle inequality,
if
and
then
.
Therefore,
![{\displaystyle {\begin{aligned}&P^{2m}(R)\\[5pt]\geq {}&P^{2m}\{\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\\[5pt]={}&\int _{V}P^{m}\{s:\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\,dP^{m}(r)\\[5pt]={}&A\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f8461f48d2fd7f0c2fe19203c871b9e67c8faf25)
since
and
are independent.
Now for
fix an
such that
. For this
, we shall show that

Thus for any
,
and hence
. And hence we perform the first step of our high level idea.
Notice,
is a binomial random variable with expectation
and variance
. By Chebyshev's inequality we get

for the mentioned bound on
. Here we use the fact that
for
.
Permutations
Let
be the set of all permutations of
that swaps
and 
in some subset of
.
Lemma: Let
be any subset of
and
any probability distribution on
. Then,
![{\displaystyle P^{2m}(R)=E[\Pr[\sigma (x)\in R]]\leq \max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/9c1d969a0460869584bfda8851b74b1346b7bf94)
where the expectation is over
chosen according to
, and the probability is over
chosen uniformly from
.
Proof: For any 

(since coordinate permutations preserve the product distribution
.)
![{\displaystyle {\begin{aligned}\therefore P^{2m}(R)={}&\int _{X^{2m}}1_{R}(x)\,dP^{2m}(x)\\[5pt]={}&{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}\int _{X^{2m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]={}&\int _{X^{2m}}{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]&{\text{(because }}|\Gamma _{m}|{\text{ is finite)}}\\[5pt]={}&\int _{X^{2m}}\Pr[\sigma (x)\in R]\,dP^{2m}(x)\quad {\text{(the expectation)}}\\[5pt]\leq {}&\max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]).\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f46a36c4d92d17e15ec97d1e17ba97a92ab9600)
The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.
Reduction to a finite class
Lemma: Basing on the previous lemma,
.
Proof: Let us define
and
which is at most
. This means there are functions
such that for any
between
and
with
for 
We see that
iff for some
in
satisfies,
. Hence if we define
if
and
otherwise.
For
and
, we have that
iff for some
in
satisfies
. By union bound we get
![{\displaystyle \Pr[\sigma (x)\in R]\leq t\cdot \max \left(\Pr[|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)|\geq {\frac {\varepsilon }{2}}]\right)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cc40ac1815ef47101a8440815d7aeccd74307766)
![{\displaystyle \leq \Pi _{H}(2m)\cdot \max \left(\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)\right|\geq {\frac {\varepsilon }{2}}\right]\right).}](https://wikimedia.org/api/rest_v1/media/math/render/svg/076d1837b87ec317dca829e524a12af39a7ca5c4)
Since, the distribution over the permutations
is uniform for each
, so
equals
, with equal probability.
Thus,
![{\displaystyle \Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}\left(w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}\right)\right)\right|\geq {\frac {\varepsilon }{2}}\right]=\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}|w_{i}^{j}-w_{m+i}^{j}|\beta _{i}\right)\right|\geq {\frac {\varepsilon }{2}}\right],}](https://wikimedia.org/api/rest_v1/media/math/render/svg/48244ac88425a73a9af7fe7ed2924fdc9a3600a9)
where the probability on the right is over
and both the possibilities are equally likely. By Hoeffding's inequality, this is at most
.
Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.
References