Which concentration bound to use
2022 Feb 18-19
I want to discuss typical use cases for Markov, Chebyshev, Chernoff, Hoeffding inequalities.
An inequality cheat sheet: http://www.lkozma.net/inequalities_cheat_sheet/
2. Markov inequality (MI): convenient, simple, works for non-negative random variable. Roughly it says "the probability of the random variable being this away (e.g., 1*average) from the average is less than this (e.g., 1/2)". Intuitively, if the distribution has mass concentrated around the mean, then MI gives a loose bound. In other words, for distributions with think tails, MI does not give a meaningfully tight bound. But otherwise, MI would work fine.
3. Chebyshev inequality (CI): simple, follows from MI, uses variance of the original distribution. Roughly it says "the probability of the random variable being this away (e.g., 2*standard deviation) from the average is less than this (e.g., 1/2^2)". I did some calculations similar to this post (https://math.stackexchange.com/a/1999727) and the conclusion is that, Chebyshev to some extent captures the tail and is therefore tighter than MI at the tail.
4. Chernoff inequality (CI2): captures the tail (gives exponentially decaying bound), works for the sum of independent random variables of general distributions over [0,1] (can be generalized to correlated distributions), has several forms (e.g., additive, multiplicative). It roughly says "the sample average is not far from the actual average if one chooses sample size properly". CI2 is useful in estimation, e.g., carrying out experiments to estimate the average with high confidence.
One of the form of CI2 is agnostic to the hidden variable itself. It only eats the error and gives a suggested sample size. This form basically says "the mass is concentrated around mean with radius sqrt(n) where n is the sample size".
Additive form is suitable for small error. This form basically says "the mass is concentrated around mean with radius sqrt(average)".
Linearity of expectation, AM-GM inequality, Jensen's inequality and Taylor theorem are useful in proofs. An important quantity measuring similarity of two distributions called Kullback-Liebler divergence is useful here to simplify the expressions.
Related: Stirling's approximation also gives very tight bounds for summation of Bernoulli random variables. But Chernoff can accommodate random variables over [0,1] as long as the average is known. The proof uses Jensen's inequality. The intuition is roughly that when one tries to allocate mass over [0,1] such that a given mean is achieved, the sparsest / most unconcentrated case is when you put all mass on two end points, 0 and 1, which forms the Bernoulli distribution.
5. Hoeffding inequality (HI): generalizes CI2 to random variables over [ai,bi], also oblivious or agnostic to the average. Hoeffding lemma is useful in proofs.