# Categorical distribution, structure of the second kind and Gumbel-max trick

Hi all,

This post is about a way of sampling from a Categorical distribution, which appears in Arthur Dempter‘s approach to inference as a generalization of Bayesian inference (see Figure 1 in “A Generalization of Bayesian Inference”, 1968), under the name “structure of the second kind”. It’s the starting point of my on-going work with Ruobin Gong and Paul Edlefsen, which I’ll write about on another day. This sampling mechanism turns out to be strictly equivalent to the “Gumbel-max” trick that got some attention in machine learning see e.g. this blog post by Francis Bach.

Let’s look at the figure above: the encompassing triangle is equivalent to the “simplex” with 3 vertices (K vertices more generally). Any point within the triangle is a convex combination of the vertices, $\sum_{k=1}^K w_k V_k$ where $(w_k)$ are non-negative “weights” summing to one, and where $(V_k)$ are the vertices. The weights are the “barycentric coordinates” of the point. Any point $\theta = (\theta_1,\ldots,\theta_K)$ in the triangle induces a partition into K sets $(\Delta_k(\theta))$. Each “sub-simplex” $\Delta_k(\theta)$ can be obtained by considering the entire simplex and replacing vertex $V_k$ by $\theta$. It has a volume equal to $\theta_k$ relative to the volume of the entire simplex. Can you see why? If not, it’s OK, great scientific endeavors require a certain degree of trust and optimism.

Since the volume of each $\Delta_k(\theta)$ is $\theta_k$, if we sample a point uniformly within the encompassing simplex, it will land within $\Delta_k(\theta)$ with probability $\theta_k$. In other words we can sample from a Categorical distribution with probabilities $(\theta_1,..., \theta_K)$ by sampling uniformly within the simplex, and by identifying which index k is such that the point lands in $\Delta_k(\theta)$. This appears in various places in Arthur Dempster’s articles (see references below), because Categorical distributions provide a pedagogical setting for new methods of statistical inference, and because this sampling mechanism does not rely on any arbitrary ordering of the categories (contrarily to “inverse transform sampling”).

How does this relate to the Gumbel-max trick? One way of sampling uniformly within the simplex is to sample Exponentials(1) $(E_k)$ and to define weights $w_k = E_k / \sum_{k'=1}^K E_{k'}$. Furthermore, a point is within $\Delta_k(\theta)$ for a given $k$, if and only if $w_k/w_j \leq \theta_k / \theta_j$ for all $j \neq k$. The next figure illustrates such inequalities: the points with coordinates $(w_k)$ satisfying $w_k/w_j \leq \theta_k / \theta_j$ are under/above some line that originates from the vertex opposite the segment $(k,j)$ and goes through $\theta$.

An Exponential(1) is also minus the logarithm of a Uniform(0,1). Putting all these pieces together, a Uniform point $\sum_{k=1}^K w_k V_k$ in the simplex is within $\Delta_k(\theta)$ if and only if, for all $j\neq k$,

$(-\log (U_k)) / (-\log (U_j)) \leq \theta_k / \theta_j$

$\Leftrightarrow \log (-\log (U_k)) - \log (-\log(U_j)) \leq \log \theta_k - \log \theta_j$

$\Leftrightarrow \log \theta_j - \log (-\log(U_j)) \leq \log \theta_k - \log (-\log (U_k))$.

Since $-\log(-\log(U))$ is a Gumbel variable, the above mechanism is equivalent to $\text{argmax}_k \log \theta_k + G_k$ where $(G_k)$ are independent Gumbel variables. It’s the Gumbel-max trick!

• It’s hard to trace back the first instance of this sampling mechanism, but it appears in various of Arthur Dempster’s articles, e.g. “New methods for reasoning towards posterior distributions based on sample data”, 1966, and it is discussed at length in “A class of random convex polytopes”, 1972.
• The connection occurred to me while reading Xi’an’s blog post, which points to this interesting article on Emil Gumbel, academic in Heidelberg up to his exile in 1932, “pioneer of modern data journalism” and active opponent to the nazis. Quoting from the article, “His fate was sealed when, at a speech in memory of the 700,000 who had perished of hunger in the winter of 1916/17, he remarked that a rutabaga would certainly be a better memorial than a scantily clad virgin with a palm frond”.
• The Gumbel-max trick is interesting for many reasons, it amounts to viewing sampling as an optimization program, it can be “relaxed” in various useful ways, etc. In Art Dempster’s work that sampling mechanism is appealing because of its invariance by relabeling of the categories (“category 2” is not between “category 1” and “category 3”). This matters when performing inference with Categorical distributions (i.e. with count data) using Art Dempster’s approach, because the estimation depends on the choice of sampling mechanism and not simply on the likelihood function.

Associate professor, Harvard University

## 3 thoughts on “Categorical distribution, structure of the second kind and Gumbel-max trick”

1. Thank you for putting this idea in a deeper perspective! In terms of efficiency, is it possible to rank the cases when one approach does significantly better than the other?

1. Pierre Jacob says:

Ah… a clean self-contained “geometric” proof that the volume of the subsimplex $\Delta_k(\theta)$ is $\theta_k$ times the volume of the encompassing simplex, I’m not sure… but the validity of the Gumbel-Max trick (as shown in https://francisbach.com/the-gumbel-trick/) itself provides a proof of that result.

By Exponential(1) I was referring to the Exponential distribution with parameter = 1, i.e. with density $x \mapsto \exp(-x)$.