Bayes rule is a familiar or natural outcome for most familiar with probability
theory. In words, it tells us how to update the probability of a random
variable(s) given some event(s) has occurred and that we have some prior
knowledge or belief about the probability of the random variable(s) from
earlier events. The algebra to get to Bayes rule is simple but I found it
always best to have a more spatial perspective on what Bayes rule is really
stating.
I'll first begin with ta 2D square sample space, $\it{S}$. This space is
discrete and we can represent each outcome as a tiny square, $\it{s}$. In this
case, we will have a total of 12 tiny squares in $\it{S}$. This means there is
a 1/16 chance that any square is randomly selected, hence, $p(s)$.
$$\begin{array}{|c|c|c|c|}\hline \it{s_1} & \it{s_2} &
\it{s_3} & \it{s_4} \\ \hline \it{s_5} &
\it{s_6} & \it{s_7} & \it{s_8} \\ \hline \it{s_9}
& \it{s_{10}} & \it{s_{11}} & \it{s_{11}} \\
\hline \it{s_{13}} & \it{s_{14}} & \it{s_{15}} &
\it{s_{16}} \\ \hline \end{array}$$
$$\mathrm{P}(\it{s}_i)_{\it{S}} = \mathrm{1/16}$$
Now we say have the scenario where we are only interested in two subspaces of
$\it{S}$, $\it{S}_A$ and $\it{S}_B$. More specifically we want to know the
probabilities of a square randomly occurring in each of these subspaces given
they occur in $\it{S}$ and what the probability is of a square occurring in
the intersection, or state differently, the probability of a square occurring
in both $\it{S}_A$ and $\it{S}_B$.
With this we have the following: $\mathrm{P}(\it{s}_A)$,
$\mathrm{P}(\it{s}_B)$, and $\mathrm{P}(\it{s}_A \cap \it{s}_B) =
\mathrm{P}(\it{s}_A)$. The updated image of this would look like:
The probability $\mathrm{P}(\it{s}_A)$ in red, $\mathrm{P}(\it{s}_B)$ in blue,
and the overlap $P(\it{s}_A \cap \it{s}_B)$. Keep in mind that
$\mathrm{P}(\it{s}_A \cap \it{s}_B) = \mathrm{P}(\it{s}_A \cap \it{s}_B) =
\mathrm{1/8}$.
The question we usually want to ask is not what the joint probability, i.e.,
what's the probability of both $\it{S}_A$ and $\it{S}_B$ squares, but instead
is what is the probability of a square in $\it{S}_A$ given that a square
in $\it{S}_B$ has been picked/occurred or vice versa. So what does this
mean? We want to compare the relative probabilities of the joint space to that
of the given space where the event has occurred:
\begin{equation} \mathrm{P}(\it{s}_A | \it{s}_B) =
\frac{\mathrm{P}(\it{s}_A \cap
\it{s}_B)}{\mathrm{P}(\it{s}_B)}\label{eq:bayes1}
\end{equation}
and
\begin{equation} \mathrm{P}(\it{s}_B| \it{s}_A) =
\frac{\mathrm{P}(\it{s}_A \cap \it{s}_B)}{\mathrm{P}(\it{s}_A)}
\label{eq:bayes2}\end{equation}
Notice how these two equations are not the same but we the probability in the
joint space, $\mathrm{P}(\it{S}_A \cap \it{S}_B) = \mathrm{P}(\it{S}_B \cap
\it{S}_A)$. This had to be the case just by looking at the illustration with
the colored cells above.
The key is that we can now determine the conditional probabilities, that is
the probability of a cell in a subspace given a cell in the other subspace has
been picked or occurred, by rearrange eq. \ref{eq:bayes1} and eq.
\ref{eq:bayes2} for the joint probability and then substituting terms to get:
\begin{equation*} \mathrm{P}(\it{s}_A | \it{s}_B) \mathrm{P}(\it{s}_B) =
\mathrm{P}(\it{s}_B | \it{s}_A) \mathrm{P}(\it{s}_A)\end{equation*}
which is rearranged to get the typical Bayes formula:
\begin{equation}\mathrm{P}\left(\it{s}_A | \it{s}_B\right) =
\frac{\mathrm{P}\left(\it{s}_B | \it{s}_A\right)
\mathrm{P}\left(\it{s}_A\right)}{\mathrm{P}\left(\it{s}_B\right)}
\label{eq:bayesformula}\end{equation}.
At first eq. \ref{eq:bayesformula} might seem expected you could. I mean it is
just an outcome of analyzing probabilities of subspaces, but the impact is
really how one can this equation to update knowledge. Let us break down the
terms in eq. \ref{eq:bayesformula}.
The first term in the numerator is called the likelihood probability. It
indicates how probably an event in $\it{S}_B$ is given that an event in
$\it{S}_A$ occurs. It can also represent the probability of the observed data
given the model and its parameters (i.e. prior over parameters). The second
term in the numerator, the prior, informs about previous knowledge of the
observations or parameters. Finally, the denominator can be interpreted as the
probability of observing a cell in $\it{S}_B$ or you can think about it as the
data averaged over all possible values of the model parameters.
An important aspect of eq. $\ref{eq:bayesformula}$ is that in the case of
probability functions, the integration equals one. This just means that over
the whole space of probabilities, something must have happened.
In the example given, the probabilities are just uniform discrete values, so
we obtain a posterior probability that is just a number that represents our
updated knowledge about the probability of a cell in $\it{S}_{A}$ given the
cell is in $\it{S}_{B}$. This is a particularly simple and maybe intuitive
outcome. What is typically more useful is that we have a probability density
function that represents our prior knowledge about an event/outcome and we
want to determine the posterior distribution. We then choose a likelihood
probability that encodes information about what has been observed given the
prior probability and make inferences by sampling the constructed posterior
distribution.
Reuse and Attribution
No comments:
Post a Comment
Please refrain from using ad hominem attacks, profanity, slander, or any similar sentiment in your comments. Let's keep the discussion respectful and constructive.