Pub Quiz: Summing Two Random Variables

probability
pub quiz
My company’s holiday pub quiz almost had a question about summing two random variables.
Published

January 14, 2024

# Pub Quiz

Last week, my company held a pub quiz. This was no ordinary pub quiz, it was designed in a way for teams to learn about other teams. One person from each team submitted four questions about their team. We ended up wtih about 60 questions from all across the company. And when I say “all” I mean all. We had questions about Data Science, Commercial, Operations, Engineering, Product, Marketing, Leadership, Human Resources, you name it. As someone who is fairly familiar with all aspects of our products and engineering teams (but by no means an expert in all these areas) by having been here over four years and written my share of documentation, I came into the pub quiz with confidence. Much of that confidence due to expecting to not need to know British pop culture back through the 70s as is standard in an actual pub quiz.

That confidence was quickly smashed with questions like:

  • What was the 2023 Coca Cola Christmas commercial slogan?
  • What is the collective age of the Customer Support team?
  • What month was repo XXX created?

The Data Science round was the trickiest of them all, in no small part because reading the question took up almost all of the response time! After the quiz, my teammate who wrote the Data Science questions shared one that didn’t quite make the cut.

I share the question with the reader here, with only some minor rewording. My answer follows.

# Question

## The preamble

You fit the following univariate regression model using Ordinary Least Squares (OLS)

In the original question, the OLS abbreviation was given without the unabbreviated form. Not sure it would have made the question any easier.

\[ y = \alpha + \beta x + \epsilon \]

Where the residual, \(\epsilon\), is Normally distributed with mean 0 and standard deviation 1, that is:

\[ \epsilon \sim {N}(0, 1) \]

From OLS, you determine:

  • \(\alpha=3\)
  • \(\beta=2\)

The next month, your colleague reruns the experiment and collects the same size dataset. They forgot to check the callibration of the machine used to collect the data and as a result the dataset now has a measurement error, \(u\). That is,

\[ \displaylines{ \begin{align} y^* &\equiv y + u \\ y^* &= \alpha^* + \beta^*x + \epsilon^* \end{align} } \]

The measurement error has mean 2 and standard deviation 2.
\[ u \sim N(2,2) \]

The residuals in the new sample dataset are also normally distributed.
\[ \epsilon^* \sim {N}(\mu^*, \sigma^*) \]

## And finally the question

What values does your colleague find when running a regression with OLS? (Try solving with a pen and paper):

  • \(\alpha^* = \text{ ??}\)
  • \(\beta^* = \text{ ??}\)
  • \(\mu^* = \text{ ??}\)
  • \(\sigma^* = \text{ ??}\)

# Analytical solution

There’s a few things to immediately note:

  1. The hell this is a pub quiz question!
  2. We assume \(y^*\), \(\epsilon\), \(u\), and \(\epsilon^*\) are independent random variables. If they weren’t, this would be a wee bit trickier (we would need to have information about the joint distributions, i.e. covariances).
  3. We recognize that the new residuals are going to include the previous residuals as well as the new measurement error, \(\epsilon^* = \epsilon + u\).
  4. Whether it was a typo or meant to be tricky, the question states the standard deviation rather than the variance for the Normal distribution. Typically, the notation is uses the variance, \(N(\mu, \sigma^2)\). Rewriting into the standard notation:

I’ve highly probably in all likelihood had the same question on a Probability exam during my Statistics postgrad.

\[ \displaylines{ \begin{alignat*}{2} \epsilon &\sim N(0, 1) && \rightarrow N(0, 1) \\ u &\sim N(2, 2) && \rightarrow N(2, \sqrt{2}) \\ \epsilon^* &\sim N(\mu^*, \sigma^{*}) && \rightarrow N(\mu^*, \sigma^{*2}) \end{alignat*} } \]

The problem boils down to realizing that we have a summation of two independent Normally distributed random variables, \(y\) and \(u\). How do independent Normally distributed random variables sum? If you don’t remember, don’t worry, neither did I. After some revision we know that (Ross 2010):

\[ \displaylines{ \begin{align} \text{E}[X + Y] &= \text{E}[X] + \text{E}[Y] \\ \text{Var}[X + Y] &= \text{Var}[X] + \text{Var}[Y] + 2\text{Cov}[X,Y] \end{align} } \]

The assumption that the random variables are independent means the covariance is zero, \(2\text{Cov}[X,Y] = 0\). We can simply add the means and variances (Ross 2010, 256–57) giving:

\[ \displaylines{ \begin{alignat*}{2} y^* &= y &&+ u \\ &= \alpha + \beta x &&+ \epsilon + u \\ &= \alpha + \beta x &&+ \epsilon^* \\ &= \alpha + \beta x &&+ \Big( N(\mu_{\epsilon}, \sigma_{\epsilon}^2) + N(\mu_u, \sigma_u^2) \Big) \\ &= \alpha + \beta x &&+ \Big( N(\mu_{\epsilon} + \mu_u, \sigma_{\epsilon}^2 + \sigma_u^2) \Big) \\ &= \alpha + \beta x &&+ N(0 + 2, 1 + 4) \\ &= \alpha + \beta x &&+ N(2, 5) \\ \end{alignat*} } \]

where,

\[ \displaylines{ \begin{alignat*}{2} \mu_{\epsilon^*} &= \mu_{\epsilon} &&+ \mu_u \\ &= 0 &&+ 2 \\ &= 2 \end{alignat*} } \]

\[ \displaylines{ \begin{alignat*}{2} \sigma_{\epsilon^*}^2 &= \sigma_{\epsilon}^2 &&+ \sigma_u^2 \\ &= 1 &&+ 4 \\ &= 5 \end{alignat*} } \]

Plugging in the values for \(\alpha\) and \(\beta\), we find the following solution: \[ y^* = 3 + 2x + N(2, 5) \]

  • \(\alpha^* = 3\)
  • \(\beta^* = 2\)
  • \(\mu^* = 2\)
  • \(\sigma^* = \sqrt{5}\)

Remember, we were asked for \(\sigma^*\) not \(\sigma^{*2}\):

We can arguably simplify this a bit further by noting that the bias, \(\alpha\), and the mean of the residuals, \(\mu_{\epsilon^*}\), are both constants that shift the intercept and can be grouped. Subtracting the mean of the error from the bias…

\[ \displaylines{ \begin{alignat*}{3} y^* &= \alpha &&+ \beta x &&&+ N(2, 5) \\ &= 3 &&+ 2x &&&+ N(2, 5) \\ &= (3+2) &&+ 2x &&&+ N(0, 5) \\ &= 5 &&+ 2x &&&+ N(0,5) \end{alignat*} } \]

And rewritting using the standard deviation rather than the variance (like in the original question) we arrive at:

\[ y = 5 + 2x + N(0,\sqrt{5}) \]

  • \(\alpha^* = 5\)
  • \(\beta^* = 2\)
  • \(\mu^* = 2\)
  • \(\sigma^* = \sqrt{5}\)

If instead we took 2 as the var in \(u \sim N(2,2)\):

  • \(\alpha^* = 5\)
  • \(\beta^* = 2\)
  • \(\mu^* = 2\)
  • \(\sigma^* = \sqrt{3}\)


# Computational verification

Let’s check our analytical solution through computational methods. First let’s check that we can indeed simply add the means and variances of independent random variables.

Code
import altair as alt
import pandas as pd
import torch
from sklearn.neighbors import KernelDensity

n = 100_000

err = torch.normal(0, 1, size=(n,))
u = torch.normal(2, 2, size=(n,))

err_computational = err + u
err_analytical = torch.normal(2, 5 ** (0.5), size=(n,))

# torch.histogram returns the endpoints of each bin, we want the middle value
# so we find the mean of each bin range
y, x = torch.histogram(err, density=True)
x = torch.tensor([x[i : i + 2].mean() for i in range(0, len(x) - 1)])
df0 = pd.DataFrame(
    {
        "x": x,
        "density": y,
        "series": ["err"] * len(x),
    }
)

y, x = torch.histogram(u, density=True)
x = torch.tensor([x[i : i + 2].mean() for i in range(0, len(x) - 1)])
df1 = pd.DataFrame(
    {
        "x": x,
        "density": y,
        "series": ["u"] * len(x),
    }
)

y, x = torch.histogram(err_computational, density=True)
x = torch.tensor([x[i : i + 2].mean() for i in range(0, len(x) - 1)])
df2 = pd.DataFrame(
    {
        "x": x,
        "density": y,
        "series": ["err* (computational)"] * len(x),
    }
)

y, x = torch.histogram(err_analytical, density=True)
x = torch.tensor([x[i : i + 2].mean() for i in range(0, len(x) - 1)])
df3 = pd.DataFrame(
    {
        "x": x,
        "density": y,
        "series": ["err* (analytical)"] * len(x),
    }
)
df = pd.concat([df0, df1, df2, df3])

# Plot
colorscheme = [
    "#368BC1",  # blue
    "#F2BB18",  # yellow
    "#BB4430",  # red
    "#8A9A67",  # green
    "#CC771F",  # orange
    "#8B5260",  # purple
]
selection = alt.selection_point(fields=["series"], bind="legend")

chart = (
    alt.Chart(df)
    .mark_line()
    .encode(
        x="x:Q",
        y="density:Q",
        color=alt.Color(
            "series:N", title="Random Variable", scale=alt.Scale(range=colorscheme)
        ).legend(orient="top-right"),
        opacity=alt.condition(selection, alt.value(0.75), alt.value(0.2)),
    )
    .configure(background="#f8f9fa")
    .properties(
        width=650,
        padding=10,
        title={
            "text": f"Summing Two Independent Random Variables (n={n} samples)",
            "subtitle": "err + u = err*",
        },
    )
    .add_params(selection)
)
chart.display()

stats_df = pd.DataFrame(
    {
        "mean": [
            err.mean().item(),
            u.mean().item(),
            err_computational.mean().item(),
            err_analytical.mean().item(),
        ],
        "var": [
            err.var().item(),
            u.var().item(),
            err_computational.var().item(),
            err_analytical.var().item(),
        ],
    },
    index=["ϵ", "u", "ϵ* (computational)", "ϵ* (analytical)"],
)

print(stats_df)
                        mean       var
ϵ                  -0.001720  1.009451
u                   2.005137  3.989215
ϵ* (computational)  2.003417  4.973287
ϵ* (analytical)     2.008872  4.990684

In the chart, we can see \(\epsilon \sim N(0,1)\) is centered on 0 while \(u \sim N(2,2)\), \(\epsilon^*_{\text{analytical}} \sim N(2, 5)\), and \(\epsilon^*_{\text{computational}} \sim N(0, 1) + \sim N(2, 5)\) are all centered on 2. Summing the means worked!

The variance is a bit more difficult to verify from the chart. We can see that \(\epsilon^*_{\text{analytical}}\) and \(\epsilon^*_{\text{computational}}\) have the same distribution, slightly wider than \(u\). The tableThe error between the means and variances is small. If we increase the sample size these errors move closer towards 0.

Now that we’re confident in random variable arithmetic, let’s check the full analytical solution to a comutational one. Below we plot the first measured sample, \(y_1\), with the original signal, \(y_{\text{signal}}\), as a backdrop for reference:

\[ \displaylines{ \begin{alignat*}{2} y_{\text{\scriptsize{signal}}} &= \alpha &&+ \beta x \\ &= 3 &&+ 2 x \end{alignat*} } \]

\[ \displaylines{ \begin{alignat*}{3} y_{\text{1}} &= \alpha &&+ \beta x + \epsilon \\ &= 3 &&+ 2 x + N(\mu, \sigma^2) \\ &= 3 &&+ 2 x + N(0, 1) \end{alignat*} } \]

\[ \displaylines{ \begin{alignat*}{4} y^*_{\text{\scriptsize{analytical}}} &= y &&+ u \\ &= \alpha &&+ \beta x + \epsilon^* \\ &= 1 &&+ 2 x + N(\mu^*, \sigma^{*2}) \\ &= 5 &&+ 2 x + N(0, 5) \\ \end{alignat*} } \]

\[ \displaylines{ \begin{alignat*}{4} y^*_{\text{\scriptsize{computational}}} &= y &&+ u \\ &= \alpha &&+ \beta x + \epsilon &&&+ u \\ &= 3 &&+ 2 x + N(0, 1) &&&+ N(2, 4) \end{alignat*} } \]

Code
import altair as alt
import pandas as pd
import torch

alt.data_transformers.disable_max_rows()

x = torch.arange(0, 10, 0.01)

# Original Signal
a = 3
b = 2
signal = a + (b * x)

# y Our measured sample
err_mean = 0
err_std = 1
err = torch.normal(err_mean, err_std**2, size=(len(x),))
y = a + (b * x) + err

# y* Analytical
a = 5
b = 2
err_mean = 0
err_std = 5 ** (0.5)
err_analytical = torch.normal(err_mean, err_std, size=(len(x),))
y_analytical = a + (b * x) + err_analytical

# y* Computational
a = 3
b = 2
err_mean = 2
err_std = 5 ** (0.5)
err_computational = torch.normal(err_mean, err_std, size=(len(x),))
y_computational = a + (b * x) + (err + err_computational)

# Data munging
df0 = pd.DataFrame({"x": x, "y": signal, "dataset": ["signal (original)"] * len(x)})
df1 = pd.DataFrame({"x": x, "y": y, "dataset": ["y"] * len(x)})
df2 = pd.DataFrame(
    {"x": x, "y": y_computational, "dataset": ["y* (computational)"] * len(x)}
)
df3 = pd.DataFrame({"x": x, "y": y_analytical, "dataset": ["y* (analytical)"] * len(x)})

df = pd.concat([df0, df1, df2, df3])

# Plot
colorscheme = [
    "#368BC1",  # blue
    "#F2BB18",  # yellow
    "#BB4430",  # red
    "#8A9A67",  # green
    "#CC771F",  # orange
    "#8B5260",  # purple
]
selection = alt.selection_point(fields=["dataset"], bind="legend")
chart = (
    alt.Chart(
        df,
        title="Computational simulation of the problem matches the analytical solution",
    )
    .mark_circle(size=10)
    .encode(
        x="x:Q",
        y="y:Q",
        color=alt.Color(
            "dataset:N", title=None, scale=alt.Scale(range=colorscheme)
        ).legend(orient="top-left"),
        opacity=alt.condition(selection, alt.value(0.75), alt.value(0.2)),
    )
    .configure(background="#f8f9fa")
    .properties(width=650, padding=10)
    .add_params(selection)
)

chart.display()

Note, torch.normal takes the std dev instead of the variance.

To see each series more clearly, click on one in the legend.

The chart shows both the computational analysis verifies the analytical solution while the table above shows the mean and variance of the residuals of the computational anlysis is the same as the analytical solution. The table of statistics for each series also reflects these findings.

# In the end, it didn’t even matter

That took a bit of work. Thankfully, this question didn’t make it into our company pub quiz. Even if it had, I think everyone (sans the author) would have had to ultimately guess the answer. We only had 30 seconds per question. I didn’t even finish reading the problem statement let alone begin tackling it in that time!

In the end, whether this question would have been included or not, wouldn’t have changed the outcome. The quiz was neck and neck between 3 teams for most of the game. My team bouncing between 1\(^{st}\) and 3\(^{rd}\) place. That is until the final round of questions from the Leadership team, which were highly specific to a certain someone.

If we have another pub quiz, I’ll make sure to sling in some equally maladjusted questions. But I’ll post the answers beforehand here.

Watch this space.

Good thing I had the CEO on my team 🥇

References

Ross, Sheldon. 2010. A First Course in Probability. 8th ed. Pearson Prentice Hall.