Joint Discrete Probability Distributions

A joint distribution is a probability distribution having two or more independent random variables. In a joint distribution, each random variable will still have its own probability distribution, expected value, variance, and standard deviation. In addition, probabilities will exist for ordered pair values of the random variables. Furthermore, the strength of any relationship between the two variables can be measured.

Most often, a joint distribution having two discrete random variables is given in table form. In this situation, the body of the table contains the probabilities for the different ordered pairs of the random variables, while the margins contain the probabilities for the individual random variables.

Formulas

Suppose a joint distribution of the random variables $X$ and $Y$ are given in table form, so that $P_{XY}(X=x, Y=y)$, typically abbreviated as $P_{XY}(x,y)$, is given for each pair $(x,y)$, of random variables. As with all discrete distributions, two requirements must hold for each pair $(x,y)$:

$0 \le P_{XY}(x,y) \le 1$

$\sum\limits_{\text{all }x} \sum\limits_{\text{all }y} P_{XY}(x,y) = 1$

Then the marginal probabilities $P_X (X=x)$ and $P_Y(Y=y)$, the expected values $E(X)$ and $E(Y)$, and the variances $Var(X)$ and $Var(Y)$ can be found by the following formulas.

\begin{align} P_X (X=x) &= \sum\limits_{\text{all }y} P_{XY}(x,y) \\ P_Y (Y=y) &= \sum\limits_{\text{all }x} P_{XY}(x,y) \\ E(X) &= \sum\limits_{\text{all }x} x P_X (x) \\ E(Y) &= \sum\limits_{\text{all }y} y P_Y (y) \\ Var(X) &= \sum\limits_{\text{all }x} x^2 P_X (x) - (E(X))^2 \\ Var(Y) &= \sum\limits_{\text{all }y} y^2 P_Y (y) - (E(Y))^2 \end{align}

As always, the standard deviations $\sigma_X$ and $\sigma_Y$ are the square roots of their respective variances.

To measure any relationship between two random variables, we use the covariance, defined by the following formula.

$Cov(X,Y) = \sum\limits_{\text{all }x} \sum\limits_{\text{all }y} xy P_{XY}(x,y) - E(X)E(Y)$

However, the covariance will have the same units as the variance, and is therefore affected by the unit chosen to measure the data. Therefore, we also define the correlation, which is unitless.

$\rho_{XY} = \dfrac{Cov(X,Y)}{\sigma_X \sigma_Y}$

The correlation $\rho_{XY}$ will always have values on the interval $[-1,1]$. The sign of the correlation indicates whether the relationship is direct or inverse (if the data were graphed, whether the slope would be positive or negative). A correlation of zero indicates that there is no (linear) relationship between the variables, which means that the variables are (linearly) independent. Correlations of positive or negative one indicate a perfect linear relationship. Values of $\rho_{XY}$ near $\pm 1$ are called strong correlations, values near $0$ are called weak correlations, and intermediate values are called moderate correlations.

An Example

An apartment manager decides to see if timely payment of rent is related to unpaid credit card balances. Using $X$ as the number of unpaid credit card balances, and $Y$ as the number of timely rent payments in the last four months, he examines his records and the credit reports of his renters and obtains the following results.

	X=0	X=1	X=2	X=3	Totals
Y=0	0.01	0.01	0.06	0.03	0.11
Y=1	0.02	0.04	0.07	0.04	0.17
Y=2	0.03	0.08	0.09	0.04	0.24
Y=3	0.05	0.11	0.08	0.02	0.26
Y=4	0.13	0.05	0.03	0.01	0.22
Totals	0.24	0.29	0.33	0.14	1.00

By inspection of the table, we can answer some basic probability questions.

The percentage of renters with exactly 3 timely payments and 1 unpaid credit card balance is $P_{XY}(X=1, Y=3) = 0.11$.
The percentage of renters with exactly 3 timely payments is $P_Y (Y=3) = 0.26$.
The percentage of renters with exactly 1 unpaid credit card balance is $P_X (X=1) = 0.29$.

The expected value (or mean) of each random variable can be found by use of the formulas.

\begin{align} E(X) &= \sum\limits_{\text{all }x} x P_X (x) = 0(0.24) + 1(0.29) + 2(0.33) + 3(0.14) = 1.37 \\ E(Y) &= \sum\limits_{\text{all }y} y P_Y (y) = 0(0.11) + 1(0.17) + 2(0.24) + 3(0.26) + 4(0.22) = 2.31 \end{align}

The renters at this apartment complex have a mean of 1.37 unpaid credit card balances, and a mean of 2.31 timely rent payments in the last four months. In other words, if one renter was randomly selected, we could expect to find a renter with 1.37 unpaid credit card balances and 2.31 timely rent payments.

We can also use the formulas to compute the variance and standard deviation of each random variable.

\begin{align} Var(X) &= \sum\limits_{\text{all }x} x^2 P_X (x) - (E(X))^2 \\ &= 0^2(0.24) + 1^2(0.29) + 2^2(0.33) + 3^2(0.14) - 1.37^2 = 0.9931 \\ \sigma_X &= \sqrt{0.9931} = 0.9965 \\ Var(Y) &= \sum\limits_{\text{all }y} y^2 P_Y (y) - (E(Y))^2 \\ &= 0^2(0.11) + 1^2(0.17) + 2^2(0.24) + 3^2(0.26) + 4^2(0.22) - 2.31^2 = 1.6539 \\ \sigma_Y &= \sqrt{1.6539} = 1.2860 \end{align}

Interpreting these results, we find variances of 0.9931 squared balances and 1.6539 squared rent payments. But since the squared units of variance are somewhat unclear, we generally favor the standard deviations of 0.9965 balances and 1.2860 rent payments. The standard deviations give a measure of the average deviation from the means reported earlier.

To obtain the strength of any relationship between these variables, we can compute the covariance.

\begin{align} Cov(X,Y) &= \sum\limits_{\text{all }x} \sum\limits_{\text{all }y} xy P_{XY}(x,y) - E(X)E(Y) \\ &= 0(0)(0.01) + 0(1)(0.02) + 0(2)(0.03) + 0(3)(0.05) + 0(4)(0.13) \\ &\phantom{=} + 1(0)(0.01) + 1(1)(0.04) + 1(2)(0.08) + 1(3)(0.11) + 1(4)(0.05) \\ &\phantom{=} + 2(0)(0.06) + 2(1)(0.07) + 2(2)(0.09) + 2(3)(0.08) + 2(4)(0.03) \\ &\phantom{=} + 3(0)(0.03) + 3(1)(0.04) + 3(2)(0.04) + 3(3)(0.02) + 3(4)(0.01) - (1.37)(2.31) \\ &= -0.5547 \end{align}

Since the covariance is nonzero, the variables are not independent, but do have some sort of relationship. Since the covariance is negative, the relationship is an inverse relationship, where the increase in credit card balances generally decreases the timeliness of rent payments. But to determine the strength of the relationship, we need to evaluate the correlation.

$\rho_{XY} = \dfrac{Cov(X,Y)}{\sigma_X \sigma_Y} = \dfrac{-0.5547}{(0.9965)(1.2860)} = -0.4329$

With $\rho_{XY} = -0.4329$, we find that there is a moderate negative relationship between the number of credit card balances and the timeliness of rent payments.