Appendix A — Exercise Solutions

A.1 Solution for paired differences exercise 24.3.1

We suggested that you ignored the pairing of the before and after samples, and that is what we will do here. Then we will extend the treatment to take the pairing into account.

Note A.1: Notebook: Paired differences solution

Download zip with notebook + data file Interact

import numpy as np
import pandas as pd

rnd = np.random.default_rng()

df = pd.read_csv('data/hamilton.csv')
before = np.array(df['score_before'])
after = np.array(df['score_after'])

observed_diff = np.mean(after) - np.mean(before)

# Let us start with a permutation test.
both = np.concatenate([before, after])
n_before = len(before)

# Samples in the null world.
n_trials = 10_000
results = np.zeros(n_trials)
for i in range(n_trials):
    shuffled = rnd.permuted(both)
    fake_before = shuffled[:n_before]
    fake_after = shuffled[n_before:]
    fake_diff = np.mean(fake_after) - np.mean(fake_before)
    results[i] = fake_diff

# We are interested in fake differences that are larger
# in magnitude than the observed difference (hence "abs").
# Here we have no prior hypothesis about which direction the difference
# will go.
k = np.sum(np.abs(results) >= np.abs(observed_diff))
kk = k / n_trials
print('Permutation p null-world abs >= abs observed:', kk)

Permutation p null-world abs >= abs observed: 0.2554

# Next a bootstrap test.
n_after = len(after)  # Of course, in our case, this will be == n_before
results = np.zeros(n_trials)
for i in range(n_trials):
    fake_before = rnd.choice(both, size=n_before)
    fake_after = rnd.choice(both, size=n_after)
    fake_diff = np.mean(fake_after) - np.mean(fake_before)
    results[i] = fake_diff

k = np.sum(np.abs(results) >= np.abs(observed_diff))
kk = k / n_trials
print('Bootstrap p null-world abs >= abs observed:', kk)

Bootstrap p null-world abs >= abs observed: 0.2217

Finally we consider the pairs. Here we do take the pairs into account. We have some reason to think that the patients or cars vary in some substantial way in their level of depression, or their tendency to break down, but we believe that the patients’ response to treatment or the difference between the mechanics is the value of interest.

In that case, we are interested in the differences between the pairs. In the null world, these before / after (mechanic A / mechanic B) differences are random. In the null-world, where there is no difference between before/after or mechanics 1 and 2, we can flip the before / after (A / B) pairs and be in the same world.

Notice that flipping the before / after or A / B in the pair just changes the sign of the difference.

So we will simulate the effect of flipping the values in the pair, by choosing a random sign for the pair, where -1 means pair is flipped, and 1 means pair is in original order. We recalculated the mean difference with these random signs (flips) applied, and these will be our values in the null-world.

# A test of paired difference with sign flips for the null world.
differences = after - before
observed_mdiff = np.mean(differences)
n_both = len(differences)

results = np.zeros(n_trials)
for i in range(n_trials):
    # Choose random signs to perform random flips of the pairs.
    signs = rnd.choice([-1, 1], size=n_both)
    # Do flips.
    fake_differences = signs * differences
    # Calculate mean difference and store result.
    results[i] = np.mean(fake_differences)

k = np.sum(np.abs(results) >= np.abs(observed_mdiff))
kk = k / n_trials
print('Sign-flip p null-world abs >= abs observed:', kk)

Sign-flip p null-world abs >= abs observed: 0.0271

Notice that the sign-flip test, in which we preserve the information about the patients / cars, is much more convincing than the permutation or bootstrap tests above, where we choose to ignore that information.

This can occur when the values within the pairs (rows) are similar to each other, but less similar across different pairs (rows).

End of notebook: Paired differences solution

paired_differences_solution starts at Note A.1.

A.2 Solution to seatbelt proportions exercise 24.3.2

Note A.2: Notebook: Seatbelt proportion solution

Download notebook Interact

import numpy as np
import matplotlib.pyplot as plt

rnd = np.random.default_rng()

pittsburgh = np.repeat(['seatbelt', 'none'], [36, 36])
n_pitts = len(pittsburgh)
chicago = np.repeat(['seatbelt', 'none'], [77, 52])
n_chicago = len(chicago)

n_trials = 10_000
results = np.zeros(n_trials)

for i in range(n_trials):
    fake_pitts = rnd.choice(pittsburgh, size=n_pitts)
    fake_chicago = rnd.choice(chicago, size=n_chicago)
    fake_p_pitts = np.sum(fake_pitts == 'seatbelt') / n_pitts
    fake_p_chicago = np.sum(fake_chicago == 'seatbelt') / n_chicago
    fake_p_diff = fake_p_pitts - fake_p_chicago
    results[i] = fake_p_diff

plt.hist(results, bins=25)
plt.title('Bootstrap distribution of p differences')
plt.xlabel('Bootstrap p differences')

p_limits = np.quantile(results, [0.025, 0.975])

print('95% percent limits for p differences:', p_limits)

95% percent limits for p differences: [-0.24386305  0.04521964]

End of notebook: Seatbelt proportion solution

seatbelt_proportion_solution starts at Note A.2.

A.3 Solution for unemployment percentage 27.8.1

In a sample of 200 people, 7 percent are found to be unemployed. Determine a 95 percent confidence interval for the true population proportion.

Note A.3: Notebook: Unemployment percent solution

Download notebook Interact

import numpy as np
import matplotlib.pyplot as plt

rnd = np.random.default_rng()

n_trials = 10_000
results = np.zeros(n_trials)

for i in range(n_trials):
    fake_people = rnd.choice(['no job', 'job'], size=200, p=[0.07, 0.93])
    p_unemployed = np.sum(fake_people == 'no job') / 200
    results[i] = p_unemployed

plt.hist(results, bins=25)
plt.title('Bootstrap distribution p unemployed')
plt.xlabel('Bootstrap p unemployed')

p_limits = np.quantile(results, [0.025, 0.975])

print('95% percent limits for p differences:', p_limits)

95% percent limits for p differences: [0.035 0.105]

End of notebook: Unemployment percent solution

unemployment_percent_solution starts at Note A.3.

A.4 Solution for battery lifetime 27.8.2

A sample of 20 batteries is tested, and the average lifetime is 28.85 months. Establish a 95 percent confidence interval for the true average value. The sample values (lifetimes in months) are listed below.

We use the “bootstrap” technique of drawing many bootstrap re-samples with replacement from the original sample, and observing how the re-sample means are distributed.

Note A.4: Notebook: Battery lifetime solution

Download notebook Interact

import numpy as np
import matplotlib.pyplot as plt

rnd = np.random.default_rng()

lifetimes = np.array([30, 32, 31, 28, 31, 29, 29, 24, 30, 31,
                      28, 28, 32, 31, 24, 23, 31, 27, 27, 31])

print('Mean is:', np.mean(lifetimes))

Mean is: 28.85

n_lifetimes = len(lifetimes)
results = np.zeros(n_trials)

for i in range(n_trials):
    # Draw 20 lifetimes from "lifetimes, randomly and with replacement.
    fake_lifetimes = rnd.choice(lifetimes, size=n_lifetimes)
    # Find the average lifetime of the 20.
    fake_mean = np.mean(fake_lifetimes)
    # Keep score.
    results[i] = fake_mean

plt.hist(results, bins=25)
plt.title('Bootstrap distribution of mean battery lifetimes')
plt.xlabel('Bootstrap mean battery lifetime')

mean_limits = np.quantile(results, [0.025, 0.975])

print('95% percent limits for mean lifetimes:', mean_limits)

95% percent limits for mean lifetimes: [27.65 29.95]

End of notebook: Battery lifetime solution

battery_lifetime_solution starts at Note A.4.

A.5 Solution for optical density 27.8.3

Note A.5: Notebook: Optical density solution

Download notebook Interact

import numpy as np
import matplotlib.pyplot as plt

rnd = np.random.default_rng()

Suppose we have 10 measurements of Optical Density on a batch of HIV negative control samples:

density = np.array(
    [.02, .026, .023, .017, .022, .019, .018, .018, .017, .022])

Derive a 95 percent confidence interval for the sample mean. Are there enough measurements to produce a satisfactory answer?

n_density = len(density)

n_trials = 10_000
results = np.zeros(n_trials)

for i in range(n_trials):
    fake_density = rnd.choice(density, size=n_density)
    results[i] = np.mean(fake_density)

plt.hist(results, bins=25)
plt.title('Bootstrap distribution of density means')
plt.xlabel('Bootstrap density means')

mean_limits = np.quantile(results, [0.025, 0.975])

print('95% percent limits for density mean:', mean_limits)

95% percent limits for density mean: [0.0185 0.022 ]

End of notebook: Optical density solution

optical_density_solution starts at Note A.5.

A.6 Solution for voter participation 29.7.1

The observed correlation coefficient between voter participation and spread is moderate and negative. Is this more negative that what might occur by chance, if no correlation exists in some underlying population, from which this sample was taken?

Create two groups of paper cards: 25 with participation rates, and 25 with the spread values. Arrange the cards in pairs in accordance with the table, and compute the correlation coefficient between the shuffled participation and spread variables.
Shuffle one of the sets, say that with participation, and compute correlation between shuffled participation and spread.
Repeat step 2 many, say 1000, times. Compute the proportion of the trials in which correlation was at least as negative as that for the original data.

Note A.6: Notebook: Voter participation in 1844 election

Download zip with notebook + data file Interact

import numpy as np
import pandas as pd

rnd = np.random.default_rng()

voter_df = pd.read_csv('data/election_1844.csv')
participation = np.array(voter_df['Participation'])
spread = np.array(voter_df['Spread'])

# Compute correlation.  It's -0.425.
actual_r = np.corrcoef(participation, spread)[0, 1]
actual_r

np.float64(-0.4249067562318352)

n_trials = 10_000
results = np.zeros(n_trials)

for i in range(n_trials):
    # Shuffle the participation rates.
    shuffled = rnd.permuted(participation)
    # Compute re-sampled correlation.
    fake_r = np.corrcoef(shuffled, spread)[0, 1]
    # Keep the value in the results.
    results[i] = fake_r

plt.hist(results, bins=25)
plt.title('Distribution of shuffled correlations')
plt.xlabel('Correlation with shuffled participation')

# Count the trials when result <= observed.
k = np.sum(results <= actual_r)
# Compute the proportion of such trials.
kk = k / n_trials

print('Proportion of shuffled r <= observed:', np.round(kk, 2))

Proportion of shuffled r <= observed: 0.03

End of notebook: Voter participation in 1844 election

voter_participation_solution starts at Note A.6.

From this we may conclude that the voter participation rates probably are negatively related to the vote spread in the election. The actual value of the correlation (-.425) cannot be explained by chance alone. In our Monte Carlo simulation of the null-hypothesis a correlation that negative is found only about 3 percent of the time.

See: Section A.6.

A.7 Solution for association of runs and strikeouts 29.7.2

We are looking at the correlation of home runs and strikeouts for major-league baseball players.

The instructions ask us to start here with the sum-of-products measure.

Note A.7: Notebook: Homeruns and strikeout sum of products.

Download notebook Interact

import numpy as np

rnd = np.random.default_rng()

homeruns = np.array([14, 20, 0, 38, 9, 38, 22, 31, 33,
                     11, 40, 5, 15, 32, 3, 29, 5, 32])
strikeout = np.array([135, 153, 120, 161, 138, 175, 126, 200, 205,
                      147, 165, 124, 169, 156, 36, 98, 82, 131])

# The sum of products approach.
actual_sop = np.sum(homeruns * strikeout)

n_trials = 10_000
results = np.zeros(n_trials)

for i in range(n_trials):
    shuffled_runs = rnd.permuted(homeruns)
    fake_sop = np.sum(shuffled_runs * strikeout)
    results[i] = fake_sop

plt.hist(results, bins=25)
plt.title('Distribution of shuffled sum of products')
plt.xlabel('Sum of products for shuffled homeruns')

k = np.sum(results >= actual_sop)
kk = k / n_trials

print('p shuffled sum of products >= actual:', np.round(kk, 3))

p shuffled sum of products >= actual: 0.003

Interpretation: In 10,000 simulations, random shuffling very rarely produced a value as high as observed. Therefore, we conclude that random chance could not reasonably be responsible for the observed degree of correlation.

End of notebook: Homeruns and strikeout sum of products.

homerun_sop_solution starts at Note A.7.

A.8 Solution for runs, strikeouts and correlation coefficient 29.7.3

Again, we are looking at the correlation of home runs and strikeouts for major-league baseball players. This time we will use the correlation coefficient (\(r\)) measure.

Note A.8: Notebook: Homeruns and strikeout correlation

Download notebook Interact

import numpy as np

rnd = np.random.default_rng()

homeruns = np.array([14, 20, 0, 38, 9, 38, 22, 31, 33,
                     11, 40, 5, 15, 32, 3, 29, 5, 32])
strikeout = np.array([135, 153, 120, 161, 138, 175, 126, 200, 205,
                      147, 165, 124, 169, 156, 36, 98, 82, 131])

# The correlation approach.
actual_r = np.corrcoef(homeruns, strikeout)[0, 1]

n_trials = 10_000
results = np.zeros(n_trials)

for i in range(n_trials):
    shuffled_runs = rnd.permuted(homeruns)
    fake_r = np.corrcoef(shuffled_runs, strikeout)[0, 1]
    results[i] = fake_r

plt.hist(results, bins=25)
plt.title('Distribution of shuffled r')
plt.xlabel('r for shuffled homeruns')

k = np.sum(results >= actual_r)
kk = k / n_trials

print('p shuffled r >= actual:', np.round(kk, 3))

p shuffled r >= actual: 0.003

Interpretation: a correlation coefficient as high as the observed value (.62) occurred only about 0.3% of the time by chance. Hence, provisionally, we choose to reject chance as an explanation for such a high value of the correlation coefficient.

Notice, we get the same answer for correlation coefficients as we do for sum of products. In fact, correlation coefficients must give us the same answer (apart from random variation from the permutation), as the tests of association are equivalent when we compare between different orderings of the same sequences.

End of notebook: Homeruns and strikeout correlation

homerun_correlation_solution starts at Note A.8.

A.9 Solution for money and exchange rate 29.7.4

Note A.9: Notebook: Exchange rates and money supply

Download zip with notebook + data file Interact

import numpy as np
import pandas as pd

rnd = np.random.default_rng()

exchange_df = pd.read_csv('data/exchange_rates.csv')
exchange_rates = np.array(exchange_df['exchange_rate'])
money_supply = np.array(exchange_df['money_supply'])

# Correlation.
actual_r = np.corrcoef(exchange_rates, money_supply)[0, 1]
actual_r

np.float64(0.4206068712932843)

n_trials = 10_000
results = np.zeros(n_trials)

for i in range(n_trials):
    shuffled_rates = rnd.permuted(exchange_rates)
    fake_r = np.corrcoef(shuffled_rates, money_supply)[0, 1]
    results[i] = fake_r

plt.hist(results, bins=25)
plt.title('Distribution of shuffled exchange rates r values')
plt.xlabel('r for shuffled exchange rate')

k = np.sum(results >= actual_r)
kk = k / n_trials

print('p shuffled r >= actual:', np.round(kk, 3))

p shuffled r >= actual: 0.0

End of notebook: Exchange rates and money supply

exchange_rates_solution starts at Note A.9.

Interpretation: The observed correlation (.42) between the exchange rate and the money supply is seldom exceeded by random experiments with these data. Thus, the observed result 0.42 cannot be reasonably explained by chance alone and we conclude that it is statistically surprising.