# Get 12 random numbers from 0 through 99
<- sample(0:99, size=12, replace=TRUE)
a
# Show the result
a
[1] 44 22 75 62 46 30 67 72 68 4 23 78
Chapter 5 introduced a problem in probability, that was also a problem in statistics. We asked how surprised we should be at the results of a trial of a new cancer treatment regime.
Here we study another urgent problem in the real world - racial bias and the death penalty.
This example comes from the excellent Berkeley introduction to data science (Ani Adhikari and Wagner 2021).
Robert Swain was a young black man who was sentenced to death in the early 60s. Swain’s trial was held in Talladega County, Alabama. At the time, 26% of the eligible jurors in that county were black, but every member of Swain’s jury was white. Swain and his legal team appealed to the Alabama Supreme Court, and then to the US Supreme Court, arguing that there was racial bias in the jury selection. They noted that there had been no black jurors in Talladega county since 1950, even though they made up about a quarter of the eligible pool of jurors. The US Supreme Court rejected this argument, in a 6 to 3 opinion, writing that “The overall percentage disparity has been small and reflects no studied attempt to include or exclude a specified number of Negros.”.
Swain’s team presented a variety of evidence on bias in jury selection, but here we will look at the obvious and apparently surprising fact that Swain’s jury was entirely white. The Supreme Court decided that the “disparity” between selection of white and black jurors “has been small” — but how would they, and how would we, make a rational decision about whether this disparity really was “small”?
You might reasonably be worried about the result of this decision for Robert Swain. In fact his death sentence was invalidated by a later, unrelated decision and he served a long prison sentence instead. In 1986, the Supreme Court overturned the precedent set by Swain’s case, in Batson v. Kentucky, 476 U.S. 79.
To answer the question that the Supreme Court asked, we return to the method we used in the last chapter.
Let us imagine a hypothetical world, in which each individual black or white person had an equal chance of being selected for the jury. Call this world Hypothetical County, Alabama.
Just as in 1960’s Talladega County, 26% of eligible jurors in Hypothetical County are black. Hypothetical County jury selection has no bias against black people, so we expect around 26% of the jury to be black. 0.26 * 12 = 3.12, so we expect that, on average, just over 3 out of 12 jurors in a Hypothetical County jury will be black. But, if we select each juror at random from the population, that means that, sometimes, by chance, we will have fewer than 3 black jurors, and sometimes will have more than 3 black jurors. And, by chance, sometimes we will have no black jurors. But, if the jurors really are selected at random, how often would we expect this to happen — that there are no black jurors? We would like to estimate the probability that we will get no black jurors. If that probability is small, then we have some evidence that the disparity in selection between black and white jurors, was not “small”.
What is the probability of an all white jury being randomly selected out of a population having 26% black people?
Before we start, we need to figure out three things:
We then take three steps to calculate the desired probability:
For this problem, our task is made a little easier by the fact that our trial (in the resampling sense) is a simulated trial (in the legal sense). One trial requires 12 simulated jurors, each labeled by race (white or black).
The outcome we are interested in is the number of black jurors.
Now comes the harder part. How do we simulate one trial?
One trial requires 12 jurors, and we are interested only in the race of each juror. In Hypothetical County, where selection by race is entirely random, each juror has a 26% chance of being black.
We need a way of simulating a 26% chance.
One way of doing this is by getting a random number from 0 through 99 (inclusive). There are 100 numbers in the range 0 through 99 (inclusive).
We will arbitrarily say that the juror is white if the random number is in the range from 0 through 73. 74 of the 100 numbers are in this range, so the juror has a 74/100 = 74% chance of getting the label “white”. We will say the juror is black if the random number is in the range 74 though 99. There are 26 such numbers, so the juror has a 26% chance of getting the label “black”.
Next we need a way of getting a random number in the range 0 through 99. This is an easy job for the computer, but if we had to do this with a physical device, we could get a single number by throwing two 10-sided dice, say a blue die and a green die. The face of the blue die will be the 10s digit, and the green face will be the ones digit. So, if the blue die comes up with 8 and the green die has 4, then the random number is 84.
We could then simulate 12 jurors by repeating this process 12 times, each time writing down “white” if the number is from 0 through 74, and “black” otherwise. The trial outcome is the number of times we wrote “black” for these 12 simulated jurors.
We use the same logic to simulate a trial with the computer. A little code makes the job easier, because we can ask R to give us 12 random numbers from 0 through 99, and to count how many of these numbers are in the range from 75 through 99. Numbers in the range from 75 through 99 correspond to black jurors.
We can now use R and sample
from the last chapter to get 12 random numbers from 0 through 99.
# Get 12 random numbers from 0 through 99
<- sample(0:99, size=12, replace=TRUE)
a
# Show the result
a
[1] 44 22 75 62 46 30 67 72 68 4 23 78
We use comparison and sum
to count how many numbers are greater than 74, and therefore, in the range from 75 through 99:
# How many numbers are greater than 74?
<- sum(a > 74)
b # Show the result
b
[1] 2
We assemble the pieces from the last few sections to make a chunk that simulates a single trial:
# Get 12 random numbers from 0 through 99
<- sample(0:99, size=12, replace=TRUE)
a # How many are greater than 74?
<- sum(a > 74)
b # Show the result
b
[1] 2
Now we come back to the details of how we:
Repeating the trial many times is the job of the for
loop, and we will come to that soon.
In order to record the results, we will store each trial result in a vector.
Since we will be working with vectors a lot, it is worth knowing more about them.
A vector is a container that stores many elements of the same type. You have already seen, in Chapter 2, how we can create a vector from a sequence of numbers using the c()
function.
# Make a vector of numbers, store with the name "some_numbers".
<- c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
some_numbers # Show the value of "some_numbers"
some_numbers
[1] 0 1 2 3 4 5 6 7 8 9
Another way that we can create vectors is to use the numeric
function to make a new array where all the elements are 0.
# Make a new vector containing 5 zeros.
<- numeric(5)
z # Show the value of "z"
z
[1] 0 0 0 0 0
Notice the argument 5
to the numeric
function. This tells the function how many zeros we want in the vector that the function will return.
The are various useful things we can do with this vector container. One is to ask how many elements there are in the vector container. We can use the length
function to calculate the number of elements in a vector:
# Show the number of elements in "z"
length(z)
[1] 5
Another thing we can do with vectors is set the value for a particular element. To do this, we use square brackets following the vector value, on the left hand side of the equals sign, like this:
# Set the value of the first element in the vector.
1] = 99
z[# Show the new contents of the vector.
z
[1] 99 0 0 0 0
Read the first line of code as “the element at position 1 gets a value of 99”.
For practice, let us also set the value of the third element in the vector:
# Set the value of the third element in the vector.
3] <- 99
z[# Show the new contents of the vector.
z
[1] 99 0 99 0 0
Read the first code line above as as “set the value at position 3 in the vector to have the value 99”.
We can also get the value of the element at a given position, using the same square-bracket notation:
# Get the value of the *first* element in the array.
# Store the value with name "v"
= z[1]
v # Show the value we got
v
[1] 99
Read the first code line here as “v gets the value at position 1 in the vector”.
Using square brackets to get and set element values is called indexing into the vector.
As a preview, let us now imagine that we want to do 50 simulated trials of Robert Swain’s jury in Hypothetical County. We will want to store the count for each trial, to give 50 counts.
In order to do this, we make a vector to hold the 50 counts. Call this vector z
.
# A vector to hold the 50 count values.
<- numeric(50) z
We could run a single trial to get a single simulated count. Here we just repeat the code chunk you saw above. Notice that we can get a different result each time we run this code, because the numbers in a
are random choices from the range 0 through 99, and different random numbers will give different counts.
# Get 12 random numbers from 0 through 99
<- sample(0:99, size=12, replace=TRUE)
a # How many are greater than 74?
<- sum(a == 9)
b # Show the result
b
[1] 0
Now we have the result of a single trial, we can store it as the first number in the z
vector:
# Store the single trial count as the first value in the "z" vector.
1] <- b
z[# Show all the values in the "z" vector.
z
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0
Of course we could just keep doing this: run the chunk corresponding to a trial, above, to get a new count, and then store it at the next position in the z
vector. For example, we could store the counts for the first three trials with:
# First trial
<- sample(0:99, size=12, replace=TRUE)
a <- sum(a == 9)
b # Store the result at the first position in z
1] <- b
z[
# Second trial
<- sample(0:99, size=12, replace=TRUE)
a <- sum(a == 9)
b # Store the result at the second position in z
2] <- b
z[
# Third trial
<- sample(0:99, size=12, replace=TRUE)
a <- sum(a == 9)
b # Store the result at the third position in z
3] <- b
z[
# And so on ...
This would get terribly long and boring to type for 50 trials. Luckily computer code is very good at repeating the same procedure many times. For example, R can do this using a for
loop. You have already seen a preview of the for
loop in Chapter 2. Here we dive into for
loops in more depth.
A for-loop is a way of asking R to:
We often use this idea when we are trying to explain a repeating procedure. For example, imagine we wanted to explain what the supermarket checkout person does for the items in your shopping basket. You might say that they do this:
For each item of shopping in your basket, they take the item off the conveyor belt, scan it, and put it on the other side of the till.
You could also break this description up into bullet points with indentation, to say the same thing:
Notice the logic; the checkout person is repeating the same procedure for each of a series of items.
This is the logic of the for
loop in R. The procedure that R repeats is called the body of the for loop. In the example of the checkout person above, the repeating procedure is:
Now imagine we wanted to use R to print out the year of birth for each of the authors for the third edition of this book:
Author | Year of birth |
---|---|
Julian Lincoln Simon | 1932 |
Matthew Brett | 1964 |
Stéfan van der Walt | 1980 |
We want to see this output:
Author birth year is 1932
Author birth year is 1964
Author birth year is 1980
Of course, we could just ask R to print out these exact lines, like this:
message('Author birth year is 1932')
Author birth year is 1932
message('Author birth year is 1964')
Author birth year is 1964
message('Author birth year is 1980')
Author birth year is 1980
We might instead notice that we are repeating the same procedure for each of the three birth years, and decide to do the same thing using a for
loop:
<- c(1932, 1964, 1980)
author_birth_years
# For each birth year
for (birth_year in author_birth_years) {
# Repeat this procedure ...
message('Author birth year is ', birth_year)
}
Author birth year is 1932
Author birth year is 1964
Author birth year is 1980
The for
loop starts with a line where we tell it what items we want to repeat the procedure for:
for (birth_year in author_birth_years) {
This initial line of the for
loop ends with an opening curly brace {
. The opening curly brace tells R that what follows, up until the matching closing curly brace }
, is the procedure R should follow for each item. The lines between the opening {
and closing }
curly braces* are the body of the for loop.
The initial line of the for
loop above tells R that it should take each item in author_birth_years
, one by one — first 1932, then 1964, then 1980. For each of these numbers it will:
birth_year
, thenJust as the person at the supermarket checkout takes each item in turn, for each iteration (repeat) of the for
loop, birth_year
gets a new value from the sequence in author_birth_years
. birth_year
is called the loop variable, because it is the variable that gets a new value each time we begin a new iteration of the for
loop procedure. As for any variable in R, we can call our loop variable anything we like. We used birth_year
here, but we could have used y
or year
or some other name.
Notice that R insists we put parentheses (round brackets) around: the loop variable; in
; and the sequence that will fill the loop variable — like this:
for (birth_year in author_birth_years) {
Do not forget these round brackets — R insists on them.
Now you know what the for
loop is doing, you can see that the for
loop above is equivalent to the following code:
<- 1932 # Set the loop variable to contain the first value.
birth_year message('Author birth year is ', birth_year) # Use the first value.
Author birth year is 1932
<- 1964 # Set the loop variable to contain the next value.
birth_year message('Author birth year is ', birth_year) # Use the second value.
Author birth year is 1964
<- 1980
birth_year message('Author birth year is ', birth_year)
Author birth year is 1980
Writing the steps in the for
loop out like this is called unrolling the loop. It can be a useful exercise to do this when you come across a for
loop, in order to work through the logic of the loop. For example, you may want to write out the unrolled equivalent of the first couple of iterations, to see what the loop variable will be, and what will happen in the body of the loop.
We often use for
loops with ranges (see Section 5.9). Here we use a loop to print out the numbers 1 through 4:
for (n in 1:4) {
message('The loop variable n is ', n)
}
The loop variable n is 1
The loop variable n is 2
The loop variable n is 3
The loop variable n is 4
Notice that the range ended at 4, and that means we repeat the loop body 4 times. We can also use the loop variable value from the range as an index, to get or set the first, second, etc values from a vector.
For example, maybe we would like to show the author position and the author year of birth.
Remember our author birth years:
author_birth_years
[1] 1932 1964 1980
We can get (for example) the second author birth year with:
2] author_birth_years[
[1] 1964
Using the combination of looping over a range, and vector indexing, we can print out the author position and the author birth year:
for (n in 1:3) {
<- author_birth_years[n]
year message('Birth year of author position ', n, ' is ', year)
}
Birth year of author position 1 is 1932
Birth year of author position 2 is 1964
Birth year of author position 3 is 1980
Just for practice, let us unroll the three iterations through this for
loop, to remind ourselves what the code is doing:
# Unrolling the for loop.
<- 1
n <- author_birth_years[n] # Will be 1932
year message('Birth year of author position ', n, ' is ', year)
Birth year of author position 1 is 1932
<- 2
n <- author_birth_years[n] # Will be 1964
year message('Birth year of author position ', n, ' is ', year)
Birth year of author position 2 is 1964
<- 3
n <- author_birth_years[n] # Will be 1980
year message('Birth year of author position ', n, ' is ', year)
Birth year of author position 3 is 1980
Here is the code we worked out above, to implement a single trial:
# Get 12 random numbers from 0 through 99
<- sample(0:99, size=12, replace=TRUE)
a # How many are greater than 74?
<- sum(a == 9)
b # Show the result
b
[1] 0
We found that we could use vectors to store the results of these trials, and that we could use for
loops to repeat the same procedure many times.
Now we can put these parts together to do 50 simulated trials:
# Procedure for 50 simulated trials.
# A vector to store the counts for each trial.
<- numeric(50)
z
# Repeat the trial procedure 50 times.
for (i in 1:50) {
# Get 12 random numbers from 0 through 99
<- sample(0:99, size=12, replace=TRUE)
a # How many are greater than 74?
<- sum(a > 74)
b # Store the result at the next position in the "z" vector.
= b
z[i] # Now go back and do the next trial until finished.
}# Show the result of all 50 trials.
z
[1] 4 1 1 4 2 3 4 3 1 2 3 2 5 3 2 3 4 3 1 5 5 2 1 1 2 2 2 3 0 2 6 2 2 3 4 0 3 4
[39] 2 5 3 2 3 3 3 4 2 2 4 4
Finally, we need to count how many of the trials in z
ended up with all-white juries. These are the trials with a z
(count) value of 0.
To do this, we can ask a vector which elements match a certain condition. E.g.:
<- c(2, 1, 3, 0)
x = x < 2
y # Show the result
y
[1] FALSE TRUE FALSE TRUE
We now use that same technique to ask, of each of the 50 counts, whether the vector z
is equal to 0, like this:
# Is the value of z equal to 0?
<- z == 0
all_white # Show the result of the comparison.
all_white
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE
We need to get the number of TRUE
values in all_white
, to find how many simulated trials gave all-white juries.
# Count the number of True values in "all_white"
# This is the same as the number of values in "z" that are equal to 0.
= sum(all_white)
n_all_white # Show the result of the comparison.
n_all_white
[1] 2
n_all_white
is the number of simulated trials for which all the jury members were white. It only remains to get the proportion of trials for which this was true, and to do this, we divide by the number of trials.
# Proportion of trials where all jury members were white.
<- n_all_white / 50
p # Show the result
p
[1] 0.04
From this initial simulation, it seems there is around a 4% chance that a jury selected randomly from the population, which was 26% black, would have no black jurors.
Our experiment above is only 50 simulated trials. The higher the number of trials, the more confident we can be of our estimate for p
— the proportion of trials where we get an all-white jury.
It is no extra trouble for us to tell the computer to do a very large number of trials. For example, we might want to run 10,000 trials instead of 50. All we have to do is to run the loop 10,000 times instead of 50 times. The computer has to do more work, but it is more than up to the job.
Here is exactly the same code we ran above, but collected into one chunk, and using 10,000 trials instead of 50. We have left out the comments, to make the code more compact.
# Full simulation procedure, with 10,000 trials.
<- numeric(10000)
z for (i in 1:10000) {
<- sample(0:99, size=12, replace=TRUE)
a <- sum(a > 74)
b = b
z[i]
}<- z == 0
all_white <- sum(all_white)
n_all_white <- n_all_white / 10000
p p
[1] 0.0317
We now have a new, more accurate estimate of the proportion of Hypothetical County juries that are all white. The proportion is 0.032, and so 3.2%.
This proportion means that, for any one jury from Hypothetical County, there is a less than one in 20 chance that the jury would be all white.
As we will see in more detail later, we might consider using the results from this experiment in Hypothetical County, to reflect on the result we saw in the real Talladega County. We might conclude, for example, that there was likely some systematic difference between Hypothetical County and Talledega County. Maybe the difference was that there was, in fact, some bias in the jury selection in Talledega county, and that the Supreme Court was wrong to reject this. You will hear more of this line of reasoning later in the book.
In this chapter we studied a real life-and-death question, on racial bias and the death penalty. We continued our exploration of the ways we can use probability, and resampling, to draw conclusions about real events. Along the way, we went into more detail on vectors in R, and for
loops; two basic tools in resampling.
In the next chapter, we will work through some more problems in probability, to show how we can use resampling, to answer questions about chance. We will add some more tools for writing code in R, to make your programs easier to write, read, and understand.