# Simulating juries

Find this notebook on the web at
<a class="quarto-xref" href="https://resampling-stats.github.io/edition-3-python/resampling_with_code2.html#nte-life_and_death">Note <span>6.1</span></a>.

### 6.3.2 Using code to simulate a trial

We use the same logic to simulate a trial with the computer. A little
code makes the job easier, because we can ask Python to give us 12
random numbers from 0 through 99, and to count how many of these numbers
are in the range from 75 through 99. Numbers in the range from 75
through 99 correspond to black jurors.

### 6.3.3 Random numbers from 0 through 99

We can now use NumPy and the random number functions from the last
chapter to get 12 random numbers from 0 through 99.

In [None]:
# Import the Numpy library, rename as "np"
import numpy as np

# Ask Numpy for a random number generator.
rnd = np.random.default_rng()

# All the integers from 0 up to, but not including 100.
zero_thru_99 = np.arange(100)

# Get 12 random numbers from 0 through 99
a = rnd.choice(zero_thru_99, size=12)

# Show the result
a

#### 6.3.3.1 Counting the jurors

We use *comparison* and `np.sum` to count how many numbers are greater
than 74, and therefore, in the range from 75 through 99:

In [None]:
# How many numbers are greater than 74?
b = np.sum(a > 74)
# Show the result
b

#### 6.3.3.2 A single simulated trial

We assemble the pieces from the last few sections to make a cell that
simulates a single trial:

In [None]:
rnd = np.random.default_rng()
zero_thru_99 = np.arange(100)

# Get 12 random numbers from 0 through 99
a = rnd.choice(zero_thru_99, size=12)

# How many numbers are greater than 74?
b = np.sum(a > 74)

# Show the result
b

## 6.4 Three simulation steps

Now we come back to the details of how we:

1.  Repeat the simulated trial many times;
2.  record the results for each trial;
3.  calculate the required proportion as an estimate of the probability
    we seek.

Repeating the trial many times is the job of the `for` loop, and we will
come to that soon.

In order to record the results, we will store each trial result in an
array.

### 6.4.1 More on arrays

Since we will be working with arrays a lot, it is worth knowing more
about them.

A Numpy array is a *container* that stores many elements of the same
type. You have already seen, in
<a class="quarto-xref" href="https://resampling-stats.github.io/edition-3-python/resampling_method.html"><span>Chapter 2</span></a>, how we
can create an array from a sequence of numbers using the `np.array`
function.

In [None]:
# Make an array of numbers, store with the name "some_numbers".
some_numbers = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# Show the value of "some_numbers"
some_numbers

Another way that we can create arrays is to use the `np.zeros` function
to make a new array where all the elements are 0.

In [None]:
# Make a new array containing 5 zeros.
# store with the name "z".
z = np.zeros(5)
# Show the value of "z"
z

Notice the argument `5` to the `np.zeros` function. This tells the
function how many zeros we want in the array that the function will
return.

## 6.5 Array length

The are various useful things we can do with this array container. One
is to ask how many elements there are in the array container. We can use
the `len` function to calculate the number of elements in an array:

In [None]:
# Show the number of elements in "z"
len(z)

## 6.6 Indexing into arrays with integers

Another thing we can do with arrays is *set* the value for a particular
element. To do this, we use square brackets following the array value,
on the left hand side of the equals sign, like this:

In [None]:
# Set the value of the *first* element in the array.
z[0] = 99
# Show the new contents of the array.
z

Read the first line of code as “the element at position 0 gets a value
of 99”.

Notice that the position number of the first element in the array is 0,
and the position number of the second element is 1. Think of the
position as an *offset* from the beginning of the array. The first
element is at the beginning of the array, and so it is at offset
(position) 0. This can be a little difficult to get used to at first,
but you will find that thinking of the positions of offsets in this way
soon starts to come naturally, and later you will also find that it
helps you to avoid some common mistakes when using positions for getting
and setting values.

For practice, let us also set the value of the third element in the
array:

In [None]:
# Set the value of the *third* element in the array.
z[2] = 99
# Show the new contents of the array.
z

Read the first code line above as as “set the value at position 2 in the
array to have the value 99”.

We can also *get* the value of the element at a given position, using
the same square-bracket notation:

In [None]:
# Get the value of the *first* element in the array.
# Store the value with name "v"
v = z[0]
# Show the value we got
v

Read the first code line here as “v gets the value at position 0 in the
array”.

Using square brackets to get and set element values is called *indexing*
into the array.

### 6.6.1 Repeating trials

As a preview, let us now imagine that we want to do 50 simulated trials
of Robert Swain’s jury in Hypothetical County. We will want to store the
count for each trial, to give 50 counts.

In order to do this, we make an array to hold the 50 counts. Call this
array `z`.

In [None]:
# An array to hold the 50 count values.
z = np.zeros(50)

We could run a single trial to get a single simulated count. Here we
just repeat the code cell you saw above. Notice that we can get a
different result each time we run this code, because the numbers in `a`
are *random* choices from the range 0 through 99, and different random
numbers will give different counts.

In [None]:
rnd = np.random.default_rng()
zero_thru_99 = np.arange(100)
# Get 12 random numbers from 0 through 99
a = rnd.choice(zero_thru_99, size=12)
# How many numbers are greater than 74?
b = np.sum(a > 74)
# Show the result
b

Now we have the result of a single trial, we can store it as the first
number in the `z` array:

In [None]:
# Store the single trial count as the first value in the "z" array.
z[0] = b
# Show all the values in the "z" array.
z

Of course we could just keep doing this: run the cell corresponding to a
trial, above, to get a new count, and then store it at the next position
in the `z` array. For example, we could store the counts for the first
three trials with:

In [None]:
# First trial
a = rnd.choice(zero_thru_99, size=12)
b = np.sum(a > 74)
# Store the result at the first position in z
# Remember, the first position is offset 0.
z[0] = b
# Second trial
a = rnd.choice(zero_thru_99, size=12)
b = np.sum(a > 74)
# Store the result at the second position in z
z[1] = b
# Third trial
a = rnd.choice(zero_thru_99, size=12)
b = np.sum(a > 74)
# Store the result at the third position in z
z[2] = b

# And so on ...

This would get terribly long and boring to type for 50 trials. Luckily
computer code is very good at repeating the same procedure many times.
For example, Python can do this using a `for` loop. You have already
seen a preview of the `for` loop in
<a class="quarto-xref" href="https://resampling-stats.github.io/edition-3-python/resampling_method.html"><span>Chapter 2</span></a> and
<a class="quarto-xref" href="https://resampling-stats.github.io/edition-3-python/resampling_with_code.html"><span>Chapter 5</span></a>. Here we
dive into `for` loops in more depth.

### 6.6.2 For-loops in Python

A for-loop is a way of asking Python to:

- Take a sequence of things, one by one, and
- Do the same task on each one.

We often use this idea when we are trying to explain a repeating
procedure. For example, imagine we wanted to explain what the
supermarket checkout person does for the items in your shopping basket.
You might say that they do this:

&gt; For each item of shopping in your basket, they take the item off the
&gt; conveyor belt, scan it, and put it on the other side of the till.

You could also break this description up into bullet points with
indentation, to say the same thing:

- For each item from your shopping basket, they:
  - Take the item off the conveyor belt.
  - Scan the item.
  - Put it on the other side of the till.

Notice the logic; the checkout person is repeating the same procedure
for each of a series of items.

This is the logic of the `for` loop in Python. The procedure that Python
repeats is called the *body of the for loop*. In the example of the
checkout person above, the repeating procedure is:

- Take the item off the conveyor belt.
- Scan the item.
- Put it on the other side of the till.

Now imagine we wanted to use Python to print out the year of birth for
each of the authors for the third edition of this book:

| Author               | Year of birth |
|----------------------|---------------|
| Julian Lincoln Simon | 1932          |
| Matthew Brett        | 1964          |
| Stéfan van der Walt  | 1980          |

We want to see this output:

    Author birth year is 1932
    Author birth year is 1964
    Author birth year is 1980

Of course, we could just ask Python to print out these exact lines, like
this:

In [None]:
print('Author birth year is 1932')

In [None]:
print('Author birth year is 1964')

In [None]:
print('Author birth year is 1980')

We might instead notice that we are repeating the same procedure for
each of the three birth years, and decide to do the same thing using a
`for` loop:

In [None]:
author_birth_years = np.array([1932, 1964, 1980])

# For each birth year
for birth_year in author_birth_years:
    # Repeat this procedure ...
    print('Author birth year is', birth_year)

The `for` loop starts with a line where we tell it what items we want to
repeat the procedure for:

    for birth_year in author_birth_years:

This *initial line* of the `for` loop ends with a colon.

The next thing in the `for` loop is the procedure Python should follow
for each item. Python knows that the following lines are the procedure
it should repeat, because the lines are *indented*. The *indented* lines
are the *body of the for loop*.

The initial line of the `for` loop above tells Python that it should
take *each item* in `author_birth_years`, one by one — first 1932, then
1964, then 1980. For each of these numbers it will:

- Put the number into the variable `birth_year`, then
- Run the indented code.

Just as the person at the supermarket checkout takes each item in turn,
for each iteration (repeat) of the `for` loop, `birth_year` gets a new
value from the sequence in `author_birth_years`. `birth_year` is called
the *loop variable*, because it is the variable that gets a new value
each time we begin a new iteration of the `for` loop procedure. As for
any variable in Python, we can call our loop variable anything we like.
We used `birth_year` here, but we could have used `y` or `year` or some
other name.

Now you know what the `for` loop is doing, you can see that the `for`
loop above is equivalent to the following code:

In [None]:
birth_year = 1932  # Set the loop variable to contain the first value.
print('Author birth year is', birth_year)  # Use it.

In [None]:
birth_year = 1964  # Set the loop variable to contain the next value.
print('Author birth year is', birth_year)  # Use the second value.

In [None]:
birth_year = 1980
print('Author birth year is', birth_year)

Writing the steps in the `for` loop out like this is called *unrolling*
the loop. It can be a useful exercise to do this when you come across a
`for` loop, in order to work through the logic of the loop. For example,
you may want to write out the unrolled equivalent of the first couple of
iterations, to see what the loop variable will be, and what will happen
in the body of the loop.

We often use `for` loops with ranges (see
<a class="quarto-xref" href="https://resampling-stats.github.io/edition-3-python/resampling_with_code.html#sec-ranges"><span>Section 5.9</span></a>). Here we use a loop
to print out the numbers 0 through 3:

In [None]:
for n in np.arange(4):
    print('The loop variable n is', n)

Notice that the range ended at (the number before) 4, and that means we
repeat the loop body 4 times. We can also use the loop variable value
from the range as an *index*, to get or set the first, second, etc
values from an array.

For example, maybe we would like to show the author position *and* the
author year of birth.

Remember our author birth years:

In [None]:
author_birth_years

We can get (for example) the second author birth year with:

In [None]:
author_birth_years[1]

Remember, for Python, the first element is position 0, so the second
element is position 1.

Using the combination of looping over a range, and array indexing, we
can print out the author position *and* the author birth year:

In [None]:
for n in np.arange(3):
    year = author_birth_years[n]
    print('Birth year of author position', n, 'is', year)

Again, remember Python considers 0 as the first position.

Just for practice, let us unroll the three iterations through this `for`
loop, to remind ourselves what the code is doing:

In [None]:
# Unrolling the for loop.
n = 0
year = author_birth_years[n]  # Will be 1932
print('Birth year of author position', n, 'is', year)

In [None]:
n = 1
year = author_birth_years[n]  # Will be 1964
print('Birth year of author position', n, 'is', year)

In [None]:
n = 2
year = author_birth_years[n]  # Will be 1980
print('Birth year of author position', n, 'is', year)

### 6.6.3 `range` in Python `for` loops

So far we have used `np.arange` to give us the sequence of integers that
we feed into the `for` loop. But — as you saw in
<a class="quarto-xref" href="https://resampling-stats.github.io/edition-3-python/resampling_with_code.html#sec-python-range"><span>Section 5.10</span></a> — we can also
get a range of numbers from Python’s `range` function. `range` is a
common and useful alternative way to provide a range of numbers to a
`for` loop.

You have just seen how we would use `np.arange` to send the numbers 0,
1, 2, and 3 to a `for` loop, in the example above, repeated here:

In [None]:
for n in np.arange(3):
    year = author_birth_years[n]
    print('Birth year of author position', n, 'is', year)

We could also use `range` instead of `np.arange` to do the same task:

In [None]:
for n in range(3):
    year = author_birth_years[n]
    print('Birth year of author position', n, 'is', year)

In fact, you will see this pattern throughout the book, where we use
`for` statements like `for value in range(10000):` to ask Python to put
each number in the range 0 up to (not including) 10000 into the variable
`value`, and then do something in the body of the loop. Just to be
clear, we could always, and almost as easily, write
`for value in np.arange(10000):` to do the same task. However, we
generally prefer `range` in our Python `for` loops, because it is just a
little less typing (without the `np.a` of `np.arange`), and because it
is a more common pattern in standard Python code.[^1]

### 6.6.4 Putting it all together

Here is the code we worked out above, to implement a single trial:

In [None]:
rnd = np.random.default_rng()
zero_thru_99 = np.arange(100)
# Get 12 random numbers from 0 through 99
a = rnd.choice(zero_thru_99, size=12)
# How many numbers are greater than 74?
b = np.sum(a > 74)
# Show the result
b

We found that we could use arrays to store the results of these trials,
and that we could use `for` loops to repeat the same procedure many
times.

Now we can put these parts together to do 50 simulated trials:

In [None]:
# Procedure for 50 simulated trials.

# The Numpy random number generator.
rnd = np.random.default_rng()

# All the numbers from 0 through 99.
zero_through_99 = np.arange(100)

# An array to store the counts for each trial.
z = np.zeros(50)

# Repeat the trial procedure 50 times.
for i in np.arange(50):
    # Get 12 random numbers from 0 through 99
    a = rnd.choice(zero_through_99, size=12)
    # How many numbers are greater than 74?
    b = np.sum(a > 74)
    # Store the result at the next position in the "z" array.
    z[i] = b
    # Now go back and do the next trial until finished.
# Show the result of all 50 trials.
z

Finally, we need to count how many of the trials in `z` ended up with
all-white juries. These are the trials with a `z` (count) value of 0.

To do this, we can ask an array which elements match a certain
condition. E.g.:

In [None]:
x = np.array([2, 1, 3, 0])
y = x < 2
# Show the result
y

We now use that same technique to ask, of *each of the 50 counts*,
whether the array `z` is equal to 0, like this:

In [None]:
# Is the value of z equal to 0?
all_white = z == 0
# Show the result of the comparison.
all_white

We need to get the number of `True` values in `all_white`, to find how
many simulated trials gave all-white juries.

In [None]:
# Count the number of True values in "all_white"
# This is the same as the number of values in "z" that are equal to 0.
n_all_white = np.sum(all_white)
# Show the result of the comparison.
n_all_white

`n_all_white` is the number of simulated trials for which all the jury
members were white. It only remains to get the proportion of trials for
which this was true, and to do this, we divide by the number of trials.

In [None]:
# Proportion of trials where all jury members were white.
p = n_all_white / 50
# Show the result
p

From this initial simulation, it seems there is around a 0% chance that
a jury selected randomly from the population, which was 26% black, would
have no black jurors.

## 6.7 Many many trials

Our experiment above is only 50 simulated trials. The higher the number
of trials, the more confident we can be of our estimate for `p` — the
proportion of trials where we get an all-white jury.

It is no extra trouble for us to tell the computer to do a very large
number of trials. For example, we might want to run 10,000 trials
instead of 50. All we have to do is to run the loop 10,000 times instead
of 50 times. The computer has to do more work, but it is more than up to
the job.

Here is exactly the same code we ran above, but collected into one cell,
and using 10,000 trials instead of 50. We have left out the comments, to
make the code more compact.

In [None]:
# Full simulation procedure, with 10,000 trials.
rnd = np.random.default_rng()
zero_through_99 = np.arange(100)
# 10,000 trials.
z = np.zeros(10000)
for i in np.arange(10000):
    a = rnd.choice(zero_through_99, size=12)
    b = np.sum(a > 74)
    z[i] = b
all_white = z == 0
n_all_white = sum(all_white)
p = n_all_white / 10000
p

We now have a new, more accurate estimate of the proportion of
Hypothetical County juries that are all white. The proportion is 0.03,
and so 3%.

This proportion means that, for any one jury from Hypothetical County,
there is a less than one in 20 chance that the jury would be all white.

As we will see in more detail later, we might consider using the results
from this experiment in Hypothetical County, to reflect on the result we
saw in the real Talladega County. We might conclude, for example, that
there was likely some systematic difference between Hypothetical County
and Talledega County. Maybe the difference was that there was, in fact,
some bias in the jury selection in Talledega county, and that the
Supreme Court was wrong to reject this. You will hear more of this line
of reasoning later in the book.


[^1]: Actually, there is a reason why many Python programmers prefer
    `range` to `np.arange` in the headers for their `for` loops. `range`
    is a very efficient container, in that it doesn’t need to take up
    all the memory required to create the full array, it just needs to
    keep track of the number to give you next. For example, consider
    `for i in np.arange(10000000):` — in this case Python has to make an
    array with 10,000,000 elements, and then, from that array, it passes
    each value one by one to the `for` loop. On the other hand,
    `for i in range(10000000):` will do the job just as well, passing
    the same sequence of 0 through 9,999,999 to `i`, one by one, but
    `range(10000000)` never has to make the whole 10,000,000 element
    array — it just needs to keep track of which number to give up next.
    Therefore `range` is very quick, and very efficient in memory. This
    doesn’t have any great practical impact for the arrays we are using
    here, typically of 10,0000 elements or so, but it can be important
    for larger arrays.