7 Tools for samples and sampling

7.1 Introduction

Now you have some experience with Python, probabilities and resampling, it is time to introduce some useful tools for our experiments and programs.

Note 7.1: Notebook: Sampling tools

Download notebook Interact

7.2 Samples and labels and strings

Thus far we have used numbers such as 1 and 0 and 10 to represent the elements we are sampling from. For example, in Chapter 6, we were simulating the chance of a particular juror being black, given that 26% of the eligible jurors in the county were black. We used integers for that task, where we started with all the integers from 0 through 99, and asked NumPy to select values at random from those integers. When NumPy selected an integer from 0 through 25, we chose to label the resulting simulated juror as black — there are 26 integers in the range 0 through 25, so there is a 26% chance that any one integer will be in that range. If the integer was from 26 through 99, the simulated juror was white (there are 74 integers in the range 26 through 99).

Here is the process of simulating a single juror, adapted from Section 6.3.3:

import numpy as np
# Ask Numpy for a random number generator.
rnd = np.random.default_rng()

# All the integers from 0 up to, but not including 100.
zero_thru_99 = np.arange(100)

# Get one random numbers from 0 through 99
a = rnd.choice(zero_thru_99)

# Show the result
a

np.int64(59)

After that, we have to unpack our labeling of 0 through 25 as being “black” and 26 through 99 as being “white”. We might do that like this:

this_juror_is_black = a < 26
this_juror_is_black

np.False_

Numpy display of single Boolean values

You saw in Note 2.3 and Note 5.3 that Numpy has a particular way of displaying single values. Note 2.3 was about display of integer (int) values, and Note 5.3 was about display of floating point (float) values. The display above is Numpy’s version of the Boolean value False. As you’d expect, Numpy shows the single value True as np.True_.

This all works as we want it to, but it’s just a little bit difficult to remember the coding (less than 26 means “black”, greater than 25 means “white”). We had to use that coding because we committed ourselves to using random numbers to simulate the outcomes.

However, Python can also store bits of text, called strings. Values that are bits of text can be very useful because the text values can be memorable labels for the entities we are sampling from, in our simulations.

Before we get to strings, let us consider the different types of value we have seen so far.

7.3 Types of values in Python

You have already come across the idea that Python values can be integers (positive or negative whole numbers), like this:

v = 10
v

Here the variable v holds the value. We can see what type of value v holds by using the type function:

type(v)

<class 'int'>

Python can also have floating point values. These are values with a decimal point — numbers that do not have to be integers, but also can be any value between the integers. These floating point values are of type float:

f = 10.1
type(f)

<class 'float'>

7.3.1 Numpy arrays

You have also seen that Numpy contains another type, the array. An array is a value that contains a sequence of values. For example, here is an array of integers:

arr = np.array([0, 10, 99, 4])
arr

array([ 0, 10, 99,  4])

Notice that this value arr is of type np.ndarray:

type(arr)

<class 'numpy.ndarray'>

The array has its own internal record of what type of values it holds. This is called the array dtype:

arr.dtype

dtype('int64')

The array dtype records the type of value stored in the array. All values in the array must be of this type, and all values in the array are therefore of the same type.

The array above contains integers, but we can also make arrays containing floating point values:

float_arr = np.array([0.1, 10.1, 99.0, 4.3])
float_arr

array([ 0.1, 10.1, 99. ,  4.3])

float_arr.dtype

dtype('float64')

7.3.2 Lists

We have elided past another Python type, the list. In fact we have already used lists in making arrays. For example, here we make an array with four values:

np.array([0, 10, 99, 4])

array([ 0, 10, 99,  4])

We could also write the statement above in two steps:

my_list = [0, 10, 99, 4]
np.array(my_list)

array([ 0, 10, 99,  4])

In the first statement — my_list = [0, 10, 99, 4] — we construct a list — a container for the four values. Let’s look at the my_list value:

my_list

[0, 10, 99, 4]

Notice that we do not see array in the display — this is not an array but a list:

type(my_list)

<class 'list'>

A list is a basic Python type. We can construct it by using the square brackets notation that you see above; we start with [, then we put the values we want to go in the list, separated by commas, followed by ]. Here is another list:

# Creating another list.
list_2 = [5, 10, 20]

As you saw, we have been building arrays by building lists, and then passing the list to the np.array function, to create an array.

list_again = [100, 10, 0]
np.array(list_again)

array([100,  10,   0])

Of course, we can do this one line, as we have been doing up till now, by constructing the list inside the parentheses of the function. So, the following cell has just the same output as the cell above:

# Constructing the list inside the function brackets.
np.array([100, 10, 0])

array([100,  10,   0])

Lists are like arrays in that they are values that contain values, but they are unlike arrays in various ways — that we will not go into now. We often use lists to construct sequences into lists to turn them into arrays. For our purposes, and particularly for our calculations, arrays are much more useful and efficient than lists.

7.4 String values

So far, all the values you have seen in Python arrays have been numbers. Now we get on to values that are bits of text. These are called strings.

Here is a single Python string value:

s = "Resampling"
s

'Resampling'

What is the type of the new bit-of-text value s?

type(s)

<class 'str'>

The Python str value is a bit of text, and therefore consists of a sequence of characters.

As arrays are containers for other things, such as numbers, strings are containers for characters.

As we can find the number of elements in an array (Section 6.5), we can find the number of characters in a string with the len function:

# Number of characters in s
len(s)

As we can index into array values to get individual elements (Section 6.6), we can index into string values to get individual characters:

# Get the second character of the string
# Remember, Python's index positions start at 0.
second_char = s[1]
second_char

'e'

7.5 Strings in arrays

As we can store numbers as elements in arrays, we can also store strings as array elements.

# Just for clarity, make the list first.
# Lists can also contain strings.
list_of_strings = ['Julian', 'Lincoln', 'Simon']
# Then pass the list to np.array to make the array.
arr_of_strings = np.array(list_of_strings)
arr_of_strings

array(['Julian', 'Lincoln', 'Simon'], dtype='<U7')

# We can also create the list and the array in one line,
# as we have been doing up til now.
arr_of_strings = np.array(['Julian', 'Lincoln', 'Simon'])
arr_of_strings

array(['Julian', 'Lincoln', 'Simon'], dtype='<U7')

Notice the array dtype:

arr_of_strings.dtype

dtype('<U7')

The U in the dtype tells you that the elements in the array are Unicode strings (Unicode is a computer representation of text characters). The number after the U gives the maximum number of characters for any string in the array, here set to the length of the longest string when we created the array.

Take care with Numpy string arrays

It is easy to run into trouble with Numpy string arrays where the elements have a maximum length, as here. Remember, the dtype of the array tells you what type of element the array can hold. Here the dtype is telling you that the array can hold strings of maximum length 7 characters. Now imagine trying to put a longer string into the array — what do you think would happen?

This happens:

# An array of small strings.
small_strings = np.array(['six', 'one', 'two'])
small_strings.dtype

dtype('<U3')

# Set a new value for the first element (first string).
small_strings[0] = 'seven'
small_strings

array(['sev', 'one', 'two'], dtype='<U3')

Numpy truncates the new string to match the original maximum length.

For that reason, it is often useful to instruct Numpy that you want to use effectively infinite length strings, by specifying the array dtype as object when you make the array, like this:

# An array of small strings, but this time, tell Numpy
# that the strings should be of effectively infinite length.
small_strings_better = np.array(['six', 'one', 'two'], dtype=object)
small_strings_better

array(['six', 'one', 'two'], dtype=object)

Notice that the code uses a named function argument (Section 5.8), to specify to np.array that the array elements should be of type object. This type can store any Python value, and so, when the array is storing strings, it will use Python’s own string values as elements, rather than the more efficient but more fragile Unicode strings that Numpy uses by default.

# Set a new value for the first element in the new array.
small_strings_better[0] = 'seven'
small_strings_better

array(['seven', 'one', 'two'], dtype=object)

As for any array, you can select elements with indexing. When you select an element with a given position (index), you get the string at at that position:

# Julian Lincoln Simon's second name.
# (Remember, Python's positions start at 0).
middle_name = arr_of_strings[1]
middle_name

np.str_('Lincoln')

Numpy display of single string-type values

np.str_ in the output shows us that the result was a Numpy single value of string (str_) type.

Notice the output from this cell:

As for numbers, we can compare strings with, for example, the == operator, that asks whether the two strings are equal:

middle_name == 'Lincoln'

True

7.6 Repeating elements

Now let us go back to the problem of selecting black and white jurors.

We started with the strategy of using numbers 0 through 25 to mean “black” jurors, and 26 through 99 to mean “white” jurors. We selected values at random from 0 through 99, and then worked out whether the number meant a “black” juror (was less than 26) or a “white” juror (was greater than 25).

It would be good to use strings instead of numbers to identify the potential jurors. Then we would not have to remember our coding of 0 through 25 and 26 through 99.

If only there was a way to make an array of 100 strings, where 26 of the strings were “black” and 74 were “white”. Then we could select randomly from that array, and it would be immediately obvious that we had a “black” or “white” juror.

Luckily we can do that, by using the np.repeat function to construct the array.

np.repeat

np.repeat is a useful Numpy function that makes a new output array by repeating elements from the input arrays.

More specifically, np.repeat accepts two arguments (inputs), and sends back an array as output. The first input is an array or list or some other sequence for which you want to repeat elements. The second input tells np.repeat how many repeats to do, for each element.

This is most easy to see by examples.

Here we repeat all three elements in the first sequence (in this case a list) by a count (number) given in the second argument.

# Repeat each element in the first list (or array) 4 times.
to_repeat = [1, 2, 3]  # First input argument.
n_repeats = 4  # Second input argument.
np.repeat(to_repeat, n_repeats)

array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])

Of course we could also write this without the variables:

# Repeat each element in the first list (or array) 4 times.
np.repeat([1, 2, 3], 4)

array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])

The second argument (number of repeats) can also be a sequence, such as an array or list, with one element for each element of the first array. In that case, it gives the number of repeats for each element. We will use string elements for this example.

# Repeat the first element 3 times, the second 2 times, and the third 4 times.
# Second input is number of repeats for each element of the first argument.
np.repeat(['one', 'two', 'three'], [3, 2, 4])

array(['one', 'one', 'one', 'two', 'two', 'three', 'three', 'three',
       'three'], dtype='<U5')

Here we use np.repeat to make up our simulated jury pool.

# The values that we will repeat to fill up the larger array.
# Use a list to store the sequence of values.
juror_types = ['black', 'white']
# The number of times we want to repeat "black" and "white".
# Use a list to store the sequence of values.
repeat_nos = [26, 74]
# Repeat "black" 26 times and "white" 74 times.
# We have passed two lists here, but we could also have passed
# arrays - the Numpy repeat function converts the lists to arrays
# before it builds the repeats.
jury_pool = np.repeat(juror_types, repeat_nos)
# Show the result
jury_pool

array(['black', 'black', 'black', 'black', 'black', 'black', 'black',
       'black', 'black', 'black', 'black', 'black', 'black', 'black',
       'black', 'black', 'black', 'black', 'black', 'black', 'black',
       'black', 'black', 'black', 'black', 'black', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white'], dtype='<U5')

We can use this array of repeats of strings, to sample from. The result is easier to grasp, because we are using the string labels, instead of numbers:

# Select one juror at random from the black / white pool.
one_juror = rnd.choice(jury_pool)
one_juror

np.str_('white')

We can select our full jury of 12 jurors, and see the results in a more obvious form:

# Select 12 jurors at random from the black / white pool.
one_jury = rnd.choice(jury_pool, 12)
one_jury

array(['white', 'white', 'white', 'white', 'black', 'white', 'black',
       'white', 'white', 'black', 'black', 'white'], dtype='<U5')

Using the size argument to rnd.choice

In the code above, we have specified the size of the sample we want (12) with the second argument to rnd.choice. As you saw in Section 5.8, we can also give names to the function arguments, in this case, to make it clearer what we mean by “12” in the code above. In fact, from now on, that is what we will do; we will specify the size of our sample by using the name for the function argument to rnd.choice — size — like this:

# Select 12 jurors at random from the black / white pool.
# Specify the sample size using the "size" named argument.
one_jury = rnd.choice(jury_pool, size=12)
one_jury

array(['black', 'white', 'white', 'white', 'black', 'white', 'black',
       'white', 'white', 'white', 'white', 'white'], dtype='<U5')

We can use == on the array to get True values where the juror was “black” and False values otherwise:

are_black = one_jury == 'black'
are_black

array([ True, False, False, False,  True, False,  True, False, False,
       False, False, False])

Finally, we can np.sum to find the number of black jurors (Section 5.15):

# Number of black jurors in this simulated jury.
n_black = np.sum(are_black)
n_black

np.int64(3)

Putting that all together, this is our new procedure to select one jury and count the number of black jurors:

one_jury = rnd.choice(jury_pool, size=12)
are_black = one_jury == 'black'
n_black = np.sum(are_black)
n_black

np.int64(3)

Or we can be even more compact by putting several statements together into one line:

# The same as above, but on one line.
n_black = np.sum(rnd.choice(jury_pool, size=12) == 'black')
n_black

np.int64(1)

7.7 Resampling with and without replacement

Now let us return to the details of Robert Swain’s case, that you first saw in Chapter 6.

We looked at the composition of Robert Swain’s 12-person jury — but in fact, by law, that does not have to be representative of the eligible jurors. The 12-person jury is drawn from a jury panel, of 100 people, and this should, in turn, be drawn from the population of all eligible jurors in the county, consisting, at the time, of “all male citizens in the community over 21 who are reputed to be honest, intelligent men and are esteemed for their integrity, good character and sound judgment.” So, unless there was some bias against black jurors, we might expect the 100-person jury panel to be a plausibly random sample of the eligible jurors, of whom 26% were black. See the Supreme Court case judgement for details.

In fact, in Robert Swain’s trial, there were 8 black members in the 100-person jury panel. We will leave it to you to adapt the simulation from Chapter 6 to ask the question — is 8% surprising as a random sample from a population with 26% black people?

But we have a different question: given that 8 out of 100 of the jury panel were black, is it surprising that none of the 12-person jury were black? As usual, we can answer that question with simulation.

Let’s think about what a single simulated jury selection would look like.

First we compile a representation of the actual jury panel, using the tools we have used above.

juror_types = ['black', 'white']
# in fact there were 8 black jurors and 92 white jurors.
panel_nos = [8, 92]
jury_panel = np.repeat(juror_types, panel_nos)
# Show the result
jury_panel

array(['black', 'black', 'black', 'black', 'black', 'black', 'black',
       'black', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white', 'white', 'white',
       'white', 'white'], dtype='<U5')

Now consider taking a 12-person jury at random from this panel. We select the first juror at random, so that juror has an 8 out of 100 chance of being black. But when we select the second jury member, the situation has changed slightly. We can’t select the first juror again, so our panel is now 99 people. If our first juror was black, then the chances of selecting another black juror next are not 8 out of 100, but 7 out of 99 — a smaller chance. The problem is, as we shall see in more detail later, the chances of getting a black juror as the second, and third and fourth members of the jury depend on whether we selected a black juror as the first and second and third jury members. At its most extreme, imagine we had already selected eight jurors, and by some strange chance, all eight were black. Now our chances of selecting a black juror as the ninth juror are zero — there are no black jurors left to select from the panel.

In this case we are selecting jurors from the panel without replacement, meaning, that once we have selected a particular juror, we cannot select them again, and we do not put them back into the panel when we select our next juror.

This is the probability equivalent of the situation when you are dealing a hand of cards. Let’s say someone is dealing you, and you only, a hand of five cards. You get an ace as your first card. Your chances of getting an ace as your first card were just the number of aces in the deck divided by the number of cards — four in 52 – \(\frac{4}{52}\). But for your second card, the probability has changed, because there is one less ace remaining in the pack, and one less card, so your chances of getting an ace as your second card are now \(\frac{3}{51}\). This is sampling without replacement — in a normal game, you can’t get the same card twice. Of course, you could imagine getting a hand where you sampled with replacement. In that case, you’d get a card, you’d write down what it was, and you’d give the card back to the dealer, who would replace the card in the deck, shuffle again, and give you another card.

As you can see, the chances change if you are sampling with or without replacement, and the kind of sampling you do, will dictate how you model your chances in your simulations.

Because this distinction is so common, and so important, the machinery you have already seen in rnd.choice has simple ways for you to select your sampling type. You have already seen sampling with replacement, and it looks like this:

# Take a sample of 12 jurors from the panel *with replacement*
# With replacement is the default for `rnd.choice`.
strange_jury = rnd.choice(jury_panel, size=12)
strange_jury

array(['white', 'white', 'white', 'black', 'white', 'white', 'white',
       'white', 'white', 'white', 'white', 'white'], dtype='<U5')

This is a strange jury, because it can select any member of the jury pool more than once. Perhaps that juror would have to fill two (or more!) seats, or run quickly between them. But of course, that is not how juries are selected. They are selected without replacement:

# Take a sample of 12 jurors from the panel *without replacement*
ok_jury = rnd.choice(jury_panel, 12, replace=False)
ok_jury

array(['white', 'white', 'white', 'white', 'black', 'white', 'white',
       'white', 'white', 'white', 'white', 'white'], dtype='<U5')

Note 7.2: Comments at the end of lines

You have already seen comment lines. These are lines beginning with #, to signal to Python that the rest of the line is text for humans to read, but Python to ignore.

# This is a comment.  Python ignores this line.

You can also put comments at the end of code lines, by finishing the code part of the line, and then putting a #, followed by more text. Again, Python will ignore everything after the # as a text for humans, but not for Python.

print('Hello')  # This is a comment at the end of the line.

Hello

To finish the procedure for simulating a single jury selection, we count the number of black jurors:

n_black = np.sum(ok_jury == 'black')  # How many black jurors?
n_black

np.int64(1)

Now we have the procedure for one simulated trial, here is the procedure for 10000 simulated trials.

counts = np.zeros(10000)
for i in np.arange(10000):
    # Single trial procedure
    jury = rnd.choice(jury_panel, size=12, replace=False)
    n_black = np.sum(jury == 'black')  # How many black jurors?
    # Store the result
    counts[i] = n_black

# Number of juries with 0 black jurors.
zero_black = np.sum(counts == 0)
# Proportion
p_zero_black = zero_black / 10000
print(p_zero_black)

0.3421

We have found that, when there are only 8% black jurors in the jury panel, having no black jurors in the final jury happens about 34% of the time, even in this case, where the jury is selected completely at random from the jury panel.

We should look for the main source of bias in the initial selection of the jury panel, not in the selection of the jury from the panel.

With or without replacement for the original jury selection

You may have noticed in Chapter 6 that we were sampling Robert Swain’s jury from the eligible pool of jurors, with replacement. You might reasonably ask whether we should have selected from the eligible jurors without replacement, given that the same juror cannot serve more than once in the same jury, and therefore, the same argument applies there as here.

The trick there was that we were selecting from a very large pool of many thousand eligible jurors, of whom 26% were black. Let’s say there were 10,000 eligible jurors, of whom 2,600 were black. When selecting the first juror, there is exactly a 2,600 in 10,000 chance of getting a black juror — 26%. If we do get a black juror first, then the chance that the second juror will be black has changed slightly, 2,599 in 9,999. But these changes are very small; even if we select eleven black jurors out of eleven, when we come to the twelfth juror, we still have a 2,589 out of 9,989 chance of getting another black juror, and that works out at a 25.92% chance — hardly changed from the original 26%. So yes, you’d be right, we really should have compiled our population of 2,600 black jurors and 7,400 white jurors, and then sampled without replacement from that population, but as the resulting sample probabilities will be very similar to the simpler sampling with replacement, we chose to try and slide that one quietly past you, in the hope you would forgive us when you realized.

7.8 Conclusion

This chapter introduced you to the idea of strings — values in Python that store bits of text. Strings are very useful as labels for the entities we are sampling from, when we do our simulations. Strings are particularly useful when we use them with arrays, and one way we often do that is to build up arrays of strings to sample from, using the np.repeat function.

There is a fundamental distinction between two different types of sampling — sampling with replacement, where we draw an element from a larger pool, then put that element back before drawing again, and sampling without replacement, where we remove the element from the remaining pool when we draw it into the sample. As we will see later, it is often a judgment call which of these two types of sampling is a more reasonable model of the world you are trying to simulate.

End of notebook: Sampling tools

sampling_tools starts at Note 7.1.