# Get 1 random number from 0 through 99
# replace=TRUE is redundant here (why?), but we leave it for consistency.
<- sample(0:99, 1, replace=TRUE)
a
# Show the result
a
[1] 44
Now you have some experience with R, probabilities and resampling, it is time to introduce some useful tools for our experiments and programs.
Thus far we have used numbers such as 1 and 0 and 10 to represent the elements we are sampling from. For example, in Chapter 6, we were simulating the chance of a particular juror being black, given that 26% of the eligible jurors in the county were black. We used integers for that task, where we started with all the integers from 0 through 99, and asked R to select values at random from those integers. When R selected an integer from 0 through 25, we chose to label the resulting simulated juror as black — there are 26 integers in the range 0 through 25, so there is a 26% chance that any one integer will be in that range. If the integer was from 26 through 99, the simulated juror was white (there are 74 integers in the range 26 through 99).
Here is the process of simulating a single juror, adapted from Section 6.3.3:
# Get 1 random number from 0 through 99
# replace=TRUE is redundant here (why?), but we leave it for consistency.
<- sample(0:99, 1, replace=TRUE)
a
# Show the result
a
[1] 44
After that, we have to unpack our labeling of 0 through 25 as being “black” and 26 through 99 as being “white”. We might do that like this:
<- a < 26
this_juror_is_black this_juror_is_black
[1] FALSE
This all works as we want it to, but it’s just a little bit difficult to remember the coding (less than 26 means “black”, greater than 25 means “white”). We had to use that coding because we committed ourselves to using random numbers to simulate the outcomes.
However, R can also store bits of text, called strings. Values that are bits of text can be very useful because the text values can be memorable labels for the entities we are sampling from, in our simulations.
Before we get to strings, let us consider the type of the values we have seen so far.
So far, all the values we have seen in R are numeric — integers or floating point values. This is an integer value:
<- 10
v v
[1] 10
Here the variable v
holds the value. We can see what type of value v
holds by using the class
function:
class(v)
[1] "numeric"
The value contained by the variable v
is of 'numeric'
type (class). This is the type of value that can store both integer values (positive or negative whole numbers), or floating point values (values that can have digits after a decimal point. Here’s a floating point value.
<- 10.1
f class(f)
[1] "numeric"
Notice that R also see this as a "numeric"
type of value. However, we are about to see that R values can be of other types, that are not numeric.
So far, all the values you have seen in R vectors have been numbers. Now we get on to values that are bits of text. These are called strings.
Here is a single R string value:
<- "Resampling"
s s
[1] "Resampling"
What is the class
of the new bit-of-text value s
?
class(s)
[1] "character"
The R character
value is a bit of text, and therefore consists of a sequence of characters.
As vectors are containers for other things, such as numbers, strings are containers for characters.
To get the length of a string, use the nchar
function (Number of Characters):
# Number of characters in s
nchar(s)
[1] 10
R has a substring
function that allows you to select individual characters or sequences of characters from a string. The arguments to substring
are: first — the string; second — the index of the first character you want to select; and third — the index of the last character you want to select. For example to select the second character in the string you would specify 2 as the starting index, and 2 as the ending index, like this:
# Get the second character of the string
<- substring(s, 2, 2)
second_char second_char
[1] "e"
As we can store numbers as elements in vectors, we can also store strings as vector elements.
= c('Julian', 'Lincoln', 'Simon')
vector_of_strings vector_of_strings
[1] "Julian" "Lincoln" "Simon"
As for any vector, you can select elements with indexing. When you select an element with a given position (index), you get the string at at that position:
Notice the output from this chunk:
# Julian Lincoln Simon's second name
<- vector_of_strings[2]
middle_name middle_name
[1] "Lincoln"
As for numbers, we can compare strings with, for example, the ==
operator, that asks whether the two strings are equal:
== 'Lincoln' middle_name
[1] TRUE
Now let us go back to the problem of selecting black and white jurors.
We started with the strategy of using numbers 0 through 25 to mean “black” jurors, and 26 through 99 to mean “white” jurors. We selected values at random from 0 through 99, and then worked out whether the number meant a “black” juror (was less than 26) or a “white” juror (was greater than 25).
It would be good to use strings instead of numbers to identify the potential jurors. Then we would not have to remember our coding of 0 through 25 and 26 through 99.
If only there was a way to make a vector of 100 strings, where 26 of the strings were “black” and 74 were “white”. Then we could select randomly from that array, and it would be immediately obvious that we had a “black” or “white” juror.
Luckily we can do that, by using the rep
function to construct the vector.
You may have noticed in Chapter 6 that we were sampling Robert Swain’s jury from the eligible pool of jurors, with replacement. You might reasonably ask whether we should have selected from the eligible jurors without replacement, given that the same juror cannot serve more than once in the same jury, and therefore, the same argument applies there as here.
The trick there was that we were selecting from a very large pool of many thousand eligible jurors, of whom 26% were black. Let’s say there were 10,000 eligible jurors, of whom 2,600 were black. When selecting the first juror, there is exactly a 2,600 in 10,000 chance of getting a black juror — 26%. If we do get a black juror first, then the chance that the second juror will be black has changed slightly, 2,599 in 9,999. But these changes are very small; even if we select eleven black jurors out of eleven, when we come to the twelfth juror, we still have a 2,589 out of 9,989 chance of getting another black juror, and that works out at a 25.92% chance — hardly changed from the original 26%. So yes, you’d be right, we really should have compiled our population of 2,600 black jurors and 7,400 white jurors, and then sampled without replacement from that population, but as the resulting sample probabilities will be very similar to the simpler sampling with replacement, we chose to try and slide that one quietly past you, in the hope you would forgive us when you realized.
This chapter introduced you to the idea of strings — values in R that store bits of text. Strings are very useful as labels for the entities we are sampling from, when we do our simulations. Strings are particularly useful when we use them with vectors, and one way we often do that is to build up vectors of strings to sample from, using the rep
function.
There is a fundamental distinction between two different types of sampling — sampling with replacement, where we draw an element from a larger pool, then put that element back before drawing again, and sampling without replacement, where we remove the element from the remaining pool when we draw it into the sample. As we will see later, it is often a judgment call which of these two types of sampling is a more reasonable model of the world you are trying to simulate.
sampling_tools
starts at Note 7.1.