# Plotting histograms

Find this notebook on the web at
<a class="quarto-xref" href="https://resampling-stats.github.io/latest-python/probability_theory_3.html#nte-on_histograms">Note <span>12.6</span></a>.

A histogram is a visual way to show the *distribution* of a sequence of
values.

We now enter the world of *plotting* in Python. As Numpy is a Python
library for working with arrays, Matplotlib is a library for making and
showing plots.

To use the Numpy library, we `import` it. As you have seen, the usual
convention is to make the standard `numpy` library name easier to read
and type, by renaming the library to `np` on `import`, like this:

In [None]:
# Import numpy library and rename to "np"
import numpy as np

In a similar way, we need to import the Matplotlib library. In fact we
will be using a particular part of the Matplotlib library, called
`pyplot`.

We use the following standard convention to import the `pyplot` part of
the Matplotlib library and give it the shorter name of `plt`:

In [None]:
import matplotlib.pyplot as plt

<div __quarto_custom="true" __quarto_custom_context="Block" __quarto_custom_id="45" __quarto_custom_type="Callout">
<div __quarto_custom_scaffold="true">

Modules and submodules

</div>
<div __quarto_custom_scaffold="true">

We have been calling Numpy and Matlotlib *libraries*, but technically,
Python calls these *modules*. Modules are collections of code and data
that you can `import` into Python. For example, Numpy (now renamed as
`np`) is a module:</div></div>

In [None]:
# Show type for the import Numpy module (renamed as "np").
type(np)

We can get elements contained in (attached to) a module using the `.`
syntax. For example, here we get the value of the `pi` variable,
attached to the Numpy module.

In [None]:
# Get and show the value of the variable "pi" attached to (contained within)
# the Numpy module.
np.pi

One type of thing a module can contain, is other modules. These
modules-attached-to-modules are called *submodules*. Perhaps without
knowing, you have already used the `random` submodule attached to the
Numpy module:

In [None]:
# "random" is itself a module, attached to (contained within) the Numpy
# module.  It is therefore a "submodule" of Numpy.
type(np.random)

We used the `default_rng` function from the `random` submodule to create
random number generators:

In [None]:
rng = np.random.default_rng()

`pyplot` is a submodule of Matplotlib.

In [None]:
# Reimport the module to remind ourselves of the import line.
import matplotlib.pyplot as plt
# plt is a new name we have set for the "pyplot" submodule of Matplotlib.
type(plt)

The `pyplot` submodule of Matplotlib has many useful functions for
making and displaying plots.





The easiest way to explain histograms is to show one.

Let’s start with a sequence of values we are interested in:

Here are the 24 values for whiskey prices in states that did not have a
liquor monopoly (`priv`).

In [None]:
priv = np.array([
    4.82, 5.29, 4.89, 4.95, 4.55, 4.90, 5.25, 5.30, 4.29, 4.85, 4.54, 4.75,
    4.85, 4.85, 4.50, 4.75, 4.79, 4.85, 4.79, 4.95, 4.95, 4.75, 5.20, 5.10,
    4.80, 4.29])

These are the 16 values for states with a liquor monopoly (`govt`):

In [None]:
govt = np.array([
    4.65, 4.55, 4.11, 4.15, 4.20, 4.55, 3.80, 4.00, 4.19, 4.75, 4.74, 4.50,
    4.10, 4.00, 5.05, 4.20])

We concatenate these values to get a sequence (an array) of all 40
liquor prices:

In [None]:
prices = np.concatenate([priv, govt])
prices

We are interested in the distribution of these 40 values. To show the
distribution, we can make and show a histogram of these prices, using
the `hist` function attached to the `plt` submodule .

In [None]:
hist_res = plt.hist(prices)

`plt.hist` has calculated an array of suitable intervals (*bins*) to
divide up the range of values, and then counted how many values in
`prices` fall into each interval (bin).

You will notice that `plt.hist` has sent back some results from the
process of making the histogram. In fact, the results are in the form of
a list.

The first result of interest to us is the definition of the intervals
(bins) into which the histogram has divided the range of `prices`
values.

In fact, `plt.hist` sent back the edges of these bins in the second
element of `hist_res`:

In [None]:
# The second element in the results list is the array of bin edges.
bin_edges = hist_res[1]
bin_edges

Think of this array as the 10 values that start each of the 10 bins,
followed by a final value that ends the final bin.

This means that the first bin was from (including) 3.8 up to, but not
including 3.95, the second bin was from (including) 3.95 up to, but not
including 4.1 and so on. The last bin is from (including) 5.15 through
(including) 5.3. Notice there are 11 edges, forming 10 bins.

Put another way, the edges that `plt.hist` sent back are the 10 left
hand (inclusive) edges of the 10 bins, and a final right hand
(inclusive) edge of the final (10<sup>th</sup>) bin.

The first element that comes back in the list of results is the array of
counts of the values in `prices` that fall within each bin.

In [None]:
# The first element in the results list is the counts of values falling into
# each bin.
counts = hist_res[0]
counts

The values tell us that 1 value from `prices` fell in the range 3.8 up
to (not including) 3.95 (were within the first bin), 2 values fell in
the range 3.95 up to (not including) 4.1, and so on.

That the counts correspond to the heights of the bars on the histogram,
so the first bar has height 1, the second bar has height 2, and so on.

By default, `plt.hist` assumes you want 10 bins, and uses its default
method of calculation to work out the edges for those 10 bins. You can
specify another number of bins, by sending a number to the `bins`
argument of `plt.hist`. For example, you might want 20 bins:

In [None]:
results_20 = plt.hist(prices, bins=20)

We now have 21 new edge values, the first 20 values giving the
(inclusive) left-hand edges, and the last giving the (inclusive) right
hand edge of the last bin.

In [None]:
bin_edges_20 = results_20[1]
bin_edges_20

We can also specify our own edges, in order to bypass `plt.hist`s
default algorithm to calculate edges. For example, we might prefer 16
bins of width 0.1, starting at 3.8, giving edges like this:

In [None]:
our_edges = 3.8 + np.arange(16) * 0.1
our_edges

We can send these directly to `plt.hist` to set the edges:

In [None]:
results_16 = plt.hist(prices, bins=our_edges)
# Show the edges that come back (these are the edges we sent).
results_16[1]

If you are running the notebook interactively in Jupyter, running
`plt.hist` on its own, as below, will show the values as the result of
the cell, along with the plot. (You won’t see these results displayed in
the textbook, because we use different software to show outputs when we
build the textbook).

In [None]:
# If we don't collect the results, Jupyter shows them to us,
# if this is the last expression in the cell.
# (You won't see the results displayed in the textbook).
plt.hist(prices)

Interactive Jupyter will display the returned list of results, because
we have not collected the results by assigning them to a variable. More
technically, on its own, the `plt.hist` line is an *expression* (code
that results in a value), and Jupyter will, by default, display the
results of an expression that ends the code in a cell.

Here we see that the result of the `plt.hist(prices)` expression is a
list with three elements. As you saw before, the first element is the
array with the counts for each of the (by default) 10 bins. The second
is the array with the bin edges (10 left edges and last right edge). The
last is a reference to the values that make up the graphical display;
you can use this last value to do some advanced configuration of the
histogram display, but we won’t cover that further in this book.

It can be distracting to see a display of the results list from a
plotting cell, so from now on we will suppress Jupyter’s default
behavior of displaying the results list from `plt.hist`, by adding a
semi-colon at the end of the code line, as in the cell below. (Remember,
in the textbook, but not in Jupyter, this will give the same result as
`plt.hist(prices)` above, because of the display system we use for the
textbook.)

In [None]:
plt.hist(prices);  # Note the semi-colon

The semi-colon is a standard indicator to Jupyter that it should not
display the results that came back from the function call. We will use
it to suppress the display of various values that come back from these
functions, as they are usually not of immediate interest.

