Preface to the third edition

The book in your hands, or on your screen, is the third edition of a book originally called “Resampling: the new statistics”, by Julian Lincoln Simon (1992).

One of the pleasures of writing a new edition of a work by another author, is that we can praise the previous version of our own book. We will do that, in the next section. Next we talk about the resampling methods in this book, and their place at the heart of “data science”. We then discuss what we have changed, what we haven’t, and why. Finally, we make some suggestions about where this book could fit into your learning and teaching.

What Simon saw

Simon gives the early history of this book in the original preface. He starts with the following observation:

In the mid-1960’s, I noticed that most graduate students — among them many who had had several advanced courses in statistics — were unable to apply statistical methods correctly…

Simon then applied his striking capacity for independent thought to the problem — and came to two essential conclusions.

The first was that introductory courses in statistics use far too much mathematics. Most students cannot follow along and quickly get lost, reducing the subject to — as Simon puts it — “mumbo-jumbo”.

On its own, this was not a new realization. Simon quotes a classic textbook by Wallis and Roberts (1956), to the effect that teaching statistics through mathematics is like teaching philosophy in ancient Greek. More recently, other teachers of statistics have come to the same conclusion. Cobb (2007) argues that it is not practical to teach students the level of mathematics they would need to understand standard introductory courses. As you will see below, Cobb also agrees with Simon about the solution.

Simon’s great contribution was to see how we can replace the mathematics, to better reveal the true heart of statistical thinking. His starting point appears in the original preface: “Beneath the logic of a statistical inference there necessarily lies a physical process”. Drawing conclusions from noisy data means building a model of the noisy world, and seeing how that model behaves. That model can be physical, where we generate the noisiness of the world using physical devices like dice and spinners and coin-tosses. Simon used exactly these kinds of devices in his first experiments in teaching (Simon 1969). He then saw that it was much more efficient to build these models with simple computer code, and the result was the first and second editions of this book, with their associated software, the Resampling Stats language.

Simon’s second conclusion follows from the first. Now he had found a path round the unnecessary barrier of mathematics, he had got to the heart of what is interesting and difficult in statistics. Drawing conclusions from noisy data involves a lot of hard, clear thinking. We should be honest with our students about that; statistics is hard, not because it is obscure (it need not be), but because it deals with difficult problems. It is exactly that hard logical thinking that can make statistics so interesting to our best students; “statistics” is just reasoning about the world when the world is noisy. Simon writes eloquently about this in a section in the introduction — “Why is statistics such a difficult subject” (1.6 Why is Statistics Such a Difficult Subject?).

We need both of Simon’s conclusions to make progress. We cannot hope to teach two hard subjects at the same time; mathematics, and statistical reasoning. He replaced the mathematics with something that is much easier for most of us to reason about. By doing that, he can concentrate on the real, interesting problem — the hard thinking about data, and the world it comes from. To quote from a later section in this book (2.4 How resampling differs from the conventional approach): “Once we get rid of the formulas and tables, we can see that statistics is a matter of clear thinking, not fancy mathematics.” Instead of asking “where would I look up the right recipe for this?”, you find yourself asking “what kind of world do these data come from?” and “How can I reason about that world?”. Like Simon, we have found that this way of thinking and teaching brings rich rewards — for insight and practice. We hope and believe that you will find the same.

Resampling and data science

The ideas in Simon’s book, first published in 1992, have found themselves at the center of the modern movement of data science.

In the section above, we described Simon’s path in discovering physical models as a way of teaching and explaining statistical tests. He saw that code was the right way to express these physical models, and therefore, to build and explain statistical tests.

Meanwhile, the wider world of data analysis has been coming to the same conclusion, but from the opposite direction. Simon saw the power of resampling for explanation, and then that code was the right way to express these explanations. The data science movement discovered first that code was essential for data analysis, and then that code was the right way to explain statistics.

The modern use of the phrase “data science” comes from the technology industry. From around 2007, companies such as LinkedIn and Facebook began to notice that there was a new type of data analyst that was much more effective than their predecessors. They came to call these analysts “data scientists”, because they had learned how to deal with large and difficult data while working in scientific fields such as ecology, biology, or astrophysics. They had done this by learning to use code:

Data scientists’ most basic, universal skill is the ability to write code. (Davenport and Patil 2012)

Further reflection (Donoho 2017) suggested that something deep was going on: that data science was the expression of a radical change in the way we analyze data, in academia, and in industry. At the center of this change — was code. Code is the language that allows us to tell the computer what it should do with data; it is the native language of data analysis.

This insight transforms the way with think of code. In the past, we have thought of code as a separate, specialized skill, that some of us learn. We take coding courses — we “learn to code”. But if we us code as the fundamental language for analyzing data, then we need code to express what data analysis does, and explain how it works. Here we “code to learn”. Code is not an aim in itself, but a language we can use to express the simple ideas behind data analysis and statistics.

Thus the data science movement started from code as the foundation for data analysis, to using code to explain statistics. It ends at the same place as this book, from the other side of the problem.

The growth of data science is the inevitable result of taking computing seriously in education and research. We have already cited Cobb (2007) on the impossibility of teaching the mathematics students would need in order to understand traditional statistics courses. He goes on to explain why there is so much mathematics, and why we should remove it. In the age before ubiquitous computing, we needed mathematics to simplify calculations that we could not practically do by hand. Now we have great computing power in our phones and laptops, we do not have this constraint, and we can use simpler ideas from resampling methods to solve the same problems. As Simon shows, these are much easier to describe and understand. Data science, and teaching with resampling, are the obvious consequences of ubiquitous computing.

What we changed

This diversion, through data science, leads us to the changes that we have made for the new edition. The previous edition of this book is still excellent, and you can read it freely at http://www.resample.com/intro-text-online. It continues to be ahead of its time, and ahead of our time. Its one major drawback is that Simon bases much of the book around code written in a special language that he developed with Dan Weidenfeld, called Resampling Stats¹. The Resampling Stats language is well designed for expressing the steps in simulating worlds that include elements of randomness, and it was a useful contribution at the time that it was written. Since then, and particularly in the last decade, there have been many improvements in more powerful and general languages, such as Python and R. These languages are particularly suitable for beginners in data analysis, and they come with a huge range of tools and libraries for many tasks in data analysis, including the kinds of models and simulations you will see in this book. We have updated the book to use Python, instead of Resampling Stats. If you already know Python or a similar language, such as R, you will have a big head start in reading this book, but even if you do not, we have written the book so it will be possible to pick up the Python code that you need to understand and build the kind of models that Simon uses. The advantage to us, your authors, is that we can use the very powerful tools associated with Python to make it easier to run and explain the code. The advantage to you, our readers, is that you can also learn these tools, and the Python language. They will serve you well for the rest of your career in data analysis.

Our second and minor change is that we have added some content that Simon specifically left out. Simon knew that his approach was radical for its time, and designed his book as a commentary, correction, and addition to traditional courses in statistics. He assumes some familiarity with the older world of normal distributions, standard deviations, and correlation. We want this book to be useful to the true beginner, so we have added some explanation of standard deviation, standard scores and the correlation coefficient. We have also updated some of the examples.

In this third edition, we have deliberately been light in our edits, to preserve the fresh and creative flavor of Simon’s book, as he worked through the landscape of traditional statistics with his radical eye.

The third edition is the director’s cut, where Simon is the director

The largest change for this edition is to update the code sections to use Python. We intend this edition to be as close as possible to the book that Simon intended, but updated to use modern tools and a standard, widely-used programming language. Read this edition as our service to Simon for his visionary work — this is Simon’s book, and, for this edition, we (Matthew and Stéfan) intend to serve as his editors and interpreters. We release this edition so you can see Simon’s ideas updated to current technology.

Who should read this book, and when

As you have seen in the previous sections, this book uses a radical approach to explaining statistical inference — the science of drawing conclusions from noisy data. This approach is quickly becoming the standard in teaching of data science, partly because it is so much easier to explain, and partly because of the increasing role of code in data analysis.

Our book teaches the basics of using the Python language, some ideas in probability, statistical inference through simulation and resampling, confidence intervals, and principles of Bayesian reasoning, all through the use of model building in simple code.

Statistical inference is an important part of research methods for many subjects; so much so, that research methods courses may even be called “statistics” courses, or include “statistics” components. This book covers the basic ideas behind statistical inference, and how you can apply these ideas to draw practical statistical conclusions. We recommend it to you as an introduction to statistics. If you are a teacher, we suggest you consider this book as a primary text for first statistics courses. We hope you will find, as we have, that this method of explaining through building is much more productive and satisfying than the traditional method of trying to convey some “intuitive” understanding of fairly complicated mathematics. We hope you will see the relationship of these resampling techniques to traditional methods. Even if you do need to teach your students t-tests, and analysis of variance, we hope you will share our experience that this way of explaining the underlying ideas is much more compelling than the traditional approach.

Simon wrote this book for students and teachers who were interested to discover a radical new method of explanation in statistics and probability. The book will still work well for that purpose. If you have done a statistics course, but you kept feeling that you did not really understand it, or there was something fundamental missing that you could not put your finger on — well done for sensing the problem! — then, please, read this book. There is a good chance that it will give you deeper understanding, and reveal the logic behind the often arcane formulations of traditional statistics.

Our book is only part of a data science course. There are several important aspects to data science. A data science course needs all the elements we list above, but it should also cover the process of reading, cleaning, and reorganizing data using Python, or another language, such as R. It may also discuss the problems of experimental design, and cover prediction techniques, such as classification with machine learning, as well as data exploration with plots, tables, and summary measures. We do not cover those here. If you are teaching a full data science course, we suggest that you use this book as your first text, as an introduction to code, and statistical inference, and then add some of the many excellent resources on these other aspects of data science that assume some knowledge of statistics and programming.

The book as a public resource

Simon was passionate about this approach to teaching, as you can see from his preface to the second edition, and from his generosity in publishing the second edition on the web. We share his passion, and we have released the third edition in the same way. Technology and technical culture have evolved since the second edition, and we can now give you the tools to edit and improve this book, for the benefit of all our readers. If you see an error in the book, or you have thought of a better way of explaining something, please send us a fix or an edit. See Appendix E — Errors and suggestions for the procedure, and accept our thanks in advance.

Welcome to resampling

We hope you will agree that Simon’s insights for understanding and explaining are — really extraordinary. We are catching up slowly. If you are like us, your humble authors, you will find that Simon has succeeded in explaining what statistics is, and exactly how it works, to anyone with the patience to work through the examples, and think hard about the problems. If you have that patience, the rewards are great. Not only will you understand statistics down to its deepest foundations, but you will be able to think of your own tests, for your own problems, and have the tools to implement them yourself.

Matthew Brett

Stéfan van der Walt

If you are interested, https://statistics101.sourceforge.io has a free modern version of the original Resampling Stats language.↩︎