Statistics: A Very Short Introduction
David J. HandModern statistics is very different from the dry and dusty discipline of the popular imagination. In its place is an exciting subject which uses deep theory and powerful software tools to shed light and enable understanding. And it sheds this light on all aspects of our lives, enabling astronomers to explore the origins of the universe, archaeologists to investigate ancient civilisations, governments to understand how to benefit and improve society, and businesses to learn how best to provide goods and services.
Aimed at readers with no prior mathematical knowledge, this Very Short Introduction explores and explains how statistics work, and how we can decipher them.
• Reveals the power of statistics as an essential tool for understanding modern life
• Shows how rapid advances in computers and number-crunching software has revolutionised the discipline
• Looks at many real-world examples, from the Challenger space-shuttle disaster, to the spread of modern epidemics, governmental elections, and business and finance
• Accessibly written: explaining fascinating concepts while assuming no prior mathematical knowledge
- Open in Browser
- Checking other formats...
- Convert to EPUB
- Convert to FB2
- Convert to MOBI
- Convert to TXT
- Convert to RTF
- Converted file can differ from the original. If possible, download the file in its original format.
Please note: you need to verify every book you want to send to your Kindle. Check your mailbox for the verification email from Amazon Kindle.
You may be interested in Powered by Rec2Me
Most frequent terms
Related Booklists
|
|
Statistics: A Very Short Introduction VERY SHORT INTRODUCTIONS are for anyone wanting a stimulating and accessible way in to a new subject. They are written by experts, and have been published in more than 25 languages worldwide. The series began in 1995, and now represents a wide variety of topics in history, philosophy, religion, science, and the humanities. Over the next few years it will grow to a library of around 200 volumes – a Very Short Introduction to everything from ancient Egypt and Indian philosophy to conceptual art and cosmology. Very Short Introductions available now: AFRICAN HISTORY John Parker and Richard Rathbone AMERICAN POLITICAL PARTIES AND ELECTIONS L. Sandy Maisel THE AMERICAN PRESIDENCY Charles O. Jones ANARCHISM Colin Ward ANCIENT EGYPT Ian Shaw ANCIENT PHILOSOPHY Julia Annas ANCIENT WARFARE Harry Sidebottom ANGLICANISM Mark Chapman THE ANGLO-SAXON AGE John Blair ANIMAL RIGHTS David DeGrazia Antisemitism Steven Beller ARCHAEOLOGY Paul Bahn ARCHITECTURE Andrew Ballantyne ARISTOTLE Jonathan Barnes ART HISTORY Dana Arnold ART THEORY Cynthia Freeland THE HISTORY OF ASTRONOMY Michael Hoskin ATHEISM Julian Baggini AUGUSTINE Henry Chadwick AUTISM Uta Frith BARTHES Jonathan Culler BESTSELLERS John Sutherland THE BIBLE John Riches THE BRAIN Michael O’Shea BRITISH POLITICS Anthony Wright BUDDHA Michael Carrithers BUDDHISM Damien Keown BUDDHIST ETHICS Damien Keown CAPITALISM James Fulcher CATHOLICISM Gerald O’Collins THE CELTS Barry Cunliffe CHAOS Leonard Smith CHOICE THEORY Michael Allingham CHRISTIAN ART Beth Williamson CHRISTIANITY Linda Woodhead CITIZENSHIP Richard Bellamy CLASSICS Mary Beard and John Henderson CLASSICAL MYTHOLOGY Helen Morales CLAUSEWITZ Michael Howard THE COLD WAR Robert McMahon CONSCIOUSNESS Susan Blackmore CONTEMPORARY ART Julian Stallabrass CONTINENTAL PHILOSOPHY Simon Critchley COSMOLOGY Peter Coles THE CRUSADES Christopher Tyerman CRYPTOGRAPHY Fred Piper and Sean Murphy DADA AND SURREALISM David Hopkins DARWIN Jonathan Howard THE DEAD SEA SCROLLS Timothy Lim D; EMOCRACY Bernard Crick DESCARTES Tom Sorell DESIGN John Heskett DINOSAURS David Norman DOCUMENTARY FILM Patricia Aufderheide DREAMING J. Allan Hobson DRUGS Leslie Iversen THE EARTH Martin Redfern ECONOMICS Partha Dasgupta EGYPTIAN MYTH Geraldine Pinch EIGHTEENTH-CENTURY BRITAIN Paul Langford THE ELEMENTS Philip Ball EMOTION Dylan Evans EMPIRE Stephen Howe ENGELS Terrell Carver ETHICS Simon Blackburn THE EUROPEAN UNION John Pinder and Simon Usherwood EVOLUTION Brian and Deborah Charlesworth EXISTENTIALISM Thomas Flynn FASCISM Kevin Passmore FEMINISM Margaret Walters THE FIRST WORLD WAR Michael Howard FOSSILS Keith Thomson FOUCAULT Gary Gutting FREE WILL Thomas Pink THE FRENCH REVOLUTION William Doyle FREUD Anthony Storr FUNDAMENTALISM Malise Ruthven GALAXIES John Gribbin GALILEO Stillman Drake Game Theory Ken Binmore GANDHI Bhikhu Parekh GEOGRAPHY John A. Matthews and David T. Herbert GEOPOLITICS Klaus Dodds GERMAN LITERATURE Nicholas Boyle GLOBAL CATASTROPHES Bill McGuire GLOBALIZATION Manfred Steger GLOBAL WARMING Mark Maslin THE GREAT DEPRESSION AND THE NEW DEAL Eric Rauchway HABERMAS James Gordon Finlayson HEGEL Peter Singer HEIDEGGER Michael Inwood HIEROGLYPHS Penelope Wilson HINDUISM Kim Knott HISTORY John H. Arnold HISTORY of Life Michael Benton THE HISTORY OF MEDICINE William Bynum HIV/AIDS Alan Whiteside HOBBES Richard Tuck HUMAN EVOLUTION Bernard Wood HUMAN RIGHTS Andrew Clapham HUME A. J. Ayer IDEOLOGY Michael Freeden INDIAN PHILOSOPHY Sue Hamilton INTELLIGENCE Ian J. Deary INTERNATIONAL MIGRATION Khalid Koser INTERNATIONAL RELATIONS Paul Wilkinson ISLAM Malise Ruthven JOURNALISM Ian Hargreaves JUDAISM Norman Solomon JUNG Anthony Stevens KABBALAH Joseph Dan KAFKA Ritchie Robertson KANT Roger Scruton KIERKEGAARD Patrick Gardiner THE KORAN Michael Cook LAW Raymond Wacks LINGUISTICS Peter Matthews LITERARY THEORY Jonathan Culler LOCKE John Dunn LOGIC Graham Priest MACHIAVELLI Quentin Skinner THE MARQUIS DE SADE John Phillips MARX Peter Singer MATHEMATICS Timothy Gowers THE MEANING OF LIFE Terry Eagleton MEDICAL ETHICS Tony Hope MEDIEVAL BRITAIN John Gillingham and Ralph A. Griffiths MEMORY Jonathan Foster MODERN ART David Cottington MODERN CHINA Rana Mitter MODERN IRELAND Senia Pašeta MOLECULES Philip Ball MORMONISM Richard Lyman Bushman MUSIC Nicholas Cook MYTH Robert A. Segal NATIONALISM Steven Grosby NELSON MANDELA Elleke Boehmer THE NEW TESTAMENT AS LITERATURE Kyle Keefer NEWTON Robert Iliffe NIETZSCHE Michael Tanner NINETEENTH-CENTURY BRITAIN Christopher Harvie and H. C. G. Matthew NORTHERN IRELAND Marc Mulholland NUCLEAR WEAPONS Joseph M. Siracusa THE OLD TESTAMENT Michael D. Coogan PARTICLE PHYSICS Frank Close PAUL E. P. Sanders PHILOSOPHY Edward Craig PHILOSOPHY OF LAW Raymond Wacks PHILOSOPHY OF SCIENCE Samir Okasha PHOTOGRAPHY Steve Edwards PLATO Julia Annas POLITICAL PHILOSOPHY David Miller POLITICS Kenneth Minogue POSTCOLONIALISM Robert Young POSTMODERNISM Christopher Butler POSTSTRUCTURALISM Catherine Belsey PREHISTORY Chris Gosden PRESOCRATIC PHILOSOPHY Catherine Osborne PSYCHIATRY Tom Burns PSYCHOLOGY Gillian Butler and Freda McManus THE QUAKERS Pink Dandelion QUANTUM THEORY John Polkinghorne RACISM Ali Rattansi RELATIVITY Russell Stannard RELIGION IN AMERICA Timothy Beal THE RENAISSANCE Jerry Brotton RENAISSANCE ART Geraldine A. Johnson ROMAN BRITAIN Peter Salway THE ROMAN EMPIRE Christopher Kelly ROUSSEAU Robert Wokler RUSSELL A. C. Grayling RUSSIAN LITERATURE Catriona Kelly THE RUSSIAN REVOLUTION S. A. Smith SCHIZOPHRENIA Chris Frith and Eve Johnstone SCHOPENHAUER Christopher Janaway SCIENCE AND RELIGION Thomas Dixon SCOTLAND Rab Houston SEXUALITY Véronique Mottier SHAKESPEARE Germaine Greer SIKHISM Eleanor Nesbitt SOCIAL AND CULTURAL ANTHROPOLOGY John Monaghan and Peter Just SOCIALISM Michael Newman SOCIOLOGY Steve Bruce SOCRATES C. C. W. Taylor THE SPANISH CIVIL WAR Helen Graham SPINOZA Roger Scruton STATISTICS David J. Hand STUART BRITAIN John Morrill TERRORISM Charles Townshend THEOLOGY David F. Ford THE HISTORY OF TIME Leofranc Holford-Strevens TRAGEDY Adrian Poole THE TUDORS John Guy TWENTIETH-CENTURY BRITAIN Kenneth O. Morgan THE UNITED NATIONS Jussi M. Hanhimäki THE VIETNAM WAR Mark Atwood Lawrence THE VIKINGS Julian Richards WITTGENSTEIN A. C. Grayling WORLD MUSIC Philip Bohlman THE WORLD TRADE ORGANIZATION Amrita Narlikar Available Soon: APOCRYPHAL GOSPELS Paul Foster BEAUTY Roger Scruton Expressionism Katerina Reed-Tsocha FREE SPEECH Nigel Warburton MODERN JAPAN Christopher Goto-Jones NOTHING Frank Close PHILOSOPHY OF RELIGION Jack Copeland and Diane Proudfoot SUPERCONDUCTIVITY Stephen Blundell For more information visit our websites www.oup.com/uk/vsi www.oup.com/us David J. Hand Statistics A Very Short Introduction 1 1 Great Clarendon Street, Oxford OX2 6DP Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide in Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries Published in the United States by Oxford University Press Inc., New York c David J. Hand 2008 The moral rights of the author have been asserted Database right Oxford University Press (maker) First Published 2008 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this book in any other binding or cover and you must impose the same condition on any acquirer British Library Cataloguing in Publication Data Data available Library of Congress Cataloging in Publication Data Data available ISBN 978–0–19–923356–4 1 3 5 7 9 10 8 6 4 2 Typeset by SPI Publisher Services, Pondicherry, India Printed in Great Britain by Ashford Colour Press Ltd, Gosport, Hampshire Contents Preface ix List of illustrations xi 1 2 3 4 5 6 7 Surrounded by statistics 1 Simple descriptions 21 Collecting good data 36 Probability 55 Estimation and inference 75 Statistical models and methods 92 Statistical computing 110 Further reading 115 Endnote 117 Index 119 This page intentionally left blank Preface Statistical ideas and methods underlie just about every aspect of modern life. Sometimes the role of statistics is obvious, but often the statistical ideas and tools are hidden in the background. In either case, because of the ubiquity of statistical ideas, it is clearly extremely useful to have some understanding of them. The aim of this book is to provide such understanding. Statistics suffers from an unfortunate but fundamental misconception which misleads people about its essential nature. This mistaken belief is that it requires extensive tedious arithmetic manipulation, and that, as a consequence, it is a dry and dusty discipline, devoid of imagination, creativity, or excitement. But this is a completely false image of the modern discipline of statistics. It is an image based on a perception dating from more than half a century ago. In particular, it entirely ignores the fact that the computer has transformed the discipline, changing it from one hinging around arithmetic to one based on the use of advanced software tools to probe data in a search for understanding and enlightenment. That is what the modern discipline is all about: the use of tools to aid perception and provide ways to shed light, routes to understanding, instruments for monitoring and guiding, and systems to assist decision-making. All of these, and more, are aspects of the modern discipline. The aim of this book is to give the reader some understanding of this modern discipline. Now, clearly, in a book as short as this one, I cannot go into detail. Instead of detail, I have taken a high-level view, a bird’s eye view, of the entire discipline, trying to convey the nature of statistical philosophy, ideas, tools, and methods. I hope the book will give the reader some understanding of how the modern discipline works, how important it is, and, indeed, why it is so important. The first chapter presents some basic definitions, along with illustrations to convey some of the power, importance, and, indeed, excitement of statistics. The second chapter introduces some of the most elementary of statistical ideas, ideas which the reader may well have already encountered, concerned with basic summaries of data. Chapter 3 cautions us that the validity of any conclusions we draw depends critically on the quality of the raw data, and also describes strategies for efficient collection of data. If data provide one of the legs on which statistics stands, the other is probability, and Chapter 4 introduces basic concepts of probability. Proceeding from the two legs of data and probability, in Chapter 5 statistics starts to walk, with a description of how one draws conclusions and makes inferences from data. Chapter 6 presents a lightning overview of some important statistical methods, showing how they form part of an interconnected network of ideas and methods for extracting understanding from data. Finally, Chapter 7 looks at just some of the ways the computer has impacted the discipline. I would like to thank Emily Kenway, Shelley Channon, Martin Crowder, and an anonymous reader for commenting on drafts of this book. Their comments have materially improved it, and helped to iron out obscurities in the explanations. Of course, any such which remain are entirely my own fault. David J. Hand Imperial College, London List of illustrations 1. Distribution of American baseball players’ salaries 31 c David Hand 2. A cumulative probability distribution 66 c David Hand 3. A probability density function 67 c David Hand 4. The normal distribution 70 c David Hand 5. Fitting a line to data 100 c David Hand 6. A ‘scatterplot matrix’ of two types of athletic events 107 c David Hand 7. A time series plot of ATM withdrawals 108 c David Hand 8. Distribution of the light scatter values from phytoplankton cells of different species 108 c David Hand This page intentionally left blank Chapter 1 Surrounded by statistics To those who say ‘there are lies, damned lies, and statistics’, I often quote Frederick Mosteller, who said that ‘it is easy to lie with statistics, but easier to lie without them’. Modern statistics I want to begin with an assertion that many readers might find surprising: statistics is the most exciting of disciplines. My aim in this book is to show you that this assertion is true and to show you why it is true. I hope to dispel some of the old misconceptions of the nature of statistics, and to show what the modern discipline looks like, as well as to illustrate some of its awesome power, as well as its ubiquity. In particular, in this introductory chapter I want to convey two things. The first is a flavour of the revolution that has taken place in the past few decades. I want to explain how statistics has been transformed from a dry Victorian discipline concerned with the manual manipulation of columns of numbers, to a highly sophisticated modern technology involving the use of the most advanced of software tools. I want to illustrate how today’s statisticians use these tools to probe data in the search for structures and patterns, and how they use this technology to peel back the layers of mystification and obscurity, revealing the truths 1 Statistics beneath. Modern statistics, like telescopes, microscopes, X-rays, radar, and medical scans, enables us to see things invisible to the naked eye. Modern statistics enables us to see through the mists and confusion of the world about us, to grasp the underlying reality. So that is the first thing I want to convey in this chapter: the sheer power and excitement of the modern discipline, where it has come from, and what it can do. The second thing I hope to convey is the ubiquity of statistics. No aspect of modern life is untouched by it. Modern medicine is built on statistics: for example, the randomized controlled trial has been described as ‘one of the simplest, most powerful, and revolutionary tools of research’. Understanding the processes by which plagues spread prevent them from decimating humanity. Effective government hinges on careful statistical analysis of data describing the economy and society: perhaps that is an argument for insisting that all those in government should take mandatory statistics courses. Farmers, food technologists, and supermarkets all implicitly use statistics to decide what to grow, how to process it, and how to package and distribute it. Hydrologists decide how high to build flood defences by analysing meteorological statistics. Engineers building computer systems use the statistics of reliability to ensure that they do not crash too often. Air traffic control systems are built on complex statistical models, working in real time. Although you may not recognize it, statistical ideas and tools are hidden in just about every aspect of modern life. Some definitions One good working definition of statistics might be that it is the technology of extracting meaning from data. However, no definition is perfect. In particular, this definition makes no reference to chance and probability, which are the mainstays of many applications of statistics. So another working definition might be that it is the technology of handling uncertainty. Yet 2 So far in this book, and in particular in the preceding paragraph, I have referred to the discipline of statistics, but the word ‘statistics’ also has another meaning: it is the plural of ‘statistic’. A statistic is a numerical fact or summary. For example, a summary of the data describing some population: perhaps its size, the birth rate, or the crime rate. So in one sense this book is about individual numerical facts. But in a very real sense it is about much more than that. It is about how to collect, manipulate, analyse, and deduce things from those numerical facts. It is about the technology itself. This means that a reader hoping to find tables of numbers in this book (e.g. ‘sports statistics’) will be disappointed. But a reader hoping to gain understanding of how businesses make decisions, of how astronomers discover new types of stars, of how medical researchers identify the genes associated with a particular disease, of how banks decide whether or not to give someone a credit card, of how insurance companies decide on the cost of a premium, of how to construct spam filters which 3 Surrounded by statistics other definitions, or more precise definitions, might put more emphasis on the roles that statistics plays. Thus we might say that statistics is the key discipline for predicting the future or for making inferences about the unknown, or for producing convenient summaries of data. Taken together these definitions broadly cover the essence of the discipline, though different applications will provide very different manifestations. For example, decision-making, forecasting, real-time monitoring, fraud detection, census enumeration, and analysis of gene sequences are all applications of statistics, and yet may require very different methods and tools. One thing to note about these definitions is that I have deliberately chosen the word ‘technology’ rather than science. A technology is the application of science and its discoveries, and that is what statistics is: the application of our understanding of how to extract information from data, and our understanding of uncertainty. Nevertheless, statistics is sometimes referred to as a science. Indeed, one of the most stimulating statistical journals is called just that: Statistical Science. prevent obscene advertisements reaching your email inbox, and so on and on, will be rewarded. Statistics All of this explains why ‘statistics’ can be both singular and plural: there is one discipline which is statistics, but there are many numbers which are statistics. So much for the word ‘statistics’. My first working definition also used the word ‘data’. The word ‘data’ is the plural of the Latin word ‘datum’, meaning ‘something given’, from dare, meaning ‘to give’. As such, one might imagine that it should be treated as a plural word: ‘the data are poor’ and ‘these data show that . . . ’, rather than ‘the data is poor’ and ‘this data shows that’. However, the English language changes over time. Increasingly, nowadays ‘data’ is treated as describing a continuum, as in ‘the water is wet’ rather than ‘the water are wet’. My own inclination is to adopt whatever sounds more euphonious in any particular context. Usually, to my ears, this means sticking to the plural usage, but occasionally I may lapse. Data are typically numbers: the results of measurements, counts, or other processes. We can think of such data as providing a simplified representation of whatever we are studying. If we are concerned with school children, and in particular their academic ability and suitability for different kinds of careers, we might choose to study the numbers giving their results in various tests and examinations. These numbers would provide an indication of their abilities and inclinations. Admittedly, the representation would not be perfect. A low score might simply indicate that someone was feeling ill during the examination. A missing value does not tell us much about their ability, but merely that they did not sit the examination. I will say more about data quality later. It matters because of the general principle (which applies throughout life, not merely in statistics) that if we have poor material to work with then the results will be poor. Statisticians 4 can perform amazing feats in extracting understanding from numbers, but they cannot perform miracles. Lies, damned lies, and setting the record straight The remark that there are ‘lies, damned lies, and statistics’, which was quoted at the start of this chapter, has been variously attributed to Mark Twain and Benjamin Disraeli, among others. Several people have made similar remarks. Thus ‘like dreams, statistics are a form of wish fulfilment’ (Jean Baudrillard, in Cool Memories, Chapter 4); ‘. . . the worship of statistics has had the particularly unfortunate result of making the job of the plain, outright liar that much easier’ (Tom Burnan, in The Dictionary of Misinformation, p. 246); ‘statistics is “hocuspocus” with numbers’ 5 Surrounded by statistics Of course, many situations do not appear to produce numerical data directly. Much raw data appears to be in the form of pictures, words, or even things such as electronic or acoustic signals. Thus, satellite images of crops or rain forest coverage, verbal descriptions of side effects suffered when taking medication, and sounds uttered when speaking, do not appear to be numbers. However, close examination shows that, when these things are measured and recorded, they are translated into numerical representations or into representations which can themselves be further translated into numbers. Satellite pictures and other photographs, for example, are represented as millions of tiny elements, called pixels, each of which is described in terms of the (numerical) intensities of the different colours making it up. Text can be processed into word counts or measures of similarity between words and phrases; this is the sort of representation used by web search engines, such as Google. Spoken words are represented by the numerical intensities of the waveforms making up the individual parts of speech. In general, although not all data are numerical, most data are translated into numerical form at some stage. And most of statistics deals with numerical data. (Audrey Habera and Richard Runyon, in General Statistics, p. 3); ‘legal proceedings are like statistics. If you manipulate them, you can prove anything’ (Arthur Hailey, in Airport, p. 385). And so on. Statistics Clearly there is much suspicion of statistics. We might also wonder if there is an element of fear of the discipline. It is certainly true that the statistician often plays the role of someone who must exercise caution, possibly even being the bearer of bad news. Statisticians working in research environments, for example in medical schools or social contexts, may well have to explain that the data are inadequate to answer a particular question, or simply that the answer is not what the researcher wanted to hear. That may be unfortunate from the researcher’s perspective, but it is a little unfair then to blame the statistical messenger. In many cases, suspicion is generated by those who selectively choose statistics. If there is more than one way to summarize a set of data, all looking at slightly different aspects, then different people can choose to emphasize different summaries. A particular example is in crime statistics. In Britain, perhaps the most important source of crime statistics is the British Crime Survey. This estimates the level of crime by directly asking a sample of people of which crimes they have been victims over the past year. In contrast, the Recorded Crime Statistics series includes all offences notifiable to the Home Office which have been recorded by the police. By definition, this excludes certain minor offences. More importantly, of course, it excludes crimes which are not reported to the police in the first place. With such differences, it is no wonder that the figures can differ between the two sets of statistics, even to the extent that certain categories of crime may appear to be decreasing over time according to one set of figures but increasing according to the other. The crime statistics figures also illustrate another potential cause of suspicion of statistics. When a particular measure is used as an indicator of the performance of a system, people may choose to 6 target that measure, improving its value but at the cost of other aspects of the system. The chosen measure then improves disproportionately, and becomes useless as a measure of performance of the system. For example, the police could reduce the rate of shoplifting by focusing all their resources on it, at the cost of allowing other kinds of crime to rise. As a result, the rate of shoplifting becomes useless as an indicator of crime rate. This phenomenon has been termed ‘Goodhart’s law’, named after Charles Goodhart, a former Chief Adviser to the Bank of England. Yet another cause of suspicion arises in a fundamental way as a consequence of the very nature of scientific advance. Thus, one day we might read in the newspaper of a scientific study appearing to show that a particular kind of food is bad for us, and the next day that it is good. Naturally enough this generates confusion, the feeling that the scientists do not know the answer, and perhaps that they are not to be trusted. Inevitably, such scientific investigations make heavy use of statistical analyses, so some of this suspicion transfers to statistics. But it is the very essence of scientific advance that new discoveries are made that change our understanding. Where we once might have thought simply that dietary fat was bad for us, further studies may have led us to recognize that there are different kinds of fats, some beneficial and some detrimental. The picture is more complicated than we first thought, so it is hardly surprising that the initial studies led to conflicting and apparently contradictory conclusions. A fourth cause of suspicion arises from elementary misunderstandings of basic statistics. As an exercise, the reader 7 Surrounded by statistics The point to all this is that the problem lies not with the statistics per se, but with the use made of those statistics, and the misunderstanding of how the statistics are produced and what they really mean. Perhaps it is perfectly natural to be suspicious of things we do not understand. The solution is to dispel that lack of understanding. might try to decide what is suspicious about each of the following statements (the answers are in the endnote at the back of the book). 1) We read in a report that earlier diagnosis of a medical condition leads to longer survival times, so that screening programmes are beneficial. 2) We are told that a stated price has already been reduced by a 25% discount for eligible customers, but we are not eligible so we have to pay 25% more than the stated price. 3) We hear of a prediction that life expectancy will reach 150 years in the next century, based on simple extrapolation from increases over the past 100 years. Statistics 4) We are told that ‘every year since 1950, the number of American children gunned down has doubled’. Sometimes the misunderstandings are not so elementary, or, at least, they arise from relatively deep statistical concepts. It would be surprising if, after more than a century of development, there were not some deep counter-intuitive ideas in statistics. One such is known as the Prosecutor’s Fallacy. It describes confusion between the probability that something will be true (e.g. the defendant is guilty) if you have some evidence (e.g. the defendant’s gloves at the scene of the crime), with the probability of finding that evidence if you assume that the defendant is guilty. This is a common confusion, not merely in the courts, and we will examine it more closely later. If there is suspicion and mistrust of statistics, it is clear that the blame lies not with the statistics or how they were calculated, but rather with the use made of those statistics. It is unfair to blame the discipline, or the statistician who extracts the meaning from the data. Rather, the blame lies with those who do not understand what the numbers are saying, or who wilfully misuse the results. 8 We do not blame a gun for murdering someone: rather it is the person firing the gun who is blamed. Data One way of looking at data is to regard it as evidence. Without data, our ideas and theories about the world around us are mere speculations. Data provide a grounding, linking our ideas and theories to reality, and allowing us to validate and test our 9 Surrounded by statistics We have seen that data are the raw material on which the discipline of statistics is built, as well as the raw material from which individual statistics themselves are calculated, and that these data are typically numbers. In fact, however, data are more than merely numbers. To be useful, that is to enable us to carry out some meaningful statistical analysis, the numbers must be associated with some meaning. For example, we need to know what the measurements are measurements of, and just what has been counted when we are presented with a count. To produce valid and accurate results when we carry out our statistical analysis, we also need to know something about how the values have been obtained. Did everyone we asked give answers to a questionnaire, or did only some people answer? If only some answered, are they properly representative of the population of people we wish to make a statement about, or is the sample distorted in some way? Does, for example, our sample disproportionately exclude young people? Likewise, we need to know if patients dropped out of a clinical trial. And whether the data are up to date. We need to know if a measuring instrument is reliable, or if it has a maximum value which is recorded when the true value is excessively high. Can we assume that a pulse rate recorded by a nurse is accurate, or is it only a rough value? There is an infinite number of such questions which could be asked, and we need to be alert for any which could influence the conclusions we draw. Or else suspicions of the kind described above might be entirely legitimate. Statistics understanding. Statistical methods are then used to compare the data with our ideas and theories, to see how good a match there is. A poor match leads us to think again, to re-evaluate our ideas and reformulate them so that they better match what we actually observe to be the case. But perhaps I should insert a cautionary note here. This is that a poor match could also be a consequence of poor data quality. We must be alert for this possibility: our theories may be sound but our measuring instruments may be lacking in some way. In general, however, a good match between the observed data and what our theories say the data should be like reassures us that we are on the right track. It reassures us that our ideas really do reflect the truth of what is going on. Implicit in this is that, to be meaningful, our ideas and theories must yield predictions, which we can compare with our data. If they do not tell us what we should expect to observe, or if the predictions are so general that any data will conform with our theories, then the theories are not much use: anything would do. Psychoanalysis and astrology have been criticized on such grounds. Data also allow us to steer our way through a complex world – to make decisions about the best actions to take. We take our measurements, count our totals, and we use statistical methods to extract information from these data to describe how the world is behaving and what we should do to make it behave how we want. These principles are illustrated by aircraft autopilots, automobile SatNav systems, economic indicators such as inflation rate and GDP, monitoring patients in intensive care units, and evaluations of complex social policies. Given the fundamental role of data as tying observations about the world around us to our ideas and understanding of that world, it is not stretching things too far to describe data, and the technology of extracting meaning from it, as the cornerstone of modern civilization. That is why I used the subtitle ‘how data rule our 10 world’, for my book Information Generation (see Further reading). Greater statistics As it developed, so the discipline of statistics went through several phases. The first, leading up to around the end of the 19th century, was characterized by discursive explorations of data. Then the first half of the 20th century saw the discipline becoming mathematicized, to the extent that many saw it as a branch of mathematics (it deals with numbers, doesn’t it?). Indeed, university statisticians are still often based within mathematics departments. The second half of the 20th century saw the advent of the computer, and it was this change which elevated statistics from drudgery to excitement. The computer removed the need for practitioners to have special arithmetic skills – they no longer needed to spend endless hours on numerical manipulation. It is analogous to the change from having to walk everywhere to being 11 Surrounded by statistics Although the roots can be traced as far back as we like, the discipline of statistics itself is really only a couple of centuries old. The Royal Statistical Society was established in 1834, and the American Statistical Association in 1839, whilst the world’s first university statistics department was set up in 1911, at University College, London. Early statistics had several strands, which eventually combined to become the modern discipline. One of these strands was the understanding of probability, dating from the mid-17th century, which emerged in part from questions concerning gambling. Another was the appreciation that measurements are rarely error free, so that some analysis was needed to extract sensible meaning from them. In the early years, this was especially important in astronomy. Yet another strand was the gradual use of statistical data to enable governments to run their country. In fact, it is this usage which led to the word ‘statistics’: data about ‘the State’. Every advanced country now has its own national statistical office. Statistics able to drive: journeys which would have previously taken days now take a matter of minutes; journeys which would have been too lengthy to contemplate now become feasible. The second half of the 20th century also saw the appearance of other schools of data analysis, with origins not in classical statistics but in other areas, especially computer science. These include machine learning, pattern recognition, and data mining. As these other disciplines developed, so there were sometimes tensions between the different schools and statistics. The truth is, however, that the varying perspectives provided by these different schools all have something to contribute to the analysis of data, to the extent that nowadays modern statisticians pick freely from the tools provided by all these areas. I will describe some of these tools later on. With this in mind, in this book I take a broad definition of statistics, following the definition of ‘greater statistics’ given by the eminent statistician John Chambers, who said: ‘Greater statistics can be defined simply, if loosely, as everything related to learning from data, from the first planning or collection to the last presentation or report.’ Trying to define boundaries between the different data-analytic disciplines is both pointless and futile. So, modern statistics is not about calculation, it is about investigation. Some have even described statistics as the scientific method in action. Although, as I noted above, one still often finds many statisticians based in mathematics departments in universities, one also finds them in medical schools, social science departments, including economics, and many other departments, ranging from engineering to psychology. And outside universities large numbers work in government and in industry, in the pharmaceutical sector, marketing, telecoms, banking, and a host of other areas. All managers rely on statistical skills to help them interpret the data describing their department, corporation, production, personnel, etc. These people are not manipulating mathematical symbols and formulae, but are using statistical tools and methods to gain insight and understanding from evidence, 12 from data. In doing so, they need to consider a wide variety of intrinsically non-mathematical issues such as data quality, how the data were or should be collected, defining the problem, identifying the broader objective of the analysis (understanding, prediction, decision, etc.), determining how much uncertainty is associated with the conclusion, and a host of other issues. As I hope is clear from the above, statistics is ubiquitous, in that it is applied in all walks of life. This has had a reciprocal impact on the development of statistics itself. As statistical methods were applied in new areas, so the particular problems, requirements, and characteristics of those areas led to the development of new statistical methods and tools. And then, once they had been developed, these new methods and tools spread out, finding applications in other areas. Example 1: Spam filtering ‘Spam’ is the term used to describe unsolicited bulk email messages automatically sent out to many recipients, typically many millions of recipients. These messages will be advertising messages, often offensive, and they may be fronts for confidence tricksters. They include things such as debt consolidation offers, get-rich-quick schemes, prescription drugs, stock market tips, and dubious sexual aids. The principle underlying them is that if you email enough people, some are likely to be interested in – or taken in by – your offer. Unless the messages are from organizations specifically asked for information, most of them will be of no interest, and nobody will want to waste time reading and deleting them. Which brings us to spam filters. These are computer programs that automatically scan incoming email messages and decide which are likely to be spam. The filters can be set up so that the program deletes the spam messages automatically, sends them to a holding folder for later examination, or takes some other appropriate action. There are various estimates of the amount of 13 Surrounded by statistics Some examples spam sent out, but at the time of writing, one estimate is that over 90 billion spam messages are sent each day – and since this number has been rising dramatically month on month, it is likely to be substantially greater by the time you read this. Statistics There are various techniques for preventing spam. Very simple approaches just check for the occurrence of keywords in the message. For example, if a message includes the word ‘viagra’ it might be blocked. However, one of the characteristics of spam detection is that it is something of an arms race. Once those responsible become aware that their messages are being blocked by a particular method, they seek ways round that method. For example, they might seek deliberately to misspell ‘viagra’ as ‘v1agra’ or ‘v-iagra’, so that you can recognize it but the automatic program cannot. More sophisticated spam detection tools are based on statistical models of the word content of spam messages. For example, they might use estimates of the probabilities of particular words or word combinations arising in spam messages. Then a message that contains too many high-probability words is suspect. More sophisticated tools build models for the probability that one word will follow another, in a sequence, hence enabling the detection of suspicious phrases and sets of words. Yet other methods use statistical models of images, to detect such things as skin tones in an emailed picture. Example 2: The Sally Clark case In 1999, Sally Clark, a young British lawyer, was tried, convicted, and given a life sentence for murdering her two baby sons. Her first child died in 1996, aged 11 weeks, and her second died in 1998, aged 8 weeks. The verdict depended on what has become a byword for the misunderstanding and misuse of statistics, when the paediatrician Sir Roy Meadow, in his role as expert witness for the prosecution, claimed that the chance of two children dying 14 from cot death was 1 in 73 million. He obtained this figure by simply multiplying together the chance for the two deaths separately. In doing so, in his ignorance of basic statistics, he entirely ignored the fact that one such death in a family is likely to mean that another such death is more likely. In the Sally Clark case, there was more evidence suggesting that she was innocent, and eventually it became clear that her second son had a bacterial infection known to predispose towards sudden infant death. Ms Clark was subsequently released on appeal in 2003. Tragically, she died in March 2007, aged just 42. More details of this terrible misunderstanding and misuse of statistics are given in an excellent article by Helen Joyce and on the website listed in the Further reading at the end of this book. Example 3: Star clusters As our ability to probe further and further into the universe has increased, so it has become apparent that astronomic objects tend 15 Surrounded by statistics Study of past data shows that the probability of a randomly selected baby suffering a cot death in a family such as the Clarks’ is about 1 in 8,500. If one then makes the assumption that the occurrence of one such death does not change the probability of another, then the chances of two such deaths in the same family would be 1/8,500 times 1/8,500; that is, about one in 73 million. But the assumption here is a big one, and careful statistical analysis of past data suggests that, in fact, the chance of a second cot death is substantially increased when one has already occurred. Indeed, the calculations suggest that several such multiple deaths should be expected to occur each year in a nation the size of the UK. The website of the Foundation for the Study of Infant Death says ‘it is very rare for cot death to occur twice in the same family, though occasionally an inherited disorder, such as a metabolic defect, may cause more than one infant to die unexpectedly’. Statistics to cluster together, and do so in a hierarchical way, so that stars form clusters, clusters of stars themselves form higher level clusters, and these then cluster in turn. In particular, our own galaxy, which is a cluster of stars, is a member of the Local Group of about thirty galaxies, and this in turn is a member of the Local Supercluster. At the largest scale, the Universe looks rather like a foam, with filaments consisting of Superclusters lying on the edges of vast empty spaces. But how was all this discovered? Even if we use powerful telescopes to look out from the Earth, we simply see a sky of stars. The answer is that teasing out this clustering structure, and indeed discovering it in the first place, required statistical techniques. One class of techniques involves calculating the distances from each star to its few closest stars. Stars which have more stars closer than expected by chance are in locally dense regions – local clusters. Of course, there is much more to it than that. Interstellar dust clouds will obscure the view of distant objects, and these dust clouds are not distributed uniformly in space. Likewise, faint objects will only be seen if they are near enough to the Earth. A thin filament of galaxies seen end on from the Earth could appear to be a dense cluster. And so on. Sophisticated statistical corrections need to be applied so that we can discern the underlying truth from the apparent distributions of objects. Understanding the structure of the universe sheds light both on how it came to be, and on its future development. Example 4: Manufacturing chemicals I have already remarked that while statisticians may be able to perform amazing feats, they cannot perform miracles. In particular, the quality of their conclusions will be moderated by the quality of the data. Given this, it is hardly surprising that there are important subdisciplines of statistics concerned with how best to collect data. These are discussed in Chapter 3. One of these 16 subdisciplines is experimental design. Experimental design techniques are used in situations where it is possible to control or manipulate some of the ‘variables’ being studied. The tools of experimental design enable us to extract maximum information for a given use of resources. For example, in producing a particular chemical polymer we might be able to set the temperature, pressure, and time of the chemical reaction to any values we want. Different values of these three variables will lead to variations in the quality of the final product. The question is, what is the best set of values? But what if the manufacturing process is such that it takes several days to make each batch? Making many such batches, just to work out the best way of doing so, may be infeasible. Making 100 batches, each of which takes three days, would take the better part of a year. Fortunately, cleverly designed experiments allow us to extract the same information from far fewer carefully chosen sets of values. Sometimes a tiny fraction of batches can yield enough information for us to determine the best set of values, provided those batches are properly selected. Example 5: Customer satisfaction To run any retail organization effectively, so that it makes a profit and grows over time, requires paying careful attention to the customers, and giving them the product or service that they want. Failing to do so will mean that they go to a competitor who does provide what is wanted. The bottom line here is that failure will be indicated by declining revenues. We can try to avoid that by 17 Surrounded by statistics In principle, this is an easy question to answer. We simply make many batches of the polymer, each with different values of the three variables. This allows us to estimate the ‘response surface’, showing the quality of the polymer at each set of three values of the variables, and we can then choose the particular triple which maximizes the quality. Statistics collecting data on how the customers feel before they begin voting with their wallets. We can carry out surveys of customer satisfaction, asking customers if they are happy with the product or service and in what ways these might be improved. At first glance, it might look as if, to obtain reliable conclusions which reflect the behaviour of the entire customer base, it is necessary to give questionnaires to all the customers. This could clearly be an expensive and time-consuming exercise. Fortunately, however, there are statistical methods which enable sufficiently accurate results to be obtained from just a sample of customers. Indeed, the results can sometimes be even more accurate than surveying all customers. Needless to say, great care is needed in such an exercise. It is necessary to be wary of basing conclusions on a distorted sample: the results would be useless as a description of how customers behaved in general if only those who spent large sums of money were interviewed. Once again, statistical methods have been developed which enable us to avoid such mistakes – and so to draw valid conclusions. Example 6: Detecting credit card fraud Not all credit card transactions are legitimate. Fraudulent transactions cost the bank money, and also cost the bank’s customers money. Detecting and preventing fraud is thus very important. Many readers of this book will have had the experience of their bank telephoning them to check that they made certain transactions. These calls are based on the predictions made by statistical models which describe how legitimate customers behave. Departures from the behaviour predicted by these models suggest that something suspicious is going on, deserving investigation. There are various kinds of model. Some are based simply on intrinsically suspicious patterns of behaviour: simultaneous use of a single card in geographically distant locations, for example. 18 Others are based on more elaborate models of the kinds of transactions someone habitually makes, when they tend to make them, for how much money, at what kinds of outlets, for which kinds of products, and so on. Of course, no such predictive model is perfect. Credit card transactions patterns are often varied, with people suddenly making purchases of a kind they have never made before. Moreover, only a tiny percentage of transactions are fraudulent – perhaps around one in a thousand. This makes detection especially difficult. Example 7: Inflation We are all familiar with the notion that things become more expensive as time passes. But how can we compare today’s cost of living with yesterday’s? To do so, we need to compare the same things bought on the two dates. Unfortunately, there are complications: different shops charge different prices for the same things, different people buy different things, the same people change in their purchasing patterns, new products appear on the market and old ones vanish, and so on. How can we allow for changes such as these in determining whether life is more expensive nowadays? Statisticians and economists construct indicators such as the Retail Price Index and the Consumer Price Index to measure the cost of living. These are based on a notional ‘basket’ of (hundreds of ) goods that people buy, along with surveys to discover what prices are being charged for each item in the basket. Sophisticated 19 Surrounded by statistics Detecting and preventing fraud is a constant battle: when one fraud avenue is stopped, fraudsters tend not to abandon their chosen career path and take up a legitimate occupation, but switch to other methods of fraud, so requiring the development of further statistical models. statistical models are used to combine the prices of the different items to yield a single overall number which can be compared over time. As well as serving as an indicator of inflation, such indices are also used to adjust tax thresholds and index-linked salaries, pensions, and so on. Statistics Conclusion It may not always be apparent to the untutored eye, but statistics and statistical methods lie at the heart of scientific discovery, commercial operations, government, social policy, manufacturing, medicine, and most other aspects of human endeavour. Furthermore, as the world progresses, so this role is becoming more and more important. For example, the development of new medicines has long had a legal requirement for statisticians to be involved and something similar is now happening in the banking industry, with new international agreements requiring statistical risk models to be built. Given this pivotal role, it is clearly important that no educated citizen should be unaware of basic statistical principles. Modern statistics, with its use of sophisticated software tools to probe data, permits us to make voyages of discovery paralleling those of pre-20th-century explorers, investigating new and exciting realms. This recognition – that real statistics is about exploring the unknown, not about tedious arithmetic manipulation – is central to an appreciation of the modern discipline. 20 Chapter 2 Simple descriptions Data are nature’s evidence Introduction In this chapter, I aim to introduce some of the basic concepts and tools which form the foundation of statistics, and which enable it to play its many roles. In Chapter 1, I noted that modern statistics suffered from many misconceptions and misunderstandings. Yet another such misunderstanding is often (probably inadvertently) propagated by textbooks which describe statistical methods for experts in other disciplines. This is that statistics is a bag of tools, with the role of the statistician or user of statistics being to pick one tool to match the question, and then to apply it. The problem with this view of statistics is that it gives the impression that the discipline is simply a collection of disconnected methods of manipulating numbers. It fails to convey the truth that statistics is a connected whole, built on deep philosophical principles, so that the data analytic tools are linked and related: some may generalize to others, some may appear to differ simply because they work with different kinds of data, even though they search for the same kind of structures, and so on. I 21 suspect that this impression of a collection of isolated methods may be another reason why newcomers find statistics rather tedious and hard to learn (apart from any fear of numbers they may have). Learning a disconnected and apparently quite distinct set of methods is much tougher than learning about such methods through their relationship of derivation from underlying principles. It is rather like the difficulty of learning a random collection of unrelated words, compared with learning words in a meaningful sentence. I have endeavoured, in this chapter and throughout the book, to convey the relationships between statistical ideas, to show that the discipline is really an interconnected whole. Statistics Data again Whatever else it does, and whatever the details of the definition we adopt, statistics begins with data. Data describe the universe we wish to study. I am using the word ‘universe’ here in a very general sense. It could be the physical world about us, but it could be the world of credit card transactions, of microarray experiments in genetics, of schools and their teaching and examination performance, of trade between countries, of how people behave when exposed to different advertisements, of subatomic particles, and so on. There is no end to the worlds which can be studied, and therefore of the worlds represented by data. Of course, no finite data set can tell us about all of the infinite complexities of the real world, just as no verbal description, even that written by the most eloquent of authors, can convey everything about every facet of the world around us. That means we must be specially aware of any potential shortcomings or gaps in our data. It means that, when collecting data, we need to take special care to ensure that they do cover the aspects we are interested in, or about which we wish to draw conclusions. There is also a more positive way of looking at this: by collecting only a 22 finite set of descriptive aspects, we are forced to eliminate the irrelevant ones. When studying the safety of different designs of cars, we might decide not to record the colour of the fabric covering the seats. In fact, in any one study we might be interested in multiple kinds of objects. We might want to understand and make statements not only about school children, but also about the schools themselves, and perhaps about the teachers, the styles of teaching, and different kinds of school management structure, all in one study. Moreover, we will typically not be interested in any single characteristic of the objects being studied, but in relationships between characteristics, and indeed, perhaps relationships between characteristics for objects of different kinds and at different levels. We see that things are often really quite complicated, as we might expect, given the complexity of the subjects we might be studying. 23 Simple descriptions Broadly speaking, it is convenient to regard data as having two aspects. One aspect is concerned with the objects we wish to study, and the other aspect is concerned with the characteristics of those objects we wish to study. For example, our objects might be children at school and the characteristics might be their test scores. Or perhaps the objects might be children, but we are studying their diet and physical development, in which case the characteristics might be the children’s height and weight. Or our objects might be physical materials, with the characteristics of interest being their electrical and magnetic properties. In statistics, it is common to call the characteristics variables, with each object having a value of a variable (the child’s score in a spelling test would be the value of the test variable, the magnitude of material’s electrical conductivity would be the value of the conductivity variable, etc.). In other data-analytic disciplines, alternative words are sometimes used (such as ‘feature’, ‘characteristic’, or ‘attribute’), but when I get on to discussing the technical aspects I shall usually stick to ‘variable’. Statistics Many people are resistant to the notion that numerical data can convey the beauty of the real world. They feel that somehow converting things to numbers strips away the magic. In fact, they could not be more wrong. Numbers have the potential to allow us to perceive that beauty, that magic, more clearly and more deeply, and to appreciate it more fully. Admittedly, ambiguity may be removed by couching things in numerical form: if I say that there are four people in the room, you know exactly what I mean, whereas, in contrast, if I say that someone is attractive you may not be entirely sure what I mean. You may even disagree with my view that someone is attractive, but you are unlikely to disagree with my view that there are four people in the room (barring errors in our counting, of course, but that’s a different matter). Numbers are universally understood, regardless of nationality, religion, gender, age, or any other human characteristic. Removing ambiguity, and with it removing the risk of misunderstanding, can only be beneficial when trying to understand something – when trying to see to its heart. This lack of ambiguity in the interpretation of numbers is closely tied to the fact that numbers have only one property: their value or magnitude. Contrary to what fortune tellers may have us believe, numbers are not lucky or unlucky – in just the same way that numbers do not have a colour, or a flavour, or an odour. They have no properties but their intrinsic numerical value. (Admittedly, some people experience synaesthesia, in which they do associate a particular colour or sensation with particular numbers. However, the associated sensations are different for different people, and cannot be regarded as properties of the numbers themselves.) Numerical data give us a more direct and immediate link to the phenomena we are studying than do words, because numerical data are typically produced by measuring instruments with a more direct link to those phenomena than are words. Numbers come directly from the things being studied, whereas words are filtered by a human brain. Of course, things are more complicated if our 24 data-collection procedure is mediated by words (as would be the case if the data are collected by questionnaires), but the principle still holds good. While measuring instruments may not be perfect, the data are a proper representation of the results of applying those instruments to the phenomenon being investigated. I sometimes summarize this by the comment at the start of this chapter: data are nature’s evidence, seen through the lens of the measuring instrument. On top of all this, numbers have practical consequences in terms of societal advance. It is the civilized world’s facility with manipulating the representations of reality provided by numbers that has led to such awesome material progress in the past few centuries. 25 Simple descriptions Although numbers have only one property, their numerical value, we might choose to use that property in different ways. For example, when deciding on the order of merit of students in a class, we might rank them according to their examination scores. That is, we might care only about whether one score is higher than another, and not about the precise numerical difference. When we are concerned only with the order of the values in this way we say we are treating the data as lying on an ‘ordinal’ scale. On the other hand, when a farmer measures the amount of corn he has produced, he does not simply want to know whether he has grown more than he grew last year. He also wants to know how much he has produced: its actual weight. It is on this basis, after all, that it will be sold in the market. In this situation, the farmer is really comparing the weight of corn he has produced with a standard weight, such as a ton, so that he can say how many tons of corn he has produced. Implicit in this is the calculation of the ratio of the weight of the corn the farmer has produced to the weight of one ton of corn. For this reason, when we use the values in this way, we say we are treating the data as lying on a ‘ratio’ scale. Note that in this case we could choose to change the basic unit of measurement: we could calculate the weight in pounds or kilograms rather than tons. As long as we say what unit we have used, then it is easy for anyone else to convert back, or to convert to whatever unit they normally use. In yet another situation, we might want to know how many patients have suffered from a particular side effect of a medicine. If the number is large enough we might want to withdraw the drug from the market as being too risky. In this case, we are simply counting discrete well-defined units (patients). No rescaling by changing units would be meaningful (we would not contemplate counting the number of ‘half patients’!), so we say we are treating the data as lying on an ‘absolute’ scale. Statistics Simple summary statistics Whilst simple numbers constitute the elements of data, in order for them to be useful we need to look at the relationships between them, and perhaps combine them in some way. And this is where statistics comes in. Later chapters will explore more complex ways of comparing and combining numbers, but this chapter serves to introduce the ideas. Here we look at some of the most straightforward ways: we will not explore relationships between different variables in this chapter, but simply look at information and insights which can be extracted from relationships between values measured on the same variable. For example, we might have recorded the ages of the applicants for a place at a university, the luminosity of the stars in a cluster, the monthly expenditures of families in a town, the weights of cows in a herd at the time of sending them to market, and so on. In each case, a single numerical value is recorded for each ‘object’ in a population of objects. The individual values in the collection, when taken together, are said to form a ‘distribution’ of values. Summary statistics are ways of characterizing that distribution: of saying whether the values 26 are very similar, whether there are some exceptionally large or small values, what a ‘typical’ value is like, and so on. Averages First, we need to be clear about exactly what we mean by ‘average’, because the word has several meanings. Perhaps the most widely used type of average is the arithmetic mean, or just mean for short. If people use the word ‘average’ without saying how they interpret it, then they probably intend the arithmetic mean. Before I show how to calculate the arithmetic mean, imagine another table of a million numbers. Only, in this second table, suppose that all the numbers are identical to each other. That is, suppose that they all have the same value. Now add up all the numbers in the first table, to find their total (this takes but a split second using a computer). And add up all the numbers in the second table, to find their total. If the two totals are the same, then the number which is repeated a million times in the second table 27 Simple descriptions One of the most basic kinds of descriptions, or summary statistics, of a set of numbers is an ‘average’. An average is a representative value; it is close, in some sense, to the numbers in the set. The need for such a thing is most apparent when the set of numbers is large. For example, suppose we had a table recording the ages of each of the people in a large city – perhaps with a million inhabitants. For administrative and business purposes it would obviously be useful to know the average age of the inhabitants. Very different services would be needed and sales opportunities would arise if the average age was 16 instead of 60. We could try to get a ball-park feel for the general size of the numbers in the table, the ages, by looking at each of the values. But this would clearly be a tough exercise. Indeed, if it took only one second to look at each number, it would take over 270 hours to look through a table of a million numbers, and that’s ignoring the actual business of trying to remember and compare them. But we can use our computer to help us. is capturing some sort of essence of the numbers in the first table. This single number, for which a million copies add up to the same total as the first table, is called the arithmetic mean (of the numbers in the first table). Statistics In fact, the arithmetic mean is most easily calculated simply by dividing the total of the million numbers in the first table by a million. In general, the arithmetic mean of a set of numbers is found by adding all the numbers up and dividing by how many there are. Here is a further example. In a test, the percentage scores for five students in a class were 78, 63, 53, 91, and 55. The total is 78 + 63 + 53 + 91 + 55 = 340. The arithmetic mean is then simply given by dividing 340 by 5. It is 68. We would get the same total of 340 if all five students each scored the mean value, 68. The arithmetic mean has many attractive properties. It always takes a value between the largest and smallest values in the set of numbers. Moreover, it balances the numbers in the set, in the sense that the sum of the differences between the arithmetic mean and those values larger than it is exactly equal to the sum of the differences between the arithmetic mean and those values smaller than it. In that sense, it is a ‘central’ value. Those of a mechanical turn of mind might like to picture a set of 1kg weights placed at various positions along a (weightless) plank of wood. The distances of the weights from one end of the plank represent the values in the set of numbers. The mean is the distance from the end such that a pivot placed there would perfectly balance the plank. The arithmetic mean is a statistic. It summarizes the entire set of values in our collection to a single value. It follows from this that it also throws away information: we should not expect to represent a million (or five, or however many) different numbers by a single number without sacrificing something. We shall explore this sacrifice later. But since it is a central value in the sense illustrated above it can be a useful summary. We can compare the average class size in different schools, the average test score of different 28 students, the average time it takes different people to get to work, the average daily temperature in different years, and so on. The arithmetic mean is one important statistic, a summary of a set of numbers. Another important summary is the median. The mean was the pivotal value, a sort of central point balancing the sum of differences between it and the numbers in the set. The median balances the set in another way: it is the value such that half the numbers in the data set are larger and half are smaller. Returning to the class of five students illustrated above, their scores, in order from smallest to largest, are 53, 55, 63, 78, and 91. The middle score here is 63, so this is the median. Presented with these two summary statistics, both providing representative values, how should we choose which to use in any particular situation? Since they are defined in different ways, combining the numerical values differently, they are likely to produce different values, so any conclusions based on them may well be different. A full answer to the question of which to choose would get us into technicalities beyond the level of this book, but a short answer is that the choice will depend on the precise details of the question one wishes to answer. 29 Simple descriptions Obviously complications arise if there are equal values in the data set (e.g. suppose it consists of 99 copies of the value 0 and a single copy of the value 1), but these can be overcome. In any case, once again the median is a representative value in some sense, although in a different sense from the mean. Because of this difference, we should expect it to take a value different from the mean. Obviously the median is easier to calculate than the mean. We do not even have to add up any values to reach it, let alone divide by the number of values in the set. All we have to do is order the numbers, and locate the one in the middle. But in fact this computational advantage is essentially irrelevant in the computer age: in real statistical analyses the computer takes over the tedium of arithmetic juggling. Statistics Here is an illustration. Suppose that a small company has five staff, each in a different grade and earning, respectively, $10,000, $10,001, $10,002, $10,003, and $99,999. The mean of these is $28,001, while the median is $10,002. Now suppose that the company intends to recruit five new employees, one to each grade. The employer might argue that in this case, ‘on average’, she would have to pay the newcomers a salary of $28,001, so that this is the average salary she states in the advertisement. The employees, however, might feel that this is dishonest, since as many new employees will be paid less than $10,002 as will be paid more than $10,002. They might feel it is more honest to put this figure in the advertisement. Sometimes it requires careful thought to decide which measure is appropriate. (And in case you think this argument is contrived, Figure 1 shows the distribution of American baseball players’ salaries prior to the 1994 strike. The arithmetic mean was $1.2 million, but the median was only $0.5 million.) This example also illustrates the relative impact of extreme values on the mean and the median. In the pay example above, the mean is nearly three times the median. But suppose the largest value had been $10,004 instead of $99,999. Then the median would remain as $10,002 (half the values above and half below), but the mean would shrink to $10,002. The size of just a single value can have a dramatic effect on the mean, but leave the median untouched. This sensitivity of the mean to extreme values is one reason why the median may sometimes be chosen in preference to the mean. The mean and the median are not the only two representative value summaries. Another important one is the mode. This is the value taken most frequently in a sample. For example, suppose that I count the number of children per family for families in a certain population. I might find that some families have one child, some two, some three, and so on, and, in particular, I might find that more families have two children than any other value. In this case, the mode of the number of children per family would be two. 30 Simple descriptions 1. Distribution of American baseball players’ salaries in 1994. The horizontal axis shows salaries in millions of dollars, and the vertical axis the numbers in each salary range Dispersion Averages, such as the mean and the median, provide single numerical summaries of collections of numerical values. They are useful because they can give an indication of the general size of the values in the data. But, as we have seen in the example above, single summary values can be misleading. In particular, single values might deviate substantially from individual values in a set of numbers. To illustrate, suppose that we have a set of a million and one numbers, taking the values 0, 1, 2, 3, 4, . . . , 1,000,000. Both the mean and the median of this set of values are 500,000. But it is readily apparent that this is not a very ‘representative’ value of the set. At the extremes, one value in the set is half a 31 million larger and one value is half a million smaller than the mean (and median). Statistics What is missing when we rely solely on an average to summarize a set of data is some indication of how widely dispersed the data are around that average. Are some data points much larger than the average? Are some much smaller? Or are they all tightly bunched about the average? In general, how different are the values in the data set from each other? Statistical measures of dispersion provide precisely this information, and as with averages there is more than one such measure. The simplest measure of dispersion is the range. This is defined as the difference between the largest and smallest values in the data set. In our data set of a million and one numbers, the range is 1,000,000 − 0 = 1,000,000. In our example of five salaries, the range is $99,999 − $10,000 = $89,999. Both of these examples, with large ranges, show that there are substantial departures from the mean. For example, if the employees had been earning the respective salaries of $27,999, $28,000, $28,001, $28,002, $28,003 then the mean would also be $28,001, but the range would be only $4. This paints a very different picture, telling us that the employees with these new salaries earn much the same as each other. The large range of the earlier example, $89,999, immediately tells us that there are gross differences. The range is all very well, and has many attractive properties as a measure of dispersion, not least its simplicity and ready interpretability. However, we might feel that it is not ideal. After all, it ignores most of the data, being based on only the largest and smallest values. To illustrate, consider two data sets, each consisting of a thousand values. One data set has one value of 0, 998 values of 500, and one value of 1000. The other data set has 500 values of 0 and 500 values of 1000. Both of these data sets have a range of 1000 (and, incidentally, both also have a mean of 32 500), but they are clearly very different in character. By focusing solely on the largest and smallest values, the range has failed to detect the fact that the first data set is mostly densely concentrated around the mean. This shortcoming can be overcome by using a measure of dispersion which takes all of the values into account. One slight complication arises from the fact that the variance involves squared values. This implies that the variance itself is measured in ‘square units’. If we measure the productivity of farms in terms of tons of corn, the variance of the values is measured in ‘tons squared’. It is not obvious what to make of this. Because of this difficulty, it is common to take the square root of the variance. This changes the units back to the original units, and produces the measure of dispersion called the standard deviation. In the example above, the standard deviation of the students’ test scores is the square root of 209.6, or 14.5. 33 Simple descriptions One common way to do this is to take the differences between the (arithmetic) mean and each number in the data set, square these differences, and then find the mean of these squared differences. (Squaring the differences makes the values all positive, otherwise positive and negative differences would cancel out when we calculated the mean.) If the resulting mean of the squared differences is small, it tells us that, on average, the numbers are not too different from their mean. That is, they are not widely dispersed. This mean squared difference measure is called the variance of the data – or, in some disciplines, simply the mean squared deviation. Illustrating with our five students, their test scores were 78, 63, 53, 91, and 55 and their mean was 68. The squared differences between the first score and the mean is (78 − 68)2 = 100, and so on. The sum of the squared differences is 100 + 25 + 225 + 529 + 169 = 1048, so that the mean of the squared differences is 1048 ÷ 5 = 209.6. This is the variance. The standard deviation overcomes the problem that we identified with the range: it uses all of the data. If most of the data points are clustered very closely together, with just a few outlying points, this will be recognized by the standard deviation being small. In contrast, if the data points take very different values, even if they have the same largest and smallest value, the standard deviation will be much larger. Statistics Skewness Measures of dispersion tell us how much the individual values deviate from each other. But they do not tell us in what way they deviate. In particular, they do not tell us if the larger deviations tend to be for the larger values or the smaller values in the data set. Recall our example of the five company employees, in which four earned about $10,000 per year, and one earned around ten times that. A measure of dispersion (the standard deviation, for example) would tell us that the values were quite widely spread out, but would not tell us that one of the values was much larger than the others. Indeed, the standard deviation for the five values $90,000, $89,999, $89,998, $89,997, and $1 is exactly the same as for the original five values. What is different is that the anomalous value (the $1 value) is now very small instead of very large. To detect this difference, we need another statistic to summarize the data, one which picks up on and measures the asymmetry in the distribution of values. One kind of asymmetry in distributions of values is called skewness. Our original employee salary example, with one anomalously large value of $99,999, is right skewed because the distribution of values has a long ‘tail’ stretching out to the single very large value of $99,999. This distribution has many smaller values and very few larger values. In contrast, the distribution of values given above, in which $1 is the anomaly, is left skewed, because the bulk of the values bunch together and there is a long tail stretching down to the single very small value. 34 Right skewed distributions are very common. A classic example is the distribution of wealth, in which there are many individuals with small sums and just a few individuals with many billions of dollars. The distribution of baseball players’ salaries in Figure 1 is heavily right skewed. Quantiles This is taken further to produce deciles (dividing the data set into tenths, from the lowest tenth through to the highest tenth) and percentiles (dividing the data into 100ths). Thus someone might be described as scoring above the 95th percentile, meaning that they are in the top 5% of the set of scores. The general term, including quartiles, deciles, percentiles, etc., as special cases, is quantile. 35 Simple descriptions Averages, measures of dispersion, and measures of skewness provide overall summary statistics, condensing the values in a distribution down to a few convenient numbers. We might, however, be interested in just parts of a distribution. For example, we might be concerned with just the largest or smallest few – say, the largest 5% – values in the data set. We have already met the median, the value which is in the middle of the data in the sense that 50% of the values are larger and 50% are smaller. This idea can be generalized. For example, the upper quartile of a set of numbers is that value such that 25% (i.e. a quarter) of the data values are larger, and the lower quartile is that value such that 25% of the data values are smaller. Chapter 3 Collecting good data Raw data, like raw potatoes, usually require cleaning before use. Ronald A. Thisted Data provide a window to the world, but it is important that they give us a clear view. A window with scratches, distortions, or with marks on the glass is likely to mislead us about what lies beyond, and it is the same with data. If data are distorted or corrupted in some way then mistaken conclusions can easily arise. In general, not all data are of high quality. Indeed, I might go further than this and suggest that it is rare to meet a data set which does not have quality problems of some kind, perhaps to the extent that if you encounter such a ‘perfect’ data set you should be suspicious. Perhaps you should ask what preprocessing the data set has been subjected to which makes it look so perfect. We will return to the question of preprocessing later. Standard textbook descriptions of statistical ideas and methods tend to assume that the data have no problems (statisticians say the data are ‘clean’, as opposed to ‘dirty’ or ‘messy’). This is understandable, since the aim in such books is to describe the methods, and it detracts from the clarity of the description to say what to do if the data are not what they should be. However, this book is rather different. The aim here is not to teach the mechanics of statistical methods, but rather to introduce and 36 convey the flavour of the real discipline. And the real discipline of statistics has to cope with dirty data. In order to develop our discussion, we need to understand what could be meant by ‘bad data’, how to recognize them, and what to do about them. Unfortunately, data are like people: they can ‘go bad’ in an unlimited number of different ways. However, many of those ways can be classified as either incomplete or incorrect. Incomplete data Other examples of this sort of ‘selection bias’ abound, and can be quite subtle. For example, it is not uncommon for patients to drop out of clinical trials of medicines. Suppose that patients who recovered while using the medicine failed to return for their next appointment, because they felt it was unnecessary (since they had 37 Collecting good data A data set is incomplete if some of the observations are missing. Data may be randomly missing, for reasons entirely unrelated to the study. For example, perhaps a chemist dropped a test tube, or a patient in a clinical trial of a skin cream missed an appointment because of a delayed plane, or someone moved house and so could not be contacted for a follow-up questionnaire. But the fact that a data item is missing can also in itself be informative. For example, people completing an application form or questionnaire may wish to conceal something, and, rather than lie outright, may simply not answer that question. Or perhaps only people with a particular view bother to complete a questionnaire. For example, if customers are asked to complete forms evaluating the service they have received, those with axes to grind may be more inclined to complete them. If this is not recognized in the analysis, a distorted view of customers’ opinions will result. Internet surveys are especially vulnerable to this kind of thing, with people often simply being invited to respond. There is no control over how representative the respondents are of the overall population, or even if the same people respond multiple times. recovered). Then we could easily draw the conclusion that the medicine did not work, since we would see only patients who were still sick. Statistics A classic case of this sort of bias arose when the Literary Digest incorrectly predicted that Landon would overwhelmingly defeat Roosevelt in the 1936 US presidential election. Unfortunately, the questionnaires were mailed only to people who had both telephones and cars, and in 1936 these people were wealthier on average than the overall population. The people sent questionnaires were not properly representative of the overall population. As it turned out, the bulk of the others supported Roosevelt. Another, rather different kind of case of incorrect conclusions arising from failure to take account of missing data has become a minor statistical classic. This is the case of the Challenger space shuttle, which blew up on launch in 1986, killing everyone on board. The night before the launch, a meeting was held to discuss whether to go ahead, since the forecast temperature for the launch date was exceptionally low. Data were produced showing that there was apparently no relationship between air temperature and damage to certain seals on the booster rockets. However, the data were incomplete, and did not include all those launches involving no damage. This was unfortunate because the launches when no damage occurred were predominantly made at higher temperatures. A plot of all of the data shows a clear relationship, with damage being more likely at lower temperatures. As a final example, people applying for bank loans, credit cards, and so on, have a ‘credit score’ calculated, which is essentially an estimate of the probability that they will fail to repay. These estimates are derived from statistical models built (as described in Chapter 6) using data from previous customers who have already 38 repaid or failed to repay. But there is a problem. Previous customers are not representative of all people who applied for a loan. After all, previous customers were chosen because they were thought to be good risks. Those applicants thought to be intrinsically poor risks and likely to default would not have been accepted in the first place, and would therefore not be included in the data. Any statistical model which fails to take account of this distortion of the data set is likely to lead to mistaken conclusions. In this case, it could well mean the bank collapsing. The second popular approach to handling missing values is to insert substitute values. For example, suppose age is missing from some records. Then we could replace the missing values by the average of the ages which had been recorded. Although this results in a complete(d) data set, it also has disadvantages. Essentially we would be making up data. If there is reason to suspect that the fact that a number is missing is related to the value it would have had (for example, if older people are less likely to give their age) then more elaborate 39 Collecting good data If only some values are missing for each record (e.g. some of the answers to a questionnaire), then there are two common elementary approaches to analysis. One is simply to discard any incomplete records. This has two potentially serious weaknesses. The first is that it can lead to selection bias distortions of the kind discussed above. If records of a particular kind are more likely to have some values missing, then deleting these records will leave a distorted data set. The second serious weakness is that it can lead to a dramatic reduction in the size of the data set available for analysis. For example, suppose a questionnaire contains 100 questions. It is entirely possible that no respondent answered every question, so that all records may have something missing. This means that dropping incomplete responses would lead to dropping all of the data. statistical techniques are needed. We need to construct a statistical model, perhaps of the kind discussed in Chapter 6, of the probability of being missing, as well as for the other relationships in the data. Statistics It is also worth mentioning that it is necessary to allow for the fact that not all values have been recorded. It is common practice to use a special symbol to indicate that a value is missing. For example, N/A, for ‘not available’. But sometimes numerical codes are used, such as 9999 for age. In this case, failure to let the computer know that 9999 represents missing values can lead to a wildly inaccurate result. Imagine the estimated average age when there are many values of 9999 included in the calculation . . . In general, and perhaps this should be expected, there is no perfect solution to missing data. All methods to handle it require some kind of additional assumptions to be made. The best solution is to minimize the problem during the data collection phase. Incorrect data Incomplete data is one kind of data problem, but data may be incorrect in any number of ways and for any number of reasons. There are both high and low level reasons for such problems. One high level reason is the difficulty of deciding on suitable (and universally agreed) definitions. Crime rate, referred to in Chapter 1, provides an example of this. Suicide rate provides another. Typically, suicide is a solitary activity, so that no one else can know for certain that it was suicide. Often a note is left, but not in all cases, and then evidence must be adduced that the death was in fact suicide. This moves us to murky ground, since it raises the question of what evidence is relevant and how much is needed. Moreover, many suicides disguise the fact that they took their own life; for example so that the family can collect on the life insurance. 40 In a different, but even more complicated situation, the National Patient Safety Agency in the UK is responsible for collating reports of accidents which have occurred in hospitals. The Agency then tries to classify them to identify commonalities, so that steps can be taken to prevent accidents happening in the future. The difficulty is that accidents are reported by many thousands of different people, and described in different ways. Even the same incident can be described very differently. 41 Collecting good data At a lower level, mistakes are often made in reading instruments or recording values. For example, a common tendency in reading instruments is to subconsciously round to the nearest whole number. Distributions of blood pressure measurements recorded using old-fashioned (non-electronic) sphygmomanometers show a clear tendency for more values to be recorded at 60, 70, and 80mm of mercury than at neighbouring values, such as 69 or 72. As far as recording errors go, digits may be transposed (28, instead of 82); the handwritten digit 7 may be mistaken for 1 (less likely in continental Europe, where 7 is written 7); data may be put in the wrong column on a form, so accidentally multiplying values by 10; the US style of date (month/day/year) might be confused with the UK style (day/month/year) or vice versa; and so on. In 1796, the Astronomer Royal Nevil Maskelyne dismissed his assistant, David Kinnebrook, on the grounds that the latter’s observations of the times at which a chosen star crossed the meridian wire in a telescope at Greenwich were too inaccurate. This mattered, because the accuracy of the clock at Greenwich hinged on accurate measurements of the transit times, estimates of the longitude of the nation’s ships depended on the clock, and the British Empire depended on its ships. However, later investigators have explained the inaccuracies in terms of psychological reaction time delays and the subconscious rounding phenomenon mentioned above. And, as a final example from the many I could have chosen, the 1970 US Census said there were 289 girls who had been both widowed and divorced by the age of 14. We should also note the general point that the larger the data set, the more hands involved in its compilation, and the more stages involved in its processing, the more likely it is to contain errors. Statistics Other low level examples of data errors often arise with units of measurement, such as recording height in metres rather than feet, or weight in pounds rather than kilograms. In 1999, the Climate Orbiter Mars probe was lost when it failed to enter the Martian atmosphere at the correct angle because of confusion between pressure measurements based on pounds and on newtons. In another example of confusion of units, this time in a medical context, an elderly lady usually had normal blood calcium levels, in the range 8.6 to 9.1, which suddenly appeared to drop to a much lower value of 4.8. The nurse in charge was about to begin infusing calcium, when Dr Salvatore Benvenga discovered that the apparent drop was simply because the laboratory had changed the units in which it reported its results (from milligrams per decilitre to milliequivalents per litre). Error propagation Once made, errors can propagate with serious consequences. For example, budget shortfalls and possible job layoffs in Northwest Indiana in 2006 were attributed to the effect of a mistake in just one number working its way up through the system. A house that should have been valued at $121,900 had its value accidentally changed to $400 million. Unfortunately, this mistaken value was used in calculating tax rates. In another case, the Times of 2 December 2004 reported how 66,500 of around 170,000 firms were accidentally removed from a list used to compile official estimates of construction output in the UK. This led to a reported fall of 2.6% in construction growth in the first quarter, rather than the correct value of an increase of 0.5%, followed by a reported growth of 5.3% rather than the correct 2.1% in the second quarter. 42 Preprocessing As must be obvious from the examples above, an essential initial component of any statistical analysis is a close examination of the data, checking for errors, and correcting them if possible. In some contexts, this initial stage can take longer than the later analysis stages. Of course, outlier detection is not a universal solution to detecting data errors. After all, errors can be made that lead to values which appear perfectly normal. Someone’s sex may mistakenly be coded as male instead of female. The best answer is to adopt data-entry practices that minimize the number of errors. I say a little more about this below. If an apparent error is detected, there is then the problem of what to do about it. We could drop the value, regarding it as missing, and then try to use one of the missing value procedures mentioned above. Sometimes we can make an intelligent guess as to what the value should have been. For example, suppose that, in recording the ages of a group of students, one had obtained the string of 43 Collecting good data A key concept in data cleaning is that of an outlier. An outlier is a value that is very different from the others, or from what is expected. It is way out in the tail of a distribution. Sometimes such extreme values occur by chance. For example, although most weather is fairly mild, we do get occasional severe storms. But in other instances anomalies arise because of the sorts of errors illustrated above, such as the anemometer which apparently reported a sudden huge gust of wind every midnight, coincidentally at the same time that it automatically reset its calibration. So one good general strategy for detecting errors in data is to look for outliers, which can then be checked by a human. These might be outliers on single variables (e.g. the man with a reported age of 210), or on multiple variables, neither of which is anomalous in itself (e.g. the 5-year-old girl with 3 children). Statistics values 18, 19, 17, 21, 23, 19, 210, 18, 18, 23. Studying these, we might think it likely that the 210 had been entered into a wrong column, and that it should be 21. By the way, note the phrase ‘intelligent guess’ used above. As with all statistical data analysis, careful thought is crucial. It is not simply a question of choosing a particular statistical method and letting the computer do the work. The computer only does the arithmetic. The example of student ages in the previous paragraph was very small, just involving ten numbers, so it was easy to look through them, identify the outlier, and make an intelligent guess about what it should have been. But we are increasingly faced with larger and larger data sets. Data sets of many billions of values are commonplace nowadays in scientific applications (e.g. particle experiments), commercial applications (e.g. telecommunications), and other areas. It will often be quite infeasible to explore all the values manually. We have to rely on the computer. Statisticians have developed automatic procedures for detecting outliers, but these do not completely solve the problem. Automatic procedures may raise flags about certain kinds of strange values, but they will ignore peculiarities they have not been told about. And then there is the question of what to do about an apparent anomaly detected by the computer. This is fine if only 1 in those billion numbers is flagged as suspicious, but what if 100,000 are so flagged? Again, human examination and correction is impracticable. To cope with such situations, statisticians have again developed automated procedures. Some of the earliest such automated editing and correcting methods were developed in the context of censuses and large surveys. But they are not foolproof. The bottom line is, I am afraid, once again, that statisticians cannot work miracles. Poor data risk yielding poor (meaning inaccurate, mistaken, error-prone) results. The best strategy for avoiding this is to ensure good-quality data from the start. Many strategies have been developed for avoiding errors in data in the first place. They vary according to the application domain and 44 the mode of data capture. For example, when clinical trial data are copied from hand-completed case record forms, there is a danger of introducing errors in the transcription phase. This is reduced by arranging for the exercise to be repeated twice, by different people working independently, and then checking any differences. When applying for a loan, the application data (e.g. age, income, other debts, and so on) may be entered directly into a computer, and interactive computer software can cross-check the answers as they are given (e.g. if a house owner, do the debts include a mortgage?). In general, forms should be designed so as to minimize errors. They should not be excessively complicated, and all questions should be unambiguous. It is obviously a good idea to conduct a small pilot survey to pick up any problems with the data capture exercise before going live. Observational versus experimental data It is often useful to distinguish between observational and experimental studies, and similarly between observational and experimental data. The word ‘observational’ refers to situations in which one cannot interfere or intervene in the process of capturing the data. Thus, for example, in a survey (see below) of people’s attitudes to politicians, an appropriate sample of people would be asked how they felt. Or, in a study of the properties of distant galaxies, those properties would be observed and recorded. In both of these examples, the researchers simply chose who or what to study and then recorded the properties of those people or objects. There is no notion of doing something to the people or galaxies before measuring them. In contrast, in an experimental study the researchers would actually manipulate the objects in some way. For example, in a clinical trial they might expose 45 Collecting good data Incidentally, the phrase ‘computer error’ is a familiar one, and the computer is a popular scapegoat when data mistakes are made. But the computer is just doing what it is told, using the data provided. When errors are made, it is not the computer’s fault. Statistics volunteers to a particular medication, before taking the measurements. In a manufacturing experiment to find the conditions which yield the strongest finished product, they would try different conditions. One fundamental difference between observational and experimental studies is that experimental studies are much more effective at sorting out what causes what. For example, we might conjecture that a particular way of teaching children to read (method A, say) is much more effective than another (method B). In an observational study, we will look at children who have been taught by each method, and compare their reading ability. But we will not be able to influence who is taught by method A and who by method B; this is determined by someone else. This raises a potential problem. It means that it is possible that there are other differences between the two reading groups, as well as teaching method. For example, to take an extreme illustration, a teacher may have assigned all the faster learners to method A. Or perhaps the children themselves were allowed to choose, and those already more advanced in reading tended to choose method A. If we are a little more sophisticated in statistics, we might use statistical methods to try to control for any pre-existing differences between the children, as well as other factors we think are likely to influence how quickly they would learn to read. But there will always remain the possibility that there are other influences we have not thought of which cause the difference. Experimental studies overcome this possibility by deliberately choosing which child is taught by each method. If we did know all the possible factors, in addition to teaching method, which could influence reading ability, we could make sure that the assignment to teaching method was ‘balanced’. For example, if we thought that reading ability was influenced by age, we could assign the same number of young children to each method. By this means, any differences in reading ability arising from age would have no 46 impact on the difference between our two groups: if age did influence reading ability, the impact would be the same in each group. However, as it happens, experimental studies have an even more powerful way of choosing which child receives which method, called randomization. I discuss this below. The upshot of this is that, in an experimental study we can be more confident of the cause of any observed effect. In the experiment comparing teaching reading, we can be more confident that any difference between the reading ability in the two groups is a consequence of the teaching method, rather than of some other factor. In general, when collecting data with the aim of answering or exploring certain questions, the more data that are collected, the more accurate an answer that can be obtained. This is a consequence of