Data Journalism: Lie With Bad Sample Surveys
A 1954 book “How to Lie with Statistics” by Darrel Huff should be made a compulsory reading for anyone who wants to do data based journalism. The book is very small and talks about the ways data is used to lie aka statisticulation. In 2016, 62 years later, we still see the same mistakes or manipulations by the journalists and organizations today.Misinforming people by the use of statistical material might be called statistical manipulation; in a word (though not a very good one), statisticulation. - Darrel Huff
It's really difficult to run sample surveys because they are after all small chosen sample. Sample surveys in general have built-in bias and skewed distribution. As someone who has a deep interest in data, i really like to see a lot of details in the stories when they are based on sample surveys. In this post I really want to focus on two stories which are published by two decently popular news agencies. Both the stories are based on sample surveys.
Case study 1: Indian Express
This story on The Indian Express boldly states 70 per cent Indians want Modi back as PM in 2019: Poll (archive). The story has a subtitle “In the survey, conducted between July 25 and August 7, 80 per cent of the respondents were aged below 35.” . Now with that title and subtitle one would think the younger generation would want Modi to be back as PM in 2019. Only later in the story you will realize. Poll had 63,141 participants and poll was conducted on an app. Now can you see the bias of this sample survey? The story doesn’t share any raw data or process to explore but goes on to make many more over simplified, skewed, exaggerated statements.
Case study 2: Zee News
This story is by Zee news India. The headline of the story is 70% want Narendra Modi to be PM till 2024, 62% happy with his performance: Survey (archive). For a regular reader it seems like the 70% of India wants Modi to continue as PM till 2024. But only if you read the complete story you will realize the sample survey had 4,000 respondents across rural and urban areas and participants were from 15 states. Also the article doesn’t publish the details about the states/cities/villages; where this survey was conducted or how they were chosen. Nor does it publish any raw data or process. Can you see how much skewed the survey is? Standard case of statisticulation.
Have you read stories were data was used to lie like Darrel Huff uses the term lie. Please comment here. I would like to have a look at them.
[Update 6/Sep/2016]: I received some comments yesterday. I am adding more information here to address those comments
- Comment: The sample size is good enough
Response: If you consider just the size, yes I agree with the comment that the sample size is good enough. But a sample is not good or bad just because of size. I emphasized sample size as they are the only data point available in the articles and it is important. I have removed the emphasis now. But my argument was not just about the size of sample but how biased and skewed the sample was. I have deleted the "billion people" and word "only"; as it doesn't add anything in that context and can lead readers to concentrate on not so important stuff. But rest of my comments remain the same. I have added couple of words to express my views clearly.
- Comment: what are the biases?
Response: I am going to name a few biases which I think these surveys suffer
IE: Survey conducted on an app suffers with Self selection bias, opportunity sampling etc
Zee: Article doesn't say about how cities and rural areas were chosen. 15 states out of 29 states and 7 union territories in my opinion seems skewed unless they give us how and which states were chosen. I think its not a representative sample.
- Comment: Tell me whats the best sample then?
Response: It needs time and money
- Comment: Your post is biased or ignorant or stupid
Response: Write your own blog post with explanation. I will read it.
- Comment: I have more to add
Response: Comment below
- Note: I ask the following five questions for any data based story. That's how I try and validate a data story.
Who says so? How does he know? What’s missing? Did somebody change the subject? Does it make sense?
Of course no comments mentioned my biggest complaint, lack of raw data and process so reader can verify and reproduce the results.