Traditional Culture Encyclopedia - Hotel franchise - How to effectively identify cognitive traps in data analysis models
How to effectively identify cognitive traps in data analysis models
This paper focuses on how to prevent others from manipulating data to mislead us.
First of all, we should make it clear that although we use the word "manipulating data", we don't care about motivation, whether it is intentional deception, intentional misleading or insufficient level to make mistakes. We don't make judgments and distinctions.
We just look at how to prevent being misled from the perspective of data thinking, whether the other party is intentional or unintentional.
Another thing that needs to be clarified is that what we are discussing here is that the data is true, but it has been misused, which leads to misleading, not including tampering with the data.
For example:
An Indian contractor was entrusted by the Indian government to provide food security for refugees, including daily necessities and living security.
However, since there is no exact number of refugees, the government must pay the fees claimed by the contractor. But the expenditure seems to be too large, and some people suggest asking statisticians for help.
Statisticians aim at three things-rice, beans and salt.
If the number of people is stable, then the consumption of these three foods is basically stable, so it can be cross-verified. It was found that salt estimated the least number of people, and rice estimated the most. No one exaggerates this place because the price of salt is low and the total amount is small. The price of rice is high and the total amount is large, so this place has the motivation to make false accounts.
This case is an illegal purpose by forging data, and it is a liar with the lowest technical level. This is not what we are discussing here.
Using real data to mislead people through various operational means is a job with high technical gold content. This direction is mainly discussed here.
There are three directions for manipulating data to mislead the audience, namely, the use of manipulating data, the generation of manipulating data and the interpretation of manipulating data.
Manipulating the use of data—
There are too many cases in this field. Let me give you some examples:
Mask the distribution with the average value:
"A company has 3,003 shareholders, with an average of 660 shares each." The truth that misleads you is this: there are 2 million shares in the company, of which 3 major shareholders hold 3/4, and the remaining 3,000 hold 1/4.
Cover the scale with a percentage:
"The 1/3 girl from Hopkins University married a university teacher." But actually only three students were admitted, and 1 of them married the teacher.
Replace long-term effects with short-term fluctuations;
"The Ministry of Health recently announced that the death toll in the suburbs of central London soared to 2,800 during the foggy week." Is this because of the fog? What is the average death toll in this place? What about the death toll in the next few weeks?
Reasons for missing changes:
"In the past 25 years, the number of cancer deaths has increased." It sounds scary, but in fact, many factors are more telling. For example, many cases with unknown causes in the past are now diagnosed as cancer; Autopsy has become a common method to help make a definite diagnosis. Medical statistics are more comprehensive; The number of susceptible age groups has increased. Besides, there are many more people now than before.
Stealing concepts—
"One member suggested that we could let the prisoners leave the prison and stay in the hotel, which would be cheaper. Because the cost of a prisoner is 8 dollars a day, and staying in a hotel is only 7 dollars. " But in fact, the $8 here refers to all the living expenses of prisoners, while the parliamentarians only compare the hotel rent.
Inconsistent definition:
Several platforms all say that their traffic is the first, and the evidence is that the TV series broadcast on the platform has the first ratings. However, the definitions of each family are different, some use the average ratings, some use the highest ratings of a single episode, and some use the first broadcast to replay the total ratings.
Ignore the measurement error—
"Li Lei's IQ is 10 1, and Han Meimei's IQ is 99, so Li Lei is smarter than Han Meimei." But there is an error in any measurement, and the result should be added with an interval, such as 3%. In this way, the IQ ranges of Li Lei and Han Meimei overlap, and it is impossible to tell who is smarter than who.
The difference is too small to be practical;
"The results of a large-scale IQ test show that boys average 106. 1 and girls average 105.9." Even if there is such a difference statistically, it has no practical significance because the difference is too small.
The reference is not clear-
"The juicing function of this juicer has been enhanced by 26%." Who is this compared with? What if it is compared with the old manual juicer?
Ignore cardinality when comparing:
"The accident at 7 o'clock in the evening in expressway is four times that at 7 o'clock in the morning, so the chance of surviving in the morning is four times higher." In fact, there are many accidents at night, just because there are many cars in expressway at night.
Compulsory comparison of different objects-
"During the American-Spanish War, the death rate of the US Navy was 9‰, and the death rate of new york residents was 16‰, so naval soldiers were safer." In fact, these two groups of objects are not comparable. The navy is mainly strong young people, while city residents include babies, the elderly and the sick. The mortality rate of these people is high everywhere.
The change of cardinal number can make people hallucinate:
50% off and 20% off will make you feel 30% off. In fact, the discount is only 60%, because the next 20% discount is calculated according to the price after 50% discount.
Use digital games to control the audience's feelings;
The return on investment is 3% in the first year and 6% in the second year. Both of the following statements are true: 1. Increased by 3 percentage points; 2. The growth rate is as high as 100%. How to present it depends on what you want the audience to feel.
Generation of operational data—
There are many examples in this regard, such as:
The rules adopted by the algorithm are different:
The experiment uses two algorithms to judge traffic violations: one is "strictly abide by the legal provisions", referred to as the provisions version, as long as the speed exceeds the line, a ticket will be issued. The other is the safety principle. If the speed was safe at that time, you wouldn't be punished. For example, there are no cars around, or everyone is fast. Slowing down is a moving stone, which is not conducive to safety. This kind of rule can "accurately reflect the legal intention", so it is called the intention version.
After experiments, under the same traffic conditions, the algorithm of the article group issued 500 tickets, while the algorithm of the intention group only issued 1 ticket. Do you think the traffic violation is serious or not?
Error in setting experimental conditions:
One paper won a provincial prize, saying that Ejiao has a good nutritional effect. The practice is to make the mice malnourished first, and then give them Ejiao. Results All kinds of data were better than the control group. It seems that Ejiao is really effective, but look at the control group, only those malnourished mice are given water. This is equivalent to the difference between giving something to eat and not giving something to eat, not the difference between Ejiao and ordinary nutrition. The experimental conclusion is certainly unreliable.
In order to prevent cheating, I emphasize that the conclusion of this paper is not reliable, not to discuss whether Ejiao has nutrition.
The order of questions affects the choice of respondents;
The survey shows that if you ask questions about clothing advertisements first, and then ask questions about general advertisements, women will have a more positive attitude towards advertisements.
According to the survey of ordinary people, similar order problems also exist. For example, first ask whether the married life is happy, and then ask whether the overall life is happy. Respondents will automatically rule out their feelings about married life and evaluate their overall life. The opposite is true.
Interpretation of manipulation data—
Let me give you a few examples to feel it:
Attribution error:
A flight instructor said with great confidence, "Criticism makes people progress, while praise makes people regress." Because the instructor found that as long as the students are praised, the performance of the students will definitely get worse the next day, and the criticism of the students will get better the next day.
In fact, this is a regression phenomenon. If the student is praised today, it means that his performance today is above his average level, and it is normal to return to the average level the next day.
Causality is not established:
In India, researchers found that people who watch TV have a more positive attitude towards gender equality. Does this mean that we should popularize television to change the attitude towards women in rural India?
The fact is that well-educated people can afford TV, while well-educated people are more open to gender equality. Often watching TV and the positive attitude of equality between men and women are not causal, but incidental.
Theoretical application error:
There is a joke on the Internet that 8000 people supported 1 civil servants in the Han Dynasty, 3000 in the Tang Dynasty, 2000 in the Ming Dynasty and 1000 in the Qing Dynasty, but today it is 18, suggesting that there are too many civil servants in this era.
This error is unconditionally enlarged or reduced. With the increase of population size, the number of people who need public services is not linear growth, but geometric growth. Only under a reasonable theoretical framework can we evaluate whether 18 people support 1 civil servants are more or less. ...
- Related articles
- What about Hangzhou dingcheng lifting and handling co., ltd
- When does the buffet in Longchang Branch of Tangxian County start?
- Where is the Dishui Lake Isolation Hotel?
- What styles of mosquito nets are there?
- Do you have the telephone number and address of Anshan Tiexi Hotel or Hotel without toilet around 50 yuan?
- What is considered a five-star hotel~?
- Who knows where Zhongshan Street, Xincheng East Road, ningjiang district, Songyuan City is? What signs are there nearby? Which bus do you take in Jiangnan?
- Why does the king of Thailand always live in Munich?
- How to use Qunar.com's upgrade rights?
- Is Weifang Aofeng a regular company?