15 data mistakes you should avoid
Challenges in business intelligence
Analyses are universal, yet crucial for valuable insights and sustained success, irrespective of company size and industry. Unfortunately, the results of these analyses are often sobering, with numerous factors contributing to inaccuracies in the data findings. We explain which stumbling blocks in data analysis you should be aware of and how you can avoid them to utilise the full potential of your data. Join us in exploring data aberrations such as the “Anchoring Effect”, the “Simpson’s Paradox” or the infamous “Gambler’s Fallacy” to gain a deeper understanding of the most common misconceptions in business intelligence.
Data errors and how to avoid them
1. Cherry Picking:
Cherry picking is a common data trap where only certain data points or information are selectively chosen to support a thesis, while other relevant data is deliberately ignored. This misleading approach can completely misrepresent the results of a data analysis as it distorts the overall picture of the data. For example, cherry picking can lead to a situation being presented as significantly better or worse than it actually is. Imagine, for example, your marketing department wants to analyse the efficiency of a product. If only positive customer reviews or success stories are used for this purpose, you can assume the analysis will show a distorted picture of reality. In this particular case, the analysis will show a very positive picture of product efficiency. However, if there are many negative reviews or critical voices that are not considered in the analysis, it could be your product is not being particularly efficient. So, to optimise or further develop your product, you should include these negative statements in the analysis.
Solution: To prevent cherry-picking, it is crucial to carry out a systematic and transparent data analysis. Therefore, all available data should be collected, objective statistical methods should be applied and all data should be published. Take special care to include data that do not fit the hypothesis you are trying to prove. Peer reviews and external reviews by independent experts can also help to identify and correct such bias. By presenting the data honestly and completely, you ensure analyses are based on a solid foundation and are not influenced by cherry-picking.
2. Survivorship Bias:
Survivorship bias is a bias that occurs when only the successful or surviving cases are considered in an analysis, while the unsuccessful or non-surviving cases are omitted. This leads to an unrealistic representation of the chances of success, as important data on failure is missing. This data bias can thus lead to false conclusions, as the omitted data can make up an important part of the overall picture. Survivorship bias is often found, for example, in studies on successful companies or famous personalities. The stories of successful companies or people are often analysed, while failed companies or unknown people are not considered. This leads to a distorted assessment of success factors. A particularly frequently cited case of survivorship bias is the study of aeroplanes in the Second World War. To decide where the armour should be reinforced, returned aircraft with bullet holes were first examined. In an attempt to reinforce aircraft based on the distribution of bullet holes, a critical flaw emerged. The analysis overlooked crashed aircraft with bullet holes. Including those in the analyis lead to the paradoxical conclusion that the parts with the fewest bullet holes should be reinforced, as most aircraft crashed when hit in those areas.
Solution: Use a comprehensive database that includes all successful as well as all failed cases to prevent this phenomenon. Since, as in the example above, it is not always guaranteed a complete set of data is available, you should take a critical look at the data before analysing it to avoid drawing false conclusions. You should therefore always be aware of possible missing data and look specifically for such cases to minimise or prevent distortion due to survivorship bias.
3. Cobra effect / perverse incentive:
The cobra effect refers to a situation in which a proposed solution to a problem has undesirable side effects which exacerbate the problem or create new problems. It is therefore a false incentive. The term originates from an anecdote from colonial times in India: at that time, many people in India were dying of cobra bites. In order to rid the population of cobras, the British colonial rulers offered a reward for every cobra caught. Unfortunately, they did not realise this could provide the wrong incentive. In response, the locals began to breed cobras in exchange for the reward. After the government ended the initiative, these bred cobras were often released into the wild, leading to a drastic increase in the cobra population rather than a decrease.
We can also often observe the cobra effect in economies: For example, if a government tries to reduce inflation by drastically reducing the money supply, this can lead to a deterioration in economic conditions. The population then has less money to invest and spend. This in turn can lead to a drop in economic activity.
Solution: To avoid the cobra effect, it is crucial to carefully consider the long-term impact of any proposed solution to ensure unwanted side effects are avoided. Engaging with experts and stakeholders can help to consider different perspectives and recognise unforeseen consequences before a solution is implemented. Continuous monitoring and adjustment of measures are also important to ensure the cobra effect and similar undesirable consequences are avoided.
4. False causality:
False causality is an error that occurs when it is assumed a cause-effect relationship exists between two events, even though they only show a random correlation or other hidden variables explain the relationship. A classic example is the correlation between the increase in ice cream sales and the increase in swimming pool accidents in summer. A quick glance at such an analysis might suggest that swimming pool accidents are caused by higher ice cream consumption. However, both events are caused by the warm season.
Solution: Be careful to distinguish carefully between correlation and causality to avoid this error. A correlation measures the statistical relationship between two variables. Causal relationships, on the other hand, provide information about cause and effect. A correlation can therefore indicate a causal relationship, but this does not necessarily have to be the case. Statistical methods such as experiments and control groups can help to identify actual cause-and-effect relationships. Therefore, analyse all available data and check alternative explanations for observed correlations. In addition, in-depth knowledge of the specific subject area can help to better understand relevant correlations and avoid unfounded assumptions. A conscious critical analysis and an open attitude towards different possible interpretations are crucial to prevent incorrect conclusions regarding false causality.
5. Data Fishing:
Data fishing, also known as P-hacking or data grabbing, refers to the practice of searching large amounts of data for statistically significant results or patterns without testing a specific hypothesis. This can lead to misleading results, as statistically significant results are expected if enough tests are performed, even if there is no actual effect. For example, researchers might test hundreds of variables against a specific target and then only present the results that appear statistically significant. For example, if a drug trial is testing the effect of different doses of the drug on a variety of symptoms, researchers should consider all results. However, if data fishing is used to select only the dosage that shows a statistically significant effect on one symptom without taking the other tests into account, this can lead to a distorted presentation of the results.
Solution: To prevent data fishing, it is important to define a clear hypothesis before data collection and to plan the analysis methods in advance. If multiple tests are performed, a correction such as the Bonferroni test should be applied to reduce the risk of false positives. Transparency and openness are also crucial. You should document all tests performed and their results, even if they are not significant. This enables a comprehensive assessment and prevents selective reporting of results that could be biased by data fishing.
6. Confirmation Bias:
Confirmation bias is the tendency to preferentially seek information or data that confirms existing beliefs or hypotheses, while ignoring or rejecting contradictory information. This is because people unconsciously seek confirmation for what they already believe instead of objectively evaluating all available information. This can lead to a biased interpretation of data. A real-life example would be an investor who tends to only pay attention to news and analyses supporting their positive view of a stock, while ignoring negative reports or warnings of potential risks.
Solution: To prevent confirmation bias, it is important to be aware of this tendency and actively combat it. The first step is to promote an open and critical mindset. In science, methods such as double-blind studies and peer reviews help to ensure objective assessments. In your organisation, you can seek opinions and feedback from people with different views and experiences to challenge and expand your point of view. It is also helpful to regularly check yourself to see whether you are remaining objective when evaluating information or unconsciously seeking confirmation. The influence of confirmation bias can be minimised through conscious self-reflection and the use of different perspectives.
7. Regression to the Mean:
Regression to the mean describes the phenomenon where extremely high or low values in a measurement tend to return to less extreme values when the measurement is repeated. This happens independently of any intervention or change and is based on random fluctuations in the data. An example of this is academic performance. It is likely for students who perform exceptionally well in one test to achieve less outstanding results in a later retake of the test. This is due to normal fluctuations, for example due to the daily form of the students.
Solution: To prevent regression to the mean, it is important to understand extreme values can often occur by chance and do not necessarily indicate a cause-and-effect relationship. Therefore, when evaluating performance or outcomes, one should not overreact to extreme values as they will tend to revert to less extreme values when the measurement is repeated. It is advisable to use statistical methods to recognise the random nature of extreme values and to always consider the context when interpreting data. Regular checks and critical analysis can help to draw reliable conclusions without being influenced by random fluctuations.
8. Anchoring effect:
Anchoring effect, also known as anchoring bias, refers to the tendency to be strongly influenced by an initial value or piece of information when making decisions. Even if this anchor is irrelevant or based on a false assumption, people tend to orientate themselves strongly towards it. For example, the first price quoted in a price negotiation is an anchor that has been shown to strongly influence the outcome of the negotiation. For example, if a seller sets a price very high, buyers will tend to orientate their own offers closer to this high price.
Solution: Realise how anchors can influence our decisions. To do this, actively distance yourself from an initially named value and use objective evaluation criteria. It can be helpful to consider alternative anchor values that are based on objective data and use these as the basis for decisions. For example, in negotiations, it can be useful to focus on relevant facts and comparative prices in order to be less influenced by an arbitrary starting point. Conscious decision-making based on sound data and analyses can help to minimise the impact of the anchoring heuristic. The reverse is also true, for example, if you want to collect data. For instance, if you are designing a survey, you should be aware respondents may be influenced by the anchoring effect, which in turn may affect the validity of the survey. In such cases, choose anchor values very carefully or do not use them if possible.
9. Simpson’s Paradox:
Simpson’s paradox describes a statistical illusion in which a trend in the overall data occurs in the opposite direction to the trend in the individual groups. This means an observation that appears in an overall analysis can be reversed when the data is split into different subgroups. A practical example could be a study on the treatment success of two different hospitals. In the overall analysis, one hospital might have a higher survival rate. However, when the data is broken down by severity of illness, the other hospital might be found to have a higher survival rate at all levels of severity.
Solution: To avoid the Simpson’s paradox, it is important to pay careful attention to possible interactions between variables in statistical analyses. It is advisable to look more closely at significant differences in the overall data to see if these differences are consistent across subgroups. A more in-depth analysis considering different variables and investigating possible interactions between them can help to recognise and understand the paradox. For very complex data, collaboration with experienced statisticians or data analysts is often advisable to ensure accurate and reliable interpretation of the results.
10. Ecological Fallacy:
The ecological fallacy refers to the incorrect conclusion about individual characteristics based on aggregated, group-based data. This bias occurs when statistical correlations at the group level are applied to individuals without taking individual differences into account. For example, if you look at a wealthy city where the average income of the inhabitants is very high, you might conclude that all inhabitants of the city are wealthy. In reality, it is more likely that even in such a city there are considerable differences in income among the individual inhabitants, so that some inhabitants could be very rich, while others could be very poor.
Solution: To avoid the ecological fallacy, it is important to distinguish between aggregated and individual characteristics when interpreting data. Data analyses should therefore not only be conducted at the group level, but also at the individual level to get a more accurate idea of the actual differences. Be aware that aggregated data cannot necessarily be transferred to individual experiences or characteristics, pay attention to the context of the data, and rely on appropriate data sources and methods of analysis to avoid drawing false conclusions.
11. Goodhart’s Law:
Goodhart’s Law, named after the British economist Charles Goodhart, states an observed statistical relationship made into a rule loses its predictive power as soon as it is used for decision-making. Simply put, this means when a particular ratio or metric is made the basis for rewards or sanctions, people or organisations develop strategies to optimise that ratio. This often leads to undesirable side effects. For example, if a company uses the sales figures of a product as a performance indicator for its sales staff and as the basis for a bonus, they may tend to use short-term sales strategies to receive the bonus. You may have sold a lot of products, but this strategy could have a detrimental effect on your company in the long term.
Solution: To prevent Goodhart’s Law, it is important to develop a holistic and balanced performance appraisal. This can be done by using multiple performance metrics to evaluate different aspects of performance. It is advisable to consider different angles to assess the overall performance of an individual or organisation. In addition, it is important to regularly review and adjust the metrics and indicators to ensure they continue to provide relevant and meaningful information without creating incentives for undesirable behaviour. A critical review of the performance metrics used and their potential impact on behaviour can help to minimise the negative effects of Goodhart’s Law.
The gambler’s fallacy is a cognitive bias in which people believe random events are influenced by their previous outcomes or frequencies. They incorrectly assume a certain series of events, such as a long losing streak in gambling, must lead to a future positive outcome to restore equilibrium. A simple example is the assumption when tossing a coin after a series of heads tosses, a tails toss is more likely. Statistically speaking, it is indeed probable for the number of heads and tails tosses to be 50 percent each in the long run. Nevertheless, each throw is independent of the previous one and therefore also has a 50 percent probability for each possible outcome. The situation is similar in sales, for example. You should not assume the probability of a salesperson selling your product at the next customer meeting will increase if they were unsuccessful in the previous meetings. Rather, the salesperson has the same statistical probability of selling at each call.
Solution: To avoid the gambler’s fallacy, it is important to realise random events are not influenced by previous outcomes. Statistically, odds do not change based on past outcomes. Understanding the basic principles of probability can help to develop realistic expectations and overcome the gambler’s fallacy.
13. Regression Bias:
Regression bias occurs when not all relevant variables are taken into account when analysing data, which leads to an incorrect relationship between the variables. This can lead to inaccurate predictions or false conclusions. For example, a study analysing the relationship between chocolate consumption and life expectancy without taking into account factors such as diet, exercise or genetic predisposition would not be meaningful. If only chocolate consumption and life expectancy are analysed without considering the other influencing factors, a distorted picture of reality emerges.
Solution: To prevent regression bias, it is important to consider all relevant variables which could influence the relationship between the analysed variables when analysing the data. This requires a thorough preliminary investigation and a great understanding of the subject area to identify potential influencing factors. The use of statistical techniques such as multivariate regression can help to analyse several variables simultaneously and isolate their individual effects. For complex analyses, it is also helpful to consult experts and specialists in the relevant field to ensure all relevant variables are considered. Careful and comprehensive data analysis which includes all influencing factors is crucial to minimise the risk of regression bias and obtain accurate results.
14. Data Mining Bias:
Data mining bias refers to distortions in the results of data analyses that can arise from the inappropriate selection or interpretation of data. It can occur, for example, when analyses are carried out on large data sets to identify patterns, correlations or trends and certain groups are unintentionally favoured or disadvantaged. A practical example would be an algorithm for job application selection using historical data which unintentionally favours or disadvantages candidates from certain groups due to existing gender or racial biases.
Solution: To prevent data mining bias, it is important to be careful when selecting and interpreting data. Thorough data analysis should ensure all relevant factors and groups are adequately represented. Regular reviews of the analyses can help to identify and correct biases at an early stage. Transparent and ethical guidelines for data use should be developed to ensure data analyses are conducted in a fair and balanced manner. Training and awareness-raising measures for data analysts and decision-makers can help to raise awareness of data mining bias and ensure analyses are objective and fair. Finally, it is important to critically scrutinise the results and look for alternative explanations for the observed patterns to identify and correct possible biases.
15. Disposition Effect:
The disposition bias is a cognitive distortion in which people tend to attribute positive outcomes to their own abilities and wise decisions, while attributing negative outcomes to external circumstances or bad luck. This leads to an imbalance in self-perception and can result in irrational decisions. This error can often be observed on the stock market, for example: Many investors see a profit as the result of their own clever analyses, while losses are blamed on unpredictable market fluctuations.
Solution: As with most other data errors, the first step in avoiding the disposition error is to realise it exists and can occur in many situations. Self-reflection on decisions and a willingness to view failures as learning opportunities can help mitigate the disposition error. It is also helpful to obtain external perspectives, whether through peer reviews, feedback from colleagues or advice from experts. Objectively analysing successes and failures, taking into account all relevant factors, can help to develop a more realistic self-perception and prevent irrational decisions. Regular reflection and awareness of one’s thought patterns are crucial to recognising the disposition error and actively tackling it.
True to the motto “a trustworthy statistic is one that you’ve personally manipulated”, you should also be aware there can be many pitfalls or stumbling blocks when analysing data with business intelligence. As soon as you are aware of the various data errors, from cherry-picking to disposition errors, you can deal critically with the results of analyses and thus ensure the right decisions are made. By taking a transparent approach and considering different perspectives, analysis methods and techniques, data errors can be avoided.
The business intelligence software myPARM BIact offers an optimal solution to overcome these challenges. With its advanced data analysis functionality, transparent reporting options and integrated mechanisms for reviewing your data, myPARM BIact enables precise and reliable data analysis and thus provides a solid basis for data-based decision-making processes. In addition, myPARM BIact allows you to immediately translate the decisions you make into action.
Learn more about the Business Intelligence Software Software myPARM BIact:
Would you like to get to know myPARM BIact in a demo presentation? Then make an appointment with us right away!