class: center, middle, inverse, title-slide # How uncertainty is represented ### IBiG ###
February 11th, 2020
Points of Significance - Nature Methods --- .center[ # Riassunto della scorsa puntata ] 1. Statistics do not tell us whether we are right. It tells us the **chances of being wrong**. 1. Samples are our **windows to the population**, and their statistics are used to estimate those of the population. 1. **Larger samples** approximate the population better. 1. Always keep in mind that your **measurements are estimates**, which you should not endow with an aura of exactitude and finality. 1. The **omnipresence of variability** will ensure that each sample will be different. -- .content-box-grey[.tiny[ - The **mean** is calculated as the arithmetic average of values and can be unduly influenced by extreme values - The **standard deviation** is calculated based on the square of the distance of each value from the mean - The **median** is a more robust measure of location and more suitable for distributions that are skewed or otherwise irregularly shaped ]] --- .center[ ## Anscomb quartet ] -- <table> <thead> <tr> <th style="text-align:right;"> xVal </th> <th style="text-align:right;"> yVal </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 8.04 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 6.95 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 7.58 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 8.81 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 8.33 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 9.96 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7.24 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4.26 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 10.84 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 4.82 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5.68 </td> </tr> </tbody> </table> --- .center[ ## Anscomb quartet ] .middle[ <table> <thead> <tr> <th style="text-align:left;"> mygroup </th> <th style="text-align:right;"> MEDIA </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> VAR </th> <th style="text-align:right;"> COR </th> <th style="text-align:right;"> CC </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 7.5 </td> <td style="text-align:right;"> 2.03 </td> <td style="text-align:right;"> 4.13 </td> <td style="text-align:right;"> 0.82 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 7.5 </td> <td style="text-align:right;"> 2.03 </td> <td style="text-align:right;"> 4.13 </td> <td style="text-align:right;"> 0.82 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 7.5 </td> <td style="text-align:right;"> 2.03 </td> <td style="text-align:right;"> 4.12 </td> <td style="text-align:right;"> 0.82 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 7.5 </td> <td style="text-align:right;"> 2.03 </td> <td style="text-align:right;"> 4.12 </td> <td style="text-align:right;"> 0.82 </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> ] --- .center[ ## Anscomb quartet ![](20200211_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- class: inverse, center, middle # Part I ### Use boxplot to illustrate the spread and differences of samples --- ## Box plots 1. Visualization methods **enhance our understanding** of sample data and help us make comparisons across samples 1. Box plot is a method for graphically depicting groups of numerical data through their **quartiles** - A quartile is a type of quantile which **divides the number of data points into four more or less equal parts**, or quarters. - The **first quartile** (Q1) is defined as the middle number between the smallest number and the median of the data set - The **interquartile range** (IQR = Q3 − Q1), which covers the central 50% of the data - Quartiles are **insensitive to outliers** and preserve information about the center and spread --- ## Box plots .center[ .middle[ <img src="sziS9F7.png" width="850"> ] ] ??? - The **core element** that gives the box plot its name is a box whose length is the **IQR** and whose width is arbitrary - A line inside the box shows the **median**, which is not necessarily central - The plot may be oriented **vertically or horizontally** - **Whiskers** are conventionally extended to the most extreme data point that is no more than 1.5 × IQR from the edge of the box (Tukey style) or all the way to minimum and maximum of the data values (Spear style) - The **1.5 multiplier** corresponds to 99.3% coverage of the data for a normal distribution - **Outliers** beyond the whiskers may be individually plotted. - Box plot construction requires a **sample of at least n = 5** (preferably larger) --- ## Final considerations 1. For small sample sizes boxplots do not necessarily represent the distribution well .center[ <img src="wuR2nUi.png" width="240"> ] 1. Strongly discourage using bar plots with error bars - These charts continue to be prevalent (we counted 100 figures that used them in 2013 Nature Methods papers, compared to only 20 that used box plots) - The choice of baseline can interfere with assessing relative sizes of means and their error bars ##### Useful link: http://shiny.chemgrid.org/boxplotr/ ??? - Barplot nascondono la distribuzione del campione --- class: inverse, center, middle # Part II ### How uncertainty is represented --- ## Error bars 1. The **uncertainty** in estimates is customarily represented using error bars 1. A lot of **misconceptions** persist about how error bars relate to statistical significance 1. When asked to estimate the required separation between two points with error bars for a difference at significance P = 0.05, **only 22% of respondents were correct** -- <br> <br> .center[ .content-box-red[ .Large[WHAT ABOUT YOU?] ] <br> .Large[http://bit.ly/2ufXHSR] ] ??? - Idee sbagliate --- ## Error bars - **Standard deviation**: it shows how the data are spread - **Standard error**: it describes how accurate the mean estimate is - **Confident intervals**: it captures the true mean μ on 95% of occasions .center[ <img src="QtpJsdK.png" width="740"> ] --- ## Confident intervals .center[ <img src="20200211_Fig2.png" width="200%"> ] --- ## Error bars <br> <br> .center[ <img src="a1.png" width="1050"> ] --- ## Error bars <br> <br> .center[ <img src="a2.png" width="1050"> ] --- ## Error bars <br> <br> .center[ <img src="aaa.png" width="1050"> ] --- ## Error bars <br> .center[ <img src="20200211_Fig3.png" width="1050"> ] --- ## Error bars <br> <br> .center[ .content-box-red[ Be wary of error bars for small sample sizes THEY ARE NOT ROBUST it is better to show individual data values ] ] --- # Error bar rules -- .small[ - When showing error bars, always describe in the **figure legends** what they are. ] -- .small[ - The value of n must be stated in the **figure legend**. ] .center[ <img src="20200211_Fig4.png" width="600"> ] --- # Error bar rules .small[ - When showing error bars, always describe in the **figure legends** what they are. ] .small[ - The value of n must be stated in the **figure legend**. ] .small[ - Error bars and statistics should only be shown for **independently repeated experiments**, and never for replicates ] ??? - Consider trying to determine whether **deletion of a gene in mice affects tail length**. - We could choose **one mutant mouse and one wild type** - **Perform 20 replicate measurements** of each of their tails. - We could **calculate the means, SDs, and SEs** of the replicate measurements, but these would **not permit us to answer the central question** of whether gene deletion affects tail length, **because n would equal 1 for each genotype**, no matter how often each tail was measured -- .small[ - Experimental biologists are usually trying to compare experimental results with controls, it is usually **appropriate to show inferential error bars**, such as SE or CI, rather than SD ] -- .small[ - 95% CIs capture μ on 95% of occasions, so you can be 95% confident your interval includes μ ] -- .small[ - When n = 3, and double the SE bars don’t overlap, P < 0.05, and if double the SE bars just touch, P is close to 0.05 ] --- # Conclusion 1. Error bars can be valuable for understanding results and deciding whether the authors’ **conclusions are justified by the data** -- 1. When first seeing a figure with error bars, ask yourself: - What is n? - Are they independent experiments, or just replicates? - What kind of error bars are they? -- <br> .content-box-red[.center[Error bars and other statistics **can only be a guide** You need to **use your biological understanding** to appreciate the meaning of the numbers shown in any figure ] ] --- class: inverse, center, middle .Huge[ END ]