ggplot2 tutorial: Statistics outside Geoms

Name: ggplot2 tutorial: Statistics outside Geoms
Uploaded: Nov 10, 2016
Duration: 312 s

DataCamp209K subscribers

18.3K views

Nov 10, 2016

5:12

Learn more about Multiple Statistics outside Geoms in ggplot2: https://www.datacamp.com/courses/data-visualization-with-ggplot2-part-2 Welcome to the second ggplot2 course on data visualisation! Here, we're going to build on the skills you learned in the first course to develop a wide variety of not only appealing, but meaningful, plots. We'll examine the last four layers - statistics, coordinates, facets, and themes - in detail. We'll also go over some tips for data vis best practices so you can make proper use of your new skill-set. At the end we'll wrap-up with a case study using the California Health Information Survey which will guide us through a complete process from exploratory to flexible explanatory visualisations.C R has risen dramatically in popularity over the past years not only because it offers excellent visualisation, but also because it can combine this with a strong statistical foundation. In this chapter we'll explore how this happens in ggplot2's statistics layer. There are two broad cagegories of functions in this family: those that are called from within a geom and those which are called independently. As you may have guessed, all the statistical functions begin with "stats - underscore". Actually we've already encountered stats functions when we used geom_histogram. Recall that under the hood, this function called stat_bin to summarise the total count in each group. slide You may also remember that when we discussed geom_bar, I mentioned that it's detault stat is set to "bin", so we could have produced the same result if we use geom_bar. or... We could have just called stat_bin directly and produced the same plot. You also saw this in the exercises where stat_bin also took grouping by fill into account. stat_bin simply counts the number of observations in a given group, so you can see why it's the default stat for the histogram, bar and frequency polygon geoms that we saw earlier. Now you can appreciate what the warning message is telling us when we use these functions: "stat_bin: binwidth defaulted to range - over - 30" is referring to a stats layer, and it's associated argument, that was called in the background. What are some other statistics that are typically called in this way? stat_smooth is a commonly used statistic that we can access with geom_smooth() Recall one of our previous scatter plots compared sepal width to length in three iris species. On top of geom_point, we can call a geom_smooth, which in turn calls stat_smooth(). The standard error, which is shown as a gray ribbon behind our smooth, is by default is the 95% Confidence interval. We can remove this by setting the se argument to FALSE. We know we are calling stat_smooth because of another warning message: "geom_smooth: method equals "auto" and size of largest group is less than 1000, so using loess." LOESS is a non-parametric smoothing algorithim that basically works by passing as a sliding window along the x axis. Within each window a weighted mean is calculated for the model. The span argument controls alpha, which is the degree of smoothing, you can think of this as the size of the window. Smaller spans are more noisy, as we can see here. From within ggplot2 we can use the method argument to call parametric models, such as lm, glm, rlm, gam. For groups larger than 1000, the method defaults to gam. Here's an example of the basic Ordinary least squares regression using lm. Notice that in both the LOESS and lm examples the model is calculate on the different groups defined by colour. We'll look at how to override this in the exercises. We can again remove the 95% CI be setting se to FALSE. Notice that the linear model is not predictive, although it can be. By default, it is bound to the limits of the data on the x-axis. If we are using a parametric method, we can ask for predictions by using the fullrange argument. Just as we'd expect, the error increases the further away from our data set we attempt to use our estimate. There are a variety of other stats functions which we will encounter throughout the rest of the data visualisation courses, some of which are particularly useful for dealing with large large data sets, such as bindot, binhex, bin2d and contour - we'll encounter those in the next course when we consider graphics of large datasets. In general you won't have to call these functions directly, but it is worth knowing about the relationship between geoms and their respective statistics. You'll understand warning and error messages better and the help pages for the stats functions are often more informative if you need to adjust any parameters. Ok, let's see how stat functions work in practice in the exercises.

Download

0 formats

No download links available.