Back to Browse

R Tutorial: Impact of weights

7.6K views
Mar 10, 2020
4:57

Want to learn more? Take the full course at https://learn.datacamp.com/courses/analyzing-survey-data-in-r at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work. --- Now that we have a handle on common survey design structures, let's look at a real-world survey to see how the survey design and the weights impact our analyses. We will explore the National Health and Nutrition Examination Survey. The goal of NHANES is to assess the health of people in the US. Because the survey includes a medical exam in a mobile health vehicle, the researchers have put a lot of care into developing a cost-effective, representative sampling design. The data are collected in four stages. First, the US is stratified by geography and the distribution of minority populations. Then counties are randomly selected within each stratum, where more populated counties are more likely to be sampled. From the sampled counties, city blocks are randomly selected, where again more populated blocks are more likely to be sampled. From the sampled city blocks, households are randomly selected based on demographic information. And lastly, within the sampled households, people are randomly selected for inclusion in the sample. The 2009 to 2012 sample data, called NHANESraw, can be found in the NHANES package. Running the dim() command returns the number of rows, or observations, and the number of columns, or variables, contained in the dataset. We see that the NHANESraw dataset contains 78 variables on 20,293 people. Before specifying the design, we need to modify the survey weights variable. WTMEC2YR. WTMEC2YR provides the number of people in the US each sampled person represents. Therefore, summing up the weights, via the summarize() command should provide a rough estimate of the total number of people in the US. However, we get an estimate of 608 million people, about twice as many as we should! That is because these weights were constructed assuming you have two years of data. Since we have four years, we need to divide each weight by 2. To do that, we use the mutate() function to create a new column, WTMEC4YR, where each value is half the value in WTMEC2YR. Let's specify the design with the R function svydesign(). In the arguments, we need to provide the dataset, NHANESraw and the strata column, SDMVSTRA. Remember, id is where we specify the variables that represent the clusters. While the design actually had three levels of clustering (counties, city blocks, and households), it’s common in practice to only specify the first level, denoted here by SDMVPSU. Running the distinct() function on SDMVPSU, we see it only takes on three values, 1, 2, and 3. This is because 1 to 3 counties were sampled within each strata. Therefore, we must include nest equals TRUE because the cluster ids are nested within the strata. Lastly, the survey weights are given in WTMEC4YR. Now for some analyses! Suppose we want to estimate the distribution of race in the US. I created these two plots using the race variable in the NHANES dataset. In the top graph, I accounted for the survey weights and in the bottom graph, I didn't. Notice how different the distribution of race is between these two plots. The survey weights account for the sampling design, in which minority groups are over-sampled, they adjust for non-response and are calibrated to known information about the population. In essence, if we ignore them, we will get a very wrong graph! The moral of the story is that survey weights cannot be ignored. And, in the rest of this course, we will learn how to ensure that the graphs, the models, the analyses properly handle the weights! Time for some practice with the NHANES weights! #DataCamp #RTutorial #AnalyzingSurveyDatainR #AnalyzingSurveyData

Download

0 formats

No download links available.

R Tutorial: Impact of weights | NatokHD