R intro part 3

Oh god hi,

that is going to be my last post about an intro to R. This one will be more theoretical and I will try to cover the data cleaning process, approach to variables and sampling when performing studies. So let me start with… :)

DATA CLEANING

1.Exploring raw data: class, dim, names, str, glimpse – dplyr library, summary, head, tail, hist, plot.

2.Tidy data rules: A dataset is a collection of values, each value belongs to a variable and an observation, a variable contains all values measured on the same unit across attributes. 

Symptoms of messy data: Multiple values stored in one column, a single observational unit is stored in multiple tables, column headers are values, not variable names.

Tidyr: gather when columns are not variables and we want to collapse them into key-value pairs, spread takes key-value pairs and spreads them across multiple columns, separate separate one columns into multiple columns, unite takes multiple columns and pastes them together.

3.Preparing data for analysis: type conversions using as.*(), using a combination of y,m,d,h,m,s to parse char to date, strings manipulation thanks to stringr: str_trim, str_pad, toupper, tolower, str_replace, str_detect.

Special values in R: NaN, Inf, #N/A.

is.na, any(is.na), complete.cases – see which rows have no missing cases, na.omit – remove all rows with any missing values, identifying outliers: summary, hist, or boxplot.

DATA in R

Types of variables:

numerical – numerical values, might be continuous (often measured, like weight) or discrete (often counted, as a number of cars).

categorical – limited numbers of distinct categories, might be ordinal (finite number of values within a given range, often measured, like socio economic status („low income”, „middle income”, „high income”)) or simply categorical (like whether you are a vegan or not).

table, droplevels drops empty levels from the data frame.

Discretize variable: converting a numerical variable to a categorical variable based on the searching criteria, e.g. adding a new variable to dataset where 1 corresponds to value bigger than average income and 0 corresponds to value lower than average income, we can do it using mutate from dplyr library.

Observational studies and experiments:

Observational study: collect data in a way that does not directly interfere with how the data arise, only correlation can be inferred.

Experiment study: randomly assign subjects to various treatments, causation can be inferred.

Example: drinking beetroot juice and physical capacity

So in the observational study, we will select two groups of people, the ones that drink beetroot juice and the ones that does not and then we will compare the physical capacity results for both groups.

In the experiment study, we will sample a group people of the population and then we will split them into a group that will drink beetroot juice and the ones that won't (decision of drinking is imposed by the researchers) and then after a while, we will compare the result of physical capacity for both of the groups.

When we will find a difference in the observational study, we cannot assign the result only to beetroot juice as they might be other variables responsible for the result and e.g. people drinking the juice might be also the ones that practice more or eat more healthy in general.

Also in the experimental study, when we select a random sample out of the population, these additional variables that we did not take under account in the observational study will be equally distributed with that random sample so when we will find a difference in the physical capacity we might do a causal statement that drinking beetroot roots affect physical capacity.

Random sampling and random assignments:

Random sampling – subjects are being selected for a study, and they are selected randomly, helps generalizability results.

Random assignments – only in experimental studies, where we assign subjects to various treatments, helps infer causation from results.

Simpson’s paradox

Explanatory variables – x and response variable – y, might be multivariate relationship.

Example: study results as a response and number of books read, exercises done, lecture attendance as a explanatory variables.

By Simpson’s paradox, we call the incident when the inclusion of the third variable in the analysis might change previously found the outcome and relationship between the explanatory variable and a response.

Example: The study shows that more men then women were accepted to universities, but when looking into each of the departments, women were usually accepted at a rate higher than men. Why is that? Because women tend to apply for the departments that have harder admission criteria and they admit fewer people.

SAMPLING STRATEGIES:

Simple random sampling: we randomly select a sample from the population so each case is likely to have similar attributes. Example: taking a random sample of dog owners.

Stratified sampling: we divide the population into a homogenous group called strata, and then we randomly sample from each of the strata. Example: if we want to have people from each of the education classes, we divide them into strata for each class and then randomly sample from each of the education classes.

Cluster sampling: we divide a population into clusters, then randomly select a few clusters and then sample all observation within these clusters, clusters are heterogeneous within themselves, and each cluster is similar to the other. Example: An NGO wants to create a sample of girls across 5 neighboring towns to provide education. Using single-stage cluster sampling, the NGO can randomly select towns (clusters) to form a sample and extend help to the girls deprived of education in those towns.

Multistage sampling: we divide a population to clusters, randomly sample a few clusters and then we randomly sample observations from those clusters.

Sampling in R: sample_n from dplyr.

PRINCIPLES OF EXPERIMENTAL DESIGN:

1.Control: compare the treatment of interest to a control group

2.Randomize: randomly assign subjects to treatments

3.Replicate: collect a sufficient large sample within a study, or replicate the entire study

4.Block: account for the potential effect of confunding variables (first we group subjects into blocks based on these variables and then we randomize within each block to treatment groups)

Okay, that one was long one :D. Hope it is clear, have a good night all!

xoxo,

szarki9