Welcome to the cogmaster-stats class

R

General annoucements about this class will are no longer available on Twitter. Follow @cogmasterstats to stay current with the latest news.

Time and place: ENS 29, rue d'Ulm, Salle Prestige 1; 9:00–11:00, every Tuesday (starting September, 29)

You should bring your personal laptop for all sessions. Internet is available on demand, but you won't need it to follow the course. It is recommended to check-out the project page, or update your own Git repository, each week. If you want to check R data files individually, please visit the slides or labs directories. Data sets are available in the data directory, or directly as a ZIP file.

Remark: To compile slides at home, you will need this custom CSS file, although some pictures will be missing in the final output. The slides are based on ioslides. There are some keyboard shortcuts available: 'f' enable fullscreen mode; 'w' toggle widescreen mode; 'o' enable overview mode. For better rendering, it is recommended to toggle at least the widescreen mode. You can navigate throughout the slideshow using the arrow keys (left/right) on your keyboard.

R lectures

The R statistical software will be used during the first seven courses. The Setup section contains additional information to install R and RStudio. The following open-access textbooks will be used along the course.

Diez, DM, Barr, CD, and Çetinkaya-Rundel, M (2012). OpenIntro Statistics (2nd edition).

Navarro, D (2015). Learning Statistics with R (v 0.5).

Other resources are provided at the bottom of this page; see Textbooks and Online resources to learn R. Another textbook will be suggested for the Python part.

For a general description of this workshop, see this Overview.

Assignments will be provided on-line on a regular basis (every 2 or 3 weeks) as multiple-choice questions. This aims to assess your understanding of R for data analysis and general statistical principles discussed in the textbook or during the class. You will generally have 1 week to complete them. Solutions will be posted on this website.

Here is a temptative timetable for the R part (7 x 2 = 14 hours in total), with slides and practicals (R code available on the main Github repository):

1. Introduction to R

  1. Working with data (slides)
  2. Getting started with R (lab notes), with selected answers

Readings: OpenIntro Statistics 1.3-1.7, 4.1, 4.3, 4.6, 5.4; Learning Statistics with R 3.6, 4.2-4.5 Homeworks: None

2. Descriptive statistics and two-group comparisons

  1. Exploratory Data Analysis and Statistical Inference (slides)
  2. Data munging and basic statistical tests with R (lab notes)

Readings: OpenIntro Statistics 4.2, 5.2, 5.5; Learning Statistics with R 10.3-10.5, 11.8
Extra readings: Tukey, J.W. We Need Both Exploratory and Confirmatory, The American Statistician (1980) 34(1):23-25; Wickham, H. The Split-Apply-Combine Strategy for Data Analysis, Journal of Statistical Software (2011) 40(1).

3. Inferential statistics and two-group comparisons (con't)

  1. Same lab notes
  2. Exercices 3, 4 and 7

Readings: OpenIntro Statistics 6.2, 6.3; Learning Statistics with R 13.2-13.5
Extra readings: Wickham, H. Tidy Data, Journal of Statistical Software (2014) 59(1).

Note: If after this session you are still unsure about what you have learned regarding basic data manipulation in R, you are encouraged to complete the first 5 chapters on http://tryr.codeschool.com at home.

4. ANOVA and design of experiments

  1. Analysis of variance 1 (slides)
  2. Exercices 9 and 10

Readings: OpenIntro Statistics 5.5; Learning Statistics with R 14.1, 14.2, 14.5
Extra readings: Lang, T. Common statistical errors even you can find - Part 1: Errors in descriptive statistics and in interpreting probability values, AMWA Journal (2003) 18(2):67-71; Lang, T. Common statistical errors even you can find - Part 2: Errors in multivariate analyses and in interpreting differences between groups, AMWA Journal (2003) 18(3):103-107.

5. ANOVA and design of experiments

  1. Analysis of variance 2 (slides)
  2. Analysis of variance with R (lab notes), with selected answers

Readings: OpenIntro Statistics 7.1-7.4; Learning Statistics with R 14.4, 14.8, 14.10, 16.1, 16.2
Extra readings: Gelman, A. Analysis of variance -- why it is more important than ever, The Annals of Statistics (2005) 33(1):1–53.

6/7. Regression models

  1. Regression modeling 1 (slides)
  2. Regression modeling 2 (slides)
  3. Regression analysis with R (lab notes), with selected answers

Readings: None
Extra readings: Gardner, M.J. and Altman, D.G. Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal (1986) 292: 746; Simmons, J.P., Nelson, L.D., and Simonsohn, U. False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant, Psychological Science (2011) 22(11):1359-1366; Vul, E., Harris, C., Winkielman, P., and Pashler, H. Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition, Perspectives on Psychological Science (2009) 4(3):274-290.

Assignments

Project

The project (70% of total mark) consists in writing a short analysis report for one of the data set that are given below. The report should preferably be written in Markdown/reST with R or Python code, and submitted by email before the deadline. Custom templates are available for R/Markdown and Python/Pweave. IPython notebooks can also be used.

The project can be done with one or two colleagues (not more).

Due date is 2016/01/31 (January, 31).

Instructions for submitting projects

  1. If you want to work with a colleague, prepare a list with students' names (3 students max.); otherwise skip to step 2.
  2. Pick one project, and relevant files on GitHub (or directly from this page, see below). For each project, there is a brief description of the data set and how to load data in R or Python, and a list of questions specific to that data set.
  3. Send an email to ch[dot]lalanne[at]gmail[dot]com with (a) your name, and that of your coworker(s) if you're not working on the project alone, (b) the project you choose, and (c) the software you will be using.
  4. Work on the project and try to address all questions using Python or R. Only one language is allowed for the whole project. Write your answer using one of the custom templates (R Markdown or Pweave) that are available on GitHub.
  5. Submit your work by email to ch[dot]lalanne[at]gmail[dot]com before the deadline.

In any case, you will be assessed on the quality of your report, the accuracy of the results and of your responses.

  1. We expect correct and reproducible R or Python code.
  2. In addition, we ask you to provide a clear and concise interpretation of the results for each question, in plain English or French.

In other words, you should not only provide R or Python code to answer the questions, but also comment on computer output in relation to the study at hand.

If you are unable to report your code and results in the provided template, consider sending a simple text file with R/Python commands and plain text comments to address the questions.

Available projects

  1. Study of the relationships between parity and anthropometric characteristics and weights of babies one month after birth.
    data: 01-weights.sav | description: 01-description.txt | questions: 01-questions.txt

  2. Effect of estrogen receptor genotype on age at diagnosis among breast cancer patients.
    data: 02-estrogen.txt | description: 02-description.txt | questions: 02-questions.txt

  3. Study on subjects' heart rate and the frequency of stepping up and down on steps of various heights.
    data: 03-stepping.dat | description: 03-description.txt | questions: 03-questions.txt

  4. Forced expiratory volume and smoking status.
    data: 04-fev.dat | description: 04-description.txt | questions: 04-questions.txt

  5. Role taking in young children.
    data: 05-role.dat | description: 05-description.txt | questions: 05-questions.txt

All data files can be accessed directly from GitHub master site.

Setup

R setup

You should already have R and RStudio. If they are not installed on your system, please follow these instructions:

  1. Go to http://cran.r-project.org/bin/ and select a binary package for your system.
  2. Go to http://www.rstudio.com/ide/download and download RStudio Desktop. There is often a Preview Release that include additional functionalities but is still considered in beta stage.

Install both software as is usually done on your system. At this point, you may want to test that everything is ok: Launch RStudio and check that the R prompt is available in the Console panel (it is usually sitting on the right). You can further check that R is working as expected by typing (without the prompt sign, >)

> 1/3 == 0.3

at the R prompt. Don't worry if the result (FALSE) doesn't match your expectation, we will learn why later.

You can customize RStudio layout under File ▹ Preferences, but see this very brief tutorial.

At some point, we will make use of the R Markdown language to document our working sessions. You don't need to install anything other than RStudio, which now comes with a built-in Pandoc processor. Pandoc is a tool for converting among markup formats and generate HTML or PDF documents based on a Markdown file. As an alternative to RStudio built-in facilities, you can install an enhanced Markdwon processor: MultiMarkdown.

Pick up a good editor

Sometimes, it is easier to work directly from a text editor. There are several good editors available for free and working on all platforms (Windows, Un*x, and OS X), including these ones:

However, RStudio already provides many features to interact with R and write R scripts.

Git

If you want to experiment with Git and if it is not already installed on your system, go download the latest version of Git. When installation is completed, check that it is available on your system by typing (again, without the prompt sign, $)

$ git --version

in a Terminal. Windows users may have to use a specific GUI and follow those instructions.

You can request an educational account on Github. This will allow you to create private repositories that you can use for your course work. Or you can create an account on BitBucket, which offers unlimited private repositories for personal use.

In any case, you can follow the course and download the latest archive of the course on Github.

Useful readings

Textbooks

Here are some recommended textbooks. References 1 and 2 include theoretical considerations (no high-level math background required) and various applications in social and psychological science. References 4, 6, and 8 are particularly relevant for those interested in the design of experiments. Reference 3 is an old textbook, but it contains interesting material for those interested in multivariate data analysis and psychometric theory. References 5 and 7 are useful for applied statistics using R; but many other books are listed on CRAN website.

  1. Baguley, T. Serious Stats, A guide to Advanced Statistics for the Behavioral Sciences. Palgrave Macmillan, 2012.
  2. Wilcox, R. Modern Statistics for the Social and Behavioral Sciences. Chapman & Hall/CRC, 2012.
  3. Tinsley, H.E.A. and Brown S.S (eds.). Handbook of Applied Multivariate Statistics and Mathematical Modeling. Academic Press, 2000.
  4. Christensen, R. Plane Answers to Complex Questions, The Theory of Linear Models (3rd ed.). Springer, 2002.
  5. Logan, M. Biostatistical Design and Analysis using R. Wiley, 2010.
  6. Quinn, G.P. and Keough, M.J. Experimental Design and Data Analysis for Biologists. Cambridge University Press, 2002.
  7. Dalgaard, P. Introductory Statistics with R (2nd ed.). Springer, 2008.
  8. Maxwell, S.E. and Delaney, H.D. Designing Experiments and Analyzing Data, A Model Comparison Perspective (2nd ed.). Lawrence Arlbaum Associates, 2004.

Online resources to learn R

Some interactive tutorials are available on the internet, e.g., Code School -- Try R, Swirl, mosaic, or Datacamp. Online textbooks or interactive modules can also be found on statsTeachR and OpenIntro Statistics.