1. Basic statistics¶

Prerequisites

This course assumes that you already know Python. The scipy-lectures give an introduction to Python.

Tip

This page gives an introduction to statistics with Python. It is useful to get acquainted with data representations in Python. In terms of experimental psychology, the patterns demonstrated here can be applied to simple dataset that arise from psychophysics, or reaction time experiments. We will see how to process efficiently more complex data as produced by brain imaging in the next course.

Contents

Interacting with data
Hypothesis testing: two-group comparisons
Linear models, ANOVA

1.1. Interacting with data ¶

1.1.1. Work environment: IPython¶

Tip

We will be working in the interactive Python prompt, IPython. You can either start it by itself, or in the spyder environment.

In this document, the Python prompts are represented with the sign “>>>”. If you need to copy-paste code, you can click on the top right of the code blocks, to hide the prompts and the outputs.

1.1.2. Basic array manipulation: numpy¶

We have the reaction times in a psychophysical experiment:

474.688 506.445 524.081 530.672 530.869 566.984 582.311 582.940 603.574 792.358

In Python, we will represent them using a numerical “array”, with the numpy module:

>>> importnumpyasnp
>>> x=np.array([474.688,506.445,524.081,530.672,530.869,566.984,582.311,582.940,603.574,792.358])

We can access the elements of x with indexing, starting at 0:

>>> x[0]# First element
474.68799999999999
>>> x[:3]# Up to the third element
array([ 474.688,  506.445,  524.081])
>>> x[-1]# Last element
792.35799999999995

Mathematical operations on arrays are done using the numpy module:

>>> np.max(x)
792.35799999999995
>>> np.sum(x[:3])
1505.2139999999999
>>> np.mean(x)
569.49219999999991
>>> np.log(x)
array([ 6.16265775,  6.22741573,  6.26164625,  6.27414413,  6.27451529,
        6.34033108,  6.36700467,  6.36808427,  6.40286865,  6.67501331])

Finding help

The reference documentation can be found on http://docs.scipy.org
http://scipy-lectures.github.io contains tutorials on Python, numpy, and the scientific Python ecosystem.
Typing np.mean? in IPython will display help on numpy.mean().
numpy.lookfor() can be used to search for keywords in functions help.

Exercise

Display the last two entries of x.
Compute the standard deviation of x (how will you find the function to do so?).

1.1.3. Basic plotting: pylab¶

1.1.3.1. The box plot¶

Basic plotting is done with matplotlib:

>>> frommatplotlibimportpyplotasplt
>>> plt.boxplot(x)

Hint

If a window doesn’t display, you need to call plt.show().

Under IPython, type %matplotlib to have plots display automatically.

1.1.3.2. More plots¶

1.1.4. Mixed-type data: pandas¶

1.1.4.1. Inputing data¶

We have a CSV file giving observations of brain size and weight and IQ (Willerman et al. 1991):

"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"1";"Female";133;132;124;"118";"64.5";816932
"2";"Male";140;150;124;".";"72.5";1001121
"3";"Male";139;123;150;"143";"73.3";1038437
"4";"Male";133;129;128;"172";"68.8";965353

The data are a mixture of numerical and categorical values. We will use pandas to manipulate them:

>>> importpandas
>>> data=pandas.read_csv('examples/brain_size.csv',sep=';',na_values=".")
>>> printdata
    Unnamed: 0  Gender  FSIQ  VIQ  PIQ  Weight  Height  MRI_Count
0            1  Female   133  132  124     118    64.5     816932
1            2    Male   140  150  124     NaN    72.5    1001121
2            3    Male   139  123  150     143    73.3    1038437
3            4    Male   133  129  128     172    68.8     965353
4            5  Female   137  132  134     147    65.0     951545
...

Warning

Missing values

The weight of the second individual is missing in the CSV file. If we don’t specify the missing value (NA = not available) marker, we will not be able to do statistics on the weight.

1.1.4.2. Manipulating data¶

data is a pandas dataframe, that resembles R’s dataframe:

>>> printdata['Gender']
0     Female
1       Male
2       Male
3       Male
4     Female
...
>>> gender_data=data.groupby('Gender')
>>> printgender_data.mean()
        Unnamed: 0   FSIQ     VIQ     PIQ      Weight     Height  MRI_Count
Gender
Female       19.65  111.9  109.45  110.45  137.200000  65.765000   862654.6
Male         21.35  115.0  115.25  111.60  166.444444  71.431579   954855.4
>>> # More manual, but more versatile
>>> forname,valueingender_data['VIQ']:
... printname,np.mean(value)
Female 109.45
Male 115.25
>>> # Simpler selector
>>> data[data['Gender']=='Female']['VIQ'].mean()
109.45

Exercise

What is the mean value for VIQ for the full population?
How many males/females were included in this study?

Hint use ‘tab completion’ to find out the methods that can be called, instead of ‘mean’ in the above example.
What is the average value of MRI counts expressed in log units, for males and females?

1.1.4.3. Plotting data¶

Pandas comes with some plotting tools (that use matplotlib behind the scene) to display statistics on dataframes:

>>> frompandas.toolsimportplotting
>>> plotting.scatter_matrix(data[['Weight','Height','MRI_Count']])

>>> plotting.scatter_matrix(data[['PIQ','VIQ','FSIQ']])

Exercise

Plot the scatter matrix for males only, and for females only. Do you think that the 2 sub-populations correspond to gender?

1.2. Hypothesis testing: two-group comparisons ¶

For simple statistical tests, we will use the stats sub-module of scipy:

>>> fromscipyimportstats

1.2.1. Student’s t-test¶

1.2.1.1. 1-sample t-test¶

scipy.stats.ttest_1samp() tests if observations are drawn from a Gaussian distributions of given population mean. It returns the T statistic, and the p-value (see the function’s help):

>>> stats.ttest_1samp(data['VIQ'],0)
(30.088099970849338, 1.3289196468727784e-28)

Tip

With a p-value of 10^-28 we can claim that the population mean for the IQ (VIQ measure) is not 0.

Exercise

Is the test performed above one-sided or two-sided? Which one should we use, and what is the corresponding p-value?

1.2.1.2. 2-sample t-test¶

We have seen above that the mean VIQ in the male and female populations were different. To test if this is significant, we do a 2-sample t-test with scipy.stats.ttest_ind():

>>> female_viq=data[data['Gender']=='Female']['VIQ']
>>> male_viq=data[data['Gender']=='Male']['VIQ']
>>> stats.ttest_ind(female_viq,male_viq)
(-0.77261617232750124, 0.44452876778583217)

1.2.2. Paired tests¶

PIQ, VIQ, and FSIQ give 3 measures of IQ. Let us test if FISQ and PIQ are significantly different. We need to use a 2 sample test:

>>> stats.ttest_ind(data['FSIQ'],data['PIQ'])
(0.46563759638096403, 0.64277250094148408)

The problem with this approach is that is forgets that there are links between observations: FSIQ and PIQ are measure on the same individuals. Thus the variance due to inter-subject variability is confounding, and can be removed, using a “paired test”, or “repeated measure test”:

>>> stats.ttest_rel(data['FSIQ'],data['PIQ'])
(1.7842019405859857, 0.082172638183642358)

This is equivalent to a 1-sample test on the difference:

>>> stats.ttest_1samp(data['FSIQ']-data['PIQ'],0)
(1.7842019405859857, 0.082172638183642358)

T-tests assume Gaussian errors. We can use a Wilcoxon signed-rank test, that relaxes this assumption:

>>> stats.wilcoxon(data['FSIQ'],data['PIQ'])
(274.5, 0.034714577290489719)

Note

The corresponding test in the non paired case is the Mann–Whitney U test, scipy.stats.mannwhitneyu().

Exercice

Test the difference between weights in males and females.
Use non parametric statistics to test the difference between VIQ in males and females.

1.3. Linear models, ANOVA ¶

1.3.1. A simple linear regression¶

Given two set of observations, x and y, we want to test the hypothesis that y is a linear function of x. In other terms:

$y = x * coef + intercept + e$

where e is observation noise. We will use the statmodels module to:

Fit a linear model. We will use the simplest strategy, ordinary least squares (OLS).
Test that coef is non zero.

First, we generate simulated data according to the model:

>>> x=np.linspace(-5,5,20)
>>> np.random.seed(1)
>>> # normal distributed noise
>>> y=-5+3*x+4*np.random.normal(size=x.shape)
>>> # Create a data frame containing all the relevant variables
>>> data=pandas.DataFrame({'x':x,'y':y})

Specify an OLS model and fit it:

>>> # The following will work only with a recent version of statsmodels
>>> fromstatsmodels.formula.apiimportols
>>> model=ols("y ~ x",data).fit()
>>> print(model.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.804
Model:                            OLS   Adj. R-squared:                  0.794
Method:                 Least Squares   F-statistic:                     74.03
Date:                ...                Prob (F-statistic):           8.56e-08
Time:                        ...        Log-Likelihood:                -57.988
No. Observations:                  20   AIC:                             120.0
Df Residuals:                      18   BIC:                             122.0
Df Model:                           1
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -5.5335      1.036     -5.342      0.000        -7.710    -3.357
x              2.9369      0.341      8.604      0.000         2.220     3.654
==============================================================================
Omnibus:                        0.100   Durbin-Watson:                   2.956
Prob(Omnibus):                  0.951   Jarque-Bera (JB):                0.322
Skew:                          -0.058   Prob(JB):                        0.851
Kurtosis:                       2.390   Cond. No.                         3.03
==============================================================================

Exercise

Retrieve the estimated parameters from the model above. Hint: use tab-completion to find the relevent attribute.

1.3.2. Multiple Regression¶

Consider a linear model explaining a variable z (the dependent variable) with 2 variables x and y:

$z = x \, c_1 + y \, c_2 + i + e$

Such a model can be seen in 3D as fitting a plane to a cloud of (x, y, z) points.

Example: the iris data

>>> data=pandas.read_csv('examples/iris.csv')
>>> model=ols('sepal_width ~ name + petal_length',data).fit()
>>> print(model.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:            sepal_width   R-squared:                       0.478
Model:                            OLS   Adj. R-squared:                  0.468
Method:                 Least Squares   F-statistic:                     44.63
Date:                ...                Prob (F-statistic):           1.58e-20
Time:                        ...        Log-Likelihood:                -38.185
No. Observations:                 150   AIC:                             84.37
Df Residuals:                     146   BIC:                             96.41
Df Model:                           3
===============================================================================...
                         coef    std err          t     P>|t|  [95.0% Conf. Int.]
-------------------------------------------------------------------------------...
Intercept              2.9813      0.099     29.989     0.000      2.785     3.178
name[T.versicolor]    -1.4821      0.181     -8.190     0.000     -1.840    -1.124
name[T.virginica]     -1.6635      0.256     -6.502     0.000     -2.169    -1.158
petal_length           0.2983      0.061      4.920     0.000      0.178     0.418
==============================================================================
Omnibus:                        2.868   Durbin-Watson:                   1.753
Prob(Omnibus):                  0.238   Jarque-Bera (JB):                2.885
Skew:                          -0.082   Prob(JB):                        0.236
Kurtosis:                       3.659   Cond. No.                         54.0
==============================================================================

1.3.3. Post-hoc ANOVA: contrast vectors¶

In the above iris example, we wish to test if the petal length is different between versicolor and virginica, after removing the effect of sepal width. This can be formulated as testing the difference between the coefficient associated to versicolor and virginica in the linear model estimated above (it is an Analysis of Variance, ANOVA). For this, we write a vector of ‘contrast’ on the parameters estimated: we want to test “name[T.versicolor] - name[T.virginica]”, with an ‘F-test’:

>>> print(model.f_test([0,1,-1,0]))
<F test: F=array([[ 3.24533535]]), p=[[ 0.07369059]], df_denom=146, df_num=1>

Is this difference significant?

Exercice

Going back to the brain size + IQ data, test if the VIQ of male and female are different after removing the effect of brain size, height and weight.

Navigation

1. Basic statistics¶

1.1. Interacting with data¶

1.1.1. Work environment: IPython¶

1.1.2. Basic array manipulation: numpy¶

1.1.3. Basic plotting: pylab¶

1.1.3.1. The box plot¶

1.1.3.2. More plots¶

1.1.4. Mixed-type data: pandas¶

1.1.4.1. Inputing data¶

1.1.4.2. Manipulating data¶

1.1.4.3. Plotting data¶

1.2. Hypothesis testing: two-group comparisons¶

1.2.1. Student’s t-test¶

1.2.1.1. 1-sample t-test¶

1.2.1.2. 2-sample t-test¶

1.2.2. Paired tests¶

1.3. Linear models, ANOVA¶

1.3.1. A simple linear regression¶

1.3.2. Multiple Regression¶

1.3.3. Post-hoc ANOVA: contrast vectors¶

Navigation

1.1. Interacting with data ¶

1.2. Hypothesis testing: two-group comparisons ¶

1.3. Linear models, ANOVA ¶