Assignment 3

This is the third (and last) quiz for the cogmaster-stats course. There are 10 multiple choice questions in total, with only one correct response. It is recommended not to answer at random, but rather to check the "Don't know" option. When you are done with the quizz, you can validate your responses by clicking on the Validate button at the bottom of the page.
Responses are not saved as you typed them, so be sure to check all questions before submitting your answers.
If you have any question, you can ask on Twitter: @cogmasterstats, or by email.
Due date: January, 15.

 Nom Email Date (E.g., 21/08/2008; this is also checked internally.)

 1. With the following code, we can create a dummy dataframe for an experiment. First, let's consider that we have 50 scores recorded on different subjects in a task with 3 levels of difficulty. ``````>>> import numpy as np >>> import pandas as pd >>> n=60 >>> x=np.array([23,20,32]) >>> x=x.repeat(n) >>> x=x+2*np.random.randn(x.size) >>> df=pd.DataFrame({"Score":x,"Condition":np.repeat(['verbal shadowing','working memory load','control'],n)}) `````` How would you compute the mean score for each level of difficulty? `df['Condition'].apply(np.mean)` `np.mean(df.groupby('Condition'))` `df.groupby('Condition').mean()` `df.groupby('Condition').np.mean()` Don't know. 2. Now, if you were to visually check the homoscedasticity of the data in each cell of the design, which command would provide you with the best graph? `df.boxplot(by='Condition')` `df.groupby('Condition').plot(kind='box')` `df.groupby('Condition').boxplot()` Don't know. 3. Actually in each condition, half of the subjects were tested in the morning and the other half in the afternoon, how would you add this factor in the current dataframe? `df.append(pd.Series(np.tile(np.repeat(['morning','afternoon'],30),3)),column=['Time'])` `df.addcolumn(np.tile(np.repeat(['morning','afternoon'],30),3),name='Time')` `df['Time']=pd.Series(np.tile(np.repeat(['morning','afternoon'],30),3),index=df.index)` Don't know. 4. How would you write the formula ( shown as `xxxxxxx`) in the following code in order to assess the validity of keeping the "Time" factor in the analysis of the experiment data? ``````>>> import statsmodels as sm >>> from statsmodels.formula.api import ols >>> from statsmodels.graphics.api import interaction_plot >>> from statsmodels.stats.anova import anova_lm >>> formula='xxxxxxx' >>> m = ols(formula, df) >>> r = m.fit() >>> print anova_lm(r) >>> r.resid.plot() >>> interaction_plot(df['Condition'],df['Time'],df['Score']) `````` `Score ~ Condition * Time` `Score ~ Condition + Time` `Score(Condition,Time)` Don't know. 5. In a study on monozygotous twins, researchers are interested in testing average reading speed in each pair of twins. What `scipy.stats` test function would you use in this case? `ttest_1samp()` `ttest_ind()` `ttest_rel()` Any of the above. Don't know. 6. What `scipy.stats` test function would you use if the study on those monozygotous twins was done with reaction time recordings, which ususally are poisson-like distributed? `mannwhitneyu()` `wilcoxon()` `ranksums()` Any of the above. Don't know. 7. A data file with some missing values has been loaded using Pandas `read_csv()` as follows: ``````>>> import numpy as np >>> import pandas as pd >>> d = pd.read_csv("readings2.csv", na_values=".") >>> d.count() Treatment 44 Response 38 dtype: int64 >>> d.head() Treatment Response 0 Treated 24 1 Treated 43 2 Treated 58 3 Treated 71 4 Treated 43 >>> d.Treatment[d.Response.isnull()] 8 Treated 17 Treated 29 Control 34 Control 41 Control 42 Control Name: Treatment, dtype: object >>> grp = d.groupby('Treatment') >>> grp.agg([np.size, np.mean, np.std]) Response size mean std Treatment Control 23 44.368421 15.724046 Treated 21 51.631579 10.111657 `````` Suppose we have defined a function `f` that allows to replace every missing values with overall mean like this: ``>>> f = lambda x: x.fillna(x.mean())`` How would you use this function to fill missing values for each treatment group separately in the above data set? `grp.transform(f)` `d.transform(f)` `grp.filter(f)` `d.filter(f)` Don't know. 8. Using the `statsmodels` scikit, we fitted a regression line to a bivariate series of 20 observations. The results are shown below: ``````>>> import numpy as np >>> from scipy import stats >>> import statsmodels.api as sm >>> n = 20 >>> x = np.random.uniform(0, 10, n) >>> y = 1.1 + 0.8*x + np.random.normal(0, 5, n) >>> X = sm.add_constant(x, prepend=False) >>> m = sm.OLS(y, X) >>> r = m.fit() >>> print r.summary() OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.489 Model: OLS Adj. R-squared: 0.460 Method: Least Squares F-statistic: 17.20 Date: Mon, 19 Jan 2015 Prob (F-statistic): xxxxxxx Time: 12:12:13 Log-Likelihood: -47.392 No. Observations: 20 AIC: 98.78 Df Residuals: 18 BIC: 100.8 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 0.7781 0.188 4.147 0.001 0.384 1.172 const 2.1638 1.038 2.084 0.052 -0.018 4.345 ============================================================================== Omnibus: 0.372 Durbin-Watson: 2.174 Prob(Omnibus): 0.830 Jarque-Bera (JB): 0.515 Skew: 0.137 Prob(JB): 0.773 Kurtosis: 2.263 Cond. No. 9.63 ============================================================================== `````` What is the p-value for the F-test (shown as `xxxxxxx` in the above output)? `1 - scipy.stats.f.cdf(r.fvalue, 1, 18)` `1 - scipy.stats.f.ppf(17.20, 1, 18)` `r.f_pvalue` Any of the above. Don't know. 9. In order to check whether a dice is crooked, we recorded the output of throws until each face appeared at least 5 times. The results are stored in the following python dictionary: ``````>>> rolls={"value":["one","two","three","four","five","six"],"count":[9,5,10,8,11,14]} ``````What command would you use assuming that you imported the stats module from scipy with: `from scipy import stats`. `stats.fisher_exact(rolls['count'])` `stats.chisquare(rolls['count'])` `stats.chi2(rolls['count'])` Any of the above. Don't know. 10. The 'low birth weight study', available as `birthwt` in R built-in datasets, is a retrospective cohort study where data from 189 mothers were recorded during 1986. Medical scientists were interested in potential risk factors associated wih low infant birth weight. The data can be imported as follows (data are fetched from the web): ``````>>> import statsmodels.api as sm >>> import statsmodels.formula.api as smf >>> d = sm.datasets.get_rdataset('birthwt', package='MASS').data >>> d.head() low age lwt race smoke ptl ht ui ftv bwt 85 0 19 182 2 0 0 0 1 0 2523 86 0 33 155 3 0 0 0 0 3 2551 87 0 20 105 1 1 0 0 0 1 2557 88 0 21 108 1 1 0 0 1 2 2594 89 0 18 107 1 1 0 0 1 0 2600 `````` A baby is considered underweight if his weight is < 2.5 kg. This is recorded in the binary variable `low`. Explanatory variables of interest are the smoking status (`smoke=1` means that the mother was smoking during pregnancy), `age` (in years) and ethnicity (`race`, where 1=white, 2=black, 3=other). Using `smf.logit()`, you can the parameter estimates from a logistic regression model that includes all three predictors. What is the value of the odds-ratio for the `smoke` variable? 3.006 3.033 1.095 Don't know.