Consider the data in the file CollegeDistance.csv which contains data from the HighSchool and Beyond survey conducted by the Department of Education in 1980, with a follow-up in 1986. The survey included students from approximately 1100 high schools. A complete description of the data is given in data description which is under the name CollegeDistance_Data Description and can be found on Blackboard.

Using this data, carry out the following empirical exercises:

1. For all the continuous variables in the dataset (ed, bytest, cue80, stwmfg80, dist, tuition) report a summary statistics table that includes the minimum, 1st quartile, median, mean, 3rd quartile, and maximum. Are there outliers in these variables? Explain (Hint: Look at the differences between the mean and the median)

2. For all the continuous variables in the dataset, create a correlation table. Identify which variables could be important in explaining years of education.

3. Looking at all the dummy variables, what is the difference in the average years of education for the different groups as defined in the dummy variables? Detail your answer carefully in explaining which group has more years of education. Are these differences statistically significant? (Hint: Run independent regressions of years of education on all the dummy variables and check the statistical significance of the coefficient to verify differences in the mean groups. For example, a regression ed~female would tell you the average difference between males and females).

4. Now run a complete regression of years of education on all the dummy variables at the same time. Which ones are still significant? In addition, can you identify any econometric issues when looking at the coefficients of the complete regression vs. the individual regressions in the previous problem?

5. What is the percent of the variation in average years of schooling explained only by base year composite test scores? What is the marginal change in years of schooling given a change in the test score? Is this variable significant?

6. What is the percent of the variation in average years of schooling explained by all continuous variables? Can you identify any econometric issues with the effect of test scores on years of schooling in the full regression? What is the gain in model fit by adding the rest of the continuous variables?

7. Regarding the regression in (6), compute the heteroskedastic corrected t-stats. Are there any changes in the significance of the variables?

8. Regarding the regression in (6) conduct an overall significance of the regression using an F- test. Report the F-statistic, the critical F at 1

%, 5

% and 10

% and state the conclusion of the test at all significance levels.

9. Run a regression of years of education on all variables in the data. Report the significance of each one. Which variable appears the strongest predictor of years of education?

10. Define a variable male=1-female and run a regression of years of education on female (the original variable) and male (the one you just created). Report the results. What is the econometric problem with such a regression?

See Instructions on Next Page…

Instructions: Please submit to my email an individual word file with your answers WELL ORGANIZED. Incomplete answers will not receive any points. You can answer the questions either in Excel or R. Do not submit any Excel or R files because I will not grade a spreadsheet or look for your answers there. You can use those tools to answer the questions, but the submitted file should be a well-organized word file where I can see all your answers to each of my questions.

You will create a new word file and at the top write your name and date of the submission. Then, answer each question in the order I asked.