This project is worth 25 points. It will be due in Canvas on May 10th at midnight. The objective of this project is to engage you in a hands-on exploration of statistical analysis. In this project you will demonstrate your understanding of the main tenets of statistical description and inference that we have learned in this class. For this project, you will need to find data related to a topic that interests you such as finance, medicine, sports, ecology, etc. From this data, you will need to choose one or two variables that relate to a statistical question that interests you and analyze the data using the appropriate methods. I have provided three sets of data on the course website. Complete the instructions for steps 1-3 to conduct a thorough statistical analysis of your chosen variables. Write up your results as a short report (1-5 pages). You may use any statistical software available to you to analyze your chosen variables (for example: Microsoft Excel, R, SAS, Python, MATLAB, SPSS, or other). The applications in the “resources” tab on the course website may also be used. I have also provided several links for online software that can be be used to complete your analyses. In addition, I have included several websites (links below) which can be used to explore publicly available datasets.
Kaggle is a platform for data science and machine learning. It hosts a variety of datasets contributed by the community, covering topics such as finance, healthcare, sports, and more.
data.gov is the official U.S. government open data platform. It provides access to a vast array of datasets from various government agencies
Google Dataset Search helps you find datasets stored across the web. It is a useful tool for discovering datasets from various domains.
The UCI Machine Learning Repository is a collection of datasets for machine learning research. It includes datasets on a wide range of topics
Federal Reserve Economic Data (FRED) offers economic data from the Federal Reserve Bank of St. Louis, including economic indicators, financial data, and more.
Eurostat is the statistical office of the European Union, and it provides access to statistical information on the European Union and its member states
Plotly for online data visualization
RAWgraphs Online Data Vizualization Tool
National Center For Education Statistics
Evan’s Awesome AB tool (for \(t\)-test and \(\chi^2\)-test)
Purdue University Department of Psychology Free 2-proportion Z-test Calculator
LibreTexts Chi-Square Goodness of Fit Test Calculator
Gain proficiency in selecting and handling real-world datasets. Develop skills in using descriptive statistics to characterize the distribution of variables. Formulate testable hypotheses based on observed patterns or relationships in the data. Apply appropriate statistical tests to evaluate hypotheses and draw meaningful conclusions. Enhance data visualization skills through the creation of descriptive plots.
Please read all instructions carefully
Choose one or two variables from a dataset that interests you or one of the provided datasets.
(A) Provide some background related to the dataset you chose such as what topic it relates to, how it was gathered, sampling method etc - if available. Provide a link or appropriate reference.
(B) Articulate a statistical question related to the dataset you chose and select one or two variables that relate to this question.
(C) Write a short description of the variable(s) and how they relate to the question you are interested in exploring.
Statistic | Value |
---|---|
minimum | |
Q1 | |
median | |
mean | |
Q3 | |
maximum | |
interquartile range | |
standard deviation |
(B) Create at least one descriptive plot for each variable: (e.g., histogram, boxplot, dot plot, stem-and-leaf plot, pareto chart, pie chart, bar chart) to visualize their distribution. Be sure to include a caption for each plot.
(C) Using the descriptive statistics computed in parts A and B, write a short description of the shape, center, and variability of your chosen variables. Make note of any outliers and how they may influence your results.
Building on the statistical description in Part 2., formulate a testable hypothesis (or multiple if you analyze the variables separately) related to your question in Part 1 that explores the relationship between the chosen variables or compares it to a population parameter. Depending on the nature of your question, be sure to choose the appropriate statistical test or model. This may include any of the tests we have talked about in class such as: one and two-sample tests of means and proportions, categorical tests such as one of the \(\chi^2\)-tests, non-parametric tests, bootstrap tests, regression, or ANOVA.