Objective: The goal of this challenge is to analyze a dataset containing student performance metrics to uncover insights that can help improve educational outcomes.
Dataset: We’ll use the “Student Performance” dataset, which is publicly available on the UCI Machine Learning Repository. This dataset includes information about students’ academic achievements and various personal, social, and school-related factors.
Dataset Overview:
- Name: Student Performance Dataset
- Source: UCI Machine Learning Repository
- Link: Student Performance Dataset
- Description: This dataset comprises student achievement data in secondary education of two Portuguese schools. It includes attributes such as student grades, demographic, social, and school-related features.UCI Machine Learning Repository+2UCI Machine Learning Repository+2Scribd+2
- Files Included:
student-mat.csv
: Data related to Mathematics coursestudent-por.csv
: Data related to Portuguese language course
- Number of Instances: 649
- Number of Attributes: 33
- Attribute Information: Includes features like school, sex, age, address, family size, parental education, study time, failures, and grades (G1, G2, G3), among others.UCI Machine Learning Repository
You can download the dataset directly from the UCI repository or access it via Kaggle:
- Student Performance Dataset on Kaggle
Figma design link: https://youtu.be/ScRA9dkm4WY
Tasks:
- Data Cleaning and Preparation:
- Load the dataset into a DataFrame.
- Check for and handle any missing or inconsistent data.
- Exploratory Data Analysis (EDA):
- Analyze the distribution of students’ final grades (G3).
- Examine correlations between G3 and other numerical features.
- Explore the impact of categorical variables (e.g., gender, parental education) on G3.
- Data Visualization:
- Create visualizations to illustrate findings from the EDA.
- Use histograms, box plots, and scatter plots to represent data distributions and relationships.
- Predictive Modeling:
- Develop a regression model to predict students’ final grades (G3) based on the available features.
- Evaluate the model’s performance using appropriate metrics (e.g., RMSE, R²).
Expected Outputs:
- A summary report detailing the data cleaning process and any issues encountered.
- Insights from the exploratory data analysis, highlighting key factors that influence student performance.
- Visualizations that effectively communicate the relationships between variables and their impact on final grades.
- A trained regression model capable of predicting student final grades, along with an evaluation of its accuracy and reliability.
Note: This challenge is designed for beginners in data science and analytics. It aims to provide hands-on experience with data cleaning, analysis, visualization, and predictive modeling. Participants are encouraged to document their process and findings thoroughly, as this practice is valuable for developing a strong data science portfolio.