Data Expedition: Travel Through Data Preprocessing, EDA And PCA

Kaushik Paul; Dipankar Basu

doi:10.53555/kuey.v30i6.5828

pdf

Published: Jun 11, 2024

DOI: https://doi.org/10.53555/kuey.v30i6.5828

Keywords:

pre-processing, exploratory data analysis (EDA), machine learning (ML), principal component analysis (PCA), Matplotlib, Seaborn, Numpy, Pandas, Jupyter Notebook

Kaushik Paul

Dipankar Basu

Abstract

Pre-processing is essential in order to improve the quality of the data and make it more suitable for specific tasks like data mining. It describes the steps taken to prepare data for analysis, such as cleaning, converting, and integrating it. This chapter focuses on the comprehensive analysis of the data collection process through web scraping techniques, preprocessed using some Python methods, and finally analyzed with the help of exploratory data analysis (EDA). Initially, different data collection methods are outlined, followed by preprocessing steps including statistical information, which determines the overall structure of the dataset considered. To build a machine learning (ML) model, some data pre-processing schemes are considered, such as handling missing or null values (N-V), outlier detection, and removing duplicates. Exploratory Data Analysis (EDA) is conducted at various levels, including univariate, bivariate, and multivariate analysis, to understand the relationships within the dataset. A dataset may contain a large number of feature variables, which can be merged into a smaller number of variables using principal component analysis (PCA). PCA reduces the complexity of the model that will be built using ML algorithms such as logistic regression, linear regression, etc. This chapter provides insights into the entire process of data analysis, from data collection to model evaluation, demonstrating the effectiveness of web scraping in extracting valuable information for predictive modeling. Subsequently, both logistic regression and linear regression models are constructed to predict target variables. Feature selection techniques are employed to identify the most influential variables, and principal component analysis (PCA) is utilized for dimensionality reduction. Finally, model performance is evaluated using confusion matrices for the logistic regression model and root-mean-squarederror for the linear regression model. In this work,the Python language is considered, which is an object-oriented, interpreted, and interactive programming language. It is open source with rich sets of libraries like Pandas, Numpy, Matplotlib, Seaborn, etc. For executing the Python code, JUPYTER NOTEBOOK is used, which provides a web-based application process and a rich media representation of the object.

Downloads

Download data is not yet available.

How to Cite

Kaushik Paul, & Dipankar Basu. (2024). Data Expedition: Travel Through Data Preprocessing, EDA And PCA. Educational Administration: Theory and Practice, 30(6), 2576–2590. https://doi.org/10.53555/kuey.v30i6.5828

Issue

Vol. 30 No. 6 (2024)

Section

Articles

Author Biographies

Kaushik Paul

Assistant Professor, Department :CSE, Brainware University,Barasat, West Bengal, India,

Dipankar Basu

Assistant Professor (HOD), Department :Computer Application, Swami Vivekananda Institute Of Modern Studies, Barrackpore, West Bengal, India

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

Kaushik Paul

Dipankar Basu

Most read articles by the same author(s)