Data Expedition: Travel Through Data Preprocessing, EDA And PCA

Main Article Content

Kaushik Paul
Dipankar Basu

Abstract

Pre-processing is essential in order to improve the quality of the data and make it more suitable for specific tasks like data mining. It describes the steps taken to prepare data for analysis, such as cleaning, converting, and integrating it. This chapter focuses on the comprehensive analysis of the data collection process through web scraping techniques, preprocessed using some Python methods, and finally analyzed with the help of exploratory data analysis (EDA). Initially, different data collection methods are outlined, followed by preprocessing steps including statistical information, which determines the overall structure of the dataset considered. To build a machine learning (ML) model, some data pre-processing schemes are considered, such as handling missing or null values (N-V), outlier detection, and removing duplicates. Exploratory Data Analysis (EDA) is conducted at various levels, including univariate, bivariate, and multivariate analysis, to understand the relationships within the dataset. A dataset may contain a large number of feature variables, which can be merged into a smaller number of variables using principal component analysis (PCA). PCA reduces the complexity of the model that will be built using ML algorithms such as logistic regression, linear regression, etc. This chapter provides insights into the entire process of data analysis, from data collection to model evaluation, demonstrating the effectiveness of web scraping in extracting valuable information for predictive modeling. Subsequently, both logistic regression and linear regression models are constructed to predict target variables. Feature selection techniques are employed to identify the most influential variables, and principal component analysis (PCA) is utilized for dimensionality reduction. Finally, model performance is evaluated using confusion matrices for the logistic regression model and root-mean-squarederror for the linear regression model. In this work,the Python language is considered, which is an object-oriented, interpreted, and interactive programming language. It is open source with rich sets of libraries like Pandas, Numpy, Matplotlib, Seaborn, etc. For executing the Python code, JUPYTER NOTEBOOK is used, which provides a web-based application process and a rich media representation of the object.

Downloads

Download data is not yet available.

Article Details

How to Cite
Kaushik Paul, & Dipankar Basu. (2024). Data Expedition: Travel Through Data Preprocessing, EDA And PCA. Educational Administration: Theory and Practice, 30(6), 2576–2590. https://doi.org/10.53555/kuey.v30i6.5828
Section
Articles
Author Biographies

Kaushik Paul

Assistant Professor, Department :CSE, Brainware University,Barasat, West Bengal, India,

Dipankar Basu

Assistant Professor (HOD), Department :Computer Application, Swami Vivekananda Institute Of Modern Studies, Barrackpore, West Bengal, India

Similar Articles

You may also start an advanced similarity search for this article.

Most read articles by the same author(s)