You will implement a Command Line Interface (CLI) which will preprocess your dataset and save your time.
You will implement a Command Line Interface (CLI) which will preprocess your dataset and save your time.
Machine Learning is a subset of the larger field of artificial intelligence (AI) that focuses on teaching computers how to learn without the need to be programmed for specific tasks. In fact, the key idea behind ML is that it is possible to create algorithms that learn from and make predictions on the data.
Examples of Machine Learning are present everywhere including the spam filter that flags messages in your email, the recommendation engine Netflix uses to suggest content you might like, and the self-driving cars being developed by Google and other companies.
But before applying Machine Learning on any dataset, you need to convert it in such a way that the algorithms could understand the dataset. These steps are preprocessing steps.
To know more about preprocessing, refer to this article.
Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of your model to learn; therefore, it is extremely crucial that you preprocess your data before feeding it into your model.
One more advantage of preprocessing is that it is considered time consuming for many machine learning developers. This simple CLI tool will save your time so that you can utilize it in applying different machine learning algorithms.
You will apply the following preprocessing steps:
To understand different preprocessing steps, refer to this article.
Finally, you will also be able to download your preprocessed dataset.
The Product Architecture consists of 6 parts which are as follows:
This project consists of the following milestones:
Pandas and scikit learn will be used throughout the project to perform the preprocessig steps.
The desired end result of this project is like this:
You will implement a Command Line Interface (CLI) which will preprocess your dataset and save your time.
Machine Learning is a subset of the larger field of artificial intelligence (AI) that focuses on teaching computers how to learn without the need to be programmed for specific tasks. In fact, the key idea behind ML is that it is possible to create algorithms that learn from and make predictions on the data.
Examples of Machine Learning are present everywhere including the spam filter that flags messages in your email, the recommendation engine Netflix uses to suggest content you might like, and the self-driving cars being developed by Google and other companies.
But before applying Machine Learning on any dataset, you need to convert it in such a way that the algorithms could understand the dataset. These steps are preprocessing steps.
To know more about preprocessing, refer to this article.
Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of your model to learn; therefore, it is extremely crucial that you preprocess your data before feeding it into your model.
One more advantage of preprocessing is that it is considered time consuming for many machine learning developers. This simple CLI tool will save your time so that you can utilize it in applying different machine learning algorithms.
You will apply the following preprocessing steps:
To understand different preprocessing steps, refer to this article.
Finally, you will also be able to download your preprocessed dataset.
The Product Architecture consists of 6 parts which are as follows:
This project consists of the following milestones:
Pandas and scikit learn will be used throughout the project to perform the preprocessig steps.
The desired end result of this project is like this:
There are several types of Machine Learning such as Supervised learning, Unsupervised learning etc. Here, you are writing python scripts to make preprocessed dataset for performing supervised learning.
Supervised learning consists of mapping input data (independent variables) to known targets (dependent variable), which humans have provided. Predicting house prices is a good example.
Important: For simultaneously testing out our application you will be performing the preprocessing on a very popular ML dataset - Titanic survival Dataset. You need to download train.csv dataset from the mentioned website.
The idea of this milestone is that you are correctly taking the input of the dataset.
At the end of this milestone, the project should work like this.
Now that you are done with the initial step, you can implement various components of your project.
In this milestone, you need to implement the functionality that will enable users to describe the dataset’s properties like mean, max, standard deviation etc.
The idea of this milestone is that you can correctly show some basic statistical details (mean, standard deviation, percentiles, total number of values, maximum, minimum), datatype of columns of the dataset using methods provided by pandas library.
At the end of this milestone, the project should work like this.
The next step of data preprocessing is to handle missing data in the datasets. If your dataset contains some missing data, then it may create a huge problem for your machine learning model. Hence it is necessary to handle missing values present in the dataset.
The handling of missing values is also called Data Imputation.
The idea of this milestone is to remove all the NULL values from the dataset.
At the end of this milestone, the project should work like this.
Categorical data is data which has some categories. Machine learning models completely works on mathematics and numbers, but if your dataset would have a categorical variable, then it may create trouble while building the model. So, it is necessary to encode these categorical variables into numbers.
The idea of this milestone is to map all the categorical columns into numbers.
At the end of this milestone, the project should work like this.
Feature scaling is a method used to normalize the range of independent variables or columns of data. It is done to handle highly varying magnitudes among different columns.
If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values. To avoid this, feature scaling is done.
There are 2 main ways of doing feature scaling:
The idea of this milestone is that you are able to correctly scale the dataset.
At the end of this milestone, the project should work like this.
As all the preprocessing is done, you can implement the functionality to download the preprocessed dataset.
The idea of this milestone is that you can correctly download the dataset in the correct file format.
At the end of this milestone, the project should work like this.