You will be reading data from CSV files and transforming data to generate final output tables to be stored in traditional DBMS.
You will be reading data from CSV files and transforming data to generate final output tables to be stored in traditional DBMS.
We have New York Yellow taxi trip dataset available with us and let us assume that the client wants to see the analysis of the overall data. Now, the size of the dataset is very huge (might go up to billions of rows) and using traditional DBMS is not feasible. So, we will use Big Data tool like Apache Spark to transform the data and generate the necessary aggregated output tables and store it in MySQL database. With this architecture the UI will be able to fetch reports and charts at much faster speed from MySQL than querying on the actual raw data.
Finally, the batch we use to analyze the data can be automated to run on daily basis within a fixed period of time.
The project consists of the following stages:
Setup the environment and install all the tools required for the project.
Read data from CSV file and store the data into HDFS (Hadoop File System) in compressed format.
Transform the raw data and build multiple table by performing the required aggregations.
Load the end tables to MySQL tables.
Automating the full flow using Shell Script.
The video below provides the overview of the project.
You will be reading data from CSV files and transforming data to generate final output tables to be stored in traditional DBMS.
We have New York Yellow taxi trip dataset available with us and let us assume that the client wants to see the analysis of the overall data. Now, the size of the dataset is very huge (might go up to billions of rows) and using traditional DBMS is not feasible. So, we will use Big Data tool like Apache Spark to transform the data and generate the necessary aggregated output tables and store it in MySQL database. With this architecture the UI will be able to fetch reports and charts at much faster speed from MySQL than querying on the actual raw data.
Finally, the batch we use to analyze the data can be automated to run on daily basis within a fixed period of time.
The project consists of the following stages:
Setup the environment and install all the tools required for the project.
Read data from CSV file and store the data into HDFS (Hadoop File System) in compressed format.
Transform the raw data and build multiple table by performing the required aggregations.
Load the end tables to MySQL tables.
Automating the full flow using Shell Script.
The video below provides the overview of the project.
The first task for us to begin with our project is to setup our system environment.
It is expected that you are using a Linux distribution. (A cloud system can be a substitute.)
We have to install all the tools and setup the environment (if you have already installed the required tools you can skip this task), make sure you install all the required software in one location for simplicity.
Install Hadoop in your system using this tutorial.
Once Hadoop is set up, start the services using start-all.sh
command and run jps
to check whether the services are up or not. Below screenshot shows the expected services that should be running on successful installation.
Now, you can install Apache Spark using this link
Once spark is installed we will install Anaconda. Download Anaconda bash installer file from Anaconda website. Install and initialize it.
Finally install MySQL.
Now, by default Spark is supposed to start on terminal, to use Jupyter Notebook for development we will have to set some properties in ~/.bashrc file.
Finally, you can run pyspark
command in terminal which should start Spark on Jupyter Notebook.
The main objective of this milestone is to make sure that you have all the software set up in your Linux distro. On successful completion of the milestone you should have your Jupyter notebook running and all the Hadoop services running in your machine.
Download the NYC Yellow Taxi dataset from its website. The dataset is huge and contains almost 80-90 Million records for each year. So, you can go ahead and download one year's dataset for the project. The dataset would be in CSV format.
When you are done with this milestone, it is expected that you should have the dataset in compressed file format in Hadoop file system.
You will need Payments mapping table for certain analysis, you can set it up by running below commands in mysql.
Initialize MySQL.
Commands:
create database source;
use source;
create table payment_otc(payment_type int NOT NULL, payment_name varchar(255) NOT NULL);
INSERT INTO payment_otc VALUES(1, 'Cash');
INSERT INTO payment_otc VALUES(2, 'Net Banking');
INSERT INTO payment_otc VALUES(3, 'UPI');
INSERT INTO payment_otc VALUES(4, 'Net Banking');
INSERT INTO payment_otc VALUES(5, 'Debit Card');
tpep_pickup_datetime
column whenever date information is needed.In this milestone, we will load the data from HDFS file system to Mysql.
At the end of the milestone you should have all the seven tables in MySQL. You can verify by querying and checking the data in MySQL.
Who doesn't like automation? This milestone is all about it, where we'll be automating all our previous manual effort.
In this milestone, we will automate the flow using Shell Script.
--master
argument is given Spark Master URL.At the end of the milestone you will be having a shell script which will running the full batch at a certain time. You should be able to monitor the Spark jobs on Spark UI.