WHY PYSPARK AND MACHINE LEARNING
In a world where data is being generated at such an alarming
rate, the correct analysis of that data at the correct time is very useful. One
of the most amazing framework to handle big data in real-time and perform
analysis is Apache Spark. Together, Python
for Spark or PySpark is one of the most sought-after
certification courses, giving Scala for Spark a run for its money. So in this PySpark Tutorial blog, I’ll
discuss the following topics:
What is PySpark?
Apache Spark is a fast cluster computing framework
which is used for processing, querying and analyzing Big data. Being based on
In-memory computation, it has an advantage over several other big data
Frameworks.
Originally written in Scala Programming Language,
the open source community has developed an amazing tool to support Python for
Apache Spark. PySpark helps data scientists interface with RDDs in Apache
Spark and Python through its library Py4j. There are many
features that make PySpark a better framework than others:
- Speed: It is 100x faster than traditional large-scale data processing frameworks
- Powerful Caching: Simple programming layer provides powerful caching and disk persistence capabilities
- Deployment: Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manager
- Real Time: Real-time computation & low latency because of in-memory computation
- Polyglot: Supports programming in Scala, Java, Python and R
PySpark in the Industry
Every Industry revolves around Big Data and where there’s Big Data there’s Analysis involved. So let’s have a look at the various industries where Apache Spark is used.
Media is one
of the biggest industry growing towards online streaming. Netflix uses Apache Spark for
real-time stream processing to provide personalized online recommendations to
its customers. It processes 450 billion
events per day which flow to server-side applications.
Finance is
another sector where Apache Spark’s Real-Time processing plays an important
role. Banks are using Spark to access and analyse the social media
profiles, to gain insights which can help them make right business decisions
for credit risk assessment,
targeted ads and customer segmentation. Customer Churn
is also reduced using Spark. Fraud Detection
is one of the most widely used areas of Machine Learning where Spark is
involved.
Healthcare
providers are using Apache Spark to Analyse patient records
along with past clinical data to identify which patients are likely to face
health issues after being discharged from the clinic. Apache Spark is used in Genomic Sequencing to reduce the
time needed to process genome data.
Retail and E-commerce is an
industry where one can’t imagine it running without the use of Analysis and
Targeted Advertising. One of the largest E-commerce platform today Alibaba
runs some of the largest Spark Jobs in the world in order to analyse
petabytes of data. Alibaba performs feature extraction in
image data. eBay uses Apache Spark to provide Targeted
Offers, enhance customer experience and optimize overall performance.
Travel
Industries also use Apache Spark. TripAdvisor,
a leading travel website that helps users plan a perfect trip is using Apache
Spark to speed up its personalized customer
recommendations.TripAdvisor uses apache spark to provide advice
to millions of travellers by comparing hundreds of
websites to find the best hotel prices for its customers.
Why Go for Python?
Easy to Learn: For
programmers Python is comparatively easier to learn because of its syntax and standard
libraries. Moreover, it’s a dynamically typed language, which means RDDs can
hold objects of multiple types.
A vast set of Libraries: Scala
does not have sufficient data science tools and libraries like Python for
machine learning and natural language processing. Moreover, Scala lacks good
visualization and local data transformations.
Huge Community Support: Python
has a global community with millions of developers that interact online and
offline in thousands of virtual and physical locations.
One of the most important topics in this PySpark Tutorial is the use of RDDs. Let’s understand what are RDDs
Spark RDDs
When it comes to iterative distributed computing,
i.e. processing data over multiple jobs in computations, we need to reuse or
share data among multiple jobs. Earlier frameworks like Hadoop had
problems while dealing with multiple operations/jobs like
- Storing Data in Intermediate Storage such as HDFS
- Multiple I/O jobs make the computations slow
- Replications and serializations which in turn makes the process even slower
RDDs try to solve all the problems by enabling
fault-tolerant distributed In-memory computations. RDD is short for Resilient
Distributed Datasets. RDD is a distributed memory abstraction
which lets programmers perform in-memory computations on large clusters in a
fault-tolerant manner. They are the read-only collection of
objects partitioned across a set of machines that can be rebuilt if a
partition is lost. There are several operations performed on RDDs:
- Transformations: Transformations create a new dataset from an existing one. Lazy Evaluation
- Actions: Spark forces the calculations for execution only when actions are invoked on the RDDs
Machine Learning with PySpark
Continuing our PySpark Tutorial Blog, let’s
analyze some BasketBall Data and do some future Prediction. So, here we are
going to use the Basketball Data of all the players of NBA since 1980 [year
of introduction of 3 Pointers].
Here we are analysing the average number of 3
point attempts for each season in a time limit of 36 min[an
interval corresponding to an approximate full NBA game with adequate
rest]. We compute this metric using the number of 3-point field goal
attempts (fg3a) and minutes played (mp) and then plot the result using matlplotlib.
Linear Regression and VectorAssembler:
We can fit a linear regression model to this
curve to model the number of shot attempts for the next 5 years. We have
to transform our data using the VectorAssembler function to a single
column. This is a requirement for the linear regression
API in MLlib.
Comments
Post a Comment