Spark

Apache Spark is a subproject of Hadoop developed in the year 2009 by Matei Zaharia in UC Berkeley’s AMP Lab. The first users of Spark were the group inside UC Berkeley including machine learning researchers, which used Spark to monitor and predict traffic congestion in the San Francisco Bay Area. Spark has open sourced in the year 2010 under BSD license. Spark became a project of Apache Software Foundation in the year 2013 and is now the biggest project of Apache foundation. Apache Spark is an open-source cluster computing framework which is setting the world of Big Data on fire. According to Spark Certified Experts, Sparks performance is up to 100 times faster in memory and 10 times faster on disk when compared to Hadoop.

Spark & its Features
Apache Spark is an open source cluster computing framework for real-time data processing. The main feature of Apache Spark is its in-memory cluster computing that increases the processing speed of an application. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries, and streaming.

Features of Spark's
  • Polyglot
  • Speed
  • Multiple Format Support
  • Lazy Evaluation
  • Real Time Computation
  • Hadoop Integration
  • Machine Learning
Let us look at these features in detail:

Polyglot: Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed directory.

Speed: Spark runs upto 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed data processing with minimal network traffic.

Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra. The Data Sources API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark.

Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed. For transformations, Spark adds them to a DAG of computation and only when the driver requests some data, does this DAG actually gets executed.

Real Time Computation: Spark’s computation is real-time and has less latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models.

Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a great boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling. 

Machine Learning: Spark’s MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use.

Benefits of spark over MapReduce
  • Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce of I/O disk latency.
  • Spark process the data batch processing, Streaming, Machine learning, Interactive SQL queries. However, MapReduce supports only batch processing.
  • MapReduce is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
  • Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by MapReduce.
  • Spark can process the real-time data through Spark Streaming where as MapReduce you can only do batch processing.
  • Spark has its own scheduler (Standalone,Mesos,Yarn) where as MapReduce  depends on the external scheduler like Oozie
  • Apache spark process every records exactly once hence eliminates duplicates Mapreduce does not support this feature.
  • In spark need to write less line of the code and easy to debug where in MapReduce need to write more lines of the code and difficult to debug the code.
  • Spark run Sql Queries through Spark-Sql where as MapReduce through the Hive.
What file systems does Spark support?
The following three file systems are supported by Spark:
  1. Hadoop Distributed File System (HDFS).
  2. Local File system.
  3. Amazon S3
Spark FrameWork


Apache Spark Components

Spark Core:- Spark Core is a central point of Spark. Basically, it provides an execution platform for all the Spark applications. Moreover, to support a wide array of applications, Spark Provides a  generalized platform.

Spark SQL:- On the top of Spark, Spark SQL enables users to run SQL/HQL queries. We can process structured as well as semi-structured data, by using Spark SQL. Moreover, it offers to run unmodified queries up to 100 times faster on existing deployments. To learn Spark SQL in detail, follow this link.

Spark Streaming:- Basically, across live streaming, Spark Streaming enables a powerful interactive and data analytics application. Moreover, the live streams are converted into micro-batches those are executed on top of spark core. Learn Spark Streaming in detail.

Spark MLlib:- Machine learning library delivers both efficiencies as well as the high-quality algorithms. Moreover, it is the hottest choice for a data scientist. Since it is capable of in-memory data processing, that improves the performance of iterative algorithm drastically.

Spark GraphX:- Basically, Spark GraphX is the graph computation engine built on top of Apache Spark that enables to process graph data at scale.

SparkR:- Basically, to use Apache Spark from R. It is R package that gives light-weight frontend. Moreover, it allows data scientists to analyze large datasets. Also allows running jobs interactively on them from the R shell. Although, the main idea behind SparkR was to explore different techniques to integrate the usability of R with the scalability of Spark. Follow the link to learn SparkR in detail.

No comments:

Post a Comment