Big Data:Apache Spark
The open source technology has been around and popular for few years. But 2016 was the year where Spark went from a predominant technology to a bona-fide superstar.
Apache Spark has become so popular as it provides data engineers and data scientists with a powerful, consolidated engine that is both fast (100x faster than Apache Hadoop for large-scale data processing) and easy to use.
In this article, we will discuss some of the key points one encounters when working with Apache Spark.
WHAT SPARK IS ALL ABOUT:
Apache spark is an open-source big data processing framework built around quickness, ease of use, and sophisticated analytics. Apache spark is built on top of Hadoop MapReduce and it extends the MapReduce model to effortlessly use more types of calculations which include Interactive queries and Stream processing.
Spark has several more advantages while compared to other big data and MapReduce technologies like Hadoop and Storm.
Firstly, Spark gives us a comprehensive, united framework to manage big data processing requirements with a wide variety of data sets that are distinct in nature.
Spark authorizes applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster while running on disk.
Spark lets the user to quickly write applications in Java, Scala, and Python. It comes with an inbuilt set of 80 high-level operators. A user can use it interactively to query the data within the shell. In addition to Map and Reduce operations, it also supports SQL queries, streaming data, machine learning, and graph data processing. A user can use this standalone or can combine them to run in a single data pipeline use case.
FEATURES OF SPARK
Spark takes MapReduce to the next level with less cost shuffles in the data processing. With the capabilities like in-memory data storage and real-time processing, the performance can be numerous times faster than other big data technologies.
Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when you need to work on the same dataset a multiple number of times. It can store part of a data set in memory and the remaining data in the disk. A user has to look at his data and use cases to assess the memory requirements. With this in-memory data storage, Spark comes with some performance advantages.
Other Spark features include:
- Supports more than Reduce and Map functions.
- Provides brief and consistent APIs in Scala, Java, and Python.
- Offers interactive shell for Scala and Python.
Spark is done in Scala Programming Language and runs in JVM (Java Virtual Machine) environment. Currently, it supports the following languages for developing applications using Spark.
- Scala
- Java
- Python
- Clojure
SPARK ECOSYSTEM
Other than Spark core API, there are some additional libraries that are part of the Spark ecosystem and provides other added capabilities in Big Data analytics and Machine Learning areas. These libraries include the following.
- Spark streaming: It can be used for processing the real-time streaming data. This process is based on the micro batch style of computing and processing.
- Spark SQL: It provides the capability to expose the Spark datasets over JDBC API and allows running the SQL queries on Spark data by using traditional BI and visualization tools.
- Spark MLlib: MLlib is Spark’s scalable machine learning library consists of common learning algorithms and utilities such as regression, clustering, collaborative filtering, and underlying optimization primitives.