Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

Apache Spark is an open-source distributed general-purpose cluster-computing framework.

  • Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under a BSD license.
  • In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became a Top-Level Apache Project.
  • In November 2014, Spark founder M. Zaharia's company Databricks set a new world record in large scale sorting using Spark.
  • Spark had in excess of 1000 contributors in 2015, making it one of the most active projects in the Apache Software Foundation and one of the most active open source big data projects.
  • Given the popularity of the platform by 2014, paid programs like General Assembly and free fellowships like The Data Incubator have started offering customized training courses


Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Below architecture shows how Apache Spark is composed/interact with other components.

...

When you see above diagram, it looks like the similar architecture with MapReduce - below table shows its difference:

ItemMapReduceApache Spark
Data Processingbatch processingbatch processing + real-time data processing
Processing Speedslower than Apache Spark, because of I/O disk latency100x faster in memory and 10x faster while running on disk
CategoryData Processing EngineData Processing Engine
Costsless costlier comparing Apache Sparkmore Costlier because of large amount of RAM
Scalabilityboth are scalable limited to 1000 nodes in single clusterboth are scalable limited to 1000 nodes in single cluster
Machine Learningmore compatible with Apache Mahout while integrating with Machine Learninghave inbuilt API's to Machine Learning
CompatibilityMajorly compatible with all the data sources and file formatsApache Spark can integrate with all data sources and file formats supported by Hadoop cluster
Securitymore secured compared to Apache Sparksecurity feature in Apache Spark is more evolving and getting matured
Schedulerdependent on external schedulerhave own scheduler
Fault ToleranceUse replication for fault toleranceusing RDD and other data storage models for fault tolerance
Ease of Usebit complex comparing Apache Spark because of Java APIsEasier to use because of Rich APIs
Duplicate Eliminationnot supportedApache Spark process every records exactly once hence eliminates duplication
Language Supportprimary language is Java but languages like C, C++, ruby, Python, Perl, Groovy is also supportedsupports Java, Scalar, Python and R
Latencyvery high latencymuch faster comparing MapReduce framework
Complexityhard to write and debug codeseasy to write and debug
Apache Communityopen source framework for processing dataopen source framework for processing data at higher speed
Codingmore lines of codelesser lines of code
Interactive Modenot interactiveinteractive
Infrastructurecommodity hardware'smid to high level hardware's
SQLsupports through Hive Query Languagesupports through Spark SQL


Key difference between MapReduce vs Apache Spark

...