The media provides generous buzz in regards to every particular emerging technology. No surprise the promising technology Spark contributes a lot to this stream. However it’s not always easy to separate out technical features descriptions and get pure understanding of how the technology can contribute organisation’s business capability.
A typical announcements for Spark look like these:
Spark is being used by a number of customers for Streaming & Data warehousing use-cases. Spark can be setup on top of Hadoop project but is quite different than the Map Reduce Hadoop.
It is a processing system that allows for storing in-memory objects (RDD) along with an ability to process them using Scala closures. This is much more powerful than the MapReduce framework and enables Graph, Data Warehouse, Machine learning & Stream processing on top.
The Shark project is the Hive equivalent of the Spark ecosystem & is used in production by many companies like Yahoo.
Spark is amazingly useful, high performing & powerful. It requires help in stability especially if you are deploying in frameworks not maintained by the teams, but overall the speedup & performance is worth the effort.
Business part of the message is the next:
- Spark can serve several purposes but DWH is the primary one;
- It uses MapReduce algorithm, however it’s own version (same idea, but another composition) – still has bottlenecks, but benchmarks show good results (interesting here that attempts to optimise MapReduce still take place when was a trend to replace it with non-distributed algorithms – seems not very successful);
- In the case of deep data processing (typically result of complicated business requirements) Shark allows Scala encapsulation in queries (i.e. insertion of low-level programming code into high-level scripts);
- Spark allows smooth migration from DWH-tailored Hive (engine for massive datasets analysis supporting high-level scripts) – useful feature for a bunch of companies already adopted Hive;
- Stability issues should be considered in a case Spark is used outside of Azure cloud.
The use case where it truly shines are iterative jobs, where a job has to go over the same data set many times. Compared to Hadoop improvement is of several orders of magnitude.
- More than comprehensive note: no doubt Spark’s caching is a key for re-run. For DWH this is excessive functionality unless real-time analytics is a goal, but for exploratory analysis or operational analytics can become a touchstone.
Spark is an accelerated Hadoop implementation that makes better use of RAM. Shark is an accelerated Hive implementation built on top of Spark.
- Simplify saying “Shark” is just a language used on Spark. Similar to HiveQL on Hive and Pig Latin on Pig. So Spark is an implementation, and Shark is a language.
Summarising all the above and other non-enclosed positive reviews it’s possible to conclude that Spark is definitely some kind on Swiss-knife for business that can be used as for regular DWH works so and for fine-tuned research.