SHARK on SPARK – business capability increment or just another ICT extension?

The media provides generous buzz in regards to every particular emerging technology. No surprise the promising technology Spark contributes a lot to this stream. However it’s not always easy to separate out technical features descriptions and get pure understanding of how the technology can contribute organisation’s business capability.


A typical announcements for Spark look like these:

Spark is being used by a number of customers for Streaming & Data warehousing use-cases. Spark can be setup on top of Hadoop project but is quite different than the Map Reduce Hadoop.
It is a processing system that allows for storing in-memory objects (RDD) along with an ability to process them using Scala closures. This is much more powerful than the MapReduce framework and enables Graph, Data Warehouse, Machine learning & Stream processing on top.
The Shark project is the Hive equivalent of the Spark ecosystem & is used in production by many companies like Yahoo.
Spark is amazingly useful, high performing & powerful. It requires help in stability especially if you are deploying in frameworks not maintained by the teams, but overall the speedup & performance is worth the effort.

Business part of the message is the next:

  • Spark can serve several purposes but DWH is the primary one;
  • It uses MapReduce algorithm, however it’s own version (same idea, but another composition) – still has bottlenecks, but benchmarks show good results (interesting here that attempts to optimise MapReduce still take place when was a trend to replace it with non-distributed algorithms – seems not very successful);
  • In the case of deep data processing (typically result of complicated business requirements) Shark allows Scala encapsulation in queries (i.e. insertion of low-level programming code into high-level scripts);
  • Spark allows smooth migration from DWH-tailored Hive (engine for massive datasets analysis supporting high-level scripts) – useful feature for a bunch of companies already adopted Hive;
  • Stability issues should be considered in a case Spark is used outside of Azure cloud.

The use case where it truly shines are iterative jobs, where a job has to go over the same data set many times. Compared to Hadoop improvement is of several orders of magnitude.

  • More than comprehensive note: no doubt Spark’s caching is a key for re-run. For DWH this is excessive functionality unless real-time analytics is a goal, but for exploratory analysis or operational analytics can become a touchstone.

Spark is an accelerated Hadoop implementation that makes better use of RAM. Shark is an accelerated Hive implementation built on top of Spark.

  • Simplify saying “Shark” is just a language used on Spark. Similar to HiveQL on Hive and Pig Latin on Pig. So Spark is an implementation, and Shark is a language.

Summarising all the above and other non-enclosed positive reviews it’s possible to conclude that Spark is definitely some kind on Swiss-knife for business that can be used as for regular DWH works so and for fine-tuned research.



About fdtki

Sr. BI Developer | An accomplished, quality-driven IT professional with over 16 years of experience in design, development and implementation of business requirements as a Microsoft SQL Server 6.5-2014 | Tabular/DAX | SSAS/MDX | Certified Tableau designer
This entry was posted in Big Data, Business Capability, Data to Knowledge, R&D and tagged , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s