Hadoop Data Processing: Battle For Speed

The idea behind Hadoop шs brilliant and revolutionary: invented an algorithm – MapReduce – allowing decomposition on it of all major data processing tasks (grouping, statistical, graph, etc). However it’s use of input files and lack of schema support prevented the performance improvements enabled by common database system features such as B-trees and hash partitioning. Business demanded instant improvement and Hadoop vendors had to move one. They faced the challenge of choosing one of the two ways of progress: whether to speed up MapReduce or to get rid of it (loosing so important scalability and fault-tolerance).


And here is what they choose:

  1. Cloudera refused to use MapReduce and finished with product called Impala.
  2. Apache offered two options: Spark with caching-based optimisation (was presented by Berkeley’s lab) and Hive on Tez – the same Hive but based on optimised version of MapReduce: tez_schema

The question is: who is winning this competition in our days? As always benchmark is the only way to find out.

As mentioned above Hive on classic MapReduce demonstrates really poor performance:


Comparison on large queries

However on Tez it works perfectly:


These are conclusions by “Yahoo! JAPAN”‘s staff made after benchmarking two promising Impala and Hive on Tez:

  • Impala returns a very fast response when the cluster is at low load state, but is not suitable for use such as running the SQL in parallel.
  • We have decided to adopt Hive on Tez because it can process the 15,000 SQL per hour that is being requested from our service.

And benchmark of Spark and Impala gives the next result:

Comparison on large queries

Comparison on large queries

Comparison on cache queries

Comparison on cache queries

Engine query performance vs Concurrent users

Engine query performance vs Concurrent users

These are the main conclusions Trystan Leftwich makes:

  • Different engines perform well for different types of queries
    In general Spark SQL and Impala perform best on small data sets (interesting characteristic for Big Data engine)
  • Impala scales with concurrency better than Hive and Spark (but not better than Hive-on-Tez – see the first benchmark)
  • A successful BI on Hadoop architecture will likely require more than one SQL on Hadoop engine. But since Azure does not provide Tez and Impala, Spark appears the only winner for companies relying on Microsoft infrastructure.

Before praising Spark as a winner it’s necessary to compare (would be hard to benchmark because they live in different environments) it with Data Lake Analytics – native, non-opensource, Microsoft technology. If it beats Spark by performance or – more significant – by price then it might appear the medal. And this is a matter for the next round of the research.



About fdtki

Sr. BI Developer | An accomplished, quality-driven IT professional with over 16 years of experience in design, development and implementation of business requirements as a Microsoft SQL Server 6.5-2014 | Tabular/DAX | SSAS/MDX | Certified Tableau designer
This entry was posted in Big Data, Business Capability, R&D, Uncategorized and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s