The idea behind Hadoop шs brilliant and revolutionary: invented an algorithm – MapReduce – allowing decomposition on it of all major data processing tasks (grouping, statistical, graph, etc). However it’s use of input files and lack of schema support prevented the performance improvements enabled by common database system features such as B-trees and hash partitioning. Business demanded instant improvement and Hadoop vendors had to move one. They faced the challenge of choosing one of the two ways of progress: whether to speed up MapReduce or to get rid of it (loosing so important scalability and fault-tolerance).
And here is what they choose:
- Cloudera refused to use MapReduce and finished with product called Impala.
- Apache offered two options: Spark with caching-based optimisation (was presented by Berkeley’s lab) and Hive on Tez – the same Hive but based on optimised version of MapReduce:
The question is: who is winning this competition in our days? As always benchmark is the only way to find out.
As mentioned above Hive on classic MapReduce demonstrates really poor performance:
However on Tez it works perfectly:
These are conclusions by “Yahoo! JAPAN”‘s staff made after benchmarking two promising Impala and Hive on Tez:
- Impala returns a very fast response when the cluster is at low load state, but is not suitable for use such as running the SQL in parallel.
- We have decided to adopt Hive on Tez because it can process the 15,000 SQL per hour that is being requested from our service.
And benchmark of Spark and Impala gives the next result:
These are the main conclusions Trystan Leftwich makes:
- Different engines perform well for different types of queries
In general Spark SQL and Impala perform best on small data sets (interesting characteristic for Big Data engine)
- Impala scales with concurrency better than Hive and Spark (but not better than Hive-on-Tez – see the first benchmark)
- A successful BI on Hadoop architecture will likely require more than one SQL on Hadoop engine. But since Azure does not provide Tez and Impala, Spark appears the only winner for companies relying on Microsoft infrastructure.
Before praising Spark as a winner it’s necessary to compare (would be hard to benchmark because they live in different environments) it with Data Lake Analytics – native, non-opensource, Microsoft technology. If it beats Spark by performance or – more significant – by price then it might appear the medal. And this is a matter for the next round of the research.