It has been a long time since Coudera introduced Impala as an alternative to slow MapReduce-based Hive. And has positioned new product as ad-hoc querying supporting tool (I recall Mr Isotov expressed interest in this – now I can state that only benchmark can give an answer).
In our days Microsoft is striking back with Spark. It even provides special ODBC-connector to link Power BI directly to the Spark engine that thrashes somewhere above clouds in Azure’s HDInsight cluster. In one-year-old Power BI blog article by program manager Theresa Palmer it is demonstrated how easily Power BI can query for example Twitter stuff or other massive(?) data thus incrementing business capability “for scenarios such as iterative machine learning and interactive data analysis“. It’s the fact that Spark is good for “iterative machine learning” due to it’s extensive caching functionality, but how about the second part? And this is a good matter for benchmark.
What theoretically should happen: 1) smooth connectivity to HDInsight cluster, 2) ability to design charts without full load of a massive data from Azure into a local machine, 3) correct generation of GROUP-BY queries after a report refresh or user action like slicing.
For test has been picked up real data from the Kaggle competition of a size around 20GB.
Second one is that the Preview screen – after pretty long 45 seconds wait – finally shows first(?) N rows:
However then comes in turn a window with a script “Evaluating…”:
That soon changes for “22 rows from Azure/hdinside”:
And after lengthy 5 minutes wait everything crashes with the error:
The error is stable, came out after all attempts. The issue is definitely a result of data size – when the same task is performed on sample table dashboard is pretty buildable:
What is noticeable is that any change on the dashboard gets immediate response, revealing that all data is actually loaded into local machine. Obviously the DirectQuery approach doesn’t take place in this case. Microsoft responding recommends to switch it on:
However this opportunity is provided only in limited cases:
Final conclusion is: we definitely should expect ad-hoc querying support (factually, a support of Exploratory Analysis) in the nearest future but for some time the only opportunity in this area is the relational pare: Azure SQL and Azure PDW.