Big Data ad-hoc querying with Power BI?

It has been a long time since Coudera introduced Impala as an alternative to slow MapReduce-based Hive. And has positioned new product as ad-hoc querying supporting tool (I recall Mr Isotov expressed interest in this – now I can state that only benchmark can give an answer).

In our days Microsoft is striking back with Spark. It even provides special ODBC-connector to link Power BI directly to the Spark engine that thrashes somewhere above clouds in Azure’s HDInsight cluster. In one-year-old Power BI blog article by program manager Theresa Palmer it is demonstrated how easily Power BI can query for example Twitter stuff or other massive(?) data thus incrementing business capability “for scenarios such as iterative machine learning and interactive data analysis“. It’s the fact that Spark is good for “iterative machine learning” due to it’s extensive caching functionality, but how about the second part? And this is a good matter for benchmark.

What theoretically should happen: 1) smooth connectivity to HDInsight cluster, 2) ability to design charts without full load of a massive data from Azure into a local machine, 3) correct generation of GROUP-BY queries after a report refresh or user action like slicing.

For test has been picked up real data from the Kaggle competition of a size around 20GB.

First good thing discovered is that there is no need to install 3rd party connector:Spark HDInsight Connector

Second one is that the Preview screen – after pretty long 45 seconds wait – finally shows first(?) N rows:

spark_adhoc02

However then comes in turn a window with a script “Evaluating…”:

spark_adhoc03

That soon changes for “22 rows from Azure/hdinside”:

spark_adhoc04

 

 

 

 

 

 

 

 

 

 

And after lengthy 5 minutes wait everything crashes with the error:

spark_adhoc05

The error is stable, came out after all attempts. The issue is definitely a result of data size – when the same task is performed on sample table dashboard is pretty buildable:

power bi desktop on sample table

What is noticeable is that any change on the dashboard gets immediate response, revealing that all data is actually loaded into local machine. Obviously the DirectQuery approach doesn’t take place in this case.  Microsoft responding recommends to switch it on:

spark_adhoc06_DirectQuery

However this opportunity is provided only in limited cases:

spark_adhoc07_DirectQuery

Final conclusion is: we definitely should expect ad-hoc querying support (factually, a support of Exploratory Analysis) in the nearest future but for some time the only opportunity in this area is the relational pare: Azure SQL and Azure PDW.

 

 

 

Advertisements

About fdtki

Sr. BI Developer | An accomplished, quality-driven IT professional with over 16 years of experience in design, development and implementation of business requirements as a Microsoft SQL Server 6.5-2014 | Tabular/DAX | SSAS/MDX | Certified Tableau designer
This entry was posted in Big Data, Data to Knowledge, R&D and tagged , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s