NoSQL and Data Diversity

Gartner states for 2016 as it stated for 2015:

“Addressing the state of data quality and its business impact must remain a key priority for chief data officers and data and analytics leaders.”

The question here is about cost of realisation of this strategic advice. And NoSQL tool can give a helping hand.

In SQL Server and PDW the next two queries would take different time to execute:

SELECT
cust_id,category,company,brand,
SUM(purchasequantity) as TotalQty,
SUM(purchaseamount) as TotalAmount,
COUNT(*) as TotalTransCount
FROM transactions
GROUP BY cust_id,category,company,brand

SELECT
cust_id,category,company,brand,
SUM(CAST(purchasequantity AS decimal(18,4))) as TotalQty,
SUM(CAST(purchaseamount AS decimal(18,4))) as TotalAmount,
COUNT(*) as TotalTransCount
FROM transactions
GROUP BY cust_id,category,company,brand

The second one would be probably 10 times slower due to the type convertion.

Surprisingly the same effect does not occur in a case of NoSQL. These two Azure Data Lake Analytics queries execute absolutely the same time:

1:

@input =
EXTRACT
...
purchasequantity decimal,
purchaseamount decimal
FROM "/transactions"
USING Extractors.Csv(silent:true);

@rows = SELECT id,
category,
company,
brand,
purchasequantity
purchaseamount
FROM @input WHERE !id.StartsWith("id");

@res = SELECT
id,category,company,brand,
SUM(purchasequantity) AS TotalQty,
SUM(purchaseamount) AS TotalAmnt,
COUNT(*) AS TotalCnt
FROM @rows
GROUP BY id,category,company,brand;

2:

@input =
EXTRACT
...
purchasequantity string,
purchaseamount string
FROM "/transactions"
USING Extractors.Csv(silent:true);

@rows = SELECT id,
category,
company,
brand,
decimal.Parse(purchasequantity) AS purchasequantity,
decimal.Parse(purchaseamount) AS purchaseamount
FROM @input WHERE !id.StartsWith("id");

@res = SELECT
id,category,company,brand,
SUM(purchasequantity) AS TotalQty,
SUM(purchaseamount) AS TotalAmnt,
COUNT(*) AS TotalCnt
FROM @rows
GROUP BY id,category,company,brand;

This native support of data diversity (examples: storing numbers as strings in the case above or usage of different date formats) as well as tolerance to rogue data (excessive commas in numbers) can be very contributive to the business as it doesn’t require to spend resources on full data cleaning at adding it into Enterprise Data Warehouse or into a private Data Lake. Fora example if enterprise gets another data stream with tables up to 200-300 fields each there is no need at arrival to assure semantic consistency of the master data – there will remain ability to deal with non-ideal data in each field later. For relational systems this approach would end up with performance degradation when for NoSQL it doesn’t cost a penny.

datacleaning_man

Significant cost reduce and additional robustness of NoSQL tools increment IT capability of the business and can become a base for implementation of “lean MDM” approach.

Advertisements

About fdtki

Sr. BI Developer | An accomplished, quality-driven IT professional with over 16 years of experience in design, development and implementation of business requirements as a Microsoft SQL Server 6.5-2014 | Tabular/DAX | SSAS/MDX | Certified Tableau designer
This entry was posted in Big Data, Business Capability, R&D and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s