Problem unsolvable for Data Lake Analytics

One of the tasks of Database Forensics is to detect events (or categories of events) that began too frequent. From programming perspective, this means that for each date there should be searches categories appearing more than certain number of times during next certain number of days.

In the next example are searched categories appearing more than 3 times during 20 subsequent days.

SET DATEFORMAT DMY

IF OBJECT_ID('tempdb..#t') IS NOT NULL DROP TABLE #t
IF OBJECT_ID('tempdb..#t1') IS NOT NULL DROP TABLE #t1

SELECT EventCategoryID, CAST(EventDate as date) AS EventDate
INTO #t
FROM
( select 707 as EventCategoryID,
 '2 mar 2012' as EventDate
 union select 708 , ' 2 mar 2012 '
 union select 9753 , ' 2 mar 2012 '
 union select 2509 , ' 2 mar 2012 '
 union select 5555 , ' 2 mar 2012 '
 union select 9753 , ' 2 mar 2012 '
 union select 9909 , ' 2 mar 2012 '
 union select 707 , ' 3 mar 2012 '
 union select 708 , ' 3 mar 2012 '
 union select 9753 , ' 3 mar 2012 '
 union select 4107 , ' 3 mar 2012 '
 union select 2106 , ' 3 mar 2012 '
 union select 814 , ' 3 mar 2012 '
 union select 9122 , ' 3 mar 2012 '
 union select 4120 , ' 3 mar 2012 '
 union select 707 , ' 4 mar 2012 '
 union select 907 , ' 4 mar 2012 '
 union select 9753 , ' 4 mar 2012 '
 union select 4509 , ' 4 mar 2012 '
 union select 2630 , ' 4 mar 2012 '
 union select 815 , ' 4 mar 2012 '
 union select 8101 , ' 4 mar 2012 '
 union select 9753 , ' 10 mar 2012 '
 union select 40000 , ' 10 mar 2012 ') t

DECLARE @period_days int = 20,
 @number_of_repeats_MIN int = 3,
 @number_of_repeats_MAX int = 1000

-- Solution via INNER JOIN with less-than-or-equal comparison:
SELECT t1.EventCategoryID, t1.EventDate, COUNT(*)
FROM #t t1
JOIN #t t2
ON t1.EventCategoryID = t2.EventCategoryID
 AND
 t2.EventDate >= t1.EventDate
 AND
 t2.EventDate <= DATEADD(DAY, @period_days, t1.EventDate) GROUP BY t1.EventCategoryID, t1.EventDate HAVING COUNT(*) BETWEEN @number_of_repeats_MIN AND @number_of_repeats_MAX ORDER BY 1,2,3 -- Much less efficient solution using a sub-query: SELECT DISTINCT EventDate, EventCategoryID, (SELECT COUNT(*)  FROM #t ti  WHERE ti.EventDate>=t.EventDate
 AND
 ti.EventDate<=DATEADD(DAY, @period_days, t.EventDate)
 AND
 t.EventCategoryID=ti.EventCategoryID
 ) as [RepeatNumber]
INTO #t1
FROM #t t

SELECT * FROM #t1
WHERE [RepeatNumber] BETWEEN
 @number_of_repeats_MIN
 AND
 @number_of_repeats_MAX

Both of the T-SQL queries cannot be executed in Azure: first one because INNER JOIN in U-SQL doesn’t support any other condition rather than “==”, second – because sub-queries in U-SQL are just static tables and there is no opportunity to send to a sub-query value from an outer query.

Very comfortable for technicians would be to implement pre-process – some tiny C#-based program – that searches for suspicious categories and stores results in a temporary table. However, it’s not suitable for ad-hoc analysis when business user moves slicers for number of days and number of repeats willing to get resultindg analytics immediately.

It seems that providing required analysis in ad-hoc manner in Azure would take implementing whether Hive on HDInsight Cluster (that might be expensive because the cluster needs to be kept on, at least during working hours), or Azure SQL DB (cheaper), or Azure PDW (again expensive solution).

Advertisements

About fdtki

Sr. BI Developer | An accomplished, quality-driven IT professional with over 16 years of experience in design, development and implementation of business requirements as a Microsoft SQL Server 6.5-2014 | Tabular/DAX | SSAS/MDX | Certified Tableau designer
This entry was posted in Big Data, Business Capability, R&D, Uncategorized and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s