5 questions for time limited interview for BI developer position that touch almost all areas related to modern Microsoft BI

1. For example it’s necessary to allow user to input worker’s salary but ensure that it’s in range related to the worker’s category. Ranges for categories are stored in a separate table. Data inserted by using T-SQL. Where is the best place to put the check?

2. On Power BI report there are two slicers “Product Category” and “Customer”, and one grid with Sales data. Business asks for cross-filter slicers: after choosing of a customer in the second slicer must remain only product categories ever purchased by the selected customer. How to enable cross-filtering in Power BI?

3. For BI report are allocated two VMs. How would you install 4 components of the MS BI Stack (SQL, SSAS, SSIS, SSRS) for even balance load?

4. There is 1 Gb fact table in source. It should be transfered to DWH with slight transformation. What combination of SSIS and T-SQL would you use and why?

5. Power BI reports are deployed with datasets. For example a fact table in DWH has 1 Gb in size but business requires to see all this data in Power BI report. Propose architecture allowing to operate large amount of data in Power BI reports.

Advertisements
Posted in Business Capability, Data to Knowledge, programming, Uncategorized | Tagged , , , , , , | Leave a comment

Problem unsolvable for Data Lake Analytics

One of the tasks of Database Forensics is to detect events (or categories of events) that began too frequent. From programming perspective, this means that for each date there should be searches categories appearing more than certain number of times during next certain number of days.

In the next example are searched categories appearing more than 3 times during 20 subsequent days.

Continue reading

Posted in Big Data, Business Capability, R&D, Uncategorized | Tagged , , , , , , | Leave a comment

Lake and Blob – how far from each other?

Two competing cloud storage products by Microsoft are defined the next way:

  • Azure Blob Storage is a general purpose, scalable object store that is designed for a wide variety of storage scenarios.
  • Azure Data Lake Store is a hyper-scale repository that is optimised for big data analytics workloads.

Data Lake

Let’s go deeper and list the major differences between them:

Azure Data Lake Store Azure Blob Storage
Purpose Optimized storage for big data analytics workloads General purpose object store for a wide variety of storage scenarios
Use Cases Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click streams, large datasets Any type of text or binary data, such as application back end, backup data, media storage for streaming and general purpose data
Structure Hierarchical file system,
Data Lake Store account contains folders, which in turn contains data stored as files
Object store with flat namespace.
There is actually a single layer of containers. You can virtually create a “”file-system”” like layered storage, but in reality everything will be in 1 layer, the container in which it is.
API REST API over HTTPS REST API over HTTP/HTTPS
Server-side API WebHDFS-compatible REST API Azure Blob Storage REST API
Hadoop File System Client Yes Yes
Data Operations – Authentication Based on Azure Active Directory Identities Based on shared secrets – Account Access Keys and Shared Access Signature Keys.
Data Operations – Authentication Protocol OAuth 2.0. Calls must contain a valid JWT (JSON Web Token) issued by Azure Active Directory Hash-based Message Authentication Code (HMAC) . Calls must contain a Base64-encoded SHA-256 hash over a part of the HTTP request.
Data Operations – Authorization POSIX Access Control Lists (ACLs). ACLs based on Azure Active Directory Identities can be set file and folder level. For account-level authorization – Use Account Access Keys
For account, container, or blob authorization – Use Shared Access Signature Keys
Data Operations – Auditing Available. Available
Encryption data at rest Transparent, Server side
With service-managed keys
With customer-managed keys in Azure KeyVault
Transparent, Server side
With service-managed keys
With customer-managed keys in Azure KeyVault (coming soon)
Client-side encryption
Developer SDKs .NET, Java, Python, Node.js .Net, Java, Python, Node.js, C++, Ruby
Analytics Workload Performance Optimized performance for parallel analytics workloads. High Throughput and IOPS. Not optimized for analytics workloads
Geo-redundancy Locally-redundant (multiple copies of data in one Azure region) Locally redundant (LRS), globally redundant (GRS), read-access globally redundant (RA-GRS).

What is not mentioned here is that U-SQL engine generates different query plans for Data Lake and Blob Storage. That means for some types of solutions it would be more reasonable to make choice not basing on optimisation for load but on optimisation for read.

U-SQL

Posted in Big Data, Business Capability, R&D | Tagged , , , , | Leave a comment

Hive 2.0 – The Solution For Data Warehousing

Slides from presentation by Hortonworks’ founder Alan Gates describing at high-level new features of Hive 2.0 tailored for the kind of queries typical for Data Warehousing:

Hive with LLAP is available in Azure:

http://www.zdnet.com/article/microsoft-hdinsight-gets-spark-2-0-faster-hive-and-better-security/

 

Posted in Big Data, R&D | Tagged , , , , | Leave a comment

BIModal IT: Unheeded Silver Bullet?

Gartner gives the next definition for it’s new Bimodal approach:

wordpress__bimodal4

The essence of an enterprise bimodal capability is marrying more predictable evolution of products and technologies (Mode 1) with the new and innovative (Mode 2):

  • Mode 1 is optimized for areas that are more predictable and well-understood. It focuses on exploiting what is known, while renovating the legacy environment into a state that is fit for a digital world.
  • Mode 2 is exploratory, experimenting to solve new problems and  optimized for areas of uncertainty.

Continue reading

Posted in Business Capability, Business Delivery | Tagged , , | Leave a comment

Restoring objects in SQL Server

This is a code googled in panic after occasional deletion of a stored procedure.

wordpress__recovery

Does real magic saving hours of work that could be spent on re-writing:

DECLARE
@Database_Name NVARCHAR(MAX) = 'SecretDbName',
@Date_From DATETIME = '21 Oct 2015',
@Date_To DATETIME = GETDATE()

DECLARE @Compatibility_Level INT

SELECT @Compatibility_Level=dtb.compatibility_level
FROM master.sys.databases AS dtb WHERE dtb.name=@Database_Name

IF ISNULL(@Compatibility_Level,0) <= 80 BEGIN
 RAISERROR('The compatibility level should be equal to or greater SQL SERVER 2005 (90)',16,1)
 RETURN
END

SELECT
[Database Name],
Convert(varchar(Max),
Substring([RowLog Contents 0],33,LEN([RowLog Contents 0]))) as [Script]
FROM fn_dblog(NULL,NULL)
WHERE [Operation]='LOP_DELETE_ROWS' AND [Context]='LCX_MARK_AS_GHOST'
AND [AllocUnitName]='sys.sysobjvalues.clst'
AND [TRANSACTION ID] IN
(SELECT DISTINCT [TRANSACTION ID] FROM sys.fn_dblog(NULL, NULL)
 WHERE Context IN ('LCX_NULL') AND Operation in ('LOP_BEGIN_XACT')
 AND [Transaction Name]='DROPOBJ'
 AND CONVERT(NVARCHAR(11), [Begin Time]) BETWEEN @Date_From AND @Date_To)
 AND Substring([RowLog Contents 0], 33, LEN([RowLog Contents 0]))<>0

 

Posted in Administration, programming, Uncategorized | Tagged , , | Leave a comment

Governed Data Discovery vs BI

Governed Data Discovery vs BI:

Is there the best strategy?

wordpress__two-direction-arrow-sign

Governed Data Discovery is one of the trending buzzword since 2014.

The term made its debut in Gartner’s 2014 Magic Quadrant Report. Gartner first popularized the term data discovery and now is adding governed to the term.

Today’s Problem

In many cases even successfully realised Business Intelligence projects contribute company’s IT capability but don’t increase business capability. This happens mainly because resulting analytics doesn’t match the context of a business or simply saying reports and dashboards don’t answer business questions. Knowledge is presented the way wisdom cannot be extracted from it.

The GDD Solution

Governed Data Discovery – the ability to meet the dual demands of enterprise IT and business users” – asserts Gartner.

GDD pretty much differs to BI. There are key points describing their major characteristics:

BI Process

  • Defining business goals metrics
  • Analyzing business processes
  • Identifying operational data
  • Define business logic and rules
  • Define mapping and quality rules

Analysis -> Design -> Develop

  • Reports: Define what happened
  • OLAP: Defined metrics/dimensions
  • Dashboards: Defined metrics/goals
  • Scorecards: Defined KPIs/variance

Governed Discovery Process

  • Clear business goal (mission)
  • Inherent business knowledge
  • Access data and work with context
  • Ability to iterate analytic models
  • Ability to share and verify findings

Discover -> Verify -> Govern

  • Discovered context to governed
  • Discovered analytic models
  • Data shows how the business works
  • Data shows how the market works

Realisation

To be a true Governed Data Discovery solution, the BI platform needs to offer four main components:

1) Self-Service Central Control

A built-in, self-service, and centralized administrative toolset to govern an organization’s BI. This robust and extensive administrative backend is required to provide the capability for managing every user’s experience, security, content, and data access from a singular, intuitive interface. PowerBI is a good approximation to a platform that can support GDD solution.

2) Data Governance

The data needs to remain centralised so to have one version of truth for data.

A Data Lineage capability that tracks the lifecycle of the model is also highly recommended especially in large-scale projects.

3) The Content Lifecycle

Content repository in a centralized, shared paradigm – that also tracks the content life-cycle is needed. This ensures content integrity and makes it easy to find and implement any changes or upgrades.

4) Secure & Protected

A strong security model is vital so that the data is not only secure from the external source, but also kept confidential internally.

Approaches

All approaches in data management can be split up into 4 cases:

wordpress_gdd

Despite theoretically good results are achievable in all the cases some of them seem much harder for realisation than others. It’s barely possible to imagine that Big Data (see on right-bottom variant) can be provided to end-users without cleaning and basic transformation, that obviously involves Semantic Integration factually adding context it. And realisation this way moves from variant in right-bottom corner to one in right-top.

For GDD the most realisable variant is a combination of well-prepared data (level of “information”) and full flexibility for end users to extract “wisdom” from it:

wordpress_wisdom_info_pyramid

Concluding, there is no way in real world to avoid structuring data before delivering it to the business for decision support. The only choice is whether to do it rigid way (BI) or flexible (GDD).

Source1, Source2

Posted in Big Data, Business Capability, Business Delivery, Data to Knowledge | Tagged , , , , , , | Leave a comment