Lake and Blob – how far from each other?

Two competing cloud storage products by Microsoft are defined the next way:

  • Azure Blob Storage is a general purpose, scalable object store that is designed for a wide variety of storage scenarios.
  • Azure Data Lake Store is a hyper-scale repository that is optimised for big data analytics workloads.

Data Lake

Let’s go deeper and list the major differences between them:

Azure Data Lake Store Azure Blob Storage
Purpose Optimized storage for big data analytics workloads General purpose object store for a wide variety of storage scenarios
Use Cases Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click streams, large datasets Any type of text or binary data, such as application back end, backup data, media storage for streaming and general purpose data
Structure Hierarchical file system,
Data Lake Store account contains folders, which in turn contains data stored as files
Object store with flat namespace.
There is actually a single layer of containers. You can virtually create a “”file-system”” like layered storage, but in reality everything will be in 1 layer, the container in which it is.
Server-side API WebHDFS-compatible REST API Azure Blob Storage REST API
Hadoop File System Client Yes Yes
Data Operations – Authentication Based on Azure Active Directory Identities Based on shared secrets – Account Access Keys and Shared Access Signature Keys.
Data Operations – Authentication Protocol OAuth 2.0. Calls must contain a valid JWT (JSON Web Token) issued by Azure Active Directory Hash-based Message Authentication Code (HMAC) . Calls must contain a Base64-encoded SHA-256 hash over a part of the HTTP request.
Data Operations – Authorization POSIX Access Control Lists (ACLs). ACLs based on Azure Active Directory Identities can be set file and folder level. For account-level authorization – Use Account Access Keys
For account, container, or blob authorization – Use Shared Access Signature Keys
Data Operations – Auditing Available. Available
Encryption data at rest Transparent, Server side
With service-managed keys
With customer-managed keys in Azure KeyVault
Transparent, Server side
With service-managed keys
With customer-managed keys in Azure KeyVault (coming soon)
Client-side encryption
Developer SDKs .NET, Java, Python, Node.js .Net, Java, Python, Node.js, C++, Ruby
Analytics Workload Performance Optimized performance for parallel analytics workloads. High Throughput and IOPS. Not optimized for analytics workloads
Geo-redundancy Locally-redundant (multiple copies of data in one Azure region) Locally redundant (LRS), globally redundant (GRS), read-access globally redundant (RA-GRS).

What is not mentioned here is that U-SQL engine generates different query plans for Data Lake and Blob Storage. That means for some types of solutions it would be more reasonable to make choice not basing on optimisation for load but on optimisation for read.



About fdtki

Sr. BI Developer | An accomplished, quality-driven IT professional with over 16 years of experience in design, development and implementation of business requirements as a Microsoft SQL Server 6.5-2014 | Tabular/DAX | SSAS/MDX | Certified Tableau designer
This entry was posted in Big Data, Business Capability, R&D and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s