Recently I was literally stunned when loading bulky data into PDW got result a magnitude faster than expected. Load of 20Gb file into Azure Blob Storage from local machine takes 15 minutes, copying from Blob Storage into Data Lake (ie from one hump in the cloud into another) takes 5 minutes, but when copying of unstructured data into relational db took 4 minutes 20 seconds – that was really shocking. It supposed to be much longer due to data transformation into relational form – expecting 20 minutes for overall I even planned “to put Billy on” (“to make tea” in Aussie slang 😉 ) – so for several seconds just couldn’t believe my eyes and thought that transmission simply interrupted and success message returned is a mistake. No, it was one of those rare moments when soft works more than brilliant.
So how does the magic work?
This is what SQL Server Customer Advisory Team tells about PolyBase’s loading functionality:
PolyBase data loading (ie load through EXTERNAL TABLES) is not limited by the Control node, and so as you scale out your DWU, your data transfer throughput also increases. By mapping the external files as external tables the data files can be accessed using standard T-SQL commands.
As the following architecture diagrams show that each Compute node connects to an external resource independently:
When “old school” methods still use the Control node. The reason why PolyBase provides a superior load rate is that PolyBase data transfer is not limited by the Control node. But if using PolyBase is not currently an option, the following technologies and methods can be used for loading into PDW:
- Bulk Insert
- Azure Data Factory (ADF) (uses SQLBulkCopy)
Finally MS teams give a general recommendation that I’m completely agree with:
As a general rule, we recommend making PolyBase your first choice for loading data into SQL Data Warehouse unless you can’t accommodate PolyBase-supported file formats.