1. Introduction
Microsoft Material and Azure Databricks are each powerhouses within the knowledge analytics discipline. These platforms can be utilized end-to-end in a medallion structure, from knowledge ingestion to creating knowledge merchandise for finish customers. Azure Databricks excels within the preliminary phases because of its power in processing giant datasets and populating the totally different zones of the lakehouse. Microsoft Material performs effectively within the latter phases when knowledge is consumed. Coming from Energy BI, the SaaS setup is straightforward to make use of and it gives self-service capabilities to finish customers.
Given the totally different strengths of those merchandise and that many shoppers do not need a greenfield state of affairs, a strategic determination might be to combine the merchandise. It’s essential to then discover a logical integration level the place each merchandise “meet”. This shall be executed with safety in thoughts as this can be a prime precedence for all enterprises.
This weblog submit first explores three totally different integration choices: Lakehouse break up, virtualization with shortcuts, and exposing through SQL API. SQL API is a typical integration level between again finish and entrance finish and the safety structure of this integration is mentioned in additional element in chapter 3. See already the structure diagram under.
2. Azure Databricks — Microsoft Material integration overview
Earlier than diving into the small print of securing SQL API structure, it’s useful to briefly talk about the totally different choices for integrating Azure Databricks and Microsoft Material. This chapter outlines three choices, highlighting their benefits and downsides. For a extra intensive overview, consult with this weblog.
2.1 Lakehouse break up: Bronze, silver zone in Databricks | gold zone in Material
On this structure, you will discover that knowledge is processed by Databricks as much as the silver zone. Material copies and processes the information to gold zone in Material utilizing V-Ordering. Gold zone knowledge is uncovered through a Material lakehouse such that knowledge merchandise might be created for finish customers, see picture under.
The benefit of this structure is that knowledge is optimized for knowledge consumption in Material. The drawback is that the lakehouse is break up over two instruments which provides complexity and can provide challenges in knowledge governance (Unity Catalog for bronze/silver, however not for gold).
This structure is most relevant to corporations that place a robust emphasis on knowledge analytics in Microsoft Material and should even need to ultimately migrate the whole lakehouse to Microsoft Material.
2.2 Virtualization: Lakehouse in Databricks | shortcuts to Material
On this structure, all knowledge is within the lakehouse is processed by Databricks. Information is virtualized to Microsoft Material Lakehouse utilizing ADLSgen2 shortcuts or perhaps a mirrored Azure Databricks Unity Catalog in Material, see additionally the picture under.
The benefit of this structure is that lakehouse is owned by a single device which supplies much less challenges in integration and governance. The drawback is that knowledge will not be optimized for Material consumption. On this, it’s possible you’ll require further copies in Material to use V-Ordering, and so optimize for Material consumption.
This structure is most relevant for corporations that need to maintain the lakehouse Databricks owned and need to allow finish customers to do analytics in Material by which the dearth of V-Ordering will not be a lot of a priority. The latter may very well be true if the information sizes usually are not too huge and/or finish customers want an information copy anyway.
2.3 Exposing SQL API: Lakehouse in Databricks | SQL API to Material
On this structure, all knowledge is within the lakehouse is processed by Databricks once more. Nonetheless, on this structure the information is uncovered to Material utilizing the SQL API. On this, you may resolve to make use of a devoted Databricks SQL Warehouse or serverless SQL. The principle distinction with shortcut structure within the earlier bullet, is that knowledge is processed in Databricks quite than Material. This may be in comparison with when an internet app fires a SQL question to a database; the question is executed within the database.
The benefit of this structure is that lakehouse is owned by a single device which supplies much less challenges in integration and governance. Additionally, SQL API gives a clear interface between Azure Databricks and Microsoft Material with much less coupling in comparison with shortcuts. The drawback is that finish customers in Material are restricted to the Databricks SQL and Material is merely used as reporting device quite than analytics device.
This structure is most relevant for corporations that need to maintain the lakehouse Databricks-owned and want to improve Azure Databricks with the Energy BI capabilities that Microsoft Material affords.
Within the subsequent chapter, a safety structure is mentioned for this SQL API integration.
3. Exposing SQL API: safety structure
On this chapter, safety structure is mentioned for this SQL API integration. The rationale is that integrating SQL API is a typical contact level the place again finish and entrance finish meet. Moreover, most safety suggestions are relevant for the opposite architectures mentioned earlier.
3.1 Superior SQL API structure
To realize protection in depth, networking isolation and identity-based entry management are the 2 most essential steps. You’ll find this within the diagram under, that was already offered within the introduction of this weblog.
On this diagram, three key connectivities that should be secured are highlighted: ADLSgen2 — Databricks connectivity, Azure Databricks — Microsoft Material connectivity and Microsoft Material — finish person connectivity. Within the remaining of this part, the connectivity between the assets is mentioned specializing in networking and entry management.
On this, it’s not in scope to debate how ADLSgen2, Databricks or Microsoft Material might be secured as merchandise themselves. The rationale is that every one three assets are main Azure merchandise and supply intensive documentation on how you can obtain this. This weblog actually focuses on the mixing factors.
3.2 ADLSgen2 — Azure Databricks connectivity
Azure Databricks must fetch knowledge from ADLSgen2 with Hierarchical Title House (HNS) enabled. ADLSgen2 is used as storage because it gives the perfect catastrophe restoration capabilities. This consists of point-in-time restoration integration with Azure Backup coming in 2025, which affords higher safety in opposition to malware assaults and unintended deletions. You’ll find the next networking and entry management practices relevant.
Networking: Azure storage public entry is disabled. To ensure that Databricks can entry the storage account, non-public endpoints are created within the Databricks VNET. This makes positive that the storage account can’t be accessed from exterior the corporate community and that knowledge stays on the Azure spine.
Identification-based entry management: The storage account can solely be accessed through identities and entry keys are disabled. To permit Databricks Unity Catalog entry to the information, the Databricks entry connector identification must be granted entry utilizing an exterior location. Relying on the information structure, this may be an RBAC function to the whole container or a fine-grained ACL/POSIX entry rule to the information folder.
3.3 Azure Databricks — Microsoft Material connectivity:
Microsoft Material must fetch knowledge from Azure Databricks. This knowledge shall be utilized by Material to serve finish customers. On this structure, the SQL API is used. The networking and identification entry management factors are additionally most relevant for the shortcut structure mentioned in paragraph 2.2.
Networking: Azure Databricks public entry is disabled. That is each true for the entrance finish because the again finish such that clusters are deployed with no public IP tackle. To ensure that Microsoft Material can entry knowledge uncovered through the SQL API from a community perspective, an information gateway must be deployed. It may very well be determined to deploy a digital machine within the Databricks VNET, nevertheless, that’s an IaaS element that must be maintained which supplies safety challenges by itself. A greater choice is to make use of a managed digital community knowledge gateway which is Microsoft managed and gives connectivity.
Identification-based entry management: Information in Azure Databricks can be uncovered through Unity Catalog. Information within the Unity Catalog shall solely be uncovered through Identities utilizing fine-grained entry management tables and utilizing row-level safety. It isn’t but potential to make use of Microsoft Material Workspace Identities to entry the Databricks SQL API. As a substitute, a service principal shall be granted entry to the information within the Unity Catalog and a private entry token based mostly on this service principal shall be used within the Microsoft Databricks Connector.
3.4 Microsoft Material — finish person connectivity:
On this structure, finish customers will hook up with Microsoft Material to entry experiences and to do self-service BI. Inside Microsoft, various kinds of experiences might be created based mostly on Energy BI. You possibly can apply the next networking and identity-based entry controls.
Networking: Microsoft Material public entry is disabled. Presently, this may solely be executed at tenant stage, as extra granular workspace non-public entry will turn out to be out there in 2025. This could guarantee that an organization can differentiate between non-public and public workspace. To ensure that finish customers can entry Material, non-public endpoints for Material are created within the workspace VNET. This office might be peered to the company on prem networking utilizing VPN or ExpressRoute. The separation of various networks ensures isolation between the totally different assets.
Identification-based entry management: Finish customers ought to get entry to experiences on a need-to-know foundation. This may be executed to create a separate workspace the place experiences are saved and to which customers get. Additionally, customers shall solely be allowed to log in Microsoft Material with conditional entry insurance policies utilized. This fashion, it may be ensured that customers can solely log in from hardened units to stop knowledge exfiltration.
3.5 Closing remarks
Within the earlier paragraph, an structure is described the place all the things is made non-public and a number of VNET and jumphosts are used. To get your arms soiled and to check this structure sooner, you may resolve to check with a simplified structure under.
On this structure, Material is configured with public entry enabled. Rationale is that Material public entry setting is at present tenant extensive setting. This means that you must make all workspaces in an organization both non-public or public. Extra granular workspace non-public entry will turn out to be out there in 2025. Additionally, a single subnet is used to deploy all assets to stop peering between VNETs and/or deploying a number of jumphosts for connectivity.
4. Conclusion
Microsoft Material and Azure Databricks are each powerhouses within the knowledge analytics discipline. Each instruments can cowl all components of the lakehouse structure, however each instruments even have their very own strengths. A strategic determination may very well be to combine the instruments particularly if there’s a non inexperienced state of affairs and each instruments are utilized in an organization.
Three totally different architectures to combine are mentioned: Lakehouse break up, virtualization with shortcuts and exposing through SQL API. The primary two architectures are extra related in case you need to put extra emphasize on the Material analytics capabilities, whereas the final SQL API structure is extra related if you wish to deal with the Material Energy BI reporting capabilities.
Within the the rest of the weblog, a safety structure is offered for the SQL API structure in which there’s a deal with community isolation, non-public endpoints and identification. Though this structure focuses on exposing knowledge from the Databricks SQL, the safety ideas are additionally relevant for the opposite architectures.
Briefly: There are quite a few issues to take note of if and the place to combine Azure Databricks with Microsoft Material. Nonetheless, this shall at all times be executed with safety in thoughts. This weblog aimed to provide you an in-depth overview utilizing the SQL API as sensible instance.