Prescriptive Practices for Deploying and Running Azure Databricks at Scale // Cloud architect insights

Planning, deploying, and running Azure Databricks (ADB) at scale requires one to make many architectural decisions. While each ADB deployment is unique to an organization’s needs, we have found that some patterns are common across most successful ADB projects. This guide summarizes these patterns into prescriptive and actionable best practices for Azure Databricks.

We follow a logical path of planning the infrastructure, provisioning the workspaces, developing Azure Databricks applications, and finally, running Azure Databricks in production. The audience of this guide are system architects, field engineers, and development teams of customers, Microsoft, and Databricks.

Scalable ADB Deployments: Guidelines for Networking, Security, and Capacity Planning

Azure Databricks is a Cloud Optimized managed PaaS offering designed to hide the underlying distributed systems and networking complexity as much as possible from the end user. It is backed by a team of support staff who monitor its health, debug tickets filed via Azure, etc. This allows ADB users to focus on developing value generating apps rather than stressing over infrastructure management.

We recommend:

Map workspaces to business divisions to streamline access control and align with chargeback models
Deploy workspaces in multiple subscriptions to honor Azure and ADB workspace limits
Consider isolating each workspace in its own VNet and use the hub-and-spoke model
Select the largest VNet CIDR when using the Bring Your Own VNet feature
Deploy ADB with limited private IP addresses by using a disconnected VNet for the Databricks workspace

Deploying Applications on ADB: Guidelines for Selecting, Sizing, and Optimizing Clusters Performance

ADB clusters are divided into “types” (Interactive and Jobs) and “modes” (Standard and High Concurrency). We recommend:

Use shared High Concurrency clusters for interactive analytics to minimize cost and optimize for latency
Use single user ephemeral Standard clusters for batch ETL workloads to improve security and throughput
Favor cluster-scoped init scripts over global and named scripts to avoid launch failures
Use the Cluster Log Delivery feature to manage logs in a blob store under your control
Choose VMs that match the workload class (ML, Streaming, ETL, Interactive)
Arrive at the correct cluster size through iterative performance testing
Tune shuffle parameters for optimal performance
Partition your data to take advantage of partition pruning and data skipping

Running ADB Applications Smoothly: Guidelines on Observability and Monitoring

We focus on collecting resource utilization metrics across ADB clusters in a Log Analytics workspace, as this is the most common ask from customers. This includes:

Configuring the Log Analytics Agent on each cluster node to stream VM metrics
Querying the collected VM metrics in Log Analytics to understand utilization

Cost Management, Chargeback and Analysis

This section covers Azure Databricks billing, tools to manage and analyze cost, and how to charge back to teams. Key recommendations include:

Leverage Azure tags (default and custom) to chargeback by filtering resource usage
Use the Azure Cost Management service and Power BI to create custom cost reports and visualizations
Be aware of pricing differences across regions and the options for pre-purchasing DBUs