Prescriptive Practices for Deploying and Running Azure Databricks at Scale
Planning, deploying, and running Azure Databricks (ADB) at scale requires one to make many architectural decisions. While each ADB deployment is unique to an organization’s needs, we have found that some patterns are common across most successful ADB projects. This guide summarizes these patterns into prescriptive and actionable best practices for Azure Databricks.
We follow a logical path of planning the infrastructure, provisioning the workspaces, developing Azure Databricks applications, and finally, running Azure Databricks in production. The audience of this guide are system architects, field engineers, and development teams of customers, Microsoft, and Databricks.
Scalable ADB Deployments: Guidelines for Networking, Security, and Capacity Planning
Azure Databricks is a Cloud Optimized managed PaaS offering designed to hide the underlying distributed systems and networking complexity as much as possible from the end user. It is backed by a team of support staff who monitor its health, debug tickets filed via Azure, etc. This allows ADB users to focus on developing value generating apps rather than stressing over infrastructure management.
We recommend:
- Map workspaces to business divisions to streamline access control and align with chargeback models
- Deploy workspaces in multiple subscriptions to honor Azure and ADB workspace limits
- Consider isolating each workspace in its own VNet and use the hub-and-spoke model
- Select the largest VNet CIDR when using the Bring Your Own VNet feature
- Deploy ADB with limited private IP addresses by using a disconnected VNet for the Databricks workspace
Deploying Applications on ADB: Guidelines for Selecting, Sizing, and Optimizing Clusters Performance
ADB clusters are divided into “types” (Interactive and Jobs) and “modes” (Standard and High Concurrency). We recommend:
- Use shared High Concurrency clusters for interactive analytics to minimize cost and optimize for latency
- Use single user ephemeral Standard clusters for batch ETL workloads to improve security and throughput
- Favor cluster-scoped init scripts over global and named scripts to avoid launch failures
- Use the Cluster Log Delivery feature to manage logs in a blob store under your control
- Choose VMs that match the workload class (ML, Streaming, ETL, Interactive)
- Arrive at the correct cluster size through iterative performance testing
- Tune shuffle parameters for optimal performance
- Partition your data to take advantage of partition pruning and data skipping
Running ADB Applications Smoothly: Guidelines on Observability and Monitoring
We focus on collecting resource utilization metrics across ADB clusters in a Log Analytics workspace, as this is the most common ask from customers. This includes:
- Configuring the Log Analytics Agent on each cluster node to stream VM metrics
- Querying the collected VM metrics in Log Analytics to understand utilization
Cost Management, Chargeback and Analysis
This section covers Azure Databricks billing, tools to manage and analyze cost, and how to charge back to teams. Key recommendations include:
- Leverage Azure tags (default and custom) to chargeback by filtering resource usage
- Use the Azure Cost Management service and Power BI to create custom cost reports and visualizations
- Be aware of pricing differences across regions and the options for pre-purchasing DBUs