Mastering Continuous Integration and Delivery in Azure Data Factory
Continuous Integration and Delivery in Azure Data Factory
Continuous integration is the practice of automatically testing each change made to your codebase as early as possible. Continuous delivery then takes the tested changes and pushes them to a staging or production environment.
In Azure Data Factory, continuous integration and delivery (CI/CD) refers to the process of moving your data pipelines from one environment (like development) to another (like test or production). Azure Data Factory uses Azure Resource Manager templates to store the configuration of your various data factory entities, such as pipelines, datasets, and data flows.
There are two main approaches to promoting a data factory to another environment:
- Automated deployment using Azure Pipelines: Azure Data Factory integrates with Azure Pipelines to enable automated CI/CD.
- Manual Resource Manager template deployment: You can manually upload a Resource Manager template using the Data Factory user interface’s integration with Azure Resource Manager.
The CI/CD Lifecycle
Let’s walk through a sample CI/CD lifecycle for an Azure data factory that’s configured with Azure Repos Git:
- Development Environment: A development data factory is created and configured with Azure Repos Git. Developers have permission to author Data Factory resources like pipelines and datasets.
- Feature Development: Developers create feature branches to make changes, then debug their pipeline runs with the latest changes. See Iterative development and debugging with Azure Data Factory.
- Pull Requests and Merging: After a developer is satisfied with their changes, they create a pull request from their feature branch to the main or collaboration branch to get their changes reviewed by peers. Once the pull request is approved and merged, the changes are published to the development factory.
- Deployment to Test/UAT: When the team is ready to deploy the changes to a test or UAT (User Acceptance Testing) factory, they use an Azure Pipelines release to deploy the desired version of the development factory to UAT. This deployment uses Resource Manager template parameters to apply the appropriate configuration.
- Deployment to Production: After the changes have been verified in the test factory, the team can deploy to the production factory using the next task in their Azure Pipelines release.
[!NOTE] Only the development factory is associated with a Git repository. The test and production factories should not have a Git repository associated with them and should only be updated via an Azure DevOps pipeline or a Resource Manager template.
Best Practices for CI/CD
Here are some recommended best practices when using Git integration and CI/CD pipelines to move changes from development to test and production:
- Git Integration: Configure only your development data factory with Git integration. Changes to test and production are deployed via CI/CD and don’t need Git integration.
- Pre- and Post-Deployment Scripts: Use PowerShell scripts before and after the Resource Manager deployment step in CI/CD to handle tasks like stopping and restarting triggers, and performing cleanup. The Azure Data Factory team has provided a sample script you can use.
- Integration Runtime Sharing: Use the same name, type, and sub-type of integration runtime across all stages of CI/CD. You can share integration runtimes across environments by using a separate ‘ternary’ data factory just for the shared integration runtimes.
- Managed Private Endpoint Deployment: If a private endpoint already exists in a factory, you can only deploy a new one with the same properties. Override any differing properties by parameterizing them in your ARM template.
- Key Vault Management: Use separate key vaults for each environment and keep the same secret names across all stages to avoid the need to parameterize connection strings.
- Resource Naming: Avoid spaces in resource names due to ARM template constraints. Use ‘_’ or ‘-’ instead.
- Repository Integrity: Avoid manually altering or adding unrelated files in the ADF Git repository, as this can cause resource loading errors.
- Exposure Control and Feature Flags: Combine global parameters and If Condition activities to conditionally show or hide logic based on environment flags, allowing you to merge changes without immediately deploying them to higher environments.
Unsupported Features
There are a few key limitations and unsupported features to be aware of:
- Data Factory does not allow cherry-picking of commits or selective publishing of resources. All changes made in the data factory will be included in a publish.
- The Azure Data Factory team does not recommend assigning Azure RBAC controls to individual entities like pipelines or datasets. Instead, consider deploying a second data factory if you need more granular access controls.
- You cannot publish from private branches, host projects on Bitbucket, or export/import alerts and metrics as parameters.
- The ‘PartialArmTemplates’ folder that was previously added to the ‘adf_publish’ branch will no longer be published after November 1, 2021. Use the ‘ARMTemplateForFactory.json’ or ‘linkedTemplates’ files for deployments instead.
Learn More
To dive deeper into CI/CD for Azure Data Factory, check out these additional resources: