Configuring a Self-Hosted Integration Runtime for Azure Data Factory and Synapse Analytics // Cloud architect insights

The integration runtime (IR) is the compute infrastructure that Azure Data Factory and Synapse pipelines use to provide data-integration capabilities across different network environments. A self-hosted integration runtime can run copy activities between a cloud data store and a data store in a private network, as well as dispatch transform activities against compute resources in an on-premises network or an Azure virtual network.

This comprehensive guide covers the key aspects of setting up and managing a self-hosted integration runtime:

Considerations for Using a Self-Hosted IR

You can use a single self-hosted integration runtime for multiple on-premises data sources, and even share it with another data factory within the same tenant.
The self-hosted integration runtime doesn’t need to be on the same machine as the data source, but keeping it close reduces connection time.
You can have multiple self-hosted integration runtimes connecting to the same on-premises data source.
Self-hosted integration runtimes support data integration within an Azure virtual network and can be used even if the data store is in the cloud on an Azure IaaS VM.
There are some important prerequisites, like requiring a 64-bit OS with .NET Framework 4.7.2+, and recommended minimum hardware specifications.

Setting Up a Self-Hosted IR

You can create and configure a self-hosted IR using either Azure PowerShell or the Azure Data Factory/Synapse UI. The key steps include:

Create the self-hosted IR in your data factory or Synapse workspace.
Download and install the self-hosted integration runtime software on a local machine.
Retrieve the authentication key and register the self-hosted IR with it.

The guide also covers advanced setup options like using an Azure Resource Manager template to automate self-hosted IR setup on an Azure VM, and leveraging PowerShell to manage an existing self-hosted IR.

Installation Best Practices

To ensure optimal performance and reliability, the guide recommends:

Configuring a power plan on the host machine to prevent hibernation
Regularly backing up the credentials associated with the self-hosted IR
Automating self-hosted IR setup operations using PowerShell

Proxy Server Considerations

If your corporate network uses a proxy server, you’ll need to configure the self-hosted IR to use the appropriate proxy settings. The guide explains the different proxy configuration options and how to update the necessary configuration files.

Ports, Firewalls, and Connectivity

Ensuring proper firewall configuration, both at the corporate and Windows firewall levels, is critical for the self-hosted IR to successfully connect to data sources and sinks. The guide covers the required domains and outbound ports, as well as tips for configuring firewall rules.

Credentials Storage

There are two recommended approaches for storing credentials when using a self-hosted IR: using Azure Key Vault or storing them locally in an encrypted format. The guide discusses the tradeoffs and considerations for each method.

Overall, this comprehensive guide provides a detailed overview of self-hosted integration runtimes, covering everything from setup and configuration to advanced management and troubleshooting. It’s an essential resource for anyone looking to leverage the power of self-hosted IRs in their Azure Data Factory or Synapse Analytics workflows.