How to make Azure Databricks cluster outbound connectivity consistent with 1 public outgoing IP address?
I've setup an Azure Databricks service that should get outbound connectivity through an Azure Firewall, which in turn makes sure that all outbound traffic is routed through a single public IP address.
As suggested by a Microsoft auto generated solution I have done the following:
- Used a VNet injected workspace, as suggested in this article (referenced by Microsoft):
- And applied this article (referenced by Microsoft):
https://kb.databricks.com/cloud/azure-vnet-single-ip
This works! Also when stopping and starting the Databricks cluster, the same public IP address is used.
However; when stopping and starting the firewall, on occasion the Databricks cluster cannot get outbound connectivity at all. Sometimes it works, sometimes it doesn't and I can't find a possible reason or inconsistency in the way things are invoked.
The method I'm using to stop and start the firewall (and Databricks cluster) is:
- Azure Automation invokes a script (via runbooks) at around 00:00 UTC that starts (allocates) the Azure Firewall (https://video2.skills-academy.com/en-us/azure/firewall/firewall-faq#how-can-i-stop-and-start-azure-firewall)
- A Databricks workflow is invoked at 00:15 UTC that starts the Databricks cluster, and runs a notebook taking at most 20 minutes)
- Azure Automation invokes a script (via runbooks) at around 01:00 UTC that stops (deallocates) the Azure Firewall (https://video2.skills-academy.com/en-us/azure/firewall/firewall-faq#how-can-i-stop-and-start-azure-firewall)
Yesterday this worked perfectly, today no luck and nothing has changed in the Azure setup.
The output of the start- and stop firewall scripts are 99% the same every time, with the only difference being new e-tags. Sometimes when I manually stop / start the Firewall it suddenly starts working, and a next time it doesn't. So basically: inconsistent behaviour without infra changes.
Things I've tried to fix the problem:
- One method:
- Stopping the firewall "manually" via the automation runbook
- Waiting for x amount of minutes to make sure it's actually stopped
- Starting the firewall manually via the automation runbook
- Starting the Databricks cluster either manually or via the workflow
- Checking for outbound connectivity with a simple http request
- Starting the Databricks cluster either manually or via the workflow
- Starting the firewall manually via the automation runbook
- Waiting for x amount of minutes to make sure it's actually stopped
- Stopping the firewall "manually" via the automation runbook
- Another method:
- Stopping and starting the firewall as described above
- Running a databricks notebook, which invokes the cluster start
- Checking for outbound connectivity with a simple http request
- Running a databricks notebook, which invokes the cluster start
- Stopping and starting the firewall as described above
As said: sometimes it works, sometimes it doesn't. When it doesn't; no outbound connectivity is possible at all.
Any suggestions as to where I could investigate the possible cause of this are much appreciated.