Andreas' Blog

Adventures of a software engineer/architect

Deploy Datascience infrastructure on Azure using Terraform

2018-01-23 6 min read anoff

In this article I will talk about my experience building my first infrastructure deployment using Terraform that does (a little) more than combining off-the-shelf resources.

The stack we will deploy ๐Ÿ“ฆ

Lately Iโ€™ve been looking at a lot of Microsoft Azure services in the big data area. I am looking for something to replace a Hadoop based ๐Ÿ˜ data analytics environment consisting mainly of HDFS, Spark & Jupyter.

How to Datascience on Azure?

The most obvious solution is to use a HDInsight cluster which is basically a managed Hadoop that you can pick in different flavours. However with the elasticity of the cloud at my hands I wanted to go for a more diverse setup that also allows a pure #python ๐Ÿ based data science stack without the need to use Spark. One reason for this is that many use cases do not require a calculation on all of the data but just use spark to mask the data. For the actual analysis/training the amount of data often fits in RAMโ€Šโ€”โ€Šif I get a bit more than my MacBook Pro has to offer ๐Ÿ™„. The solution described in this article consists of a Data Science Virtual Machine ๐Ÿ–ฅ and Azure Files ๐Ÿ“„ for common data store. As the file storage account has a cap on 5TB you might need something different if you really have a lot of dataโ€Šโ€”โ€Šor use multiple fileshares.

Multiple VMs accessing a shared storage

Target setup with multiple data scientist VM accessing a common data pool

Quick intro to Terraform ๐Ÿ‘€

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. [โ€ฆ] Terraform generates an execution plan describing what it will do to reach the desired state, and then executes it to build the described infrastructure. As the configuration changes, Terraform is able to determine what changed and create incremental execution plans which can be applied.

Infrastructure as Code (IaC) is an important aspect for me when managing a cloud solution. I want to be able to automate โš™๏ธ everything from infrastructure provisioning to code deployment. Amazon Web Services has Cloudformation, Azure has the Azure Resource Manager (ARM) templates but Hashicorps Terraform provides a somewhat cloud agnostic layer on top of those proprietary tools. ‘Somewhat’ because the recipes you write are specific to a certain cloud environment but it allows you to use common syntax to deploy a multi-cloud solution. The main reason for me to use Terraform instead of ARM is that it offers a better way to create modular recipes and not worry about handling the state during incremental deployments.

Building the Terraform recipe ๐Ÿ“œ

When deploying any higher level components the first thing to do is figure out what underlying infrastructure is used. The easiest way to do this is click yourself through the Portal to create the resourceโ€Šโ€”โ€Šin this case the Data Scientist VMโ€Šโ€”โ€Šand look at the ARM template that drops out of this process. You could use it to deploy this VM automatically using either the ARM tooling or the azurerm_template_deployment in Terraform. However the ARM templates are way too complex to maintain to my liking.

Access ARM template from portal

Overview of ARM resources

Resource view of the ARM template in Azure Portal

In the case of the Data Scientist VM you can see that five different resources are deployed to bring up an Azure VM. The machine itself which consists of compute and memory allocations. A storage account that holds the disk images and a couple of networking components.

If you look at the Virtual Machine recipe that is available on the Terraform docs you may see similar components. There they are called _azurerm_virtualnetwork, _azurermsubnet, _azurerm_networkinterface, _azurerm_virtualmachine. I took the Terraform example as a base for my recipe and tried to modify this vanilla Ubuntu VM into the Data Science VM that I clicked together using the Portal. So I was mostly interested in figuring out what part of the VM deployment makes the VM a Data Science VM with all those fancy software packages pre-installed.

  "resources": [
      "name": "[parameters('virtualMachineName')]",
      "type": "Microsoft.Compute/virtualMachines",
      "apiVersion": "2016-04-30-preview",
      "location": "[parameters('location')]",
      "dependsOn": [
          "[concat('Microsoft.Network/networkInterfaces/', parameters('networkInterfaceName'))]",
          "[concat('Microsoft.Storage/storageAccounts/', parameters('diagnosticsStorageAccountName'))]"
      "properties": {
          "osProfile": {
              "computerName": "[parameters('virtualMachineName')]",
              "adminUsername": "[parameters('adminUsername')]",
              "adminPassword": "[parameters('adminPassword')]"
          "hardwareProfile": {
              "vmSize": "[parameters('virtualMachineSize')]"
          "storageProfile": {
              "imageReference": {
                  "publisher": "microsoft-ads",
                  "offer": "linux-data-science-vm-ubuntu",
                  "sku": "linuxdsvmubuntu",
                  "version": "latest"
              "osDisk": {
                  "createOption": "fromImage",
                  "managedDisk": {
                      "storageAccountType": "Premium_LRS"
              "dataDisks": [
                      "createOption": "fromImage",
                      "lun": 0,
                      "managedDisk": {
                          "storageAccountType": "Premium_LRS"
      "plan": {
          "name": "linuxdsvmubuntu",
          "publisher": "microsoft-ads",
          "product": "linux-data-science-vm-ubuntu"

The part of the ARM template that specifies the OS setup

The important fields to look at are the way the imagReference, osDisk and dataDisks are created and the plan that is required if you want to deploy a marketplace VM. These differ from the vanilla setup that we get from the Terraform documentation. By going through the Terraform docs on the VM provider you can identify the fields necessary to turn the example VM into a Data Science VM. The main changes are to create a storage_data_disk that has the create_option = fromImageThis seems to be required as the DSVM ships with some data according to the ARM template. The second thing to add is the plan property into your VM recipe. This should be set with the same parameters as shown above in the ARM snippet.

You can find my final resulting code on github ๐Ÿ‘ฏโ€

Thatโ€˜s it for the Data Scientist VM, now on to the File share ๐Ÿ“„

Once you understand that the File service on Azure is part of the Storage suite you can either follow along the example above and look at the ARM template that the Portal generates or hop right into the Terraform documentation and look for possible ways to deploy storage resources.

_Note: Before writing this article I didnโ€™t realize there is an option to create a Fileshare using Terraform. So I initially built a custom Azure CLI script and hooked that into the storage recipe. Take a look at the code history if you want to learn more about fusing Terraform with the Azure CLI._

Terraform resources for storage

Storage resources in the Azure Provider for Terraform

Creating a recipe for the Fileshare is literally just copying the example as it does not have any customised properties. Make sure you give the whole setup pretty names and the correct quote and youโ€™re done.

Terraform output during creation

Run terraform apply -auto-approve to execute the recipe. In my latest run it took 2min 42sto spin up all the required resources.

Killing the components took twice as long ๐Ÿ™„

The full recipe will provision the following 6 resources for you. You might notice that Terraform mentions 7 added resources, the difference of 1 comes from the resource group that is not listed below. If you want to clean up just run terraform destroy.

Azure resource group overview

Fully provisioned Datascience Setup

Next steps ๐Ÿ‘Ÿ

Thereโ€™s a bunch of things I want to improve on this setup:

  1. Create a Terraform module for the Virtual Machine setup to easily create multiple VMs without cloning the 50+ lines recipe. The goal would be that the name of the VM, the size and the network can be defined. Ideally multiple data scientists would work in a common network.

  2. Auto-mount the file share into the VMs during creation. The remote-exec provisioner might be a good way to start.

Feel free to discuss this approach or even implement improvements for this via a PR on or on twitter ๐Ÿฆ. Hope you enjoyed my first article ๐Ÿ•บ.

comments powered by Disqus