Implementing scalable and manageable data solutions in the cloud can be difficult. Organizations need to develop a strategy that not only succeeds technically but fits with their team’s persona. There are a number of Platform as a Service (PaaS) products and Software as a Service (SaaS) products that make it easy to connect to, transform, and move data in your network. However, the surplus of tools can make it difficult to figure out which ones to use, and often they tools can only do a fraction of what an engineer can do with scripting language. Many of the engineers I work with love functionally languages when working with data. My preferred data language is Python, however, there can be a barrier when moving from a local desktop to the cloud. When developing data pipelines using a language like Python I recommend using Docker containers.
Historically, it is not a simple task to deploy code to different environments and have it run reliably. This issue arises most when a data scientist or data engineer is moving code from local development to a test or production environment. Containers consist of their own run-time environment and contain all the required dependencies, therefore, it eliminates variable environments at deployment. Containers make it easy to develop in the same environment as production and eliminate a lot of risk when deploying.
Creating Data Pipeline Containers
My preferred Python distribution is Anaconda because of how easy it is to create an use different virtual environments, allowing me to insure that there are no python or dependency conflicts when working on different solutions. Virtual environments are extremely popular with python developers, therefore, the transition deploying using containers should be familiar. If you are unfamiliar with anaconda virtual environments check out this separate blog post where I talk about best practices and how to use these environments when working with Visual Studio Code.
Data pipelines always start with data extractions. Best practices the engineer should land their raw data into a data store as quickly as possible. The raw data gives organizations a source of data that is untouched, allowing a developer to reprocess data as needed to solve different business problems. Once in the raw data store the developer will transform and manipulate data as needed. In Azure, my favorite data store to handle raw, transformed, and business data is the Azure Data Lake Store. Below is a general flow diagram of data pipelines where the transformations can be as complicated as machine learning models, or as simple as normalizing the data. In this scenario each intermediate pipe could be a container, or the entire data pipeline could be a single container. At each pipeline the data may be read a data source or chained from a previous transform. This flexibility is left up to the developer. Containers make versioning and deploying data applications easy because they allow an engineer to develop how they prefer, and quickly deploy with a few configuration steps and commands.
Most engineers prefer to develop locally on their laptops using notebooks (like Jupyter notebooks) or a code editor (like Visual Studio Code). Therefore, when a new data source is determined, engineers should simply start developing locally using an Anaconda environment and iterate over their solution in order to package it up as a container. If the engineer is using Python to extract data, they will need to track all dependencies in a
requirements.txt file, and make note of any special installations (like SQL drivers) required to extract data and write it to a raw data lake store. Once the initial development is completed the engineer will then need to get their code ready for deployment! This workflow is ideal for small to medium size data sources because the velocity of true big data can often be an issue for batch data extraction, and a streaming data solution is preferred (i.e. Apache Spark).
Deploying Data Pipeline Containers in Azure
To set the stage, you are a developer and you have written a python data extraction application using a virtual environment on your machine. Since you started with a fresh python interpreter and added requirements you have compiled a list of the installed libraries, drivers, and other dependencies as need to solve their problem. How does a developer get from running the extraction on a local machine to the cloud?
First we will create and run a docker container locally for testing purposes. Then we will deploy the container to Azure Container Instance, the fastest and simplest way to run a container in Azure. Data extractors that are deployed as containers are usually batch jobs that the developers wants to run on a specific cadence. There are two ways to achieve this CRON scheduling: have the application “sleep” after each data extraction, or have a centralized enterprise scheduler (like Apache Airflow) that kicks off the process as needed. I recommend the latter because it allows for a central location to monitor all data pipeline jobs, and avoids having to redeploy or make code changes if the developers wishes to change the schedule.
Before deploying a Docker container there are a few things that the engineer will do before it is ready.
- Create a
requirements.txtfile in the solution’s root directory
- Create a
Dockerfilefile in the solution’s root directory
- Make sure the data extractor is in an “application” folder off the root directory
- Write automated tests using the popular pytest python packagethis is not required but I would recommend it for automated testing. I do not include this in my walk through that is provided.
- Build an image locally
- Build and run the container locally for testing
- Deploy to Azure Container Instance (or Azure Kubernetes Service)
Here is an example
requirements.txt file for the sample application available here:
Here is an example
Dockerfile file that starts with a python 3.6 image, copies are application into the working directory, and runs our data extraction. In this case we have a python script, extract_data.py, in the
RUN mkdir /src
COPY . /src/
RUN pip install -r requirements.txt
CMD [ "python", "./application/extract_data.py" ]
To build an image locally you will need Docker installed. If you do not have it installed please download it here, otherwise, make sure that docker is currently running on your machine. Open up a command prompt, navigate to your projects root directory, and run the following commands:
## Build an image from the current directory
docker build -t my-image-name .
## Run the container using the newly created image
docker run my-image-name
To deploy the container to Azure Container Instance, you first must create an Azure Container Registry and push your container to the registry. Next you will need to deploy that image to Azure Container Instance using the Azure CLI. Note that the Azure CLI tool can be used to automate these deployments in the future, or an engineer can take advantage of Azure DevOps Build and Release tasks.
Now that you have deployed the container manually to Azure Container Instance, it is important to manage these applications. Often times data extractors will be on a scheduled basis, therefore, will likely require external triggers to extract and monitor data pipelines. Stay tuned for a future blog on how to managed your data containers!
Developing data solutions using containers is an excellent way to manage, orchestrate, and develop a scalable analytics and artificial intelligence application. This walkthrough walks engineers through the process of creating a weather data source extractor, wrap it up as a container, and deploy the container both locally and in the cloud.