Docker — Containerization for Data Scientists

Dhilip Subramanian

12 Sep 2020 • 6 min read

Data scientists come from different backgrounds. In today’s agile environment, it is highly essential to respond quickly to customer needs and deliver value. Faster value provides more wins for the customer and hence more wins for the organization.

Information Technology is always under immense pressure to increase agility and speed up delivery of new functionality to the business. A particular point of pressure is the deployment of new or enhanced application code at the frequency and immediacy demanded by typical digital transformation. Under the covers, this problem is not simple, and it is compounded by infrastructure challenges. Challenges like how long it takes to provide a platform for the development team or how difficult it is to build a test system that emulates the production environment adequately (ref: IBM). Docker and Containers exploded onto the scene in 2013, and it has shaped the software development and is causing a structural change in the cloud computing world.

It is essential for data scientists to be self-sufficient and participate in continuous deployment activities. Building an effective model requires multiple iterations of deployment. It is highly important to have the ability to make small changes and deploy and test frequently. Based on the queries I received over recent times, I wanted to write this blog to help people understand what Docker and Containers are and how they promote continuous deployment and help the business.

In this blog, I am writing about Docker and covering the following,

When do we need Docker?
Where does Docker operate in Data Science?
What is Docker?
How does Docker work?
Advantages of using Docker

Why do we need Docker?

This happens many times in our work; whenever you develop a model, code, or build an application, it always works on your laptop. However, it gives certain issues when we try to run the same model or application in the production or testing environment. This happened because of the different computing environment between a developer platform or production platform. For example, you could have used Windows OS or any upgraded software, and in production, they might have used Linux OS or a different software version.

In the real world, both the developer’s system and production environment should be consistent. However, it is very difficult to achieve as each person has their own preferences and cannot be forced to use them uniformly. This is where Docker comes into the picture and solves this problem.

Where does Docker operate in Data Science?

In the Data Science or Software development life cycle, Docker comes into the deployment stage.

Docker makes the deployment process very easy and efficient. It also solves any issues related to deploying the applications.

What is Docker?

Docker is the world’s leading software container platform. Let’s take our real example, as we know, data science is a team project and needs to be coordinated with other areas like Client-side (Front end development), Backend (Server), Database, another environment/library dependencies for running the model. The model will not be deployed alone, and it will be deployed along with other software applications to get a final product.

From the above picture, we can see the technology stack which has different components and platform which has a different environment. We need to make sure that each component in the technology stack should be compatible with every possible hardware (platform). In reality, it becomes complex to work with all the platforms due to the different computing environments of each component. This is the main problem in the industry, and we know that Docker can solve this problem. But how?

Let’s take one more practical use case from the Shipping industry.

Docker — Containerization for Data Scientists

Dhilip Subramanian

Why do we need Docker?

Where does Docker operate in Data Science?

What is Docker?

How does Docker work?

Advantages of using Docker

1. Build an application only once

2. Portability

3. Version Control

4. Independent

Sign up for more like this.