Containers: Deep dive part 1

Containers, containers, containers, containers. I’m hearing this word more than my name theses days, what is so special about them, is it magic, or just a fugazi ?

This is what i’ll try to respond in theses series of articles.

What is expected

To make the subject more simple and less boring, it will be divide it into several articles, where the main subjects are :

  • What is a container : in this section you’ll find out what features are used to create a container and what is simply a container.
  • Container networking: this section we’ll be dedicated to container networking, we’ll discover how a container can connect to another container or to the internet.
  • CNI(Container Network Interface): together we’ll understand how CNI works in detail.

In All articles, we’ll not stick with the theoretical side of each subject, i believe that to fully understand a subject you must go under the hood and do everything yourself, and this is what we’ll try to do in each section. So what do i need ?


In this series of article you’ll need 2 Linux VMs with a distribution of your choice, i’ll be using a Ubuntu 20.04 distribution.

Lets goooo !

What is a container ?

A container is a process, wait where are you going, we’re not finished yet. Where were we, ah yes a container is a process but a special one, how special let’s find out. But before understanding what makes it so special, lets pay a visit to the kernel land.

Kernel Land

As you properly know or not. In Linux memory, we have two spaces where applications generally run, the kernel system space and the user space. Kernel space is protected and only kernel code is allowed to access it. On the other hand user space can be used by non kernel applications such as a browser or a text editor.

Kernel and User Space

So if both the user space can’t access the kernel space, how a user space application can open a file located on a disk or send a ping ?

The Answer is syscalls, syscalls are use by applications running in the user space to ask the kernel to do something, like opening a file, sending a network packet or creating a new process.

System calls

So as we can see a process running in the user space mode asks the kernel for multiples actions while executing, but this doesn’t explain how can a process is created.

So how processes are created ?

Each process is a fork of another process, for you who don’t know what fork is. Fork is yet another syscall, this syscall can be called by processes(parents) in the user space to create new processes(childs).

Process creation

Each process is created as in the diagram above:

  1. The parent process calls the clone syscal, with some flags the kernel will copy the memory section of app1, at this state the child and the father will have the same code to run. For example when you run in your bash ls, at this period the child still points to bash code.
  2. To load the code of app2 , the exec syscall will load the code of app2. If we continue with the same example as in phase 1, with this syscall the ls code is loaded.


Some of you are familiar with the fork syscall as the way to create new processes. To clarify the clone syscall is the new fork, it does the same job but allows more control on the execution context of a process. Now the glibc forks function calls the clone syscall with flags that provide the same effect as the traditional fork syscall. But you don’t need to know this boring details

Now we knows how a process is created, but what makes a container so special than a normal process, the difference is that a container is a process isolated from the rest of other process, this isolation can be at one or multiple level, some of the well know isolation are: network, mount, IPC(Inter Process Communication), PID and so on. But what makes this isolation possible ? the answer is the kernel, using the namespaces feature.

So what will change if we drew the same diagram again

Container creation

As we can see, the procedure is the same, the difference is in the flags passed to the clone syscall, some of the know flags used to create new namespaces are:

  • CLONE_NEWIPC: create the process in a new IPC namespace, that means that the process can’t send signals like kill to other processes in the host namespace or other namespaces
  • CLONE_NEWNET: create the process in a new network namespace, by doing so the process will have its own network tables(routing,arp), own network interface, own network configuration.
  • CLONE_NEWNS: create the process in a new mount namespace, this will hide host mounts from the process.
  • CLONE_NEWPID: create the process in a new PID namespace, this will prevent the process from seeing other processes, and reuse process ID already used on the host or other namespaces.
  • CLONE_NEWUSER: create the process in a new user namespace, with this a user user on the container is not the same as the user user on the host.

Clone is not the only syscall used to isolate a processes, there are other syscall too :

  • unshare: This system call is actually the same as clone but the difference is that this syscall will create and move the current process to a new namespaces but clone will create a new process with new namespaces.
  • setns: This system call allows the running process to join an existing namespace.

Now the devil behind containers is unveiled, for the kernel containers do not exist, it is all bunch of namespaces that isolate process like the movie inception where the actor thinks that he is in the realer world but instead he is in dream, and we’re not going to speak level two inception (containers in container), which is possible as well with namespace.

In the end i think that Jérôme Petazzoni tweet sum it all


What is next

In the next article we will use the notions that we learned today and use it to create a container from scratch, it is not that hard you’ll see, until next time.

Add a Comment

Your email address will not be published. Required fields are marked *