Containers: Deep dive part 2

In the last blog post, we explored what containers are and how they are created in the Linux kernel. If you haven’t read it yet, I encourage you to start there first.

https://medium.com/@abdellahtdj/containers-deep-dive-part-1-dd5a56743a65

Requirement

In this series of articles, you’ll need 2 Linux VMs with a distribution of your choice. I’ll be using a Ubuntu 22.04 distribution.

Lets goooo !

What is chroot?

Chroot (short for change root) is a system call that changes the apparent root directory (/) for a running process and its children. This means the process will be restricted to a specified directory subtree and will not be able to access files outside of it

Think of chroot like putting a process in a playpen and saying:

“Hey little process, from now on, this directory is your whole world. You see nothing beyond it. Have fun!”

Having a separate filesystem is the first step to creating your own environment. When using tools like Docker or Podman, we typically pull images first, which include the application and its libraries installed within their own filesystem. This avoids the need to install or download these components directly on the host’s root filesystem.

The first we need to download a root filesystem; for simplicity, we can go with Alpine minirootfs

root@testing:~/blog/containers# wget http://dl-cdn.alpinelinux.org/alpine/v3.22/releases/x86_64/alpine-minirootfs-3.22.0-x86_64.tar.gz
root@testing:~/blog/containers# mkdir alpine-container
root@testing:~/blog/containers# tar xfz alpine-minirootfs-3.22.0-x86_64.tar.gz -C alpine-container/

Once the extraction finishes, we should have a minimal root filesystem that we can use to launch applications within it using chroot. In the example below, we’re running the sh command from the Alpine container, not the one on the host system. The -l option is used to instruct sh to behave as a login shell—this means it reads login-related startup files (such as /etc/profile), and sets up environment variables like PATH, USER, and others.

root@testing:~/blog/containers#chroot alpine-container/ /bin/sh -l 
testing:/# cat /etc/os-release 
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.22.0
PRETTY_NAME="Alpine Linux v3.22"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"

The host OS we’re running is Ubuntu 22.04. After launching a new Bash process with chroot, the sh process can only see the filesystem of the Alpine image we downloaded. This means we can run commands that exist only on Alpine, such as the apk package manager.

But what happens when we try to list processes using ps on the new container

testing:/# ps aux 
PID   USER     TIME  COMMAND
testing:/# ls /dev/
null

Yes, it’s expected that there are no processes or devices in the output — not even the shell we ran the command from. But why?

In Linux, process information is exposed via the /proc pseudo-filesystem, a virtual interface provided by the kernel. It is not a standard disk-backed filesystem but a dynamic, in-memory representation of process and system data. Since the /proc filesystem has not been explicitly mounted within the chroot environment, the /proc directory is empty, resulting in ps returning no output. The same goes for /dev and /sys, which are used to access device nodes (e.g., disks, terminals) and kernel parameters and hardware, respectively.

testing:/# mount -t proc proc /proc
testing:/# mount -t devtmpfs dev /dev/
testing:/# mount -t sys sys /sys
testing:/# ps aux 
PID   USER     TIME  COMMAND
    1 root     25:34 /lib/systemd/systemd --system --deserialize 59
    2 root      0:24 [kthreadd]
    3 root      0:00 [rcu_gp]
    4 root      0:00 [rcu_par_gp]
    5 root      0:00 [slub_flushwq]
    6 root      0:00 [netns]
    8 root      0:00 [kworker/0:0H-ev]
   10 root      0:00 [mm_percpu_wq]
   11 root      0:00 [rcu_tasks_rude_]
   12 root      0:00 [rcu_tasks_trace]

Nope, we’re not there yet. Another important aspect when dealing with containers is user management. It’s generally discouraged to run containers as the root user. Let’s try to find another user (other than root) to switch to.

testing:/# cat /etc/passwd 
root:x:0:0:root:/root:/bin/sh
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/mail:/sbin/nologin
news:x:9:13:news:/usr/lib/news:/sbin/nologin
uucp:x:10:14:uucp:/var/spool/uucppublic:/sbin/nologin
cron:x:16:16:cron:/var/spool/cron:/sbin/nologin
ftp:x:21:21::/var/lib/ftp:/sbin/nologin
sshd:x:22:22:sshd:/dev/null:/sbin/nologin
games:x:35:35:games:/usr/games:/sbin/nologin
ntp:x:123:123:NTP:/var/empty:/sbin/nologin
guest:x:405:100:guest:/dev/null:/sbin/nologin
nobody:x:65534:65534:nobody:/:/sbin/nologin

Hmm, there are no other users on this system besides root. The rest are system users, which we can’t login with. To fix this, we need to create our own user inside the container.

testing:/# echo "container::1001:10001:user:/home/container:/bin/sh" >> /etc/passwd
testing:/# mkdir -p /home/container
testing:/# chown -R 1001:1001 /home/container/
testing:/# echo "container:x:1001:" >> /etc/group
testing:/# su - container
testing:~$ echo "Hello from inside container" > hello
testing:~$ ls -alh hello 
-rw-r--r--    1 container container      28 Jul 25 07:23 hello

Since we have access to the container’s filesystem from the host, we can see that the user we created exists only inside the container. The group ID, however, corresponds to a different group on the host system (microk8s).

root@testing:~/blog/containers# ls -alh alpine-container/home/container/hello 
-rw-r--r-- 1 1001 microk8s 28 Jul 25 07:23 alpine-container/home/container/hello
root@testing:~/blog/containers# grep 1001 /etc/group
microk8s:x:1001:

Note:

You’re probably wondering why the files in my pseudo-container aren’t deleted. Normally, when you stop a Docker container, any changes you’ve made are lost. That’s expected behavior, Docker containers are built on layered images, and any changes happen on a temporary overlay layer. This could be a whole blog post by itself, but let’s not worry about it for now.

Are we finished yet? Not quite. The container we created has an isolated filesystem, but it’s on the same network as the host, shares the host’s hostname configuration, and can see all the processes running on the host. In other words, it still has significant visibility into the host environment. So, our job isn’t done yet; we’ll address these issues in the next blog post.