The jailer is a program designed to isolate the Firecracker process in order to enhance Firecracker’s security posture. It is meant to address the security needs of Firecracker only and is not intended to work with other binaries. Additionally, each jailer binary should be used with a statically linked Firecracker binary (with the default musl toolchain) of the same version. Experimental gnu builds are not supported.
The jailer is invoked in this manner:
jailer --id <id> \
--exec-file <exec_file> \
--uid <uid> \
--gid <gid>
[--parent-cgroup <relative_path>]
[--cgroup-version <cgroup-version>]
[--cgroup <cgroup>]
[--chroot-base-dir <chroot_base>]
[--netns <netns>]
[--resource-limit <resource=value>]
[--daemonize]
[--new-pid-ns]
[--...extra arguments for Firecracker]
id
is the unique VM identification string, which may contain alphanumeric
characters and hyphens. The maximum id
length is currently 64 characters.exec_file
is the path to the Firecracker binary that will be exec-ed by the
jailer. The filename must include the string firecracker
. This is enforced
because the interaction with the jailer is Firecracker specific.uid
and gid
are the uid and gid the jailer switches to as it execs the
target binary.parent-cgroup
is used to allow the placement of microvm cgroups in custom
nested hierarchies. By specifying this parameter, the jailer will create a
new cgroup named id
for the microvm in the <cgroup_base>/<parent_cgroup>
subfolder. cgroup_base
is the cgroup controller root for cgroup v1
(e.g.
/sys/fs/cgroup/cpu
) or the unified controller hierarchy for cgroup v2
(
e.g. /sys/fs/cgroup/unified
. <parent_cgroup>
is a relative path within that
hierarchy. For example, if --parent-cgroup all_uvms/external_uvms
is specified,
the jailer will write all cgroup parameters specified through --cgroup
in
/sys/fs/cgroup/<controller_name>/all_uvms/external_uvms/<id>
. By default, the
parent cgroup is exec-file
.
If there are no --cgroup
parameters specified and --group-version=2
was
passed, then the jailer will move the process to the specified cgroup.cgroup-version
is used to select which type of cgroup hierarchy to use for
the creation of cgroups. The default value is “1” which means that cgroups
specified with the cgroup
argument will be created within a v1 hierarchy.
Supported options are “1” for cgroup-v1 and “2” for cgroup-v2.cgroup
cgroups can be passed to the jailer to let it set the values
when the microVM process is spawned. The --cgroup
argument must follow this format:
<cgroup_file>=<value>
(e.g cpuset.cpus=0
). This argument can be used multiple
times to set multiple cgroups. This is useful to avoid providing privileged permissions
to another process for setting the cgroups before or after the jailer is executed.
The --cgroup
flag can help as well to set Firecracker process cgroups
before the VM starts running, with no need to create the entire cgroup
hierarchy manually (which requires privileged permissions).chroot_base
represents the base folder where chroot jails are built. The
default is /srv/jailer
.netns
represents the path to a network namespace handle. If present, the
jailer will use this to join the associated network namespace.resource-limit
can be
used to set bounds to the process resources. The --resource-limit
argument
must follow this format: <resource>=<value>
(e.g no-file=1024
) and can be
used multiple times to set multiple bounds. Current available resources that
can be limited using this argument are:
fsize
: The maximum size in bytes for files created by the process.no-file
: Specifies a value one greater than the maximum file descriptor
number that can be opened by this process.Here is an example on how to set multiple resource limits using this argument:
--resource-limit fsize=250000000 --resource-limit no-file=1024
--daemonize
flag causes the jailer to call setsid()
and
redirect all three standard I/O file descriptors to /dev/null
.--new-pid-ns
flag causes the jailer to spawn the provided
binary into a new PID namespace.
It makes use of the libc clone()
function with the CLONE_NEWPID
flag.
As a result, the jailer and
the process running the exec file have different PIDs. The PID of the child
process is stored in the jail root directory inside <exec_file_name>.pid
.--
are forwarded to Firecracker. For
example, this can be paired with the --config-file
Firecracker argument to
specify a configuration file when starting Firecracker via the jailer (the
file path and the resources referenced within must be valid relative to a
jailed Firecracker).
Please note the jailer already passes --id
parameter to the
Firecracker process.After starting, the Jailer goes through the following operations:
id
./proc/<jailer-pid>/fd
except
input, output and error.<chroot_base>/<exec_file_name>/<id>/root
folder, which will be
henceforth referred to as chroot_dir
. exec_file_name
is the
last path component of exec_file
(for example, that would be firecracker
for /usr/bin/firecracker
). Nothing is done if the path already
exists (it should not, since id
is supposed to be unique).exec_file
to
<chroot_base>/<exec_file_name>/<id>/root/<exec_file_name>
. This ensures the
new process will not share memory with any other Firecracker process.--resource-limit
argument, by calling setrlimit()
system call with the
specific resource argument. If no limits are provided, the jailer bounds
no-file
to a maximum default value of 2048.cgroup
sub-folders. The jailer can use either cgroup v1
or cgroup v2
. On most systems, this is mounted by default in /sys/fs/cgroup
(should be mounted by the user otherwise). The jailer will parse
/proc/mounts
to detect where each of the controllers required in --cgroup
can be found (multiple controllers may share the same path). For each identified
location (referred to as <cgroup_base>
), the jailer creates the
<cgroup_base>/<parent_cgroup>/<id>
subfolder, and writes the current pid
to <cgroup_base>/<parent_cgroup>/<id>/tasks
. Also, the value passed for each
<cgroup_file>
is written to the file. If --node
is used the corresponding
values are written to the appropriate cpuset.mems
and cpuset.cpus
files.unshare()
into a new mount namespace, use pivot_root()
to switch
the old system root mount point with a new one base in chroot_dir
, switch
the current working directory to the new root, unmount the old root mount
point, and call chroot
into the current directory.mknod
to create a /dev/net/tun
equivalent inside the jail.mknod
to create a /dev/kvm
equivalent inside the jail.chown
to change ownership of the chroot_dir
(root path /
as seen
by the jailed firecracker), /dev/net/tun
, /dev/kvm
. The ownership is
changed to the provided uid:gid
.--netns <netns>
is present, attempt to join the specified network
namespace.--daemonize
is specified, call setsid()
and redirect STDIN
,
STDOUT
, and STDERR
to /dev/null
.--new-pid-ns
is specified, call clone()
with CLONE_NEWPID
flag
to spawn a new process within a new PID namespace.
The new process will assume the role of init(1) in the new namespace.
The parent will store child’s PID inside <exec_file_name>.pid
, while the child
drops privileges and exec()
s into the <exec_file_name>
, as described below.uid
and gid
.<exec_file_name> --id=<id>
--start-time-us=<opaque> --start-time-cpu-us=<opaque>
(and also forward
any extra arguments provided to the jailer after --
, as mentioned in
the Jailer Usage section), where:
id
: (string
) - The id
argument provided to jailer.opaque
: (number
) time calculated by the jailer that it spent doing
its work.Let’s assume Firecracker is available as /usr/bin/firecracker
, and the jailer
can be found at /usr/bin/jailer
. We pick the unique id
551e7604-e35c-42b3-b825-416853441234, and we choose to run on NUMA node
0 (in order to isolate the process in the 0th NUMA node we need to set cpuset.mems=0
and cpuset.cpus
equals to the CPUs of that NUMA node), using uid 123,
and gid 100. For this example, we are content with the default /srv/jailer
chroot base dir.
We start by running:
/usr/bin/jailer --id 551e7604-e35c-42b3-b825-416853441234
--cgroup cpuset.mems=0 --cgroup cpuset.cpus=$(cat /sys/devices/system/node/node0/cpulist)
--exec-file /usr/bin/firecracker --uid 123 --gid 100 \
--netns /var/run/netns/my_netns --daemonize
After opening the file descriptors mentioned in the previous section, the jailer will create the following resources (and all their prerequisites, such as the path which contains them):
/srv/jailer/firecracker/551e7604-e35c-42b3-b825-416853441234/root/firecracker
(copied from /usr/bin/firecracker
)We are going to refer to
/srv/jailer/firecracker/551e7604-e35c-42b3-b825-416853441234/root
as <chroot_dir>
.
Let’s also assume the, cpuset cgroups are mounted at
/sys/fs/cgroup/cpuset
. The jailer will create the following subfolder
(which will inherit settings from the parent cgroup):
/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234
It’s worth noting that, whenever a folder already exists, nothing will be done,
and we move on to the next directory that needs to be created. This should only
happen for the common firecracker
subfolder (but, as for creating the chroot
path before, we do not issue an error if folders directly associated with the
supposedly unique id
already exist).
The jailer then writes the current pid to
/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234/tasks
,
It also writes 0
to
/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234/cpuset.mems
,
And the corresponding CPUs to
/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234/cpuset.cpus
.
Since the --netns
parameter is specified in our example, the jailer opens
/var/run/netns/my_netns
to get a file descriptor fd
, uses
setns(fd, CLONE_NEWNET)
to join the associated network namespace, and then
closes fd
.
The --daemonize
flag is also present, so the jailers opens /dev/null
as
RW and keeps the associate file descriptor as dev_null_fd
(we do this
before going inside the jail), to be used later.
Build the chroot jail. First, the jailer uses unshare()
to enter a new mount
namespace, and changes the propagation of all mount points in the new namespace
to private using mount(NULL, “/”, NULL, MS_PRIVATE | MS_REC, NULL)
, as a
prerequisite to pivot_root()
. Another required operation is to bind mount
<chroot_dir>
on top of itself using mount(<chroot_dir>, <chroot_dir>,
NULL, MS_BIND, NULL)
. At this point, the jailer creates the folder
<chroot_dir>/old_root
, changes the current directory to <chroot_dir>
,
and calls syscall(SYS_pivot_root, “.”, “old_root”)
. The final steps of
building the jail are unmounting old_root
using umount2(“old_root”,
MNT_DETACH)
, deleting old_root
with rmdir
, and finally calling
chroot(“.”)
for good measure. From now, the process is jailed in
<chroot_dir>
.
Create the special file /dev/net/tun
, using mknod(“/dev/net/tun”, S_IFCHR |
S_IRUSR | S_IWUSR, makedev(10, 200))
, and then call chown(“/dev/net/tun”,
123, 100)
, so Firecracker can use it after dropping privileges. This is
required to use multiple TAP interfaces when running jailed. Do the same for
/dev/kvm
.
Change ownership of <chroot_dir>
to uid:gid
so that Firecracker can create
its API socket there.
Since the --daemonize
flag is present, call setsid()
to join a new
session, a new process group, and to detach from the controlling terminal.
Then, redirect standard file descriptors to /dev/null
by calling
dup2(dev_null_fd, STDIN)
, dup2(dev_null_fd, STDOUT)
, and dup2(dev_null_fd,
STDERR)
. Close dev_null_fd
, because it is no longer necessary.
Finally, the jailer switches the uid
to 123
, and gid
to 100
, and execs
./firecracker \
--id="551e7604-e35c-42b3-b825-416853441234" \
--start-time-us=<opaque> \
--start-time-cpu-us=<opaque>
Now firecracker creates the socket at
/srv/jailer/firecracker/551e7604-e35c-42b3-b825-416853441234/root/<api-sock>
to interact with the VM.
Note: default value for <api-sock>
is /run/firecracker.socket
.
--cgroup
command line argument.notify_on_release
mechanism, while being wary about potential race conditions (the instance
crashing before the subscription process is complete, for example).--new-pid-ns
flag enables the Jailer to exec the
binary file in a new PID namespace, in order to become a pseudo-init process.
Alternatively, the user can spawn the jailer in a new PID namespace via a
combination of clone()
with the CLONE_NEWPID
flag and exec()
.root
user; it actually requires a more restricted
set of capabilities, but that’s to be determined as features stabilize.--daemonize
runs towards the end, instead of the very
beginning. We are working on adding better logging capabilities.If all the cgroup controllers are bunched up on a single mount point using the “all” option, our current program logic will complain it cannot detect individual controller mount points.
#4287 When
starting a jailer with --parent-cgroup
specified but no cgroup flags
specified, then the rules in the parent cgroup folder are ignored. To
work around, use a dummy cgroup parameter like --cgroup=memory.max=max
.