[PATCH for-5.1 0/3] virtiofsd: allow virtiofsd to run in a container

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH for-5.1 0/3] virtiofsd: allow virtiofsd to run in a container
@ 2020-07-22 13:02 ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-22 13:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: virtio-fs, rmohr, Dr. David Alan Gilbert, Stefan Hajnoczi, vromanso

Container runtimes handle namespace setup and remove privileges needed by
virtiofsd to perform sandboxing. Luckily the container environment already
provides most of the sandbox that virtiofsd needs for security.

Introduce a new "virtiofsd -o chroot" option that uses chroot(2) instead of
namespaces. This option allows virtiofsd to work inside a container.

Please see the individual patches for details on the changes and security
implications.

Given that people are starting to attempt running virtiofsd in containers I
think this should go into QEMU 5.1.

Stefan Hajnoczi (3):
  virtiofsd: drop CAP_DAC_READ_SEARCH
  virtiofsd: add container-friendly -o chroot sandboxing option
  virtiofsd: probe unshare(CLONE_FS) and print an error

 tools/virtiofsd/fuse_virtio.c    | 13 +++++++++
 tools/virtiofsd/helper.c         |  3 +++
 tools/virtiofsd/passthrough_ll.c | 45 +++++++++++++++++++++++++++++---
 3 files changed, 58 insertions(+), 3 deletions(-)

-- 
2.26.2

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Virtio-fs] [PATCH for-5.1 0/3] virtiofsd: allow virtiofsd to run in a container
@ 2020-07-22 13:02 ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-22 13:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: virtio-fs, rmohr, vromanso

Container runtimes handle namespace setup and remove privileges needed by
virtiofsd to perform sandboxing. Luckily the container environment already
provides most of the sandbox that virtiofsd needs for security.

Introduce a new "virtiofsd -o chroot" option that uses chroot(2) instead of
namespaces. This option allows virtiofsd to work inside a container.

Please see the individual patches for details on the changes and security
implications.

Given that people are starting to attempt running virtiofsd in containers I
think this should go into QEMU 5.1.

Stefan Hajnoczi (3):
  virtiofsd: drop CAP_DAC_READ_SEARCH
  virtiofsd: add container-friendly -o chroot sandboxing option
  virtiofsd: probe unshare(CLONE_FS) and print an error

 tools/virtiofsd/fuse_virtio.c    | 13 +++++++++
 tools/virtiofsd/helper.c         |  3 +++
 tools/virtiofsd/passthrough_ll.c | 45 +++++++++++++++++++++++++++++---
 3 files changed, 58 insertions(+), 3 deletions(-)

-- 
2.26.2

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH for-5.1 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH
  2020-07-22 13:02 ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-22 13:02   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-22 13:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: virtio-fs, rmohr, Dr. David Alan Gilbert, Stefan Hajnoczi, vromanso

virtiofsd does not need CAP_DAC_READ_SEARCH because it already has
the more powerful CAP_DAC_OVERRIDE. Drop it from the list of
capabilities.

This is important because container runtimes may not include
CAP_DAC_READ_SEARCH by default. This patch allows virtiofsd to reduce
its capabilities when running inside a Docker container.

Note that CAP_DAC_READ_SEARCH may be necessary again in the future if
virtiofsd starts using open_by_handle_at(2).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 94e0de2d2b..50a164a599 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2596,7 +2596,6 @@ static void setup_capabilities(char *modcaps_in)
     if (capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
             CAP_CHOWN,
             CAP_DAC_OVERRIDE,
-            CAP_DAC_READ_SEARCH,
             CAP_FOWNER,
             CAP_FSETID,
             CAP_SETGID,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Virtio-fs] [PATCH for-5.1 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH
@ 2020-07-22 13:02   ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-22 13:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: virtio-fs, rmohr, vromanso

virtiofsd does not need CAP_DAC_READ_SEARCH because it already has
the more powerful CAP_DAC_OVERRIDE. Drop it from the list of
capabilities.

This is important because container runtimes may not include
CAP_DAC_READ_SEARCH by default. This patch allows virtiofsd to reduce
its capabilities when running inside a Docker container.

Note that CAP_DAC_READ_SEARCH may be necessary again in the future if
virtiofsd starts using open_by_handle_at(2).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 94e0de2d2b..50a164a599 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2596,7 +2596,6 @@ static void setup_capabilities(char *modcaps_in)
     if (capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
             CAP_CHOWN,
             CAP_DAC_OVERRIDE,
-            CAP_DAC_READ_SEARCH,
             CAP_FOWNER,
             CAP_FSETID,
             CAP_SETGID,
-- 
2.26.2

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-22 13:02 ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-22 13:02   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-22 13:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: virtio-fs, rmohr, Dr. David Alan Gilbert, Stefan Hajnoczi, vromanso

virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
is required to create namespaces.

Introduce a weaker sandbox that is sufficient in container environments
because the container runtime already sets up namespaces. Use chroot to
restrict path traversal to the shared directory.

virtiofsd loses the following:

1. Mount namespace. The process chroots to the shared directory but
   leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
   syscalls.

2. Pid namespace. This should be fine because virtiofsd is the only
   process running in the container.

3. Network namespace. This should be fine because seccomp already
   rejects the connect(2) syscall, but an additional layer of security
   is lost. Container runtime-specific network security policies can be
   used drop network traffic (except for the vhost-user UNIX domain
   socket).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/helper.c         |  3 +++
 tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index 3105b6c23a..7421c9ca1a 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
            "    -o cache=<mode>            cache mode. could be one of \"auto, "
            "always, none\"\n"
            "                               default: auto\n"
+           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
+           "                               of stronger mount namespace sandbox\n"
+           "                               default: false\n"
            "    -o flock|no_flock          enable/disable flock\n"
            "                               default: no_flock\n"
            "    -o log_level=<level>       log level, default to \"info\"\n"
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 50a164a599..990c0a8a70 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -139,6 +139,7 @@ enum {
 
 struct lo_data {
     pthread_mutex_t mutex;
+    int chroot; /* 1 - use chroot, 0 - use mount namespace */
     int debug;
     int writeback;
     int flock;
@@ -162,6 +163,8 @@ struct lo_data {
 };
 
 static const struct fuse_opt lo_opts[] = {
+    { "chroot", offsetof(struct lo_data, chroot), 1 },
+    { "no_chroot", offsetof(struct lo_data, chroot), 0 },
     { "writeback", offsetof(struct lo_data, writeback), 1 },
     { "no_writeback", offsetof(struct lo_data, writeback), 0 },
     { "source=%s", offsetof(struct lo_data, source), 0 },
@@ -2665,6 +2668,37 @@ static void setup_capabilities(char *modcaps_in)
     pthread_mutex_unlock(&cap.mutex);
 }
 
+/*
+ * Use chroot as a weaker sandbox for environment where the process is launched
+ * without CAP_SYS_ADMIN.
+ */
+static void setup_chroot(struct lo_data *lo)
+{
+    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
+    if (lo->proc_self_fd == -1) {
+        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
+        exit(1);
+    }
+
+    /*
+     * Make the shared directory the file system root so that FUSE_OPEN
+     * (lo_open()) cannot escape the shared directory by opening a symlink.
+     *
+     * It's still possible to escape the chroot via lo->proc_self_fd but that
+     * requires gaining control of the process first.
+     */
+    if (chroot(lo->source) != 0) {
+        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
+        exit(1);
+    }
+
+    /* Move into the chroot */
+    if (chdir("/") != 0) {
+        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
+        exit(1);
+    }
+}
+
 /*
  * Lock down this process to prevent access to other processes or files outside
  * source directory.  This reduces the impact of arbitrary code execution bugs.
@@ -2672,8 +2706,13 @@ static void setup_capabilities(char *modcaps_in)
 static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
                           bool enable_syslog)
 {
-    setup_namespaces(lo, se);
-    setup_mounts(lo->source);
+    if (lo->chroot) {
+        setup_chroot(lo);
+    } else {
+        setup_namespaces(lo, se);
+        setup_mounts(lo->source);
+    }
+
     setup_seccomp(enable_syslog);
     setup_capabilities(g_strdup(lo->modcaps));
 }
@@ -2820,6 +2859,7 @@ int main(int argc, char *argv[])
     struct fuse_session *se;
     struct fuse_cmdline_opts opts;
     struct lo_data lo = {
+        .chroot = 0,
         .debug = 0,
         .writeback = 0,
         .posix_lock = 1,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-22 13:02   ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-22 13:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: virtio-fs, rmohr, vromanso

virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
is required to create namespaces.

Introduce a weaker sandbox that is sufficient in container environments
because the container runtime already sets up namespaces. Use chroot to
restrict path traversal to the shared directory.

virtiofsd loses the following:

1. Mount namespace. The process chroots to the shared directory but
   leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
   syscalls.

2. Pid namespace. This should be fine because virtiofsd is the only
   process running in the container.

3. Network namespace. This should be fine because seccomp already
   rejects the connect(2) syscall, but an additional layer of security
   is lost. Container runtime-specific network security policies can be
   used drop network traffic (except for the vhost-user UNIX domain
   socket).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/helper.c         |  3 +++
 tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index 3105b6c23a..7421c9ca1a 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
            "    -o cache=<mode>            cache mode. could be one of \"auto, "
            "always, none\"\n"
            "                               default: auto\n"
+           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
+           "                               of stronger mount namespace sandbox\n"
+           "                               default: false\n"
            "    -o flock|no_flock          enable/disable flock\n"
            "                               default: no_flock\n"
            "    -o log_level=<level>       log level, default to \"info\"\n"
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 50a164a599..990c0a8a70 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -139,6 +139,7 @@ enum {
 
 struct lo_data {
     pthread_mutex_t mutex;
+    int chroot; /* 1 - use chroot, 0 - use mount namespace */
     int debug;
     int writeback;
     int flock;
@@ -162,6 +163,8 @@ struct lo_data {
 };
 
 static const struct fuse_opt lo_opts[] = {
+    { "chroot", offsetof(struct lo_data, chroot), 1 },
+    { "no_chroot", offsetof(struct lo_data, chroot), 0 },
     { "writeback", offsetof(struct lo_data, writeback), 1 },
     { "no_writeback", offsetof(struct lo_data, writeback), 0 },
     { "source=%s", offsetof(struct lo_data, source), 0 },
@@ -2665,6 +2668,37 @@ static void setup_capabilities(char *modcaps_in)
     pthread_mutex_unlock(&cap.mutex);
 }
 
+/*
+ * Use chroot as a weaker sandbox for environment where the process is launched
+ * without CAP_SYS_ADMIN.
+ */
+static void setup_chroot(struct lo_data *lo)
+{
+    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
+    if (lo->proc_self_fd == -1) {
+        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
+        exit(1);
+    }
+
+    /*
+     * Make the shared directory the file system root so that FUSE_OPEN
+     * (lo_open()) cannot escape the shared directory by opening a symlink.
+     *
+     * It's still possible to escape the chroot via lo->proc_self_fd but that
+     * requires gaining control of the process first.
+     */
+    if (chroot(lo->source) != 0) {
+        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
+        exit(1);
+    }
+
+    /* Move into the chroot */
+    if (chdir("/") != 0) {
+        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
+        exit(1);
+    }
+}
+
 /*
  * Lock down this process to prevent access to other processes or files outside
  * source directory.  This reduces the impact of arbitrary code execution bugs.
@@ -2672,8 +2706,13 @@ static void setup_capabilities(char *modcaps_in)
 static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
                           bool enable_syslog)
 {
-    setup_namespaces(lo, se);
-    setup_mounts(lo->source);
+    if (lo->chroot) {
+        setup_chroot(lo);
+    } else {
+        setup_namespaces(lo, se);
+        setup_mounts(lo->source);
+    }
+
     setup_seccomp(enable_syslog);
     setup_capabilities(g_strdup(lo->modcaps));
 }
@@ -2820,6 +2859,7 @@ int main(int argc, char *argv[])
     struct fuse_session *se;
     struct fuse_cmdline_opts opts;
     struct lo_data lo = {
+        .chroot = 0,
         .debug = 0,
         .writeback = 0,
         .posix_lock = 1,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-22 13:02 ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-22 13:02   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-22 13:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: vromanso, Misono Tomohiro, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, rmohr

An assertion failure is raised during request processing if
unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
be detected right away.

Unfortunately Docker/Moby does not include unshare in the seccomp.json
list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
include unshare (e.g. podman is unaffected):
https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json

Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
default seccomp.json is missing unshare.

Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 3b6d16a041..ebeb352514 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
 {
     int ret;
 
+    /*
+     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
+     * an unprivileged system call but some Docker/Moby versions are known to
+     * reject it via seccomp when CAP_SYS_ADMIN is not given.
+     */
+    ret = unshare(CLONE_FS);
+    if (ret == -1 && errno == EPERM) {
+        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
+                "running in a container please check that the container "
+                "runtime seccomp policy allows unshare.\n");
+        return -1;
+    }
+
     ret = fv_create_listen_socket(se);
     if (ret < 0) {
         return ret;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Virtio-fs] [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
@ 2020-07-22 13:02   ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-22 13:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: vromanso, virtio-fs, rmohr

An assertion failure is raised during request processing if
unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
be detected right away.

Unfortunately Docker/Moby does not include unshare in the seccomp.json
list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
include unshare (e.g. podman is unaffected):
https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json

Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
default seccomp.json is missing unshare.

Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 3b6d16a041..ebeb352514 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
 {
     int ret;
 
+    /*
+     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
+     * an unprivileged system call but some Docker/Moby versions are known to
+     * reject it via seccomp when CAP_SYS_ADMIN is not given.
+     */
+    ret = unshare(CLONE_FS);
+    if (ret == -1 && errno == EPERM) {
+        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
+                "running in a container please check that the container "
+                "runtime seccomp policy allows unshare.\n");
+        return -1;
+    }
+
     ret = fv_create_listen_socket(se);
     if (ret < 0) {
         return ret;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH
  2020-07-22 13:02   ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-22 16:51     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 43+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-22 16:51 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd does not need CAP_DAC_READ_SEARCH because it already has
> the more powerful CAP_DAC_OVERRIDE. Drop it from the list of
> capabilities.
> 
> This is important because container runtimes may not include
> CAP_DAC_READ_SEARCH by default. This patch allows virtiofsd to reduce
> its capabilities when running inside a Docker container.
> 
> Note that CAP_DAC_READ_SEARCH may be necessary again in the future if
> virtiofsd starts using open_by_handle_at(2).
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Yeh that seems to make sense, and is probably worth having irrespective
of the rest of the series.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  tools/virtiofsd/passthrough_ll.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 94e0de2d2b..50a164a599 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2596,7 +2596,6 @@ static void setup_capabilities(char *modcaps_in)
>      if (capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
>              CAP_CHOWN,
>              CAP_DAC_OVERRIDE,
> -            CAP_DAC_READ_SEARCH,
>              CAP_FOWNER,
>              CAP_FSETID,
>              CAP_SETGID,
> -- 
> 2.26.2
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH
@ 2020-07-22 16:51     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 43+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-22 16:51 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd does not need CAP_DAC_READ_SEARCH because it already has
> the more powerful CAP_DAC_OVERRIDE. Drop it from the list of
> capabilities.
> 
> This is important because container runtimes may not include
> CAP_DAC_READ_SEARCH by default. This patch allows virtiofsd to reduce
> its capabilities when running inside a Docker container.
> 
> Note that CAP_DAC_READ_SEARCH may be necessary again in the future if
> virtiofsd starts using open_by_handle_at(2).
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Yeh that seems to make sense, and is probably worth having irrespective
of the rest of the series.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  tools/virtiofsd/passthrough_ll.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 94e0de2d2b..50a164a599 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2596,7 +2596,6 @@ static void setup_capabilities(char *modcaps_in)
>      if (capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
>              CAP_CHOWN,
>              CAP_DAC_OVERRIDE,
> -            CAP_DAC_READ_SEARCH,
>              CAP_FOWNER,
>              CAP_FSETID,
>              CAP_SETGID,
> -- 
> 2.26.2
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-22 13:02   ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-22 16:58     ` Daniel P. Berrangé
  -1 siblings, 0 replies; 43+ messages in thread
From: Daniel P. Berrangé @ 2020-07-22 16:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-fs, vromanso, qemu-devel, Dr. David Alan Gilbert, rmohr

On Wed, Jul 22, 2020 at 02:02:05PM +0100, Stefan Hajnoczi wrote:
> virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> is required to create namespaces.
> 
> Introduce a weaker sandbox that is sufficient in container environments
> because the container runtime already sets up namespaces. Use chroot to
> restrict path traversal to the shared directory.
> 
> virtiofsd loses the following:
> 
> 1. Mount namespace. The process chroots to the shared directory but
>    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
>    syscalls.
> 
> 2. Pid namespace. This should be fine because virtiofsd is the only
>    process running in the container.
> 
> 3. Network namespace. This should be fine because seccomp already
>    rejects the connect(2) syscall, but an additional layer of security
>    is lost. Container runtime-specific network security policies can be
>    used drop network traffic (except for the vhost-user UNIX domain
>    socket).

IIUC this relies on the fact that the container will still have 
CAP_SYS_CHROOT IOW, we still don't have a solution for running
virtiofsd as an unprivileged user.

Bearing in mind possibly need for future improvements to handle
unprivileged users....

> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/helper.c         |  3 +++
>  tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 3105b6c23a..7421c9ca1a 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
>             "    -o cache=<mode>            cache mode. could be one of \"auto, "
>             "always, none\"\n"
>             "                               default: auto\n"
> +           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
> +           "                               of stronger mount namespace sandbox\n"
> +           "                               default: false\n"

A simple boolean like this feels like it lacks future-proofing.

Should we have an option that takes an enumerated arg string:

   -o sandbox=[namespace|chroot|....]

so we can add more options later with the same syntax.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-22 16:58     ` Daniel P. Berrangé
  0 siblings, 0 replies; 43+ messages in thread
From: Daniel P. Berrangé @ 2020-07-22 16:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, vromanso, qemu-devel, rmohr

On Wed, Jul 22, 2020 at 02:02:05PM +0100, Stefan Hajnoczi wrote:
> virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> is required to create namespaces.
> 
> Introduce a weaker sandbox that is sufficient in container environments
> because the container runtime already sets up namespaces. Use chroot to
> restrict path traversal to the shared directory.
> 
> virtiofsd loses the following:
> 
> 1. Mount namespace. The process chroots to the shared directory but
>    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
>    syscalls.
> 
> 2. Pid namespace. This should be fine because virtiofsd is the only
>    process running in the container.
> 
> 3. Network namespace. This should be fine because seccomp already
>    rejects the connect(2) syscall, but an additional layer of security
>    is lost. Container runtime-specific network security policies can be
>    used drop network traffic (except for the vhost-user UNIX domain
>    socket).

IIUC this relies on the fact that the container will still have 
CAP_SYS_CHROOT IOW, we still don't have a solution for running
virtiofsd as an unprivileged user.

Bearing in mind possibly need for future improvements to handle
unprivileged users....

> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/helper.c         |  3 +++
>  tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 3105b6c23a..7421c9ca1a 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
>             "    -o cache=<mode>            cache mode. could be one of \"auto, "
>             "always, none\"\n"
>             "                               default: auto\n"
> +           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
> +           "                               of stronger mount namespace sandbox\n"
> +           "                               default: false\n"

A simple boolean like this feels like it lacks future-proofing.

Should we have an option that takes an enumerated arg string:

   -o sandbox=[namespace|chroot|....]

so we can add more options later with the same syntax.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-22 13:02   ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-22 17:03     ` Daniel P. Berrangé
  -1 siblings, 0 replies; 43+ messages in thread
From: Daniel P. Berrangé @ 2020-07-22 17:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: vromanso, Misono Tomohiro, qemu-devel, virtio-fs, rmohr,
	Dr. David Alan Gilbert

On Wed, Jul 22, 2020 at 02:02:06PM +0100, Stefan Hajnoczi wrote:
> An assertion failure is raised during request processing if
> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> be detected right away.
> 
> Unfortunately Docker/Moby does not include unshare in the seccomp.json
> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> include unshare (e.g. podman is unaffected):
> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> 
> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> default seccomp.json is missing unshare.
> 
> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index 3b6d16a041..ebeb352514 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
>  {
>      int ret;
>  
> +    /*
> +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> +     * an unprivileged system call but some Docker/Moby versions are known to
> +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> +     */
> +    ret = unshare(CLONE_FS);
> +    if (ret == -1 && errno == EPERM) {
> +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> +                "running in a container please check that the container "
> +                "runtime seccomp policy allows unshare.\n");
> +        return -1;
> +    }
> +

This describes the unshare() call as a "probe" and a "test", but that's
misleading IMHO. A "probe" / "test" implies that after it has completed,
there's no lingering side-effect, which isn't the case here.

This is actively changing the process' namespace environment in the
success case, and not putting it back how it was originally.

May be this is in fact OK, but if so I think the commit message and
comment should explain/justify what its fine to have this lingering
side-effect.

If we want to avoid the side-effect then we need to fork() and run
unshare() in the child, and use a check of exit status of the child
to determine the result.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
@ 2020-07-22 17:03     ` Daniel P. Berrangé
  0 siblings, 0 replies; 43+ messages in thread
From: Daniel P. Berrangé @ 2020-07-22 17:03 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: vromanso, qemu-devel, virtio-fs, rmohr

On Wed, Jul 22, 2020 at 02:02:06PM +0100, Stefan Hajnoczi wrote:
> An assertion failure is raised during request processing if
> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> be detected right away.
> 
> Unfortunately Docker/Moby does not include unshare in the seccomp.json
> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> include unshare (e.g. podman is unaffected):
> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> 
> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> default seccomp.json is missing unshare.
> 
> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index 3b6d16a041..ebeb352514 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
>  {
>      int ret;
>  
> +    /*
> +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> +     * an unprivileged system call but some Docker/Moby versions are known to
> +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> +     */
> +    ret = unshare(CLONE_FS);
> +    if (ret == -1 && errno == EPERM) {
> +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> +                "running in a container please check that the container "
> +                "runtime seccomp policy allows unshare.\n");
> +        return -1;
> +    }
> +

This describes the unshare() call as a "probe" and a "test", but that's
misleading IMHO. A "probe" / "test" implies that after it has completed,
there's no lingering side-effect, which isn't the case here.

This is actively changing the process' namespace environment in the
success case, and not putting it back how it was originally.

May be this is in fact OK, but if so I think the commit message and
comment should explain/justify what its fine to have this lingering
side-effect.

If we want to avoid the side-effect then we need to fork() and run
unshare() in the child, and use a check of exit status of the child
to determine the result.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-22 13:02   ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-22 17:58     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 43+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-22 17:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> is required to create namespaces.
> 
> Introduce a weaker sandbox that is sufficient in container environments
> because the container runtime already sets up namespaces. Use chroot to
> restrict path traversal to the shared directory.
> 
> virtiofsd loses the following:
> 
> 1. Mount namespace. The process chroots to the shared directory but
>    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
>    syscalls.

OK, I'm guessing the behaviour of what happens if the host adds another
mount afterwards might be different?

> 2. Pid namespace. This should be fine because virtiofsd is the only
>    process running in the container.

Is it ? Isn't the qemu and any other vhost-user processes also in the
same container?

> 3. Network namespace. This should be fine because seccomp already
>    rejects the connect(2) syscall, but an additional layer of security
>    is lost. Container runtime-specific network security policies can be
>    used drop network traffic (except for the vhost-user UNIX domain
>    socket).

Should this be tied to the same flag - this feels different from the
chroot specific problem.

> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/helper.c         |  3 +++
>  tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 3105b6c23a..7421c9ca1a 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
>             "    -o cache=<mode>            cache mode. could be one of \"auto, "
>             "always, none\"\n"
>             "                               default: auto\n"
> +           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
> +           "                               of stronger mount namespace sandbox\n"
> +           "                               default: false\n"

I agree with Dan that something more enum like feels right

>             "    -o flock|no_flock          enable/disable flock\n"
>             "                               default: no_flock\n"
>             "    -o log_level=<level>       log level, default to \"info\"\n"
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 50a164a599..990c0a8a70 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -139,6 +139,7 @@ enum {
>  
>  struct lo_data {
>      pthread_mutex_t mutex;
> +    int chroot; /* 1 - use chroot, 0 - use mount namespace */
>      int debug;
>      int writeback;
>      int flock;
> @@ -162,6 +163,8 @@ struct lo_data {
>  };
>  
>  static const struct fuse_opt lo_opts[] = {
> +    { "chroot", offsetof(struct lo_data, chroot), 1 },
> +    { "no_chroot", offsetof(struct lo_data, chroot), 0 },
>      { "writeback", offsetof(struct lo_data, writeback), 1 },
>      { "no_writeback", offsetof(struct lo_data, writeback), 0 },
>      { "source=%s", offsetof(struct lo_data, source), 0 },
> @@ -2665,6 +2668,37 @@ static void setup_capabilities(char *modcaps_in)
>      pthread_mutex_unlock(&cap.mutex);
>  }
>  
> +/*
> + * Use chroot as a weaker sandbox for environment where the process is launched
> + * without CAP_SYS_ADMIN.
> + */
> +static void setup_chroot(struct lo_data *lo)
> +{
> +    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +    if (lo->proc_self_fd == -1) {
> +        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
> +        exit(1);
> +    }
> +
> +    /*
> +     * Make the shared directory the file system root so that FUSE_OPEN
> +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> +     *
> +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> +     * requires gaining control of the process first.
> +     */
> +    if (chroot(lo->source) != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> +        exit(1);
> +    }
> +
> +    /* Move into the chroot */
> +    if (chdir("/") != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
> +        exit(1);
> +    }
> +}
> +

This looks OK to me, but I would prefer a check from someone who has the
experience of why the mount based sandboxing is so much more common than
old chroot.

Dave

>  /*
>   * Lock down this process to prevent access to other processes or files outside
>   * source directory.  This reduces the impact of arbitrary code execution bugs.
> @@ -2672,8 +2706,13 @@ static void setup_capabilities(char *modcaps_in)
>  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>                            bool enable_syslog)
>  {
> -    setup_namespaces(lo, se);
> -    setup_mounts(lo->source);
> +    if (lo->chroot) {
> +        setup_chroot(lo);
> +    } else {
> +        setup_namespaces(lo, se);
> +        setup_mounts(lo->source);
> +    }
> +
>      setup_seccomp(enable_syslog);
>      setup_capabilities(g_strdup(lo->modcaps));
>  }
> @@ -2820,6 +2859,7 @@ int main(int argc, char *argv[])
>      struct fuse_session *se;
>      struct fuse_cmdline_opts opts;
>      struct lo_data lo = {
> +        .chroot = 0,
>          .debug = 0,
>          .writeback = 0,
>          .posix_lock = 1,
> -- 
> 2.26.2
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-22 17:58     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 43+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-22 17:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> is required to create namespaces.
> 
> Introduce a weaker sandbox that is sufficient in container environments
> because the container runtime already sets up namespaces. Use chroot to
> restrict path traversal to the shared directory.
> 
> virtiofsd loses the following:
> 
> 1. Mount namespace. The process chroots to the shared directory but
>    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
>    syscalls.

OK, I'm guessing the behaviour of what happens if the host adds another
mount afterwards might be different?

> 2. Pid namespace. This should be fine because virtiofsd is the only
>    process running in the container.

Is it ? Isn't the qemu and any other vhost-user processes also in the
same container?

> 3. Network namespace. This should be fine because seccomp already
>    rejects the connect(2) syscall, but an additional layer of security
>    is lost. Container runtime-specific network security policies can be
>    used drop network traffic (except for the vhost-user UNIX domain
>    socket).

Should this be tied to the same flag - this feels different from the
chroot specific problem.

> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/helper.c         |  3 +++
>  tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 3105b6c23a..7421c9ca1a 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
>             "    -o cache=<mode>            cache mode. could be one of \"auto, "
>             "always, none\"\n"
>             "                               default: auto\n"
> +           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
> +           "                               of stronger mount namespace sandbox\n"
> +           "                               default: false\n"

I agree with Dan that something more enum like feels right

>             "    -o flock|no_flock          enable/disable flock\n"
>             "                               default: no_flock\n"
>             "    -o log_level=<level>       log level, default to \"info\"\n"
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 50a164a599..990c0a8a70 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -139,6 +139,7 @@ enum {
>  
>  struct lo_data {
>      pthread_mutex_t mutex;
> +    int chroot; /* 1 - use chroot, 0 - use mount namespace */
>      int debug;
>      int writeback;
>      int flock;
> @@ -162,6 +163,8 @@ struct lo_data {
>  };
>  
>  static const struct fuse_opt lo_opts[] = {
> +    { "chroot", offsetof(struct lo_data, chroot), 1 },
> +    { "no_chroot", offsetof(struct lo_data, chroot), 0 },
>      { "writeback", offsetof(struct lo_data, writeback), 1 },
>      { "no_writeback", offsetof(struct lo_data, writeback), 0 },
>      { "source=%s", offsetof(struct lo_data, source), 0 },
> @@ -2665,6 +2668,37 @@ static void setup_capabilities(char *modcaps_in)
>      pthread_mutex_unlock(&cap.mutex);
>  }
>  
> +/*
> + * Use chroot as a weaker sandbox for environment where the process is launched
> + * without CAP_SYS_ADMIN.
> + */
> +static void setup_chroot(struct lo_data *lo)
> +{
> +    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +    if (lo->proc_self_fd == -1) {
> +        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
> +        exit(1);
> +    }
> +
> +    /*
> +     * Make the shared directory the file system root so that FUSE_OPEN
> +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> +     *
> +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> +     * requires gaining control of the process first.
> +     */
> +    if (chroot(lo->source) != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> +        exit(1);
> +    }
> +
> +    /* Move into the chroot */
> +    if (chdir("/") != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
> +        exit(1);
> +    }
> +}
> +

This looks OK to me, but I would prefer a check from someone who has the
experience of why the mount based sandboxing is so much more common than
old chroot.

Dave

>  /*
>   * Lock down this process to prevent access to other processes or files outside
>   * source directory.  This reduces the impact of arbitrary code execution bugs.
> @@ -2672,8 +2706,13 @@ static void setup_capabilities(char *modcaps_in)
>  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>                            bool enable_syslog)
>  {
> -    setup_namespaces(lo, se);
> -    setup_mounts(lo->source);
> +    if (lo->chroot) {
> +        setup_chroot(lo);
> +    } else {
> +        setup_namespaces(lo, se);
> +        setup_mounts(lo->source);
> +    }
> +
>      setup_seccomp(enable_syslog);
>      setup_capabilities(g_strdup(lo->modcaps));
>  }
> @@ -2820,6 +2859,7 @@ int main(int argc, char *argv[])
>      struct fuse_session *se;
>      struct fuse_cmdline_opts opts;
>      struct lo_data lo = {
> +        .chroot = 0,
>          .debug = 0,
>          .writeback = 0,
>          .posix_lock = 1,
> -- 
> 2.26.2
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-22 13:02   ` [Virtio-fs] " Stefan Hajnoczi
                     ` (2 preceding siblings ...)
  (?)
@ 2020-07-22 18:17   ` Vivek Goyal
  2020-07-23 12:29     ` Stefan Hajnoczi
  -1 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-07-22 18:17 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, vromanso, qemu-devel, rmohr

On Wed, Jul 22, 2020 at 02:02:05PM +0100, Stefan Hajnoczi wrote:
> virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> is required to create namespaces.
> 
> Introduce a weaker sandbox that is sufficient in container environments
> because the container runtime already sets up namespaces. Use chroot to
> restrict path traversal to the shared directory.
> 
> virtiofsd loses the following:
> 
> 1. Mount namespace. The process chroots to the shared directory but
>    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
>    syscalls.
> 
> 2. Pid namespace. This should be fine because virtiofsd is the only
>    process running in the container.
> 
> 3. Network namespace. This should be fine because seccomp already
>    rejects the connect(2) syscall, but an additional layer of security
>    is lost. Container runtime-specific network security policies can be
>    used drop network traffic (except for the vhost-user UNIX domain
>    socket).
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/helper.c         |  3 +++
>  tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 3105b6c23a..7421c9ca1a 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
>             "    -o cache=<mode>            cache mode. could be one of \"auto, "
>             "always, none\"\n"
>             "                               default: auto\n"
> +           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
> +           "                               of stronger mount namespace sandbox\n"
> +           "                               default: false\n"

This option name disabling namespace setup is little confusing to me.

Will it make sense to provide another option to disable/enable
namespaces. "-o no-namespaces" and that disables setting up
namespaces.

Thanks
Vivek

>             "    -o flock|no_flock          enable/disable flock\n"
>             "                               default: no_flock\n"
>             "    -o log_level=<level>       log level, default to \"info\"\n"
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 50a164a599..990c0a8a70 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -139,6 +139,7 @@ enum {
>  
>  struct lo_data {
>      pthread_mutex_t mutex;
> +    int chroot; /* 1 - use chroot, 0 - use mount namespace */
>      int debug;
>      int writeback;
>      int flock;
> @@ -162,6 +163,8 @@ struct lo_data {
>  };
>  
>  static const struct fuse_opt lo_opts[] = {
> +    { "chroot", offsetof(struct lo_data, chroot), 1 },
> +    { "no_chroot", offsetof(struct lo_data, chroot), 0 },
>      { "writeback", offsetof(struct lo_data, writeback), 1 },
>      { "no_writeback", offsetof(struct lo_data, writeback), 0 },
>      { "source=%s", offsetof(struct lo_data, source), 0 },
> @@ -2665,6 +2668,37 @@ static void setup_capabilities(char *modcaps_in)
>      pthread_mutex_unlock(&cap.mutex);
>  }
>  
> +/*
> + * Use chroot as a weaker sandbox for environment where the process is launched
> + * without CAP_SYS_ADMIN.
> + */
> +static void setup_chroot(struct lo_data *lo)
> +{
> +    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +    if (lo->proc_self_fd == -1) {
> +        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
> +        exit(1);
> +    }
> +
> +    /*
> +     * Make the shared directory the file system root so that FUSE_OPEN
> +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> +     *
> +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> +     * requires gaining control of the process first.
> +     */
> +    if (chroot(lo->source) != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> +        exit(1);
> +    }
> +
> +    /* Move into the chroot */
> +    if (chdir("/") != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
> +        exit(1);
> +    }
> +}
> +
>  /*
>   * Lock down this process to prevent access to other processes or files outside
>   * source directory.  This reduces the impact of arbitrary code execution bugs.
> @@ -2672,8 +2706,13 @@ static void setup_capabilities(char *modcaps_in)
>  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>                            bool enable_syslog)
>  {
> -    setup_namespaces(lo, se);
> -    setup_mounts(lo->source);
> +    if (lo->chroot) {
> +        setup_chroot(lo);
> +    } else {
> +        setup_namespaces(lo, se);
> +        setup_mounts(lo->source);
> +    }
> +
>      setup_seccomp(enable_syslog);
>      setup_capabilities(g_strdup(lo->modcaps));
>  }
> @@ -2820,6 +2859,7 @@ int main(int argc, char *argv[])
>      struct fuse_session *se;
>      struct fuse_cmdline_opts opts;
>      struct lo_data lo = {
> +        .chroot = 0,
>          .debug = 0,
>          .writeback = 0,
>          .posix_lock = 1,
> -- 
> 2.26.2
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 0/3] virtiofsd: allow virtiofsd to run in a container
  2020-07-22 13:02 ` [Virtio-fs] " Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  (?)
@ 2020-07-22 18:19 ` Vivek Goyal
  2020-07-23 12:46   ` Stefan Hajnoczi
  -1 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-07-22 18:19 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, vromanso, qemu-devel, rmohr

On Wed, Jul 22, 2020 at 02:02:03PM +0100, Stefan Hajnoczi wrote:
> Container runtimes handle namespace setup and remove privileges needed by
> virtiofsd to perform sandboxing. Luckily the container environment already
> provides most of the sandbox that virtiofsd needs for security.
> 
> Introduce a new "virtiofsd -o chroot" option that uses chroot(2) instead of
> namespaces. This option allows virtiofsd to work inside a container.
> 
> Please see the individual patches for details on the changes and security
> implications.
> 
> Given that people are starting to attempt running virtiofsd in containers I
> think this should go into QEMU 5.1.

Hi Stefan,

I have written a document to help with testing virtiofs with any changes.

https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/virtio-fs-testing-requirement.txt

Will be good to run some of these tests to make sure there are no
regressions due to these changes.

Thanks
Vivek

> 
> Stefan Hajnoczi (3):
>   virtiofsd: drop CAP_DAC_READ_SEARCH
>   virtiofsd: add container-friendly -o chroot sandboxing option
>   virtiofsd: probe unshare(CLONE_FS) and print an error
> 
>  tools/virtiofsd/fuse_virtio.c    | 13 +++++++++
>  tools/virtiofsd/helper.c         |  3 +++
>  tools/virtiofsd/passthrough_ll.c | 45 +++++++++++++++++++++++++++++---
>  3 files changed, 58 insertions(+), 3 deletions(-)
> 
> -- 
> 2.26.2
> 
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-22 13:02   ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-22 19:03     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 43+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-22 19:03 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> is required to create namespaces.
> 
> Introduce a weaker sandbox that is sufficient in container environments
> because the container runtime already sets up namespaces. Use chroot to
> restrict path traversal to the shared directory.
> 
> virtiofsd loses the following:
> 
> 1. Mount namespace. The process chroots to the shared directory but
>    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
>    syscalls.
> 
> 2. Pid namespace. This should be fine because virtiofsd is the only
>    process running in the container.
> 
> 3. Network namespace. This should be fine because seccomp already
>    rejects the connect(2) syscall, but an additional layer of security
>    is lost. Container runtime-specific network security policies can be
>    used drop network traffic (except for the vhost-user UNIX domain
>    socket).
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/helper.c         |  3 +++
>  tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 3105b6c23a..7421c9ca1a 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
>             "    -o cache=<mode>            cache mode. could be one of \"auto, "
>             "always, none\"\n"
>             "                               default: auto\n"
> +           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
> +           "                               of stronger mount namespace sandbox\n"
> +           "                               default: false\n"
>             "    -o flock|no_flock          enable/disable flock\n"
>             "                               default: no_flock\n"
>             "    -o log_level=<level>       log level, default to \"info\"\n"
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 50a164a599..990c0a8a70 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -139,6 +139,7 @@ enum {
>  
>  struct lo_data {
>      pthread_mutex_t mutex;
> +    int chroot; /* 1 - use chroot, 0 - use mount namespace */
>      int debug;
>      int writeback;
>      int flock;
> @@ -162,6 +163,8 @@ struct lo_data {
>  };
>  
>  static const struct fuse_opt lo_opts[] = {
> +    { "chroot", offsetof(struct lo_data, chroot), 1 },
> +    { "no_chroot", offsetof(struct lo_data, chroot), 0 },
>      { "writeback", offsetof(struct lo_data, writeback), 1 },
>      { "no_writeback", offsetof(struct lo_data, writeback), 0 },
>      { "source=%s", offsetof(struct lo_data, source), 0 },
> @@ -2665,6 +2668,37 @@ static void setup_capabilities(char *modcaps_in)
>      pthread_mutex_unlock(&cap.mutex);
>  }
>  
> +/*
> + * Use chroot as a weaker sandbox for environment where the process is launched
> + * without CAP_SYS_ADMIN.
> + */
> +static void setup_chroot(struct lo_data *lo)
> +{
> +    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +    if (lo->proc_self_fd == -1) {
> +        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
> +        exit(1);
> +    }
> +
> +    /*
> +     * Make the shared directory the file system root so that FUSE_OPEN
> +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> +     *
> +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> +     * requires gaining control of the process first.
> +     */
> +    if (chroot(lo->source) != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> +        exit(1);
> +    }

I'm seeing suggestions that you should drop CAP_SYS_CHROOT after
chroot'ing to stop an old escape (where you create another jail inside
the jail and the kernel then lets you walk outside of the old one).

Dave

> +    /* Move into the chroot */
> +    if (chdir("/") != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
> +        exit(1);
> +    }
> +}
> +
>  /*
>   * Lock down this process to prevent access to other processes or files outside
>   * source directory.  This reduces the impact of arbitrary code execution bugs.
> @@ -2672,8 +2706,13 @@ static void setup_capabilities(char *modcaps_in)
>  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>                            bool enable_syslog)
>  {
> -    setup_namespaces(lo, se);
> -    setup_mounts(lo->source);
> +    if (lo->chroot) {
> +        setup_chroot(lo);
> +    } else {
> +        setup_namespaces(lo, se);
> +        setup_mounts(lo->source);
> +    }
> +
>      setup_seccomp(enable_syslog);
>      setup_capabilities(g_strdup(lo->modcaps));
>  }
> @@ -2820,6 +2859,7 @@ int main(int argc, char *argv[])
>      struct fuse_session *se;
>      struct fuse_cmdline_opts opts;
>      struct lo_data lo = {
> +        .chroot = 0,
>          .debug = 0,
>          .writeback = 0,
>          .posix_lock = 1,
> -- 
> 2.26.2
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-22 19:03     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 43+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-22 19:03 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> is required to create namespaces.
> 
> Introduce a weaker sandbox that is sufficient in container environments
> because the container runtime already sets up namespaces. Use chroot to
> restrict path traversal to the shared directory.
> 
> virtiofsd loses the following:
> 
> 1. Mount namespace. The process chroots to the shared directory but
>    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
>    syscalls.
> 
> 2. Pid namespace. This should be fine because virtiofsd is the only
>    process running in the container.
> 
> 3. Network namespace. This should be fine because seccomp already
>    rejects the connect(2) syscall, but an additional layer of security
>    is lost. Container runtime-specific network security policies can be
>    used drop network traffic (except for the vhost-user UNIX domain
>    socket).
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/helper.c         |  3 +++
>  tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 3105b6c23a..7421c9ca1a 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
>             "    -o cache=<mode>            cache mode. could be one of \"auto, "
>             "always, none\"\n"
>             "                               default: auto\n"
> +           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
> +           "                               of stronger mount namespace sandbox\n"
> +           "                               default: false\n"
>             "    -o flock|no_flock          enable/disable flock\n"
>             "                               default: no_flock\n"
>             "    -o log_level=<level>       log level, default to \"info\"\n"
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 50a164a599..990c0a8a70 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -139,6 +139,7 @@ enum {
>  
>  struct lo_data {
>      pthread_mutex_t mutex;
> +    int chroot; /* 1 - use chroot, 0 - use mount namespace */
>      int debug;
>      int writeback;
>      int flock;
> @@ -162,6 +163,8 @@ struct lo_data {
>  };
>  
>  static const struct fuse_opt lo_opts[] = {
> +    { "chroot", offsetof(struct lo_data, chroot), 1 },
> +    { "no_chroot", offsetof(struct lo_data, chroot), 0 },
>      { "writeback", offsetof(struct lo_data, writeback), 1 },
>      { "no_writeback", offsetof(struct lo_data, writeback), 0 },
>      { "source=%s", offsetof(struct lo_data, source), 0 },
> @@ -2665,6 +2668,37 @@ static void setup_capabilities(char *modcaps_in)
>      pthread_mutex_unlock(&cap.mutex);
>  }
>  
> +/*
> + * Use chroot as a weaker sandbox for environment where the process is launched
> + * without CAP_SYS_ADMIN.
> + */
> +static void setup_chroot(struct lo_data *lo)
> +{
> +    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +    if (lo->proc_self_fd == -1) {
> +        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
> +        exit(1);
> +    }
> +
> +    /*
> +     * Make the shared directory the file system root so that FUSE_OPEN
> +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> +     *
> +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> +     * requires gaining control of the process first.
> +     */
> +    if (chroot(lo->source) != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> +        exit(1);
> +    }

I'm seeing suggestions that you should drop CAP_SYS_CHROOT after
chroot'ing to stop an old escape (where you create another jail inside
the jail and the kernel then lets you walk outside of the old one).

Dave

> +    /* Move into the chroot */
> +    if (chdir("/") != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
> +        exit(1);
> +    }
> +}
> +
>  /*
>   * Lock down this process to prevent access to other processes or files outside
>   * source directory.  This reduces the impact of arbitrary code execution bugs.
> @@ -2672,8 +2706,13 @@ static void setup_capabilities(char *modcaps_in)
>  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>                            bool enable_syslog)
>  {
> -    setup_namespaces(lo, se);
> -    setup_mounts(lo->source);
> +    if (lo->chroot) {
> +        setup_chroot(lo);
> +    } else {
> +        setup_namespaces(lo, se);
> +        setup_mounts(lo->source);
> +    }
> +
>      setup_seccomp(enable_syslog);
>      setup_capabilities(g_strdup(lo->modcaps));
>  }
> @@ -2820,6 +2859,7 @@ int main(int argc, char *argv[])
>      struct fuse_session *se;
>      struct fuse_cmdline_opts opts;
>      struct lo_data lo = {
> +        .chroot = 0,
>          .debug = 0,
>          .writeback = 0,
>          .posix_lock = 1,
> -- 
> 2.26.2
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-22 16:58     ` [Virtio-fs] " Daniel P. Berrangé
@ 2020-07-23 12:17       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:17 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: virtio-fs, vromanso, qemu-devel, Dr. David Alan Gilbert, rmohr

[-- Attachment #1: Type: text/plain, Size: 1355 bytes --]

On Wed, Jul 22, 2020 at 05:58:11PM +0100, Daniel P. Berrangé wrote:
> On Wed, Jul 22, 2020 at 02:02:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> > is required to create namespaces.
> > 
> > Introduce a weaker sandbox that is sufficient in container environments
> > because the container runtime already sets up namespaces. Use chroot to
> > restrict path traversal to the shared directory.
> > 
> > virtiofsd loses the following:
> > 
> > 1. Mount namespace. The process chroots to the shared directory but
> >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> >    syscalls.
> > 
> > 2. Pid namespace. This should be fine because virtiofsd is the only
> >    process running in the container.
> > 
> > 3. Network namespace. This should be fine because seccomp already
> >    rejects the connect(2) syscall, but an additional layer of security
> >    is lost. Container runtime-specific network security policies can be
> >    used drop network traffic (except for the vhost-user UNIX domain
> >    socket).
> 
> IIUC this relies on the fact that the container will still have 
> CAP_SYS_CHROOT IOW, we still don't have a solution for running
> virtiofsd as an unprivileged user.

Yes, this still requires root in the container.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-23 12:17       ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:17 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: virtio-fs, vromanso, qemu-devel, rmohr

[-- Attachment #1: Type: text/plain, Size: 1355 bytes --]

On Wed, Jul 22, 2020 at 05:58:11PM +0100, Daniel P. Berrangé wrote:
> On Wed, Jul 22, 2020 at 02:02:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> > is required to create namespaces.
> > 
> > Introduce a weaker sandbox that is sufficient in container environments
> > because the container runtime already sets up namespaces. Use chroot to
> > restrict path traversal to the shared directory.
> > 
> > virtiofsd loses the following:
> > 
> > 1. Mount namespace. The process chroots to the shared directory but
> >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> >    syscalls.
> > 
> > 2. Pid namespace. This should be fine because virtiofsd is the only
> >    process running in the container.
> > 
> > 3. Network namespace. This should be fine because seccomp already
> >    rejects the connect(2) syscall, but an additional layer of security
> >    is lost. Container runtime-specific network security policies can be
> >    used drop network traffic (except for the vhost-user UNIX domain
> >    socket).
> 
> IIUC this relies on the fact that the container will still have 
> CAP_SYS_CHROOT IOW, we still don't have a solution for running
> virtiofsd as an unprivileged user.

Yes, this still requires root in the container.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-22 17:58     ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-07-23 12:28       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

[-- Attachment #1: Type: text/plain, Size: 2484 bytes --]

On Wed, Jul 22, 2020 at 06:58:20PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> > is required to create namespaces.
> > 
> > Introduce a weaker sandbox that is sufficient in container environments
> > because the container runtime already sets up namespaces. Use chroot to
> > restrict path traversal to the shared directory.
> > 
> > virtiofsd loses the following:
> > 
> > 1. Mount namespace. The process chroots to the shared directory but
> >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> >    syscalls.
> 
> OK, I'm guessing the behaviour of what happens if the host adds another
> mount afterwards might be different?

Running inside a container with -o chroot:
 - The container has its own mount namespace and it is therefore not
   affected, modulo shared subtrees (see mount(8)).

Running outside a container with -o chroot:
 - Path traversal can only reach mounts that are made within the shared
   directory tree. Technically other mounts are still there but it is
   not possible to reach them via file system paths.

> > 2. Pid namespace. This should be fine because virtiofsd is the only
> >    process running in the container.
> 
> Is it ? Isn't the qemu and any other vhost-user processes also in the
> same container?

No. QEMU, virtiofsd, and other vhost-user processes should run in their
own containers. Application container images are typically designed to
run a single program per container. It's technically possible to launch
multiple programs but that is considered bad practice for application
containers.

Kubernetes:
Containers in a pod do not share a single pid namespace by default.
Pods do share a single network namespace so they can communicate via
UNIX domain sockets.

> > 3. Network namespace. This should be fine because seccomp already
> >    rejects the connect(2) syscall, but an additional layer of security
> >    is lost. Container runtime-specific network security policies can be
> >    used drop network traffic (except for the vhost-user UNIX domain
> >    socket).
> 
> Should this be tied to the same flag - this feels different from the
> chroot specific problem.

Good point. Daniel Berrange has suggested another command-line syntax
that makes sandbox configuration more modular. I'll try to implement
something like that.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-23 12:28       ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

[-- Attachment #1: Type: text/plain, Size: 2484 bytes --]

On Wed, Jul 22, 2020 at 06:58:20PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> > is required to create namespaces.
> > 
> > Introduce a weaker sandbox that is sufficient in container environments
> > because the container runtime already sets up namespaces. Use chroot to
> > restrict path traversal to the shared directory.
> > 
> > virtiofsd loses the following:
> > 
> > 1. Mount namespace. The process chroots to the shared directory but
> >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> >    syscalls.
> 
> OK, I'm guessing the behaviour of what happens if the host adds another
> mount afterwards might be different?

Running inside a container with -o chroot:
 - The container has its own mount namespace and it is therefore not
   affected, modulo shared subtrees (see mount(8)).

Running outside a container with -o chroot:
 - Path traversal can only reach mounts that are made within the shared
   directory tree. Technically other mounts are still there but it is
   not possible to reach them via file system paths.

> > 2. Pid namespace. This should be fine because virtiofsd is the only
> >    process running in the container.
> 
> Is it ? Isn't the qemu and any other vhost-user processes also in the
> same container?

No. QEMU, virtiofsd, and other vhost-user processes should run in their
own containers. Application container images are typically designed to
run a single program per container. It's technically possible to launch
multiple programs but that is considered bad practice for application
containers.

Kubernetes:
Containers in a pod do not share a single pid namespace by default.
Pods do share a single network namespace so they can communicate via
UNIX domain sockets.

> > 3. Network namespace. This should be fine because seccomp already
> >    rejects the connect(2) syscall, but an additional layer of security
> >    is lost. Container runtime-specific network security policies can be
> >    used drop network traffic (except for the vhost-user UNIX domain
> >    socket).
> 
> Should this be tied to the same flag - this feels different from the
> chroot specific problem.

Good point. Daniel Berrange has suggested another command-line syntax
that makes sandbox configuration more modular. I'll try to implement
something like that.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-22 18:17   ` Vivek Goyal
@ 2020-07-23 12:29     ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:29 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, vromanso, qemu-devel, rmohr

[-- Attachment #1: Type: text/plain, Size: 2315 bytes --]

On Wed, Jul 22, 2020 at 02:17:10PM -0400, Vivek Goyal wrote:
> On Wed, Jul 22, 2020 at 02:02:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> > is required to create namespaces.
> > 
> > Introduce a weaker sandbox that is sufficient in container environments
> > because the container runtime already sets up namespaces. Use chroot to
> > restrict path traversal to the shared directory.
> > 
> > virtiofsd loses the following:
> > 
> > 1. Mount namespace. The process chroots to the shared directory but
> >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> >    syscalls.
> > 
> > 2. Pid namespace. This should be fine because virtiofsd is the only
> >    process running in the container.
> > 
> > 3. Network namespace. This should be fine because seccomp already
> >    rejects the connect(2) syscall, but an additional layer of security
> >    is lost. Container runtime-specific network security policies can be
> >    used drop network traffic (except for the vhost-user UNIX domain
> >    socket).
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  tools/virtiofsd/helper.c         |  3 +++
> >  tools/virtiofsd/passthrough_ll.c | 44 ++++++++++++++++++++++++++++++--
> >  2 files changed, 45 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> > index 3105b6c23a..7421c9ca1a 100644
> > --- a/tools/virtiofsd/helper.c
> > +++ b/tools/virtiofsd/helper.c
> > @@ -151,6 +151,9 @@ void fuse_cmdline_help(void)
> >             "    -o cache=<mode>            cache mode. could be one of \"auto, "
> >             "always, none\"\n"
> >             "                               default: auto\n"
> > +           "    -o chroot|no_chroot        use container-friendly chroot instead\n"
> > +           "                               of stronger mount namespace sandbox\n"
> > +           "                               default: false\n"
> 
> This option name disabling namespace setup is little confusing to me.
> 
> Will it make sense to provide another option to disable/enable
> namespaces. "-o no-namespaces" and that disables setting up
> namespaces.

Thanks, I'll propose a new syntax.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-22 19:03     ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-07-23 12:32       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

[-- Attachment #1: Type: text/plain, Size: 973 bytes --]

On Wed, Jul 22, 2020 at 08:03:18PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > +    /*
> > +     * Make the shared directory the file system root so that FUSE_OPEN
> > +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> > +     *
> > +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> > +     * requires gaining control of the process first.
> > +     */
> > +    if (chroot(lo->source) != 0) {
> > +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> > +        exit(1);
> > +    }
> 
> I'm seeing suggestions that you should drop CAP_SYS_CHROOT after
> chroot'ing to stop an old escape (where you create another jail inside
> the jail and the kernel then lets you walk outside of the old one).

That's already the case:

1. setup_seccomp() blocks further chroot(2) calls.
2. setup_capabilities() drops CAP_SYS_CHROOT.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-23 12:32       ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

[-- Attachment #1: Type: text/plain, Size: 973 bytes --]

On Wed, Jul 22, 2020 at 08:03:18PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > +    /*
> > +     * Make the shared directory the file system root so that FUSE_OPEN
> > +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> > +     *
> > +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> > +     * requires gaining control of the process first.
> > +     */
> > +    if (chroot(lo->source) != 0) {
> > +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> > +        exit(1);
> > +    }
> 
> I'm seeing suggestions that you should drop CAP_SYS_CHROOT after
> chroot'ing to stop an old escape (where you create another jail inside
> the jail and the kernel then lets you walk outside of the old one).

That's already the case:

1. setup_seccomp() blocks further chroot(2) calls.
2. setup_capabilities() drops CAP_SYS_CHROOT.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-22 17:03     ` [Virtio-fs] " Daniel P. Berrangé
@ 2020-07-23 12:46       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:46 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: vromanso, Misono Tomohiro, qemu-devel, virtio-fs, rmohr,
	Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2798 bytes --]

On Wed, Jul 22, 2020 at 06:03:28PM +0100, Daniel P. Berrangé wrote:
> On Wed, Jul 22, 2020 at 02:02:06PM +0100, Stefan Hajnoczi wrote:
> > An assertion failure is raised during request processing if
> > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > be detected right away.
> > 
> > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > include unshare (e.g. podman is unaffected):
> > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > 
> > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > default seccomp.json is missing unshare.
> > 
> > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > index 3b6d16a041..ebeb352514 100644
> > --- a/tools/virtiofsd/fuse_virtio.c
> > +++ b/tools/virtiofsd/fuse_virtio.c
> > @@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
> >  {
> >      int ret;
> >  
> > +    /*
> > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> > +     * an unprivileged system call but some Docker/Moby versions are known to
> > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > +     */
> > +    ret = unshare(CLONE_FS);
> > +    if (ret == -1 && errno == EPERM) {
> > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> > +                "running in a container please check that the container "
> > +                "runtime seccomp policy allows unshare.\n");
> > +        return -1;
> > +    }
> > +
> 
> This describes the unshare() call as a "probe" and a "test", but that's
> misleading IMHO. A "probe" / "test" implies that after it has completed,
> there's no lingering side-effect, which isn't the case here.
> 
> This is actively changing the process' namespace environment in the
> success case, and not putting it back how it was originally.
> 
> May be this is in fact OK, but if so I think the commit message and
> comment should explain/justify what its fine to have this lingering
> side-effect.
> 
> If we want to avoid the side-effect then we need to fork() and run
> unshare() in the child, and use a check of exit status of the child
> to determine the result.

Thanks for pointing this out. I'll add a comment explaining that
virtiofsd is single-threaded at this point. No other threads share the
file system attributes so the call has no observable side-effects.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
@ 2020-07-23 12:46       ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:46 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: vromanso, qemu-devel, virtio-fs, rmohr

[-- Attachment #1: Type: text/plain, Size: 2798 bytes --]

On Wed, Jul 22, 2020 at 06:03:28PM +0100, Daniel P. Berrangé wrote:
> On Wed, Jul 22, 2020 at 02:02:06PM +0100, Stefan Hajnoczi wrote:
> > An assertion failure is raised during request processing if
> > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > be detected right away.
> > 
> > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > include unshare (e.g. podman is unaffected):
> > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > 
> > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > default seccomp.json is missing unshare.
> > 
> > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > index 3b6d16a041..ebeb352514 100644
> > --- a/tools/virtiofsd/fuse_virtio.c
> > +++ b/tools/virtiofsd/fuse_virtio.c
> > @@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
> >  {
> >      int ret;
> >  
> > +    /*
> > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> > +     * an unprivileged system call but some Docker/Moby versions are known to
> > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > +     */
> > +    ret = unshare(CLONE_FS);
> > +    if (ret == -1 && errno == EPERM) {
> > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> > +                "running in a container please check that the container "
> > +                "runtime seccomp policy allows unshare.\n");
> > +        return -1;
> > +    }
> > +
> 
> This describes the unshare() call as a "probe" and a "test", but that's
> misleading IMHO. A "probe" / "test" implies that after it has completed,
> there's no lingering side-effect, which isn't the case here.
> 
> This is actively changing the process' namespace environment in the
> success case, and not putting it back how it was originally.
> 
> May be this is in fact OK, but if so I think the commit message and
> comment should explain/justify what its fine to have this lingering
> side-effect.
> 
> If we want to avoid the side-effect then we need to fork() and run
> unshare() in the child, and use a check of exit status of the child
> to determine the result.

Thanks for pointing this out. I'll add a comment explaining that
virtiofsd is single-threaded at this point. No other threads share the
file system attributes so the call has no observable side-effects.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 0/3] virtiofsd: allow virtiofsd to run in a container
  2020-07-22 18:19 ` [Virtio-fs] [PATCH for-5.1 0/3] virtiofsd: allow virtiofsd to run in a container Vivek Goyal
@ 2020-07-23 12:46   ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 12:46 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, vromanso, qemu-devel, rmohr

[-- Attachment #1: Type: text/plain, Size: 1123 bytes --]

On Wed, Jul 22, 2020 at 02:19:14PM -0400, Vivek Goyal wrote:
> On Wed, Jul 22, 2020 at 02:02:03PM +0100, Stefan Hajnoczi wrote:
> > Container runtimes handle namespace setup and remove privileges needed by
> > virtiofsd to perform sandboxing. Luckily the container environment already
> > provides most of the sandbox that virtiofsd needs for security.
> > 
> > Introduce a new "virtiofsd -o chroot" option that uses chroot(2) instead of
> > namespaces. This option allows virtiofsd to work inside a container.
> > 
> > Please see the individual patches for details on the changes and security
> > implications.
> > 
> > Given that people are starting to attempt running virtiofsd in containers I
> > think this should go into QEMU 5.1.
> 
> Hi Stefan,
> 
> I have written a document to help with testing virtiofs with any changes.
> 
> https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/virtio-fs-testing-requirement.txt
> 
> Will be good to run some of these tests to make sure there are no
> regressions due to these changes.

Thank you! I will run them and post the results.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-23 12:46       ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-23 12:50         ` Daniel P. Berrangé
  -1 siblings, 0 replies; 43+ messages in thread
From: Daniel P. Berrangé @ 2020-07-23 12:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: vromanso, Misono Tomohiro, qemu-devel, virtio-fs, rmohr,
	Dr. David Alan Gilbert

On Thu, Jul 23, 2020 at 01:46:11PM +0100, Stefan Hajnoczi wrote:
> On Wed, Jul 22, 2020 at 06:03:28PM +0100, Daniel P. Berrangé wrote:
> > On Wed, Jul 22, 2020 at 02:02:06PM +0100, Stefan Hajnoczi wrote:
> > > An assertion failure is raised during request processing if
> > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > be detected right away.
> > > 
> > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > include unshare (e.g. podman is unaffected):
> > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > 
> > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > default seccomp.json is missing unshare.
> > > 
> > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > ---
> > >  tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > > 
> > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > > index 3b6d16a041..ebeb352514 100644
> > > --- a/tools/virtiofsd/fuse_virtio.c
> > > +++ b/tools/virtiofsd/fuse_virtio.c
> > > @@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
> > >  {
> > >      int ret;
> > >  
> > > +    /*
> > > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> > > +     * an unprivileged system call but some Docker/Moby versions are known to
> > > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > > +     */
> > > +    ret = unshare(CLONE_FS);
> > > +    if (ret == -1 && errno == EPERM) {
> > > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> > > +                "running in a container please check that the container "
> > > +                "runtime seccomp policy allows unshare.\n");
> > > +        return -1;
> > > +    }
> > > +
> > 
> > This describes the unshare() call as a "probe" and a "test", but that's
> > misleading IMHO. A "probe" / "test" implies that after it has completed,
> > there's no lingering side-effect, which isn't the case here.
> > 
> > This is actively changing the process' namespace environment in the
> > success case, and not putting it back how it was originally.
> > 
> > May be this is in fact OK, but if so I think the commit message and
> > comment should explain/justify what its fine to have this lingering
> > side-effect.
> > 
> > If we want to avoid the side-effect then we need to fork() and run
> > unshare() in the child, and use a check of exit status of the child
> > to determine the result.
> 
> Thanks for pointing this out. I'll add a comment explaining that
> virtiofsd is single-threaded at this point. No other threads share the
> file system attributes so the call has no observable side-effects.

Also, if we do an unshare() here, do we still need the unshare() later
on in the code ?


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
@ 2020-07-23 12:50         ` Daniel P. Berrangé
  0 siblings, 0 replies; 43+ messages in thread
From: Daniel P. Berrangé @ 2020-07-23 12:50 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: vromanso, qemu-devel, virtio-fs, rmohr

On Thu, Jul 23, 2020 at 01:46:11PM +0100, Stefan Hajnoczi wrote:
> On Wed, Jul 22, 2020 at 06:03:28PM +0100, Daniel P. Berrangé wrote:
> > On Wed, Jul 22, 2020 at 02:02:06PM +0100, Stefan Hajnoczi wrote:
> > > An assertion failure is raised during request processing if
> > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > be detected right away.
> > > 
> > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > include unshare (e.g. podman is unaffected):
> > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > 
> > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > default seccomp.json is missing unshare.
> > > 
> > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > ---
> > >  tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > > 
> > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > > index 3b6d16a041..ebeb352514 100644
> > > --- a/tools/virtiofsd/fuse_virtio.c
> > > +++ b/tools/virtiofsd/fuse_virtio.c
> > > @@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
> > >  {
> > >      int ret;
> > >  
> > > +    /*
> > > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> > > +     * an unprivileged system call but some Docker/Moby versions are known to
> > > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > > +     */
> > > +    ret = unshare(CLONE_FS);
> > > +    if (ret == -1 && errno == EPERM) {
> > > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> > > +                "running in a container please check that the container "
> > > +                "runtime seccomp policy allows unshare.\n");
> > > +        return -1;
> > > +    }
> > > +
> > 
> > This describes the unshare() call as a "probe" and a "test", but that's
> > misleading IMHO. A "probe" / "test" implies that after it has completed,
> > there's no lingering side-effect, which isn't the case here.
> > 
> > This is actively changing the process' namespace environment in the
> > success case, and not putting it back how it was originally.
> > 
> > May be this is in fact OK, but if so I think the commit message and
> > comment should explain/justify what its fine to have this lingering
> > side-effect.
> > 
> > If we want to avoid the side-effect then we need to fork() and run
> > unshare() in the child, and use a check of exit status of the child
> > to determine the result.
> 
> Thanks for pointing this out. I'll add a comment explaining that
> virtiofsd is single-threaded at this point. No other threads share the
> file system attributes so the call has no observable side-effects.

Also, if we do an unshare() here, do we still need the unshare() later
on in the code ?


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-23 12:28       ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-23 13:47         ` Vivek Goyal
  -1 siblings, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2020-07-23 13:47 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-fs, qemu-devel, vromanso, Dr. David Alan Gilbert, rmohr

On Thu, Jul 23, 2020 at 01:28:50PM +0100, Stefan Hajnoczi wrote:
> On Wed, Jul 22, 2020 at 06:58:20PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> > > is required to create namespaces.
> > > 
> > > Introduce a weaker sandbox that is sufficient in container environments
> > > because the container runtime already sets up namespaces. Use chroot to
> > > restrict path traversal to the shared directory.
> > > 
> > > virtiofsd loses the following:
> > > 
> > > 1. Mount namespace. The process chroots to the shared directory but
> > >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> > >    syscalls.
> > 
> > OK, I'm guessing the behaviour of what happens if the host adds another
> > mount afterwards might be different?
> 
> Running inside a container with -o chroot:
>  - The container has its own mount namespace and it is therefore not
>    affected, modulo shared subtrees (see mount(8)).

How does virtiofsd inside container gets the directory it wants to
export to guest? I am assuming that its probably a volume mount
inside container. If yes, volume mounts can set propagation
properties say "slave" and that means mounts done later will
propagate inside container in that volume (and hopefully be
visible inside guest once submount patches are upstream).

Thanks
Vivek



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-23 13:47         ` Vivek Goyal
  0 siblings, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2020-07-23 13:47 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, vromanso, rmohr

On Thu, Jul 23, 2020 at 01:28:50PM +0100, Stefan Hajnoczi wrote:
> On Wed, Jul 22, 2020 at 06:58:20PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> > > is required to create namespaces.
> > > 
> > > Introduce a weaker sandbox that is sufficient in container environments
> > > because the container runtime already sets up namespaces. Use chroot to
> > > restrict path traversal to the shared directory.
> > > 
> > > virtiofsd loses the following:
> > > 
> > > 1. Mount namespace. The process chroots to the shared directory but
> > >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> > >    syscalls.
> > 
> > OK, I'm guessing the behaviour of what happens if the host adds another
> > mount afterwards might be different?
> 
> Running inside a container with -o chroot:
>  - The container has its own mount namespace and it is therefore not
>    affected, modulo shared subtrees (see mount(8)).

How does virtiofsd inside container gets the directory it wants to
export to guest? I am assuming that its probably a volume mount
inside container. If yes, volume mounts can set propagation
properties say "slave" and that means mounts done later will
propagate inside container in that volume (and hopefully be
visible inside guest once submount patches are upstream).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-23 12:50         ` [Virtio-fs] " Daniel P. Berrangé
@ 2020-07-23 13:56           ` Vivek Goyal
  -1 siblings, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2020-07-23 13:56 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: virtio-fs, rmohr, qemu-devel, Stefan Hajnoczi, vromanso

On Thu, Jul 23, 2020 at 01:50:35PM +0100, Daniel P. BerrangÃ© wrote:
> On Thu, Jul 23, 2020 at 01:46:11PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Jul 22, 2020 at 06:03:28PM +0100, Daniel P. BerrangÃ© wrote:
> > > On Wed, Jul 22, 2020 at 02:02:06PM +0100, Stefan Hajnoczi wrote:
> > > > An assertion failure is raised during request processing if
> > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > > be detected right away.
> > > > 
> > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > > include unshare (e.g. podman is unaffected):
> > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > > 
> > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > > default seccomp.json is missing unshare.
> > > > 
> > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > ---
> > > >  tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
> > > >  1 file changed, 13 insertions(+)
> > > > 
> > > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > > > index 3b6d16a041..ebeb352514 100644
> > > > --- a/tools/virtiofsd/fuse_virtio.c
> > > > +++ b/tools/virtiofsd/fuse_virtio.c
> > > > @@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
> > > >  {
> > > >      int ret;
> > > >  
> > > > +    /*
> > > > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> > > > +     * an unprivileged system call but some Docker/Moby versions are known to
> > > > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > > > +     */
> > > > +    ret = unshare(CLONE_FS);
> > > > +    if (ret == -1 && errno == EPERM) {
> > > > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> > > > +                "running in a container please check that the container "
> > > > +                "runtime seccomp policy allows unshare.\n");
> > > > +        return -1;
> > > > +    }
> > > > +
> > > 
> > > This describes the unshare() call as a "probe" and a "test", but that's
> > > misleading IMHO. A "probe" / "test" implies that after it has completed,
> > > there's no lingering side-effect, which isn't the case here.
> > > 
> > > This is actively changing the process' namespace environment in the
> > > success case, and not putting it back how it was originally.
> > > 
> > > May be this is in fact OK, but if so I think the commit message and
> > > comment should explain/justify what its fine to have this lingering
> > > side-effect.
> > > 
> > > If we want to avoid the side-effect then we need to fork() and run
> > > unshare() in the child, and use a check of exit status of the child
> > > to determine the result.
> > 
> > Thanks for pointing this out. I'll add a comment explaining that
> > virtiofsd is single-threaded at this point. No other threads share the
> > file system attributes so the call has no observable side-effects.
> 
> Also, if we do an unshare() here, do we still need the unshare() later
> on in the code ?

I think so. That unshare() later is to isolate one thread from other.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
@ 2020-07-23 13:56           ` Vivek Goyal
  0 siblings, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2020-07-23 13:56 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

On Thu, Jul 23, 2020 at 01:50:35PM +0100, Daniel P. Berrangé wrote:
> On Thu, Jul 23, 2020 at 01:46:11PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Jul 22, 2020 at 06:03:28PM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Jul 22, 2020 at 02:02:06PM +0100, Stefan Hajnoczi wrote:
> > > > An assertion failure is raised during request processing if
> > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > > be detected right away.
> > > > 
> > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > > include unshare (e.g. podman is unaffected):
> > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > > 
> > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > > default seccomp.json is missing unshare.
> > > > 
> > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > ---
> > > >  tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
> > > >  1 file changed, 13 insertions(+)
> > > > 
> > > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > > > index 3b6d16a041..ebeb352514 100644
> > > > --- a/tools/virtiofsd/fuse_virtio.c
> > > > +++ b/tools/virtiofsd/fuse_virtio.c
> > > > @@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
> > > >  {
> > > >      int ret;
> > > >  
> > > > +    /*
> > > > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> > > > +     * an unprivileged system call but some Docker/Moby versions are known to
> > > > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > > > +     */
> > > > +    ret = unshare(CLONE_FS);
> > > > +    if (ret == -1 && errno == EPERM) {
> > > > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> > > > +                "running in a container please check that the container "
> > > > +                "runtime seccomp policy allows unshare.\n");
> > > > +        return -1;
> > > > +    }
> > > > +
> > > 
> > > This describes the unshare() call as a "probe" and a "test", but that's
> > > misleading IMHO. A "probe" / "test" implies that after it has completed,
> > > there's no lingering side-effect, which isn't the case here.
> > > 
> > > This is actively changing the process' namespace environment in the
> > > success case, and not putting it back how it was originally.
> > > 
> > > May be this is in fact OK, but if so I think the commit message and
> > > comment should explain/justify what its fine to have this lingering
> > > side-effect.
> > > 
> > > If we want to avoid the side-effect then we need to fork() and run
> > > unshare() in the child, and use a check of exit status of the child
> > > to determine the result.
> > 
> > Thanks for pointing this out. I'll add a comment explaining that
> > virtiofsd is single-threaded at this point. No other threads share the
> > file system attributes so the call has no observable side-effects.
> 
> Also, if we do an unshare() here, do we still need the unshare() later
> on in the code ?

I think so. That unshare() later is to isolate one thread from other.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-23 13:56           ` Vivek Goyal
  (?)
@ 2020-07-23 15:19           ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 15:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs, rmohr, Daniel P. Berrangé, qemu-devel, vromanso

[-- Attachment #1: Type: text/plain, Size: 3786 bytes --]

On Thu, Jul 23, 2020 at 09:56:03AM -0400, Vivek Goyal wrote:
> On Thu, Jul 23, 2020 at 01:50:35PM +0100, Daniel P. Berrangé wrote:
> > On Thu, Jul 23, 2020 at 01:46:11PM +0100, Stefan Hajnoczi wrote:
> > > On Wed, Jul 22, 2020 at 06:03:28PM +0100, Daniel P. Berrangé wrote:
> > > > On Wed, Jul 22, 2020 at 02:02:06PM +0100, Stefan Hajnoczi wrote:
> > > > > An assertion failure is raised during request processing if
> > > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > > > be detected right away.
> > > > > 
> > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > > > include unshare (e.g. podman is unaffected):
> > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > > > 
> > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > > > default seccomp.json is missing unshare.
> > > > > 
> > > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > ---
> > > > >  tools/virtiofsd/fuse_virtio.c | 13 +++++++++++++
> > > > >  1 file changed, 13 insertions(+)
> > > > > 
> > > > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > > > > index 3b6d16a041..ebeb352514 100644
> > > > > --- a/tools/virtiofsd/fuse_virtio.c
> > > > > +++ b/tools/virtiofsd/fuse_virtio.c
> > > > > @@ -949,6 +949,19 @@ int virtio_session_mount(struct fuse_session *se)
> > > > >  {
> > > > >      int ret;
> > > > >  
> > > > > +    /*
> > > > > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> > > > > +     * an unprivileged system call but some Docker/Moby versions are known to
> > > > > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > > > > +     */
> > > > > +    ret = unshare(CLONE_FS);
> > > > > +    if (ret == -1 && errno == EPERM) {
> > > > > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> > > > > +                "running in a container please check that the container "
> > > > > +                "runtime seccomp policy allows unshare.\n");
> > > > > +        return -1;
> > > > > +    }
> > > > > +
> > > > 
> > > > This describes the unshare() call as a "probe" and a "test", but that's
> > > > misleading IMHO. A "probe" / "test" implies that after it has completed,
> > > > there's no lingering side-effect, which isn't the case here.
> > > > 
> > > > This is actively changing the process' namespace environment in the
> > > > success case, and not putting it back how it was originally.
> > > > 
> > > > May be this is in fact OK, but if so I think the commit message and
> > > > comment should explain/justify what its fine to have this lingering
> > > > side-effect.
> > > > 
> > > > If we want to avoid the side-effect then we need to fork() and run
> > > > unshare() in the child, and use a check of exit status of the child
> > > > to determine the result.
> > > 
> > > Thanks for pointing this out. I'll add a comment explaining that
> > > virtiofsd is single-threaded at this point. No other threads share the
> > > file system attributes so the call has no observable side-effects.
> > 
> > Also, if we do an unshare() here, do we still need the unshare() later
> > on in the code ?
> 
> I think so. That unshare() later is to isolate one thread from other.

Yes. Each thread that calls unshare(CLONE_FS) gets its own current
working directory and other file system attributes.

The purpose of the call is to allow worker threads to change current
working directory without affecting each other.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-23 13:47         ` Vivek Goyal
@ 2020-07-23 15:36           ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs, qemu-devel, vromanso, Dr. David Alan Gilbert, rmohr

[-- Attachment #1: Type: text/plain, Size: 2453 bytes --]

On Thu, Jul 23, 2020 at 09:47:33AM -0400, Vivek Goyal wrote:
> On Thu, Jul 23, 2020 at 01:28:50PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Jul 22, 2020 at 06:58:20PM +0100, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> > > > is required to create namespaces.
> > > > 
> > > > Introduce a weaker sandbox that is sufficient in container environments
> > > > because the container runtime already sets up namespaces. Use chroot to
> > > > restrict path traversal to the shared directory.
> > > > 
> > > > virtiofsd loses the following:
> > > > 
> > > > 1. Mount namespace. The process chroots to the shared directory but
> > > >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> > > >    syscalls.
> > > 
> > > OK, I'm guessing the behaviour of what happens if the host adds another
> > > mount afterwards might be different?
> > 
> > Running inside a container with -o chroot:
> >  - The container has its own mount namespace and it is therefore not
> >    affected, modulo shared subtrees (see mount(8)).
> 
> How does virtiofsd inside container gets the directory it wants to
> export to guest? I am assuming that its probably a volume mount
> inside container. If yes, volume mounts can set propagation
> properties say "slave" and that means mounts done later will
> propagate inside container in that volume (and hopefully be
> visible inside guest once submount patches are upstream).

Yes, the shared directory is a volume mount and propagation works as
expected:

  $ docker run -v path/to/shared-dir:/shared-dir:slave \
	       -v /tmp/socket-dir:/var/run \
	       virtiofsd

When I mount things in path/to/shared-dir on the host they appear in
/shared-dir inside the guest. When I omit the "slave" option (the
default is "private") then the mount does not propagate.

The following Dockerfile was used:

  $ cp ~/src/qemu/virtiofsd . # copy the executable
  $ cat >Dockerfile
  FROM fedora:32
  COPY --chown=0:0 virtiofsd /usr/bin/virtiofsd
  RUN dnf update -qy && dnf install -qy libseccomp liburing zlib glib2 nettle gnutls libcap-ng
  VOLUME ["/shared-dir", "/var/run"]
  CMD ["/usr/bin/virtiofsd", "--socket-path=/var/run/virtiofsd.sock", "-o", "source=/shared-dir", "-o", "cache=auto", "-o", "no_posix_lock", "-o", "xattr", "-o", "chroot"]

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-23 15:36           ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23 15:36 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, vromanso, rmohr

[-- Attachment #1: Type: text/plain, Size: 2453 bytes --]

On Thu, Jul 23, 2020 at 09:47:33AM -0400, Vivek Goyal wrote:
> On Thu, Jul 23, 2020 at 01:28:50PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Jul 22, 2020 at 06:58:20PM +0100, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > virtiofsd cannot run in an unprivileged container because CAP_SYS_ADMIN
> > > > is required to create namespaces.
> > > > 
> > > > Introduce a weaker sandbox that is sufficient in container environments
> > > > because the container runtime already sets up namespaces. Use chroot to
> > > > restrict path traversal to the shared directory.
> > > > 
> > > > virtiofsd loses the following:
> > > > 
> > > > 1. Mount namespace. The process chroots to the shared directory but
> > > >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> > > >    syscalls.
> > > 
> > > OK, I'm guessing the behaviour of what happens if the host adds another
> > > mount afterwards might be different?
> > 
> > Running inside a container with -o chroot:
> >  - The container has its own mount namespace and it is therefore not
> >    affected, modulo shared subtrees (see mount(8)).
> 
> How does virtiofsd inside container gets the directory it wants to
> export to guest? I am assuming that its probably a volume mount
> inside container. If yes, volume mounts can set propagation
> properties say "slave" and that means mounts done later will
> propagate inside container in that volume (and hopefully be
> visible inside guest once submount patches are upstream).

Yes, the shared directory is a volume mount and propagation works as
expected:

  $ docker run -v path/to/shared-dir:/shared-dir:slave \
	       -v /tmp/socket-dir:/var/run \
	       virtiofsd

When I mount things in path/to/shared-dir on the host they appear in
/shared-dir inside the guest. When I omit the "slave" option (the
default is "private") then the mount does not propagate.

The following Dockerfile was used:

  $ cp ~/src/qemu/virtiofsd . # copy the executable
  $ cat >Dockerfile
  FROM fedora:32
  COPY --chown=0:0 virtiofsd /usr/bin/virtiofsd
  RUN dnf update -qy && dnf install -qy libseccomp liburing zlib glib2 nettle gnutls libcap-ng
  VOLUME ["/shared-dir", "/var/run"]
  CMD ["/usr/bin/virtiofsd", "--socket-path=/var/run/virtiofsd.sock", "-o", "source=/shared-dir", "-o", "cache=auto", "-o", "no_posix_lock", "-o", "xattr", "-o", "chroot"]

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-23 12:32       ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-07-23 17:55         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 43+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-23 17:55 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Jul 22, 2020 at 08:03:18PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > +    /*
> > > +     * Make the shared directory the file system root so that FUSE_OPEN
> > > +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> > > +     *
> > > +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> > > +     * requires gaining control of the process first.
> > > +     */
> > > +    if (chroot(lo->source) != 0) {
> > > +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> > > +        exit(1);
> > > +    }
> > 
> > I'm seeing suggestions that you should drop CAP_SYS_CHROOT after
> > chroot'ing to stop an old escape (where you create another jail inside
> > the jail and the kernel then lets you walk outside of the old one).
> 
> That's already the case:
> 
> 1. setup_seccomp() blocks further chroot(2) calls.
> 2. setup_capabilities() drops CAP_SYS_CHROOT.

Ah yes; could you add a comment if you respin; it's not obvious that
the capability to chroot allows you to break out of an existing chroot
you're in.

Dave

> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-23 17:55         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 43+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-23 17:55 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Jul 22, 2020 at 08:03:18PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > +    /*
> > > +     * Make the shared directory the file system root so that FUSE_OPEN
> > > +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> > > +     *
> > > +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> > > +     * requires gaining control of the process first.
> > > +     */
> > > +    if (chroot(lo->source) != 0) {
> > > +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> > > +        exit(1);
> > > +    }
> > 
> > I'm seeing suggestions that you should drop CAP_SYS_CHROOT after
> > chroot'ing to stop an old escape (where you create another jail inside
> > the jail and the kernel then lets you walk outside of the old one).
> 
> That's already the case:
> 
> 1. setup_seccomp() blocks further chroot(2) calls.
> 2. setup_capabilities() drops CAP_SYS_CHROOT.

Ah yes; could you add a comment if you respin; it's not obvious that
the capability to chroot allows you to break out of an existing chroot
you're in.

Dave

> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
  2020-07-23 17:55         ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-07-24 12:22           ` Stefan Hajnoczi
  -1 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-24 12:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

[-- Attachment #1: Type: text/plain, Size: 1357 bytes --]

On Thu, Jul 23, 2020 at 06:55:38PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Wed, Jul 22, 2020 at 08:03:18PM +0100, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > +    /*
> > > > +     * Make the shared directory the file system root so that FUSE_OPEN
> > > > +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> > > > +     *
> > > > +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> > > > +     * requires gaining control of the process first.
> > > > +     */
> > > > +    if (chroot(lo->source) != 0) {
> > > > +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> > > > +        exit(1);
> > > > +    }
> > > 
> > > I'm seeing suggestions that you should drop CAP_SYS_CHROOT after
> > > chroot'ing to stop an old escape (where you create another jail inside
> > > the jail and the kernel then lets you walk outside of the old one).
> > 
> > That's already the case:
> > 
> > 1. setup_seccomp() blocks further chroot(2) calls.
> > 2. setup_capabilities() drops CAP_SYS_CHROOT.
> 
> Ah yes; could you add a comment if you respin; it's not obvious that
> the capability to chroot allows you to break out of an existing chroot
> you're in.

Sure.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Virtio-fs] [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option
@ 2020-07-24 12:22           ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2020-07-24 12:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, rmohr, qemu-devel, vromanso

[-- Attachment #1: Type: text/plain, Size: 1357 bytes --]

On Thu, Jul 23, 2020 at 06:55:38PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Wed, Jul 22, 2020 at 08:03:18PM +0100, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > +    /*
> > > > +     * Make the shared directory the file system root so that FUSE_OPEN
> > > > +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> > > > +     *
> > > > +     * It's still possible to escape the chroot via lo->proc_self_fd but that
> > > > +     * requires gaining control of the process first.
> > > > +     */
> > > > +    if (chroot(lo->source) != 0) {
> > > > +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> > > > +        exit(1);
> > > > +    }
> > > 
> > > I'm seeing suggestions that you should drop CAP_SYS_CHROOT after
> > > chroot'ing to stop an old escape (where you create another jail inside
> > > the jail and the kernel then lets you walk outside of the old one).
> > 
> > That's already the case:
> > 
> > 1. setup_seccomp() blocks further chroot(2) calls.
> > 2. setup_capabilities() drops CAP_SYS_CHROOT.
> 
> Ah yes; could you add a comment if you respin; it's not obvious that
> the capability to chroot allows you to break out of an existing chroot
> you're in.

Sure.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2020-07-24 12:23 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-22 13:02 [PATCH for-5.1 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi
2020-07-22 13:02 ` [Virtio-fs] " Stefan Hajnoczi
2020-07-22 13:02 ` [PATCH for-5.1 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH Stefan Hajnoczi
2020-07-22 13:02   ` [Virtio-fs] " Stefan Hajnoczi
2020-07-22 16:51   ` Dr. David Alan Gilbert
2020-07-22 16:51     ` [Virtio-fs] " Dr. David Alan Gilbert
2020-07-22 13:02 ` [PATCH for-5.1 2/3] virtiofsd: add container-friendly -o chroot sandboxing option Stefan Hajnoczi
2020-07-22 13:02   ` [Virtio-fs] " Stefan Hajnoczi
2020-07-22 16:58   ` Daniel P. Berrangé
2020-07-22 16:58     ` [Virtio-fs] " Daniel P. Berrangé
2020-07-23 12:17     ` Stefan Hajnoczi
2020-07-23 12:17       ` [Virtio-fs] " Stefan Hajnoczi
2020-07-22 17:58   ` Dr. David Alan Gilbert
2020-07-22 17:58     ` [Virtio-fs] " Dr. David Alan Gilbert
2020-07-23 12:28     ` Stefan Hajnoczi
2020-07-23 12:28       ` [Virtio-fs] " Stefan Hajnoczi
2020-07-23 13:47       ` Vivek Goyal
2020-07-23 13:47         ` Vivek Goyal
2020-07-23 15:36         ` Stefan Hajnoczi
2020-07-23 15:36           ` Stefan Hajnoczi
2020-07-22 18:17   ` Vivek Goyal
2020-07-23 12:29     ` Stefan Hajnoczi
2020-07-22 19:03   ` Dr. David Alan Gilbert
2020-07-22 19:03     ` [Virtio-fs] " Dr. David Alan Gilbert
2020-07-23 12:32     ` Stefan Hajnoczi
2020-07-23 12:32       ` [Virtio-fs] " Stefan Hajnoczi
2020-07-23 17:55       ` Dr. David Alan Gilbert
2020-07-23 17:55         ` [Virtio-fs] " Dr. David Alan Gilbert
2020-07-24 12:22         ` Stefan Hajnoczi
2020-07-24 12:22           ` [Virtio-fs] " Stefan Hajnoczi
2020-07-22 13:02 ` [PATCH for-5.1 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error Stefan Hajnoczi
2020-07-22 13:02   ` [Virtio-fs] " Stefan Hajnoczi
2020-07-22 17:03   ` Daniel P. Berrangé
2020-07-22 17:03     ` [Virtio-fs] " Daniel P. Berrangé
2020-07-23 12:46     ` Stefan Hajnoczi
2020-07-23 12:46       ` [Virtio-fs] " Stefan Hajnoczi
2020-07-23 12:50       ` Daniel P. Berrangé
2020-07-23 12:50         ` [Virtio-fs] " Daniel P. Berrangé
2020-07-23 13:56         ` Vivek Goyal
2020-07-23 13:56           ` Vivek Goyal
2020-07-23 15:19           ` Stefan Hajnoczi
2020-07-22 18:19 ` [Virtio-fs] [PATCH for-5.1 0/3] virtiofsd: allow virtiofsd to run in a container Vivek Goyal
2020-07-23 12:46   ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.