qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container
@ 2020-07-27 19:02 Stefan Hajnoczi
  2020-07-27 19:02 ` [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH Stefan Hajnoczi
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2020-07-27 19:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs,
	Stefan Hajnoczi, mpatel, vgoyal

v2:
 * Update virtiofsd.rst documentation on sandboxing modes
 * Change syntax to -o sandbox=namespace|chroot
 * Add comment explaining that unshare(CLONE_FS) has no visible side-effect
   while single-threaded
 * xfstests and pjdfstest pass. Did not run tests on overlayfs because required
   xattrs do not work without CAP_SYS_ADMIN.

Mrunal and Dan: This patch series adds a sandboxing mode where virtiofsd relies
on the container runtime for isolation. It only does
chroot("path/to/shared-dir"), seccomp, and drops Linux capabilities. Previously
it created a new mount, pid, and net namespace but cannot do this without
CAP_SYS_ADMIN when run inside a container. pivot_root("path/to/shared-dir") has
been replaced with chroot("path/to/shared-dir"), again because CAP_SYS_ADMIN is
unavailable. The point of the chroot() is to prevent escapes from the shared
directory during path traversal. Does this ring any alarm bells or does it
sound sane?

Container runtimes handle namespace setup and remove privileges needed by
virtiofsd to perform sandboxing. Luckily the container environment already
provides most of the sandbox that virtiofsd needs for security.

Introduce a new "virtiofsd -o sandbox=chroot" option that uses chroot(2)
instead of namespaces. This option allows virtiofsd to work inside a container.

Please see the individual patches for details on the changes and security
implications.

Given that people are starting to attempt running virtiofsd in containers I
think this should go into QEMU 5.1.

Stefan Hajnoczi (3):
  virtiofsd: drop CAP_DAC_READ_SEARCH
  virtiofsd: add container-friendly -o sandbox=chroot option
  virtiofsd: probe unshare(CLONE_FS) and print an error

 tools/virtiofsd/fuse_virtio.c    | 16 +++++++++
 tools/virtiofsd/helper.c         |  8 +++++
 tools/virtiofsd/passthrough_ll.c | 58 ++++++++++++++++++++++++++++++--
 docs/tools/virtiofsd.rst         | 32 ++++++++++++++----
 4 files changed, 104 insertions(+), 10 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH
  2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi
@ 2020-07-27 19:02 ` Stefan Hajnoczi
  2020-07-27 19:02 ` [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option Stefan Hajnoczi
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2020-07-27 19:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs,
	Stefan Hajnoczi, mpatel, vgoyal

virtiofsd does not need CAP_DAC_READ_SEARCH because it already has
the more powerful CAP_DAC_OVERRIDE. Drop it from the list of
capabilities.

This is important because container runtimes may not include
CAP_DAC_READ_SEARCH by default. This patch allows virtiofsd to reduce
its capabilities when running inside a Docker container.

Note that CAP_DAC_READ_SEARCH may be necessary again in the future if
virtiofsd starts using open_by_handle_at(2).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 94e0de2d2b..50a164a599 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2596,7 +2596,6 @@ static void setup_capabilities(char *modcaps_in)
     if (capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
             CAP_CHOWN,
             CAP_DAC_OVERRIDE,
-            CAP_DAC_READ_SEARCH,
             CAP_FOWNER,
             CAP_FSETID,
             CAP_SETGID,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option
  2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi
  2020-07-27 19:02 ` [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH Stefan Hajnoczi
@ 2020-07-27 19:02 ` Stefan Hajnoczi
  2020-08-07 15:36   ` Dr. David Alan Gilbert
  2020-07-27 19:02 ` [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error Stefan Hajnoczi
  2020-08-27 18:40 ` [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Dr. David Alan Gilbert
  3 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2020-07-27 19:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs,
	Stefan Hajnoczi, mpatel, vgoyal

virtiofsd cannot run in a container because CAP_SYS_ADMIN is required to
create namespaces.

Introduce a weaker sandbox mode that is sufficient in container
environments because the container runtime already sets up namespaces.
Use chroot to restrict path traversal to the shared directory.

virtiofsd loses the following:

1. Mount namespace. The process chroots to the shared directory but
   leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
   syscalls.

2. Pid namespace. This should be fine because virtiofsd is the only
   process running in the container.

3. Network namespace. This should be fine because seccomp already
   rejects the connect(2) syscall, but an additional layer of security
   is lost. Container runtime-specific network security policies can be
   used drop network traffic (except for the vhost-user UNIX domain
   socket).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/helper.c         |  8 +++++
 tools/virtiofsd/passthrough_ll.c | 57 ++++++++++++++++++++++++++++++--
 docs/tools/virtiofsd.rst         | 32 ++++++++++++++----
 3 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index 3105b6c23a..91dcb23664 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -168,6 +168,14 @@ void fuse_cmdline_help(void)
            "                               enable/disable readirplus\n"
            "                               default: readdirplus except with "
            "cache=none\n"
+           "    -o sandbox=namespace|chroot\n"
+           "                               sandboxing mode:\n"
+           "                               - namespace: mount, pid, and net\n"
+           "                                 namespaces with pivot_root(2)\n"
+           "                                 into shared directory\n"
+           "                               - chroot: chroot(2) into shared\n"
+           "                                 directory (use in containers)\n"
+           "                               default: namespace\n"
            "    -o timeout=<number>        I/O timeout (seconds)\n"
            "                               default: depends on cache= option.\n"
            "    -o writeback|no_writeback  enable/disable writeback cache\n"
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 50a164a599..a7894c3e7c 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -137,8 +137,14 @@ enum {
     CACHE_ALWAYS,
 };
 
+enum {
+    SANDBOX_NAMESPACE,
+    SANDBOX_CHROOT,
+};
+
 struct lo_data {
     pthread_mutex_t mutex;
+    int sandbox;
     int debug;
     int writeback;
     int flock;
@@ -162,6 +168,12 @@ struct lo_data {
 };
 
 static const struct fuse_opt lo_opts[] = {
+    { "sandbox=namespace",
+      offsetof(struct lo_data, sandbox),
+      SANDBOX_NAMESPACE },
+    { "sandbox=chroot",
+      offsetof(struct lo_data, sandbox),
+      SANDBOX_CHROOT },
     { "writeback", offsetof(struct lo_data, writeback), 1 },
     { "no_writeback", offsetof(struct lo_data, writeback), 0 },
     { "source=%s", offsetof(struct lo_data, source), 0 },
@@ -2665,6 +2677,41 @@ static void setup_capabilities(char *modcaps_in)
     pthread_mutex_unlock(&cap.mutex);
 }
 
+/*
+ * Use chroot as a weaker sandbox for environments where the process is
+ * launched without CAP_SYS_ADMIN.
+ */
+static void setup_chroot(struct lo_data *lo)
+{
+    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
+    if (lo->proc_self_fd == -1) {
+        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
+        exit(1);
+    }
+
+    /*
+     * Make the shared directory the file system root so that FUSE_OPEN
+     * (lo_open()) cannot escape the shared directory by opening a symlink.
+     *
+     * The chroot(2) syscall is later disabled by seccomp and the
+     * CAP_SYS_CHROOT capability is dropped so that tampering with the chroot
+     * is not possible.
+     *
+     * However, it's still possible to escape the chroot via lo->proc_self_fd
+     * but that requires first gaining control of the process.
+     */
+    if (chroot(lo->source) != 0) {
+        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
+        exit(1);
+    }
+
+    /* Move into the chroot */
+    if (chdir("/") != 0) {
+        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
+        exit(1);
+    }
+}
+
 /*
  * Lock down this process to prevent access to other processes or files outside
  * source directory.  This reduces the impact of arbitrary code execution bugs.
@@ -2672,8 +2719,13 @@ static void setup_capabilities(char *modcaps_in)
 static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
                           bool enable_syslog)
 {
-    setup_namespaces(lo, se);
-    setup_mounts(lo->source);
+    if (lo->sandbox == SANDBOX_NAMESPACE) {
+        setup_namespaces(lo, se);
+        setup_mounts(lo->source);
+    } else {
+        setup_chroot(lo);
+    }
+
     setup_seccomp(enable_syslog);
     setup_capabilities(g_strdup(lo->modcaps));
 }
@@ -2820,6 +2872,7 @@ int main(int argc, char *argv[])
     struct fuse_session *se;
     struct fuse_cmdline_opts opts;
     struct lo_data lo = {
+        .sandbox = SANDBOX_NAMESPACE,
         .debug = 0,
         .writeback = 0,
         .posix_lock = 1,
diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
index 824e713491..40629f95ae 100644
--- a/docs/tools/virtiofsd.rst
+++ b/docs/tools/virtiofsd.rst
@@ -17,13 +17,24 @@ This program is designed to work with QEMU's ``--device vhost-user-fs-pci``
 but should work with any virtual machine monitor (VMM) that supports
 vhost-user.  See the Examples section below.
 
-This program must be run as the root user.  Upon startup the program will
-switch into a new file system namespace with the shared directory tree as its
-root.  This prevents "file system escapes" due to symlinks and other file
-system objects that might lead to files outside the shared directory.  The
-program also sandboxes itself using seccomp(2) to prevent ptrace(2) and other
-vectors that could allow an attacker to compromise the system after gaining
-control of the virtiofsd process.
+This program must be run as the root user.  The program drops privileges where
+possible during startup although it must be able to create and access files
+with any uid/gid:
+
+* The ability to invoke syscalls is limited using seccomp(2).
+* Linux capabilities(7) are dropped.
+
+In "namespace" sandbox mode the program switches into a new file system
+namespace and invokes pivot_root(2) to make the shared directory tree its root.
+A new pid and net namespace is also created to isolate the process.
+
+In "chroot" sandbox mode the program invokes chroot(2) to make the shared
+directory tree its root. This mode is intended for container environments where
+the container runtime has already set up the namespaces and the program does
+not have permission to create namespaces itself.
+
+Both sandbox modes prevent "file system escapes" due to symlinks and other file
+system objects that might lead to files outside the shared directory.
 
 Options
 -------
@@ -72,6 +83,13 @@ Options
   * readdirplus|no_readdirplus -
     Enable/disable readdirplus.  The default is ``readdirplus``.
 
+  * sandbox=namespace|chroot -
+    Sandbox mode:
+    - namespace: Create mount, pid, and net namespaces and pivot_root(2) into
+    the shared directory.
+    - chroot: chroot(2) into shared directory (use in containers).
+    The default is "namespace".
+
   * source=PATH -
     Share host directory tree located at PATH.  This option is required.
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi
  2020-07-27 19:02 ` [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH Stefan Hajnoczi
  2020-07-27 19:02 ` [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option Stefan Hajnoczi
@ 2020-07-27 19:02 ` Stefan Hajnoczi
  2020-07-28  1:05   ` misono.tomohiro
  2020-08-27 18:40 ` [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Dr. David Alan Gilbert
  3 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2020-07-27 19:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs,
	Misono Tomohiro, Stefan Hajnoczi, mpatel, vgoyal

An assertion failure is raised during request processing if
unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
be detected right away.

Unfortunately Docker/Moby does not include unshare in the seccomp.json
list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
include unshare (e.g. podman is unaffected):
https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json

Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
default seccomp.json is missing unshare.

Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 3b6d16a041..9e5537506c 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
 {
     int ret;
 
+    /*
+     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
+     * an unprivileged system call but some Docker/Moby versions are known to
+     * reject it via seccomp when CAP_SYS_ADMIN is not given.
+     *
+     * Note that the program is single-threaded here so this syscall has no
+     * visible effect and is safe to make.
+     */
+    ret = unshare(CLONE_FS);
+    if (ret == -1 && errno == EPERM) {
+        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
+                "running in a container please check that the container "
+                "runtime seccomp policy allows unshare.\n");
+        return -1;
+    }
+
     ret = fv_create_listen_socket(se);
     if (ret < 0) {
         return ret;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-27 19:02 ` [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error Stefan Hajnoczi
@ 2020-07-28  1:05   ` misono.tomohiro
  2020-07-28 10:00     ` Roman Mohr
  2020-08-07 15:29     ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 23+ messages in thread
From: misono.tomohiro @ 2020-07-28  1:05 UTC (permalink / raw)
  To: 'Stefan Hajnoczi', qemu-devel
  Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs,
	mpatel, vgoyal

> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
> 
> An assertion failure is raised during request processing if
> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> be detected right away.
> 
> Unfortunately Docker/Moby does not include unshare in the seccomp.json
> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> include unshare (e.g. podman is unaffected):
> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> 
> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> default seccomp.json is missing unshare.

Hi, sorry for a bit late.

unshare() was added to fix xattr problem: 
  https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
In theory we don't need to call unshare if xattr is disabled, but it is hard to get to know
if xattr is enabled or disabled in fv_queue_worker(), right?

So, it looks good to me.
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>

Regards,
Misono

> 
> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index 3b6d16a041..9e5537506c 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
>  {
>      int ret;
> 
> +    /*
> +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> +     * an unprivileged system call but some Docker/Moby versions are known to
> +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> +     *
> +     * Note that the program is single-threaded here so this syscall has no
> +     * visible effect and is safe to make.
> +     */
> +    ret = unshare(CLONE_FS);
> +    if (ret == -1 && errno == EPERM) {
> +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> +                "running in a container please check that the container "
> +                "runtime seccomp policy allows unshare.\n");
> +        return -1;
> +    }
> +
>      ret = fv_create_listen_socket(se);
>      if (ret < 0) {
>          return ret;
> --
> 2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28  1:05   ` misono.tomohiro
@ 2020-07-28 10:00     ` Roman Mohr
  2020-07-28 13:12       ` Vivek Goyal
  2020-07-28 15:32       ` Stefan Hajnoczi
  2020-08-07 15:29     ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 23+ messages in thread
From: Roman Mohr @ 2020-07-28 10:00 UTC (permalink / raw)
  To: misono.tomohiro
  Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, mpatel, vgoyal

[-- Attachment #1: Type: text/plain, Size: 3114 bytes --]

On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
misono.tomohiro@fujitsu.com> wrote:

> > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> error
> >
> > An assertion failure is raised during request processing if
> > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > be detected right away.
> >
> > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > include unshare (e.g. podman is unaffected):
> >
> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> >
> > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > default seccomp.json is missing unshare.
>
> Hi, sorry for a bit late.
>
> unshare() was added to fix xattr problem:
>
> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> In theory we don't need to call unshare if xattr is disabled, but it is
> hard to get to know
> if xattr is enabled or disabled in fv_queue_worker(), right?
>
>
In kubevirt we want to run virtiofsd in containers. We would already not
have xattr support for e.g. overlayfs in the VM after this patch series (an
acceptable con at least for us right now).
If we can get rid of the unshare (and potentially of needing root) that
would be great. We always assume that everything which we run in containers
should work for cri-o and docker.

"Just" pointing docker to a different seccomp.json file is something which
k8s users/admin in many cases can't do.

Best Regards,
Roman


> So, it looks good to me.
> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
>
> Regards,
> Misono
>
> >
> > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/tools/virtiofsd/fuse_virtio.c
> b/tools/virtiofsd/fuse_virtio.c
> > index 3b6d16a041..9e5537506c 100644
> > --- a/tools/virtiofsd/fuse_virtio.c
> > +++ b/tools/virtiofsd/fuse_virtio.c
> > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
> >  {
> >      int ret;
> >
> > +    /*
> > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need
> it. It's
> > +     * an unprivileged system call but some Docker/Moby versions are
> known to
> > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > +     *
> > +     * Note that the program is single-threaded here so this syscall
> has no
> > +     * visible effect and is safe to make.
> > +     */
> > +    ret = unshare(CLONE_FS);
> > +    if (ret == -1 && errno == EPERM) {
> > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If
> "
> > +                "running in a container please check that the container
> "
> > +                "runtime seccomp policy allows unshare.\n");
> > +        return -1;
> > +    }
> > +
> >      ret = fv_create_listen_socket(se);
> >      if (ret < 0) {
> >          return ret;
> > --
> > 2.26.2
>
>

[-- Attachment #2: Type: text/html, Size: 4612 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28 10:00     ` Roman Mohr
@ 2020-07-28 13:12       ` Vivek Goyal
  2020-07-28 15:52         ` Daniel P. Berrangé
                           ` (2 more replies)
  2020-07-28 15:32       ` Stefan Hajnoczi
  1 sibling, 3 replies; 23+ messages in thread
From: Vivek Goyal @ 2020-07-28 13:12 UTC (permalink / raw)
  To: Roman Mohr
  Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, misono.tomohiro, mpatel

On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> misono.tomohiro@fujitsu.com> wrote:
> 
> > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> > error
> > >
> > > An assertion failure is raised during request processing if
> > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > be detected right away.
> > >
> > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > include unshare (e.g. podman is unaffected):
> > >
> > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > >
> > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > default seccomp.json is missing unshare.
> >
> > Hi, sorry for a bit late.
> >
> > unshare() was added to fix xattr problem:
> >
> > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > In theory we don't need to call unshare if xattr is disabled, but it is
> > hard to get to know
> > if xattr is enabled or disabled in fv_queue_worker(), right?
> >
> >
> In kubevirt we want to run virtiofsd in containers. We would already not
> have xattr support for e.g. overlayfs in the VM after this patch series (an
> acceptable con at least for us right now).
> If we can get rid of the unshare (and potentially of needing root) that
> would be great. We always assume that everything which we run in containers
> should work for cri-o and docker.

But cri-o and docker containers run as root, isn't it? (or atleast have
the capability to run as root). Havind said that, it will be nice to be able
to run virtiofsd without root. 

There are few hurdles though.

- For file creation, we switch uid/gid (seteuid/setegid) and that seems
  to require root. If we were to run unpriviliged, probably all files
  on host will have to be owned by unpriviliged user and guest visible
  uid/gid will have to be stored in xattrs. I think virtfs supports
  something similar.

I am sure there are other restrictions but this probably is the biggest
one to overcome.

 > 
> "Just" pointing docker to a different seccomp.json file is something which
> k8s users/admin in many cases can't do.

Or may be issue is that standard seccomp.json does not allow unshare()
and hence you are forced to use a non-standar seccomp.json.

Vivek

> 
> Best Regards,
> Roman
> 
> 
> > So, it looks good to me.
> > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> >
> > Regards,
> > Misono
> >
> > >
> > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > ---
> > >  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> > >  1 file changed, 16 insertions(+)
> > >
> > > diff --git a/tools/virtiofsd/fuse_virtio.c
> > b/tools/virtiofsd/fuse_virtio.c
> > > index 3b6d16a041..9e5537506c 100644
> > > --- a/tools/virtiofsd/fuse_virtio.c
> > > +++ b/tools/virtiofsd/fuse_virtio.c
> > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
> > >  {
> > >      int ret;
> > >
> > > +    /*
> > > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need
> > it. It's
> > > +     * an unprivileged system call but some Docker/Moby versions are
> > known to
> > > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > > +     *
> > > +     * Note that the program is single-threaded here so this syscall
> > has no
> > > +     * visible effect and is safe to make.
> > > +     */
> > > +    ret = unshare(CLONE_FS);
> > > +    if (ret == -1 && errno == EPERM) {
> > > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If
> > "
> > > +                "running in a container please check that the container
> > "
> > > +                "runtime seccomp policy allows unshare.\n");
> > > +        return -1;
> > > +    }
> > > +
> > >      ret = fv_create_listen_socket(se);
> > >      if (ret < 0) {
> > >          return ret;
> > > --
> > > 2.26.2
> >
> >



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28 10:00     ` Roman Mohr
  2020-07-28 13:12       ` Vivek Goyal
@ 2020-07-28 15:32       ` Stefan Hajnoczi
  2020-07-28 19:15         ` Daniel Walsh
  1 sibling, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2020-07-28 15:32 UTC (permalink / raw)
  To: Roman Mohr
  Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert,
	virtio-fs, misono.tomohiro, mpatel, vgoyal

[-- Attachment #1: Type: text/plain, Size: 2102 bytes --]

On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> misono.tomohiro@fujitsu.com> wrote:
> 
> > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> > error
> > >
> > > An assertion failure is raised during request processing if
> > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > be detected right away.
> > >
> > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > include unshare (e.g. podman is unaffected):
> > >
> > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > >
> > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > default seccomp.json is missing unshare.
> >
> > Hi, sorry for a bit late.
> >
> > unshare() was added to fix xattr problem:
> >
> > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > In theory we don't need to call unshare if xattr is disabled, but it is
> > hard to get to know
> > if xattr is enabled or disabled in fv_queue_worker(), right?
> >
> >
> In kubevirt we want to run virtiofsd in containers. We would already not
> have xattr support for e.g. overlayfs in the VM after this patch series (an
> acceptable con at least for us right now).
> If we can get rid of the unshare (and potentially of needing root) that
> would be great. We always assume that everything which we run in containers
> should work for cri-o and docker.

Root is required to access files with any uid/gid.

Dave Gilbert is working on xattr support without CAP_SYS_ADMIN. He may
be able to find a way to drop unshare (at least in containers).

> "Just" pointing docker to a different seccomp.json file is something which
> k8s users/admin in many cases can't do.

There is a Moby PR to change the default seccomp.json file here but it's
unclear if it will be merged:
https://github.com/moby/moby/pull/41244

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28 13:12       ` Vivek Goyal
@ 2020-07-28 15:52         ` Daniel P. Berrangé
  2020-07-28 20:54           ` Vivek Goyal
  2020-07-28 19:12         ` Daniel Walsh
  2020-07-29  7:59         ` Roman Mohr
  2 siblings, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2020-07-28 15:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: vromanso, mpatel, Daniel Walsh, qemu-devel,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	misono.tomohiro, Roman Mohr

On Tue, Jul 28, 2020 at 09:12:50AM -0400, Vivek Goyal wrote:
> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > misono.tomohiro@fujitsu.com> wrote:
> > 
> > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> > > error
> > > >
> > > > An assertion failure is raised during request processing if
> > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > > be detected right away.
> > > >
> > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > > include unshare (e.g. podman is unaffected):
> > > >
> > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > >
> > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > > default seccomp.json is missing unshare.
> > >
> > > Hi, sorry for a bit late.
> > >
> > > unshare() was added to fix xattr problem:
> > >
> > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > > In theory we don't need to call unshare if xattr is disabled, but it is
> > > hard to get to know
> > > if xattr is enabled or disabled in fv_queue_worker(), right?
> > >
> > >
> > In kubevirt we want to run virtiofsd in containers. We would already not
> > have xattr support for e.g. overlayfs in the VM after this patch series (an
> > acceptable con at least for us right now).
> > If we can get rid of the unshare (and potentially of needing root) that
> > would be great. We always assume that everything which we run in containers
> > should work for cri-o and docker.
> 
> But cri-o and docker containers run as root, isn't it? (or atleast have
> the capability to run as root). Havind said that, it will be nice to be able
> to run virtiofsd without root. 
> 
> There are few hurdles though.
> 
> - For file creation, we switch uid/gid (seteuid/setegid) and that seems
>   to require root. If we were to run unpriviliged, probably all files
>   on host will have to be owned by unpriviliged user and guest visible
>   uid/gid will have to be stored in xattrs. I think virtfs supports
>   something similar.

I think I've mentioned before, 9p virtfs supports different modes,
passthrough, squashed or remapped.

passthrough should be reasonably straightforward to support in virtiofs.
The guest sees all the host UID/GIDs ownership as normal, and can read
any files the host user can read, but are obviously restricted to write
to only the files that host user can write too. No DAC-OVERRIDE facility
in essence. You'll just get EPERM, which is fine. This simple passthrough
scenario would be just what's desired for a typical desktop virt use
cases, where you want to share part/all of your home dir with a guest for
easy file access. Personally this is the mode I'd be most interested in
seeing provided for unprivileged virtiofsd usage.

squash is similar to passthrough, except the guest sees everything
as owned by the same user. This can be surprising as the guest might
see a file owned by them, but not be able to write to it, as on the
host its actually owned by some other user. Fairly niche use case
I think.

remapping would be needed for a more general purpose use cases
allowing the guest to do arbitrary UID/GID changes, but on the host
everything is still stored as one user and remapped somehow.

The main challenge for all the unprivileged scenarios is safety of
the sandbox, to avoid risk of guests escaping to access files outside
of the exported dir via symlink attacks or similar.



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28 13:12       ` Vivek Goyal
  2020-07-28 15:52         ` Daniel P. Berrangé
@ 2020-07-28 19:12         ` Daniel Walsh
  2020-07-28 21:01           ` Vivek Goyal
  2020-07-29  7:59         ` Roman Mohr
  2 siblings, 1 reply; 23+ messages in thread
From: Daniel Walsh @ 2020-07-28 19:12 UTC (permalink / raw)
  To: Vivek Goyal, Roman Mohr
  Cc: vromanso, qemu-devel, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, misono.tomohiro, mpatel

On 7/28/20 09:12, Vivek Goyal wrote:
> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
>> misono.tomohiro@fujitsu.com> wrote:
>>
>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
>>> error
>>>> An assertion failure is raised during request processing if
>>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
>>>> be detected right away.
>>>>
>>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json
>>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
>>>> include unshare (e.g. podman is unaffected):
>>>>
>>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
>>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
>>>> default seccomp.json is missing unshare.
>>> Hi, sorry for a bit late.
>>>
>>> unshare() was added to fix xattr problem:
>>>
>>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
>>> In theory we don't need to call unshare if xattr is disabled, but it is
>>> hard to get to know
>>> if xattr is enabled or disabled in fv_queue_worker(), right?
>>>
>>>
>> In kubevirt we want to run virtiofsd in containers. We would already not
>> have xattr support for e.g. overlayfs in the VM after this patch series (an
>> acceptable con at least for us right now).
>> If we can get rid of the unshare (and potentially of needing root) that
>> would be great. We always assume that everything which we run in containers
>> should work for cri-o and docker.
> But cri-o and docker containers run as root, isn't it? (or atleast have
> the capability to run as root). Havind said that, it will be nice to be able
> to run virtiofsd without root. 
>
> There are few hurdles though.
>
> - For file creation, we switch uid/gid (seteuid/setegid) and that seems
>   to require root. If we were to run unpriviliged, probably all files
>   on host will have to be owned by unpriviliged user and guest visible
>   uid/gid will have to be stored in xattrs. I think virtfs supports
>   something similar.
>
> I am sure there are other restrictions but this probably is the biggest
> one to overcome.
>
>  > 
You should be able to run it within a user namespace with Namespaces
capabilities.
>> "Just" pointing docker to a different seccomp.json file is something which
>> k8s users/admin in many cases can't do.
> Or may be issue is that standard seccomp.json does not allow unshare()
> and hence you are forced to use a non-standar seccomp.json.
>
> Vivek
>
>> Best Regards,
>> Roman
>>
>>
>>> So, it looks good to me.
>>> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
>>>
>>> Regards,
>>> Misono
>>>
>>>> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
>>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> ---
>>>>  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
>>>>  1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/tools/virtiofsd/fuse_virtio.c
>>> b/tools/virtiofsd/fuse_virtio.c
>>>> index 3b6d16a041..9e5537506c 100644
>>>> --- a/tools/virtiofsd/fuse_virtio.c
>>>> +++ b/tools/virtiofsd/fuse_virtio.c
>>>> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
>>>>  {
>>>>      int ret;
>>>>
>>>> +    /*
>>>> +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need
>>> it. It's
>>>> +     * an unprivileged system call but some Docker/Moby versions are
>>> known to
>>>> +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
>>>> +     *
>>>> +     * Note that the program is single-threaded here so this syscall
>>> has no
>>>> +     * visible effect and is safe to make.
>>>> +     */
>>>> +    ret = unshare(CLONE_FS);
>>>> +    if (ret == -1 && errno == EPERM) {
>>>> +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If
>>> "
>>>> +                "running in a container please check that the container
>>> "
>>>> +                "runtime seccomp policy allows unshare.\n");
>>>> +        return -1;
>>>> +    }
>>>> +
>>>>      ret = fv_create_listen_socket(se);
>>>>      if (ret < 0) {
>>>>          return ret;
>>>> --
>>>> 2.26.2
>>>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28 15:32       ` Stefan Hajnoczi
@ 2020-07-28 19:15         ` Daniel Walsh
  2020-07-29 14:29           ` Stefan Hajnoczi
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Walsh @ 2020-07-28 19:15 UTC (permalink / raw)
  To: Stefan Hajnoczi, Roman Mohr
  Cc: vromanso, qemu-devel, Dr. David Alan Gilbert, virtio-fs,
	misono.tomohiro, mpatel, vgoyal


[-- Attachment #1.1: Type: text/plain, Size: 2166 bytes --]

On 7/28/20 11:32, Stefan Hajnoczi wrote:
> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
>> misono.tomohiro@fujitsu.com> wrote:
>>
>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
>>> error
>>>> An assertion failure is raised during request processing if
>>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
>>>> be detected right away.
>>>>
>>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json
>>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
>>>> include unshare (e.g. podman is unaffected):
>>>>
>>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
>>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
>>>> default seccomp.json is missing unshare.
>>> Hi, sorry for a bit late.
>>>
>>> unshare() was added to fix xattr problem:
>>>
>>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
>>> In theory we don't need to call unshare if xattr is disabled, but it is
>>> hard to get to know
>>> if xattr is enabled or disabled in fv_queue_worker(), right?
>>>
>>>
>> In kubevirt we want to run virtiofsd in containers. We would already not
>> have xattr support for e.g. overlayfs in the VM after this patch series (an
>> acceptable con at least for us right now).
>> If we can get rid of the unshare (and potentially of needing root) that
>> would be great. We always assume that everything which we run in containers
>> should work for cri-o and docker.
> Root is required to access files with any uid/gid.
>
> Dave Gilbert is working on xattr support without CAP_SYS_ADMIN. He may
> be able to find a way to drop unshare (at least in containers).
>
>> "Just" pointing docker to a different seccomp.json file is something which
>> k8s users/admin in many cases can't do.
> There is a Moby PR to change the default seccomp.json file here but it's
> unclear if it will be merged:
> https://github.com/moby/moby/pull/41244
>
> Stefan

Why not try Podman?



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28 15:52         ` Daniel P. Berrangé
@ 2020-07-28 20:54           ` Vivek Goyal
  0 siblings, 0 replies; 23+ messages in thread
From: Vivek Goyal @ 2020-07-28 20:54 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: vromanso, mpatel, Daniel Walsh, qemu-devel,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	misono.tomohiro, Roman Mohr

On Tue, Jul 28, 2020 at 04:52:33PM +0100, Daniel P. Berrangé wrote:
> On Tue, Jul 28, 2020 at 09:12:50AM -0400, Vivek Goyal wrote:
> > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > > misono.tomohiro@fujitsu.com> wrote:
> > > 
> > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> > > > error
> > > > >
> > > > > An assertion failure is raised during request processing if
> > > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > > > be detected right away.
> > > > >
> > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > > > include unshare (e.g. podman is unaffected):
> > > > >
> > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > > >
> > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > > > default seccomp.json is missing unshare.
> > > >
> > > > Hi, sorry for a bit late.
> > > >
> > > > unshare() was added to fix xattr problem:
> > > >
> > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > > > In theory we don't need to call unshare if xattr is disabled, but it is
> > > > hard to get to know
> > > > if xattr is enabled or disabled in fv_queue_worker(), right?
> > > >
> > > >
> > > In kubevirt we want to run virtiofsd in containers. We would already not
> > > have xattr support for e.g. overlayfs in the VM after this patch series (an
> > > acceptable con at least for us right now).
> > > If we can get rid of the unshare (and potentially of needing root) that
> > > would be great. We always assume that everything which we run in containers
> > > should work for cri-o and docker.
> > 
> > But cri-o and docker containers run as root, isn't it? (or atleast have
> > the capability to run as root). Havind said that, it will be nice to be able
> > to run virtiofsd without root. 
> > 
> > There are few hurdles though.
> > 
> > - For file creation, we switch uid/gid (seteuid/setegid) and that seems
> >   to require root. If we were to run unpriviliged, probably all files
> >   on host will have to be owned by unpriviliged user and guest visible
> >   uid/gid will have to be stored in xattrs. I think virtfs supports
> >   something similar.
> 
> I think I've mentioned before, 9p virtfs supports different modes,
> passthrough, squashed or remapped.
> 
> passthrough should be reasonably straightforward to support in virtiofs.
> The guest sees all the host UID/GIDs ownership as normal, and can read
> any files the host user can read, but are obviously restricted to write
> to only the files that host user can write too. No DAC-OVERRIDE facility
> in essence. You'll just get EPERM, which is fine. This simple passthrough
> scenario would be just what's desired for a typical desktop virt use
> cases, where you want to share part/all of your home dir with a guest for
> easy file access. Personally this is the mode I'd be most interested in
> seeing provided for unprivileged virtiofsd usage.

Interesting. So passthrough will have two sub modes. priviliged and
unpriviliged. As of now we support priviliged passthrough. 

I guess it does make sense to look into unpriviliged passthrough
and see what other operations will not be allowed.

Thanks
Vivek

> 
> squash is similar to passthrough, except the guest sees everything
> as owned by the same user. This can be surprising as the guest might
> see a file owned by them, but not be able to write to it, as on the
> host its actually owned by some other user. Fairly niche use case
> I think.
> 
> remapping would be needed for a more general purpose use cases
> allowing the guest to do arbitrary UID/GID changes, but on the host
> everything is still stored as one user and remapped somehow.
> 
> The main challenge for all the unprivileged scenarios is safety of
> the sandbox, to avoid risk of guests escaping to access files outside
> of the exported dir via symlink attacks or similar.
> 
> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28 19:12         ` Daniel Walsh
@ 2020-07-28 21:01           ` Vivek Goyal
  0 siblings, 0 replies; 23+ messages in thread
From: Vivek Goyal @ 2020-07-28 21:01 UTC (permalink / raw)
  To: Daniel Walsh
  Cc: vromanso, mpatel, Dr. David Alan Gilbert, qemu-devel, virtio-fs,
	Stefan Hajnoczi, misono.tomohiro, Roman Mohr

On Tue, Jul 28, 2020 at 03:12:54PM -0400, Daniel Walsh wrote:
> On 7/28/20 09:12, Vivek Goyal wrote:
> > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> >> misono.tomohiro@fujitsu.com> wrote:
> >>
> >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> >>> error
> >>>> An assertion failure is raised during request processing if
> >>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> >>>> be detected right away.
> >>>>
> >>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json
> >>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> >>>> include unshare (e.g. podman is unaffected):
> >>>>
> >>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> >>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> >>>> default seccomp.json is missing unshare.
> >>> Hi, sorry for a bit late.
> >>>
> >>> unshare() was added to fix xattr problem:
> >>>
> >>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> >>> In theory we don't need to call unshare if xattr is disabled, but it is
> >>> hard to get to know
> >>> if xattr is enabled or disabled in fv_queue_worker(), right?
> >>>
> >>>
> >> In kubevirt we want to run virtiofsd in containers. We would already not
> >> have xattr support for e.g. overlayfs in the VM after this patch series (an
> >> acceptable con at least for us right now).
> >> If we can get rid of the unshare (and potentially of needing root) that
> >> would be great. We always assume that everything which we run in containers
> >> should work for cri-o and docker.
> > But cri-o and docker containers run as root, isn't it? (or atleast have
> > the capability to run as root). Havind said that, it will be nice to be able
> > to run virtiofsd without root. 
> >
> > There are few hurdles though.
> >
> > - For file creation, we switch uid/gid (seteuid/setegid) and that seems
> >   to require root. If we were to run unpriviliged, probably all files
> >   on host will have to be owned by unpriviliged user and guest visible
> >   uid/gid will have to be stored in xattrs. I think virtfs supports
> >   something similar.
> >
> > I am sure there are other restrictions but this probably is the biggest
> > one to overcome.
> >
> >  > 
> You should be able to run it within a user namespace with Namespaces
> capabilities.

User namespaces has always been a one TODO item. Few challenges are
how to manage uid/gid mapping for existing directories which will be
shared. We will have to pick a mapping range and then chown the
files accordingly. And chowning itself will require priviliges.

Now Stefan is adding support to run virtiofsd inside container. So
podman should be able virtiofsd inside a user namespace (And even
give CAP_SYS_ADMIN). That way we don't have to worry about setting
a user namespace and select a mapping etc. But persistent data
volumes will continue to be an issue with user namespace, right?

How are you dealing with it in podman. Containers running in 
user namespace and with volume mounts.

Vivek

> >> "Just" pointing docker to a different seccomp.json file is something which
> >> k8s users/admin in many cases can't do.
> > Or may be issue is that standard seccomp.json does not allow unshare()
> > and hence you are forced to use a non-standar seccomp.json.
> >
> > Vivek
> >
> >> Best Regards,
> >> Roman
> >>
> >>
> >>> So, it looks good to me.
> >>> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> >>>
> >>> Regards,
> >>> Misono
> >>>
> >>>> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> >>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>> ---
> >>>>  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> >>>>  1 file changed, 16 insertions(+)
> >>>>
> >>>> diff --git a/tools/virtiofsd/fuse_virtio.c
> >>> b/tools/virtiofsd/fuse_virtio.c
> >>>> index 3b6d16a041..9e5537506c 100644
> >>>> --- a/tools/virtiofsd/fuse_virtio.c
> >>>> +++ b/tools/virtiofsd/fuse_virtio.c
> >>>> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
> >>>>  {
> >>>>      int ret;
> >>>>
> >>>> +    /*
> >>>> +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need
> >>> it. It's
> >>>> +     * an unprivileged system call but some Docker/Moby versions are
> >>> known to
> >>>> +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> >>>> +     *
> >>>> +     * Note that the program is single-threaded here so this syscall
> >>> has no
> >>>> +     * visible effect and is safe to make.
> >>>> +     */
> >>>> +    ret = unshare(CLONE_FS);
> >>>> +    if (ret == -1 && errno == EPERM) {
> >>>> +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If
> >>> "
> >>>> +                "running in a container please check that the container
> >>> "
> >>>> +                "runtime seccomp policy allows unshare.\n");
> >>>> +        return -1;
> >>>> +    }
> >>>> +
> >>>>      ret = fv_create_listen_socket(se);
> >>>>      if (ret < 0) {
> >>>>          return ret;
> >>>> --
> >>>> 2.26.2
> >>>
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28 13:12       ` Vivek Goyal
  2020-07-28 15:52         ` Daniel P. Berrangé
  2020-07-28 19:12         ` Daniel Walsh
@ 2020-07-29  7:59         ` Roman Mohr
  2020-07-29 14:40           ` Stefan Hajnoczi
  2 siblings, 1 reply; 23+ messages in thread
From: Roman Mohr @ 2020-07-29  7:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, misono.tomohiro, mpatel

[-- Attachment #1: Type: text/plain, Size: 5457 bytes --]

On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > misono.tomohiro@fujitsu.com> wrote:
> >
> > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print
> an
> > > error
> > > >
> > > > An assertion failure is raised during request processing if
> > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem
> can
> > > > be detected right away.
> > > >
> > > > Unfortunately Docker/Moby does not include unshare in the
> seccomp.json
> > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > > include unshare (e.g. podman is unaffected):
> > > >
> > >
> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > >
> > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if
> the
> > > > default seccomp.json is missing unshare.
> > >
> > > Hi, sorry for a bit late.
> > >
> > > unshare() was added to fix xattr problem:
> > >
> > >
> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > > In theory we don't need to call unshare if xattr is disabled, but it is
> > > hard to get to know
> > > if xattr is enabled or disabled in fv_queue_worker(), right?
> > >
> > >
> > In kubevirt we want to run virtiofsd in containers. We would already not
> > have xattr support for e.g. overlayfs in the VM after this patch series
> (an
> > acceptable con at least for us right now).
> > If we can get rid of the unshare (and potentially of needing root) that
> > would be great. We always assume that everything which we run in
> containers
> > should work for cri-o and docker.
>
> But cri-o and docker containers run as root, isn't it? (or atleast have
> the capability to run as root). Havind said that, it will be nice to be
> able
> to run virtiofsd without root.
>

Yes they can run as root. I can tell you what we plan to do with the
containerized virtiofsd: We run it as part of the user-owned pod (a set of
containers).
One of our main goals at the moment is to run VMs in a user-owned pod
without additional privileges.
So that in case the user (VM-creator/owner) enters the pod or something
breaks out of the VM they are just in the unprivileged container sandbox.
As part of that we try to get also rid of running containers in the
user-context with the root user.

One possible scenario which I could think of as being desirable from a
kubevirt perspective:
We would run the VM in one container and have an unprivileged
virtiofsd container in parallel.
This container already has its own mount namespace and it is not that
critical if something manages to enter this sandbox.

But we are not as far yet as getting completely rid of root right now in
kubevirt, so if as a temporary step it needs root, the current proposed
changes would still be very useful for us.

Best Regards,
Roman

There are few hurdles though.
>
> - For file creation, we switch uid/gid (seteuid/setegid) and that seems
>   to require root. If we were to run unpriviliged, probably all files
>   on host will have to be owned by unpriviliged user and guest visible
>   uid/gid will have to be stored in xattrs. I think virtfs supports
>   something similar.


> I am sure there are other restrictions but this probably is the biggest
> one to overcome.
>
>  >
> > "Just" pointing docker to a different seccomp.json file is something
> which
> > k8s users/admin in many cases can't do.
>
> Or may be issue is that standard seccomp.json does not allow unshare()
> and hence you are forced to use a non-standar seccomp.json.
>
> Vivek
>
> >
> > Best Regards,
> > Roman
> >
> >
> > > So, it looks good to me.
> > > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > >
> > > Regards,
> > > Misono
> > >
> > > >
> > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > ---
> > > >  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> > > >  1 file changed, 16 insertions(+)
> > > >
> > > > diff --git a/tools/virtiofsd/fuse_virtio.c
> > > b/tools/virtiofsd/fuse_virtio.c
> > > > index 3b6d16a041..9e5537506c 100644
> > > > --- a/tools/virtiofsd/fuse_virtio.c
> > > > +++ b/tools/virtiofsd/fuse_virtio.c
> > > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session
> *se)
> > > >  {
> > > >      int ret;
> > > >
> > > > +    /*
> > > > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will
> need
> > > it. It's
> > > > +     * an unprivileged system call but some Docker/Moby versions are
> > > known to
> > > > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > > > +     *
> > > > +     * Note that the program is single-threaded here so this syscall
> > > has no
> > > > +     * visible effect and is safe to make.
> > > > +     */
> > > > +    ret = unshare(CLONE_FS);
> > > > +    if (ret == -1 && errno == EPERM) {
> > > > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with
> EPERM. If
> > > "
> > > > +                "running in a container please check that the
> container
> > > "
> > > > +                "runtime seccomp policy allows unshare.\n");
> > > > +        return -1;
> > > > +    }
> > > > +
> > > >      ret = fv_create_listen_socket(se);
> > > >      if (ret < 0) {
> > > >          return ret;
> > > > --
> > > > 2.26.2
> > >
> > >
>
>

[-- Attachment #2: Type: text/html, Size: 7863 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28 19:15         ` Daniel Walsh
@ 2020-07-29 14:29           ` Stefan Hajnoczi
  0 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2020-07-29 14:29 UTC (permalink / raw)
  To: Daniel Walsh
  Cc: vromanso, mpatel, Dr. David Alan Gilbert, qemu-devel, virtio-fs,
	misono.tomohiro, Roman Mohr, vgoyal

[-- Attachment #1: Type: text/plain, Size: 921 bytes --]

On Tue, Jul 28, 2020 at 03:15:25PM -0400, Daniel Walsh wrote:
> On 7/28/20 11:32, Stefan Hajnoczi wrote:
> > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> >> misono.tomohiro@fujitsu.com> wrote:
> >>
> >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> >>> error
> >> "Just" pointing docker to a different seccomp.json file is something which
> >> k8s users/admin in many cases can't do.
> > There is a Moby PR to change the default seccomp.json file here but it's
> > unclear if it will be merged:
> > https://github.com/moby/moby/pull/41244
> >
> > Stefan
> 
> Why not try Podman?

Absolutely, Podman allows unshare(2) in its default seccomp policy so it
does not have this problem.

I think Roman's point was mainly about the upstream user experience
where Docker is common.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-29  7:59         ` Roman Mohr
@ 2020-07-29 14:40           ` Stefan Hajnoczi
  2020-07-30 22:21             ` Daniel Walsh
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2020-07-29 14:40 UTC (permalink / raw)
  To: Roman Mohr
  Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert,
	virtio-fs, misono.tomohiro, mpatel, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 2110 bytes --]

On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > > misono.tomohiro@fujitsu.com> wrote:
> > >
> > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print
> > an
> > > > error
> Yes they can run as root. I can tell you what we plan to do with the
> containerized virtiofsd: We run it as part of the user-owned pod (a set of
> containers).
> One of our main goals at the moment is to run VMs in a user-owned pod
> without additional privileges.
> So that in case the user (VM-creator/owner) enters the pod or something
> breaks out of the VM they are just in the unprivileged container sandbox.
> As part of that we try to get also rid of running containers in the
> user-context with the root user.
> 
> One possible scenario which I could think of as being desirable from a
> kubevirt perspective:
> We would run the VM in one container and have an unprivileged
> virtiofsd container in parallel.
> This container already has its own mount namespace and it is not that
> critical if something manages to enter this sandbox.
> 
> But we are not as far yet as getting completely rid of root right now in
> kubevirt, so if as a temporary step it needs root, the current proposed
> changes would still be very useful for us.

What is the issue with root in user namespaces?

I remember a few years ago it was seen as a major security issue but
don't remember if container runtimes were already using user namespaces
back then.

I guess the goal might be simply to minimize Linux capabilities as much
as possible?

virtiofsd could nominally run with an arbitrary uid/gid but it still
needs the Linux capabilities that allow it to change uid/gid and
override file system permission checks just like the root user. Not sure
if there is any advantage to running with uid 1000 when you still have
these Linux capabilities.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-29 14:40           ` Stefan Hajnoczi
@ 2020-07-30 22:21             ` Daniel Walsh
  2020-07-31  8:26               ` Stefan Hajnoczi
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Walsh @ 2020-07-30 22:21 UTC (permalink / raw)
  To: Stefan Hajnoczi, Roman Mohr
  Cc: vromanso, qemu-devel, Dr. David Alan Gilbert, virtio-fs,
	misono.tomohiro, mpatel, Vivek Goyal


[-- Attachment #1.1: Type: text/plain, Size: 2553 bytes --]

On 7/29/20 10:40, Stefan Hajnoczi wrote:
> On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
>> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>>
>>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
>>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
>>>> misono.tomohiro@fujitsu.com> wrote:
>>>>
>>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print
>>> an
>>>>> error
>> Yes they can run as root. I can tell you what we plan to do with the
>> containerized virtiofsd: We run it as part of the user-owned pod (a set of
>> containers).
>> One of our main goals at the moment is to run VMs in a user-owned pod
>> without additional privileges.
>> So that in case the user (VM-creator/owner) enters the pod or something
>> breaks out of the VM they are just in the unprivileged container sandbox.
>> As part of that we try to get also rid of running containers in the
>> user-context with the root user.
>>
>> One possible scenario which I could think of as being desirable from a
>> kubevirt perspective:
>> We would run the VM in one container and have an unprivileged
>> virtiofsd container in parallel.
>> This container already has its own mount namespace and it is not that
>> critical if something manages to enter this sandbox.
>>
>> But we are not as far yet as getting completely rid of root right now in
>> kubevirt, so if as a temporary step it needs root, the current proposed
>> changes would still be very useful for us.
> What is the issue with root in user namespaces?
>
> I remember a few years ago it was seen as a major security issue but
> don't remember if container runtimes were already using user namespaces
> back then.
>
> I guess the goal might be simply to minimize Linux capabilities as much
> as possible?
>
> virtiofsd could nominally run with an arbitrary uid/gid but it still
> needs the Linux capabilities that allow it to change uid/gid and
> override file system permission checks just like the root user. Not sure
> if there is any advantage to running with uid 1000 when you still have
> these Linux capabilities.
>
> Stefan

When you run in a user namespace, virtiofsd would only have
setuid/setgid over the range of UIDs mapped into the user namespace.  So
if UID=0 on the host is not mapped, then the container can not create
real UID=0 files on disk.

Similarly you can protect the user directories and any content by
running the containers in a really high UID Mapping.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-30 22:21             ` Daniel Walsh
@ 2020-07-31  8:26               ` Stefan Hajnoczi
  2020-07-31  8:39                 ` Roman Mohr
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2020-07-31  8:26 UTC (permalink / raw)
  To: Roman Mohr
  Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert,
	virtio-fs, misono.tomohiro, mpatel, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 2811 bytes --]

On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote:
> On 7/29/20 10:40, Stefan Hajnoczi wrote:
> > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
> >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >>
> >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> >>>> misono.tomohiro@fujitsu.com> wrote:
> >>>>
> >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print
> >>> an
> >>>>> error
> >> Yes they can run as root. I can tell you what we plan to do with the
> >> containerized virtiofsd: We run it as part of the user-owned pod (a set of
> >> containers).
> >> One of our main goals at the moment is to run VMs in a user-owned pod
> >> without additional privileges.
> >> So that in case the user (VM-creator/owner) enters the pod or something
> >> breaks out of the VM they are just in the unprivileged container sandbox.
> >> As part of that we try to get also rid of running containers in the
> >> user-context with the root user.
> >>
> >> One possible scenario which I could think of as being desirable from a
> >> kubevirt perspective:
> >> We would run the VM in one container and have an unprivileged
> >> virtiofsd container in parallel.
> >> This container already has its own mount namespace and it is not that
> >> critical if something manages to enter this sandbox.
> >>
> >> But we are not as far yet as getting completely rid of root right now in
> >> kubevirt, so if as a temporary step it needs root, the current proposed
> >> changes would still be very useful for us.
> > What is the issue with root in user namespaces?
> >
> > I remember a few years ago it was seen as a major security issue but
> > don't remember if container runtimes were already using user namespaces
> > back then.
> >
> > I guess the goal might be simply to minimize Linux capabilities as much
> > as possible?
> >
> > virtiofsd could nominally run with an arbitrary uid/gid but it still
> > needs the Linux capabilities that allow it to change uid/gid and
> > override file system permission checks just like the root user. Not sure
> > if there is any advantage to running with uid 1000 when you still have
> > these Linux capabilities.
> >
> > Stefan
> 
> When you run in a user namespace, virtiofsd would only have
> setuid/setgid over the range of UIDs mapped into the user namespace.  So
> if UID=0 on the host is not mapped, then the container can not create
> real UID=0 files on disk.
> 
> Similarly you can protect the user directories and any content by
> running the containers in a really high UID Mapping.

Roman, do user namespaces address your concerns about uid 0 in
containers?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-31  8:26               ` Stefan Hajnoczi
@ 2020-07-31  8:39                 ` Roman Mohr
  2020-07-31 14:11                   ` Stefan Hajnoczi
  0 siblings, 1 reply; 23+ messages in thread
From: Roman Mohr @ 2020-07-31  8:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert,
	virtio-fs, misono.tomohiro, mpatel, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 4089 bytes --]

On Fri, Jul 31, 2020 at 10:26 AM Stefan Hajnoczi <stefanha@redhat.com>
wrote:

> On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote:
> > On 7/29/20 10:40, Stefan Hajnoczi wrote:
> > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
> > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com>
> wrote:
> > >>
> > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > >>>> misono.tomohiro@fujitsu.com> wrote:
> > >>>>
> > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and
> print
> > >>> an
> > >>>>> error
> > >> Yes they can run as root. I can tell you what we plan to do with the
> > >> containerized virtiofsd: We run it as part of the user-owned pod (a
> set of
> > >> containers).
> > >> One of our main goals at the moment is to run VMs in a user-owned pod
> > >> without additional privileges.
> > >> So that in case the user (VM-creator/owner) enters the pod or
> something
> > >> breaks out of the VM they are just in the unprivileged container
> sandbox.
> > >> As part of that we try to get also rid of running containers in the
> > >> user-context with the root user.
> > >>
> > >> One possible scenario which I could think of as being desirable from a
> > >> kubevirt perspective:
> > >> We would run the VM in one container and have an unprivileged
> > >> virtiofsd container in parallel.
> > >> This container already has its own mount namespace and it is not that
> > >> critical if something manages to enter this sandbox.
> > >>
> > >> But we are not as far yet as getting completely rid of root right now
> in
> > >> kubevirt, so if as a temporary step it needs root, the current
> proposed
> > >> changes would still be very useful for us.
> > > What is the issue with root in user namespaces?
> > >
> > > I remember a few years ago it was seen as a major security issue but
> > > don't remember if container runtimes were already using user namespaces
> > > back then.
> > >
> > > I guess the goal might be simply to minimize Linux capabilities as much
> > > as possible?
> > >
> > > virtiofsd could nominally run with an arbitrary uid/gid but it still
> > > needs the Linux capabilities that allow it to change uid/gid and
> > > override file system permission checks just like the root user. Not
> sure
> > > if there is any advantage to running with uid 1000 when you still have
> > > these Linux capabilities.
> > >
> > > Stefan
> >
> > When you run in a user namespace, virtiofsd would only have
> > setuid/setgid over the range of UIDs mapped into the user namespace.  So
> > if UID=0 on the host is not mapped, then the container can not create
> > real UID=0 files on disk.
> >
> > Similarly you can protect the user directories and any content by
> > running the containers in a really high UID Mapping.
>
> Roman, do user namespaces address your concerns about uid 0 in
> containers?
>

They may eventually solve it. I would not let us hang up on this right now,
since as said at least in kubevirt we can't get rid right now of root
anyway.
Even if it is at some point in the future save and supported on
bleeding-edge managed k8s clusters to allow ordinary users to run with uid
0, from my perspective it is right now common to restrict namespaces with
PodSecurityPolicies or SecurityContexts to not allow running pods as root
for normal users.
It is also common that a significant part of the community users run docker
and/or run on managed k8s clusters where they can not influence if
user-namespaces are enabled, if they can run pods as root, if the runtime
points to a seccomp file they like or if the runtime they prefer is used.

But let me repeat again that we require root right now anyway and that we
don't run the pods right now with the user privileges (but we should and we
aim for that). Right now PSPs and SCCs restrict access to these pods by the
users.
So for our use case, at this exact moment root is acceptable, the unshare
call is a little bit more problematic.

Best Regards,
Roman



>
> Stefan
>

[-- Attachment #2: Type: text/html, Size: 5424 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-31  8:39                 ` Roman Mohr
@ 2020-07-31 14:11                   ` Stefan Hajnoczi
  0 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2020-07-31 14:11 UTC (permalink / raw)
  To: Roman Mohr
  Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert,
	virtio-fs, misono.tomohiro, mpatel, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 4428 bytes --]

On Fri, Jul 31, 2020 at 10:39:37AM +0200, Roman Mohr wrote:
> On Fri, Jul 31, 2020 at 10:26 AM Stefan Hajnoczi <stefanha@redhat.com>
> wrote:
> 
> > On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote:
> > > On 7/29/20 10:40, Stefan Hajnoczi wrote:
> > > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
> > > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com>
> > wrote:
> > > >>
> > > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > > >>>> misono.tomohiro@fujitsu.com> wrote:
> > > >>>>
> > > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and
> > print
> > > >>> an
> > > >>>>> error
> > > >> Yes they can run as root. I can tell you what we plan to do with the
> > > >> containerized virtiofsd: We run it as part of the user-owned pod (a
> > set of
> > > >> containers).
> > > >> One of our main goals at the moment is to run VMs in a user-owned pod
> > > >> without additional privileges.
> > > >> So that in case the user (VM-creator/owner) enters the pod or
> > something
> > > >> breaks out of the VM they are just in the unprivileged container
> > sandbox.
> > > >> As part of that we try to get also rid of running containers in the
> > > >> user-context with the root user.
> > > >>
> > > >> One possible scenario which I could think of as being desirable from a
> > > >> kubevirt perspective:
> > > >> We would run the VM in one container and have an unprivileged
> > > >> virtiofsd container in parallel.
> > > >> This container already has its own mount namespace and it is not that
> > > >> critical if something manages to enter this sandbox.
> > > >>
> > > >> But we are not as far yet as getting completely rid of root right now
> > in
> > > >> kubevirt, so if as a temporary step it needs root, the current
> > proposed
> > > >> changes would still be very useful for us.
> > > > What is the issue with root in user namespaces?
> > > >
> > > > I remember a few years ago it was seen as a major security issue but
> > > > don't remember if container runtimes were already using user namespaces
> > > > back then.
> > > >
> > > > I guess the goal might be simply to minimize Linux capabilities as much
> > > > as possible?
> > > >
> > > > virtiofsd could nominally run with an arbitrary uid/gid but it still
> > > > needs the Linux capabilities that allow it to change uid/gid and
> > > > override file system permission checks just like the root user. Not
> > sure
> > > > if there is any advantage to running with uid 1000 when you still have
> > > > these Linux capabilities.
> > > >
> > > > Stefan
> > >
> > > When you run in a user namespace, virtiofsd would only have
> > > setuid/setgid over the range of UIDs mapped into the user namespace.  So
> > > if UID=0 on the host is not mapped, then the container can not create
> > > real UID=0 files on disk.
> > >
> > > Similarly you can protect the user directories and any content by
> > > running the containers in a really high UID Mapping.
> >
> > Roman, do user namespaces address your concerns about uid 0 in
> > containers?
> >
> 
> They may eventually solve it. I would not let us hang up on this right now,
> since as said at least in kubevirt we can't get rid right now of root
> anyway.
> Even if it is at some point in the future save and supported on
> bleeding-edge managed k8s clusters to allow ordinary users to run with uid
> 0, from my perspective it is right now common to restrict namespaces with
> PodSecurityPolicies or SecurityContexts to not allow running pods as root
> for normal users.
> It is also common that a significant part of the community users run docker
> and/or run on managed k8s clusters where they can not influence if
> user-namespaces are enabled, if they can run pods as root, if the runtime
> points to a seccomp file they like or if the runtime they prefer is used.
> 
> But let me repeat again that we require root right now anyway and that we
> don't run the pods right now with the user privileges (but we should and we
> aim for that). Right now PSPs and SCCs restrict access to these pods by the
> users.
> So for our use case, at this exact moment root is acceptable, the unshare
> call is a little bit more problematic.

Okay, thanks for explaining.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
  2020-07-28  1:05   ` misono.tomohiro
  2020-07-28 10:00     ` Roman Mohr
@ 2020-08-07 15:29     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 23+ messages in thread
From: Dr. David Alan Gilbert @ 2020-08-07 15:29 UTC (permalink / raw)
  To: misono.tomohiro
  Cc: vromanso, Daniel Walsh, qemu-devel, rmohr, virtio-fs,
	'Stefan Hajnoczi',
	mpatel, vgoyal

* misono.tomohiro@fujitsu.com (misono.tomohiro@fujitsu.com) wrote:
> > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
> > 
> > An assertion failure is raised during request processing if
> > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > be detected right away.
> > 
> > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > include unshare (e.g. podman is unaffected):
> > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > 
> > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > default seccomp.json is missing unshare.
> 
> Hi, sorry for a bit late.
> 
> unshare() was added to fix xattr problem: 
>   https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> In theory we don't need to call unshare if xattr is disabled, but it is hard to get to know
> if xattr is enabled or disabled in fv_queue_worker(), right?
> 
> So, it looks good to me.
> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>

OK, I think it might also be OK to just fail the xattr operation on a
non-file/directory in this case.

Dave

> Regards,
> Misono
> 
> > 
> > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > index 3b6d16a041..9e5537506c 100644
> > --- a/tools/virtiofsd/fuse_virtio.c
> > +++ b/tools/virtiofsd/fuse_virtio.c
> > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
> >  {
> >      int ret;
> > 
> > +    /*
> > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> > +     * an unprivileged system call but some Docker/Moby versions are known to
> > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > +     *
> > +     * Note that the program is single-threaded here so this syscall has no
> > +     * visible effect and is safe to make.
> > +     */
> > +    ret = unshare(CLONE_FS);
> > +    if (ret == -1 && errno == EPERM) {
> > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> > +                "running in a container please check that the container "
> > +                "runtime seccomp policy allows unshare.\n");
> > +        return -1;
> > +    }
> > +
> >      ret = fv_create_listen_socket(se);
> >      if (ret < 0) {
> >          return ret;
> > --
> > 2.26.2
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option
  2020-07-27 19:02 ` [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option Stefan Hajnoczi
@ 2020-08-07 15:36   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 23+ messages in thread
From: Dr. David Alan Gilbert @ 2020-08-07 15:36 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: vromanso, Daniel Walsh, qemu-devel, rmohr, virtio-fs, mpatel, vgoyal

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd cannot run in a container because CAP_SYS_ADMIN is required to
> create namespaces.
> 
> Introduce a weaker sandbox mode that is sufficient in container
> environments because the container runtime already sets up namespaces.
> Use chroot to restrict path traversal to the shared directory.
> 
> virtiofsd loses the following:
> 
> 1. Mount namespace. The process chroots to the shared directory but
>    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
>    syscalls.
> 
> 2. Pid namespace. This should be fine because virtiofsd is the only
>    process running in the container.
> 
> 3. Network namespace. This should be fine because seccomp already
>    rejects the connect(2) syscall, but an additional layer of security
>    is lost. Container runtime-specific network security policies can be
>    used drop network traffic (except for the vhost-user UNIX domain
>    socket).
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Is there anyone with a bit more FS semantics expertise who could check
this;  I already surprised myself a few times reading about chroot
escapes, so I'd appreciate a 2nd pair of eyes.

Dave

> ---
>  tools/virtiofsd/helper.c         |  8 +++++
>  tools/virtiofsd/passthrough_ll.c | 57 ++++++++++++++++++++++++++++++--
>  docs/tools/virtiofsd.rst         | 32 ++++++++++++++----
>  3 files changed, 88 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 3105b6c23a..91dcb23664 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -168,6 +168,14 @@ void fuse_cmdline_help(void)
>             "                               enable/disable readirplus\n"
>             "                               default: readdirplus except with "
>             "cache=none\n"
> +           "    -o sandbox=namespace|chroot\n"
> +           "                               sandboxing mode:\n"
> +           "                               - namespace: mount, pid, and net\n"
> +           "                                 namespaces with pivot_root(2)\n"
> +           "                                 into shared directory\n"
> +           "                               - chroot: chroot(2) into shared\n"
> +           "                                 directory (use in containers)\n"
> +           "                               default: namespace\n"
>             "    -o timeout=<number>        I/O timeout (seconds)\n"
>             "                               default: depends on cache= option.\n"
>             "    -o writeback|no_writeback  enable/disable writeback cache\n"
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 50a164a599..a7894c3e7c 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -137,8 +137,14 @@ enum {
>      CACHE_ALWAYS,
>  };
>  
> +enum {
> +    SANDBOX_NAMESPACE,
> +    SANDBOX_CHROOT,
> +};
> +
>  struct lo_data {
>      pthread_mutex_t mutex;
> +    int sandbox;
>      int debug;
>      int writeback;
>      int flock;
> @@ -162,6 +168,12 @@ struct lo_data {
>  };
>  
>  static const struct fuse_opt lo_opts[] = {
> +    { "sandbox=namespace",
> +      offsetof(struct lo_data, sandbox),
> +      SANDBOX_NAMESPACE },
> +    { "sandbox=chroot",
> +      offsetof(struct lo_data, sandbox),
> +      SANDBOX_CHROOT },
>      { "writeback", offsetof(struct lo_data, writeback), 1 },
>      { "no_writeback", offsetof(struct lo_data, writeback), 0 },
>      { "source=%s", offsetof(struct lo_data, source), 0 },
> @@ -2665,6 +2677,41 @@ static void setup_capabilities(char *modcaps_in)
>      pthread_mutex_unlock(&cap.mutex);
>  }
>  
> +/*
> + * Use chroot as a weaker sandbox for environments where the process is
> + * launched without CAP_SYS_ADMIN.
> + */
> +static void setup_chroot(struct lo_data *lo)
> +{
> +    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +    if (lo->proc_self_fd == -1) {
> +        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
> +        exit(1);
> +    }
> +
> +    /*
> +     * Make the shared directory the file system root so that FUSE_OPEN
> +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> +     *
> +     * The chroot(2) syscall is later disabled by seccomp and the
> +     * CAP_SYS_CHROOT capability is dropped so that tampering with the chroot
> +     * is not possible.
> +     *
> +     * However, it's still possible to escape the chroot via lo->proc_self_fd
> +     * but that requires first gaining control of the process.
> +     */
> +    if (chroot(lo->source) != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> +        exit(1);
> +    }
> +
> +    /* Move into the chroot */
> +    if (chdir("/") != 0) {
> +        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
> +        exit(1);
> +    }
> +}
> +
>  /*
>   * Lock down this process to prevent access to other processes or files outside
>   * source directory.  This reduces the impact of arbitrary code execution bugs.
> @@ -2672,8 +2719,13 @@ static void setup_capabilities(char *modcaps_in)
>  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>                            bool enable_syslog)
>  {
> -    setup_namespaces(lo, se);
> -    setup_mounts(lo->source);
> +    if (lo->sandbox == SANDBOX_NAMESPACE) {
> +        setup_namespaces(lo, se);
> +        setup_mounts(lo->source);
> +    } else {
> +        setup_chroot(lo);
> +    }
> +
>      setup_seccomp(enable_syslog);
>      setup_capabilities(g_strdup(lo->modcaps));
>  }
> @@ -2820,6 +2872,7 @@ int main(int argc, char *argv[])
>      struct fuse_session *se;
>      struct fuse_cmdline_opts opts;
>      struct lo_data lo = {
> +        .sandbox = SANDBOX_NAMESPACE,
>          .debug = 0,
>          .writeback = 0,
>          .posix_lock = 1,
> diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
> index 824e713491..40629f95ae 100644
> --- a/docs/tools/virtiofsd.rst
> +++ b/docs/tools/virtiofsd.rst
> @@ -17,13 +17,24 @@ This program is designed to work with QEMU's ``--device vhost-user-fs-pci``
>  but should work with any virtual machine monitor (VMM) that supports
>  vhost-user.  See the Examples section below.
>  
> -This program must be run as the root user.  Upon startup the program will
> -switch into a new file system namespace with the shared directory tree as its
> -root.  This prevents "file system escapes" due to symlinks and other file
> -system objects that might lead to files outside the shared directory.  The
> -program also sandboxes itself using seccomp(2) to prevent ptrace(2) and other
> -vectors that could allow an attacker to compromise the system after gaining
> -control of the virtiofsd process.
> +This program must be run as the root user.  The program drops privileges where
> +possible during startup although it must be able to create and access files
> +with any uid/gid:
> +
> +* The ability to invoke syscalls is limited using seccomp(2).
> +* Linux capabilities(7) are dropped.
> +
> +In "namespace" sandbox mode the program switches into a new file system
> +namespace and invokes pivot_root(2) to make the shared directory tree its root.
> +A new pid and net namespace is also created to isolate the process.
> +
> +In "chroot" sandbox mode the program invokes chroot(2) to make the shared
> +directory tree its root. This mode is intended for container environments where
> +the container runtime has already set up the namespaces and the program does
> +not have permission to create namespaces itself.
> +
> +Both sandbox modes prevent "file system escapes" due to symlinks and other file
> +system objects that might lead to files outside the shared directory.
>  
>  Options
>  -------
> @@ -72,6 +83,13 @@ Options
>    * readdirplus|no_readdirplus -
>      Enable/disable readdirplus.  The default is ``readdirplus``.
>  
> +  * sandbox=namespace|chroot -
> +    Sandbox mode:
> +    - namespace: Create mount, pid, and net namespaces and pivot_root(2) into
> +    the shared directory.
> +    - chroot: chroot(2) into shared directory (use in containers).
> +    The default is "namespace".
> +
>    * source=PATH -
>      Share host directory tree located at PATH.  This option is required.
>  
> -- 
> 2.26.2
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container
  2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2020-07-27 19:02 ` [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error Stefan Hajnoczi
@ 2020-08-27 18:40 ` Dr. David Alan Gilbert
  3 siblings, 0 replies; 23+ messages in thread
From: Dr. David Alan Gilbert @ 2020-08-27 18:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: vromanso, Daniel Walsh, qemu-devel, rmohr, virtio-fs, mpatel, vgoyal

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> v2:
>  * Update virtiofsd.rst documentation on sandboxing modes
>  * Change syntax to -o sandbox=namespace|chroot
>  * Add comment explaining that unshare(CLONE_FS) has no visible side-effect
>    while single-threaded
>  * xfstests and pjdfstest pass. Did not run tests on overlayfs because required
>    xattrs do not work without CAP_SYS_ADMIN.
> 
> Mrunal and Dan: This patch series adds a sandboxing mode where virtiofsd relies
> on the container runtime for isolation. It only does
> chroot("path/to/shared-dir"), seccomp, and drops Linux capabilities. Previously
> it created a new mount, pid, and net namespace but cannot do this without
> CAP_SYS_ADMIN when run inside a container. pivot_root("path/to/shared-dir") has
> been replaced with chroot("path/to/shared-dir"), again because CAP_SYS_ADMIN is
> unavailable. The point of the chroot() is to prevent escapes from the shared
> directory during path traversal. Does this ring any alarm bells or does it
> sound sane?
> 
> Container runtimes handle namespace setup and remove privileges needed by
> virtiofsd to perform sandboxing. Luckily the container environment already
> provides most of the sandbox that virtiofsd needs for security.
> 
> Introduce a new "virtiofsd -o sandbox=chroot" option that uses chroot(2)
> instead of namespaces. This option allows virtiofsd to work inside a container.
> 
> Please see the individual patches for details on the changes and security
> implications.
> 
> Given that people are starting to attempt running virtiofsd in containers I
> think this should go into QEMU 5.1.

I've queued 1 and 3; waiting for someone with better knowledge of chroot
to review 2.

Dave

> Stefan Hajnoczi (3):
>   virtiofsd: drop CAP_DAC_READ_SEARCH
>   virtiofsd: add container-friendly -o sandbox=chroot option
>   virtiofsd: probe unshare(CLONE_FS) and print an error
> 
>  tools/virtiofsd/fuse_virtio.c    | 16 +++++++++
>  tools/virtiofsd/helper.c         |  8 +++++
>  tools/virtiofsd/passthrough_ll.c | 58 ++++++++++++++++++++++++++++++--
>  docs/tools/virtiofsd.rst         | 32 ++++++++++++++----
>  4 files changed, 104 insertions(+), 10 deletions(-)
> 
> -- 
> 2.26.2
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-08-27 18:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi
2020-07-27 19:02 ` [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH Stefan Hajnoczi
2020-07-27 19:02 ` [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option Stefan Hajnoczi
2020-08-07 15:36   ` Dr. David Alan Gilbert
2020-07-27 19:02 ` [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error Stefan Hajnoczi
2020-07-28  1:05   ` misono.tomohiro
2020-07-28 10:00     ` Roman Mohr
2020-07-28 13:12       ` Vivek Goyal
2020-07-28 15:52         ` Daniel P. Berrangé
2020-07-28 20:54           ` Vivek Goyal
2020-07-28 19:12         ` Daniel Walsh
2020-07-28 21:01           ` Vivek Goyal
2020-07-29  7:59         ` Roman Mohr
2020-07-29 14:40           ` Stefan Hajnoczi
2020-07-30 22:21             ` Daniel Walsh
2020-07-31  8:26               ` Stefan Hajnoczi
2020-07-31  8:39                 ` Roman Mohr
2020-07-31 14:11                   ` Stefan Hajnoczi
2020-07-28 15:32       ` Stefan Hajnoczi
2020-07-28 19:15         ` Daniel Walsh
2020-07-29 14:29           ` Stefan Hajnoczi
2020-08-07 15:29     ` Dr. David Alan Gilbert
2020-08-27 18:40 ` [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Dr. David Alan Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).