* [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container @ 2020-07-27 19:02 Stefan Hajnoczi 2020-07-27 19:02 ` [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH Stefan Hajnoczi ` (3 more replies) 0 siblings, 4 replies; 23+ messages in thread From: Stefan Hajnoczi @ 2020-07-27 19:02 UTC (permalink / raw) To: qemu-devel Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs, Stefan Hajnoczi, mpatel, vgoyal v2: * Update virtiofsd.rst documentation on sandboxing modes * Change syntax to -o sandbox=namespace|chroot * Add comment explaining that unshare(CLONE_FS) has no visible side-effect while single-threaded * xfstests and pjdfstest pass. Did not run tests on overlayfs because required xattrs do not work without CAP_SYS_ADMIN. Mrunal and Dan: This patch series adds a sandboxing mode where virtiofsd relies on the container runtime for isolation. It only does chroot("path/to/shared-dir"), seccomp, and drops Linux capabilities. Previously it created a new mount, pid, and net namespace but cannot do this without CAP_SYS_ADMIN when run inside a container. pivot_root("path/to/shared-dir") has been replaced with chroot("path/to/shared-dir"), again because CAP_SYS_ADMIN is unavailable. The point of the chroot() is to prevent escapes from the shared directory during path traversal. Does this ring any alarm bells or does it sound sane? Container runtimes handle namespace setup and remove privileges needed by virtiofsd to perform sandboxing. Luckily the container environment already provides most of the sandbox that virtiofsd needs for security. Introduce a new "virtiofsd -o sandbox=chroot" option that uses chroot(2) instead of namespaces. This option allows virtiofsd to work inside a container. Please see the individual patches for details on the changes and security implications. Given that people are starting to attempt running virtiofsd in containers I think this should go into QEMU 5.1. Stefan Hajnoczi (3): virtiofsd: drop CAP_DAC_READ_SEARCH virtiofsd: add container-friendly -o sandbox=chroot option virtiofsd: probe unshare(CLONE_FS) and print an error tools/virtiofsd/fuse_virtio.c | 16 +++++++++ tools/virtiofsd/helper.c | 8 +++++ tools/virtiofsd/passthrough_ll.c | 58 ++++++++++++++++++++++++++++++-- docs/tools/virtiofsd.rst | 32 ++++++++++++++---- 4 files changed, 104 insertions(+), 10 deletions(-) -- 2.26.2 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH 2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi @ 2020-07-27 19:02 ` Stefan Hajnoczi 2020-07-27 19:02 ` [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option Stefan Hajnoczi ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Stefan Hajnoczi @ 2020-07-27 19:02 UTC (permalink / raw) To: qemu-devel Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs, Stefan Hajnoczi, mpatel, vgoyal virtiofsd does not need CAP_DAC_READ_SEARCH because it already has the more powerful CAP_DAC_OVERRIDE. Drop it from the list of capabilities. This is important because container runtimes may not include CAP_DAC_READ_SEARCH by default. This patch allows virtiofsd to reduce its capabilities when running inside a Docker container. Note that CAP_DAC_READ_SEARCH may be necessary again in the future if virtiofsd starts using open_by_handle_at(2). Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> --- tools/virtiofsd/passthrough_ll.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 94e0de2d2b..50a164a599 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -2596,7 +2596,6 @@ static void setup_capabilities(char *modcaps_in) if (capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE, CAP_CHOWN, CAP_DAC_OVERRIDE, - CAP_DAC_READ_SEARCH, CAP_FOWNER, CAP_FSETID, CAP_SETGID, -- 2.26.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option 2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi 2020-07-27 19:02 ` [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH Stefan Hajnoczi @ 2020-07-27 19:02 ` Stefan Hajnoczi 2020-08-07 15:36 ` Dr. David Alan Gilbert 2020-07-27 19:02 ` [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error Stefan Hajnoczi 2020-08-27 18:40 ` [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Dr. David Alan Gilbert 3 siblings, 1 reply; 23+ messages in thread From: Stefan Hajnoczi @ 2020-07-27 19:02 UTC (permalink / raw) To: qemu-devel Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs, Stefan Hajnoczi, mpatel, vgoyal virtiofsd cannot run in a container because CAP_SYS_ADMIN is required to create namespaces. Introduce a weaker sandbox mode that is sufficient in container environments because the container runtime already sets up namespaces. Use chroot to restrict path traversal to the shared directory. virtiofsd loses the following: 1. Mount namespace. The process chroots to the shared directory but leaves the mounts in place. Seccomp rejects mount(2)/umount(2) syscalls. 2. Pid namespace. This should be fine because virtiofsd is the only process running in the container. 3. Network namespace. This should be fine because seccomp already rejects the connect(2) syscall, but an additional layer of security is lost. Container runtime-specific network security policies can be used drop network traffic (except for the vhost-user UNIX domain socket). Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- tools/virtiofsd/helper.c | 8 +++++ tools/virtiofsd/passthrough_ll.c | 57 ++++++++++++++++++++++++++++++-- docs/tools/virtiofsd.rst | 32 ++++++++++++++---- 3 files changed, 88 insertions(+), 9 deletions(-) diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c index 3105b6c23a..91dcb23664 100644 --- a/tools/virtiofsd/helper.c +++ b/tools/virtiofsd/helper.c @@ -168,6 +168,14 @@ void fuse_cmdline_help(void) " enable/disable readirplus\n" " default: readdirplus except with " "cache=none\n" + " -o sandbox=namespace|chroot\n" + " sandboxing mode:\n" + " - namespace: mount, pid, and net\n" + " namespaces with pivot_root(2)\n" + " into shared directory\n" + " - chroot: chroot(2) into shared\n" + " directory (use in containers)\n" + " default: namespace\n" " -o timeout=<number> I/O timeout (seconds)\n" " default: depends on cache= option.\n" " -o writeback|no_writeback enable/disable writeback cache\n" diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 50a164a599..a7894c3e7c 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -137,8 +137,14 @@ enum { CACHE_ALWAYS, }; +enum { + SANDBOX_NAMESPACE, + SANDBOX_CHROOT, +}; + struct lo_data { pthread_mutex_t mutex; + int sandbox; int debug; int writeback; int flock; @@ -162,6 +168,12 @@ struct lo_data { }; static const struct fuse_opt lo_opts[] = { + { "sandbox=namespace", + offsetof(struct lo_data, sandbox), + SANDBOX_NAMESPACE }, + { "sandbox=chroot", + offsetof(struct lo_data, sandbox), + SANDBOX_CHROOT }, { "writeback", offsetof(struct lo_data, writeback), 1 }, { "no_writeback", offsetof(struct lo_data, writeback), 0 }, { "source=%s", offsetof(struct lo_data, source), 0 }, @@ -2665,6 +2677,41 @@ static void setup_capabilities(char *modcaps_in) pthread_mutex_unlock(&cap.mutex); } +/* + * Use chroot as a weaker sandbox for environments where the process is + * launched without CAP_SYS_ADMIN. + */ +static void setup_chroot(struct lo_data *lo) +{ + lo->proc_self_fd = open("/proc/self/fd", O_PATH); + if (lo->proc_self_fd == -1) { + fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n"); + exit(1); + } + + /* + * Make the shared directory the file system root so that FUSE_OPEN + * (lo_open()) cannot escape the shared directory by opening a symlink. + * + * The chroot(2) syscall is later disabled by seccomp and the + * CAP_SYS_CHROOT capability is dropped so that tampering with the chroot + * is not possible. + * + * However, it's still possible to escape the chroot via lo->proc_self_fd + * but that requires first gaining control of the process. + */ + if (chroot(lo->source) != 0) { + fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source); + exit(1); + } + + /* Move into the chroot */ + if (chdir("/") != 0) { + fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n"); + exit(1); + } +} + /* * Lock down this process to prevent access to other processes or files outside * source directory. This reduces the impact of arbitrary code execution bugs. @@ -2672,8 +2719,13 @@ static void setup_capabilities(char *modcaps_in) static void setup_sandbox(struct lo_data *lo, struct fuse_session *se, bool enable_syslog) { - setup_namespaces(lo, se); - setup_mounts(lo->source); + if (lo->sandbox == SANDBOX_NAMESPACE) { + setup_namespaces(lo, se); + setup_mounts(lo->source); + } else { + setup_chroot(lo); + } + setup_seccomp(enable_syslog); setup_capabilities(g_strdup(lo->modcaps)); } @@ -2820,6 +2872,7 @@ int main(int argc, char *argv[]) struct fuse_session *se; struct fuse_cmdline_opts opts; struct lo_data lo = { + .sandbox = SANDBOX_NAMESPACE, .debug = 0, .writeback = 0, .posix_lock = 1, diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst index 824e713491..40629f95ae 100644 --- a/docs/tools/virtiofsd.rst +++ b/docs/tools/virtiofsd.rst @@ -17,13 +17,24 @@ This program is designed to work with QEMU's ``--device vhost-user-fs-pci`` but should work with any virtual machine monitor (VMM) that supports vhost-user. See the Examples section below. -This program must be run as the root user. Upon startup the program will -switch into a new file system namespace with the shared directory tree as its -root. This prevents "file system escapes" due to symlinks and other file -system objects that might lead to files outside the shared directory. The -program also sandboxes itself using seccomp(2) to prevent ptrace(2) and other -vectors that could allow an attacker to compromise the system after gaining -control of the virtiofsd process. +This program must be run as the root user. The program drops privileges where +possible during startup although it must be able to create and access files +with any uid/gid: + +* The ability to invoke syscalls is limited using seccomp(2). +* Linux capabilities(7) are dropped. + +In "namespace" sandbox mode the program switches into a new file system +namespace and invokes pivot_root(2) to make the shared directory tree its root. +A new pid and net namespace is also created to isolate the process. + +In "chroot" sandbox mode the program invokes chroot(2) to make the shared +directory tree its root. This mode is intended for container environments where +the container runtime has already set up the namespaces and the program does +not have permission to create namespaces itself. + +Both sandbox modes prevent "file system escapes" due to symlinks and other file +system objects that might lead to files outside the shared directory. Options ------- @@ -72,6 +83,13 @@ Options * readdirplus|no_readdirplus - Enable/disable readdirplus. The default is ``readdirplus``. + * sandbox=namespace|chroot - + Sandbox mode: + - namespace: Create mount, pid, and net namespaces and pivot_root(2) into + the shared directory. + - chroot: chroot(2) into shared directory (use in containers). + The default is "namespace". + * source=PATH - Share host directory tree located at PATH. This option is required. -- 2.26.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option 2020-07-27 19:02 ` [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option Stefan Hajnoczi @ 2020-08-07 15:36 ` Dr. David Alan Gilbert 0 siblings, 0 replies; 23+ messages in thread From: Dr. David Alan Gilbert @ 2020-08-07 15:36 UTC (permalink / raw) To: Stefan Hajnoczi Cc: vromanso, Daniel Walsh, qemu-devel, rmohr, virtio-fs, mpatel, vgoyal * Stefan Hajnoczi (stefanha@redhat.com) wrote: > virtiofsd cannot run in a container because CAP_SYS_ADMIN is required to > create namespaces. > > Introduce a weaker sandbox mode that is sufficient in container > environments because the container runtime already sets up namespaces. > Use chroot to restrict path traversal to the shared directory. > > virtiofsd loses the following: > > 1. Mount namespace. The process chroots to the shared directory but > leaves the mounts in place. Seccomp rejects mount(2)/umount(2) > syscalls. > > 2. Pid namespace. This should be fine because virtiofsd is the only > process running in the container. > > 3. Network namespace. This should be fine because seccomp already > rejects the connect(2) syscall, but an additional layer of security > is lost. Container runtime-specific network security policies can be > used drop network traffic (except for the vhost-user UNIX domain > socket). > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Is there anyone with a bit more FS semantics expertise who could check this; I already surprised myself a few times reading about chroot escapes, so I'd appreciate a 2nd pair of eyes. Dave > --- > tools/virtiofsd/helper.c | 8 +++++ > tools/virtiofsd/passthrough_ll.c | 57 ++++++++++++++++++++++++++++++-- > docs/tools/virtiofsd.rst | 32 ++++++++++++++---- > 3 files changed, 88 insertions(+), 9 deletions(-) > > diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c > index 3105b6c23a..91dcb23664 100644 > --- a/tools/virtiofsd/helper.c > +++ b/tools/virtiofsd/helper.c > @@ -168,6 +168,14 @@ void fuse_cmdline_help(void) > " enable/disable readirplus\n" > " default: readdirplus except with " > "cache=none\n" > + " -o sandbox=namespace|chroot\n" > + " sandboxing mode:\n" > + " - namespace: mount, pid, and net\n" > + " namespaces with pivot_root(2)\n" > + " into shared directory\n" > + " - chroot: chroot(2) into shared\n" > + " directory (use in containers)\n" > + " default: namespace\n" > " -o timeout=<number> I/O timeout (seconds)\n" > " default: depends on cache= option.\n" > " -o writeback|no_writeback enable/disable writeback cache\n" > diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c > index 50a164a599..a7894c3e7c 100644 > --- a/tools/virtiofsd/passthrough_ll.c > +++ b/tools/virtiofsd/passthrough_ll.c > @@ -137,8 +137,14 @@ enum { > CACHE_ALWAYS, > }; > > +enum { > + SANDBOX_NAMESPACE, > + SANDBOX_CHROOT, > +}; > + > struct lo_data { > pthread_mutex_t mutex; > + int sandbox; > int debug; > int writeback; > int flock; > @@ -162,6 +168,12 @@ struct lo_data { > }; > > static const struct fuse_opt lo_opts[] = { > + { "sandbox=namespace", > + offsetof(struct lo_data, sandbox), > + SANDBOX_NAMESPACE }, > + { "sandbox=chroot", > + offsetof(struct lo_data, sandbox), > + SANDBOX_CHROOT }, > { "writeback", offsetof(struct lo_data, writeback), 1 }, > { "no_writeback", offsetof(struct lo_data, writeback), 0 }, > { "source=%s", offsetof(struct lo_data, source), 0 }, > @@ -2665,6 +2677,41 @@ static void setup_capabilities(char *modcaps_in) > pthread_mutex_unlock(&cap.mutex); > } > > +/* > + * Use chroot as a weaker sandbox for environments where the process is > + * launched without CAP_SYS_ADMIN. > + */ > +static void setup_chroot(struct lo_data *lo) > +{ > + lo->proc_self_fd = open("/proc/self/fd", O_PATH); > + if (lo->proc_self_fd == -1) { > + fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n"); > + exit(1); > + } > + > + /* > + * Make the shared directory the file system root so that FUSE_OPEN > + * (lo_open()) cannot escape the shared directory by opening a symlink. > + * > + * The chroot(2) syscall is later disabled by seccomp and the > + * CAP_SYS_CHROOT capability is dropped so that tampering with the chroot > + * is not possible. > + * > + * However, it's still possible to escape the chroot via lo->proc_self_fd > + * but that requires first gaining control of the process. > + */ > + if (chroot(lo->source) != 0) { > + fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source); > + exit(1); > + } > + > + /* Move into the chroot */ > + if (chdir("/") != 0) { > + fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n"); > + exit(1); > + } > +} > + > /* > * Lock down this process to prevent access to other processes or files outside > * source directory. This reduces the impact of arbitrary code execution bugs. > @@ -2672,8 +2719,13 @@ static void setup_capabilities(char *modcaps_in) > static void setup_sandbox(struct lo_data *lo, struct fuse_session *se, > bool enable_syslog) > { > - setup_namespaces(lo, se); > - setup_mounts(lo->source); > + if (lo->sandbox == SANDBOX_NAMESPACE) { > + setup_namespaces(lo, se); > + setup_mounts(lo->source); > + } else { > + setup_chroot(lo); > + } > + > setup_seccomp(enable_syslog); > setup_capabilities(g_strdup(lo->modcaps)); > } > @@ -2820,6 +2872,7 @@ int main(int argc, char *argv[]) > struct fuse_session *se; > struct fuse_cmdline_opts opts; > struct lo_data lo = { > + .sandbox = SANDBOX_NAMESPACE, > .debug = 0, > .writeback = 0, > .posix_lock = 1, > diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst > index 824e713491..40629f95ae 100644 > --- a/docs/tools/virtiofsd.rst > +++ b/docs/tools/virtiofsd.rst > @@ -17,13 +17,24 @@ This program is designed to work with QEMU's ``--device vhost-user-fs-pci`` > but should work with any virtual machine monitor (VMM) that supports > vhost-user. See the Examples section below. > > -This program must be run as the root user. Upon startup the program will > -switch into a new file system namespace with the shared directory tree as its > -root. This prevents "file system escapes" due to symlinks and other file > -system objects that might lead to files outside the shared directory. The > -program also sandboxes itself using seccomp(2) to prevent ptrace(2) and other > -vectors that could allow an attacker to compromise the system after gaining > -control of the virtiofsd process. > +This program must be run as the root user. The program drops privileges where > +possible during startup although it must be able to create and access files > +with any uid/gid: > + > +* The ability to invoke syscalls is limited using seccomp(2). > +* Linux capabilities(7) are dropped. > + > +In "namespace" sandbox mode the program switches into a new file system > +namespace and invokes pivot_root(2) to make the shared directory tree its root. > +A new pid and net namespace is also created to isolate the process. > + > +In "chroot" sandbox mode the program invokes chroot(2) to make the shared > +directory tree its root. This mode is intended for container environments where > +the container runtime has already set up the namespaces and the program does > +not have permission to create namespaces itself. > + > +Both sandbox modes prevent "file system escapes" due to symlinks and other file > +system objects that might lead to files outside the shared directory. > > Options > ------- > @@ -72,6 +83,13 @@ Options > * readdirplus|no_readdirplus - > Enable/disable readdirplus. The default is ``readdirplus``. > > + * sandbox=namespace|chroot - > + Sandbox mode: > + - namespace: Create mount, pid, and net namespaces and pivot_root(2) into > + the shared directory. > + - chroot: chroot(2) into shared directory (use in containers). > + The default is "namespace". > + > * source=PATH - > Share host directory tree located at PATH. This option is required. > > -- > 2.26.2 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi 2020-07-27 19:02 ` [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH Stefan Hajnoczi 2020-07-27 19:02 ` [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option Stefan Hajnoczi @ 2020-07-27 19:02 ` Stefan Hajnoczi 2020-07-28 1:05 ` misono.tomohiro 2020-08-27 18:40 ` [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Dr. David Alan Gilbert 3 siblings, 1 reply; 23+ messages in thread From: Stefan Hajnoczi @ 2020-07-27 19:02 UTC (permalink / raw) To: qemu-devel Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs, Misono Tomohiro, Stefan Hajnoczi, mpatel, vgoyal An assertion failure is raised during request processing if unshare(CLONE_FS) fails. Implement a probe at startup so the problem can be detected right away. Unfortunately Docker/Moby does not include unshare in the seccomp.json list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always include unshare (e.g. podman is unaffected): https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the default seccomp.json is missing unshare. Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c index 3b6d16a041..9e5537506c 100644 --- a/tools/virtiofsd/fuse_virtio.c +++ b/tools/virtiofsd/fuse_virtio.c @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) { int ret; + /* + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's + * an unprivileged system call but some Docker/Moby versions are known to + * reject it via seccomp when CAP_SYS_ADMIN is not given. + * + * Note that the program is single-threaded here so this syscall has no + * visible effect and is safe to make. + */ + ret = unshare(CLONE_FS); + if (ret == -1 && errno == EPERM) { + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If " + "running in a container please check that the container " + "runtime seccomp policy allows unshare.\n"); + return -1; + } + ret = fv_create_listen_socket(se); if (ret < 0) { return ret; -- 2.26.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* RE: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-27 19:02 ` [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error Stefan Hajnoczi @ 2020-07-28 1:05 ` misono.tomohiro 2020-07-28 10:00 ` Roman Mohr 2020-08-07 15:29 ` Dr. David Alan Gilbert 0 siblings, 2 replies; 23+ messages in thread From: misono.tomohiro @ 2020-07-28 1:05 UTC (permalink / raw) To: 'Stefan Hajnoczi', qemu-devel Cc: vromanso, Daniel Walsh, Dr. David Alan Gilbert, rmohr, virtio-fs, mpatel, vgoyal > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error > > An assertion failure is raised during request processing if > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > be detected right away. > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > include unshare (e.g. podman is unaffected): > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > default seccomp.json is missing unshare. Hi, sorry for a bit late. unshare() was added to fix xattr problem: https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# In theory we don't need to call unshare if xattr is disabled, but it is hard to get to know if xattr is enabled or disabled in fv_queue_worker(), right? So, it looks good to me. Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Regards, Misono > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c > index 3b6d16a041..9e5537506c 100644 > --- a/tools/virtiofsd/fuse_virtio.c > +++ b/tools/virtiofsd/fuse_virtio.c > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > { > int ret; > > + /* > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's > + * an unprivileged system call but some Docker/Moby versions are known to > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > + * > + * Note that the program is single-threaded here so this syscall has no > + * visible effect and is safe to make. > + */ > + ret = unshare(CLONE_FS); > + if (ret == -1 && errno == EPERM) { > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If " > + "running in a container please check that the container " > + "runtime seccomp policy allows unshare.\n"); > + return -1; > + } > + > ret = fv_create_listen_socket(se); > if (ret < 0) { > return ret; > -- > 2.26.2 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 1:05 ` misono.tomohiro @ 2020-07-28 10:00 ` Roman Mohr 2020-07-28 13:12 ` Vivek Goyal 2020-07-28 15:32 ` Stefan Hajnoczi 2020-08-07 15:29 ` Dr. David Alan Gilbert 1 sibling, 2 replies; 23+ messages in thread From: Roman Mohr @ 2020-07-28 10:00 UTC (permalink / raw) To: misono.tomohiro Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi, mpatel, vgoyal [-- Attachment #1: Type: text/plain, Size: 3114 bytes --] On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < misono.tomohiro@fujitsu.com> wrote: > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > error > > > > An assertion failure is raised during request processing if > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > be detected right away. > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > include unshare (e.g. podman is unaffected): > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > default seccomp.json is missing unshare. > > Hi, sorry for a bit late. > > unshare() was added to fix xattr problem: > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > In theory we don't need to call unshare if xattr is disabled, but it is > hard to get to know > if xattr is enabled or disabled in fv_queue_worker(), right? > > In kubevirt we want to run virtiofsd in containers. We would already not have xattr support for e.g. overlayfs in the VM after this patch series (an acceptable con at least for us right now). If we can get rid of the unshare (and potentially of needing root) that would be great. We always assume that everything which we run in containers should work for cri-o and docker. "Just" pointing docker to a different seccomp.json file is something which k8s users/admin in many cases can't do. Best Regards, Roman > So, it looks good to me. > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > Regards, > Misono > > > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > > 1 file changed, 16 insertions(+) > > > > diff --git a/tools/virtiofsd/fuse_virtio.c > b/tools/virtiofsd/fuse_virtio.c > > index 3b6d16a041..9e5537506c 100644 > > --- a/tools/virtiofsd/fuse_virtio.c > > +++ b/tools/virtiofsd/fuse_virtio.c > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > > { > > int ret; > > > > + /* > > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need > it. It's > > + * an unprivileged system call but some Docker/Moby versions are > known to > > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > > + * > > + * Note that the program is single-threaded here so this syscall > has no > > + * visible effect and is safe to make. > > + */ > > + ret = unshare(CLONE_FS); > > + if (ret == -1 && errno == EPERM) { > > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If > " > > + "running in a container please check that the container > " > > + "runtime seccomp policy allows unshare.\n"); > > + return -1; > > + } > > + > > ret = fv_create_listen_socket(se); > > if (ret < 0) { > > return ret; > > -- > > 2.26.2 > > [-- Attachment #2: Type: text/html, Size: 4612 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 10:00 ` Roman Mohr @ 2020-07-28 13:12 ` Vivek Goyal 2020-07-28 15:52 ` Daniel P. Berrangé ` (2 more replies) 2020-07-28 15:32 ` Stefan Hajnoczi 1 sibling, 3 replies; 23+ messages in thread From: Vivek Goyal @ 2020-07-28 13:12 UTC (permalink / raw) To: Roman Mohr Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi, misono.tomohiro, mpatel On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > misono.tomohiro@fujitsu.com> wrote: > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > > error > > > > > > An assertion failure is raised during request processing if > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > > be detected right away. > > > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > include unshare (e.g. podman is unaffected): > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > > default seccomp.json is missing unshare. > > > > Hi, sorry for a bit late. > > > > unshare() was added to fix xattr problem: > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > In theory we don't need to call unshare if xattr is disabled, but it is > > hard to get to know > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > have xattr support for e.g. overlayfs in the VM after this patch series (an > acceptable con at least for us right now). > If we can get rid of the unshare (and potentially of needing root) that > would be great. We always assume that everything which we run in containers > should work for cri-o and docker. But cri-o and docker containers run as root, isn't it? (or atleast have the capability to run as root). Havind said that, it will be nice to be able to run virtiofsd without root. There are few hurdles though. - For file creation, we switch uid/gid (seteuid/setegid) and that seems to require root. If we were to run unpriviliged, probably all files on host will have to be owned by unpriviliged user and guest visible uid/gid will have to be stored in xattrs. I think virtfs supports something similar. I am sure there are other restrictions but this probably is the biggest one to overcome. > > "Just" pointing docker to a different seccomp.json file is something which > k8s users/admin in many cases can't do. Or may be issue is that standard seccomp.json does not allow unshare() and hence you are forced to use a non-standar seccomp.json. Vivek > > Best Regards, > Roman > > > > So, it looks good to me. > > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > > > Regards, > > Misono > > > > > > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > --- > > > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > > > 1 file changed, 16 insertions(+) > > > > > > diff --git a/tools/virtiofsd/fuse_virtio.c > > b/tools/virtiofsd/fuse_virtio.c > > > index 3b6d16a041..9e5537506c 100644 > > > --- a/tools/virtiofsd/fuse_virtio.c > > > +++ b/tools/virtiofsd/fuse_virtio.c > > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > > > { > > > int ret; > > > > > > + /* > > > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need > > it. It's > > > + * an unprivileged system call but some Docker/Moby versions are > > known to > > > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > > > + * > > > + * Note that the program is single-threaded here so this syscall > > has no > > > + * visible effect and is safe to make. > > > + */ > > > + ret = unshare(CLONE_FS); > > > + if (ret == -1 && errno == EPERM) { > > > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If > > " > > > + "running in a container please check that the container > > " > > > + "runtime seccomp policy allows unshare.\n"); > > > + return -1; > > > + } > > > + > > > ret = fv_create_listen_socket(se); > > > if (ret < 0) { > > > return ret; > > > -- > > > 2.26.2 > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 13:12 ` Vivek Goyal @ 2020-07-28 15:52 ` Daniel P. Berrangé 2020-07-28 20:54 ` Vivek Goyal 2020-07-28 19:12 ` Daniel Walsh 2020-07-29 7:59 ` Roman Mohr 2 siblings, 1 reply; 23+ messages in thread From: Daniel P. Berrangé @ 2020-07-28 15:52 UTC (permalink / raw) To: Vivek Goyal Cc: vromanso, mpatel, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi, misono.tomohiro, Roman Mohr On Tue, Jul 28, 2020 at 09:12:50AM -0400, Vivek Goyal wrote: > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > misono.tomohiro@fujitsu.com> wrote: > > > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > > > error > > > > > > > > An assertion failure is raised during request processing if > > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > > > be detected right away. > > > > > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > > include unshare (e.g. podman is unaffected): > > > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > > > default seccomp.json is missing unshare. > > > > > > Hi, sorry for a bit late. > > > > > > unshare() was added to fix xattr problem: > > > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > > In theory we don't need to call unshare if xattr is disabled, but it is > > > hard to get to know > > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > > have xattr support for e.g. overlayfs in the VM after this patch series (an > > acceptable con at least for us right now). > > If we can get rid of the unshare (and potentially of needing root) that > > would be great. We always assume that everything which we run in containers > > should work for cri-o and docker. > > But cri-o and docker containers run as root, isn't it? (or atleast have > the capability to run as root). Havind said that, it will be nice to be able > to run virtiofsd without root. > > There are few hurdles though. > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > to require root. If we were to run unpriviliged, probably all files > on host will have to be owned by unpriviliged user and guest visible > uid/gid will have to be stored in xattrs. I think virtfs supports > something similar. I think I've mentioned before, 9p virtfs supports different modes, passthrough, squashed or remapped. passthrough should be reasonably straightforward to support in virtiofs. The guest sees all the host UID/GIDs ownership as normal, and can read any files the host user can read, but are obviously restricted to write to only the files that host user can write too. No DAC-OVERRIDE facility in essence. You'll just get EPERM, which is fine. This simple passthrough scenario would be just what's desired for a typical desktop virt use cases, where you want to share part/all of your home dir with a guest for easy file access. Personally this is the mode I'd be most interested in seeing provided for unprivileged virtiofsd usage. squash is similar to passthrough, except the guest sees everything as owned by the same user. This can be surprising as the guest might see a file owned by them, but not be able to write to it, as on the host its actually owned by some other user. Fairly niche use case I think. remapping would be needed for a more general purpose use cases allowing the guest to do arbitrary UID/GID changes, but on the host everything is still stored as one user and remapped somehow. The main challenge for all the unprivileged scenarios is safety of the sandbox, to avoid risk of guests escaping to access files outside of the exported dir via symlink attacks or similar. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 15:52 ` Daniel P. Berrangé @ 2020-07-28 20:54 ` Vivek Goyal 0 siblings, 0 replies; 23+ messages in thread From: Vivek Goyal @ 2020-07-28 20:54 UTC (permalink / raw) To: Daniel P. Berrangé Cc: vromanso, mpatel, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi, misono.tomohiro, Roman Mohr On Tue, Jul 28, 2020 at 04:52:33PM +0100, Daniel P. Berrangé wrote: > On Tue, Jul 28, 2020 at 09:12:50AM -0400, Vivek Goyal wrote: > > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > > misono.tomohiro@fujitsu.com> wrote: > > > > > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > > > > error > > > > > > > > > > An assertion failure is raised during request processing if > > > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > > > > be detected right away. > > > > > > > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > > > include unshare (e.g. podman is unaffected): > > > > > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > > > > default seccomp.json is missing unshare. > > > > > > > > Hi, sorry for a bit late. > > > > > > > > unshare() was added to fix xattr problem: > > > > > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > > > In theory we don't need to call unshare if xattr is disabled, but it is > > > > hard to get to know > > > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > > > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > > > have xattr support for e.g. overlayfs in the VM after this patch series (an > > > acceptable con at least for us right now). > > > If we can get rid of the unshare (and potentially of needing root) that > > > would be great. We always assume that everything which we run in containers > > > should work for cri-o and docker. > > > > But cri-o and docker containers run as root, isn't it? (or atleast have > > the capability to run as root). Havind said that, it will be nice to be able > > to run virtiofsd without root. > > > > There are few hurdles though. > > > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > > to require root. If we were to run unpriviliged, probably all files > > on host will have to be owned by unpriviliged user and guest visible > > uid/gid will have to be stored in xattrs. I think virtfs supports > > something similar. > > I think I've mentioned before, 9p virtfs supports different modes, > passthrough, squashed or remapped. > > passthrough should be reasonably straightforward to support in virtiofs. > The guest sees all the host UID/GIDs ownership as normal, and can read > any files the host user can read, but are obviously restricted to write > to only the files that host user can write too. No DAC-OVERRIDE facility > in essence. You'll just get EPERM, which is fine. This simple passthrough > scenario would be just what's desired for a typical desktop virt use > cases, where you want to share part/all of your home dir with a guest for > easy file access. Personally this is the mode I'd be most interested in > seeing provided for unprivileged virtiofsd usage. Interesting. So passthrough will have two sub modes. priviliged and unpriviliged. As of now we support priviliged passthrough. I guess it does make sense to look into unpriviliged passthrough and see what other operations will not be allowed. Thanks Vivek > > squash is similar to passthrough, except the guest sees everything > as owned by the same user. This can be surprising as the guest might > see a file owned by them, but not be able to write to it, as on the > host its actually owned by some other user. Fairly niche use case > I think. > > remapping would be needed for a more general purpose use cases > allowing the guest to do arbitrary UID/GID changes, but on the host > everything is still stored as one user and remapped somehow. > > The main challenge for all the unprivileged scenarios is safety of > the sandbox, to avoid risk of guests escaping to access files outside > of the exported dir via symlink attacks or similar. > > > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 13:12 ` Vivek Goyal 2020-07-28 15:52 ` Daniel P. Berrangé @ 2020-07-28 19:12 ` Daniel Walsh 2020-07-28 21:01 ` Vivek Goyal 2020-07-29 7:59 ` Roman Mohr 2 siblings, 1 reply; 23+ messages in thread From: Daniel Walsh @ 2020-07-28 19:12 UTC (permalink / raw) To: Vivek Goyal, Roman Mohr Cc: vromanso, qemu-devel, Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi, misono.tomohiro, mpatel On 7/28/20 09:12, Vivek Goyal wrote: > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < >> misono.tomohiro@fujitsu.com> wrote: >> >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an >>> error >>>> An assertion failure is raised during request processing if >>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can >>>> be detected right away. >>>> >>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json >>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always >>>> include unshare (e.g. podman is unaffected): >>>> >>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json >>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the >>>> default seccomp.json is missing unshare. >>> Hi, sorry for a bit late. >>> >>> unshare() was added to fix xattr problem: >>> >>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# >>> In theory we don't need to call unshare if xattr is disabled, but it is >>> hard to get to know >>> if xattr is enabled or disabled in fv_queue_worker(), right? >>> >>> >> In kubevirt we want to run virtiofsd in containers. We would already not >> have xattr support for e.g. overlayfs in the VM after this patch series (an >> acceptable con at least for us right now). >> If we can get rid of the unshare (and potentially of needing root) that >> would be great. We always assume that everything which we run in containers >> should work for cri-o and docker. > But cri-o and docker containers run as root, isn't it? (or atleast have > the capability to run as root). Havind said that, it will be nice to be able > to run virtiofsd without root. > > There are few hurdles though. > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > to require root. If we were to run unpriviliged, probably all files > on host will have to be owned by unpriviliged user and guest visible > uid/gid will have to be stored in xattrs. I think virtfs supports > something similar. > > I am sure there are other restrictions but this probably is the biggest > one to overcome. > > > You should be able to run it within a user namespace with Namespaces capabilities. >> "Just" pointing docker to a different seccomp.json file is something which >> k8s users/admin in many cases can't do. > Or may be issue is that standard seccomp.json does not allow unshare() > and hence you are forced to use a non-standar seccomp.json. > > Vivek > >> Best Regards, >> Roman >> >> >>> So, it looks good to me. >>> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> >>> >>> Regards, >>> Misono >>> >>>> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> >>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> >>>> --- >>>> tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ >>>> 1 file changed, 16 insertions(+) >>>> >>>> diff --git a/tools/virtiofsd/fuse_virtio.c >>> b/tools/virtiofsd/fuse_virtio.c >>>> index 3b6d16a041..9e5537506c 100644 >>>> --- a/tools/virtiofsd/fuse_virtio.c >>>> +++ b/tools/virtiofsd/fuse_virtio.c >>>> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) >>>> { >>>> int ret; >>>> >>>> + /* >>>> + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need >>> it. It's >>>> + * an unprivileged system call but some Docker/Moby versions are >>> known to >>>> + * reject it via seccomp when CAP_SYS_ADMIN is not given. >>>> + * >>>> + * Note that the program is single-threaded here so this syscall >>> has no >>>> + * visible effect and is safe to make. >>>> + */ >>>> + ret = unshare(CLONE_FS); >>>> + if (ret == -1 && errno == EPERM) { >>>> + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If >>> " >>>> + "running in a container please check that the container >>> " >>>> + "runtime seccomp policy allows unshare.\n"); >>>> + return -1; >>>> + } >>>> + >>>> ret = fv_create_listen_socket(se); >>>> if (ret < 0) { >>>> return ret; >>>> -- >>>> 2.26.2 >>> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 19:12 ` Daniel Walsh @ 2020-07-28 21:01 ` Vivek Goyal 0 siblings, 0 replies; 23+ messages in thread From: Vivek Goyal @ 2020-07-28 21:01 UTC (permalink / raw) To: Daniel Walsh Cc: vromanso, mpatel, Dr. David Alan Gilbert, qemu-devel, virtio-fs, Stefan Hajnoczi, misono.tomohiro, Roman Mohr On Tue, Jul 28, 2020 at 03:12:54PM -0400, Daniel Walsh wrote: > On 7/28/20 09:12, Vivek Goyal wrote: > > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > >> misono.tomohiro@fujitsu.com> wrote: > >> > >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > >>> error > >>>> An assertion failure is raised during request processing if > >>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > >>>> be detected right away. > >>>> > >>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json > >>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > >>>> include unshare (e.g. podman is unaffected): > >>>> > >>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > >>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > >>>> default seccomp.json is missing unshare. > >>> Hi, sorry for a bit late. > >>> > >>> unshare() was added to fix xattr problem: > >>> > >>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > >>> In theory we don't need to call unshare if xattr is disabled, but it is > >>> hard to get to know > >>> if xattr is enabled or disabled in fv_queue_worker(), right? > >>> > >>> > >> In kubevirt we want to run virtiofsd in containers. We would already not > >> have xattr support for e.g. overlayfs in the VM after this patch series (an > >> acceptable con at least for us right now). > >> If we can get rid of the unshare (and potentially of needing root) that > >> would be great. We always assume that everything which we run in containers > >> should work for cri-o and docker. > > But cri-o and docker containers run as root, isn't it? (or atleast have > > the capability to run as root). Havind said that, it will be nice to be able > > to run virtiofsd without root. > > > > There are few hurdles though. > > > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > > to require root. If we were to run unpriviliged, probably all files > > on host will have to be owned by unpriviliged user and guest visible > > uid/gid will have to be stored in xattrs. I think virtfs supports > > something similar. > > > > I am sure there are other restrictions but this probably is the biggest > > one to overcome. > > > > > > You should be able to run it within a user namespace with Namespaces > capabilities. User namespaces has always been a one TODO item. Few challenges are how to manage uid/gid mapping for existing directories which will be shared. We will have to pick a mapping range and then chown the files accordingly. And chowning itself will require priviliges. Now Stefan is adding support to run virtiofsd inside container. So podman should be able virtiofsd inside a user namespace (And even give CAP_SYS_ADMIN). That way we don't have to worry about setting a user namespace and select a mapping etc. But persistent data volumes will continue to be an issue with user namespace, right? How are you dealing with it in podman. Containers running in user namespace and with volume mounts. Vivek > >> "Just" pointing docker to a different seccomp.json file is something which > >> k8s users/admin in many cases can't do. > > Or may be issue is that standard seccomp.json does not allow unshare() > > and hence you are forced to use a non-standar seccomp.json. > > > > Vivek > > > >> Best Regards, > >> Roman > >> > >> > >>> So, it looks good to me. > >>> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > >>> > >>> Regards, > >>> Misono > >>> > >>>> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > >>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>> --- > >>>> tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > >>>> 1 file changed, 16 insertions(+) > >>>> > >>>> diff --git a/tools/virtiofsd/fuse_virtio.c > >>> b/tools/virtiofsd/fuse_virtio.c > >>>> index 3b6d16a041..9e5537506c 100644 > >>>> --- a/tools/virtiofsd/fuse_virtio.c > >>>> +++ b/tools/virtiofsd/fuse_virtio.c > >>>> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > >>>> { > >>>> int ret; > >>>> > >>>> + /* > >>>> + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need > >>> it. It's > >>>> + * an unprivileged system call but some Docker/Moby versions are > >>> known to > >>>> + * reject it via seccomp when CAP_SYS_ADMIN is not given. > >>>> + * > >>>> + * Note that the program is single-threaded here so this syscall > >>> has no > >>>> + * visible effect and is safe to make. > >>>> + */ > >>>> + ret = unshare(CLONE_FS); > >>>> + if (ret == -1 && errno == EPERM) { > >>>> + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If > >>> " > >>>> + "running in a container please check that the container > >>> " > >>>> + "runtime seccomp policy allows unshare.\n"); > >>>> + return -1; > >>>> + } > >>>> + > >>>> ret = fv_create_listen_socket(se); > >>>> if (ret < 0) { > >>>> return ret; > >>>> -- > >>>> 2.26.2 > >>> > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 13:12 ` Vivek Goyal 2020-07-28 15:52 ` Daniel P. Berrangé 2020-07-28 19:12 ` Daniel Walsh @ 2020-07-29 7:59 ` Roman Mohr 2020-07-29 14:40 ` Stefan Hajnoczi 2 siblings, 1 reply; 23+ messages in thread From: Roman Mohr @ 2020-07-29 7:59 UTC (permalink / raw) To: Vivek Goyal Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi, misono.tomohiro, mpatel [-- Attachment #1: Type: text/plain, Size: 5457 bytes --] On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > misono.tomohiro@fujitsu.com> wrote: > > > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print > an > > > error > > > > > > > > An assertion failure is raised during request processing if > > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem > can > > > > be detected right away. > > > > > > > > Unfortunately Docker/Moby does not include unshare in the > seccomp.json > > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > > include unshare (e.g. podman is unaffected): > > > > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if > the > > > > default seccomp.json is missing unshare. > > > > > > Hi, sorry for a bit late. > > > > > > unshare() was added to fix xattr problem: > > > > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > > In theory we don't need to call unshare if xattr is disabled, but it is > > > hard to get to know > > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > > have xattr support for e.g. overlayfs in the VM after this patch series > (an > > acceptable con at least for us right now). > > If we can get rid of the unshare (and potentially of needing root) that > > would be great. We always assume that everything which we run in > containers > > should work for cri-o and docker. > > But cri-o and docker containers run as root, isn't it? (or atleast have > the capability to run as root). Havind said that, it will be nice to be > able > to run virtiofsd without root. > Yes they can run as root. I can tell you what we plan to do with the containerized virtiofsd: We run it as part of the user-owned pod (a set of containers). One of our main goals at the moment is to run VMs in a user-owned pod without additional privileges. So that in case the user (VM-creator/owner) enters the pod or something breaks out of the VM they are just in the unprivileged container sandbox. As part of that we try to get also rid of running containers in the user-context with the root user. One possible scenario which I could think of as being desirable from a kubevirt perspective: We would run the VM in one container and have an unprivileged virtiofsd container in parallel. This container already has its own mount namespace and it is not that critical if something manages to enter this sandbox. But we are not as far yet as getting completely rid of root right now in kubevirt, so if as a temporary step it needs root, the current proposed changes would still be very useful for us. Best Regards, Roman There are few hurdles though. > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > to require root. If we were to run unpriviliged, probably all files > on host will have to be owned by unpriviliged user and guest visible > uid/gid will have to be stored in xattrs. I think virtfs supports > something similar. > I am sure there are other restrictions but this probably is the biggest > one to overcome. > > > > > "Just" pointing docker to a different seccomp.json file is something > which > > k8s users/admin in many cases can't do. > > Or may be issue is that standard seccomp.json does not allow unshare() > and hence you are forced to use a non-standar seccomp.json. > > Vivek > > > > > Best Regards, > > Roman > > > > > > > So, it looks good to me. > > > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > > > > > Regards, > > > Misono > > > > > > > > > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > --- > > > > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > > > > 1 file changed, 16 insertions(+) > > > > > > > > diff --git a/tools/virtiofsd/fuse_virtio.c > > > b/tools/virtiofsd/fuse_virtio.c > > > > index 3b6d16a041..9e5537506c 100644 > > > > --- a/tools/virtiofsd/fuse_virtio.c > > > > +++ b/tools/virtiofsd/fuse_virtio.c > > > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session > *se) > > > > { > > > > int ret; > > > > > > > > + /* > > > > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will > need > > > it. It's > > > > + * an unprivileged system call but some Docker/Moby versions are > > > known to > > > > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > > > > + * > > > > + * Note that the program is single-threaded here so this syscall > > > has no > > > > + * visible effect and is safe to make. > > > > + */ > > > > + ret = unshare(CLONE_FS); > > > > + if (ret == -1 && errno == EPERM) { > > > > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with > EPERM. If > > > " > > > > + "running in a container please check that the > container > > > " > > > > + "runtime seccomp policy allows unshare.\n"); > > > > + return -1; > > > > + } > > > > + > > > > ret = fv_create_listen_socket(se); > > > > if (ret < 0) { > > > > return ret; > > > > -- > > > > 2.26.2 > > > > > > > > [-- Attachment #2: Type: text/html, Size: 7863 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-29 7:59 ` Roman Mohr @ 2020-07-29 14:40 ` Stefan Hajnoczi 2020-07-30 22:21 ` Daniel Walsh 0 siblings, 1 reply; 23+ messages in thread From: Stefan Hajnoczi @ 2020-07-29 14:40 UTC (permalink / raw) To: Roman Mohr Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, misono.tomohiro, mpatel, Vivek Goyal [-- Attachment #1: Type: text/plain, Size: 2110 bytes --] On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: > On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > > misono.tomohiro@fujitsu.com> wrote: > > > > > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print > > an > > > > error > Yes they can run as root. I can tell you what we plan to do with the > containerized virtiofsd: We run it as part of the user-owned pod (a set of > containers). > One of our main goals at the moment is to run VMs in a user-owned pod > without additional privileges. > So that in case the user (VM-creator/owner) enters the pod or something > breaks out of the VM they are just in the unprivileged container sandbox. > As part of that we try to get also rid of running containers in the > user-context with the root user. > > One possible scenario which I could think of as being desirable from a > kubevirt perspective: > We would run the VM in one container and have an unprivileged > virtiofsd container in parallel. > This container already has its own mount namespace and it is not that > critical if something manages to enter this sandbox. > > But we are not as far yet as getting completely rid of root right now in > kubevirt, so if as a temporary step it needs root, the current proposed > changes would still be very useful for us. What is the issue with root in user namespaces? I remember a few years ago it was seen as a major security issue but don't remember if container runtimes were already using user namespaces back then. I guess the goal might be simply to minimize Linux capabilities as much as possible? virtiofsd could nominally run with an arbitrary uid/gid but it still needs the Linux capabilities that allow it to change uid/gid and override file system permission checks just like the root user. Not sure if there is any advantage to running with uid 1000 when you still have these Linux capabilities. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-29 14:40 ` Stefan Hajnoczi @ 2020-07-30 22:21 ` Daniel Walsh 2020-07-31 8:26 ` Stefan Hajnoczi 0 siblings, 1 reply; 23+ messages in thread From: Daniel Walsh @ 2020-07-30 22:21 UTC (permalink / raw) To: Stefan Hajnoczi, Roman Mohr Cc: vromanso, qemu-devel, Dr. David Alan Gilbert, virtio-fs, misono.tomohiro, mpatel, Vivek Goyal [-- Attachment #1.1: Type: text/plain, Size: 2553 bytes --] On 7/29/20 10:40, Stefan Hajnoczi wrote: > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: >> >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < >>>> misono.tomohiro@fujitsu.com> wrote: >>>> >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print >>> an >>>>> error >> Yes they can run as root. I can tell you what we plan to do with the >> containerized virtiofsd: We run it as part of the user-owned pod (a set of >> containers). >> One of our main goals at the moment is to run VMs in a user-owned pod >> without additional privileges. >> So that in case the user (VM-creator/owner) enters the pod or something >> breaks out of the VM they are just in the unprivileged container sandbox. >> As part of that we try to get also rid of running containers in the >> user-context with the root user. >> >> One possible scenario which I could think of as being desirable from a >> kubevirt perspective: >> We would run the VM in one container and have an unprivileged >> virtiofsd container in parallel. >> This container already has its own mount namespace and it is not that >> critical if something manages to enter this sandbox. >> >> But we are not as far yet as getting completely rid of root right now in >> kubevirt, so if as a temporary step it needs root, the current proposed >> changes would still be very useful for us. > What is the issue with root in user namespaces? > > I remember a few years ago it was seen as a major security issue but > don't remember if container runtimes were already using user namespaces > back then. > > I guess the goal might be simply to minimize Linux capabilities as much > as possible? > > virtiofsd could nominally run with an arbitrary uid/gid but it still > needs the Linux capabilities that allow it to change uid/gid and > override file system permission checks just like the root user. Not sure > if there is any advantage to running with uid 1000 when you still have > these Linux capabilities. > > Stefan When you run in a user namespace, virtiofsd would only have setuid/setgid over the range of UIDs mapped into the user namespace. So if UID=0 on the host is not mapped, then the container can not create real UID=0 files on disk. Similarly you can protect the user directories and any content by running the containers in a really high UID Mapping. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-30 22:21 ` Daniel Walsh @ 2020-07-31 8:26 ` Stefan Hajnoczi 2020-07-31 8:39 ` Roman Mohr 0 siblings, 1 reply; 23+ messages in thread From: Stefan Hajnoczi @ 2020-07-31 8:26 UTC (permalink / raw) To: Roman Mohr Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, misono.tomohiro, mpatel, Vivek Goyal [-- Attachment #1: Type: text/plain, Size: 2811 bytes --] On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote: > On 7/29/20 10:40, Stefan Hajnoczi wrote: > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: > >> > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > >>>> misono.tomohiro@fujitsu.com> wrote: > >>>> > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print > >>> an > >>>>> error > >> Yes they can run as root. I can tell you what we plan to do with the > >> containerized virtiofsd: We run it as part of the user-owned pod (a set of > >> containers). > >> One of our main goals at the moment is to run VMs in a user-owned pod > >> without additional privileges. > >> So that in case the user (VM-creator/owner) enters the pod or something > >> breaks out of the VM they are just in the unprivileged container sandbox. > >> As part of that we try to get also rid of running containers in the > >> user-context with the root user. > >> > >> One possible scenario which I could think of as being desirable from a > >> kubevirt perspective: > >> We would run the VM in one container and have an unprivileged > >> virtiofsd container in parallel. > >> This container already has its own mount namespace and it is not that > >> critical if something manages to enter this sandbox. > >> > >> But we are not as far yet as getting completely rid of root right now in > >> kubevirt, so if as a temporary step it needs root, the current proposed > >> changes would still be very useful for us. > > What is the issue with root in user namespaces? > > > > I remember a few years ago it was seen as a major security issue but > > don't remember if container runtimes were already using user namespaces > > back then. > > > > I guess the goal might be simply to minimize Linux capabilities as much > > as possible? > > > > virtiofsd could nominally run with an arbitrary uid/gid but it still > > needs the Linux capabilities that allow it to change uid/gid and > > override file system permission checks just like the root user. Not sure > > if there is any advantage to running with uid 1000 when you still have > > these Linux capabilities. > > > > Stefan > > When you run in a user namespace, virtiofsd would only have > setuid/setgid over the range of UIDs mapped into the user namespace. So > if UID=0 on the host is not mapped, then the container can not create > real UID=0 files on disk. > > Similarly you can protect the user directories and any content by > running the containers in a really high UID Mapping. Roman, do user namespaces address your concerns about uid 0 in containers? Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-31 8:26 ` Stefan Hajnoczi @ 2020-07-31 8:39 ` Roman Mohr 2020-07-31 14:11 ` Stefan Hajnoczi 0 siblings, 1 reply; 23+ messages in thread From: Roman Mohr @ 2020-07-31 8:39 UTC (permalink / raw) To: Stefan Hajnoczi Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, misono.tomohiro, mpatel, Vivek Goyal [-- Attachment #1: Type: text/plain, Size: 4089 bytes --] On Fri, Jul 31, 2020 at 10:26 AM Stefan Hajnoczi <stefanha@redhat.com> wrote: > On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote: > > On 7/29/20 10:40, Stefan Hajnoczi wrote: > > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: > > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> > wrote: > > >> > > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > >>>> misono.tomohiro@fujitsu.com> wrote: > > >>>> > > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and > print > > >>> an > > >>>>> error > > >> Yes they can run as root. I can tell you what we plan to do with the > > >> containerized virtiofsd: We run it as part of the user-owned pod (a > set of > > >> containers). > > >> One of our main goals at the moment is to run VMs in a user-owned pod > > >> without additional privileges. > > >> So that in case the user (VM-creator/owner) enters the pod or > something > > >> breaks out of the VM they are just in the unprivileged container > sandbox. > > >> As part of that we try to get also rid of running containers in the > > >> user-context with the root user. > > >> > > >> One possible scenario which I could think of as being desirable from a > > >> kubevirt perspective: > > >> We would run the VM in one container and have an unprivileged > > >> virtiofsd container in parallel. > > >> This container already has its own mount namespace and it is not that > > >> critical if something manages to enter this sandbox. > > >> > > >> But we are not as far yet as getting completely rid of root right now > in > > >> kubevirt, so if as a temporary step it needs root, the current > proposed > > >> changes would still be very useful for us. > > > What is the issue with root in user namespaces? > > > > > > I remember a few years ago it was seen as a major security issue but > > > don't remember if container runtimes were already using user namespaces > > > back then. > > > > > > I guess the goal might be simply to minimize Linux capabilities as much > > > as possible? > > > > > > virtiofsd could nominally run with an arbitrary uid/gid but it still > > > needs the Linux capabilities that allow it to change uid/gid and > > > override file system permission checks just like the root user. Not > sure > > > if there is any advantage to running with uid 1000 when you still have > > > these Linux capabilities. > > > > > > Stefan > > > > When you run in a user namespace, virtiofsd would only have > > setuid/setgid over the range of UIDs mapped into the user namespace. So > > if UID=0 on the host is not mapped, then the container can not create > > real UID=0 files on disk. > > > > Similarly you can protect the user directories and any content by > > running the containers in a really high UID Mapping. > > Roman, do user namespaces address your concerns about uid 0 in > containers? > They may eventually solve it. I would not let us hang up on this right now, since as said at least in kubevirt we can't get rid right now of root anyway. Even if it is at some point in the future save and supported on bleeding-edge managed k8s clusters to allow ordinary users to run with uid 0, from my perspective it is right now common to restrict namespaces with PodSecurityPolicies or SecurityContexts to not allow running pods as root for normal users. It is also common that a significant part of the community users run docker and/or run on managed k8s clusters where they can not influence if user-namespaces are enabled, if they can run pods as root, if the runtime points to a seccomp file they like or if the runtime they prefer is used. But let me repeat again that we require root right now anyway and that we don't run the pods right now with the user privileges (but we should and we aim for that). Right now PSPs and SCCs restrict access to these pods by the users. So for our use case, at this exact moment root is acceptable, the unshare call is a little bit more problematic. Best Regards, Roman > > Stefan > [-- Attachment #2: Type: text/html, Size: 5424 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-31 8:39 ` Roman Mohr @ 2020-07-31 14:11 ` Stefan Hajnoczi 0 siblings, 0 replies; 23+ messages in thread From: Stefan Hajnoczi @ 2020-07-31 14:11 UTC (permalink / raw) To: Roman Mohr Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, misono.tomohiro, mpatel, Vivek Goyal [-- Attachment #1: Type: text/plain, Size: 4428 bytes --] On Fri, Jul 31, 2020 at 10:39:37AM +0200, Roman Mohr wrote: > On Fri, Jul 31, 2020 at 10:26 AM Stefan Hajnoczi <stefanha@redhat.com> > wrote: > > > On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote: > > > On 7/29/20 10:40, Stefan Hajnoczi wrote: > > > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: > > > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> > > wrote: > > > >> > > > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > > >>>> misono.tomohiro@fujitsu.com> wrote: > > > >>>> > > > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and > > print > > > >>> an > > > >>>>> error > > > >> Yes they can run as root. I can tell you what we plan to do with the > > > >> containerized virtiofsd: We run it as part of the user-owned pod (a > > set of > > > >> containers). > > > >> One of our main goals at the moment is to run VMs in a user-owned pod > > > >> without additional privileges. > > > >> So that in case the user (VM-creator/owner) enters the pod or > > something > > > >> breaks out of the VM they are just in the unprivileged container > > sandbox. > > > >> As part of that we try to get also rid of running containers in the > > > >> user-context with the root user. > > > >> > > > >> One possible scenario which I could think of as being desirable from a > > > >> kubevirt perspective: > > > >> We would run the VM in one container and have an unprivileged > > > >> virtiofsd container in parallel. > > > >> This container already has its own mount namespace and it is not that > > > >> critical if something manages to enter this sandbox. > > > >> > > > >> But we are not as far yet as getting completely rid of root right now > > in > > > >> kubevirt, so if as a temporary step it needs root, the current > > proposed > > > >> changes would still be very useful for us. > > > > What is the issue with root in user namespaces? > > > > > > > > I remember a few years ago it was seen as a major security issue but > > > > don't remember if container runtimes were already using user namespaces > > > > back then. > > > > > > > > I guess the goal might be simply to minimize Linux capabilities as much > > > > as possible? > > > > > > > > virtiofsd could nominally run with an arbitrary uid/gid but it still > > > > needs the Linux capabilities that allow it to change uid/gid and > > > > override file system permission checks just like the root user. Not > > sure > > > > if there is any advantage to running with uid 1000 when you still have > > > > these Linux capabilities. > > > > > > > > Stefan > > > > > > When you run in a user namespace, virtiofsd would only have > > > setuid/setgid over the range of UIDs mapped into the user namespace. So > > > if UID=0 on the host is not mapped, then the container can not create > > > real UID=0 files on disk. > > > > > > Similarly you can protect the user directories and any content by > > > running the containers in a really high UID Mapping. > > > > Roman, do user namespaces address your concerns about uid 0 in > > containers? > > > > They may eventually solve it. I would not let us hang up on this right now, > since as said at least in kubevirt we can't get rid right now of root > anyway. > Even if it is at some point in the future save and supported on > bleeding-edge managed k8s clusters to allow ordinary users to run with uid > 0, from my perspective it is right now common to restrict namespaces with > PodSecurityPolicies or SecurityContexts to not allow running pods as root > for normal users. > It is also common that a significant part of the community users run docker > and/or run on managed k8s clusters where they can not influence if > user-namespaces are enabled, if they can run pods as root, if the runtime > points to a seccomp file they like or if the runtime they prefer is used. > > But let me repeat again that we require root right now anyway and that we > don't run the pods right now with the user privileges (but we should and we > aim for that). Right now PSPs and SCCs restrict access to these pods by the > users. > So for our use case, at this exact moment root is acceptable, the unshare > call is a little bit more problematic. Okay, thanks for explaining. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 10:00 ` Roman Mohr 2020-07-28 13:12 ` Vivek Goyal @ 2020-07-28 15:32 ` Stefan Hajnoczi 2020-07-28 19:15 ` Daniel Walsh 1 sibling, 1 reply; 23+ messages in thread From: Stefan Hajnoczi @ 2020-07-28 15:32 UTC (permalink / raw) To: Roman Mohr Cc: vromanso, Daniel Walsh, qemu-devel, Dr. David Alan Gilbert, virtio-fs, misono.tomohiro, mpatel, vgoyal [-- Attachment #1: Type: text/plain, Size: 2102 bytes --] On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > misono.tomohiro@fujitsu.com> wrote: > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > > error > > > > > > An assertion failure is raised during request processing if > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > > be detected right away. > > > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > include unshare (e.g. podman is unaffected): > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > > default seccomp.json is missing unshare. > > > > Hi, sorry for a bit late. > > > > unshare() was added to fix xattr problem: > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > In theory we don't need to call unshare if xattr is disabled, but it is > > hard to get to know > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > have xattr support for e.g. overlayfs in the VM after this patch series (an > acceptable con at least for us right now). > If we can get rid of the unshare (and potentially of needing root) that > would be great. We always assume that everything which we run in containers > should work for cri-o and docker. Root is required to access files with any uid/gid. Dave Gilbert is working on xattr support without CAP_SYS_ADMIN. He may be able to find a way to drop unshare (at least in containers). > "Just" pointing docker to a different seccomp.json file is something which > k8s users/admin in many cases can't do. There is a Moby PR to change the default seccomp.json file here but it's unclear if it will be merged: https://github.com/moby/moby/pull/41244 Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 15:32 ` Stefan Hajnoczi @ 2020-07-28 19:15 ` Daniel Walsh 2020-07-29 14:29 ` Stefan Hajnoczi 0 siblings, 1 reply; 23+ messages in thread From: Daniel Walsh @ 2020-07-28 19:15 UTC (permalink / raw) To: Stefan Hajnoczi, Roman Mohr Cc: vromanso, qemu-devel, Dr. David Alan Gilbert, virtio-fs, misono.tomohiro, mpatel, vgoyal [-- Attachment #1.1: Type: text/plain, Size: 2166 bytes --] On 7/28/20 11:32, Stefan Hajnoczi wrote: > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < >> misono.tomohiro@fujitsu.com> wrote: >> >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an >>> error >>>> An assertion failure is raised during request processing if >>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can >>>> be detected right away. >>>> >>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json >>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always >>>> include unshare (e.g. podman is unaffected): >>>> >>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json >>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the >>>> default seccomp.json is missing unshare. >>> Hi, sorry for a bit late. >>> >>> unshare() was added to fix xattr problem: >>> >>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# >>> In theory we don't need to call unshare if xattr is disabled, but it is >>> hard to get to know >>> if xattr is enabled or disabled in fv_queue_worker(), right? >>> >>> >> In kubevirt we want to run virtiofsd in containers. We would already not >> have xattr support for e.g. overlayfs in the VM after this patch series (an >> acceptable con at least for us right now). >> If we can get rid of the unshare (and potentially of needing root) that >> would be great. We always assume that everything which we run in containers >> should work for cri-o and docker. > Root is required to access files with any uid/gid. > > Dave Gilbert is working on xattr support without CAP_SYS_ADMIN. He may > be able to find a way to drop unshare (at least in containers). > >> "Just" pointing docker to a different seccomp.json file is something which >> k8s users/admin in many cases can't do. > There is a Moby PR to change the default seccomp.json file here but it's > unclear if it will be merged: > https://github.com/moby/moby/pull/41244 > > Stefan Why not try Podman? [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 19:15 ` Daniel Walsh @ 2020-07-29 14:29 ` Stefan Hajnoczi 0 siblings, 0 replies; 23+ messages in thread From: Stefan Hajnoczi @ 2020-07-29 14:29 UTC (permalink / raw) To: Daniel Walsh Cc: vromanso, mpatel, Dr. David Alan Gilbert, qemu-devel, virtio-fs, misono.tomohiro, Roman Mohr, vgoyal [-- Attachment #1: Type: text/plain, Size: 921 bytes --] On Tue, Jul 28, 2020 at 03:15:25PM -0400, Daniel Walsh wrote: > On 7/28/20 11:32, Stefan Hajnoczi wrote: > > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > >> misono.tomohiro@fujitsu.com> wrote: > >> > >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > >>> error > >> "Just" pointing docker to a different seccomp.json file is something which > >> k8s users/admin in many cases can't do. > > There is a Moby PR to change the default seccomp.json file here but it's > > unclear if it will be merged: > > https://github.com/moby/moby/pull/41244 > > > > Stefan > > Why not try Podman? Absolutely, Podman allows unshare(2) in its default seccomp policy so it does not have this problem. I think Roman's point was mainly about the upstream user experience where Docker is common. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error 2020-07-28 1:05 ` misono.tomohiro 2020-07-28 10:00 ` Roman Mohr @ 2020-08-07 15:29 ` Dr. David Alan Gilbert 1 sibling, 0 replies; 23+ messages in thread From: Dr. David Alan Gilbert @ 2020-08-07 15:29 UTC (permalink / raw) To: misono.tomohiro Cc: vromanso, Daniel Walsh, qemu-devel, rmohr, virtio-fs, 'Stefan Hajnoczi', mpatel, vgoyal * misono.tomohiro@fujitsu.com (misono.tomohiro@fujitsu.com) wrote: > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error > > > > An assertion failure is raised during request processing if > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > be detected right away. > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > include unshare (e.g. podman is unaffected): > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > default seccomp.json is missing unshare. > > Hi, sorry for a bit late. > > unshare() was added to fix xattr problem: > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > In theory we don't need to call unshare if xattr is disabled, but it is hard to get to know > if xattr is enabled or disabled in fv_queue_worker(), right? > > So, it looks good to me. > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> OK, I think it might also be OK to just fail the xattr operation on a non-file/directory in this case. Dave > Regards, > Misono > > > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > > 1 file changed, 16 insertions(+) > > > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c > > index 3b6d16a041..9e5537506c 100644 > > --- a/tools/virtiofsd/fuse_virtio.c > > +++ b/tools/virtiofsd/fuse_virtio.c > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > > { > > int ret; > > > > + /* > > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's > > + * an unprivileged system call but some Docker/Moby versions are known to > > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > > + * > > + * Note that the program is single-threaded here so this syscall has no > > + * visible effect and is safe to make. > > + */ > > + ret = unshare(CLONE_FS); > > + if (ret == -1 && errno == EPERM) { > > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If " > > + "running in a container please check that the container " > > + "runtime seccomp policy allows unshare.\n"); > > + return -1; > > + } > > + > > ret = fv_create_listen_socket(se); > > if (ret < 0) { > > return ret; > > -- > > 2.26.2 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container 2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi ` (2 preceding siblings ...) 2020-07-27 19:02 ` [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error Stefan Hajnoczi @ 2020-08-27 18:40 ` Dr. David Alan Gilbert 3 siblings, 0 replies; 23+ messages in thread From: Dr. David Alan Gilbert @ 2020-08-27 18:40 UTC (permalink / raw) To: Stefan Hajnoczi Cc: vromanso, Daniel Walsh, qemu-devel, rmohr, virtio-fs, mpatel, vgoyal * Stefan Hajnoczi (stefanha@redhat.com) wrote: > v2: > * Update virtiofsd.rst documentation on sandboxing modes > * Change syntax to -o sandbox=namespace|chroot > * Add comment explaining that unshare(CLONE_FS) has no visible side-effect > while single-threaded > * xfstests and pjdfstest pass. Did not run tests on overlayfs because required > xattrs do not work without CAP_SYS_ADMIN. > > Mrunal and Dan: This patch series adds a sandboxing mode where virtiofsd relies > on the container runtime for isolation. It only does > chroot("path/to/shared-dir"), seccomp, and drops Linux capabilities. Previously > it created a new mount, pid, and net namespace but cannot do this without > CAP_SYS_ADMIN when run inside a container. pivot_root("path/to/shared-dir") has > been replaced with chroot("path/to/shared-dir"), again because CAP_SYS_ADMIN is > unavailable. The point of the chroot() is to prevent escapes from the shared > directory during path traversal. Does this ring any alarm bells or does it > sound sane? > > Container runtimes handle namespace setup and remove privileges needed by > virtiofsd to perform sandboxing. Luckily the container environment already > provides most of the sandbox that virtiofsd needs for security. > > Introduce a new "virtiofsd -o sandbox=chroot" option that uses chroot(2) > instead of namespaces. This option allows virtiofsd to work inside a container. > > Please see the individual patches for details on the changes and security > implications. > > Given that people are starting to attempt running virtiofsd in containers I > think this should go into QEMU 5.1. I've queued 1 and 3; waiting for someone with better knowledge of chroot to review 2. Dave > Stefan Hajnoczi (3): > virtiofsd: drop CAP_DAC_READ_SEARCH > virtiofsd: add container-friendly -o sandbox=chroot option > virtiofsd: probe unshare(CLONE_FS) and print an error > > tools/virtiofsd/fuse_virtio.c | 16 +++++++++ > tools/virtiofsd/helper.c | 8 +++++ > tools/virtiofsd/passthrough_ll.c | 58 ++++++++++++++++++++++++++++++-- > docs/tools/virtiofsd.rst | 32 ++++++++++++++---- > 4 files changed, 104 insertions(+), 10 deletions(-) > > -- > 2.26.2 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2020-08-27 18:41 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-27 19:02 [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Stefan Hajnoczi 2020-07-27 19:02 ` [PATCH v2 1/3] virtiofsd: drop CAP_DAC_READ_SEARCH Stefan Hajnoczi 2020-07-27 19:02 ` [PATCH v2 2/3] virtiofsd: add container-friendly -o sandbox=chroot option Stefan Hajnoczi 2020-08-07 15:36 ` Dr. David Alan Gilbert 2020-07-27 19:02 ` [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error Stefan Hajnoczi 2020-07-28 1:05 ` misono.tomohiro 2020-07-28 10:00 ` Roman Mohr 2020-07-28 13:12 ` Vivek Goyal 2020-07-28 15:52 ` Daniel P. Berrangé 2020-07-28 20:54 ` Vivek Goyal 2020-07-28 19:12 ` Daniel Walsh 2020-07-28 21:01 ` Vivek Goyal 2020-07-29 7:59 ` Roman Mohr 2020-07-29 14:40 ` Stefan Hajnoczi 2020-07-30 22:21 ` Daniel Walsh 2020-07-31 8:26 ` Stefan Hajnoczi 2020-07-31 8:39 ` Roman Mohr 2020-07-31 14:11 ` Stefan Hajnoczi 2020-07-28 15:32 ` Stefan Hajnoczi 2020-07-28 19:15 ` Daniel Walsh 2020-07-29 14:29 ` Stefan Hajnoczi 2020-08-07 15:29 ` Dr. David Alan Gilbert 2020-08-27 18:40 ` [PATCH v2 0/3] virtiofsd: allow virtiofsd to run in a container Dr. David Alan Gilbert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).