All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-04-16 16:49 ` Stefan Hajnoczi
  0 siblings, 0 replies; 71+ messages in thread
From: Stefan Hajnoczi @ 2020-04-16 16:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: virtio-fs, Stefan Hajnoczi, Dr. David Alan Gilbert, Vivek Goyal

virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
whitelisted set of capabilities that we require.  This improves security in
case virtiofsd is compromised by making it hard for an attacker to gain further
access to the system.

Stefan Hajnoczi (2):
  virtiofsd: only retain file system capabilities
  virtiofsd: drop all capabilities in the wait parent process

 tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-04-16 16:49 ` Stefan Hajnoczi
  0 siblings, 0 replies; 71+ messages in thread
From: Stefan Hajnoczi @ 2020-04-16 16:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: virtio-fs, Vivek Goyal

virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
whitelisted set of capabilities that we require.  This improves security in
case virtiofsd is compromised by making it hard for an attacker to gain further
access to the system.

Stefan Hajnoczi (2):
  virtiofsd: only retain file system capabilities
  virtiofsd: drop all capabilities in the wait parent process

 tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 1/2] virtiofsd: only retain file system capabilities
  2020-04-16 16:49 ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-04-16 16:49   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 71+ messages in thread
From: Stefan Hajnoczi @ 2020-04-16 16:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: virtio-fs, Stefan Hajnoczi, Dr. David Alan Gilbert, Vivek Goyal

virtiofsd runs as root but only needs a subset of root's Linux
capabilities(7).  As a file server its purpose is to create and access
files on behalf of a client.  It needs to be able to access files with
arbitrary uid/gid owners.  It also needs to be create device nodes.

Introduce a Linux capabilities(7) whitelist and drop all capabilities
that we don't need, making the virtiofsd process less powerful than a
regular uid root process.

  # cat /proc/PID/status
  ...
          Before           After
  CapInh: 0000000000000000 0000000000000000
  CapPrm: 0000003fffffffff 00000000880000df
  CapEff: 0000003fffffffff 00000000880000df
  CapBnd: 0000003fffffffff 0000000000000000
  CapAmb: 0000000000000000 0000000000000000

Note that file capabilities cannot be used to achieve the same effect on
the virtiofsd executable because mount is used during sandbox setup.
Therefore we drop capabilities programmatically at the right point
during startup.

This patch only affects the sandboxed child process.  The parent process
that sits in waitpid(2) still has full root capabilities and will be
addressed in the next patch.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 38 ++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 4c35c95b25..af97ba1c41 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2695,6 +2695,43 @@ static void setup_mounts(const char *source)
     close(oldroot);
 }
 
+/*
+ * Only keep whitelisted capabilities that are needed for file system operation
+ */
+static void setup_capabilities(void)
+{
+    pthread_mutex_lock(&cap.mutex);
+    capng_restore_state(&cap.saved);
+
+    /*
+     * Whitelist file system-related capabilities that are needed for a file
+     * server to act like root.  Drop everything else like networking and
+     * sysadmin capabilities.
+     *
+     * Exclusions:
+     * 1. CAP_LINUX_IMMUTABLE is not included because it's only used via ioctl
+     *    and we don't support that.
+     * 2. CAP_MAC_OVERRIDE is not included because it only seems to be
+     *    used by the Smack LSM.  Omit it until there is demand for it.
+     */
+    capng_setpid(syscall(SYS_gettid));
+    capng_clear(CAPNG_SELECT_BOTH);
+    capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
+            CAP_CHOWN,
+            CAP_DAC_OVERRIDE,
+            CAP_DAC_READ_SEARCH,
+            CAP_FOWNER,
+            CAP_FSETID,
+            CAP_SETGID,
+            CAP_SETUID,
+            CAP_MKNOD,
+            CAP_SETFCAP);
+    capng_apply(CAPNG_SELECT_BOTH);
+
+    cap.saved = capng_save_state();
+    pthread_mutex_unlock(&cap.mutex);
+}
+
 /*
  * Lock down this process to prevent access to other processes or files outside
  * source directory.  This reduces the impact of arbitrary code execution bugs.
@@ -2705,6 +2742,7 @@ static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
     setup_namespaces(lo, se);
     setup_mounts(lo->source);
     setup_seccomp(enable_syslog);
+    setup_capabilities();
 }
 
 /* Raise the maximum number of open file descriptors */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Virtio-fs] [PATCH 1/2] virtiofsd: only retain file system capabilities
@ 2020-04-16 16:49   ` Stefan Hajnoczi
  0 siblings, 0 replies; 71+ messages in thread
From: Stefan Hajnoczi @ 2020-04-16 16:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: virtio-fs, Vivek Goyal

virtiofsd runs as root but only needs a subset of root's Linux
capabilities(7).  As a file server its purpose is to create and access
files on behalf of a client.  It needs to be able to access files with
arbitrary uid/gid owners.  It also needs to be create device nodes.

Introduce a Linux capabilities(7) whitelist and drop all capabilities
that we don't need, making the virtiofsd process less powerful than a
regular uid root process.

  # cat /proc/PID/status
  ...
          Before           After
  CapInh: 0000000000000000 0000000000000000
  CapPrm: 0000003fffffffff 00000000880000df
  CapEff: 0000003fffffffff 00000000880000df
  CapBnd: 0000003fffffffff 0000000000000000
  CapAmb: 0000000000000000 0000000000000000

Note that file capabilities cannot be used to achieve the same effect on
the virtiofsd executable because mount is used during sandbox setup.
Therefore we drop capabilities programmatically at the right point
during startup.

This patch only affects the sandboxed child process.  The parent process
that sits in waitpid(2) still has full root capabilities and will be
addressed in the next patch.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 38 ++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 4c35c95b25..af97ba1c41 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2695,6 +2695,43 @@ static void setup_mounts(const char *source)
     close(oldroot);
 }
 
+/*
+ * Only keep whitelisted capabilities that are needed for file system operation
+ */
+static void setup_capabilities(void)
+{
+    pthread_mutex_lock(&cap.mutex);
+    capng_restore_state(&cap.saved);
+
+    /*
+     * Whitelist file system-related capabilities that are needed for a file
+     * server to act like root.  Drop everything else like networking and
+     * sysadmin capabilities.
+     *
+     * Exclusions:
+     * 1. CAP_LINUX_IMMUTABLE is not included because it's only used via ioctl
+     *    and we don't support that.
+     * 2. CAP_MAC_OVERRIDE is not included because it only seems to be
+     *    used by the Smack LSM.  Omit it until there is demand for it.
+     */
+    capng_setpid(syscall(SYS_gettid));
+    capng_clear(CAPNG_SELECT_BOTH);
+    capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
+            CAP_CHOWN,
+            CAP_DAC_OVERRIDE,
+            CAP_DAC_READ_SEARCH,
+            CAP_FOWNER,
+            CAP_FSETID,
+            CAP_SETGID,
+            CAP_SETUID,
+            CAP_MKNOD,
+            CAP_SETFCAP);
+    capng_apply(CAPNG_SELECT_BOTH);
+
+    cap.saved = capng_save_state();
+    pthread_mutex_unlock(&cap.mutex);
+}
+
 /*
  * Lock down this process to prevent access to other processes or files outside
  * source directory.  This reduces the impact of arbitrary code execution bugs.
@@ -2705,6 +2742,7 @@ static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
     setup_namespaces(lo, se);
     setup_mounts(lo->source);
     setup_seccomp(enable_syslog);
+    setup_capabilities();
 }
 
 /* Raise the maximum number of open file descriptors */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 2/2] virtiofsd: drop all capabilities in the wait parent process
  2020-04-16 16:49 ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-04-16 16:49   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 71+ messages in thread
From: Stefan Hajnoczi @ 2020-04-16 16:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: virtio-fs, Stefan Hajnoczi, Dr. David Alan Gilbert, Vivek Goyal

All this process does is wait for its child.  No capabilities are
needed.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index af97ba1c41..0c3f33b074 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2530,6 +2530,17 @@ static void print_capabilities(void)
     printf("}\n");
 }
 
+/*
+ * Drop all Linux capabilities because the wait parent process only needs to
+ * sit in waitpid(2) and terminate.
+ */
+static void setup_wait_parent_capabilities(void)
+{
+    capng_setpid(syscall(SYS_gettid));
+    capng_clear(CAPNG_SELECT_BOTH);
+    capng_apply(CAPNG_SELECT_BOTH);
+}
+
 /*
  * Move to a new mount, net, and pid namespaces to isolate this process.
  */
@@ -2561,6 +2572,8 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
         pid_t waited;
         int wstatus;
 
+        setup_wait_parent_capabilities();
+
         /* The parent waits for the child */
         do {
             waited = waitpid(child, &wstatus, 0);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Virtio-fs] [PATCH 2/2] virtiofsd: drop all capabilities in the wait parent process
@ 2020-04-16 16:49   ` Stefan Hajnoczi
  0 siblings, 0 replies; 71+ messages in thread
From: Stefan Hajnoczi @ 2020-04-16 16:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: virtio-fs, Vivek Goyal

All this process does is wait for its child.  No capabilities are
needed.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index af97ba1c41..0c3f33b074 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2530,6 +2530,17 @@ static void print_capabilities(void)
     printf("}\n");
 }
 
+/*
+ * Drop all Linux capabilities because the wait parent process only needs to
+ * sit in waitpid(2) and terminate.
+ */
+static void setup_wait_parent_capabilities(void)
+{
+    capng_setpid(syscall(SYS_gettid));
+    capng_clear(CAPNG_SELECT_BOTH);
+    capng_apply(CAPNG_SELECT_BOTH);
+}
+
 /*
  * Move to a new mount, net, and pid namespaces to isolate this process.
  */
@@ -2561,6 +2572,8 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
         pid_t waited;
         int wstatus;
 
+        setup_wait_parent_capabilities();
+
         /* The parent waits for the child */
         do {
             waited = waitpid(child, &wstatus, 0);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/2] virtiofsd: drop all capabilities in the wait parent process
  2020-04-16 16:49   ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-04-16 17:50     ` Philippe Mathieu-Daudé
  -1 siblings, 0 replies; 71+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-04-16 17:50 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: virtio-fs, Dr. David Alan Gilbert, Vivek Goyal

On 4/16/20 6:49 PM, Stefan Hajnoczi wrote:
> All this process does is wait for its child.  No capabilities are
> needed.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   tools/virtiofsd/passthrough_ll.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index af97ba1c41..0c3f33b074 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,17 @@ static void print_capabilities(void)
>       printf("}\n");
>   }
>   
> +/*
> + * Drop all Linux capabilities because the wait parent process only needs to
> + * sit in waitpid(2) and terminate.
> + */
> +static void setup_wait_parent_capabilities(void)
> +{
> +    capng_setpid(syscall(SYS_gettid));

Maybe worth a /* Drop all capabilities */ comment here.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

> +    capng_clear(CAPNG_SELECT_BOTH);
> +    capng_apply(CAPNG_SELECT_BOTH);
> +}
> +
>   /*
>    * Move to a new mount, net, and pid namespaces to isolate this process.
>    */
> @@ -2561,6 +2572,8 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
>           pid_t waited;
>           int wstatus;
>   
> +        setup_wait_parent_capabilities();
> +
>           /* The parent waits for the child */
>           do {
>               waited = waitpid(child, &wstatus, 0);
> 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofsd: drop all capabilities in the wait parent process
@ 2020-04-16 17:50     ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 71+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-04-16 17:50 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel; +Cc: virtio-fs, Vivek Goyal

On 4/16/20 6:49 PM, Stefan Hajnoczi wrote:
> All this process does is wait for its child.  No capabilities are
> needed.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   tools/virtiofsd/passthrough_ll.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index af97ba1c41..0c3f33b074 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,17 @@ static void print_capabilities(void)
>       printf("}\n");
>   }
>   
> +/*
> + * Drop all Linux capabilities because the wait parent process only needs to
> + * sit in waitpid(2) and terminate.
> + */
> +static void setup_wait_parent_capabilities(void)
> +{
> +    capng_setpid(syscall(SYS_gettid));

Maybe worth a /* Drop all capabilities */ comment here.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

> +    capng_clear(CAPNG_SELECT_BOTH);
> +    capng_apply(CAPNG_SELECT_BOTH);
> +}
> +
>   /*
>    * Move to a new mount, net, and pid namespaces to isolate this process.
>    */
> @@ -2561,6 +2572,8 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
>           pid_t waited;
>           int wstatus;
>   
> +        setup_wait_parent_capabilities();
> +
>           /* The parent waits for the child */
>           do {
>               waited = waitpid(child, &wstatus, 0);
> 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-04-16 16:49 ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-04-16 20:10   ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-04-16 20:10 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Dr. David Alan Gilbert

On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> whitelisted set of capabilities that we require.  This improves security in
> case virtiofsd is compromised by making it hard for an attacker to gain further
> access to the system.

Hi Stefan,

Good to see this patch. We needed to limit capabilities to reduce attack
surface.

What tests have you run to make sure this current set of whitelisted
capabilities is good enough.

Vivek

> 
> Stefan Hajnoczi (2):
>   virtiofsd: only retain file system capabilities
>   virtiofsd: drop all capabilities in the wait parent process
> 
>  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> -- 
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-04-16 20:10   ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-04-16 20:10 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel

On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> whitelisted set of capabilities that we require.  This improves security in
> case virtiofsd is compromised by making it hard for an attacker to gain further
> access to the system.

Hi Stefan,

Good to see this patch. We needed to limit capabilities to reduce attack
surface.

What tests have you run to make sure this current set of whitelisted
capabilities is good enough.

Vivek

> 
> Stefan Hajnoczi (2):
>   virtiofsd: only retain file system capabilities
>   virtiofsd: drop all capabilities in the wait parent process
> 
>  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-04-16 20:10   ` [Virtio-fs] " Vivek Goyal
@ 2020-04-17  9:42     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 71+ messages in thread
From: Stefan Hajnoczi @ 2020-04-17  9:42 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 779 bytes --]

On Thu, Apr 16, 2020 at 04:10:22PM -0400, Vivek Goyal wrote:
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain further
> > access to the system.
> 
> Hi Stefan,
> 
> Good to see this patch. We needed to limit capabilities to reduce attack
> surface.
> 
> What tests have you run to make sure this current set of whitelisted
> capabilities is good enough.

Booting and light usage of Fedora 29 and running blogbench.

I would appreciate it if others could try it out with their
tests/workloads.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-04-17  9:42     ` Stefan Hajnoczi
  0 siblings, 0 replies; 71+ messages in thread
From: Stefan Hajnoczi @ 2020-04-17  9:42 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 779 bytes --]

On Thu, Apr 16, 2020 at 04:10:22PM -0400, Vivek Goyal wrote:
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain further
> > access to the system.
> 
> Hi Stefan,
> 
> Good to see this patch. We needed to limit capabilities to reduce attack
> surface.
> 
> What tests have you run to make sure this current set of whitelisted
> capabilities is good enough.

Booting and light usage of Fedora 29 and running blogbench.

I would appreciate it if others could try it out with their
tests/workloads.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/2] virtiofsd: only retain file system capabilities
  2020-04-16 16:49   ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-04-28 11:48     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-04-28 11:48 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Vivek Goyal

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd runs as root but only needs a subset of root's Linux
> capabilities(7).  As a file server its purpose is to create and access
> files on behalf of a client.  It needs to be able to access files with
> arbitrary uid/gid owners.  It also needs to be create device nodes.
> 
> Introduce a Linux capabilities(7) whitelist and drop all capabilities
> that we don't need, making the virtiofsd process less powerful than a
> regular uid root process.
> 
>   # cat /proc/PID/status
>   ...
>           Before           After
>   CapInh: 0000000000000000 0000000000000000
>   CapPrm: 0000003fffffffff 00000000880000df
>   CapEff: 0000003fffffffff 00000000880000df
>   CapBnd: 0000003fffffffff 0000000000000000
>   CapAmb: 0000000000000000 0000000000000000
> 
> Note that file capabilities cannot be used to achieve the same effect on
> the virtiofsd executable because mount is used during sandbox setup.
> Therefore we drop capabilities programmatically at the right point
> during startup.
> 
> This patch only affects the sandboxed child process.  The parent process
> that sits in waitpid(2) still has full root capabilities and will be
> addressed in the next patch.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Looks reasonable to me; I can't see any capabilities in the manpage that
you're missing that make sense.
They also look old enough not to be a problem with reasonably old
systems.



Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  tools/virtiofsd/passthrough_ll.c | 38 ++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..af97ba1c41 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2695,6 +2695,43 @@ static void setup_mounts(const char *source)
>      close(oldroot);
>  }
>  
> +/*
> + * Only keep whitelisted capabilities that are needed for file system operation
> + */
> +static void setup_capabilities(void)
> +{
> +    pthread_mutex_lock(&cap.mutex);
> +    capng_restore_state(&cap.saved);
> +
> +    /*
> +     * Whitelist file system-related capabilities that are needed for a file
> +     * server to act like root.  Drop everything else like networking and
> +     * sysadmin capabilities.
> +     *
> +     * Exclusions:
> +     * 1. CAP_LINUX_IMMUTABLE is not included because it's only used via ioctl
> +     *    and we don't support that.
> +     * 2. CAP_MAC_OVERRIDE is not included because it only seems to be
> +     *    used by the Smack LSM.  Omit it until there is demand for it.
> +     */
> +    capng_setpid(syscall(SYS_gettid));
> +    capng_clear(CAPNG_SELECT_BOTH);
> +    capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
> +            CAP_CHOWN,
> +            CAP_DAC_OVERRIDE,
> +            CAP_DAC_READ_SEARCH,
> +            CAP_FOWNER,
> +            CAP_FSETID,
> +            CAP_SETGID,
> +            CAP_SETUID,
> +            CAP_MKNOD,
> +            CAP_SETFCAP);
> +    capng_apply(CAPNG_SELECT_BOTH);
> +
> +    cap.saved = capng_save_state();
> +    pthread_mutex_unlock(&cap.mutex);
> +}
> +
>  /*
>   * Lock down this process to prevent access to other processes or files outside
>   * source directory.  This reduces the impact of arbitrary code execution bugs.
> @@ -2705,6 +2742,7 @@ static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>      setup_namespaces(lo, se);
>      setup_mounts(lo->source);
>      setup_seccomp(enable_syslog);
> +    setup_capabilities();
>  }
>  
>  /* Raise the maximum number of open file descriptors */
> -- 
> 2.25.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 1/2] virtiofsd: only retain file system capabilities
@ 2020-04-28 11:48     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-04-28 11:48 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Vivek Goyal

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd runs as root but only needs a subset of root's Linux
> capabilities(7).  As a file server its purpose is to create and access
> files on behalf of a client.  It needs to be able to access files with
> arbitrary uid/gid owners.  It also needs to be create device nodes.
> 
> Introduce a Linux capabilities(7) whitelist and drop all capabilities
> that we don't need, making the virtiofsd process less powerful than a
> regular uid root process.
> 
>   # cat /proc/PID/status
>   ...
>           Before           After
>   CapInh: 0000000000000000 0000000000000000
>   CapPrm: 0000003fffffffff 00000000880000df
>   CapEff: 0000003fffffffff 00000000880000df
>   CapBnd: 0000003fffffffff 0000000000000000
>   CapAmb: 0000000000000000 0000000000000000
> 
> Note that file capabilities cannot be used to achieve the same effect on
> the virtiofsd executable because mount is used during sandbox setup.
> Therefore we drop capabilities programmatically at the right point
> during startup.
> 
> This patch only affects the sandboxed child process.  The parent process
> that sits in waitpid(2) still has full root capabilities and will be
> addressed in the next patch.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Looks reasonable to me; I can't see any capabilities in the manpage that
you're missing that make sense.
They also look old enough not to be a problem with reasonably old
systems.



Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  tools/virtiofsd/passthrough_ll.c | 38 ++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..af97ba1c41 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2695,6 +2695,43 @@ static void setup_mounts(const char *source)
>      close(oldroot);
>  }
>  
> +/*
> + * Only keep whitelisted capabilities that are needed for file system operation
> + */
> +static void setup_capabilities(void)
> +{
> +    pthread_mutex_lock(&cap.mutex);
> +    capng_restore_state(&cap.saved);
> +
> +    /*
> +     * Whitelist file system-related capabilities that are needed for a file
> +     * server to act like root.  Drop everything else like networking and
> +     * sysadmin capabilities.
> +     *
> +     * Exclusions:
> +     * 1. CAP_LINUX_IMMUTABLE is not included because it's only used via ioctl
> +     *    and we don't support that.
> +     * 2. CAP_MAC_OVERRIDE is not included because it only seems to be
> +     *    used by the Smack LSM.  Omit it until there is demand for it.
> +     */
> +    capng_setpid(syscall(SYS_gettid));
> +    capng_clear(CAPNG_SELECT_BOTH);
> +    capng_updatev(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
> +            CAP_CHOWN,
> +            CAP_DAC_OVERRIDE,
> +            CAP_DAC_READ_SEARCH,
> +            CAP_FOWNER,
> +            CAP_FSETID,
> +            CAP_SETGID,
> +            CAP_SETUID,
> +            CAP_MKNOD,
> +            CAP_SETFCAP);
> +    capng_apply(CAPNG_SELECT_BOTH);
> +
> +    cap.saved = capng_save_state();
> +    pthread_mutex_unlock(&cap.mutex);
> +}
> +
>  /*
>   * Lock down this process to prevent access to other processes or files outside
>   * source directory.  This reduces the impact of arbitrary code execution bugs.
> @@ -2705,6 +2742,7 @@ static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>      setup_namespaces(lo, se);
>      setup_mounts(lo->source);
>      setup_seccomp(enable_syslog);
> +    setup_capabilities();
>  }
>  
>  /* Raise the maximum number of open file descriptors */
> -- 
> 2.25.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-04-16 16:49 ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-05-01 18:28   ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-01 18:28 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Vivek Goyal

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> whitelisted set of capabilities that we require.  This improves security in
> case virtiofsd is compromised by making it hard for an attacker to gain further
> access to the system.

Queued.

> Stefan Hajnoczi (2):
>   virtiofsd: only retain file system capabilities
>   virtiofsd: drop all capabilities in the wait parent process
> 
>  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> -- 
> 2.25.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-05-01 18:28   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-01 18:28 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Vivek Goyal

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> whitelisted set of capabilities that we require.  This improves security in
> case virtiofsd is compromised by making it hard for an attacker to gain further
> access to the system.

Queued.

> Stefan Hajnoczi (2):
>   virtiofsd: only retain file system capabilities
>   virtiofsd: drop all capabilities in the wait parent process
> 
>  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> -- 
> 2.25.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-04-16 16:49 ` [Virtio-fs] " Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  (?)
@ 2020-06-18 19:08 ` Vivek Goyal
  2020-06-18 19:16     ` Dr. David Alan Gilbert
  2020-06-19 14:16     ` Miklos Szeredi
  -1 siblings, 2 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-18 19:08 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Miklos Szeredi

On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> whitelisted set of capabilities that we require.  This improves security in
> case virtiofsd is compromised by making it hard for an attacker to gain further
> access to the system.

Hi Stefan,

I just noticed that this patch set breaks overlayfs on top of virtiofs.

overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
need CAP_SYS_ADMIN.

man xattr says.

   Trusted extended attributes
       Trusted  extended  attributes  are  visible and accessible only to pro‐
       cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
       class are used to implement mechanisms in user space (i.e., outside the
       kernel) which keep information in extended attributes to which ordinary
       processes should not have access.

There is a chance that overlay moves away from trusted xattr in future.
But for now we need to make it work. This is an important use case for
kata docker in docker build.

May be we can add an option to virtiofsd say "--add-cap <capability>" and
ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
with this capaibility.

Thanks
Vivek

> 
> Stefan Hajnoczi (2):
>   virtiofsd: only retain file system capabilities
>   virtiofsd: drop all capabilities in the wait parent process
> 
>  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> -- 
> 2.25.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-18 19:08 ` Vivek Goyal
@ 2020-06-18 19:16     ` Dr. David Alan Gilbert
  2020-06-19 14:16     ` Miklos Szeredi
  1 sibling, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-18 19:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, Stefan Hajnoczi, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain further
> > access to the system.
> 
> Hi Stefan,
> 
> I just noticed that this patch set breaks overlayfs on top of virtiofs.
> 
> overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> need CAP_SYS_ADMIN.
> 
> man xattr says.
> 
>    Trusted extended attributes
>        Trusted  extended  attributes  are  visible and accessible only to pro‐
>        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
>        class are used to implement mechanisms in user space (i.e., outside the
>        kernel) which keep information in extended attributes to which ordinary
>        processes should not have access.
> 
> There is a chance that overlay moves away from trusted xattr in future.
> But for now we need to make it work. This is an important use case for
> kata docker in docker build.
> 
> May be we can add an option to virtiofsd say "--add-cap <capability>" and
> ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> with this capaibility.

I'll admit I don't like the idea of giving it cap_sys_admin.
Can you explain:
  a) What overlayfs uses trusted for?
  b) If something nasty was to write junk into the trusted attributes,
    what would happen?
  c) I see overlayfs has a fallback check if xattr isn't supported at
all - what is the consequence?

Dave

> Thanks
> Vivek
> 
> > 
> > Stefan Hajnoczi (2):
> >   virtiofsd: only retain file system capabilities
> >   virtiofsd: drop all capabilities in the wait parent process
> > 
> >  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> > 
> > -- 
> > 2.25.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-18 19:16     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-18 19:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain further
> > access to the system.
> 
> Hi Stefan,
> 
> I just noticed that this patch set breaks overlayfs on top of virtiofs.
> 
> overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> need CAP_SYS_ADMIN.
> 
> man xattr says.
> 
>    Trusted extended attributes
>        Trusted  extended  attributes  are  visible and accessible only to pro‐
>        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
>        class are used to implement mechanisms in user space (i.e., outside the
>        kernel) which keep information in extended attributes to which ordinary
>        processes should not have access.
> 
> There is a chance that overlay moves away from trusted xattr in future.
> But for now we need to make it work. This is an important use case for
> kata docker in docker build.
> 
> May be we can add an option to virtiofsd say "--add-cap <capability>" and
> ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> with this capaibility.

I'll admit I don't like the idea of giving it cap_sys_admin.
Can you explain:
  a) What overlayfs uses trusted for?
  b) If something nasty was to write junk into the trusted attributes,
    what would happen?
  c) I see overlayfs has a fallback check if xattr isn't supported at
all - what is the consequence?

Dave

> Thanks
> Vivek
> 
> > 
> > Stefan Hajnoczi (2):
> >   virtiofsd: only retain file system capabilities
> >   virtiofsd: drop all capabilities in the wait parent process
> > 
> >  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> > 
> > -- 
> > 2.25.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-18 19:16     ` Dr. David Alan Gilbert
@ 2020-06-18 19:27       ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-18 19:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: virtio-fs, qemu-devel, Stefan Hajnoczi, Miklos Szeredi

On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > whitelisted set of capabilities that we require.  This improves security in
> > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > access to the system.
> > 
> > Hi Stefan,
> > 
> > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > 
> > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > need CAP_SYS_ADMIN.
> > 
> > man xattr says.
> > 
> >    Trusted extended attributes
> >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> >        class are used to implement mechanisms in user space (i.e., outside the
> >        kernel) which keep information in extended attributes to which ordinary
> >        processes should not have access.
> > 
> > There is a chance that overlay moves away from trusted xattr in future.
> > But for now we need to make it work. This is an important use case for
> > kata docker in docker build.
> > 
> > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > with this capaibility.
> 
> I'll admit I don't like the idea of giving it cap_sys_admin.
> Can you explain:
>   a) What overlayfs uses trusted for?

overlayfs stores bunch of metadata and uses "trusted" xattrs for it.

>   b) If something nasty was to write junk into the trusted attributes,
>     what would happen?

This directory is owned by guest. So it should be able to write
anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?

>   c) I see overlayfs has a fallback check if xattr isn't supported at
> all - what is the consequence?

It falls back to I think read only mode. 

For a moment forget about overlayfs. Say a user process in guest with
CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
passthrough filesystem, so it should go through. But currently it
wont.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-18 19:27       ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-18 19:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, qemu-devel, Miklos Szeredi

On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > whitelisted set of capabilities that we require.  This improves security in
> > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > access to the system.
> > 
> > Hi Stefan,
> > 
> > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > 
> > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > need CAP_SYS_ADMIN.
> > 
> > man xattr says.
> > 
> >    Trusted extended attributes
> >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> >        class are used to implement mechanisms in user space (i.e., outside the
> >        kernel) which keep information in extended attributes to which ordinary
> >        processes should not have access.
> > 
> > There is a chance that overlay moves away from trusted xattr in future.
> > But for now we need to make it work. This is an important use case for
> > kata docker in docker build.
> > 
> > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > with this capaibility.
> 
> I'll admit I don't like the idea of giving it cap_sys_admin.
> Can you explain:
>   a) What overlayfs uses trusted for?

overlayfs stores bunch of metadata and uses "trusted" xattrs for it.

>   b) If something nasty was to write junk into the trusted attributes,
>     what would happen?

This directory is owned by guest. So it should be able to write
anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?

>   c) I see overlayfs has a fallback check if xattr isn't supported at
> all - what is the consequence?

It falls back to I think read only mode. 

For a moment forget about overlayfs. Say a user process in guest with
CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
passthrough filesystem, so it should go through. But currently it
wont.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-18 19:27       ` Vivek Goyal
@ 2020-06-19  4:46         ` Chirantan Ekbote
  -1 siblings, 0 replies; 71+ messages in thread
From: Chirantan Ekbote @ 2020-06-19  4:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs-list, Miklos Szeredi, Dr. David Alan Gilbert, qemu-devel

On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > whitelisted set of capabilities that we require.  This improves security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > access to the system.
> > >
> > > Hi Stefan,
> > >
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > >
> > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > need CAP_SYS_ADMIN.
> > >

Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
user namespace and it cannot set any trusted or security xattrs.  The
security xattrs check is at least namespace aware so you only need
CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
help us much.

We ended up working around it by prefixing "user.virtiofs." to the
xattr name[2], which has its own problems but there was pretty much no
chance that we would be able to give the fs device CAP_SYS_ADMIN in
the init namespace.


[1]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/5e857ce6eae7ca21b2055cca4885545e29228fe2/fs/xattr.c#116
[2]: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2243111


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19  4:46         ` Chirantan Ekbote
  0 siblings, 0 replies; 71+ messages in thread
From: Chirantan Ekbote @ 2020-06-19  4:46 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, Miklos Szeredi, qemu-devel

On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > whitelisted set of capabilities that we require.  This improves security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > access to the system.
> > >
> > > Hi Stefan,
> > >
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > >
> > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > need CAP_SYS_ADMIN.
> > >

Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
user namespace and it cannot set any trusted or security xattrs.  The
security xattrs check is at least namespace aware so you only need
CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
help us much.

We ended up working around it by prefixing "user.virtiofs." to the
xattr name[2], which has its own problems but there was pretty much no
chance that we would be able to give the fs device CAP_SYS_ADMIN in
the init namespace.


[1]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/5e857ce6eae7ca21b2055cca4885545e29228fe2/fs/xattr.c#116
[2]: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2243111


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-18 19:27       ` Vivek Goyal
@ 2020-06-19  8:27         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19  8:27 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, Stefan Hajnoczi, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > whitelisted set of capabilities that we require.  This improves security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > access to the system.
> > > 
> > > Hi Stefan,
> > > 
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > 
> > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > need CAP_SYS_ADMIN.
> > > 
> > > man xattr says.
> > > 
> > >    Trusted extended attributes
> > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > >        class are used to implement mechanisms in user space (i.e., outside the
> > >        kernel) which keep information in extended attributes to which ordinary
> > >        processes should not have access.
> > > 
> > > There is a chance that overlay moves away from trusted xattr in future.
> > > But for now we need to make it work. This is an important use case for
> > > kata docker in docker build.
> > > 
> > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > with this capaibility.
> > 
> > I'll admit I don't like the idea of giving it cap_sys_admin.
> > Can you explain:
> >   a) What overlayfs uses trusted for?
> 
> overlayfs stores bunch of metadata and uses "trusted" xattrs for it.

Tell me more about this metadata.
Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
Or what happens if I was to write random numbers into OVL_XATTR_NLINK?

> >   b) If something nasty was to write junk into the trusted attributes,
> >     what would happen?
> 
> This directory is owned by guest. So it should be able to write
> anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?

Well, we shouldn't be able to break/crash/escape into the host; how
much does overlayfs validate trusted.* it uses?

> >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > all - what is the consequence?
> 
> It falls back to I think read only mode. 

It looks like the fallback is more subtle to me:
        /*
         * Check if upper/work fs supports trusted.overlay.* xattr
         */
        err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
        if (err) {
                ofs->noxattr = true;
                ofs->config.index = false;
                ofs->config.metacopy = false;
                pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");

but I don't know what index and metacopy are.

> For a moment forget about overlayfs. Say a user process in guest with
> CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> passthrough filesystem, so it should go through. But currently it
> wont.

As long as any effects of what it writes are contained to the area of
the filesystem exposed to the guest, yes - however it worries me what
the consequences of broken trusted metadata is.  If it's delicate enough
that it's guarded by CAP_SYS_ADMIN someone must have worried about it.

Dave

> Thanks
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19  8:27         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19  8:27 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > whitelisted set of capabilities that we require.  This improves security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > access to the system.
> > > 
> > > Hi Stefan,
> > > 
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > 
> > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > need CAP_SYS_ADMIN.
> > > 
> > > man xattr says.
> > > 
> > >    Trusted extended attributes
> > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > >        class are used to implement mechanisms in user space (i.e., outside the
> > >        kernel) which keep information in extended attributes to which ordinary
> > >        processes should not have access.
> > > 
> > > There is a chance that overlay moves away from trusted xattr in future.
> > > But for now we need to make it work. This is an important use case for
> > > kata docker in docker build.
> > > 
> > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > with this capaibility.
> > 
> > I'll admit I don't like the idea of giving it cap_sys_admin.
> > Can you explain:
> >   a) What overlayfs uses trusted for?
> 
> overlayfs stores bunch of metadata and uses "trusted" xattrs for it.

Tell me more about this metadata.
Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
Or what happens if I was to write random numbers into OVL_XATTR_NLINK?

> >   b) If something nasty was to write junk into the trusted attributes,
> >     what would happen?
> 
> This directory is owned by guest. So it should be able to write
> anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?

Well, we shouldn't be able to break/crash/escape into the host; how
much does overlayfs validate trusted.* it uses?

> >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > all - what is the consequence?
> 
> It falls back to I think read only mode. 

It looks like the fallback is more subtle to me:
        /*
         * Check if upper/work fs supports trusted.overlay.* xattr
         */
        err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
        if (err) {
                ofs->noxattr = true;
                ofs->config.index = false;
                ofs->config.metacopy = false;
                pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");

but I don't know what index and metacopy are.

> For a moment forget about overlayfs. Say a user process in guest with
> CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> passthrough filesystem, so it should go through. But currently it
> wont.

As long as any effects of what it writes are contained to the area of
the filesystem exposed to the guest, yes - however it worries me what
the consequences of broken trusted metadata is.  If it's delicate enough
that it's guarded by CAP_SYS_ADMIN someone must have worried about it.

Dave

> Thanks
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19  4:46         ` Chirantan Ekbote
  (?)
@ 2020-06-19  8:39         ` Dr. David Alan Gilbert
  2020-06-19  9:17           ` Chirantan Ekbote
  -1 siblings, 1 reply; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19  8:39 UTC (permalink / raw)
  To: Chirantan Ekbote; +Cc: virtio-fs-list, qemu-devel, Vivek Goyal, Miklos Szeredi

* Chirantan Ekbote (chirantan@chromium.org) wrote:
> On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > >
> > > > Hi Stefan,
> > > >
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > >
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > >
> 
> Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
> have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
> user namespace and it cannot set any trusted or security xattrs.  The
> security xattrs check is at least namespace aware so you only need
> CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
> help us much.

It would have been good if you'd mentioned that; it would have saved
Vivek some confusion!

> We ended up working around it by prefixing "user.virtiofs." to the
> xattr name[2], which has its own problems but there was pretty much no
> chance that we would be able to give the fs device CAP_SYS_ADMIN in
> the init namespace.


What problems did you hit with that?  We should standardise the renaming
so we make an on-disc format that's compatible.

Dave

> 
> [1]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/5e857ce6eae7ca21b2055cca4885545e29228fe2/fs/xattr.c#116
> [2]: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2243111
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19  8:39         ` Dr. David Alan Gilbert
@ 2020-06-19  9:17           ` Chirantan Ekbote
  2020-06-19 11:12             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 71+ messages in thread
From: Chirantan Ekbote @ 2020-06-19  9:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: virtio-fs-list, qemu-devel, Vivek Goyal, Miklos Szeredi

On Fri, Jun 19, 2020 at 5:40 PM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Chirantan Ekbote (chirantan@chromium.org) wrote:
>
> > We ended up working around it by prefixing "user.virtiofs." to the
> > xattr name[2], which has its own problems but there was pretty much no
> > chance that we would be able to give the fs device CAP_SYS_ADMIN in
> > the init namespace.
>
>
> What problems did you hit with that?  We should standardise the renaming
> so we make an on-disc format that's compatible.
>

I guess what I meant by problems is that it made what was previously a
simple and straightforward implementation into something more complex
and added some limitations.  For example, we now need to parse the
result of the listxattr system call and strip out the prefix from any
name in the list.  It also means that we cannot allow the guest to
directly set or remove any "user.virtiofs." xattr as this would allow
an unprivileged process in the vm to modify an xattr that it wouldn't
otherwise be allowed to modify.  On top of being a somewhat arbitrary
restriction this also means that you can't have stacked virtiofs
instances as the lower instance would reject attempts by the upper one
to set those xattrs.  These limitations aren't really a problem for us
but I can see how they might be a problem for others.

The change was also merged just yesterday so there may be other
problems with it that haven't surfaced yet.

I didn't mention it before because I figured this was something that
we brought upon ourselves as chrome os is a bit extreme about
sandboxing.  If we can come up with a standardized way to handle this
I think we'll gladly switch the chrome os implementation to use it.

Chirantan


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19  9:17           ` Chirantan Ekbote
@ 2020-06-19 11:12             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19 11:12 UTC (permalink / raw)
  To: Chirantan Ekbote; +Cc: virtio-fs-list, qemu-devel, Vivek Goyal, Miklos Szeredi

* Chirantan Ekbote (chirantan@chromium.org) wrote:
> On Fri, Jun 19, 2020 at 5:40 PM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Chirantan Ekbote (chirantan@chromium.org) wrote:
> >
> > > We ended up working around it by prefixing "user.virtiofs." to the
> > > xattr name[2], which has its own problems but there was pretty much no
> > > chance that we would be able to give the fs device CAP_SYS_ADMIN in
> > > the init namespace.
> >
> >
> > What problems did you hit with that?  We should standardise the renaming
> > so we make an on-disc format that's compatible.
> >
> 
> I guess what I meant by problems is that it made what was previously a
> simple and straightforward implementation into something more complex
> and added some limitations.

Yeh.

>  For example, we now need to parse the
> result of the listxattr system call and strip out the prefix from any
> name in the list.  It also means that we cannot allow the guest to
> directly set or remove any "user.virtiofs." xattr as this would allow
> an unprivileged process in the vm to modify an xattr that it wouldn't
> otherwise be allowed to modify.  On top of being a somewhat arbitrary
> restriction this also means that you can't have stacked virtiofs
> instances as the lower instance would reject attempts by the upper one
> to set those xattrs.  These limitations aren't really a problem for us
> but I can see how they might be a problem for others.

Isn't this a classic escaping problem?
Would it work if you prepended  'user.virtiofs.' onto any xattr
that started with 'trusted' or 'user.virtiofs.' ?

> The change was also merged just yesterday so there may be other
> problems with it that haven't surfaced yet.
> 
> I didn't mention it before because I figured this was something that
> we brought upon ourselves as chrome os is a bit extreme about
> sandboxing. 

I think we're trying to be as extreme in virtiofsd, but it is causing us
similar problems.

> If we can come up with a standardized way to handle this
> I think we'll gladly switch the chrome os implementation to use it.

Great!

Dave

> 
> Chirantan
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19  8:27         ` Dr. David Alan Gilbert
@ 2020-06-19 11:39           ` Daniel P. Berrangé
  -1 siblings, 0 replies; 71+ messages in thread
From: Daniel P. Berrangé @ 2020-06-19 11:39 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: virtio-fs, Stefan Hajnoczi, qemu-devel, Vivek Goyal, Miklos Szeredi

On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > > 
> > > > Hi Stefan,
> > > > 
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > 
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > > 
> > > > man xattr says.
> > > > 
> > > >    Trusted extended attributes
> > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > >        kernel) which keep information in extended attributes to which ordinary
> > > >        processes should not have access.
> > > > 
> > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > But for now we need to make it work. This is an important use case for
> > > > kata docker in docker build.
> > > > 
> > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > with this capaibility.
> > > 
> > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > Can you explain:
> > >   a) What overlayfs uses trusted for?
> > 
> > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> 
> Tell me more about this metadata.
> Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> 
> > >   b) If something nasty was to write junk into the trusted attributes,
> > >     what would happen?
> > 
> > This directory is owned by guest. So it should be able to write
> > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> 
> Well, we shouldn't be able to break/crash/escape into the host; how
> much does overlayfs validate trusted.* it uses?
> 
> > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > all - what is the consequence?
> > 
> > It falls back to I think read only mode. 
> 
> It looks like the fallback is more subtle to me:
>         /*
>          * Check if upper/work fs supports trusted.overlay.* xattr
>          */
>         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
>         if (err) {
>                 ofs->noxattr = true;
>                 ofs->config.index = false;
>                 ofs->config.metacopy = false;
>                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> 
> but I don't know what index and metacopy are.
> 
> > For a moment forget about overlayfs. Say a user process in guest with
> > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > passthrough filesystem, so it should go through. But currently it
> > wont.
> 
> As long as any effects of what it writes are contained to the area of
> the filesystem exposed to the guest, yes - however it worries me what
> the consequences of broken trusted metadata is.  If it's delicate enough
> that it's guarded by CAP_SYS_ADMIN someone must have worried about it.

The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
mechanism for applications to control access. The host kernel doesn'
tuse this namespace itself. Linux has four namespaces for xattrs:

 -  user - for userspace apps. accessible based on read/write permissions
 -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
 -  system - for kernel only. used by ACLs
 -  security - for kernel only. used by SELinux

The use case for "trusted" xattrs is thus where a privileged management
application or service wants to store metadata against the file, but
also needs to grant an unprivileged process access to write to this file
while not allowing that unprivileged process the ability to change the
metadata. This is mentioned in the man page:

[man xattr(7)]
   Trusted extended attributes
       Trusted  extended attributes are visible and accessible only to pro‐
       cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
       class  are used to implement mechanisms in user space (i.e., outside
       the kernel) which keep information in extended attributes  to  which
       ordinary processes should not have access.

   User extended attributes
       User  extended  attributes  may be assigned to files and directories
       for storing arbitrary additional information such as the mime  type,
       character  set  or  encoding  of a file.  The access permissions for
       user attributes are defined by the file permission bits:  read  per‐
       mission is required to retrieve the attribute value, and writer per‐
       mission is required to change it.
[/man]

Libvirtd uses the "trusted." xattr namespace to record information against
disk images for QEMU, because we need to grant QEMU access to read/write
the disk iamges, but don't want QEMU to be able to alter our xattrs.

It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
It really ought to have had its own dedicated capability :-( Such is
life with anything that uses CAP_SYS_ADMIN...

With this in mind we really should have both trusted. & user. xattrs
allowed to the guest by default.

Conversely, we'll need to block usage of the security. and system.
namespaces.

Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
have a "--trusted-xattrs=on|off" flag to allow people to run in a more
locked down state if they knowingly want to block trusted xattrs.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 11:39           ` Daniel P. Berrangé
  0 siblings, 0 replies; 71+ messages in thread
From: Daniel P. Berrangé @ 2020-06-19 11:39 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, qemu-devel, Vivek Goyal, Miklos Szeredi

On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > > 
> > > > Hi Stefan,
> > > > 
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > 
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > > 
> > > > man xattr says.
> > > > 
> > > >    Trusted extended attributes
> > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > >        kernel) which keep information in extended attributes to which ordinary
> > > >        processes should not have access.
> > > > 
> > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > But for now we need to make it work. This is an important use case for
> > > > kata docker in docker build.
> > > > 
> > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > with this capaibility.
> > > 
> > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > Can you explain:
> > >   a) What overlayfs uses trusted for?
> > 
> > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> 
> Tell me more about this metadata.
> Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> 
> > >   b) If something nasty was to write junk into the trusted attributes,
> > >     what would happen?
> > 
> > This directory is owned by guest. So it should be able to write
> > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> 
> Well, we shouldn't be able to break/crash/escape into the host; how
> much does overlayfs validate trusted.* it uses?
> 
> > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > all - what is the consequence?
> > 
> > It falls back to I think read only mode. 
> 
> It looks like the fallback is more subtle to me:
>         /*
>          * Check if upper/work fs supports trusted.overlay.* xattr
>          */
>         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
>         if (err) {
>                 ofs->noxattr = true;
>                 ofs->config.index = false;
>                 ofs->config.metacopy = false;
>                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> 
> but I don't know what index and metacopy are.
> 
> > For a moment forget about overlayfs. Say a user process in guest with
> > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > passthrough filesystem, so it should go through. But currently it
> > wont.
> 
> As long as any effects of what it writes are contained to the area of
> the filesystem exposed to the guest, yes - however it worries me what
> the consequences of broken trusted metadata is.  If it's delicate enough
> that it's guarded by CAP_SYS_ADMIN someone must have worried about it.

The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
mechanism for applications to control access. The host kernel doesn'
tuse this namespace itself. Linux has four namespaces for xattrs:

 -  user - for userspace apps. accessible based on read/write permissions
 -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
 -  system - for kernel only. used by ACLs
 -  security - for kernel only. used by SELinux

The use case for "trusted" xattrs is thus where a privileged management
application or service wants to store metadata against the file, but
also needs to grant an unprivileged process access to write to this file
while not allowing that unprivileged process the ability to change the
metadata. This is mentioned in the man page:

[man xattr(7)]
   Trusted extended attributes
       Trusted  extended attributes are visible and accessible only to pro‐
       cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
       class  are used to implement mechanisms in user space (i.e., outside
       the kernel) which keep information in extended attributes  to  which
       ordinary processes should not have access.

   User extended attributes
       User  extended  attributes  may be assigned to files and directories
       for storing arbitrary additional information such as the mime  type,
       character  set  or  encoding  of a file.  The access permissions for
       user attributes are defined by the file permission bits:  read  per‐
       mission is required to retrieve the attribute value, and writer per‐
       mission is required to change it.
[/man]

Libvirtd uses the "trusted." xattr namespace to record information against
disk images for QEMU, because we need to grant QEMU access to read/write
the disk iamges, but don't want QEMU to be able to alter our xattrs.

It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
It really ought to have had its own dedicated capability :-( Such is
life with anything that uses CAP_SYS_ADMIN...

With this in mind we really should have both trusted. & user. xattrs
allowed to the guest by default.

Conversely, we'll need to block usage of the security. and system.
namespaces.

Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
have a "--trusted-xattrs=on|off" flag to allow people to run in a more
locked down state if they knowingly want to block trusted xattrs.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 11:39           ` Daniel P. Berrangé
@ 2020-06-19 11:49             ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19 11:49 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: virtio-fs, Stefan Hajnoczi, qemu-devel, Vivek Goyal, Miklos Szeredi

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > access to the system.
> > > > > 
> > > > > Hi Stefan,
> > > > > 
> > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > 
> > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > need CAP_SYS_ADMIN.
> > > > > 
> > > > > man xattr says.
> > > > > 
> > > > >    Trusted extended attributes
> > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > >        processes should not have access.
> > > > > 
> > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > But for now we need to make it work. This is an important use case for
> > > > > kata docker in docker build.
> > > > > 
> > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > with this capaibility.
> > > > 
> > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > Can you explain:
> > > >   a) What overlayfs uses trusted for?
> > > 
> > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > 
> > Tell me more about this metadata.
> > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> > 
> > > >   b) If something nasty was to write junk into the trusted attributes,
> > > >     what would happen?
> > > 
> > > This directory is owned by guest. So it should be able to write
> > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > 
> > Well, we shouldn't be able to break/crash/escape into the host; how
> > much does overlayfs validate trusted.* it uses?
> > 
> > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > all - what is the consequence?
> > > 
> > > It falls back to I think read only mode. 
> > 
> > It looks like the fallback is more subtle to me:
> >         /*
> >          * Check if upper/work fs supports trusted.overlay.* xattr
> >          */
> >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> >         if (err) {
> >                 ofs->noxattr = true;
> >                 ofs->config.index = false;
> >                 ofs->config.metacopy = false;
> >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > 
> > but I don't know what index and metacopy are.
> > 
> > > For a moment forget about overlayfs. Say a user process in guest with
> > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > passthrough filesystem, so it should go through. But currently it
> > > wont.
> > 
> > As long as any effects of what it writes are contained to the area of
> > the filesystem exposed to the guest, yes - however it worries me what
> > the consequences of broken trusted metadata is.  If it's delicate enough
> > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> 
> The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> mechanism for applications to control access. The host kernel doesn'
> tuse this namespace itself. Linux has four namespaces for xattrs:
> 
>  -  user - for userspace apps. accessible based on read/write permissions
>  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
>  -  system - for kernel only. used by ACLs
>  -  security - for kernel only. used by SELinux
> 
> The use case for "trusted" xattrs is thus where a privileged management
> application or service wants to store metadata against the file, but
> also needs to grant an unprivileged process access to write to this file
> while not allowing that unprivileged process the ability to change the
> metadata. This is mentioned in the man page:
> 
> [man xattr(7)]
>    Trusted extended attributes
>        Trusted  extended attributes are visible and accessible only to pro‐
>        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
>        class  are used to implement mechanisms in user space (i.e., outside
>        the kernel) which keep information in extended attributes  to  which
>        ordinary processes should not have access.

overlayfs inside the kernel is using trusted though which is the case
Vivek ran into.

>    User extended attributes
>        User  extended  attributes  may be assigned to files and directories
>        for storing arbitrary additional information such as the mime  type,
>        character  set  or  encoding  of a file.  The access permissions for
>        user attributes are defined by the file permission bits:  read  per‐
>        mission is required to retrieve the attribute value, and writer per‐
>        mission is required to change it.
> [/man]
> 
> Libvirtd uses the "trusted." xattr namespace to record information against
> disk images for QEMU, because we need to grant QEMU access to read/write
> the disk iamges, but don't want QEMU to be able to alter our xattrs.
> 
> It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> It really ought to have had its own dedicated capability :-( Such is
> life with anything that uses CAP_SYS_ADMIN...
> 
> With this in mind we really should have both trusted. & user. xattrs
> allowed to the guest by default.
> 
> Conversely, we'll need to block usage of the security. and system.
> namespaces.
> 
> Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
> have a "--trusted-xattrs=on|off" flag to allow people to run in a more
> locked down state if they knowingly want to block trusted xattrs.

Does the trick described by Chirantan work for you, where the clients
'trusted' view is different from the host?

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 11:49             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19 11:49 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: virtio-fs, qemu-devel, Vivek Goyal, Miklos Szeredi

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > access to the system.
> > > > > 
> > > > > Hi Stefan,
> > > > > 
> > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > 
> > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > need CAP_SYS_ADMIN.
> > > > > 
> > > > > man xattr says.
> > > > > 
> > > > >    Trusted extended attributes
> > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > >        processes should not have access.
> > > > > 
> > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > But for now we need to make it work. This is an important use case for
> > > > > kata docker in docker build.
> > > > > 
> > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > with this capaibility.
> > > > 
> > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > Can you explain:
> > > >   a) What overlayfs uses trusted for?
> > > 
> > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > 
> > Tell me more about this metadata.
> > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> > 
> > > >   b) If something nasty was to write junk into the trusted attributes,
> > > >     what would happen?
> > > 
> > > This directory is owned by guest. So it should be able to write
> > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > 
> > Well, we shouldn't be able to break/crash/escape into the host; how
> > much does overlayfs validate trusted.* it uses?
> > 
> > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > all - what is the consequence?
> > > 
> > > It falls back to I think read only mode. 
> > 
> > It looks like the fallback is more subtle to me:
> >         /*
> >          * Check if upper/work fs supports trusted.overlay.* xattr
> >          */
> >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> >         if (err) {
> >                 ofs->noxattr = true;
> >                 ofs->config.index = false;
> >                 ofs->config.metacopy = false;
> >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > 
> > but I don't know what index and metacopy are.
> > 
> > > For a moment forget about overlayfs. Say a user process in guest with
> > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > passthrough filesystem, so it should go through. But currently it
> > > wont.
> > 
> > As long as any effects of what it writes are contained to the area of
> > the filesystem exposed to the guest, yes - however it worries me what
> > the consequences of broken trusted metadata is.  If it's delicate enough
> > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> 
> The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> mechanism for applications to control access. The host kernel doesn'
> tuse this namespace itself. Linux has four namespaces for xattrs:
> 
>  -  user - for userspace apps. accessible based on read/write permissions
>  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
>  -  system - for kernel only. used by ACLs
>  -  security - for kernel only. used by SELinux
> 
> The use case for "trusted" xattrs is thus where a privileged management
> application or service wants to store metadata against the file, but
> also needs to grant an unprivileged process access to write to this file
> while not allowing that unprivileged process the ability to change the
> metadata. This is mentioned in the man page:
> 
> [man xattr(7)]
>    Trusted extended attributes
>        Trusted  extended attributes are visible and accessible only to pro‐
>        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
>        class  are used to implement mechanisms in user space (i.e., outside
>        the kernel) which keep information in extended attributes  to  which
>        ordinary processes should not have access.

overlayfs inside the kernel is using trusted though which is the case
Vivek ran into.

>    User extended attributes
>        User  extended  attributes  may be assigned to files and directories
>        for storing arbitrary additional information such as the mime  type,
>        character  set  or  encoding  of a file.  The access permissions for
>        user attributes are defined by the file permission bits:  read  per‐
>        mission is required to retrieve the attribute value, and writer per‐
>        mission is required to change it.
> [/man]
> 
> Libvirtd uses the "trusted." xattr namespace to record information against
> disk images for QEMU, because we need to grant QEMU access to read/write
> the disk iamges, but don't want QEMU to be able to alter our xattrs.
> 
> It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> It really ought to have had its own dedicated capability :-( Such is
> life with anything that uses CAP_SYS_ADMIN...
> 
> With this in mind we really should have both trusted. & user. xattrs
> allowed to the guest by default.
> 
> Conversely, we'll need to block usage of the security. and system.
> namespaces.
> 
> Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
> have a "--trusted-xattrs=on|off" flag to allow people to run in a more
> locked down state if they knowingly want to block trusted xattrs.

Does the trick described by Chirantan work for you, where the clients
'trusted' view is different from the host?

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 11:49             ` Dr. David Alan Gilbert
@ 2020-06-19 12:05               ` Daniel P. Berrangé
  -1 siblings, 0 replies; 71+ messages in thread
From: Daniel P. Berrangé @ 2020-06-19 12:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: virtio-fs, Stefan Hajnoczi, qemu-devel, Vivek Goyal, Miklos Szeredi

On Fri, Jun 19, 2020 at 12:49:57PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > > access to the system.
> > > > > > 
> > > > > > Hi Stefan,
> > > > > > 
> > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > > 
> > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > > need CAP_SYS_ADMIN.
> > > > > > 
> > > > > > man xattr says.
> > > > > > 
> > > > > >    Trusted extended attributes
> > > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > > >        processes should not have access.
> > > > > > 
> > > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > > But for now we need to make it work. This is an important use case for
> > > > > > kata docker in docker build.
> > > > > > 
> > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > > with this capaibility.
> > > > > 
> > > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > > Can you explain:
> > > > >   a) What overlayfs uses trusted for?
> > > > 
> > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > > 
> > > Tell me more about this metadata.
> > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> > > 
> > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > >     what would happen?
> > > > 
> > > > This directory is owned by guest. So it should be able to write
> > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > 
> > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > much does overlayfs validate trusted.* it uses?
> > > 
> > > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > > all - what is the consequence?
> > > > 
> > > > It falls back to I think read only mode. 
> > > 
> > > It looks like the fallback is more subtle to me:
> > >         /*
> > >          * Check if upper/work fs supports trusted.overlay.* xattr
> > >          */
> > >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> > >         if (err) {
> > >                 ofs->noxattr = true;
> > >                 ofs->config.index = false;
> > >                 ofs->config.metacopy = false;
> > >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > > 
> > > but I don't know what index and metacopy are.
> > > 
> > > > For a moment forget about overlayfs. Say a user process in guest with
> > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > > passthrough filesystem, so it should go through. But currently it
> > > > wont.
> > > 
> > > As long as any effects of what it writes are contained to the area of
> > > the filesystem exposed to the guest, yes - however it worries me what
> > > the consequences of broken trusted metadata is.  If it's delicate enough
> > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> > 
> > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> > mechanism for applications to control access. The host kernel doesn'
> > tuse this namespace itself. Linux has four namespaces for xattrs:
> > 
> >  -  user - for userspace apps. accessible based on read/write permissions
> >  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
> >  -  system - for kernel only. used by ACLs
> >  -  security - for kernel only. used by SELinux
> > 
> > The use case for "trusted" xattrs is thus where a privileged management
> > application or service wants to store metadata against the file, but
> > also needs to grant an unprivileged process access to write to this file
> > while not allowing that unprivileged process the ability to change the
> > metadata. This is mentioned in the man page:
> > 
> > [man xattr(7)]
> >    Trusted extended attributes
> >        Trusted  extended attributes are visible and accessible only to pro‐
> >        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
> >        class  are used to implement mechanisms in user space (i.e., outside
> >        the kernel) which keep information in extended attributes  to  which
> >        ordinary processes should not have access.
> 
> overlayfs inside the kernel is using trusted though which is the case
> Vivek ran into.

IIUC, this is need on the lower-layer FS. If you're running overlayfs on the
host, then virtiofsd is exposing the upper layer FS, so guest use of "trusted."
on the upper layer won't clash with overlayfs in the host.

If you're running overlayfs in the guest, then virtiofs is the lower layer
so needs to support the "trusted." on the FS its exposing. Again this should
not clash with anything on the host.

> >    User extended attributes
> >        User  extended  attributes  may be assigned to files and directories
> >        for storing arbitrary additional information such as the mime  type,
> >        character  set  or  encoding  of a file.  The access permissions for
> >        user attributes are defined by the file permission bits:  read  per‐
> >        mission is required to retrieve the attribute value, and writer per‐
> >        mission is required to change it.
> > [/man]
> > 
> > Libvirtd uses the "trusted." xattr namespace to record information against
> > disk images for QEMU, because we need to grant QEMU access to read/write
> > the disk iamges, but don't want QEMU to be able to alter our xattrs.
> > 
> > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> > It really ought to have had its own dedicated capability :-( Such is
> > life with anything that uses CAP_SYS_ADMIN...
> > 
> > With this in mind we really should have both trusted. & user. xattrs
> > allowed to the guest by default.
> > 
> > Conversely, we'll need to block usage of the security. and system.
> > namespaces.
> > 
> > Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
> > have a "--trusted-xattrs=on|off" flag to allow people to run in a more
> > locked down state if they knowingly want to block trusted xattrs.
> 
> Does the trick described by Chirantan work for you, where the clients
> 'trusted' view is different from the host?

That works as long as you don't require any interoperability with
processes that have setup the exposed file tree on the host, or
might need to concurrently access it.

This probably suggests a need various modes of exposing xattrs

 - "passthrough"  - run CAP_SYS_ADMIN and pass directly
 - "mapped"       - no capabilities, rewrite everything into user. namespace
 - "disable"      - don't allow xattr

As mentioned in the other thread though, CAP_SYS_ADMIN needs to be present
in the initial namespace, so to support "passthrough" we'd have to NOT
use user namespace too. Something to be aware of when we accept the
user namespace patches.

Such "mapping" of attributes could also be used if virtiofs was running
as an unprivileged user account, to expose a full range of file owner,
group and permissions to the guest, while having everything be owned
by the normal user on the host.   QEMU's current 9p filesystem has
this kind of mapping support for owner/permissions, so guest view can
be distinct from the host view.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 12:05               ` Daniel P. Berrangé
  0 siblings, 0 replies; 71+ messages in thread
From: Daniel P. Berrangé @ 2020-06-19 12:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, qemu-devel, Vivek Goyal, Miklos Szeredi

On Fri, Jun 19, 2020 at 12:49:57PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > > access to the system.
> > > > > > 
> > > > > > Hi Stefan,
> > > > > > 
> > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > > 
> > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > > need CAP_SYS_ADMIN.
> > > > > > 
> > > > > > man xattr says.
> > > > > > 
> > > > > >    Trusted extended attributes
> > > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > > >        processes should not have access.
> > > > > > 
> > > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > > But for now we need to make it work. This is an important use case for
> > > > > > kata docker in docker build.
> > > > > > 
> > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > > with this capaibility.
> > > > > 
> > > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > > Can you explain:
> > > > >   a) What overlayfs uses trusted for?
> > > > 
> > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > > 
> > > Tell me more about this metadata.
> > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> > > 
> > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > >     what would happen?
> > > > 
> > > > This directory is owned by guest. So it should be able to write
> > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > 
> > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > much does overlayfs validate trusted.* it uses?
> > > 
> > > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > > all - what is the consequence?
> > > > 
> > > > It falls back to I think read only mode. 
> > > 
> > > It looks like the fallback is more subtle to me:
> > >         /*
> > >          * Check if upper/work fs supports trusted.overlay.* xattr
> > >          */
> > >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> > >         if (err) {
> > >                 ofs->noxattr = true;
> > >                 ofs->config.index = false;
> > >                 ofs->config.metacopy = false;
> > >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > > 
> > > but I don't know what index and metacopy are.
> > > 
> > > > For a moment forget about overlayfs. Say a user process in guest with
> > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > > passthrough filesystem, so it should go through. But currently it
> > > > wont.
> > > 
> > > As long as any effects of what it writes are contained to the area of
> > > the filesystem exposed to the guest, yes - however it worries me what
> > > the consequences of broken trusted metadata is.  If it's delicate enough
> > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> > 
> > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> > mechanism for applications to control access. The host kernel doesn'
> > tuse this namespace itself. Linux has four namespaces for xattrs:
> > 
> >  -  user - for userspace apps. accessible based on read/write permissions
> >  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
> >  -  system - for kernel only. used by ACLs
> >  -  security - for kernel only. used by SELinux
> > 
> > The use case for "trusted" xattrs is thus where a privileged management
> > application or service wants to store metadata against the file, but
> > also needs to grant an unprivileged process access to write to this file
> > while not allowing that unprivileged process the ability to change the
> > metadata. This is mentioned in the man page:
> > 
> > [man xattr(7)]
> >    Trusted extended attributes
> >        Trusted  extended attributes are visible and accessible only to pro‐
> >        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
> >        class  are used to implement mechanisms in user space (i.e., outside
> >        the kernel) which keep information in extended attributes  to  which
> >        ordinary processes should not have access.
> 
> overlayfs inside the kernel is using trusted though which is the case
> Vivek ran into.

IIUC, this is need on the lower-layer FS. If you're running overlayfs on the
host, then virtiofsd is exposing the upper layer FS, so guest use of "trusted."
on the upper layer won't clash with overlayfs in the host.

If you're running overlayfs in the guest, then virtiofs is the lower layer
so needs to support the "trusted." on the FS its exposing. Again this should
not clash with anything on the host.

> >    User extended attributes
> >        User  extended  attributes  may be assigned to files and directories
> >        for storing arbitrary additional information such as the mime  type,
> >        character  set  or  encoding  of a file.  The access permissions for
> >        user attributes are defined by the file permission bits:  read  per‐
> >        mission is required to retrieve the attribute value, and writer per‐
> >        mission is required to change it.
> > [/man]
> > 
> > Libvirtd uses the "trusted." xattr namespace to record information against
> > disk images for QEMU, because we need to grant QEMU access to read/write
> > the disk iamges, but don't want QEMU to be able to alter our xattrs.
> > 
> > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> > It really ought to have had its own dedicated capability :-( Such is
> > life with anything that uses CAP_SYS_ADMIN...
> > 
> > With this in mind we really should have both trusted. & user. xattrs
> > allowed to the guest by default.
> > 
> > Conversely, we'll need to block usage of the security. and system.
> > namespaces.
> > 
> > Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
> > have a "--trusted-xattrs=on|off" flag to allow people to run in a more
> > locked down state if they knowingly want to block trusted xattrs.
> 
> Does the trick described by Chirantan work for you, where the clients
> 'trusted' view is different from the host?

That works as long as you don't require any interoperability with
processes that have setup the exposed file tree on the host, or
might need to concurrently access it.

This probably suggests a need various modes of exposing xattrs

 - "passthrough"  - run CAP_SYS_ADMIN and pass directly
 - "mapped"       - no capabilities, rewrite everything into user. namespace
 - "disable"      - don't allow xattr

As mentioned in the other thread though, CAP_SYS_ADMIN needs to be present
in the initial namespace, so to support "passthrough" we'd have to NOT
use user namespace too. Something to be aware of when we accept the
user namespace patches.

Such "mapping" of attributes could also be used if virtiofs was running
as an unprivileged user account, to expose a full range of file owner,
group and permissions to the guest, while having everything be owned
by the normal user on the host.   QEMU's current 9p filesystem has
this kind of mapping support for owner/permissions, so guest view can
be distinct from the host view.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-18 19:08 ` Vivek Goyal
@ 2020-06-19 14:16     ` Miklos Szeredi
  2020-06-19 14:16     ` Miklos Szeredi
  1 sibling, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2020-06-19 14:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel, Stefan Hajnoczi

On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain further
> > access to the system.
>
> Hi Stefan,
>
> I just noticed that this patch set breaks overlayfs on top of virtiofs.

How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
requires CAP_SYS_ADMIN, not the accessor.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 14:16     ` Miklos Szeredi
  0 siblings, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2020-06-19 14:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel

On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain further
> > access to the system.
>
> Hi Stefan,
>
> I just noticed that this patch set breaks overlayfs on top of virtiofs.

How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
requires CAP_SYS_ADMIN, not the accessor.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 14:16     ` Miklos Szeredi
@ 2020-06-19 14:25       ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 14:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list, qemu-devel, Stefan Hajnoczi

On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > whitelisted set of capabilities that we require.  This improves security in
> > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > access to the system.
> >
> > Hi Stefan,
> >
> > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> 
> How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> requires CAP_SYS_ADMIN, not the accessor.

virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
fails in lo_setxattr().

This is triggered when we mount overlayfs on top of virtiofs and overlayfs
tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
supported or not.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 14:25       ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 14:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list, qemu-devel

On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > whitelisted set of capabilities that we require.  This improves security in
> > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > access to the system.
> >
> > Hi Stefan,
> >
> > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> 
> How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> requires CAP_SYS_ADMIN, not the accessor.

virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
fails in lo_setxattr().

This is triggered when we mount overlayfs on top of virtiofs and overlayfs
tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
supported or not.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 14:16     ` Miklos Szeredi
@ 2020-06-19 14:29       ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 14:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list, qemu-devel, Stefan Hajnoczi

On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > whitelisted set of capabilities that we require.  This improves security in
> > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > access to the system.
> >
> > Hi Stefan,
> >
> > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> 
> How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> requires CAP_SYS_ADMIN, not the accessor.

Are you thinking of virtiofs on top of overlayfs? I was referring to
mounting overlayfs on top of virtiofs inside VM.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 14:29       ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 14:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list, qemu-devel

On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > whitelisted set of capabilities that we require.  This improves security in
> > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > access to the system.
> >
> > Hi Stefan,
> >
> > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> 
> How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> requires CAP_SYS_ADMIN, not the accessor.

Are you thinking of virtiofs on top of overlayfs? I was referring to
mounting overlayfs on top of virtiofs inside VM.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 14:25       ` Vivek Goyal
@ 2020-06-19 15:26         ` Miklos Szeredi
  -1 siblings, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2020-06-19 15:26 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel, Stefan Hajnoczi

On Fri, Jun 19, 2020 at 4:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > whitelisted set of capabilities that we require.  This improves security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > access to the system.
> > >
> > > Hi Stefan,
> > >
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> >
> > How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> > requires CAP_SYS_ADMIN, not the accessor.
>
> virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
> fails in lo_setxattr().
>
> This is triggered when we mount overlayfs on top of virtiofs and overlayfs
> tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
> supported or not.

Ah, right.

Plan is to use "user.*" xattr for unprivileged overlay.  This would be
a good way to eliminate this attack surface in the overlay on virtiofs
case as well.

Other ways to minimize risk is to separate operations requiring
CAP_SYS_ADMIN into a separate process, preferably a separate
executable, that communicates with virtiofsd using a pipe and contains
the minimum amount of code.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 15:26         ` Miklos Szeredi
  0 siblings, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2020-06-19 15:26 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel

On Fri, Jun 19, 2020 at 4:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > whitelisted set of capabilities that we require.  This improves security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > access to the system.
> > >
> > > Hi Stefan,
> > >
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> >
> > How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> > requires CAP_SYS_ADMIN, not the accessor.
>
> virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
> fails in lo_setxattr().
>
> This is triggered when we mount overlayfs on top of virtiofs and overlayfs
> tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
> supported or not.

Ah, right.

Plan is to use "user.*" xattr for unprivileged overlay.  This would be
a good way to eliminate this attack surface in the overlay on virtiofs
case as well.

Other ways to minimize risk is to separate operations requiring
CAP_SYS_ADMIN into a separate process, preferably a separate
executable, that communicates with virtiofsd using a pipe and contains
the minimum amount of code.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 15:26         ` Miklos Szeredi
@ 2020-06-19 15:57           ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 15:57 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list, qemu-devel, Stefan Hajnoczi

On Fri, Jun 19, 2020 at 05:26:37PM +0200, Miklos Szeredi wrote:
> On Fri, Jun 19, 2020 at 4:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> > > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > >
> > > > Hi Stefan,
> > > >
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > >
> > > How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> > > requires CAP_SYS_ADMIN, not the accessor.
> >
> > virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
> > fails in lo_setxattr().
> >
> > This is triggered when we mount overlayfs on top of virtiofs and overlayfs
> > tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
> > supported or not.
> 
> Ah, right.
> 
> Plan is to use "user.*" xattr for unprivileged overlay.  This would be
> a good way to eliminate this attack surface in the overlay on virtiofs
> case as well.

So unpriviliged overlay is one which is mounted from inside a user
namespace. But in this case we might be mounting it from init_user_ns
of guest. So from overlayfs perspective this will still be treated
as priviliged overlay instance and it will use trusted xattrs, IIUC?

Thanks
Vivek

> 
> Other ways to minimize risk is to separate operations requiring
> CAP_SYS_ADMIN into a separate process, preferably a separate
> executable, that communicates with virtiofsd using a pipe and contains
> the minimum amount of code.

> 
> Thanks,
> Miklos
> 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 15:57           ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 15:57 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list, qemu-devel

On Fri, Jun 19, 2020 at 05:26:37PM +0200, Miklos Szeredi wrote:
> On Fri, Jun 19, 2020 at 4:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> > > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > >
> > > > Hi Stefan,
> > > >
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > >
> > > How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> > > requires CAP_SYS_ADMIN, not the accessor.
> >
> > virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
> > fails in lo_setxattr().
> >
> > This is triggered when we mount overlayfs on top of virtiofs and overlayfs
> > tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
> > supported or not.
> 
> Ah, right.
> 
> Plan is to use "user.*" xattr for unprivileged overlay.  This would be
> a good way to eliminate this attack surface in the overlay on virtiofs
> case as well.

So unpriviliged overlay is one which is mounted from inside a user
namespace. But in this case we might be mounting it from init_user_ns
of guest. So from overlayfs perspective this will still be treated
as priviliged overlay instance and it will use trusted xattrs, IIUC?

Thanks
Vivek

> 
> Other ways to minimize risk is to separate operations requiring
> CAP_SYS_ADMIN into a separate process, preferably a separate
> executable, that communicates with virtiofsd using a pipe and contains
> the minimum amount of code.

> 
> Thanks,
> Miklos
> 


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19  8:27         ` Dr. David Alan Gilbert
@ 2020-06-19 16:09           ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 16:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: virtio-fs, qemu-devel, Stefan Hajnoczi, Miklos Szeredi

On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > > 
> > > > Hi Stefan,
> > > > 
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > 
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > > 
> > > > man xattr says.
> > > > 
> > > >    Trusted extended attributes
> > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > >        kernel) which keep information in extended attributes to which ordinary
> > > >        processes should not have access.
> > > > 
> > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > But for now we need to make it work. This is an important use case for
> > > > kata docker in docker build.
> > > > 
> > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > with this capaibility.
> > > 
> > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > Can you explain:
> > >   a) What overlayfs uses trusted for?
> > 
> > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> 
> Tell me more about this metadata.
> Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?

It contains path information which is used for lookup into lower layer.

> Or what happens if I was to write random numbers into OVL_XATTR_NLINK?

Overlay is storing its metadata in trusted.* xattrs. If a user modifies
metadata, then various kind of bad things can happen. I think one can
do some kind of checks on metadata to make sure it does not crash
atleast.

And that's true for any filesystem. Isn't. If user manages to modify
metadata outside of filesystem, then lot of bad things can happen. I
thought that's the reason that people are not comfortable with the
idea of allowing mounting filesystem from inside user namespace because
it makes it easy to mount a hand crafted filesystem.

Anyway, I think overlayfs is just one use case of trusted xattr. Even
if overlayfs moves away from trusted xattr, what about other users.
We need to have a story around how will we support trusted xattrs
safely.


> 
> > >   b) If something nasty was to write junk into the trusted attributes,
> > >     what would happen?
> > 
> > This directory is owned by guest. So it should be able to write
> > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> 
> Well, we shouldn't be able to break/crash/escape into the host; how
> much does overlayfs validate trusted.* it uses?

I thought qemu and kvm are the one who should ensure we should not be
able to break out of sandbox. Kernel implementation could be as 
buggy as it wanted to be. We are working with this security model
that kernel is completely untrusted.

> 
> > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > all - what is the consequence?
> > 
> > It falls back to I think read only mode. 
> 
> It looks like the fallback is more subtle to me:
>         /*
>          * Check if upper/work fs supports trusted.overlay.* xattr
>          */
>         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
>         if (err) {
>                 ofs->noxattr = true;
>                 ofs->config.index = false;
>                 ofs->config.metacopy = false;
>                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> 
> but I don't know what index and metacopy are.

They enable certain features in overlayfs. In fact, we fall back to
lesser capability on if we are running on ext4/xfs. For virtiofs, 
we deny the mount completely.

        /*
         * We allowed sub-optimal upper fs configuration and don't want to break
         * users over kernel upgrade, but we never allowed remote upper fs, so
         * we can enforce strict requirements for remote upper fs.
         */
        if (ovl_dentry_remote(ofs->workdir) &&
            (!d_type || !rename_whiteout || ofs->noxattr)) {
                pr_err("upper fs missing required features.\n");
                err = -EINVAL;
                goto out;
        }

> 
> > For a moment forget about overlayfs. Say a user process in guest with
> > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > passthrough filesystem, so it should go through. But currently it
> > wont.
> 
> As long as any effects of what it writes are contained to the area of
> the filesystem exposed to the guest, yes - however it worries me what
> the consequences of broken trusted metadata is.  If it's delicate enough
> that it's guarded by CAP_SYS_ADMIN someone must have worried about it.

Agreed that we need to look into whether having CAP_SYS_ADMIN allow
virtiofsd to break out of jail. 

May be we need to provide that remapping trusted xattr feature so
that we don't have to have CAP_SYS_ADMIN in init_user_ns and can
provide this emulation even when running in user namespace.

Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 16:09           ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 16:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, qemu-devel, Miklos Szeredi

On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > > 
> > > > Hi Stefan,
> > > > 
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > 
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > > 
> > > > man xattr says.
> > > > 
> > > >    Trusted extended attributes
> > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > >        kernel) which keep information in extended attributes to which ordinary
> > > >        processes should not have access.
> > > > 
> > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > But for now we need to make it work. This is an important use case for
> > > > kata docker in docker build.
> > > > 
> > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > with this capaibility.
> > > 
> > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > Can you explain:
> > >   a) What overlayfs uses trusted for?
> > 
> > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> 
> Tell me more about this metadata.
> Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?

It contains path information which is used for lookup into lower layer.

> Or what happens if I was to write random numbers into OVL_XATTR_NLINK?

Overlay is storing its metadata in trusted.* xattrs. If a user modifies
metadata, then various kind of bad things can happen. I think one can
do some kind of checks on metadata to make sure it does not crash
atleast.

And that's true for any filesystem. Isn't. If user manages to modify
metadata outside of filesystem, then lot of bad things can happen. I
thought that's the reason that people are not comfortable with the
idea of allowing mounting filesystem from inside user namespace because
it makes it easy to mount a hand crafted filesystem.

Anyway, I think overlayfs is just one use case of trusted xattr. Even
if overlayfs moves away from trusted xattr, what about other users.
We need to have a story around how will we support trusted xattrs
safely.


> 
> > >   b) If something nasty was to write junk into the trusted attributes,
> > >     what would happen?
> > 
> > This directory is owned by guest. So it should be able to write
> > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> 
> Well, we shouldn't be able to break/crash/escape into the host; how
> much does overlayfs validate trusted.* it uses?

I thought qemu and kvm are the one who should ensure we should not be
able to break out of sandbox. Kernel implementation could be as 
buggy as it wanted to be. We are working with this security model
that kernel is completely untrusted.

> 
> > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > all - what is the consequence?
> > 
> > It falls back to I think read only mode. 
> 
> It looks like the fallback is more subtle to me:
>         /*
>          * Check if upper/work fs supports trusted.overlay.* xattr
>          */
>         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
>         if (err) {
>                 ofs->noxattr = true;
>                 ofs->config.index = false;
>                 ofs->config.metacopy = false;
>                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> 
> but I don't know what index and metacopy are.

They enable certain features in overlayfs. In fact, we fall back to
lesser capability on if we are running on ext4/xfs. For virtiofs, 
we deny the mount completely.

        /*
         * We allowed sub-optimal upper fs configuration and don't want to break
         * users over kernel upgrade, but we never allowed remote upper fs, so
         * we can enforce strict requirements for remote upper fs.
         */
        if (ovl_dentry_remote(ofs->workdir) &&
            (!d_type || !rename_whiteout || ofs->noxattr)) {
                pr_err("upper fs missing required features.\n");
                err = -EINVAL;
                goto out;
        }

> 
> > For a moment forget about overlayfs. Say a user process in guest with
> > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > passthrough filesystem, so it should go through. But currently it
> > wont.
> 
> As long as any effects of what it writes are contained to the area of
> the filesystem exposed to the guest, yes - however it worries me what
> the consequences of broken trusted metadata is.  If it's delicate enough
> that it's guarded by CAP_SYS_ADMIN someone must have worried about it.

Agreed that we need to look into whether having CAP_SYS_ADMIN allow
virtiofsd to break out of jail. 

May be we need to provide that remapping trusted xattr feature so
that we don't have to have CAP_SYS_ADMIN in init_user_ns and can
provide this emulation even when running in user namespace.

Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 16:09           ` Vivek Goyal
@ 2020-06-19 16:16             ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19 16:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, Stefan Hajnoczi, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > access to the system.
> > > > > 
> > > > > Hi Stefan,
> > > > > 
> > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > 
> > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > need CAP_SYS_ADMIN.
> > > > > 
> > > > > man xattr says.
> > > > > 
> > > > >    Trusted extended attributes
> > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > >        processes should not have access.
> > > > > 
> > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > But for now we need to make it work. This is an important use case for
> > > > > kata docker in docker build.
> > > > > 
> > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > with this capaibility.
> > > > 
> > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > Can you explain:
> > > >   a) What overlayfs uses trusted for?
> > > 
> > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > 
> > Tell me more about this metadata.
> > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> 
> It contains path information which is used for lookup into lower layer.
> 
> > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> 
> Overlay is storing its metadata in trusted.* xattrs. If a user modifies
> metadata, then various kind of bad things can happen. I think one can
> do some kind of checks on metadata to make sure it does not crash
> atleast.
> 
> And that's true for any filesystem. Isn't. If user manages to modify
> metadata outside of filesystem, then lot of bad things can happen. I
> thought that's the reason that people are not comfortable with the
> idea of allowing mounting filesystem from inside user namespace because
> it makes it easy to mount a hand crafted filesystem.
> 
> Anyway, I think overlayfs is just one use case of trusted xattr. Even
> if overlayfs moves away from trusted xattr, what about other users.
> We need to have a story around how will we support trusted xattrs
> safely.
> 
> 
> > 
> > > >   b) If something nasty was to write junk into the trusted attributes,
> > > >     what would happen?
> > > 
> > > This directory is owned by guest. So it should be able to write
> > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > 
> > Well, we shouldn't be able to break/crash/escape into the host; how
> > much does overlayfs validate trusted.* it uses?
> 
> I thought qemu and kvm are the one who should ensure we should not be
> able to break out of sandbox. Kernel implementation could be as 
> buggy as it wanted to be. We are working with this security model
> that kernel is completely untrusted.

But with virtiofs we allow the guest to do a lot of filesystem
operations on the host.  It's the virtiofsd that has to ensure that
these are safe and contained within the fs it's exposed; the qemu/kvm
can't protect us from that.

That's why we sandbox the virtiofsd like we do - if we allow a
priviliged guest to perform calls to an unconstrained virtiofsd it would
be able to escape.  That's what I want to check.

Dave

> > 
> > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > all - what is the consequence?
> > > 
> > > It falls back to I think read only mode. 
> > 
> > It looks like the fallback is more subtle to me:
> >         /*
> >          * Check if upper/work fs supports trusted.overlay.* xattr
> >          */
> >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> >         if (err) {
> >                 ofs->noxattr = true;
> >                 ofs->config.index = false;
> >                 ofs->config.metacopy = false;
> >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > 
> > but I don't know what index and metacopy are.
> 
> They enable certain features in overlayfs. In fact, we fall back to
> lesser capability on if we are running on ext4/xfs. For virtiofs, 
> we deny the mount completely.
> 
>         /*
>          * We allowed sub-optimal upper fs configuration and don't want to break
>          * users over kernel upgrade, but we never allowed remote upper fs, so
>          * we can enforce strict requirements for remote upper fs.
>          */
>         if (ovl_dentry_remote(ofs->workdir) &&
>             (!d_type || !rename_whiteout || ofs->noxattr)) {
>                 pr_err("upper fs missing required features.\n");
>                 err = -EINVAL;
>                 goto out;
>         }
> 
> > 
> > > For a moment forget about overlayfs. Say a user process in guest with
> > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > passthrough filesystem, so it should go through. But currently it
> > > wont.
> > 
> > As long as any effects of what it writes are contained to the area of
> > the filesystem exposed to the guest, yes - however it worries me what
> > the consequences of broken trusted metadata is.  If it's delicate enough
> > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> 
> Agreed that we need to look into whether having CAP_SYS_ADMIN allow
> virtiofsd to break out of jail. 
> 
> May be we need to provide that remapping trusted xattr feature so
> that we don't have to have CAP_SYS_ADMIN in init_user_ns and can
> provide this emulation even when running in user namespace.
> 
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 16:16             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19 16:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > access to the system.
> > > > > 
> > > > > Hi Stefan,
> > > > > 
> > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > 
> > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > need CAP_SYS_ADMIN.
> > > > > 
> > > > > man xattr says.
> > > > > 
> > > > >    Trusted extended attributes
> > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > >        processes should not have access.
> > > > > 
> > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > But for now we need to make it work. This is an important use case for
> > > > > kata docker in docker build.
> > > > > 
> > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > with this capaibility.
> > > > 
> > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > Can you explain:
> > > >   a) What overlayfs uses trusted for?
> > > 
> > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > 
> > Tell me more about this metadata.
> > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> 
> It contains path information which is used for lookup into lower layer.
> 
> > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> 
> Overlay is storing its metadata in trusted.* xattrs. If a user modifies
> metadata, then various kind of bad things can happen. I think one can
> do some kind of checks on metadata to make sure it does not crash
> atleast.
> 
> And that's true for any filesystem. Isn't. If user manages to modify
> metadata outside of filesystem, then lot of bad things can happen. I
> thought that's the reason that people are not comfortable with the
> idea of allowing mounting filesystem from inside user namespace because
> it makes it easy to mount a hand crafted filesystem.
> 
> Anyway, I think overlayfs is just one use case of trusted xattr. Even
> if overlayfs moves away from trusted xattr, what about other users.
> We need to have a story around how will we support trusted xattrs
> safely.
> 
> 
> > 
> > > >   b) If something nasty was to write junk into the trusted attributes,
> > > >     what would happen?
> > > 
> > > This directory is owned by guest. So it should be able to write
> > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > 
> > Well, we shouldn't be able to break/crash/escape into the host; how
> > much does overlayfs validate trusted.* it uses?
> 
> I thought qemu and kvm are the one who should ensure we should not be
> able to break out of sandbox. Kernel implementation could be as 
> buggy as it wanted to be. We are working with this security model
> that kernel is completely untrusted.

But with virtiofs we allow the guest to do a lot of filesystem
operations on the host.  It's the virtiofsd that has to ensure that
these are safe and contained within the fs it's exposed; the qemu/kvm
can't protect us from that.

That's why we sandbox the virtiofsd like we do - if we allow a
priviliged guest to perform calls to an unconstrained virtiofsd it would
be able to escape.  That's what I want to check.

Dave

> > 
> > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > all - what is the consequence?
> > > 
> > > It falls back to I think read only mode. 
> > 
> > It looks like the fallback is more subtle to me:
> >         /*
> >          * Check if upper/work fs supports trusted.overlay.* xattr
> >          */
> >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> >         if (err) {
> >                 ofs->noxattr = true;
> >                 ofs->config.index = false;
> >                 ofs->config.metacopy = false;
> >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > 
> > but I don't know what index and metacopy are.
> 
> They enable certain features in overlayfs. In fact, we fall back to
> lesser capability on if we are running on ext4/xfs. For virtiofs, 
> we deny the mount completely.
> 
>         /*
>          * We allowed sub-optimal upper fs configuration and don't want to break
>          * users over kernel upgrade, but we never allowed remote upper fs, so
>          * we can enforce strict requirements for remote upper fs.
>          */
>         if (ovl_dentry_remote(ofs->workdir) &&
>             (!d_type || !rename_whiteout || ofs->noxattr)) {
>                 pr_err("upper fs missing required features.\n");
>                 err = -EINVAL;
>                 goto out;
>         }
> 
> > 
> > > For a moment forget about overlayfs. Say a user process in guest with
> > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > passthrough filesystem, so it should go through. But currently it
> > > wont.
> > 
> > As long as any effects of what it writes are contained to the area of
> > the filesystem exposed to the guest, yes - however it worries me what
> > the consequences of broken trusted metadata is.  If it's delicate enough
> > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> 
> Agreed that we need to look into whether having CAP_SYS_ADMIN allow
> virtiofsd to break out of jail. 
> 
> May be we need to provide that remapping trusted xattr feature so
> that we don't have to have CAP_SYS_ADMIN in init_user_ns and can
> provide this emulation even when running in user namespace.
> 
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 16:16             ` Dr. David Alan Gilbert
@ 2020-06-19 17:11               ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 17:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: virtio-fs, qemu-devel, Stefan Hajnoczi, Miklos Szeredi

On Fri, Jun 19, 2020 at 05:16:48PM +0100, Dr. David Alan Gilbert wrote:

[..]
> > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > >     what would happen?
> > > > 
> > > > This directory is owned by guest. So it should be able to write
> > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > 
> > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > much does overlayfs validate trusted.* it uses?
> > 
> > I thought qemu and kvm are the one who should ensure we should not be
> > able to break out of sandbox. Kernel implementation could be as 
> > buggy as it wanted to be. We are working with this security model
> > that kernel is completely untrusted.
> 
> But with virtiofs we allow the guest to do a lot of filesystem
> operations on the host.  It's the virtiofsd that has to ensure that
> these are safe and contained within the fs it's exposed; the qemu/kvm
> can't protect us from that.

Fair enough. I should have added virtiofsd to list. Its an attack
vector and is of concern.

> 
> That's why we sandbox the virtiofsd like we do - if we allow a
> priviliged guest to perform calls to an unconstrained virtiofsd it would
> be able to escape.  That's what I want to check.

Sure. So does giving CAP_SYS_ADMIN to virtiofsd allow daemon to escape
the jail. If it does we need to implement what crossvm folks did,
remapping of trusted xattr. That will also allow us to run inside
user namespace and still be able to support trusted xattr emulation
for guest.

Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 17:11               ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 17:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, qemu-devel, Miklos Szeredi

On Fri, Jun 19, 2020 at 05:16:48PM +0100, Dr. David Alan Gilbert wrote:

[..]
> > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > >     what would happen?
> > > > 
> > > > This directory is owned by guest. So it should be able to write
> > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > 
> > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > much does overlayfs validate trusted.* it uses?
> > 
> > I thought qemu and kvm are the one who should ensure we should not be
> > able to break out of sandbox. Kernel implementation could be as 
> > buggy as it wanted to be. We are working with this security model
> > that kernel is completely untrusted.
> 
> But with virtiofs we allow the guest to do a lot of filesystem
> operations on the host.  It's the virtiofsd that has to ensure that
> these are safe and contained within the fs it's exposed; the qemu/kvm
> can't protect us from that.

Fair enough. I should have added virtiofsd to list. Its an attack
vector and is of concern.

> 
> That's why we sandbox the virtiofsd like we do - if we allow a
> priviliged guest to perform calls to an unconstrained virtiofsd it would
> be able to escape.  That's what I want to check.

Sure. So does giving CAP_SYS_ADMIN to virtiofsd allow daemon to escape
the jail. If it does we need to implement what crossvm folks did,
remapping of trusted xattr. That will also allow us to run inside
user namespace and still be able to support trusted xattr emulation
for guest.

Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 17:11               ` Vivek Goyal
@ 2020-06-19 17:16                 ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19 17:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, Stefan Hajnoczi, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 05:16:48PM +0100, Dr. David Alan Gilbert wrote:
> 
> [..]
> > > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > > >     what would happen?
> > > > > 
> > > > > This directory is owned by guest. So it should be able to write
> > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > > 
> > > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > > much does overlayfs validate trusted.* it uses?
> > > 
> > > I thought qemu and kvm are the one who should ensure we should not be
> > > able to break out of sandbox. Kernel implementation could be as 
> > > buggy as it wanted to be. We are working with this security model
> > > that kernel is completely untrusted.
> > 
> > But with virtiofs we allow the guest to do a lot of filesystem
> > operations on the host.  It's the virtiofsd that has to ensure that
> > these are safe and contained within the fs it's exposed; the qemu/kvm
> > can't protect us from that.
> 
> Fair enough. I should have added virtiofsd to list. Its an attack
> vector and is of concern.
> 
> > 
> > That's why we sandbox the virtiofsd like we do - if we allow a
> > priviliged guest to perform calls to an unconstrained virtiofsd it would
> > be able to escape.  That's what I want to check.
> 
> Sure. So does giving CAP_SYS_ADMIN to virtiofsd allow daemon to escape
> the jail.

So that's *my* question - what bad things can someone do by setting
attributes (trusted/system/security) - it's fine if they break they
screwup the security inside the container, because they'd need to be
CAP_SYS_ADMIN inside the container to do it - as long as they can't
break the host.  So what happens if someone starts doing bad things to
trusted.* attributes on an overlayfs - no other fs uses them as far as I
know.

> If it does we need to implement what crossvm folks did,
> remapping of trusted xattr. That will also allow us to run inside
> user namespace and still be able to support trusted xattr emulation
> for guest.

I think we need to do that anyway, it's just going to take a while to
figure out.

Dave

> 
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 17:16                 ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19 17:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 05:16:48PM +0100, Dr. David Alan Gilbert wrote:
> 
> [..]
> > > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > > >     what would happen?
> > > > > 
> > > > > This directory is owned by guest. So it should be able to write
> > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > > 
> > > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > > much does overlayfs validate trusted.* it uses?
> > > 
> > > I thought qemu and kvm are the one who should ensure we should not be
> > > able to break out of sandbox. Kernel implementation could be as 
> > > buggy as it wanted to be. We are working with this security model
> > > that kernel is completely untrusted.
> > 
> > But with virtiofs we allow the guest to do a lot of filesystem
> > operations on the host.  It's the virtiofsd that has to ensure that
> > these are safe and contained within the fs it's exposed; the qemu/kvm
> > can't protect us from that.
> 
> Fair enough. I should have added virtiofsd to list. Its an attack
> vector and is of concern.
> 
> > 
> > That's why we sandbox the virtiofsd like we do - if we allow a
> > priviliged guest to perform calls to an unconstrained virtiofsd it would
> > be able to escape.  That's what I want to check.
> 
> Sure. So does giving CAP_SYS_ADMIN to virtiofsd allow daemon to escape
> the jail.

So that's *my* question - what bad things can someone do by setting
attributes (trusted/system/security) - it's fine if they break they
screwup the security inside the container, because they'd need to be
CAP_SYS_ADMIN inside the container to do it - as long as they can't
break the host.  So what happens if someone starts doing bad things to
trusted.* attributes on an overlayfs - no other fs uses them as far as I
know.

> If it does we need to implement what crossvm folks did,
> remapping of trusted xattr. That will also allow us to run inside
> user namespace and still be able to support trusted xattr emulation
> for guest.

I think we need to do that anyway, it's just going to take a while to
figure out.

Dave

> 
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 12:05               ` Daniel P. Berrangé
@ 2020-06-19 17:41                 ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 17:41 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: virtio-fs, Miklos Szeredi, Dr. David Alan Gilbert,
	Stefan Hajnoczi, qemu-devel

On Fri, Jun 19, 2020 at 01:05:47PM +0100, Daniel P. Berrangé wrote:
> On Fri, Jun 19, 2020 at 12:49:57PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > > > access to the system.
> > > > > > > 
> > > > > > > Hi Stefan,
> > > > > > > 
> > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > > > 
> > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > > > need CAP_SYS_ADMIN.
> > > > > > > 
> > > > > > > man xattr says.
> > > > > > > 
> > > > > > >    Trusted extended attributes
> > > > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > > > >        processes should not have access.
> > > > > > > 
> > > > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > > > But for now we need to make it work. This is an important use case for
> > > > > > > kata docker in docker build.
> > > > > > > 
> > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > > > with this capaibility.
> > > > > > 
> > > > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > > > Can you explain:
> > > > > >   a) What overlayfs uses trusted for?
> > > > > 
> > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > > > 
> > > > Tell me more about this metadata.
> > > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> > > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> > > > 
> > > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > > >     what would happen?
> > > > > 
> > > > > This directory is owned by guest. So it should be able to write
> > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > > 
> > > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > > much does overlayfs validate trusted.* it uses?
> > > > 
> > > > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > > > all - what is the consequence?
> > > > > 
> > > > > It falls back to I think read only mode. 
> > > > 
> > > > It looks like the fallback is more subtle to me:
> > > >         /*
> > > >          * Check if upper/work fs supports trusted.overlay.* xattr
> > > >          */
> > > >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> > > >         if (err) {
> > > >                 ofs->noxattr = true;
> > > >                 ofs->config.index = false;
> > > >                 ofs->config.metacopy = false;
> > > >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > > > 
> > > > but I don't know what index and metacopy are.
> > > > 
> > > > > For a moment forget about overlayfs. Say a user process in guest with
> > > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > > > passthrough filesystem, so it should go through. But currently it
> > > > > wont.
> > > > 
> > > > As long as any effects of what it writes are contained to the area of
> > > > the filesystem exposed to the guest, yes - however it worries me what
> > > > the consequences of broken trusted metadata is.  If it's delicate enough
> > > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> > > 
> > > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> > > mechanism for applications to control access. The host kernel doesn'
> > > tuse this namespace itself. Linux has four namespaces for xattrs:
> > > 
> > >  -  user - for userspace apps. accessible based on read/write permissions
> > >  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
> > >  -  system - for kernel only. used by ACLs
> > >  -  security - for kernel only. used by SELinux
> > > 
> > > The use case for "trusted" xattrs is thus where a privileged management
> > > application or service wants to store metadata against the file, but
> > > also needs to grant an unprivileged process access to write to this file
> > > while not allowing that unprivileged process the ability to change the
> > > metadata. This is mentioned in the man page:
> > > 
> > > [man xattr(7)]
> > >    Trusted extended attributes
> > >        Trusted  extended attributes are visible and accessible only to pro‐
> > >        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
> > >        class  are used to implement mechanisms in user space (i.e., outside
> > >        the kernel) which keep information in extended attributes  to  which
> > >        ordinary processes should not have access.
> > 
> > overlayfs inside the kernel is using trusted though which is the case
> > Vivek ran into.
> 
> IIUC, this is need on the lower-layer FS. If you're running overlayfs on the
> host, then virtiofsd is exposing the upper layer FS, so guest use of "trusted."
> on the upper layer won't clash with overlayfs in the host.
> 
> If you're running overlayfs in the guest, then virtiofs is the lower layer
> so needs to support the "trusted." on the FS its exposing. Again this should
> not clash with anything on the host.
> 
> > >    User extended attributes
> > >        User  extended  attributes  may be assigned to files and directories
> > >        for storing arbitrary additional information such as the mime  type,
> > >        character  set  or  encoding  of a file.  The access permissions for
> > >        user attributes are defined by the file permission bits:  read  per‐
> > >        mission is required to retrieve the attribute value, and writer per‐
> > >        mission is required to change it.
> > > [/man]
> > > 
> > > Libvirtd uses the "trusted." xattr namespace to record information against
> > > disk images for QEMU, because we need to grant QEMU access to read/write
> > > the disk iamges, but don't want QEMU to be able to alter our xattrs.
> > > 
> > > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> > > It really ought to have had its own dedicated capability :-( Such is
> > > life with anything that uses CAP_SYS_ADMIN...
> > > 
> > > With this in mind we really should have both trusted. & user. xattrs
> > > allowed to the guest by default.
> > > 
> > > Conversely, we'll need to block usage of the security. and system.
> > > namespaces.
> > > 
> > > Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
> > > have a "--trusted-xattrs=on|off" flag to allow people to run in a more
> > > locked down state if they knowingly want to block trusted xattrs.
> > 
> > Does the trick described by Chirantan work for you, where the clients
> > 'trusted' view is different from the host?
> 
> That works as long as you don't require any interoperability with
> processes that have setup the exposed file tree on the host, or
> might need to concurrently access it.
> 
> This probably suggests a need various modes of exposing xattrs
> 
>  - "passthrough"  - run CAP_SYS_ADMIN and pass directly
>  - "mapped"       - no capabilities, rewrite everything into user. namespace
>  - "disable"      - don't allow xattr
> 

We already do "passthrough" (this is our default) and "disable" (-o no_xattr).
We probably need to implement this "mapped" mode and that will allow us
to run without CAP_SYS_ADMIN in init_user_ns.

> As mentioned in the other thread though, CAP_SYS_ADMIN needs to be present
> in the initial namespace, so to support "passthrough" we'd have to NOT
> use user namespace too. Something to be aware of when we accept the
> user namespace patches.

> 
> Such "mapping" of attributes could also be used if virtiofs was running
> as an unprivileged user account, to expose a full range of file owner,
> group and permissions to the guest, while having everything be owned
> by the normal user on the host.   QEMU's current 9p filesystem has
> this kind of mapping support for owner/permissions, so guest view can
> be distinct from the host view.

If we figure out a way to run in a user namespace, then we don't
necessarily require this mapping functionality, right? User namespace will
automatically provide that mapping. We can just run as root of that user
namespace.

When 9p wrote this remapping method, we probably did not have user
namespace as popular as they are now. So we probably should implement
remapping of trusted xattr and then run in a user namesapce.

And complete passthrough mode will have to run in init user namespace
and this will be least securre mode (from host security point of view).

Thanks
Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 17:41                 ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 17:41 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: virtio-fs, Miklos Szeredi, qemu-devel

On Fri, Jun 19, 2020 at 01:05:47PM +0100, Daniel P. Berrangé wrote:
> On Fri, Jun 19, 2020 at 12:49:57PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > > > access to the system.
> > > > > > > 
> > > > > > > Hi Stefan,
> > > > > > > 
> > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > > > 
> > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > > > need CAP_SYS_ADMIN.
> > > > > > > 
> > > > > > > man xattr says.
> > > > > > > 
> > > > > > >    Trusted extended attributes
> > > > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > > > >        processes should not have access.
> > > > > > > 
> > > > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > > > But for now we need to make it work. This is an important use case for
> > > > > > > kata docker in docker build.
> > > > > > > 
> > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > > > with this capaibility.
> > > > > > 
> > > > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > > > Can you explain:
> > > > > >   a) What overlayfs uses trusted for?
> > > > > 
> > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > > > 
> > > > Tell me more about this metadata.
> > > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> > > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> > > > 
> > > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > > >     what would happen?
> > > > > 
> > > > > This directory is owned by guest. So it should be able to write
> > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > > 
> > > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > > much does overlayfs validate trusted.* it uses?
> > > > 
> > > > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > > > all - what is the consequence?
> > > > > 
> > > > > It falls back to I think read only mode. 
> > > > 
> > > > It looks like the fallback is more subtle to me:
> > > >         /*
> > > >          * Check if upper/work fs supports trusted.overlay.* xattr
> > > >          */
> > > >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> > > >         if (err) {
> > > >                 ofs->noxattr = true;
> > > >                 ofs->config.index = false;
> > > >                 ofs->config.metacopy = false;
> > > >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > > > 
> > > > but I don't know what index and metacopy are.
> > > > 
> > > > > For a moment forget about overlayfs. Say a user process in guest with
> > > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > > > passthrough filesystem, so it should go through. But currently it
> > > > > wont.
> > > > 
> > > > As long as any effects of what it writes are contained to the area of
> > > > the filesystem exposed to the guest, yes - however it worries me what
> > > > the consequences of broken trusted metadata is.  If it's delicate enough
> > > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> > > 
> > > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> > > mechanism for applications to control access. The host kernel doesn'
> > > tuse this namespace itself. Linux has four namespaces for xattrs:
> > > 
> > >  -  user - for userspace apps. accessible based on read/write permissions
> > >  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
> > >  -  system - for kernel only. used by ACLs
> > >  -  security - for kernel only. used by SELinux
> > > 
> > > The use case for "trusted" xattrs is thus where a privileged management
> > > application or service wants to store metadata against the file, but
> > > also needs to grant an unprivileged process access to write to this file
> > > while not allowing that unprivileged process the ability to change the
> > > metadata. This is mentioned in the man page:
> > > 
> > > [man xattr(7)]
> > >    Trusted extended attributes
> > >        Trusted  extended attributes are visible and accessible only to pro‐
> > >        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
> > >        class  are used to implement mechanisms in user space (i.e., outside
> > >        the kernel) which keep information in extended attributes  to  which
> > >        ordinary processes should not have access.
> > 
> > overlayfs inside the kernel is using trusted though which is the case
> > Vivek ran into.
> 
> IIUC, this is need on the lower-layer FS. If you're running overlayfs on the
> host, then virtiofsd is exposing the upper layer FS, so guest use of "trusted."
> on the upper layer won't clash with overlayfs in the host.
> 
> If you're running overlayfs in the guest, then virtiofs is the lower layer
> so needs to support the "trusted." on the FS its exposing. Again this should
> not clash with anything on the host.
> 
> > >    User extended attributes
> > >        User  extended  attributes  may be assigned to files and directories
> > >        for storing arbitrary additional information such as the mime  type,
> > >        character  set  or  encoding  of a file.  The access permissions for
> > >        user attributes are defined by the file permission bits:  read  per‐
> > >        mission is required to retrieve the attribute value, and writer per‐
> > >        mission is required to change it.
> > > [/man]
> > > 
> > > Libvirtd uses the "trusted." xattr namespace to record information against
> > > disk images for QEMU, because we need to grant QEMU access to read/write
> > > the disk iamges, but don't want QEMU to be able to alter our xattrs.
> > > 
> > > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> > > It really ought to have had its own dedicated capability :-( Such is
> > > life with anything that uses CAP_SYS_ADMIN...
> > > 
> > > With this in mind we really should have both trusted. & user. xattrs
> > > allowed to the guest by default.
> > > 
> > > Conversely, we'll need to block usage of the security. and system.
> > > namespaces.
> > > 
> > > Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
> > > have a "--trusted-xattrs=on|off" flag to allow people to run in a more
> > > locked down state if they knowingly want to block trusted xattrs.
> > 
> > Does the trick described by Chirantan work for you, where the clients
> > 'trusted' view is different from the host?
> 
> That works as long as you don't require any interoperability with
> processes that have setup the exposed file tree on the host, or
> might need to concurrently access it.
> 
> This probably suggests a need various modes of exposing xattrs
> 
>  - "passthrough"  - run CAP_SYS_ADMIN and pass directly
>  - "mapped"       - no capabilities, rewrite everything into user. namespace
>  - "disable"      - don't allow xattr
> 

We already do "passthrough" (this is our default) and "disable" (-o no_xattr).
We probably need to implement this "mapped" mode and that will allow us
to run without CAP_SYS_ADMIN in init_user_ns.

> As mentioned in the other thread though, CAP_SYS_ADMIN needs to be present
> in the initial namespace, so to support "passthrough" we'd have to NOT
> use user namespace too. Something to be aware of when we accept the
> user namespace patches.

> 
> Such "mapping" of attributes could also be used if virtiofs was running
> as an unprivileged user account, to expose a full range of file owner,
> group and permissions to the guest, while having everything be owned
> by the normal user on the host.   QEMU's current 9p filesystem has
> this kind of mapping support for owner/permissions, so guest view can
> be distinct from the host view.

If we figure out a way to run in a user namespace, then we don't
necessarily require this mapping functionality, right? User namespace will
automatically provide that mapping. We can just run as root of that user
namespace.

When 9p wrote this remapping method, we probably did not have user
namespace as popular as they are now. So we probably should implement
remapping of trusted xattr and then run in a user namesapce.

And complete passthrough mode will have to run in init user namespace
and this will be least securre mode (from host security point of view).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 11:39           ` Daniel P. Berrangé
@ 2020-06-19 19:12             ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 19:12 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: virtio-fs, Miklos Szeredi, Dr. David Alan Gilbert,
	Stefan Hajnoczi, qemu-devel

On Fri, Jun 19, 2020 at 12:39:14PM +0100, Daniel P. Berrangé wrote:
[..]
> The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> mechanism for applications to control access. The host kernel doesn'
> tuse this namespace itself. Linux has four namespaces for xattrs:
> 
>  -  user - for userspace apps. accessible based on read/write permissions
>  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
>  -  system - for kernel only. used by ACLs
>  -  security - for kernel only. used by SELinux
> 
> The use case for "trusted" xattrs is thus where a privileged management
> application or service wants to store metadata against the file, but
> also needs to grant an unprivileged process access to write to this file
> while not allowing that unprivileged process the ability to change the
> metadata. This is mentioned in the man page:
> 
> [man xattr(7)]
>    Trusted extended attributes
>        Trusted  extended attributes are visible and accessible only to pro‐
>        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
>        class  are used to implement mechanisms in user space (i.e., outside
>        the kernel) which keep information in extended attributes  to  which
>        ordinary processes should not have access.
> 
>    User extended attributes
>        User  extended  attributes  may be assigned to files and directories
>        for storing arbitrary additional information such as the mime  type,
>        character  set  or  encoding  of a file.  The access permissions for
>        user attributes are defined by the file permission bits:  read  per‐
>        mission is required to retrieve the attribute value, and writer per‐
>        mission is required to change it.
> [/man]
> 
> Libvirtd uses the "trusted." xattr namespace to record information against
> disk images for QEMU, because we need to grant QEMU access to read/write
> the disk iamges, but don't want QEMU to be able to alter our xattrs.
> 
> It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> It really ought to have had its own dedicated capability :-( Such is
> life with anything that uses CAP_SYS_ADMIN...
> 
> With this in mind we really should have both trusted. & user. xattrs
> allowed to the guest by default.
> 
> Conversely, we'll need to block usage of the security. and system.
> namespaces.

I am wondering can we block usage of "system" and "security"?  What
about guest setting acls over virtiofs files. These will have to
go through and that means we need to allow system xattrs.

Similarly setting file capabilities inside should trigger
setxattr(security.capability) and that means we need to allow security
xattr as well.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 19:12             ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 19:12 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: virtio-fs, Miklos Szeredi, qemu-devel

On Fri, Jun 19, 2020 at 12:39:14PM +0100, Daniel P. Berrangé wrote:
[..]
> The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> mechanism for applications to control access. The host kernel doesn'
> tuse this namespace itself. Linux has four namespaces for xattrs:
> 
>  -  user - for userspace apps. accessible based on read/write permissions
>  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
>  -  system - for kernel only. used by ACLs
>  -  security - for kernel only. used by SELinux
> 
> The use case for "trusted" xattrs is thus where a privileged management
> application or service wants to store metadata against the file, but
> also needs to grant an unprivileged process access to write to this file
> while not allowing that unprivileged process the ability to change the
> metadata. This is mentioned in the man page:
> 
> [man xattr(7)]
>    Trusted extended attributes
>        Trusted  extended attributes are visible and accessible only to pro‐
>        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
>        class  are used to implement mechanisms in user space (i.e., outside
>        the kernel) which keep information in extended attributes  to  which
>        ordinary processes should not have access.
> 
>    User extended attributes
>        User  extended  attributes  may be assigned to files and directories
>        for storing arbitrary additional information such as the mime  type,
>        character  set  or  encoding  of a file.  The access permissions for
>        user attributes are defined by the file permission bits:  read  per‐
>        mission is required to retrieve the attribute value, and writer per‐
>        mission is required to change it.
> [/man]
> 
> Libvirtd uses the "trusted." xattr namespace to record information against
> disk images for QEMU, because we need to grant QEMU access to read/write
> the disk iamges, but don't want QEMU to be able to alter our xattrs.
> 
> It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> It really ought to have had its own dedicated capability :-( Such is
> life with anything that uses CAP_SYS_ADMIN...
> 
> With this in mind we really should have both trusted. & user. xattrs
> allowed to the guest by default.
> 
> Conversely, we'll need to block usage of the security. and system.
> namespaces.

I am wondering can we block usage of "system" and "security"?  What
about guest setting acls over virtiofs files. These will have to
go through and that means we need to allow system xattrs.

Similarly setting file capabilities inside should trigger
setxattr(security.capability) and that means we need to allow security
xattr as well.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19  4:46         ` Chirantan Ekbote
@ 2020-06-19 19:15           ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 19:15 UTC (permalink / raw)
  To: Chirantan Ekbote
  Cc: virtio-fs-list, Miklos Szeredi, Dr. David Alan Gilbert, qemu-devel

On Fri, Jun 19, 2020 at 01:46:20PM +0900, Chirantan Ekbote wrote:
> On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > >
> > > > Hi Stefan,
> > > >
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > >
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > >
> 
> Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
> have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
> user namespace and it cannot set any trusted or security xattrs.  The
> security xattrs check is at least namespace aware so you only need
> CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
> help us much.
> 
> We ended up working around it by prefixing "user.virtiofs." to the
> xattr name[2], which has its own problems but there was pretty much no
> chance that we would be able to give the fs device CAP_SYS_ADMIN in
> the init namespace.

Chirantan, 

So you ended up renaming all "trusted", "security" and "system" xattrs?
Only "user" xattrs are complete passthrough?

IOW, security.selinux will be renamed to user.virtiofs.security.selinux
on host?

Thanks
Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-19 19:15           ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-19 19:15 UTC (permalink / raw)
  To: Chirantan Ekbote; +Cc: virtio-fs-list, Miklos Szeredi, qemu-devel

On Fri, Jun 19, 2020 at 01:46:20PM +0900, Chirantan Ekbote wrote:
> On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > >
> > > > Hi Stefan,
> > > >
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > >
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > >
> 
> Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
> have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
> user namespace and it cannot set any trusted or security xattrs.  The
> security xattrs check is at least namespace aware so you only need
> CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
> help us much.
> 
> We ended up working around it by prefixing "user.virtiofs." to the
> xattr name[2], which has its own problems but there was pretty much no
> chance that we would be able to give the fs device CAP_SYS_ADMIN in
> the init namespace.

Chirantan, 

So you ended up renaming all "trusted", "security" and "system" xattrs?
Only "user" xattrs are complete passthrough?

IOW, security.selinux will be renamed to user.virtiofs.security.selinux
on host?

Thanks
Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 19:15           ` Vivek Goyal
@ 2020-06-25  3:19             ` Chirantan Ekbote
  -1 siblings, 0 replies; 71+ messages in thread
From: Chirantan Ekbote @ 2020-06-25  3:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs-list, Miklos Szeredi, Dr. David Alan Gilbert, qemu-devel

On Sat, Jun 20, 2020 at 4:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Jun 19, 2020 at 01:46:20PM +0900, Chirantan Ekbote wrote:
> > On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > access to the system.
> > > > >
> > > > > Hi Stefan,
> > > > >
> > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > >
> > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > need CAP_SYS_ADMIN.
> > > > >
> >
> > Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
> > have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
> > user namespace and it cannot set any trusted or security xattrs.  The
> > security xattrs check is at least namespace aware so you only need
> > CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
> > help us much.
> >
> > We ended up working around it by prefixing "user.virtiofs." to the
> > xattr name[2], which has its own problems but there was pretty much no
> > chance that we would be able to give the fs device CAP_SYS_ADMIN in
> > the init namespace.
>
> Chirantan,
>
> So you ended up renaming all "trusted", "security" and "system" xattrs?
> Only "user" xattrs are complete passthrough?
>

No, we only rename "security" xattrs (except for selinux).

>
> IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> on host?
>

We don't relabel security.selinux because it only requires CAP_FOWNER
in the process's user namespace and it also does its own MAC-based
checks.  Also we have some tools that label files beforehand so it
seemed easier to leave them unchanged.

Chirantan


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-25  3:19             ` Chirantan Ekbote
  0 siblings, 0 replies; 71+ messages in thread
From: Chirantan Ekbote @ 2020-06-25  3:19 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, Miklos Szeredi, qemu-devel

On Sat, Jun 20, 2020 at 4:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Jun 19, 2020 at 01:46:20PM +0900, Chirantan Ekbote wrote:
> > On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > access to the system.
> > > > >
> > > > > Hi Stefan,
> > > > >
> > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > >
> > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > need CAP_SYS_ADMIN.
> > > > >
> >
> > Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
> > have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
> > user namespace and it cannot set any trusted or security xattrs.  The
> > security xattrs check is at least namespace aware so you only need
> > CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
> > help us much.
> >
> > We ended up working around it by prefixing "user.virtiofs." to the
> > xattr name[2], which has its own problems but there was pretty much no
> > chance that we would be able to give the fs device CAP_SYS_ADMIN in
> > the init namespace.
>
> Chirantan,
>
> So you ended up renaming all "trusted", "security" and "system" xattrs?
> Only "user" xattrs are complete passthrough?
>

No, we only rename "security" xattrs (except for selinux).

>
> IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> on host?
>

We don't relabel security.selinux because it only requires CAP_FOWNER
in the process's user namespace and it also does its own MAC-based
checks.  Also we have some tools that label files beforehand so it
seemed easier to leave them unchanged.

Chirantan


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-25  3:19             ` Chirantan Ekbote
@ 2020-06-25 12:55               ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-25 12:55 UTC (permalink / raw)
  To: Chirantan Ekbote
  Cc: virtio-fs-list, Miklos Szeredi, Dr. David Alan Gilbert, qemu-devel

On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
[..]
> > Chirantan,
> >
> > So you ended up renaming all "trusted", "security" and "system" xattrs?
> > Only "user" xattrs are complete passthrough?
> >
> 
> No, we only rename "security" xattrs (except for selinux).
> 
> >
> > IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> > on host?
> >
> 
> We don't relabel security.selinux because it only requires CAP_FOWNER
> in the process's user namespace and it also does its own MAC-based
> checks.  Also we have some tools that label files beforehand so it
> seemed easier to leave them unchanged.

If we rename selinux xattr also, then we can support selinux both in
guest and host and they both can have their own independent policies.

Otherwise we either have to disable selinux on host (if we want to
support it in guest) or somehow guest and how policies will have
to know about each other and be able to work together (which will
be hard for a generic use case).

Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-25 12:55               ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-06-25 12:55 UTC (permalink / raw)
  To: Chirantan Ekbote; +Cc: virtio-fs-list, Miklos Szeredi, qemu-devel

On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
[..]
> > Chirantan,
> >
> > So you ended up renaming all "trusted", "security" and "system" xattrs?
> > Only "user" xattrs are complete passthrough?
> >
> 
> No, we only rename "security" xattrs (except for selinux).
> 
> >
> > IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> > on host?
> >
> 
> We don't relabel security.selinux because it only requires CAP_FOWNER
> in the process's user namespace and it also does its own MAC-based
> checks.  Also we have some tools that label files beforehand so it
> seemed easier to leave them unchanged.

If we rename selinux xattr also, then we can support selinux both in
guest and host and they both can have their own independent policies.

Otherwise we either have to disable selinux on host (if we want to
support it in guest) or somehow guest and how policies will have
to know about each other and be able to work together (which will
be hard for a generic use case).

Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-19 19:12             ` Vivek Goyal
@ 2020-06-26 11:26               ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-26 11:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs, Daniel P. Berrangé,
	qemu-devel, Stefan Hajnoczi, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 12:39:14PM +0100, Daniel P. Berrangé wrote:
> [..]
> > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> > mechanism for applications to control access. The host kernel doesn'
> > tuse this namespace itself. Linux has four namespaces for xattrs:
> > 
> >  -  user - for userspace apps. accessible based on read/write permissions
> >  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
> >  -  system - for kernel only. used by ACLs
> >  -  security - for kernel only. used by SELinux
> > 
> > The use case for "trusted" xattrs is thus where a privileged management
> > application or service wants to store metadata against the file, but
> > also needs to grant an unprivileged process access to write to this file
> > while not allowing that unprivileged process the ability to change the
> > metadata. This is mentioned in the man page:
> > 
> > [man xattr(7)]
> >    Trusted extended attributes
> >        Trusted  extended attributes are visible and accessible only to pro‐
> >        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
> >        class  are used to implement mechanisms in user space (i.e., outside
> >        the kernel) which keep information in extended attributes  to  which
> >        ordinary processes should not have access.
> > 
> >    User extended attributes
> >        User  extended  attributes  may be assigned to files and directories
> >        for storing arbitrary additional information such as the mime  type,
> >        character  set  or  encoding  of a file.  The access permissions for
> >        user attributes are defined by the file permission bits:  read  per‐
> >        mission is required to retrieve the attribute value, and writer per‐
> >        mission is required to change it.
> > [/man]
> > 
> > Libvirtd uses the "trusted." xattr namespace to record information against
> > disk images for QEMU, because we need to grant QEMU access to read/write
> > the disk iamges, but don't want QEMU to be able to alter our xattrs.
> > 
> > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> > It really ought to have had its own dedicated capability :-( Such is
> > life with anything that uses CAP_SYS_ADMIN...
> > 
> > With this in mind we really should have both trusted. & user. xattrs
> > allowed to the guest by default.
> > 
> > Conversely, we'll need to block usage of the security. and system.
> > namespaces.
> 
> I am wondering can we block usage of "system" and "security"?  What
> about guest setting acls over virtiofs files. These will have to
> go through and that means we need to allow system xattrs.
> 
> Similarly setting file capabilities inside should trigger
> setxattr(security.capability) and that means we need to allow security
> xattr as well.

Yep, we see that when people install Fedora packages, when rpm
unpackgs /usr/bin/newgidmap which has:

$ getfattr -d '--match=.*' /usr/bin/newgidmap 
getfattr: Removing leading '/' from absolute path names
# file: usr/bin/newgidmap
security.capability=0sAQAAAkAAAAAAAAAAAAAAAAAAAAA=
security.selinux="system_u:object_r:bin_t:s0"


Dave

> Thanks
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-06-26 11:26               ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 71+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-26 11:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs, Daniel P. Berrangé, qemu-devel, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 12:39:14PM +0100, Daniel P. Berrangé wrote:
> [..]
> > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> > mechanism for applications to control access. The host kernel doesn'
> > tuse this namespace itself. Linux has four namespaces for xattrs:
> > 
> >  -  user - for userspace apps. accessible based on read/write permissions
> >  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
> >  -  system - for kernel only. used by ACLs
> >  -  security - for kernel only. used by SELinux
> > 
> > The use case for "trusted" xattrs is thus where a privileged management
> > application or service wants to store metadata against the file, but
> > also needs to grant an unprivileged process access to write to this file
> > while not allowing that unprivileged process the ability to change the
> > metadata. This is mentioned in the man page:
> > 
> > [man xattr(7)]
> >    Trusted extended attributes
> >        Trusted  extended attributes are visible and accessible only to pro‐
> >        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
> >        class  are used to implement mechanisms in user space (i.e., outside
> >        the kernel) which keep information in extended attributes  to  which
> >        ordinary processes should not have access.
> > 
> >    User extended attributes
> >        User  extended  attributes  may be assigned to files and directories
> >        for storing arbitrary additional information such as the mime  type,
> >        character  set  or  encoding  of a file.  The access permissions for
> >        user attributes are defined by the file permission bits:  read  per‐
> >        mission is required to retrieve the attribute value, and writer per‐
> >        mission is required to change it.
> > [/man]
> > 
> > Libvirtd uses the "trusted." xattr namespace to record information against
> > disk images for QEMU, because we need to grant QEMU access to read/write
> > the disk iamges, but don't want QEMU to be able to alter our xattrs.
> > 
> > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> > It really ought to have had its own dedicated capability :-( Such is
> > life with anything that uses CAP_SYS_ADMIN...
> > 
> > With this in mind we really should have both trusted. & user. xattrs
> > allowed to the guest by default.
> > 
> > Conversely, we'll need to block usage of the security. and system.
> > namespaces.
> 
> I am wondering can we block usage of "system" and "security"?  What
> about guest setting acls over virtiofs files. These will have to
> go through and that means we need to allow system xattrs.
> 
> Similarly setting file capabilities inside should trigger
> setxattr(security.capability) and that means we need to allow security
> xattr as well.

Yep, we see that when people install Fedora packages, when rpm
unpackgs /usr/bin/newgidmap which has:

$ getfattr -d '--match=.*' /usr/bin/newgidmap 
getfattr: Removing leading '/' from absolute path names
# file: usr/bin/newgidmap
security.capability=0sAQAAAkAAAAAAAAAAAAAAAAAAAAA=
security.selinux="system_u:object_r:bin_t:s0"


Dave

> Thanks
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-06-25 12:55               ` Vivek Goyal
@ 2020-07-13  8:54                 ` Chirantan Ekbote
  -1 siblings, 0 replies; 71+ messages in thread
From: Chirantan Ekbote @ 2020-07-13  8:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs-list, Miklos Szeredi, Dr. David Alan Gilbert, qemu-devel

On Thu, Jun 25, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
> [..]
> > > Chirantan,
> > >
> > > So you ended up renaming all "trusted", "security" and "system" xattrs?
> > > Only "user" xattrs are complete passthrough?
> > >
> >
> > No, we only rename "security" xattrs (except for selinux).
> >
> > >
> > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> > > on host?
> > >
> >
> > We don't relabel security.selinux because it only requires CAP_FOWNER
> > in the process's user namespace and it also does its own MAC-based
> > checks.  Also we have some tools that label files beforehand so it
> > seemed easier to leave them unchanged.
>
> If we rename selinux xattr also, then we can support selinux both in
> guest and host and they both can have their own independent policies.
>

This works as long as you don't need to support setfscreatecon, which
exists to guarantee that an inode is atomically created with the
correct selinux context.  It's kind of possible for files with
O_TMPFILE + fsetxattr + linkat but there is no equivalent for
directories.  You're basically forced to create the directory and then
set the xattr and while it's possible to prevent other threads in the
server from seeing the unfinished directories with a rwlock or similar
there is no protection against power loss.  If the machine loses power
after the directory is created but before the selinux xattr is set
then that directory will have the wrong selinux context and the guest
would need to run restorecon at boot to assign the correct label.

> Otherwise we either have to disable selinux on host (if we want to
> support it in guest) or somehow guest and how policies will have
> to know about each other and be able to work together (which will
> be hard for a generic use case).

Yes, I agree this is hard to do for a generic case but unfortunately
the more I understand how selinux works the less I feel that it works
well with a passthrough style file system.  As you said it either
needs to be turned off on the host or the host and guest need to work
together.

Chirantan


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-07-13  8:54                 ` Chirantan Ekbote
  0 siblings, 0 replies; 71+ messages in thread
From: Chirantan Ekbote @ 2020-07-13  8:54 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, Miklos Szeredi, qemu-devel

On Thu, Jun 25, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
> [..]
> > > Chirantan,
> > >
> > > So you ended up renaming all "trusted", "security" and "system" xattrs?
> > > Only "user" xattrs are complete passthrough?
> > >
> >
> > No, we only rename "security" xattrs (except for selinux).
> >
> > >
> > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> > > on host?
> > >
> >
> > We don't relabel security.selinux because it only requires CAP_FOWNER
> > in the process's user namespace and it also does its own MAC-based
> > checks.  Also we have some tools that label files beforehand so it
> > seemed easier to leave them unchanged.
>
> If we rename selinux xattr also, then we can support selinux both in
> guest and host and they both can have their own independent policies.
>

This works as long as you don't need to support setfscreatecon, which
exists to guarantee that an inode is atomically created with the
correct selinux context.  It's kind of possible for files with
O_TMPFILE + fsetxattr + linkat but there is no equivalent for
directories.  You're basically forced to create the directory and then
set the xattr and while it's possible to prevent other threads in the
server from seeing the unfinished directories with a rwlock or similar
there is no protection against power loss.  If the machine loses power
after the directory is created but before the selinux xattr is set
then that directory will have the wrong selinux context and the guest
would need to run restorecon at boot to assign the correct label.

> Otherwise we either have to disable selinux on host (if we want to
> support it in guest) or somehow guest and how policies will have
> to know about each other and be able to work together (which will
> be hard for a generic use case).

Yes, I agree this is hard to do for a generic case but unfortunately
the more I understand how selinux works the less I feel that it works
well with a passthrough style file system.  As you said it either
needs to be turned off on the host or the host and guest need to work
together.

Chirantan


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-07-13  8:54                 ` Chirantan Ekbote
@ 2020-07-13 13:39                   ` Vivek Goyal
  -1 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-07-13 13:39 UTC (permalink / raw)
  To: Chirantan Ekbote
  Cc: virtio-fs-list, Miklos Szeredi, Dr. David Alan Gilbert, qemu-devel

On Mon, Jul 13, 2020 at 05:54:19PM +0900, Chirantan Ekbote wrote:
> On Thu, Jun 25, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
> > [..]
> > > > Chirantan,
> > > >
> > > > So you ended up renaming all "trusted", "security" and "system" xattrs?
> > > > Only "user" xattrs are complete passthrough?
> > > >
> > >
> > > No, we only rename "security" xattrs (except for selinux).
> > >
> > > >
> > > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> > > > on host?
> > > >
> > >
> > > We don't relabel security.selinux because it only requires CAP_FOWNER
> > > in the process's user namespace and it also does its own MAC-based
> > > checks.  Also we have some tools that label files beforehand so it
> > > seemed easier to leave them unchanged.
> >
> > If we rename selinux xattr also, then we can support selinux both in
> > guest and host and they both can have their own independent policies.
> >
> 
> This works as long as you don't need to support setfscreatecon, which
> exists to guarantee that an inode is atomically created with the
> correct selinux context.  It's kind of possible for files with
> O_TMPFILE + fsetxattr + linkat but there is no equivalent for
> directories.  You're basically forced to create the directory and then
> set the xattr and while it's possible to prevent other threads in the
> server from seeing the unfinished directories with a rwlock or similar
> there is no protection against power loss.  If the machine loses power
> after the directory is created but before the selinux xattr is set
> then that directory will have the wrong selinux context and the guest
> would need to run restorecon at boot to assign the correct label.

Overlayfs has this notion of work directory where they create directory
(and file if O_TMPFILE is not supported), and then rename it to correct
place. I am wondering if we could do something similar. If server is
given a work directory where.

A. Create new directory
B. set selinux lables
C. rename directory

That way if machine crashes after step A or step B, that directory
never becomes visible to guest and no relabeling is required.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
@ 2020-07-13 13:39                   ` Vivek Goyal
  0 siblings, 0 replies; 71+ messages in thread
From: Vivek Goyal @ 2020-07-13 13:39 UTC (permalink / raw)
  To: Chirantan Ekbote; +Cc: virtio-fs-list, Miklos Szeredi, qemu-devel

On Mon, Jul 13, 2020 at 05:54:19PM +0900, Chirantan Ekbote wrote:
> On Thu, Jun 25, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
> > [..]
> > > > Chirantan,
> > > >
> > > > So you ended up renaming all "trusted", "security" and "system" xattrs?
> > > > Only "user" xattrs are complete passthrough?
> > > >
> > >
> > > No, we only rename "security" xattrs (except for selinux).
> > >
> > > >
> > > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> > > > on host?
> > > >
> > >
> > > We don't relabel security.selinux because it only requires CAP_FOWNER
> > > in the process's user namespace and it also does its own MAC-based
> > > checks.  Also we have some tools that label files beforehand so it
> > > seemed easier to leave them unchanged.
> >
> > If we rename selinux xattr also, then we can support selinux both in
> > guest and host and they both can have their own independent policies.
> >
> 
> This works as long as you don't need to support setfscreatecon, which
> exists to guarantee that an inode is atomically created with the
> correct selinux context.  It's kind of possible for files with
> O_TMPFILE + fsetxattr + linkat but there is no equivalent for
> directories.  You're basically forced to create the directory and then
> set the xattr and while it's possible to prevent other threads in the
> server from seeing the unfinished directories with a rwlock or similar
> there is no protection against power loss.  If the machine loses power
> after the directory is created but before the selinux xattr is set
> then that directory will have the wrong selinux context and the guest
> would need to run restorecon at boot to assign the correct label.

Overlayfs has this notion of work directory where they create directory
(and file if O_TMPFILE is not supported), and then rename it to correct
place. I am wondering if we could do something similar. If server is
given a work directory where.

A. Create new directory
B. set selinux lables
C. rename directory

That way if machine crashes after step A or step B, that directory
never becomes visible to guest and no relabeling is required.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-07-13  8:54                 ` Chirantan Ekbote
  (?)
  (?)
@ 2020-07-13 21:39                 ` Daniel Walsh
  2020-07-14 12:33                   ` Vivek Goyal
  -1 siblings, 1 reply; 71+ messages in thread
From: Daniel Walsh @ 2020-07-13 21:39 UTC (permalink / raw)
  To: virtio-fs

On 7/13/20 04:54, Chirantan Ekbote wrote:
> On Thu, Jun 25, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>> On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
>> [..]
>>>> Chirantan,
>>>>
>>>> So you ended up renaming all "trusted", "security" and "system" xattrs?
>>>> Only "user" xattrs are complete passthrough?
>>>>
>>> No, we only rename "security" xattrs (except for selinux).
>>>
>>>> IOW, security.selinux will be renamed to user.virtiofs.security.selinux
>>>> on host?
>>>>
>>> We don't relabel security.selinux because it only requires CAP_FOWNER
>>> in the process's user namespace and it also does its own MAC-based
>>> checks.  Also we have some tools that label files beforehand so it
>>> seemed easier to leave them unchanged.
>> If we rename selinux xattr also, then we can support selinux both in
>> guest and host and they both can have their own independent policies.
>>
> This works as long as you don't need to support setfscreatecon, which
> exists to guarantee that an inode is atomically created with the
> correct selinux context.  It's kind of possible for files with
> O_TMPFILE + fsetxattr + linkat but there is no equivalent for
> directories.  You're basically forced to create the directory and then
> set the xattr and while it's possible to prevent other threads in the
> server from seeing the unfinished directories with a rwlock or similar
> there is no protection against power loss.  If the machine loses power
> after the directory is created but before the selinux xattr is set
> then that directory will have the wrong selinux context and the guest
> would need to run restorecon at boot to assign the correct label.
>
>> Otherwise we either have to disable selinux on host (if we want to
>> support it in guest) or somehow guest and how policies will have
>> to know about each other and be able to work together (which will
>> be hard for a generic use case).
> Yes, I agree this is hard to do for a generic case but unfortunately
> the more I understand how selinux works the less I feel that it works
> well with a passthrough style file system.  As you said it either
> needs to be turned off on the host or the host and guest need to work
> together.

Correct both kernels need to understand the labels, or one of the
kernels has to have SELinux disabled.

That is the bottom line.  Same issue exists for labeled NFS so I don't
see this as a problem.

> Chirantan
>
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-07-13 21:39                 ` Daniel Walsh
@ 2020-07-14 12:33                   ` Vivek Goyal
  2020-07-14 17:16                     ` Daniel Walsh
  0 siblings, 1 reply; 71+ messages in thread
From: Vivek Goyal @ 2020-07-14 12:33 UTC (permalink / raw)
  To: Daniel Walsh; +Cc: virtio-fs

On Mon, Jul 13, 2020 at 05:39:05PM -0400, Daniel Walsh wrote:

[..]
> >> Otherwise we either have to disable selinux on host (if we want to
> >> support it in guest) or somehow guest and how policies will have
> >> to know about each other and be able to work together (which will
> >> be hard for a generic use case).
> > Yes, I agree this is hard to do for a generic case but unfortunately
> > the more I understand how selinux works the less I feel that it works
> > well with a passthrough style file system.  As you said it either
> > needs to be turned off on the host or the host and guest need to work
> > together.
> 
> Correct both kernels need to understand the labels, or one of the
> kernels has to have SELinux disabled.
> 
> That is the bottom line.  Same issue exists for labeled NFS so I don't
> see this as a problem.

Dan,

So what does labeled NFS do. Server disables SELinux so that it can
be enabled on client?

Thanks
Vivek


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)
  2020-07-14 12:33                   ` Vivek Goyal
@ 2020-07-14 17:16                     ` Daniel Walsh
  0 siblings, 0 replies; 71+ messages in thread
From: Daniel Walsh @ 2020-07-14 17:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On 7/14/20 08:33, Vivek Goyal wrote:
> On Mon, Jul 13, 2020 at 05:39:05PM -0400, Daniel Walsh wrote:
>
> [..]
>>>> Otherwise we either have to disable selinux on host (if we want to
>>>> support it in guest) or somehow guest and how policies will have
>>>> to know about each other and be able to work together (which will
>>>> be hard for a generic use case).
>>> Yes, I agree this is hard to do for a generic case but unfortunately
>>> the more I understand how selinux works the less I feel that it works
>>> well with a passthrough style file system.  As you said it either
>>> needs to be turned off on the host or the host and guest need to work
>>> together.
>> Correct both kernels need to understand the labels, or one of the
>> kernels has to have SELinux disabled.
>>
>> That is the bottom line.  Same issue exists for labeled NFS so I don't
>> see this as a problem.
> Dan,
>
> So what does labeled NFS do. Server disables SELinux so that it can
> be enabled on client?
>
> Thanks
> Vivek

Labeled NFS does all three

Server Disabled, Client Enabled - Server does not care about the labels

Sever Enabled, Client Disabled  - Server selects the labels, Client
ignores them

Sever Enabled, Client Enabled  - All labels are seen on the client and
server and must be understood by the Server, if the client does not
understand the label, it will be marked as unlabeled_t on the client, if
the Client tries to create a label that is not understood on the Server,
the Server will block the creation.

The most usual case of labeled NFS is the Client and Server use the same
SELinux policy. Or one or the other is disabled.  The cases where each
is enabled with different policy is rare, in my experience.


^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2020-07-14 17:16 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-16 16:49 [PATCH 0/2] virtiofsd: drop Linux capabilities(7) Stefan Hajnoczi
2020-04-16 16:49 ` [Virtio-fs] " Stefan Hajnoczi
2020-04-16 16:49 ` [PATCH 1/2] virtiofsd: only retain file system capabilities Stefan Hajnoczi
2020-04-16 16:49   ` [Virtio-fs] " Stefan Hajnoczi
2020-04-28 11:48   ` Dr. David Alan Gilbert
2020-04-28 11:48     ` [Virtio-fs] " Dr. David Alan Gilbert
2020-04-16 16:49 ` [PATCH 2/2] virtiofsd: drop all capabilities in the wait parent process Stefan Hajnoczi
2020-04-16 16:49   ` [Virtio-fs] " Stefan Hajnoczi
2020-04-16 17:50   ` Philippe Mathieu-Daudé
2020-04-16 17:50     ` [Virtio-fs] " Philippe Mathieu-Daudé
2020-04-16 20:10 ` [PATCH 0/2] virtiofsd: drop Linux capabilities(7) Vivek Goyal
2020-04-16 20:10   ` [Virtio-fs] " Vivek Goyal
2020-04-17  9:42   ` Stefan Hajnoczi
2020-04-17  9:42     ` [Virtio-fs] " Stefan Hajnoczi
2020-05-01 18:28 ` Dr. David Alan Gilbert
2020-05-01 18:28   ` [Virtio-fs] " Dr. David Alan Gilbert
2020-06-18 19:08 ` Vivek Goyal
2020-06-18 19:16   ` Dr. David Alan Gilbert
2020-06-18 19:16     ` Dr. David Alan Gilbert
2020-06-18 19:27     ` Vivek Goyal
2020-06-18 19:27       ` Vivek Goyal
2020-06-19  4:46       ` Chirantan Ekbote
2020-06-19  4:46         ` Chirantan Ekbote
2020-06-19  8:39         ` Dr. David Alan Gilbert
2020-06-19  9:17           ` Chirantan Ekbote
2020-06-19 11:12             ` Dr. David Alan Gilbert
2020-06-19 19:15         ` Vivek Goyal
2020-06-19 19:15           ` Vivek Goyal
2020-06-25  3:19           ` Chirantan Ekbote
2020-06-25  3:19             ` Chirantan Ekbote
2020-06-25 12:55             ` Vivek Goyal
2020-06-25 12:55               ` Vivek Goyal
2020-07-13  8:54               ` Chirantan Ekbote
2020-07-13  8:54                 ` Chirantan Ekbote
2020-07-13 13:39                 ` Vivek Goyal
2020-07-13 13:39                   ` Vivek Goyal
2020-07-13 21:39                 ` Daniel Walsh
2020-07-14 12:33                   ` Vivek Goyal
2020-07-14 17:16                     ` Daniel Walsh
2020-06-19  8:27       ` Dr. David Alan Gilbert
2020-06-19  8:27         ` Dr. David Alan Gilbert
2020-06-19 11:39         ` Daniel P. Berrangé
2020-06-19 11:39           ` Daniel P. Berrangé
2020-06-19 11:49           ` Dr. David Alan Gilbert
2020-06-19 11:49             ` Dr. David Alan Gilbert
2020-06-19 12:05             ` Daniel P. Berrangé
2020-06-19 12:05               ` Daniel P. Berrangé
2020-06-19 17:41               ` Vivek Goyal
2020-06-19 17:41                 ` Vivek Goyal
2020-06-19 19:12           ` Vivek Goyal
2020-06-19 19:12             ` Vivek Goyal
2020-06-26 11:26             ` Dr. David Alan Gilbert
2020-06-26 11:26               ` Dr. David Alan Gilbert
2020-06-19 16:09         ` Vivek Goyal
2020-06-19 16:09           ` Vivek Goyal
2020-06-19 16:16           ` Dr. David Alan Gilbert
2020-06-19 16:16             ` Dr. David Alan Gilbert
2020-06-19 17:11             ` Vivek Goyal
2020-06-19 17:11               ` Vivek Goyal
2020-06-19 17:16               ` Dr. David Alan Gilbert
2020-06-19 17:16                 ` Dr. David Alan Gilbert
2020-06-19 14:16   ` Miklos Szeredi
2020-06-19 14:16     ` Miklos Szeredi
2020-06-19 14:25     ` Vivek Goyal
2020-06-19 14:25       ` Vivek Goyal
2020-06-19 15:26       ` Miklos Szeredi
2020-06-19 15:26         ` Miklos Szeredi
2020-06-19 15:57         ` Vivek Goyal
2020-06-19 15:57           ` Vivek Goyal
2020-06-19 14:29     ` Vivek Goyal
2020-06-19 14:29       ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.