* [RFC] Expose request_module via syscall @ 2021-09-15 15:49 Thomas Weißschuh 2021-09-15 16:02 ` Greg KH 2021-09-15 16:47 ` Andy Lutomirski 0 siblings, 2 replies; 18+ messages in thread From: Thomas Weißschuh @ 2021-09-15 15:49 UTC (permalink / raw) To: linux-api, linux-kernel, Luis Chamberlain, Jessica Yu Hi, I would like to propose a new syscall that exposes the functionality of request_module() to userspace. Propsed signature: request_module(char *module_name, char **args, int flags); Where args and flags have to be NULL and 0 for the time being. Rationale: We are using nested, privileged containers which are loading kernel modules. Currently we have to always pass around the contents of /lib/modules from the root namespace which contains the modules. (Also the containers need to have userspace components for moduleloading installed) The syscall would remove the need for this bookkeeping work. If this has a chance of getting accepted I would be happy to provide an implementation. Thanks, Thomas ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-15 15:49 [RFC] Expose request_module via syscall Thomas Weißschuh @ 2021-09-15 16:02 ` Greg KH 2021-09-15 16:28 ` Thomas Weißschuh 2021-09-15 16:47 ` Andy Lutomirski 1 sibling, 1 reply; 18+ messages in thread From: Greg KH @ 2021-09-15 16:02 UTC (permalink / raw) To: Thomas Weißschuh Cc: linux-api, linux-kernel, Luis Chamberlain, Jessica Yu On Wed, Sep 15, 2021 at 05:49:34PM +0200, Thomas Weißschuh wrote: > Hi, > > I would like to propose a new syscall that exposes the functionality of > request_module() to userspace. > > Propsed signature: request_module(char *module_name, char **args, int flags); > Where args and flags have to be NULL and 0 for the time being. > > Rationale: > > We are using nested, privileged containers which are loading kernel modules. > Currently we have to always pass around the contents of /lib/modules from the > root namespace which contains the modules. > (Also the containers need to have userspace components for moduleloading > installed) > > The syscall would remove the need for this bookkeeping work. So you want any container to have the ability to "bust through" the containers and load a module from the "root" of the system? That feels dangerous, why not just allow a mount of /lib/modules into the containers that you want to be able to load a module? Why are modules somehow "special" here, they are just a resource that has to be allowed (or not) to be accessed by a container like anything else on a filesystem. thanks, greg k-h ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-15 16:02 ` Greg KH @ 2021-09-15 16:28 ` Thomas Weißschuh 0 siblings, 0 replies; 18+ messages in thread From: Thomas Weißschuh @ 2021-09-15 16:28 UTC (permalink / raw) To: Greg KH; +Cc: linux-api, linux-kernel, Luis Chamberlain, Jessica Yu On 2021-09-15T18:02+0200, Greg KH wrote: > On Wed, Sep 15, 2021 at 05:49:34PM +0200, Thomas Weißschuh wrote: > > Hi, > > > > I would like to propose a new syscall that exposes the functionality of > > request_module() to userspace. > > > > Propsed signature: request_module(char *module_name, char **args, int flags); > > Where args and flags have to be NULL and 0 for the time being. > > > > Rationale: > > > > We are using nested, privileged containers which are loading kernel modules. > > Currently we have to always pass around the contents of /lib/modules from the > > root namespace which contains the modules. > > (Also the containers need to have userspace components for moduleloading > > installed) > > > > The syscall would remove the need for this bookkeeping work. > > So you want any container to have the ability to "bust through" the > containers and load a module from the "root" of the system? Only those with CAP_SYS_MODULE. Having this capability would also allow them load the module normally when mounted in or potentially downloaded from the internet. > That feels dangerous, why not just allow a mount of /lib/modules into > the containers that you want to be able to load a module? This is what we are currently doing. But sometimes this gets forgotten at some point in the chain of nested containers/namespaces and things break. > Why are modules somehow "special" here, they are just a resource that > has to be allowed (or not) to be accessed by a container like anything > else on a filesystem. They are special insofar as they always have to match the running kernel. Which is managed by the root namespace. The biggest problems would probably arise if the root namespace has non-standard modules available which the container would normally not have access to. I think this is a big potential problem and which would not be justified by the quality of life improvement. Sorry for the noise. Thomas ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-15 15:49 [RFC] Expose request_module via syscall Thomas Weißschuh 2021-09-15 16:02 ` Greg KH @ 2021-09-15 16:47 ` Andy Lutomirski 2021-09-16 9:27 ` Christian Brauner 1 sibling, 1 reply; 18+ messages in thread From: Andy Lutomirski @ 2021-09-15 16:47 UTC (permalink / raw) To: Thomas Weißschuh; +Cc: Linux API, LKML, Luis Chamberlain, Jessica Yu On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > Hi, > > I would like to propose a new syscall that exposes the functionality of > request_module() to userspace. > > Propsed signature: request_module(char *module_name, char **args, int flags); > Where args and flags have to be NULL and 0 for the time being. > > Rationale: > > We are using nested, privileged containers which are loading kernel modules. > Currently we have to always pass around the contents of /lib/modules from the > root namespace which contains the modules. > (Also the containers need to have userspace components for moduleloading > installed) > > The syscall would remove the need for this bookkeeping work. I feel like I'm missing something, and I don't understand the purpose of this syscall. Wouldn't the right solution be for the container to have a stub module loader (maybe doable with a special /sbin/modprobe or maybe a kernel patch would be needed, depending on the exact use case) and have the stub call out to the container manager to request the module? The container manager would check its security policy and load the module or not load it as appropriate. --Andy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-15 16:47 ` Andy Lutomirski @ 2021-09-16 9:27 ` Christian Brauner 2021-09-18 18:47 ` Andy Lutomirski 0 siblings, 1 reply; 18+ messages in thread From: Christian Brauner @ 2021-09-16 9:27 UTC (permalink / raw) To: Andy Lutomirski Cc: Thomas Weißschuh, Linux API, LKML, Luis Chamberlain, Jessica Yu On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote: > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > > > Hi, > > > > I would like to propose a new syscall that exposes the functionality of > > request_module() to userspace. > > > > Propsed signature: request_module(char *module_name, char **args, int flags); > > Where args and flags have to be NULL and 0 for the time being. > > > > Rationale: > > > > We are using nested, privileged containers which are loading kernel modules. > > Currently we have to always pass around the contents of /lib/modules from the > > root namespace which contains the modules. > > (Also the containers need to have userspace components for moduleloading > > installed) > > > > The syscall would remove the need for this bookkeeping work. > > I feel like I'm missing something, and I don't understand the purpose > of this syscall. Wouldn't the right solution be for the container to > have a stub module loader (maybe doable with a special /sbin/modprobe > or maybe a kernel patch would be needed, depending on the exact use > case) and have the stub call out to the container manager to request > the module? The container manager would check its security policy and > load the module or not load it as appropriate. I don't see the need for a syscall like this yet either. This should be the job of the container manager. modprobe just calls the init_module() syscall, right? If so the seccomp notifier can be used to intercept this system call for the container and verify the module against an allowlist similar to how we currently handle mount. Christian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-16 9:27 ` Christian Brauner @ 2021-09-18 18:47 ` Andy Lutomirski 2021-09-19 7:56 ` Thomas Weißschuh 0 siblings, 1 reply; 18+ messages in thread From: Andy Lutomirski @ 2021-09-18 18:47 UTC (permalink / raw) To: Christian Brauner Cc: Thomas Weißschuh, Linux API, Linux Kernel Mailing List, Luis Chamberlain, Jessica Yu On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote: > On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote: > > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > > > > > Hi, > > > > > > I would like to propose a new syscall that exposes the functionality of > > > request_module() to userspace. > > > > > > Propsed signature: request_module(char *module_name, char **args, int flags); > > > Where args and flags have to be NULL and 0 for the time being. > > > > > > Rationale: > > > > > > We are using nested, privileged containers which are loading kernel modules. > > > Currently we have to always pass around the contents of /lib/modules from the > > > root namespace which contains the modules. > > > (Also the containers need to have userspace components for moduleloading > > > installed) > > > > > > The syscall would remove the need for this bookkeeping work. > > > > I feel like I'm missing something, and I don't understand the purpose > > of this syscall. Wouldn't the right solution be for the container to > > have a stub module loader (maybe doable with a special /sbin/modprobe > > or maybe a kernel patch would be needed, depending on the exact use > > case) and have the stub call out to the container manager to request > > the module? The container manager would check its security policy and > > load the module or not load it as appropriate. > > I don't see the need for a syscall like this yet either. > > This should be the job of the container manager. modprobe just calls the > init_module() syscall, right? Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do. But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called? > > If so the seccomp notifier can be used to intercept this system call for > the container and verify the module against an allowlist similar to how > we currently handle mount. > > Christian > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-18 18:47 ` Andy Lutomirski @ 2021-09-19 7:56 ` Thomas Weißschuh 2021-09-19 14:37 ` Andy Lutomirski 2021-10-24 9:38 ` Thomas Weißschuh 0 siblings, 2 replies; 18+ messages in thread From: Thomas Weißschuh @ 2021-09-19 7:56 UTC (permalink / raw) To: Andy Lutomirski Cc: Christian Brauner, Linux API, Linux Kernel Mailing List, Luis Chamberlain, Jessica Yu On 2021-09-18T11:47-0700, Andy Lutomirski wrote: > On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote: > > On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote: > > > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > > > > > > > Hi, > > > > > > > > I would like to propose a new syscall that exposes the functionality of > > > > request_module() to userspace. > > > > > > > > Propsed signature: request_module(char *module_name, char **args, int flags); > > > > Where args and flags have to be NULL and 0 for the time being. > > > > > > > > Rationale: > > > > > > > > We are using nested, privileged containers which are loading kernel modules. > > > > Currently we have to always pass around the contents of /lib/modules from the > > > > root namespace which contains the modules. > > > > (Also the containers need to have userspace components for moduleloading > > > > installed) > > > > > > > > The syscall would remove the need for this bookkeeping work. > > > > > > I feel like I'm missing something, and I don't understand the purpose > > > of this syscall. Wouldn't the right solution be for the container to > > > have a stub module loader (maybe doable with a special /sbin/modprobe > > > or maybe a kernel patch would be needed, depending on the exact use > > > case) and have the stub call out to the container manager to request > > > the module? The container manager would check its security policy and > > > load the module or not load it as appropriate. > > > > I don't see the need for a syscall like this yet either. > > > > This should be the job of the container manager. modprobe just calls the > > init_module() syscall, right? > > Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do. > > But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called? The container is running an instance of the docker daemon in swarm mode. That needs the "ip_vs" module (amongst others) and explicitly tries to load it via modprobe. > > If so the seccomp notifier can be used to intercept this system call for > > the container and verify the module against an allowlist similar to how > > we currently handle mount. > > > > Christian > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-19 7:56 ` Thomas Weißschuh @ 2021-09-19 14:37 ` Andy Lutomirski 2021-09-20 14:51 ` Thomas Weißschuh 2021-10-24 9:38 ` Thomas Weißschuh 1 sibling, 1 reply; 18+ messages in thread From: Andy Lutomirski @ 2021-09-19 14:37 UTC (permalink / raw) To: Thomas Weißschuh Cc: Andy Lutomirski, Christian Brauner, Linux API, Linux Kernel Mailing List, Luis Chamberlain, Jessica Yu On Sun, Sep 19, 2021 at 12:56 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > On 2021-09-18T11:47-0700, Andy Lutomirski wrote: > > On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote: > > > On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote: > > > > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > > > > > > > > > Hi, > > > > > > > > > > I would like to propose a new syscall that exposes the functionality of > > > > > request_module() to userspace. > > > > > > > > > > Propsed signature: request_module(char *module_name, char **args, int flags); > > > > > Where args and flags have to be NULL and 0 for the time being. > > > > > > > > > > Rationale: > > > > > > > > > > We are using nested, privileged containers which are loading kernel modules. > > > > > Currently we have to always pass around the contents of /lib/modules from the > > > > > root namespace which contains the modules. > > > > > (Also the containers need to have userspace components for moduleloading > > > > > installed) > > > > > > > > > > The syscall would remove the need for this bookkeeping work. > > > > > > > > I feel like I'm missing something, and I don't understand the purpose > > > > of this syscall. Wouldn't the right solution be for the container to > > > > have a stub module loader (maybe doable with a special /sbin/modprobe > > > > or maybe a kernel patch would be needed, depending on the exact use > > > > case) and have the stub call out to the container manager to request > > > > the module? The container manager would check its security policy and > > > > load the module or not load it as appropriate. > > > > > > I don't see the need for a syscall like this yet either. > > > > > > This should be the job of the container manager. modprobe just calls the > > > init_module() syscall, right? > > > > Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do. > > > > But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called? > > The container is running an instance of the docker daemon in swarm mode. > That needs the "ip_vs" module (amongst others) and explicitly tries to load it > via modprobe. > Do you mean it literally invokes /sbin/modprobe? If so, hooking this at /sbin/modprobe and calling out to the container manager seems like a decent solution. > > > If so the seccomp notifier can be used to intercept this system call for > > > the container and verify the module against an allowlist similar to how > > > we currently handle mount. > > > > > > Christian > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-19 14:37 ` Andy Lutomirski @ 2021-09-20 14:51 ` Thomas Weißschuh 2021-09-20 16:59 ` Luis Chamberlain 0 siblings, 1 reply; 18+ messages in thread From: Thomas Weißschuh @ 2021-09-20 14:51 UTC (permalink / raw) To: Andy Lutomirski Cc: Christian Brauner, Linux API, Linux Kernel Mailing List, Luis Chamberlain, Jessica Yu On 2021-09-19T07:37-0700, Andy Lutomirski wrote: > On Sun, Sep 19, 2021 at 12:56 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > > > On 2021-09-18T11:47-0700, Andy Lutomirski wrote: > > > On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote: > > > > On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote: > > > > > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > I would like to propose a new syscall that exposes the functionality of > > > > > > request_module() to userspace. > > > > > > > > > > > > Propsed signature: request_module(char *module_name, char **args, int flags); > > > > > > Where args and flags have to be NULL and 0 for the time being. > > > > > > > > > > > > Rationale: > > > > > > > > > > > > We are using nested, privileged containers which are loading kernel modules. > > > > > > Currently we have to always pass around the contents of /lib/modules from the > > > > > > root namespace which contains the modules. > > > > > > (Also the containers need to have userspace components for moduleloading > > > > > > installed) > > > > > > > > > > > > The syscall would remove the need for this bookkeeping work. > > > > > > > > > > I feel like I'm missing something, and I don't understand the purpose > > > > > of this syscall. Wouldn't the right solution be for the container to > > > > > have a stub module loader (maybe doable with a special /sbin/modprobe > > > > > or maybe a kernel patch would be needed, depending on the exact use > > > > > case) and have the stub call out to the container manager to request > > > > > the module? The container manager would check its security policy and > > > > > load the module or not load it as appropriate. > > > > > > > > I don't see the need for a syscall like this yet either. > > > > > > > > This should be the job of the container manager. modprobe just calls the > > > > init_module() syscall, right? > > > > > > Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do. > > > > > > But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called? > > > > The container is running an instance of the docker daemon in swarm mode. > > That needs the "ip_vs" module (amongst others) and explicitly tries to load it > > via modprobe. > > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this > at /sbin/modprobe and calling out to the container manager seems like > a decent solution. Yes it does. Thanks for the idea, I'll see how this works out. > > > > If so the seccomp notifier can be used to intercept this system call for > > > > the container and verify the module against an allowlist similar to how > > > > we currently handle mount. > > > > > > > > Christian > > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-20 14:51 ` Thomas Weißschuh @ 2021-09-20 16:59 ` Luis Chamberlain 2021-09-20 18:36 ` Andy Lutomirski 0 siblings, 1 reply; 18+ messages in thread From: Luis Chamberlain @ 2021-09-20 16:59 UTC (permalink / raw) To: Thomas Weißschuh Cc: Andy Lutomirski, Christian Brauner, Linux API, Linux Kernel Mailing List, Jessica Yu On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote: > On 2021-09-19T07:37-0700, Andy Lutomirski wrote: > > On Sun, Sep 19, 2021 at 12:56 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > > > > > On 2021-09-18T11:47-0700, Andy Lutomirski wrote: > > > > On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote: > > > > > On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote: > > > > > > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <thomas@t-8ch.de> wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I would like to propose a new syscall that exposes the functionality of > > > > > > > request_module() to userspace. > > > > > > > > > > > > > > Propsed signature: request_module(char *module_name, char **args, int flags); > > > > > > > Where args and flags have to be NULL and 0 for the time being. > > > > > > > > > > > > > > Rationale: > > > > > > > > > > > > > > We are using nested, privileged containers which are loading kernel modules. > > > > > > > Currently we have to always pass around the contents of /lib/modules from the > > > > > > > root namespace which contains the modules. > > > > > > > (Also the containers need to have userspace components for moduleloading > > > > > > > installed) > > > > > > > > > > > > > > The syscall would remove the need for this bookkeeping work. > > > > > > > > > > > > I feel like I'm missing something, and I don't understand the purpose > > > > > > of this syscall. Wouldn't the right solution be for the container to > > > > > > have a stub module loader (maybe doable with a special /sbin/modprobe > > > > > > or maybe a kernel patch would be needed, depending on the exact use > > > > > > case) and have the stub call out to the container manager to request > > > > > > the module? The container manager would check its security policy and > > > > > > load the module or not load it as appropriate. > > > > > > > > > > I don't see the need for a syscall like this yet either. > > > > > > > > > > This should be the job of the container manager. modprobe just calls the > > > > > init_module() syscall, right? > > > > > > > > Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do. > > > > > > > > But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called? > > > > > > The container is running an instance of the docker daemon in swarm mode. > > > That needs the "ip_vs" module (amongst others) and explicitly tries to load it > > > via modprobe. > > > > > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this > > at /sbin/modprobe and calling out to the container manager seems like > > a decent solution. > > Yes it does. Thanks for the idea, I'll see how this works out. Would documentation guiding you in that way have helped? If so I welcome a patch that does just that. Luis ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-20 16:59 ` Luis Chamberlain @ 2021-09-20 18:36 ` Andy Lutomirski 2021-09-22 12:25 ` Christian Brauner 0 siblings, 1 reply; 18+ messages in thread From: Andy Lutomirski @ 2021-09-20 18:36 UTC (permalink / raw) To: Luis Chamberlain Cc: Thomas Weißschuh, Andy Lutomirski, Christian Brauner, Linux API, Linux Kernel Mailing List, Jessica Yu On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote: > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this > > > at /sbin/modprobe and calling out to the container manager seems like > > > a decent solution. > > > > Yes it does. Thanks for the idea, I'll see how this works out. > > Would documentation guiding you in that way have helped? If so > I welcome a patch that does just that. If someone wants to make this classy, we should probably have the container counterpart of a standardized paravirt interface. There should be a way for a container to, in a runtime-agnostic way, issue requests to its manager, and requesting a module by (name, Linux kernel version for which that name makes sense) seems like an excellent use of such an interface. --Andy > > Luis ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-20 18:36 ` Andy Lutomirski @ 2021-09-22 12:25 ` Christian Brauner 2021-09-22 15:34 ` Andy Lutomirski 0 siblings, 1 reply; 18+ messages in thread From: Christian Brauner @ 2021-09-22 12:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Luis Chamberlain, Thomas Weißschuh, Linux API, Linux Kernel Mailing List, Jessica Yu On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote: > On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote: > > > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this > > > > at /sbin/modprobe and calling out to the container manager seems like > > > > a decent solution. > > > > > > Yes it does. Thanks for the idea, I'll see how this works out. > > > > Would documentation guiding you in that way have helped? If so > > I welcome a patch that does just that. > > If someone wants to make this classy, we should probably have the > container counterpart of a standardized paravirt interface. There > should be a way for a container to, in a runtime-agnostic way, issue > requests to its manager, and requesting a module by (name, Linux > kernel version for which that name makes sense) seems like an > excellent use of such an interface. I always thought of this in two ways we currently do this: 1. Caller transparent container manager requests. This is the seccomp notifier where we transparently handle syscalls including intercepting init_module() where we parse out the module to be loaded from the syscall args of the container and if it is allow-listed load it for the container otherwise continue the syscall letting it fail or failing directly through seccomp return value. 2. A process in the container explicitly calling out to the container manager. One example how this happens is systemd-nspawn via dbus messages between systemd in the container and systemd outside the container to e.g. allocate a new terminal in the container (kinda insecure but that's another issue) or other stuff. So what was your idea: would it be like a device file that could be exposed to the container where it writes requestes to the container manager? What would be the advantage to just standardizing a socket protocol which is what we do for example (it doesn't do module loading of course as we handle that differently): ## Container to host communication LXD sets up a socket at `/dev/lxd/sock` which root in the container can use to communicate with LXD on the host. In LXD, this feature is implemented through a /dev/lxd/sock node which is created and setup for all LXD instances. This file is a Unix socket which processes inside the instance can connect to. It's multi-threaded so multiple clients can be connected at the same time. Implementation details LXD on the host binds /var/lib/lxd/devlxd/sock and starts listening for new connections on it. This socket is then exposed into every single instance started by LXD at /dev/lxd/sock. The single socket is required so we can exceed 4096 instances, otherwise, LXD would have to bind a different socket for every instance, quickly reaching the FD limit. Authentication Queries on /dev/lxd/sock will only return information related to the requesting instance. To figure out where a request comes from, LXD will extract the initial socket ucred and compare that to the list of instances it manages. Protocol The protocol on /dev/lxd/sock is plain-text HTTP with JSON messaging, so very similar to the local version of the LXD protocol. Unlike the main LXD API, there is no background operation and no authentication support in the /dev/lxd/sock API. Christian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-22 12:25 ` Christian Brauner @ 2021-09-22 15:34 ` Andy Lutomirski 2021-09-22 15:52 ` Christian Brauner 0 siblings, 1 reply; 18+ messages in thread From: Andy Lutomirski @ 2021-09-22 15:34 UTC (permalink / raw) To: Christian Brauner Cc: Luis Chamberlain, Thomas Weißschuh, Linux API, Linux Kernel Mailing List, Jessica Yu On Wed, Sep 22, 2021, at 5:25 AM, Christian Brauner wrote: > On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote: >> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <mcgrof@kernel.org> wrote: >> > >> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote: >> >> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this >> > > > at /sbin/modprobe and calling out to the container manager seems like >> > > > a decent solution. >> > > >> > > Yes it does. Thanks for the idea, I'll see how this works out. >> > >> > Would documentation guiding you in that way have helped? If so >> > I welcome a patch that does just that. >> >> If someone wants to make this classy, we should probably have the >> container counterpart of a standardized paravirt interface. There >> should be a way for a container to, in a runtime-agnostic way, issue >> requests to its manager, and requesting a module by (name, Linux >> kernel version for which that name makes sense) seems like an >> excellent use of such an interface. > > I always thought of this in two ways we currently do this: > > 1. Caller transparent container manager requests. > This is the seccomp notifier where we transparently handle syscalls > including intercepting init_module() where we parse out the module to > be loaded from the syscall args of the container and if it is > allow-listed load it for the container otherwise continue the syscall > letting it fail or failing directly through seccomp return value. Specific problems here include aliases and dependencies. My modules.alias file, for example, has: alias net-pf-16-proto-16-family-wireguard wireguard If I do modprobe net-pf-16-proto-16-family-wireguard, modprobe parses some files in /lib/modules/`uname -r` and issues init_module() asking for 'wireguard'. So hooking init_module() is at the wrong layer -- for that to work, the container's /sbin/modprobe needs to already have figured out that the desired module is wireguard and have a .ko for it. > > 2. A process in the container explicitly calling out to the container > manager. > One example how this happens is systemd-nspawn via dbus messages > between systemd in the container and systemd outside the container to > e.g. allocate a new terminal in the container (kinda insecure but > that's another issue) or other stuff. > > So what was your idea: would it be like a device file that could be > exposed to the container where it writes requestes to the container > manager? What would be the advantage to just standardizing a socket > protocol which is what we do for example (it doesn't do module loading > of course as we handle that differently): My idea is standardizing *something*. I think it would be nice if, for example, distros could ship a /sbin/modprobe that would do the right thing inside any compliant container runtime as well as when running outside a container. I suppose container managers could also bind-mount over /sbin/modprobe, but that's more intrusive. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-22 15:34 ` Andy Lutomirski @ 2021-09-22 15:52 ` Christian Brauner 2021-09-22 20:06 ` Andy Lutomirski 0 siblings, 1 reply; 18+ messages in thread From: Christian Brauner @ 2021-09-22 15:52 UTC (permalink / raw) To: Andy Lutomirski Cc: Luis Chamberlain, Thomas Weißschuh, Linux API, Linux Kernel Mailing List, Jessica Yu On Wed, Sep 22, 2021 at 08:34:23AM -0700, Andy Lutomirski wrote: > On Wed, Sep 22, 2021, at 5:25 AM, Christian Brauner wrote: > > On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote: > >> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > >> > > >> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote: > >> > >> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this > >> > > > at /sbin/modprobe and calling out to the container manager seems like > >> > > > a decent solution. > >> > > > >> > > Yes it does. Thanks for the idea, I'll see how this works out. > >> > > >> > Would documentation guiding you in that way have helped? If so > >> > I welcome a patch that does just that. > >> > >> If someone wants to make this classy, we should probably have the > >> container counterpart of a standardized paravirt interface. There > >> should be a way for a container to, in a runtime-agnostic way, issue > >> requests to its manager, and requesting a module by (name, Linux > >> kernel version for which that name makes sense) seems like an > >> excellent use of such an interface. > > > > I always thought of this in two ways we currently do this: > > > > 1. Caller transparent container manager requests. > > This is the seccomp notifier where we transparently handle syscalls > > including intercepting init_module() where we parse out the module to > > be loaded from the syscall args of the container and if it is > > allow-listed load it for the container otherwise continue the syscall > > letting it fail or failing directly through seccomp return value. > > Specific problems here include aliases and dependencies. My modules.alias file, for example, has: > > alias net-pf-16-proto-16-family-wireguard wireguard > > If I do modprobe net-pf-16-proto-16-family-wireguard, modprobe parses some files in /lib/modules/`uname -r` and issues init_module() asking for 'wireguard'. So hooking init_module() is at the wrong layer -- for that to work, the container's /sbin/modprobe needs to already have figured out that the desired module is wireguard and have a .ko for it. You can't use the container's .ko module. For this you would need to trust the image that the container wants you to load. The container manager should always load a host module. > > > > > 2. A process in the container explicitly calling out to the container > > manager. > > One example how this happens is systemd-nspawn via dbus messages > > between systemd in the container and systemd outside the container to > > e.g. allocate a new terminal in the container (kinda insecure but > > that's another issue) or other stuff. > > > > So what was your idea: would it be like a device file that could be > > exposed to the container where it writes requestes to the container > > manager? What would be the advantage to just standardizing a socket > > protocol which is what we do for example (it doesn't do module loading > > of course as we handle that differently): > > My idea is standardizing *something*. I think it would be nice if, for example, distros could ship a /sbin/modprobe that would do the right thing inside any compliant container runtime as well as when running outside a container. > > I suppose container managers could also bind-mount over /sbin/modprobe, but that's more intrusive. I don't see this is a big issue because that is fairly trivial. I think we never want to trust the container's modules. What probably should be happening is that the manager exposes a list of modules the container can request in some form. We have precedence for doing something like this. So now modprobe and similar tools can be made aware that if they are in a container they should request that module from the container manager be it via a socket request or something else. Nesting will be a bit funny but can probably be made to work by just bind-mounting the outermost socket into the container or relaying the request. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-22 15:52 ` Christian Brauner @ 2021-09-22 20:06 ` Andy Lutomirski 2021-09-24 13:19 ` Christian Brauner 0 siblings, 1 reply; 18+ messages in thread From: Andy Lutomirski @ 2021-09-22 20:06 UTC (permalink / raw) To: Christian Brauner Cc: Luis Chamberlain, Thomas Weißschuh, Linux API, Linux Kernel Mailing List, Jessica Yu On Wed, Sep 22, 2021, at 8:52 AM, Christian Brauner wrote: > On Wed, Sep 22, 2021 at 08:34:23AM -0700, Andy Lutomirski wrote: >> On Wed, Sep 22, 2021, at 5:25 AM, Christian Brauner wrote: >> > On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote: >> >> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <mcgrof@kernel.org> wrote: >> >> > >> >> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote: >> >> >> >> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this >> >> > > > at /sbin/modprobe and calling out to the container manager seems like >> >> > > > a decent solution. >> >> > > >> >> > > Yes it does. Thanks for the idea, I'll see how this works out. >> >> > >> >> > Would documentation guiding you in that way have helped? If so >> >> > I welcome a patch that does just that. >> >> >> >> If someone wants to make this classy, we should probably have the >> >> container counterpart of a standardized paravirt interface. There >> >> should be a way for a container to, in a runtime-agnostic way, issue >> >> requests to its manager, and requesting a module by (name, Linux >> >> kernel version for which that name makes sense) seems like an >> >> excellent use of such an interface. >> > >> > I always thought of this in two ways we currently do this: >> > >> > 1. Caller transparent container manager requests. >> > This is the seccomp notifier where we transparently handle syscalls >> > including intercepting init_module() where we parse out the module to >> > be loaded from the syscall args of the container and if it is >> > allow-listed load it for the container otherwise continue the syscall >> > letting it fail or failing directly through seccomp return value. >> >> Specific problems here include aliases and dependencies. My modules.alias file, for example, has: >> >> alias net-pf-16-proto-16-family-wireguard wireguard >> >> If I do modprobe net-pf-16-proto-16-family-wireguard, modprobe parses some files in /lib/modules/`uname -r` and issues init_module() asking for 'wireguard'. So hooking init_module() is at the wrong layer -- for that to work, the container's /sbin/modprobe needs to already have figured out that the desired module is wireguard and have a .ko for it. > > You can't use the container's .ko module. For this you would need to > trust the image that the container wants you to load. The container > manager should always load a host module. > Agreed. >> >> > >> > 2. A process in the container explicitly calling out to the container >> > manager. >> > One example how this happens is systemd-nspawn via dbus messages >> > between systemd in the container and systemd outside the container to >> > e.g. allocate a new terminal in the container (kinda insecure but >> > that's another issue) or other stuff. >> > >> > So what was your idea: would it be like a device file that could be >> > exposed to the container where it writes requestes to the container >> > manager? What would be the advantage to just standardizing a socket >> > protocol which is what we do for example (it doesn't do module loading >> > of course as we handle that differently): >> >> My idea is standardizing *something*. I think it would be nice if, for example, distros could ship a /sbin/modprobe that would do the right thing inside any compliant container runtime as well as when running outside a container. >> >> I suppose container managers could also bind-mount over /sbin/modprobe, but that's more intrusive. > > I don't see this is a big issue because that is fairly trivial. > I think we never want to trust the container's modules. > What probably should be happening is that the manager exposes a list of > modules the container can request in some form. We have precedence for > doing something like this. > So now modprobe and similar tools can be made aware that if they are in > a container they should request that module from the container manager > be it via a socket request or something else. > Nesting will be a bit funny but can probably be made to work by just > bind-mounting the outermost socket into the container or relaying the > request. Why bother with a list? I think it should be sufficient for the container to ask for a module and either get it or not get it. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-22 20:06 ` Andy Lutomirski @ 2021-09-24 13:19 ` Christian Brauner 2021-09-24 23:04 ` Andy Lutomirski 0 siblings, 1 reply; 18+ messages in thread From: Christian Brauner @ 2021-09-24 13:19 UTC (permalink / raw) To: Andy Lutomirski Cc: Luis Chamberlain, Thomas Weißschuh, Linux API, Linux Kernel Mailing List, Jessica Yu On Wed, Sep 22, 2021 at 01:06:49PM -0700, Andy Lutomirski wrote: > > > On Wed, Sep 22, 2021, at 8:52 AM, Christian Brauner wrote: > > On Wed, Sep 22, 2021 at 08:34:23AM -0700, Andy Lutomirski wrote: > >> On Wed, Sep 22, 2021, at 5:25 AM, Christian Brauner wrote: > >> > On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote: > >> >> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > >> >> > > >> >> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote: > >> >> > >> >> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this > >> >> > > > at /sbin/modprobe and calling out to the container manager seems like > >> >> > > > a decent solution. > >> >> > > > >> >> > > Yes it does. Thanks for the idea, I'll see how this works out. > >> >> > > >> >> > Would documentation guiding you in that way have helped? If so > >> >> > I welcome a patch that does just that. > >> >> > >> >> If someone wants to make this classy, we should probably have the > >> >> container counterpart of a standardized paravirt interface. There > >> >> should be a way for a container to, in a runtime-agnostic way, issue > >> >> requests to its manager, and requesting a module by (name, Linux > >> >> kernel version for which that name makes sense) seems like an > >> >> excellent use of such an interface. > >> > > >> > I always thought of this in two ways we currently do this: > >> > > >> > 1. Caller transparent container manager requests. > >> > This is the seccomp notifier where we transparently handle syscalls > >> > including intercepting init_module() where we parse out the module to > >> > be loaded from the syscall args of the container and if it is > >> > allow-listed load it for the container otherwise continue the syscall > >> > letting it fail or failing directly through seccomp return value. > >> > >> Specific problems here include aliases and dependencies. My modules.alias file, for example, has: > >> > >> alias net-pf-16-proto-16-family-wireguard wireguard > >> > >> If I do modprobe net-pf-16-proto-16-family-wireguard, modprobe parses some files in /lib/modules/`uname -r` and issues init_module() asking for 'wireguard'. So hooking init_module() is at the wrong layer -- for that to work, the container's /sbin/modprobe needs to already have figured out that the desired module is wireguard and have a .ko for it. > > > > You can't use the container's .ko module. For this you would need to > > trust the image that the container wants you to load. The container > > manager should always load a host module. > > > > Agreed. > > >> > >> > > >> > 2. A process in the container explicitly calling out to the container > >> > manager. > >> > One example how this happens is systemd-nspawn via dbus messages > >> > between systemd in the container and systemd outside the container to > >> > e.g. allocate a new terminal in the container (kinda insecure but > >> > that's another issue) or other stuff. > >> > > >> > So what was your idea: would it be like a device file that could be > >> > exposed to the container where it writes requestes to the container > >> > manager? What would be the advantage to just standardizing a socket > >> > protocol which is what we do for example (it doesn't do module loading > >> > of course as we handle that differently): > >> > >> My idea is standardizing *something*. I think it would be nice if, for example, distros could ship a /sbin/modprobe that would do the right thing inside any compliant container runtime as well as when running outside a container. > >> > >> I suppose container managers could also bind-mount over /sbin/modprobe, but that's more intrusive. > > > > I don't see this is a big issue because that is fairly trivial. > > I think we never want to trust the container's modules. > > What probably should be happening is that the manager exposes a list of > > modules the container can request in some form. We have precedence for > > doing something like this. > > So now modprobe and similar tools can be made aware that if they are in > > a container they should request that module from the container manager > > be it via a socket request or something else. > > Nesting will be a bit funny but can probably be made to work by just > > bind-mounting the outermost socket into the container or relaying the > > request. > > Why bother with a list? I think it should be sufficient for the container to ask for a module and either get it or not get it. I just meant that the programs in the container can see the modules available on the host. Simplest thing could be bind-mounting in the host's module folder with suitable protection (locked read-only mount). But yeah, it can likely be as simple as allowing it to ask for a module and not bother telling it about what is available. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-24 13:19 ` Christian Brauner @ 2021-09-24 23:04 ` Andy Lutomirski 0 siblings, 0 replies; 18+ messages in thread From: Andy Lutomirski @ 2021-09-24 23:04 UTC (permalink / raw) To: Christian Brauner Cc: Luis Chamberlain, Thomas Weißschuh, Linux API, Linux Kernel Mailing List, Jessica Yu On 9/24/21 06:19, Christian Brauner wrote: > On Wed, Sep 22, 2021 at 01:06:49PM -0700, Andy Lutomirski wrote: > I just meant that the programs in the container can see the modules > available on the host. Simplest thing could be bind-mounting in the > host's module folder with suitable protection (locked read-only mount). > But yeah, it can likely be as simple as allowing it to ask for a module > and not bother telling it about what is available. > If the container gets to see host modules, interesting races when containers are migrated CRIU-style will result. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC] Expose request_module via syscall 2021-09-19 7:56 ` Thomas Weißschuh 2021-09-19 14:37 ` Andy Lutomirski @ 2021-10-24 9:38 ` Thomas Weißschuh 1 sibling, 0 replies; 18+ messages in thread From: Thomas Weißschuh @ 2021-10-24 9:38 UTC (permalink / raw) To: Andy Lutomirski Cc: Christian Brauner, Linux API, Linux Kernel Mailing List, Luis Chamberlain, Jessica Yu On 2021-09-19 09:56+0200, Thomas Weißschuh wrote: > On 2021-09-18T11:47-0700, Andy Lutomirski wrote: > > But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called? > > The container is running an instance of the docker daemon in swarm mode. > That needs the "ip_vs" module (amongst others) and explicitly tries to load it > via modprobe. If somebody stumbles upon this specific issue: The "ip_vs" module will be autoloaded in future kernel versions with https://lore.kernel.org/lkml/20211021130255.4177-1-linux@weissschuh.net/ applied. > > > If so the seccomp notifier can be used to intercept this system call for > > > the container and verify the module against an allowlist similar to how > > > we currently handle mount. > > > > > > Christian ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2021-10-24 9:38 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-15 15:49 [RFC] Expose request_module via syscall Thomas Weißschuh 2021-09-15 16:02 ` Greg KH 2021-09-15 16:28 ` Thomas Weißschuh 2021-09-15 16:47 ` Andy Lutomirski 2021-09-16 9:27 ` Christian Brauner 2021-09-18 18:47 ` Andy Lutomirski 2021-09-19 7:56 ` Thomas Weißschuh 2021-09-19 14:37 ` Andy Lutomirski 2021-09-20 14:51 ` Thomas Weißschuh 2021-09-20 16:59 ` Luis Chamberlain 2021-09-20 18:36 ` Andy Lutomirski 2021-09-22 12:25 ` Christian Brauner 2021-09-22 15:34 ` Andy Lutomirski 2021-09-22 15:52 ` Christian Brauner 2021-09-22 20:06 ` Andy Lutomirski 2021-09-24 13:19 ` Christian Brauner 2021-09-24 23:04 ` Andy Lutomirski 2021-10-24 9:38 ` Thomas Weißschuh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).