* [PATCH v1] sysctl: Allow change system v ipc sysctls inside ipc namespace @ 2022-07-12 16:17 Alexey Gladkov 2022-07-25 16:16 ` Eric W. Biederman 0 siblings, 1 reply; 23+ messages in thread From: Alexey Gladkov @ 2022-07-12 16:17 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul Rootless containers are not allowed to modify kernel IPC parameters such as kernel.msgmnb. It seems to me that we can allow customization of these parameters if the user has CAP_SYS_RESOURCE in that ipc namespace. CAP_SYS_RESOURCE is already needed in order to overcome mqueue limits (msg_max and msgsize_max). Signed-off-by: Alexey Gladkov <legion@kernel.org> --- ipc/ipc_sysctl.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c index ef313ecfb53a..e79452867720 100644 --- a/ipc/ipc_sysctl.c +++ b/ipc/ipc_sysctl.c @@ -193,16 +193,19 @@ static int set_is_seen(struct ctl_table_set *set) static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table) { int mode = table->mode; - -#ifdef CONFIG_CHECKPOINT_RESTORE struct ipc_namespace *ns = current->nsproxy->ipc_ns; +#ifdef CONFIG_CHECKPOINT_RESTORE if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || (table->data == &ns->ids[IPC_MSG_IDS].next_id) || (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && checkpoint_restore_ns_capable(ns->user_ns)) mode = 0666; + else #endif + if (ns_capable(ns->user_ns, CAP_SYS_RESOURCE)) + mode = 0666; + return mode; } -- 2.33.3 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v1] sysctl: Allow change system v ipc sysctls inside ipc namespace 2022-07-12 16:17 [PATCH v1] sysctl: Allow change system v ipc sysctls inside ipc namespace Alexey Gladkov @ 2022-07-25 16:16 ` Eric W. Biederman 2022-08-16 15:42 ` Alexey Gladkov 0 siblings, 1 reply; 23+ messages in thread From: Eric W. Biederman @ 2022-07-25 16:16 UTC (permalink / raw) To: Alexey Gladkov Cc: LKML, Linux Containers, Andrew Morton, Christian Brauner, Kees Cook, Manfred Spraul Alexey Gladkov <legion@kernel.org> writes: > Rootless containers are not allowed to modify kernel IPC parameters such > as kernel.msgmnb. > > It seems to me that we can allow customization of these parameters if > the user has CAP_SYS_RESOURCE in that ipc namespace. > > CAP_SYS_RESOURCE is already needed in order to overcome mqueue limits > (msg_max and msgsize_max). For changing the permissions on who can modify the SysV limits, I don't think this change is safe. I don't see anything that will prevent abuse if anyone can modify these limits. Replacing the ordinary unix DAC permission check with ns_capable will allow anyone to modify the limits. That said there is RLIMIT_MSGQUEUE that limits the posix messages queues so those should be safe to allow anyone to modify their limits. The code in mqueue_get_inode is where that limiting happens. For the posix message queues all that should be needed is to change the owner of the sysctl files from the global root to the user namespace root. There are also two capable calls in ipc/mqueue.c that can probably be changed to ns_capable calls. The only posix message queue limit that I don't immediately see something that will prevent abuse of is /proc/sys/fs/mqueue/queus_max. That probably still runs into RLIMIT_MSGQUEUE somewhere but it was not immediately obvious at first glance. Eric > > Signed-off-by: Alexey Gladkov <legion@kernel.org> > --- > ipc/ipc_sysctl.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c > index ef313ecfb53a..e79452867720 100644 > --- a/ipc/ipc_sysctl.c > +++ b/ipc/ipc_sysctl.c > @@ -193,16 +193,19 @@ static int set_is_seen(struct ctl_table_set *set) > static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table) > { > int mode = table->mode; > - > -#ifdef CONFIG_CHECKPOINT_RESTORE > struct ipc_namespace *ns = current->nsproxy->ipc_ns; > > +#ifdef CONFIG_CHECKPOINT_RESTORE > if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || > (table->data == &ns->ids[IPC_MSG_IDS].next_id) || > (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && > checkpoint_restore_ns_capable(ns->user_ns)) > mode = 0666; > + else > #endif > + if (ns_capable(ns->user_ns, CAP_SYS_RESOURCE)) > + mode = 0666; > + > return mode; > } ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v1] sysctl: Allow change system v ipc sysctls inside ipc namespace 2022-07-25 16:16 ` Eric W. Biederman @ 2022-08-16 15:42 ` Alexey Gladkov 2022-08-16 15:42 ` [PATCH v1 1/3] " Alexey Gladkov ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Alexey Gladkov @ 2022-08-16 15:42 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul On Mon, Jul 25, 2022 at 11:16:07AM -0500, Eric W. Biederman wrote: > Alexey Gladkov <legion@kernel.org> writes: > > > Rootless containers are not allowed to modify kernel IPC parameters such > > as kernel.msgmnb. > > > > It seems to me that we can allow customization of these parameters if > > the user has CAP_SYS_RESOURCE in that ipc namespace. > > > > CAP_SYS_RESOURCE is already needed in order to overcome mqueue limits > > (msg_max and msgsize_max). > > > For changing the permissions on who can modify the SysV limits, I don't > think this change is safe. I don't see anything that will prevent abuse > if anyone can modify these limits. Replacing the ordinary unix DAC > permission check with ns_capable will allow anyone to modify the limits. All limits are set to almost maximum values - ULONG_MAX. Limit values are not inherited and are counted in the each ipc namespace (shm_tot is not global and is located in ipc_ns). In fact, limits are disabled by default. They can only be reduced. > That said there is RLIMIT_MSGQUEUE that limits the posix messages queues > so those should be safe to allow anyone to modify their limits. > > The code in mqueue_get_inode is where that limiting happens. > > For the posix message queues all that should be needed is to change the > owner of the sysctl files from the global root to the user namespace > root. There are also two capable calls in ipc/mqueue.c that can > probably be changed to ns_capable calls. > > > The only posix message queue limit that I don't immediately see > something that will prevent abuse of is /proc/sys/fs/mqueue/queus_max. > That probably still runs into RLIMIT_MSGQUEUE somewhere but it was > not immediately obvious at first glance. Everything always ends in mqueue_get_inode. In mqueue_create_attr we check mq_queues_max and call mqueue_get_inode almost immediately. I suggest allowing root in user namespace to change ipc namespace limits. -- Alexey Gladkov (3): sysctl: Allow change system v ipc sysctls inside ipc namespace sysctl: Allow to change limits for posix messages queues docs: Add information about ipc sysctls limitations Documentation/admin-guide/sysctl/kernel.rst | 14 ++++++-- ipc/ipc_sysctl.c | 34 ++++++++++++++++--- ipc/mq_sysctl.c | 36 +++++++++++++++++++++ 3 files changed, 76 insertions(+), 8 deletions(-) -- 2.33.4 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v1 1/3] sysctl: Allow change system v ipc sysctls inside ipc namespace 2022-08-16 15:42 ` Alexey Gladkov @ 2022-08-16 15:42 ` Alexey Gladkov 2022-09-19 15:26 ` Eric W. Biederman 2022-08-16 15:42 ` [PATCH v1 2/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov 2022-08-16 15:42 ` [PATCH v1 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2 siblings, 1 reply; 23+ messages in thread From: Alexey Gladkov @ 2022-08-16 15:42 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul Rootless containers are not allowed to modify kernel IPC parameters. All default limits are set to such high values that in fact there are no limits at all. All limits are not inherited and are initialized to default values when a new ipc_namespace is created. For new ipc_namespace: size_t ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24)) size_t ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24)) int ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15) int ipc_ns.shm_rmid_forced = 0; unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192 unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000 unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384 The shm_tot (total amount of shared pages) has also ceased to be global, it is located in ipc_namespace and is not inherited from anywhere. In such conditions, it cannot be said that these limits limit anything. The real limiter for them is cgroups. If we allow rootless containers to change these parameters, then it can only be reduced. Signed-off-by: Alexey Gladkov <legion@kernel.org> --- ipc/ipc_sysctl.c | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-) diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c index ef313ecfb53a..87eb1b1e42fa 100644 --- a/ipc/ipc_sysctl.c +++ b/ipc/ipc_sysctl.c @@ -192,23 +192,47 @@ static int set_is_seen(struct ctl_table_set *set) static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table) { - int mode = table->mode; - -#ifdef CONFIG_CHECKPOINT_RESTORE struct ipc_namespace *ns = current->nsproxy->ipc_ns; +#ifdef CONFIG_CHECKPOINT_RESTORE if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || (table->data == &ns->ids[IPC_MSG_IDS].next_id) || (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && checkpoint_restore_ns_capable(ns->user_ns)) - mode = 0666; + return 0666; #endif - return mode; + if (ns->user_ns != &init_user_ns) { + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + if (uid_valid(ns_root_uid) && uid_eq(current_euid(), ns_root_uid)) + return table->mode >> 6; + + if (gid_valid(ns_root_gid) && in_egroup_p(ns_root_gid)) + return table->mode >> 3; + } + + return table->mode; +} + +static void ipc_set_ownership(struct ctl_table_header *head, + struct ctl_table *table, + kuid_t *uid, kgid_t *gid) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, ipc_set); + + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; } static struct ctl_table_root set_root = { .lookup = set_lookup, .permissions = ipc_permissions, + .set_ownership = ipc_set_ownership, }; bool setup_ipc_sysctls(struct ipc_namespace *ns) -- 2.33.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v1 1/3] sysctl: Allow change system v ipc sysctls inside ipc namespace 2022-08-16 15:42 ` [PATCH v1 1/3] " Alexey Gladkov @ 2022-09-19 15:26 ` Eric W. Biederman 2022-09-20 16:15 ` Alexey Gladkov 0 siblings, 1 reply; 23+ messages in thread From: Eric W. Biederman @ 2022-09-19 15:26 UTC (permalink / raw) To: Alexey Gladkov Cc: LKML, Linux Containers, Andrew Morton, Christian Brauner, Kees Cook, Manfred Spraul Alexey Gladkov <legion@kernel.org> writes: > Rootless containers are not allowed to modify kernel IPC parameters. > > All default limits are set to such high values that in fact there are no > limits at all. All limits are not inherited and are initialized to > default values when a new ipc_namespace is created. > > For new ipc_namespace: > > size_t ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24)) > size_t ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24)) > int ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15) > int ipc_ns.shm_rmid_forced = 0; > unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192 > unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000 > unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384 > > The shm_tot (total amount of shared pages) has also ceased to be > global, it is located in ipc_namespace and is not inherited from > anywhere. > > In such conditions, it cannot be said that these limits limit anything. > The real limiter for them is cgroups. > > If we allow rootless containers to change these parameters, then it can > only be reduced. Manfred does that analysis sound correct to you? Do you have any concerns about allowing the users who create the ipc namespace to be able to set it's limits? At a quick skim through everything Alex's analysis above appears correct to me. From 10,000 feet this patch looks good. I do see a couple of nits that should be fixed before we merge this into Linus's tree. > Signed-off-by: Alexey Gladkov <legion@kernel.org> > --- > ipc/ipc_sysctl.c | 34 +++++++++++++++++++++++++++++----- > 1 file changed, 29 insertions(+), 5 deletions(-) > > diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c > index ef313ecfb53a..87eb1b1e42fa 100644 > --- a/ipc/ipc_sysctl.c > +++ b/ipc/ipc_sysctl.c > @@ -192,23 +192,47 @@ static int set_is_seen(struct ctl_table_set *set) > > static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table) > { > - int mode = table->mode; > - > -#ifdef CONFIG_CHECKPOINT_RESTORE > struct ipc_namespace *ns = current->nsproxy->ipc_ns; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Historically that was the best we could do. But now that we have an ipc_set member in struct ipc_namespace you can use container_of to compute this value. For a permission check that is much safer. > +#ifdef CONFIG_CHECKPOINT_RESTORE > if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || > (table->data == &ns->ids[IPC_MSG_IDS].next_id) || > (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && > checkpoint_restore_ns_capable(ns->user_ns)) > - mode = 0666; > + return 0666; > #endif > - return mode; > + if (ns->user_ns != &init_user_ns) { > + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); > + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); > + > + if (uid_valid(ns_root_uid) && uid_eq(current_euid(), ns_root_uid)) > + return table->mode >> 6; > + > + if (gid_valid(ns_root_gid) && in_egroup_p(ns_root_gid)) > + return table->mode >> 3; From 10,000 fee this is fine. But this has to interact with test_perm in proc_systl.c. So can you please do what net_ctl_permissions does and replicate the chosen mode all through the mode line. Perhaps something like: kuid_t ns_root_uid; kgid_t ns_root_gid ipc_set_ownership(head, table, &ns_root_uid, &ns_root_gid); #ifdef CONFIG_CHECKPOINT_RESTORE if (...) mode = 0666; else #endif if (uid_eq(current_euid(), ns_root_uid)) mode >>= 6; else if (uid_eq(in_group_p(ns_root_gid)) mode >>= 3; mode &= 7; mode = (mode << 6) | (mode << 3) | mode; return mode; If we always pass through the same logic there is the advantage that we will always test it, and there is less room for bugs to slip through. I added a couple of unnecessary simplifications in there that I just saw as I was writing my example code. Eric > + } > + > + return table->mode; > +} > + > +static void ipc_set_ownership(struct ctl_table_header *head, > + struct ctl_table *table, > + kuid_t *uid, kgid_t *gid) > +{ > + struct ipc_namespace *ns = > + container_of(head->set, struct ipc_namespace, ipc_set); > + > + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); > + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); > + > + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; > + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; > } > > static struct ctl_table_root set_root = { > .lookup = set_lookup, > .permissions = ipc_permissions, > + .set_ownership = ipc_set_ownership, > }; > > bool setup_ipc_sysctls(struct ipc_namespace *ns) I can't see anything wrong with your proposed ipc_set_ownership. Eric ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v1 1/3] sysctl: Allow change system v ipc sysctls inside ipc namespace 2022-09-19 15:26 ` Eric W. Biederman @ 2022-09-20 16:15 ` Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 0/3] Allow to change ipc/mq " Alexey Gladkov 0 siblings, 1 reply; 23+ messages in thread From: Alexey Gladkov @ 2022-09-20 16:15 UTC (permalink / raw) To: Eric W. Biederman Cc: LKML, Linux Containers, Andrew Morton, Christian Brauner, Kees Cook, Manfred Spraul On Mon, Sep 19, 2022 at 10:26:39AM -0500, Eric W. Biederman wrote: > > > > diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c > > index ef313ecfb53a..87eb1b1e42fa 100644 > > --- a/ipc/ipc_sysctl.c > > +++ b/ipc/ipc_sysctl.c > > @@ -192,23 +192,47 @@ static int set_is_seen(struct ctl_table_set *set) > > > > static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table) > > { > > - int mode = table->mode; > > - > > -#ifdef CONFIG_CHECKPOINT_RESTORE > > struct ipc_namespace *ns = current->nsproxy->ipc_ns; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Historically that was the best we could do. But now that we have > an ipc_set member in struct ipc_namespace you can use container_of > to compute this value. > > For a permission check that is much safer. Yes. It make sense. > > +#ifdef CONFIG_CHECKPOINT_RESTORE > > if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || > > (table->data == &ns->ids[IPC_MSG_IDS].next_id) || > > (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && > > checkpoint_restore_ns_capable(ns->user_ns)) > > - mode = 0666; > > + return 0666; > > #endif > > - return mode; > > + if (ns->user_ns != &init_user_ns) { > > + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); > > + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); > > + > > + if (uid_valid(ns_root_uid) && uid_eq(current_euid(), ns_root_uid)) > > + return table->mode >> 6; > > + > > + if (gid_valid(ns_root_gid) && in_egroup_p(ns_root_gid)) > > + return table->mode >> 3; > > >From 10,000 fee this is fine. But this has to interact with > test_perm in proc_systl.c. So can you please do what > net_ctl_permissions does and replicate the chosen mode all through > the mode line. > > Perhaps something like: > > kuid_t ns_root_uid; > kgid_t ns_root_gid > > ipc_set_ownership(head, table, &ns_root_uid, &ns_root_gid); > > #ifdef CONFIG_CHECKPOINT_RESTORE > if (...) > mode = 0666; > else > #endif > if (uid_eq(current_euid(), ns_root_uid)) > mode >>= 6; > > else if (uid_eq(in_group_p(ns_root_gid)) > mode >>= 3; > > mode &= 7; > mode = (mode << 6) | (mode << 3) | mode; > return mode; > > > If we always pass through the same logic there is the advantage that we > will always test it, and there is less room for bugs to slip through. > > I added a couple of unnecessary simplifications in there that I just > saw as I was writing my example code. Thanks! It looks better. I'll fix it and send a new version. -- Rgrds, legion ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v2 0/3] Allow to change ipc/mq sysctls inside ipc namespace 2022-09-20 16:15 ` Alexey Gladkov @ 2022-09-20 18:08 ` Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 1/3] sysctl: Allow change system v ipc " Alexey Gladkov ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Alexey Gladkov @ 2022-09-20 18:08 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul Right now ipc and mq limits count as per ipc namespace, but only real root can change them. By default, the current values of these limits are such that it can only be reduced. Since only root can change the values, it is impossible to reduce these limits in the rootless container. We can allow limit changes within ipc namespace because mq parameters are limited by RLIMIT_MSGQUEUE and ipc parameters are not limited to anything other than cgroups. -- Alexey Gladkov (3): sysctl: Allow change system v ipc sysctls inside ipc namespace sysctl: Allow to change limits for posix messages queues docs: Add information about ipc sysctls limitations Documentation/admin-guide/sysctl/kernel.rst | 14 ++++++-- ipc/ipc_sysctl.c | 34 ++++++++++++++++-- ipc/mq_sysctl.c | 38 +++++++++++++++++++++ 3 files changed, 80 insertions(+), 6 deletions(-) -- 2.33.4 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v2 1/3] sysctl: Allow change system v ipc sysctls inside ipc namespace 2022-09-20 18:08 ` [PATCH v2 0/3] Allow to change ipc/mq " Alexey Gladkov @ 2022-09-20 18:08 ` Alexey Gladkov 2022-09-21 9:38 ` kernel test robot 2022-09-20 18:08 ` [PATCH v2 2/3] " Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2 siblings, 1 reply; 23+ messages in thread From: Alexey Gladkov @ 2022-09-20 18:08 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul Rootless containers are not allowed to modify kernel IPC parameters. All default limits are set to such high values that in fact there are no limits at all. All limits are not inherited and are initialized to default values when a new ipc_namespace is created. For new ipc_namespace: size_t ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24)) size_t ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24)) int ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15) int ipc_ns.shm_rmid_forced = 0; unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192 unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000 unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384 The shm_tot (total amount of shared pages) has also ceased to be global, it is located in ipc_namespace and is not inherited from anywhere. In such conditions, it cannot be said that these limits limit anything. The real limiter for them is cgroups. If we allow rootless containers to change these parameters, then it can only be reduced. Signed-off-by: Alexey Gladkov <legion@kernel.org> --- ipc/ipc_sysctl.c | 34 +++++++++++++++++++++++++++++++--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c index ef313ecfb53a..a6a9d7f680dd 100644 --- a/ipc/ipc_sysctl.c +++ b/ipc/ipc_sysctl.c @@ -190,25 +190,53 @@ static int set_is_seen(struct ctl_table_set *set) return ¤t->nsproxy->ipc_ns->ipc_set == set; } +static void ipc_set_ownership(struct ctl_table_header *head, + struct ctl_table *table, + kuid_t *uid, kgid_t *gid) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, ipc_set); + + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; +} + static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table) { + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, ipc_set); int mode = table->mode; + kuid_t ns_root_uid; + kgid_t ns_root_gid; -#ifdef CONFIG_CHECKPOINT_RESTORE - struct ipc_namespace *ns = current->nsproxy->ipc_ns; + ipc_set_ownership(head, table, &ns_root_uid, ns_root_gid); +#ifdef CONFIG_CHECKPOINT_RESTORE if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || (table->data == &ns->ids[IPC_MSG_IDS].next_id) || (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && checkpoint_restore_ns_capable(ns->user_ns)) mode = 0666; + else #endif - return mode; + if (uid_eq(current_euid(), ns_root_uid)) + mode >>= 6; + + else if (in_egroup_p(ns_root_gid)) + mode >>= 3; + + mode &= 7; + + return (mode << 6) | (mode << 3) | mode; } static struct ctl_table_root set_root = { .lookup = set_lookup, .permissions = ipc_permissions, + .set_ownership = ipc_set_ownership, }; bool setup_ipc_sysctls(struct ipc_namespace *ns) -- 2.33.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] sysctl: Allow change system v ipc sysctls inside ipc namespace 2022-09-20 18:08 ` [PATCH v2 1/3] sysctl: Allow change system v ipc " Alexey Gladkov @ 2022-09-21 9:38 ` kernel test robot 2022-09-21 10:41 ` [PATCH v3 0/3] Allow to change ipc/mq " Alexey Gladkov 0 siblings, 1 reply; 23+ messages in thread From: kernel test robot @ 2022-09-21 9:38 UTC (permalink / raw) To: Alexey Gladkov, LKML, Linux Containers Cc: llvm, kbuild-all, Andrew Morton, Linux Memory Management List, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul Hi Alexey, Thank you for the patch! Yet something to improve: [auto build test ERROR on akpm-mm/mm-everything] [also build test ERROR on kees/for-next/pstore linus/master v6.0-rc6 next-20220920] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Alexey-Gladkov/sysctl-Allow-change-system-v-ipc-sysctls-inside-ipc-namespace/20220921-030939 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything config: hexagon-randconfig-r041-20220921 (https://download.01.org/0day-ci/archive/20220921/202209211737.0Bu0F40t-lkp@intel.com/config) compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 791a7ae1ba3efd6bca96338e10ffde557ba83920) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/eb972fb9aad60123519d8dd32df26cb58985ce4a git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Alexey-Gladkov/sysctl-Allow-change-system-v-ipc-sysctls-inside-ipc-namespace/20220921-030939 git checkout eb972fb9aad60123519d8dd32df26cb58985ce4a # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> ipc/ipc_sysctl.c:215:47: error: passing 'kgid_t' to parameter of incompatible type 'kgid_t *'; take the address with & ipc_set_ownership(head, table, &ns_root_uid, ns_root_gid); ^~~~~~~~~~~ & ipc/ipc_sysctl.c:195:31: note: passing argument to parameter 'gid' here kuid_t *uid, kgid_t *gid) ^ >> ipc/ipc_sysctl.c:225:13: error: call to undeclared function 'current_euid'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] if (uid_eq(current_euid(), ns_root_uid)) ^ ipc/ipc_sysctl.c:225:13: note: did you mean 'current_work'? include/linux/workqueue.h:467:28: note: 'current_work' declared here extern struct work_struct *current_work(void); ^ >> ipc/ipc_sysctl.c:225:13: error: passing 'int' to parameter of incompatible type 'kuid_t' if (uid_eq(current_euid(), ns_root_uid)) ^~~~~~~~~~~~~~ include/linux/uidgid.h:61:34: note: passing argument to parameter 'left' here static inline bool uid_eq(kuid_t left, kuid_t right) ^ >> ipc/ipc_sysctl.c:228:11: error: call to undeclared function 'in_egroup_p'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] else if (in_egroup_p(ns_root_gid)) ^ 4 errors generated. vim +215 ipc/ipc_sysctl.c 206 207 static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table) 208 { 209 struct ipc_namespace *ns = 210 container_of(head->set, struct ipc_namespace, ipc_set); 211 int mode = table->mode; 212 kuid_t ns_root_uid; 213 kgid_t ns_root_gid; 214 > 215 ipc_set_ownership(head, table, &ns_root_uid, ns_root_gid); 216 217 #ifdef CONFIG_CHECKPOINT_RESTORE 218 if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || 219 (table->data == &ns->ids[IPC_MSG_IDS].next_id) || 220 (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && 221 checkpoint_restore_ns_capable(ns->user_ns)) 222 mode = 0666; 223 else 224 #endif > 225 if (uid_eq(current_euid(), ns_root_uid)) 226 mode >>= 6; 227 > 228 else if (in_egroup_p(ns_root_gid)) 229 mode >>= 3; 230 231 mode &= 7; 232 233 return (mode << 6) | (mode << 3) | mode; 234 } 235 -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v3 0/3] Allow to change ipc/mq sysctls inside ipc namespace 2022-09-21 9:38 ` kernel test robot @ 2022-09-21 10:41 ` Alexey Gladkov 2022-09-21 10:41 ` [PATCH v3 1/3] sysctl: Allow change system v ipc " Alexey Gladkov ` (3 more replies) 0 siblings, 4 replies; 23+ messages in thread From: Alexey Gladkov @ 2022-09-21 10:41 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul Right now ipc and mq limits count as per ipc namespace, but only real root can change them. By default, the current values of these limits are such that it can only be reduced. Since only root can change the values, it is impossible to reduce these limits in the rootless container. We can allow limit changes within ipc namespace because mq parameters are limited by RLIMIT_MSGQUEUE and ipc parameters are not limited to anything other than cgroups. -- Alexey Gladkov (3): sysctl: Allow change system v ipc sysctls inside ipc namespace sysctl: Allow to change limits for posix messages queues docs: Add information about ipc sysctls limitations Documentation/admin-guide/sysctl/kernel.rst | 14 ++++++-- ipc/ipc_sysctl.c | 36 +++++++++++++++++++-- ipc/mq_sysctl.c | 36 +++++++++++++++++++++ 3 files changed, 81 insertions(+), 5 deletions(-) -- 2.33.4 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v3 1/3] sysctl: Allow change system v ipc sysctls inside ipc namespace 2022-09-21 10:41 ` [PATCH v3 0/3] Allow to change ipc/mq " Alexey Gladkov @ 2022-09-21 10:41 ` Alexey Gladkov 2022-09-21 10:41 ` [PATCH v3 2/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Alexey Gladkov @ 2022-09-21 10:41 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul Rootless containers are not allowed to modify kernel IPC parameters. All default limits are set to such high values that in fact there are no limits at all. All limits are not inherited and are initialized to default values when a new ipc_namespace is created. For new ipc_namespace: size_t ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24)) size_t ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24)) int ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15) int ipc_ns.shm_rmid_forced = 0; unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192 unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000 unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384 The shm_tot (total amount of shared pages) has also ceased to be global, it is located in ipc_namespace and is not inherited from anywhere. In such conditions, it cannot be said that these limits limit anything. The real limiter for them is cgroups. If we allow rootless containers to change these parameters, then it can only be reduced. Signed-off-by: Alexey Gladkov <legion@kernel.org> --- ipc/ipc_sysctl.c | 36 ++++++++++++++++++++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c index ef313ecfb53a..31282e0a630d 100644 --- a/ipc/ipc_sysctl.c +++ b/ipc/ipc_sysctl.c @@ -190,25 +190,57 @@ static int set_is_seen(struct ctl_table_set *set) return ¤t->nsproxy->ipc_ns->ipc_set == set; } +static void ipc_set_ownership(struct ctl_table_header *head, + struct ctl_table *table, + kuid_t *uid, kgid_t *gid) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, ipc_set); + + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; +} + static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table) { int mode = table->mode; #ifdef CONFIG_CHECKPOINT_RESTORE - struct ipc_namespace *ns = current->nsproxy->ipc_ns; + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, ipc_set); if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || (table->data == &ns->ids[IPC_MSG_IDS].next_id) || (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && checkpoint_restore_ns_capable(ns->user_ns)) mode = 0666; + else #endif - return mode; + { + kuid_t ns_root_uid; + kgid_t ns_root_gid; + + ipc_set_ownership(head, table, &ns_root_uid, &ns_root_gid); + + if (uid_eq(current_euid(), ns_root_uid)) + mode >>= 6; + + else if (in_egroup_p(ns_root_gid)) + mode >>= 3; + } + + mode &= 7; + + return (mode << 6) | (mode << 3) | mode; } static struct ctl_table_root set_root = { .lookup = set_lookup, .permissions = ipc_permissions, + .set_ownership = ipc_set_ownership, }; bool setup_ipc_sysctls(struct ipc_namespace *ns) -- 2.33.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 2/3] sysctl: Allow to change limits for posix messages queues 2022-09-21 10:41 ` [PATCH v3 0/3] Allow to change ipc/mq " Alexey Gladkov 2022-09-21 10:41 ` [PATCH v3 1/3] sysctl: Allow change system v ipc " Alexey Gladkov @ 2022-09-21 10:41 ` Alexey Gladkov 2022-09-21 10:41 ` [PATCH v3 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 0/3] Allow to change ipc/mq sysctls inside ipc namespace Alexey Gladkov 3 siblings, 0 replies; 23+ messages in thread From: Alexey Gladkov @ 2022-09-21 10:41 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul All parameters of posix messages queues (queues_max/msg_max/msgsize_max) end up being limited by RLIMIT_MSGQUEUE. The code in mqueue_get_inode is where that limiting happens. The RLIMIT_MSGQUEUE is bound to the user namespace and is counted hierarchically. We can allow root in the user namespace to modify the posix messages queues parameters. Signed-off-by: Alexey Gladkov <legion@kernel.org> --- ipc/mq_sysctl.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/ipc/mq_sysctl.c b/ipc/mq_sysctl.c index fbf6a8b93a26..ff1054fbbacc 100644 --- a/ipc/mq_sysctl.c +++ b/ipc/mq_sysctl.c @@ -12,6 +12,7 @@ #include <linux/stat.h> #include <linux/capability.h> #include <linux/slab.h> +#include <linux/cred.h> static int msg_max_limit_min = MIN_MSGMAX; static int msg_max_limit_max = HARD_MSGMAX; @@ -76,8 +77,43 @@ static int set_is_seen(struct ctl_table_set *set) return ¤t->nsproxy->ipc_ns->mq_set == set; } +static void mq_set_ownership(struct ctl_table_header *head, + struct ctl_table *table, + kuid_t *uid, kgid_t *gid) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, mq_set); + + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; +} + +static int mq_permissions(struct ctl_table_header *head, struct ctl_table *table) +{ + int mode = table->mode; + kuid_t ns_root_uid; + kgid_t ns_root_gid; + + mq_set_ownership(head, table, &ns_root_uid, &ns_root_gid); + + if (uid_eq(current_euid(), ns_root_uid)) + mode >>= 6; + + if (in_egroup_p(ns_root_gid)) + mode >>= 3; + + mode &= 7; + + return (mode << 6) | (mode << 3) | mode; +} + static struct ctl_table_root set_root = { .lookup = set_lookup, + .permissions = mq_permissions, + .set_ownership = mq_set_ownership, }; bool setup_mq_sysctls(struct ipc_namespace *ns) -- 2.33.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 3/3] docs: Add information about ipc sysctls limitations 2022-09-21 10:41 ` [PATCH v3 0/3] Allow to change ipc/mq " Alexey Gladkov 2022-09-21 10:41 ` [PATCH v3 1/3] sysctl: Allow change system v ipc " Alexey Gladkov 2022-09-21 10:41 ` [PATCH v3 2/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov @ 2022-09-21 10:41 ` Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 0/3] Allow to change ipc/mq sysctls inside ipc namespace Alexey Gladkov 3 siblings, 0 replies; 23+ messages in thread From: Alexey Gladkov @ 2022-09-21 10:41 UTC (permalink / raw) To: LKML, Linux Containers, linux-doc, linux-man Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul After 25b21cb2f6d6 ("[PATCH] IPC namespace core") and 4e9823111bdc ("[PATCH] IPC namespace - shm") the shared memory page count stopped being global and started counting per ipc namespace. The documentation and shmget(2) still says that shmall is a global option. shmget(2): SHMALL System-wide limit on the total amount of shared memory, measured in units of the system page size. On Linux, this limit can be read and modified via /proc/sys/kernel/shmall. I think the changes made in 2006 should be documented. Signed-off-by: Alexey Gladkov <legion@kernel.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> --- Documentation/admin-guide/sysctl/kernel.rst | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index ee6572b1edad..c8b89bd8f004 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -541,6 +541,9 @@ default (``MSGMNB``). ``msgmni`` is the maximum number of IPC queues. 32000 by default (``MSGMNI``). +All of these parameters are set per ipc namespace. The maximum number of bytes +in POSIX message queues is limited by ``RLIMIT_MSGQUEUE``. This limit is +respected hierarchically in the each user namespace. msg_next_id, sem_next_id, and shm_next_id (System V IPC) ======================================================== @@ -1181,15 +1184,20 @@ are doing anyway :) shmall ====== -This parameter sets the total amount of shared memory pages that -can be used system wide. Hence, ``shmall`` should always be at least -``ceil(shmmax/PAGE_SIZE)``. +This parameter sets the total amount of shared memory pages that can be used +inside ipc namespace. The shared memory pages counting occurs for each ipc +namespace separately and is not inherited. Hence, ``shmall`` should always be at +least ``ceil(shmmax/PAGE_SIZE)``. If you are not sure what the default ``PAGE_SIZE`` is on your Linux system, you can run the following command:: # getconf PAGE_SIZE +To reduce or disable the ability to allocate shared memory, you must create a +new ipc namespace, set this parameter to the required value and prohibit the +creation of a new ipc namespace in the current user namespace or cgroups can +be used. shmmax ====== -- 2.33.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RESEND PATCH v3 0/3] Allow to change ipc/mq sysctls inside ipc namespace 2022-09-21 10:41 ` [PATCH v3 0/3] Allow to change ipc/mq " Alexey Gladkov ` (2 preceding siblings ...) 2022-09-21 10:41 ` [PATCH v3 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov @ 2024-01-15 15:46 ` Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 1/3] sysctl: Allow change system v ipc " Alexey Gladkov ` (2 more replies) 3 siblings, 3 replies; 23+ messages in thread From: Alexey Gladkov @ 2024-01-15 15:46 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Joel Granados, Kees Cook, Luis Chamberlain, Manfred Spraul Right now ipc and mq limits count as per ipc namespace, but only real root can change them. By default, the current values of these limits are such that it can only be reduced. Since only root can change the values, it is impossible to reduce these limits in the rootless container. We can allow limit changes within ipc namespace because mq parameters are limited by RLIMIT_MSGQUEUE and ipc parameters are not limited to anything other than cgroups. This is just a rebase of patches on v6.7-6264-g70d201a40823. --- Alexey Gladkov (3): sysctl: Allow change system v ipc sysctls inside ipc namespace docs: Add information about ipc sysctls limitations sysctl: Allow to change limits for posix messages queues Documentation/admin-guide/sysctl/kernel.rst | 14 ++++++-- ipc/ipc_sysctl.c | 37 +++++++++++++++++++-- ipc/mq_sysctl.c | 36 ++++++++++++++++++++ 3 files changed, 82 insertions(+), 5 deletions(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [RESEND PATCH v3 1/3] sysctl: Allow change system v ipc sysctls inside ipc namespace 2024-01-15 15:46 ` [RESEND PATCH v3 0/3] Allow to change ipc/mq sysctls inside ipc namespace Alexey Gladkov @ 2024-01-15 15:46 ` Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 2/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 3/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov 2 siblings, 0 replies; 23+ messages in thread From: Alexey Gladkov @ 2024-01-15 15:46 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Joel Granados, Kees Cook, Luis Chamberlain, Manfred Spraul Rootless containers are not allowed to modify kernel IPC parameters. All default limits are set to such high values that in fact there are no limits at all. All limits are not inherited and are initialized to default values when a new ipc_namespace is created. For new ipc_namespace: size_t ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24)) size_t ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24)) int ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15) int ipc_ns.shm_rmid_forced = 0; unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192 unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000 unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384 The shm_tot (total amount of shared pages) has also ceased to be global, it is located in ipc_namespace and is not inherited from anywhere. In such conditions, it cannot be said that these limits limit anything. The real limiter for them is cgroups. If we allow rootless containers to change these parameters, then it can only be reduced. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/e2d84d3ec0172cfff759e6065da84ce0cc2736f8.1663756794.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- ipc/ipc_sysctl.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c index 8c62e443f78b..01c4a50d22b2 100644 --- a/ipc/ipc_sysctl.c +++ b/ipc/ipc_sysctl.c @@ -14,6 +14,7 @@ #include <linux/ipc_namespace.h> #include <linux/msg.h> #include <linux/slab.h> +#include <linux/cred.h> #include "util.h" static int proc_ipc_dointvec_minmax_orphans(struct ctl_table *table, int write, @@ -190,25 +191,57 @@ static int set_is_seen(struct ctl_table_set *set) return ¤t->nsproxy->ipc_ns->ipc_set == set; } +static void ipc_set_ownership(struct ctl_table_header *head, + struct ctl_table *table, + kuid_t *uid, kgid_t *gid) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, ipc_set); + + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; +} + static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table) { int mode = table->mode; #ifdef CONFIG_CHECKPOINT_RESTORE - struct ipc_namespace *ns = current->nsproxy->ipc_ns; + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, ipc_set); if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || (table->data == &ns->ids[IPC_MSG_IDS].next_id) || (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && checkpoint_restore_ns_capable(ns->user_ns)) mode = 0666; + else #endif - return mode; + { + kuid_t ns_root_uid; + kgid_t ns_root_gid; + + ipc_set_ownership(head, table, &ns_root_uid, &ns_root_gid); + + if (uid_eq(current_euid(), ns_root_uid)) + mode >>= 6; + + else if (in_egroup_p(ns_root_gid)) + mode >>= 3; + } + + mode &= 7; + + return (mode << 6) | (mode << 3) | mode; } static struct ctl_table_root set_root = { .lookup = set_lookup, .permissions = ipc_permissions, + .set_ownership = ipc_set_ownership, }; bool setup_ipc_sysctls(struct ipc_namespace *ns) -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RESEND PATCH v3 2/3] docs: Add information about ipc sysctls limitations 2024-01-15 15:46 ` [RESEND PATCH v3 0/3] Allow to change ipc/mq sysctls inside ipc namespace Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 1/3] sysctl: Allow change system v ipc " Alexey Gladkov @ 2024-01-15 15:46 ` Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 3/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov 2 siblings, 0 replies; 23+ messages in thread From: Alexey Gladkov @ 2024-01-15 15:46 UTC (permalink / raw) To: LKML, Linux Containers, linux-doc Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Joel Granados, Kees Cook, Luis Chamberlain, Manfred Spraul After 25b21cb2f6d6 ("[PATCH] IPC namespace core") and 4e9823111bdc ("[PATCH] IPC namespace - shm") the shared memory page count stopped being global and started counting per ipc namespace. The documentation and shmget(2) still says that shmall is a global option. shmget(2): SHMALL System-wide limit on the total amount of shared memory, measured in units of the system page size. On Linux, this limit can be read and modified via /proc/sys/kernel/shmall. I think the changes made in 2006 should be documented. Signed-off-by: Alexey Gladkov <legion@kernel.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/ede20ddf7be48b93e8084c3be2e920841ee1a641.1663756794.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- Documentation/admin-guide/sysctl/kernel.rst | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 6584a1f9bfe3..bc578663619d 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -594,6 +594,9 @@ default (``MSGMNB``). ``msgmni`` is the maximum number of IPC queues. 32000 by default (``MSGMNI``). +All of these parameters are set per ipc namespace. The maximum number of bytes +in POSIX message queues is limited by ``RLIMIT_MSGQUEUE``. This limit is +respected hierarchically in the each user namespace. msg_next_id, sem_next_id, and shm_next_id (System V IPC) ======================================================== @@ -1274,15 +1277,20 @@ are doing anyway :) shmall ====== -This parameter sets the total amount of shared memory pages that -can be used system wide. Hence, ``shmall`` should always be at least -``ceil(shmmax/PAGE_SIZE)``. +This parameter sets the total amount of shared memory pages that can be used +inside ipc namespace. The shared memory pages counting occurs for each ipc +namespace separately and is not inherited. Hence, ``shmall`` should always be at +least ``ceil(shmmax/PAGE_SIZE)``. If you are not sure what the default ``PAGE_SIZE`` is on your Linux system, you can run the following command:: # getconf PAGE_SIZE +To reduce or disable the ability to allocate shared memory, you must create a +new ipc namespace, set this parameter to the required value and prohibit the +creation of a new ipc namespace in the current user namespace or cgroups can +be used. shmmax ====== -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RESEND PATCH v3 3/3] sysctl: Allow to change limits for posix messages queues 2024-01-15 15:46 ` [RESEND PATCH v3 0/3] Allow to change ipc/mq sysctls inside ipc namespace Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 1/3] sysctl: Allow change system v ipc " Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 2/3] docs: Add information about ipc sysctls limitations Alexey Gladkov @ 2024-01-15 15:46 ` Alexey Gladkov 2 siblings, 0 replies; 23+ messages in thread From: Alexey Gladkov @ 2024-01-15 15:46 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Joel Granados, Kees Cook, Luis Chamberlain, Manfred Spraul All parameters of posix messages queues (queues_max/msg_max/msgsize_max) end up being limited by RLIMIT_MSGQUEUE. The code in mqueue_get_inode is where that limiting happens. The RLIMIT_MSGQUEUE is bound to the user namespace and is counted hierarchically. We can allow root in the user namespace to modify the posix messages queues parameters. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/7eb21211c8622e91d226e63416b1b93c079f60ee.1663756794.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- ipc/mq_sysctl.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/ipc/mq_sysctl.c b/ipc/mq_sysctl.c index ebb5ed81c151..21fba3a6edaf 100644 --- a/ipc/mq_sysctl.c +++ b/ipc/mq_sysctl.c @@ -12,6 +12,7 @@ #include <linux/stat.h> #include <linux/capability.h> #include <linux/slab.h> +#include <linux/cred.h> static int msg_max_limit_min = MIN_MSGMAX; static int msg_max_limit_max = HARD_MSGMAX; @@ -76,8 +77,43 @@ static int set_is_seen(struct ctl_table_set *set) return ¤t->nsproxy->ipc_ns->mq_set == set; } +static void mq_set_ownership(struct ctl_table_header *head, + struct ctl_table *table, + kuid_t *uid, kgid_t *gid) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, mq_set); + + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; +} + +static int mq_permissions(struct ctl_table_header *head, struct ctl_table *table) +{ + int mode = table->mode; + kuid_t ns_root_uid; + kgid_t ns_root_gid; + + mq_set_ownership(head, table, &ns_root_uid, &ns_root_gid); + + if (uid_eq(current_euid(), ns_root_uid)) + mode >>= 6; + + else if (in_egroup_p(ns_root_gid)) + mode >>= 3; + + mode &= 7; + + return (mode << 6) | (mode << 3) | mode; +} + static struct ctl_table_root set_root = { .lookup = set_lookup, + .permissions = mq_permissions, + .set_ownership = mq_set_ownership, }; bool setup_mq_sysctls(struct ipc_namespace *ns) -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v2 2/3] sysctl: Allow to change limits for posix messages queues 2022-09-20 18:08 ` [PATCH v2 0/3] Allow to change ipc/mq " Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 1/3] sysctl: Allow change system v ipc " Alexey Gladkov @ 2022-09-20 18:08 ` Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2 siblings, 0 replies; 23+ messages in thread From: Alexey Gladkov @ 2022-09-20 18:08 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul All parameters of posix messages queues (queues_max/msg_max/msgsize_max) end up being limited by RLIMIT_MSGQUEUE. The code in mqueue_get_inode is where that limiting happens. The RLIMIT_MSGQUEUE is bound to the user namespace and is counted hierarchically. We can allow root in the user namespace to modify the posix messages queues parameters. Signed-off-by: Alexey Gladkov <legion@kernel.org> --- ipc/mq_sysctl.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/ipc/mq_sysctl.c b/ipc/mq_sysctl.c index fbf6a8b93a26..5e573c657775 100644 --- a/ipc/mq_sysctl.c +++ b/ipc/mq_sysctl.c @@ -12,6 +12,7 @@ #include <linux/stat.h> #include <linux/capability.h> #include <linux/slab.h> +#include <linux/cred.h> static int msg_max_limit_min = MIN_MSGMAX; static int msg_max_limit_max = HARD_MSGMAX; @@ -76,8 +77,45 @@ static int set_is_seen(struct ctl_table_set *set) return ¤t->nsproxy->ipc_ns->mq_set == set; } +static void mq_set_ownership(struct ctl_table_header *head, + struct ctl_table *table, + kuid_t *uid, kgid_t *gid) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, mq_set); + + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; +} + +static int mq_permissions(struct ctl_table_header *head, struct ctl_table *table) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, mq_set); + int mode = table->mode; + kuid_t ns_root_uid; + kgid_t ns_root_gid; + + mq_set_ownership(head, table, &ns_root_uid, &ns_root_gid); + + if (uid_eq(current_euid(), ns_root_uid)) + mode >>= 6; + + if (in_egroup_p(ns_root_gid)) + mode >>= 3; + + mode &= 7; + + return (mode << 6) | (mode << 3) | mode; +} + static struct ctl_table_root set_root = { .lookup = set_lookup, + .permissions = mq_permissions, + .set_ownership = mq_set_ownership, }; bool setup_mq_sysctls(struct ipc_namespace *ns) -- 2.33.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v2 3/3] docs: Add information about ipc sysctls limitations 2022-09-20 18:08 ` [PATCH v2 0/3] Allow to change ipc/mq " Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 1/3] sysctl: Allow change system v ipc " Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 2/3] " Alexey Gladkov @ 2022-09-20 18:08 ` Alexey Gladkov 2 siblings, 0 replies; 23+ messages in thread From: Alexey Gladkov @ 2022-09-20 18:08 UTC (permalink / raw) To: LKML, Linux Containers, linux-doc, linux-man Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul, Jonathan Corbet After 25b21cb2f6d6 ("[PATCH] IPC namespace core") and 4e9823111bdc ("[PATCH] IPC namespace - shm") the shared memory page count stopped being global and started counting per ipc namespace. The documentation and shmget(2) still says that shmall is a global option. shmget(2): SHMALL System-wide limit on the total amount of shared memory, measured in units of the system page size. On Linux, this limit can be read and modified via /proc/sys/kernel/shmall. I think the changes made in 2006 should be documented. Signed-off-by: Alexey Gladkov <legion@kernel.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> --- Documentation/admin-guide/sysctl/kernel.rst | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index ee6572b1edad..c8b89bd8f004 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -541,6 +541,9 @@ default (``MSGMNB``). ``msgmni`` is the maximum number of IPC queues. 32000 by default (``MSGMNI``). +All of these parameters are set per ipc namespace. The maximum number of bytes +in POSIX message queues is limited by ``RLIMIT_MSGQUEUE``. This limit is +respected hierarchically in the each user namespace. msg_next_id, sem_next_id, and shm_next_id (System V IPC) ======================================================== @@ -1181,15 +1184,20 @@ are doing anyway :) shmall ====== -This parameter sets the total amount of shared memory pages that -can be used system wide. Hence, ``shmall`` should always be at least -``ceil(shmmax/PAGE_SIZE)``. +This parameter sets the total amount of shared memory pages that can be used +inside ipc namespace. The shared memory pages counting occurs for each ipc +namespace separately and is not inherited. Hence, ``shmall`` should always be at +least ``ceil(shmmax/PAGE_SIZE)``. If you are not sure what the default ``PAGE_SIZE`` is on your Linux system, you can run the following command:: # getconf PAGE_SIZE +To reduce or disable the ability to allocate shared memory, you must create a +new ipc namespace, set this parameter to the required value and prohibit the +creation of a new ipc namespace in the current user namespace or cgroups can +be used. shmmax ====== -- 2.33.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 2/3] sysctl: Allow to change limits for posix messages queues 2022-08-16 15:42 ` Alexey Gladkov 2022-08-16 15:42 ` [PATCH v1 1/3] " Alexey Gladkov @ 2022-08-16 15:42 ` Alexey Gladkov 2022-09-19 15:27 ` Eric W. Biederman 2022-08-16 15:42 ` [PATCH v1 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2 siblings, 1 reply; 23+ messages in thread From: Alexey Gladkov @ 2022-08-16 15:42 UTC (permalink / raw) To: LKML, Linux Containers Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul All parameters of posix messages queues (queues_max/msg_max/msgsize_max) end up being limited by RLIMIT_MSGQUEUE. The code in mqueue_get_inode is where that limiting happens. The RLIMIT_MSGQUEUE is bound to the user namespace and is counted hierarchically. We can allow root in the user namespace to modify the posix messages queues parameters. Signed-off-by: Alexey Gladkov <legion@kernel.org> --- ipc/mq_sysctl.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/ipc/mq_sysctl.c b/ipc/mq_sysctl.c index fbf6a8b93a26..39dcf086b7c2 100644 --- a/ipc/mq_sysctl.c +++ b/ipc/mq_sysctl.c @@ -12,6 +12,7 @@ #include <linux/stat.h> #include <linux/capability.h> #include <linux/slab.h> +#include <linux/cred.h> static int msg_max_limit_min = MIN_MSGMAX; static int msg_max_limit_max = HARD_MSGMAX; @@ -76,8 +77,43 @@ static int set_is_seen(struct ctl_table_set *set) return ¤t->nsproxy->ipc_ns->mq_set == set; } +static int mq_permissions(struct ctl_table_header *head, struct ctl_table *table) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, mq_set); + + if (ns->user_ns != &init_user_ns) { + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + if (uid_valid(ns_root_uid) && uid_eq(current_euid(), ns_root_uid)) + return table->mode >> 6; + + if (gid_valid(ns_root_gid) && in_egroup_p(ns_root_gid)) + return table->mode >> 3; + } + + return table->mode; +} + +static void mq_set_ownership(struct ctl_table_header *head, + struct ctl_table *table, + kuid_t *uid, kgid_t *gid) +{ + struct ipc_namespace *ns = + container_of(head->set, struct ipc_namespace, mq_set); + + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); + + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; +} + static struct ctl_table_root set_root = { .lookup = set_lookup, + .permissions = mq_permissions, + .set_ownership = mq_set_ownership, }; bool setup_mq_sysctls(struct ipc_namespace *ns) -- 2.33.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v1 2/3] sysctl: Allow to change limits for posix messages queues 2022-08-16 15:42 ` [PATCH v1 2/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov @ 2022-09-19 15:27 ` Eric W. Biederman 0 siblings, 0 replies; 23+ messages in thread From: Eric W. Biederman @ 2022-09-19 15:27 UTC (permalink / raw) To: Alexey Gladkov Cc: LKML, Linux Containers, Andrew Morton, Christian Brauner, Kees Cook, Manfred Spraul Alexey Gladkov <legion@kernel.org> writes: > All parameters of posix messages queues (queues_max/msg_max/msgsize_max) > end up being limited by RLIMIT_MSGQUEUE. The code in mqueue_get_inode is > where that limiting happens. > > The RLIMIT_MSGQUEUE is bound to the user namespace and is counted > hierarchically. > > We can allow root in the user namespace to modify the posix messages > queues parameters. This looks good from 10,000 feet. But the same nits with setting mode in mq_permissions as in ipc_set_permissions in your other patch. Eric > Signed-off-by: Alexey Gladkov <legion@kernel.org> > --- > ipc/mq_sysctl.c | 36 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 36 insertions(+) > > diff --git a/ipc/mq_sysctl.c b/ipc/mq_sysctl.c > index fbf6a8b93a26..39dcf086b7c2 100644 > --- a/ipc/mq_sysctl.c > +++ b/ipc/mq_sysctl.c > @@ -12,6 +12,7 @@ > #include <linux/stat.h> > #include <linux/capability.h> > #include <linux/slab.h> > +#include <linux/cred.h> > > static int msg_max_limit_min = MIN_MSGMAX; > static int msg_max_limit_max = HARD_MSGMAX; > @@ -76,8 +77,43 @@ static int set_is_seen(struct ctl_table_set *set) > return ¤t->nsproxy->ipc_ns->mq_set == set; > } > > +static int mq_permissions(struct ctl_table_header *head, struct ctl_table *table) > +{ > + struct ipc_namespace *ns = > + container_of(head->set, struct ipc_namespace, mq_set); > + > + if (ns->user_ns != &init_user_ns) { > + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); > + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); > + > + if (uid_valid(ns_root_uid) && uid_eq(current_euid(), ns_root_uid)) > + return table->mode >> 6; > + > + if (gid_valid(ns_root_gid) && in_egroup_p(ns_root_gid)) > + return table->mode >> 3; > + } > + > + return table->mode; > +} > + > +static void mq_set_ownership(struct ctl_table_header *head, > + struct ctl_table *table, > + kuid_t *uid, kgid_t *gid) > +{ > + struct ipc_namespace *ns = > + container_of(head->set, struct ipc_namespace, mq_set); > + > + kuid_t ns_root_uid = make_kuid(ns->user_ns, 0); > + kgid_t ns_root_gid = make_kgid(ns->user_ns, 0); > + > + *uid = uid_valid(ns_root_uid) ? ns_root_uid : GLOBAL_ROOT_UID; > + *gid = gid_valid(ns_root_gid) ? ns_root_gid : GLOBAL_ROOT_GID; > +} > + > static struct ctl_table_root set_root = { > .lookup = set_lookup, > + .permissions = mq_permissions, > + .set_ownership = mq_set_ownership, > }; > > bool setup_mq_sysctls(struct ipc_namespace *ns) ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v1 3/3] docs: Add information about ipc sysctls limitations 2022-08-16 15:42 ` Alexey Gladkov 2022-08-16 15:42 ` [PATCH v1 1/3] " Alexey Gladkov 2022-08-16 15:42 ` [PATCH v1 2/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov @ 2022-08-16 15:42 ` Alexey Gladkov 2022-09-19 15:29 ` Eric W. Biederman 2 siblings, 1 reply; 23+ messages in thread From: Alexey Gladkov @ 2022-08-16 15:42 UTC (permalink / raw) To: LKML, Linux Containers, linux-doc Cc: Andrew Morton, Christian Brauner, Eric W . Biederman, Kees Cook, Manfred Spraul, Jonathan Corbet After 25b21cb2f6d6 ("[PATCH] IPC namespace core") and 4e9823111bdc ("[PATCH] IPC namespace - shm") the shared memory page count stopped being global and started counting per ipc namespace. The documentation and shmget(2) still says that shmall is a global option. shmget(2): SHMALL System-wide limit on the total amount of shared memory, measured in units of the system page size. On Linux, this limit can be read and modified via /proc/sys/kernel/shmall. I think the changes made in 2006 should be documented. Signed-off-by: Alexey Gladkov <legion@kernel.org> --- Documentation/admin-guide/sysctl/kernel.rst | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index ddccd1077462..9ad344b5e7a1 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -541,6 +541,9 @@ default (``MSGMNB``). ``msgmni`` is the maximum number of IPC queues. 32000 by default (``MSGMNI``). +All of these parameters are set per ipc namespace. The maximum number of bytes +in POSIX message queues is limited by ``RLIMIT_MSGQUEUE``. This limit is +respected hierarchically in the each user namespace. msg_next_id, sem_next_id, and shm_next_id (System V IPC) ======================================================== @@ -1169,15 +1172,20 @@ are doing anyway :) shmall ====== -This parameter sets the total amount of shared memory pages that -can be used system wide. Hence, ``shmall`` should always be at least -``ceil(shmmax/PAGE_SIZE)``. +This parameter sets the total amount of shared memory pages that can be used +inside ipc namespace. The shared memory pages counting occurs for each ipc +namespace separately and is not inherited. Hence, ``shmall`` should always be at +least ``ceil(shmmax/PAGE_SIZE)``. If you are not sure what the default ``PAGE_SIZE`` is on your Linux system, you can run the following command:: # getconf PAGE_SIZE +To reduce or disable the ability to allocate shared memory, you must create a +new ipc namespace, set this parameter to the required value and prohibit the +creation of a new ipc namespace in the current user namespace or cgroups can +be used. shmmax ====== -- 2.33.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v1 3/3] docs: Add information about ipc sysctls limitations 2022-08-16 15:42 ` [PATCH v1 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov @ 2022-09-19 15:29 ` Eric W. Biederman 0 siblings, 0 replies; 23+ messages in thread From: Eric W. Biederman @ 2022-09-19 15:29 UTC (permalink / raw) To: Alexey Gladkov Cc: LKML, Linux Containers, linux-doc, Andrew Morton, Christian Brauner, Kees Cook, Manfred Spraul, Jonathan Corbet Alexey Gladkov <legion@kernel.org> writes: > After 25b21cb2f6d6 ("[PATCH] IPC namespace core") and 4e9823111bdc > ("[PATCH] IPC namespace - shm") the shared memory page count stopped > being global and started counting per ipc namespace. The documentation > and shmget(2) still says that shmall is a global option. > > shmget(2): > > SHMALL System-wide limit on the total amount of shared memory, measured > in units of the system page size. On Linux, this limit can be read and > modified via /proc/sys/kernel/shmall. > > I think the changes made in 2006 should be documented. Agreed. Documenting these limits only apply to their ipc namespace is overdue. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > Signed-off-by: Alexey Gladkov <legion@kernel.org> > --- > Documentation/admin-guide/sysctl/kernel.rst | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst > index ddccd1077462..9ad344b5e7a1 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -541,6 +541,9 @@ default (``MSGMNB``). > ``msgmni`` is the maximum number of IPC queues. 32000 by default > (``MSGMNI``). > > +All of these parameters are set per ipc namespace. The maximum number of bytes > +in POSIX message queues is limited by ``RLIMIT_MSGQUEUE``. This limit is > +respected hierarchically in the each user namespace. > > msg_next_id, sem_next_id, and shm_next_id (System V IPC) > ======================================================== > @@ -1169,15 +1172,20 @@ are doing anyway :) > shmall > ====== > > -This parameter sets the total amount of shared memory pages that > -can be used system wide. Hence, ``shmall`` should always be at least > -``ceil(shmmax/PAGE_SIZE)``. > +This parameter sets the total amount of shared memory pages that can be used > +inside ipc namespace. The shared memory pages counting occurs for each ipc > +namespace separately and is not inherited. Hence, ``shmall`` should always be at > +least ``ceil(shmmax/PAGE_SIZE)``. > > If you are not sure what the default ``PAGE_SIZE`` is on your Linux > system, you can run the following command:: > > # getconf PAGE_SIZE > > +To reduce or disable the ability to allocate shared memory, you must create a > +new ipc namespace, set this parameter to the required value and prohibit the > +creation of a new ipc namespace in the current user namespace or cgroups can > +be used. > > shmmax > ====== ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-01-15 15:49 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-07-12 16:17 [PATCH v1] sysctl: Allow change system v ipc sysctls inside ipc namespace Alexey Gladkov 2022-07-25 16:16 ` Eric W. Biederman 2022-08-16 15:42 ` Alexey Gladkov 2022-08-16 15:42 ` [PATCH v1 1/3] " Alexey Gladkov 2022-09-19 15:26 ` Eric W. Biederman 2022-09-20 16:15 ` Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 0/3] Allow to change ipc/mq " Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 1/3] sysctl: Allow change system v ipc " Alexey Gladkov 2022-09-21 9:38 ` kernel test robot 2022-09-21 10:41 ` [PATCH v3 0/3] Allow to change ipc/mq " Alexey Gladkov 2022-09-21 10:41 ` [PATCH v3 1/3] sysctl: Allow change system v ipc " Alexey Gladkov 2022-09-21 10:41 ` [PATCH v3 2/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov 2022-09-21 10:41 ` [PATCH v3 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 0/3] Allow to change ipc/mq sysctls inside ipc namespace Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 1/3] sysctl: Allow change system v ipc " Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 2/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2024-01-15 15:46 ` [RESEND PATCH v3 3/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 2/3] " Alexey Gladkov 2022-09-20 18:08 ` [PATCH v2 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2022-08-16 15:42 ` [PATCH v1 2/3] sysctl: Allow to change limits for posix messages queues Alexey Gladkov 2022-09-19 15:27 ` Eric W. Biederman 2022-08-16 15:42 ` [PATCH v1 3/3] docs: Add information about ipc sysctls limitations Alexey Gladkov 2022-09-19 15:29 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).