* [PATCH v2] userns: automatically split user namespace extent @ 2020-12-03 15:02 Giuseppe Scrivano 2021-05-10 17:23 ` Serge E. Hallyn 0 siblings, 1 reply; 7+ messages in thread From: Giuseppe Scrivano @ 2020-12-03 15:02 UTC (permalink / raw) To: linux-kernel; +Cc: ebiederm, christian.brauner, serge writing to the id map fails when an extent overlaps multiple mappings in the parent user namespace, e.g.: $ cat /proc/self/uid_map 0 1000 1 1 100000 65536 $ unshare -U sleep 100 & [1] 1029703 $ printf "0 0 100\n" | tee /proc/$!/uid_map 0 0 100 tee: /proc/1029703/uid_map: Operation not permitted To prevent it from happening, automatically split an extent so that each portion fits in one extent in the parent user namespace. $ cat /proc/self/uid_map 0 1000 1 1 110000 65536 $ unshare -U sleep 100 & [1] 1552 $ printf "0 0 100\n" | tee /proc/$!/uid_map 0 0 100 $ cat /proc/$!/uid_map 0 0 1 1 1 99 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --- v2: - move the split logic when the extent are mapped to the parent map to reduce lookup complexity. v1: https://lkml.kernel.org/lkml/20201126100839.381415-1-gscrivan@redhat.com kernel/user_namespace.c | 79 +++++++++++++++++++++++++++++++++++------ 1 file changed, 68 insertions(+), 11 deletions(-) diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 87804e0371fe..550612c6e794 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -312,6 +312,55 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id) return map_id_range_down(map, id, 1); } +/** + * find_and_split_extent_down - Find lower_first for the target extent + * using the specified map. + * If the extent doesn't fit in a single lower extent, split target and + * write the remaining IDs (first and count) to the overflow extent. + * If no overflow happens, overflow->count is set to 0. + */ +static int find_and_split_extent_down(struct uid_gid_map *map, + struct uid_gid_extent *target, + struct uid_gid_extent *overflow) +{ + unsigned int extents = map->nr_extents; + u32 lower_first = target->lower_first; + struct uid_gid_extent *extent; + u32 off, available; + + overflow->count = 0; + + /* Find the lower extent that includes the first ID. */ + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) + extent = map_id_range_down_base(extents, map, lower_first, 1); + else + extent = map_id_range_down_max(extents, map, lower_first, 1); + + /* Could not map the first ID in the extent. */ + if (extent == NULL) + return -EPERM; + + /* Offset of lower_first in the lower extent. */ + off = target->lower_first - extent->first; + + /* How many IDs are available in the lower extent? */ + available = extent->count - off; + + /* Requesting more IDs than available in the lower extent. */ + if (target->count > available) { + /* Move the remaining IDs to the overflow extent. */ + overflow->first = target->first + available; + overflow->lower_first = target->lower_first + available; + overflow->count = target->count - available; + + /* Shrink the initial extent to what is available. */ + target->count = available; + } + + target->lower_first = extent->lower_first + off; + return 0; +} + /** * map_id_up_base - Find idmap via binary search in static extent array. * Can only be called if number of mappings is equal or less than @@ -749,6 +798,7 @@ static bool mappings_overlap(struct uid_gid_map *new_map, * insert_extent - Safely insert a new idmap extent into struct uid_gid_map. * Takes care to allocate a 4K block of memory if the number of mappings exceeds * UID_GID_MAP_MAX_BASE_EXTENTS. + * The extent is appended at the position map->nr_extents. */ static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent) { @@ -968,30 +1018,37 @@ static ssize_t map_write(struct file *file, const char __user *buf, if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) goto out; - ret = -EPERM; /* Map the lower ids from the parent user namespace to the * kernel global id space. */ for (idx = 0; idx < new_map.nr_extents; idx++) { + struct uid_gid_extent overflow; struct uid_gid_extent *e; - u32 lower_first; if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) e = &new_map.extent[idx]; else e = &new_map.forward[idx]; - lower_first = map_id_range_down(parent_map, - e->lower_first, - e->count); - - /* Fail if we can not map the specified extent to - * the kernel global id space. - */ - if (lower_first == (u32) -1) + ret = find_and_split_extent_down(parent_map, e, &overflow); + if (ret < 0) goto out; - e->lower_first = lower_first; + /* If the extent doesn't fit in a single lower extent, + * move what could not be mapped to a new extent. + * The new extent is appended to the existing ones in + * new_map, it will be checked again and if necessary it + * is split further. + */ + if (overflow.count > 0) { + if (new_map.nr_extents == UID_GID_MAP_MAX_EXTENTS) { + ret = -EINVAL; + goto out; + } + ret = insert_extent(&new_map, &overflow); + if (ret < 0) + goto out; + } } /* -- 2.28.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2] userns: automatically split user namespace extent 2020-12-03 15:02 [PATCH v2] userns: automatically split user namespace extent Giuseppe Scrivano @ 2021-05-10 17:23 ` Serge E. Hallyn 2021-05-10 18:57 ` Serge E. Hallyn 0 siblings, 1 reply; 7+ messages in thread From: Serge E. Hallyn @ 2021-05-10 17:23 UTC (permalink / raw) To: Giuseppe Scrivano; +Cc: linux-kernel, ebiederm, christian.brauner, serge On Thu, Dec 03, 2020 at 04:02:52PM +0100, Giuseppe Scrivano wrote: > writing to the id map fails when an extent overlaps multiple mappings > in the parent user namespace, e.g.: > > $ cat /proc/self/uid_map > 0 1000 1 > 1 100000 65536 > $ unshare -U sleep 100 & > [1] 1029703 > $ printf "0 0 100\n" | tee /proc/$!/uid_map > 0 0 100 > tee: /proc/1029703/uid_map: Operation not permitted > > To prevent it from happening, automatically split an extent so that > each portion fits in one extent in the parent user namespace. > > $ cat /proc/self/uid_map > 0 1000 1 > 1 110000 65536 > $ unshare -U sleep 100 & > [1] 1552 > $ printf "0 0 100\n" | tee /proc/$!/uid_map > 0 0 100 > $ cat /proc/$!/uid_map > 0 0 1 > 1 1 99 > > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> The patch on the whole looks great, easy to reason about. But I have one question below: > --- > v2: > - move the split logic when the extent are mapped to the parent map to > reduce lookup complexity. > > v1: https://lkml.kernel.org/lkml/20201126100839.381415-1-gscrivan@redhat.com > > kernel/user_namespace.c | 79 +++++++++++++++++++++++++++++++++++------ > 1 file changed, 68 insertions(+), 11 deletions(-) > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 87804e0371fe..550612c6e794 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -312,6 +312,55 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id) > return map_id_range_down(map, id, 1); > } > > +/** > + * find_and_split_extent_down - Find lower_first for the target extent > + * using the specified map. > + * If the extent doesn't fit in a single lower extent, split target and > + * write the remaining IDs (first and count) to the overflow extent. > + * If no overflow happens, overflow->count is set to 0. > + */ > +static int find_and_split_extent_down(struct uid_gid_map *map, > + struct uid_gid_extent *target, > + struct uid_gid_extent *overflow) > +{ > + unsigned int extents = map->nr_extents; > + u32 lower_first = target->lower_first; > + struct uid_gid_extent *extent; > + u32 off, available; > + > + overflow->count = 0; > + > + /* Find the lower extent that includes the first ID. */ > + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > + extent = map_id_range_down_base(extents, map, lower_first, 1); > + else > + extent = map_id_range_down_max(extents, map, lower_first, 1); > + > + /* Could not map the first ID in the extent. */ > + if (extent == NULL) > + return -EPERM; > + > + /* Offset of lower_first in the lower extent. */ > + off = target->lower_first - extent->first; > + > + /* How many IDs are available in the lower extent? */ > + available = extent->count - off; > + > + /* Requesting more IDs than available in the lower extent. */ > + if (target->count > available) { > + /* Move the remaining IDs to the overflow extent. */ > + overflow->first = target->first + available; > + overflow->lower_first = target->lower_first + available; > + overflow->count = target->count - available; > + > + /* Shrink the initial extent to what is available. */ > + target->count = available; > + } > + > + target->lower_first = extent->lower_first + off; > + return 0; > +} > + > /** > * map_id_up_base - Find idmap via binary search in static extent array. > * Can only be called if number of mappings is equal or less than > @@ -749,6 +798,7 @@ static bool mappings_overlap(struct uid_gid_map *new_map, > * insert_extent - Safely insert a new idmap extent into struct uid_gid_map. > * Takes care to allocate a 4K block of memory if the number of mappings exceeds > * UID_GID_MAP_MAX_BASE_EXTENTS. > + * The extent is appended at the position map->nr_extents. > */ > static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent) > { > @@ -968,30 +1018,37 @@ static ssize_t map_write(struct file *file, const char __user *buf, > if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) > goto out; > > - ret = -EPERM; > /* Map the lower ids from the parent user namespace to the > * kernel global id space. > */ > for (idx = 0; idx < new_map.nr_extents; idx++) { > + struct uid_gid_extent overflow; > struct uid_gid_extent *e; > - u32 lower_first; > > if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > e = &new_map.extent[idx]; > else > e = &new_map.forward[idx]; > > - lower_first = map_id_range_down(parent_map, > - e->lower_first, > - e->count); > - > - /* Fail if we can not map the specified extent to > - * the kernel global id space. > - */ > - if (lower_first == (u32) -1) > + ret = find_and_split_extent_down(parent_map, e, &overflow); > + if (ret < 0) > goto out; > > - e->lower_first = lower_first; > + /* If the extent doesn't fit in a single lower extent, > + * move what could not be mapped to a new extent. > + * The new extent is appended to the existing ones in > + * new_map, it will be checked again and if necessary it > + * is split further. > + */ > + if (overflow.count > 0) { > + if (new_map.nr_extents == UID_GID_MAP_MAX_EXTENTS) { Why are you doing this? The insert_extent() will automatically extend it if needed, right? So this condition should be fine? > + ret = -EINVAL; > + goto out; > + } > + ret = insert_extent(&new_map, &overflow); > + if (ret < 0) > + goto out; > + } > } > > /* > -- > 2.28.0 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] userns: automatically split user namespace extent 2021-05-10 17:23 ` Serge E. Hallyn @ 2021-05-10 18:57 ` Serge E. Hallyn 2021-06-04 14:41 ` Giuseppe Scrivano 0 siblings, 1 reply; 7+ messages in thread From: Serge E. Hallyn @ 2021-05-10 18:57 UTC (permalink / raw) To: Serge E. Hallyn Cc: Giuseppe Scrivano, linux-kernel, ebiederm, christian.brauner On Mon, May 10, 2021 at 12:23:51PM -0500, Serge E. Hallyn wrote: > On Thu, Dec 03, 2020 at 04:02:52PM +0100, Giuseppe Scrivano wrote: > > writing to the id map fails when an extent overlaps multiple mappings > > in the parent user namespace, e.g.: > > > > $ cat /proc/self/uid_map > > 0 1000 1 > > 1 100000 65536 > > $ unshare -U sleep 100 & > > [1] 1029703 > > $ printf "0 0 100\n" | tee /proc/$!/uid_map > > 0 0 100 > > tee: /proc/1029703/uid_map: Operation not permitted > > > > To prevent it from happening, automatically split an extent so that > > each portion fits in one extent in the parent user namespace. > > > > $ cat /proc/self/uid_map > > 0 1000 1 > > 1 110000 65536 > > $ unshare -U sleep 100 & > > [1] 1552 > > $ printf "0 0 100\n" | tee /proc/$!/uid_map > > 0 0 100 > > $ cat /proc/$!/uid_map > > 0 0 1 > > 1 1 99 > > > > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> > > The patch on the whole looks great, easy to reason about. But I have > one question below: As you pointed out, I was misreading the variable name, thank you :) Reviewed-by: Serge Hallyn <serge@hallyn.com> > > > --- > > v2: > > - move the split logic when the extent are mapped to the parent map to > > reduce lookup complexity. > > > > v1: https://lkml.kernel.org/lkml/20201126100839.381415-1-gscrivan@redhat.com > > > > kernel/user_namespace.c | 79 +++++++++++++++++++++++++++++++++++------ > > 1 file changed, 68 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > > index 87804e0371fe..550612c6e794 100644 > > --- a/kernel/user_namespace.c > > +++ b/kernel/user_namespace.c > > @@ -312,6 +312,55 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id) > > return map_id_range_down(map, id, 1); > > } > > > > +/** > > + * find_and_split_extent_down - Find lower_first for the target extent > > + * using the specified map. > > + * If the extent doesn't fit in a single lower extent, split target and > > + * write the remaining IDs (first and count) to the overflow extent. > > + * If no overflow happens, overflow->count is set to 0. > > + */ > > +static int find_and_split_extent_down(struct uid_gid_map *map, > > + struct uid_gid_extent *target, > > + struct uid_gid_extent *overflow) > > +{ > > + unsigned int extents = map->nr_extents; > > + u32 lower_first = target->lower_first; > > + struct uid_gid_extent *extent; > > + u32 off, available; > > + > > + overflow->count = 0; > > + > > + /* Find the lower extent that includes the first ID. */ > > + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > > + extent = map_id_range_down_base(extents, map, lower_first, 1); > > + else > > + extent = map_id_range_down_max(extents, map, lower_first, 1); > > + > > + /* Could not map the first ID in the extent. */ > > + if (extent == NULL) > > + return -EPERM; > > + > > + /* Offset of lower_first in the lower extent. */ > > + off = target->lower_first - extent->first; > > + > > + /* How many IDs are available in the lower extent? */ > > + available = extent->count - off; > > + > > + /* Requesting more IDs than available in the lower extent. */ > > + if (target->count > available) { > > + /* Move the remaining IDs to the overflow extent. */ > > + overflow->first = target->first + available; > > + overflow->lower_first = target->lower_first + available; > > + overflow->count = target->count - available; > > + > > + /* Shrink the initial extent to what is available. */ > > + target->count = available; > > + } > > + > > + target->lower_first = extent->lower_first + off; > > + return 0; > > +} > > + > > /** > > * map_id_up_base - Find idmap via binary search in static extent array. > > * Can only be called if number of mappings is equal or less than > > @@ -749,6 +798,7 @@ static bool mappings_overlap(struct uid_gid_map *new_map, > > * insert_extent - Safely insert a new idmap extent into struct uid_gid_map. > > * Takes care to allocate a 4K block of memory if the number of mappings exceeds > > * UID_GID_MAP_MAX_BASE_EXTENTS. > > + * The extent is appended at the position map->nr_extents. > > */ > > static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent) > > { > > @@ -968,30 +1018,37 @@ static ssize_t map_write(struct file *file, const char __user *buf, > > if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) > > goto out; > > > > - ret = -EPERM; > > /* Map the lower ids from the parent user namespace to the > > * kernel global id space. > > */ > > for (idx = 0; idx < new_map.nr_extents; idx++) { > > + struct uid_gid_extent overflow; > > struct uid_gid_extent *e; > > - u32 lower_first; > > > > if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > > e = &new_map.extent[idx]; > > else > > e = &new_map.forward[idx]; > > > > - lower_first = map_id_range_down(parent_map, > > - e->lower_first, > > - e->count); > > - > > - /* Fail if we can not map the specified extent to > > - * the kernel global id space. > > - */ > > - if (lower_first == (u32) -1) > > + ret = find_and_split_extent_down(parent_map, e, &overflow); > > + if (ret < 0) > > goto out; > > > > - e->lower_first = lower_first; > > + /* If the extent doesn't fit in a single lower extent, > > + * move what could not be mapped to a new extent. > > + * The new extent is appended to the existing ones in > > + * new_map, it will be checked again and if necessary it > > + * is split further. > > + */ > > + if (overflow.count > 0) { > > + if (new_map.nr_extents == UID_GID_MAP_MAX_EXTENTS) { > > Why are you doing this? The insert_extent() will automatically extend it > if needed, right? So this condition should be fine? > > > + ret = -EINVAL; > > + goto out; > > + } > > + ret = insert_extent(&new_map, &overflow); > > + if (ret < 0) > > + goto out; > > + } > > } > > > > /* > > -- > > 2.28.0 > > Cheers, > Balint > > > > > -serge > > > > -- > Balint Reczey > Ubuntu & Debian Developer > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] userns: automatically split user namespace extent 2021-05-10 18:57 ` Serge E. Hallyn @ 2021-06-04 14:41 ` Giuseppe Scrivano 2021-06-05 13:00 ` Christian Brauner 0 siblings, 1 reply; 7+ messages in thread From: Giuseppe Scrivano @ 2021-06-04 14:41 UTC (permalink / raw) To: ebiederm, christian.brauner; +Cc: Serge E. Hallyn, linux-kernel Christian, Eric, are you fine with this patch or is there anything more you'd like me to change? Thanks, Giuseppe "Serge E. Hallyn" <serge@hallyn.com> writes: > On Mon, May 10, 2021 at 12:23:51PM -0500, Serge E. Hallyn wrote: >> On Thu, Dec 03, 2020 at 04:02:52PM +0100, Giuseppe Scrivano wrote: >> > writing to the id map fails when an extent overlaps multiple mappings >> > in the parent user namespace, e.g.: >> > >> > $ cat /proc/self/uid_map >> > 0 1000 1 >> > 1 100000 65536 >> > $ unshare -U sleep 100 & >> > [1] 1029703 >> > $ printf "0 0 100\n" | tee /proc/$!/uid_map >> > 0 0 100 >> > tee: /proc/1029703/uid_map: Operation not permitted >> > >> > To prevent it from happening, automatically split an extent so that >> > each portion fits in one extent in the parent user namespace. >> > >> > $ cat /proc/self/uid_map >> > 0 1000 1 >> > 1 110000 65536 >> > $ unshare -U sleep 100 & >> > [1] 1552 >> > $ printf "0 0 100\n" | tee /proc/$!/uid_map >> > 0 0 100 >> > $ cat /proc/$!/uid_map >> > 0 0 1 >> > 1 1 99 >> > >> > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> >> >> The patch on the whole looks great, easy to reason about. But I have >> one question below: > > As you pointed out, I was misreading the variable name, thank you :) > > Reviewed-by: Serge Hallyn <serge@hallyn.com> > >> >> > --- >> > v2: >> > - move the split logic when the extent are mapped to the parent map to >> > reduce lookup complexity. >> > >> > v1: https://lkml.kernel.org/lkml/20201126100839.381415-1-gscrivan@redhat.com >> > >> > kernel/user_namespace.c | 79 +++++++++++++++++++++++++++++++++++------ >> > 1 file changed, 68 insertions(+), 11 deletions(-) >> > >> > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c >> > index 87804e0371fe..550612c6e794 100644 >> > --- a/kernel/user_namespace.c >> > +++ b/kernel/user_namespace.c >> > @@ -312,6 +312,55 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id) >> > return map_id_range_down(map, id, 1); >> > } >> > >> > +/** >> > + * find_and_split_extent_down - Find lower_first for the target extent >> > + * using the specified map. >> > + * If the extent doesn't fit in a single lower extent, split target and >> > + * write the remaining IDs (first and count) to the overflow extent. >> > + * If no overflow happens, overflow->count is set to 0. >> > + */ >> > +static int find_and_split_extent_down(struct uid_gid_map *map, >> > + struct uid_gid_extent *target, >> > + struct uid_gid_extent *overflow) >> > +{ >> > + unsigned int extents = map->nr_extents; >> > + u32 lower_first = target->lower_first; >> > + struct uid_gid_extent *extent; >> > + u32 off, available; >> > + >> > + overflow->count = 0; >> > + >> > + /* Find the lower extent that includes the first ID. */ >> > + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) >> > + extent = map_id_range_down_base(extents, map, lower_first, 1); >> > + else >> > + extent = map_id_range_down_max(extents, map, lower_first, 1); >> > + >> > + /* Could not map the first ID in the extent. */ >> > + if (extent == NULL) >> > + return -EPERM; >> > + >> > + /* Offset of lower_first in the lower extent. */ >> > + off = target->lower_first - extent->first; >> > + >> > + /* How many IDs are available in the lower extent? */ >> > + available = extent->count - off; >> > + >> > + /* Requesting more IDs than available in the lower extent. */ >> > + if (target->count > available) { >> > + /* Move the remaining IDs to the overflow extent. */ >> > + overflow->first = target->first + available; >> > + overflow->lower_first = target->lower_first + available; >> > + overflow->count = target->count - available; >> > + >> > + /* Shrink the initial extent to what is available. */ >> > + target->count = available; >> > + } >> > + >> > + target->lower_first = extent->lower_first + off; >> > + return 0; >> > +} >> > + >> > /** >> > * map_id_up_base - Find idmap via binary search in static extent array. >> > * Can only be called if number of mappings is equal or less than >> > @@ -749,6 +798,7 @@ static bool mappings_overlap(struct uid_gid_map *new_map, >> > * insert_extent - Safely insert a new idmap extent into struct uid_gid_map. >> > * Takes care to allocate a 4K block of memory if the number of mappings exceeds >> > * UID_GID_MAP_MAX_BASE_EXTENTS. >> > + * The extent is appended at the position map->nr_extents. >> > */ >> > static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent) >> > { >> > @@ -968,30 +1018,37 @@ static ssize_t map_write(struct file *file, const char __user *buf, >> > if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) >> > goto out; >> > >> > - ret = -EPERM; >> > /* Map the lower ids from the parent user namespace to the >> > * kernel global id space. >> > */ >> > for (idx = 0; idx < new_map.nr_extents; idx++) { >> > + struct uid_gid_extent overflow; >> > struct uid_gid_extent *e; >> > - u32 lower_first; >> > >> > if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) >> > e = &new_map.extent[idx]; >> > else >> > e = &new_map.forward[idx]; >> > >> > - lower_first = map_id_range_down(parent_map, >> > - e->lower_first, >> > - e->count); >> > - >> > - /* Fail if we can not map the specified extent to >> > - * the kernel global id space. >> > - */ >> > - if (lower_first == (u32) -1) >> > + ret = find_and_split_extent_down(parent_map, e, &overflow); >> > + if (ret < 0) >> > goto out; >> > >> > - e->lower_first = lower_first; >> > + /* If the extent doesn't fit in a single lower extent, >> > + * move what could not be mapped to a new extent. >> > + * The new extent is appended to the existing ones in >> > + * new_map, it will be checked again and if necessary it >> > + * is split further. >> > + */ >> > + if (overflow.count > 0) { >> > + if (new_map.nr_extents == UID_GID_MAP_MAX_EXTENTS) { >> >> Why are you doing this? The insert_extent() will automatically extend it >> if needed, right? So this condition should be fine? >> >> > + ret = -EINVAL; >> > + goto out; >> > + } >> > + ret = insert_extent(&new_map, &overflow); >> > + if (ret < 0) >> > + goto out; >> > + } >> > } >> > >> > /* >> > -- >> > 2.28.0 >> >> Cheers, >> Balint >> >> > >> > -serge >> >> >> >> -- >> Balint Reczey >> Ubuntu & Debian Developer >> > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] userns: automatically split user namespace extent 2021-06-04 14:41 ` Giuseppe Scrivano @ 2021-06-05 13:00 ` Christian Brauner 2021-06-07 9:31 ` Giuseppe Scrivano 0 siblings, 1 reply; 7+ messages in thread From: Christian Brauner @ 2021-06-05 13:00 UTC (permalink / raw) To: Giuseppe Scrivano; +Cc: ebiederm, Serge E. Hallyn, linux-kernel On Fri, Jun 04, 2021 at 04:41:19PM +0200, Giuseppe Scrivano wrote: > Christian, Eric, > > are you fine with this patch or is there anything more you'd like me to > change? Before being a little bit of a party pooper thanks for your patches! I appreciate the work you're doing! So my concern is that this may cause silent regressions/security issues for tools in userspace by making this work automagically. For example we have a go library that calculates idmap ranges and extents. Those idmappings are stored in the database and in the container's config and for backups and so on. The calculated extents match exactly with how these lines look in /proc/<pid>/*id_map. If we miscalculate the extents and we try to write them to /proc/<pid>/*id_map we get told to go away and immediately recognize the bug. With this patch however we may succeed and then we record misleading extents in the db or the config. Turning this into a general concern, I think it is a non-trivial semantic change to break up the 1:1 correspondence between mappings written and mappings applied that we had for such a long time now. In general I'm not sure it should be the kernel that has the idmapping ranges smarts. I'd rather see a generic userspace library that container runtimes make use of that also breaks up idmapping ranges. We can certainly accomodate this in https://pkg.go.dev/github.com/lxc/lxd/shared/idmap Is that a reasonable concern? Christian > > Thanks, > Giuseppe > > > > "Serge E. Hallyn" <serge@hallyn.com> writes: > > > On Mon, May 10, 2021 at 12:23:51PM -0500, Serge E. Hallyn wrote: > >> On Thu, Dec 03, 2020 at 04:02:52PM +0100, Giuseppe Scrivano wrote: > >> > writing to the id map fails when an extent overlaps multiple mappings > >> > in the parent user namespace, e.g.: > >> > > >> > $ cat /proc/self/uid_map > >> > 0 1000 1 > >> > 1 100000 65536 > >> > $ unshare -U sleep 100 & > >> > [1] 1029703 > >> > $ printf "0 0 100\n" | tee /proc/$!/uid_map > >> > 0 0 100 > >> > tee: /proc/1029703/uid_map: Operation not permitted > >> > > >> > To prevent it from happening, automatically split an extent so that > >> > each portion fits in one extent in the parent user namespace. > >> > > >> > $ cat /proc/self/uid_map > >> > 0 1000 1 > >> > 1 110000 65536 > >> > $ unshare -U sleep 100 & > >> > [1] 1552 > >> > $ printf "0 0 100\n" | tee /proc/$!/uid_map > >> > 0 0 100 > >> > $ cat /proc/$!/uid_map > >> > 0 0 1 > >> > 1 1 99 > >> > > >> > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> > >> > >> The patch on the whole looks great, easy to reason about. But I have > >> one question below: > > > > As you pointed out, I was misreading the variable name, thank you :) > > > > Reviewed-by: Serge Hallyn <serge@hallyn.com> > > > >> > >> > --- > >> > v2: > >> > - move the split logic when the extent are mapped to the parent map to > >> > reduce lookup complexity. > >> > > >> > v1: https://lkml.kernel.org/lkml/20201126100839.381415-1-gscrivan@redhat.com > >> > > >> > kernel/user_namespace.c | 79 +++++++++++++++++++++++++++++++++++------ > >> > 1 file changed, 68 insertions(+), 11 deletions(-) > >> > > >> > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > >> > index 87804e0371fe..550612c6e794 100644 > >> > --- a/kernel/user_namespace.c > >> > +++ b/kernel/user_namespace.c > >> > @@ -312,6 +312,55 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id) > >> > return map_id_range_down(map, id, 1); > >> > } > >> > > >> > +/** > >> > + * find_and_split_extent_down - Find lower_first for the target extent > >> > + * using the specified map. > >> > + * If the extent doesn't fit in a single lower extent, split target and > >> > + * write the remaining IDs (first and count) to the overflow extent. > >> > + * If no overflow happens, overflow->count is set to 0. > >> > + */ > >> > +static int find_and_split_extent_down(struct uid_gid_map *map, > >> > + struct uid_gid_extent *target, > >> > + struct uid_gid_extent *overflow) > >> > +{ > >> > + unsigned int extents = map->nr_extents; > >> > + u32 lower_first = target->lower_first; > >> > + struct uid_gid_extent *extent; > >> > + u32 off, available; > >> > + > >> > + overflow->count = 0; > >> > + > >> > + /* Find the lower extent that includes the first ID. */ > >> > + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > >> > + extent = map_id_range_down_base(extents, map, lower_first, 1); > >> > + else > >> > + extent = map_id_range_down_max(extents, map, lower_first, 1); > >> > + > >> > + /* Could not map the first ID in the extent. */ > >> > + if (extent == NULL) > >> > + return -EPERM; > >> > + > >> > + /* Offset of lower_first in the lower extent. */ > >> > + off = target->lower_first - extent->first; > >> > + > >> > + /* How many IDs are available in the lower extent? */ > >> > + available = extent->count - off; > >> > + > >> > + /* Requesting more IDs than available in the lower extent. */ > >> > + if (target->count > available) { > >> > + /* Move the remaining IDs to the overflow extent. */ > >> > + overflow->first = target->first + available; > >> > + overflow->lower_first = target->lower_first + available; > >> > + overflow->count = target->count - available; > >> > + > >> > + /* Shrink the initial extent to what is available. */ > >> > + target->count = available; > >> > + } > >> > + > >> > + target->lower_first = extent->lower_first + off; > >> > + return 0; > >> > +} > >> > + > >> > /** > >> > * map_id_up_base - Find idmap via binary search in static extent array. > >> > * Can only be called if number of mappings is equal or less than > >> > @@ -749,6 +798,7 @@ static bool mappings_overlap(struct uid_gid_map *new_map, > >> > * insert_extent - Safely insert a new idmap extent into struct uid_gid_map. > >> > * Takes care to allocate a 4K block of memory if the number of mappings exceeds > >> > * UID_GID_MAP_MAX_BASE_EXTENTS. > >> > + * The extent is appended at the position map->nr_extents. > >> > */ > >> > static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent) > >> > { > >> > @@ -968,30 +1018,37 @@ static ssize_t map_write(struct file *file, const char __user *buf, > >> > if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) > >> > goto out; > >> > > >> > - ret = -EPERM; > >> > /* Map the lower ids from the parent user namespace to the > >> > * kernel global id space. > >> > */ > >> > for (idx = 0; idx < new_map.nr_extents; idx++) { > >> > + struct uid_gid_extent overflow; > >> > struct uid_gid_extent *e; > >> > - u32 lower_first; > >> > > >> > if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > >> > e = &new_map.extent[idx]; > >> > else > >> > e = &new_map.forward[idx]; > >> > > >> > - lower_first = map_id_range_down(parent_map, > >> > - e->lower_first, > >> > - e->count); > >> > - > >> > - /* Fail if we can not map the specified extent to > >> > - * the kernel global id space. > >> > - */ > >> > - if (lower_first == (u32) -1) > >> > + ret = find_and_split_extent_down(parent_map, e, &overflow); > >> > + if (ret < 0) > >> > goto out; > >> > > >> > - e->lower_first = lower_first; > >> > + /* If the extent doesn't fit in a single lower extent, > >> > + * move what could not be mapped to a new extent. > >> > + * The new extent is appended to the existing ones in > >> > + * new_map, it will be checked again and if necessary it > >> > + * is split further. > >> > + */ > >> > + if (overflow.count > 0) { > >> > + if (new_map.nr_extents == UID_GID_MAP_MAX_EXTENTS) { > >> > >> Why are you doing this? The insert_extent() will automatically extend it > >> if needed, right? So this condition should be fine? > >> > >> > + ret = -EINVAL; > >> > + goto out; > >> > + } > >> > + ret = insert_extent(&new_map, &overflow); > >> > + if (ret < 0) > >> > + goto out; > >> > + } > >> > } > >> > > >> > /* > >> > -- > >> > 2.28.0 > >> > >> Cheers, > >> Balint > >> > >> > > >> > -serge > >> > >> > >> > >> -- > >> Balint Reczey > >> Ubuntu & Debian Developer > >> > > > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] userns: automatically split user namespace extent 2021-06-05 13:00 ` Christian Brauner @ 2021-06-07 9:31 ` Giuseppe Scrivano 2021-06-07 11:08 ` Christian Brauner 0 siblings, 1 reply; 7+ messages in thread From: Giuseppe Scrivano @ 2021-06-07 9:31 UTC (permalink / raw) To: Christian Brauner; +Cc: ebiederm, Serge E. Hallyn, linux-kernel Christian Brauner <christian.brauner@ubuntu.com> writes: > On Fri, Jun 04, 2021 at 04:41:19PM +0200, Giuseppe Scrivano wrote: >> Christian, Eric, >> >> are you fine with this patch or is there anything more you'd like me to >> change? > > Before being a little bit of a party pooper thanks for your patches! I > appreciate the work you're doing! > > So my concern is that this may cause silent regressions/security issues > for tools in userspace by making this work automagically. > > For example we have a go library that calculates idmap ranges and > extents. Those idmappings are stored in the database and in the > container's config and for backups and so on. > > The calculated extents match exactly with how these lines look in > /proc/<pid>/*id_map. > If we miscalculate the extents and we try to write them to > /proc/<pid>/*id_map we get told to go away and immediately recognize the > bug. > With this patch however we may succeed and then we record misleading > extents in the db or the config. > > Turning this into a general concern, I think it is a non-trivial > semantic change to break up the 1:1 correspondence between mappings > written and mappings applied that we had for such a long time now. > > In general I'm not sure it should be the kernel that has the idmapping > ranges smarts. > > I'd rather see a generic userspace library that container runtimes make > use of that also breaks up idmapping ranges. We can certainly accomodate > this in > https://pkg.go.dev/github.com/lxc/lxd/shared/idmap > > Is that a reasonable concern? I've ended up adding a similar logic to Podman for the same reason as above. In our use case, containers are created within a user namespace that usually has two extents, the current unprivileged ID mapped to root, and any additional ID allocated to the user through /etc/sub?id mapped to 1. Within this user namespace, other user namespaces can be created and we let users specify the mappings. It is a common mistake to specify a mapping that overlaps multiple extents in the parent userns e.g: 0:0:IDS_AVAILABLE. To avoid the problem we have to first parse /proc/self/?id_map and then split the specified extents when they overlap. In our case this is not an issue anymore, moving the logic to the kernel would just avoid a open syscall. IMHO the 1:1 mapping is just an implementation detail, that is not obvious for users. Having the split in the kernel will also avoid that this same check is added to each container runtimes that uses nested user namespaces. Thanks, Giuseppe ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] userns: automatically split user namespace extent 2021-06-07 9:31 ` Giuseppe Scrivano @ 2021-06-07 11:08 ` Christian Brauner 0 siblings, 0 replies; 7+ messages in thread From: Christian Brauner @ 2021-06-07 11:08 UTC (permalink / raw) To: Giuseppe Scrivano; +Cc: ebiederm, Serge E. Hallyn, linux-kernel On Mon, Jun 07, 2021 at 11:31:37AM +0200, Giuseppe Scrivano wrote: > Christian Brauner <christian.brauner@ubuntu.com> writes: > > > On Fri, Jun 04, 2021 at 04:41:19PM +0200, Giuseppe Scrivano wrote: > >> Christian, Eric, > >> > >> are you fine with this patch or is there anything more you'd like me to > >> change? > > > > Before being a little bit of a party pooper thanks for your patches! I > > appreciate the work you're doing! > > > > So my concern is that this may cause silent regressions/security issues > > for tools in userspace by making this work automagically. > > > > For example we have a go library that calculates idmap ranges and > > extents. Those idmappings are stored in the database and in the > > container's config and for backups and so on. > > > > The calculated extents match exactly with how these lines look in > > /proc/<pid>/*id_map. > > If we miscalculate the extents and we try to write them to > > /proc/<pid>/*id_map we get told to go away and immediately recognize the > > bug. > > With this patch however we may succeed and then we record misleading > > extents in the db or the config. > > > > Turning this into a general concern, I think it is a non-trivial > > semantic change to break up the 1:1 correspondence between mappings > > written and mappings applied that we had for such a long time now. > > > > In general I'm not sure it should be the kernel that has the idmapping > > ranges smarts. > > > > I'd rather see a generic userspace library that container runtimes make > > use of that also breaks up idmapping ranges. We can certainly accomodate > > this in > > https://pkg.go.dev/github.com/lxc/lxd/shared/idmap > > > > Is that a reasonable concern? > > I've ended up adding a similar logic to Podman for the same reason as > above. > > In our use case, containers are created within a user namespace that > usually has two extents, the current unprivileged ID mapped to root, > and any additional ID allocated to the user through /etc/sub?id mapped > to 1. > > Within this user namespace, other user namespaces can be created and we > let users specify the mappings. It is a common mistake to specify a > mapping that overlaps multiple extents in the parent userns e.g: > 0:0:IDS_AVAILABLE. > > To avoid the problem we have to first parse /proc/self/?id_map and then > split the specified extents when they overlap. > > In our case this is not an issue anymore, moving the logic to the kernel > would just avoid a open syscall. > > IMHO the 1:1 mapping is just an implementation detail, that is not With proc such details tend to have the unfortunate tendency to become part of the api which is what makes me a bit uneasy. For non-lxd users that use the low-level lxc library directly we allow callers to specify the idmappings in the exact same format they will appear in procfs. This means you can reason 1:1 about the extents used. A change like this on the kernel level would break that assumption meaning e.g. that container configs would suddenly start that would otherwise fail because of miscalculated or "wrong" extents. Since we all have solved that problem in userspace already saving a single open system call is not a good enough win maybe. It feels like the new libsubid library in shadow that we added should grow that feature to split extents instead. Potentially making new*idmap handle that case by default in newer versions. That's easier to revert if anything breaks, allows us to see whether this causes trouble, and users that write to new*idmap directly aren't affected at all. Christian ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-06-07 11:08 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-12-03 15:02 [PATCH v2] userns: automatically split user namespace extent Giuseppe Scrivano 2021-05-10 17:23 ` Serge E. Hallyn 2021-05-10 18:57 ` Serge E. Hallyn 2021-06-04 14:41 ` Giuseppe Scrivano 2021-06-05 13:00 ` Christian Brauner 2021-06-07 9:31 ` Giuseppe Scrivano 2021-06-07 11:08 ` Christian Brauner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.