linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* KASAN: slab-out-of-bounds Write in mpol_parse_str
@ 2020-01-15  2:24 syzbot
  2020-01-15  5:54 ` [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str() Dan Carpenter
  0 siblings, 1 reply; 13+ messages in thread
From: syzbot @ 2020-01-15  2:24 UTC (permalink / raw)
  To: aarcange, akpm, hughd, linux-kernel, linux-mm, mhocko,
	syzkaller-bugs, vbabka, viro, yang.shi

Hello,

syzbot found the following crash on:

HEAD commit:    e69ec487 Merge branch 'for-linus' of git://git.kernel.org/..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=143045c6e00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=18698c0c240ba616
dashboard link: https://syzkaller.appspot.com/bug?extid=e64a13c5369a194d67df
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=15ddc8d1e00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15851c85e00000

The bug was bisected to:

commit 626c3920aeb4575f53c96b0d4ad4e651a21cbb66
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Mon Sep 9 00:28:06 2019 +0000

     shmem_parse_one(): switch to use of fs_parse()

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1643b3fee00000
final crash:    https://syzkaller.appspot.com/x/report.txt?x=1543b3fee00000
console output: https://syzkaller.appspot.com/x/log.txt?x=1143b3fee00000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
Fixes: 626c3920aeb4 ("shmem_parse_one(): switch to use of fs_parse()")

==================================================================
BUG: KASAN: slab-out-of-bounds in mpol_parse_str+0x87b/0xa50  
mm/mempolicy.c:2922
Write of size 1 at addr ffff8880a4513abf by task syz-executor950/9591

CPU: 0 PID: 9591 Comm: syz-executor950 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x197/0x210 lib/dump_stack.c:118
  print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
  __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
  kasan_report+0x12/0x20 mm/kasan/common.c:639
  __asan_report_store1_noabort+0x17/0x20 mm/kasan/generic_report.c:137
  mpol_parse_str+0x87b/0xa50 mm/mempolicy.c:2922
  shmem_parse_one+0x71e/0xa40 mm/shmem.c:3472
  vfs_parse_fs_param+0x2ca/0x540 fs/fs_context.c:145
  vfs_parse_fs_string+0x105/0x170 fs/fs_context.c:188
  shmem_parse_options+0x168/0x250 mm/shmem.c:3522
  parse_monolithic_mount_data+0x69/0x90 fs/fs_context.c:704
  do_new_mount fs/namespace.c:2818 [inline]
  do_mount+0x1310/0x1b50 fs/namespace.c:3142
  __do_sys_mount fs/namespace.c:3351 [inline]
  __se_sys_mount fs/namespace.c:3328 [inline]
  __x64_sys_mount+0x192/0x230 fs/namespace.c:3328
  do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x446a9a
Code: b8 08 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 7d ae fb ff c3 66 2e 0f  
1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 5a ae fb ff c3 66 0f 1f 84 00 00 00 00 00
RSP: 002b:00007fffb59b03c8 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007fffb59b03d0 RCX: 0000000000446a9a
RDX: 00007fffb59b03d0 RSI: 00000000200000c0 RDI: 00007fffb59b03f0
RBP: 0000000000000003 R08: 00007fffb59b0430 R09: 000000000000000a
R10: 0000000000000000 R11: 0000000000000297 R12: 00007fffb59b0430
R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 9564:
  save_stack+0x23/0x90 mm/kasan/common.c:72
  set_track mm/kasan/common.c:80 [inline]
  __kasan_kmalloc mm/kasan/common.c:513 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
  kasan_kmalloc+0x9/0x10 mm/kasan/common.c:527
  __do_kmalloc mm/slab.c:3656 [inline]
  __kmalloc+0x163/0x770 mm/slab.c:3665
  kmalloc include/linux/slab.h:561 [inline]
  tomoyo_add_entry security/tomoyo/common.c:2031 [inline]
  tomoyo_supervisor+0xd3e/0xef0 security/tomoyo/common.c:2103
  tomoyo_audit_path_log security/tomoyo/file.c:168 [inline]
  tomoyo_path_permission security/tomoyo/file.c:587 [inline]
  tomoyo_path_permission+0x263/0x360 security/tomoyo/file.c:573
  tomoyo_path_perm+0x318/0x430 security/tomoyo/file.c:838
  tomoyo_inode_getattr+0x1d/0x30 security/tomoyo/tomoyo.c:129
  security_inode_getattr+0xf2/0x150 security/security.c:1222
  vfs_getattr+0x25/0x70 fs/stat.c:115
  vfs_statx_fd+0x71/0xc0 fs/stat.c:145
  vfs_fstat include/linux/fs.h:3265 [inline]
  __do_sys_newfstat+0x9b/0x120 fs/stat.c:378
  __se_sys_newfstat fs/stat.c:375 [inline]
  __x64_sys_newfstat+0x54/0x80 fs/stat.c:375
  do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9564:
  save_stack+0x23/0x90 mm/kasan/common.c:72
  set_track mm/kasan/common.c:80 [inline]
  kasan_set_free_info mm/kasan/common.c:335 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
  __cache_free mm/slab.c:3426 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3757
  tomoyo_add_entry security/tomoyo/common.c:2045 [inline]
  tomoyo_supervisor+0xc2c/0xef0 security/tomoyo/common.c:2103
  tomoyo_audit_path_log security/tomoyo/file.c:168 [inline]
  tomoyo_path_permission security/tomoyo/file.c:587 [inline]
  tomoyo_path_permission+0x263/0x360 security/tomoyo/file.c:573
  tomoyo_path_perm+0x318/0x430 security/tomoyo/file.c:838
  tomoyo_inode_getattr+0x1d/0x30 security/tomoyo/tomoyo.c:129
  security_inode_getattr+0xf2/0x150 security/security.c:1222
  vfs_getattr+0x25/0x70 fs/stat.c:115
  vfs_statx_fd+0x71/0xc0 fs/stat.c:145
  vfs_fstat include/linux/fs.h:3265 [inline]
  __do_sys_newfstat+0x9b/0x120 fs/stat.c:378
  __se_sys_newfstat fs/stat.c:375 [inline]
  __x64_sys_newfstat+0x54/0x80 fs/stat.c:375
  do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8880a4513a80
  which belongs to the cache kmalloc-32 of size 32
The buggy address is located 31 bytes to the right of
  32-byte region [ffff8880a4513a80, ffff8880a4513aa0)
The buggy address belongs to the page:
page:ffffea00029144c0 refcount:1 mapcount:0 mapping:ffff8880aa4001c0  
index:0xffff8880a4513fc1
raw: 00fffe0000000200 ffffea0002a2e388 ffffea0002581cc8 ffff8880aa4001c0
raw: ffff8880a4513fc1 ffff8880a4513000 0000000100000028 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8880a4513980: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
  ffff8880a4513a00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
> ffff8880a4513a80: fb fb fb fb fc fc fc fc 00 05 fc fc fc fc fc fc
                                         ^
  ffff8880a4513b00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
  ffff8880a4513b80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-15  2:24 KASAN: slab-out-of-bounds Write in mpol_parse_str syzbot
@ 2020-01-15  5:54 ` Dan Carpenter
  2020-01-15 12:54   ` Vlastimil Babka
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Carpenter @ 2020-01-15  5:54 UTC (permalink / raw)
  To: Andrew Morton, Lee Schermerhorn
  Cc: linux-mm, linux-kernel, syzbot, aarcange, hughd, mhocko,
	syzkaller-bugs, vbabka, viro, yang.shi

What we are trying to do is change the '=' character to a NUL terminator
and then at the end of the function we restore it back to an '='.  The
problem is there are two error paths where we jump to the end of the
function before we have replaced the '=' with NUL.  We end up putting
the '=' in the wrong place (possibly one element before the start of
the buffer).

Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 mm/mempolicy.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 067cf7d3daf5..1340c5c496b5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2817,6 +2817,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 	char *flags = strchr(str, '=');
 	int err = 1, mode;
 
+	if (flags)
+		*flags++ = '\0';	/* terminate mode string */
+
 	if (nodelist) {
 		/* NUL-terminate mode or flags string */
 		*nodelist++ = '\0';
@@ -2827,9 +2830,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 	} else
 		nodes_clear(nodes);
 
-	if (flags)
-		*flags++ = '\0';	/* terminate mode string */
-
 	mode = match_string(policy_modes, MPOL_MAX, str);
 	if (mode < 0)
 		goto out;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-15  5:54 ` [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str() Dan Carpenter
@ 2020-01-15 12:54   ` Vlastimil Babka
  2020-01-15 12:57     ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Vlastimil Babka @ 2020-01-15 12:54 UTC (permalink / raw)
  To: Dan Carpenter, Andrew Morton, Lee Schermerhorn
  Cc: linux-mm, linux-kernel, syzbot, aarcange, hughd, mhocko,
	syzkaller-bugs, viro, yang.shi

On 1/15/20 6:54 AM, Dan Carpenter wrote:
> What we are trying to do is change the '=' character to a NUL terminator
> and then at the end of the function we restore it back to an '='.  The
> problem is there are two error paths where we jump to the end of the
> function before we have replaced the '=' with NUL.  We end up putting
> the '=' in the wrong place (possibly one element before the start of
> the buffer).

Bleh.

> Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
> Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

CC stable perhaps? Can this (tmpfs mount options parsing AFAICS?) become
part of unprivileged operation in some scenarios?

> ---
>  mm/mempolicy.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 067cf7d3daf5..1340c5c496b5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2817,6 +2817,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
>  	char *flags = strchr(str, '=');
>  	int err = 1, mode;
>  
> +	if (flags)
> +		*flags++ = '\0';	/* terminate mode string */
> +
>  	if (nodelist) {
>  		/* NUL-terminate mode or flags string */
>  		*nodelist++ = '\0';
> @@ -2827,9 +2830,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
>  	} else
>  		nodes_clear(nodes);
>  
> -	if (flags)
> -		*flags++ = '\0';	/* terminate mode string */
> -
>  	mode = match_string(policy_modes, MPOL_MAX, str);
>  	if (mode < 0)
>  		goto out;
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-15 12:54   ` Vlastimil Babka
@ 2020-01-15 12:57     ` Dmitry Vyukov
  2020-01-15 15:03       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2020-01-15 12:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dan Carpenter, Andrew Morton, Lee Schermerhorn, Linux-MM, LKML,
	syzbot, Andrea Arcangeli, Hugh Dickins, Michal Hocko,
	syzkaller-bugs, Al Viro, yang.shi

On Wed, Jan 15, 2020 at 1:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/15/20 6:54 AM, Dan Carpenter wrote:
> > What we are trying to do is change the '=' character to a NUL terminator
> > and then at the end of the function we restore it back to an '='.  The
> > problem is there are two error paths where we jump to the end of the
> > function before we have replaced the '=' with NUL.  We end up putting
> > the '=' in the wrong place (possibly one element before the start of
> > the buffer).
>
> Bleh.
>
> > Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
> > Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
> > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> CC stable perhaps? Can this (tmpfs mount options parsing AFAICS?) become
> part of unprivileged operation in some scenarios?

Yes, tmpfs can be mounted by any user inside of a user namespace.
Also I suspect there are cases where an unprivileged attacker can
trick some utility to mount tmpfs on their behalf and provide their
own mount options.

> > ---
> >  mm/mempolicy.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 067cf7d3daf5..1340c5c496b5 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2817,6 +2817,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
> >       char *flags = strchr(str, '=');
> >       int err = 1, mode;
> >
> > +     if (flags)
> > +             *flags++ = '\0';        /* terminate mode string */
> > +
> >       if (nodelist) {
> >               /* NUL-terminate mode or flags string */
> >               *nodelist++ = '\0';
> > @@ -2827,9 +2830,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
> >       } else
> >               nodes_clear(nodes);
> >
> > -     if (flags)
> > -             *flags++ = '\0';        /* terminate mode string */
> > -
> >       mode = match_string(policy_modes, MPOL_MAX, str);
> >       if (mode < 0)
> >               goto out;
> >
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/d31f6069-bda7-2cdb-b770-0c9cddac7537%40suse.cz.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-15 12:57     ` Dmitry Vyukov
@ 2020-01-15 15:03       ` Michal Hocko
  2020-01-15 15:14         ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-01-15 15:03 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Vlastimil Babka, Dan Carpenter, Andrew Morton, Lee Schermerhorn,
	Linux-MM, LKML, syzbot, Andrea Arcangeli, Hugh Dickins,
	syzkaller-bugs, Al Viro, yang.shi

On Wed 15-01-20 13:57:47, Dmitry Vyukov wrote:
> On Wed, Jan 15, 2020 at 1:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 1/15/20 6:54 AM, Dan Carpenter wrote:
> > > What we are trying to do is change the '=' character to a NUL terminator
> > > and then at the end of the function we restore it back to an '='.  The
> > > problem is there are two error paths where we jump to the end of the
> > > function before we have replaced the '=' with NUL.  We end up putting
> > > the '=' in the wrong place (possibly one element before the start of
> > > the buffer).
> >
> > Bleh.
> >
> > > Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
> > > Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
> > > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> >
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > CC stable perhaps? Can this (tmpfs mount options parsing AFAICS?) become
> > part of unprivileged operation in some scenarios?
> 
> Yes, tmpfs can be mounted by any user inside of a user namespace.

Huh, is there any restriction though? It is certainly not nice to have
an arbitrary memory allocated without a way of reclaiming it and OOM
killer wouldn't help for shmem.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-15 15:03       ` Michal Hocko
@ 2020-01-15 15:14         ` Dmitry Vyukov
  2020-01-15 19:05           ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2020-01-15 15:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Dan Carpenter, Andrew Morton, Lee Schermerhorn,
	Linux-MM, LKML, syzbot, Andrea Arcangeli, Hugh Dickins,
	syzkaller-bugs, Al Viro, yang.shi

On Wed, Jan 15, 2020 at 4:03 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 15-01-20 13:57:47, Dmitry Vyukov wrote:
> > On Wed, Jan 15, 2020 at 1:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 1/15/20 6:54 AM, Dan Carpenter wrote:
> > > > What we are trying to do is change the '=' character to a NUL terminator
> > > > and then at the end of the function we restore it back to an '='.  The
> > > > problem is there are two error paths where we jump to the end of the
> > > > function before we have replaced the '=' with NUL.  We end up putting
> > > > the '=' in the wrong place (possibly one element before the start of
> > > > the buffer).
> > >
> > > Bleh.
> > >
> > > > Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
> > > > Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
> > > > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> > >
> > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > >
> > > CC stable perhaps? Can this (tmpfs mount options parsing AFAICS?) become
> > > part of unprivileged operation in some scenarios?
> >
> > Yes, tmpfs can be mounted by any user inside of a user namespace.
>
> Huh, is there any restriction though? It is certainly not nice to have
> an arbitrary memory allocated without a way of reclaiming it and OOM
> killer wouldn't help for shmem.

The last time I checked there were hundreds of ways to allocate
arbitrary amounts of memory without any restrictions by any user. The
example at hand was setting up GB-sized netfilter tables in netns
under userns. It's not subject to ulimit/memcg. Most kmalloc/vmalloc's
are not accounted and can be abused. Is tmpfs even worse than these?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-15 15:14         ` Dmitry Vyukov
@ 2020-01-15 19:05           ` Michal Hocko
  2020-01-16  5:41             ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-01-15 19:05 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Vlastimil Babka, Dan Carpenter, Andrew Morton, Lee Schermerhorn,
	Linux-MM, LKML, syzbot, Andrea Arcangeli, Hugh Dickins,
	syzkaller-bugs, Al Viro, yang.shi

On Wed 15-01-20 16:14:43, Dmitry Vyukov wrote:
> On Wed, Jan 15, 2020 at 4:03 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 15-01-20 13:57:47, Dmitry Vyukov wrote:
> > > On Wed, Jan 15, 2020 at 1:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >
> > > > On 1/15/20 6:54 AM, Dan Carpenter wrote:
> > > > > What we are trying to do is change the '=' character to a NUL terminator
> > > > > and then at the end of the function we restore it back to an '='.  The
> > > > > problem is there are two error paths where we jump to the end of the
> > > > > function before we have replaced the '=' with NUL.  We end up putting
> > > > > the '=' in the wrong place (possibly one element before the start of
> > > > > the buffer).
> > > >
> > > > Bleh.
> > > >
> > > > > Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
> > > > > Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
> > > > > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> > > >
> > > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > >
> > > > CC stable perhaps? Can this (tmpfs mount options parsing AFAICS?) become
> > > > part of unprivileged operation in some scenarios?
> > >
> > > Yes, tmpfs can be mounted by any user inside of a user namespace.
> >
> > Huh, is there any restriction though? It is certainly not nice to have
> > an arbitrary memory allocated without a way of reclaiming it and OOM
> > killer wouldn't help for shmem.
> 
> The last time I checked there were hundreds of ways to allocate
> arbitrary amounts of memory without any restrictions by any user. The
> example at hand was setting up GB-sized netfilter tables in netns
> under userns. It's not subject to ulimit/memcg.

That's bad!

> Most kmalloc/vmalloc's are not accounted and can be abused.

Many of those should be bound to some objects and if those are directly
controllable by userspace then we should account at least. And if they
are not bound to a process life time then restricted.

> Is tmpfs even worse than these?

Well, tmpfs is accounted and restricted by memcg at least. The problem
is that it the memory is not really bound to a process life time which
makes it effectively unreclaimable once the swap space is depleted.
Still bad.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-15 19:05           ` Michal Hocko
@ 2020-01-16  5:41             ` Dmitry Vyukov
  2020-01-16  7:39               ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2020-01-16  5:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Dan Carpenter, Andrew Morton, Lee Schermerhorn,
	Linux-MM, LKML, syzbot, Andrea Arcangeli, Hugh Dickins,
	syzkaller-bugs, Al Viro, yang.shi

On Wed, Jan 15, 2020 at 8:05 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Wed, Jan 15, 2020 at 1:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > >
> > > > > On 1/15/20 6:54 AM, Dan Carpenter wrote:
> > > > > > What we are trying to do is change the '=' character to a NUL terminator
> > > > > > and then at the end of the function we restore it back to an '='.  The
> > > > > > problem is there are two error paths where we jump to the end of the
> > > > > > function before we have replaced the '=' with NUL.  We end up putting
> > > > > > the '=' in the wrong place (possibly one element before the start of
> > > > > > the buffer).
> > > > >
> > > > > Bleh.
> > > > >
> > > > > > Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
> > > > > > Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
> > > > > > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> > > > >
> > > > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > > >
> > > > > CC stable perhaps? Can this (tmpfs mount options parsing AFAICS?) become
> > > > > part of unprivileged operation in some scenarios?
> > > >
> > > > Yes, tmpfs can be mounted by any user inside of a user namespace.
> > >
> > > Huh, is there any restriction though? It is certainly not nice to have
> > > an arbitrary memory allocated without a way of reclaiming it and OOM
> > > killer wouldn't help for shmem.
> >
> > The last time I checked there were hundreds of ways to allocate
> > arbitrary amounts of memory without any restrictions by any user. The
> > example at hand was setting up GB-sized netfilter tables in netns
> > under userns. It's not subject to ulimit/memcg.
>
> That's bad!
>
> > Most kmalloc/vmalloc's are not accounted and can be abused.
>
> Many of those should be bound to some objects and if those are directly
> controllable by userspace then we should account at least. And if they
> are not bound to a process life time then restricted.

I see you actually added one GFP_ACCOUNT in netfilter in "netfilter:
x_tables: do not fail xt_alloc_table_info too easilly". But it seems
there are more:

$ grep vmalloc\( net/netfilter/*.c
net/netfilter/nf_tables_api.c: return kvmalloc(alloc, GFP_KERNEL);
net/netfilter/x_tables.c: xt[af].compat_tab = vmalloc(mem);
net/netfilter/x_tables.c: mem = vmalloc(len);
net/netfilter/x_tables.c: info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
net/netfilter/xt_hashlimit.c: /* FIXME: don't use vmalloc() here or
anywhere else -HW */
net/netfilter/xt_hashlimit.c: hinfo = vmalloc(struct_size(hinfo, hash, size));

These are not bound to processes/threads as namespaces are orthogonal to tasks.

Somebody told me that it's not good to use GFP_ACCOUNT if the
allocation is not tied to the lifetime of the process. Is it still
true?

In the end if user controls either size or number allocations, they
should be accounted, and it seems we still have thousands of
unaccounted ones. There are dozens of kmalloc's in netfilter code and
none of them use GFP_ACCOUNT...



> > Is tmpfs even worse than these?
>
> Well, tmpfs is accounted and restricted by memcg at least. The problem
> is that it the memory is not really bound to a process life time which
> makes it effectively unreclaimable once the swap space is depleted.
> Still bad.

I see. If I understand it correctly, this one is actually better than
all these non-GFP_ACCOUNT allocations.
If I would DoS a box (intentionally or unintentionally, just a bug in
my program) I would probably go for one of these easier ones without
GFP_ACCOUNT.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-16  5:41             ` Dmitry Vyukov
@ 2020-01-16  7:39               ` Michal Hocko
  2020-01-16 10:13                 ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-01-16  7:39 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Vlastimil Babka, Dan Carpenter, Andrew Morton, Lee Schermerhorn,
	Linux-MM, LKML, syzbot, Andrea Arcangeli, Hugh Dickins,
	syzkaller-bugs, Al Viro, yang.shi

On Thu 16-01-20 06:41:46, Dmitry Vyukov wrote:
> On Wed, Jan 15, 2020 at 8:05 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > On Wed, Jan 15, 2020 at 1:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > > >
> > > > > > On 1/15/20 6:54 AM, Dan Carpenter wrote:
> > > > > > > What we are trying to do is change the '=' character to a NUL terminator
> > > > > > > and then at the end of the function we restore it back to an '='.  The
> > > > > > > problem is there are two error paths where we jump to the end of the
> > > > > > > function before we have replaced the '=' with NUL.  We end up putting
> > > > > > > the '=' in the wrong place (possibly one element before the start of
> > > > > > > the buffer).
> > > > > >
> > > > > > Bleh.
> > > > > >
> > > > > > > Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
> > > > > > > Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
> > > > > > > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> > > > > >
> > > > > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > > > >
> > > > > > CC stable perhaps? Can this (tmpfs mount options parsing AFAICS?) become
> > > > > > part of unprivileged operation in some scenarios?
> > > > >
> > > > > Yes, tmpfs can be mounted by any user inside of a user namespace.
> > > >
> > > > Huh, is there any restriction though? It is certainly not nice to have
> > > > an arbitrary memory allocated without a way of reclaiming it and OOM
> > > > killer wouldn't help for shmem.
> > >
> > > The last time I checked there were hundreds of ways to allocate
> > > arbitrary amounts of memory without any restrictions by any user. The
> > > example at hand was setting up GB-sized netfilter tables in netns
> > > under userns. It's not subject to ulimit/memcg.
> >
> > That's bad!
> >
> > > Most kmalloc/vmalloc's are not accounted and can be abused.
> >
> > Many of those should be bound to some objects and if those are directly
> > controllable by userspace then we should account at least. And if they
> > are not bound to a process life time then restricted.
> 
> I see you actually added one GFP_ACCOUNT in netfilter in "netfilter:
> x_tables: do not fail xt_alloc_table_info too easilly". But it seems
> there are more:
> 
> $ grep vmalloc\( net/netfilter/*.c
> net/netfilter/nf_tables_api.c: return kvmalloc(alloc, GFP_KERNEL);
> net/netfilter/x_tables.c: xt[af].compat_tab = vmalloc(mem);
> net/netfilter/x_tables.c: mem = vmalloc(len);
> net/netfilter/x_tables.c: info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
> net/netfilter/xt_hashlimit.c: /* FIXME: don't use vmalloc() here or
> anywhere else -HW */
> net/netfilter/xt_hashlimit.c: hinfo = vmalloc(struct_size(hinfo, hash, size));
> 
> These are not bound to processes/threads as namespaces are orthogonal to tasks.

I cannot really comment on those. This is for networking people to
examine and find out whether they allow an untrusted user to runaway.

> Somebody told me that it's not good to use GFP_ACCOUNT if the
> allocation is not tied to the lifetime of the process. Is it still
> true?

Those are more tricky. Mostly because there is no way to reclaim the
memory once the hard limit is hit. Even the memcg oom killer will not
help much. So a care should be taken when adding GFP_ACCOUNT for those.
On the other hand it would prevent an unbounded allocations at least
so the DoS would be reduced to the hard limited memcg.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-16  7:39               ` Michal Hocko
@ 2020-01-16 10:13                 ` Dmitry Vyukov
  2020-01-16 11:51                   ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2020-01-16 10:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Dan Carpenter, Andrew Morton, Lee Schermerhorn,
	Linux-MM, LKML, syzbot, Andrea Arcangeli, Hugh Dickins,
	syzkaller-bugs, Al Viro, yang.shi

On Thu, Jan 16, 2020 at 8:39 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > > On 1/15/20 6:54 AM, Dan Carpenter wrote:
> > > > > > > > What we are trying to do is change the '=' character to a NUL terminator
> > > > > > > > and then at the end of the function we restore it back to an '='.  The
> > > > > > > > problem is there are two error paths where we jump to the end of the
> > > > > > > > function before we have replaced the '=' with NUL.  We end up putting
> > > > > > > > the '=' in the wrong place (possibly one element before the start of
> > > > > > > > the buffer).
> > > > > > >
> > > > > > > Bleh.
> > > > > > >
> > > > > > > > Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
> > > > > > > > Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
> > > > > > > > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> > > > > > >
> > > > > > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > > > > >
> > > > > > > CC stable perhaps? Can this (tmpfs mount options parsing AFAICS?) become
> > > > > > > part of unprivileged operation in some scenarios?
> > > > > >
> > > > > > Yes, tmpfs can be mounted by any user inside of a user namespace.
> > > > >
> > > > > Huh, is there any restriction though? It is certainly not nice to have
> > > > > an arbitrary memory allocated without a way of reclaiming it and OOM
> > > > > killer wouldn't help for shmem.
> > > >
> > > > The last time I checked there were hundreds of ways to allocate
> > > > arbitrary amounts of memory without any restrictions by any user. The
> > > > example at hand was setting up GB-sized netfilter tables in netns
> > > > under userns. It's not subject to ulimit/memcg.
> > >
> > > That's bad!
> > >
> > > > Most kmalloc/vmalloc's are not accounted and can be abused.
> > >
> > > Many of those should be bound to some objects and if those are directly
> > > controllable by userspace then we should account at least. And if they
> > > are not bound to a process life time then restricted.
> >
> > I see you actually added one GFP_ACCOUNT in netfilter in "netfilter:
> > x_tables: do not fail xt_alloc_table_info too easilly". But it seems
> > there are more:
> >
> > $ grep vmalloc\( net/netfilter/*.c
> > net/netfilter/nf_tables_api.c: return kvmalloc(alloc, GFP_KERNEL);
> > net/netfilter/x_tables.c: xt[af].compat_tab = vmalloc(mem);
> > net/netfilter/x_tables.c: mem = vmalloc(len);
> > net/netfilter/x_tables.c: info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
> > net/netfilter/xt_hashlimit.c: /* FIXME: don't use vmalloc() here or
> > anywhere else -HW */
> > net/netfilter/xt_hashlimit.c: hinfo = vmalloc(struct_size(hinfo, hash, size));
> >
> > These are not bound to processes/threads as namespaces are orthogonal to tasks.
>
> I cannot really comment on those. This is for networking people to
> examine and find out whether they allow an untrusted user to runaway.

Unless I am missing an elephant in this whole picture, kernel code
contains 20K+ unaccounted allocations and if I am not mistaken few of
them were audited and are intentionally unaccounted rather than
unaccounted just because it's the default. So if we want DoS
protection, it's really for every kernel developer/maintainer to audit
and fix these allocation sites. And since we have a unikernel, a
single unaccounted allocation may compromise the whole kernel. I
assume we would need something like GFP_UNACCOUNTED to mark audited
allocations that don't need accounting and then slowly reduce number
of allocations without both ACCOUNTED and UNACCOUNTED.


> > Somebody told me that it's not good to use GFP_ACCOUNT if the
> > allocation is not tied to the lifetime of the process. Is it still
> > true?
>
> Those are more tricky. Mostly because there is no way to reclaim the
> memory once the hard limit is hit. Even the memcg oom killer will not
> help much. So a care should be taken when adding GFP_ACCOUNT for those.
> On the other hand it would prevent an unbounded allocations at least
> so the DoS would be reduced to the hard limited memcg.

What exactly is this care in practice?
It seems that in a148ce15375fc664ad64762c751c0c2aecb2cafe you just
added it and the allocation is not tied to the process. At least I
don't see any explanation as to why that one is safe, while accounting
other similar allocation is not...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-16 10:13                 ` Dmitry Vyukov
@ 2020-01-16 11:51                   ` Michal Hocko
  2020-01-16 12:41                     ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-01-16 11:51 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Vlastimil Babka, Dan Carpenter, Andrew Morton, Lee Schermerhorn,
	Linux-MM, LKML, syzbot, Andrea Arcangeli, Hugh Dickins,
	syzkaller-bugs, Al Viro, yang.shi

On Thu 16-01-20 11:13:09, Dmitry Vyukov wrote:
> On Thu, Jan 16, 2020 at 8:39 AM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > $ grep vmalloc\( net/netfilter/*.c
> > > net/netfilter/nf_tables_api.c: return kvmalloc(alloc, GFP_KERNEL);
> > > net/netfilter/x_tables.c: xt[af].compat_tab = vmalloc(mem);
> > > net/netfilter/x_tables.c: mem = vmalloc(len);
> > > net/netfilter/x_tables.c: info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
> > > net/netfilter/xt_hashlimit.c: /* FIXME: don't use vmalloc() here or
> > > anywhere else -HW */
> > > net/netfilter/xt_hashlimit.c: hinfo = vmalloc(struct_size(hinfo, hash, size));
> > >
> > > These are not bound to processes/threads as namespaces are orthogonal to tasks.
> >
> > I cannot really comment on those. This is for networking people to
> > examine and find out whether they allow an untrusted user to runaway.
> 
> Unless I am missing an elephant in this whole picture, kernel code
> contains 20K+ unaccounted allocations and if I am not mistaken few of
> them were audited and are intentionally unaccounted rather than
> unaccounted just because it's the default. So if we want DoS
> protection, it's really for every kernel developer/maintainer to audit
> and fix these allocation sites. And since we have a unikernel, a
> single unaccounted allocation may compromise the whole kernel. I
> assume we would need something like GFP_UNACCOUNTED to mark audited
> allocations that don't need accounting and then slowly reduce number
> of allocations without both ACCOUNTED and UNACCOUNTED.

This is the original approach which led to all sorts of problems and so
we switched the opt-out to opt-in. Have a look at a9bb7e620efd ("memcg:
only account kmem allocations marked as __GFP_ACCOUNT").
Our protection will never be perfect because that would require to
design the system with the protection in mind.
 
> > > Somebody told me that it's not good to use GFP_ACCOUNT if the
> > > allocation is not tied to the lifetime of the process. Is it still
> > > true?
> >
> > Those are more tricky. Mostly because there is no way to reclaim the
> > memory once the hard limit is hit. Even the memcg oom killer will not
> > help much. So a care should be taken when adding GFP_ACCOUNT for those.
> > On the other hand it would prevent an unbounded allocations at least
> > so the DoS would be reduced to the hard limited memcg.
> 
> What exactly is this care in practice?
> It seems that in a148ce15375fc664ad64762c751c0c2aecb2cafe you just
> added it and the allocation is not tied to the process. At least I
> don't see any explanation as to why that one is safe, while accounting
> other similar allocation is not...

My memory is dim but AFAIR the memcg accounting was compromise between
usability and the whole system stability. Really large tables could be
allocated by untrusted users and that was seen as a _real_ problem. The
previous solution added _some_ protection which led to regressions
even for reasonable cases though. Memcg accounting was deemed as
reasonable middle ground.

The result is that a completely depleted memcg requires an admin
intervention and the admin has to know what to do to tear it down.
Kernel cannot do anything about that. And that is the trickiness I've
had in mind. Listing page tables is something admins can do quite
easily, right? There are many other objects which are much harder to act
about. E.g. what are you going to do with tmpfs mounts? Are you going to
remove them and cause potential data loss? That being said some objects
really have to be limited even before they start consuming memory IMHO.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-16 11:51                   ` Michal Hocko
@ 2020-01-16 12:41                     ` Dmitry Vyukov
  2020-01-16 14:05                       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2020-01-16 12:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Dan Carpenter, Andrew Morton, Lee Schermerhorn,
	Linux-MM, LKML, syzbot, Andrea Arcangeli, Hugh Dickins,
	syzkaller-bugs, Al Viro, yang.shi

On Thu, Jan 16, 2020 at 12:51 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 16-01-20 11:13:09, Dmitry Vyukov wrote:
> > On Thu, Jan 16, 2020 at 8:39 AM Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > > $ grep vmalloc\( net/netfilter/*.c
> > > > net/netfilter/nf_tables_api.c: return kvmalloc(alloc, GFP_KERNEL);
> > > > net/netfilter/x_tables.c: xt[af].compat_tab = vmalloc(mem);
> > > > net/netfilter/x_tables.c: mem = vmalloc(len);
> > > > net/netfilter/x_tables.c: info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
> > > > net/netfilter/xt_hashlimit.c: /* FIXME: don't use vmalloc() here or
> > > > anywhere else -HW */
> > > > net/netfilter/xt_hashlimit.c: hinfo = vmalloc(struct_size(hinfo, hash, size));
> > > >
> > > > These are not bound to processes/threads as namespaces are orthogonal to tasks.
> > >
> > > I cannot really comment on those. This is for networking people to
> > > examine and find out whether they allow an untrusted user to runaway.
> >
> > Unless I am missing an elephant in this whole picture, kernel code
> > contains 20K+ unaccounted allocations and if I am not mistaken few of
> > them were audited and are intentionally unaccounted rather than
> > unaccounted just because it's the default. So if we want DoS
> > protection, it's really for every kernel developer/maintainer to audit
> > and fix these allocation sites. And since we have a unikernel, a
> > single unaccounted allocation may compromise the whole kernel. I
> > assume we would need something like GFP_UNACCOUNTED to mark audited
> > allocations that don't need accounting and then slowly reduce number
> > of allocations without both ACCOUNTED and UNACCOUNTED.
>
> This is the original approach which led to all sorts of problems and so
> we switched the opt-out to opt-in. Have a look at a9bb7e620efd ("memcg:
> only account kmem allocations marked as __GFP_ACCOUNT").
> Our protection will never be perfect because that would require to
> design the system with the protection in mind.

I don't mean to switch the default. I mean adding a way to distinguish
between reviewed and intentionally unaccounted allocation and
unreviewed allocation which is unaccounted just because that's the
default. This would allow to progress incrementally, rather than redo
the same work again and again.

> > > > Somebody told me that it's not good to use GFP_ACCOUNT if the
> > > > allocation is not tied to the lifetime of the process. Is it still
> > > > true?
> > >
> > > Those are more tricky. Mostly because there is no way to reclaim the
> > > memory once the hard limit is hit. Even the memcg oom killer will not
> > > help much. So a care should be taken when adding GFP_ACCOUNT for those.
> > > On the other hand it would prevent an unbounded allocations at least
> > > so the DoS would be reduced to the hard limited memcg.
> >
> > What exactly is this care in practice?
> > It seems that in a148ce15375fc664ad64762c751c0c2aecb2cafe you just
> > added it and the allocation is not tied to the process. At least I
> > don't see any explanation as to why that one is safe, while accounting
> > other similar allocation is not...
>
> My memory is dim but AFAIR the memcg accounting was compromise between
> usability and the whole system stability. Really large tables could be
> allocated by untrusted users and that was seen as a _real_ problem. The
> previous solution added _some_ protection which led to regressions
> even for reasonable cases though. Memcg accounting was deemed as
> reasonable middle ground.
>
> The result is that a completely depleted memcg requires an admin
> intervention and the admin has to know what to do to tear it down.
> Kernel cannot do anything about that. And that is the trickiness I've
> had in mind. Listing page tables is something admins can do quite
> easily, right? There are many other objects which are much harder to act
> about. E.g. what are you going to do with tmpfs mounts? Are you going to
> remove them and cause potential data loss? That being said some objects
> really have to be limited even before they start consuming memory IMHO.

Interesting. But there is really no admin today, or at least nothing
should rely on one in any way. Either because of the scale (you have
thousands/millions of machines and spending human time on each of them
individually is not going to fly) and/or because there is nobody
qualified enough around (e.g. who is an admin of a median android
phone? and what does they know about tearing down namespaces and
mounts) or there is nobody interested enough (it's fun sometimes, but
not always)... but I agree that retrofitting this level of resource
control into a large existing complex system is close to impossible.

Thank a lot for bearing with me and answering my questions. I have a
better understanding of this now.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str()
  2020-01-16 12:41                     ` Dmitry Vyukov
@ 2020-01-16 14:05                       ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2020-01-16 14:05 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Vlastimil Babka, Dan Carpenter, Andrew Morton, Lee Schermerhorn,
	Linux-MM, LKML, syzbot, Andrea Arcangeli, Hugh Dickins,
	syzkaller-bugs, Al Viro, yang.shi

On Thu 16-01-20 13:41:39, Dmitry Vyukov wrote:
> On Thu, Jan 16, 2020 at 12:51 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 16-01-20 11:13:09, Dmitry Vyukov wrote:
> > > On Thu, Jan 16, 2020 at 8:39 AM Michal Hocko <mhocko@kernel.org> wrote:
> > [...]
> > > > > $ grep vmalloc\( net/netfilter/*.c
> > > > > net/netfilter/nf_tables_api.c: return kvmalloc(alloc, GFP_KERNEL);
> > > > > net/netfilter/x_tables.c: xt[af].compat_tab = vmalloc(mem);
> > > > > net/netfilter/x_tables.c: mem = vmalloc(len);
> > > > > net/netfilter/x_tables.c: info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
> > > > > net/netfilter/xt_hashlimit.c: /* FIXME: don't use vmalloc() here or
> > > > > anywhere else -HW */
> > > > > net/netfilter/xt_hashlimit.c: hinfo = vmalloc(struct_size(hinfo, hash, size));
> > > > >
> > > > > These are not bound to processes/threads as namespaces are orthogonal to tasks.
> > > >
> > > > I cannot really comment on those. This is for networking people to
> > > > examine and find out whether they allow an untrusted user to runaway.
> > >
> > > Unless I am missing an elephant in this whole picture, kernel code
> > > contains 20K+ unaccounted allocations and if I am not mistaken few of
> > > them were audited and are intentionally unaccounted rather than
> > > unaccounted just because it's the default. So if we want DoS
> > > protection, it's really for every kernel developer/maintainer to audit
> > > and fix these allocation sites. And since we have a unikernel, a
> > > single unaccounted allocation may compromise the whole kernel. I
> > > assume we would need something like GFP_UNACCOUNTED to mark audited
> > > allocations that don't need accounting and then slowly reduce number
> > > of allocations without both ACCOUNTED and UNACCOUNTED.
> >
> > This is the original approach which led to all sorts of problems and so
> > we switched the opt-out to opt-in. Have a look at a9bb7e620efd ("memcg:
> > only account kmem allocations marked as __GFP_ACCOUNT").
> > Our protection will never be perfect because that would require to
> > design the system with the protection in mind.
> 
> I don't mean to switch the default. I mean adding a way to distinguish
> between reviewed and intentionally unaccounted allocation and
> unreviewed allocation which is unaccounted just because that's the
> default. This would allow to progress incrementally, rather than redo
> the same work again and again.

I am not really sure this would be viable just because of the sheer
number of allocations we have in the kernel. But I would be more than
happy to be proven wrong ;)

> > > > > Somebody told me that it's not good to use GFP_ACCOUNT if the
> > > > > allocation is not tied to the lifetime of the process. Is it still
> > > > > true?
> > > >
> > > > Those are more tricky. Mostly because there is no way to reclaim the
> > > > memory once the hard limit is hit. Even the memcg oom killer will not
> > > > help much. So a care should be taken when adding GFP_ACCOUNT for those.
> > > > On the other hand it would prevent an unbounded allocations at least
> > > > so the DoS would be reduced to the hard limited memcg.
> > >
> > > What exactly is this care in practice?
> > > It seems that in a148ce15375fc664ad64762c751c0c2aecb2cafe you just
> > > added it and the allocation is not tied to the process. At least I
> > > don't see any explanation as to why that one is safe, while accounting
> > > other similar allocation is not...
> >
> > My memory is dim but AFAIR the memcg accounting was compromise between
> > usability and the whole system stability. Really large tables could be
> > allocated by untrusted users and that was seen as a _real_ problem. The
> > previous solution added _some_ protection which led to regressions
> > even for reasonable cases though. Memcg accounting was deemed as
> > reasonable middle ground.
> >
> > The result is that a completely depleted memcg requires an admin
> > intervention and the admin has to know what to do to tear it down.
> > Kernel cannot do anything about that. And that is the trickiness I've
> > had in mind. Listing page tables is something admins can do quite
> > easily, right? There are many other objects which are much harder to act
> > about. E.g. what are you going to do with tmpfs mounts? Are you going to
> > remove them and cause potential data loss? That being said some objects
> > really have to be limited even before they start consuming memory IMHO.
> 
> Interesting. But there is really no admin today, or at least nothing
> should rely on one in any way. Either because of the scale (you have
> thousands/millions of machines and spending human time on each of them
> individually is not going to fly) and/or because there is nobody
> qualified enough around (e.g. who is an admin of a median android
> phone? and what does they know about tearing down namespaces and
> mounts) or there is nobody interested enough (it's fun sometimes, but
> not always)...

If there is nobody in control then there should be a reasonably safe
policy defined at least. This is what I mentioned earlier when saying
that not all objects can be easily accounted. There has to be a strategy
defined for corner cases. And user namespaces seem to really beg for
that when you are allowed to control resources like tmpfs and others
alike.

> but I agree that retrofitting this level of resource
> control into a large existing complex system is close to impossible.
> 
> Thank a lot for bearing with me and answering my questions. I have a
> better understanding of this now.

I am glad I could help.

Thanks for giving some scary examples ;)
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-01-16 14:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-15  2:24 KASAN: slab-out-of-bounds Write in mpol_parse_str syzbot
2020-01-15  5:54 ` [PATCH] mm/mempolicy.c: Fix out of bounds write in mpol_parse_str() Dan Carpenter
2020-01-15 12:54   ` Vlastimil Babka
2020-01-15 12:57     ` Dmitry Vyukov
2020-01-15 15:03       ` Michal Hocko
2020-01-15 15:14         ` Dmitry Vyukov
2020-01-15 19:05           ` Michal Hocko
2020-01-16  5:41             ` Dmitry Vyukov
2020-01-16  7:39               ` Michal Hocko
2020-01-16 10:13                 ` Dmitry Vyukov
2020-01-16 11:51                   ` Michal Hocko
2020-01-16 12:41                     ` Dmitry Vyukov
2020-01-16 14:05                       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).