memcg, kmem: do not fail __GFP_NOFAIL charges
diff mbox series

Message ID 20190906125608.32129-1-mhocko@kernel.org
State Accepted
Commit e55d9d9bfb69405bd7615c0f8d229d8fafb3e9b8
Headers show
Series
  • memcg, kmem: do not fail __GFP_NOFAIL charges
Related show

Commit Message

Michal Hocko Sept. 6, 2019, 12:56 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

Thomas has noticed the following NULL ptr dereference when using cgroup
v1 kmem limit:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
RIP: 0010:create_empty_buffers+0x24/0x100
Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
Call Trace:
 create_page_buffers+0x4d/0x60
 __block_write_begin_int+0x8e/0x5a0
 ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
 ? jbd2__journal_start+0xd7/0x1f0
 ext4_da_write_begin+0x112/0x3d0
 generic_perform_write+0xf1/0x1b0
 ? file_update_time+0x70/0x140
 __generic_file_write_iter+0x141/0x1a0
 ext4_file_write_iter+0xef/0x3b0
 __vfs_write+0x17e/0x1e0
 vfs_write+0xa5/0x1a0
 ksys_write+0x57/0xd0
 do_syscall_64+0x55/0x160
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

 Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
 fails __GFP_NOFAIL charge when the kmem limit is reached. This is a
 wrong behavior because nofail allocations are not allowed to fail.
 Normal charge path simply forces the charge even if that means to cross
 the limit. Kmem accounting should be doing the same.

Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: stable
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Shakeel Butt Sept. 6, 2019, 6:24 p.m. UTC | #1
On Fri, Sep 6, 2019 at 5:56 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> From: Michal Hocko <mhocko@suse.com>
>
> Thomas has noticed the following NULL ptr dereference when using cgroup
> v1 kmem limit:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> PGD 0
> P4D 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
> Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
> RIP: 0010:create_empty_buffers+0x24/0x100
> Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
> RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
> RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
> RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
> R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
> FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
> Call Trace:
>  create_page_buffers+0x4d/0x60
>  __block_write_begin_int+0x8e/0x5a0
>  ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
>  ? jbd2__journal_start+0xd7/0x1f0
>  ext4_da_write_begin+0x112/0x3d0
>  generic_perform_write+0xf1/0x1b0
>  ? file_update_time+0x70/0x140
>  __generic_file_write_iter+0x141/0x1a0
>  ext4_file_write_iter+0xef/0x3b0
>  __vfs_write+0x17e/0x1e0
>  vfs_write+0xa5/0x1a0
>  ksys_write+0x57/0xd0
>  do_syscall_64+0x55/0x160
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
>  Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
>  fails __GFP_NOFAIL charge when the kmem limit is reached. This is a
>  wrong behavior because nofail allocations are not allowed to fail.
>  Normal charge path simply forces the charge even if that means to cross
>  the limit. Kmem accounting should be doing the same.
>
> Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
> Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>

I wonder what has changed since
<http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.

> ---
>  mm/memcontrol.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9ec5e12486a7..e18108b2b786 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2821,6 +2821,16 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>
>         if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>             !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> +
> +               /*
> +                * Enforce __GFP_NOFAIL allocation because callers are not
> +                * prepared to see failures and likely do not have any failure
> +                * handling code.
> +                */
> +               if (gfp & __GFP_NOFAIL) {
> +                       page_counter_charge(&memcg->kmem, nr_pages);
> +                       return 0;
> +               }
>                 cancel_charge(memcg, nr_pages);
>                 return -ENOMEM;
>         }
> --
> 2.20.1
>
Michal Hocko Sept. 9, 2019, 11:22 a.m. UTC | #2
On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
> On Fri, Sep 6, 2019 at 5:56 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Thomas has noticed the following NULL ptr dereference when using cgroup
> > v1 kmem limit:
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> > PGD 0
> > P4D 0
> > Oops: 0000 [#1] PREEMPT SMP PTI
> > CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
> > Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
> > RIP: 0010:create_empty_buffers+0x24/0x100
> > Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
> > RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
> > RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
> > RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
> > RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
> > R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
> > R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
> > FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
> > Call Trace:
> >  create_page_buffers+0x4d/0x60
> >  __block_write_begin_int+0x8e/0x5a0
> >  ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
> >  ? jbd2__journal_start+0xd7/0x1f0
> >  ext4_da_write_begin+0x112/0x3d0
> >  generic_perform_write+0xf1/0x1b0
> >  ? file_update_time+0x70/0x140
> >  __generic_file_write_iter+0x141/0x1a0
> >  ext4_file_write_iter+0xef/0x3b0
> >  __vfs_write+0x17e/0x1e0
> >  vfs_write+0xa5/0x1a0
> >  ksys_write+0x57/0xd0
> >  do_syscall_64+0x55/0x160
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> >  Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
> >  fails __GFP_NOFAIL charge when the kmem limit is reached. This is a
> >  wrong behavior because nofail allocations are not allowed to fail.
> >  Normal charge path simply forces the charge even if that means to cross
> >  the limit. Kmem accounting should be doing the same.
> >
> > Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
> > Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> > Cc: stable
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> I wonder what has changed since
> <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.

I have completely forgot about that one. It seems that we have just
repeated the same discussion again. This time we have a poor user who
actually enabled the kmem limit.

I guess there was no real objection to the change back then. The primary
discussion revolved around the fact that the accounting will stay broken
even when this particular part was fixed. Considering this leads to easy
to trigger crash (with the limit enabled) then I guess we should just
make it less broken and backport to stable trees and have a serious
discussion about discontinuing of the limit. Start by simply failing to
set any limit in the current upstream kernels.
Michal Hocko Sept. 11, 2019, noon UTC | #3
On Mon 09-09-19 13:22:45, Michal Hocko wrote:
> On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
[...]
> > I wonder what has changed since
> > <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.
> 
> I have completely forgot about that one. It seems that we have just
> repeated the same discussion again. This time we have a poor user who
> actually enabled the kmem limit.
> 
> I guess there was no real objection to the change back then. The primary
> discussion revolved around the fact that the accounting will stay broken
> even when this particular part was fixed. Considering this leads to easy
> to trigger crash (with the limit enabled) then I guess we should just
> make it less broken and backport to stable trees and have a serious
> discussion about discontinuing of the limit. Start by simply failing to
> set any limit in the current upstream kernels.

Any more concerns/objections to the patch? I can add a reference to your
earlier post Shakeel if you want or to credit you the way you prefer.

Also are there any objections to start deprecating process of kmem
limit? I would see it in two stages
- 1st warn in the kernel log
	pr_warn("kmem.limit_in_bytes is deprecated and will be removed.
	        "Please report your usecase to linux-mm@kvack.org if you "
		"depend on this functionality."
- 2nd fail any write to kmem.limit_in_bytes
- 3rd remove the control file completely
Andrew Morton Sept. 11, 2019, 2:37 p.m. UTC | #4
On Wed, 11 Sep 2019 14:00:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Mon 09-09-19 13:22:45, Michal Hocko wrote:
> > On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
> [...]
> > > I wonder what has changed since
> > > <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.
> > 
> > I have completely forgot about that one. It seems that we have just
> > repeated the same discussion again. This time we have a poor user who
> > actually enabled the kmem limit.
> > 
> > I guess there was no real objection to the change back then. The primary
> > discussion revolved around the fact that the accounting will stay broken
> > even when this particular part was fixed. Considering this leads to easy
> > to trigger crash (with the limit enabled) then I guess we should just
> > make it less broken and backport to stable trees and have a serious
> > discussion about discontinuing of the limit. Start by simply failing to
> > set any limit in the current upstream kernels.
> 
> Any more concerns/objections to the patch? I can add a reference to your
> earlier post Shakeel if you want or to credit you the way you prefer.
> 
> Also are there any objections to start deprecating process of kmem
> limit? I would see it in two stages
> - 1st warn in the kernel log
> 	pr_warn("kmem.limit_in_bytes is deprecated and will be removed.
> 	        "Please report your usecase to linux-mm@kvack.org if you "
> 		"depend on this functionality."

pr_warn_once() :)

> - 2nd fail any write to kmem.limit_in_bytes
> - 3rd remove the control file completely

Sounds good to me.
Michal Hocko Sept. 11, 2019, 3:16 p.m. UTC | #5
On Wed 11-09-19 07:37:40, Andrew Morton wrote:
> On Wed, 11 Sep 2019 14:00:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Mon 09-09-19 13:22:45, Michal Hocko wrote:
> > > On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
> > [...]
> > > > I wonder what has changed since
> > > > <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.
> > > 
> > > I have completely forgot about that one. It seems that we have just
> > > repeated the same discussion again. This time we have a poor user who
> > > actually enabled the kmem limit.
> > > 
> > > I guess there was no real objection to the change back then. The primary
> > > discussion revolved around the fact that the accounting will stay broken
> > > even when this particular part was fixed. Considering this leads to easy
> > > to trigger crash (with the limit enabled) then I guess we should just
> > > make it less broken and backport to stable trees and have a serious
> > > discussion about discontinuing of the limit. Start by simply failing to
> > > set any limit in the current upstream kernels.
> > 
> > Any more concerns/objections to the patch? I can add a reference to your
> > earlier post Shakeel if you want or to credit you the way you prefer.
> > 
> > Also are there any objections to start deprecating process of kmem
> > limit? I would see it in two stages
> > - 1st warn in the kernel log
> > 	pr_warn("kmem.limit_in_bytes is deprecated and will be removed.
> > 	        "Please report your usecase to linux-mm@kvack.org if you "
> > 		"depend on this functionality."
> 
> pr_warn_once() :)
> 
> > - 2nd fail any write to kmem.limit_in_bytes
> > - 3rd remove the control file completely
> 
> Sounds good to me.

Here we go

From 512822e551fe2960040c23b12c7b27a5fdab9013 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 11 Sep 2019 17:02:33 +0200
Subject: [PATCH] memcg, kmem: deprecate kmem.limit_in_bytes

Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
which turned out to be really a bad idea because there are paths which
cannot shrink the kernel memory usage enough to get below the limit
(e.g. because the accounted memory is not reclaimable). There are cases
when the failure is even not allowed (e.g. __GFP_NOFAIL). This means
that the kmem limit is in excess to the hard limit without any way to
shrink and thus completely useless. OOM killer cannot be invoked to
handle the situation because that would lead to a premature oom killing.

As a result many places might see ENOMEM returning from kmalloc and
result in unexpected errors. E.g. a global OOM killer when there is a
lot of free memory because ENOMEM is translated into VM_FAULT_OOM in #PF
path and therefore pagefault_out_of_memory would result in OOM killer.

Please note that the kernel memory is still accounted to the overall
limit along with the user memory so removing the kmem specific limit
should still allow to contain kernel memory consumption. Unlike the kmem
one, though, it invokes memory reclaim and targeted memcg oom killing if
necessary.

Start the deprecation process by crying to the kernel log. Let's see
whether there are relevant usecases and simply return to EINVAL in the
second stage if nobody complains in few releases.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/admin-guide/cgroup-v1/memory.rst | 3 +++
 mm/memcontrol.c                                | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 41bdc038dad9..e53fc2f31549 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -87,6 +87,9 @@ Brief summary of control files.
 				     node
 
  memory.kmem.limit_in_bytes          set/show hard limit for kernel memory
+                                     This knob is deprecated it shouldn't be
+                                     used. It is planned to be removed in
+                                     a foreseeable future.
  memory.kmem.usage_in_bytes          show current kernel memory allocation
  memory.kmem.failcnt                 show the number of kernel memory usage
 				     hits limits
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e18108b2b786..113969bc57e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3518,6 +3518,9 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 			ret = mem_cgroup_resize_max(memcg, nr_pages, true);
 			break;
 		case _KMEM:
+			pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
+				     "Please report your usecase to linux-mm@kvack.org if you "
+				     "depend on this functionality.\n");
 			ret = memcg_update_kmem_max(memcg, nr_pages);
 			break;
 		case _TCP:
Shakeel Butt Sept. 13, 2019, 2:46 a.m. UTC | #6
On Wed, Sep 11, 2019 at 8:16 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 11-09-19 07:37:40, Andrew Morton wrote:
> > On Wed, 11 Sep 2019 14:00:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> >
> > > On Mon 09-09-19 13:22:45, Michal Hocko wrote:
> > > > On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
> > > [...]
> > > > > I wonder what has changed since
> > > > > <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.
> > > >
> > > > I have completely forgot about that one. It seems that we have just
> > > > repeated the same discussion again. This time we have a poor user who
> > > > actually enabled the kmem limit.
> > > >
> > > > I guess there was no real objection to the change back then. The primary
> > > > discussion revolved around the fact that the accounting will stay broken
> > > > even when this particular part was fixed. Considering this leads to easy
> > > > to trigger crash (with the limit enabled) then I guess we should just
> > > > make it less broken and backport to stable trees and have a serious
> > > > discussion about discontinuing of the limit. Start by simply failing to
> > > > set any limit in the current upstream kernels.
> > >
> > > Any more concerns/objections to the patch? I can add a reference to your
> > > earlier post Shakeel if you want or to credit you the way you prefer.
> > >
> > > Also are there any objections to start deprecating process of kmem
> > > limit? I would see it in two stages
> > > - 1st warn in the kernel log
> > >     pr_warn("kmem.limit_in_bytes is deprecated and will be removed.
> > >             "Please report your usecase to linux-mm@kvack.org if you "
> > >             "depend on this functionality."
> >
> > pr_warn_once() :)
> >
> > > - 2nd fail any write to kmem.limit_in_bytes
> > > - 3rd remove the control file completely
> >
> > Sounds good to me.
>
> Here we go
>
> From 512822e551fe2960040c23b12c7b27a5fdab9013 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 11 Sep 2019 17:02:33 +0200
> Subject: [PATCH] memcg, kmem: deprecate kmem.limit_in_bytes
>
> Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
> which turned out to be really a bad idea because there are paths which
> cannot shrink the kernel memory usage enough to get below the limit
> (e.g. because the accounted memory is not reclaimable). There are cases
> when the failure is even not allowed (e.g. __GFP_NOFAIL). This means
> that the kmem limit is in excess to the hard limit without any way to
> shrink and thus completely useless. OOM killer cannot be invoked to
> handle the situation because that would lead to a premature oom killing.
>
> As a result many places might see ENOMEM returning from kmalloc and
> result in unexpected errors. E.g. a global OOM killer when there is a
> lot of free memory because ENOMEM is translated into VM_FAULT_OOM in #PF
> path and therefore pagefault_out_of_memory would result in OOM killer.
>
> Please note that the kernel memory is still accounted to the overall
> limit along with the user memory so removing the kmem specific limit
> should still allow to contain kernel memory consumption. Unlike the kmem
> one, though, it invokes memory reclaim and targeted memcg oom killing if
> necessary.
>
> Start the deprecation process by crying to the kernel log. Let's see
> whether there are relevant usecases and simply return to EINVAL in the
> second stage if nobody complains in few releases.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  Documentation/admin-guide/cgroup-v1/memory.rst | 3 +++
>  mm/memcontrol.c                                | 3 +++
>  2 files changed, 6 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index 41bdc038dad9..e53fc2f31549 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -87,6 +87,9 @@ Brief summary of control files.
>                                      node
>
>   memory.kmem.limit_in_bytes          set/show hard limit for kernel memory
> +                                     This knob is deprecated it shouldn't be
> +                                     used. It is planned to be removed in
> +                                     a foreseeable future.
>   memory.kmem.usage_in_bytes          show current kernel memory allocation
>   memory.kmem.failcnt                 show the number of kernel memory usage
>                                      hits limits
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e18108b2b786..113969bc57e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3518,6 +3518,9 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
>                         ret = mem_cgroup_resize_max(memcg, nr_pages, true);
>                         break;
>                 case _KMEM:
> +                       pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
> +                                    "Please report your usecase to linux-mm@kvack.org if you "
> +                                    "depend on this functionality.\n");
>                         ret = memcg_update_kmem_max(memcg, nr_pages);
>                         break;
>                 case _TCP:
> --
> 2.20.1
>
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Sept. 24, 2019, 10:53 a.m. UTC | #7
Andrew, do you plan to send this patch to Linus as ell?

On Fri 06-09-19 14:56:08, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Thomas has noticed the following NULL ptr dereference when using cgroup
> v1 kmem limit:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> PGD 0
> P4D 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
> Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
> RIP: 0010:create_empty_buffers+0x24/0x100
> Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
> RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
> RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
> RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
> R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
> FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
> Call Trace:
>  create_page_buffers+0x4d/0x60
>  __block_write_begin_int+0x8e/0x5a0
>  ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
>  ? jbd2__journal_start+0xd7/0x1f0
>  ext4_da_write_begin+0x112/0x3d0
>  generic_perform_write+0xf1/0x1b0
>  ? file_update_time+0x70/0x140
>  __generic_file_write_iter+0x141/0x1a0
>  ext4_file_write_iter+0xef/0x3b0
>  __vfs_write+0x17e/0x1e0
>  vfs_write+0xa5/0x1a0
>  ksys_write+0x57/0xd0
>  do_syscall_64+0x55/0x160
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
>  Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
>  fails __GFP_NOFAIL charge when the kmem limit is reached. This is a
>  wrong behavior because nofail allocations are not allowed to fail.
>  Normal charge path simply forces the charge even if that means to cross
>  the limit. Kmem accounting should be doing the same.
> 
> Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
> Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9ec5e12486a7..e18108b2b786 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2821,6 +2821,16 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>  
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>  	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> +
> +		/*
> +		 * Enforce __GFP_NOFAIL allocation because callers are not
> +		 * prepared to see failures and likely do not have any failure
> +		 * handling code.
> +		 */
> +		if (gfp & __GFP_NOFAIL) {
> +			page_counter_charge(&memcg->kmem, nr_pages);
> +			return 0;
> +		}
>  		cancel_charge(memcg, nr_pages);
>  		return -ENOMEM;
>  	}
> -- 
> 2.20.1
Andrew Morton Sept. 24, 2019, 11:06 p.m. UTC | #8
On Tue, 24 Sep 2019 12:53:55 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> Andrew, do you plan to send this patch to Linus as ell?

I suppose so.  We don't actually have any reviewed-bys or acked-bys but
I expect that's because Shakeel forgot to type them in.

The followup deprecation warning patch I parked for 5.4-rc1.  Best to
give it a spin in -next and see if anyone complains before we go
bothering mainline users.


From: Michal Hocko <mhocko@suse.com>
Subject: memcg, kmem: do not fail __GFP_NOFAIL charges

Thomas has noticed the following NULL ptr dereference when using cgroup
v1 kmem limit:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
RIP: 0010:create_empty_buffers+0x24/0x100
Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
Call Trace:
 create_page_buffers+0x4d/0x60
 __block_write_begin_int+0x8e/0x5a0
 ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
 ? jbd2__journal_start+0xd7/0x1f0
 ext4_da_write_begin+0x112/0x3d0
 generic_perform_write+0xf1/0x1b0
 ? file_update_time+0x70/0x140
 __generic_file_write_iter+0x141/0x1a0
 ext4_file_write_iter+0xef/0x3b0
 __vfs_write+0x17e/0x1e0
 vfs_write+0xa5/0x1a0
 ksys_write+0x57/0xd0
 do_syscall_64+0x55/0x160
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
behavior because nofail allocations are not allowed to fail.  Normal
charge path simply forces the charge even if that means to cross the
limit.  Kmem accounting should be doing the same.

Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/mm/memcontrol.c~memcg-kmem-do-not-fail-__gfp_nofail-charges
+++ a/mm/memcontrol.c
@@ -2943,6 +2943,16 @@ int __memcg_kmem_charge_memcg(struct pag
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
 	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
+
+		/*
+		 * Enforce __GFP_NOFAIL allocation because callers are not
+		 * prepared to see failures and likely do not have any failure
+		 * handling code.
+		 */
+		if (gfp & __GFP_NOFAIL) {
+			page_counter_charge(&memcg->kmem, nr_pages);
+			return 0;
+		}
 		cancel_charge(memcg, nr_pages);
 		return -ENOMEM;
 	}

Patch
diff mbox series

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9ec5e12486a7..e18108b2b786 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2821,6 +2821,16 @@  int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
 	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
+
+		/*
+		 * Enforce __GFP_NOFAIL allocation because callers are not
+		 * prepared to see failures and likely do not have any failure
+		 * handling code.
+		 */
+		if (gfp & __GFP_NOFAIL) {
+			page_counter_charge(&memcg->kmem, nr_pages);
+			return 0;
+		}
 		cancel_charge(memcg, nr_pages);
 		return -ENOMEM;
 	}