[PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
@ 2019-10-29  0:54 Shakeel Butt
  2019-10-29  9:03 ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Shakeel Butt @ 2019-10-29  0:54 UTC (permalink / raw)
  To: Roman Gushchin, Johannes Weiner, Michal Hocko, Andrew Morton
  Cc: linux-mm, cgroups, linux-kernel, Shakeel Butt, Eric Dumazet,
	Greg Thelen, syzbot+13f93c99c06988391efe

Syzbot reported the following bug:

BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node

write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
 mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
 try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
 reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
 mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
 tracehook_notify_resume include/linux/tracehook.h:197 [inline]
 exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
 prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
 swapgs_restore_regs_and_return_to_usermode+0x0/0x40

read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
 mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
 try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
 reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
 mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
 tracehook_notify_resume include/linux/tracehook.h:197 [inline]
 exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
 prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
 swapgs_restore_regs_and_return_to_usermode+0x0/0x40

mem_cgroup_select_victim_node() can be called concurrently which reads
and modifies memcg->last_scanned_node without any synchrnonization. So,
read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
to stop potential reordering.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Reported-by: syzbot+13f93c99c06988391efe@syzkaller.appspotmail.com
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c4c555055a72..5a06739dd3e4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1667,7 +1667,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 	int node;
 
 	mem_cgroup_may_update_nodemask(memcg);
-	node = memcg->last_scanned_node;
+	node = READ_ONCE(memcg->last_scanned_node);
 
 	node = next_node_in(node, memcg->scan_nodes);
 	/*
@@ -1678,7 +1678,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 	if (unlikely(node == MAX_NUMNODES))
 		node = numa_node_id();
 
-	memcg->last_scanned_node = node;
+	WRITE_ONCE(memcg->last_scanned_node, node);
 	return node;
 }
 #else
-- 
2.24.0.rc0.303.g954a862665-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
  2019-10-29  0:54 [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node Shakeel Butt
@ 2019-10-29  9:03 ` Michal Hocko
  2019-10-29 18:09   ` Shakeel Butt
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2019-10-29  9:03 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Johannes Weiner, Andrew Morton, linux-mm,
	cgroups, linux-kernel, Eric Dumazet, Greg Thelen,
	syzbot+13f93c99c06988391efe

On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> Syzbot reported the following bug:
> 
> BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> 
> write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
>  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
>  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
>  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
>  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
>  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
>  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
>  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
>  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> 
> read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
>  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
>  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
>  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
>  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
>  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
>  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
>  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
>  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> 
> mem_cgroup_select_victim_node() can be called concurrently which reads
> and modifies memcg->last_scanned_node without any synchrnonization. So,
> read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> to stop potential reordering.

I am sorry but I do not understand the problem and the fix. Why does the
race happen and why does _ONCE fixes it? There is still no
synchronization. Do you want to prevent from memcg->last_scanned_node
reloading?

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Cc: Greg Thelen <gthelen@google.com>
> Reported-by: syzbot+13f93c99c06988391efe@syzkaller.appspotmail.com
> ---
>  mm/memcontrol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c4c555055a72..5a06739dd3e4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1667,7 +1667,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
>  	int node;
>  
>  	mem_cgroup_may_update_nodemask(memcg);
> -	node = memcg->last_scanned_node;
> +	node = READ_ONCE(memcg->last_scanned_node);
>  
>  	node = next_node_in(node, memcg->scan_nodes);
>  	/*
> @@ -1678,7 +1678,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
>  	if (unlikely(node == MAX_NUMNODES))
>  		node = numa_node_id();
>  
> -	memcg->last_scanned_node = node;
> +	WRITE_ONCE(memcg->last_scanned_node, node);
>  	return node;
>  }
>  #else
> -- 
> 2.24.0.rc0.303.g954a862665-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
  2019-10-29  9:03 ` Michal Hocko
@ 2019-10-29 18:09   ` Shakeel Butt
  2019-10-29 18:28     ` Marco Elver
                       ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Shakeel Butt @ 2019-10-29 18:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Johannes Weiner, Andrew Morton, Linux MM,
	Cgroups, LKML, Eric Dumazet, Greg Thelen,
	syzbot+13f93c99c06988391efe, elver

+Marco

On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > Syzbot reported the following bug:
> >
> > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> >
> > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> >
> > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> >
> > mem_cgroup_select_victim_node() can be called concurrently which reads
> > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > to stop potential reordering.
>
> I am sorry but I do not understand the problem and the fix. Why does the
> race happen and why does _ONCE fixes it? There is still no
> synchronization. Do you want to prevent from memcg->last_scanned_node
> reloading?
>

The problem is memcg->last_scanned_node can read and modified
concurrently. Though to me it seems like a tolerable race and not
worth to add an explicit lock. My aim was to make KCSAN happy here to
look elsewhere for the concurrency bugs. However I see that it might
complain next on memcg->scan_nodes.

Now taking a step back, I am questioning the whole motivation behind
mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
zonelist to the reclaimer, the shrink_node will be called for all
potential nodes. Also we don't short circuit the traversal of
shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
priority) for all nodes, I don't see the reason behind having round
robin order of node traversal.

I am thinking of removing the whole mem_cgroup_select_victim_node()
heuristic. Please let me know if there are any objections.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
  2019-10-29 18:09   ` Shakeel Butt
@ 2019-10-29 18:28     ` Marco Elver
  2019-10-29 18:46       ` Shakeel Butt
  2019-10-29 18:34     ` Johannes Weiner
  2019-10-29 18:47     ` Michal Hocko
  2 siblings, 1 reply; 8+ messages in thread
From: Marco Elver @ 2019-10-29 18:28 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, Andrew Morton,
	Linux MM, Cgroups, LKML, Eric Dumazet, Greg Thelen,
	syzbot+13f93c99c06988391efe



On Tue, 29 Oct 2019, Shakeel Butt wrote:

> +Marco
> 
> On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > Syzbot reported the following bug:
> > >
> > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > >
> > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > to stop potential reordering.

Strictly speaking, READ_ONCE/WRITE_ONCE alone avoid various bad compiler
optimizations, including store tearing, load tearing, etc. This does not
add memory barriers to constrain memory ordering.  (If this code needs
some memory ordering guarantees w.r.t. previous loads/stores then this
alone is not enough.)

> > I am sorry but I do not understand the problem and the fix. Why does the
> > race happen and why does _ONCE fixes it? There is still no
> > synchronization. Do you want to prevent from memcg->last_scanned_node
> > reloading?
> >
> 
> The problem is memcg->last_scanned_node can read and modified
> concurrently. Though to me it seems like a tolerable race and not
> worth to add an explicit lock. My aim was to make KCSAN happy here to
> look elsewhere for the concurrency bugs. However I see that it might
> complain next on memcg->scan_nodes.

The plain concurrent reads/writes are a data race, which may manifest in
various undefined behaviour due to compiler optimizations. The _ONCE
will prevent these (KCSAN only reports data races).  Note that, "data
race" does not necessarily imply "race condition"; some data races are
race conditions (usually the more interesting bugs) -- but not *all*
data races are race conditions. If there is no race condition here that
warrants heavier synchronization (locking etc.), then this patch is all
that should be needed.

I can't comment on the rest.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
  2019-10-29 18:28     ` Marco Elver
@ 2019-10-29 18:46       ` Shakeel Butt
  0 siblings, 0 replies; 8+ messages in thread
From: Shakeel Butt @ 2019-10-29 18:46 UTC (permalink / raw)
  To: Marco Elver
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, Andrew Morton,
	Linux MM, Cgroups, LKML, Eric Dumazet, Greg Thelen,
	syzbot+13f93c99c06988391efe

On Tue, Oct 29, 2019 at 11:28 AM Marco Elver <elver@google.com> wrote:
>
>
>
> On Tue, 29 Oct 2019, Shakeel Butt wrote:
>
> > +Marco
> >
> > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > > Syzbot reported the following bug:
> > > >
> > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > > >
> > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > > to stop potential reordering.
>
> Strictly speaking, READ_ONCE/WRITE_ONCE alone avoid various bad compiler
> optimizations, including store tearing, load tearing, etc. This does not
> add memory barriers to constrain memory ordering.  (If this code needs
> some memory ordering guarantees w.r.t. previous loads/stores then this
> alone is not enough.)
>
> > > I am sorry but I do not understand the problem and the fix. Why does the
> > > race happen and why does _ONCE fixes it? There is still no
> > > synchronization. Do you want to prevent from memcg->last_scanned_node
> > > reloading?
> > >
> >
> > The problem is memcg->last_scanned_node can read and modified
> > concurrently. Though to me it seems like a tolerable race and not
> > worth to add an explicit lock. My aim was to make KCSAN happy here to
> > look elsewhere for the concurrency bugs. However I see that it might
> > complain next on memcg->scan_nodes.
>
> The plain concurrent reads/writes are a data race, which may manifest in
> various undefined behaviour due to compiler optimizations. The _ONCE
> will prevent these (KCSAN only reports data races).  Note that, "data
> race" does not necessarily imply "race condition"; some data races are
> race conditions (usually the more interesting bugs) -- but not *all*
> data races are race conditions. If there is no race condition here that
> warrants heavier synchronization (locking etc.), then this patch is all
> that should be needed.
>
> I can't comment on the rest.
>

Thanks Marco for the explanation.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
  2019-10-29 18:09   ` Shakeel Butt
  2019-10-29 18:28     ` Marco Elver
@ 2019-10-29 18:34     ` Johannes Weiner
  2019-10-29 18:47       ` Shakeel Butt
  2019-10-29 18:47     ` Michal Hocko
  2 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2019-10-29 18:34 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Roman Gushchin, Andrew Morton, Linux MM, Cgroups,
	LKML, Eric Dumazet, Greg Thelen, syzbot+13f93c99c06988391efe,
	elver

On Tue, Oct 29, 2019 at 11:09:29AM -0700, Shakeel Butt wrote:
> +Marco
> 
> On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > Syzbot reported the following bug:
> > >
> > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > >
> > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > to stop potential reordering.
> >
> > I am sorry but I do not understand the problem and the fix. Why does the
> > race happen and why does _ONCE fixes it? There is still no
> > synchronization. Do you want to prevent from memcg->last_scanned_node
> > reloading?
> >
> 
> The problem is memcg->last_scanned_node can read and modified
> concurrently. Though to me it seems like a tolerable race and not
> worth to add an explicit lock. My aim was to make KCSAN happy here to
> look elsewhere for the concurrency bugs. However I see that it might
> complain next on memcg->scan_nodes.
> 
> Now taking a step back, I am questioning the whole motivation behind
> mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
> zonelist to the reclaimer, the shrink_node will be called for all
> potential nodes. Also we don't short circuit the traversal of
> shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
> priority) for all nodes, I don't see the reason behind having round
> robin order of node traversal.

It's actually only very recently that we don't bail out of the reclaim
loop anymore - if I'm not missing anything, it was only 1ba6fc9af35b
("mm: vmscan: do not share cgroup iteration between reclaimers") that
removed the last bailout condition on sc->nr_reclaimed.

> I am thinking of removing the whole mem_cgroup_select_victim_node()
> heuristic. Please let me know if there are any objections.

In the current state, I don't see any reason to keep it, either. We
can always just start the zonelist walk from the current node.

A nice cleanup, actually. Good catch!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
  2019-10-29 18:34     ` Johannes Weiner
@ 2019-10-29 18:47       ` Shakeel Butt
  0 siblings, 0 replies; 8+ messages in thread
From: Shakeel Butt @ 2019-10-29 18:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Roman Gushchin, Andrew Morton, Linux MM, Cgroups,
	LKML, Eric Dumazet, Greg Thelen, syzbot+13f93c99c06988391efe,
	elver

On Tue, Oct 29, 2019 at 11:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Oct 29, 2019 at 11:09:29AM -0700, Shakeel Butt wrote:
> > +Marco
> >
> > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > > Syzbot reported the following bug:
> > > >
> > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > > >
> > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > > to stop potential reordering.
> > >
> > > I am sorry but I do not understand the problem and the fix. Why does the
> > > race happen and why does _ONCE fixes it? There is still no
> > > synchronization. Do you want to prevent from memcg->last_scanned_node
> > > reloading?
> > >
> >
> > The problem is memcg->last_scanned_node can read and modified
> > concurrently. Though to me it seems like a tolerable race and not
> > worth to add an explicit lock. My aim was to make KCSAN happy here to
> > look elsewhere for the concurrency bugs. However I see that it might
> > complain next on memcg->scan_nodes.
> >
> > Now taking a step back, I am questioning the whole motivation behind
> > mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
> > zonelist to the reclaimer, the shrink_node will be called for all
> > potential nodes. Also we don't short circuit the traversal of
> > shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
> > priority) for all nodes, I don't see the reason behind having round
> > robin order of node traversal.
>
> It's actually only very recently that we don't bail out of the reclaim
> loop anymore - if I'm not missing anything, it was only 1ba6fc9af35b
> ("mm: vmscan: do not share cgroup iteration between reclaimers") that
> removed the last bailout condition on sc->nr_reclaimed.
>
> > I am thinking of removing the whole mem_cgroup_select_victim_node()
> > heuristic. Please let me know if there are any objections.
>
> In the current state, I don't see any reason to keep it, either. We
> can always just start the zonelist walk from the current node.
>
> A nice cleanup, actually. Good catch!

Thanks, I will follow up with the removal of this heuristic.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
  2019-10-29 18:09   ` Shakeel Butt
  2019-10-29 18:28     ` Marco Elver
  2019-10-29 18:34     ` Johannes Weiner
@ 2019-10-29 18:47     ` Michal Hocko
  2 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2019-10-29 18:47 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Johannes Weiner, Andrew Morton, Linux MM,
	Cgroups, LKML, Eric Dumazet, Greg Thelen,
	syzbot+13f93c99c06988391efe, elver

On Tue 29-10-19 11:09:29, Shakeel Butt wrote:
> +Marco
> 
> On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > Syzbot reported the following bug:
> > >
> > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > >
> > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > to stop potential reordering.
> >
> > I am sorry but I do not understand the problem and the fix. Why does the
> > race happen and why does _ONCE fixes it? There is still no
> > synchronization. Do you want to prevent from memcg->last_scanned_node
> > reloading?
> >
> 
> The problem is memcg->last_scanned_node can read and modified
> concurrently. Though to me it seems like a tolerable race and not
> worth to add an explicit lock.

Agreed

> My aim was to make KCSAN happy here to
> look elsewhere for the concurrency bugs. However I see that it might
> complain next on memcg->scan_nodes.

I would really refrain from adding whatever measure to silence some
tool without a deeper understanding of why that is needed. $FOO_ONCE
will prevent compiler from making funcy stuff. But this is an int and
I would be really surprised if $FOO_ONCE made any practical difference.

> Now taking a step back, I am questioning the whole motivation behind
> mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
> zonelist to the reclaimer, the shrink_node will be called for all
> potential nodes. Also we don't short circuit the traversal of
> shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
> priority) for all nodes, I don't see the reason behind having round
> robin order of node traversal.
> 
> I am thinking of removing the whole mem_cgroup_select_victim_node()
> heuristic. Please let me know if there are any objections.

I would have to think more about this but this surely sounds like a
preferable way than adding $FOO_ONCE to silence the tool.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-10-29 18:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-29  0:54 [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node Shakeel Butt
2019-10-29  9:03 ` Michal Hocko
2019-10-29 18:09   ` Shakeel Butt
2019-10-29 18:28     ` Marco Elver
2019-10-29 18:46       ` Shakeel Butt
2019-10-29 18:34     ` Johannes Weiner
2019-10-29 18:47       ` Shakeel Butt
2019-10-29 18:47     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).