* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
2019-10-29 18:09 ` Shakeel Butt
@ 2019-10-29 18:28 ` Marco Elver
2019-10-29 18:46 ` Shakeel Butt
2019-10-29 18:34 ` Johannes Weiner
2019-10-29 18:47 ` Michal Hocko
2 siblings, 1 reply; 8+ messages in thread
From: Marco Elver @ 2019-10-29 18:28 UTC (permalink / raw)
To: Shakeel Butt
Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, Andrew Morton,
Linux MM, Cgroups, LKML, Eric Dumazet, Greg Thelen,
syzbot+13f93c99c06988391efe
On Tue, 29 Oct 2019, Shakeel Butt wrote:
> +Marco
>
> On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > Syzbot reported the following bug:
> > >
> > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > >
> > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > to stop potential reordering.
Strictly speaking, READ_ONCE/WRITE_ONCE alone avoid various bad compiler
optimizations, including store tearing, load tearing, etc. This does not
add memory barriers to constrain memory ordering. (If this code needs
some memory ordering guarantees w.r.t. previous loads/stores then this
alone is not enough.)
> > I am sorry but I do not understand the problem and the fix. Why does the
> > race happen and why does _ONCE fixes it? There is still no
> > synchronization. Do you want to prevent from memcg->last_scanned_node
> > reloading?
> >
>
> The problem is memcg->last_scanned_node can read and modified
> concurrently. Though to me it seems like a tolerable race and not
> worth to add an explicit lock. My aim was to make KCSAN happy here to
> look elsewhere for the concurrency bugs. However I see that it might
> complain next on memcg->scan_nodes.
The plain concurrent reads/writes are a data race, which may manifest in
various undefined behaviour due to compiler optimizations. The _ONCE
will prevent these (KCSAN only reports data races). Note that, "data
race" does not necessarily imply "race condition"; some data races are
race conditions (usually the more interesting bugs) -- but not *all*
data races are race conditions. If there is no race condition here that
warrants heavier synchronization (locking etc.), then this patch is all
that should be needed.
I can't comment on the rest.
Thanks,
-- Marco
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
2019-10-29 18:28 ` Marco Elver
@ 2019-10-29 18:46 ` Shakeel Butt
0 siblings, 0 replies; 8+ messages in thread
From: Shakeel Butt @ 2019-10-29 18:46 UTC (permalink / raw)
To: Marco Elver
Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, Andrew Morton,
Linux MM, Cgroups, LKML, Eric Dumazet, Greg Thelen,
syzbot+13f93c99c06988391efe
On Tue, Oct 29, 2019 at 11:28 AM Marco Elver <elver@google.com> wrote:
>
>
>
> On Tue, 29 Oct 2019, Shakeel Butt wrote:
>
> > +Marco
> >
> > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > > Syzbot reported the following bug:
> > > >
> > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > > >
> > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > > to stop potential reordering.
>
> Strictly speaking, READ_ONCE/WRITE_ONCE alone avoid various bad compiler
> optimizations, including store tearing, load tearing, etc. This does not
> add memory barriers to constrain memory ordering. (If this code needs
> some memory ordering guarantees w.r.t. previous loads/stores then this
> alone is not enough.)
>
> > > I am sorry but I do not understand the problem and the fix. Why does the
> > > race happen and why does _ONCE fixes it? There is still no
> > > synchronization. Do you want to prevent from memcg->last_scanned_node
> > > reloading?
> > >
> >
> > The problem is memcg->last_scanned_node can read and modified
> > concurrently. Though to me it seems like a tolerable race and not
> > worth to add an explicit lock. My aim was to make KCSAN happy here to
> > look elsewhere for the concurrency bugs. However I see that it might
> > complain next on memcg->scan_nodes.
>
> The plain concurrent reads/writes are a data race, which may manifest in
> various undefined behaviour due to compiler optimizations. The _ONCE
> will prevent these (KCSAN only reports data races). Note that, "data
> race" does not necessarily imply "race condition"; some data races are
> race conditions (usually the more interesting bugs) -- but not *all*
> data races are race conditions. If there is no race condition here that
> warrants heavier synchronization (locking etc.), then this patch is all
> that should be needed.
>
> I can't comment on the rest.
>
Thanks Marco for the explanation.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
2019-10-29 18:09 ` Shakeel Butt
2019-10-29 18:28 ` Marco Elver
@ 2019-10-29 18:34 ` Johannes Weiner
2019-10-29 18:47 ` Shakeel Butt
2019-10-29 18:47 ` Michal Hocko
2 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2019-10-29 18:34 UTC (permalink / raw)
To: Shakeel Butt
Cc: Michal Hocko, Roman Gushchin, Andrew Morton, Linux MM, Cgroups,
LKML, Eric Dumazet, Greg Thelen, syzbot+13f93c99c06988391efe,
elver
On Tue, Oct 29, 2019 at 11:09:29AM -0700, Shakeel Butt wrote:
> +Marco
>
> On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > Syzbot reported the following bug:
> > >
> > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > >
> > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > to stop potential reordering.
> >
> > I am sorry but I do not understand the problem and the fix. Why does the
> > race happen and why does _ONCE fixes it? There is still no
> > synchronization. Do you want to prevent from memcg->last_scanned_node
> > reloading?
> >
>
> The problem is memcg->last_scanned_node can read and modified
> concurrently. Though to me it seems like a tolerable race and not
> worth to add an explicit lock. My aim was to make KCSAN happy here to
> look elsewhere for the concurrency bugs. However I see that it might
> complain next on memcg->scan_nodes.
>
> Now taking a step back, I am questioning the whole motivation behind
> mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
> zonelist to the reclaimer, the shrink_node will be called for all
> potential nodes. Also we don't short circuit the traversal of
> shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
> priority) for all nodes, I don't see the reason behind having round
> robin order of node traversal.
It's actually only very recently that we don't bail out of the reclaim
loop anymore - if I'm not missing anything, it was only 1ba6fc9af35b
("mm: vmscan: do not share cgroup iteration between reclaimers") that
removed the last bailout condition on sc->nr_reclaimed.
> I am thinking of removing the whole mem_cgroup_select_victim_node()
> heuristic. Please let me know if there are any objections.
In the current state, I don't see any reason to keep it, either. We
can always just start the zonelist walk from the current node.
A nice cleanup, actually. Good catch!
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
2019-10-29 18:34 ` Johannes Weiner
@ 2019-10-29 18:47 ` Shakeel Butt
0 siblings, 0 replies; 8+ messages in thread
From: Shakeel Butt @ 2019-10-29 18:47 UTC (permalink / raw)
To: Johannes Weiner
Cc: Michal Hocko, Roman Gushchin, Andrew Morton, Linux MM, Cgroups,
LKML, Eric Dumazet, Greg Thelen, syzbot+13f93c99c06988391efe,
elver
On Tue, Oct 29, 2019 at 11:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Oct 29, 2019 at 11:09:29AM -0700, Shakeel Butt wrote:
> > +Marco
> >
> > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > > Syzbot reported the following bug:
> > > >
> > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > > >
> > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > > to stop potential reordering.
> > >
> > > I am sorry but I do not understand the problem and the fix. Why does the
> > > race happen and why does _ONCE fixes it? There is still no
> > > synchronization. Do you want to prevent from memcg->last_scanned_node
> > > reloading?
> > >
> >
> > The problem is memcg->last_scanned_node can read and modified
> > concurrently. Though to me it seems like a tolerable race and not
> > worth to add an explicit lock. My aim was to make KCSAN happy here to
> > look elsewhere for the concurrency bugs. However I see that it might
> > complain next on memcg->scan_nodes.
> >
> > Now taking a step back, I am questioning the whole motivation behind
> > mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
> > zonelist to the reclaimer, the shrink_node will be called for all
> > potential nodes. Also we don't short circuit the traversal of
> > shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
> > priority) for all nodes, I don't see the reason behind having round
> > robin order of node traversal.
>
> It's actually only very recently that we don't bail out of the reclaim
> loop anymore - if I'm not missing anything, it was only 1ba6fc9af35b
> ("mm: vmscan: do not share cgroup iteration between reclaimers") that
> removed the last bailout condition on sc->nr_reclaimed.
>
> > I am thinking of removing the whole mem_cgroup_select_victim_node()
> > heuristic. Please let me know if there are any objections.
>
> In the current state, I don't see any reason to keep it, either. We
> can always just start the zonelist walk from the current node.
>
> A nice cleanup, actually. Good catch!
Thanks, I will follow up with the removal of this heuristic.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: memcontrol: fix data race in mem_cgroup_select_victim_node
2019-10-29 18:09 ` Shakeel Butt
2019-10-29 18:28 ` Marco Elver
2019-10-29 18:34 ` Johannes Weiner
@ 2019-10-29 18:47 ` Michal Hocko
2 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2019-10-29 18:47 UTC (permalink / raw)
To: Shakeel Butt
Cc: Roman Gushchin, Johannes Weiner, Andrew Morton, Linux MM,
Cgroups, LKML, Eric Dumazet, Greg Thelen,
syzbot+13f93c99c06988391efe, elver
On Tue 29-10-19 11:09:29, Shakeel Butt wrote:
> +Marco
>
> On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > Syzbot reported the following bug:
> > >
> > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > >
> > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > to stop potential reordering.
> >
> > I am sorry but I do not understand the problem and the fix. Why does the
> > race happen and why does _ONCE fixes it? There is still no
> > synchronization. Do you want to prevent from memcg->last_scanned_node
> > reloading?
> >
>
> The problem is memcg->last_scanned_node can read and modified
> concurrently. Though to me it seems like a tolerable race and not
> worth to add an explicit lock.
Agreed
> My aim was to make KCSAN happy here to
> look elsewhere for the concurrency bugs. However I see that it might
> complain next on memcg->scan_nodes.
I would really refrain from adding whatever measure to silence some
tool without a deeper understanding of why that is needed. $FOO_ONCE
will prevent compiler from making funcy stuff. But this is an int and
I would be really surprised if $FOO_ONCE made any practical difference.
> Now taking a step back, I am questioning the whole motivation behind
> mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
> zonelist to the reclaimer, the shrink_node will be called for all
> potential nodes. Also we don't short circuit the traversal of
> shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
> priority) for all nodes, I don't see the reason behind having round
> robin order of node traversal.
>
> I am thinking of removing the whole mem_cgroup_select_victim_node()
> heuristic. Please let me know if there are any objections.
I would have to think more about this but this surely sounds like a
preferable way than adding $FOO_ONCE to silence the tool.
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 8+ messages in thread