* [PATCH 0/2] mm/memcontrol: fix reclaim bugs in mem_cgroup_iter
@ 2017-04-28 21:55 ` Sean Christopherson
0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2017-04-28 21:55 UTC (permalink / raw)
To: mhocko; +Cc: hannes, vdavydov.dev, cgroups, linux-mm, sean.j.christopherson
This patch set contains two bug fixes for mem_cgroup_iter(). The bugs
were found by code inspection and were confirmed via synthetic testing
that forcefully setup the failing conditions.
Bug #1 is a race condition where mem_cgroup_iter() incorrectly returns
the same memcg to multiple threads reclaiming from the same root, zone,
priority and generation. mem_cgroup_iter() doesn't check the result of
cmpxchg(iter->pos...) when setting the new pos, and so fails to detect
that it will return the same memcg as the thread that successfully set
iter->position. If multiple threads read the same iter->position value,
then they will call css_next_descendant_pre() with the same css and will
compute the same memcg (unless they see different versions of the tree
due to an RCU update).
Bug #2 is also a race condition of sorts, with the same setup conditions
as bug #1. If a reclaimer's initial call to mem_cgroup_iter() triggers
a restart of the hierarchy walk, i.e. css_next_descendant_pre() returns
NULL and prev == NULL, mem_cgroup_iter() fails to increment iter->gen...
even though it has started a new walk of the hierarchy. This technically
isn't a bug for the thread that triggered the restart as it's reasonable
for that thread to perform a full walk of the tree, but other threads
in the current reclaim generation will incorrectly continue to walk the
tree since iter->generation won't be updated until one of the reclaimers
reaches the end of the hierarchy a second time.
The two patches can be applied independently, but I included them in a
single series as the fix for bug #1 can theoretically exacerbate bug #2,
and bug #2 is likely more serious as it results in a duplicate walk of
the entire tree as opposed to a duplicate reclaim of a single memcg.
Sean Christopherson (2):
mm/memcontrol: check cmpxchg(iter->pos...) result in mem_cgroup_iter()
mm/memcontrol: inc reclaim gen if restarting walk in mem_cgroup_iter()
mm/memcontrol.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 47 insertions(+), 9 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 0/2] mm/memcontrol: fix reclaim bugs in mem_cgroup_iter
@ 2017-04-28 21:55 ` Sean Christopherson
0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2017-04-28 21:55 UTC (permalink / raw)
To: mhocko-DgEjT+Ai2ygdnm+yROfE0A
Cc: hannes-druUgvl0LCNAfugRpC6u6w,
vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w
This patch set contains two bug fixes for mem_cgroup_iter(). The bugs
were found by code inspection and were confirmed via synthetic testing
that forcefully setup the failing conditions.
Bug #1 is a race condition where mem_cgroup_iter() incorrectly returns
the same memcg to multiple threads reclaiming from the same root, zone,
priority and generation. mem_cgroup_iter() doesn't check the result of
cmpxchg(iter->pos...) when setting the new pos, and so fails to detect
that it will return the same memcg as the thread that successfully set
iter->position. If multiple threads read the same iter->position value,
then they will call css_next_descendant_pre() with the same css and will
compute the same memcg (unless they see different versions of the tree
due to an RCU update).
Bug #2 is also a race condition of sorts, with the same setup conditions
as bug #1. If a reclaimer's initial call to mem_cgroup_iter() triggers
a restart of the hierarchy walk, i.e. css_next_descendant_pre() returns
NULL and prev == NULL, mem_cgroup_iter() fails to increment iter->gen...
even though it has started a new walk of the hierarchy. This technically
isn't a bug for the thread that triggered the restart as it's reasonable
for that thread to perform a full walk of the tree, but other threads
in the current reclaim generation will incorrectly continue to walk the
tree since iter->generation won't be updated until one of the reclaimers
reaches the end of the hierarchy a second time.
The two patches can be applied independently, but I included them in a
single series as the fix for bug #1 can theoretically exacerbate bug #2,
and bug #2 is likely more serious as it results in a duplicate walk of
the entire tree as opposed to a duplicate reclaim of a single memcg.
Sean Christopherson (2):
mm/memcontrol: check cmpxchg(iter->pos...) result in mem_cgroup_iter()
mm/memcontrol: inc reclaim gen if restarting walk in mem_cgroup_iter()
mm/memcontrol.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 47 insertions(+), 9 deletions(-)
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] mm/memcontrol: check cmpxchg(iter->pos...) result in mem_cgroup_iter()
@ 2017-04-28 21:55 ` Sean Christopherson
0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2017-04-28 21:55 UTC (permalink / raw)
To: mhocko; +Cc: hannes, vdavydov.dev, cgroups, linux-mm, sean.j.christopherson
Check the return value of cmpxchg when updating iter->position in
mem_cgroup_iter(). If cmpxchg failed, i.e. a different thread won
the race to update iter->position, then restart the entire flow of
reading, processing and updating iter->position. Simply ensuring
that there aren't multiple writes to iter->position doesn't avoid
redundant reclaims of a memcg, as competing threads will compute
the same memcg given the same iter->position.
The cmpxchg will only fail if a different thread saw the same value
of iter->position, meaning it called css_next_descendant_pre() with
the same css and therefore computed the same memcg (ignoring the
corner case where the threads see different versions of the tree).
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
mm/memcontrol.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 16c556a..6a7ca3c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -758,6 +758,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
rcu_read_lock();
+start:
if (reclaim) {
struct mem_cgroup_per_node *mz;
@@ -818,11 +819,27 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
if (reclaim) {
/*
- * The position could have already been updated by a competing
- * thread, so check that the value hasn't changed since we read
- * it to avoid reclaiming from the same cgroup twice.
+ * Competing reclaim threads may attempt to consume the same
+ * iter->position, check that the value hasn't changed since
+ * we read it to avoid reclaiming from the same cgroup twice.
+ * Note that just avoiding multiple writes to iter->position
+ * does not prevent redundant reclaims to memcg. Given the
+ * same input css on competing threads, the css returned by
+ * css_next_descendant_pre will also be the same (unless the
+ * tree itself changes). So, if a different thread read the
+ * same iter->position, then it also computed the same memcg.
+ * If we lost the race, put our css references and restart
+ * the entire process of reading and updating iter->position.
*/
- (void)cmpxchg(&iter->position, pos, memcg);
+ if (cmpxchg(&iter->position, pos, memcg) != pos) {
+ if (memcg && memcg != root)
+ css_put(&memcg->css);
+ if (pos)
+ css_put(&pos->css);
+ css = NULL;
+ memcg = NULL;
+ goto start;
+ }
if (pos)
css_put(&pos->css);
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 1/2] mm/memcontrol: check cmpxchg(iter->pos...) result in mem_cgroup_iter()
@ 2017-04-28 21:55 ` Sean Christopherson
0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2017-04-28 21:55 UTC (permalink / raw)
To: mhocko-DgEjT+Ai2ygdnm+yROfE0A
Cc: hannes-druUgvl0LCNAfugRpC6u6w,
vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w
Check the return value of cmpxchg when updating iter->position in
mem_cgroup_iter(). If cmpxchg failed, i.e. a different thread won
the race to update iter->position, then restart the entire flow of
reading, processing and updating iter->position. Simply ensuring
that there aren't multiple writes to iter->position doesn't avoid
redundant reclaims of a memcg, as competing threads will compute
the same memcg given the same iter->position.
The cmpxchg will only fail if a different thread saw the same value
of iter->position, meaning it called css_next_descendant_pre() with
the same css and therefore computed the same memcg (ignoring the
corner case where the threads see different versions of the tree).
Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
mm/memcontrol.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 16c556a..6a7ca3c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -758,6 +758,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
rcu_read_lock();
+start:
if (reclaim) {
struct mem_cgroup_per_node *mz;
@@ -818,11 +819,27 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
if (reclaim) {
/*
- * The position could have already been updated by a competing
- * thread, so check that the value hasn't changed since we read
- * it to avoid reclaiming from the same cgroup twice.
+ * Competing reclaim threads may attempt to consume the same
+ * iter->position, check that the value hasn't changed since
+ * we read it to avoid reclaiming from the same cgroup twice.
+ * Note that just avoiding multiple writes to iter->position
+ * does not prevent redundant reclaims to memcg. Given the
+ * same input css on competing threads, the css returned by
+ * css_next_descendant_pre will also be the same (unless the
+ * tree itself changes). So, if a different thread read the
+ * same iter->position, then it also computed the same memcg.
+ * If we lost the race, put our css references and restart
+ * the entire process of reading and updating iter->position.
*/
- (void)cmpxchg(&iter->position, pos, memcg);
+ if (cmpxchg(&iter->position, pos, memcg) != pos) {
+ if (memcg && memcg != root)
+ css_put(&memcg->css);
+ if (pos)
+ css_put(&pos->css);
+ css = NULL;
+ memcg = NULL;
+ goto start;
+ }
if (pos)
css_put(&pos->css);
--
2.7.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] mm/memcontrol: inc reclaim gen if restarting walk in mem_cgroup_iter()
@ 2017-04-28 21:55 ` Sean Christopherson
0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2017-04-28 21:55 UTC (permalink / raw)
To: mhocko; +Cc: hannes, vdavydov.dev, cgroups, linux-mm, sean.j.christopherson
Increment iter->generation if a reclaimer reaches the end of the tree,
even if it restarts the hierarchy walk instead of returning NULL, i.e.
this is the reclaimer's initial call to mem_cgroup_iter(). If we don't
increment the generation, other threads that are part of the current
reclaim generation will incorrectly continue to walk the tree since
iter->generation won't be updated until one of the reclaimers reaches
the end of the hierarchy a second time.
Move the put_css(&pos->css) call below the iter->generation update
to minimize the window where a thread can see a stale generation but
consume an updated position, as iter->generation and iter->position
are not updated atomically.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
mm/memcontrol.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6a7ca3c..b858245 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -740,6 +740,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
struct cgroup_subsys_state *css = NULL;
struct mem_cgroup *memcg = NULL;
struct mem_cgroup *pos = NULL;
+ bool inc_gen = false;
if (mem_cgroup_disabled())
return NULL;
@@ -791,6 +792,14 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
css = css_next_descendant_pre(css, &root->css);
if (!css) {
/*
+ * Increment the generation as the next call to
+ * css_next_descendant_pre will restart at root.
+ * Do not update iter->generation directly as we
+ * should only do so if we update iter->position.
+ */
+ inc_gen = true;
+
+ /*
* Reclaimers share the hierarchy walk, and a
* new one might jump in right at the end of
* the hierarchy - make sure they see at least
@@ -838,16 +847,28 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
css_put(&pos->css);
css = NULL;
memcg = NULL;
+ inc_gen = false;
goto start;
}
- if (pos)
- css_put(&pos->css);
-
- if (!memcg)
+ /*
+ * Update iter->generation asap to minimize the window where
+ * a different thread compares against a stale generation but
+ * consumes an updated position.
+ */
+ if (inc_gen)
iter->generation++;
- else if (!prev)
+
+ /*
+ * Initialize the reclaimer's generation after the potential
+ * update to iter->generation; if we restarted the hierarchy
+ * walk then we are part of the new generation.
+ */
+ if (!prev)
reclaim->generation = iter->generation;
+
+ if (pos)
+ css_put(&pos->css);
}
out_unlock:
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] mm/memcontrol: inc reclaim gen if restarting walk in mem_cgroup_iter()
@ 2017-04-28 21:55 ` Sean Christopherson
0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2017-04-28 21:55 UTC (permalink / raw)
To: mhocko-DgEjT+Ai2ygdnm+yROfE0A
Cc: hannes-druUgvl0LCNAfugRpC6u6w,
vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w
Increment iter->generation if a reclaimer reaches the end of the tree,
even if it restarts the hierarchy walk instead of returning NULL, i.e.
this is the reclaimer's initial call to mem_cgroup_iter(). If we don't
increment the generation, other threads that are part of the current
reclaim generation will incorrectly continue to walk the tree since
iter->generation won't be updated until one of the reclaimers reaches
the end of the hierarchy a second time.
Move the put_css(&pos->css) call below the iter->generation update
to minimize the window where a thread can see a stale generation but
consume an updated position, as iter->generation and iter->position
are not updated atomically.
Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
mm/memcontrol.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6a7ca3c..b858245 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -740,6 +740,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
struct cgroup_subsys_state *css = NULL;
struct mem_cgroup *memcg = NULL;
struct mem_cgroup *pos = NULL;
+ bool inc_gen = false;
if (mem_cgroup_disabled())
return NULL;
@@ -791,6 +792,14 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
css = css_next_descendant_pre(css, &root->css);
if (!css) {
/*
+ * Increment the generation as the next call to
+ * css_next_descendant_pre will restart at root.
+ * Do not update iter->generation directly as we
+ * should only do so if we update iter->position.
+ */
+ inc_gen = true;
+
+ /*
* Reclaimers share the hierarchy walk, and a
* new one might jump in right at the end of
* the hierarchy - make sure they see at least
@@ -838,16 +847,28 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
css_put(&pos->css);
css = NULL;
memcg = NULL;
+ inc_gen = false;
goto start;
}
- if (pos)
- css_put(&pos->css);
-
- if (!memcg)
+ /*
+ * Update iter->generation asap to minimize the window where
+ * a different thread compares against a stale generation but
+ * consumes an updated position.
+ */
+ if (inc_gen)
iter->generation++;
- else if (!prev)
+
+ /*
+ * Initialize the reclaimer's generation after the potential
+ * update to iter->generation; if we restarted the hierarchy
+ * walk then we are part of the new generation.
+ */
+ if (!prev)
reclaim->generation = iter->generation;
+
+ if (pos)
+ css_put(&pos->css);
}
out_unlock:
--
2.7.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] mm/memcontrol: fix reclaim bugs in mem_cgroup_iter
2017-04-28 21:55 ` Sean Christopherson
` (2 preceding siblings ...)
(?)
@ 2017-05-02 14:03 ` Michal Hocko
2017-05-02 14:20 ` Christopherson, Sean J
-1 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2017-05-02 14:03 UTC (permalink / raw)
To: Sean Christopherson; +Cc: hannes, vdavydov.dev, cgroups, linux-mm
On Fri 28-04-17 14:55:45, Sean Christopherson wrote:
> This patch set contains two bug fixes for mem_cgroup_iter(). The bugs
> were found by code inspection and were confirmed via synthetic testing
> that forcefully setup the failing conditions.
I assume that you added some artificial sleeps to make those races more
probable, right? Or did you manage to hit those issue solely from the
userspace? I will have a look at those patches. It has been some time
since I've had it cached. It is pretty subtle code so I would like to
understand the urgency before I dive into this further.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH 0/2] mm/memcontrol: fix reclaim bugs in mem_cgroup_iter
@ 2017-05-02 14:20 ` Christopherson, Sean J
0 siblings, 0 replies; 9+ messages in thread
From: Christopherson, Sean J @ 2017-05-02 14:20 UTC (permalink / raw)
To: 'Michal Hocko'; +Cc: hannes, vdavydov.dev, cgroups, linux-mm
Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 28-04-17 14:55:45, Sean Christopherson wrote:
> > This patch set contains two bug fixes for mem_cgroup_iter(). The bugs
> > were found by code inspection and were confirmed via synthetic testing
> > that forcefully setup the failing conditions.
>
> I assume that you added some artificial sleeps to make those races more
> probable, right? Or did you manage to hit those issue solely from the
> userspace? I will have a look at those patches. It has been some time
> since I've had it cached. It is pretty subtle code so I would like to
> understand the urgency before I dive into this further.
> --
> Michal Hocko
> SUSE Labs
The code to prove the bugs is completely artificial, it's basically a
unit test for mem_cgroup_iter() that uses a thread barrier to all but
guarantee two threads will call mem_cgroup_iter() simultaneously. I
haven't even attempted to hit this via the actual userspace flow.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH 0/2] mm/memcontrol: fix reclaim bugs in mem_cgroup_iter
@ 2017-05-02 14:20 ` Christopherson, Sean J
0 siblings, 0 replies; 9+ messages in thread
From: Christopherson, Sean J @ 2017-05-02 14:20 UTC (permalink / raw)
To: 'Michal Hocko'
Cc: hannes-druUgvl0LCNAfugRpC6u6w,
vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Fri 28-04-17 14:55:45, Sean Christopherson wrote:
> > This patch set contains two bug fixes for mem_cgroup_iter(). The bugs
> > were found by code inspection and were confirmed via synthetic testing
> > that forcefully setup the failing conditions.
>
> I assume that you added some artificial sleeps to make those races more
> probable, right? Or did you manage to hit those issue solely from the
> userspace? I will have a look at those patches. It has been some time
> since I've had it cached. It is pretty subtle code so I would like to
> understand the urgency before I dive into this further.
> --
> Michal Hocko
> SUSE Labs
The code to prove the bugs is completely artificial, it's basically a
unit test for mem_cgroup_iter() that uses a thread barrier to all but
guarantee two threads will call mem_cgroup_iter() simultaneously. I
haven't even attempted to hit this via the actual userspace flow.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-05-02 14:21 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-28 21:55 [PATCH 0/2] mm/memcontrol: fix reclaim bugs in mem_cgroup_iter Sean Christopherson
2017-04-28 21:55 ` Sean Christopherson
2017-04-28 21:55 ` [PATCH 1/2] mm/memcontrol: check cmpxchg(iter->pos...) result in mem_cgroup_iter() Sean Christopherson
2017-04-28 21:55 ` Sean Christopherson
2017-04-28 21:55 ` [PATCH 2/2] mm/memcontrol: inc reclaim gen if restarting walk " Sean Christopherson
2017-04-28 21:55 ` Sean Christopherson
2017-05-02 14:03 ` [PATCH 0/2] mm/memcontrol: fix reclaim bugs in mem_cgroup_iter Michal Hocko
2017-05-02 14:20 ` Christopherson, Sean J
2017-05-02 14:20 ` Christopherson, Sean J
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.