All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
@ 2019-01-10 17:44 ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2019-01-10 17:44 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Johannes Weiner, Vladimir Davydov
  Cc: cgroups, linux-mm, linux-kernel, Shakeel Butt

If a memcg is over high limit, memory reclaim is scheduled to run on
return-to-userland.  However it is assumed that the memcg is the current
process's memcg.  With remote memcg charging for kmem or swapping in a
page charged to remote memcg, current process can trigger reclaim on
remote memcg.  So, schduling reclaim on return-to-userland for remote
memcgs will ignore the high reclaim altogether.  So, record the memcg
needing high reclaim and trigger high reclaim for that memcg on
return-to-userland.  However if the memcg is already recorded for high
reclaim and the recorded memcg is not the descendant of the the memcg
needing high reclaim, punt the high reclaim to the work queue.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
Changelog since v2:
- TIF_NOTIFY_RESUME can be set from places other than try_charge() in
  which case current->memcg_high_reclaim will be null. Correctly handle
  such scenarios.

Changelog since v1:
- Punt high reclaim of a memcg to work queue only if the recorded memcg
  is not its descendant.

 include/linux/sched.h |  3 +++
 kernel/fork.c         |  1 +
 mm/memcontrol.c       | 22 ++++++++++++++++------
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d08562eeec7..5e6690042497 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1172,6 +1172,9 @@ struct task_struct {
 
 	/* Used by memcontrol for targeted memcg charge: */
 	struct mem_cgroup		*active_memcg;
+
+	/* Used by memcontrol for high relcaim: */
+	struct mem_cgroup		*memcg_high_reclaim;
 #endif
 
 #ifdef CONFIG_BLK_CGROUP
diff --git a/kernel/fork.c b/kernel/fork.c
index 1b0fde63d831..85da44137847 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -918,6 +918,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
+	tsk->memcg_high_reclaim = NULL;
 #endif
 	return tsk;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 953d4ba8a595..18f4aefbe0bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2168,14 +2168,17 @@ static void high_work_func(struct work_struct *work)
 void mem_cgroup_handle_over_high(void)
 {
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
-	struct mem_cgroup *memcg;
+	struct mem_cgroup *memcg = current->memcg_high_reclaim;
 
 	if (likely(!nr_pages))
 		return;
 
-	memcg = get_mem_cgroup_from_mm(current->mm);
+	if (!memcg)
+		memcg = get_mem_cgroup_from_mm(current->mm);
+
 	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	css_put(&memcg->css);
+	current->memcg_high_reclaim = NULL;
 	current->memcg_nr_pages_over_high = 0;
 }
 
@@ -2329,10 +2332,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * If the hierarchy is above the normal consumption range, schedule
 	 * reclaim on returning to userland.  We can perform reclaim here
 	 * if __GFP_RECLAIM but let's always punt for simplicity and so that
-	 * GFP_KERNEL can consistently be used during reclaim.  @memcg is
-	 * not recorded as it most likely matches current's and won't
-	 * change in the meantime.  As high limit is checked again before
-	 * reclaim, the cost of mismatch is negligible.
+	 * GFP_KERNEL can consistently be used during reclaim. Record the memcg
+	 * for the return-to-userland high reclaim. If the memcg is already
+	 * recorded and the recorded memcg is not the descendant of the memcg
+	 * needing high reclaim, punt the high reclaim to the work queue.
 	 */
 	do {
 		if (page_counter_read(&memcg->memory) > memcg->high) {
@@ -2340,6 +2343,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			if (in_interrupt()) {
 				schedule_work(&memcg->high_work);
 				break;
+			} else if (!current->memcg_high_reclaim) {
+				css_get(&memcg->css);
+				current->memcg_high_reclaim = memcg;
+			} else if (!mem_cgroup_is_descendant(
+					current->memcg_high_reclaim, memcg)) {
+				schedule_work(&memcg->high_work);
+				break;
 			}
 			current->memcg_nr_pages_over_high += batch;
 			set_notify_resume(current);
-- 
2.20.1.97.g81188d93c3-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
@ 2019-01-10 17:44 ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2019-01-10 17:44 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Johannes Weiner, Vladimir Davydov
  Cc: cgroups, linux-mm, linux-kernel, Shakeel Butt

If a memcg is over high limit, memory reclaim is scheduled to run on
return-to-userland.  However it is assumed that the memcg is the current
process's memcg.  With remote memcg charging for kmem or swapping in a
page charged to remote memcg, current process can trigger reclaim on
remote memcg.  So, schduling reclaim on return-to-userland for remote
memcgs will ignore the high reclaim altogether.  So, record the memcg
needing high reclaim and trigger high reclaim for that memcg on
return-to-userland.  However if the memcg is already recorded for high
reclaim and the recorded memcg is not the descendant of the the memcg
needing high reclaim, punt the high reclaim to the work queue.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
Changelog since v2:
- TIF_NOTIFY_RESUME can be set from places other than try_charge() in
  which case current->memcg_high_reclaim will be null. Correctly handle
  such scenarios.

Changelog since v1:
- Punt high reclaim of a memcg to work queue only if the recorded memcg
  is not its descendant.

 include/linux/sched.h |  3 +++
 kernel/fork.c         |  1 +
 mm/memcontrol.c       | 22 ++++++++++++++++------
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d08562eeec7..5e6690042497 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1172,6 +1172,9 @@ struct task_struct {
 
 	/* Used by memcontrol for targeted memcg charge: */
 	struct mem_cgroup		*active_memcg;
+
+	/* Used by memcontrol for high relcaim: */
+	struct mem_cgroup		*memcg_high_reclaim;
 #endif
 
 #ifdef CONFIG_BLK_CGROUP
diff --git a/kernel/fork.c b/kernel/fork.c
index 1b0fde63d831..85da44137847 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -918,6 +918,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
+	tsk->memcg_high_reclaim = NULL;
 #endif
 	return tsk;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 953d4ba8a595..18f4aefbe0bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2168,14 +2168,17 @@ static void high_work_func(struct work_struct *work)
 void mem_cgroup_handle_over_high(void)
 {
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
-	struct mem_cgroup *memcg;
+	struct mem_cgroup *memcg = current->memcg_high_reclaim;
 
 	if (likely(!nr_pages))
 		return;
 
-	memcg = get_mem_cgroup_from_mm(current->mm);
+	if (!memcg)
+		memcg = get_mem_cgroup_from_mm(current->mm);
+
 	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	css_put(&memcg->css);
+	current->memcg_high_reclaim = NULL;
 	current->memcg_nr_pages_over_high = 0;
 }
 
@@ -2329,10 +2332,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * If the hierarchy is above the normal consumption range, schedule
 	 * reclaim on returning to userland.  We can perform reclaim here
 	 * if __GFP_RECLAIM but let's always punt for simplicity and so that
-	 * GFP_KERNEL can consistently be used during reclaim.  @memcg is
-	 * not recorded as it most likely matches current's and won't
-	 * change in the meantime.  As high limit is checked again before
-	 * reclaim, the cost of mismatch is negligible.
+	 * GFP_KERNEL can consistently be used during reclaim. Record the memcg
+	 * for the return-to-userland high reclaim. If the memcg is already
+	 * recorded and the recorded memcg is not the descendant of the memcg
+	 * needing high reclaim, punt the high reclaim to the work queue.
 	 */
 	do {
 		if (page_counter_read(&memcg->memory) > memcg->high) {
@@ -2340,6 +2343,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			if (in_interrupt()) {
 				schedule_work(&memcg->high_work);
 				break;
+			} else if (!current->memcg_high_reclaim) {
+				css_get(&memcg->css);
+				current->memcg_high_reclaim = memcg;
+			} else if (!mem_cgroup_is_descendant(
+					current->memcg_high_reclaim, memcg)) {
+				schedule_work(&memcg->high_work);
+				break;
 			}
 			current->memcg_nr_pages_over_high += batch;
 			set_notify_resume(current);
-- 
2.20.1.97.g81188d93c3-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
  2019-01-10 17:44 ` Shakeel Butt
  (?)
@ 2019-01-11 20:59 ` Johannes Weiner
  2019-01-11 22:54     ` Shakeel Butt
  -1 siblings, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2019-01-11 20:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, Vladimir Davydov, cgroups, linux-mm,
	linux-kernel

Hi Shakeel,

On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> If a memcg is over high limit, memory reclaim is scheduled to run on
> return-to-userland.  However it is assumed that the memcg is the current
> process's memcg.  With remote memcg charging for kmem or swapping in a
> page charged to remote memcg, current process can trigger reclaim on
> remote memcg.  So, schduling reclaim on return-to-userland for remote
> memcgs will ignore the high reclaim altogether. So, record the memcg
> needing high reclaim and trigger high reclaim for that memcg on
> return-to-userland.  However if the memcg is already recorded for high
> reclaim and the recorded memcg is not the descendant of the the memcg
> needing high reclaim, punt the high reclaim to the work queue.

The idea behind remote charging is that the thread allocating the
memory is not responsible for that memory, but a different cgroup
is. Why would the same thread then have to work off any high excess
this could produce in that unrelated group?

Say you have a inotify/dnotify listener that is restricted in its
memory use - now everybody sending notification events from outside
that listener's group would get throttled on a cgroup over which it
has no control. That sounds like a recipe for priority inversions.

It seems to me we should only do reclaim-on-return when current is in
the ill-behaved cgroup, and punt everything else - interrupts and
remote charges - to the workqueue.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
  2019-01-11 20:59 ` Johannes Weiner
@ 2019-01-11 22:54     ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2019-01-11 22:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Vladimir Davydov, Cgroups, Linux MM, LKML

Hi Johannes,

On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hi Shakeel,
>
> On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > If a memcg is over high limit, memory reclaim is scheduled to run on
> > return-to-userland.  However it is assumed that the memcg is the current
> > process's memcg.  With remote memcg charging for kmem or swapping in a
> > page charged to remote memcg, current process can trigger reclaim on
> > remote memcg.  So, schduling reclaim on return-to-userland for remote
> > memcgs will ignore the high reclaim altogether. So, record the memcg
> > needing high reclaim and trigger high reclaim for that memcg on
> > return-to-userland.  However if the memcg is already recorded for high
> > reclaim and the recorded memcg is not the descendant of the the memcg
> > needing high reclaim, punt the high reclaim to the work queue.
>
> The idea behind remote charging is that the thread allocating the
> memory is not responsible for that memory, but a different cgroup
> is. Why would the same thread then have to work off any high excess
> this could produce in that unrelated group?
>
> Say you have a inotify/dnotify listener that is restricted in its
> memory use - now everybody sending notification events from outside
> that listener's group would get throttled on a cgroup over which it
> has no control. That sounds like a recipe for priority inversions.
>
> It seems to me we should only do reclaim-on-return when current is in
> the ill-behaved cgroup, and punt everything else - interrupts and
> remote charges - to the workqueue.

This is what v1 of this patch was doing but Michal suggested to do
what this version is doing. Michal's argument was that the current is
already charging and maybe reclaiming a remote memcg then why not do
the high excess reclaim as well.

Personally I don't have any strong opinion either way. What I actually
wanted was to punt this high reclaim to some process in that remote
memcg. However I didn't explore much on that direction thinking if
that complexity is worth it. Maybe I should at least explore it, so,
we can compare the solutions. What do you think?

Shakeel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
@ 2019-01-11 22:54     ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2019-01-11 22:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Vladimir Davydov, Cgroups, Linux MM, LKML

Hi Johannes,

On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hi Shakeel,
>
> On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > If a memcg is over high limit, memory reclaim is scheduled to run on
> > return-to-userland.  However it is assumed that the memcg is the current
> > process's memcg.  With remote memcg charging for kmem or swapping in a
> > page charged to remote memcg, current process can trigger reclaim on
> > remote memcg.  So, schduling reclaim on return-to-userland for remote
> > memcgs will ignore the high reclaim altogether. So, record the memcg
> > needing high reclaim and trigger high reclaim for that memcg on
> > return-to-userland.  However if the memcg is already recorded for high
> > reclaim and the recorded memcg is not the descendant of the the memcg
> > needing high reclaim, punt the high reclaim to the work queue.
>
> The idea behind remote charging is that the thread allocating the
> memory is not responsible for that memory, but a different cgroup
> is. Why would the same thread then have to work off any high excess
> this could produce in that unrelated group?
>
> Say you have a inotify/dnotify listener that is restricted in its
> memory use - now everybody sending notification events from outside
> that listener's group would get throttled on a cgroup over which it
> has no control. That sounds like a recipe for priority inversions.
>
> It seems to me we should only do reclaim-on-return when current is in
> the ill-behaved cgroup, and punt everything else - interrupts and
> remote charges - to the workqueue.

This is what v1 of this patch was doing but Michal suggested to do
what this version is doing. Michal's argument was that the current is
already charging and maybe reclaiming a remote memcg then why not do
the high excess reclaim as well.

Personally I don't have any strong opinion either way. What I actually
wanted was to punt this high reclaim to some process in that remote
memcg. However I didn't explore much on that direction thinking if
that complexity is worth it. Maybe I should at least explore it, so,
we can compare the solutions. What do you think?

Shakeel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
  2019-01-11 22:54     ` Shakeel Butt
  (?)
@ 2019-01-13 18:34     ` Michal Hocko
  2019-01-14 20:18         ` Shakeel Butt
  -1 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2019-01-13 18:34 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Andrew Morton, Vladimir Davydov, Cgroups,
	Linux MM, LKML

On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> Hi Johannes,
> 
> On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Hi Shakeel,
> >
> > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > return-to-userland.  However it is assumed that the memcg is the current
> > > process's memcg.  With remote memcg charging for kmem or swapping in a
> > > page charged to remote memcg, current process can trigger reclaim on
> > > remote memcg.  So, schduling reclaim on return-to-userland for remote
> > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > needing high reclaim and trigger high reclaim for that memcg on
> > > return-to-userland.  However if the memcg is already recorded for high
> > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > needing high reclaim, punt the high reclaim to the work queue.
> >
> > The idea behind remote charging is that the thread allocating the
> > memory is not responsible for that memory, but a different cgroup
> > is. Why would the same thread then have to work off any high excess
> > this could produce in that unrelated group?
> >
> > Say you have a inotify/dnotify listener that is restricted in its
> > memory use - now everybody sending notification events from outside
> > that listener's group would get throttled on a cgroup over which it
> > has no control. That sounds like a recipe for priority inversions.
> >
> > It seems to me we should only do reclaim-on-return when current is in
> > the ill-behaved cgroup, and punt everything else - interrupts and
> > remote charges - to the workqueue.
> 
> This is what v1 of this patch was doing but Michal suggested to do
> what this version is doing. Michal's argument was that the current is
> already charging and maybe reclaiming a remote memcg then why not do
> the high excess reclaim as well.

Johannes has a good point about the priority inversion problems which I
haven't thought about.

> Personally I don't have any strong opinion either way. What I actually
> wanted was to punt this high reclaim to some process in that remote
> memcg. However I didn't explore much on that direction thinking if
> that complexity is worth it. Maybe I should at least explore it, so,
> we can compare the solutions. What do you think?

My question would be whether we really care all that much. Do we know of
workloads which would generate a large high limit excess?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
  2019-01-13 18:34     ` Michal Hocko
@ 2019-01-14 20:18         ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2019-01-14 20:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Vladimir Davydov, Cgroups,
	Linux MM, LKML

On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > Hi Johannes,
> >
> > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Hi Shakeel,
> > >
> > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > return-to-userland.  However it is assumed that the memcg is the current
> > > > process's memcg.  With remote memcg charging for kmem or swapping in a
> > > > page charged to remote memcg, current process can trigger reclaim on
> > > > remote memcg.  So, schduling reclaim on return-to-userland for remote
> > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > return-to-userland.  However if the memcg is already recorded for high
> > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > needing high reclaim, punt the high reclaim to the work queue.
> > >
> > > The idea behind remote charging is that the thread allocating the
> > > memory is not responsible for that memory, but a different cgroup
> > > is. Why would the same thread then have to work off any high excess
> > > this could produce in that unrelated group?
> > >
> > > Say you have a inotify/dnotify listener that is restricted in its
> > > memory use - now everybody sending notification events from outside
> > > that listener's group would get throttled on a cgroup over which it
> > > has no control. That sounds like a recipe for priority inversions.
> > >
> > > It seems to me we should only do reclaim-on-return when current is in
> > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > remote charges - to the workqueue.
> >
> > This is what v1 of this patch was doing but Michal suggested to do
> > what this version is doing. Michal's argument was that the current is
> > already charging and maybe reclaiming a remote memcg then why not do
> > the high excess reclaim as well.
>
> Johannes has a good point about the priority inversion problems which I
> haven't thought about.
>
> > Personally I don't have any strong opinion either way. What I actually
> > wanted was to punt this high reclaim to some process in that remote
> > memcg. However I didn't explore much on that direction thinking if
> > that complexity is worth it. Maybe I should at least explore it, so,
> > we can compare the solutions. What do you think?
>
> My question would be whether we really care all that much. Do we know of
> workloads which would generate a large high limit excess?
>

The current semantics of memory.high is that it can be breached under
extreme conditions. However any workload where memory.high is used and
a lot of remote memcg charging happens (inotify/dnotify example given
by Johannes or swapping in tmpfs file or shared memory region) the
memory.high breach will become common.

Shakeel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
@ 2019-01-14 20:18         ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2019-01-14 20:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Vladimir Davydov, Cgroups,
	Linux MM, LKML

On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > Hi Johannes,
> >
> > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Hi Shakeel,
> > >
> > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > return-to-userland.  However it is assumed that the memcg is the current
> > > > process's memcg.  With remote memcg charging for kmem or swapping in a
> > > > page charged to remote memcg, current process can trigger reclaim on
> > > > remote memcg.  So, schduling reclaim on return-to-userland for remote
> > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > return-to-userland.  However if the memcg is already recorded for high
> > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > needing high reclaim, punt the high reclaim to the work queue.
> > >
> > > The idea behind remote charging is that the thread allocating the
> > > memory is not responsible for that memory, but a different cgroup
> > > is. Why would the same thread then have to work off any high excess
> > > this could produce in that unrelated group?
> > >
> > > Say you have a inotify/dnotify listener that is restricted in its
> > > memory use - now everybody sending notification events from outside
> > > that listener's group would get throttled on a cgroup over which it
> > > has no control. That sounds like a recipe for priority inversions.
> > >
> > > It seems to me we should only do reclaim-on-return when current is in
> > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > remote charges - to the workqueue.
> >
> > This is what v1 of this patch was doing but Michal suggested to do
> > what this version is doing. Michal's argument was that the current is
> > already charging and maybe reclaiming a remote memcg then why not do
> > the high excess reclaim as well.
>
> Johannes has a good point about the priority inversion problems which I
> haven't thought about.
>
> > Personally I don't have any strong opinion either way. What I actually
> > wanted was to punt this high reclaim to some process in that remote
> > memcg. However I didn't explore much on that direction thinking if
> > that complexity is worth it. Maybe I should at least explore it, so,
> > we can compare the solutions. What do you think?
>
> My question would be whether we really care all that much. Do we know of
> workloads which would generate a large high limit excess?
>

The current semantics of memory.high is that it can be breached under
extreme conditions. However any workload where memory.high is used and
a lot of remote memcg charging happens (inotify/dnotify example given
by Johannes or swapping in tmpfs file or shared memory region) the
memory.high breach will become common.

Shakeel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
  2019-01-14 20:18         ` Shakeel Butt
  (?)
@ 2019-01-15  7:25         ` Michal Hocko
  2019-01-15 19:38             ` Shakeel Butt
  -1 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2019-01-15  7:25 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Andrew Morton, Vladimir Davydov, Cgroups,
	Linux MM, LKML

On Mon 14-01-19 12:18:07, Shakeel Butt wrote:
> On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > > Hi Johannes,
> > >
> > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > Hi Shakeel,
> > > >
> > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > > return-to-userland.  However it is assumed that the memcg is the current
> > > > > process's memcg.  With remote memcg charging for kmem or swapping in a
> > > > > page charged to remote memcg, current process can trigger reclaim on
> > > > > remote memcg.  So, schduling reclaim on return-to-userland for remote
> > > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > > return-to-userland.  However if the memcg is already recorded for high
> > > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > > needing high reclaim, punt the high reclaim to the work queue.
> > > >
> > > > The idea behind remote charging is that the thread allocating the
> > > > memory is not responsible for that memory, but a different cgroup
> > > > is. Why would the same thread then have to work off any high excess
> > > > this could produce in that unrelated group?
> > > >
> > > > Say you have a inotify/dnotify listener that is restricted in its
> > > > memory use - now everybody sending notification events from outside
> > > > that listener's group would get throttled on a cgroup over which it
> > > > has no control. That sounds like a recipe for priority inversions.
> > > >
> > > > It seems to me we should only do reclaim-on-return when current is in
> > > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > > remote charges - to the workqueue.
> > >
> > > This is what v1 of this patch was doing but Michal suggested to do
> > > what this version is doing. Michal's argument was that the current is
> > > already charging and maybe reclaiming a remote memcg then why not do
> > > the high excess reclaim as well.
> >
> > Johannes has a good point about the priority inversion problems which I
> > haven't thought about.
> >
> > > Personally I don't have any strong opinion either way. What I actually
> > > wanted was to punt this high reclaim to some process in that remote
> > > memcg. However I didn't explore much on that direction thinking if
> > > that complexity is worth it. Maybe I should at least explore it, so,
> > > we can compare the solutions. What do you think?
> >
> > My question would be whether we really care all that much. Do we know of
> > workloads which would generate a large high limit excess?
> >
> 
> The current semantics of memory.high is that it can be breached under
> extreme conditions. However any workload where memory.high is used and
> a lot of remote memcg charging happens (inotify/dnotify example given
> by Johannes or swapping in tmpfs file or shared memory region) the
> memory.high breach will become common.

This is exactly what I am asking about. Is this something that can
happen easily? Remote charges on themselves should be rare, no?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
  2019-01-15  7:25         ` Michal Hocko
@ 2019-01-15 19:38             ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2019-01-15 19:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Vladimir Davydov, Cgroups,
	Linux MM, LKML

On Mon, Jan 14, 2019 at 11:25 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 14-01-19 12:18:07, Shakeel Butt wrote:
> > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > > > Hi Johannes,
> > > >
> > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > Hi Shakeel,
> > > > >
> > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > > > return-to-userland.  However it is assumed that the memcg is the current
> > > > > > process's memcg.  With remote memcg charging for kmem or swapping in a
> > > > > > page charged to remote memcg, current process can trigger reclaim on
> > > > > > remote memcg.  So, schduling reclaim on return-to-userland for remote
> > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > > > return-to-userland.  However if the memcg is already recorded for high
> > > > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > > > needing high reclaim, punt the high reclaim to the work queue.
> > > > >
> > > > > The idea behind remote charging is that the thread allocating the
> > > > > memory is not responsible for that memory, but a different cgroup
> > > > > is. Why would the same thread then have to work off any high excess
> > > > > this could produce in that unrelated group?
> > > > >
> > > > > Say you have a inotify/dnotify listener that is restricted in its
> > > > > memory use - now everybody sending notification events from outside
> > > > > that listener's group would get throttled on a cgroup over which it
> > > > > has no control. That sounds like a recipe for priority inversions.
> > > > >
> > > > > It seems to me we should only do reclaim-on-return when current is in
> > > > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > > > remote charges - to the workqueue.
> > > >
> > > > This is what v1 of this patch was doing but Michal suggested to do
> > > > what this version is doing. Michal's argument was that the current is
> > > > already charging and maybe reclaiming a remote memcg then why not do
> > > > the high excess reclaim as well.
> > >
> > > Johannes has a good point about the priority inversion problems which I
> > > haven't thought about.
> > >
> > > > Personally I don't have any strong opinion either way. What I actually
> > > > wanted was to punt this high reclaim to some process in that remote
> > > > memcg. However I didn't explore much on that direction thinking if
> > > > that complexity is worth it. Maybe I should at least explore it, so,
> > > > we can compare the solutions. What do you think?
> > >
> > > My question would be whether we really care all that much. Do we know of
> > > workloads which would generate a large high limit excess?
> > >
> >
> > The current semantics of memory.high is that it can be breached under
> > extreme conditions. However any workload where memory.high is used and
> > a lot of remote memcg charging happens (inotify/dnotify example given
> > by Johannes or swapping in tmpfs file or shared memory region) the
> > memory.high breach will become common.
>
> This is exactly what I am asking about. Is this something that can
> happen easily? Remote charges on themselves should be rare, no?
>

At the moment, for kmem we can do remote charging for fanotify,
inotify and buffer_head and for anon pages we can do remote charging
on swap in. Now based on the workload's cgroup setup the remote
charging can be very frequent or rare.

At Google, remote charging is very frequent but since we are still on
cgroup-v1 and do not use memory.high, the issue this patch is fixing
is not observed. However for the adoption of cgroup-v2, this fix is
needed.

Shakeel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
@ 2019-01-15 19:38             ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2019-01-15 19:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Vladimir Davydov, Cgroups,
	Linux MM, LKML

On Mon, Jan 14, 2019 at 11:25 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 14-01-19 12:18:07, Shakeel Butt wrote:
> > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > > > Hi Johannes,
> > > >
> > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > Hi Shakeel,
> > > > >
> > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > > > return-to-userland.  However it is assumed that the memcg is the current
> > > > > > process's memcg.  With remote memcg charging for kmem or swapping in a
> > > > > > page charged to remote memcg, current process can trigger reclaim on
> > > > > > remote memcg.  So, schduling reclaim on return-to-userland for remote
> > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > > > return-to-userland.  However if the memcg is already recorded for high
> > > > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > > > needing high reclaim, punt the high reclaim to the work queue.
> > > > >
> > > > > The idea behind remote charging is that the thread allocating the
> > > > > memory is not responsible for that memory, but a different cgroup
> > > > > is. Why would the same thread then have to work off any high excess
> > > > > this could produce in that unrelated group?
> > > > >
> > > > > Say you have a inotify/dnotify listener that is restricted in its
> > > > > memory use - now everybody sending notification events from outside
> > > > > that listener's group would get throttled on a cgroup over which it
> > > > > has no control. That sounds like a recipe for priority inversions.
> > > > >
> > > > > It seems to me we should only do reclaim-on-return when current is in
> > > > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > > > remote charges - to the workqueue.
> > > >
> > > > This is what v1 of this patch was doing but Michal suggested to do
> > > > what this version is doing. Michal's argument was that the current is
> > > > already charging and maybe reclaiming a remote memcg then why not do
> > > > the high excess reclaim as well.
> > >
> > > Johannes has a good point about the priority inversion problems which I
> > > haven't thought about.
> > >
> > > > Personally I don't have any strong opinion either way. What I actually
> > > > wanted was to punt this high reclaim to some process in that remote
> > > > memcg. However I didn't explore much on that direction thinking if
> > > > that complexity is worth it. Maybe I should at least explore it, so,
> > > > we can compare the solutions. What do you think?
> > >
> > > My question would be whether we really care all that much. Do we know of
> > > workloads which would generate a large high limit excess?
> > >
> >
> > The current semantics of memory.high is that it can be breached under
> > extreme conditions. However any workload where memory.high is used and
> > a lot of remote memcg charging happens (inotify/dnotify example given
> > by Johannes or swapping in tmpfs file or shared memory region) the
> > memory.high breach will become common.
>
> This is exactly what I am asking about. Is this something that can
> happen easily? Remote charges on themselves should be rare, no?
>

At the moment, for kmem we can do remote charging for fanotify,
inotify and buffer_head and for anon pages we can do remote charging
on swap in. Now based on the workload's cgroup setup the remote
charging can be very frequent or rare.

At Google, remote charging is very frequent but since we are still on
cgroup-v1 and do not use memory.high, the issue this patch is fixing
is not observed. However for the adoption of cgroup-v2, this fix is
needed.

Shakeel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work
  2019-01-15 19:38             ` Shakeel Butt
  (?)
@ 2019-01-16  7:02             ` Michal Hocko
  -1 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2019-01-16  7:02 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Andrew Morton, Vladimir Davydov, Cgroups,
	Linux MM, LKML

On Tue 15-01-19 11:38:23, Shakeel Butt wrote:
> On Mon, Jan 14, 2019 at 11:25 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 14-01-19 12:18:07, Shakeel Butt wrote:
> > > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > > > > Hi Johannes,
> > > > >
> > > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > >
> > > > > > Hi Shakeel,
> > > > > >
> > > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > > > > return-to-userland.  However it is assumed that the memcg is the current
> > > > > > > process's memcg.  With remote memcg charging for kmem or swapping in a
> > > > > > > page charged to remote memcg, current process can trigger reclaim on
> > > > > > > remote memcg.  So, schduling reclaim on return-to-userland for remote
> > > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > > > > return-to-userland.  However if the memcg is already recorded for high
> > > > > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > > > > needing high reclaim, punt the high reclaim to the work queue.
> > > > > >
> > > > > > The idea behind remote charging is that the thread allocating the
> > > > > > memory is not responsible for that memory, but a different cgroup
> > > > > > is. Why would the same thread then have to work off any high excess
> > > > > > this could produce in that unrelated group?
> > > > > >
> > > > > > Say you have a inotify/dnotify listener that is restricted in its
> > > > > > memory use - now everybody sending notification events from outside
> > > > > > that listener's group would get throttled on a cgroup over which it
> > > > > > has no control. That sounds like a recipe for priority inversions.
> > > > > >
> > > > > > It seems to me we should only do reclaim-on-return when current is in
> > > > > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > > > > remote charges - to the workqueue.
> > > > >
> > > > > This is what v1 of this patch was doing but Michal suggested to do
> > > > > what this version is doing. Michal's argument was that the current is
> > > > > already charging and maybe reclaiming a remote memcg then why not do
> > > > > the high excess reclaim as well.
> > > >
> > > > Johannes has a good point about the priority inversion problems which I
> > > > haven't thought about.
> > > >
> > > > > Personally I don't have any strong opinion either way. What I actually
> > > > > wanted was to punt this high reclaim to some process in that remote
> > > > > memcg. However I didn't explore much on that direction thinking if
> > > > > that complexity is worth it. Maybe I should at least explore it, so,
> > > > > we can compare the solutions. What do you think?
> > > >
> > > > My question would be whether we really care all that much. Do we know of
> > > > workloads which would generate a large high limit excess?
> > > >
> > >
> > > The current semantics of memory.high is that it can be breached under
> > > extreme conditions. However any workload where memory.high is used and
> > > a lot of remote memcg charging happens (inotify/dnotify example given
> > > by Johannes or swapping in tmpfs file or shared memory region) the
> > > memory.high breach will become common.
> >
> > This is exactly what I am asking about. Is this something that can
> > happen easily? Remote charges on themselves should be rare, no?
> >
> 
> At the moment, for kmem we can do remote charging for fanotify,
> inotify and buffer_head and for anon pages we can do remote charging
> on swap in. Now based on the workload's cgroup setup the remote
> charging can be very frequent or rare.
> 
> At Google, remote charging is very frequent but since we are still on
> cgroup-v1 and do not use memory.high, the issue this patch is fixing
> is not observed. However for the adoption of cgroup-v2, this fix is
> needed.

Adding some numbers into the changelog would be really valuable to judge
the urgency and the scale of the problem. If we are going via kworker
then it is also important to evaluate what kind of effect on the system
this has.  How big of the excess can we get? Why don't those memcgs
resolve the excess by themselves on the first direct charge? Is it
possible that kworkers simply swamp the system with many parallel memcgs
with remote charges?

In other words we need deeper analysis of the problem and the solution.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-01-16  7:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-10 17:44 [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work Shakeel Butt
2019-01-10 17:44 ` Shakeel Butt
2019-01-11 20:59 ` Johannes Weiner
2019-01-11 22:54   ` Shakeel Butt
2019-01-11 22:54     ` Shakeel Butt
2019-01-13 18:34     ` Michal Hocko
2019-01-14 20:18       ` Shakeel Butt
2019-01-14 20:18         ` Shakeel Butt
2019-01-15  7:25         ` Michal Hocko
2019-01-15 19:38           ` Shakeel Butt
2019-01-15 19:38             ` Shakeel Butt
2019-01-16  7:02             ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.