linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
@ 2019-07-28 12:29 Konstantin Khlebnikov
  2019-07-29  9:17 ` Michal Hocko
  2019-07-29 15:49 ` Johannes Weiner
  0 siblings, 2 replies; 22+ messages in thread
From: Konstantin Khlebnikov @ 2019-07-28 12:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: cgroups, Michal Hocko, Vladimir Davydov, Johannes Weiner

High memory limit in memory cgroup allows to batch memory reclaiming and
defer it until returning into userland. This moves it out of any locks.

Fixed gap between high and max limit works pretty well (we are using
64 * NR_CPUS pages) except cases when one syscall allocates tons of
memory. This affects all other tasks in cgroup because they might hit
max memory limit in unhandy places and\or under hot locks.

For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
of pages and push memory cgroup usage far ahead high memory limit.

This patch uses halfway between high and max limits as threshold and
in this case starts memory reclaiming if mem_cgroup_handle_over_high()
called with argument only_severe = true, otherwise reclaim is deferred
till returning into userland. If high limits isn't set nothing changes.

Now long running get_user_pages will periodically reclaim cgroup memory.
Other possible targets are generic file read/write iter loops.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 include/linux/memcontrol.h |    4 ++--
 include/linux/tracehook.h  |    2 +-
 mm/gup.c                   |    5 ++++-
 mm/memcontrol.c            |   17 ++++++++++++++++-
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44c41462be33..eca2bf9560f2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,7 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
 	return mz->lru_zone_size[zone_idx][lru];
 }
 
-void mem_cgroup_handle_over_high(void);
+void mem_cgroup_handle_over_high(bool only_severe);
 
 unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
 
@@ -969,7 +969,7 @@ static inline void unlock_page_memcg(struct page *page)
 {
 }
 
-static inline void mem_cgroup_handle_over_high(void)
+static inline void mem_cgroup_handle_over_high(bool only_severe)
 {
 }
 
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 36fb3bbed6b2..8845fb65353f 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -194,7 +194,7 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
 	}
 #endif
 
-	mem_cgroup_handle_over_high();
+	mem_cgroup_handle_over_high(false);
 	blkcg_maybe_throttle_current();
 }
 
diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..42b93fffe824 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			ret = -ERESTARTSYS;
 			goto out;
 		}
-		cond_resched();
 
+		/* Reclaim memory over high limit before stocking too much */
+		mem_cgroup_handle_over_high(true);
+
+		cond_resched();
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
 		if (!page) {
 			ret = faultin_page(tsk, vma, start, &foll_flags,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cdbb7a84cb6e..15fa664ce98c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2317,11 +2317,16 @@ static void high_work_func(struct work_struct *work)
 	reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
 }
 
+#define MEMCG_SEVERE_OVER_HIGH	(1 << 31)
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
+ *
+ * Long allocation loops should call periodically with only_severe = true
+ * to reclaim memory if usage already over halfway to the max limit.
  */
-void mem_cgroup_handle_over_high(void)
+void mem_cgroup_handle_over_high(bool only_severe)
 {
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
 	struct mem_cgroup *memcg;
@@ -2329,6 +2334,11 @@ void mem_cgroup_handle_over_high(void)
 	if (likely(!nr_pages))
 		return;
 
+	if (nr_pages & MEMCG_SEVERE_OVER_HIGH)
+		nr_pages -= MEMCG_SEVERE_OVER_HIGH;
+	else if (only_severe)
+		return;
+
 	memcg = get_mem_cgroup_from_mm(current->mm);
 	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	css_put(&memcg->css);
@@ -2493,6 +2503,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				schedule_work(&memcg->high_work);
 				break;
 			}
+			/* Mark as severe if over halfway to the max limit */
+			if (page_counter_read(&memcg->memory) >
+			    (memcg->high >> 1) + (memcg->memory.max >> 1))
+				current->memcg_nr_pages_over_high |=
+						MEMCG_SEVERE_OVER_HIGH;
 			current->memcg_nr_pages_over_high += batch;
 			set_notify_resume(current);
 			break;


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-28 12:29 [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop Konstantin Khlebnikov
@ 2019-07-29  9:17 ` Michal Hocko
  2019-07-29  9:40   ` Konstantin Khlebnikov
  2019-07-29 15:49 ` Johannes Weiner
  1 sibling, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-07-29  9:17 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, cgroups, Vladimir Davydov, Johannes Weiner

On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:
> High memory limit in memory cgroup allows to batch memory reclaiming and
> defer it until returning into userland. This moves it out of any locks.
> 
> Fixed gap between high and max limit works pretty well (we are using
> 64 * NR_CPUS pages) except cases when one syscall allocates tons of
> memory. This affects all other tasks in cgroup because they might hit
> max memory limit in unhandy places and\or under hot locks.
> 
> For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
> of pages and push memory cgroup usage far ahead high memory limit.
> 
> This patch uses halfway between high and max limits as threshold and
> in this case starts memory reclaiming if mem_cgroup_handle_over_high()
> called with argument only_severe = true, otherwise reclaim is deferred
> till returning into userland. If high limits isn't set nothing changes.
> 
> Now long running get_user_pages will periodically reclaim cgroup memory.
> Other possible targets are generic file read/write iter loops.

I do see how gup can lead to a large high limit excess, but could you be
more specific why is that a problem? We should be reclaiming the similar
number of pages cumulatively.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-29  9:17 ` Michal Hocko
@ 2019-07-29  9:40   ` Konstantin Khlebnikov
  2019-07-29 10:33     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Konstantin Khlebnikov @ 2019-07-29  9:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, cgroups, Vladimir Davydov, Johannes Weiner

On 29.07.2019 12:17, Michal Hocko wrote:
> On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:
>> High memory limit in memory cgroup allows to batch memory reclaiming and
>> defer it until returning into userland. This moves it out of any locks.
>>
>> Fixed gap between high and max limit works pretty well (we are using
>> 64 * NR_CPUS pages) except cases when one syscall allocates tons of
>> memory. This affects all other tasks in cgroup because they might hit
>> max memory limit in unhandy places and\or under hot locks.
>>
>> For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
>> of pages and push memory cgroup usage far ahead high memory limit.
>>
>> This patch uses halfway between high and max limits as threshold and
>> in this case starts memory reclaiming if mem_cgroup_handle_over_high()
>> called with argument only_severe = true, otherwise reclaim is deferred
>> till returning into userland. If high limits isn't set nothing changes.
>>
>> Now long running get_user_pages will periodically reclaim cgroup memory.
>> Other possible targets are generic file read/write iter loops.
> 
> I do see how gup can lead to a large high limit excess, but could you be
> more specific why is that a problem? We should be reclaiming the similar
> number of pages cumulatively.
> 

Large gup might push usage close to limit and keep it here for a some time.
As a result concurrent allocations will enter direct reclaim right at
charging much more frequently.


Right now deferred recalaim after passing high limit works like distributed
memcg kswapd which reclaims memory in "background" and prevents completely
synchronous direct reclaim.

Maybe somebody have any plans for real kswapd for memcg?


I've put mem_cgroup_handle_over_high in gup next to cond_resched() and
later that gave me idea that this is good place for running any
deferred works, like bottom half for tasks. Right now this happens
only at switching into userspace.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-29  9:40   ` Konstantin Khlebnikov
@ 2019-07-29 10:33     ` Michal Hocko
  2019-07-29 11:24       ` Konstantin Khlebnikov
  2019-07-29 17:28       ` Yang Shi
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2019-07-29 10:33 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, cgroups, Vladimir Davydov, Johannes Weiner

On Mon 29-07-19 12:40:29, Konstantin Khlebnikov wrote:
> On 29.07.2019 12:17, Michal Hocko wrote:
> > On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:
> > > High memory limit in memory cgroup allows to batch memory reclaiming and
> > > defer it until returning into userland. This moves it out of any locks.
> > > 
> > > Fixed gap between high and max limit works pretty well (we are using
> > > 64 * NR_CPUS pages) except cases when one syscall allocates tons of
> > > memory. This affects all other tasks in cgroup because they might hit
> > > max memory limit in unhandy places and\or under hot locks.
> > > 
> > > For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
> > > of pages and push memory cgroup usage far ahead high memory limit.
> > > 
> > > This patch uses halfway between high and max limits as threshold and
> > > in this case starts memory reclaiming if mem_cgroup_handle_over_high()
> > > called with argument only_severe = true, otherwise reclaim is deferred
> > > till returning into userland. If high limits isn't set nothing changes.
> > > 
> > > Now long running get_user_pages will periodically reclaim cgroup memory.
> > > Other possible targets are generic file read/write iter loops.
> > 
> > I do see how gup can lead to a large high limit excess, but could you be
> > more specific why is that a problem? We should be reclaiming the similar
> > number of pages cumulatively.
> > 
> 
> Large gup might push usage close to limit and keep it here for a some time.
> As a result concurrent allocations will enter direct reclaim right at
> charging much more frequently.

Yes, this is indeed prossible. On the other hand even the reclaim from
the charge path doesn't really prevent from that happening because the
context might get preempted or blocked on locks. So I guess we need a
more detailed information of an actual world visible problem here.
 
> Right now deferred recalaim after passing high limit works like distributed
> memcg kswapd which reclaims memory in "background" and prevents completely
> synchronous direct reclaim.
> 
> Maybe somebody have any plans for real kswapd for memcg?

I am not aware of that. The primary problem back then was that we simply
cannot have a kernel thread per each memcg because that doesn't scale.
Using kthreads and a dynamic pool of threads tends to be quite tricky -
e.g. a proper accounting, scaling again.
 
> I've put mem_cgroup_handle_over_high in gup next to cond_resched() and
> later that gave me idea that this is good place for running any
> deferred works, like bottom half for tasks. Right now this happens
> only at switching into userspace.

I am not against pushing high memory reclaim into the charge path in
principle. I just want to hear how big of a problem this really is in
practice. If this is mostly a theoretical problem that might hit then I
would rather stick with the existing code though.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-29 10:33     ` Michal Hocko
@ 2019-07-29 11:24       ` Konstantin Khlebnikov
  2019-07-29 17:28       ` Yang Shi
  1 sibling, 0 replies; 22+ messages in thread
From: Konstantin Khlebnikov @ 2019-07-29 11:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, cgroups, Vladimir Davydov, Johannes Weiner

On 29.07.2019 13:33, Michal Hocko wrote:
> On Mon 29-07-19 12:40:29, Konstantin Khlebnikov wrote:
>> On 29.07.2019 12:17, Michal Hocko wrote:
>>> On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:
>>>> High memory limit in memory cgroup allows to batch memory reclaiming and
>>>> defer it until returning into userland. This moves it out of any locks.
>>>>
>>>> Fixed gap between high and max limit works pretty well (we are using
>>>> 64 * NR_CPUS pages) except cases when one syscall allocates tons of
>>>> memory. This affects all other tasks in cgroup because they might hit
>>>> max memory limit in unhandy places and\or under hot locks.
>>>>
>>>> For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
>>>> of pages and push memory cgroup usage far ahead high memory limit.
>>>>
>>>> This patch uses halfway between high and max limits as threshold and
>>>> in this case starts memory reclaiming if mem_cgroup_handle_over_high()
>>>> called with argument only_severe = true, otherwise reclaim is deferred
>>>> till returning into userland. If high limits isn't set nothing changes.
>>>>
>>>> Now long running get_user_pages will periodically reclaim cgroup memory.
>>>> Other possible targets are generic file read/write iter loops.
>>>
>>> I do see how gup can lead to a large high limit excess, but could you be
>>> more specific why is that a problem? We should be reclaiming the similar
>>> number of pages cumulatively.
>>>
>>
>> Large gup might push usage close to limit and keep it here for a some time.
>> As a result concurrent allocations will enter direct reclaim right at
>> charging much more frequently.
> 
> Yes, this is indeed prossible. On the other hand even the reclaim from
> the charge path doesn't really prevent from that happening because the
> context might get preempted or blocked on locks. So I guess we need a
> more detailed information of an actual world visible problem here.
>   
>> Right now deferred recalaim after passing high limit works like distributed
>> memcg kswapd which reclaims memory in "background" and prevents completely
>> synchronous direct reclaim.
>>
>> Maybe somebody have any plans for real kswapd for memcg?
> 
> I am not aware of that. The primary problem back then was that we simply
> cannot have a kernel thread per each memcg because that doesn't scale.
> Using kthreads and a dynamic pool of threads tends to be quite tricky -
> e.g. a proper accounting, scaling again.

Yep, for containers proper accounting is important, especially cpu usage.

We're using manual kwapd-style reclaim in userspace by MADV_STOCKPILE
within container where memory allocation latency is critical.

This patch is about less extreme cases which would be nice to handle
automatically, without custom tuning.

>   
>> I've put mem_cgroup_handle_over_high in gup next to cond_resched() and
>> later that gave me idea that this is good place for running any
>> deferred works, like bottom half for tasks. Right now this happens
>> only at switching into userspace.
> 
> I am not against pushing high memory reclaim into the charge path in
> principle. I just want to hear how big of a problem this really is in
> practice. If this is mostly a theoretical problem that might hit then I
> would rather stick with the existing code though.
> 

Besides latency which might be not so important for everybody I see these:

First problem is a fairness within cgroup - task that generates allocation
flow isn't throttled after passing high limits as documentation states.
It will feel memory pressure only after hitting max limit while other
tasks with smaller allocations will go into direct reclaim right away.

Second is an accumulating too much deferred reclaim - after large gup task
might call direct reclaim with target amount much larger than gap between
high and max limits, or even larger than max limit itself.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-28 12:29 [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop Konstantin Khlebnikov
  2019-07-29  9:17 ` Michal Hocko
@ 2019-07-29 15:49 ` Johannes Weiner
  2019-07-29 18:55   ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2019-07-29 15:49 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, cgroups, Michal Hocko, Vladimir Davydov

On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			ret = -ERESTARTSYS;
>  			goto out;
>  		}
> -		cond_resched();
>  
> +		/* Reclaim memory over high limit before stocking too much */
> +		mem_cgroup_handle_over_high(true);

I'd rather this remained part of the try_charge() call. The code
comment in try_charge says this:

	 * We can perform reclaim here if __GFP_RECLAIM but let's
	 * always punt for simplicity and so that GFP_KERNEL can
	 * consistently be used during reclaim.

The simplicity argument doesn't hold true anymore once we have to add
manual calls into allocation sites. We should instead fix try_charge()
to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
return when actually needed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-29 10:33     ` Michal Hocko
  2019-07-29 11:24       ` Konstantin Khlebnikov
@ 2019-07-29 17:28       ` Yang Shi
  2019-07-29 18:48         ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Yang Shi @ 2019-07-29 17:28 UTC (permalink / raw)
  To: Michal Hocko, Konstantin Khlebnikov
  Cc: Linux MM, Linux Kernel Mailing List, cgroups, Vladimir Davydov,
	Johannes Weiner

On Mon, Jul 29, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 29-07-19 12:40:29, Konstantin Khlebnikov wrote:
> > On 29.07.2019 12:17, Michal Hocko wrote:
> > > On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:
> > > > High memory limit in memory cgroup allows to batch memory reclaiming and
> > > > defer it until returning into userland. This moves it out of any locks.
> > > >
> > > > Fixed gap between high and max limit works pretty well (we are using
> > > > 64 * NR_CPUS pages) except cases when one syscall allocates tons of
> > > > memory. This affects all other tasks in cgroup because they might hit
> > > > max memory limit in unhandy places and\or under hot locks.
> > > >
> > > > For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
> > > > of pages and push memory cgroup usage far ahead high memory limit.
> > > >
> > > > This patch uses halfway between high and max limits as threshold and
> > > > in this case starts memory reclaiming if mem_cgroup_handle_over_high()
> > > > called with argument only_severe = true, otherwise reclaim is deferred
> > > > till returning into userland. If high limits isn't set nothing changes.
> > > >
> > > > Now long running get_user_pages will periodically reclaim cgroup memory.
> > > > Other possible targets are generic file read/write iter loops.
> > >
> > > I do see how gup can lead to a large high limit excess, but could you be
> > > more specific why is that a problem? We should be reclaiming the similar
> > > number of pages cumulatively.
> > >
> >
> > Large gup might push usage close to limit and keep it here for a some time.
> > As a result concurrent allocations will enter direct reclaim right at
> > charging much more frequently.
>
> Yes, this is indeed prossible. On the other hand even the reclaim from
> the charge path doesn't really prevent from that happening because the
> context might get preempted or blocked on locks. So I guess we need a
> more detailed information of an actual world visible problem here.
>
> > Right now deferred recalaim after passing high limit works like distributed
> > memcg kswapd which reclaims memory in "background" and prevents completely
> > synchronous direct reclaim.
> >
> > Maybe somebody have any plans for real kswapd for memcg?
>
> I am not aware of that. The primary problem back then was that we simply
> cannot have a kernel thread per each memcg because that doesn't scale.
> Using kthreads and a dynamic pool of threads tends to be quite tricky -
> e.g. a proper accounting, scaling again.

We did discuss this topic in last year's LSF/MM, please see:
https://lwn.net/Articles/753162/. We (Alibaba) do have the memcg
kswapd thing work in our production environment for a while, and it
works well for our 11.11 shopping festival flood.

I did plan to post the patches to upstream, but I was distracted by
something else for a long time, now I already redesigned it and
already had some preliminary patches work, if you are interested in
this I would like post the patches soon to gather some comments early.

However, some of the issues mentioned by Michal does still exist, i.e.
accounting. I have not thought too much about accounting yet. I
recalled Johannes mentioned they were working on accounting kswapd
back then. But, I'm not sure if they are still working on that or not,
or he just meant some throttling solved by commit
2cf855837b89d92996cf264713f3bed2bf9b0b4f ("memcontrol: schedule
throttling if we are congested")? But, I recalled vaguely accounting
sounds not very critical.

I don't worry too much about scale since the scale issue is not unique
to background reclaim, direct reclaim may run into the same problem.
If you run into extreme memory pressure, a lot memcgs run into direct
relcaim and global reclaim is also running at the mean time, your
machine is definitely already not usable. And, our usecase is memcg
backgroup reclaim is mainly used by some latency sensitive memcgs
(running latency sensitive applications) to try to minimize direct
reclaim, for other memcgs they'd better be throttled by direct reclaim
if they consume too much memory.

Regards,
Yang

>
> > I've put mem_cgroup_handle_over_high in gup next to cond_resched() and
> > later that gave me idea that this is good place for running any
> > deferred works, like bottom half for tasks. Right now this happens
> > only at switching into userspace.
>
> I am not against pushing high memory reclaim into the charge path in
> principle. I just want to hear how big of a problem this really is in
> practice. If this is mostly a theoretical problem that might hit then I
> would rather stick with the existing code though.
>
> --
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-29 17:28       ` Yang Shi
@ 2019-07-29 18:48         ` Michal Hocko
  2019-08-01 21:00           ` Yang Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-07-29 18:48 UTC (permalink / raw)
  To: Yang Shi
  Cc: Konstantin Khlebnikov, Linux MM, Linux Kernel Mailing List,
	cgroups, Vladimir Davydov, Johannes Weiner

On Mon 29-07-19 10:28:43, Yang Shi wrote:
[...]
> I don't worry too much about scale since the scale issue is not unique
> to background reclaim, direct reclaim may run into the same problem.

Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
You can have thousands of memcgs and I do not think we really do want
to create one kswapd for each. Once we have a kswapd thread pool then we
get into a tricky land where a determinism/fairness would be non trivial
to achieve. Direct reclaim, on the other hand is bound by the workload
itself.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-29 15:49 ` Johannes Weiner
@ 2019-07-29 18:55   ` Michal Hocko
  2019-08-02  9:40     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-07-29 18:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, cgroups, Vladimir Davydov

On Mon 29-07-19 11:49:52, Johannes Weiner wrote:
> On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  			ret = -ERESTARTSYS;
> >  			goto out;
> >  		}
> > -		cond_resched();
> >  
> > +		/* Reclaim memory over high limit before stocking too much */
> > +		mem_cgroup_handle_over_high(true);
> 
> I'd rather this remained part of the try_charge() call. The code
> comment in try_charge says this:
> 
> 	 * We can perform reclaim here if __GFP_RECLAIM but let's
> 	 * always punt for simplicity and so that GFP_KERNEL can
> 	 * consistently be used during reclaim.
> 
> The simplicity argument doesn't hold true anymore once we have to add
> manual calls into allocation sites. We should instead fix try_charge()
> to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
> return when actually needed.

Agreed. If we want to do direct reclaim on the high limit breach then it
should go into try_charge same way we do hard limit reclaim there. I am
not yet sure about how/whether to scale the excess. The only reason to
move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you
say, maybe we should start by always performing the reclaim for
sleepable contexts first and only defer for non-sleeping requests.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-29 18:48         ` Michal Hocko
@ 2019-08-01 21:00           ` Yang Shi
  2019-08-02  9:35             ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2019-08-01 21:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Konstantin Khlebnikov, Linux MM, Linux Kernel Mailing List,
	cgroups, Vladimir Davydov, Johannes Weiner

On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 29-07-19 10:28:43, Yang Shi wrote:
> [...]
> > I don't worry too much about scale since the scale issue is not unique
> > to background reclaim, direct reclaim may run into the same problem.
>
> Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> You can have thousands of memcgs and I do not think we really do want
> to create one kswapd for each. Once we have a kswapd thread pool then we
> get into a tricky land where a determinism/fairness would be non trivial
> to achieve. Direct reclaim, on the other hand is bound by the workload
> itself.

Yes, I agree thread pool would introduce more latency than dedicated
kswapd thread. But, it looks not that bad in our test. When memory
allocation is fast, even though dedicated kswapd thread can't catch
up. So, such background reclaim is best effort, not guaranteed.

I don't quite get what you mean about fairness. Do you mean they may
spend excessive cpu time then cause other processes starvation? I
think this could be mitigated by properly organizing and setting
groups. But, I agree this is tricky.

Typically, the processes are placed into different cgroups according
to their importance and priority. For example, in our cluster, system
processes would go to system group, then latency sensitive jobs and
batch jobs (they are usually second class citizens) go to different
groups. The memcg kswapd would be enabled for latency sensitive groups
only. The memcg kswapd threads would have the same priority with
global kswapd.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-01 21:00           ` Yang Shi
@ 2019-08-02  9:35             ` Michal Hocko
  2019-08-02 18:56               ` Yang Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-08-02  9:35 UTC (permalink / raw)
  To: Yang Shi
  Cc: Konstantin Khlebnikov, Linux MM, Linux Kernel Mailing List,
	cgroups, Vladimir Davydov, Johannes Weiner

On Thu 01-08-19 14:00:51, Yang Shi wrote:
> On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > [...]
> > > I don't worry too much about scale since the scale issue is not unique
> > > to background reclaim, direct reclaim may run into the same problem.
> >
> > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > You can have thousands of memcgs and I do not think we really do want
> > to create one kswapd for each. Once we have a kswapd thread pool then we
> > get into a tricky land where a determinism/fairness would be non trivial
> > to achieve. Direct reclaim, on the other hand is bound by the workload
> > itself.
> 
> Yes, I agree thread pool would introduce more latency than dedicated
> kswapd thread. But, it looks not that bad in our test. When memory
> allocation is fast, even though dedicated kswapd thread can't catch
> up. So, such background reclaim is best effort, not guaranteed.
> 
> I don't quite get what you mean about fairness. Do you mean they may
> spend excessive cpu time then cause other processes starvation? I
> think this could be mitigated by properly organizing and setting
> groups. But, I agree this is tricky.

No, I meant that the cost of reclaiming a unit of charges (e.g.
SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
on LRUs. Therefore any thread pool mechanism would lead to unfair
reclaim and non-deterministic behavior.

I can imagine a middle ground where the background reclaim would have to
be an opt-in feature and a dedicated kernel thread would be assigned to
the particular memcg (hierarchy).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-07-29 18:55   ` Michal Hocko
@ 2019-08-02  9:40     ` Michal Hocko
  2019-08-02 10:01       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-08-02  9:40 UTC (permalink / raw)
  To: Johannes Weiner, Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, cgroups, Vladimir Davydov

On Mon 29-07-19 20:55:09, Michal Hocko wrote:
> On Mon 29-07-19 11:49:52, Johannes Weiner wrote:
> > On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > >  			ret = -ERESTARTSYS;
> > >  			goto out;
> > >  		}
> > > -		cond_resched();
> > >  
> > > +		/* Reclaim memory over high limit before stocking too much */
> > > +		mem_cgroup_handle_over_high(true);
> > 
> > I'd rather this remained part of the try_charge() call. The code
> > comment in try_charge says this:
> > 
> > 	 * We can perform reclaim here if __GFP_RECLAIM but let's
> > 	 * always punt for simplicity and so that GFP_KERNEL can
> > 	 * consistently be used during reclaim.
> > 
> > The simplicity argument doesn't hold true anymore once we have to add
> > manual calls into allocation sites. We should instead fix try_charge()
> > to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
> > return when actually needed.
> 
> Agreed. If we want to do direct reclaim on the high limit breach then it
> should go into try_charge same way we do hard limit reclaim there. I am
> not yet sure about how/whether to scale the excess. The only reason to
> move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you
> say, maybe we should start by always performing the reclaim for
> sleepable contexts first and only defer for non-sleeping requests.

In other words. Something like patch below (completely untested). Could
you give it a try Konstantin?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba9138a4a1de..53a35c526e43 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				schedule_work(&memcg->high_work);
 				break;
 			}
-			current->memcg_nr_pages_over_high += batch;
-			set_notify_resume(current);
+			if (gfpflags_allow_blocking(gfp_mask)) {
+				reclaim_high(memcg, nr_pages, GFP_KERNEL);
+			} else {
+				current->memcg_nr_pages_over_high += batch;
+				set_notify_resume(current);
+			}
 			break;
 		}
 	} while ((memcg = parent_mem_cgroup(memcg)));
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-02  9:40     ` Michal Hocko
@ 2019-08-02 10:01       ` Konstantin Khlebnikov
  2019-08-02 11:44         ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Konstantin Khlebnikov @ 2019-08-02 10:01 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: linux-mm, linux-kernel, cgroups, Vladimir Davydov



On 02.08.2019 12:40, Michal Hocko wrote:
> On Mon 29-07-19 20:55:09, Michal Hocko wrote:
>> On Mon 29-07-19 11:49:52, Johannes Weiner wrote:
>>> On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>>>   			ret = -ERESTARTSYS;
>>>>   			goto out;
>>>>   		}
>>>> -		cond_resched();
>>>>   
>>>> +		/* Reclaim memory over high limit before stocking too much */
>>>> +		mem_cgroup_handle_over_high(true);
>>>
>>> I'd rather this remained part of the try_charge() call. The code
>>> comment in try_charge says this:
>>>
>>> 	 * We can perform reclaim here if __GFP_RECLAIM but let's
>>> 	 * always punt for simplicity and so that GFP_KERNEL can
>>> 	 * consistently be used during reclaim.
>>>
>>> The simplicity argument doesn't hold true anymore once we have to add
>>> manual calls into allocation sites. We should instead fix try_charge()
>>> to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
>>> return when actually needed.
>>
>> Agreed. If we want to do direct reclaim on the high limit breach then it
>> should go into try_charge same way we do hard limit reclaim there. I am
>> not yet sure about how/whether to scale the excess. The only reason to
>> move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you
>> say, maybe we should start by always performing the reclaim for
>> sleepable contexts first and only defer for non-sleeping requests.
> 
> In other words. Something like patch below (completely untested). Could
> you give it a try Konstantin?

This should work but also eliminate all benefits from deferred reclaim:
bigger batching and running without of any locks.

After that gap between high and max will work just as reserve for atomic allocations.

> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ba9138a4a1de..53a35c526e43 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   				schedule_work(&memcg->high_work);
>   				break;
>   			}
> -			current->memcg_nr_pages_over_high += batch;
> -			set_notify_resume(current);
> +			if (gfpflags_allow_blocking(gfp_mask)) {
> +				reclaim_high(memcg, nr_pages, GFP_KERNEL);
> +			} else {
> +				current->memcg_nr_pages_over_high += batch;
> +				set_notify_resume(current);
> +			}
>   			break;
>   		}
>   	} while ((memcg = parent_mem_cgroup(memcg)));
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-02 10:01       ` Konstantin Khlebnikov
@ 2019-08-02 11:44         ` Michal Hocko
  2019-08-06  7:07           ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-08-02 11:44 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Johannes Weiner, linux-mm, linux-kernel, cgroups, Vladimir Davydov

On Fri 02-08-19 13:01:07, Konstantin Khlebnikov wrote:
> 
> 
> On 02.08.2019 12:40, Michal Hocko wrote:
> > On Mon 29-07-19 20:55:09, Michal Hocko wrote:
> > > On Mon 29-07-19 11:49:52, Johannes Weiner wrote:
> > > > On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:
> > > > > --- a/mm/gup.c
> > > > > +++ b/mm/gup.c
> > > > > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > > > >   			ret = -ERESTARTSYS;
> > > > >   			goto out;
> > > > >   		}
> > > > > -		cond_resched();
> > > > > +		/* Reclaim memory over high limit before stocking too much */
> > > > > +		mem_cgroup_handle_over_high(true);
> > > > 
> > > > I'd rather this remained part of the try_charge() call. The code
> > > > comment in try_charge says this:
> > > > 
> > > > 	 * We can perform reclaim here if __GFP_RECLAIM but let's
> > > > 	 * always punt for simplicity and so that GFP_KERNEL can
> > > > 	 * consistently be used during reclaim.
> > > > 
> > > > The simplicity argument doesn't hold true anymore once we have to add
> > > > manual calls into allocation sites. We should instead fix try_charge()
> > > > to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
> > > > return when actually needed.
> > > 
> > > Agreed. If we want to do direct reclaim on the high limit breach then it
> > > should go into try_charge same way we do hard limit reclaim there. I am
> > > not yet sure about how/whether to scale the excess. The only reason to
> > > move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you
> > > say, maybe we should start by always performing the reclaim for
> > > sleepable contexts first and only defer for non-sleeping requests.
> > 
> > In other words. Something like patch below (completely untested). Could
> > you give it a try Konstantin?
> 
> This should work but also eliminate all benefits from deferred reclaim:
> bigger batching and running without of any locks.

Yes, but we already have to deal with for hard limit reclaim. Also I
would like to see any actual data to back any more complex solution.
We should definitely start simple.

> After that gap between high and max will work just as reserve for atomic allocations.
> 
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ba9138a4a1de..53a35c526e43 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >   				schedule_work(&memcg->high_work);
> >   				break;
> >   			}
> > -			current->memcg_nr_pages_over_high += batch;
> > -			set_notify_resume(current);
> > +			if (gfpflags_allow_blocking(gfp_mask)) {
> > +				reclaim_high(memcg, nr_pages, GFP_KERNEL);

ups, this should be s@GFP_KERNEL@gfp_mask@

> > +			} else {
> > +				current->memcg_nr_pages_over_high += batch;
> > +				set_notify_resume(current);
> > +			}
> >   			break;
> >   		}
> >   	} while ((memcg = parent_mem_cgroup(memcg)));
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-02  9:35             ` Michal Hocko
@ 2019-08-02 18:56               ` Yang Shi
  2019-08-05 14:32                 ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2019-08-02 18:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Konstantin Khlebnikov, Linux MM, Linux Kernel Mailing List,
	cgroups, Vladimir Davydov, Johannes Weiner

On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > [...]
> > > > I don't worry too much about scale since the scale issue is not unique
> > > > to background reclaim, direct reclaim may run into the same problem.
> > >
> > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > > You can have thousands of memcgs and I do not think we really do want
> > > to create one kswapd for each. Once we have a kswapd thread pool then we
> > > get into a tricky land where a determinism/fairness would be non trivial
> > > to achieve. Direct reclaim, on the other hand is bound by the workload
> > > itself.
> >
> > Yes, I agree thread pool would introduce more latency than dedicated
> > kswapd thread. But, it looks not that bad in our test. When memory
> > allocation is fast, even though dedicated kswapd thread can't catch
> > up. So, such background reclaim is best effort, not guaranteed.
> >
> > I don't quite get what you mean about fairness. Do you mean they may
> > spend excessive cpu time then cause other processes starvation? I
> > think this could be mitigated by properly organizing and setting
> > groups. But, I agree this is tricky.
>
> No, I meant that the cost of reclaiming a unit of charges (e.g.
> SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> on LRUs. Therefore any thread pool mechanism would lead to unfair
> reclaim and non-deterministic behavior.

Yes, the cost depends on the state of pages, but I still don't quite
understand what does "unfair" refer to in this context. Do you mean
some cgroups may reclaim much more than others? Or the work may take
too long so it can't not serve other cgroups in time?

>
> I can imagine a middle ground where the background reclaim would have to
> be an opt-in feature and a dedicated kernel thread would be assigned to
> the particular memcg (hierarchy).

Yes, it is opt-in by defining a proper "water mark". As long as "water
mark" is defined (0, 100), the "kswapd" work would be queued once the
usage is greater than "water mark", then it would exit once the usage
is under "water mark". If "water mark" is 0, it will never queue any
background reclaim work.

We did use dedicated kernel thread for each cgroup, but it turns out
it is also tricky and error prone to manage the kernel thread,
workqueue sounds much more simple and less error prone.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-02 18:56               ` Yang Shi
@ 2019-08-05 14:32                 ` Michal Hocko
  2019-08-05 19:24                   ` Shakeel Butt
  2019-08-06  3:28                   ` Yang Shi
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2019-08-05 14:32 UTC (permalink / raw)
  To: Yang Shi
  Cc: Konstantin Khlebnikov, Linux MM, Linux Kernel Mailing List,
	cgroups, Vladimir Davydov, Johannes Weiner

On Fri 02-08-19 11:56:28, Yang Shi wrote:
> On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > > [...]
> > > > > I don't worry too much about scale since the scale issue is not unique
> > > > > to background reclaim, direct reclaim may run into the same problem.
> > > >
> > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > > > You can have thousands of memcgs and I do not think we really do want
> > > > to create one kswapd for each. Once we have a kswapd thread pool then we
> > > > get into a tricky land where a determinism/fairness would be non trivial
> > > > to achieve. Direct reclaim, on the other hand is bound by the workload
> > > > itself.
> > >
> > > Yes, I agree thread pool would introduce more latency than dedicated
> > > kswapd thread. But, it looks not that bad in our test. When memory
> > > allocation is fast, even though dedicated kswapd thread can't catch
> > > up. So, such background reclaim is best effort, not guaranteed.
> > >
> > > I don't quite get what you mean about fairness. Do you mean they may
> > > spend excessive cpu time then cause other processes starvation? I
> > > think this could be mitigated by properly organizing and setting
> > > groups. But, I agree this is tricky.
> >
> > No, I meant that the cost of reclaiming a unit of charges (e.g.
> > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> > on LRUs. Therefore any thread pool mechanism would lead to unfair
> > reclaim and non-deterministic behavior.
> 
> Yes, the cost depends on the state of pages, but I still don't quite
> understand what does "unfair" refer to in this context. Do you mean
> some cgroups may reclaim much more than others?

> Or the work may take too long so it can't not serve other cgroups in time?

exactly.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-05 14:32                 ` Michal Hocko
@ 2019-08-05 19:24                   ` Shakeel Butt
  2019-08-06  3:28                   ` Yang Shi
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2019-08-05 19:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, Konstantin Khlebnikov, Linux MM,
	Linux Kernel Mailing List, Cgroups, Vladimir Davydov,
	Johannes Weiner

On Mon, Aug 5, 2019 at 7:32 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 02-08-19 11:56:28, Yang Shi wrote:
> > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > > > [...]
> > > > > > I don't worry too much about scale since the scale issue is not unique
> > > > > > to background reclaim, direct reclaim may run into the same problem.
> > > > >
> > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > > > > You can have thousands of memcgs and I do not think we really do want
> > > > > to create one kswapd for each. Once we have a kswapd thread pool then we
> > > > > get into a tricky land where a determinism/fairness would be non trivial
> > > > > to achieve. Direct reclaim, on the other hand is bound by the workload
> > > > > itself.
> > > >
> > > > Yes, I agree thread pool would introduce more latency than dedicated
> > > > kswapd thread. But, it looks not that bad in our test. When memory
> > > > allocation is fast, even though dedicated kswapd thread can't catch
> > > > up. So, such background reclaim is best effort, not guaranteed.
> > > >
> > > > I don't quite get what you mean about fairness. Do you mean they may
> > > > spend excessive cpu time then cause other processes starvation? I
> > > > think this could be mitigated by properly organizing and setting
> > > > groups. But, I agree this is tricky.
> > >
> > > No, I meant that the cost of reclaiming a unit of charges (e.g.
> > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> > > on LRUs. Therefore any thread pool mechanism would lead to unfair
> > > reclaim and non-deterministic behavior.
> >
> > Yes, the cost depends on the state of pages, but I still don't quite
> > understand what does "unfair" refer to in this context. Do you mean
> > some cgroups may reclaim much more than others?
>
> > Or the work may take too long so it can't not serve other cgroups in time?
>
> exactly.
>

How about allowing the users to implement their own user space kswapd?
A memcg interface similar to MADV_PAGEOUT. Users can register for
MEMCG_HIGH notification (it needs some modification) and on receiving
the notification, the uswapd (User's kswapd) will trigger reclaim
through memory.pageout (or memory.try_to_free_pages). One can argue
why not just use MADV_PAGEOUT? In real workload, a job can be a
combination of different sub-jobs and most probably may not know the
importance of the memory layout of the tasks of the sub-jobs. So, a
memcg level interface makes more sense there.

Shakeel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-05 14:32                 ` Michal Hocko
  2019-08-05 19:24                   ` Shakeel Butt
@ 2019-08-06  3:28                   ` Yang Shi
  2019-08-06  7:05                     ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Yang Shi @ 2019-08-06  3:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Konstantin Khlebnikov, Linux MM, Linux Kernel Mailing List,
	cgroups, Vladimir Davydov, Johannes Weiner

On Mon, Aug 5, 2019 at 7:32 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 02-08-19 11:56:28, Yang Shi wrote:
> > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > > > [...]
> > > > > > I don't worry too much about scale since the scale issue is not unique
> > > > > > to background reclaim, direct reclaim may run into the same problem.
> > > > >
> > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > > > > You can have thousands of memcgs and I do not think we really do want
> > > > > to create one kswapd for each. Once we have a kswapd thread pool then we
> > > > > get into a tricky land where a determinism/fairness would be non trivial
> > > > > to achieve. Direct reclaim, on the other hand is bound by the workload
> > > > > itself.
> > > >
> > > > Yes, I agree thread pool would introduce more latency than dedicated
> > > > kswapd thread. But, it looks not that bad in our test. When memory
> > > > allocation is fast, even though dedicated kswapd thread can't catch
> > > > up. So, such background reclaim is best effort, not guaranteed.
> > > >
> > > > I don't quite get what you mean about fairness. Do you mean they may
> > > > spend excessive cpu time then cause other processes starvation? I
> > > > think this could be mitigated by properly organizing and setting
> > > > groups. But, I agree this is tricky.
> > >
> > > No, I meant that the cost of reclaiming a unit of charges (e.g.
> > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> > > on LRUs. Therefore any thread pool mechanism would lead to unfair
> > > reclaim and non-deterministic behavior.
> >
> > Yes, the cost depends on the state of pages, but I still don't quite
> > understand what does "unfair" refer to in this context. Do you mean
> > some cgroups may reclaim much more than others?
>
> > Or the work may take too long so it can't not serve other cgroups in time?
>
> exactly.

Actually, I'm not very concerned by this. In our design each memcg has
its dedicated work (memcg->wmark_work), so the reclaim work for
different memcgs could be run in parallel since they are *different*
work in fact although they run the same function. And, We could queue
them to a dedicated unbound workqueue which may have maximum 512 or
scale with nr cpus active works. Although the system may have
thousands of online memcgs, I'm supposed it should be rare to have all
of them trigger reclaim at the same time.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-06  3:28                   ` Yang Shi
@ 2019-08-06  7:05                     ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2019-08-06  7:05 UTC (permalink / raw)
  To: Yang Shi
  Cc: Konstantin Khlebnikov, Linux MM, Linux Kernel Mailing List,
	cgroups, Vladimir Davydov, Johannes Weiner

On Mon 05-08-19 20:28:40, Yang Shi wrote:
> On Mon, Aug 5, 2019 at 7:32 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 02-08-19 11:56:28, Yang Shi wrote:
> > > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > >
> > > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > > > > [...]
> > > > > > > I don't worry too much about scale since the scale issue is not unique
> > > > > > > to background reclaim, direct reclaim may run into the same problem.
> > > > > >
> > > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > > > > > You can have thousands of memcgs and I do not think we really do want
> > > > > > to create one kswapd for each. Once we have a kswapd thread pool then we
> > > > > > get into a tricky land where a determinism/fairness would be non trivial
> > > > > > to achieve. Direct reclaim, on the other hand is bound by the workload
> > > > > > itself.
> > > > >
> > > > > Yes, I agree thread pool would introduce more latency than dedicated
> > > > > kswapd thread. But, it looks not that bad in our test. When memory
> > > > > allocation is fast, even though dedicated kswapd thread can't catch
> > > > > up. So, such background reclaim is best effort, not guaranteed.
> > > > >
> > > > > I don't quite get what you mean about fairness. Do you mean they may
> > > > > spend excessive cpu time then cause other processes starvation? I
> > > > > think this could be mitigated by properly organizing and setting
> > > > > groups. But, I agree this is tricky.
> > > >
> > > > No, I meant that the cost of reclaiming a unit of charges (e.g.
> > > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> > > > on LRUs. Therefore any thread pool mechanism would lead to unfair
> > > > reclaim and non-deterministic behavior.
> > >
> > > Yes, the cost depends on the state of pages, but I still don't quite
> > > understand what does "unfair" refer to in this context. Do you mean
> > > some cgroups may reclaim much more than others?
> >
> > > Or the work may take too long so it can't not serve other cgroups in time?
> >
> > exactly.
> 
> Actually, I'm not very concerned by this. In our design each memcg has
> its dedicated work (memcg->wmark_work), so the reclaim work for
> different memcgs could be run in parallel since they are *different*
> work in fact although they run the same function. And, We could queue
> them to a dedicated unbound workqueue which may have maximum 512 or
> scale with nr cpus active works. Although the system may have
> thousands of online memcgs, I'm supposed it should be rare to have all
> of them trigger reclaim at the same time.

I do believe that it might work for your particular usecase but I do not
think this is robust enough for the upstream kernel, I am afraid.

As I've said I am open to discuss an opt-in per memcg pro-active reclaim
(a kernel thread that belongs to the memcg) but it has to be a dedicated
worker bound by all the cgroup resource restrictions.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-02 11:44         ` Michal Hocko
@ 2019-08-06  7:07           ` Michal Hocko
  2019-08-06  7:19             ` Konstantin Khlebnikov
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-08-06  7:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Johannes Weiner, linux-mm, linux-kernel, cgroups, Vladimir Davydov

On Fri 02-08-19 13:44:38, Michal Hocko wrote:
[...]
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index ba9138a4a1de..53a35c526e43 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >   				schedule_work(&memcg->high_work);
> > >   				break;
> > >   			}
> > > -			current->memcg_nr_pages_over_high += batch;
> > > -			set_notify_resume(current);
> > > +			if (gfpflags_allow_blocking(gfp_mask)) {
> > > +				reclaim_high(memcg, nr_pages, GFP_KERNEL);
> 
> ups, this should be s@GFP_KERNEL@gfp_mask@
> 
> > > +			} else {
> > > +				current->memcg_nr_pages_over_high += batch;
> > > +				set_notify_resume(current);
> > > +			}
> > >   			break;
> > >   		}
> > >   	} while ((memcg = parent_mem_cgroup(memcg)));
> > > 

Should I send an official patch for this?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-06  7:07           ` Michal Hocko
@ 2019-08-06  7:19             ` Konstantin Khlebnikov
  2019-08-06  7:36               ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Konstantin Khlebnikov @ 2019-08-06  7:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-kernel, cgroups, Vladimir Davydov

On 8/6/19 10:07 AM, Michal Hocko wrote:
> On Fri 02-08-19 13:44:38, Michal Hocko wrote:
> [...]
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index ba9138a4a1de..53a35c526e43 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>>    				schedule_work(&memcg->high_work);
>>>>    				break;
>>>>    			}
>>>> -			current->memcg_nr_pages_over_high += batch;
>>>> -			set_notify_resume(current);
>>>> +			if (gfpflags_allow_blocking(gfp_mask)) {
>>>> +				reclaim_high(memcg, nr_pages, GFP_KERNEL);
>>
>> ups, this should be s@GFP_KERNEL@gfp_mask@
>>
>>>> +			} else {
>>>> +				current->memcg_nr_pages_over_high += batch;
>>>> +				set_notify_resume(current);
>>>> +			}
>>>>    			break;
>>>>    		}
>>>>    	} while ((memcg = parent_mem_cgroup(memcg)));
>>>>
> 
> Should I send an official patch for this?
> 

I prefer to keep it as is while we have no better solution.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop
  2019-08-06  7:19             ` Konstantin Khlebnikov
@ 2019-08-06  7:36               ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2019-08-06  7:36 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Johannes Weiner, linux-mm, linux-kernel, cgroups, Vladimir Davydov

On Tue 06-08-19 10:19:49, Konstantin Khlebnikov wrote:
> On 8/6/19 10:07 AM, Michal Hocko wrote:
> > On Fri 02-08-19 13:44:38, Michal Hocko wrote:
> > [...]
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index ba9138a4a1de..53a35c526e43 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > > >    				schedule_work(&memcg->high_work);
> > > > >    				break;
> > > > >    			}
> > > > > -			current->memcg_nr_pages_over_high += batch;
> > > > > -			set_notify_resume(current);
> > > > > +			if (gfpflags_allow_blocking(gfp_mask)) {
> > > > > +				reclaim_high(memcg, nr_pages, GFP_KERNEL);
> > > 
> > > ups, this should be s@GFP_KERNEL@gfp_mask@
> > > 
> > > > > +			} else {
> > > > > +				current->memcg_nr_pages_over_high += batch;
> > > > > +				set_notify_resume(current);
> > > > > +			}
> > > > >    			break;
> > > > >    		}
> > > > >    	} while ((memcg = parent_mem_cgroup(memcg)));
> > > > > 
> > 
> > Should I send an official patch for this?
> > 
> 
> I prefer to keep it as is while we have no better solution.

Fine with me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-08-06  7:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-28 12:29 [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop Konstantin Khlebnikov
2019-07-29  9:17 ` Michal Hocko
2019-07-29  9:40   ` Konstantin Khlebnikov
2019-07-29 10:33     ` Michal Hocko
2019-07-29 11:24       ` Konstantin Khlebnikov
2019-07-29 17:28       ` Yang Shi
2019-07-29 18:48         ` Michal Hocko
2019-08-01 21:00           ` Yang Shi
2019-08-02  9:35             ` Michal Hocko
2019-08-02 18:56               ` Yang Shi
2019-08-05 14:32                 ` Michal Hocko
2019-08-05 19:24                   ` Shakeel Butt
2019-08-06  3:28                   ` Yang Shi
2019-08-06  7:05                     ` Michal Hocko
2019-07-29 15:49 ` Johannes Weiner
2019-07-29 18:55   ` Michal Hocko
2019-08-02  9:40     ` Michal Hocko
2019-08-02 10:01       ` Konstantin Khlebnikov
2019-08-02 11:44         ` Michal Hocko
2019-08-06  7:07           ` Michal Hocko
2019-08-06  7:19             ` Konstantin Khlebnikov
2019-08-06  7:36               ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).