linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls
@ 2019-10-04 13:09 Konstantin Khlebnikov
  2019-10-04 13:12 ` Michal Hocko
  2019-10-05 19:35 ` Andrew Morton
  0 siblings, 2 replies; 8+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-04 13:09 UTC (permalink / raw)
  To: linux-mm, Andrew Morton; +Cc: Michal Hocko, linux-kernel, Matthew Wilcox

This is very slow operation. There is no reason to do it again if somebody
else already drained all per-cpu vectors while we waited for lock.

Piggyback on drain started and finished while we waited for lock:
all pages pended at the time of our enter were drained from vectors.

Callers like POSIX_FADV_DONTNEED retry their operations once after
draining per-cpu vectors when pages have unexpected references.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/swap.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 38c3fa4308e2..5ba948a9d82a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -708,9 +708,10 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
  */
 void lru_add_drain_all(void)
 {
+	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
 	static DEFINE_MUTEX(lock);
 	static struct cpumask has_work;
-	int cpu;
+	int cpu, seq;
 
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -719,7 +720,19 @@ void lru_add_drain_all(void)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
+	seq = raw_read_seqcount_latch(&seqcount);
+
 	mutex_lock(&lock);
+
+	/*
+	 * Piggyback on drain started and finished while we waited for lock:
+	 * all pages pended at the time of our enter were drained from vectors.
+	 */
+	if (__read_seqcount_retry(&seqcount, seq))
+		goto done;
+
+	raw_write_seqcount_latch(&seqcount);
+
 	cpumask_clear(&has_work);
 
 	for_each_online_cpu(cpu) {
@@ -740,6 +753,7 @@ void lru_add_drain_all(void)
 	for_each_cpu(cpu, &has_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
 
+done:
 	mutex_unlock(&lock);
 }
 #else


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls
  2019-10-04 13:09 [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls Konstantin Khlebnikov
@ 2019-10-04 13:12 ` Michal Hocko
  2019-10-04 13:32   ` Konstantin Khlebnikov
  2019-10-05 19:35 ` Andrew Morton
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2019-10-04 13:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, Matthew Wilcox

On Fri 04-10-19 16:09:22, Konstantin Khlebnikov wrote:
> This is very slow operation. There is no reason to do it again if somebody
> else already drained all per-cpu vectors while we waited for lock.
> 
> Piggyback on drain started and finished while we waited for lock:
> all pages pended at the time of our enter were drained from vectors.
> 
> Callers like POSIX_FADV_DONTNEED retry their operations once after
> draining per-cpu vectors when pages have unexpected references.

This describes why we need to wait for preexisted pages on the pvecs but
the changelog doesn't say anything about improvements this leads to.
In other words what kind of workloads benefit from it?

> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  mm/swap.c |   16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 38c3fa4308e2..5ba948a9d82a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -708,9 +708,10 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>   */
>  void lru_add_drain_all(void)
>  {
> +	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
>  	static DEFINE_MUTEX(lock);
>  	static struct cpumask has_work;
> -	int cpu;
> +	int cpu, seq;
>  
>  	/*
>  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -719,7 +720,19 @@ void lru_add_drain_all(void)
>  	if (WARN_ON(!mm_percpu_wq))
>  		return;
>  
> +	seq = raw_read_seqcount_latch(&seqcount);
> +
>  	mutex_lock(&lock);
> +
> +	/*
> +	 * Piggyback on drain started and finished while we waited for lock:
> +	 * all pages pended at the time of our enter were drained from vectors.
> +	 */
> +	if (__read_seqcount_retry(&seqcount, seq))
> +		goto done;
> +
> +	raw_write_seqcount_latch(&seqcount);
> +
>  	cpumask_clear(&has_work);
>  
>  	for_each_online_cpu(cpu) {
> @@ -740,6 +753,7 @@ void lru_add_drain_all(void)
>  	for_each_cpu(cpu, &has_work)
>  		flush_work(&per_cpu(lru_add_drain_work, cpu));
>  
> +done:
>  	mutex_unlock(&lock);
>  }
>  #else
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls
  2019-10-04 13:12 ` Michal Hocko
@ 2019-10-04 13:32   ` Konstantin Khlebnikov
  2019-10-04 13:39     ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-04 13:32 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, linux-kernel, Matthew Wilcox

On 04/10/2019 16.12, Michal Hocko wrote:
> On Fri 04-10-19 16:09:22, Konstantin Khlebnikov wrote:
>> This is very slow operation. There is no reason to do it again if somebody
>> else already drained all per-cpu vectors while we waited for lock.
>>
>> Piggyback on drain started and finished while we waited for lock:
>> all pages pended at the time of our enter were drained from vectors.
>>
>> Callers like POSIX_FADV_DONTNEED retry their operations once after
>> draining per-cpu vectors when pages have unexpected references.
> 
> This describes why we need to wait for preexisted pages on the pvecs but
> the changelog doesn't say anything about improvements this leads to.
> In other words what kind of workloads benefit from it?

Right now POSIX_FADV_DONTNEED is top user because it have to freeze page
reference when removes it from cache. invalidate_bdev calls it for same reason.
Both are triggered from userspace, so it's easy to generate storm.

mlock/mlockall no longer calls lru_add_drain_all - I've seen here
serious slowdown on older kernel.

There are some less obvious paths in memory migration/CMA/offlining
which shouldn't be called frequently.

> 
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> ---
>>   mm/swap.c |   16 +++++++++++++++-
>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 38c3fa4308e2..5ba948a9d82a 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -708,9 +708,10 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>>    */
>>   void lru_add_drain_all(void)
>>   {
>> +	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
>>   	static DEFINE_MUTEX(lock);
>>   	static struct cpumask has_work;
>> -	int cpu;
>> +	int cpu, seq;
>>   
>>   	/*
>>   	 * Make sure nobody triggers this path before mm_percpu_wq is fully
>> @@ -719,7 +720,19 @@ void lru_add_drain_all(void)
>>   	if (WARN_ON(!mm_percpu_wq))
>>   		return;
>>   
>> +	seq = raw_read_seqcount_latch(&seqcount);
>> +
>>   	mutex_lock(&lock);
>> +
>> +	/*
>> +	 * Piggyback on drain started and finished while we waited for lock:
>> +	 * all pages pended at the time of our enter were drained from vectors.
>> +	 */
>> +	if (__read_seqcount_retry(&seqcount, seq))
>> +		goto done;
>> +
>> +	raw_write_seqcount_latch(&seqcount);
>> +
>>   	cpumask_clear(&has_work);
>>   
>>   	for_each_online_cpu(cpu) {
>> @@ -740,6 +753,7 @@ void lru_add_drain_all(void)
>>   	for_each_cpu(cpu, &has_work)
>>   		flush_work(&per_cpu(lru_add_drain_work, cpu));
>>   
>> +done:
>>   	mutex_unlock(&lock);
>>   }
>>   #else
>>
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls
  2019-10-04 13:32   ` Konstantin Khlebnikov
@ 2019-10-04 13:39     ` Michal Hocko
  2019-10-04 14:06       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2019-10-04 13:39 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, Matthew Wilcox

On Fri 04-10-19 16:32:39, Konstantin Khlebnikov wrote:
> On 04/10/2019 16.12, Michal Hocko wrote:
> > On Fri 04-10-19 16:09:22, Konstantin Khlebnikov wrote:
> > > This is very slow operation. There is no reason to do it again if somebody
> > > else already drained all per-cpu vectors while we waited for lock.
> > > 
> > > Piggyback on drain started and finished while we waited for lock:
> > > all pages pended at the time of our enter were drained from vectors.
> > > 
> > > Callers like POSIX_FADV_DONTNEED retry their operations once after
> > > draining per-cpu vectors when pages have unexpected references.
> > 
> > This describes why we need to wait for preexisted pages on the pvecs but
> > the changelog doesn't say anything about improvements this leads to.
> > In other words what kind of workloads benefit from it?
> 
> Right now POSIX_FADV_DONTNEED is top user because it have to freeze page
> reference when removes it from cache. invalidate_bdev calls it for same reason.
> Both are triggered from userspace, so it's easy to generate storm.
> 
> mlock/mlockall no longer calls lru_add_drain_all - I've seen here
> serious slowdown on older kernel.
> 
> There are some less obvious paths in memory migration/CMA/offlining
> which shouldn't be called frequently.

Can you back those claims by any numbers?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls
  2019-10-04 13:39     ` Michal Hocko
@ 2019-10-04 14:06       ` Konstantin Khlebnikov
  2019-10-07 12:50         ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-04 14:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, linux-kernel, Matthew Wilcox

On 04/10/2019 16.39, Michal Hocko wrote:
> On Fri 04-10-19 16:32:39, Konstantin Khlebnikov wrote:
>> On 04/10/2019 16.12, Michal Hocko wrote:
>>> On Fri 04-10-19 16:09:22, Konstantin Khlebnikov wrote:
>>>> This is very slow operation. There is no reason to do it again if somebody
>>>> else already drained all per-cpu vectors while we waited for lock.
>>>>
>>>> Piggyback on drain started and finished while we waited for lock:
>>>> all pages pended at the time of our enter were drained from vectors.
>>>>
>>>> Callers like POSIX_FADV_DONTNEED retry their operations once after
>>>> draining per-cpu vectors when pages have unexpected references.
>>>
>>> This describes why we need to wait for preexisted pages on the pvecs but
>>> the changelog doesn't say anything about improvements this leads to.
>>> In other words what kind of workloads benefit from it?
>>
>> Right now POSIX_FADV_DONTNEED is top user because it have to freeze page
>> reference when removes it from cache. invalidate_bdev calls it for same reason.
>> Both are triggered from userspace, so it's easy to generate storm.
>>
>> mlock/mlockall no longer calls lru_add_drain_all - I've seen here
>> serious slowdown on older kernel.
>>
>> There are some less obvious paths in memory migration/CMA/offlining
>> which shouldn't be called frequently.
> 
> Can you back those claims by any numbers?
> 

Well, worst case requires non-trivial workload because lru_add_drain_all
skips cpus where vectors are empty. Something must constantly generates
flow of pages at each cpu. Also cpus must be busy to make scheduling per-cpu
works slower. And machine must be big enough (64+ cpus in our case).

In our case that was massive series of mlock calls in map-reduce while other
tasks writes log (and generates flow of new pages in per-cpu vectors). Mlock
calls were serialized by mutex and accumulated latency up to 10 second and more.

Kernel does not call lru_add_drain_all on mlock paths since 4.15, but same scenario
could be triggered by fadvise(POSIX_FADV_DONTNEED) or any other remaining user.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls
  2019-10-04 13:09 [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls Konstantin Khlebnikov
  2019-10-04 13:12 ` Michal Hocko
@ 2019-10-05 19:35 ` Andrew Morton
  2019-10-05 20:03   ` Konstantin Khlebnikov
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2019-10-05 19:35 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Michal Hocko, linux-kernel, Matthew Wilcox

On Fri, 04 Oct 2019 16:09:22 +0300 Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:

> This is very slow operation. There is no reason to do it again if somebody
> else already drained all per-cpu vectors while we waited for lock.
> 
> Piggyback on drain started and finished while we waited for lock:
> all pages pended at the time of our enter were drained from vectors.
> 
> Callers like POSIX_FADV_DONTNEED retry their operations once after
> draining per-cpu vectors when pages have unexpected references.
> 
> ...
>
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -708,9 +708,10 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>   */
>  void lru_add_drain_all(void)
>  {
> +	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
>  	static DEFINE_MUTEX(lock);
>  	static struct cpumask has_work;
> -	int cpu;
> +	int cpu, seq;
>  
>  	/*
>  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -719,7 +720,19 @@ void lru_add_drain_all(void)
>  	if (WARN_ON(!mm_percpu_wq))
>  		return;
>  
> +	seq = raw_read_seqcount_latch(&seqcount);
> +
>  	mutex_lock(&lock);
> +
> +	/*
> +	 * Piggyback on drain started and finished while we waited for lock:
> +	 * all pages pended at the time of our enter were drained from vectors.
> +	 */
> +	if (__read_seqcount_retry(&seqcount, seq))
> +		goto done;
> +
> +	raw_write_seqcount_latch(&seqcount);
> +
>  	cpumask_clear(&has_work);
>  
>  	for_each_online_cpu(cpu) {
> @@ -740,6 +753,7 @@ void lru_add_drain_all(void)
>  	for_each_cpu(cpu, &has_work)
>  		flush_work(&per_cpu(lru_add_drain_work, cpu));
>  
> +done:
>  	mutex_unlock(&lock);
>  }

I'm not sure this works as intended.

Suppose CPU #30 is presently executing the for_each_online_cpu() loop
and has reached CPU #15's per-cpu data.

Now CPU #2 comes along, adds some pages to its per-cpu vectors then
calls lru_add_drain_all().  AFAICT the code will assume that CPU #30
has flushed out all of the pages which CPU #2 just added, but that
isn't the case.

Moving the raw_write_seqcount_latch() to the point where all processing
has completed might fix?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls
  2019-10-05 19:35 ` Andrew Morton
@ 2019-10-05 20:03   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 8+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-05 20:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, linux-mm, Michal Hocko,
	Linux Kernel Mailing List, Matthew Wilcox

On Sat, Oct 5, 2019 at 10:35 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 04 Oct 2019 16:09:22 +0300 Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:
>
> > This is very slow operation. There is no reason to do it again if somebody
> > else already drained all per-cpu vectors while we waited for lock.
> >
> > Piggyback on drain started and finished while we waited for lock:
> > all pages pended at the time of our enter were drained from vectors.
> >
> > Callers like POSIX_FADV_DONTNEED retry their operations once after
> > draining per-cpu vectors when pages have unexpected references.
> >
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -708,9 +708,10 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> >   */
> >  void lru_add_drain_all(void)
> >  {
> > +     static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
> >       static DEFINE_MUTEX(lock);
> >       static struct cpumask has_work;
> > -     int cpu;
> > +     int cpu, seq;
> >
> >       /*
> >        * Make sure nobody triggers this path before mm_percpu_wq is fully
> > @@ -719,7 +720,19 @@ void lru_add_drain_all(void)
> >       if (WARN_ON(!mm_percpu_wq))
> >               return;
> >
> > +     seq = raw_read_seqcount_latch(&seqcount);
> > +
> >       mutex_lock(&lock);
> > +
> > +     /*
> > +      * Piggyback on drain started and finished while we waited for lock:
> > +      * all pages pended at the time of our enter were drained from vectors.
> > +      */
> > +     if (__read_seqcount_retry(&seqcount, seq))
> > +             goto done;
> > +
> > +     raw_write_seqcount_latch(&seqcount);
> > +
> >       cpumask_clear(&has_work);
> >
> >       for_each_online_cpu(cpu) {
> > @@ -740,6 +753,7 @@ void lru_add_drain_all(void)
> >       for_each_cpu(cpu, &has_work)
> >               flush_work(&per_cpu(lru_add_drain_work, cpu));
> >
> > +done:
> >       mutex_unlock(&lock);
> >  }
>
> I'm not sure this works as intended.
>
> Suppose CPU #30 is presently executing the for_each_online_cpu() loop
> and has reached CPU #15's per-cpu data.
>
> Now CPU #2 comes along, adds some pages to its per-cpu vectors then
> calls lru_add_drain_all().  AFAICT the code will assume that CPU #30
> has flushed out all of the pages which CPU #2 just added, but that
> isn't the case.
>
> Moving the raw_write_seqcount_latch() to the point where all processing
> has completed might fix?
>
>

No, raw_write_seqcount_latch() should be exactly before draining.

Here seqcount works as generation of pages that could be in vectors.
And all steps are serialized by mutex: only after taking lock we could be
sure that all previous generations are gone.

Here CPU #2 will see same generation at entry and after taking lock.
So it will drain own pages.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls
  2019-10-04 14:06       ` Konstantin Khlebnikov
@ 2019-10-07 12:50         ` Michal Hocko
  0 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2019-10-07 12:50 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, Matthew Wilcox

On Fri 04-10-19 17:06:13, Konstantin Khlebnikov wrote:
> On 04/10/2019 16.39, Michal Hocko wrote:
> > On Fri 04-10-19 16:32:39, Konstantin Khlebnikov wrote:
> > > On 04/10/2019 16.12, Michal Hocko wrote:
> > > > On Fri 04-10-19 16:09:22, Konstantin Khlebnikov wrote:
> > > > > This is very slow operation. There is no reason to do it again if somebody
> > > > > else already drained all per-cpu vectors while we waited for lock.
> > > > > 
> > > > > Piggyback on drain started and finished while we waited for lock:
> > > > > all pages pended at the time of our enter were drained from vectors.
> > > > > 
> > > > > Callers like POSIX_FADV_DONTNEED retry their operations once after
> > > > > draining per-cpu vectors when pages have unexpected references.
> > > > 
> > > > This describes why we need to wait for preexisted pages on the pvecs but
> > > > the changelog doesn't say anything about improvements this leads to.
> > > > In other words what kind of workloads benefit from it?
> > > 
> > > Right now POSIX_FADV_DONTNEED is top user because it have to freeze page
> > > reference when removes it from cache. invalidate_bdev calls it for same reason.
> > > Both are triggered from userspace, so it's easy to generate storm.
> > > 
> > > mlock/mlockall no longer calls lru_add_drain_all - I've seen here
> > > serious slowdown on older kernel.
> > > 
> > > There are some less obvious paths in memory migration/CMA/offlining
> > > which shouldn't be called frequently.
> > 
> > Can you back those claims by any numbers?
> > 
> 
> Well, worst case requires non-trivial workload because lru_add_drain_all
> skips cpus where vectors are empty. Something must constantly generates
> flow of pages at each cpu. Also cpus must be busy to make scheduling per-cpu
> works slower. And machine must be big enough (64+ cpus in our case).
> 
> In our case that was massive series of mlock calls in map-reduce while other
> tasks writes log (and generates flow of new pages in per-cpu vectors). Mlock
> calls were serialized by mutex and accumulated latency up to 10 second and more.

This is a very useful information!

> Kernel does not call lru_add_drain_all on mlock paths since 4.15, but same scenario
> could be triggered by fadvise(POSIX_FADV_DONTNEED) or any other remaining user.

OK, so I read it as, you are unlikely to hit problems with the current
tree but they are still possible in principle. That is a useful
information as well. All that belongs to the changelog. Do not let us
guess and future generations scratch their heads WTH is going on with
that weird code.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-10-07 12:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-04 13:09 [PATCH v2] mm/swap: piggyback lru_add_drain_all() calls Konstantin Khlebnikov
2019-10-04 13:12 ` Michal Hocko
2019-10-04 13:32   ` Konstantin Khlebnikov
2019-10-04 13:39     ` Michal Hocko
2019-10-04 14:06       ` Konstantin Khlebnikov
2019-10-07 12:50         ` Michal Hocko
2019-10-05 19:35 ` Andrew Morton
2019-10-05 20:03   ` Konstantin Khlebnikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).