Lock overhead in shrink_inactive_list / Slow page reclamation

All of lore.kernel.org
 help / color / mirror / Atom feed

* Lock overhead in shrink_inactive_list / Slow page reclamation
@ 2019-01-11  5:52 Baptiste Lepers
  2019-01-11 13:59 ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Baptiste Lepers @ 2019-01-11  5:52 UTC (permalink / raw)
  To: mgorman, akpm, dhowells, linux-mm, hannes

Hello,

We have a performance issue with the page cache. One of our workload
spends more than 50% of it's time in the lru_locks called by
shrink_inactive_list in mm/vmscan.c.

The workload is simple but stresses the page cache a lot: a big file
is mmaped and multiple threads stream chunks of the file; the chunks
sizes range from a few KB to a few MB. The file is about 1TB and is
stored on a very fast SSD (2.6GB/s bandwidth). Our machine has 64GB of
RAM. We rely on the page cache to cache data, but obviously pages have
to be reclaimed quite often to put new data. The workload is *read
only* so we would expect page reclamation to be fast, but it's not. In
some workloads the page cache only reclaims pages at 500-600MB/s.

We have tried to play with fadvise to speed up page reclamation (e.g.,
using the DONTNEED flag) but that didn't help.

Increasing the value of SWAP_CLUSTER_MAX to 256UL helped (as suggested
here https://lkml.org/lkml/2015/7/6/440), but we are still spending
most of the time waiting for the page cache to reclaim pages.
Increasing the value to more than 256 doesn't help -- the
shrink_inactive_list function is never reclaiming more than a few
hundred pages at a time. (I don't know why, and I'm not sure how to
profile why this is the case, but I'm willing to spend time to debug
the issue if you have ideas.)

Any idea of anything else we could try to speed up page reclamation?

Thanks,
Baptiste.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lock overhead in shrink_inactive_list / Slow page reclamation
  2019-01-11  5:52 Lock overhead in shrink_inactive_list / Slow page reclamation Baptiste Lepers
@ 2019-01-11 13:59 ` Michal Hocko
  2019-01-11 17:53   ` Daniel Jordan
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2019-01-11 13:59 UTC (permalink / raw)
  To: Baptiste Lepers; +Cc: mgorman, akpm, dhowells, linux-mm, hannes

On Fri 11-01-19 16:52:17, Baptiste Lepers wrote:
> Hello,
> 
> We have a performance issue with the page cache. One of our workload
> spends more than 50% of it's time in the lru_locks called by
> shrink_inactive_list in mm/vmscan.c.

Who does contend on the lock? Are there direct reclaimers or is it
solely kswapd with paths that are faulting the new page cache in?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lock overhead in shrink_inactive_list / Slow page reclamation
  2019-01-11 13:59 ` Michal Hocko
@ 2019-01-11 17:53   ` Daniel Jordan
  2019-01-13 23:12       ` Baptiste Lepers
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Jordan @ 2019-01-11 17:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Baptiste Lepers, mgorman, akpm, dhowells, linux-mm, hannes

On Fri, Jan 11, 2019 at 02:59:38PM +0100, Michal Hocko wrote:
> On Fri 11-01-19 16:52:17, Baptiste Lepers wrote:
> > Hello,
> > 
> > We have a performance issue with the page cache. One of our workload
> > spends more than 50% of it's time in the lru_locks called by
> > shrink_inactive_list in mm/vmscan.c.
> 
> Who does contend on the lock? Are there direct reclaimers or is it
> solely kswapd with paths that are faulting the new page cache in?

Yes, and could you please post your performance data showing the time in
lru_lock?  Whatever you have is fine, but using perf with -g would give
callstacks and help answer Michal's question about who's contending.

Happy to help profile and debug offline.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lock overhead in shrink_inactive_list / Slow page reclamation
@ 2019-01-13 23:12       ` Baptiste Lepers
  0 siblings, 0 replies; 10+ messages in thread
From: Baptiste Lepers @ 2019-01-13 23:12 UTC (permalink / raw)
  To: Daniel Jordan; +Cc: Michal Hocko, mgorman, akpm, dhowells, linux-mm, hannes

On Sat, Jan 12, 2019 at 4:53 AM Daniel Jordan
<daniel.m.jordan@oracle.com> wrote:
>
> On Fri, Jan 11, 2019 at 02:59:38PM +0100, Michal Hocko wrote:
> > On Fri 11-01-19 16:52:17, Baptiste Lepers wrote:
> > > Hello,
> > >
> > > We have a performance issue with the page cache. One of our workload
> > > spends more than 50% of it's time in the lru_locks called by
> > > shrink_inactive_list in mm/vmscan.c.
> >
> > Who does contend on the lock? Are there direct reclaimers or is it
> > solely kswapd with paths that are faulting the new page cache in?
>
> Yes, and could you please post your performance data showing the time in
> lru_lock?  Whatever you have is fine, but using perf with -g would give
> callstacks and help answer Michal's question about who's contending.

Thanks for the quick answer.

The time spent in the lru_lock is mainly due to direct reclaimers
(reading an mmaped page that causes some readahead to happen). We have
tried to play with readahead values, but it doesn't change performance
a lot. We have disabled swap on the machine, so kwapd doesn't run.

Our programs run in memory cgroups, but I don't think that the issue
directly comes from cgroups (I might be wrong though).

Here is the callchain that I have using perf report --no-children;
(Paste here https://pastebin.com/151x4QhR )

    44.30%  swapper      [kernel.vmlinux]  [k] intel_idle
    # The machine is idle mainly because it waits in that lru_locks,
which is the 2nd function in the report:
    10.98%  testradix    [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath
               |--10.33%--_raw_spin_lock_irq
               |          |
               |           --10.12%--shrink_inactive_list
               |                     shrink_node_memcg
               |                     shrink_node
               |                     do_try_to_free_pages
               |                     try_to_free_mem_cgroup_pages
               |                     try_charge
               |                     mem_cgroup_try_charge
               |                     __add_to_page_cache_locked
               |                     add_to_page_cache_lru
               |                     |
               |                     |--5.39%--ext4_mpage_readpages
               |                     |          ext4_readpages
               |                     |          __do_page_cache_readahead
               |                     |          |
               |                     |           --5.37%--ondemand_readahead
               |                     |
page_cache_async_readahead
               |                     |                     filemap_fault
               |                     |                     ext4_filemap_fault
               |                     |                     __do_fault
               |                     |                     handle_pte_fault
               |                     |                     __handle_mm_fault
               |                     |                     handle_mm_fault
               |                     |                     __do_page_fault
               |                     |                     do_page_fault
               |                     |                     page_fault
               |                     |                     |
               |                     |                     |--4.23%-- <our app>


Thanks,

Baptiste.






>
> Happy to help profile and debug offline.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lock overhead in shrink_inactive_list / Slow page reclamation
@ 2019-01-13 23:12       ` Baptiste Lepers
  0 siblings, 0 replies; 10+ messages in thread
From: Baptiste Lepers @ 2019-01-13 23:12 UTC (permalink / raw)
  To: Daniel Jordan; +Cc: Michal Hocko, mgorman, akpm, dhowells, linux-mm, hannes

On Sat, Jan 12, 2019 at 4:53 AM Daniel Jordan
<daniel.m.jordan@oracle.com> wrote:
>
> On Fri, Jan 11, 2019 at 02:59:38PM +0100, Michal Hocko wrote:
> > On Fri 11-01-19 16:52:17, Baptiste Lepers wrote:
> > > Hello,
> > >
> > > We have a performance issue with the page cache. One of our workload
> > > spends more than 50% of it's time in the lru_locks called by
> > > shrink_inactive_list in mm/vmscan.c.
> >
> > Who does contend on the lock? Are there direct reclaimers or is it
> > solely kswapd with paths that are faulting the new page cache in?
>
> Yes, and could you please post your performance data showing the time in
> lru_lock?  Whatever you have is fine, but using perf with -g would give
> callstacks and help answer Michal's question about who's contending.

Thanks for the quick answer.

The time spent in the lru_lock is mainly due to direct reclaimers
(reading an mmaped page that causes some readahead to happen). We have
tried to play with readahead values, but it doesn't change performance
a lot. We have disabled swap on the machine, so kwapd doesn't run.

Our programs run in memory cgroups, but I don't think that the issue
directly comes from cgroups (I might be wrong though).

Here is the callchain that I have using perf report --no-children;
(Paste here https://pastebin.com/151x4QhR )

    44.30%  swapper      [kernel.vmlinux]  [k] intel_idle
    # The machine is idle mainly because it waits in that lru_locks,
which is the 2nd function in the report:
    10.98%  testradix    [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath
               |--10.33%--_raw_spin_lock_irq
               |          |
               |           --10.12%--shrink_inactive_list
               |                     shrink_node_memcg
               |                     shrink_node
               |                     do_try_to_free_pages
               |                     try_to_free_mem_cgroup_pages
               |                     try_charge
               |                     mem_cgroup_try_charge
               |                     __add_to_page_cache_locked
               |                     add_to_page_cache_lru
               |                     |
               |                     |--5.39%--ext4_mpage_readpages
               |                     |          ext4_readpages
               |                     |          __do_page_cache_readahead
               |                     |          |
               |                     |           --5.37%--ondemand_readahead
               |                     |
page_cache_async_readahead
               |                     |                     filemap_fault
               |                     |                     ext4_filemap_fault
               |                     |                     __do_fault
               |                     |                     handle_pte_fault
               |                     |                     __handle_mm_fault
               |                     |                     handle_mm_fault
               |                     |                     __do_page_fault
               |                     |                     do_page_fault
               |                     |                     page_fault
               |                     |                     |
               |                     |                     |--4.23%-- <our app>


Thanks,

Baptiste.






>
> Happy to help profile and debug offline.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lock overhead in shrink_inactive_list / Slow page reclamation
  2019-01-13 23:12       ` Baptiste Lepers
  (?)
@ 2019-01-14  7:06       ` Michal Hocko
  2019-01-14  7:25           ` Baptiste Lepers
  -1 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2019-01-14  7:06 UTC (permalink / raw)
  To: Baptiste Lepers; +Cc: Daniel Jordan, mgorman, akpm, dhowells, linux-mm, hannes

On Mon 14-01-19 10:12:37, Baptiste Lepers wrote:
> On Sat, Jan 12, 2019 at 4:53 AM Daniel Jordan
> <daniel.m.jordan@oracle.com> wrote:
> >
> > On Fri, Jan 11, 2019 at 02:59:38PM +0100, Michal Hocko wrote:
> > > On Fri 11-01-19 16:52:17, Baptiste Lepers wrote:
> > > > Hello,
> > > >
> > > > We have a performance issue with the page cache. One of our workload
> > > > spends more than 50% of it's time in the lru_locks called by
> > > > shrink_inactive_list in mm/vmscan.c.
> > >
> > > Who does contend on the lock? Are there direct reclaimers or is it
> > > solely kswapd with paths that are faulting the new page cache in?
> >
> > Yes, and could you please post your performance data showing the time in
> > lru_lock?  Whatever you have is fine, but using perf with -g would give
> > callstacks and help answer Michal's question about who's contending.
> 
> Thanks for the quick answer.
> 
> The time spent in the lru_lock is mainly due to direct reclaimers
> (reading an mmaped page that causes some readahead to happen). We have
> tried to play with readahead values, but it doesn't change performance
> a lot. We have disabled swap on the machine, so kwapd doesn't run.

kswapd runs even without swap storage.

> Our programs run in memory cgroups, but I don't think that the issue
> directly comes from cgroups (I might be wrong though).

Do you use hard/high limit on those cgroups. Because those would be a
source of the reclaim.

> Here is the callchain that I have using perf report --no-children;
> (Paste here https://pastebin.com/151x4QhR )
> 
>     44.30%  swapper      [kernel.vmlinux]  [k] intel_idle
>     # The machine is idle mainly because it waits in that lru_locks,
> which is the 2nd function in the report:
>     10.98%  testradix    [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath
>                |--10.33%--_raw_spin_lock_irq
>                |          |
>                |           --10.12%--shrink_inactive_list
>                |                     shrink_node_memcg
>                |                     shrink_node
>                |                     do_try_to_free_pages
>                |                     try_to_free_mem_cgroup_pages
>                |                     try_charge
>                |                     mem_cgroup_try_charge

And here it shows this is indeed the case. You are hitting the hard
limit and that causes direct reclaim to shrink the memcg.

If you do not really need a strong isolation between cgroups then I
would suggest to not set the hard limit and rely on the global memory
reclaim to do the background reclaim which is less aggressive and more
pro-active.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lock overhead in shrink_inactive_list / Slow page reclamation
@ 2019-01-14  7:25           ` Baptiste Lepers
  0 siblings, 0 replies; 10+ messages in thread
From: Baptiste Lepers @ 2019-01-14  7:25 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Daniel Jordan, mgorman, akpm, dhowells, linux-mm, hannes

On Mon, Jan 14, 2019 at 6:06 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 14-01-19 10:12:37, Baptiste Lepers wrote:
> > On Sat, Jan 12, 2019 at 4:53 AM Daniel Jordan
> > <daniel.m.jordan@oracle.com> wrote:
> > >
> > > On Fri, Jan 11, 2019 at 02:59:38PM +0100, Michal Hocko wrote:
> > > > On Fri 11-01-19 16:52:17, Baptiste Lepers wrote:
> > > > > Hello,
> > > > >
> > > > > We have a performance issue with the page cache. One of our workload
> > > > > spends more than 50% of it's time in the lru_locks called by
> > > > > shrink_inactive_list in mm/vmscan.c.
> > > >
> > > > Who does contend on the lock? Are there direct reclaimers or is it
> > > > solely kswapd with paths that are faulting the new page cache in?
> > >
> > > Yes, and could you please post your performance data showing the time in
> > > lru_lock?  Whatever you have is fine, but using perf with -g would give
> > > callstacks and help answer Michal's question about who's contending.
> >
> > Thanks for the quick answer.
> >
> > The time spent in the lru_lock is mainly due to direct reclaimers
> > (reading an mmaped page that causes some readahead to happen). We have
> > tried to play with readahead values, but it doesn't change performance
> > a lot. We have disabled swap on the machine, so kwapd doesn't run.
>
> kswapd runs even without swap storage.
>
> > Our programs run in memory cgroups, but I don't think that the issue
> > directly comes from cgroups (I might be wrong though).
>
> Do you use hard/high limit on those cgroups. Because those would be a
> source of the reclaim.
>
> > Here is the callchain that I have using perf report --no-children;
> > (Paste here https://pastebin.com/151x4QhR )
> >
> >     44.30%  swapper      [kernel.vmlinux]  [k] intel_idle
> >     # The machine is idle mainly because it waits in that lru_locks,
> > which is the 2nd function in the report:
> >     10.98%  testradix    [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath
> >                |--10.33%--_raw_spin_lock_irq
> >                |          |
> >                |           --10.12%--shrink_inactive_list
> >                |                     shrink_node_memcg
> >                |                     shrink_node
> >                |                     do_try_to_free_pages
> >                |                     try_to_free_mem_cgroup_pages
> >                |                     try_charge
> >                |                     mem_cgroup_try_charge
>
> And here it shows this is indeed the case. You are hitting the hard
> limit and that causes direct reclaim to shrink the memcg.
>
> If you do not really need a strong isolation between cgroups then I
> would suggest to not set the hard limit and rely on the global memory
> reclaim to do the background reclaim which is less aggressive and more
> pro-active.

Thanks for the suggestion.
We actually need the hard limit in that case, but the problem occurs
even without cgroups (we mmap a 1TB file and we only have 64GB of
RAM). Basically the page cache fills up quickly and then reading the
mmaped file becomes "slow" (400-500MB/s instead of the initial
2.6GB/s). I'm just wondering if there is a way to make page
reclamation a bit faster, especially given that our workload is read
only.

shrink_inactive_list only seem to reclaim 32 pages with the default
setting and takes lru_lock twice to do that, so that's a lock of
locking per KB. Increasing the SWAP_CLUSTER_MAX value helped a bit,
but this is still quite slow.

And thanks for the precision on kwapd, I didn't know it was running
even without swap :)

Baptiste.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lock overhead in shrink_inactive_list / Slow page reclamation
@ 2019-01-14  7:25           ` Baptiste Lepers
  0 siblings, 0 replies; 10+ messages in thread
From: Baptiste Lepers @ 2019-01-14  7:25 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Daniel Jordan, mgorman, akpm, dhowells, linux-mm, hannes

On Mon, Jan 14, 2019 at 6:06 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 14-01-19 10:12:37, Baptiste Lepers wrote:
> > On Sat, Jan 12, 2019 at 4:53 AM Daniel Jordan
> > <daniel.m.jordan@oracle.com> wrote:
> > >
> > > On Fri, Jan 11, 2019 at 02:59:38PM +0100, Michal Hocko wrote:
> > > > On Fri 11-01-19 16:52:17, Baptiste Lepers wrote:
> > > > > Hello,
> > > > >
> > > > > We have a performance issue with the page cache. One of our workload
> > > > > spends more than 50% of it's time in the lru_locks called by
> > > > > shrink_inactive_list in mm/vmscan.c.
> > > >
> > > > Who does contend on the lock? Are there direct reclaimers or is it
> > > > solely kswapd with paths that are faulting the new page cache in?
> > >
> > > Yes, and could you please post your performance data showing the time in
> > > lru_lock?  Whatever you have is fine, but using perf with -g would give
> > > callstacks and help answer Michal's question about who's contending.
> >
> > Thanks for the quick answer.
> >
> > The time spent in the lru_lock is mainly due to direct reclaimers
> > (reading an mmaped page that causes some readahead to happen). We have
> > tried to play with readahead values, but it doesn't change performance
> > a lot. We have disabled swap on the machine, so kwapd doesn't run.
>
> kswapd runs even without swap storage.
>
> > Our programs run in memory cgroups, but I don't think that the issue
> > directly comes from cgroups (I might be wrong though).
>
> Do you use hard/high limit on those cgroups. Because those would be a
> source of the reclaim.
>
> > Here is the callchain that I have using perf report --no-children;
> > (Paste here https://pastebin.com/151x4QhR )
> >
> >     44.30%  swapper      [kernel.vmlinux]  [k] intel_idle
> >     # The machine is idle mainly because it waits in that lru_locks,
> > which is the 2nd function in the report:
> >     10.98%  testradix    [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath
> >                |--10.33%--_raw_spin_lock_irq
> >                |          |
> >                |           --10.12%--shrink_inactive_list
> >                |                     shrink_node_memcg
> >                |                     shrink_node
> >                |                     do_try_to_free_pages
> >                |                     try_to_free_mem_cgroup_pages
> >                |                     try_charge
> >                |                     mem_cgroup_try_charge
>
> And here it shows this is indeed the case. You are hitting the hard
> limit and that causes direct reclaim to shrink the memcg.
>
> If you do not really need a strong isolation between cgroups then I
> would suggest to not set the hard limit and rely on the global memory
> reclaim to do the background reclaim which is less aggressive and more
> pro-active.

Thanks for the suggestion.
We actually need the hard limit in that case, but the problem occurs
even without cgroups (we mmap a 1TB file and we only have 64GB of
RAM). Basically the page cache fills up quickly and then reading the
mmaped file becomes "slow" (400-500MB/s instead of the initial
2.6GB/s). I'm just wondering if there is a way to make page
reclamation a bit faster, especially given that our workload is read
only.

shrink_inactive_list only seem to reclaim 32 pages with the default
setting and takes lru_lock twice to do that, so that's a lock of
locking per KB. Increasing the SWAP_CLUSTER_MAX value helped a bit,
but this is still quite slow.

And thanks for the precision on kwapd, I didn't know it was running
even without swap :)

Baptiste.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lock overhead in shrink_inactive_list / Slow page reclamation
  2019-01-14  7:25           ` Baptiste Lepers
  (?)
@ 2019-01-14  7:44           ` Michal Hocko
  -1 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2019-01-14  7:44 UTC (permalink / raw)
  To: Baptiste Lepers; +Cc: Daniel Jordan, mgorman, akpm, dhowells, linux-mm, hannes

On Mon 14-01-19 18:25:45, Baptiste Lepers wrote:
> On Mon, Jan 14, 2019 at 6:06 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 14-01-19 10:12:37, Baptiste Lepers wrote:
> > > On Sat, Jan 12, 2019 at 4:53 AM Daniel Jordan
> > > <daniel.m.jordan@oracle.com> wrote:
> > > >
> > > > On Fri, Jan 11, 2019 at 02:59:38PM +0100, Michal Hocko wrote:
> > > > > On Fri 11-01-19 16:52:17, Baptiste Lepers wrote:
> > > > > > Hello,
> > > > > >
> > > > > > We have a performance issue with the page cache. One of our workload
> > > > > > spends more than 50% of it's time in the lru_locks called by
> > > > > > shrink_inactive_list in mm/vmscan.c.
> > > > >
> > > > > Who does contend on the lock? Are there direct reclaimers or is it
> > > > > solely kswapd with paths that are faulting the new page cache in?
> > > >
> > > > Yes, and could you please post your performance data showing the time in
> > > > lru_lock?  Whatever you have is fine, but using perf with -g would give
> > > > callstacks and help answer Michal's question about who's contending.
> > >
> > > Thanks for the quick answer.
> > >
> > > The time spent in the lru_lock is mainly due to direct reclaimers
> > > (reading an mmaped page that causes some readahead to happen). We have
> > > tried to play with readahead values, but it doesn't change performance
> > > a lot. We have disabled swap on the machine, so kwapd doesn't run.
> >
> > kswapd runs even without swap storage.
> >
> > > Our programs run in memory cgroups, but I don't think that the issue
> > > directly comes from cgroups (I might be wrong though).
> >
> > Do you use hard/high limit on those cgroups. Because those would be a
> > source of the reclaim.
> >
> > > Here is the callchain that I have using perf report --no-children;
> > > (Paste here https://pastebin.com/151x4QhR )
> > >
> > >     44.30%  swapper      [kernel.vmlinux]  [k] intel_idle
> > >     # The machine is idle mainly because it waits in that lru_locks,
> > > which is the 2nd function in the report:
> > >     10.98%  testradix    [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath
> > >                |--10.33%--_raw_spin_lock_irq
> > >                |          |
> > >                |           --10.12%--shrink_inactive_list
> > >                |                     shrink_node_memcg
> > >                |                     shrink_node
> > >                |                     do_try_to_free_pages
> > >                |                     try_to_free_mem_cgroup_pages
> > >                |                     try_charge
> > >                |                     mem_cgroup_try_charge
> >
> > And here it shows this is indeed the case. You are hitting the hard
> > limit and that causes direct reclaim to shrink the memcg.
> >
> > If you do not really need a strong isolation between cgroups then I
> > would suggest to not set the hard limit and rely on the global memory
> > reclaim to do the background reclaim which is less aggressive and more
> > pro-active.
> 
> Thanks for the suggestion.
> We actually need the hard limit in that case, but the problem occurs
> even without cgroups (we mmap a 1TB file and we only have 64GB of
> RAM). Basically the page cache fills up quickly and then reading the
> mmaped file becomes "slow" (400-500MB/s instead of the initial
> 2.6GB/s). I'm just wondering if there is a way to make page
> reclamation a bit faster, especially given that our workload is read
> only.

Well, the clean page cache should be the most simple reclaim scenario so
I would be curious about a performance profile of this run.

> shrink_inactive_list only seem to reclaim 32 pages with the default
> setting and takes lru_lock twice to do that, so that's a lock of
> locking per KB. Increasing the SWAP_CLUSTER_MAX value helped a bit,
> but this is still quite slow.

Yes, SWAP_CLUSTER_MAX is a bit arbitrary but it is hard to guess what is
the bottle neck without some more data. Please note that this batching
controls how many pages are isolated from the LRU at once and that is
where we take the lock. So the larger the number the more pages we
isolate. This might indeed have a possitive effect on performance but
another side is that we do not want to isolate way too much for various
reasons (e.g. over reclaim).

Also getting back to your specific usecase. You've said you have played
with MADV_DONTNEED but it didn't help much. I am really curious on
details. Have you called madvice too late? What is the actual resident
portion of the file that you really need?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Lock overhead in shrink_inactive_list / Slow page reclamation
  2019-01-13 23:12       ` Baptiste Lepers
  (?)
  (?)
@ 2019-01-14 15:22       ` Kirill Tkhai
  -1 siblings, 0 replies; 10+ messages in thread
From: Kirill Tkhai @ 2019-01-14 15:22 UTC (permalink / raw)
  To: Baptiste Lepers, Daniel Jordan
  Cc: Michal Hocko, mgorman, akpm, dhowells, linux-mm, hannes

On 14.01.2019 02:12, Baptiste Lepers wrote:
> On Sat, Jan 12, 2019 at 4:53 AM Daniel Jordan
> <daniel.m.jordan@oracle.com> wrote:
>>
>> On Fri, Jan 11, 2019 at 02:59:38PM +0100, Michal Hocko wrote:
>>> On Fri 11-01-19 16:52:17, Baptiste Lepers wrote:
>>>> Hello,
>>>>
>>>> We have a performance issue with the page cache. One of our workload
>>>> spends more than 50% of it's time in the lru_locks called by
>>>> shrink_inactive_list in mm/vmscan.c.
>>>
>>> Who does contend on the lock? Are there direct reclaimers or is it
>>> solely kswapd with paths that are faulting the new page cache in?
>>
>> Yes, and could you please post your performance data showing the time in
>> lru_lock?  Whatever you have is fine, but using perf with -g would give
>> callstacks and help answer Michal's question about who's contending.
> 
> Thanks for the quick answer.
> 
> The time spent in the lru_lock is mainly due to direct reclaimers
> (reading an mmaped page that causes some readahead to happen). We have
> tried to play with readahead values, but it doesn't change performance
> a lot. We have disabled swap on the machine, so kwapd doesn't run.
> 
> Our programs run in memory cgroups, but I don't think that the issue
> directly comes from cgroups (I might be wrong though).
> 
> Here is the callchain that I have using perf report --no-children;
> (Paste here https://pastebin.com/151x4QhR )
> 
>     44.30%  swapper      [kernel.vmlinux]  [k] intel_idle
>     # The machine is idle mainly because it waits in that lru_locks,
> which is the 2nd function in the report:
>     10.98%  testradix    [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath
>                |--10.33%--_raw_spin_lock_irq
>                |          |
>                |           --10.12%--shrink_inactive_list
>                |                     shrink_node_memcg
>                |                     shrink_node
>                |                     do_try_to_free_pages
>                |                     try_to_free_mem_cgroup_pages
>                |                     try_charge
>                |                     mem_cgroup_try_charge
>                |                     __add_to_page_cache_locked
>                |                     add_to_page_cache_lru
>                |                     |
>                |                     |--5.39%--ext4_mpage_readpages
>                |                     |          ext4_readpages
>                |                     |          __do_page_cache_readahead
>                |                     |          |
>                |                     |           --5.37%--ondemand_readahead
>                |                     |
> page_cache_async_readahead

Does MADV_RANDOM make the trace better or worse?

>                |                     |                     filemap_fault
>                |                     |                     ext4_filemap_fault
>                |                     |                     __do_fault
>                |                     |                     handle_pte_fault
>                |                     |                     __handle_mm_fault
>                |                     |                     handle_mm_fault
>                |                     |                     __do_page_fault
>                |                     |                     do_page_fault
>                |                     |                     page_fault
>                |                     |                     |
>                |                     |                     |--4.23%-- <our app>
> 
> 
> Thanks,
> 
> Baptiste.
> 
> 
> 
> 
> 
> 
>>
>> Happy to help profile and debug offline.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-01-14 15:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-11  5:52 Lock overhead in shrink_inactive_list / Slow page reclamation Baptiste Lepers
2019-01-11 13:59 ` Michal Hocko
2019-01-11 17:53   ` Daniel Jordan
2019-01-13 23:12     ` Baptiste Lepers
2019-01-13 23:12       ` Baptiste Lepers
2019-01-14  7:06       ` Michal Hocko
2019-01-14  7:25         ` Baptiste Lepers
2019-01-14  7:25           ` Baptiste Lepers
2019-01-14  7:44           ` Michal Hocko
2019-01-14 15:22       ` Kirill Tkhai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.