linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>,
	SeongJae Park <sj38.park@gmail.com>,
	Linux-MM <linux-mm@kvack.org>, Andi Kleen <ak@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Benjamin Manes <ben.manes@gmail.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>, Miaohe Lin <linmiaohe@huawei.com>,
	Michael Larabel <michael@michaellarabel.com>,
	Michal Hocko <mhocko@suse.com>,
	Michel Lespinasse <michel@lespinasse.org>,
	Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>,
	Rong Chen <rong.a.chen@intel.com>,
	SeongJae Park <sjpark@amazon.de>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>,
	Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org,
	Kernel Page Reclaim v2 <page-reclaim@google.com>
Subject: Re: [PATCH v2 00/16] Multigenerational LRU Framework
Date: Wed, 14 Apr 2021 13:42:30 -0600	[thread overview]
Message-ID: <CAOUHufZxz9ucfLexutAKi8EjHKrT4NfMhsvbk+3DDymaEhp-Rg@mail.gmail.com> (raw)
In-Reply-To: <91146ee7-3054-a81a-296e-e75c24f4e290@kernel.dk>

On Wed, Apr 14, 2021 at 8:43 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/13/21 5:14 PM, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >> On 4/13/21 1:51 AM, SeongJae Park wrote:
> >>> From: SeongJae Park <sjpark@amazon.de>
> >>>
> >>> Hello,
> >>>
> >>>
> >>> Very interesting work, thank you for sharing this :)
> >>>
> >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> >>>
> >>>> What's new in v2
> >>>> ================
> >>>> Special thanks to Jens Axboe for reporting a regression in buffered
> >>>> I/O and helping test the fix.
> >>>
> >>> Is the discussion open?  If so, could you please give me a link?
> >>
> >> I wasn't on the initial post (or any of the lists it was posted to), but
> >> it's on the google page reclaim list. Not sure if that is public or not.
> >>
> >> tldr is that I was pretty excited about this work, as buffered IO tends
> >> to suck (a lot) for high throughput applications. My test case was
> >> pretty simple:
> >>
> >> Randomly read a fast device, using 4k buffered IO, and watch what
> >> happens when the page cache gets filled up. For this particular test,
> >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> >> with kswapd using a lot of CPU trying to keep up. That's mainline
> >> behavior.
> >
> > I see this exact same behaviour here, too, but I RCA'd it to
> > contention between the inode and memory reclaim for the mapping
> > structure that indexes the page cache. Basically the mapping tree
> > lock is the contention point here - you can either be adding pages
> > to the mapping during IO, or memory reclaim can be removing pages
> > from the mapping, but we can't do both at once.
> >
> > So we end up with kswapd spinning on the mapping tree lock like so
> > when doing 1.6GB/s in 4kB buffered IO:
> >
> > -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
> >    - 20.06% kswapd                                                                                                                                             ▒
> >       - 20.05% balance_pgdat                                                                                                                                   ▒
> >          - 20.03% shrink_node                                                                                                                                  ▒
> >             - 19.92% shrink_lruvec                                                                                                                             ▒
> >                - 19.91% shrink_inactive_list                                                                                                                   ▒
> >                   - 19.22% shrink_page_list                                                                                                                    ▒
> >                      - 17.51% __remove_mapping                                                                                                                 ▒
> >                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
> >                            - 14.14% do_raw_spin_lock                                                                                                           ▒
> >                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
> >                         - 1.56% __delete_from_page_cache                                                                                                       ▒
> >                              0.63% xas_store                                                                                                                   ▒
> >                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
> >                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
> >                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
> >                      - 0.82% free_unref_page_list                                                                                                              ▒
> >                         - 0.72% free_unref_page_commit                                                                                                         ▒
> >                              0.57% free_pcppages_bulk                                                                                                          ▒
> >
> > And these are the processes consuming CPU:
> >
> >    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
> >    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
> >    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
> >    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
> >    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2
>
> Here's my profile when memory reclaim is active for the above mentioned
> test case. This is a single node system, so just kswapd. It's using around
> 40-45% CPU:
>
>     43.69%  kswapd0  [kernel.vmlinux]  [k] xas_create
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                shrink_page_list
>                __delete_from_page_cache
>                xas_store
>                xas_create
>
>     16.88%  kswapd0  [kernel.vmlinux]  [k] queued_spin_lock_slowpath
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                |
>                 --16.82%--shrink_inactive_list
>                           |
>                            --16.55%--shrink_page_list
>                                      |
>                                       --16.26%--_raw_spin_lock_irqsave
>                                                 queued_spin_lock_slowpath
>
>      9.89%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                shrink_page_list
>
>      5.46%  kswapd0  [kernel.vmlinux]  [k] xas_init_marks
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                shrink_page_list
>                |
>                 --5.41%--__delete_from_page_cache
>                           xas_init_marks
>
>      4.42%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                |
>                 --4.40%--shrink_page_list
>                           __delete_from_page_cache
>
>      2.82%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                |
>                |--1.43%--shrink_active_list
>                |          isolate_lru_pages
>                |
>                 --1.39%--shrink_inactive_list
>                           isolate_lru_pages
>
>      1.99%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                shrink_page_list
>                free_unref_page_list
>                free_unref_page_commit
>                free_pcppages_bulk
>
>      1.79%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                |
>                 --1.76%--shrink_node
>                           shrink_lruvec
>                           shrink_inactive_list
>                           |
>                            --1.72%--shrink_page_list
>                                      _raw_spin_lock_irqsave
>
>      1.02%  kswapd0  [kernel.vmlinux]  [k] workingset_eviction
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                |
>                 --1.00%--shrink_page_list
>                           workingset_eviction
>
> > i.e. when memory reclaim kicks in, the read process has 20% less
> > time with exclusive access to the mapping tree to insert new pages.
> > Hence buffered read performance goes down quite substantially when
> > memory reclaim kicks in, and this really has nothing to do with the
> > memory reclaim LRU scanning algorithm.
> >
> > I can actually get this machine to pin those 5 processes to 100% CPU
> > under certain conditions. Each process is spinning all that extra
> > time on the mapping tree lock, and performance degrades further.
> > Changing the LRU reclaim algorithm won't fix this - the workload is
> > solidly bound by the exclusive nature of the mapping tree lock and
> > the number of tasks trying to obtain it exclusively...
>
> I've seen way worse than the above as well, it's just my go-to easy test
> case for "man I wish buffered IO didn't suck so much".
>
> >> The initial posting of this patchset did no better, in fact it did a bit
> >> worse. Performance dropped to the same levels and kswapd was using as
> >> much CPU as before, but on top of that we also got excessive swapping.
> >> Not at a high rate, but 5-10MB/sec continually.
> >>
> >> I had some back and forths with Yu Zhao and tested a few new revisions,
> >> and the current series does much better in this regard. Performance
> >> still dips a bit when page cache fills, but not nearly as much, and
> >> kswapd is using less CPU than before.
> >
> > Profiles would be interesting, because it sounds to me like reclaim
> > *might* be batching page cache removal better (e.g. fewer, larger
> > batches) and so spending less time contending on the mapping tree
> > lock...
> >
> > IOWs, I suspect this result might actually be a result of less lock
> > contention due to a change in batch processing characteristics of
> > the new algorithm rather than it being a "better" algorithm...
>
> See above - let me know if you want to see more specific profiling as
> well.

Hi Jens,

Thanks for the profiles.

Does the code path I've demonstrated seem clear to you?

Recap:

When randomly accessing a (not infinitely) large file long enough,
some blocks are bound to be accessed multiple times. In the buffered
io access path, mark_page_accessed() activates them, i.e., moving them
to the active list. Once memory is filled and kswapd starts
reclaiming, shrink_active_list() deactivates them, i.e., moving them
back to the inactive list. Both take the lru lock to add/remove pages
to/from the active/inactive lists.

IOW, pages accessed multiple times bounce between the active and the
inactive lists when random accesses put a system under memory
pressure. For random accesses, pages accessed multiple times are not
different from those accessed once, in terms of page reclaim.
(Statistically speaking, they would be less unlikely to be used
again.)

I'd be happy to give it another try if there is anything unclear.

Thanks.

  reply	other threads:[~2021-04-14 19:43 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-13  6:56 [PATCH v2 00/16] Multigenerational LRU Framework Yu Zhao
2021-04-13  6:56 ` [PATCH v2 01/16] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-04-13  6:56 ` [PATCH v2 02/16] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-04-13  6:56 ` [PATCH v2 03/16] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-04-13  6:56 ` [PATCH v2 04/16] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-04-13  6:56 ` [PATCH v2 05/16] mm/swap.c: export activate_page() Yu Zhao
2021-04-13  6:56 ` [PATCH v2 06/16] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-04-13  6:56 ` [PATCH v2 07/16] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-04-13  6:56 ` [PATCH v2 08/16] mm: multigenerational lru: groundwork Yu Zhao
2021-04-13  6:56 ` [PATCH v2 09/16] mm: multigenerational lru: activation Yu Zhao
2021-04-13  6:56 ` [PATCH v2 10/16] mm: multigenerational lru: mm_struct list Yu Zhao
2021-04-14 14:36   ` Matthew Wilcox
2021-04-13  6:56 ` [PATCH v2 11/16] mm: multigenerational lru: aging Yu Zhao
2021-04-13  6:56 ` [PATCH v2 12/16] mm: multigenerational lru: eviction Yu Zhao
2021-04-13  6:56 ` [PATCH v2 13/16] mm: multigenerational lru: page reclaim Yu Zhao
2021-04-13  6:56 ` [PATCH v2 14/16] mm: multigenerational lru: user interface Yu Zhao
2021-04-13  6:56 ` [PATCH v2 15/16] mm: multigenerational lru: Kconfig Yu Zhao
2021-04-13  6:56 ` [PATCH v2 16/16] mm: multigenerational lru: documentation Yu Zhao
2021-04-13  7:51 ` [PATCH v2 00/16] Multigenerational LRU Framework SeongJae Park
2021-04-13 16:13   ` Jens Axboe
2021-04-13 16:42     ` SeongJae Park
2021-04-13 23:14     ` Dave Chinner
2021-04-14  2:29       ` Rik van Riel
     [not found]         ` <CAOUHufafMcaG8sOS=1YMy2P_6p0R1FzP16bCwpUau7g1-PybBQ@mail.gmail.com>
2021-04-14  6:15           ` Huang, Ying
2021-04-14  7:58             ` Yu Zhao
2021-04-14  8:27               ` Huang, Ying
2021-04-14 13:51                 ` Rik van Riel
2021-04-14 15:56                   ` Andi Kleen
2021-04-14 15:58                   ` [page-reclaim] " Shakeel Butt
2021-04-14 18:45                   ` Yu Zhao
2021-04-14 15:51           ` Andi Kleen
2021-04-14 15:58             ` Rik van Riel
2021-04-14 19:14               ` Yu Zhao
2021-04-14 19:41                 ` Rik van Riel
2021-04-14 20:08                   ` Yu Zhao
2021-04-14 19:04             ` Yu Zhao
2021-04-15  3:00               ` Andi Kleen
2021-04-15  7:13                 ` Yu Zhao
2021-04-15  8:19                   ` Huang, Ying
2021-04-15  9:57                   ` Michel Lespinasse
2021-04-24  2:33                     ` Yu Zhao
2021-04-24  3:30                       ` Andi Kleen
2021-04-24  4:16                         ` Yu Zhao
2021-04-14  3:40       ` Yu Zhao
2021-04-14  4:50         ` Dave Chinner
2021-04-14  7:16           ` Yu Zhao
2021-04-14 10:00             ` Yu Zhao
2021-04-15  1:36             ` Dave Chinner
2021-04-24 21:21               ` Yu Zhao
2021-04-14 14:43       ` Jens Axboe
2021-04-14 19:42         ` Yu Zhao [this message]
2021-04-15  1:21         ` Dave Chinner
2021-04-14 17:43 ` Johannes Weiner
2021-04-27 10:35   ` Yu Zhao
2021-04-29 23:46 ` Konstantin Kharlamov
2021-04-30  6:37   ` Konstantin Kharlamov
2021-04-30 19:31     ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOUHufZxz9ucfLexutAKi8EjHKrT4NfMhsvbk+3DDymaEhp-Rg@mail.gmail.com \
    --to=yuzhao@google.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=ben.manes@gmail.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@fromorbit.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@lists.01.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=michael@michaellarabel.com \
    --cc=michel@lespinasse.org \
    --cc=page-reclaim@google.com \
    --cc=riel@surriel.com \
    --cc=rong.a.chen@intel.com \
    --cc=shy828301@gmail.com \
    --cc=sj38.park@gmail.com \
    --cc=sjpark@amazon.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).