linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Dave Chinner <david@fromorbit.com>
Cc: SeongJae Park <sj38.park@gmail.com>, Yu Zhao <yuzhao@google.com>,
	linux-mm@kvack.org, Andi Kleen <ak@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Benjamin Manes <ben.manes@gmail.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>, Miaohe Lin <linmiaohe@huawei.com>,
	Michael Larabel <michael@michaellarabel.com>,
	Michal Hocko <mhocko@suse.com>,
	Michel Lespinasse <michel@lespinasse.org>,
	Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>,
	Rong Chen <rong.a.chen@intel.com>,
	SeongJae Park <sjpark@amazon.de>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>,
	Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
	linux-kernel@vger.kernel.org, lkp@lists.01.org,
	page-reclaim@google.com
Subject: Re: [PATCH v2 00/16] Multigenerational LRU Framework
Date: Wed, 14 Apr 2021 08:43:36 -0600	[thread overview]
Message-ID: <91146ee7-3054-a81a-296e-e75c24f4e290@kernel.dk> (raw)
In-Reply-To: <20210413231436.GF63242@dread.disaster.area>

On 4/13/21 5:14 PM, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
>> On 4/13/21 1:51 AM, SeongJae Park wrote:
>>> From: SeongJae Park <sjpark@amazon.de>
>>>
>>> Hello,
>>>
>>>
>>> Very interesting work, thank you for sharing this :)
>>>
>>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
>>>
>>>> What's new in v2
>>>> ================
>>>> Special thanks to Jens Axboe for reporting a regression in buffered
>>>> I/O and helping test the fix.
>>>
>>> Is the discussion open?  If so, could you please give me a link?
>>
>> I wasn't on the initial post (or any of the lists it was posted to), but
>> it's on the google page reclaim list. Not sure if that is public or not.
>>
>> tldr is that I was pretty excited about this work, as buffered IO tends
>> to suck (a lot) for high throughput applications. My test case was
>> pretty simple:
>>
>> Randomly read a fast device, using 4k buffered IO, and watch what
>> happens when the page cache gets filled up. For this particular test,
>> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
>> with kswapd using a lot of CPU trying to keep up. That's mainline
>> behavior.
> 
> I see this exact same behaviour here, too, but I RCA'd it to
> contention between the inode and memory reclaim for the mapping
> structure that indexes the page cache. Basically the mapping tree
> lock is the contention point here - you can either be adding pages
> to the mapping during IO, or memory reclaim can be removing pages
> from the mapping, but we can't do both at once.
> 
> So we end up with kswapd spinning on the mapping tree lock like so
> when doing 1.6GB/s in 4kB buffered IO:
> 
> -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
>    - 20.06% kswapd                                                                                                                                             ▒
>       - 20.05% balance_pgdat                                                                                                                                   ▒
>          - 20.03% shrink_node                                                                                                                                  ▒
>             - 19.92% shrink_lruvec                                                                                                                             ▒
>                - 19.91% shrink_inactive_list                                                                                                                   ▒
>                   - 19.22% shrink_page_list                                                                                                                    ▒
>                      - 17.51% __remove_mapping                                                                                                                 ▒
>                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
>                            - 14.14% do_raw_spin_lock                                                                                                           ▒
>                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
>                         - 1.56% __delete_from_page_cache                                                                                                       ▒
>                              0.63% xas_store                                                                                                                   ▒
>                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
>                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
>                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
>                      - 0.82% free_unref_page_list                                                                                                              ▒
>                         - 0.72% free_unref_page_commit                                                                                                         ▒
>                              0.57% free_pcppages_bulk                                                                                                          ▒
> 
> And these are the processes consuming CPU:
> 
>    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
>    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
>    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
>    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
>    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2

Here's my profile when memory reclaim is active for the above mentioned
test case. This is a single node system, so just kswapd. It's using around
40-45% CPU:

    43.69%  kswapd0  [kernel.vmlinux]  [k] xas_create
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               shrink_page_list
               __delete_from_page_cache
               xas_store
               xas_create

    16.88%  kswapd0  [kernel.vmlinux]  [k] queued_spin_lock_slowpath
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               |          
                --16.82%--shrink_inactive_list
                          |          
                           --16.55%--shrink_page_list
                                     |          
                                      --16.26%--_raw_spin_lock_irqsave
                                                queued_spin_lock_slowpath

     9.89%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               shrink_page_list

     5.46%  kswapd0  [kernel.vmlinux]  [k] xas_init_marks
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               shrink_page_list
               |          
                --5.41%--__delete_from_page_cache
                          xas_init_marks

     4.42%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               |          
                --4.40%--shrink_page_list
                          __delete_from_page_cache

     2.82%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               |          
               |--1.43%--shrink_active_list
               |          isolate_lru_pages
               |          
                --1.39%--shrink_inactive_list
                          isolate_lru_pages

     1.99%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               shrink_page_list
               free_unref_page_list
               free_unref_page_commit
               free_pcppages_bulk

     1.79%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               |          
                --1.76%--shrink_node
                          shrink_lruvec
                          shrink_inactive_list
                          |          
                           --1.72%--shrink_page_list
                                     _raw_spin_lock_irqsave

     1.02%  kswapd0  [kernel.vmlinux]  [k] workingset_eviction
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               |          
                --1.00%--shrink_page_list
                          workingset_eviction

> i.e. when memory reclaim kicks in, the read process has 20% less
> time with exclusive access to the mapping tree to insert new pages.
> Hence buffered read performance goes down quite substantially when
> memory reclaim kicks in, and this really has nothing to do with the
> memory reclaim LRU scanning algorithm.
> 
> I can actually get this machine to pin those 5 processes to 100% CPU
> under certain conditions. Each process is spinning all that extra
> time on the mapping tree lock, and performance degrades further.
> Changing the LRU reclaim algorithm won't fix this - the workload is
> solidly bound by the exclusive nature of the mapping tree lock and
> the number of tasks trying to obtain it exclusively...

I've seen way worse than the above as well, it's just my go-to easy test
case for "man I wish buffered IO didn't suck so much".

>> The initial posting of this patchset did no better, in fact it did a bit
>> worse. Performance dropped to the same levels and kswapd was using as
>> much CPU as before, but on top of that we also got excessive swapping.
>> Not at a high rate, but 5-10MB/sec continually.
>>
>> I had some back and forths with Yu Zhao and tested a few new revisions,
>> and the current series does much better in this regard. Performance
>> still dips a bit when page cache fills, but not nearly as much, and
>> kswapd is using less CPU than before.
> 
> Profiles would be interesting, because it sounds to me like reclaim
> *might* be batching page cache removal better (e.g. fewer, larger
> batches) and so spending less time contending on the mapping tree
> lock...
> 
> IOWs, I suspect this result might actually be a result of less lock
> contention due to a change in batch processing characteristics of
> the new algorithm rather than it being a "better" algorithm...

See above - let me know if you want to see more specific profiling as
well.

-- 
Jens Axboe


  parent reply	other threads:[~2021-04-14 14:43 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-13  6:56 Yu Zhao
2021-04-13  6:56 ` [PATCH v2 01/16] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-04-13  6:56 ` [PATCH v2 02/16] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-04-13  6:56 ` [PATCH v2 03/16] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-04-13  6:56 ` [PATCH v2 04/16] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-04-13  6:56 ` [PATCH v2 05/16] mm/swap.c: export activate_page() Yu Zhao
2021-04-13  6:56 ` [PATCH v2 06/16] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-04-13  6:56 ` [PATCH v2 07/16] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-04-13  6:56 ` [PATCH v2 08/16] mm: multigenerational lru: groundwork Yu Zhao
2021-04-13  6:56 ` [PATCH v2 09/16] mm: multigenerational lru: activation Yu Zhao
2021-04-13  6:56 ` [PATCH v2 10/16] mm: multigenerational lru: mm_struct list Yu Zhao
2021-04-14 14:36   ` Matthew Wilcox
2021-04-13  6:56 ` [PATCH v2 11/16] mm: multigenerational lru: aging Yu Zhao
2021-04-13  6:56 ` [PATCH v2 12/16] mm: multigenerational lru: eviction Yu Zhao
2021-04-13  6:56 ` [PATCH v2 13/16] mm: multigenerational lru: page reclaim Yu Zhao
2021-04-13  6:56 ` [PATCH v2 14/16] mm: multigenerational lru: user interface Yu Zhao
2021-04-13  6:56 ` [PATCH v2 15/16] mm: multigenerational lru: Kconfig Yu Zhao
2021-04-13  6:56 ` [PATCH v2 16/16] mm: multigenerational lru: documentation Yu Zhao
2021-04-13  7:51 ` [PATCH v2 00/16] Multigenerational LRU Framework SeongJae Park
2021-04-13 16:13   ` Jens Axboe
2021-04-13 16:42     ` SeongJae Park
2021-04-13 23:14     ` Dave Chinner
2021-04-14  2:29       ` Rik van Riel
     [not found]         ` <CAOUHufafMcaG8sOS=1YMy2P_6p0R1FzP16bCwpUau7g1-PybBQ@mail.gmail.com>
2021-04-14  6:15           ` Huang, Ying
2021-04-14  7:58             ` Yu Zhao
2021-04-14  8:27               ` Huang, Ying
2021-04-14 13:51                 ` Rik van Riel
2021-04-14 15:56                   ` Andi Kleen
2021-04-14 15:58                   ` [page-reclaim] " Shakeel Butt
2021-04-14 18:45                   ` Yu Zhao
2021-04-14 15:51           ` Andi Kleen
2021-04-14 15:58             ` Rik van Riel
2021-04-14 19:14               ` Yu Zhao
2021-04-14 19:41                 ` Rik van Riel
2021-04-14 20:08                   ` Yu Zhao
2021-04-14 19:04             ` Yu Zhao
2021-04-15  3:00               ` Andi Kleen
2021-04-15  7:13                 ` Yu Zhao
2021-04-15  8:19                   ` Huang, Ying
2021-04-15  9:57                   ` Michel Lespinasse
2021-04-24  2:33                     ` Yu Zhao
2021-04-24  3:30                       ` Andi Kleen
2021-04-24  4:16                         ` Yu Zhao
2021-04-14  3:40       ` Yu Zhao
2021-04-14  4:50         ` Dave Chinner
2021-04-14  7:16           ` Yu Zhao
2021-04-14 10:00             ` Yu Zhao
2021-04-15  1:36             ` Dave Chinner
2021-04-24 21:21               ` Yu Zhao
2021-04-14 14:43       ` Jens Axboe [this message]
2021-04-14 19:42         ` Yu Zhao
2021-04-15  1:21         ` Dave Chinner
2021-04-14 17:43 ` Johannes Weiner
2021-04-27 10:35   ` Yu Zhao
2021-04-29 23:46 ` Konstantin Kharlamov
2021-04-30  6:37   ` Konstantin Kharlamov
2021-04-30 19:31     ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=91146ee7-3054-a81a-296e-e75c24f4e290@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=ben.manes@gmail.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@fromorbit.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@lists.01.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=michael@michaellarabel.com \
    --cc=michel@lespinasse.org \
    --cc=page-reclaim@google.com \
    --cc=riel@surriel.com \
    --cc=rong.a.chen@intel.com \
    --cc=shy828301@gmail.com \
    --cc=sj38.park@gmail.com \
    --cc=sjpark@amazon.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    --subject='Re: [PATCH v2 00/16] Multigenerational LRU Framework' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).