All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>,
	Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
Date: Tue, 27 Feb 2024 20:17:18 +0100	[thread overview]
Message-ID: <2934125a-f2e2-417c-a9f9-3cb1e074a44f@redhat.com> (raw)
In-Reply-To: <ba9101ae-a618-4afc-9515-a61f25376390@arm.com>

On 27.02.24 18:10, Ryan Roberts wrote:
> Hi David,
> 
> On 26/02/2024 17:41, Ryan Roberts wrote:
>> On 22/02/2024 10:20, David Hildenbrand wrote:
>>> On 22.02.24 11:19, David Hildenbrand wrote:
>>>> On 25.10.23 16:45, Ryan Roberts wrote:
>>>>> As preparation for supporting small-sized THP in the swap-out path,
>>>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>>>>> which, when present, always implies PMD-sized THP, which is the same as
>>>>> the cluster size.
>>>>>
>>>>> The only use of the flag was to determine whether a swap entry refers to
>>>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>>>>> Instead of relying on the flag, we now pass in nr_pages, which
>>>>> originates from the folio's number of pages. This allows the logic to
>>>>> work for folios of any order.
>>>>>
>>>>> The one snag is that one of the swap_page_trans_huge_swapped() call
>>>>> sites does not have the folio. But it was only being called there to
>>>>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>>>>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>>>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>>>>> call site and believe the new logic should be equivalent.
>>>>
>>>> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
>>>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
>>>>
>>>> The "difference" is that you will now (1) get another temporary
>>>> reference on the folio and (2) (try)lock the folio every time you
>>>> discard a single PTE of a (possibly) large THP.
>>>>
>>>
>>> Thinking about it, your change will not only affect THP, but any call to
>>> free_swap_and_cache().
>>>
>>> Likely that's not what we want. :/
>>>
>>
>> Is folio_trylock() really that expensive given the code path is already locking
>> multiple spinlocks, and I don't think we would expect the folio lock to be very
>> contended?
>>
>> I guess filemap_get_folio() could be a bit more expensive, but again, is this
>> really a deal-breaker?
>>
>>
>> I'm just trying to refamiliarize myself with this series, but I think I ended up
>> allocating a cluster per cpu per order. So one potential solution would be to
>> turn the flag into a size and store it in the cluster info. (In fact I think I
>> was doing that in an early version of this series - will have to look at why I
>> got rid of that). Then we could avoid needing to figure out nr_pages from the folio.
> 
> I ran some microbenchmarks to see if these extra operations cause a performance
> issue - it all looks OK to me.

Sorry, I'm drowning in reviews right now. I was hoping to get some of my own
stuff figured out today ... maybe tomorrow.

> 
> I modified your "pte-mapped-folio-benchmarks" to add a "munmap-swapped-forked"
> mode, which prepares the 1G memory mapping by first paging it out with
> MADV_PAGEOUT, then it forks a child (and keeps that child alive) so that the
> swap slots have 2 references, then it measures the duration of munmap() in the
> parent on the entire range. The idea is that free_swap_and_cache() is called for
> each PTE during munmap(). Prior to my change, swap_page_trans_huge_swapped()
> will return true, due to the child's references, and __try_to_reclaim_swap() is
> not called. After my change, we no longer have this short cut.
> 
> In both cases the results are within 1% (confirmed across multiple runs of 20
> seconds each):
> 
> mm-stable: Average: 0.004997
>   + change: Average: 0.005037
> 
> (these numbers are for Ampere Altra. I also tested on M2 VM - no regression
> their either).
> 
> Do you still have a concern about this change?

The main concern I had was not about overhead due to atomic operations in the
non-concurrent case that you are measuring.

We might now unnecessarily be incrementing the folio refcount and taking
the folio lock. That will affects large folios in the swapcache now IIUC.
Small folios should be unaffected.

The side effects of that can be:
* Code checking for additional folio reference could now detect some and
   back out. (the "mapcount + swapcache*folio_nr_pages != folio_refcount"
   stuff)
* Code that might really benefit from trylocking the folio might fail to
   do so.

For example, splitting a large folio might now fail more often simply
because some process zaps a swap entry and the additional reference+page
lock were optimized out previously.

How relevant is it? Relevant enough that someone decided to put that
optimization in? I don't know :)

Arguably, zapping a present PTE also leaves the refcount elevated for a while
until the mapcount was freed. But here, it could be avoided.

Digging a bit, it was introduced in:

commit e07098294adfd03d582af7626752255e3d170393
Author: Huang Ying <ying.huang@intel.com>
Date:   Wed Sep 6 16:22:16 2017 -0700

     mm, THP, swap: support to reclaim swap space for THP swapped out
     
     The normal swap slot reclaiming can be done when the swap count reaches
     SWAP_HAS_CACHE.  But for the swap slot which is backing a THP, all swap
     slots backing one THP must be reclaimed together, because the swap slot
     may be used again when the THP is swapped out again later.  So the swap
     slots backing one THP can be reclaimed together when the swap count for
     all swap slots for the THP reached SWAP_HAS_CACHE.  In the patch, the
     functions to check whether the swap count for all swap slots backing one
     THP reached SWAP_HAS_CACHE are implemented and used when checking
     whether a swap slot can be reclaimed.
     
     To make it easier to determine whether a swap slot is backing a THP, a
     new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
     cluster which is backing a THP (Transparent Huge Page).  Because THP
     swap in as a whole isn't supported now.  After deleting the THP from the
     swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
     flag will be cleared.  So that, the normal pages inside THP can be
     swapped in individually.


With your change, if we have a swapped out THP with 512 entries and exit(), we
would now 512 times in a row grab a folio reference and trylock the folio. In the
past, we would have done that at most once.

That doesn't feel quite right TBH ... so I'm wondering if there are any low-hanging
fruits to avoid that.

-- 
Cheers,

David / dhildenb


  reply	other threads:[~2024-02-27 19:17 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-02-22 10:19   ` David Hildenbrand
2024-02-22 10:20     ` David Hildenbrand
2024-02-26 17:41       ` Ryan Roberts
2024-02-27 17:10         ` Ryan Roberts
2024-02-27 19:17           ` David Hildenbrand [this message]
2024-02-28  9:37             ` Ryan Roberts
2024-02-28 12:12               ` David Hildenbrand
2024-02-28 14:57                 ` Ryan Roberts
2024-02-28 15:12                   ` David Hildenbrand
2024-02-28 15:18                     ` Ryan Roberts
2024-03-01 16:27                     ` Ryan Roberts
2024-03-01 16:31                       ` Matthew Wilcox
2024-03-01 16:44                         ` Ryan Roberts
2024-03-01 17:00                           ` David Hildenbrand
2024-03-01 17:14                             ` Ryan Roberts
2024-03-01 17:18                               ` David Hildenbrand
2024-03-01 17:06                           ` Ryan Roberts
2024-03-04  4:52                             ` Barry Song
2024-03-04  5:42                               ` Barry Song
2024-03-05  7:41                                 ` Ryan Roberts
2024-03-01 16:31                       ` Ryan Roberts
2024-03-01 16:32                       ` David Hildenbrand
2024-03-04 16:03                 ` Ryan Roberts
2024-03-04 17:30                   ` David Hildenbrand
2024-03-04 18:38                     ` Ryan Roberts
2024-03-04 20:50                       ` David Hildenbrand
2024-03-04 21:55                         ` Ryan Roberts
2024-03-04 22:02                           ` David Hildenbrand
2024-03-04 22:34                             ` Ryan Roberts
2024-03-05  6:11                               ` Huang, Ying
2024-03-05  8:35                                 ` David Hildenbrand
2024-03-05  8:46                                   ` Ryan Roberts
2024-02-28 13:33               ` Matthew Wilcox
2024-02-28 14:24                 ` Ryan Roberts
2024-02-28 14:59                   ` Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals entry Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-30  8:18   ` Huang, Ying
2023-10-30 13:59     ` Ryan Roberts
2023-10-31  8:12       ` Huang, Ying
2023-11-03 11:42         ` Ryan Roberts
2023-11-02  7:40   ` Barry Song
2023-11-02 10:21     ` Ryan Roberts
2023-11-02 22:36       ` Barry Song
2023-11-03 11:31         ` Ryan Roberts
2023-11-03 13:57           ` Steven Price
2023-11-04  9:34             ` Barry Song
2023-11-06 10:12               ` Steven Price
2023-11-06 21:39                 ` Barry Song
2023-11-08 11:51                   ` Steven Price
2023-11-07 12:46               ` Ryan Roberts
2023-11-07 18:05                 ` Barry Song
2023-11-08 11:23                   ` Barry Song
2023-11-08 20:20                     ` Ryan Roberts
2023-11-08 21:04                       ` Barry Song
2023-11-04  5:49           ` Barry Song
2024-02-05  9:51   ` Barry Song
2024-02-05 12:14     ` Ryan Roberts
2024-02-18 23:40       ` Barry Song
2024-02-20 20:03         ` Ryan Roberts
2024-03-05  9:00         ` Ryan Roberts
2024-03-05  9:54           ` Barry Song
2024-03-05 10:44             ` Ryan Roberts
2024-02-27 12:28     ` Ryan Roberts
2024-02-27 13:37     ` Ryan Roberts
2024-02-28  2:46       ` Barry Song
2024-02-22  7:05   ` Barry Song
2024-02-22 10:09     ` David Hildenbrand
2024-02-23  9:46       ` Barry Song
2024-02-27 12:05         ` Ryan Roberts
2024-02-28  1:23           ` Barry Song
2024-02-28  9:34             ` David Hildenbrand
2024-02-28 23:18               ` Barry Song
2024-02-28 15:57             ` Ryan Roberts
2023-11-29  7:47 ` [PATCH v3 0/4] " Barry Song
2023-11-29 12:06   ` Ryan Roberts
2023-11-29 20:38     ` Barry Song
2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
2024-01-18 11:10   ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
2024-01-26 23:14     ` Chris Li
2024-02-26  2:59       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free() Barry Song
2024-01-26 23:17     ` Chris Li
2024-02-26  4:47       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
2024-01-26 23:22     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
2024-01-27 19:53     ` Chris Li
2024-02-26  7:29       ` Barry Song
2024-01-27 20:06     ` Chris Li
2024-02-26  7:31       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
2024-01-18 11:54     ` David Hildenbrand
2024-01-23  6:49       ` Barry Song
2024-01-29  3:25         ` Chris Li
2024-01-29 10:06           ` David Hildenbrand
2024-01-29 16:31             ` Chris Li
2024-02-26  5:05               ` Barry Song
2024-04-06 23:27             ` Barry Song
2024-01-27 23:41     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
2024-01-29  2:15     ` Chris Li
2024-02-26  6:39       ` Barry Song
2024-02-27 12:22     ` Ryan Roberts
2024-02-27 22:39       ` Barry Song
2024-02-27 14:40     ` Ryan Roberts
2024-02-27 18:57       ` Barry Song
2024-02-28  3:49         ` Barry Song
2024-01-18 15:25   ` [PATCH RFC 0/6] mm: support large folios swap-in Ryan Roberts
2024-01-18 23:54     ` Barry Song
2024-01-19 13:25       ` Ryan Roberts
2024-01-27 14:27         ` Barry Song
2024-01-29  9:05   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2934125a-f2e2-417c-a9f9-3cb1e074a44f@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=xiang@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.