All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Barry Song <21cnbao@gmail.com>
Cc: steven.price@arm.com, akpm@linux-foundation.org,
	david@redhat.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, mhocko@suse.com, shy828301@gmail.com,
	wangkefeng.wang@huawei.com, willy@infradead.org,
	xiang@kernel.org, ying.huang@intel.com, yuzhao@google.com,
	Barry Song <v-songbaohua@oppo.com>,
	nd@arm.com
Subject: Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
Date: Wed, 8 Nov 2023 20:20:58 +0000	[thread overview]
Message-ID: <2c98be67-657e-4c65-bf6b-3d70ff596c64@arm.com> (raw)
In-Reply-To: <CAGsJ_4xmBAcApyK8NgVQeX_Znp5e8D4fbbhGguOkNzmh1Veocg@mail.gmail.com>

On 08/11/2023 11:23, Barry Song wrote:
> On Wed, Nov 8, 2023 at 2:05 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, Nov 7, 2023 at 8:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 04/11/2023 09:34, Barry Song wrote:
>>>>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
>>>>> and if failing then arch_prepare_to_swap() would need to put things back
>>>>> how they were with calls to mte_invalidate_tags() (although I think
>>>>> you'd actually want to refactor to create a function which takes a
>>>>> struct page *).
>>>>>
>>>>> Steve
>>>>
>>>> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
>>>> One tricky thing is that we are restoring one page rather than folio
>>>> in arch_restore_swap() as we are only swapping in one page at this
>>>> stage.
>>>>
>>>> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
>>>>
>>>> This patch makes MTE tags saving and restoring support large folios,
>>>> then we don't need to split them into base pages for swapping on
>>>> ARM64 SoCs with MTE.
>>>>
>>>> This patch moves arch_prepare_to_swap() to take folio rather than
>>>> page, as we support THP swap-out as a whole. And this patch also
>>>> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
>>>> needs it.
>>>>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>> ---
>>>>  arch/arm64/include/asm/pgtable.h | 21 +++------------
>>>>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
>>>>  include/linux/huge_mm.h          | 12 ---------
>>>>  include/linux/pgtable.h          |  2 +-
>>>>  mm/page_io.c                     |  2 +-
>>>>  mm/swap_slots.c                  |  2 +-
>>>>  6 files changed, 51 insertions(+), 32 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index b19a8aee684c..d8f523dc41e7 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -45,12 +45,6 @@
>>>>       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>
>>>> -static inline bool arch_thp_swp_supported(void)
>>>> -{
>>>> -     return !system_supports_mte();
>>>> -}
>>>> -#define arch_thp_swp_supported arch_thp_swp_supported
>>>> -
>>>>  /*
>>>>   * Outside of a few very special situations (e.g. hibernation), we always
>>>>   * use broadcast TLB invalidation instructions, therefore a spurious page
>>>> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>>>  #ifdef CONFIG_ARM64_MTE
>>>>
>>>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
>>>> -static inline int arch_prepare_to_swap(struct page *page)
>>>> -{
>>>> -     if (system_supports_mte())
>>>> -             return mte_save_tags(page);
>>>> -     return 0;
>>>> -}
>>>> +#define arch_prepare_to_swap arch_prepare_to_swap
>>>> +extern int arch_prepare_to_swap(struct folio *folio);
>>>>
>>>>  #define __HAVE_ARCH_SWAP_INVALIDATE
>>>>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
>>>> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
>>>>  }
>>>>
>>>>  #define __HAVE_ARCH_SWAP_RESTORE
>>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>>> -{
>>>> -     if (system_supports_mte())
>>>> -             mte_restore_tags(entry, &folio->page);
>>>> -}
>>>> +#define arch_swap_restore arch_swap_restore
>>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>>>>
>>>>  #endif /* CONFIG_ARM64_MTE */
>>>>
>>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
>>>> index a31833e3ddc5..14a479e4ea8e 100644
>>>> --- a/arch/arm64/mm/mteswap.c
>>>> +++ b/arch/arm64/mm/mteswap.c
>>>> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>>>>       mte_free_tag_storage(tags);
>>>>  }
>>>>
>>>> +static inline void __mte_invalidate_tags(struct page *page)
>>>> +{
>>>> +     swp_entry_t entry = page_swap_entry(page);
>>>> +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
>>>> +}
>>>> +
>>>>  void mte_invalidate_tags_area(int type)
>>>>  {
>>>>       swp_entry_t entry = swp_entry(type, 0);
>>>> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
>>>>       }
>>>>       xa_unlock(&mte_pages);
>>>>  }
>>>> +
>>>> +int arch_prepare_to_swap(struct folio *folio)
>>>> +{
>>>> +     int err;
>>>> +     long i;
>>>> +
>>>> +     if (system_supports_mte()) {
>>>> +             long nr = folio_nr_pages(folio);
>>>
>>> nit: there should be a clear line between variable declarations and logic.
>>
>> right.
>>
>>>
>>>> +             for (i = 0; i < nr; i++) {
>>>> +                     err = mte_save_tags(folio_page(folio, i));
>>>> +                     if (err)
>>>> +                             goto out;
>>>> +             }
>>>> +     }
>>>> +     return 0;
>>>> +
>>>> +out:
>>>> +     while (--i)
>>>
>>> If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
>>> then it will wrap and run ~forever. I think you meant `while (i--)`?
>>
>> nop. if i=0 and we goto out, that means the page0 has failed to save tags,
>> there is nothing to revert. if i=3 and we goto out, that means 0,1,2 have
>> saved, we restore 0,1,2 and we don't restore 3.
> 
> I am terribly sorry for my previous noise. You are right, Ryan. i
> actually meant i--.

No problem - it saves me from writing a long response explaining why --i is
wrong, at least!

> 
>>
>>>
>>>> +             __mte_invalidate_tags(folio_page(folio, i));
>>>> +     return err;
>>>> +}
>>>> +
>>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>>> +{
>>>> +     if (system_supports_mte()) {
>>>> +             /*
>>>> +              * We don't support large folios swap in as whole yet, but
>>>> +              * we can hit a large folio which is still in swapcache
>>>> +              * after those related processes' PTEs have been unmapped
>>>> +              * but before the swapcache folio  is dropped, in this case,
>>>> +              * we need to find the exact page which "entry" is mapping
>>>> +              * to. If we are not hitting swapcache, this folio won't be
>>>> +              * large
>>>> +              */
>>>
>>> So the currently defined API allows a large folio to be passed but the caller is
>>> supposed to find the single correct page using the swap entry? That feels quite
>>> nasty to me. And that's not what the old version of the function was doing; it
>>> always assumed that the folio was small and passed the first page (which also
>>> doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
>>> to fix that. If the old version is correct, then I guess this version is wrong.
>>
>> the original version(mainline) is wrong but it works as once we find the SoCs
>> support MTE, we will split large folios into small pages. so only small pages
>> will be added into swapcache successfully.
>>
>> but now we want to swap out large folios even on SoCs with MTE as a whole,
>> we don't split, so this breaks the assumption do_swap_page() will always get
>> small pages.
> 
> let me clarify this more. The current mainline assumes
> arch_swap_restore() always
> get a folio with only one page. this is true as we split large folios
> if we find SoCs
> have MTE. but since we are dropping the split now, that means a large
> folio can be
> gotten by do_swap_page(). we have a chance that try_to_unmap_one() has been done
> but folio is not put. so PTEs will have swap entry but folio is still
> there, and do_swap_page()
> to hit cache directly and the folio won't be released.
> 
> but after getting the large folio in do_swap_page, it still only takes
> one basepage particularly
> for the faulted PTE and maps this 4KB PTE only. so it uses the faulted
> swap_entry and
> the folio as parameters to call arch_swap_restore() which can be something like:
> 
> do_swap_page()
> {
>         arch_swap_restore(the swap entry for the faulted 4KB PTE, large folio);
> }

OK, I understand what's going on, but it seems like a bad API decision. I think
Steve is saying the same thing; If its only intended to operate on a single
page, it would be much clearer to pass the actual page rather than the folio;
i.e. leave the complexity of figuring out the target page to the caller, which
understands all this.

As a side note, if the folio is still in the cache, doesn't that imply that the
tags haven't been torn down yet? So perhaps you can avoid even making the call
in this case?

>>
>>>
>>> Thanks,
>>> Ryan
> 
> Thanks
> Barry


  reply	other threads:[~2023-11-08 20:21 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-02-22 10:19   ` David Hildenbrand
2024-02-22 10:20     ` David Hildenbrand
2024-02-26 17:41       ` Ryan Roberts
2024-02-27 17:10         ` Ryan Roberts
2024-02-27 19:17           ` David Hildenbrand
2024-02-28  9:37             ` Ryan Roberts
2024-02-28 12:12               ` David Hildenbrand
2024-02-28 14:57                 ` Ryan Roberts
2024-02-28 15:12                   ` David Hildenbrand
2024-02-28 15:18                     ` Ryan Roberts
2024-03-01 16:27                     ` Ryan Roberts
2024-03-01 16:31                       ` Matthew Wilcox
2024-03-01 16:44                         ` Ryan Roberts
2024-03-01 17:00                           ` David Hildenbrand
2024-03-01 17:14                             ` Ryan Roberts
2024-03-01 17:18                               ` David Hildenbrand
2024-03-01 17:06                           ` Ryan Roberts
2024-03-04  4:52                             ` Barry Song
2024-03-04  5:42                               ` Barry Song
2024-03-05  7:41                                 ` Ryan Roberts
2024-03-01 16:31                       ` Ryan Roberts
2024-03-01 16:32                       ` David Hildenbrand
2024-03-04 16:03                 ` Ryan Roberts
2024-03-04 17:30                   ` David Hildenbrand
2024-03-04 18:38                     ` Ryan Roberts
2024-03-04 20:50                       ` David Hildenbrand
2024-03-04 21:55                         ` Ryan Roberts
2024-03-04 22:02                           ` David Hildenbrand
2024-03-04 22:34                             ` Ryan Roberts
2024-03-05  6:11                               ` Huang, Ying
2024-03-05  8:35                                 ` David Hildenbrand
2024-03-05  8:46                                   ` Ryan Roberts
2024-02-28 13:33               ` Matthew Wilcox
2024-02-28 14:24                 ` Ryan Roberts
2024-02-28 14:59                   ` Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals entry Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-30  8:18   ` Huang, Ying
2023-10-30 13:59     ` Ryan Roberts
2023-10-31  8:12       ` Huang, Ying
2023-11-03 11:42         ` Ryan Roberts
2023-11-02  7:40   ` Barry Song
2023-11-02 10:21     ` Ryan Roberts
2023-11-02 22:36       ` Barry Song
2023-11-03 11:31         ` Ryan Roberts
2023-11-03 13:57           ` Steven Price
2023-11-04  9:34             ` Barry Song
2023-11-06 10:12               ` Steven Price
2023-11-06 21:39                 ` Barry Song
2023-11-08 11:51                   ` Steven Price
2023-11-07 12:46               ` Ryan Roberts
2023-11-07 18:05                 ` Barry Song
2023-11-08 11:23                   ` Barry Song
2023-11-08 20:20                     ` Ryan Roberts [this message]
2023-11-08 21:04                       ` Barry Song
2023-11-04  5:49           ` Barry Song
2024-02-05  9:51   ` Barry Song
2024-02-05 12:14     ` Ryan Roberts
2024-02-18 23:40       ` Barry Song
2024-02-20 20:03         ` Ryan Roberts
2024-03-05  9:00         ` Ryan Roberts
2024-03-05  9:54           ` Barry Song
2024-03-05 10:44             ` Ryan Roberts
2024-02-27 12:28     ` Ryan Roberts
2024-02-27 13:37     ` Ryan Roberts
2024-02-28  2:46       ` Barry Song
2024-02-22  7:05   ` Barry Song
2024-02-22 10:09     ` David Hildenbrand
2024-02-23  9:46       ` Barry Song
2024-02-27 12:05         ` Ryan Roberts
2024-02-28  1:23           ` Barry Song
2024-02-28  9:34             ` David Hildenbrand
2024-02-28 23:18               ` Barry Song
2024-02-28 15:57             ` Ryan Roberts
2023-11-29  7:47 ` [PATCH v3 0/4] " Barry Song
2023-11-29 12:06   ` Ryan Roberts
2023-11-29 20:38     ` Barry Song
2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
2024-01-18 11:10   ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
2024-01-26 23:14     ` Chris Li
2024-02-26  2:59       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free() Barry Song
2024-01-26 23:17     ` Chris Li
2024-02-26  4:47       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
2024-01-26 23:22     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
2024-01-27 19:53     ` Chris Li
2024-02-26  7:29       ` Barry Song
2024-01-27 20:06     ` Chris Li
2024-02-26  7:31       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
2024-01-18 11:54     ` David Hildenbrand
2024-01-23  6:49       ` Barry Song
2024-01-29  3:25         ` Chris Li
2024-01-29 10:06           ` David Hildenbrand
2024-01-29 16:31             ` Chris Li
2024-02-26  5:05               ` Barry Song
2024-04-06 23:27             ` Barry Song
2024-01-27 23:41     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
2024-01-29  2:15     ` Chris Li
2024-02-26  6:39       ` Barry Song
2024-02-27 12:22     ` Ryan Roberts
2024-02-27 22:39       ` Barry Song
2024-02-27 14:40     ` Ryan Roberts
2024-02-27 18:57       ` Barry Song
2024-02-28  3:49         ` Barry Song
2024-01-18 15:25   ` [PATCH RFC 0/6] mm: support large folios swap-in Ryan Roberts
2024-01-18 23:54     ` Barry Song
2024-01-19 13:25       ` Ryan Roberts
2024-01-27 14:27         ` Barry Song
2024-01-29  9:05   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2c98be67-657e-4c65-bf6b-3d70ff596c64@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=nd@arm.com \
    --cc=shy828301@gmail.com \
    --cc=steven.price@arm.com \
    --cc=v-songbaohua@oppo.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=xiang@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.