All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Zi Yan <ziy@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>,
	Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	Barry Song <21cnbao@gmail.com>, Chris Li <chrisl@kernel.org>,
	Lance Yang <ioworker0@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
Date: Wed, 3 Apr 2024 08:21:14 +0100	[thread overview]
Message-ID: <4020c532-d20d-4624-8ea6-607de423396c@arm.com> (raw)
In-Reply-To: <B0E526FD-64CF-4653-B624-1AFA5B7AA245@nvidia.com>

On 03/04/2024 01:30, Zi Yan wrote:
> On 27 Mar 2024, at 10:45, Ryan Roberts wrote:
> 
>> Now that we no longer have a convenient flag in the cluster to determine
>> if a folio is large, free_swap_and_cache() will take a reference and
>> lock a large folio much more often, which could lead to contention and
>> (e.g.) failure to split large folios, etc.
>>
>> Let's solve that problem by batch freeing swap and cache with a new
>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>> entries together. This allows us to first drop a reference to each swap
>> slot before we try to release the cache folio. This means we only try to
>> release the folio once, only taking the reference and lock once - much
>> better than the previous 512 times for the 2M THP case.
>>
>> Contiguous swap entries are gathered in zap_pte_range() and
>> madvise_free_pte_range() in a similar way to how present ptes are
>> already gathered in zap_pte_range().
>>
>> While we are at it, let's simplify by converting the return type of both
>> functions to void. The return value was used only by zap_pte_range() to
>> print a bad pte, and was ignored by everyone else, so the extra
>> reporting wasn't exactly guaranteed. We will still get the warning with
>> most of the information from get_swap_device(). With the batch version,
>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/pgtable.h | 28 +++++++++++++++
>>  include/linux/swap.h    | 12 +++++--
>>  mm/internal.h           | 48 +++++++++++++++++++++++++
>>  mm/madvise.c            | 12 ++++---
>>  mm/memory.c             | 13 +++----
>>  mm/swapfile.c           | 78 ++++++++++++++++++++++++++++++-----------
>>  6 files changed, 157 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 09c85c7bf9c2..8185939df1e8 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
>>  }
>>  #endif
>>
>> +#ifndef clear_not_present_full_ptes
>> +/**
>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
>> + * @mm: Address space the ptes represent.
>> + * @addr: Address of the first pte.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear.
>> + * @full: Whether we are clearing a full mm.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over pte_clear_not_present_full().
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>> + * The PTEs are all in the same PMD.
>> + */
>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	for (;;) {
>> +		pte_clear_not_present_full(mm, addr, ptep, full);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +	}
>> +}
>> +#endif
>> +
> 
> Would the code below be better?
> 
> for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
> 	pte_clear_not_present_full(mm, addr, ptep, full);

I certainly agree that this is cleaner and more standard. But I'm copying the
pattern used by the other batch helpers. I believe this pattern was first done
by Willy for set_ptes(), then continued by DavidH for wrprotect_ptes() and
clear_full_ptes().

I guess the benefit is that ptep and addr are only incremented if we are going
around the loop again. I'd rather continue to be consistent with those other
helpers.


> 
> --
> Best Regards,
> Yan, Zi


  parent reply	other threads:[~2024-04-03  7:21 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-27 14:45 [PATCH v5 0/6] Swap-out mTHP without splitting Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-03-29  1:56   ` Huang, Ying
2024-04-05  9:22   ` David Hildenbrand
2024-03-27 14:45 ` [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
2024-04-01  5:52   ` Huang, Ying
2024-04-02 11:15     ` Ryan Roberts
2024-04-03  3:57       ` Huang, Ying
2024-04-03  7:16         ` Ryan Roberts
2024-04-03  0:30   ` Zi Yan
2024-04-03  0:47     ` Lance Yang
2024-04-03  7:21     ` Ryan Roberts [this message]
2024-04-05  9:24       ` David Hildenbrand
2024-03-27 14:45 ` [PATCH v5 3/6] mm: swap: Simplify struct percpu_cluster Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders Ryan Roberts
2024-04-01  3:15   ` Huang, Ying
2024-04-02 11:18     ` Ryan Roberts
2024-04-03  3:07       ` Huang, Ying
2024-04-03  7:48         ` Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list() Ryan Roberts
2024-03-28  8:18   ` Barry Song
2024-03-28  8:48     ` Ryan Roberts
2024-04-02 13:10     ` Ryan Roberts
2024-04-02 13:22       ` Lance Yang
2024-04-02 13:22       ` Ryan Roberts
2024-04-02 22:54         ` Barry Song
2024-04-05  4:06       ` Barry Song
2024-04-05  7:28         ` Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
2024-04-01 12:25   ` Lance Yang
2024-04-02 11:20     ` Ryan Roberts
2024-04-02 11:30       ` Lance Yang
2024-04-02 10:16   ` Barry Song
2024-04-02 10:56     ` Ryan Roberts
2024-04-02 11:01       ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4020c532-d20d-4624-8ea6-607de423396c@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=ioworker0@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=shy828301@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=xiang@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.