linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Barry Song <21cnbao@gmail.com>, ryan.roberts@arm.com
Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, mhocko@suse.com, shy828301@gmail.com,
	wangkefeng.wang@huawei.com, willy@infradead.org,
	xiang@kernel.org, ying.huang@intel.com, yuzhao@google.com,
	chrisl@kernel.org, surenb@google.com, hanchuanhua@oppo.com
Subject: Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
Date: Thu, 22 Feb 2024 11:09:33 +0100	[thread overview]
Message-ID: <1a9fcdcd-c0dd-46dd-9c03-265a6988eeb2@redhat.com> (raw)
In-Reply-To: <20240222070544.133673-1-21cnbao@gmail.com>

On 22.02.24 08:05, Barry Song wrote:
> Hi Ryan,
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2cc0cb41fb32..ea19710aa4cd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>   					if (!can_split_folio(folio, NULL))
>>   						goto activate_locked;
>>   					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>   					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>   					    split_folio_to_list(folio,
>>   								folio_list))
>>   						goto activate_locked;
> 
> I ran a test to investigate what would happen while reclaiming a partially
> unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
> 4KB~64KB, and keep the first subpage 0~4KiB.

IOW, something that already happens with ordinary THP already IIRC.

>   
> My test wants to address my three concerns,
> a. whether we will have leak on swap slots
> b. whether we will have redundant I/O
> c. whether we will cause races on swapcache
> 
> what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
> at some specific stage
> 1. just after add_to_swap   (swap slots are allocated)
> 2. before and after try_to_unmap   (ptes are set to swap_entry)
> 3. before and after pageout (also add printk in zram driver to dump all I/O write)
> 4. before and after remove_mapping
> 
> The below is the dumped info for a particular large folio,
> 
> 1. after add_to_swap
> [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
> [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> 
> as you can see,
> _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
> 
> 
> 2. before and after try_to_unmap
> [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
> [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
> [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
> [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> 
> as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
> 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
> 
> 3. before and after pageout
> [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
> [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
> [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
> [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
> [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
> [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
> [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
> [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
> [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
> [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
> [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
> [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
> [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
> [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
> [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
> [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
> [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
> [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
> 
> as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
> 4~64KiB has been zap_pte_range before, we still write them to zRAM.
> 
> 4. before and after remove_mapping
> [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
> [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> 
> as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
> all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
> slot leak at all!
> 
> Thus, only two concerns are left for me,
> 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
> is partially unmapped.
> 2. large folio is added as a whole as a swapcache covering the range whose
> part has been zapped. I am not quite sure if this will cause some problems
> while some concurrent do_anon_page, swapin and swapout occurs between 3 and
> 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...

Just noting: I was running into something different in the past with 
THP. And it's effectively the same scenario, just swapout and 
MADV_DONTNEED reversed.

Imagine you swapped out a THP and the THP it still is in the swapcache.

Then you unmap/zap some PTEs, freeing up the swap slots.

In zap_pte_range(), we'll call free_swap_and_cache(). There, we run into 
the "!swap_page_trans_huge_swapped(p, entry)", and we won't be calling 
__try_to_reclaim_swap().

So we won't split the large folio that is in the swapcache and it will 
continue consuming "more memory" than intended until fully evicted.

> 
> To me, it seems safer to split or do some other similar optimization if we find a
> large folio has partial map and unmap.

I'm hoping that we can avoid any new direct users of _nr_pages_mapped if 
possible.

If we find that the folio is on the deferred split list, we might as 
well just split it right away, before swapping it out. That might be a 
reasonable optimization for the case you describe.

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2024-02-22 10:09 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-02-22 10:19   ` David Hildenbrand
2024-02-22 10:20     ` David Hildenbrand
2024-02-26 17:41       ` Ryan Roberts
2024-02-27 17:10         ` Ryan Roberts
2024-02-27 19:17           ` David Hildenbrand
2024-02-28  9:37             ` Ryan Roberts
2024-02-28 12:12               ` David Hildenbrand
2024-02-28 14:57                 ` Ryan Roberts
2024-02-28 15:12                   ` David Hildenbrand
2024-02-28 15:18                     ` Ryan Roberts
2024-03-01 16:27                     ` Ryan Roberts
2024-03-01 16:31                       ` Matthew Wilcox
2024-03-01 16:44                         ` Ryan Roberts
2024-03-01 17:00                           ` David Hildenbrand
2024-03-01 17:14                             ` Ryan Roberts
2024-03-01 17:18                               ` David Hildenbrand
2024-03-01 17:06                           ` Ryan Roberts
2024-03-04  4:52                             ` Barry Song
2024-03-04  5:42                               ` Barry Song
2024-03-05  7:41                                 ` Ryan Roberts
2024-03-01 16:31                       ` Ryan Roberts
2024-03-01 16:32                       ` David Hildenbrand
2024-03-04 16:03                 ` Ryan Roberts
2024-03-04 17:30                   ` David Hildenbrand
2024-03-04 18:38                     ` Ryan Roberts
2024-03-04 20:50                       ` David Hildenbrand
2024-03-04 21:55                         ` Ryan Roberts
2024-03-04 22:02                           ` David Hildenbrand
2024-03-04 22:34                             ` Ryan Roberts
2024-03-05  6:11                               ` Huang, Ying
2024-03-05  8:35                                 ` David Hildenbrand
2024-03-05  8:46                                   ` Ryan Roberts
2024-02-28 13:33               ` Matthew Wilcox
2024-02-28 14:24                 ` Ryan Roberts
2024-02-28 14:59                   ` Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals entry Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-30  8:18   ` Huang, Ying
2023-10-30 13:59     ` Ryan Roberts
2023-10-31  8:12       ` Huang, Ying
2023-11-03 11:42         ` Ryan Roberts
2023-11-02  7:40   ` Barry Song
2023-11-02 10:21     ` Ryan Roberts
2023-11-02 22:36       ` Barry Song
2023-11-03 11:31         ` Ryan Roberts
2023-11-03 13:57           ` Steven Price
2023-11-04  9:34             ` Barry Song
2023-11-06 10:12               ` Steven Price
2023-11-06 21:39                 ` Barry Song
2023-11-08 11:51                   ` Steven Price
2023-11-07 12:46               ` Ryan Roberts
2023-11-07 18:05                 ` Barry Song
2023-11-08 11:23                   ` Barry Song
2023-11-08 20:20                     ` Ryan Roberts
2023-11-08 21:04                       ` Barry Song
2023-11-04  5:49           ` Barry Song
2024-02-05  9:51   ` Barry Song
2024-02-05 12:14     ` Ryan Roberts
2024-02-18 23:40       ` Barry Song
2024-02-20 20:03         ` Ryan Roberts
2024-03-05  9:00         ` Ryan Roberts
2024-03-05  9:54           ` Barry Song
2024-03-05 10:44             ` Ryan Roberts
2024-02-27 12:28     ` Ryan Roberts
2024-02-27 13:37     ` Ryan Roberts
2024-02-28  2:46       ` Barry Song
2024-02-22  7:05   ` Barry Song
2024-02-22 10:09     ` David Hildenbrand [this message]
2024-02-23  9:46       ` Barry Song
2024-02-27 12:05         ` Ryan Roberts
2024-02-28  1:23           ` Barry Song
2024-02-28  9:34             ` David Hildenbrand
2024-02-28 23:18               ` Barry Song
2024-02-28 15:57             ` Ryan Roberts
2023-11-29  7:47 ` [PATCH v3 0/4] " Barry Song
2023-11-29 12:06   ` Ryan Roberts
2023-11-29 20:38     ` Barry Song
2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
2024-01-18 11:10   ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
2024-01-26 23:14     ` Chris Li
2024-02-26  2:59       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free() Barry Song
2024-01-26 23:17     ` Chris Li
2024-02-26  4:47       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
2024-01-26 23:22     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
2024-01-27 19:53     ` Chris Li
2024-02-26  7:29       ` Barry Song
2024-01-27 20:06     ` Chris Li
2024-02-26  7:31       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
2024-01-18 11:54     ` David Hildenbrand
2024-01-23  6:49       ` Barry Song
2024-01-29  3:25         ` Chris Li
2024-01-29 10:06           ` David Hildenbrand
2024-01-29 16:31             ` Chris Li
2024-02-26  5:05               ` Barry Song
2024-04-06 23:27             ` Barry Song
2024-01-27 23:41     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
2024-01-29  2:15     ` Chris Li
2024-02-26  6:39       ` Barry Song
2024-02-27 12:22     ` Ryan Roberts
2024-02-27 22:39       ` Barry Song
2024-02-27 14:40     ` Ryan Roberts
2024-02-27 18:57       ` Barry Song
2024-02-28  3:49         ` Barry Song
2024-01-18 15:25   ` [PATCH RFC 0/6] mm: support large folios swap-in Ryan Roberts
2024-01-18 23:54     ` Barry Song
2024-01-19 13:25       ` Ryan Roberts
2024-01-27 14:27         ` Barry Song
2024-01-29  9:05   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1a9fcdcd-c0dd-46dd-9c03-265a6988eeb2@redhat.com \
    --to=david@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=hanchuanhua@oppo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=xiang@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).