All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: "zhangliang (AG)" <zhangliang5@huawei.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	wangzhigang17@huawei.com, Matthew Wilcox <willy@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page
Date: Thu, 20 Jan 2022 20:55:12 +0100	[thread overview]
Message-ID: <9071d5a8-ed2d-5cf5-5526-43fe7dd377ec@redhat.com> (raw)
In-Reply-To: <6225EAFF-B323-4DC5-AC4C-885B29ED7261@gmail.com>

On 20.01.22 19:11, Nadav Amit wrote:
> 
> 
>> On Jan 20, 2022, at 10:00 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 20.01.22 18:48, Nadav Amit wrote:
>>>
>>>> On Jan 20, 2022, at 6:15 AM, David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 17.01.22 14:31, zhangliang (AG) wrote:
>>>>> Sure, I will do that :)
>>>>
>>>> I'm polishing up / testing the patches and might send something out for discussion shortly.
>>>> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
>>>>
>>>
>>> Sorry for being late for the discussion.
>>>
>>> David, does any of it regards the lru_cache_add() reference issue that I
>>> mentioned? [1]
>>
>> No, unfortunately not in that part of my work. *Maybe* we could also try
>> to handle that reference similarly to the swapcache, but the question is
>> if we can't wait for PageAnonExclusive.
>>
>> Right now I have the following in mind to get most parts working as
>> exptected:
>>
>> 1. Optimize reuse logic for the swapcache as it seems to be easy
>> 2. Streamline COW logic and remove reuse_swap_page() -- fix the CVE for
>>   THP.
>> 3. Introduce PageAnonExclusive and allow FOLL_PIN only on
>>   PageAnonExclusive pages.
>> 4. Convert O_DIRECT to FOLL_PIN
>>
>> We will never ever have to copy a page PageAnonExclusive page in the COW
>> handler and can immediately reuse it without even locking the page. The
>> existing reuse logic is essentially then used to reset PageAnonExclusive
>> on a page (thus it makes sense to work on it) where the flag is not set
>> anymore -- or on a fresh page if we have to copy.
>>
>> That implies that all these additional references won't care if your app
>> doesn't fork() or KSM isn't active. Consequently, anything that
>> read-protects anonymous pages will work as expected and should be as
>> fast as it gets.
>>
>> Sounds good? At least to me. If only swap/migration entries wouldn't be
>> harder to handle than I'd wish, that's why it's taking a little and will
>> take a little longer.
> 
> Thanks for the quick response. I would have to see the logic to set/clear
> PageAnonExclusive to fully understand how things are handled.
> 
> BTW, I just saw this patch form PeterZ [1] that seems to be related, as
> it deals with changing protection on pinned pages.

Hi Nadav,

I'm trying to see how effective the following patch is with your forceswap.c [1] reproducer.

commit b08d494deb319a63b7c776636b960258c48775e1
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Jan 14 09:29:52 2022 +0100

    mm: optimize do_wp_page() for exclusive pages in the swapcache
    
    Let's optimize for a page with a single user that has been added to the
    swapcache. Try removing the swapcache reference if there is hope that
    we're the exclusive user, but keep the page_count(page) == 1 check in
    place.
    
    Avoid using reuse_swap_page(), we'll streamline all reuse_swap_page()
    users next.
    
    While at it, remove the superfluous page_mapcount() check: it's
    implicitly covered by the page_count() for ordinary anon pages.
    
    Signed-off-by: David Hildenbrand <david@redhat.com>

diff --git a/mm/memory.c b/mm/memory.c
index f306e698a1e3..d9186981662a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3291,19 +3291,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
        if (PageAnon(vmf->page)) {
                struct page *page = vmf->page;
 
-               /* PageKsm() doesn't necessarily raise the page refcount */
-               if (PageKsm(page) || page_count(page) != 1)
+               /*
+                * PageKsm() doesn't necessarily raise the page refcount.
+                *
+                * These checks are racy as long as we haven't locked the page;
+                * they are a pure optimization to avoid trying to lock the page
+                * and trying to free the swap cache when there is little hope
+                * it will actually result in a refcount of 1.
+                */
+               if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
                        goto copy;
                if (!trylock_page(page))
                        goto copy;
-               if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
+               if (PageSwapCache(page))
+                       try_to_free_swap(page);
+               if (PageKsm(page) || page_count(page) != 1) {
                        unlock_page(page);
                        goto copy;
                }
                /*
-                * Ok, we've got the only map reference, and the only
-                * page count reference, and the page is locked,
-                * it's dark out, and we're wearing sunglasses. Hit it.
+                * Ok, we've got the only page reference from our mapping
+                * and the page is locked, it's dark out, and we're wearing
+                * sunglasses. Hit it.
                 */
                unlock_page(page);
                wp_page_reuse(vmf);


I added some vmstats that monitor various paths. After one run of
	./forceswap 2 1000000 1
I'm left with a rough delta (including some noise) of
	anon_wp_copy_count 1799
	anon_wp_copy_count_early 1
	anon_wp_copy_lock 983396
	anon_wp_reuse 0

The relevant part of your reproducer is

	for (i = 0; i < nops; i++) {
		if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
			perror("madvise");
			exit(-1);
		}

		for (j = 0; j < npages; j++) {
			c = p[j * PAGE_SIZE];
			c++;
			time -= rdtscp();
			p[j * PAGE_SIZE] = c;
			time += rdtscp();
		}
	}

For this specific reproducer at least, the page lock seems to be the thingy that prohibits
reuse if I interpret the numbers correctly. We pass the initial page_count() check.

Haven't looked into the details, and I would be curious how that performs with actual
workloads, if we can reproduce similar behavior.


[1] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com

-- 
Thanks,

David / dhildenb


  parent reply	other threads:[~2022-01-20 19:55 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-13 14:03 [PATCH] mm: reuse the unshared swapcache page in do_wp_page Liang Zhang
2022-01-13 14:39 ` Matthew Wilcox
2022-01-13 14:46   ` David Hildenbrand
2022-01-13 15:02     ` Matthew Wilcox
2022-01-13 15:04       ` David Hildenbrand
2022-01-13 16:37   ` Linus Torvalds
2022-01-13 16:48     ` David Hildenbrand
2022-01-13 17:14       ` Linus Torvalds
2022-01-13 17:25         ` David Hildenbrand
2022-01-13 17:44           ` Linus Torvalds
2022-01-13 17:55             ` David Hildenbrand
2022-01-13 18:55               ` Linus Torvalds
2022-01-13 21:07             ` Matthew Wilcox
2022-01-13 22:21               ` Linus Torvalds
2022-01-14  5:00       ` zhangliang (AG)
2022-01-14 11:23         ` David Hildenbrand
2022-01-17  2:11           ` zhangliang (AG)
2022-01-17 12:58             ` David Hildenbrand
2022-01-17 13:31               ` zhangliang (AG)
2022-01-20 14:15                 ` David Hildenbrand
2022-01-20 14:39                   ` Matthew Wilcox
2022-01-20 15:26                     ` David Hildenbrand
2022-01-20 15:36                       ` Matthew Wilcox
2022-01-20 15:39                         ` David Hildenbrand
2022-01-20 15:45                           ` Matthew Wilcox
2022-01-20 15:51                             ` David Hildenbrand
2022-01-20 16:09                               ` Matthew Wilcox
2022-01-20 16:35                                 ` David Hildenbrand
2022-01-20 15:37                       ` Linus Torvalds
2022-01-20 15:46                         ` David Hildenbrand
2022-01-20 17:22                           ` Linus Torvalds
2022-01-20 17:49                             ` David Hildenbrand
2022-01-20 17:48                   ` Nadav Amit
2022-01-20 18:00                     ` David Hildenbrand
2022-01-20 18:11                       ` Nadav Amit
2022-01-20 18:19                         ` David Hildenbrand
2022-01-20 19:55                         ` David Hildenbrand [this message]
2022-01-20 20:07                           ` Matthew Wilcox
2022-01-20 20:09                             ` David Hildenbrand
2022-01-20 20:37                               ` David Hildenbrand
2022-01-20 20:46                                 ` Nadav Amit
2022-01-20 20:49                                   ` David Hildenbrand
2022-01-21  9:01                                     ` David Hildenbrand
2022-01-21 17:43                                       ` Nadav Amit
2022-01-20 20:18                           ` David Hildenbrand
2022-01-14  3:29   ` zhangliang (AG)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9071d5a8-ed2d-5cf5-5526-43fe7dd377ec@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nadav.amit@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=wangzhigang17@huawei.com \
    --cc=willy@infradead.org \
    --cc=zhangliang5@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.