linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Yu Zhao <yuzhao@google.com>,  Andy Lutomirski <luto@kernel.org>,
	Peter Xu <peterx@redhat.com>, Pavel Emelyanov <xemul@openvz.org>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	 Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	 Peter Zijlstra <peterz@infradead.org>,
	Hugh Dickins <hughd@google.com>,
	 "Kirill A. Shutemov" <kirill@shutemov.name>,
	Matthew Wilcox <willy@infradead.org>,
	 Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Kees Cook <keescook@chromium.org>,
	 John Hubbard <jhubbard@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,  Jan Kara <jack@suse.cz>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Nadav Amit <nadav.amit@gmail.com>,  Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
Date: Sat, 9 Jan 2021 17:19:51 -0800	[thread overview]
Message-ID: <CAHk-=wj5=1DKbQut1-21EwQbMSghNL3KOSd82rNrBhuG9+eekA@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wghqNywtf=sRv_5FmG=+hPGqj=KWakw34tNeoZ1wPuaHg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1680 bytes --]

On Sat, Jan 9, 2021 at 4:55 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> What part of "clear_refs is the _least_ important of the three cases"
> are you not willing to understand?

In fact, I couldn't even turn on that code with my normal config,
because it depends on CONFIG_CHECKPOINT_RESTORE that I didn't even
have enabled.

IOW, that code is some special-case stuff, and instead of messing up
the rest of the VM, it should be made to conform to all the normal VM
rules and requirements.

Here's two patches to basically start doing that.

The first one is the same one I already sent out earlier, fixing the
locking. And yes, it can be improved upon, but before improving on it,
let's _fix_ the code.

The second is a trivial "oh, look, I can see that the page is pinned,
soft-dirty cannot work so don't do it then". Again, it can be improved
upon, most particularly by doing the same (simple) tests for the
hugepage case too, which I didn't do.

Note: I have not a single actual user of this code that I can test
with, so this is all ENTIRELY untested.

IOW, I am in no way claiming that these patches are perfect and
correct, and the only way to do things.

But what I _am_ claiming is that this clear_refs code (and the UFFD
code) is of secondary importance, and instead of messing up the core
VM, we should fix these special cases to not do bad things.

It really is that simple.

And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem
for writing. For whoever wants to look at that, it's
mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to
turn the read-lock (and unlock) into a write-lock (and unlock).

           Linus

[-- Attachment #2: 0001-mm-fix-clear_refs_write-locking.patch --]
[-- Type: text/x-patch, Size: 2315 bytes --]

From dacb5de62b654f1f5df1147e263b5b4e5fe2af44 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 8 Jan 2021 13:13:41 -0800
Subject: [PATCH 1/2] mm: fix clear_refs_write locking

Turning page table entries read-only requires the mmap_sem held for
writing.

So stop doing the odd games with turning things from read locks to write
locks and back.  Just get the write lock.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/proc/task_mmu.c | 32 +++++++++-----------------------
 1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee5a235b3056..ab7d700b2caa 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1215,41 +1215,26 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.type = type,
 		};
 
+		if (mmap_write_lock_killable(mm)) {
+			count = -EINTR;
+			goto out_mm;
+		}
 		if (type == CLEAR_REFS_MM_HIWATER_RSS) {
-			if (mmap_write_lock_killable(mm)) {
-				count = -EINTR;
-				goto out_mm;
-			}
-
 			/*
 			 * Writing 5 to /proc/pid/clear_refs resets the peak
 			 * resident set size to this mm's current rss value.
 			 */
 			reset_mm_hiwater_rss(mm);
-			mmap_write_unlock(mm);
-			goto out_mm;
+			goto out_unlock;
 		}
 
-		if (mmap_read_lock_killable(mm)) {
-			count = -EINTR;
-			goto out_mm;
-		}
 		tlb_gather_mmu(&tlb, mm, 0, -1);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
 				if (!(vma->vm_flags & VM_SOFTDIRTY))
 					continue;
-				mmap_read_unlock(mm);
-				if (mmap_write_lock_killable(mm)) {
-					count = -EINTR;
-					goto out_mm;
-				}
-				for (vma = mm->mmap; vma; vma = vma->vm_next) {
-					vma->vm_flags &= ~VM_SOFTDIRTY;
-					vma_set_page_prot(vma);
-				}
-				mmap_write_downgrade(mm);
-				break;
+				vma->vm_flags &= ~VM_SOFTDIRTY;
+				vma_set_page_prot(vma);
 			}
 
 			mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY,
@@ -1261,7 +1246,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		if (type == CLEAR_REFS_SOFT_DIRTY)
 			mmu_notifier_invalidate_range_end(&range);
 		tlb_finish_mmu(&tlb, 0, -1);
-		mmap_read_unlock(mm);
+out_unlock:
+		mmap_write_unlock(mm);
 out_mm:
 		mmput(mm);
 	}
-- 
2.29.2.157.g1d47791a39


[-- Attachment #3: 0002-mm-don-t-play-games-with-pinned-pages-in-clear_page_.patch --]
[-- Type: text/x-patch, Size: 1818 bytes --]

From b40950b647509f7222e1f7174d61045d15f56f1c Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 9 Jan 2021 17:09:10 -0800
Subject: [PATCH 2/2] mm: don't play games with pinned pages in clear_page_refs

Turnign a pinned page read-only breaks the pinning after COW. Don't do it.

The whole "track page soft dirty" state doesn't work with pinned pages
anyway, since the page might be dirtied by the pinning entity without
ever being noticed in the page tables.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/proc/task_mmu.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ab7d700b2caa..0377081021b7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1035,6 +1035,25 @@ struct clear_refs_private {
 };
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
+
+#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE)
+
+static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
+{
+	struct page *page;
+
+	if (!is_cow_mapping(vma->vm_flags))
+		return false;
+	if (likely(!atomic_read(&vma->vm_mm->has_pinned)))
+		return false;
+	page = vm_normal_page(vma, addr, pte);
+	if (!page)
+		return false;
+	if (page_mapcount(page) != 1)
+		return false;
+	return page_maybe_dma_pinned(page);
+}
+
 static inline void clear_soft_dirty(struct vm_area_struct *vma,
 		unsigned long addr, pte_t *pte)
 {
@@ -1049,6 +1068,8 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 	if (pte_present(ptent)) {
 		pte_t old_pte;
 
+		if (pte_is_pinned(vma, addr, ptent))
+			return;
 		old_pte = ptep_modify_prot_start(vma, addr, pte);
 		ptent = pte_wrprotect(old_pte);
 		ptent = pte_clear_soft_dirty(ptent);
-- 
2.29.2.157.g1d47791a39


  reply	other threads:[~2021-01-10  1:20 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-10  0:44 [PATCH 0/1] mm: restore full accuracy in COW page reuse Andrea Arcangeli
2021-01-10  0:44 ` [PATCH 1/1] " Andrea Arcangeli
2021-01-10  2:54   ` Andrea Arcangeli
2021-01-11 14:11     ` Kirill A. Shutemov
2021-01-10  0:55 ` [PATCH 0/1] " Linus Torvalds
2021-01-10  1:19   ` Linus Torvalds [this message]
2021-01-10  1:37     ` Linus Torvalds
2021-01-10  3:24       ` Andrea Arcangeli
2021-01-10  2:51     ` Andrea Arcangeli
2021-01-10  3:51       ` Linus Torvalds
2021-01-10 19:30         ` Linus Torvalds
2021-01-11  1:18           ` Jason Gunthorpe
2021-01-11  7:26           ` John Hubbard
2021-01-11 12:42             ` Matthew Wilcox
2021-01-11 16:05             ` Jason Gunthorpe
2021-01-11 16:15               ` Michal Hocko
2021-01-11 19:19             ` Linus Torvalds
2021-01-11 22:18               ` Linus Torvalds
2021-01-12 17:07                 ` Andy Lutomirski
2021-01-12 23:51                 ` Jerome Glisse
2021-01-13  2:16                 ` Matthew Wilcox
2021-01-13  2:43                   ` Linus Torvalds
2021-01-13  3:31                   ` Linus Torvalds
2021-01-13  8:52                     ` David Hildenbrand
2021-01-13  8:57                       ` David Hildenbrand
2021-01-13 12:32                     ` Kirill A. Shutemov
2021-01-13 12:55                       ` Matthew Wilcox
2021-01-13 19:54                         ` Linus Torvalds
2021-01-13 23:54           ` Peter Xu
2021-01-11 15:52       ` Jason Gunthorpe
2021-01-15  8:59 ` David Hildenbrand
2021-01-15 18:37   ` Jason Gunthorpe
2021-01-15 19:46     ` David Hildenbrand
2021-01-15 19:53       ` Jason Gunthorpe
2021-01-16  3:40       ` John Hubbard
2021-01-16 11:42         ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wj5=1DKbQut1-21EwQbMSghNL3KOSd82rNrBhuG9+eekA@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=keescook@chromium.org \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).