linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linux-MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Yu Zhao <yuzhao@google.com>, Andy Lutomirski <luto@kernel.org>,
	Peter Xu <peterx@redhat.com>, Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Hugh Dickins <hughd@google.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Kees Cook <keescook@chromium.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Jason Gunthorpe <jgg@ziepe.ca>, Jan Kara <jack@suse.cz>,
	Kirill Tkhai <ktkhai@virtuozzo.com>
Subject: Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
Date: Thu, 7 Jan 2021 16:25:54 -0800	[thread overview]
Message-ID: <CAHk-=whTCBa6Frpbveuy7Hnz17P+g03yQvynkApFbBjV5rVrsA@mail.gmail.com> (raw)
In-Reply-To: <X/edsWgguQDgsOtx@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3091 bytes --]

On Thu, Jan 7, 2021 at 3:48 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> > The alternate fix remains "make sure we always flush the TLB before
> > releasing the page table lock, and make COW do the copy under the page
> > table lock". But I really would prefer to just have this code follow

> The copy under PT lock isn't enough.
>
> Flush TLB before releasing is enough of course.

Right. That's why I said "and". You need both, afaik.

But if we just do the mmap write lock, you need neither - then you
just need to flush before you release the write lock.

> Note also the patch in 2/2 patch that I sent:

Yes, yes, and that's what I'm objecting to.

All these stupid games with "flush_pending(" counts are complete garbage.

They all come from the fact that this code doesn't hold the right lock.

I don't understand you: in one breath you seem to say "yes, taking the
write lock is the right thing to do", and then in the next one you
point to this patch that adds all this garbage *because* it's not
holding the write lock.

All of those "tlb_flush_pending" things are wrong. They should not
exist. The code in clear_refs_write() should hold the mmap_sem for
writing, and do the TLB flush before it releases the mmap sem, and
then it *cannot* race with the page faults.

See what I'm saying? I refuse to apply your patch 2/2, because it all
seems entirely wrong.

What's doubly ludicrous about that is that the coide already _took_
the mmap_sem for writing, and spent extra cycles to downgrade it -
INCORRECTLY - to a read-lock. And as far as I can tell, it doesn't
even do anything expensive inside that (now downgraded) region, so the
downgrading was

 (a) buggy

 (b) slower than just keeping the lock the way it had

and (b) is because it had already done the expensive part (which was
to get the lock in the first place).

Just as an example, the whole "Rollback wrprotect_tlb_flush_pending"
is all because it got the lock - again wrongly - as a read-lock
initially, then it says "oh, I need to get a write lock", releases it,
re-takes it as a write lock, does a tiny amount of work, and then -
again incorrectly - downgrades it to a read-lock.

To make it even worse (if that is possible) it actually had YET
ANOTHER case - that CLEAR_REFS_MM_HIWATER_RSS - where it took the mmap
sem for writing, did its thing, and then released it.

So there's like *four* different locking mistakes in that single
function. And it's not even an important function to begin with.

It shgould just have done a single

        mmap_write_lock_killable(mm);
        ...
        mmap_write_unlock(mm);

around the whole thing, instead of _any_ of that crazy stuff.

That code is WRONG.

And your PATCH 2/2 makes that insane code EVEN WORSE. Why the heck is
that magic fs/proc/ interface allowed to get VM internals so wrong,
and make things so much worse?

Can you not see why I'm arguing with you?

Please. Why is the correct patch not the attached one (apart from the
obvious fact that I haven't tested it and maybe just screwed up
completely - but you get the idea)?

            Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 1811 bytes --]

 fs/proc/task_mmu.c | 32 +++++++++-----------------------
 1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee5a235b3056..ab7d700b2caa 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1215,41 +1215,26 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.type = type,
 		};
 
+		if (mmap_write_lock_killable(mm)) {
+			count = -EINTR;
+			goto out_mm;
+		}
 		if (type == CLEAR_REFS_MM_HIWATER_RSS) {
-			if (mmap_write_lock_killable(mm)) {
-				count = -EINTR;
-				goto out_mm;
-			}
-
 			/*
 			 * Writing 5 to /proc/pid/clear_refs resets the peak
 			 * resident set size to this mm's current rss value.
 			 */
 			reset_mm_hiwater_rss(mm);
-			mmap_write_unlock(mm);
-			goto out_mm;
+			goto out_unlock;
 		}
 
-		if (mmap_read_lock_killable(mm)) {
-			count = -EINTR;
-			goto out_mm;
-		}
 		tlb_gather_mmu(&tlb, mm, 0, -1);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
 				if (!(vma->vm_flags & VM_SOFTDIRTY))
 					continue;
-				mmap_read_unlock(mm);
-				if (mmap_write_lock_killable(mm)) {
-					count = -EINTR;
-					goto out_mm;
-				}
-				for (vma = mm->mmap; vma; vma = vma->vm_next) {
-					vma->vm_flags &= ~VM_SOFTDIRTY;
-					vma_set_page_prot(vma);
-				}
-				mmap_write_downgrade(mm);
-				break;
+				vma->vm_flags &= ~VM_SOFTDIRTY;
+				vma_set_page_prot(vma);
 			}
 
 			mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY,
@@ -1261,7 +1246,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		if (type == CLEAR_REFS_SOFT_DIRTY)
 			mmu_notifier_invalidate_range_end(&range);
 		tlb_finish_mmu(&tlb, 0, -1);
-		mmap_read_unlock(mm);
+out_unlock:
+		mmap_write_unlock(mm);
 out_mm:
 		mmput(mm);
 	}

  reply	other threads:[~2021-01-08  0:34 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-25  9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25  9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22   ` Peter Zijlstra
2021-01-04 19:24     ` Andrea Arcangeli
2021-01-04 19:35       ` Nadav Amit
2021-01-04 20:19         ` Andrea Arcangeli
2021-01-04 20:39           ` Nadav Amit
2021-01-04 21:01             ` Andrea Arcangeli
2021-01-04 21:26               ` Nadav Amit
2021-01-05 18:45                 ` Andrea Arcangeli
2021-01-05 19:05                   ` Nadav Amit
2021-01-05 19:45                     ` Andrea Arcangeli
2021-01-05 20:06                       ` Nadav Amit
2021-01-05 21:06                         ` Andrea Arcangeli
2021-01-05 21:43                           ` Peter Xu
2021-01-05  8:13       ` Peter Zijlstra
2021-01-05  8:52         ` Nadav Amit
2021-01-05 14:26           ` Peter Zijlstra
2021-01-05  8:58       ` Peter Zijlstra
2021-01-05  9:22         ` Nadav Amit
2021-01-05 17:58         ` Andrea Arcangeli
2021-01-05 15:08   ` Peter Xu
2021-01-05 18:08     ` Andrea Arcangeli
2021-01-05 18:41       ` Peter Xu
2021-01-05 18:55         ` Andrea Arcangeli
2021-01-05 19:07     ` Nadav Amit
2021-01-05 19:43       ` Peter Xu
2020-12-25  9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit
2021-01-05 15:08   ` Will Deacon
2021-01-05 18:20   ` Andrea Arcangeli
2021-01-05 19:26     ` Nadav Amit
2021-01-05 20:39       ` Andrea Arcangeli
2021-01-05 21:20         ` Yu Zhao
2021-01-05 21:22         ` Nadav Amit
2021-01-05 22:16           ` Will Deacon
2021-01-06  0:29             ` Andrea Arcangeli
2021-01-06  0:02           ` Andrea Arcangeli
2021-01-07 20:04           ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17               ` Linus Torvalds
2021-01-07 20:25                 ` Linus Torvalds
2021-01-07 20:58                 ` Andrea Arcangeli
2021-01-07 21:29                   ` Linus Torvalds
2021-01-07 21:53                     ` John Hubbard
2021-01-07 22:00                       ` Linus Torvalds
2021-01-07 22:14                         ` John Hubbard
2021-01-07 22:20                           ` Linus Torvalds
2021-01-07 22:24                             ` Linus Torvalds
2021-01-07 22:37                               ` John Hubbard
2021-01-15 11:27                       ` Jan Kara
2021-01-07 22:31                     ` Andrea Arcangeli
2021-01-07 22:42                       ` Linus Torvalds
2021-01-07 22:51                         ` Linus Torvalds
2021-01-07 23:48                           ` Andrea Arcangeli
2021-01-08  0:25                             ` Linus Torvalds [this message]
2021-01-08 12:48                               ` Will Deacon
2021-01-08 16:14                                 ` Andrea Arcangeli
2021-01-08 17:39                                   ` Linus Torvalds
2021-01-08 17:53                                     ` Andrea Arcangeli
2021-01-08 19:25                                       ` Linus Torvalds
2021-01-09  0:12                                         ` Andrea Arcangeli
2021-01-08 17:30                                 ` Linus Torvalds
2021-01-07 23:28                         ` Andrea Arcangeli
2021-01-07 21:36               ` kernel test robot
2021-01-07 20:25             ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32               ` Linus Torvalds
2021-01-07 21:05                 ` Linus Torvalds
2021-01-07 22:02                   ` Andrea Arcangeli
2021-01-07 22:17                     ` Linus Torvalds
2021-01-07 22:56                       ` Andrea Arcangeli
2021-01-09 19:32                   ` Matthew Wilcox
2021-01-09 19:46                     ` Linus Torvalds
2021-01-15 14:30                       ` Jan Kara
2021-01-07 21:54                 ` Andrea Arcangeli
2021-01-07 21:45               ` Andrea Arcangeli
2021-01-08 13:36                 ` Jason Gunthorpe
2021-01-08 17:00                   ` Andrea Arcangeli
2021-01-08 18:19                     ` Jason Gunthorpe
2021-01-08 18:31                       ` Andy Lutomirski
2021-01-08 18:38                         ` Linus Torvalds
2021-01-08 23:34                         ` Andrea Arcangeli
2021-01-09 19:03                           ` Andy Lutomirski
2021-01-09 19:15                             ` Linus Torvalds
2021-01-08 18:59                       ` Linus Torvalds
2021-01-08 22:43                       ` Andrea Arcangeli
2021-01-09  0:42                         ` Jason Gunthorpe
2021-01-09  2:50                           ` Andrea Arcangeli
2021-01-11 14:30                             ` Jason Gunthorpe
2021-01-13 21:56                           ` Jerome Glisse
2021-01-13 23:39                             ` Jason Gunthorpe
2021-01-14  2:35                               ` Jerome Glisse
     [not found]                     ` <20210109034958.6928-1-hdanton@sina.com>
2021-01-11 14:39                       ` Jason Gunthorpe
2021-01-05 21:55         ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu
2021-03-02 22:14   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=whTCBa6Frpbveuy7Hnz17P+g03yQvynkApFbBjV5rVrsA@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=keescook@chromium.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).