linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Nadav Amit <nadav.amit@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Nadav Amit <namit@vmware.com>, Yu Zhao <yuzhao@google.com>,
	Andy Lutomirski <luto@kernel.org>, Peter Xu <peterx@redhat.com>,
	Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
Date: Mon, 4 Jan 2021 14:24:38 -0500	[thread overview]
Message-ID: <X/NrdnoDHgFd0Ku1@redhat.com> (raw)
In-Reply-To: <20210104122227.GL3021@hirez.programming.kicks-ass.net>

Hello,

On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> 
> > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > 
> > cpu0				cpu1			cpu2
> > ----				----			----
> > 							[ Writable PTE
> > 							  cached in TLB ]
> > userfaultfd_writeprotect()
> > [ write-*unprotect* ]
> > mwriteprotect_range()
> > mmap_read_lock()
> > change_protection()
> > 
> > change_protection_range()
> > ...
> > change_pte_range()
> > [ *clear* “write”-bit ]
> > [ defer TLB flushes ]
> > 				[ page-fault ]
> > 				...
> > 				wp_page_copy()
> > 				 cow_user_page()
> > 				  [ copy page ]
> > 							[ write to old
> > 							  page ]
> > 				...
> > 				 set_pte_at_notify()
> 
> Yuck!
> 

Note, the above was posted before we figured out the details so it
wasn't showing the real deferred tlb flush that caused problems (the
one showed on the left causes zero issues).

The problematic one not pictured is the one of the wrprotect that has
to be running in another CPU which is also isn't picture above. More
accurate traces are posted later in the thread.

> Isn't this all rather similar to the problem that resulted in the
> tlb_flush_pending mess?
> 
> I still think that's all fundamentally buggered, the much saner solution
> (IMO) would've been to make things wait for the pending flush, instead

How do intend you wait in PT lock while the writer also has to take PT
lock repeatedly before it can do wake_up_var?

If you release the PT lock before calling wait_tlb_flush_pending it
all falls apart again.

This I guess explains why a local pte/hugepmd smp local invlpg is the
only working solution for this issue, similarly to how it's done in rmap.

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 07d9acb5b19c..0210547ac424 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
>  	 *
>  	 * Therefore we must rely on tlb_flush_*() to guarantee order.
>  	 */
> -	atomic_dec(&mm->tlb_flush_pending);
> +	if (atomic_dec_and_test(&mm->tlb_flush_pending))
> +		wake_up_var(&mm->tlb_flush_pending);
>  }
>  
>  static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
>  	return atomic_read(&mm->tlb_flush_pending) > 1;
>  }
>  
> +static inline void wait_tlb_flush_pending(struct mm_struct *mm)
> +{
> +	wait_var_event(&mm->tlb_flush_pending,
> +		       atomic_read(&mm->tlb_flush_pending) == 0);
> +}

I appreciate the effort in not regressing soft dirty and uffd-wp
writeprotect to disk-I/O spindle bandwidth and not using mmap_sem for
writing.

At the same time what was posted so far wasn't clean enough but it
wasn't even tested... if we abstract it in some clean way and we mark
all connected points (soft dirty, uffd-wp, the wrprotect page fault),
then I can be optimistic it will remain understandable when we look at
it again a few years down the road.

Or at the very least it can't get worse than the "tlb_flush_pending
mess" you mentioned above.

flush_tlb_batched_pending() has to be orthogonally re-reviewed for
those things Nadav pointed out. But I'd rather keep that review in a
separate thread since any bug in that code has zero connection to this
issue. The basic idea is similar but the methods and logic are
different and our flush here will be granular and it's going to be
only run if VM_SOFTDIRTY isn't set and soft dirty is compiled in, or
if VM_UFFD_WP is set. The flush_tlb_batched_pending is mm wide,
unconditional etc.. Pretty much all different.

Thanks,
Andrea


  reply	other threads:[~2021-01-04 19:26 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-25  9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25  9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22   ` Peter Zijlstra
2021-01-04 19:24     ` Andrea Arcangeli [this message]
2021-01-04 19:35       ` Nadav Amit
2021-01-04 20:19         ` Andrea Arcangeli
2021-01-04 20:39           ` Nadav Amit
2021-01-04 21:01             ` Andrea Arcangeli
2021-01-04 21:26               ` Nadav Amit
2021-01-05 18:45                 ` Andrea Arcangeli
2021-01-05 19:05                   ` Nadav Amit
2021-01-05 19:45                     ` Andrea Arcangeli
2021-01-05 20:06                       ` Nadav Amit
2021-01-05 21:06                         ` Andrea Arcangeli
2021-01-05 21:43                           ` Peter Xu
2021-01-05  8:13       ` Peter Zijlstra
2021-01-05  8:52         ` Nadav Amit
2021-01-05 14:26           ` Peter Zijlstra
2021-01-05  8:58       ` Peter Zijlstra
2021-01-05  9:22         ` Nadav Amit
2021-01-05 17:58         ` Andrea Arcangeli
2021-01-05 15:08   ` Peter Xu
2021-01-05 18:08     ` Andrea Arcangeli
2021-01-05 18:41       ` Peter Xu
2021-01-05 18:55         ` Andrea Arcangeli
2021-01-05 19:07     ` Nadav Amit
2021-01-05 19:43       ` Peter Xu
2020-12-25  9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit
2021-01-05 15:08   ` Will Deacon
2021-01-05 18:20   ` Andrea Arcangeli
2021-01-05 19:26     ` Nadav Amit
2021-01-05 20:39       ` Andrea Arcangeli
2021-01-05 21:20         ` Yu Zhao
2021-01-05 21:22         ` Nadav Amit
2021-01-05 22:16           ` Will Deacon
2021-01-06  0:29             ` Andrea Arcangeli
2021-01-06  0:02           ` Andrea Arcangeli
2021-01-07 20:04           ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17               ` Linus Torvalds
2021-01-07 20:25                 ` Linus Torvalds
2021-01-07 20:58                 ` Andrea Arcangeli
2021-01-07 21:29                   ` Linus Torvalds
2021-01-07 21:53                     ` John Hubbard
2021-01-07 22:00                       ` Linus Torvalds
2021-01-07 22:14                         ` John Hubbard
2021-01-07 22:20                           ` Linus Torvalds
2021-01-07 22:24                             ` Linus Torvalds
2021-01-07 22:37                               ` John Hubbard
2021-01-15 11:27                       ` Jan Kara
2021-01-07 22:31                     ` Andrea Arcangeli
2021-01-07 22:42                       ` Linus Torvalds
2021-01-07 22:51                         ` Linus Torvalds
2021-01-07 23:48                           ` Andrea Arcangeli
2021-01-08  0:25                             ` Linus Torvalds
2021-01-08 12:48                               ` Will Deacon
2021-01-08 16:14                                 ` Andrea Arcangeli
2021-01-08 17:39                                   ` Linus Torvalds
2021-01-08 17:53                                     ` Andrea Arcangeli
2021-01-08 19:25                                       ` Linus Torvalds
2021-01-09  0:12                                         ` Andrea Arcangeli
2021-01-08 17:30                                 ` Linus Torvalds
2021-01-07 23:28                         ` Andrea Arcangeli
2021-01-07 21:36               ` kernel test robot
2021-01-07 20:25             ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32               ` Linus Torvalds
2021-01-07 21:05                 ` Linus Torvalds
2021-01-07 22:02                   ` Andrea Arcangeli
2021-01-07 22:17                     ` Linus Torvalds
2021-01-07 22:56                       ` Andrea Arcangeli
2021-01-09 19:32                   ` Matthew Wilcox
2021-01-09 19:46                     ` Linus Torvalds
2021-01-15 14:30                       ` Jan Kara
2021-01-07 21:54                 ` Andrea Arcangeli
2021-01-07 21:45               ` Andrea Arcangeli
2021-01-08 13:36                 ` Jason Gunthorpe
2021-01-08 17:00                   ` Andrea Arcangeli
2021-01-08 18:19                     ` Jason Gunthorpe
2021-01-08 18:31                       ` Andy Lutomirski
2021-01-08 18:38                         ` Linus Torvalds
2021-01-08 23:34                         ` Andrea Arcangeli
2021-01-09 19:03                           ` Andy Lutomirski
2021-01-09 19:15                             ` Linus Torvalds
2021-01-08 18:59                       ` Linus Torvalds
2021-01-08 22:43                       ` Andrea Arcangeli
2021-01-09  0:42                         ` Jason Gunthorpe
2021-01-09  2:50                           ` Andrea Arcangeli
2021-01-11 14:30                             ` Jason Gunthorpe
2021-01-13 21:56                           ` Jerome Glisse
2021-01-13 23:39                             ` Jason Gunthorpe
2021-01-14  2:35                               ` Jerome Glisse
     [not found]                     ` <20210109034958.6928-1-hdanton@sina.com>
2021-01-11 14:39                       ` Jason Gunthorpe
2021-01-05 21:55         ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu
2021-03-02 22:14   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X/NrdnoDHgFd0Ku1@redhat.com \
    --to=aarcange@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=namit@vmware.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=will@kernel.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).