Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork

From: Jason Gunthorpe <jgg@nvidia.com>
To: Peter Xu <peterx@redhat.com>,
	"Ahmed S. Darwish" <a.darwish@linutronix.de>
Cc: <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>,
	Christoph Hellwig <hch@lst.de>, Hugh Dickins <hughd@google.com>,
	Jan Kara <jack@suse.cz>, Jann Horn <jannh@google.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Linux-MM <linux-mm@kvack.org>, Michal Hocko <mhocko@suse.com>,
	Oleg Nesterov <oleg@redhat.com>
Subject: Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork
Date: Fri, 30 Oct 2020 20:51:21 -0300	[thread overview]
Message-ID: <20201030235121.GQ2620339@nvidia.com> (raw)
In-Reply-To: <20201030225250.GB6357@xz-x1>

On Fri, Oct 30, 2020 at 06:52:50PM -0400, Peter Xu wrote:
> Hi, Jason,
> 
> I think majorly the patch looks good to me, but I have a few pure questions
> majorly not directly related to the patch itself, but around the contexts.
> Since I _feel_ like there'll be a new version to update the comments below,
> maybe I can still ask aloud... Please bare with me. :)

No problem

> On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote:
> > Slow GUP is safe against this race because copy_page_range() is only
> > called while holding the exclusive side of the mmap_lock on the src
> > mm_struct.
> 
> Pure question: I understand that this patch requires this, but... Could anyone
> remind me why read lock of mmap_sem is not enough for fork() before this one?

I do not know why fork uses the exclusive lock, however the write side
of the seqcount must be exclusive or it doesn't work properly. Was
happy to find we already had the locking to support this.

> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> > index 5e5480a0a32d7d..2520f6e05f4d44 100644
> > +++ b/drivers/firmware/efi/efi.c
> > @@ -57,6 +57,7 @@ struct mm_struct efi_mm = {
> >  	.mm_rb			= RB_ROOT,
> >  	.mm_users		= ATOMIC_INIT(2),
> >  	.mm_count		= ATOMIC_INIT(1),
> > +	.write_protect_seq      = SEQCNT_ZERO(efi_mm.write_protect_seq),
> >  	MMAP_LOCK_INITIALIZER(efi_mm)
> >  	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
> >  	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
> 
> Another pure question: I'm just curious how you find all the statically
> definied mm_structs, and to make sure all of them are covered (just in case
> un-initialized seqcount could fail strangely).

I searched for all MMAP_LOCK_INITIALIZER() places and assumed that
Michel got them all when he added it :)

> Actually I'm thinking whether we should have one place to keep all the init
> vars for all the statically definied mm_structs, so we don't need to find them
> everytime, but only change that one place.

I was thinking that as well, most of the places are all the same

> > diff --git a/mm/memory.c b/mm/memory.c
> > index c48f8df6e50268..294c2c3c4fe00d 100644
> > +++ b/mm/memory.c
> > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >  		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> >  					0, src_vma, src_mm, addr, end);
> >  		mmu_notifier_invalidate_range_start(&range);
> > +		/*
> > +		 * The read side doesn't spin, it goes to the mmap_lock, so the
> > +		 * raw version is used to avoid disabling preemption here
> > +		 */
> > +		mmap_assert_write_locked(src_mm);
> > +		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
> 
> Would raw_write_seqcount_begin() be better here?

Hum..

I felt no because it had the preempt stuff added into it, however it
would work - __seqcount_lock_preemptible() == false for the seqcount_t
case (see below)

Looking more closely, maybe the right API to pick is
write_seqcount_t_begin() and write_seqcount_t_end() ??

However, no idea what the intention of the '*_seqcount_t_*' family is
- it only seems to be used to implement the seqlock..

Lets add Amhed, perhaps he can give some guidance (see next section)?

> My understanding is that we used raw_write_seqcount_t_begin() because we're
> with spin lock so assuming we disabled preemption already.

Here we rely on the exclusive mmap_lock, not a spinlock. This ensures
only one write side is running concurrently as required by seqcount.

The concern about preemption disable is that it wasn't held for fork()
before, and we don't need it.. I understand preemption disable regions
must be short or the RT people will not be happy, holding one across
all of copy_page_range() sounds bad.

Ahmed explained in commit 8117ab508f the reason the seqcount_t write
side has preemption disabled is because it can livelock RT kernels if
the read side is spinning after preempting the write side. eg look at
how __read_seqcount_begin() is implemented:

	while ((seq = __seqcount_sequence(s)) & 1)			\
		cpu_relax();						\

However, in this patch, we don't spin on the read side.

If the read side collides with a writer it immediately goes to the
mmap_lock, which is sleeping, and so it will sort itself out properly,
even if it was preempted.

> An even further pure question on __seqcount_preemptible() (feel free to ignore
> this question!): I saw that __seqcount_preemptible() seems to have been
> constantly defined as "return false".  Not sure what happened there..

The new code has a range of seqcount_t types see
Documentation/locking/seqlock.rst 'Sequence counters with associated
locks'

It uses _Generic to do a bit of meta-programming and creates a compile
time table of lock properties:

SEQCOUNT_LOCKNAME(raw_spinlock, raw_spinlock_t,  false,    s->lock,        raw_spin, raw_spin_lock(s->lock))
SEQCOUNT_LOCKNAME(spinlock,     spinlock_t,      __SEQ_RT, s->lock,        spin,     spin_lock(s->lock))
SEQCOUNT_LOCKNAME(rwlock,       rwlock_t,        __SEQ_RT, s->lock,        read,     read_lock(s->lock))
SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     s->lock,        mutex,    mutex_lock(s->lock))
SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mutex, ww_mutex_lock(s->lock, NULL))

As well as as default set of properties for normal seqcount_t. The
__seqcount_preemptible() is selected by the _Generic for seqcount_t:

#define __seqprop(s, prop) _Generic(*(s),				\
	seqcount_t:		__seqprop_##prop((void *)(s)),		\

And it says preemption must be disabled before using the lock:

static inline void __seqprop_assert(const seqcount_t *s)
{
	lockdep_assert_preemption_disabled();
}

And thus no need to have an automatic disable preemption:

static inline bool __seqprop_preemptible(const seqcount_t *s)
{
	return false;
}

Other lock subtypes are different, eg the codegen for mutex will use
lockdep_assert_held(s->lock) for _assert and true for _preemptible()

So if we map the 'write begin' entry points:

 write_seqcount_begin - Enforces preemption off
 raw_write_seqcount_begin - Auto disable preemption if required (false)
 raw_write_seqcount_t_begin - No preemption stuff
 write_seqcount_t_begin - No preemption stuff

Jason