Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork

From: Peter Xu <peterx@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: "Ahmed S. Darwish" <a.darwish@linutronix.de>,
	linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Christoph Hellwig <hch@lst.de>, Hugh Dickins <hughd@google.com>,
	Jan Kara <jack@suse.cz>, Jann Horn <jannh@google.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Linux-MM <linux-mm@kvack.org>, Michal Hocko <mhocko@suse.com>,
	Oleg Nesterov <oleg@redhat.com>
Subject: Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork
Date: Sat, 31 Oct 2020 11:26:05 -0400	[thread overview]
Message-ID: <20201031152605.GD6357@xz-x1> (raw)
In-Reply-To: <20201030235121.GQ2620339@nvidia.com>

On Fri, Oct 30, 2020 at 08:51:21PM -0300, Jason Gunthorpe wrote:
> > Another pure question: I'm just curious how you find all the statically
> > definied mm_structs, and to make sure all of them are covered (just in case
> > un-initialized seqcount could fail strangely).
> 
> I searched for all MMAP_LOCK_INITIALIZER() places and assumed that
> Michel got them all when he added it :)

Hmm, I should have noticed that before I ask.. :)

> 
> > Actually I'm thinking whether we should have one place to keep all the init
> > vars for all the statically definied mm_structs, so we don't need to find them
> > everytime, but only change that one place.
> 
> I was thinking that as well, most of the places are all the same

Yes, we can work on top.

> 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index c48f8df6e50268..294c2c3c4fe00d 100644
> > > +++ b/mm/memory.c
> > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > >  		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> > >  					0, src_vma, src_mm, addr, end);
> > >  		mmu_notifier_invalidate_range_start(&range);
> > > +		/*
> > > +		 * The read side doesn't spin, it goes to the mmap_lock, so the
> > > +		 * raw version is used to avoid disabling preemption here
> > > +		 */
> > > +		mmap_assert_write_locked(src_mm);
> > > +		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
> > 
> > Would raw_write_seqcount_begin() be better here?
> 
> Hum..
> 
> I felt no because it had the preempt stuff added into it, however it
> would work - __seqcount_lock_preemptible() == false for the seqcount_t
> case (see below)
> 
> Looking more closely, maybe the right API to pick is
> write_seqcount_t_begin() and write_seqcount_t_end() ??
> 
> However, no idea what the intention of the '*_seqcount_t_*' family is
> - it only seems to be used to implement the seqlock..
> 
> Lets add Amhed, perhaps he can give some guidance (see next section)?

IMHO we shouldn't directly use these helpers since they seem to only be used by
lock-associated versions of seqcount types.  But yeah, Amhed would be the best
one to answer...

> 
> > My understanding is that we used raw_write_seqcount_t_begin() because we're
> > with spin lock so assuming we disabled preemption already.
> 
> Here we rely on the exclusive mmap_lock, not a spinlock. This ensures
> only one write side is running concurrently as required by seqcount.

So imho here we have these things to consider during one thread updating the
seqcount_t:

  0. Concurrent read is perfectly welcomed, for sure.

  1. Concurrent writes on seqcount_t: mm sem protects it.

  2. Preempted write (if possible, maybe on RT?): I think it's also protected
     by mm sem, so looks ok too to me.

  3. Preempted/interrupted read on seqcount_t.  Seems to be the one discussed
     below.  Looks safe to me now with below explanation.  However...

> 
> The concern about preemption disable is that it wasn't held for fork()
> before, and we don't need it.. I understand preemption disable regions
> must be short or the RT people will not be happy, holding one across
> all of copy_page_range() sounds bad.
> 
> Ahmed explained in commit 8117ab508f the reason the seqcount_t write
> side has preemption disabled is because it can livelock RT kernels if
> the read side is spinning after preempting the write side. eg look at
> how __read_seqcount_begin() is implemented:
> 
> 	while ((seq = __seqcount_sequence(s)) & 1)			\
> 		cpu_relax();						\
> 
> However, in this patch, we don't spin on the read side.

... Shall we document this explicitly (if this patch still needs a repost)?
Seems not straightforward since that seems not the usual way to use seqcount,
not sure whether I'm the only one that feels this way, though.

> 
> If the read side collides with a writer it immediately goes to the
> mmap_lock, which is sleeping, and so it will sort itself out properly,
> even if it was preempted.
> 
> > An even further pure question on __seqcount_preemptible() (feel free to ignore
> > this question!): I saw that __seqcount_preemptible() seems to have been
> > constantly defined as "return false".  Not sure what happened there..
> 
> The new code has a range of seqcount_t types see
> Documentation/locking/seqlock.rst 'Sequence counters with associated
> locks'
> 
> It uses _Generic to do a bit of meta-programming and creates a compile
> time table of lock properties:
> 
> SEQCOUNT_LOCKNAME(raw_spinlock, raw_spinlock_t,  false,    s->lock,        raw_spin, raw_spin_lock(s->lock))
> SEQCOUNT_LOCKNAME(spinlock,     spinlock_t,      __SEQ_RT, s->lock,        spin,     spin_lock(s->lock))
> SEQCOUNT_LOCKNAME(rwlock,       rwlock_t,        __SEQ_RT, s->lock,        read,     read_lock(s->lock))
> SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     s->lock,        mutex,    mutex_lock(s->lock))
> SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mutex, ww_mutex_lock(s->lock, NULL))
> 
> As well as as default set of properties for normal seqcount_t. The
> __seqcount_preemptible() is selected by the _Generic for seqcount_t:
> 
> #define __seqprop(s, prop) _Generic(*(s),				\
> 	seqcount_t:		__seqprop_##prop((void *)(s)),		\
> 
> And it says preemption must be disabled before using the lock:
> 
> static inline void __seqprop_assert(const seqcount_t *s)
> {
> 	lockdep_assert_preemption_disabled();
> }
> 
> And thus no need to have an automatic disable preemption:
> 
> static inline bool __seqprop_preemptible(const seqcount_t *s)
> {
> 	return false;
> }
> 
> Other lock subtypes are different, eg the codegen for mutex will use
> lockdep_assert_held(s->lock) for _assert and true for _preemptible()
> 
> So if we map the 'write begin' entry points:
> 
>  write_seqcount_begin - Enforces preemption off
>  raw_write_seqcount_begin - Auto disable preemption if required (false)
>  raw_write_seqcount_t_begin - No preemption stuff
>  write_seqcount_t_begin - No preemption stuff

Thanks for listing these details.

As a summary, I think I'm convinced maybe we can have this work without disable
preemtion.  It's just that some more comment might be even better.

The other thing is, considering this use of seqcount seems to be quite special
as explained below, I'm just not sure whether this would confuse lockdep or
kcsan, etc., if we decide to use write_seqcount_t_begin().

Thanks,

-- 
Peter Xu