Re: [PATCH] fs/direct-io: avoid data race on ->s_dio_done_wq

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Eric Biggers <ebiggers@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-ext4@vger.kernel.org
Subject: Re: [PATCH] fs/direct-io: avoid data race on ->s_dio_done_wq
Date: Wed, 15 Jul 2020 09:41:18 -0700	[thread overview]
Message-ID: <20200715164118.GB848607@magnolia> (raw)
In-Reply-To: <20200715161342.GA1167@sol.localdomain>

On Wed, Jul 15, 2020 at 09:13:42AM -0700, Eric Biggers wrote:
> On Wed, Jul 15, 2020 at 06:01:44PM +1000, Dave Chinner wrote:
> > > > >  /* direct-io.c: */
> > > > > -int sb_init_dio_done_wq(struct super_block *sb);
> > > > > +int __sb_init_dio_done_wq(struct super_block *sb);
> > > > > +static inline int sb_init_dio_done_wq(struct super_block *sb)
> > > > > +{
> > > > > +	/* pairs with cmpxchg() in __sb_init_dio_done_wq() */
> > > > > +	if (likely(READ_ONCE(sb->s_dio_done_wq)))
> > > > > +		return 0;
> > > > > +	return __sb_init_dio_done_wq(sb);
> > > > > +}
> > > > 
> > > > Ummm, why don't you just add this check in sb_init_dio_done_wq(). I
> > > > don't see any need for adding another level of function call
> > > > abstraction in the source code?
> > > 
> > > This keeps the fast path doing no function calls and one fewer branch, as it was
> > > before.  People care a lot about minimizing direct I/O overhead, so it seems
> > > desirable to keep this simple optimization.  Would you rather it be removed?
> > 
> > No.
> > 
> > What I'm trying to say is that I'd prefer fast path checks don't get
> > hidden away in a static inline function wrappers that require the
> > reader to go look up code in a different file to understand that
> > code in yet another different file is conditionally executed.
> > 
> > Going from obvious, easy to read fast path code to spreading the
> > fast path logic over functions in 3 different files is not an
> > improvement in the code - it is how we turn good code into an
> > unmaintainable mess...
> 
> The alternative would be to duplicate the READ_ONCE() at all 3 call sites --
> including the explanatory comment.  That seems strictly worse.
> 
> And the code before was broken, so I disagree it was "obvious" or "good".
> 
> > 
> > > > Also, you need to explain the reason for the READ_ONCE() existing
> > > > rather than just saying "it pairs with <some other operation>".
> > > > Knowing what operation it pairs with doesn't explain why the pairing
> > > > is necessary in the first place, and that leads to nobody reading
> > > > the code being able to understand what this is protecting against.
> > > > 
> > > 
> > > How about this?
> > > 
> > > 	/*
> > > 	 * Nothing to do if ->s_dio_done_wq is already set.  But since another
> > > 	 * process may set it concurrently, we need to use READ_ONCE() rather
> > > 	 * than a plain read to avoid a data race (undefined behavior) and to
> > > 	 * ensure we observe the pointed-to struct to be fully initialized.
> > > 	 */
> > > 	if (likely(READ_ONCE(sb->s_dio_done_wq)))
> > > 		return 0;
> > 
> > You still need to document what it pairs with, as "data race" doesn't
> > describe the actual dependency we are synchronising against is.
> > 
> > AFAICT from your description, the data race is not on
> > sb->s_dio_done_wq itself, but on seeing the contents of the
> > structure being pointed to incorrectly. i.e. we need to ensure that
> > writes done before the cmpxchg are ordered correctly against
> > reads done after the pointer can be seen here.
> > 
> 
> No, the data race is on ->s_dio_done_wq itself.  How about this:
> 
>         /*
>          * Nothing to do if ->s_dio_done_wq is already set.  The READ_ONCE()
>          * here pairs with the cmpxchg() in __sb_init_dio_done_wq().  Since the
>          * cmpxchg() may set ->s_dio_done_wq concurrently, a plain load would be
>          * a data race (undefined behavior), so READ_ONCE() is needed.
>          * READ_ONCE() also includes any needed read data dependency barrier to
>          * ensure that the pointed-to struct is seen to be fully initialized.
>          */
> 
> FWIW, long-term we really need to get developers to understand these sorts of
> issues, so that the code is written correctly in the first place and we don't
> need to annotate common patterns like one-time-init with a long essay and have a
> long discussion.  Recently KCSAN was merged upstream
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/dev-tools/kcsan.rst)
> and the memory model documentation was improved
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt?h=v5.8-rc5#n1922),
> so hopefully that will raise awareness...

I tried to understand that, but TBH this whole topic area seems very
complex and difficult to understand.  I more or less understand what
READ_ONCE and WRITE_ONCE do wrt restricting compiler optimizations, but
I wouldn't say that I understand all that machinery.

Granted, my winning strategy so far is to write a simple version with
big dumb locks and let the rest of you argue over slick optimizations.
:P

<shrug> If using READ_ONCE and cmpxchg for pointer initialization (or I
guess smp_store_release and smp_load_acquire?) are a commonly used
paradigm, then maybe that should get its own section in
tools/memory-model/Documentation/recipes.txt and then any code that uses
it can point readers at that?

--D

> > If so, can't we just treat this as a normal
> > store-release/load-acquire ordering pattern and hence use more
> > relaxed memory barriers instead of have to patch up what we have now
> > to specifically make ancient platforms that nobody actually uses
> > with weird and unusual memory models work correctly?
> 
> READ_ONCE() is already as relaxed as it can get, as it includes a read data
> dependency barrier only (which is no-op on everything other than Alpha).
> 
> If anything it should be upgraded to smp_load_acquire(), which handles control
> dependencies too.  I didn't see anything obvious in the workqueue code that
> would need that (i.e. accesses to some global structure that isn't transitively
> reachable via the workqueue_struct itself).  But we could use it to be safe if
> we're okay with any performance implications of the additional memory barrier it
> would add.
> 
> - Eric