Re: [PATCH v2 2/4] cxl/mem: Fix synchronization mechanism for device removal vs ioctl operations

From: Jason Gunthorpe <jgg@nvidia.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-cxl@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Vishal L Verma <vishal.l.verma@intel.com>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	"Schofield, Alison" <alison.schofield@intel.com>
Subject: Re: [PATCH v2 2/4] cxl/mem: Fix synchronization mechanism for device removal vs ioctl operations
Date: Tue, 30 Mar 2021 16:26:08 -0300	[thread overview]
Message-ID: <20210330192608.GA1430856@nvidia.com> (raw)
In-Reply-To: <CAPcyv4igMvwfZNgi-Uap_QUJi+uocMUD3KZBhXUy56AuHZQtqw@mail.gmail.com>

On Tue, Mar 30, 2021 at 12:00:23PM -0700, Dan Williams wrote:

> > > > IMHO this can't use 'dev->kobj.state_in_sysfs' as the RCU protected data.
> > >
> > > This usage of srcu is functionally equivalent to replacing
> > > srcu_read_lock() with down_read() and the shutdown path with:
> >
> > Sort of, but the rules for load/store under RCU are different than for
> > load/store under a normal barriered lock. All the data is unstable for
> > instance and minimially needs READ_ONCE.
> 
> The data is unstable under the srcu_read_lock until the end of the
> next rcu grace period, synchronize_rcu() ensures all active
> srcu_read_lock() sections have completed.

No, that isn't how I would phrase it. *any* write side data under RCU
is always unstable by definition in the read side because the write
side can always race with any reader. Thus one should use the RCU
accessors to deal with that data race, and get some acquire/release
semantics when pointer chasing (though this doesn't matter here)

> Unless Paul and I misunderstood each other, this scheme of
> synchronizing object state is also used in kill_dax(), and I put
> that comment there the last time this question was raised. If srcu
> was being used to synchronize the liveness of an rcu object like
> @cxlm or a new ops object then I would expect rcu_dereference +
> rcu_assign_pointer around usages of that object. The liveness of the
> object in this case is handled by kobject reference, or inode
> reference in the case of kill_dax() outside of srcu.

It was probably a mis-understanding as I don't think Paul would say
you should read data in thread A and write to it in B without using
READ_ONCE/WRITE_ONCE or a stronger atomic to manage the data race.

The LWN articles on the "big bad compiler" are informative here. You
don't want the compiler to do a transformation where it loads
state_in_sysfs multiple times and gets different answers. This is what
READ_ONCE is aiming to prevent.

Here is it just a boolean flag, and the flag is only cleared, so risks
are low, but it still isn't a technically correct way to use RCU.

(and yes the kernel is full of examples of not following the memory
model strictly)

> > > cdev_device_del(...);
> > > down_write(...):
> > > up_write(...);
> >
> > The lock would have to enclose the store to state_in_sysfs, otherwise
> > as written it has the same data race problems.
> 
> There's no race above. The rule is that any possible observation of
> ->state_in_sysfs == 1, or rcu_dereference() != NULL, must be
> flushed.

It is not about the flushing.

Jason