All of lore.kernel.org
 help / color / mirror / Atom feed
* How to abuse RCU to scan the children of a mount?
@ 2020-03-04 17:45 David Howells
  2020-03-04 19:28 ` Paul E. McKenney
  2020-03-05 12:47 ` David Howells
  0 siblings, 2 replies; 5+ messages in thread
From: David Howells @ 2020-03-04 17:45 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: dhowells, Al Viro, mszeredi, linux-fsdevel

Hi Paul,

As part of trying to implement a filesystem information querying system call,
I need to be able to return a list of child mounts of a mount.  Children,
however, may be moved with "mount --move", which means that the list can't be
accessed with normal RCU practices.

For reference, I'm dealing with struct mount and mnt_mounts and mnt_child and
struct mount is released via call_rcu().

What does rcu_dereference() guarantee, exactly?  It's just that the next hop
is set up when you follow the pointer, right?

Can I do something like the attached?  The mount 'm' is pinned, but I need to
trawl m->mnt_mounts.  mount_lock is a seqlock that ticks when mount topology
is rearranged.  I *think* it covers ->mnt_mounts.  Whilst trawling in
non-locked mode, I keep an eye on the seq counter and if it changes, the list
may have been altered and I need to get the real lock and restart.

The objects shouldn't disappear or be destroyed, so I think I'm safe.

Thanks,
David
---

int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
{
	struct fsinfo_mount_child record = {};
	struct mount *m, *child;
	int seq = 0;

	m = real_mount(path->mnt);

	rcu_read_lock();
	do {
		ctx->usage = 0;
		read_seqbegin_or_lock(&mount_lock, &seq);
		list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
			if (need_seqretry(&mount_lock, seq))
				break;
			if (child->mnt_parent != m)
				continue;
			record.mnt_unique_id = child->mnt_unique_id;
			record.mnt_id = child->mnt_id;
			record.notify_sum = 0;
#ifdef CONFIG_SB_NOTIFICATIONS
			record.notify_sum +=
				atomic_read(&child->mnt.mnt_sb->s_change_counter) +
				atomic_read(&child->mnt.mnt_sb->s_notify_counter);
#endif
#ifdef CONFIG_MOUNT_NOTIFICATIONS
			record.notify_sum +=
				atomic_read(&child->mnt_attr_changes) +
				atomic_read(&child->mnt_topology_changes) +
				atomic_read(&child->mnt_subtree_notifications);
#endif
			store_mount_fsinfo(ctx, &record);
		}
	} while (need_seqretry(&mount_lock, seq));
	done_seqretry(&mount_lock, seq);

	rcu_read_unlock();

	/* End the list with a copy of the parameter mount's details so that
	 * userspace can quickly check for changes.
	 */
	record.mnt_unique_id = m->mnt_unique_id;
	record.mnt_id = m->mnt_id;
	record.notify_sum = 0;
#ifdef CONFIG_SB_NOTIFICATIONS
	record.notify_sum +=
		atomic_read(&m->mnt.mnt_sb->s_change_counter) +
		atomic_read(&m->mnt.mnt_sb->s_notify_counter);
#endif
#ifdef CONFIG_MOUNT_NOTIFICATIONS
	record.notify_sum +=
		atomic_read(&m->mnt_attr_changes) +
		atomic_read(&m->mnt_topology_changes) +
		atomic_read(&m->mnt_subtree_notifications);
#endif
	store_mount_fsinfo(ctx, &record);
	return ctx->usage;
}


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to abuse RCU to scan the children of a mount?
  2020-03-04 17:45 How to abuse RCU to scan the children of a mount? David Howells
@ 2020-03-04 19:28 ` Paul E. McKenney
  2020-03-04 19:44   ` Al Viro
  2020-03-05 12:47 ` David Howells
  1 sibling, 1 reply; 5+ messages in thread
From: Paul E. McKenney @ 2020-03-04 19:28 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, mszeredi, linux-fsdevel

On Wed, Mar 04, 2020 at 05:45:16PM +0000, David Howells wrote:
> Hi Paul,
> 
> As part of trying to implement a filesystem information querying system call,
> I need to be able to return a list of child mounts of a mount.  Children,
> however, may be moved with "mount --move", which means that the list can't be
> accessed with normal RCU practices.
> 
> For reference, I'm dealing with struct mount and mnt_mounts and mnt_child and
> struct mount is released via call_rcu().
> 
> What does rcu_dereference() guarantee, exactly?  It's just that the next hop
> is set up when you follow the pointer, right?
> 
> Can I do something like the attached?  The mount 'm' is pinned, but I need to
> trawl m->mnt_mounts.  mount_lock is a seqlock that ticks when mount topology
> is rearranged.  I *think* it covers ->mnt_mounts.

I took a quick but non-exhaustive look, and didn't find any exceptions.
However, a stress test with lockdep enabled and with the addition of
appropriate lockdep assertions might be helpful here.  I trust lockdep
quite a bit more than I trust my quick looks.  ;-)

>                                                    Whilst trawling in
> non-locked mode, I keep an eye on the seq counter and if it changes, the list
> may have been altered and I need to get the real lock and restart.
> 
> The objects shouldn't disappear or be destroyed, so I think I'm safe.

Famous last words!  ;-)

Huh.  The mount structure isn't suffering from a shortage of list_head
structures, is it?

So the following can happen, then?

o	The __attach_mnt() function adds a struct mount to its parent
	list, but in a non-RCU manner.	Unless there is some other
	safeguard, the list_add_tail() in this function needs to be
	list_add_tail_rcu().

o	I am assuming that the various non-RCU traversals that I see,
	for example, next_mnt(), are protected by lock_mount_hash().
	Especially skip_mnt_tree(), which uses mnt_mounts.prev.  (I didn't
	find any exceptions, but I don't claim an exhaustive search.)

o	The umount_tree() function's use of list_del_init() looks like
	it could trap an RCU reader in the newly singular list formed
	by the removal.  It appears that there are other functions that
	use list_del_init() on this list, though I cannot claim any sort
	of familiarity with this code.

	So, do you need to add a check for child->mnt_child being in this
	self-referential state within fsinfo_generic_mount_children()?

	Plus list_del_init() doesn't mark its stores, though
	some would argue that unmarked stores are OK in this situation.

o	There might be other operations in need of RCU-ification.

	Maybe the list_add_tail() in umount_tree(), but it is not
	immediately clear that this is adding a new element instead of
	re-inserting an element already exposed to readers.

> Thanks,
> David
> ---
> 
> int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
> {
> 	struct fsinfo_mount_child record = {};
> 	struct mount *m, *child;
> 	int seq = 0;
> 
> 	m = real_mount(path->mnt);

What keeps the thing referenced by "m" from going away?  Presumably the
caller has nailed it down somehow, but I have to ask...

> 	rcu_read_lock();
> 	do {
> 		ctx->usage = 0;
> 		read_seqbegin_or_lock(&mount_lock, &seq);

Aside: If there was an updater holding the lock along with a flood of
readers, everyone would hereafter acquire the lock.  Or am I missing
a trick here?  (I would have expected it to try a few times, and only
then acquire the lock.  Yeah, I know, more state would be required.)

> 		list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
> 			if (need_seqretry(&mount_lock, seq))
> 				break;
> 			if (child->mnt_parent != m)
> 				continue;
> 			record.mnt_unique_id = child->mnt_unique_id;
> 			record.mnt_id = child->mnt_id;
> 			record.notify_sum = 0;
> #ifdef CONFIG_SB_NOTIFICATIONS
> 			record.notify_sum +=
> 				atomic_read(&child->mnt.mnt_sb->s_change_counter) +
> 				atomic_read(&child->mnt.mnt_sb->s_notify_counter);
> #endif
> #ifdef CONFIG_MOUNT_NOTIFICATIONS
> 			record.notify_sum +=
> 				atomic_read(&child->mnt_attr_changes) +
> 				atomic_read(&child->mnt_topology_changes) +
> 				atomic_read(&child->mnt_subtree_notifications);
> #endif
> 			store_mount_fsinfo(ctx, &record);
> 		}
> 	} while (need_seqretry(&mount_lock, seq));
> 	done_seqretry(&mount_lock, seq);
> 
> 	rcu_read_unlock();
> 
> 	/* End the list with a copy of the parameter mount's details so that
> 	 * userspace can quickly check for changes.
> 	 */
> 	record.mnt_unique_id = m->mnt_unique_id;
> 	record.mnt_id = m->mnt_id;
> 	record.notify_sum = 0;
> #ifdef CONFIG_SB_NOTIFICATIONS
> 	record.notify_sum +=
> 		atomic_read(&m->mnt.mnt_sb->s_change_counter) +
> 		atomic_read(&m->mnt.mnt_sb->s_notify_counter);
> #endif
> #ifdef CONFIG_MOUNT_NOTIFICATIONS
> 	record.notify_sum +=
> 		atomic_read(&m->mnt_attr_changes) +
> 		atomic_read(&m->mnt_topology_changes) +
> 		atomic_read(&m->mnt_subtree_notifications);
> #endif
> 	store_mount_fsinfo(ctx, &record);
> 	return ctx->usage;
> }

Other than that, looks legit!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to abuse RCU to scan the children of a mount?
  2020-03-04 19:28 ` Paul E. McKenney
@ 2020-03-04 19:44   ` Al Viro
  2020-03-08 14:32     ` Paul E. McKenney
  0 siblings, 1 reply; 5+ messages in thread
From: Al Viro @ 2020-03-04 19:44 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: David Howells, mszeredi, linux-fsdevel

On Wed, Mar 04, 2020 at 11:28:16AM -0800, Paul E. McKenney wrote:

> Huh.  The mount structure isn't suffering from a shortage of list_head
> structures, is it?
> 
> So the following can happen, then?
> 
> o	The __attach_mnt() function adds a struct mount to its parent
> 	list, but in a non-RCU manner.	Unless there is some other
> 	safeguard, the list_add_tail() in this function needs to be
> 	list_add_tail_rcu().
> 
> o	I am assuming that the various non-RCU traversals that I see,
> 	for example, next_mnt(), are protected by lock_mount_hash().
> 	Especially skip_mnt_tree(), which uses mnt_mounts.prev.  (I didn't
> 	find any exceptions, but I don't claim an exhaustive search.)
> 
> o	The umount_tree() function's use of list_del_init() looks like
> 	it could trap an RCU reader in the newly singular list formed
> 	by the removal.  It appears that there are other functions that
> 	use list_del_init() on this list, though I cannot claim any sort
> 	of familiarity with this code.
> 
> 	So, do you need to add a check for child->mnt_child being in this
> 	self-referential state within fsinfo_generic_mount_children()?
> 
> 	Plus list_del_init() doesn't mark its stores, though
> 	some would argue that unmarked stores are OK in this situation.
> 
> o	There might be other operations in need of RCU-ification.
> 
> 	Maybe the list_add_tail() in umount_tree(), but it is not
> 	immediately clear that this is adding a new element instead of
> 	re-inserting an element already exposed to readers.

IMO all of that is a good argument *against* trying to pull any kind of RCU
games here.  Access to these lists is assumed to be serialized on
mount_lock spinlock component held exclusive and unless there is a very
good reason to go for something trickier, let's not.

Hash chains are supposed to be walked under rcu_read_lock(), requiring
to recheck the mount_lock seqcount *AND* with use of legitimize_mnt()
if you are to try and get a reference out of that.  ->mnt_parent chains
also can be walked in the same conditions (subject to the same
requirements).  The policy with everything else is
	* get mount_lock spinlock exclusive or
	* get namespace_sem or
	* just fucking don't do it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to abuse RCU to scan the children of a mount?
  2020-03-04 17:45 How to abuse RCU to scan the children of a mount? David Howells
  2020-03-04 19:28 ` Paul E. McKenney
@ 2020-03-05 12:47 ` David Howells
  1 sibling, 0 replies; 5+ messages in thread
From: David Howells @ 2020-03-05 12:47 UTC (permalink / raw)
  To: paulmck; +Cc: dhowells, Al Viro, mszeredi, linux-fsdevel

Paul E. McKenney <paulmck@kernel.org> wrote:

> o	The __attach_mnt() function adds a struct mount to its parent
> 	list, but in a non-RCU manner.	Unless there is some other
> 	safeguard, the list_add_tail() in this function needs to be
> 	list_add_tail_rcu().

Yeah.  I think the first time it appears on mnt_mounts so that we can see it
would have to be protected with an smp_store_release() barrier
(list_add_tail_rcu()), but thereafter it should be fine.  Freeing is staved
off by holding the RCU read lock.

The fields that I'm interested in retrieving are all immutable IDs or atomic
event counters.  That said, I do need to follow the mnt_sb pointer to the
superblock to retrieve more event counters, but mnt_sb is immutable also.

It might be possible to mirror the sb event counters into every mountpoint
that points to it, but that would mean that the sb event generator would have
to traverse that list - requiring appropriate locking there.

> 	So, do you need to add a check for child->mnt_child being in this
> 	self-referential state within fsinfo_generic_mount_children()?

I think what I would need is a counter on the specified mount that gets
incremented every time its child list gets rearranged in some way.  The
->mnt_topology_changes counter that I added would suffice for this, though it
also notes if the specified mount itself got moved.

I would then need to note the counter value before the loop and abort the RCU
traversal if it changes during any iteration and switch to a locked traveral.

> 	Plus list_del_init() doesn't mark its stores, though
> 	some would argue that unmarked stores are OK in this situation.

That should be okay.  The next pointer still points *somewhere* valid.

> > 	m = real_mount(path->mnt);
> 
> What keeps the thing referenced by "m" from going away?  Presumably the
> caller has nailed it down somehow, but I have to ask...

That's just a container_of() wrapper.  path->mnt has to be pinned by the
caller.

> > 	rcu_read_lock();
> > 	do {
> > 		ctx->usage = 0;
> > 		read_seqbegin_or_lock(&mount_lock, &seq);
> 
> Aside: If there was an updater holding the lock along with a flood of
> readers, everyone would hereafter acquire the lock.  Or am I missing
> a trick here?  (I would have expected it to try a few times, and only
> then acquire the lock.  Yeah, I know, more state would be required.)

I think it would probably be more work than it's worth to do the extra
repetitions.

So how about the attached?  I've also made sure that all the calls to
notify_mount() (which updated the topology counter) are after the events
they're reporting on and I've made __attach_mnt() use list_add_tail_rcu() -
though there's a barrier in the preceding line that affects mnt_mountpoint.

Anyway, this is mostly theoretical.  I think Al would rather I just used a
lock.

David
---

/*
 * Store a mount record into the fsinfo buffer.
 */
static void fsinfo_store_mount(struct fsinfo_context *ctx, const struct mount *p,
			       unsigned int mnt_topology_changes)
{
	struct fsinfo_mount_child record = {};
	unsigned int usage = ctx->usage;

	if (ctx->usage >= INT_MAX)
		return;
	ctx->usage = usage + sizeof(record);

	if (ctx->buffer && ctx->usage <= ctx->buf_size) {
		record.mnt_unique_id	= p->mnt_unique_id;
		record.mnt_id		= p->mnt_id;
		record.notify_sum	= 0;
#ifdef CONFIG_SB_NOTIFICATIONS
		record.notify_sum +=
			atomic_read(&p->mnt.mnt_sb->s_change_counter) +
			atomic_read(&p->mnt.mnt_sb->s_notify_counter);
#endif
#ifdef CONFIG_MOUNT_NOTIFICATIONS
		record.notify_sum +=
			atomic_read(&p->mnt_attr_changes) +
			mnt_topology_changes +
			atomic_read(&p->mnt_subtree_notifications);
#endif
		memcpy(ctx->buffer + usage, &record, sizeof(record));
	}
}

/*
 * Return information about the submounts relative to path.
 */
int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
{
	struct mount *m, *child;
	bool rcu_mode = true, changed;
	int topology_check;

	m = real_mount(path->mnt);

	rcu_read_lock();

retry:
	topology_check = atomic_read(&m->mnt_topology_changes);

	ctx->usage = 0;
	list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
		if (atomic_read(&m->mnt_topology_changes) != topology_check)
			break;
		if (child->mnt_parent != m)
			continue;
		fsinfo_store_mount(ctx, child,
				   atomic_read(&child->mnt_topology_changes));
	}

	changed = (atomic_read(&m->mnt_topology_changes) != topology_check);
	if (rcu_mode) {
		rcu_read_unlock();
		if (changed) {
			rcu_mode = false;
			read_seqlock_excl(&mount_lock);
			goto retry;
		}
	} else {
		read_sequnlock_excl(&mount_lock);
		if (changed) {
			read_sequnlock_excl(&mount_lock);
			pr_err("Topology changed whilst lock held!\n");
			return -EAGAIN;
		}
	}

	/* End the list with a copy of the parameter mount's details so that
	 * userspace can quickly check for changes.  Note that we include the
	 * topology counter we got at the start of the function rather than the
	 * current value.
	 */
	fsinfo_store_mount(ctx, m, topology_check);
	return ctx->usage;
}


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to abuse RCU to scan the children of a mount?
  2020-03-04 19:44   ` Al Viro
@ 2020-03-08 14:32     ` Paul E. McKenney
  0 siblings, 0 replies; 5+ messages in thread
From: Paul E. McKenney @ 2020-03-08 14:32 UTC (permalink / raw)
  To: Al Viro; +Cc: David Howells, mszeredi, linux-fsdevel

On Wed, Mar 04, 2020 at 07:44:51PM +0000, Al Viro wrote:
> On Wed, Mar 04, 2020 at 11:28:16AM -0800, Paul E. McKenney wrote:
> 
> > Huh.  The mount structure isn't suffering from a shortage of list_head
> > structures, is it?
> > 
> > So the following can happen, then?
> > 
> > o	The __attach_mnt() function adds a struct mount to its parent
> > 	list, but in a non-RCU manner.	Unless there is some other
> > 	safeguard, the list_add_tail() in this function needs to be
> > 	list_add_tail_rcu().
> > 
> > o	I am assuming that the various non-RCU traversals that I see,
> > 	for example, next_mnt(), are protected by lock_mount_hash().
> > 	Especially skip_mnt_tree(), which uses mnt_mounts.prev.  (I didn't
> > 	find any exceptions, but I don't claim an exhaustive search.)
> > 
> > o	The umount_tree() function's use of list_del_init() looks like
> > 	it could trap an RCU reader in the newly singular list formed
> > 	by the removal.  It appears that there are other functions that
> > 	use list_del_init() on this list, though I cannot claim any sort
> > 	of familiarity with this code.
> > 
> > 	So, do you need to add a check for child->mnt_child being in this
> > 	self-referential state within fsinfo_generic_mount_children()?
> > 
> > 	Plus list_del_init() doesn't mark its stores, though
> > 	some would argue that unmarked stores are OK in this situation.
> > 
> > o	There might be other operations in need of RCU-ification.
> > 
> > 	Maybe the list_add_tail() in umount_tree(), but it is not
> > 	immediately clear that this is adding a new element instead of
> > 	re-inserting an element already exposed to readers.
> 
> IMO all of that is a good argument *against* trying to pull any kind of RCU
> games here.  Access to these lists is assumed to be serialized on
> mount_lock spinlock component held exclusive and unless there is a very
> good reason to go for something trickier, let's not.
> 
> Hash chains are supposed to be walked under rcu_read_lock(), requiring
> to recheck the mount_lock seqcount *AND* with use of legitimize_mnt()
> if you are to try and get a reference out of that.  ->mnt_parent chains
> also can be walked in the same conditions (subject to the same
> requirements).  The policy with everything else is
> 	* get mount_lock spinlock exclusive or
> 	* get namespace_sem or
> 	* just fucking don't do it.

I have no idea what David's performance requirements are.  But yes,
of course, if locking suffices, use locking, be happy, and get on with
your life.

It looks like David would otherwise have been gathering information
while holding some sort of lock (as you say above), then releasing that
lock before actually using the gathered information.  Therefore, should
it turns out that locking is to slow does not suffice for David's use
case, and if the issues I identified above disqualify RCU (which would
be unsurprising, especially given that the above is unlikely to be an
exhaustive list), then another option is to maintain a separate data
structure that keeps the needed information.  When the mounts change,
the next lookup would grab the needed locks and regenerate this separate
data structure.  Otherwise, just do lookup in the separate data structure.

But again, if just locking and traversing the main data structure
suffices, agreed, just do that, be happy, and get on with life.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-03-08 14:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-04 17:45 How to abuse RCU to scan the children of a mount? David Howells
2020-03-04 19:28 ` Paul E. McKenney
2020-03-04 19:44   ` Al Viro
2020-03-08 14:32     ` Paul E. McKenney
2020-03-05 12:47 ` David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.