Re: How to abuse RCU to scan the children of a mount?

From: David Howells <dhowells@redhat.com>
To: paulmck@kernel.org
Cc: dhowells@redhat.com, Al Viro <viro@zeniv.linux.org.uk>,
	mszeredi@redhat.com, linux-fsdevel@vger.kernel.org
Subject: Re: How to abuse RCU to scan the children of a mount?
Date: Thu, 05 Mar 2020 12:47:57 +0000	[thread overview]
Message-ID: <3512175.1583412477@warthog.procyon.org.uk> (raw)
In-Reply-To: <20200304192816.GI2935@paulmck-ThinkPad-P72>

Paul E. McKenney <paulmck@kernel.org> wrote:

> o	The __attach_mnt() function adds a struct mount to its parent
> 	list, but in a non-RCU manner.	Unless there is some other
> 	safeguard, the list_add_tail() in this function needs to be
> 	list_add_tail_rcu().

Yeah.  I think the first time it appears on mnt_mounts so that we can see it
would have to be protected with an smp_store_release() barrier
(list_add_tail_rcu()), but thereafter it should be fine.  Freeing is staved
off by holding the RCU read lock.

The fields that I'm interested in retrieving are all immutable IDs or atomic
event counters.  That said, I do need to follow the mnt_sb pointer to the
superblock to retrieve more event counters, but mnt_sb is immutable also.

It might be possible to mirror the sb event counters into every mountpoint
that points to it, but that would mean that the sb event generator would have
to traverse that list - requiring appropriate locking there.

> 	So, do you need to add a check for child->mnt_child being in this
> 	self-referential state within fsinfo_generic_mount_children()?

I think what I would need is a counter on the specified mount that gets
incremented every time its child list gets rearranged in some way.  The
->mnt_topology_changes counter that I added would suffice for this, though it
also notes if the specified mount itself got moved.

I would then need to note the counter value before the loop and abort the RCU
traversal if it changes during any iteration and switch to a locked traveral.

> 	Plus list_del_init() doesn't mark its stores, though
> 	some would argue that unmarked stores are OK in this situation.

That should be okay.  The next pointer still points *somewhere* valid.

> > 	m = real_mount(path->mnt);
> 
> What keeps the thing referenced by "m" from going away?  Presumably the
> caller has nailed it down somehow, but I have to ask...

That's just a container_of() wrapper.  path->mnt has to be pinned by the
caller.

> > 	rcu_read_lock();
> > 	do {
> > 		ctx->usage = 0;
> > 		read_seqbegin_or_lock(&mount_lock, &seq);
> 
> Aside: If there was an updater holding the lock along with a flood of
> readers, everyone would hereafter acquire the lock.  Or am I missing
> a trick here?  (I would have expected it to try a few times, and only
> then acquire the lock.  Yeah, I know, more state would be required.)

I think it would probably be more work than it's worth to do the extra
repetitions.

So how about the attached?  I've also made sure that all the calls to
notify_mount() (which updated the topology counter) are after the events
they're reporting on and I've made __attach_mnt() use list_add_tail_rcu() -
though there's a barrier in the preceding line that affects mnt_mountpoint.

Anyway, this is mostly theoretical.  I think Al would rather I just used a
lock.

David
---

/*
 * Store a mount record into the fsinfo buffer.
 */
static void fsinfo_store_mount(struct fsinfo_context *ctx, const struct mount *p,
			       unsigned int mnt_topology_changes)
{
	struct fsinfo_mount_child record = {};
	unsigned int usage = ctx->usage;

	if (ctx->usage >= INT_MAX)
		return;
	ctx->usage = usage + sizeof(record);

	if (ctx->buffer && ctx->usage <= ctx->buf_size) {
		record.mnt_unique_id	= p->mnt_unique_id;
		record.mnt_id		= p->mnt_id;
		record.notify_sum	= 0;
#ifdef CONFIG_SB_NOTIFICATIONS
		record.notify_sum +=
			atomic_read(&p->mnt.mnt_sb->s_change_counter) +
			atomic_read(&p->mnt.mnt_sb->s_notify_counter);
#endif
#ifdef CONFIG_MOUNT_NOTIFICATIONS
		record.notify_sum +=
			atomic_read(&p->mnt_attr_changes) +
			mnt_topology_changes +
			atomic_read(&p->mnt_subtree_notifications);
#endif
		memcpy(ctx->buffer + usage, &record, sizeof(record));
	}
}

/*
 * Return information about the submounts relative to path.
 */
int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
{
	struct mount *m, *child;
	bool rcu_mode = true, changed;
	int topology_check;

	m = real_mount(path->mnt);

	rcu_read_lock();

retry:
	topology_check = atomic_read(&m->mnt_topology_changes);

	ctx->usage = 0;
	list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
		if (atomic_read(&m->mnt_topology_changes) != topology_check)
			break;
		if (child->mnt_parent != m)
			continue;
		fsinfo_store_mount(ctx, child,
				   atomic_read(&child->mnt_topology_changes));
	}

	changed = (atomic_read(&m->mnt_topology_changes) != topology_check);
	if (rcu_mode) {
		rcu_read_unlock();
		if (changed) {
			rcu_mode = false;
			read_seqlock_excl(&mount_lock);
			goto retry;
		}
	} else {
		read_sequnlock_excl(&mount_lock);
		if (changed) {
			read_sequnlock_excl(&mount_lock);
			pr_err("Topology changed whilst lock held!\n");
			return -EAGAIN;
		}
	}

	/* End the list with a copy of the parameter mount's details so that
	 * userspace can quickly check for changes.  Note that we include the
	 * topology counter we got at the start of the function rather than the
	 * current value.
	 */
	fsinfo_store_mount(ctx, m, topology_check);
	return ctx->usage;
}