* Re: How to abuse RCU to scan the children of a mount?
2020-03-04 17:45 How to abuse RCU to scan the children of a mount? David Howells
@ 2020-03-04 19:28 ` Paul E. McKenney
2020-03-04 19:44 ` Al Viro
2020-03-05 12:47 ` David Howells
1 sibling, 1 reply; 5+ messages in thread
From: Paul E. McKenney @ 2020-03-04 19:28 UTC (permalink / raw)
To: David Howells; +Cc: Al Viro, mszeredi, linux-fsdevel
On Wed, Mar 04, 2020 at 05:45:16PM +0000, David Howells wrote:
> Hi Paul,
>
> As part of trying to implement a filesystem information querying system call,
> I need to be able to return a list of child mounts of a mount. Children,
> however, may be moved with "mount --move", which means that the list can't be
> accessed with normal RCU practices.
>
> For reference, I'm dealing with struct mount and mnt_mounts and mnt_child and
> struct mount is released via call_rcu().
>
> What does rcu_dereference() guarantee, exactly? It's just that the next hop
> is set up when you follow the pointer, right?
>
> Can I do something like the attached? The mount 'm' is pinned, but I need to
> trawl m->mnt_mounts. mount_lock is a seqlock that ticks when mount topology
> is rearranged. I *think* it covers ->mnt_mounts.
I took a quick but non-exhaustive look, and didn't find any exceptions.
However, a stress test with lockdep enabled and with the addition of
appropriate lockdep assertions might be helpful here. I trust lockdep
quite a bit more than I trust my quick looks. ;-)
> Whilst trawling in
> non-locked mode, I keep an eye on the seq counter and if it changes, the list
> may have been altered and I need to get the real lock and restart.
>
> The objects shouldn't disappear or be destroyed, so I think I'm safe.
Famous last words! ;-)
Huh. The mount structure isn't suffering from a shortage of list_head
structures, is it?
So the following can happen, then?
o The __attach_mnt() function adds a struct mount to its parent
list, but in a non-RCU manner. Unless there is some other
safeguard, the list_add_tail() in this function needs to be
list_add_tail_rcu().
o I am assuming that the various non-RCU traversals that I see,
for example, next_mnt(), are protected by lock_mount_hash().
Especially skip_mnt_tree(), which uses mnt_mounts.prev. (I didn't
find any exceptions, but I don't claim an exhaustive search.)
o The umount_tree() function's use of list_del_init() looks like
it could trap an RCU reader in the newly singular list formed
by the removal. It appears that there are other functions that
use list_del_init() on this list, though I cannot claim any sort
of familiarity with this code.
So, do you need to add a check for child->mnt_child being in this
self-referential state within fsinfo_generic_mount_children()?
Plus list_del_init() doesn't mark its stores, though
some would argue that unmarked stores are OK in this situation.
o There might be other operations in need of RCU-ification.
Maybe the list_add_tail() in umount_tree(), but it is not
immediately clear that this is adding a new element instead of
re-inserting an element already exposed to readers.
> Thanks,
> David
> ---
>
> int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
> {
> struct fsinfo_mount_child record = {};
> struct mount *m, *child;
> int seq = 0;
>
> m = real_mount(path->mnt);
What keeps the thing referenced by "m" from going away? Presumably the
caller has nailed it down somehow, but I have to ask...
> rcu_read_lock();
> do {
> ctx->usage = 0;
> read_seqbegin_or_lock(&mount_lock, &seq);
Aside: If there was an updater holding the lock along with a flood of
readers, everyone would hereafter acquire the lock. Or am I missing
a trick here? (I would have expected it to try a few times, and only
then acquire the lock. Yeah, I know, more state would be required.)
> list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
> if (need_seqretry(&mount_lock, seq))
> break;
> if (child->mnt_parent != m)
> continue;
> record.mnt_unique_id = child->mnt_unique_id;
> record.mnt_id = child->mnt_id;
> record.notify_sum = 0;
> #ifdef CONFIG_SB_NOTIFICATIONS
> record.notify_sum +=
> atomic_read(&child->mnt.mnt_sb->s_change_counter) +
> atomic_read(&child->mnt.mnt_sb->s_notify_counter);
> #endif
> #ifdef CONFIG_MOUNT_NOTIFICATIONS
> record.notify_sum +=
> atomic_read(&child->mnt_attr_changes) +
> atomic_read(&child->mnt_topology_changes) +
> atomic_read(&child->mnt_subtree_notifications);
> #endif
> store_mount_fsinfo(ctx, &record);
> }
> } while (need_seqretry(&mount_lock, seq));
> done_seqretry(&mount_lock, seq);
>
> rcu_read_unlock();
>
> /* End the list with a copy of the parameter mount's details so that
> * userspace can quickly check for changes.
> */
> record.mnt_unique_id = m->mnt_unique_id;
> record.mnt_id = m->mnt_id;
> record.notify_sum = 0;
> #ifdef CONFIG_SB_NOTIFICATIONS
> record.notify_sum +=
> atomic_read(&m->mnt.mnt_sb->s_change_counter) +
> atomic_read(&m->mnt.mnt_sb->s_notify_counter);
> #endif
> #ifdef CONFIG_MOUNT_NOTIFICATIONS
> record.notify_sum +=
> atomic_read(&m->mnt_attr_changes) +
> atomic_read(&m->mnt_topology_changes) +
> atomic_read(&m->mnt_subtree_notifications);
> #endif
> store_mount_fsinfo(ctx, &record);
> return ctx->usage;
> }
Other than that, looks legit! ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How to abuse RCU to scan the children of a mount?
2020-03-04 17:45 How to abuse RCU to scan the children of a mount? David Howells
2020-03-04 19:28 ` Paul E. McKenney
@ 2020-03-05 12:47 ` David Howells
1 sibling, 0 replies; 5+ messages in thread
From: David Howells @ 2020-03-05 12:47 UTC (permalink / raw)
To: paulmck; +Cc: dhowells, Al Viro, mszeredi, linux-fsdevel
Paul E. McKenney <paulmck@kernel.org> wrote:
> o The __attach_mnt() function adds a struct mount to its parent
> list, but in a non-RCU manner. Unless there is some other
> safeguard, the list_add_tail() in this function needs to be
> list_add_tail_rcu().
Yeah. I think the first time it appears on mnt_mounts so that we can see it
would have to be protected with an smp_store_release() barrier
(list_add_tail_rcu()), but thereafter it should be fine. Freeing is staved
off by holding the RCU read lock.
The fields that I'm interested in retrieving are all immutable IDs or atomic
event counters. That said, I do need to follow the mnt_sb pointer to the
superblock to retrieve more event counters, but mnt_sb is immutable also.
It might be possible to mirror the sb event counters into every mountpoint
that points to it, but that would mean that the sb event generator would have
to traverse that list - requiring appropriate locking there.
> So, do you need to add a check for child->mnt_child being in this
> self-referential state within fsinfo_generic_mount_children()?
I think what I would need is a counter on the specified mount that gets
incremented every time its child list gets rearranged in some way. The
->mnt_topology_changes counter that I added would suffice for this, though it
also notes if the specified mount itself got moved.
I would then need to note the counter value before the loop and abort the RCU
traversal if it changes during any iteration and switch to a locked traveral.
> Plus list_del_init() doesn't mark its stores, though
> some would argue that unmarked stores are OK in this situation.
That should be okay. The next pointer still points *somewhere* valid.
> > m = real_mount(path->mnt);
>
> What keeps the thing referenced by "m" from going away? Presumably the
> caller has nailed it down somehow, but I have to ask...
That's just a container_of() wrapper. path->mnt has to be pinned by the
caller.
> > rcu_read_lock();
> > do {
> > ctx->usage = 0;
> > read_seqbegin_or_lock(&mount_lock, &seq);
>
> Aside: If there was an updater holding the lock along with a flood of
> readers, everyone would hereafter acquire the lock. Or am I missing
> a trick here? (I would have expected it to try a few times, and only
> then acquire the lock. Yeah, I know, more state would be required.)
I think it would probably be more work than it's worth to do the extra
repetitions.
So how about the attached? I've also made sure that all the calls to
notify_mount() (which updated the topology counter) are after the events
they're reporting on and I've made __attach_mnt() use list_add_tail_rcu() -
though there's a barrier in the preceding line that affects mnt_mountpoint.
Anyway, this is mostly theoretical. I think Al would rather I just used a
lock.
David
---
/*
* Store a mount record into the fsinfo buffer.
*/
static void fsinfo_store_mount(struct fsinfo_context *ctx, const struct mount *p,
unsigned int mnt_topology_changes)
{
struct fsinfo_mount_child record = {};
unsigned int usage = ctx->usage;
if (ctx->usage >= INT_MAX)
return;
ctx->usage = usage + sizeof(record);
if (ctx->buffer && ctx->usage <= ctx->buf_size) {
record.mnt_unique_id = p->mnt_unique_id;
record.mnt_id = p->mnt_id;
record.notify_sum = 0;
#ifdef CONFIG_SB_NOTIFICATIONS
record.notify_sum +=
atomic_read(&p->mnt.mnt_sb->s_change_counter) +
atomic_read(&p->mnt.mnt_sb->s_notify_counter);
#endif
#ifdef CONFIG_MOUNT_NOTIFICATIONS
record.notify_sum +=
atomic_read(&p->mnt_attr_changes) +
mnt_topology_changes +
atomic_read(&p->mnt_subtree_notifications);
#endif
memcpy(ctx->buffer + usage, &record, sizeof(record));
}
}
/*
* Return information about the submounts relative to path.
*/
int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
{
struct mount *m, *child;
bool rcu_mode = true, changed;
int topology_check;
m = real_mount(path->mnt);
rcu_read_lock();
retry:
topology_check = atomic_read(&m->mnt_topology_changes);
ctx->usage = 0;
list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
if (atomic_read(&m->mnt_topology_changes) != topology_check)
break;
if (child->mnt_parent != m)
continue;
fsinfo_store_mount(ctx, child,
atomic_read(&child->mnt_topology_changes));
}
changed = (atomic_read(&m->mnt_topology_changes) != topology_check);
if (rcu_mode) {
rcu_read_unlock();
if (changed) {
rcu_mode = false;
read_seqlock_excl(&mount_lock);
goto retry;
}
} else {
read_sequnlock_excl(&mount_lock);
if (changed) {
read_sequnlock_excl(&mount_lock);
pr_err("Topology changed whilst lock held!\n");
return -EAGAIN;
}
}
/* End the list with a copy of the parameter mount's details so that
* userspace can quickly check for changes. Note that we include the
* topology counter we got at the start of the function rather than the
* current value.
*/
fsinfo_store_mount(ctx, m, topology_check);
return ctx->usage;
}
^ permalink raw reply [flat|nested] 5+ messages in thread