From: Al Viro <viro@ZenIV.linux.org.uk>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
Nathan Zimmer <nzimmer@sgi.com>
Subject: Re: linux-next: manual merge of the akpm tree with the vfs tree
Date: Thu, 4 Apr 2013 09:02:48 +0100 [thread overview]
Message-ID: <20130404080248.GN21522@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20130403235634.6bc72c39.akpm@linux-foundation.org>
On Wed, Apr 03, 2013 at 11:56:34PM -0700, Andrew Morton wrote:
> On Thu, 4 Apr 2013 17:26:48 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> > Hi Andrew,
> >
> > Today's linux-next merge of the akpm tree got a conflict in
> > fs/proc/generic.c between several commits from the vfs tree and commit
> > "procfs: improve scaling in proc" from the akpm tree.
> >
> > I just dropped the akpm tree patch (and the following
> > "procfs-improve-scaling-in-proc-v5") as the conflicts are a bit complex.
>
> Well perhaps the vfs tree should start paying some attention to the
> rest of the world, particularly after -rc5.
I'm sorry, but... not in this case. There are seriously nasty races around
remove_proc_entry()/proc_reg_release() and the whole area needs a rewrite.
Tentative fix is in vfs.git#experimental; I hadn't pushed it into #for-next
yet, but Nathan's patches are definitely going to buggered by any realistic
solution. For now I'm going for the dumbest variant possible; ->pde_users
will eventually become atomic_t, but it'll be modified by
atomic_inc_unless_negative(), not atomic_inc() and ->proc_fops won't be
zeroed at all. But before we do that, I want to get that sucker tested and
stabilized - the whole thing is sufficiently convoluted for those races to
stay unnoticed for a long time in the first place.
I am aware of those patches; it's just that they'll need to be redone - the
code being optimized is broken and needs to be fixed. Longer term, I want
to lift the whole thing into VFS proper; it *can* be done with minimal
overhead and it'll get us a large part of revoke(2) if done right. Basically,
what I want is a pair of new types - struct revoke and struct revokable.
* struct file gets a pointer to struct revoke; set in ->open() by
file_revokable(file, revokable) and never changed afterwards. No locking
is needed to check it.
* struct revoke contains a pointer back to struct file (never changed)
+ pointer to struct revokable (RCU-protected, zeroed on revoke) + mutex/count
(serializes __fput() vs. revoke() deciding who's going to call ->release() and
who'll be waiting; see pdeo->mutex and pdep->count in #experimental) + cyclic
list anchored in struct revokable (list_del_init() after ->release() had been
done, under revoke->mutex and revokable->lock).
* struct revokable contains an anchor for aforementioned cyclic
list + spinlock + atomic_t in_use (a-la pde->pde_users) + pointer to
completion + one method (->kick(); see below). Freed via RCU.
* start_using(file) is an inlined helper, returning true if
file->f_revoke is NULL; if it's not NULL, we do rcu_read_lock() and look
at file->f_revoke->revokable. If it's NULL - rcu_read_unlock() and return
false (file had been revoked). If it's not, atomic_inc_unless_negative()
of revokable->in_use. Then rcu_read_unlock() and return - true if
->in_use used to be non-negative (file not revoked, revoke will wait) and
false if it was negative (file in process of being revoked).
* stop_using(file) - inlined helper, does nothing if file->f_revoke
is NULL, otherwise decrements revokable->in_use and if it's reached
BIAS (large negative), complete(revokable->completion).
* all normal method calls (everything except ->release()) are
turned into
if (likely(start_using(file)) {
res = method call
stop_using(file);
} else {
res = <appropriate for method>;
}
Overhead for non-revokable files is trivial - we just check one field in
struct file and if it's in the same cacheline as ->f_op, we are not going
to see any real delays.
* new helper - release_revoke(); similar to close_pdeo() in
vfs.git#experimenatal. __fput() checks ->f_revoke and, if that sucker's
non-NULL, does rcu_read_lock(), checks ->revokable, grabs ->lock on it
and does release_revoke(). If ->f_revoke is NULL, call ->release() as
we do now. Note that unlike struct pde_opener, these guys would be
created with ->count equal to 1 - IOW, file->f_revoke would contribute to it.
* do_revoke(revokable) starts with setting ->completion and adding
BIAS to ->in_use; if it non-zero prior to that, call ->kick(revokable) and
wait for completion. ->kick() should essentially wake up those who are
sleeping in ->read() or ->write() and make them return, be it with EAGAIN
or short read/write. Empty for procfs, for something like TTY it should
imitate hangup. That's where driver-specific logics in revoke(2) would
live. Once do_revoke() has finished waiting, we know that nobody is in
method calls (except possibly ->release()) and nobody will manage to enter
them from now on. We grab ->lock, pick the first struct revoke from the
list, do release_revoke() to it and keep doing that to these guys until
none is left.
procfs would have struct revokable embedded into proc_dir_entry, with
freeing of those guys RCUd. It would have file_revokable() done in
proc_reg_open() (that would do allocation of struct revoke, adding it
to the cyclic list, etc.) and set ->f_op to ->proc_fops; no wrappers
needed anymore. All file_operations instances fed to proc_create()
et.al. would lose ->owner - it's already not needed for those, actually.
remove_proc_entry()/remove_proc_subtree() would call do_revoke() on
everything we are removing.
Getting from there to working revoke(2) is probably not too hard; the
interesting part is what should we associate struct revokable with and
what should revoke(2) in progress do to new attempts to open the damn
device. Hell knows - not sure what's the right semantics here.
next prev parent reply other threads:[~2013-04-04 8:02 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-04 6:26 linux-next: manual merge of the akpm tree with the vfs tree Stephen Rothwell
2013-04-04 6:56 ` Andrew Morton
2013-04-04 7:02 ` Andrew Morton
2013-04-04 8:10 ` Al Viro
2013-04-04 23:18 ` Stephen Rothwell
2013-04-04 8:02 ` Al Viro [this message]
2013-04-04 15:43 ` Nathan Zimmer
-- strict thread matches above, loose matches on Subject: below --
2020-05-15 11:33 Stephen Rothwell
2019-04-11 6:21 Stephen Rothwell
2018-05-17 6:36 Stephen Rothwell
2018-01-02 6:46 Stephen Rothwell
2016-12-12 5:52 Stephen Rothwell
2016-12-12 8:14 ` Ian Kent
2014-08-08 6:20 Stephen Rothwell
2013-09-10 4:41 Stephen Rothwell
2013-09-05 8:56 Stephen Rothwell
2013-04-30 5:54 Stephen Rothwell
2013-04-29 8:34 Stephen Rothwell
2013-04-29 8:25 Stephen Rothwell
2013-04-04 6:17 Stephen Rothwell
2013-04-04 6:12 Stephen Rothwell
2013-04-04 12:33 ` Jan Kara
2013-04-04 6:04 Stephen Rothwell
2013-02-25 3:40 Stephen Rothwell
2012-09-24 13:40 Stephen Rothwell
2012-09-24 13:12 Stephen Rothwell
2012-09-24 13:06 Stephen Rothwell
2012-07-22 5:44 Stephen Rothwell
2011-07-18 8:55 Stephen Rothwell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130404080248.GN21522@ZenIV.linux.org.uk \
--to=viro@zeniv.linux.org.uk \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-next@vger.kernel.org \
--cc=nzimmer@sgi.com \
--cc=sfr@canb.auug.org.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).