linux-next.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
	linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	Nathan Zimmer <nzimmer@sgi.com>
Subject: Re: linux-next: manual merge of the akpm tree with the vfs tree
Date: Thu, 4 Apr 2013 09:02:48 +0100	[thread overview]
Message-ID: <20130404080248.GN21522@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20130403235634.6bc72c39.akpm@linux-foundation.org>

On Wed, Apr 03, 2013 at 11:56:34PM -0700, Andrew Morton wrote:
> On Thu, 4 Apr 2013 17:26:48 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> > Hi Andrew,
> > 
> > Today's linux-next merge of the akpm tree got a conflict in
> > fs/proc/generic.c between several commits from the vfs tree and commit
> > "procfs: improve scaling in proc" from the akpm tree.
> > 
> > I just dropped the akpm tree patch (and the following
> > "procfs-improve-scaling-in-proc-v5") as the conflicts are a bit complex.
> 
> Well perhaps the vfs tree should start paying some attention to the
> rest of the world, particularly after -rc5.

I'm sorry, but... not in this case.  There are seriously nasty races around
remove_proc_entry()/proc_reg_release() and the whole area needs a rewrite.
Tentative fix is in vfs.git#experimental; I hadn't pushed it into #for-next
yet, but Nathan's patches are definitely going to buggered by any realistic
solution.  For now I'm going for the dumbest variant possible; ->pde_users
will eventually become atomic_t, but it'll be modified by
atomic_inc_unless_negative(), not atomic_inc() and ->proc_fops won't be
zeroed at all.  But before we do that, I want to get that sucker tested and
stabilized - the whole thing is sufficiently convoluted for those races to
stay unnoticed for a long time in the first place.

I am aware of those patches; it's just that they'll need to be redone - the
code being optimized is broken and needs to be fixed.  Longer term, I want
to lift the whole thing into VFS proper; it *can* be done with minimal
overhead and it'll get us a large part of revoke(2) if done right.  Basically,
what I want is a pair of new types - struct revoke and struct revokable.
	* struct file gets a pointer to struct revoke; set in ->open() by
file_revokable(file, revokable) and never changed afterwards.  No locking
is needed to check it.
	* struct revoke contains a pointer back to struct file (never changed)
+ pointer to struct revokable (RCU-protected, zeroed on revoke) + mutex/count
(serializes __fput() vs. revoke() deciding who's going to call ->release() and
who'll be waiting; see pdeo->mutex and pdep->count in #experimental) + cyclic
list anchored in struct revokable (list_del_init() after ->release() had been
done, under revoke->mutex and revokable->lock).
	* struct revokable contains an anchor for aforementioned cyclic
list + spinlock + atomic_t in_use (a-la pde->pde_users) + pointer to
completion + one method (->kick(); see below).  Freed via RCU.
	* start_using(file) is an inlined helper, returning true if
file->f_revoke is NULL; if it's not NULL, we do rcu_read_lock() and look
at file->f_revoke->revokable.  If it's NULL - rcu_read_unlock() and return
false (file had been revoked).  If it's not, atomic_inc_unless_negative()
of revokable->in_use.  Then rcu_read_unlock() and return - true if
->in_use used to be non-negative (file not revoked, revoke will wait) and
false if it was negative (file in process of being revoked).
	* stop_using(file) - inlined helper, does nothing if file->f_revoke
is NULL, otherwise decrements revokable->in_use and if it's reached
BIAS (large negative), complete(revokable->completion).
	* all normal method calls (everything except ->release()) are
turned into
	if (likely(start_using(file)) {
		res = method call
		stop_using(file);
	} else {
		res = <appropriate for method>;
	}
Overhead for non-revokable files is trivial - we just check one field in
struct file and if it's in the same cacheline as ->f_op, we are not going
to see any real delays.
	* new helper - release_revoke(); similar to close_pdeo() in
vfs.git#experimenatal.  __fput() checks ->f_revoke and, if that sucker's
non-NULL, does rcu_read_lock(), checks ->revokable, grabs ->lock on it
and does release_revoke().  If ->f_revoke is NULL, call ->release() as
we do now.  Note that unlike struct pde_opener, these guys would be
created with ->count equal to 1 - IOW, file->f_revoke would contribute to it.
	* do_revoke(revokable) starts with setting ->completion and adding
BIAS to ->in_use; if it non-zero prior to that, call ->kick(revokable) and
wait for completion.  ->kick() should essentially wake up those who are
sleeping in ->read() or ->write() and make them return, be it with EAGAIN
or short read/write.  Empty for procfs, for something like TTY it should
imitate hangup.  That's where driver-specific logics in revoke(2) would
live.  Once do_revoke() has finished waiting, we know that nobody is in
method calls (except possibly ->release()) and nobody will manage to enter
them from now on.  We grab ->lock, pick the first struct revoke from the
list, do release_revoke() to it and keep doing that to these guys until
none is left.

procfs would have struct revokable embedded into proc_dir_entry, with
freeing of those guys RCUd.  It would have file_revokable() done in
proc_reg_open() (that would do allocation of struct revoke, adding it
to the cyclic list, etc.) and set ->f_op to ->proc_fops; no wrappers
needed anymore.  All file_operations instances fed to proc_create()
et.al. would lose ->owner - it's already not needed for those, actually.
remove_proc_entry()/remove_proc_subtree() would call do_revoke() on
everything we are removing.

Getting from there to working revoke(2) is probably not too hard; the
interesting part is what should we associate struct revokable with and
what should revoke(2) in progress do to new attempts to open the damn
device.  Hell knows - not sure what's the right semantics here.

  parent reply	other threads:[~2013-04-04  8:02 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-04  6:26 Stephen Rothwell
2013-04-04  6:56 ` Andrew Morton
2013-04-04  7:02   ` Andrew Morton
2013-04-04  8:10     ` Al Viro
2013-04-04 23:18       ` Stephen Rothwell
2013-04-04  8:02   ` Al Viro [this message]
2013-04-04 15:43     ` Nathan Zimmer
  -- strict thread matches above, loose matches on Subject: below --
2020-05-15 11:33 Stephen Rothwell
2019-04-11  6:21 Stephen Rothwell
2018-05-17  6:36 Stephen Rothwell
2018-01-02  6:46 Stephen Rothwell
2016-12-12  5:52 Stephen Rothwell
2016-12-12  8:14 ` Ian Kent
2014-08-08  6:20 Stephen Rothwell
2013-09-10  4:41 Stephen Rothwell
2013-09-05  8:56 Stephen Rothwell
2013-04-30  5:54 Stephen Rothwell
2013-04-29  8:34 Stephen Rothwell
2013-04-29  8:25 Stephen Rothwell
2013-04-04  6:17 Stephen Rothwell
2013-04-04  6:12 Stephen Rothwell
2013-04-04 12:33 ` Jan Kara
2013-04-04  6:04 Stephen Rothwell
2013-02-25  3:40 Stephen Rothwell
2012-09-24 13:40 Stephen Rothwell
2012-09-24 13:12 Stephen Rothwell
2012-09-24 13:06 Stephen Rothwell
2012-07-22  5:44 Stephen Rothwell
2011-07-18  8:55 Stephen Rothwell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130404080248.GN21522@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=nzimmer@sgi.com \
    --cc=sfr@canb.auug.org.au \
    --subject='Re: linux-next: manual merge of the akpm tree with the vfs tree' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).