linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>, LKML <linux-kernel@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 0/6 RFC] Mapping range lock
Date: Mon, 4 Feb 2013 13:38:31 +0100	[thread overview]
Message-ID: <20130204123831.GE7523@quack.suse.cz> (raw)
In-Reply-To: <20130131160757.06d7f1c2.akpm@linux-foundation.org>

On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> On Thu, 31 Jan 2013 22:49:48 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > There are several different motivations for implementing mapping range
> > locking:
> > 
> > a) Punch hole is currently racy wrt mmap (page can be faulted in in the
> >    punched range after page cache has been invalidated) leading to nasty
> >    results as fs corruption (we can end up writing to already freed block),
> >    user exposure of uninitialized data, etc. To fix this we need some new
> >    mechanism of serializing hole punching and page faults.
> 
> This one doesn't seem very exciting - perhaps there are local fixes
> which can be made?
  I agree this probably won't be triggered by accident since punch hole
uses are limited. But a malicious user is a different thing...

Regarding local fix - local in what sense? We could fix it inside each
filesystem separately but the number of filesystems supporting punch hole
is growing so I don't think it's a good decision for each of them to devise
their own synchronization mechanisms. Fixing 'locally' in a sence that we
fix just the mmap vs punch hole race is possible but we need some
synchronisation of page fault and punch hole - likely in a form of rwsem
where page fault will take a reader side and punch hole a writer side. So
this "minimal" fix requires additional rwsem in struct address_space and
also incurs some cost to page fault path. It is likely a lower cost than
the one of range locking but there is some.

> > b) There is an uncomfortable number of mechanisms serializing various paths
> >    manipulating pagecache and data underlying it. We have i_mutex, page lock,
> >    checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
> >    Different pairs of operations are serialized by different mechanisms and
> >    not all the cases are covered. Case (a) above is likely the worst but DIO
> >    vs buffered IO isn't ideal either (we provide only limited consistency).
> >    The range locking should somewhat simplify serialization of pagecache
> >    operations. So i_dio_count can be removed completely, i_mutex to certain
> >    extent (we still need something for things like timestamp updates,
> >    possibly for i_size changes although those can be dealt with I think).
> 
> Those would be nice cleanups and simplifications, to make kernel
> developers' lives easier.  And there is value in this, but doing this
> means our users incur real costs.
> 
> I'm rather uncomfortable changes which make our lives easier at the
> expense of our users.  If we had an infinite amount of labor, we
> wouldn't do this.  In reality we have finite labor, but a small cost
> dispersed amongst millions or billions of users becomes a very large
> cost.
  I agree there's a cost (as with everything) and personally I feel the
cost is larger than I'd like so we mostly agree on that. OTOH I don't quite
buy the argument "multiplied by millions or billions of users" - the more
machines running the code, the more wealth these machines hopefully
generate ;-). So where the additional cost starts mattering is when it is
making the code not worth it for some purposes. But this is really
philosophy :)

> > c) i_mutex doesn't allow any paralellism of operations using it and some
> >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> >    range locking allows for concurrent operations (e.g. writes, DIO) on
> >    different parts of the file. Of course, range locking itself isn't
> >    enough to make the parallelism possible. Filesystems still have to
> >    somehow deal with the concurrency when manipulating inode allocation
> >    data. But the range locking at least provides a common VFS mechanism for
> >    serialization VFS itself needs and it's upto each filesystem to
> >    serialize more if it needs to.
> 
> That would be useful to end-users, but I'm having trouble predicting
> *how* useful.
  As Zheng said, there are people interested in this for DIO. Currently
filesystems each invent their own tweaks to avoid the serialization at
least for the easiest cases.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  parent reply	other threads:[~2013-02-04 12:38 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-31 21:49 [PATCH 0/6 RFC] Mapping range lock Jan Kara
2013-01-31 21:49 ` [PATCH 1/6] lib: Implement range locks Jan Kara
2013-01-31 23:57   ` Andrew Morton
2013-02-04 16:41     ` Jan Kara
2013-02-11  5:42   ` Michel Lespinasse
2013-02-11 10:27     ` Jan Kara
2013-02-11 11:03       ` Michel Lespinasse
2013-02-11 12:58         ` Jan Kara
2013-01-31 21:49 ` [PATCH 2/6] fs: Take mapping lock in generic read paths Jan Kara
2013-01-31 23:59   ` Andrew Morton
2013-02-04 12:47     ` Jan Kara
2013-02-08 14:59       ` Jan Kara
2013-01-31 21:49 ` [PATCH 3/6] fs: Provide function to take mapping lock in buffered write path Jan Kara
2013-01-31 21:49 ` [PATCH 4/6] fs: Don't call dio_cleanup() before submitting all bios Jan Kara
2013-01-31 21:49 ` [PATCH 5/6] fs: Take mapping lock during direct IO Jan Kara
2013-01-31 21:49 ` [PATCH 6/6] ext3: Convert ext3 to use mapping lock Jan Kara
2013-02-01  0:07 ` [PATCH 0/6 RFC] Mapping range lock Andrew Morton
2013-02-04  9:29   ` Zheng Liu
2013-02-04 12:38   ` Jan Kara [this message]
2013-02-05 23:25     ` Dave Chinner
2013-02-06 19:25       ` Jan Kara
2013-02-07  2:43         ` Dave Chinner
2013-02-07 11:06           ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130204123831.GE7523@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).