From: Jan Kara <jack@suse.cz>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>, LKML <linux-kernel@vger.kernel.org>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 0/6 RFC] Mapping range lock
Date: Mon, 4 Feb 2013 13:38:31 +0100 [thread overview]
Message-ID: <20130204123831.GE7523@quack.suse.cz> (raw)
In-Reply-To: <20130131160757.06d7f1c2.akpm@linux-foundation.org>
On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> On Thu, 31 Jan 2013 22:49:48 +0100
> Jan Kara <jack@suse.cz> wrote:
>
> > There are several different motivations for implementing mapping range
> > locking:
> >
> > a) Punch hole is currently racy wrt mmap (page can be faulted in in the
> > punched range after page cache has been invalidated) leading to nasty
> > results as fs corruption (we can end up writing to already freed block),
> > user exposure of uninitialized data, etc. To fix this we need some new
> > mechanism of serializing hole punching and page faults.
>
> This one doesn't seem very exciting - perhaps there are local fixes
> which can be made?
I agree this probably won't be triggered by accident since punch hole
uses are limited. But a malicious user is a different thing...
Regarding local fix - local in what sense? We could fix it inside each
filesystem separately but the number of filesystems supporting punch hole
is growing so I don't think it's a good decision for each of them to devise
their own synchronization mechanisms. Fixing 'locally' in a sence that we
fix just the mmap vs punch hole race is possible but we need some
synchronisation of page fault and punch hole - likely in a form of rwsem
where page fault will take a reader side and punch hole a writer side. So
this "minimal" fix requires additional rwsem in struct address_space and
also incurs some cost to page fault path. It is likely a lower cost than
the one of range locking but there is some.
> > b) There is an uncomfortable number of mechanisms serializing various paths
> > manipulating pagecache and data underlying it. We have i_mutex, page lock,
> > checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
> > Different pairs of operations are serialized by different mechanisms and
> > not all the cases are covered. Case (a) above is likely the worst but DIO
> > vs buffered IO isn't ideal either (we provide only limited consistency).
> > The range locking should somewhat simplify serialization of pagecache
> > operations. So i_dio_count can be removed completely, i_mutex to certain
> > extent (we still need something for things like timestamp updates,
> > possibly for i_size changes although those can be dealt with I think).
>
> Those would be nice cleanups and simplifications, to make kernel
> developers' lives easier. And there is value in this, but doing this
> means our users incur real costs.
>
> I'm rather uncomfortable changes which make our lives easier at the
> expense of our users. If we had an infinite amount of labor, we
> wouldn't do this. In reality we have finite labor, but a small cost
> dispersed amongst millions or billions of users becomes a very large
> cost.
I agree there's a cost (as with everything) and personally I feel the
cost is larger than I'd like so we mostly agree on that. OTOH I don't quite
buy the argument "multiplied by millions or billions of users" - the more
machines running the code, the more wealth these machines hopefully
generate ;-). So where the additional cost starts mattering is when it is
making the code not worth it for some purposes. But this is really
philosophy :)
> > c) i_mutex doesn't allow any paralellism of operations using it and some
> > filesystems workaround this for specific cases (e.g. DIO reads). Using
> > range locking allows for concurrent operations (e.g. writes, DIO) on
> > different parts of the file. Of course, range locking itself isn't
> > enough to make the parallelism possible. Filesystems still have to
> > somehow deal with the concurrency when manipulating inode allocation
> > data. But the range locking at least provides a common VFS mechanism for
> > serialization VFS itself needs and it's upto each filesystem to
> > serialize more if it needs to.
>
> That would be useful to end-users, but I'm having trouble predicting
> *how* useful.
As Zheng said, there are people interested in this for DIO. Currently
filesystems each invent their own tweaks to avoid the serialization at
least for the easiest cases.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2013-02-04 12:38 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-31 21:49 [PATCH 0/6 RFC] Mapping range lock Jan Kara
2013-01-31 21:49 ` [PATCH 1/6] lib: Implement range locks Jan Kara
2013-01-31 23:57 ` Andrew Morton
2013-02-04 16:41 ` Jan Kara
2013-02-11 5:42 ` Michel Lespinasse
2013-02-11 10:27 ` Jan Kara
2013-02-11 11:03 ` Michel Lespinasse
2013-02-11 12:58 ` Jan Kara
2013-01-31 21:49 ` [PATCH 2/6] fs: Take mapping lock in generic read paths Jan Kara
2013-01-31 23:59 ` Andrew Morton
2013-02-04 12:47 ` Jan Kara
2013-02-08 14:59 ` Jan Kara
2013-01-31 21:49 ` [PATCH 3/6] fs: Provide function to take mapping lock in buffered write path Jan Kara
2013-01-31 21:49 ` [PATCH 4/6] fs: Don't call dio_cleanup() before submitting all bios Jan Kara
2013-01-31 21:49 ` [PATCH 5/6] fs: Take mapping lock during direct IO Jan Kara
2013-01-31 21:49 ` [PATCH 6/6] ext3: Convert ext3 to use mapping lock Jan Kara
2013-02-01 0:07 ` [PATCH 0/6 RFC] Mapping range lock Andrew Morton
2013-02-04 9:29 ` Zheng Liu
2013-02-04 12:38 ` Jan Kara [this message]
2013-02-05 23:25 ` Dave Chinner
2013-02-06 19:25 ` Jan Kara
2013-02-07 2:43 ` Dave Chinner
2013-02-07 11:06 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130204123831.GE7523@quack.suse.cz \
--to=jack@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).