linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Alexey Kuznetsov <kuznet@virtuozzo.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Christoph Hellwig <hch@lst.de>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Eric Van Hensbergen <ericvh@gmail.com>,
	Jens Axboe <axboe@kernel.dk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Latchesar Ionkov <lucho@ionkov.net>,
	linux-cifs@vger.kernel.org,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-nfs@vger.kernel.org,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	Matthew Wilcox <mawilcox@microsoft.com>,
	Ron Minnich <rminnich@sandia.gov>,
	samba-technical@lists.samba.org, Steve French <sfrench@samba.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	v9fs-developer@lists.sourceforge.net
Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
Date: Mon, 1 May 2017 15:59:10 -0700	[thread overview]
Message-ID: <CAPcyv4jWWW6FUC_4f-FunPBva4TY2bdn7FQb+9nhVvA3zx6DiQ@mail.gmail.com> (raw)
In-Reply-To: <20170427072659.GA29789@quack2.suse.cz>

On Thu, Apr 27, 2017 at 12:26 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 26-04-17 16:52:36, Ross Zwisler wrote:
>> On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote:
>> > On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
>> > > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
>> > > <>
>> > > > Hum, but now thinking more about it I have hard time figuring out why write
>> > > > vs fault cannot actually still race:
>> > > >
>> > > > CPU1 - write(2)                         CPU2 - read fault
>> > > >
>> > > >                                         dax_iomap_pte_fault()
>> > > >                                           ->iomap_begin() - sees hole
>> > > > dax_iomap_rw()
>> > > >   iomap_apply()
>> > > >     ->iomap_begin - allocates blocks
>> > > >     dax_iomap_actor()
>> > > >       invalidate_inode_pages2_range()
>> > > >         - there's nothing to invalidate
>> > > >                                           grab_mapping_entry()
>> > > >                                           - we add zero page in the radix
>> > > >                                             tree & map it to page tables
>> > > >
>> > > > Similarly read vs write fault may end up racing in a wrong way and try to
>> > > > replace already existing exceptional entry with a hole page?
>> > >
>> > > Yep, this race seems real to me, too.  This seems very much like the issues
>> > > that exist when a thread is doing direct I/O.  One thread is doing I/O to an
>> > > intermediate buffer (page cache for direct I/O case, zero page for us), and
>> > > the other is going around it directly to media, and they can get out of sync.
>> > >
>> > > IIRC the direct I/O code looked something like:
>> > >
>> > > 1/ invalidate existing mappings
>> > > 2/ do direct I/O to media
>> > > 3/ invalidate mappings again, just in case.  Should be cheap if there weren't
>> > >    any conflicting faults.  This makes sure any new allocations we made are
>> > >    faulted in.
>> >
>> > Yeah, the problem is people generally expect weird behavior when they mix
>> > direct and buffered IO (or let alone mmap) however everyone expects
>> > standard read(2) and write(2) to be completely coherent with mmap(2).
>>
>> Yep, fair enough.
>>
>> > > I guess one option would be to replicate that logic in the DAX I/O path, or we
>> > > could try and enhance our locking so page faults can't race with I/O since
>> > > both can allocate blocks.
>> >
>> > In the abstract way, the problem is that we have radix tree (and page
>> > tables) cache block mapping information and the operation: "read block
>> > mapping information, store it in the radix tree" is not serialized in any
>> > way against other block allocations so the information we store can be out
>> > of date by the time we store it.
>> >
>> > One way to solve this would be to move ->iomap_begin call in the fault
>> > paths under entry lock although that would mean I have to redo how ext4
>> > handles DAX faults because with current code it would create lock inversion
>> > wrt transaction start.
>>
>> I don't think this alone is enough to save us.  The I/O path doesn't currently
>> take any DAX radix tree entry locks, so our race would just become:
>>
>> CPU1 - write(2)                               CPU2 - read fault
>>
>>                                       dax_iomap_pte_fault()
>>                                         grab_mapping_entry() // newly moved
>>                                         ->iomap_begin() - sees hole
>> dax_iomap_rw()
>>   iomap_apply()
>>     ->iomap_begin - allocates blocks
>>     dax_iomap_actor()
>>       invalidate_inode_pages2_range()
>>         - there's nothing to invalidate
>>                                         - we add zero page in the radix
>>                                           tree & map it to page tables
>>
>> In their current form I don't think we want to take DAX radix tree entry locks
>> in the I/O path because that would effectively serialize I/O over a given
>> radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range
>> would be serialized.
>
> Note that invalidate_inode_pages2_range() will see the entry created by
> grab_mapping_entry() on CPU2 and block waiting for its lock and this is
> exactly what stops the race. The invalidate_inode_pages2_range()
> effectively makes sure there isn't any page fault in progress for given
> range...
>
> Also note that writes to a file are serialized by i_rwsem anyway (and at
> least serialization of writes to the overlapping range is required by POSIX)
> so this doesn't add any more serialization than we already have.
>
>> > Another solution would be to grab i_mmap_sem for write when doing write
>> > fault of a page and similarly have it grabbed for writing when doing
>> > write(2). This would scale rather poorly but if we later replaced it with a
>> > range lock (Davidlohr has already posted a nice implementation of it) it
>> > won't be as bad. But I guess option 1) is better...
>>
>> The best idea I had for handling this sounds similar, which would be to
>> convert the radix tree locks to essentially be reader/writer locks.  I/O and
>> faults that don't modify the block mapping could just take read-level locks,
>> and could all run concurrently.  I/O or faults that modify a block mapping
>> would take a write lock, and serialize with other writers and readers.
>
> Well, this would be difficult to implement inside the radix tree (not
> enough bits in the entry) so you'd have to go for some external locking
> primitive anyway. And if you do that, read-write range lock Davidlohr has
> implemented is what you describe - well we could also have a radix tree
> with rwsems but I suspect the overhead of maintaining that would be too
> large. It would require larger rewrite than reusing entry locks as I
> suggest above though and it isn't an obvious performance win for realistic
> workloads either so I'd like to see some performance numbers before going
> that way. It likely improves a situation where processes race to fault the
> same page for which we already know the block mapping but I'm not sure if
> that translates to any measurable performance wins for workloads on DAX
> filesystem.

I'm also concerned about inventing new / fancy radix infrastructure
when we're already in the space of needing struct page for any
non-trivial usage of dax. As Kirill's transparent-huge-page page cache
implementation matures I'd be interested in looking at a transition
path away from radix locking towards something that it shared with the
common case page cache locking.

  parent reply	other threads:[~2017-05-01 22:59 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
2017-04-18 19:38   ` Ross Zwisler
2017-04-19 15:11     ` Andrey Ryabinin
2017-04-19 19:28       ` Ross Zwisler
2017-04-20 14:35         ` Jan Kara
2017-04-20 14:44           ` Jan Kara
2017-04-20 19:14             ` Ross Zwisler
2017-04-21  3:44               ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
2017-04-21  3:44                 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
2017-04-25 11:10                   ` Jan Kara
2017-04-25 22:59                     ` Ross Zwisler
2017-04-26  8:52                       ` Jan Kara
2017-04-26 22:52                         ` Ross Zwisler
2017-04-27  7:26                           ` Jan Kara
2017-05-01 22:38                             ` Ross Zwisler
2017-05-04  9:12                               ` Jan Kara
2017-05-01 22:59                             ` Dan Williams [this message]
2017-04-25 10:10                 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
2017-05-01 16:54                   ` Ross Zwisler
2017-04-18 22:46   ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrew Morton
2017-04-19 15:15     ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
2017-04-18 18:51   ` Nikolay Borisov
2017-04-19 13:22     ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
2017-04-18 15:24 ` [PATCH 0/4] Properly invalidate data in the cleancache Konrad Rzeszutek Wilk
2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
2017-04-24 16:41   ` [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
2017-04-25  8:25     ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
2017-04-25  8:34     ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
2017-04-25  8:37     ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
2017-04-25  8:41     ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4jWWW6FUC_4f-FunPBva4TY2bdn7FQb+9nhVvA3zx6DiQ@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=anna.schumaker@netapp.com \
    --cc=aryabinin@virtuozzo.com \
    --cc=axboe@kernel.dk \
    --cc=darrick.wong@oracle.com \
    --cc=ericvh@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=konrad.wilk@oracle.com \
    --cc=kuznet@virtuozzo.com \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=lucho@ionkov.net \
    --cc=mawilcox@microsoft.com \
    --cc=rminnich@sandia.gov \
    --cc=ross.zwisler@linux.intel.com \
    --cc=samba-technical@lists.samba.org \
    --cc=sfrench@samba.org \
    --cc=trond.myklebust@primarydata.com \
    --cc=v9fs-developer@lists.sourceforge.net \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).