linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>,
	Goldwyn Rodrigues <rgoldwyn@suse.de>,
	linux-btrfs@vger.kernel.org, kilobyte@angband.pl,
	linux-fsdevel@vger.kernel.org, willy@infradead.org, hch@lst.de,
	dsterba@suse.cz, nborisov@suse.com, linux-nvdimm@lists.01.org
Subject: Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
Date: Fri, 31 May 2019 08:59:25 +1000	[thread overview]
Message-ID: <20190530225925.GG16786@dread.disaster.area> (raw)
In-Reply-To: <20190530111605.GC29237@quack2.suse.cz>

On Thu, May 30, 2019 at 01:16:05PM +0200, Jan Kara wrote:
> On Thu 30-05-19 08:14:45, Dave Chinner wrote:
> > On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote:
> > > On Wed 29-05-19 14:46:58, Dave Chinner wrote:
> > > >  iomap_apply()
> > > > 
> > > >  	->iomap_begin()
> > > > 		map old data extent that we copy from
> > > > 
> > > > 		allocate new data extent we copy to in data fork,
> > > > 		immediately replacing old data extent
> > > > 
> > > > 		return transaction handle as private data
> > 
> > This holds the inode block map locked exclusively across the IO,
> > so....
> 
> Does it? We do hold XFS_IOLOCK_EXCL during the whole dax write.

I forgot about that, I keep thinking that we use shared locking for
DAX like we do for direct IO. There's another reason for range
locks - allowing concurrent DAX read/write IO - but that's
orthogonal to the issue here.

> But
> xfs_file_iomap_begin() does release XFS_ILOCK_* on exit AFAICS. So I don't
> see anything that would prevent page fault from mapping blocks into page
> tables just after xfs_file_iomap_begin() returns.

Right, holding the IOLOCK doesn't stop concurrent page faults from
mapping the page we are trying to write, and that leaves a window
where stale data can be exposed if we don't initialise the newly
allocated range whilst in the allocation transaction holding the
ILOCK. That's what the XFS_BMAPI_ZERO flag does in the DAX block
allocation path.

So the idea of holding the allocation transaction across the data
copy means that ILOCK is then held until the data blocks are fully
initialised with valid data, meaning we can greatly reduce the scope
of the XFS_BMAPI_ZERO flag and possible get rid of it altogether.

> > > This race was actually the strongest
> > > motivation for pre-zeroing of blocks. OTOH copy_from_iter() in
> > > dax_iomap_actor() needs to be able to fault pages to copy from (and these
> > > pages may be from the same file you're writing to) so you cannot just block
> > > faulting for the file through I_MMAP_LOCK.
> > 
> > Right, it doesn't take the I_MMAP_LOCK, but it would block further
> > in. And, really, I'm not caring all this much about this corner
> > case. i.e.  anyone using a "mmap()+write() zero copy" pattern on DAX
> > within a file is unbeleivably naive - the data still gets copied by
> > the CPU in the write() call. It's far simpler and more effcient to
> > just mmap() both ranges of the file(s) and memcpy() in userspace....
> > 
> > FWIW, it's to avoid problems with stupid userspace stuff that nobody
> > really should be doing that I want range locks for the XFS inode
> > locks.  If userspace overlaps the ranges and deadlocks in that case,
> > they they get to keep all the broken bits because, IMO, they are
> > doing something monumentally stupid. I'd probably be making it
> > return EDEADLOCK back out to userspace in the case rather than
> > deadlocking but, fundamentally, I think it's broken behaviour that
> > we should be rejecting with an error rather than adding complexity
> > trying to handle it.
> 
> I agree with this. We must just prevent user from taking the kernel down
> with maliciously created IOs...

Noted. :)

I'm still working to scale the range locks effectively for direct
IO; I've got to work out why sometimes they give identical
performance to rwsems out to 16 threads, and other times they run
20% slower or worse at 8+ threads. I'm way ahead of the original
mutex protected tree implementation that I have, but still got work
to do to get consistently close to rwsem performance for pure shared
locking workloads like direct IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2019-05-30 22:59 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
2019-05-21 18:02   ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues
2019-05-21 15:14   ` Darrick J. Wong
2019-05-22 21:50     ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
2019-05-21 16:51   ` Darrick J. Wong
2019-05-22 20:14     ` Goldwyn Rodrigues
2019-05-23  2:10       ` Dave Chinner
2019-05-23  9:05     ` Shiyang Ruan
2019-05-23 11:51       ` Goldwyn Rodrigues
2019-05-27  8:25         ` Shiyang Ruan
2019-05-28  9:17           ` Jan Kara
2019-05-29  2:01             ` Shiyang Ruan
2019-05-29  2:47               ` Dave Chinner
2019-05-29  4:02                 ` Shiyang Ruan
2019-05-29  4:07                   ` Darrick J. Wong
2019-05-29  4:46                     ` Dave Chinner
2019-05-29 13:46                       ` Jan Kara
2019-05-29 22:14                         ` Dave Chinner
2019-05-30 11:16                           ` Jan Kara
2019-05-30 22:59                             ` Dave Chinner [this message]
2019-04-29 17:26 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues
2019-05-21 17:08   ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
2019-05-21 17:46   ` Darrick J. Wong
2019-05-22 19:11     ` Goldwyn Rodrigues
2019-05-23  4:02       ` Darrick J. Wong
2019-05-23 12:10     ` Jan Kara
2019-04-29 17:26 ` [PATCH 09/18] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
2019-05-21 17:35   ` Darrick J. Wong
2019-05-23 13:38   ` Jan Kara
2019-04-29 17:26 ` [PATCH 11/18] btrfs: add dax mmap support Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
2019-05-10 15:32   ` [PATCH for-goldwyn] btrfs: disallow MAP_SYNC outside of DAX mounts Adam Borowski
2019-05-10 15:41     ` Dan Williams
2019-05-10 15:59       ` Pankaj Gupta
2019-05-23 13:44   ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Jan Kara
2019-05-23 16:19     ` Adam Borowski
2019-04-29 17:26 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
2019-05-21 18:17   ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
2019-05-21 17:27   ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
2019-05-23 14:04   ` Jan Kara
2019-05-23 15:27     ` Goldwyn Rodrigues
2019-05-23 19:07       ` Jan Kara
2019-05-23 21:22         ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
  -- strict thread matches above, loose matches on Subject: below --
2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
2019-04-17 16:46   ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190530225925.GG16786@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=darrick.wong@oracle.com \
    --cc=dsterba@suse.cz \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=kilobyte@angband.pl \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=nborisov@suse.com \
    --cc=rgoldwyn@suse.de \
    --cc=ruansy.fnst@cn.fujitsu.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).