linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Lock ordering in iomap code
@ 2016-10-07 11:13 Jan Kara
  2016-10-17  9:31 ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2016-10-07 11:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs

Hi Christoph,

when converting ext4 DAX path to iomap code I've come across one locking
issue: Buffered writes for iomap code work through iomap_write_actor
function. What that ends up doing is that it calls fs to map extent of
blocks (->iomap_begin in iomap_apply()) and then proceeds to lock pages in
iomap_write_actor() and copy data to them. Then ->iomap_end is called to
finish work for that extent.

OTOH page faults for iomap code work through iomap_page_mkwrite() which
first grabs page lock and then calls iomap_apply() which ends up calling
->iomap_begin(). So this effectively nests all locks acquired in
->iomap_begin() under page lock. This makes it impossible for any lock to
be held between ->iomap_begin() and ->iomap_end() as you immediately get
lock inversion between this lock and page lock. Also any lock acquired in
->iomap_begin() gets nested under page lock and that is already no-go for
ext4 as we need to start a transaction there and that needs to happen
before grabbing page lock. I believe this is a bug in iomap_page_mkwrite()
but wanted to check with you... The slight trouble is that when we change
iomap_page_mkwrite() to work similarly to buffered write path, we have to
watch out for races with truncate - both xfs and ext4 have these "mmap
locks" which protect against that but the iomap fault code will be relying
on fs properly serializing it against truncate which was not the case with
the old fault path. So we'll probably need some comments about that in the
code.

Somewhat related issue is that the old buffered write handled by
generic_perform_write() made block mapping for a page happen under a page
lock while the new iomap code does that before grabbing page lock. This
opens a new set of races possible between write(2) and page fault (page
fault can now see a state of page and block allocation information after
->iomap_begin() was called but before the data got copied into the page). I
have yet to think through all the possible implications but this will
definitely need a close checking...

And DAX iomap code has similar issues, only instead of the page lock the
radix tree entry lock is in the game...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Lock ordering in iomap code
  2016-10-07 11:13 Lock ordering in iomap code Jan Kara
@ 2016-10-17  9:31 ` Jan Kara
  2016-10-17 10:26   ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2016-10-17  9:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs

On Fri 07-10-16 13:13:14, Jan Kara wrote:
> Hi Christoph,
> 
> when converting ext4 DAX path to iomap code I've come across one locking
> issue: Buffered writes for iomap code work through iomap_write_actor
> function. What that ends up doing is that it calls fs to map extent of
> blocks (->iomap_begin in iomap_apply()) and then proceeds to lock pages in
> iomap_write_actor() and copy data to them. Then ->iomap_end is called to
> finish work for that extent.
> 
> OTOH page faults for iomap code work through iomap_page_mkwrite() which
> first grabs page lock and then calls iomap_apply() which ends up calling
> ->iomap_begin(). So this effectively nests all locks acquired in
> ->iomap_begin() under page lock. This makes it impossible for any lock to
> be held between ->iomap_begin() and ->iomap_end() as you immediately get
> lock inversion between this lock and page lock. Also any lock acquired in
> ->iomap_begin() gets nested under page lock and that is already no-go for
> ext4 as we need to start a transaction there and that needs to happen
> before grabbing page lock. I believe this is a bug in iomap_page_mkwrite()
> but wanted to check with you... The slight trouble is that when we change
> iomap_page_mkwrite() to work similarly to buffered write path, we have to
> watch out for races with truncate - both xfs and ext4 have these "mmap
> locks" which protect against that but the iomap fault code will be relying
> on fs properly serializing it against truncate which was not the case with
> the old fault path. So we'll probably need some comments about that in the
> code.
> 
> Somewhat related issue is that the old buffered write handled by
> generic_perform_write() made block mapping for a page happen under a page
> lock while the new iomap code does that before grabbing page lock. This
> opens a new set of races possible between write(2) and page fault (page
> fault can now see a state of page and block allocation information after
> ->iomap_begin() was called but before the data got copied into the page). I
> have yet to think through all the possible implications but this will
> definitely need a close checking...
> 
> And DAX iomap code has similar issues, only instead of the page lock the
> radix tree entry lock is in the game...

Ping? I have ext4 DAX read & write path working with the iomap code but to
convert the fault path, I need this resolved. Are you OK with moving
iomap_begin() / iomap_end() calls outside of page lock / entry lock in the
fault path?

I was also thinking about the implications of iomap_begin() (and thus block
allocation for buffered writes in nodelalloc case) being no longer protected
by page lock and at least for ext2 / ext3 compatibility modes this will
lead to uninitialized data exposure when page fault races in the right way
with buffered write. So current locking scheme in iomap code is not easily
usable for ext4 for buffered writes.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Lock ordering in iomap code
  2016-10-17  9:31 ` Jan Kara
@ 2016-10-17 10:26   ` Christoph Hellwig
  2016-10-20 11:55     ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2016-10-17 10:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs

Hi Jan,

sorry for the delay, I've been overloaded with various projects
recently..

> Ping? I have ext4 DAX read & write path working with the iomap code but to
> convert the fault path, I need this resolved. Are you OK with moving
> iomap_begin() / iomap_end() calls outside of page lock / entry lock in the
> fault path?

Yes, that sounds fine.
> 
> I was also thinking about the implications of iomap_begin() (and thus block
> allocation for buffered writes in nodelalloc case) being no longer protected
> by page lock and at least for ext2 / ext3 compatibility modes this will
> lead to uninitialized data exposure when page fault races in the right way
> with buffered write. So current locking scheme in iomap code is not easily
> usable for ext4 for buffered writes.

Right, the iomap I/O code assumes that you either use delalloc, or that
your have unwritten extents that your convert to written once I/O has
finished.  XFS uses the latter feature for direct I/O, or if an extent
size hints is set.

I can't really think of a good way to handle file systems without
either feature - the problem is that we'd have to introduce a file-wide
(or range based) lock to protect allocated but not written ranges.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Lock ordering in iomap code
  2016-10-17 10:26   ` Christoph Hellwig
@ 2016-10-20 11:55     ` Jan Kara
  2016-10-20 20:22       ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2016-10-20 11:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel, linux-xfs

Hi Christoph,

On Mon 17-10-16 12:26:57, Christoph Hellwig wrote:
> > Ping? I have ext4 DAX read & write path working with the iomap code but to
> > convert the fault path, I need this resolved. Are you OK with moving
> > iomap_begin() / iomap_end() calls outside of page lock / entry lock in the
> > fault path?
> 
> Yes, that sounds fine.

I've been looking into this some more and realized it's not as easy as I've
originally though. ->page_mkwrite callback is expected to return with the
page locked so locking it inside the iomap actor is really awkward (think
of situation when blocksize < pagesize). Since nobody currently has issues
with ->iomap_begin being sometimes called with page lock and sometimes
without, I don't think changing that would be worth the hassle. Just one
more question: Doesn't XFS have some lock ordering issues when
xfs_file_iomap_begin() gets called with page lock held from
iomap_page_mkwrite() for a file with extent size hints and thus we end up
in xfs_iomap_write_direct() with page lock held? We do a lot of stuff there
including transaction setup and such...

For DAX I have switched the locking order as there we fully handle the
fault and unlock radix tree entry inside the ->fault / ->page_mkwrite
handler so the locking is sane and also ext4 needs this to be able to use
the iomap infrastructure. The locking difference between DAX and !DAX path
is unfortunate but looks as the least evil to me.

> > I was also thinking about the implications of iomap_begin() (and thus block
> > allocation for buffered writes in nodelalloc case) being no longer protected
> > by page lock and at least for ext2 / ext3 compatibility modes this will
> > lead to uninitialized data exposure when page fault races in the right way
> > with buffered write. So current locking scheme in iomap code is not easily
> > usable for ext4 for buffered writes.
> 
> Right, the iomap I/O code assumes that you either use delalloc, or that
> your have unwritten extents that your convert to written once I/O has
> finished.  XFS uses the latter feature for direct I/O, or if an extent
> size hints is set.
> 
> I can't really think of a good way to handle file systems without
> either feature - the problem is that we'd have to introduce a file-wide
> (or range based) lock to protect allocated but not written ranges.

Yup, I had something similar in mind but definitely don't want to go into
that rathole now ;).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Lock ordering in iomap code
  2016-10-20 11:55     ` Jan Kara
@ 2016-10-20 20:22       ` Dave Chinner
  0 siblings, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2016-10-20 20:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs

On Thu, Oct 20, 2016 at 01:55:00PM +0200, Jan Kara wrote:
> Hi Christoph,
> 
> On Mon 17-10-16 12:26:57, Christoph Hellwig wrote:
> > > Ping? I have ext4 DAX read & write path working with the iomap code but to
> > > convert the fault path, I need this resolved. Are you OK with moving
> > > iomap_begin() / iomap_end() calls outside of page lock / entry lock in the
> > > fault path?
> > 
> > Yes, that sounds fine.
> 
> I've been looking into this some more and realized it's not as easy as I've
> originally though. ->page_mkwrite callback is expected to return with the
> page locked so locking it inside the iomap actor is really awkward (think
> of situation when blocksize < pagesize). Since nobody currently has issues
> with ->iomap_begin being sometimes called with page lock and sometimes
> without, I don't think changing that would be worth the hassle. Just one
> more question: Doesn't XFS have some lock ordering issues when
> xfs_file_iomap_begin() gets called with page lock held from
> iomap_page_mkwrite() for a file with extent size hints and thus we end up
> in xfs_iomap_write_direct() with page lock held? We do a lot of stuff there
> including transaction setup and such...

Lock order in xfs is iolock->page lock->transaction->ilock, so what
is being done in xfs_file_iomap_begin (transactions, ilock) is fine
regardless of whether we hold the page locked or noti when called.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-10-20 20:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-07 11:13 Lock ordering in iomap code Jan Kara
2016-10-17  9:31 ` Jan Kara
2016-10-17 10:26   ` Christoph Hellwig
2016-10-20 11:55     ` Jan Kara
2016-10-20 20:22       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).