All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: xfs@oss.sgi.com
Cc: linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
	willy@linux.intel.com, dan.j.williams@intel.com,
	kirill.shutemov@linux.intel.com, linux-nvdimm@lists.01.org,
	jack@suse.cz, linux-kernel@vger.kernel.org
Subject: [PATCH 0/7] xfs, dax: fix the page fault/allocation mess
Date: Thu,  1 Oct 2015 17:46:32 +1000	[thread overview]
Message-ID: <1443685599-4843-1-git-send-email-david@fromorbit.com> (raw)

Hi folks,

As discussed in the recent thread about problems with DAX locking:

http://www.gossamer-threads.com/lists/linux/kernel/2264090?do=post_view_threaded

I said that I'd post the patch set that fixed the problems for XFS
as soon as I had something sane and workable. That's what this
series is.

To start with, it passes xfstests "auto" group with only the only
failures being expected failures or failures due to unexpected
allocation patterns or trying to use unsupported block sizes. That
makes it better than any previous version of the XFS/DAX code.

The patchset starts by reverting the two patches that were
introduced in 4.3-rc1 to try to fix the fault vs fault and fault vs
truncate races that caused deadlocks. This fixes the hangs in
generic/075 that these patches introduced.

Patch 3 enables XFS to handle the behaviour of DAX and DIO when
asking to allocate the block at (2^63 - 1FSB), where the offset +
count s technically illegal (larger than sb->s_maxbytes) and
overflows a s64 variable. This is currently hidden by the fact that
all DAX and DIO allocation is currently unwritten, but patch 5
exposes it for DAX.

Patch 4 introduces the ability for XFS to allocate physically zeroed
data blocks. This is done for each physical extent that is
allocated, deep inside the allocator itself and guaranteed to be
atomic with the allocation transaction and hence has no
crash+recovery exposure issues.

This is necessary because the BMAPI layer merges allocated extents
in the BMBT before it returns the mapped extent back to the high
level get_blocks() code. Hence the high level code can have a single
extent presented that is made of merged new and existing extents,
and so zeroing can't be done at this layer.

The advantage of driving the zeroing deep into the allocator is the
functionality is now available to all XFS code. Hence we can
allocate pre-zeroed blocks on any type of storage, and we can
utilise storage-based hardware acceleration (e.g. discard to zero,
WRITE_SAME, etc) to do the zeroing. From this POV, DAX is just
another hardware accelerated physical zeroing mechanism for XFS. :)

[ This is an example of the mantra I repeat a lot: solve the problem
  properly the first time and it will make everything simpler! Sure,
  it took me three attempts to work out how to solve it in a sane
  manner, but that's pretty much par for the course with anything
  non-trivial. ]

Patch 5 makes __xfs_get_blocks() aware that it is being called from
the DAX fault path and makes sure it returns zeroed blocks rather
than unwritten extents via XFS_BMAPI_ZERO. It also now sets
XFS_BMAPI_CONVERT, which tells it to convert unwritten extents to
written, zeroed blocks. This is the major change of behaviour.

Patch 6 removes the IO completion callbacks from the XFS DAX code as
they are not longer necessary after patch 5.

Patch 7 adds pfn_mkwrite support to XFS. This is needed to fix
generic/080, which detects a failure to update the inode timestamp
on a pfn fault. It also adds the same locking as the XFS
implementation of ->fault and ->page_mkwrite and hence provide
correct serialisation against truncate, hole punching, etc that
doesn't currently exist.

The next steps that are needed are to do the same "block zeroing
during allocation" to ext4, and then the block zeroing and
complete_unwritten callbacks can be removed from the DAX API and
code. I've had a breif look at the ext4 code - the block zeroing
should be able to be done by overloading the existing zeroout code
that ext4 has in the unwritten extent allocation code. I'd much
prefer that an ext4 expert does this work, and then we can clean up
the DAX code...

Cheers,

Dave.


WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: xfs@oss.sgi.com
Cc: linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
	willy@linux.intel.com, dan.j.williams@intel.com,
	kirill.shutemov@linux.intel.com, linux-nvdimm@ml01.01.org,
	jack@suse.cz, linux-kernel@vger.kernel.org
Subject: [PATCH 0/7] xfs, dax: fix the page fault/allocation mess
Date: Thu,  1 Oct 2015 17:46:32 +1000	[thread overview]
Message-ID: <1443685599-4843-1-git-send-email-david@fromorbit.com> (raw)

Hi folks,

As discussed in the recent thread about problems with DAX locking:

http://www.gossamer-threads.com/lists/linux/kernel/2264090?do=post_view_threaded

I said that I'd post the patch set that fixed the problems for XFS
as soon as I had something sane and workable. That's what this
series is.

To start with, it passes xfstests "auto" group with only the only
failures being expected failures or failures due to unexpected
allocation patterns or trying to use unsupported block sizes. That
makes it better than any previous version of the XFS/DAX code.

The patchset starts by reverting the two patches that were
introduced in 4.3-rc1 to try to fix the fault vs fault and fault vs
truncate races that caused deadlocks. This fixes the hangs in
generic/075 that these patches introduced.

Patch 3 enables XFS to handle the behaviour of DAX and DIO when
asking to allocate the block at (2^63 - 1FSB), where the offset +
count s technically illegal (larger than sb->s_maxbytes) and
overflows a s64 variable. This is currently hidden by the fact that
all DAX and DIO allocation is currently unwritten, but patch 5
exposes it for DAX.

Patch 4 introduces the ability for XFS to allocate physically zeroed
data blocks. This is done for each physical extent that is
allocated, deep inside the allocator itself and guaranteed to be
atomic with the allocation transaction and hence has no
crash+recovery exposure issues.

This is necessary because the BMAPI layer merges allocated extents
in the BMBT before it returns the mapped extent back to the high
level get_blocks() code. Hence the high level code can have a single
extent presented that is made of merged new and existing extents,
and so zeroing can't be done at this layer.

The advantage of driving the zeroing deep into the allocator is the
functionality is now available to all XFS code. Hence we can
allocate pre-zeroed blocks on any type of storage, and we can
utilise storage-based hardware acceleration (e.g. discard to zero,
WRITE_SAME, etc) to do the zeroing. From this POV, DAX is just
another hardware accelerated physical zeroing mechanism for XFS. :)

[ This is an example of the mantra I repeat a lot: solve the problem
  properly the first time and it will make everything simpler! Sure,
  it took me three attempts to work out how to solve it in a sane
  manner, but that's pretty much par for the course with anything
  non-trivial. ]

Patch 5 makes __xfs_get_blocks() aware that it is being called from
the DAX fault path and makes sure it returns zeroed blocks rather
than unwritten extents via XFS_BMAPI_ZERO. It also now sets
XFS_BMAPI_CONVERT, which tells it to convert unwritten extents to
written, zeroed blocks. This is the major change of behaviour.

Patch 6 removes the IO completion callbacks from the XFS DAX code as
they are not longer necessary after patch 5.

Patch 7 adds pfn_mkwrite support to XFS. This is needed to fix
generic/080, which detects a failure to update the inode timestamp
on a pfn fault. It also adds the same locking as the XFS
implementation of ->fault and ->page_mkwrite and hence provide
correct serialisation against truncate, hole punching, etc that
doesn't currently exist.

The next steps that are needed are to do the same "block zeroing
during allocation" to ext4, and then the block zeroing and
complete_unwritten callbacks can be removed from the DAX API and
code. I've had a breif look at the ext4 code - the block zeroing
should be able to be done by overloading the existing zeroout code
that ext4 has in the unwritten extent allocation code. I'd much
prefer that an ext4 expert does this work, and then we can clean up
the DAX code...

Cheers,

Dave.


WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: xfs@oss.sgi.com
Cc: jack@suse.cz, linux-nvdimm@lists.01.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	willy@linux.intel.com, ross.zwisler@linux.intel.com,
	dan.j.williams@intel.com, kirill.shutemov@linux.intel.com
Subject: [PATCH 0/7] xfs, dax: fix the page fault/allocation mess
Date: Thu,  1 Oct 2015 17:46:32 +1000	[thread overview]
Message-ID: <1443685599-4843-1-git-send-email-david@fromorbit.com> (raw)

Hi folks,

As discussed in the recent thread about problems with DAX locking:

http://www.gossamer-threads.com/lists/linux/kernel/2264090?do=post_view_threaded

I said that I'd post the patch set that fixed the problems for XFS
as soon as I had something sane and workable. That's what this
series is.

To start with, it passes xfstests "auto" group with only the only
failures being expected failures or failures due to unexpected
allocation patterns or trying to use unsupported block sizes. That
makes it better than any previous version of the XFS/DAX code.

The patchset starts by reverting the two patches that were
introduced in 4.3-rc1 to try to fix the fault vs fault and fault vs
truncate races that caused deadlocks. This fixes the hangs in
generic/075 that these patches introduced.

Patch 3 enables XFS to handle the behaviour of DAX and DIO when
asking to allocate the block at (2^63 - 1FSB), where the offset +
count s technically illegal (larger than sb->s_maxbytes) and
overflows a s64 variable. This is currently hidden by the fact that
all DAX and DIO allocation is currently unwritten, but patch 5
exposes it for DAX.

Patch 4 introduces the ability for XFS to allocate physically zeroed
data blocks. This is done for each physical extent that is
allocated, deep inside the allocator itself and guaranteed to be
atomic with the allocation transaction and hence has no
crash+recovery exposure issues.

This is necessary because the BMAPI layer merges allocated extents
in the BMBT before it returns the mapped extent back to the high
level get_blocks() code. Hence the high level code can have a single
extent presented that is made of merged new and existing extents,
and so zeroing can't be done at this layer.

The advantage of driving the zeroing deep into the allocator is the
functionality is now available to all XFS code. Hence we can
allocate pre-zeroed blocks on any type of storage, and we can
utilise storage-based hardware acceleration (e.g. discard to zero,
WRITE_SAME, etc) to do the zeroing. From this POV, DAX is just
another hardware accelerated physical zeroing mechanism for XFS. :)

[ This is an example of the mantra I repeat a lot: solve the problem
  properly the first time and it will make everything simpler! Sure,
  it took me three attempts to work out how to solve it in a sane
  manner, but that's pretty much par for the course with anything
  non-trivial. ]

Patch 5 makes __xfs_get_blocks() aware that it is being called from
the DAX fault path and makes sure it returns zeroed blocks rather
than unwritten extents via XFS_BMAPI_ZERO. It also now sets
XFS_BMAPI_CONVERT, which tells it to convert unwritten extents to
written, zeroed blocks. This is the major change of behaviour.

Patch 6 removes the IO completion callbacks from the XFS DAX code as
they are not longer necessary after patch 5.

Patch 7 adds pfn_mkwrite support to XFS. This is needed to fix
generic/080, which detects a failure to update the inode timestamp
on a pfn fault. It also adds the same locking as the XFS
implementation of ->fault and ->page_mkwrite and hence provide
correct serialisation against truncate, hole punching, etc that
doesn't currently exist.

The next steps that are needed are to do the same "block zeroing
during allocation" to ext4, and then the block zeroing and
complete_unwritten callbacks can be removed from the DAX API and
code. I've had a breif look at the ext4 code - the block zeroing
should be able to be done by overloading the existing zeroout code
that ext4 has in the unwritten extent allocation code. I'd much
prefer that an ext4 expert does this work, and then we can clean up
the DAX code...

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

             reply	other threads:[~2015-10-01  7:46 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-01  7:46 Dave Chinner [this message]
2015-10-01  7:46 ` [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
2015-10-01  7:46 ` Dave Chinner
2015-10-01  7:46 ` [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  8:35   ` kbuild test robot
2015-10-01  8:35     ` kbuild test robot
2015-10-01  8:35     ` kbuild test robot
2015-10-01 20:27   ` Ross Zwisler
2015-10-01 20:27     ` Ross Zwisler
2015-10-01 20:27     ` Ross Zwisler
2015-10-01 22:14     ` Williams, Dan J
2015-10-01 22:14       ` Williams, Dan J
2015-10-01 22:14       ` Williams, Dan J
2015-10-01 22:45       ` Ross Zwisler
2015-10-01 22:45         ` Ross Zwisler
2015-10-01 22:45         ` Ross Zwisler
2015-10-01 22:32     ` Dave Chinner
2015-10-01 22:32       ` Dave Chinner
2015-10-01 22:32       ` Dave Chinner
2015-10-01 22:47       ` Ross Zwisler
2015-10-01 22:47         ` Ross Zwisler
2015-10-01 22:47         ` Ross Zwisler
2015-10-01  7:46 ` [PATCH 2/7] Revert "dax: fix race between simultaneous faults" Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46 ` [PATCH 3/7] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46 ` [PATCH 4/7] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46 ` [PATCH 5/7] xfs: Don't use unwritten extents for DAX Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46 ` [PATCH 6/7] xfs: DAX does not use IO completion callbacks Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46 ` [PATCH 7/7] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01  7:46   ` Dave Chinner
2015-10-01 20:31 ` [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Ross Zwisler
2015-10-01 20:31   ` Ross Zwisler
2015-10-01 20:31   ` Ross Zwisler
2015-10-01 22:54   ` Dave Chinner
2015-10-01 22:54     ` Dave Chinner
2015-10-01 22:54     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1443685599-4843-1-git-send-email-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=dan.j.williams@intel.com \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=willy@linux.intel.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.