[RFC v2 PATCH 00/10] dm-thin/xfs: prototype a block reservation allocation model

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC v2 PATCH 00/10] dm-thin/xfs: prototype a block reservation allocation model
@ 2016-04-12 16:42 ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

Hi all,

This is v2 of the XFS and block device reservation experiment. The
significant changes in v2 are that the bdev interface has been condensed
to a single callback function, the XFS transaction reservation
management has been reworked to make transactions responsible for
tracking and releasing excess reservation (for non-delalloc cases) and a
workaround for the fallocate over-reservation issue is included. Beyond
that, this version adds a bunch of miscellaneous cleanups and fixes some
of the nastier locking/leak issues present in the first rfc.

Patches 1-2 refactor some XFS reserve pool and block accounting code in
preparation for subsequent patches. Patches 3-5 add block/device-mapper
reservation support. Patches 6-10 add the core reservation
infrastructure and management bits to XFS. See the link to the original
rfc below for instructions and further details around the purpose of
this series.

Finally, note that this is still highly experimental/theoretical and
should not be used on production systems. Thoughts, reviews, flames
appreciated.

Brian

rfcv2:
- Rebased to 4.6.0-rc3.
- Fix compile warnings reported by kbuild.
- Fix reservation leakage on fs ENOSPC.
- Fix mod_fdblocks locking to avoid BUG() (still racy).
- Fix XFS reserve block ENOSPC handling.
- Kill block wrappers, condense get/set/provision to a single callback.
- Update XFS transaction to track reservation, don't release excess on
  provision.
- Add transaction noblkres mode, use for fallocate reservation
  optimization.
rfc: http://oss.sgi.com/pipermail/xfs/2016-March/047673.html

Brian Foster (7):
  xfs: refactor xfs_reserve_blocks() to handle ENOSPC correctly
  xfs: replace xfs_mod_fdblocks() bool param with flags
  xfs: thin block device reservation mechanism
  xfs: adopt a reserved allocation model on dm-thin devices
  xfs: handle bdev reservation ENOSPC correctly from XFS reserved pool
  xfs: support no block reservation transaction mode
  xfs: use contiguous bdev reservation for file preallocation

Joe Thornber (1):
  dm thin: add methods to set and get reserved space

Mike Snitzer (2):
  block: add block_device_operations methods to set and get reserved
    space
  dm: add methods to set and get reserved space

 drivers/md/dm-thin.c          | 181 +++++++++++++++++++++++++++--
 drivers/md/dm.c               |  41 +++++++
 fs/xfs/Makefile               |   1 +
 fs/xfs/libxfs/xfs_alloc.c     |  25 ++++
 fs/xfs/libxfs/xfs_bmap.c      |  17 ++-
 fs/xfs/libxfs/xfs_shared.h    |   2 +
 fs/xfs/xfs_bmap_util.c        |  29 ++++-
 fs/xfs/xfs_fsops.c            | 128 ++++++++++++++-------
 fs/xfs/xfs_mount.c            | 106 +++++++++++++++--
 fs/xfs/xfs_mount.h            |  12 +-
 fs/xfs/xfs_thin.c             | 260 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_thin.h             |  31 +++++
 fs/xfs/xfs_trace.h            |  27 +++++
 fs/xfs/xfs_trans.c            |  94 +++++++++++++--
 fs/xfs/xfs_trans.h            |   1 +
 include/linux/blkdev.h        |   6 +
 include/linux/device-mapper.h |   5 +
 17 files changed, 880 insertions(+), 86 deletions(-)
 create mode 100644 fs/xfs/xfs_thin.c
 create mode 100644 fs/xfs/xfs_thin.h

-- 
2.4.11

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 00/10] dm-thin/xfs: prototype a block reservation allocation model
@ 2016-04-12 16:42 ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

Hi all,

This is v2 of the XFS and block device reservation experiment. The
significant changes in v2 are that the bdev interface has been condensed
to a single callback function, the XFS transaction reservation
management has been reworked to make transactions responsible for
tracking and releasing excess reservation (for non-delalloc cases) and a
workaround for the fallocate over-reservation issue is included. Beyond
that, this version adds a bunch of miscellaneous cleanups and fixes some
of the nastier locking/leak issues present in the first rfc.

Patches 1-2 refactor some XFS reserve pool and block accounting code in
preparation for subsequent patches. Patches 3-5 add block/device-mapper
reservation support. Patches 6-10 add the core reservation
infrastructure and management bits to XFS. See the link to the original
rfc below for instructions and further details around the purpose of
this series.

Finally, note that this is still highly experimental/theoretical and
should not be used on production systems. Thoughts, reviews, flames
appreciated.

Brian

rfcv2:
- Rebased to 4.6.0-rc3.
- Fix compile warnings reported by kbuild.
- Fix reservation leakage on fs ENOSPC.
- Fix mod_fdblocks locking to avoid BUG() (still racy).
- Fix XFS reserve block ENOSPC handling.
- Kill block wrappers, condense get/set/provision to a single callback.
- Update XFS transaction to track reservation, don't release excess on
  provision.
- Add transaction noblkres mode, use for fallocate reservation
  optimization.
rfc: http://oss.sgi.com/pipermail/xfs/2016-March/047673.html

Brian Foster (7):
  xfs: refactor xfs_reserve_blocks() to handle ENOSPC correctly
  xfs: replace xfs_mod_fdblocks() bool param with flags
  xfs: thin block device reservation mechanism
  xfs: adopt a reserved allocation model on dm-thin devices
  xfs: handle bdev reservation ENOSPC correctly from XFS reserved pool
  xfs: support no block reservation transaction mode
  xfs: use contiguous bdev reservation for file preallocation

Joe Thornber (1):
  dm thin: add methods to set and get reserved space

Mike Snitzer (2):
  block: add block_device_operations methods to set and get reserved
    space
  dm: add methods to set and get reserved space

 drivers/md/dm-thin.c          | 181 +++++++++++++++++++++++++++--
 drivers/md/dm.c               |  41 +++++++
 fs/xfs/Makefile               |   1 +
 fs/xfs/libxfs/xfs_alloc.c     |  25 ++++
 fs/xfs/libxfs/xfs_bmap.c      |  17 ++-
 fs/xfs/libxfs/xfs_shared.h    |   2 +
 fs/xfs/xfs_bmap_util.c        |  29 ++++-
 fs/xfs/xfs_fsops.c            | 128 ++++++++++++++-------
 fs/xfs/xfs_mount.c            | 106 +++++++++++++++--
 fs/xfs/xfs_mount.h            |  12 +-
 fs/xfs/xfs_thin.c             | 260 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_thin.h             |  31 +++++
 fs/xfs/xfs_trace.h            |  27 +++++
 fs/xfs/xfs_trans.c            |  94 +++++++++++++--
 fs/xfs/xfs_trans.h            |   1 +
 include/linux/blkdev.h        |   6 +
 include/linux/device-mapper.h |   5 +
 17 files changed, 880 insertions(+), 86 deletions(-)
 create mode 100644 fs/xfs/xfs_thin.c
 create mode 100644 fs/xfs/xfs_thin.h

-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 01/10] xfs: refactor xfs_reserve_blocks() to handle ENOSPC correctly
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

xfs_reserve_blocks() is responsible to update the XFS reserved block
pool count at mount time or based on user request. When the caller
requests to increase the reserve pool, blocks must be allocated from the
global counters such that they are no longer available for general
purpose use. If the requested reserve pool size is too large, XFS
reserves what blocks are available. The implementation requires looking
at the percpu counters and making an educated guess as to how many
blocks to try and allocate from xfs_mod_fdblocks(), which can return
-ENOSPC if the guess was not accurate due to counters being modified in
parallel.

xfs_reserve_blocks() retries the guess in this scenario until the
allocation succeeds or it is determined that there is no space available
in the fs. While not easily reproducible in the current form, the retry
code doesn't actually work correctly if xfs_mod_fdblocks() actually
fails. The problem is that the percpu calculations use the m_resblks
counter to determine how many blocks to allocate, but unconditionally
update m_resblks before the block allocation has actually succeeded.
Therefore, if xfs_mod_fdblocks() fails, the code jumps to the retry
label and uses the already updated m_resblks value to determine how many
blocks to try and allocate. If the percpu counters previously suggested
that the entire request was available, fdblocks_delta could end up set
to 0. In that case, m_resblks is updated to the requested value, yet no
blocks have been reserved at all.

Refactor xfs_reserve_blocks() to use an explicit loop and make the code
easier to follow. Since we have to drop the spinlock across the
xfs_mod_fdblocks() call, use a delta value for m_resblks as well and
only apply the delta once allocation succeeds.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fsops.c | 105 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 60 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index ee3aaa0a..87d4b1b 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -671,8 +671,11 @@ xfs_reserve_blocks(
 	__uint64_t              *inval,
 	xfs_fsop_resblks_t      *outval)
 {
-	__int64_t		lcounter, delta, fdblks_delta;
+	__int64_t		lcounter, delta;
+	__int64_t		fdblks_delta = 0;
 	__uint64_t		request;
+	__int64_t		free;
+	int			error = 0;
 
 	/* If inval is null, report current values and return */
 	if (inval == (__uint64_t *)NULL) {
@@ -686,24 +689,23 @@ xfs_reserve_blocks(
 	request = *inval;
 
 	/*
-	 * With per-cpu counters, this becomes an interesting
-	 * problem. we needto work out if we are freeing or allocation
-	 * blocks first, then we can do the modification as necessary.
+	 * With per-cpu counters, this becomes an interesting problem. we need
+	 * to work out if we are freeing or allocation blocks first, then we can
+	 * do the modification as necessary.
 	 *
-	 * We do this under the m_sb_lock so that if we are near
-	 * ENOSPC, we will hold out any changes while we work out
-	 * what to do. This means that the amount of free space can
-	 * change while we do this, so we need to retry if we end up
-	 * trying to reserve more space than is available.
+	 * We do this under the m_sb_lock so that if we are near ENOSPC, we will
+	 * hold out any changes while we work out what to do. This means that
+	 * the amount of free space can change while we do this, so we need to
+	 * retry if we end up trying to reserve more space than is available.
 	 */
-retry:
 	spin_lock(&mp->m_sb_lock);
 
 	/*
 	 * If our previous reservation was larger than the current value,
-	 * then move any unused blocks back to the free pool.
+	 * then move any unused blocks back to the free pool. Modify the resblks
+	 * counters directly since we shouldn't have any problems unreserving
+	 * space.
 	 */
-	fdblks_delta = 0;
 	if (mp->m_resblks > request) {
 		lcounter = mp->m_resblks_avail - request;
 		if (lcounter  > 0) {		/* release unused blocks */
@@ -711,54 +713,67 @@ retry:
 			mp->m_resblks_avail -= lcounter;
 		}
 		mp->m_resblks = request;
-	} else {
-		__int64_t	free;
+		if (fdblks_delta) {
+			spin_unlock(&mp->m_sb_lock);
+			error = xfs_mod_fdblocks(mp, fdblks_delta, 0);
+			spin_lock(&mp->m_sb_lock);
+		}
+
+		goto out;
+	}
 
+	/*
+	 * If the request is larger than the current reservation, reserve the
+	 * blocks before we update the reserve counters. Sample m_fdblocks and
+	 * perform a partial reservation if the request exceeds free space.
+	 */
+	error = -ENOSPC;
+	while (error == -ENOSPC) {
 		free = percpu_counter_sum(&mp->m_fdblocks) -
 							XFS_ALLOC_SET_ASIDE(mp);
 		if (!free)
-			goto out; /* ENOSPC and fdblks_delta = 0 */
+			break;
 
 		delta = request - mp->m_resblks;
 		lcounter = free - delta;
-		if (lcounter < 0) {
+		if (lcounter < 0)
 			/* We can't satisfy the request, just get what we can */
-			mp->m_resblks += free;
-			mp->m_resblks_avail += free;
-			fdblks_delta = -free;
-		} else {
-			fdblks_delta = -delta;
-			mp->m_resblks = request;
-			mp->m_resblks_avail += delta;
-		}
-	}
-out:
-	if (outval) {
-		outval->resblks = mp->m_resblks;
-		outval->resblks_avail = mp->m_resblks_avail;
-	}
-	spin_unlock(&mp->m_sb_lock);
+			fdblks_delta = free;
+		else
+			fdblks_delta = delta;
 
-	if (fdblks_delta) {
 		/*
-		 * If we are putting blocks back here, m_resblks_avail is
-		 * already at its max so this will put it in the free pool.
-		 *
-		 * If we need space, we'll either succeed in getting it
-		 * from the free block count or we'll get an enospc. If
-		 * we get a ENOSPC, it means things changed while we were
-		 * calculating fdblks_delta and so we should try again to
-		 * see if there is anything left to reserve.
+		 * We'll either succeed in getting space from the free block
+		 * count or we'll get an ENOSPC. If we get a ENOSPC, it means
+		 * things changed while we were calculating fdblks_delta and so
+		 * we should try again to see if there is anything left to
+		 * reserve.
 		 *
 		 * Don't set the reserved flag here - we don't want to reserve
 		 * the extra reserve blocks from the reserve.....
 		 */
-		int error;
-		error = xfs_mod_fdblocks(mp, fdblks_delta, 0);
-		if (error == -ENOSPC)
-			goto retry;
+		spin_unlock(&mp->m_sb_lock);
+		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
+		spin_lock(&mp->m_sb_lock);
 	}
-	return 0;
+
+	/*
+	 * Update the reserve counters if blocks have been successfully
+	 * allocated.
+	 */
+	if (!error && fdblks_delta) {
+		mp->m_resblks += fdblks_delta;
+		mp->m_resblks_avail += fdblks_delta;
+	}
+
+out:
+	if (outval) {
+		outval->resblks = mp->m_resblks;
+		outval->resblks_avail = mp->m_resblks_avail;
+	}
+
+	spin_unlock(&mp->m_sb_lock);
+	return error;
 }
 
 int
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 01/10] xfs: refactor xfs_reserve_blocks() to handle ENOSPC correctly
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

xfs_reserve_blocks() is responsible to update the XFS reserved block
pool count at mount time or based on user request. When the caller
requests to increase the reserve pool, blocks must be allocated from the
global counters such that they are no longer available for general
purpose use. If the requested reserve pool size is too large, XFS
reserves what blocks are available. The implementation requires looking
at the percpu counters and making an educated guess as to how many
blocks to try and allocate from xfs_mod_fdblocks(), which can return
-ENOSPC if the guess was not accurate due to counters being modified in
parallel.

xfs_reserve_blocks() retries the guess in this scenario until the
allocation succeeds or it is determined that there is no space available
in the fs. While not easily reproducible in the current form, the retry
code doesn't actually work correctly if xfs_mod_fdblocks() actually
fails. The problem is that the percpu calculations use the m_resblks
counter to determine how many blocks to allocate, but unconditionally
update m_resblks before the block allocation has actually succeeded.
Therefore, if xfs_mod_fdblocks() fails, the code jumps to the retry
label and uses the already updated m_resblks value to determine how many
blocks to try and allocate. If the percpu counters previously suggested
that the entire request was available, fdblocks_delta could end up set
to 0. In that case, m_resblks is updated to the requested value, yet no
blocks have been reserved at all.

Refactor xfs_reserve_blocks() to use an explicit loop and make the code
easier to follow. Since we have to drop the spinlock across the
xfs_mod_fdblocks() call, use a delta value for m_resblks as well and
only apply the delta once allocation succeeds.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fsops.c | 105 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 60 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index ee3aaa0a..87d4b1b 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -671,8 +671,11 @@ xfs_reserve_blocks(
 	__uint64_t              *inval,
 	xfs_fsop_resblks_t      *outval)
 {
-	__int64_t		lcounter, delta, fdblks_delta;
+	__int64_t		lcounter, delta;
+	__int64_t		fdblks_delta = 0;
 	__uint64_t		request;
+	__int64_t		free;
+	int			error = 0;
 
 	/* If inval is null, report current values and return */
 	if (inval == (__uint64_t *)NULL) {
@@ -686,24 +689,23 @@ xfs_reserve_blocks(
 	request = *inval;
 
 	/*
-	 * With per-cpu counters, this becomes an interesting
-	 * problem. we needto work out if we are freeing or allocation
-	 * blocks first, then we can do the modification as necessary.
+	 * With per-cpu counters, this becomes an interesting problem. we need
+	 * to work out if we are freeing or allocation blocks first, then we can
+	 * do the modification as necessary.
 	 *
-	 * We do this under the m_sb_lock so that if we are near
-	 * ENOSPC, we will hold out any changes while we work out
-	 * what to do. This means that the amount of free space can
-	 * change while we do this, so we need to retry if we end up
-	 * trying to reserve more space than is available.
+	 * We do this under the m_sb_lock so that if we are near ENOSPC, we will
+	 * hold out any changes while we work out what to do. This means that
+	 * the amount of free space can change while we do this, so we need to
+	 * retry if we end up trying to reserve more space than is available.
 	 */
-retry:
 	spin_lock(&mp->m_sb_lock);
 
 	/*
 	 * If our previous reservation was larger than the current value,
-	 * then move any unused blocks back to the free pool.
+	 * then move any unused blocks back to the free pool. Modify the resblks
+	 * counters directly since we shouldn't have any problems unreserving
+	 * space.
 	 */
-	fdblks_delta = 0;
 	if (mp->m_resblks > request) {
 		lcounter = mp->m_resblks_avail - request;
 		if (lcounter  > 0) {		/* release unused blocks */
@@ -711,54 +713,67 @@ retry:
 			mp->m_resblks_avail -= lcounter;
 		}
 		mp->m_resblks = request;
-	} else {
-		__int64_t	free;
+		if (fdblks_delta) {
+			spin_unlock(&mp->m_sb_lock);
+			error = xfs_mod_fdblocks(mp, fdblks_delta, 0);
+			spin_lock(&mp->m_sb_lock);
+		}
+
+		goto out;
+	}
 
+	/*
+	 * If the request is larger than the current reservation, reserve the
+	 * blocks before we update the reserve counters. Sample m_fdblocks and
+	 * perform a partial reservation if the request exceeds free space.
+	 */
+	error = -ENOSPC;
+	while (error == -ENOSPC) {
 		free = percpu_counter_sum(&mp->m_fdblocks) -
 							XFS_ALLOC_SET_ASIDE(mp);
 		if (!free)
-			goto out; /* ENOSPC and fdblks_delta = 0 */
+			break;
 
 		delta = request - mp->m_resblks;
 		lcounter = free - delta;
-		if (lcounter < 0) {
+		if (lcounter < 0)
 			/* We can't satisfy the request, just get what we can */
-			mp->m_resblks += free;
-			mp->m_resblks_avail += free;
-			fdblks_delta = -free;
-		} else {
-			fdblks_delta = -delta;
-			mp->m_resblks = request;
-			mp->m_resblks_avail += delta;
-		}
-	}
-out:
-	if (outval) {
-		outval->resblks = mp->m_resblks;
-		outval->resblks_avail = mp->m_resblks_avail;
-	}
-	spin_unlock(&mp->m_sb_lock);
+			fdblks_delta = free;
+		else
+			fdblks_delta = delta;
 
-	if (fdblks_delta) {
 		/*
-		 * If we are putting blocks back here, m_resblks_avail is
-		 * already at its max so this will put it in the free pool.
-		 *
-		 * If we need space, we'll either succeed in getting it
-		 * from the free block count or we'll get an enospc. If
-		 * we get a ENOSPC, it means things changed while we were
-		 * calculating fdblks_delta and so we should try again to
-		 * see if there is anything left to reserve.
+		 * We'll either succeed in getting space from the free block
+		 * count or we'll get an ENOSPC. If we get a ENOSPC, it means
+		 * things changed while we were calculating fdblks_delta and so
+		 * we should try again to see if there is anything left to
+		 * reserve.
 		 *
 		 * Don't set the reserved flag here - we don't want to reserve
 		 * the extra reserve blocks from the reserve.....
 		 */
-		int error;
-		error = xfs_mod_fdblocks(mp, fdblks_delta, 0);
-		if (error == -ENOSPC)
-			goto retry;
+		spin_unlock(&mp->m_sb_lock);
+		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
+		spin_lock(&mp->m_sb_lock);
 	}
-	return 0;
+
+	/*
+	 * Update the reserve counters if blocks have been successfully
+	 * allocated.
+	 */
+	if (!error && fdblks_delta) {
+		mp->m_resblks += fdblks_delta;
+		mp->m_resblks_avail += fdblks_delta;
+	}
+
+out:
+	if (outval) {
+		outval->resblks = mp->m_resblks;
+		outval->resblks_avail = mp->m_resblks_avail;
+	}
+
+	spin_unlock(&mp->m_sb_lock);
+	return error;
 }
 
 int
-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 02/10] xfs: replace xfs_mod_fdblocks() bool param with flags
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

xfs_mod_fdblocks() takes a boolean parameter to indicate whether the
requested allocation can dip into the XFS reserve block pool, if
necessary, to satisfy the allocation.

This function will also require caller control over block device
reservation. In preparation, convert the bool parameter to a flags
parameter and update all callers to use the appropriate reserved pool
flag as appropriate.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 17 ++++++++---------
 fs/xfs/xfs_mount.c       |  3 ++-
 fs/xfs/xfs_mount.h       |  4 +++-
 fs/xfs/xfs_trans.c       | 21 ++++++++++++++-------
 4 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ce41d7f..1a805b0 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2186,7 +2186,7 @@ xfs_bmap_add_extent_delay_real(
 			(bma->cur ? bma->cur->bc_private.b.allocated : 0));
 		if (diff > 0) {
 			error = xfs_mod_fdblocks(bma->ip->i_mount,
-						 -((int64_t)diff), false);
+						 -((int64_t)diff), 0);
 			ASSERT(!error);
 			if (error)
 				goto done;
@@ -2238,7 +2238,7 @@ xfs_bmap_add_extent_delay_real(
 		ASSERT(temp <= da_old);
 		if (temp < da_old)
 			xfs_mod_fdblocks(bma->ip->i_mount,
-					(int64_t)(da_old - temp), false);
+					(int64_t)(da_old - temp), 0);
 	}
 
 	/* clear out the allocated field, done with it now in any case. */
@@ -2916,8 +2916,7 @@ xfs_bmap_add_extent_hole_delay(
 	}
 	if (oldlen != newlen) {
 		ASSERT(oldlen > newlen);
-		xfs_mod_fdblocks(ip->i_mount, (int64_t)(oldlen - newlen),
-				 false);
+		xfs_mod_fdblocks(ip->i_mount, (int64_t)(oldlen - newlen), 0);
 		/*
 		 * Nothing to do for disk quota accounting here.
 		 */
@@ -4149,13 +4148,13 @@ xfs_bmapi_reserve_delalloc(
 	if (rt) {
 		error = xfs_mod_frextents(mp, -((int64_t)extsz));
 	} else {
-		error = xfs_mod_fdblocks(mp, -((int64_t)alen), false);
+		error = xfs_mod_fdblocks(mp, -((int64_t)alen), 0);
 	}
 
 	if (error)
 		goto out_unreserve_quota;
 
-	error = xfs_mod_fdblocks(mp, -((int64_t)indlen), false);
+	error = xfs_mod_fdblocks(mp, -((int64_t)indlen), 0);
 	if (error)
 		goto out_unreserve_blocks;
 
@@ -4184,7 +4183,7 @@ out_unreserve_blocks:
 	if (rt)
 		xfs_mod_frextents(mp, extsz);
 	else
-		xfs_mod_fdblocks(mp, alen, false);
+		xfs_mod_fdblocks(mp, alen, 0);
 out_unreserve_quota:
 	if (XFS_IS_QUOTA_ON(mp))
 		xfs_trans_unreserve_quota_nblks(NULL, ip, (long)alen, 0, rt ?
@@ -5093,7 +5092,7 @@ xfs_bmap_del_extent(
 	 */
 	ASSERT(da_old >= da_new);
 	if (da_old > da_new)
-		xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), false);
+		xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), 0);
 done:
 	*logflagsp = flags;
 	return error;
@@ -5413,7 +5412,7 @@ xfs_bunmapi(
 			goto error0;
 
 		if (!isrt && wasdel)
-			xfs_mod_fdblocks(mp, (int64_t)del.br_blockcount, false);
+			xfs_mod_fdblocks(mp, (int64_t)del.br_blockcount, 0);
 
 		bno = del.br_startoff - 1;
 nodelete:
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index cfd4210..50a6ccc 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1150,11 +1150,12 @@ int
 xfs_mod_fdblocks(
 	struct xfs_mount	*mp,
 	int64_t			delta,
-	bool			rsvd)
+	uint32_t		flags)
 {
 	int64_t			lcounter;
 	long long		res_used;
 	s32			batch;
+	bool			rsvd = (flags & XFS_FDBLOCKS_RSVD);
 
 	if (delta > 0) {
 		/*
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index eafe257..bd1043f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -347,8 +347,10 @@ extern void	xfs_unmountfs(xfs_mount_t *);
 
 extern int	xfs_mod_icount(struct xfs_mount *mp, int64_t delta);
 extern int	xfs_mod_ifree(struct xfs_mount *mp, int64_t delta);
+
+#define	XFS_FDBLOCKS_RSVD	(1 << 0)
 extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
-				 bool reserved);
+				 uint32_t flags);
 extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
 
 extern struct xfs_buf *xfs_getsb(xfs_mount_t *, int);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 20c5366..8aa9d9a 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -172,8 +172,11 @@ xfs_trans_reserve(
 	uint			blocks,
 	uint			rtextents)
 {
-	int		error = 0;
-	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
+	int			error = 0;
+	int			flags = 0;
+
+	if (tp->t_flags & XFS_TRANS_RESERVE)
+		flags |= XFS_FDBLOCKS_RSVD;
 
 	/* Mark this thread as being in a transaction */
 	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
@@ -184,7 +187,8 @@ xfs_trans_reserve(
 	 * fail if the count would go below zero.
 	 */
 	if (blocks > 0) {
-		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
+		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks),
+					 flags);
 		if (error != 0) {
 			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 			return -ENOSPC;
@@ -259,7 +263,7 @@ undo_log:
 
 undo_blocks:
 	if (blocks > 0) {
-		xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
+		xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), flags);
 		tp->t_blk_res = 0;
 	}
 
@@ -543,12 +547,15 @@ xfs_trans_unreserve_and_mod_sb(
 	struct xfs_trans	*tp)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 	int64_t			blkdelta = 0;
 	int64_t			rtxdelta = 0;
 	int64_t			idelta = 0;
 	int64_t			ifreedelta = 0;
 	int			error;
+	int			flags = 0;
+
+	if (tp->t_flags & XFS_TRANS_RESERVE)
+		flags |= XFS_FDBLOCKS_RSVD;
 
 	/* calculate deltas */
 	if (tp->t_blk_res > 0)
@@ -572,7 +579,7 @@ xfs_trans_unreserve_and_mod_sb(
 
 	/* apply the per-cpu counters */
 	if (blkdelta) {
-		error = xfs_mod_fdblocks(mp, blkdelta, rsvd);
+		error = xfs_mod_fdblocks(mp, blkdelta, flags);
 		if (error)
 			goto out;
 	}
@@ -680,7 +687,7 @@ out_undo_icount:
 		xfs_mod_icount(mp, -idelta);
 out_undo_fdblocks:
 	if (blkdelta)
-		xfs_mod_fdblocks(mp, -blkdelta, rsvd);
+		xfs_mod_fdblocks(mp, -blkdelta, flags);
 out:
 	ASSERT(error == 0);
 	return;
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 02/10] xfs: replace xfs_mod_fdblocks() bool param with flags
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

xfs_mod_fdblocks() takes a boolean parameter to indicate whether the
requested allocation can dip into the XFS reserve block pool, if
necessary, to satisfy the allocation.

This function will also require caller control over block device
reservation. In preparation, convert the bool parameter to a flags
parameter and update all callers to use the appropriate reserved pool
flag as appropriate.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 17 ++++++++---------
 fs/xfs/xfs_mount.c       |  3 ++-
 fs/xfs/xfs_mount.h       |  4 +++-
 fs/xfs/xfs_trans.c       | 21 ++++++++++++++-------
 4 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ce41d7f..1a805b0 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2186,7 +2186,7 @@ xfs_bmap_add_extent_delay_real(
 			(bma->cur ? bma->cur->bc_private.b.allocated : 0));
 		if (diff > 0) {
 			error = xfs_mod_fdblocks(bma->ip->i_mount,
-						 -((int64_t)diff), false);
+						 -((int64_t)diff), 0);
 			ASSERT(!error);
 			if (error)
 				goto done;
@@ -2238,7 +2238,7 @@ xfs_bmap_add_extent_delay_real(
 		ASSERT(temp <= da_old);
 		if (temp < da_old)
 			xfs_mod_fdblocks(bma->ip->i_mount,
-					(int64_t)(da_old - temp), false);
+					(int64_t)(da_old - temp), 0);
 	}
 
 	/* clear out the allocated field, done with it now in any case. */
@@ -2916,8 +2916,7 @@ xfs_bmap_add_extent_hole_delay(
 	}
 	if (oldlen != newlen) {
 		ASSERT(oldlen > newlen);
-		xfs_mod_fdblocks(ip->i_mount, (int64_t)(oldlen - newlen),
-				 false);
+		xfs_mod_fdblocks(ip->i_mount, (int64_t)(oldlen - newlen), 0);
 		/*
 		 * Nothing to do for disk quota accounting here.
 		 */
@@ -4149,13 +4148,13 @@ xfs_bmapi_reserve_delalloc(
 	if (rt) {
 		error = xfs_mod_frextents(mp, -((int64_t)extsz));
 	} else {
-		error = xfs_mod_fdblocks(mp, -((int64_t)alen), false);
+		error = xfs_mod_fdblocks(mp, -((int64_t)alen), 0);
 	}
 
 	if (error)
 		goto out_unreserve_quota;
 
-	error = xfs_mod_fdblocks(mp, -((int64_t)indlen), false);
+	error = xfs_mod_fdblocks(mp, -((int64_t)indlen), 0);
 	if (error)
 		goto out_unreserve_blocks;
 
@@ -4184,7 +4183,7 @@ out_unreserve_blocks:
 	if (rt)
 		xfs_mod_frextents(mp, extsz);
 	else
-		xfs_mod_fdblocks(mp, alen, false);
+		xfs_mod_fdblocks(mp, alen, 0);
 out_unreserve_quota:
 	if (XFS_IS_QUOTA_ON(mp))
 		xfs_trans_unreserve_quota_nblks(NULL, ip, (long)alen, 0, rt ?
@@ -5093,7 +5092,7 @@ xfs_bmap_del_extent(
 	 */
 	ASSERT(da_old >= da_new);
 	if (da_old > da_new)
-		xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), false);
+		xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), 0);
 done:
 	*logflagsp = flags;
 	return error;
@@ -5413,7 +5412,7 @@ xfs_bunmapi(
 			goto error0;
 
 		if (!isrt && wasdel)
-			xfs_mod_fdblocks(mp, (int64_t)del.br_blockcount, false);
+			xfs_mod_fdblocks(mp, (int64_t)del.br_blockcount, 0);
 
 		bno = del.br_startoff - 1;
 nodelete:
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index cfd4210..50a6ccc 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1150,11 +1150,12 @@ int
 xfs_mod_fdblocks(
 	struct xfs_mount	*mp,
 	int64_t			delta,
-	bool			rsvd)
+	uint32_t		flags)
 {
 	int64_t			lcounter;
 	long long		res_used;
 	s32			batch;
+	bool			rsvd = (flags & XFS_FDBLOCKS_RSVD);
 
 	if (delta > 0) {
 		/*
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index eafe257..bd1043f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -347,8 +347,10 @@ extern void	xfs_unmountfs(xfs_mount_t *);
 
 extern int	xfs_mod_icount(struct xfs_mount *mp, int64_t delta);
 extern int	xfs_mod_ifree(struct xfs_mount *mp, int64_t delta);
+
+#define	XFS_FDBLOCKS_RSVD	(1 << 0)
 extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
-				 bool reserved);
+				 uint32_t flags);
 extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
 
 extern struct xfs_buf *xfs_getsb(xfs_mount_t *, int);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 20c5366..8aa9d9a 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -172,8 +172,11 @@ xfs_trans_reserve(
 	uint			blocks,
 	uint			rtextents)
 {
-	int		error = 0;
-	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
+	int			error = 0;
+	int			flags = 0;
+
+	if (tp->t_flags & XFS_TRANS_RESERVE)
+		flags |= XFS_FDBLOCKS_RSVD;
 
 	/* Mark this thread as being in a transaction */
 	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
@@ -184,7 +187,8 @@ xfs_trans_reserve(
 	 * fail if the count would go below zero.
 	 */
 	if (blocks > 0) {
-		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
+		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks),
+					 flags);
 		if (error != 0) {
 			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 			return -ENOSPC;
@@ -259,7 +263,7 @@ undo_log:
 
 undo_blocks:
 	if (blocks > 0) {
-		xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
+		xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), flags);
 		tp->t_blk_res = 0;
 	}
 
@@ -543,12 +547,15 @@ xfs_trans_unreserve_and_mod_sb(
 	struct xfs_trans	*tp)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 	int64_t			blkdelta = 0;
 	int64_t			rtxdelta = 0;
 	int64_t			idelta = 0;
 	int64_t			ifreedelta = 0;
 	int			error;
+	int			flags = 0;
+
+	if (tp->t_flags & XFS_TRANS_RESERVE)
+		flags |= XFS_FDBLOCKS_RSVD;
 
 	/* calculate deltas */
 	if (tp->t_blk_res > 0)
@@ -572,7 +579,7 @@ xfs_trans_unreserve_and_mod_sb(
 
 	/* apply the per-cpu counters */
 	if (blkdelta) {
-		error = xfs_mod_fdblocks(mp, blkdelta, rsvd);
+		error = xfs_mod_fdblocks(mp, blkdelta, flags);
 		if (error)
 			goto out;
 	}
@@ -680,7 +687,7 @@ out_undo_icount:
 		xfs_mod_icount(mp, -idelta);
 out_undo_fdblocks:
 	if (blkdelta)
-		xfs_mod_fdblocks(mp, -blkdelta, rsvd);
+		xfs_mod_fdblocks(mp, -blkdelta, flags);
 out:
 	ASSERT(error == 0);
 	return;
-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 03/10] block: add block_device_operations methods to set and get reserved space
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel, Mike Snitzer

From: Mike Snitzer <snitzer@redhat.com>

[BF:
 - Killed wrapper functions.
 - Condensed to single bdev op.]

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 669e419..6c6ea96 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1650,6 +1650,10 @@ struct blk_dax_ctl {
 	pfn_t pfn;
 };
 
+#define BDEV_RES_GET		0
+#define BDEV_RES_MOD		(1 << 0)
+#define BDEV_RES_PROVISION	(1 << 1)
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
@@ -1667,6 +1671,8 @@ struct block_device_operations {
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
 	/* this callback is with swap_lock and sometimes page table lock held */
 	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+	int (*reserve_space) (struct block_device *, int, sector_t, sector_t,
+			      sector_t *);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 03/10] block: add block_device_operations methods to set and get reserved space
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel, Mike Snitzer

From: Mike Snitzer <snitzer@redhat.com>

[BF:
 - Killed wrapper functions.
 - Condensed to single bdev op.]

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 669e419..6c6ea96 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1650,6 +1650,10 @@ struct blk_dax_ctl {
 	pfn_t pfn;
 };
 
+#define BDEV_RES_GET		0
+#define BDEV_RES_MOD		(1 << 0)
+#define BDEV_RES_PROVISION	(1 << 1)
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
@@ -1667,6 +1671,8 @@ struct block_device_operations {
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
 	/* this callback is with swap_lock and sometimes page table lock held */
 	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+	int (*reserve_space) (struct block_device *, int, sector_t, sector_t,
+			      sector_t *);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 04/10] dm: add methods to set and get reserved space
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel, Mike Snitzer

From: Mike Snitzer <snitzer@redhat.com>

[BF: Condensed to single function.]

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c               | 41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/device-mapper.h |  5 +++++
 2 files changed, 46 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index be49057..8f95e78 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -664,6 +664,46 @@ out:
 	return r;
 }
 
+/*
+ * FIXME: factor out common helper that can be used by
+ * multiple block_device_operations -> target methods
+ * (including dm_blk_ioctl above)
+ */
+
+static int dm_blk_reserve_space(struct block_device *bdev, int mode,
+				sector_t offset, sector_t len, sector_t *res)
+{
+	struct mapped_device *md = bdev->bd_disk->private_data;
+	int srcu_idx;
+	struct dm_table *map;
+	struct dm_target *tgt;
+	int r = -EINVAL;
+
+	map = dm_get_live_table(md, &srcu_idx);
+
+	if (!map || !dm_table_get_size(map))
+		goto out;
+
+	/* We only support devices that have a single target */
+	if (dm_table_get_num_targets(map) != 1)
+		goto out;
+
+	tgt = dm_table_get_target(map, 0);
+	if (!tgt->type->reserve_space)
+		goto out;
+
+	if (dm_suspended_md(md)) {
+		r = -EAGAIN;
+		goto out;
+	}
+
+	r = tgt->type->reserve_space(tgt, mode, offset, len, res);
+out:
+	dm_put_live_table(md, srcu_idx);
+
+	return r;
+}
+
 static struct dm_io *alloc_io(struct mapped_device *md)
 {
 	return mempool_alloc(md->io_pool, GFP_NOIO);
@@ -3723,6 +3763,7 @@ static const struct block_device_operations dm_blk_dops = {
 	.ioctl = dm_blk_ioctl,
 	.getgeo = dm_blk_getgeo,
 	.pr_ops = &dm_pr_ops,
+	.reserve_space = dm_blk_reserve_space,
 	.owner = THIS_MODULE
 };
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 0830c9e..b4825db 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -116,6 +116,10 @@ typedef void (*dm_io_hints_fn) (struct dm_target *ti,
  */
 typedef int (*dm_busy_fn) (struct dm_target *ti);
 
+typedef int (*dm_reserve_space_fn) (struct dm_target *ti, int mode,
+				    sector_t offset, sector_t len,
+				    sector_t *res);
+
 void dm_error(const char *message);
 
 struct dm_dev {
@@ -162,6 +166,7 @@ struct target_type {
 	dm_busy_fn busy;
 	dm_iterate_devices_fn iterate_devices;
 	dm_io_hints_fn io_hints;
+	dm_reserve_space_fn reserve_space;
 
 	/* For internal device-mapper use. */
 	struct list_head list;
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 04/10] dm: add methods to set and get reserved space
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel, Mike Snitzer

From: Mike Snitzer <snitzer@redhat.com>

[BF: Condensed to single function.]

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c               | 41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/device-mapper.h |  5 +++++
 2 files changed, 46 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index be49057..8f95e78 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -664,6 +664,46 @@ out:
 	return r;
 }
 
+/*
+ * FIXME: factor out common helper that can be used by
+ * multiple block_device_operations -> target methods
+ * (including dm_blk_ioctl above)
+ */
+
+static int dm_blk_reserve_space(struct block_device *bdev, int mode,
+				sector_t offset, sector_t len, sector_t *res)
+{
+	struct mapped_device *md = bdev->bd_disk->private_data;
+	int srcu_idx;
+	struct dm_table *map;
+	struct dm_target *tgt;
+	int r = -EINVAL;
+
+	map = dm_get_live_table(md, &srcu_idx);
+
+	if (!map || !dm_table_get_size(map))
+		goto out;
+
+	/* We only support devices that have a single target */
+	if (dm_table_get_num_targets(map) != 1)
+		goto out;
+
+	tgt = dm_table_get_target(map, 0);
+	if (!tgt->type->reserve_space)
+		goto out;
+
+	if (dm_suspended_md(md)) {
+		r = -EAGAIN;
+		goto out;
+	}
+
+	r = tgt->type->reserve_space(tgt, mode, offset, len, res);
+out:
+	dm_put_live_table(md, srcu_idx);
+
+	return r;
+}
+
 static struct dm_io *alloc_io(struct mapped_device *md)
 {
 	return mempool_alloc(md->io_pool, GFP_NOIO);
@@ -3723,6 +3763,7 @@ static const struct block_device_operations dm_blk_dops = {
 	.ioctl = dm_blk_ioctl,
 	.getgeo = dm_blk_getgeo,
 	.pr_ops = &dm_pr_ops,
+	.reserve_space = dm_blk_reserve_space,
 	.owner = THIS_MODULE
 };
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 0830c9e..b4825db 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -116,6 +116,10 @@ typedef void (*dm_io_hints_fn) (struct dm_target *ti,
  */
 typedef int (*dm_busy_fn) (struct dm_target *ti);
 
+typedef int (*dm_reserve_space_fn) (struct dm_target *ti, int mode,
+				    sector_t offset, sector_t len,
+				    sector_t *res);
+
 void dm_error(const char *message);
 
 struct dm_dev {
@@ -162,6 +166,7 @@ struct target_type {
 	dm_busy_fn busy;
 	dm_iterate_devices_fn iterate_devices;
 	dm_io_hints_fn io_hints;
+	dm_reserve_space_fn reserve_space;
 
 	/* For internal device-mapper use. */
 	struct list_head list;
-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel, Joe Thornber

From: Joe Thornber <ejt@redhat.com>

Experimental reserve interface for XFS guys to play with.

I have big reservations (no pun intended) about this patch.

[BF:
 - Support for reservation reduction.
 - Support for space provisioning.
 - Condensed to a single function.]

Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 171 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 92237b6..32bc5bd 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -271,6 +271,8 @@ struct pool {
 	process_mapping_fn process_prepared_discard;
 
 	struct dm_bio_prison_cell **cell_sort_array;
+
+	dm_block_t reserve_count;
 };
 
 static enum pool_mode get_pool_mode(struct pool *pool);
@@ -318,6 +320,8 @@ struct thin_c {
 	 */
 	atomic_t refcount;
 	struct completion can_destroy;
+
+	dm_block_t reserve_count;
 };
 
 /*----------------------------------------------------------------*/
@@ -1359,24 +1363,19 @@ static void check_low_water_mark(struct pool *pool, dm_block_t free_blocks)
 	}
 }
 
-static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
+static int get_free_blocks(struct pool *pool, dm_block_t *free_blocks)
 {
 	int r;
-	dm_block_t free_blocks;
-	struct pool *pool = tc->pool;
-
-	if (WARN_ON(get_pool_mode(pool) != PM_WRITE))
-		return -EINVAL;
 
-	r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
+	r = dm_pool_get_free_block_count(pool->pmd, free_blocks);
 	if (r) {
 		metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
 		return r;
 	}
 
-	check_low_water_mark(pool, free_blocks);
+	check_low_water_mark(pool, *free_blocks);
 
-	if (!free_blocks) {
+	if (!*free_blocks) {
 		/*
 		 * Try to commit to see if that will free up some
 		 * more space.
@@ -1385,7 +1384,7 @@ static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
 		if (r)
 			return r;
 
-		r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
+		r = dm_pool_get_free_block_count(pool->pmd, free_blocks);
 		if (r) {
 			metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
 			return r;
@@ -1397,6 +1396,76 @@ static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
 		}
 	}
 
+	return r;
+}
+
+/*
+ * Returns true iff either:
+ * i) decrement succeeded (ie. there was reserve left)
+ * ii) there is extra space in the pool
+ */
+static bool dec_reserve_count(struct thin_c *tc, dm_block_t free_blocks)
+{
+	bool r = false;
+	unsigned long flags;
+
+	if (!free_blocks)
+		return false;
+
+	spin_lock_irqsave(&tc->pool->lock, flags);
+	if (tc->reserve_count > 0) {
+		tc->reserve_count--;
+		tc->pool->reserve_count--;
+		r = true;
+	} else {
+		if (free_blocks > tc->pool->reserve_count)
+			r = true;
+	}
+	spin_unlock_irqrestore(&tc->pool->lock, flags);
+
+	return r;
+}
+
+static int set_reserve_count(struct thin_c *tc, dm_block_t count)
+{
+	int r;
+	dm_block_t free_blocks;
+	int64_t delta;
+	unsigned long flags;
+
+	r = get_free_blocks(tc->pool, &free_blocks);
+	if (r)
+		return r;
+
+	spin_lock_irqsave(&tc->pool->lock, flags);
+	delta = count - tc->reserve_count;
+	if (tc->pool->reserve_count + delta > free_blocks)
+		r = -ENOSPC;
+	else {
+		tc->reserve_count = count;
+		tc->pool->reserve_count += delta;
+	}
+	spin_unlock_irqrestore(&tc->pool->lock, flags);
+
+	return r;
+}
+
+static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
+{
+	int r;
+	dm_block_t free_blocks;
+	struct pool *pool = tc->pool;
+
+	if (WARN_ON(get_pool_mode(pool) != PM_WRITE))
+		return -EINVAL;
+
+	r = get_free_blocks(tc->pool, &free_blocks);
+	if (r)
+		return r;
+
+	if (!dec_reserve_count(tc, free_blocks))
+		return -ENOSPC;
+
 	r = dm_pool_alloc_data_block(pool->pmd, result);
 	if (r) {
 		metadata_operation_failed(pool, "dm_pool_alloc_data_block", r);
@@ -2880,6 +2949,7 @@ static struct pool *pool_create(struct mapped_device *pool_md,
 	pool->last_commit_jiffies = jiffies;
 	pool->pool_md = pool_md;
 	pool->md_dev = metadata_dev;
+	pool->reserve_count = 0;
 	__pool_table_insert(pool);
 
 	return pool;
@@ -3936,6 +4006,7 @@ static void thin_dtr(struct dm_target *ti)
 
 	spin_lock_irqsave(&tc->pool->lock, flags);
 	list_del_rcu(&tc->list);
+	tc->pool->reserve_count -= tc->reserve_count;
 	spin_unlock_irqrestore(&tc->pool->lock, flags);
 	synchronize_rcu();
 
@@ -4074,6 +4145,7 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	init_completion(&tc->can_destroy);
 	list_add_tail_rcu(&tc->list, &tc->pool->active_thins);
 	spin_unlock_irqrestore(&tc->pool->lock, flags);
+	tc->reserve_count = 0;
 	/*
 	 * This synchronize_rcu() call is needed here otherwise we risk a
 	 * wake_worker() call finding no bios to process (because the newly
@@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
 	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
 }
 
+static int thin_provision_space(struct dm_target *ti, sector_t offset,
+				sector_t len, sector_t *res)
+{
+	struct thin_c *tc = ti->private;
+	struct pool *pool = tc->pool;
+	sector_t end;
+	dm_block_t pblock;
+	dm_block_t vblock;
+	int error;
+	struct dm_thin_lookup_result lookup;
+
+	if (!is_factor(offset, pool->sectors_per_block))
+		return -EINVAL;
+
+	if (!len || !is_factor(len, pool->sectors_per_block))
+		return -EINVAL;
+
+	if (res && !is_factor(*res, pool->sectors_per_block))
+		return -EINVAL;
+
+	end = offset + len;
+
+	while (offset < end) {
+		vblock = offset;
+		do_div(vblock, pool->sectors_per_block);
+
+		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
+		if (error == 0)
+			goto next;
+		if (error != -ENODATA)
+			return error;
+
+		error = alloc_data_block(tc, &pblock);
+		if (error)
+			return error;
+
+		error = dm_thin_insert_block(tc->td, vblock, pblock);
+		if (error)
+			return error;
+
+		if (res && *res)
+			*res -= pool->sectors_per_block;
+next:
+		offset += pool->sectors_per_block;
+	}
+
+	return 0;
+}
+
+static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
+			      sector_t len, sector_t *res)
+{
+	struct thin_c *tc = ti->private;
+	struct pool *pool = tc->pool;
+	sector_t blocks;
+	unsigned long flags;
+	int error;
+
+	if (mode == BDEV_RES_PROVISION)
+		return thin_provision_space(ti, offset, len, res);
+
+	/* res required for get/set */
+	error = -EINVAL;
+	if (!res)
+		return error;
+
+	if (mode == BDEV_RES_GET) {
+		spin_lock_irqsave(&tc->pool->lock, flags);
+		*res = tc->reserve_count * pool->sectors_per_block;
+		spin_unlock_irqrestore(&tc->pool->lock, flags);
+		error = 0;
+	} else if (mode == BDEV_RES_MOD) {
+		/*
+		* @res must always be a factor of the pool's blocksize; upper
+		* layers can rely on the bdev's minimum_io_size for this.
+		*/
+		if (!is_factor(*res, pool->sectors_per_block))
+			return error;
+
+		blocks = *res;
+		(void) sector_div(blocks, pool->sectors_per_block);
+
+		error = set_reserve_count(tc, blocks);
+	}
+
+	return error;
+}
+
 static struct target_type thin_target = {
 	.name = "thin",
 	.version = {1, 18, 0},
@@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
 	.status = thin_status,
 	.iterate_devices = thin_iterate_devices,
 	.io_hints = thin_io_hints,
+	.reserve_space = thin_reserve_space,
 };
 
 /*----------------------------------------------------------------*/
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel, Joe Thornber

From: Joe Thornber <ejt@redhat.com>

Experimental reserve interface for XFS guys to play with.

I have big reservations (no pun intended) about this patch.

[BF:
 - Support for reservation reduction.
 - Support for space provisioning.
 - Condensed to a single function.]

Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 171 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 92237b6..32bc5bd 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -271,6 +271,8 @@ struct pool {
 	process_mapping_fn process_prepared_discard;
 
 	struct dm_bio_prison_cell **cell_sort_array;
+
+	dm_block_t reserve_count;
 };
 
 static enum pool_mode get_pool_mode(struct pool *pool);
@@ -318,6 +320,8 @@ struct thin_c {
 	 */
 	atomic_t refcount;
 	struct completion can_destroy;
+
+	dm_block_t reserve_count;
 };
 
 /*----------------------------------------------------------------*/
@@ -1359,24 +1363,19 @@ static void check_low_water_mark(struct pool *pool, dm_block_t free_blocks)
 	}
 }
 
-static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
+static int get_free_blocks(struct pool *pool, dm_block_t *free_blocks)
 {
 	int r;
-	dm_block_t free_blocks;
-	struct pool *pool = tc->pool;
-
-	if (WARN_ON(get_pool_mode(pool) != PM_WRITE))
-		return -EINVAL;
 
-	r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
+	r = dm_pool_get_free_block_count(pool->pmd, free_blocks);
 	if (r) {
 		metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
 		return r;
 	}
 
-	check_low_water_mark(pool, free_blocks);
+	check_low_water_mark(pool, *free_blocks);
 
-	if (!free_blocks) {
+	if (!*free_blocks) {
 		/*
 		 * Try to commit to see if that will free up some
 		 * more space.
@@ -1385,7 +1384,7 @@ static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
 		if (r)
 			return r;
 
-		r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
+		r = dm_pool_get_free_block_count(pool->pmd, free_blocks);
 		if (r) {
 			metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
 			return r;
@@ -1397,6 +1396,76 @@ static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
 		}
 	}
 
+	return r;
+}
+
+/*
+ * Returns true iff either:
+ * i) decrement succeeded (ie. there was reserve left)
+ * ii) there is extra space in the pool
+ */
+static bool dec_reserve_count(struct thin_c *tc, dm_block_t free_blocks)
+{
+	bool r = false;
+	unsigned long flags;
+
+	if (!free_blocks)
+		return false;
+
+	spin_lock_irqsave(&tc->pool->lock, flags);
+	if (tc->reserve_count > 0) {
+		tc->reserve_count--;
+		tc->pool->reserve_count--;
+		r = true;
+	} else {
+		if (free_blocks > tc->pool->reserve_count)
+			r = true;
+	}
+	spin_unlock_irqrestore(&tc->pool->lock, flags);
+
+	return r;
+}
+
+static int set_reserve_count(struct thin_c *tc, dm_block_t count)
+{
+	int r;
+	dm_block_t free_blocks;
+	int64_t delta;
+	unsigned long flags;
+
+	r = get_free_blocks(tc->pool, &free_blocks);
+	if (r)
+		return r;
+
+	spin_lock_irqsave(&tc->pool->lock, flags);
+	delta = count - tc->reserve_count;
+	if (tc->pool->reserve_count + delta > free_blocks)
+		r = -ENOSPC;
+	else {
+		tc->reserve_count = count;
+		tc->pool->reserve_count += delta;
+	}
+	spin_unlock_irqrestore(&tc->pool->lock, flags);
+
+	return r;
+}
+
+static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
+{
+	int r;
+	dm_block_t free_blocks;
+	struct pool *pool = tc->pool;
+
+	if (WARN_ON(get_pool_mode(pool) != PM_WRITE))
+		return -EINVAL;
+
+	r = get_free_blocks(tc->pool, &free_blocks);
+	if (r)
+		return r;
+
+	if (!dec_reserve_count(tc, free_blocks))
+		return -ENOSPC;
+
 	r = dm_pool_alloc_data_block(pool->pmd, result);
 	if (r) {
 		metadata_operation_failed(pool, "dm_pool_alloc_data_block", r);
@@ -2880,6 +2949,7 @@ static struct pool *pool_create(struct mapped_device *pool_md,
 	pool->last_commit_jiffies = jiffies;
 	pool->pool_md = pool_md;
 	pool->md_dev = metadata_dev;
+	pool->reserve_count = 0;
 	__pool_table_insert(pool);
 
 	return pool;
@@ -3936,6 +4006,7 @@ static void thin_dtr(struct dm_target *ti)
 
 	spin_lock_irqsave(&tc->pool->lock, flags);
 	list_del_rcu(&tc->list);
+	tc->pool->reserve_count -= tc->reserve_count;
 	spin_unlock_irqrestore(&tc->pool->lock, flags);
 	synchronize_rcu();
 
@@ -4074,6 +4145,7 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	init_completion(&tc->can_destroy);
 	list_add_tail_rcu(&tc->list, &tc->pool->active_thins);
 	spin_unlock_irqrestore(&tc->pool->lock, flags);
+	tc->reserve_count = 0;
 	/*
 	 * This synchronize_rcu() call is needed here otherwise we risk a
 	 * wake_worker() call finding no bios to process (because the newly
@@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
 	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
 }
 
+static int thin_provision_space(struct dm_target *ti, sector_t offset,
+				sector_t len, sector_t *res)
+{
+	struct thin_c *tc = ti->private;
+	struct pool *pool = tc->pool;
+	sector_t end;
+	dm_block_t pblock;
+	dm_block_t vblock;
+	int error;
+	struct dm_thin_lookup_result lookup;
+
+	if (!is_factor(offset, pool->sectors_per_block))
+		return -EINVAL;
+
+	if (!len || !is_factor(len, pool->sectors_per_block))
+		return -EINVAL;
+
+	if (res && !is_factor(*res, pool->sectors_per_block))
+		return -EINVAL;
+
+	end = offset + len;
+
+	while (offset < end) {
+		vblock = offset;
+		do_div(vblock, pool->sectors_per_block);
+
+		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
+		if (error == 0)
+			goto next;
+		if (error != -ENODATA)
+			return error;
+
+		error = alloc_data_block(tc, &pblock);
+		if (error)
+			return error;
+
+		error = dm_thin_insert_block(tc->td, vblock, pblock);
+		if (error)
+			return error;
+
+		if (res && *res)
+			*res -= pool->sectors_per_block;
+next:
+		offset += pool->sectors_per_block;
+	}
+
+	return 0;
+}
+
+static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
+			      sector_t len, sector_t *res)
+{
+	struct thin_c *tc = ti->private;
+	struct pool *pool = tc->pool;
+	sector_t blocks;
+	unsigned long flags;
+	int error;
+
+	if (mode == BDEV_RES_PROVISION)
+		return thin_provision_space(ti, offset, len, res);
+
+	/* res required for get/set */
+	error = -EINVAL;
+	if (!res)
+		return error;
+
+	if (mode == BDEV_RES_GET) {
+		spin_lock_irqsave(&tc->pool->lock, flags);
+		*res = tc->reserve_count * pool->sectors_per_block;
+		spin_unlock_irqrestore(&tc->pool->lock, flags);
+		error = 0;
+	} else if (mode == BDEV_RES_MOD) {
+		/*
+		* @res must always be a factor of the pool's blocksize; upper
+		* layers can rely on the bdev's minimum_io_size for this.
+		*/
+		if (!is_factor(*res, pool->sectors_per_block))
+			return error;
+
+		blocks = *res;
+		(void) sector_div(blocks, pool->sectors_per_block);
+
+		error = set_reserve_count(tc, blocks);
+	}
+
+	return error;
+}
+
 static struct target_type thin_target = {
 	.name = "thin",
 	.version = {1, 18, 0},
@@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
 	.status = thin_status,
 	.iterate_devices = thin_iterate_devices,
 	.io_hints = thin_io_hints,
+	.reserve_space = thin_reserve_space,
 };
 
 /*----------------------------------------------------------------*/
-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 06/10] xfs: thin block device reservation mechanism
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

Add block device reservation infrastructure to XFS. This primarily
consists of wrappers around the associated block device functions. This
mechanism provides the ability to reserve, release and provision a set
of blocks in the underlying block device.

Block device reservation enables the filesystem to adopt an allocation
model that guarantees physical blocks are available for operations on
block devices where this is currently not the case (i.e., thin devices).
Without such guarantees, overprovisioning of a thin block device results
in a read-only state transition and possible shutdown of the fs.
Reservation allows the fs to detect when space is not available in the
underlying device, avoid the conditions that lead to read-only state
transitions and handle the situation gracefully by returning -ENOSPC to
userspace.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/Makefile    |   1 +
 fs/xfs/xfs_mount.h |   5 ++
 fs/xfs/xfs_thin.c  | 260 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_thin.h  |  31 +++++++
 fs/xfs/xfs_trace.h |  27 ++++++
 5 files changed, 324 insertions(+)
 create mode 100644 fs/xfs/xfs_thin.c
 create mode 100644 fs/xfs/xfs_thin.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 3542d94..b394db7 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -88,6 +88,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_super.o \
 				   xfs_symlink.o \
 				   xfs_sysfs.o \
+				   xfs_thin.o \
 				   xfs_trans.o \
 				   xfs_xattr.o \
 				   kmem.o \
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index bd1043f..8d54c56 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -148,6 +148,11 @@ typedef struct xfs_mount {
 	 */
 	__uint32_t		m_generation;
 
+	bool			m_thin_reserve;
+	struct mutex		m_thin_res_lock;
+	uint32_t		m_thin_sectpb;
+	sector_t		m_thin_res;
+
 #ifdef DEBUG
 	/*
 	 * DEBUG mode instrumentation to test and/or trigger delayed allocation
diff --git a/fs/xfs/xfs_thin.c b/fs/xfs/xfs_thin.c
new file mode 100644
index 0000000..16e9a03
--- /dev/null
+++ b/fs/xfs/xfs_thin.c
@@ -0,0 +1,260 @@
+/*
+ * Copyright (c) 2016 Red Hat, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_inode.h"
+#include "xfs_dir2.h"
+#include "xfs_ialloc.h"
+#include "xfs_alloc.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bmap.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_log.h"
+#include "xfs_error.h"
+#include "xfs_quota.h"
+#include "xfs_fsops.h"
+#include "xfs_trace.h"
+#include "xfs_icache.h"
+#include "xfs_sysfs.h"
+/* XXX: above copied from xfs_mount.c */
+#include "xfs_thin.h"
+
+/*
+ * Notes/Issues:
+ *
+ * - Reservation support depends on the '-o discard' mount option so freed
+ *   extents are returned to the pool. Note that online discard has not been
+ *   totally reliable in terms of returning freed space to the thin pool. Use
+ *   fstrim as a workaround.
+ * - The bdev reservation API receives an absolute value reservation from the
+ *   caller as opposed to a delta value. The latter is probably more ideal, but
+ *   the former helps us use the XFS reserve pool as a broad protection layer
+ *   for any potential leaks. For example, free list blocks used for btree
+ *   growth are currently not reserved. With a delta API, _any_ unreserved
+ *   allocations from the fs will slowly and permanently leak the reservation as
+ *   tracked by the bdev. The abs value mechanism covers this kind of slop based
+ *   on the locally maintained reservation.
+ *   	- What might be ideal to support a delta reservation API is a model (or
+ *   	test mode) that requires a reservation to be attached or somehow
+ *   	associated with every bdev allocation when the reserve feature is
+ *   	enabled (or one that disables allocation via writes altogether in favor
+ *   	of provision calls). Otherwise, any unreserved allocation returns an I/O
+ *   	error. Such deterministic behavior helps ensure general testing detects
+ *   	problems more reliably.
+ * - Worst case reservation means each XFS filesystem block is considered a new
+ *   dm block allocation. This translates to a significant amount of space given
+ *   larger dm block sizes. For example, 4k XFS blocks to 64k dm blocks means
+ *   we'll hit ENOSPC sooner and more frequently than typically expected.
+ * - The xfs_mod_fdblocks() implementation means the XFS reserve pool blocks are
+ *   also reserved from the thin pool. XFS defaults to 8192 reserve pool blocks
+ *   in most cases, which translates to 512MB of reserved space. This can be
+ *   tuned with: 'xfs_io -xc "resblks <blks>" <mnt>'. Note that insufficient
+ *   reserves will result in errors in unexpected areas of code (e.g., page
+ *   discards on writeback, inode unlinked list removal failures, etc.).
+ */
+
+static inline int
+bdev_reserve_space(
+	struct xfs_mount			*mp,
+	int					mode,
+	sector_t				offset,
+	sector_t				len,
+	sector_t				*res)
+{
+	struct block_device			*bdev;
+	const struct block_device_operations	*ops;
+
+	bdev = mp->m_ddev_targp->bt_bdev;
+	ops = bdev->bd_disk->fops;
+
+	return ops->reserve_space(bdev, mode, offset, len, res);
+}
+
+/*
+ * Reserve blocks from the underlying block device.
+ */
+int
+xfs_thin_reserve(
+	struct xfs_mount	*mp,
+	sector_t		bb)
+{
+	int			error;
+	sector_t		res;
+
+	mutex_lock(&mp->m_thin_res_lock);
+
+	res = mp->m_thin_res + bb;
+	error = bdev_reserve_space(mp, BDEV_RES_MOD, 0, 0, &res);
+	if (error) {
+		if (error == -ENOSPC)
+			trace_xfs_thin_reserve_enospc(mp, mp->m_thin_res, bb);
+		goto out;
+	}
+
+	trace_xfs_thin_reserve(mp, mp->m_thin_res, bb);
+	mp->m_thin_res += bb;
+
+out:
+	mutex_unlock(&mp->m_thin_res_lock);
+	return error;
+}
+
+static int
+__xfs_thin_unreserve(
+	struct xfs_mount	*mp,
+	sector_t		bb)
+{
+	int			error;
+	sector_t		res;
+
+	if (bb > mp->m_thin_res) {
+		WARN(1, "unres (%llu) exceeds current res (%llu)",
+		     (uint64_t) bb, (uint64_t) mp->m_thin_res);
+		bb = mp->m_thin_res;
+	}
+
+	res = mp->m_thin_res - bb;
+	error = bdev_reserve_space(mp, BDEV_RES_MOD, 0, 0, &res);
+	if (error)
+		return error;;
+
+	trace_xfs_thin_unreserve(mp, mp->m_thin_res, bb);
+	mp->m_thin_res -= bb;
+
+	return error;
+}
+
+/*
+ * Release a reservation back to the block device.
+ */
+int
+xfs_thin_unreserve(
+	struct xfs_mount	*mp,
+	sector_t		res)
+{
+	int			error;
+
+	mutex_lock(&mp->m_thin_res_lock);
+	error = __xfs_thin_unreserve(mp, res);
+	mutex_unlock(&mp->m_thin_res_lock);
+
+	return error;
+}
+
+/*
+ * Given a recently allocated extent, ask the block device to provision the
+ * underlying space.
+ */
+int
+xfs_thin_provision(
+	struct xfs_mount	*mp,
+	xfs_fsblock_t		offset,
+	xfs_fsblock_t		len,
+	sector_t		*res)
+{
+	sector_t		ores = *res;
+	sector_t		bbstart, bblen;
+	int			error;
+
+	bbstart = XFS_FSB_TO_DADDR(mp, offset);
+	bbstart = round_down(bbstart, mp->m_thin_sectpb);
+	bblen = XFS_FSB_TO_BB(mp, len);
+	bblen = round_up(bblen, mp->m_thin_sectpb);
+
+	mutex_lock(&mp->m_thin_res_lock);
+
+	WARN_ON(bblen > mp->m_thin_res);
+
+	error = bdev_reserve_space(mp, BDEV_RES_PROVISION, bbstart, bblen,
+				   res);
+	if (error)
+		goto out;
+	ASSERT(ores >= *res);
+
+	trace_xfs_thin_provision(mp, mp->m_thin_res, ores - *res);
+
+	/*
+	 * Update the local reservation based on the blocks that were actually
+	 * allocated.
+	 */
+	mp->m_thin_res -= (ores - *res);
+out:
+	mutex_unlock(&mp->m_thin_res_lock);
+	return error;
+}
+
+int
+xfs_thin_init(
+	struct xfs_mount			*mp)
+{
+	struct block_device			*bdev;
+	const struct block_device_operations	*ops;
+	sector_t				res;
+	int					error;
+	unsigned int				io_opt;
+
+	bdev = mp->m_ddev_targp->bt_bdev;
+	ops = bdev->bd_disk->fops;
+
+	mp->m_thin_reserve = false;
+	mutex_init(&mp->m_thin_res_lock);
+
+	if (!ops->reserve_space)
+		goto out;
+	if (!(mp->m_flags & XFS_MOUNT_DISCARD))
+		goto out;
+
+	/* use optimal I/O size as dm-thin block size */
+	io_opt = bdev_io_opt(mp->m_super->s_bdev);
+	if ((io_opt % BBSIZE) || (io_opt < mp->m_sb.sb_blocksize))
+		goto out;
+	mp->m_thin_sectpb = io_opt / BBSIZE;
+
+	/* warn about any preexisting reservation */
+	error = bdev_reserve_space(mp, BDEV_RES_GET, 0, 0, &res);
+	if (error)
+		goto out;
+	if (res) {
+		/* force res count to 0 */
+		xfs_warn(mp, "Reset non-zero (%llu sectors) block reservation.",
+			 (uint64_t) res);
+		res = 0;
+		error = bdev_reserve_space(mp, BDEV_RES_MOD, 0, 0, &res);
+		if (error)
+			goto out;
+	}
+
+	mp->m_thin_reserve = true;
+out:
+	xfs_notice(mp, "Thin pool reservation %s", mp->m_thin_reserve ?
+							"enabled" : "disabled");
+	if (mp->m_thin_reserve)
+		xfs_notice(mp, "Thin reserve blocksize: %u sectors",
+			   mp->m_thin_sectpb);
+	return 0;
+}
diff --git a/fs/xfs/xfs_thin.h b/fs/xfs/xfs_thin.h
new file mode 100644
index 0000000..6d995a0
--- /dev/null
+++ b/fs/xfs/xfs_thin.h
@@ -0,0 +1,31 @@
+#ifndef __XFS_THIN_H__
+#define __XFS_THIN_H__
+
+/*
+ * Convert an fsb count to a sector reservation.
+ */
+static inline sector_t
+xfs_fsb_res(
+	struct xfs_mount	*mp,
+	xfs_fsblock_t		fsb,
+	bool			contig)
+{
+	sector_t		bb;
+
+       if (contig) {
+		bb = XFS_FSB_TO_BB(mp, fsb);
+		bb += (2 * mp->m_thin_sectpb);
+		bb = round_up(bb, mp->m_thin_sectpb);
+	} else
+		bb = fsb * mp->m_thin_sectpb;
+
+	return bb;
+}
+
+int xfs_thin_init(struct xfs_mount *);
+int xfs_thin_reserve(struct xfs_mount *, sector_t);
+int xfs_thin_unreserve(struct xfs_mount *, sector_t);
+int xfs_thin_provision(struct xfs_mount *, xfs_fsblock_t, xfs_fsblock_t,
+		       sector_t *);
+
+#endif	/* __XFS_THIN_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c8d5842..a7733a1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2184,6 +2184,33 @@ DEFINE_DISCARD_EVENT(xfs_discard_toosmall);
 DEFINE_DISCARD_EVENT(xfs_discard_exclude);
 DEFINE_DISCARD_EVENT(xfs_discard_busy);
 
+DECLARE_EVENT_CLASS(xfs_thin_class,
+	TP_PROTO(struct xfs_mount *mp, sector_t total, sector_t res),
+	TP_ARGS(mp, total, res),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(uint64_t, total)
+		__field(uint64_t, res)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->total = total;
+		__entry->res = res;
+	),
+	TP_printk("dev %d:%d total %llu res %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->total,
+		  __entry->res)
+)
+
+#define DEFINE_THIN_EVENT(name) \
+DEFINE_EVENT(xfs_thin_class, name, \
+	TP_PROTO(struct xfs_mount *mp, sector_t total, sector_t res), \
+	TP_ARGS(mp, total, res))
+DEFINE_THIN_EVENT(xfs_thin_reserve);
+DEFINE_THIN_EVENT(xfs_thin_reserve_enospc);
+DEFINE_THIN_EVENT(xfs_thin_unreserve);
+DEFINE_THIN_EVENT(xfs_thin_provision);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 06/10] xfs: thin block device reservation mechanism
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

Add block device reservation infrastructure to XFS. This primarily
consists of wrappers around the associated block device functions. This
mechanism provides the ability to reserve, release and provision a set
of blocks in the underlying block device.

Block device reservation enables the filesystem to adopt an allocation
model that guarantees physical blocks are available for operations on
block devices where this is currently not the case (i.e., thin devices).
Without such guarantees, overprovisioning of a thin block device results
in a read-only state transition and possible shutdown of the fs.
Reservation allows the fs to detect when space is not available in the
underlying device, avoid the conditions that lead to read-only state
transitions and handle the situation gracefully by returning -ENOSPC to
userspace.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/Makefile    |   1 +
 fs/xfs/xfs_mount.h |   5 ++
 fs/xfs/xfs_thin.c  | 260 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_thin.h  |  31 +++++++
 fs/xfs/xfs_trace.h |  27 ++++++
 5 files changed, 324 insertions(+)
 create mode 100644 fs/xfs/xfs_thin.c
 create mode 100644 fs/xfs/xfs_thin.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 3542d94..b394db7 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -88,6 +88,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_super.o \
 				   xfs_symlink.o \
 				   xfs_sysfs.o \
+				   xfs_thin.o \
 				   xfs_trans.o \
 				   xfs_xattr.o \
 				   kmem.o \
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index bd1043f..8d54c56 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -148,6 +148,11 @@ typedef struct xfs_mount {
 	 */
 	__uint32_t		m_generation;
 
+	bool			m_thin_reserve;
+	struct mutex		m_thin_res_lock;
+	uint32_t		m_thin_sectpb;
+	sector_t		m_thin_res;
+
 #ifdef DEBUG
 	/*
 	 * DEBUG mode instrumentation to test and/or trigger delayed allocation
diff --git a/fs/xfs/xfs_thin.c b/fs/xfs/xfs_thin.c
new file mode 100644
index 0000000..16e9a03
--- /dev/null
+++ b/fs/xfs/xfs_thin.c
@@ -0,0 +1,260 @@
+/*
+ * Copyright (c) 2016 Red Hat, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_inode.h"
+#include "xfs_dir2.h"
+#include "xfs_ialloc.h"
+#include "xfs_alloc.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bmap.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_log.h"
+#include "xfs_error.h"
+#include "xfs_quota.h"
+#include "xfs_fsops.h"
+#include "xfs_trace.h"
+#include "xfs_icache.h"
+#include "xfs_sysfs.h"
+/* XXX: above copied from xfs_mount.c */
+#include "xfs_thin.h"
+
+/*
+ * Notes/Issues:
+ *
+ * - Reservation support depends on the '-o discard' mount option so freed
+ *   extents are returned to the pool. Note that online discard has not been
+ *   totally reliable in terms of returning freed space to the thin pool. Use
+ *   fstrim as a workaround.
+ * - The bdev reservation API receives an absolute value reservation from the
+ *   caller as opposed to a delta value. The latter is probably more ideal, but
+ *   the former helps us use the XFS reserve pool as a broad protection layer
+ *   for any potential leaks. For example, free list blocks used for btree
+ *   growth are currently not reserved. With a delta API, _any_ unreserved
+ *   allocations from the fs will slowly and permanently leak the reservation as
+ *   tracked by the bdev. The abs value mechanism covers this kind of slop based
+ *   on the locally maintained reservation.
+ *   	- What might be ideal to support a delta reservation API is a model (or
+ *   	test mode) that requires a reservation to be attached or somehow
+ *   	associated with every bdev allocation when the reserve feature is
+ *   	enabled (or one that disables allocation via writes altogether in favor
+ *   	of provision calls). Otherwise, any unreserved allocation returns an I/O
+ *   	error. Such deterministic behavior helps ensure general testing detects
+ *   	problems more reliably.
+ * - Worst case reservation means each XFS filesystem block is considered a new
+ *   dm block allocation. This translates to a significant amount of space given
+ *   larger dm block sizes. For example, 4k XFS blocks to 64k dm blocks means
+ *   we'll hit ENOSPC sooner and more frequently than typically expected.
+ * - The xfs_mod_fdblocks() implementation means the XFS reserve pool blocks are
+ *   also reserved from the thin pool. XFS defaults to 8192 reserve pool blocks
+ *   in most cases, which translates to 512MB of reserved space. This can be
+ *   tuned with: 'xfs_io -xc "resblks <blks>" <mnt>'. Note that insufficient
+ *   reserves will result in errors in unexpected areas of code (e.g., page
+ *   discards on writeback, inode unlinked list removal failures, etc.).
+ */
+
+static inline int
+bdev_reserve_space(
+	struct xfs_mount			*mp,
+	int					mode,
+	sector_t				offset,
+	sector_t				len,
+	sector_t				*res)
+{
+	struct block_device			*bdev;
+	const struct block_device_operations	*ops;
+
+	bdev = mp->m_ddev_targp->bt_bdev;
+	ops = bdev->bd_disk->fops;
+
+	return ops->reserve_space(bdev, mode, offset, len, res);
+}
+
+/*
+ * Reserve blocks from the underlying block device.
+ */
+int
+xfs_thin_reserve(
+	struct xfs_mount	*mp,
+	sector_t		bb)
+{
+	int			error;
+	sector_t		res;
+
+	mutex_lock(&mp->m_thin_res_lock);
+
+	res = mp->m_thin_res + bb;
+	error = bdev_reserve_space(mp, BDEV_RES_MOD, 0, 0, &res);
+	if (error) {
+		if (error == -ENOSPC)
+			trace_xfs_thin_reserve_enospc(mp, mp->m_thin_res, bb);
+		goto out;
+	}
+
+	trace_xfs_thin_reserve(mp, mp->m_thin_res, bb);
+	mp->m_thin_res += bb;
+
+out:
+	mutex_unlock(&mp->m_thin_res_lock);
+	return error;
+}
+
+static int
+__xfs_thin_unreserve(
+	struct xfs_mount	*mp,
+	sector_t		bb)
+{
+	int			error;
+	sector_t		res;
+
+	if (bb > mp->m_thin_res) {
+		WARN(1, "unres (%llu) exceeds current res (%llu)",
+		     (uint64_t) bb, (uint64_t) mp->m_thin_res);
+		bb = mp->m_thin_res;
+	}
+
+	res = mp->m_thin_res - bb;
+	error = bdev_reserve_space(mp, BDEV_RES_MOD, 0, 0, &res);
+	if (error)
+		return error;;
+
+	trace_xfs_thin_unreserve(mp, mp->m_thin_res, bb);
+	mp->m_thin_res -= bb;
+
+	return error;
+}
+
+/*
+ * Release a reservation back to the block device.
+ */
+int
+xfs_thin_unreserve(
+	struct xfs_mount	*mp,
+	sector_t		res)
+{
+	int			error;
+
+	mutex_lock(&mp->m_thin_res_lock);
+	error = __xfs_thin_unreserve(mp, res);
+	mutex_unlock(&mp->m_thin_res_lock);
+
+	return error;
+}
+
+/*
+ * Given a recently allocated extent, ask the block device to provision the
+ * underlying space.
+ */
+int
+xfs_thin_provision(
+	struct xfs_mount	*mp,
+	xfs_fsblock_t		offset,
+	xfs_fsblock_t		len,
+	sector_t		*res)
+{
+	sector_t		ores = *res;
+	sector_t		bbstart, bblen;
+	int			error;
+
+	bbstart = XFS_FSB_TO_DADDR(mp, offset);
+	bbstart = round_down(bbstart, mp->m_thin_sectpb);
+	bblen = XFS_FSB_TO_BB(mp, len);
+	bblen = round_up(bblen, mp->m_thin_sectpb);
+
+	mutex_lock(&mp->m_thin_res_lock);
+
+	WARN_ON(bblen > mp->m_thin_res);
+
+	error = bdev_reserve_space(mp, BDEV_RES_PROVISION, bbstart, bblen,
+				   res);
+	if (error)
+		goto out;
+	ASSERT(ores >= *res);
+
+	trace_xfs_thin_provision(mp, mp->m_thin_res, ores - *res);
+
+	/*
+	 * Update the local reservation based on the blocks that were actually
+	 * allocated.
+	 */
+	mp->m_thin_res -= (ores - *res);
+out:
+	mutex_unlock(&mp->m_thin_res_lock);
+	return error;
+}
+
+int
+xfs_thin_init(
+	struct xfs_mount			*mp)
+{
+	struct block_device			*bdev;
+	const struct block_device_operations	*ops;
+	sector_t				res;
+	int					error;
+	unsigned int				io_opt;
+
+	bdev = mp->m_ddev_targp->bt_bdev;
+	ops = bdev->bd_disk->fops;
+
+	mp->m_thin_reserve = false;
+	mutex_init(&mp->m_thin_res_lock);
+
+	if (!ops->reserve_space)
+		goto out;
+	if (!(mp->m_flags & XFS_MOUNT_DISCARD))
+		goto out;
+
+	/* use optimal I/O size as dm-thin block size */
+	io_opt = bdev_io_opt(mp->m_super->s_bdev);
+	if ((io_opt % BBSIZE) || (io_opt < mp->m_sb.sb_blocksize))
+		goto out;
+	mp->m_thin_sectpb = io_opt / BBSIZE;
+
+	/* warn about any preexisting reservation */
+	error = bdev_reserve_space(mp, BDEV_RES_GET, 0, 0, &res);
+	if (error)
+		goto out;
+	if (res) {
+		/* force res count to 0 */
+		xfs_warn(mp, "Reset non-zero (%llu sectors) block reservation.",
+			 (uint64_t) res);
+		res = 0;
+		error = bdev_reserve_space(mp, BDEV_RES_MOD, 0, 0, &res);
+		if (error)
+			goto out;
+	}
+
+	mp->m_thin_reserve = true;
+out:
+	xfs_notice(mp, "Thin pool reservation %s", mp->m_thin_reserve ?
+							"enabled" : "disabled");
+	if (mp->m_thin_reserve)
+		xfs_notice(mp, "Thin reserve blocksize: %u sectors",
+			   mp->m_thin_sectpb);
+	return 0;
+}
diff --git a/fs/xfs/xfs_thin.h b/fs/xfs/xfs_thin.h
new file mode 100644
index 0000000..6d995a0
--- /dev/null
+++ b/fs/xfs/xfs_thin.h
@@ -0,0 +1,31 @@
+#ifndef __XFS_THIN_H__
+#define __XFS_THIN_H__
+
+/*
+ * Convert an fsb count to a sector reservation.
+ */
+static inline sector_t
+xfs_fsb_res(
+	struct xfs_mount	*mp,
+	xfs_fsblock_t		fsb,
+	bool			contig)
+{
+	sector_t		bb;
+
+       if (contig) {
+		bb = XFS_FSB_TO_BB(mp, fsb);
+		bb += (2 * mp->m_thin_sectpb);
+		bb = round_up(bb, mp->m_thin_sectpb);
+	} else
+		bb = fsb * mp->m_thin_sectpb;
+
+	return bb;
+}
+
+int xfs_thin_init(struct xfs_mount *);
+int xfs_thin_reserve(struct xfs_mount *, sector_t);
+int xfs_thin_unreserve(struct xfs_mount *, sector_t);
+int xfs_thin_provision(struct xfs_mount *, xfs_fsblock_t, xfs_fsblock_t,
+		       sector_t *);
+
+#endif	/* __XFS_THIN_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c8d5842..a7733a1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2184,6 +2184,33 @@ DEFINE_DISCARD_EVENT(xfs_discard_toosmall);
 DEFINE_DISCARD_EVENT(xfs_discard_exclude);
 DEFINE_DISCARD_EVENT(xfs_discard_busy);
 
+DECLARE_EVENT_CLASS(xfs_thin_class,
+	TP_PROTO(struct xfs_mount *mp, sector_t total, sector_t res),
+	TP_ARGS(mp, total, res),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(uint64_t, total)
+		__field(uint64_t, res)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->total = total;
+		__entry->res = res;
+	),
+	TP_printk("dev %d:%d total %llu res %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->total,
+		  __entry->res)
+)
+
+#define DEFINE_THIN_EVENT(name) \
+DEFINE_EVENT(xfs_thin_class, name, \
+	TP_PROTO(struct xfs_mount *mp, sector_t total, sector_t res), \
+	TP_ARGS(mp, total, res))
+DEFINE_THIN_EVENT(xfs_thin_reserve);
+DEFINE_THIN_EVENT(xfs_thin_reserve_enospc);
+DEFINE_THIN_EVENT(xfs_thin_unreserve);
+DEFINE_THIN_EVENT(xfs_thin_provision);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 07/10] xfs: adopt a reserved allocation model on dm-thin devices
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

Adopt a reservation-based block allocation model when XFS runs on top of
a dm-thin device with accompanying support. As of today, the filesystem
has no indication of available space in the underlying device. If the
thin pool is depleted, the filesystem has no recourse but to handle the
read-only state change of the device. This results in unexpected higher
level behavior, error returns and can result in data loss if the
filesystem is ultimately shutdown before more space is provisioned to
the pool.

The reservation model enables XFS to manage thin pool space similar to
how delayed allocation blocks are managed today. Delalloc blocks are
reserved up front (e.g., at write time) to guarantee physical space is
available at writeback time and thus prevent data loss due to
overprovisioning. Similarly, block device reservation allows XFS to
reserve space for various operations in advance and thus guarantee an
operation will not fail for lack of space, or otherwise return an error
to the user.

To accomplish this, tie in the device block reservation calls to the
existing filesystem reservation mechanism. Each transaction now reserves
physical space in the underlying thin pool along with other such
reserved resources (e.g., filesystem blocks, log space). Delayed
allocation blocks are similarly reserved in the thin device when the
associated filesystem blocks are reserved. If a reservation cannot be
satisfied, the associated operation returns -ENOSPC as if the filesystem
itself were out of space.

Note that this is proof-of-concept and highly experimental. The purpose
is to explore the potential effectiveness of such a scheme between the
filesystem and a thinly provisioned device. As such, the implementation
is hacky, broken and geared towards proof-of-concept over correctness or
completeness.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c |  25 +++++++++++
 fs/xfs/xfs_mount.c        | 103 +++++++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_mount.h        |   3 ++
 fs/xfs/xfs_trans.c        |  77 +++++++++++++++++++++++++++++++---
 fs/xfs/xfs_trans.h        |   1 +
 5 files changed, 193 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index a708e38..af21c93 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -35,6 +35,7 @@
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
+#include "xfs_thin.h"
 
 struct workqueue_struct *xfs_alloc_wq;
 
@@ -652,6 +653,30 @@ xfs_alloc_ag_vextent(
 				 XFS_TRANS_SB_RES_FDBLOCKS :
 				 XFS_TRANS_SB_FDBLOCKS,
 				 -((long)(args->len)));
+
+		if (args->mp->m_thin_reserve) {
+			sector_t	res;
+			xfs_fsblock_t	fsbno = XFS_AGB_TO_FSB(args->mp,
+							       args->agno,
+							       args->agbno);
+			if (args->wasdel)
+				res = xfs_fsb_res(args->mp, args->len, false);
+			else
+				res = args->tp->t_blk_thin_res;
+			error = xfs_thin_provision(args->mp, fsbno, args->len,
+						   &res);
+			WARN_ON(error);
+
+			if (args->wasdel) {
+				if (res)
+					error = xfs_thin_unreserve(args->mp, res);
+				WARN_ON(error);
+			} else if (args->tp) {
+				args->tp->t_blk_thin_res = res;
+			}
+
+			error = 0;
+		}
 	}
 
 	XFS_STATS_INC(args->mp, xs_allocx);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 50a6ccc..d2d9c85 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -41,6 +41,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_sysfs.h"
+#include "xfs_thin.h"
 
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -929,6 +930,8 @@ xfs_mountfs(
 		xfs_qm_mount_quotas(mp);
 	}
 
+	xfs_thin_init(mp);
+
 	/*
 	 * Now we are mounted, reserve a small amount of unused space for
 	 * privileged transactions. This is needed so that transaction
@@ -1147,7 +1150,7 @@ xfs_mod_ifree(
  */
 #define XFS_FDBLOCKS_BATCH	1024
 int
-xfs_mod_fdblocks(
+__xfs_mod_fdblocks(
 	struct xfs_mount	*mp,
 	int64_t			delta,
 	uint32_t		flags)
@@ -1156,13 +1159,27 @@ xfs_mod_fdblocks(
 	long long		res_used;
 	s32			batch;
 	bool			rsvd = (flags & XFS_FDBLOCKS_RSVD);
+	bool			blkres = (flags & XFS_BLK_RES);
+	int			error;
+	int64_t			res_delta = 0;
+
+	ASSERT(!(rsvd && !blkres && delta < 0));
 
 	if (delta > 0) {
 		/*
-		 * If the reserve pool is depleted, put blocks back into it
-		 * first. Most of the time the pool is full.
+		 * If the reserve pool is full (the typical case), return the
+		 * blocks to the general fs pool. Otherwise, return what we can
+		 * to the reserve pool first.
 		 */
 		if (likely(mp->m_resblks == mp->m_resblks_avail)) {
+main_pool:
+			if (mp->m_thin_reserve && blkres) {
+				error = xfs_thin_unreserve(mp,
+						xfs_fsb_res(mp, delta, false));
+				if (error)
+					return error;
+			}
+
 			percpu_counter_add(&mp->m_fdblocks, delta);
 			return 0;
 		}
@@ -1170,17 +1187,69 @@ xfs_mod_fdblocks(
 		spin_lock(&mp->m_sb_lock);
 		res_used = (long long)(mp->m_resblks - mp->m_resblks_avail);
 
-		if (res_used > delta) {
-			mp->m_resblks_avail += delta;
+		/*
+		 * The reserve pool is not full. Blocks in the reserve pool must
+		 * hold a bdev reservation which means we may need to re-reserve
+		 * blocks depending on what the caller is giving us.
+		 *
+		 * If the blocks are already reserved (i.e., via a transaction
+		 * reservation), simply update the reserve pool counter. If not,
+		 * reserve as many blocks as we can, return those to the reserve
+		 * pool, and then jump back above to return whatever is left
+		 * back to the general filesystem pool.
+		 */
+		if (!blkres) {
+			while (delta) {
+				if (res_delta >= res_used)
+					break;
+
+				spin_unlock(&mp->m_sb_lock);
+
+				/*
+				 * XXX: This is racy/leaky. Somebody else could
+				 * replenish m_resblks_avail once we've dropped
+				 * the lock.
+				 */
+				error = xfs_thin_reserve(mp,
+						xfs_fsb_res(mp, 1, false));
+				if (error) {
+					spin_lock(&mp->m_sb_lock);
+					break;
+				}
+
+				spin_lock(&mp->m_sb_lock);
+
+				res_delta++;
+				delta--;
+				res_used = (long long)(mp->m_resblks -
+							mp->m_resblks_avail);
+			}
 		} else {
-			delta -= res_used;
-			mp->m_resblks_avail = mp->m_resblks;
-			percpu_counter_add(&mp->m_fdblocks, delta);
+			res_delta = min(delta, res_used);
+			delta -= res_delta;
 		}
+
+		if (res_used > res_delta)
+			mp->m_resblks_avail += res_delta;
+		else
+			mp->m_resblks_avail = mp->m_resblks;
 		spin_unlock(&mp->m_sb_lock);
+		if (delta)
+			goto main_pool;
 		return 0;
 	}
 
+	/* res calls take positive value */
+	if (mp->m_thin_reserve && blkres) {
+		error = xfs_thin_reserve(mp, xfs_fsb_res(mp, -delta, false));
+		if (error == -ENOSPC && rsvd) {
+			spin_lock(&mp->m_sb_lock);
+			goto fdblocks_rsvd;
+		}
+		if (error)
+			return error;
+	}
+
 	/*
 	 * Taking blocks away, need to be more accurate the closer we
 	 * are to zero.
@@ -1203,14 +1272,17 @@ xfs_mod_fdblocks(
 	}
 
 	/*
-	 * lock up the sb for dipping into reserves before releasing the space
-	 * that took us to ENOSPC.
+	 * Release bdev reservation then lock up the sb for dipping into local
+	 * reserves before releasing the space that took us to ENOSPC.
 	 */
+	if (mp->m_thin_reserve && blkres)
+		error = xfs_thin_unreserve(mp, xfs_fsb_res(mp, -delta, false));
 	spin_lock(&mp->m_sb_lock);
 	percpu_counter_add(&mp->m_fdblocks, -delta);
 	if (!rsvd)
 		goto fdblocks_enospc;
 
+fdblocks_rsvd:
 	lcounter = (long long)mp->m_resblks_avail + delta;
 	if (lcounter >= 0) {
 		mp->m_resblks_avail = lcounter;
@@ -1227,6 +1299,17 @@ fdblocks_enospc:
 }
 
 int
+xfs_mod_fdblocks(
+	struct xfs_mount	*mp,
+	int64_t			delta,
+	uint32_t		flags)
+{
+	/* unres is the common case */
+	flags |= XFS_BLK_RES;
+	return __xfs_mod_fdblocks(mp, delta, flags);
+}
+
+int
 xfs_mod_frextents(
 	struct xfs_mount	*mp,
 	int64_t			delta)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8d54c56..958f815 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -354,6 +354,9 @@ extern int	xfs_mod_icount(struct xfs_mount *mp, int64_t delta);
 extern int	xfs_mod_ifree(struct xfs_mount *mp, int64_t delta);
 
 #define	XFS_FDBLOCKS_RSVD	(1 << 0)
+#define XFS_BLK_RES		(1 << 1)
+extern int	__xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
+				   uint32_t flags);
 extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
 				 uint32_t flags);
 extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 8aa9d9a..26e6288 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -31,6 +31,7 @@
 #include "xfs_log.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
+#include "xfs_thin.h"
 
 kmem_zone_t	*xfs_trans_zone;
 kmem_zone_t	*xfs_log_item_desc_zone;
@@ -174,6 +175,7 @@ xfs_trans_reserve(
 {
 	int			error = 0;
 	int			flags = 0;
+	struct xfs_mount	*mp = tp->t_mountp;
 
 	if (tp->t_flags & XFS_TRANS_RESERVE)
 		flags |= XFS_FDBLOCKS_RSVD;
@@ -187,13 +189,14 @@ xfs_trans_reserve(
 	 * fail if the count would go below zero.
 	 */
 	if (blocks > 0) {
-		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks),
-					 flags);
+		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), flags);
 		if (error != 0) {
 			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
+		if (mp->m_thin_res)
+			tp->t_blk_thin_res += xfs_fsb_res(mp, blocks, false);
 	}
 
 	/*
@@ -265,6 +268,8 @@ undo_blocks:
 	if (blocks > 0) {
 		xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), flags);
 		tp->t_blk_res = 0;
+		if (tp->t_blk_thin_res)
+			tp->t_blk_thin_res = 0;
 	}
 
 	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
@@ -551,6 +556,7 @@ xfs_trans_unreserve_and_mod_sb(
 	int64_t			rtxdelta = 0;
 	int64_t			idelta = 0;
 	int64_t			ifreedelta = 0;
+	int64_t			resdelta = 0;
 	int			error;
 	int			flags = 0;
 
@@ -558,8 +564,41 @@ xfs_trans_unreserve_and_mod_sb(
 		flags |= XFS_FDBLOCKS_RSVD;
 
 	/* calculate deltas */
-	if (tp->t_blk_res > 0)
-		blkdelta = tp->t_blk_res;
+	if (tp->t_blk_res > 0) {
+		/*
+		 * The transaction may have some number of unused fs blocks and
+		 * unused bdev reservation. It might also have non-reserved free
+		 * blocks (i.e., freed extents) that need to make it back into
+		 * the fs general pool. We need to distinguish between these
+		 * cases when unwinding the unused resources.
+		 *
+		 * We do this as follows:
+		 *
+		 * - resdelta - For every unused fs block and bdev reservation
+		 *   combination, account one fs+bdev reserved block that can be
+		 *   returned to the fs. These are blocks that can go directly
+		 *   back into the XFS reserve pool, if necessary, because they
+		 *   are already reserved. If the reserve pool is full, they are
+		 *   unreserved and returned to the general pool.
+		 * - blkdelta - Freed filesystem blocks without any bdev
+		 *   reservation. These can get into the XFS reserve pool as
+		 *   well, but they are reserved from the bdev first. If
+		 *   reservation fails, they are returned to the general pool.
+		 * - t_blk_thin_res - Unused bdev reservation from the
+		 *   transaction. Extra bdev reservation remains when newly
+		 *   allocated fs blocks might have already been provisioned in
+		 *   the bdev (due to larger bdev blocks). This reservation is
+		 *   returned directly to the bdev.
+		 */
+		blkdelta = tp->t_blk_res - tp->t_blk_res_used;
+		while (blkdelta && tp->t_blk_thin_res) {
+			tp->t_blk_thin_res -= xfs_fsb_res(mp, 1, false);
+			blkdelta--;
+			resdelta++;
+		}
+		blkdelta = tp->t_blk_res - resdelta;
+	}
+
 	if ((tp->t_fdblocks_delta != 0) &&
 	    (xfs_sb_version_haslazysbcount(&mp->m_sb) ||
 	     (tp->t_flags & XFS_TRANS_SB_DIRTY)))
@@ -578,11 +617,34 @@ xfs_trans_unreserve_and_mod_sb(
 	}
 
 	/* apply the per-cpu counters */
-	if (blkdelta) {
-		error = xfs_mod_fdblocks(mp, blkdelta, flags);
+	if (resdelta) {
+		error = __xfs_mod_fdblocks(mp, resdelta, flags | XFS_BLK_RES);
 		if (error)
 			goto out;
 	}
+	/*
+	 * Return any bdev reservation that hasn't been returned in the form of
+	 * reserved blocks above. Do this before returning unreserved blocks to
+	 * improve the chance that bdev reservation is available if the XFS
+	 * reserve pool must be replenished.
+	 *
+	 * XXX: This logic is kind of wonky now that the bdev res. is tracked
+	 * separately. If we have a bunch of freed blocks, can't we just return
+	 * however many we have reservation for as 'reserved blocks?' Also need
+	 * to fix up the code above to kill the while loop.
+	 */
+	if (tp->t_blk_thin_res) {
+		error = xfs_thin_unreserve(mp, tp->t_blk_thin_res);
+		if (error)
+			goto out_undo_resblocks;
+		tp->t_blk_thin_res = 0;
+	}
+
+	if (blkdelta) {
+		error = __xfs_mod_fdblocks(mp, blkdelta, flags);
+		if (error)
+			goto out_undo_resblocks;
+	}
 
 	if (idelta) {
 		error = xfs_mod_icount(mp, idelta);
@@ -688,6 +750,9 @@ out_undo_icount:
 out_undo_fdblocks:
 	if (blkdelta)
 		xfs_mod_fdblocks(mp, -blkdelta, flags);
+out_undo_resblocks:
+	if (resdelta)
+		xfs_mod_fdblocks(mp, -resdelta, flags);
 out:
 	ASSERT(error == 0);
 	return;
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index e7c49cf..18685d9 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -95,6 +95,7 @@ typedef struct xfs_trans {
 	unsigned int		t_log_count;	/* count for perm log res */
 	unsigned int		t_blk_res;	/* # of blocks resvd */
 	unsigned int		t_blk_res_used;	/* # of resvd blocks used */
+	unsigned int		t_blk_thin_res;
 	unsigned int		t_rtx_res;	/* # of rt extents resvd */
 	unsigned int		t_rtx_res_used;	/* # of resvd rt extents used */
 	struct xlog_ticket	*t_ticket;	/* log mgr ticket */
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 07/10] xfs: adopt a reserved allocation model on dm-thin devices
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

Adopt a reservation-based block allocation model when XFS runs on top of
a dm-thin device with accompanying support. As of today, the filesystem
has no indication of available space in the underlying device. If the
thin pool is depleted, the filesystem has no recourse but to handle the
read-only state change of the device. This results in unexpected higher
level behavior, error returns and can result in data loss if the
filesystem is ultimately shutdown before more space is provisioned to
the pool.

The reservation model enables XFS to manage thin pool space similar to
how delayed allocation blocks are managed today. Delalloc blocks are
reserved up front (e.g., at write time) to guarantee physical space is
available at writeback time and thus prevent data loss due to
overprovisioning. Similarly, block device reservation allows XFS to
reserve space for various operations in advance and thus guarantee an
operation will not fail for lack of space, or otherwise return an error
to the user.

To accomplish this, tie in the device block reservation calls to the
existing filesystem reservation mechanism. Each transaction now reserves
physical space in the underlying thin pool along with other such
reserved resources (e.g., filesystem blocks, log space). Delayed
allocation blocks are similarly reserved in the thin device when the
associated filesystem blocks are reserved. If a reservation cannot be
satisfied, the associated operation returns -ENOSPC as if the filesystem
itself were out of space.

Note that this is proof-of-concept and highly experimental. The purpose
is to explore the potential effectiveness of such a scheme between the
filesystem and a thinly provisioned device. As such, the implementation
is hacky, broken and geared towards proof-of-concept over correctness or
completeness.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c |  25 +++++++++++
 fs/xfs/xfs_mount.c        | 103 +++++++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_mount.h        |   3 ++
 fs/xfs/xfs_trans.c        |  77 +++++++++++++++++++++++++++++++---
 fs/xfs/xfs_trans.h        |   1 +
 5 files changed, 193 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index a708e38..af21c93 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -35,6 +35,7 @@
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
+#include "xfs_thin.h"
 
 struct workqueue_struct *xfs_alloc_wq;
 
@@ -652,6 +653,30 @@ xfs_alloc_ag_vextent(
 				 XFS_TRANS_SB_RES_FDBLOCKS :
 				 XFS_TRANS_SB_FDBLOCKS,
 				 -((long)(args->len)));
+
+		if (args->mp->m_thin_reserve) {
+			sector_t	res;
+			xfs_fsblock_t	fsbno = XFS_AGB_TO_FSB(args->mp,
+							       args->agno,
+							       args->agbno);
+			if (args->wasdel)
+				res = xfs_fsb_res(args->mp, args->len, false);
+			else
+				res = args->tp->t_blk_thin_res;
+			error = xfs_thin_provision(args->mp, fsbno, args->len,
+						   &res);
+			WARN_ON(error);
+
+			if (args->wasdel) {
+				if (res)
+					error = xfs_thin_unreserve(args->mp, res);
+				WARN_ON(error);
+			} else if (args->tp) {
+				args->tp->t_blk_thin_res = res;
+			}
+
+			error = 0;
+		}
 	}
 
 	XFS_STATS_INC(args->mp, xs_allocx);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 50a6ccc..d2d9c85 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -41,6 +41,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_sysfs.h"
+#include "xfs_thin.h"
 
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -929,6 +930,8 @@ xfs_mountfs(
 		xfs_qm_mount_quotas(mp);
 	}
 
+	xfs_thin_init(mp);
+
 	/*
 	 * Now we are mounted, reserve a small amount of unused space for
 	 * privileged transactions. This is needed so that transaction
@@ -1147,7 +1150,7 @@ xfs_mod_ifree(
  */
 #define XFS_FDBLOCKS_BATCH	1024
 int
-xfs_mod_fdblocks(
+__xfs_mod_fdblocks(
 	struct xfs_mount	*mp,
 	int64_t			delta,
 	uint32_t		flags)
@@ -1156,13 +1159,27 @@ xfs_mod_fdblocks(
 	long long		res_used;
 	s32			batch;
 	bool			rsvd = (flags & XFS_FDBLOCKS_RSVD);
+	bool			blkres = (flags & XFS_BLK_RES);
+	int			error;
+	int64_t			res_delta = 0;
+
+	ASSERT(!(rsvd && !blkres && delta < 0));
 
 	if (delta > 0) {
 		/*
-		 * If the reserve pool is depleted, put blocks back into it
-		 * first. Most of the time the pool is full.
+		 * If the reserve pool is full (the typical case), return the
+		 * blocks to the general fs pool. Otherwise, return what we can
+		 * to the reserve pool first.
 		 */
 		if (likely(mp->m_resblks == mp->m_resblks_avail)) {
+main_pool:
+			if (mp->m_thin_reserve && blkres) {
+				error = xfs_thin_unreserve(mp,
+						xfs_fsb_res(mp, delta, false));
+				if (error)
+					return error;
+			}
+
 			percpu_counter_add(&mp->m_fdblocks, delta);
 			return 0;
 		}
@@ -1170,17 +1187,69 @@ xfs_mod_fdblocks(
 		spin_lock(&mp->m_sb_lock);
 		res_used = (long long)(mp->m_resblks - mp->m_resblks_avail);
 
-		if (res_used > delta) {
-			mp->m_resblks_avail += delta;
+		/*
+		 * The reserve pool is not full. Blocks in the reserve pool must
+		 * hold a bdev reservation which means we may need to re-reserve
+		 * blocks depending on what the caller is giving us.
+		 *
+		 * If the blocks are already reserved (i.e., via a transaction
+		 * reservation), simply update the reserve pool counter. If not,
+		 * reserve as many blocks as we can, return those to the reserve
+		 * pool, and then jump back above to return whatever is left
+		 * back to the general filesystem pool.
+		 */
+		if (!blkres) {
+			while (delta) {
+				if (res_delta >= res_used)
+					break;
+
+				spin_unlock(&mp->m_sb_lock);
+
+				/*
+				 * XXX: This is racy/leaky. Somebody else could
+				 * replenish m_resblks_avail once we've dropped
+				 * the lock.
+				 */
+				error = xfs_thin_reserve(mp,
+						xfs_fsb_res(mp, 1, false));
+				if (error) {
+					spin_lock(&mp->m_sb_lock);
+					break;
+				}
+
+				spin_lock(&mp->m_sb_lock);
+
+				res_delta++;
+				delta--;
+				res_used = (long long)(mp->m_resblks -
+							mp->m_resblks_avail);
+			}
 		} else {
-			delta -= res_used;
-			mp->m_resblks_avail = mp->m_resblks;
-			percpu_counter_add(&mp->m_fdblocks, delta);
+			res_delta = min(delta, res_used);
+			delta -= res_delta;
 		}
+
+		if (res_used > res_delta)
+			mp->m_resblks_avail += res_delta;
+		else
+			mp->m_resblks_avail = mp->m_resblks;
 		spin_unlock(&mp->m_sb_lock);
+		if (delta)
+			goto main_pool;
 		return 0;
 	}
 
+	/* res calls take positive value */
+	if (mp->m_thin_reserve && blkres) {
+		error = xfs_thin_reserve(mp, xfs_fsb_res(mp, -delta, false));
+		if (error == -ENOSPC && rsvd) {
+			spin_lock(&mp->m_sb_lock);
+			goto fdblocks_rsvd;
+		}
+		if (error)
+			return error;
+	}
+
 	/*
 	 * Taking blocks away, need to be more accurate the closer we
 	 * are to zero.
@@ -1203,14 +1272,17 @@ xfs_mod_fdblocks(
 	}
 
 	/*
-	 * lock up the sb for dipping into reserves before releasing the space
-	 * that took us to ENOSPC.
+	 * Release bdev reservation then lock up the sb for dipping into local
+	 * reserves before releasing the space that took us to ENOSPC.
 	 */
+	if (mp->m_thin_reserve && blkres)
+		error = xfs_thin_unreserve(mp, xfs_fsb_res(mp, -delta, false));
 	spin_lock(&mp->m_sb_lock);
 	percpu_counter_add(&mp->m_fdblocks, -delta);
 	if (!rsvd)
 		goto fdblocks_enospc;
 
+fdblocks_rsvd:
 	lcounter = (long long)mp->m_resblks_avail + delta;
 	if (lcounter >= 0) {
 		mp->m_resblks_avail = lcounter;
@@ -1227,6 +1299,17 @@ fdblocks_enospc:
 }
 
 int
+xfs_mod_fdblocks(
+	struct xfs_mount	*mp,
+	int64_t			delta,
+	uint32_t		flags)
+{
+	/* unres is the common case */
+	flags |= XFS_BLK_RES;
+	return __xfs_mod_fdblocks(mp, delta, flags);
+}
+
+int
 xfs_mod_frextents(
 	struct xfs_mount	*mp,
 	int64_t			delta)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8d54c56..958f815 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -354,6 +354,9 @@ extern int	xfs_mod_icount(struct xfs_mount *mp, int64_t delta);
 extern int	xfs_mod_ifree(struct xfs_mount *mp, int64_t delta);
 
 #define	XFS_FDBLOCKS_RSVD	(1 << 0)
+#define XFS_BLK_RES		(1 << 1)
+extern int	__xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
+				   uint32_t flags);
 extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
 				 uint32_t flags);
 extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 8aa9d9a..26e6288 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -31,6 +31,7 @@
 #include "xfs_log.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
+#include "xfs_thin.h"
 
 kmem_zone_t	*xfs_trans_zone;
 kmem_zone_t	*xfs_log_item_desc_zone;
@@ -174,6 +175,7 @@ xfs_trans_reserve(
 {
 	int			error = 0;
 	int			flags = 0;
+	struct xfs_mount	*mp = tp->t_mountp;
 
 	if (tp->t_flags & XFS_TRANS_RESERVE)
 		flags |= XFS_FDBLOCKS_RSVD;
@@ -187,13 +189,14 @@ xfs_trans_reserve(
 	 * fail if the count would go below zero.
 	 */
 	if (blocks > 0) {
-		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks),
-					 flags);
+		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), flags);
 		if (error != 0) {
 			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
+		if (mp->m_thin_res)
+			tp->t_blk_thin_res += xfs_fsb_res(mp, blocks, false);
 	}
 
 	/*
@@ -265,6 +268,8 @@ undo_blocks:
 	if (blocks > 0) {
 		xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), flags);
 		tp->t_blk_res = 0;
+		if (tp->t_blk_thin_res)
+			tp->t_blk_thin_res = 0;
 	}
 
 	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
@@ -551,6 +556,7 @@ xfs_trans_unreserve_and_mod_sb(
 	int64_t			rtxdelta = 0;
 	int64_t			idelta = 0;
 	int64_t			ifreedelta = 0;
+	int64_t			resdelta = 0;
 	int			error;
 	int			flags = 0;
 
@@ -558,8 +564,41 @@ xfs_trans_unreserve_and_mod_sb(
 		flags |= XFS_FDBLOCKS_RSVD;
 
 	/* calculate deltas */
-	if (tp->t_blk_res > 0)
-		blkdelta = tp->t_blk_res;
+	if (tp->t_blk_res > 0) {
+		/*
+		 * The transaction may have some number of unused fs blocks and
+		 * unused bdev reservation. It might also have non-reserved free
+		 * blocks (i.e., freed extents) that need to make it back into
+		 * the fs general pool. We need to distinguish between these
+		 * cases when unwinding the unused resources.
+		 *
+		 * We do this as follows:
+		 *
+		 * - resdelta - For every unused fs block and bdev reservation
+		 *   combination, account one fs+bdev reserved block that can be
+		 *   returned to the fs. These are blocks that can go directly
+		 *   back into the XFS reserve pool, if necessary, because they
+		 *   are already reserved. If the reserve pool is full, they are
+		 *   unreserved and returned to the general pool.
+		 * - blkdelta - Freed filesystem blocks without any bdev
+		 *   reservation. These can get into the XFS reserve pool as
+		 *   well, but they are reserved from the bdev first. If
+		 *   reservation fails, they are returned to the general pool.
+		 * - t_blk_thin_res - Unused bdev reservation from the
+		 *   transaction. Extra bdev reservation remains when newly
+		 *   allocated fs blocks might have already been provisioned in
+		 *   the bdev (due to larger bdev blocks). This reservation is
+		 *   returned directly to the bdev.
+		 */
+		blkdelta = tp->t_blk_res - tp->t_blk_res_used;
+		while (blkdelta && tp->t_blk_thin_res) {
+			tp->t_blk_thin_res -= xfs_fsb_res(mp, 1, false);
+			blkdelta--;
+			resdelta++;
+		}
+		blkdelta = tp->t_blk_res - resdelta;
+	}
+
 	if ((tp->t_fdblocks_delta != 0) &&
 	    (xfs_sb_version_haslazysbcount(&mp->m_sb) ||
 	     (tp->t_flags & XFS_TRANS_SB_DIRTY)))
@@ -578,11 +617,34 @@ xfs_trans_unreserve_and_mod_sb(
 	}
 
 	/* apply the per-cpu counters */
-	if (blkdelta) {
-		error = xfs_mod_fdblocks(mp, blkdelta, flags);
+	if (resdelta) {
+		error = __xfs_mod_fdblocks(mp, resdelta, flags | XFS_BLK_RES);
 		if (error)
 			goto out;
 	}
+	/*
+	 * Return any bdev reservation that hasn't been returned in the form of
+	 * reserved blocks above. Do this before returning unreserved blocks to
+	 * improve the chance that bdev reservation is available if the XFS
+	 * reserve pool must be replenished.
+	 *
+	 * XXX: This logic is kind of wonky now that the bdev res. is tracked
+	 * separately. If we have a bunch of freed blocks, can't we just return
+	 * however many we have reservation for as 'reserved blocks?' Also need
+	 * to fix up the code above to kill the while loop.
+	 */
+	if (tp->t_blk_thin_res) {
+		error = xfs_thin_unreserve(mp, tp->t_blk_thin_res);
+		if (error)
+			goto out_undo_resblocks;
+		tp->t_blk_thin_res = 0;
+	}
+
+	if (blkdelta) {
+		error = __xfs_mod_fdblocks(mp, blkdelta, flags);
+		if (error)
+			goto out_undo_resblocks;
+	}
 
 	if (idelta) {
 		error = xfs_mod_icount(mp, idelta);
@@ -688,6 +750,9 @@ out_undo_icount:
 out_undo_fdblocks:
 	if (blkdelta)
 		xfs_mod_fdblocks(mp, -blkdelta, flags);
+out_undo_resblocks:
+	if (resdelta)
+		xfs_mod_fdblocks(mp, -resdelta, flags);
 out:
 	ASSERT(error == 0);
 	return;
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index e7c49cf..18685d9 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -95,6 +95,7 @@ typedef struct xfs_trans {
 	unsigned int		t_log_count;	/* count for perm log res */
 	unsigned int		t_blk_res;	/* # of blocks resvd */
 	unsigned int		t_blk_res_used;	/* # of resvd blocks used */
+	unsigned int		t_blk_thin_res;
 	unsigned int		t_rtx_res;	/* # of rt extents resvd */
 	unsigned int		t_rtx_res_used;	/* # of resvd rt extents used */
 	struct xlog_ticket	*t_ticket;	/* log mgr ticket */
-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 08/10] xfs: handle bdev reservation ENOSPC correctly from XFS reserved pool
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

The XFS reserved block pool holds blocks from general allocation for
internal purposes. When enabled, these blocks shall also carry a
reservation from the block device to guarantee they are usable.

The reserved pool allocation code currently uses a retry algorithm based
on the available space estimation. It assumes that an inability to
allocate blocks based on the estimation is a transient problem. Now that
block allocation attempts bdev reservation, however, an ENOSPC could
originate from the block device and might not be transient.

Because the retry algorithm cannot distinguish between fs block
allocation and bdev reservation, separate the two operations in this
particular case. If the bdev reservation fails, back off the reservation
delta until something can be reserved or return ENOSPC to the caller.
Once a bdev reservation is made, attempt to allocate blocks from the fs
and return to the original retry algorithm based on the free space
estimation. This prevents infinite retries in the event of a reserved
pool allocation request that cannot be satisfied from a bdev that
supports reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fsops.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 87d4b1b..79ae408 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -40,6 +40,7 @@
 #include "xfs_trace.h"
 #include "xfs_log.h"
 #include "xfs_filestream.h"
+#include "xfs_thin.h"

 /*
  * File system operations
@@ -676,6 +677,7 @@ xfs_reserve_blocks(
 	__uint64_t		request;
 	__int64_t		free;
 	int			error = 0;
+	sector_t		res = 0;

 	/* If inval is null, report current values and return */
 	if (inval == (__uint64_t *)NULL) {
@@ -743,6 +745,28 @@ xfs_reserve_blocks(
 			fdblks_delta = delta;

 		/*
+		 * Reserve pool blocks must carry a block device reservation (if
+		 * enabled). The block device could be much closer to ENOSPC
+		 * than the fs (i.e., a thin or snap device), so try to reserve
+		 * the bdev space first.
+		 */
+		spin_unlock(&mp->m_sb_lock);
+		if (mp->m_thin_reserve) {
+			while (fdblks_delta) {
+				res = xfs_fsb_res(mp, fdblks_delta, false);
+				error = xfs_thin_reserve(mp, res);
+				if (error != -ENOSPC)
+					break;
+
+				fdblks_delta >>= 1;
+			}
+			if (!fdblks_delta || error) {
+				spin_lock(&mp->m_sb_lock);
+				break;
+			}
+		}
+
+		/*
 		 * We'll either succeed in getting space from the free block
 		 * count or we'll get an ENOSPC. If we get a ENOSPC, it means
 		 * things changed while we were calculating fdblks_delta and so
@@ -752,8 +776,9 @@ xfs_reserve_blocks(
 		 * Don't set the reserved flag here - we don't want to reserve
 		 * the extra reserve blocks from the reserve.....
 		 */
-		spin_unlock(&mp->m_sb_lock);
-		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
+		error = __xfs_mod_fdblocks(mp, -fdblks_delta, 0);
+		if (error && mp->m_thin_reserve)
+			xfs_thin_unreserve(mp, res);
 		spin_lock(&mp->m_sb_lock);
 	}

-- 
2.4.11

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 08/10] xfs: handle bdev reservation ENOSPC correctly from XFS reserved pool
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

The XFS reserved block pool holds blocks from general allocation for
internal purposes. When enabled, these blocks shall also carry a
reservation from the block device to guarantee they are usable.

The reserved pool allocation code currently uses a retry algorithm based
on the available space estimation. It assumes that an inability to
allocate blocks based on the estimation is a transient problem. Now that
block allocation attempts bdev reservation, however, an ENOSPC could
originate from the block device and might not be transient.

Because the retry algorithm cannot distinguish between fs block
allocation and bdev reservation, separate the two operations in this
particular case. If the bdev reservation fails, back off the reservation
delta until something can be reserved or return ENOSPC to the caller.
Once a bdev reservation is made, attempt to allocate blocks from the fs
and return to the original retry algorithm based on the free space
estimation. This prevents infinite retries in the event of a reserved
pool allocation request that cannot be satisfied from a bdev that
supports reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fsops.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 87d4b1b..79ae408 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -40,6 +40,7 @@
 #include "xfs_trace.h"
 #include "xfs_log.h"
 #include "xfs_filestream.h"
+#include "xfs_thin.h"

 /*
  * File system operations
@@ -676,6 +677,7 @@ xfs_reserve_blocks(
 	__uint64_t		request;
 	__int64_t		free;
 	int			error = 0;
+	sector_t		res = 0;

 	/* If inval is null, report current values and return */
 	if (inval == (__uint64_t *)NULL) {
@@ -743,6 +745,28 @@ xfs_reserve_blocks(
 			fdblks_delta = delta;

 		/*
+		 * Reserve pool blocks must carry a block device reservation (if
+		 * enabled). The block device could be much closer to ENOSPC
+		 * than the fs (i.e., a thin or snap device), so try to reserve
+		 * the bdev space first.
+		 */
+		spin_unlock(&mp->m_sb_lock);
+		if (mp->m_thin_reserve) {
+			while (fdblks_delta) {
+				res = xfs_fsb_res(mp, fdblks_delta, false);
+				error = xfs_thin_reserve(mp, res);
+				if (error != -ENOSPC)
+					break;
+
+				fdblks_delta >>= 1;
+			}
+			if (!fdblks_delta || error) {
+				spin_lock(&mp->m_sb_lock);
+				break;
+			}
+		}
+
+		/*
 		 * We'll either succeed in getting space from the free block
 		 * count or we'll get an ENOSPC. If we get a ENOSPC, it means
 		 * things changed while we were calculating fdblks_delta and so
@@ -752,8 +776,9 @@ xfs_reserve_blocks(
 		 * Don't set the reserved flag here - we don't want to reserve
 		 * the extra reserve blocks from the reserve.....
 		 */
-		spin_unlock(&mp->m_sb_lock);
-		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
+		error = __xfs_mod_fdblocks(mp, -fdblks_delta, 0);
+		if (error && mp->m_thin_reserve)
+			xfs_thin_unreserve(mp, res);
 		spin_lock(&mp->m_sb_lock);
 	}

-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 09/10] xfs: support no block reservation transaction mode
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

The block device reservation mechanism is tied into the transaction
reservation mechanism and assumes the worst case scenario of a 1-1
mapping between filesystem blocks and dm blocks. This might be overkill
for certain codepaths that have enough context to not require a
worst-case reservation.

Define an optional transaction flag to disable block reservation on a
per-transaction basis. This allows any particular operation to open code
a block device reservation and potentially use a more optimal
reservation value.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_shared.h |  2 ++
 fs/xfs/xfs_trans.c         | 10 ++++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 81ac870..ba79373 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -183,6 +183,8 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
 #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
 #define XFS_TRANS_FREEZE_PROT	0x40	/* Transaction has elevated writer
 					   count in superblock */
+#define XFS_TRANS_NOBLKRES	0x100	/* do not attempt blkdev reservation */
+
 /*
  * Field values for xfs_trans_mod_sb.
  */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 26e6288..343e435 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -174,11 +174,13 @@ xfs_trans_reserve(
 	uint			rtextents)
 {
 	int			error = 0;
-	int			flags = 0;
+	int			flags = XFS_BLK_RES;
 	struct xfs_mount	*mp = tp->t_mountp;
 
 	if (tp->t_flags & XFS_TRANS_RESERVE)
 		flags |= XFS_FDBLOCKS_RSVD;
+	if (tp->t_flags & XFS_TRANS_NOBLKRES)
+		flags &= ~XFS_BLK_RES;
 
 	/* Mark this thread as being in a transaction */
 	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
@@ -189,13 +191,13 @@ xfs_trans_reserve(
 	 * fail if the count would go below zero.
 	 */
 	if (blocks > 0) {
-		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), flags);
+		error = __xfs_mod_fdblocks(mp, -((int64_t)blocks), flags);
 		if (error != 0) {
 			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
-		if (mp->m_thin_res)
+		if (mp->m_thin_res && (flags & XFS_BLK_RES))
 			tp->t_blk_thin_res += xfs_fsb_res(mp, blocks, false);
 	}
 
@@ -266,7 +268,7 @@ undo_log:
 
 undo_blocks:
 	if (blocks > 0) {
-		xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), flags);
+		__xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), flags);
 		tp->t_blk_res = 0;
 		if (tp->t_blk_thin_res)
 			tp->t_blk_thin_res = 0;
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 09/10] xfs: support no block reservation transaction mode
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

The block device reservation mechanism is tied into the transaction
reservation mechanism and assumes the worst case scenario of a 1-1
mapping between filesystem blocks and dm blocks. This might be overkill
for certain codepaths that have enough context to not require a
worst-case reservation.

Define an optional transaction flag to disable block reservation on a
per-transaction basis. This allows any particular operation to open code
a block device reservation and potentially use a more optimal
reservation value.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_shared.h |  2 ++
 fs/xfs/xfs_trans.c         | 10 ++++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 81ac870..ba79373 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -183,6 +183,8 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
 #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
 #define XFS_TRANS_FREEZE_PROT	0x40	/* Transaction has elevated writer
 					   count in superblock */
+#define XFS_TRANS_NOBLKRES	0x100	/* do not attempt blkdev reservation */
+
 /*
  * Field values for xfs_trans_mod_sb.
  */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 26e6288..343e435 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -174,11 +174,13 @@ xfs_trans_reserve(
 	uint			rtextents)
 {
 	int			error = 0;
-	int			flags = 0;
+	int			flags = XFS_BLK_RES;
 	struct xfs_mount	*mp = tp->t_mountp;
 
 	if (tp->t_flags & XFS_TRANS_RESERVE)
 		flags |= XFS_FDBLOCKS_RSVD;
+	if (tp->t_flags & XFS_TRANS_NOBLKRES)
+		flags &= ~XFS_BLK_RES;
 
 	/* Mark this thread as being in a transaction */
 	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
@@ -189,13 +191,13 @@ xfs_trans_reserve(
 	 * fail if the count would go below zero.
 	 */
 	if (blocks > 0) {
-		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), flags);
+		error = __xfs_mod_fdblocks(mp, -((int64_t)blocks), flags);
 		if (error != 0) {
 			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
-		if (mp->m_thin_res)
+		if (mp->m_thin_res && (flags & XFS_BLK_RES))
 			tp->t_blk_thin_res += xfs_fsb_res(mp, blocks, false);
 	}
 
@@ -266,7 +268,7 @@ undo_log:
 
 undo_blocks:
 	if (blocks > 0) {
-		xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), flags);
+		__xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), flags);
 		tp->t_blk_res = 0;
 		if (tp->t_blk_thin_res)
 			tp->t_blk_thin_res = 0;
-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 10/10] xfs: use contiguous bdev reservation for file preallocation
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 16:42   ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

The block device reservation that occurs as part of transaction
reservation uses a worst case algorithm to determine the amount of
reservation required to satisfy the transaction. This means that one
bdev (i.e., device-mapper) block is reserved per required filesystem
block, even though the former block size is likely much larger than the
latter.

Worst case reservation is required in most cases because, from the
perspective of the transaction, block allocation can occur throughout
the block address space. This is unnecessary for some operations where
more context is available, however. xfs_alloc_file_space() is one such
case. It calls xfs_bmapi_write() in a loop and once per transaction.
Since it also passes nmap == 1, each call maps a single extent and thus
allocates contiguous blocks. Based on that, the bdev reservation can be
reduced from the worst case 1-1 mapping to a more optimal 1-N mapping of
dm blocks to fs blocks (e.g., one dm block can cover many fs blocks).

Update xfs_alloc_file_space() to bypass transaction based bdev
reservation. Instead, open-code the bdev reservation using the more
optimal contiguous reservation value. This allows fallocate requests to
consume just about all of the available space in a thin volume without
premature ENOSPC errors.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 3b63098..c2e1215 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -40,6 +40,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_log.h"
+#include "xfs_thin.h"

 /* Kernel only BMAP related definitions and functions */

@@ -1035,9 +1036,11 @@ xfs_alloc_file_space(
 		}

 		/*
-		 * Allocate and setup the transaction.
+		 * Allocate and setup the transaction. The noblkres flags tells
+		 * the reservation infrastructure to skip bdev reservation.
 		 */
 		tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
+		tp->t_flags |= XFS_TRANS_NOBLKRES;
 		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write,
 					  resblks, resrtextents);
 		/*
@@ -1051,6 +1054,30 @@ xfs_alloc_file_space(
 			xfs_trans_cancel(tp);
 			break;
 		}
+
+		/*
+		 * We disabled the transaction bdev reservation because the
+		 * trans infrastructure uses a worst case reservation. Since we
+		 * call xfs_bmapi_write() one mapping at a time, we can assume
+		 * the allocated blocks will be contiguous and thus can use a
+		 * more optimal reservation value. Acquire the reservation here
+		 * and attach it to the transaction.
+		 *
+		 * XXX: Need to take apart data and metadata block parts of res
+		 * (see XFS_DIOSTRAT_SPACE_RES()). The latter still needs
+		 * worst-case.
+		 */
+		if (mp->m_thin_res) {
+			sector_t	res = xfs_fsb_res(mp, resblks, true);
+
+			error = xfs_thin_reserve(mp, res);
+			if (error) {
+				xfs_trans_cancel(tp);
+				break;
+			}
+			tp->t_blk_thin_res = res;
+		}
+
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		error = xfs_trans_reserve_quota_nblks(tp, ip, qblocks,
 						      0, quota_flag);
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC v2 PATCH 10/10] xfs: use contiguous bdev reservation for file preallocation
@ 2016-04-12 16:42   ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-12 16:42 UTC (permalink / raw)
  To: xfs; +Cc: linux-block, linux-fsdevel, dm-devel

The block device reservation that occurs as part of transaction
reservation uses a worst case algorithm to determine the amount of
reservation required to satisfy the transaction. This means that one
bdev (i.e., device-mapper) block is reserved per required filesystem
block, even though the former block size is likely much larger than the
latter.

Worst case reservation is required in most cases because, from the
perspective of the transaction, block allocation can occur throughout
the block address space. This is unnecessary for some operations where
more context is available, however. xfs_alloc_file_space() is one such
case. It calls xfs_bmapi_write() in a loop and once per transaction.
Since it also passes nmap == 1, each call maps a single extent and thus
allocates contiguous blocks. Based on that, the bdev reservation can be
reduced from the worst case 1-1 mapping to a more optimal 1-N mapping of
dm blocks to fs blocks (e.g., one dm block can cover many fs blocks).

Update xfs_alloc_file_space() to bypass transaction based bdev
reservation. Instead, open-code the bdev reservation using the more
optimal contiguous reservation value. This allows fallocate requests to
consume just about all of the available space in a thin volume without
premature ENOSPC errors.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 3b63098..c2e1215 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -40,6 +40,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_log.h"
+#include "xfs_thin.h"

 /* Kernel only BMAP related definitions and functions */

@@ -1035,9 +1036,11 @@ xfs_alloc_file_space(
 		}

 		/*
-		 * Allocate and setup the transaction.
+		 * Allocate and setup the transaction. The noblkres flags tells
+		 * the reservation infrastructure to skip bdev reservation.
 		 */
 		tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
+		tp->t_flags |= XFS_TRANS_NOBLKRES;
 		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write,
 					  resblks, resrtextents);
 		/*
@@ -1051,6 +1054,30 @@ xfs_alloc_file_space(
 			xfs_trans_cancel(tp);
 			break;
 		}
+
+		/*
+		 * We disabled the transaction bdev reservation because the
+		 * trans infrastructure uses a worst case reservation. Since we
+		 * call xfs_bmapi_write() one mapping at a time, we can assume
+		 * the allocated blocks will be contiguous and thus can use a
+		 * more optimal reservation value. Acquire the reservation here
+		 * and attach it to the transaction.
+		 *
+		 * XXX: Need to take apart data and metadata block parts of res
+		 * (see XFS_DIOSTRAT_SPACE_RES()). The latter still needs
+		 * worst-case.
+		 */
+		if (mp->m_thin_res) {
+			sector_t	res = xfs_fsb_res(mp, resblks, true);
+
+			error = xfs_thin_reserve(mp, res);
+			if (error) {
+				xfs_trans_cancel(tp);
+				break;
+			}
+			tp->t_blk_thin_res = res;
+		}
+
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		error = xfs_trans_reserve_quota_nblks(tp, ip, qblocks,
 						      0, quota_flag);
-- 
2.4.11

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
  2016-04-12 16:42 ` Brian Foster
@ 2016-04-12 20:04   ` Mike Snitzer
  -1 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-12 20:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, linux-block, linux-fsdevel, dm-devel, Darrick J. Wong

On Tue, Apr 12 2016 at 12:42P -0400,
Brian Foster <bfoster@redhat.com> wrote:

> Hi all,
> 
> This is v2 of the XFS and block device reservation experiment. The
> significant changes in v2 are that the bdev interface has been condensed
> to a single callback function, the XFS transaction reservation
> management has been reworked to make transactions responsible for
> tracking and releasing excess reservation (for non-delalloc cases) and a
> workaround for the fallocate over-reservation issue is included. Beyond
> that, this version adds a bunch of miscellaneous cleanups and fixes some
> of the nastier locking/leak issues present in the first rfc.
> 
> Patches 1-2 refactor some XFS reserve pool and block accounting code in
> preparation for subsequent patches. Patches 3-5 add block/device-mapper
> reservation support. Patches 6-10 add the core reservation
> infrastructure and management bits to XFS. See the link to the original
> rfc below for instructions and further details around the purpose of
> this series.
> 
> Finally, note that this is still highly experimental/theoretical and
> should not be used on production systems. Thoughts, reviews, flames
> appreciated.

Thanks for carrying on with this work Brian.

I've started to review your patchset and Darrick's fallocate patchset.
I've pushed a branch to linux-dm.git that combines the 2, see:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-fallocate

and then added this RFC patch, at the end, which relies on both of your
patchsets -- you'll see blkdev_ensure_space_exists() has a FIXME which
implies it isn't much more than simply stubbed out at this point
(completely untested):

From: Mike Snitzer <snitzer@redhat.com>
Date: Tue, 12 Apr 2016 15:54:31 -0400
Subject: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space

This effectively exposes the primitive for "ensure space exists".  It
relies on block_device_operations' reserve_space method.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-lib.c        | 26 ++++++++++++++++++++++++++
 fs/block_dev.c         | 20 +++++++++++---------
 include/linux/blkdev.h |  2 ++
 3 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 9dca6bb..5042a84 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -314,3 +314,29 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_ensure_space_exists - preallocate a block range
+ * @bdev:	blockdev to preallocate space for
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to preallocate
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @flags:	FALLOC_FL_* to control behaviour
+ *
+ * Description:
+ *    Ensure space exists, or is preallocated, for the sectors in question.
+ */
+int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, unsigned long flags)
+{
+	sector_t res;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+	if (!ops->reserve_space)
+		return -EOPNOTSUPP;
+
+	// FIXME: check with Brian Foster on whether it makes sense to
+	// use BDEV_RES_GET/BDEV_RES_MOD instead of BDEV_RES_PROVISION?
+	return ops->reserve_space(bdev, BDEV_RES_PROVISION, sector, nr_sects, &res);
+}
+EXPORT_SYMBOL(blkdev_ensure_space_exists);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5a2c3ab..b34c07b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
 	struct request_queue *q = bdev_get_queue(bdev);
 	struct address_space *mapping;
 	loff_t end = start + len - 1;
-	loff_t bs_mask, isize;
+	loff_t isize;
 	int error;
 
 	/* We only support zero range and punch hole. */
 	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
 		return -EOPNOTSUPP;
 
-	/* We haven't a primitive for "ensure space exists" right now. */
-	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
-		return -EOPNOTSUPP;
-
 	/* Only punch if the device can do zeroing discard. */
 	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
 	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
@@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
 			return -EINVAL;
 	}
 
-	/* Don't allow IO that isn't aligned to logical block size */
-	bs_mask = bdev_logical_block_size(bdev) - 1;
-	if ((start | len) & bs_mask)
+	/*
+	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
+	 * - for normal device's io_min is usually logical block size
+	 * - but for more exotic devices (e.g. DM thinp) it may be larger
+	 */
+	if ((start | len) % bdev_io_min(bdev))
 		return -EINVAL;
 
 	/* Invalidate the page cache, including dirty pages. */
@@ -1839,7 +1838,10 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
 	truncate_inode_pages_range(mapping, start, end);
 
 	error = -EINVAL;
-	if (mode & FALLOC_FL_ZERO_RANGE)
+	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
+		error = blkdev_ensure_space_exists(bdev, start >> 9, len >> 9,
+						   mode);
+	else if (mode & FALLOC_FL_ZERO_RANGE)
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
 					    GFP_KERNEL, false);
 	else if (mode & FALLOC_FL_PUNCH_HOLE)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6c6ea96..4147af2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1132,6 +1132,8 @@ extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, bool discard);
+extern int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, unsigned long flags);
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
 {
-- 
2.6.4 (Apple Git-63)


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
@ 2016-04-12 20:04   ` Mike Snitzer
  0 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-12 20:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-block, linux-fsdevel, dm-devel, Darrick J. Wong, xfs

On Tue, Apr 12 2016 at 12:42P -0400,
Brian Foster <bfoster@redhat.com> wrote:

> Hi all,
> 
> This is v2 of the XFS and block device reservation experiment. The
> significant changes in v2 are that the bdev interface has been condensed
> to a single callback function, the XFS transaction reservation
> management has been reworked to make transactions responsible for
> tracking and releasing excess reservation (for non-delalloc cases) and a
> workaround for the fallocate over-reservation issue is included. Beyond
> that, this version adds a bunch of miscellaneous cleanups and fixes some
> of the nastier locking/leak issues present in the first rfc.
> 
> Patches 1-2 refactor some XFS reserve pool and block accounting code in
> preparation for subsequent patches. Patches 3-5 add block/device-mapper
> reservation support. Patches 6-10 add the core reservation
> infrastructure and management bits to XFS. See the link to the original
> rfc below for instructions and further details around the purpose of
> this series.
> 
> Finally, note that this is still highly experimental/theoretical and
> should not be used on production systems. Thoughts, reviews, flames
> appreciated.

Thanks for carrying on with this work Brian.

I've started to review your patchset and Darrick's fallocate patchset.
I've pushed a branch to linux-dm.git that combines the 2, see:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-fallocate

and then added this RFC patch, at the end, which relies on both of your
patchsets -- you'll see blkdev_ensure_space_exists() has a FIXME which
implies it isn't much more than simply stubbed out at this point
(completely untested):

From: Mike Snitzer <snitzer@redhat.com>
Date: Tue, 12 Apr 2016 15:54:31 -0400
Subject: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space

This effectively exposes the primitive for "ensure space exists".  It
relies on block_device_operations' reserve_space method.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-lib.c        | 26 ++++++++++++++++++++++++++
 fs/block_dev.c         | 20 +++++++++++---------
 include/linux/blkdev.h |  2 ++
 3 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 9dca6bb..5042a84 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -314,3 +314,29 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_ensure_space_exists - preallocate a block range
+ * @bdev:	blockdev to preallocate space for
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to preallocate
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @flags:	FALLOC_FL_* to control behaviour
+ *
+ * Description:
+ *    Ensure space exists, or is preallocated, for the sectors in question.
+ */
+int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, unsigned long flags)
+{
+	sector_t res;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+	if (!ops->reserve_space)
+		return -EOPNOTSUPP;
+
+	// FIXME: check with Brian Foster on whether it makes sense to
+	// use BDEV_RES_GET/BDEV_RES_MOD instead of BDEV_RES_PROVISION?
+	return ops->reserve_space(bdev, BDEV_RES_PROVISION, sector, nr_sects, &res);
+}
+EXPORT_SYMBOL(blkdev_ensure_space_exists);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5a2c3ab..b34c07b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
 	struct request_queue *q = bdev_get_queue(bdev);
 	struct address_space *mapping;
 	loff_t end = start + len - 1;
-	loff_t bs_mask, isize;
+	loff_t isize;
 	int error;
 
 	/* We only support zero range and punch hole. */
 	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
 		return -EOPNOTSUPP;
 
-	/* We haven't a primitive for "ensure space exists" right now. */
-	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
-		return -EOPNOTSUPP;
-
 	/* Only punch if the device can do zeroing discard. */
 	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
 	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
@@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
 			return -EINVAL;
 	}
 
-	/* Don't allow IO that isn't aligned to logical block size */
-	bs_mask = bdev_logical_block_size(bdev) - 1;
-	if ((start | len) & bs_mask)
+	/*
+	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
+	 * - for normal device's io_min is usually logical block size
+	 * - but for more exotic devices (e.g. DM thinp) it may be larger
+	 */
+	if ((start | len) % bdev_io_min(bdev))
 		return -EINVAL;
 
 	/* Invalidate the page cache, including dirty pages. */
@@ -1839,7 +1838,10 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
 	truncate_inode_pages_range(mapping, start, end);
 
 	error = -EINVAL;
-	if (mode & FALLOC_FL_ZERO_RANGE)
+	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
+		error = blkdev_ensure_space_exists(bdev, start >> 9, len >> 9,
+						   mode);
+	else if (mode & FALLOC_FL_ZERO_RANGE)
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
 					    GFP_KERNEL, false);
 	else if (mode & FALLOC_FL_PUNCH_HOLE)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6c6ea96..4147af2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1132,6 +1132,8 @@ extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, bool discard);
+extern int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, unsigned long flags);
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
 {
-- 
2.6.4 (Apple Git-63)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
  2016-04-12 20:04   ` Mike Snitzer
@ 2016-04-12 20:39     ` Darrick J. Wong
  -1 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-12 20:39 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Brian Foster, xfs, linux-block, linux-fsdevel, dm-devel

On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2016 at 12:42P -0400,
> Brian Foster <bfoster@redhat.com> wrote:
> 
> > Hi all,
> > 
> > This is v2 of the XFS and block device reservation experiment. The
> > significant changes in v2 are that the bdev interface has been condensed
> > to a single callback function, the XFS transaction reservation
> > management has been reworked to make transactions responsible for
> > tracking and releasing excess reservation (for non-delalloc cases) and a
> > workaround for the fallocate over-reservation issue is included. Beyond
> > that, this version adds a bunch of miscellaneous cleanups and fixes some
> > of the nastier locking/leak issues present in the first rfc.
> > 
> > Patches 1-2 refactor some XFS reserve pool and block accounting code in
> > preparation for subsequent patches. Patches 3-5 add block/device-mapper
> > reservation support. Patches 6-10 add the core reservation
> > infrastructure and management bits to XFS. See the link to the original
> > rfc below for instructions and further details around the purpose of
> > this series.
> > 
> > Finally, note that this is still highly experimental/theoretical and
> > should not be used on production systems. Thoughts, reviews, flames
> > appreciated.
> 
> Thanks for carrying on with this work Brian.
> 
> I've started to review your patchset and Darrick's fallocate patchset.
> I've pushed a branch to linux-dm.git that combines the 2, see:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-fallocate
> 
> and then added this RFC patch, at the end, which relies on both of your
> patchsets -- you'll see blkdev_ensure_space_exists() has a FIXME which
> implies it isn't much more than simply stubbed out at this point
> (completely untested):

Hmm, ok, but -rc3 broke a bunch of stuff.  Guess I should repost with all
the PAGE_CACHE_ -> PAGE_ stuff fixed. :)

> From: Mike Snitzer <snitzer@redhat.com>
> Date: Tue, 12 Apr 2016 15:54:31 -0400
> Subject: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
> 
> This effectively exposes the primitive for "ensure space exists".  It
> relies on block_device_operations' reserve_space method.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  block/blk-lib.c        | 26 ++++++++++++++++++++++++++
>  fs/block_dev.c         | 20 +++++++++++---------
>  include/linux/blkdev.h |  2 ++
>  3 files changed, 39 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 9dca6bb..5042a84 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -314,3 +314,29 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
>  }
>  EXPORT_SYMBOL(blkdev_issue_zeroout);
> +
> +/**
> + * blkdev_ensure_space_exists - preallocate a block range
> + * @bdev:	blockdev to preallocate space for
> + * @sector:	start sector
> + * @nr_sects:	number of sectors to preallocate
> + * @gfp_mask:	memory allocation flags (for bio_alloc)
> + * @flags:	FALLOC_FL_* to control behaviour
> + *
> + * Description:
> + *    Ensure space exists, or is preallocated, for the sectors in question.
> + */
> +int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, unsigned long flags)
> +{
> +	sector_t res;
> +	const struct block_device_operations *ops = bdev->bd_disk->fops;
> +
> +	if (!ops->reserve_space)
> +		return -EOPNOTSUPP;
> +
> +	// FIXME: check with Brian Foster on whether it makes sense to
> +	// use BDEV_RES_GET/BDEV_RES_MOD instead of BDEV_RES_PROVISION?
> +	return ops->reserve_space(bdev, BDEV_RES_PROVISION, sector, nr_sects, &res);

/me thinks BDEV_RES_PROVISION is correct here, because regular-mode file
fallocate (for ext4/xfs anyway) allocates blocks and maps them to specific file
offsets as unwritten extents.  afaict RES_PROVISION -> thin_provision_space()
and thin_provision_space() seems to allocate blocks and map them to the
device's LBAs.

If I'm reading the patches correctly, RES_GET/RES_MOD seem to reserve N blocks
but doesn't map them to any specific LBA.

> +}
> +EXPORT_SYMBOL(blkdev_ensure_space_exists);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 5a2c3ab..b34c07b 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
>  	struct request_queue *q = bdev_get_queue(bdev);
>  	struct address_space *mapping;
>  	loff_t end = start + len - 1;
> -	loff_t bs_mask, isize;
> +	loff_t isize;
>  	int error;
>  
>  	/* We only support zero range and punch hole. */
>  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
>  		return -EOPNOTSUPP;
>  
> -	/* We haven't a primitive for "ensure space exists" right now. */
> -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> -		return -EOPNOTSUPP;
> -
>  	/* Only punch if the device can do zeroing discard. */
>  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
>  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
>  			return -EINVAL;
>  	}
>  
> -	/* Don't allow IO that isn't aligned to logical block size */
> -	bs_mask = bdev_logical_block_size(bdev) - 1;
> -	if ((start | len) & bs_mask)
> +	/*
> +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> +	 * - for normal device's io_min is usually logical block size
> +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> +	 */
> +	if ((start | len) % bdev_io_min(bdev))
>  		return -EINVAL;

Noted.  Will update the original patch.

>  	/* Invalidate the page cache, including dirty pages. */
> @@ -1839,7 +1838,10 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
>  	truncate_inode_pages_range(mapping, start, end);
>  
>  	error = -EINVAL;
> -	if (mode & FALLOC_FL_ZERO_RANGE)
> +	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> +		error = blkdev_ensure_space_exists(bdev, start >> 9, len >> 9,
> +						   mode);
> +	else if (mode & FALLOC_FL_ZERO_RANGE)

This whole thing got converted to a switch statement due to some feedback
from hch.

Anyway, will try to have a new blockdev fallocate patchset done by the end
of the day.

(Is there a test case for this?)

--D

>  		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
>  					    GFP_KERNEL, false);
>  	else if (mode & FALLOC_FL_PUNCH_HOLE)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 6c6ea96..4147af2 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1132,6 +1132,8 @@ extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
>  extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, bool discard);
> +extern int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, unsigned long flags);
>  static inline int sb_issue_discard(struct super_block *sb, sector_t block,
>  		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
>  {
> -- 
> 2.6.4 (Apple Git-63)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
@ 2016-04-12 20:39     ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-12 20:39 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, linux-fsdevel, Brian Foster, dm-devel, xfs

On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2016 at 12:42P -0400,
> Brian Foster <bfoster@redhat.com> wrote:
> 
> > Hi all,
> > 
> > This is v2 of the XFS and block device reservation experiment. The
> > significant changes in v2 are that the bdev interface has been condensed
> > to a single callback function, the XFS transaction reservation
> > management has been reworked to make transactions responsible for
> > tracking and releasing excess reservation (for non-delalloc cases) and a
> > workaround for the fallocate over-reservation issue is included. Beyond
> > that, this version adds a bunch of miscellaneous cleanups and fixes some
> > of the nastier locking/leak issues present in the first rfc.
> > 
> > Patches 1-2 refactor some XFS reserve pool and block accounting code in
> > preparation for subsequent patches. Patches 3-5 add block/device-mapper
> > reservation support. Patches 6-10 add the core reservation
> > infrastructure and management bits to XFS. See the link to the original
> > rfc below for instructions and further details around the purpose of
> > this series.
> > 
> > Finally, note that this is still highly experimental/theoretical and
> > should not be used on production systems. Thoughts, reviews, flames
> > appreciated.
> 
> Thanks for carrying on with this work Brian.
> 
> I've started to review your patchset and Darrick's fallocate patchset.
> I've pushed a branch to linux-dm.git that combines the 2, see:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-fallocate
> 
> and then added this RFC patch, at the end, which relies on both of your
> patchsets -- you'll see blkdev_ensure_space_exists() has a FIXME which
> implies it isn't much more than simply stubbed out at this point
> (completely untested):

Hmm, ok, but -rc3 broke a bunch of stuff.  Guess I should repost with all
the PAGE_CACHE_ -> PAGE_ stuff fixed. :)

> From: Mike Snitzer <snitzer@redhat.com>
> Date: Tue, 12 Apr 2016 15:54:31 -0400
> Subject: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
> 
> This effectively exposes the primitive for "ensure space exists".  It
> relies on block_device_operations' reserve_space method.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  block/blk-lib.c        | 26 ++++++++++++++++++++++++++
>  fs/block_dev.c         | 20 +++++++++++---------
>  include/linux/blkdev.h |  2 ++
>  3 files changed, 39 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 9dca6bb..5042a84 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -314,3 +314,29 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
>  }
>  EXPORT_SYMBOL(blkdev_issue_zeroout);
> +
> +/**
> + * blkdev_ensure_space_exists - preallocate a block range
> + * @bdev:	blockdev to preallocate space for
> + * @sector:	start sector
> + * @nr_sects:	number of sectors to preallocate
> + * @gfp_mask:	memory allocation flags (for bio_alloc)
> + * @flags:	FALLOC_FL_* to control behaviour
> + *
> + * Description:
> + *    Ensure space exists, or is preallocated, for the sectors in question.
> + */
> +int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, unsigned long flags)
> +{
> +	sector_t res;
> +	const struct block_device_operations *ops = bdev->bd_disk->fops;
> +
> +	if (!ops->reserve_space)
> +		return -EOPNOTSUPP;
> +
> +	// FIXME: check with Brian Foster on whether it makes sense to
> +	// use BDEV_RES_GET/BDEV_RES_MOD instead of BDEV_RES_PROVISION?
> +	return ops->reserve_space(bdev, BDEV_RES_PROVISION, sector, nr_sects, &res);

/me thinks BDEV_RES_PROVISION is correct here, because regular-mode file
fallocate (for ext4/xfs anyway) allocates blocks and maps them to specific file
offsets as unwritten extents.  afaict RES_PROVISION -> thin_provision_space()
and thin_provision_space() seems to allocate blocks and map them to the
device's LBAs.

If I'm reading the patches correctly, RES_GET/RES_MOD seem to reserve N blocks
but doesn't map them to any specific LBA.

> +}
> +EXPORT_SYMBOL(blkdev_ensure_space_exists);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 5a2c3ab..b34c07b 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
>  	struct request_queue *q = bdev_get_queue(bdev);
>  	struct address_space *mapping;
>  	loff_t end = start + len - 1;
> -	loff_t bs_mask, isize;
> +	loff_t isize;
>  	int error;
>  
>  	/* We only support zero range and punch hole. */
>  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
>  		return -EOPNOTSUPP;
>  
> -	/* We haven't a primitive for "ensure space exists" right now. */
> -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> -		return -EOPNOTSUPP;
> -
>  	/* Only punch if the device can do zeroing discard. */
>  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
>  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
>  			return -EINVAL;
>  	}
>  
> -	/* Don't allow IO that isn't aligned to logical block size */
> -	bs_mask = bdev_logical_block_size(bdev) - 1;
> -	if ((start | len) & bs_mask)
> +	/*
> +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> +	 * - for normal device's io_min is usually logical block size
> +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> +	 */
> +	if ((start | len) % bdev_io_min(bdev))
>  		return -EINVAL;

Noted.  Will update the original patch.

>  	/* Invalidate the page cache, including dirty pages. */
> @@ -1839,7 +1838,10 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
>  	truncate_inode_pages_range(mapping, start, end);
>  
>  	error = -EINVAL;
> -	if (mode & FALLOC_FL_ZERO_RANGE)
> +	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> +		error = blkdev_ensure_space_exists(bdev, start >> 9, len >> 9,
> +						   mode);
> +	else if (mode & FALLOC_FL_ZERO_RANGE)

This whole thing got converted to a switch statement due to some feedback
from hch.

Anyway, will try to have a new blockdev fallocate patchset done by the end
of the day.

(Is there a test case for this?)

--D

>  		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
>  					    GFP_KERNEL, false);
>  	else if (mode & FALLOC_FL_PUNCH_HOLE)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 6c6ea96..4147af2 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1132,6 +1132,8 @@ extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
>  extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, bool discard);
> +extern int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, unsigned long flags);
>  static inline int sb_issue_discard(struct super_block *sb, sector_t block,
>  		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
>  {
> -- 
> 2.6.4 (Apple Git-63)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
  2016-04-12 20:39     ` Darrick J. Wong
@ 2016-04-12 20:46       ` Mike Snitzer
  -1 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-12 20:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, xfs, linux-block, linux-fsdevel, dm-devel

On Tue, Apr 12 2016 at  4:39pm -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2016 at 12:42P -0400,
> > Brian Foster <bfoster@redhat.com> wrote:
> > 
> > > Hi all,
> > > 
> > > This is v2 of the XFS and block device reservation experiment. The
> > > significant changes in v2 are that the bdev interface has been condensed
> > > to a single callback function, the XFS transaction reservation
> > > management has been reworked to make transactions responsible for
> > > tracking and releasing excess reservation (for non-delalloc cases) and a
> > > workaround for the fallocate over-reservation issue is included. Beyond
> > > that, this version adds a bunch of miscellaneous cleanups and fixes some
> > > of the nastier locking/leak issues present in the first rfc.
> > > 
> > > Patches 1-2 refactor some XFS reserve pool and block accounting code in
> > > preparation for subsequent patches. Patches 3-5 add block/device-mapper
> > > reservation support. Patches 6-10 add the core reservation
> > > infrastructure and management bits to XFS. See the link to the original
> > > rfc below for instructions and further details around the purpose of
> > > this series.
> > > 
> > > Finally, note that this is still highly experimental/theoretical and
> > > should not be used on production systems. Thoughts, reviews, flames
> > > appreciated.
> > 
> > Thanks for carrying on with this work Brian.
> > 
> > I've started to review your patchset and Darrick's fallocate patchset.
> > I've pushed a branch to linux-dm.git that combines the 2, see:
> > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-fallocate
> > 
> > and then added this RFC patch, at the end, which relies on both of your
> > patchsets -- you'll see blkdev_ensure_space_exists() has a FIXME which
> > implies it isn't much more than simply stubbed out at this point
> > (completely untested):
> 
> Hmm, ok, but -rc3 broke a bunch of stuff.  Guess I should repost with all
> the PAGE_CACHE_ -> PAGE_ stuff fixed. :)

Yeah, the kernel.org kbuild robots just spammed us about that same exact
breakage.

> > From: Mike Snitzer <snitzer@redhat.com>
> > Date: Tue, 12 Apr 2016 15:54:31 -0400
> > Subject: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
> > 
> > This effectively exposes the primitive for "ensure space exists".  It
> > relies on block_device_operations' reserve_space method.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  block/blk-lib.c        | 26 ++++++++++++++++++++++++++
> >  fs/block_dev.c         | 20 +++++++++++---------
> >  include/linux/blkdev.h |  2 ++
> >  3 files changed, 39 insertions(+), 9 deletions(-)
> > 
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index 9dca6bb..5042a84 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -314,3 +314,29 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> >  	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
> >  }
> >  EXPORT_SYMBOL(blkdev_issue_zeroout);
> > +
> > +/**
> > + * blkdev_ensure_space_exists - preallocate a block range
> > + * @bdev:	blockdev to preallocate space for
> > + * @sector:	start sector
> > + * @nr_sects:	number of sectors to preallocate
> > + * @gfp_mask:	memory allocation flags (for bio_alloc)
> > + * @flags:	FALLOC_FL_* to control behaviour
> > + *
> > + * Description:
> > + *    Ensure space exists, or is preallocated, for the sectors in question.
> > + */
> > +int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
> > +		sector_t nr_sects, unsigned long flags)
> > +{
> > +	sector_t res;
> > +	const struct block_device_operations *ops = bdev->bd_disk->fops;
> > +
> > +	if (!ops->reserve_space)
> > +		return -EOPNOTSUPP;
> > +
> > +	// FIXME: check with Brian Foster on whether it makes sense to
> > +	// use BDEV_RES_GET/BDEV_RES_MOD instead of BDEV_RES_PROVISION?
> > +	return ops->reserve_space(bdev, BDEV_RES_PROVISION, sector, nr_sects, &res);
> 
> /me thinks BDEV_RES_PROVISION is correct here, because regular-mode file
> fallocate (for ext4/xfs anyway) allocates blocks and maps them to specific file
> offsets as unwritten extents.  afaict RES_PROVISION -> thin_provision_space()
> and thin_provision_space() seems to allocate blocks and map them to the
> device's LBAs.
> 
> If I'm reading the patches correctly, RES_GET/RES_MOD seem to reserve N blocks
> but doesn't map them to any specific LBA.

Right that is how I read it too.  I just put that FIXME in to cover my
ass incase I was being an idiot ;)

> > +}
> > +EXPORT_SYMBOL(blkdev_ensure_space_exists);
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 5a2c3ab..b34c07b 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  	struct request_queue *q = bdev_get_queue(bdev);
> >  	struct address_space *mapping;
> >  	loff_t end = start + len - 1;
> > -	loff_t bs_mask, isize;
> > +	loff_t isize;
> >  	int error;
> >  
> >  	/* We only support zero range and punch hole. */
> >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> >  		return -EOPNOTSUPP;
> >  
> > -	/* We haven't a primitive for "ensure space exists" right now. */
> > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > -		return -EOPNOTSUPP;
> > -
> >  	/* Only punch if the device can do zeroing discard. */
> >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  			return -EINVAL;
> >  	}
> >  
> > -	/* Don't allow IO that isn't aligned to logical block size */
> > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > -	if ((start | len) & bs_mask)
> > +	/*
> > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > +	 * - for normal device's io_min is usually logical block size
> > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > +	 */
> > +	if ((start | len) % bdev_io_min(bdev))
> >  		return -EINVAL;
> 
> Noted.  Will update the original patch.

OK, thanks.

Once your new patchset is available I'll rebase my 'dm-fallocate' test
branch accordingly.
 
> >  	/* Invalidate the page cache, including dirty pages. */
> > @@ -1839,7 +1838,10 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  	truncate_inode_pages_range(mapping, start, end);
> >  
> >  	error = -EINVAL;
> > -	if (mode & FALLOC_FL_ZERO_RANGE)
> > +	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > +		error = blkdev_ensure_space_exists(bdev, start >> 9, len >> 9,
> > +						   mode);
> > +	else if (mode & FALLOC_FL_ZERO_RANGE)
> 
> This whole thing got converted to a switch statement due to some feedback
> from hch.
> 
> Anyway, will try to have a new blockdev fallocate patchset done by the end
> of the day.
> 
> (Is there a test case for this?)

No, but once my patch is in place to join your patchset with Brian's
then any basic fallocate tests against a DM thinp volume _should_ work.

/me assumes xfstests has such tests?  Only missing bit would be to layer
the filesystem ontop of DM thinp?  Or extend the tests your added to
test DM thinp devices directly.  I think Eric Sandeen (now cc'd) made
xfstests capable or creating DM thinp volumes for certain tests.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
@ 2016-04-12 20:46       ` Mike Snitzer
  0 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-12 20:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-block, linux-fsdevel, Brian Foster, dm-devel, xfs

On Tue, Apr 12 2016 at  4:39pm -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2016 at 12:42P -0400,
> > Brian Foster <bfoster@redhat.com> wrote:
> > 
> > > Hi all,
> > > 
> > > This is v2 of the XFS and block device reservation experiment. The
> > > significant changes in v2 are that the bdev interface has been condensed
> > > to a single callback function, the XFS transaction reservation
> > > management has been reworked to make transactions responsible for
> > > tracking and releasing excess reservation (for non-delalloc cases) and a
> > > workaround for the fallocate over-reservation issue is included. Beyond
> > > that, this version adds a bunch of miscellaneous cleanups and fixes some
> > > of the nastier locking/leak issues present in the first rfc.
> > > 
> > > Patches 1-2 refactor some XFS reserve pool and block accounting code in
> > > preparation for subsequent patches. Patches 3-5 add block/device-mapper
> > > reservation support. Patches 6-10 add the core reservation
> > > infrastructure and management bits to XFS. See the link to the original
> > > rfc below for instructions and further details around the purpose of
> > > this series.
> > > 
> > > Finally, note that this is still highly experimental/theoretical and
> > > should not be used on production systems. Thoughts, reviews, flames
> > > appreciated.
> > 
> > Thanks for carrying on with this work Brian.
> > 
> > I've started to review your patchset and Darrick's fallocate patchset.
> > I've pushed a branch to linux-dm.git that combines the 2, see:
> > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-fallocate
> > 
> > and then added this RFC patch, at the end, which relies on both of your
> > patchsets -- you'll see blkdev_ensure_space_exists() has a FIXME which
> > implies it isn't much more than simply stubbed out at this point
> > (completely untested):
> 
> Hmm, ok, but -rc3 broke a bunch of stuff.  Guess I should repost with all
> the PAGE_CACHE_ -> PAGE_ stuff fixed. :)

Yeah, the kernel.org kbuild robots just spammed us about that same exact
breakage.

> > From: Mike Snitzer <snitzer@redhat.com>
> > Date: Tue, 12 Apr 2016 15:54:31 -0400
> > Subject: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
> > 
> > This effectively exposes the primitive for "ensure space exists".  It
> > relies on block_device_operations' reserve_space method.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  block/blk-lib.c        | 26 ++++++++++++++++++++++++++
> >  fs/block_dev.c         | 20 +++++++++++---------
> >  include/linux/blkdev.h |  2 ++
> >  3 files changed, 39 insertions(+), 9 deletions(-)
> > 
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index 9dca6bb..5042a84 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -314,3 +314,29 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> >  	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
> >  }
> >  EXPORT_SYMBOL(blkdev_issue_zeroout);
> > +
> > +/**
> > + * blkdev_ensure_space_exists - preallocate a block range
> > + * @bdev:	blockdev to preallocate space for
> > + * @sector:	start sector
> > + * @nr_sects:	number of sectors to preallocate
> > + * @gfp_mask:	memory allocation flags (for bio_alloc)
> > + * @flags:	FALLOC_FL_* to control behaviour
> > + *
> > + * Description:
> > + *    Ensure space exists, or is preallocated, for the sectors in question.
> > + */
> > +int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
> > +		sector_t nr_sects, unsigned long flags)
> > +{
> > +	sector_t res;
> > +	const struct block_device_operations *ops = bdev->bd_disk->fops;
> > +
> > +	if (!ops->reserve_space)
> > +		return -EOPNOTSUPP;
> > +
> > +	// FIXME: check with Brian Foster on whether it makes sense to
> > +	// use BDEV_RES_GET/BDEV_RES_MOD instead of BDEV_RES_PROVISION?
> > +	return ops->reserve_space(bdev, BDEV_RES_PROVISION, sector, nr_sects, &res);
> 
> /me thinks BDEV_RES_PROVISION is correct here, because regular-mode file
> fallocate (for ext4/xfs anyway) allocates blocks and maps them to specific file
> offsets as unwritten extents.  afaict RES_PROVISION -> thin_provision_space()
> and thin_provision_space() seems to allocate blocks and map them to the
> device's LBAs.
> 
> If I'm reading the patches correctly, RES_GET/RES_MOD seem to reserve N blocks
> but doesn't map them to any specific LBA.

Right that is how I read it too.  I just put that FIXME in to cover my
ass incase I was being an idiot ;)

> > +}
> > +EXPORT_SYMBOL(blkdev_ensure_space_exists);
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 5a2c3ab..b34c07b 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  	struct request_queue *q = bdev_get_queue(bdev);
> >  	struct address_space *mapping;
> >  	loff_t end = start + len - 1;
> > -	loff_t bs_mask, isize;
> > +	loff_t isize;
> >  	int error;
> >  
> >  	/* We only support zero range and punch hole. */
> >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> >  		return -EOPNOTSUPP;
> >  
> > -	/* We haven't a primitive for "ensure space exists" right now. */
> > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > -		return -EOPNOTSUPP;
> > -
> >  	/* Only punch if the device can do zeroing discard. */
> >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  			return -EINVAL;
> >  	}
> >  
> > -	/* Don't allow IO that isn't aligned to logical block size */
> > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > -	if ((start | len) & bs_mask)
> > +	/*
> > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > +	 * - for normal device's io_min is usually logical block size
> > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > +	 */
> > +	if ((start | len) % bdev_io_min(bdev))
> >  		return -EINVAL;
> 
> Noted.  Will update the original patch.

OK, thanks.

Once your new patchset is available I'll rebase my 'dm-fallocate' test
branch accordingly.
 
> >  	/* Invalidate the page cache, including dirty pages. */
> > @@ -1839,7 +1838,10 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  	truncate_inode_pages_range(mapping, start, end);
> >  
> >  	error = -EINVAL;
> > -	if (mode & FALLOC_FL_ZERO_RANGE)
> > +	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > +		error = blkdev_ensure_space_exists(bdev, start >> 9, len >> 9,
> > +						   mode);
> > +	else if (mode & FALLOC_FL_ZERO_RANGE)
> 
> This whole thing got converted to a switch statement due to some feedback
> from hch.
> 
> Anyway, will try to have a new blockdev fallocate patchset done by the end
> of the day.
> 
> (Is there a test case for this?)

No, but once my patch is in place to join your patchset with Brian's
then any basic fallocate tests against a DM thinp volume _should_ work.

/me assumes xfstests has such tests?  Only missing bit would be to layer
the filesystem ontop of DM thinp?  Or extend the tests your added to
test DM thinp devices directly.  I think Eric Sandeen (now cc'd) made
xfstests capable or creating DM thinp volumes for certain tests.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
  2016-04-12 20:39     ` Darrick J. Wong
@ 2016-04-12 21:04       ` Mike Snitzer
  -1 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-12 21:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, xfs, linux-block, linux-fsdevel, dm-devel

On Tue, Apr 12 2016 at  4:39pm -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 5a2c3ab..b34c07b 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  	struct request_queue *q = bdev_get_queue(bdev);
> >  	struct address_space *mapping;
> >  	loff_t end = start + len - 1;
> > -	loff_t bs_mask, isize;
> > +	loff_t isize;
> >  	int error;
> >  
> >  	/* We only support zero range and punch hole. */
> >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> >  		return -EOPNOTSUPP;
> >  
> > -	/* We haven't a primitive for "ensure space exists" right now. */
> > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > -		return -EOPNOTSUPP;
> > -
> >  	/* Only punch if the device can do zeroing discard. */
> >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  			return -EINVAL;
> >  	}
> >  
> > -	/* Don't allow IO that isn't aligned to logical block size */
> > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > -	if ((start | len) & bs_mask)
> > +	/*
> > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > +	 * - for normal device's io_min is usually logical block size
> > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > +	 */
> > +	if ((start | len) % bdev_io_min(bdev))
> >  		return -EINVAL;
> 
> Noted.  Will update the original patch.

BTW, I just noticed your "block: require write_same and discard requests
align to logical block size" -- doesn't look right.

But maybe I'm just too hyper-focused on DM thinp's needs (which would
much prefer these checks be done in terms of minimum_io_size, rather
than logical_block_size, and _not_ assuming power-of-2 math will work).

But at least for discard: your lbs-based check is fine; since we have
discard_granularity to cover thinp's more specific requirements.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
@ 2016-04-12 21:04       ` Mike Snitzer
  0 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-12 21:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-block, linux-fsdevel, Brian Foster, dm-devel, xfs

On Tue, Apr 12 2016 at  4:39pm -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 5a2c3ab..b34c07b 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  	struct request_queue *q = bdev_get_queue(bdev);
> >  	struct address_space *mapping;
> >  	loff_t end = start + len - 1;
> > -	loff_t bs_mask, isize;
> > +	loff_t isize;
> >  	int error;
> >  
> >  	/* We only support zero range and punch hole. */
> >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> >  		return -EOPNOTSUPP;
> >  
> > -	/* We haven't a primitive for "ensure space exists" right now. */
> > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > -		return -EOPNOTSUPP;
> > -
> >  	/* Only punch if the device can do zeroing discard. */
> >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> >  			return -EINVAL;
> >  	}
> >  
> > -	/* Don't allow IO that isn't aligned to logical block size */
> > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > -	if ((start | len) & bs_mask)
> > +	/*
> > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > +	 * - for normal device's io_min is usually logical block size
> > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > +	 */
> > +	if ((start | len) % bdev_io_min(bdev))
> >  		return -EINVAL;
> 
> Noted.  Will update the original patch.

BTW, I just noticed your "block: require write_same and discard requests
align to logical block size" -- doesn't look right.

But maybe I'm just too hyper-focused on DM thinp's needs (which would
much prefer these checks be done in terms of minimum_io_size, rather
than logical_block_size, and _not_ assuming power-of-2 math will work).

But at least for discard: your lbs-based check is fine; since we have
discard_granularity to cover thinp's more specific requirements.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
  2016-04-12 20:46       ` Mike Snitzer
@ 2016-04-12 22:25         ` Darrick J. Wong
  -1 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-12 22:25 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, linux-fsdevel, Brian Foster, dm-devel, xfs

On Tue, Apr 12, 2016 at 04:46:58PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2016 at  4:39pm -0400,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2016 at 12:42P -0400,
> > > Brian Foster <bfoster@redhat.com> wrote:
> > > 
> > > > Hi all,
> > > > 
> > > > This is v2 of the XFS and block device reservation experiment. The
> > > > significant changes in v2 are that the bdev interface has been condensed
> > > > to a single callback function, the XFS transaction reservation
> > > > management has been reworked to make transactions responsible for
> > > > tracking and releasing excess reservation (for non-delalloc cases) and a
> > > > workaround for the fallocate over-reservation issue is included. Beyond
> > > > that, this version adds a bunch of miscellaneous cleanups and fixes some
> > > > of the nastier locking/leak issues present in the first rfc.
> > > > 
> > > > Patches 1-2 refactor some XFS reserve pool and block accounting code in
> > > > preparation for subsequent patches. Patches 3-5 add block/device-mapper
> > > > reservation support. Patches 6-10 add the core reservation
> > > > infrastructure and management bits to XFS. See the link to the original
> > > > rfc below for instructions and further details around the purpose of
> > > > this series.
> > > > 
> > > > Finally, note that this is still highly experimental/theoretical and
> > > > should not be used on production systems. Thoughts, reviews, flames
> > > > appreciated.
> > > 
> > > Thanks for carrying on with this work Brian.
> > > 
> > > I've started to review your patchset and Darrick's fallocate patchset.
> > > I've pushed a branch to linux-dm.git that combines the 2, see:
> > > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-fallocate
> > > 
> > > and then added this RFC patch, at the end, which relies on both of your
> > > patchsets -- you'll see blkdev_ensure_space_exists() has a FIXME which
> > > implies it isn't much more than simply stubbed out at this point
> > > (completely untested):
> > 
> > Hmm, ok, but -rc3 broke a bunch of stuff.  Guess I should repost with all
> > the PAGE_CACHE_ -> PAGE_ stuff fixed. :)
> 
> Yeah, the kernel.org kbuild robots just spammed us about that same exact
> breakage.
> 
> > > From: Mike Snitzer <snitzer@redhat.com>
> > > Date: Tue, 12 Apr 2016 15:54:31 -0400
> > > Subject: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
> > > 
> > > This effectively exposes the primitive for "ensure space exists".  It
> > > relies on block_device_operations' reserve_space method.
> > > 
> > > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > ---
> > >  block/blk-lib.c        | 26 ++++++++++++++++++++++++++
> > >  fs/block_dev.c         | 20 +++++++++++---------
> > >  include/linux/blkdev.h |  2 ++
> > >  3 files changed, 39 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > > index 9dca6bb..5042a84 100644
> > > --- a/block/blk-lib.c
> > > +++ b/block/blk-lib.c
> > > @@ -314,3 +314,29 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> > >  	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
> > >  }
> > >  EXPORT_SYMBOL(blkdev_issue_zeroout);
> > > +
> > > +/**
> > > + * blkdev_ensure_space_exists - preallocate a block range
> > > + * @bdev:	blockdev to preallocate space for
> > > + * @sector:	start sector
> > > + * @nr_sects:	number of sectors to preallocate
> > > + * @gfp_mask:	memory allocation flags (for bio_alloc)
> > > + * @flags:	FALLOC_FL_* to control behaviour
> > > + *
> > > + * Description:
> > > + *    Ensure space exists, or is preallocated, for the sectors in question.
> > > + */
> > > +int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
> > > +		sector_t nr_sects, unsigned long flags)
> > > +{
> > > +	sector_t res;
> > > +	const struct block_device_operations *ops = bdev->bd_disk->fops;
> > > +
> > > +	if (!ops->reserve_space)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	// FIXME: check with Brian Foster on whether it makes sense to
> > > +	// use BDEV_RES_GET/BDEV_RES_MOD instead of BDEV_RES_PROVISION?
> > > +	return ops->reserve_space(bdev, BDEV_RES_PROVISION, sector, nr_sects, &res);
> > 
> > /me thinks BDEV_RES_PROVISION is correct here, because regular-mode file
> > fallocate (for ext4/xfs anyway) allocates blocks and maps them to specific file
> > offsets as unwritten extents.  afaict RES_PROVISION -> thin_provision_space()
> > and thin_provision_space() seems to allocate blocks and map them to the
> > device's LBAs.
> > 
> > If I'm reading the patches correctly, RES_GET/RES_MOD seem to reserve N blocks
> > but doesn't map them to any specific LBA.
> 
> Right that is how I read it too.  I just put that FIXME in to cover my
> ass incase I was being an idiot ;)

<nod>

> > > +}
> > > +EXPORT_SYMBOL(blkdev_ensure_space_exists);
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 5a2c3ab..b34c07b 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  	struct request_queue *q = bdev_get_queue(bdev);
> > >  	struct address_space *mapping;
> > >  	loff_t end = start + len - 1;
> > > -	loff_t bs_mask, isize;
> > > +	loff_t isize;
> > >  	int error;
> > >  
> > >  	/* We only support zero range and punch hole. */
> > >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> > >  		return -EOPNOTSUPP;
> > >  
> > > -	/* We haven't a primitive for "ensure space exists" right now. */
> > > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > > -		return -EOPNOTSUPP;
> > > -
> > >  	/* Only punch if the device can do zeroing discard. */
> > >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> > >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  			return -EINVAL;
> > >  	}
> > >  
> > > -	/* Don't allow IO that isn't aligned to logical block size */
> > > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > > -	if ((start | len) & bs_mask)
> > > +	/*
> > > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > > +	 * - for normal device's io_min is usually logical block size
> > > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > > +	 */
> > > +	if ((start | len) % bdev_io_min(bdev))
> > >  		return -EINVAL;
> > 
> > Noted.  Will update the original patch.
> 
> OK, thanks.
> 
> Once your new patchset is available I'll rebase my 'dm-fallocate' test
> branch accordingly.
>  
> > >  	/* Invalidate the page cache, including dirty pages. */
> > > @@ -1839,7 +1838,10 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  	truncate_inode_pages_range(mapping, start, end);
> > >  
> > >  	error = -EINVAL;
> > > -	if (mode & FALLOC_FL_ZERO_RANGE)
> > > +	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > > +		error = blkdev_ensure_space_exists(bdev, start >> 9, len >> 9,
> > > +						   mode);
> > > +	else if (mode & FALLOC_FL_ZERO_RANGE)
> > 
> > This whole thing got converted to a switch statement due to some feedback
> > from hch.
> > 
> > Anyway, will try to have a new blockdev fallocate patchset done by the end
> > of the day.
> > 
> > (Is there a test case for this?)
> 
> No, but once my patch is in place to join your patchset with Brian's
> then any basic fallocate tests against a DM thinp volume _should_ work.
> 
> /me assumes xfstests has such tests?  Only missing bit would be to layer
> the filesystem ontop of DM thinp?  Or extend the tests your added to
> test DM thinp devices directly.  I think Eric Sandeen (now cc'd) made
> xfstests capable or creating DM thinp volumes for certain tests.

The patches got reviewed but aren't upstream.  It looks like it wouldn't
be difficult once it lands to make a test case that tests fallocate directly
on a thinp device.

--D

> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
@ 2016-04-12 22:25         ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-12 22:25 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, linux-fsdevel, Brian Foster, dm-devel, xfs

On Tue, Apr 12, 2016 at 04:46:58PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2016 at  4:39pm -0400,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2016 at 12:42P -0400,
> > > Brian Foster <bfoster@redhat.com> wrote:
> > > 
> > > > Hi all,
> > > > 
> > > > This is v2 of the XFS and block device reservation experiment. The
> > > > significant changes in v2 are that the bdev interface has been condensed
> > > > to a single callback function, the XFS transaction reservation
> > > > management has been reworked to make transactions responsible for
> > > > tracking and releasing excess reservation (for non-delalloc cases) and a
> > > > workaround for the fallocate over-reservation issue is included. Beyond
> > > > that, this version adds a bunch of miscellaneous cleanups and fixes some
> > > > of the nastier locking/leak issues present in the first rfc.
> > > > 
> > > > Patches 1-2 refactor some XFS reserve pool and block accounting code in
> > > > preparation for subsequent patches. Patches 3-5 add block/device-mapper
> > > > reservation support. Patches 6-10 add the core reservation
> > > > infrastructure and management bits to XFS. See the link to the original
> > > > rfc below for instructions and further details around the purpose of
> > > > this series.
> > > > 
> > > > Finally, note that this is still highly experimental/theoretical and
> > > > should not be used on production systems. Thoughts, reviews, flames
> > > > appreciated.
> > > 
> > > Thanks for carrying on with this work Brian.
> > > 
> > > I've started to review your patchset and Darrick's fallocate patchset.
> > > I've pushed a branch to linux-dm.git that combines the 2, see:
> > > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-fallocate
> > > 
> > > and then added this RFC patch, at the end, which relies on both of your
> > > patchsets -- you'll see blkdev_ensure_space_exists() has a FIXME which
> > > implies it isn't much more than simply stubbed out at this point
> > > (completely untested):
> > 
> > Hmm, ok, but -rc3 broke a bunch of stuff.  Guess I should repost with all
> > the PAGE_CACHE_ -> PAGE_ stuff fixed. :)
> 
> Yeah, the kernel.org kbuild robots just spammed us about that same exact
> breakage.
> 
> > > From: Mike Snitzer <snitzer@redhat.com>
> > > Date: Tue, 12 Apr 2016 15:54:31 -0400
> > > Subject: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
> > > 
> > > This effectively exposes the primitive for "ensure space exists".  It
> > > relies on block_device_operations' reserve_space method.
> > > 
> > > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > ---
> > >  block/blk-lib.c        | 26 ++++++++++++++++++++++++++
> > >  fs/block_dev.c         | 20 +++++++++++---------
> > >  include/linux/blkdev.h |  2 ++
> > >  3 files changed, 39 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > > index 9dca6bb..5042a84 100644
> > > --- a/block/blk-lib.c
> > > +++ b/block/blk-lib.c
> > > @@ -314,3 +314,29 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> > >  	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
> > >  }
> > >  EXPORT_SYMBOL(blkdev_issue_zeroout);
> > > +
> > > +/**
> > > + * blkdev_ensure_space_exists - preallocate a block range
> > > + * @bdev:	blockdev to preallocate space for
> > > + * @sector:	start sector
> > > + * @nr_sects:	number of sectors to preallocate
> > > + * @gfp_mask:	memory allocation flags (for bio_alloc)
> > > + * @flags:	FALLOC_FL_* to control behaviour
> > > + *
> > > + * Description:
> > > + *    Ensure space exists, or is preallocated, for the sectors in question.
> > > + */
> > > +int blkdev_ensure_space_exists(struct block_device *bdev, sector_t sector,
> > > +		sector_t nr_sects, unsigned long flags)
> > > +{
> > > +	sector_t res;
> > > +	const struct block_device_operations *ops = bdev->bd_disk->fops;
> > > +
> > > +	if (!ops->reserve_space)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	// FIXME: check with Brian Foster on whether it makes sense to
> > > +	// use BDEV_RES_GET/BDEV_RES_MOD instead of BDEV_RES_PROVISION?
> > > +	return ops->reserve_space(bdev, BDEV_RES_PROVISION, sector, nr_sects, &res);
> > 
> > /me thinks BDEV_RES_PROVISION is correct here, because regular-mode file
> > fallocate (for ext4/xfs anyway) allocates blocks and maps them to specific file
> > offsets as unwritten extents.  afaict RES_PROVISION -> thin_provision_space()
> > and thin_provision_space() seems to allocate blocks and map them to the
> > device's LBAs.
> > 
> > If I'm reading the patches correctly, RES_GET/RES_MOD seem to reserve N blocks
> > but doesn't map them to any specific LBA.
> 
> Right that is how I read it too.  I just put that FIXME in to cover my
> ass incase I was being an idiot ;)

<nod>

> > > +}
> > > +EXPORT_SYMBOL(blkdev_ensure_space_exists);
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 5a2c3ab..b34c07b 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  	struct request_queue *q = bdev_get_queue(bdev);
> > >  	struct address_space *mapping;
> > >  	loff_t end = start + len - 1;
> > > -	loff_t bs_mask, isize;
> > > +	loff_t isize;
> > >  	int error;
> > >  
> > >  	/* We only support zero range and punch hole. */
> > >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> > >  		return -EOPNOTSUPP;
> > >  
> > > -	/* We haven't a primitive for "ensure space exists" right now. */
> > > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > > -		return -EOPNOTSUPP;
> > > -
> > >  	/* Only punch if the device can do zeroing discard. */
> > >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> > >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  			return -EINVAL;
> > >  	}
> > >  
> > > -	/* Don't allow IO that isn't aligned to logical block size */
> > > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > > -	if ((start | len) & bs_mask)
> > > +	/*
> > > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > > +	 * - for normal device's io_min is usually logical block size
> > > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > > +	 */
> > > +	if ((start | len) % bdev_io_min(bdev))
> > >  		return -EINVAL;
> > 
> > Noted.  Will update the original patch.
> 
> OK, thanks.
> 
> Once your new patchset is available I'll rebase my 'dm-fallocate' test
> branch accordingly.
>  
> > >  	/* Invalidate the page cache, including dirty pages. */
> > > @@ -1839,7 +1838,10 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  	truncate_inode_pages_range(mapping, start, end);
> > >  
> > >  	error = -EINVAL;
> > > -	if (mode & FALLOC_FL_ZERO_RANGE)
> > > +	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > > +		error = blkdev_ensure_space_exists(bdev, start >> 9, len >> 9,
> > > +						   mode);
> > > +	else if (mode & FALLOC_FL_ZERO_RANGE)
> > 
> > This whole thing got converted to a switch statement due to some feedback
> > from hch.
> > 
> > Anyway, will try to have a new blockdev fallocate patchset done by the end
> > of the day.
> > 
> > (Is there a test case for this?)
> 
> No, but once my patch is in place to join your patchset with Brian's
> then any basic fallocate tests against a DM thinp volume _should_ work.
> 
> /me assumes xfstests has such tests?  Only missing bit would be to layer
> the filesystem ontop of DM thinp?  Or extend the tests your added to
> test DM thinp devices directly.  I think Eric Sandeen (now cc'd) made
> xfstests capable or creating DM thinp volumes for certain tests.

The patches got reviewed but aren't upstream.  It looks like it wouldn't
be difficult once it lands to make a test case that tests fallocate directly
on a thinp device.

--D

> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
  2016-04-12 21:04       ` Mike Snitzer
@ 2016-04-13  0:12         ` Darrick J. Wong
  -1 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-13  0:12 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Brian Foster, xfs, linux-block, linux-fsdevel, dm-devel

On Tue, Apr 12, 2016 at 05:04:27PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2016 at  4:39pm -0400,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 5a2c3ab..b34c07b 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  	struct request_queue *q = bdev_get_queue(bdev);
> > >  	struct address_space *mapping;
> > >  	loff_t end = start + len - 1;
> > > -	loff_t bs_mask, isize;
> > > +	loff_t isize;
> > >  	int error;
> > >  
> > >  	/* We only support zero range and punch hole. */
> > >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> > >  		return -EOPNOTSUPP;
> > >  
> > > -	/* We haven't a primitive for "ensure space exists" right now. */
> > > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > > -		return -EOPNOTSUPP;
> > > -
> > >  	/* Only punch if the device can do zeroing discard. */
> > >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> > >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  			return -EINVAL;
> > >  	}
> > >  
> > > -	/* Don't allow IO that isn't aligned to logical block size */
> > > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > > -	if ((start | len) & bs_mask)
> > > +	/*
> > > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > > +	 * - for normal device's io_min is usually logical block size
> > > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > > +	 */
> > > +	if ((start | len) % bdev_io_min(bdev))

I started by noticing the 64-bit division.  However, in researching alignment
requirements for fallocate, I noticed that nothing says that we can return
-EINVAL for unaligned offset/len for allocate or punch.  For file allocations
ext4 and xfs simply enlarge the range so that the ends are aligned to the
logical block size; for punch they both shrink the range to deallocate until
the ends are aligned, and write zeroes to the partial blocks.

At least for user-visible fallocate we should do likewise, but for the internal
blkdev_ helpers I think it makes more sense to check lbs alignment and let the
lower level driver reject the IO if min_io alignment is a hard requirement.
Documentation/block/queue-sysfs.txt says that the min_io is the smallest
/preferred/ size.

But, before that, I'll push out some new fallocate patches for -rc3.

> > >  		return -EINVAL;
> > 
> > Noted.  Will update the original patch.
> 
> BTW, I just noticed your "block: require write_same and discard requests
> align to logical block size" -- doesn't look right.

What happens if we pass a request to thinp that isn't aligned to
minimum_io_size?  Does it reject the command?

> But maybe I'm just too hyper-focused on DM thinp's needs (which would
> much prefer these checks be done in terms of minimum_io_size, rather
> than logical_block_size, and _not_ assuming power-of-2 math will work).
> 
> But at least for discard: your lbs-based check is fine; since we have
> discard_granularity to cover thinp's more specific requirements.

--D

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
@ 2016-04-13  0:12         ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-13  0:12 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, linux-fsdevel, Brian Foster, dm-devel, xfs

On Tue, Apr 12, 2016 at 05:04:27PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2016 at  4:39pm -0400,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 5a2c3ab..b34c07b 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  	struct request_queue *q = bdev_get_queue(bdev);
> > >  	struct address_space *mapping;
> > >  	loff_t end = start + len - 1;
> > > -	loff_t bs_mask, isize;
> > > +	loff_t isize;
> > >  	int error;
> > >  
> > >  	/* We only support zero range and punch hole. */
> > >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> > >  		return -EOPNOTSUPP;
> > >  
> > > -	/* We haven't a primitive for "ensure space exists" right now. */
> > > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > > -		return -EOPNOTSUPP;
> > > -
> > >  	/* Only punch if the device can do zeroing discard. */
> > >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> > >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > >  			return -EINVAL;
> > >  	}
> > >  
> > > -	/* Don't allow IO that isn't aligned to logical block size */
> > > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > > -	if ((start | len) & bs_mask)
> > > +	/*
> > > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > > +	 * - for normal device's io_min is usually logical block size
> > > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > > +	 */
> > > +	if ((start | len) % bdev_io_min(bdev))

I started by noticing the 64-bit division.  However, in researching alignment
requirements for fallocate, I noticed that nothing says that we can return
-EINVAL for unaligned offset/len for allocate or punch.  For file allocations
ext4 and xfs simply enlarge the range so that the ends are aligned to the
logical block size; for punch they both shrink the range to deallocate until
the ends are aligned, and write zeroes to the partial blocks.

At least for user-visible fallocate we should do likewise, but for the internal
blkdev_ helpers I think it makes more sense to check lbs alignment and let the
lower level driver reject the IO if min_io alignment is a hard requirement.
Documentation/block/queue-sysfs.txt says that the min_io is the smallest
/preferred/ size.

But, before that, I'll push out some new fallocate patches for -rc3.

> > >  		return -EINVAL;
> > 
> > Noted.  Will update the original patch.
> 
> BTW, I just noticed your "block: require write_same and discard requests
> align to logical block size" -- doesn't look right.

What happens if we pass a request to thinp that isn't aligned to
minimum_io_size?  Does it reject the command?

> But maybe I'm just too hyper-focused on DM thinp's needs (which would
> much prefer these checks be done in terms of minimum_io_size, rather
> than logical_block_size, and _not_ assuming power-of-2 math will work).
> 
> But at least for discard: your lbs-based check is fine; since we have
> discard_granularity to cover thinp's more specific requirements.

--D

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
  2016-04-12 16:42   ` Brian Foster
@ 2016-04-13 17:44     ` Darrick J. Wong
  -1 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-13 17:44 UTC (permalink / raw)
  To: Brian Foster
  Cc: xfs, linux-block, linux-fsdevel, dm-devel, Joe Thornber, snitzer

On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> From: Joe Thornber <ejt@redhat.com>
> 
> Experimental reserve interface for XFS guys to play with.
> 
> I have big reservations (no pun intended) about this patch.
> 
> [BF:
>  - Support for reservation reduction.
>  - Support for space provisioning.
>  - Condensed to a single function.]
> 
> Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 171 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 92237b6..32bc5bd 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -271,6 +271,8 @@ struct pool {
>  	process_mapping_fn process_prepared_discard;
>  
>  	struct dm_bio_prison_cell **cell_sort_array;
> +
> +	dm_block_t reserve_count;
>  };
>  
>  static enum pool_mode get_pool_mode(struct pool *pool);
> @@ -318,6 +320,8 @@ struct thin_c {
>  	 */
>  	atomic_t refcount;
>  	struct completion can_destroy;
> +
> +	dm_block_t reserve_count;
>  };
>  
>  /*----------------------------------------------------------------*/
> @@ -1359,24 +1363,19 @@ static void check_low_water_mark(struct pool *pool, dm_block_t free_blocks)
>  	}
>  }
>  
> -static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
> +static int get_free_blocks(struct pool *pool, dm_block_t *free_blocks)
>  {
>  	int r;
> -	dm_block_t free_blocks;
> -	struct pool *pool = tc->pool;
> -
> -	if (WARN_ON(get_pool_mode(pool) != PM_WRITE))
> -		return -EINVAL;
>  
> -	r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
> +	r = dm_pool_get_free_block_count(pool->pmd, free_blocks);
>  	if (r) {
>  		metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
>  		return r;
>  	}
>  
> -	check_low_water_mark(pool, free_blocks);
> +	check_low_water_mark(pool, *free_blocks);
>  
> -	if (!free_blocks) {
> +	if (!*free_blocks) {
>  		/*
>  		 * Try to commit to see if that will free up some
>  		 * more space.
> @@ -1385,7 +1384,7 @@ static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
>  		if (r)
>  			return r;
>  
> -		r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
> +		r = dm_pool_get_free_block_count(pool->pmd, free_blocks);
>  		if (r) {
>  			metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
>  			return r;
> @@ -1397,6 +1396,76 @@ static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
>  		}
>  	}
>  
> +	return r;
> +}
> +
> +/*
> + * Returns true iff either:
> + * i) decrement succeeded (ie. there was reserve left)
> + * ii) there is extra space in the pool
> + */
> +static bool dec_reserve_count(struct thin_c *tc, dm_block_t free_blocks)
> +{
> +	bool r = false;
> +	unsigned long flags;
> +
> +	if (!free_blocks)
> +		return false;
> +
> +	spin_lock_irqsave(&tc->pool->lock, flags);
> +	if (tc->reserve_count > 0) {
> +		tc->reserve_count--;
> +		tc->pool->reserve_count--;
> +		r = true;
> +	} else {
> +		if (free_blocks > tc->pool->reserve_count)
> +			r = true;
> +	}
> +	spin_unlock_irqrestore(&tc->pool->lock, flags);
> +
> +	return r;
> +}
> +
> +static int set_reserve_count(struct thin_c *tc, dm_block_t count)
> +{
> +	int r;
> +	dm_block_t free_blocks;
> +	int64_t delta;
> +	unsigned long flags;
> +
> +	r = get_free_blocks(tc->pool, &free_blocks);
> +	if (r)
> +		return r;
> +
> +	spin_lock_irqsave(&tc->pool->lock, flags);
> +	delta = count - tc->reserve_count;
> +	if (tc->pool->reserve_count + delta > free_blocks)
> +		r = -ENOSPC;
> +	else {
> +		tc->reserve_count = count;
> +		tc->pool->reserve_count += delta;
> +	}
> +	spin_unlock_irqrestore(&tc->pool->lock, flags);
> +
> +	return r;
> +}
> +
> +static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
> +{
> +	int r;
> +	dm_block_t free_blocks;
> +	struct pool *pool = tc->pool;
> +
> +	if (WARN_ON(get_pool_mode(pool) != PM_WRITE))
> +		return -EINVAL;
> +
> +	r = get_free_blocks(tc->pool, &free_blocks);
> +	if (r)
> +		return r;
> +
> +	if (!dec_reserve_count(tc, free_blocks))
> +		return -ENOSPC;
> +
>  	r = dm_pool_alloc_data_block(pool->pmd, result);
>  	if (r) {
>  		metadata_operation_failed(pool, "dm_pool_alloc_data_block", r);
> @@ -2880,6 +2949,7 @@ static struct pool *pool_create(struct mapped_device *pool_md,
>  	pool->last_commit_jiffies = jiffies;
>  	pool->pool_md = pool_md;
>  	pool->md_dev = metadata_dev;
> +	pool->reserve_count = 0;
>  	__pool_table_insert(pool);
>  
>  	return pool;
> @@ -3936,6 +4006,7 @@ static void thin_dtr(struct dm_target *ti)
>  
>  	spin_lock_irqsave(&tc->pool->lock, flags);
>  	list_del_rcu(&tc->list);
> +	tc->pool->reserve_count -= tc->reserve_count;
>  	spin_unlock_irqrestore(&tc->pool->lock, flags);
>  	synchronize_rcu();
>  
> @@ -4074,6 +4145,7 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
>  	init_completion(&tc->can_destroy);
>  	list_add_tail_rcu(&tc->list, &tc->pool->active_thins);
>  	spin_unlock_irqrestore(&tc->pool->lock, flags);
> +	tc->reserve_count = 0;
>  	/*
>  	 * This synchronize_rcu() call is needed here otherwise we risk a
>  	 * wake_worker() call finding no bios to process (because the newly
> @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
>  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
>  }
>  
> +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> +				sector_t len, sector_t *res)
> +{
> +	struct thin_c *tc = ti->private;
> +	struct pool *pool = tc->pool;
> +	sector_t end;
> +	dm_block_t pblock;
> +	dm_block_t vblock;
> +	int error;
> +	struct dm_thin_lookup_result lookup;
> +
> +	if (!is_factor(offset, pool->sectors_per_block))
> +		return -EINVAL;
> +
> +	if (!len || !is_factor(len, pool->sectors_per_block))
> +		return -EINVAL;
> +
> +	if (res && !is_factor(*res, pool->sectors_per_block))
> +		return -EINVAL;
> +
> +	end = offset + len;
> +
> +	while (offset < end) {
> +		vblock = offset;
> +		do_div(vblock, pool->sectors_per_block);
> +
> +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> +		if (error == 0)
> +			goto next;
> +		if (error != -ENODATA)
> +			return error;
> +
> +		error = alloc_data_block(tc, &pblock);

So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
space that was previously reserved by some other caller.  I think?

> +		if (error)
> +			return error;
> +
> +		error = dm_thin_insert_block(tc->td, vblock, pblock);

Having reserved and mapped blocks, what happens when we try to read them?
Do we actually get zeroes, or does the read go straight through to whatever
happens to be in the disk blocks?  I don't think it's correct that we could
BDEV_RES_PROVISION and end up with stale credit card numbers from some other
thin device.

(PS: I don't know enough about thinp to know if this has already been taken
care of.  I didn't see anything, but who knows what I missed. :))

--D

> +		if (error)
> +			return error;
> +
> +		if (res && *res)
> +			*res -= pool->sectors_per_block;
> +next:
> +		offset += pool->sectors_per_block;
> +	}
> +
> +	return 0;
> +}
> +
> +static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
> +			      sector_t len, sector_t *res)
> +{
> +	struct thin_c *tc = ti->private;
> +	struct pool *pool = tc->pool;
> +	sector_t blocks;
> +	unsigned long flags;
> +	int error;
> +
> +	if (mode == BDEV_RES_PROVISION)
> +		return thin_provision_space(ti, offset, len, res);
> +
> +	/* res required for get/set */
> +	error = -EINVAL;
> +	if (!res)
> +		return error;
> +
> +	if (mode == BDEV_RES_GET) {
> +		spin_lock_irqsave(&tc->pool->lock, flags);
> +		*res = tc->reserve_count * pool->sectors_per_block;
> +		spin_unlock_irqrestore(&tc->pool->lock, flags);
> +		error = 0;
> +	} else if (mode == BDEV_RES_MOD) {
> +		/*
> +		* @res must always be a factor of the pool's blocksize; upper
> +		* layers can rely on the bdev's minimum_io_size for this.
> +		*/
> +		if (!is_factor(*res, pool->sectors_per_block))
> +			return error;
> +
> +		blocks = *res;
> +		(void) sector_div(blocks, pool->sectors_per_block);
> +
> +		error = set_reserve_count(tc, blocks);
> +	}
> +
> +	return error;
> +}
> +
>  static struct target_type thin_target = {
>  	.name = "thin",
>  	.version = {1, 18, 0},
> @@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
>  	.status = thin_status,
>  	.iterate_devices = thin_iterate_devices,
>  	.io_hints = thin_io_hints,
> +	.reserve_space = thin_reserve_space,
>  };
>  
>  /*----------------------------------------------------------------*/
> -- 
> 2.4.11
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
@ 2016-04-13 17:44     ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-13 17:44 UTC (permalink / raw)
  To: Brian Foster
  Cc: snitzer, Joe Thornber, xfs, linux-block, dm-devel, linux-fsdevel

On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> From: Joe Thornber <ejt@redhat.com>
> 
> Experimental reserve interface for XFS guys to play with.
> 
> I have big reservations (no pun intended) about this patch.
> 
> [BF:
>  - Support for reservation reduction.
>  - Support for space provisioning.
>  - Condensed to a single function.]
> 
> Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 171 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 92237b6..32bc5bd 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -271,6 +271,8 @@ struct pool {
>  	process_mapping_fn process_prepared_discard;
>  
>  	struct dm_bio_prison_cell **cell_sort_array;
> +
> +	dm_block_t reserve_count;
>  };
>  
>  static enum pool_mode get_pool_mode(struct pool *pool);
> @@ -318,6 +320,8 @@ struct thin_c {
>  	 */
>  	atomic_t refcount;
>  	struct completion can_destroy;
> +
> +	dm_block_t reserve_count;
>  };
>  
>  /*----------------------------------------------------------------*/
> @@ -1359,24 +1363,19 @@ static void check_low_water_mark(struct pool *pool, dm_block_t free_blocks)
>  	}
>  }
>  
> -static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
> +static int get_free_blocks(struct pool *pool, dm_block_t *free_blocks)
>  {
>  	int r;
> -	dm_block_t free_blocks;
> -	struct pool *pool = tc->pool;
> -
> -	if (WARN_ON(get_pool_mode(pool) != PM_WRITE))
> -		return -EINVAL;
>  
> -	r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
> +	r = dm_pool_get_free_block_count(pool->pmd, free_blocks);
>  	if (r) {
>  		metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
>  		return r;
>  	}
>  
> -	check_low_water_mark(pool, free_blocks);
> +	check_low_water_mark(pool, *free_blocks);
>  
> -	if (!free_blocks) {
> +	if (!*free_blocks) {
>  		/*
>  		 * Try to commit to see if that will free up some
>  		 * more space.
> @@ -1385,7 +1384,7 @@ static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
>  		if (r)
>  			return r;
>  
> -		r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
> +		r = dm_pool_get_free_block_count(pool->pmd, free_blocks);
>  		if (r) {
>  			metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
>  			return r;
> @@ -1397,6 +1396,76 @@ static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
>  		}
>  	}
>  
> +	return r;
> +}
> +
> +/*
> + * Returns true iff either:
> + * i) decrement succeeded (ie. there was reserve left)
> + * ii) there is extra space in the pool
> + */
> +static bool dec_reserve_count(struct thin_c *tc, dm_block_t free_blocks)
> +{
> +	bool r = false;
> +	unsigned long flags;
> +
> +	if (!free_blocks)
> +		return false;
> +
> +	spin_lock_irqsave(&tc->pool->lock, flags);
> +	if (tc->reserve_count > 0) {
> +		tc->reserve_count--;
> +		tc->pool->reserve_count--;
> +		r = true;
> +	} else {
> +		if (free_blocks > tc->pool->reserve_count)
> +			r = true;
> +	}
> +	spin_unlock_irqrestore(&tc->pool->lock, flags);
> +
> +	return r;
> +}
> +
> +static int set_reserve_count(struct thin_c *tc, dm_block_t count)
> +{
> +	int r;
> +	dm_block_t free_blocks;
> +	int64_t delta;
> +	unsigned long flags;
> +
> +	r = get_free_blocks(tc->pool, &free_blocks);
> +	if (r)
> +		return r;
> +
> +	spin_lock_irqsave(&tc->pool->lock, flags);
> +	delta = count - tc->reserve_count;
> +	if (tc->pool->reserve_count + delta > free_blocks)
> +		r = -ENOSPC;
> +	else {
> +		tc->reserve_count = count;
> +		tc->pool->reserve_count += delta;
> +	}
> +	spin_unlock_irqrestore(&tc->pool->lock, flags);
> +
> +	return r;
> +}
> +
> +static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
> +{
> +	int r;
> +	dm_block_t free_blocks;
> +	struct pool *pool = tc->pool;
> +
> +	if (WARN_ON(get_pool_mode(pool) != PM_WRITE))
> +		return -EINVAL;
> +
> +	r = get_free_blocks(tc->pool, &free_blocks);
> +	if (r)
> +		return r;
> +
> +	if (!dec_reserve_count(tc, free_blocks))
> +		return -ENOSPC;
> +
>  	r = dm_pool_alloc_data_block(pool->pmd, result);
>  	if (r) {
>  		metadata_operation_failed(pool, "dm_pool_alloc_data_block", r);
> @@ -2880,6 +2949,7 @@ static struct pool *pool_create(struct mapped_device *pool_md,
>  	pool->last_commit_jiffies = jiffies;
>  	pool->pool_md = pool_md;
>  	pool->md_dev = metadata_dev;
> +	pool->reserve_count = 0;
>  	__pool_table_insert(pool);
>  
>  	return pool;
> @@ -3936,6 +4006,7 @@ static void thin_dtr(struct dm_target *ti)
>  
>  	spin_lock_irqsave(&tc->pool->lock, flags);
>  	list_del_rcu(&tc->list);
> +	tc->pool->reserve_count -= tc->reserve_count;
>  	spin_unlock_irqrestore(&tc->pool->lock, flags);
>  	synchronize_rcu();
>  
> @@ -4074,6 +4145,7 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
>  	init_completion(&tc->can_destroy);
>  	list_add_tail_rcu(&tc->list, &tc->pool->active_thins);
>  	spin_unlock_irqrestore(&tc->pool->lock, flags);
> +	tc->reserve_count = 0;
>  	/*
>  	 * This synchronize_rcu() call is needed here otherwise we risk a
>  	 * wake_worker() call finding no bios to process (because the newly
> @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
>  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
>  }
>  
> +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> +				sector_t len, sector_t *res)
> +{
> +	struct thin_c *tc = ti->private;
> +	struct pool *pool = tc->pool;
> +	sector_t end;
> +	dm_block_t pblock;
> +	dm_block_t vblock;
> +	int error;
> +	struct dm_thin_lookup_result lookup;
> +
> +	if (!is_factor(offset, pool->sectors_per_block))
> +		return -EINVAL;
> +
> +	if (!len || !is_factor(len, pool->sectors_per_block))
> +		return -EINVAL;
> +
> +	if (res && !is_factor(*res, pool->sectors_per_block))
> +		return -EINVAL;
> +
> +	end = offset + len;
> +
> +	while (offset < end) {
> +		vblock = offset;
> +		do_div(vblock, pool->sectors_per_block);
> +
> +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> +		if (error == 0)
> +			goto next;
> +		if (error != -ENODATA)
> +			return error;
> +
> +		error = alloc_data_block(tc, &pblock);

So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
space that was previously reserved by some other caller.  I think?

> +		if (error)
> +			return error;
> +
> +		error = dm_thin_insert_block(tc->td, vblock, pblock);

Having reserved and mapped blocks, what happens when we try to read them?
Do we actually get zeroes, or does the read go straight through to whatever
happens to be in the disk blocks?  I don't think it's correct that we could
BDEV_RES_PROVISION and end up with stale credit card numbers from some other
thin device.

(PS: I don't know enough about thinp to know if this has already been taken
care of.  I didn't see anything, but who knows what I missed. :))

--D

> +		if (error)
> +			return error;
> +
> +		if (res && *res)
> +			*res -= pool->sectors_per_block;
> +next:
> +		offset += pool->sectors_per_block;
> +	}
> +
> +	return 0;
> +}
> +
> +static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
> +			      sector_t len, sector_t *res)
> +{
> +	struct thin_c *tc = ti->private;
> +	struct pool *pool = tc->pool;
> +	sector_t blocks;
> +	unsigned long flags;
> +	int error;
> +
> +	if (mode == BDEV_RES_PROVISION)
> +		return thin_provision_space(ti, offset, len, res);
> +
> +	/* res required for get/set */
> +	error = -EINVAL;
> +	if (!res)
> +		return error;
> +
> +	if (mode == BDEV_RES_GET) {
> +		spin_lock_irqsave(&tc->pool->lock, flags);
> +		*res = tc->reserve_count * pool->sectors_per_block;
> +		spin_unlock_irqrestore(&tc->pool->lock, flags);
> +		error = 0;
> +	} else if (mode == BDEV_RES_MOD) {
> +		/*
> +		* @res must always be a factor of the pool's blocksize; upper
> +		* layers can rely on the bdev's minimum_io_size for this.
> +		*/
> +		if (!is_factor(*res, pool->sectors_per_block))
> +			return error;
> +
> +		blocks = *res;
> +		(void) sector_div(blocks, pool->sectors_per_block);
> +
> +		error = set_reserve_count(tc, blocks);
> +	}
> +
> +	return error;
> +}
> +
>  static struct target_type thin_target = {
>  	.name = "thin",
>  	.version = {1, 18, 0},
> @@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
>  	.status = thin_status,
>  	.iterate_devices = thin_iterate_devices,
>  	.io_hints = thin_io_hints,
> +	.reserve_space = thin_reserve_space,
>  };
>  
>  /*----------------------------------------------------------------*/
> -- 
> 2.4.11
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
  2016-04-13 17:44     ` Darrick J. Wong
@ 2016-04-13 18:33       ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-13 18:33 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: xfs, linux-block, linux-fsdevel, dm-devel, Joe Thornber, snitzer

On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > From: Joe Thornber <ejt@redhat.com>
> > 
> > Experimental reserve interface for XFS guys to play with.
> > 
> > I have big reservations (no pun intended) about this patch.
> > 
> > [BF:
> >  - Support for reservation reduction.
> >  - Support for space provisioning.
> >  - Condensed to a single function.]
> > 
> > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 171 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index 92237b6..32bc5bd 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
...
> > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> >  }
> >  
> > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > +				sector_t len, sector_t *res)
> > +{
> > +	struct thin_c *tc = ti->private;
> > +	struct pool *pool = tc->pool;
> > +	sector_t end;
> > +	dm_block_t pblock;
> > +	dm_block_t vblock;
> > +	int error;
> > +	struct dm_thin_lookup_result lookup;
> > +
> > +	if (!is_factor(offset, pool->sectors_per_block))
> > +		return -EINVAL;
> > +
> > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > +		return -EINVAL;
> > +
> > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > +		return -EINVAL;
> > +
> > +	end = offset + len;
> > +
> > +	while (offset < end) {
> > +		vblock = offset;
> > +		do_div(vblock, pool->sectors_per_block);
> > +
> > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > +		if (error == 0)
> > +			goto next;
> > +		if (error != -ENODATA)
> > +			return error;
> > +
> > +		error = alloc_data_block(tc, &pblock);
> 
> So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> space that was previously reserved by some other caller.  I think?
> 

Yes, assuming this is being called from a filesystem using the
reservation mechanism.

> > +		if (error)
> > +			return error;
> > +
> > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> 
> Having reserved and mapped blocks, what happens when we try to read them?
> Do we actually get zeroes, or does the read go straight through to whatever
> happens to be in the disk blocks?  I don't think it's correct that we could
> BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> thin device.
> 

Agree, but I'm not really sure how this works in thinp tbh. fallocate
wasn't really on my mind when doing this. I was simply trying to cobble
together what I could to facilitate making progress on the fs parts
(e.g., I just needed a call that allocated blocks and consumed
reservation in the process).

Skimming through the dm-thin code, it looks like a (configurable) block
zeroing mechanism can be triggered from somewhere around
provision_block()->schedule_zero(), depending on whether the incoming
write overwrites the newly allocated block. If that's the case, then I
suspect that means reads would just fall through to the block and return
whatever was on disk. This code would probably need to tie into that
zeroing mechanism one way or another to deal with that issue. (Though
somebody who actually knows something about dm-thin should verify that.
:)

Brian

> (PS: I don't know enough about thinp to know if this has already been taken
> care of.  I didn't see anything, but who knows what I missed. :))
> 
> --D
> 
> > +		if (error)
> > +			return error;
> > +
> > +		if (res && *res)
> > +			*res -= pool->sectors_per_block;
> > +next:
> > +		offset += pool->sectors_per_block;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
> > +			      sector_t len, sector_t *res)
> > +{
> > +	struct thin_c *tc = ti->private;
> > +	struct pool *pool = tc->pool;
> > +	sector_t blocks;
> > +	unsigned long flags;
> > +	int error;
> > +
> > +	if (mode == BDEV_RES_PROVISION)
> > +		return thin_provision_space(ti, offset, len, res);
> > +
> > +	/* res required for get/set */
> > +	error = -EINVAL;
> > +	if (!res)
> > +		return error;
> > +
> > +	if (mode == BDEV_RES_GET) {
> > +		spin_lock_irqsave(&tc->pool->lock, flags);
> > +		*res = tc->reserve_count * pool->sectors_per_block;
> > +		spin_unlock_irqrestore(&tc->pool->lock, flags);
> > +		error = 0;
> > +	} else if (mode == BDEV_RES_MOD) {
> > +		/*
> > +		* @res must always be a factor of the pool's blocksize; upper
> > +		* layers can rely on the bdev's minimum_io_size for this.
> > +		*/
> > +		if (!is_factor(*res, pool->sectors_per_block))
> > +			return error;
> > +
> > +		blocks = *res;
> > +		(void) sector_div(blocks, pool->sectors_per_block);
> > +
> > +		error = set_reserve_count(tc, blocks);
> > +	}
> > +
> > +	return error;
> > +}
> > +
> >  static struct target_type thin_target = {
> >  	.name = "thin",
> >  	.version = {1, 18, 0},
> > @@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
> >  	.status = thin_status,
> >  	.iterate_devices = thin_iterate_devices,
> >  	.io_hints = thin_io_hints,
> > +	.reserve_space = thin_reserve_space,
> >  };
> >  
> >  /*----------------------------------------------------------------*/
> > -- 
> > 2.4.11
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
@ 2016-04-13 18:33       ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-13 18:33 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: snitzer, Joe Thornber, xfs, linux-block, dm-devel, linux-fsdevel

On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > From: Joe Thornber <ejt@redhat.com>
> > 
> > Experimental reserve interface for XFS guys to play with.
> > 
> > I have big reservations (no pun intended) about this patch.
> > 
> > [BF:
> >  - Support for reservation reduction.
> >  - Support for space provisioning.
> >  - Condensed to a single function.]
> > 
> > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 171 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index 92237b6..32bc5bd 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
...
> > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> >  }
> >  
> > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > +				sector_t len, sector_t *res)
> > +{
> > +	struct thin_c *tc = ti->private;
> > +	struct pool *pool = tc->pool;
> > +	sector_t end;
> > +	dm_block_t pblock;
> > +	dm_block_t vblock;
> > +	int error;
> > +	struct dm_thin_lookup_result lookup;
> > +
> > +	if (!is_factor(offset, pool->sectors_per_block))
> > +		return -EINVAL;
> > +
> > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > +		return -EINVAL;
> > +
> > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > +		return -EINVAL;
> > +
> > +	end = offset + len;
> > +
> > +	while (offset < end) {
> > +		vblock = offset;
> > +		do_div(vblock, pool->sectors_per_block);
> > +
> > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > +		if (error == 0)
> > +			goto next;
> > +		if (error != -ENODATA)
> > +			return error;
> > +
> > +		error = alloc_data_block(tc, &pblock);
> 
> So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> space that was previously reserved by some other caller.  I think?
> 

Yes, assuming this is being called from a filesystem using the
reservation mechanism.

> > +		if (error)
> > +			return error;
> > +
> > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> 
> Having reserved and mapped blocks, what happens when we try to read them?
> Do we actually get zeroes, or does the read go straight through to whatever
> happens to be in the disk blocks?  I don't think it's correct that we could
> BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> thin device.
> 

Agree, but I'm not really sure how this works in thinp tbh. fallocate
wasn't really on my mind when doing this. I was simply trying to cobble
together what I could to facilitate making progress on the fs parts
(e.g., I just needed a call that allocated blocks and consumed
reservation in the process).

Skimming through the dm-thin code, it looks like a (configurable) block
zeroing mechanism can be triggered from somewhere around
provision_block()->schedule_zero(), depending on whether the incoming
write overwrites the newly allocated block. If that's the case, then I
suspect that means reads would just fall through to the block and return
whatever was on disk. This code would probably need to tie into that
zeroing mechanism one way or another to deal with that issue. (Though
somebody who actually knows something about dm-thin should verify that.
:)

Brian

> (PS: I don't know enough about thinp to know if this has already been taken
> care of.  I didn't see anything, but who knows what I missed. :))
> 
> --D
> 
> > +		if (error)
> > +			return error;
> > +
> > +		if (res && *res)
> > +			*res -= pool->sectors_per_block;
> > +next:
> > +		offset += pool->sectors_per_block;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
> > +			      sector_t len, sector_t *res)
> > +{
> > +	struct thin_c *tc = ti->private;
> > +	struct pool *pool = tc->pool;
> > +	sector_t blocks;
> > +	unsigned long flags;
> > +	int error;
> > +
> > +	if (mode == BDEV_RES_PROVISION)
> > +		return thin_provision_space(ti, offset, len, res);
> > +
> > +	/* res required for get/set */
> > +	error = -EINVAL;
> > +	if (!res)
> > +		return error;
> > +
> > +	if (mode == BDEV_RES_GET) {
> > +		spin_lock_irqsave(&tc->pool->lock, flags);
> > +		*res = tc->reserve_count * pool->sectors_per_block;
> > +		spin_unlock_irqrestore(&tc->pool->lock, flags);
> > +		error = 0;
> > +	} else if (mode == BDEV_RES_MOD) {
> > +		/*
> > +		* @res must always be a factor of the pool's blocksize; upper
> > +		* layers can rely on the bdev's minimum_io_size for this.
> > +		*/
> > +		if (!is_factor(*res, pool->sectors_per_block))
> > +			return error;
> > +
> > +		blocks = *res;
> > +		(void) sector_div(blocks, pool->sectors_per_block);
> > +
> > +		error = set_reserve_count(tc, blocks);
> > +	}
> > +
> > +	return error;
> > +}
> > +
> >  static struct target_type thin_target = {
> >  	.name = "thin",
> >  	.version = {1, 18, 0},
> > @@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
> >  	.status = thin_status,
> >  	.iterate_devices = thin_iterate_devices,
> >  	.io_hints = thin_io_hints,
> > +	.reserve_space = thin_reserve_space,
> >  };
> >  
> >  /*----------------------------------------------------------------*/
> > -- 
> > 2.4.11
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
  2016-04-13 18:33       ` Brian Foster
@ 2016-04-13 20:41         ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-13 20:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: xfs, linux-block, linux-fsdevel, dm-devel, Joe Thornber, snitzer

On Wed, Apr 13, 2016 at 02:33:52PM -0400, Brian Foster wrote:
> On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> > On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > > From: Joe Thornber <ejt@redhat.com>
> > > 
> > > Experimental reserve interface for XFS guys to play with.
> > > 
> > > I have big reservations (no pun intended) about this patch.
> > > 
> > > [BF:
> > >  - Support for reservation reduction.
> > >  - Support for space provisioning.
> > >  - Condensed to a single function.]
> > > 
> > > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > ---
> > >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 171 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > index 92237b6..32bc5bd 100644
> > > --- a/drivers/md/dm-thin.c
> > > +++ b/drivers/md/dm-thin.c
> ...
> > > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > >  }
> > >  
> > > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > > +				sector_t len, sector_t *res)
> > > +{
> > > +	struct thin_c *tc = ti->private;
> > > +	struct pool *pool = tc->pool;
> > > +	sector_t end;
> > > +	dm_block_t pblock;
> > > +	dm_block_t vblock;
> > > +	int error;
> > > +	struct dm_thin_lookup_result lookup;
> > > +
> > > +	if (!is_factor(offset, pool->sectors_per_block))
> > > +		return -EINVAL;
> > > +
> > > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > > +		return -EINVAL;
> > > +
> > > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > > +		return -EINVAL;
> > > +
> > > +	end = offset + len;
> > > +
> > > +	while (offset < end) {
> > > +		vblock = offset;
> > > +		do_div(vblock, pool->sectors_per_block);
> > > +
> > > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > > +		if (error == 0)
> > > +			goto next;
> > > +		if (error != -ENODATA)
> > > +			return error;
> > > +
> > > +		error = alloc_data_block(tc, &pblock);
> > 
> > So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> > first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> > space that was previously reserved by some other caller.  I think?
> > 
> 
> Yes, assuming this is being called from a filesystem using the
> reservation mechanism.
> 
> > > +		if (error)
> > > +			return error;
> > > +
> > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > 
> > Having reserved and mapped blocks, what happens when we try to read them?
> > Do we actually get zeroes, or does the read go straight through to whatever
> > happens to be in the disk blocks?  I don't think it's correct that we could
> > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > thin device.
> > 
> 
> Agree, but I'm not really sure how this works in thinp tbh. fallocate
> wasn't really on my mind when doing this. I was simply trying to cobble
> together what I could to facilitate making progress on the fs parts
> (e.g., I just needed a call that allocated blocks and consumed
> reservation in the process).
> 
> Skimming through the dm-thin code, it looks like a (configurable) block
> zeroing mechanism can be triggered from somewhere around
> provision_block()->schedule_zero(), depending on whether the incoming
> write overwrites the newly allocated block. If that's the case, then I
> suspect that means reads would just fall through to the block and return
> whatever was on disk. This code would probably need to tie into that
> zeroing mechanism one way or another to deal with that issue. (Though
> somebody who actually knows something about dm-thin should verify that.
> :)
> 

BTW, if that mechanism is in fact doing I/O, that might not be the
appropriate solution for fallocate. Perhaps we'd have to consider an
unwritten flag or some such in dm-thin, if possible.

Brian

> Brian
> 
> > (PS: I don't know enough about thinp to know if this has already been taken
> > care of.  I didn't see anything, but who knows what I missed. :))
> > 
> > --D
> > 
> > > +		if (error)
> > > +			return error;
> > > +
> > > +		if (res && *res)
> > > +			*res -= pool->sectors_per_block;
> > > +next:
> > > +		offset += pool->sectors_per_block;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
> > > +			      sector_t len, sector_t *res)
> > > +{
> > > +	struct thin_c *tc = ti->private;
> > > +	struct pool *pool = tc->pool;
> > > +	sector_t blocks;
> > > +	unsigned long flags;
> > > +	int error;
> > > +
> > > +	if (mode == BDEV_RES_PROVISION)
> > > +		return thin_provision_space(ti, offset, len, res);
> > > +
> > > +	/* res required for get/set */
> > > +	error = -EINVAL;
> > > +	if (!res)
> > > +		return error;
> > > +
> > > +	if (mode == BDEV_RES_GET) {
> > > +		spin_lock_irqsave(&tc->pool->lock, flags);
> > > +		*res = tc->reserve_count * pool->sectors_per_block;
> > > +		spin_unlock_irqrestore(&tc->pool->lock, flags);
> > > +		error = 0;
> > > +	} else if (mode == BDEV_RES_MOD) {
> > > +		/*
> > > +		* @res must always be a factor of the pool's blocksize; upper
> > > +		* layers can rely on the bdev's minimum_io_size for this.
> > > +		*/
> > > +		if (!is_factor(*res, pool->sectors_per_block))
> > > +			return error;
> > > +
> > > +		blocks = *res;
> > > +		(void) sector_div(blocks, pool->sectors_per_block);
> > > +
> > > +		error = set_reserve_count(tc, blocks);
> > > +	}
> > > +
> > > +	return error;
> > > +}
> > > +
> > >  static struct target_type thin_target = {
> > >  	.name = "thin",
> > >  	.version = {1, 18, 0},
> > > @@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
> > >  	.status = thin_status,
> > >  	.iterate_devices = thin_iterate_devices,
> > >  	.io_hints = thin_io_hints,
> > > +	.reserve_space = thin_reserve_space,
> > >  };
> > >  
> > >  /*----------------------------------------------------------------*/
> > > -- 
> > > 2.4.11
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
@ 2016-04-13 20:41         ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-13 20:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: snitzer, Joe Thornber, xfs, linux-block, dm-devel, linux-fsdevel

On Wed, Apr 13, 2016 at 02:33:52PM -0400, Brian Foster wrote:
> On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> > On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > > From: Joe Thornber <ejt@redhat.com>
> > > 
> > > Experimental reserve interface for XFS guys to play with.
> > > 
> > > I have big reservations (no pun intended) about this patch.
> > > 
> > > [BF:
> > >  - Support for reservation reduction.
> > >  - Support for space provisioning.
> > >  - Condensed to a single function.]
> > > 
> > > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > ---
> > >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 171 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > index 92237b6..32bc5bd 100644
> > > --- a/drivers/md/dm-thin.c
> > > +++ b/drivers/md/dm-thin.c
> ...
> > > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > >  }
> > >  
> > > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > > +				sector_t len, sector_t *res)
> > > +{
> > > +	struct thin_c *tc = ti->private;
> > > +	struct pool *pool = tc->pool;
> > > +	sector_t end;
> > > +	dm_block_t pblock;
> > > +	dm_block_t vblock;
> > > +	int error;
> > > +	struct dm_thin_lookup_result lookup;
> > > +
> > > +	if (!is_factor(offset, pool->sectors_per_block))
> > > +		return -EINVAL;
> > > +
> > > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > > +		return -EINVAL;
> > > +
> > > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > > +		return -EINVAL;
> > > +
> > > +	end = offset + len;
> > > +
> > > +	while (offset < end) {
> > > +		vblock = offset;
> > > +		do_div(vblock, pool->sectors_per_block);
> > > +
> > > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > > +		if (error == 0)
> > > +			goto next;
> > > +		if (error != -ENODATA)
> > > +			return error;
> > > +
> > > +		error = alloc_data_block(tc, &pblock);
> > 
> > So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> > first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> > space that was previously reserved by some other caller.  I think?
> > 
> 
> Yes, assuming this is being called from a filesystem using the
> reservation mechanism.
> 
> > > +		if (error)
> > > +			return error;
> > > +
> > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > 
> > Having reserved and mapped blocks, what happens when we try to read them?
> > Do we actually get zeroes, or does the read go straight through to whatever
> > happens to be in the disk blocks?  I don't think it's correct that we could
> > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > thin device.
> > 
> 
> Agree, but I'm not really sure how this works in thinp tbh. fallocate
> wasn't really on my mind when doing this. I was simply trying to cobble
> together what I could to facilitate making progress on the fs parts
> (e.g., I just needed a call that allocated blocks and consumed
> reservation in the process).
> 
> Skimming through the dm-thin code, it looks like a (configurable) block
> zeroing mechanism can be triggered from somewhere around
> provision_block()->schedule_zero(), depending on whether the incoming
> write overwrites the newly allocated block. If that's the case, then I
> suspect that means reads would just fall through to the block and return
> whatever was on disk. This code would probably need to tie into that
> zeroing mechanism one way or another to deal with that issue. (Though
> somebody who actually knows something about dm-thin should verify that.
> :)
> 

BTW, if that mechanism is in fact doing I/O, that might not be the
appropriate solution for fallocate. Perhaps we'd have to consider an
unwritten flag or some such in dm-thin, if possible.

Brian

> Brian
> 
> > (PS: I don't know enough about thinp to know if this has already been taken
> > care of.  I didn't see anything, but who knows what I missed. :))
> > 
> > --D
> > 
> > > +		if (error)
> > > +			return error;
> > > +
> > > +		if (res && *res)
> > > +			*res -= pool->sectors_per_block;
> > > +next:
> > > +		offset += pool->sectors_per_block;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
> > > +			      sector_t len, sector_t *res)
> > > +{
> > > +	struct thin_c *tc = ti->private;
> > > +	struct pool *pool = tc->pool;
> > > +	sector_t blocks;
> > > +	unsigned long flags;
> > > +	int error;
> > > +
> > > +	if (mode == BDEV_RES_PROVISION)
> > > +		return thin_provision_space(ti, offset, len, res);
> > > +
> > > +	/* res required for get/set */
> > > +	error = -EINVAL;
> > > +	if (!res)
> > > +		return error;
> > > +
> > > +	if (mode == BDEV_RES_GET) {
> > > +		spin_lock_irqsave(&tc->pool->lock, flags);
> > > +		*res = tc->reserve_count * pool->sectors_per_block;
> > > +		spin_unlock_irqrestore(&tc->pool->lock, flags);
> > > +		error = 0;
> > > +	} else if (mode == BDEV_RES_MOD) {
> > > +		/*
> > > +		* @res must always be a factor of the pool's blocksize; upper
> > > +		* layers can rely on the bdev's minimum_io_size for this.
> > > +		*/
> > > +		if (!is_factor(*res, pool->sectors_per_block))
> > > +			return error;
> > > +
> > > +		blocks = *res;
> > > +		(void) sector_div(blocks, pool->sectors_per_block);
> > > +
> > > +		error = set_reserve_count(tc, blocks);
> > > +	}
> > > +
> > > +	return error;
> > > +}
> > > +
> > >  static struct target_type thin_target = {
> > >  	.name = "thin",
> > >  	.version = {1, 18, 0},
> > > @@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
> > >  	.status = thin_status,
> > >  	.iterate_devices = thin_iterate_devices,
> > >  	.io_hints = thin_io_hints,
> > > +	.reserve_space = thin_reserve_space,
> > >  };
> > >  
> > >  /*----------------------------------------------------------------*/
> > > -- 
> > > 2.4.11
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
  2016-04-13 20:41         ` Brian Foster
@ 2016-04-13 21:01           ` Darrick J. Wong
  -1 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-13 21:01 UTC (permalink / raw)
  To: Brian Foster
  Cc: xfs, linux-block, linux-fsdevel, dm-devel, Joe Thornber, snitzer

On Wed, Apr 13, 2016 at 04:41:18PM -0400, Brian Foster wrote:
> On Wed, Apr 13, 2016 at 02:33:52PM -0400, Brian Foster wrote:
> > On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> > > On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > > > From: Joe Thornber <ejt@redhat.com>
> > > > 
> > > > Experimental reserve interface for XFS guys to play with.
> > > > 
> > > > I have big reservations (no pun intended) about this patch.
> > > > 
> > > > [BF:
> > > >  - Support for reservation reduction.
> > > >  - Support for space provisioning.
> > > >  - Condensed to a single function.]
> > > > 
> > > > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > > > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > > ---
> > > >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > > >  1 file changed, 171 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > > index 92237b6..32bc5bd 100644
> > > > --- a/drivers/md/dm-thin.c
> > > > +++ b/drivers/md/dm-thin.c
> > ...
> > > > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > >  }
> > > >  
> > > > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > > > +				sector_t len, sector_t *res)
> > > > +{
> > > > +	struct thin_c *tc = ti->private;
> > > > +	struct pool *pool = tc->pool;
> > > > +	sector_t end;
> > > > +	dm_block_t pblock;
> > > > +	dm_block_t vblock;
> > > > +	int error;
> > > > +	struct dm_thin_lookup_result lookup;
> > > > +
> > > > +	if (!is_factor(offset, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	end = offset + len;
> > > > +
> > > > +	while (offset < end) {
> > > > +		vblock = offset;
> > > > +		do_div(vblock, pool->sectors_per_block);
> > > > +
> > > > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > > > +		if (error == 0)
> > > > +			goto next;
> > > > +		if (error != -ENODATA)
> > > > +			return error;
> > > > +
> > > > +		error = alloc_data_block(tc, &pblock);
> > > 
> > > So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> > > first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> > > space that was previously reserved by some other caller.  I think?
> > > 
> > 
> > Yes, assuming this is being called from a filesystem using the
> > reservation mechanism.
> > 
> > > > +		if (error)
> > > > +			return error;
> > > > +
> > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > 
> > > Having reserved and mapped blocks, what happens when we try to read them?
> > > Do we actually get zeroes, or does the read go straight through to whatever
> > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > thin device.
> > > 
> > 
> > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > wasn't really on my mind when doing this. I was simply trying to cobble
> > together what I could to facilitate making progress on the fs parts
> > (e.g., I just needed a call that allocated blocks and consumed
> > reservation in the process).
> > 
> > Skimming through the dm-thin code, it looks like a (configurable) block
> > zeroing mechanism can be triggered from somewhere around
> > provision_block()->schedule_zero(), depending on whether the incoming
> > write overwrites the newly allocated block. If that's the case, then I
> > suspect that means reads would just fall through to the block and return
> > whatever was on disk. This code would probably need to tie into that
> > zeroing mechanism one way or another to deal with that issue. (Though
> > somebody who actually knows something about dm-thin should verify that.
> > :)
> > 
> 
> BTW, if that mechanism is in fact doing I/O, that might not be the
> appropriate solution for fallocate. Perhaps we'd have to consider an
> unwritten flag or some such in dm-thin, if possible.

The hard part is that we don't know if the caller actually has a way to
prevent userspace from seeing the stale contents (filesystems) or if we'd
be leaking data straight to userspace (user program calling fallocate).

(And yeah, here we go with NO_HIDE_STALE again...)

--D

> 
> Brian
> 
> > Brian
> > 
> > > (PS: I don't know enough about thinp to know if this has already been taken
> > > care of.  I didn't see anything, but who knows what I missed. :))
> > > 
> > > --D
> > > 
> > > > +		if (error)
> > > > +			return error;
> > > > +
> > > > +		if (res && *res)
> > > > +			*res -= pool->sectors_per_block;
> > > > +next:
> > > > +		offset += pool->sectors_per_block;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
> > > > +			      sector_t len, sector_t *res)
> > > > +{
> > > > +	struct thin_c *tc = ti->private;
> > > > +	struct pool *pool = tc->pool;
> > > > +	sector_t blocks;
> > > > +	unsigned long flags;
> > > > +	int error;
> > > > +
> > > > +	if (mode == BDEV_RES_PROVISION)
> > > > +		return thin_provision_space(ti, offset, len, res);
> > > > +
> > > > +	/* res required for get/set */
> > > > +	error = -EINVAL;
> > > > +	if (!res)
> > > > +		return error;
> > > > +
> > > > +	if (mode == BDEV_RES_GET) {
> > > > +		spin_lock_irqsave(&tc->pool->lock, flags);
> > > > +		*res = tc->reserve_count * pool->sectors_per_block;
> > > > +		spin_unlock_irqrestore(&tc->pool->lock, flags);
> > > > +		error = 0;
> > > > +	} else if (mode == BDEV_RES_MOD) {
> > > > +		/*
> > > > +		* @res must always be a factor of the pool's blocksize; upper
> > > > +		* layers can rely on the bdev's minimum_io_size for this.
> > > > +		*/
> > > > +		if (!is_factor(*res, pool->sectors_per_block))
> > > > +			return error;
> > > > +
> > > > +		blocks = *res;
> > > > +		(void) sector_div(blocks, pool->sectors_per_block);
> > > > +
> > > > +		error = set_reserve_count(tc, blocks);
> > > > +	}
> > > > +
> > > > +	return error;
> > > > +}
> > > > +
> > > >  static struct target_type thin_target = {
> > > >  	.name = "thin",
> > > >  	.version = {1, 18, 0},
> > > > @@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
> > > >  	.status = thin_status,
> > > >  	.iterate_devices = thin_iterate_devices,
> > > >  	.io_hints = thin_io_hints,
> > > > +	.reserve_space = thin_reserve_space,
> > > >  };
> > > >  
> > > >  /*----------------------------------------------------------------*/
> > > > -- 
> > > > 2.4.11
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
@ 2016-04-13 21:01           ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2016-04-13 21:01 UTC (permalink / raw)
  To: Brian Foster
  Cc: snitzer, Joe Thornber, xfs, linux-block, dm-devel, linux-fsdevel

On Wed, Apr 13, 2016 at 04:41:18PM -0400, Brian Foster wrote:
> On Wed, Apr 13, 2016 at 02:33:52PM -0400, Brian Foster wrote:
> > On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> > > On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > > > From: Joe Thornber <ejt@redhat.com>
> > > > 
> > > > Experimental reserve interface for XFS guys to play with.
> > > > 
> > > > I have big reservations (no pun intended) about this patch.
> > > > 
> > > > [BF:
> > > >  - Support for reservation reduction.
> > > >  - Support for space provisioning.
> > > >  - Condensed to a single function.]
> > > > 
> > > > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > > > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > > ---
> > > >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > > >  1 file changed, 171 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > > index 92237b6..32bc5bd 100644
> > > > --- a/drivers/md/dm-thin.c
> > > > +++ b/drivers/md/dm-thin.c
> > ...
> > > > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > >  }
> > > >  
> > > > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > > > +				sector_t len, sector_t *res)
> > > > +{
> > > > +	struct thin_c *tc = ti->private;
> > > > +	struct pool *pool = tc->pool;
> > > > +	sector_t end;
> > > > +	dm_block_t pblock;
> > > > +	dm_block_t vblock;
> > > > +	int error;
> > > > +	struct dm_thin_lookup_result lookup;
> > > > +
> > > > +	if (!is_factor(offset, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	end = offset + len;
> > > > +
> > > > +	while (offset < end) {
> > > > +		vblock = offset;
> > > > +		do_div(vblock, pool->sectors_per_block);
> > > > +
> > > > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > > > +		if (error == 0)
> > > > +			goto next;
> > > > +		if (error != -ENODATA)
> > > > +			return error;
> > > > +
> > > > +		error = alloc_data_block(tc, &pblock);
> > > 
> > > So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> > > first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> > > space that was previously reserved by some other caller.  I think?
> > > 
> > 
> > Yes, assuming this is being called from a filesystem using the
> > reservation mechanism.
> > 
> > > > +		if (error)
> > > > +			return error;
> > > > +
> > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > 
> > > Having reserved and mapped blocks, what happens when we try to read them?
> > > Do we actually get zeroes, or does the read go straight through to whatever
> > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > thin device.
> > > 
> > 
> > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > wasn't really on my mind when doing this. I was simply trying to cobble
> > together what I could to facilitate making progress on the fs parts
> > (e.g., I just needed a call that allocated blocks and consumed
> > reservation in the process).
> > 
> > Skimming through the dm-thin code, it looks like a (configurable) block
> > zeroing mechanism can be triggered from somewhere around
> > provision_block()->schedule_zero(), depending on whether the incoming
> > write overwrites the newly allocated block. If that's the case, then I
> > suspect that means reads would just fall through to the block and return
> > whatever was on disk. This code would probably need to tie into that
> > zeroing mechanism one way or another to deal with that issue. (Though
> > somebody who actually knows something about dm-thin should verify that.
> > :)
> > 
> 
> BTW, if that mechanism is in fact doing I/O, that might not be the
> appropriate solution for fallocate. Perhaps we'd have to consider an
> unwritten flag or some such in dm-thin, if possible.

The hard part is that we don't know if the caller actually has a way to
prevent userspace from seeing the stale contents (filesystems) or if we'd
be leaking data straight to userspace (user program calling fallocate).

(And yeah, here we go with NO_HIDE_STALE again...)

--D

> 
> Brian
> 
> > Brian
> > 
> > > (PS: I don't know enough about thinp to know if this has already been taken
> > > care of.  I didn't see anything, but who knows what I missed. :))
> > > 
> > > --D
> > > 
> > > > +		if (error)
> > > > +			return error;
> > > > +
> > > > +		if (res && *res)
> > > > +			*res -= pool->sectors_per_block;
> > > > +next:
> > > > +		offset += pool->sectors_per_block;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int thin_reserve_space(struct dm_target *ti, int mode, sector_t offset,
> > > > +			      sector_t len, sector_t *res)
> > > > +{
> > > > +	struct thin_c *tc = ti->private;
> > > > +	struct pool *pool = tc->pool;
> > > > +	sector_t blocks;
> > > > +	unsigned long flags;
> > > > +	int error;
> > > > +
> > > > +	if (mode == BDEV_RES_PROVISION)
> > > > +		return thin_provision_space(ti, offset, len, res);
> > > > +
> > > > +	/* res required for get/set */
> > > > +	error = -EINVAL;
> > > > +	if (!res)
> > > > +		return error;
> > > > +
> > > > +	if (mode == BDEV_RES_GET) {
> > > > +		spin_lock_irqsave(&tc->pool->lock, flags);
> > > > +		*res = tc->reserve_count * pool->sectors_per_block;
> > > > +		spin_unlock_irqrestore(&tc->pool->lock, flags);
> > > > +		error = 0;
> > > > +	} else if (mode == BDEV_RES_MOD) {
> > > > +		/*
> > > > +		* @res must always be a factor of the pool's blocksize; upper
> > > > +		* layers can rely on the bdev's minimum_io_size for this.
> > > > +		*/
> > > > +		if (!is_factor(*res, pool->sectors_per_block))
> > > > +			return error;
> > > > +
> > > > +		blocks = *res;
> > > > +		(void) sector_div(blocks, pool->sectors_per_block);
> > > > +
> > > > +		error = set_reserve_count(tc, blocks);
> > > > +	}
> > > > +
> > > > +	return error;
> > > > +}
> > > > +
> > > >  static struct target_type thin_target = {
> > > >  	.name = "thin",
> > > >  	.version = {1, 18, 0},
> > > > @@ -4285,6 +4445,7 @@ static struct target_type thin_target = {
> > > >  	.status = thin_status,
> > > >  	.iterate_devices = thin_iterate_devices,
> > > >  	.io_hints = thin_io_hints,
> > > > +	.reserve_space = thin_reserve_space,
> > > >  };
> > > >  
> > > >  /*----------------------------------------------------------------*/
> > > > -- 
> > > > 2.4.11
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 03/10] block: add block_device_operations methods to set and get reserved space
  2016-04-12 16:42   ` Brian Foster
@ 2016-04-14  0:32     ` Dave Chinner
  -1 siblings, 0 replies; 54+ messages in thread
From: Dave Chinner @ 2016-04-14  0:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, linux-block, linux-fsdevel, dm-devel, Mike Snitzer

On Tue, Apr 12, 2016 at 12:42:46PM -0400, Brian Foster wrote:
> From: Mike Snitzer <snitzer@redhat.com>
> 
> [BF:
>  - Killed wrapper functions.
>  - Condensed to single bdev op.]
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  include/linux/blkdev.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 669e419..6c6ea96 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1650,6 +1650,10 @@ struct blk_dax_ctl {
>  	pfn_t pfn;
>  };
>  
> +#define BDEV_RES_GET		0
> +#define BDEV_RES_MOD		(1 << 0)
> +#define BDEV_RES_PROVISION	(1 << 1)
> +
>  struct block_device_operations {
>  	int (*open) (struct block_device *, fmode_t);
>  	void (*release) (struct gendisk *, fmode_t);
> @@ -1667,6 +1671,8 @@ struct block_device_operations {
>  	int (*getgeo)(struct block_device *, struct hd_geometry *);
>  	/* this callback is with swap_lock and sometimes page table lock held */
>  	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> +	int (*reserve_space) (struct block_device *, int, sector_t, sector_t,
> +			      sector_t *);
>  	struct module *owner;
>  	const struct pr_ops *pr_ops;

You know, I'm now wondering how much of this has overlap with the
iomap interface we are adding for the VFS IO paths to map and
allocate extents....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 03/10] block: add block_device_operations methods to set and get reserved space
@ 2016-04-14  0:32     ` Dave Chinner
  0 siblings, 0 replies; 54+ messages in thread
From: Dave Chinner @ 2016-04-14  0:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-block, linux-fsdevel, dm-devel, Mike Snitzer, xfs

On Tue, Apr 12, 2016 at 12:42:46PM -0400, Brian Foster wrote:
> From: Mike Snitzer <snitzer@redhat.com>
> 
> [BF:
>  - Killed wrapper functions.
>  - Condensed to single bdev op.]
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  include/linux/blkdev.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 669e419..6c6ea96 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1650,6 +1650,10 @@ struct blk_dax_ctl {
>  	pfn_t pfn;
>  };
>  
> +#define BDEV_RES_GET		0
> +#define BDEV_RES_MOD		(1 << 0)
> +#define BDEV_RES_PROVISION	(1 << 1)
> +
>  struct block_device_operations {
>  	int (*open) (struct block_device *, fmode_t);
>  	void (*release) (struct gendisk *, fmode_t);
> @@ -1667,6 +1671,8 @@ struct block_device_operations {
>  	int (*getgeo)(struct block_device *, struct hd_geometry *);
>  	/* this callback is with swap_lock and sometimes page table lock held */
>  	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> +	int (*reserve_space) (struct block_device *, int, sector_t, sector_t,
> +			      sector_t *);
>  	struct module *owner;
>  	const struct pr_ops *pr_ops;

You know, I'm now wondering how much of this has overlap with the
iomap interface we are adding for the VFS IO paths to map and
allocate extents....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
  2016-04-13 20:41         ` Brian Foster
@ 2016-04-14 15:10           ` Mike Snitzer
  -1 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-14 15:10 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, Joe Thornber, xfs, linux-block, dm-devel, linux-fsdevel

On Wed, Apr 13 2016 at  4:41pm -0400,
Brian Foster <bfoster@redhat.com> wrote:

> On Wed, Apr 13, 2016 at 02:33:52PM -0400, Brian Foster wrote:
> > On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> > > On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > > > From: Joe Thornber <ejt@redhat.com>
> > > > 
> > > > Experimental reserve interface for XFS guys to play with.
> > > > 
> > > > I have big reservations (no pun intended) about this patch.
> > > > 
> > > > [BF:
> > > >  - Support for reservation reduction.
> > > >  - Support for space provisioning.
> > > >  - Condensed to a single function.]
> > > > 
> > > > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > > > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > > ---
> > > >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > > >  1 file changed, 171 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > > index 92237b6..32bc5bd 100644
> > > > --- a/drivers/md/dm-thin.c
> > > > +++ b/drivers/md/dm-thin.c
> > ...
> > > > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > >  }
> > > >  
> > > > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > > > +				sector_t len, sector_t *res)
> > > > +{
> > > > +	struct thin_c *tc = ti->private;
> > > > +	struct pool *pool = tc->pool;
> > > > +	sector_t end;
> > > > +	dm_block_t pblock;
> > > > +	dm_block_t vblock;
> > > > +	int error;
> > > > +	struct dm_thin_lookup_result lookup;
> > > > +
> > > > +	if (!is_factor(offset, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	end = offset + len;
> > > > +
> > > > +	while (offset < end) {
> > > > +		vblock = offset;
> > > > +		do_div(vblock, pool->sectors_per_block);
> > > > +
> > > > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > > > +		if (error == 0)
> > > > +			goto next;
> > > > +		if (error != -ENODATA)
> > > > +			return error;
> > > > +
> > > > +		error = alloc_data_block(tc, &pblock);
> > > 
> > > So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> > > first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> > > space that was previously reserved by some other caller.  I think?
> > > 
> > 
> > Yes, assuming this is being called from a filesystem using the
> > reservation mechanism.

Brian, I need to circle back with you to understand why XFS even needs
reservation as opposed to just using something like fallocate (which
would provision the space before you actually initiate the IO that would
use it).  But we can discuss that in person and then report back to the
list if it makes it easier...
 
> > > > +		if (error)
> > > > +			return error;
> > > > +
> > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > 
> > > Having reserved and mapped blocks, what happens when we try to read them?
> > > Do we actually get zeroes, or does the read go straight through to whatever
> > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > thin device.
> > > 
> > 
> > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > wasn't really on my mind when doing this. I was simply trying to cobble
> > together what I could to facilitate making progress on the fs parts
> > (e.g., I just needed a call that allocated blocks and consumed
> > reservation in the process).
> > 
> > Skimming through the dm-thin code, it looks like a (configurable) block
> > zeroing mechanism can be triggered from somewhere around
> > provision_block()->schedule_zero(), depending on whether the incoming
> > write overwrites the newly allocated block. If that's the case, then I
> > suspect that means reads would just fall through to the block and return
> > whatever was on disk. This code would probably need to tie into that
> > zeroing mechanism one way or another to deal with that issue. (Though
> > somebody who actually knows something about dm-thin should verify that.
> > :)
> > 
> 
> BTW, if that mechanism is in fact doing I/O, that might not be the
> appropriate solution for fallocate. Perhaps we'd have to consider an
> unwritten flag or some such in dm-thin, if possible.

DM thinp defaults to enabling 'zero_new_blocks' (can be disabled using
the 'skip_block_zeroing' feature when loading the DM table for the
thin-pool).  With block-zeroing any blocks that are provisioned _will_
be overwritten with zeroes (using dm-kcopyd which is trained to use
WRITE_SAME if supported).

But yeah, for fallocate.. certainly not something we want as it defeats
the point of fallocate being cheap.

So we probably would need a flag comparable to the
ext4-stale-flag-that-shall-not-be-named ;)

Mike

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
@ 2016-04-14 15:10           ` Mike Snitzer
  0 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-14 15:10 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, dm-devel, xfs, linux-block, Joe Thornber, linux-fsdevel

On Wed, Apr 13 2016 at  4:41pm -0400,
Brian Foster <bfoster@redhat.com> wrote:

> On Wed, Apr 13, 2016 at 02:33:52PM -0400, Brian Foster wrote:
> > On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> > > On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > > > From: Joe Thornber <ejt@redhat.com>
> > > > 
> > > > Experimental reserve interface for XFS guys to play with.
> > > > 
> > > > I have big reservations (no pun intended) about this patch.
> > > > 
> > > > [BF:
> > > >  - Support for reservation reduction.
> > > >  - Support for space provisioning.
> > > >  - Condensed to a single function.]
> > > > 
> > > > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > > > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > > ---
> > > >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > > >  1 file changed, 171 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > > index 92237b6..32bc5bd 100644
> > > > --- a/drivers/md/dm-thin.c
> > > > +++ b/drivers/md/dm-thin.c
> > ...
> > > > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > >  }
> > > >  
> > > > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > > > +				sector_t len, sector_t *res)
> > > > +{
> > > > +	struct thin_c *tc = ti->private;
> > > > +	struct pool *pool = tc->pool;
> > > > +	sector_t end;
> > > > +	dm_block_t pblock;
> > > > +	dm_block_t vblock;
> > > > +	int error;
> > > > +	struct dm_thin_lookup_result lookup;
> > > > +
> > > > +	if (!is_factor(offset, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > > > +		return -EINVAL;
> > > > +
> > > > +	end = offset + len;
> > > > +
> > > > +	while (offset < end) {
> > > > +		vblock = offset;
> > > > +		do_div(vblock, pool->sectors_per_block);
> > > > +
> > > > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > > > +		if (error == 0)
> > > > +			goto next;
> > > > +		if (error != -ENODATA)
> > > > +			return error;
> > > > +
> > > > +		error = alloc_data_block(tc, &pblock);
> > > 
> > > So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> > > first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> > > space that was previously reserved by some other caller.  I think?
> > > 
> > 
> > Yes, assuming this is being called from a filesystem using the
> > reservation mechanism.

Brian, I need to circle back with you to understand why XFS even needs
reservation as opposed to just using something like fallocate (which
would provision the space before you actually initiate the IO that would
use it).  But we can discuss that in person and then report back to the
list if it makes it easier...
 
> > > > +		if (error)
> > > > +			return error;
> > > > +
> > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > 
> > > Having reserved and mapped blocks, what happens when we try to read them?
> > > Do we actually get zeroes, or does the read go straight through to whatever
> > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > thin device.
> > > 
> > 
> > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > wasn't really on my mind when doing this. I was simply trying to cobble
> > together what I could to facilitate making progress on the fs parts
> > (e.g., I just needed a call that allocated blocks and consumed
> > reservation in the process).
> > 
> > Skimming through the dm-thin code, it looks like a (configurable) block
> > zeroing mechanism can be triggered from somewhere around
> > provision_block()->schedule_zero(), depending on whether the incoming
> > write overwrites the newly allocated block. If that's the case, then I
> > suspect that means reads would just fall through to the block and return
> > whatever was on disk. This code would probably need to tie into that
> > zeroing mechanism one way or another to deal with that issue. (Though
> > somebody who actually knows something about dm-thin should verify that.
> > :)
> > 
> 
> BTW, if that mechanism is in fact doing I/O, that might not be the
> appropriate solution for fallocate. Perhaps we'd have to consider an
> unwritten flag or some such in dm-thin, if possible.

DM thinp defaults to enabling 'zero_new_blocks' (can be disabled using
the 'skip_block_zeroing' feature when loading the DM table for the
thin-pool).  With block-zeroing any blocks that are provisioned _will_
be overwritten with zeroes (using dm-kcopyd which is trained to use
WRITE_SAME if supported).

But yeah, for fallocate.. certainly not something we want as it defeats
the point of fallocate being cheap.

So we probably would need a flag comparable to the
ext4-stale-flag-that-shall-not-be-named ;)

Mike

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
  2016-04-13  0:12         ` Darrick J. Wong
@ 2016-04-14 15:18           ` Mike Snitzer
  -1 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-14 15:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-block, linux-fsdevel, Brian Foster, dm-devel, xfs

On Tue, Apr 12 2016 at  8:12pm -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Tue, Apr 12, 2016 at 05:04:27PM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2016 at  4:39pm -0400,
> > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > 
> > > On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 5a2c3ab..b34c07b 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > > >  	struct request_queue *q = bdev_get_queue(bdev);
> > > >  	struct address_space *mapping;
> > > >  	loff_t end = start + len - 1;
> > > > -	loff_t bs_mask, isize;
> > > > +	loff_t isize;
> > > >  	int error;
> > > >  
> > > >  	/* We only support zero range and punch hole. */
> > > >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> > > >  		return -EOPNOTSUPP;
> > > >  
> > > > -	/* We haven't a primitive for "ensure space exists" right now. */
> > > > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > > > -		return -EOPNOTSUPP;
> > > > -
> > > >  	/* Only punch if the device can do zeroing discard. */
> > > >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> > > >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > > > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > > >  			return -EINVAL;
> > > >  	}
> > > >  
> > > > -	/* Don't allow IO that isn't aligned to logical block size */
> > > > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > > > -	if ((start | len) & bs_mask)
> > > > +	/*
> > > > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > > > +	 * - for normal device's io_min is usually logical block size
> > > > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > > > +	 */
> > > > +	if ((start | len) % bdev_io_min(bdev))
> 
> I started by noticing the 64-bit division.

Oops, yeah good point.  I did said my patch was untested (didn't mention
that it wasn't even compile tested.. RFC and all) ;)

> However, in researching alignment
> requirements for fallocate, I noticed that nothing says that we can return
> -EINVAL for unaligned offset/len for allocate or punch.  For file allocations
> ext4 and xfs simply enlarge the range so that the ends are aligned to the
> logical block size; for punch they both shrink the range to deallocate until
> the ends are aligned, and write zeroes to the partial blocks.
> 
> At least for user-visible fallocate we should do likewise, but for the internal
> blkdev_ helpers I think it makes more sense to check lbs alignment and let the
> lower level driver reject the IO if min_io alignment is a hard requirement.
> Documentation/block/queue-sysfs.txt says that the min_io is the smallest
> /preferred/ size.

Thinking about this all further.  Alignment on allocation isn't a big
deal for thinp.  If the extent requested isn't properly aligned we'll
still do the right thing (which is to round-up and allocate a block at
the beginning and/or end to fulfill the request).

As for discard, DM-thinp silently drops the discard of the beginning
and/or end that isn't aligned on a thinp blocksize boundary -- that is
DM thinp doesn't unmap the corresponding thinp block because the partial
block is still considered used.  But thinp still passes down the
appropriate discard for that subset of the still-mapped thinp block to
the underlying storage (if discard passdown in enabled on the DM
thin-pool).
 
> But, before that, I'll push out some new fallocate patches for -rc3.
> 
> > > >  		return -EINVAL;
> > > 
> > > Noted.  Will update the original patch.
> > 
> > BTW, I just noticed your "block: require write_same and discard requests
> > align to logical block size" -- doesn't look right.
> 
> What happens if we pass a request to thinp that isn't aligned to
> minimum_io_size?  Does it reject the command?

I hope I answered that above just now.

SO.. it seems we can avoid the mess of worrying about minimum_io_size
alignment (both for this fallocate interface and discard).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space
@ 2016-04-14 15:18           ` Mike Snitzer
  0 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-14 15:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-block, linux-fsdevel, Brian Foster, dm-devel, xfs

On Tue, Apr 12 2016 at  8:12pm -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Tue, Apr 12, 2016 at 05:04:27PM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2016 at  4:39pm -0400,
> > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > 
> > > On Tue, Apr 12, 2016 at 04:04:59PM -0400, Mike Snitzer wrote:
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 5a2c3ab..b34c07b 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -1801,17 +1801,13 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > > >  	struct request_queue *q = bdev_get_queue(bdev);
> > > >  	struct address_space *mapping;
> > > >  	loff_t end = start + len - 1;
> > > > -	loff_t bs_mask, isize;
> > > > +	loff_t isize;
> > > >  	int error;
> > > >  
> > > >  	/* We only support zero range and punch hole. */
> > > >  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> > > >  		return -EOPNOTSUPP;
> > > >  
> > > > -	/* We haven't a primitive for "ensure space exists" right now. */
> > > > -	if (!(mode & ~FALLOC_FL_KEEP_SIZE))
> > > > -		return -EOPNOTSUPP;
> > > > -
> > > >  	/* Only punch if the device can do zeroing discard. */
> > > >  	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> > > >  	    (!blk_queue_discard(q) || !q->limits.discard_zeroes_data))
> > > > @@ -1829,9 +1825,12 @@ long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len)
> > > >  			return -EINVAL;
> > > >  	}
> > > >  
> > > > -	/* Don't allow IO that isn't aligned to logical block size */
> > > > -	bs_mask = bdev_logical_block_size(bdev) - 1;
> > > > -	if ((start | len) & bs_mask)
> > > > +	/*
> > > > +	 * Don't allow IO that isn't aligned to minimum IO size (io_min)
> > > > +	 * - for normal device's io_min is usually logical block size
> > > > +	 * - but for more exotic devices (e.g. DM thinp) it may be larger
> > > > +	 */
> > > > +	if ((start | len) % bdev_io_min(bdev))
> 
> I started by noticing the 64-bit division.

Oops, yeah good point.  I did said my patch was untested (didn't mention
that it wasn't even compile tested.. RFC and all) ;)

> However, in researching alignment
> requirements for fallocate, I noticed that nothing says that we can return
> -EINVAL for unaligned offset/len for allocate or punch.  For file allocations
> ext4 and xfs simply enlarge the range so that the ends are aligned to the
> logical block size; for punch they both shrink the range to deallocate until
> the ends are aligned, and write zeroes to the partial blocks.
> 
> At least for user-visible fallocate we should do likewise, but for the internal
> blkdev_ helpers I think it makes more sense to check lbs alignment and let the
> lower level driver reject the IO if min_io alignment is a hard requirement.
> Documentation/block/queue-sysfs.txt says that the min_io is the smallest
> /preferred/ size.

Thinking about this all further.  Alignment on allocation isn't a big
deal for thinp.  If the extent requested isn't properly aligned we'll
still do the right thing (which is to round-up and allocate a block at
the beginning and/or end to fulfill the request).

As for discard, DM-thinp silently drops the discard of the beginning
and/or end that isn't aligned on a thinp blocksize boundary -- that is
DM thinp doesn't unmap the corresponding thinp block because the partial
block is still considered used.  But thinp still passes down the
appropriate discard for that subset of the still-mapped thinp block to
the underlying storage (if discard passdown in enabled on the DM
thin-pool).
 
> But, before that, I'll push out some new fallocate patches for -rc3.
> 
> > > >  		return -EINVAL;
> > > 
> > > Noted.  Will update the original patch.
> > 
> > BTW, I just noticed your "block: require write_same and discard requests
> > align to logical block size" -- doesn't look right.
> 
> What happens if we pass a request to thinp that isn't aligned to
> minimum_io_size?  Does it reject the command?

I hope I answered that above just now.

SO.. it seems we can avoid the mess of worrying about minimum_io_size
alignment (both for this fallocate interface and discard).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
  2016-04-14 15:10           ` Mike Snitzer
@ 2016-04-14 16:23             ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-14 16:23 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Darrick J. Wong, dm-devel, xfs, linux-block, Joe Thornber, linux-fsdevel

On Thu, Apr 14, 2016 at 11:10:14AM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2016 at  4:41pm -0400,
> Brian Foster <bfoster@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2016 at 02:33:52PM -0400, Brian Foster wrote:
> > > On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > > > > From: Joe Thornber <ejt@redhat.com>
> > > > > 
> > > > > Experimental reserve interface for XFS guys to play with.
> > > > > 
> > > > > I have big reservations (no pun intended) about this patch.
> > > > > 
> > > > > [BF:
> > > > >  - Support for reservation reduction.
> > > > >  - Support for space provisioning.
> > > > >  - Condensed to a single function.]
> > > > > 
> > > > > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > > > > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > > > ---
> > > > >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > > > >  1 file changed, 171 insertions(+), 10 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > > > index 92237b6..32bc5bd 100644
> > > > > --- a/drivers/md/dm-thin.c
> > > > > +++ b/drivers/md/dm-thin.c
> > > ...
> > > > > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > > >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > > >  }
> > > > >  
> > > > > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > > > > +				sector_t len, sector_t *res)
> > > > > +{
> > > > > +	struct thin_c *tc = ti->private;
> > > > > +	struct pool *pool = tc->pool;
> > > > > +	sector_t end;
> > > > > +	dm_block_t pblock;
> > > > > +	dm_block_t vblock;
> > > > > +	int error;
> > > > > +	struct dm_thin_lookup_result lookup;
> > > > > +
> > > > > +	if (!is_factor(offset, pool->sectors_per_block))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	end = offset + len;
> > > > > +
> > > > > +	while (offset < end) {
> > > > > +		vblock = offset;
> > > > > +		do_div(vblock, pool->sectors_per_block);
> > > > > +
> > > > > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > > > > +		if (error == 0)
> > > > > +			goto next;
> > > > > +		if (error != -ENODATA)
> > > > > +			return error;
> > > > > +
> > > > > +		error = alloc_data_block(tc, &pblock);
> > > > 
> > > > So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> > > > first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> > > > space that was previously reserved by some other caller.  I think?
> > > > 
> > > 
> > > Yes, assuming this is being called from a filesystem using the
> > > reservation mechanism.
> 
> Brian, I need to circle back with you to understand why XFS even needs
> reservation as opposed to just using something like fallocate (which
> would provision the space before you actually initiate the IO that would
> use it).  But we can discuss that in person and then report back to the
> list if it makes it easier...
>  

The primary reason is delayed allocation. Buffered writes to the fs copy
data into the pagecache before the physical space has been allocated.
E.g., we only modify the free blocks counters at write() time in order
to guarantee that we have space somewhere in the fs. The physical
extents aren't allocated until later at writeback time.

So reservation from dm-thin basically extends the mechanism to also
guarantee that the underlying thin volume has space for writes that
we've received but haven't written back yet.

> > > > > +		if (error)
> > > > > +			return error;
> > > > > +
> > > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > > 
> > > > Having reserved and mapped blocks, what happens when we try to read them?
> > > > Do we actually get zeroes, or does the read go straight through to whatever
> > > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > > thin device.
> > > > 
> > > 
> > > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > > wasn't really on my mind when doing this. I was simply trying to cobble
> > > together what I could to facilitate making progress on the fs parts
> > > (e.g., I just needed a call that allocated blocks and consumed
> > > reservation in the process).
> > > 
> > > Skimming through the dm-thin code, it looks like a (configurable) block
> > > zeroing mechanism can be triggered from somewhere around
> > > provision_block()->schedule_zero(), depending on whether the incoming
> > > write overwrites the newly allocated block. If that's the case, then I
> > > suspect that means reads would just fall through to the block and return
> > > whatever was on disk. This code would probably need to tie into that
> > > zeroing mechanism one way or another to deal with that issue. (Though
> > > somebody who actually knows something about dm-thin should verify that.
> > > :)
> > > 
> > 
> > BTW, if that mechanism is in fact doing I/O, that might not be the
> > appropriate solution for fallocate. Perhaps we'd have to consider an
> > unwritten flag or some such in dm-thin, if possible.
> 
> DM thinp defaults to enabling 'zero_new_blocks' (can be disabled using
> the 'skip_block_zeroing' feature when loading the DM table for the
> thin-pool).  With block-zeroing any blocks that are provisioned _will_
> be overwritten with zeroes (using dm-kcopyd which is trained to use
> WRITE_SAME if supported).
> 

Ok, thanks.

> But yeah, for fallocate.. certainly not something we want as it defeats
> the point of fallocate being cheap.
> 

Indeed.

> So we probably would need a flag comparable to the
> ext4-stale-flag-that-shall-not-be-named ;)
> 

Any chance to support an unwritten flag for all blocks that are
allocated via fallocate? E.g., subsequent reads detect the flag and
return zeroes as if the block wasn't there and a subsequent write clears
the flag (doing any partial block zeroing that might be necessary as
well).

Brian

> Mike
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
@ 2016-04-14 16:23             ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-14 16:23 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Darrick J. Wong, Joe Thornber, xfs, linux-block, dm-devel, linux-fsdevel

On Thu, Apr 14, 2016 at 11:10:14AM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2016 at  4:41pm -0400,
> Brian Foster <bfoster@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2016 at 02:33:52PM -0400, Brian Foster wrote:
> > > On Wed, Apr 13, 2016 at 10:44:42AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Apr 12, 2016 at 12:42:48PM -0400, Brian Foster wrote:
> > > > > From: Joe Thornber <ejt@redhat.com>
> > > > > 
> > > > > Experimental reserve interface for XFS guys to play with.
> > > > > 
> > > > > I have big reservations (no pun intended) about this patch.
> > > > > 
> > > > > [BF:
> > > > >  - Support for reservation reduction.
> > > > >  - Support for space provisioning.
> > > > >  - Condensed to a single function.]
> > > > > 
> > > > > Not-Signed-off-by: Joe Thornber <ejt@redhat.com>
> > > > > Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > > > ---
> > > > >  drivers/md/dm-thin.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > > > >  1 file changed, 171 insertions(+), 10 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > > > index 92237b6..32bc5bd 100644
> > > > > --- a/drivers/md/dm-thin.c
> > > > > +++ b/drivers/md/dm-thin.c
> > > ...
> > > > > @@ -4271,6 +4343,94 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > > > >  	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
> > > > >  }
> > > > >  
> > > > > +static int thin_provision_space(struct dm_target *ti, sector_t offset,
> > > > > +				sector_t len, sector_t *res)
> > > > > +{
> > > > > +	struct thin_c *tc = ti->private;
> > > > > +	struct pool *pool = tc->pool;
> > > > > +	sector_t end;
> > > > > +	dm_block_t pblock;
> > > > > +	dm_block_t vblock;
> > > > > +	int error;
> > > > > +	struct dm_thin_lookup_result lookup;
> > > > > +
> > > > > +	if (!is_factor(offset, pool->sectors_per_block))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!len || !is_factor(len, pool->sectors_per_block))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (res && !is_factor(*res, pool->sectors_per_block))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	end = offset + len;
> > > > > +
> > > > > +	while (offset < end) {
> > > > > +		vblock = offset;
> > > > > +		do_div(vblock, pool->sectors_per_block);
> > > > > +
> > > > > +		error = dm_thin_find_block(tc->td, vblock, true, &lookup);
> > > > > +		if (error == 0)
> > > > > +			goto next;
> > > > > +		if (error != -ENODATA)
> > > > > +			return error;
> > > > > +
> > > > > +		error = alloc_data_block(tc, &pblock);
> > > > 
> > > > So this means that if fallocate wants to BDEV_RES_PROVISION N blocks, it must
> > > > first increase the reservation (BDEV_RES_MOD) by N blocks to avoid using up
> > > > space that was previously reserved by some other caller.  I think?
> > > > 
> > > 
> > > Yes, assuming this is being called from a filesystem using the
> > > reservation mechanism.
> 
> Brian, I need to circle back with you to understand why XFS even needs
> reservation as opposed to just using something like fallocate (which
> would provision the space before you actually initiate the IO that would
> use it).  But we can discuss that in person and then report back to the
> list if it makes it easier...
>  

The primary reason is delayed allocation. Buffered writes to the fs copy
data into the pagecache before the physical space has been allocated.
E.g., we only modify the free blocks counters at write() time in order
to guarantee that we have space somewhere in the fs. The physical
extents aren't allocated until later at writeback time.

So reservation from dm-thin basically extends the mechanism to also
guarantee that the underlying thin volume has space for writes that
we've received but haven't written back yet.

> > > > > +		if (error)
> > > > > +			return error;
> > > > > +
> > > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > > 
> > > > Having reserved and mapped blocks, what happens when we try to read them?
> > > > Do we actually get zeroes, or does the read go straight through to whatever
> > > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > > thin device.
> > > > 
> > > 
> > > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > > wasn't really on my mind when doing this. I was simply trying to cobble
> > > together what I could to facilitate making progress on the fs parts
> > > (e.g., I just needed a call that allocated blocks and consumed
> > > reservation in the process).
> > > 
> > > Skimming through the dm-thin code, it looks like a (configurable) block
> > > zeroing mechanism can be triggered from somewhere around
> > > provision_block()->schedule_zero(), depending on whether the incoming
> > > write overwrites the newly allocated block. If that's the case, then I
> > > suspect that means reads would just fall through to the block and return
> > > whatever was on disk. This code would probably need to tie into that
> > > zeroing mechanism one way or another to deal with that issue. (Though
> > > somebody who actually knows something about dm-thin should verify that.
> > > :)
> > > 
> > 
> > BTW, if that mechanism is in fact doing I/O, that might not be the
> > appropriate solution for fallocate. Perhaps we'd have to consider an
> > unwritten flag or some such in dm-thin, if possible.
> 
> DM thinp defaults to enabling 'zero_new_blocks' (can be disabled using
> the 'skip_block_zeroing' feature when loading the DM table for the
> thin-pool).  With block-zeroing any blocks that are provisioned _will_
> be overwritten with zeroes (using dm-kcopyd which is trained to use
> WRITE_SAME if supported).
> 

Ok, thanks.

> But yeah, for fallocate.. certainly not something we want as it defeats
> the point of fallocate being cheap.
> 

Indeed.

> So we probably would need a flag comparable to the
> ext4-stale-flag-that-shall-not-be-named ;)
> 

Any chance to support an unwritten flag for all blocks that are
allocated via fallocate? E.g., subsequent reads detect the flag and
return zeroes as if the block wasn't there and a subsequent write clears
the flag (doing any partial block zeroing that might be necessary as
well).

Brian

> Mike
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
  2016-04-14 16:23             ` Brian Foster
@ 2016-04-14 20:18               ` Mike Snitzer
  -1 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-14 20:18 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, Joe Thornber, xfs, linux-block, dm-devel, linux-fsdevel

On Thu, Apr 14 2016 at 12:23pm -0400,
Brian Foster <bfoster@redhat.com> wrote:

> On Thu, Apr 14, 2016 at 11:10:14AM -0400, Mike Snitzer wrote:
> > 
> > Brian, I need to circle back with you to understand why XFS even needs
> > reservation as opposed to just using something like fallocate (which
> > would provision the space before you actually initiate the IO that would
> > use it).  But we can discuss that in person and then report back to the
> > list if it makes it easier...
> >  
> 
> The primary reason is delayed allocation. Buffered writes to the fs copy
> data into the pagecache before the physical space has been allocated.
> E.g., we only modify the free blocks counters at write() time in order
> to guarantee that we have space somewhere in the fs. The physical
> extents aren't allocated until later at writeback time.
> 
> So reservation from dm-thin basically extends the mechanism to also
> guarantee that the underlying thin volume has space for writes that
> we've received but haven't written back yet.

OK, so even if/when we have bdev_fallocate support that would be more
rigid than XFS would like.

As you've said, the XFS established reservation is larger than is really
needed.  Whereas regularly provisioning more than is actually needed is
a recipe for disaster.

> > > > > > +		if (error)
> > > > > > +			return error;
> > > > > > +
> > > > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > > > 
> > > > > Having reserved and mapped blocks, what happens when we try to read them?
> > > > > Do we actually get zeroes, or does the read go straight through to whatever
> > > > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > > > thin device.
> > > > > 
> > > > 
> > > > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > > > wasn't really on my mind when doing this. I was simply trying to cobble
> > > > together what I could to facilitate making progress on the fs parts
> > > > (e.g., I just needed a call that allocated blocks and consumed
> > > > reservation in the process).
> > > > 
> > > > Skimming through the dm-thin code, it looks like a (configurable) block
> > > > zeroing mechanism can be triggered from somewhere around
> > > > provision_block()->schedule_zero(), depending on whether the incoming
> > > > write overwrites the newly allocated block. If that's the case, then I
> > > > suspect that means reads would just fall through to the block and return
> > > > whatever was on disk. This code would probably need to tie into that
> > > > zeroing mechanism one way or another to deal with that issue. (Though
> > > > somebody who actually knows something about dm-thin should verify that.
> > > > :)
> > > > 
> > > 
> > > BTW, if that mechanism is in fact doing I/O, that might not be the
> > > appropriate solution for fallocate. Perhaps we'd have to consider an
> > > unwritten flag or some such in dm-thin, if possible.
> > 
> > DM thinp defaults to enabling 'zero_new_blocks' (can be disabled using
> > the 'skip_block_zeroing' feature when loading the DM table for the
> > thin-pool).  With block-zeroing any blocks that are provisioned _will_
> > be overwritten with zeroes (using dm-kcopyd which is trained to use
> > WRITE_SAME if supported).
> > 
> 
> Ok, thanks.
> 
> > But yeah, for fallocate.. certainly not something we want as it defeats
> > the point of fallocate being cheap.
> > 
> 
> Indeed.
> 
> > So we probably would need a flag comparable to the
> > ext4-stale-flag-that-shall-not-be-named ;)
> > 
> 
> Any chance to support an unwritten flag for all blocks that are
> allocated via fallocate? E.g., subsequent reads detect the flag and
> return zeroes as if the block wasn't there and a subsequent write clears
> the flag (doing any partial block zeroing that might be necessary as
> well).

Yeah, I've already started talking to Joe about doing exactly that.
Without it we cannot securely provide fallocate support in DM thinp.

I'll keep discussing with Joe... he doesn't like this requirement but
we'll work through it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
@ 2016-04-14 20:18               ` Mike Snitzer
  0 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2016-04-14 20:18 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, dm-devel, xfs, linux-block, Joe Thornber, linux-fsdevel

On Thu, Apr 14 2016 at 12:23pm -0400,
Brian Foster <bfoster@redhat.com> wrote:

> On Thu, Apr 14, 2016 at 11:10:14AM -0400, Mike Snitzer wrote:
> > 
> > Brian, I need to circle back with you to understand why XFS even needs
> > reservation as opposed to just using something like fallocate (which
> > would provision the space before you actually initiate the IO that would
> > use it).  But we can discuss that in person and then report back to the
> > list if it makes it easier...
> >  
> 
> The primary reason is delayed allocation. Buffered writes to the fs copy
> data into the pagecache before the physical space has been allocated.
> E.g., we only modify the free blocks counters at write() time in order
> to guarantee that we have space somewhere in the fs. The physical
> extents aren't allocated until later at writeback time.
> 
> So reservation from dm-thin basically extends the mechanism to also
> guarantee that the underlying thin volume has space for writes that
> we've received but haven't written back yet.

OK, so even if/when we have bdev_fallocate support that would be more
rigid than XFS would like.

As you've said, the XFS established reservation is larger than is really
needed.  Whereas regularly provisioning more than is actually needed is
a recipe for disaster.

> > > > > > +		if (error)
> > > > > > +			return error;
> > > > > > +
> > > > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > > > 
> > > > > Having reserved and mapped blocks, what happens when we try to read them?
> > > > > Do we actually get zeroes, or does the read go straight through to whatever
> > > > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > > > thin device.
> > > > > 
> > > > 
> > > > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > > > wasn't really on my mind when doing this. I was simply trying to cobble
> > > > together what I could to facilitate making progress on the fs parts
> > > > (e.g., I just needed a call that allocated blocks and consumed
> > > > reservation in the process).
> > > > 
> > > > Skimming through the dm-thin code, it looks like a (configurable) block
> > > > zeroing mechanism can be triggered from somewhere around
> > > > provision_block()->schedule_zero(), depending on whether the incoming
> > > > write overwrites the newly allocated block. If that's the case, then I
> > > > suspect that means reads would just fall through to the block and return
> > > > whatever was on disk. This code would probably need to tie into that
> > > > zeroing mechanism one way or another to deal with that issue. (Though
> > > > somebody who actually knows something about dm-thin should verify that.
> > > > :)
> > > > 
> > > 
> > > BTW, if that mechanism is in fact doing I/O, that might not be the
> > > appropriate solution for fallocate. Perhaps we'd have to consider an
> > > unwritten flag or some such in dm-thin, if possible.
> > 
> > DM thinp defaults to enabling 'zero_new_blocks' (can be disabled using
> > the 'skip_block_zeroing' feature when loading the DM table for the
> > thin-pool).  With block-zeroing any blocks that are provisioned _will_
> > be overwritten with zeroes (using dm-kcopyd which is trained to use
> > WRITE_SAME if supported).
> > 
> 
> Ok, thanks.
> 
> > But yeah, for fallocate.. certainly not something we want as it defeats
> > the point of fallocate being cheap.
> > 
> 
> Indeed.
> 
> > So we probably would need a flag comparable to the
> > ext4-stale-flag-that-shall-not-be-named ;)
> > 
> 
> Any chance to support an unwritten flag for all blocks that are
> allocated via fallocate? E.g., subsequent reads detect the flag and
> return zeroes as if the block wasn't there and a subsequent write clears
> the flag (doing any partial block zeroing that might be necessary as
> well).

Yeah, I've already started talking to Joe about doing exactly that.
Without it we cannot securely provide fallocate support in DM thinp.

I'll keep discussing with Joe... he doesn't like this requirement but
we'll work through it.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
  2016-04-14 20:18               ` Mike Snitzer
@ 2016-04-15 11:48                 ` Brian Foster
  -1 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-15 11:48 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Darrick J. Wong, dm-devel, xfs, linux-block, Joe Thornber, linux-fsdevel

On Thu, Apr 14, 2016 at 04:18:12PM -0400, Mike Snitzer wrote:
> On Thu, Apr 14 2016 at 12:23pm -0400,
> Brian Foster <bfoster@redhat.com> wrote:
> 
> > On Thu, Apr 14, 2016 at 11:10:14AM -0400, Mike Snitzer wrote:
> > > 
> > > Brian, I need to circle back with you to understand why XFS even needs
> > > reservation as opposed to just using something like fallocate (which
> > > would provision the space before you actually initiate the IO that would
> > > use it).  But we can discuss that in person and then report back to the
> > > list if it makes it easier...
> > >  
> > 
> > The primary reason is delayed allocation. Buffered writes to the fs copy
> > data into the pagecache before the physical space has been allocated.
> > E.g., we only modify the free blocks counters at write() time in order
> > to guarantee that we have space somewhere in the fs. The physical
> > extents aren't allocated until later at writeback time.
> > 
> > So reservation from dm-thin basically extends the mechanism to also
> > guarantee that the underlying thin volume has space for writes that
> > we've received but haven't written back yet.
> 
> OK, so even if/when we have bdev_fallocate support that would be more
> rigid than XFS would like.
> 

Yeah, fallocate is still useful on its own. For example, we could still
invoke bdev_fallocate() in response to userspace fallocate to ensure the
space is physically allocated (i.e., provide the no -ENOSPC guarantee).

That just doesn't help us avoid the overprovisioned situation where we
have data in pagecache and nowhere to write it back to (w/o setting the
volume read-only). The only way I'm aware of to handle that is to
account for the space at write time.

> As you've said, the XFS established reservation is larger than is really
> needed.  Whereas regularly provisioning more than is actually needed is
> a recipe for disaster.
> 

Indeed, this prototype ties right into XFS' existing transaction
reservation mechanism. It basically adds bdev reservation to the blocks
that we already locally reserve during creation of a transaction or a
delalloc write. This already does worst case reservation. What's
interesting is that the worst case 1-1 fs-dm block reservation doesn't
appear to be as much of a functional impediment as anticipated. I think
that's because XFS is already designed for such worst case reservations
and has mechanisms in place to handle it in appropriate situations.

For example, if an incoming write fails to reserve blocks due to too
much outstanding reservation, xfs_file_buffered_aio_write() will do
things like flush inodes and run our post-eof (for speculatively
preallocated space) scanner to reclaim some of that space and retry
before it gives up.

I'm sure there's a performance issue in there somewhere when that whole
sequence occurs more frequently than normal due to the amplified
reservation (a 4k fs block reserving 64k or more in the dm vol), but I
don't think that's necessarily a disaster scenario.

Brian


> > > > > > > +		if (error)
> > > > > > > +			return error;
> > > > > > > +
> > > > > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > > > > 
> > > > > > Having reserved and mapped blocks, what happens when we try to read them?
> > > > > > Do we actually get zeroes, or does the read go straight through to whatever
> > > > > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > > > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > > > > thin device.
> > > > > > 
> > > > > 
> > > > > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > > > > wasn't really on my mind when doing this. I was simply trying to cobble
> > > > > together what I could to facilitate making progress on the fs parts
> > > > > (e.g., I just needed a call that allocated blocks and consumed
> > > > > reservation in the process).
> > > > > 
> > > > > Skimming through the dm-thin code, it looks like a (configurable) block
> > > > > zeroing mechanism can be triggered from somewhere around
> > > > > provision_block()->schedule_zero(), depending on whether the incoming
> > > > > write overwrites the newly allocated block. If that's the case, then I
> > > > > suspect that means reads would just fall through to the block and return
> > > > > whatever was on disk. This code would probably need to tie into that
> > > > > zeroing mechanism one way or another to deal with that issue. (Though
> > > > > somebody who actually knows something about dm-thin should verify that.
> > > > > :)
> > > > > 
> > > > 
> > > > BTW, if that mechanism is in fact doing I/O, that might not be the
> > > > appropriate solution for fallocate. Perhaps we'd have to consider an
> > > > unwritten flag or some such in dm-thin, if possible.
> > > 
> > > DM thinp defaults to enabling 'zero_new_blocks' (can be disabled using
> > > the 'skip_block_zeroing' feature when loading the DM table for the
> > > thin-pool).  With block-zeroing any blocks that are provisioned _will_
> > > be overwritten with zeroes (using dm-kcopyd which is trained to use
> > > WRITE_SAME if supported).
> > > 
> > 
> > Ok, thanks.
> > 
> > > But yeah, for fallocate.. certainly not something we want as it defeats
> > > the point of fallocate being cheap.
> > > 
> > 
> > Indeed.
> > 
> > > So we probably would need a flag comparable to the
> > > ext4-stale-flag-that-shall-not-be-named ;)
> > > 
> > 
> > Any chance to support an unwritten flag for all blocks that are
> > allocated via fallocate? E.g., subsequent reads detect the flag and
> > return zeroes as if the block wasn't there and a subsequent write clears
> > the flag (doing any partial block zeroing that might be necessary as
> > well).
> 
> Yeah, I've already started talking to Joe about doing exactly that.
> Without it we cannot securely provide fallocate support in DM thinp.
> 
> I'll keep discussing with Joe... he doesn't like this requirement but
> we'll work through it.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC v2 PATCH 05/10] dm thin: add methods to set and get reserved space
@ 2016-04-15 11:48                 ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2016-04-15 11:48 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Darrick J. Wong, Joe Thornber, xfs, linux-block, dm-devel, linux-fsdevel

On Thu, Apr 14, 2016 at 04:18:12PM -0400, Mike Snitzer wrote:
> On Thu, Apr 14 2016 at 12:23pm -0400,
> Brian Foster <bfoster@redhat.com> wrote:
> 
> > On Thu, Apr 14, 2016 at 11:10:14AM -0400, Mike Snitzer wrote:
> > > 
> > > Brian, I need to circle back with you to understand why XFS even needs
> > > reservation as opposed to just using something like fallocate (which
> > > would provision the space before you actually initiate the IO that would
> > > use it).  But we can discuss that in person and then report back to the
> > > list if it makes it easier...
> > >  
> > 
> > The primary reason is delayed allocation. Buffered writes to the fs copy
> > data into the pagecache before the physical space has been allocated.
> > E.g., we only modify the free blocks counters at write() time in order
> > to guarantee that we have space somewhere in the fs. The physical
> > extents aren't allocated until later at writeback time.
> > 
> > So reservation from dm-thin basically extends the mechanism to also
> > guarantee that the underlying thin volume has space for writes that
> > we've received but haven't written back yet.
> 
> OK, so even if/when we have bdev_fallocate support that would be more
> rigid than XFS would like.
> 

Yeah, fallocate is still useful on its own. For example, we could still
invoke bdev_fallocate() in response to userspace fallocate to ensure the
space is physically allocated (i.e., provide the no -ENOSPC guarantee).

That just doesn't help us avoid the overprovisioned situation where we
have data in pagecache and nowhere to write it back to (w/o setting the
volume read-only). The only way I'm aware of to handle that is to
account for the space at write time.

> As you've said, the XFS established reservation is larger than is really
> needed.  Whereas regularly provisioning more than is actually needed is
> a recipe for disaster.
> 

Indeed, this prototype ties right into XFS' existing transaction
reservation mechanism. It basically adds bdev reservation to the blocks
that we already locally reserve during creation of a transaction or a
delalloc write. This already does worst case reservation. What's
interesting is that the worst case 1-1 fs-dm block reservation doesn't
appear to be as much of a functional impediment as anticipated. I think
that's because XFS is already designed for such worst case reservations
and has mechanisms in place to handle it in appropriate situations.

For example, if an incoming write fails to reserve blocks due to too
much outstanding reservation, xfs_file_buffered_aio_write() will do
things like flush inodes and run our post-eof (for speculatively
preallocated space) scanner to reclaim some of that space and retry
before it gives up.

I'm sure there's a performance issue in there somewhere when that whole
sequence occurs more frequently than normal due to the amplified
reservation (a 4k fs block reserving 64k or more in the dm vol), but I
don't think that's necessarily a disaster scenario.

Brian


> > > > > > > +		if (error)
> > > > > > > +			return error;
> > > > > > > +
> > > > > > > +		error = dm_thin_insert_block(tc->td, vblock, pblock);
> > > > > > 
> > > > > > Having reserved and mapped blocks, what happens when we try to read them?
> > > > > > Do we actually get zeroes, or does the read go straight through to whatever
> > > > > > happens to be in the disk blocks?  I don't think it's correct that we could
> > > > > > BDEV_RES_PROVISION and end up with stale credit card numbers from some other
> > > > > > thin device.
> > > > > > 
> > > > > 
> > > > > Agree, but I'm not really sure how this works in thinp tbh. fallocate
> > > > > wasn't really on my mind when doing this. I was simply trying to cobble
> > > > > together what I could to facilitate making progress on the fs parts
> > > > > (e.g., I just needed a call that allocated blocks and consumed
> > > > > reservation in the process).
> > > > > 
> > > > > Skimming through the dm-thin code, it looks like a (configurable) block
> > > > > zeroing mechanism can be triggered from somewhere around
> > > > > provision_block()->schedule_zero(), depending on whether the incoming
> > > > > write overwrites the newly allocated block. If that's the case, then I
> > > > > suspect that means reads would just fall through to the block and return
> > > > > whatever was on disk. This code would probably need to tie into that
> > > > > zeroing mechanism one way or another to deal with that issue. (Though
> > > > > somebody who actually knows something about dm-thin should verify that.
> > > > > :)
> > > > > 
> > > > 
> > > > BTW, if that mechanism is in fact doing I/O, that might not be the
> > > > appropriate solution for fallocate. Perhaps we'd have to consider an
> > > > unwritten flag or some such in dm-thin, if possible.
> > > 
> > > DM thinp defaults to enabling 'zero_new_blocks' (can be disabled using
> > > the 'skip_block_zeroing' feature when loading the DM table for the
> > > thin-pool).  With block-zeroing any blocks that are provisioned _will_
> > > be overwritten with zeroes (using dm-kcopyd which is trained to use
> > > WRITE_SAME if supported).
> > > 
> > 
> > Ok, thanks.
> > 
> > > But yeah, for fallocate.. certainly not something we want as it defeats
> > > the point of fallocate being cheap.
> > > 
> > 
> > Indeed.
> > 
> > > So we probably would need a flag comparable to the
> > > ext4-stale-flag-that-shall-not-be-named ;)
> > > 
> > 
> > Any chance to support an unwritten flag for all blocks that are
> > allocated via fallocate? E.g., subsequent reads detect the flag and
> > return zeroes as if the block wasn't there and a subsequent write clears
> > the flag (doing any partial block zeroing that might be necessary as
> > well).
> 
> Yeah, I've already started talking to Joe about doing exactly that.
> Without it we cannot securely provide fallocate support in DM thinp.
> 
> I'll keep discussing with Joe... he doesn't like this requirement but
> we'll work through it.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2016-04-15 11:48 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-12 16:42 [RFC v2 PATCH 00/10] dm-thin/xfs: prototype a block reservation allocation model Brian Foster
2016-04-12 16:42 ` Brian Foster
2016-04-12 16:42 ` [RFC v2 PATCH 01/10] xfs: refactor xfs_reserve_blocks() to handle ENOSPC correctly Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-12 16:42 ` [RFC v2 PATCH 02/10] xfs: replace xfs_mod_fdblocks() bool param with flags Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-12 16:42 ` [RFC v2 PATCH 03/10] block: add block_device_operations methods to set and get reserved space Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-14  0:32   ` Dave Chinner
2016-04-14  0:32     ` Dave Chinner
2016-04-12 16:42 ` [RFC v2 PATCH 04/10] dm: add " Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-12 16:42 ` [RFC v2 PATCH 05/10] dm thin: " Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-13 17:44   ` Darrick J. Wong
2016-04-13 17:44     ` Darrick J. Wong
2016-04-13 18:33     ` Brian Foster
2016-04-13 18:33       ` Brian Foster
2016-04-13 20:41       ` Brian Foster
2016-04-13 20:41         ` Brian Foster
2016-04-13 21:01         ` Darrick J. Wong
2016-04-13 21:01           ` Darrick J. Wong
2016-04-14 15:10         ` Mike Snitzer
2016-04-14 15:10           ` Mike Snitzer
2016-04-14 16:23           ` Brian Foster
2016-04-14 16:23             ` Brian Foster
2016-04-14 20:18             ` Mike Snitzer
2016-04-14 20:18               ` Mike Snitzer
2016-04-15 11:48               ` Brian Foster
2016-04-15 11:48                 ` Brian Foster
2016-04-12 16:42 ` [RFC v2 PATCH 06/10] xfs: thin block device reservation mechanism Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-12 16:42 ` [RFC v2 PATCH 07/10] xfs: adopt a reserved allocation model on dm-thin devices Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-12 16:42 ` [RFC v2 PATCH 08/10] xfs: handle bdev reservation ENOSPC correctly from XFS reserved pool Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-12 16:42 ` [RFC v2 PATCH 09/10] xfs: support no block reservation transaction mode Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-12 16:42 ` [RFC v2 PATCH 10/10] xfs: use contiguous bdev reservation for file preallocation Brian Foster
2016-04-12 16:42   ` Brian Foster
2016-04-12 20:04 ` [RFC PATCH] block: wire blkdev_fallocate() to block_device_operations' reserve_space Mike Snitzer
2016-04-12 20:04   ` Mike Snitzer
2016-04-12 20:39   ` Darrick J. Wong
2016-04-12 20:39     ` Darrick J. Wong
2016-04-12 20:46     ` Mike Snitzer
2016-04-12 20:46       ` Mike Snitzer
2016-04-12 22:25       ` Darrick J. Wong
2016-04-12 22:25         ` Darrick J. Wong
2016-04-12 21:04     ` Mike Snitzer
2016-04-12 21:04       ` Mike Snitzer
2016-04-13  0:12       ` Darrick J. Wong
2016-04-13  0:12         ` Darrick J. Wong
2016-04-14 15:18         ` Mike Snitzer
2016-04-14 15:18           ` Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.