All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4] btrfs: fix automatic blockgroup remove + discard
@ 2015-06-11 15:20 jeffm
  2015-06-11 15:20 ` [PATCH 1/4] btrfs: skip superblocks during discard jeffm
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: jeffm @ 2015-06-11 15:20 UTC (permalink / raw)
  To: linux-btrfs

The automatic block group removal patch introduced some regressions
in how discards are handled.

1/ FITRIM only iterates over block groups on disk - removed block groups
   won't be trimmed.
2/ Clearing the dirty bit from extents in removed block groups means that
   those extents won't be discarded when the block group is removed.
3/ More of a UI wart: We don't wait on block groups to be removed during
   read-only remount or fs umount. This results in block groups that
   /should/ have been discarded on thin provisioned storage hanging around
   until the file system is mounted read-write again.

The following patches address these problems by:
1/ Iterating over block groups on disk and then iterating over free space.
   This is consistent with how other file systems handle FITRIM.
2/ Putting removed block groups on a list so that they are automatically
   discarded during btrfs_finish_extent_commit after transaction commit.
   Note: This may still leave undiscarded space on disk if the system
   crashes after transaction commit but before discard. The file system
   itself will be compeltely consistent, but the user will need to trim
   manually.
3/ Simple: We call btrfs_delete_unused_bgs explicitly during ro-remount
   and umount.
4/ Skipping over blocks that contain superblocks during discard.

Changelog:
v1->v2
- -odiscard
 - Fix ordering to ensure that we dont' discard extents freed in an
    uncommitted transaction.
- FITRIM
  - Don't start a transaction so the entire run is transactionless
  - The loop can be interrupted while waiting on the chunk mutex and
    after the discard has completed.
  - The only lock held for the duration is the device_list_mutex.  The
    chunk mutex is take per loop iteration so normal operations should
    continue while we're running, even on large file systems.

v2->v3
- -odiscard
 - Factor out get/put block_group->trimming to ensure that cleanup always
   happens at the last reference drop.
 - Cleanup the free space cache on the last reference drop.
 - Use list_move instead of list_add in case of multiple adds.  We still
   issue a warning but we shouldn't fall over.
 - Explicitly delete unused block groups in close_ctree and ro-remount.
- FITRIM
 - Cleaned up pointer tricks that abused &NULL->member.
 - Take the commit_root_sem across loop iteration to protect against
   transaction commit moving the commit root.

v3->v4
- skip superblocks during discard

Please apply.

Thanks,

-Jeff


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 15:20 [PATCH v4] btrfs: fix automatic blockgroup remove + discard jeffm
@ 2015-06-11 15:20 ` jeffm
  2015-06-11 15:25   ` Jeff Mahoney
  2015-06-11 16:47   ` Filipe David Manana
  2015-06-11 15:21 ` [PATCH 2/4] btrfs: iterate over unused chunk space in FITRIM jeffm
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: jeffm @ 2015-06-11 15:20 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

Btrfs doesn't track superblocks with extent records so there is nothing
persistent on-disk to indicate that those blocks are in use.  We track
the superblocks in memory to ensure they don't get used by removing them
from the free space cache when we load a block group from disk.  Prior
to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
was fine since the block group would never be reclaimed so the superblock
was always safe.  Once we started removing the empty block groups, we
were protected by the fact that discards weren't being properly issued
for unused space either via FITRIM or -odiscard.  The block groups were
still being released, but the blocks remained on disk.

In order to properly discard unused block groups, we need to filter out
the superblocks from the discard range.  Superblocks are located at fixed
locations on each device, so it makes sense to filter them out in
btrfs_issue_discard, which is used by both -odiscard and FITRIM.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/extent-tree.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0ec3acd..75d0226 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1884,10 +1884,47 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-static int btrfs_issue_discard(struct block_device *bdev,
-				u64 start, u64 len)
+#define in_range(b, first, len)	((b) >= (first) && (b) < (first) + (len))
+static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
+			       u64 *discarded_bytes)
 {
-	return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
+	u64 skipped = 0;
+	u64 bytes_left = len;
+	int ret = 0 , j;
+
+	*discarded_bytes = 0;
+
+	/* Skip any superblocks on this device. */
+	for (j = 0; j < BTRFS_SUPER_MIRROR_MAX; j++) {
+		u64 sb_offset = btrfs_sb_offset(j);
+		u64 size = sb_offset - start;
+
+		if (!in_range(sb_offset, start, bytes_left))
+			continue;
+
+		if (size) {
+			ret = blkdev_issue_discard(bdev, start >> 9, size >> 9,
+						   GFP_NOFS, 0);
+			if (!ret)
+				*discarded_bytes += size;
+			else if (ret != -EOPNOTSUPP)
+				return ret;
+		}
+
+		start = sb_offset + BTRFS_SUPER_INFO_SIZE;
+		if (start > len)
+			start = len;
+		bytes_left = len - start;
+		skipped += BTRFS_SUPER_INFO_SIZE;
+	}
+
+	if (bytes_left) {
+		ret = blkdev_issue_discard(bdev, start >> 9, bytes_left >> 9,
+					   GFP_NOFS, 0);
+		if (!ret)
+			*discarded_bytes += bytes_left;
+	}
+	return ret;
 }
 
 int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
@@ -1906,16 +1943,17 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
 		struct btrfs_bio_stripe *stripe = bbio->stripes;
 		int i;
 
-
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
+			u64 bytes;
+
 			if (!stripe->dev->can_discard)
 				continue;
 
 			ret = btrfs_issue_discard(stripe->dev->bdev,
 						  stripe->physical,
-						  stripe->length);
+						  stripe->length, &bytes);
 			if (!ret)
-				discarded_bytes += stripe->length;
+				discarded_bytes += bytes;
 			else if (ret != -EOPNOTSUPP)
 				break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] btrfs: iterate over unused chunk space in FITRIM
  2015-06-11 15:20 [PATCH v4] btrfs: fix automatic blockgroup remove + discard jeffm
  2015-06-11 15:20 ` [PATCH 1/4] btrfs: skip superblocks during discard jeffm
@ 2015-06-11 15:21 ` jeffm
  2015-06-11 15:21 ` [PATCH 3/4] btrfs: explictly delete unused block groups in close_ctree and ro-remount jeffm
  2015-06-11 15:21 ` [PATCH 4/4] btrfs: add missing discards when unpinning extents with -o discard jeffm
  3 siblings, 0 replies; 14+ messages in thread
From: jeffm @ 2015-06-11 15:21 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

Since we now clean up block groups automatically as they become
empty, iterating over block groups is no longer sufficient to discard
unused space.

This patch iterates over the unused chunk space and discards any regions
that are unallocated, regardless of whether they were ever used.  This is
a change for btrfs but is consistent with other file systems.

We do this in a transactionless manner since the discard process can take
a substantial amount of time and a transaction would need to be started
before the acquisition of the device list lock.  That would mean a
transaction would be held open across /all/ of the discards collectively.
In order to prevent other threads from allocating or freeing chunks, we
hold the chunks lock across the search and discard calls.  We release it
between searches to allow the file system to perform more-or-less
normally.  Since the running transaction can commit and disappear while
we're using the transaction pointer, we take a reference to it and
release it after the search.  This is safe since it would happen normally
at the end of the transaction commit after any locks are released anyway.
We also take the commit_root_sem to protect against a transaction starting
and committing while we're running.

v1->v2:
- Doesn't take a transaction, so the entire run is transactionless
- The loop can be interrupted while waiting on the chunk mutex
  and after the discard has completed.
- The only lock held for the duration is the device_list_mutex. The chunk
  mutex is taken per loop iteration, so normal operation should continue
  while we're running even on large file systems.

v2->v3:
- Cleaned up pointer tricks that abused &NULL->member.
- Take commit_root_sem across loop iteration to protect against transaction
  commit.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/extent-tree.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c     |  60 ++++++++++++++++++-----------
 fs/btrfs/volumes.h     |   3 ++
 3 files changed, 141 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 75d0226..2c437ed 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10112,10 +10112,99 @@ int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
 	return unpin_extent_range(root, start, end, false);
 }
 
+/*
+ * It used to be that old block groups would be left around forever.
+ * Iterating over them would be enough to trim unused space.  Since we
+ * now automatically remove them, we also need to iterate over unallocated
+ * space.
+ *
+ * We don't want a transaction for this since the discard may take a
+ * substantial amount of time.  We don't require that a transaction be
+ * running, but we do need to take a running transaction into account
+ * to ensure that we're not discarding chunks that were released in
+ * the current transaction.
+ *
+ * Holding the chunks lock will prevent other threads from allocating
+ * or releasing chunks, but it won't prevent a running transaction
+ * from committing and releasing the memory that the pending chunks
+ * list head uses.  For that, we need to take a reference to the
+ * transaction.
+ */
+static int btrfs_trim_free_extents(struct btrfs_device *device,
+				   u64 minlen, u64 *trimmed)
+{
+	u64 start = 0, len = 0;
+	int ret;
+
+	*trimmed = 0;
+
+	/* Not writeable = nothing to do. */
+	if (!device->writeable)
+		return 0;
+
+	/* No free space = nothing to do. */
+	if (device->total_bytes <= device->bytes_used)
+		return 0;
+
+	ret = 0;
+
+	while (1) {
+		struct btrfs_fs_info *fs_info = device->dev_root->fs_info;
+		struct btrfs_transaction *trans;
+		u64 bytes;
+
+		ret = mutex_lock_interruptible(&fs_info->chunk_mutex);
+		if (ret)
+			return ret;
+
+		down_read(&fs_info->commit_root_sem);
+
+		spin_lock(&fs_info->trans_lock);
+		trans = fs_info->running_transaction;
+		if (trans)
+			atomic_inc(&trans->use_count);
+		spin_unlock(&fs_info->trans_lock);
+
+		ret = find_free_dev_extent_start(trans, device, minlen, start,
+						 &start, &len);
+		if (trans)
+			btrfs_put_transaction(trans);
+
+		if (ret) {
+			up_read(&fs_info->commit_root_sem);
+			mutex_unlock(&fs_info->chunk_mutex);
+			if (ret == -ENOSPC)
+				ret = 0;
+			break;
+		}
+
+		ret = btrfs_issue_discard(device->bdev, start, len, &bytes);
+		up_read(&fs_info->commit_root_sem);
+		mutex_unlock(&fs_info->chunk_mutex);
+
+		if (ret)
+			break;
+
+		start += len;
+		*trimmed += bytes;
+
+		if (fatal_signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+
+		cond_resched();
+	}
+
+	return ret;
+}
+
 int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_group_cache *cache = NULL;
+	struct btrfs_device *device;
+	struct list_head *devices;
 	u64 group_trimmed;
 	u64 start;
 	u64 end;
@@ -10170,6 +10259,18 @@ int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
 		cache = next_block_group(fs_info->tree_root, cache);
 	}
 
+	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
+	devices = &root->fs_info->fs_devices->alloc_list;
+	list_for_each_entry(device, devices, dev_alloc_list) {
+		ret = btrfs_trim_free_extents(device, range->minlen,
+					      &group_trimmed);
+		if (ret)
+			break;
+
+		trimmed += group_trimmed;
+	}
+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+
 	range->len = trimmed;
 	return ret;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 174f5e1..7fdde31 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1051,15 +1051,18 @@ out:
 	return ret;
 }
 
-static int contains_pending_extent(struct btrfs_trans_handle *trans,
+static int contains_pending_extent(struct btrfs_transaction *transaction,
 				   struct btrfs_device *device,
 				   u64 *start, u64 len)
 {
+	struct btrfs_fs_info *fs_info = device->dev_root->fs_info;
 	struct extent_map *em;
-	struct list_head *search_list = &trans->transaction->pending_chunks;
+	struct list_head *search_list = &fs_info->pinned_chunks;
 	int ret = 0;
 	u64 physical_start = *start;
 
+	if (transaction)
+		search_list = &transaction->pending_chunks;
 again:
 	list_for_each_entry(em, search_list, list) {
 		struct map_lookup *map;
@@ -1078,8 +1081,8 @@ again:
 			ret = 1;
 		}
 	}
-	if (search_list == &trans->transaction->pending_chunks) {
-		search_list = &trans->root->fs_info->pinned_chunks;
+	if (search_list != &fs_info->pinned_chunks) {
+		search_list = &fs_info->pinned_chunks;
 		goto again;
 	}
 
@@ -1088,12 +1091,13 @@ again:
 
 
 /*
- * find_free_dev_extent - find free space in the specified device
- * @device:	the device which we search the free space in
- * @num_bytes:	the size of the free space that we need
- * @start:	store the start of the free space.
- * @len:	the size of the free space. that we find, or the size of the max
- * 		free space if we don't find suitable free space
+ * find_free_dev_extent_start - find free space in the specified device
+ * @device:	  the device which we search the free space in
+ * @num_bytes:	  the size of the free space that we need
+ * @search_start: the position from which to begin the search
+ * @start:	  store the start of the free space.
+ * @len:	  the size of the free space. that we find, or the size
+ *		  of the max free space if we don't find suitable free space
  *
  * this uses a pretty simple search, the expectation is that it is
  * called very infrequently and that a given device has a small number
@@ -1107,9 +1111,9 @@ again:
  * But if we don't find suitable free space, it is used to store the size of
  * the max free space.
  */
-int find_free_dev_extent(struct btrfs_trans_handle *trans,
-			 struct btrfs_device *device, u64 num_bytes,
-			 u64 *start, u64 *len)
+int find_free_dev_extent_start(struct btrfs_transaction *transaction,
+			       struct btrfs_device *device, u64 num_bytes,
+			       u64 search_start, u64 *start, u64 *len)
 {
 	struct btrfs_key key;
 	struct btrfs_root *root = device->dev_root;
@@ -1119,19 +1123,11 @@ int find_free_dev_extent(struct btrfs_trans_handle *trans,
 	u64 max_hole_start;
 	u64 max_hole_size;
 	u64 extent_end;
-	u64 search_start;
 	u64 search_end = device->total_bytes;
 	int ret;
 	int slot;
 	struct extent_buffer *l;
 
-	/* FIXME use last free of some kind */
-
-	/* we don't want to overwrite the superblock on the drive,
-	 * so we make sure to start at an offset of at least 1MB
-	 */
-	search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
-
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -1192,7 +1188,7 @@ again:
 			 * Have to check before we set max_hole_start, otherwise
 			 * we could end up sending back this offset anyway.
 			 */
-			if (contains_pending_extent(trans, device,
+			if (contains_pending_extent(transaction, device,
 						    &search_start,
 						    hole_size)) {
 				if (key.offset >= search_start) {
@@ -1241,7 +1237,7 @@ next:
 	if (search_end > search_start) {
 		hole_size = search_end - search_start;
 
-		if (contains_pending_extent(trans, device, &search_start,
+		if (contains_pending_extent(transaction, device, &search_start,
 					    hole_size)) {
 			btrfs_release_path(path);
 			goto again;
@@ -1267,6 +1263,24 @@ out:
 	return ret;
 }
 
+int find_free_dev_extent(struct btrfs_trans_handle *trans,
+			 struct btrfs_device *device, u64 num_bytes,
+			 u64 *start, u64 *len)
+{
+	struct btrfs_root *root = device->dev_root;
+	u64 search_start;
+
+	/* FIXME use last free of some kind */
+
+	/*
+	 * we don't want to overwrite the superblock on the drive,
+	 * so we make sure to start at an offset of at least 1MB
+	 */
+	search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
+	return find_free_dev_extent_start(trans->transaction, device,
+					  num_bytes, search_start, start, len);
+}
+
 static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
 			  struct btrfs_device *device,
 			  u64 start, u64 *dev_extent_len)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ebc3133..30918a8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -449,6 +449,9 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info);
 int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info);
 int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info);
 int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset);
+int find_free_dev_extent_start(struct btrfs_transaction *transaction,
+			 struct btrfs_device *device, u64 num_bytes,
+			 u64 search_start, u64 *start, u64 *max_avail);
 int find_free_dev_extent(struct btrfs_trans_handle *trans,
 			 struct btrfs_device *device, u64 num_bytes,
 			 u64 *start, u64 *max_avail);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] btrfs: explictly delete unused block groups in close_ctree and ro-remount
  2015-06-11 15:20 [PATCH v4] btrfs: fix automatic blockgroup remove + discard jeffm
  2015-06-11 15:20 ` [PATCH 1/4] btrfs: skip superblocks during discard jeffm
  2015-06-11 15:21 ` [PATCH 2/4] btrfs: iterate over unused chunk space in FITRIM jeffm
@ 2015-06-11 15:21 ` jeffm
  2015-06-11 15:21 ` [PATCH 4/4] btrfs: add missing discards when unpinning extents with -o discard jeffm
  3 siblings, 0 replies; 14+ messages in thread
From: jeffm @ 2015-06-11 15:21 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

The cleaner thread may already be sleeping by the time we enter
close_ctree.  If that's the case, we'll skip removing any unused
block groups queued for removal, even during a normal umount.
They'll be cleaned up automatically at next mount, but users
expect a umount to be a clean synchronization point, especially
when used on thin-provisioned storage with -odiscard.  We also
explicitly remove unused block groups in the ro-remount path
for the same reason.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/disk-io.c |  9 +++++++++
 fs/btrfs/super.c   | 11 +++++++++++
 2 files changed, 20 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2ef9a4b..2e47fef 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root)
 	cancel_work_sync(&fs_info->async_reclaim_work);
 
 	if (!(fs_info->sb->s_flags & MS_RDONLY)) {
+		/*
+		 * If the cleaner thread is stopped and there are
+		 * block groups queued for removal, the deletion will be
+		 * skipped when we quit the cleaner thread.
+		 */
+		mutex_lock(&root->fs_info->cleaner_mutex);
+		btrfs_delete_unused_bgs(root->fs_info);
+		mutex_unlock(&root->fs_info->cleaner_mutex);
+
 		ret = btrfs_commit_super(root);
 		if (ret)
 			btrfs_err(fs_info, "commit super ret %d", ret);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 9e66f5e..2ccd8d4 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 
 		sb->s_flags |= MS_RDONLY;
 
+		/*
+		 * Setting MS_RDONLY will put the cleaner thread to
+		 * sleep at the next loop if it's already active.
+		 * If it's already asleep, we'll leave unused block
+		 * groups on disk until we're mounted read-write again
+		 * unless we clean them up here.
+		 */
+		mutex_lock(&root->fs_info->cleaner_mutex);
+		btrfs_delete_unused_bgs(fs_info);
+		mutex_unlock(&root->fs_info->cleaner_mutex);
+
 		btrfs_dev_replace_suspend_for_unmount(fs_info);
 		btrfs_scrub_cancel(fs_info);
 		btrfs_pause_balance(fs_info);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] btrfs: add missing discards when unpinning extents with -o discard
  2015-06-11 15:20 [PATCH v4] btrfs: fix automatic blockgroup remove + discard jeffm
                   ` (2 preceding siblings ...)
  2015-06-11 15:21 ` [PATCH 3/4] btrfs: explictly delete unused block groups in close_ctree and ro-remount jeffm
@ 2015-06-11 15:21 ` jeffm
  3 siblings, 0 replies; 14+ messages in thread
From: jeffm @ 2015-06-11 15:21 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.

The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit.  As a result, any extents that
would be freed when the block group is removed aren't discarded.  In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.

To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.

v1->v2:
- Fix ordering to ensure that we don't discard extents freed in an
  uncommitted transaction.

v2->v3:
- Factor out get/put block_group->trimming to ensure that cleanup always
  happens at the last reference drop.
- Cleanup the free space cache on the last reference drop.
- Use list_move instead of list_add in case of multiple adds.  We still
  issue a warning, but we shouldn't fall over.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/ctree.h            |  3 ++
 fs/btrfs/extent-tree.c      | 69 ++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.c | 57 +++++++++++++++++++++----------------
 fs/btrfs/super.c            |  2 +-
 fs/btrfs/transaction.c      |  2 ++
 fs/btrfs/transaction.h      |  2 ++
 6 files changed, 106 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6f364e1..780acf1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3438,6 +3438,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root, u64 group_start,
 			     struct extent_map *em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
+void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache);
+void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *cache);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *root);
 u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
@@ -4068,6 +4070,7 @@ __printf(5, 6)
 void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
 		     unsigned int line, int errno, const char *fmt, ...);
 
+const char *btrfs_decode_error(int errno);
 
 void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root, const char *function,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2c437ed..40c6ffe 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6072,20 +6072,19 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_block_group_cache *block_group, *tmp;
+	struct list_head *deleted_bgs;
 	struct extent_io_tree *unpin;
 	u64 start;
 	u64 end;
 	int ret;
 
-	if (trans->aborted)
-		return 0;
-
 	if (fs_info->pinned_extents == &fs_info->freed_extents[0])
 		unpin = &fs_info->freed_extents[1];
 	else
 		unpin = &fs_info->freed_extents[0];
 
-	while (1) {
+	while (!trans->aborted) {
 		mutex_lock(&fs_info->unused_bg_unpin_mutex);
 		ret = find_first_extent_bit(unpin, 0, &start, &end,
 					    EXTENT_DIRTY, NULL);
@@ -6104,6 +6103,34 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
 		cond_resched();
 	}
 
+	/*
+	 * Transaction is finished.  We don't need the lock anymore.  We
+	 * do need to clean up the block groups in case of a transaction
+	 * abort.
+	 */
+	deleted_bgs = &trans->transaction->deleted_bgs;
+	list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {
+		u64 trimmed = 0;
+
+		ret = -EROFS;
+		if (!trans->aborted)
+			ret = btrfs_discard_extent(root,
+						   block_group->key.objectid,
+						   block_group->key.offset,
+						   &trimmed);
+
+		list_del_init(&block_group->bg_list);
+		btrfs_put_block_group_trimming(block_group);
+		btrfs_put_block_group(block_group);
+
+		if (ret) {
+			const char *errstr = btrfs_decode_error(ret);
+			btrfs_warn(fs_info,
+				   "Discard failed while removing blockgroup: errno=%d %s\n",
+				   ret, errstr);
+		}
+	}
+
 	return 0;
 }
 
@@ -9883,6 +9910,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	 * currently running transaction might finish and a new one start,
 	 * allowing for new block groups to be created that can reuse the same
 	 * physical device locations unless we take this special care.
+	 *
+	 * There may also be an implicit trim operation if the file system
+	 * is mounted with -odiscard. The same protections must remain
+	 * in place until the extents have been discarded completely when
+	 * the transaction commit has completed.
 	 */
 	remove_em = (atomic_read(&block_group->trimming) == 0);
 	/*
@@ -9955,8 +9987,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		return;
 
 	spin_lock(&fs_info->unused_bgs_lock);
+
 	while (!list_empty(&fs_info->unused_bgs)) {
 		u64 start, end;
+		int trimming;
 
 		block_group = list_first_entry(&fs_info->unused_bgs,
 					       struct btrfs_block_group_cache,
@@ -10054,12 +10088,39 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&block_group->lock);
 		spin_unlock(&space_info->lock);
 
+		/* DISCARD can flip during remount */
+		trimming = btrfs_test_opt(root, DISCARD);
+
+		/* Implicit trim during transaction commit. */
+		if (trimming)
+			btrfs_get_block_group_trimming(block_group);
+
 		/*
 		 * Btrfs_remove_chunk will abort the transaction if things go
 		 * horribly wrong.
 		 */
 		ret = btrfs_remove_chunk(trans, root,
 					 block_group->key.objectid);
+
+		if (ret) {
+			if (trimming)
+				btrfs_put_block_group_trimming(block_group);
+			goto end_trans;
+		}
+
+		/*
+		 * If we're not mounted with -odiscard, we can just forget
+		 * about this block group. Otherwise we'll need to wait
+		 * until transaction commit to do the actual discard.
+		 */
+		if (trimming) {
+			WARN_ON(!list_empty(&block_group->bg_list));
+			spin_lock(&trans->transaction->deleted_bgs_lock);
+			list_move(&block_group->bg_list,
+				  &trans->transaction->deleted_bgs);
+			spin_unlock(&trans->transaction->deleted_bgs_lock);
+			btrfs_get_block_group(block_group);
+		}
 end_trans:
 		btrfs_end_transaction(trans, root);
 next:
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 9dbe5b5..c79253e 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3274,35 +3274,23 @@ next:
 	return ret;
 }
 
-int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
-			   u64 *trimmed, u64 start, u64 end, u64 minlen)
+void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache)
 {
-	int ret;
+	atomic_inc(&cache->trimming);
+}
 
-	*trimmed = 0;
+void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
+{
+	struct extent_map_tree *em_tree;
+	struct extent_map *em;
+	bool cleanup;
 
 	spin_lock(&block_group->lock);
-	if (block_group->removed) {
-		spin_unlock(&block_group->lock);
-		return 0;
-	}
-	atomic_inc(&block_group->trimming);
+	cleanup = (atomic_dec_and_test(&block_group->trimming) &&
+		   block_group->removed);
 	spin_unlock(&block_group->lock);
 
-	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
-	if (ret)
-		goto out;
-
-	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
-out:
-	spin_lock(&block_group->lock);
-	if (atomic_dec_and_test(&block_group->trimming) &&
-	    block_group->removed) {
-		struct extent_map_tree *em_tree;
-		struct extent_map *em;
-
-		spin_unlock(&block_group->lock);
-
+	if (cleanup) {
 		lock_chunks(block_group->fs_info->chunk_root);
 		em_tree = &block_group->fs_info->mapping_tree.map_tree;
 		write_lock(&em_tree->lock);
@@ -3326,10 +3314,31 @@ out:
 		 * this block group have left 1 entry each one. Free them.
 		 */
 		__btrfs_remove_free_space_cache(block_group->free_space_ctl);
-	} else {
+	}
+}
+
+int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
+			   u64 *trimmed, u64 start, u64 end, u64 minlen)
+{
+	int ret;
+
+	*trimmed = 0;
+
+	spin_lock(&block_group->lock);
+	if (block_group->removed) {
 		spin_unlock(&block_group->lock);
+		return 0;
 	}
+	btrfs_get_block_group_trimming(block_group);
+	spin_unlock(&block_group->lock);
+
+	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
+	if (ret)
+		goto out;
 
+	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
+out:
+	btrfs_put_block_group_trimming(block_group);
 	return ret;
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2ccd8d4..a80da03 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -69,7 +69,7 @@ static struct file_system_type btrfs_fs_type;
 
 static int btrfs_remount(struct super_block *sb, int *flags, char *data);
 
-static const char *btrfs_decode_error(int errno)
+const char *btrfs_decode_error(int errno)
 {
 	char *errstr = "unknown";
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 5628e25..2005262 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -256,6 +256,8 @@ loop:
 	mutex_init(&cur_trans->cache_write_mutex);
 	cur_trans->num_dirty_bgs = 0;
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
+	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
+	spin_lock_init(&cur_trans->deleted_bgs_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(&cur_trans->dirty_pages,
 			     fs_info->btree_inode->i_mapping);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 0b24755..14325f2 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -74,6 +74,8 @@ struct btrfs_transaction {
 	 */
 	struct mutex cache_write_mutex;
 	spinlock_t dirty_bgs_lock;
+	struct list_head deleted_bgs;
+	spinlock_t deleted_bgs_lock;
 	struct btrfs_delayed_ref_root delayed_refs;
 	int aborted;
 	int dirty_bg_run;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 15:20 ` [PATCH 1/4] btrfs: skip superblocks during discard jeffm
@ 2015-06-11 15:25   ` Jeff Mahoney
  2015-06-11 16:47   ` Filipe David Manana
  1 sibling, 0 replies; 14+ messages in thread
From: Jeff Mahoney @ 2015-06-11 15:25 UTC (permalink / raw)
  To: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/11/15 11:20 AM, jeffm@suse.com wrote:
> From: Jeff Mahoney <jeffm@suse.com>
> 
> Btrfs doesn't track superblocks with extent records so there is
> nothing persistent on-disk to indicate that those blocks are in
> use.  We track the superblocks in memory to ensure they don't get
> used by removing them from the free space cache when we load a
> block group from disk.  Prior to 47ab2a6c6a (Btrfs: remove empty
> block groups automatically), that was fine since the block group
> would never be reclaimed so the superblock was always safe.  Once
> we started removing the empty block groups, we were protected by
> the fact that discards weren't being properly issued for unused
> space either via FITRIM or -odiscard.  The block groups were still
> being released, but the blocks remained on disk.
> 
> In order to properly discard unused block groups, we need to filter
> out the superblocks from the discard range.  Superblocks are
> located at fixed locations on each device, so it makes sense to
> filter them out in btrfs_issue_discard, which is used by both
> -odiscard and FITRIM.
> 
> Signed-off-by: Jeff Mahoney <jeffm@suse.com> --- 
> fs/btrfs/extent-tree.c | 50
> ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed,
> 44 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> 0ec3acd..75d0226 100644 --- a/fs/btrfs/extent-tree.c +++
> b/fs/btrfs/extent-tree.c @@ -1884,10 +1884,47 @@ static int
> remove_extent_backref(struct btrfs_trans_handle *trans, return
> ret; }
> 
> -static int btrfs_issue_discard(struct block_device *bdev, -				u64
> start, u64 len) +#define in_range(b, first, len)	((b) >= (first) &&
> (b) < (first) + (len)) +static int btrfs_issue_discard(struct
> block_device *bdev, u64 start, u64 len, +			       u64
> *discarded_bytes) { -	return blkdev_issue_discard(bdev, start >> 9,
> len >> 9, GFP_NOFS, 0); +	u64 skipped = 0; +	u64 bytes_left = len; 
> +	int ret = 0 , j; + +	*discarded_bytes = 0; + +	/* Skip any
> superblocks on this device. */ +	for (j = 0; j <
> BTRFS_SUPER_MIRROR_MAX; j++) { +		u64 sb_offset =
> btrfs_sb_offset(j); +		u64 size = sb_offset - start; + +		if
> (!in_range(sb_offset, start, bytes_left)) +			continue; + +		if
> (size) { +			ret = blkdev_issue_discard(bdev, start >> 9, size >>
> 9, +						   GFP_NOFS, 0); +			if (!ret) +				*discarded_bytes +=
> size; +			else if (ret != -EOPNOTSUPP) +				return ret; +		} + +
> start = sb_offset + BTRFS_SUPER_INFO_SIZE; +		if (start > len) +
> start = len; +		bytes_left = len - start; +		skipped +=
> BTRFS_SUPER_INFO_SIZE;

Whoops. This is an unused artifact from an earlier version.

- -Jeff

> +	} + +	if (bytes_left) { +		ret = blkdev_issue_discard(bdev, start
> >> 9, bytes_left >> 9, +					   GFP_NOFS, 0); +		if (!ret) +
> *discarded_bytes += bytes_left; +	} +	return ret; }
> 
> int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr, @@
> -1906,16 +1943,17 @@ int btrfs_discard_extent(struct btrfs_root
> *root, u64 bytenr, struct btrfs_bio_stripe *stripe =
> bbio->stripes; int i;
> 
> - for (i = 0; i < bbio->num_stripes; i++, stripe++) { +			u64
> bytes; + if (!stripe->dev->can_discard) continue;
> 
> ret = btrfs_issue_discard(stripe->dev->bdev, stripe->physical, -
> stripe->length); +						  stripe->length, &bytes); if (!ret) -
> discarded_bytes += stripe->length; +				discarded_bytes += bytes; 
> else if (ret != -EOPNOTSUPP) break; /* Logic errors or -ENOMEM, or
> -EIO but I don't know how that could happen JDM */
> 
> 


- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVeahbAAoJEB57S2MheeWyA3YP/00sK18TqJi6ohu0l18otsd2
vG2lAtPUDT6f4dFM1GvO4PYLssuwm0Vy98qFYO30lvcHbdbajhsGd1qk+IzEq3Dr
+Qxze9lIrX8267nRE3aqu1I8/y2Gn5Wcy5Xr6fwhN0g+m5J+fWZHy1mfTvqbNt+M
U7Cg4hgvTFDcuV5adrg3JoQ3W/w6IvgIid27oNZ0MJMaK4f2jcLfmhGVCTcCnydp
7fEUsMow2firGJew7161ORYkYjTn76JW2HXB8ETOgWUcMpZon1KWduVXOTD0r9sk
UrfMJ8dLw6334T265gmQrBXvb3yOBLbyRYuksUFhXFUo1/xDSQMLGMBrkEf6T/Vo
mwnXVw5hnPM41vZiJ4Pc56vg1Yy4JGMXs28XEyHrj7gQWo3b1jRw0YhosZE/TR5+
BsJb4KqBTPor/upPLmDwIOYU4Ia+GZ5V2k1jYsE9qs6khHXlX9is8WHeB/fMzfCe
zkZEl8YI3ek50wXcqBk+QHFj8LT89Eyq59O3iaCFceIhvjO87ldRDBtTtBR34ogf
mZphx9UzVTcY31wftDYKMMi2Xs+8BZgXq7H7IXc2muugFlK9AGsMv4iXPbVkmUmE
T4pS2vKqCwfkW9bGf7ufQaOpVgwzQjDkC/Ngtybl8sULroOjI3lm6p+2/jMuqyYd
twV2FaYAJAI/tiQkTken
=1g/b
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 15:20 ` [PATCH 1/4] btrfs: skip superblocks during discard jeffm
  2015-06-11 15:25   ` Jeff Mahoney
@ 2015-06-11 16:47   ` Filipe David Manana
  2015-06-11 18:17     ` Jeff Mahoney
  1 sibling, 1 reply; 14+ messages in thread
From: Filipe David Manana @ 2015-06-11 16:47 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Thu, Jun 11, 2015 at 4:20 PM,  <jeffm@suse.com> wrote:
> From: Jeff Mahoney <jeffm@suse.com>
>
> Btrfs doesn't track superblocks with extent records so there is nothing
> persistent on-disk to indicate that those blocks are in use.  We track
> the superblocks in memory to ensure they don't get used by removing them
> from the free space cache when we load a block group from disk.  Prior
> to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
> was fine since the block group would never be reclaimed so the superblock
> was always safe.  Once we started removing the empty block groups, we
> were protected by the fact that discards weren't being properly issued
> for unused space either via FITRIM or -odiscard.  The block groups were
> still being released, but the blocks remained on disk.
>
> In order to properly discard unused block groups, we need to filter out
> the superblocks from the discard range.  Superblocks are located at fixed
> locations on each device, so it makes sense to filter them out in
> btrfs_issue_discard, which is used by both -odiscard and FITRIM.
>
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
> ---
>  fs/btrfs/extent-tree.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 44 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0ec3acd..75d0226 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1884,10 +1884,47 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans,
>         return ret;
>  }
>
> -static int btrfs_issue_discard(struct block_device *bdev,
> -                               u64 start, u64 len)
> +#define in_range(b, first, len)        ((b) >= (first) && (b) < (first) + (len))

Hi Jeff,

So this will work if every caller behaves well and passes a region
whose start and end offsets are a multiple of the sector size (4096)
which currently matches the superblock size.

However, I think it would be safer to check for the case where the
start offset of a superblock mirror is < (first) and (sb_offset +
sb_len) > (first).  Just to deal with cases where for example the 2nd
half of the sb starts at offset (first).

I guess this sectorsize becoming less than 4096 will happen sooner or
later with the subpage sectorsize patch set, so it wouldn't hurt to
make it more bullet proof already.

Otherwise it looks good to me.
I'll give a test on this patchset soon.

thanks

> +static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
> +                              u64 *discarded_bytes)
>  {
> -       return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
> +       u64 skipped = 0;
> +       u64 bytes_left = len;
> +       int ret = 0 , j;
> +
> +       *discarded_bytes = 0;
> +
> +       /* Skip any superblocks on this device. */
> +       for (j = 0; j < BTRFS_SUPER_MIRROR_MAX; j++) {
> +               u64 sb_offset = btrfs_sb_offset(j);
> +               u64 size = sb_offset - start;
> +
> +               if (!in_range(sb_offset, start, bytes_left))
> +                       continue;
> +
> +               if (size) {
> +                       ret = blkdev_issue_discard(bdev, start >> 9, size >> 9,
> +                                                  GFP_NOFS, 0);
> +                       if (!ret)
> +                               *discarded_bytes += size;
> +                       else if (ret != -EOPNOTSUPP)
> +                               return ret;
> +               }
> +
> +               start = sb_offset + BTRFS_SUPER_INFO_SIZE;
> +               if (start > len)
> +                       start = len;
> +               bytes_left = len - start;
> +               skipped += BTRFS_SUPER_INFO_SIZE;
> +       }
> +
> +       if (bytes_left) {
> +               ret = blkdev_issue_discard(bdev, start >> 9, bytes_left >> 9,
> +                                          GFP_NOFS, 0);
> +               if (!ret)
> +                       *discarded_bytes += bytes_left;
> +       }
> +       return ret;
>  }
>
>  int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
> @@ -1906,16 +1943,17 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
>                 struct btrfs_bio_stripe *stripe = bbio->stripes;
>                 int i;
>
> -
>                 for (i = 0; i < bbio->num_stripes; i++, stripe++) {
> +                       u64 bytes;
> +
>                         if (!stripe->dev->can_discard)
>                                 continue;
>
>                         ret = btrfs_issue_discard(stripe->dev->bdev,
>                                                   stripe->physical,
> -                                                 stripe->length);
> +                                                 stripe->length, &bytes);
>                         if (!ret)
> -                               discarded_bytes += stripe->length;
> +                               discarded_bytes += bytes;
>                         else if (ret != -EOPNOTSUPP)
>                                 break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
>
> --
> 1.8.4.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 16:47   ` Filipe David Manana
@ 2015-06-11 18:17     ` Jeff Mahoney
  2015-06-11 18:44       ` Filipe David Manana
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff Mahoney @ 2015-06-11 18:17 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/11/15 12:47 PM, Filipe David Manana wrote:
> On Thu, Jun 11, 2015 at 4:20 PM,  <jeffm@suse.com> wrote:
>> From: Jeff Mahoney <jeffm@suse.com>
>> 
>> Btrfs doesn't track superblocks with extent records so there is
>> nothing persistent on-disk to indicate that those blocks are in
>> use.  We track the superblocks in memory to ensure they don't get
>> used by removing them from the free space cache when we load a
>> block group from disk.  Prior to 47ab2a6c6a (Btrfs: remove empty
>> block groups automatically), that was fine since the block group
>> would never be reclaimed so the superblock was always safe.  Once
>> we started removing the empty block groups, we were protected by
>> the fact that discards weren't being properly issued for unused
>> space either via FITRIM or -odiscard.  The block groups were 
>> still being released, but the blocks remained on disk.
>> 
>> In order to properly discard unused block groups, we need to
>> filter out the superblocks from the discard range.  Superblocks
>> are located at fixed locations on each device, so it makes sense
>> to filter them out in btrfs_issue_discard, which is used by both
>> -odiscard and FITRIM.
>> 
>> Signed-off-by: Jeff Mahoney <jeffm@suse.com> --- 
>> fs/btrfs/extent-tree.c | 50
>> ++++++++++++++++++++++++++++++++++++++++++++------ 1 file
>> changed, 44 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c 
>> index 0ec3acd..75d0226 100644 --- a/fs/btrfs/extent-tree.c +++
>> b/fs/btrfs/extent-tree.c @@ -1884,10 +1884,47 @@ static int
>> remove_extent_backref(struct btrfs_trans_handle *trans, return
>> ret; }
>> 
>> -static int btrfs_issue_discard(struct block_device *bdev, -
>> u64 start, u64 len) +#define in_range(b, first, len)        ((b)
>> >= (first) && (b) < (first) + (len))
> 
> Hi Jeff,
> 
> So this will work if every caller behaves well and passes a region 
> whose start and end offsets are a multiple of the sector size
> (4096) which currently matches the superblock size.
> 
> However, I think it would be safer to check for the case where the 
> start offset of a superblock mirror is < (first) and (sb_offset + 
> sb_len) > (first).  Just to deal with cases where for example the
> 2nd half of the sb starts at offset (first).
> 
> I guess this sectorsize becoming less than 4096 will happen sooner
> or later with the subpage sectorsize patch set, so it wouldn't hurt
> to make it more bullet proof already.

Is that something anyone intends to support?  While I suppose the
subpage sector patch /could/ be used to allow file systems with a node
size under 4k, the intention is the other way around -- systems that
have higher order page sizes currently don't work with btrfs file
system created on systems with smaller order page sizes like x86.
Btrfs already has high enough metadata overhead.  Pretty much all new
hardware has, at least, a native 4k sector size even if it's
abstracted behind a RMW layer.  The sectors are only going to get
larger.  With the metadata overhead that btrfs already incurs, I can't
imagine any production use case with smaller sector sizes.

Are we looking to support <4k nodes to test the subpage sector code on
x86?  If so, then I'll change this to handle the possibility of
superblocks crossing sector boundaries.  Otherwise, it's protecting
against a use case that just shouldn't happen.

> Otherwise it looks good to me. I'll give a test on this patchset
> soon.

Thanks,

- -Jeff


- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVedCiAAoJEB57S2MheeWymA4P/2Y+YGwAkcbYxrlMVH/wIGpN
j+qO/y1YHNJQNiRHt/1ahM8wy2DEEbqrD36xUuX5lDYRREo9jeqIGGajOOHg/KFw
7q6zIWVEEwom/RKd9CX48TC2pHly5Fw8ESyi+A857KJ1ZHcpKdyNcIwle/Jsoe0q
a+SX6hJPlHFVai/QZhBRO00ZgXlTwjXAGyfgmfuHERY6lS5PBmoA8tYxnigpnBOa
PUrgw+Cdn4glrZUTpt3WN4H5oE+pD6cGMQ+GnFXQYaytssyQNuPpCWdQ1Aferg7u
Af3E6iBj776bQIRTWZSwjXTMLWHVjnBmdU8D+wXE7Ar3oU1POL7NLvnGm4Yr9TWZ
n1NXcBhJ4QQQXBprK3bI+WNSzMzMLdvJHIb5t2Z9BO8wd5Ws0QlmT8Gl+u/ZofZU
8eouhLgu1hZzPeJgXPuDu0S8QaFtpI0zlupOwJByHp9QSzpUw98xqlFXIRtnLc48
n/ZMFi98EMroaQx0hzFLGMwp/57tUFWUUroDkfP1NngoE482ohPHAdREtr7+RZe6
h+pfF+//5CycYxk8BciBmCyLWWTW1WdDL/MzQsXdKE+797cNk79+W5EKYLmQiAbW
IGyRXMj+XPpWNC1VM12JJXdoOujSGocmJa28jKanZRzqw0HKJOlTUBqNoBV83Bph
ChcTZyhAi3PCsib+mNnW
=wb8l
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 18:17     ` Jeff Mahoney
@ 2015-06-11 18:44       ` Filipe David Manana
  2015-06-11 19:15         ` Jeff Mahoney
  0 siblings, 1 reply; 14+ messages in thread
From: Filipe David Manana @ 2015-06-11 18:44 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Thu, Jun 11, 2015 at 7:17 PM, Jeff Mahoney <jeffm@suse.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 6/11/15 12:47 PM, Filipe David Manana wrote:
>> On Thu, Jun 11, 2015 at 4:20 PM,  <jeffm@suse.com> wrote:
>>> From: Jeff Mahoney <jeffm@suse.com>
>>>
>>> Btrfs doesn't track superblocks with extent records so there is
>>> nothing persistent on-disk to indicate that those blocks are in
>>> use.  We track the superblocks in memory to ensure they don't get
>>> used by removing them from the free space cache when we load a
>>> block group from disk.  Prior to 47ab2a6c6a (Btrfs: remove empty
>>> block groups automatically), that was fine since the block group
>>> would never be reclaimed so the superblock was always safe.  Once
>>> we started removing the empty block groups, we were protected by
>>> the fact that discards weren't being properly issued for unused
>>> space either via FITRIM or -odiscard.  The block groups were
>>> still being released, but the blocks remained on disk.
>>>
>>> In order to properly discard unused block groups, we need to
>>> filter out the superblocks from the discard range.  Superblocks
>>> are located at fixed locations on each device, so it makes sense
>>> to filter them out in btrfs_issue_discard, which is used by both
>>> -odiscard and FITRIM.
>>>
>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com> ---
>>> fs/btrfs/extent-tree.c | 50
>>> ++++++++++++++++++++++++++++++++++++++++++++------ 1 file
>>> changed, 44 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 0ec3acd..75d0226 100644 --- a/fs/btrfs/extent-tree.c +++
>>> b/fs/btrfs/extent-tree.c @@ -1884,10 +1884,47 @@ static int
>>> remove_extent_backref(struct btrfs_trans_handle *trans, return
>>> ret; }
>>>
>>> -static int btrfs_issue_discard(struct block_device *bdev, -
>>> u64 start, u64 len) +#define in_range(b, first, len)        ((b)
>>> >= (first) && (b) < (first) + (len))
>>
>> Hi Jeff,
>>
>> So this will work if every caller behaves well and passes a region
>> whose start and end offsets are a multiple of the sector size
>> (4096) which currently matches the superblock size.
>>
>> However, I think it would be safer to check for the case where the
>> start offset of a superblock mirror is < (first) and (sb_offset +
>> sb_len) > (first).  Just to deal with cases where for example the
>> 2nd half of the sb starts at offset (first).
>>
>> I guess this sectorsize becoming less than 4096 will happen sooner
>> or later with the subpage sectorsize patch set, so it wouldn't hurt
>> to make it more bullet proof already.
>
> Is that something anyone intends to support?  While I suppose the
> subpage sector patch /could/ be used to allow file systems with a node
> size under 4k, the intention is the other way around -- systems that
> have higher order page sizes currently don't work with btrfs file
> system created on systems with smaller order page sizes like x86.
> Btrfs already has high enough metadata overhead.  Pretty much all new
> hardware has, at least, a native 4k sector size even if it's
> abstracted behind a RMW layer.  The sectors are only going to get
> larger.  With the metadata overhead that btrfs already incurs, I can't
> imagine any production use case with smaller sector sizes.
>
> Are we looking to support <4k nodes to test the subpage sector code on
> x86?  If so, then I'll change this to handle the possibility of
> superblocks crossing sector boundaries.  Otherwise, it's protecting
> against a use case that just shouldn't happen.

I understand your point.
I'm probably being too paranoid. But it's exactly because it's not
supposed to happen that at least an assertion or something should be
added imho. A lot of "not supposed not happen things" happen often,
and that's often how people lose data, and get into other bad issues.

And I think I've heard once of supporting <4k nodes (sectorsizes) for
testing at least on x86 for e.g, but I might have not understood it
correctly. Having such a check would help detect bugs during
development where some caller passes a wrong range to discard - better
to find it during development/RCs rather than in production.

But anyway, just a personal preference.

thanks

>
>> Otherwise it looks good to me. I'll give a test on this patchset
>> soon.
>
> Thanks,
>
> - -Jeff
>
>
> - --
> Jeff Mahoney
> SUSE Labs
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>
> iQIcBAEBAgAGBQJVedCiAAoJEB57S2MheeWymA4P/2Y+YGwAkcbYxrlMVH/wIGpN
> j+qO/y1YHNJQNiRHt/1ahM8wy2DEEbqrD36xUuX5lDYRREo9jeqIGGajOOHg/KFw
> 7q6zIWVEEwom/RKd9CX48TC2pHly5Fw8ESyi+A857KJ1ZHcpKdyNcIwle/Jsoe0q
> a+SX6hJPlHFVai/QZhBRO00ZgXlTwjXAGyfgmfuHERY6lS5PBmoA8tYxnigpnBOa
> PUrgw+Cdn4glrZUTpt3WN4H5oE+pD6cGMQ+GnFXQYaytssyQNuPpCWdQ1Aferg7u
> Af3E6iBj776bQIRTWZSwjXTMLWHVjnBmdU8D+wXE7Ar3oU1POL7NLvnGm4Yr9TWZ
> n1NXcBhJ4QQQXBprK3bI+WNSzMzMLdvJHIb5t2Z9BO8wd5Ws0QlmT8Gl+u/ZofZU
> 8eouhLgu1hZzPeJgXPuDu0S8QaFtpI0zlupOwJByHp9QSzpUw98xqlFXIRtnLc48
> n/ZMFi98EMroaQx0hzFLGMwp/57tUFWUUroDkfP1NngoE482ohPHAdREtr7+RZe6
> h+pfF+//5CycYxk8BciBmCyLWWTW1WdDL/MzQsXdKE+797cNk79+W5EKYLmQiAbW
> IGyRXMj+XPpWNC1VM12JJXdoOujSGocmJa28jKanZRzqw0HKJOlTUBqNoBV83Bph
> ChcTZyhAi3PCsib+mNnW
> =wb8l
> -----END PGP SIGNATURE-----



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 18:44       ` Filipe David Manana
@ 2015-06-11 19:15         ` Jeff Mahoney
  2015-06-11 19:24           ` Chris Mason
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff Mahoney @ 2015-06-11 19:15 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/11/15 2:44 PM, Filipe David Manana wrote:
> On Thu, Jun 11, 2015 at 7:17 PM, Jeff Mahoney <jeffm@suse.com>
> wrote: On 6/11/15 12:47 PM, Filipe David Manana wrote:
>>>> On Thu, Jun 11, 2015 at 4:20 PM,  <jeffm@suse.com> wrote:
>>>>> From: Jeff Mahoney <jeffm@suse.com>
>>>>> 
>>>>> Btrfs doesn't track superblocks with extent records so
>>>>> there is nothing persistent on-disk to indicate that those
>>>>> blocks are in use.  We track the superblocks in memory to
>>>>> ensure they don't get used by removing them from the free
>>>>> space cache when we load a block group from disk.  Prior to
>>>>> 47ab2a6c6a (Btrfs: remove empty block groups
>>>>> automatically), that was fine since the block group would
>>>>> never be reclaimed so the superblock was always safe.
>>>>> Once we started removing the empty block groups, we were
>>>>> protected by the fact that discards weren't being properly
>>>>> issued for unused space either via FITRIM or -odiscard.
>>>>> The block groups were still being released, but the blocks
>>>>> remained on disk.
>>>>> 
>>>>> In order to properly discard unused block groups, we need
>>>>> to filter out the superblocks from the discard range.
>>>>> Superblocks are located at fixed locations on each device,
>>>>> so it makes sense to filter them out in
>>>>> btrfs_issue_discard, which is used by both -odiscard and
>>>>> FITRIM.
>>>>> 
>>>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com> --- 
>>>>> fs/btrfs/extent-tree.c | 50 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++------ 1 file 
>>>>> changed, 44 insertions(+), 6 deletions(-)
>>>>> 
>>>>> diff --git a/fs/btrfs/extent-tree.c
>>>>> b/fs/btrfs/extent-tree.c index 0ec3acd..75d0226 100644 ---
>>>>> a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@
>>>>> -1884,10 +1884,47 @@ static int 
>>>>> remove_extent_backref(struct btrfs_trans_handle *trans,
>>>>> return ret; }
>>>>> 
>>>>> -static int btrfs_issue_discard(struct block_device *bdev,
>>>>> - u64 start, u64 len) +#define in_range(b, first, len)
>>>>> ((b)
>>>>>> = (first) && (b) < (first) + (len))
>>>> 
>>>> Hi Jeff,
>>>> 
>>>> So this will work if every caller behaves well and passes a
>>>> region whose start and end offsets are a multiple of the
>>>> sector size (4096) which currently matches the superblock
>>>> size.
>>>> 
>>>> However, I think it would be safer to check for the case
>>>> where the start offset of a superblock mirror is < (first)
>>>> and (sb_offset + sb_len) > (first).  Just to deal with cases
>>>> where for example the 2nd half of the sb starts at offset
>>>> (first).
>>>> 
>>>> I guess this sectorsize becoming less than 4096 will happen
>>>> sooner or later with the subpage sectorsize patch set, so it
>>>> wouldn't hurt to make it more bullet proof already.
> 
> Is that something anyone intends to support?  While I suppose the 
> subpage sector patch /could/ be used to allow file systems with a
> node size under 4k, the intention is the other way around --
> systems that have higher order page sizes currently don't work with
> btrfs file system created on systems with smaller order page sizes
> like x86. Btrfs already has high enough metadata overhead.  Pretty
> much all new hardware has, at least, a native 4k sector size even
> if it's abstracted behind a RMW layer.  The sectors are only going
> to get larger.  With the metadata overhead that btrfs already
> incurs, I can't imagine any production use case with smaller sector
> sizes.
> 
> Are we looking to support <4k nodes to test the subpage sector code
> on x86?  If so, then I'll change this to handle the possibility of 
> superblocks crossing sector boundaries.  Otherwise, it's
> protecting against a use case that just shouldn't happen.
> 
>> I understand your point. I'm probably being too paranoid. But
>> it's exactly because it's not supposed to happen that at least an
>> assertion or something should be added imho. A lot of "not
>> supposed not happen things" happen often, and that's often how
>> people lose data, and get into other bad issues.
> 
>> And I think I've heard once of supporting <4k nodes (sectorsizes)
>> for testing at least on x86 for e.g, but I might have not
>> understood it correctly. Having such a check would help detect
>> bugs during development where some caller passes a wrong range to
>> discard - better to find it during development/RCs rather than in
>> production.

Yeah, you're right. I'll make it more bulletproof. I'm just looking to
be done with this particular mess. :)

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVed5SAAoJEB57S2MheeWypa4QAJIrhdZNg3d1I972dkkBBFjk
723lHLXWdHkJifQKYa7b91hwtWplKxZlTuEoc7/AwacoemR7m6V3JMApykHDgHrV
mmHrUvcJVUFjhbn1vXkeMhbz8COLbOUIUF6/hrwzkg00SSDwtjmv5IJQEIXikKEo
j/iI2IyyNsA2JgCa5R/2JvWRssWUF8VhupLw/T6WOfjTwhEEDT/DqZIuD/avGrwk
cxo8NqUkUNrFA7JxhO9TS/Mz9IIxiLu75XQiFzHF3IkLGUGAT4R1JyRMSZhFNjgc
MPS9d9V1ZUZ5swKSRWR07LDtBntiRDYS7atChLsWrOuJs3lsyS3nC4LFkbcqkFdX
pLFOsftfe/FX3OGz5Jry9LyKjr0m7hjveZQ2hnEHB1D/wtpEhRw8WQc8cpL6/Fyr
11g67vsdMbb+KeWsOSLyNk47oRuqULSoqRWRBqeyjfXVRPUBAHME31mZ3QN8+wJz
7Ua6xYoKV5RT8A3w2ceVEhpwqY/DuSGKJ/frteFyLiy12myZz4ZutKLIxwvww4uT
sP3w3NnXFtf3UdA1D0hgm5PQuDk5vBtrFJ1f78nkMXHoKgbIfoSF/LE3Zxb7SFw7
1GscISYtTHumXQdiv6bPw0rvNt1NDj0vKwJG6aRf9IDukT+05a37e33SLBGHGRbv
tXxOaVLT67hJK7jQIWAM
=yNKP
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 19:15         ` Jeff Mahoney
@ 2015-06-11 19:24           ` Chris Mason
  2015-06-11 19:27             ` Jeff Mahoney
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Mason @ 2015-06-11 19:24 UTC (permalink / raw)
  To: Jeff Mahoney, fdmanana; +Cc: linux-btrfs

On 06/11/2015 03:15 PM, Jeff Mahoney wrote:
> On 6/11/15 2:44 PM, Filipe David Manana wrote:
>> On Thu, Jun 11, 2015 at 7:17 PM, Jeff Mahoney <jeffm@suse.com>
>> wrote: On 6/11/15 12:47 PM, Filipe David Manana wrote:
>>>>> On Thu, Jun 11, 2015 at 4:20 PM,  <jeffm@suse.com> wrote:
>>>>>> From: Jeff Mahoney <jeffm@suse.com>
>>>>>>
>>>>>> Btrfs doesn't track superblocks with extent records so
>>>>>> there is nothing persistent on-disk to indicate that those
>>>>>> blocks are in use.  We track the superblocks in memory to
>>>>>> ensure they don't get used by removing them from the free
>>>>>> space cache when we load a block group from disk.  Prior to
>>>>>> 47ab2a6c6a (Btrfs: remove empty block groups
>>>>>> automatically), that was fine since the block group would
>>>>>> never be reclaimed so the superblock was always safe.
>>>>>> Once we started removing the empty block groups, we were
>>>>>> protected by the fact that discards weren't being properly
>>>>>> issued for unused space either via FITRIM or -odiscard.
>>>>>> The block groups were still being released, but the blocks
>>>>>> remained on disk.
>>>>>>
>>>>>> In order to properly discard unused block groups, we need
>>>>>> to filter out the superblocks from the discard range.
>>>>>> Superblocks are located at fixed locations on each device,
>>>>>> so it makes sense to filter them out in
>>>>>> btrfs_issue_discard, which is used by both -odiscard and
>>>>>> FITRIM.
>>>>>>
>>>>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com> --- 
>>>>>> fs/btrfs/extent-tree.c | 50 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++------ 1 file 
>>>>>> changed, 44 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/extent-tree.c
>>>>>> b/fs/btrfs/extent-tree.c index 0ec3acd..75d0226 100644 ---
>>>>>> a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@
>>>>>> -1884,10 +1884,47 @@ static int 
>>>>>> remove_extent_backref(struct btrfs_trans_handle *trans,
>>>>>> return ret; }
>>>>>>
>>>>>> -static int btrfs_issue_discard(struct block_device *bdev,
>>>>>> - u64 start, u64 len) +#define in_range(b, first, len)
>>>>>> ((b)
>>>>>>> = (first) && (b) < (first) + (len))
>>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> So this will work if every caller behaves well and passes a
>>>>> region whose start and end offsets are a multiple of the
>>>>> sector size (4096) which currently matches the superblock
>>>>> size.
>>>>>
>>>>> However, I think it would be safer to check for the case
>>>>> where the start offset of a superblock mirror is < (first)
>>>>> and (sb_offset + sb_len) > (first).  Just to deal with cases
>>>>> where for example the 2nd half of the sb starts at offset
>>>>> (first).
>>>>>
>>>>> I guess this sectorsize becoming less than 4096 will happen
>>>>> sooner or later with the subpage sectorsize patch set, so it
>>>>> wouldn't hurt to make it more bullet proof already.
> 
>> Is that something anyone intends to support?  While I suppose the 
>> subpage sector patch /could/ be used to allow file systems with a
>> node size under 4k, the intention is the other way around --
>> systems that have higher order page sizes currently don't work with
>> btrfs file system created on systems with smaller order page sizes
>> like x86.

The best use of smaller node sizes is just to test the subpagesize
patches on more common hardware.  I wouldn't expect anyone to use a 1K
node size in production.

Thanks for doing these Jeff.

-chris

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 19:24           ` Chris Mason
@ 2015-06-11 19:27             ` Jeff Mahoney
  2015-06-11 19:35               ` Chris Mason
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff Mahoney @ 2015-06-11 19:27 UTC (permalink / raw)
  To: Chris Mason, fdmanana; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/11/15 3:24 PM, Chris Mason wrote:
> On 06/11/2015 03:15 PM, Jeff Mahoney wrote:
>> On 6/11/15 2:44 PM, Filipe David Manana wrote:
>>> On Thu, Jun 11, 2015 at 7:17 PM, Jeff Mahoney <jeffm@suse.com> 
>>> wrote: On 6/11/15 12:47 PM, Filipe David Manana wrote:
>>>>>> On Thu, Jun 11, 2015 at 4:20 PM,  <jeffm@suse.com>
>>>>>> wrote:
>>>>>>> From: Jeff Mahoney <jeffm@suse.com>
>>>>>>> 
>>>>>>> Btrfs doesn't track superblocks with extent records so 
>>>>>>> there is nothing persistent on-disk to indicate that
>>>>>>> those blocks are in use.  We track the superblocks in
>>>>>>> memory to ensure they don't get used by removing them
>>>>>>> from the free space cache when we load a block group
>>>>>>> from disk.  Prior to 47ab2a6c6a (Btrfs: remove empty
>>>>>>> block groups automatically), that was fine since the
>>>>>>> block group would never be reclaimed so the superblock
>>>>>>> was always safe. Once we started removing the empty
>>>>>>> block groups, we were protected by the fact that
>>>>>>> discards weren't being properly issued for unused space
>>>>>>> either via FITRIM or -odiscard. The block groups were
>>>>>>> still being released, but the blocks remained on disk.
>>>>>>> 
>>>>>>> In order to properly discard unused block groups, we
>>>>>>> need to filter out the superblocks from the discard
>>>>>>> range. Superblocks are located at fixed locations on
>>>>>>> each device, so it makes sense to filter them out in 
>>>>>>> btrfs_issue_discard, which is used by both -odiscard
>>>>>>> and FITRIM.
>>>>>>> 
>>>>>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com> --- 
>>>>>>> fs/btrfs/extent-tree.c | 50 
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++------ 1
>>>>>>> file changed, 44 insertions(+), 6 deletions(-)
>>>>>>> 
>>>>>>> diff --git a/fs/btrfs/extent-tree.c 
>>>>>>> b/fs/btrfs/extent-tree.c index 0ec3acd..75d0226 100644
>>>>>>> --- a/fs/btrfs/extent-tree.c +++
>>>>>>> b/fs/btrfs/extent-tree.c @@ -1884,10 +1884,47 @@ static
>>>>>>> int remove_extent_backref(struct btrfs_trans_handle
>>>>>>> *trans, return ret; }
>>>>>>> 
>>>>>>> -static int btrfs_issue_discard(struct block_device
>>>>>>> *bdev, - u64 start, u64 len) +#define in_range(b,
>>>>>>> first, len) ((b)
>>>>>>>> = (first) && (b) < (first) + (len))
>>>>>> 
>>>>>> Hi Jeff,
>>>>>> 
>>>>>> So this will work if every caller behaves well and passes
>>>>>> a region whose start and end offsets are a multiple of
>>>>>> the sector size (4096) which currently matches the
>>>>>> superblock size.
>>>>>> 
>>>>>> However, I think it would be safer to check for the case 
>>>>>> where the start offset of a superblock mirror is <
>>>>>> (first) and (sb_offset + sb_len) > (first).  Just to deal
>>>>>> with cases where for example the 2nd half of the sb
>>>>>> starts at offset (first).
>>>>>> 
>>>>>> I guess this sectorsize becoming less than 4096 will
>>>>>> happen sooner or later with the subpage sectorsize patch
>>>>>> set, so it wouldn't hurt to make it more bullet proof
>>>>>> already.
>> 
>>> Is that something anyone intends to support?  While I suppose
>>> the subpage sector patch /could/ be used to allow file systems
>>> with a node size under 4k, the intention is the other way
>>> around -- systems that have higher order page sizes currently
>>> don't work with btrfs file system created on systems with
>>> smaller order page sizes like x86.
> 
> The best use of smaller node sizes is just to test the subpagesize 
> patches on more common hardware.  I wouldn't expect anyone to use a
> 1K node size in production.

Any chance we can enforce that?  Like with a compile-time option? :)

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVeeE5AAoJEB57S2MheeWyGh8P/RvSEQxL0M7Od6YltKz3yg0H
b/pi1ZukSTNFzQQiK/tt2rgPd0XH8p139SmJ9bWiHMPD14KDN0OtsLlAIjurLUVd
Pf9jBBnaV+D8i8GVMfVBrvMqqK5xZ/6LGEFMgyqzET1WkhYZ1D7XQ5mSPKDNTqnk
YwHrbeJ7+wN0zaaqL7I2ed06Yt5e5GczpIVRXxWKsNvYsQFie4rSdnG/QUDjAReI
W4cynK6NRxQcjc0avqHRGdxFn7woyGe47tPdzr+eERE7yXr+hqrrXyzCXqWu8Nm0
f1mYX/RNRggPg6MmmjIxKsC/ySfs8p6CR6t6DejevtdQRw2uEFciqzx63y8/H3kX
L+wN/ITEOGfeCs5ndwlfL204FhZ73brY7dNIXd97yqHJhGBxadJlsoc7eI9J/fux
jNJy/wHF2/Epdfk9nBNJiRFUuL0eioYBkoE0pEdBkdoGtPAodQ+TIn+rEkxKC//n
ZovStjTJYUVuDfAi9Wpraxd6oaGmqWQU+eYxxLQBXc+ADk+lh15wmWFD9tYAw5uQ
kOwoz2PIxKCOVFBR3ixoPCy9nJsHRQX69L5xV76VqTDnGrBEDeGKHSy4Dhi8aTCN
pSJkF2STAgxxh/CHUT0FmRsIcUiLUp26pdewIjEgXQ8S6cok2/a/+HpOJRSfD0M3
bEPZLzdZREM2FhyrBcCY
=qoEa
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 19:27             ` Jeff Mahoney
@ 2015-06-11 19:35               ` Chris Mason
  2015-06-11 19:46                 ` Jeff Mahoney
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Mason @ 2015-06-11 19:35 UTC (permalink / raw)
  To: Jeff Mahoney, fdmanana; +Cc: linux-btrfs

On 06/11/2015 03:27 PM, Jeff Mahoney wrote:
> On 6/11/15 3:24 PM, Chris Mason wrote:
>> On 06/11/2015 03:15 PM, Jeff Mahoney wrote:
>>> On 6/11/15 2:44 PM, Filipe David Manana wrote:
>>>> On Thu, Jun 11, 2015 at 7:17 PM, Jeff Mahoney <jeffm@suse.com> 
>>>> wrote: On 6/11/15 12:47 PM, Filipe David Manana wrote:
>>>>>>> On Thu, Jun 11, 2015 at 4:20 PM,  <jeffm@suse.com>
>>>>>>> wrote:
>>>>>>>> From: Jeff Mahoney <jeffm@suse.com>
>>>>>>>>
>>>>>>>> Btrfs doesn't track superblocks with extent records so 
>>>>>>>> there is nothing persistent on-disk to indicate that
>>>>>>>> those blocks are in use.  We track the superblocks in
>>>>>>>> memory to ensure they don't get used by removing them
>>>>>>>> from the free space cache when we load a block group
>>>>>>>> from disk.  Prior to 47ab2a6c6a (Btrfs: remove empty
>>>>>>>> block groups automatically), that was fine since the
>>>>>>>> block group would never be reclaimed so the superblock
>>>>>>>> was always safe. Once we started removing the empty
>>>>>>>> block groups, we were protected by the fact that
>>>>>>>> discards weren't being properly issued for unused space
>>>>>>>> either via FITRIM or -odiscard. The block groups were
>>>>>>>> still being released, but the blocks remained on disk.
>>>>>>>>
>>>>>>>> In order to properly discard unused block groups, we
>>>>>>>> need to filter out the superblocks from the discard
>>>>>>>> range. Superblocks are located at fixed locations on
>>>>>>>> each device, so it makes sense to filter them out in 
>>>>>>>> btrfs_issue_discard, which is used by both -odiscard
>>>>>>>> and FITRIM.
>>>>>>>>
>>>>>>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com> --- 
>>>>>>>> fs/btrfs/extent-tree.c | 50 
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++------ 1
>>>>>>>> file changed, 44 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/btrfs/extent-tree.c 
>>>>>>>> b/fs/btrfs/extent-tree.c index 0ec3acd..75d0226 100644
>>>>>>>> --- a/fs/btrfs/extent-tree.c +++
>>>>>>>> b/fs/btrfs/extent-tree.c @@ -1884,10 +1884,47 @@ static
>>>>>>>> int remove_extent_backref(struct btrfs_trans_handle
>>>>>>>> *trans, return ret; }
>>>>>>>>
>>>>>>>> -static int btrfs_issue_discard(struct block_device
>>>>>>>> *bdev, - u64 start, u64 len) +#define in_range(b,
>>>>>>>> first, len) ((b)
>>>>>>>>> = (first) && (b) < (first) + (len))
>>>>>>>
>>>>>>> Hi Jeff,
>>>>>>>
>>>>>>> So this will work if every caller behaves well and passes
>>>>>>> a region whose start and end offsets are a multiple of
>>>>>>> the sector size (4096) which currently matches the
>>>>>>> superblock size.
>>>>>>>
>>>>>>> However, I think it would be safer to check for the case 
>>>>>>> where the start offset of a superblock mirror is <
>>>>>>> (first) and (sb_offset + sb_len) > (first).  Just to deal
>>>>>>> with cases where for example the 2nd half of the sb
>>>>>>> starts at offset (first).
>>>>>>>
>>>>>>> I guess this sectorsize becoming less than 4096 will
>>>>>>> happen sooner or later with the subpage sectorsize patch
>>>>>>> set, so it wouldn't hurt to make it more bullet proof
>>>>>>> already.
>>>
>>>> Is that something anyone intends to support?  While I suppose
>>>> the subpage sector patch /could/ be used to allow file systems
>>>> with a node size under 4k, the intention is the other way
>>>> around -- systems that have higher order page sizes currently
>>>> don't work with btrfs file system created on systems with
>>>> smaller order page sizes like x86.
> 
>> The best use of smaller node sizes is just to test the subpagesize 
>> patches on more common hardware.  I wouldn't expect anyone to use a
>> 1K node size in production.
> 
> Any chance we can enforce that?  Like with a compile-time option? :)

We can make mkfs.btrfs advise strongly against it ;)

But, since I wasn't horribly clear, I'd love one extra if statement in
the discard function.  Silently eating bytes is horribly hard to track down.

-chris

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: skip superblocks during discard
  2015-06-11 19:35               ` Chris Mason
@ 2015-06-11 19:46                 ` Jeff Mahoney
  0 siblings, 0 replies; 14+ messages in thread
From: Jeff Mahoney @ 2015-06-11 19:46 UTC (permalink / raw)
  To: Chris Mason, fdmanana; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/11/15 3:35 PM, Chris Mason wrote:
> On 06/11/2015 03:27 PM, Jeff Mahoney wrote:
>> On 6/11/15 3:24 PM, Chris Mason wrote:
>>> On 06/11/2015 03:15 PM, Jeff Mahoney wrote:
>>>> On 6/11/15 2:44 PM, Filipe David Manana wrote:
>>>>> On Thu, Jun 11, 2015 at 7:17 PM, Jeff Mahoney 
>>>>> <jeffm@suse.com> wrote: On 6/11/15 12:47 PM, Filipe David 
>>>>> Manana wrote:
>>>>>>>> On Thu, Jun 11, 2015 at 4:20 PM,  <jeffm@suse.com> 
>>>>>>>> wrote:
>>>>>>>>> From: Jeff Mahoney <jeffm@suse.com>
>>>>>>>>> 
>>>>>>>>> Btrfs doesn't track superblocks with extent
>>>>>>>>> records so there is nothing persistent on-disk to
>>>>>>>>> indicate that those blocks are in use.  We track
>>>>>>>>> the superblocks in memory to ensure they don't get
>>>>>>>>> used by removing them from the free space cache
>>>>>>>>> when we load a block group from disk.  Prior to
>>>>>>>>> 47ab2a6c6a (Btrfs: remove empty block groups
>>>>>>>>> automatically), that was fine since the block group
>>>>>>>>> would never be reclaimed so the superblock was
>>>>>>>>> always safe. Once we started removing the empty
>>>>>>>>> block groups, we were protected by the fact that
>>>>>>>>> discards weren't being properly issued for unused
>>>>>>>>> space either via FITRIM or -odiscard. The block
>>>>>>>>> groups were still being released, but the blocks
>>>>>>>>> remained on disk.
>>>>>>>>> 
>>>>>>>>> In order to properly discard unused block groups, 
>>>>>>>>> we need to filter out the superblocks from the 
>>>>>>>>> discard range. Superblocks are located at fixed 
>>>>>>>>> locations on each device, so it makes sense to 
>>>>>>>>> filter them out in btrfs_issue_discard, which is 
>>>>>>>>> used by both -odiscard and FITRIM.
>>>>>>>>> 
>>>>>>>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com> --- 
>>>>>>>>> fs/btrfs/extent-tree.c | 50 
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++------ 
>>>>>>>>> 1 file changed, 44 insertions(+), 6 deletions(-)
>>>>>>>>> 
>>>>>>>>> diff --git a/fs/btrfs/extent-tree.c 
>>>>>>>>> b/fs/btrfs/extent-tree.c index 0ec3acd..75d0226 
>>>>>>>>> 100644 --- a/fs/btrfs/extent-tree.c +++ 
>>>>>>>>> b/fs/btrfs/extent-tree.c @@ -1884,10 +1884,47 @@ 
>>>>>>>>> static int remove_extent_backref(struct 
>>>>>>>>> btrfs_trans_handle *trans, return ret; }
>>>>>>>>> 
>>>>>>>>> -static int btrfs_issue_discard(struct block_device
>>>>>>>>> *bdev, - u64 start, u64 len) +#define in_range(b,
>>>>>>>>> first, len) ((b)
>>>>>>>>>> = (first) && (b) < (first) + (len))
>>>>>>>> 
>>>>>>>> Hi Jeff,
>>>>>>>> 
>>>>>>>> So this will work if every caller behaves well and 
>>>>>>>> passes a region whose start and end offsets are a 
>>>>>>>> multiple of the sector size (4096) which currently 
>>>>>>>> matches the superblock size.
>>>>>>>> 
>>>>>>>> However, I think it would be safer to check for the 
>>>>>>>> case where the start offset of a superblock mirror
>>>>>>>> is < (first) and (sb_offset + sb_len) > (first).
>>>>>>>> Just to deal with cases where for example the 2nd
>>>>>>>> half of the sb starts at offset (first).
>>>>>>>> 
>>>>>>>> I guess this sectorsize becoming less than 4096 will
>>>>>>>>  happen sooner or later with the subpage sectorsize 
>>>>>>>> patch set, so it wouldn't hurt to make it more
>>>>>>>> bullet proof already.
>>>> 
>>>>> Is that something anyone intends to support?  While I 
>>>>> suppose the subpage sector patch /could/ be used to allow 
>>>>> file systems with a node size under 4k, the intention is 
>>>>> the other way around -- systems that have higher order
>>>>> page sizes currently don't work with btrfs file system
>>>>> created on systems with smaller order page sizes like x86.
>> 
>>> The best use of smaller node sizes is just to test the 
>>> subpagesize patches on more common hardware.  I wouldn't
>>> expect anyone to use a 1K node size in production.
>> 
>> Any chance we can enforce that?  Like with a compile-time
>> option? :)
> 
> We can make mkfs.btrfs advise strongly against it ;)
> 
> But, since I wasn't horribly clear, I'd love one extra if
> statement in the discard function.  Silently eating bytes is
> horribly hard to track down.

Heh, yeah.  I'm making it bulletproof now.  If the goal is to also
catch potential misbehavior, I'm catching some other cases as well.  A
few extra conditionals will still take a small percentage of the time
a discard takes.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVeeWDAAoJEB57S2MheeWygMEQAMPCNf8ZIMfYRDkzbpW0mezB
6Vbu7PM5WNAqOU2XdJXq47Z+jvLzsbBG0Z1hDLdavkQiOfjOQeBDvwVQQwFPizJ9
lRA4HB6P0nMKVl4x4PcXzgRinrIIy46nFY7VFZBe/cO0aEq7bsB3/vjlRj4LKvsp
eeMg212Sc4V6yuVbSfLSgYTtMGcAsmE9rUWl+2+kV6aTGqZr72YG1033YVu9Y+0F
vnelEIKFSmYF1y7FqO8Ejpk7G6fOoKYXGIxjcyC5v6kAKygZkxuUFYt9wPgpxl4X
eTYnPwjRwE3qRHlZtCGmb0SKvIkFMeKaI5Dy8KXUSHu6Q4NZ8q+kftgzNTGHcEzD
EgGrsbMa3N6necDYsmKYrIWVq21Nj2vSZc7YmLDKYtVQJRH2ScPOvHQlosEX8JsA
h4DfSp8fLVWu8hAORrUvByrGfw7DkFOlv1bF4B76MokP7sb4ITnpBUJtW+0Uiw3x
n1OJ94RiFOXpxWvEYquZUnK/9k1cg/eCwDpaFTCSDrTOVfW78lnoso1VKhQ1CJLg
Ed3I77RA0jPE004hpwtLdGE2AMiOZfAMKTAPkErnnWMfcrBh9O8DUBWVXds3IBSg
mv6lKPz24P28ymOINkqFC22D1vyXBH4Xiel0ZuPHHjnrxPUwovrF//XRbwcc7lCf
jzsGyTnEnAf00/R8s7sP
=v4r5
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-06-11 19:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-11 15:20 [PATCH v4] btrfs: fix automatic blockgroup remove + discard jeffm
2015-06-11 15:20 ` [PATCH 1/4] btrfs: skip superblocks during discard jeffm
2015-06-11 15:25   ` Jeff Mahoney
2015-06-11 16:47   ` Filipe David Manana
2015-06-11 18:17     ` Jeff Mahoney
2015-06-11 18:44       ` Filipe David Manana
2015-06-11 19:15         ` Jeff Mahoney
2015-06-11 19:24           ` Chris Mason
2015-06-11 19:27             ` Jeff Mahoney
2015-06-11 19:35               ` Chris Mason
2015-06-11 19:46                 ` Jeff Mahoney
2015-06-11 15:21 ` [PATCH 2/4] btrfs: iterate over unused chunk space in FITRIM jeffm
2015-06-11 15:21 ` [PATCH 3/4] btrfs: explictly delete unused block groups in close_ctree and ro-remount jeffm
2015-06-11 15:21 ` [PATCH 4/4] btrfs: add missing discards when unpinning extents with -o discard jeffm

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.