All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded
  2015-06-15 13:45 [PATCH v5] btrfs: fix automatic blockgroup remove + discard Jeff Mahoney
@ 2015-06-15 13:41 ` jeffm
  2015-06-17  9:53   ` Filipe David Manana
  2015-06-15 13:41 ` [PATCH 2/7] btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries jeffm
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: jeffm @ 2015-06-15 13:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

Initially this will just be the length argument passed to it,
but the following patches will adjust that to reflect re-alignment
and skipped blocks.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/extent-tree.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0ec3acd..da1145d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1884,10 +1884,17 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-static int btrfs_issue_discard(struct block_device *bdev,
-				u64 start, u64 len)
+static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
+			       u64 *discarded_bytes)
 {
-	return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
+	int ret = 0;
+
+	*discarded_bytes = 0;
+	ret = blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
+	if (!ret)
+		*discarded_bytes = len;
+
+	return ret;
 }
 
 int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
@@ -1908,14 +1915,16 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
 
 
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
+			u64 bytes;
 			if (!stripe->dev->can_discard)
 				continue;
 
 			ret = btrfs_issue_discard(stripe->dev->bdev,
 						  stripe->physical,
-						  stripe->length);
+						  stripe->length,
+						  &bytes);
 			if (!ret)
-				discarded_bytes += stripe->length;
+				discarded_bytes += bytes;
 			else if (ret != -EOPNOTSUPP)
 				break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/7] btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries
  2015-06-15 13:45 [PATCH v5] btrfs: fix automatic blockgroup remove + discard Jeff Mahoney
  2015-06-15 13:41 ` [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded jeffm
@ 2015-06-15 13:41 ` jeffm
  2015-06-17  9:55   ` Filipe David Manana
  2015-06-15 13:41 ` [PATCH 3/7] btrfs: skip superblocks during discard jeffm
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: jeffm @ 2015-06-15 13:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

It's possible, though unexpected, to pass unaligned offsets and lengths
to btrfs_issue_discard.  We then shift the offset/length values to sector
units.  If an unaligned offset has been passed, it will result in the
entire sector being discarded, possibly losing data.  An unaligned
length is safe but we'll end up returning an inaccurate number of
discarded bytes.

This patch aligns the offset to the 512B boundary, adjusts the length,
and warns, since we shouldn't be discarding on an offset that isn't
aligned with our sector size.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/extent-tree.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index da1145d..cf9cefd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1888,12 +1888,21 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 			       u64 *discarded_bytes)
 {
 	int ret = 0;
+	u64 aligned_start = ALIGN(start, 1 << 9);
 
-	*discarded_bytes = 0;
-	ret = blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
-	if (!ret)
-		*discarded_bytes = len;
+	if (WARN_ON(start != aligned_start)) {
+		len -= aligned_start - start;
+		len = round_down(len, 1 << 9);
+		start = aligned_start;
+	}
 
+	*discarded_bytes = 0;
+	if (len) {
+		ret = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+					   GFP_NOFS, 0);
+		if (!ret)
+			*discarded_bytes = len;
+	}
 	return ret;
 }
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/7] btrfs: skip superblocks during discard
  2015-06-15 13:45 [PATCH v5] btrfs: fix automatic blockgroup remove + discard Jeff Mahoney
  2015-06-15 13:41 ` [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded jeffm
  2015-06-15 13:41 ` [PATCH 2/7] btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries jeffm
@ 2015-06-15 13:41 ` jeffm
  2015-06-17  9:57   ` Filipe David Manana
  2015-06-15 13:41 ` [PATCH 4/7] btrfs: iterate over unused chunk space in FITRIM jeffm
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: jeffm @ 2015-06-15 13:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

Btrfs doesn't track superblocks with extent records so there is nothing
persistent on-disk to indicate that those blocks are in use.  We track
the superblocks in memory to ensure they don't get used by removing them
from the free space cache when we load a block group from disk.  Prior
to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
was fine since the block group would never be reclaimed so the superblock
was always safe.  Once we started removing the empty block groups, we
were protected by the fact that discards weren't being properly issued
for unused space either via FITRIM or -odiscard.  The block groups were
still being released, but the blocks remained on disk.

In order to properly discard unused block groups, we need to filter out
the superblocks from the discard range.  Superblocks are located at fixed
locations on each device, so it makes sense to filter them out in
btrfs_issue_discard, which is used by both -odiscard and FITRIM.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cf9cefd..1e44b93 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1884,10 +1884,12 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+#define in_range(b, first, len)        ((b) >= (first) && (b) < (first) + (len))
 static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 			       u64 *discarded_bytes)
 {
-	int ret = 0;
+	int j, ret = 0;
+	u64 bytes_left, end;
 	u64 aligned_start = ALIGN(start, 1 << 9);
 
 	if (WARN_ON(start != aligned_start)) {
@@ -1897,11 +1899,60 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 	}
 
 	*discarded_bytes = 0;
-	if (len) {
-		ret = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+
+	if (!len)
+		return 0;
+
+	end = start + len;
+	bytes_left = len;
+
+	/* Skip any superblocks on this device. */
+	for (j = 0; j < BTRFS_SUPER_MIRROR_MAX; j++) {
+		u64 sb_start = btrfs_sb_offset(j);
+		u64 sb_end = sb_start + BTRFS_SUPER_INFO_SIZE;
+		u64 size = sb_start - start;
+
+		if (!in_range(sb_start, start, bytes_left) &&
+		    !in_range(sb_end, start, bytes_left) &&
+		    !in_range(start, sb_start, BTRFS_SUPER_INFO_SIZE))
+			continue;
+
+		/*
+		 * Superblock spans beginning of range.  Adjust start and
+		 * try again.
+		 */
+		if (sb_start <= start) {
+			start += sb_end - start;
+			if (start > end) {
+				bytes_left = 0;
+				break;
+			}
+			bytes_left = end - start;
+			continue;
+		}
+
+		if (size) {
+			ret = blkdev_issue_discard(bdev, start >> 9, size >> 9,
+						   GFP_NOFS, 0);
+			if (!ret)
+				*discarded_bytes += size;
+			else if (ret != -EOPNOTSUPP)
+				return ret;
+		}
+
+		start = sb_end;
+		if (start > end) {
+			bytes_left = 0;
+			break;
+		}
+		bytes_left = end - start;
+	}
+
+	if (bytes_left) {
+		ret = blkdev_issue_discard(bdev, start >> 9, bytes_left >> 9,
 					   GFP_NOFS, 0);
 		if (!ret)
-			*discarded_bytes = len;
+			*discarded_bytes += bytes_left;
 	}
 	return ret;
 }
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/7] btrfs: iterate over unused chunk space in FITRIM
  2015-06-15 13:45 [PATCH v5] btrfs: fix automatic blockgroup remove + discard Jeff Mahoney
                   ` (2 preceding siblings ...)
  2015-06-15 13:41 ` [PATCH 3/7] btrfs: skip superblocks during discard jeffm
@ 2015-06-15 13:41 ` jeffm
  2015-06-17 10:03   ` Filipe David Manana
  2015-06-15 13:41 ` [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount jeffm
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: jeffm @ 2015-06-15 13:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

Since we now clean up block groups automatically as they become
empty, iterating over block groups is no longer sufficient to discard
unused space.

This patch iterates over the unused chunk space and discards any regions
that are unallocated, regardless of whether they were ever used.  This is
a change for btrfs but is consistent with other file systems.

We do this in a transactionless manner since the discard process can take
a substantial amount of time and a transaction would need to be started
before the acquisition of the device list lock.  That would mean a
transaction would be held open across /all/ of the discards collectively.
In order to prevent other threads from allocating or freeing chunks, we
hold the chunks lock across the search and discard calls.  We release it
between searches to allow the file system to perform more-or-less
normally.  Since the running transaction can commit and disappear while
we're using the transaction pointer, we take a reference to it and
release it after the search.  This is safe since it would happen normally
at the end of the transaction commit after any locks are released anyway.
We also take the commit_root_sem to protect against a transaction starting
and committing while we're running.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/extent-tree.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c     |  60 ++++++++++++++++++-----------
 fs/btrfs/volumes.h     |   3 ++
 3 files changed, 141 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1e44b93..24b48df 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10143,10 +10143,99 @@ int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
 	return unpin_extent_range(root, start, end, false);
 }
 
+/*
+ * It used to be that old block groups would be left around forever.
+ * Iterating over them would be enough to trim unused space.  Since we
+ * now automatically remove them, we also need to iterate over unallocated
+ * space.
+ *
+ * We don't want a transaction for this since the discard may take a
+ * substantial amount of time.  We don't require that a transaction be
+ * running, but we do need to take a running transaction into account
+ * to ensure that we're not discarding chunks that were released in
+ * the current transaction.
+ *
+ * Holding the chunks lock will prevent other threads from allocating
+ * or releasing chunks, but it won't prevent a running transaction
+ * from committing and releasing the memory that the pending chunks
+ * list head uses.  For that, we need to take a reference to the
+ * transaction.
+ */
+static int btrfs_trim_free_extents(struct btrfs_device *device,
+				   u64 minlen, u64 *trimmed)
+{
+	u64 start = 0, len = 0;
+	int ret;
+
+	*trimmed = 0;
+
+	/* Not writeable = nothing to do. */
+	if (!device->writeable)
+		return 0;
+
+	/* No free space = nothing to do. */
+	if (device->total_bytes <= device->bytes_used)
+		return 0;
+
+	ret = 0;
+
+	while (1) {
+		struct btrfs_fs_info *fs_info = device->dev_root->fs_info;
+		struct btrfs_transaction *trans;
+		u64 bytes;
+
+		ret = mutex_lock_interruptible(&fs_info->chunk_mutex);
+		if (ret)
+			return ret;
+
+		down_read(&fs_info->commit_root_sem);
+
+		spin_lock(&fs_info->trans_lock);
+		trans = fs_info->running_transaction;
+		if (trans)
+			atomic_inc(&trans->use_count);
+		spin_unlock(&fs_info->trans_lock);
+
+		ret = find_free_dev_extent_start(trans, device, minlen, start,
+						 &start, &len);
+		if (trans)
+			btrfs_put_transaction(trans);
+
+		if (ret) {
+			up_read(&fs_info->commit_root_sem);
+			mutex_unlock(&fs_info->chunk_mutex);
+			if (ret == -ENOSPC)
+				ret = 0;
+			break;
+		}
+
+		ret = btrfs_issue_discard(device->bdev, start, len, &bytes);
+		up_read(&fs_info->commit_root_sem);
+		mutex_unlock(&fs_info->chunk_mutex);
+
+		if (ret)
+			break;
+
+		start += len;
+		*trimmed += bytes;
+
+		if (fatal_signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+
+		cond_resched();
+	}
+
+	return ret;
+}
+
 int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_group_cache *cache = NULL;
+	struct btrfs_device *device;
+	struct list_head *devices;
 	u64 group_trimmed;
 	u64 start;
 	u64 end;
@@ -10201,6 +10290,18 @@ int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
 		cache = next_block_group(fs_info->tree_root, cache);
 	}
 
+	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
+	devices = &root->fs_info->fs_devices->alloc_list;
+	list_for_each_entry(device, devices, dev_alloc_list) {
+		ret = btrfs_trim_free_extents(device, range->minlen,
+					      &group_trimmed);
+		if (ret)
+			break;
+
+		trimmed += group_trimmed;
+	}
+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+
 	range->len = trimmed;
 	return ret;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 174f5e1..7fdde31 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1051,15 +1051,18 @@ out:
 	return ret;
 }
 
-static int contains_pending_extent(struct btrfs_trans_handle *trans,
+static int contains_pending_extent(struct btrfs_transaction *transaction,
 				   struct btrfs_device *device,
 				   u64 *start, u64 len)
 {
+	struct btrfs_fs_info *fs_info = device->dev_root->fs_info;
 	struct extent_map *em;
-	struct list_head *search_list = &trans->transaction->pending_chunks;
+	struct list_head *search_list = &fs_info->pinned_chunks;
 	int ret = 0;
 	u64 physical_start = *start;
 
+	if (transaction)
+		search_list = &transaction->pending_chunks;
 again:
 	list_for_each_entry(em, search_list, list) {
 		struct map_lookup *map;
@@ -1078,8 +1081,8 @@ again:
 			ret = 1;
 		}
 	}
-	if (search_list == &trans->transaction->pending_chunks) {
-		search_list = &trans->root->fs_info->pinned_chunks;
+	if (search_list != &fs_info->pinned_chunks) {
+		search_list = &fs_info->pinned_chunks;
 		goto again;
 	}
 
@@ -1088,12 +1091,13 @@ again:
 
 
 /*
- * find_free_dev_extent - find free space in the specified device
- * @device:	the device which we search the free space in
- * @num_bytes:	the size of the free space that we need
- * @start:	store the start of the free space.
- * @len:	the size of the free space. that we find, or the size of the max
- * 		free space if we don't find suitable free space
+ * find_free_dev_extent_start - find free space in the specified device
+ * @device:	  the device which we search the free space in
+ * @num_bytes:	  the size of the free space that we need
+ * @search_start: the position from which to begin the search
+ * @start:	  store the start of the free space.
+ * @len:	  the size of the free space. that we find, or the size
+ *		  of the max free space if we don't find suitable free space
  *
  * this uses a pretty simple search, the expectation is that it is
  * called very infrequently and that a given device has a small number
@@ -1107,9 +1111,9 @@ again:
  * But if we don't find suitable free space, it is used to store the size of
  * the max free space.
  */
-int find_free_dev_extent(struct btrfs_trans_handle *trans,
-			 struct btrfs_device *device, u64 num_bytes,
-			 u64 *start, u64 *len)
+int find_free_dev_extent_start(struct btrfs_transaction *transaction,
+			       struct btrfs_device *device, u64 num_bytes,
+			       u64 search_start, u64 *start, u64 *len)
 {
 	struct btrfs_key key;
 	struct btrfs_root *root = device->dev_root;
@@ -1119,19 +1123,11 @@ int find_free_dev_extent(struct btrfs_trans_handle *trans,
 	u64 max_hole_start;
 	u64 max_hole_size;
 	u64 extent_end;
-	u64 search_start;
 	u64 search_end = device->total_bytes;
 	int ret;
 	int slot;
 	struct extent_buffer *l;
 
-	/* FIXME use last free of some kind */
-
-	/* we don't want to overwrite the superblock on the drive,
-	 * so we make sure to start at an offset of at least 1MB
-	 */
-	search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
-
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -1192,7 +1188,7 @@ again:
 			 * Have to check before we set max_hole_start, otherwise
 			 * we could end up sending back this offset anyway.
 			 */
-			if (contains_pending_extent(trans, device,
+			if (contains_pending_extent(transaction, device,
 						    &search_start,
 						    hole_size)) {
 				if (key.offset >= search_start) {
@@ -1241,7 +1237,7 @@ next:
 	if (search_end > search_start) {
 		hole_size = search_end - search_start;
 
-		if (contains_pending_extent(trans, device, &search_start,
+		if (contains_pending_extent(transaction, device, &search_start,
 					    hole_size)) {
 			btrfs_release_path(path);
 			goto again;
@@ -1267,6 +1263,24 @@ out:
 	return ret;
 }
 
+int find_free_dev_extent(struct btrfs_trans_handle *trans,
+			 struct btrfs_device *device, u64 num_bytes,
+			 u64 *start, u64 *len)
+{
+	struct btrfs_root *root = device->dev_root;
+	u64 search_start;
+
+	/* FIXME use last free of some kind */
+
+	/*
+	 * we don't want to overwrite the superblock on the drive,
+	 * so we make sure to start at an offset of at least 1MB
+	 */
+	search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
+	return find_free_dev_extent_start(trans->transaction, device,
+					  num_bytes, search_start, start, len);
+}
+
 static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
 			  struct btrfs_device *device,
 			  u64 start, u64 *dev_extent_len)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ebc3133..30918a8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -449,6 +449,9 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info);
 int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info);
 int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info);
 int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset);
+int find_free_dev_extent_start(struct btrfs_transaction *transaction,
+			 struct btrfs_device *device, u64 num_bytes,
+			 u64 search_start, u64 *start, u64 *max_avail);
 int find_free_dev_extent(struct btrfs_trans_handle *trans,
 			 struct btrfs_device *device, u64 num_bytes,
 			 u64 *start, u64 *max_avail);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
  2015-06-15 13:45 [PATCH v5] btrfs: fix automatic blockgroup remove + discard Jeff Mahoney
                   ` (3 preceding siblings ...)
  2015-06-15 13:41 ` [PATCH 4/7] btrfs: iterate over unused chunk space in FITRIM jeffm
@ 2015-06-15 13:41 ` jeffm
  2015-06-17 10:04   ` Filipe David Manana
  2015-06-15 13:41 ` [PATCH 6/7] btrfs: add missing discards when unpinning extents with -o discard jeffm
  2015-06-15 13:41 ` [PATCH 7/7] btrfs: cleanup, stop casting for extent_map->lookup everywhere jeffm
  6 siblings, 1 reply; 19+ messages in thread
From: jeffm @ 2015-06-15 13:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

The cleaner thread may already be sleeping by the time we enter
close_ctree.  If that's the case, we'll skip removing any unused
block groups queued for removal, even during a normal umount.
They'll be cleaned up automatically at next mount, but users
expect a umount to be a clean synchronization point, especially
when used on thin-provisioned storage with -odiscard.  We also
explicitly remove unused block groups in the ro-remount path
for the same reason.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/disk-io.c |  9 +++++++++
 fs/btrfs/super.c   | 11 +++++++++++
 2 files changed, 20 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2ef9a4b..2e47fef 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root)
 	cancel_work_sync(&fs_info->async_reclaim_work);
 
 	if (!(fs_info->sb->s_flags & MS_RDONLY)) {
+		/*
+		 * If the cleaner thread is stopped and there are
+		 * block groups queued for removal, the deletion will be
+		 * skipped when we quit the cleaner thread.
+		 */
+		mutex_lock(&root->fs_info->cleaner_mutex);
+		btrfs_delete_unused_bgs(root->fs_info);
+		mutex_unlock(&root->fs_info->cleaner_mutex);
+
 		ret = btrfs_commit_super(root);
 		if (ret)
 			btrfs_err(fs_info, "commit super ret %d", ret);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 9e66f5e..2ccd8d4 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 
 		sb->s_flags |= MS_RDONLY;
 
+		/*
+		 * Setting MS_RDONLY will put the cleaner thread to
+		 * sleep at the next loop if it's already active.
+		 * If it's already asleep, we'll leave unused block
+		 * groups on disk until we're mounted read-write again
+		 * unless we clean them up here.
+		 */
+		mutex_lock(&root->fs_info->cleaner_mutex);
+		btrfs_delete_unused_bgs(fs_info);
+		mutex_unlock(&root->fs_info->cleaner_mutex);
+
 		btrfs_dev_replace_suspend_for_unmount(fs_info);
 		btrfs_scrub_cancel(fs_info);
 		btrfs_pause_balance(fs_info);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/7] btrfs: add missing discards when unpinning extents with -o discard
  2015-06-15 13:45 [PATCH v5] btrfs: fix automatic blockgroup remove + discard Jeff Mahoney
                   ` (4 preceding siblings ...)
  2015-06-15 13:41 ` [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount jeffm
@ 2015-06-15 13:41 ` jeffm
  2015-06-17 10:07   ` Filipe David Manana
  2015-06-15 13:41 ` [PATCH 7/7] btrfs: cleanup, stop casting for extent_map->lookup everywhere jeffm
  6 siblings, 1 reply; 19+ messages in thread
From: jeffm @ 2015-06-15 13:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.

The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit.  As a result, any extents that
would be freed when the block group is removed aren't discarded.  In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.

To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/ctree.h            |  3 ++
 fs/btrfs/extent-tree.c      | 68 ++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.c | 57 +++++++++++++++++++++----------------
 fs/btrfs/super.c            |  2 +-
 fs/btrfs/transaction.c      |  2 ++
 fs/btrfs/transaction.h      |  2 ++
 6 files changed, 105 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6f364e1..780acf1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3438,6 +3438,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root, u64 group_start,
 			     struct extent_map *em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
+void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache);
+void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *cache);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *root);
 u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
@@ -4068,6 +4070,7 @@ __printf(5, 6)
 void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
 		     unsigned int line, int errno, const char *fmt, ...);
 
+const char *btrfs_decode_error(int errno);
 
 void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root, const char *function,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 24b48df..3598440 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6103,20 +6103,19 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_block_group_cache *block_group, *tmp;
+	struct list_head *deleted_bgs;
 	struct extent_io_tree *unpin;
 	u64 start;
 	u64 end;
 	int ret;
 
-	if (trans->aborted)
-		return 0;
-
 	if (fs_info->pinned_extents == &fs_info->freed_extents[0])
 		unpin = &fs_info->freed_extents[1];
 	else
 		unpin = &fs_info->freed_extents[0];
 
-	while (1) {
+	while (!trans->aborted) {
 		mutex_lock(&fs_info->unused_bg_unpin_mutex);
 		ret = find_first_extent_bit(unpin, 0, &start, &end,
 					    EXTENT_DIRTY, NULL);
@@ -6135,6 +6134,34 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
 		cond_resched();
 	}
 
+	/*
+	 * Transaction is finished.  We don't need the lock anymore.  We
+	 * do need to clean up the block groups in case of a transaction
+	 * abort.
+	 */
+	deleted_bgs = &trans->transaction->deleted_bgs;
+	list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {
+		u64 trimmed = 0;
+
+		ret = -EROFS;
+		if (!trans->aborted)
+			ret = btrfs_discard_extent(root,
+						   block_group->key.objectid,
+						   block_group->key.offset,
+						   &trimmed);
+
+		list_del_init(&block_group->bg_list);
+		btrfs_put_block_group_trimming(block_group);
+		btrfs_put_block_group(block_group);
+
+		if (ret) {
+			const char *errstr = btrfs_decode_error(ret);
+			btrfs_warn(fs_info,
+				   "Discard failed while removing blockgroup: errno=%d %s\n",
+				   ret, errstr);
+		}
+	}
+
 	return 0;
 }
 
@@ -9914,6 +9941,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	 * currently running transaction might finish and a new one start,
 	 * allowing for new block groups to be created that can reuse the same
 	 * physical device locations unless we take this special care.
+	 *
+	 * There may also be an implicit trim operation if the file system
+	 * is mounted with -odiscard. The same protections must remain
+	 * in place until the extents have been discarded completely when
+	 * the transaction commit has completed.
 	 */
 	remove_em = (atomic_read(&block_group->trimming) == 0);
 	/*
@@ -9988,6 +10020,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 	spin_lock(&fs_info->unused_bgs_lock);
 	while (!list_empty(&fs_info->unused_bgs)) {
 		u64 start, end;
+		int trimming;
 
 		block_group = list_first_entry(&fs_info->unused_bgs,
 					       struct btrfs_block_group_cache,
@@ -10085,12 +10118,39 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&block_group->lock);
 		spin_unlock(&space_info->lock);
 
+		/* DISCARD can flip during remount */
+		trimming = btrfs_test_opt(root, DISCARD);
+
+		/* Implicit trim during transaction commit. */
+		if (trimming)
+			btrfs_get_block_group_trimming(block_group);
+
 		/*
 		 * Btrfs_remove_chunk will abort the transaction if things go
 		 * horribly wrong.
 		 */
 		ret = btrfs_remove_chunk(trans, root,
 					 block_group->key.objectid);
+
+		if (ret) {
+			if (trimming)
+				btrfs_put_block_group_trimming(block_group);
+			goto end_trans;
+		}
+
+		/*
+		 * If we're not mounted with -odiscard, we can just forget
+		 * about this block group. Otherwise we'll need to wait
+		 * until transaction commit to do the actual discard.
+		 */
+		if (trimming) {
+			WARN_ON(!list_empty(&block_group->bg_list));
+			spin_lock(&trans->transaction->deleted_bgs_lock);
+			list_move(&block_group->bg_list,
+				  &trans->transaction->deleted_bgs);
+			spin_unlock(&trans->transaction->deleted_bgs_lock);
+			btrfs_get_block_group(block_group);
+		}
 end_trans:
 		btrfs_end_transaction(trans, root);
 next:
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 9dbe5b5..c79253e 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3274,35 +3274,23 @@ next:
 	return ret;
 }
 
-int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
-			   u64 *trimmed, u64 start, u64 end, u64 minlen)
+void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache)
 {
-	int ret;
+	atomic_inc(&cache->trimming);
+}
 
-	*trimmed = 0;
+void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
+{
+	struct extent_map_tree *em_tree;
+	struct extent_map *em;
+	bool cleanup;
 
 	spin_lock(&block_group->lock);
-	if (block_group->removed) {
-		spin_unlock(&block_group->lock);
-		return 0;
-	}
-	atomic_inc(&block_group->trimming);
+	cleanup = (atomic_dec_and_test(&block_group->trimming) &&
+		   block_group->removed);
 	spin_unlock(&block_group->lock);
 
-	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
-	if (ret)
-		goto out;
-
-	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
-out:
-	spin_lock(&block_group->lock);
-	if (atomic_dec_and_test(&block_group->trimming) &&
-	    block_group->removed) {
-		struct extent_map_tree *em_tree;
-		struct extent_map *em;
-
-		spin_unlock(&block_group->lock);
-
+	if (cleanup) {
 		lock_chunks(block_group->fs_info->chunk_root);
 		em_tree = &block_group->fs_info->mapping_tree.map_tree;
 		write_lock(&em_tree->lock);
@@ -3326,10 +3314,31 @@ out:
 		 * this block group have left 1 entry each one. Free them.
 		 */
 		__btrfs_remove_free_space_cache(block_group->free_space_ctl);
-	} else {
+	}
+}
+
+int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
+			   u64 *trimmed, u64 start, u64 end, u64 minlen)
+{
+	int ret;
+
+	*trimmed = 0;
+
+	spin_lock(&block_group->lock);
+	if (block_group->removed) {
 		spin_unlock(&block_group->lock);
+		return 0;
 	}
+	btrfs_get_block_group_trimming(block_group);
+	spin_unlock(&block_group->lock);
+
+	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
+	if (ret)
+		goto out;
 
+	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
+out:
+	btrfs_put_block_group_trimming(block_group);
 	return ret;
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2ccd8d4..a80da03 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -69,7 +69,7 @@ static struct file_system_type btrfs_fs_type;
 
 static int btrfs_remount(struct super_block *sb, int *flags, char *data);
 
-static const char *btrfs_decode_error(int errno)
+const char *btrfs_decode_error(int errno)
 {
 	char *errstr = "unknown";
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 5628e25..2005262 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -256,6 +256,8 @@ loop:
 	mutex_init(&cur_trans->cache_write_mutex);
 	cur_trans->num_dirty_bgs = 0;
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
+	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
+	spin_lock_init(&cur_trans->deleted_bgs_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(&cur_trans->dirty_pages,
 			     fs_info->btree_inode->i_mapping);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 0b24755..14325f2 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -74,6 +74,8 @@ struct btrfs_transaction {
 	 */
 	struct mutex cache_write_mutex;
 	spinlock_t dirty_bgs_lock;
+	struct list_head deleted_bgs;
+	spinlock_t deleted_bgs_lock;
 	struct btrfs_delayed_ref_root delayed_refs;
 	int aborted;
 	int dirty_bg_run;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 7/7] btrfs: cleanup, stop casting for extent_map->lookup everywhere
  2015-06-15 13:45 [PATCH v5] btrfs: fix automatic blockgroup remove + discard Jeff Mahoney
                   ` (5 preceding siblings ...)
  2015-06-15 13:41 ` [PATCH 6/7] btrfs: add missing discards when unpinning extents with -o discard jeffm
@ 2015-06-15 13:41 ` jeffm
  2015-06-17 10:08   ` Filipe David Manana
  6 siblings, 1 reply; 19+ messages in thread
From: jeffm @ 2015-06-15 13:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/dev-replace.c |  2 +-
 fs/btrfs/extent_map.c  |  2 +-
 fs/btrfs/extent_map.h  | 10 +++++++++-
 fs/btrfs/scrub.c       |  2 +-
 fs/btrfs/volumes.c     | 24 ++++++++++++------------
 5 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 0573848..2ad3289 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -613,7 +613,7 @@ static void btrfs_dev_replace_update_device_in_mapping_tree(
 		em = lookup_extent_mapping(em_tree, start, (u64)-1);
 		if (!em)
 			break;
-		map = (struct map_lookup *)em->bdev;
+		map = em->map_lookup;
 		for (i = 0; i < map->num_stripes; i++)
 			if (srcdev == map->stripes[i].dev)
 				map->stripes[i].dev = tgtdev;
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 6a98bdd..84fb56d 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -76,7 +76,7 @@ void free_extent_map(struct extent_map *em)
 		WARN_ON(extent_map_in_tree(em));
 		WARN_ON(!list_empty(&em->list));
 		if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
-			kfree(em->bdev);
+			kfree(em->map_lookup);
 		kmem_cache_free(extent_map_cache, em);
 	}
 }
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index b2991fd..eb8b8fa 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -32,7 +32,15 @@ struct extent_map {
 	u64 block_len;
 	u64 generation;
 	unsigned long flags;
-	struct block_device *bdev;
+	union {
+		struct block_device *bdev;
+
+		/*
+		 * used for chunk mappings
+		 * flags & EXTENT_FLAG_FS_MAPPING must be set
+		 */
+		struct map_lookup *map_lookup;
+	};
 	atomic_t refs;
 	unsigned int compress_type;
 	struct list_head list;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ab58115..19f7241d 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3339,7 +3339,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
 	if (!em)
 		return -EINVAL;
 
-	map = (struct map_lookup *)em->bdev;
+	map = em->map_lookup;
 	if (em->start != chunk_offset)
 		goto out;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7fdde31..9f48ae5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1068,7 +1068,7 @@ again:
 		struct map_lookup *map;
 		int i;
 
-		map = (struct map_lookup *)em->bdev;
+		map = em->map_lookup;
 		for (i = 0; i < map->num_stripes; i++) {
 			if (map->stripes[i].dev != device)
 				continue;
@@ -2622,7 +2622,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 			free_extent_map(em);
 		return -EINVAL;
 	}
-	map = (struct map_lookup *)em->bdev;
+	map = em->map_lookup;
 
 	for (i = 0; i < map->num_stripes; i++) {
 		struct btrfs_device *device = map->stripes[i].dev;
@@ -4465,7 +4465,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		goto error;
 	}
 	set_bit(EXTENT_FLAG_FS_MAPPING, &em->flags);
-	em->bdev = (struct block_device *)map;
+	em->map_lookup = map;
 	em->start = start;
 	em->len = num_bytes;
 	em->block_start = 0;
@@ -4560,7 +4560,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
 		return -EINVAL;
 	}
 
-	map = (struct map_lookup *)em->bdev;
+	map = em->map_lookup;
 	item_size = btrfs_chunk_item_size(map->num_stripes);
 	stripe_size = em->orig_block_len;
 
@@ -4702,7 +4702,7 @@ int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset)
 	if (!em)
 		return 1;
 
-	map = (struct map_lookup *)em->bdev;
+	map = em->map_lookup;
 	for (i = 0; i < map->num_stripes; i++) {
 		if (map->stripes[i].dev->missing) {
 			miss_ndevs++;
@@ -4782,7 +4782,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 		return 1;
 	}
 
-	map = (struct map_lookup *)em->bdev;
+	map = em->map_lookup;
 	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
 		ret = map->num_stripes;
 	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
@@ -4818,7 +4818,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root *root,
 	BUG_ON(!em);
 
 	BUG_ON(em->start > logical || em->start + em->len < logical);
-	map = (struct map_lookup *)em->bdev;
+	map = em->map_lookup;
 	if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
 		len = map->stripe_len * nr_data_stripes(map);
 	free_extent_map(em);
@@ -4839,7 +4839,7 @@ int btrfs_is_parity_mirror(struct btrfs_mapping_tree *map_tree,
 	BUG_ON(!em);
 
 	BUG_ON(em->start > logical || em->start + em->len < logical);
-	map = (struct map_lookup *)em->bdev;
+	map = em->map_lookup;
 	if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
 		ret = 1;
 	free_extent_map(em);
@@ -5000,7 +5000,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 		return -EINVAL;
 	}
 
-	map = (struct map_lookup *)em->bdev;
+	map = em->map_lookup;
 	offset = logical - em->start;
 
 	stripe_len = map->stripe_len;
@@ -5542,7 +5542,7 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 		free_extent_map(em);
 		return -EIO;
 	}
-	map = (struct map_lookup *)em->bdev;
+	map = em->map_lookup;
 
 	length = em->len;
 	rmap_len = map->stripe_len;
@@ -6057,7 +6057,7 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key,
 	}
 
 	set_bit(EXTENT_FLAG_FS_MAPPING, &em->flags);
-	em->bdev = (struct block_device *)map;
+	em->map_lookup = map;
 	em->start = logical;
 	em->len = length;
 	em->orig_start = 0;
@@ -6733,7 +6733,7 @@ void btrfs_update_commit_device_bytes_used(struct btrfs_root *root,
 	/* In order to kick the device replace finish process */
 	lock_chunks(root);
 	list_for_each_entry(em, &transaction->pending_chunks, list) {
-		map = (struct map_lookup *)em->bdev;
+		map = em->map_lookup;
 
 		for (i = 0; i < map->num_stripes; i++) {
 			dev = map->stripes[i].dev;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5] btrfs: fix automatic blockgroup remove + discard
  2015-06-15 13:41 ` [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded jeffm
@ 2015-06-15 13:45 Jeff Mahoney
  2015-06-15 13:41 ` [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded jeffm
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Jeff Mahoney @ 2015-06-15 13:45 UTC (permalink / raw)
  To: linux-btrfs

Apologies for the out-of-order post. git send-email caught ^C
at just the right time and send them without my confirmation. That's
also why the map->lookup casting patch got mixed in here.

The automatic block group removal patch introduced some regressions
in how discards are handled.

1/ FITRIM only iterates over block groups on disk - removed block groups
   won't be trimmed.
2/ Clearing the dirty bit from extents in removed block groups means that
   those extents won't be discarded when the block group is removed.
3/ More of a UI wart: We don't wait on block groups to be removed during
   read-only remount or fs umount. This results in block groups that
   /should/ have been discarded on thin provisioned storage hanging around
   until the file system is mounted read-write again.

The following patches address these problems by:
1/ Iterating over block groups on disk and then iterating over free space.
   This is consistent with how other file systems handle FITRIM.
2/ Putting removed block groups on a list so that they are automatically
   discarded during btrfs_finish_extent_commit after transaction commit.
   Note: This may still leave undiscarded space on disk if the system
   crashes after transaction commit but before discard. The file system
   itself will be compeltely consistent, but the user will need to trim
   manually.
3/ Simple: We call btrfs_delete_unused_bgs explicitly during ro-remount
   and umount.
Changelog:
v1->v2
- -odiscard
 - Fix ordering to ensure that we don't discard extents freed in an
    uncommitted transaction.
- FITRIM
  - Don't start a transaction so the entire run is transactionless
  - The loop can be interrupted while waiting on the chunk mutex and
    after the discard has completed.
  - The only lock held for the duration is the device_list_mutex.  The
    chunk mutex is take per loop iteration so normal operations should
    continue while we're running, even on large file systems.

v2->v3
- -odiscard
 - Factor out get/put block_group->trimming to ensure that cleanup always
   happens at the last reference drop.
 - Cleanup the free space cache on the last reference drop.
 - Use list_move instead of list_add in case of multiple adds.  We still
   issue a warning but we shouldn't fall over.
 - Explicitly delete unused block groups in close_ctree and ro-remount.
- FITRIM
 - Cleaned up pointer tricks that abused &NULL->member.
 - Take the commit_root_sem across loop iteration to protect against
   transaction commit moving the commit root.

v3->v4
- new: skip superblocks during discard
- new: btrfs_issue_discard passes back discarded_bytes on success

v4->v5
- new: check/align/warn on unaligned starting offset in btrfs_issue_discard
- new: handle case where superblock begins before the range on <4k filesystems
- no changes to other patches

Thanks,

-Jeff-- 
Jeff Mahoney
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded
  2015-06-15 13:41 ` [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded jeffm
@ 2015-06-17  9:53   ` Filipe David Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe David Manana @ 2015-06-17  9:53 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
> From: Jeff Mahoney <jeffm@suse.com>
>
> Initially this will just be the length argument passed to it,
> but the following patches will adjust that to reflect re-alignment
> and skipped blocks.
>
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/extent-tree.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0ec3acd..da1145d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1884,10 +1884,17 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans,
>         return ret;
>  }
>
> -static int btrfs_issue_discard(struct block_device *bdev,
> -                               u64 start, u64 len)
> +static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
> +                              u64 *discarded_bytes)
>  {
> -       return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
> +       int ret = 0;
> +
> +       *discarded_bytes = 0;
> +       ret = blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
> +       if (!ret)
> +               *discarded_bytes = len;
> +
> +       return ret;
>  }
>
>  int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
> @@ -1908,14 +1915,16 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
>
>
>                 for (i = 0; i < bbio->num_stripes; i++, stripe++) {
> +                       u64 bytes;
>                         if (!stripe->dev->can_discard)
>                                 continue;
>
>                         ret = btrfs_issue_discard(stripe->dev->bdev,
>                                                   stripe->physical,
> -                                                 stripe->length);
> +                                                 stripe->length,
> +                                                 &bytes);
>                         if (!ret)
> -                               discarded_bytes += stripe->length;
> +                               discarded_bytes += bytes;
>                         else if (ret != -EOPNOTSUPP)
>                                 break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
>
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries
  2015-06-15 13:41 ` [PATCH 2/7] btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries jeffm
@ 2015-06-17  9:55   ` Filipe David Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe David Manana @ 2015-06-17  9:55 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
> From: Jeff Mahoney <jeffm@suse.com>
>
> It's possible, though unexpected, to pass unaligned offsets and lengths
> to btrfs_issue_discard.  We then shift the offset/length values to sector
> units.  If an unaligned offset has been passed, it will result in the
> entire sector being discarded, possibly losing data.  An unaligned
> length is safe but we'll end up returning an inaccurate number of
> discarded bytes.
>
> This patch aligns the offset to the 512B boundary, adjusts the length,
> and warns, since we shouldn't be discarding on an offset that isn't
> aligned with our sector size.
>
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/extent-tree.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index da1145d..cf9cefd 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1888,12 +1888,21 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>                                u64 *discarded_bytes)
>  {
>         int ret = 0;
> +       u64 aligned_start = ALIGN(start, 1 << 9);
>
> -       *discarded_bytes = 0;
> -       ret = blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
> -       if (!ret)
> -               *discarded_bytes = len;
> +       if (WARN_ON(start != aligned_start)) {
> +               len -= aligned_start - start;
> +               len = round_down(len, 1 << 9);
> +               start = aligned_start;
> +       }
>
> +       *discarded_bytes = 0;
> +       if (len) {
> +               ret = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> +                                          GFP_NOFS, 0);
> +               if (!ret)
> +                       *discarded_bytes = len;
> +       }
>         return ret;
>  }
>
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/7] btrfs: skip superblocks during discard
  2015-06-15 13:41 ` [PATCH 3/7] btrfs: skip superblocks during discard jeffm
@ 2015-06-17  9:57   ` Filipe David Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe David Manana @ 2015-06-17  9:57 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
> From: Jeff Mahoney <jeffm@suse.com>
>
> Btrfs doesn't track superblocks with extent records so there is nothing
> persistent on-disk to indicate that those blocks are in use.  We track
> the superblocks in memory to ensure they don't get used by removing them
> from the free space cache when we load a block group from disk.  Prior
> to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
> was fine since the block group would never be reclaimed so the superblock
> was always safe.  Once we started removing the empty block groups, we
> were protected by the fact that discards weren't being properly issued
> for unused space either via FITRIM or -odiscard.  The block groups were
> still being released, but the blocks remained on disk.
>
> In order to properly discard unused block groups, we need to filter out
> the superblocks from the discard range.  Superblocks are located at fixed
> locations on each device, so it makes sense to filter them out in
> btrfs_issue_discard, which is used by both -odiscard and FITRIM.
>
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 55 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index cf9cefd..1e44b93 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1884,10 +1884,12 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans,
>         return ret;
>  }
>
> +#define in_range(b, first, len)        ((b) >= (first) && (b) < (first) + (len))
>  static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>                                u64 *discarded_bytes)
>  {
> -       int ret = 0;
> +       int j, ret = 0;
> +       u64 bytes_left, end;
>         u64 aligned_start = ALIGN(start, 1 << 9);
>
>         if (WARN_ON(start != aligned_start)) {
> @@ -1897,11 +1899,60 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>         }
>
>         *discarded_bytes = 0;
> -       if (len) {
> -               ret = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> +
> +       if (!len)
> +               return 0;
> +
> +       end = start + len;
> +       bytes_left = len;
> +
> +       /* Skip any superblocks on this device. */
> +       for (j = 0; j < BTRFS_SUPER_MIRROR_MAX; j++) {
> +               u64 sb_start = btrfs_sb_offset(j);
> +               u64 sb_end = sb_start + BTRFS_SUPER_INFO_SIZE;
> +               u64 size = sb_start - start;
> +
> +               if (!in_range(sb_start, start, bytes_left) &&
> +                   !in_range(sb_end, start, bytes_left) &&
> +                   !in_range(start, sb_start, BTRFS_SUPER_INFO_SIZE))
> +                       continue;
> +
> +               /*
> +                * Superblock spans beginning of range.  Adjust start and
> +                * try again.
> +                */
> +               if (sb_start <= start) {
> +                       start += sb_end - start;
> +                       if (start > end) {
> +                               bytes_left = 0;
> +                               break;
> +                       }
> +                       bytes_left = end - start;
> +                       continue;
> +               }
> +
> +               if (size) {
> +                       ret = blkdev_issue_discard(bdev, start >> 9, size >> 9,
> +                                                  GFP_NOFS, 0);
> +                       if (!ret)
> +                               *discarded_bytes += size;
> +                       else if (ret != -EOPNOTSUPP)
> +                               return ret;
> +               }
> +
> +               start = sb_end;
> +               if (start > end) {
> +                       bytes_left = 0;
> +                       break;
> +               }
> +               bytes_left = end - start;
> +       }
> +
> +       if (bytes_left) {
> +               ret = blkdev_issue_discard(bdev, start >> 9, bytes_left >> 9,
>                                            GFP_NOFS, 0);
>                 if (!ret)
> -                       *discarded_bytes = len;
> +                       *discarded_bytes += bytes_left;
>         }
>         return ret;
>  }
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/7] btrfs: iterate over unused chunk space in FITRIM
  2015-06-15 13:41 ` [PATCH 4/7] btrfs: iterate over unused chunk space in FITRIM jeffm
@ 2015-06-17 10:03   ` Filipe David Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe David Manana @ 2015-06-17 10:03 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
> From: Jeff Mahoney <jeffm@suse.com>
>
> Since we now clean up block groups automatically as they become
> empty, iterating over block groups is no longer sufficient to discard
> unused space.
>
> This patch iterates over the unused chunk space and discards any regions
> that are unallocated, regardless of whether they were ever used.  This is
> a change for btrfs but is consistent with other file systems.
>
> We do this in a transactionless manner since the discard process can take
> a substantial amount of time and a transaction would need to be started
> before the acquisition of the device list lock.  That would mean a
> transaction would be held open across /all/ of the discards collectively.
> In order to prevent other threads from allocating or freeing chunks, we
> hold the chunks lock across the search and discard calls.  We release it
> between searches to allow the file system to perform more-or-less
> normally.  Since the running transaction can commit and disappear while
> we're using the transaction pointer, we take a reference to it and
> release it after the search.  This is safe since it would happen normally
> at the end of the transaction commit after any locks are released anyway.
> We also take the commit_root_sem to protect against a transaction starting
> and committing while we're running.
>
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>

Side note, this still doesn't apply cleanly on latest integration
branch (integration-4.2), results in warnings about casting pointer
from different type (btrfs_trans_handle to btrfs_transaction) at
btrfs_shrink_device().

> ---
>  fs/btrfs/extent-tree.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.c     |  60 ++++++++++++++++++-----------
>  fs/btrfs/volumes.h     |   3 ++
>  3 files changed, 141 insertions(+), 23 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 1e44b93..24b48df 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10143,10 +10143,99 @@ int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
>         return unpin_extent_range(root, start, end, false);
>  }
>
> +/*
> + * It used to be that old block groups would be left around forever.
> + * Iterating over them would be enough to trim unused space.  Since we
> + * now automatically remove them, we also need to iterate over unallocated
> + * space.
> + *
> + * We don't want a transaction for this since the discard may take a
> + * substantial amount of time.  We don't require that a transaction be
> + * running, but we do need to take a running transaction into account
> + * to ensure that we're not discarding chunks that were released in
> + * the current transaction.
> + *
> + * Holding the chunks lock will prevent other threads from allocating
> + * or releasing chunks, but it won't prevent a running transaction
> + * from committing and releasing the memory that the pending chunks
> + * list head uses.  For that, we need to take a reference to the
> + * transaction.
> + */
> +static int btrfs_trim_free_extents(struct btrfs_device *device,
> +                                  u64 minlen, u64 *trimmed)
> +{
> +       u64 start = 0, len = 0;
> +       int ret;
> +
> +       *trimmed = 0;
> +
> +       /* Not writeable = nothing to do. */
> +       if (!device->writeable)
> +               return 0;
> +
> +       /* No free space = nothing to do. */
> +       if (device->total_bytes <= device->bytes_used)
> +               return 0;
> +
> +       ret = 0;
> +
> +       while (1) {
> +               struct btrfs_fs_info *fs_info = device->dev_root->fs_info;
> +               struct btrfs_transaction *trans;
> +               u64 bytes;
> +
> +               ret = mutex_lock_interruptible(&fs_info->chunk_mutex);
> +               if (ret)
> +                       return ret;
> +
> +               down_read(&fs_info->commit_root_sem);
> +
> +               spin_lock(&fs_info->trans_lock);
> +               trans = fs_info->running_transaction;
> +               if (trans)
> +                       atomic_inc(&trans->use_count);
> +               spin_unlock(&fs_info->trans_lock);
> +
> +               ret = find_free_dev_extent_start(trans, device, minlen, start,
> +                                                &start, &len);
> +               if (trans)
> +                       btrfs_put_transaction(trans);
> +
> +               if (ret) {
> +                       up_read(&fs_info->commit_root_sem);
> +                       mutex_unlock(&fs_info->chunk_mutex);
> +                       if (ret == -ENOSPC)
> +                               ret = 0;
> +                       break;
> +               }
> +
> +               ret = btrfs_issue_discard(device->bdev, start, len, &bytes);
> +               up_read(&fs_info->commit_root_sem);
> +               mutex_unlock(&fs_info->chunk_mutex);
> +
> +               if (ret)
> +                       break;
> +
> +               start += len;
> +               *trimmed += bytes;
> +
> +               if (fatal_signal_pending(current)) {
> +                       ret = -ERESTARTSYS;
> +                       break;
> +               }
> +
> +               cond_resched();
> +       }
> +
> +       return ret;
> +}
> +
>  int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
>  {
>         struct btrfs_fs_info *fs_info = root->fs_info;
>         struct btrfs_block_group_cache *cache = NULL;
> +       struct btrfs_device *device;
> +       struct list_head *devices;
>         u64 group_trimmed;
>         u64 start;
>         u64 end;
> @@ -10201,6 +10290,18 @@ int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
>                 cache = next_block_group(fs_info->tree_root, cache);
>         }
>
> +       mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
> +       devices = &root->fs_info->fs_devices->alloc_list;
> +       list_for_each_entry(device, devices, dev_alloc_list) {
> +               ret = btrfs_trim_free_extents(device, range->minlen,
> +                                             &group_trimmed);
> +               if (ret)
> +                       break;
> +
> +               trimmed += group_trimmed;
> +       }
> +       mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
> +
>         range->len = trimmed;
>         return ret;
>  }
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 174f5e1..7fdde31 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1051,15 +1051,18 @@ out:
>         return ret;
>  }
>
> -static int contains_pending_extent(struct btrfs_trans_handle *trans,
> +static int contains_pending_extent(struct btrfs_transaction *transaction,
>                                    struct btrfs_device *device,
>                                    u64 *start, u64 len)
>  {
> +       struct btrfs_fs_info *fs_info = device->dev_root->fs_info;
>         struct extent_map *em;
> -       struct list_head *search_list = &trans->transaction->pending_chunks;
> +       struct list_head *search_list = &fs_info->pinned_chunks;
>         int ret = 0;
>         u64 physical_start = *start;
>
> +       if (transaction)
> +               search_list = &transaction->pending_chunks;
>  again:
>         list_for_each_entry(em, search_list, list) {
>                 struct map_lookup *map;
> @@ -1078,8 +1081,8 @@ again:
>                         ret = 1;
>                 }
>         }
> -       if (search_list == &trans->transaction->pending_chunks) {
> -               search_list = &trans->root->fs_info->pinned_chunks;
> +       if (search_list != &fs_info->pinned_chunks) {
> +               search_list = &fs_info->pinned_chunks;
>                 goto again;
>         }
>
> @@ -1088,12 +1091,13 @@ again:
>
>
>  /*
> - * find_free_dev_extent - find free space in the specified device
> - * @device:    the device which we search the free space in
> - * @num_bytes: the size of the free space that we need
> - * @start:     store the start of the free space.
> - * @len:       the size of the free space. that we find, or the size of the max
> - *             free space if we don't find suitable free space
> + * find_free_dev_extent_start - find free space in the specified device
> + * @device:      the device which we search the free space in
> + * @num_bytes:   the size of the free space that we need
> + * @search_start: the position from which to begin the search
> + * @start:       store the start of the free space.
> + * @len:         the size of the free space. that we find, or the size
> + *               of the max free space if we don't find suitable free space
>   *
>   * this uses a pretty simple search, the expectation is that it is
>   * called very infrequently and that a given device has a small number
> @@ -1107,9 +1111,9 @@ again:
>   * But if we don't find suitable free space, it is used to store the size of
>   * the max free space.
>   */
> -int find_free_dev_extent(struct btrfs_trans_handle *trans,
> -                        struct btrfs_device *device, u64 num_bytes,
> -                        u64 *start, u64 *len)
> +int find_free_dev_extent_start(struct btrfs_transaction *transaction,
> +                              struct btrfs_device *device, u64 num_bytes,
> +                              u64 search_start, u64 *start, u64 *len)
>  {
>         struct btrfs_key key;
>         struct btrfs_root *root = device->dev_root;
> @@ -1119,19 +1123,11 @@ int find_free_dev_extent(struct btrfs_trans_handle *trans,
>         u64 max_hole_start;
>         u64 max_hole_size;
>         u64 extent_end;
> -       u64 search_start;
>         u64 search_end = device->total_bytes;
>         int ret;
>         int slot;
>         struct extent_buffer *l;
>
> -       /* FIXME use last free of some kind */
> -
> -       /* we don't want to overwrite the superblock on the drive,
> -        * so we make sure to start at an offset of at least 1MB
> -        */
> -       search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> -
>         path = btrfs_alloc_path();
>         if (!path)
>                 return -ENOMEM;
> @@ -1192,7 +1188,7 @@ again:
>                          * Have to check before we set max_hole_start, otherwise
>                          * we could end up sending back this offset anyway.
>                          */
> -                       if (contains_pending_extent(trans, device,
> +                       if (contains_pending_extent(transaction, device,
>                                                     &search_start,
>                                                     hole_size)) {
>                                 if (key.offset >= search_start) {
> @@ -1241,7 +1237,7 @@ next:
>         if (search_end > search_start) {
>                 hole_size = search_end - search_start;
>
> -               if (contains_pending_extent(trans, device, &search_start,
> +               if (contains_pending_extent(transaction, device, &search_start,
>                                             hole_size)) {
>                         btrfs_release_path(path);
>                         goto again;
> @@ -1267,6 +1263,24 @@ out:
>         return ret;
>  }
>
> +int find_free_dev_extent(struct btrfs_trans_handle *trans,
> +                        struct btrfs_device *device, u64 num_bytes,
> +                        u64 *start, u64 *len)
> +{
> +       struct btrfs_root *root = device->dev_root;
> +       u64 search_start;
> +
> +       /* FIXME use last free of some kind */
> +
> +       /*
> +        * we don't want to overwrite the superblock on the drive,
> +        * so we make sure to start at an offset of at least 1MB
> +        */
> +       search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> +       return find_free_dev_extent_start(trans->transaction, device,
> +                                         num_bytes, search_start, start, len);
> +}
> +
>  static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
>                           struct btrfs_device *device,
>                           u64 start, u64 *dev_extent_len)
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ebc3133..30918a8 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -449,6 +449,9 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info);
>  int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info);
>  int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info);
>  int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset);
> +int find_free_dev_extent_start(struct btrfs_transaction *transaction,
> +                        struct btrfs_device *device, u64 num_bytes,
> +                        u64 search_start, u64 *start, u64 *max_avail);
>  int find_free_dev_extent(struct btrfs_trans_handle *trans,
>                          struct btrfs_device *device, u64 num_bytes,
>                          u64 *start, u64 *max_avail);
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
  2015-06-15 13:41 ` [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount jeffm
@ 2015-06-17 10:04   ` Filipe David Manana
  2015-06-17 13:24     ` Filipe David Manana
  0 siblings, 1 reply; 19+ messages in thread
From: Filipe David Manana @ 2015-06-17 10:04 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
> From: Jeff Mahoney <jeffm@suse.com>
>
> The cleaner thread may already be sleeping by the time we enter
> close_ctree.  If that's the case, we'll skip removing any unused
> block groups queued for removal, even during a normal umount.
> They'll be cleaned up automatically at next mount, but users
> expect a umount to be a clean synchronization point, especially
> when used on thin-provisioned storage with -odiscard.  We also
> explicitly remove unused block groups in the ro-remount path
> for the same reason.
>
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/disk-io.c |  9 +++++++++
>  fs/btrfs/super.c   | 11 +++++++++++
>  2 files changed, 20 insertions(+)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2ef9a4b..2e47fef 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root)
>         cancel_work_sync(&fs_info->async_reclaim_work);
>
>         if (!(fs_info->sb->s_flags & MS_RDONLY)) {
> +               /*
> +                * If the cleaner thread is stopped and there are
> +                * block groups queued for removal, the deletion will be
> +                * skipped when we quit the cleaner thread.
> +                */
> +               mutex_lock(&root->fs_info->cleaner_mutex);
> +               btrfs_delete_unused_bgs(root->fs_info);
> +               mutex_unlock(&root->fs_info->cleaner_mutex);
> +
>                 ret = btrfs_commit_super(root);
>                 if (ret)
>                         btrfs_err(fs_info, "commit super ret %d", ret);
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 9e66f5e..2ccd8d4 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
>
>                 sb->s_flags |= MS_RDONLY;
>
> +               /*
> +                * Setting MS_RDONLY will put the cleaner thread to
> +                * sleep at the next loop if it's already active.
> +                * If it's already asleep, we'll leave unused block
> +                * groups on disk until we're mounted read-write again
> +                * unless we clean them up here.
> +                */
> +               mutex_lock(&root->fs_info->cleaner_mutex);
> +               btrfs_delete_unused_bgs(fs_info);
> +               mutex_unlock(&root->fs_info->cleaner_mutex);
> +
>                 btrfs_dev_replace_suspend_for_unmount(fs_info);
>                 btrfs_scrub_cancel(fs_info);
>                 btrfs_pause_balance(fs_info);
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/7] btrfs: add missing discards when unpinning extents with -o discard
  2015-06-15 13:41 ` [PATCH 6/7] btrfs: add missing discards when unpinning extents with -o discard jeffm
@ 2015-06-17 10:07   ` Filipe David Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe David Manana @ 2015-06-17 10:07 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
> From: Jeff Mahoney <jeffm@suse.com>
>
> When we clear the dirty bits in btrfs_delete_unused_bgs for extents
> in the empty block group, it results in btrfs_finish_extent_commit being
> unable to discard the freed extents.
>
> The block group removal patch added an alternate path to forget extents
> other than btrfs_finish_extent_commit.  As a result, any extents that
> would be freed when the block group is removed aren't discarded.  In my
> test run, with a large copy of mixed sized files followed by removal, it
> left nearly 2/3 of extents undiscarded.
>
> To clean up the block groups, we add the removed block group onto a list
> that will be discarded after transaction commit.
>
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/ctree.h            |  3 ++
>  fs/btrfs/extent-tree.c      | 68 ++++++++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/free-space-cache.c | 57 +++++++++++++++++++++----------------
>  fs/btrfs/super.c            |  2 +-
>  fs/btrfs/transaction.c      |  2 ++
>  fs/btrfs/transaction.h      |  2 ++
>  6 files changed, 105 insertions(+), 29 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 6f364e1..780acf1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3438,6 +3438,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>                              struct btrfs_root *root, u64 group_start,
>                              struct extent_map *em);
>  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
> +void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache);
> +void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *cache);
>  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>                                        struct btrfs_root *root);
>  u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
> @@ -4068,6 +4070,7 @@ __printf(5, 6)
>  void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
>                      unsigned int line, int errno, const char *fmt, ...);
>
> +const char *btrfs_decode_error(int errno);
>
>  void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
>                                struct btrfs_root *root, const char *function,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 24b48df..3598440 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6103,20 +6103,19 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
>                                struct btrfs_root *root)
>  {
>         struct btrfs_fs_info *fs_info = root->fs_info;
> +       struct btrfs_block_group_cache *block_group, *tmp;
> +       struct list_head *deleted_bgs;
>         struct extent_io_tree *unpin;
>         u64 start;
>         u64 end;
>         int ret;
>
> -       if (trans->aborted)
> -               return 0;
> -
>         if (fs_info->pinned_extents == &fs_info->freed_extents[0])
>                 unpin = &fs_info->freed_extents[1];
>         else
>                 unpin = &fs_info->freed_extents[0];
>
> -       while (1) {
> +       while (!trans->aborted) {
>                 mutex_lock(&fs_info->unused_bg_unpin_mutex);
>                 ret = find_first_extent_bit(unpin, 0, &start, &end,
>                                             EXTENT_DIRTY, NULL);
> @@ -6135,6 +6134,34 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
>                 cond_resched();
>         }
>
> +       /*
> +        * Transaction is finished.  We don't need the lock anymore.  We
> +        * do need to clean up the block groups in case of a transaction
> +        * abort.
> +        */
> +       deleted_bgs = &trans->transaction->deleted_bgs;
> +       list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {
> +               u64 trimmed = 0;
> +
> +               ret = -EROFS;
> +               if (!trans->aborted)
> +                       ret = btrfs_discard_extent(root,
> +                                                  block_group->key.objectid,
> +                                                  block_group->key.offset,
> +                                                  &trimmed);
> +
> +               list_del_init(&block_group->bg_list);
> +               btrfs_put_block_group_trimming(block_group);
> +               btrfs_put_block_group(block_group);
> +
> +               if (ret) {
> +                       const char *errstr = btrfs_decode_error(ret);
> +                       btrfs_warn(fs_info,
> +                                  "Discard failed while removing blockgroup: errno=%d %s\n",
> +                                  ret, errstr);
> +               }
> +       }
> +
>         return 0;
>  }
>
> @@ -9914,6 +9941,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>          * currently running transaction might finish and a new one start,
>          * allowing for new block groups to be created that can reuse the same
>          * physical device locations unless we take this special care.
> +        *
> +        * There may also be an implicit trim operation if the file system
> +        * is mounted with -odiscard. The same protections must remain
> +        * in place until the extents have been discarded completely when
> +        * the transaction commit has completed.
>          */
>         remove_em = (atomic_read(&block_group->trimming) == 0);
>         /*
> @@ -9988,6 +10020,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>         spin_lock(&fs_info->unused_bgs_lock);
>         while (!list_empty(&fs_info->unused_bgs)) {
>                 u64 start, end;
> +               int trimming;
>
>                 block_group = list_first_entry(&fs_info->unused_bgs,
>                                                struct btrfs_block_group_cache,
> @@ -10085,12 +10118,39 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>                 spin_unlock(&block_group->lock);
>                 spin_unlock(&space_info->lock);
>
> +               /* DISCARD can flip during remount */
> +               trimming = btrfs_test_opt(root, DISCARD);
> +
> +               /* Implicit trim during transaction commit. */
> +               if (trimming)
> +                       btrfs_get_block_group_trimming(block_group);
> +
>                 /*
>                  * Btrfs_remove_chunk will abort the transaction if things go
>                  * horribly wrong.
>                  */
>                 ret = btrfs_remove_chunk(trans, root,
>                                          block_group->key.objectid);
> +
> +               if (ret) {
> +                       if (trimming)
> +                               btrfs_put_block_group_trimming(block_group);
> +                       goto end_trans;
> +               }
> +
> +               /*
> +                * If we're not mounted with -odiscard, we can just forget
> +                * about this block group. Otherwise we'll need to wait
> +                * until transaction commit to do the actual discard.
> +                */
> +               if (trimming) {
> +                       WARN_ON(!list_empty(&block_group->bg_list));
> +                       spin_lock(&trans->transaction->deleted_bgs_lock);
> +                       list_move(&block_group->bg_list,
> +                                 &trans->transaction->deleted_bgs);
> +                       spin_unlock(&trans->transaction->deleted_bgs_lock);
> +                       btrfs_get_block_group(block_group);
> +               }
>  end_trans:
>                 btrfs_end_transaction(trans, root);
>  next:
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 9dbe5b5..c79253e 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3274,35 +3274,23 @@ next:
>         return ret;
>  }
>
> -int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> -                          u64 *trimmed, u64 start, u64 end, u64 minlen)
> +void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache)
>  {
> -       int ret;
> +       atomic_inc(&cache->trimming);
> +}
>
> -       *trimmed = 0;
> +void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
> +{
> +       struct extent_map_tree *em_tree;
> +       struct extent_map *em;
> +       bool cleanup;
>
>         spin_lock(&block_group->lock);
> -       if (block_group->removed) {
> -               spin_unlock(&block_group->lock);
> -               return 0;
> -       }
> -       atomic_inc(&block_group->trimming);
> +       cleanup = (atomic_dec_and_test(&block_group->trimming) &&
> +                  block_group->removed);
>         spin_unlock(&block_group->lock);
>
> -       ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
> -       if (ret)
> -               goto out;
> -
> -       ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
> -out:
> -       spin_lock(&block_group->lock);
> -       if (atomic_dec_and_test(&block_group->trimming) &&
> -           block_group->removed) {
> -               struct extent_map_tree *em_tree;
> -               struct extent_map *em;
> -
> -               spin_unlock(&block_group->lock);
> -
> +       if (cleanup) {
>                 lock_chunks(block_group->fs_info->chunk_root);
>                 em_tree = &block_group->fs_info->mapping_tree.map_tree;
>                 write_lock(&em_tree->lock);
> @@ -3326,10 +3314,31 @@ out:
>                  * this block group have left 1 entry each one. Free them.
>                  */
>                 __btrfs_remove_free_space_cache(block_group->free_space_ctl);
> -       } else {
> +       }
> +}
> +
> +int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> +                          u64 *trimmed, u64 start, u64 end, u64 minlen)
> +{
> +       int ret;
> +
> +       *trimmed = 0;
> +
> +       spin_lock(&block_group->lock);
> +       if (block_group->removed) {
>                 spin_unlock(&block_group->lock);
> +               return 0;
>         }
> +       btrfs_get_block_group_trimming(block_group);
> +       spin_unlock(&block_group->lock);
> +
> +       ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
> +       if (ret)
> +               goto out;
>
> +       ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
> +out:
> +       btrfs_put_block_group_trimming(block_group);
>         return ret;
>  }
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 2ccd8d4..a80da03 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -69,7 +69,7 @@ static struct file_system_type btrfs_fs_type;
>
>  static int btrfs_remount(struct super_block *sb, int *flags, char *data);
>
> -static const char *btrfs_decode_error(int errno)
> +const char *btrfs_decode_error(int errno)
>  {
>         char *errstr = "unknown";
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 5628e25..2005262 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -256,6 +256,8 @@ loop:
>         mutex_init(&cur_trans->cache_write_mutex);
>         cur_trans->num_dirty_bgs = 0;
>         spin_lock_init(&cur_trans->dirty_bgs_lock);
> +       INIT_LIST_HEAD(&cur_trans->deleted_bgs);
> +       spin_lock_init(&cur_trans->deleted_bgs_lock);
>         list_add_tail(&cur_trans->list, &fs_info->trans_list);
>         extent_io_tree_init(&cur_trans->dirty_pages,
>                              fs_info->btree_inode->i_mapping);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 0b24755..14325f2 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -74,6 +74,8 @@ struct btrfs_transaction {
>          */
>         struct mutex cache_write_mutex;
>         spinlock_t dirty_bgs_lock;
> +       struct list_head deleted_bgs;
> +       spinlock_t deleted_bgs_lock;
>         struct btrfs_delayed_ref_root delayed_refs;
>         int aborted;
>         int dirty_bg_run;
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7] btrfs: cleanup, stop casting for extent_map->lookup everywhere
  2015-06-15 13:41 ` [PATCH 7/7] btrfs: cleanup, stop casting for extent_map->lookup everywhere jeffm
@ 2015-06-17 10:08   ` Filipe David Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe David Manana @ 2015-06-17 10:08 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
> From: Jeff Mahoney <jeffm@suse.com>
>
> Overloading extent_map->bdev to struct map_lookup * might have started out
> as a means to an end, but it's a pattern that's used all over the place
> now. Let's get rid of the casting and just add a union instead.
>
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/dev-replace.c |  2 +-
>  fs/btrfs/extent_map.c  |  2 +-
>  fs/btrfs/extent_map.h  | 10 +++++++++-
>  fs/btrfs/scrub.c       |  2 +-
>  fs/btrfs/volumes.c     | 24 ++++++++++++------------
>  5 files changed, 24 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 0573848..2ad3289 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -613,7 +613,7 @@ static void btrfs_dev_replace_update_device_in_mapping_tree(
>                 em = lookup_extent_mapping(em_tree, start, (u64)-1);
>                 if (!em)
>                         break;
> -               map = (struct map_lookup *)em->bdev;
> +               map = em->map_lookup;
>                 for (i = 0; i < map->num_stripes; i++)
>                         if (srcdev == map->stripes[i].dev)
>                                 map->stripes[i].dev = tgtdev;
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index 6a98bdd..84fb56d 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -76,7 +76,7 @@ void free_extent_map(struct extent_map *em)
>                 WARN_ON(extent_map_in_tree(em));
>                 WARN_ON(!list_empty(&em->list));
>                 if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
> -                       kfree(em->bdev);
> +                       kfree(em->map_lookup);
>                 kmem_cache_free(extent_map_cache, em);
>         }
>  }
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index b2991fd..eb8b8fa 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -32,7 +32,15 @@ struct extent_map {
>         u64 block_len;
>         u64 generation;
>         unsigned long flags;
> -       struct block_device *bdev;
> +       union {
> +               struct block_device *bdev;
> +
> +               /*
> +                * used for chunk mappings
> +                * flags & EXTENT_FLAG_FS_MAPPING must be set
> +                */
> +               struct map_lookup *map_lookup;
> +       };
>         atomic_t refs;
>         unsigned int compress_type;
>         struct list_head list;
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ab58115..19f7241d 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3339,7 +3339,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
>         if (!em)
>                 return -EINVAL;
>
> -       map = (struct map_lookup *)em->bdev;
> +       map = em->map_lookup;
>         if (em->start != chunk_offset)
>                 goto out;
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7fdde31..9f48ae5 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1068,7 +1068,7 @@ again:
>                 struct map_lookup *map;
>                 int i;
>
> -               map = (struct map_lookup *)em->bdev;
> +               map = em->map_lookup;
>                 for (i = 0; i < map->num_stripes; i++) {
>                         if (map->stripes[i].dev != device)
>                                 continue;
> @@ -2622,7 +2622,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>                         free_extent_map(em);
>                 return -EINVAL;
>         }
> -       map = (struct map_lookup *)em->bdev;
> +       map = em->map_lookup;
>
>         for (i = 0; i < map->num_stripes; i++) {
>                 struct btrfs_device *device = map->stripes[i].dev;
> @@ -4465,7 +4465,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>                 goto error;
>         }
>         set_bit(EXTENT_FLAG_FS_MAPPING, &em->flags);
> -       em->bdev = (struct block_device *)map;
> +       em->map_lookup = map;
>         em->start = start;
>         em->len = num_bytes;
>         em->block_start = 0;
> @@ -4560,7 +4560,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
>                 return -EINVAL;
>         }
>
> -       map = (struct map_lookup *)em->bdev;
> +       map = em->map_lookup;
>         item_size = btrfs_chunk_item_size(map->num_stripes);
>         stripe_size = em->orig_block_len;
>
> @@ -4702,7 +4702,7 @@ int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset)
>         if (!em)
>                 return 1;
>
> -       map = (struct map_lookup *)em->bdev;
> +       map = em->map_lookup;
>         for (i = 0; i < map->num_stripes; i++) {
>                 if (map->stripes[i].dev->missing) {
>                         miss_ndevs++;
> @@ -4782,7 +4782,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>                 return 1;
>         }
>
> -       map = (struct map_lookup *)em->bdev;
> +       map = em->map_lookup;
>         if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
>                 ret = map->num_stripes;
>         else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
> @@ -4818,7 +4818,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root *root,
>         BUG_ON(!em);
>
>         BUG_ON(em->start > logical || em->start + em->len < logical);
> -       map = (struct map_lookup *)em->bdev;
> +       map = em->map_lookup;
>         if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
>                 len = map->stripe_len * nr_data_stripes(map);
>         free_extent_map(em);
> @@ -4839,7 +4839,7 @@ int btrfs_is_parity_mirror(struct btrfs_mapping_tree *map_tree,
>         BUG_ON(!em);
>
>         BUG_ON(em->start > logical || em->start + em->len < logical);
> -       map = (struct map_lookup *)em->bdev;
> +       map = em->map_lookup;
>         if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
>                 ret = 1;
>         free_extent_map(em);
> @@ -5000,7 +5000,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
>                 return -EINVAL;
>         }
>
> -       map = (struct map_lookup *)em->bdev;
> +       map = em->map_lookup;
>         offset = logical - em->start;
>
>         stripe_len = map->stripe_len;
> @@ -5542,7 +5542,7 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
>                 free_extent_map(em);
>                 return -EIO;
>         }
> -       map = (struct map_lookup *)em->bdev;
> +       map = em->map_lookup;
>
>         length = em->len;
>         rmap_len = map->stripe_len;
> @@ -6057,7 +6057,7 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key,
>         }
>
>         set_bit(EXTENT_FLAG_FS_MAPPING, &em->flags);
> -       em->bdev = (struct block_device *)map;
> +       em->map_lookup = map;
>         em->start = logical;
>         em->len = length;
>         em->orig_start = 0;
> @@ -6733,7 +6733,7 @@ void btrfs_update_commit_device_bytes_used(struct btrfs_root *root,
>         /* In order to kick the device replace finish process */
>         lock_chunks(root);
>         list_for_each_entry(em, &transaction->pending_chunks, list) {
> -               map = (struct map_lookup *)em->bdev;
> +               map = em->map_lookup;
>
>                 for (i = 0; i < map->num_stripes; i++) {
>                         dev = map->stripes[i].dev;
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
  2015-06-17 10:04   ` Filipe David Manana
@ 2015-06-17 13:24     ` Filipe David Manana
  2015-06-17 14:32       ` Jeff Mahoney
  0 siblings, 1 reply; 19+ messages in thread
From: Filipe David Manana @ 2015-06-17 13:24 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana
<fdmanana@gmail.com> wrote:
> On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
>> From: Jeff Mahoney <jeffm@suse.com>
>>
>> The cleaner thread may already be sleeping by the time we enter
>> close_ctree.  If that's the case, we'll skip removing any unused
>> block groups queued for removal, even during a normal umount.
>> They'll be cleaned up automatically at next mount, but users
>> expect a umount to be a clean synchronization point, especially
>> when used on thin-provisioned storage with -odiscard.  We also
>> explicitly remove unused block groups in the ro-remount path
>> for the same reason.
>>
>> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> Tested-by: Filipe Manana <fdmanana@suse.com>
>
>> ---
>>  fs/btrfs/disk-io.c |  9 +++++++++
>>  fs/btrfs/super.c   | 11 +++++++++++
>>  2 files changed, 20 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 2ef9a4b..2e47fef 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root)
>>         cancel_work_sync(&fs_info->async_reclaim_work);
>>
>>         if (!(fs_info->sb->s_flags & MS_RDONLY)) {
>> +               /*
>> +                * If the cleaner thread is stopped and there are
>> +                * block groups queued for removal, the deletion will be
>> +                * skipped when we quit the cleaner thread.
>> +                */
>> +               mutex_lock(&root->fs_info->cleaner_mutex);
>> +               btrfs_delete_unused_bgs(root->fs_info);
>> +               mutex_unlock(&root->fs_info->cleaner_mutex);
>> +
>>                 ret = btrfs_commit_super(root);
>>                 if (ret)
>>                         btrfs_err(fs_info, "commit super ret %d", ret);
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 9e66f5e..2ccd8d4 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
>>
>>                 sb->s_flags |= MS_RDONLY;
>>
>> +               /*
>> +                * Setting MS_RDONLY will put the cleaner thread to
>> +                * sleep at the next loop if it's already active.
>> +                * If it's already asleep, we'll leave unused block
>> +                * groups on disk until we're mounted read-write again
>> +                * unless we clean them up here.
>> +                */
>> +               mutex_lock(&root->fs_info->cleaner_mutex);
>> +               btrfs_delete_unused_bgs(fs_info);
>> +               mutex_unlock(&root->fs_info->cleaner_mutex);

So actually, this allows for a deadlock after the patch I sent out last week:

https://patchwork.kernel.org/patch/6586811/

In that patch delete_unused_bgs is no longer called under the
cleaner_mutex, and making it so, will cause a deadlock with
relocation.

Even without that patch, I don't think you need using this mutex
anyway - no 2 tasks running this function can get the same bg from the
fs_info->unused_bgs list.

thanks


>> +
>>                 btrfs_dev_replace_suspend_for_unmount(fs_info);
>>                 btrfs_scrub_cancel(fs_info);
>>                 btrfs_pause_balance(fs_info);
>> --
>> 2.4.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
  2015-06-17 13:24     ` Filipe David Manana
@ 2015-06-17 14:32       ` Jeff Mahoney
  2015-06-17 14:36         ` Jeff Mahoney
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff Mahoney @ 2015-06-17 14:32 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/17/15 9:24 AM, Filipe David Manana wrote:
> On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana 
> <fdmanana@gmail.com> wrote:
>> On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
>>> From: Jeff Mahoney <jeffm@suse.com>
>>> 
>>> The cleaner thread may already be sleeping by the time we
>>> enter close_ctree.  If that's the case, we'll skip removing any
>>> unused block groups queued for removal, even during a normal
>>> umount. They'll be cleaned up automatically at next mount, but
>>> users expect a umount to be a clean synchronization point,
>>> especially when used on thin-provisioned storage with
>>> -odiscard.  We also explicitly remove unused block groups in
>>> the ro-remount path for the same reason.
>>> 
>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
>> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe
>> Manana <fdmanana@suse.com>
>> 
>>> --- fs/btrfs/disk-io.c |  9 +++++++++ fs/btrfs/super.c   | 11
>>> +++++++++++ 2 files changed, 20 insertions(+)
>>> 
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
>>> 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++
>>> b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void
>>> close_ctree(struct btrfs_root *root) 
>>> cancel_work_sync(&fs_info->async_reclaim_work);
>>> 
>>> if (!(fs_info->sb->s_flags & MS_RDONLY)) { +               /* +
>>> * If the cleaner thread is stopped and there are +
>>> * block groups queued for removal, the deletion will be +
>>> * skipped when we quit the cleaner thread. +                */ 
>>> +               mutex_lock(&root->fs_info->cleaner_mutex); +
>>> btrfs_delete_unused_bgs(root->fs_info); +
>>> mutex_unlock(&root->fs_info->cleaner_mutex); + ret =
>>> btrfs_commit_super(root); if (ret) btrfs_err(fs_info, "commit
>>> super ret %d", ret); diff --git a/fs/btrfs/super.c
>>> b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 ---
>>> a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6 +1539,17
>>> @@ static int btrfs_remount(struct super_block *sb, int *flags,
>>> char *data)
>>> 
>>> sb->s_flags |= MS_RDONLY;
>>> 
>>> +               /* +                * Setting MS_RDONLY will
>>> put the cleaner thread to +                * sleep at the next
>>> loop if it's already active. +                * If it's already
>>> asleep, we'll leave unused block +                * groups on
>>> disk until we're mounted read-write again +                *
>>> unless we clean them up here. +                */ +
>>> mutex_lock(&root->fs_info->cleaner_mutex); +
>>> btrfs_delete_unused_bgs(fs_info); +
>>> mutex_unlock(&root->fs_info->cleaner_mutex);
> 
> So actually, this allows for a deadlock after the patch I sent out
> last week:
> 
> https://patchwork.kernel.org/patch/6586811/
> 
> In that patch delete_unused_bgs is no longer called under the 
> cleaner_mutex, and making it so, will cause a deadlock with/ru 
> relocation.
> 
> Even without that patch, I don't think you need using this mutex 
> anyway - no 2 tasks running this function can get the same bg from
> the fs_info->unused_bgs list.

I was hitting crashes during umount when xfstests would do remount-ro
and umount in quick succession.  I can go back and confirm this, but I
believe I was encountering a race between the cleaner thread and
umount after being set read-only.  It didn't trigger all the time.  My
hypothesis is that if the cleaner thread was running and had a lot of
work to do, it could start before set MS_RDONLY and still be
performing work through the remount and into the umount.  Ro-remount
would have set MS_RDONLY so we skip the btrfs_super_commit in
close_ctree and then blow up afterwards.

Taking the cleaner mutex means we either wait until the cleaner thread
has finished or we put it to sleep on the next loop before it does
anything.  In either case, it's safe.  It could just has easily been:

               mutex_lock(&root->fs_info->cleaner_mutex);
               mutex_unlock(&root->fs_info->cleaner_mutex);

               btrfs_delete_unused_bgs(fs_info);

I think it actually was in a previous version I was testing.  It
probably should go back to that version so that we don't end up
confusing it with the new mutex you introduced in your patch.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVgYThAAoJEB57S2MheeWymvMP/jPnCFslZfEphccGlqsDUQeb
Ua9SVQJ5XjS0BbnVfuMGmzxew30BkUBdpnlWsufdVIKIeR9DNcvuDHJtcXMUI+Uw
FU/Asik//xiDPJ1hldPc4d0CJjsFBKHVLKjirkeE7kuvwa+XmfUYfKrhfzt6ZGvt
sWrCwMJRWFAS88ayR+NAelwaMzIy+Rbs5gZYg6dd2OCvIa4GuTh/szx8RaPOjNWQ
QcQHy2FlCcV/AtCA+ZaXh8NLmATIA8613biP7ATGIYHEdaZf7Oivov/u154QVwkt
c4omauofHKbBmlz2d//PS/T/n9nT7F7p1YvFaDnLLyQ0Ew3VBq+M9gyuWF8IGxti
iHdGkgQxnPSY0gGLA5bIt0D+su1RcTqa/71LOsBqbmk7KioNF4bp9FmaykHx2LAL
NpKGPD6BEcTTZAXfGdV6/IxTuii1temxcyawJrijakFTseA/GOmODI3K1kg+nLZA
OBjzFmzzLFir8SuiIWLO5ncbbsoM6rHhbl08DeKZ6tOH4JQm2ROciVgTn67SVxB5
bmjzl/zhhePfPgmbf5WoLsT4cbuGK00r+M3U79vzIjfEPmKAfGFbu9jEGPvQahKT
tOBRw7IaL8vCrBLFGUhhQzECwOK6Zms4r2ZTino30MwSNHegPjUbt8xDmFHw+gp3
Td6o4o23By9ygZgac0KI
=iHjc
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
  2015-06-17 14:32       ` Jeff Mahoney
@ 2015-06-17 14:36         ` Jeff Mahoney
  2015-06-17 14:37           ` Filipe David Manana
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff Mahoney @ 2015-06-17 14:36 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/17/15 10:32 AM, Jeff Mahoney wrote:
> On 6/17/15 9:24 AM, Filipe David Manana wrote:
>> On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana 
>> <fdmanana@gmail.com> wrote:
>>> On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
>>>> From: Jeff Mahoney <jeffm@suse.com>
>>>> 
>>>> The cleaner thread may already be sleeping by the time we 
>>>> enter close_ctree.  If that's the case, we'll skip removing
>>>> any unused block groups queued for removal, even during a
>>>> normal umount. They'll be cleaned up automatically at next
>>>> mount, but users expect a umount to be a clean
>>>> synchronization point, especially when used on
>>>> thin-provisioned storage with -odiscard.  We also explicitly
>>>> remove unused block groups in the ro-remount path for the
>>>> same reason.
>>>> 
>>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
>>> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by:
>>> Filipe Manana <fdmanana@suse.com>
>>> 
>>>> --- fs/btrfs/disk-io.c |  9 +++++++++ fs/btrfs/super.c   |
>>>> 11 +++++++++++ 2 files changed, 20 insertions(+)
>>>> 
>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 
>>>> 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++ 
>>>> b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void 
>>>> close_ctree(struct btrfs_root *root) 
>>>> cancel_work_sync(&fs_info->async_reclaim_work);
>>>> 
>>>> if (!(fs_info->sb->s_flags & MS_RDONLY)) { +               /*
>>>> + * If the cleaner thread is stopped and there are + * block
>>>> groups queued for removal, the deletion will be + * skipped
>>>> when we quit the cleaner thread. +                */ +
>>>> mutex_lock(&root->fs_info->cleaner_mutex); + 
>>>> btrfs_delete_unused_bgs(root->fs_info); + 
>>>> mutex_unlock(&root->fs_info->cleaner_mutex); + ret = 
>>>> btrfs_commit_super(root); if (ret) btrfs_err(fs_info,
>>>> "commit super ret %d", ret); diff --git a/fs/btrfs/super.c 
>>>> b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 --- 
>>>> a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6
>>>> +1539,17 @@ static int btrfs_remount(struct super_block *sb,
>>>> int *flags, char *data)
>>>> 
>>>> sb->s_flags |= MS_RDONLY;
>>>> 
>>>> +               /* +                * Setting MS_RDONLY will 
>>>> put the cleaner thread to +                * sleep at the
>>>> next loop if it's already active. +                * If it's
>>>> already asleep, we'll leave unused block +                *
>>>> groups on disk until we're mounted read-write again +
>>>> * unless we clean them up here. +                */ + 
>>>> mutex_lock(&root->fs_info->cleaner_mutex); + 
>>>> btrfs_delete_unused_bgs(fs_info); + 
>>>> mutex_unlock(&root->fs_info->cleaner_mutex);
> 
>> So actually, this allows for a deadlock after the patch I sent
>> out last week:
> 
>> https://patchwork.kernel.org/patch/6586811/
> 
>> In that patch delete_unused_bgs is no longer called under the 
>> cleaner_mutex, and making it so, will cause a deadlock with/ru 
>> relocation.
> 
>> Even without that patch, I don't think you need using this mutex
>>  anyway - no 2 tasks running this function can get the same bg
>> from the fs_info->unused_bgs list.
> 
> I was hitting crashes during umount when xfstests would do
> remount-ro and umount in quick succession.  I can go back and
> confirm this, but I believe I was encountering a race between the
> cleaner thread and umount after being set read-only.  It didn't
> trigger all the time.  My hypothesis is that if the cleaner thread
> was running and had a lot of work to do, it could start before set
> MS_RDONLY and still be performing work through the remount and into
> the umount.  Ro-remount would have set MS_RDONLY so we skip the
> btrfs_super_commit in close_ctree and then blow up afterwards.
> 
> Taking the cleaner mutex means we either wait until the cleaner
> thread has finished or we put it to sleep on the next loop before
> it does anything.  In either case, it's safe.  It could just has
> easily been:
> 
> mutex_lock(&root->fs_info->cleaner_mutex); 
> mutex_unlock(&root->fs_info->cleaner_mutex);
> 
> btrfs_delete_unused_bgs(fs_info);
> 
> I think it actually was in a previous version I was testing.  It 
> probably should go back to that version so that we don't end up 
> confusing it with the new mutex you introduced in your patch.

It looks like your:
[PATCH] Btrfs: fix crash on close_ctree() if cleaner starts new
transaction

would also fix this in a more general case.  We can drop taking the
cleaner mutex here.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVgYYDAAoJEB57S2MheeWyxIcQAIGwFvP1bL4C8Oa3WyFL/tjE
QITNDQZGYXEKfFqRWdHEAeFJ8kv234xo/tx7Ml0Txd8DFrqzDwXSxv6deLzDiiTT
gymMdBKO3x7TLKZTxnyDXYEUDHM72IMOUS2el3wOOsc61rL1KajFEWySGtAA80pk
bIUH6uosRTXhpXBRe080mc9XPhtfIQyCC8nroJHYazNwT3VWrvbhDaZPM3npNttj
5glsCz7ieseiWKqFCIlYC5yCgpst79U7D8M75Jo0yslvtZNpZOMR3YhvyQakj5hG
p/CFRfbdFGnl3wKv+ACyu7XlewqoA9LwkB5Sbjzd4XbS3n7J4gch043b+BbIl2SA
VghNTTI+tm7KKvMa3fghtedooVYu6DjdhU58VEWOBtHaDiWntSmd0FqzUCqAotxC
fwEmMWCWCWR1E0etRUrnbO1DGltkR38ost7cvXOPXUUdvv3Hy22mTfWW73YwsWXW
kwmG2V+IdgOWHDMxQCnj55/NbYep+/TiVjDPJnOuCn8tD5Tw+zHxtRbXhVcyKpGj
jJXKb9uxDhKpsisz8HQJHf1uMLFJ3qzCqgYxysbc2PqlzylFfY2aefYWSPmrE6y4
6OJW7gTr75PzrGGm7gM1sPiPQLuFNEFBEi0Ak7ad6Q6SAAV339r+h00sg4Q1adVu
2JedYHUeFDUjAGAgft0G
=SQ/a
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
  2015-06-17 14:36         ` Jeff Mahoney
@ 2015-06-17 14:37           ` Filipe David Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe David Manana @ 2015-06-17 14:37 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

On Wed, Jun 17, 2015 at 3:36 PM, Jeff Mahoney <jeffm@suse.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 6/17/15 10:32 AM, Jeff Mahoney wrote:
>> On 6/17/15 9:24 AM, Filipe David Manana wrote:
>>> On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana
>>> <fdmanana@gmail.com> wrote:
>>>> On Mon, Jun 15, 2015 at 2:41 PM,  <jeffm@suse.com> wrote:
>>>>> From: Jeff Mahoney <jeffm@suse.com>
>>>>>
>>>>> The cleaner thread may already be sleeping by the time we
>>>>> enter close_ctree.  If that's the case, we'll skip removing
>>>>> any unused block groups queued for removal, even during a
>>>>> normal umount. They'll be cleaned up automatically at next
>>>>> mount, but users expect a umount to be a clean
>>>>> synchronization point, especially when used on
>>>>> thin-provisioned storage with -odiscard.  We also explicitly
>>>>> remove unused block groups in the ro-remount path for the
>>>>> same reason.
>>>>>
>>>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
>>>> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by:
>>>> Filipe Manana <fdmanana@suse.com>
>>>>
>>>>> --- fs/btrfs/disk-io.c |  9 +++++++++ fs/btrfs/super.c   |
>>>>> 11 +++++++++++ 2 files changed, 20 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
>>>>> 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++
>>>>> b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void
>>>>> close_ctree(struct btrfs_root *root)
>>>>> cancel_work_sync(&fs_info->async_reclaim_work);
>>>>>
>>>>> if (!(fs_info->sb->s_flags & MS_RDONLY)) { +               /*
>>>>> + * If the cleaner thread is stopped and there are + * block
>>>>> groups queued for removal, the deletion will be + * skipped
>>>>> when we quit the cleaner thread. +                */ +
>>>>> mutex_lock(&root->fs_info->cleaner_mutex); +
>>>>> btrfs_delete_unused_bgs(root->fs_info); +
>>>>> mutex_unlock(&root->fs_info->cleaner_mutex); + ret =
>>>>> btrfs_commit_super(root); if (ret) btrfs_err(fs_info,
>>>>> "commit super ret %d", ret); diff --git a/fs/btrfs/super.c
>>>>> b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 ---
>>>>> a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6
>>>>> +1539,17 @@ static int btrfs_remount(struct super_block *sb,
>>>>> int *flags, char *data)
>>>>>
>>>>> sb->s_flags |= MS_RDONLY;
>>>>>
>>>>> +               /* +                * Setting MS_RDONLY will
>>>>> put the cleaner thread to +                * sleep at the
>>>>> next loop if it's already active. +                * If it's
>>>>> already asleep, we'll leave unused block +                *
>>>>> groups on disk until we're mounted read-write again +
>>>>> * unless we clean them up here. +                */ +
>>>>> mutex_lock(&root->fs_info->cleaner_mutex); +
>>>>> btrfs_delete_unused_bgs(fs_info); +
>>>>> mutex_unlock(&root->fs_info->cleaner_mutex);
>>
>>> So actually, this allows for a deadlock after the patch I sent
>>> out last week:
>>
>>> https://patchwork.kernel.org/patch/6586811/
>>
>>> In that patch delete_unused_bgs is no longer called under the
>>> cleaner_mutex, and making it so, will cause a deadlock with/ru
>>> relocation.
>>
>>> Even without that patch, I don't think you need using this mutex
>>>  anyway - no 2 tasks running this function can get the same bg
>>> from the fs_info->unused_bgs list.
>>
>> I was hitting crashes during umount when xfstests would do
>> remount-ro and umount in quick succession.  I can go back and
>> confirm this, but I believe I was encountering a race between the
>> cleaner thread and umount after being set read-only.  It didn't
>> trigger all the time.  My hypothesis is that if the cleaner thread
>> was running and had a lot of work to do, it could start before set
>> MS_RDONLY and still be performing work through the remount and into
>> the umount.  Ro-remount would have set MS_RDONLY so we skip the
>> btrfs_super_commit in close_ctree and then blow up afterwards.
>>
>> Taking the cleaner mutex means we either wait until the cleaner
>> thread has finished or we put it to sleep on the next loop before
>> it does anything.  In either case, it's safe.  It could just has
>> easily been:
>>
>> mutex_lock(&root->fs_info->cleaner_mutex);
>> mutex_unlock(&root->fs_info->cleaner_mutex);
>>
>> btrfs_delete_unused_bgs(fs_info);
>>
>> I think it actually was in a previous version I was testing.  It
>> probably should go back to that version so that we don't end up
>> confusing it with the new mutex you introduced in your patch.
>
> It looks like your:
> [PATCH] Btrfs: fix crash on close_ctree() if cleaner starts new
> transaction
>
> would also fix this in a more general case.  We can drop taking the
> cleaner mutex here.

Cool, thanks Jeff.

>
> - -Jeff
>
> - --
> Jeff Mahoney
> SUSE Labs
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>
> iQIcBAEBAgAGBQJVgYYDAAoJEB57S2MheeWyxIcQAIGwFvP1bL4C8Oa3WyFL/tjE
> QITNDQZGYXEKfFqRWdHEAeFJ8kv234xo/tx7Ml0Txd8DFrqzDwXSxv6deLzDiiTT
> gymMdBKO3x7TLKZTxnyDXYEUDHM72IMOUS2el3wOOsc61rL1KajFEWySGtAA80pk
> bIUH6uosRTXhpXBRe080mc9XPhtfIQyCC8nroJHYazNwT3VWrvbhDaZPM3npNttj
> 5glsCz7ieseiWKqFCIlYC5yCgpst79U7D8M75Jo0yslvtZNpZOMR3YhvyQakj5hG
> p/CFRfbdFGnl3wKv+ACyu7XlewqoA9LwkB5Sbjzd4XbS3n7J4gch043b+BbIl2SA
> VghNTTI+tm7KKvMa3fghtedooVYu6DjdhU58VEWOBtHaDiWntSmd0FqzUCqAotxC
> fwEmMWCWCWR1E0etRUrnbO1DGltkR38ost7cvXOPXUUdvv3Hy22mTfWW73YwsWXW
> kwmG2V+IdgOWHDMxQCnj55/NbYep+/TiVjDPJnOuCn8tD5Tw+zHxtRbXhVcyKpGj
> jJXKb9uxDhKpsisz8HQJHf1uMLFJ3qzCqgYxysbc2PqlzylFfY2aefYWSPmrE6y4
> 6OJW7gTr75PzrGGm7gM1sPiPQLuFNEFBEi0Ak7ad6Q6SAAV339r+h00sg4Q1adVu
> 2JedYHUeFDUjAGAgft0G
> =SQ/a
> -----END PGP SIGNATURE-----



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-06-17 14:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-15 13:45 [PATCH v5] btrfs: fix automatic blockgroup remove + discard Jeff Mahoney
2015-06-15 13:41 ` [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded jeffm
2015-06-17  9:53   ` Filipe David Manana
2015-06-15 13:41 ` [PATCH 2/7] btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries jeffm
2015-06-17  9:55   ` Filipe David Manana
2015-06-15 13:41 ` [PATCH 3/7] btrfs: skip superblocks during discard jeffm
2015-06-17  9:57   ` Filipe David Manana
2015-06-15 13:41 ` [PATCH 4/7] btrfs: iterate over unused chunk space in FITRIM jeffm
2015-06-17 10:03   ` Filipe David Manana
2015-06-15 13:41 ` [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount jeffm
2015-06-17 10:04   ` Filipe David Manana
2015-06-17 13:24     ` Filipe David Manana
2015-06-17 14:32       ` Jeff Mahoney
2015-06-17 14:36         ` Jeff Mahoney
2015-06-17 14:37           ` Filipe David Manana
2015-06-15 13:41 ` [PATCH 6/7] btrfs: add missing discards when unpinning extents with -o discard jeffm
2015-06-17 10:07   ` Filipe David Manana
2015-06-15 13:41 ` [PATCH 7/7] btrfs: cleanup, stop casting for extent_map->lookup everywhere jeffm
2015-06-17 10:08   ` Filipe David Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.