Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 00/19] btrfs: async discard support
@ 2019-10-07 20:17 Dennis Zhou
  2019-10-07 20:17 ` [PATCH 01/19] bitmap: genericize percpu bitmap region iterators Dennis Zhou
                   ` (20 more replies)
  0 siblings, 21 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Hello,

Discard is an operation that allows for the filesystem to communicate
with underlying ssds that a lba region is no longer needed. This gives
the drive the more information as it tries to manage the available free
space to minimize write amplification. However, discard hasn't been
given the most tlc. Discard is a problematic command because a drive's
physical block management is more or less a black box to us and the
effects of any particular discard aren't necessarily limited the
lifetime of a command.

Currently, btrfs handles discarding synchronously during transaction
commit. This problematically can delay transaction commit based on the
amount of space that needs to be trimmed and the efficacy of the discard
operation for a particular drive.

This series introduces async discarding, which removes discard from the
transaction commit path. While every SSD has the choice of implementing
trim support different, we strive here to do the right thing. The idea
hinges on recognizing that write amplification really only kicks in once
we're really low on free space.  As long as we trim enough to keep a
large enough pool of free space, in theory this should minimize the cost
of issuing discards on a workload and have limited cost overhead in
write amplification.

With async discard, we try to emphasize discarding larger regions
and reusing the lba (implicit discard). The first is done by using the
free space cache to maintain discard state and thus allows us to get
coalescing for fairly cheap. A background workqueue is used to scan over
an LRU kept list of the block groups. It then uses filters to determine
what to discard next hence giving priority to larger discards. While
reusing an lba isn't explicitly attempted, it happens implicitly via
find_free_extent() which if it happens to find a dirty extent, will
grant us reuse of the lba. Additionally, async discarding skips metadata
block groups as these should see a fairly high turnover as btrfs is a
self-packing filesystem being stingy with allocating new block groups
until necessary.

Preliminary results seem promising as when a lot of freeing is going on,
the discarding is delayed allowing for reuse which translates to less
discarding (in addition to the slower discarding). This has shown a
reduction in p90 and p99 read latencies on a test on our webservers.

I am currently working on tuning the rate at which it discards in the
background. I am doing this by evaluating other workloads and drives.
The iops and bps rate limits are fairly aggressive right now as my
basic survey of a few drives noted that the trim command itself is a
significant part of the overhead. So optimizing for larger trims is the
right thing to do.

Persistence isn't supported, so when we mount a filesystem, the block
groups are read in as dirty and background trim begins. This makes async
discard more useful for longer running mount points.

This series contains the following 19 patches and is on top of
btrfs-devel#master 0f1a7b3fac05:
  0001-bitmap-genericize-percpu-bitmap-region-iterators.patch
  0002-btrfs-rename-DISCARD-opt-to-DISCARD_SYNC.patch
  0003-btrfs-keep-track-of-which-extents-have-been-discarde.patch
  0004-btrfs-keep-track-of-cleanliness-of-the-bitmap.patch
  0005-btrfs-add-the-beginning-of-async-discard-discard-wor.patch
  0006-btrfs-handle-empty-block_group-removal.patch
  0007-btrfs-discard-one-region-at-a-time-in-async-discard.patch
  0008-btrfs-track-discardable-extents-for-asnyc-discard.patch
  0009-btrfs-keep-track-of-discardable_bytes.patch
  0010-btrfs-calculate-discard-delay-based-on-number-of-ext.patch
  0011-btrfs-add-bps-discard-rate-limit.patch
  0012-btrfs-limit-max-discard-size-for-async-discard.patch
  0013-btrfs-have-multiple-discard-lists.patch
  0014-btrfs-only-keep-track-of-data-extents-for-async-disc.patch
  0015-btrfs-load-block_groups-into-discard_list-on-mount.patch
  0016-btrfs-keep-track-of-discard-reuse-stats.patch
  0017-btrfs-add-async-discard-header.patch
  0018-btrfs-increase-the-metadata-allowance-for-the-free_s.patch
  0019-btrfs-make-smaller-extents-more-likely-to-go-into-bi.patch

0001 exports percpu's bitmap iterators for eventual use in 0015. 0002
renames DISCARD to DISCARD_SYNC. 0003 and 0004 adds discard tracking to
the free space cache. 0005-0007 adds the core of async discard support.
0008-0013 fiddle with stats and operation limits. 0014 makes async
discarding only track data block groups. 0015 loads block groups on
reading in the block groups. 0016 adds reuse stats. 0017 adds an
explanation header. 0018 and 0019 modify the free space cache metadata
allowance, add a bitmap -> extent path and makes us more likely to put
smaller extents into the bitmaps.

diffstats below:

Dennis Zhou (19):
  bitmap: genericize percpu bitmap region iterators
  btrfs: rename DISCARD opt to DISCARD_SYNC
  btrfs: keep track of which extents have been discarded
  btrfs: keep track of cleanliness of the bitmap
  btrfs: add the beginning of async discard, discard workqueue
  btrfs: handle empty block_group removal
  btrfs: discard one region at a time in async discard
  btrfs: track discardable extents for asnyc discard
  btrfs: keep track of discardable_bytes
  btrfs: calculate discard delay based on number of extents
  btrfs: add bps discard rate limit
  btrfs: limit max discard size for async discard
  btrfs: have multiple discard lists
  btrfs: only keep track of data extents for async discard
  btrfs: load block_groups into discard_list on mount
  btrfs: keep track of discard reuse stats
  btrfs: add async discard header
  btrfs: increase the metadata allowance for the free_space_cache
  btrfs: make smaller extents more likely to go into bitmaps

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |  47 +++-
 fs/btrfs/block-group.h      |  18 ++
 fs/btrfs/ctree.h            |  29 ++-
 fs/btrfs/discard.c          | 458 +++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          | 112 ++++++++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |  26 +-
 fs/btrfs/free-space-cache.c | 494 ++++++++++++++++++++++++++++++------
 fs/btrfs/free-space-cache.h |  27 +-
 fs/btrfs/inode-map.c        |  13 +-
 fs/btrfs/scrub.c            |   7 +-
 fs/btrfs/super.c            |  39 ++-
 fs/btrfs/sysfs.c            | 141 ++++++++++
 include/linux/bitmap.h      |  35 +++
 mm/percpu.c                 |  61 ++---
 16 files changed, 1369 insertions(+), 155 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 01/19] bitmap: genericize percpu bitmap region iterators
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-07 20:26   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC Dennis Zhou
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Bitmaps are fairly popular for their space efficiency, but we don't have
generic iterators available. Make percpu's bitmap region iterators
available to everyone.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 include/linux/bitmap.h | 35 ++++++++++++++++++++++++
 mm/percpu.c            | 61 +++++++++++-------------------------------
 2 files changed, 51 insertions(+), 45 deletions(-)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 90528f12bdfa..9b0664f36808 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -437,6 +437,41 @@ static inline int bitmap_parse(const char *buf, unsigned int buflen,
 	return __bitmap_parse(buf, buflen, 0, maskp, nmaskbits);
 }
 
+static inline void bitmap_next_clear_region(unsigned long *bitmap,
+					    unsigned int *rs, unsigned int *re,
+					    unsigned int end)
+{
+	*rs = find_next_zero_bit(bitmap, end, *rs);
+	*re = find_next_bit(bitmap, end, *rs + 1);
+}
+
+static inline void bitmap_next_set_region(unsigned long *bitmap,
+					  unsigned int *rs, unsigned int *re,
+					  unsigned int end)
+{
+	*rs = find_next_bit(bitmap, end, *rs);
+	*re = find_next_zero_bit(bitmap, end, *rs + 1);
+}
+
+/*
+ * Bitmap region iterators.  Iterates over the bitmap between [@start, @end).
+ * @rs and @re should be integer variables and will be set to start and end
+ * index of the current clear or set region.
+ */
+#define bitmap_for_each_clear_region(bitmap, rs, re, start, end)	     \
+	for ((rs) = (start),						     \
+	     bitmap_next_clear_region((bitmap), &(rs), &(re), (end));	     \
+	     (rs) < (re);						     \
+	     (rs) = (re) + 1,						     \
+	     bitmap_next_clear_region((bitmap), &(rs), &(re), (end)))
+
+#define bitmap_for_each_set_region(bitmap, rs, re, start, end)		     \
+	for ((rs) = (start),						     \
+	     bitmap_next_set_region((bitmap), &(rs), &(re), (end));	     \
+	     (rs) < (re);						     \
+	     (rs) = (re) + 1,						     \
+	     bitmap_next_set_region((bitmap), &(rs), &(re), (end)))
+
 /**
  * BITMAP_FROM_U64() - Represent u64 value in the format suitable for bitmap.
  * @n: u64 value
diff --git a/mm/percpu.c b/mm/percpu.c
index 7e06a1e58720..e9844086b236 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -270,33 +270,6 @@ static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
 	       pcpu_unit_page_offset(cpu, page_idx);
 }
 
-static void pcpu_next_unpop(unsigned long *bitmap, int *rs, int *re, int end)
-{
-	*rs = find_next_zero_bit(bitmap, end, *rs);
-	*re = find_next_bit(bitmap, end, *rs + 1);
-}
-
-static void pcpu_next_pop(unsigned long *bitmap, int *rs, int *re, int end)
-{
-	*rs = find_next_bit(bitmap, end, *rs);
-	*re = find_next_zero_bit(bitmap, end, *rs + 1);
-}
-
-/*
- * Bitmap region iterators.  Iterates over the bitmap between
- * [@start, @end) in @chunk.  @rs and @re should be integer variables
- * and will be set to start and end index of the current free region.
- */
-#define pcpu_for_each_unpop_region(bitmap, rs, re, start, end)		     \
-	for ((rs) = (start), pcpu_next_unpop((bitmap), &(rs), &(re), (end)); \
-	     (rs) < (re);						     \
-	     (rs) = (re) + 1, pcpu_next_unpop((bitmap), &(rs), &(re), (end)))
-
-#define pcpu_for_each_pop_region(bitmap, rs, re, start, end)		     \
-	for ((rs) = (start), pcpu_next_pop((bitmap), &(rs), &(re), (end));   \
-	     (rs) < (re);						     \
-	     (rs) = (re) + 1, pcpu_next_pop((bitmap), &(rs), &(re), (end)))
-
 /*
  * The following are helper functions to help access bitmaps and convert
  * between bitmap offsets to address offsets.
@@ -732,9 +705,8 @@ static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk, bool full_scan)
 	}
 
 	bits = 0;
-	pcpu_for_each_md_free_region(chunk, bit_off, bits) {
+	pcpu_for_each_md_free_region(chunk, bit_off, bits)
 		pcpu_block_update(chunk_md, bit_off, bit_off + bits);
-	}
 }
 
 /**
@@ -749,7 +721,7 @@ static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
 {
 	struct pcpu_block_md *block = chunk->md_blocks + index;
 	unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index);
-	int rs, re, start;	/* region start, region end */
+	unsigned int rs, re, start;	/* region start, region end */
 
 	/* promote scan_hint to contig_hint */
 	if (block->scan_hint) {
@@ -765,10 +737,9 @@ static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
 	block->right_free = 0;
 
 	/* iterate over free areas and update the contig hints */
-	pcpu_for_each_unpop_region(alloc_map, rs, re, start,
-				   PCPU_BITMAP_BLOCK_BITS) {
+	bitmap_for_each_clear_region(alloc_map, rs, re, start,
+				     PCPU_BITMAP_BLOCK_BITS)
 		pcpu_block_update(block, rs, re);
-	}
 }
 
 /**
@@ -1041,13 +1012,13 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
 static bool pcpu_is_populated(struct pcpu_chunk *chunk, int bit_off, int bits,
 			      int *next_off)
 {
-	int page_start, page_end, rs, re;
+	unsigned int page_start, page_end, rs, re;
 
 	page_start = PFN_DOWN(bit_off * PCPU_MIN_ALLOC_SIZE);
 	page_end = PFN_UP((bit_off + bits) * PCPU_MIN_ALLOC_SIZE);
 
 	rs = page_start;
-	pcpu_next_unpop(chunk->populated, &rs, &re, page_end);
+	bitmap_next_clear_region(chunk->populated, &rs, &re, page_end);
 	if (rs >= page_end)
 		return true;
 
@@ -1702,13 +1673,13 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 
 	/* populate if not all pages are already there */
 	if (!is_atomic) {
-		int page_start, page_end, rs, re;
+		unsigned int page_start, page_end, rs, re;
 
 		page_start = PFN_DOWN(off);
 		page_end = PFN_UP(off + size);
 
-		pcpu_for_each_unpop_region(chunk->populated, rs, re,
-					   page_start, page_end) {
+		bitmap_for_each_clear_region(chunk->populated, rs, re,
+					     page_start, page_end) {
 			WARN_ON(chunk->immutable);
 
 			ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
@@ -1858,10 +1829,10 @@ static void pcpu_balance_workfn(struct work_struct *work)
 	spin_unlock_irq(&pcpu_lock);
 
 	list_for_each_entry_safe(chunk, next, &to_free, list) {
-		int rs, re;
+		unsigned int rs, re;
 
-		pcpu_for_each_pop_region(chunk->populated, rs, re, 0,
-					 chunk->nr_pages) {
+		bitmap_for_each_set_region(chunk->populated, rs, re, 0,
+					   chunk->nr_pages) {
 			pcpu_depopulate_chunk(chunk, rs, re);
 			spin_lock_irq(&pcpu_lock);
 			pcpu_chunk_depopulated(chunk, rs, re);
@@ -1893,7 +1864,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
 	}
 
 	for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {
-		int nr_unpop = 0, rs, re;
+		unsigned int nr_unpop = 0, rs, re;
 
 		if (!nr_to_pop)
 			break;
@@ -1910,9 +1881,9 @@ static void pcpu_balance_workfn(struct work_struct *work)
 			continue;
 
 		/* @chunk can't go away while pcpu_alloc_mutex is held */
-		pcpu_for_each_unpop_region(chunk->populated, rs, re, 0,
-					   chunk->nr_pages) {
-			int nr = min(re - rs, nr_to_pop);
+		bitmap_for_each_clear_region(chunk->populated, rs, re, 0,
+					     chunk->nr_pages) {
+			int nr = min_t(int, re - rs, nr_to_pop);
 
 			ret = pcpu_populate_chunk(chunk, rs, rs + nr, gfp);
 			if (!ret) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
  2019-10-07 20:17 ` [PATCH 01/19] bitmap: genericize percpu bitmap region iterators Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-07 20:27   ` Josef Bacik
                     ` (2 more replies)
  2019-10-07 20:17 ` [PATCH 03/19] btrfs: keep track of which extents have been discarded Dennis Zhou
                   ` (18 subsequent siblings)
  20 siblings, 3 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

This series introduces async discard which will use the flag
DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
synchronously done in transaction commit.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/block-group.c | 2 +-
 fs/btrfs/ctree.h       | 2 +-
 fs/btrfs/extent-tree.c | 4 ++--
 fs/btrfs/super.c       | 8 ++++----
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index bf7e3f23bba7..afe86028246a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1365,7 +1365,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&space_info->lock);
 
 		/* DISCARD can flip during remount */
-		trimming = btrfs_test_opt(fs_info, DISCARD);
+		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC);
 
 		/* Implicit trim during transaction commit. */
 		if (trimming)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 19d669d12ca1..1877586576aa 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1171,7 +1171,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FLUSHONCOMMIT       (1 << 7)
 #define BTRFS_MOUNT_SSD_SPREAD		(1 << 8)
 #define BTRFS_MOUNT_NOSSD		(1 << 9)
-#define BTRFS_MOUNT_DISCARD		(1 << 10)
+#define BTRFS_MOUNT_DISCARD_SYNC	(1 << 10)
 #define BTRFS_MOUNT_FORCE_COMPRESS      (1 << 11)
 #define BTRFS_MOUNT_SPACE_CACHE		(1 << 12)
 #define BTRFS_MOUNT_CLEAR_CACHE		(1 << 13)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 49cb26fa7c63..77a5904756c5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2903,7 +2903,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 			break;
 		}
 
-		if (btrfs_test_opt(fs_info, DISCARD))
+		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
 			ret = btrfs_discard_extent(fs_info, start,
 						   end + 1 - start, NULL);
 
@@ -4146,7 +4146,7 @@ static int __btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
 	if (pin)
 		pin_down_extent(cache, start, len, 1);
 	else {
-		if (btrfs_test_opt(fs_info, DISCARD))
+		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
 			ret = btrfs_discard_extent(fs_info, start, len, NULL);
 		btrfs_add_free_space(cache, start, len);
 		btrfs_free_reserved_bytes(cache, len, delalloc);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 1b151af25772..a02fece949cb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -695,11 +695,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				   info->metadata_ratio);
 			break;
 		case Opt_discard:
-			btrfs_set_and_info(info, DISCARD,
-					   "turning on discard");
+			btrfs_set_and_info(info, DISCARD_SYNC,
+					   "turning on sync discard");
 			break;
 		case Opt_nodiscard:
-			btrfs_clear_and_info(info, DISCARD,
+			btrfs_clear_and_info(info, DISCARD_SYNC,
 					     "turning off discard");
 			break;
 		case Opt_space_cache:
@@ -1322,7 +1322,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",nologreplay");
 	if (btrfs_test_opt(info, FLUSHONCOMMIT))
 		seq_puts(seq, ",flushoncommit");
-	if (btrfs_test_opt(info, DISCARD))
+	if (btrfs_test_opt(info, DISCARD_SYNC))
 		seq_puts(seq, ",discard");
 	if (!(info->sb->s_flags & SB_POSIXACL))
 		seq_puts(seq, ",noacl");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 03/19] btrfs: keep track of which extents have been discarded
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
  2019-10-07 20:17 ` [PATCH 01/19] bitmap: genericize percpu bitmap region iterators Dennis Zhou
  2019-10-07 20:17 ` [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-07 20:37   ` Josef Bacik
                     ` (2 more replies)
  2019-10-07 20:17 ` [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap Dennis Zhou
                   ` (17 subsequent siblings)
  20 siblings, 3 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Async discard will use the free space cache as backing knowledge for
which extents to discard. This patch plumbs knowledge about which
extents need to be discarded into the free space cache from
unpin_extent_range().

An untrimmed extent can merge with everything as this is a new region.
Absorbing trimmed extents is a tradeoff to for greater coalescing which
makes life better for find_free_extent(). Additionally, it seems the
size of a trim isn't as problematic as the trim io itself.

When reading in the free space cache from disk, if sync is set, mark all
extents as trimmed. The current code ensures at transaction commit that
all free space is trimmed when sync is set, so this reflects that.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/extent-tree.c      | 15 ++++++++++-----
 fs/btrfs/free-space-cache.c | 38 ++++++++++++++++++++++++++++++-------
 fs/btrfs/free-space-cache.h | 10 +++++++++-
 fs/btrfs/inode-map.c        | 13 +++++++------
 4 files changed, 57 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 77a5904756c5..b9e3bedad878 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2782,7 +2782,7 @@ fetch_cluster_info(struct btrfs_fs_info *fs_info,
 }
 
 static int unpin_extent_range(struct btrfs_fs_info *fs_info,
-			      u64 start, u64 end,
+			      u64 start, u64 end, u32 fsc_flags,
 			      const bool return_free_space)
 {
 	struct btrfs_block_group_cache *cache = NULL;
@@ -2816,7 +2816,9 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		if (start < cache->last_byte_to_unpin) {
 			len = min(len, cache->last_byte_to_unpin - start);
 			if (return_free_space)
-				btrfs_add_free_space(cache, start, len);
+				__btrfs_add_free_space(fs_info,
+						       cache->free_space_ctl,
+						       start, len, fsc_flags);
 		}
 
 		start += len;
@@ -2894,6 +2896,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 
 	while (!trans->aborted) {
 		struct extent_state *cached_state = NULL;
+		u32 fsc_flags = 0;
 
 		mutex_lock(&fs_info->unused_bg_unpin_mutex);
 		ret = find_first_extent_bit(unpin, 0, &start, &end,
@@ -2903,12 +2906,14 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 			break;
 		}
 
-		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
+		if (btrfs_test_opt(fs_info, DISCARD_SYNC)) {
 			ret = btrfs_discard_extent(fs_info, start,
 						   end + 1 - start, NULL);
+			fsc_flags |= BTRFS_FSC_TRIMMED;
+		}
 
 		clear_extent_dirty(unpin, start, end, &cached_state);
-		unpin_extent_range(fs_info, start, end, true);
+		unpin_extent_range(fs_info, start, end, fsc_flags, true);
 		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 		free_extent_state(cached_state);
 		cond_resched();
@@ -5512,7 +5517,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
 int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
 				   u64 start, u64 end)
 {
-	return unpin_extent_range(fs_info, start, end, false);
+	return unpin_extent_range(fs_info, start, end, 0, false);
 }
 
 /*
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index d54dcd0ab230..f119895292b8 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -747,6 +747,14 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 			goto free_cache;
 		}
 
+		/*
+		 * Sync discard ensures that the free space cache is always
+		 * trimmed.  So when reading this in, the state should reflect
+		 * that.
+		 */
+		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
+			e->flags |= BTRFS_FSC_TRIMMED;
+
 		if (!e->bytes) {
 			kmem_cache_free(btrfs_free_space_cachep, e);
 			goto free_cache;
@@ -2165,6 +2173,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
 	bool merged = false;
 	u64 offset = info->offset;
 	u64 bytes = info->bytes;
+	bool is_trimmed = btrfs_free_space_trimmed(info);
 
 	/*
 	 * first we want to see if there is free space adjacent to the range we
@@ -2178,7 +2187,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
 	else
 		left_info = tree_search_offset(ctl, offset - 1, 0, 0);
 
-	if (right_info && !right_info->bitmap) {
+	if (right_info && !right_info->bitmap &&
+	    (!is_trimmed || btrfs_free_space_trimmed(right_info))) {
 		if (update_stat)
 			unlink_free_space(ctl, right_info);
 		else
@@ -2189,7 +2199,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
 	}
 
 	if (left_info && !left_info->bitmap &&
-	    left_info->offset + left_info->bytes == offset) {
+	    left_info->offset + left_info->bytes == offset &&
+	    (!is_trimmed || btrfs_free_space_trimmed(left_info))) {
 		if (update_stat)
 			unlink_free_space(ctl, left_info);
 		else
@@ -2225,6 +2236,9 @@ static bool steal_from_bitmap_to_end(struct btrfs_free_space_ctl *ctl,
 	bytes = (j - i) * ctl->unit;
 	info->bytes += bytes;
 
+	if (!btrfs_free_space_trimmed(bitmap))
+		info->flags &= ~BTRFS_FSC_TRIMMED;
+
 	if (update_stat)
 		bitmap_clear_bits(ctl, bitmap, end, bytes);
 	else
@@ -2278,6 +2292,9 @@ static bool steal_from_bitmap_to_front(struct btrfs_free_space_ctl *ctl,
 	info->offset -= bytes;
 	info->bytes += bytes;
 
+	if (!btrfs_free_space_trimmed(bitmap))
+		info->flags &= ~BTRFS_FSC_TRIMMED;
+
 	if (update_stat)
 		bitmap_clear_bits(ctl, bitmap, info->offset, bytes);
 	else
@@ -2327,7 +2344,7 @@ static void steal_from_bitmap(struct btrfs_free_space_ctl *ctl,
 
 int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   struct btrfs_free_space_ctl *ctl,
-			   u64 offset, u64 bytes)
+			   u64 offset, u64 bytes, u32 flags)
 {
 	struct btrfs_free_space *info;
 	int ret = 0;
@@ -2338,6 +2355,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 
 	info->offset = offset;
 	info->bytes = bytes;
+	info->flags = flags;
 	RB_CLEAR_NODE(&info->offset_index);
 
 	spin_lock(&ctl->tree_lock);
@@ -2385,7 +2403,7 @@ int btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
 {
 	return __btrfs_add_free_space(block_group->fs_info,
 				      block_group->free_space_ctl,
-				      bytenr, size);
+				      bytenr, size, 0);
 }
 
 int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
@@ -2460,8 +2478,11 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
 			}
 			spin_unlock(&ctl->tree_lock);
 
-			ret = btrfs_add_free_space(block_group, offset + bytes,
-						   old_end - (offset + bytes));
+			ret = __btrfs_add_free_space(block_group->fs_info,
+						     ctl,
+						     offset + bytes,
+						     old_end - (offset + bytes),
+						     info->flags);
 			WARN_ON(ret);
 			goto out;
 		}
@@ -2630,6 +2651,7 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 	u64 ret = 0;
 	u64 align_gap = 0;
 	u64 align_gap_len = 0;
+	u64 align_gap_flags = 0;
 
 	spin_lock(&ctl->tree_lock);
 	entry = find_free_space(ctl, &offset, &bytes_search,
@@ -2646,6 +2668,7 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 		unlink_free_space(ctl, entry);
 		align_gap_len = offset - entry->offset;
 		align_gap = entry->offset;
+		align_gap_flags = entry->flags;
 
 		entry->offset = offset + bytes;
 		WARN_ON(entry->bytes < bytes + align_gap_len);
@@ -2661,7 +2684,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 
 	if (align_gap_len)
 		__btrfs_add_free_space(block_group->fs_info, ctl,
-				       align_gap, align_gap_len);
+				       align_gap, align_gap_len,
+				       align_gap_flags);
 	return ret;
 }
 
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 39c32c8fc24f..ab3dfc00abb5 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -6,6 +6,8 @@
 #ifndef BTRFS_FREE_SPACE_CACHE_H
 #define BTRFS_FREE_SPACE_CACHE_H
 
+#define BTRFS_FSC_TRIMMED		(1UL << 0)
+
 struct btrfs_free_space {
 	struct rb_node offset_index;
 	u64 offset;
@@ -13,8 +15,14 @@ struct btrfs_free_space {
 	u64 max_extent_size;
 	unsigned long *bitmap;
 	struct list_head list;
+	u32 flags;
 };
 
+static inline bool btrfs_free_space_trimmed(struct btrfs_free_space *info)
+{
+	return (info->flags & BTRFS_FSC_TRIMMED);
+}
+
 struct btrfs_free_space_ctl {
 	spinlock_t tree_lock;
 	struct rb_root free_space_offset;
@@ -84,7 +92,7 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
 void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group);
 int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   struct btrfs_free_space_ctl *ctl,
-			   u64 bytenr, u64 size);
+			   u64 bytenr, u64 size, u32 flags);
 int btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
 			 u64 bytenr, u64 size);
 int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 63cad7865d75..00e225de4fe6 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -107,7 +107,7 @@ static int caching_kthread(void *data)
 
 		if (last != (u64)-1 && last + 1 != key.objectid) {
 			__btrfs_add_free_space(fs_info, ctl, last + 1,
-					       key.objectid - last - 1);
+					       key.objectid - last - 1, 0);
 			wake_up(&root->ino_cache_wait);
 		}
 
@@ -118,7 +118,7 @@ static int caching_kthread(void *data)
 
 	if (last < root->highest_objectid - 1) {
 		__btrfs_add_free_space(fs_info, ctl, last + 1,
-				       root->highest_objectid - last - 1);
+				       root->highest_objectid - last - 1, 0);
 	}
 
 	spin_lock(&root->ino_cache_lock);
@@ -175,7 +175,8 @@ static void start_caching(struct btrfs_root *root)
 	ret = btrfs_find_free_objectid(root, &objectid);
 	if (!ret && objectid <= BTRFS_LAST_FREE_OBJECTID) {
 		__btrfs_add_free_space(fs_info, ctl, objectid,
-				       BTRFS_LAST_FREE_OBJECTID - objectid + 1);
+				       BTRFS_LAST_FREE_OBJECTID - objectid + 1,
+				       0);
 		wake_up(&root->ino_cache_wait);
 	}
 
@@ -221,7 +222,7 @@ void btrfs_return_ino(struct btrfs_root *root, u64 objectid)
 		return;
 again:
 	if (root->ino_cache_state == BTRFS_CACHE_FINISHED) {
-		__btrfs_add_free_space(fs_info, pinned, objectid, 1);
+		__btrfs_add_free_space(fs_info, pinned, objectid, 1, 0);
 	} else {
 		down_write(&fs_info->commit_root_sem);
 		spin_lock(&root->ino_cache_lock);
@@ -234,7 +235,7 @@ void btrfs_return_ino(struct btrfs_root *root, u64 objectid)
 
 		start_caching(root);
 
-		__btrfs_add_free_space(fs_info, pinned, objectid, 1);
+		__btrfs_add_free_space(fs_info, pinned, objectid, 1, 0);
 
 		up_write(&fs_info->commit_root_sem);
 	}
@@ -281,7 +282,7 @@ void btrfs_unpin_free_ino(struct btrfs_root *root)
 		spin_unlock(rbroot_lock);
 		if (count)
 			__btrfs_add_free_space(root->fs_info, ctl,
-					       info->offset, count);
+					       info->offset, count, 0);
 		kmem_cache_free(btrfs_free_space_cachep, info);
 	}
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (2 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 03/19] btrfs: keep track of which extents have been discarded Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 14:16   ` Josef Bacik
  2019-10-15 12:23   ` David Sterba
  2019-10-07 20:17 ` [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
                   ` (16 subsequent siblings)
  20 siblings, 2 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

There is a cap in btrfs in the amount of free extents that a block group
can have. When it surpasses that threshold, future extents are placed
into bitmaps. Instead of keeping track of if a certain bit is trimmed or
not in a second bitmap, keep track of the relative state of the bitmap.

With async discard, trimming bitmaps becomes a more frequent operation.
As a trade off with simplicity, we keep track of if discarding a bitmap
is in progress. If we fully scan a bitmap and trim as necessary, the
bitmap is marked clean. This has some caveats as the min block size may
skip over regions deemed too small. But this should be a reasonable
trade off rather than keeping a second bitmap and making allocation
paths more complex. The downside is we may overtrim, but ideally the min
block size should prevent us from doing that too often and getting stuck
trimming
pathological cases.

BTRFS_FSC_TRIMMING_BITMAP is added to indicate a bitmap is in the
process of being trimmed. If additional free space is added to that
bitmap, the bit is cleared. A bitmap will be marked BTRFS_FSC_TRIMMED if
the trimming code was able to reach the end of it and the former is
still set.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/free-space-cache.c | 83 +++++++++++++++++++++++++++++++++----
 fs/btrfs/free-space-cache.h |  7 ++++
 2 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index f119895292b8..129b9a164b35 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1975,11 +1975,14 @@ static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
 
 static u64 add_bytes_to_bitmap(struct btrfs_free_space_ctl *ctl,
 			       struct btrfs_free_space *info, u64 offset,
-			       u64 bytes)
+			       u64 bytes, u32 flags)
 {
 	u64 bytes_to_set = 0;
 	u64 end;
 
+	if (!(flags & BTRFS_FSC_TRIMMED))
+		info->flags &= ~(BTRFS_FSC_TRIMMED | BTRFS_FSC_TRIMMING_BITMAP);
+
 	end = info->offset + (u64)(BITS_PER_BITMAP * ctl->unit);
 
 	bytes_to_set = min(end - offset, bytes);
@@ -2054,10 +2057,12 @@ static int insert_into_bitmap(struct btrfs_free_space_ctl *ctl,
 	struct btrfs_block_group_cache *block_group = NULL;
 	int added = 0;
 	u64 bytes, offset, bytes_added;
+	u32 flags;
 	int ret;
 
 	bytes = info->bytes;
 	offset = info->offset;
+	flags = info->flags;
 
 	if (!ctl->op->use_bitmap(ctl, info))
 		return 0;
@@ -2093,7 +2098,7 @@ static int insert_into_bitmap(struct btrfs_free_space_ctl *ctl,
 
 		if (entry->offset == offset_to_bitmap(ctl, offset)) {
 			bytes_added = add_bytes_to_bitmap(ctl, entry,
-							  offset, bytes);
+							  offset, bytes, flags);
 			bytes -= bytes_added;
 			offset += bytes_added;
 		}
@@ -2112,7 +2117,8 @@ static int insert_into_bitmap(struct btrfs_free_space_ctl *ctl,
 		goto new_bitmap;
 	}
 
-	bytes_added = add_bytes_to_bitmap(ctl, bitmap_info, offset, bytes);
+	bytes_added = add_bytes_to_bitmap(ctl, bitmap_info, offset, bytes,
+					  flags);
 	bytes -= bytes_added;
 	offset += bytes_added;
 	added = 0;
@@ -2146,6 +2152,7 @@ static int insert_into_bitmap(struct btrfs_free_space_ctl *ctl,
 		/* allocate the bitmap */
 		info->bitmap = kmem_cache_zalloc(btrfs_free_space_bitmap_cachep,
 						 GFP_NOFS);
+		info->flags |= BTRFS_FSC_TRIMMED;
 		spin_lock(&ctl->tree_lock);
 		if (!info->bitmap) {
 			ret = -ENOMEM;
@@ -3295,6 +3302,41 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 	return ret;
 }
 
+/*
+ * If we break out of trimming a bitmap prematurely, we should reset the
+ * trimming bit.  In a rather contrieved case, it's possible to race here so
+ * clear BTRFS_FSC_TRIMMED as well.
+ *
+ * start = start of bitmap
+ * end = near end of bitmap
+ *
+ * Thread 1:			Thread 2:
+ * trim_bitmaps(start)
+ *				trim_bitmaps(end)
+ *				end_trimming_bitmap()
+ * reset_trimming_bitmap()
+ */
+static void reset_trimming_bitmap(struct btrfs_free_space_ctl *ctl, u64 offset)
+{
+	struct btrfs_free_space *info;
+
+	spin_lock(&ctl->tree_lock);
+
+	info = tree_search_offset(ctl, offset, 1, 0);
+	if (info)
+		info->flags &= ~(BTRFS_FSC_TRIMMED | BTRFS_FSC_TRIMMING_BITMAP);
+
+	spin_unlock(&ctl->tree_lock);
+}
+
+static void end_trimming_bitmap(struct btrfs_free_space *entry)
+{
+	if (btrfs_free_space_trimming_bitmap(entry)) {
+		entry->flags |= BTRFS_FSC_TRIMMED;
+		entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
+	}
+}
+
 static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 			u64 *total_trimmed, u64 start, u64 end, u64 minlen)
 {
@@ -3326,9 +3368,26 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 			goto next;
 		}
 
+		/*
+		 * Async discard bitmap trimming begins at by setting the start
+		 * to be key.objectid and the offset_to_bitmap() aligns to the
+		 * start of the bitmap.  This lets us know we are fully
+		 * scanning the bitmap rather than only some portion of it.
+		 */
+		if (start == offset)
+			entry->flags |= BTRFS_FSC_TRIMMING_BITMAP;
+
 		bytes = minlen;
 		ret2 = search_bitmap(ctl, entry, &start, &bytes, false);
 		if (ret2 || start >= end) {
+			/*
+			 * This keeps the invariant that all bytes are trimmed
+			 * if BTRFS_FSC_TRIMMED is set on a bitmap.
+			 */
+			if (ret2 && !minlen)
+				end_trimming_bitmap(entry);
+			else
+				entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
 			spin_unlock(&ctl->tree_lock);
 			mutex_unlock(&ctl->cache_writeout_mutex);
 			next_bitmap = true;
@@ -3337,6 +3396,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 
 		bytes = min(bytes, end - start);
 		if (bytes < minlen) {
+			entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
 			spin_unlock(&ctl->tree_lock);
 			mutex_unlock(&ctl->cache_writeout_mutex);
 			goto next;
@@ -3354,18 +3414,21 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
 				  start, bytes, &trim_entry);
-		if (ret)
+		if (ret) {
+			reset_trimming_bitmap(ctl, offset);
 			break;
+		}
 next:
 		if (next_bitmap) {
 			offset += BITS_PER_BITMAP * ctl->unit;
+			start = offset;
 		} else {
 			start += bytes;
-			if (start >= offset + BITS_PER_BITMAP * ctl->unit)
-				offset += BITS_PER_BITMAP * ctl->unit;
 		}
 
 		if (fatal_signal_pending(current)) {
+			if (start != offset)
+				reset_trimming_bitmap(ctl, offset);
 			ret = -ERESTARTSYS;
 			break;
 		}
@@ -3419,6 +3482,7 @@ void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
 int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 			   u64 *trimmed, u64 start, u64 end, u64 minlen)
 {
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
 
 	*trimmed = 0;
@@ -3436,6 +3500,9 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 		goto out;
 
 	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
+	/* if we ended in the middle of a bitmap, reset the trimming flag */
+	if (end % (BITS_PER_BITMAP * ctl->unit))
+		reset_trimming_bitmap(ctl, offset_to_bitmap(ctl, end));
 out:
 	btrfs_put_block_group_trimming(block_group);
 	return ret;
@@ -3620,6 +3687,7 @@ int test_add_free_space_entry(struct btrfs_block_group_cache *cache,
 	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
 	struct btrfs_free_space *info = NULL, *bitmap_info;
 	void *map = NULL;
+	u32 flags = 0;
 	u64 bytes_added;
 	int ret;
 
@@ -3661,7 +3729,8 @@ int test_add_free_space_entry(struct btrfs_block_group_cache *cache,
 		info = NULL;
 	}
 
-	bytes_added = add_bytes_to_bitmap(ctl, bitmap_info, offset, bytes);
+	bytes_added = add_bytes_to_bitmap(ctl, bitmap_info, offset, bytes,
+					  flags);
 
 	bytes -= bytes_added;
 	offset += bytes_added;
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index ab3dfc00abb5..dc73ec8d34bb 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -7,6 +7,7 @@
 #define BTRFS_FREE_SPACE_CACHE_H
 
 #define BTRFS_FSC_TRIMMED		(1UL << 0)
+#define BTRFS_FSC_TRIMMING_BITMAP	(1UL << 1)
 
 struct btrfs_free_space {
 	struct rb_node offset_index;
@@ -23,6 +24,12 @@ static inline bool btrfs_free_space_trimmed(struct btrfs_free_space *info)
 	return (info->flags & BTRFS_FSC_TRIMMED);
 }
 
+static inline
+bool btrfs_free_space_trimming_bitmap(struct btrfs_free_space *info)
+{
+	return (info->flags & BTRFS_FSC_TRIMMING_BITMAP);
+}
+
 struct btrfs_free_space_ctl {
 	spinlock_t tree_lock;
 	struct rb_root free_space_offset;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (3 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 14:38   ` Josef Bacik
  2019-10-15 12:49   ` David Sterba
  2019-10-07 20:17 ` [PATCH 06/19] btrfs: handle empty block_group removal Dennis Zhou
                   ` (15 subsequent siblings)
  20 siblings, 2 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be address in a future patch in this
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |   4 +
 fs/btrfs/block-group.h      |  10 ++
 fs/btrfs/ctree.h            |  17 +++
 fs/btrfs/discard.c          | 200 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  49 +++++++++
 fs/btrfs/disk-io.c          |  15 ++-
 fs/btrfs/extent-tree.c      |   4 +
 fs/btrfs/free-space-cache.c |  29 +++++-
 fs/btrfs/super.c            |  35 ++++++-
 10 files changed, 356 insertions(+), 9 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 82200dbca5ac..9a0ff3384381 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
-	   block-rsv.o delalloc-space.o block-group.o
+	   block-rsv.o delalloc-space.o block-group.o discard.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index afe86028246a..8bbbe7488328 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -14,6 +14,7 @@
 #include "sysfs.h"
 #include "tree-log.h"
 #include "delalloc-space.h"
+#include "discard.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -1273,6 +1274,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->unused_bgs_lock);
 
+		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
 
 		/* Don't want to race with allocators so take the groups_sem */
@@ -1622,6 +1625,7 @@ static struct btrfs_block_group_cache *btrfs_create_block_group_cache(
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	INIT_LIST_HEAD(&cache->ro_list);
+	INIT_LIST_HEAD(&cache->discard_list);
 	INIT_LIST_HEAD(&cache->dirty_list);
 	INIT_LIST_HEAD(&cache->io_list);
 	btrfs_init_free_space_ctl(cache);
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index c391800388dd..0f9a1c91753f 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -115,7 +115,11 @@ struct btrfs_block_group_cache {
 	/* For read-only block groups */
 	struct list_head ro_list;
 
+	/* For discard operations */
 	atomic_t trimming;
+	struct list_head discard_list;
+	int discard_index;
+	u64 discard_delay;
 
 	/* For dirty block groups */
 	struct list_head dirty_list;
@@ -157,6 +161,12 @@ struct btrfs_block_group_cache {
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
 
+static inline
+u64 btrfs_block_group_end(struct btrfs_block_group_cache *cache)
+{
+	return (cache->key.objectid + cache->key.offset);
+}
+
 #ifdef CONFIG_BTRFS_DEBUG
 static inline int btrfs_should_fragment_free_space(
 		struct btrfs_block_group_cache *block_group)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1877586576aa..419445868909 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -438,6 +438,17 @@ struct btrfs_full_stripe_locks_tree {
 	struct mutex lock;
 };
 
+/* discard control */
+#define BTRFS_NR_DISCARD_LISTS		1
+
+struct btrfs_discard_ctl {
+	struct workqueue_struct *discard_workers;
+	struct delayed_work work;
+	spinlock_t lock;
+	struct btrfs_block_group_cache *cache;
+	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
+};
+
 /* delayed seq elem */
 struct seq_list {
 	struct list_head list;
@@ -524,6 +535,9 @@ enum {
 	 * so we don't need to offload checksums to workqueues.
 	 */
 	BTRFS_FS_CSUM_IMPL_FAST,
+
+	/* Indicate that the discard workqueue can service discards. */
+	BTRFS_FS_DISCARD_RUNNING,
 };
 
 struct btrfs_fs_info {
@@ -817,6 +831,8 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *scrub_wr_completion_workers;
 	struct btrfs_workqueue *scrub_parity_workers;
 
+	struct btrfs_discard_ctl discard_ctl;
+
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	u32 check_integrity_print_mask;
 #endif
@@ -1190,6 +1206,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
new file mode 100644
index 000000000000..6df124639e55
--- /dev/null
+++ b/fs/btrfs/discard.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Facebook.  All rights reserved.
+ */
+
+#include <linux/jiffies.h>
+#include <linux/list.h>
+#include <linux/sizes.h>
+#include <linux/ktime.h>
+#include <linux/workqueue.h>
+#include "ctree.h"
+#include "block-group.h"
+#include "discard.h"
+#include "free-space-cache.h"
+
+#define BTRFS_DISCARD_DELAY		(300ULL * NSEC_PER_SEC)
+
+static struct list_head *
+btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
+		       struct btrfs_block_group_cache *cache)
+{
+	return &discard_ctl->discard_list[cache->discard_index];
+}
+
+void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group_cache *cache)
+{
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	if (list_empty(&cache->discard_list))
+		cache->discard_delay = now + BTRFS_DISCARD_DELAY;
+
+	list_move_tail(&cache->discard_list,
+		       btrfs_get_discard_list(discard_ctl, cache));
+
+	spin_unlock(&discard_ctl->lock);
+}
+
+static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
+				     struct btrfs_block_group_cache *cache)
+{
+	bool running = false;
+
+	spin_lock(&discard_ctl->lock);
+
+	if (cache == discard_ctl->cache) {
+		running = true;
+		discard_ctl->cache = NULL;
+	}
+
+	cache->discard_delay = 0;
+	list_del_init(&cache->discard_list);
+
+	spin_unlock(&discard_ctl->lock);
+
+	return running;
+}
+
+static struct btrfs_block_group_cache *
+find_next_cache(struct btrfs_discard_ctl *discard_ctl, u64 now)
+{
+	struct btrfs_block_group_cache *ret_cache = NULL, *cache;
+	int i;
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) {
+		struct list_head *discard_list = &discard_ctl->discard_list[i];
+
+		if (!list_empty(discard_list)) {
+			cache = list_first_entry(discard_list,
+						 struct btrfs_block_group_cache,
+						 discard_list);
+
+			if (!ret_cache)
+				ret_cache = cache;
+
+			if (ret_cache->discard_delay < now)
+				break;
+
+			if (ret_cache->discard_delay > cache->discard_delay)
+				ret_cache = cache;
+		}
+	}
+
+	return ret_cache;
+}
+
+static struct btrfs_block_group_cache *
+peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_block_group_cache *cache;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	cache = find_next_cache(discard_ctl, now);
+
+	if (cache && now < cache->discard_delay)
+		cache = NULL;
+
+	discard_ctl->cache = cache;
+
+	spin_unlock(&discard_ctl->lock);
+
+	return cache;
+}
+
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group_cache *cache)
+{
+	if (remove_from_discard_list(discard_ctl, cache)) {
+		cancel_delayed_work_sync(&discard_ctl->work);
+		btrfs_discard_schedule_work(discard_ctl, true);
+	}
+}
+
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override)
+{
+	struct btrfs_block_group_cache *cache;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	if (!btrfs_run_discard_work(discard_ctl))
+		goto out;
+
+	if (!override && delayed_work_pending(&discard_ctl->work))
+		goto out;
+
+	cache = find_next_cache(discard_ctl, now);
+	if (cache) {
+		u64 delay = 0;
+
+		if (now < cache->discard_delay)
+			delay = nsecs_to_jiffies(cache->discard_delay - now);
+
+		mod_delayed_work(discard_ctl->discard_workers,
+				 &discard_ctl->work,
+				 delay);
+	}
+
+out:
+	spin_unlock(&discard_ctl->lock);
+}
+
+static void btrfs_discard_workfn(struct work_struct *work)
+{
+	struct btrfs_discard_ctl *discard_ctl;
+	struct btrfs_block_group_cache *cache;
+	u64 trimmed = 0;
+
+	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
+
+	cache = peek_discard_list(discard_ctl);
+	if (!cache || !btrfs_run_discard_work(discard_ctl))
+		return;
+
+	btrfs_trim_block_group(cache, &trimmed, cache->key.objectid,
+			       btrfs_block_group_end(cache), 0);
+
+	remove_from_discard_list(discard_ctl, cache);
+
+	btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
+		btrfs_discard_cleanup(fs_info);
+		return;
+	}
+
+	set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info)
+{
+	clear_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_init(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_discard_ctl *discard_ctl = &fs_info->discard_ctl;
+	int i;
+
+	spin_lock_init(&discard_ctl->lock);
+
+	INIT_DELAYED_WORK(&discard_ctl->work, btrfs_discard_workfn);
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
+		 INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
+}
+
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
+{
+	btrfs_discard_stop(fs_info);
+	cancel_delayed_work_sync(&fs_info->discard_ctl.work);
+}
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
new file mode 100644
index 000000000000..6d7805bb0eb7
--- /dev/null
+++ b/fs/btrfs/discard.h
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Facebook.  All rights reserved.
+ */
+
+#ifndef BTRFS_DISCARD_H
+#define BTRFS_DISCARD_H
+
+#include <linux/kernel.h>
+#include <linux/workqueue.h>
+
+#include "ctree.h"
+
+void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group_cache *cache);
+
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group_cache *cache);
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override);
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info);
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info);
+void btrfs_discard_init(struct btrfs_fs_info *fs_info);
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info);
+
+static inline
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_fs_info *fs_info = container_of(discard_ctl,
+						     struct btrfs_fs_info,
+						     discard_ctl);
+
+	return (!(fs_info->sb->s_flags & SB_RDONLY) &&
+		test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
+}
+
+static inline
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group_cache *cache)
+{
+	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
+		return;
+
+	btrfs_add_to_discard_list(discard_ctl, cache);
+	if (!delayed_work_pending(&discard_ctl->work))
+		btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 044981cf6df9..a304ec972f67 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -41,6 +41,7 @@
 #include "tree-checker.h"
 #include "ref-verify.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -2009,6 +2010,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->flush_workers);
 	btrfs_destroy_workqueue(fs_info->qgroup_rescan_workers);
 	btrfs_destroy_workqueue(fs_info->extent_workers);
+	if (fs_info->discard_ctl.discard_workers)
+		destroy_workqueue(fs_info->discard_ctl.discard_workers);
 	/*
 	 * Now that all other work queues are destroyed, we can safely destroy
 	 * the queues used for metadata I/O, since tasks from those other work
@@ -2218,6 +2221,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 		btrfs_alloc_workqueue(fs_info, "extent-refs", flags,
 				      min_t(u64, fs_devices->num_devices,
 					    max_active), 8);
+	fs_info->discard_ctl.discard_workers =
+		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
 
 	if (!(fs_info->workers && fs_info->delalloc_workers &&
 	      fs_info->submit_workers && fs_info->flush_workers &&
@@ -2229,7 +2234,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	      fs_info->caching_workers && fs_info->readahead_workers &&
 	      fs_info->fixup_workers && fs_info->delayed_workers &&
 	      fs_info->extent_workers &&
-	      fs_info->qgroup_rescan_workers)) {
+	      fs_info->qgroup_rescan_workers &&
+	      fs_info->discard_ctl.discard_workers)) {
 		return -ENOMEM;
 	}
 
@@ -2772,6 +2778,8 @@ int open_ctree(struct super_block *sb,
 	btrfs_init_dev_replace_locks(fs_info);
 	btrfs_init_qgroup(fs_info);
 
+	btrfs_discard_init(fs_info);
+
 	btrfs_init_free_cluster(&fs_info->meta_alloc_cluster);
 	btrfs_init_free_cluster(&fs_info->data_alloc_cluster);
 
@@ -3284,6 +3292,8 @@ int open_ctree(struct super_block *sb,
 
 	btrfs_qgroup_rescan_resume(fs_info);
 
+	btrfs_discard_resume(fs_info);
+
 	if (!fs_info->uuid_root) {
 		btrfs_info(fs_info, "creating UUID tree");
 		ret = btrfs_create_uuid_tree(fs_info);
@@ -3993,6 +4003,9 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 	 */
 	kthread_park(fs_info->cleaner_kthread);
 
+	/* cancel or finish ongoing work */
+	btrfs_discard_cleanup(fs_info);
+
 	/* wait for the qgroup rescan worker to stop */
 	btrfs_qgroup_wait_for_completion(fs_info, false);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b9e3bedad878..d69ee5f51b38 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -32,6 +32,7 @@
 #include "block-rsv.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2919,6 +2920,9 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 		cond_resched();
 	}
 
+	if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_schedule_work(&fs_info->discard_ctl, true);
+
 	/*
 	 * Transaction is finished.  We don't need the lock anymore.  We
 	 * do need to clean up the block groups in case of a transaction
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 129b9a164b35..54ff1bc97777 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -21,6 +21,7 @@
 #include "space-info.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BITS_PER_BITMAP		(PAGE_SIZE * 8UL)
 #define MAX_CACHE_BYTES_PER_GIG	SZ_32K
@@ -2353,6 +2354,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   struct btrfs_free_space_ctl *ctl,
 			   u64 offset, u64 bytes, u32 flags)
 {
+	struct btrfs_block_group_cache *cache = ctl->private;
 	struct btrfs_free_space *info;
 	int ret = 0;
 
@@ -2402,6 +2404,9 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 		ASSERT(ret != -EEXIST);
 	}
 
+	if (!(flags & BTRFS_FSC_TRIMMED))
+		btrfs_discard_queue_work(&fs_info->discard_ctl, cache);
+
 	return ret;
 }
 
@@ -3175,14 +3180,17 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
 static int do_trimming(struct btrfs_block_group_cache *block_group,
 		       u64 *total_trimmed, u64 start, u64 bytes,
 		       u64 reserved_start, u64 reserved_bytes,
-		       struct btrfs_trim_range *trim_entry)
+		       u32 reserved_flags, struct btrfs_trim_range *trim_entry)
 {
 	struct btrfs_space_info *space_info = block_group->space_info;
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
 	int update = 0;
+	u64 end = start + bytes;
+	u64 reserved_end = reserved_start + reserved_bytes;
 	u64 trimmed = 0;
+	u32 flags = 0;
 
 	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
@@ -3195,11 +3203,19 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
 	spin_unlock(&space_info->lock);
 
 	ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
-	if (!ret)
+	if (!ret) {
 		*total_trimmed += trimmed;
+		flags |= BTRFS_FSC_TRIMMED;
+	}
 
 	mutex_lock(&ctl->cache_writeout_mutex);
-	btrfs_add_free_space(block_group, reserved_start, reserved_bytes);
+	if (reserved_start < start)
+		__btrfs_add_free_space(fs_info, ctl, reserved_start,
+				       start - reserved_start, reserved_flags);
+	if (start + bytes < reserved_start + reserved_bytes)
+		__btrfs_add_free_space(fs_info, ctl, end,
+				       reserved_end - end, reserved_flags);
+	__btrfs_add_free_space(fs_info, ctl, start, bytes, flags);
 	list_del(&trim_entry->list);
 	mutex_unlock(&ctl->cache_writeout_mutex);
 
@@ -3226,6 +3242,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 	int ret = 0;
 	u64 extent_start;
 	u64 extent_bytes;
+	u32 extent_flags;
 	u64 bytes;
 
 	while (start < end) {
@@ -3267,6 +3284,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 
 		extent_start = entry->offset;
 		extent_bytes = entry->bytes;
+		extent_flags = entry->flags;
 		start = max(start, extent_start);
 		bytes = min(extent_start + extent_bytes, end) - start;
 		if (bytes < minlen) {
@@ -3285,7 +3303,8 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  extent_start, extent_bytes, &trim_entry);
+				  extent_start, extent_bytes, extent_flags,
+				  &trim_entry);
 		if (ret)
 			break;
 next:
@@ -3413,7 +3432,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  start, bytes, &trim_entry);
+				  start, bytes, 0, &trim_entry);
 		if (ret) {
 			reset_trimming_bitmap(ctl, offset);
 			break;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a02fece949cb..3da60d7be535 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -46,6 +46,7 @@
 #include "sysfs.h"
 #include "tests/btrfs-tests.h"
 #include "block-group.h"
+#include "discard.h"
 
 #include "qgroup.h"
 #define CREATE_TRACE_POINTS
@@ -146,6 +147,8 @@ void __btrfs_handle_fs_error(struct btrfs_fs_info *fs_info, const char *function
 	if (sb_rdonly(sb))
 		return;
 
+	btrfs_discard_stop(fs_info);
+
 	/* btrfs handle error by forcing the filesystem readonly */
 	sb->s_flags |= SB_RDONLY;
 	btrfs_info(fs_info, "forced readonly");
@@ -313,6 +316,7 @@ enum {
 	Opt_datasum, Opt_nodatasum,
 	Opt_defrag, Opt_nodefrag,
 	Opt_discard, Opt_nodiscard,
+	Opt_discard_version,
 	Opt_nologreplay,
 	Opt_norecovery,
 	Opt_ratio,
@@ -376,6 +380,7 @@ static const match_table_t tokens = {
 	{Opt_nodefrag, "noautodefrag"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_discard_version, "discard=%s"},
 	{Opt_nologreplay, "nologreplay"},
 	{Opt_norecovery, "norecovery"},
 	{Opt_ratio, "metadata_ratio=%u"},
@@ -695,12 +700,26 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				   info->metadata_ratio);
 			break;
 		case Opt_discard:
-			btrfs_set_and_info(info, DISCARD_SYNC,
-					   "turning on sync discard");
+		case Opt_discard_version:
+			if (token == Opt_discard ||
+			    strcmp(args[0].from, "sync") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_ASYNC);
+				btrfs_set_and_info(info, DISCARD_SYNC,
+						   "turning on sync discard");
+			} else if (strcmp(args[0].from, "async") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_SYNC);
+				btrfs_set_and_info(info, DISCARD_ASYNC,
+						   "turning on async discard");
+			} else {
+				ret = -EINVAL;
+				goto out;
+			}
 			break;
 		case Opt_nodiscard:
 			btrfs_clear_and_info(info, DISCARD_SYNC,
 					     "turning off discard");
+			btrfs_clear_and_info(info, DISCARD_ASYNC,
+					     "turning off async discard");
 			break;
 		case Opt_space_cache:
 		case Opt_space_cache_version:
@@ -1324,6 +1343,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
 		seq_puts(seq, ",discard");
+	if (btrfs_test_opt(info, DISCARD_ASYNC))
+		seq_puts(seq, ",discard=async");
 	if (!(info->sb->s_flags & SB_POSIXACL))
 		seq_puts(seq, ",noacl");
 	if (btrfs_test_opt(info, SPACE_CACHE))
@@ -1714,6 +1735,14 @@ static inline void btrfs_remount_cleanup(struct btrfs_fs_info *fs_info,
 		btrfs_cleanup_defrag_inodes(fs_info);
 	}
 
+	/* if we toggled discard async */
+	if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+	    btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_resume(fs_info);
+	else if (btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+		 !btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_cleanup(fs_info);
+
 	clear_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state);
 }
 
@@ -1761,6 +1790,8 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		 */
 		cancel_work_sync(&fs_info->async_reclaim_work);
 
+		btrfs_discard_cleanup(fs_info);
+
 		/* wait for the uuid_scan task to finish */
 		down(&fs_info->uuid_tree_rescan_sem);
 		/* avoid complains from lockdep et al. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 06/19] btrfs: handle empty block_group removal
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (4 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 15:00   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 07/19] btrfs: discard one region at a time in async discard Dennis Zhou
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.

Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group.  This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.

The new order of events is a fully freed block group gets placed on the
discard queue first. Once it's processed, it will be placed on the
unusued_bgs list and then the original sequence of events will happen,
just without the final whole block_group discard.

The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/block-group.c      | 39 ++++++++++++++++++---
 fs/btrfs/ctree.h            |  2 +-
 fs/btrfs/discard.c          | 68 ++++++++++++++++++++++++++++++++++++-
 fs/btrfs/discard.h          | 11 +++++-
 fs/btrfs/free-space-cache.c | 33 ++++++++++++++++++
 fs/btrfs/free-space-cache.h |  1 +
 fs/btrfs/scrub.c            |  7 +++-
 7 files changed, 153 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 8bbbe7488328..73e5a9384491 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1251,6 +1251,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 	struct btrfs_block_group_cache *block_group;
 	struct btrfs_space_info *space_info;
 	struct btrfs_trans_handle *trans;
+	bool async_trim_enabled = btrfs_test_opt(fs_info, DISCARD_ASYNC);
 	int ret = 0;
 
 	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
@@ -1260,6 +1261,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 	while (!list_empty(&fs_info->unused_bgs)) {
 		u64 start, end;
 		int trimming;
+		bool async_trimmed;
 
 		block_group = list_first_entry(&fs_info->unused_bgs,
 					       struct btrfs_block_group_cache,
@@ -1281,10 +1283,20 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
 		spin_lock(&block_group->lock);
+
+		/* async discard requires block groups to be fully trimmed */
+		async_trimmed = (!btrfs_test_opt(fs_info, DISCARD_ASYNC) ||
+				 btrfs_is_free_space_trimmed(block_group));
+
 		if (block_group->reserved || block_group->pinned ||
 		    btrfs_block_group_used(&block_group->item) ||
 		    block_group->ro ||
-		    list_is_singular(&block_group->list)) {
+		    list_is_singular(&block_group->list) ||
+		    !async_trimmed) {
+			/* requeue if we failed because of async discard */
+			if (!async_trimmed)
+				btrfs_discard_queue_work(&fs_info->discard_ctl,
+							 block_group);
 			/*
 			 * We want to bail if we made new allocations or have
 			 * outstanding allocations in this block group.  We do
@@ -1367,6 +1379,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&block_group->lock);
 		spin_unlock(&space_info->lock);
 
+		if (!async_trim_enabled &&
+		    btrfs_test_opt(fs_info, DISCARD_ASYNC))
+			goto flip_async;
+
 		/* DISCARD can flip during remount */
 		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC);
 
@@ -1411,6 +1427,13 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_lock(&fs_info->unused_bgs_lock);
 	}
 	spin_unlock(&fs_info->unused_bgs_lock);
+	return;
+
+flip_async:
+	btrfs_end_transaction(trans);
+	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+	btrfs_put_block_group(block_group);
+	btrfs_discard_punt_unused_bgs_list(fs_info);
 }
 
 void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
@@ -1618,6 +1641,8 @@ static struct btrfs_block_group_cache *btrfs_create_block_group_cache(
 	cache->full_stripe_len = btrfs_full_stripe_len(fs_info, start);
 	set_free_space_tree_thresholds(cache);
 
+	cache->discard_index = 1;
+
 	atomic_set(&cache->count, 1);
 	spin_lock_init(&cache->lock);
 	init_rwsem(&cache->data_rwsem);
@@ -1829,7 +1854,11 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 			inc_block_group_ro(cache, 1);
 		} else if (btrfs_block_group_used(&cache->item) == 0) {
 			ASSERT(list_empty(&cache->bg_list));
-			btrfs_mark_bg_unused(cache);
+			if (btrfs_test_opt(info, DISCARD_ASYNC))
+				btrfs_add_to_discard_free_list(
+						&info->discard_ctl, cache);
+			else
+				btrfs_mark_bg_unused(cache);
 		}
 	}
 
@@ -2724,8 +2753,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 		 * dirty list to avoid races between cleaner kthread and space
 		 * cache writeout.
 		 */
-		if (!alloc && old_val == 0)
-			btrfs_mark_bg_unused(cache);
+		if (!alloc && old_val == 0) {
+			if (!btrfs_test_opt(info, DISCARD_ASYNC))
+				btrfs_mark_bg_unused(cache);
+		}
 
 		btrfs_put_block_group(cache);
 		total -= num_bytes;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 419445868909..c328d2e85e4d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -439,7 +439,7 @@ struct btrfs_full_stripe_locks_tree {
 };
 
 /* discard control */
-#define BTRFS_NR_DISCARD_LISTS		1
+#define BTRFS_NR_DISCARD_LISTS		2
 
 struct btrfs_discard_ctl {
 	struct workqueue_struct *discard_workers;
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 6df124639e55..fb92b888774d 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -29,8 +29,11 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
 
 	spin_lock(&discard_ctl->lock);
 
-	if (list_empty(&cache->discard_list))
+	if (list_empty(&cache->discard_list) || !cache->discard_index) {
+		if (!cache->discard_index)
+			cache->discard_index = 1;
 		cache->discard_delay = now + BTRFS_DISCARD_DELAY;
+	}
 
 	list_move_tail(&cache->discard_list,
 		       btrfs_get_discard_list(discard_ctl, cache));
@@ -38,6 +41,23 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
 	spin_unlock(&discard_ctl->lock);
 }
 
+void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
+				    struct btrfs_block_group_cache *cache)
+{
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	if (!list_empty(&cache->discard_list))
+		list_del_init(&cache->discard_list);
+
+	cache->discard_index = 0;
+	cache->discard_delay = now;
+	list_add_tail(&cache->discard_list, &discard_ctl->discard_list[0]);
+
+	spin_unlock(&discard_ctl->lock);
+}
+
 static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
 				     struct btrfs_block_group_cache *cache)
 {
@@ -161,10 +181,52 @@ static void btrfs_discard_workfn(struct work_struct *work)
 			       btrfs_block_group_end(cache), 0);
 
 	remove_from_discard_list(discard_ctl, cache);
+	if (btrfs_is_free_space_trimmed(cache))
+		btrfs_mark_bg_unused(cache);
+	else if (cache->free_space_ctl->free_space == cache->key.offset)
+		btrfs_add_to_discard_free_list(discard_ctl, cache);
 
 	btrfs_discard_schedule_work(discard_ctl, false);
 }
 
+void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_block_group_cache *cache, *next;
+
+	/* we enabled async discard, so punt all to the queue */
+	spin_lock(&fs_info->unused_bgs_lock);
+
+	list_for_each_entry_safe(cache, next, &fs_info->unused_bgs, bg_list) {
+		list_del_init(&cache->bg_list);
+		btrfs_add_to_discard_free_list(&fs_info->discard_ctl, cache);
+	}
+
+	spin_unlock(&fs_info->unused_bgs_lock);
+}
+
+static void btrfs_discard_purge_list(struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_block_group_cache *cache, *next;
+	int i;
+
+	spin_lock(&discard_ctl->lock);
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) {
+		list_for_each_entry_safe(cache, next,
+					 &discard_ctl->discard_list[i],
+					 discard_list) {
+			list_del_init(&cache->discard_list);
+			spin_unlock(&discard_ctl->lock);
+			if (cache->free_space_ctl->free_space ==
+			    cache->key.offset)
+				btrfs_mark_bg_unused(cache);
+			spin_lock(&discard_ctl->lock);
+		}
+	}
+
+	spin_unlock(&discard_ctl->lock);
+}
+
 void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
 {
 	if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
@@ -172,6 +234,8 @@ void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
 		return;
 	}
 
+	btrfs_discard_punt_unused_bgs_list(fs_info);
+
 	set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
 }
 
@@ -197,4 +261,6 @@ void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
 {
 	btrfs_discard_stop(fs_info);
 	cancel_delayed_work_sync(&fs_info->discard_ctl.work);
+
+	btrfs_discard_purge_list(&fs_info->discard_ctl);
 }
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
index 6d7805bb0eb7..55f79b624943 100644
--- a/fs/btrfs/discard.h
+++ b/fs/btrfs/discard.h
@@ -10,9 +10,14 @@
 #include <linux/workqueue.h>
 
 #include "ctree.h"
+#include "block-group.h"
+#include "free-space-cache.h"
 
 void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
 			       struct btrfs_block_group_cache *cache);
+void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
+				    struct btrfs_block_group_cache *cache);
+void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info);
 
 void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
 			       struct btrfs_block_group_cache *cache);
@@ -41,7 +46,11 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
 	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
 		return;
 
-	btrfs_add_to_discard_list(discard_ctl, cache);
+	if (cache->free_space_ctl->free_space == cache->key.offset)
+		btrfs_add_to_discard_free_list(discard_ctl, cache);
+	else
+		btrfs_add_to_discard_list(discard_ctl, cache);
+
 	if (!delayed_work_pending(&discard_ctl->work))
 		btrfs_discard_schedule_work(discard_ctl, false);
 }
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 54ff1bc97777..ed0e7ee4c78d 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2653,6 +2653,31 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group_cache *block_group)
 
 }
 
+bool btrfs_is_free_space_trimmed(struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
+	struct btrfs_free_space *info;
+	struct rb_node *node;
+	bool ret = true;
+
+	spin_lock(&ctl->tree_lock);
+	node = rb_first(&ctl->free_space_offset);
+
+	while (node) {
+		info = rb_entry(node, struct btrfs_free_space, offset_index);
+
+		if (!btrfs_free_space_trimmed(info)) {
+			ret = false;
+			break;
+		}
+
+		node = rb_next(node);
+	}
+
+	spin_unlock(&ctl->tree_lock);
+	return ret;
+}
+
 u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 			       u64 offset, u64 bytes, u64 empty_size,
 			       u64 *max_extent_size)
@@ -2739,6 +2764,9 @@ int btrfs_return_cluster_to_free_space(
 	ret = __btrfs_return_cluster_to_free_space(block_group, cluster);
 	spin_unlock(&ctl->tree_lock);
 
+	btrfs_discard_queue_work(&block_group->fs_info->discard_ctl,
+				 block_group);
+
 	/* finally drop our ref */
 	btrfs_put_block_group(block_group);
 	return ret;
@@ -3097,6 +3125,7 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group,
 	u64 min_bytes;
 	u64 cont1_bytes;
 	int ret;
+	bool found_cluster = false;
 
 	/*
 	 * Choose the minimum extent size we'll require for this
@@ -3149,6 +3178,7 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group,
 		list_del_init(&entry->list);
 
 	if (!ret) {
+		found_cluster = true;
 		atomic_inc(&block_group->count);
 		list_add_tail(&cluster->block_group_list,
 			      &block_group->cluster_list);
@@ -3160,6 +3190,9 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group,
 	spin_unlock(&cluster->lock);
 	spin_unlock(&ctl->tree_lock);
 
+	if (found_cluster)
+		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+
 	return ret;
 }
 
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index dc73ec8d34bb..b688e70a7512 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -107,6 +107,7 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
 void __btrfs_remove_free_space_cache(struct btrfs_free_space_ctl *ctl);
 void btrfs_remove_free_space_cache(struct btrfs_block_group_cache
 				     *block_group);
+bool btrfs_is_free_space_trimmed(struct btrfs_block_group_cache *cache);
 u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 			       u64 offset, u64 bytes, u64 empty_size,
 			       u64 *max_extent_size);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f7d4e03f4c5d..49927a642b5a 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -8,6 +8,7 @@
 #include <linux/sched/mm.h>
 #include <crypto/hash.h>
 #include "ctree.h"
+#include "discard.h"
 #include "volumes.h"
 #include "disk-io.h"
 #include "ordered-data.h"
@@ -3683,7 +3684,11 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		if (!cache->removed && !cache->ro && cache->reserved == 0 &&
 		    btrfs_block_group_used(&cache->item) == 0) {
 			spin_unlock(&cache->lock);
-			btrfs_mark_bg_unused(cache);
+			if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+				btrfs_add_to_discard_free_list(
+						&fs_info->discard_ctl, cache);
+			else
+				btrfs_mark_bg_unused(cache);
 		} else {
 			spin_unlock(&cache->lock);
 		}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 07/19] btrfs: discard one region at a time in async discard
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (5 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 06/19] btrfs: handle empty block_group removal Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 15:22   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 08/19] btrfs: track discardable extents for asnyc discard Dennis Zhou
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

The prior two patches added discarding via a background workqueue. This
just piggybacked off of the fstrim code to trim the whole block at once.
Well inevitably this is worse performance wise and will aggressively
overtrim. But it was nice to plumb the other infrastructure to keep the
patches easier to review.

This adds the real goal of this series which is discarding slowly (ie a
slow long running fstrim). The discarding is split into two phases,
extents and then bitmaps. The reason for this is two fold. First, the
bitmap regions overlap the extent regions. Second, discarding the
extents first will let the newly trimmed bitmaps have the highest chance
of coalescing when being readded to the free space cache.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/block-group.h      |   2 +
 fs/btrfs/discard.c          |  73 ++++++++++++++++++++-----
 fs/btrfs/discard.h          |  16 ++++++
 fs/btrfs/extent-tree.c      |   3 +-
 fs/btrfs/free-space-cache.c | 106 ++++++++++++++++++++++++++----------
 fs/btrfs/free-space-cache.h |   6 +-
 6 files changed, 159 insertions(+), 47 deletions(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 0f9a1c91753f..b59e6a8ed73d 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -120,6 +120,8 @@ struct btrfs_block_group_cache {
 	struct list_head discard_list;
 	int discard_index;
 	u64 discard_delay;
+	u64 discard_cursor;
+	u32 discard_flags;
 
 	/* For dirty block groups */
 	struct list_head dirty_list;
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index fb92b888774d..26a1e44b4bfa 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -22,21 +22,28 @@ btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
 	return &discard_ctl->discard_list[cache->discard_index];
 }
 
-void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
-			       struct btrfs_block_group_cache *cache)
+static void __btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+					struct btrfs_block_group_cache *cache)
 {
 	u64 now = ktime_get_ns();
 
-	spin_lock(&discard_ctl->lock);
-
 	if (list_empty(&cache->discard_list) || !cache->discard_index) {
 		if (!cache->discard_index)
 			cache->discard_index = 1;
 		cache->discard_delay = now + BTRFS_DISCARD_DELAY;
+		cache->discard_flags |= BTRFS_DISCARD_RESET_CURSOR;
 	}
 
 	list_move_tail(&cache->discard_list,
 		       btrfs_get_discard_list(discard_ctl, cache));
+}
+
+void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group_cache *cache)
+{
+	spin_lock(&discard_ctl->lock);
+
+	__btrfs_add_to_discard_list(discard_ctl, cache);
 
 	spin_unlock(&discard_ctl->lock);
 }
@@ -53,6 +60,7 @@ void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
 
 	cache->discard_index = 0;
 	cache->discard_delay = now;
+	cache->discard_flags |= BTRFS_DISCARD_RESET_CURSOR;
 	list_add_tail(&cache->discard_list, &discard_ctl->discard_list[0]);
 
 	spin_unlock(&discard_ctl->lock);
@@ -114,13 +122,24 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
 
 	spin_lock(&discard_ctl->lock);
 
+again:
 	cache = find_next_cache(discard_ctl, now);
 
-	if (cache && now < cache->discard_delay)
+	if (cache && now > cache->discard_delay) {
+		discard_ctl->cache = cache;
+		if (cache->discard_index == 0 &&
+		    cache->free_space_ctl->free_space != cache->key.offset) {
+			__btrfs_add_to_discard_list(discard_ctl, cache);
+			goto again;
+		}
+		if (btrfs_discard_reset_cursor(cache)) {
+			cache->discard_cursor = cache->key.objectid;
+			cache->discard_flags &= ~(BTRFS_DISCARD_RESET_CURSOR |
+						  BTRFS_DISCARD_BITMAPS);
+		}
+	} else {
 		cache = NULL;
-
-	discard_ctl->cache = cache;
-
+	}
 	spin_unlock(&discard_ctl->lock);
 
 	return cache;
@@ -173,18 +192,42 @@ static void btrfs_discard_workfn(struct work_struct *work)
 
 	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
 
+again:
 	cache = peek_discard_list(discard_ctl);
 	if (!cache || !btrfs_run_discard_work(discard_ctl))
 		return;
 
-	btrfs_trim_block_group(cache, &trimmed, cache->key.objectid,
-			       btrfs_block_group_end(cache), 0);
+	if (btrfs_discard_bitmaps(cache))
+		btrfs_trim_block_group_bitmaps(cache, &trimmed,
+					       cache->discard_cursor,
+					       btrfs_block_group_end(cache),
+					       0, true);
+	else
+		btrfs_trim_block_group(cache, &trimmed, cache->discard_cursor,
+				       btrfs_block_group_end(cache), 0, true);
+
+	if (cache->discard_cursor >= btrfs_block_group_end(cache)) {
+		if (btrfs_discard_bitmaps(cache)) {
+			remove_from_discard_list(discard_ctl, cache);
+			if (btrfs_is_free_space_trimmed(cache))
+				btrfs_mark_bg_unused(cache);
+			else if (cache->free_space_ctl->free_space ==
+				 cache->key.offset)
+				btrfs_add_to_discard_free_list(discard_ctl,
+							       cache);
+		} else {
+			cache->discard_cursor = cache->key.objectid;
+			cache->discard_flags |= BTRFS_DISCARD_BITMAPS;
+		}
+	}
+
+	spin_lock(&discard_ctl->lock);
+	discard_ctl->cache = NULL;
+	spin_unlock(&discard_ctl->lock);
 
-	remove_from_discard_list(discard_ctl, cache);
-	if (btrfs_is_free_space_trimmed(cache))
-		btrfs_mark_bg_unused(cache);
-	else if (cache->free_space_ctl->free_space == cache->key.offset)
-		btrfs_add_to_discard_free_list(discard_ctl, cache);
+	/* we didn't trim anything but we really ought to so try again */
+	if (trimmed == 0)
+		goto again;
 
 	btrfs_discard_schedule_work(discard_ctl, false);
 }
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
index 55f79b624943..22cfa7e401bb 100644
--- a/fs/btrfs/discard.h
+++ b/fs/btrfs/discard.h
@@ -13,6 +13,22 @@
 #include "block-group.h"
 #include "free-space-cache.h"
 
+/* discard flags */
+#define BTRFS_DISCARD_RESET_CURSOR	(1UL << 0)
+#define BTRFS_DISCARD_BITMAPS           (1UL << 1)
+
+static inline
+bool btrfs_discard_reset_cursor(struct btrfs_block_group_cache *cache)
+{
+	return (cache->discard_flags & BTRFS_DISCARD_RESET_CURSOR);
+}
+
+static inline
+bool btrfs_discard_bitmaps(struct btrfs_block_group_cache *cache)
+{
+	return (cache->discard_flags & BTRFS_DISCARD_BITMAPS);
+}
+
 void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
 			       struct btrfs_block_group_cache *cache);
 void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d69ee5f51b38..ff42e4abb01d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5683,7 +5683,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
 						     &group_trimmed,
 						     start,
 						     end,
-						     range->minlen);
+						     range->minlen,
+						     false);
 
 			trimmed += group_trimmed;
 			if (ret) {
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index ed0e7ee4c78d..97b3074e83c0 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3267,7 +3267,8 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
 }
 
 static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
-			  u64 *total_trimmed, u64 start, u64 end, u64 minlen)
+			  u64 *total_trimmed, u64 start, u64 end, u64 minlen,
+			  bool async)
 {
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	struct btrfs_free_space *entry;
@@ -3284,36 +3285,25 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 		mutex_lock(&ctl->cache_writeout_mutex);
 		spin_lock(&ctl->tree_lock);
 
-		if (ctl->free_space < minlen) {
-			spin_unlock(&ctl->tree_lock);
-			mutex_unlock(&ctl->cache_writeout_mutex);
-			break;
-		}
+		if (ctl->free_space < minlen)
+			goto out_unlock;
 
 		entry = tree_search_offset(ctl, start, 0, 1);
-		if (!entry) {
-			spin_unlock(&ctl->tree_lock);
-			mutex_unlock(&ctl->cache_writeout_mutex);
-			break;
-		}
+		if (!entry)
+			goto out_unlock;
 
 		/* skip bitmaps */
-		while (entry->bitmap) {
+		while (entry->bitmap || (async &&
+					 btrfs_free_space_trimmed(entry))) {
 			node = rb_next(&entry->offset_index);
-			if (!node) {
-				spin_unlock(&ctl->tree_lock);
-				mutex_unlock(&ctl->cache_writeout_mutex);
-				goto out;
-			}
+			if (!node)
+				goto out_unlock;
 			entry = rb_entry(node, struct btrfs_free_space,
 					 offset_index);
 		}
 
-		if (entry->offset >= end) {
-			spin_unlock(&ctl->tree_lock);
-			mutex_unlock(&ctl->cache_writeout_mutex);
-			break;
-		}
+		if (entry->offset >= end)
+			goto out_unlock;
 
 		extent_start = entry->offset;
 		extent_bytes = entry->bytes;
@@ -3338,10 +3328,15 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
 				  extent_start, extent_bytes, extent_flags,
 				  &trim_entry);
-		if (ret)
+		if (ret) {
+			block_group->discard_cursor = start + bytes;
 			break;
+		}
 next:
 		start += bytes;
+		block_group->discard_cursor = start;
+		if (async && *total_trimmed)
+			break;
 
 		if (fatal_signal_pending(current)) {
 			ret = -ERESTARTSYS;
@@ -3350,7 +3345,14 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 
 		cond_resched();
 	}
-out:
+
+	return ret;
+
+out_unlock:
+	block_group->discard_cursor = btrfs_block_group_end(block_group);
+	spin_unlock(&ctl->tree_lock);
+	mutex_unlock(&ctl->cache_writeout_mutex);
+
 	return ret;
 }
 
@@ -3390,7 +3392,8 @@ static void end_trimming_bitmap(struct btrfs_free_space *entry)
 }
 
 static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
-			u64 *total_trimmed, u64 start, u64 end, u64 minlen)
+			u64 *total_trimmed, u64 start, u64 end, u64 minlen,
+			bool async)
 {
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	struct btrfs_free_space *entry;
@@ -3407,13 +3410,16 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		spin_lock(&ctl->tree_lock);
 
 		if (ctl->free_space < minlen) {
+			block_group->discard_cursor =
+				btrfs_block_group_end(block_group);
 			spin_unlock(&ctl->tree_lock);
 			mutex_unlock(&ctl->cache_writeout_mutex);
 			break;
 		}
 
 		entry = tree_search_offset(ctl, offset, 1, 0);
-		if (!entry) {
+		if (!entry || (async && start == offset &&
+			       btrfs_free_space_trimmed(entry))) {
 			spin_unlock(&ctl->tree_lock);
 			mutex_unlock(&ctl->cache_writeout_mutex);
 			next_bitmap = true;
@@ -3446,6 +3452,16 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 			goto next;
 		}
 
+		/*
+		 * We already trimmed a region, but are using the locking above
+		 * to reset the BTRFS_FSC_TRIMMING_BITMAP flag.
+		 */
+		if (async && *total_trimmed) {
+			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
+			return ret;
+		}
+
 		bytes = min(bytes, end - start);
 		if (bytes < minlen) {
 			entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
@@ -3468,6 +3484,8 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 				  start, bytes, 0, &trim_entry);
 		if (ret) {
 			reset_trimming_bitmap(ctl, offset);
+			block_group->discard_cursor =
+				btrfs_block_group_end(block_group);
 			break;
 		}
 next:
@@ -3477,6 +3495,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		} else {
 			start += bytes;
 		}
+		block_group->discard_cursor = start;
 
 		if (fatal_signal_pending(current)) {
 			if (start != offset)
@@ -3488,6 +3507,9 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		cond_resched();
 	}
 
+	if (offset >= end)
+		block_group->discard_cursor = end;
+
 	return ret;
 }
 
@@ -3532,7 +3554,8 @@ void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
 }
 
 int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
-			   u64 *trimmed, u64 start, u64 end, u64 minlen)
+			   u64 *trimmed, u64 start, u64 end, u64 minlen,
+			   bool async)
 {
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
@@ -3547,11 +3570,11 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 	btrfs_get_block_group_trimming(block_group);
 	spin_unlock(&block_group->lock);
 
-	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
-	if (ret)
+	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen, async);
+	if (ret || async)
 		goto out;
 
-	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
+	ret = trim_bitmaps(block_group, trimmed, start, end, minlen, false);
 	/* if we ended in the middle of a bitmap, reset the trimming flag */
 	if (end % (BITS_PER_BITMAP * ctl->unit))
 		reset_trimming_bitmap(ctl, offset_to_bitmap(ctl, end));
@@ -3560,6 +3583,29 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 	return ret;
 }
 
+int btrfs_trim_block_group_bitmaps(struct btrfs_block_group_cache *block_group,
+				   u64 *trimmed, u64 start, u64 end, u64 minlen,
+				   bool async)
+{
+	int ret;
+
+	*trimmed = 0;
+
+	spin_lock(&block_group->lock);
+	if (block_group->removed) {
+		spin_unlock(&block_group->lock);
+		return 0;
+	}
+	btrfs_get_block_group_trimming(block_group);
+	spin_unlock(&block_group->lock);
+
+	ret = trim_bitmaps(block_group, trimmed, start, end, minlen, async);
+
+	btrfs_put_block_group_trimming(block_group);
+	return ret;
+
+}
+
 /*
  * Find the left-most item in the cache tree, and then return the
  * smallest inode number in the item.
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index b688e70a7512..450ea01ea0c7 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -125,7 +125,11 @@ int btrfs_return_cluster_to_free_space(
 			       struct btrfs_block_group_cache *block_group,
 			       struct btrfs_free_cluster *cluster);
 int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
-			   u64 *trimmed, u64 start, u64 end, u64 minlen);
+			   u64 *trimmed, u64 start, u64 end, u64 minlen,
+			   bool async);
+int btrfs_trim_block_group_bitmaps(struct btrfs_block_group_cache *block_group,
+				   u64 *trimmed, u64 start, u64 end, u64 minlen,
+				   bool async);
 
 /* Support functions for running our sanity tests */
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 08/19] btrfs: track discardable extents for asnyc discard
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (6 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 07/19] btrfs: discard one region at a time in async discard Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 15:36   ` Josef Bacik
  2019-10-15 13:12   ` David Sterba
  2019-10-07 20:17 ` [PATCH 09/19] btrfs: keep track of discardable_bytes Dennis Zhou
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

The number of discardable extents will serve as the rate limiting metric
for how often we should discard. This keeps track of discardable extents
in the free space caches by maintaining deltas and propagating them to
the global count.

This also setups up a discard directory in btrfs sysfs and exports the
total discard_extents count.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/ctree.h            |  2 +
 fs/btrfs/discard.c          |  2 +
 fs/btrfs/discard.h          | 19 ++++++++
 fs/btrfs/free-space-cache.c | 93 ++++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.h |  2 +
 fs/btrfs/sysfs.c            | 33 +++++++++++++
 6 files changed, 144 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c328d2e85e4d..43e515939b9c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -447,6 +447,7 @@ struct btrfs_discard_ctl {
 	spinlock_t lock;
 	struct btrfs_block_group_cache *cache;
 	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
+	atomic_t discard_extents;
 };
 
 /* delayed seq elem */
@@ -831,6 +832,7 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *scrub_wr_completion_workers;
 	struct btrfs_workqueue *scrub_parity_workers;
 
+	struct kobject *discard_kobj;
 	struct btrfs_discard_ctl discard_ctl;
 
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 26a1e44b4bfa..0544eb6717d4 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -298,6 +298,8 @@ void btrfs_discard_init(struct btrfs_fs_info *fs_info)
 
 	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
 		 INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
+
+	atomic_set(&discard_ctl->discard_extents, 0);
 }
 
 void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
index 22cfa7e401bb..85939d62521e 100644
--- a/fs/btrfs/discard.h
+++ b/fs/btrfs/discard.h
@@ -71,4 +71,23 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
 		btrfs_discard_schedule_work(discard_ctl, false);
 }
 
+static inline
+void btrfs_discard_update_discardable(struct btrfs_block_group_cache *cache,
+				      struct btrfs_free_space_ctl *ctl)
+{
+	struct btrfs_discard_ctl *discard_ctl;
+	s32 extents_delta;
+
+	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
+		return;
+
+	discard_ctl = &cache->fs_info->discard_ctl;
+
+	extents_delta = ctl->discard_extents[0] - ctl->discard_extents[1];
+	if (extents_delta) {
+		atomic_add(extents_delta, &discard_ctl->discard_extents);
+		ctl->discard_extents[1] = ctl->discard_extents[0];
+	}
+}
+
 #endif
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 97b3074e83c0..6c2bebfd206f 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -32,6 +32,9 @@ struct btrfs_trim_range {
 	struct list_head list;
 };
 
+static int count_bitmap_extents(struct btrfs_free_space_ctl *ctl,
+				struct btrfs_free_space *bitmap_info);
+
 static int link_free_space(struct btrfs_free_space_ctl *ctl,
 			   struct btrfs_free_space *info);
 static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
@@ -809,12 +812,15 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 		ret = io_ctl_read_bitmap(&io_ctl, e);
 		if (ret)
 			goto free_cache;
+		e->bitmap_extents = count_bitmap_extents(ctl, e);
+		ctl->discard_extents[0] += e->bitmap_extents;
 	}
 
 	io_ctl_drop_pages(&io_ctl);
 	merge_space_tree(ctl);
 	ret = 1;
 out:
+	btrfs_discard_update_discardable(ctl->private, ctl);
 	io_ctl_free(&io_ctl);
 	return ret;
 free_cache:
@@ -1629,6 +1635,9 @@ __unlink_free_space(struct btrfs_free_space_ctl *ctl,
 {
 	rb_erase(&info->offset_index, &ctl->free_space_offset);
 	ctl->free_extents--;
+
+	if (!info->bitmap && !btrfs_free_space_trimmed(info))
+		ctl->discard_extents[0]--;
 }
 
 static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
@@ -1649,6 +1658,9 @@ static int link_free_space(struct btrfs_free_space_ctl *ctl,
 	if (ret)
 		return ret;
 
+	if (!info->bitmap && !btrfs_free_space_trimmed(info))
+		ctl->discard_extents[0]++;
+
 	ctl->free_space += info->bytes;
 	ctl->free_extents++;
 	return ret;
@@ -1705,17 +1717,29 @@ static inline void __bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
 				       struct btrfs_free_space *info,
 				       u64 offset, u64 bytes)
 {
-	unsigned long start, count;
+	unsigned long start, count, end;
+	int extent_delta = -1;
 
 	start = offset_to_bit(info->offset, ctl->unit, offset);
 	count = bytes_to_bits(bytes, ctl->unit);
-	ASSERT(start + count <= BITS_PER_BITMAP);
+	end = start + count;
+	ASSERT(end <= BITS_PER_BITMAP);
 
 	bitmap_clear(info->bitmap, start, count);
 
 	info->bytes -= bytes;
 	if (info->max_extent_size > ctl->unit)
 		info->max_extent_size = 0;
+
+	if (start && test_bit(start - 1, info->bitmap))
+		extent_delta++;
+
+	if (end < BITS_PER_BITMAP && test_bit(end, info->bitmap))
+		extent_delta++;
+
+	info->bitmap_extents += extent_delta;
+	if (!btrfs_free_space_trimmed(info))
+		ctl->discard_extents[0] += extent_delta;
 }
 
 static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
@@ -1730,16 +1754,28 @@ static void bitmap_set_bits(struct btrfs_free_space_ctl *ctl,
 			    struct btrfs_free_space *info, u64 offset,
 			    u64 bytes)
 {
-	unsigned long start, count;
+	unsigned long start, count, end;
+	int extent_delta = 1;
 
 	start = offset_to_bit(info->offset, ctl->unit, offset);
 	count = bytes_to_bits(bytes, ctl->unit);
-	ASSERT(start + count <= BITS_PER_BITMAP);
+	end = start + count;
+	ASSERT(end <= BITS_PER_BITMAP);
 
 	bitmap_set(info->bitmap, start, count);
 
 	info->bytes += bytes;
 	ctl->free_space += bytes;
+
+	if (start && test_bit(start - 1, info->bitmap))
+		extent_delta--;
+
+	if (end < BITS_PER_BITMAP && test_bit(end, info->bitmap))
+		extent_delta--;
+
+	info->bitmap_extents += extent_delta;
+	if (!btrfs_free_space_trimmed(info))
+		ctl->discard_extents[0] += extent_delta;
 }
 
 /*
@@ -1875,11 +1911,35 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
 	return NULL;
 }
 
+static int count_bitmap_extents(struct btrfs_free_space_ctl *ctl,
+				struct btrfs_free_space *bitmap_info)
+{
+	struct btrfs_block_group_cache *cache = ctl->private;
+	u64 bytes = bitmap_info->bytes;
+	unsigned int rs, re;
+	int count = 0;
+
+	if (!cache || !bytes)
+		return count;
+
+	bitmap_for_each_set_region(bitmap_info->bitmap, rs, re, 0,
+				   BITS_PER_BITMAP) {
+		bytes -= (rs - re) * ctl->unit;
+		count++;
+
+		if (!bytes)
+			break;
+	}
+
+	return count;
+}
+
 static void add_new_bitmap(struct btrfs_free_space_ctl *ctl,
 			   struct btrfs_free_space *info, u64 offset)
 {
 	info->offset = offset_to_bitmap(ctl, offset);
 	info->bytes = 0;
+	info->bitmap_extents = 0;
 	INIT_LIST_HEAD(&info->list);
 	link_free_space(ctl, info);
 	ctl->total_bitmaps++;
@@ -1981,8 +2041,11 @@ static u64 add_bytes_to_bitmap(struct btrfs_free_space_ctl *ctl,
 	u64 bytes_to_set = 0;
 	u64 end;
 
-	if (!(flags & BTRFS_FSC_TRIMMED))
+	if (!(flags & BTRFS_FSC_TRIMMED)) {
+		if (btrfs_free_space_trimmed(info))
+			ctl->discard_extents[0] += info->bitmap_extents;
 		info->flags &= ~(BTRFS_FSC_TRIMMED | BTRFS_FSC_TRIMMING_BITMAP);
+	}
 
 	end = info->offset + (u64)(BITS_PER_BITMAP * ctl->unit);
 
@@ -2397,6 +2460,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	if (ret)
 		kmem_cache_free(btrfs_free_space_cachep, info);
 out:
+	btrfs_discard_update_discardable(cache, ctl);
 	spin_unlock(&ctl->tree_lock);
 
 	if (ret) {
@@ -2506,6 +2570,7 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
 		goto again;
 	}
 out_lock:
+	btrfs_discard_update_discardable(block_group, ctl);
 	spin_unlock(&ctl->tree_lock);
 out:
 	return ret;
@@ -2591,8 +2656,16 @@ __btrfs_return_cluster_to_free_space(
 
 		bitmap = (entry->bitmap != NULL);
 		if (!bitmap) {
+			/* merging treats extents as if they were new */
+			if (!btrfs_free_space_trimmed(entry))
+				ctl->discard_extents[0]--;
+
 			try_merge_free_space(ctl, entry, false);
 			steal_from_bitmap(ctl, entry, false);
+
+			/* as we insert directly, update these statistics */
+			if (!btrfs_free_space_trimmed(entry))
+				ctl->discard_extents[0]++;
 		}
 		tree_insert_offset(&ctl->free_space_offset,
 				   entry->offset, &entry->offset_index, bitmap);
@@ -2649,6 +2722,7 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group_cache *block_group)
 		cond_resched_lock(&ctl->tree_lock);
 	}
 	__btrfs_remove_free_space_cache_locked(ctl);
+	btrfs_discard_update_discardable(block_group, ctl);
 	spin_unlock(&ctl->tree_lock);
 
 }
@@ -2717,6 +2791,7 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 			link_free_space(ctl, entry);
 	}
 out:
+	btrfs_discard_update_discardable(block_group, ctl);
 	spin_unlock(&ctl->tree_lock);
 
 	if (align_gap_len)
@@ -2882,6 +2957,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
 					entry->bitmap);
 			ctl->total_bitmaps--;
 			ctl->op->recalc_thresholds(ctl);
+		} else if (!btrfs_free_space_trimmed(entry)) {
+			ctl->discard_extents[0]--;
 		}
 		kmem_cache_free(btrfs_free_space_cachep, entry);
 	}
@@ -3383,11 +3460,13 @@ static void reset_trimming_bitmap(struct btrfs_free_space_ctl *ctl, u64 offset)
 	spin_unlock(&ctl->tree_lock);
 }
 
-static void end_trimming_bitmap(struct btrfs_free_space *entry)
+static void end_trimming_bitmap(struct btrfs_free_space_ctl *ctl,
+				struct btrfs_free_space *entry)
 {
 	if (btrfs_free_space_trimming_bitmap(entry)) {
 		entry->flags |= BTRFS_FSC_TRIMMED;
 		entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
+		ctl->discard_extents[0] -= entry->bitmap_extents;
 	}
 }
 
@@ -3443,7 +3522,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 			 * if BTRFS_FSC_TRIMMED is set on a bitmap.
 			 */
 			if (ret2 && !minlen)
-				end_trimming_bitmap(entry);
+				end_trimming_bitmap(ctl, entry);
 			else
 				entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
 			spin_unlock(&ctl->tree_lock);
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 450ea01ea0c7..855f42dc15cd 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -16,6 +16,7 @@ struct btrfs_free_space {
 	u64 max_extent_size;
 	unsigned long *bitmap;
 	struct list_head list;
+	s32 bitmap_extents;
 	u32 flags;
 };
 
@@ -39,6 +40,7 @@ struct btrfs_free_space_ctl {
 	int total_bitmaps;
 	int unit;
 	u64 start;
+	s32 discard_extents[2];
 	const struct btrfs_free_space_op *op;
 	void *private;
 	struct mutex cache_writeout_mutex;
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index f6d3c80f2e28..14c6910128f1 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/bug.h>
 
 #include "ctree.h"
+#include "discard.h"
 #include "disk-io.h"
 #include "transaction.h"
 #include "sysfs.h"
@@ -470,6 +471,22 @@ static const struct attribute *allocation_attrs[] = {
 	NULL,
 };
 
+static ssize_t btrfs_discard_extents_show(struct kobject *kobj,
+					struct kobj_attribute *a,
+					char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj->parent);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n",
+			atomic_read(&fs_info->discard_ctl.discard_extents));
+}
+BTRFS_ATTR(discard, discard_extents, btrfs_discard_extents_show);
+
+static const struct attribute *discard_attrs[] = {
+	BTRFS_ATTR_PTR(discard, discard_extents),
+	NULL,
+};
+
 static ssize_t btrfs_label_show(struct kobject *kobj,
 				struct kobj_attribute *a, char *buf)
 {
@@ -727,6 +744,12 @@ void btrfs_sysfs_remove_mounted(struct btrfs_fs_info *fs_info)
 {
 	btrfs_reset_fs_info_ptr(fs_info);
 
+	if (fs_info->discard_kobj) {
+		sysfs_remove_files(fs_info->discard_kobj, discard_attrs);
+		kobject_del(fs_info->discard_kobj);
+		kobject_put(fs_info->discard_kobj);
+	}
+
 	if (fs_info->space_info_kobj) {
 		sysfs_remove_files(fs_info->space_info_kobj, allocation_attrs);
 		kobject_del(fs_info->space_info_kobj);
@@ -1093,6 +1116,16 @@ int btrfs_sysfs_add_mounted(struct btrfs_fs_info *fs_info)
 	if (error)
 		goto failure;
 
+	fs_info->discard_kobj = kobject_create_and_add("discard", fsid_kobj);
+	if (!fs_info->discard_kobj) {
+		error = -ENOMEM;
+		goto failure;
+	}
+
+	error = sysfs_create_files(fs_info->discard_kobj, discard_attrs);
+	if (error)
+		goto failure;
+
 	return 0;
 failure:
 	btrfs_sysfs_remove_mounted(fs_info);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 09/19] btrfs: keep track of discardable_bytes
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (7 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 08/19] btrfs: track discardable extents for asnyc discard Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 15:38   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 10/19] btrfs: calculate discard delay based on number of extents Dennis Zhou
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Keep track of this metric so that we can understand how ahead or behind
we are in discarding rate.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/ctree.h            |  1 +
 fs/btrfs/discard.c          |  1 +
 fs/btrfs/discard.h          |  7 +++++++
 fs/btrfs/free-space-cache.c | 32 +++++++++++++++++++++++++-------
 fs/btrfs/free-space-cache.h |  1 +
 fs/btrfs/sysfs.c            | 12 ++++++++++++
 6 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 43e515939b9c..8479ab037812 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -448,6 +448,7 @@ struct btrfs_discard_ctl {
 	struct btrfs_block_group_cache *cache;
 	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
 	atomic_t discard_extents;
+	atomic64_t discardable_bytes;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 0544eb6717d4..75a2ff14b3c0 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -300,6 +300,7 @@ void btrfs_discard_init(struct btrfs_fs_info *fs_info)
 		 INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
 
 	atomic_set(&discard_ctl->discard_extents, 0);
+	atomic64_set(&discard_ctl->discardable_bytes, 0);
 }
 
 void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
index 85939d62521e..d55a9a9f8ad8 100644
--- a/fs/btrfs/discard.h
+++ b/fs/btrfs/discard.h
@@ -77,6 +77,7 @@ void btrfs_discard_update_discardable(struct btrfs_block_group_cache *cache,
 {
 	struct btrfs_discard_ctl *discard_ctl;
 	s32 extents_delta;
+	s64 bytes_delta;
 
 	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
 		return;
@@ -88,6 +89,12 @@ void btrfs_discard_update_discardable(struct btrfs_block_group_cache *cache,
 		atomic_add(extents_delta, &discard_ctl->discard_extents);
 		ctl->discard_extents[1] = ctl->discard_extents[0];
 	}
+
+	bytes_delta = ctl->discardable_bytes[0] - ctl->discardable_bytes[1];
+	if (bytes_delta) {
+		atomic64_add(bytes_delta, &discard_ctl->discardable_bytes);
+		ctl->discardable_bytes[1] = ctl->discardable_bytes[0];
+	}
 }
 
 #endif
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 6c2bebfd206f..54f3c8325858 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -814,6 +814,7 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 			goto free_cache;
 		e->bitmap_extents = count_bitmap_extents(ctl, e);
 		ctl->discard_extents[0] += e->bitmap_extents;
+		ctl->discardable_bytes[0] += e->bytes;
 	}
 
 	io_ctl_drop_pages(&io_ctl);
@@ -1636,8 +1637,10 @@ __unlink_free_space(struct btrfs_free_space_ctl *ctl,
 	rb_erase(&info->offset_index, &ctl->free_space_offset);
 	ctl->free_extents--;
 
-	if (!info->bitmap && !btrfs_free_space_trimmed(info))
+	if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
 		ctl->discard_extents[0]--;
+		ctl->discardable_bytes[0] -= info->bytes;
+	}
 }
 
 static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
@@ -1658,8 +1661,10 @@ static int link_free_space(struct btrfs_free_space_ctl *ctl,
 	if (ret)
 		return ret;
 
-	if (!info->bitmap && !btrfs_free_space_trimmed(info))
+	if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
 		ctl->discard_extents[0]++;
+		ctl->discardable_bytes[0] += info->bytes;
+	}
 
 	ctl->free_space += info->bytes;
 	ctl->free_extents++;
@@ -1738,8 +1743,10 @@ static inline void __bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
 		extent_delta++;
 
 	info->bitmap_extents += extent_delta;
-	if (!btrfs_free_space_trimmed(info))
+	if (!btrfs_free_space_trimmed(info)) {
 		ctl->discard_extents[0] += extent_delta;
+		ctl->discardable_bytes[0] -= bytes;
+	}
 }
 
 static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
@@ -1774,8 +1781,10 @@ static void bitmap_set_bits(struct btrfs_free_space_ctl *ctl,
 		extent_delta--;
 
 	info->bitmap_extents += extent_delta;
-	if (!btrfs_free_space_trimmed(info))
+	if (!btrfs_free_space_trimmed(info)) {
 		ctl->discard_extents[0] += extent_delta;
+		ctl->discardable_bytes[0] += bytes;
+	}
 }
 
 /*
@@ -2042,8 +2051,10 @@ static u64 add_bytes_to_bitmap(struct btrfs_free_space_ctl *ctl,
 	u64 end;
 
 	if (!(flags & BTRFS_FSC_TRIMMED)) {
-		if (btrfs_free_space_trimmed(info))
+		if (btrfs_free_space_trimmed(info)) {
 			ctl->discard_extents[0] += info->bitmap_extents;
+			ctl->discardable_bytes[0] += info->bytes;
+		}
 		info->flags &= ~(BTRFS_FSC_TRIMMED | BTRFS_FSC_TRIMMING_BITMAP);
 	}
 
@@ -2657,15 +2668,19 @@ __btrfs_return_cluster_to_free_space(
 		bitmap = (entry->bitmap != NULL);
 		if (!bitmap) {
 			/* merging treats extents as if they were new */
-			if (!btrfs_free_space_trimmed(entry))
+			if (!btrfs_free_space_trimmed(entry)) {
 				ctl->discard_extents[0]--;
+				ctl->discardable_bytes[0] -= entry->bytes;
+			}
 
 			try_merge_free_space(ctl, entry, false);
 			steal_from_bitmap(ctl, entry, false);
 
 			/* as we insert directly, update these statistics */
-			if (!btrfs_free_space_trimmed(entry))
+			if (!btrfs_free_space_trimmed(entry)) {
 				ctl->discard_extents[0]++;
+				ctl->discardable_bytes[0] += entry->bytes;
+			}
 		}
 		tree_insert_offset(&ctl->free_space_offset,
 				   entry->offset, &entry->offset_index, bitmap);
@@ -2950,6 +2965,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
 	spin_lock(&ctl->tree_lock);
 
 	ctl->free_space -= bytes;
+	if (!entry->bitmap && !btrfs_free_space_trimmed(entry))
+		ctl->discardable_bytes[0] -= bytes;
 	if (entry->bytes == 0) {
 		ctl->free_extents--;
 		if (entry->bitmap) {
@@ -3467,6 +3484,7 @@ static void end_trimming_bitmap(struct btrfs_free_space_ctl *ctl,
 		entry->flags |= BTRFS_FSC_TRIMMED;
 		entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
 		ctl->discard_extents[0] -= entry->bitmap_extents;
+		ctl->discardable_bytes[0] -= entry->bytes;
 	}
 }
 
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 855f42dc15cd..c5cce44b03af 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -41,6 +41,7 @@ struct btrfs_free_space_ctl {
 	int unit;
 	u64 start;
 	s32 discard_extents[2];
+	s64 discardable_bytes[2];
 	const struct btrfs_free_space_op *op;
 	void *private;
 	struct mutex cache_writeout_mutex;
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 14c6910128f1..a2852706ec6c 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -482,8 +482,20 @@ static ssize_t btrfs_discard_extents_show(struct kobject *kobj,
 }
 BTRFS_ATTR(discard, discard_extents, btrfs_discard_extents_show);
 
+static ssize_t btrfs_discardable_bytes_show(struct kobject *kobj,
+					struct kobj_attribute *a,
+					char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj->parent);
+
+	return snprintf(buf, PAGE_SIZE, "%lld\n",
+			atomic64_read(&fs_info->discard_ctl.discardable_bytes));
+}
+BTRFS_ATTR(discard, discardable_bytes, btrfs_discardable_bytes_show);
+
 static const struct attribute *discard_attrs[] = {
 	BTRFS_ATTR_PTR(discard, discard_extents),
+	BTRFS_ATTR_PTR(discard, discardable_bytes),
 	NULL,
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 10/19] btrfs: calculate discard delay based on number of extents
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (8 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 09/19] btrfs: keep track of discardable_bytes Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 15:41   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 11/19] btrfs: add bps discard rate limit Dennis Zhou
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Use the number of discardable extents to help guide our discard delay
interval. This value is reevaluated every transaction commit.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/discard.c     | 31 +++++++++++++++++++++++++++++--
 fs/btrfs/discard.h     |  3 +++
 fs/btrfs/extent-tree.c |  4 +++-
 fs/btrfs/sysfs.c       | 30 ++++++++++++++++++++++++++++++
 5 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8479ab037812..b0823961d049 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -449,6 +449,8 @@ struct btrfs_discard_ctl {
 	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
 	atomic_t discard_extents;
 	atomic64_t discardable_bytes;
+	atomic_t delay;
+	atomic_t iops_limit;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 75a2ff14b3c0..c7afb5f8240d 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -15,6 +15,11 @@
 
 #define BTRFS_DISCARD_DELAY		(300ULL * NSEC_PER_SEC)
 
+/* target discard delay in milliseconds */
+#define BTRFS_DISCARD_TARGET_MSEC	(6 * 60 * 60ULL * MSEC_PER_SEC)
+#define BTRFS_DISCARD_MAX_DELAY		(10000UL)
+#define BTRFS_DISCARD_MAX_IOPS		(10UL)
+
 static struct list_head *
 btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
 		       struct btrfs_block_group_cache *cache)
@@ -170,10 +175,12 @@ void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
 
 	cache = find_next_cache(discard_ctl, now);
 	if (cache) {
-		u64 delay = 0;
+		u64 delay = atomic_read(&discard_ctl->delay);
 
 		if (now < cache->discard_delay)
-			delay = nsecs_to_jiffies(cache->discard_delay - now);
+			delay = max_t(u64, delay,
+				      nsecs_to_jiffies(cache->discard_delay -
+						       now));
 
 		mod_delayed_work(discard_ctl->discard_workers,
 				 &discard_ctl->work,
@@ -232,6 +239,24 @@ static void btrfs_discard_workfn(struct work_struct *work)
 	btrfs_discard_schedule_work(discard_ctl, false);
 }
 
+void btrfs_discard_calc_delay(struct btrfs_discard_ctl *discard_ctl)
+{
+	s32 discard_extents = atomic_read(&discard_ctl->discard_extents);
+	s32 iops_limit;
+	unsigned long delay;
+
+	if (!discard_extents)
+		return;
+
+	iops_limit = atomic_read(&discard_ctl->iops_limit);
+	if (iops_limit)
+		iops_limit = MSEC_PER_SEC / iops_limit;
+
+	delay = BTRFS_DISCARD_TARGET_MSEC / discard_extents;
+	delay = clamp_t(s32, delay, iops_limit, BTRFS_DISCARD_MAX_DELAY);
+	atomic_set(&discard_ctl->delay, msecs_to_jiffies(delay));
+}
+
 void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_block_group_cache *cache, *next;
@@ -301,6 +326,8 @@ void btrfs_discard_init(struct btrfs_fs_info *fs_info)
 
 	atomic_set(&discard_ctl->discard_extents, 0);
 	atomic64_set(&discard_ctl->discardable_bytes, 0);
+	atomic_set(&discard_ctl->delay, BTRFS_DISCARD_MAX_DELAY);
+	atomic_set(&discard_ctl->iops_limit, BTRFS_DISCARD_MAX_IOPS);
 }
 
 void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
index d55a9a9f8ad8..acaf56f63b1c 100644
--- a/fs/btrfs/discard.h
+++ b/fs/btrfs/discard.h
@@ -7,6 +7,8 @@
 #define BTRFS_DISCARD_H
 
 #include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/time.h>
 #include <linux/workqueue.h>
 
 #include "ctree.h"
@@ -39,6 +41,7 @@ void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
 			       struct btrfs_block_group_cache *cache);
 void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
 				 bool override);
+void btrfs_discard_calc_delay(struct btrfs_discard_ctl *discard_ctl);
 void btrfs_discard_resume(struct btrfs_fs_info *fs_info);
 void btrfs_discard_stop(struct btrfs_fs_info *fs_info);
 void btrfs_discard_init(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ff42e4abb01d..ab0d46da3771 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2920,8 +2920,10 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 		cond_resched();
 	}
 
-	if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+	if (btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
+		btrfs_discard_calc_delay(&fs_info->discard_ctl);
 		btrfs_discard_schedule_work(&fs_info->discard_ctl, true);
+	}
 
 	/*
 	 * Transaction is finished.  We don't need the lock anymore.  We
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index a2852706ec6c..b9a62e470316 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -493,9 +493,39 @@ static ssize_t btrfs_discardable_bytes_show(struct kobject *kobj,
 }
 BTRFS_ATTR(discard, discardable_bytes, btrfs_discardable_bytes_show);
 
+static ssize_t btrfs_discard_iops_limit_show(struct kobject *kobj,
+					     struct kobj_attribute *a,
+					     char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj->parent);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n",
+			atomic_read(&fs_info->discard_ctl.iops_limit));
+}
+
+static ssize_t btrfs_discard_iops_limit_store(struct kobject *kobj,
+					      struct kobj_attribute *a,
+					      const char *buf, size_t len)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj->parent);
+	s32 iops_limit;
+	int ret;
+
+	ret = kstrtos32(buf, 10, &iops_limit);
+	if (ret || iops_limit < 0)
+		return -EINVAL;
+
+	atomic_set(&fs_info->discard_ctl.iops_limit, iops_limit);
+
+	return len;
+}
+BTRFS_ATTR_RW(discard, iops_limit, btrfs_discard_iops_limit_show,
+	      btrfs_discard_iops_limit_store);
+
 static const struct attribute *discard_attrs[] = {
 	BTRFS_ATTR_PTR(discard, discard_extents),
 	BTRFS_ATTR_PTR(discard, discardable_bytes),
+	BTRFS_ATTR_PTR(discard, iops_limit),
 	NULL,
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 11/19] btrfs: add bps discard rate limit
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (9 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 10/19] btrfs: calculate discard delay based on number of extents Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 15:47   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 12/19] btrfs: limit max discard size for async discard Dennis Zhou
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Provide an ability to rate limit based on mbps in addition to the iops
delay calculated from number of discardable extents.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/discard.c | 11 +++++++++++
 fs/btrfs/sysfs.c   | 30 ++++++++++++++++++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b0823961d049..e81f699347e0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -447,10 +447,12 @@ struct btrfs_discard_ctl {
 	spinlock_t lock;
 	struct btrfs_block_group_cache *cache;
 	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
+	u64 prev_discard;
 	atomic_t discard_extents;
 	atomic64_t discardable_bytes;
 	atomic_t delay;
 	atomic_t iops_limit;
+	atomic64_t bps_limit;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index c7afb5f8240d..072c73f48297 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -176,6 +176,13 @@ void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
 	cache = find_next_cache(discard_ctl, now);
 	if (cache) {
 		u64 delay = atomic_read(&discard_ctl->delay);
+		s64 bps_limit = atomic64_read(&discard_ctl->bps_limit);
+
+		if (bps_limit)
+			delay = max_t(u64, delay,
+				      msecs_to_jiffies(MSEC_PER_SEC *
+						discard_ctl->prev_discard /
+						bps_limit));
 
 		if (now < cache->discard_delay)
 			delay = max_t(u64, delay,
@@ -213,6 +220,8 @@ static void btrfs_discard_workfn(struct work_struct *work)
 		btrfs_trim_block_group(cache, &trimmed, cache->discard_cursor,
 				       btrfs_block_group_end(cache), 0, true);
 
+	discard_ctl->prev_discard = trimmed;
+
 	if (cache->discard_cursor >= btrfs_block_group_end(cache)) {
 		if (btrfs_discard_bitmaps(cache)) {
 			remove_from_discard_list(discard_ctl, cache);
@@ -324,10 +333,12 @@ void btrfs_discard_init(struct btrfs_fs_info *fs_info)
 	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
 		 INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
 
+	discard_ctl->prev_discard = 0;
 	atomic_set(&discard_ctl->discard_extents, 0);
 	atomic64_set(&discard_ctl->discardable_bytes, 0);
 	atomic_set(&discard_ctl->delay, BTRFS_DISCARD_MAX_DELAY);
 	atomic_set(&discard_ctl->iops_limit, BTRFS_DISCARD_MAX_IOPS);
+	atomic64_set(&discard_ctl->bps_limit, 0);
 }
 
 void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index b9a62e470316..6fc4d644401b 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -522,10 +522,40 @@ static ssize_t btrfs_discard_iops_limit_store(struct kobject *kobj,
 BTRFS_ATTR_RW(discard, iops_limit, btrfs_discard_iops_limit_show,
 	      btrfs_discard_iops_limit_store);
 
+static ssize_t btrfs_discard_bps_limit_show(struct kobject *kobj,
+					     struct kobj_attribute *a,
+					     char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj->parent);
+
+	return snprintf(buf, PAGE_SIZE, "%lld\n",
+			atomic64_read(&fs_info->discard_ctl.bps_limit));
+}
+
+static ssize_t btrfs_discard_bps_limit_store(struct kobject *kobj,
+					      struct kobj_attribute *a,
+					      const char *buf, size_t len)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj->parent);
+	s64 bps_limit;
+	int ret;
+
+	ret = kstrtos64(buf, 10, &bps_limit);
+	if (ret || bps_limit < 0)
+		return -EINVAL;
+
+	atomic64_set(&fs_info->discard_ctl.bps_limit, bps_limit);
+
+	return len;
+}
+BTRFS_ATTR_RW(discard, bps_limit, btrfs_discard_bps_limit_show,
+	      btrfs_discard_bps_limit_store);
+
 static const struct attribute *discard_attrs[] = {
 	BTRFS_ATTR_PTR(discard, discard_extents),
 	BTRFS_ATTR_PTR(discard, discardable_bytes),
 	BTRFS_ATTR_PTR(discard, iops_limit),
+	BTRFS_ATTR_PTR(discard, bps_limit),
 	NULL,
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 12/19] btrfs: limit max discard size for async discard
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (10 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 11/19] btrfs: add bps discard rate limit Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 16:16   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 13/19] btrfs: have multiple discard lists Dennis Zhou
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Throttle the maximum size of a discard so that we can provide an upper
bound for the rate of async discard. While the block layer is able to
split discards into the appropriate sized discards, we want to be able
to account more accurately the rate at which we are consuming ncq slots
as well as limit the upper bound of work for a discard.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/discard.h          |  4 ++++
 fs/btrfs/free-space-cache.c | 47 +++++++++++++++++++++++++++----------
 2 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
index acaf56f63b1c..898dd92dbf8f 100644
--- a/fs/btrfs/discard.h
+++ b/fs/btrfs/discard.h
@@ -8,6 +8,7 @@
 
 #include <linux/kernel.h>
 #include <linux/jiffies.h>
+#include <linux/sizes.h>
 #include <linux/time.h>
 #include <linux/workqueue.h>
 
@@ -15,6 +16,9 @@
 #include "block-group.h"
 #include "free-space-cache.h"
 
+/* discard size limits */
+#define BTRFS_DISCARD_MAX_SIZE		(SZ_64M)
+
 /* discard flags */
 #define BTRFS_DISCARD_RESET_CURSOR	(1UL << 0)
 #define BTRFS_DISCARD_BITMAPS           (1UL << 1)
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 54f3c8325858..ce33803a45b2 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3399,19 +3399,39 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 		if (entry->offset >= end)
 			goto out_unlock;
 
-		extent_start = entry->offset;
-		extent_bytes = entry->bytes;
-		extent_flags = entry->flags;
-		start = max(start, extent_start);
-		bytes = min(extent_start + extent_bytes, end) - start;
-		if (bytes < minlen) {
-			spin_unlock(&ctl->tree_lock);
-			mutex_unlock(&ctl->cache_writeout_mutex);
-			goto next;
-		}
+		if (async) {
+			start = extent_start = entry->offset;
+			bytes = extent_bytes = entry->bytes;
+			extent_flags = entry->flags;
+			if (bytes < minlen) {
+				spin_unlock(&ctl->tree_lock);
+				mutex_unlock(&ctl->cache_writeout_mutex);
+				goto next;
+			}
+			unlink_free_space(ctl, entry);
+			if (bytes > BTRFS_DISCARD_MAX_SIZE) {
+				bytes = extent_bytes = BTRFS_DISCARD_MAX_SIZE;
+				entry->offset += BTRFS_DISCARD_MAX_SIZE;
+				entry->bytes -= BTRFS_DISCARD_MAX_SIZE;
+				link_free_space(ctl, entry);
+			} else {
+				kmem_cache_free(btrfs_free_space_cachep, entry);
+			}
+		} else {
+			extent_start = entry->offset;
+			extent_bytes = entry->bytes;
+			extent_flags = entry->flags;
+			start = max(start, extent_start);
+			bytes = min(extent_start + extent_bytes, end) - start;
+			if (bytes < minlen) {
+				spin_unlock(&ctl->tree_lock);
+				mutex_unlock(&ctl->cache_writeout_mutex);
+				goto next;
+			}
 
-		unlink_free_space(ctl, entry);
-		kmem_cache_free(btrfs_free_space_cachep, entry);
+			unlink_free_space(ctl, entry);
+			kmem_cache_free(btrfs_free_space_cachep, entry);
+		}
 
 		spin_unlock(&ctl->tree_lock);
 		trim_entry.start = extent_start;
@@ -3567,6 +3587,9 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 			goto next;
 		}
 
+		if (async && bytes > BTRFS_DISCARD_MAX_SIZE)
+			bytes = BTRFS_DISCARD_MAX_SIZE;
+
 		bitmap_clear_bits(ctl, entry, start, bytes);
 		if (entry->bytes == 0)
 			free_bitmap(ctl, entry);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 13/19] btrfs: have multiple discard lists
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (11 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 12/19] btrfs: limit max discard size for async discard Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 16:51   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 14/19] btrfs: only keep track of data extents for async discard Dennis Zhou
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Non-block group destruction discarding currently only had a single list
with no minimum discard length. This can lead to caravaning more
meaningful discards behind a heavily fragmented block group.

This adds support for multiple lists with minimum discard lengths to
prevent the caravan effect. We promote block groups back up when we
exceed the BTRFS_DISCARD_MAX_FILTER size, currently we support only 2
lists with filters of 1MB and 32KB respectively.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/ctree.h            |  2 +-
 fs/btrfs/discard.c          | 60 +++++++++++++++++++++++++++++++++----
 fs/btrfs/discard.h          |  4 +++
 fs/btrfs/free-space-cache.c | 37 +++++++++++++++--------
 fs/btrfs/free-space-cache.h |  2 +-
 5 files changed, 85 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e81f699347e0..b5608f8dc41a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -439,7 +439,7 @@ struct btrfs_full_stripe_locks_tree {
 };
 
 /* discard control */
-#define BTRFS_NR_DISCARD_LISTS		2
+#define BTRFS_NR_DISCARD_LISTS		3
 
 struct btrfs_discard_ctl {
 	struct workqueue_struct *discard_workers;
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 072c73f48297..296cbffc5957 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -20,6 +20,10 @@
 #define BTRFS_DISCARD_MAX_DELAY		(10000UL)
 #define BTRFS_DISCARD_MAX_IOPS		(10UL)
 
+/* montonically decreasing filters after 0 */
+static int discard_minlen[BTRFS_NR_DISCARD_LISTS] = {0,
+	BTRFS_DISCARD_MAX_FILTER, BTRFS_DISCARD_MIN_FILTER};
+
 static struct list_head *
 btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
 		       struct btrfs_block_group_cache *cache)
@@ -120,7 +124,7 @@ find_next_cache(struct btrfs_discard_ctl *discard_ctl, u64 now)
 }
 
 static struct btrfs_block_group_cache *
-peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
+peek_discard_list(struct btrfs_discard_ctl *discard_ctl, int *discard_index)
 {
 	struct btrfs_block_group_cache *cache;
 	u64 now = ktime_get_ns();
@@ -132,6 +136,7 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
 
 	if (cache && now > cache->discard_delay) {
 		discard_ctl->cache = cache;
+		*discard_index = cache->discard_index;
 		if (cache->discard_index == 0 &&
 		    cache->free_space_ctl->free_space != cache->key.offset) {
 			__btrfs_add_to_discard_list(discard_ctl, cache);
@@ -150,6 +155,36 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
 	return cache;
 }
 
+void btrfs_discard_check_filter(struct btrfs_block_group_cache *cache,
+				u64 bytes)
+{
+	struct btrfs_discard_ctl *discard_ctl;
+
+	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
+		return;
+
+	discard_ctl = &cache->fs_info->discard_ctl;
+
+	if (cache && cache->discard_index > 1 &&
+	    bytes >= BTRFS_DISCARD_MAX_FILTER) {
+		remove_from_discard_list(discard_ctl, cache);
+		cache->discard_index = 1;
+		btrfs_add_to_discard_list(discard_ctl, cache);
+	}
+}
+
+static void btrfs_update_discard_index(struct btrfs_discard_ctl *discard_ctl,
+				       struct btrfs_block_group_cache *cache)
+{
+	cache->discard_index++;
+	if (cache->discard_index == BTRFS_NR_DISCARD_LISTS) {
+		cache->discard_index = 1;
+		return;
+	}
+
+	btrfs_add_to_discard_list(discard_ctl, cache);
+}
+
 void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
 			       struct btrfs_block_group_cache *cache)
 {
@@ -202,23 +237,34 @@ static void btrfs_discard_workfn(struct work_struct *work)
 {
 	struct btrfs_discard_ctl *discard_ctl;
 	struct btrfs_block_group_cache *cache;
+	int discard_index = 0;
 	u64 trimmed = 0;
+	u64 minlen = 0;
 
 	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
 
 again:
-	cache = peek_discard_list(discard_ctl);
+	cache = peek_discard_list(discard_ctl, &discard_index);
 	if (!cache || !btrfs_run_discard_work(discard_ctl))
 		return;
 
-	if (btrfs_discard_bitmaps(cache))
+	minlen = discard_minlen[discard_index];
+
+	if (btrfs_discard_bitmaps(cache)) {
+		u64 maxlen = 0;
+
+		if (discard_index)
+			maxlen = discard_minlen[discard_index - 1];
+
 		btrfs_trim_block_group_bitmaps(cache, &trimmed,
 					       cache->discard_cursor,
 					       btrfs_block_group_end(cache),
-					       0, true);
-	else
+					       minlen, maxlen, true);
+	} else {
 		btrfs_trim_block_group(cache, &trimmed, cache->discard_cursor,
-				       btrfs_block_group_end(cache), 0, true);
+				       btrfs_block_group_end(cache),
+				       minlen, true);
+	}
 
 	discard_ctl->prev_discard = trimmed;
 
@@ -231,6 +277,8 @@ static void btrfs_discard_workfn(struct work_struct *work)
 				 cache->key.offset)
 				btrfs_add_to_discard_free_list(discard_ctl,
 							       cache);
+			else
+				btrfs_update_discard_index(discard_ctl, cache);
 		} else {
 			cache->discard_cursor = cache->key.objectid;
 			cache->discard_flags |= BTRFS_DISCARD_BITMAPS;
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
index 898dd92dbf8f..1daa8da4a1b5 100644
--- a/fs/btrfs/discard.h
+++ b/fs/btrfs/discard.h
@@ -18,6 +18,8 @@
 
 /* discard size limits */
 #define BTRFS_DISCARD_MAX_SIZE		(SZ_64M)
+#define BTRFS_DISCARD_MAX_FILTER	(SZ_1M)
+#define BTRFS_DISCARD_MIN_FILTER	(SZ_32K)
 
 /* discard flags */
 #define BTRFS_DISCARD_RESET_CURSOR	(1UL << 0)
@@ -39,6 +41,8 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
 			       struct btrfs_block_group_cache *cache);
 void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
 				    struct btrfs_block_group_cache *cache);
+void btrfs_discard_check_filter(struct btrfs_block_group_cache *cache,
+				u64 bytes);
 void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info);
 
 void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index ce33803a45b2..ed35dc090df6 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2471,6 +2471,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	if (ret)
 		kmem_cache_free(btrfs_free_space_cachep, info);
 out:
+	btrfs_discard_check_filter(cache, bytes);
 	btrfs_discard_update_discardable(cache, ctl);
 	spin_unlock(&ctl->tree_lock);
 
@@ -3409,7 +3410,13 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 				goto next;
 			}
 			unlink_free_space(ctl, entry);
-			if (bytes > BTRFS_DISCARD_MAX_SIZE) {
+			/*
+			 * Let bytes = BTRFS_MAX_DISCARD_SIZE + X.
+			 * If X < BTRFS_DISCARD_MIN_FILTER, we won't trim X when
+			 * we come back around.  So trim it now.
+			 */
+			if (bytes > (BTRFS_DISCARD_MAX_SIZE +
+				     BTRFS_DISCARD_MIN_FILTER)) {
 				bytes = extent_bytes = BTRFS_DISCARD_MAX_SIZE;
 				entry->offset += BTRFS_DISCARD_MAX_SIZE;
 				entry->bytes -= BTRFS_DISCARD_MAX_SIZE;
@@ -3510,7 +3517,7 @@ static void end_trimming_bitmap(struct btrfs_free_space_ctl *ctl,
 
 static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 			u64 *total_trimmed, u64 start, u64 end, u64 minlen,
-			bool async)
+			u64 maxlen, bool async)
 {
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	struct btrfs_free_space *entry;
@@ -3535,7 +3542,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		}
 
 		entry = tree_search_offset(ctl, offset, 1, 0);
-		if (!entry || (async && start == offset &&
+		if (!entry || (async && minlen && start == offset &&
 			       btrfs_free_space_trimmed(entry))) {
 			spin_unlock(&ctl->tree_lock);
 			mutex_unlock(&ctl->cache_writeout_mutex);
@@ -3556,10 +3563,10 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		ret2 = search_bitmap(ctl, entry, &start, &bytes, false);
 		if (ret2 || start >= end) {
 			/*
-			 * This keeps the invariant that all bytes are trimmed
-			 * if BTRFS_FSC_TRIMMED is set on a bitmap.
+			 * We lossily consider a bitmap trimmed if we only skip
+			 * over regions <= BTRFS_DISCARD_MIN_FILTER.
 			 */
-			if (ret2 && !minlen)
+			if (ret2 && minlen <= BTRFS_DISCARD_MIN_FILTER)
 				end_trimming_bitmap(ctl, entry);
 			else
 				entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
@@ -3580,14 +3587,19 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		}
 
 		bytes = min(bytes, end - start);
-		if (bytes < minlen) {
-			entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
+		if (bytes < minlen || (async && maxlen && bytes > maxlen)) {
 			spin_unlock(&ctl->tree_lock);
 			mutex_unlock(&ctl->cache_writeout_mutex);
 			goto next;
 		}
 
-		if (async && bytes > BTRFS_DISCARD_MAX_SIZE)
+		/*
+		 * Let bytes = BTRFS_MAX_DISCARD_SIZE + X.
+		 * If X < BTRFS_DISCARD_MIN_FILTER, we won't trim X when we come
+		 * back around.  So trim it now.
+		 */
+		if (async && bytes > (BTRFS_DISCARD_MAX_SIZE +
+				      BTRFS_DISCARD_MIN_FILTER))
 			bytes = BTRFS_DISCARD_MAX_SIZE;
 
 		bitmap_clear_bits(ctl, entry, start, bytes);
@@ -3694,7 +3706,7 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 	if (ret || async)
 		goto out;
 
-	ret = trim_bitmaps(block_group, trimmed, start, end, minlen, false);
+	ret = trim_bitmaps(block_group, trimmed, start, end, minlen, 0, false);
 	/* if we ended in the middle of a bitmap, reset the trimming flag */
 	if (end % (BITS_PER_BITMAP * ctl->unit))
 		reset_trimming_bitmap(ctl, offset_to_bitmap(ctl, end));
@@ -3705,7 +3717,7 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 
 int btrfs_trim_block_group_bitmaps(struct btrfs_block_group_cache *block_group,
 				   u64 *trimmed, u64 start, u64 end, u64 minlen,
-				   bool async)
+				   u64 maxlen, bool async)
 {
 	int ret;
 
@@ -3719,7 +3731,8 @@ int btrfs_trim_block_group_bitmaps(struct btrfs_block_group_cache *block_group,
 	btrfs_get_block_group_trimming(block_group);
 	spin_unlock(&block_group->lock);
 
-	ret = trim_bitmaps(block_group, trimmed, start, end, minlen, async);
+	ret = trim_bitmaps(block_group, trimmed, start, end, minlen, maxlen,
+			   async);
 
 	btrfs_put_block_group_trimming(block_group);
 	return ret;
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index c5cce44b03af..90abf922f0ba 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -132,7 +132,7 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 			   bool async);
 int btrfs_trim_block_group_bitmaps(struct btrfs_block_group_cache *block_group,
 				   u64 *trimmed, u64 start, u64 end, u64 minlen,
-				   bool async);
+				   u64 maxlen, bool async);
 
 /* Support functions for running our sanity tests */
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 14/19] btrfs: only keep track of data extents for async discard
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (12 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 13/19] btrfs: have multiple discard lists Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 16:53   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 15/19] btrfs: load block_groups into discard_list on mount Dennis Zhou
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

As mentioned earlier, discarding data can be done either by issuing an
explicit discard or implicitly by reusing the LBA. Metadata chunks see
much more frequent reuse due to well it being metadata. So instead of
explicitly discarding metadata blocks, just leave them be and let the
latter implicit discarding be done for them.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/block-group.h | 6 ++++++
 fs/btrfs/discard.c     | 8 +++++++-
 fs/btrfs/discard.h     | 3 ++-
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index b59e6a8ed73d..7739099e974a 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -169,6 +169,12 @@ u64 btrfs_block_group_end(struct btrfs_block_group_cache *cache)
 	return (cache->key.objectid + cache->key.offset);
 }
 
+static inline
+bool btrfs_is_block_group_data(struct btrfs_block_group_cache *cache)
+{
+	return (cache->flags & BTRFS_BLOCK_GROUP_DATA);
+}
+
 #ifdef CONFIG_BTRFS_DEBUG
 static inline int btrfs_should_fragment_free_space(
 		struct btrfs_block_group_cache *block_group)
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 296cbffc5957..0e4d5a22c661 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -50,6 +50,9 @@ static void __btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
 void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
 			       struct btrfs_block_group_cache *cache)
 {
+	if (!btrfs_is_block_group_data(cache))
+		return;
+
 	spin_lock(&discard_ctl->lock);
 
 	__btrfs_add_to_discard_list(discard_ctl, cache);
@@ -139,7 +142,10 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl, int *discard_index)
 		*discard_index = cache->discard_index;
 		if (cache->discard_index == 0 &&
 		    cache->free_space_ctl->free_space != cache->key.offset) {
-			__btrfs_add_to_discard_list(discard_ctl, cache);
+			if (btrfs_is_block_group_data(cache))
+				__btrfs_add_to_discard_list(discard_ctl, cache);
+			else
+				list_del_init(&cache->discard_list);
 			goto again;
 		}
 		if (btrfs_discard_reset_cursor(cache)) {
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
index 1daa8da4a1b5..552daa7251df 100644
--- a/fs/btrfs/discard.h
+++ b/fs/btrfs/discard.h
@@ -90,7 +90,8 @@ void btrfs_discard_update_discardable(struct btrfs_block_group_cache *cache,
 	s32 extents_delta;
 	s64 bytes_delta;
 
-	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
+	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC) ||
+	    !btrfs_is_block_group_data(cache))
 		return;
 
 	discard_ctl = &cache->fs_info->discard_ctl;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 15/19] btrfs: load block_groups into discard_list on mount
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (13 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 14/19] btrfs: only keep track of data extents for async discard Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 17:11   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 16/19] btrfs: keep track of discard reuse stats Dennis Zhou
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Async discard doesn't remember the discard state of a block_group when
unmounting or when we crash. So, any block_group that is not fully used
may have undiscarded regions. However, free space caches are read in on
demand. Let the discard worker read in the free space cache so we can
proceed with discarding rather than wait for the block_group to be used.
This prevents us from indefinitely deferring discards until that
particular block_group is reused.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/block-group.c |  2 ++
 fs/btrfs/discard.c     | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 73e5a9384491..684959c96c3f 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1859,6 +1859,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 						&info->discard_ctl, cache);
 			else
 				btrfs_mark_bg_unused(cache);
+		} else if (btrfs_test_opt(info, DISCARD_ASYNC)) {
+			btrfs_add_to_discard_list(&info->discard_ctl, cache);
 		}
 	}
 
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 0e4d5a22c661..d99ba31e6f3b 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -246,6 +246,7 @@ static void btrfs_discard_workfn(struct work_struct *work)
 	int discard_index = 0;
 	u64 trimmed = 0;
 	u64 minlen = 0;
+	int ret;
 
 	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
 
@@ -254,6 +255,19 @@ static void btrfs_discard_workfn(struct work_struct *work)
 	if (!cache || !btrfs_run_discard_work(discard_ctl))
 		return;
 
+	if (!btrfs_block_group_cache_done(cache)) {
+		ret = btrfs_cache_block_group(cache, 0);
+		if (ret) {
+			remove_from_discard_list(discard_ctl, cache);
+			goto out;
+		}
+		ret = btrfs_wait_block_group_cache_done(cache);
+		if (ret) {
+			remove_from_discard_list(discard_ctl, cache);
+			goto out;
+		}
+	}
+
 	minlen = discard_minlen[discard_index];
 
 	if (btrfs_discard_bitmaps(cache)) {
@@ -291,6 +305,7 @@ static void btrfs_discard_workfn(struct work_struct *work)
 		}
 	}
 
+out:
 	spin_lock(&discard_ctl->lock);
 	discard_ctl->cache = NULL;
 	spin_unlock(&discard_ctl->lock);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 16/19] btrfs: keep track of discard reuse stats
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (14 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 15/19] btrfs: load block_groups into discard_list on mount Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 17:13   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 17/19] btrfs: add async discard header Dennis Zhou
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Keep track of how much we are discarding and how often we are reusing
with async discard.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/ctree.h            |  3 +++
 fs/btrfs/discard.c          |  5 +++++
 fs/btrfs/free-space-cache.c | 10 ++++++++++
 fs/btrfs/sysfs.c            | 36 ++++++++++++++++++++++++++++++++++++
 4 files changed, 54 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b5608f8dc41a..2f52b29ff74c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -453,6 +453,9 @@ struct btrfs_discard_ctl {
 	atomic_t delay;
 	atomic_t iops_limit;
 	atomic64_t bps_limit;
+	atomic64_t discard_extent_bytes;
+	atomic64_t discard_bitmap_bytes;
+	atomic64_t discard_bytes_saved;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index d99ba31e6f3b..f0088ca19d28 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -280,10 +280,12 @@ static void btrfs_discard_workfn(struct work_struct *work)
 					       cache->discard_cursor,
 					       btrfs_block_group_end(cache),
 					       minlen, maxlen, true);
+		atomic64_add(trimmed, &discard_ctl->discard_bitmap_bytes);
 	} else {
 		btrfs_trim_block_group(cache, &trimmed, cache->discard_cursor,
 				       btrfs_block_group_end(cache),
 				       minlen, true);
+		atomic64_add(trimmed, &discard_ctl->discard_extent_bytes);
 	}
 
 	discard_ctl->prev_discard = trimmed;
@@ -408,6 +410,9 @@ void btrfs_discard_init(struct btrfs_fs_info *fs_info)
 	atomic_set(&discard_ctl->delay, BTRFS_DISCARD_MAX_DELAY);
 	atomic_set(&discard_ctl->iops_limit, BTRFS_DISCARD_MAX_IOPS);
 	atomic64_set(&discard_ctl->bps_limit, 0);
+	atomic64_set(&discard_ctl->discard_extent_bytes, 0);
+	atomic64_set(&discard_ctl->discard_bitmap_bytes, 0);
+	atomic64_set(&discard_ctl->discard_bytes_saved, 0);
 }
 
 void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index ed35dc090df6..480119016c0d 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2773,6 +2773,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 			       u64 *max_extent_size)
 {
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	struct btrfs_discard_ctl *discard_ctl =
+					&block_group->fs_info->discard_ctl;
 	struct btrfs_free_space *entry = NULL;
 	u64 bytes_search = bytes + empty_size;
 	u64 ret = 0;
@@ -2797,6 +2799,9 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 		align_gap = entry->offset;
 		align_gap_flags = entry->flags;
 
+		if (!btrfs_free_space_trimmed(entry))
+			atomic64_add(bytes, &discard_ctl->discard_bytes_saved);
+
 		entry->offset = offset + bytes;
 		WARN_ON(entry->bytes < bytes + align_gap_len);
 
@@ -2901,6 +2906,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
 			     u64 min_start, u64 *max_extent_size)
 {
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	struct btrfs_discard_ctl *discard_ctl =
+					&block_group->fs_info->discard_ctl;
 	struct btrfs_free_space *entry = NULL;
 	struct rb_node *node;
 	u64 ret = 0;
@@ -2965,6 +2972,9 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
 
 	spin_lock(&ctl->tree_lock);
 
+	if (!btrfs_free_space_trimmed(entry))
+		atomic64_add(bytes, &discard_ctl->discard_bytes_saved);
+
 	ctl->free_space -= bytes;
 	if (!entry->bitmap && !btrfs_free_space_trimmed(entry))
 		ctl->discardable_bytes[0] -= bytes;
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 6fc4d644401b..29a290d75492 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -551,11 +551,47 @@ static ssize_t btrfs_discard_bps_limit_store(struct kobject *kobj,
 BTRFS_ATTR_RW(discard, bps_limit, btrfs_discard_bps_limit_show,
 	      btrfs_discard_bps_limit_store);
 
+static ssize_t btrfs_discard_extent_bytes_show(struct kobject *kobj,
+					struct kobj_attribute *a,
+					char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj->parent);
+
+	return snprintf(buf, PAGE_SIZE, "%lld\n",
+		atomic64_read(&fs_info->discard_ctl.discard_extent_bytes));
+}
+BTRFS_ATTR(discard, discard_extent_bytes, btrfs_discard_extent_bytes_show);
+
+static ssize_t btrfs_discard_bitmap_bytes_show(struct kobject *kobj,
+					struct kobj_attribute *a,
+					char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj->parent);
+
+	return snprintf(buf, PAGE_SIZE, "%lld\n",
+		atomic64_read(&fs_info->discard_ctl.discard_bitmap_bytes));
+}
+BTRFS_ATTR(discard, discard_bitmap_bytes, btrfs_discard_bitmap_bytes_show);
+
+static ssize_t btrfs_discard_bytes_saved_show(struct kobject *kobj,
+					      struct kobj_attribute *a,
+					      char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj->parent);
+
+	return snprintf(buf, PAGE_SIZE, "%lld\n",
+		atomic64_read(&fs_info->discard_ctl.discard_bytes_saved));
+}
+BTRFS_ATTR(discard, discard_bytes_saved, btrfs_discard_bytes_saved_show);
+
 static const struct attribute *discard_attrs[] = {
 	BTRFS_ATTR_PTR(discard, discard_extents),
 	BTRFS_ATTR_PTR(discard, discardable_bytes),
 	BTRFS_ATTR_PTR(discard, iops_limit),
 	BTRFS_ATTR_PTR(discard, bps_limit),
+	BTRFS_ATTR_PTR(discard, discard_extent_bytes),
+	BTRFS_ATTR_PTR(discard, discard_bitmap_bytes),
+	BTRFS_ATTR_PTR(discard, discard_bytes_saved),
 	NULL,
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 17/19] btrfs: add async discard header
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (15 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 16/19] btrfs: keep track of discard reuse stats Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 17:13   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 18/19] btrfs: increase the metadata allowance for the free_space_cache Dennis Zhou
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Give a brief overview for how async discard is implemented.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/discard.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index f0088ca19d28..61e341685acd 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -1,6 +1,40 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2019 Facebook.  All rights reserved.
+ *
+ * This contains the logic to handle async discard.
+ *
+ * Async discard manages trimming of free space outside of transaction commit.
+ * Discarding is done by managing the block_groups on a LRU list based on free
+ * space recency.  Two passes are used to first prioritize discarding extents
+ * and then allow for trimming in the bitmap the best opportunity to coalesce.
+ * The block_groups are maintained on multiple lists to allow for multiple
+ * passes with different discard filter requirements.  A delayed work item is
+ * used to manage discarding with timeout determined by a max of the delay
+ * incurred by the iops rate limit, byte rate limit, and the timeout of max
+ * delay of BTRFS_DISCARD_MAX_DELAY.
+ *
+ * The first list is special to manage discarding of fully free block groups.
+ * This is necessary because we issue a final trim for a full free block group
+ * after forgetting it.  When a block group becomes unused, instead of directly
+ * being added to the unused_bgs list, we add it to this first list.  Then
+ * from there, if it becomes fully discarded, we place it onto the unused_bgs
+ * list.
+ *
+ * The in-memory free space cache serves as the backing state for discard.
+ * Consequently this means there is no persistence.  We opt to load all the
+ * block groups in as not discarded, so the mount case degenerates to the
+ * crashing case.
+ *
+ * As the free space cache uses bitmaps, there exists a tradeoff between
+ * ease/efficiency for find_free_extent() and the accuracy of discard state.
+ * Here we opt to let untrimmed regions merge with everything while only letting
+ * trimmed regions merge with other trimmed regions.  This can cause
+ * overtrimming, but the coalescing benefit seems to be worth it.  Additionally,
+ * bitmap state is tracked as a whole.  If we're able to fully trim a bitmap,
+ * the trimmed flag is set on the bitmap.  Otherwise, if an allocation comes in,
+ * this resets the state and we will retry trimming the whole bitmap.  This is a
+ * tradeoff between discard state accuracy and the cost of accounting.
  */
 
 #include <linux/jiffies.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 18/19] btrfs: increase the metadata allowance for the free_space_cache
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (16 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 17/19] btrfs: add async discard header Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 17:16   ` Josef Bacik
  2019-10-07 20:17 ` [PATCH 19/19] btrfs: make smaller extents more likely to go into bitmaps Dennis Zhou
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Currently, there is no way for the free space cache to recover from
being serviced by purely bitmaps because the extent threshold is set to
0 in recalculate_thresholds() when we surpass the metadata allowance.

This adds a recovery mechanism by keeping large extents out of the
bitmaps and increases the metadata upper bound to 64KB. The recovery
mechanism bypasses this upper bound, thus making it a soft upper bound.
But, with the bypass being 1MB or greater, it shouldn't add unbounded
overhead.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/free-space-cache.c | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 480119016c0d..a0941d281a63 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -24,7 +24,8 @@
 #include "discard.h"
 
 #define BITS_PER_BITMAP		(PAGE_SIZE * 8UL)
-#define MAX_CACHE_BYTES_PER_GIG	SZ_32K
+#define MAX_CACHE_BYTES_PER_GIG	SZ_64K
+#define FORCE_EXTENT_THRESHOLD	SZ_1M
 
 struct btrfs_trim_range {
 	u64 start;
@@ -1686,26 +1687,17 @@ static void recalculate_thresholds(struct btrfs_free_space_ctl *ctl)
 	ASSERT(ctl->total_bitmaps <= max_bitmaps);
 
 	/*
-	 * The goal is to keep the total amount of memory used per 1gb of space
-	 * at or below 32k, so we need to adjust how much memory we allow to be
-	 * used by extent based free space tracking
+	 * We are trying to keep the total amount of memory used per 1gb of
+	 * space to be MAX_CACHE_BYTES_PER_GIG.  However, with a reclamation
+	 * mechanism of pulling extents >= FORCE_EXTENT_THRESHOLD out of
+	 * bitmaps, we may end up using more memory than this.
 	 */
 	if (size < SZ_1G)
 		max_bytes = MAX_CACHE_BYTES_PER_GIG;
 	else
 		max_bytes = MAX_CACHE_BYTES_PER_GIG * div_u64(size, SZ_1G);
 
-	/*
-	 * we want to account for 1 more bitmap than what we have so we can make
-	 * sure we don't go over our overall goal of MAX_CACHE_BYTES_PER_GIG as
-	 * we add more bitmaps.
-	 */
-	bitmap_bytes = (ctl->total_bitmaps + 1) * ctl->unit;
-
-	if (bitmap_bytes >= max_bytes) {
-		ctl->extents_thresh = 0;
-		return;
-	}
+	bitmap_bytes = ctl->total_bitmaps * ctl->unit;
 
 	/*
 	 * we want the extent entry threshold to always be at most 1/2 the max
@@ -2086,6 +2078,10 @@ static bool use_bitmap(struct btrfs_free_space_ctl *ctl,
 		forced = true;
 #endif
 
+	/* this is a way to reclaim large regions from the bitmaps */
+	if (!forced && info->bytes >= FORCE_EXTENT_THRESHOLD)
+		return false;
+
 	/*
 	 * If we are below the extents threshold then we can add this as an
 	 * extent, and don't have to deal with the bitmap
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 19/19] btrfs: make smaller extents more likely to go into bitmaps
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (17 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 18/19] btrfs: increase the metadata allowance for the free_space_cache Dennis Zhou
@ 2019-10-07 20:17 ` Dennis Zhou
  2019-10-10 17:17   ` Josef Bacik
  2019-10-11  7:49 ` [RFC PATCH 00/19] btrfs: async discard support Nikolay Borisov
  2019-10-15 12:08 ` David Sterba
  20 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 20:17 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

It's less than ideal for small extents to eat into our extent budget, so
force extents <= 32KB into the bitmaps save for the first handful.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 fs/btrfs/free-space-cache.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index a0941d281a63..505091940580 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2094,8 +2094,8 @@ static bool use_bitmap(struct btrfs_free_space_ctl *ctl,
 		 * of cache left then go ahead an dadd them, no sense in adding
 		 * the overhead of a bitmap if we don't have to.
 		 */
-		if (info->bytes <= fs_info->sectorsize * 4) {
-			if (ctl->free_extents * 2 <= ctl->extents_thresh)
+		if (info->bytes <= fs_info->sectorsize * 8) {
+			if (ctl->free_extents * 3 <= ctl->extents_thresh)
 				return false;
 		} else {
 			return false;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/19] bitmap: genericize percpu bitmap region iterators
  2019-10-07 20:17 ` [PATCH 01/19] bitmap: genericize percpu bitmap region iterators Dennis Zhou
@ 2019-10-07 20:26   ` Josef Bacik
  2019-10-07 22:24     ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-07 20:26 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:32PM -0400, Dennis Zhou wrote:
> Bitmaps are fairly popular for their space efficiency, but we don't have
> generic iterators available. Make percpu's bitmap region iterators
> available to everyone.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  include/linux/bitmap.h | 35 ++++++++++++++++++++++++
>  mm/percpu.c            | 61 +++++++++++-------------------------------
>  2 files changed, 51 insertions(+), 45 deletions(-)
> 
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 90528f12bdfa..9b0664f36808 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -437,6 +437,41 @@ static inline int bitmap_parse(const char *buf, unsigned int buflen,
>  	return __bitmap_parse(buf, buflen, 0, maskp, nmaskbits);
>  }
>  
> +static inline void bitmap_next_clear_region(unsigned long *bitmap,
> +					    unsigned int *rs, unsigned int *re,
> +					    unsigned int end)
> +{
> +	*rs = find_next_zero_bit(bitmap, end, *rs);
> +	*re = find_next_bit(bitmap, end, *rs + 1);
> +}
> +
> +static inline void bitmap_next_set_region(unsigned long *bitmap,
> +					  unsigned int *rs, unsigned int *re,
> +					  unsigned int end)
> +{
> +	*rs = find_next_bit(bitmap, end, *rs);
> +	*re = find_next_zero_bit(bitmap, end, *rs + 1);
> +}
> +
> +/*
> + * Bitmap region iterators.  Iterates over the bitmap between [@start, @end).

Gonna be that guy here, should be '[@start, @end]'

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC
  2019-10-07 20:17 ` [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC Dennis Zhou
@ 2019-10-07 20:27   ` Josef Bacik
  2019-10-08 11:12   ` Johannes Thumshirn
  2019-10-11  9:19   ` Nikolay Borisov
  2 siblings, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2019-10-07 20:27 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:33PM -0400, Dennis Zhou wrote:
> This series introduces async discard which will use the flag
> DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
> synchronously done in transaction commit.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/19] btrfs: keep track of which extents have been discarded
  2019-10-07 20:17 ` [PATCH 03/19] btrfs: keep track of which extents have been discarded Dennis Zhou
@ 2019-10-07 20:37   ` Josef Bacik
  2019-10-07 22:38     ` Dennis Zhou
  2019-10-08 12:46   ` Nikolay Borisov
  2019-10-15 12:17   ` David Sterba
  2 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-07 20:37 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:34PM -0400, Dennis Zhou wrote:
> Async discard will use the free space cache as backing knowledge for
> which extents to discard. This patch plumbs knowledge about which
> extents need to be discarded into the free space cache from
> unpin_extent_range().
> 
> An untrimmed extent can merge with everything as this is a new region.
> Absorbing trimmed extents is a tradeoff to for greater coalescing which
> makes life better for find_free_extent(). Additionally, it seems the
> size of a trim isn't as problematic as the trim io itself.
> 
> When reading in the free space cache from disk, if sync is set, mark all
> extents as trimmed. The current code ensures at transaction commit that
> all free space is trimmed when sync is set, so this reflects that.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/extent-tree.c      | 15 ++++++++++-----
>  fs/btrfs/free-space-cache.c | 38 ++++++++++++++++++++++++++++++-------
>  fs/btrfs/free-space-cache.h | 10 +++++++++-
>  fs/btrfs/inode-map.c        | 13 +++++++------
>  4 files changed, 57 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 77a5904756c5..b9e3bedad878 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2782,7 +2782,7 @@ fetch_cluster_info(struct btrfs_fs_info *fs_info,
>  }
>  
>  static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> -			      u64 start, u64 end,
> +			      u64 start, u64 end, u32 fsc_flags,
>  			      const bool return_free_space)
>  {
>  	struct btrfs_block_group_cache *cache = NULL;
> @@ -2816,7 +2816,9 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>  		if (start < cache->last_byte_to_unpin) {
>  			len = min(len, cache->last_byte_to_unpin - start);
>  			if (return_free_space)
> -				btrfs_add_free_space(cache, start, len);
> +				__btrfs_add_free_space(fs_info,
> +						       cache->free_space_ctl,
> +						       start, len, fsc_flags);
>  		}
>  
>  		start += len;
> @@ -2894,6 +2896,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
>  
>  	while (!trans->aborted) {
>  		struct extent_state *cached_state = NULL;
> +		u32 fsc_flags = 0;
>  
>  		mutex_lock(&fs_info->unused_bg_unpin_mutex);
>  		ret = find_first_extent_bit(unpin, 0, &start, &end,
> @@ -2903,12 +2906,14 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
>  			break;
>  		}
>  
> -		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> +		if (btrfs_test_opt(fs_info, DISCARD_SYNC)) {
>  			ret = btrfs_discard_extent(fs_info, start,
>  						   end + 1 - start, NULL);
> +			fsc_flags |= BTRFS_FSC_TRIMMED;
> +		}
>  
>  		clear_extent_dirty(unpin, start, end, &cached_state);
> -		unpin_extent_range(fs_info, start, end, true);
> +		unpin_extent_range(fs_info, start, end, fsc_flags, true);
>  		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
>  		free_extent_state(cached_state);
>  		cond_resched();
> @@ -5512,7 +5517,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
>  int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
>  				   u64 start, u64 end)
>  {
> -	return unpin_extent_range(fs_info, start, end, false);
> +	return unpin_extent_range(fs_info, start, end, 0, false);
>  }
>  
>  /*
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index d54dcd0ab230..f119895292b8 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -747,6 +747,14 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
>  			goto free_cache;
>  		}
>  
> +		/*
> +		 * Sync discard ensures that the free space cache is always
> +		 * trimmed.  So when reading this in, the state should reflect
> +		 * that.
> +		 */
> +		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> +			e->flags |= BTRFS_FSC_TRIMMED;
> +
>  		if (!e->bytes) {
>  			kmem_cache_free(btrfs_free_space_cachep, e);
>  			goto free_cache;
> @@ -2165,6 +2173,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
>  	bool merged = false;
>  	u64 offset = info->offset;
>  	u64 bytes = info->bytes;
> +	bool is_trimmed = btrfs_free_space_trimmed(info);
>  
>  	/*
>  	 * first we want to see if there is free space adjacent to the range we
> @@ -2178,7 +2187,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
>  	else
>  		left_info = tree_search_offset(ctl, offset - 1, 0, 0);
>  
> -	if (right_info && !right_info->bitmap) {
> +	if (right_info && !right_info->bitmap &&
> +	    (!is_trimmed || btrfs_free_space_trimmed(right_info))) {
>  		if (update_stat)
>  			unlink_free_space(ctl, right_info);
>  		else
> @@ -2189,7 +2199,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
>  	}
>  
>  	if (left_info && !left_info->bitmap &&
> -	    left_info->offset + left_info->bytes == offset) {
> +	    left_info->offset + left_info->bytes == offset &&
> +	    (!is_trimmed || btrfs_free_space_trimmed(left_info))) {

So we allow merging if we haven't trimmed this entry, or if the adjacent entry
is already trimmed?  This means we'll merge if we trimmed the new entry
regardless of the adjacent entries status, or if the new entry is drity.  Why is
that?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/19] bitmap: genericize percpu bitmap region iterators
  2019-10-07 20:26   ` Josef Bacik
@ 2019-10-07 22:24     ` Dennis Zhou
  2019-10-15 12:11       ` David Sterba
  0 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 22:24 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:26:13PM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:32PM -0400, Dennis Zhou wrote:
> > Bitmaps are fairly popular for their space efficiency, but we don't have
> > generic iterators available. Make percpu's bitmap region iterators
> > available to everyone.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  include/linux/bitmap.h | 35 ++++++++++++++++++++++++
> >  mm/percpu.c            | 61 +++++++++++-------------------------------
> >  2 files changed, 51 insertions(+), 45 deletions(-)
> > 
> > diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> > index 90528f12bdfa..9b0664f36808 100644
> > --- a/include/linux/bitmap.h
> > +++ b/include/linux/bitmap.h
> > @@ -437,6 +437,41 @@ static inline int bitmap_parse(const char *buf, unsigned int buflen,
> >  	return __bitmap_parse(buf, buflen, 0, maskp, nmaskbits);
> >  }
> >  
> > +static inline void bitmap_next_clear_region(unsigned long *bitmap,
> > +					    unsigned int *rs, unsigned int *re,
> > +					    unsigned int end)
> > +{
> > +	*rs = find_next_zero_bit(bitmap, end, *rs);
> > +	*re = find_next_bit(bitmap, end, *rs + 1);
> > +}
> > +
> > +static inline void bitmap_next_set_region(unsigned long *bitmap,
> > +					  unsigned int *rs, unsigned int *re,
> > +					  unsigned int end)
> > +{
> > +	*rs = find_next_bit(bitmap, end, *rs);
> > +	*re = find_next_zero_bit(bitmap, end, *rs + 1);
> > +}
> > +
> > +/*
> > + * Bitmap region iterators.  Iterates over the bitmap between [@start, @end).
> 
> Gonna be that guy here, should be '[@start, @end]'
> 

I disagree here. I'm pretty happy with [@start, @end). If btrfs wants to
carry their own iterators I'm happy to copy and paste them, but as far
as percpu goes I like [@start, @end).

> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/19] btrfs: keep track of which extents have been discarded
  2019-10-07 20:37   ` Josef Bacik
@ 2019-10-07 22:38     ` Dennis Zhou
  2019-10-10 13:40       ` Josef Bacik
  0 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-07 22:38 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:37:28PM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:34PM -0400, Dennis Zhou wrote:
> > Async discard will use the free space cache as backing knowledge for
> > which extents to discard. This patch plumbs knowledge about which
> > extents need to be discarded into the free space cache from
> > unpin_extent_range().
> > 
> > An untrimmed extent can merge with everything as this is a new region.
> > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > makes life better for find_free_extent(). Additionally, it seems the
> > size of a trim isn't as problematic as the trim io itself.
> > 
> > When reading in the free space cache from disk, if sync is set, mark all
> > extents as trimmed. The current code ensures at transaction commit that
> > all free space is trimmed when sync is set, so this reflects that.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  fs/btrfs/extent-tree.c      | 15 ++++++++++-----
> >  fs/btrfs/free-space-cache.c | 38 ++++++++++++++++++++++++++++++-------
> >  fs/btrfs/free-space-cache.h | 10 +++++++++-
> >  fs/btrfs/inode-map.c        | 13 +++++++------
> >  4 files changed, 57 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 77a5904756c5..b9e3bedad878 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2782,7 +2782,7 @@ fetch_cluster_info(struct btrfs_fs_info *fs_info,
> >  }
> >  
> >  static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> > -			      u64 start, u64 end,
> > +			      u64 start, u64 end, u32 fsc_flags,
> >  			      const bool return_free_space)
> >  {
> >  	struct btrfs_block_group_cache *cache = NULL;
> > @@ -2816,7 +2816,9 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> >  		if (start < cache->last_byte_to_unpin) {
> >  			len = min(len, cache->last_byte_to_unpin - start);
> >  			if (return_free_space)
> > -				btrfs_add_free_space(cache, start, len);
> > +				__btrfs_add_free_space(fs_info,
> > +						       cache->free_space_ctl,
> > +						       start, len, fsc_flags);
> >  		}
> >  
> >  		start += len;
> > @@ -2894,6 +2896,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> >  
> >  	while (!trans->aborted) {
> >  		struct extent_state *cached_state = NULL;
> > +		u32 fsc_flags = 0;
> >  
> >  		mutex_lock(&fs_info->unused_bg_unpin_mutex);
> >  		ret = find_first_extent_bit(unpin, 0, &start, &end,
> > @@ -2903,12 +2906,14 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> >  			break;
> >  		}
> >  
> > -		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > +		if (btrfs_test_opt(fs_info, DISCARD_SYNC)) {
> >  			ret = btrfs_discard_extent(fs_info, start,
> >  						   end + 1 - start, NULL);
> > +			fsc_flags |= BTRFS_FSC_TRIMMED;
> > +		}
> >  
> >  		clear_extent_dirty(unpin, start, end, &cached_state);
> > -		unpin_extent_range(fs_info, start, end, true);
> > +		unpin_extent_range(fs_info, start, end, fsc_flags, true);
> >  		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
> >  		free_extent_state(cached_state);
> >  		cond_resched();
> > @@ -5512,7 +5517,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
> >  int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
> >  				   u64 start, u64 end)
> >  {
> > -	return unpin_extent_range(fs_info, start, end, false);
> > +	return unpin_extent_range(fs_info, start, end, 0, false);
> >  }
> >  
> >  /*
> > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > index d54dcd0ab230..f119895292b8 100644
> > --- a/fs/btrfs/free-space-cache.c
> > +++ b/fs/btrfs/free-space-cache.c
> > @@ -747,6 +747,14 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
> >  			goto free_cache;
> >  		}
> >  
> > +		/*
> > +		 * Sync discard ensures that the free space cache is always
> > +		 * trimmed.  So when reading this in, the state should reflect
> > +		 * that.
> > +		 */
> > +		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > +			e->flags |= BTRFS_FSC_TRIMMED;
> > +
> >  		if (!e->bytes) {
> >  			kmem_cache_free(btrfs_free_space_cachep, e);
> >  			goto free_cache;
> > @@ -2165,6 +2173,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> >  	bool merged = false;
> >  	u64 offset = info->offset;
> >  	u64 bytes = info->bytes;
> > +	bool is_trimmed = btrfs_free_space_trimmed(info);
> >  
> >  	/*
> >  	 * first we want to see if there is free space adjacent to the range we
> > @@ -2178,7 +2187,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> >  	else
> >  		left_info = tree_search_offset(ctl, offset - 1, 0, 0);
> >  
> > -	if (right_info && !right_info->bitmap) {
> > +	if (right_info && !right_info->bitmap &&
> > +	    (!is_trimmed || btrfs_free_space_trimmed(right_info))) {
> >  		if (update_stat)
> >  			unlink_free_space(ctl, right_info);
> >  		else
> > @@ -2189,7 +2199,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> >  	}
> >  
> >  	if (left_info && !left_info->bitmap &&
> > -	    left_info->offset + left_info->bytes == offset) {
> > +	    left_info->offset + left_info->bytes == offset &&
> > +	    (!is_trimmed || btrfs_free_space_trimmed(left_info))) {
> 
> So we allow merging if we haven't trimmed this entry, or if the adjacent entry
> is already trimmed?  This means we'll merge if we trimmed the new entry
> regardless of the adjacent entries status, or if the new entry is drity.  Why is
> that?  Thanks,
> 

This is the tradeoff I called out above here:

> > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > makes life better for find_free_extent(). Additionally, it seems the
> > size of a trim isn't as problematic as the trim io itself.

A problematic example case:

|----trimmed----|/////X/////|-----trimmed-----|

If region X gets freed and returned to the free space cache, we end up
with the following:

|----trimmed----|-untrimmed-|-----trimmed-----|

This isn't great because now we need to teach find_free_extent() to span
multiple btrfs_free_space entries, something I didn't want to do. So the
other option is to overtrim trading for a simpler find_free_extent().
Then the above becomes:

|-------------------trimmed-------------------|

It makes the assumption that if we're inserting, it's generally is free
space being returned rather than we needed to slice out from the middle
of a block. It does still have degenerative cases, but it's better than
the above. The merging also allows for stuff to come out of bitmaps more
proactively too.

Also from what it seems, the cost of a discard operation is quite costly
relative to the amount your discarding (1 larger discard is better than
several smaller discards) as it will clog up the device too.


Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC
  2019-10-07 20:17 ` [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC Dennis Zhou
  2019-10-07 20:27   ` Josef Bacik
@ 2019-10-08 11:12   ` Johannes Thumshirn
  2019-10-11  9:19   ` Nikolay Borisov
  2 siblings, 0 replies; 71+ messages in thread
From: Johannes Thumshirn @ 2019-10-08 11:12 UTC (permalink / raw)
  To: Dennis Zhou, David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/19] btrfs: keep track of which extents have been discarded
  2019-10-07 20:17 ` [PATCH 03/19] btrfs: keep track of which extents have been discarded Dennis Zhou
  2019-10-07 20:37   ` Josef Bacik
@ 2019-10-08 12:46   ` Nikolay Borisov
  2019-10-11 16:08     ` Dennis Zhou
  2019-10-15 12:17   ` David Sterba
  2 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2019-10-08 12:46 UTC (permalink / raw)
  To: Dennis Zhou, Chris Mason, Omar Sandoval, David Sterba, Josef Bacik
  Cc: kernel-team, linux-btrfs



On 7.10.19 г. 23:17 ч., Dennis Zhou wrote:
> Async discard will use the free space cache as backing knowledge for
> which extents to discard. This patch plumbs knowledge about which
> extents need to be discarded into the free space cache from
> unpin_extent_range().
> 
> An untrimmed extent can merge with everything as this is a new region.
> Absorbing trimmed extents is a tradeoff to for greater coalescing which
> makes life better for find_free_extent(). Additionally, it seems the
> size of a trim isn't as problematic as the trim io itself.
> 
> When reading in the free space cache from disk, if sync is set, mark all
> extents as trimmed. The current code ensures at transaction commit that
> all free space is trimmed when sync is set, so this reflects that.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

I haven't looked closely into this commit but I already implemented
something similar in order to speed up trimming by not discarding an
already discarded region twice. The code was introduced by the following
series:
https://lore.kernel.org/linux-btrfs/20190327122418.24027-1-nborisov@suse.com/
in particular patches 13 to 15 .

Can you leverage it ? If not then your code should, at some point,
subsume the old one.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/19] btrfs: keep track of which extents have been discarded
  2019-10-07 22:38     ` Dennis Zhou
@ 2019-10-10 13:40       ` Josef Bacik
  2019-10-11 16:15         ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 13:40 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Josef Bacik, David Sterba, Chris Mason, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 06:38:10PM -0400, Dennis Zhou wrote:
> On Mon, Oct 07, 2019 at 04:37:28PM -0400, Josef Bacik wrote:
> > On Mon, Oct 07, 2019 at 04:17:34PM -0400, Dennis Zhou wrote:
> > > Async discard will use the free space cache as backing knowledge for
> > > which extents to discard. This patch plumbs knowledge about which
> > > extents need to be discarded into the free space cache from
> > > unpin_extent_range().
> > > 
> > > An untrimmed extent can merge with everything as this is a new region.
> > > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > > makes life better for find_free_extent(). Additionally, it seems the
> > > size of a trim isn't as problematic as the trim io itself.
> > > 
> > > When reading in the free space cache from disk, if sync is set, mark all
> > > extents as trimmed. The current code ensures at transaction commit that
> > > all free space is trimmed when sync is set, so this reflects that.
> > > 
> > > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > > ---
> > >  fs/btrfs/extent-tree.c      | 15 ++++++++++-----
> > >  fs/btrfs/free-space-cache.c | 38 ++++++++++++++++++++++++++++++-------
> > >  fs/btrfs/free-space-cache.h | 10 +++++++++-
> > >  fs/btrfs/inode-map.c        | 13 +++++++------
> > >  4 files changed, 57 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > index 77a5904756c5..b9e3bedad878 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -2782,7 +2782,7 @@ fetch_cluster_info(struct btrfs_fs_info *fs_info,
> > >  }
> > >  
> > >  static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > -			      u64 start, u64 end,
> > > +			      u64 start, u64 end, u32 fsc_flags,
> > >  			      const bool return_free_space)
> > >  {
> > >  	struct btrfs_block_group_cache *cache = NULL;
> > > @@ -2816,7 +2816,9 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> > >  		if (start < cache->last_byte_to_unpin) {
> > >  			len = min(len, cache->last_byte_to_unpin - start);
> > >  			if (return_free_space)
> > > -				btrfs_add_free_space(cache, start, len);
> > > +				__btrfs_add_free_space(fs_info,
> > > +						       cache->free_space_ctl,
> > > +						       start, len, fsc_flags);
> > >  		}
> > >  
> > >  		start += len;
> > > @@ -2894,6 +2896,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> > >  
> > >  	while (!trans->aborted) {
> > >  		struct extent_state *cached_state = NULL;
> > > +		u32 fsc_flags = 0;
> > >  
> > >  		mutex_lock(&fs_info->unused_bg_unpin_mutex);
> > >  		ret = find_first_extent_bit(unpin, 0, &start, &end,
> > > @@ -2903,12 +2906,14 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> > >  			break;
> > >  		}
> > >  
> > > -		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > > +		if (btrfs_test_opt(fs_info, DISCARD_SYNC)) {
> > >  			ret = btrfs_discard_extent(fs_info, start,
> > >  						   end + 1 - start, NULL);
> > > +			fsc_flags |= BTRFS_FSC_TRIMMED;
> > > +		}
> > >  
> > >  		clear_extent_dirty(unpin, start, end, &cached_state);
> > > -		unpin_extent_range(fs_info, start, end, true);
> > > +		unpin_extent_range(fs_info, start, end, fsc_flags, true);
> > >  		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
> > >  		free_extent_state(cached_state);
> > >  		cond_resched();
> > > @@ -5512,7 +5517,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
> > >  int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
> > >  				   u64 start, u64 end)
> > >  {
> > > -	return unpin_extent_range(fs_info, start, end, false);
> > > +	return unpin_extent_range(fs_info, start, end, 0, false);
> > >  }
> > >  
> > >  /*
> > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > > index d54dcd0ab230..f119895292b8 100644
> > > --- a/fs/btrfs/free-space-cache.c
> > > +++ b/fs/btrfs/free-space-cache.c
> > > @@ -747,6 +747,14 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
> > >  			goto free_cache;
> > >  		}
> > >  
> > > +		/*
> > > +		 * Sync discard ensures that the free space cache is always
> > > +		 * trimmed.  So when reading this in, the state should reflect
> > > +		 * that.
> > > +		 */
> > > +		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > > +			e->flags |= BTRFS_FSC_TRIMMED;
> > > +
> > >  		if (!e->bytes) {
> > >  			kmem_cache_free(btrfs_free_space_cachep, e);
> > >  			goto free_cache;
> > > @@ -2165,6 +2173,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > >  	bool merged = false;
> > >  	u64 offset = info->offset;
> > >  	u64 bytes = info->bytes;
> > > +	bool is_trimmed = btrfs_free_space_trimmed(info);
> > >  
> > >  	/*
> > >  	 * first we want to see if there is free space adjacent to the range we
> > > @@ -2178,7 +2187,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > >  	else
> > >  		left_info = tree_search_offset(ctl, offset - 1, 0, 0);
> > >  
> > > -	if (right_info && !right_info->bitmap) {
> > > +	if (right_info && !right_info->bitmap &&
> > > +	    (!is_trimmed || btrfs_free_space_trimmed(right_info))) {
> > >  		if (update_stat)
> > >  			unlink_free_space(ctl, right_info);
> > >  		else
> > > @@ -2189,7 +2199,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > >  	}
> > >  
> > >  	if (left_info && !left_info->bitmap &&
> > > -	    left_info->offset + left_info->bytes == offset) {
> > > +	    left_info->offset + left_info->bytes == offset &&
> > > +	    (!is_trimmed || btrfs_free_space_trimmed(left_info))) {
> > 
> > So we allow merging if we haven't trimmed this entry, or if the adjacent entry
> > is already trimmed?  This means we'll merge if we trimmed the new entry
> > regardless of the adjacent entries status, or if the new entry is drity.  Why is
> > that?  Thanks,
> > 
> 
> This is the tradeoff I called out above here:
> 
> > > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > > makes life better for find_free_extent(). Additionally, it seems the
> > > size of a trim isn't as problematic as the trim io itself.
> 
> A problematic example case:
> 
> |----trimmed----|/////X/////|-----trimmed-----|
> 
> If region X gets freed and returned to the free space cache, we end up
> with the following:
> 
> |----trimmed----|-untrimmed-|-----trimmed-----|
> 
> This isn't great because now we need to teach find_free_extent() to span
> multiple btrfs_free_space entries, something I didn't want to do. So the
> other option is to overtrim trading for a simpler find_free_extent().
> Then the above becomes:
> 
> |-------------------trimmed-------------------|
> 
> It makes the assumption that if we're inserting, it's generally is free
> space being returned rather than we needed to slice out from the middle
> of a block. It does still have degenerative cases, but it's better than
> the above. The merging also allows for stuff to come out of bitmaps more
> proactively too.
> 
> Also from what it seems, the cost of a discard operation is quite costly
> relative to the amount your discarding (1 larger discard is better than
> several smaller discards) as it will clog up the device too.


OOOOOh I fucking get it now.  That's going to need a comment, because it's not
obvious at all.

However I still wonder if this is right.  Your above examples are legitimate,
but say you have

| 512mib adding back that isn't trimmed |------- 512mib trimmed ------|

we'll merge these two, but really we should probably trim that 512mib chunk
we're adding right?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap
  2019-10-07 20:17 ` [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap Dennis Zhou
@ 2019-10-10 14:16   ` Josef Bacik
  2019-10-11 16:17     ` Dennis Zhou
  2019-10-15 12:23   ` David Sterba
  1 sibling, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 14:16 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:35PM -0400, Dennis Zhou wrote:
> There is a cap in btrfs in the amount of free extents that a block group
> can have. When it surpasses that threshold, future extents are placed
> into bitmaps. Instead of keeping track of if a certain bit is trimmed or
> not in a second bitmap, keep track of the relative state of the bitmap.
> 
> With async discard, trimming bitmaps becomes a more frequent operation.
> As a trade off with simplicity, we keep track of if discarding a bitmap
> is in progress. If we fully scan a bitmap and trim as necessary, the
> bitmap is marked clean. This has some caveats as the min block size may
> skip over regions deemed too small. But this should be a reasonable
> trade off rather than keeping a second bitmap and making allocation
> paths more complex. The downside is we may overtrim, but ideally the min
> block size should prevent us from doing that too often and getting stuck
> trimming
> pathological cases.
> 
> BTRFS_FSC_TRIMMING_BITMAP is added to indicate a bitmap is in the
> process of being trimmed. If additional free space is added to that
> bitmap, the bit is cleared. A bitmap will be marked BTRFS_FSC_TRIMMED if
> the trimming code was able to reach the end of it and the former is
> still set.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

I went through and looked at the end result and it appears to me that we never
have TRIMMED and TRIMMING set at the same time.  Since these are the only two
flags, and TRIMMING is only set on bitmaps, it makes more sense for this to be
more like

enum btrfs_trim_state {
	BTRFS_TRIM_STATE_TRIMMED,
	BTRFS_TRIM_STATE_TRIMMING,
	BTRFS_TRIM_STATE_UNTRIMMED,
};

and then just have enum btrfs_trim_state trim_state in the free space entry.
This makes things a bit cleaner since it's really just a state indicator rather
than a actual flags.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue
  2019-10-07 20:17 ` [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
@ 2019-10-10 14:38   ` Josef Bacik
  2019-10-15 12:49   ` David Sterba
  1 sibling, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 14:38 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:36PM -0400, Dennis Zhou wrote:
> When discard is enabled, everytime a pinned extent is released back to
> the block_group's free space cache, a discard is issued for the extent.
> This is an overeager approach when it comes to discarding and helping
> the SSD maintain enough free space to prevent severe garbage collection
> situations.
> 
> This adds the beginning of async discard. Instead of issuing a discard
> prior to returning it to the free space, it is just marked as untrimmed.
> The block_group is then added to a LRU which then feeds into a workqueue
> to issue discards at a much slower rate. Full discarding of unused block
> groups is still done and will be address in a future patch in this
> series.
> 
> For now, we don't persist the discard state of extents and bitmaps.
> Therefore, our failure recovery mode will be to consider extents
> untrimmed. This lets us handle failure and unmounting as one in the
> same.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/19] btrfs: handle empty block_group removal
  2019-10-07 20:17 ` [PATCH 06/19] btrfs: handle empty block_group removal Dennis Zhou
@ 2019-10-10 15:00   ` Josef Bacik
  2019-10-11 16:52     ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 15:00 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:37PM -0400, Dennis Zhou wrote:
> block_group removal is a little tricky. It can race with the extent
> allocator, the cleaner thread, and balancing. The current path is for a
> block_group to be added to the unused_bgs list. Then, when the cleaner
> thread comes around, it starts a transaction and then proceeds with
> removing the block_group. Extents that are pinned are subsequently
> removed from the pinned trees and then eventually a discard is issued
> for the entire block_group.
> 
> Async discard introduces another player into the game, the discard
> workqueue. While it has none of the racing issues, the new problem is
> ensuring we don't leave free space untrimmed prior to forgetting the
> block_group.  This is handled by placing fully free block_groups on a
> separate discard queue. This is necessary to maintain discarding order
> as in the future we will slowly trim even fully free block_groups. The
> ordering helps us make progress on the same block_group rather than say
> the last fully freed block_group or needing to search through the fully
> freed block groups at the beginning of a list and insert after.
> 
> The new order of events is a fully freed block group gets placed on the
> discard queue first. Once it's processed, it will be placed on the
> unusued_bgs list and then the original sequence of events will happen,
> just without the final whole block_group discard.
> 
> The mount flags can change when processing unused_bgs, so when flipping
> from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
> discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
> free block groups on the discard_list to the unused_bg queue which will
> do the final discard for us.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/block-group.c      | 39 ++++++++++++++++++---
>  fs/btrfs/ctree.h            |  2 +-
>  fs/btrfs/discard.c          | 68 ++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/discard.h          | 11 +++++-
>  fs/btrfs/free-space-cache.c | 33 ++++++++++++++++++
>  fs/btrfs/free-space-cache.h |  1 +
>  fs/btrfs/scrub.c            |  7 +++-
>  7 files changed, 153 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 8bbbe7488328..73e5a9384491 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1251,6 +1251,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  	struct btrfs_block_group_cache *block_group;
>  	struct btrfs_space_info *space_info;
>  	struct btrfs_trans_handle *trans;
> +	bool async_trim_enabled = btrfs_test_opt(fs_info, DISCARD_ASYNC);
>  	int ret = 0;
>  
>  	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
> @@ -1260,6 +1261,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  	while (!list_empty(&fs_info->unused_bgs)) {
>  		u64 start, end;
>  		int trimming;
> +		bool async_trimmed;
>  
>  		block_group = list_first_entry(&fs_info->unused_bgs,
>  					       struct btrfs_block_group_cache,
> @@ -1281,10 +1283,20 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		/* Don't want to race with allocators so take the groups_sem */
>  		down_write(&space_info->groups_sem);
>  		spin_lock(&block_group->lock);
> +
> +		/* async discard requires block groups to be fully trimmed */
> +		async_trimmed = (!btrfs_test_opt(fs_info, DISCARD_ASYNC) ||
> +				 btrfs_is_free_space_trimmed(block_group));
> +
>  		if (block_group->reserved || block_group->pinned ||
>  		    btrfs_block_group_used(&block_group->item) ||
>  		    block_group->ro ||
> -		    list_is_singular(&block_group->list)) {
> +		    list_is_singular(&block_group->list) ||
> +		    !async_trimmed) {
> +			/* requeue if we failed because of async discard */
> +			if (!async_trimmed)
> +				btrfs_discard_queue_work(&fs_info->discard_ctl,
> +							 block_group);
>  			/*
>  			 * We want to bail if we made new allocations or have
>  			 * outstanding allocations in this block group.  We do
> @@ -1367,6 +1379,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		spin_unlock(&block_group->lock);
>  		spin_unlock(&space_info->lock);
>  
> +		if (!async_trim_enabled &&
> +		    btrfs_test_opt(fs_info, DISCARD_ASYNC))
> +			goto flip_async;
> +

This took me a minute to grok, please add a comment indicating that this is
meant to catch the case that we flipped from no async to async and thus need to
kick off the async trim work now before removing the unused bg.

>  		/* DISCARD can flip during remount */
>  		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC);
>  
> @@ -1411,6 +1427,13 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		spin_lock(&fs_info->unused_bgs_lock);
>  	}
>  	spin_unlock(&fs_info->unused_bgs_lock);
> +	return;
> +
> +flip_async:
> +	btrfs_end_transaction(trans);
> +	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +	btrfs_put_block_group(block_group);
> +	btrfs_discard_punt_unused_bgs_list(fs_info);
>  }
>  
>  void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
> @@ -1618,6 +1641,8 @@ static struct btrfs_block_group_cache *btrfs_create_block_group_cache(
>  	cache->full_stripe_len = btrfs_full_stripe_len(fs_info, start);
>  	set_free_space_tree_thresholds(cache);
>  
> +	cache->discard_index = 1;
> +
>  	atomic_set(&cache->count, 1);
>  	spin_lock_init(&cache->lock);
>  	init_rwsem(&cache->data_rwsem);
> @@ -1829,7 +1854,11 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  			inc_block_group_ro(cache, 1);
>  		} else if (btrfs_block_group_used(&cache->item) == 0) {
>  			ASSERT(list_empty(&cache->bg_list));
> -			btrfs_mark_bg_unused(cache);
> +			if (btrfs_test_opt(info, DISCARD_ASYNC))
> +				btrfs_add_to_discard_free_list(
> +						&info->discard_ctl, cache);
> +			else
> +				btrfs_mark_bg_unused(cache);
>  		}
>  	}
>  
> @@ -2724,8 +2753,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>  		 * dirty list to avoid races between cleaner kthread and space
>  		 * cache writeout.
>  		 */
> -		if (!alloc && old_val == 0)
> -			btrfs_mark_bg_unused(cache);
> +		if (!alloc && old_val == 0) {
> +			if (!btrfs_test_opt(info, DISCARD_ASYNC))
> +				btrfs_mark_bg_unused(cache);
> +		}
>  
>  		btrfs_put_block_group(cache);
>  		total -= num_bytes;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 419445868909..c328d2e85e4d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -439,7 +439,7 @@ struct btrfs_full_stripe_locks_tree {
>  };
>  
>  /* discard control */
> -#define BTRFS_NR_DISCARD_LISTS		1
> +#define BTRFS_NR_DISCARD_LISTS		2
>  
>  struct btrfs_discard_ctl {
>  	struct workqueue_struct *discard_workers;
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index 6df124639e55..fb92b888774d 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -29,8 +29,11 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  
>  	spin_lock(&discard_ctl->lock);
>  
> -	if (list_empty(&cache->discard_list))
> +	if (list_empty(&cache->discard_list) || !cache->discard_index) {
> +		if (!cache->discard_index)
> +			cache->discard_index = 1;

Need a #define for this so it's clear what our intention is, I hate magic
numbers.

>  		cache->discard_delay = now + BTRFS_DISCARD_DELAY;
> +	}
>  
>  	list_move_tail(&cache->discard_list,
>  		       btrfs_get_discard_list(discard_ctl, cache));
> @@ -38,6 +41,23 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  	spin_unlock(&discard_ctl->lock);
>  }
>  
> +void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
> +				    struct btrfs_block_group_cache *cache)
> +{
> +	u64 now = ktime_get_ns();
> +
> +	spin_lock(&discard_ctl->lock);
> +
> +	if (!list_empty(&cache->discard_list))
> +		list_del_init(&cache->discard_list);
> +
> +	cache->discard_index = 0;
> +	cache->discard_delay = now;
> +	list_add_tail(&cache->discard_list, &discard_ctl->discard_list[0]);
> +
> +	spin_unlock(&discard_ctl->lock);
> +}
> +
>  static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  				     struct btrfs_block_group_cache *cache)
>  {
> @@ -161,10 +181,52 @@ static void btrfs_discard_workfn(struct work_struct *work)
>  			       btrfs_block_group_end(cache), 0);
>  
>  	remove_from_discard_list(discard_ctl, cache);
> +	if (btrfs_is_free_space_trimmed(cache))
> +		btrfs_mark_bg_unused(cache);
> +	else if (cache->free_space_ctl->free_space == cache->key.offset)
> +		btrfs_add_to_discard_free_list(discard_ctl, cache);

This needs to be

else if (btrfs_block_group_used(cache) == 0)

because we just exclude super mirrors from the free space cache, so completely
empty free_space != cache->key.offset.

>  
>  	btrfs_discard_schedule_work(discard_ctl, false);
>  }
>  
> +void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_block_group_cache *cache, *next;
> +
> +	/* we enabled async discard, so punt all to the queue */
> +	spin_lock(&fs_info->unused_bgs_lock);
> +
> +	list_for_each_entry_safe(cache, next, &fs_info->unused_bgs, bg_list) {
> +		list_del_init(&cache->bg_list);
> +		btrfs_add_to_discard_free_list(&fs_info->discard_ctl, cache);
> +	}
> +
> +	spin_unlock(&fs_info->unused_bgs_lock);
> +}
> +
> +static void btrfs_discard_purge_list(struct btrfs_discard_ctl *discard_ctl)
> +{
> +	struct btrfs_block_group_cache *cache, *next;
> +	int i;
> +
> +	spin_lock(&discard_ctl->lock);
> +
> +	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) {
> +		list_for_each_entry_safe(cache, next,
> +					 &discard_ctl->discard_list[i],
> +					 discard_list) {
> +			list_del_init(&cache->discard_list);
> +			spin_unlock(&discard_ctl->lock);
> +			if (cache->free_space_ctl->free_space ==
> +			    cache->key.offset)
> +				btrfs_mark_bg_unused(cache);
> +			spin_lock(&discard_ctl->lock);
> +		}
> +	}
> +
> +	spin_unlock(&discard_ctl->lock);
> +}
> +
>  void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
>  {
>  	if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
> @@ -172,6 +234,8 @@ void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
>  		return;
>  	}
>  
> +	btrfs_discard_punt_unused_bgs_list(fs_info);
> +
>  	set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
>  }
>  
> @@ -197,4 +261,6 @@ void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
>  {
>  	btrfs_discard_stop(fs_info);
>  	cancel_delayed_work_sync(&fs_info->discard_ctl.work);
> +
> +	btrfs_discard_purge_list(&fs_info->discard_ctl);
>  }
> diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> index 6d7805bb0eb7..55f79b624943 100644
> --- a/fs/btrfs/discard.h
> +++ b/fs/btrfs/discard.h
> @@ -10,9 +10,14 @@
>  #include <linux/workqueue.h>
>  
>  #include "ctree.h"
> +#include "block-group.h"
> +#include "free-space-cache.h"
>  
>  void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  			       struct btrfs_block_group_cache *cache);
> +void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
> +				    struct btrfs_block_group_cache *cache);
> +void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info);
>  
>  void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
>  			       struct btrfs_block_group_cache *cache);
> @@ -41,7 +46,11 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
>  	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
>  		return;
>  
> -	btrfs_add_to_discard_list(discard_ctl, cache);
> +	if (cache->free_space_ctl->free_space == cache->key.offset)
> +		btrfs_add_to_discard_free_list(discard_ctl, cache);

Same here.

> +	else
> +		btrfs_add_to_discard_list(discard_ctl, cache);
> +
>  	if (!delayed_work_pending(&discard_ctl->work))
>  		btrfs_discard_schedule_work(discard_ctl, false);
>  }
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 54ff1bc97777..ed0e7ee4c78d 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -2653,6 +2653,31 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group_cache *block_group)
>  
>  }
>  
> +bool btrfs_is_free_space_trimmed(struct btrfs_block_group_cache *cache)
> +{
> +	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
> +	struct btrfs_free_space *info;
> +	struct rb_node *node;
> +	bool ret = true;
> +
> +	spin_lock(&ctl->tree_lock);
> +	node = rb_first(&ctl->free_space_offset);
> +
> +	while (node) {
> +		info = rb_entry(node, struct btrfs_free_space, offset_index);
> +
> +		if (!btrfs_free_space_trimmed(info)) {
> +			ret = false;
> +			break;
> +		}
> +
> +		node = rb_next(node);
> +	}
> +
> +	spin_unlock(&ctl->tree_lock);
> +	return ret;
> +}
> +
>  u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
>  			       u64 offset, u64 bytes, u64 empty_size,
>  			       u64 *max_extent_size)
> @@ -2739,6 +2764,9 @@ int btrfs_return_cluster_to_free_space(
>  	ret = __btrfs_return_cluster_to_free_space(block_group, cluster);
>  	spin_unlock(&ctl->tree_lock);
>  
> +	btrfs_discard_queue_work(&block_group->fs_info->discard_ctl,
> +				 block_group);
> +
>  	/* finally drop our ref */
>  	btrfs_put_block_group(block_group);
>  	return ret;
> @@ -3097,6 +3125,7 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group,
>  	u64 min_bytes;
>  	u64 cont1_bytes;
>  	int ret;
> +	bool found_cluster = false;
>  
>  	/*
>  	 * Choose the minimum extent size we'll require for this
> @@ -3149,6 +3178,7 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group,
>  		list_del_init(&entry->list);
>  
>  	if (!ret) {
> +		found_cluster = true;
>  		atomic_inc(&block_group->count);
>  		list_add_tail(&cluster->block_group_list,
>  			      &block_group->cluster_list);
> @@ -3160,6 +3190,9 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group,
>  	spin_unlock(&cluster->lock);
>  	spin_unlock(&ctl->tree_lock);
>  
> +	if (found_cluster)
> +		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
> +
>  	return ret;
>  }

This bit here seems unrelated to the rest of the change.  Why do we want to
cancel the discard work if we find a cluster?  And what does it have to do with
unused bgs?  If it's the allocation part that makes you want to cancel then that
sort of makes sense in the unused bg context, but this is happening no matter
what.  It should probably be in its own patch with an explanation, or at least
in a different patch.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 07/19] btrfs: discard one region at a time in async discard
  2019-10-07 20:17 ` [PATCH 07/19] btrfs: discard one region at a time in async discard Dennis Zhou
@ 2019-10-10 15:22   ` Josef Bacik
  2019-10-14 19:42     ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 15:22 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:38PM -0400, Dennis Zhou wrote:
> The prior two patches added discarding via a background workqueue. This
> just piggybacked off of the fstrim code to trim the whole block at once.
> Well inevitably this is worse performance wise and will aggressively
> overtrim. But it was nice to plumb the other infrastructure to keep the
> patches easier to review.
> 
> This adds the real goal of this series which is discarding slowly (ie a
> slow long running fstrim). The discarding is split into two phases,
> extents and then bitmaps. The reason for this is two fold. First, the
> bitmap regions overlap the extent regions. Second, discarding the
> extents first will let the newly trimmed bitmaps have the highest chance
> of coalescing when being readded to the free space cache.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/block-group.h      |   2 +
>  fs/btrfs/discard.c          |  73 ++++++++++++++++++++-----
>  fs/btrfs/discard.h          |  16 ++++++
>  fs/btrfs/extent-tree.c      |   3 +-
>  fs/btrfs/free-space-cache.c | 106 ++++++++++++++++++++++++++----------
>  fs/btrfs/free-space-cache.h |   6 +-
>  6 files changed, 159 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index 0f9a1c91753f..b59e6a8ed73d 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -120,6 +120,8 @@ struct btrfs_block_group_cache {
>  	struct list_head discard_list;
>  	int discard_index;
>  	u64 discard_delay;
> +	u64 discard_cursor;
> +	u32 discard_flags;
>  

Same comment as the free space flags, this is just a state holder and never has
more than one bit set, so switch it to an enum and treat it like a state.

>  	/* For dirty block groups */
>  	struct list_head dirty_list;
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index fb92b888774d..26a1e44b4bfa 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -22,21 +22,28 @@ btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  	return &discard_ctl->discard_list[cache->discard_index];
>  }
>  
> -void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> -			       struct btrfs_block_group_cache *cache)
> +static void __btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> +					struct btrfs_block_group_cache *cache)
>  {
>  	u64 now = ktime_get_ns();
>  
> -	spin_lock(&discard_ctl->lock);
> -
>  	if (list_empty(&cache->discard_list) || !cache->discard_index) {
>  		if (!cache->discard_index)
>  			cache->discard_index = 1;
>  		cache->discard_delay = now + BTRFS_DISCARD_DELAY;
> +		cache->discard_flags |= BTRFS_DISCARD_RESET_CURSOR;
>  	}
>  
>  	list_move_tail(&cache->discard_list,
>  		       btrfs_get_discard_list(discard_ctl, cache));
> +}
> +
> +void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> +			       struct btrfs_block_group_cache *cache)
> +{
> +	spin_lock(&discard_ctl->lock);
> +
> +	__btrfs_add_to_discard_list(discard_ctl, cache);
>  
>  	spin_unlock(&discard_ctl->lock);
>  }
> @@ -53,6 +60,7 @@ void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
>  
>  	cache->discard_index = 0;
>  	cache->discard_delay = now;
> +	cache->discard_flags |= BTRFS_DISCARD_RESET_CURSOR;
>  	list_add_tail(&cache->discard_list, &discard_ctl->discard_list[0]);
>  
>  	spin_unlock(&discard_ctl->lock);
> @@ -114,13 +122,24 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
>  
>  	spin_lock(&discard_ctl->lock);
>  
> +again:
>  	cache = find_next_cache(discard_ctl, now);
>  
> -	if (cache && now < cache->discard_delay)
> +	if (cache && now > cache->discard_delay) {
> +		discard_ctl->cache = cache;
> +		if (cache->discard_index == 0 &&
> +		    cache->free_space_ctl->free_space != cache->key.offset) {
> +			__btrfs_add_to_discard_list(discard_ctl, cache);
> +			goto again;

The magic number thing again, it needs to be discard_index == UNUSED_DISCARD or
some such descriptive thing.  Also needs to be btrfs_block_group_used(cache) ==
0;

> +		}
> +		if (btrfs_discard_reset_cursor(cache)) {
> +			cache->discard_cursor = cache->key.objectid;
> +			cache->discard_flags &= ~(BTRFS_DISCARD_RESET_CURSOR |
> +						  BTRFS_DISCARD_BITMAPS);
> +		}
> +	} else {
>  		cache = NULL;
> -
> -	discard_ctl->cache = cache;
> -
> +	}
>  	spin_unlock(&discard_ctl->lock);
>  
>  	return cache;
> @@ -173,18 +192,42 @@ static void btrfs_discard_workfn(struct work_struct *work)
>  
>  	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
>  
> +again:
>  	cache = peek_discard_list(discard_ctl);
>  	if (!cache || !btrfs_run_discard_work(discard_ctl))
>  		return;
>  
> -	btrfs_trim_block_group(cache, &trimmed, cache->key.objectid,
> -			       btrfs_block_group_end(cache), 0);
> +	if (btrfs_discard_bitmaps(cache))
> +		btrfs_trim_block_group_bitmaps(cache, &trimmed,
> +					       cache->discard_cursor,
> +					       btrfs_block_group_end(cache),
> +					       0, true);
> +	else
> +		btrfs_trim_block_group(cache, &trimmed, cache->discard_cursor,
> +				       btrfs_block_group_end(cache), 0, true);
> +
> +	if (cache->discard_cursor >= btrfs_block_group_end(cache)) {
> +		if (btrfs_discard_bitmaps(cache)) {
> +			remove_from_discard_list(discard_ctl, cache);
> +			if (btrfs_is_free_space_trimmed(cache))
> +				btrfs_mark_bg_unused(cache);
> +			else if (cache->free_space_ctl->free_space ==
> +				 cache->key.offset)

btrfs_block_group_used(cache) == 0;

> +				btrfs_add_to_discard_free_list(discard_ctl,
> +							       cache);
> +		} else {
> +			cache->discard_cursor = cache->key.objectid;
> +			cache->discard_flags |= BTRFS_DISCARD_BITMAPS;
> +		}
> +	}
> +
> +	spin_lock(&discard_ctl->lock);
> +	discard_ctl->cache = NULL;
> +	spin_unlock(&discard_ctl->lock);
>  
> -	remove_from_discard_list(discard_ctl, cache);
> -	if (btrfs_is_free_space_trimmed(cache))
> -		btrfs_mark_bg_unused(cache);
> -	else if (cache->free_space_ctl->free_space == cache->key.offset)
> -		btrfs_add_to_discard_free_list(discard_ctl, cache);
> +	/* we didn't trim anything but we really ought to so try again */
> +	if (trimmed == 0)
> +		goto again;

Why?  We'll reschedule if we need to.  We unconditionally do this, I feel like
there's going to be some corner case where we end up seeing this workfn using
100% on a bunch of sandcastle boxes and we'll all be super sad.

>  
>  	btrfs_discard_schedule_work(discard_ctl, false);
>  }
> diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> index 55f79b624943..22cfa7e401bb 100644
> --- a/fs/btrfs/discard.h
> +++ b/fs/btrfs/discard.h
> @@ -13,6 +13,22 @@
>  #include "block-group.h"
>  #include "free-space-cache.h"
>  
> +/* discard flags */
> +#define BTRFS_DISCARD_RESET_CURSOR	(1UL << 0)
> +#define BTRFS_DISCARD_BITMAPS           (1UL << 1)
> +
> +static inline
> +bool btrfs_discard_reset_cursor(struct btrfs_block_group_cache *cache)
> +{
> +	return (cache->discard_flags & BTRFS_DISCARD_RESET_CURSOR);
> +}
> +
> +static inline
> +bool btrfs_discard_bitmaps(struct btrfs_block_group_cache *cache)
> +{
> +	return (cache->discard_flags & BTRFS_DISCARD_BITMAPS);
> +}
> +
>  void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  			       struct btrfs_block_group_cache *cache);
>  void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d69ee5f51b38..ff42e4abb01d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5683,7 +5683,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
>  						     &group_trimmed,
>  						     start,
>  						     end,
> -						     range->minlen);
> +						     range->minlen,
> +						     false);
>  
>  			trimmed += group_trimmed;
>  			if (ret) {
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index ed0e7ee4c78d..97b3074e83c0 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3267,7 +3267,8 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
>  }
>  
>  static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
> -			  u64 *total_trimmed, u64 start, u64 end, u64 minlen)
> +			  u64 *total_trimmed, u64 start, u64 end, u64 minlen,
> +			  bool async)
>  {
>  	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>  	struct btrfs_free_space *entry;
> @@ -3284,36 +3285,25 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
>  		mutex_lock(&ctl->cache_writeout_mutex);
>  		spin_lock(&ctl->tree_lock);
>  
> -		if (ctl->free_space < minlen) {
> -			spin_unlock(&ctl->tree_lock);
> -			mutex_unlock(&ctl->cache_writeout_mutex);
> -			break;
> -		}
> +		if (ctl->free_space < minlen)
> +			goto out_unlock;
>  
>  		entry = tree_search_offset(ctl, start, 0, 1);
> -		if (!entry) {
> -			spin_unlock(&ctl->tree_lock);
> -			mutex_unlock(&ctl->cache_writeout_mutex);
> -			break;
> -		}
> +		if (!entry)
> +			goto out_unlock;
>  
>  		/* skip bitmaps */
> -		while (entry->bitmap) {
> +		while (entry->bitmap || (async &&
> +					 btrfs_free_space_trimmed(entry))) {

Update the comment to say we're skipping already trimmed entries as well please.

>  			node = rb_next(&entry->offset_index);
> -			if (!node) {
> -				spin_unlock(&ctl->tree_lock);
> -				mutex_unlock(&ctl->cache_writeout_mutex);
> -				goto out;
> -			}
> +			if (!node)
> +				goto out_unlock;
>  			entry = rb_entry(node, struct btrfs_free_space,
>  					 offset_index);
>  		}
>  
> -		if (entry->offset >= end) {
> -			spin_unlock(&ctl->tree_lock);
> -			mutex_unlock(&ctl->cache_writeout_mutex);
> -			break;
> -		}
> +		if (entry->offset >= end)
> +			goto out_unlock;
>  
>  		extent_start = entry->offset;
>  		extent_bytes = entry->bytes;
> @@ -3338,10 +3328,15 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
>  		ret = do_trimming(block_group, total_trimmed, start, bytes,
>  				  extent_start, extent_bytes, extent_flags,
>  				  &trim_entry);
> -		if (ret)
> +		if (ret) {
> +			block_group->discard_cursor = start + bytes;
>  			break;
> +		}
>  next:
>  		start += bytes;
> +		block_group->discard_cursor = start;
> +		if (async && *total_trimmed)
> +			break;

Alright so this means we'll only trim one entry and then return if we're async?
It seems to be the same below for bitmaps.  This deserves a comment for the
functions, it fundamentally changes the behavior of the function if we're async.

>  
>  		if (fatal_signal_pending(current)) {
>  			ret = -ERESTARTSYS;
> @@ -3350,7 +3345,14 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
>  
>  		cond_resched();
>  	}
> -out:
> +
> +	return ret;
> +
> +out_unlock:
> +	block_group->discard_cursor = btrfs_block_group_end(block_group);
> +	spin_unlock(&ctl->tree_lock);
> +	mutex_unlock(&ctl->cache_writeout_mutex);
> +
>  	return ret;
>  }
>  
> @@ -3390,7 +3392,8 @@ static void end_trimming_bitmap(struct btrfs_free_space *entry)
>  }
>  
>  static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
> -			u64 *total_trimmed, u64 start, u64 end, u64 minlen)
> +			u64 *total_trimmed, u64 start, u64 end, u64 minlen,
> +			bool async)
>  {
>  	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>  	struct btrfs_free_space *entry;
> @@ -3407,13 +3410,16 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
>  		spin_lock(&ctl->tree_lock);
>  
>  		if (ctl->free_space < minlen) {
> +			block_group->discard_cursor =
> +				btrfs_block_group_end(block_group);
>  			spin_unlock(&ctl->tree_lock);
>  			mutex_unlock(&ctl->cache_writeout_mutex);
>  			break;
>  		}
>  
>  		entry = tree_search_offset(ctl, offset, 1, 0);
> -		if (!entry) {
> +		if (!entry || (async && start == offset &&
> +			       btrfs_free_space_trimmed(entry))) {
>  			spin_unlock(&ctl->tree_lock);
>  			mutex_unlock(&ctl->cache_writeout_mutex);
>  			next_bitmap = true;
> @@ -3446,6 +3452,16 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
>  			goto next;
>  		}
>  
> +		/*
> +		 * We already trimmed a region, but are using the locking above
> +		 * to reset the BTRFS_FSC_TRIMMING_BITMAP flag.
> +		 */
> +		if (async && *total_trimmed) {
> +			spin_unlock(&ctl->tree_lock);
> +			mutex_unlock(&ctl->cache_writeout_mutex);
> +			return ret;
> +		}
> +
>  		bytes = min(bytes, end - start);
>  		if (bytes < minlen) {
>  			entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
> @@ -3468,6 +3484,8 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
>  				  start, bytes, 0, &trim_entry);
>  		if (ret) {
>  			reset_trimming_bitmap(ctl, offset);
> +			block_group->discard_cursor =
> +				btrfs_block_group_end(block_group);
>  			break;
>  		}
>  next:
> @@ -3477,6 +3495,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
>  		} else {
>  			start += bytes;
>  		}
> +		block_group->discard_cursor = start;
>  
>  		if (fatal_signal_pending(current)) {
>  			if (start != offset)
> @@ -3488,6 +3507,9 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
>  		cond_resched();
>  	}
>  
> +	if (offset >= end)
> +		block_group->discard_cursor = end;
> +
>  	return ret;
>  }
>  
> @@ -3532,7 +3554,8 @@ void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
>  }
>  
>  int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> -			   u64 *trimmed, u64 start, u64 end, u64 minlen)
> +			   u64 *trimmed, u64 start, u64 end, u64 minlen,
> +			   bool async)
>  {
>  	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>  	int ret;
> @@ -3547,11 +3570,11 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
>  	btrfs_get_block_group_trimming(block_group);
>  	spin_unlock(&block_group->lock);
>  
> -	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
> -	if (ret)
> +	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen, async);
> +	if (ret || async)
>  		goto out;
>  

You already separate out btrfs_trim_block_group_bitmaps, so this function really
only trims the whole block group if !async.  Make a separate helper for trimming
only the extents so it's clear what the async stuff is doing, and we're not
relying on the async to change the behavior of btrfs_trim_block_group().
Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 08/19] btrfs: track discardable extents for asnyc discard
  2019-10-07 20:17 ` [PATCH 08/19] btrfs: track discardable extents for asnyc discard Dennis Zhou
@ 2019-10-10 15:36   ` Josef Bacik
  2019-10-14 19:50     ` Dennis Zhou
  2019-10-15 13:12   ` David Sterba
  1 sibling, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 15:36 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:39PM -0400, Dennis Zhou wrote:
> The number of discardable extents will serve as the rate limiting metric
> for how often we should discard. This keeps track of discardable extents
> in the free space caches by maintaining deltas and propagating them to
> the global count.
> 
> This also setups up a discard directory in btrfs sysfs and exports the
> total discard_extents count.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/ctree.h            |  2 +
>  fs/btrfs/discard.c          |  2 +
>  fs/btrfs/discard.h          | 19 ++++++++
>  fs/btrfs/free-space-cache.c | 93 ++++++++++++++++++++++++++++++++++---
>  fs/btrfs/free-space-cache.h |  2 +
>  fs/btrfs/sysfs.c            | 33 +++++++++++++
>  6 files changed, 144 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index c328d2e85e4d..43e515939b9c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -447,6 +447,7 @@ struct btrfs_discard_ctl {
>  	spinlock_t lock;
>  	struct btrfs_block_group_cache *cache;
>  	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
> +	atomic_t discard_extents;
>  };
>  
>  /* delayed seq elem */
> @@ -831,6 +832,7 @@ struct btrfs_fs_info {
>  	struct btrfs_workqueue *scrub_wr_completion_workers;
>  	struct btrfs_workqueue *scrub_parity_workers;
>  
> +	struct kobject *discard_kobj;
>  	struct btrfs_discard_ctl discard_ctl;
>  
>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index 26a1e44b4bfa..0544eb6717d4 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -298,6 +298,8 @@ void btrfs_discard_init(struct btrfs_fs_info *fs_info)
>  
>  	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
>  		 INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
> +
> +	atomic_set(&discard_ctl->discard_extents, 0);
>  }
>  
>  void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
> diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> index 22cfa7e401bb..85939d62521e 100644
> --- a/fs/btrfs/discard.h
> +++ b/fs/btrfs/discard.h
> @@ -71,4 +71,23 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
>  		btrfs_discard_schedule_work(discard_ctl, false);
>  }
>  
> +static inline
> +void btrfs_discard_update_discardable(struct btrfs_block_group_cache *cache,
> +				      struct btrfs_free_space_ctl *ctl)
> +{
> +	struct btrfs_discard_ctl *discard_ctl;
> +	s32 extents_delta;
> +
> +	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
> +		return;
> +
> +	discard_ctl = &cache->fs_info->discard_ctl;
> +
> +	extents_delta = ctl->discard_extents[0] - ctl->discard_extents[1];
> +	if (extents_delta) {
> +		atomic_add(extents_delta, &discard_ctl->discard_extents);
> +		ctl->discard_extents[1] = ctl->discard_extents[0];
> +	}

What the actual fuck?  I assume you did this to avoid checking DISCARD_ASYNC on
every update, but man this complexity is not worth it.  We might as well update
the counter every time to avoid doing stuff like this.

If there's a better reason for doing it this way then I'm all ears, but even so
this is not the way to do it.  Just do

atomic_add(ctl->discard_extenst, &discard_ctl->discard_extents);
ctl->discard_extents = 0;

and avoid the two step thing.  And a comment, because it was like 5 minutes
between me seeing this and getting to your reasoning, and in between there was a
lot of swearing.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/19] btrfs: keep track of discardable_bytes
  2019-10-07 20:17 ` [PATCH 09/19] btrfs: keep track of discardable_bytes Dennis Zhou
@ 2019-10-10 15:38   ` Josef Bacik
  0 siblings, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 15:38 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:40PM -0400, Dennis Zhou wrote:
> Keep track of this metric so that we can understand how ahead or behind
> we are in discarding rate.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

Same comment as the discard_extents patch.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 10/19] btrfs: calculate discard delay based on number of extents
  2019-10-07 20:17 ` [PATCH 10/19] btrfs: calculate discard delay based on number of extents Dennis Zhou
@ 2019-10-10 15:41   ` Josef Bacik
  2019-10-11 18:07     ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 15:41 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:41PM -0400, Dennis Zhou wrote:
> Use the number of discardable extents to help guide our discard delay
> interval. This value is reevaluated every transaction commit.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/ctree.h       |  2 ++
>  fs/btrfs/discard.c     | 31 +++++++++++++++++++++++++++++--
>  fs/btrfs/discard.h     |  3 +++
>  fs/btrfs/extent-tree.c |  4 +++-
>  fs/btrfs/sysfs.c       | 30 ++++++++++++++++++++++++++++++
>  5 files changed, 67 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 8479ab037812..b0823961d049 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -449,6 +449,8 @@ struct btrfs_discard_ctl {
>  	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
>  	atomic_t discard_extents;
>  	atomic64_t discardable_bytes;
> +	atomic_t delay;
> +	atomic_t iops_limit;
>  };
>  
>  /* delayed seq elem */
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index 75a2ff14b3c0..c7afb5f8240d 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -15,6 +15,11 @@
>  
>  #define BTRFS_DISCARD_DELAY		(300ULL * NSEC_PER_SEC)
>  
> +/* target discard delay in milliseconds */
> +#define BTRFS_DISCARD_TARGET_MSEC	(6 * 60 * 60ULL * MSEC_PER_SEC)
> +#define BTRFS_DISCARD_MAX_DELAY		(10000UL)
> +#define BTRFS_DISCARD_MAX_IOPS		(10UL)
> +
>  static struct list_head *
>  btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  		       struct btrfs_block_group_cache *cache)
> @@ -170,10 +175,12 @@ void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
>  
>  	cache = find_next_cache(discard_ctl, now);
>  	if (cache) {
> -		u64 delay = 0;
> +		u64 delay = atomic_read(&discard_ctl->delay);
>  
>  		if (now < cache->discard_delay)
> -			delay = nsecs_to_jiffies(cache->discard_delay - now);
> +			delay = max_t(u64, delay,
> +				      nsecs_to_jiffies(cache->discard_delay -
> +						       now));

Small nit, instead

			delay = nsecs_to_jiffies(cache->discard_delay - now);
			delay = max_t(u64, delay,
				      atomic_read(&discard_ctl->delay);

Looks a little cleaner.  Otherwise

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 11/19] btrfs: add bps discard rate limit
  2019-10-07 20:17 ` [PATCH 11/19] btrfs: add bps discard rate limit Dennis Zhou
@ 2019-10-10 15:47   ` Josef Bacik
  2019-10-14 19:56     ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 15:47 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:42PM -0400, Dennis Zhou wrote:
> Provide an ability to rate limit based on mbps in addition to the iops
> delay calculated from number of discardable extents.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/ctree.h   |  2 ++
>  fs/btrfs/discard.c | 11 +++++++++++
>  fs/btrfs/sysfs.c   | 30 ++++++++++++++++++++++++++++++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b0823961d049..e81f699347e0 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -447,10 +447,12 @@ struct btrfs_discard_ctl {
>  	spinlock_t lock;
>  	struct btrfs_block_group_cache *cache;
>  	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
> +	u64 prev_discard;
>  	atomic_t discard_extents;
>  	atomic64_t discardable_bytes;
>  	atomic_t delay;
>  	atomic_t iops_limit;
> +	atomic64_t bps_limit;
>  };
>  
>  /* delayed seq elem */
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index c7afb5f8240d..072c73f48297 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -176,6 +176,13 @@ void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
>  	cache = find_next_cache(discard_ctl, now);
>  	if (cache) {
>  		u64 delay = atomic_read(&discard_ctl->delay);
> +		s64 bps_limit = atomic64_read(&discard_ctl->bps_limit);
> +
> +		if (bps_limit)
> +			delay = max_t(u64, delay,
> +				      msecs_to_jiffies(MSEC_PER_SEC *
> +						discard_ctl->prev_discard /
> +						bps_limit));

I forget, are we allowed to do 0 / some value?  I feel like I did this at some
point with io.latency and it panic'ed and was very confused.  Maybe I'm just
misremembering.

And a similar nit, maybe we just do

u64 delay = atomic_read(&discard_ctl->delay);
u64 bps_delay = atomic64_read(&discard_ctl->bps_limit);
if (bps_delay)
	bps_delay = msecs_to_jiffies(MSEC_PER_SEC * blah)

delay = max(delay, bps_delay);

Or something else.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 12/19] btrfs: limit max discard size for async discard
  2019-10-07 20:17 ` [PATCH 12/19] btrfs: limit max discard size for async discard Dennis Zhou
@ 2019-10-10 16:16   ` Josef Bacik
  2019-10-14 19:57     ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 16:16 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:43PM -0400, Dennis Zhou wrote:
> Throttle the maximum size of a discard so that we can provide an upper
> bound for the rate of async discard. While the block layer is able to
> split discards into the appropriate sized discards, we want to be able
> to account more accurately the rate at which we are consuming ncq slots
> as well as limit the upper bound of work for a discard.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/discard.h          |  4 ++++
>  fs/btrfs/free-space-cache.c | 47 +++++++++++++++++++++++++++----------
>  2 files changed, 39 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> index acaf56f63b1c..898dd92dbf8f 100644
> --- a/fs/btrfs/discard.h
> +++ b/fs/btrfs/discard.h
> @@ -8,6 +8,7 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/jiffies.h>
> +#include <linux/sizes.h>
>  #include <linux/time.h>
>  #include <linux/workqueue.h>
>  
> @@ -15,6 +16,9 @@
>  #include "block-group.h"
>  #include "free-space-cache.h"
>  
> +/* discard size limits */
> +#define BTRFS_DISCARD_MAX_SIZE		(SZ_64M)
> +

Let's make this configurable via sysfs as well.  I assume at some point in the
far, far future SSD's will stop being shitty and it would be nice to be able to
easily adjust and test.  Also this only applies to async, so
BTRFS_ASYNC_DISCARD_MAX_SIZE.  You can add a follow up patch for the sysfs
stuff, just adjust the name and you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 13/19] btrfs: have multiple discard lists
  2019-10-07 20:17 ` [PATCH 13/19] btrfs: have multiple discard lists Dennis Zhou
@ 2019-10-10 16:51   ` Josef Bacik
  2019-10-14 20:04     ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 16:51 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:44PM -0400, Dennis Zhou wrote:
> Non-block group destruction discarding currently only had a single list
> with no minimum discard length. This can lead to caravaning more
> meaningful discards behind a heavily fragmented block group.
> 
> This adds support for multiple lists with minimum discard lengths to
> prevent the caravan effect. We promote block groups back up when we
> exceed the BTRFS_DISCARD_MAX_FILTER size, currently we support only 2
> lists with filters of 1MB and 32KB respectively.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/ctree.h            |  2 +-
>  fs/btrfs/discard.c          | 60 +++++++++++++++++++++++++++++++++----
>  fs/btrfs/discard.h          |  4 +++
>  fs/btrfs/free-space-cache.c | 37 +++++++++++++++--------
>  fs/btrfs/free-space-cache.h |  2 +-
>  5 files changed, 85 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index e81f699347e0..b5608f8dc41a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -439,7 +439,7 @@ struct btrfs_full_stripe_locks_tree {
>  };
>  
>  /* discard control */
> -#define BTRFS_NR_DISCARD_LISTS		2
> +#define BTRFS_NR_DISCARD_LISTS		3
>  
>  struct btrfs_discard_ctl {
>  	struct workqueue_struct *discard_workers;
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index 072c73f48297..296cbffc5957 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -20,6 +20,10 @@
>  #define BTRFS_DISCARD_MAX_DELAY		(10000UL)
>  #define BTRFS_DISCARD_MAX_IOPS		(10UL)
>  
> +/* montonically decreasing filters after 0 */
> +static int discard_minlen[BTRFS_NR_DISCARD_LISTS] = {0,
> +	BTRFS_DISCARD_MAX_FILTER, BTRFS_DISCARD_MIN_FILTER};
> +
>  static struct list_head *
>  btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  		       struct btrfs_block_group_cache *cache)
> @@ -120,7 +124,7 @@ find_next_cache(struct btrfs_discard_ctl *discard_ctl, u64 now)
>  }
>  
>  static struct btrfs_block_group_cache *
> -peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
> +peek_discard_list(struct btrfs_discard_ctl *discard_ctl, int *discard_index)
>  {
>  	struct btrfs_block_group_cache *cache;
>  	u64 now = ktime_get_ns();
> @@ -132,6 +136,7 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
>  
>  	if (cache && now > cache->discard_delay) {
>  		discard_ctl->cache = cache;
> +		*discard_index = cache->discard_index;
>  		if (cache->discard_index == 0 &&
>  		    cache->free_space_ctl->free_space != cache->key.offset) {
>  			__btrfs_add_to_discard_list(discard_ctl, cache);
> @@ -150,6 +155,36 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
>  	return cache;
>  }
>  
> +void btrfs_discard_check_filter(struct btrfs_block_group_cache *cache,
> +				u64 bytes)
> +{
> +	struct btrfs_discard_ctl *discard_ctl;
> +
> +	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
> +		return;
> +
> +	discard_ctl = &cache->fs_info->discard_ctl;
> +
> +	if (cache && cache->discard_index > 1 &&
> +	    bytes >= BTRFS_DISCARD_MAX_FILTER) {
> +		remove_from_discard_list(discard_ctl, cache);
> +		cache->discard_index = 1;

Really need names here, I have no idea what 1 is.

> +		btrfs_add_to_discard_list(discard_ctl, cache);
> +	}
> +}
> +
> +static void btrfs_update_discard_index(struct btrfs_discard_ctl *discard_ctl,
> +				       struct btrfs_block_group_cache *cache)
> +{
> +	cache->discard_index++;
> +	if (cache->discard_index == BTRFS_NR_DISCARD_LISTS) {
> +		cache->discard_index = 1;
> +		return;
> +	}
> +
> +	btrfs_add_to_discard_list(discard_ctl, cache);
> +}
> +
>  void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
>  			       struct btrfs_block_group_cache *cache)
>  {
> @@ -202,23 +237,34 @@ static void btrfs_discard_workfn(struct work_struct *work)
>  {
>  	struct btrfs_discard_ctl *discard_ctl;
>  	struct btrfs_block_group_cache *cache;
> +	int discard_index = 0;
>  	u64 trimmed = 0;
> +	u64 minlen = 0;
>  
>  	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
>  
>  again:
> -	cache = peek_discard_list(discard_ctl);
> +	cache = peek_discard_list(discard_ctl, &discard_index);
>  	if (!cache || !btrfs_run_discard_work(discard_ctl))
>  		return;
>  
> -	if (btrfs_discard_bitmaps(cache))
> +	minlen = discard_minlen[discard_index];
> +
> +	if (btrfs_discard_bitmaps(cache)) {
> +		u64 maxlen = 0;
> +
> +		if (discard_index)
> +			maxlen = discard_minlen[discard_index - 1];
> +
>  		btrfs_trim_block_group_bitmaps(cache, &trimmed,
>  					       cache->discard_cursor,
>  					       btrfs_block_group_end(cache),
> -					       0, true);
> -	else
> +					       minlen, maxlen, true);
> +	} else {
>  		btrfs_trim_block_group(cache, &trimmed, cache->discard_cursor,
> -				       btrfs_block_group_end(cache), 0, true);
> +				       btrfs_block_group_end(cache),
> +				       minlen, true);
> +	}
>  
>  	discard_ctl->prev_discard = trimmed;
>  
> @@ -231,6 +277,8 @@ static void btrfs_discard_workfn(struct work_struct *work)
>  				 cache->key.offset)
>  				btrfs_add_to_discard_free_list(discard_ctl,
>  							       cache);
> +			else
> +				btrfs_update_discard_index(discard_ctl, cache);
>  		} else {
>  			cache->discard_cursor = cache->key.objectid;
>  			cache->discard_flags |= BTRFS_DISCARD_BITMAPS;
> diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> index 898dd92dbf8f..1daa8da4a1b5 100644
> --- a/fs/btrfs/discard.h
> +++ b/fs/btrfs/discard.h
> @@ -18,6 +18,8 @@
>  
>  /* discard size limits */
>  #define BTRFS_DISCARD_MAX_SIZE		(SZ_64M)
> +#define BTRFS_DISCARD_MAX_FILTER	(SZ_1M)
> +#define BTRFS_DISCARD_MIN_FILTER	(SZ_32K)
>  
>  /* discard flags */
>  #define BTRFS_DISCARD_RESET_CURSOR	(1UL << 0)
> @@ -39,6 +41,8 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  			       struct btrfs_block_group_cache *cache);
>  void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
>  				    struct btrfs_block_group_cache *cache);
> +void btrfs_discard_check_filter(struct btrfs_block_group_cache *cache,
> +				u64 bytes);
>  void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info);
>  
>  void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index ce33803a45b2..ed35dc090df6 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -2471,6 +2471,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>  	if (ret)
>  		kmem_cache_free(btrfs_free_space_cachep, info);
>  out:
> +	btrfs_discard_check_filter(cache, bytes);

So we're only accounting the new space?  What if we merge with a larger area
here?  We should probably make our decision based on the actual trimable area.

>  	btrfs_discard_update_discardable(cache, ctl);
>  	spin_unlock(&ctl->tree_lock);
>  
> @@ -3409,7 +3410,13 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
>  				goto next;
>  			}
>  			unlink_free_space(ctl, entry);
> -			if (bytes > BTRFS_DISCARD_MAX_SIZE) {
> +			/*
> +			 * Let bytes = BTRFS_MAX_DISCARD_SIZE + X.
> +			 * If X < BTRFS_DISCARD_MIN_FILTER, we won't trim X when
> +			 * we come back around.  So trim it now.
> +			 */
> +			if (bytes > (BTRFS_DISCARD_MAX_SIZE +
> +				     BTRFS_DISCARD_MIN_FILTER)) {
>  				bytes = extent_bytes = BTRFS_DISCARD_MAX_SIZE;
>  				entry->offset += BTRFS_DISCARD_MAX_SIZE;
>  				entry->bytes -= BTRFS_DISCARD_MAX_SIZE;
> @@ -3510,7 +3517,7 @@ static void end_trimming_bitmap(struct btrfs_free_space_ctl *ctl,
>  
>  static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
>  			u64 *total_trimmed, u64 start, u64 end, u64 minlen,
> -			bool async)
> +			u64 maxlen, bool async)
>  {
>  	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>  	struct btrfs_free_space *entry;
> @@ -3535,7 +3542,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
>  		}
>  
>  		entry = tree_search_offset(ctl, offset, 1, 0);
> -		if (!entry || (async && start == offset &&
> +		if (!entry || (async && minlen && start == offset &&
>  			       btrfs_free_space_trimmed(entry))) {

Huh?  Why do we care if minlen is set if our entry is already trimmed?  If we're
already trimmed we should just skip it even with minlen set, right?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 14/19] btrfs: only keep track of data extents for async discard
  2019-10-07 20:17 ` [PATCH 14/19] btrfs: only keep track of data extents for async discard Dennis Zhou
@ 2019-10-10 16:53   ` Josef Bacik
  0 siblings, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 16:53 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:45PM -0400, Dennis Zhou wrote:
> As mentioned earlier, discarding data can be done either by issuing an
> explicit discard or implicitly by reusing the LBA. Metadata chunks see
> much more frequent reuse due to well it being metadata. So instead of
> explicitly discarding metadata blocks, just leave them be and let the
> latter implicit discarding be done for them.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 15/19] btrfs: load block_groups into discard_list on mount
  2019-10-07 20:17 ` [PATCH 15/19] btrfs: load block_groups into discard_list on mount Dennis Zhou
@ 2019-10-10 17:11   ` Josef Bacik
  2019-10-14 20:17     ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 17:11 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:46PM -0400, Dennis Zhou wrote:
> Async discard doesn't remember the discard state of a block_group when
> unmounting or when we crash. So, any block_group that is not fully used
> may have undiscarded regions. However, free space caches are read in on
> demand. Let the discard worker read in the free space cache so we can
> proceed with discarding rather than wait for the block_group to be used.
> This prevents us from indefinitely deferring discards until that
> particular block_group is reused.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

What if we did completely discard the last time, now we're going back and
discarding again?  I think by default we just assume we discarded everything.
If we didn't then the user can always initiate a fitrim later.  Drop this one.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 16/19] btrfs: keep track of discard reuse stats
  2019-10-07 20:17 ` [PATCH 16/19] btrfs: keep track of discard reuse stats Dennis Zhou
@ 2019-10-10 17:13   ` Josef Bacik
  0 siblings, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 17:13 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:47PM -0400, Dennis Zhou wrote:
> Keep track of how much we are discarding and how often we are reusing
> with async discard.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 17/19] btrfs: add async discard header
  2019-10-07 20:17 ` [PATCH 17/19] btrfs: add async discard header Dennis Zhou
@ 2019-10-10 17:13   ` Josef Bacik
  0 siblings, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 17:13 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:48PM -0400, Dennis Zhou wrote:
> Give a brief overview for how async discard is implemented.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 18/19] btrfs: increase the metadata allowance for the free_space_cache
  2019-10-07 20:17 ` [PATCH 18/19] btrfs: increase the metadata allowance for the free_space_cache Dennis Zhou
@ 2019-10-10 17:16   ` Josef Bacik
  0 siblings, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 17:16 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:49PM -0400, Dennis Zhou wrote:
> Currently, there is no way for the free space cache to recover from
> being serviced by purely bitmaps because the extent threshold is set to
> 0 in recalculate_thresholds() when we surpass the metadata allowance.
> 
> This adds a recovery mechanism by keeping large extents out of the
> bitmaps and increases the metadata upper bound to 64KB. The recovery
> mechanism bypasses this upper bound, thus making it a soft upper bound.
> But, with the bypass being 1MB or greater, it shouldn't add unbounded
> overhead.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 19/19] btrfs: make smaller extents more likely to go into bitmaps
  2019-10-07 20:17 ` [PATCH 19/19] btrfs: make smaller extents more likely to go into bitmaps Dennis Zhou
@ 2019-10-10 17:17   ` Josef Bacik
  0 siblings, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2019-10-10 17:17 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:50PM -0400, Dennis Zhou wrote:
> It's less than ideal for small extents to eat into our extent budget, so
> force extents <= 32KB into the bitmaps save for the first handful.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 00/19] btrfs: async discard support
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (18 preceding siblings ...)
  2019-10-07 20:17 ` [PATCH 19/19] btrfs: make smaller extents more likely to go into bitmaps Dennis Zhou
@ 2019-10-11  7:49 ` Nikolay Borisov
  2019-10-14 21:05   ` Dennis Zhou
  2019-10-15 12:08 ` David Sterba
  20 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2019-10-11  7:49 UTC (permalink / raw)
  To: Dennis Zhou, Chris Mason, Omar Sandoval, David Sterba, Josef Bacik
  Cc: kernel-team, linux-btrfs



On 7.10.19 г. 23:17 ч., Dennis Zhou wrote:
> Hello,
> 

<snip>

> 
> With async discard, we try to emphasize discarding larger regions
> and reusing the lba (implicit discard). The first is done by using the
> free space cache to maintain discard state and thus allows us to get
> coalescing for fairly cheap. A background workqueue is used to scan over
> an LRU kept list of the block groups. It then uses filters to determine
> what to discard next hence giving priority to larger discards. While
> reusing an lba isn't explicitly attempted, it happens implicitly via
> find_free_extent() which if it happens to find a dirty extent, will
> grant us reuse of the lba. Additionally, async discarding skips metadata

By 'dirty' I assume you mean not-discarded-yet-but-free extent?

> block groups as these should see a fairly high turnover as btrfs is a
> self-packing filesystem being stingy with allocating new block groups
> until necessary.
> 
> Preliminary results seem promising as when a lot of freeing is going on,
> the discarding is delayed allowing for reuse which translates to less
> discarding (in addition to the slower discarding). This has shown a
> reduction in p90 and p99 read latencies on a test on our webservers.
> 
> I am currently working on tuning the rate at which it discards in the
> background. I am doing this by evaluating other workloads and drives.
> The iops and bps rate limits are fairly aggressive right now as my
> basic survey of a few drives noted that the trim command itself is a
> significant part of the overhead. So optimizing for larger trims is the
> right thing to do.

Do you intend on sharing performance results alongside the workloads
used to obtain them? Since this is a performance improvement patch in
its core that is of prime importance!

> 

<snip>
> 
> Thanks,
> Dennis
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC
  2019-10-07 20:17 ` [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC Dennis Zhou
  2019-10-07 20:27   ` Josef Bacik
  2019-10-08 11:12   ` Johannes Thumshirn
@ 2019-10-11  9:19   ` Nikolay Borisov
  2 siblings, 0 replies; 71+ messages in thread
From: Nikolay Borisov @ 2019-10-11  9:19 UTC (permalink / raw)
  To: Dennis Zhou, Chris Mason, Omar Sandoval, David Sterba, Josef Bacik
  Cc: kernel-team, linux-btrfs



On 7.10.19 г. 23:17 ч., Dennis Zhou wrote:
> This series introduces async discard which will use the flag
> DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
> synchronously done in transaction commit.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/block-group.c | 2 +-
>  fs/btrfs/ctree.h       | 2 +-
>  fs/btrfs/extent-tree.c | 4 ++--
>  fs/btrfs/super.c       | 8 ++++----
>  4 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index bf7e3f23bba7..afe86028246a 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1365,7 +1365,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		spin_unlock(&space_info->lock);
>  
>  		/* DISCARD can flip during remount */
> -		trimming = btrfs_test_opt(fs_info, DISCARD);
> +		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC);
>  
>  		/* Implicit trim during transaction commit. */
>  		if (trimming)
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 19d669d12ca1..1877586576aa 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1171,7 +1171,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FLUSHONCOMMIT       (1 << 7)
>  #define BTRFS_MOUNT_SSD_SPREAD		(1 << 8)
>  #define BTRFS_MOUNT_NOSSD		(1 << 9)
> -#define BTRFS_MOUNT_DISCARD		(1 << 10)
> +#define BTRFS_MOUNT_DISCARD_SYNC	(1 << 10)
>  #define BTRFS_MOUNT_FORCE_COMPRESS      (1 << 11)
>  #define BTRFS_MOUNT_SPACE_CACHE		(1 << 12)
>  #define BTRFS_MOUNT_CLEAR_CACHE		(1 << 13)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 49cb26fa7c63..77a5904756c5 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2903,7 +2903,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
>  			break;
>  		}
>  
> -		if (btrfs_test_opt(fs_info, DISCARD))
> +		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
>  			ret = btrfs_discard_extent(fs_info, start,
>  						   end + 1 - start, NULL);
>  
> @@ -4146,7 +4146,7 @@ static int __btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
>  	if (pin)
>  		pin_down_extent(cache, start, len, 1);
>  	else {
> -		if (btrfs_test_opt(fs_info, DISCARD))
> +		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
>  			ret = btrfs_discard_extent(fs_info, start, len, NULL);

Is discard even needed in that function? All but one call of
btrfs_free_reserved_extent( it calls __btrfs_Free_Reserved_extent with
pin 0) happen in cleanup code when an extent has just been allocated,
not written to and potentially it has been discarded.

In cow_file_range that function is called only if create_io_em fails or
btrfs_add_ordered_extent fail, both of which happen _before_ any io is
submitted to the newly reserved range hence I think this can be removed.

In submit_compressed_extents the code flow is similar - out_free_reserve
can be called only before btrfs_submit_compressed_write

btrfs_new_extent_direct - again, called in case extent_map creation
fails, before any io happens.

__btrfs_prealloc_file_range - called as a cleanup for a prealloc extent.

btrfs_alloc_tree_block - the metadata extent is allocated but not
written to yet

btrfs_finish_ordered_io - here it seems it can be called for an extent
which could have had some data written to it on disk so discard seems
like necessary. On the other hand the code contradicts the comment:

"We only free the extent in the truncated case if we didn't write out
the extent at all. "

Yet the 'if' does:

 if ((ret || !logical_len) &&
     clear_reserved_extent &&
     !test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
     !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags))

So even if we have ret non 0 (meaning error) we could still free the
extent so long it's not before insert_reserved_file_extent returns
success (the clear_reserved_extent check). This logic is messy, Josef do
you have any idea what should be the correct behavior?

My point is that if btrfs_free_reserved_extent should only be called in
finish_ordered_io for a truncated extent, which hasn't been written at
all then this renders the btrfs_discard_extent call in
btrfs_free_reserved_extent redundant and can be removed, provided my
analysis is correct.

What do you think ?



<snip>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/19] btrfs: keep track of which extents have been discarded
  2019-10-08 12:46   ` Nikolay Borisov
@ 2019-10-11 16:08     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-11 16:08 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Dennis Zhou, Chris Mason, Omar Sandoval, David Sterba,
	Josef Bacik, kernel-team, linux-btrfs

On Tue, Oct 08, 2019 at 03:46:18PM +0300, Nikolay Borisov wrote:
> 
> 
> On 7.10.19 г. 23:17 ч., Dennis Zhou wrote:
> > Async discard will use the free space cache as backing knowledge for
> > which extents to discard. This patch plumbs knowledge about which
> > extents need to be discarded into the free space cache from
> > unpin_extent_range().
> > 
> > An untrimmed extent can merge with everything as this is a new region.
> > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > makes life better for find_free_extent(). Additionally, it seems the
> > size of a trim isn't as problematic as the trim io itself.
> > 
> > When reading in the free space cache from disk, if sync is set, mark all
> > extents as trimmed. The current code ensures at transaction commit that
> > all free space is trimmed when sync is set, so this reflects that.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> 
> I haven't looked closely into this commit but I already implemented
> something similar in order to speed up trimming by not discarding an
> already discarded region twice. The code was introduced by the following
> series:
> https://lore.kernel.org/linux-btrfs/20190327122418.24027-1-nborisov@suse.com/
> in particular patches 13 to 15 .
> 
> Can you leverage it ? If not then your code should, at some point,
> subsume the old one.
> 

I spent some time reading through that. I believe we're tackling two
separate problems. Correct me if I'm wrong, but your patches are making
subsequent fitrims faster because it's skipping over free regions that
were never allocated by the chunk allocator.

This series is aiming to solve intra-block group trim latency as trim is
handled during transaction commit and consequently also help prevent
retrimming of the free space that is already trimmed.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/19] btrfs: keep track of which extents have been discarded
  2019-10-10 13:40       ` Josef Bacik
@ 2019-10-11 16:15         ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-11 16:15 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 09:40:37AM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 06:38:10PM -0400, Dennis Zhou wrote:
> > On Mon, Oct 07, 2019 at 04:37:28PM -0400, Josef Bacik wrote:
> > > On Mon, Oct 07, 2019 at 04:17:34PM -0400, Dennis Zhou wrote:
> > > > Async discard will use the free space cache as backing knowledge for
> > > > which extents to discard. This patch plumbs knowledge about which
> > > > extents need to be discarded into the free space cache from
> > > > unpin_extent_range().
> > > > 
> > > > An untrimmed extent can merge with everything as this is a new region.
> > > > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > > > makes life better for find_free_extent(). Additionally, it seems the
> > > > size of a trim isn't as problematic as the trim io itself.
> > > > 
> > > > When reading in the free space cache from disk, if sync is set, mark all
> > > > extents as trimmed. The current code ensures at transaction commit that
> > > > all free space is trimmed when sync is set, so this reflects that.
> > > > 
> > > > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > > > ---
> > > >  fs/btrfs/extent-tree.c      | 15 ++++++++++-----
> > > >  fs/btrfs/free-space-cache.c | 38 ++++++++++++++++++++++++++++++-------
> > > >  fs/btrfs/free-space-cache.h | 10 +++++++++-
> > > >  fs/btrfs/inode-map.c        | 13 +++++++------
> > > >  4 files changed, 57 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > > index 77a5904756c5..b9e3bedad878 100644
> > > > --- a/fs/btrfs/extent-tree.c
> > > > +++ b/fs/btrfs/extent-tree.c
> > > > @@ -2782,7 +2782,7 @@ fetch_cluster_info(struct btrfs_fs_info *fs_info,
> > > >  }
> > > >  
> > > >  static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > > -			      u64 start, u64 end,
> > > > +			      u64 start, u64 end, u32 fsc_flags,
> > > >  			      const bool return_free_space)
> > > >  {
> > > >  	struct btrfs_block_group_cache *cache = NULL;
> > > > @@ -2816,7 +2816,9 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > >  		if (start < cache->last_byte_to_unpin) {
> > > >  			len = min(len, cache->last_byte_to_unpin - start);
> > > >  			if (return_free_space)
> > > > -				btrfs_add_free_space(cache, start, len);
> > > > +				__btrfs_add_free_space(fs_info,
> > > > +						       cache->free_space_ctl,
> > > > +						       start, len, fsc_flags);
> > > >  		}
> > > >  
> > > >  		start += len;
> > > > @@ -2894,6 +2896,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> > > >  
> > > >  	while (!trans->aborted) {
> > > >  		struct extent_state *cached_state = NULL;
> > > > +		u32 fsc_flags = 0;
> > > >  
> > > >  		mutex_lock(&fs_info->unused_bg_unpin_mutex);
> > > >  		ret = find_first_extent_bit(unpin, 0, &start, &end,
> > > > @@ -2903,12 +2906,14 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> > > >  			break;
> > > >  		}
> > > >  
> > > > -		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > > > +		if (btrfs_test_opt(fs_info, DISCARD_SYNC)) {
> > > >  			ret = btrfs_discard_extent(fs_info, start,
> > > >  						   end + 1 - start, NULL);
> > > > +			fsc_flags |= BTRFS_FSC_TRIMMED;
> > > > +		}
> > > >  
> > > >  		clear_extent_dirty(unpin, start, end, &cached_state);
> > > > -		unpin_extent_range(fs_info, start, end, true);
> > > > +		unpin_extent_range(fs_info, start, end, fsc_flags, true);
> > > >  		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
> > > >  		free_extent_state(cached_state);
> > > >  		cond_resched();
> > > > @@ -5512,7 +5517,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
> > > >  int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > >  				   u64 start, u64 end)
> > > >  {
> > > > -	return unpin_extent_range(fs_info, start, end, false);
> > > > +	return unpin_extent_range(fs_info, start, end, 0, false);
> > > >  }
> > > >  
> > > >  /*
> > > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > > > index d54dcd0ab230..f119895292b8 100644
> > > > --- a/fs/btrfs/free-space-cache.c
> > > > +++ b/fs/btrfs/free-space-cache.c
> > > > @@ -747,6 +747,14 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
> > > >  			goto free_cache;
> > > >  		}
> > > >  
> > > > +		/*
> > > > +		 * Sync discard ensures that the free space cache is always
> > > > +		 * trimmed.  So when reading this in, the state should reflect
> > > > +		 * that.
> > > > +		 */
> > > > +		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > > > +			e->flags |= BTRFS_FSC_TRIMMED;
> > > > +
> > > >  		if (!e->bytes) {
> > > >  			kmem_cache_free(btrfs_free_space_cachep, e);
> > > >  			goto free_cache;
> > > > @@ -2165,6 +2173,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > > >  	bool merged = false;
> > > >  	u64 offset = info->offset;
> > > >  	u64 bytes = info->bytes;
> > > > +	bool is_trimmed = btrfs_free_space_trimmed(info);
> > > >  
> > > >  	/*
> > > >  	 * first we want to see if there is free space adjacent to the range we
> > > > @@ -2178,7 +2187,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > > >  	else
> > > >  		left_info = tree_search_offset(ctl, offset - 1, 0, 0);
> > > >  
> > > > -	if (right_info && !right_info->bitmap) {
> > > > +	if (right_info && !right_info->bitmap &&
> > > > +	    (!is_trimmed || btrfs_free_space_trimmed(right_info))) {
> > > >  		if (update_stat)
> > > >  			unlink_free_space(ctl, right_info);
> > > >  		else
> > > > @@ -2189,7 +2199,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > > >  	}
> > > >  
> > > >  	if (left_info && !left_info->bitmap &&
> > > > -	    left_info->offset + left_info->bytes == offset) {
> > > > +	    left_info->offset + left_info->bytes == offset &&
> > > > +	    (!is_trimmed || btrfs_free_space_trimmed(left_info))) {
> > > 
> > > So we allow merging if we haven't trimmed this entry, or if the adjacent entry
> > > is already trimmed?  This means we'll merge if we trimmed the new entry
> > > regardless of the adjacent entries status, or if the new entry is drity.  Why is
> > > that?  Thanks,
> > > 
> > 
> > This is the tradeoff I called out above here:
> > 
> > > > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > > > makes life better for find_free_extent(). Additionally, it seems the
> > > > size of a trim isn't as problematic as the trim io itself.
> > 
> > A problematic example case:
> > 
> > |----trimmed----|/////X/////|-----trimmed-----|
> > 
> > If region X gets freed and returned to the free space cache, we end up
> > with the following:
> > 
> > |----trimmed----|-untrimmed-|-----trimmed-----|
> > 
> > This isn't great because now we need to teach find_free_extent() to span
> > multiple btrfs_free_space entries, something I didn't want to do. So the
> > other option is to overtrim trading for a simpler find_free_extent().
> > Then the above becomes:
> > 
> > |-------------------trimmed-------------------|
> > 
> > It makes the assumption that if we're inserting, it's generally is free
> > space being returned rather than we needed to slice out from the middle
> > of a block. It does still have degenerative cases, but it's better than
> > the above. The merging also allows for stuff to come out of bitmaps more
> > proactively too.
> > 
> > Also from what it seems, the cost of a discard operation is quite costly
> > relative to the amount your discarding (1 larger discard is better than
> > several smaller discards) as it will clog up the device too.
> 
> 
> OOOOOh I fucking get it now.  That's going to need a comment, because it's not
> obvious at all.
> 
> However I still wonder if this is right.  Your above examples are legitimate,
> but say you have
> 
> | 512mib adding back that isn't trimmed |------- 512mib trimmed ------|
> 
> we'll merge these two, but really we should probably trim that 512mib chunk
> we're adding right?  Thanks,
> 

So that's the crux of the problem. I'm not sure if it's right to make
heuristics around this and have merging thresholds because it makes the
code tricker + not necessarily correct. A contrived case would be
something where we go through a few iterations of merging because we
pulled stuff out of the bitmaps and that then was able to merge more
free space. How do you what the right balance is for merging extents?

I kind of favor the overeager approach for now because it is always
correct to rediscard regions, but forgetting about regions means it may
go undiscarded until for some unbounded time in the future.  This also
makes life the easiest for find_free_extent().

As I said, I'm not sure what the right thing to do is, so I favored
being accurate.  This is something I'm happy to change depending on
discussion and on further data I collect.

I added a comment, I might need to make it more indepth, but it's a
start (I'll revisit before v2).

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap
  2019-10-10 14:16   ` Josef Bacik
@ 2019-10-11 16:17     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-11 16:17 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 10:16:30AM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:35PM -0400, Dennis Zhou wrote:
> > There is a cap in btrfs in the amount of free extents that a block group
> > can have. When it surpasses that threshold, future extents are placed
> > into bitmaps. Instead of keeping track of if a certain bit is trimmed or
> > not in a second bitmap, keep track of the relative state of the bitmap.
> > 
> > With async discard, trimming bitmaps becomes a more frequent operation.
> > As a trade off with simplicity, we keep track of if discarding a bitmap
> > is in progress. If we fully scan a bitmap and trim as necessary, the
> > bitmap is marked clean. This has some caveats as the min block size may
> > skip over regions deemed too small. But this should be a reasonable
> > trade off rather than keeping a second bitmap and making allocation
> > paths more complex. The downside is we may overtrim, but ideally the min
> > block size should prevent us from doing that too often and getting stuck
> > trimming
> > pathological cases.
> > 
> > BTRFS_FSC_TRIMMING_BITMAP is added to indicate a bitmap is in the
> > process of being trimmed. If additional free space is added to that
> > bitmap, the bit is cleared. A bitmap will be marked BTRFS_FSC_TRIMMED if
> > the trimming code was able to reach the end of it and the former is
> > still set.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> 
> I went through and looked at the end result and it appears to me that we never
> have TRIMMED and TRIMMING set at the same time.  Since these are the only two
> flags, and TRIMMING is only set on bitmaps, it makes more sense for this to be
> more like
> 
> enum btrfs_trim_state {
> 	BTRFS_TRIM_STATE_TRIMMED,
> 	BTRFS_TRIM_STATE_TRIMMING,
> 	BTRFS_TRIM_STATE_UNTRIMMED,
> };
> 
> and then just have enum btrfs_trim_state trim_state in the free space entry.
> This makes things a bit cleaner since it's really just a state indicator rather
> than a actual flags.  Thanks,
> 

That makes sense. I've gone ahead and done this for both this state and
the block group discard state. FWIW, at the time I didn't know what I
would need and flags was just easier to iterate on. I agree this is
nicer.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/19] btrfs: handle empty block_group removal
  2019-10-10 15:00   ` Josef Bacik
@ 2019-10-11 16:52     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-11 16:52 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 11:00:42AM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:37PM -0400, Dennis Zhou wrote:
> > block_group removal is a little tricky. It can race with the extent
> > allocator, the cleaner thread, and balancing. The current path is for a
> > block_group to be added to the unused_bgs list. Then, when the cleaner
> > thread comes around, it starts a transaction and then proceeds with
> > removing the block_group. Extents that are pinned are subsequently
> > removed from the pinned trees and then eventually a discard is issued
> > for the entire block_group.
> > 
> > Async discard introduces another player into the game, the discard
> > workqueue. While it has none of the racing issues, the new problem is
> > ensuring we don't leave free space untrimmed prior to forgetting the
> > block_group.  This is handled by placing fully free block_groups on a
> > separate discard queue. This is necessary to maintain discarding order
> > as in the future we will slowly trim even fully free block_groups. The
> > ordering helps us make progress on the same block_group rather than say
> > the last fully freed block_group or needing to search through the fully
> > freed block groups at the beginning of a list and insert after.
> > 
> > The new order of events is a fully freed block group gets placed on the
> > discard queue first. Once it's processed, it will be placed on the
> > unusued_bgs list and then the original sequence of events will happen,
> > just without the final whole block_group discard.
> > 
> > The mount flags can change when processing unused_bgs, so when flipping
> > from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
> > discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
> > free block groups on the discard_list to the unused_bg queue which will
> > do the final discard for us.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  fs/btrfs/block-group.c      | 39 ++++++++++++++++++---
> >  fs/btrfs/ctree.h            |  2 +-
> >  fs/btrfs/discard.c          | 68 ++++++++++++++++++++++++++++++++++++-
> >  fs/btrfs/discard.h          | 11 +++++-
> >  fs/btrfs/free-space-cache.c | 33 ++++++++++++++++++
> >  fs/btrfs/free-space-cache.h |  1 +
> >  fs/btrfs/scrub.c            |  7 +++-
> >  7 files changed, 153 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 8bbbe7488328..73e5a9384491 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -1251,6 +1251,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> >  	struct btrfs_block_group_cache *block_group;
> >  	struct btrfs_space_info *space_info;
> >  	struct btrfs_trans_handle *trans;
> > +	bool async_trim_enabled = btrfs_test_opt(fs_info, DISCARD_ASYNC);
> >  	int ret = 0;
> >  
> >  	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
> > @@ -1260,6 +1261,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> >  	while (!list_empty(&fs_info->unused_bgs)) {
> >  		u64 start, end;
> >  		int trimming;
> > +		bool async_trimmed;
> >  
> >  		block_group = list_first_entry(&fs_info->unused_bgs,
> >  					       struct btrfs_block_group_cache,
> > @@ -1281,10 +1283,20 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> >  		/* Don't want to race with allocators so take the groups_sem */
> >  		down_write(&space_info->groups_sem);
> >  		spin_lock(&block_group->lock);
> > +
> > +		/* async discard requires block groups to be fully trimmed */
> > +		async_trimmed = (!btrfs_test_opt(fs_info, DISCARD_ASYNC) ||
> > +				 btrfs_is_free_space_trimmed(block_group));
> > +
> >  		if (block_group->reserved || block_group->pinned ||
> >  		    btrfs_block_group_used(&block_group->item) ||
> >  		    block_group->ro ||
> > -		    list_is_singular(&block_group->list)) {
> > +		    list_is_singular(&block_group->list) ||
> > +		    !async_trimmed) {
> > +			/* requeue if we failed because of async discard */
> > +			if (!async_trimmed)
> > +				btrfs_discard_queue_work(&fs_info->discard_ctl,
> > +							 block_group);
> >  			/*
> >  			 * We want to bail if we made new allocations or have
> >  			 * outstanding allocations in this block group.  We do
> > @@ -1367,6 +1379,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> >  		spin_unlock(&block_group->lock);
> >  		spin_unlock(&space_info->lock);
> >  
> > +		if (!async_trim_enabled &&
> > +		    btrfs_test_opt(fs_info, DISCARD_ASYNC))
> > +			goto flip_async;
> > +
> 
> This took me a minute to grok, please add a comment indicating that this is
> meant to catch the case that we flipped from no async to async and thus need to
> kick off the async trim work now before removing the unused bg.
> 

Added a comment explaining what's going on here.


> >  		/* DISCARD can flip during remount */
> >  		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC);
> >  
> > @@ -1411,6 +1427,13 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> >  		spin_lock(&fs_info->unused_bgs_lock);
> >  	}
> >  	spin_unlock(&fs_info->unused_bgs_lock);
> > +	return;
> > +
> > +flip_async:
> > +	btrfs_end_transaction(trans);
> > +	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> > +	btrfs_put_block_group(block_group);
> > +	btrfs_discard_punt_unused_bgs_list(fs_info);
> >  }
> >  
> >  void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
> > @@ -1618,6 +1641,8 @@ static struct btrfs_block_group_cache *btrfs_create_block_group_cache(
> >  	cache->full_stripe_len = btrfs_full_stripe_len(fs_info, start);
> >  	set_free_space_tree_thresholds(cache);
> >  
> > +	cache->discard_index = 1;
> > +
> >  	atomic_set(&cache->count, 1);
> >  	spin_lock_init(&cache->lock);
> >  	init_rwsem(&cache->data_rwsem);
> > @@ -1829,7 +1854,11 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
> >  			inc_block_group_ro(cache, 1);
> >  		} else if (btrfs_block_group_used(&cache->item) == 0) {
> >  			ASSERT(list_empty(&cache->bg_list));
> > -			btrfs_mark_bg_unused(cache);
> > +			if (btrfs_test_opt(info, DISCARD_ASYNC))
> > +				btrfs_add_to_discard_free_list(
> > +						&info->discard_ctl, cache);
> > +			else
> > +				btrfs_mark_bg_unused(cache);
> >  		}
> >  	}
> >  
> > @@ -2724,8 +2753,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >  		 * dirty list to avoid races between cleaner kthread and space
> >  		 * cache writeout.
> >  		 */
> > -		if (!alloc && old_val == 0)
> > -			btrfs_mark_bg_unused(cache);
> > +		if (!alloc && old_val == 0) {
> > +			if (!btrfs_test_opt(info, DISCARD_ASYNC))
> > +				btrfs_mark_bg_unused(cache);
> > +		}
> >  
> >  		btrfs_put_block_group(cache);
> >  		total -= num_bytes;
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 419445868909..c328d2e85e4d 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -439,7 +439,7 @@ struct btrfs_full_stripe_locks_tree {
> >  };
> >  
> >  /* discard control */
> > -#define BTRFS_NR_DISCARD_LISTS		1
> > +#define BTRFS_NR_DISCARD_LISTS		2
> >  
> >  struct btrfs_discard_ctl {
> >  	struct workqueue_struct *discard_workers;
> > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> > index 6df124639e55..fb92b888774d 100644
> > --- a/fs/btrfs/discard.c
> > +++ b/fs/btrfs/discard.c
> > @@ -29,8 +29,11 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> >  
> >  	spin_lock(&discard_ctl->lock);
> >  
> > -	if (list_empty(&cache->discard_list))
> > +	if (list_empty(&cache->discard_list) || !cache->discard_index) {
> > +		if (!cache->discard_index)
> > +			cache->discard_index = 1;
> 
> Need a #define for this so it's clear what our intention is, I hate magic
> numbers.
> 

Sounds good, done.

> >  		cache->discard_delay = now + BTRFS_DISCARD_DELAY;
> > +	}
> >  
> >  	list_move_tail(&cache->discard_list,
> >  		       btrfs_get_discard_list(discard_ctl, cache));
> > @@ -38,6 +41,23 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> >  	spin_unlock(&discard_ctl->lock);
> >  }
> >  
> > +void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
> > +				    struct btrfs_block_group_cache *cache)
> > +{
> > +	u64 now = ktime_get_ns();
> > +
> > +	spin_lock(&discard_ctl->lock);
> > +
> > +	if (!list_empty(&cache->discard_list))
> > +		list_del_init(&cache->discard_list);
> > +
> > +	cache->discard_index = 0;
> > +	cache->discard_delay = now;
> > +	list_add_tail(&cache->discard_list, &discard_ctl->discard_list[0]);
> > +
> > +	spin_unlock(&discard_ctl->lock);
> > +}
> > +
> >  static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
> >  				     struct btrfs_block_group_cache *cache)
> >  {
> > @@ -161,10 +181,52 @@ static void btrfs_discard_workfn(struct work_struct *work)
> >  			       btrfs_block_group_end(cache), 0);
> >  
> >  	remove_from_discard_list(discard_ctl, cache);
> > +	if (btrfs_is_free_space_trimmed(cache))
> > +		btrfs_mark_bg_unused(cache);
> > +	else if (cache->free_space_ctl->free_space == cache->key.offset)
> > +		btrfs_add_to_discard_free_list(discard_ctl, cache);
> 
> This needs to be
> 
> else if (btrfs_block_group_used(cache) == 0)
> 
> because we just exclude super mirrors from the free space cache, so completely
> empty free_space != cache->key.offset.
> 

Ah cool I didn't know that. This is fixed now.

> >  
> >  	btrfs_discard_schedule_work(discard_ctl, false);
> >  }
> >  
> > +void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
> > +{
> > +	struct btrfs_block_group_cache *cache, *next;
> > +
> > +	/* we enabled async discard, so punt all to the queue */
> > +	spin_lock(&fs_info->unused_bgs_lock);
> > +
> > +	list_for_each_entry_safe(cache, next, &fs_info->unused_bgs, bg_list) {
> > +		list_del_init(&cache->bg_list);
> > +		btrfs_add_to_discard_free_list(&fs_info->discard_ctl, cache);
> > +	}
> > +
> > +	spin_unlock(&fs_info->unused_bgs_lock);
> > +}
> > +
> > +static void btrfs_discard_purge_list(struct btrfs_discard_ctl *discard_ctl)
> > +{
> > +	struct btrfs_block_group_cache *cache, *next;
> > +	int i;
> > +
> > +	spin_lock(&discard_ctl->lock);
> > +
> > +	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) {
> > +		list_for_each_entry_safe(cache, next,
> > +					 &discard_ctl->discard_list[i],
> > +					 discard_list) {
> > +			list_del_init(&cache->discard_list);
> > +			spin_unlock(&discard_ctl->lock);
> > +			if (cache->free_space_ctl->free_space ==
> > +			    cache->key.offset)
> > +				btrfs_mark_bg_unused(cache);
> > +			spin_lock(&discard_ctl->lock);
> > +		}
> > +	}
> > +
> > +	spin_unlock(&discard_ctl->lock);
> > +}
> > +
> >  void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
> >  {
> >  	if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
> > @@ -172,6 +234,8 @@ void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
> >  		return;
> >  	}
> >  
> > +	btrfs_discard_punt_unused_bgs_list(fs_info);
> > +
> >  	set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
> >  }
> >  
> > @@ -197,4 +261,6 @@ void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
> >  {
> >  	btrfs_discard_stop(fs_info);
> >  	cancel_delayed_work_sync(&fs_info->discard_ctl.work);
> > +
> > +	btrfs_discard_purge_list(&fs_info->discard_ctl);
> >  }
> > diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> > index 6d7805bb0eb7..55f79b624943 100644
> > --- a/fs/btrfs/discard.h
> > +++ b/fs/btrfs/discard.h
> > @@ -10,9 +10,14 @@
> >  #include <linux/workqueue.h>
> >  
> >  #include "ctree.h"
> > +#include "block-group.h"
> > +#include "free-space-cache.h"
> >  
> >  void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> >  			       struct btrfs_block_group_cache *cache);
> > +void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
> > +				    struct btrfs_block_group_cache *cache);
> > +void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info);
> >  
> >  void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
> >  			       struct btrfs_block_group_cache *cache);
> > @@ -41,7 +46,11 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
> >  	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
> >  		return;
> >  
> > -	btrfs_add_to_discard_list(discard_ctl, cache);
> > +	if (cache->free_space_ctl->free_space == cache->key.offset)
> > +		btrfs_add_to_discard_free_list(discard_ctl, cache);
> 
> Same here.
> 
> > +	else
> > +		btrfs_add_to_discard_list(discard_ctl, cache);
> > +
> >  	if (!delayed_work_pending(&discard_ctl->work))
> >  		btrfs_discard_schedule_work(discard_ctl, false);
> >  }
> > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > index 54ff1bc97777..ed0e7ee4c78d 100644
> > --- a/fs/btrfs/free-space-cache.c
> > +++ b/fs/btrfs/free-space-cache.c
> > @@ -2653,6 +2653,31 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group_cache *block_group)
> >  
> >  }
> >  
> > +bool btrfs_is_free_space_trimmed(struct btrfs_block_group_cache *cache)
> > +{
> > +	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
> > +	struct btrfs_free_space *info;
> > +	struct rb_node *node;
> > +	bool ret = true;
> > +
> > +	spin_lock(&ctl->tree_lock);
> > +	node = rb_first(&ctl->free_space_offset);
> > +
> > +	while (node) {
> > +		info = rb_entry(node, struct btrfs_free_space, offset_index);
> > +
> > +		if (!btrfs_free_space_trimmed(info)) {
> > +			ret = false;
> > +			break;
> > +		}
> > +
> > +		node = rb_next(node);
> > +	}
> > +
> > +	spin_unlock(&ctl->tree_lock);
> > +	return ret;
> > +}
> > +
> >  u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
> >  			       u64 offset, u64 bytes, u64 empty_size,
> >  			       u64 *max_extent_size)
> > @@ -2739,6 +2764,9 @@ int btrfs_return_cluster_to_free_space(
> >  	ret = __btrfs_return_cluster_to_free_space(block_group, cluster);
> >  	spin_unlock(&ctl->tree_lock);
> >  
> > +	btrfs_discard_queue_work(&block_group->fs_info->discard_ctl,
> > +				 block_group);
> > +
> >  	/* finally drop our ref */
> >  	btrfs_put_block_group(block_group);
> >  	return ret;
> > @@ -3097,6 +3125,7 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group,
> >  	u64 min_bytes;
> >  	u64 cont1_bytes;
> >  	int ret;
> > +	bool found_cluster = false;
> >  
> >  	/*
> >  	 * Choose the minimum extent size we'll require for this
> > @@ -3149,6 +3178,7 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group,
> >  		list_del_init(&entry->list);
> >  
> >  	if (!ret) {
> > +		found_cluster = true;
> >  		atomic_inc(&block_group->count);
> >  		list_add_tail(&cluster->block_group_list,
> >  			      &block_group->cluster_list);
> > @@ -3160,6 +3190,9 @@ int btrfs_find_space_cluster(struct btrfs_block_group_cache *block_group,
> >  	spin_unlock(&cluster->lock);
> >  	spin_unlock(&ctl->tree_lock);
> >  
> > +	if (found_cluster)
> > +		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
> > +
> >  	return ret;
> >  }
> 
> This bit here seems unrelated to the rest of the change.  Why do we want to
> cancel the discard work if we find a cluster?  And what does it have to do with
> unused bgs?  If it's the allocation part that makes you want to cancel then that
> sort of makes sense in the unused bg context, but this is happening no matter
> what.  It should probably be in its own patch with an explanation, or at least
> in a different patch.  Thanks,
> 
> Josef

I think this got lost in my rebasing over time. I'll move this or axe
it, I'm not sure yet. The idea is later on we don't care about metadata
block groups and they are the only ones that use clustering. In that
case don't worry about fully trimming as the only way it'll start
trimming is if it's a fully free metadata block group.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 10/19] btrfs: calculate discard delay based on number of extents
  2019-10-10 15:41   ` Josef Bacik
@ 2019-10-11 18:07     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-11 18:07 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 11:41:33AM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:41PM -0400, Dennis Zhou wrote:
> > Use the number of discardable extents to help guide our discard delay
> > interval. This value is reevaluated every transaction commit.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  fs/btrfs/ctree.h       |  2 ++
> >  fs/btrfs/discard.c     | 31 +++++++++++++++++++++++++++++--
> >  fs/btrfs/discard.h     |  3 +++
> >  fs/btrfs/extent-tree.c |  4 +++-
> >  fs/btrfs/sysfs.c       | 30 ++++++++++++++++++++++++++++++
> >  5 files changed, 67 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 8479ab037812..b0823961d049 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -449,6 +449,8 @@ struct btrfs_discard_ctl {
> >  	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
> >  	atomic_t discard_extents;
> >  	atomic64_t discardable_bytes;
> > +	atomic_t delay;
> > +	atomic_t iops_limit;
> >  };
> >  
> >  /* delayed seq elem */
> > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> > index 75a2ff14b3c0..c7afb5f8240d 100644
> > --- a/fs/btrfs/discard.c
> > +++ b/fs/btrfs/discard.c
> > @@ -15,6 +15,11 @@
> >  
> >  #define BTRFS_DISCARD_DELAY		(300ULL * NSEC_PER_SEC)
> >  
> > +/* target discard delay in milliseconds */
> > +#define BTRFS_DISCARD_TARGET_MSEC	(6 * 60 * 60ULL * MSEC_PER_SEC)
> > +#define BTRFS_DISCARD_MAX_DELAY		(10000UL)
> > +#define BTRFS_DISCARD_MAX_IOPS		(10UL)
> > +
> >  static struct list_head *
> >  btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
> >  		       struct btrfs_block_group_cache *cache)
> > @@ -170,10 +175,12 @@ void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
> >  
> >  	cache = find_next_cache(discard_ctl, now);
> >  	if (cache) {
> > -		u64 delay = 0;
> > +		u64 delay = atomic_read(&discard_ctl->delay);
> >  
> >  		if (now < cache->discard_delay)
> > -			delay = nsecs_to_jiffies(cache->discard_delay - now);
> > +			delay = max_t(u64, delay,
> > +				      nsecs_to_jiffies(cache->discard_delay -
> > +						       now));
> 
> Small nit, instead
> 
> 			delay = nsecs_to_jiffies(cache->discard_delay - now);
> 			delay = max_t(u64, delay,
> 				      atomic_read(&discard_ctl->delay);
> 
> Looks a little cleaner.  Otherwise

Hmmm. Does that work if now > cache->discard_delay? I'm just worried
about the max_t with type u64.

> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 
> Thanks,
> 
> Josef

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 07/19] btrfs: discard one region at a time in async discard
  2019-10-10 15:22   ` Josef Bacik
@ 2019-10-14 19:42     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-14 19:42 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Dennis Zhou, David Sterba, Chris Mason, Omar Sandoval,
	kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 11:22:44AM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:38PM -0400, Dennis Zhou wrote:
> > The prior two patches added discarding via a background workqueue. This
> > just piggybacked off of the fstrim code to trim the whole block at once.
> > Well inevitably this is worse performance wise and will aggressively
> > overtrim. But it was nice to plumb the other infrastructure to keep the
> > patches easier to review.
> > 
> > This adds the real goal of this series which is discarding slowly (ie a
> > slow long running fstrim). The discarding is split into two phases,
> > extents and then bitmaps. The reason for this is two fold. First, the
> > bitmap regions overlap the extent regions. Second, discarding the
> > extents first will let the newly trimmed bitmaps have the highest chance
> > of coalescing when being readded to the free space cache.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  fs/btrfs/block-group.h      |   2 +
> >  fs/btrfs/discard.c          |  73 ++++++++++++++++++++-----
> >  fs/btrfs/discard.h          |  16 ++++++
> >  fs/btrfs/extent-tree.c      |   3 +-
> >  fs/btrfs/free-space-cache.c | 106 ++++++++++++++++++++++++++----------
> >  fs/btrfs/free-space-cache.h |   6 +-
> >  6 files changed, 159 insertions(+), 47 deletions(-)
> > 
> > diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> > index 0f9a1c91753f..b59e6a8ed73d 100644
> > --- a/fs/btrfs/block-group.h
> > +++ b/fs/btrfs/block-group.h
> > @@ -120,6 +120,8 @@ struct btrfs_block_group_cache {
> >  	struct list_head discard_list;
> >  	int discard_index;
> >  	u64 discard_delay;
> > +	u64 discard_cursor;
> > +	u32 discard_flags;
> >  
> 
> Same comment as the free space flags, this is just a state holder and never has
> more than one bit set, so switch it to an enum and treat it like a state.
> 

Yeah, I've switched it to an enum.


> >  	/* For dirty block groups */
> >  	struct list_head dirty_list;
> > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> > index fb92b888774d..26a1e44b4bfa 100644
> > --- a/fs/btrfs/discard.c
> > +++ b/fs/btrfs/discard.c
> > @@ -22,21 +22,28 @@ btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
> >  	return &discard_ctl->discard_list[cache->discard_index];
> >  }
> >  
> > -void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> > -			       struct btrfs_block_group_cache *cache)
> > +static void __btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> > +					struct btrfs_block_group_cache *cache)
> >  {
> >  	u64 now = ktime_get_ns();
> >  
> > -	spin_lock(&discard_ctl->lock);
> > -
> >  	if (list_empty(&cache->discard_list) || !cache->discard_index) {
> >  		if (!cache->discard_index)
> >  			cache->discard_index = 1;
> >  		cache->discard_delay = now + BTRFS_DISCARD_DELAY;
> > +		cache->discard_flags |= BTRFS_DISCARD_RESET_CURSOR;
> >  	}
> >  
> >  	list_move_tail(&cache->discard_list,
> >  		       btrfs_get_discard_list(discard_ctl, cache));
> > +}
> > +
> > +void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> > +			       struct btrfs_block_group_cache *cache)
> > +{
> > +	spin_lock(&discard_ctl->lock);
> > +
> > +	__btrfs_add_to_discard_list(discard_ctl, cache);
> >  
> >  	spin_unlock(&discard_ctl->lock);
> >  }
> > @@ -53,6 +60,7 @@ void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
> >  
> >  	cache->discard_index = 0;
> >  	cache->discard_delay = now;
> > +	cache->discard_flags |= BTRFS_DISCARD_RESET_CURSOR;
> >  	list_add_tail(&cache->discard_list, &discard_ctl->discard_list[0]);
> >  
> >  	spin_unlock(&discard_ctl->lock);
> > @@ -114,13 +122,24 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
> >  
> >  	spin_lock(&discard_ctl->lock);
> >  
> > +again:
> >  	cache = find_next_cache(discard_ctl, now);
> >  
> > -	if (cache && now < cache->discard_delay)
> > +	if (cache && now > cache->discard_delay) {
> > +		discard_ctl->cache = cache;
> > +		if (cache->discard_index == 0 &&
> > +		    cache->free_space_ctl->free_space != cache->key.offset) {
> > +			__btrfs_add_to_discard_list(discard_ctl, cache);
> > +			goto again;
> 
> The magic number thing again, it needs to be discard_index == UNUSED_DISCARD or
> some such descriptive thing.  Also needs to be btrfs_block_group_used(cache) ==
> 0;
> 

I added #defines with an explanation of names. But the gist is 0 is
reserved for fully free block groups in the process of being trimmed for
the unused_bg free path. >= 1 is just monotonically decreasing filters
to prioritize trimming of regions.

> > +		}
> > +		if (btrfs_discard_reset_cursor(cache)) {
> > +			cache->discard_cursor = cache->key.objectid;
> > +			cache->discard_flags &= ~(BTRFS_DISCARD_RESET_CURSOR |
> > +						  BTRFS_DISCARD_BITMAPS);
> > +		}
> > +	} else {
> >  		cache = NULL;
> > -
> > -	discard_ctl->cache = cache;
> > -
> > +	}
> >  	spin_unlock(&discard_ctl->lock);
> >  
> >  	return cache;
> > @@ -173,18 +192,42 @@ static void btrfs_discard_workfn(struct work_struct *work)
> >  
> >  	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
> >  
> > +again:
> >  	cache = peek_discard_list(discard_ctl);
> >  	if (!cache || !btrfs_run_discard_work(discard_ctl))
> >  		return;
> >  
> > -	btrfs_trim_block_group(cache, &trimmed, cache->key.objectid,
> > -			       btrfs_block_group_end(cache), 0);
> > +	if (btrfs_discard_bitmaps(cache))
> > +		btrfs_trim_block_group_bitmaps(cache, &trimmed,
> > +					       cache->discard_cursor,
> > +					       btrfs_block_group_end(cache),
> > +					       0, true);
> > +	else
> > +		btrfs_trim_block_group(cache, &trimmed, cache->discard_cursor,
> > +				       btrfs_block_group_end(cache), 0, true);
> > +
> > +	if (cache->discard_cursor >= btrfs_block_group_end(cache)) {
> > +		if (btrfs_discard_bitmaps(cache)) {
> > +			remove_from_discard_list(discard_ctl, cache);
> > +			if (btrfs_is_free_space_trimmed(cache))
> > +				btrfs_mark_bg_unused(cache);
> > +			else if (cache->free_space_ctl->free_space ==
> > +				 cache->key.offset)
> 
> btrfs_block_group_used(cache) == 0;
> 

Fixed.

> > +				btrfs_add_to_discard_free_list(discard_ctl,
> > +							       cache);
> > +		} else {
> > +			cache->discard_cursor = cache->key.objectid;
> > +			cache->discard_flags |= BTRFS_DISCARD_BITMAPS;
> > +		}
> > +	}
> > +
> > +	spin_lock(&discard_ctl->lock);
> > +	discard_ctl->cache = NULL;
> > +	spin_unlock(&discard_ctl->lock);
> >  
> > -	remove_from_discard_list(discard_ctl, cache);
> > -	if (btrfs_is_free_space_trimmed(cache))
> > -		btrfs_mark_bg_unused(cache);
> > -	else if (cache->free_space_ctl->free_space == cache->key.offset)
> > -		btrfs_add_to_discard_free_list(discard_ctl, cache);
> > +	/* we didn't trim anything but we really ought to so try again */
> > +	if (trimmed == 0)
> > +		goto again;
> 
> Why?  We'll reschedule if we need to.  We unconditionally do this, I feel like
> there's going to be some corner case where we end up seeing this workfn using
> 100% on a bunch of sandcastle boxes and we'll all be super sad.
> 

I only added this as a eh let's make forward progress as much as
possible kind of deal as the rate limits are pretty aggressive. It seems
sandcastle was fine with this, but I can take it out.

> >  
> >  	btrfs_discard_schedule_work(discard_ctl, false);
> >  }
> > diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> > index 55f79b624943..22cfa7e401bb 100644
> > --- a/fs/btrfs/discard.h
> > +++ b/fs/btrfs/discard.h
> > @@ -13,6 +13,22 @@
> >  #include "block-group.h"
> >  #include "free-space-cache.h"
> >  
> > +/* discard flags */
> > +#define BTRFS_DISCARD_RESET_CURSOR	(1UL << 0)
> > +#define BTRFS_DISCARD_BITMAPS           (1UL << 1)
> > +
> > +static inline
> > +bool btrfs_discard_reset_cursor(struct btrfs_block_group_cache *cache)
> > +{
> > +	return (cache->discard_flags & BTRFS_DISCARD_RESET_CURSOR);
> > +}
> > +
> > +static inline
> > +bool btrfs_discard_bitmaps(struct btrfs_block_group_cache *cache)
> > +{
> > +	return (cache->discard_flags & BTRFS_DISCARD_BITMAPS);
> > +}
> > +
> >  void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> >  			       struct btrfs_block_group_cache *cache);
> >  void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index d69ee5f51b38..ff42e4abb01d 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -5683,7 +5683,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
> >  						     &group_trimmed,
> >  						     start,
> >  						     end,
> > -						     range->minlen);
> > +						     range->minlen,
> > +						     false);
> >  
> >  			trimmed += group_trimmed;
> >  			if (ret) {
> > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > index ed0e7ee4c78d..97b3074e83c0 100644
> > --- a/fs/btrfs/free-space-cache.c
> > +++ b/fs/btrfs/free-space-cache.c
> > @@ -3267,7 +3267,8 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
> >  }
> >  
> >  static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
> > -			  u64 *total_trimmed, u64 start, u64 end, u64 minlen)
> > +			  u64 *total_trimmed, u64 start, u64 end, u64 minlen,
> > +			  bool async)
> >  {
> >  	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> >  	struct btrfs_free_space *entry;
> > @@ -3284,36 +3285,25 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
> >  		mutex_lock(&ctl->cache_writeout_mutex);
> >  		spin_lock(&ctl->tree_lock);
> >  
> > -		if (ctl->free_space < minlen) {
> > -			spin_unlock(&ctl->tree_lock);
> > -			mutex_unlock(&ctl->cache_writeout_mutex);
> > -			break;
> > -		}
> > +		if (ctl->free_space < minlen)
> > +			goto out_unlock;
> >  
> >  		entry = tree_search_offset(ctl, start, 0, 1);
> > -		if (!entry) {
> > -			spin_unlock(&ctl->tree_lock);
> > -			mutex_unlock(&ctl->cache_writeout_mutex);
> > -			break;
> > -		}
> > +		if (!entry)
> > +			goto out_unlock;
> >  
> >  		/* skip bitmaps */
> > -		while (entry->bitmap) {
> > +		while (entry->bitmap || (async &&
> > +					 btrfs_free_space_trimmed(entry))) {
> 
> Update the comment to say we're skipping already trimmed entries as well please.
> 

Done.

> >  			node = rb_next(&entry->offset_index);
> > -			if (!node) {
> > -				spin_unlock(&ctl->tree_lock);
> > -				mutex_unlock(&ctl->cache_writeout_mutex);
> > -				goto out;
> > -			}
> > +			if (!node)
> > +				goto out_unlock;
> >  			entry = rb_entry(node, struct btrfs_free_space,
> >  					 offset_index);
> >  		}
> >  
> > -		if (entry->offset >= end) {
> > -			spin_unlock(&ctl->tree_lock);
> > -			mutex_unlock(&ctl->cache_writeout_mutex);
> > -			break;
> > -		}
> > +		if (entry->offset >= end)
> > +			goto out_unlock;
> >  
> >  		extent_start = entry->offset;
> >  		extent_bytes = entry->bytes;
> > @@ -3338,10 +3328,15 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
> >  		ret = do_trimming(block_group, total_trimmed, start, bytes,
> >  				  extent_start, extent_bytes, extent_flags,
> >  				  &trim_entry);
> > -		if (ret)
> > +		if (ret) {
> > +			block_group->discard_cursor = start + bytes;
> >  			break;
> > +		}
> >  next:
> >  		start += bytes;
> > +		block_group->discard_cursor = start;
> > +		if (async && *total_trimmed)
> > +			break;
> 
> Alright so this means we'll only trim one entry and then return if we're async?
> It seems to be the same below for bitmaps.  This deserves a comment for the
> functions, it fundamentally changes the behavior of the function if we're async.
> 

Yeah, I added a short comment in the beginning of each function.

> >  
> >  		if (fatal_signal_pending(current)) {
> >  			ret = -ERESTARTSYS;
> > @@ -3350,7 +3345,14 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
> >  
> >  		cond_resched();
> >  	}
> > -out:
> > +
> > +	return ret;
> > +
> > +out_unlock:
> > +	block_group->discard_cursor = btrfs_block_group_end(block_group);
> > +	spin_unlock(&ctl->tree_lock);
> > +	mutex_unlock(&ctl->cache_writeout_mutex);
> > +
> >  	return ret;
> >  }
> >  
> > @@ -3390,7 +3392,8 @@ static void end_trimming_bitmap(struct btrfs_free_space *entry)
> >  }
> >  
> >  static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
> > -			u64 *total_trimmed, u64 start, u64 end, u64 minlen)
> > +			u64 *total_trimmed, u64 start, u64 end, u64 minlen,
> > +			bool async)
> >  {
> >  	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> >  	struct btrfs_free_space *entry;
> > @@ -3407,13 +3410,16 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
> >  		spin_lock(&ctl->tree_lock);
> >  
> >  		if (ctl->free_space < minlen) {
> > +			block_group->discard_cursor =
> > +				btrfs_block_group_end(block_group);
> >  			spin_unlock(&ctl->tree_lock);
> >  			mutex_unlock(&ctl->cache_writeout_mutex);
> >  			break;
> >  		}
> >  
> >  		entry = tree_search_offset(ctl, offset, 1, 0);
> > -		if (!entry) {
> > +		if (!entry || (async && start == offset &&
> > +			       btrfs_free_space_trimmed(entry))) {
> >  			spin_unlock(&ctl->tree_lock);
> >  			mutex_unlock(&ctl->cache_writeout_mutex);
> >  			next_bitmap = true;
> > @@ -3446,6 +3452,16 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
> >  			goto next;
> >  		}
> >  
> > +		/*
> > +		 * We already trimmed a region, but are using the locking above
> > +		 * to reset the BTRFS_FSC_TRIMMING_BITMAP flag.
> > +		 */
> > +		if (async && *total_trimmed) {
> > +			spin_unlock(&ctl->tree_lock);
> > +			mutex_unlock(&ctl->cache_writeout_mutex);
> > +			return ret;
> > +		}
> > +
> >  		bytes = min(bytes, end - start);
> >  		if (bytes < minlen) {
> >  			entry->flags &= ~BTRFS_FSC_TRIMMING_BITMAP;
> > @@ -3468,6 +3484,8 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
> >  				  start, bytes, 0, &trim_entry);
> >  		if (ret) {
> >  			reset_trimming_bitmap(ctl, offset);
> > +			block_group->discard_cursor =
> > +				btrfs_block_group_end(block_group);
> >  			break;
> >  		}
> >  next:
> > @@ -3477,6 +3495,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
> >  		} else {
> >  			start += bytes;
> >  		}
> > +		block_group->discard_cursor = start;
> >  
> >  		if (fatal_signal_pending(current)) {
> >  			if (start != offset)
> > @@ -3488,6 +3507,9 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
> >  		cond_resched();
> >  	}
> >  
> > +	if (offset >= end)
> > +		block_group->discard_cursor = end;
> > +
> >  	return ret;
> >  }
> >  
> > @@ -3532,7 +3554,8 @@ void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
> >  }
> >  
> >  int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> > -			   u64 *trimmed, u64 start, u64 end, u64 minlen)
> > +			   u64 *trimmed, u64 start, u64 end, u64 minlen,
> > +			   bool async)
> >  {
> >  	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> >  	int ret;
> > @@ -3547,11 +3570,11 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> >  	btrfs_get_block_group_trimming(block_group);
> >  	spin_unlock(&block_group->lock);
> >  
> > -	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
> > -	if (ret)
> > +	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen, async);
> > +	if (ret || async)
> >  		goto out;
> >  
> 
> You already separate out btrfs_trim_block_group_bitmaps, so this function really
> only trims the whole block group if !async.  Make a separate helper for trimming
> only the extents so it's clear what the async stuff is doing, and we're not
> relying on the async to change the behavior of btrfs_trim_block_group().
> Thanks,

Done.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 08/19] btrfs: track discardable extents for asnyc discard
  2019-10-10 15:36   ` Josef Bacik
@ 2019-10-14 19:50     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-14 19:50 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 11:36:54AM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:39PM -0400, Dennis Zhou wrote:
> > The number of discardable extents will serve as the rate limiting metric
> > for how often we should discard. This keeps track of discardable extents
> > in the free space caches by maintaining deltas and propagating them to
> > the global count.
> > 
> > This also setups up a discard directory in btrfs sysfs and exports the
> > total discard_extents count.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  fs/btrfs/ctree.h            |  2 +
> >  fs/btrfs/discard.c          |  2 +
> >  fs/btrfs/discard.h          | 19 ++++++++
> >  fs/btrfs/free-space-cache.c | 93 ++++++++++++++++++++++++++++++++++---
> >  fs/btrfs/free-space-cache.h |  2 +
> >  fs/btrfs/sysfs.c            | 33 +++++++++++++
> >  6 files changed, 144 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index c328d2e85e4d..43e515939b9c 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -447,6 +447,7 @@ struct btrfs_discard_ctl {
> >  	spinlock_t lock;
> >  	struct btrfs_block_group_cache *cache;
> >  	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
> > +	atomic_t discard_extents;
> >  };
> >  
> >  /* delayed seq elem */
> > @@ -831,6 +832,7 @@ struct btrfs_fs_info {
> >  	struct btrfs_workqueue *scrub_wr_completion_workers;
> >  	struct btrfs_workqueue *scrub_parity_workers;
> >  
> > +	struct kobject *discard_kobj;
> >  	struct btrfs_discard_ctl discard_ctl;
> >  
> >  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
> > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> > index 26a1e44b4bfa..0544eb6717d4 100644
> > --- a/fs/btrfs/discard.c
> > +++ b/fs/btrfs/discard.c
> > @@ -298,6 +298,8 @@ void btrfs_discard_init(struct btrfs_fs_info *fs_info)
> >  
> >  	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
> >  		 INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
> > +
> > +	atomic_set(&discard_ctl->discard_extents, 0);
> >  }
> >  
> >  void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
> > diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> > index 22cfa7e401bb..85939d62521e 100644
> > --- a/fs/btrfs/discard.h
> > +++ b/fs/btrfs/discard.h
> > @@ -71,4 +71,23 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
> >  		btrfs_discard_schedule_work(discard_ctl, false);
> >  }
> >  
> > +static inline
> > +void btrfs_discard_update_discardable(struct btrfs_block_group_cache *cache,
> > +				      struct btrfs_free_space_ctl *ctl)
> > +{
> > +	struct btrfs_discard_ctl *discard_ctl;
> > +	s32 extents_delta;
> > +
> > +	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
> > +		return;
> > +
> > +	discard_ctl = &cache->fs_info->discard_ctl;
> > +
> > +	extents_delta = ctl->discard_extents[0] - ctl->discard_extents[1];
> > +	if (extents_delta) {
> > +		atomic_add(extents_delta, &discard_ctl->discard_extents);
> > +		ctl->discard_extents[1] = ctl->discard_extents[0];
> > +	}
> 
> What the actual fuck?  I assume you did this to avoid checking DISCARD_ASYNC on
> every update, but man this complexity is not worth it.  We might as well update
> the counter every time to avoid doing stuff like this.
> 
> If there's a better reason for doing it this way then I'm all ears, but even so
> this is not the way to do it.  Just do
> 
> atomic_add(ctl->discard_extenst, &discard_ctl->discard_extents);
> ctl->discard_extents = 0;
> 
> and avoid the two step thing.  And a comment, because it was like 5 minutes
> between me seeing this and getting to your reasoning, and in between there was a
> lot of swearing.  Thanks,

The nice thing about doing it this way is the update is self-contained
and then each block_group now maintains individual counts which I can
use drgn to get at. A global count was very easy to get wrong as the
total number can look pretty reasonable, but ultimately be very wrong.
I'd much rather keep it this way than switch to purely delta counters as
to be able to get this information from drgn should we want to better
understand any issues with this code.

I added comments and created BTRFS_STAT_CURR and BTRFS_STAT_PREV macros
for this use.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 11/19] btrfs: add bps discard rate limit
  2019-10-10 15:47   ` Josef Bacik
@ 2019-10-14 19:56     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-14 19:56 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 11:47:19AM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:42PM -0400, Dennis Zhou wrote:
> > Provide an ability to rate limit based on mbps in addition to the iops
> > delay calculated from number of discardable extents.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  fs/btrfs/ctree.h   |  2 ++
> >  fs/btrfs/discard.c | 11 +++++++++++
> >  fs/btrfs/sysfs.c   | 30 ++++++++++++++++++++++++++++++
> >  3 files changed, 43 insertions(+)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index b0823961d049..e81f699347e0 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -447,10 +447,12 @@ struct btrfs_discard_ctl {
> >  	spinlock_t lock;
> >  	struct btrfs_block_group_cache *cache;
> >  	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
> > +	u64 prev_discard;
> >  	atomic_t discard_extents;
> >  	atomic64_t discardable_bytes;
> >  	atomic_t delay;
> >  	atomic_t iops_limit;
> > +	atomic64_t bps_limit;
> >  };
> >  
> >  /* delayed seq elem */
> > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> > index c7afb5f8240d..072c73f48297 100644
> > --- a/fs/btrfs/discard.c
> > +++ b/fs/btrfs/discard.c
> > @@ -176,6 +176,13 @@ void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
> >  	cache = find_next_cache(discard_ctl, now);
> >  	if (cache) {
> >  		u64 delay = atomic_read(&discard_ctl->delay);
> > +		s64 bps_limit = atomic64_read(&discard_ctl->bps_limit);
> > +
> > +		if (bps_limit)
> > +			delay = max_t(u64, delay,
> > +				      msecs_to_jiffies(MSEC_PER_SEC *
> > +						discard_ctl->prev_discard /
> > +						bps_limit));
> 
> I forget, are we allowed to do 0 / some value?  I feel like I did this at some
> point with io.latency and it panic'ed and was very confused.  Maybe I'm just
> misremembering.
> 

I don't remember there being an issue there, but I'd rather not find
out. I added discard_ctl->prev_discard to the if statement.

> And a similar nit, maybe we just do
> 
> u64 delay = atomic_read(&discard_ctl->delay);
> u64 bps_delay = atomic64_read(&discard_ctl->bps_limit);
> if (bps_delay)
> 	bps_delay = msecs_to_jiffies(MSEC_PER_SEC * blah)
> 
> delay = max(delay, bps_delay);
> 

/*
 * A single delayed workqueue item is responsible for
 * discarding, so we can manage the bytes rate limit by keeping
 * track of the previous discard.
 */
if (bps_limit && discard_ctl->prev_discard) {
	u64 bps_delay = (MSEC_PER_SEC *
	 	 discard_ctl->prev_discard / bps_limit);

	delay = max_t(u64, delay, msecs_to_jiffies(bps_delay));
}

This is what I changed it to. I'm not sure I quite grasped what you're
getting at from above.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 12/19] btrfs: limit max discard size for async discard
  2019-10-10 16:16   ` Josef Bacik
@ 2019-10-14 19:57     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-14 19:57 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 12:16:38PM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:43PM -0400, Dennis Zhou wrote:
> > Throttle the maximum size of a discard so that we can provide an upper
> > bound for the rate of async discard. While the block layer is able to
> > split discards into the appropriate sized discards, we want to be able
> > to account more accurately the rate at which we are consuming ncq slots
> > as well as limit the upper bound of work for a discard.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  fs/btrfs/discard.h          |  4 ++++
> >  fs/btrfs/free-space-cache.c | 47 +++++++++++++++++++++++++++----------
> >  2 files changed, 39 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> > index acaf56f63b1c..898dd92dbf8f 100644
> > --- a/fs/btrfs/discard.h
> > +++ b/fs/btrfs/discard.h
> > @@ -8,6 +8,7 @@
> >  
> >  #include <linux/kernel.h>
> >  #include <linux/jiffies.h>
> > +#include <linux/sizes.h>
> >  #include <linux/time.h>
> >  #include <linux/workqueue.h>
> >  
> > @@ -15,6 +16,9 @@
> >  #include "block-group.h"
> >  #include "free-space-cache.h"
> >  
> > +/* discard size limits */
> > +#define BTRFS_DISCARD_MAX_SIZE		(SZ_64M)
> > +
> 
> Let's make this configurable via sysfs as well.  I assume at some point in the
> far, far future SSD's will stop being shitty and it would be nice to be able to
> easily adjust and test.  Also this only applies to async, so
> BTRFS_ASYNC_DISCARD_MAX_SIZE.  You can add a follow up patch for the sysfs
> stuff, just adjust the name and you can add

Yeah that sounds good. I exposed it as max_discard_bytes in another
patch and renamed all the variables to BTRFS_ASYNC_DISCARD_* around
discard size limits.

> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 13/19] btrfs: have multiple discard lists
  2019-10-10 16:51   ` Josef Bacik
@ 2019-10-14 20:04     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-14 20:04 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 12:51:01PM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:44PM -0400, Dennis Zhou wrote:
> > Non-block group destruction discarding currently only had a single list
> > with no minimum discard length. This can lead to caravaning more
> > meaningful discards behind a heavily fragmented block group.
> > 
> > This adds support for multiple lists with minimum discard lengths to
> > prevent the caravan effect. We promote block groups back up when we
> > exceed the BTRFS_DISCARD_MAX_FILTER size, currently we support only 2
> > lists with filters of 1MB and 32KB respectively.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  fs/btrfs/ctree.h            |  2 +-
> >  fs/btrfs/discard.c          | 60 +++++++++++++++++++++++++++++++++----
> >  fs/btrfs/discard.h          |  4 +++
> >  fs/btrfs/free-space-cache.c | 37 +++++++++++++++--------
> >  fs/btrfs/free-space-cache.h |  2 +-
> >  5 files changed, 85 insertions(+), 20 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index e81f699347e0..b5608f8dc41a 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -439,7 +439,7 @@ struct btrfs_full_stripe_locks_tree {
> >  };
> >  
> >  /* discard control */
> > -#define BTRFS_NR_DISCARD_LISTS		2
> > +#define BTRFS_NR_DISCARD_LISTS		3
> >  
> >  struct btrfs_discard_ctl {
> >  	struct workqueue_struct *discard_workers;
> > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> > index 072c73f48297..296cbffc5957 100644
> > --- a/fs/btrfs/discard.c
> > +++ b/fs/btrfs/discard.c
> > @@ -20,6 +20,10 @@
> >  #define BTRFS_DISCARD_MAX_DELAY		(10000UL)
> >  #define BTRFS_DISCARD_MAX_IOPS		(10UL)
> >  
> > +/* montonically decreasing filters after 0 */
> > +static int discard_minlen[BTRFS_NR_DISCARD_LISTS] = {0,
> > +	BTRFS_DISCARD_MAX_FILTER, BTRFS_DISCARD_MIN_FILTER};
> > +
> >  static struct list_head *
> >  btrfs_get_discard_list(struct btrfs_discard_ctl *discard_ctl,
> >  		       struct btrfs_block_group_cache *cache)
> > @@ -120,7 +124,7 @@ find_next_cache(struct btrfs_discard_ctl *discard_ctl, u64 now)
> >  }
> >  
> >  static struct btrfs_block_group_cache *
> > -peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
> > +peek_discard_list(struct btrfs_discard_ctl *discard_ctl, int *discard_index)
> >  {
> >  	struct btrfs_block_group_cache *cache;
> >  	u64 now = ktime_get_ns();
> > @@ -132,6 +136,7 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
> >  
> >  	if (cache && now > cache->discard_delay) {
> >  		discard_ctl->cache = cache;
> > +		*discard_index = cache->discard_index;
> >  		if (cache->discard_index == 0 &&
> >  		    cache->free_space_ctl->free_space != cache->key.offset) {
> >  			__btrfs_add_to_discard_list(discard_ctl, cache);
> > @@ -150,6 +155,36 @@ peek_discard_list(struct btrfs_discard_ctl *discard_ctl)
> >  	return cache;
> >  }
> >  
> > +void btrfs_discard_check_filter(struct btrfs_block_group_cache *cache,
> > +				u64 bytes)
> > +{
> > +	struct btrfs_discard_ctl *discard_ctl;
> > +
> > +	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
> > +		return;
> > +
> > +	discard_ctl = &cache->fs_info->discard_ctl;
> > +
> > +	if (cache && cache->discard_index > 1 &&
> > +	    bytes >= BTRFS_DISCARD_MAX_FILTER) {
> > +		remove_from_discard_list(discard_ctl, cache);
> > +		cache->discard_index = 1;
> 
> Really need names here, I have no idea what 1 is.
> 

Yep, done.

> > +		btrfs_add_to_discard_list(discard_ctl, cache);
> > +	}
> > +}
> > +
> > +static void btrfs_update_discard_index(struct btrfs_discard_ctl *discard_ctl,
> > +				       struct btrfs_block_group_cache *cache)
> > +{
> > +	cache->discard_index++;
> > +	if (cache->discard_index == BTRFS_NR_DISCARD_LISTS) {
> > +		cache->discard_index = 1;
> > +		return;
> > +	}
> > +
> > +	btrfs_add_to_discard_list(discard_ctl, cache);
> > +}
> > +
> >  void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
> >  			       struct btrfs_block_group_cache *cache)
> >  {
> > @@ -202,23 +237,34 @@ static void btrfs_discard_workfn(struct work_struct *work)
> >  {
> >  	struct btrfs_discard_ctl *discard_ctl;
> >  	struct btrfs_block_group_cache *cache;
> > +	int discard_index = 0;
> >  	u64 trimmed = 0;
> > +	u64 minlen = 0;
> >  
> >  	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
> >  
> >  again:
> > -	cache = peek_discard_list(discard_ctl);
> > +	cache = peek_discard_list(discard_ctl, &discard_index);
> >  	if (!cache || !btrfs_run_discard_work(discard_ctl))
> >  		return;
> >  
> > -	if (btrfs_discard_bitmaps(cache))
> > +	minlen = discard_minlen[discard_index];
> > +
> > +	if (btrfs_discard_bitmaps(cache)) {
> > +		u64 maxlen = 0;
> > +
> > +		if (discard_index)
> > +			maxlen = discard_minlen[discard_index - 1];
> > +
> >  		btrfs_trim_block_group_bitmaps(cache, &trimmed,
> >  					       cache->discard_cursor,
> >  					       btrfs_block_group_end(cache),
> > -					       0, true);
> > -	else
> > +					       minlen, maxlen, true);
> > +	} else {
> >  		btrfs_trim_block_group(cache, &trimmed, cache->discard_cursor,
> > -				       btrfs_block_group_end(cache), 0, true);
> > +				       btrfs_block_group_end(cache),
> > +				       minlen, true);
> > +	}
> >  
> >  	discard_ctl->prev_discard = trimmed;
> >  
> > @@ -231,6 +277,8 @@ static void btrfs_discard_workfn(struct work_struct *work)
> >  				 cache->key.offset)
> >  				btrfs_add_to_discard_free_list(discard_ctl,
> >  							       cache);
> > +			else
> > +				btrfs_update_discard_index(discard_ctl, cache);
> >  		} else {
> >  			cache->discard_cursor = cache->key.objectid;
> >  			cache->discard_flags |= BTRFS_DISCARD_BITMAPS;
> > diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> > index 898dd92dbf8f..1daa8da4a1b5 100644
> > --- a/fs/btrfs/discard.h
> > +++ b/fs/btrfs/discard.h
> > @@ -18,6 +18,8 @@
> >  
> >  /* discard size limits */
> >  #define BTRFS_DISCARD_MAX_SIZE		(SZ_64M)
> > +#define BTRFS_DISCARD_MAX_FILTER	(SZ_1M)
> > +#define BTRFS_DISCARD_MIN_FILTER	(SZ_32K)
> >  
> >  /* discard flags */
> >  #define BTRFS_DISCARD_RESET_CURSOR	(1UL << 0)
> > @@ -39,6 +41,8 @@ void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> >  			       struct btrfs_block_group_cache *cache);
> >  void btrfs_add_to_discard_free_list(struct btrfs_discard_ctl *discard_ctl,
> >  				    struct btrfs_block_group_cache *cache);
> > +void btrfs_discard_check_filter(struct btrfs_block_group_cache *cache,
> > +				u64 bytes);
> >  void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info);
> >  
> >  void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
> > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > index ce33803a45b2..ed35dc090df6 100644
> > --- a/fs/btrfs/free-space-cache.c
> > +++ b/fs/btrfs/free-space-cache.c
> > @@ -2471,6 +2471,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
> >  	if (ret)
> >  		kmem_cache_free(btrfs_free_space_cachep, info);
> >  out:
> > +	btrfs_discard_check_filter(cache, bytes);
> 
> So we're only accounting the new space?  What if we merge with a larger area
> here?  We should probably make our decision based on the actual trimable area.
> 

Thanks that's a good catch. I've updated it to be info->bytes if
possible.

> >  	btrfs_discard_update_discardable(cache, ctl);
> >  	spin_unlock(&ctl->tree_lock);
> >  
> > @@ -3409,7 +3410,13 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
> >  				goto next;
> >  			}
> >  			unlink_free_space(ctl, entry);
> > -			if (bytes > BTRFS_DISCARD_MAX_SIZE) {
> > +			/*
> > +			 * Let bytes = BTRFS_MAX_DISCARD_SIZE + X.
> > +			 * If X < BTRFS_DISCARD_MIN_FILTER, we won't trim X when
> > +			 * we come back around.  So trim it now.
> > +			 */
> > +			if (bytes > (BTRFS_DISCARD_MAX_SIZE +
> > +				     BTRFS_DISCARD_MIN_FILTER)) {
> >  				bytes = extent_bytes = BTRFS_DISCARD_MAX_SIZE;
> >  				entry->offset += BTRFS_DISCARD_MAX_SIZE;
> >  				entry->bytes -= BTRFS_DISCARD_MAX_SIZE;
> > @@ -3510,7 +3517,7 @@ static void end_trimming_bitmap(struct btrfs_free_space_ctl *ctl,
> >  
> >  static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
> >  			u64 *total_trimmed, u64 start, u64 end, u64 minlen,
> > -			bool async)
> > +			u64 maxlen, bool async)
> >  {
> >  	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> >  	struct btrfs_free_space *entry;
> > @@ -3535,7 +3542,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
> >  		}
> >  
> >  		entry = tree_search_offset(ctl, offset, 1, 0);
> > -		if (!entry || (async && start == offset &&
> > +		if (!entry || (async && minlen && start == offset &&
> >  			       btrfs_free_space_trimmed(entry))) {
> 
> Huh?  Why do we care if minlen is set if our entry is already trimmed?  If we're
> already trimmed we should just skip it even with minlen set, right?  Thanks,
> 

Yeah this definitely needs a comment. The reason for this is minlen is
used to check if we're in discard_index 0 which is the free path
discarding. In this case, because bitmaps are lossily marked trimmed,
(we skip lone pages less than BTRFS_ASYNC_DISCARD_MIN_FILTER), we need
to just go back and double check. The goal is to maintain the invariant
everything is discarded when we forget about a block group.

I think in practice, with the FORCE_EXTENT_THRESHOLD patch later on,
this shouldn't really be common case as it should have been picked out
by some coalescing.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 15/19] btrfs: load block_groups into discard_list on mount
  2019-10-10 17:11   ` Josef Bacik
@ 2019-10-14 20:17     ` Dennis Zhou
  2019-10-14 23:38       ` David Sterba
  0 siblings, 1 reply; 71+ messages in thread
From: Dennis Zhou @ 2019-10-14 20:17 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Chris Mason, Omar Sandoval, kernel-team, linux-btrfs

On Thu, Oct 10, 2019 at 01:11:38PM -0400, Josef Bacik wrote:
> On Mon, Oct 07, 2019 at 04:17:46PM -0400, Dennis Zhou wrote:
> > Async discard doesn't remember the discard state of a block_group when
> > unmounting or when we crash. So, any block_group that is not fully used
> > may have undiscarded regions. However, free space caches are read in on
> > demand. Let the discard worker read in the free space cache so we can
> > proceed with discarding rather than wait for the block_group to be used.
> > This prevents us from indefinitely deferring discards until that
> > particular block_group is reused.
> > 
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> 
> What if we did completely discard the last time, now we're going back and
> discarding again?  I think by default we just assume we discarded everything.
> If we didn't then the user can always initiate a fitrim later.  Drop this one.
> Thanks,
> 

Yeah this is something I wasn't sure about.

It makes me a little uncomfortable to make the lack of persistence a
user problem. If in some extreme case where someone frees a large amount
of space and then unmounts. We can either make them wait on unmount to
discard everything or retrim the whole drive which in an ideal world
should just be a noop on already free lba space. If others are in favor
of just going the fitrim route for users, I'm happy to drop this patch,
but I do like the fact that this makes the whole system consistent
without user intervention. Does anyone else have an opinion?

On a side note, the find_free_extent() allocator tries pretty hard
before allocating subsequent block groups. So maybe it's right to just
deprioritize these block groups instead of just not loading them.

Thanks, 
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 00/19] btrfs: async discard support
  2019-10-11  7:49 ` [RFC PATCH 00/19] btrfs: async discard support Nikolay Borisov
@ 2019-10-14 21:05   ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-14 21:05 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Chris Mason, Omar Sandoval, David Sterba, Josef Bacik,
	kernel-team, linux-btrfs

On Fri, Oct 11, 2019 at 10:49:20AM +0300, Nikolay Borisov wrote:
> 
> 
> On 7.10.19 г. 23:17 ч., Dennis Zhou wrote:
> > Hello,
> > 
> 
> <snip>
> 
> > 
> > With async discard, we try to emphasize discarding larger regions
> > and reusing the lba (implicit discard). The first is done by using the
> > free space cache to maintain discard state and thus allows us to get
> > coalescing for fairly cheap. A background workqueue is used to scan over
> > an LRU kept list of the block groups. It then uses filters to determine
> > what to discard next hence giving priority to larger discards. While
> > reusing an lba isn't explicitly attempted, it happens implicitly via
> > find_free_extent() which if it happens to find a dirty extent, will
> > grant us reuse of the lba. Additionally, async discarding skips metadata
> 
> By 'dirty' I assume you mean not-discarded-yet-but-free extent?
> 

Yes.

> > block groups as these should see a fairly high turnover as btrfs is a
> > self-packing filesystem being stingy with allocating new block groups
> > until necessary.
> > 
> > Preliminary results seem promising as when a lot of freeing is going on,
> > the discarding is delayed allowing for reuse which translates to less
> > discarding (in addition to the slower discarding). This has shown a
> > reduction in p90 and p99 read latencies on a test on our webservers.
> > 
> > I am currently working on tuning the rate at which it discards in the
> > background. I am doing this by evaluating other workloads and drives.
> > The iops and bps rate limits are fairly aggressive right now as my
> > basic survey of a few drives noted that the trim command itself is a
> > significant part of the overhead. So optimizing for larger trims is the
> > right thing to do.
> 
> Do you intend on sharing performance results alongside the workloads
> used to obtain them? Since this is a performance improvement patch in
> its core that is of prime importance!
> 

I'll try and find some stuff to share for v2. As I'm just running this
on production machines, I don't intend to share any workloads. However,
there is an iocost workload that demonstrates the problem nicely that
might already be shared.

The win really is moving the work from transaction commit to completely
background work, effectively making discard a 2nd class citizen. On more
loaded machines, it's not great that discards are blocking transaction
commit. The other thing is it's very drive dependent. Some drives just
have really bad discard implementations and there will be a bigger win
than say on some high end nvme drive.

> > 
> 
> <snip>
> > 
> > Thanks,
> > Dennis
> > 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 15/19] btrfs: load block_groups into discard_list on mount
  2019-10-14 20:17     ` Dennis Zhou
@ 2019-10-14 23:38       ` David Sterba
  2019-10-15 15:42         ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: David Sterba @ 2019-10-14 23:38 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Josef Bacik, David Sterba, Chris Mason, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 14, 2019 at 04:17:46PM -0400, Dennis Zhou wrote:
> On Thu, Oct 10, 2019 at 01:11:38PM -0400, Josef Bacik wrote:
> > On Mon, Oct 07, 2019 at 04:17:46PM -0400, Dennis Zhou wrote:
> > > Async discard doesn't remember the discard state of a block_group when
> > > unmounting or when we crash. So, any block_group that is not fully used
> > > may have undiscarded regions. However, free space caches are read in on
> > > demand. Let the discard worker read in the free space cache so we can
> > > proceed with discarding rather than wait for the block_group to be used.
> > > This prevents us from indefinitely deferring discards until that
> > > particular block_group is reused.
> > > 
> > > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > 
> > What if we did completely discard the last time, now we're going back and
> > discarding again?  I think by default we just assume we discarded everything.
> > If we didn't then the user can always initiate a fitrim later.  Drop this one.
> > Thanks,
> > 
> 
> Yeah this is something I wasn't sure about.
> 
> It makes me a little uncomfortable to make the lack of persistence a
> user problem. If in some extreme case where someone frees a large amount
> of space and then unmounts.

Based on past experience, umount should not be slowed down unless really
necessary.

> We can either make them wait on unmount to
> discard everything or retrim the whole drive which in an ideal world
> should just be a noop on already free lba space.

Without persistence of the state, we can't make it perfect and I think,
without any hard evidence, that trimming already trimmed blocks is no-op
on the device. We all know that we don't know what SSDs actually do, so
it's best effort and making it "device problem" is a good solution from
filesystem POV.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 00/19] btrfs: async discard support
  2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
                   ` (19 preceding siblings ...)
  2019-10-11  7:49 ` [RFC PATCH 00/19] btrfs: async discard support Nikolay Borisov
@ 2019-10-15 12:08 ` David Sterba
  2019-10-15 15:41   ` Dennis Zhou
  20 siblings, 1 reply; 71+ messages in thread
From: David Sterba @ 2019-10-15 12:08 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

Hi,

thanks for working on this. The plain -odiscard hasn't been recommended
to users for a long time, even with the SATA 3.1 drives that allow
queueing the requests.

The overall approach to async discard sounds good, the hard part is not
shoot down the filesystem by trimming live data, and we had a bug like
that in the pastlive data, and we had a bug like that in the past. For
correctness reasons I understand the size of the patchset.

On Mon, Oct 07, 2019 at 04:17:31PM -0400, Dennis Zhou wrote:
> I am currently working on tuning the rate at which it discards in the
> background. I am doing this by evaluating other workloads and drives.
> The iops and bps rate limits are fairly aggressive right now as my
> basic survey of a few drives noted that the trim command itself is a
> significant part of the overhead. So optimizing for larger trims is the
> right thing to do.

We need a sane default behaviour, without the need for knobs and
configuration, so it's great you can have a wide range of samples to
tune it.

As trim is only a hint, a short delay in processing the requests or
slight ineffectivity should be acceptable, to avoid complications in the
code or interfering with other IO.

> Persistence isn't supported, so when we mount a filesystem, the block
> groups are read in as dirty and background trim begins. This makes async
> discard more useful for longer running mount points.

I think this is acceptable.

Regarding the code, I leave comments to the block group and trim
structures to Josef and will focus more on the low-level and coding
style or changelogs.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/19] bitmap: genericize percpu bitmap region iterators
  2019-10-07 22:24     ` Dennis Zhou
@ 2019-10-15 12:11       ` David Sterba
  2019-10-15 18:35         ` Dennis Zhou
  0 siblings, 1 reply; 71+ messages in thread
From: David Sterba @ 2019-10-15 12:11 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Josef Bacik, David Sterba, Chris Mason, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 06:24:19PM -0400, Dennis Zhou wrote:
> > > + * Bitmap region iterators.  Iterates over the bitmap between [@start, @end).
> > 
> > Gonna be that guy here, should be '[@start, @end]'
> 
> I disagree here. I'm pretty happy with [@start, @end). If btrfs wants to
> carry their own iterators I'm happy to copy and paste them, but as far
> as percpu goes I like [@start, @end).

It's not clear what the comment was about, if it's the notation of
half-closed interval or request to support closed interval in the
lookup. The orignal code has [,) and that shouldn't be changed when
copying. Or I'm missing something.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/19] btrfs: keep track of which extents have been discarded
  2019-10-07 20:17 ` [PATCH 03/19] btrfs: keep track of which extents have been discarded Dennis Zhou
  2019-10-07 20:37   ` Josef Bacik
  2019-10-08 12:46   ` Nikolay Borisov
@ 2019-10-15 12:17   ` David Sterba
  2019-10-15 19:58     ` Dennis Zhou
  2 siblings, 1 reply; 71+ messages in thread
From: David Sterba @ 2019-10-15 12:17 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:34PM -0400, Dennis Zhou wrote:
> @@ -2165,6 +2173,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
>  	bool merged = false;
>  	u64 offset = info->offset;
>  	u64 bytes = info->bytes;
> +	bool is_trimmed = btrfs_free_space_trimmed(info);

Please add a const in such cases. I've been doing that in other patches
but as more iterations are expected, let's have it there from the
beginning.

> --- a/fs/btrfs/free-space-cache.h
> +++ b/fs/btrfs/free-space-cache.h
> @@ -6,6 +6,8 @@
>  #ifndef BTRFS_FREE_SPACE_CACHE_H
>  #define BTRFS_FREE_SPACE_CACHE_H
>  
> +#define BTRFS_FSC_TRIMMED		(1UL << 0)

Please add a comment

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap
  2019-10-07 20:17 ` [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap Dennis Zhou
  2019-10-10 14:16   ` Josef Bacik
@ 2019-10-15 12:23   ` David Sterba
  1 sibling, 0 replies; 71+ messages in thread
From: David Sterba @ 2019-10-15 12:23 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:35PM -0400, Dennis Zhou wrote:
> --- a/fs/btrfs/free-space-cache.h
> +++ b/fs/btrfs/free-space-cache.h
> @@ -7,6 +7,7 @@
>  #define BTRFS_FREE_SPACE_CACHE_H
>  
>  #define BTRFS_FSC_TRIMMED		(1UL << 0)
> +#define BTRFS_FSC_TRIMMING_BITMAP	(1UL << 1)
>  
>  struct btrfs_free_space {
>  	struct rb_node offset_index;
> @@ -23,6 +24,12 @@ static inline bool btrfs_free_space_trimmed(struct btrfs_free_space *info)
>  	return (info->flags & BTRFS_FSC_TRIMMED);
>  }
>  
> +static inline
> +bool btrfs_free_space_trimming_bitmap(struct btrfs_free_space *info)

Please keep the specifiers and type on the same line as the function,

static inline bool btrfs_free_space_trimming_bitmap(struct btrfs_free_space *info)

Short 80 column overflow of ( or { is ok (though chekpatch would report
that). I've seen split type/name in several other patches and will not
point out every occurence, so please fix it up as you find them.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue
  2019-10-07 20:17 ` [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
  2019-10-10 14:38   ` Josef Bacik
@ 2019-10-15 12:49   ` David Sterba
  2019-10-15 19:57     ` Dennis Zhou
  1 sibling, 1 reply; 71+ messages in thread
From: David Sterba @ 2019-10-15 12:49 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:36PM -0400, Dennis Zhou wrote:
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -115,7 +115,11 @@ struct btrfs_block_group_cache {
>  	/* For read-only block groups */
>  	struct list_head ro_list;
>  
> +	/* For discard operations */
>  	atomic_t trimming;
> +	struct list_head discard_list;
> +	int discard_index;
> +	u64 discard_delay;
>  
>  	/* For dirty block groups */
>  	struct list_head dirty_list;
> @@ -157,6 +161,12 @@ struct btrfs_block_group_cache {
>  	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
>  };
>  
> +static inline
> +u64 btrfs_block_group_end(struct btrfs_block_group_cache *cache)
> +{
> +	return (cache->key.objectid + cache->key.offset);
> +}
> +
>  #ifdef CONFIG_BTRFS_DEBUG
>  static inline int btrfs_should_fragment_free_space(
>  		struct btrfs_block_group_cache *block_group)
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1877586576aa..419445868909 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -438,6 +438,17 @@ struct btrfs_full_stripe_locks_tree {
>  	struct mutex lock;
>  };
>  
> +/* discard control */

This is going to be 'fix everywhere too' comment, please start comments
with capital letter.

> +#define BTRFS_NR_DISCARD_LISTS		1

Constants and defines should be documented

> --- /dev/null
> +++ b/fs/btrfs/discard.c
> @@ -0,0 +1,200 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 Facebook.  All rights reserved.
> + */

With the SPDX in place and immutable git history, the copyright notices
are not necessary and we don't add them to new files anymore.

> +void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> +			       struct btrfs_block_group_cache *cache)
> +{
> +	u64 now = ktime_get_ns();

Variable used only once, can be removed and ktime_get_ns called
directly.

> +	spin_lock(&discard_ctl->lock);
> +
> +	if (list_empty(&cache->discard_list))
> +		cache->discard_delay = now + BTRFS_DISCARD_DELAY;

->discard_delay does not seem to be a delay but an expiration time, so
this is a bit confusing. BTRFS_DISCARD_DELAY is the delay time, that's
clear.

> +	list_move_tail(&cache->discard_list,
> +		       btrfs_get_discard_list(discard_ctl, cache));
> +
> +	spin_unlock(&discard_ctl->lock);
> +}

> --- /dev/null
> +++ b/fs/btrfs/discard.h
> @@ -0,0 +1,49 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 Facebook.  All rights reserved.
> + */
> +
> +#ifndef BTRFS_DISCARD_H
> +#define BTRFS_DISCARD_H
> +
> +#include <linux/kernel.h>
> +#include <linux/workqueue.h>
> +
> +#include "ctree.h"

Is it possible to avoid including ctree.h here? Like adding forward
declarations and defining the helpers in .c (that will have to include
ctree.h anyway). The includes have become very cluttered and untangling
the dependencies is ongoing work so it would be good to avoid adding
extra work.

> +void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> +			       struct btrfs_block_group_cache *cache);
> +
> +void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
> +			       struct btrfs_block_group_cache *cache);
> +void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
> +				 bool override);
> +void btrfs_discard_resume(struct btrfs_fs_info *fs_info);
> +void btrfs_discard_stop(struct btrfs_fs_info *fs_info);
> +void btrfs_discard_init(struct btrfs_fs_info *fs_info);
> +void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info);
> +
> +static inline
> +bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
> +{
> +	struct btrfs_fs_info *fs_info = container_of(discard_ctl,
> +						     struct btrfs_fs_info,
> +						     discard_ctl);
> +
> +	return (!(fs_info->sb->s_flags & SB_RDONLY) &&
> +		test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
> +}
> +
> +static inline
> +void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
> +			      struct btrfs_block_group_cache *cache)
> +{
> +	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
> +		return;
> +
> +	btrfs_add_to_discard_list(discard_ctl, cache);
> +	if (!delayed_work_pending(&discard_ctl->work))
> +		btrfs_discard_schedule_work(discard_ctl, false);
> +}

These two would need full fs_info definition but they don't seem to be
called in performance sensitive code so a full function call is ok here.

> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -313,6 +316,7 @@ enum {
>  	Opt_datasum, Opt_nodatasum,
>  	Opt_defrag, Opt_nodefrag,
>  	Opt_discard, Opt_nodiscard,
> +	Opt_discard_version,

This is probably copied from space_cache options, 'version' does not
fit discard, it could be 'mode'

>  	Opt_nologreplay,
>  	Opt_norecovery,
>  	Opt_ratio,
>  		case Opt_space_cache_version:

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 08/19] btrfs: track discardable extents for asnyc discard
  2019-10-07 20:17 ` [PATCH 08/19] btrfs: track discardable extents for asnyc discard Dennis Zhou
  2019-10-10 15:36   ` Josef Bacik
@ 2019-10-15 13:12   ` David Sterba
  2019-10-15 18:41     ` Dennis Zhou
  1 sibling, 1 reply; 71+ messages in thread
From: David Sterba @ 2019-10-15 13:12 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Oct 07, 2019 at 04:17:39PM -0400, Dennis Zhou wrote:
> The number of discardable extents will serve as the rate limiting metric
> for how often we should discard. This keeps track of discardable extents
> in the free space caches by maintaining deltas and propagating them to
> the global count.
> 
> This also setups up a discard directory in btrfs sysfs and exports the
> total discard_extents count.

Please put the discard directory under debug/ for now.

> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>  fs/btrfs/ctree.h            |  2 +
>  fs/btrfs/discard.c          |  2 +
>  fs/btrfs/discard.h          | 19 ++++++++
>  fs/btrfs/free-space-cache.c | 93 ++++++++++++++++++++++++++++++++++---
>  fs/btrfs/free-space-cache.h |  2 +
>  fs/btrfs/sysfs.c            | 33 +++++++++++++
>  6 files changed, 144 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index c328d2e85e4d..43e515939b9c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -447,6 +447,7 @@ struct btrfs_discard_ctl {
>  	spinlock_t lock;
>  	struct btrfs_block_group_cache *cache;
>  	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
> +	atomic_t discard_extents;

At the end of the series this becomes

452         atomic_t discard_extents;
453         atomic64_t discardable_bytes;
454         atomic_t delay;
455         atomic_t iops_limit;
456         atomic64_t bps_limit;
457         atomic64_t discard_extent_bytes;
458         atomic64_t discard_bitmap_bytes;
459         atomic64_t discard_bytes_saved;

raising many eyebrows. What's the reason to use so many atomics? As this
is purely for accounting and perhaps not contended, add one spinlock
protecting all of them.

None of delay, bps_limit and iops_limit use the atomict_t semantics at
all, it's just _set and _read.

As this seem to cascade to all other patches, I'll postpone my review
until I see V2.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 00/19] btrfs: async discard support
  2019-10-15 12:08 ` David Sterba
@ 2019-10-15 15:41   ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-15 15:41 UTC (permalink / raw)
  To: David Sterba
  Cc: Dennis Zhou, David Sterba, Chris Mason, Josef Bacik,
	Omar Sandoval, kernel-team, linux-btrfs

On Tue, Oct 15, 2019 at 02:08:31PM +0200, David Sterba wrote:
> Hi,
> 
> thanks for working on this. The plain -odiscard hasn't been recommended
> to users for a long time, even with the SATA 3.1 drives that allow
> queueing the requests.
> 
> The overall approach to async discard sounds good, the hard part is not
> shoot down the filesystem by trimming live data, and we had a bug like
> that in the pastlive data, and we had a bug like that in the past. For
> correctness reasons I understand the size of the patchset.
> 
> On Mon, Oct 07, 2019 at 04:17:31PM -0400, Dennis Zhou wrote:
> > I am currently working on tuning the rate at which it discards in the
> > background. I am doing this by evaluating other workloads and drives.
> > The iops and bps rate limits are fairly aggressive right now as my
> > basic survey of a few drives noted that the trim command itself is a
> > significant part of the overhead. So optimizing for larger trims is the
> > right thing to do.
> 
> We need a sane default behaviour, without the need for knobs and
> configuration, so it's great you can have a wide range of samples to
> tune it.
> 

Yeah I'm just not quite sure what that is yet. The tricky part is that
we don't really get burned by trims until well after we've issued the
trims. The feedback loop is not really transparent with reads and
writes.

> As trim is only a hint, a short delay in processing the requests or
> slight ineffectivity should be acceptable, to avoid complications in the
> code or interfering with other IO.
> 
> > Persistence isn't supported, so when we mount a filesystem, the block
> > groups are read in as dirty and background trim begins. This makes async
> > discard more useful for longer running mount points.
> 
> I think this is acceptable.
> 
> Regarding the code, I leave comments to the block group and trim
> structures to Josef and will focus more on the low-level and coding
> style or changelogs.

Sounds good. I'll hopefully post a v2 shortly so that we can get more of
that out of the way. Then hopefully a smaller incremental later to
figure out the configuration part.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 15/19] btrfs: load block_groups into discard_list on mount
  2019-10-14 23:38       ` David Sterba
@ 2019-10-15 15:42         ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-15 15:42 UTC (permalink / raw)
  To: David Sterba
  Cc: Josef Bacik, David Sterba, Chris Mason, Omar Sandoval,
	kernel-team, linux-btrfs

On Tue, Oct 15, 2019 at 01:38:25AM +0200, David Sterba wrote:
> On Mon, Oct 14, 2019 at 04:17:46PM -0400, Dennis Zhou wrote:
> > On Thu, Oct 10, 2019 at 01:11:38PM -0400, Josef Bacik wrote:
> > > On Mon, Oct 07, 2019 at 04:17:46PM -0400, Dennis Zhou wrote:
> > > > Async discard doesn't remember the discard state of a block_group when
> > > > unmounting or when we crash. So, any block_group that is not fully used
> > > > may have undiscarded regions. However, free space caches are read in on
> > > > demand. Let the discard worker read in the free space cache so we can
> > > > proceed with discarding rather than wait for the block_group to be used.
> > > > This prevents us from indefinitely deferring discards until that
> > > > particular block_group is reused.
> > > > 
> > > > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > > 
> > > What if we did completely discard the last time, now we're going back and
> > > discarding again?  I think by default we just assume we discarded everything.
> > > If we didn't then the user can always initiate a fitrim later.  Drop this one.
> > > Thanks,
> > > 
> > 
> > Yeah this is something I wasn't sure about.
> > 
> > It makes me a little uncomfortable to make the lack of persistence a
> > user problem. If in some extreme case where someone frees a large amount
> > of space and then unmounts.
> 
> Based on past experience, umount should not be slowed down unless really
> necessary.
> 
> > We can either make them wait on unmount to
> > discard everything or retrim the whole drive which in an ideal world
> > should just be a noop on already free lba space.
> 
> Without persistence of the state, we can't make it perfect and I think,
> without any hard evidence, that trimming already trimmed blocks is no-op
> on the device. We all know that we don't know what SSDs actually do, so
> it's best effort and making it "device problem" is a good solution from
> filesystem POV.

That makes sense and sounds good to me. I've dropped this patch.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/19] bitmap: genericize percpu bitmap region iterators
  2019-10-15 12:11       ` David Sterba
@ 2019-10-15 18:35         ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-15 18:35 UTC (permalink / raw)
  To: David Sterba
  Cc: Dennis Zhou, Josef Bacik, David Sterba, Chris Mason,
	Omar Sandoval, kernel-team, linux-btrfs

On Tue, Oct 15, 2019 at 02:11:02PM +0200, David Sterba wrote:
> On Mon, Oct 07, 2019 at 06:24:19PM -0400, Dennis Zhou wrote:
> > > > + * Bitmap region iterators.  Iterates over the bitmap between [@start, @end).
> > > 
> > > Gonna be that guy here, should be '[@start, @end]'
> > 
> > I disagree here. I'm pretty happy with [@start, @end). If btrfs wants to
> > carry their own iterators I'm happy to copy and paste them, but as far
> > as percpu goes I like [@start, @end).
> 
> It's not clear what the comment was about, if it's the notation of
> half-closed interval or request to support closed interval in the
> lookup. The orignal code has [,) and that shouldn't be changed when
> copying. Or I'm missing something.

I think there was just confusion based on the notation where '[' means
inclusive and ')' means exclusive. That got cleared up.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 08/19] btrfs: track discardable extents for asnyc discard
  2019-10-15 13:12   ` David Sterba
@ 2019-10-15 18:41     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-15 18:41 UTC (permalink / raw)
  To: David Sterba
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Tue, Oct 15, 2019 at 03:12:17PM +0200, David Sterba wrote:
> On Mon, Oct 07, 2019 at 04:17:39PM -0400, Dennis Zhou wrote:
> > The number of discardable extents will serve as the rate limiting metric
> > for how often we should discard. This keeps track of discardable extents
> > in the free space caches by maintaining deltas and propagating them to
> > the global count.
> > 
> > This also setups up a discard directory in btrfs sysfs and exports the
> > total discard_extents count.
> 
> Please put the discard directory under debug/ for now.
> 

Just double checking, but you mean to have it be:
/sys/fs/btrfs/<uuid>/debug/discard/*?

> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  fs/btrfs/ctree.h            |  2 +
> >  fs/btrfs/discard.c          |  2 +
> >  fs/btrfs/discard.h          | 19 ++++++++
> >  fs/btrfs/free-space-cache.c | 93 ++++++++++++++++++++++++++++++++++---
> >  fs/btrfs/free-space-cache.h |  2 +
> >  fs/btrfs/sysfs.c            | 33 +++++++++++++
> >  6 files changed, 144 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index c328d2e85e4d..43e515939b9c 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -447,6 +447,7 @@ struct btrfs_discard_ctl {
> >  	spinlock_t lock;
> >  	struct btrfs_block_group_cache *cache;
> >  	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
> > +	atomic_t discard_extents;
> 
> At the end of the series this becomes
> 
> 452         atomic_t discard_extents;
> 453         atomic64_t discardable_bytes;
> 454         atomic_t delay;
> 455         atomic_t iops_limit;
> 456         atomic64_t bps_limit;
> 457         atomic64_t discard_extent_bytes;
> 458         atomic64_t discard_bitmap_bytes;
> 459         atomic64_t discard_bytes_saved;
> 
> raising many eyebrows. What's the reason to use so many atomics? As this
> is purely for accounting and perhaps not contended, add one spinlock
> protecting all of them.
> 
> None of delay, bps_limit and iops_limit use the atomict_t semantics at
> all, it's just _set and _read.
> 
> As this seem to cascade to all other patches, I'll postpone my review
> until I see V2.

Yeah... I think the following 3 would be nice to keep as atomics as the
first two are propagated per block group and are protected via the
free_space_ctl's lock. Then multiple allocations can go through
concurrently. discard_bytes_saved is also something that can be
incremented by multiple block groups at once for the same reason, so an
atomic makes life simple.

> 452         atomic_t discard_extents;
> 453         atomic64_t discardable_bytes;
> 459         atomic64_t discard_bytes_saved;

The others I'll flip over to the proper type.

Thanks, 
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue
  2019-10-15 12:49   ` David Sterba
@ 2019-10-15 19:57     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-15 19:57 UTC (permalink / raw)
  To: David Sterba
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Tue, Oct 15, 2019 at 02:49:19PM +0200, David Sterba wrote:
> On Mon, Oct 07, 2019 at 04:17:36PM -0400, Dennis Zhou wrote:
> > --- a/fs/btrfs/block-group.h
> > +++ b/fs/btrfs/block-group.h
> > @@ -115,7 +115,11 @@ struct btrfs_block_group_cache {
> >  	/* For read-only block groups */
> >  	struct list_head ro_list;
> >  
> > +	/* For discard operations */
> >  	atomic_t trimming;
> > +	struct list_head discard_list;
> > +	int discard_index;
> > +	u64 discard_delay;
> >  
> >  	/* For dirty block groups */
> >  	struct list_head dirty_list;
> > @@ -157,6 +161,12 @@ struct btrfs_block_group_cache {
> >  	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
> >  };
> >  
> > +static inline
> > +u64 btrfs_block_group_end(struct btrfs_block_group_cache *cache)
> > +{
> > +	return (cache->key.objectid + cache->key.offset);
> > +}
> > +
> >  #ifdef CONFIG_BTRFS_DEBUG
> >  static inline int btrfs_should_fragment_free_space(
> >  		struct btrfs_block_group_cache *block_group)
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 1877586576aa..419445868909 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -438,6 +438,17 @@ struct btrfs_full_stripe_locks_tree {
> >  	struct mutex lock;
> >  };
> >  
> > +/* discard control */
> 
> This is going to be 'fix everywhere too' comment, please start comments
> with capital letter.

I've done a pass at uppercasing everything. I think I've caught most of
them.

> 
> > +#define BTRFS_NR_DISCARD_LISTS		1
> 
> Constants and defines should be documented
> 

Yeah sounds good. I've added comments. A few may not have comments until
later, but I think by the end of the series everything has the right
comments.

> > --- /dev/null
> > +++ b/fs/btrfs/discard.c
> > @@ -0,0 +1,200 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2019 Facebook.  All rights reserved.
> > + */
> 
> With the SPDX in place and immutable git history, the copyright notices
> are not necessary and we don't add them to new files anymore.
> 

Ah okay. I wasn't exactly sure why those got added when they did.

> > +void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> > +			       struct btrfs_block_group_cache *cache)
> > +{
> > +	u64 now = ktime_get_ns();
> 
> Variable used only once, can be removed and ktime_get_ns called
> directly.
> 

Done.

> > +	spin_lock(&discard_ctl->lock);
> > +
> > +	if (list_empty(&cache->discard_list))
> > +		cache->discard_delay = now + BTRFS_DISCARD_DELAY;
> 
> ->discard_delay does not seem to be a delay but an expiration time, so
> this is a bit confusing. BTRFS_DISCARD_DELAY is the delay time, that's
> clear.
> 

I added a comment explaining the premise. I just didn't want a block
group to start discarding immediately if we just created it, so give
some chance for the lba to be reused. So, discard_delay holds the
time to begin discarding. I'll try and figure out a better name.
Maybe discard_eligible_time (idk it seems long)?

> > +	list_move_tail(&cache->discard_list,
> > +		       btrfs_get_discard_list(discard_ctl, cache));
> > +
> > +	spin_unlock(&discard_ctl->lock);
> > +}
> 
> > --- /dev/null
> > +++ b/fs/btrfs/discard.h
> > @@ -0,0 +1,49 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2019 Facebook.  All rights reserved.
> > + */
> > +
> > +#ifndef BTRFS_DISCARD_H
> > +#define BTRFS_DISCARD_H
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/workqueue.h>
> > +
> > +#include "ctree.h"
> 
> Is it possible to avoid including ctree.h here? Like adding forward
> declarations and defining the helpers in .c (that will have to include
> ctree.h anyway). The includes have become very cluttered and untangling
> the dependencies is ongoing work so it would be good to avoid adding
> extra work.
> 

I moved the two inline's to the .c file and then just struct
declarations was enough.

> > +void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
> > +			       struct btrfs_block_group_cache *cache);
> > +
> > +void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
> > +			       struct btrfs_block_group_cache *cache);
> > +void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
> > +				 bool override);
> > +void btrfs_discard_resume(struct btrfs_fs_info *fs_info);
> > +void btrfs_discard_stop(struct btrfs_fs_info *fs_info);
> > +void btrfs_discard_init(struct btrfs_fs_info *fs_info);
> > +void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info);
> > +
> > +static inline
> > +bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
> > +{
> > +	struct btrfs_fs_info *fs_info = container_of(discard_ctl,
> > +						     struct btrfs_fs_info,
> > +						     discard_ctl);
> > +
> > +	return (!(fs_info->sb->s_flags & SB_RDONLY) &&
> > +		test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
> > +}
> > +
> > +static inline
> > +void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
> > +			      struct btrfs_block_group_cache *cache)
> > +{
> > +	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
> > +		return;
> > +
> > +	btrfs_add_to_discard_list(discard_ctl, cache);
> > +	if (!delayed_work_pending(&discard_ctl->work))
> > +		btrfs_discard_schedule_work(discard_ctl, false);
> > +}
> 
> These two would need full fs_info definition but they don't seem to be
> called in performance sensitive code so a full function call is ok here.
> 

Done.

> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -313,6 +316,7 @@ enum {
> >  	Opt_datasum, Opt_nodatasum,
> >  	Opt_defrag, Opt_nodefrag,
> >  	Opt_discard, Opt_nodiscard,
> > +	Opt_discard_version,
> 
> This is probably copied from space_cache options, 'version' does not
> fit discard, it could be 'mode'
> 

So the initial version I actually named them discard v1 and v2, Omar
mentioned I should probably give them separate flags and why it's now
sync and async. I renamed it to mode.

> >  	Opt_nologreplay,
> >  	Opt_norecovery,
> >  	Opt_ratio,
> >  		case Opt_space_cache_version:

Thanks, 
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/19] btrfs: keep track of which extents have been discarded
  2019-10-15 12:17   ` David Sterba
@ 2019-10-15 19:58     ` Dennis Zhou
  0 siblings, 0 replies; 71+ messages in thread
From: Dennis Zhou @ 2019-10-15 19:58 UTC (permalink / raw)
  To: David Sterba
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Tue, Oct 15, 2019 at 02:17:55PM +0200, David Sterba wrote:
> On Mon, Oct 07, 2019 at 04:17:34PM -0400, Dennis Zhou wrote:
> > @@ -2165,6 +2173,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> >  	bool merged = false;
> >  	u64 offset = info->offset;
> >  	u64 bytes = info->bytes;
> > +	bool is_trimmed = btrfs_free_space_trimmed(info);
> 
> Please add a const in such cases. I've been doing that in other patches
> but as more iterations are expected, let's have it there from the
> beginning.
> 

Done.

> > --- a/fs/btrfs/free-space-cache.h
> > +++ b/fs/btrfs/free-space-cache.h
> > @@ -6,6 +6,8 @@
> >  #ifndef BTRFS_FREE_SPACE_CACHE_H
> >  #define BTRFS_FREE_SPACE_CACHE_H
> >  
> > +#define BTRFS_FSC_TRIMMED		(1UL << 0)
> 
> Please add a comment

I've switched this to an enum and added a comment above it.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, back to index

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-07 20:17 [RFC PATCH 00/19] btrfs: async discard support Dennis Zhou
2019-10-07 20:17 ` [PATCH 01/19] bitmap: genericize percpu bitmap region iterators Dennis Zhou
2019-10-07 20:26   ` Josef Bacik
2019-10-07 22:24     ` Dennis Zhou
2019-10-15 12:11       ` David Sterba
2019-10-15 18:35         ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 02/19] btrfs: rename DISCARD opt to DISCARD_SYNC Dennis Zhou
2019-10-07 20:27   ` Josef Bacik
2019-10-08 11:12   ` Johannes Thumshirn
2019-10-11  9:19   ` Nikolay Borisov
2019-10-07 20:17 ` [PATCH 03/19] btrfs: keep track of which extents have been discarded Dennis Zhou
2019-10-07 20:37   ` Josef Bacik
2019-10-07 22:38     ` Dennis Zhou
2019-10-10 13:40       ` Josef Bacik
2019-10-11 16:15         ` Dennis Zhou
2019-10-08 12:46   ` Nikolay Borisov
2019-10-11 16:08     ` Dennis Zhou
2019-10-15 12:17   ` David Sterba
2019-10-15 19:58     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 04/19] btrfs: keep track of cleanliness of the bitmap Dennis Zhou
2019-10-10 14:16   ` Josef Bacik
2019-10-11 16:17     ` Dennis Zhou
2019-10-15 12:23   ` David Sterba
2019-10-07 20:17 ` [PATCH 05/19] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
2019-10-10 14:38   ` Josef Bacik
2019-10-15 12:49   ` David Sterba
2019-10-15 19:57     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 06/19] btrfs: handle empty block_group removal Dennis Zhou
2019-10-10 15:00   ` Josef Bacik
2019-10-11 16:52     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 07/19] btrfs: discard one region at a time in async discard Dennis Zhou
2019-10-10 15:22   ` Josef Bacik
2019-10-14 19:42     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 08/19] btrfs: track discardable extents for asnyc discard Dennis Zhou
2019-10-10 15:36   ` Josef Bacik
2019-10-14 19:50     ` Dennis Zhou
2019-10-15 13:12   ` David Sterba
2019-10-15 18:41     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 09/19] btrfs: keep track of discardable_bytes Dennis Zhou
2019-10-10 15:38   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 10/19] btrfs: calculate discard delay based on number of extents Dennis Zhou
2019-10-10 15:41   ` Josef Bacik
2019-10-11 18:07     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 11/19] btrfs: add bps discard rate limit Dennis Zhou
2019-10-10 15:47   ` Josef Bacik
2019-10-14 19:56     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 12/19] btrfs: limit max discard size for async discard Dennis Zhou
2019-10-10 16:16   ` Josef Bacik
2019-10-14 19:57     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 13/19] btrfs: have multiple discard lists Dennis Zhou
2019-10-10 16:51   ` Josef Bacik
2019-10-14 20:04     ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 14/19] btrfs: only keep track of data extents for async discard Dennis Zhou
2019-10-10 16:53   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 15/19] btrfs: load block_groups into discard_list on mount Dennis Zhou
2019-10-10 17:11   ` Josef Bacik
2019-10-14 20:17     ` Dennis Zhou
2019-10-14 23:38       ` David Sterba
2019-10-15 15:42         ` Dennis Zhou
2019-10-07 20:17 ` [PATCH 16/19] btrfs: keep track of discard reuse stats Dennis Zhou
2019-10-10 17:13   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 17/19] btrfs: add async discard header Dennis Zhou
2019-10-10 17:13   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 18/19] btrfs: increase the metadata allowance for the free_space_cache Dennis Zhou
2019-10-10 17:16   ` Josef Bacik
2019-10-07 20:17 ` [PATCH 19/19] btrfs: make smaller extents more likely to go into bitmaps Dennis Zhou
2019-10-10 17:17   ` Josef Bacik
2019-10-11  7:49 ` [RFC PATCH 00/19] btrfs: async discard support Nikolay Borisov
2019-10-14 21:05   ` Dennis Zhou
2019-10-15 12:08 ` David Sterba
2019-10-15 15:41   ` Dennis Zhou

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox