All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] Improve preemptive ENOSPC flushing
@ 2020-09-30 20:01 Josef Bacik
  2020-09-30 20:01 ` [PATCH 1/9] btrfs: add a trace point for reserve tickets Josef Bacik
                   ` (9 more replies)
  0 siblings, 10 replies; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Hello,

A while ago Nikolay started digging into a problem where they were seeing an
around 20% regression on random writes, and he bisected it down to

  btrfs: don't end the transaction for delayed refs in throttle

However this wasn't actually the cause of the problem.

This patch removed the code that would preemptively end the transactions if we
were low on space.  Because we had just introduced the ticketing code, this was
no longer necessary and was causing a lot of transaction commits.

And in Nikolay's testing he validated this, we would see like 100x more
transaction commits without that patch than with it, but the write regression
clearly appeared when this patch was applied.

The root cause of this is that the transaction commits were essentially
happening so quickly that we didn't end up needing to wait on space in the
ENOSPC ticketing code as much, and thus were able to write pretty quickly.  With
this gone, we now were getting a sawtoothy sort of behavior where we'd run up,
stop while we flushed metadata space, run some more, stop again etc.

When I implemented the ticketing infrastructure, I was trying to get us out of
excessively flushing space because we would sometimes over create block groups,
and thus short circuited flushing if we no longer had tickets.  This had the
side effect of breaking the preemptive flushing code, where we attempted to
flush space in the background before we were forced to wait for space.

Enter this patchset.  We still have some of this preemption logic sprinkled
everywhere, so I've separated it out of the normal ticketed flushing code, and
made preemptive flushing it's own thing.

The preemptive flushing logic is more specialized than the standard flushing
logic.  It attempts to flush in whichever pool has the highest usage.  This
means that if most of our space is tied up in pinned extents, we'll commit the
transaction.  If most of the space is tied up in delalloc, we'll flush delalloc,
etc.

To test this out I used the fio job that Nikolay used, this needs to be adjusted
so the overall IO size is at least 2x the RAM size for the box you are testing

fio --direct=0 --ioengine=sync --thread --directory=/mnt/test --invalidate=1 \
        --group_reporting=1 --runtime=300 --fallocate=none --ramp_time=10 \
        --name=RandomWrites-async-64512-4k-4 --new_group --rw=randwrite \
        --size=2g --numjobs=4 --bs=4k --fsync_on_close=0 --end_fsync=0 \
        --filename_format=FioWorkloads.\$jobnum

I got the following results

misc-next:	bw=13.4MiB/s (14.0MB/s), 13.4MiB/s-13.4MiB/s (14.0MB/s-14.0MB/s), io=4015MiB (4210MB), run=300323-300323msec
pre-throttling:	bw=16.9MiB/s (17.7MB/s), 16.9MiB/s-16.9MiB/s (17.7MB/s-17.7MB/s), io=5068MiB (5314MB), run=300069-300069msec
my patches:	bw=18.0MiB/s (18.9MB/s), 18.0MiB/s-18.0MiB/s (18.9MB/s-18.9MB/s), io=5403MiB (5666MB), run=300001-300001msec

Thanks,

Josef

Josef Bacik (9):
  btrfs: add a trace point for reserve tickets
  btrfs: improve preemptive background space flushing
  btrfs: rename need_do_async_reclaim
  btrfs: check reclaim_size in need_preemptive_reclaim
  btrfs: rework btrfs_calc_reclaim_metadata_size
  btrfs: simplify the logic in need_preemptive_flushing
  btrfs: implement space clamping for preemptive flushing
  btrfs: adjust the flush trace point to include the source
  btrfs: add a trace class for dumping the current ENOSPC state

 fs/btrfs/ctree.h             |   1 +
 fs/btrfs/disk-io.c           |   1 +
 fs/btrfs/space-info.c        | 216 +++++++++++++++++++++++++++++------
 fs/btrfs/space-info.h        |   3 +
 include/trace/events/btrfs.h | 112 +++++++++++++++++-
 5 files changed, 292 insertions(+), 41 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/9] btrfs: add a trace point for reserve tickets
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
@ 2020-09-30 20:01 ` Josef Bacik
  2020-10-01  5:54   ` Nikolay Borisov
  2020-09-30 20:01 ` [PATCH 2/9] btrfs: improve preemptive background space flushing Josef Bacik
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

While debugging a ENOSPC related performance problem I needed to see the
time difference between start and end of a reserve ticket, so add a
tracepoint where we add the ticket, and where we end the ticket.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c        |  3 +++
 include/trace/events/btrfs.h | 40 ++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 64099565ab8f..40726c8888f7 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1279,6 +1279,7 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 	 * space wasn't reserved at all).
 	 */
 	ASSERT(!(ticket->bytes == 0 && ticket->error));
+	trace_btrfs_end_reserve_ticket(fs_info, ticket->error);
 	return ret;
 }
 
@@ -1380,6 +1381,8 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 			list_add_tail(&ticket.list,
 				      &space_info->priority_tickets);
 		}
+		trace_btrfs_add_reserve_ticket(fs_info, space_info->flags,
+					       flush, orig_bytes);
 	} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
 		used += orig_bytes;
 		/*
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index ecd24c719de4..68d1622623c7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2025,6 +2025,46 @@ TRACE_EVENT(btrfs_convert_extent_bit,
 		  __print_flags(__entry->clear_bits, "|", EXTENT_FLAGS))
 );
 
+TRACE_EVENT(btrfs_add_reserve_ticket,
+	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, int flush,
+		 u64 bytes),
+
+	TP_ARGS(fs_info, flags, flush, bytes),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	flags	)
+		__field(	int,	flush	)
+		__field(	u64,	bytes	)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->flags	= flags;
+		__entry->flush	= flush;
+		__entry->bytes	= bytes;
+	),
+
+	TP_printk_btrfs("flags=%s flush=%s bytes=%llu",
+			__print_symbolic(__entry->flush, FLUSH_ACTIONS),
+			__print_flags(__entry->flags, "|", BTRFS_GROUP_FLAGS),
+			__entry->bytes)
+);
+
+TRACE_EVENT(btrfs_end_reserve_ticket,
+	TP_PROTO(const struct btrfs_fs_info *fs_info, int error),
+
+	TP_ARGS(fs_info, error),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	int,	error	)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->error	= error;
+	),
+
+	TP_printk_btrfs("error=%d", __entry->error)
+);
+
 DECLARE_EVENT_CLASS(btrfs_sleep_tree_lock,
 	TP_PROTO(const struct extent_buffer *eb, u64 start_ns),
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/9] btrfs: improve preemptive background space flushing
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
  2020-09-30 20:01 ` [PATCH 1/9] btrfs: add a trace point for reserve tickets Josef Bacik
@ 2020-09-30 20:01 ` Josef Bacik
  2020-10-01 13:19   ` Nikolay Borisov
  2020-09-30 20:01 ` [PATCH 3/9] btrfs: rename need_do_async_reclaim Josef Bacik
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the file system that hold space
reservations until our space is reclaimed.

However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.

In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.

When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing.  However this meant that we essentially
never preemptively flushed.  This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.

The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction.  This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer.  With the patch to remove this behavior you got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.

To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on.  Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing.  Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h      |   1 +
 fs/btrfs/disk-io.c    |   1 +
 fs/btrfs/space-info.c | 122 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aac3d6f4e35b..483182eaea19 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -922,6 +922,7 @@ struct btrfs_fs_info {
 	/* Used to reclaim the metadata space in the background. */
 	struct work_struct async_reclaim_work;
 	struct work_struct async_data_reclaim_work;
+	struct work_struct preempt_reclaim_work;
 
 	spinlock_t unused_bgs_lock;
 	struct list_head unused_bgs;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3d39f5d47ad3..2d0edc849da9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3999,6 +3999,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 
 	cancel_work_sync(&fs_info->async_reclaim_work);
 	cancel_work_sync(&fs_info->async_data_reclaim_work);
+	cancel_work_sync(&fs_info->preempt_reclaim_work);
 
 	/* Cancel or finish ongoing discard work */
 	btrfs_discard_cleanup(fs_info);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 40726c8888f7..024daa843c56 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -987,6 +987,122 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 	} while (flush_state <= COMMIT_TRANS);
 }
 
+/*
+ * This handles pre-flushing of metadata space before we get to the point that
+ * we need to start blocking people on tickets.  The logic here is different
+ * from the other flush paths because it doesn't rely on tickets to tell us how
+ * much we need to flush, instead it attempts to keep us below the 80% full
+ * watermark of space by flushing whichever reservation pool is currently the
+ * largest.
+ */
+static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_space_info *space_info;
+	struct btrfs_block_rsv *delayed_block_rsv;
+	struct btrfs_block_rsv *delayed_refs_rsv;
+	struct btrfs_block_rsv *global_rsv;
+	struct btrfs_block_rsv *trans_rsv;
+	u64 used;
+
+	fs_info = container_of(work, struct btrfs_fs_info,
+			       preempt_reclaim_work);
+	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
+	delayed_block_rsv = &fs_info->delayed_block_rsv;
+	delayed_refs_rsv = &fs_info->delayed_refs_rsv;
+	global_rsv = &fs_info->global_block_rsv;
+	trans_rsv = &fs_info->trans_block_rsv;
+
+	spin_lock(&space_info->lock);
+	used = btrfs_space_info_used(space_info, true);
+	while (need_do_async_reclaim(fs_info, space_info, used)) {
+		enum btrfs_reserve_flush_enum flush;
+		u64 delalloc_size = 0;
+		u64 to_reclaim, block_rsv_size;
+		u64 global_rsv_size = global_rsv->reserved;
+
+		/*
+		 * If we're just full of pinned, commit the transaction.  We
+		 * don't call flush_space(COMMIT_TRANS) here because that has
+		 * logic to decide whether we need to commit the transaction to
+		 * satisfy the ticket to keep us from live locking the box by
+		 * committing over and over again.  Here we don't care about
+		 * that, we know we are using a lot of space and most of it is
+		 * pinned, just commit.
+		 *
+		 * We do not want to include the global rsv size in the
+		 * bytes_may_use size here because it's unreclaimable.
+		 */
+		to_reclaim = (space_info->bytes_may_use < global_rsv_size) ? 0:
+			space_info->bytes_may_use - global_rsv_size;
+		if (space_info->bytes_pinned >= to_reclaim) {
+			struct btrfs_trans_handle *trans;
+
+			spin_unlock(&space_info->lock);
+			trans = btrfs_join_transaction(fs_info->extent_root);
+			if (IS_ERR(trans))
+				return;
+			btrfs_commit_transaction(trans);
+			goto next;
+		}
+
+		/*
+		 * We don't have a precise counter for delalloc, so we'll
+		 * approximate it by subtracting out the block rsv's space from
+		 * the bytes_may_use.  If that amount is higher than the
+		 * individual reserves, then we can assume it's tied up in
+		 * delalloc reservations.
+		 */
+		block_rsv_size = global_rsv_size +
+			delayed_block_rsv->reserved +
+			delayed_refs_rsv->reserved +
+			trans_rsv->reserved;
+		if (block_rsv_size < space_info->bytes_may_use)
+			delalloc_size = space_info->bytes_may_use -
+				block_rsv_size;
+		spin_unlock(&space_info->lock);
+
+		/*
+		 * We don't want to include the global_rsv in our calculation,
+		 * because that's space we can't touch.  Subtract it from the
+		 * block_rsv_size for the next checks.
+		 */
+		block_rsv_size -= global_rsv_size;
+
+		/*
+		 * We really want to avoid flushing delalloc too much, as it
+		 * could result in poor allocation patterns, so only flush it if
+		 * it's larger than the rest of the pools combined.
+		 */
+		if (delalloc_size > block_rsv_size) {
+			to_reclaim = delalloc_size;
+			flush = FLUSH_DELALLOC;
+		} else if (delayed_block_rsv->reserved >
+			   delayed_refs_rsv->reserved) {
+			to_reclaim = delayed_block_rsv->reserved;
+			flush = FLUSH_DELAYED_ITEMS_NR;
+		} else {
+			to_reclaim = delayed_refs_rsv->reserved;
+			flush = FLUSH_DELAYED_REFS_NR;
+		}
+
+		/*
+		 * We don't want to reclaim everything, just a portion, so scale
+		 * down the to_reclaim by 1/4.  If it takes us down to 0,
+		 * reclaim 1 items worth.
+		 */
+		to_reclaim >>= 2;
+		if (!to_reclaim)
+			to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 1);
+		flush_space(fs_info, space_info, to_reclaim, flush);
+next:
+		cond_resched();
+		spin_lock(&space_info->lock);
+		used = btrfs_space_info_used(space_info, true);
+	}
+	spin_unlock(&space_info->lock);
+}
+
 /*
  * FLUSH_DELALLOC_WAIT:
  *   Space is freed from flushing delalloc in one of two ways.
@@ -1113,6 +1229,8 @@ void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info)
 {
 	INIT_WORK(&fs_info->async_reclaim_work, btrfs_async_reclaim_metadata_space);
 	INIT_WORK(&fs_info->async_data_reclaim_work, btrfs_async_reclaim_data_space);
+	INIT_WORK(&fs_info->preempt_reclaim_work,
+		  btrfs_preempt_reclaim_metadata_space);
 }
 
 static const enum btrfs_flush_state priority_flush_states[] = {
@@ -1392,11 +1510,11 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 		 */
 		if (!test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags) &&
 		    need_do_async_reclaim(fs_info, space_info, used) &&
-		    !work_busy(&fs_info->async_reclaim_work)) {
+		    !work_busy(&fs_info->preempt_reclaim_work)) {
 			trace_btrfs_trigger_flush(fs_info, space_info->flags,
 						  orig_bytes, flush, "preempt");
 			queue_work(system_unbound_wq,
-				   &fs_info->async_reclaim_work);
+				   &fs_info->preempt_reclaim_work);
 		}
 	}
 	spin_unlock(&space_info->lock);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/9] btrfs: rename need_do_async_reclaim
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
  2020-09-30 20:01 ` [PATCH 1/9] btrfs: add a trace point for reserve tickets Josef Bacik
  2020-09-30 20:01 ` [PATCH 2/9] btrfs: improve preemptive background space flushing Josef Bacik
@ 2020-09-30 20:01 ` Josef Bacik
  2020-10-01 13:20   ` Nikolay Borisov
  2020-10-01 13:24   ` Nikolay Borisov
  2020-09-30 20:01 ` [PATCH 4/9] btrfs: check reclaim_size in need_preemptive_reclaim Josef Bacik
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

All of our normal flushing is asynchronous reclaim, so this helper is
poorly named.  This is more checking if we need to preemptively flush
space, so rename it to need_preemptive_reclaim.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 024daa843c56..98207ea57a3d 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -795,9 +795,9 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 	return to_reclaim;
 }
 
-static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
-					struct btrfs_space_info *space_info,
-					u64 used)
+static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
+					  struct btrfs_space_info *space_info,
+					  u64 used)
 {
 	u64 thresh = div_factor_fine(space_info->total_bytes, 98);
 
@@ -1015,7 +1015,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 
 	spin_lock(&space_info->lock);
 	used = btrfs_space_info_used(space_info, true);
-	while (need_do_async_reclaim(fs_info, space_info, used)) {
+	while (need_preemptive_reclaim(fs_info, space_info, used)) {
 		enum btrfs_reserve_flush_enum flush;
 		u64 delalloc_size = 0;
 		u64 to_reclaim, block_rsv_size;
@@ -1509,7 +1509,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 		 * the async reclaim as we will panic.
 		 */
 		if (!test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags) &&
-		    need_do_async_reclaim(fs_info, space_info, used) &&
+		    need_preemptive_reclaim(fs_info, space_info, used) &&
 		    !work_busy(&fs_info->preempt_reclaim_work)) {
 			trace_btrfs_trigger_flush(fs_info, space_info->flags,
 						  orig_bytes, flush, "preempt");
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/9] btrfs: check reclaim_size in need_preemptive_reclaim
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
                   ` (2 preceding siblings ...)
  2020-09-30 20:01 ` [PATCH 3/9] btrfs: rename need_do_async_reclaim Josef Bacik
@ 2020-09-30 20:01 ` Josef Bacik
  2020-10-01 13:23   ` Nikolay Borisov
  2020-09-30 20:01 ` [PATCH 5/9] btrfs: rework btrfs_calc_reclaim_metadata_size Josef Bacik
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

If we're flushing space for tickets then we have
space_info->reclaim_size set and we do not need to do background
reclaim.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 98207ea57a3d..518749093bc5 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -805,6 +805,13 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
 	if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
 		return 0;
 
+	/*
+	 * We have tickets queued, bail so we don't compete with the async
+	 * flushers.
+	 */
+	if (space_info->reclaim_size)
+		return 0;
+
 	if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info))
 		return 0;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/9] btrfs: rework btrfs_calc_reclaim_metadata_size
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
                   ` (3 preceding siblings ...)
  2020-09-30 20:01 ` [PATCH 4/9] btrfs: check reclaim_size in need_preemptive_reclaim Josef Bacik
@ 2020-09-30 20:01 ` Josef Bacik
  2020-10-01 13:59   ` Nikolay Borisov
  2020-09-30 20:01 ` [PATCH 6/9] btrfs: simplify the logic in need_preemptive_flushing Josef Bacik
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Currently btrfs_calc_reclaim_metadata_size does two things, it returns
the space currently required for flushing by the tickets, and if there
are no tickets it calculates a value for the preemptive flushing.

However for the normal ticketed flushing we really only care about the
space required for tickets.  We will accidentally come in and flush one
time, but as soon as we see there are no tickets we bail out of our
flushing.

Fix this by making btrfs_calc_reclaim_metadata_size really only tell us
what is required for flushing if we have people waiting on space.  Then
move the preemptive flushing logic into need_preemptive_reclaim().  We
ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim()
because if we are in this path then we made our reservation and there
are not pending tickets currently, so we do not need to check it, simply
do the fuzzy logic to check if we're getting low on space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 44 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 518749093bc5..37eb5a11a875 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -752,7 +752,6 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 {
 	u64 used;
 	u64 avail;
-	u64 expected;
 	u64 to_reclaim = space_info->reclaim_size;
 
 	lockdep_assert_held(&space_info->lock);
@@ -770,28 +769,6 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 	if (space_info->total_bytes + avail < used)
 		to_reclaim += used - (space_info->total_bytes + avail);
 
-	if (to_reclaim)
-		return to_reclaim;
-
-	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
-	if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
-				 BTRFS_RESERVE_FLUSH_ALL))
-		return 0;
-
-	used = btrfs_space_info_used(space_info, true);
-
-	if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
-				 BTRFS_RESERVE_FLUSH_ALL))
-		expected = div_factor_fine(space_info->total_bytes, 95);
-	else
-		expected = div_factor_fine(space_info->total_bytes, 90);
-
-	if (used > expected)
-		to_reclaim = used - expected;
-	else
-		to_reclaim = 0;
-	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
-				     space_info->bytes_reserved);
 	return to_reclaim;
 }
 
@@ -800,6 +777,7 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
 					  u64 used)
 {
 	u64 thresh = div_factor_fine(space_info->total_bytes, 98);
+	u64 to_reclaim, expected;
 
 	/* If we're just plain full then async reclaim just slows us down. */
 	if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
@@ -812,7 +790,25 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
 	if (space_info->reclaim_size)
 		return 0;
 
-	if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info))
+	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
+	if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
+				 BTRFS_RESERVE_FLUSH_ALL))
+		return 0;
+
+	used = btrfs_space_info_used(space_info, true);
+	if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
+				 BTRFS_RESERVE_FLUSH_ALL))
+		expected = div_factor_fine(space_info->total_bytes, 95);
+	else
+		expected = div_factor_fine(space_info->total_bytes, 90);
+
+	if (used > expected)
+		to_reclaim = used - expected;
+	else
+		to_reclaim = 0;
+	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
+				     space_info->bytes_reserved);
+	if (!to_reclaim)
 		return 0;
 
 	return (used >= thresh && !btrfs_fs_closing(fs_info) &&
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/9] btrfs: simplify the logic in need_preemptive_flushing
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
                   ` (4 preceding siblings ...)
  2020-09-30 20:01 ` [PATCH 5/9] btrfs: rework btrfs_calc_reclaim_metadata_size Josef Bacik
@ 2020-09-30 20:01 ` Josef Bacik
  2020-10-01 14:09   ` Nikolay Borisov
  2020-10-02  7:13   ` Nikolay Borisov
  2020-09-30 20:01 ` [PATCH 7/9] btrfs: implement space clamping for preemptive flushing Josef Bacik
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

A lot of this was added all in one go with no explanation, and is a bit
unwieldy and confusing.  Simplify the logic to start preemptive flushing
if we've reserved more than half of our available free space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 51 ++++++++++++++++++++-----------------------
 1 file changed, 24 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 37eb5a11a875..a41cf358f438 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -773,11 +773,10 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 }
 
 static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
-					  struct btrfs_space_info *space_info,
-					  u64 used)
+					  struct btrfs_space_info *space_info)
 {
 	u64 thresh = div_factor_fine(space_info->total_bytes, 98);
-	u64 to_reclaim, expected;
+	u64 used;
 
 	/* If we're just plain full then async reclaim just slows us down. */
 	if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
@@ -790,26 +789,27 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
 	if (space_info->reclaim_size)
 		return 0;
 
-	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
-	if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
-				 BTRFS_RESERVE_FLUSH_ALL))
-		return 0;
-
-	used = btrfs_space_info_used(space_info, true);
-	if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
-				 BTRFS_RESERVE_FLUSH_ALL))
-		expected = div_factor_fine(space_info->total_bytes, 95);
-	else
-		expected = div_factor_fine(space_info->total_bytes, 90);
+	/*
+	 * If we have over half of the free space occupied by reservations or
+	 * pinned then we want to start flushing.
+	 *
+	 * We do not do the traditional thing here, which is to say
+	 *
+	 *   if (used >= ((total_bytes + avail) >> 1))
+	 *     return 1;
+	 *
+	 * because this doesn't quite work how we want.  If we had more than 50%
+	 * of the space_info used by bytes_used and we had 0 available we'd just
+	 * constantly run the background flusher.  Instead we want it to kick in
+	 * if our reclaimable space exceeds 50% of our available free space.
+	 */
+	thresh = calc_available_free_space(fs_info, space_info,
+					   BTRFS_RESERVE_FLUSH_ALL);
+	thresh += (space_info->total_bytes - space_info->bytes_used -
+		   space_info->bytes_reserved - space_info->bytes_readonly);
+	thresh >>= 1;
 
-	if (used > expected)
-		to_reclaim = used - expected;
-	else
-		to_reclaim = 0;
-	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
-				     space_info->bytes_reserved);
-	if (!to_reclaim)
-		return 0;
+	used = space_info->bytes_may_use + space_info->bytes_pinned;
 
 	return (used >= thresh && !btrfs_fs_closing(fs_info) &&
 		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
@@ -1006,7 +1006,6 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 	struct btrfs_block_rsv *delayed_refs_rsv;
 	struct btrfs_block_rsv *global_rsv;
 	struct btrfs_block_rsv *trans_rsv;
-	u64 used;
 
 	fs_info = container_of(work, struct btrfs_fs_info,
 			       preempt_reclaim_work);
@@ -1017,8 +1016,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 	trans_rsv = &fs_info->trans_block_rsv;
 
 	spin_lock(&space_info->lock);
-	used = btrfs_space_info_used(space_info, true);
-	while (need_preemptive_reclaim(fs_info, space_info, used)) {
+	while (need_preemptive_reclaim(fs_info, space_info)) {
 		enum btrfs_reserve_flush_enum flush;
 		u64 delalloc_size = 0;
 		u64 to_reclaim, block_rsv_size;
@@ -1101,7 +1099,6 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 next:
 		cond_resched();
 		spin_lock(&space_info->lock);
-		used = btrfs_space_info_used(space_info, true);
 	}
 	spin_unlock(&space_info->lock);
 }
@@ -1512,7 +1509,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 		 * the async reclaim as we will panic.
 		 */
 		if (!test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags) &&
-		    need_preemptive_reclaim(fs_info, space_info, used) &&
+		    need_preemptive_reclaim(fs_info, space_info) &&
 		    !work_busy(&fs_info->preempt_reclaim_work)) {
 			trace_btrfs_trigger_flush(fs_info, space_info->flags,
 						  orig_bytes, flush, "preempt");
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/9] btrfs: implement space clamping for preemptive flushing
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
                   ` (5 preceding siblings ...)
  2020-09-30 20:01 ` [PATCH 6/9] btrfs: simplify the logic in need_preemptive_flushing Josef Bacik
@ 2020-09-30 20:01 ` Josef Bacik
  2020-10-01 14:49   ` Nikolay Borisov
  2020-09-30 20:01 ` [PATCH 8/9] btrfs: adjust the flush trace point to include the source Josef Bacik
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Starting preemptive flushing at 50% of available free space is a good
start, but some workloads are particularly abusive and can quickly
overwhelm the preemptive flushing code and drive us into using tickets.

Handle this by clamping down on our threshold for starting and
continuing to run preemptive flushing.  This is particularly important
for our overcommit case, as we can really drive the file system into
overages and then it's more difficult to pull it back as we start to
actually fill up the file system.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 20 ++++++++++++++++++--
 fs/btrfs/space-info.h |  3 +++
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index a41cf358f438..5ecdd6c41a51 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -206,6 +206,7 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
 	INIT_LIST_HEAD(&space_info->ro_bgs);
 	INIT_LIST_HEAD(&space_info->tickets);
 	INIT_LIST_HEAD(&space_info->priority_tickets);
+	space_info->clamp = 1;
 
 	ret = btrfs_sysfs_add_space_info_type(info, space_info);
 	if (ret)
@@ -801,13 +802,13 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
 	 * because this doesn't quite work how we want.  If we had more than 50%
 	 * of the space_info used by bytes_used and we had 0 available we'd just
 	 * constantly run the background flusher.  Instead we want it to kick in
-	 * if our reclaimable space exceeds 50% of our available free space.
+	 * if our reclaimable space exceeds our clamped free space.
 	 */
 	thresh = calc_available_free_space(fs_info, space_info,
 					   BTRFS_RESERVE_FLUSH_ALL);
 	thresh += (space_info->total_bytes - space_info->bytes_used -
 		   space_info->bytes_reserved - space_info->bytes_readonly);
-	thresh >>= 1;
+	thresh >>= space_info->clamp;
 
 	used = space_info->bytes_may_use + space_info->bytes_pinned;
 
@@ -1006,6 +1007,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 	struct btrfs_block_rsv *delayed_refs_rsv;
 	struct btrfs_block_rsv *global_rsv;
 	struct btrfs_block_rsv *trans_rsv;
+	int loops = 0;
 
 	fs_info = container_of(work, struct btrfs_fs_info,
 			       preempt_reclaim_work);
@@ -1022,6 +1024,8 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 		u64 to_reclaim, block_rsv_size;
 		u64 global_rsv_size = global_rsv->reserved;
 
+		loops++;
+
 		/*
 		 * If we're just full of pinned, commit the transaction.  We
 		 * don't call flush_space(COMMIT_TRANS) here because that has
@@ -1100,6 +1104,10 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 		cond_resched();
 		spin_lock(&space_info->lock);
 	}
+
+	/* We only went through once, back off our clamping. */
+	if (loops == 1 && !space_info->reclaim_size)
+		space_info->clamp = max(1, space_info->clamp - 1);
 	spin_unlock(&space_info->lock);
 }
 
@@ -1499,6 +1507,14 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 			list_add_tail(&ticket.list,
 				      &space_info->priority_tickets);
 		}
+
+		/*
+		 * We were forced to add a reserve ticket, so our preemptive
+		 * flushing is unable to keep up.  Clamp down on the threshold
+		 * for the preemptive flushing in order to keep up with the
+		 * workload.
+		 */
+		space_info->clamp = min(space_info->clamp + 1, 8);
 		trace_btrfs_add_reserve_ticket(fs_info, space_info->flags,
 					       flush, orig_bytes);
 	} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 5646393b928c..bcbbfae131f6 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -22,6 +22,9 @@ struct btrfs_space_info {
 				   the space info if we had an ENOSPC in the
 				   allocator. */
 
+	int clamp;		/* Used to scale our threshold for preemptive
+				   flushing. */
+
 	unsigned int full:1;	/* indicates that we cannot allocate any more
 				   chunks for this space */
 	unsigned int chunk_alloc:1;	/* set if we are allocating a chunk */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 8/9] btrfs: adjust the flush trace point to include the source
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
                   ` (6 preceding siblings ...)
  2020-09-30 20:01 ` [PATCH 7/9] btrfs: implement space clamping for preemptive flushing Josef Bacik
@ 2020-09-30 20:01 ` Josef Bacik
  2020-10-01 15:32   ` Nikolay Borisov
  2020-09-30 20:01 ` [PATCH 9/9] btrfs: add a trace class for dumping the current ENOSPC state Josef Bacik
  2020-10-06 12:55 ` [PATCH 0/9] Improve preemptive ENOSPC flushing Nikolay Borisov
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Since we have normal ticketed flushing and preemptive flushing, adjust
the tracepoint so that we know the source of the flushing action to make
it easier to debug problems.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c        | 20 ++++++++++++--------
 include/trace/events/btrfs.h | 10 ++++++----
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 5ecdd6c41a51..b9735e7de480 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -668,7 +668,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
  */
 static void flush_space(struct btrfs_fs_info *fs_info,
 		       struct btrfs_space_info *space_info, u64 num_bytes,
-		       int state)
+		       int state, bool for_preempt)
 {
 	struct btrfs_root *root = fs_info->extent_root;
 	struct btrfs_trans_handle *trans;
@@ -743,7 +743,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 	}
 
 	trace_btrfs_flush_space(fs_info, space_info->flags, num_bytes, state,
-				ret);
+				ret, for_preempt);
 	return;
 }
 
@@ -943,7 +943,8 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 
 	flush_state = FLUSH_DELAYED_ITEMS_NR;
 	do {
-		flush_space(fs_info, space_info, to_reclaim, flush_state);
+		flush_space(fs_info, space_info, to_reclaim, flush_state,
+			    false);
 		spin_lock(&space_info->lock);
 		if (list_empty(&space_info->tickets)) {
 			space_info->flush = 0;
@@ -1099,7 +1100,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 		to_reclaim >>= 2;
 		if (!to_reclaim)
 			to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 1);
-		flush_space(fs_info, space_info, to_reclaim, flush);
+		flush_space(fs_info, space_info, to_reclaim, flush, true);
 next:
 		cond_resched();
 		spin_lock(&space_info->lock);
@@ -1191,7 +1192,8 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work)
 	spin_unlock(&space_info->lock);
 
 	while (!space_info->full) {
-		flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE);
+		flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE,
+			    false);
 		spin_lock(&space_info->lock);
 		if (list_empty(&space_info->tickets)) {
 			space_info->flush = 0;
@@ -1204,7 +1206,7 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work)
 
 	while (flush_state < ARRAY_SIZE(data_flush_states)) {
 		flush_space(fs_info, space_info, U64_MAX,
-			    data_flush_states[flush_state]);
+			    data_flush_states[flush_state], false);
 		spin_lock(&space_info->lock);
 		if (list_empty(&space_info->tickets)) {
 			space_info->flush = 0;
@@ -1277,7 +1279,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
 
 	flush_state = 0;
 	do {
-		flush_space(fs_info, space_info, to_reclaim, states[flush_state]);
+		flush_space(fs_info, space_info, to_reclaim, states[flush_state],
+			    false);
 		flush_state++;
 		spin_lock(&space_info->lock);
 		if (ticket->bytes == 0) {
@@ -1293,7 +1296,8 @@ static void priority_reclaim_data_space(struct btrfs_fs_info *fs_info,
 					struct reserve_ticket *ticket)
 {
 	while (!space_info->full) {
-		flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE);
+		flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE,
+			    false);
 		spin_lock(&space_info->lock);
 		if (ticket->bytes == 0) {
 			spin_unlock(&space_info->lock);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 68d1622623c7..c340bff65450 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1111,15 +1111,16 @@ TRACE_EVENT(btrfs_trigger_flush,
 TRACE_EVENT(btrfs_flush_space,
 
 	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, u64 num_bytes,
-		 int state, int ret),
+		 int state, int ret, int for_preempt),
 
-	TP_ARGS(fs_info, flags, num_bytes, state, ret),
+	TP_ARGS(fs_info, flags, num_bytes, state, ret, for_preempt),
 
 	TP_STRUCT__entry_btrfs(
 		__field(	u64,	flags			)
 		__field(	u64,	num_bytes		)
 		__field(	int,	state			)
 		__field(	int,	ret			)
+		__field(	int,	for_preempt		)
 	),
 
 	TP_fast_assign_btrfs(fs_info,
@@ -1127,15 +1128,16 @@ TRACE_EVENT(btrfs_flush_space,
 		__entry->num_bytes	=	num_bytes;
 		__entry->state		=	state;
 		__entry->ret		=	ret;
+		__entry->for_preempt	=	for_preempt;
 	),
 
-	TP_printk_btrfs("state=%d(%s) flags=%llu(%s) num_bytes=%llu ret=%d",
+	TP_printk_btrfs("state=%d(%s) flags=%llu(%s) num_bytes=%llu ret=%d for_preempt=%d",
 		  __entry->state,
 		  __print_symbolic(__entry->state, FLUSH_STATES),
 		  __entry->flags,
 		  __print_flags((unsigned long)__entry->flags, "|",
 				BTRFS_GROUP_FLAGS),
-		  __entry->num_bytes, __entry->ret)
+		  __entry->num_bytes, __entry->ret, __entry->for_preempt)
 );
 
 DECLARE_EVENT_CLASS(btrfs__reserved_extent,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 9/9] btrfs: add a trace class for dumping the current ENOSPC state
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
                   ` (7 preceding siblings ...)
  2020-09-30 20:01 ` [PATCH 8/9] btrfs: adjust the flush trace point to include the source Josef Bacik
@ 2020-09-30 20:01 ` Josef Bacik
  2020-10-02  8:30   ` Nikolay Borisov
  2020-10-06 12:55 ` [PATCH 0/9] Improve preemptive ENOSPC flushing Nikolay Borisov
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-09-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Often when I'm debugging ENOSPC related issues I have to resort to
printing the entire ENOSPC state with trace_printk() in different spots.
This gets pretty annoying, so add a trace state that does this for us.
Then add a trace point at the end of preemptive flushing so you can see
the state of the space_info when we decide to exit preemptive flushing.
This helped me figure out we weren't kicking in the preemptive flushing
soon enough.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c        |  1 +
 include/trace/events/btrfs.h | 62 ++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index b9735e7de480..6f569a7d2df4 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1109,6 +1109,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 	/* We only went through once, back off our clamping. */
 	if (loops == 1 && !space_info->reclaim_size)
 		space_info->clamp = max(1, space_info->clamp - 1);
+	trace_btrfs_done_preemptive_reclaim(fs_info, space_info);
 	spin_unlock(&space_info->lock);
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index c340bff65450..651ac47a6945 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2027,6 +2027,68 @@ TRACE_EVENT(btrfs_convert_extent_bit,
 		  __print_flags(__entry->clear_bits, "|", EXTENT_FLAGS))
 );
 
+DECLARE_EVENT_CLASS(btrfs_dump_space_info,
+	TP_PROTO(const struct btrfs_fs_info *fs_info,
+		 const struct btrfs_space_info *sinfo),
+
+	TP_ARGS(fs_info, sinfo),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	flags			)
+		__field(	u64,	total_bytes		)
+		__field(	u64,	bytes_used		)
+		__field(	u64,	bytes_pinned		)
+		__field(	u64,	bytes_reserved		)
+		__field(	u64,	bytes_may_use		)
+		__field(	u64,	bytes_readonly		)
+		__field(	u64,	reclaim_size		)
+		__field(	int,	clamp			)
+		__field(	u64,	global_reserved		)
+		__field(	u64,	trans_reserved		)
+		__field(	u64,	delayed_refs_reserved	)
+		__field(	u64,	delayed_reserved	)
+		__field(	u64,	free_chunk_space	)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->flags			=	sinfo->flags;
+		__entry->total_bytes		=	sinfo->total_bytes;
+		__entry->bytes_used		=	sinfo->bytes_used;
+		__entry->bytes_pinned		=	sinfo->bytes_pinned;
+		__entry->bytes_reserved		=	sinfo->bytes_reserved;
+		__entry->bytes_may_use		=	sinfo->bytes_may_use;
+		__entry->bytes_readonly		=	sinfo->bytes_readonly;
+		__entry->reclaim_size		=	sinfo->reclaim_size;
+		__entry->clamp			=	sinfo->clamp;
+		__entry->global_reserved	=	fs_info->global_block_rsv.reserved;
+		__entry->trans_reserved		=	fs_info->trans_block_rsv.reserved;
+		__entry->delayed_refs_reserved	=	fs_info->delayed_refs_rsv.reserved;
+		__entry->delayed_reserved	=	fs_info->delayed_block_rsv.reserved;
+		__entry->free_chunk_space	=	atomic64_read(&fs_info->free_chunk_space);
+	),
+
+	TP_printk_btrfs("flags=%s total_bytes=%llu bytes_used=%llu "
+			"bytes_pinned=%llu bytes_reserved=%llu "
+			"bytes_may_use=%llu bytes_readonly=%llu "
+			"reclaim_size=%llu clamp=%d global_reserved=%llu "
+			"trans_reserved=%llu delayed_refs_reserved=%llu "
+			"delayed_reserved=%llu chunk_free_space=%llu",
+			__print_flags(__entry->flags, "|", BTRFS_GROUP_FLAGS),
+			__entry->total_bytes, __entry->bytes_used,
+			__entry->bytes_pinned, __entry->bytes_reserved,
+			__entry->bytes_may_use, __entry->bytes_readonly,
+			__entry->reclaim_size, __entry->clamp,
+			__entry->global_reserved, __entry->trans_reserved,
+			__entry->delayed_refs_reserved,
+			__entry->delayed_reserved, __entry->free_chunk_space)
+);
+
+DEFINE_EVENT(btrfs_dump_space_info, btrfs_done_preemptive_reclaim,
+	TP_PROTO(const struct btrfs_fs_info *fs_info,
+		 const struct btrfs_space_info *sinfo),
+	TP_ARGS(fs_info, sinfo)
+);
+
 TRACE_EVENT(btrfs_add_reserve_ticket,
 	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, int flush,
 		 u64 bytes),
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/9] btrfs: add a trace point for reserve tickets
  2020-09-30 20:01 ` [PATCH 1/9] btrfs: add a trace point for reserve tickets Josef Bacik
@ 2020-10-01  5:54   ` Nikolay Borisov
  2020-10-01 21:33     ` Josef Bacik
  0 siblings, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-01  5:54 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> While debugging a ENOSPC related performance problem I needed to see the
> time difference between start and end of a reserve ticket, so add a
> tracepoint where we add the ticket, and where we end the ticket.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

I wonder can't we get away with a single tracepoint - simply record the
queue time when we create the ticket in __reserve_bytes and then have a
single tracepoint in handle_reserve_ticket which calculates the
difference between curr_time and the written one while the ticket was
queued? IMO it will be more user friendly for manual inspection, I
assume you have analyzed the duration with a bpf script rather than
looking at every pair of start/end events manually? Also won't it be
useful to also have the flush type printed in the end event so that at
least we know what is the type of the ticket ? Let's not forget
tracepoint from the POV of Linus are more or less ABI so it's preferable
if we get them right from the beginning.

> ---
>  fs/btrfs/space-info.c        |  3 +++
>  include/trace/events/btrfs.h | 40 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 64099565ab8f..40726c8888f7 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -1279,6 +1279,7 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
>  	 * space wasn't reserved at all).
>  	 */
>  	ASSERT(!(ticket->bytes == 0 && ticket->error));
> +	trace_btrfs_end_reserve_ticket(fs_info, ticket->error);
>  	return ret;
>  }
>  
> @@ -1380,6 +1381,8 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
>  			list_add_tail(&ticket.list,
>  				      &space_info->priority_tickets);
>  		}
> +		trace_btrfs_add_reserve_ticket(fs_info, space_info->flags,
> +					       flush, orig_bytes);
>  	} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
>  		used += orig_bytes;
>  		/*
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index ecd24c719de4..68d1622623c7 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -2025,6 +2025,46 @@ TRACE_EVENT(btrfs_convert_extent_bit,
>  		  __print_flags(__entry->clear_bits, "|", EXTENT_FLAGS))
>  );
>  
> +TRACE_EVENT(btrfs_add_reserve_ticket,
> +	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, int flush,
> +		 u64 bytes),
> +
> +	TP_ARGS(fs_info, flags, flush, bytes),
> +
> +	TP_STRUCT__entry_btrfs(
> +		__field(	u64,	flags	)
> +		__field(	int,	flush	)
> +		__field(	u64,	bytes	)
> +	),
> +
> +	TP_fast_assign_btrfs(fs_info,
> +		__entry->flags	= flags;
> +		__entry->flush	= flush;
> +		__entry->bytes	= bytes;
> +	),
> +
> +	TP_printk_btrfs("flags=%s flush=%s bytes=%llu",
> +			__print_symbolic(__entry->flush, FLUSH_ACTIONS),
> +			__print_flags(__entry->flags, "|", BTRFS_GROUP_FLAGS),
> +			__entry->bytes)
> +);
> +
> +TRACE_EVENT(btrfs_end_reserve_ticket,
> +	TP_PROTO(const struct btrfs_fs_info *fs_info, int error),
> +
> +	TP_ARGS(fs_info, error),
> +
> +	TP_STRUCT__entry_btrfs(
> +		__field(	int,	error	)
> +	),
> +
> +	TP_fast_assign_btrfs(fs_info,
> +		__entry->error	= error;
> +	),
> +
> +	TP_printk_btrfs("error=%d", __entry->error)
> +);
> +
>  DECLARE_EVENT_CLASS(btrfs_sleep_tree_lock,
>  	TP_PROTO(const struct extent_buffer *eb, u64 start_ns),
>  
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/9] btrfs: improve preemptive background space flushing
  2020-09-30 20:01 ` [PATCH 2/9] btrfs: improve preemptive background space flushing Josef Bacik
@ 2020-10-01 13:19   ` Nikolay Borisov
  2020-10-01 21:35     ` Josef Bacik
  0 siblings, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-01 13:19 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
<snip>

> When I introduced the ticketed ENOSPC stuff this broke slightly in the
> fact that we were using tickets to indicate if we were done flushing.
> No tickets, no more flushing.  However this meant that we essentially
> never preemptively flushed.  This caused a write performance regression
> that Nikolay noticed in an unrelated patch that removed the committing
> of the transaction during btrfs_end_transaction.

I see, so basically the patch which I biseceted this to was really
papering over the initial bug since the logic in end_transaction, sort
of, simulated pre-emptive flushing... how subtle!

<snip>

> +	spin_lock(&space_info->lock);
> +	used = btrfs_space_info_used(space_info, true);
> +	while (need_do_async_reclaim(fs_info, space_info, used)) {
> +		enum btrfs_reserve_flush_enum flush;
> +		u64 delalloc_size = 0;
> +		u64 to_reclaim, block_rsv_size;
> +		u64 global_rsv_size = global_rsv->reserved;
> +
> +		/*
> +		 * If we're just full of pinned, commit the transaction.  We
> +		 * don't call flush_space(COMMIT_TRANS) here because that has
> +		 * logic to decide whether we need to commit the transaction to
> +		 * satisfy the ticket to keep us from live locking the box by
> +		 * committing over and over again.  Here we don't care about
> +		 * that, we know we are using a lot of space and most of it is
> +		 * pinned, just commit.

nit: That comment is a mouthful, I think what you are describing here is
really this line in may_commit_transaction:

if (!bytes_needed) return 0;

Which triggers if we don't have a ticket, if so there simply say :

"We can't call flush_commit because it will flush iff there is a pending
ticket".

<snip>

> +		/*
> +		 * We don't have a precise counter for delalloc, so we'll
> +		 * approximate it by subtracting out the block rsv's space from
> +		 * the bytes_may_use.  If that amount is higher than the
> +		 * individual reserves, then we can assume it's tied up in
> +		 * delalloc reservations.
> +		 */
> +		block_rsv_size = global_rsv_size +
> +			delayed_block_rsv->reserved +
> +			delayed_refs_rsv->reserved +
> +			trans_rsv->reserved;
> +		if (block_rsv_size < space_info->bytes_may_use)
> +			delalloc_size = space_info->bytes_may_use -
> +				block_rsv_size;

What about  :

percpu_counter_add_batch(&fs_info->delalloc_bytes, len,
                      fs_info->delalloc_batch);

> +		spin_unlock(&space_info->lock);
> +

<snip>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] btrfs: rename need_do_async_reclaim
  2020-09-30 20:01 ` [PATCH 3/9] btrfs: rename need_do_async_reclaim Josef Bacik
@ 2020-10-01 13:20   ` Nikolay Borisov
  2020-10-01 13:24   ` Nikolay Borisov
  1 sibling, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-01 13:20 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> All of our normal flushing is asynchronous reclaim, so this helper is
> poorly named.  This is more checking if we need to preemptively flush
> space, so rename it to need_preemptive_reclaim.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/9] btrfs: check reclaim_size in need_preemptive_reclaim
  2020-09-30 20:01 ` [PATCH 4/9] btrfs: check reclaim_size in need_preemptive_reclaim Josef Bacik
@ 2020-10-01 13:23   ` Nikolay Borisov
  2020-10-01 21:36     ` Josef Bacik
  0 siblings, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-01 13:23 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> If we're flushing space for tickets then we have
> space_info->reclaim_size set and we do not need to do background
> reclaim.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

I'm fine with this but check my remak below.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/space-info.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 98207ea57a3d..518749093bc5 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -805,6 +805,13 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
>  	if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
>  		return 0;
>  
> +	/*
> +	 * We have tickets queued, bail so we don't compete with the async
> +	 * flushers.
> +	 */
> +	if (space_info->reclaim_size)
> +		return 0;
> +

nit: So where do we draw the line if this check should be in this
function or in __reserve_bytes so that we eliminate the case where a
preemptive reclaim is scheduled only to return instantly because of
tickets. Though the 'if' in __reserve_bytes is getting slitghly out of
hand :)

>  	if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info))
>  		return 0;
>  
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] btrfs: rename need_do_async_reclaim
  2020-09-30 20:01 ` [PATCH 3/9] btrfs: rename need_do_async_reclaim Josef Bacik
  2020-10-01 13:20   ` Nikolay Borisov
@ 2020-10-01 13:24   ` Nikolay Borisov
  2020-10-01 21:37     ` Josef Bacik
  1 sibling, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-01 13:24 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> All of our normal flushing is asynchronous reclaim, so this helper is
> poorly named.  This is more checking if we need to preemptively flush
> space, so rename it to need_preemptive_reclaim.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 024daa843c56..98207ea57a3d 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -795,9 +795,9 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
>  	return to_reclaim;
>  }
>  
> -static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
> -					struct btrfs_space_info *space_info,
> -					u64 used)
> +static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
> +					  struct btrfs_space_info *space_info,
> +					  u64 used)

nit: oops I forgot to mention but while at it, why don't you switch the
function to bool return type.

<snip>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] btrfs: rework btrfs_calc_reclaim_metadata_size
  2020-09-30 20:01 ` [PATCH 5/9] btrfs: rework btrfs_calc_reclaim_metadata_size Josef Bacik
@ 2020-10-01 13:59   ` Nikolay Borisov
  2020-10-01 21:38     ` Josef Bacik
  0 siblings, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-01 13:59 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
<snip>
>  
> @@ -800,6 +777,7 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
>  					  u64 used)
>  {
>  	u64 thresh = div_factor_fine(space_info->total_bytes, 98);
> +	u64 to_reclaim, expected;
>  
>  	/* If we're just plain full then async reclaim just slows us down. */
>  	if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
> @@ -812,7 +790,25 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
>  	if (space_info->reclaim_size)
>  		return 0;
>  
> -	if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info))
> +	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
> +	if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
> +				 BTRFS_RESERVE_FLUSH_ALL))
> +		return 0;
> +
> +	used = btrfs_space_info_used(space_info, true);
> +	if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
> +				 BTRFS_RESERVE_FLUSH_ALL))
> +		expected = div_factor_fine(space_info->total_bytes, 95);
> +	else
> +		expected = div_factor_fine(space_info->total_bytes, 90);

I think this should be just:

expected = div_factor_fine(space_info->total_bytes, 90);

Because before this check we tried to overcommit between 1 and 16m
(depending on the online CPU's) and we failed. So there is no reason to
think that :

btrfs_can_overcommit(fs_info, space_info, SZ_1M, BTRFS_RESERVE_FLUSH_ALL)

would succeed. So you can simplify the logic by eliminating the 2nd
check for btrfs_can_overcommit

> +
> +	if (used > expected)
> +		to_reclaim = used - expected;
> +	else
> +		to_reclaim = 0;
> +	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
> +				     space_info->bytes_reserved);
> +	if (!to_reclaim)
>  		return 0;
>  
>  	return (used >= thresh && !btrfs_fs_closing(fs_info) &&
> 

nit: Not directly related to your patch but since you are moving the
code does it make sense to keep the fs_closing and STATE_REMOUNTING
checks around?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/9] btrfs: simplify the logic in need_preemptive_flushing
  2020-09-30 20:01 ` [PATCH 6/9] btrfs: simplify the logic in need_preemptive_flushing Josef Bacik
@ 2020-10-01 14:09   ` Nikolay Borisov
  2020-10-01 21:40     ` Josef Bacik
  2020-10-02  7:13   ` Nikolay Borisov
  1 sibling, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-01 14:09 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> A lot of this was added all in one go with no explanation, and is a bit
> unwieldy and confusing.  Simplify the logic to start preemptive flushing
> if we've reserved more than half of our available free space.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

<snip>

> +	 * If we have over half of the free space occupied by reservations or
> +	 * pinned then we want to start flushing.
> +	 *
> +	 * We do not do the traditional thing here, which is to say
> +	 *
> +	 *   if (used >= ((total_bytes + avail) >> 1))
> +	 *     return 1;
> +	 *
> +	 * because this doesn't quite work how we want.  If we had more than 50%
> +	 * of the space_info used by bytes_used and we had 0 available we'd just
> +	 * constantly run the background flusher.  Instead we want it to kick in
> +	 * if our reclaimable space exceeds 50% of our available free space.
> +	 */
> +	thresh = calc_available_free_space(fs_info, space_info,
> +					   BTRFS_RESERVE_FLUSH_ALL);
> +	thresh += (space_info->total_bytes - space_info->bytes_used -
> +		   space_info->bytes_reserved - space_info->bytes_readonly);

Isn't the freespace in space_info calculated by subtracting every
bytes_* from total_bytes ? I.e why aren't you subtracting bytes_may_use
and bytes_pinned ? Shouldn't this be

thresh += space_info->total_bytes - btrfs_space_info_used(space_info, true)



> +	thresh >>= 1;
>  
> -	if (used > expected)
> -		to_reclaim = used - expected;
> -	else
> -		to_reclaim = 0;
> -	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
> -				     space_info->bytes_reserved);
> -	if (!to_reclaim)
> -		return 0;
> +	used = space_info->bytes_may_use + space_info->bytes_pinned;
>  
>  	return (used >= thresh && !btrfs_fs_closing(fs_info) &&
>  		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));

<snip>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] btrfs: implement space clamping for preemptive flushing
  2020-09-30 20:01 ` [PATCH 7/9] btrfs: implement space clamping for preemptive flushing Josef Bacik
@ 2020-10-01 14:49   ` Nikolay Borisov
  2020-10-01 21:41     ` Josef Bacik
  0 siblings, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-01 14:49 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> Starting preemptive flushing at 50% of available free space is a good
> start, but some workloads are particularly abusive and can quickly
> overwhelm the preemptive flushing code and drive us into using tickets.
> 
> Handle this by clamping down on our threshold for starting and
> continuing to run preemptive flushing.  This is particularly important
> for our overcommit case, as we can really drive the file system into
> overages and then it's more difficult to pull it back as we start to
> actually fill up the file system.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

nit: IMO it would be worthile to very briefly describe the threshold
calculation, essentially it will be 2^CLAMP and we start with 1. So in
the best case we'll preempt flush when we have allocated more than 1/2
(50%) of the freespace and in the worst case 1/256th 0.4 %


LGTM:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] btrfs: adjust the flush trace point to include the source
  2020-09-30 20:01 ` [PATCH 8/9] btrfs: adjust the flush trace point to include the source Josef Bacik
@ 2020-10-01 15:32   ` Nikolay Borisov
  0 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-01 15:32 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> Since we have normal ticketed flushing and preemptive flushing, adjust
> the tracepoint so that we know the source of the flushing action to make
> it easier to debug problems.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/9] btrfs: add a trace point for reserve tickets
  2020-10-01  5:54   ` Nikolay Borisov
@ 2020-10-01 21:33     ` Josef Bacik
  0 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-10-01 21:33 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 10/1/20 1:54 AM, Nikolay Borisov wrote:
> 
> 
> On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
>> While debugging a ENOSPC related performance problem I needed to see the
>> time difference between start and end of a reserve ticket, so add a
>> tracepoint where we add the ticket, and where we end the ticket.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> 
> I wonder can't we get away with a single tracepoint - simply record the
> queue time when we create the ticket in __reserve_bytes and then have a
> single tracepoint in handle_reserve_ticket which calculates the
> difference between curr_time and the written one while the ticket was
> queued? IMO it will be more user friendly for manual inspection, I
> assume you have analyzed the duration with a bpf script rather than
> looking at every pair of start/end events manually? Also won't it be
> useful to also have the flush type printed in the end event so that at
> least we know what is the type of the ticket ? Let's not forget
> tracepoint from the POV of Linus are more or less ABI so it's preferable
> if we get them right from the beginning.

I'm kind of opposed to tracking time for tracepoints.  Yes I used bpf, but it's 
not hard to do, you can just do something like

bpftrace -e 'tracepoint:btrfs:btrfs_add_reserve_ticket {@start[pid] = time();} \
	tracepoint:btrfs:btrfs_end_reserve_ticket { \
		@times = hist(time() - @start[pid]); }'

nothing overly complicated, and leaves all the computation work to the thing 
doing the tracing, not in the kernel itself.

I didn't want to bring the flush type into the end one because I have to store 
it in the ticket, and you can get it from the start trace point.  However it 
probably would be useful if we want to just catch error events, so I'll add 
that.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/9] btrfs: improve preemptive background space flushing
  2020-10-01 13:19   ` Nikolay Borisov
@ 2020-10-01 21:35     ` Josef Bacik
  0 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-10-01 21:35 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 10/1/20 9:19 AM, Nikolay Borisov wrote:
> 
> 
> On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> <snip>
> 
>> When I introduced the ticketed ENOSPC stuff this broke slightly in the
>> fact that we were using tickets to indicate if we were done flushing.
>> No tickets, no more flushing.  However this meant that we essentially
>> never preemptively flushed.  This caused a write performance regression
>> that Nikolay noticed in an unrelated patch that removed the committing
>> of the transaction during btrfs_end_transaction.
> 
> I see, so basically the patch which I biseceted this to was really
> papering over the initial bug since the logic in end_transaction, sort
> of, simulated pre-emptive flushing... how subtle!
> 
> <snip>
> 
>> +	spin_lock(&space_info->lock);
>> +	used = btrfs_space_info_used(space_info, true);
>> +	while (need_do_async_reclaim(fs_info, space_info, used)) {
>> +		enum btrfs_reserve_flush_enum flush;
>> +		u64 delalloc_size = 0;
>> +		u64 to_reclaim, block_rsv_size;
>> +		u64 global_rsv_size = global_rsv->reserved;
>> +
>> +		/*
>> +		 * If we're just full of pinned, commit the transaction.  We
>> +		 * don't call flush_space(COMMIT_TRANS) here because that has
>> +		 * logic to decide whether we need to commit the transaction to
>> +		 * satisfy the ticket to keep us from live locking the box by
>> +		 * committing over and over again.  Here we don't care about
>> +		 * that, we know we are using a lot of space and most of it is
>> +		 * pinned, just commit.
> 
> nit: That comment is a mouthful, I think what you are describing here is
> really this line in may_commit_transaction:
> 
> if (!bytes_needed) return 0;
> 
> Which triggers if we don't have a ticket, if so there simply say :
> 
> "We can't call flush_commit because it will flush iff there is a pending
> ticket".
> 

Yup I'll reword.

> <snip>
> 
>> +		/*
>> +		 * We don't have a precise counter for delalloc, so we'll
>> +		 * approximate it by subtracting out the block rsv's space from
>> +		 * the bytes_may_use.  If that amount is higher than the
>> +		 * individual reserves, then we can assume it's tied up in
>> +		 * delalloc reservations.
>> +		 */
>> +		block_rsv_size = global_rsv_size +
>> +			delayed_block_rsv->reserved +
>> +			delayed_refs_rsv->reserved +
>> +			trans_rsv->reserved;
>> +		if (block_rsv_size < space_info->bytes_may_use)
>> +			delalloc_size = space_info->bytes_may_use -
>> +				block_rsv_size;
> 
> What about  :
> 
> percpu_counter_add_batch(&fs_info->delalloc_bytes, len,
>                        fs_info->delalloc_batch);
> 

Because that's the total data bytes, not how much metadata is reserved for the 
data bytes, which can be very grossly different things.  For example, one 
contiguous MAX_EXTENT_SIZE data extent is only like 1mib of metadata 
reservations, but if we used delalloc_bytes here we'd think the delalloc size 
was dominating the metadata reservations.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/9] btrfs: check reclaim_size in need_preemptive_reclaim
  2020-10-01 13:23   ` Nikolay Borisov
@ 2020-10-01 21:36     ` Josef Bacik
  0 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-10-01 21:36 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 10/1/20 9:23 AM, Nikolay Borisov wrote:
> 
> 
> On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
>> If we're flushing space for tickets then we have
>> space_info->reclaim_size set and we do not need to do background
>> reclaim.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> 
> I'm fine with this but check my remak below.
> 
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
>> ---
>>   fs/btrfs/space-info.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index 98207ea57a3d..518749093bc5 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -805,6 +805,13 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
>>   	if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
>>   		return 0;
>>   
>> +	/*
>> +	 * We have tickets queued, bail so we don't compete with the async
>> +	 * flushers.
>> +	 */
>> +	if (space_info->reclaim_size)
>> +		return 0;
>> +
> 
> nit: So where do we draw the line if this check should be in this
> function or in __reserve_bytes so that we eliminate the case where a
> preemptive reclaim is scheduled only to return instantly because of
> tickets. Though the 'if' in __reserve_bytes is getting slitghly out of
> hand :)
> 

Keep in mind this is also used by the background flushing thread to tell when 
it's time to stop flushing, which is why the check is in here.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] btrfs: rename need_do_async_reclaim
  2020-10-01 13:24   ` Nikolay Borisov
@ 2020-10-01 21:37     ` Josef Bacik
  0 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-10-01 21:37 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 10/1/20 9:24 AM, Nikolay Borisov wrote:
> 
> 
> On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
>> All of our normal flushing is asynchronous reclaim, so this helper is
>> poorly named.  This is more checking if we need to preemptively flush
>> space, so rename it to need_preemptive_reclaim.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   fs/btrfs/space-info.c | 10 +++++-----
>>   1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index 024daa843c56..98207ea57a3d 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -795,9 +795,9 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
>>   	return to_reclaim;
>>   }
>>   
>> -static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
>> -					struct btrfs_space_info *space_info,
>> -					u64 used)
>> +static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
>> +					  struct btrfs_space_info *space_info,
>> +					  u64 used)
> 
> nit: oops I forgot to mention but while at it, why don't you switch the
> function to bool return type.
> 

Can do, thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] btrfs: rework btrfs_calc_reclaim_metadata_size
  2020-10-01 13:59   ` Nikolay Borisov
@ 2020-10-01 21:38     ` Josef Bacik
  0 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-10-01 21:38 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 10/1/20 9:59 AM, Nikolay Borisov wrote:
> 
> 
> On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> <snip>
>>   
>> @@ -800,6 +777,7 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
>>   					  u64 used)
>>   {
>>   	u64 thresh = div_factor_fine(space_info->total_bytes, 98);
>> +	u64 to_reclaim, expected;
>>   
>>   	/* If we're just plain full then async reclaim just slows us down. */
>>   	if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
>> @@ -812,7 +790,25 @@ static inline int need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
>>   	if (space_info->reclaim_size)
>>   		return 0;
>>   
>> -	if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info))
>> +	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
>> +	if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
>> +				 BTRFS_RESERVE_FLUSH_ALL))
>> +		return 0;
>> +
>> +	used = btrfs_space_info_used(space_info, true);
>> +	if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
>> +				 BTRFS_RESERVE_FLUSH_ALL))
>> +		expected = div_factor_fine(space_info->total_bytes, 95);
>> +	else
>> +		expected = div_factor_fine(space_info->total_bytes, 90);
> 
> I think this should be just:
> 
> expected = div_factor_fine(space_info->total_bytes, 90);
> 
> Because before this check we tried to overcommit between 1 and 16m
> (depending on the online CPU's) and we failed. So there is no reason to
> think that :
> 
> btrfs_can_overcommit(fs_info, space_info, SZ_1M, BTRFS_RESERVE_FLUSH_ALL)
> 
> would succeed. So you can simplify the logic by eliminating the 2nd
> check for btrfs_can_overcommit

I remove all this code in a later patch, I'm just moving it here to simplify 
btrfs_calc_reclaim_metadata_size, and then changing this logic later so the 
changes are discrete.

> 
>> +
>> +	if (used > expected)
>> +		to_reclaim = used - expected;
>> +	else
>> +		to_reclaim = 0;
>> +	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
>> +				     space_info->bytes_reserved);
>> +	if (!to_reclaim)
>>   		return 0;
>>   
>>   	return (used >= thresh && !btrfs_fs_closing(fs_info) &&
>>
> 
> nit: Not directly related to your patch but since you are moving the
> code does it make sense to keep the fs_closing and STATE_REMOUNTING
> checks around?
> 

It does because we use this as the breaking condition for the preemptive flusher 
thread.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/9] btrfs: simplify the logic in need_preemptive_flushing
  2020-10-01 14:09   ` Nikolay Borisov
@ 2020-10-01 21:40     ` Josef Bacik
  0 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-10-01 21:40 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 10/1/20 10:09 AM, Nikolay Borisov wrote:
> 
> 
> On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
>> A lot of this was added all in one go with no explanation, and is a bit
>> unwieldy and confusing.  Simplify the logic to start preemptive flushing
>> if we've reserved more than half of our available free space.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> 
> <snip>
> 
>> +	 * If we have over half of the free space occupied by reservations or
>> +	 * pinned then we want to start flushing.
>> +	 *
>> +	 * We do not do the traditional thing here, which is to say
>> +	 *
>> +	 *   if (used >= ((total_bytes + avail) >> 1))
>> +	 *     return 1;
>> +	 *
>> +	 * because this doesn't quite work how we want.  If we had more than 50%
>> +	 * of the space_info used by bytes_used and we had 0 available we'd just
>> +	 * constantly run the background flusher.  Instead we want it to kick in
>> +	 * if our reclaimable space exceeds 50% of our available free space.
>> +	 */
>> +	thresh = calc_available_free_space(fs_info, space_info,
>> +					   BTRFS_RESERVE_FLUSH_ALL);
>> +	thresh += (space_info->total_bytes - space_info->bytes_used -
>> +		   space_info->bytes_reserved - space_info->bytes_readonly);
> 
> Isn't the freespace in space_info calculated by subtracting every
> bytes_* from total_bytes ? I.e why aren't you subtracting bytes_may_use
> and bytes_pinned ? Shouldn't this be
> 
> thresh += space_info->total_bytes - btrfs_space_info_used(space_info, true)
> 

Because I'm using the reservation in `used` below.  What I want is to see how my 
free space we have for all of our reservations, and then use that as the 
threshold for when to start preemptive flushing.  Then below we use 
bytes_may_use and bytes_pinned as the used.  If I subtracted it from here it 
would appear that we had less free space when I then went and did

return (used >= thresh);

Thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] btrfs: implement space clamping for preemptive flushing
  2020-10-01 14:49   ` Nikolay Borisov
@ 2020-10-01 21:41     ` Josef Bacik
  0 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-10-01 21:41 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 10/1/20 10:49 AM, Nikolay Borisov wrote:
> 
> 
> On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
>> Starting preemptive flushing at 50% of available free space is a good
>> start, but some workloads are particularly abusive and can quickly
>> overwhelm the preemptive flushing code and drive us into using tickets.
>>
>> Handle this by clamping down on our threshold for starting and
>> continuing to run preemptive flushing.  This is particularly important
>> for our overcommit case, as we can really drive the file system into
>> overages and then it's more difficult to pull it back as we start to
>> actually fill up the file system.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> 
> nit: IMO it would be worthile to very briefly describe the threshold
> calculation, essentially it will be 2^CLAMP and we start with 1. So in
> the best case we'll preempt flush when we have allocated more than 1/2
> (50%) of the freespace and in the worst case 1/256th 0.4 %
> 

Yup I'll reword, thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/9] btrfs: simplify the logic in need_preemptive_flushing
  2020-09-30 20:01 ` [PATCH 6/9] btrfs: simplify the logic in need_preemptive_flushing Josef Bacik
  2020-10-01 14:09   ` Nikolay Borisov
@ 2020-10-02  7:13   ` Nikolay Borisov
  1 sibling, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-02  7:13 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> A lot of this was added all in one go with no explanation, and is a bit
> unwieldy and confusing.  Simplify the logic to start preemptive flushing
> if we've reserved more than half of our available free space.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---

<snip>

> +	/*
> +	 * If we have over half of the free space occupied by reservations or
> +	 * pinned then we want to start flushing.
> +	 *
> +	 * We do not do the traditional thing here, which is to say
> +	 *
> +	 *   if (used >= ((total_bytes + avail) >> 1))
> +	 *     return 1;
> +	 *
> +	 * because this doesn't quite work how we want.  If we had more than 50%
> +	 * of the space_info used by bytes_used and we had 0 available we'd just
> +	 * constantly run the background flusher.  Instead we want it to kick in
> +	 * if our reclaimable space exceeds 50% of our available free space.
> +	 */
> +	thresh = calc_available_free_space(fs_info, space_info,
> +					   BTRFS_RESERVE_FLUSH_ALL);
> +	thresh += (space_info->total_bytes - space_info->bytes_used -
> +		   space_info->bytes_reserved - space_info->bytes_readonly);
> +	thresh >>= 1;

Ok, a fresh read illuminates the logic, thresh contains the freespace in
space_info + bytes_may_use + bytes_pinned so the check below says if
byte_may_use + bytes-Pinned are 50% or more than what's potentially
available - flush .


<snip>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 9/9] btrfs: add a trace class for dumping the current ENOSPC state
  2020-09-30 20:01 ` [PATCH 9/9] btrfs: add a trace class for dumping the current ENOSPC state Josef Bacik
@ 2020-10-02  8:30   ` Nikolay Borisov
  2020-10-02 13:45     ` Josef Bacik
  0 siblings, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-02  8:30 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> Often when I'm debugging ENOSPC related issues I have to resort to
> printing the entire ENOSPC state with trace_printk() in different spots.
> This gets pretty annoying, so add a trace state that does this for us.
> Then add a trace point at the end of preemptive flushing so you can see
> the state of the space_info when we decide to exit preemptive flushing.
> This helped me figure out we weren't kicking in the preemptive flushing
> soon enough.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c        |  1 +
>  include/trace/events/btrfs.h | 62 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 63 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index b9735e7de480..6f569a7d2df4 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -1109,6 +1109,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
>  	/* We only went through once, back off our clamping. */
>  	if (loops == 1 && !space_info->reclaim_size)
>  		space_info->clamp = max(1, space_info->clamp - 1);
> +	trace_btrfs_done_preemptive_reclaim(fs_info, space_info);
>  	spin_unlock(&space_info->lock);
>  }
>  
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index c340bff65450..651ac47a6945 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -2027,6 +2027,68 @@ TRACE_EVENT(btrfs_convert_extent_bit,
>  		  __print_flags(__entry->clear_bits, "|", EXTENT_FLAGS))
>  );
>  
> +DECLARE_EVENT_CLASS(btrfs_dump_space_info,
> +	TP_PROTO(const struct btrfs_fs_info *fs_info,
> +		 const struct btrfs_space_info *sinfo),
> +
> +	TP_ARGS(fs_info, sinfo),
> +
> +	TP_STRUCT__entry_btrfs(
> +		__field(	u64,	flags			)
> +		__field(	u64,	total_bytes		)
> +		__field(	u64,	bytes_used		)
> +		__field(	u64,	bytes_pinned		)
> +		__field(	u64,	bytes_reserved		)
> +		__field(	u64,	bytes_may_use		)
> +		__field(	u64,	bytes_readonly		)
> +		__field(	u64,	reclaim_size		)
> +		__field(	int,	clamp			)
> +		__field(	u64,	global_reserved		)
> +		__field(	u64,	trans_reserved		)
> +		__field(	u64,	delayed_refs_reserved	)
> +		__field(	u64,	delayed_reserved	)
> +		__field(	u64,	free_chunk_space	)
> +	),
> +
> +	TP_fast_assign_btrfs(fs_info,
> +		__entry->flags			=	sinfo->flags;
> +		__entry->total_bytes		=	sinfo->total_bytes;
> +		__entry->bytes_used		=	sinfo->bytes_used;
> +		__entry->bytes_pinned		=	sinfo->bytes_pinned;
> +		__entry->bytes_reserved		=	sinfo->bytes_reserved;
> +		__entry->bytes_may_use		=	sinfo->bytes_may_use;
> +		__entry->bytes_readonly		=	sinfo->bytes_readonly;
> +		__entry->reclaim_size		=	sinfo->reclaim_size;
> +		__entry->clamp			=	sinfo->clamp;
> +		__entry->global_reserved	=	fs_info->global_block_rsv.reserved;
> +		__entry->trans_reserved		=	fs_info->trans_block_rsv.reserved;
> +		__entry->delayed_refs_reserved	=	fs_info->delayed_refs_rsv.reserved;
> +		__entry->delayed_reserved	=	fs_info->delayed_block_rsv.reserved;
> +		__entry->free_chunk_space	=	atomic64_read(&fs_info->free_chunk_space);
> +	),
> +
> +	TP_printk_btrfs("flags=%s total_bytes=%llu bytes_used=%llu "
> +			"bytes_pinned=%llu bytes_reserved=%llu "
> +			"bytes_may_use=%llu bytes_readonly=%llu "
> +			"reclaim_size=%llu clamp=%d global_reserved=%llu "
> +			"trans_reserved=%llu delayed_refs_reserved=%llu "
> +			"delayed_reserved=%llu chunk_free_space=%llu",
> +			__print_flags(__entry->flags, "|", BTRFS_GROUP_FLAGS),
> +			__entry->total_bytes, __entry->bytes_used,
> +			__entry->bytes_pinned, __entry->bytes_reserved,
> +			__entry->bytes_may_use, __entry->bytes_readonly,
> +			__entry->reclaim_size, __entry->clamp,
> +			__entry->global_reserved, __entry->trans_reserved,
> +			__entry->delayed_refs_reserved,
> +			__entry->delayed_reserved, __entry->free_chunk_space)
> +);
> +
> +DEFINE_EVENT(btrfs_dump_space_info, btrfs_done_preemptive_reclaim,
> +	TP_PROTO(const struct btrfs_fs_info *fs_info,
> +		 const struct btrfs_space_info *sinfo),
> +	TP_ARGS(fs_info, sinfo)
> +);


I think this could be just a TRACE_EVENT, do you expect to define
further events based on the same class ?

> +
>  TRACE_EVENT(btrfs_add_reserve_ticket,
>  	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, int flush,
>  		 u64 bytes),
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 9/9] btrfs: add a trace class for dumping the current ENOSPC state
  2020-10-02  8:30   ` Nikolay Borisov
@ 2020-10-02 13:45     ` Josef Bacik
  0 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-10-02 13:45 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 10/2/20 4:30 AM, Nikolay Borisov wrote:
> 
> 
> On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
>> Often when I'm debugging ENOSPC related issues I have to resort to
>> printing the entire ENOSPC state with trace_printk() in different spots.
>> This gets pretty annoying, so add a trace state that does this for us.
>> Then add a trace point at the end of preemptive flushing so you can see
>> the state of the space_info when we decide to exit preemptive flushing.
>> This helped me figure out we weren't kicking in the preemptive flushing
>> soon enough.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   fs/btrfs/space-info.c        |  1 +
>>   include/trace/events/btrfs.h | 62 ++++++++++++++++++++++++++++++++++++
>>   2 files changed, 63 insertions(+)
>>
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index b9735e7de480..6f569a7d2df4 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -1109,6 +1109,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
>>   	/* We only went through once, back off our clamping. */
>>   	if (loops == 1 && !space_info->reclaim_size)
>>   		space_info->clamp = max(1, space_info->clamp - 1);
>> +	trace_btrfs_done_preemptive_reclaim(fs_info, space_info);
>>   	spin_unlock(&space_info->lock);
>>   }
>>   
>> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
>> index c340bff65450..651ac47a6945 100644
>> --- a/include/trace/events/btrfs.h
>> +++ b/include/trace/events/btrfs.h
>> @@ -2027,6 +2027,68 @@ TRACE_EVENT(btrfs_convert_extent_bit,
>>   		  __print_flags(__entry->clear_bits, "|", EXTENT_FLAGS))
>>   );
>>   
>> +DECLARE_EVENT_CLASS(btrfs_dump_space_info,
>> +	TP_PROTO(const struct btrfs_fs_info *fs_info,
>> +		 const struct btrfs_space_info *sinfo),
>> +
>> +	TP_ARGS(fs_info, sinfo),
>> +
>> +	TP_STRUCT__entry_btrfs(
>> +		__field(	u64,	flags			)
>> +		__field(	u64,	total_bytes		)
>> +		__field(	u64,	bytes_used		)
>> +		__field(	u64,	bytes_pinned		)
>> +		__field(	u64,	bytes_reserved		)
>> +		__field(	u64,	bytes_may_use		)
>> +		__field(	u64,	bytes_readonly		)
>> +		__field(	u64,	reclaim_size		)
>> +		__field(	int,	clamp			)
>> +		__field(	u64,	global_reserved		)
>> +		__field(	u64,	trans_reserved		)
>> +		__field(	u64,	delayed_refs_reserved	)
>> +		__field(	u64,	delayed_reserved	)
>> +		__field(	u64,	free_chunk_space	)
>> +	),
>> +
>> +	TP_fast_assign_btrfs(fs_info,
>> +		__entry->flags			=	sinfo->flags;
>> +		__entry->total_bytes		=	sinfo->total_bytes;
>> +		__entry->bytes_used		=	sinfo->bytes_used;
>> +		__entry->bytes_pinned		=	sinfo->bytes_pinned;
>> +		__entry->bytes_reserved		=	sinfo->bytes_reserved;
>> +		__entry->bytes_may_use		=	sinfo->bytes_may_use;
>> +		__entry->bytes_readonly		=	sinfo->bytes_readonly;
>> +		__entry->reclaim_size		=	sinfo->reclaim_size;
>> +		__entry->clamp			=	sinfo->clamp;
>> +		__entry->global_reserved	=	fs_info->global_block_rsv.reserved;
>> +		__entry->trans_reserved		=	fs_info->trans_block_rsv.reserved;
>> +		__entry->delayed_refs_reserved	=	fs_info->delayed_refs_rsv.reserved;
>> +		__entry->delayed_reserved	=	fs_info->delayed_block_rsv.reserved;
>> +		__entry->free_chunk_space	=	atomic64_read(&fs_info->free_chunk_space);
>> +	),
>> +
>> +	TP_printk_btrfs("flags=%s total_bytes=%llu bytes_used=%llu "
>> +			"bytes_pinned=%llu bytes_reserved=%llu "
>> +			"bytes_may_use=%llu bytes_readonly=%llu "
>> +			"reclaim_size=%llu clamp=%d global_reserved=%llu "
>> +			"trans_reserved=%llu delayed_refs_reserved=%llu "
>> +			"delayed_reserved=%llu chunk_free_space=%llu",
>> +			__print_flags(__entry->flags, "|", BTRFS_GROUP_FLAGS),
>> +			__entry->total_bytes, __entry->bytes_used,
>> +			__entry->bytes_pinned, __entry->bytes_reserved,
>> +			__entry->bytes_may_use, __entry->bytes_readonly,
>> +			__entry->reclaim_size, __entry->clamp,
>> +			__entry->global_reserved, __entry->trans_reserved,
>> +			__entry->delayed_refs_reserved,
>> +			__entry->delayed_reserved, __entry->free_chunk_space)
>> +);
>> +
>> +DEFINE_EVENT(btrfs_dump_space_info, btrfs_done_preemptive_reclaim,
>> +	TP_PROTO(const struct btrfs_fs_info *fs_info,
>> +		 const struct btrfs_space_info *sinfo),
>> +	TP_ARGS(fs_info, sinfo)
>> +);
> 
> 
> I think this could be just a TRACE_EVENT, do you expect to define
> further events based on the same class ?

Yes, the idea is that the next time I'm doing something I can add a new event 
where I need it.  I had thoughts of where I'd add ones right now, but I'd rather 
wait until I have real need of another tracepoint before I go adding them 
randomly in our code.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] Improve preemptive ENOSPC flushing
  2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
                   ` (8 preceding siblings ...)
  2020-09-30 20:01 ` [PATCH 9/9] btrfs: add a trace class for dumping the current ENOSPC state Josef Bacik
@ 2020-10-06 12:55 ` Nikolay Borisov
  9 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2020-10-06 12:55 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> Hello,
> 
> A while ago Nikolay started digging into a problem where they were seeing an
> around 20% regression on random writes, and he bisected it down to
> 
>   btrfs: don't end the transaction for delayed refs in throttle
> 
> However this wasn't actually the cause of the problem.
> 
> This patch removed the code that would preemptively end the transactions if we
> were low on space.  Because we had just introduced the ticketing code, this was
> no longer necessary and was causing a lot of transaction commits.
> 
> And in Nikolay's testing he validated this, we would see like 100x more
> transaction commits without that patch than with it, but the write regression
> clearly appeared when this patch was applied.
> 
> The root cause of this is that the transaction commits were essentially
> happening so quickly that we didn't end up needing to wait on space in the
> ENOSPC ticketing code as much, and thus were able to write pretty quickly.  With
> this gone, we now were getting a sawtoothy sort of behavior where we'd run up,
> stop while we flushed metadata space, run some more, stop again etc.
> 
> When I implemented the ticketing infrastructure, I was trying to get us out of
> excessively flushing space because we would sometimes over create block groups,
> and thus short circuited flushing if we no longer had tickets.  This had the
> side effect of breaking the preemptive flushing code, where we attempted to
> flush space in the background before we were forced to wait for space.
> 
> Enter this patchset.  We still have some of this preemption logic sprinkled
> everywhere, so I've separated it out of the normal ticketed flushing code, and
> made preemptive flushing it's own thing.
> 
> The preemptive flushing logic is more specialized than the standard flushing
> logic.  It attempts to flush in whichever pool has the highest usage.  This
> means that if most of our space is tied up in pinned extents, we'll commit the
> transaction.  If most of the space is tied up in delalloc, we'll flush delalloc,
> etc.
> 
> To test this out I used the fio job that Nikolay used, this needs to be adjusted
> so the overall IO size is at least 2x the RAM size for the box you are testing
> 
> fio --direct=0 --ioengine=sync --thread --directory=/mnt/test --invalidate=1 \
>         --group_reporting=1 --runtime=300 --fallocate=none --ramp_time=10 \
>         --name=RandomWrites-async-64512-4k-4 --new_group --rw=randwrite \
>         --size=2g --numjobs=4 --bs=4k --fsync_on_close=0 --end_fsync=0 \
>         --filename_format=FioWorkloads.\$jobnum
> 
> I got the following results
> 
> misc-next:	bw=13.4MiB/s (14.0MB/s), 13.4MiB/s-13.4MiB/s (14.0MB/s-14.0MB/s), io=4015MiB (4210MB), run=300323-300323msec
> pre-throttling:	bw=16.9MiB/s (17.7MB/s), 16.9MiB/s-16.9MiB/s (17.7MB/s-17.7MB/s), io=5068MiB (5314MB), run=300069-300069msec
> my patches:	bw=18.0MiB/s (18.9MB/s), 18.0MiB/s-18.0MiB/s (18.9MB/s-18.9MB/s), io=5403MiB (5666MB), run=300001-300001msec
> 
> Thanks,
> 
> Josef
> 
> Josef Bacik (9):
>   btrfs: add a trace point for reserve tickets
>   btrfs: improve preemptive background space flushing
>   btrfs: rename need_do_async_reclaim
>   btrfs: check reclaim_size in need_preemptive_reclaim
>   btrfs: rework btrfs_calc_reclaim_metadata_size
>   btrfs: simplify the logic in need_preemptive_flushing
>   btrfs: implement space clamping for preemptive flushing
>   btrfs: adjust the flush trace point to include the source
>   btrfs: add a trace class for dumping the current ENOSPC state
> 
>  fs/btrfs/ctree.h             |   1 +
>  fs/btrfs/disk-io.c           |   1 +
>  fs/btrfs/space-info.c        | 216 +++++++++++++++++++++++++++++------
>  fs/btrfs/space-info.h        |   3 +
>  include/trace/events/btrfs.h | 112 +++++++++++++++++-
>  5 files changed, 292 insertions(+), 41 deletions(-)
> 



Here are my results from testing misc-next/your patches/4.19 for buffered io: 

With your patches:
 WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=8192MiB (8590MB), run=281678-281678msec
 WRITE: bw=30.0MiB/s (32.5MB/s), 30.0MiB/s-30.0MiB/s (32.5MB/s-32.5MB/s), io=8192MiB (8590MB), run=264337-264337msec
 WRITE: bw=29.6MiB/s (31.1MB/s), 29.6MiB/s-29.6MiB/s (31.1MB/s-31.1MB/s), io=8192MiB (8590MB), run=276312-276312msec
 WRITE: bw=29.8MiB/s (31.2MB/s), 29.8MiB/s-29.8MiB/s (31.2MB/s-31.2MB/s), io=8192MiB (8590MB), run=274916-274916msec
 WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269030-269030msec

Without-misc-next:
  WRITE: bw=20.2MiB/s (21.2MB/s), 20.2MiB/s-20.2MiB/s (21.2MB/s-21.2MB/s), io=8192MiB (8590MB), run=404831-404831msec
  WRITE: bw=20.8MiB/s (21.8MB/s), 20.8MiB/s-20.8MiB/s (21.8MB/s-21.8MB/s), io=8192MiB (8590MB), run=394749-394749msec
  WRITE: bw=20.8MiB/s (21.8MB/s), 20.8MiB/s-20.8MiB/s (21.8MB/s-21.8MB/s), io=8192MiB (8590MB), run=393291-393291msec
  WRITE: bw=20.7MiB/s (21.8MB/s), 20.7MiB/s-20.7MiB/s (21.8MB/s-21.8MB/s), io=8192MiB (8590MB), run=394918-394918msec
  WRITE: bw=21.1MiB/s (22.1MB/s), 21.1MiB/s-21.1MiB/s (22.1MB/s-22.1MB/s), io=8192MiB (8590MB), run=388499-388499msec

4.19.x:
  WRITE: bw=23.3MiB/s (24.4MB/s), 23.3MiB/s-23.3MiB/s (24.4MB/s-24.4MB/s), io=6387MiB (6697MB), run=274460-274460msec
  WRITE: bw=23.3MiB/s (24.5MB/s), 23.3MiB/s-23.3MiB/s (24.5MB/s-24.5MB/s), io=6643MiB (6966MB), run=284518-284518msec
  WRITE: bw=23.4MiB/s (24.5MB/s), 23.4MiB/s-23.4MiB/s (24.5MB/s-24.5MB/s), io=6643MiB (6966MB), run=284372-284372msec
  WRITE: bw=23.6MiB/s (24.7MB/s), 23.6MiB/s-23.6MiB/s (24.7MB/s-24.7MB/s), io=6387MiB (6697MB), run=271200-271200msec
  WRITE: bw=23.4MiB/s (24.6MB/s), 23.4MiB/s-23.4MiB/s (24.6MB/s-24.6MB/s), io=6387MiB (6697MB), run=272670-272670msec

So there is indeed a net-gain in performance, however, (expectedly, due to the increased number of transaction commits) there is a regressions in DIO: 


DIO-josef:
  WRITE: bw=47.1MiB/s (49.4MB/s), 47.1MiB/s-47.1MiB/s (49.4MB/s-49.4MB/s), io=8192MiB (8590MB), run=174049-174049msec
  WRITE: bw=48.5MiB/s (50.8MB/s), 48.5MiB/s-48.5MiB/s (50.8MB/s-50.8MB/s), io=8192MiB (8590MB), run=169045-169045msec
  WRITE: bw=45.0MiB/s (48.2MB/s), 45.0MiB/s-45.0MiB/s (48.2MB/s-48.2MB/s), io=8192MiB (8590MB), run=178196-178196msec
  WRITE: bw=46.1MiB/s (48.3MB/s), 46.1MiB/s-46.1MiB/s (48.3MB/s-48.3MB/s), io=8192MiB (8590MB), run=177861-177861msec
  WRITE: bw=46.4MiB/s (48.7MB/s), 46.4MiB/s-46.4MiB/s (48.7MB/s-48.7MB/s), io=8192MiB (8590MB), run=176376-176376msec

DIO-misc-next:
  WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163365-163365msec
  WRITE: bw=50.3MiB/s (52.8MB/s), 50.3MiB/s-50.3MiB/s (52.8MB/s-52.8MB/s), io=8192MiB (8590MB), run=162753-162753msec
  WRITE: bw=50.6MiB/s (53.1MB/s), 50.6MiB/s-50.6MiB/s (53.1MB/s-53.1MB/s), io=8192MiB (8590MB), run=161766-161766msec
  WRITE: bw=50.2MiB/s (52.7MB/s), 50.2MiB/s-50.2MiB/s (52.7MB/s-52.7MB/s), io=8192MiB (8590MB), run=163074-163074msec
  WRITE: bw=50.5MiB/s (52.9MB/s), 50.5MiB/s-50.5MiB/s (52.9MB/s-52.9MB/s), io=8192MiB (8590MB), run=162252-162252msec


With that: 

Tested-by: Nikolay Borisov <nborisov@suse.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-10-06 12:55 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-30 20:01 [PATCH 0/9] Improve preemptive ENOSPC flushing Josef Bacik
2020-09-30 20:01 ` [PATCH 1/9] btrfs: add a trace point for reserve tickets Josef Bacik
2020-10-01  5:54   ` Nikolay Borisov
2020-10-01 21:33     ` Josef Bacik
2020-09-30 20:01 ` [PATCH 2/9] btrfs: improve preemptive background space flushing Josef Bacik
2020-10-01 13:19   ` Nikolay Borisov
2020-10-01 21:35     ` Josef Bacik
2020-09-30 20:01 ` [PATCH 3/9] btrfs: rename need_do_async_reclaim Josef Bacik
2020-10-01 13:20   ` Nikolay Borisov
2020-10-01 13:24   ` Nikolay Borisov
2020-10-01 21:37     ` Josef Bacik
2020-09-30 20:01 ` [PATCH 4/9] btrfs: check reclaim_size in need_preemptive_reclaim Josef Bacik
2020-10-01 13:23   ` Nikolay Borisov
2020-10-01 21:36     ` Josef Bacik
2020-09-30 20:01 ` [PATCH 5/9] btrfs: rework btrfs_calc_reclaim_metadata_size Josef Bacik
2020-10-01 13:59   ` Nikolay Borisov
2020-10-01 21:38     ` Josef Bacik
2020-09-30 20:01 ` [PATCH 6/9] btrfs: simplify the logic in need_preemptive_flushing Josef Bacik
2020-10-01 14:09   ` Nikolay Borisov
2020-10-01 21:40     ` Josef Bacik
2020-10-02  7:13   ` Nikolay Borisov
2020-09-30 20:01 ` [PATCH 7/9] btrfs: implement space clamping for preemptive flushing Josef Bacik
2020-10-01 14:49   ` Nikolay Borisov
2020-10-01 21:41     ` Josef Bacik
2020-09-30 20:01 ` [PATCH 8/9] btrfs: adjust the flush trace point to include the source Josef Bacik
2020-10-01 15:32   ` Nikolay Borisov
2020-09-30 20:01 ` [PATCH 9/9] btrfs: add a trace class for dumping the current ENOSPC state Josef Bacik
2020-10-02  8:30   ` Nikolay Borisov
2020-10-02 13:45     ` Josef Bacik
2020-10-06 12:55 ` [PATCH 0/9] Improve preemptive ENOSPC flushing Nikolay Borisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.