* [PATCH 1/7] btrfs: check worker before need_preemptive_reclaim
2021-04-28 17:38 [PATCH 0/7] Preemptive flushing improvements Josef Bacik
@ 2021-04-28 17:38 ` Josef Bacik
2021-04-29 13:35 ` Johannes Thumshirn
2021-04-28 17:38 ` [PATCH 2/7] btrfs: only clamp the first time we have to start flushing Josef Bacik
` (7 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2021-04-28 17:38 UTC (permalink / raw)
To: linux-btrfs, kernel-team
need_preemptive_reclaim() does some calculations, which aren't heavy,
but if we're already running preemptive reclaim there's no reason to do
them at all, so re-order the checks so that we don't do the calculation
if we're already doing reclaim.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/space-info.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 2dc674b7c3b1..c9a5e003bcfa 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1588,8 +1588,8 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
* the async reclaim as we will panic.
*/
if (!test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags) &&
- need_preemptive_reclaim(fs_info, space_info) &&
- !work_busy(&fs_info->preempt_reclaim_work)) {
+ !work_busy(&fs_info->preempt_reclaim_work) &&
+ need_preemptive_reclaim(fs_info, space_info)) {
trace_btrfs_trigger_flush(fs_info, space_info->flags,
orig_bytes, flush, "preempt");
queue_work(system_unbound_wq,
--
2.26.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/7] btrfs: only clamp the first time we have to start flushing
2021-04-28 17:38 [PATCH 0/7] Preemptive flushing improvements Josef Bacik
2021-04-28 17:38 ` [PATCH 1/7] btrfs: check worker before need_preemptive_reclaim Josef Bacik
@ 2021-04-28 17:38 ` Josef Bacik
2021-04-28 17:38 ` [PATCH 3/7] btrfs: take into account global rsv in need_preemptive_reclaim Josef Bacik
` (6 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Josef Bacik @ 2021-04-28 17:38 UTC (permalink / raw)
To: linux-btrfs, kernel-team
We were clamping the threshold for preemptive reclaim any time we added
a ticket to wait on, which if we have a lot of threads means we'd
essentially max out the clamp the first time we start to flush. Instead
of doing this, simply do it every time we have to start flushing, this
will make us ramp up gradually instead of going to max clamping as soon
as we start needing to do flushing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/space-info.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index c9a5e003bcfa..33edab17af0d 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1561,6 +1561,15 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
flush == BTRFS_RESERVE_FLUSH_DATA) {
list_add_tail(&ticket.list, &space_info->tickets);
if (!space_info->flush) {
+ /*
+ * We were forced to add a reserve ticket, so
+ * our preemptive flushing is unable to keep
+ * up. Clamp down on the threshold for the
+ * preemptive flushing in order to keep up with
+ * the workload.
+ */
+ maybe_clamp_preempt(fs_info, space_info);
+
space_info->flush = 1;
trace_btrfs_trigger_flush(fs_info,
space_info->flags,
@@ -1572,14 +1581,6 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
list_add_tail(&ticket.list,
&space_info->priority_tickets);
}
-
- /*
- * We were forced to add a reserve ticket, so our preemptive
- * flushing is unable to keep up. Clamp down on the threshold
- * for the preemptive flushing in order to keep up with the
- * workload.
- */
- maybe_clamp_preempt(fs_info, space_info);
} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
used += orig_bytes;
/*
--
2.26.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 3/7] btrfs: take into account global rsv in need_preemptive_reclaim
2021-04-28 17:38 [PATCH 0/7] Preemptive flushing improvements Josef Bacik
2021-04-28 17:38 ` [PATCH 1/7] btrfs: check worker before need_preemptive_reclaim Josef Bacik
2021-04-28 17:38 ` [PATCH 2/7] btrfs: only clamp the first time we have to start flushing Josef Bacik
@ 2021-04-28 17:38 ` Josef Bacik
2021-04-29 13:50 ` Johannes Thumshirn
2021-04-29 20:03 ` David Sterba
2021-04-28 17:38 ` [PATCH 4/7] btrfs: use the global rsv size in the preemptive thresh calculation Josef Bacik
` (5 subsequent siblings)
8 siblings, 2 replies; 21+ messages in thread
From: Josef Bacik @ 2021-04-28 17:38 UTC (permalink / raw)
To: linux-btrfs, kernel-team
Global rsv can't be used for normal allocations, and for very full file
systems we can decide to try and async flush constantly even though
there's really not a lot of space to reclaim. Deal with this by
including the global block rsv size in the "total used" calculation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/space-info.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 33edab17af0d..e341f995a7dd 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -792,12 +792,14 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *space_info)
{
+ u64 global_rsv_size = fs_info->global_block_rsv.reserved;
u64 ordered, delalloc;
u64 thresh = div_factor_fine(space_info->total_bytes, 98);
u64 used;
/* If we're just plain full then async reclaim just slows us down. */
- if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
+ if ((space_info->bytes_used + space_info->bytes_reserved +
+ global_rsv_size) >= thresh)
return false;
/*
--
2.26.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 3/7] btrfs: take into account global rsv in need_preemptive_reclaim
2021-04-28 17:38 ` [PATCH 3/7] btrfs: take into account global rsv in need_preemptive_reclaim Josef Bacik
@ 2021-04-29 13:50 ` Johannes Thumshirn
2021-04-29 20:03 ` David Sterba
1 sibling, 0 replies; 21+ messages in thread
From: Johannes Thumshirn @ 2021-04-29 13:50 UTC (permalink / raw)
To: Josef Bacik, linux-btrfs, kernel-team
Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/7] btrfs: take into account global rsv in need_preemptive_reclaim
2021-04-28 17:38 ` [PATCH 3/7] btrfs: take into account global rsv in need_preemptive_reclaim Josef Bacik
2021-04-29 13:50 ` Johannes Thumshirn
@ 2021-04-29 20:03 ` David Sterba
2021-04-29 20:05 ` David Sterba
1 sibling, 1 reply; 21+ messages in thread
From: David Sterba @ 2021-04-29 20:03 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs, kernel-team
On Wed, Apr 28, 2021 at 01:38:44PM -0400, Josef Bacik wrote:
> Global rsv can't be used for normal allocations, and for very full file
> systems we can decide to try and async flush constantly even though
> there's really not a lot of space to reclaim. Deal with this by
> including the global block rsv size in the "total used" calculation.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/btrfs/space-info.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 33edab17af0d..e341f995a7dd 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -792,12 +792,14 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
> static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
> struct btrfs_space_info *space_info)
> {
> + u64 global_rsv_size = fs_info->global_block_rsv.reserved;
> u64 ordered, delalloc;
> u64 thresh = div_factor_fine(space_info->total_bytes, 98);
> u64 used;
>
> /* If we're just plain full then async reclaim just slows us down. */
> - if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
> + if ((space_info->bytes_used + space_info->bytes_reserved +
> + global_rsv_size) >= thresh)
That's an extra variable for one time use. Do you have plans to use it
in further patches? Regarding readability, it has to be put on a
separate line anyway so there's space for the whole identifier bit if
you intend to use it then it's ok to keep it.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/7] btrfs: take into account global rsv in need_preemptive_reclaim
2021-04-29 20:03 ` David Sterba
@ 2021-04-29 20:05 ` David Sterba
0 siblings, 0 replies; 21+ messages in thread
From: David Sterba @ 2021-04-29 20:05 UTC (permalink / raw)
To: dsterba, Josef Bacik, linux-btrfs, kernel-team
On Thu, Apr 29, 2021 at 10:03:10PM +0200, David Sterba wrote:
> On Wed, Apr 28, 2021 at 01:38:44PM -0400, Josef Bacik wrote:
> > Global rsv can't be used for normal allocations, and for very full file
> > systems we can decide to try and async flush constantly even though
> > there's really not a lot of space to reclaim. Deal with this by
> > including the global block rsv size in the "total used" calculation.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/btrfs/space-info.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > index 33edab17af0d..e341f995a7dd 100644
> > --- a/fs/btrfs/space-info.c
> > +++ b/fs/btrfs/space-info.c
> > @@ -792,12 +792,14 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
> > static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
> > struct btrfs_space_info *space_info)
> > {
> > + u64 global_rsv_size = fs_info->global_block_rsv.reserved;
> > u64 ordered, delalloc;
> > u64 thresh = div_factor_fine(space_info->total_bytes, 98);
> > u64 used;
> >
> > /* If we're just plain full then async reclaim just slows us down. */
> > - if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
> > + if ((space_info->bytes_used + space_info->bytes_reserved +
> > + global_rsv_size) >= thresh)
>
> That's an extra variable for one time use. Do you have plans to use it
> in further patches? Regarding readability, it has to be put on a
> separate line anyway so there's space for the whole identifier bit if
> you intend to use it then it's ok to keep it.
Never mind, I got fooled by the diff -p that shows the name of the
previous function (btrfs_calc_reclaim_metadata_size) and not the actual
function where it is being added (need_preemptive_reclaim) in case it's
in the hunk context.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 4/7] btrfs: use the global rsv size in the preemptive thresh calculation
2021-04-28 17:38 [PATCH 0/7] Preemptive flushing improvements Josef Bacik
` (2 preceding siblings ...)
2021-04-28 17:38 ` [PATCH 3/7] btrfs: take into account global rsv in need_preemptive_reclaim Josef Bacik
@ 2021-04-28 17:38 ` Josef Bacik
2021-04-29 14:04 ` Johannes Thumshirn
2021-04-28 17:38 ` [PATCH 5/7] btrfs: don't include the global rsv size in the preemptive used amount Josef Bacik
` (4 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2021-04-28 17:38 UTC (permalink / raw)
To: linux-btrfs, kernel-team
We calculate the amount of "free" space available for normal
reservations by taking the total space and subtracting out the hard used
space, which is readonly, used, and reserved space. However we weren't
taking into account the global block rsv, which is essentially hard used
space. Handle this by subtracting it from the available free space, so
that our threshold more closely mirrors reality.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/space-info.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index e341f995a7dd..b0dd9b57d352 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -840,8 +840,10 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
thresh = calc_available_free_space(fs_info, space_info,
BTRFS_RESERVE_FLUSH_ALL);
- thresh += (space_info->total_bytes - space_info->bytes_used -
- space_info->bytes_reserved - space_info->bytes_readonly);
+ used = space_info->bytes_used + space_info->bytes_reserved +
+ space_info->bytes_readonly + global_rsv_size;
+ if (used < space_info->total_bytes)
+ thresh += space_info->total_bytes - used;
thresh >>= space_info->clamp;
used = space_info->bytes_pinned;
--
2.26.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 4/7] btrfs: use the global rsv size in the preemptive thresh calculation
2021-04-28 17:38 ` [PATCH 4/7] btrfs: use the global rsv size in the preemptive thresh calculation Josef Bacik
@ 2021-04-29 14:04 ` Johannes Thumshirn
2021-04-29 15:19 ` Josef Bacik
0 siblings, 1 reply; 21+ messages in thread
From: Johannes Thumshirn @ 2021-04-29 14:04 UTC (permalink / raw)
To: Josef Bacik, linux-btrfs, kernel-team
On 28/04/2021 19:40, Josef Bacik wrote:
> - thresh += (space_info->total_bytes - space_info->bytes_used -
> - space_info->bytes_reserved - space_info->bytes_readonly);
> + used = space_info->bytes_used + space_info->bytes_reserved +
> + space_info->bytes_readonly + global_rsv_size;
> + if (used < space_info->total_bytes)
> + thresh += space_info->total_bytes - used;
> thresh >>= space_info->clamp;
I don't quite understand why you introduced the 'if' here. Now we're only
adding the new free space to the threshold if we're using less than our
total free space, which kinda makes sense that we're not going over our
total free space.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/7] btrfs: use the global rsv size in the preemptive thresh calculation
2021-04-29 14:04 ` Johannes Thumshirn
@ 2021-04-29 15:19 ` Josef Bacik
0 siblings, 0 replies; 21+ messages in thread
From: Josef Bacik @ 2021-04-29 15:19 UTC (permalink / raw)
To: Johannes Thumshirn, linux-btrfs, kernel-team
On 4/29/21 10:04 AM, Johannes Thumshirn wrote:
> On 28/04/2021 19:40, Josef Bacik wrote:
>> - thresh += (space_info->total_bytes - space_info->bytes_used -
>> - space_info->bytes_reserved - space_info->bytes_readonly);
>> + used = space_info->bytes_used + space_info->bytes_reserved +
>> + space_info->bytes_readonly + global_rsv_size;
>> + if (used < space_info->total_bytes)
>> + thresh += space_info->total_bytes - used;
>> thresh >>= space_info->clamp;
>
>
> I don't quite understand why you introduced the 'if' here. Now we're only
> adding the new free space to the threshold if we're using less than our
> total free space, which kinda makes sense that we're not going over our
> total free space.
>
Because it's possible that the global_rsv_size + used is > total_bytes, because
sometimes the global reserve can end up being calculated as larger than the
available size (think stupid small fs'es where we only have the original 8mib
chunk of metadata). It doesn't usually happen, but that sort of thinking gets
me into trouble, so this is safer. Thanks,
Josef
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 5/7] btrfs: don't include the global rsv size in the preemptive used amount
2021-04-28 17:38 [PATCH 0/7] Preemptive flushing improvements Josef Bacik
` (3 preceding siblings ...)
2021-04-28 17:38 ` [PATCH 4/7] btrfs: use the global rsv size in the preemptive thresh calculation Josef Bacik
@ 2021-04-28 17:38 ` Josef Bacik
2021-04-29 14:19 ` Johannes Thumshirn
2021-04-28 17:38 ` [PATCH 6/7] btrfs: only ignore delalloc if delalloc is much smaller than ordered Josef Bacik
` (3 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2021-04-28 17:38 UTC (permalink / raw)
To: linux-btrfs, kernel-team
When deciding if we should preemptively flush space, we will add in the
amount of space used by all block rsvs. However this also includes the
global block rsv, which isn't flushable so shouldn't be accounted for in
this calculation. If we decide to use ->bytes_may_use in our used
calculation we need to subtract the global rsv size from this amount so
it most closely matches the flushable space.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/space-info.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index b0dd9b57d352..4e3857474cfd 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -871,7 +871,7 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
used += fs_info->delayed_refs_rsv.reserved +
fs_info->delayed_block_rsv.reserved;
else
- used += space_info->bytes_may_use;
+ used += space_info->bytes_may_use - global_rsv_size;
return (used >= thresh && !btrfs_fs_closing(fs_info) &&
!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
--
2.26.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 6/7] btrfs: only ignore delalloc if delalloc is much smaller than ordered
2021-04-28 17:38 [PATCH 0/7] Preemptive flushing improvements Josef Bacik
` (4 preceding siblings ...)
2021-04-28 17:38 ` [PATCH 5/7] btrfs: don't include the global rsv size in the preemptive used amount Josef Bacik
@ 2021-04-28 17:38 ` Josef Bacik
2021-04-28 17:38 ` [PATCH 7/7] btrfs: handle preemptive delalloc flushing slightly differently Josef Bacik
` (2 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Josef Bacik @ 2021-04-28 17:38 UTC (permalink / raw)
To: linux-btrfs, kernel-team
While testing heavy delalloc workloads I noticed that sometimes we'd
just stop preemptively flushing when we had loads of delalloc available
to flush. This is because we skip preemptive flushing if delalloc <=
ordered. However if we start with say 4gib of delalloc, and we flush
2gib of that, we'll stop flushing there, when we still have 2gib of
delalloc to flush.
Instead adjust the ordered bytes down by half, this way if 2/3 of our
outstanding delalloc reservations are tied up by ordered extents we
don't bother preemptive flushing, as we're getting close to the state
where we need to wait on ordered extents.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/space-info.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 4e3857474cfd..cf09b23f3448 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -864,8 +864,14 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
* clearly be heavy enough to warrant preemptive flushing. In the case
* of heavy DIO or ordered reservations, preemptive flushing will just
* waste time and cause us to slow down.
+ *
+ * We want to make sure we truly are maxed out on ordered however, so
+ * cut ordered in half, and if it's still higher than delalloc then we
+ * can keep flushing. This is to avoid the case where we start
+ * flushing, and now delalloc == ordered and we stop preemptively
+ * flushing when we could still have several gigs of delalloc to flush.
*/
- ordered = percpu_counter_read_positive(&fs_info->ordered_bytes);
+ ordered = percpu_counter_read_positive(&fs_info->ordered_bytes) >> 1;
delalloc = percpu_counter_read_positive(&fs_info->delalloc_bytes);
if (ordered >= delalloc)
used += fs_info->delayed_refs_rsv.reserved +
--
2.26.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 7/7] btrfs: handle preemptive delalloc flushing slightly differently
2021-04-28 17:38 [PATCH 0/7] Preemptive flushing improvements Josef Bacik
` (5 preceding siblings ...)
2021-04-28 17:38 ` [PATCH 6/7] btrfs: only ignore delalloc if delalloc is much smaller than ordered Josef Bacik
@ 2021-04-28 17:38 ` Josef Bacik
[not found] ` <20210429094852.DAC3.409509F4@e16-tech.com>
2021-04-30 15:47 ` David Sterba
8 siblings, 0 replies; 21+ messages in thread
From: Josef Bacik @ 2021-04-28 17:38 UTC (permalink / raw)
To: linux-btrfs, kernel-team
If we decide to flush delalloc from the preemptive flusher, we really do
not want to wait on ordered extents, as it gains us nothing. However
there was logic to go ahead and wait on ordered extents if there was
more ordered bytes than delalloc bytes. We do not want this behavior,
so pass through whether this flushing is for preemption, and do not wait
for ordered extents if that's the case. Also break out of the shrink
loop after the first flushing, as we just want to one shot shrink
delalloc.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/space-info.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index cf09b23f3448..b2d834b92811 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -495,7 +495,8 @@ static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
*/
static void shrink_delalloc(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *space_info,
- u64 to_reclaim, bool wait_ordered)
+ u64 to_reclaim, bool wait_ordered,
+ bool for_preempt)
{
struct btrfs_trans_handle *trans;
u64 delalloc_bytes;
@@ -532,7 +533,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
* ordered extents, otherwise we'll waste time trying to flush delalloc
* that likely won't give us the space back we need.
*/
- if (ordered_bytes > delalloc_bytes)
+ if (ordered_bytes > delalloc_bytes && !for_preempt)
wait_ordered = true;
loops = 0;
@@ -551,6 +552,14 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
break;
}
+ /*
+ * If we are for preemption we just want a one-shot of delalloc
+ * flushing so we can stop flushing if we decide we don't need
+ * to anymore.
+ */
+ if (for_preempt)
+ break;
+
spin_lock(&space_info->lock);
if (list_empty(&space_info->tickets) &&
list_empty(&space_info->priority_tickets)) {
@@ -702,7 +711,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
case FLUSH_DELALLOC:
case FLUSH_DELALLOC_WAIT:
shrink_delalloc(fs_info, space_info, num_bytes,
- state == FLUSH_DELALLOC_WAIT);
+ state == FLUSH_DELALLOC_WAIT, for_preempt);
break;
case FLUSH_DELAYED_REFS_NR:
case FLUSH_DELAYED_REFS:
--
2.26.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
[parent not found: <20210429094852.DAC3.409509F4@e16-tech.com>]
* Re: [PATCH 0/7] Preemptive flushing improvements
[not found] ` <20210429094852.DAC3.409509F4@e16-tech.com>
@ 2021-04-29 15:06 ` Josef Bacik
2021-04-29 21:40 ` Wang Yugui
0 siblings, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2021-04-29 15:06 UTC (permalink / raw)
To: Wang Yugui; +Cc: linux-btrfs, kernel-team
On 4/28/21 9:48 PM, Wang Yugui wrote:
> Hi,
>
> xfstests generic/562 failed with this patch.
>
> This is the file 562.out.bad.
>
It's not failing for me, what is your config like? Both the local.config and
the type/size of your devices. Thanks,
Josef
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/7] Preemptive flushing improvements
2021-04-29 15:06 ` [PATCH 0/7] Preemptive flushing improvements Josef Bacik
@ 2021-04-29 21:40 ` Wang Yugui
2021-04-29 22:23 ` Wang Yugui
0 siblings, 1 reply; 21+ messages in thread
From: Wang Yugui @ 2021-04-29 21:40 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs, kernel-team
Hi,
> On 4/28/21 9:48 PM, Wang Yugui wrote:
> > Hi,
> >
> > xfstests generic/562 failed with this patch.
> >
> > This is the file 562.out.bad.
> >
> It's not failing for me, what is your config like? Both the local.config and the type/size of your devices. Thanks,
>
# cat xfstests/local.config
dev_base_disk=/dev/disk/by-id/nvme-Dell_Express_Flash_NVMe_SM1715_800GB_SFF__S29HNYAH500207
export TEST_DIR=/mnt/test
export TEST_DEV=${dev_base_disk}-part6
export TEST_DEV=$(realpath $TEST_DEV)
export SCRATCH_MNT=/mnt/scratch
export SCRATCH_DEV_POOL="${dev_base_disk}-part1 ${dev_base_disk}-part2
${dev_base_disk}-part3 ${dev_base_disk}-part4 ${dev_base_disk}-part5"
export SCRATCH_DEV_POOL=$(realpath $SCRATCH_DEV_POOL | xargs /bin/echo )
#LOGWRITES_DEV=
# parted /dev/disk/by-id/nvme-Dell_Express_Flash_NVMe_SM1715_800GB_SFF__S29HNYAH500207 unit s print
Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 1562824368s
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 2097152s 54525951s 52428800s btrfs primary
2 54525952s 106954751s 52428800s primary
3 106954752s 159383551s 52428800s primary
4 159383552s 211812351s 52428800s primary
5 211812352s 264241151s 52428800s primary
6 264241152s 316669951s 52428800s btrfs primary
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/30
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/7] Preemptive flushing improvements
2021-04-29 21:40 ` Wang Yugui
@ 2021-04-29 22:23 ` Wang Yugui
2021-04-30 15:28 ` David Sterba
0 siblings, 1 reply; 21+ messages in thread
From: Wang Yugui @ 2021-04-29 22:23 UTC (permalink / raw)
To: Josef Bacik, linux-btrfs, kernel-team
Hi,
> > On 4/28/21 9:48 PM, Wang Yugui wrote:
> > > Hi,
> > >
> > > xfstests generic/562 failed with this patch.
> > >
> > > This is the file 562.out.bad.
> > >
> > It's not failing for me, what is your config like? Both the local.config and the type/size of your devices. Thanks,
> >
>
> # cat xfstests/local.config
>
> dev_base_disk=/dev/disk/by-id/nvme-Dell_Express_Flash_NVMe_SM1715_800GB_SFF__S29HNYAH500207
>
> export TEST_DIR=/mnt/test
>
> export TEST_DEV=${dev_base_disk}-part6
> export TEST_DEV=$(realpath $TEST_DEV)
>
>
> export SCRATCH_MNT=/mnt/scratch
>
> export SCRATCH_DEV_POOL="${dev_base_disk}-part1 ${dev_base_disk}-part2
> ${dev_base_disk}-part3 ${dev_base_disk}-part4 ${dev_base_disk}-part5"
> export SCRATCH_DEV_POOL=$(realpath $SCRATCH_DEV_POOL | xargs /bin/echo )
>
> #LOGWRITES_DEV=
>
>
> # parted /dev/disk/by-id/nvme-Dell_Express_Flash_NVMe_SM1715_800GB_SFF__S29HNYAH500207 unit s print
> Model: NVMe Device (nvme)
> Disk /dev/nvme0n1: 1562824368s
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
> Disk Flags:
>
> Number Start End Size File system Name Flags
> 1 2097152s 54525951s 52428800s btrfs primary
> 2 54525952s 106954751s 52428800s primary
> 3 106954752s 159383551s 52428800s primary
> 4 159383552s 211812351s 52428800s primary
> 5 211812352s 264241151s 52428800s primary
> 6 264241152s 316669951s 52428800s btrfs primary
By the way, We enable '-O no-holes -R free-space-tree' default for
mkfs.btrfs.
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/30
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/7] Preemptive flushing improvements
2021-04-29 22:23 ` Wang Yugui
@ 2021-04-30 15:28 ` David Sterba
2021-04-30 15:43 ` Wang Yugui
0 siblings, 1 reply; 21+ messages in thread
From: David Sterba @ 2021-04-30 15:28 UTC (permalink / raw)
To: Wang Yugui; +Cc: Josef Bacik, linux-btrfs, kernel-team
On Fri, Apr 30, 2021 at 06:23:08AM +0800, Wang Yugui wrote:
> > > > This is the file 562.out.bad.
Can you please post again the error?
> By the way, We enable '-O no-holes -R free-space-tree' default for
> mkfs.btrfs.
The fstests cases are not robust enough to catch differences in the
output when various features are enabled so it depends on the exact
error.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/7] Preemptive flushing improvements
2021-04-30 15:28 ` David Sterba
@ 2021-04-30 15:43 ` Wang Yugui
0 siblings, 0 replies; 21+ messages in thread
From: Wang Yugui @ 2021-04-30 15:43 UTC (permalink / raw)
To: dsterba, Wang Yugui, Josef Bacik, linux-btrfs, kernel-team
Hi,
> On Fri, Apr 30, 2021 at 06:23:08AM +0800, Wang Yugui wrote:
> > > > > This is the file 562.out.bad.
>
> Can you please post again the error?
This is the head part of the file of 562.out.bad(whole file is 2.55M).
QA output created by 562
XFS_IOC_CLONE: No space left on device
File foo data after cloning and remount:
0000000 c7 c7 c7 c7 c7 c7 c7 c7 c7 c7 c7 c7 c7 c7 c7 c7
*
25509888 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
25513984 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5
*
25518080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
25522176 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5 e5
<repeat of 00 00.../e5 e5...>
> > By the way, We enable '-O no-holes -R free-space-tree' default for
> > mkfs.btrfs.
>
> The fstests cases are not robust enough to catch differences in the
> output when various features are enabled so it depends on the exact
> error.
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/30
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/7] Preemptive flushing improvements
2021-04-28 17:38 [PATCH 0/7] Preemptive flushing improvements Josef Bacik
` (7 preceding siblings ...)
[not found] ` <20210429094852.DAC3.409509F4@e16-tech.com>
@ 2021-04-30 15:47 ` David Sterba
8 siblings, 0 replies; 21+ messages in thread
From: David Sterba @ 2021-04-30 15:47 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs, kernel-team
On Wed, Apr 28, 2021 at 01:38:41PM -0400, Josef Bacik wrote:
> The following patch series is a set of improvements around our preemptive
> flushing infrastructure. A user reported[1] a problem where the preemptive
> flushing was running non-stop on his full file system. I was not able to
> reproduce this behavior, but noticed a few fundamental issues with how we
> decided to do preemptive flushing. The main relevant thing is that we were not
> taking into account the global reserve when deciding if we needed to
> preemptively flush. In cases such as the users file system where the majority
> of the free metadata space is taken up by the global reserve we could
> potentially always want to flush, which is annoying and not what we want.
>
> Furthermore I noticed issues where we would clamp entirely too quickly, and
> where we made some poor decisions around delalloc. None of these are ground
> breaking or huge improvements, but offer some better performance in some of the
> fsperf test cases. This is the results of my recent run with the whole
> patchset. You'll notice a "regression" in the 16g buffered write, this is
> mostly because of the test, if you look at the results on the nightly
> performance tests you'll see this test varies more than others. Re-testing with
> just that test didn't show the same regression after multiple runs, so it's just
> noise. I could have chosen to run all the tests multiple times to get an
> average over several runs, but that takes a fair bit of time. Individual runs
> of the test showed no regression and often showed an improvement, so I feel
> comfortable calling it noise. The full results are as follows
>
> dbench60 results
> metric baseline current diff
> ==================================================
> qpathinfo 11.53 13.25 14.90%
> throughput 446.23 434.52 -2.62%
> flush 2502.92 1682.43 -32.78%
> qfileinfo 0.92 1.17 27.29%
> ntcreatex 1359.76 519.42 -61.80%
> qfsinfo 1.77 3.76 112.64%
> close 1.90 1.64 -13.91%å
> sfileinfo 209.76 76.43 -63.56%
> rename 1110.08 518.40 -53.30%
> find 13.84 13.13 -5.15%
> unlink 1192.89 521.53 -56.28%
> writex 1713.39 1321.39 -22.88%
> deltree 280.34 296.33 5.70%
> readx 3.16 2.91 -8.10%
> mkdir 0.03 0.02 -46.67%
> lockx 0.78 0.20 -73.89%
> unlockx 0.16 0.12 -23.81%
>
> emptyfiles500k results
> metric baseline current diff
> =========================================================
> write_io_kbytes 125000 125000 0.00%
> read_clat_ns_p99 0 0 0.00%
> write_bw_bytes 1.79e+08 1.85e+08 3.04%
> read_iops 0 0 0.00%
> write_clat_ns_p50 17728 17280 -2.53%
> read_io_kbytes 0 0 0.00%
> read_io_bytes 0 0 0.00%
> write_clat_ns_p99 72704 68096 -6.34%
> read_bw_bytes 0 0 0.00%
> elapsed 1 1 0.00%
> write_lat_ns_min 0 0 0.00%
> sys_cpu 91.23 89.16 -2.27%
> write_lat_ns_max 0 0 0.00%
> read_lat_ns_min 0 0 0.00%
> write_iops 43763.97 45093.80 3.04%
> read_lat_ns_max 0 0 0.00%
> read_clat_ns_p50 0 0 0.00%
>
> smallfiles100k results
> metric baseline current diff
> =========================================================
> write_io_kbytes 2.04e+08 2.04e+08 0.00%
> read_clat_ns_p99 0 0 0.00%
> write_bw_bytes 1.40e+08 1.40e+08 0.50%
> read_iops 0 0 0.00%
> write_clat_ns_p50 6712 6944 3.46%
> read_io_kbytes 0 0 0.00%
> read_io_bytes 0 0 0.00%
> write_clat_ns_p99 16000 16512 3.20%
> read_bw_bytes 0 0 0.00%
> elapsed 1498.88 1491 -0.53%
> write_lat_ns_min 2858.38 2919 2.12%
> sys_cpu 6.22 6.51 4.61%
> write_lat_ns_max 1.31e+08 1.27e+08 -2.77%
> read_lat_ns_min 0 0 0.00%
> write_iops 34081.44 34253.51 0.50%
> read_lat_ns_max 0 0 0.00%
> read_clat_ns_p50 0 0 0.00%
>
> dio4kbs16threads results
> metric baseline current diff
> =============================================================
> write_io_kbytes 4360879 5312908 21.83%
> read_clat_ns_p99 0 0 0.00%
> write_bw_bytes 74302497.38 90667585 22.02%
> read_iops 0 0 0.00%
> write_clat_ns_p50 243968 238592 -2.20%
> read_io_kbytes 0 0 0.00%
> read_io_bytes 0 0 0.00%
> write_clat_ns_p99 21094400 15007744 -28.85%
> read_bw_bytes 0 0 0.00%
> elapsed 61 61 0.00%
> write_lat_ns_min 38183.25 37949 -0.61%
> sys_cpu 4.03 4.72 17.11%
> write_lat_ns_max 1.68e+09 8.46e+08 -49.55%
> read_lat_ns_min 0 0 0.00%
> write_iops 18140.26 22135.64 22.02%
> read_lat_ns_max 0 0 0.00%
> read_clat_ns_p50 0 0 0.00%
>
> randwrite2xram results
> metric baseline current diff
> =============================================================
> write_io_kbytes 27720434 36563300 31.90%
> read_clat_ns_p99 0 0 0.00%
> write_bw_bytes 93268100.75 1.16e+08 24.83%
> read_iops 0 0 0.00%
> write_clat_ns_p50 13168 16512 25.39%
> read_io_kbytes 0 0 0.00%
> read_io_bytes 0 0 0.00%
> write_clat_ns_p99 39360 125440 218.70%
> read_bw_bytes 0 0 0.00%
> elapsed 334.25 333 -0.37%
> write_lat_ns_min 5436.75 5682 4.51%
> sys_cpu 8.22 12.57 52.96%
> write_lat_ns_max 4.05e+10 2.73e+10 -32.65%
> read_lat_ns_min 0 0 0.00%
> write_iops 22770.53 28425.17 24.83%
> read_lat_ns_max 0 0 0.00%
> read_clat_ns_p50 0 0 0.00%
>
> untarfirefox results
> metric baseline current diff
> ==============================================
> elapsed 47.23 46.82 -0.87%
>
> I'm still waiting on feedback from the user to make sure the patches fix the
> reported problem, but they're worthy on their own if they do not resolve the
> original reported issue. Thanks,
>
> Josef
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=212185
>
> Josef Bacik (7):
> btrfs: check worker before need_preemptive_reclaim
> btrfs: only clamp the first time we have to start flushing
> btrfs: take into account global rsv in need_preemptive_reclaim
> btrfs: use the global rsv size in the preemptive thresh calculation
> btrfs: don't include the global rsv size in the preemptive used amount
> btrfs: only ignore delalloc if delalloc is much smaller than ordered
> btrfs: handle preemptive delalloc flushing slightly differently
It would be good to summarize the noticeable improvements over all the
tests and put it at least into the last patch assuming that's where it
was produced.
The tests passed for me here so I'll add the patches to misc-next for
more coverage.
^ permalink raw reply [flat|nested] 21+ messages in thread