All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] ENOSPC delalloc flushing fixes
@ 2021-06-29 13:59 Josef Bacik
  2021-06-29 13:59 ` [PATCH v2 1/8] btrfs: enable a tracepoint when we fail tickets Josef Bacik
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Josef Bacik @ 2021-06-29 13:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fsdevel

v1->v2:
- Two extra patches to remove the last user of sync_inode() (p9fs) and then
  remove sync_inode() itself, as per hch's request

--- Original email ---

Hello,

I've been debugging and fixing a problem we hit in production related to getting
ENOSPC when we still should have had space to use.  A variety of different
things were going wrong, but one of them was sometimes we wouldn't wait for all
of delalloc to be flushed.  This series of patches fixes a few problems in this
area, starting with

  btrfs: handle shrink_delalloc pages calculation differently

When we switched to writing pages instead of full inodes for flushing we didn't
adjust the counters to give us pages, instead using the "items" amount.  This is
incorrect as we'd just not flush that much delalloc, leaving a lot laying around
when we finally ENOSPC'd.

The next bit are related to compression, as we have compression on everywhere in
production.

  btrfs: wait on async extents when flushing delalloc
  btrfs: wake up async_delalloc_pages waiters after submit

I ripped this code out because I added a second sync_inode() if we had async
extents in order to do the proper waiting.  However sync_inode() could skip
writeout if writeout had begun already on the inode, so we still need this
waiting in order to make sure we don't try to wait on ordered extents until all
ordered extents have been created.

And finally these two patches

  fs: add a filemap_fdatawrite_wbc helper
  btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking

We need a writeback helper that will take a wbc and not try to do anything fancy
other than write out the inode we want.  sync_inode() has the drawback that it
will skip writeout if the inode is currently under writeback, and thus we won't
wait properly.

I ran this series through fsperf, the results ar eposted below.  I'm still
printing the %diff, but I'm also printing the stdev so you can see the variance
in teh values we expect.  Generally there is no change or it's within the normal
range.  These patches really only affect anything when we're very full on space.

I also ran this through the enospc stress test and saw no early enospc.  Thanks,

Josef

dbench60 results
  metric     baseline   current     stdev           diff
==============================================================
qpathinfo       11.26      11.87       0.64     5.35%
throughput     652.90     628.99      65.68    -3.66%
flush        22292.41   35005.49   15197.94    57.03%
qfileinfo        1.04       1.24       0.17    19.02%
ntcreatex     3965.31    6167.38    7790.74    55.53%
qfsinfo          1.79       1.41       0.40   -20.88%
close            1.81       1.95       0.39     7.67%
sfileinfo        4.96       5.50       1.13    10.70%
rename        2640.98    5844.86    5873.86   121.31%
find            12.29      12.83       1.00     4.36%
unlink        3310.42    4809.88    7179.77    45.30%
writex       12521.22   37992.79    7567.72   203.43%
deltree        409.86     363.46     224.63   -11.32%
readx            2.37       4.18       0.66    76.01%
mkdir            0.03       0.11       0.01   271.06%
lockx            0.44       0.17       0.27   -61.83%
unlockx          0.16       0.64       0.11   291.70%

emptyfiles500k results
     metric         baseline   current      stdev           diff
======================================================================
write_io_kbytes       125000     125000            0    0.00%
read_clat_ns_p99           0          0            0    0.00%
write_bw_bytes      1.83e+08   1.77e+08   5516543.33   -3.19%
read_iops                  0          0            0    0.00%
write_clat_ns_p50      17536      17792       273.68    1.46%
read_io_kbytes             0          0            0    0.00%
read_io_bytes              0          0            0    0.00%
write_clat_ns_p99      72576      74240      2677.97    2.29%
read_bw_bytes              0          0            0    0.00%
elapsed                    1          1            0    0.00%
write_lat_ns_min           0          0            0    0.00%
sys_cpu                91.86      91.29         0.66   -0.63%
write_lat_ns_max           0          0            0    0.00%
read_lat_ns_min            0          0            0    0.00%
write_iops          44583.10   43162.98      1346.81   -3.19%
read_lat_ns_max            0          0            0    0.00%
read_clat_ns_p50           0          0            0    0.00%

smallfiles100k results
     metric         baseline   current       stdev            diff
========================================================================
write_io_kbytes     2.04e+08   2.04e+08             0     0.00%
read_clat_ns_p99           0          0             0     0.00%
write_bw_bytes      1.33e+08   1.40e+08   15033768.39     5.20%
read_iops                  0          0             0     0.00%
write_clat_ns_p50       6424       6688         79.77     4.11%
read_io_kbytes             0          0             0     0.00%
read_io_bytes              0          0             0     0.00%
write_clat_ns_p99      15960      16320        251.24     2.26%
read_bw_bytes              0          0             0     0.00%
elapsed              1592.50       1492        234.98    -6.31%
write_lat_ns_min     2730.75       2768         45.73     1.36%
sys_cpu                 5.75       6.25          0.65     8.68%
write_lat_ns_max    5.17e+08   1.52e+08      1.05e+09   -70.54%
read_lat_ns_min            0          0             0     0.00%
write_iops          32545.54   34239.34       3670.35     5.20%
read_lat_ns_max            0          0             0     0.00%
read_clat_ns_p50           0          0             0     0.00%

bufferedrandwrite16g results
     metric          baseline     current       stdev            diff
===========================================================================
write_io_kbytes        16777216   16777216             0     0.00%
read_clat_ns_p99              0          0             0     0.00%
write_bw_bytes      91729211.62   94071836   13859731.84     2.55%
read_iops                     0          0             0     0.00%
write_clat_ns_p50         12704      11840       1227.26    -6.80%
read_io_kbytes                0          0             0     0.00%
read_io_bytes                 0          0             0     0.00%
write_clat_ns_p99         31136      32384       2925.86     4.01%
read_bw_bytes                 0          0             0     0.00%
elapsed                  191.38        183         25.98    -4.38%
write_lat_ns_min        3961.25       4068         95.29     2.69%
sys_cpu                   31.83      29.77          6.85    -6.47%
write_lat_ns_max       3.05e+10   1.70e+10      1.95e+10   -44.34%
read_lat_ns_min               0          0             0     0.00%
write_iops             22394.83   22966.76       3383.72     2.55%
read_lat_ns_max               0          0             0     0.00%
read_clat_ns_p50              0          0             0     0.00%

dio4kbs16threads results
     metric          baseline     current       stdev            diff
===========================================================================
write_io_kbytes         4187092    4949612     599817.04    18.21%
read_clat_ns_p99              0          0             0     0.00%
write_bw_bytes      71448568.62   84467746   10233293.68    18.22%
read_iops                     0          0             0     0.00%
write_clat_ns_p50        245504     240640       4370.25    -1.98%
read_io_kbytes                0          0             0     0.00%
read_io_bytes                 0          0             0     0.00%
write_clat_ns_p99      22249472   20054016    1120427.41    -9.87%
read_bw_bytes                 0          0             0     0.00%
elapsed                      61         61             0     0.00%
write_lat_ns_min          38440      38571        225.10     0.34%
sys_cpu                    3.89       4.57          0.46    17.47%
write_lat_ns_max       1.23e+09   9.48e+08      7.00e+08   -23.22%
read_lat_ns_min               0          0             0     0.00%
write_iops             17443.50   20622.01       2498.36    18.22%
read_lat_ns_max               0          0             0     0.00%
read_clat_ns_p50              0          0             0     0.00%

randwrite2xram results
     metric         baseline   current       stdev            diff
========================================================================
write_io_kbytes     33948247   34359528    4793805.63     1.21%
read_clat_ns_p99           0          0             0     0.00%
write_bw_bytes      1.15e+08   1.17e+08   17455355.14     1.47%
read_iops                  0          0             0     0.00%
write_clat_ns_p50      15232      14400       1043.81    -5.46%
read_io_kbytes             0          0             0     0.00%
read_io_bytes              0          0             0     0.00%
write_clat_ns_p99      95264      67072      25379.05   -29.59%
read_bw_bytes              0          0             0     0.00%
elapsed               313.62        314          5.10     0.12%
write_lat_ns_min     5399.38       5658        126.19     4.79%
sys_cpu                11.61      11.06          2.04    -4.69%
write_lat_ns_max    3.16e+10   2.91e+10      1.46e+10    -8.05%
read_lat_ns_min            0          0             0     0.00%
write_iops          28099.86   28511.67       4261.56     1.47%
read_lat_ns_max            0          0             0     0.00%
read_clat_ns_p50           0          0             0     0.00%

untarfirefox results
metric    baseline   current   stdev        diff
======================================================
elapsed      46.89     46.75    0.18   -0.29%

Josef Bacik (8):
  btrfs: enable a tracepoint when we fail tickets
  btrfs: handle shrink_delalloc pages calculation differently
  btrfs: wait on async extents when flushing delalloc
  btrfs: wake up async_delalloc_pages waiters after submit
  fs: add a filemap_fdatawrite_wbc helper
  btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking
  9p: migrate from sync_inode to filemap_fdatawrite_wbc
  fs: kill sync_inode

 fs/9p/vfs_file.c             |  7 +---
 fs/btrfs/ctree.h             |  9 +++--
 fs/btrfs/inode.c             | 16 +++-----
 fs/btrfs/space-info.c        | 77 +++++++++++++++++++++++++++++++-----
 fs/fs-writeback.c            | 19 +--------
 include/linux/fs.h           |  3 +-
 include/trace/events/btrfs.h |  7 ++++
 mm/filemap.c                 | 35 +++++++++++-----
 8 files changed, 116 insertions(+), 57 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/8] btrfs: enable a tracepoint when we fail tickets
  2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
@ 2021-06-29 13:59 ` Josef Bacik
  2021-07-07 10:51   ` Nikolay Borisov
  2021-06-29 13:59 ` [PATCH v2 2/8] btrfs: handle shrink_delalloc pages calculation differently Josef Bacik
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2021-06-29 13:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fsdevel

When debugging early enospc problems it was useful to have a tracepoint
where we failed all tickets so I could check the state of the enospc
counters at failure time to validate my fixes.  This adds the tracpoint
so you can easily get that information.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c        | 2 ++
 include/trace/events/btrfs.h | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 392997376a1c..af161eb808a2 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -825,6 +825,8 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info,
 	struct reserve_ticket *ticket;
 	u64 tickets_id = space_info->tickets_id;
 
+	trace_btrfs_fail_all_tickets(fs_info, space_info);
+
 	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
 		btrfs_info(fs_info, "cannot satisfy tickets, dumping space info");
 		__btrfs_dump_space_info(fs_info, space_info);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index c7237317a8b9..3d81ba8c37b9 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2098,6 +2098,12 @@ DEFINE_EVENT(btrfs_dump_space_info, btrfs_done_preemptive_reclaim,
 	TP_ARGS(fs_info, sinfo)
 );
 
+DEFINE_EVENT(btrfs_dump_space_info, btrfs_fail_all_tickets,
+	TP_PROTO(const struct btrfs_fs_info *fs_info,
+		 const struct btrfs_space_info *sinfo),
+	TP_ARGS(fs_info, sinfo)
+);
+
 TRACE_EVENT(btrfs_reserve_ticket,
 	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, u64 bytes,
 		 u64 start_ns, int flush, int error),
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/8] btrfs: handle shrink_delalloc pages calculation differently
  2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
  2021-06-29 13:59 ` [PATCH v2 1/8] btrfs: enable a tracepoint when we fail tickets Josef Bacik
@ 2021-06-29 13:59 ` Josef Bacik
  2021-07-07 10:50   ` Nikolay Borisov
  2021-06-29 13:59 ` [PATCH v2 3/8] btrfs: wait on async extents when flushing delalloc Josef Bacik
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2021-06-29 13:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fsdevel; +Cc: stable

We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be.  With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.

Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations.  Now assume we have 128MiB of delalloc
outstanding.  With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.

Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.

Fix this two ways.  First, change the calculations to be a fraction of
the total delalloc bytes on the system.  Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.

Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once.  This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.

I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again.  This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.

CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h             |  9 +++++----
 fs/btrfs/space-info.c        | 35 ++++++++++++++++++++++++++---------
 include/trace/events/btrfs.h |  1 +
 3 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d7ef4d7d2c1a..232ff1a49ca6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2783,10 +2783,11 @@ enum btrfs_flush_state {
 	FLUSH_DELAYED_REFS	=	4,
 	FLUSH_DELALLOC		=	5,
 	FLUSH_DELALLOC_WAIT	=	6,
-	ALLOC_CHUNK		=	7,
-	ALLOC_CHUNK_FORCE	=	8,
-	RUN_DELAYED_IPUTS	=	9,
-	COMMIT_TRANS		=	10,
+	FLUSH_DELALLOC_FULL	=	7,
+	ALLOC_CHUNK		=	8,
+	ALLOC_CHUNK_FORCE	=	9,
+	RUN_DELAYED_IPUTS	=	10,
+	COMMIT_TRANS		=	11,
 };
 
 int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index af161eb808a2..0c539a94c6d9 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -494,6 +494,9 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
 	long time_left;
 	int loops;
 
+	delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
+	ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
+
 	/* Calc the number of the pages we need flush for space reservation */
 	if (to_reclaim == U64_MAX) {
 		items = U64_MAX;
@@ -501,19 +504,21 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
 		/*
 		 * to_reclaim is set to however much metadata we need to
 		 * reclaim, but reclaiming that much data doesn't really track
-		 * exactly, so increase the amount to reclaim by 2x in order to
-		 * make sure we're flushing enough delalloc to hopefully reclaim
-		 * some metadata reservations.
+		 * exactly.  What we really want to do is reclaim full inode's
+		 * worth of reservations, however that's not available to us
+		 * here.  We will take a fraction of the delalloc bytes for our
+		 * flushing loops and hope for the best.  Delalloc will expand
+		 * the amount we write to cover an entire dirty extent, which
+		 * will reclaim the metadata reservation for that range.  If
+		 * it's not enough subsequent flush stages will be more
+		 * aggressive.
 		 */
+		to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
 		items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
-		to_reclaim = items * EXTENT_SIZE_PER_ITEM;
 	}
 
 	trans = (struct btrfs_trans_handle *)current->journal_info;
 
-	delalloc_bytes = percpu_counter_sum_positive(
-						&fs_info->delalloc_bytes);
-	ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
 	if (delalloc_bytes == 0 && ordered_bytes == 0)
 		return;
 
@@ -596,8 +601,11 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		break;
 	case FLUSH_DELALLOC:
 	case FLUSH_DELALLOC_WAIT:
+	case FLUSH_DELALLOC_FULL:
+		if (state == FLUSH_DELALLOC_FULL)
+			num_bytes = U64_MAX;
 		shrink_delalloc(fs_info, space_info, num_bytes,
-				state == FLUSH_DELALLOC_WAIT, for_preempt);
+				state != FLUSH_DELALLOC, for_preempt);
 		break;
 	case FLUSH_DELAYED_REFS_NR:
 	case FLUSH_DELAYED_REFS:
@@ -907,6 +915,14 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 				commit_cycles--;
 		}
 
+		/*
+		 * We do not want to empty the system of delalloc unless we're
+		 * under heavy pressure, so allow one trip through the flushing
+		 * logic before we start doing a FLUSH_DELALLOC_FULL.
+		 */
+		if (flush_state == FLUSH_DELALLOC_FULL && !commit_cycles)
+			flush_state++;
+
 		/*
 		 * We don't want to force a chunk allocation until we've tried
 		 * pretty hard to reclaim space.  Think of the case where we
@@ -1070,7 +1086,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
  *   so if we now have space to allocate do the force chunk allocation.
  */
 static const enum btrfs_flush_state data_flush_states[] = {
-	FLUSH_DELALLOC_WAIT,
+	FLUSH_DELALLOC_FULL,
 	RUN_DELAYED_IPUTS,
 	COMMIT_TRANS,
 	ALLOC_CHUNK_FORCE,
@@ -1159,6 +1175,7 @@ static const enum btrfs_flush_state evict_flush_states[] = {
 	FLUSH_DELAYED_REFS,
 	FLUSH_DELALLOC,
 	FLUSH_DELALLOC_WAIT,
+	FLUSH_DELALLOC_FULL,
 	ALLOC_CHUNK,
 	COMMIT_TRANS,
 };
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 3d81ba8c37b9..ddf5c250726c 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -94,6 +94,7 @@ struct btrfs_space_info;
 	EM( FLUSH_DELAYED_ITEMS,	"FLUSH_DELAYED_ITEMS")		\
 	EM( FLUSH_DELALLOC,		"FLUSH_DELALLOC")		\
 	EM( FLUSH_DELALLOC_WAIT,	"FLUSH_DELALLOC_WAIT")		\
+	EM( FLUSH_DELALLOC_FULL,	"FLUSH_DELALLOC_FULL")		\
 	EM( FLUSH_DELAYED_REFS_NR,	"FLUSH_DELAYED_REFS_NR")	\
 	EM( FLUSH_DELAYED_REFS,		"FLUSH_ELAYED_REFS")		\
 	EM( ALLOC_CHUNK,		"ALLOC_CHUNK")			\
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/8] btrfs: wait on async extents when flushing delalloc
  2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
  2021-06-29 13:59 ` [PATCH v2 1/8] btrfs: enable a tracepoint when we fail tickets Josef Bacik
  2021-06-29 13:59 ` [PATCH v2 2/8] btrfs: handle shrink_delalloc pages calculation differently Josef Bacik
@ 2021-06-29 13:59 ` Josef Bacik
  2021-07-07 11:09   ` Nikolay Borisov
  2021-06-29 13:59 ` [PATCH v2 4/8] btrfs: wake up async_delalloc_pages waiters after submit Josef Bacik
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2021-06-29 13:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fsdevel

I've been debugging an early ENOSPC problem in production and finally
root caused it to this problem.  When we switched to the per-inode in
38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") I pulled out the async extent handling, because we
were doing the correct thing by calling filemap_flush() if we had async
extents set.  This would properly wait on any async extents by locking
the page in the second flush, thus making sure our ordered extents were
properly set up.

However when I switched us back to page based flushing, I used
sync_inode(), which allows us to pass in our own wbc.  The problem here
is that sync_inode() is smarter than the filemap_* helpers, it tries to
avoid calling writepages at all.  This means that our second call could
skip calling do_writepages altogether, and thus not wait on the pagelock
for the async helpers.  This means we could come back before any ordered
extents were created and then simply continue on in our flushing
mechanisms and ENOSPC out when we have plenty of space to use.

Fix this by putting back the async pages logic in shrink_delalloc.  This
allows us to bulk write out everything that we need to, and then we can
wait in one place for the async helpers to catch up, and then wait on
any ordered extents that are created.

Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c      |  4 ----
 fs/btrfs/space-info.c | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6eb20987351..b1f02e3fea5d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9714,10 +9714,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 					 &work->work);
 		} else {
 			ret = sync_inode(inode, wbc);
-			if (!ret &&
-			    test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-				     &BTRFS_I(inode)->runtime_flags))
-				ret = sync_inode(inode, wbc);
 			btrfs_add_delayed_iput(inode);
 			if (ret || wbc->nr_to_write <= 0)
 				goto out;
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 0c539a94c6d9..f140a89a3cdd 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -534,9 +534,49 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
 	while ((delalloc_bytes || ordered_bytes) && loops < 3) {
 		u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
 		long nr_pages = min_t(u64, temp, LONG_MAX);
+		int async_pages;
 
 		btrfs_start_delalloc_roots(fs_info, nr_pages, true);
 
+		/*
+		 * We need to make sure any outstanding async pages are now
+		 * processed before we continue.  This is because things like
+		 * sync_inode() try to be smart and skip writing if the inode is
+		 * marked clean.  We don't use filemap_fwrite for flushing
+		 * because we want to control how many pages we write out at a
+		 * time, thus this is the only safe way to make sure we've
+		 * waited for outstanding compressed workers to have started
+		 * their jobs and thus have ordered extents set up properly.
+		 *
+		 * This exists because we do not want to wait for each
+		 * individual inode to finish its async work, we simply want to
+		 * start the IO on everybody, and then come back here and wait
+		 * for all of the async work to catch up.  Once we're done with
+		 * that we know we'll have ordered extents for everything and we
+		 * can decide if we wait for that or not.
+		 *
+		 * If we choose to replace this in the future, make absolutely
+		 * sure that the proper waiting is being done in the async case,
+		 * as there have been bugs in that area before.
+		 */
+		async_pages = atomic_read(&fs_info->async_delalloc_pages);
+		if (!async_pages)
+			goto skip_async;
+
+		/*
+		 * We don't want to wait forever, if we wrote less pages in this
+		 * loop than we have outstanding, only wait for that number of
+		 * pages, otherwise we can wait for all async pages to finish
+		 * before continuing.
+		 */
+		if (async_pages > nr_pages)
+			async_pages -= nr_pages;
+		else
+			async_pages = 0;
+		wait_event(fs_info->async_submit_wait,
+			   atomic_read(&fs_info->async_delalloc_pages) <=
+			   async_pages);
+skip_async:
 		loops++;
 		if (wait_ordered && !trans) {
 			btrfs_wait_ordered_roots(fs_info, items, 0, (u64)-1);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/8] btrfs: wake up async_delalloc_pages waiters after submit
  2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
                   ` (2 preceding siblings ...)
  2021-06-29 13:59 ` [PATCH v2 3/8] btrfs: wait on async extents when flushing delalloc Josef Bacik
@ 2021-06-29 13:59 ` Josef Bacik
  2021-07-07 11:04   ` Nikolay Borisov
  2021-06-29 13:59 ` [PATCH v2 5/8] fs: add a filemap_fdatawrite_wbc helper Josef Bacik
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2021-06-29 13:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fsdevel; +Cc: stable

We use the async_delalloc_pages mechanism to make sure that we've
completed our async work before trying to continue our delalloc
flushing.  The reason for this is we need to see any ordered extents
that were created by our delalloc flushing.  However we're waking up
before we do the submit work, which is before we create the ordered
extents.  This is a pretty wide race window where we could potentially
think there are no ordered extents and thus exit shrink_delalloc
prematurely.  Fix this by waking us up after we've done the work to
create ordered extents.

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b1f02e3fea5d..e388153c4ae4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1290,11 +1290,6 @@ static noinline void async_cow_submit(struct btrfs_work *work)
 	nr_pages = (async_chunk->end - async_chunk->start + PAGE_SIZE) >>
 		PAGE_SHIFT;
 
-	/* atomic_sub_return implies a barrier */
-	if (atomic_sub_return(nr_pages, &fs_info->async_delalloc_pages) <
-	    5 * SZ_1M)
-		cond_wake_up_nomb(&fs_info->async_submit_wait);
-
 	/*
 	 * ->inode could be NULL if async_chunk_start has failed to compress,
 	 * in which case we don't have anything to submit, yet we need to
@@ -1303,6 +1298,11 @@ static noinline void async_cow_submit(struct btrfs_work *work)
 	 */
 	if (async_chunk->inode)
 		submit_compressed_extents(async_chunk);
+
+	/* atomic_sub_return implies a barrier */
+	if (atomic_sub_return(nr_pages, &fs_info->async_delalloc_pages) <
+	    5 * SZ_1M)
+		cond_wake_up_nomb(&fs_info->async_submit_wait);
 }
 
 static noinline void async_cow_free(struct btrfs_work *work)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/8] fs: add a filemap_fdatawrite_wbc helper
  2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
                   ` (3 preceding siblings ...)
  2021-06-29 13:59 ` [PATCH v2 4/8] btrfs: wake up async_delalloc_pages waiters after submit Josef Bacik
@ 2021-06-29 13:59 ` Josef Bacik
  2021-07-07 11:03   ` Nikolay Borisov
  2021-06-29 13:59 ` [PATCH v2 6/8] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking Josef Bacik
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2021-06-29 13:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fsdevel

Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
order to reclaim metadata reservations.  Unfortunately most helpers in
this area are too smart for us

1) The normal filemap_fdata* helpers only take range and sync modes, and
   don't give any indication of how much was written, so we can only
   flush full inodes, which isn't what we want in most cases.
2) The normal writeback path requires us to have the s_umount sem held,
   but we can't unconditionally take it in this path because we could
   deadlock.
3) The normal writeback path also skips inodes with I_SYNC set if we
   write with WB_SYNC_NONE.  This isn't the behavior we want under heavy
   ENOSPC pressure, we want to actually make sure the pages are under
   writeback before returning, and if another thread is in the middle of
   writing the file we may return before they're under writeback and
   miss our ordered extents and not properly wait for completion.
4) sync_inode() uses the normal writeback path and has the same problem
   as #3.

What we really want is to call do_writepages() with our wbc.  This way
we can make sure that writeback is actually started on the pages, and we
can control how many pages are written as a whole as we write many
inodes using the same wbc.  Accomplish this with a new helper that does
just that so we can use it for our ENOSPC flushing infrastructure.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 35 ++++++++++++++++++++++++++---------
 2 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c3c88fdb9b2a..aace07f88b73 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2886,6 +2886,8 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 extern int filemap_check_errors(struct address_space *mapping);
 extern void __filemap_set_wb_err(struct address_space *mapping, int err);
+extern int filemap_fdatawrite_wbc(struct address_space *mapping,
+				  struct writeback_control *wbc);
 
 static inline int filemap_write_and_wait(struct address_space *mapping)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 66f7e9fdfbc4..8395eafc178b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,31 @@ static int filemap_check_and_keep_errors(struct address_space *mapping)
 		return -ENOSPC;
 	return 0;
 }
+/**
+ * filemap_fdatawrite_wbc - start writeback on mapping dirty pages in range
+ * @mapping:	address space structure to write
+ * @wbc:	the writeback_control controlling the writeout
+ *
+ * Call writepages on the mapping using the provided wbc to control the
+ * writeout.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int filemap_fdatawrite_wbc(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	int ret;
+
+	if (!mapping_can_writeback(mapping) ||
+	    !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+		return 0;
+
+	wbc_attach_fdatawrite_inode(wbc, mapping->host);
+	ret = do_writepages(mapping, wbc);
+	wbc_detach_inode(wbc);
+	return ret;
+}
+EXPORT_SYMBOL(filemap_fdatawrite_wbc);
 
 /**
  * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
@@ -397,7 +422,6 @@ static int filemap_check_and_keep_errors(struct address_space *mapping)
 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end, int sync_mode)
 {
-	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
 		.nr_to_write = LONG_MAX,
@@ -405,14 +429,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 		.range_end = end,
 	};
 
-	if (!mapping_can_writeback(mapping) ||
-	    !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
-		return 0;
-
-	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
-	ret = do_writepages(mapping, &wbc);
-	wbc_detach_inode(&wbc);
-	return ret;
+	return filemap_fdatawrite_wbc(mapping, &wbc);
 }
 
 static inline int __filemap_fdatawrite(struct address_space *mapping,
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 6/8] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking
  2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
                   ` (4 preceding siblings ...)
  2021-06-29 13:59 ` [PATCH v2 5/8] fs: add a filemap_fdatawrite_wbc helper Josef Bacik
@ 2021-06-29 13:59 ` Josef Bacik
  2021-07-07 11:03   ` Nikolay Borisov
  2021-06-29 13:59 ` [PATCH v2 7/8] 9p: migrate from sync_inode to filemap_fdatawrite_wbc Josef Bacik
  2021-06-29 13:59 ` [PATCH v2 8/8] fs: kill sync_inode Josef Bacik
  7 siblings, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2021-06-29 13:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fsdevel

sync_inode() has some holes that can cause problems if we're under heavy
ENOSPC pressure.  If there's writeback running on a separate thread
sync_inode() will skip writing the inode altogether.  What we really
want is to make sure writeback has been started on all the pages to make
sure we can see the ordered extents and wait on them if appropriate.
Switch to this new helper which will allow us to accomplish this and
avoid ENOSPC'ing early.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e388153c4ae4..b25c84aba743 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9713,7 +9713,7 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 			btrfs_queue_work(root->fs_info->flush_workers,
 					 &work->work);
 		} else {
-			ret = sync_inode(inode, wbc);
+			ret = filemap_fdatawrite_wbc(inode->i_mapping, wbc);
 			btrfs_add_delayed_iput(inode);
 			if (ret || wbc->nr_to_write <= 0)
 				goto out;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 7/8] 9p: migrate from sync_inode to filemap_fdatawrite_wbc
  2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
                   ` (5 preceding siblings ...)
  2021-06-29 13:59 ` [PATCH v2 6/8] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking Josef Bacik
@ 2021-06-29 13:59 ` Josef Bacik
  2021-07-07 11:00   ` Nikolay Borisov
  2021-06-29 13:59 ` [PATCH v2 8/8] fs: kill sync_inode Josef Bacik
  7 siblings, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2021-06-29 13:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fsdevel

We're going to remove sync_inode, so migrate to filemap_fdatawrite_wbc
instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/9p/vfs_file.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 59c32c9b799f..6b64e8391f30 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -625,12 +625,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
 	p9_debug(P9_DEBUG_VFS, "9p VMA close, %p, flushing", vma);
 
 	inode = file_inode(vma->vm_file);
-
-	if (!mapping_can_writeback(inode->i_mapping))
-		wbc.nr_to_write = 0;
-
-	might_sleep();
-	sync_inode(inode, &wbc);
+	filemap_fdatawrite_wbc(inode->i_mapping, &wbc);
 }
 
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 8/8] fs: kill sync_inode
  2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
                   ` (6 preceding siblings ...)
  2021-06-29 13:59 ` [PATCH v2 7/8] 9p: migrate from sync_inode to filemap_fdatawrite_wbc Josef Bacik
@ 2021-06-29 13:59 ` Josef Bacik
  7 siblings, 0 replies; 16+ messages in thread
From: Josef Bacik @ 2021-06-29 13:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fsdevel

Now that all users of sync_inode() have been deleted, remove
sync_inode().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fs-writeback.c  | 19 +------------------
 include/linux/fs.h |  1 -
 2 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e91980f49388..706dad22f735 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2608,23 +2608,6 @@ int write_inode_now(struct inode *inode, int sync)
 }
 EXPORT_SYMBOL(write_inode_now);
 
-/**
- * sync_inode - write an inode and its pages to disk.
- * @inode: the inode to sync
- * @wbc: controls the writeback mode
- *
- * sync_inode() will write an inode and its pages to disk.  It will also
- * correctly update the inode on its superblock's dirty inode lists and will
- * update inode->i_state.
- *
- * The caller must have a ref on the inode.
- */
-int sync_inode(struct inode *inode, struct writeback_control *wbc)
-{
-	return writeback_single_inode(inode, wbc);
-}
-EXPORT_SYMBOL(sync_inode);
-
 /**
  * sync_inode_metadata - write an inode to disk
  * @inode: the inode to sync
@@ -2641,6 +2624,6 @@ int sync_inode_metadata(struct inode *inode, int wait)
 		.nr_to_write = 0, /* metadata-only */
 	};
 
-	return sync_inode(inode, &wbc);
+	return writeback_single_inode(inode, &wbc);
 }
 EXPORT_SYMBOL(sync_inode_metadata);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index aace07f88b73..7c33e5414747 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2458,7 +2458,6 @@ static inline void file_accessed(struct file *file)
 
 extern int file_modified(struct file *file);
 
-int sync_inode(struct inode *inode, struct writeback_control *wbc);
 int sync_inode_metadata(struct inode *inode, int wait);
 
 struct file_system_type {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/8] btrfs: handle shrink_delalloc pages calculation differently
  2021-06-29 13:59 ` [PATCH v2 2/8] btrfs: handle shrink_delalloc pages calculation differently Josef Bacik
@ 2021-07-07 10:50   ` Nikolay Borisov
  0 siblings, 0 replies; 16+ messages in thread
From: Nikolay Borisov @ 2021-07-07 10:50 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team, linux-fsdevel; +Cc: stable



On 29.06.21 г. 16:59, Josef Bacik wrote:
> We have been hitting some early ENOSPC issues in production with more
> recent kernels, and I tracked it down to us simply not flushing delalloc
> as aggressively as we should be.  With tracing I was seeing us failing
> all tickets with all of the block rsvs at or around 0, with very little
> pinned space, but still around 120MiB of outstanding bytes_may_used.
> Upon further investigation I saw that we were flushing around 14 pages
> per shrink call for delalloc, despite having around 2GiB of delalloc
> outstanding.
> 
> Consider the example of a 8 way machine, all CPUs trying to create a
> file in parallel, which at the time of this commit requires 5 items to
> do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
> size waiting on reservations.  Now assume we have 128MiB of delalloc
> outstanding.  With our current math we would set items to 20, and then
> set to_reclaim to 20 * 256k, or 5MiB.
> 
> Assuming that we went through this loop all 3 times, for both
> FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
> twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
> leave a fair bit of delalloc reservations still hanging around by the
> time we go to ENOSPC out all the remaining tickets.
> 
> Fix this two ways.  First, change the calculations to be a fraction of
> the total delalloc bytes on the system.  Prior to this change we were
> calculating based on dirty inodes so our math made more sense, now it's
> just completely unrelated to what we're actually doing.
> 
> Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
> gone through the flush states at least once.  This will empty the system
> of all delalloc so we're sure to be truly out of space when we start
> failing tickets.
> 
> I'm tagging stable 5.10 and forward, because this is where we started
> using the page stuff heavily again.  This affects earlier kernel
> versions as well, but would be a pain to backport to them as the
> flushing mechanisms aren't the same.
> 
> CC: stable@vger.kernel.org # 5.10+
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/ctree.h             |  9 +++++----
>  fs/btrfs/space-info.c        | 35 ++++++++++++++++++++++++++---------
>  include/trace/events/btrfs.h |  1 +
>  3 files changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index d7ef4d7d2c1a..232ff1a49ca6 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2783,10 +2783,11 @@ enum btrfs_flush_state {
>  	FLUSH_DELAYED_REFS	=	4,
>  	FLUSH_DELALLOC		=	5,
>  	FLUSH_DELALLOC_WAIT	=	6,
> -	ALLOC_CHUNK		=	7,
> -	ALLOC_CHUNK_FORCE	=	8,
> -	RUN_DELAYED_IPUTS	=	9,
> -	COMMIT_TRANS		=	10,
> +	FLUSH_DELALLOC_FULL	=	7,
> +	ALLOC_CHUNK		=	8,
> +	ALLOC_CHUNK_FORCE	=	9,
> +	RUN_DELAYED_IPUTS	=	10,
> +	COMMIT_TRANS		=	11,
>  };
>  
>  int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index af161eb808a2..0c539a94c6d9 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -494,6 +494,9 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
>  	long time_left;
>  	int loops;
>  
> +	delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
> +	ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
> +
>  	/* Calc the number of the pages we need flush for space reservation */
>  	if (to_reclaim == U64_MAX) {
>  		items = U64_MAX;
> @@ -501,19 +504,21 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
>  		/*
>  		 * to_reclaim is set to however much metadata we need to
>  		 * reclaim, but reclaiming that much data doesn't really track
> -		 * exactly, so increase the amount to reclaim by 2x in order to
> -		 * make sure we're flushing enough delalloc to hopefully reclaim
> -		 * some metadata reservations.
> +		 * exactly.  What we really want to do is reclaim full inode's
> +		 * worth of reservations, however that's not available to us
> +		 * here.  We will take a fraction of the delalloc bytes for our
> +		 * flushing loops and hope for the best.  Delalloc will expand
> +		 * the amount we write to cover an entire dirty extent, which
> +		 * will reclaim the metadata reservation for that range.  If
> +		 * it's not enough subsequent flush stages will be more
> +		 * aggressive.
>  		 */
> +		to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
>  		items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
> -		to_reclaim = items * EXTENT_SIZE_PER_ITEM;
>  	}
>  
>  	trans = (struct btrfs_trans_handle *)current->journal_info;
>  
> -	delalloc_bytes = percpu_counter_sum_positive(
> -						&fs_info->delalloc_bytes);
> -	ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
>  	if (delalloc_bytes == 0 && ordered_bytes == 0)
>  		return;

nit: That check should also be moved alongside delalloc/ordered _bytes
read out.

<snip>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/8] btrfs: enable a tracepoint when we fail tickets
  2021-06-29 13:59 ` [PATCH v2 1/8] btrfs: enable a tracepoint when we fail tickets Josef Bacik
@ 2021-07-07 10:51   ` Nikolay Borisov
  0 siblings, 0 replies; 16+ messages in thread
From: Nikolay Borisov @ 2021-07-07 10:51 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team, linux-fsdevel



On 29.06.21 г. 16:59, Josef Bacik wrote:
> When debugging early enospc problems it was useful to have a tracepoint
> where we failed all tickets so I could check the state of the enospc
> counters at failure time to validate my fixes.  This adds the tracpoint
> so you can easily get that information.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c        | 2 ++
>  include/trace/events/btrfs.h | 6 ++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 392997376a1c..af161eb808a2 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -825,6 +825,8 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info,
>  	struct reserve_ticket *ticket;
>  	u64 tickets_id = space_info->tickets_id;
>  
> +	trace_btrfs_fail_all_tickets(fs_info, space_info);
> +
>  	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
>  		btrfs_info(fs_info, "cannot satisfy tickets, dumping space info");
>  		__btrfs_dump_space_info(fs_info, space_info);
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index c7237317a8b9..3d81ba8c37b9 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -2098,6 +2098,12 @@ DEFINE_EVENT(btrfs_dump_space_info, btrfs_done_preemptive_reclaim,
>  	TP_ARGS(fs_info, sinfo)
>  );
>  
> +DEFINE_EVENT(btrfs_dump_space_info, btrfs_fail_all_tickets,
> +	TP_PROTO(const struct btrfs_fs_info *fs_info,
> +		 const struct btrfs_space_info *sinfo),
> +	TP_ARGS(fs_info, sinfo)
> +);

General suggestion for the dump_space_info - it would be good if
ordered_bytes and delalloc_bytes is also dumped in the tracepoint.

> +
>  TRACE_EVENT(btrfs_reserve_ticket,
>  	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, u64 bytes,
>  		 u64 start_ns, int flush, int error),
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 7/8] 9p: migrate from sync_inode to filemap_fdatawrite_wbc
  2021-06-29 13:59 ` [PATCH v2 7/8] 9p: migrate from sync_inode to filemap_fdatawrite_wbc Josef Bacik
@ 2021-07-07 11:00   ` Nikolay Borisov
  0 siblings, 0 replies; 16+ messages in thread
From: Nikolay Borisov @ 2021-07-07 11:00 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team, linux-fsdevel



On 29.06.21 г. 16:59, Josef Bacik wrote:
> We're going to remove sync_inode, so migrate to filemap_fdatawrite_wbc
> instead.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/9p/vfs_file.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index 59c32c9b799f..6b64e8391f30 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -625,12 +625,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
>  	p9_debug(P9_DEBUG_VFS, "9p VMA close, %p, flushing", vma);
>  
>  	inode = file_inode(vma->vm_file);
> -
> -	if (!mapping_can_writeback(inode->i_mapping))
> -		wbc.nr_to_write = 0;
> -
> -	might_sleep();

Not a 9p expert but we are losing the might_sleep check and
do_writepages can sleep due to cond_resched or congestion_wait

> -	sync_inode(inode, &wbc);
> +	filemap_fdatawrite_wbc(inode->i_mapping, &wbc);
>  }
>  
>  
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 6/8] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking
  2021-06-29 13:59 ` [PATCH v2 6/8] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking Josef Bacik
@ 2021-07-07 11:03   ` Nikolay Borisov
  0 siblings, 0 replies; 16+ messages in thread
From: Nikolay Borisov @ 2021-07-07 11:03 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team, linux-fsdevel



On 29.06.21 г. 16:59, Josef Bacik wrote:
> sync_inode() has some holes that can cause problems if we're under heavy
> ENOSPC pressure.  If there's writeback running on a separate thread
> sync_inode() will skip writing the inode altogether.  What we really
> want is to make sure writeback has been started on all the pages to make
> sure we can see the ordered extents and wait on them if appropriate.
> Switch to this new helper which will allow us to accomplish this and
> avoid ENOSPC'ing early.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/8] fs: add a filemap_fdatawrite_wbc helper
  2021-06-29 13:59 ` [PATCH v2 5/8] fs: add a filemap_fdatawrite_wbc helper Josef Bacik
@ 2021-07-07 11:03   ` Nikolay Borisov
  0 siblings, 0 replies; 16+ messages in thread
From: Nikolay Borisov @ 2021-07-07 11:03 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team, linux-fsdevel



On 29.06.21 г. 16:59, Josef Bacik wrote:
> Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
> order to reclaim metadata reservations.  Unfortunately most helpers in
> this area are too smart for us
> 
> 1) The normal filemap_fdata* helpers only take range and sync modes, and
>    don't give any indication of how much was written, so we can only
>    flush full inodes, which isn't what we want in most cases.
> 2) The normal writeback path requires us to have the s_umount sem held,
>    but we can't unconditionally take it in this path because we could
>    deadlock.
> 3) The normal writeback path also skips inodes with I_SYNC set if we
>    write with WB_SYNC_NONE.  This isn't the behavior we want under heavy
>    ENOSPC pressure, we want to actually make sure the pages are under
>    writeback before returning, and if another thread is in the middle of
>    writing the file we may return before they're under writeback and
>    miss our ordered extents and not properly wait for completion.
> 4) sync_inode() uses the normal writeback path and has the same problem
>    as #3.
> 
> What we really want is to call do_writepages() with our wbc.  This way
> we can make sure that writeback is actually started on the pages, and we
> can control how many pages are written as a whole as we write many
> inodes using the same wbc.  Accomplish this with a new helper that does
> just that so we can use it for our ENOSPC flushing infrastructure.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/8] btrfs: wake up async_delalloc_pages waiters after submit
  2021-06-29 13:59 ` [PATCH v2 4/8] btrfs: wake up async_delalloc_pages waiters after submit Josef Bacik
@ 2021-07-07 11:04   ` Nikolay Borisov
  0 siblings, 0 replies; 16+ messages in thread
From: Nikolay Borisov @ 2021-07-07 11:04 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team, linux-fsdevel; +Cc: stable



On 29.06.21 г. 16:59, Josef Bacik wrote:
> We use the async_delalloc_pages mechanism to make sure that we've
> completed our async work before trying to continue our delalloc
> flushing.  The reason for this is we need to see any ordered extents
> that were created by our delalloc flushing.  However we're waking up
> before we do the submit work, which is before we create the ordered
> extents.  This is a pretty wide race window where we could potentially
> think there are no ordered extents and thus exit shrink_delalloc
> prematurely.  Fix this by waking us up after we've done the work to
> create ordered extents.
> 
> cc: stable@vger.kernel.org
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

This is an independent change from the rest of the series and it can go
before the rest.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/8] btrfs: wait on async extents when flushing delalloc
  2021-06-29 13:59 ` [PATCH v2 3/8] btrfs: wait on async extents when flushing delalloc Josef Bacik
@ 2021-07-07 11:09   ` Nikolay Borisov
  0 siblings, 0 replies; 16+ messages in thread
From: Nikolay Borisov @ 2021-07-07 11:09 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team, linux-fsdevel



On 29.06.21 г. 16:59, Josef Bacik wrote:
> I've been debugging an early ENOSPC problem in production and finally
> root caused it to this problem.  When we switched to the per-inode in
> 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
> shrink_delalloc") I pulled out the async extent handling, because we
> were doing the correct thing by calling filemap_flush() if we had async
> extents set.  This would properly wait on any async extents by locking
> the page in the second flush, thus making sure our ordered extents were
> properly set up.
> 
> However when I switched us back to page based flushing, I used
> sync_inode(), which allows us to pass in our own wbc.  The problem here
> is that sync_inode() is smarter than the filemap_* helpers, it tries to
> avoid calling writepages at all.  This means that our second call could
> skip calling do_writepages altogether, and thus not wait on the pagelock
> for the async helpers.  This means we could come back before any ordered
> extents were created and then simply continue on in our flushing
> mechanisms and ENOSPC out when we have plenty of space to use.
> 
> Fix this by putting back the async pages logic in shrink_delalloc.  This
> allows us to bulk write out everything that we need to, and then we can
> wait in one place for the async helpers to catch up, and then wait on
> any ordered extents that are created.
> 
> Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

This patch really depend on the next one in order for it to be correct.
Imo this dependency should be explicitly stated in the change log and
the patches re-ordered.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-07-07 11:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
2021-06-29 13:59 ` [PATCH v2 1/8] btrfs: enable a tracepoint when we fail tickets Josef Bacik
2021-07-07 10:51   ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 2/8] btrfs: handle shrink_delalloc pages calculation differently Josef Bacik
2021-07-07 10:50   ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 3/8] btrfs: wait on async extents when flushing delalloc Josef Bacik
2021-07-07 11:09   ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 4/8] btrfs: wake up async_delalloc_pages waiters after submit Josef Bacik
2021-07-07 11:04   ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 5/8] fs: add a filemap_fdatawrite_wbc helper Josef Bacik
2021-07-07 11:03   ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 6/8] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking Josef Bacik
2021-07-07 11:03   ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 7/8] 9p: migrate from sync_inode to filemap_fdatawrite_wbc Josef Bacik
2021-07-07 11:00   ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 8/8] fs: kill sync_inode Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.