All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] bcachefs: journal bug fixes / freeze support
@ 2023-09-15 12:51 Brian Foster
  2023-09-15 12:51 ` [PATCH v2 1/4] bcachefs: refactor pin put helpers Brian Foster
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Brian Foster @ 2023-09-15 12:51 UTC (permalink / raw)
  To: linux-bcachefs

Hi all,

Here's v2 of the bcachefs freeze and related bugfix patches. Patch 1 of
the original series has been merged. Patches 1-3 of this series have
been reworked a bit to address some of Kent's concerns, particularly
around increasing the amount of inline code. The factoring has been
reworked a bit to address that, but otherwise the fundamental fix in
patch 3 is the same: allow journal reservation to hold an indirect
reference on the pin list via the journal buffer.

Patch 4 adds initial support for the vfs freeze mechanism. This is
technically still a bit incomplete because we don't include intwrite
freeze protection, but technically afaict things still function
correctly mainly due to bcachefs' nearly overlapping write reference
mechanism. So all in all while it seems like this could still be
improved, that requires further thought and doesn't seem to warrant
further gating. Further details and caveats are discussed in the commit
log.

This survives all of the usual regression tests with the only observable
failure being generic/459. This is a newly enabled (freeze dependent)
test that intentionally produces I/O errors via dm-thin
overprovisioning. The reason for the occasional test failure is that the
filesystem shuts down due to these I/O errors. This is generally
expected behavior for bcachefs and so not an immediate freeze issue.
Finally, CI regression is ongoing and observable here:

 https://evilpiepirate.org/~testdashboard/ci?branch=bfoster

Thoughts, reviews, flames appreciated.

Brian

v2:
- Reworked approach to final journal buffer processing.
- Appended initial freeze support patch.
v1: https://lore.kernel.org/linux-bcachefs/20230831110734.787212-1-bfoster@redhat.com/

Brian Foster (4):
  bcachefs: refactor pin put helpers
  bcachefs: prepare journal buf put to handle pin put
  bcachefs: fix race between journal entry close and pin set
  bcachefs: initial freeze/unfreeze support

 fs/bcachefs/fs.c              | 31 +++++++++++++++++++++++++++++--
 fs/bcachefs/journal.c         | 20 +++++++++++++-------
 fs/bcachefs/journal.h         | 34 ++++++++++++++++++++++++++++------
 fs/bcachefs/journal_reclaim.c | 11 ++++-------
 fs/bcachefs/journal_reclaim.h |  3 ++-
 5 files changed, 76 insertions(+), 23 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/4] bcachefs: refactor pin put helpers
  2023-09-15 12:51 [PATCH v2 0/4] bcachefs: journal bug fixes / freeze support Brian Foster
@ 2023-09-15 12:51 ` Brian Foster
  2023-09-15 12:51 ` [PATCH v2 2/4] bcachefs: prepare journal buf put to handle pin put Brian Foster
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2023-09-15 12:51 UTC (permalink / raw)
  To: linux-bcachefs

We have a couple journal pin put helpers to handle cases where the
journal lock is already held or not. Refactor the helpers to lock
and reclaim from the highest level and open code the reclaim from
the one caller of the internal variant. The latter call will be
moved into the journal buf release helper in a later patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/journal.c         |  3 ++-
 fs/bcachefs/journal_reclaim.c | 11 ++++-------
 fs/bcachefs/journal_reclaim.h |  3 ++-
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index ad80618d1740..210a2b90bb50 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -204,7 +204,8 @@ static void __journal_entry_close(struct journal *j, unsigned closed_val)
 	buf->data->last_seq	= cpu_to_le64(buf->last_seq);
 	BUG_ON(buf->last_seq > le64_to_cpu(buf->data->seq));
 
-	__bch2_journal_pin_put(j, le64_to_cpu(buf->data->seq));
+	if (__bch2_journal_pin_put(j, le64_to_cpu(buf->data->seq)))
+		bch2_journal_reclaim_fast(j);
 
 	cancel_delayed_work(&j->write_work);
 
diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c
index 1f3d5890ff11..9a584aaaa2eb 100644
--- a/fs/bcachefs/journal_reclaim.c
+++ b/fs/bcachefs/journal_reclaim.c
@@ -290,7 +290,7 @@ void bch2_journal_do_discards(struct journal *j)
  * entry, holding it open to ensure it gets replayed during recovery:
  */
 
-static void bch2_journal_reclaim_fast(struct journal *j)
+void bch2_journal_reclaim_fast(struct journal *j)
 {
 	bool popped = false;
 
@@ -310,19 +310,16 @@ static void bch2_journal_reclaim_fast(struct journal *j)
 		bch2_journal_space_available(j);
 }
 
-void __bch2_journal_pin_put(struct journal *j, u64 seq)
+bool __bch2_journal_pin_put(struct journal *j, u64 seq)
 {
 	struct journal_entry_pin_list *pin_list = journal_seq_pin(j, seq);
 
-	if (atomic_dec_and_test(&pin_list->count))
-		bch2_journal_reclaim_fast(j);
+	return atomic_dec_and_test(&pin_list->count);
 }
 
 void bch2_journal_pin_put(struct journal *j, u64 seq)
 {
-	struct journal_entry_pin_list *pin_list = journal_seq_pin(j, seq);
-
-	if (atomic_dec_and_test(&pin_list->count)) {
+	if (__bch2_journal_pin_put(j, seq)) {
 		spin_lock(&j->lock);
 		bch2_journal_reclaim_fast(j);
 		spin_unlock(&j->lock);
diff --git a/fs/bcachefs/journal_reclaim.h b/fs/bcachefs/journal_reclaim.h
index 0fd1af120db5..494d1a6eddb0 100644
--- a/fs/bcachefs/journal_reclaim.h
+++ b/fs/bcachefs/journal_reclaim.h
@@ -31,7 +31,8 @@ journal_seq_pin(struct journal *j, u64 seq)
 	return &j->pin.data[seq & j->pin.mask];
 }
 
-void __bch2_journal_pin_put(struct journal *, u64);
+void bch2_journal_reclaim_fast(struct journal *);
+bool __bch2_journal_pin_put(struct journal *, u64);
 void bch2_journal_pin_put(struct journal *, u64);
 void bch2_journal_pin_drop(struct journal *, struct journal_entry_pin *);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/4] bcachefs: prepare journal buf put to handle pin put
  2023-09-15 12:51 [PATCH v2 0/4] bcachefs: journal bug fixes / freeze support Brian Foster
  2023-09-15 12:51 ` [PATCH v2 1/4] bcachefs: refactor pin put helpers Brian Foster
@ 2023-09-15 12:51 ` Brian Foster
  2023-09-15 12:51 ` [PATCH v2 3/4] bcachefs: fix race between journal entry close and pin set Brian Foster
  2023-09-15 12:51 ` [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support Brian Foster
  3 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2023-09-15 12:51 UTC (permalink / raw)
  To: linux-bcachefs

bcachefs freeze testing has uncovered some raciness between journal
entry open/close and pin list reference count management. The
details of the problem are described in a separate patch. In
preparation for the associated fix, refactor the journal buffer put
path a bit to allow it to eventually handle dropping the pin list
reference currently held by an open journal entry.

Retain the journal write dispatch helper since the closure code is
inlined and we don't want to increase the amount of inline code in
the transaction commit path, but rename the function to reflect
the purpose of final processing of the journal buffer.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/journal.c |  2 +-
 fs/bcachefs/journal.h | 22 +++++++++++++++++-----
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index 210a2b90bb50..be61d43458eb 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -134,7 +134,7 @@ journal_error_check_stuck(struct journal *j, int error, unsigned flags)
 
 /* journal entry close/open: */
 
-void __bch2_journal_buf_put(struct journal *j)
+void bch2_journal_buf_put_final(struct journal *j)
 {
 	struct bch_fs *c = container_of(j, struct bch_fs, journal);
 
diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h
index 008a2e25a4fa..0a53a2142594 100644
--- a/fs/bcachefs/journal.h
+++ b/fs/bcachefs/journal.h
@@ -252,9 +252,10 @@ static inline bool journal_entry_empty(struct jset *j)
 	return true;
 }
 
-void __bch2_journal_buf_put(struct journal *);
-
-static inline void bch2_journal_buf_put(struct journal *j, unsigned idx)
+/*
+ * Drop reference on a buffer index and return true if the count has hit zero.
+ */
+static inline union journal_res_state journal_state_buf_put(struct journal *j, unsigned idx)
 {
 	union journal_res_state s;
 
@@ -264,9 +265,20 @@ static inline void bch2_journal_buf_put(struct journal *j, unsigned idx)
 				    .buf2_count = idx == 2,
 				    .buf3_count = idx == 3,
 				    }).v, &j->reservations.counter);
+	return s;
+}
+
+void bch2_journal_buf_put_final(struct journal *);
 
-	if (!journal_state_count(s, idx) && idx == s.unwritten_idx)
-		__bch2_journal_buf_put(j);
+static inline void bch2_journal_buf_put(struct journal *j, unsigned idx)
+{
+	union journal_res_state s;
+
+	s = journal_state_buf_put(j, idx);
+	if (!journal_state_count(s, idx)) {
+		if (idx == s.unwritten_idx)
+			bch2_journal_buf_put_final(j);
+	}
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/4] bcachefs: fix race between journal entry close and pin set
  2023-09-15 12:51 [PATCH v2 0/4] bcachefs: journal bug fixes / freeze support Brian Foster
  2023-09-15 12:51 ` [PATCH v2 1/4] bcachefs: refactor pin put helpers Brian Foster
  2023-09-15 12:51 ` [PATCH v2 2/4] bcachefs: prepare journal buf put to handle pin put Brian Foster
@ 2023-09-15 12:51 ` Brian Foster
  2023-09-15 12:51 ` [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support Brian Foster
  3 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2023-09-15 12:51 UTC (permalink / raw)
  To: linux-bcachefs

bcachefs freeze testing via fstests generic/390 occasionally
reproduces the following BUG from bch2_fs_read_only():

  BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty));

This indicates that one or more dirty key cache keys still exist
after the attempt to flush and quiesce the fs. The sequence that
leads to this problem actually occurs on unfreeze (ro->rw), and
looks something like the following:

- Task A begins a transaction commit and acquires journal_res for
  the current seq. This transaction intends to perform key cache
  insertion.
- Task B begins a bch2_journal_flush() via bch2_sync_fs(). This ends
  up in journal_entry_want_write(), which closes the current journal
  entry and drops the reference to the pin list created on entry open.
  The pin put pops the front of the journal via fast reclaim since the
  reference count has dropped to 0.
- Task A attempts to set the journal pin for the associated cached
  key, but bch2_journal_pin_set() skips the pin insert because the
  seq of the transaction reservation is behind the front of the pin
  list fifo.

The end result is that the pin associated with the cached key is not
added, which prevents a subsequent reclaim from processing the key
and thus leaves it dangling at freeze time. The fundamental cause of
this problem is that the front of the journal is allowed to pop
before a transaction with outstanding reservation on the associated
journal seq is able to add a pin. The count for the pin list
associated with the seq drops to zero and is prematurely reclaimed
as a result.

The logical fix for this problem lies in how the journal buffer is
managed in similar scenarios where the entry might have been closed
before a transaction with outstanding reservations happens to be
committed.

When a journal entry is opened, the current sequence number is
bumped, the associated pin list is initialized with a reference
count of 1, and the journal buffer reference count is bumped (via
journal_state_inc()). When a journal reservation is acquired, the
reservation also acquires a reference on the associated buffer. If
the journal entry is closed in the meantime, it drops both the pin
and buffer references held by the open entry, but the buffer still
has references held by outstanding reservation. After the associated
transaction commits, the reservation release drops the associated
buffer references and the buffer is written out once the reference
count has dropped to zero.

The fundamental problem here is that the lifecycle of the pin list
reference held by an open journal entry is too short to cover the
processing of transactions with outstanding reservations. The
simplest way to address this is to expand the pin list reference to
the lifecycle of the buffer vs. the shorter lifecycle of the open
journal entry. This ensures the pin list for a seq with outstanding
reservation cannot be popped and reclaimed before all outstanding
reservations have been released, even if the associated journal
entry has been closed for further reservations.

Move the pin put from journal entry close to where final processing
of the journal buffer occurs. Create a duplicate helper to cover the
case where the caller doesn't already hold the journal lock. This
allows generic/390 to pass reliably.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/journal.c | 21 +++++++++++++--------
 fs/bcachefs/journal.h | 20 +++++++++++++++-----
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index be61d43458eb..fc3dd5bef386 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -132,13 +132,21 @@ journal_error_check_stuck(struct journal *j, int error, unsigned flags)
 	return stuck;
 }
 
-/* journal entry close/open: */
-
-void bch2_journal_buf_put_final(struct journal *j)
+/*
+ * Final processing when the last reference of a journal buffer has been
+ * dropped. Drop the pin list reference acquired at journal entry open and write
+ * the buffer, if requested.
+ */
+void bch2_journal_buf_put_final(struct journal *j, u64 seq, bool write)
 {
 	struct bch_fs *c = container_of(j, struct bch_fs, journal);
 
-	closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL);
+	lockdep_assert_held(&j->lock);
+
+	if (__bch2_journal_pin_put(j, seq))
+		bch2_journal_reclaim_fast(j);
+	if (write)
+		closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL);
 }
 
 /*
@@ -204,14 +212,11 @@ static void __journal_entry_close(struct journal *j, unsigned closed_val)
 	buf->data->last_seq	= cpu_to_le64(buf->last_seq);
 	BUG_ON(buf->last_seq > le64_to_cpu(buf->data->seq));
 
-	if (__bch2_journal_pin_put(j, le64_to_cpu(buf->data->seq)))
-		bch2_journal_reclaim_fast(j);
-
 	cancel_delayed_work(&j->write_work);
 
 	bch2_journal_space_available(j);
 
-	bch2_journal_buf_put(j, old.idx);
+	__bch2_journal_buf_put(j, old.idx, le64_to_cpu(buf->data->seq));
 }
 
 void bch2_journal_halt(struct journal *j)
diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h
index 0a53a2142594..491133cc52f3 100644
--- a/fs/bcachefs/journal.h
+++ b/fs/bcachefs/journal.h
@@ -268,16 +268,26 @@ static inline union journal_res_state journal_state_buf_put(struct journal *j, u
 	return s;
 }
 
-void bch2_journal_buf_put_final(struct journal *);
+void bch2_journal_buf_put_final(struct journal *, u64, bool);
 
-static inline void bch2_journal_buf_put(struct journal *j, unsigned idx)
+static inline void __bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq)
+{
+	union journal_res_state s;
+
+	s = journal_state_buf_put(j, idx);
+	if (!journal_state_count(s, idx))
+		bch2_journal_buf_put_final(j, seq, idx == s.unwritten_idx);
+}
+
+static inline void bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq)
 {
 	union journal_res_state s;
 
 	s = journal_state_buf_put(j, idx);
 	if (!journal_state_count(s, idx)) {
-		if (idx == s.unwritten_idx)
-			bch2_journal_buf_put_final(j);
+		spin_lock(&j->lock);
+		bch2_journal_buf_put_final(j, seq, idx == s.unwritten_idx);
+		spin_unlock(&j->lock);
 	}
 }
 
@@ -298,7 +308,7 @@ static inline void bch2_journal_res_put(struct journal *j,
 				       BCH_JSET_ENTRY_btree_keys,
 				       0, 0, 0);
 
-	bch2_journal_buf_put(j, res->idx);
+	bch2_journal_buf_put(j, res->idx, res->seq);
 
 	res->ref = 0;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support
  2023-09-15 12:51 [PATCH v2 0/4] bcachefs: journal bug fixes / freeze support Brian Foster
                   ` (2 preceding siblings ...)
  2023-09-15 12:51 ` [PATCH v2 3/4] bcachefs: fix race between journal entry close and pin set Brian Foster
@ 2023-09-15 12:51 ` Brian Foster
  2023-09-19 21:32   ` Kent Overstreet
  2023-09-20  1:21   ` Kent Overstreet
  3 siblings, 2 replies; 10+ messages in thread
From: Brian Foster @ 2023-09-15 12:51 UTC (permalink / raw)
  To: linux-bcachefs

Initial support for the vfs superblock freeze and unfreeze
operations. Superblock freeze occurs in stages, where the vfs
attempts to quiesce high level write operations, page faults, fs
internal operations, and then finally calls into the filesystem for
any last stage steps (i.e. log flushing, etc.) before marking the
superblock frozen.

The majority of write paths are covered by freeze protection (i.e.
sb_start_write() and friends) in higher level common code, with the
exception of the fs-internal SB_FREEZE_FS stage (i.e.
sb_start_intwrite()). This typically maps to active filesystem
transactions in a manner that allows the vfs to implement a barrier
of internal fs operations during the freeze sequence. This is not a
viable model for bcachefs, however, because it utilizes transactions
both to populate the journal as well as to perform journal reclaim.
This means that mapping intwrite protection to transaction lifecycle
or transaction commit is likely to deadlock freeze, as quiescing the
journal requires transactional operations blocked by the final stage
of freeze.

The flipside of this is that bcachefs does already maintain its own
internal sets of write references for similar purposes, currently
utilized for transitions from read-write to read-only mode. Since
this largely mirrors the high level sequence involved with freeze,
we can simply invoke this mechanism in the freeze callback to fully
quiesce the filesystem in the final stage. This means that while the
SB_FREEZE_FS stage is essentially a no-op, the ->freeze_fs()
callback that immediately follows begins by performing effectively
the same step by quiescing all internal write references.

One caveat to this approach is that without integration of internal
freeze protection, write operations gated on internal write refs
will fail with an internal -EROFS error rather than block on
acquiring freeze protection. IOW, this is roughly equivalent to only
having support for sb_start_intwrite_trylock(), and not the blocking
variant. Many of these paths already use non-blocking internal write
refs and so would map into an sb_start_intwrite_trylock() anyways.
The only instance of this I've been able to uncover that doesn't
explicitly rely on a higher level non-blocking write ref is the
bch2_rbio_narrow_crcs() path, which updates crcs in certain read
cases, and Kent has pointed out isn't critical if it happens to fail
due to read-only status.

Given that, implement basic freeze support as described above and
leave tighter integration with internal freeze protection as a
possible future enhancement. There are multiple potential ideas
worth exploring here. For example, we could implement a multi-stage
freeze callback that might allow bcachefs to quiesce its internal
write references without deadlocks, we could integrate intwrite
protection with bcachefs' internal write references somehow or
another, or perhaps consider implementing blocking support for
internal write refs to be used specifically for freeze, etc. In the
meantime, this enables functional freeze support and the associated
test coverage that comes with it.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/fs.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index c1abed92703b..1354af2cb85c 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -1716,6 +1716,35 @@ static void bch2_put_super(struct super_block *sb)
 	__bch2_fs_stop(c);
 }
 
+/*
+ * bcachefs doesn't currently integrate intwrite freeze protection but the
+ * internal write references serve the same purpose. Therefore reuse the
+ * read-only transition code to perform the quiesce. The caveat is that we don't
+ * currently have the ability to block tasks that want a write reference while
+ * the superblock is frozen. This is fine for now, but we should either add
+ * blocking support or find a way to integrate sb_start_intwrite() and friends.
+ */
+static int bch2_freeze(struct super_block *sb)
+{
+	struct bch_fs *c = sb->s_fs_info;
+
+	down_write(&c->state_lock);
+	bch2_fs_read_only(c);
+	up_write(&c->state_lock);
+	return 0;
+}
+
+static int bch2_unfreeze(struct super_block *sb)
+{
+	struct bch_fs *c = sb->s_fs_info;
+	int ret;
+
+	down_write(&c->state_lock);
+	ret = bch2_fs_read_write(c);
+	up_write(&c->state_lock);
+	return ret;
+}
+
 static const struct super_operations bch_super_operations = {
 	.alloc_inode	= bch2_alloc_inode,
 	.destroy_inode	= bch2_destroy_inode,
@@ -1727,10 +1756,8 @@ static const struct super_operations bch_super_operations = {
 	.show_options	= bch2_show_options,
 	.remount_fs	= bch2_remount,
 	.put_super	= bch2_put_super,
-#if 0
 	.freeze_fs	= bch2_freeze,
 	.unfreeze_fs	= bch2_unfreeze,
-#endif
 };
 
 static int bch2_set_super(struct super_block *s, void *data)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support
  2023-09-15 12:51 ` [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support Brian Foster
@ 2023-09-19 21:32   ` Kent Overstreet
  2023-09-20  1:21   ` Kent Overstreet
  1 sibling, 0 replies; 10+ messages in thread
From: Kent Overstreet @ 2023-09-19 21:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Fri, Sep 15, 2023 at 08:51:54AM -0400, Brian Foster wrote:
> Initial support for the vfs superblock freeze and unfreeze
> operations. Superblock freeze occurs in stages, where the vfs
> attempts to quiesce high level write operations, page faults, fs
> internal operations, and then finally calls into the filesystem for
> any last stage steps (i.e. log flushing, etc.) before marking the
> superblock frozen.
> 
> The majority of write paths are covered by freeze protection (i.e.
> sb_start_write() and friends) in higher level common code, with the
> exception of the fs-internal SB_FREEZE_FS stage (i.e.
> sb_start_intwrite()). This typically maps to active filesystem
> transactions in a manner that allows the vfs to implement a barrier
> of internal fs operations during the freeze sequence. This is not a
> viable model for bcachefs, however, because it utilizes transactions
> both to populate the journal as well as to perform journal reclaim.
> This means that mapping intwrite protection to transaction lifecycle
> or transaction commit is likely to deadlock freeze, as quiescing the
> journal requires transactional operations blocked by the final stage
> of freeze.
> 
> The flipside of this is that bcachefs does already maintain its own
> internal sets of write references for similar purposes, currently
> utilized for transitions from read-write to read-only mode. Since
> this largely mirrors the high level sequence involved with freeze,
> we can simply invoke this mechanism in the freeze callback to fully
> quiesce the filesystem in the final stage. This means that while the
> SB_FREEZE_FS stage is essentially a no-op, the ->freeze_fs()
> callback that immediately follows begins by performing effectively
> the same step by quiescing all internal write references.
> 
> One caveat to this approach is that without integration of internal
> freeze protection, write operations gated on internal write refs
> will fail with an internal -EROFS error rather than block on
> acquiring freeze protection. IOW, this is roughly equivalent to only
> having support for sb_start_intwrite_trylock(), and not the blocking
> variant. Many of these paths already use non-blocking internal write
> refs and so would map into an sb_start_intwrite_trylock() anyways.
> The only instance of this I've been able to uncover that doesn't
> explicitly rely on a higher level non-blocking write ref is the
> bch2_rbio_narrow_crcs() path, which updates crcs in certain read
> cases, and Kent has pointed out isn't critical if it happens to fail
> due to read-only status.
> 
> Given that, implement basic freeze support as described above and
> leave tighter integration with internal freeze protection as a
> possible future enhancement. There are multiple potential ideas
> worth exploring here. For example, we could implement a multi-stage
> freeze callback that might allow bcachefs to quiesce its internal
> write references without deadlocks, we could integrate intwrite
> protection with bcachefs' internal write references somehow or
> another, or perhaps consider implementing blocking support for
> internal write refs to be used specifically for freeze, etc. In the
> meantime, this enables functional freeze support and the associated
> test coverage that comes with it.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

This is going to get a whole lot more test coverage and I'm pretty
excited about that - thanks, applied the series

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support
  2023-09-15 12:51 ` [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support Brian Foster
  2023-09-19 21:32   ` Kent Overstreet
@ 2023-09-20  1:21   ` Kent Overstreet
  2023-09-20 13:28     ` Brian Foster
  1 sibling, 1 reply; 10+ messages in thread
From: Kent Overstreet @ 2023-09-20  1:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

Pulled this into the testing branch and then got

https://evilpiepirate.org/~testdashboard/c/de4ea1e2a9ceec5d55fffbc1acab89f0dc8f90b6/xfstests.generic.459/log.br

So I'll likely kick this patch back out for now, let me know when you
have a fixed version :)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support
  2023-09-20  1:21   ` Kent Overstreet
@ 2023-09-20 13:28     ` Brian Foster
  2023-09-21 14:55       ` Brian Foster
  2023-09-21 15:46       ` Kent Overstreet
  0 siblings, 2 replies; 10+ messages in thread
From: Brian Foster @ 2023-09-20 13:28 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Tue, Sep 19, 2023 at 09:21:16PM -0400, Kent Overstreet wrote:
> Pulled this into the testing branch and then got
> 
> https://evilpiepirate.org/~testdashboard/c/de4ea1e2a9ceec5d55fffbc1acab89f0dc8f90b6/xfstests.generic.459/log.br
> 
> So I'll likely kick this patch back out for now, let me know when you
> have a fixed version :)
> 

Ah, sorry.. I should have mentioned this in the cover letter. I'm aware
of this failure but my initial triage has it as an unrelated problem.
That test basically induces I/O errors by explicitly overprovisioning a
dm-thin volume for the fs. The original bug was a livelock issue on XFS
related to metadata writeback failure/retry in this particular scenario.

The test relies on freeze in that it basically consumes all of the
initially provisioned space, issues a freeze in the background (which
will start off hanging because not everything can write back until more
storage is available), and then grows available space so freeze can
proceed to completion. It uses the success/failure of the freeze to
determine pass/failure, and if the freeze fails it looks like it expects
the filesystem to have been remounted ro (which I believe is ext4's way
of dealing with this).

My notes say that freeze failed because the fs shutdown, which I think
is due to the whole overprovision/flush thing leading to I/O errors
(i.e. expected behavior, probably similar to ext4). But TBH I hadn't dug
into it further than initial triage to rule out the core freeze
mechanism itself. I'll dig more into that soon to see whether we need to
change the test or something in the kernel, though I don't think it
necessarily needs to gate freeze support..

Brian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support
  2023-09-20 13:28     ` Brian Foster
@ 2023-09-21 14:55       ` Brian Foster
  2023-09-21 15:46       ` Kent Overstreet
  1 sibling, 0 replies; 10+ messages in thread
From: Brian Foster @ 2023-09-21 14:55 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Wed, Sep 20, 2023 at 09:28:44AM -0400, Brian Foster wrote:
> On Tue, Sep 19, 2023 at 09:21:16PM -0400, Kent Overstreet wrote:
> > Pulled this into the testing branch and then got
> > 
> > https://evilpiepirate.org/~testdashboard/c/de4ea1e2a9ceec5d55fffbc1acab89f0dc8f90b6/xfstests.generic.459/log.br
> > 
> > So I'll likely kick this patch back out for now, let me know when you
> > have a fixed version :)
> > 
> 
> Ah, sorry.. I should have mentioned this in the cover letter. I'm aware
> of this failure but my initial triage has it as an unrelated problem.
> That test basically induces I/O errors by explicitly overprovisioning a
> dm-thin volume for the fs. The original bug was a livelock issue on XFS
> related to metadata writeback failure/retry in this particular scenario.
> 
> The test relies on freeze in that it basically consumes all of the
> initially provisioned space, issues a freeze in the background (which
> will start off hanging because not everything can write back until more
> storage is available), and then grows available space so freeze can
> proceed to completion. It uses the success/failure of the freeze to
> determine pass/failure, and if the freeze fails it looks like it expects
> the filesystem to have been remounted ro (which I believe is ext4's way
> of dealing with this).
> 
> My notes say that freeze failed because the fs shutdown, which I think
> is due to the whole overprovision/flush thing leading to I/O errors
> (i.e. expected behavior, probably similar to ext4). But TBH I hadn't dug
> into it further than initial triage to rule out the core freeze
> mechanism itself. I'll dig more into that soon to see whether we need to
> change the test or something in the kernel, though I don't think it
> necessarily needs to gate freeze support..
> 

Yeah, so I can confirm that the only real difference in behavior between
ext4 and bcachefs wrt to this test is that the former sets SB_RDONLY in
its internal error remount read-only sequence, which in turn results in
"ro" text in the /proc/mounts output and thus satisfies the test.

bcachefs only does that in the ioctl shutdown paths for whatever reason.
Unfortunately it's not quite as simple as doing the same in the fatal
error path. If I do that, it looks like the async error handling worker
can race with a freeze, which does two different things wrt to internal
locks when sb_rdonly() is true vs. not.

This can possibly all be serialized via the right combination of
s_umount and freeze protection in the error handler to ensure we never
see the wrong combination of being freeze locked && sb_rdonly(), but
that requires a little more thought. Given the test could also easily
change to checking for -EROFS or some such on bcachefs vs. "ro" in the
mount status, this does strike me as more of a minor shutdown
inconsistency than a significant freeze or shutdown bug. What might be
more useful is at some point to audit the various read-only paths for
consistent behaviors regardless of how invoked (i.e. "ro" for remount,
shutdown) and proper serialization.

Brian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support
  2023-09-20 13:28     ` Brian Foster
  2023-09-21 14:55       ` Brian Foster
@ 2023-09-21 15:46       ` Kent Overstreet
  1 sibling, 0 replies; 10+ messages in thread
From: Kent Overstreet @ 2023-09-21 15:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Wed, Sep 20, 2023 at 09:28:44AM -0400, Brian Foster wrote:
> On Tue, Sep 19, 2023 at 09:21:16PM -0400, Kent Overstreet wrote:
> > Pulled this into the testing branch and then got
> > 
> > https://evilpiepirate.org/~testdashboard/c/de4ea1e2a9ceec5d55fffbc1acab89f0dc8f90b6/xfstests.generic.459/log.br
> > 
> > So I'll likely kick this patch back out for now, let me know when you
> > have a fixed version :)
> > 
> 
> Ah, sorry.. I should have mentioned this in the cover letter. I'm aware
> of this failure but my initial triage has it as an unrelated problem.
> That test basically induces I/O errors by explicitly overprovisioning a
> dm-thin volume for the fs. The original bug was a livelock issue on XFS
> related to metadata writeback failure/retry in this particular scenario.
> 
> The test relies on freeze in that it basically consumes all of the
> initially provisioned space, issues a freeze in the background (which
> will start off hanging because not everything can write back until more
> storage is available), and then grows available space so freeze can
> proceed to completion. It uses the success/failure of the freeze to
> determine pass/failure, and if the freeze fails it looks like it expects
> the filesystem to have been remounted ro (which I believe is ext4's way
> of dealing with this).
> 
> My notes say that freeze failed because the fs shutdown, which I think
> is due to the whole overprovision/flush thing leading to I/O errors
> (i.e. expected behavior, probably similar to ext4). But TBH I hadn't dug
> into it further than initial triage to rule out the core freeze
> mechanism itself. I'll dig more into that soon to see whether we need to
> change the test or something in the kernel, though I don't think it
> necessarily needs to gate freeze support..

Ah ok - and regardless, this is a new test being enabled. Pulling this
patch back in...

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-09-21 17:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-15 12:51 [PATCH v2 0/4] bcachefs: journal bug fixes / freeze support Brian Foster
2023-09-15 12:51 ` [PATCH v2 1/4] bcachefs: refactor pin put helpers Brian Foster
2023-09-15 12:51 ` [PATCH v2 2/4] bcachefs: prepare journal buf put to handle pin put Brian Foster
2023-09-15 12:51 ` [PATCH v2 3/4] bcachefs: fix race between journal entry close and pin set Brian Foster
2023-09-15 12:51 ` [PATCH v2 4/4] bcachefs: initial freeze/unfreeze support Brian Foster
2023-09-19 21:32   ` Kent Overstreet
2023-09-20  1:21   ` Kent Overstreet
2023-09-20 13:28     ` Brian Foster
2023-09-21 14:55       ` Brian Foster
2023-09-21 15:46       ` Kent Overstreet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.