[Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection
@ 2019-11-13 21:29 Bob Peterson
  2019-11-13 21:29 ` [Cluster-devel] [PATCH 01/32] gfs2: Introduce concept of a pending withdraw Bob Peterson
                   ` (32 more replies)
  0 siblings, 33 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This is my latest collection of patches to address the myriad of gfs2
recovery problems I've found. I'm not convinced we need all of these
but I thought I'd send them anyway and get feedback

Some of these are just bugs and may be pushed separately.

Bob Peterson (32):
  gfs2: Introduce concept of a pending withdraw
  gfs2: clear ail1 list when gfs2 withdraws
  gfs2: Rework how rgrp buffer_heads are managed
  gfs2: fix infinite loop in gfs2_ail1_flush on io error
  gfs2: log error reform
  gfs2: Only complain the first time an io error occurs in quota or log
  gfs2: Ignore dlm recovery requests if gfs2 is withdrawn
  gfs2: move check_journal_clean to util.c for future use
  gfs2: Allow some glocks to be used during withdraw
  gfs2: Don't loop forever in gfs2_freeze if withdrawn
  gfs2: Make secondary withdrawers wait for first withdrawer
  gfs2: Don't write log headers after file system withdraw
  gfs2: Force withdraw to replay journals and wait for it to finish
  gfs2: fix infinite loop when checking ail item count before go_inval
  gfs2: Add verbose option to check_journal_clean
  gfs2: Abort gfs2_freeze if io error is seen
  gfs2: Issue revokes more intelligently
  gfs2: Prepare to withdraw as soon as an IO error occurs in log write
  gfs2: Check for log write errors before telling dlm to unlock
  gfs2: new slab for transactions
  gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
  gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty
  gfs2: Don't skip log flush if glock still has revokes
  gfs2: initialize tr_ail1_list when creating transactions
  gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages returns error
  gfs2: drain the ail2 list after io errors
  gfs2: make gfs2_log_shutdown static
  gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence
  gfs2: if finish_open returns error, clean up iopen glock mess
  gfs2: Don't demote a glock until its revokes are written
  gfs2: Do proper error checking for go_sync family of glops functions
  gfs2: fix glock reference problem in gfs2_trans_add_unrevoke

 fs/gfs2/aops.c       |   4 +-
 fs/gfs2/file.c       |   2 +-
 fs/gfs2/glock.c      | 140 ++++++++++++++++++++++----
 fs/gfs2/glops.c      | 153 ++++++++++++++++++++++------
 fs/gfs2/incore.h     |  21 ++--
 fs/gfs2/inode.c      |   6 ++
 fs/gfs2/lock_dlm.c   |  52 ++++++++++
 fs/gfs2/log.c        | 231 +++++++++++++++++++++++++++++-------------
 fs/gfs2/log.h        |   2 +-
 fs/gfs2/lops.c       |  12 ++-
 fs/gfs2/main.c       |  23 +++++
 fs/gfs2/meta_io.c    |   6 +-
 fs/gfs2/ops_fstype.c |  51 +---------
 fs/gfs2/quota.c      |  10 +-
 fs/gfs2/recovery.c   |   5 +
 fs/gfs2/rgrp.c       |  82 +++++++++------
 fs/gfs2/rgrp.h       |   1 -
 fs/gfs2/super.c      |  97 ++++++++++++------
 fs/gfs2/sys.c        |   2 +-
 fs/gfs2/trans.c      |  38 +++++--
 fs/gfs2/trans.h      |   1 +
 fs/gfs2/util.c       | 235 +++++++++++++++++++++++++++++++++++++++++--
 fs/gfs2/util.h       |  16 +++
 23 files changed, 924 insertions(+), 266 deletions(-)

-- 
2.23.0



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 01/32] gfs2: Introduce concept of a pending withdraw
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
@ 2019-11-13 21:29 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 02/32] gfs2: clear ail1 list when gfs2 withdraws Bob Peterson
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

File system withdraws can be delayed when inconsistencies are
discovered when we cannot withdraw immediately, for example, when
critical spin_locks are held. But delaying the withdraw can cause
gfs2 to ignore the error and keep running for a short period of time.
For example, an rgrp glock may be dequeued and demoted while there
are still buffers that haven't been properly revoked, due to io
errors writing to the journal.

This patch introduces a new concept of a pending withdraw, which
means an inconsistency has been discovered and we need to withdraw
at the earliest possible opportunity. In these cases, we aren't
quite withdrawn yet, but we still need to not dequeue glocks and
other critical things. If we dequeue the glocks and the withdraw
results in our journal being replayed, the replay could overwrite
data that's been modified by a different node that acquired the
glock in the meantime.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/aops.c       |  4 ++--
 fs/gfs2/file.c       |  2 +-
 fs/gfs2/glock.c      |  7 +++----
 fs/gfs2/glops.c      |  2 +-
 fs/gfs2/incore.h     |  1 +
 fs/gfs2/log.c        | 20 ++++++++------------
 fs/gfs2/meta_io.c    |  6 +++---
 fs/gfs2/ops_fstype.c |  3 +--
 fs/gfs2/quota.c      |  2 +-
 fs/gfs2/super.c      |  6 +++---
 fs/gfs2/sys.c        |  2 +-
 fs/gfs2/util.c       | 14 +++++++-------
 fs/gfs2/util.h       | 12 ++++++++++++
 13 files changed, 44 insertions(+), 37 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 765e40aad985..9c6df721321a 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -497,7 +497,7 @@ static int __gfs2_readpage(void *file, struct page *page)
 		error = mpage_readpage(page, gfs2_block_map);
 	}
 
-	if (unlikely(test_bit(SDF_WITHDRAWN, &sdp->sd_flags)))
+	if (unlikely(gfs2_withdrawn(sdp)))
 		return -EIO;
 
 	return error;
@@ -614,7 +614,7 @@ static int gfs2_readpages(struct file *file, struct address_space *mapping,
 	gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
-	if (unlikely(test_bit(SDF_WITHDRAWN, &sdp->sd_flags)))
+	if (unlikely(gfs2_withdrawn(sdp)))
 		ret = -EIO;
 	return ret;
 }
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 92524a946d03..62cc5bd12d09 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1194,7 +1194,7 @@ static int gfs2_lock(struct file *file, int cmd, struct file_lock *fl)
 		cmd = F_SETLK;
 		fl->fl_type = F_UNLCK;
 	}
-	if (unlikely(test_bit(SDF_WITHDRAWN, &sdp->sd_flags))) {
+	if (unlikely(gfs2_withdrawn(sdp))) {
 		if (fl->fl_type == F_UNLCK)
 			locks_lock_file_wait(file, fl);
 		return -EIO;
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 0290a22ebccf..faa88bd594e2 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -549,7 +549,7 @@ __acquires(&gl->gl_lockref.lock)
 	unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0);
 	int ret;
 
-	if (unlikely(test_bit(SDF_WITHDRAWN, &sdp->sd_flags)) &&
+	if (unlikely(gfs2_withdrawn(sdp)) &&
 	    target != LM_ST_UNLOCKED)
 		return;
 	lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
@@ -586,8 +586,7 @@ __acquires(&gl->gl_lockref.lock)
 		}
 		else if (ret) {
 			fs_err(sdp, "lm_lock ret %d\n", ret);
-			GLOCK_BUG_ON(gl, !test_bit(SDF_WITHDRAWN,
-						   &sdp->sd_flags));
+			GLOCK_BUG_ON(gl, !gfs2_withdrawn(sdp));
 		}
 	} else { /* lock_nolock */
 		finish_xmote(gl, target);
@@ -1191,7 +1190,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	int error = 0;
 
-	if (unlikely(test_bit(SDF_WITHDRAWN, &sdp->sd_flags)))
+	if (unlikely(gfs2_withdrawn(sdp)))
 		return -EIO;
 
 	if (test_bit(GLF_LRU, &gl->gl_flags))
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 0e019f5a72d1..4ede1f18de85 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -540,7 +540,7 @@ static int freeze_go_xmote_bh(struct gfs2_glock *gl, struct gfs2_holder *gh)
 			gfs2_consist(sdp);
 
 		/*  Initialize some head of the log stuff  */
-		if (!test_bit(SDF_WITHDRAWN, &sdp->sd_flags)) {
+		if (!gfs2_withdrawn(sdp)) {
 			sdp->sd_log_sequence = head.lh_sequence + 1;
 			gfs2_log_pointers_init(sdp, head.lh_blkno);
 		}
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 5f89c515f5bb..6042babb7324 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -619,6 +619,7 @@ enum {
 	SDF_FORCE_AIL_FLUSH     = 9,
 	SDF_AIL1_IO_ERROR	= 10,
 	SDF_FS_FROZEN           = 11,
+	SDF_WITHDRAWING		= 12, /* Will withdraw eventually */
 };
 
 enum gfs2_freeze_state {
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 162246fafc2e..877610e1e3c0 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -89,8 +89,7 @@ static void gfs2_remove_from_ail(struct gfs2_bufdata *bd)
 
 static int gfs2_ail1_start_one(struct gfs2_sbd *sdp,
 			       struct writeback_control *wbc,
-			       struct gfs2_trans *tr,
-			       bool *withdraw)
+			       struct gfs2_trans *tr)
 __releases(&sdp->sd_ail_lock)
 __acquires(&sdp->sd_ail_lock)
 {
@@ -109,7 +108,7 @@ __acquires(&sdp->sd_ail_lock)
 			    !test_and_set_bit(SDF_AIL1_IO_ERROR,
 					      &sdp->sd_flags)) {
 				gfs2_io_error_bh(sdp, bh);
-				*withdraw = true;
+				set_bit(SDF_WITHDRAWING, &sdp->sd_flags);
 			}
 			list_move(&bd->bd_ail_st_list, &tr->tr_ail2_list);
 			continue;
@@ -150,7 +149,6 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc)
 	struct list_head *head = &sdp->sd_ail1_list;
 	struct gfs2_trans *tr;
 	struct blk_plug plug;
-	bool withdraw = false;
 
 	trace_gfs2_ail_flush(sdp, wbc, 1);
 	blk_start_plug(&plug);
@@ -159,12 +157,12 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc)
 	list_for_each_entry_reverse(tr, head, tr_list) {
 		if (wbc->nr_to_write <= 0)
 			break;
-		if (gfs2_ail1_start_one(sdp, wbc, tr, &withdraw))
+		if (gfs2_ail1_start_one(sdp, wbc, tr))
 			goto restart;
 	}
 	spin_unlock(&sdp->sd_ail_lock);
 	blk_finish_plug(&plug);
-	if (withdraw)
+	if (test_bit(SDF_WITHDRAWING, &sdp->sd_flags))
 		gfs2_lm_withdraw(sdp, NULL);
 	trace_gfs2_ail_flush(sdp, wbc, 0);
 }
@@ -193,8 +191,7 @@ static void gfs2_ail1_start(struct gfs2_sbd *sdp)
  *
  */
 
-static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr,
-				bool *withdraw)
+static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
 {
 	struct gfs2_bufdata *bd, *s;
 	struct buffer_head *bh;
@@ -208,7 +205,7 @@ static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr,
 		if (!buffer_uptodate(bh) &&
 		    !test_and_set_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
 			gfs2_io_error_bh(sdp, bh);
-			*withdraw = true;
+			set_bit(SDF_WITHDRAWING, &sdp->sd_flags);
 		}
 		list_move(&bd->bd_ail_st_list, &tr->tr_ail2_list);
 	}
@@ -226,11 +223,10 @@ static int gfs2_ail1_empty(struct gfs2_sbd *sdp)
 	struct gfs2_trans *tr, *s;
 	int oldest_tr = 1;
 	int ret;
-	bool withdraw = false;
 
 	spin_lock(&sdp->sd_ail_lock);
 	list_for_each_entry_safe_reverse(tr, s, &sdp->sd_ail1_list, tr_list) {
-		gfs2_ail1_empty_one(sdp, tr, &withdraw);
+		gfs2_ail1_empty_one(sdp, tr);
 		if (list_empty(&tr->tr_ail1_list) && oldest_tr)
 			list_move(&tr->tr_list, &sdp->sd_ail2_list);
 		else
@@ -239,7 +235,7 @@ static int gfs2_ail1_empty(struct gfs2_sbd *sdp)
 	ret = list_empty(&sdp->sd_ail1_list);
 	spin_unlock(&sdp->sd_ail_lock);
 
-	if (withdraw)
+	if (test_bit(SDF_WITHDRAWING, &sdp->sd_flags))
 		gfs2_lm_withdraw(sdp, "fatal: I/O error(s)\n");
 
 	return ret;
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 662ef36c1874..0c3772974030 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -251,7 +251,7 @@ int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags,
 	struct buffer_head *bh, *bhs[2];
 	int num = 0;
 
-	if (unlikely(test_bit(SDF_WITHDRAWN, &sdp->sd_flags))) {
+	if (unlikely(gfs2_withdrawn(sdp))) {
 		*bhp = NULL;
 		return -EIO;
 	}
@@ -309,7 +309,7 @@ int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags,
 
 int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh)
 {
-	if (unlikely(test_bit(SDF_WITHDRAWN, &sdp->sd_flags)))
+	if (unlikely(gfs2_withdrawn(sdp)))
 		return -EIO;
 
 	wait_on_buffer(bh);
@@ -320,7 +320,7 @@ int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh)
 			gfs2_io_error_bh_wd(sdp, bh);
 		return -EIO;
 	}
-	if (unlikely(test_bit(SDF_WITHDRAWN, &sdp->sd_flags)))
+	if (unlikely(gfs2_withdrawn(sdp)))
 		return -EIO;
 
 	return 0;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index de8f156adf7a..e8b7b0ce8404 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1006,8 +1006,7 @@ static int gfs2_lm_mount(struct gfs2_sbd *sdp, int silent)
 void gfs2_lm_unmount(struct gfs2_sbd *sdp)
 {
 	const struct lm_lockops *lm = sdp->sd_lockstruct.ls_ops;
-	if (likely(!test_bit(SDF_WITHDRAWN, &sdp->sd_flags)) &&
-	    lm->lm_unmount)
+	if (likely(!gfs2_withdrawn(sdp)) && lm->lm_unmount)
 		lm->lm_unmount(sdp);
 }
 
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 8206fa0e8d2c..e9f93045eb01 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -1475,7 +1475,7 @@ static void quotad_error(struct gfs2_sbd *sdp, const char *msg, int error)
 {
 	if (error == 0 || error == -EROFS)
 		return;
-	if (!test_bit(SDF_WITHDRAWN, &sdp->sd_flags)) {
+	if (!gfs2_withdrawn(sdp)) {
 		fs_err(sdp, "gfs2_quotad: %s error %d\n", msg, error);
 		sdp->sd_log_error = error;
 		wake_up(&sdp->sd_logd_waitq);
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 5fa1eec4fb4f..478015bc6890 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -553,7 +553,7 @@ static void gfs2_dirty_inode(struct inode *inode, int flags)
 
 	if (!(flags & I_DIRTY_INODE))
 		return;
-	if (unlikely(test_bit(SDF_WITHDRAWN, &sdp->sd_flags)))
+	if (unlikely(gfs2_withdrawn(sdp)))
 		return;
 	if (!gfs2_glock_is_locked_by_me(ip->i_gl)) {
 		ret = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
@@ -602,7 +602,7 @@ int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
 
 	error = gfs2_glock_nq_init(sdp->sd_freeze_gl, LM_ST_SHARED, GL_NOCACHE,
 				   &freeze_gh);
-	if (error && !test_bit(SDF_WITHDRAWN, &sdp->sd_flags))
+	if (error && !gfs2_withdrawn(sdp))
 		return error;
 
 	flush_workqueue(gfs2_delete_workqueue);
@@ -761,7 +761,7 @@ static int gfs2_freeze(struct super_block *sb)
 	if (atomic_read(&sdp->sd_freeze_state) != SFS_UNFROZEN)
 		goto out;
 
-	if (test_bit(SDF_WITHDRAWN, &sdp->sd_flags)) {
+	if (gfs2_withdrawn(sdp)) {
 		error = -EINVAL;
 		goto out;
 	}
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index dd15b8e4af2c..8ccb68f4ed16 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -118,7 +118,7 @@ static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
 
 static ssize_t withdraw_show(struct gfs2_sbd *sdp, char *buf)
 {
-	unsigned int b = test_bit(SDF_WITHDRAWN, &sdp->sd_flags);
+	unsigned int b = gfs2_withdrawn(sdp);
 	return snprintf(buf, PAGE_SIZE, "%u\n", b);
 }
 
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index c45159133d8e..7305a7036c3e 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -258,13 +258,13 @@ void gfs2_io_error_bh_i(struct gfs2_sbd *sdp, struct buffer_head *bh,
 			const char *function, char *file, unsigned int line,
 			bool withdraw)
 {
-	if (!test_bit(SDF_WITHDRAWN, &sdp->sd_flags))
-		fs_err(sdp,
-		       "fatal: I/O error\n"
-		       "  block = %llu\n"
-		       "  function = %s, file = %s, line = %u\n",
-		       (unsigned long long)bh->b_blocknr,
-		       function, file, line);
+	if (gfs2_withdrawn(sdp))
+		return;
+
+	fs_err(sdp, "fatal: I/O error\n"
+	       "  block = %llu\n"
+	       "  function = %s, file = %s, line = %u\n",
+	       (unsigned long long)bh->b_blocknr, function, file, line);
 	if (withdraw)
 		gfs2_lm_withdraw(sdp, NULL);
 }
diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h
index 4b68b2c1fe56..858ab5b15a6c 100644
--- a/fs/gfs2/util.h
+++ b/fs/gfs2/util.h
@@ -164,6 +164,18 @@ static inline unsigned int gfs2_tune_get_i(struct gfs2_tune *gt,
 	return x;
 }
 
+/**
+ * gfs2_withdrawn - test whether the file system is withdrawing or withdrawn
+ * @sdp: the superblock
+ */
+static inline bool gfs2_withdrawn(struct gfs2_sbd *sdp)
+{
+	if (test_bit(SDF_WITHDRAWN, &sdp->sd_flags) ||
+	    test_bit(SDF_WITHDRAWING, &sdp->sd_flags))
+		return true;
+	return false;
+}
+
 #define gfs2_tune_get(sdp, field) \
 gfs2_tune_get_i(&(sdp)->sd_tune, &(sdp)->sd_tune.field)
 
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 02/32] gfs2: clear ail1 list when gfs2 withdraws
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
  2019-11-13 21:29 ` [Cluster-devel] [PATCH 01/32] gfs2: Introduce concept of a pending withdraw Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 03/32] gfs2: Rework how rgrp buffer_heads are managed Bob Peterson
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch fixes a bug in which function gfs2_log_flush can get into
an infinite loop when a gfs2 file system is withdrawn. The problem
is the infinite loop "for (;;)" in gfs2_log_flush which would never
finish because the io error and subsequent withdraw prevented the
items from being taken off the ail list.

This patch tries to clean up the mess by allowing withdraw situations
to move not-in-flight buffer_heads to the ail2 list, where they will
be dealt with later.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 877610e1e3c0..a36889eea302 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -109,11 +109,17 @@ __acquires(&sdp->sd_ail_lock)
 					      &sdp->sd_flags)) {
 				gfs2_io_error_bh(sdp, bh);
 				set_bit(SDF_WITHDRAWING, &sdp->sd_flags);
+			} else {
+				list_move(&bd->bd_ail_st_list,
+					  &tr->tr_ail2_list);
+				continue;
 			}
-			list_move(&bd->bd_ail_st_list, &tr->tr_ail2_list);
-			continue;
 		}
 
+		if (gfs2_withdrawn(sdp)) {
+			gfs2_remove_from_ail(bd);
+			continue;
+		}
 		if (!buffer_dirty(bh))
 			continue;
 		if (gl == bd->bd_gl)
@@ -846,6 +852,8 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 				if (gfs2_ail1_empty(sdp))
 					break;
 			}
+			if (gfs2_withdrawn(sdp))
+				goto out;
 			atomic_dec(&sdp->sd_log_blks_free); /* Adjust for unreserved buffer */
 			trace_gfs2_log_blocks(sdp, -1);
 			log_write_header(sdp, flags);
@@ -858,6 +866,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 			atomic_set(&sdp->sd_freeze_state, SFS_FROZEN);
 	}
 
+out:
 	trace_gfs2_log_flush(sdp, 0, flags);
 	up_write(&sdp->sd_log_flush_lock);
 
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 03/32] gfs2: Rework how rgrp buffer_heads are managed
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
  2019-11-13 21:29 ` [Cluster-devel] [PATCH 01/32] gfs2: Introduce concept of a pending withdraw Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 02/32] gfs2: clear ail1 list when gfs2 withdraws Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 04/32] gfs2: fix infinite loop in gfs2_ail1_flush on io error Bob Peterson
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, the rgrp code had a serious problem related to
how it managed buffer_heads for resource groups. The problem caused
file system corruption, especially in cases of journal replay.

When an rgrp glock was demoted to transfer ownership to a
different cluster node, do_xmote() first calls rgrp_go_sync and then
rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
gfs2_rgrp_brelse() that dropped the buffer_head reference count.
In most cases, the reference count went to zero, which is right.
However, there were other places where the buffers are handled
differently.

After rgrp_go_sync, do_xmote called rgrp_go_inval which called
gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
truncate_inode_pages_range would get rid of the pages in memory,
but only if the reference count drops to 0.

Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
to the buffer_heads in cases where the reference count was still 1.
Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
it failed the check for "if (bi->bi_bh)" and thus failed to call
brelse a second time. Because of that, the reference count on those
buffers sometimes failed to drop from 1 to 0. And that caused
function truncate_inode_pages_range to keep the pages in page cache
rather than freeing them.

The next time the rgrp glock was acquired, the metadata read of
the rgrp buffers re-used the pages in memory, which were now
wrong because they were likely modified by the other node who
acquired the glock in EX (which is why we demoted the glock).
This re-use of the page cache caused corruption because changes
made by the other nodes were never seen, so the bitmaps were
inaccurate.

For some reason, the problem became most apparent when journal
replay forced the replay of rgrps in memory, which caused newer
rgrp data to be overwritten by the older in-core pages.

A big part of the problem was that the rgrp buffer were released
in multiple places: The go_unlock function would release them when
the glock was released rather than when the glock is demoted,
which is clearly wrong because our intent was to cache them until
the glock is demoted from SH or EX.

This patch attempts to clean up the mess and make one consistent
and centralized mechanism for managing the rgrp buffer_heads by
implementing several changes:

1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
   We don't want to release the buffers or zero the pointers when
   syncing for the reasons stated above. It only makes sense to
   release them when the glock is actually invalidated (go_inval).
   And when we do, then we set the bh pointers to NULL.
2. The go_unlock function (which was only used for rgrps) is
   eliminated, as we've talked about doing many times before.
   The go_unlock function was called too early in the glock dq
   process, and should not happen until the glock is invalidated.
3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
   That will now happen automatically when the rgrp glocks are
   demoted, and shouldn't happen any sooner or later than that.
   Instead, function gfs2_clear_rgrpd has been modified to demote
   the rgrp glocks, and therefore, free those pages, before the
   remaining glocks are culled by gfs2_gl_hash_clear. This
   prevents the gl_object from hanging around when the glocks are
   culled.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c  |  7 ------
 fs/gfs2/glops.c  |  9 +-------
 fs/gfs2/incore.h |  1 -
 fs/gfs2/rgrp.c   | 60 +++++++++++++++++++++++++++++++-----------------
 fs/gfs2/rgrp.h   |  1 -
 5 files changed, 40 insertions(+), 38 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index faa88bd594e2..809f9cf4239d 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1245,13 +1245,6 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
 	list_del_init(&gh->gh_list);
 	clear_bit(HIF_HOLDER, &gh->gh_iflags);
 	if (find_first_holder(gl) == NULL) {
-		if (glops->go_unlock) {
-			GLOCK_BUG_ON(gl, test_and_set_bit(GLF_LOCK, &gl->gl_flags));
-			spin_unlock(&gl->gl_lockref.lock);
-			glops->go_unlock(gh);
-			spin_lock(&gl->gl_lockref.lock);
-			clear_bit(GLF_LOCK, &gl->gl_flags);
-		}
 		if (list_empty(&gl->gl_holders) &&
 		    !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
 		    !test_bit(GLF_DEMOTE, &gl->gl_flags))
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 4ede1f18de85..dec5e245b991 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -144,15 +144,9 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
 {
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct address_space *mapping = &sdp->sd_aspace;
-	struct gfs2_rgrpd *rgd;
+	struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
 	int error;

-	spin_lock(&gl->gl_lockref.lock);
-	rgd = gl->gl_object;
-	if (rgd)
-		gfs2_rgrp_brelse(rgd);
-	spin_unlock(&gl->gl_lockref.lock);
-
 	if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
 		return;
 	GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
@@ -600,7 +594,6 @@ const struct gfs2_glock_operations gfs2_rgrp_glops = {
 	.go_sync = rgrp_go_sync,
 	.go_inval = rgrp_go_inval,
 	.go_lock = gfs2_rgrp_go_lock,
-	.go_unlock = gfs2_rgrp_go_unlock,
 	.go_dump = gfs2_rgrp_dump,
 	.go_type = LM_TYPE_RGRP,
 	.go_flags = GLOF_LVB,
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 6042babb7324..9a50235567f4 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -239,7 +239,6 @@ struct gfs2_glock_operations {
 	void (*go_inval) (struct gfs2_glock *gl, int flags);
 	int (*go_demote_ok) (const struct gfs2_glock *gl);
 	int (*go_lock) (struct gfs2_holder *gh);
-	void (*go_unlock) (struct gfs2_holder *gh);
 	void (*go_dump)(struct seq_file *seq, struct gfs2_glock *gl,
 			const char *fs_id_buf);
 	void (*go_callback)(struct gfs2_glock *gl, bool remote);
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 2466bb44a23c..10d3397ed3cd 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -730,8 +730,12 @@ void gfs2_clear_rgrpd(struct gfs2_sbd *sdp)
 		rb_erase(n, &sdp->sd_rindex_tree);

 		if (gl) {
-			glock_clear_object(gl, rgd);
+			if (gl->gl_state != LM_ST_UNLOCKED) {
+				gfs2_glock_cb(gl, LM_ST_UNLOCKED);
+				flush_delayed_work(&gl->gl_work);
+			}
 			gfs2_rgrp_brelse(rgd);
+			glock_clear_object(gl, rgd);
 			gfs2_glock_put(gl);
 		}

@@ -1283,6 +1287,10 @@ int gfs2_rgrp_go_lock(struct gfs2_holder *gh)
  * gfs2_rgrp_brelse - Release RG bitmaps read in with gfs2_rgrp_bh_get()
  * @rgd: The resource group
  *
+ * Under normal circumstances, b_count should be 1. However, if we withdraw
+ * there might be buffer_heads stranded that were pinned but never unpinned
+ * because an io error may have resulted in withdraw, which avoids the bulk
+ * of function gfs2_log_flush. Therefore, we need to release all refs.
  */

 void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)
@@ -1291,28 +1299,38 @@ void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)

 	for (x = 0; x < length; x++) {
 		struct gfs2_bitmap *bi = rgd->rd_bits + x;
-		if (bi->bi_bh) {
-			brelse(bi->bi_bh);
-			bi->bi_bh = NULL;
+		if (bi->bi_bh == NULL)
+			continue;
+
+		smp_mb();
+		if (!gfs2_withdrawn(rgd->rd_sbd) &&
+		    (atomic_read(&bi->bi_bh->b_count) != 1)) {
+			    struct gfs2_sbd *sdp = rgd->rd_gl->gl_name.ln_sbd;
+			    struct gfs2_bufdata *bd = bi->bi_bh->b_private;
+
+			    fs_err(sdp, "blk:0x%llx b_count=%d pinned:%d "
+				   "locked:%d dirty:%d up2dt:%d bd?%d\n",
+				   (unsigned long long)bi->bi_bh->b_blocknr,
+				   atomic_read(&bi->bi_bh->b_count),
+				   buffer_pinned(bi->bi_bh),
+				   buffer_locked(bi->bi_bh),
+				   buffer_dirty(bi->bi_bh),
+				   buffer_uptodate(bi->bi_bh), bd ? 1 : 0);
+			    if (bd)
+				    fs_err(sdp, "bd_blkno:0x%llx bd_tr:%p bd_list:%d "
+					   "st_list:%d gl_list:%d\n",
+					   (unsigned long long)bd->bd_blkno,
+					   bd->bd_tr,
+					   list_empty(&bd->bd_list)?0:1,
+					   list_empty(&bd->bd_ail_st_list)?0:1,
+					   list_empty(&bd->bd_ail_gl_list)?0:1);
+			    gfs2_assert_withdraw(rgd->rd_sbd,
+					(atomic_read(&bi->bi_bh->b_count) == 1));
 		}
+		brelse(bi->bi_bh);
+		bi->bi_bh = NULL;
 	}
-
-}
-
-/**
- * gfs2_rgrp_go_unlock - Unlock a rgrp glock
- * @gh: The glock holder for the resource group
- *
- */
-
-void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
-{
-	struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
-	int demote_requested = test_bit(GLF_DEMOTE, &gh->gh_gl->gl_flags) |
-		test_bit(GLF_PENDING_DEMOTE, &gh->gh_gl->gl_flags);
-
-	if (rgd && demote_requested)
-		gfs2_rgrp_brelse(rgd);
+	smp_mb();
 }

 int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset,
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index c14a673ae36f..a584f3096418 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -33,7 +33,6 @@ extern int gfs2_rindex_update(struct gfs2_sbd *sdp);
 extern void gfs2_free_clones(struct gfs2_rgrpd *rgd);
 extern int gfs2_rgrp_go_lock(struct gfs2_holder *gh);
 extern void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd);
-extern void gfs2_rgrp_go_unlock(struct gfs2_holder *gh);

 extern struct gfs2_alloc *gfs2_alloc_get(struct gfs2_inode *ip);

-- 
2.23.0

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 04/32] gfs2: fix infinite loop in gfs2_ail1_flush on io error
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (2 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 03/32] gfs2: Rework how rgrp buffer_heads are managed Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 05/32] gfs2: log error reform Bob Peterson
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, an IO error encountered in function gfs2_ail1_flush
would cause a deadlock: because of the io error (and its resulting
withdrawn state), buffers stopped being written to the journal.
Buffers would remain on the ail1 list, so gfs2_ail1_start_one would
return 1 to indicate dirty buffers were still on the ail1 list.
However, when function gfs2_ail1_flush got a non-zero return code,
it would goto restart to retry the writes, which meant it would never
finish, and thus the infinite loop.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index a36889eea302..6d6013c05e05 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -163,7 +163,7 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc)
 	list_for_each_entry_reverse(tr, head, tr_list) {
 		if (wbc->nr_to_write <= 0)
 			break;
-		if (gfs2_ail1_start_one(sdp, wbc, tr))
+		if (gfs2_ail1_start_one(sdp, wbc, tr) && !gfs2_withdrawn(sdp))
 			goto restart;
 	}
 	spin_unlock(&sdp->sd_ail_lock);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 05/32] gfs2: log error reform
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (3 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 04/32] gfs2: fix infinite loop in gfs2_ail1_flush on io error Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 06/32] gfs2: Only complain the first time an io error occurs in quota or log Bob Peterson
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, gfs2 kept track of journal io errors in two
places sd_log_error and the SDF_AIL1_IO_ERROR flag in sd_flags.
This patch consolidates the two into sd_log_error so that it
reflects the first error encountered writing to the journal.
In future patches, we will take advantage of this by checking
this value rather than having to check both when reacting to
io errors.

In addition, this fixes a tight loop in unmount: If buffers
get on the ail1 list and an io error occurs elsewhere, the
ail1 list would never be cleared because they were always busy.
So unmount would hang, waiting for the ail1 list to empty.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/incore.h |  7 +++----
 fs/gfs2/log.c    | 16 ++++++++++++----
 fs/gfs2/quota.c  |  2 +-
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 9a50235567f4..d37698502d5d 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -616,9 +616,8 @@ enum {
 	SDF_RORECOVERY		= 7, /* read only recovery */
 	SDF_SKIP_DLM_UNLOCK	= 8,
 	SDF_FORCE_AIL_FLUSH     = 9,
-	SDF_AIL1_IO_ERROR	= 10,
-	SDF_FS_FROZEN           = 11,
-	SDF_WITHDRAWING		= 12, /* Will withdraw eventually */
+	SDF_FS_FROZEN           = 10,
+	SDF_WITHDRAWING		= 11, /* Will withdraw eventually */
 };
 
 enum gfs2_freeze_state {
@@ -828,7 +827,7 @@ struct gfs2_sbd {
 	atomic_t sd_log_in_flight;
 	struct bio *sd_log_bio;
 	wait_queue_head_t sd_log_flush_wait;
-	int sd_log_error;
+	int sd_log_error; /* First log error */
 
 	atomic_t sd_reserving_log;
 	wait_queue_head_t sd_reserving_log_wait;
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 6d6013c05e05..e8b1a7a2341b 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -105,8 +105,7 @@ __acquires(&sdp->sd_ail_lock)
 
 		if (!buffer_busy(bh)) {
 			if (!buffer_uptodate(bh) &&
-			    !test_and_set_bit(SDF_AIL1_IO_ERROR,
-					      &sdp->sd_flags)) {
+			    !cmpxchg(&sdp->sd_log_error, 0, -EIO)) {
 				gfs2_io_error_bh(sdp, bh);
 				set_bit(SDF_WITHDRAWING, &sdp->sd_flags);
 			} else {
@@ -206,10 +205,19 @@ static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
 					 bd_ail_st_list) {
 		bh = bd->bd_bh;
 		gfs2_assert(sdp, bd->bd_tr == tr);
-		if (buffer_busy(bh))
+		/*
+		 * If another process flagged an io error, e.g. writing to the
+		 * journal, error all other bhs and move them off the ail1 to
+		 * prevent a tight loop when unmount tries to flush ail1,
+		 * regardless of whether they're still busy. If no outside
+		 * errors were found and the buffer is busy, move to the next.
+		 * If the ail buffer is not busy and caught an error, flag it
+		 * for others.
+		 */
+		if (!sdp->sd_log_error && buffer_busy(bh))
 			continue;
 		if (!buffer_uptodate(bh) &&
-		    !test_and_set_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
+		    !cmpxchg(&sdp->sd_log_error, 0, -EIO)) {
 			gfs2_io_error_bh(sdp, bh);
 			set_bit(SDF_WITHDRAWING, &sdp->sd_flags);
 		}
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index e9f93045eb01..ca2194cfa38e 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -1477,7 +1477,7 @@ static void quotad_error(struct gfs2_sbd *sdp, const char *msg, int error)
 		return;
 	if (!gfs2_withdrawn(sdp)) {
 		fs_err(sdp, "gfs2_quotad: %s error %d\n", msg, error);
-		sdp->sd_log_error = error;
+		cmpxchg(&sdp->sd_log_error, 0, error);
 		wake_up(&sdp->sd_logd_waitq);
 	}
 }
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 06/32] gfs2: Only complain the first time an io error occurs in quota or log
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (4 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 05/32] gfs2: log error reform Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 07/32] gfs2: Ignore dlm recovery requests if gfs2 is withdrawn Bob Peterson
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, all io errors received by the quota daemon or the
logd daemon would cause a complaint message to be issued, such as:

   gfs2: fsid=dm-13.0: Error 10 writing to journal, jid=0

This patch changes it so that the error message is only issued the
first time the error is encountered.

Also, before this patch function gfs2_end_log_write did not set the
sd_log_error value, so log errors would not cause the file system to
be withdrawn. This patch sets the error code so the file system is
properly withdrawn if an io error is encountered writing to the journal.

WARNING: This change in function breaks check xfstests generic/441
and causes it to fail: io errors writing to the log should cause a
file system to be withdrawn, and no further operations are tolerated.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/lops.c  | 5 +++--
 fs/gfs2/quota.c | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 313b83ef6657..d3c73b1771b6 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -203,8 +203,9 @@ static void gfs2_end_log_write(struct bio *bio)
 	struct bvec_iter_all iter_all;
 
 	if (bio->bi_status) {
-		fs_err(sdp, "Error %d writing to journal, jid=%u\n",
-		       bio->bi_status, sdp->sd_jdesc->jd_jid);
+		if (!cmpxchg(&sdp->sd_log_error, 0, (int)bio->bi_status))
+			fs_err(sdp, "Error %d writing to journal, jid=%u\n",
+			       bio->bi_status, sdp->sd_jdesc->jd_jid);
 		wake_up(&sdp->sd_logd_waitq);
 	}
 
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index ca2194cfa38e..dbe87b2b55af 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -1476,8 +1476,8 @@ static void quotad_error(struct gfs2_sbd *sdp, const char *msg, int error)
 	if (error == 0 || error == -EROFS)
 		return;
 	if (!gfs2_withdrawn(sdp)) {
-		fs_err(sdp, "gfs2_quotad: %s error %d\n", msg, error);
-		cmpxchg(&sdp->sd_log_error, 0, error);
+		if (!cmpxchg(&sdp->sd_log_error, 0, error))
+			fs_err(sdp, "gfs2_quotad: %s error %d\n", msg, error);
 		wake_up(&sdp->sd_logd_waitq);
 	}
 }
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 07/32] gfs2: Ignore dlm recovery requests if gfs2 is withdrawn
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (5 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 06/32] gfs2: Only complain the first time an io error occurs in quota or log Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 08/32] gfs2: move check_journal_clean to util.c for future use Bob Peterson
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When a node fails, user space informs dlm of the node failure,
and dlm instructs gfs2 on the surviving nodes to perform journal
recovery. It does this by calling various callback functions in
lock_dlm.c. To mark its progress, it keeps generation numbers
and recover bits in a dlm "control" lock lvb, which is seen by
all nodes to determine which journals need to be replayed.

The gfs2 on all nodes get the same recovery requests from dlm,
so they all try to do the recovery, but only one will be
granted the exclusive lock on the journal. The others fail
with a "Busy" message on their "try lock."

However, when a node is withdrawn, it cannot safely do any
recovery or safely replay any journals. To make matters worse,
gfs2 might withdraw as a result of attempting recovery. For
example, this might happen if the device goes offline, or if
an hba fails. But in today's gfs2 code, it doesn't check for
being withdrawn at any step in the recovery process. What's
worse if that these callbacks from dlm have no return code,
so there is no way to indicate failure back to dlm. We can
send a "Recovery failed" uevent eventually, but that tells
user space what happened, not dlm's kernel code.

Before this patch, lock_dlm would perform its recovery steps but
ignore the result, and eventually it would still update its
generation number in the lvb, despite the fact that it may have
withdrawn or encountered an error. The other nodes would then
see the newer generation number in the lvb and conclude that
they don't need to do recovery because the generation number
is newer than the last one they saw. They think a different
node has already recovered the journal.

This patch adds checks to several of the callbacks used by dlm
in its recovery state machine so that the functions are ignored
and skipped if an io error has occurred or if the file system
is withdrawn. That prevents the lvb bits from being updated, and
therefore dlm and user space still see the need for recovery to
take place.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/lock_dlm.c | 18 ++++++++++++++++++
 fs/gfs2/recovery.c |  5 +++++
 2 files changed, 23 insertions(+)

diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 7c7197343ee2..57fdf53d2246 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1079,6 +1079,10 @@ static void gdlm_recover_prep(void *arg)
 	struct gfs2_sbd *sdp = arg;
 	struct lm_lockstruct *ls = &sdp->sd_lockstruct;

+	if (gfs2_withdrawn(sdp)) {
+		fs_err(sdp, "recover_prep ignored due to withdraw.\n");
+		return;
+	}
 	spin_lock(&ls->ls_recover_spin);
 	ls->ls_recover_block = ls->ls_recover_start;
 	set_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
@@ -1101,6 +1105,11 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot *slot)
 	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
 	int jid = slot->slot - 1;

+	if (gfs2_withdrawn(sdp)) {
+		fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
+		       jid);
+		return;
+	}
 	spin_lock(&ls->ls_recover_spin);
 	if (ls->ls_recover_size < jid + 1) {
 		fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
@@ -1125,6 +1134,10 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
 	struct gfs2_sbd *sdp = arg;
 	struct lm_lockstruct *ls = &sdp->sd_lockstruct;

+	if (gfs2_withdrawn(sdp)) {
+		fs_err(sdp, "recover_done ignored due to withdraw.\n");
+		return;
+	}
 	/* ensure the ls jid arrays are large enough */
 	set_recover_size(sdp, slots, num_slots);

@@ -1152,6 +1165,11 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
 {
 	struct lm_lockstruct *ls = &sdp->sd_lockstruct;

+	if (gfs2_withdrawn(sdp)) {
+		fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
+		       jid);
+		return;
+	}
 	if (test_bit(DFL_NO_DLM_OPS, &ls->ls_recover_flags))
 		return;

diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index 85f830e56945..f1762f2daf50 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -306,6 +306,11 @@ void gfs2_recover_func(struct work_struct *work)
 	int jlocked = 0;

 	t_start = ktime_get();
+	if (gfs2_withdrawn(sdp)) {
+		fs_err(sdp, "jid=%u: Recovery not attempted due to withdraw.\n",
+		       jd->jd_jid);
+		goto fail;
+	}
 	if (sdp->sd_args.ar_spectator)
 		goto fail;
 	if (jd->jd_jid != sdp->sd_lockstruct.ls_jid) {
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 08/32] gfs2: move check_journal_clean to util.c for future use
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (6 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 07/32] gfs2: Ignore dlm recovery requests if gfs2 is withdrawn Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 09/32] gfs2: Allow some glocks to be used during withdraw Bob Peterson
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch function check_journal_clean was in ops_fstype.c.
This patch moves it to util.c so we can make use of it elsewhere
in a future patch.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/ops_fstype.c | 42 -----------------------------------------
 fs/gfs2/util.c       | 45 ++++++++++++++++++++++++++++++++++++++++++++
 fs/gfs2/util.h       |  1 +
 3 files changed, 46 insertions(+), 42 deletions(-)

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index e8b7b0ce8404..fc3a560793c8 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -598,48 +598,6 @@ static int gfs2_jindex_hold(struct gfs2_sbd *sdp, struct gfs2_holder *ji_gh)
 	return error;
 }
 
-/**
- * check_journal_clean - Make sure a journal is clean for a spectator mount
- * @sdp: The GFS2 superblock
- * @jd: The journal descriptor
- *
- * Returns: 0 if the journal is clean or locked, else an error
- */
-static int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd)
-{
-	int error;
-	struct gfs2_holder j_gh;
-	struct gfs2_log_header_host head;
-	struct gfs2_inode *ip;
-
-	ip = GFS2_I(jd->jd_inode);
-	error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_NOEXP |
-				   GL_EXACT | GL_NOCACHE, &j_gh);
-	if (error) {
-		fs_err(sdp, "Error locking journal for spectator mount.\n");
-		return -EPERM;
-	}
-	error = gfs2_jdesc_check(jd);
-	if (error) {
-		fs_err(sdp, "Error checking journal for spectator mount.\n");
-		goto out_unlock;
-	}
-	error = gfs2_find_jhead(jd, &head, false);
-	if (error) {
-		fs_err(sdp, "Error parsing journal for spectator mount.\n");
-		goto out_unlock;
-	}
-	if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
-		error = -EPERM;
-		fs_err(sdp, "jid=%u: Journal is dirty, so the first mounter "
-		       "must not be a spectator.\n", jd->jd_jid);
-	}
-
-out_unlock:
-	gfs2_glock_dq_uninit(&j_gh);
-	return error;
-}
-
 static int init_journal(struct gfs2_sbd *sdp, int undo)
 {
 	struct inode *master = d_inode(sdp->sd_master_dir);
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 7305a7036c3e..13068968a958 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -16,7 +16,10 @@
 #include "gfs2.h"
 #include "incore.h"
 #include "glock.h"
+#include "lops.h"
+#include "recovery.h"
 #include "rgrp.h"
+#include "super.h"
 #include "util.h"
 
 struct kmem_cache *gfs2_glock_cachep __read_mostly;
@@ -33,6 +36,48 @@ void gfs2_assert_i(struct gfs2_sbd *sdp)
 	fs_emerg(sdp, "fatal assertion failed\n");
 }
 
+/**
+ * check_journal_clean - Make sure a journal is clean for a spectator mount
+ * @sdp: The GFS2 superblock
+ * @jd: The journal descriptor
+ *
+ * Returns: 0 if the journal is clean or locked, else an error
+ */
+int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd)
+{
+	int error;
+	struct gfs2_holder j_gh;
+	struct gfs2_log_header_host head;
+	struct gfs2_inode *ip;
+
+	ip = GFS2_I(jd->jd_inode);
+	error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_NOEXP |
+				   GL_EXACT | GL_NOCACHE, &j_gh);
+	if (error) {
+		fs_err(sdp, "Error locking journal for spectator mount.\n");
+		return -EPERM;
+	}
+	error = gfs2_jdesc_check(jd);
+	if (error) {
+		fs_err(sdp, "Error checking journal for spectator mount.\n");
+		goto out_unlock;
+	}
+	error = gfs2_find_jhead(jd, &head, false);
+	if (error) {
+		fs_err(sdp, "Error parsing journal for spectator mount.\n");
+		goto out_unlock;
+	}
+	if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
+		error = -EPERM;
+		fs_err(sdp, "jid=%u: Journal is dirty, so the first mounter "
+		       "must not be a spectator.\n", jd->jd_jid);
+	}
+
+out_unlock:
+	gfs2_glock_dq_uninit(&j_gh);
+	return error;
+}
+
 int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...)
 {
 	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h
index 858ab5b15a6c..babdae6476ee 100644
--- a/fs/gfs2/util.h
+++ b/fs/gfs2/util.h
@@ -128,6 +128,7 @@ static inline void gfs2_metatype_set(struct buffer_head *bh, u16 type,
 
 int gfs2_io_error_i(struct gfs2_sbd *sdp, const char *function,
 		    char *file, unsigned int line);
+int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd);
 
 #define gfs2_io_error(sdp) \
 gfs2_io_error_i((sdp), __func__, __FILE__, __LINE__);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 09/32] gfs2: Allow some glocks to be used during withdraw
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (7 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 08/32] gfs2: move check_journal_clean to util.c for future use Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 10/32] gfs2: Don't loop forever in gfs2_freeze if withdrawn Bob Peterson
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c  | 29 +++++++++++++++++++++++++++--
 fs/gfs2/glops.c  |  4 ++--
 fs/gfs2/incore.h |  1 +
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 809f9cf4239d..15ff968e8192 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -133,6 +133,31 @@ static void gfs2_glock_dealloc(struct rcu_head *rcu)
 	}
 }
 
+/**
+ * allow_while_withdrawn - determine if we can use this glock while withdrawn
+ * @gl: the glock
+ *
+ * We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
+ * when we're withdrawn. For example, to maintain metadata integrity, we should
+ * disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
+ * iopen or the transaction glocks may be safely used because none of their
+ * metadata goes through the journal. So in general, we should disallow all
+ * glocks that are journaled, and allow all the others. One exception is:
+ * we need to allow our active journal to be demoted so others may recover it.
+ */
+static bool allow_while_withdrawn(struct gfs2_glock *gl)
+{
+	struct gfs2_sbd *sdp;
+
+	if (!(gl->gl_ops->go_flags & GLOF_JOURNALED))
+		return true;
+
+	sdp = gl->gl_name.ln_sbd;
+	if (!sdp->sd_jdesc || gl->gl_object == sdp->sd_jdesc->jd_inode)
+		return true;
+	return false;
+}
+
 void gfs2_glock_free(struct gfs2_glock *gl)
 {
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
@@ -549,7 +574,7 @@ __acquires(&gl->gl_lockref.lock)
 	unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0);
 	int ret;
 
-	if (unlikely(gfs2_withdrawn(sdp)) &&
+	if (unlikely(gfs2_withdrawn(sdp)) && !allow_while_withdrawn(gl) &&
 	    target != LM_ST_UNLOCKED)
 		return;
 	lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
@@ -1190,7 +1215,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	int error = 0;
 
-	if (unlikely(gfs2_withdrawn(sdp)))
+	if (unlikely(gfs2_withdrawn(sdp)) && !allow_while_withdrawn(gl))
 		return -EIO;
 
 	if (test_bit(GLF_LRU, &gl->gl_flags))
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index dec5e245b991..04f55e5b8bf1 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -587,7 +587,7 @@ const struct gfs2_glock_operations gfs2_inode_glops = {
 	.go_lock = inode_go_lock,
 	.go_dump = inode_go_dump,
 	.go_type = LM_TYPE_INODE,
-	.go_flags = GLOF_ASPACE | GLOF_LRU,
+	.go_flags = GLOF_ASPACE | GLOF_LRU | GLOF_JOURNALED,
 };
 
 const struct gfs2_glock_operations gfs2_rgrp_glops = {
@@ -596,7 +596,7 @@ const struct gfs2_glock_operations gfs2_rgrp_glops = {
 	.go_lock = gfs2_rgrp_go_lock,
 	.go_dump = gfs2_rgrp_dump,
 	.go_type = LM_TYPE_RGRP,
-	.go_flags = GLOF_LVB,
+	.go_flags = GLOF_LVB | GLOF_JOURNALED,
 };
 
 const struct gfs2_glock_operations gfs2_freeze_glops = {
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index d37698502d5d..f6ec52776408 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -247,6 +247,7 @@ struct gfs2_glock_operations {
 #define GLOF_ASPACE 1
 #define GLOF_LVB    2
 #define GLOF_LRU    4
+#define GLOF_JOURNALED 8 /* goes through the journal */
 };
 
 enum {
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 10/32] gfs2: Don't loop forever in gfs2_freeze if withdrawn
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (8 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 09/32] gfs2: Allow some glocks to be used during withdraw Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 11/32] gfs2: Make secondary withdrawers wait for first withdrawer Bob Peterson
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, function gfs2_freeze would loop forever if the
file system trying to be frozen is withdrawn. That's because function
gfs2_lock_fs_check_clean tries to enqueue the glock of the journal
and the gfs2_glock returns -EIO because you can't enqueue a journaled
glock after a withdraw.

This patch moves the check for file system withdraw inside the loop
so that the loop can end when withdraw occurs.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/super.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 478015bc6890..8154c38e488b 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -761,12 +761,12 @@ static int gfs2_freeze(struct super_block *sb)
 	if (atomic_read(&sdp->sd_freeze_state) != SFS_UNFROZEN)
 		goto out;
 
-	if (gfs2_withdrawn(sdp)) {
-		error = -EINVAL;
-		goto out;
-	}
-
 	for (;;) {
+		if (gfs2_withdrawn(sdp)) {
+			error = -EINVAL;
+			goto out;
+		}
+
 		error = gfs2_lock_fs_check_clean(sdp, &sdp->sd_freeze_gh);
 		if (!error)
 			break;
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 11/32] gfs2: Make secondary withdrawers wait for first withdrawer
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (9 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 10/32] gfs2: Don't loop forever in gfs2_freeze if withdrawn Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 12/32] gfs2: Don't write log headers after file system withdraw Bob Peterson
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, if a process encountered an error and decided to
withdraw, if another process was already in the process of withdrawing,
the secondary withdraw would be silently ignored, which set it free
to proceed with its processing, unlock any locks, etc. That's correct
behavior if the original withdrawer encounters further errors down
the road. However, second withdrawers need to wait for the first
withdrawer to finish its withdraw before proceeding. If we don't wait
we could end up assuming everything is alright, unlock glocks and
telling other nodes they can have the glock, despite the fact that
a withdraw is still ongoing and may require a journal replay before
any locks are released. For example, if an rgrp glock is freed
by a process that didn't wait for the withdraw, a journal replay
could introduce file system corruption by replaying a rgrp block
that has already been granted to another node.

This patch makes secondary withdrawers wait until the primary
withdrawer is finished with its processing before proceeding.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/incore.h |  3 +++
 fs/gfs2/util.c   | 21 +++++++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index f6ec52776408..6e713bf536a1 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -619,6 +619,7 @@ enum {
 	SDF_FORCE_AIL_FLUSH     = 9,
 	SDF_FS_FROZEN           = 10,
 	SDF_WITHDRAWING		= 11, /* Will withdraw eventually */
+	SDF_WITHDRAW_IN_PROG	= 12, /* Withdraw is in progress */
 };
 
 enum gfs2_freeze_state {
@@ -829,6 +830,8 @@ struct gfs2_sbd {
 	struct bio *sd_log_bio;
 	wait_queue_head_t sd_log_flush_wait;
 	int sd_log_error; /* First log error */
+	atomic_t sd_withdrawer;
+	wait_queue_head_t sd_withdraw_wait;
 
 	atomic_t sd_reserving_log;
 	wait_queue_head_t sd_reserving_log_wait;
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 13068968a958..4ef5218303d7 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -86,9 +86,23 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...)
 	struct va_format vaf;
 
 	if (sdp->sd_args.ar_errors == GFS2_ERRORS_WITHDRAW &&
-	    test_and_set_bit(SDF_WITHDRAWN, &sdp->sd_flags))
-		return 0;
+	    test_and_set_bit(SDF_WITHDRAWN, &sdp->sd_flags)) {
+		if (!test_bit(SDF_WITHDRAW_IN_PROG, &sdp->sd_flags))
+			return -1;
+
+		fs_warn(sdp, "Pid %d waiting for process %d to withdraw.\n",
+			pid_nr(task_pid(current)),
+			atomic_read(&sdp->sd_withdrawer));
+		wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_IN_PROG,
+			    TASK_UNINTERRUPTIBLE);
+		fs_warn(sdp, "Pid %d done waiting for process %d.\n",
+			pid_nr(task_pid(current)),
+			atomic_read(&sdp->sd_withdrawer));
+		return -1;
+	}
 
+	set_bit(SDF_WITHDRAW_IN_PROG, &sdp->sd_flags);
+	atomic_set(&sdp->sd_withdrawer, pid_nr(task_pid(current)));
 	if (fmt) {
 		va_start(args, fmt);
 
@@ -116,6 +130,9 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...)
 		set_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags);
 		fs_err(sdp, "withdrawn\n");
 		dump_stack();
+		clear_bit(SDF_WITHDRAW_IN_PROG, &sdp->sd_flags);
+		smp_mb__after_atomic();
+		wake_up_bit(&sdp->sd_flags, SDF_WITHDRAW_IN_PROG);
 	}
 
 	if (sdp->sd_args.ar_errors == GFS2_ERRORS_PANIC)
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 12/32] gfs2: Don't write log headers after file system withdraw
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (10 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 11/32] gfs2: Make secondary withdrawers wait for first withdrawer Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 13/32] gfs2: Force withdraw to replay journals and wait for it to finish Bob Peterson
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, when a node withdrew a gfs2 file system, it
wrote a (clean) unmount log header. That's wrong. You don't want
to write anything to the journal once you're withdrawn because
that's acknowledging that the transaction is complete and the
journal is in good shape, neither of which may be a valid
assumption when the file system is withdrawn. This is especially
true if the withdraw was caused due to io errors writing to the
journal in the first place. The best course of action is to leave
the journal "as is" until it may be safely replayed during
journal recovery, regardless of whether it's done by this node or
another.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index e8b1a7a2341b..735291a93e1a 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -692,12 +692,16 @@ void gfs2_write_log_header(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd,
 {
 	struct gfs2_log_header *lh;
 	u32 hash, crc;
-	struct page *page = mempool_alloc(gfs2_page_pool, GFP_NOIO);
+	struct page *page;
 	struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local;
 	struct timespec64 tv;
 	struct super_block *sb = sdp->sd_vfs;
 	u64 dblock;
 
+	if (gfs2_withdrawn(sdp))
+		goto out;
+
+	page = mempool_alloc(gfs2_page_pool, GFP_NOIO);
 	lh = page_address(page);
 	clear_page(lh);
 
@@ -750,6 +754,7 @@ void gfs2_write_log_header(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd,
 
 	gfs2_log_write(sdp, page, sb->s_blocksize, 0, dblock);
 	gfs2_log_submit_bio(&sdp->sd_log_bio, REQ_OP_WRITE | op_flags);
+out:
 	log_flush_wait(sdp);
 }
 
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 13/32] gfs2: Force withdraw to replay journals and wait for it to finish
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (11 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 12/32] gfs2: Don't write log headers after file system withdraw Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 14/32] gfs2: fix infinite loop when checking ail item count before go_inval Bob Peterson
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.

This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.

The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.

Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.

The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.

Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.

Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.

We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c      |  24 ++++++-
 fs/gfs2/glops.c      |  77 ++++++++++++++++++++++-
 fs/gfs2/incore.h     |   7 +++
 fs/gfs2/lock_dlm.c   |  34 ++++++++++
 fs/gfs2/meta_io.c    |   2 +-
 fs/gfs2/ops_fstype.c |   4 +-
 fs/gfs2/quota.c      |   4 ++
 fs/gfs2/super.c      |  67 ++++++++++++++------
 fs/gfs2/util.c       | 145 ++++++++++++++++++++++++++++++++++++++++++-
 9 files changed, 340 insertions(+), 24 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 15ff968e8192..d8e48b72026e 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -269,7 +269,7 @@ static void __gfs2_glock_put(struct gfs2_glock *gl)
 	gfs2_glock_remove_from_lru(gl);
 	spin_unlock(&gl->gl_lockref.lock);
 	GLOCK_BUG_ON(gl, !list_empty(&gl->gl_holders));
-	GLOCK_BUG_ON(gl, mapping && mapping->nrpages);
+	GLOCK_BUG_ON(gl, mapping && mapping->nrpages && !gfs2_withdrawn(sdp));
 	trace_gfs2_glock_put(gl);
 	sdp->sd_lockstruct.ls_ops->lm_put_lock(gl);
 }
@@ -575,6 +575,7 @@ __acquires(&gl->gl_lockref.lock)
 	int ret;
 
 	if (unlikely(gfs2_withdrawn(sdp)) && !allow_while_withdrawn(gl) &&
+	    (gh && !(LM_FLAG_NOEXP & gh->gh_flags)) &&
 	    target != LM_ST_UNLOCKED)
 		return;
 	lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
@@ -1215,7 +1216,8 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	int error = 0;
 
-	if (unlikely(gfs2_withdrawn(sdp)) && !allow_while_withdrawn(gl))
+	if (unlikely(gfs2_withdrawn(sdp)) && !allow_while_withdrawn(gl) &&
+	    !(LM_FLAG_NOEXP & gh->gh_flags))
 		return -EIO;
 
 	if (test_bit(GLF_LRU, &gl->gl_flags))
@@ -1259,11 +1261,29 @@ int gfs2_glock_poll(struct gfs2_holder *gh)
 void gfs2_glock_dq(struct gfs2_holder *gh)
 {
 	struct gfs2_glock *gl = gh->gh_gl;
+	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	const struct gfs2_glock_operations *glops = gl->gl_ops;
 	unsigned delay = 0;
 	int fast_path = 0;
 
 	spin_lock(&gl->gl_lockref.lock);
+	/*
+	 * If we're in the process of file system withdraw, we cannot just
+	 * dequeue any glocks until our journal is recovered, lest we
+	 * introduce file system corruption. We need two exceptions to this
+	 * rule: We need to allow unlocking of nondisk glocks and the glock
+	 * for our own journal that needs recovery. Note that evict may also
+	 * call this for the journal inode, but without using sdp->sd_jinode_gh
+	 * so we need to check the glock, not just the holder of the glock.
+	 */
+	if (test_bit(SDF_WITHDRAWN, &sdp->sd_flags) &&
+	    test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
+	    !allow_while_withdrawn(gl) && gh->gh_gl != sdp->sd_jinode_gl) {
+		sdp->sd_glock_dqs_held++;
+		might_sleep();
+		wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
+			    TASK_UNINTERRUPTIBLE);
+	}
 	if (gh->gh_flags & GL_NOCACHE)
 		handle_callback(gl, LM_ST_UNLOCKED, 0, false);
 
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 04f55e5b8bf1..3904e4892a65 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -29,6 +29,8 @@
 
 struct workqueue_struct *gfs2_freeze_wq;
 
+extern struct workqueue_struct *gfs2_control_wq;
+
 static void gfs2_ail_error(struct gfs2_glock *gl, const struct buffer_head *bh)
 {
 	fs_err(gl->gl_name.ln_sbd,
@@ -495,13 +497,17 @@ static void freeze_go_sync(struct gfs2_glock *gl)
 	int error = 0;
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 
-	if (gl->gl_state == LM_ST_SHARED &&
+	if (gl->gl_state == LM_ST_SHARED && !gfs2_withdrawn(sdp) &&
 	    test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags)) {
 		atomic_set(&sdp->sd_freeze_state, SFS_STARTING_FREEZE);
 		error = freeze_super(sdp->sd_vfs);
 		if (error) {
 			fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
 				error);
+			if (gfs2_withdrawn(sdp)) {
+				atomic_set(&sdp->sd_freeze_state, SFS_UNFROZEN);
+				return;
+			}
 			gfs2_assert_withdraw(sdp, 0);
 		}
 		queue_work(gfs2_freeze_wq, &sdp->sd_freeze_work);
@@ -576,6 +582,73 @@ static void iopen_go_callback(struct gfs2_glock *gl, bool remote)
 	}
 }
 
+/**
+ * nondisk_go_callback - used to signal when a node did a withdraw
+ * @gl: the nondisk glock
+ * @remote: true if this came from a different cluster node
+ *
+ */
+static void nondisk_go_callback(struct gfs2_glock *gl, bool remote)
+{
+	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
+
+	/* Ignore the callback unless it's from another node, and it's the
+	   live lock. */
+	if (!remote || gl->gl_name.ln_number != GFS2_LIVE_LOCK)
+		return;
+
+	/* First order of business is to cancel the demote request. We don't
+	 * really want to demote a nondisk glock. At best it's just to inform
+	 * us of a another node's withdraw. We'll keep it in SH mode. */
+	clear_bit(GLF_DEMOTE, &gl->gl_flags);
+	clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags);
+
+	/* Ignore the unlock if we're withdrawn, unmounting, or in recovery. */
+	if (test_bit(SDF_NORECOVERY, &sdp->sd_flags) ||
+	    test_bit(SDF_WITHDRAWN, &sdp->sd_flags) ||
+	    test_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags))
+		return;
+
+	/* We only care when a node wants us to unlock, because that means
+	 * they want a journal recovered. */
+	if (gl->gl_demote_state != LM_ST_UNLOCKED)
+		return;
+
+	if (sdp->sd_args.ar_spectator) {
+		fs_warn(sdp, "Spectator node cannot recover journals.\n");
+		return;
+	}
+
+	fs_warn(sdp, "Some node has withdrawn; checking for recovery.\n");
+	set_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags);
+	/*
+	 * We can't call remote_withdraw directly here or gfs2_recover_journal
+	 * because this is called from the glock unlock function and the
+	 * remote_withdraw needs to enqueue and dequeue the same "live" glock
+	 * we were called from. So we queue it to the control work queue in
+	 * lock_dlm.
+	 */
+	queue_delayed_work(gfs2_control_wq, &sdp->sd_control_work, 0);
+}
+
+/**
+ * inode_go_free - wake up anyone waiting for dlm's unlock ast to free it
+ * @gl: glock being freed
+ *
+ * For now, this is only used for the journal inode glock. In withdraw
+ * situations, we need to wait for the glock to be freed so that we know
+ * other nodes may proceed with recovery / journal replay.
+ */
+static void inode_go_free(struct gfs2_glock *gl)
+{
+	/* Note that we cannot reference gl_object because it's already set
+	 * to NULL by this point in its lifecycle. */
+	if (!test_bit(GLF_FREEING, &gl->gl_flags))
+		return;
+	clear_bit_unlock(GLF_FREEING, &gl->gl_flags);
+	wake_up_bit(&gl->gl_flags, GLF_FREEING);
+}
+
 const struct gfs2_glock_operations gfs2_meta_glops = {
 	.go_type = LM_TYPE_META,
 };
@@ -588,6 +661,7 @@ const struct gfs2_glock_operations gfs2_inode_glops = {
 	.go_dump = inode_go_dump,
 	.go_type = LM_TYPE_INODE,
 	.go_flags = GLOF_ASPACE | GLOF_LRU | GLOF_JOURNALED,
+	.go_free = inode_go_free,
 };
 
 const struct gfs2_glock_operations gfs2_rgrp_glops = {
@@ -619,6 +693,7 @@ const struct gfs2_glock_operations gfs2_flock_glops = {
 
 const struct gfs2_glock_operations gfs2_nondisk_glops = {
 	.go_type = LM_TYPE_NONDISK,
+	.go_callback = nondisk_go_callback,
 };
 
 const struct gfs2_glock_operations gfs2_quota_glops = {
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 6e713bf536a1..a15ddd2f9bf4 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -242,6 +242,7 @@ struct gfs2_glock_operations {
 	void (*go_dump)(struct seq_file *seq, struct gfs2_glock *gl,
 			const char *fs_id_buf);
 	void (*go_callback)(struct gfs2_glock *gl, bool remote);
+	void (*go_free)(struct gfs2_glock *gl);
 	const int go_type;
 	const unsigned long go_flags;
 #define GLOF_ASPACE 1
@@ -343,6 +344,7 @@ enum {
 	GLF_OBJECT			= 14, /* Used only for tracing */
 	GLF_BLOCKING			= 15,
 	GLF_INODE_CREATING		= 16, /* Inode creation occurring */
+	GLF_FREEING			= 18, /* Wait for glock to be freed */
 };
 
 struct gfs2_glock {
@@ -620,6 +622,9 @@ enum {
 	SDF_FS_FROZEN           = 10,
 	SDF_WITHDRAWING		= 11, /* Will withdraw eventually */
 	SDF_WITHDRAW_IN_PROG	= 12, /* Withdraw is in progress */
+	SDF_REMOTE_WITHDRAW	= 13, /* Performing remote recovery */
+	SDF_WITHDRAW_RECOVERY	= 14, /* Wait for journal recovery when we are
+					 withdrawing */
 };
 
 enum gfs2_freeze_state {
@@ -769,6 +774,7 @@ struct gfs2_sbd {
 	struct gfs2_jdesc *sd_jdesc;
 	struct gfs2_holder sd_journal_gh;
 	struct gfs2_holder sd_jinode_gh;
+	struct gfs2_glock *sd_jinode_gl;
 
 	struct gfs2_holder sd_sc_gh;
 	struct gfs2_holder sd_qc_gh;
@@ -855,6 +861,7 @@ struct gfs2_sbd {
 
 	unsigned long sd_last_warning;
 	struct dentry *debugfs_dir;    /* debugfs directory */
+	unsigned long sd_glock_dqs_held;
 };
 
 static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int which)
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 57fdf53d2246..9f2b5609f225 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -16,6 +16,8 @@
 
 #include "incore.h"
 #include "glock.h"
+#include "glops.h"
+#include "recovery.h"
 #include "util.h"
 #include "sys.h"
 #include "trace_gfs2.h"
@@ -124,6 +126,8 @@ static void gdlm_ast(void *arg)
 
 	switch (gl->gl_lksb.sb_status) {
 	case -DLM_EUNLOCK: /* Unlocked, so glock can be freed */
+		if (gl->gl_ops->go_free)
+			gl->gl_ops->go_free(gl);
 		gfs2_glock_free(gl);
 		return;
 	case -DLM_ECANCEL: /* Cancel while getting lock */
@@ -323,6 +327,7 @@ static void gdlm_cancel(struct gfs2_glock *gl)
 /*
  * dlm/gfs2 recovery coordination using dlm_recover callbacks
  *
+ *  0. gfs2 checks for another cluster node withdraw, needing journal replay
  *  1. dlm_controld sees lockspace members change
  *  2. dlm_controld blocks dlm-kernel locking activity
  *  3. dlm_controld within dlm-kernel notifies gfs2 (recover_prep)
@@ -571,6 +576,28 @@ static int control_lock(struct gfs2_sbd *sdp, int mode, uint32_t flags)
 			 &ls->ls_control_lksb, "control_lock");
 }
 
+/**
+ * remote_withdraw - react to a node withdrawing from the file system
+ * @sdp: The superblock
+ */
+static void remote_withdraw(struct gfs2_sbd *sdp)
+{
+	struct gfs2_jdesc *jd;
+	int ret = 0, count = 0;
+
+	list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
+		if (jd->jd_jid == sdp->sd_lockstruct.ls_jid)
+			continue;
+		ret = gfs2_recover_journal(jd, true);
+		if (ret)
+			break;
+		count++;
+	}
+
+	/* Now drop the additional reference we acquired */
+	fs_err(sdp, "Journals checked: %d, ret = %d.\n", count, ret);
+}
+
 static void gfs2_control_func(struct work_struct *work)
 {
 	struct gfs2_sbd *sdp = container_of(work, struct gfs2_sbd, sd_control_work.work);
@@ -581,6 +608,13 @@ static void gfs2_control_func(struct work_struct *work)
 	int recover_size;
 	int i, error;
 
+	/* First check for other nodes that may have done a withdraw. */
+	if (test_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags)) {
+		remote_withdraw(sdp);
+		clear_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags);
+		return;
+	}
+
 	spin_lock(&ls->ls_recover_spin);
 	/*
 	 * No MOUNT_DONE means we're still mounting; control_mount()
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 0c3772974030..6bdbdb9b3cd5 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -251,7 +251,7 @@ int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags,
 	struct buffer_head *bh, *bhs[2];
 	int num = 0;
 
-	if (unlikely(gfs2_withdrawn(sdp))) {
+	if (unlikely(gfs2_withdrawn(sdp)) && gl != sdp->sd_jinode_gl) {
 		*bhp = NULL;
 		return -EIO;
 	}
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index fc3a560793c8..d49741bf5a0f 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -650,7 +650,8 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
 
 		error = gfs2_glock_nq_num(sdp, sdp->sd_lockstruct.ls_jid,
 					  &gfs2_journal_glops,
-					  LM_ST_EXCLUSIVE, LM_FLAG_NOEXP,
+					  LM_ST_EXCLUSIVE,
+					  LM_FLAG_NOEXP | GL_NOCACHE,
 					  &sdp->sd_journal_gh);
 		if (error) {
 			fs_err(sdp, "can't acquire journal glock: %d\n", error);
@@ -658,6 +659,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
 		}
 
 		ip = GFS2_I(sdp->sd_jdesc->jd_inode);
+		sdp->sd_jinode_gl = ip->i_gl;
 		error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED,
 					   LM_FLAG_NOEXP | GL_EXACT | GL_NOCACHE,
 					   &sdp->sd_jinode_gh);
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index dbe87b2b55af..3fb5bfdb1c64 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -1572,6 +1572,10 @@ int gfs2_quotad(void *data)
 		else
 			t = 0;
 		finish_wait(&sdp->sd_quota_wait, &wait);
+
+		if (atomic_read(&sdp->sd_withdrawer) ==
+		    pid_nr(task_pid(current)))
+			break;
 	}
 
 	return 0;
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 8154c38e488b..a696663bf5e5 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -171,9 +171,13 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
 		goto fail_threads;
 
 	j_gl->gl_ops->go_inval(j_gl, DIO_METADATA);
+	if (gfs2_withdrawn(sdp)) {
+		error = -EIO;
+		goto fail;
+	}
 
 	error = gfs2_find_jhead(sdp->sd_jdesc, &head, false);
-	if (error)
+	if (error || gfs2_withdrawn(sdp))
 		goto fail;
 
 	if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
@@ -187,7 +191,7 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
 	gfs2_log_pointers_init(sdp, head.lh_blkno);
 
 	error = gfs2_quota_init(sdp);
-	if (error)
+	if (error || gfs2_withdrawn(sdp))
 		goto fail;
 
 	set_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
@@ -599,33 +603,58 @@ int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
 {
 	struct gfs2_holder freeze_gh;
 	int error;
-
-	error = gfs2_glock_nq_init(sdp->sd_freeze_gl, LM_ST_SHARED, GL_NOCACHE,
-				   &freeze_gh);
-	if (error && !gfs2_withdrawn(sdp))
-		return error;
+	int log_write_allowed = test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
+
+	gfs2_holder_mark_uninitialized(&freeze_gh);
+	if (!log_write_allowed && sdp->sd_freeze_gl &&
+	    !gfs2_glock_is_locked_by_me(sdp->sd_freeze_gl)) {
+		error = gfs2_glock_nq_init(sdp->sd_freeze_gl, LM_ST_SHARED,
+				   GL_NOCACHE | LM_FLAG_TRY, &freeze_gh);
+		if (error == GLR_TRYFAILED)
+			error = 0;
+	} else {
+		error = gfs2_glock_nq_init(sdp->sd_freeze_gl, LM_ST_SHARED,
+					   GL_NOCACHE, &freeze_gh);
+		if (error && !gfs2_withdrawn(sdp))
+			return error;
+	}
 
 	flush_workqueue(gfs2_delete_workqueue);
-	if (sdp->sd_quotad_process)
+	if (!log_write_allowed && current == sdp->sd_quotad_process)
+		fs_warn(sdp, "The quotad daemon is withdrawing.\n");
+	else if (sdp->sd_quotad_process)
 		kthread_stop(sdp->sd_quotad_process);
 	sdp->sd_quotad_process = NULL;
-	if (sdp->sd_logd_process)
+
+	if (!log_write_allowed && current == sdp->sd_logd_process)
+		fs_warn(sdp, "The logd daemon is withdrawing.\n");
+	else if (sdp->sd_logd_process)
 		kthread_stop(sdp->sd_logd_process);
 	sdp->sd_logd_process = NULL;
 
-	gfs2_quota_sync(sdp->sd_vfs, 0);
-	gfs2_statfs_sync(sdp->sd_vfs, 0);
-
-	gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_SHUTDOWN |
-		       GFS2_LFC_MAKE_FS_RO);
-	wait_event(sdp->sd_reserving_log_wait, atomic_read(&sdp->sd_reserving_log) == 0);
-	gfs2_assert_warn(sdp, atomic_read(&sdp->sd_log_blks_free) == sdp->sd_jdesc->jd_blocks);
+	if (log_write_allowed) {
+		gfs2_quota_sync(sdp->sd_vfs, 0);
+		gfs2_statfs_sync(sdp->sd_vfs, 0);
 
+		gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_SHUTDOWN |
+			       GFS2_LFC_MAKE_FS_RO);
+		wait_event(sdp->sd_reserving_log_wait,
+			   atomic_read(&sdp->sd_reserving_log) == 0);
+		gfs2_assert_warn(sdp, atomic_read(&sdp->sd_log_blks_free) ==
+				 sdp->sd_jdesc->jd_blocks);
+	} else {
+		wait_event_timeout(sdp->sd_reserving_log_wait,
+				   atomic_read(&sdp->sd_reserving_log) == 0,
+				   HZ * 5);
+	}
 	if (gfs2_holder_initialized(&freeze_gh))
 		gfs2_glock_dq_uninit(&freeze_gh);
 
 	gfs2_quota_cleanup(sdp);
 
+	if (!log_write_allowed)
+		sdp->sd_vfs->s_flags |= SB_RDONLY;
+
 	return error;
 }
 
@@ -676,8 +705,10 @@ static void gfs2_put_super(struct super_block *sb)
 	gfs2_glock_put(sdp->sd_freeze_gl);
 
 	if (!sdp->sd_args.ar_spectator) {
-		gfs2_glock_dq_uninit(&sdp->sd_journal_gh);
-		gfs2_glock_dq_uninit(&sdp->sd_jinode_gh);
+		if (gfs2_holder_initialized(&sdp->sd_journal_gh))
+			gfs2_glock_dq_uninit(&sdp->sd_journal_gh);
+		if (gfs2_holder_initialized(&sdp->sd_jinode_gh))
+			gfs2_glock_dq_uninit(&sdp->sd_jinode_gh);
 		gfs2_glock_dq_uninit(&sdp->sd_sc_gh);
 		gfs2_glock_dq_uninit(&sdp->sd_qc_gh);
 		iput(sdp->sd_sc_inode);
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 4ef5218303d7..0d6ee69fab40 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -11,11 +11,14 @@
 #include <linux/buffer_head.h>
 #include <linux/crc32.h>
 #include <linux/gfs2_ondisk.h>
+#include <linux/delay.h>
 #include <linux/uaccess.h>
 
 #include "gfs2.h"
 #include "incore.h"
 #include "glock.h"
+#include "glops.h"
+#include "log.h"
 #include "lops.h"
 #include "recovery.h"
 #include "rgrp.h"
@@ -78,6 +81,144 @@ int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd)
 	return error;
 }
 
+static void signal_our_withdraw(struct gfs2_sbd *sdp)
+{
+	struct gfs2_glock *gl = sdp->sd_live_gh.gh_gl;
+	int ret = 0;
+	int tries;
+
+	if (test_bit(SDF_NORECOVERY, &sdp->sd_flags))
+		return;
+
+	/* Prevent any glock dq until withdraw recovery is complete */
+	set_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags);
+	/*
+	 * Don't tell dlm we're bailing until we have no more buffers in the
+	 * wind. If journal had an IO error, the log code should just purge
+	 * the outstanding buffers rather than submitting new IO. Making the
+	 * file system read-only will flush the journal, etc.
+	 *
+	 * During a normal unmount, gfs2_make_fs_ro calls gfs2_log_shutdown
+	 * which clears SDF_JOURNAL_LIVE. In a withdraw, we must not write
+	 * any UNMOUNT log header, so we can't call gfs2_log_shutdown, and
+	 * therefore we need to clear SDF_JOURNAL_LIVE manually.
+	 */
+	clear_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
+	if (sdp->sd_vfs && !sb_rdonly(sdp->sd_vfs))
+		ret = gfs2_make_fs_ro(sdp);
+
+	/*
+	 * Drop the glock for our journal so another node can recover it.
+	 */
+	if (gfs2_holder_initialized(&sdp->sd_journal_gh)) {
+		gfs2_glock_dq_wait(&sdp->sd_journal_gh);
+		gfs2_holder_uninit(&sdp->sd_journal_gh);
+	}
+	sdp->sd_jinode_gh.gh_flags |= GL_NOCACHE;
+	gfs2_glock_dq(&sdp->sd_jinode_gh);
+	if (test_bit(SDF_FS_FROZEN, &sdp->sd_flags)) {
+		/* Make sure gfs2_unfreeze works if partially-frozen */
+		flush_workqueue(gfs2_freeze_wq);
+		atomic_set(&sdp->sd_freeze_state, SFS_FROZEN);
+		thaw_super(sdp->sd_vfs);
+	} else {
+		wait_on_bit(&gl->gl_flags, GLF_DEMOTE, TASK_UNINTERRUPTIBLE);
+	}
+
+	/*
+	 * holder_uninit to force glock_put, to force dlm to let go
+	 */
+	gfs2_holder_uninit(&sdp->sd_jinode_gh);
+
+	/*
+	 * Note: We need to be careful here:
+	 * Func gfs2_jindex_free calls iput on jd_inode, which will evict it.
+	 * The evict, in turn, will dequeue its glock, but the glock dq will
+	 * wait for the withdraw unless we have exception code in glock_dq.
+	 */
+	gfs2_jindex_free(sdp);
+	/*
+	 * Wait until the journal inode's glock is freed. This allows try locks
+	 * on other nodes to be successful, otherwise we remain the owner of
+	 * the glock as far as dlm is concerned.
+	 */
+	if (gl->gl_ops->go_free) {
+		set_bit(GLF_FREEING, &gl->gl_flags);
+		wait_on_bit(&gl->gl_flags, GLF_FREEING, TASK_UNINTERRUPTIBLE);
+	}
+
+	if (sdp->sd_lockstruct.ls_ops->lm_lock == NULL) { /* lock_nolock */
+		clear_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags);
+		goto skip_recovery;
+	}
+	/*
+	 * Dequeue the "live" glock, but keep a reference so it's never freed.
+	 */
+	gfs2_glock_hold(gl);
+	gfs2_glock_dq_wait(&sdp->sd_live_gh);
+	/*
+	 * We enqueue the "live" glock in EX so that all other nodes
+	 * get a demote request and act on it. We don't really want the
+	 * lock in EX, so we send a "try" lock with 1CB to produce a callback.
+	 */
+	fs_warn(sdp, "Requesting recovery of jid %d.\n",
+		sdp->sd_lockstruct.ls_jid);
+	gfs2_holder_reinit(LM_ST_EXCLUSIVE, LM_FLAG_TRY_1CB | LM_FLAG_NOEXP,
+			   &sdp->sd_live_gh);
+	msleep(GL_GLOCK_MAX_HOLD);
+	/*
+	 * This will likely fail in a cluster, but succeed standalone:
+	 */
+	ret = gfs2_glock_nq(&sdp->sd_live_gh);
+
+	/*
+	 * If we actually got the "live" lock in EX mode, there are no other
+	 * nodes available to replay our journal. So we try to replay it
+	 * ourselves. We hold the "live" glock to prevent other mounters
+	 * during recovery, then just dequeue it and reacquire it in our
+	 * normal SH mode. Just in case the problem that caused us to
+	 * withdraw prevents us from recovering our journal (e.g. io errors
+	 * and such) we still check if the journal is clean before proceeding
+	 * but we may wait forever until another mounter does the recovery.
+	 */
+	if (ret == 0) {
+		fs_warn(sdp, "No other mounters found. Trying to recover our "
+			"own journal jid %d.\n", sdp->sd_lockstruct.ls_jid);
+		if (gfs2_recover_journal(sdp->sd_jdesc, 1))
+			fs_warn(sdp, "Unable to recover our journal jid %d.\n",
+				sdp->sd_lockstruct.ls_jid);
+		gfs2_glock_dq_wait(&sdp->sd_live_gh);
+		gfs2_holder_reinit(LM_ST_SHARED, LM_FLAG_NOEXP | GL_EXACT,
+				   &sdp->sd_live_gh);
+		gfs2_glock_nq(&sdp->sd_live_gh);
+	}
+
+	gfs2_glock_queue_put(gl); /* drop the extra reference we acquired */
+	clear_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags);
+
+	/*
+	 * Now wait until recovery complete.
+	 */
+	for (tries = 0; tries < 10; tries++) {
+		ret = check_journal_clean(sdp, sdp->sd_jdesc);
+		if (!ret)
+			break;
+		msleep(HZ);
+		fs_warn(sdp, "Waiting for journal recovery jid %d.\n",
+			sdp->sd_lockstruct.ls_jid);
+	}
+skip_recovery:
+	if (!ret)
+		fs_warn(sdp, "Journal recovery complete for jid %d.\n",
+			sdp->sd_lockstruct.ls_jid);
+	else
+		fs_warn(sdp, "Journal recovery skipped for %d until next "
+			"mount.\n", sdp->sd_lockstruct.ls_jid);
+	fs_warn(sdp, "Glock dequeues delayed: %lu\n", sdp->sd_glock_dqs_held);
+	sdp->sd_glock_dqs_held = 0;
+	wake_up_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY);
+}
+
 int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...)
 {
 	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
@@ -118,6 +259,8 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...)
 		fs_err(sdp, "about to withdraw this file system\n");
 		BUG_ON(sdp->sd_args.ar_debug);
 
+		signal_our_withdraw(sdp);
+
 		kobject_uevent(&sdp->sd_kobj, KOBJ_OFFLINE);
 
 		if (!strcmp(sdp->sd_lockstruct.ls_ops->lm_proto_name, "lock_dlm"))
@@ -128,7 +271,7 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...)
 			lm->lm_unmount(sdp);
 		}
 		set_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags);
-		fs_err(sdp, "withdrawn\n");
+		fs_err(sdp, "File system withdrawn\n");
 		dump_stack();
 		clear_bit(SDF_WITHDRAW_IN_PROG, &sdp->sd_flags);
 		smp_mb__after_atomic();
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 14/32] gfs2: fix infinite loop when checking ail item count before go_inval
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (12 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 13/32] gfs2: Force withdraw to replay journals and wait for it to finish Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 15/32] gfs2: Add verbose option to check_journal_clean Bob Peterson
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, the rgrp_go_inval and inode_go_inval functions each
checked if there were any items left on the ail count (by way of a
count), and if so, did a withdraw. But the withdraw code now uses
glocks when changing the file system to read-only status. So we can
not have glock functions withdrawing or a hang will likely result:
The glocks can't be serviced by the work_func if the work_func is
busy doing its own withdraw.

This patch removes the checks from the go_inval functions and adds
a centralized check in do_xmote to warn about the problem and not
withdraw, but flag the error so it's eventually caught when the logd
daemon eventually runs.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c | 17 +++++++++++++++--
 fs/gfs2/glops.c |  3 ---
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index d8e48b72026e..879625471fe0 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -596,9 +596,22 @@ __acquires(&gl->gl_lockref.lock)
 	spin_unlock(&gl->gl_lockref.lock);
 	if (glops->go_sync)
 		glops->go_sync(gl);
-	if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags))
+	if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags)) {
+		/*
+		 * The call to go_sync should have cleared out the ail list.
+		 * If there are still items, we have a problem. We ought to
+		 * withdraw, but we can't because the withdraw code also uses
+		 * glocks. Warn about the error, dump the glock, then fall
+		 * through and wait for logd to do the withdraw for us.
+		 */
+		if ((atomic_read(&gl->gl_ail_count) != 0) &&
+		    (!cmpxchg(&sdp->sd_log_error, 0, -EIO))) {
+			gfs2_assert_warn(sdp, !atomic_read(&gl->gl_ail_count));
+			gfs2_dump_glock(NULL, gl, true);
+		}
 		glops->go_inval(gl, target == LM_ST_DEFERRED ? 0 : DIO_METADATA);
-	clear_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags);
+		clear_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags);
+	}
 
 	gfs2_glock_hold(gl);
 	if (sdp->sd_lockstruct.ls_ops->lm_lock)	{
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 3904e4892a65..d8f9ee756c0e 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -187,7 +187,6 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
 		gfs2_rgrp_brelse(rgd);
 
 	WARN_ON_ONCE(!(flags & DIO_METADATA));
-	gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));
 	truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
 
 	if (rgd)
@@ -287,8 +286,6 @@ static void inode_go_inval(struct gfs2_glock *gl, int flags)
 {
 	struct gfs2_inode *ip = gfs2_glock2inode(gl);
 
-	gfs2_assert_withdraw(gl->gl_name.ln_sbd, !atomic_read(&gl->gl_ail_count));
-
 	if (flags & DIO_METADATA) {
 		struct address_space *mapping = gfs2_glock2aspace(gl);
 		truncate_inode_pages(mapping, 0);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 15/32] gfs2: Add verbose option to check_journal_clean
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (13 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 14/32] gfs2: fix infinite loop when checking ail item count before go_inval Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 16/32] gfs2: Abort gfs2_freeze if io error is seen Bob Peterson
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, function check_journal_clean would give messages
related to journal recovery. That's fine for mount time, but when a
node withdraws and forces replay that way, we don't want all those
distracting and misleading messages. This patch adds a new parameter
to make those messages optional.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/ops_fstype.c |  2 +-
 fs/gfs2/util.c       | 23 ++++++++++++++++-------
 fs/gfs2/util.h       |  4 +++-
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index d49741bf5a0f..22df012cbbd6 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -690,7 +690,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
 			struct gfs2_jdesc *jd = gfs2_jdesc_find(sdp, x);
 
 			if (sdp->sd_args.ar_spectator) {
-				error = check_journal_clean(sdp, jd);
+				error = check_journal_clean(sdp, jd, true);
 				if (error)
 					goto fail_jinode_gh;
 				continue;
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 0d6ee69fab40..690a2f575709 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -46,7 +46,8 @@ void gfs2_assert_i(struct gfs2_sbd *sdp)
  *
  * Returns: 0 if the journal is clean or locked, else an error
  */
-int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd)
+int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd,
+			bool verbose)
 {
 	int error;
 	struct gfs2_holder j_gh;
@@ -57,23 +58,31 @@ int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd)
 	error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_NOEXP |
 				   GL_EXACT | GL_NOCACHE, &j_gh);
 	if (error) {
-		fs_err(sdp, "Error locking journal for spectator mount.\n");
+		if (verbose)
+			fs_err(sdp, "Error %d locking journal for spectator "
+			       "mount.\n", error);
 		return -EPERM;
 	}
 	error = gfs2_jdesc_check(jd);
 	if (error) {
-		fs_err(sdp, "Error checking journal for spectator mount.\n");
+		if (verbose)
+			fs_err(sdp, "Error checking journal for spectator "
+			       "mount.\n");
 		goto out_unlock;
 	}
 	error = gfs2_find_jhead(jd, &head, false);
 	if (error) {
-		fs_err(sdp, "Error parsing journal for spectator mount.\n");
+		if (verbose)
+			fs_err(sdp, "Error parsing journal for spectator "
+			       "mount.\n");
 		goto out_unlock;
 	}
 	if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
 		error = -EPERM;
-		fs_err(sdp, "jid=%u: Journal is dirty, so the first mounter "
-		       "must not be a spectator.\n", jd->jd_jid);
+		if (verbose)
+			fs_err(sdp, "jid=%u: Journal is dirty, so the first "
+			       "mounter must not be a spectator.\n",
+			       jd->jd_jid);
 	}
 
 out_unlock:
@@ -200,7 +209,7 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
 	 * Now wait until recovery complete.
 	 */
 	for (tries = 0; tries < 10; tries++) {
-		ret = check_journal_clean(sdp, sdp->sd_jdesc);
+		ret = check_journal_clean(sdp, sdp->sd_jdesc, false);
 		if (!ret)
 			break;
 		msleep(HZ);
diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h
index babdae6476ee..297eabc29856 100644
--- a/fs/gfs2/util.h
+++ b/fs/gfs2/util.h
@@ -128,7 +128,9 @@ static inline void gfs2_metatype_set(struct buffer_head *bh, u16 type,
 
 int gfs2_io_error_i(struct gfs2_sbd *sdp, const char *function,
 		    char *file, unsigned int line);
-int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd);
+
+extern int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd,
+			       bool verbose);
 
 #define gfs2_io_error(sdp) \
 gfs2_io_error_i((sdp), __func__, __FILE__, __LINE__);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 16/32] gfs2: Abort gfs2_freeze if io error is seen
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (14 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 15/32] gfs2: Add verbose option to check_journal_clean Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 17/32] gfs2: Issue revokes more intelligently Bob Peterson
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, an io error, such as -EIO writing to the journal
would cause function gfs2_freeze to go into an infinite loop,
continuously retrying the freeze operation. But nothing ever clears
the -EIO except unmount after withdraw, which is impossible if the
freeze operation never ends (fails). Instead you get:

[ 6499.767994] gfs2: fsid=dm-32.0: error freezing FS: -5
[ 6499.773058] gfs2: fsid=dm-32.0: retrying...
[ 6500.791957] gfs2: fsid=dm-32.0: error freezing FS: -5
[ 6500.797015] gfs2: fsid=dm-32.0: retrying...

This patch adds a check for -EIO in gfs2_freeze, and if seen, it
dequeues the freeze glock, aborts the loop and returns the error.
Also, there's no need to pass the freeze holder to function
gfs2_lock_fs_check_clean since it's only called in one place and
it's a well-known superblock pointer, so this simplifies that.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/super.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index a696663bf5e5..c7183d550442 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -403,8 +403,7 @@ struct lfcc {
  * Returns: errno
  */
 
-static int gfs2_lock_fs_check_clean(struct gfs2_sbd *sdp,
-				    struct gfs2_holder *freeze_gh)
+static int gfs2_lock_fs_check_clean(struct gfs2_sbd *sdp)
 {
 	struct gfs2_inode *ip;
 	struct gfs2_jdesc *jd;
@@ -429,7 +428,7 @@ static int gfs2_lock_fs_check_clean(struct gfs2_sbd *sdp,
 	}
 
 	error = gfs2_glock_nq_init(sdp->sd_freeze_gl, LM_ST_EXCLUSIVE,
-				   GL_NOCACHE, freeze_gh);
+				   GL_NOCACHE, &sdp->sd_freeze_gh);
 
 	list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
 		error = gfs2_jdesc_check(jd);
@@ -445,7 +444,7 @@ static int gfs2_lock_fs_check_clean(struct gfs2_sbd *sdp,
 	}
 
 	if (error)
-		gfs2_glock_dq_uninit(freeze_gh);
+		gfs2_glock_dq_uninit(&sdp->sd_freeze_gh);
 
 out:
 	while (!list_empty(&list)) {
@@ -798,15 +797,20 @@ static int gfs2_freeze(struct super_block *sb)
 			goto out;
 		}
 
-		error = gfs2_lock_fs_check_clean(sdp, &sdp->sd_freeze_gh);
+		error = gfs2_lock_fs_check_clean(sdp);
 		if (!error)
 			break;
 
 		if (error == -EBUSY)
 			fs_err(sdp, "waiting for recovery before freeze\n");
-		else
+		else if (error == -EIO) {
+			fs_err(sdp, "Fatal IO error: cannot freeze gfs2 due "
+			       "to recovery error.\n");
+			gfs2_glock_dq_uninit(&sdp->sd_freeze_gh);
+			goto out;
+		} else {
 			fs_err(sdp, "error freezing FS: %d\n", error);
-
+		}
 		fs_err(sdp, "retrying...\n");
 		msleep(1000);
 	}
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 17/32] gfs2: Issue revokes more intelligently
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (15 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 16/32] gfs2: Abort gfs2_freeze if io error is seen Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 18/32] gfs2: Prepare to withdraw as soon as an IO error occurs in log write Bob Peterson
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, function gfs2_write_revokes would call
gfs2_ail1_empty, then traverse the sd_ail1_list looking for
transactions that had bds which were no longer queued to a glock.
And if it found some, it would try to issue revokes for them, up to
a predetermined maximum. There were two problems with how it did
this. First was the fact that gfs2_ail1_empty moves transactions
which have nothing remaining on the ail1 list from the sd_ail1_list
to the sd_ail2_list, thus making its traversal of sd_ail1_list
miss them completely, and therefore, never issue revokes for them.
Second was the fact that there were three traversals (or partial
traversals) of the sd_ail1_list, each of which took and then
released the sd_ail_lock lock: First inside gfs2_ail1_empty,
second to determine if there are any revokes to be issued, and
third to actually issue them. All this taking and releasing of the
sd_ail_lock meant other processes could modify the lists and the
conditions in which we're working.

This patch simplies the whole process by adding a new parameter
to function gfs2_ail1_empty, max_revokes. For normal calls, this
is passed in as 0, meaning we don't want to issue any revokes.
For function gfs2_write_revokes, we pass in the maximum number
of revokes we can, thus allowing gfs2_ail1_empty to add the
revokes where needed. This simplies the code, allows for a single
holding of the sd_ail_lock, and allows gfs2_ail1_empty to add
revokes for all the necessary bd items without missing any.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c | 61 +++++++++++++++++++--------------------------------
 1 file changed, 22 insertions(+), 39 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 735291a93e1a..bb0ef8ede14c 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -192,11 +192,13 @@ static void gfs2_ail1_start(struct gfs2_sbd *sdp)
 /**
  * gfs2_ail1_empty_one - Check whether or not a trans in the AIL has been synced
  * @sdp: the filesystem
- * @ai: the AIL entry
+ * @tr: the transaction
+ * @max_revokes: If nonzero, issue revokes for the bd items for written buffers
  *
  */
 
-static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
+static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr,
+				int *max_revokes)
 {
 	struct gfs2_bufdata *bd, *s;
 	struct buffer_head *bh;
@@ -221,18 +223,28 @@ static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
 			gfs2_io_error_bh(sdp, bh);
 			set_bit(SDF_WITHDRAWING, &sdp->sd_flags);
 		}
-		list_move(&bd->bd_ail_st_list, &tr->tr_ail2_list);
+		/* If we have space for revokes and the bd is no longer on any
+		   buf list, we can just add a revoke for it here and avoid
+		   having to put it on the ail2 list, where it would need to
+		   be revoked later. */
+		if (*max_revokes && list_empty(&bd->bd_list)) {
+			gfs2_add_revoke(sdp, bd);
+			(*max_revokes)--;
+		} else {
+			list_move(&bd->bd_ail_st_list, &tr->tr_ail2_list);
+		}
 	}
 }
 
 /**
  * gfs2_ail1_empty - Try to empty the ail1 lists
  * @sdp: The superblock
+ * @max_revokes: If non-zero, add revokes where appropriate
  *
  * Tries to empty the ail1 lists, starting with the oldest first
  */
 
-static int gfs2_ail1_empty(struct gfs2_sbd *sdp)
+static int gfs2_ail1_empty(struct gfs2_sbd *sdp, int max_revokes)
 {
 	struct gfs2_trans *tr, *s;
 	int oldest_tr = 1;
@@ -240,7 +252,7 @@ static int gfs2_ail1_empty(struct gfs2_sbd *sdp)
 
 	spin_lock(&sdp->sd_ail_lock);
 	list_for_each_entry_safe_reverse(tr, s, &sdp->sd_ail1_list, tr_list) {
-		gfs2_ail1_empty_one(sdp, tr);
+		gfs2_ail1_empty_one(sdp, tr, &max_revokes);
 		if (list_empty(&tr->tr_ail1_list) && oldest_tr)
 			list_move(&tr->tr_list, &sdp->sd_ail2_list);
 		else
@@ -621,25 +633,9 @@ void gfs2_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd)
 
 void gfs2_write_revokes(struct gfs2_sbd *sdp)
 {
-	struct gfs2_trans *tr;
-	struct gfs2_bufdata *bd, *tmp;
-	int have_revokes = 0;
 	int max_revokes = (sdp->sd_sb.sb_bsize - sizeof(struct gfs2_log_descriptor)) / sizeof(u64);
 
-	gfs2_ail1_empty(sdp);
-	spin_lock(&sdp->sd_ail_lock);
-	list_for_each_entry_reverse(tr, &sdp->sd_ail1_list, tr_list) {
-		list_for_each_entry(bd, &tr->tr_ail2_list, bd_ail_st_list) {
-			if (list_empty(&bd->bd_list)) {
-				have_revokes = 1;
-				goto done;
-			}
-		}
-	}
-done:
-	spin_unlock(&sdp->sd_ail_lock);
-	if (have_revokes == 0)
-		return;
+	gfs2_log_lock(sdp);
 	while (sdp->sd_log_num_revoke > max_revokes)
 		max_revokes += (sdp->sd_sb.sb_bsize - sizeof(struct gfs2_meta_header)) / sizeof(u64);
 	max_revokes -= sdp->sd_log_num_revoke;
@@ -650,20 +646,7 @@ void gfs2_write_revokes(struct gfs2_sbd *sdp)
 		if (!sdp->sd_log_blks_reserved)
 			atomic_dec(&sdp->sd_log_blks_free);
 	}
-	gfs2_log_lock(sdp);
-	spin_lock(&sdp->sd_ail_lock);
-	list_for_each_entry_reverse(tr, &sdp->sd_ail1_list, tr_list) {
-		list_for_each_entry_safe(bd, tmp, &tr->tr_ail2_list, bd_ail_st_list) {
-			if (max_revokes == 0)
-				goto out_of_blocks;
-			if (!list_empty(&bd->bd_list))
-				continue;
-			gfs2_add_revoke(sdp, bd);
-			max_revokes--;
-		}
-	}
-out_of_blocks:
-	spin_unlock(&sdp->sd_ail_lock);
+	gfs2_ail1_empty(sdp, max_revokes);
 	gfs2_log_unlock(sdp);
 
 	if (!sdp->sd_log_num_revoke) {
@@ -862,7 +845,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 			for (;;) {
 				gfs2_ail1_start(sdp);
 				gfs2_ail1_wait(sdp);
-				if (gfs2_ail1_empty(sdp))
+				if (gfs2_ail1_empty(sdp, 0))
 					break;
 			}
 			if (gfs2_withdrawn(sdp))
@@ -1030,7 +1013,7 @@ int gfs2_logd(void *data)
 
 		did_flush = false;
 		if (gfs2_jrnl_flush_reqd(sdp) || t == 0) {
-			gfs2_ail1_empty(sdp);
+			gfs2_ail1_empty(sdp, 0);
 			gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
 				       GFS2_LFC_LOGD_JFLUSH_REQD);
 			did_flush = true;
@@ -1039,7 +1022,7 @@ int gfs2_logd(void *data)
 		if (gfs2_ail_flush_reqd(sdp)) {
 			gfs2_ail1_start(sdp);
 			gfs2_ail1_wait(sdp);
-			gfs2_ail1_empty(sdp);
+			gfs2_ail1_empty(sdp, 0);
 			gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
 				       GFS2_LFC_LOGD_AIL_FLUSH_REQD);
 			did_flush = true;
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 18/32] gfs2: Prepare to withdraw as soon as an IO error occurs in log write
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (16 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 17/32] gfs2: Issue revokes more intelligently Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 19/32] gfs2: Check for log write errors before telling dlm to unlock Bob Peterson
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, function gfs2_end_log_write would detect any IO
errors writing to the journal and put out an appropriate message,
but it never set a withdrawing condition. Eventually, the log daemon
would see the error and determine it was time to withdraw, but in
the meantime, other processes could continue running as if nothing
bad ever happened. The biggest consequence is that __gfs2_glock_put
would BUG() when it saw that there were still unwritten items.

This patch sets the WITHDRAWING status as soon as an IO error is
detected, and that way, the BUG will be avoided so the file system
can be properly withdrawn and unmounted.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/lops.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index d3c73b1771b6..7dda50717c9e 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -206,6 +206,9 @@ static void gfs2_end_log_write(struct bio *bio)
 		if (!cmpxchg(&sdp->sd_log_error, 0, (int)bio->bi_status))
 			fs_err(sdp, "Error %d writing to journal, jid=%u\n",
 			       bio->bi_status, sdp->sd_jdesc->jd_jid);
+		set_bit(SDF_WITHDRAWING, &sdp->sd_flags);
+		/* prevent more writes to the journal */
+		clear_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
 		wake_up(&sdp->sd_logd_waitq);
 	}

-- 
2.23.0

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 19/32] gfs2: Check for log write errors before telling dlm to unlock
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (17 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 18/32] gfs2: Prepare to withdraw as soon as an IO error occurs in log write Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 20/32] gfs2: new slab for transactions Bob Peterson
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, function do_xmote just assumed all the writes
submitted to the journal were finished and successful, and it
called the go_unlock function to release the dlm lock. But if
they're not, and a revoke failed to make its way to the journal,
a journal replay on another node will cause corruption if we
let the go_inval function continue and tell dlm to release the
glock to another node. This patch adds a couple checks for errors
in do_xmote after the calls to go_sync and go_inval. If an error
is found, we cannot withdraw yet, because the withdraw itself
uses glocks to make the file system read-only. Instead, we flag
the error. Later, asserts should cause another node to replay
the journal before continuing, thus protecting rgrp and dinode
glocks and maintaining the integrity of the metadata. Note that
we only need to do this for journaled glocks. System glocks
should be able to progress even under withdrawn conditions.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 879625471fe0..f18b8f2b8ce5 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -614,6 +614,30 @@ __acquires(&gl->gl_lockref.lock)
 	}
 
 	gfs2_glock_hold(gl);
+	/*
+	 * Check for an error encountered since we called go_sync and go_inval.
+	 * If so, we can't withdraw from the glock code because the withdraw
+	 * code itself uses glocks (see function signal_our_withdraw) to
+	 * change the mount to read-only. Most importantly, we must not call
+	 * dlm to unlock the glock until the journal is in a known good state
+	 * (after journal replay) otherwise other nodes may use the object
+	 * (rgrp or dinode) and then later, journal replay will corrupt the
+	 * file system. The best we can do here is wait for the logd daemon
+	 * to see sd_log_error and withdraw, and in the meantime, requeue the
+	 * work for later.
+	 *
+	 * However, if we're just unlocking the lock (say, for unmount, when
+	 * gfs2_gl_hash_clear calls clear_glock) and recovery is complete
+	 * then it's okay to tell dlm to unlock it.
+	 */
+	if (unlikely(sdp->sd_log_error && !allow_while_withdrawn(gl))) {
+		if (target != LM_ST_UNLOCKED ||
+		    test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags)) {
+			gfs2_glock_queue_work(gl, GL_GLOCK_DFT_HOLD);
+			goto out;
+		}
+	}
+
 	if (sdp->sd_lockstruct.ls_ops->lm_lock)	{
 		/* lock_dlm */
 		ret = sdp->sd_lockstruct.ls_ops->lm_lock(gl, target, lck_flags);
@@ -622,8 +646,7 @@ __acquires(&gl->gl_lockref.lock)
 		    test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags)) {
 			finish_xmote(gl, target);
 			gfs2_glock_queue_work(gl, 0);
-		}
-		else if (ret) {
+		} else if (ret) {
 			fs_err(sdp, "lm_lock ret %d\n", ret);
 			GLOCK_BUG_ON(gl, !gfs2_withdrawn(sdp));
 		}
@@ -631,7 +654,7 @@ __acquires(&gl->gl_lockref.lock)
 		finish_xmote(gl, target);
 		gfs2_glock_queue_work(gl, 0);
 	}
-
+out:
 	spin_lock(&gl->gl_lockref.lock);
 }
 
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 20/32] gfs2: new slab for transactions
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (18 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 19/32] gfs2: Check for log write errors before telling dlm to unlock Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 21/32] gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS Bob Peterson
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch adds a new slab for gfs2 transactions. That allows us to
have an initialization function and protect against some errors.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glops.c |  1 +
 fs/gfs2/log.c   | 14 +++++++++++---
 fs/gfs2/lops.c  |  4 ++++
 fs/gfs2/main.c  | 23 +++++++++++++++++++++++
 fs/gfs2/trans.c | 28 ++++++++++++++++++++++------
 fs/gfs2/trans.h |  1 +
 fs/gfs2/util.c  |  1 +
 fs/gfs2/util.h  |  1 +
 8 files changed, 64 insertions(+), 9 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index d8f9ee756c0e..e0219b0e2229 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -89,6 +89,7 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
 	memset(&tr, 0, sizeof(tr));
 	INIT_LIST_HEAD(&tr.tr_buf);
 	INIT_LIST_HEAD(&tr.tr_databuf);
+	set_bit(TR_ATTACHED, &tr.tr_flags); /* prevent gfs2_trans_end free */
 	tr.tr_revokes = atomic_read(&gl->gl_ail_count);
 
 	if (!tr.tr_revokes)
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index bb0ef8ede14c..11aa65dae761 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -30,6 +30,7 @@
 #include "util.h"
 #include "dir.h"
 #include "trace_gfs2.h"
+#include "trans.h"
 
 /**
  * gfs2_struct2blk - compute stuff
@@ -329,7 +330,7 @@ static void ail2_empty(struct gfs2_sbd *sdp, unsigned int new_tail)
 		list_del(&tr->tr_list);
 		gfs2_assert_warn(sdp, list_empty(&tr->tr_ail1_list));
 		gfs2_assert_warn(sdp, list_empty(&tr->tr_ail2_list));
-		kfree(tr);
+		gfs2_trans_free(sdp, tr);
 	}
 
 	spin_unlock(&sdp->sd_ail_lock);
@@ -866,7 +867,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 	trace_gfs2_log_flush(sdp, 0, flags);
 	up_write(&sdp->sd_log_flush_lock);
 
-	kfree(tr);
+	gfs2_trans_free(sdp, tr);
 }
 
 /**
@@ -885,6 +886,12 @@ static void gfs2_merge_trans(struct gfs2_trans *old, struct gfs2_trans *new)
 	old->tr_num_databuf_rm	+= new->tr_num_databuf_rm;
 	old->tr_num_revoke	+= new->tr_num_revoke;
 
+	new->tr_num_buf_new	= 0;
+	new->tr_num_buf_rm	= 0;
+	new->tr_num_databuf_new	= 0;
+	new->tr_num_databuf_rm	= 0;
+	new->tr_num_revoke	= 0;
+
 	list_splice_tail_init(&new->tr_databuf, &old->tr_databuf);
 	list_splice_tail_init(&new->tr_buf, &old->tr_buf);
 }
@@ -894,6 +901,7 @@ static void log_refund(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
 	unsigned int reserved;
 	unsigned int unused;
 	unsigned int maxres;
+	unsigned int new_revokes = tr->tr_num_revoke;
 
 	gfs2_log_lock(sdp);
 
@@ -905,7 +913,7 @@ static void log_refund(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
 		set_bit(TR_ATTACHED, &tr->tr_flags);
 	}
 
-	sdp->sd_log_commited_revoke += tr->tr_num_revoke;
+	sdp->sd_log_commited_revoke += new_revokes;
 	reserved = calc_reserved(sdp);
 	maxres = sdp->sd_log_blks_reserved + tr->tr_reserved;
 	gfs2_assert_withdraw(sdp, maxres >= reserved);
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 7dda50717c9e..c62b9e12aed7 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -718,6 +718,8 @@ static void buf_lo_after_commit(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
 		list_del_init(&bd->bd_list);
 		gfs2_unpin(sdp, bd->bd_bh, tr);
 	}
+	tr->tr_num_buf_new = 0;
+	tr->tr_num_buf_rm = 0;
 }
 
 static void buf_lo_before_scan(struct gfs2_jdesc *jd,
@@ -1070,6 +1072,8 @@ static void databuf_lo_after_commit(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
 		list_del_init(&bd->bd_list);
 		gfs2_unpin(sdp, bd->bd_bh, tr);
 	}
+	tr->tr_num_databuf_new = 0;
+	tr->tr_num_databuf_rm = 0;
 }
 
 
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index a1a295b739fb..5ff114bbcede 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -68,6 +68,19 @@ static void gfs2_init_gl_aspace_once(void *foo)
 	address_space_init_once(mapping);
 }
 
+static void gfs2_init_tr_once(void *foo)
+{
+	struct gfs2_trans *tr = foo;
+
+	memset(tr, 0, sizeof(struct gfs2_trans));
+	INIT_LIST_HEAD(&tr->tr_databuf);
+	INIT_LIST_HEAD(&tr->tr_buf);
+	INIT_LIST_HEAD(&tr->tr_ail1_list);
+	INIT_LIST_HEAD(&tr->tr_ail2_list);
+	INIT_LIST_HEAD(&tr->tr_list);
+	tr->tr_ip = 0;
+}
+
 /**
  * init_gfs2_fs - Register GFS2 as a filesystem
  *
@@ -143,6 +156,13 @@ static int __init init_gfs2_fs(void)
 	if (!gfs2_qadata_cachep)
 		goto fail_cachep7;
 
+	gfs2_trans_cachep = kmem_cache_create("gfs2_trans",
+					       sizeof(struct gfs2_trans),
+					       0, SLAB_CONSISTENCY_CHECKS|
+					       SLAB_POISON, gfs2_init_tr_once);
+	if (!gfs2_trans_cachep)
+		goto fail_cachep8;
+
 	error = register_shrinker(&gfs2_qd_shrinker);
 	if (error)
 		goto fail_shrinker;
@@ -194,6 +214,8 @@ static int __init init_gfs2_fs(void)
 fail_fs1:
 	unregister_shrinker(&gfs2_qd_shrinker);
 fail_shrinker:
+	kmem_cache_destroy(gfs2_trans_cachep);
+fail_cachep8:
 	kmem_cache_destroy(gfs2_qadata_cachep);
 fail_cachep7:
 	kmem_cache_destroy(gfs2_quotad_cachep);
@@ -236,6 +258,7 @@ static void __exit exit_gfs2_fs(void)
 	rcu_barrier();
 
 	mempool_destroy(gfs2_page_pool);
+	kmem_cache_destroy(gfs2_trans_cachep);
 	kmem_cache_destroy(gfs2_qadata_cachep);
 	kmem_cache_destroy(gfs2_quotad_cachep);
 	kmem_cache_destroy(gfs2_rgrpd_cachep);
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 35e3059255fe..7d16d9aa3153 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -37,7 +37,7 @@ int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,
 	if (!test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags))
 		return -EROFS;
 
-	tr = kzalloc(sizeof(struct gfs2_trans), GFP_NOFS);
+	tr = kmem_cache_alloc(gfs2_trans_cachep, GFP_NOFS);
 	if (!tr)
 		return -ENOMEM;
 
@@ -45,14 +45,19 @@ int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,
 	tr->tr_blocks = blocks;
 	tr->tr_revokes = revokes;
 	tr->tr_reserved = 1;
+	tr->tr_num_revoke = 0;
+
+	clear_bit(TR_TOUCHED, &tr->tr_flags);
+	clear_bit(TR_ATTACHED, &tr->tr_flags);
+
 	set_bit(TR_ALLOCED, &tr->tr_flags);
 	if (blocks)
 		tr->tr_reserved += 6 + blocks;
 	if (revokes)
 		tr->tr_reserved += gfs2_struct2blk(sdp, revokes,
 						   sizeof(u64));
-	INIT_LIST_HEAD(&tr->tr_databuf);
-	INIT_LIST_HEAD(&tr->tr_buf);
+	gfs2_assert_warn(sdp, list_empty(&tr->tr_databuf));
+	gfs2_assert_warn(sdp, list_empty(&tr->tr_buf));
 
 	sb_start_intwrite(sdp->sd_vfs);
 
@@ -66,7 +71,7 @@ int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,
 
 fail:
 	sb_end_intwrite(sdp->sd_vfs);
-	kfree(tr);
+	kmem_cache_free(gfs2_trans_cachep, tr);
 
 	return error;
 }
@@ -94,7 +99,7 @@ void gfs2_trans_end(struct gfs2_sbd *sdp)
 	if (!test_bit(TR_TOUCHED, &tr->tr_flags)) {
 		gfs2_log_release(sdp, tr->tr_reserved);
 		if (alloced) {
-			kfree(tr);
+			gfs2_trans_free(sdp, tr);
 			sb_end_intwrite(sdp->sd_vfs);
 		}
 		return;
@@ -110,7 +115,7 @@ void gfs2_trans_end(struct gfs2_sbd *sdp)
 
 	gfs2_log_commit(sdp, tr);
 	if (alloced && !test_bit(TR_ATTACHED, &tr->tr_flags))
-		kfree(tr);
+		gfs2_trans_free(sdp, tr);
 	up_read(&sdp->sd_log_flush_lock);
 
 	if (sdp->sd_vfs->s_flags & SB_SYNCHRONOUS)
@@ -271,3 +276,14 @@ void gfs2_trans_remove_revoke(struct gfs2_sbd *sdp, u64 blkno, unsigned int len)
 	gfs2_log_unlock(sdp);
 }
 
+void gfs2_trans_free(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
+{
+	if (tr == NULL)
+		return;
+
+	gfs2_assert_warn(sdp, list_empty(&tr->tr_ail1_list));
+	gfs2_assert_warn(sdp, list_empty(&tr->tr_ail2_list));
+	gfs2_assert_warn(sdp, list_empty(&tr->tr_databuf));
+	gfs2_assert_warn(sdp, list_empty(&tr->tr_buf));
+	kmem_cache_free(gfs2_trans_cachep, tr);
+}
diff --git a/fs/gfs2/trans.h b/fs/gfs2/trans.h
index 6071334de035..83199ce5a5c5 100644
--- a/fs/gfs2/trans.h
+++ b/fs/gfs2/trans.h
@@ -42,5 +42,6 @@ extern void gfs2_trans_add_data(struct gfs2_glock *gl, struct buffer_head *bh);
 extern void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh);
 extern void gfs2_trans_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd);
 extern void gfs2_trans_remove_revoke(struct gfs2_sbd *sdp, u64 blkno, unsigned int len);
+extern void gfs2_trans_free(struct gfs2_sbd *sdp, struct gfs2_trans *tr);
 
 #endif /* __TRANS_DOT_H__ */
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 690a2f575709..04141d0d8e17 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -32,6 +32,7 @@ struct kmem_cache *gfs2_bufdata_cachep __read_mostly;
 struct kmem_cache *gfs2_rgrpd_cachep __read_mostly;
 struct kmem_cache *gfs2_quotad_cachep __read_mostly;
 struct kmem_cache *gfs2_qadata_cachep __read_mostly;
+struct kmem_cache *gfs2_trans_cachep __read_mostly;
 mempool_t *gfs2_page_pool __read_mostly;
 
 void gfs2_assert_i(struct gfs2_sbd *sdp)
diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h
index 297eabc29856..72faee19c38d 100644
--- a/fs/gfs2/util.h
+++ b/fs/gfs2/util.h
@@ -154,6 +154,7 @@ extern struct kmem_cache *gfs2_bufdata_cachep;
 extern struct kmem_cache *gfs2_rgrpd_cachep;
 extern struct kmem_cache *gfs2_quotad_cachep;
 extern struct kmem_cache *gfs2_qadata_cachep;
+extern struct kmem_cache *gfs2_trans_cachep;
 extern mempool_t *gfs2_page_pool;
 extern struct workqueue_struct *gfs2_control_wq;
 
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 21/32] gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (19 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 20/32] gfs2: new slab for transactions Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 22/32] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty Bob Peterson
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch closes a timing window in which two processes compete
and overlap in the execution of do_xmote for the same glock:

             Process A                              Process B
   ------------------------------------   -----------------------------
1. Grabs gl_lockref and calls do_xmote
2.                                        Grabs gl_lockref but is blocked
3. Sets GLF_INVALIDATE_IN_PROGRESS
4. Unlocks gl_lockref
5.                                        Calls do_xmote
6. Call glops->go_sync
7. test_and_clear_bit GLF_DIRTY
8. Call gfs2_log_flush                    Call glops->go_sync
9. (slow IO, so it blocks a long time)    test_and_clear_bit GLF_DIRTY
                                          It's not dirty (step 7) returns
10.                                       Tests GLF_INVALIDATE_IN_PROGRESS
11.                                       Calls go_inval (rgrp_go_inval)
12.                                       gfs2_rgrp_relse does brelse
13.                                       truncate_inode_pages_range
14.                                       Calls lm_lock UN

In step 14 we've just told dlm to give the glock to another node
when, in fact, process A has not finished the IO and synced all
buffer_heads to disk and make sure their revokes are done.

This patch fixes the problem by changing the GLF_INVALIDATE_IN_PROGRESS
to use test_and_set_bit, and if the bit is already set, process B just
ignores it and trusts that process A will do the do_xmote in the proper
order.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index f18b8f2b8ce5..ab72797e3ba1 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -584,7 +584,19 @@ __acquires(&gl->gl_lockref.lock)
 	GLOCK_BUG_ON(gl, gl->gl_state == gl->gl_target);
 	if ((target == LM_ST_UNLOCKED || target == LM_ST_DEFERRED) &&
 	    glops->go_inval) {
-		set_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags);
+		/*
+		 * If another process is already doing the invalidate we should
+		 * not interfere. If we call go_sync and it finds the glock is
+		 * not dirty, we might call go_inval prematurely before the
+		 * other go_sync has finished with its revokes. If we then call
+		 * lm_lock prematurely, we've really screwed up: we cannot tell
+		 * dlm to give the glock away until we're synced and
+		 * invalidated. Best thing is to return and trust the other
+		 * process will finish do_xmote tasks in their proper order.
+		 */
+		if (test_and_set_bit(GLF_INVALIDATE_IN_PROGRESS,
+				     &gl->gl_flags))
+			return;
 		do_error(gl, 0); /* Fail queued try locks */
 	}
 	gl->gl_req = target;
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 22/32] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (20 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 21/32] gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 23/32] gfs2: Don't skip log flush if glock still has revokes Bob Peterson
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, if gfs2_ail_empty_gl saw there was nothing on
the ail list, it would return and not flush the log. The problem
is that there could still be a revoke for the rgrp sitting on the
sd_log_le_revoke list that's been recently taken off the ail list.
But that revoke still needs to be written, and the rgrp_go_inval
still needs to call log_flush_wait to ensure the revokes are all
properly written to the journal before we relinquish control of
the glock to another node. If we give the glock to another node
before we have this knowledge, the node might crash and its journal
replayed, in which case the missing revoke would allow the journal
replay to replay the rgrp over top of the rgrp we already gave to
another node, thus overwriting its changes and corrupting the
file system.

This patch makes gfs2_ail_empty_gl still call gfs2_log_flush rather
than returning.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glops.c | 26 +++++++++++++++++++++++++-
 fs/gfs2/log.c   |  2 +-
 fs/gfs2/log.h   |  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index e0219b0e2229..4072f37e4278 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -92,8 +92,31 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
 	set_bit(TR_ATTACHED, &tr.tr_flags); /* prevent gfs2_trans_end free */
 	tr.tr_revokes = atomic_read(&gl->gl_ail_count);
 
-	if (!tr.tr_revokes)
+	if (!tr.tr_revokes) {
+		/*
+		 * We have nothing on the ail, but there could be revokes on
+		 * the sdp revoke queue, in which case, we still want to flush
+		 * the log and wait for it to finish.
+		 *
+		 * If the sdp revoke list is empty too, we might still have an
+		 * io outstanding for writing revokes, so we should wait for
+		 * it before returning.
+		 *
+		 * If none of these conditions are true, our revokes are all
+		 * flushed and we can return.
+		 */
+		gfs2_log_lock(sdp);
+		if (!list_empty(&sdp->sd_log_le_revoke)) {
+			gfs2_log_unlock(sdp);
+			goto flush;
+		} else if (atomic_read(&sdp->sd_log_in_flight)) {
+			gfs2_log_unlock(sdp);
+			log_flush_wait(sdp);
+		} else {
+			gfs2_log_unlock(sdp);
+		}
 		return;
+	}
 
 	/* A shortened, inline version of gfs2_trans_begin()
          * tr->alloced is not set since the transaction structure is
@@ -108,6 +131,7 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
 	__gfs2_ail_flush(gl, 0, tr.tr_revokes);
 
 	gfs2_trans_end(sdp);
+flush:
 	gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
 		       GFS2_LFC_AIL_EMPTY_GL);
 }
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 11aa65dae761..6b5bac203d6d 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -536,7 +536,7 @@ static void log_pull_tail(struct gfs2_sbd *sdp, unsigned int new_tail)
 }
 
 
-static void log_flush_wait(struct gfs2_sbd *sdp)
+void log_flush_wait(struct gfs2_sbd *sdp)
 {
 	DEFINE_WAIT(wait);
 
diff --git a/fs/gfs2/log.h b/fs/gfs2/log.h
index 2315fca47a2b..c60f1432bb28 100644
--- a/fs/gfs2/log.h
+++ b/fs/gfs2/log.h
@@ -73,6 +73,7 @@ extern void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl,
 			   u32 type);
 extern void gfs2_log_commit(struct gfs2_sbd *sdp, struct gfs2_trans *trans);
 extern void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc);
+extern void log_flush_wait(struct gfs2_sbd *sdp);
 
 extern void gfs2_log_shutdown(struct gfs2_sbd *sdp);
 extern int gfs2_logd(void *data);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 23/32] gfs2: Don't skip log flush if glock still has revokes
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (21 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 22/32] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 24/32] gfs2: initialize tr_ail1_list when creating transactions Bob Peterson
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch,function gfs2_log_flush only did flush work if
it saw the LFLUSH flag on the given glock. However, that glock still
needs to be flushed if it has revokes attached that are still pending.
If we don't flush them, they can be missed in the journal and a
journal replay can replay metadata that should have been revoked.

This patch adds an additional check to gfs2_log_flush for this condition.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 6b5bac203d6d..80b8220c532f 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -789,7 +789,8 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 	down_write(&sdp->sd_log_flush_lock);
 
 	/* Log might have been flushed while we waited for the flush lock */
-	if (gl && !test_bit(GLF_LFLUSH, &gl->gl_flags)) {
+	if (gl && !test_bit(GLF_LFLUSH, &gl->gl_flags) &&
+	    atomic_read(&gl->gl_revokes) == 0) {
 		up_write(&sdp->sd_log_flush_lock);
 		return;
 	}
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 24/32] gfs2: initialize tr_ail1_list when creating transactions
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (22 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 23/32] gfs2: Don't skip log flush if glock still has revokes Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 25/32] gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages returns error Bob Peterson
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

In function gfs2_trans_begin, new transactions were created but their
ail1 list, tr_ail1_list was never initialized. Therefore it relied
upon other circumstances when the list became empty.
This patch adds proper initialization of the list.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/trans.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 7d16d9aa3153..790fdd4e8c5e 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -58,6 +58,7 @@ int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,
 						   sizeof(u64));
 	gfs2_assert_warn(sdp, list_empty(&tr->tr_databuf));
 	gfs2_assert_warn(sdp, list_empty(&tr->tr_buf));
+	INIT_LIST_HEAD(&tr->tr_ail1_list);
 
 	sb_start_intwrite(sdp->sd_vfs);
 
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 25/32] gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages returns error
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (23 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 24/32] gfs2: initialize tr_ail1_list when creating transactions Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 26/32] gfs2: drain the ail2 list after io errors Bob Peterson
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, function gfs2_ail1_start_one would return any
errors it received from write_cache_pages (except -EBUSY) but it did
not withdraw. Since function gfs2_ail1_flush just checks for the bad
return code and loops, the loop might potentially never end.
This patch adds some logic to allow it to exit the loop and withdraw
properly when errors are received from write_cache_pages.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 80b8220c532f..359b85660da9 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -98,6 +98,7 @@ __acquires(&sdp->sd_ail_lock)
 	struct address_space *mapping;
 	struct gfs2_bufdata *bd, *s;
 	struct buffer_head *bh;
+	int ret = 0;
 
 	list_for_each_entry_safe_reverse(bd, s, &tr->tr_ail1_list, bd_ail_st_list) {
 		bh = bd->bd_bh;
@@ -130,10 +131,16 @@ __acquires(&sdp->sd_ail_lock)
 		if (!mapping)
 			continue;
 		spin_unlock(&sdp->sd_ail_lock);
-		generic_writepages(mapping, wbc);
+		ret = generic_writepages(mapping, wbc);
 		spin_lock(&sdp->sd_ail_lock);
+		if (ret == -EBUSY)
+			return 1; /* need to retry */
+		if (ret < 0) {
+			set_bit(SDF_WITHDRAWING, &sdp->sd_flags);
+			return ret;
+		}
 		if (wbc->nr_to_write <= 0)
-			break;
+			return 0;
 		return 1;
 	}
 
@@ -155,6 +162,7 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc)
 	struct list_head *head = &sdp->sd_ail1_list;
 	struct gfs2_trans *tr;
 	struct blk_plug plug;
+	int ret;
 
 	trace_gfs2_ail_flush(sdp, wbc, 1);
 	blk_start_plug(&plug);
@@ -163,7 +171,10 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc)
 	list_for_each_entry_reverse(tr, head, tr_list) {
 		if (wbc->nr_to_write <= 0)
 			break;
-		if (gfs2_ail1_start_one(sdp, wbc, tr) && !gfs2_withdrawn(sdp))
+		ret = gfs2_ail1_start_one(sdp, wbc, tr);
+		if (ret < 0)
+			break;
+		else if (ret)
 			goto restart;
 	}
 	spin_unlock(&sdp->sd_ail_lock);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 26/32] gfs2: drain the ail2 list after io errors
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (24 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 25/32] gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages returns error Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 27/32] gfs2: make gfs2_log_shutdown static Bob Peterson
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, gfs2_logd continually tried to flush its journal
log, after the file system is withdrawn. We don't want to write anything
to the journal, lest we add corruption. Best course of action is to
drain the ail1 into the ail2 list (via gfs2_ail1_empty) then drain the
ail2 list with a new function, ail2_drain.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c   | 82 +++++++++++++++++++++++++++++++++++++++++++------
 fs/gfs2/trans.c |  4 +++
 2 files changed, 77 insertions(+), 9 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 359b85660da9..c4c832f6b6a3 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -302,25 +302,34 @@ static void gfs2_ail1_wait(struct gfs2_sbd *sdp)
 }
 
 /**
- * gfs2_ail2_empty_one - Check whether or not a trans in the AIL has been synced
- * @sdp: the filesystem
- * @ai: the AIL entry
- *
+ * gfs2_ail_empty_tr - empty one of the ail lists for a transaction
  */
 
-static void gfs2_ail2_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
+static void gfs2_ail_empty_tr(struct gfs2_sbd *sdp, struct gfs2_trans *tr,
+			      struct list_head *head)
 {
-	struct list_head *head = &tr->tr_ail2_list;
 	struct gfs2_bufdata *bd;
 
 	while (!list_empty(head)) {
-		bd = list_entry(head->prev, struct gfs2_bufdata,
-				bd_ail_st_list);
+		bd = list_first_entry(head, struct gfs2_bufdata,
+				      bd_ail_st_list);
 		gfs2_assert(sdp, bd->bd_tr == tr);
 		gfs2_remove_from_ail(bd);
 	}
 }
 
+/**
+ * gfs2_ail2_empty_one - Check whether or not a trans in the AIL has been synced
+ * @sdp: the filesystem
+ * @ai: the AIL entry
+ *
+ */
+
+static void gfs2_ail2_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
+{
+	gfs2_ail_empty_tr(sdp, tr, &tr->tr_ail2_list);
+}
+
 static void ail2_empty(struct gfs2_sbd *sdp, unsigned int new_tail)
 {
 	struct gfs2_trans *tr, *safe;
@@ -784,6 +793,41 @@ static void log_write_header(struct gfs2_sbd *sdp, u32 flags)
 		log_pull_tail(sdp, tail);
 }
 
+/**
+ * ail_drain - drain the ail lists after a withdraw
+ * @sdp: Pointer to GFS2 superblock
+ */
+static void ail_drain(struct gfs2_sbd *sdp)
+{
+	struct gfs2_trans *tr;
+
+	spin_lock(&sdp->sd_ail_lock);
+	/*
+	 * For transactions on the sd_ail1_list we need to drain both the
+	 * ail1 and ail2 lists. That's because function gfs2_ail1_start_one
+	 * (temporarily) moves items from its tr_ail1 list to tr_ail2 list
+	 * before revokes are sent for that block. Items on the sd_ail2_list
+	 * should have already gotten beyond that point, so no need.
+	 */
+	while (!list_empty(&sdp->sd_ail1_list)) {
+		tr = list_first_entry(&sdp->sd_ail1_list, struct gfs2_trans,
+				      tr_list);
+		gfs2_ail_empty_tr(sdp, tr, &tr->tr_ail1_list);
+		gfs2_ail_empty_tr(sdp, tr, &tr->tr_ail2_list);
+		list_del(&tr->tr_list);
+		gfs2_trans_free(sdp, tr);
+	}
+	while (!list_empty(&sdp->sd_ail2_list)) {
+		tr = list_first_entry(&sdp->sd_ail2_list, struct gfs2_trans,
+				      tr_list);
+		gfs2_ail_empty_tr(sdp, tr, &tr->tr_ail2_list);
+		/* Same as gfs2_ail2_empty_one(sdp, tr);*/
+		list_del(&tr->tr_list);
+		gfs2_trans_free(sdp, tr);
+	}
+	spin_unlock(&sdp->sd_ail_lock);
+}
+
 /**
  * gfs2_log_flush - flush incore transaction(s)
  * @sdp: the filesystem
@@ -794,11 +838,18 @@ static void log_write_header(struct gfs2_sbd *sdp, u32 flags)
 
 void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 {
-	struct gfs2_trans *tr;
+	struct gfs2_trans *tr = NULL;
 	enum gfs2_freeze_state state = atomic_read(&sdp->sd_freeze_state);
 
 	down_write(&sdp->sd_log_flush_lock);
 
+	/*
+	 * Do this check while holding the log_flush_lock to prevent new
+	 * buffers from being added to the ail via gfs2_pin()
+	 */
+	if (gfs2_withdrawn(sdp))
+		goto out;
+
 	/* Log might have been flushed while we waited for the flush lock */
 	if (gl && !test_bit(GLF_LFLUSH, &gl->gl_flags) &&
 	    atomic_read(&gl->gl_revokes) == 0) {
@@ -827,8 +878,14 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 			sdp->sd_log_num_revoke == sdp->sd_log_commited_revoke);
 
 	gfs2_ordered_write(sdp);
+	if (gfs2_withdrawn(sdp))
+		goto out;
 	lops_before_commit(sdp, tr);
+	if (gfs2_withdrawn(sdp))
+		goto out;
 	gfs2_log_submit_bio(&sdp->sd_log_bio, REQ_OP_WRITE);
+	if (gfs2_withdrawn(sdp))
+		goto out;
 
 	if (sdp->sd_log_head != sdp->sd_log_flush_head) {
 		log_flush_wait(sdp);
@@ -838,6 +895,8 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 		trace_gfs2_log_blocks(sdp, -1);
 		log_write_header(sdp, flags);
 	}
+	if (gfs2_withdrawn(sdp))
+		goto out;
 	lops_after_commit(sdp, tr);
 
 	gfs2_log_lock(sdp);
@@ -876,6 +935,11 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 	}
 
 out:
+	if (gfs2_withdrawn(sdp)) {
+		ail_drain(sdp); /* frees all transactions */
+		tr = NULL;
+	}
+
 	trace_gfs2_log_flush(sdp, 0, flags);
 	up_write(&sdp->sd_log_flush_lock);
 
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 790fdd4e8c5e..88209a81c7bb 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -235,6 +235,10 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh)
 		fs_info(sdp, "GFS2:adding buf while frozen\n");
 		gfs2_assert_withdraw(sdp, 0);
 	}
+	if (unlikely(gfs2_withdrawn(sdp))) {
+		fs_info(sdp, "GFS2:adding buf while withdrawn! 0x%llx\n",
+			(unsigned long long)bd->bd_bh->b_blocknr);
+	}
 	gfs2_pin(sdp, bd->bd_bh);
 	mh->__pad0 = cpu_to_be64(0);
 	mh->mh_jid = cpu_to_be32(sdp->sd_jdesc->jd_jid);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 27/32] gfs2: make gfs2_log_shutdown static
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (25 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 26/32] gfs2: drain the ail2 list after io errors Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 28/32] gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence Bob Peterson
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Function gfs2_log_shutdown is only called from within log.c. This
patch removes the extern declaration and makes it static.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c | 4 +++-
 fs/gfs2/log.h | 1 -
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index c4c832f6b6a3..49ed80e4bd0f 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -32,6 +32,8 @@
 #include "trace_gfs2.h"
 #include "trans.h"
 
+static void gfs2_log_shutdown(struct gfs2_sbd *sdp);
+
 /**
  * gfs2_struct2blk - compute stuff
  * @sdp: the filesystem
@@ -1034,7 +1036,7 @@ void gfs2_log_commit(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
  *
  */
 
-void gfs2_log_shutdown(struct gfs2_sbd *sdp)
+static void gfs2_log_shutdown(struct gfs2_sbd *sdp)
 {
 	gfs2_assert_withdraw(sdp, !sdp->sd_log_blks_reserved);
 	gfs2_assert_withdraw(sdp, !sdp->sd_log_num_revoke);
diff --git a/fs/gfs2/log.h b/fs/gfs2/log.h
index c60f1432bb28..b661d0747047 100644
--- a/fs/gfs2/log.h
+++ b/fs/gfs2/log.h
@@ -75,7 +75,6 @@ extern void gfs2_log_commit(struct gfs2_sbd *sdp, struct gfs2_trans *trans);
 extern void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc);
 extern void log_flush_wait(struct gfs2_sbd *sdp);
 
-extern void gfs2_log_shutdown(struct gfs2_sbd *sdp);
 extern int gfs2_logd(void *data);
 extern void gfs2_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd);
 extern void gfs2_write_revokes(struct gfs2_sbd *sdp);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 28/32] gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (26 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 27/32] gfs2: make gfs2_log_shutdown static Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-14 10:42   ` Steven Whitehouse
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 29/32] gfs2: if finish_open returns error, clean up iopen glock mess Bob Peterson
                   ` (4 subsequent siblings)
  32 siblings, 1 reply; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, the rgrp code used two different methods to check
if the rgrp information was up-to-date: (1) The GFS2_RDF_UPTODATE flag
in the rgrp and (2) the existence (or not) of valid buffer_head
pointers in the first bitmap. When the buffer_heads are read in from
media, the rgrp is, by definition, up to date. When the rgrp glock is
invalidated, the buffer_heads are released, thereby indicating the
rgrp is no longer up to date (another node may have changed it).
So we don't need both of these flags. This patch eliminates the flag
in favor of simply checking if the buffer_head pointers exist.
This simplifies the code. It also makes it more bullet-proof:
if there are two methods, they can possibly get out of sync. With
one method, there's no way to get out of sync, and debugging is
easier.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glops.c  |  3 ---
 fs/gfs2/incore.h |  1 -
 fs/gfs2/rgrp.c   | 22 +++++++++++-----------
 3 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 4072f37e4278..183fd7cbdbc1 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -213,9 +213,6 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
 
 	WARN_ON_ONCE(!(flags & DIO_METADATA));
 	truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
-
-	if (rgd)
-		rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
 }
 
 static struct gfs2_inode *gfs2_glock2inode(struct gfs2_glock *gl)
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index a15ddd2f9bf4..61be366a2fa7 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -113,7 +113,6 @@ struct gfs2_rgrpd {
 	u32 rd_flags;
 	u32 rd_extfail_pt;		/* extent failure point */
 #define GFS2_RDF_CHECK		0x10000000 /* check for unlinked inodes */
-#define GFS2_RDF_UPTODATE	0x20000000 /* rg is up to date */
 #define GFS2_RDF_ERROR		0x40000000 /* error in rg */
 #define GFS2_RDF_PREFERRED	0x80000000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK		0xf0000000 /* mask for internal flags */
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 10d3397ed3cd..e5eba83a1a42 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -939,7 +939,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
 		goto fail;
 
 	rgd->rd_rgl = (struct gfs2_rgrp_lvb *)rgd->rd_gl->gl_lksb.sb_lvbptr;
-	rgd->rd_flags &= ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
+	rgd->rd_flags &= ~GFS2_RDF_PREFERRED;
 	if (rgd->rd_data > sdp->sd_max_rg_data)
 		sdp->sd_max_rg_data = rgd->rd_data;
 	spin_lock(&sdp->sd_rindex_spin);
@@ -1214,15 +1214,15 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
 		}
 	}
 
-	if (!(rgd->rd_flags & GFS2_RDF_UPTODATE)) {
-		for (x = 0; x < length; x++)
-			clear_bit(GBF_FULL, &rgd->rd_bits[x].bi_flags);
-		gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);
-		rgd->rd_flags |= (GFS2_RDF_UPTODATE | GFS2_RDF_CHECK);
-		rgd->rd_free_clone = rgd->rd_free;
-		/* max out the rgrp allocation failure point */
-		rgd->rd_extfail_pt = rgd->rd_free;
-	}
+	for (x = 0; x < length; x++)
+		clear_bit(GBF_FULL, &rgd->rd_bits[x].bi_flags);
+
+	gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);
+	rgd->rd_flags |= GFS2_RDF_CHECK;
+	rgd->rd_free_clone = rgd->rd_free;
+	/* max out the rgrp allocation failure point */
+	rgd->rd_extfail_pt = rgd->rd_free;
+
 	if (cpu_to_be32(GFS2_MAGIC) != rgd->rd_rgl->rl_magic) {
 		rgd->rd_rgl->rl_unlinked = cpu_to_be32(count_unlinked(rgd));
 		gfs2_rgrp_ondisk2lvb(rgd->rd_rgl,
@@ -1254,7 +1254,7 @@ static int update_rgrp_lvb(struct gfs2_rgrpd *rgd)
 {
 	u32 rl_flags;
 
-	if (rgd->rd_flags & GFS2_RDF_UPTODATE)
+	if (rgd->rd_bits[0].bi_bh)
 		return 0;
 
 	if (cpu_to_be32(GFS2_MAGIC) != rgd->rd_rgl->rl_magic)
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 29/32] gfs2: if finish_open returns error, clean up iopen glock mess
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (27 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 28/32] gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 30/32] gfs2: Don't demote a glock until its revokes are written Bob Peterson
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, if anything went wrong in function gfs2_create_inode
it would goto fail_gunlock3 and clean up the iopen glock it had just
created and locked. However, if function finish_open returns an error
it did not. That meant subsequent attempts to create the file were
seen as glock recursion errors on the iopen glock.

This patch adds additional checking for an error from finish_open and
cleans up the iopen glock appropriately.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/inode.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index dcb5d363f9b9..acfc57e1cb53 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -772,6 +772,12 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
 	gfs2_glock_dq_uninit(ghs);
 	gfs2_glock_dq_uninit(ghs + 1);
 	clear_bit(GLF_INODE_CREATING, &io_gl->gl_flags);
+	if (error) {
+		glock_clear_object(io_gl, ip);
+		clear_bit(GLF_INODE_CREATING, &io_gl->gl_flags);
+		gfs2_glock_dq_uninit(&ip->i_iopen_gh);
+		gfs2_glock_put(io_gl);
+	}
 	return error;
 
 fail_gunlock3:
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 30/32] gfs2: Don't demote a glock until its revokes are written
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (28 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 29/32] gfs2: if finish_open returns error, clean up iopen glock mess Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-14 10:45   ` Steven Whitehouse
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 31/32] gfs2: Do proper error checking for go_sync family of glops functions Bob Peterson
                   ` (2 subsequent siblings)
  32 siblings, 1 reply; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, run_queue would demote glocks based on whether
there are any more holders. But if the glock has pending revokes that
haven't been written to the media, giving up the glock might end in
file system corruption if the revokes never get written due to
io errors, node crashes and fences, etc. In that case, another node
will replay the metadata blocks associated with the glock, but
because the revoke was never written, it could replay that block
even though the glock had since been granted to another node who
might have made changes.

This patch changes the logic in run_queue so that it never demotes
a glock until its count of pending revokes reaches zero.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index ab72797e3ba1..082f70eb96db 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -712,6 +712,9 @@ __acquires(&gl->gl_lockref.lock)
 			goto out_unlock;
 		if (nonblock)
 			goto out_sched;
+		smp_mb();
+		if (atomic_read(&gl->gl_revokes) != 0)
+			goto out_sched;
 		set_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags);
 		GLOCK_BUG_ON(gl, gl->gl_demote_state == LM_ST_EXCLUSIVE);
 		gl->gl_target = gl->gl_demote_state;
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 31/32] gfs2: Do proper error checking for go_sync family of glops functions
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (29 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 30/32] gfs2: Don't demote a glock until its revokes are written Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 32/32] gfs2: fix glock reference problem in gfs2_trans_add_unrevoke Bob Peterson
  2019-11-14 10:48 ` [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Steven Whitehouse
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, function do_xmote would try to sync out the glock
dirty data by calling the appropriate glops function XXX_go_sync()
but it did not check for a good return code. If the sync was not
possible due to an io error or whatever, do_xmote would continue on
and call go_inval and release the glock to other cluster nodes.
When those nodes go to replay the journal, they may already be holding
glocks for the journal records that should have been synced, but were
not due to the ignored error.

This patch introduces proper error code checking to the go_sync
family of glops functions.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c  | 16 ++++++++++++++--
 fs/gfs2/glops.c  | 30 +++++++++++++++++++-----------
 fs/gfs2/incore.h |  2 +-
 3 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 082f70eb96db..464c89c89a2b 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -606,8 +606,20 @@ __acquires(&gl->gl_lockref.lock)
 	    (lck_flags & (LM_FLAG_TRY|LM_FLAG_TRY_1CB)))
 		clear_bit(GLF_BLOCKING, &gl->gl_flags);
 	spin_unlock(&gl->gl_lockref.lock);
-	if (glops->go_sync)
-		glops->go_sync(gl);
+	if (glops->go_sync) {
+		ret = glops->go_sync(gl);
+		/* If we had a problem syncing (due to io errors or whatever,
+		 * we should not invalidate the metadata or tell dlm to
+		 * release the glock to other nodes.
+		 */
+		if (ret) {
+			if (cmpxchg(&sdp->sd_log_error, 0, ret)) {
+				fs_err(sdp, "Error %d syncing glock \n", ret);
+				gfs2_dump_glock(NULL, gl, true);
+			}
+			return;
+		}
+	}
 	if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags)) {
 		/*
 		 * The call to go_sync should have cleared out the ail list.
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 183fd7cbdbc1..412237469773 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -81,10 +81,11 @@ static void __gfs2_ail_flush(struct gfs2_glock *gl, bool fsync,
 }
 
 
-static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
+static int gfs2_ail_empty_gl(struct gfs2_glock *gl)
 {
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct gfs2_trans tr;
+	int ret;
 
 	memset(&tr, 0, sizeof(tr));
 	INIT_LIST_HEAD(&tr.tr_buf);
@@ -115,7 +116,7 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
 		} else {
 			gfs2_log_unlock(sdp);
 		}
-		return;
+		return 0;
 	}
 
 	/* A shortened, inline version of gfs2_trans_begin()
@@ -123,8 +124,9 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
          * on the stack */
 	tr.tr_reserved = 1 + gfs2_struct2blk(sdp, tr.tr_revokes, sizeof(u64));
 	tr.tr_ip = _RET_IP_;
-	if (gfs2_log_reserve(sdp, tr.tr_reserved) < 0)
-		return;
+	ret = gfs2_log_reserve(sdp, tr.tr_reserved);
+	if (ret < 0)
+		return ret;
 	WARN_ON_ONCE(current->journal_info);
 	current->journal_info = &tr;
 
@@ -134,6 +136,7 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
 flush:
 	gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
 		       GFS2_LFC_AIL_EMPTY_GL);
+	return 0;
 }
 
 void gfs2_ail_flush(struct gfs2_glock *gl, bool fsync)
@@ -167,7 +170,7 @@ void gfs2_ail_flush(struct gfs2_glock *gl, bool fsync)
  * return to caller to demote/unlock the glock until I/O is complete.
  */
 
-static void rgrp_go_sync(struct gfs2_glock *gl)
+static int rgrp_go_sync(struct gfs2_glock *gl)
 {
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct address_space *mapping = &sdp->sd_aspace;
@@ -175,21 +178,24 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
 	int error;
 
 	if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
-		return;
+		return 0;
 	GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
 
 	gfs2_log_flush(sdp, gl, GFS2_LOG_HEAD_FLUSH_NORMAL |
 		       GFS2_LFC_RGRP_GO_SYNC);
 	filemap_fdatawrite_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
 	error = filemap_fdatawait_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
+	WARN_ON_ONCE(error);
 	mapping_set_error(mapping, error);
-	gfs2_ail_empty_gl(gl);
+	if (!error)
+		error = gfs2_ail_empty_gl(gl);
 
 	spin_lock(&gl->gl_lockref.lock);
 	rgd = gl->gl_object;
 	if (rgd)
 		gfs2_free_clones(rgd);
 	spin_unlock(&gl->gl_lockref.lock);
+	return error;
 }
 
 /**
@@ -253,12 +259,12 @@ static void gfs2_clear_glop_pending(struct gfs2_inode *ip)
  *
  */
 
-static void inode_go_sync(struct gfs2_glock *gl)
+static int inode_go_sync(struct gfs2_glock *gl)
 {
 	struct gfs2_inode *ip = gfs2_glock2inode(gl);
 	int isreg = ip && S_ISREG(ip->i_inode.i_mode);
 	struct address_space *metamapping = gfs2_glock2aspace(gl);
-	int error;
+	int error = 0;
 
 	if (isreg) {
 		if (test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags))
@@ -291,6 +297,7 @@ static void inode_go_sync(struct gfs2_glock *gl)
 
 out:
 	gfs2_clear_glop_pending(ip);
+	return error;
 }
 
 /**
@@ -511,7 +518,7 @@ static void inode_go_dump(struct seq_file *seq, struct gfs2_glock *gl,
  *
  */
 
-static void freeze_go_sync(struct gfs2_glock *gl)
+static int freeze_go_sync(struct gfs2_glock *gl)
 {
 	int error = 0;
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
@@ -525,7 +532,7 @@ static void freeze_go_sync(struct gfs2_glock *gl)
 				error);
 			if (gfs2_withdrawn(sdp)) {
 				atomic_set(&sdp->sd_freeze_state, SFS_UNFROZEN);
-				return;
+				return 0;
 			}
 			gfs2_assert_withdraw(sdp, 0);
 		}
@@ -533,6 +540,7 @@ static void freeze_go_sync(struct gfs2_glock *gl)
 		gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_FREEZE |
 			       GFS2_LFC_FREEZE_GO_SYNC);
 	}
+	return 0;
 }
 
 /**
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 61be366a2fa7..bebb2c87e0ea 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -233,7 +233,7 @@ struct lm_lockname {
 
 
 struct gfs2_glock_operations {
-	void (*go_sync) (struct gfs2_glock *gl);
+	int (*go_sync) (struct gfs2_glock *gl);
 	int (*go_xmote_bh) (struct gfs2_glock *gl, struct gfs2_holder *gh);
 	void (*go_inval) (struct gfs2_glock *gl, int flags);
 	int (*go_demote_ok) (const struct gfs2_glock *gl);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 32/32] gfs2: fix glock reference problem in gfs2_trans_add_unrevoke
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (30 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 31/32] gfs2: Do proper error checking for go_sync family of glops functions Bob Peterson
@ 2019-11-13 21:30 ` Bob Peterson
  2019-11-14 10:48 ` [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Steven Whitehouse
  32 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-13 21:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Patch 9287c6452d2 fixed a situation in which gfs2 could use a glock
after it had been freed. To fix that, it temporarily added a new
glock reference by calling gfs2_glock_hold in function gfs2_add_revoke.
However, if the bd element was removed by gfs2_trans_add_unrevoke
it failed to drop the additional reference.

This patch adds logic to gfs2_trans_add_unrevoke so that it can
properly drop the additional glock reference.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/trans.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 88209a81c7bb..f1ad6345d91f 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -272,6 +272,11 @@ void gfs2_trans_remove_revoke(struct gfs2_sbd *sdp, u64 blkno, unsigned int len)
 			list_del_init(&bd->bd_list);
 			gfs2_assert_withdraw(sdp, sdp->sd_log_num_revoke);
 			sdp->sd_log_num_revoke--;
+			if (bd->bd_gl &&
+			    atomic_dec_return(&bd->bd_gl->gl_revokes) == 0) {
+				clear_bit(GLF_LFLUSH, &bd->bd_gl->gl_flags);
+				gfs2_glock_queue_put(bd->bd_gl);
+			}
 			kmem_cache_free(gfs2_bufdata_cachep, bd);
 			tr->tr_num_revoke--;
 			if (--n == 0)
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 28/32] gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 28/32] gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence Bob Peterson
@ 2019-11-14 10:42   ` Steven Whitehouse
  2019-11-14 13:16     ` Bob Peterson
  0 siblings, 1 reply; 37+ messages in thread
From: Steven Whitehouse @ 2019-11-14 10:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 13/11/2019 21:30, Bob Peterson wrote:
> Before this patch, the rgrp code used two different methods to check
> if the rgrp information was up-to-date: (1) The GFS2_RDF_UPTODATE flag
> in the rgrp and (2) the existence (or not) of valid buffer_head
> pointers in the first bitmap. When the buffer_heads are read in from
> media, the rgrp is, by definition, up to date. When the rgrp glock is
> invalidated, the buffer_heads are released, thereby indicating the
> rgrp is no longer up to date (another node may have changed it).
> So we don't need both of these flags. This patch eliminates the flag
> in favor of simply checking if the buffer_head pointers exist.
> This simplifies the code. It also makes it more bullet-proof:
> if there are two methods, they can possibly get out of sync. With
> one method, there's no way to get out of sync, and debugging is
> easier.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>

These are two different things... the buffer_head flags signal whether 
the buffer head is up to date with respect to what is on disk. The 
GFS2_RDF_UPTODATE flag is there to indicate whether the internal copy of 
the various fields in the resource group is up to date.

These might match depending on how the rgrp's internal copy of the 
fields is maintained, but not sure that this is guaranteed. Has this 
been tested with the rgrplvb option? We should make sure that is all 
still working correctly,

Steve.


> ---
>   fs/gfs2/glops.c  |  3 ---
>   fs/gfs2/incore.h |  1 -
>   fs/gfs2/rgrp.c   | 22 +++++++++++-----------
>   3 files changed, 11 insertions(+), 15 deletions(-)
>
> diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> index 4072f37e4278..183fd7cbdbc1 100644
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -213,9 +213,6 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
>   
>   	WARN_ON_ONCE(!(flags & DIO_METADATA));
>   	truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
> -
> -	if (rgd)
> -		rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
>   }
>   
>   static struct gfs2_inode *gfs2_glock2inode(struct gfs2_glock *gl)
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index a15ddd2f9bf4..61be366a2fa7 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -113,7 +113,6 @@ struct gfs2_rgrpd {
>   	u32 rd_flags;
>   	u32 rd_extfail_pt;		/* extent failure point */
>   #define GFS2_RDF_CHECK		0x10000000 /* check for unlinked inodes */
> -#define GFS2_RDF_UPTODATE	0x20000000 /* rg is up to date */
>   #define GFS2_RDF_ERROR		0x40000000 /* error in rg */
>   #define GFS2_RDF_PREFERRED	0x80000000 /* This rgrp is preferred */
>   #define GFS2_RDF_MASK		0xf0000000 /* mask for internal flags */
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 10d3397ed3cd..e5eba83a1a42 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -939,7 +939,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
>   		goto fail;
>   
>   	rgd->rd_rgl = (struct gfs2_rgrp_lvb *)rgd->rd_gl->gl_lksb.sb_lvbptr;
> -	rgd->rd_flags &= ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
> +	rgd->rd_flags &= ~GFS2_RDF_PREFERRED;
>   	if (rgd->rd_data > sdp->sd_max_rg_data)
>   		sdp->sd_max_rg_data = rgd->rd_data;
>   	spin_lock(&sdp->sd_rindex_spin);
> @@ -1214,15 +1214,15 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
>   		}
>   	}
>   
> -	if (!(rgd->rd_flags & GFS2_RDF_UPTODATE)) {
> -		for (x = 0; x < length; x++)
> -			clear_bit(GBF_FULL, &rgd->rd_bits[x].bi_flags);
> -		gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);
> -		rgd->rd_flags |= (GFS2_RDF_UPTODATE | GFS2_RDF_CHECK);
> -		rgd->rd_free_clone = rgd->rd_free;
> -		/* max out the rgrp allocation failure point */
> -		rgd->rd_extfail_pt = rgd->rd_free;
> -	}
> +	for (x = 0; x < length; x++)
> +		clear_bit(GBF_FULL, &rgd->rd_bits[x].bi_flags);
> +
> +	gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);
> +	rgd->rd_flags |= GFS2_RDF_CHECK;
> +	rgd->rd_free_clone = rgd->rd_free;
> +	/* max out the rgrp allocation failure point */
> +	rgd->rd_extfail_pt = rgd->rd_free;
> +
>   	if (cpu_to_be32(GFS2_MAGIC) != rgd->rd_rgl->rl_magic) {
>   		rgd->rd_rgl->rl_unlinked = cpu_to_be32(count_unlinked(rgd));
>   		gfs2_rgrp_ondisk2lvb(rgd->rd_rgl,
> @@ -1254,7 +1254,7 @@ static int update_rgrp_lvb(struct gfs2_rgrpd *rgd)
>   {
>   	u32 rl_flags;
>   
> -	if (rgd->rd_flags & GFS2_RDF_UPTODATE)
> +	if (rgd->rd_bits[0].bi_bh)
>   		return 0;
>   
>   	if (cpu_to_be32(GFS2_MAGIC) != rgd->rd_rgl->rl_magic)



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 30/32] gfs2: Don't demote a glock until its revokes are written
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 30/32] gfs2: Don't demote a glock until its revokes are written Bob Peterson
@ 2019-11-14 10:45   ` Steven Whitehouse
  0 siblings, 0 replies; 37+ messages in thread
From: Steven Whitehouse @ 2019-11-14 10:45 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 13/11/2019 21:30, Bob Peterson wrote:
> Before this patch, run_queue would demote glocks based on whether
> there are any more holders. But if the glock has pending revokes that
> haven't been written to the media, giving up the glock might end in
> file system corruption if the revokes never get written due to
> io errors, node crashes and fences, etc. In that case, another node
> will replay the metadata blocks associated with the glock, but
> because the revoke was never written, it could replay that block
> even though the glock had since been granted to another node who
> might have made changes.
>
> This patch changes the logic in run_queue so that it never demotes
> a glock until its count of pending revokes reaches zero.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>

I'm not sure this makes sense... if we demote the glock then the revokes 
should be written out during that process. So if that is not happening 
it is a bug. I don't think we should need to change the logic for 
deciding what we are going to demote?

Steve.

> ---
>   fs/gfs2/glock.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index ab72797e3ba1..082f70eb96db 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -712,6 +712,9 @@ __acquires(&gl->gl_lockref.lock)
>   			goto out_unlock;
>   		if (nonblock)
>   			goto out_sched;
> +		smp_mb();
> +		if (atomic_read(&gl->gl_revokes) != 0)
> +			goto out_sched;
>   		set_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags);
>   		GLOCK_BUG_ON(gl, gl->gl_demote_state == LM_ST_EXCLUSIVE);
>   		gl->gl_target = gl->gl_demote_state;



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection
  2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
                   ` (31 preceding siblings ...)
  2019-11-13 21:30 ` [Cluster-devel] [PATCH 32/32] gfs2: fix glock reference problem in gfs2_trans_add_unrevoke Bob Peterson
@ 2019-11-14 10:48 ` Steven Whitehouse
  32 siblings, 0 replies; 37+ messages in thread
From: Steven Whitehouse @ 2019-11-14 10:48 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

There are a lot of useful fixes in this series. We should consider how 
many of them should go to -stable I think. Also we should start to get 
them integrated upstream. Might be a good plan to sort out the more 
obvious ones and send those right away, and then do anything which needs 
a bit more review in a second pass,

Steve.

On 13/11/2019 21:29, Bob Peterson wrote:
> This is my latest collection of patches to address the myriad of gfs2
> recovery problems I've found. I'm not convinced we need all of these
> but I thought I'd send them anyway and get feedback
>
> Some of these are just bugs and may be pushed separately.
>
> Bob Peterson (32):
>    gfs2: Introduce concept of a pending withdraw
>    gfs2: clear ail1 list when gfs2 withdraws
>    gfs2: Rework how rgrp buffer_heads are managed
>    gfs2: fix infinite loop in gfs2_ail1_flush on io error
>    gfs2: log error reform
>    gfs2: Only complain the first time an io error occurs in quota or log
>    gfs2: Ignore dlm recovery requests if gfs2 is withdrawn
>    gfs2: move check_journal_clean to util.c for future use
>    gfs2: Allow some glocks to be used during withdraw
>    gfs2: Don't loop forever in gfs2_freeze if withdrawn
>    gfs2: Make secondary withdrawers wait for first withdrawer
>    gfs2: Don't write log headers after file system withdraw
>    gfs2: Force withdraw to replay journals and wait for it to finish
>    gfs2: fix infinite loop when checking ail item count before go_inval
>    gfs2: Add verbose option to check_journal_clean
>    gfs2: Abort gfs2_freeze if io error is seen
>    gfs2: Issue revokes more intelligently
>    gfs2: Prepare to withdraw as soon as an IO error occurs in log write
>    gfs2: Check for log write errors before telling dlm to unlock
>    gfs2: new slab for transactions
>    gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
>    gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty
>    gfs2: Don't skip log flush if glock still has revokes
>    gfs2: initialize tr_ail1_list when creating transactions
>    gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages returns error
>    gfs2: drain the ail2 list after io errors
>    gfs2: make gfs2_log_shutdown static
>    gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence
>    gfs2: if finish_open returns error, clean up iopen glock mess
>    gfs2: Don't demote a glock until its revokes are written
>    gfs2: Do proper error checking for go_sync family of glops functions
>    gfs2: fix glock reference problem in gfs2_trans_add_unrevoke
>
>   fs/gfs2/aops.c       |   4 +-
>   fs/gfs2/file.c       |   2 +-
>   fs/gfs2/glock.c      | 140 ++++++++++++++++++++++----
>   fs/gfs2/glops.c      | 153 ++++++++++++++++++++++------
>   fs/gfs2/incore.h     |  21 ++--
>   fs/gfs2/inode.c      |   6 ++
>   fs/gfs2/lock_dlm.c   |  52 ++++++++++
>   fs/gfs2/log.c        | 231 +++++++++++++++++++++++++++++-------------
>   fs/gfs2/log.h        |   2 +-
>   fs/gfs2/lops.c       |  12 ++-
>   fs/gfs2/main.c       |  23 +++++
>   fs/gfs2/meta_io.c    |   6 +-
>   fs/gfs2/ops_fstype.c |  51 +---------
>   fs/gfs2/quota.c      |  10 +-
>   fs/gfs2/recovery.c   |   5 +
>   fs/gfs2/rgrp.c       |  82 +++++++++------
>   fs/gfs2/rgrp.h       |   1 -
>   fs/gfs2/super.c      |  97 ++++++++++++------
>   fs/gfs2/sys.c        |   2 +-
>   fs/gfs2/trans.c      |  38 +++++--
>   fs/gfs2/trans.h      |   1 +
>   fs/gfs2/util.c       | 235 +++++++++++++++++++++++++++++++++++++++++--
>   fs/gfs2/util.h       |  16 +++
>   23 files changed, 924 insertions(+), 266 deletions(-)
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Cluster-devel] [PATCH 28/32] gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence
  2019-11-14 10:42   ` Steven Whitehouse
@ 2019-11-14 13:16     ` Bob Peterson
  0 siblings, 0 replies; 37+ messages in thread
From: Bob Peterson @ 2019-11-14 13:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> These are two different things... the buffer_head flags signal whether
> the buffer head is up to date with respect to what is on disk. The
> GFS2_RDF_UPTODATE flag is there to indicate whether the internal copy of
> the various fields in the resource group is up to date.
> 
> These might match depending on how the rgrp's internal copy of the
> fields is maintained, but not sure that this is guaranteed. Has this
> been tested with the rgrplvb option? We should make sure that is all
> still working correctly,
> 
> Steve.

You're right, and it seems obvious now.
I obviously wasn't thinking clearly when I did this. I'll eliminate this patch.

Bob



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2019-11-14 13:16 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 21:29 [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Bob Peterson
2019-11-13 21:29 ` [Cluster-devel] [PATCH 01/32] gfs2: Introduce concept of a pending withdraw Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 02/32] gfs2: clear ail1 list when gfs2 withdraws Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 03/32] gfs2: Rework how rgrp buffer_heads are managed Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 04/32] gfs2: fix infinite loop in gfs2_ail1_flush on io error Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 05/32] gfs2: log error reform Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 06/32] gfs2: Only complain the first time an io error occurs in quota or log Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 07/32] gfs2: Ignore dlm recovery requests if gfs2 is withdrawn Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 08/32] gfs2: move check_journal_clean to util.c for future use Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 09/32] gfs2: Allow some glocks to be used during withdraw Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 10/32] gfs2: Don't loop forever in gfs2_freeze if withdrawn Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 11/32] gfs2: Make secondary withdrawers wait for first withdrawer Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 12/32] gfs2: Don't write log headers after file system withdraw Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 13/32] gfs2: Force withdraw to replay journals and wait for it to finish Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 14/32] gfs2: fix infinite loop when checking ail item count before go_inval Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 15/32] gfs2: Add verbose option to check_journal_clean Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 16/32] gfs2: Abort gfs2_freeze if io error is seen Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 17/32] gfs2: Issue revokes more intelligently Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 18/32] gfs2: Prepare to withdraw as soon as an IO error occurs in log write Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 19/32] gfs2: Check for log write errors before telling dlm to unlock Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 20/32] gfs2: new slab for transactions Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 21/32] gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 22/32] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 23/32] gfs2: Don't skip log flush if glock still has revokes Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 24/32] gfs2: initialize tr_ail1_list when creating transactions Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 25/32] gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages returns error Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 26/32] gfs2: drain the ail2 list after io errors Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 27/32] gfs2: make gfs2_log_shutdown static Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 28/32] gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence Bob Peterson
2019-11-14 10:42   ` Steven Whitehouse
2019-11-14 13:16     ` Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 29/32] gfs2: if finish_open returns error, clean up iopen glock mess Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 30/32] gfs2: Don't demote a glock until its revokes are written Bob Peterson
2019-11-14 10:45   ` Steven Whitehouse
2019-11-13 21:30 ` [Cluster-devel] [PATCH 31/32] gfs2: Do proper error checking for go_sync family of glops functions Bob Peterson
2019-11-13 21:30 ` [Cluster-devel] [PATCH 32/32] gfs2: fix glock reference problem in gfs2_trans_add_unrevoke Bob Peterson
2019-11-14 10:48 ` [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection Steven Whitehouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.