All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection
@ 2021-07-13 18:09 Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 01/10] gfs2: Fix glock recursion in freeze_go_xmote_bh Bob Peterson
                   ` (9 more replies)
  0 siblings, 10 replies; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This is a set of 10 patches from my collection. They can be added
individually or as a set.

Bob Peterson (10):
  gfs2: Fix glock recursion in freeze_go_xmote_bh
  gfs2: Eliminate go_xmote_bh in favor of go_lock
  gfs2: be more verbose replaying invalid rgrp blocks
  gfs2: trivial clean up of gfs2_ail_error
  gfs2: tiny cleanup in gfs2_log_reserve
  gfs2: init system threads before freeze lock
  gfs2: Don't release and reacquire local statfs bh
  gfs2: New log flush watchdog
  gfs2: fix deadlock in gfs2_ail1_empty withdraw
  gfs2: replace sd_aspace with sd_inode

 fs/gfs2/aops.c       |   9 +---
 fs/gfs2/glock.c      |  12 +----
 fs/gfs2/glops.c      |  43 +++++++++--------
 fs/gfs2/incore.h     |  15 +++++-
 fs/gfs2/log.c        |  71 +++++++++++++++++++++++++---
 fs/gfs2/log.h        |   1 +
 fs/gfs2/lops.c       |  44 ++++++++++++------
 fs/gfs2/main.c       |   8 ++++
 fs/gfs2/meta_io.c    |   2 +-
 fs/gfs2/meta_io.h    |   2 -
 fs/gfs2/ops_fstype.c |  80 +++++++++++++++++++++++++++-----
 fs/gfs2/super.c      | 107 ++++++++-----------------------------------
 fs/gfs2/super.h      |   3 +-
 fs/gfs2/sys.c        |   6 ++-
 14 files changed, 235 insertions(+), 168 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 01/10] gfs2: Fix glock recursion in freeze_go_xmote_bh
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 02/10] gfs2: Eliminate go_xmote_bh in favor of go_lock Bob Peterson
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

We must not call gfs2_consist (which does a file system withdraw) from
the freeze glock's freeze_go_xmote_bh function because the withdraw
will try to use the freeze glock, thus causing a glock recursion error.

This patch changes freeze_go_xmote_bh to call function
gfs2_assert_withdraw_delayed instead of gfs2_consist to avoid recursion.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glops.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 54d3fbeb3002..4939308d54f3 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -604,24 +604,24 @@ static int freeze_go_xmote_bh(struct gfs2_glock *gl)
 	struct gfs2_inode *ip = GFS2_I(sdp->sd_jdesc->jd_inode);
 	struct gfs2_glock *j_gl = ip->i_gl;
 	struct gfs2_log_header_host head;
-	int error;
+	int error = 0;
 
 	if (test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags)) {
 		j_gl->gl_ops->go_inval(j_gl, DIO_METADATA);
 
 		error = gfs2_find_jhead(sdp->sd_jdesc, &head, false);
-		if (error)
-			gfs2_consist(sdp);
-		if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT))
-			gfs2_consist(sdp);
-
-		/*  Initialize some head of the log stuff  */
-		if (!gfs2_withdrawn(sdp)) {
-			sdp->sd_log_sequence = head.lh_sequence + 1;
-			gfs2_log_pointers_init(sdp, head.lh_blkno);
+		if (gfs2_assert_withdraw_delayed(sdp, !error))
+			goto out;
+		if (gfs2_assert_withdraw_delayed(sdp, head.lh_flags &
+						 GFS2_LOG_HEAD_UNMOUNT)) {
+			error = -EIO;
+			goto out;
 		}
+		sdp->sd_log_sequence = head.lh_sequence + 1;
+		gfs2_log_pointers_init(sdp, head.lh_blkno);
 	}
-	return 0;
+out:
+	return error;
 }
 
 /**
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 02/10] gfs2: Eliminate go_xmote_bh in favor of go_lock
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 01/10] gfs2: Fix glock recursion in freeze_go_xmote_bh Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 03/10] gfs2: be more verbose replaying invalid rgrp blocks Bob Peterson
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, the freeze glock was the only glock to use the
go_xmote_bh glock op (glop). The go_xmote_bh glop is done when
a glock is locked. But so is go_lock. This patch eliminates the
glop altogether in favor of just using go_lock for the freeze
glock. This is for better consistency, readability, etc.
I also fixed a misleading comment in do_promote.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c  | 12 +-----------
 fs/gfs2/glops.c  |  7 ++++---
 fs/gfs2/incore.h |  1 -
 3 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 1f3902ecdded..9f94d6b13363 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -384,7 +384,7 @@ static void do_error(struct gfs2_glock *gl, const int ret)
  * do_promote - promote as many requests as possible on the current queue
  * @gl: The glock
  * 
- * Returns: 1 if there is a blocked holder at the head of the list, or 2
+ * Returns: 1 if there is a blocked waiter at the head of the list, or 2
  *          if a type specific operation is underway.
  */
 
@@ -504,7 +504,6 @@ static void gfs2_demote_wake(struct gfs2_glock *gl)
 
 static void finish_xmote(struct gfs2_glock *gl, unsigned int ret)
 {
-	const struct gfs2_glock_operations *glops = gl->gl_ops;
 	struct gfs2_holder *gh;
 	unsigned state = ret & LM_OUT_ST_MASK;
 	int rv;
@@ -562,15 +561,6 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned int ret)
 	if (test_and_clear_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags))
 		gfs2_demote_wake(gl);
 	if (state != LM_ST_UNLOCKED) {
-		if (glops->go_xmote_bh) {
-			spin_unlock(&gl->gl_lockref.lock);
-			rv = glops->go_xmote_bh(gl);
-			spin_lock(&gl->gl_lockref.lock);
-			if (rv) {
-				do_error(gl, rv);
-				goto out;
-			}
-		}
 		rv = do_promote(gl);
 		if (rv == 2)
 			goto out_locked;
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 4939308d54f3..9f3b68806376 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -595,11 +595,12 @@ static int freeze_go_sync(struct gfs2_glock *gl)
 }
 
 /**
- * freeze_go_xmote_bh - After promoting/demoting the freeze glock
+ * freeze_go_lock - After promoting/demoting the freeze glock
  * @gl: the glock
  */
-static int freeze_go_xmote_bh(struct gfs2_glock *gl)
+static int freeze_go_lock(struct gfs2_holder *gh)
 {
+	struct gfs2_glock *gl = gh->gh_gl;
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct gfs2_inode *ip = GFS2_I(sdp->sd_jdesc->jd_inode);
 	struct gfs2_glock *j_gl = ip->i_gl;
@@ -759,7 +760,7 @@ const struct gfs2_glock_operations gfs2_rgrp_glops = {
 
 const struct gfs2_glock_operations gfs2_freeze_glops = {
 	.go_sync = freeze_go_sync,
-	.go_xmote_bh = freeze_go_xmote_bh,
+	.go_lock = freeze_go_lock,
 	.go_demote_ok = freeze_go_demote_ok,
 	.go_type = LM_TYPE_NONDISK,
 	.go_flags = GLOF_NONDISK,
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e6f820f146cb..d5ea955d6a87 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -217,7 +217,6 @@ struct lm_lockname {
 
 struct gfs2_glock_operations {
 	int (*go_sync) (struct gfs2_glock *gl);
-	int (*go_xmote_bh)(struct gfs2_glock *gl);
 	void (*go_inval) (struct gfs2_glock *gl, int flags);
 	int (*go_demote_ok) (const struct gfs2_glock *gl);
 	int (*go_lock) (struct gfs2_holder *gh);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 03/10] gfs2: be more verbose replaying invalid rgrp blocks
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 01/10] gfs2: Fix glock recursion in freeze_go_xmote_bh Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 02/10] gfs2: Eliminate go_xmote_bh in favor of go_lock Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 04/10] gfs2: trivial clean up of gfs2_ail_error Bob Peterson
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch adds some crucial information when journal replay detects a
replay of an obsolete rgrp block. For example, it wasn't printing the
journal id or the generation number played. This just supplements what
is logged in this unusual case.

The function that actually complains about the replaying of an obsolete
rgrp block has been split off to avoid long lines and sparse warnings.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/lops.c | 44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 8ee05d25dfa6..ca0bb3a73912 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -761,6 +761,32 @@ static void buf_lo_before_scan(struct gfs2_jdesc *jd,
 	jd->jd_replayed_blocks = 0;
 }
 
+#define obsolete_rgrp_replay \
+"Replaying 0x%llx from jid=%d/0x%llx but we already have a bh!\n"
+#define obsolete_rgrp_replay2 \
+"busy:%d, pinned:%d rg_gen:0x%llx, j_gen:0x%llx\n"
+
+static void obsolete_rgrp(struct gfs2_jdesc *jd, struct buffer_head *bh_log,
+			  u64 blkno)
+{
+	struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
+	struct gfs2_rgrpd *rgd;
+	struct gfs2_rgrp *jrgd = (struct gfs2_rgrp *)bh_log->b_data;
+
+	rgd = gfs2_blk2rgrpd(sdp, blkno, false);
+	if (rgd && rgd->rd_addr == blkno &&
+	    rgd->rd_bits && rgd->rd_bits->bi_bh) {
+		fs_info(sdp, obsolete_rgrp_replay, (unsigned long long)blkno,
+			jd->jd_jid, bh_log->b_blocknr);
+		fs_info(sdp, obsolete_rgrp_replay2,
+			buffer_busy(rgd->rd_bits->bi_bh) ? 1 : 0,
+			buffer_pinned(rgd->rd_bits->bi_bh),
+			rgd->rd_igeneration,
+			be64_to_cpu(jrgd->rg_igeneration));
+		gfs2_dump_glock(NULL, rgd->rd_gl, true);
+	}
+}
+
 static int buf_lo_scan_elements(struct gfs2_jdesc *jd, u32 start,
 				struct gfs2_log_descriptor *ld, __be64 *ptr,
 				int pass)
@@ -799,21 +825,9 @@ static int buf_lo_scan_elements(struct gfs2_jdesc *jd, u32 start,
 			struct gfs2_meta_header *mh =
 				(struct gfs2_meta_header *)bh_ip->b_data;
 
-			if (mh->mh_type == cpu_to_be32(GFS2_METATYPE_RG)) {
-				struct gfs2_rgrpd *rgd;
-
-				rgd = gfs2_blk2rgrpd(sdp, blkno, false);
-				if (rgd && rgd->rd_addr == blkno &&
-				    rgd->rd_bits && rgd->rd_bits->bi_bh) {
-					fs_info(sdp, "Replaying 0x%llx but we "
-						"already have a bh!\n",
-						(unsigned long long)blkno);
-					fs_info(sdp, "busy:%d, pinned:%d\n",
-						buffer_busy(rgd->rd_bits->bi_bh) ? 1 : 0,
-						buffer_pinned(rgd->rd_bits->bi_bh));
-					gfs2_dump_glock(NULL, rgd->rd_gl, true);
-				}
-			}
+			if (mh->mh_type == cpu_to_be32(GFS2_METATYPE_RG))
+				obsolete_rgrp(jd, bh_log, blkno);
+
 			mark_buffer_dirty(bh_ip);
 		}
 		brelse(bh_log);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 04/10] gfs2: trivial clean up of gfs2_ail_error
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
                   ` (2 preceding siblings ...)
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 03/10] gfs2: be more verbose replaying invalid rgrp blocks Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 05/10] gfs2: tiny cleanup in gfs2_log_reserve Bob Peterson
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch does not change function. It adds variable sdp to clean up
function gfs2_ail_error and make it more readable.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glops.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 9f3b68806376..744cacd27213 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -33,16 +33,18 @@ extern struct workqueue_struct *gfs2_control_wq;
 
 static void gfs2_ail_error(struct gfs2_glock *gl, const struct buffer_head *bh)
 {
-	fs_err(gl->gl_name.ln_sbd,
+	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
+
+	fs_err(sdp,
 	       "AIL buffer %p: blocknr %llu state 0x%08lx mapping %p page "
 	       "state 0x%lx\n",
 	       bh, (unsigned long long)bh->b_blocknr, bh->b_state,
 	       bh->b_page->mapping, bh->b_page->flags);
-	fs_err(gl->gl_name.ln_sbd, "AIL glock %u:%llu mapping %p\n",
+	fs_err(sdp, "AIL glock %u:%llu mapping %p\n",
 	       gl->gl_name.ln_type, gl->gl_name.ln_number,
 	       gfs2_glock2aspace(gl));
-	gfs2_lm(gl->gl_name.ln_sbd, "AIL error\n");
-	gfs2_withdraw(gl->gl_name.ln_sbd);
+	gfs2_lm(sdp, "AIL error\n");
+	gfs2_withdraw(sdp);
 }
 
 /**
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 05/10] gfs2: tiny cleanup in gfs2_log_reserve
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
                   ` (3 preceding siblings ...)
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 04/10] gfs2: trivial clean up of gfs2_ail_error Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 06/10] gfs2: init system threads before freeze lock Bob Peterson
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Function gfs2_log_reserve was setting revoke_blks to 0. There's no
need because it calculates it shortly thereafter. This patch removes
the unnecessary set.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 42c15cfc0821..f0ee3ff6f9a8 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -594,7 +594,7 @@ void gfs2_log_reserve(struct gfs2_sbd *sdp, struct gfs2_trans *tr,
 {
 	unsigned int blks = tr->tr_reserved;
 	unsigned int revokes = tr->tr_revokes;
-	unsigned int revoke_blks = 0;
+	unsigned int revoke_blks;
 
 	*extra_revokes = 0;
 	if (revokes) {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 06/10] gfs2: init system threads before freeze lock
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
                   ` (4 preceding siblings ...)
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 05/10] gfs2: tiny cleanup in gfs2_log_reserve Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 07/10] gfs2: Don't release and reacquire local statfs bh Bob Peterson
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Patch 96b1454f2e ("gfs2: move freeze glock outside the make_fs_rw and _ro
functions") changed the gfs2 mount sequence so that it holds the freeze
lock before calling gfs2_make_fs_rw. Before this patch, gfs2_make_fs_rw
called init_threads to initialize the quotad and logd threads. That is a
problem if the system needs to withdraw due to IO errors early in the
mount sequence, for example, while initializing the system statfs inode:

1. An IO error causes the statfs glock to not sync properly after
   recovery, and leaves items on the ail list.
2. The leftover items on the ail list causes its do_xmote call to fail,
   which makes it want to withdraw. But since the glock code cannot
   withdraw (because the withdraw sequence uses glocks) it relies upon
   the logd daemon to initiate the withdraw.`
3. The withdraw can never be performed by the logd daemon because all
   this takes place before the logd daemon is started.

This patch moves function init_threads from super.c to ops_fstype.c
and it changes gfs2_fill_super to start its threads before holding the
freeze lock, and if there's an error, stop its threads after releasing
it. This allows the logd to run unblocked by the freeze lock. Thus,
the logd daemon can perform its withdraw sequence properly.

Fixes: 96b1454f2e ("gfs2: move freeze glock outside the make_fs_rw and _ro
functions")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/ops_fstype.c | 42 ++++++++++++++++++++++++++++++
 fs/gfs2/super.c      | 61 +++++---------------------------------------
 2 files changed, 48 insertions(+), 55 deletions(-)

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 5f4504dd0875..0b8229fa0bb8 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1088,6 +1088,34 @@ void gfs2_online_uevent(struct gfs2_sbd *sdp)
 	kobject_uevent_env(&sdp->sd_kobj, KOBJ_ONLINE, envp);
 }
 
+static int init_threads(struct gfs2_sbd *sdp)
+{
+	struct task_struct *p;
+	int error = 0;
+
+	p = kthread_run(gfs2_logd, sdp, "gfs2_logd");
+	if (IS_ERR(p)) {
+		error = PTR_ERR(p);
+		fs_err(sdp, "can't start logd thread: %d\n", error);
+		return error;
+	}
+	sdp->sd_logd_process = p;
+
+	p = kthread_run(gfs2_quotad, sdp, "gfs2_quotad");
+	if (IS_ERR(p)) {
+		error = PTR_ERR(p);
+		fs_err(sdp, "can't start quotad thread: %d\n", error);
+		goto fail;
+	}
+	sdp->sd_quotad_process = p;
+	return 0;
+
+fail:
+	kthread_stop(sdp->sd_logd_process);
+	sdp->sd_logd_process = NULL;
+	return error;
+}
+
 /**
  * gfs2_fill_super - Read in superblock
  * @sb: The VFS superblock
@@ -1216,6 +1244,14 @@ static int gfs2_fill_super(struct super_block *sb, struct fs_context *fc)
 		goto fail_per_node;
 	}
 
+	if (!sb_rdonly(sb)) {
+		error = init_threads(sdp);
+		if (error) {
+			gfs2_withdraw_delayed(sdp);
+			goto fail_per_node;
+		}
+	}
+
 	error = gfs2_freeze_lock(sdp, &freeze_gh, 0);
 	if (error)
 		goto fail_per_node;
@@ -1225,6 +1261,12 @@ static int gfs2_fill_super(struct super_block *sb, struct fs_context *fc)
 
 	gfs2_freeze_unlock(&freeze_gh);
 	if (error) {
+		if (sdp->sd_quotad_process)
+			kthread_stop(sdp->sd_quotad_process);
+		sdp->sd_quotad_process = NULL;
+		if (sdp->sd_logd_process)
+			kthread_stop(sdp->sd_logd_process);
+		sdp->sd_logd_process = NULL;
 		fs_err(sdp, "can't make FS RW: %d\n", error);
 		goto fail_per_node;
 	}
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 4d4ceb0b6903..2bdbba5ea8d7 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -119,34 +119,6 @@ int gfs2_jdesc_check(struct gfs2_jdesc *jd)
 	return 0;
 }
 
-static int init_threads(struct gfs2_sbd *sdp)
-{
-	struct task_struct *p;
-	int error = 0;
-
-	p = kthread_run(gfs2_logd, sdp, "gfs2_logd");
-	if (IS_ERR(p)) {
-		error = PTR_ERR(p);
-		fs_err(sdp, "can't start logd thread: %d\n", error);
-		return error;
-	}
-	sdp->sd_logd_process = p;
-
-	p = kthread_run(gfs2_quotad, sdp, "gfs2_quotad");
-	if (IS_ERR(p)) {
-		error = PTR_ERR(p);
-		fs_err(sdp, "can't start quotad thread: %d\n", error);
-		goto fail;
-	}
-	sdp->sd_quotad_process = p;
-	return 0;
-
-fail:
-	kthread_stop(sdp->sd_logd_process);
-	sdp->sd_logd_process = NULL;
-	return error;
-}
-
 /**
  * gfs2_make_fs_rw - Turn a Read-Only FS into a Read-Write one
  * @sdp: the filesystem
@@ -161,26 +133,17 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
 	struct gfs2_log_header_host head;
 	int error;
 
-	error = init_threads(sdp);
-	if (error) {
-		gfs2_withdraw_delayed(sdp);
-		return error;
-	}
-
 	j_gl->gl_ops->go_inval(j_gl, DIO_METADATA);
-	if (gfs2_withdrawn(sdp)) {
-		error = -EIO;
-		goto fail;
-	}
+	if (gfs2_withdrawn(sdp))
+		return -EIO;
 
 	error = gfs2_find_jhead(sdp->sd_jdesc, &head, false);
 	if (error || gfs2_withdrawn(sdp))
-		goto fail;
+		return error;
 
 	if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
 		gfs2_consist(sdp);
-		error = -EIO;
-		goto fail;
+		return -EIO;
 	}
 
 	/*  Initialize some head of the log stuff  */
@@ -188,20 +151,8 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
 	gfs2_log_pointers_init(sdp, head.lh_blkno);
 
 	error = gfs2_quota_init(sdp);
-	if (error || gfs2_withdrawn(sdp))
-		goto fail;
-
-	set_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
-
-	return 0;
-
-fail:
-	if (sdp->sd_quotad_process)
-		kthread_stop(sdp->sd_quotad_process);
-	sdp->sd_quotad_process = NULL;
-	if (sdp->sd_logd_process)
-		kthread_stop(sdp->sd_logd_process);
-	sdp->sd_logd_process = NULL;
+	if (!error && !gfs2_withdrawn(sdp))
+		set_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
 	return error;
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 07/10] gfs2: Don't release and reacquire local statfs bh
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
                   ` (5 preceding siblings ...)
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 06/10] gfs2: init system threads before freeze lock Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog Bob Peterson
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, several functions in gfs2 related to the updating
of the statfs file used a newly acquired/read buffer_head for the
local statfs file. This is completely unnecessary, because other nodes
should never update it. Recreating the buffer is a waste of time.

This patch allows gfs2 to read in the local statefs buffer_head at
mount time and keep it around until unmount time.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/aops.c       |  9 ++-------
 fs/gfs2/incore.h     |  1 +
 fs/gfs2/ops_fstype.c |  9 +++++++++
 fs/gfs2/super.c      | 44 ++++++++++++--------------------------------
 fs/gfs2/super.h      |  3 +--
 5 files changed, 25 insertions(+), 41 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 81d8f064126e..005e920f5d4a 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -574,10 +574,9 @@ void adjust_fs_space(struct inode *inode)
 {
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-	struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode);
 	struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master;
 	struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local;
-	struct buffer_head *m_bh, *l_bh;
+	struct buffer_head *m_bh;
 	u64 fs_total, new_free;
 
 	if (gfs2_trans_begin(sdp, 2 * RES_STATFS, 0) != 0)
@@ -600,11 +599,7 @@ void adjust_fs_space(struct inode *inode)
 		(unsigned long long)new_free);
 	gfs2_statfs_change(sdp, new_free, new_free, 0);
 
-	if (gfs2_meta_inode_buffer(l_ip, &l_bh) != 0)
-		goto out2;
-	update_statfs(sdp, m_bh, l_bh);
-	brelse(l_bh);
-out2:
+	update_statfs(sdp, m_bh);
 	brelse(m_bh);
 out:
 	sdp->sd_rindex_uptodate = 0;
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index d5ea955d6a87..6f31a067a5f2 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -767,6 +767,7 @@ struct gfs2_sbd {
 	struct gfs2_glock *sd_jinode_gl;
 
 	struct gfs2_holder sd_sc_gh;
+	struct buffer_head *sd_sc_bh;
 	struct gfs2_holder sd_qc_gh;
 
 	struct completion sd_journal_ready;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 0b8229fa0bb8..6a950c4a61e9 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -695,8 +695,16 @@ static int init_statfs(struct gfs2_sbd *sdp)
 		fs_err(sdp, "can't lock local \"sc\" file: %d\n", error);
 		goto free_local;
 	}
+	/* read in the local statfs buffer - other nodes don't change it. */
+	error = gfs2_meta_inode_buffer(ip, &sdp->sd_sc_bh);
+	if (error) {
+		fs_err(sdp, "Cannot read in local statfs: %d\n", error);
+		goto unlock_sd_gh;
+	}
 	return 0;
 
+unlock_sd_gh:
+	gfs2_glock_dq_uninit(&sdp->sd_sc_gh);
 free_local:
 	free_local_statfs_inodes(sdp);
 	iput(pn);
@@ -710,6 +718,7 @@ static int init_statfs(struct gfs2_sbd *sdp)
 static void uninit_statfs(struct gfs2_sbd *sdp)
 {
 	if (!sdp->sd_args.ar_spectator) {
+		brelse(sdp->sd_sc_bh);
 		gfs2_glock_dq_uninit(&sdp->sd_sc_gh);
 		free_local_statfs_inodes(sdp);
 	}
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 2bdbba5ea8d7..0a5b7dfa7a45 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -178,9 +178,8 @@ int gfs2_statfs_init(struct gfs2_sbd *sdp)
 {
 	struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 	struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master;
-	struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode);
 	struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local;
-	struct buffer_head *m_bh, *l_bh;
+	struct buffer_head *m_bh;
 	struct gfs2_holder gh;
 	int error;
 
@@ -199,21 +198,15 @@ int gfs2_statfs_init(struct gfs2_sbd *sdp)
 				      sizeof(struct gfs2_dinode));
 		spin_unlock(&sdp->sd_statfs_spin);
 	} else {
-		error = gfs2_meta_inode_buffer(l_ip, &l_bh);
-		if (error)
-			goto out_m_bh;
-
 		spin_lock(&sdp->sd_statfs_spin);
 		gfs2_statfs_change_in(m_sc, m_bh->b_data +
 				      sizeof(struct gfs2_dinode));
-		gfs2_statfs_change_in(l_sc, l_bh->b_data +
+		gfs2_statfs_change_in(l_sc, sdp->sd_sc_bh->b_data +
 				      sizeof(struct gfs2_dinode));
 		spin_unlock(&sdp->sd_statfs_spin);
 
-		brelse(l_bh);
 	}
 
-out_m_bh:
 	brelse(m_bh);
 out:
 	gfs2_glock_dq_uninit(&gh);
@@ -226,22 +219,17 @@ void gfs2_statfs_change(struct gfs2_sbd *sdp, s64 total, s64 free,
 	struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode);
 	struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local;
 	struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master;
-	struct buffer_head *l_bh;
 	s64 x, y;
 	int need_sync = 0;
-	int error;
-
-	error = gfs2_meta_inode_buffer(l_ip, &l_bh);
-	if (error)
-		return;
 
-	gfs2_trans_add_meta(l_ip->i_gl, l_bh);
+	gfs2_trans_add_meta(l_ip->i_gl, sdp->sd_sc_bh);
 
 	spin_lock(&sdp->sd_statfs_spin);
 	l_sc->sc_total += total;
 	l_sc->sc_free += free;
 	l_sc->sc_dinodes += dinodes;
-	gfs2_statfs_change_out(l_sc, l_bh->b_data + sizeof(struct gfs2_dinode));
+	gfs2_statfs_change_out(l_sc, sdp->sd_sc_bh->b_data +
+			       sizeof(struct gfs2_dinode));
 	if (sdp->sd_args.ar_statfs_percent) {
 		x = 100 * l_sc->sc_free;
 		y = m_sc->sc_free * sdp->sd_args.ar_statfs_percent;
@@ -250,20 +238,18 @@ void gfs2_statfs_change(struct gfs2_sbd *sdp, s64 total, s64 free,
 	}
 	spin_unlock(&sdp->sd_statfs_spin);
 
-	brelse(l_bh);
 	if (need_sync)
 		gfs2_wake_up_statfs(sdp);
 }
 
-void update_statfs(struct gfs2_sbd *sdp, struct buffer_head *m_bh,
-		   struct buffer_head *l_bh)
+void update_statfs(struct gfs2_sbd *sdp, struct buffer_head *m_bh)
 {
 	struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 	struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode);
 	struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master;
 	struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local;
 
-	gfs2_trans_add_meta(l_ip->i_gl, l_bh);
+	gfs2_trans_add_meta(l_ip->i_gl, sdp->sd_sc_bh);
 	gfs2_trans_add_meta(m_ip->i_gl, m_bh);
 
 	spin_lock(&sdp->sd_statfs_spin);
@@ -271,7 +257,7 @@ void update_statfs(struct gfs2_sbd *sdp, struct buffer_head *m_bh,
 	m_sc->sc_free += l_sc->sc_free;
 	m_sc->sc_dinodes += l_sc->sc_dinodes;
 	memset(l_sc, 0, sizeof(struct gfs2_statfs_change));
-	memset(l_bh->b_data + sizeof(struct gfs2_dinode),
+	memset(sdp->sd_sc_bh->b_data + sizeof(struct gfs2_dinode),
 	       0, sizeof(struct gfs2_statfs_change));
 	gfs2_statfs_change_out(m_sc, m_bh->b_data + sizeof(struct gfs2_dinode));
 	spin_unlock(&sdp->sd_statfs_spin);
@@ -281,11 +267,10 @@ int gfs2_statfs_sync(struct super_block *sb, int type)
 {
 	struct gfs2_sbd *sdp = sb->s_fs_info;
 	struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-	struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode);
 	struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master;
 	struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local;
 	struct gfs2_holder gh;
-	struct buffer_head *m_bh, *l_bh;
+	struct buffer_head *m_bh;
 	int error;
 
 	error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE, GL_NOCACHE,
@@ -306,21 +291,15 @@ int gfs2_statfs_sync(struct super_block *sb, int type)
 	}
 	spin_unlock(&sdp->sd_statfs_spin);
 
-	error = gfs2_meta_inode_buffer(l_ip, &l_bh);
-	if (error)
-		goto out_bh;
-
 	error = gfs2_trans_begin(sdp, 2 * RES_DINODE, 0);
 	if (error)
-		goto out_bh2;
+		goto out_bh;
 
-	update_statfs(sdp, m_bh, l_bh);
+	update_statfs(sdp, m_bh);
 	sdp->sd_statfs_force_sync = 0;
 
 	gfs2_trans_end(sdp);
 
-out_bh2:
-	brelse(l_bh);
 out_bh:
 	brelse(m_bh);
 out_unlock:
@@ -626,6 +605,7 @@ static void gfs2_put_super(struct super_block *sb)
 			gfs2_glock_dq_uninit(&sdp->sd_journal_gh);
 		if (gfs2_holder_initialized(&sdp->sd_jinode_gh))
 			gfs2_glock_dq_uninit(&sdp->sd_jinode_gh);
+		brelse(sdp->sd_sc_bh);
 		gfs2_glock_dq_uninit(&sdp->sd_sc_gh);
 		gfs2_glock_dq_uninit(&sdp->sd_qc_gh);
 		free_local_statfs_inodes(sdp);
diff --git a/fs/gfs2/super.h b/fs/gfs2/super.h
index ec4affb33ed5..58d13fd77aed 100644
--- a/fs/gfs2/super.h
+++ b/fs/gfs2/super.h
@@ -43,8 +43,7 @@ extern void gfs2_statfs_change_in(struct gfs2_statfs_change_host *sc,
 				  const void *buf);
 extern void gfs2_statfs_change_out(const struct gfs2_statfs_change_host *sc,
 				   void *buf);
-extern void update_statfs(struct gfs2_sbd *sdp, struct buffer_head *m_bh,
-			  struct buffer_head *l_bh);
+extern void update_statfs(struct gfs2_sbd *sdp, struct buffer_head *m_bh);
 extern int gfs2_statfs_sync(struct super_block *sb, int type);
 extern void gfs2_freeze_func(struct work_struct *work);
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
                   ` (6 preceding siblings ...)
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 07/10] gfs2: Don't release and reacquire local statfs bh Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:41   ` Steven Whitehouse
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 09/10] gfs2: fix deadlock in gfs2_ail1_empty withdraw Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode Bob Peterson
  9 siblings, 1 reply; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch adds a new watchdog whose sole purpose is to complain when
gfs2_log_flush operations are taking too long.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/incore.h     |  6 ++++++
 fs/gfs2/log.c        | 47 ++++++++++++++++++++++++++++++++++++++++++++
 fs/gfs2/log.h        |  1 +
 fs/gfs2/main.c       |  8 ++++++++
 fs/gfs2/ops_fstype.c |  2 ++
 fs/gfs2/sys.c        |  6 ++++--
 6 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 6f31a067a5f2..566c0053b7c5 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -683,6 +683,8 @@ struct local_statfs_inode {
 	unsigned int si_jid; /* journal id this statfs inode corresponds to */
 };
 
+#define GFS2_LOG_FLUSH_TIMEOUT (HZ / 10) /* arbitrary: 1/10 second per page */
+
 struct gfs2_sbd {
 	struct super_block *sd_vfs;
 	struct gfs2_pcpu_lkstats __percpu *sd_lkstats;
@@ -849,6 +851,10 @@ struct gfs2_sbd {
 	unsigned long sd_last_warning;
 	struct dentry *debugfs_dir;    /* debugfs directory */
 	unsigned long sd_glock_dqs_held;
+
+	struct delayed_work sd_log_flush_watchdog;
+	unsigned long sd_dirty_pages;
+	unsigned long sd_log_flush_start;
 };
 
 static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int which)
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index f0ee3ff6f9a8..bd2ff5ef4b91 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -19,6 +19,7 @@
 #include <linux/blkdev.h>
 #include <linux/writeback.h>
 #include <linux/list_sort.h>
+#include <linux/sched/debug.h>
 
 #include "gfs2.h"
 #include "incore.h"
@@ -32,8 +33,22 @@
 #include "trace_gfs2.h"
 #include "trans.h"
 
+extern struct workqueue_struct *gfs2_log_flush_wq;
+
 static void gfs2_log_shutdown(struct gfs2_sbd *sdp);
 
+void gfs2_log_flush_watchdog_func(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct gfs2_sbd *sdp = container_of(dwork, struct gfs2_sbd,
+					    sd_log_flush_watchdog);
+
+	fs_err(sdp, "log flush pid %u took > %lu secs to write %lu pages.\n",
+	       sdp->sd_logd_process ? pid_nr(task_pid(sdp->sd_logd_process)) :
+	       0, (jiffies - sdp->sd_log_flush_start) / HZ,
+	       sdp->sd_dirty_pages);
+}
+
 /**
  * gfs2_struct2blk - compute stuff
  * @sdp: the filesystem
@@ -1016,6 +1031,26 @@ static void trans_drain(struct gfs2_trans *tr)
 	}
 }
 
+/**
+ * count_dirty_pages - rough count the dirty ordered writes pages
+ * @sdp: the filesystem
+ *
+ * This is not meant to be exact. It's simply a rough estimate of how many
+ * dirty pages are on the ordered writes list. The actual number of pages
+ * may change because we don't keep the lock held during the log flush.
+ */
+static unsigned long count_dirty_pages(struct gfs2_sbd *sdp)
+{
+	struct gfs2_inode *ip;
+	unsigned long dpages = 0;
+
+	spin_lock(&sdp->sd_ordered_lock);
+	list_for_each_entry(ip, &sdp->sd_log_ordered, i_ordered)
+		dpages += ip->i_inode.i_mapping->nrpages;
+	spin_unlock(&sdp->sd_ordered_lock);
+	return dpages;
+}
+
 /**
  * gfs2_log_flush - flush incore transaction(s)
  * @sdp: The filesystem
@@ -1031,8 +1066,19 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 	enum gfs2_freeze_state state = atomic_read(&sdp->sd_freeze_state);
 	unsigned int first_log_head;
 	unsigned int reserved_revokes = 0;
+	unsigned long dpages;
+
+	dpages = count_dirty_pages(sdp);
 
 	down_write(&sdp->sd_log_flush_lock);
+	if (dpages)
+		if (queue_delayed_work(gfs2_log_flush_wq,
+				       &sdp->sd_log_flush_watchdog,
+				       round_up(dpages *
+						GFS2_LOG_FLUSH_TIMEOUT, HZ))) {
+			sdp->sd_dirty_pages = dpages;
+			sdp->sd_log_flush_start = jiffies;
+		}
 	trace_gfs2_log_flush(sdp, 1, flags);
 
 repeat:
@@ -1144,6 +1190,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
 		gfs2_assert_withdraw_delayed(sdp, used_blocks < reserved_blocks);
 		gfs2_log_release(sdp, reserved_blocks - used_blocks);
 	}
+	cancel_delayed_work(&sdp->sd_log_flush_watchdog);
 	up_write(&sdp->sd_log_flush_lock);
 	gfs2_trans_free(sdp, tr);
 	if (gfs2_withdrawing(sdp))
diff --git a/fs/gfs2/log.h b/fs/gfs2/log.h
index fc905c2af53c..962044fba53a 100644
--- a/fs/gfs2/log.h
+++ b/fs/gfs2/log.h
@@ -94,5 +94,6 @@ extern void gfs2_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd);
 extern void gfs2_glock_remove_revoke(struct gfs2_glock *gl);
 extern void gfs2_flush_revokes(struct gfs2_sbd *sdp);
 extern void gfs2_ail_drain(struct gfs2_sbd *sdp);
+extern void gfs2_log_flush_watchdog_func(struct work_struct *work);
 
 #endif /* __LOG_DOT_H__ */
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 28d0eb23e18e..55a7f29742b3 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -30,6 +30,7 @@
 #include "glops.h"
 
 struct workqueue_struct *gfs2_control_wq;
+struct workqueue_struct *gfs2_log_flush_wq;
 
 static void gfs2_init_inode_once(void *foo)
 {
@@ -178,6 +179,10 @@ static int __init init_gfs2_fs(void)
 	if (!gfs2_freeze_wq)
 		goto fail_wq3;
 
+	gfs2_log_flush_wq = alloc_workqueue("gfs2_log_flush_wq", 0, 0);
+	if (!gfs2_log_flush_wq)
+		goto fail_wq4;
+
 	gfs2_page_pool = mempool_create_page_pool(64, 0);
 	if (!gfs2_page_pool)
 		goto fail_mempool;
@@ -189,6 +194,8 @@ static int __init init_gfs2_fs(void)
 	return 0;
 
 fail_mempool:
+	destroy_workqueue(gfs2_log_flush_wq);
+fail_wq4:
 	destroy_workqueue(gfs2_freeze_wq);
 fail_wq3:
 	destroy_workqueue(gfs2_control_wq);
@@ -240,6 +247,7 @@ static void __exit exit_gfs2_fs(void)
 	destroy_workqueue(gfs_recovery_wq);
 	destroy_workqueue(gfs2_control_wq);
 	destroy_workqueue(gfs2_freeze_wq);
+	destroy_workqueue(gfs2_log_flush_wq);
 	list_lru_destroy(&gfs2_qd_lru);
 
 	rcu_barrier();
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 6a950c4a61e9..b09e61457b23 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -139,6 +139,8 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
 	init_waitqueue_head(&sdp->sd_log_flush_wait);
 	atomic_set(&sdp->sd_freeze_state, SFS_UNFROZEN);
 	mutex_init(&sdp->sd_freeze_mutex);
+	INIT_DELAYED_WORK(&sdp->sd_log_flush_watchdog,
+			  gfs2_log_flush_watchdog_func);
 
 	return sdp;
 
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index c0a34d9ddee4..c90d9f48571a 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -96,7 +96,8 @@ static ssize_t status_show(struct gfs2_sbd *sdp, char *buf)
 		     "sd_log_flush_head:        %d\n"
 		     "sd_log_flush_tail:        %d\n"
 		     "sd_log_blks_reserved:     %d\n"
-		     "sd_log_revokes_available: %d\n",
+		     "sd_log_revokes_available: %d\n"
+		     "sd_dirty_pages:           %lu\n",
 		     test_bit(SDF_JOURNAL_CHECKED, &f),
 		     test_bit(SDF_JOURNAL_LIVE, &f),
 		     (sdp->sd_jdesc ? sdp->sd_jdesc->jd_jid : 0),
@@ -124,7 +125,8 @@ static ssize_t status_show(struct gfs2_sbd *sdp, char *buf)
 		     sdp->sd_log_flush_head,
 		     sdp->sd_log_flush_tail,
 		     sdp->sd_log_blks_reserved,
-		     atomic_read(&sdp->sd_log_revokes_available));
+		     atomic_read(&sdp->sd_log_revokes_available),
+		     sdp->sd_dirty_pages);
 	return s;
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 09/10] gfs2: fix deadlock in gfs2_ail1_empty withdraw
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
                   ` (7 preceding siblings ...)
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode Bob Peterson
  9 siblings, 0 replies; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, function gfs2_ail1_empty could issue a file system
withdraw when IO errors were discovered. However, there are several
callers, including gfs2_flush_revokes() which holds the gfs2_log_lock
before calling gfs2_ail1_empty. If gfs2_ail1_empty needed to withdraw
it would leave the gfs2_log_lock held, which resulted in a deadlock
due to other processes that needed the log_lock.

This patch moves the withdraw out of function gfs2_ail1_empty and
makes each of the callers check for a withdraw by calling new function
check_ail1_withdraw. Function gfs2_flush_revokes now does this check
after releasing the gfs2_log_lock to avoid the deadlock.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/log.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index bd2ff5ef4b91..7e0ac87f7d71 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -379,11 +379,6 @@ static int gfs2_ail1_empty(struct gfs2_sbd *sdp, int max_revokes)
 	ret = list_empty(&sdp->sd_ail1_list);
 	spin_unlock(&sdp->sd_ail_lock);
 
-	if (test_bit(SDF_WITHDRAWING, &sdp->sd_flags)) {
-		gfs2_lm(sdp, "fatal: I/O error(s)\n");
-		gfs2_withdraw(sdp);
-	}
-
 	return ret;
 }
 
@@ -801,6 +796,15 @@ void gfs2_glock_remove_revoke(struct gfs2_glock *gl)
 	}
 }
 
+static void check_ail1_withdraw(struct gfs2_sbd *sdp)
+{
+	if (!test_bit(SDF_WITHDRAWING, &sdp->sd_flags))
+		return;
+
+	gfs2_lm(sdp, "fatal: I/O error(s)\n");
+	gfs2_withdraw(sdp);
+}
+
 /**
  * gfs2_flush_revokes - Add as many revokes to the system transaction as we can
  * @sdp: The GFS2 superblock
@@ -821,6 +825,7 @@ void gfs2_flush_revokes(struct gfs2_sbd *sdp)
 	gfs2_log_lock(sdp);
 	gfs2_ail1_empty(sdp, max_revokes);
 	gfs2_log_unlock(sdp);
+	check_ail1_withdraw(sdp);
 }
 
 /**
@@ -982,6 +987,7 @@ void gfs2_ail_drain(struct gfs2_sbd *sdp)
 static void empty_ail1_list(struct gfs2_sbd *sdp)
 {
 	unsigned long start = jiffies;
+	int empty;
 
 	for (;;) {
 		if (time_after(jiffies, start + (HZ * 600))) {
@@ -992,7 +998,9 @@ static void empty_ail1_list(struct gfs2_sbd *sdp)
 		}
 		gfs2_ail1_start(sdp);
 		gfs2_ail1_wait(sdp);
-		if (gfs2_ail1_empty(sdp, 0))
+		empty = gfs2_ail1_empty(sdp, 0);
+		check_ail1_withdraw(sdp);
+		if (empty)
 			return;
 	}
 }
@@ -1364,6 +1372,7 @@ int gfs2_logd(void *data)
 
 		if (gfs2_jrnl_flush_reqd(sdp) || t == 0) {
 			gfs2_ail1_empty(sdp, 0);
+			check_ail1_withdraw(sdp);
 			gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
 						  GFS2_LFC_LOGD_JFLUSH_REQD);
 		}
@@ -1372,6 +1381,7 @@ int gfs2_logd(void *data)
 			gfs2_ail1_start(sdp);
 			gfs2_ail1_wait(sdp);
 			gfs2_ail1_empty(sdp, 0);
+			check_ail1_withdraw(sdp);
 			gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
 						  GFS2_LFC_LOGD_AIL_FLUSH_REQD);
 		}
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode
  2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
                   ` (8 preceding siblings ...)
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 09/10] gfs2: fix deadlock in gfs2_ail1_empty withdraw Bob Peterson
@ 2021-07-13 18:09 ` Bob Peterson
  2021-07-13 18:26   ` Steven Whitehouse
  9 siblings, 1 reply; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 18:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, gfs2 kept its own address space for rgrps, but this
caused a lockdep problem because vfs assumes a 1:1 relationship between
address spaces and their inode. One problematic area is this:

gfs2_unpin
   mark_buffer_dirty(bh);
      mapping = page_mapping(page);
         __set_page_dirty(page, mapping, memcg, 0);
            xa_lock_irqsave(&mapping->i_pages, flags);
                 ^---locks page->mapping->i_pages
            account_page_dirtied(page, mapping)
	       struct inode *inode = mapping->host;
                 ^---assumes the mapping points to an inode
               inode_to_wb(inode)
                  WARN_ON_ONCE !lockdep_is_held(&inode->i_mapping->
                                                i_pages.xa_lock)

It manifests as a lockdep warning you see in the last line.

This patch removes sd_aspace in favor of an entire inode, sd_inode.
Functions that need to access the address space may use a new function
that follows the inode to its address space. This creates the 1:1 relation
between the inode and its address space, so lockdep doesn't complain.
This is how some other file systems manage their metadata, such as btrfs.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glops.c      |  4 ++--
 fs/gfs2/incore.h     |  7 ++++++-
 fs/gfs2/meta_io.c    |  2 +-
 fs/gfs2/meta_io.h    |  2 --
 fs/gfs2/ops_fstype.c | 27 ++++++++++++++++-----------
 fs/gfs2/super.c      |  2 +-
 6 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 744cacd27213..5d755d30d91c 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -162,7 +162,7 @@ void gfs2_ail_flush(struct gfs2_glock *gl, bool fsync)
 static int gfs2_rgrp_metasync(struct gfs2_glock *gl)
 {
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
-	struct address_space *metamapping = &sdp->sd_aspace;
+	struct address_space *metamapping = gfs2_aspace(sdp);
 	struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
 	const unsigned bsize = sdp->sd_sb.sb_bsize;
 	loff_t start = (rgd->rd_addr * bsize) & PAGE_MASK;
@@ -219,7 +219,7 @@ static int rgrp_go_sync(struct gfs2_glock *gl)
 static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
 {
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
-	struct address_space *mapping = &sdp->sd_aspace;
+	struct address_space *mapping = gfs2_aspace(sdp);
 	struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
 	const unsigned bsize = sdp->sd_sb.sb_bsize;
 	loff_t start = (rgd->rd_addr * bsize) & PAGE_MASK;
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 566c0053b7c5..075e5db1d654 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -797,7 +797,7 @@ struct gfs2_sbd {
 
 	/* Log stuff */
 
-	struct address_space sd_aspace;
+	struct inode *sd_inode;
 
 	spinlock_t sd_log_lock;
 
@@ -857,6 +857,11 @@ struct gfs2_sbd {
 	unsigned long sd_log_flush_start;
 };
 
+static inline struct address_space *gfs2_aspace(struct gfs2_sbd *sdp)
+{
+	return sdp->sd_inode->i_mapping;
+}
+
 static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int which)
 {
 	gl->gl_stats.stats[which]++;
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 7c9619997355..0123437d9c12 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -120,7 +120,7 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
 	unsigned int bufnum;
 
 	if (mapping == NULL)
-		mapping = &sdp->sd_aspace;
+		mapping = gfs2_aspace(sdp);
 
 	shift = PAGE_SHIFT - sdp->sd_sb.sb_bsize_shift;
 	index = blkno >> shift;             /* convert block to page */
diff --git a/fs/gfs2/meta_io.h b/fs/gfs2/meta_io.h
index 21880d72081a..70b9c41ecb46 100644
--- a/fs/gfs2/meta_io.h
+++ b/fs/gfs2/meta_io.h
@@ -42,8 +42,6 @@ static inline struct gfs2_sbd *gfs2_mapping2sbd(struct address_space *mapping)
 	struct inode *inode = mapping->host;
 	if (mapping->a_ops == &gfs2_meta_aops)
 		return (((struct gfs2_glock *)mapping) - 1)->gl_name.ln_sbd;
-	else if (mapping->a_ops == &gfs2_rgrp_aops)
-		return container_of(mapping, struct gfs2_sbd, sd_aspace);
 	else
 		return inode->i_sb->s_fs_info;
 }
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index b09e61457b23..3e252cfa7f17 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -72,7 +72,6 @@ void free_sbd(struct gfs2_sbd *sdp)
 static struct gfs2_sbd *init_sbd(struct super_block *sb)
 {
 	struct gfs2_sbd *sdp;
-	struct address_space *mapping;
 
 	sdp = kzalloc(sizeof(struct gfs2_sbd), GFP_KERNEL);
 	if (!sdp)
@@ -112,16 +111,6 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
 
 	INIT_LIST_HEAD(&sdp->sd_sc_inodes_list);
 
-	mapping = &sdp->sd_aspace;
-
-	address_space_init_once(mapping);
-	mapping->a_ops = &gfs2_rgrp_aops;
-	mapping->host = sb->s_bdev->bd_inode;
-	mapping->flags = 0;
-	mapping_set_gfp_mask(mapping, GFP_NOFS);
-	mapping->private_data = NULL;
-	mapping->writeback_index = 0;
-
 	spin_lock_init(&sdp->sd_log_lock);
 	atomic_set(&sdp->sd_log_pinned, 0);
 	INIT_LIST_HEAD(&sdp->sd_log_revokes);
@@ -1141,6 +1130,7 @@ static int gfs2_fill_super(struct super_block *sb, struct fs_context *fc)
 	struct gfs2_sbd *sdp;
 	struct gfs2_holder mount_gh;
 	struct gfs2_holder freeze_gh;
+	struct address_space *mapping;
 	int error;
 
 	sdp = init_sbd(sb);
@@ -1162,6 +1152,21 @@ static int gfs2_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_flags |= SB_NOSEC;
 	sb->s_magic = GFS2_MAGIC;
 	sb->s_op = &gfs2_super_ops;
+
+	/* Set up an address space for metadata writes */
+	sdp->sd_inode = new_inode(sb);
+	if (!sdp->sd_inode)
+		goto fail_free;
+	sdp->sd_inode->i_ino = GFS2_MAGIC;
+	set_nlink(sdp->sd_inode, 1);
+	sdp->sd_inode->i_size = i_size_read(sb->s_bdev->bd_inode);
+	insert_inode_hash(sdp->sd_inode);
+
+	mapping = gfs2_aspace(sdp);
+	mapping->a_ops = &gfs2_rgrp_aops;
+	mapping_set_gfp_mask(mapping, GFP_NOFS);
+	mapping->writeback_index = 0;
+
 	sb->s_d_op = &gfs2_dops;
 	sb->s_export_op = &gfs2_export_ops;
 	sb->s_qcop = &gfs2_quotactl_ops;
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 0a5b7dfa7a45..9857d8725b2d 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -617,13 +617,13 @@ static void gfs2_put_super(struct super_block *sb)
 	gfs2_jindex_free(sdp);
 	/*  Take apart glock structures and buffer lists  */
 	gfs2_gl_hash_clear(sdp);
-	truncate_inode_pages_final(&sdp->sd_aspace);
 	gfs2_delete_debugfs_file(sdp);
 	/*  Unmount the locking protocol  */
 	gfs2_lm_unmount(sdp);
 
 	/*  At this point, we're through participating in the lockspace  */
 	gfs2_sys_fs_del(sdp);
+	iput(sdp->sd_inode);
 	free_sbd(sdp);
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode Bob Peterson
@ 2021-07-13 18:26   ` Steven Whitehouse
  2021-07-13 19:34     ` Bob Peterson
  0 siblings, 1 reply; 20+ messages in thread
From: Steven Whitehouse @ 2021-07-13 18:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> Before this patch, gfs2 kept its own address space for rgrps, but
> this
> caused a lockdep problem because vfs assumes a 1:1 relationship
> between
> address spaces and their inode. One problematic area is this:
> 
I don't think that is the case. The reason that the address space is a
separate structure in the first place is to allow them to exist without
an inode. Maybe that has changed, but we should see why that is, in
that case rather than just making this change immediately.

I can't see any reason why if we have to have an inode here that it
needs to be hashed... what would need to look it up via the hashes?

Steve.

> gfs2_unpin
>    mark_buffer_dirty(bh);
>       mapping = page_mapping(page);
>          __set_page_dirty(page, mapping, memcg, 0);
>             xa_lock_irqsave(&mapping->i_pages, flags);
>                  ^---locks page->mapping->i_pages
>             account_page_dirtied(page, mapping)
> 	       struct inode *inode = mapping->host;
>                  ^---assumes the mapping points to an inode
>                inode_to_wb(inode)
>                   WARN_ON_ONCE !lockdep_is_held(&inode->i_mapping->
>                                                 i_pages.xa_lock)
> 
> It manifests as a lockdep warning you see in the last line.
> 
> This patch removes sd_aspace in favor of an entire inode, sd_inode.
> Functions that need to access the address space may use a new
> function
> that follows the inode to its address space. This creates the 1:1
> relation
> between the inode and its address space, so lockdep doesn't complain.
> This is how some other file systems manage their metadata, such as
> btrfs.
> 
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>  fs/gfs2/glops.c      |  4 ++--
>  fs/gfs2/incore.h     |  7 ++++++-
>  fs/gfs2/meta_io.c    |  2 +-
>  fs/gfs2/meta_io.h    |  2 --
>  fs/gfs2/ops_fstype.c | 27 ++++++++++++++++-----------
>  fs/gfs2/super.c      |  2 +-
>  6 files changed, 26 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> index 744cacd27213..5d755d30d91c 100644
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -162,7 +162,7 @@ void gfs2_ail_flush(struct gfs2_glock *gl, bool
> fsync)
>  static int gfs2_rgrp_metasync(struct gfs2_glock *gl)
>  {
>  	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> -	struct address_space *metamapping = &sdp->sd_aspace;
> +	struct address_space *metamapping = gfs2_aspace(sdp);
>  	struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
>  	const unsigned bsize = sdp->sd_sb.sb_bsize;
>  	loff_t start = (rgd->rd_addr * bsize) & PAGE_MASK;
> @@ -219,7 +219,7 @@ static int rgrp_go_sync(struct gfs2_glock *gl)
>  static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
>  {
>  	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> -	struct address_space *mapping = &sdp->sd_aspace;
> +	struct address_space *mapping = gfs2_aspace(sdp);
>  	struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
>  	const unsigned bsize = sdp->sd_sb.sb_bsize;
>  	loff_t start = (rgd->rd_addr * bsize) & PAGE_MASK;
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 566c0053b7c5..075e5db1d654 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -797,7 +797,7 @@ struct gfs2_sbd {
>  
>  	/* Log stuff */
>  
> -	struct address_space sd_aspace;
> +	struct inode *sd_inode;
>  
>  	spinlock_t sd_log_lock;
>  
> @@ -857,6 +857,11 @@ struct gfs2_sbd {
>  	unsigned long sd_log_flush_start;
>  };
>  
> +static inline struct address_space *gfs2_aspace(struct gfs2_sbd
> *sdp)
> +{
> +	return sdp->sd_inode->i_mapping;
> +}
> +
>  static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int
> which)
>  {
>  	gl->gl_stats.stats[which]++;
> diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
> index 7c9619997355..0123437d9c12 100644
> --- a/fs/gfs2/meta_io.c
> +++ b/fs/gfs2/meta_io.c
> @@ -120,7 +120,7 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock
> *gl, u64 blkno, int create)
>  	unsigned int bufnum;
>  
>  	if (mapping == NULL)
> -		mapping = &sdp->sd_aspace;
> +		mapping = gfs2_aspace(sdp);
>  
>  	shift = PAGE_SHIFT - sdp->sd_sb.sb_bsize_shift;
>  	index = blkno >> shift;             /* convert block to page */
> diff --git a/fs/gfs2/meta_io.h b/fs/gfs2/meta_io.h
> index 21880d72081a..70b9c41ecb46 100644
> --- a/fs/gfs2/meta_io.h
> +++ b/fs/gfs2/meta_io.h
> @@ -42,8 +42,6 @@ static inline struct gfs2_sbd
> *gfs2_mapping2sbd(struct address_space *mapping)
>  	struct inode *inode = mapping->host;
>  	if (mapping->a_ops == &gfs2_meta_aops)
>  		return (((struct gfs2_glock *)mapping) - 1)-
> >gl_name.ln_sbd;
> -	else if (mapping->a_ops == &gfs2_rgrp_aops)
> -		return container_of(mapping, struct gfs2_sbd,
> sd_aspace);
>  	else
>  		return inode->i_sb->s_fs_info;
>  }
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index b09e61457b23..3e252cfa7f17 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -72,7 +72,6 @@ void free_sbd(struct gfs2_sbd *sdp)
>  static struct gfs2_sbd *init_sbd(struct super_block *sb)
>  {
>  	struct gfs2_sbd *sdp;
> -	struct address_space *mapping;
>  
>  	sdp = kzalloc(sizeof(struct gfs2_sbd), GFP_KERNEL);
>  	if (!sdp)
> @@ -112,16 +111,6 @@ static struct gfs2_sbd *init_sbd(struct
> super_block *sb)
>  
>  	INIT_LIST_HEAD(&sdp->sd_sc_inodes_list);
>  
> -	mapping = &sdp->sd_aspace;
> -
> -	address_space_init_once(mapping);
> -	mapping->a_ops = &gfs2_rgrp_aops;
> -	mapping->host = sb->s_bdev->bd_inode;
> -	mapping->flags = 0;
> -	mapping_set_gfp_mask(mapping, GFP_NOFS);
> -	mapping->private_data = NULL;
> -	mapping->writeback_index = 0;
> -
>  	spin_lock_init(&sdp->sd_log_lock);
>  	atomic_set(&sdp->sd_log_pinned, 0);
>  	INIT_LIST_HEAD(&sdp->sd_log_revokes);
> @@ -1141,6 +1130,7 @@ static int gfs2_fill_super(struct super_block
> *sb, struct fs_context *fc)
>  	struct gfs2_sbd *sdp;
>  	struct gfs2_holder mount_gh;
>  	struct gfs2_holder freeze_gh;
> +	struct address_space *mapping;
>  	int error;
>  
>  	sdp = init_sbd(sb);
> @@ -1162,6 +1152,21 @@ static int gfs2_fill_super(struct super_block
> *sb, struct fs_context *fc)
>  	sb->s_flags |= SB_NOSEC;
>  	sb->s_magic = GFS2_MAGIC;
>  	sb->s_op = &gfs2_super_ops;
> +
> +	/* Set up an address space for metadata writes */
> +	sdp->sd_inode = new_inode(sb);
> +	if (!sdp->sd_inode)
> +		goto fail_free;
> +	sdp->sd_inode->i_ino = GFS2_MAGIC;
> +	set_nlink(sdp->sd_inode, 1);
> +	sdp->sd_inode->i_size = i_size_read(sb->s_bdev->bd_inode);
> +	insert_inode_hash(sdp->sd_inode);
> +
> +	mapping = gfs2_aspace(sdp);
> +	mapping->a_ops = &gfs2_rgrp_aops;
> +	mapping_set_gfp_mask(mapping, GFP_NOFS);
> +	mapping->writeback_index = 0;
> +
>  	sb->s_d_op = &gfs2_dops;
>  	sb->s_export_op = &gfs2_export_ops;
>  	sb->s_qcop = &gfs2_quotactl_ops;
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index 0a5b7dfa7a45..9857d8725b2d 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -617,13 +617,13 @@ static void gfs2_put_super(struct super_block
> *sb)
>  	gfs2_jindex_free(sdp);
>  	/*  Take apart glock structures and buffer lists  */
>  	gfs2_gl_hash_clear(sdp);
> -	truncate_inode_pages_final(&sdp->sd_aspace);
>  	gfs2_delete_debugfs_file(sdp);
>  	/*  Unmount the locking protocol  */
>  	gfs2_lm_unmount(sdp);
>  
>  	/*  At this point, we're through participating in the
> lockspace  */
>  	gfs2_sys_fs_del(sdp);
> +	iput(sdp->sd_inode);
>  	free_sbd(sdp);
>  }
>  



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog
  2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog Bob Peterson
@ 2021-07-13 18:41   ` Steven Whitehouse
  2021-07-13 20:03     ` Bob Peterson
  0 siblings, 1 reply; 20+ messages in thread
From: Steven Whitehouse @ 2021-07-13 18:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> This patch adds a new watchdog whose sole purpose is to complain when
> gfs2_log_flush operations are taking too long.
> 
This one is a bit confusing. It says that it is to check if the log
flush is taking too long, but it appears to set a timeout based on the
amount of dirty data that will be written back, so it isn't really the
log flush, but the writeback and log flush that is being timed I think?

It also looks like the timeout is entirely dependent upon the number of
dirty pages too, and not on the log flush size. I wonder about the
performance impact of traversing the list of dirty pages too. If that
can be avoided it should make the implementation rather more efficient,

Steve.

> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>  fs/gfs2/incore.h     |  6 ++++++
>  fs/gfs2/log.c        | 47
> ++++++++++++++++++++++++++++++++++++++++++++
>  fs/gfs2/log.h        |  1 +
>  fs/gfs2/main.c       |  8 ++++++++
>  fs/gfs2/ops_fstype.c |  2 ++
>  fs/gfs2/sys.c        |  6 ++++--
>  6 files changed, 68 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 6f31a067a5f2..566c0053b7c5 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -683,6 +683,8 @@ struct local_statfs_inode {
>  	unsigned int si_jid; /* journal id this statfs inode
> corresponds to */
>  };
>  
> +#define GFS2_LOG_FLUSH_TIMEOUT (HZ / 10) /* arbitrary: 1/10 second
> per page */
> +
>  struct gfs2_sbd {
>  	struct super_block *sd_vfs;
>  	struct gfs2_pcpu_lkstats __percpu *sd_lkstats;
> @@ -849,6 +851,10 @@ struct gfs2_sbd {
>  	unsigned long sd_last_warning;
>  	struct dentry *debugfs_dir;    /* debugfs directory */
>  	unsigned long sd_glock_dqs_held;
> +
> +	struct delayed_work sd_log_flush_watchdog;
> +	unsigned long sd_dirty_pages;
> +	unsigned long sd_log_flush_start;
>  };
>  
>  static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int
> which)
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> index f0ee3ff6f9a8..bd2ff5ef4b91 100644
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -19,6 +19,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/writeback.h>
>  #include <linux/list_sort.h>
> +#include <linux/sched/debug.h>
>  
>  #include "gfs2.h"
>  #include "incore.h"
> @@ -32,8 +33,22 @@
>  #include "trace_gfs2.h"
>  #include "trans.h"
>  
> +extern struct workqueue_struct *gfs2_log_flush_wq;
> +
>  static void gfs2_log_shutdown(struct gfs2_sbd *sdp);
>  
> +void gfs2_log_flush_watchdog_func(struct work_struct *work)
> +{
> +	struct delayed_work *dwork = to_delayed_work(work);
> +	struct gfs2_sbd *sdp = container_of(dwork, struct gfs2_sbd,
> +					    sd_log_flush_watchdog);
> +
> +	fs_err(sdp, "log flush pid %u took > %lu secs to write %lu
> pages.\n",
> +	       sdp->sd_logd_process ? pid_nr(task_pid(sdp-
> >sd_logd_process)) :
> +	       0, (jiffies - sdp->sd_log_flush_start) / HZ,
> +	       sdp->sd_dirty_pages);
> +}
> +
>  /**
>   * gfs2_struct2blk - compute stuff
>   * @sdp: the filesystem
> @@ -1016,6 +1031,26 @@ static void trans_drain(struct gfs2_trans *tr)
>  	}
>  }
>  
> +/**
> + * count_dirty_pages - rough count the dirty ordered writes pages
> + * @sdp: the filesystem
> + *
> + * This is not meant to be exact. It's simply a rough estimate of
> how many
> + * dirty pages are on the ordered writes list. The actual number of
> pages
> + * may change because we don't keep the lock held during the log
> flush.
> + */
> +static unsigned long count_dirty_pages(struct gfs2_sbd *sdp)
> +{
> +	struct gfs2_inode *ip;
> +	unsigned long dpages = 0;
> +
> +	spin_lock(&sdp->sd_ordered_lock);
> +	list_for_each_entry(ip, &sdp->sd_log_ordered, i_ordered)
> +		dpages += ip->i_inode.i_mapping->nrpages;
> +	spin_unlock(&sdp->sd_ordered_lock);
> +	return dpages;
> +}
> +
>  /**
>   * gfs2_log_flush - flush incore transaction(s)
>   * @sdp: The filesystem
> @@ -1031,8 +1066,19 @@ void gfs2_log_flush(struct gfs2_sbd *sdp,
> struct gfs2_glock *gl, u32 flags)
>  	enum gfs2_freeze_state state = atomic_read(&sdp-
> >sd_freeze_state);
>  	unsigned int first_log_head;
>  	unsigned int reserved_revokes = 0;
> +	unsigned long dpages;
> +
> +	dpages = count_dirty_pages(sdp);
>  
>  	down_write(&sdp->sd_log_flush_lock);
> +	if (dpages)
> +		if (queue_delayed_work(gfs2_log_flush_wq,
> +				       &sdp->sd_log_flush_watchdog,
> +				       round_up(dpages *
> +						GFS2_LOG_FLUSH_TIMEOUT,
> HZ))) {
> +			sdp->sd_dirty_pages = dpages;
> +			sdp->sd_log_flush_start = jiffies;
> +		}
>  	trace_gfs2_log_flush(sdp, 1, flags);
>  
>  repeat:
> @@ -1144,6 +1190,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp,
> struct gfs2_glock *gl, u32 flags)
>  		gfs2_assert_withdraw_delayed(sdp, used_blocks <
> reserved_blocks);
>  		gfs2_log_release(sdp, reserved_blocks - used_blocks);
>  	}
> +	cancel_delayed_work(&sdp->sd_log_flush_watchdog);
>  	up_write(&sdp->sd_log_flush_lock);
>  	gfs2_trans_free(sdp, tr);
>  	if (gfs2_withdrawing(sdp))
> diff --git a/fs/gfs2/log.h b/fs/gfs2/log.h
> index fc905c2af53c..962044fba53a 100644
> --- a/fs/gfs2/log.h
> +++ b/fs/gfs2/log.h
> @@ -94,5 +94,6 @@ extern void gfs2_add_revoke(struct gfs2_sbd *sdp,
> struct gfs2_bufdata *bd);
>  extern void gfs2_glock_remove_revoke(struct gfs2_glock *gl);
>  extern void gfs2_flush_revokes(struct gfs2_sbd *sdp);
>  extern void gfs2_ail_drain(struct gfs2_sbd *sdp);
> +extern void gfs2_log_flush_watchdog_func(struct work_struct *work);
>  
>  #endif /* __LOG_DOT_H__ */
> diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
> index 28d0eb23e18e..55a7f29742b3 100644
> --- a/fs/gfs2/main.c
> +++ b/fs/gfs2/main.c
> @@ -30,6 +30,7 @@
>  #include "glops.h"
>  
>  struct workqueue_struct *gfs2_control_wq;
> +struct workqueue_struct *gfs2_log_flush_wq;
>  
>  static void gfs2_init_inode_once(void *foo)
>  {
> @@ -178,6 +179,10 @@ static int __init init_gfs2_fs(void)
>  	if (!gfs2_freeze_wq)
>  		goto fail_wq3;
>  
> +	gfs2_log_flush_wq = alloc_workqueue("gfs2_log_flush_wq", 0, 0);
> +	if (!gfs2_log_flush_wq)
> +		goto fail_wq4;
> +
>  	gfs2_page_pool = mempool_create_page_pool(64, 0);
>  	if (!gfs2_page_pool)
>  		goto fail_mempool;
> @@ -189,6 +194,8 @@ static int __init init_gfs2_fs(void)
>  	return 0;
>  
>  fail_mempool:
> +	destroy_workqueue(gfs2_log_flush_wq);
> +fail_wq4:
>  	destroy_workqueue(gfs2_freeze_wq);
>  fail_wq3:
>  	destroy_workqueue(gfs2_control_wq);
> @@ -240,6 +247,7 @@ static void __exit exit_gfs2_fs(void)
>  	destroy_workqueue(gfs_recovery_wq);
>  	destroy_workqueue(gfs2_control_wq);
>  	destroy_workqueue(gfs2_freeze_wq);
> +	destroy_workqueue(gfs2_log_flush_wq);
>  	list_lru_destroy(&gfs2_qd_lru);
>  
>  	rcu_barrier();
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index 6a950c4a61e9..b09e61457b23 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -139,6 +139,8 @@ static struct gfs2_sbd *init_sbd(struct
> super_block *sb)
>  	init_waitqueue_head(&sdp->sd_log_flush_wait);
>  	atomic_set(&sdp->sd_freeze_state, SFS_UNFROZEN);
>  	mutex_init(&sdp->sd_freeze_mutex);
> +	INIT_DELAYED_WORK(&sdp->sd_log_flush_watchdog,
> +			  gfs2_log_flush_watchdog_func);
>  
>  	return sdp;
>  
> diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
> index c0a34d9ddee4..c90d9f48571a 100644
> --- a/fs/gfs2/sys.c
> +++ b/fs/gfs2/sys.c
> @@ -96,7 +96,8 @@ static ssize_t status_show(struct gfs2_sbd *sdp,
> char *buf)
>  		     "sd_log_flush_head:        %d\n"
>  		     "sd_log_flush_tail:        %d\n"
>  		     "sd_log_blks_reserved:     %d\n"
> -		     "sd_log_revokes_available: %d\n",
> +		     "sd_log_revokes_available: %d\n"
> +		     "sd_dirty_pages:           %lu\n",
>  		     test_bit(SDF_JOURNAL_CHECKED, &f),
>  		     test_bit(SDF_JOURNAL_LIVE, &f),
>  		     (sdp->sd_jdesc ? sdp->sd_jdesc->jd_jid : 0),
> @@ -124,7 +125,8 @@ static ssize_t status_show(struct gfs2_sbd *sdp,
> char *buf)
>  		     sdp->sd_log_flush_head,
>  		     sdp->sd_log_flush_tail,
>  		     sdp->sd_log_blks_reserved,
> -		     atomic_read(&sdp->sd_log_revokes_available));
> +		     atomic_read(&sdp->sd_log_revokes_available),
> +		     sdp->sd_dirty_pages);
>  	return s;
>  }
>  



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode
  2021-07-13 18:26   ` Steven Whitehouse
@ 2021-07-13 19:34     ` Bob Peterson
  2021-07-28  6:50       ` Andreas Gruenbacher
  0 siblings, 1 reply; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 19:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com


On 7/13/21 1:26 PM, Steven Whitehouse wrote:
> Hi,
>
> On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
>> Before this patch, gfs2 kept its own address space for rgrps, but
>> this
>> caused a lockdep problem because vfs assumes a 1:1 relationship
>> between
>> address spaces and their inode. One problematic area is this:
>>
> I don't think that is the case. The reason that the address space is a
> separate structure in the first place is to allow them to exist without
> an inode. Maybe that has changed, but we should see why that is, in
> that case rather than just making this change immediately.
>
> I can't see any reason why if we have to have an inode here that it
> needs to be hashed... what would need to look it up via the hashes?
>
> Steve.
>
Hi,

The actual use case, which is easily demonstrated with lockdep, is given
in the patch text shortly after where you placed your comment. This goes
back to this discussion from April 2018:

https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html

in which Jan Kara pointed out that:

"The problem is we really do expect mapping->host->i_mapping == mapping as
we pass mapping and inode interchangebly in the mm code. The address_space
and inodes are separate structures because you can have many inodes
pointing to one address space (block devices). However it is not allowed
for several address_spaces to point to one inode!"

Regards,

Bob Peterson


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20210713/b8a187e3/attachment.htm>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog
  2021-07-13 18:41   ` Steven Whitehouse
@ 2021-07-13 20:03     ` Bob Peterson
  2021-07-14  8:53       ` Steven Whitehouse
  0 siblings, 1 reply; 20+ messages in thread
From: Bob Peterson @ 2021-07-13 20:03 UTC (permalink / raw)
  To: cluster-devel.redhat.com


On 7/13/21 1:41 PM, Steven Whitehouse wrote:
> Hi,
>
> On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
>> This patch adds a new watchdog whose sole purpose is to complain when
>> gfs2_log_flush operations are taking too long.
>>
> This one is a bit confusing. It says that it is to check if the log
> flush is taking too long, but it appears to set a timeout based on the
> amount of dirty data that will be written back, so it isn't really the
> log flush, but the writeback and log flush that is being timed I think?
>
> It also looks like the timeout is entirely dependent upon the number of
> dirty pages too, and not on the log flush size. I wonder about the
> performance impact of traversing the list of dirty pages too. If that
> can be avoided it should make the implementation rather more efficient,
>
> Steve.

Well, perhaps my patch description was misleading. The watchdog is meant
to time how long function gfs2_log_flush() holds the sd_log_flush_lock rwsem
in write mode. That includes writing the ordered writes as well as the
metadata. The metadata portion is almost always outweighed 100:1 (or more)
by the ordered writes. The length of time it will take to do the ordered 
writes
should be based on the number of dirty pages. I don't think running the
ordered writes list will impact performance too badly, and that's one reason
I chose to do it before we actually take the sd_log_flush_lock. It does, 
however,
hold the sd_ordered_lock lock during its count. Still, it's small 
compared to
counting the actual pages or something, and modern cpus can run lists very
quickly.

My initial version didn't count at all; it just used an arbitrary number of
seconds any log flush _ought_ to take. However, Barry pointed out that older
hardware can be slow when driven to extremes and I didn't want false
positives.

I also thought about keeping an interactive count whenever pages are
dirtied, or when inodes are added to the ordered writes list, but that 
seemed
like overkill. But it is a reasonable alternative.

The timeout value is also somewhat arbitrary, but I'm open to suggestions.

In my case, faulty hardware caused log flushes to take a very long time, 
which
caused many transactions and glocks to be blocked a long time and eventually
hit the 120-second kernel watchdog, which gives the impression glocks are
being held a very long time (which they are) for some unknown reason.

This can manifest on many (often non-faulty) nodes, since glocks can be 
tied up
indefinitely waiting for a process who has it locked EX but now must
wait until it can acquire the transaction lock, which is blocked on the 
log flush:
My goal was to make hardware problems (like faulty HBAs and fibre switches)
NOT seem like cascading gfs2 file system problems or slowdowns.

These messages will hopefully prompt operations people to investigate the
cause of the slowdown.

I tested this patch with faulty hardware, and it yielded messages like:

[ 2127.027527] gfs2: fsid=bobsrhel8:test.0: log flush pid 256206 took > 
20 seconds to write 98 pages.
[ 2348.979535] gfs2: fsid=bobsrhel8:test.0: log flush pid 256681 took > 
1 seconds to write 1 pages.
[ 3643.571505] gfs2: fsid=bobsrhel8:test.0: log flush pid 262385 took > 
4 seconds to write 16 pages.

Regards,

Bob Peterson




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog
  2021-07-13 20:03     ` Bob Peterson
@ 2021-07-14  8:53       ` Steven Whitehouse
  0 siblings, 0 replies; 20+ messages in thread
From: Steven Whitehouse @ 2021-07-14  8:53 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Tue, 2021-07-13 at 15:03 -0500, Bob Peterson wrote:
> On 7/13/21 1:41 PM, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > > This patch adds a new watchdog whose sole purpose is to complain
> > > when
> > > gfs2_log_flush operations are taking too long.
> > > 
> > This one is a bit confusing. It says that it is to check if the log
> > flush is taking too long, but it appears to set a timeout based on
> > the
> > amount of dirty data that will be written back, so it isn't really
> > the
> > log flush, but the writeback and log flush that is being timed I
> > think?
> > 
> > It also looks like the timeout is entirely dependent upon the
> > number of
> > dirty pages too, and not on the log flush size. I wonder about the
> > performance impact of traversing the list of dirty pages too. If
> > that
> > can be avoided it should make the implementation rather more
> > efficient,
> > 
> > Steve.
> 
> Well, perhaps my patch description was misleading. The watchdog is
> meant
> to time how long function gfs2_log_flush() holds the
> sd_log_flush_lock rwsem
> in write mode. 

I think it needs looking at a bit more carefully. That lock is really
an implementation detail, and one that we expect will change in the not
too distant future as the log code improves.

As you say the description is confusing, and the log messages even more
so, since they give a page count that refers to the ordered writes and
not to the log writes at all. 

Also, we have tools already that should be able to diagnose this issue
(slow I/O) such as blktrace, although I know that is more tricky to
catch after the fact. So I think we need to look at this again to see
if there is a better solution.


> That includes writing the ordered writes as well as the
> metadata. The metadata portion is almost always outweighed 100:1 (or
> more)
> by the ordered writes. The length of time it will take to do the
> ordered 
> writes
> should be based on the number of dirty pages. I don't think running
> the
> ordered writes list will impact performance too badly, and that's one
> reason
> I chose to do it before we actually take the sd_log_flush_lock. It
> does, 
> however,
> hold the sd_ordered_lock lock during its count. Still, it's small 
> compared to
> counting the actual pages or something, and modern cpus can run lists
> very
> quickly.
> 
What limits do we have on the ordered write list length? I seem to
remember we had addressed that issue at some point in the past.
Generally though iterating over what might be quite a long list is not
a good plan from a performance perspective,

Steve.

> My initial version didn't count at all; it just used an arbitrary
> number of
> seconds any log flush _ought_ to take. However, Barry pointed out
> that older
> hardware can be slow when driven to extremes and I didn't want false
> positives.
> 
> I also thought about keeping an interactive count whenever pages are
> dirtied, or when inodes are added to the ordered writes list, but
> that 
> seemed
> like overkill. But it is a reasonable alternative.
> 
> The timeout value is also somewhat arbitrary, but I'm open to
> suggestions.
> 
> In my case, faulty hardware caused log flushes to take a very long
> time, 
> which
> caused many transactions and glocks to be blocked a long time and
> eventually
> hit the 120-second kernel watchdog, which gives the impression glocks
> are
> being held a very long time (which they are) for some unknown reason.
> 
> This can manifest on many (often non-faulty) nodes, since glocks can
> be 
> tied up
> indefinitely waiting for a process who has it locked EX but now must
> wait until it can acquire the transaction lock, which is blocked on
> the 
> log flush:
> My goal was to make hardware problems (like faulty HBAs and fibre
> switches)
> NOT seem like cascading gfs2 file system problems or slowdowns.
> 
> These messages will hopefully prompt operations people to investigate
> the
> cause of the slowdown.
> 
> I tested this patch with faulty hardware, and it yielded messages
> like:
> 
> [ 2127.027527] gfs2: fsid=bobsrhel8:test.0: log flush pid 256206 took
> > 
> 20 seconds to write 98 pages.
> [ 2348.979535] gfs2: fsid=bobsrhel8:test.0: log flush pid 256681 took
> > 
> 1 seconds to write 1 pages.
> [ 3643.571505] gfs2: fsid=bobsrhel8:test.0: log flush pid 262385 took
> > 
> 4 seconds to write 16 pages.
> 
> Regards,
> 
> Bob Peterson
> 
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode
  2021-07-13 19:34     ` Bob Peterson
@ 2021-07-28  6:50       ` Andreas Gruenbacher
  2021-07-28  8:57         ` Steven Whitehouse
  0 siblings, 1 reply; 20+ messages in thread
From: Andreas Gruenbacher @ 2021-07-28  6:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jul 13, 2021 at 9:34 PM Bob Peterson <rpeterso@redhat.com> wrote:
> On 7/13/21 1:26 PM, Steven Whitehouse wrote:
>
> Hi,
>
> On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
>
> Before this patch, gfs2 kept its own address space for rgrps, but
> this
> caused a lockdep problem because vfs assumes a 1:1 relationship
> between
> address spaces and their inode. One problematic area is this:
>
> I don't think that is the case. The reason that the address space is a
> separate structure in the first place is to allow them to exist without
> an inode. Maybe that has changed, but we should see why that is, in
> that case rather than just making this change immediately.
>
> I can't see any reason why if we have to have an inode here that it
> needs to be hashed... what would need to look it up via the hashes?
>
> Steve.
>
> Hi,
>
> The actual use case, which is easily demonstrated with lockdep, is given
> in the patch text shortly after where you placed your comment. This goes
> back to this discussion from April 2018:
>
> https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html
>
> in which Jan Kara pointed out that:
>
> "The problem is we really do expect mapping->host->i_mapping == mapping as
> we pass mapping and inode interchangeably in the mm code. The address_space
> and inodes are separate structures because you can have many inodes
> pointing to one address space (block devices). However it is not allowed
> for several address_spaces to point to one inode!"

This is fundamentally at adds with how we manage inodes: we have
inode->i_mapping which is the logical address space of the inode, and
we have gfs2_glock2aspace(GFS2_I(inode)->i_gl) which is the metadata
address space of the inode. The most important function of the
metadata address space is to remove the inode's metadata from memory
by truncating the metadata address space (inode_go_inval). We need
that when moving an inode to another node. I don't have the faintest
idea how we could otherwise achieve that in a somewhat efficient way.

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode
  2021-07-28  6:50       ` Andreas Gruenbacher
@ 2021-07-28  8:57         ` Steven Whitehouse
  2021-07-28 13:16             ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Steven Whitehouse @ 2021-07-28  8:57 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Wed, 2021-07-28 at 08:50 +0200, Andreas Gruenbacher wrote:
> On Tue, Jul 13, 2021 at 9:34 PM Bob Peterson <rpeterso@redhat.com>
> wrote:
> > On 7/13/21 1:26 PM, Steven Whitehouse wrote:
> > 
> > Hi,
> > 
> > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > 
> > Before this patch, gfs2 kept its own address space for rgrps, but
> > this
> > caused a lockdep problem because vfs assumes a 1:1 relationship
> > between
> > address spaces and their inode. One problematic area is this:
> > 
> > I don't think that is the case. The reason that the address space
> > is a
> > separate structure in the first place is to allow them to exist
> > without
> > an inode. Maybe that has changed, but we should see why that is, in
> > that case rather than just making this change immediately.
> > 
> > I can't see any reason why if we have to have an inode here that it
> > needs to be hashed... what would need to look it up via the hashes?
> > 
> > Steve.
> > 
> > Hi,
> > 
> > The actual use case, which is easily demonstrated with lockdep, is
> > given
> > in the patch text shortly after where you placed your comment. This
> > goes
> > back to this discussion from April 2018:
> > 
> > https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html
> > 
> > in which Jan Kara pointed out that:
> > 
> > "The problem is we really do expect mapping->host->i_mapping ==
> > mapping as
> > we pass mapping and inode interchangeably in the mm code. The
> > address_space
> > and inodes are separate structures because you can have many inodes
> > pointing to one address space (block devices). However it is not
> > allowed
> > for several address_spaces to point to one inode!"
> 
> This is fundamentally at adds with how we manage inodes: we have
> inode->i_mapping which is the logical address space of the inode, and
> we have gfs2_glock2aspace(GFS2_I(inode)->i_gl) which is the metadata
> address space of the inode. The most important function of the
> metadata address space is to remove the inode's metadata from memory
> by truncating the metadata address space (inode_go_inval). We need
> that when moving an inode to another node. I don't have the faintest
> idea how we could otherwise achieve that in a somewhat efficient way.
> 
> Thanks,
> Andreas
> 

In addition, I'm fairly sure also that we were told to use this
solution (i.e. a separate address space) back in the day because it was
expected that they didn't have a 1:1 relationship with inodes. I don't
think we'd have used that solution otherwise. I've not had enough time
to go digging back in my email to check, but it might be worth looking
to see when we introduced the use of the second address space (removing
a whole additional inode structure) and any discussions around that
change,

Steve.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode
  2021-07-28  8:57         ` Steven Whitehouse
@ 2021-07-28 13:16             ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2021-07-28 13:16 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Andreas Gruenbacher, Bob Peterson, Jan Kara, Matthew Wilcox,
	cluster-devel, linux-fsdevel, linux-mm

On Wed 28-07-21 09:57:01, Steven Whitehouse wrote:
> On Wed, 2021-07-28 at 08:50 +0200, Andreas Gruenbacher wrote:
> > On Tue, Jul 13, 2021 at 9:34 PM Bob Peterson <rpeterso@redhat.com>
> > wrote:
> > > On 7/13/21 1:26 PM, Steven Whitehouse wrote:
> > > 
> > > Hi,
> > > 
> > > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > > 
> > > Before this patch, gfs2 kept its own address space for rgrps, but
> > > this
> > > caused a lockdep problem because vfs assumes a 1:1 relationship
> > > between
> > > address spaces and their inode. One problematic area is this:
> > > 
> > > I don't think that is the case. The reason that the address space
> > > is a
> > > separate structure in the first place is to allow them to exist
> > > without
> > > an inode. Maybe that has changed, but we should see why that is, in
> > > that case rather than just making this change immediately.
> > > 
> > > I can't see any reason why if we have to have an inode here that it
> > > needs to be hashed... what would need to look it up via the hashes?
> > > 
> > > Steve.
> > > 
> > > Hi,
> > > 
> > > The actual use case, which is easily demonstrated with lockdep, is
> > > given
> > > in the patch text shortly after where you placed your comment. This
> > > goes
> > > back to this discussion from April 2018:
> > > 
> > > https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html
> > > 
> > > in which Jan Kara pointed out that:
> > > 
> > > "The problem is we really do expect mapping->host->i_mapping ==
> > > mapping as
> > > we pass mapping and inode interchangeably in the mm code. The
> > > address_space
> > > and inodes are separate structures because you can have many inodes
> > > pointing to one address space (block devices). However it is not
> > > allowed
> > > for several address_spaces to point to one inode!"
> > 
> > This is fundamentally at adds with how we manage inodes: we have
> > inode->i_mapping which is the logical address space of the inode, and
> > we have gfs2_glock2aspace(GFS2_I(inode)->i_gl) which is the metadata
> > address space of the inode. The most important function of the
> > metadata address space is to remove the inode's metadata from memory
> > by truncating the metadata address space (inode_go_inval). We need
> > that when moving an inode to another node. I don't have the faintest
> > idea how we could otherwise achieve that in a somewhat efficient way.
> > 
> > Thanks,
> > Andreas
> > 
> 
> In addition, I'm fairly sure also that we were told to use this
> solution (i.e. a separate address space) back in the day because it was
> expected that they didn't have a 1:1 relationship with inodes. I don't
> think we'd have used that solution otherwise. I've not had enough time
> to go digging back in my email to check, but it might be worth looking
> to see when we introduced the use of the second address space (removing
> a whole additional inode structure) and any discussions around that
> change,

AFAIK in last 20 years it has never been the case that multiple
address_space structs for an inode would be handled by the VFS/MM code
properly. There can be multiple 'struct inode' for one 'struct
address_space'. That is handled just fine and is being used. The trouble is
that once you allow multiple address_space structs pointing to one struct
inode, you have some hard questions to answer (at least for VFS) - e.g. you
get a request to writeback the inode, how you do that when you have
multiple address spaces? Writeback them all? And how do you iterate them?

And there are similar questions regarding how to determine inode's
dirtiness or whether some inode's page is under writeback. Tied to that are
locking questions where inode->i_mapping->i_pages->xa_lock lock is used to
protect some of the inode state transitions. But when you have multiple
address spaces pointing to the inode it is not clear which of these many
locks would be protecting it (and the warning syzbot is hitting exactly
says that these state transitions are not properly serialized).

Of course all this can be decided and implemented but it is not a trivial
task there needs to be good motivation for the added complexity in the code
used by everybody. And AFAIU GFS2 even doesn't strictly need it. It uses a
very limited subset of functions that can operate on address_space for
these special address spaces. Andreas mentions you use metadata
address_space for tracking and evicting cached metadata associated with the
inode. Quickly checking the code, another heavy use seems to be metadata
dirtiness tracking and writeback. I'd note that other filesystems
traditionally use mapping->private_list for exactly these purposes (through
helpers like mark_buffer_dirty_inode, sync_mapping_buffers, etc.). Any
reason why you don't use these?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode
@ 2021-07-28 13:16             ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2021-07-28 13:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed 28-07-21 09:57:01, Steven Whitehouse wrote:
> On Wed, 2021-07-28 at 08:50 +0200, Andreas Gruenbacher wrote:
> > On Tue, Jul 13, 2021 at 9:34 PM Bob Peterson <rpeterso@redhat.com>
> > wrote:
> > > On 7/13/21 1:26 PM, Steven Whitehouse wrote:
> > > 
> > > Hi,
> > > 
> > > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > > 
> > > Before this patch, gfs2 kept its own address space for rgrps, but
> > > this
> > > caused a lockdep problem because vfs assumes a 1:1 relationship
> > > between
> > > address spaces and their inode. One problematic area is this:
> > > 
> > > I don't think that is the case. The reason that the address space
> > > is a
> > > separate structure in the first place is to allow them to exist
> > > without
> > > an inode. Maybe that has changed, but we should see why that is, in
> > > that case rather than just making this change immediately.
> > > 
> > > I can't see any reason why if we have to have an inode here that it
> > > needs to be hashed... what would need to look it up via the hashes?
> > > 
> > > Steve.
> > > 
> > > Hi,
> > > 
> > > The actual use case, which is easily demonstrated with lockdep, is
> > > given
> > > in the patch text shortly after where you placed your comment. This
> > > goes
> > > back to this discussion from April 2018:
> > > 
> > > https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html
> > > 
> > > in which Jan Kara pointed out that:
> > > 
> > > "The problem is we really do expect mapping->host->i_mapping ==
> > > mapping as
> > > we pass mapping and inode interchangeably in the mm code. The
> > > address_space
> > > and inodes are separate structures because you can have many inodes
> > > pointing to one address space (block devices). However it is not
> > > allowed
> > > for several address_spaces to point to one inode!"
> > 
> > This is fundamentally at adds with how we manage inodes: we have
> > inode->i_mapping which is the logical address space of the inode, and
> > we have gfs2_glock2aspace(GFS2_I(inode)->i_gl) which is the metadata
> > address space of the inode. The most important function of the
> > metadata address space is to remove the inode's metadata from memory
> > by truncating the metadata address space (inode_go_inval). We need
> > that when moving an inode to another node. I don't have the faintest
> > idea how we could otherwise achieve that in a somewhat efficient way.
> > 
> > Thanks,
> > Andreas
> > 
> 
> In addition, I'm fairly sure also that we were told to use this
> solution (i.e. a separate address space) back in the day because it was
> expected that they didn't have a 1:1 relationship with inodes. I don't
> think we'd have used that solution otherwise. I've not had enough time
> to go digging back in my email to check, but it might be worth looking
> to see when we introduced the use of the second address space (removing
> a whole additional inode structure) and any discussions around that
> change,

AFAIK in last 20 years it has never been the case that multiple
address_space structs for an inode would be handled by the VFS/MM code
properly. There can be multiple 'struct inode' for one 'struct
address_space'. That is handled just fine and is being used. The trouble is
that once you allow multiple address_space structs pointing to one struct
inode, you have some hard questions to answer (at least for VFS) - e.g. you
get a request to writeback the inode, how you do that when you have
multiple address spaces? Writeback them all? And how do you iterate them?

And there are similar questions regarding how to determine inode's
dirtiness or whether some inode's page is under writeback. Tied to that are
locking questions where inode->i_mapping->i_pages->xa_lock lock is used to
protect some of the inode state transitions. But when you have multiple
address spaces pointing to the inode it is not clear which of these many
locks would be protecting it (and the warning syzbot is hitting exactly
says that these state transitions are not properly serialized).

Of course all this can be decided and implemented but it is not a trivial
task there needs to be good motivation for the added complexity in the code
used by everybody. And AFAIU GFS2 even doesn't strictly need it. It uses a
very limited subset of functions that can operate on address_space for
these special address spaces. Andreas mentions you use metadata
address_space for tracking and evicting cached metadata associated with the
inode. Quickly checking the code, another heavy use seems to be metadata
dirtiness tracking and writeback. I'd note that other filesystems
traditionally use mapping->private_list for exactly these purposes (through
helpers like mark_buffer_dirty_inode, sync_mapping_buffers, etc.). Any
reason why you don't use these?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-07-28 13:16 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 01/10] gfs2: Fix glock recursion in freeze_go_xmote_bh Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 02/10] gfs2: Eliminate go_xmote_bh in favor of go_lock Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 03/10] gfs2: be more verbose replaying invalid rgrp blocks Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 04/10] gfs2: trivial clean up of gfs2_ail_error Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 05/10] gfs2: tiny cleanup in gfs2_log_reserve Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 06/10] gfs2: init system threads before freeze lock Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 07/10] gfs2: Don't release and reacquire local statfs bh Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog Bob Peterson
2021-07-13 18:41   ` Steven Whitehouse
2021-07-13 20:03     ` Bob Peterson
2021-07-14  8:53       ` Steven Whitehouse
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 09/10] gfs2: fix deadlock in gfs2_ail1_empty withdraw Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode Bob Peterson
2021-07-13 18:26   ` Steven Whitehouse
2021-07-13 19:34     ` Bob Peterson
2021-07-28  6:50       ` Andreas Gruenbacher
2021-07-28  8:57         ` Steven Whitehouse
2021-07-28 13:16           ` Jan Kara
2021-07-28 13:16             ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.