All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] [GFS2 PATCH 0/2 v3] Fix infinite loop in ail1 flush with jdata
@ 2020-01-13 14:04 Bob Peterson
  2020-01-13 14:04 ` [Cluster-devel] [GFS2 PATCH 1/2 v3] Revert "gfs2: eliminate tr_num_revoke_rm" Bob Peterson
  2020-01-13 14:04 ` [Cluster-devel] [GFS2 PATCH 2/2 v3] gfs2: keep a redirty list for jdata pages that are PageChecked in ail1 Bob Peterson
  0 siblings, 2 replies; 3+ messages in thread
From: Bob Peterson @ 2020-01-13 14:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi. This patch set fixes a problem in which gfs2 can become deadlocked
while doing normal IO on jdata files. The problem is best observed by
repeatedly running xfstests generic/269 repeatedly with jdata files.
The specifics of the hang are best described in the second patch.

The first patch reverts e955537e3262de8e56f070b13817f525f472fa00.
The defective patch caused tr->tr_num_revoke to sometimes be a negative
number, since you can remove more revokes than you add. However, since
tr_num_revoke is declared unsigned, it triggered this assert in
gfs2_trans_end:

	if (gfs2_assert_withdraw(sdp, (nbuf <= tr->tr_blocks) &&
			       (tr->tr_num_revoke <= tr->tr_revokes)))

The management of revokes is not very good since we moved them from a
private list to a global list hung off the superblock pointer, sdp.
So we will probably want to revisit this and rework how revokes are
handled. In the meantime, it is safest to just revert the patch until
we can fix it properly.

The second patch fixes an infinite loop deadlock while flushing the
ail1 list for jdata pages. The patch comments describe the problem
and circumstances fairly well.

Bob Peterson (2):
  Revert "gfs2: eliminate tr_num_revoke_rm"
  gfs2: keep a redirty list for jdata pages that are PageChecked in ail1

 fs/gfs2/incore.h |  2 ++
 fs/gfs2/log.c    | 30 +++++++++++++++++++++++++++++-
 fs/gfs2/trans.c  |  7 ++++---
 3 files changed, 35 insertions(+), 4 deletions(-)

-- 
2.24.1



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2 v3] Revert "gfs2: eliminate tr_num_revoke_rm"
  2020-01-13 14:04 [Cluster-devel] [GFS2 PATCH 0/2 v3] Fix infinite loop in ail1 flush with jdata Bob Peterson
@ 2020-01-13 14:04 ` Bob Peterson
  2020-01-13 14:04 ` [Cluster-devel] [GFS2 PATCH 2/2 v3] gfs2: keep a redirty list for jdata pages that are PageChecked in ail1 Bob Peterson
  1 sibling, 0 replies; 3+ messages in thread
From: Bob Peterson @ 2020-01-13 14:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This reverts commit e955537e3262de8e56f070b13817f525f472fa00.

Before patch e955537e32, tr_num_revoke tracked the number of revokes
added to the transaction, and tr_num_revoke_rm tracked how many
revokes were removed. But since revokes are queued off the sdp
(superblock) pointer, some transactions could remove more revokes
than they added. (e.g. revokes added by a different process).
Commit e955537e32 eliminated transaction variable tr_num_revoke_rm,
but in order to do so, it changed the accounting to always use
tr_num_revoke for its math. Since you can remove more revokes than
you add, tr_num_revoke could now become a negative value.
This negative value broke the assert in function gfs2_trans_end:

	if (gfs2_assert_withdraw(sdp, (nbuf <= tr->tr_blocks) &&
			       (tr->tr_num_revoke <= tr->tr_revokes)))

One way to fix this is to simply remove the tr_num_revoke clause
from the assert and allow the value to become negative. Andreas
didn't like that idea, so instead, we decided to revert e955537e32.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/incore.h | 1 +
 fs/gfs2/log.c    | 3 ++-
 fs/gfs2/trans.c  | 6 +++---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index b5d9c11f4901..a5d81a261f77 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -505,6 +505,7 @@ struct gfs2_trans {
 	unsigned int tr_num_buf_rm;
 	unsigned int tr_num_databuf_rm;
 	unsigned int tr_num_revoke;
+	unsigned int tr_num_revoke_rm;
 
 	struct list_head tr_list;
 	struct list_head tr_databuf;
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 66189ae62c1d..6e2728416b6b 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -895,6 +895,7 @@ static void gfs2_merge_trans(struct gfs2_trans *old, struct gfs2_trans *new)
 	old->tr_num_buf_rm	+= new->tr_num_buf_rm;
 	old->tr_num_databuf_rm	+= new->tr_num_databuf_rm;
 	old->tr_num_revoke	+= new->tr_num_revoke;
+	old->tr_num_revoke_rm	+= new->tr_num_revoke_rm;
 
 	list_splice_tail_init(&new->tr_databuf, &old->tr_databuf);
 	list_splice_tail_init(&new->tr_buf, &old->tr_buf);
@@ -916,7 +917,7 @@ static void log_refund(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
 		set_bit(TR_ATTACHED, &tr->tr_flags);
 	}
 
-	sdp->sd_log_commited_revoke += tr->tr_num_revoke;
+	sdp->sd_log_commited_revoke += tr->tr_num_revoke - tr->tr_num_revoke_rm;
 	reserved = calc_reserved(sdp);
 	maxres = sdp->sd_log_blks_reserved + tr->tr_reserved;
 	gfs2_assert_withdraw(sdp, maxres >= reserved);
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 4d01fe19c125..a685637a5b55 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -76,10 +76,10 @@ static void gfs2_print_trans(struct gfs2_sbd *sdp, const struct gfs2_trans *tr)
 	fs_warn(sdp, "blocks=%u revokes=%u reserved=%u touched=%u\n",
 		tr->tr_blocks, tr->tr_revokes, tr->tr_reserved,
 		test_bit(TR_TOUCHED, &tr->tr_flags));
-	fs_warn(sdp, "Buf %u/%u Databuf %u/%u Revoke %u\n",
+	fs_warn(sdp, "Buf %u/%u Databuf %u/%u Revoke %u/%u\n",
 		tr->tr_num_buf_new, tr->tr_num_buf_rm,
 		tr->tr_num_databuf_new, tr->tr_num_databuf_rm,
-		tr->tr_num_revoke);
+		tr->tr_num_revoke, tr->tr_num_revoke_rm);
 }
 
 void gfs2_trans_end(struct gfs2_sbd *sdp)
@@ -264,7 +264,7 @@ void gfs2_trans_remove_revoke(struct gfs2_sbd *sdp, u64 blkno, unsigned int len)
 			if (bd->bd_gl)
 				gfs2_glock_remove_revoke(bd->bd_gl);
 			kmem_cache_free(gfs2_bufdata_cachep, bd);
-			tr->tr_num_revoke--;
+			tr->tr_num_revoke_rm++;
 			if (--n == 0)
 				break;
 		}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2 v3] gfs2: keep a redirty list for jdata pages that are PageChecked in ail1
  2020-01-13 14:04 [Cluster-devel] [GFS2 PATCH 0/2 v3] Fix infinite loop in ail1 flush with jdata Bob Peterson
  2020-01-13 14:04 ` [Cluster-devel] [GFS2 PATCH 1/2 v3] Revert "gfs2: eliminate tr_num_revoke_rm" Bob Peterson
@ 2020-01-13 14:04 ` Bob Peterson
  1 sibling, 0 replies; 3+ messages in thread
From: Bob Peterson @ 2020-01-13 14:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, jdata writes would sometimes get into an infinite
loop because function gfs2_ail1_flush refused to exit until all of
its pages were written out, but if the page was in use and another
process sets PageChecked, it would never be written out: The
PageChecked prevented the ail1 list from being written:

gfs2_logd() calls gfs2_ail1_start() because it decides it needs to
  gfs2_ail1_start() calls gfs2_ail1_flush() (unconditionally)
    gfs2_ail1_flush() calls gfs2_ail1_start_one() for each transaction
                      queued onto sd_ail1_list, to start that transaction
      gfs2_ail1_start_one() calls generic_writepages() for this
                      particular problematic transaction.
        generic_writepages() calls write_cache_pages() passing in
                      __writepage (unconditionally)
          write_cache_pages() calls (*writepage) which maps to __writepage()
            __writepage() calls mapping->a_ops->writepage which maps
                      to gfs2_jdata_writepage
              gfs2_jdata_writepage() sees the page is PageChecked
                      (i.e. dirty is pending) so it skips
                      __gfs2_jdata_writepage and instead redirties
                      the page, returns 0.
            __writepage() returns 0 to write_cache_pages()
          write_cache_pages() returns 0 to generic_writepages()
        generic_writepages() returns 0 to gfs2_ail1_start_one()
       gfs2_ail1_start_one() returns 1 to gfs2_ail1_flush(), which
                      causes it to goto restart; (Infinite loop)

Thus, logd goes into an infinite loop, chewing up 100% of
one CPU, and all other transactions get blocked by the flush.

This patch adds a new queue to the transactions, tr_redirty_list.
This temporarily holds buffer descriptor (bd / bufdata) elements
that need to be redirtied because they were PageChecked while the
ail1 list was being flushed.

The new queue allows all eligible pages to be written properly and
gfs2_ail1_flush to requeue and redirty PageChecked pages. Thus, the
pages are redirtied and go into the next ail1 flush cycle. The current
ail1 flush completes, since those pages are temporarily moved off the
list.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/incore.h |  1 +
 fs/gfs2/log.c    | 27 +++++++++++++++++++++++++++
 fs/gfs2/trans.c  |  1 +
 3 files changed, 29 insertions(+)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index a5d81a261f77..eef686161953 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -514,6 +514,7 @@ struct gfs2_trans {
 	unsigned int tr_first;
 	struct list_head tr_ail1_list;
 	struct list_head tr_ail2_list;
+	struct list_head tr_redirty_list;
 };
 
 struct gfs2_journal_extent {
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 6e2728416b6b..7d7bae63b3e1 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -119,6 +119,10 @@ __acquires(&sdp->sd_ail_lock)
 		if (gl == bd->bd_gl)
 			continue;
 		gl = bd->bd_gl;
+		if (PageChecked(bh->b_page)) {
+			list_move(&bd->bd_ail_st_list, &tr->tr_redirty_list);
+			continue;
+		}
 		list_move(&bd->bd_ail_st_list, &tr->tr_ail1_list);
 		mapping = bh->b_page->mapping;
 		if (!mapping)
@@ -219,11 +223,21 @@ static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr,
  * @sdp: The superblock
  *
  * Tries to empty the ail1 lists, starting with the oldest first
+ * Then it requeues any buffers that need to be redirtied, to the ail1 list.
+ * It returns true if the ail1 list was empty BEFORE the redirtied entries
+ * were requeued.
  */
 
 static int gfs2_ail1_empty(struct gfs2_sbd *sdp)
 {
 	struct gfs2_trans *tr, *s;
+	struct gfs2_bufdata *bd;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = LONG_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+	};
 	int oldest_tr = 1;
 	int ret;
 	bool withdraw = false;
@@ -237,6 +251,19 @@ static int gfs2_ail1_empty(struct gfs2_sbd *sdp)
 			oldest_tr = 0;
 	}
 	ret = list_empty(&sdp->sd_ail1_list);
+	/*
+	 * Now requeue and redirty any bufdata elements that were not written
+	 * because they were PageChecked.
+	 */
+	list_for_each_entry_reverse(tr, &sdp->sd_ail1_list, tr_list) {
+		while (!list_empty(&tr->tr_redirty_list)) {
+			bd = list_first_entry(&tr->tr_redirty_list,
+					      struct gfs2_bufdata,
+					      bd_ail_st_list);
+			redirty_page_for_writepage(&wbc, bd->bd_bh->b_page);
+			list_move(&bd->bd_ail_st_list, &tr->tr_ail1_list);
+		}
+	}
 	spin_unlock(&sdp->sd_ail_lock);
 
 	if (withdraw)
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index a685637a5b55..0545490cb4e3 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -52,6 +52,7 @@ int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,
 		tr->tr_reserved += gfs2_struct2blk(sdp, revokes);
 	INIT_LIST_HEAD(&tr->tr_databuf);
 	INIT_LIST_HEAD(&tr->tr_buf);
+	INIT_LIST_HEAD(&tr->tr_redirty_list);
 
 	sb_start_intwrite(sdp->sd_vfs);
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-01-13 14:04 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-13 14:04 [Cluster-devel] [GFS2 PATCH 0/2 v3] Fix infinite loop in ail1 flush with jdata Bob Peterson
2020-01-13 14:04 ` [Cluster-devel] [GFS2 PATCH 1/2 v3] Revert "gfs2: eliminate tr_num_revoke_rm" Bob Peterson
2020-01-13 14:04 ` [Cluster-devel] [GFS2 PATCH 2/2 v3] gfs2: keep a redirty list for jdata pages that are PageChecked in ail1 Bob Peterson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.