linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] xfs: various fixes and cleanups
@ 2020-03-25  1:41 Dave Chinner
  2020-03-25  1:41 ` [PATCH 1/8] xfs: Lower CIL flush limit for large logs Dave Chinner
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:41 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

These are the fixes and cleanups that are part of the non-blocking
inode reclaim series I've (slowly) been working on. These fixes and
cleanups stand alone, many have already been reviewed, and getting
them out of the non-blocking reclaim patchset makes that a much
smaller and easier to digest set of patches.

The changes in this patchset are for:

- limiting the size of checkpoints that the CIL builds to reduce the
  memory it pins and the latency of commits.
- cleaning up the AIL item removal code so we can reduce the number
  of tail LSN updates to prevent unnecessary thundering herd wakeups
- account for reclaimable slab caches in XFS correctly
- account for reclaimed pages from buffers correctly
- avoiding log IO priority inversions
- factoring the inode cluster deletion code to make it more readable
  and easier to modify for the non-blocking inode reclaim mods.

Thoughts, comments and improvemnts welcome.

-Dave.


Dave Chinner (8):
  xfs: Lower CIL flush limit for large logs
  xfs: Throttle commits on delayed background CIL push
  xfs: don't allow log IO to be throttled
  xfs: Improve metadata buffer reclaim accountability
  xfs: correctly acount for reclaimable slabs
  xfs: factor common AIL item deletion code
  xfs: tail updates only need to occur when LSN changes
  xfs: factor inode lookup from xfs_ifree_cluster

 fs/xfs/xfs_buf.c        |  11 ++-
 fs/xfs/xfs_inode.c      | 152 ++++++++++++++++++++++------------------
 fs/xfs/xfs_inode_item.c |  28 ++++----
 fs/xfs/xfs_log.c        |  10 ++-
 fs/xfs/xfs_log_cil.c    |  37 ++++++++--
 fs/xfs/xfs_log_priv.h   |  53 ++++++++++++--
 fs/xfs/xfs_super.c      |   3 +-
 fs/xfs/xfs_trace.h      |   1 +
 fs/xfs/xfs_trans_ail.c  |  88 ++++++++++++++---------
 fs/xfs/xfs_trans_priv.h |   6 +-
 10 files changed, 257 insertions(+), 132 deletions(-)

-- 
2.26.0.rc2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/8] xfs: Lower CIL flush limit for large logs
  2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
@ 2020-03-25  1:41 ` Dave Chinner
  2020-03-25  4:42   ` Allison Collins
  2020-03-25  1:41 ` [PATCH 2/8] xfs: Throttle commits on delayed background CIL push Dave Chinner
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:41 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The current CIL size aggregation limit is 1/8th the log size. This
means for large logs we might be aggregating at least 250MB of dirty objects
in memory before the CIL is flushed to the journal. With CIL shadow
buffers sitting around, this means the CIL is often consuming >500MB
of temporary memory that is all allocated under GFP_NOFS conditions.

Flushing the CIL can take some time to do if there is other IO
ongoing, and can introduce substantial log force latency by itself.
It also pins the memory until the objects are in the AIL and can be
written back and reclaimed by shrinkers. Hence this threshold also
tends to determine the minimum amount of memory XFS can operate in
under heavy modification without triggering the OOM killer.

Modify the CIL space limit to prevent such huge amounts of pinned
metadata from aggregating. We can have 2MB of log IO in flight at
once, so limit aggregation to 16x this size. This threshold was
chosen as it little impact on performance (on 16-way fsmark) or log
traffic but pins a lot less memory on large logs especially under
heavy memory pressure.  An aggregation limit of 8x had 5-10%
performance degradation and a 50% increase in log throughput for
the same workload, so clearly that was too small for highly
concurrent workloads on large logs.

This was found via trace analysis of AIL behaviour. e.g. insertion
from a single CIL flush:

xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL

$ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
1721823
$

So there were 1.7 million objects inserted into the AIL from this
CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
was the end of the trace (i.e. it hadn't finished). Clearly a major
problem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_log_priv.h | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index cfcf3f02e30a3..8c4be91f62d0d 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -316,13 +316,30 @@ struct xfs_cil {
  * tries to keep 25% of the log free, so we need to keep below that limit or we
  * risk running out of free log space to start any new transactions.
  *
- * In order to keep background CIL push efficient, we will set a lower
- * threshold at which background pushing is attempted without blocking current
- * transaction commits.  A separate, higher bound defines when CIL pushes are
- * enforced to ensure we stay within our maximum checkpoint size bounds.
- * threshold, yet give us plenty of space for aggregation on large logs.
+ * In order to keep background CIL push efficient, we only need to ensure the
+ * CIL is large enough to maintain sufficient in-memory relogging to avoid
+ * repeated physical writes of frequently modified metadata. If we allow the CIL
+ * to grow to a substantial fraction of the log, then we may be pinning hundreds
+ * of megabytes of metadata in memory until the CIL flushes. This can cause
+ * issues when we are running low on memory - pinned memory cannot be reclaimed,
+ * and the CIL consumes a lot of memory. Hence we need to set an upper physical
+ * size limit for the CIL that limits the maximum amount of memory pinned by the
+ * CIL but does not limit performance by reducing relogging efficiency
+ * significantly.
+ *
+ * As such, the CIL push threshold ends up being the smaller of two thresholds:
+ * - a threshold large enough that it allows CIL to be pushed and progress to be
+ *   made without excessive blocking of incoming transaction commits. This is
+ *   defined to be 12.5% of the log space - half the 25% push threshold of the
+ *   AIL.
+ * - small enough that it doesn't pin excessive amounts of memory but maintains
+ *   close to peak relogging efficiency. This is defined to be 16x the iclog
+ *   buffer window (32MB) as measurements have shown this to be roughly the
+ *   point of diminishing performance increases under highly concurrent
+ *   modification workloads.
  */
-#define XLOG_CIL_SPACE_LIMIT(log)	(log->l_logsize >> 3)
+#define XLOG_CIL_SPACE_LIMIT(log)	\
+	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
 
 /*
  * ticket grant locks, queues and accounting have their own cachlines
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/8] xfs: Throttle commits on delayed background CIL push
  2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
  2020-03-25  1:41 ` [PATCH 1/8] xfs: Lower CIL flush limit for large logs Dave Chinner
@ 2020-03-25  1:41 ` Dave Chinner
  2020-03-25  4:42   ` Allison Collins
                     ` (2 more replies)
  2020-03-25  1:42 ` [PATCH 3/8] xfs: don't allow log IO to be throttled Dave Chinner
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:41 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In certain situations the background CIL push can be indefinitely
delayed. While we have workarounds from the obvious cases now, it
doesn't solve the underlying issue. This issue is that there is no
upper limit on the CIL where we will either force or wait for
a background push to start, hence allowing the CIL to grow without
bound until it consumes all log space.

To fix this, add a new wait queue to the CIL which allows background
pushes to wait for the CIL context to be switched out. This happens
when the push starts, so it will allow us to block incoming
transaction commit completion until the push has started. This will
only affect processes that are running modifications, and only when
the CIL threshold has been significantly overrun.

This has no apparent impact on performance, and doesn't even trigger
until over 45 million inodes had been created in a 16-way fsmark
test on a 2GB log. That was limiting at 64MB of log space used, so
the active CIL size is only about 3% of the total log in that case.
The concurrent removal of those files did not trigger the background
sleep at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
 fs/xfs/xfs_trace.h    |  1 +
 3 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 27de462d2ba40..ac43301ae2f43 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -668,6 +668,11 @@ xlog_cil_push_work(
 	push_seq = cil->xc_push_seq;
 	ASSERT(push_seq <= ctx->sequence);
 
+	/*
+	 * Wake up any background push waiters now this context is being pushed.
+	 */
+	wake_up_all(&ctx->push_wait);
+
 	/*
 	 * Check if we've anything to push. If there is nothing, then we don't
 	 * move on to a new sequence number and so we have to be able to push
@@ -744,6 +749,7 @@ xlog_cil_push_work(
 	 */
 	INIT_LIST_HEAD(&new_ctx->committing);
 	INIT_LIST_HEAD(&new_ctx->busy_extents);
+	init_waitqueue_head(&new_ctx->push_wait);
 	new_ctx->sequence = ctx->sequence + 1;
 	new_ctx->cil = cil;
 	cil->xc_ctx = new_ctx;
@@ -891,7 +897,7 @@ xlog_cil_push_work(
  */
 static void
 xlog_cil_push_background(
-	struct xlog	*log)
+	struct xlog	*log) __releases(cil->xc_ctx_lock)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
@@ -905,14 +911,36 @@ xlog_cil_push_background(
 	 * don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
-	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
+	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+		up_read(&cil->xc_ctx_lock);
 		return;
+	}
 
 	spin_lock(&cil->xc_push_lock);
 	if (cil->xc_push_seq < cil->xc_current_sequence) {
 		cil->xc_push_seq = cil->xc_current_sequence;
 		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	}
+
+	/*
+	 * Drop the context lock now, we can't hold that if we need to sleep
+	 * because we are over the blocking threshold. The push_lock is still
+	 * held, so blocking threshold sleep/wakeup is still correctly
+	 * serialised here.
+	 */
+	up_read(&cil->xc_ctx_lock);
+
+	/*
+	 * If we are well over the space limit, throttle the work that is being
+	 * done until the push work on this context has begun.
+	 */
+	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
+		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
+		return;
+	}
+
 	spin_unlock(&cil->xc_push_lock);
 
 }
@@ -1032,9 +1060,9 @@ xfs_log_commit_cil(
 		if (lip->li_ops->iop_committing)
 			lip->li_ops->iop_committing(lip, xc_commit_lsn);
 	}
-	xlog_cil_push_background(log);
 
-	up_read(&cil->xc_ctx_lock);
+	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
+	xlog_cil_push_background(log);
 }
 
 /*
@@ -1193,6 +1221,7 @@ xlog_cil_init(
 
 	INIT_LIST_HEAD(&ctx->committing);
 	INIT_LIST_HEAD(&ctx->busy_extents);
+	init_waitqueue_head(&ctx->push_wait);
 	ctx->sequence = 1;
 	ctx->cil = cil;
 	cil->xc_ctx = ctx;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 8c4be91f62d0d..dacab1817a1b0 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -240,6 +240,7 @@ struct xfs_cil_ctx {
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	struct list_head	iclog_entry;
 	struct list_head	committing;	/* ctx committing list */
+	wait_queue_head_t	push_wait;	/* background push throttle */
 	struct work_struct	discard_endio_work;
 };
 
@@ -337,10 +338,33 @@ struct xfs_cil {
  *   buffer window (32MB) as measurements have shown this to be roughly the
  *   point of diminishing performance increases under highly concurrent
  *   modification workloads.
+ *
+ * To prevent the CIL from overflowing upper commit size bounds, we introduce a
+ * new threshold at which we block committing transactions until the background
+ * CIL commit commences and switches to a new context. While this is not a hard
+ * limit, it forces the process committing a transaction to the CIL to block and
+ * yeild the CPU, giving the CIL push work a chance to be scheduled and start
+ * work. This prevents a process running lots of transactions from overfilling
+ * the CIL because it is not yielding the CPU. We set the blocking limit at
+ * twice the background push space threshold so we keep in line with the AIL
+ * push thresholds.
+ *
+ * Note: this is not a -hard- limit as blocking is applied after the transaction
+ * is inserted into the CIL and the push has been triggered. It is largely a
+ * throttling mechanism that allows the CIL push to be scheduled and run. A hard
+ * limit will be difficult to implement without introducing global serialisation
+ * in the CIL commit fast path, and it's not at all clear that we actually need
+ * such hard limits given the ~7 years we've run without a hard limit before
+ * finding the first situation where a checkpoint size overflow actually
+ * occurred. Hence the simple throttle, and an ASSERT check to tell us that
+ * we've overrun the max size.
  */
 #define XLOG_CIL_SPACE_LIMIT(log)	\
 	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
 
+#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
+	(XLOG_CIL_SPACE_LIMIT(log) * 2)
+
 /*
  * ticket grant locks, queues and accounting have their own cachlines
  * as these are quite hot and can be operated on concurrently.
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index fbfdd9cf160df..575ca74532f79 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1015,6 +1015,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_ticket_regrant_sub);
 DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done);
 DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_sub);
 DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_exit);
+DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
 
 DECLARE_EVENT_CLASS(xfs_log_item_class,
 	TP_PROTO(struct xfs_log_item *lip),
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/8] xfs: don't allow log IO to be throttled
  2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
  2020-03-25  1:41 ` [PATCH 1/8] xfs: Lower CIL flush limit for large logs Dave Chinner
  2020-03-25  1:41 ` [PATCH 2/8] xfs: Throttle commits on delayed background CIL push Dave Chinner
@ 2020-03-25  1:42 ` Dave Chinner
  2020-03-25  4:42   ` Allison Collins
  2020-03-25  1:42 ` [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability Dave Chinner
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Running metadata intensive workloads, I've been seeing the AIL
pushing getting stuck on pinned buffers and triggering log forces.
The log force is taking a long time to run because the log IO is
getting throttled by wbt_wait() - the block layer writeback
throttle. It's being throttled because there is a huge amount of
metadata writeback going on which is filling the request queue.

IOWs, we have a priority inversion problem here.

Mark the log IO bios with REQ_IDLE so they don't get throttled
by the block layer writeback throttle. When we are forcing the CIL,
we are likely to need to to tens of log IOs, and they are issued as
fast as they can be build and IO completed. Hence REQ_IDLE is
appropriate - it's an indication that more IO will follow shortly.

And because we also set REQ_SYNC, the writeback throttle will now
treat log IO the same way it treats direct IO writes - it will not
throttle them at all. Hence we solve the priority inversion problem
caused by the writeback throttle being unable to distinguish between
high priority log IO and background metadata writeback.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_log.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index f70be8151a59f..e16ad0b1aec84 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1687,7 +1687,15 @@ xlog_write_iclog(
 	iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart + bno;
 	iclog->ic_bio.bi_end_io = xlog_bio_end_io;
 	iclog->ic_bio.bi_private = iclog;
-	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_FUA;
+
+	/*
+	 * We use REQ_SYNC | REQ_IDLE here to tell the block layer the are more
+	 * IOs coming immediately after this one. This prevents the block layer
+	 * writeback throttle from throttling log writes behind background
+	 * metadata writeback and causing priority inversions.
+	 */
+	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
+				REQ_IDLE | REQ_FUA;
 	if (need_flush)
 		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
 
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability
  2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
                   ` (2 preceding siblings ...)
  2020-03-25  1:42 ` [PATCH 3/8] xfs: don't allow log IO to be throttled Dave Chinner
@ 2020-03-25  1:42 ` Dave Chinner
  2020-03-25  4:42   ` Allison Collins
                     ` (2 more replies)
  2020-03-25  1:42 ` [PATCH 5/8] xfs: correctly acount for reclaimable slabs Dave Chinner
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The buffer cache shrinker frees more than just the xfs_buf slab
objects - it also frees the pages attached to the buffers. Make sure
the memory reclaim code accounts for this memory being freed
correctly, similar to how the inode shrinker accounts for pages
freed from the page cache due to mapping invalidation.

We also need to make sure that the mm subsystem knows these are
reclaimable objects. We provide the memory reclaim subsystem with a
a shrinker to reclaim xfs_bufs, so we should really mark the slab
that way.

We also have a lot of xfs_bufs in a busy system, spread them around
like we do inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index f880141a22681..9ec3eaf1c618f 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -327,6 +327,9 @@ xfs_buf_free(
 
 			__free_page(page);
 		}
+		if (current->reclaim_state)
+			current->reclaim_state->reclaimed_slab +=
+							bp->b_page_count;
 	} else if (bp->b_flags & _XBF_KMEM)
 		kmem_free(bp->b_addr);
 	_xfs_buf_free_pages(bp);
@@ -2114,9 +2117,11 @@ xfs_buf_delwri_pushbuf(
 int __init
 xfs_buf_init(void)
 {
-	xfs_buf_zone = kmem_cache_create("xfs_buf",
-					 sizeof(struct xfs_buf), 0,
-					 SLAB_HWCACHE_ALIGN, NULL);
+	xfs_buf_zone = kmem_cache_create("xfs_buf", sizeof(struct xfs_buf), 0,
+					 SLAB_HWCACHE_ALIGN |
+					 SLAB_RECLAIM_ACCOUNT |
+					 SLAB_MEM_SPREAD,
+					 NULL);
 	if (!xfs_buf_zone)
 		goto out;
 
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/8] xfs: correctly acount for reclaimable slabs
  2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
                   ` (3 preceding siblings ...)
  2020-03-25  1:42 ` [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability Dave Chinner
@ 2020-03-25  1:42 ` Dave Chinner
  2020-03-25  4:43   ` Allison Collins
  2020-03-25  1:42 ` [PATCH 6/8] xfs: factor common AIL item deletion code Dave Chinner
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The XFS inode item slab actually reclaimed by inode shrinker
callbacks from the memory reclaim subsystem. These should be marked
as reclaimable so the mm subsystem has the full picture of how much
memory it can actually reclaim from the XFS slab caches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2094386af8aca..68fea439d9743 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1861,7 +1861,8 @@ xfs_init_zones(void)
 
 	xfs_ili_zone = kmem_cache_create("xfs_ili",
 					 sizeof(struct xfs_inode_log_item), 0,
-					 SLAB_MEM_SPREAD, NULL);
+					 SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+					 NULL);
 	if (!xfs_ili_zone)
 		goto out_destroy_inode_zone;
 
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/8] xfs: factor common AIL item deletion code
  2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
                   ` (4 preceding siblings ...)
  2020-03-25  1:42 ` [PATCH 5/8] xfs: correctly acount for reclaimable slabs Dave Chinner
@ 2020-03-25  1:42 ` Dave Chinner
  2020-03-25  4:54   ` Allison Collins
                     ` (2 more replies)
  2020-03-25  1:42 ` [PATCH 7/8] xfs: tail updates only need to occur when LSN changes Dave Chinner
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Factor the common AIL deletion code that does all the wakeups into a
helper so we only have one copy of this somewhat tricky code to
interface with all the wakeups necessary when the LSN of the log
tail changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_inode_item.c | 12 +----------
 fs/xfs/xfs_trans_ail.c  | 48 ++++++++++++++++++++++-------------------
 fs/xfs/xfs_trans_priv.h |  4 +++-
 3 files changed, 30 insertions(+), 34 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 4a3d13d4a0228..bd8c368098707 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -742,17 +742,7 @@ xfs_iflush_done(
 				xfs_clear_li_failed(blip);
 			}
 		}
-
-		if (mlip_changed) {
-			if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
-				xlog_assign_tail_lsn_locked(ailp->ail_mount);
-			if (list_empty(&ailp->ail_head))
-				wake_up_all(&ailp->ail_empty);
-		}
-		spin_unlock(&ailp->ail_lock);
-
-		if (mlip_changed)
-			xfs_log_space_wake(ailp->ail_mount);
+		xfs_ail_update_finish(ailp, mlip_changed);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 2ef0dfbfb303d..26d2e7928121c 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -681,6 +681,27 @@ xfs_ail_push_all_sync(
 	finish_wait(&ailp->ail_empty, &wait);
 }
 
+void
+xfs_ail_update_finish(
+	struct xfs_ail		*ailp,
+	bool			do_tail_update) __releases(ailp->ail_lock)
+{
+	struct xfs_mount	*mp = ailp->ail_mount;
+
+	if (!do_tail_update) {
+		spin_unlock(&ailp->ail_lock);
+		return;
+	}
+
+	if (!XFS_FORCED_SHUTDOWN(mp))
+		xlog_assign_tail_lsn_locked(mp);
+
+	if (list_empty(&ailp->ail_head))
+		wake_up_all(&ailp->ail_empty);
+	spin_unlock(&ailp->ail_lock);
+	xfs_log_space_wake(mp);
+}
+
 /*
  * xfs_trans_ail_update - bulk AIL insertion operation.
  *
@@ -740,15 +761,7 @@ xfs_trans_ail_update_bulk(
 	if (!list_empty(&tmp))
 		xfs_ail_splice(ailp, cur, &tmp, lsn);
 
-	if (mlip_changed) {
-		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
-			xlog_assign_tail_lsn_locked(ailp->ail_mount);
-		spin_unlock(&ailp->ail_lock);
-
-		xfs_log_space_wake(ailp->ail_mount);
-	} else {
-		spin_unlock(&ailp->ail_lock);
-	}
+	xfs_ail_update_finish(ailp, mlip_changed);
 }
 
 bool
@@ -792,10 +805,10 @@ void
 xfs_trans_ail_delete(
 	struct xfs_ail		*ailp,
 	struct xfs_log_item	*lip,
-	int			shutdown_type) __releases(ailp->ail_lock)
+	int			shutdown_type)
 {
 	struct xfs_mount	*mp = ailp->ail_mount;
-	bool			mlip_changed;
+	bool			need_update;
 
 	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
 		spin_unlock(&ailp->ail_lock);
@@ -808,17 +821,8 @@ xfs_trans_ail_delete(
 		return;
 	}
 
-	mlip_changed = xfs_ail_delete_one(ailp, lip);
-	if (mlip_changed) {
-		if (!XFS_FORCED_SHUTDOWN(mp))
-			xlog_assign_tail_lsn_locked(mp);
-		if (list_empty(&ailp->ail_head))
-			wake_up_all(&ailp->ail_empty);
-	}
-
-	spin_unlock(&ailp->ail_lock);
-	if (mlip_changed)
-		xfs_log_space_wake(ailp->ail_mount);
+	need_update = xfs_ail_delete_one(ailp, lip);
+	xfs_ail_update_finish(ailp, need_update);
 }
 
 int
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 2e073c1c4614f..64ffa746730e4 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -92,8 +92,10 @@ xfs_trans_ail_update(
 }
 
 bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
+void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
+			__releases(ailp->ail_lock);
 void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
-		int shutdown_type) __releases(ailp->ail_lock);
+		int shutdown_type);
 
 static inline void
 xfs_trans_ail_remove(
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 7/8] xfs: tail updates only need to occur when LSN changes
  2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
                   ` (5 preceding siblings ...)
  2020-03-25  1:42 ` [PATCH 6/8] xfs: factor common AIL item deletion code Dave Chinner
@ 2020-03-25  1:42 ` Dave Chinner
  2020-03-25  5:10   ` Allison Collins
  2020-03-26  5:14   ` Darrick J. Wong
  2020-03-25  1:42 ` [PATCH 8/8] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
  2020-03-25  1:51 ` [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
  8 siblings, 2 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We currently wake anything waiting on the log tail to move whenever
the log item at the tail of the log is removed. Historically this
was fine behaviour because there were very few items at any given
LSN. But with delayed logging, there may be thousands of items at
any given LSN, and we can't move the tail until they are all gone.

Hence if we are removing them in near tail-first order, we might be
waking up processes waiting on the tail LSN to change (e.g. log
space waiters) repeatedly without them being able to make progress.
This also occurs with the new sync push waiters, and can result in
thousands of spurious wakeups every second when under heavy direct
reclaim pressure.

To fix this, check that the tail LSN has actually changed on the
AIL before triggering wakeups. This will reduce the number of
spurious wakeups when doing bulk AIL removal and make this code much
more efficient.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode_item.c | 18 ++++++++++----
 fs/xfs/xfs_trans_ail.c  | 52 ++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_trans_priv.h |  4 ++--
 3 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index bd8c368098707..a627cb951dc61 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -730,19 +730,27 @@ xfs_iflush_done(
 	 * holding the lock before removing the inode from the AIL.
 	 */
 	if (need_ail) {
-		bool			mlip_changed = false;
+		xfs_lsn_t	tail_lsn = 0;
 
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(blip, &tmp, li_bio_list) {
 			if (INODE_ITEM(blip)->ili_logged &&
-			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
-				mlip_changed |= xfs_ail_delete_one(ailp, blip);
-			else {
+			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
+				/*
+				 * xfs_ail_update_finish() only cares about the
+				 * lsn of the first tail item removed, any
+				 * others will be at the same or higher lsn so
+				 * we just ignore them.
+				 */
+				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
+				if (!tail_lsn && lsn)
+					tail_lsn = lsn;
+			} else {
 				xfs_clear_li_failed(blip);
 			}
 		}
-		xfs_ail_update_finish(ailp, mlip_changed);
+		xfs_ail_update_finish(ailp, tail_lsn);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 26d2e7928121c..564253550b754 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -109,17 +109,25 @@ xfs_ail_next(
  * We need the AIL lock in order to get a coherent read of the lsn of the last
  * item in the AIL.
  */
+static xfs_lsn_t
+__xfs_ail_min_lsn(
+	struct xfs_ail		*ailp)
+{
+	struct xfs_log_item	*lip = xfs_ail_min(ailp);
+
+	if (lip)
+		return lip->li_lsn;
+	return 0;
+}
+
 xfs_lsn_t
 xfs_ail_min_lsn(
 	struct xfs_ail		*ailp)
 {
-	xfs_lsn_t		lsn = 0;
-	struct xfs_log_item	*lip;
+	xfs_lsn_t		lsn;
 
 	spin_lock(&ailp->ail_lock);
-	lip = xfs_ail_min(ailp);
-	if (lip)
-		lsn = lip->li_lsn;
+	lsn = __xfs_ail_min_lsn(ailp);
 	spin_unlock(&ailp->ail_lock);
 
 	return lsn;
@@ -684,11 +692,12 @@ xfs_ail_push_all_sync(
 void
 xfs_ail_update_finish(
 	struct xfs_ail		*ailp,
-	bool			do_tail_update) __releases(ailp->ail_lock)
+	xfs_lsn_t		old_lsn) __releases(ailp->ail_lock)
 {
 	struct xfs_mount	*mp = ailp->ail_mount;
 
-	if (!do_tail_update) {
+	/* if the tail lsn hasn't changed, don't do updates or wakeups. */
+	if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) {
 		spin_unlock(&ailp->ail_lock);
 		return;
 	}
@@ -733,7 +742,7 @@ xfs_trans_ail_update_bulk(
 	xfs_lsn_t		lsn) __releases(ailp->ail_lock)
 {
 	struct xfs_log_item	*mlip;
-	int			mlip_changed = 0;
+	xfs_lsn_t		tail_lsn = 0;
 	int			i;
 	LIST_HEAD(tmp);
 
@@ -748,9 +757,10 @@ xfs_trans_ail_update_bulk(
 				continue;
 
 			trace_xfs_ail_move(lip, lip->li_lsn, lsn);
+			if (mlip == lip && !tail_lsn)
+				tail_lsn = lip->li_lsn;
+
 			xfs_ail_delete(ailp, lip);
-			if (mlip == lip)
-				mlip_changed = 1;
 		} else {
 			trace_xfs_ail_insert(lip, 0, lsn);
 		}
@@ -761,15 +771,23 @@ xfs_trans_ail_update_bulk(
 	if (!list_empty(&tmp))
 		xfs_ail_splice(ailp, cur, &tmp, lsn);
 
-	xfs_ail_update_finish(ailp, mlip_changed);
+	xfs_ail_update_finish(ailp, tail_lsn);
 }
 
-bool
+/*
+ * Delete one log item from the AIL.
+ *
+ * If this item was at the tail of the AIL, return the LSN of the log item so
+ * that we can use it to check if the LSN of the tail of the log has moved
+ * when finishing up the AIL delete process in xfs_ail_update_finish().
+ */
+xfs_lsn_t
 xfs_ail_delete_one(
 	struct xfs_ail		*ailp,
 	struct xfs_log_item	*lip)
 {
 	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
+	xfs_lsn_t		lsn = lip->li_lsn;
 
 	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
 	xfs_ail_delete(ailp, lip);
@@ -777,7 +795,9 @@ xfs_ail_delete_one(
 	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
 	lip->li_lsn = 0;
 
-	return mlip == lip;
+	if (mlip == lip)
+		return lsn;
+	return 0;
 }
 
 /**
@@ -808,7 +828,7 @@ xfs_trans_ail_delete(
 	int			shutdown_type)
 {
 	struct xfs_mount	*mp = ailp->ail_mount;
-	bool			need_update;
+	xfs_lsn_t		tail_lsn;
 
 	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
 		spin_unlock(&ailp->ail_lock);
@@ -821,8 +841,8 @@ xfs_trans_ail_delete(
 		return;
 	}
 
-	need_update = xfs_ail_delete_one(ailp, lip);
-	xfs_ail_update_finish(ailp, need_update);
+	tail_lsn = xfs_ail_delete_one(ailp, lip);
+	xfs_ail_update_finish(ailp, tail_lsn);
 }
 
 int
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 64ffa746730e4..35655eac01a65 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -91,8 +91,8 @@ xfs_trans_ail_update(
 	xfs_trans_ail_update_bulk(ailp, NULL, &lip, 1, lsn);
 }
 
-bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
-void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
+xfs_lsn_t xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
+void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
 			__releases(ailp->ail_lock);
 void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
 		int shutdown_type);
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 8/8] xfs: factor inode lookup from xfs_ifree_cluster
  2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
                   ` (6 preceding siblings ...)
  2020-03-25  1:42 ` [PATCH 7/8] xfs: tail updates only need to occur when LSN changes Dave Chinner
@ 2020-03-25  1:42 ` Dave Chinner
  2020-03-25 13:30   ` Brian Foster
  2020-03-25  1:51 ` [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
  8 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

There's lots of indent in this code which makes it a bit hard to
follow. We are also going to completely rework the inode lookup code
as part of the inode reclaim rework, so factor out the inode lookup
code from the inode cluster freeing code.

Based on prototype code from Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 152 +++++++++++++++++++++++++--------------------
 1 file changed, 84 insertions(+), 68 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 14b922f2a6db7..5c930863ed5b8 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2503,6 +2503,88 @@ xfs_iunlink_remove(
 	return error;
 }
 
+/*
+ * Look up the inode number specified and mark it stale if it is found. If it is
+ * dirty, return the inode so it can be attached to the cluster buffer so it can
+ * be processed appropriately when the cluster free transaction completes.
+ */
+static struct xfs_inode *
+xfs_ifree_get_one_inode(
+	struct xfs_perag	*pag,
+	struct xfs_inode	*free_ip,
+	int			inum)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+	struct xfs_inode	*ip;
+
+retry:
+	rcu_read_lock();
+	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
+
+	/* Inode not in memory, nothing to do */
+	if (!ip)
+		goto out_rcu_unlock;
+
+	/*
+	 * because this is an RCU protected lookup, we could find a recently
+	 * freed or even reallocated inode during the lookup. We need to check
+	 * under the i_flags_lock for a valid inode here. Skip it if it is not
+	 * valid, the wrong inode or stale.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
+		spin_unlock(&ip->i_flags_lock);
+		goto out_rcu_unlock;
+	}
+	spin_unlock(&ip->i_flags_lock);
+
+	/*
+	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
+	 * other inodes that we did not find in the list attached to the buffer
+	 * and are not already marked stale. If we can't lock it, back off and
+	 * retry.
+	 */
+	if (ip != free_ip) {
+		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
+			rcu_read_unlock();
+			delay(1);
+			goto retry;
+		}
+
+		/*
+		 * Check the inode number again in case we're racing with
+		 * freeing in xfs_reclaim_inode().  See the comments in that
+		 * function for more information as to why the initial check is
+		 * not sufficient.
+		 */
+		if (ip->i_ino != inum) {
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			goto out_rcu_unlock;
+		}
+	}
+	rcu_read_unlock();
+
+	xfs_iflock(ip);
+	xfs_iflags_set(ip, XFS_ISTALE);
+
+	/*
+	 * We don't need to attach clean inodes or those only with unlogged
+	 * changes (which we throw away, anyway).
+	 */
+	if (!ip->i_itemp || xfs_inode_clean(ip)) {
+		ASSERT(ip != free_ip);
+		xfs_ifunlock(ip);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		goto out_no_inode;
+	}
+	return ip;
+
+out_rcu_unlock:
+	rcu_read_unlock();
+out_no_inode:
+	return NULL;
+}
+
 /*
  * A big issue when freeing the inode cluster is that we _cannot_ skip any
  * inodes that are in memory - they all must be marked stale and attached to
@@ -2603,77 +2685,11 @@ xfs_ifree_cluster(
 		 * even trying to lock them.
 		 */
 		for (i = 0; i < igeo->inodes_per_cluster; i++) {
-retry:
-			rcu_read_lock();
-			ip = radix_tree_lookup(&pag->pag_ici_root,
-					XFS_INO_TO_AGINO(mp, (inum + i)));
-
-			/* Inode not in memory, nothing to do */
-			if (!ip) {
-				rcu_read_unlock();
+			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
+			if (!ip)
 				continue;
-			}
-
-			/*
-			 * because this is an RCU protected lookup, we could
-			 * find a recently freed or even reallocated inode
-			 * during the lookup. We need to check under the
-			 * i_flags_lock for a valid inode here. Skip it if it
-			 * is not valid, the wrong inode or stale.
-			 */
-			spin_lock(&ip->i_flags_lock);
-			if (ip->i_ino != inum + i ||
-			    __xfs_iflags_test(ip, XFS_ISTALE)) {
-				spin_unlock(&ip->i_flags_lock);
-				rcu_read_unlock();
-				continue;
-			}
-			spin_unlock(&ip->i_flags_lock);
-
-			/*
-			 * Don't try to lock/unlock the current inode, but we
-			 * _cannot_ skip the other inodes that we did not find
-			 * in the list attached to the buffer and are not
-			 * already marked stale. If we can't lock it, back off
-			 * and retry.
-			 */
-			if (ip != free_ip) {
-				if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
-					rcu_read_unlock();
-					delay(1);
-					goto retry;
-				}
-
-				/*
-				 * Check the inode number again in case we're
-				 * racing with freeing in xfs_reclaim_inode().
-				 * See the comments in that function for more
-				 * information as to why the initial check is
-				 * not sufficient.
-				 */
-				if (ip->i_ino != inum + i) {
-					xfs_iunlock(ip, XFS_ILOCK_EXCL);
-					rcu_read_unlock();
-					continue;
-				}
-			}
-			rcu_read_unlock();
 
-			xfs_iflock(ip);
-			xfs_iflags_set(ip, XFS_ISTALE);
-
-			/*
-			 * we don't need to attach clean inodes or those only
-			 * with unlogged changes (which we throw away, anyway).
-			 */
 			iip = ip->i_itemp;
-			if (!iip || xfs_inode_clean(ip)) {
-				ASSERT(ip != free_ip);
-				xfs_ifunlock(ip);
-				xfs_iunlock(ip, XFS_ILOCK_EXCL);
-				continue;
-			}
-
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
 			iip->ili_fsync_fields = 0;
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] xfs: various fixes and cleanups
  2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
                   ` (7 preceding siblings ...)
  2020-03-25  1:42 ` [PATCH 8/8] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
@ 2020-03-25  1:51 ` Dave Chinner
  8 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  1:51 UTC (permalink / raw)
  To: linux-xfs

On Wed, Mar 25, 2020 at 12:41:57PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> These are the fixes and cleanups that are part of the non-blocking
> inode reclaim series I've (slowly) been working on. These fixes and
> cleanups stand alone, many have already been reviewed, and getting
> them out of the non-blocking reclaim patchset makes that a much
> smaller and easier to digest set of patches.
> 
> The changes in this patchset are for:
> 
> - limiting the size of checkpoints that the CIL builds to reduce the
>   memory it pins and the latency of commits.
> - cleaning up the AIL item removal code so we can reduce the number
>   of tail LSN updates to prevent unnecessary thundering herd wakeups
> - account for reclaimable slab caches in XFS correctly
> - account for reclaimed pages from buffers correctly
> - avoiding log IO priority inversions
> - factoring the inode cluster deletion code to make it more readable
>   and easier to modify for the non-blocking inode reclaim mods.
> 
> Thoughts, comments and improvemnts welcome.

Oops, forgot to mention this is based on the for-next tree with
Christoph's xlog-ticket-cleanup.2 branch merged on top. You can pull
it from here:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-fixes-cleanups-5.7

or browse:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/log/?h=xfs-fixes-cleanups-5.7

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/8] xfs: Lower CIL flush limit for large logs
  2020-03-25  1:41 ` [PATCH 1/8] xfs: Lower CIL flush limit for large logs Dave Chinner
@ 2020-03-25  4:42   ` Allison Collins
  0 siblings, 0 replies; 28+ messages in thread
From: Allison Collins @ 2020-03-25  4:42 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 3/24/20 6:41 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The current CIL size aggregation limit is 1/8th the log size. This
> means for large logs we might be aggregating at least 250MB of dirty objects
> in memory before the CIL is flushed to the journal. With CIL shadow
> buffers sitting around, this means the CIL is often consuming >500MB
> of temporary memory that is all allocated under GFP_NOFS conditions.
> 
> Flushing the CIL can take some time to do if there is other IO
> ongoing, and can introduce substantial log force latency by itself.
> It also pins the memory until the objects are in the AIL and can be
> written back and reclaimed by shrinkers. Hence this threshold also
> tends to determine the minimum amount of memory XFS can operate in
> under heavy modification without triggering the OOM killer.
> 
> Modify the CIL space limit to prevent such huge amounts of pinned
> metadata from aggregating. We can have 2MB of log IO in flight at
> once, so limit aggregation to 16x this size. This threshold was
> chosen as it little impact on performance (on 16-way fsmark) or log
> traffic but pins a lot less memory on large logs especially under
> heavy memory pressure.  An aggregation limit of 8x had 5-10%
> performance degradation and a 50% increase in log throughput for
> the same workload, so clearly that was too small for highly
> concurrent workloads on large logs.
> 
> This was found via trace analysis of AIL behaviour. e.g. insertion
> from a single CIL flush:
> 
> xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL
> 
> $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
> 1721823
> $
> 
> So there were 1.7 million objects inserted into the AIL from this
> CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
> was the end of the trace (i.e. it hadn't finished). Clearly a major
> problem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Looks ok to me
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_log_priv.h | 29 +++++++++++++++++++++++------
>   1 file changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index cfcf3f02e30a3..8c4be91f62d0d 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -316,13 +316,30 @@ struct xfs_cil {
>    * tries to keep 25% of the log free, so we need to keep below that limit or we
>    * risk running out of free log space to start any new transactions.
>    *
> - * In order to keep background CIL push efficient, we will set a lower
> - * threshold at which background pushing is attempted without blocking current
> - * transaction commits.  A separate, higher bound defines when CIL pushes are
> - * enforced to ensure we stay within our maximum checkpoint size bounds.
> - * threshold, yet give us plenty of space for aggregation on large logs.
> + * In order to keep background CIL push efficient, we only need to ensure the
> + * CIL is large enough to maintain sufficient in-memory relogging to avoid
> + * repeated physical writes of frequently modified metadata. If we allow the CIL
> + * to grow to a substantial fraction of the log, then we may be pinning hundreds
> + * of megabytes of metadata in memory until the CIL flushes. This can cause
> + * issues when we are running low on memory - pinned memory cannot be reclaimed,
> + * and the CIL consumes a lot of memory. Hence we need to set an upper physical
> + * size limit for the CIL that limits the maximum amount of memory pinned by the
> + * CIL but does not limit performance by reducing relogging efficiency
> + * significantly.
> + *
> + * As such, the CIL push threshold ends up being the smaller of two thresholds:
> + * - a threshold large enough that it allows CIL to be pushed and progress to be
> + *   made without excessive blocking of incoming transaction commits. This is
> + *   defined to be 12.5% of the log space - half the 25% push threshold of the
> + *   AIL.
> + * - small enough that it doesn't pin excessive amounts of memory but maintains
> + *   close to peak relogging efficiency. This is defined to be 16x the iclog
> + *   buffer window (32MB) as measurements have shown this to be roughly the
> + *   point of diminishing performance increases under highly concurrent
> + *   modification workloads.
>    */
> -#define XLOG_CIL_SPACE_LIMIT(log)	(log->l_logsize >> 3)
> +#define XLOG_CIL_SPACE_LIMIT(log)	\
> +	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
>   
>   /*
>    * ticket grant locks, queues and accounting have their own cachlines
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/8] xfs: Throttle commits on delayed background CIL push
  2020-03-25  1:41 ` [PATCH 2/8] xfs: Throttle commits on delayed background CIL push Dave Chinner
@ 2020-03-25  4:42   ` Allison Collins
  2020-03-25  5:07   ` Dave Chinner
  2020-03-26  5:24   ` Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Allison Collins @ 2020-03-25  4:42 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 3/24/20 6:41 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In certain situations the background CIL push can be indefinitely
> delayed. While we have workarounds from the obvious cases now, it
> doesn't solve the underlying issue. This issue is that there is no
> upper limit on the CIL where we will either force or wait for
> a background push to start, hence allowing the CIL to grow without
> bound until it consumes all log space.
> 
> To fix this, add a new wait queue to the CIL which allows background
> pushes to wait for the CIL context to be switched out. This happens
> when the push starts, so it will allow us to block incoming
> transaction commit completion until the push has started. This will
> only affect processes that are running modifications, and only when
> the CIL threshold has been significantly overrun.
> 
> This has no apparent impact on performance, and doesn't even trigger
> until over 45 million inodes had been created in a 16-way fsmark
> test on a 2GB log. That was limiting at 64MB of log space used, so
> the active CIL size is only about 3% of the total log in that case.
> The concurrent removal of those files did not trigger the background
> sleep at all.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
Ok, I don't see any obvious errors
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
>   fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
>   fs/xfs/xfs_trace.h    |  1 +
>   3 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 27de462d2ba40..ac43301ae2f43 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -668,6 +668,11 @@ xlog_cil_push_work(
>   	push_seq = cil->xc_push_seq;
>   	ASSERT(push_seq <= ctx->sequence);
>   
> +	/*
> +	 * Wake up any background push waiters now this context is being pushed.
> +	 */
> +	wake_up_all(&ctx->push_wait);
> +
>   	/*
>   	 * Check if we've anything to push. If there is nothing, then we don't
>   	 * move on to a new sequence number and so we have to be able to push
> @@ -744,6 +749,7 @@ xlog_cil_push_work(
>   	 */
>   	INIT_LIST_HEAD(&new_ctx->committing);
>   	INIT_LIST_HEAD(&new_ctx->busy_extents);
> +	init_waitqueue_head(&new_ctx->push_wait);
>   	new_ctx->sequence = ctx->sequence + 1;
>   	new_ctx->cil = cil;
>   	cil->xc_ctx = new_ctx;
> @@ -891,7 +897,7 @@ xlog_cil_push_work(
>    */
>   static void
>   xlog_cil_push_background(
> -	struct xlog	*log)
> +	struct xlog	*log) __releases(cil->xc_ctx_lock)
>   {
>   	struct xfs_cil	*cil = log->l_cilp;
>   
> @@ -905,14 +911,36 @@ xlog_cil_push_background(
>   	 * don't do a background push if we haven't used up all the
>   	 * space available yet.
>   	 */
> -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
> +	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +		up_read(&cil->xc_ctx_lock);
>   		return;
> +	}
>   
>   	spin_lock(&cil->xc_push_lock);
>   	if (cil->xc_push_seq < cil->xc_current_sequence) {
>   		cil->xc_push_seq = cil->xc_current_sequence;
>   		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
>   	}
> +
> +	/*
> +	 * Drop the context lock now, we can't hold that if we need to sleep
> +	 * because we are over the blocking threshold. The push_lock is still
> +	 * held, so blocking threshold sleep/wakeup is still correctly
> +	 * serialised here.
> +	 */
> +	up_read(&cil->xc_ctx_lock);
> +
> +	/*
> +	 * If we are well over the space limit, throttle the work that is being
> +	 * done until the push work on this context has begun.
> +	 */
> +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> +		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> +		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
> +		return;
> +	}
> +
>   	spin_unlock(&cil->xc_push_lock);
>   
>   }
> @@ -1032,9 +1060,9 @@ xfs_log_commit_cil(
>   		if (lip->li_ops->iop_committing)
>   			lip->li_ops->iop_committing(lip, xc_commit_lsn);
>   	}
> -	xlog_cil_push_background(log);
>   
> -	up_read(&cil->xc_ctx_lock);
> +	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
> +	xlog_cil_push_background(log);
>   }
>   
>   /*
> @@ -1193,6 +1221,7 @@ xlog_cil_init(
>   
>   	INIT_LIST_HEAD(&ctx->committing);
>   	INIT_LIST_HEAD(&ctx->busy_extents);
> +	init_waitqueue_head(&ctx->push_wait);
>   	ctx->sequence = 1;
>   	ctx->cil = cil;
>   	cil->xc_ctx = ctx;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 8c4be91f62d0d..dacab1817a1b0 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -240,6 +240,7 @@ struct xfs_cil_ctx {
>   	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
>   	struct list_head	iclog_entry;
>   	struct list_head	committing;	/* ctx committing list */
> +	wait_queue_head_t	push_wait;	/* background push throttle */
>   	struct work_struct	discard_endio_work;
>   };
>   
> @@ -337,10 +338,33 @@ struct xfs_cil {
>    *   buffer window (32MB) as measurements have shown this to be roughly the
>    *   point of diminishing performance increases under highly concurrent
>    *   modification workloads.
> + *
> + * To prevent the CIL from overflowing upper commit size bounds, we introduce a
> + * new threshold at which we block committing transactions until the background
> + * CIL commit commences and switches to a new context. While this is not a hard
> + * limit, it forces the process committing a transaction to the CIL to block and
> + * yeild the CPU, giving the CIL push work a chance to be scheduled and start
> + * work. This prevents a process running lots of transactions from overfilling
> + * the CIL because it is not yielding the CPU. We set the blocking limit at
> + * twice the background push space threshold so we keep in line with the AIL
> + * push thresholds.
> + *
> + * Note: this is not a -hard- limit as blocking is applied after the transaction
> + * is inserted into the CIL and the push has been triggered. It is largely a
> + * throttling mechanism that allows the CIL push to be scheduled and run. A hard
> + * limit will be difficult to implement without introducing global serialisation
> + * in the CIL commit fast path, and it's not at all clear that we actually need
> + * such hard limits given the ~7 years we've run without a hard limit before
> + * finding the first situation where a checkpoint size overflow actually
> + * occurred. Hence the simple throttle, and an ASSERT check to tell us that
> + * we've overrun the max size.
>    */
>   #define XLOG_CIL_SPACE_LIMIT(log)	\
>   	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
>   
> +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
> +	(XLOG_CIL_SPACE_LIMIT(log) * 2)
> +
>   /*
>    * ticket grant locks, queues and accounting have their own cachlines
>    * as these are quite hot and can be operated on concurrently.
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index fbfdd9cf160df..575ca74532f79 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1015,6 +1015,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_ticket_regrant_sub);
>   DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done);
>   DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_sub);
>   DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_exit);
> +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
>   
>   DECLARE_EVENT_CLASS(xfs_log_item_class,
>   	TP_PROTO(struct xfs_log_item *lip),
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/8] xfs: don't allow log IO to be throttled
  2020-03-25  1:42 ` [PATCH 3/8] xfs: don't allow log IO to be throttled Dave Chinner
@ 2020-03-25  4:42   ` Allison Collins
  0 siblings, 0 replies; 28+ messages in thread
From: Allison Collins @ 2020-03-25  4:42 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 3/24/20 6:42 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Running metadata intensive workloads, I've been seeing the AIL
> pushing getting stuck on pinned buffers and triggering log forces.
> The log force is taking a long time to run because the log IO is
> getting throttled by wbt_wait() - the block layer writeback
> throttle. It's being throttled because there is a huge amount of
> metadata writeback going on which is filling the request queue.
> 
> IOWs, we have a priority inversion problem here.
> 
> Mark the log IO bios with REQ_IDLE so they don't get throttled
> by the block layer writeback throttle. When we are forcing the CIL,
> we are likely to need to to tens of log IOs, and they are issued as
> fast as they can be build and IO completed. Hence REQ_IDLE is
> appropriate - it's an indication that more IO will follow shortly.
> 
> And because we also set REQ_SYNC, the writeback throttle will now
> treat log IO the same way it treats direct IO writes - it will not
> throttle them at all. Hence we solve the priority inversion problem
> caused by the writeback throttle being unable to distinguish between
> high priority log IO and background metadata writeback.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Looks ok
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
> ---
>   fs/xfs/xfs_log.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f70be8151a59f..e16ad0b1aec84 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1687,7 +1687,15 @@ xlog_write_iclog(
>   	iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart + bno;
>   	iclog->ic_bio.bi_end_io = xlog_bio_end_io;
>   	iclog->ic_bio.bi_private = iclog;
> -	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_FUA;
> +
> +	/*
> +	 * We use REQ_SYNC | REQ_IDLE here to tell the block layer the are more
> +	 * IOs coming immediately after this one. This prevents the block layer
> +	 * writeback throttle from throttling log writes behind background
> +	 * metadata writeback and causing priority inversions.
> +	 */
> +	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
> +				REQ_IDLE | REQ_FUA;
>   	if (need_flush)
>   		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
>   
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability
  2020-03-25  1:42 ` [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability Dave Chinner
@ 2020-03-25  4:42   ` Allison Collins
  2020-03-25 13:30   ` Brian Foster
  2020-03-26  5:05   ` Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Allison Collins @ 2020-03-25  4:42 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 3/24/20 6:42 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The buffer cache shrinker frees more than just the xfs_buf slab
> objects - it also frees the pages attached to the buffers. Make sure
> the memory reclaim code accounts for this memory being freed
> correctly, similar to how the inode shrinker accounts for pages
> freed from the page cache due to mapping invalidation.
> 
> We also need to make sure that the mm subsystem knows these are
> reclaimable objects. We provide the memory reclaim subsystem with a
> a shrinker to reclaim xfs_bufs, so we should really mark the slab
> that way.
> 
> We also have a lot of xfs_bufs in a busy system, spread them around
> like we do inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Ok, makes sense
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_buf.c | 11 ++++++++---
>   1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f880141a22681..9ec3eaf1c618f 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -327,6 +327,9 @@ xfs_buf_free(
>   
>   			__free_page(page);
>   		}
> +		if (current->reclaim_state)
> +			current->reclaim_state->reclaimed_slab +=
> +							bp->b_page_count;
>   	} else if (bp->b_flags & _XBF_KMEM)
>   		kmem_free(bp->b_addr);
>   	_xfs_buf_free_pages(bp);
> @@ -2114,9 +2117,11 @@ xfs_buf_delwri_pushbuf(
>   int __init
>   xfs_buf_init(void)
>   {
> -	xfs_buf_zone = kmem_cache_create("xfs_buf",
> -					 sizeof(struct xfs_buf), 0,
> -					 SLAB_HWCACHE_ALIGN, NULL);
> +	xfs_buf_zone = kmem_cache_create("xfs_buf", sizeof(struct xfs_buf), 0,
> +					 SLAB_HWCACHE_ALIGN |
> +					 SLAB_RECLAIM_ACCOUNT |
> +					 SLAB_MEM_SPREAD,
> +					 NULL);
>   	if (!xfs_buf_zone)
>   		goto out;
>   
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/8] xfs: correctly acount for reclaimable slabs
  2020-03-25  1:42 ` [PATCH 5/8] xfs: correctly acount for reclaimable slabs Dave Chinner
@ 2020-03-25  4:43   ` Allison Collins
  0 siblings, 0 replies; 28+ messages in thread
From: Allison Collins @ 2020-03-25  4:43 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 3/24/20 6:42 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The XFS inode item slab actually reclaimed by inode shrinker
> callbacks from the memory reclaim subsystem. These should be marked
> as reclaimable so the mm subsystem has the full picture of how much
> memory it can actually reclaim from the XFS slab caches.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Looks fine
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_super.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 2094386af8aca..68fea439d9743 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1861,7 +1861,8 @@ xfs_init_zones(void)
>   
>   	xfs_ili_zone = kmem_cache_create("xfs_ili",
>   					 sizeof(struct xfs_inode_log_item), 0,
> -					 SLAB_MEM_SPREAD, NULL);
> +					 SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
> +					 NULL);
>   	if (!xfs_ili_zone)
>   		goto out_destroy_inode_zone;
>   
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] xfs: factor common AIL item deletion code
  2020-03-25  1:42 ` [PATCH 6/8] xfs: factor common AIL item deletion code Dave Chinner
@ 2020-03-25  4:54   ` Allison Collins
  2020-03-25 13:30   ` Brian Foster
  2020-03-26  5:10   ` Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Allison Collins @ 2020-03-25  4:54 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 3/24/20 6:42 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Factor the common AIL deletion code that does all the wakeups into a
> helper so we only have one copy of this somewhat tricky code to
> interface with all the wakeups necessary when the LSN of the log
> tail changes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
Ok, the hoisted code looks equivalent
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_inode_item.c | 12 +----------
>   fs/xfs/xfs_trans_ail.c  | 48 ++++++++++++++++++++++-------------------
>   fs/xfs/xfs_trans_priv.h |  4 +++-
>   3 files changed, 30 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 4a3d13d4a0228..bd8c368098707 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -742,17 +742,7 @@ xfs_iflush_done(
>   				xfs_clear_li_failed(blip);
>   			}
>   		}
> -
> -		if (mlip_changed) {
> -			if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> -				xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -			if (list_empty(&ailp->ail_head))
> -				wake_up_all(&ailp->ail_empty);
> -		}
> -		spin_unlock(&ailp->ail_lock);
> -
> -		if (mlip_changed)
> -			xfs_log_space_wake(ailp->ail_mount);
> +		xfs_ail_update_finish(ailp, mlip_changed);
>   	}
>   
>   	/*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 2ef0dfbfb303d..26d2e7928121c 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -681,6 +681,27 @@ xfs_ail_push_all_sync(
>   	finish_wait(&ailp->ail_empty, &wait);
>   }
>   
> +void
> +xfs_ail_update_finish(
> +	struct xfs_ail		*ailp,
> +	bool			do_tail_update) __releases(ailp->ail_lock)
> +{
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +
> +	if (!do_tail_update) {
> +		spin_unlock(&ailp->ail_lock);
> +		return;
> +	}
> +
> +	if (!XFS_FORCED_SHUTDOWN(mp))
> +		xlog_assign_tail_lsn_locked(mp);
> +
> +	if (list_empty(&ailp->ail_head))
> +		wake_up_all(&ailp->ail_empty);
> +	spin_unlock(&ailp->ail_lock);
> +	xfs_log_space_wake(mp);
> +}
> +
>   /*
>    * xfs_trans_ail_update - bulk AIL insertion operation.
>    *
> @@ -740,15 +761,7 @@ xfs_trans_ail_update_bulk(
>   	if (!list_empty(&tmp))
>   		xfs_ail_splice(ailp, cur, &tmp, lsn);
>   
> -	if (mlip_changed) {
> -		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> -			xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -		spin_unlock(&ailp->ail_lock);
> -
> -		xfs_log_space_wake(ailp->ail_mount);
> -	} else {
> -		spin_unlock(&ailp->ail_lock);
> -	}
> +	xfs_ail_update_finish(ailp, mlip_changed);
>   }
>   
>   bool
> @@ -792,10 +805,10 @@ void
>   xfs_trans_ail_delete(
>   	struct xfs_ail		*ailp,
>   	struct xfs_log_item	*lip,
> -	int			shutdown_type) __releases(ailp->ail_lock)
> +	int			shutdown_type)
>   {
>   	struct xfs_mount	*mp = ailp->ail_mount;
> -	bool			mlip_changed;
> +	bool			need_update;
>   
>   	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
>   		spin_unlock(&ailp->ail_lock);
> @@ -808,17 +821,8 @@ xfs_trans_ail_delete(
>   		return;
>   	}
>   
> -	mlip_changed = xfs_ail_delete_one(ailp, lip);
> -	if (mlip_changed) {
> -		if (!XFS_FORCED_SHUTDOWN(mp))
> -			xlog_assign_tail_lsn_locked(mp);
> -		if (list_empty(&ailp->ail_head))
> -			wake_up_all(&ailp->ail_empty);
> -	}
> -
> -	spin_unlock(&ailp->ail_lock);
> -	if (mlip_changed)
> -		xfs_log_space_wake(ailp->ail_mount);
> +	need_update = xfs_ail_delete_one(ailp, lip);
> +	xfs_ail_update_finish(ailp, need_update);
>   }
>   
>   int
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 2e073c1c4614f..64ffa746730e4 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -92,8 +92,10 @@ xfs_trans_ail_update(
>   }
>   
>   bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> +void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
> +			__releases(ailp->ail_lock);
>   void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
> -		int shutdown_type) __releases(ailp->ail_lock);
> +		int shutdown_type);
>   
>   static inline void
>   xfs_trans_ail_remove(
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/8] xfs: Throttle commits on delayed background CIL push
  2020-03-25  1:41 ` [PATCH 2/8] xfs: Throttle commits on delayed background CIL push Dave Chinner
  2020-03-25  4:42   ` Allison Collins
@ 2020-03-25  5:07   ` Dave Chinner
  2020-03-26  5:24   ` Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-25  5:07 UTC (permalink / raw)
  To: linux-xfs

On Wed, Mar 25, 2020 at 12:41:59PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In certain situations the background CIL push can be indefinitely
> delayed. While we have workarounds from the obvious cases now, it
> doesn't solve the underlying issue. This issue is that there is no
> upper limit on the CIL where we will either force or wait for
> a background push to start, hence allowing the CIL to grow without
> bound until it consumes all log space.
> 
> To fix this, add a new wait queue to the CIL which allows background
> pushes to wait for the CIL context to be switched out. This happens
> when the push starts, so it will allow us to block incoming
> transaction commit completion until the push has started. This will
> only affect processes that are running modifications, and only when
> the CIL threshold has been significantly overrun.
> 
> This has no apparent impact on performance, and doesn't even trigger
> until over 45 million inodes had been created in a 16-way fsmark
> test on a 2GB log. That was limiting at 64MB of log space used, so
> the active CIL size is only about 3% of the total log in that case.
> The concurrent removal of those files did not trigger the background
> sleep at all.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Sorry, Brian, I forgot to drop your rvb from this patch as it was
incorrectly added. Maybe next time?

-Dave.

> ---
>  fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
>  fs/xfs/xfs_trace.h    |  1 +
>  3 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 27de462d2ba40..ac43301ae2f43 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -668,6 +668,11 @@ xlog_cil_push_work(
>  	push_seq = cil->xc_push_seq;
>  	ASSERT(push_seq <= ctx->sequence);
>  
> +	/*
> +	 * Wake up any background push waiters now this context is being pushed.
> +	 */
> +	wake_up_all(&ctx->push_wait);
> +
>  	/*
>  	 * Check if we've anything to push. If there is nothing, then we don't
>  	 * move on to a new sequence number and so we have to be able to push
> @@ -744,6 +749,7 @@ xlog_cil_push_work(
>  	 */
>  	INIT_LIST_HEAD(&new_ctx->committing);
>  	INIT_LIST_HEAD(&new_ctx->busy_extents);
> +	init_waitqueue_head(&new_ctx->push_wait);
>  	new_ctx->sequence = ctx->sequence + 1;
>  	new_ctx->cil = cil;
>  	cil->xc_ctx = new_ctx;
> @@ -891,7 +897,7 @@ xlog_cil_push_work(
>   */
>  static void
>  xlog_cil_push_background(
> -	struct xlog	*log)
> +	struct xlog	*log) __releases(cil->xc_ctx_lock)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
>  
> @@ -905,14 +911,36 @@ xlog_cil_push_background(
>  	 * don't do a background push if we haven't used up all the
>  	 * space available yet.
>  	 */
> -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
> +	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +		up_read(&cil->xc_ctx_lock);
>  		return;
> +	}
>  
>  	spin_lock(&cil->xc_push_lock);
>  	if (cil->xc_push_seq < cil->xc_current_sequence) {
>  		cil->xc_push_seq = cil->xc_current_sequence;
>  		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
>  	}
> +
> +	/*
> +	 * Drop the context lock now, we can't hold that if we need to sleep
> +	 * because we are over the blocking threshold. The push_lock is still
> +	 * held, so blocking threshold sleep/wakeup is still correctly
> +	 * serialised here.
> +	 */
> +	up_read(&cil->xc_ctx_lock);
> +
> +	/*
> +	 * If we are well over the space limit, throttle the work that is being
> +	 * done until the push work on this context has begun.
> +	 */
> +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> +		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> +		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
> +		return;
> +	}
> +
>  	spin_unlock(&cil->xc_push_lock);
>  
>  }
> @@ -1032,9 +1060,9 @@ xfs_log_commit_cil(
>  		if (lip->li_ops->iop_committing)
>  			lip->li_ops->iop_committing(lip, xc_commit_lsn);
>  	}
> -	xlog_cil_push_background(log);
>  
> -	up_read(&cil->xc_ctx_lock);
> +	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
> +	xlog_cil_push_background(log);
>  }
>  
>  /*
> @@ -1193,6 +1221,7 @@ xlog_cil_init(
>  
>  	INIT_LIST_HEAD(&ctx->committing);
>  	INIT_LIST_HEAD(&ctx->busy_extents);
> +	init_waitqueue_head(&ctx->push_wait);
>  	ctx->sequence = 1;
>  	ctx->cil = cil;
>  	cil->xc_ctx = ctx;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 8c4be91f62d0d..dacab1817a1b0 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -240,6 +240,7 @@ struct xfs_cil_ctx {
>  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
> +	wait_queue_head_t	push_wait;	/* background push throttle */
>  	struct work_struct	discard_endio_work;
>  };
>  
> @@ -337,10 +338,33 @@ struct xfs_cil {
>   *   buffer window (32MB) as measurements have shown this to be roughly the
>   *   point of diminishing performance increases under highly concurrent
>   *   modification workloads.
> + *
> + * To prevent the CIL from overflowing upper commit size bounds, we introduce a
> + * new threshold at which we block committing transactions until the background
> + * CIL commit commences and switches to a new context. While this is not a hard
> + * limit, it forces the process committing a transaction to the CIL to block and
> + * yeild the CPU, giving the CIL push work a chance to be scheduled and start
> + * work. This prevents a process running lots of transactions from overfilling
> + * the CIL because it is not yielding the CPU. We set the blocking limit at
> + * twice the background push space threshold so we keep in line with the AIL
> + * push thresholds.
> + *
> + * Note: this is not a -hard- limit as blocking is applied after the transaction
> + * is inserted into the CIL and the push has been triggered. It is largely a
> + * throttling mechanism that allows the CIL push to be scheduled and run. A hard
> + * limit will be difficult to implement without introducing global serialisation
> + * in the CIL commit fast path, and it's not at all clear that we actually need
> + * such hard limits given the ~7 years we've run without a hard limit before
> + * finding the first situation where a checkpoint size overflow actually
> + * occurred. Hence the simple throttle, and an ASSERT check to tell us that
> + * we've overrun the max size.
>   */
>  #define XLOG_CIL_SPACE_LIMIT(log)	\
>  	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
>  
> +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
> +	(XLOG_CIL_SPACE_LIMIT(log) * 2)
> +
>  /*
>   * ticket grant locks, queues and accounting have their own cachlines
>   * as these are quite hot and can be operated on concurrently.
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index fbfdd9cf160df..575ca74532f79 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1015,6 +1015,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_ticket_regrant_sub);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_sub);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_exit);
> +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
>  
>  DECLARE_EVENT_CLASS(xfs_log_item_class,
>  	TP_PROTO(struct xfs_log_item *lip),
> -- 
> 2.26.0.rc2
> 
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 7/8] xfs: tail updates only need to occur when LSN changes
  2020-03-25  1:42 ` [PATCH 7/8] xfs: tail updates only need to occur when LSN changes Dave Chinner
@ 2020-03-25  5:10   ` Allison Collins
  2020-03-26  5:14   ` Darrick J. Wong
  1 sibling, 0 replies; 28+ messages in thread
From: Allison Collins @ 2020-03-25  5:10 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 3/24/20 6:42 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently wake anything waiting on the log tail to move whenever
> the log item at the tail of the log is removed. Historically this
> was fine behaviour because there were very few items at any given
> LSN. But with delayed logging, there may be thousands of items at
> any given LSN, and we can't move the tail until they are all gone.
> 
> Hence if we are removing them in near tail-first order, we might be
> waking up processes waiting on the tail LSN to change (e.g. log
> space waiters) repeatedly without them being able to make progress.
> This also occurs with the new sync push waiters, and can result in
> thousands of spurious wakeups every second when under heavy direct
> reclaim pressure.
> 
> To fix this, check that the tail LSN has actually changed on the
> AIL before triggering wakeups. This will reduce the number of
> spurious wakeups when doing bulk AIL removal and make this code much
> more efficient.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
Ok, I dont see any obvious errors
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_inode_item.c | 18 ++++++++++----
>   fs/xfs/xfs_trans_ail.c  | 52 ++++++++++++++++++++++++++++-------------
>   fs/xfs/xfs_trans_priv.h |  4 ++--
>   3 files changed, 51 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index bd8c368098707..a627cb951dc61 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -730,19 +730,27 @@ xfs_iflush_done(
>   	 * holding the lock before removing the inode from the AIL.
>   	 */
>   	if (need_ail) {
> -		bool			mlip_changed = false;
> +		xfs_lsn_t	tail_lsn = 0;
>   
>   		/* this is an opencoded batch version of xfs_trans_ail_delete */
>   		spin_lock(&ailp->ail_lock);
>   		list_for_each_entry(blip, &tmp, li_bio_list) {
>   			if (INODE_ITEM(blip)->ili_logged &&
> -			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
> -				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> -			else {
> +			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
> +				/*
> +				 * xfs_ail_update_finish() only cares about the
> +				 * lsn of the first tail item removed, any
> +				 * others will be at the same or higher lsn so
> +				 * we just ignore them.
> +				 */
> +				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
> +				if (!tail_lsn && lsn)
> +					tail_lsn = lsn;
> +			} else {
>   				xfs_clear_li_failed(blip);
>   			}
>   		}
> -		xfs_ail_update_finish(ailp, mlip_changed);
> +		xfs_ail_update_finish(ailp, tail_lsn);
>   	}
>   
>   	/*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 26d2e7928121c..564253550b754 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -109,17 +109,25 @@ xfs_ail_next(
>    * We need the AIL lock in order to get a coherent read of the lsn of the last
>    * item in the AIL.
>    */
> +static xfs_lsn_t
> +__xfs_ail_min_lsn(
> +	struct xfs_ail		*ailp)
> +{
> +	struct xfs_log_item	*lip = xfs_ail_min(ailp);
> +
> +	if (lip)
> +		return lip->li_lsn;
> +	return 0;
> +}
> +
>   xfs_lsn_t
>   xfs_ail_min_lsn(
>   	struct xfs_ail		*ailp)
>   {
> -	xfs_lsn_t		lsn = 0;
> -	struct xfs_log_item	*lip;
> +	xfs_lsn_t		lsn;
>   
>   	spin_lock(&ailp->ail_lock);
> -	lip = xfs_ail_min(ailp);
> -	if (lip)
> -		lsn = lip->li_lsn;
> +	lsn = __xfs_ail_min_lsn(ailp);
>   	spin_unlock(&ailp->ail_lock);
>   
>   	return lsn;
> @@ -684,11 +692,12 @@ xfs_ail_push_all_sync(
>   void
>   xfs_ail_update_finish(
>   	struct xfs_ail		*ailp,
> -	bool			do_tail_update) __releases(ailp->ail_lock)
> +	xfs_lsn_t		old_lsn) __releases(ailp->ail_lock)
>   {
>   	struct xfs_mount	*mp = ailp->ail_mount;
>   
> -	if (!do_tail_update) {
> +	/* if the tail lsn hasn't changed, don't do updates or wakeups. */
> +	if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) {
>   		spin_unlock(&ailp->ail_lock);
>   		return;
>   	}
> @@ -733,7 +742,7 @@ xfs_trans_ail_update_bulk(
>   	xfs_lsn_t		lsn) __releases(ailp->ail_lock)
>   {
>   	struct xfs_log_item	*mlip;
> -	int			mlip_changed = 0;
> +	xfs_lsn_t		tail_lsn = 0;
>   	int			i;
>   	LIST_HEAD(tmp);
>   
> @@ -748,9 +757,10 @@ xfs_trans_ail_update_bulk(
>   				continue;
>   
>   			trace_xfs_ail_move(lip, lip->li_lsn, lsn);
> +			if (mlip == lip && !tail_lsn)
> +				tail_lsn = lip->li_lsn;
> +
>   			xfs_ail_delete(ailp, lip);
> -			if (mlip == lip)
> -				mlip_changed = 1;
>   		} else {
>   			trace_xfs_ail_insert(lip, 0, lsn);
>   		}
> @@ -761,15 +771,23 @@ xfs_trans_ail_update_bulk(
>   	if (!list_empty(&tmp))
>   		xfs_ail_splice(ailp, cur, &tmp, lsn);
>   
> -	xfs_ail_update_finish(ailp, mlip_changed);
> +	xfs_ail_update_finish(ailp, tail_lsn);
>   }
>   
> -bool
> +/*
> + * Delete one log item from the AIL.
> + *
> + * If this item was at the tail of the AIL, return the LSN of the log item so
> + * that we can use it to check if the LSN of the tail of the log has moved
> + * when finishing up the AIL delete process in xfs_ail_update_finish().
> + */
> +xfs_lsn_t
>   xfs_ail_delete_one(
>   	struct xfs_ail		*ailp,
>   	struct xfs_log_item	*lip)
>   {
>   	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
> +	xfs_lsn_t		lsn = lip->li_lsn;
>   
>   	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
>   	xfs_ail_delete(ailp, lip);
> @@ -777,7 +795,9 @@ xfs_ail_delete_one(
>   	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
>   	lip->li_lsn = 0;
>   
> -	return mlip == lip;
> +	if (mlip == lip)
> +		return lsn;
> +	return 0;
>   }
>   
>   /**
> @@ -808,7 +828,7 @@ xfs_trans_ail_delete(
>   	int			shutdown_type)
>   {
>   	struct xfs_mount	*mp = ailp->ail_mount;
> -	bool			need_update;
> +	xfs_lsn_t		tail_lsn;
>   
>   	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
>   		spin_unlock(&ailp->ail_lock);
> @@ -821,8 +841,8 @@ xfs_trans_ail_delete(
>   		return;
>   	}
>   
> -	need_update = xfs_ail_delete_one(ailp, lip);
> -	xfs_ail_update_finish(ailp, need_update);
> +	tail_lsn = xfs_ail_delete_one(ailp, lip);
> +	xfs_ail_update_finish(ailp, tail_lsn);
>   }
>   
>   int
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 64ffa746730e4..35655eac01a65 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -91,8 +91,8 @@ xfs_trans_ail_update(
>   	xfs_trans_ail_update_bulk(ailp, NULL, &lip, 1, lsn);
>   }
>   
> -bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> -void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
> +xfs_lsn_t xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> +void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
>   			__releases(ailp->ail_lock);
>   void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
>   		int shutdown_type);
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability
  2020-03-25  1:42 ` [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability Dave Chinner
  2020-03-25  4:42   ` Allison Collins
@ 2020-03-25 13:30   ` Brian Foster
  2020-03-26  5:05   ` Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Brian Foster @ 2020-03-25 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 25, 2020 at 12:42:01PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The buffer cache shrinker frees more than just the xfs_buf slab
> objects - it also frees the pages attached to the buffers. Make sure
> the memory reclaim code accounts for this memory being freed
> correctly, similar to how the inode shrinker accounts for pages
> freed from the page cache due to mapping invalidation.
> 
> We also need to make sure that the mm subsystem knows these are
> reclaimable objects. We provide the memory reclaim subsystem with a
> a shrinker to reclaim xfs_bufs, so we should really mark the slab
> that way.
> 
> We also have a lot of xfs_bufs in a busy system, spread them around
> like we do inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

https://lore.kernel.org/linux-xfs/20191101120513.GD59146@bfoster/

>  fs/xfs/xfs_buf.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f880141a22681..9ec3eaf1c618f 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -327,6 +327,9 @@ xfs_buf_free(
>  
>  			__free_page(page);
>  		}
> +		if (current->reclaim_state)
> +			current->reclaim_state->reclaimed_slab +=
> +							bp->b_page_count;
>  	} else if (bp->b_flags & _XBF_KMEM)
>  		kmem_free(bp->b_addr);
>  	_xfs_buf_free_pages(bp);
> @@ -2114,9 +2117,11 @@ xfs_buf_delwri_pushbuf(
>  int __init
>  xfs_buf_init(void)
>  {
> -	xfs_buf_zone = kmem_cache_create("xfs_buf",
> -					 sizeof(struct xfs_buf), 0,
> -					 SLAB_HWCACHE_ALIGN, NULL);
> +	xfs_buf_zone = kmem_cache_create("xfs_buf", sizeof(struct xfs_buf), 0,
> +					 SLAB_HWCACHE_ALIGN |
> +					 SLAB_RECLAIM_ACCOUNT |
> +					 SLAB_MEM_SPREAD,
> +					 NULL);
>  	if (!xfs_buf_zone)
>  		goto out;
>  
> -- 
> 2.26.0.rc2
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] xfs: factor common AIL item deletion code
  2020-03-25  1:42 ` [PATCH 6/8] xfs: factor common AIL item deletion code Dave Chinner
  2020-03-25  4:54   ` Allison Collins
@ 2020-03-25 13:30   ` Brian Foster
  2020-03-26  5:10   ` Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Brian Foster @ 2020-03-25 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 25, 2020 at 12:42:03PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Factor the common AIL deletion code that does all the wakeups into a
> helper so we only have one copy of this somewhat tricky code to
> interface with all the wakeups necessary when the LSN of the log
> tail changes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_inode_item.c | 12 +----------
>  fs/xfs/xfs_trans_ail.c  | 48 ++++++++++++++++++++++-------------------
>  fs/xfs/xfs_trans_priv.h |  4 +++-
>  3 files changed, 30 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 4a3d13d4a0228..bd8c368098707 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -742,17 +742,7 @@ xfs_iflush_done(
>  				xfs_clear_li_failed(blip);
>  			}
>  		}
> -
> -		if (mlip_changed) {
> -			if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> -				xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -			if (list_empty(&ailp->ail_head))
> -				wake_up_all(&ailp->ail_empty);
> -		}
> -		spin_unlock(&ailp->ail_lock);
> -
> -		if (mlip_changed)
> -			xfs_log_space_wake(ailp->ail_mount);
> +		xfs_ail_update_finish(ailp, mlip_changed);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 2ef0dfbfb303d..26d2e7928121c 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -681,6 +681,27 @@ xfs_ail_push_all_sync(
>  	finish_wait(&ailp->ail_empty, &wait);
>  }
>  
> +void
> +xfs_ail_update_finish(
> +	struct xfs_ail		*ailp,
> +	bool			do_tail_update) __releases(ailp->ail_lock)
> +{
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +
> +	if (!do_tail_update) {
> +		spin_unlock(&ailp->ail_lock);
> +		return;
> +	}
> +
> +	if (!XFS_FORCED_SHUTDOWN(mp))
> +		xlog_assign_tail_lsn_locked(mp);
> +
> +	if (list_empty(&ailp->ail_head))
> +		wake_up_all(&ailp->ail_empty);
> +	spin_unlock(&ailp->ail_lock);
> +	xfs_log_space_wake(mp);
> +}
> +
>  /*
>   * xfs_trans_ail_update - bulk AIL insertion operation.
>   *
> @@ -740,15 +761,7 @@ xfs_trans_ail_update_bulk(
>  	if (!list_empty(&tmp))
>  		xfs_ail_splice(ailp, cur, &tmp, lsn);
>  
> -	if (mlip_changed) {
> -		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> -			xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -		spin_unlock(&ailp->ail_lock);
> -
> -		xfs_log_space_wake(ailp->ail_mount);
> -	} else {
> -		spin_unlock(&ailp->ail_lock);
> -	}
> +	xfs_ail_update_finish(ailp, mlip_changed);
>  }
>  
>  bool
> @@ -792,10 +805,10 @@ void
>  xfs_trans_ail_delete(
>  	struct xfs_ail		*ailp,
>  	struct xfs_log_item	*lip,
> -	int			shutdown_type) __releases(ailp->ail_lock)
> +	int			shutdown_type)
>  {
>  	struct xfs_mount	*mp = ailp->ail_mount;
> -	bool			mlip_changed;
> +	bool			need_update;
>  
>  	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
>  		spin_unlock(&ailp->ail_lock);
> @@ -808,17 +821,8 @@ xfs_trans_ail_delete(
>  		return;
>  	}
>  
> -	mlip_changed = xfs_ail_delete_one(ailp, lip);
> -	if (mlip_changed) {
> -		if (!XFS_FORCED_SHUTDOWN(mp))
> -			xlog_assign_tail_lsn_locked(mp);
> -		if (list_empty(&ailp->ail_head))
> -			wake_up_all(&ailp->ail_empty);
> -	}
> -
> -	spin_unlock(&ailp->ail_lock);
> -	if (mlip_changed)
> -		xfs_log_space_wake(ailp->ail_mount);
> +	need_update = xfs_ail_delete_one(ailp, lip);
> +	xfs_ail_update_finish(ailp, need_update);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 2e073c1c4614f..64ffa746730e4 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -92,8 +92,10 @@ xfs_trans_ail_update(
>  }
>  
>  bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> +void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
> +			__releases(ailp->ail_lock);
>  void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
> -		int shutdown_type) __releases(ailp->ail_lock);
> +		int shutdown_type);
>  
>  static inline void
>  xfs_trans_ail_remove(
> -- 
> 2.26.0.rc2
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 8/8] xfs: factor inode lookup from xfs_ifree_cluster
  2020-03-25  1:42 ` [PATCH 8/8] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
@ 2020-03-25 13:30   ` Brian Foster
  0 siblings, 0 replies; 28+ messages in thread
From: Brian Foster @ 2020-03-25 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 25, 2020 at 12:42:05PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> There's lots of indent in this code which makes it a bit hard to
> follow. We are also going to completely rework the inode lookup code
> as part of the inode reclaim rework, so factor out the inode lookup
> code from the inode cluster freeing code.
> 
> Based on prototype code from Christoph Hellwig.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

https://lore.kernel.org/linux-xfs/20191104232000.GR4153244@magnolia/
https://lore.kernel.org/linux-xfs/20191101120521.GE59146@bfoster/

>  fs/xfs/xfs_inode.c | 152 +++++++++++++++++++++++++--------------------
>  1 file changed, 84 insertions(+), 68 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 14b922f2a6db7..5c930863ed5b8 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2503,6 +2503,88 @@ xfs_iunlink_remove(
>  	return error;
>  }
>  
> +/*
> + * Look up the inode number specified and mark it stale if it is found. If it is
> + * dirty, return the inode so it can be attached to the cluster buffer so it can
> + * be processed appropriately when the cluster free transaction completes.
> + */
> +static struct xfs_inode *
> +xfs_ifree_get_one_inode(
> +	struct xfs_perag	*pag,
> +	struct xfs_inode	*free_ip,
> +	int			inum)
> +{
> +	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_inode	*ip;
> +
> +retry:
> +	rcu_read_lock();
> +	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
> +
> +	/* Inode not in memory, nothing to do */
> +	if (!ip)
> +		goto out_rcu_unlock;
> +
> +	/*
> +	 * because this is an RCU protected lookup, we could find a recently
> +	 * freed or even reallocated inode during the lookup. We need to check
> +	 * under the i_flags_lock for a valid inode here. Skip it if it is not
> +	 * valid, the wrong inode or stale.
> +	 */
> +	spin_lock(&ip->i_flags_lock);
> +	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
> +		spin_unlock(&ip->i_flags_lock);
> +		goto out_rcu_unlock;
> +	}
> +	spin_unlock(&ip->i_flags_lock);
> +
> +	/*
> +	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
> +	 * other inodes that we did not find in the list attached to the buffer
> +	 * and are not already marked stale. If we can't lock it, back off and
> +	 * retry.
> +	 */
> +	if (ip != free_ip) {
> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> +			rcu_read_unlock();
> +			delay(1);
> +			goto retry;
> +		}
> +
> +		/*
> +		 * Check the inode number again in case we're racing with
> +		 * freeing in xfs_reclaim_inode().  See the comments in that
> +		 * function for more information as to why the initial check is
> +		 * not sufficient.
> +		 */
> +		if (ip->i_ino != inum) {
> +			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +			goto out_rcu_unlock;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	xfs_iflock(ip);
> +	xfs_iflags_set(ip, XFS_ISTALE);
> +
> +	/*
> +	 * We don't need to attach clean inodes or those only with unlogged
> +	 * changes (which we throw away, anyway).
> +	 */
> +	if (!ip->i_itemp || xfs_inode_clean(ip)) {
> +		ASSERT(ip != free_ip);
> +		xfs_ifunlock(ip);
> +		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +		goto out_no_inode;
> +	}
> +	return ip;
> +
> +out_rcu_unlock:
> +	rcu_read_unlock();
> +out_no_inode:
> +	return NULL;
> +}
> +
>  /*
>   * A big issue when freeing the inode cluster is that we _cannot_ skip any
>   * inodes that are in memory - they all must be marked stale and attached to
> @@ -2603,77 +2685,11 @@ xfs_ifree_cluster(
>  		 * even trying to lock them.
>  		 */
>  		for (i = 0; i < igeo->inodes_per_cluster; i++) {
> -retry:
> -			rcu_read_lock();
> -			ip = radix_tree_lookup(&pag->pag_ici_root,
> -					XFS_INO_TO_AGINO(mp, (inum + i)));
> -
> -			/* Inode not in memory, nothing to do */
> -			if (!ip) {
> -				rcu_read_unlock();
> +			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
> +			if (!ip)
>  				continue;
> -			}
> -
> -			/*
> -			 * because this is an RCU protected lookup, we could
> -			 * find a recently freed or even reallocated inode
> -			 * during the lookup. We need to check under the
> -			 * i_flags_lock for a valid inode here. Skip it if it
> -			 * is not valid, the wrong inode or stale.
> -			 */
> -			spin_lock(&ip->i_flags_lock);
> -			if (ip->i_ino != inum + i ||
> -			    __xfs_iflags_test(ip, XFS_ISTALE)) {
> -				spin_unlock(&ip->i_flags_lock);
> -				rcu_read_unlock();
> -				continue;
> -			}
> -			spin_unlock(&ip->i_flags_lock);
> -
> -			/*
> -			 * Don't try to lock/unlock the current inode, but we
> -			 * _cannot_ skip the other inodes that we did not find
> -			 * in the list attached to the buffer and are not
> -			 * already marked stale. If we can't lock it, back off
> -			 * and retry.
> -			 */
> -			if (ip != free_ip) {
> -				if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> -					rcu_read_unlock();
> -					delay(1);
> -					goto retry;
> -				}
> -
> -				/*
> -				 * Check the inode number again in case we're
> -				 * racing with freeing in xfs_reclaim_inode().
> -				 * See the comments in that function for more
> -				 * information as to why the initial check is
> -				 * not sufficient.
> -				 */
> -				if (ip->i_ino != inum + i) {
> -					xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -					rcu_read_unlock();
> -					continue;
> -				}
> -			}
> -			rcu_read_unlock();
>  
> -			xfs_iflock(ip);
> -			xfs_iflags_set(ip, XFS_ISTALE);
> -
> -			/*
> -			 * we don't need to attach clean inodes or those only
> -			 * with unlogged changes (which we throw away, anyway).
> -			 */
>  			iip = ip->i_itemp;
> -			if (!iip || xfs_inode_clean(ip)) {
> -				ASSERT(ip != free_ip);
> -				xfs_ifunlock(ip);
> -				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -				continue;
> -			}
> -
>  			iip->ili_last_fields = iip->ili_fields;
>  			iip->ili_fields = 0;
>  			iip->ili_fsync_fields = 0;
> -- 
> 2.26.0.rc2
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability
  2020-03-25  1:42 ` [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability Dave Chinner
  2020-03-25  4:42   ` Allison Collins
  2020-03-25 13:30   ` Brian Foster
@ 2020-03-26  5:05   ` Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2020-03-26  5:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 25, 2020 at 12:42:01PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The buffer cache shrinker frees more than just the xfs_buf slab
> objects - it also frees the pages attached to the buffers. Make sure
> the memory reclaim code accounts for this memory being freed
> correctly, similar to how the inode shrinker accounts for pages
> freed from the page cache due to mapping invalidation.
> 
> We also need to make sure that the mm subsystem knows these are
> reclaimable objects. We provide the memory reclaim subsystem with a
> a shrinker to reclaim xfs_bufs, so we should really mark the slab
> that way.
> 
> We also have a lot of xfs_bufs in a busy system, spread them around
> like we do inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks decent,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_buf.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f880141a22681..9ec3eaf1c618f 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -327,6 +327,9 @@ xfs_buf_free(
>  
>  			__free_page(page);
>  		}
> +		if (current->reclaim_state)
> +			current->reclaim_state->reclaimed_slab +=
> +							bp->b_page_count;
>  	} else if (bp->b_flags & _XBF_KMEM)
>  		kmem_free(bp->b_addr);
>  	_xfs_buf_free_pages(bp);
> @@ -2114,9 +2117,11 @@ xfs_buf_delwri_pushbuf(
>  int __init
>  xfs_buf_init(void)
>  {
> -	xfs_buf_zone = kmem_cache_create("xfs_buf",
> -					 sizeof(struct xfs_buf), 0,
> -					 SLAB_HWCACHE_ALIGN, NULL);
> +	xfs_buf_zone = kmem_cache_create("xfs_buf", sizeof(struct xfs_buf), 0,
> +					 SLAB_HWCACHE_ALIGN |
> +					 SLAB_RECLAIM_ACCOUNT |
> +					 SLAB_MEM_SPREAD,
> +					 NULL);
>  	if (!xfs_buf_zone)
>  		goto out;
>  
> -- 
> 2.26.0.rc2
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] xfs: factor common AIL item deletion code
  2020-03-25  1:42 ` [PATCH 6/8] xfs: factor common AIL item deletion code Dave Chinner
  2020-03-25  4:54   ` Allison Collins
  2020-03-25 13:30   ` Brian Foster
@ 2020-03-26  5:10   ` Darrick J. Wong
  2020-03-27  0:50     ` Dave Chinner
  2 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2020-03-26  5:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 25, 2020 at 12:42:03PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Factor the common AIL deletion code that does all the wakeups into a
> helper so we only have one copy of this somewhat tricky code to
> interface with all the wakeups necessary when the LSN of the log
> tail changes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

This call site didn't have a wake_up_all and now it does; is that going
to make a difference?  I /think/ the answer is that this function
usually puts things on the AIL so we won't trigger the ail_empty wakeup;
and if the AIL was previously empty and we didn't match any log items
(such that it's still empty) then it's fine to wake up anyone who was
waiting for the ail to clear out?

If so,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode_item.c | 12 +----------
>  fs/xfs/xfs_trans_ail.c  | 48 ++++++++++++++++++++++-------------------
>  fs/xfs/xfs_trans_priv.h |  4 +++-
>  3 files changed, 30 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 4a3d13d4a0228..bd8c368098707 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -742,17 +742,7 @@ xfs_iflush_done(
>  				xfs_clear_li_failed(blip);
>  			}
>  		}
> -
> -		if (mlip_changed) {
> -			if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> -				xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -			if (list_empty(&ailp->ail_head))
> -				wake_up_all(&ailp->ail_empty);
> -		}
> -		spin_unlock(&ailp->ail_lock);
> -
> -		if (mlip_changed)
> -			xfs_log_space_wake(ailp->ail_mount);
> +		xfs_ail_update_finish(ailp, mlip_changed);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 2ef0dfbfb303d..26d2e7928121c 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -681,6 +681,27 @@ xfs_ail_push_all_sync(
>  	finish_wait(&ailp->ail_empty, &wait);
>  }
>  
> +void
> +xfs_ail_update_finish(
> +	struct xfs_ail		*ailp,
> +	bool			do_tail_update) __releases(ailp->ail_lock)
> +{
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +
> +	if (!do_tail_update) {
> +		spin_unlock(&ailp->ail_lock);
> +		return;
> +	}
> +
> +	if (!XFS_FORCED_SHUTDOWN(mp))
> +		xlog_assign_tail_lsn_locked(mp);
> +
> +	if (list_empty(&ailp->ail_head))
> +		wake_up_all(&ailp->ail_empty);
> +	spin_unlock(&ailp->ail_lock);
> +	xfs_log_space_wake(mp);
> +}
> +
>  /*
>   * xfs_trans_ail_update - bulk AIL insertion operation.
>   *
> @@ -740,15 +761,7 @@ xfs_trans_ail_update_bulk(
>  	if (!list_empty(&tmp))
>  		xfs_ail_splice(ailp, cur, &tmp, lsn);
>  
> -	if (mlip_changed) {
> -		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> -			xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -		spin_unlock(&ailp->ail_lock);
> -
> -		xfs_log_space_wake(ailp->ail_mount);
> -	} else {
> -		spin_unlock(&ailp->ail_lock);
> -	}
> +	xfs_ail_update_finish(ailp, mlip_changed);
>  }
>  
>  bool
> @@ -792,10 +805,10 @@ void
>  xfs_trans_ail_delete(
>  	struct xfs_ail		*ailp,
>  	struct xfs_log_item	*lip,
> -	int			shutdown_type) __releases(ailp->ail_lock)
> +	int			shutdown_type)
>  {
>  	struct xfs_mount	*mp = ailp->ail_mount;
> -	bool			mlip_changed;
> +	bool			need_update;
>  
>  	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
>  		spin_unlock(&ailp->ail_lock);
> @@ -808,17 +821,8 @@ xfs_trans_ail_delete(
>  		return;
>  	}
>  
> -	mlip_changed = xfs_ail_delete_one(ailp, lip);
> -	if (mlip_changed) {
> -		if (!XFS_FORCED_SHUTDOWN(mp))
> -			xlog_assign_tail_lsn_locked(mp);
> -		if (list_empty(&ailp->ail_head))
> -			wake_up_all(&ailp->ail_empty);
> -	}
> -
> -	spin_unlock(&ailp->ail_lock);
> -	if (mlip_changed)
> -		xfs_log_space_wake(ailp->ail_mount);
> +	need_update = xfs_ail_delete_one(ailp, lip);
> +	xfs_ail_update_finish(ailp, need_update);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 2e073c1c4614f..64ffa746730e4 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -92,8 +92,10 @@ xfs_trans_ail_update(
>  }
>  
>  bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> +void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
> +			__releases(ailp->ail_lock);
>  void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
> -		int shutdown_type) __releases(ailp->ail_lock);
> +		int shutdown_type);
>  
>  static inline void
>  xfs_trans_ail_remove(
> -- 
> 2.26.0.rc2
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 7/8] xfs: tail updates only need to occur when LSN changes
  2020-03-25  1:42 ` [PATCH 7/8] xfs: tail updates only need to occur when LSN changes Dave Chinner
  2020-03-25  5:10   ` Allison Collins
@ 2020-03-26  5:14   ` Darrick J. Wong
  1 sibling, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2020-03-26  5:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 25, 2020 at 12:42:04PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently wake anything waiting on the log tail to move whenever
> the log item at the tail of the log is removed. Historically this
> was fine behaviour because there were very few items at any given
> LSN. But with delayed logging, there may be thousands of items at
> any given LSN, and we can't move the tail until they are all gone.
> 
> Hence if we are removing them in near tail-first order, we might be
> waking up processes waiting on the tail LSN to change (e.g. log
> space waiters) repeatedly without them being able to make progress.
> This also occurs with the new sync push waiters, and can result in
> thousands of spurious wakeups every second when under heavy direct
> reclaim pressure.
> 
> To fix this, check that the tail LSN has actually changed on the
> AIL before triggering wakeups. This will reduce the number of
> spurious wakeups when doing bulk AIL removal and make this code much
> more efficient.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

I don't think this changed since I RVB'd the last submission, right?

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode_item.c | 18 ++++++++++----
>  fs/xfs/xfs_trans_ail.c  | 52 ++++++++++++++++++++++++++++-------------
>  fs/xfs/xfs_trans_priv.h |  4 ++--
>  3 files changed, 51 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index bd8c368098707..a627cb951dc61 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -730,19 +730,27 @@ xfs_iflush_done(
>  	 * holding the lock before removing the inode from the AIL.
>  	 */
>  	if (need_ail) {
> -		bool			mlip_changed = false;
> +		xfs_lsn_t	tail_lsn = 0;
>  
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(blip, &tmp, li_bio_list) {
>  			if (INODE_ITEM(blip)->ili_logged &&
> -			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
> -				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> -			else {
> +			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
> +				/*
> +				 * xfs_ail_update_finish() only cares about the
> +				 * lsn of the first tail item removed, any
> +				 * others will be at the same or higher lsn so
> +				 * we just ignore them.
> +				 */
> +				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
> +				if (!tail_lsn && lsn)
> +					tail_lsn = lsn;
> +			} else {
>  				xfs_clear_li_failed(blip);
>  			}
>  		}
> -		xfs_ail_update_finish(ailp, mlip_changed);
> +		xfs_ail_update_finish(ailp, tail_lsn);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 26d2e7928121c..564253550b754 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -109,17 +109,25 @@ xfs_ail_next(
>   * We need the AIL lock in order to get a coherent read of the lsn of the last
>   * item in the AIL.
>   */
> +static xfs_lsn_t
> +__xfs_ail_min_lsn(
> +	struct xfs_ail		*ailp)
> +{
> +	struct xfs_log_item	*lip = xfs_ail_min(ailp);
> +
> +	if (lip)
> +		return lip->li_lsn;
> +	return 0;
> +}
> +
>  xfs_lsn_t
>  xfs_ail_min_lsn(
>  	struct xfs_ail		*ailp)
>  {
> -	xfs_lsn_t		lsn = 0;
> -	struct xfs_log_item	*lip;
> +	xfs_lsn_t		lsn;
>  
>  	spin_lock(&ailp->ail_lock);
> -	lip = xfs_ail_min(ailp);
> -	if (lip)
> -		lsn = lip->li_lsn;
> +	lsn = __xfs_ail_min_lsn(ailp);
>  	spin_unlock(&ailp->ail_lock);
>  
>  	return lsn;
> @@ -684,11 +692,12 @@ xfs_ail_push_all_sync(
>  void
>  xfs_ail_update_finish(
>  	struct xfs_ail		*ailp,
> -	bool			do_tail_update) __releases(ailp->ail_lock)
> +	xfs_lsn_t		old_lsn) __releases(ailp->ail_lock)
>  {
>  	struct xfs_mount	*mp = ailp->ail_mount;
>  
> -	if (!do_tail_update) {
> +	/* if the tail lsn hasn't changed, don't do updates or wakeups. */
> +	if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) {
>  		spin_unlock(&ailp->ail_lock);
>  		return;
>  	}
> @@ -733,7 +742,7 @@ xfs_trans_ail_update_bulk(
>  	xfs_lsn_t		lsn) __releases(ailp->ail_lock)
>  {
>  	struct xfs_log_item	*mlip;
> -	int			mlip_changed = 0;
> +	xfs_lsn_t		tail_lsn = 0;
>  	int			i;
>  	LIST_HEAD(tmp);
>  
> @@ -748,9 +757,10 @@ xfs_trans_ail_update_bulk(
>  				continue;
>  
>  			trace_xfs_ail_move(lip, lip->li_lsn, lsn);
> +			if (mlip == lip && !tail_lsn)
> +				tail_lsn = lip->li_lsn;
> +
>  			xfs_ail_delete(ailp, lip);
> -			if (mlip == lip)
> -				mlip_changed = 1;
>  		} else {
>  			trace_xfs_ail_insert(lip, 0, lsn);
>  		}
> @@ -761,15 +771,23 @@ xfs_trans_ail_update_bulk(
>  	if (!list_empty(&tmp))
>  		xfs_ail_splice(ailp, cur, &tmp, lsn);
>  
> -	xfs_ail_update_finish(ailp, mlip_changed);
> +	xfs_ail_update_finish(ailp, tail_lsn);
>  }
>  
> -bool
> +/*
> + * Delete one log item from the AIL.
> + *
> + * If this item was at the tail of the AIL, return the LSN of the log item so
> + * that we can use it to check if the LSN of the tail of the log has moved
> + * when finishing up the AIL delete process in xfs_ail_update_finish().
> + */
> +xfs_lsn_t
>  xfs_ail_delete_one(
>  	struct xfs_ail		*ailp,
>  	struct xfs_log_item	*lip)
>  {
>  	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
> +	xfs_lsn_t		lsn = lip->li_lsn;
>  
>  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
>  	xfs_ail_delete(ailp, lip);
> @@ -777,7 +795,9 @@ xfs_ail_delete_one(
>  	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
>  	lip->li_lsn = 0;
>  
> -	return mlip == lip;
> +	if (mlip == lip)
> +		return lsn;
> +	return 0;
>  }
>  
>  /**
> @@ -808,7 +828,7 @@ xfs_trans_ail_delete(
>  	int			shutdown_type)
>  {
>  	struct xfs_mount	*mp = ailp->ail_mount;
> -	bool			need_update;
> +	xfs_lsn_t		tail_lsn;
>  
>  	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
>  		spin_unlock(&ailp->ail_lock);
> @@ -821,8 +841,8 @@ xfs_trans_ail_delete(
>  		return;
>  	}
>  
> -	need_update = xfs_ail_delete_one(ailp, lip);
> -	xfs_ail_update_finish(ailp, need_update);
> +	tail_lsn = xfs_ail_delete_one(ailp, lip);
> +	xfs_ail_update_finish(ailp, tail_lsn);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 64ffa746730e4..35655eac01a65 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -91,8 +91,8 @@ xfs_trans_ail_update(
>  	xfs_trans_ail_update_bulk(ailp, NULL, &lip, 1, lsn);
>  }
>  
> -bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> -void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
> +xfs_lsn_t xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> +void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
>  			__releases(ailp->ail_lock);
>  void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
>  		int shutdown_type);
> -- 
> 2.26.0.rc2
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/8] xfs: Throttle commits on delayed background CIL push
  2020-03-25  1:41 ` [PATCH 2/8] xfs: Throttle commits on delayed background CIL push Dave Chinner
  2020-03-25  4:42   ` Allison Collins
  2020-03-25  5:07   ` Dave Chinner
@ 2020-03-26  5:24   ` Darrick J. Wong
  2020-03-26 11:33     ` Brian Foster
  2 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2020-03-26  5:24 UTC (permalink / raw)
  To: Dave Chinner, Brian Foster; +Cc: linux-xfs

On Wed, Mar 25, 2020 at 12:41:59PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In certain situations the background CIL push can be indefinitely
> delayed. While we have workarounds from the obvious cases now, it
> doesn't solve the underlying issue. This issue is that there is no
> upper limit on the CIL where we will either force or wait for
> a background push to start, hence allowing the CIL to grow without
> bound until it consumes all log space.
> 
> To fix this, add a new wait queue to the CIL which allows background
> pushes to wait for the CIL context to be switched out. This happens
> when the push starts, so it will allow us to block incoming
> transaction commit completion until the push has started. This will
> only affect processes that are running modifications, and only when
> the CIL threshold has been significantly overrun.
> 
> This has no apparent impact on performance, and doesn't even trigger
> until over 45 million inodes had been created in a 16-way fsmark
> test on a 2GB log. That was limiting at 64MB of log space used, so
> the active CIL size is only about 3% of the total log in that case.
> The concurrent removal of those files did not trigger the background
> sleep at all.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

This looks reasonable to me, though considering the big long thread that
erupted a few versions ago I'm seriously wondering what he thinks of all
this?

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
>  fs/xfs/xfs_trace.h    |  1 +
>  3 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 27de462d2ba40..ac43301ae2f43 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -668,6 +668,11 @@ xlog_cil_push_work(
>  	push_seq = cil->xc_push_seq;
>  	ASSERT(push_seq <= ctx->sequence);
>  
> +	/*
> +	 * Wake up any background push waiters now this context is being pushed.
> +	 */
> +	wake_up_all(&ctx->push_wait);
> +
>  	/*
>  	 * Check if we've anything to push. If there is nothing, then we don't
>  	 * move on to a new sequence number and so we have to be able to push
> @@ -744,6 +749,7 @@ xlog_cil_push_work(
>  	 */
>  	INIT_LIST_HEAD(&new_ctx->committing);
>  	INIT_LIST_HEAD(&new_ctx->busy_extents);
> +	init_waitqueue_head(&new_ctx->push_wait);
>  	new_ctx->sequence = ctx->sequence + 1;
>  	new_ctx->cil = cil;
>  	cil->xc_ctx = new_ctx;
> @@ -891,7 +897,7 @@ xlog_cil_push_work(
>   */
>  static void
>  xlog_cil_push_background(
> -	struct xlog	*log)
> +	struct xlog	*log) __releases(cil->xc_ctx_lock)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
>  
> @@ -905,14 +911,36 @@ xlog_cil_push_background(
>  	 * don't do a background push if we haven't used up all the
>  	 * space available yet.
>  	 */
> -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
> +	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +		up_read(&cil->xc_ctx_lock);
>  		return;
> +	}
>  
>  	spin_lock(&cil->xc_push_lock);
>  	if (cil->xc_push_seq < cil->xc_current_sequence) {
>  		cil->xc_push_seq = cil->xc_current_sequence;
>  		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
>  	}
> +
> +	/*
> +	 * Drop the context lock now, we can't hold that if we need to sleep
> +	 * because we are over the blocking threshold. The push_lock is still
> +	 * held, so blocking threshold sleep/wakeup is still correctly
> +	 * serialised here.
> +	 */
> +	up_read(&cil->xc_ctx_lock);
> +
> +	/*
> +	 * If we are well over the space limit, throttle the work that is being
> +	 * done until the push work on this context has begun.
> +	 */
> +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> +		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> +		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
> +		return;
> +	}
> +
>  	spin_unlock(&cil->xc_push_lock);
>  
>  }
> @@ -1032,9 +1060,9 @@ xfs_log_commit_cil(
>  		if (lip->li_ops->iop_committing)
>  			lip->li_ops->iop_committing(lip, xc_commit_lsn);
>  	}
> -	xlog_cil_push_background(log);
>  
> -	up_read(&cil->xc_ctx_lock);
> +	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
> +	xlog_cil_push_background(log);
>  }
>  
>  /*
> @@ -1193,6 +1221,7 @@ xlog_cil_init(
>  
>  	INIT_LIST_HEAD(&ctx->committing);
>  	INIT_LIST_HEAD(&ctx->busy_extents);
> +	init_waitqueue_head(&ctx->push_wait);
>  	ctx->sequence = 1;
>  	ctx->cil = cil;
>  	cil->xc_ctx = ctx;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 8c4be91f62d0d..dacab1817a1b0 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -240,6 +240,7 @@ struct xfs_cil_ctx {
>  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
> +	wait_queue_head_t	push_wait;	/* background push throttle */
>  	struct work_struct	discard_endio_work;
>  };
>  
> @@ -337,10 +338,33 @@ struct xfs_cil {
>   *   buffer window (32MB) as measurements have shown this to be roughly the
>   *   point of diminishing performance increases under highly concurrent
>   *   modification workloads.
> + *
> + * To prevent the CIL from overflowing upper commit size bounds, we introduce a
> + * new threshold at which we block committing transactions until the background
> + * CIL commit commences and switches to a new context. While this is not a hard
> + * limit, it forces the process committing a transaction to the CIL to block and
> + * yeild the CPU, giving the CIL push work a chance to be scheduled and start
> + * work. This prevents a process running lots of transactions from overfilling
> + * the CIL because it is not yielding the CPU. We set the blocking limit at
> + * twice the background push space threshold so we keep in line with the AIL
> + * push thresholds.
> + *
> + * Note: this is not a -hard- limit as blocking is applied after the transaction
> + * is inserted into the CIL and the push has been triggered. It is largely a
> + * throttling mechanism that allows the CIL push to be scheduled and run. A hard
> + * limit will be difficult to implement without introducing global serialisation
> + * in the CIL commit fast path, and it's not at all clear that we actually need
> + * such hard limits given the ~7 years we've run without a hard limit before
> + * finding the first situation where a checkpoint size overflow actually
> + * occurred. Hence the simple throttle, and an ASSERT check to tell us that
> + * we've overrun the max size.
>   */
>  #define XLOG_CIL_SPACE_LIMIT(log)	\
>  	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
>  
> +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
> +	(XLOG_CIL_SPACE_LIMIT(log) * 2)
> +
>  /*
>   * ticket grant locks, queues and accounting have their own cachlines
>   * as these are quite hot and can be operated on concurrently.
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index fbfdd9cf160df..575ca74532f79 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1015,6 +1015,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_ticket_regrant_sub);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_sub);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_exit);
> +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
>  
>  DECLARE_EVENT_CLASS(xfs_log_item_class,
>  	TP_PROTO(struct xfs_log_item *lip),
> -- 
> 2.26.0.rc2
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/8] xfs: Throttle commits on delayed background CIL push
  2020-03-26  5:24   ` Darrick J. Wong
@ 2020-03-26 11:33     ` Brian Foster
  2020-03-27  0:40       ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Brian Foster @ 2020-03-26 11:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Wed, Mar 25, 2020 at 10:24:35PM -0700, Darrick J. Wong wrote:
> On Wed, Mar 25, 2020 at 12:41:59PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > In certain situations the background CIL push can be indefinitely
> > delayed. While we have workarounds from the obvious cases now, it
> > doesn't solve the underlying issue. This issue is that there is no
> > upper limit on the CIL where we will either force or wait for
> > a background push to start, hence allowing the CIL to grow without
> > bound until it consumes all log space.
> > 
> > To fix this, add a new wait queue to the CIL which allows background
> > pushes to wait for the CIL context to be switched out. This happens
> > when the push starts, so it will allow us to block incoming
> > transaction commit completion until the push has started. This will
> > only affect processes that are running modifications, and only when
> > the CIL threshold has been significantly overrun.
> > 
> > This has no apparent impact on performance, and doesn't even trigger
> > until over 45 million inodes had been created in a 16-way fsmark
> > test on a 2GB log. That was limiting at 64MB of log space used, so
> > the active CIL size is only about 3% of the total log in that case.
> > The concurrent removal of those files did not trigger the background
> > sleep at all.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> This looks reasonable to me, though considering the big long thread that
> erupted a few versions ago I'm seriously wondering what he thinks of all
> this?
> 

Hmmmm... this was my reply to the last post of this one:

https://lore.kernel.org/linux-xfs/20191101120426.GC59146@bfoster/

... so I suspect that would still be my feedback if this patch wasn't
fixed up..? ;)

Brian

> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> --D
> 
> > ---
> >  fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
> >  fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
> >  fs/xfs/xfs_trace.h    |  1 +
> >  3 files changed, 58 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 27de462d2ba40..ac43301ae2f43 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -668,6 +668,11 @@ xlog_cil_push_work(
> >  	push_seq = cil->xc_push_seq;
> >  	ASSERT(push_seq <= ctx->sequence);
> >  
> > +	/*
> > +	 * Wake up any background push waiters now this context is being pushed.
> > +	 */
> > +	wake_up_all(&ctx->push_wait);
> > +
> >  	/*
> >  	 * Check if we've anything to push. If there is nothing, then we don't
> >  	 * move on to a new sequence number and so we have to be able to push
> > @@ -744,6 +749,7 @@ xlog_cil_push_work(
> >  	 */
> >  	INIT_LIST_HEAD(&new_ctx->committing);
> >  	INIT_LIST_HEAD(&new_ctx->busy_extents);
> > +	init_waitqueue_head(&new_ctx->push_wait);
> >  	new_ctx->sequence = ctx->sequence + 1;
> >  	new_ctx->cil = cil;
> >  	cil->xc_ctx = new_ctx;
> > @@ -891,7 +897,7 @@ xlog_cil_push_work(
> >   */
> >  static void
> >  xlog_cil_push_background(
> > -	struct xlog	*log)
> > +	struct xlog	*log) __releases(cil->xc_ctx_lock)
> >  {
> >  	struct xfs_cil	*cil = log->l_cilp;
> >  
> > @@ -905,14 +911,36 @@ xlog_cil_push_background(
> >  	 * don't do a background push if we haven't used up all the
> >  	 * space available yet.
> >  	 */
> > -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
> > +	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> > +		up_read(&cil->xc_ctx_lock);
> >  		return;
> > +	}
> >  
> >  	spin_lock(&cil->xc_push_lock);
> >  	if (cil->xc_push_seq < cil->xc_current_sequence) {
> >  		cil->xc_push_seq = cil->xc_current_sequence;
> >  		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> >  	}
> > +
> > +	/*
> > +	 * Drop the context lock now, we can't hold that if we need to sleep
> > +	 * because we are over the blocking threshold. The push_lock is still
> > +	 * held, so blocking threshold sleep/wakeup is still correctly
> > +	 * serialised here.
> > +	 */
> > +	up_read(&cil->xc_ctx_lock);
> > +
> > +	/*
> > +	 * If we are well over the space limit, throttle the work that is being
> > +	 * done until the push work on this context has begun.
> > +	 */
> > +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > +		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> > +		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> > +		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
> > +		return;
> > +	}
> > +
> >  	spin_unlock(&cil->xc_push_lock);
> >  
> >  }
> > @@ -1032,9 +1060,9 @@ xfs_log_commit_cil(
> >  		if (lip->li_ops->iop_committing)
> >  			lip->li_ops->iop_committing(lip, xc_commit_lsn);
> >  	}
> > -	xlog_cil_push_background(log);
> >  
> > -	up_read(&cil->xc_ctx_lock);
> > +	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
> > +	xlog_cil_push_background(log);
> >  }
> >  
> >  /*
> > @@ -1193,6 +1221,7 @@ xlog_cil_init(
> >  
> >  	INIT_LIST_HEAD(&ctx->committing);
> >  	INIT_LIST_HEAD(&ctx->busy_extents);
> > +	init_waitqueue_head(&ctx->push_wait);
> >  	ctx->sequence = 1;
> >  	ctx->cil = cil;
> >  	cil->xc_ctx = ctx;
> > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> > index 8c4be91f62d0d..dacab1817a1b0 100644
> > --- a/fs/xfs/xfs_log_priv.h
> > +++ b/fs/xfs/xfs_log_priv.h
> > @@ -240,6 +240,7 @@ struct xfs_cil_ctx {
> >  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
> >  	struct list_head	iclog_entry;
> >  	struct list_head	committing;	/* ctx committing list */
> > +	wait_queue_head_t	push_wait;	/* background push throttle */
> >  	struct work_struct	discard_endio_work;
> >  };
> >  
> > @@ -337,10 +338,33 @@ struct xfs_cil {
> >   *   buffer window (32MB) as measurements have shown this to be roughly the
> >   *   point of diminishing performance increases under highly concurrent
> >   *   modification workloads.
> > + *
> > + * To prevent the CIL from overflowing upper commit size bounds, we introduce a
> > + * new threshold at which we block committing transactions until the background
> > + * CIL commit commences and switches to a new context. While this is not a hard
> > + * limit, it forces the process committing a transaction to the CIL to block and
> > + * yeild the CPU, giving the CIL push work a chance to be scheduled and start
> > + * work. This prevents a process running lots of transactions from overfilling
> > + * the CIL because it is not yielding the CPU. We set the blocking limit at
> > + * twice the background push space threshold so we keep in line with the AIL
> > + * push thresholds.
> > + *
> > + * Note: this is not a -hard- limit as blocking is applied after the transaction
> > + * is inserted into the CIL and the push has been triggered. It is largely a
> > + * throttling mechanism that allows the CIL push to be scheduled and run. A hard
> > + * limit will be difficult to implement without introducing global serialisation
> > + * in the CIL commit fast path, and it's not at all clear that we actually need
> > + * such hard limits given the ~7 years we've run without a hard limit before
> > + * finding the first situation where a checkpoint size overflow actually
> > + * occurred. Hence the simple throttle, and an ASSERT check to tell us that
> > + * we've overrun the max size.
> >   */
> >  #define XLOG_CIL_SPACE_LIMIT(log)	\
> >  	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
> >  
> > +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
> > +	(XLOG_CIL_SPACE_LIMIT(log) * 2)
> > +
> >  /*
> >   * ticket grant locks, queues and accounting have their own cachlines
> >   * as these are quite hot and can be operated on concurrently.
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index fbfdd9cf160df..575ca74532f79 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -1015,6 +1015,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_ticket_regrant_sub);
> >  DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done);
> >  DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_sub);
> >  DEFINE_LOGGRANT_EVENT(xfs_log_ticket_done_exit);
> > +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
> >  
> >  DECLARE_EVENT_CLASS(xfs_log_item_class,
> >  	TP_PROTO(struct xfs_log_item *lip),
> > -- 
> > 2.26.0.rc2
> > 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/8] xfs: Throttle commits on delayed background CIL push
  2020-03-26 11:33     ` Brian Foster
@ 2020-03-27  0:40       ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-27  0:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Thu, Mar 26, 2020 at 07:33:58AM -0400, Brian Foster wrote:
> On Wed, Mar 25, 2020 at 10:24:35PM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 25, 2020 at 12:41:59PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > In certain situations the background CIL push can be indefinitely
> > > delayed. While we have workarounds from the obvious cases now, it
> > > doesn't solve the underlying issue. This issue is that there is no
> > > upper limit on the CIL where we will either force or wait for
> > > a background push to start, hence allowing the CIL to grow without
> > > bound until it consumes all log space.
> > > 
> > > To fix this, add a new wait queue to the CIL which allows background
> > > pushes to wait for the CIL context to be switched out. This happens
> > > when the push starts, so it will allow us to block incoming
> > > transaction commit completion until the push has started. This will
> > > only affect processes that are running modifications, and only when
> > > the CIL threshold has been significantly overrun.
> > > 
> > > This has no apparent impact on performance, and doesn't even trigger
> > > until over 45 million inodes had been created in a 16-way fsmark
> > > test on a 2GB log. That was limiting at 64MB of log space used, so
> > > the active CIL size is only about 3% of the total log in that case.
> > > The concurrent removal of those files did not trigger the background
> > > sleep at all.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > 
> > This looks reasonable to me, though considering the big long thread that
> > erupted a few versions ago I'm seriously wondering what he thinks of all
> > this?
> > 
> 
> Hmmmm... this was my reply to the last post of this one:
> 
> https://lore.kernel.org/linux-xfs/20191101120426.GC59146@bfoster/
> 
> ... so I suspect that would still be my feedback if this patch wasn't
> fixed up..? ;)

Which actually took me some time to find because that email just
points some other email.... :/

I think in the end, that entire discussion ended up at two things
once the behaviour of the throttle was iterated over and agreed upon
as a decent compromise. I'll summarise below in line.


> > > @@ -905,14 +911,36 @@ xlog_cil_push_background(
> > >  	 * don't do a background push if we haven't used up all the
> > >  	 * space available yet.
> > >  	 */
> > > -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
> > > +	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> > > +		up_read(&cil->xc_ctx_lock);
> > >  		return;
> > > +	}
> > >  
> > >  	spin_lock(&cil->xc_push_lock);
> > >  	if (cil->xc_push_seq < cil->xc_current_sequence) {
> > >  		cil->xc_push_seq = cil->xc_current_sequence;
> > >  		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> > >  	}
> > > +
> > > +	/*
> > > +	 * Drop the context lock now, we can't hold that if we need to sleep
> > > +	 * because we are over the blocking threshold. The push_lock is still
> > > +	 * held, so blocking threshold sleep/wakeup is still correctly
> > > +	 * serialised here.
> > > +	 */
> > > +	up_read(&cil->xc_ctx_lock);
> > > +
> > > +	/*
> > > +	 * If we are well over the space limit, throttle the work that is being
> > > +	 * done until the push work on this context has begun.
> > > +	 */
> > > +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > > +		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> > > +		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> > > +		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
> > > +		return;
> > > +	}

Biran asked to change this to a "<" check and switch the code in the
branch and the function tail around. The code is not wrong or
inconsistent with other code, and the change doesn't result in a
reduction of code, just a different layout.

OTOH, personal taste on my side is to prefer to keep lock/unlock
pairs at the same indent level so at a glance it is obvious that
they are paired and balanced.

IOWs, this is just a question of personal taste...

> > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> > > index 8c4be91f62d0d..dacab1817a1b0 100644
> > > --- a/fs/xfs/xfs_log_priv.h
> > > +++ b/fs/xfs/xfs_log_priv.h
> > > @@ -240,6 +240,7 @@ struct xfs_cil_ctx {
> > >  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
> > >  	struct list_head	iclog_entry;
> > >  	struct list_head	committing;	/* ctx committing list */
> > > +	wait_queue_head_t	push_wait;	/* background push throttle */
> > >  	struct work_struct	discard_endio_work;
> > >  };
> > >  
> > > @@ -337,10 +338,33 @@ struct xfs_cil {
> > >   *   buffer window (32MB) as measurements have shown this to be roughly the
> > >   *   point of diminishing performance increases under highly concurrent
> > >   *   modification workloads.
> > > + *
> > > + * To prevent the CIL from overflowing upper commit size bounds, we introduce a
> > > + * new threshold at which we block committing transactions until the background
> > > + * CIL commit commences and switches to a new context. While this is not a hard
> > > + * limit, it forces the process committing a transaction to the CIL to block and
> > > + * yeild the CPU, giving the CIL push work a chance to be scheduled and start
> > > + * work. This prevents a process running lots of transactions from overfilling
> > > + * the CIL because it is not yielding the CPU. We set the blocking limit at
> > > + * twice the background push space threshold so we keep in line with the AIL
> > > + * push thresholds.
> > > + *
> > > + * Note: this is not a -hard- limit as blocking is applied after the transaction
> > > + * is inserted into the CIL and the push has been triggered. It is largely a
> > > + * throttling mechanism that allows the CIL push to be scheduled and run. A hard
> > > + * limit will be difficult to implement without introducing global serialisation
> > > + * in the CIL commit fast path, and it's not at all clear that we actually need
> > > + * such hard limits given the ~7 years we've run without a hard limit before
> > > + * finding the first situation where a checkpoint size overflow actually
> > > + * occurred. Hence the simple throttle, and an ASSERT check to tell us that
> > > + * we've overrun the max size.
> > >   */

Brian also asked for this note to be removed and put into the commit
message, which I had pulled out of the commit message and put into
the code because someone else requested that....

https://lore.kernel.org/linux-xfs/20190916163325.GZ2229799@magnolia/

So, when personal tastes collide, the submitter has to choose
between them. When review comments conflict, the submitter has
to choose between them.

I chose not to modify the patch, and that's the long and short of
it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] xfs: factor common AIL item deletion code
  2020-03-26  5:10   ` Darrick J. Wong
@ 2020-03-27  0:50     ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-27  0:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 25, 2020 at 10:10:01PM -0700, Darrick J. Wong wrote:
> On Wed, Mar 25, 2020 at 12:42:03PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Factor the common AIL deletion code that does all the wakeups into a
> > helper so we only have one copy of this somewhat tricky code to
> > interface with all the wakeups necessary when the LSN of the log
> > tail changes.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> This call site didn't have a wake_up_all and now it does; is that going
> to make a difference?

That logic only changedf for xfs_trans_ail_update_bulk()...

> I /think/ the answer is that this function
> usually puts things on the AIL so we won't trigger the ail_empty wakeup;

which only inserts into the AIL, so will never trigger the "wake up
if AIL empty" code that is now there because the AIL will never be
empty...

> and if the AIL was previously empty and we didn't match any log items
> (such that it's still empty) then it's fine to wake up anyone who was
> waiting for the ail to clear out?

If we are calling xfs_trans_ail_update_bulk() with zero log items on
an empty AIL, we are probably doing something else wrong. But doing
a wakeup on anything waiting on an empty log will not hurt anything
in this case, because the log was already empty and there should be
nothing waiting for the log to empty...

So, yes, it is safe and won't cause strange issues....

> If so,
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-03-27  0:50 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-25  1:41 [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner
2020-03-25  1:41 ` [PATCH 1/8] xfs: Lower CIL flush limit for large logs Dave Chinner
2020-03-25  4:42   ` Allison Collins
2020-03-25  1:41 ` [PATCH 2/8] xfs: Throttle commits on delayed background CIL push Dave Chinner
2020-03-25  4:42   ` Allison Collins
2020-03-25  5:07   ` Dave Chinner
2020-03-26  5:24   ` Darrick J. Wong
2020-03-26 11:33     ` Brian Foster
2020-03-27  0:40       ` Dave Chinner
2020-03-25  1:42 ` [PATCH 3/8] xfs: don't allow log IO to be throttled Dave Chinner
2020-03-25  4:42   ` Allison Collins
2020-03-25  1:42 ` [PATCH 4/8] xfs: Improve metadata buffer reclaim accountability Dave Chinner
2020-03-25  4:42   ` Allison Collins
2020-03-25 13:30   ` Brian Foster
2020-03-26  5:05   ` Darrick J. Wong
2020-03-25  1:42 ` [PATCH 5/8] xfs: correctly acount for reclaimable slabs Dave Chinner
2020-03-25  4:43   ` Allison Collins
2020-03-25  1:42 ` [PATCH 6/8] xfs: factor common AIL item deletion code Dave Chinner
2020-03-25  4:54   ` Allison Collins
2020-03-25 13:30   ` Brian Foster
2020-03-26  5:10   ` Darrick J. Wong
2020-03-27  0:50     ` Dave Chinner
2020-03-25  1:42 ` [PATCH 7/8] xfs: tail updates only need to occur when LSN changes Dave Chinner
2020-03-25  5:10   ` Allison Collins
2020-03-26  5:14   ` Darrick J. Wong
2020-03-25  1:42 ` [PATCH 8/8] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
2020-03-25 13:30   ` Brian Foster
2020-03-25  1:51 ` [PATCH 0/8] xfs: various fixes and cleanups Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).