All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] xfs: merged io completions
@ 2019-04-15  2:07 Darrick J. Wong
  2019-04-15  2:07 ` [PATCH 1/3] xfs: implement per-inode writeback completion queues Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-15  2:07 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

Here's a quick patchset reworking writeback ioend completion processing
into per-inode completion queues so that we don't have a thundering herd
of unwritten/cow completion kworkers contending for the ILOCK that then
tie up log reservation space and cause deadlocks.  The third patch
combines adjacent ioends when possible to reduce the overhead further.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=merged-completions

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15  2:07 [PATCH v2 0/3] xfs: merged io completions Darrick J. Wong
@ 2019-04-15  2:07 ` Darrick J. Wong
  2019-04-15 14:49   ` Brian Foster
                     ` (2 more replies)
  2019-04-15  2:07 ` [PATCH 2/3] xfs: remove unused m_data_workqueue Darrick J. Wong
  2019-04-15  2:08 ` [PATCH 3/3] xfs: merge adjacent io completions of the same type Darrick J. Wong
  2 siblings, 3 replies; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-15  2:07 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When scheduling writeback of dirty file data in the page cache, XFS uses
IO completion workqueue items to ensure that filesystem metadata only
updates after the write completes successfully.  This is essential for
converting unwritten extents to real extents at the right time and
performing COW remappings.

Unfortunately, XFS queues each IO completion work item to an unbounded
workqueue, which means that the kernel can spawn dozens of threads to
try to handle the items quickly.  These threads need to take the ILOCK
to update file metadata, which results in heavy ILOCK contention if a
large number of the work items target a single file, which is
inefficient.

Worse yet, the writeback completion threads get stuck waiting for the
ILOCK while holding transaction reservations, which can use up all
available log reservation space.  When that happens, metadata updates to
other parts of the filesystem grind to a halt, even if the filesystem
could otherwise have handled it.

Even worse, if one of the things grinding to a halt happens to be a
thread in the middle of a defer-ops finish holding the same ILOCK and
trying to obtain more log reservation having exhausted the permanent
reservation, we now have an ABBA deadlock - writeback has a transaction
reserved and wants the ILOCK, and someone else has the ILOCk and wants a
transaction reservation.

Therefore, we create a per-inode writeback io completion queue + work
item.  When writeback finishes, it can add the ioend to the per-inode
queue and let the single worker item process that queue.  This
dramatically cuts down on the number of kworkers and ILOCK contention in
the system, and seems to have eliminated an occasional deadlock I was
seeing while running generic/476.

Testing with a program that simulates a heavy random-write workload to a
single file demonstrates that the number of kworkers drops from
approximately 120 threads per file to 1, without dramatically changing
write bandwidth or pagecache access latency.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_aops.h   |    1 -
 fs/xfs/xfs_icache.c |    3 +++
 fs/xfs/xfs_inode.h  |    7 +++++++
 4 files changed, 47 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3619e9e8d359..f7a9bb661826 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -234,11 +234,9 @@ xfs_setfilesize_ioend(
  * IO write completion.
  */
 STATIC void
-xfs_end_io(
-	struct work_struct *work)
+xfs_end_ioend(
+	struct xfs_ioend	*ioend)
 {
-	struct xfs_ioend	*ioend =
-		container_of(work, struct xfs_ioend, io_work);
 	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
 	xfs_off_t		offset = ioend->io_offset;
 	size_t			size = ioend->io_size;
@@ -278,19 +276,48 @@ xfs_end_io(
 	xfs_destroy_ioend(ioend, error);
 }
 
+/* Finish all pending io completions. */
+void
+xfs_end_io(
+	struct work_struct	*work)
+{
+	struct xfs_inode	*ip;
+	struct xfs_ioend	*ioend;
+	struct list_head	completion_list;
+	unsigned long		flags;
+
+	ip = container_of(work, struct xfs_inode, i_iodone_work);
+
+	spin_lock_irqsave(&ip->i_iodone_lock, flags);
+	list_replace_init(&ip->i_iodone_list, &completion_list);
+	spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
+
+	while (!list_empty(&completion_list)) {
+		ioend = list_first_entry(&completion_list, struct xfs_ioend,
+				io_list);
+		list_del_init(&ioend->io_list);
+		xfs_end_ioend(ioend);
+	}
+}
+
 STATIC void
 xfs_end_bio(
 	struct bio		*bio)
 {
 	struct xfs_ioend	*ioend = bio->bi_private;
-	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	unsigned long		flags;
 
 	if (ioend->io_fork == XFS_COW_FORK ||
-	    ioend->io_state == XFS_EXT_UNWRITTEN)
-		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
-	else if (ioend->io_append_trans)
-		queue_work(mp->m_data_workqueue, &ioend->io_work);
-	else
+	    ioend->io_state == XFS_EXT_UNWRITTEN ||
+	    ioend->io_append_trans != NULL) {
+		spin_lock_irqsave(&ip->i_iodone_lock, flags);
+		if (list_empty(&ip->i_iodone_list))
+			queue_work(mp->m_unwritten_workqueue, &ip->i_iodone_work);
+		list_add_tail(&ioend->io_list, &ip->i_iodone_list);
+		spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
+	} else
 		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
 }
 
@@ -594,7 +621,6 @@ xfs_alloc_ioend(
 	ioend->io_inode = inode;
 	ioend->io_size = 0;
 	ioend->io_offset = offset;
-	INIT_WORK(&ioend->io_work, xfs_end_io);
 	ioend->io_append_trans = NULL;
 	ioend->io_bio = bio;
 	return ioend;
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 6c2615b83c5d..f62b03186c62 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -18,7 +18,6 @@ struct xfs_ioend {
 	struct inode		*io_inode;	/* file being written to */
 	size_t			io_size;	/* size of the extent */
 	xfs_off_t		io_offset;	/* offset in the file */
-	struct work_struct	io_work;	/* xfsdatad work queue */
 	struct xfs_trans	*io_append_trans;/* xact. for size update */
 	struct bio		*io_bio;	/* bio being built */
 	struct bio		io_inline_bio;	/* MUST BE LAST! */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 245483cc282b..e70e7db29026 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -70,6 +70,9 @@ xfs_inode_alloc(
 	ip->i_flags = 0;
 	ip->i_delayed_blks = 0;
 	memset(&ip->i_d, 0, sizeof(ip->i_d));
+	INIT_WORK(&ip->i_iodone_work, xfs_end_io);
+	INIT_LIST_HEAD(&ip->i_iodone_list);
+	spin_lock_init(&ip->i_iodone_lock);
 
 	return ip;
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e62074a5257c..88239c2dd824 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -57,6 +57,11 @@ typedef struct xfs_inode {
 
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
+
+	/* pending io completions */
+	spinlock_t		i_iodone_lock;
+	struct work_struct	i_iodone_work;
+	struct list_head	i_iodone_list;
 } xfs_inode_t;
 
 /* Convert from vfs inode to xfs inode */
@@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
 int xfs_iunlink_init(struct xfs_perag *pag);
 void xfs_iunlink_destroy(struct xfs_perag *pag);
 
+void xfs_end_io(struct work_struct *work);
+
 #endif	/* __XFS_INODE_H__ */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] xfs: remove unused m_data_workqueue
  2019-04-15  2:07 [PATCH v2 0/3] xfs: merged io completions Darrick J. Wong
  2019-04-15  2:07 ` [PATCH 1/3] xfs: implement per-inode writeback completion queues Darrick J. Wong
@ 2019-04-15  2:07 ` Darrick J. Wong
  2019-04-15 14:49   ` Brian Foster
  2019-04-15  2:08 ` [PATCH 3/3] xfs: merge adjacent io completions of the same type Darrick J. Wong
  2 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-15  2:07 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we're no longer using m_data_workqueue, remove it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_mount.h |    1 -
 fs/xfs/xfs_super.c |   10 +---------
 2 files changed, 1 insertion(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 110f927cf943..1d3a4b2bdc2f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -175,7 +175,6 @@ typedef struct xfs_mount {
 	struct xstats		m_stats;	/* per-fs stats */
 
 	struct workqueue_struct *m_buf_workqueue;
-	struct workqueue_struct	*m_data_workqueue;
 	struct workqueue_struct	*m_unwritten_workqueue;
 	struct workqueue_struct	*m_cil_workqueue;
 	struct workqueue_struct	*m_reclaim_workqueue;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f093ea244849..df917f41ca46 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -838,15 +838,10 @@ xfs_init_mount_workqueues(
 	if (!mp->m_buf_workqueue)
 		goto out;
 
-	mp->m_data_workqueue = alloc_workqueue("xfs-data/%s",
-			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
-	if (!mp->m_data_workqueue)
-		goto out_destroy_buf;
-
 	mp->m_unwritten_workqueue = alloc_workqueue("xfs-conv/%s",
 			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
 	if (!mp->m_unwritten_workqueue)
-		goto out_destroy_data_iodone_queue;
+		goto out_destroy_buf;
 
 	mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s",
 			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
@@ -886,8 +881,6 @@ xfs_init_mount_workqueues(
 	destroy_workqueue(mp->m_cil_workqueue);
 out_destroy_unwritten:
 	destroy_workqueue(mp->m_unwritten_workqueue);
-out_destroy_data_iodone_queue:
-	destroy_workqueue(mp->m_data_workqueue);
 out_destroy_buf:
 	destroy_workqueue(mp->m_buf_workqueue);
 out:
@@ -903,7 +896,6 @@ xfs_destroy_mount_workqueues(
 	destroy_workqueue(mp->m_log_workqueue);
 	destroy_workqueue(mp->m_reclaim_workqueue);
 	destroy_workqueue(mp->m_cil_workqueue);
-	destroy_workqueue(mp->m_data_workqueue);
 	destroy_workqueue(mp->m_unwritten_workqueue);
 	destroy_workqueue(mp->m_buf_workqueue);
 }

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] xfs: merge adjacent io completions of the same type
  2019-04-15  2:07 [PATCH v2 0/3] xfs: merged io completions Darrick J. Wong
  2019-04-15  2:07 ` [PATCH 1/3] xfs: implement per-inode writeback completion queues Darrick J. Wong
  2019-04-15  2:07 ` [PATCH 2/3] xfs: remove unused m_data_workqueue Darrick J. Wong
@ 2019-04-15  2:08 ` Darrick J. Wong
  2019-04-15 14:49   ` Brian Foster
  2 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-15  2:08 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

It's possible for pagecache writeback to split up a large amount of work
into smaller pieces for throttling purposes or to reduce the amount of
time a writeback operation is pending.  Whatever the reason, XFS can end
up with a bunch of IO completions that call for the same operation to be
performed on a contiguous extent mapping.  Since mappings are extent
based in XFS, we'd prefer to run fewer transactions when we can.

When we're processing an ioend on the list of io completions, check to
see if the next items on the list are both adjacent and of the same
type.  If so, we can merge the completions to reduce transaction
overhead.

On fast storage this doesn't seem to make much of a difference in
performance, though the number of transactions for an overnight xfstests
run seems to drop by ~5%.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c |   86 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f7a9bb661826..53afa2e6e3e7 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -237,6 +237,7 @@ STATIC void
 xfs_end_ioend(
 	struct xfs_ioend	*ioend)
 {
+	struct list_head	ioend_list;
 	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
 	xfs_off_t		offset = ioend->io_offset;
 	size_t			size = ioend->io_size;
@@ -273,7 +274,89 @@ xfs_end_ioend(
 done:
 	if (ioend->io_append_trans)
 		error = xfs_setfilesize_ioend(ioend, error);
+	list_replace_init(&ioend->io_list, &ioend_list);
 	xfs_destroy_ioend(ioend, error);
+
+	while (!list_empty(&ioend_list)) {
+		ioend = list_first_entry(&ioend_list, struct xfs_ioend,
+				io_list);
+		list_del_init(&ioend->io_list);
+		xfs_destroy_ioend(ioend, error);
+	}
+}
+
+/*
+ * We can merge two adjacent ioends if they have the same set of work to do.
+ */
+static bool
+xfs_ioend_can_merge(
+	struct xfs_ioend	*ioend,
+	int			ioend_error,
+	struct xfs_ioend	*next)
+{
+	int			next_error;
+
+	next_error = blk_status_to_errno(next->io_bio->bi_status);
+	if (ioend_error != next_error)
+		return false;
+	if ((ioend->io_fork == XFS_COW_FORK) ^ (next->io_fork == XFS_COW_FORK))
+		return false;
+	if ((ioend->io_state == XFS_EXT_UNWRITTEN) ^
+	    (next->io_state == XFS_EXT_UNWRITTEN))
+		return false;
+	if (ioend->io_offset + ioend->io_size != next->io_offset)
+		return false;
+	if (xfs_ioend_is_append(ioend) != xfs_ioend_is_append(next))
+		return false;
+	return true;
+}
+
+/* Try to merge adjacent completions. */
+STATIC void
+xfs_ioend_try_merge(
+	struct xfs_ioend	*ioend,
+	struct list_head	*more_ioends)
+{
+	struct xfs_ioend	*next_ioend;
+	int			ioend_error;
+	int			error;
+
+	if (list_empty(more_ioends))
+		return;
+
+	ioend_error = blk_status_to_errno(ioend->io_bio->bi_status);
+
+	while (!list_empty(more_ioends)) {
+		next_ioend = list_first_entry(more_ioends, struct xfs_ioend,
+				io_list);
+		if (!xfs_ioend_can_merge(ioend, ioend_error, next_ioend))
+			break;
+		list_move_tail(&next_ioend->io_list, &ioend->io_list);
+		ioend->io_size += next_ioend->io_size;
+		if (ioend->io_append_trans) {
+			error = xfs_setfilesize_ioend(next_ioend, 1);
+			ASSERT(error == 1);
+		}
+	}
+}
+
+/* list_sort compare function for ioends */
+static int
+xfs_ioend_compare(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xfs_ioend	*ia;
+	struct xfs_ioend	*ib;
+
+	ia = container_of(a, struct xfs_ioend, io_list);
+	ib = container_of(b, struct xfs_ioend, io_list);
+	if (ia->io_offset < ib->io_offset)
+		return -1;
+	else if (ia->io_offset > ib->io_offset)
+		return 1;
+	return 0;
 }
 
 /* Finish all pending io completions. */
@@ -292,10 +375,13 @@ xfs_end_io(
 	list_replace_init(&ip->i_iodone_list, &completion_list);
 	spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
 
+	list_sort(NULL, &completion_list, xfs_ioend_compare);
+
 	while (!list_empty(&completion_list)) {
 		ioend = list_first_entry(&completion_list, struct xfs_ioend,
 				io_list);
 		list_del_init(&ioend->io_list);
+		xfs_ioend_try_merge(ioend, &completion_list);
 		xfs_end_ioend(ioend);
 	}
 }

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15  2:07 ` [PATCH 1/3] xfs: implement per-inode writeback completion queues Darrick J. Wong
@ 2019-04-15 14:49   ` Brian Foster
  2019-04-15 15:53     ` Darrick J. Wong
  2019-04-15 16:13   ` [PATCH v2 " Darrick J. Wong
  2019-04-23  6:34   ` [PATCH " Christoph Hellwig
  2 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-04-15 14:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Apr 14, 2019 at 07:07:48PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When scheduling writeback of dirty file data in the page cache, XFS uses
> IO completion workqueue items to ensure that filesystem metadata only
> updates after the write completes successfully.  This is essential for
> converting unwritten extents to real extents at the right time and
> performing COW remappings.
> 
> Unfortunately, XFS queues each IO completion work item to an unbounded
> workqueue, which means that the kernel can spawn dozens of threads to
> try to handle the items quickly.  These threads need to take the ILOCK
> to update file metadata, which results in heavy ILOCK contention if a
> large number of the work items target a single file, which is
> inefficient.
> 
> Worse yet, the writeback completion threads get stuck waiting for the
> ILOCK while holding transaction reservations, which can use up all
> available log reservation space.  When that happens, metadata updates to
> other parts of the filesystem grind to a halt, even if the filesystem
> could otherwise have handled it.
> 
> Even worse, if one of the things grinding to a halt happens to be a
> thread in the middle of a defer-ops finish holding the same ILOCK and
> trying to obtain more log reservation having exhausted the permanent
> reservation, we now have an ABBA deadlock - writeback has a transaction
> reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> transaction reservation.
> 
> Therefore, we create a per-inode writeback io completion queue + work
> item.  When writeback finishes, it can add the ioend to the per-inode
> queue and let the single worker item process that queue.  This
> dramatically cuts down on the number of kworkers and ILOCK contention in
> the system, and seems to have eliminated an occasional deadlock I was
> seeing while running generic/476.
> 
> Testing with a program that simulates a heavy random-write workload to a
> single file demonstrates that the number of kworkers drops from
> approximately 120 threads per file to 1, without dramatically changing
> write bandwidth or pagecache access latency.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Thanks for the updated commit log. It looks like a couple nits from the
rfc weren't addressed though. Specifically, to rename the workqueue and
iodone bits to ioend_[workqueue|list|lock] so as to not confuse with log
buf (iodone) completion callbacks.

Also, this patch essentially turns the unbound completion work queue
into something that should only expect one task running at a time,
right? If so, it might make sense to fix up the max_active parameter to
the alloc_workqueue() call and ...

>  fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_aops.h   |    1 -
>  fs/xfs/xfs_icache.c |    3 +++
>  fs/xfs/xfs_inode.h  |    7 +++++++
>  4 files changed, 47 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3619e9e8d359..f7a9bb661826 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
...
> @@ -278,19 +276,48 @@ xfs_end_io(
...
>  STATIC void
>  xfs_end_bio(
>  	struct bio		*bio)
>  {
>  	struct xfs_ioend	*ioend = bio->bi_private;
> -	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	unsigned long		flags;
>  
>  	if (ioend->io_fork == XFS_COW_FORK ||
> -	    ioend->io_state == XFS_EXT_UNWRITTEN)
> -		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> -	else if (ioend->io_append_trans)
> -		queue_work(mp->m_data_workqueue, &ioend->io_work);
> -	else
> +	    ioend->io_state == XFS_EXT_UNWRITTEN ||
> +	    ioend->io_append_trans != NULL) {
> +		spin_lock_irqsave(&ip->i_iodone_lock, flags);
> +		if (list_empty(&ip->i_iodone_list))
> +			queue_work(mp->m_unwritten_workqueue, &ip->i_iodone_work);

... assert that queue_work() always returns true.

Brian

> +		list_add_tail(&ioend->io_list, &ip->i_iodone_list);
> +		spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
> +	} else
>  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
>  }
>  
> @@ -594,7 +621,6 @@ xfs_alloc_ioend(
>  	ioend->io_inode = inode;
>  	ioend->io_size = 0;
>  	ioend->io_offset = offset;
> -	INIT_WORK(&ioend->io_work, xfs_end_io);
>  	ioend->io_append_trans = NULL;
>  	ioend->io_bio = bio;
>  	return ioend;
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 6c2615b83c5d..f62b03186c62 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -18,7 +18,6 @@ struct xfs_ioend {
>  	struct inode		*io_inode;	/* file being written to */
>  	size_t			io_size;	/* size of the extent */
>  	xfs_off_t		io_offset;	/* offset in the file */
> -	struct work_struct	io_work;	/* xfsdatad work queue */
>  	struct xfs_trans	*io_append_trans;/* xact. for size update */
>  	struct bio		*io_bio;	/* bio being built */
>  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 245483cc282b..e70e7db29026 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -70,6 +70,9 @@ xfs_inode_alloc(
>  	ip->i_flags = 0;
>  	ip->i_delayed_blks = 0;
>  	memset(&ip->i_d, 0, sizeof(ip->i_d));
> +	INIT_WORK(&ip->i_iodone_work, xfs_end_io);
> +	INIT_LIST_HEAD(&ip->i_iodone_list);
> +	spin_lock_init(&ip->i_iodone_lock);
>  
>  	return ip;
>  }
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index e62074a5257c..88239c2dd824 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -57,6 +57,11 @@ typedef struct xfs_inode {
>  
>  	/* VFS inode */
>  	struct inode		i_vnode;	/* embedded VFS inode */
> +
> +	/* pending io completions */
> +	spinlock_t		i_iodone_lock;
> +	struct work_struct	i_iodone_work;
> +	struct list_head	i_iodone_list;
>  } xfs_inode_t;
>  
>  /* Convert from vfs inode to xfs inode */
> @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
>  int xfs_iunlink_init(struct xfs_perag *pag);
>  void xfs_iunlink_destroy(struct xfs_perag *pag);
>  
> +void xfs_end_io(struct work_struct *work);
> +
>  #endif	/* __XFS_INODE_H__ */
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] xfs: remove unused m_data_workqueue
  2019-04-15  2:07 ` [PATCH 2/3] xfs: remove unused m_data_workqueue Darrick J. Wong
@ 2019-04-15 14:49   ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2019-04-15 14:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Apr 14, 2019 at 07:07:55PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Now that we're no longer using m_data_workqueue, remove it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_mount.h |    1 -
>  fs/xfs/xfs_super.c |   10 +---------
>  2 files changed, 1 insertion(+), 10 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 110f927cf943..1d3a4b2bdc2f 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -175,7 +175,6 @@ typedef struct xfs_mount {
>  	struct xstats		m_stats;	/* per-fs stats */
>  
>  	struct workqueue_struct *m_buf_workqueue;
> -	struct workqueue_struct	*m_data_workqueue;
>  	struct workqueue_struct	*m_unwritten_workqueue;
>  	struct workqueue_struct	*m_cil_workqueue;
>  	struct workqueue_struct	*m_reclaim_workqueue;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index f093ea244849..df917f41ca46 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -838,15 +838,10 @@ xfs_init_mount_workqueues(
>  	if (!mp->m_buf_workqueue)
>  		goto out;
>  
> -	mp->m_data_workqueue = alloc_workqueue("xfs-data/%s",
> -			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
> -	if (!mp->m_data_workqueue)
> -		goto out_destroy_buf;
> -
>  	mp->m_unwritten_workqueue = alloc_workqueue("xfs-conv/%s",
>  			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
>  	if (!mp->m_unwritten_workqueue)
> -		goto out_destroy_data_iodone_queue;
> +		goto out_destroy_buf;
>  
>  	mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s",
>  			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
> @@ -886,8 +881,6 @@ xfs_init_mount_workqueues(
>  	destroy_workqueue(mp->m_cil_workqueue);
>  out_destroy_unwritten:
>  	destroy_workqueue(mp->m_unwritten_workqueue);
> -out_destroy_data_iodone_queue:
> -	destroy_workqueue(mp->m_data_workqueue);
>  out_destroy_buf:
>  	destroy_workqueue(mp->m_buf_workqueue);
>  out:
> @@ -903,7 +896,6 @@ xfs_destroy_mount_workqueues(
>  	destroy_workqueue(mp->m_log_workqueue);
>  	destroy_workqueue(mp->m_reclaim_workqueue);
>  	destroy_workqueue(mp->m_cil_workqueue);
> -	destroy_workqueue(mp->m_data_workqueue);
>  	destroy_workqueue(mp->m_unwritten_workqueue);
>  	destroy_workqueue(mp->m_buf_workqueue);
>  }
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] xfs: merge adjacent io completions of the same type
  2019-04-15  2:08 ` [PATCH 3/3] xfs: merge adjacent io completions of the same type Darrick J. Wong
@ 2019-04-15 14:49   ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2019-04-15 14:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Apr 14, 2019 at 07:08:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> It's possible for pagecache writeback to split up a large amount of work
> into smaller pieces for throttling purposes or to reduce the amount of
> time a writeback operation is pending.  Whatever the reason, XFS can end
> up with a bunch of IO completions that call for the same operation to be
> performed on a contiguous extent mapping.  Since mappings are extent
> based in XFS, we'd prefer to run fewer transactions when we can.
> 
> When we're processing an ioend on the list of io completions, check to
> see if the next items on the list are both adjacent and of the same
> type.  If so, we can merge the completions to reduce transaction
> overhead.
> 
> On fast storage this doesn't seem to make much of a difference in
> performance, though the number of transactions for an overnight xfstests
> run seems to drop by ~5%.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

LGTM:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_aops.c |   86 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 86 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index f7a9bb661826..53afa2e6e3e7 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -237,6 +237,7 @@ STATIC void
>  xfs_end_ioend(
>  	struct xfs_ioend	*ioend)
>  {
> +	struct list_head	ioend_list;
>  	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
>  	xfs_off_t		offset = ioend->io_offset;
>  	size_t			size = ioend->io_size;
> @@ -273,7 +274,89 @@ xfs_end_ioend(
>  done:
>  	if (ioend->io_append_trans)
>  		error = xfs_setfilesize_ioend(ioend, error);
> +	list_replace_init(&ioend->io_list, &ioend_list);
>  	xfs_destroy_ioend(ioend, error);
> +
> +	while (!list_empty(&ioend_list)) {
> +		ioend = list_first_entry(&ioend_list, struct xfs_ioend,
> +				io_list);
> +		list_del_init(&ioend->io_list);
> +		xfs_destroy_ioend(ioend, error);
> +	}
> +}
> +
> +/*
> + * We can merge two adjacent ioends if they have the same set of work to do.
> + */
> +static bool
> +xfs_ioend_can_merge(
> +	struct xfs_ioend	*ioend,
> +	int			ioend_error,
> +	struct xfs_ioend	*next)
> +{
> +	int			next_error;
> +
> +	next_error = blk_status_to_errno(next->io_bio->bi_status);
> +	if (ioend_error != next_error)
> +		return false;
> +	if ((ioend->io_fork == XFS_COW_FORK) ^ (next->io_fork == XFS_COW_FORK))
> +		return false;
> +	if ((ioend->io_state == XFS_EXT_UNWRITTEN) ^
> +	    (next->io_state == XFS_EXT_UNWRITTEN))
> +		return false;
> +	if (ioend->io_offset + ioend->io_size != next->io_offset)
> +		return false;
> +	if (xfs_ioend_is_append(ioend) != xfs_ioend_is_append(next))
> +		return false;
> +	return true;
> +}
> +
> +/* Try to merge adjacent completions. */
> +STATIC void
> +xfs_ioend_try_merge(
> +	struct xfs_ioend	*ioend,
> +	struct list_head	*more_ioends)
> +{
> +	struct xfs_ioend	*next_ioend;
> +	int			ioend_error;
> +	int			error;
> +
> +	if (list_empty(more_ioends))
> +		return;
> +
> +	ioend_error = blk_status_to_errno(ioend->io_bio->bi_status);
> +
> +	while (!list_empty(more_ioends)) {
> +		next_ioend = list_first_entry(more_ioends, struct xfs_ioend,
> +				io_list);
> +		if (!xfs_ioend_can_merge(ioend, ioend_error, next_ioend))
> +			break;
> +		list_move_tail(&next_ioend->io_list, &ioend->io_list);
> +		ioend->io_size += next_ioend->io_size;
> +		if (ioend->io_append_trans) {
> +			error = xfs_setfilesize_ioend(next_ioend, 1);
> +			ASSERT(error == 1);
> +		}
> +	}
> +}
> +
> +/* list_sort compare function for ioends */
> +static int
> +xfs_ioend_compare(
> +	void			*priv,
> +	struct list_head	*a,
> +	struct list_head	*b)
> +{
> +	struct xfs_ioend	*ia;
> +	struct xfs_ioend	*ib;
> +
> +	ia = container_of(a, struct xfs_ioend, io_list);
> +	ib = container_of(b, struct xfs_ioend, io_list);
> +	if (ia->io_offset < ib->io_offset)
> +		return -1;
> +	else if (ia->io_offset > ib->io_offset)
> +		return 1;
> +	return 0;
>  }
>  
>  /* Finish all pending io completions. */
> @@ -292,10 +375,13 @@ xfs_end_io(
>  	list_replace_init(&ip->i_iodone_list, &completion_list);
>  	spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
>  
> +	list_sort(NULL, &completion_list, xfs_ioend_compare);
> +
>  	while (!list_empty(&completion_list)) {
>  		ioend = list_first_entry(&completion_list, struct xfs_ioend,
>  				io_list);
>  		list_del_init(&ioend->io_list);
> +		xfs_ioend_try_merge(ioend, &completion_list);
>  		xfs_end_ioend(ioend);
>  	}
>  }
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15 14:49   ` Brian Foster
@ 2019-04-15 15:53     ` Darrick J. Wong
  2019-04-15 16:13       ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-15 15:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Apr 15, 2019 at 10:49:28AM -0400, Brian Foster wrote:
> On Sun, Apr 14, 2019 at 07:07:48PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When scheduling writeback of dirty file data in the page cache, XFS uses
> > IO completion workqueue items to ensure that filesystem metadata only
> > updates after the write completes successfully.  This is essential for
> > converting unwritten extents to real extents at the right time and
> > performing COW remappings.
> > 
> > Unfortunately, XFS queues each IO completion work item to an unbounded
> > workqueue, which means that the kernel can spawn dozens of threads to
> > try to handle the items quickly.  These threads need to take the ILOCK
> > to update file metadata, which results in heavy ILOCK contention if a
> > large number of the work items target a single file, which is
> > inefficient.
> > 
> > Worse yet, the writeback completion threads get stuck waiting for the
> > ILOCK while holding transaction reservations, which can use up all
> > available log reservation space.  When that happens, metadata updates to
> > other parts of the filesystem grind to a halt, even if the filesystem
> > could otherwise have handled it.
> > 
> > Even worse, if one of the things grinding to a halt happens to be a
> > thread in the middle of a defer-ops finish holding the same ILOCK and
> > trying to obtain more log reservation having exhausted the permanent
> > reservation, we now have an ABBA deadlock - writeback has a transaction
> > reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> > transaction reservation.
> > 
> > Therefore, we create a per-inode writeback io completion queue + work
> > item.  When writeback finishes, it can add the ioend to the per-inode
> > queue and let the single worker item process that queue.  This
> > dramatically cuts down on the number of kworkers and ILOCK contention in
> > the system, and seems to have eliminated an occasional deadlock I was
> > seeing while running generic/476.
> > 
> > Testing with a program that simulates a heavy random-write workload to a
> > single file demonstrates that the number of kworkers drops from
> > approximately 120 threads per file to 1, without dramatically changing
> > write bandwidth or pagecache access latency.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Thanks for the updated commit log. It looks like a couple nits from the
> rfc weren't addressed though. Specifically, to rename the workqueue and
> iodone bits to ioend_[workqueue|list|lock] so as to not confuse with log
> buf (iodone) completion callbacks.

Oops, I did forget to rename the iodone part.  Will fix.

> Also, this patch essentially turns the unbound completion work queue
> into something that should only expect one task running at a time,
> right? If so, it might make sense to fix up the max_active parameter to
> the alloc_workqueue() call and ...

It's fine to leave max_active == 0 (i.e. "spawn as many kworker threads
as you decide are necessary to handle the queued work items") here
because while we only want to have one ioend worker per inode, we still
want to have as many inodes undergoing ioend processing simultaneously
as the storage can handle.

> 
> >  fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
> >  fs/xfs/xfs_aops.h   |    1 -
> >  fs/xfs/xfs_icache.c |    3 +++
> >  fs/xfs/xfs_inode.h  |    7 +++++++
> >  4 files changed, 47 insertions(+), 12 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3619e9e8d359..f7a9bb661826 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> ...
> > @@ -278,19 +276,48 @@ xfs_end_io(
> ...
> >  STATIC void
> >  xfs_end_bio(
> >  	struct bio		*bio)
> >  {
> >  	struct xfs_ioend	*ioend = bio->bi_private;
> > -	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> > +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	unsigned long		flags;
> >  
> >  	if (ioend->io_fork == XFS_COW_FORK ||
> > -	    ioend->io_state == XFS_EXT_UNWRITTEN)
> > -		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> > -	else if (ioend->io_append_trans)
> > -		queue_work(mp->m_data_workqueue, &ioend->io_work);
> > -	else
> > +	    ioend->io_state == XFS_EXT_UNWRITTEN ||
> > +	    ioend->io_append_trans != NULL) {
> > +		spin_lock_irqsave(&ip->i_iodone_lock, flags);
> > +		if (list_empty(&ip->i_iodone_list))
> > +			queue_work(mp->m_unwritten_workqueue, &ip->i_iodone_work);
> 
> ... assert that queue_work() always returns true.

queue_work can return false here because the worker function only holds
i_ioend_lock long enough to move the i_ioend_list onto a function-local
list head.  Once it drops the lock and begins issuing ioend transactions,
there's nothing preventing another xfs_end_bio from taking the lock,
seeing the empty list, and (re)queuing i_ioend_work while the worker is
running.

--D

> Brian
> 
> > +		list_add_tail(&ioend->io_list, &ip->i_iodone_list);
> > +		spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
> > +	} else
> >  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
> >  }
> >  
> > @@ -594,7 +621,6 @@ xfs_alloc_ioend(
> >  	ioend->io_inode = inode;
> >  	ioend->io_size = 0;
> >  	ioend->io_offset = offset;
> > -	INIT_WORK(&ioend->io_work, xfs_end_io);
> >  	ioend->io_append_trans = NULL;
> >  	ioend->io_bio = bio;
> >  	return ioend;
> > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > index 6c2615b83c5d..f62b03186c62 100644
> > --- a/fs/xfs/xfs_aops.h
> > +++ b/fs/xfs/xfs_aops.h
> > @@ -18,7 +18,6 @@ struct xfs_ioend {
> >  	struct inode		*io_inode;	/* file being written to */
> >  	size_t			io_size;	/* size of the extent */
> >  	xfs_off_t		io_offset;	/* offset in the file */
> > -	struct work_struct	io_work;	/* xfsdatad work queue */
> >  	struct xfs_trans	*io_append_trans;/* xact. for size update */
> >  	struct bio		*io_bio;	/* bio being built */
> >  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 245483cc282b..e70e7db29026 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -70,6 +70,9 @@ xfs_inode_alloc(
> >  	ip->i_flags = 0;
> >  	ip->i_delayed_blks = 0;
> >  	memset(&ip->i_d, 0, sizeof(ip->i_d));
> > +	INIT_WORK(&ip->i_iodone_work, xfs_end_io);
> > +	INIT_LIST_HEAD(&ip->i_iodone_list);
> > +	spin_lock_init(&ip->i_iodone_lock);
> >  
> >  	return ip;
> >  }
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index e62074a5257c..88239c2dd824 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -57,6 +57,11 @@ typedef struct xfs_inode {
> >  
> >  	/* VFS inode */
> >  	struct inode		i_vnode;	/* embedded VFS inode */
> > +
> > +	/* pending io completions */
> > +	spinlock_t		i_iodone_lock;
> > +	struct work_struct	i_iodone_work;
> > +	struct list_head	i_iodone_list;
> >  } xfs_inode_t;
> >  
> >  /* Convert from vfs inode to xfs inode */
> > @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
> >  int xfs_iunlink_init(struct xfs_perag *pag);
> >  void xfs_iunlink_destroy(struct xfs_perag *pag);
> >  
> > +void xfs_end_io(struct work_struct *work);
> > +
> >  #endif	/* __XFS_INODE_H__ */
> > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15  2:07 ` [PATCH 1/3] xfs: implement per-inode writeback completion queues Darrick J. Wong
  2019-04-15 14:49   ` Brian Foster
@ 2019-04-15 16:13   ` Darrick J. Wong
  2019-04-15 18:13     ` Brian Foster
  2019-04-23  6:34   ` [PATCH " Christoph Hellwig
  2 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-15 16:13 UTC (permalink / raw)
  To: linux-xfs; +Cc: Brian Foster

From: Darrick J. Wong <darrick.wong@oracle.com>

When scheduling writeback of dirty file data in the page cache, XFS uses
IO completion workqueue items to ensure that filesystem metadata only
updates after the write completes successfully.  This is essential for
converting unwritten extents to real extents at the right time and
performing COW remappings.

Unfortunately, XFS queues each IO completion work item to an unbounded
workqueue, which means that the kernel can spawn dozens of threads to
try to handle the items quickly.  These threads need to take the ILOCK
to update file metadata, which results in heavy ILOCK contention if a
large number of the work items target a single file, which is
inefficient.

Worse yet, the writeback completion threads get stuck waiting for the
ILOCK while holding transaction reservations, which can use up all
available log reservation space.  When that happens, metadata updates to
other parts of the filesystem grind to a halt, even if the filesystem
could otherwise have handled it.

Even worse, if one of the things grinding to a halt happens to be a
thread in the middle of a defer-ops finish holding the same ILOCK and
trying to obtain more log reservation having exhausted the permanent
reservation, we now have an ABBA deadlock - writeback has a transaction
reserved and wants the ILOCK, and someone else has the ILOCk and wants a
transaction reservation.

Therefore, we create a per-inode writeback io completion queue + work
item.  When writeback finishes, it can add the ioend to the per-inode
queue and let the single worker item process that queue.  This
dramatically cuts down on the number of kworkers and ILOCK contention in
the system, and seems to have eliminated an occasional deadlock I was
seeing while running generic/476.

Testing with a program that simulates a heavy random-write workload to a
single file demonstrates that the number of kworkers drops from
approximately 120 threads per file to 1, without dramatically changing
write bandwidth or pagecache access latency.

Note that we leave the xfs-conv workqueue's max_active alone because we
still want to be able to run ioend processing for as many inodes as the
system can handle.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: rename i_iodone_* -> i_ioend_* to avoid naming collision
---
 fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_aops.h   |    1 -
 fs/xfs/xfs_icache.c |    3 +++
 fs/xfs/xfs_inode.h  |    7 +++++++
 4 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3619e9e8d359..0dd249115f3c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -234,11 +234,9 @@ xfs_setfilesize_ioend(
  * IO write completion.
  */
 STATIC void
-xfs_end_io(
-	struct work_struct *work)
+xfs_end_ioend(
+	struct xfs_ioend	*ioend)
 {
-	struct xfs_ioend	*ioend =
-		container_of(work, struct xfs_ioend, io_work);
 	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
 	xfs_off_t		offset = ioend->io_offset;
 	size_t			size = ioend->io_size;
@@ -278,19 +276,48 @@ xfs_end_io(
 	xfs_destroy_ioend(ioend, error);
 }
 
+/* Finish all pending io completions. */
+void
+xfs_end_io(
+	struct work_struct	*work)
+{
+	struct xfs_inode	*ip;
+	struct xfs_ioend	*ioend;
+	struct list_head	completion_list;
+	unsigned long		flags;
+
+	ip = container_of(work, struct xfs_inode, i_ioend_work);
+
+	spin_lock_irqsave(&ip->i_ioend_lock, flags);
+	list_replace_init(&ip->i_ioend_list, &completion_list);
+	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
+
+	while (!list_empty(&completion_list)) {
+		ioend = list_first_entry(&completion_list, struct xfs_ioend,
+				io_list);
+		list_del_init(&ioend->io_list);
+		xfs_end_ioend(ioend);
+	}
+}
+
 STATIC void
 xfs_end_bio(
 	struct bio		*bio)
 {
 	struct xfs_ioend	*ioend = bio->bi_private;
-	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	unsigned long		flags;
 
 	if (ioend->io_fork == XFS_COW_FORK ||
-	    ioend->io_state == XFS_EXT_UNWRITTEN)
-		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
-	else if (ioend->io_append_trans)
-		queue_work(mp->m_data_workqueue, &ioend->io_work);
-	else
+	    ioend->io_state == XFS_EXT_UNWRITTEN ||
+	    ioend->io_append_trans != NULL) {
+		spin_lock_irqsave(&ip->i_ioend_lock, flags);
+		if (list_empty(&ip->i_ioend_list))
+			queue_work(mp->m_unwritten_workqueue, &ip->i_ioend_work);
+		list_add_tail(&ioend->io_list, &ip->i_ioend_list);
+		spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
+	} else
 		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
 }
 
@@ -594,7 +621,6 @@ xfs_alloc_ioend(
 	ioend->io_inode = inode;
 	ioend->io_size = 0;
 	ioend->io_offset = offset;
-	INIT_WORK(&ioend->io_work, xfs_end_io);
 	ioend->io_append_trans = NULL;
 	ioend->io_bio = bio;
 	return ioend;
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 6c2615b83c5d..f62b03186c62 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -18,7 +18,6 @@ struct xfs_ioend {
 	struct inode		*io_inode;	/* file being written to */
 	size_t			io_size;	/* size of the extent */
 	xfs_off_t		io_offset;	/* offset in the file */
-	struct work_struct	io_work;	/* xfsdatad work queue */
 	struct xfs_trans	*io_append_trans;/* xact. for size update */
 	struct bio		*io_bio;	/* bio being built */
 	struct bio		io_inline_bio;	/* MUST BE LAST! */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 245483cc282b..1237b775d7f7 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -70,6 +70,9 @@ xfs_inode_alloc(
 	ip->i_flags = 0;
 	ip->i_delayed_blks = 0;
 	memset(&ip->i_d, 0, sizeof(ip->i_d));
+	INIT_WORK(&ip->i_ioend_work, xfs_end_io);
+	INIT_LIST_HEAD(&ip->i_ioend_list);
+	spin_lock_init(&ip->i_ioend_lock);
 
 	return ip;
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e62074a5257c..b9ee3dfc104a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -57,6 +57,11 @@ typedef struct xfs_inode {
 
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
+
+	/* pending io completions */
+	spinlock_t		i_ioend_lock;
+	struct work_struct	i_ioend_work;
+	struct list_head	i_ioend_list;
 } xfs_inode_t;
 
 /* Convert from vfs inode to xfs inode */
@@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
 int xfs_iunlink_init(struct xfs_perag *pag);
 void xfs_iunlink_destroy(struct xfs_perag *pag);
 
+void xfs_end_io(struct work_struct *work);
+
 #endif	/* __XFS_INODE_H__ */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15 15:53     ` Darrick J. Wong
@ 2019-04-15 16:13       ` Brian Foster
  2019-04-15 17:09         ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-04-15 16:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Apr 15, 2019 at 08:53:20AM -0700, Darrick J. Wong wrote:
> On Mon, Apr 15, 2019 at 10:49:28AM -0400, Brian Foster wrote:
> > On Sun, Apr 14, 2019 at 07:07:48PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > When scheduling writeback of dirty file data in the page cache, XFS uses
> > > IO completion workqueue items to ensure that filesystem metadata only
> > > updates after the write completes successfully.  This is essential for
> > > converting unwritten extents to real extents at the right time and
> > > performing COW remappings.
> > > 
> > > Unfortunately, XFS queues each IO completion work item to an unbounded
> > > workqueue, which means that the kernel can spawn dozens of threads to
> > > try to handle the items quickly.  These threads need to take the ILOCK
> > > to update file metadata, which results in heavy ILOCK contention if a
> > > large number of the work items target a single file, which is
> > > inefficient.
> > > 
> > > Worse yet, the writeback completion threads get stuck waiting for the
> > > ILOCK while holding transaction reservations, which can use up all
> > > available log reservation space.  When that happens, metadata updates to
> > > other parts of the filesystem grind to a halt, even if the filesystem
> > > could otherwise have handled it.
> > > 
> > > Even worse, if one of the things grinding to a halt happens to be a
> > > thread in the middle of a defer-ops finish holding the same ILOCK and
> > > trying to obtain more log reservation having exhausted the permanent
> > > reservation, we now have an ABBA deadlock - writeback has a transaction
> > > reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> > > transaction reservation.
> > > 
> > > Therefore, we create a per-inode writeback io completion queue + work
> > > item.  When writeback finishes, it can add the ioend to the per-inode
> > > queue and let the single worker item process that queue.  This
> > > dramatically cuts down on the number of kworkers and ILOCK contention in
> > > the system, and seems to have eliminated an occasional deadlock I was
> > > seeing while running generic/476.
> > > 
> > > Testing with a program that simulates a heavy random-write workload to a
> > > single file demonstrates that the number of kworkers drops from
> > > approximately 120 threads per file to 1, without dramatically changing
> > > write bandwidth or pagecache access latency.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > Thanks for the updated commit log. It looks like a couple nits from the
> > rfc weren't addressed though. Specifically, to rename the workqueue and
> > iodone bits to ioend_[workqueue|list|lock] so as to not confuse with log
> > buf (iodone) completion callbacks.
> 
> Oops, I did forget to rename the iodone part.  Will fix.
> 
> > Also, this patch essentially turns the unbound completion work queue
> > into something that should only expect one task running at a time,
> > right? If so, it might make sense to fix up the max_active parameter to
> > the alloc_workqueue() call and ...
> 
> It's fine to leave max_active == 0 (i.e. "spawn as many kworker threads
> as you decide are necessary to handle the queued work items") here
> because while we only want to have one ioend worker per inode, we still
> want to have as many inodes undergoing ioend processing simultaneously
> as the storage can handle.
> 

Ah, right. For some reason I was thinking this would be per-work..
disregard the thinko..

> > 
> > >  fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
> > >  fs/xfs/xfs_aops.h   |    1 -
> > >  fs/xfs/xfs_icache.c |    3 +++
> > >  fs/xfs/xfs_inode.h  |    7 +++++++
> > >  4 files changed, 47 insertions(+), 12 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index 3619e9e8d359..f7a9bb661826 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > ...
> > > @@ -278,19 +276,48 @@ xfs_end_io(
> > ...
> > >  STATIC void
> > >  xfs_end_bio(
> > >  	struct bio		*bio)
> > >  {
> > >  	struct xfs_ioend	*ioend = bio->bi_private;
> > > -	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> > > +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	unsigned long		flags;
> > >  
> > >  	if (ioend->io_fork == XFS_COW_FORK ||
> > > -	    ioend->io_state == XFS_EXT_UNWRITTEN)
> > > -		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> > > -	else if (ioend->io_append_trans)
> > > -		queue_work(mp->m_data_workqueue, &ioend->io_work);
> > > -	else
> > > +	    ioend->io_state == XFS_EXT_UNWRITTEN ||
> > > +	    ioend->io_append_trans != NULL) {
> > > +		spin_lock_irqsave(&ip->i_iodone_lock, flags);
> > > +		if (list_empty(&ip->i_iodone_list))
> > > +			queue_work(mp->m_unwritten_workqueue, &ip->i_iodone_work);
> > 
> > ... assert that queue_work() always returns true.
> 
> queue_work can return false here because the worker function only holds
> i_ioend_lock long enough to move the i_ioend_list onto a function-local
> list head.  Once it drops the lock and begins issuing ioend transactions,
> there's nothing preventing another xfs_end_bio from taking the lock,
> seeing the empty list, and (re)queuing i_ioend_work while the worker is
> running.
> 

So I'm not familiar with the details of the wq implementation, but I was
assuming that the return value essentially tells us if we've scheduled a
new execution of this work or not. IOW, that the case above would return
true because the current execution has already dequeued the work and
began running it.

Is that not the case? FWIW, if the queue_work() call doesn't actually
requeue the task if it happens to be running (aside from the semantics
of the return value), ISTM that we're at risk of leaving ioends around
until something else comes along to kick the handler. A quick look at
the wq code suggests queue_work() returns false only when a particular
queued bit is set and thus the call is essentially a no-op, but that
state appears to be cleared (via set_work_pool_and_clear_pending())
before the work callback is invoked. Hm?

Brian

> --D
> 
> > Brian
> > 
> > > +		list_add_tail(&ioend->io_list, &ip->i_iodone_list);
> > > +		spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
> > > +	} else
> > >  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
> > >  }
> > >  
> > > @@ -594,7 +621,6 @@ xfs_alloc_ioend(
> > >  	ioend->io_inode = inode;
> > >  	ioend->io_size = 0;
> > >  	ioend->io_offset = offset;
> > > -	INIT_WORK(&ioend->io_work, xfs_end_io);
> > >  	ioend->io_append_trans = NULL;
> > >  	ioend->io_bio = bio;
> > >  	return ioend;
> > > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > > index 6c2615b83c5d..f62b03186c62 100644
> > > --- a/fs/xfs/xfs_aops.h
> > > +++ b/fs/xfs/xfs_aops.h
> > > @@ -18,7 +18,6 @@ struct xfs_ioend {
> > >  	struct inode		*io_inode;	/* file being written to */
> > >  	size_t			io_size;	/* size of the extent */
> > >  	xfs_off_t		io_offset;	/* offset in the file */
> > > -	struct work_struct	io_work;	/* xfsdatad work queue */
> > >  	struct xfs_trans	*io_append_trans;/* xact. for size update */
> > >  	struct bio		*io_bio;	/* bio being built */
> > >  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 245483cc282b..e70e7db29026 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -70,6 +70,9 @@ xfs_inode_alloc(
> > >  	ip->i_flags = 0;
> > >  	ip->i_delayed_blks = 0;
> > >  	memset(&ip->i_d, 0, sizeof(ip->i_d));
> > > +	INIT_WORK(&ip->i_iodone_work, xfs_end_io);
> > > +	INIT_LIST_HEAD(&ip->i_iodone_list);
> > > +	spin_lock_init(&ip->i_iodone_lock);
> > >  
> > >  	return ip;
> > >  }
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index e62074a5257c..88239c2dd824 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -57,6 +57,11 @@ typedef struct xfs_inode {
> > >  
> > >  	/* VFS inode */
> > >  	struct inode		i_vnode;	/* embedded VFS inode */
> > > +
> > > +	/* pending io completions */
> > > +	spinlock_t		i_iodone_lock;
> > > +	struct work_struct	i_iodone_work;
> > > +	struct list_head	i_iodone_list;
> > >  } xfs_inode_t;
> > >  
> > >  /* Convert from vfs inode to xfs inode */
> > > @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
> > >  int xfs_iunlink_init(struct xfs_perag *pag);
> > >  void xfs_iunlink_destroy(struct xfs_perag *pag);
> > >  
> > > +void xfs_end_io(struct work_struct *work);
> > > +
> > >  #endif	/* __XFS_INODE_H__ */
> > > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15 16:13       ` Brian Foster
@ 2019-04-15 17:09         ` Darrick J. Wong
  2019-04-15 17:31           ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-15 17:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Apr 15, 2019 at 12:13:47PM -0400, Brian Foster wrote:
> On Mon, Apr 15, 2019 at 08:53:20AM -0700, Darrick J. Wong wrote:
> > On Mon, Apr 15, 2019 at 10:49:28AM -0400, Brian Foster wrote:
> > > On Sun, Apr 14, 2019 at 07:07:48PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > When scheduling writeback of dirty file data in the page cache, XFS uses
> > > > IO completion workqueue items to ensure that filesystem metadata only
> > > > updates after the write completes successfully.  This is essential for
> > > > converting unwritten extents to real extents at the right time and
> > > > performing COW remappings.
> > > > 
> > > > Unfortunately, XFS queues each IO completion work item to an unbounded
> > > > workqueue, which means that the kernel can spawn dozens of threads to
> > > > try to handle the items quickly.  These threads need to take the ILOCK
> > > > to update file metadata, which results in heavy ILOCK contention if a
> > > > large number of the work items target a single file, which is
> > > > inefficient.
> > > > 
> > > > Worse yet, the writeback completion threads get stuck waiting for the
> > > > ILOCK while holding transaction reservations, which can use up all
> > > > available log reservation space.  When that happens, metadata updates to
> > > > other parts of the filesystem grind to a halt, even if the filesystem
> > > > could otherwise have handled it.
> > > > 
> > > > Even worse, if one of the things grinding to a halt happens to be a
> > > > thread in the middle of a defer-ops finish holding the same ILOCK and
> > > > trying to obtain more log reservation having exhausted the permanent
> > > > reservation, we now have an ABBA deadlock - writeback has a transaction
> > > > reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> > > > transaction reservation.
> > > > 
> > > > Therefore, we create a per-inode writeback io completion queue + work
> > > > item.  When writeback finishes, it can add the ioend to the per-inode
> > > > queue and let the single worker item process that queue.  This
> > > > dramatically cuts down on the number of kworkers and ILOCK contention in
> > > > the system, and seems to have eliminated an occasional deadlock I was
> > > > seeing while running generic/476.
> > > > 
> > > > Testing with a program that simulates a heavy random-write workload to a
> > > > single file demonstrates that the number of kworkers drops from
> > > > approximately 120 threads per file to 1, without dramatically changing
> > > > write bandwidth or pagecache access latency.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > Thanks for the updated commit log. It looks like a couple nits from the
> > > rfc weren't addressed though. Specifically, to rename the workqueue and
> > > iodone bits to ioend_[workqueue|list|lock] so as to not confuse with log
> > > buf (iodone) completion callbacks.
> > 
> > Oops, I did forget to rename the iodone part.  Will fix.
> > 
> > > Also, this patch essentially turns the unbound completion work queue
> > > into something that should only expect one task running at a time,
> > > right? If so, it might make sense to fix up the max_active parameter to
> > > the alloc_workqueue() call and ...
> > 
> > It's fine to leave max_active == 0 (i.e. "spawn as many kworker threads
> > as you decide are necessary to handle the queued work items") here
> > because while we only want to have one ioend worker per inode, we still
> > want to have as many inodes undergoing ioend processing simultaneously
> > as the storage can handle.
> > 
> 
> Ah, right. For some reason I was thinking this would be per-work..
> disregard the thinko..
> 
> > > 
> > > >  fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
> > > >  fs/xfs/xfs_aops.h   |    1 -
> > > >  fs/xfs/xfs_icache.c |    3 +++
> > > >  fs/xfs/xfs_inode.h  |    7 +++++++
> > > >  4 files changed, 47 insertions(+), 12 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index 3619e9e8d359..f7a9bb661826 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > ...
> > > > @@ -278,19 +276,48 @@ xfs_end_io(
> > > ...
> > > >  STATIC void
> > > >  xfs_end_bio(
> > > >  	struct bio		*bio)
> > > >  {
> > > >  	struct xfs_ioend	*ioend = bio->bi_private;
> > > > -	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> > > > +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > +	unsigned long		flags;
> > > >  
> > > >  	if (ioend->io_fork == XFS_COW_FORK ||
> > > > -	    ioend->io_state == XFS_EXT_UNWRITTEN)
> > > > -		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> > > > -	else if (ioend->io_append_trans)
> > > > -		queue_work(mp->m_data_workqueue, &ioend->io_work);
> > > > -	else
> > > > +	    ioend->io_state == XFS_EXT_UNWRITTEN ||
> > > > +	    ioend->io_append_trans != NULL) {
> > > > +		spin_lock_irqsave(&ip->i_iodone_lock, flags);
> > > > +		if (list_empty(&ip->i_iodone_list))
> > > > +			queue_work(mp->m_unwritten_workqueue, &ip->i_iodone_work);
> > > 
> > > ... assert that queue_work() always returns true.
> > 
> > queue_work can return false here because the worker function only holds
> > i_ioend_lock long enough to move the i_ioend_list onto a function-local
> > list head.  Once it drops the lock and begins issuing ioend transactions,
> > there's nothing preventing another xfs_end_bio from taking the lock,
> > seeing the empty list, and (re)queuing i_ioend_work while the worker is
> > running.
> > 
> 
> So I'm not familiar with the details of the wq implementation, but I was
> assuming that the return value essentially tells us if we've scheduled a
> new execution of this work or not. IOW, that the case above would return
> true because the current execution has already dequeued the work and
> began running it.

It does return true if we've scheduled a new execution of work.  If the
work has already been dequeued and is running then it'll queue it again,
because process_one_work calls set_work_pool_and_clear_pending to clear
the PENDING bit after it's decided to run the work item but before it
actually invokes the work function.

> Is that not the case? FWIW, if the queue_work() call doesn't actually
> requeue the task if it happens to be running (aside from the semantics

As far as queue_work is concerned there's no difference between a work
item that is currently running and a work item that has never been
queued -- the result for both cases is to put it on the queue.  So I
think the answer to "is that not the case?" is "no, you got it right in
the previous paragraph".

But maybe I'll just work out some examples to satisfy my own sense of
"egad I looked at the workqueue code and OH MY EYES" :)

If queue_work observes that the PENDING bit was set, it'll return false
immediately because we know that the work function hasn't been called
yet, so whatever we just added to the i_ioend_list will get picked up
when the work function gets called.

If queue_work observes that PENDING was not set and the work item isn't
queued or being processed anywhere in the system, it'll add the work
item to the queue and xfs_end_io can pick up the work from the ioend
list whenever it does get called.

If queue_work observes that PENDING was not set and process_one_work has
picked up the work item and it is after the point where it clears the
bit but before the work function picks up the ioend list, then we'll
reqeueue the work item, but the current the work function invocation
will process our ioend.  The requeued item will be called again, but it
won't find any work and returns.

If queue_work observes that PENDING was not set and process_one_work has
picked up the work item and it is after the point where the work
function picks up the ioend list, then we'll reqeueue the work item, the
current work function invocation won't see our new ioend, and the
requeued item will be called again and it'll process our new ioend.

> of the return value), ISTM that we're at risk of leaving ioends around
> until something else comes along to kick the handler. A quick look at
> the wq code suggests queue_work() returns false only when a particular
> queued bit is set and thus the call is essentially a no-op, but that
> state appears to be cleared (via set_work_pool_and_clear_pending())
> before the work callback is invoked. Hm?

<nod> Does that clear things up?  The wq documentation isn't
particularly clear in its description of how queues work for end users
like us. :)

--D

> 
> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > +		list_add_tail(&ioend->io_list, &ip->i_iodone_list);
> > > > +		spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
> > > > +	} else
> > > >  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
> > > >  }
> > > >  
> > > > @@ -594,7 +621,6 @@ xfs_alloc_ioend(
> > > >  	ioend->io_inode = inode;
> > > >  	ioend->io_size = 0;
> > > >  	ioend->io_offset = offset;
> > > > -	INIT_WORK(&ioend->io_work, xfs_end_io);
> > > >  	ioend->io_append_trans = NULL;
> > > >  	ioend->io_bio = bio;
> > > >  	return ioend;
> > > > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > > > index 6c2615b83c5d..f62b03186c62 100644
> > > > --- a/fs/xfs/xfs_aops.h
> > > > +++ b/fs/xfs/xfs_aops.h
> > > > @@ -18,7 +18,6 @@ struct xfs_ioend {
> > > >  	struct inode		*io_inode;	/* file being written to */
> > > >  	size_t			io_size;	/* size of the extent */
> > > >  	xfs_off_t		io_offset;	/* offset in the file */
> > > > -	struct work_struct	io_work;	/* xfsdatad work queue */
> > > >  	struct xfs_trans	*io_append_trans;/* xact. for size update */
> > > >  	struct bio		*io_bio;	/* bio being built */
> > > >  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index 245483cc282b..e70e7db29026 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -70,6 +70,9 @@ xfs_inode_alloc(
> > > >  	ip->i_flags = 0;
> > > >  	ip->i_delayed_blks = 0;
> > > >  	memset(&ip->i_d, 0, sizeof(ip->i_d));
> > > > +	INIT_WORK(&ip->i_iodone_work, xfs_end_io);
> > > > +	INIT_LIST_HEAD(&ip->i_iodone_list);
> > > > +	spin_lock_init(&ip->i_iodone_lock);
> > > >  
> > > >  	return ip;
> > > >  }
> > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > index e62074a5257c..88239c2dd824 100644
> > > > --- a/fs/xfs/xfs_inode.h
> > > > +++ b/fs/xfs/xfs_inode.h
> > > > @@ -57,6 +57,11 @@ typedef struct xfs_inode {
> > > >  
> > > >  	/* VFS inode */
> > > >  	struct inode		i_vnode;	/* embedded VFS inode */
> > > > +
> > > > +	/* pending io completions */
> > > > +	spinlock_t		i_iodone_lock;
> > > > +	struct work_struct	i_iodone_work;
> > > > +	struct list_head	i_iodone_list;
> > > >  } xfs_inode_t;
> > > >  
> > > >  /* Convert from vfs inode to xfs inode */
> > > > @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
> > > >  int xfs_iunlink_init(struct xfs_perag *pag);
> > > >  void xfs_iunlink_destroy(struct xfs_perag *pag);
> > > >  
> > > > +void xfs_end_io(struct work_struct *work);
> > > > +
> > > >  #endif	/* __XFS_INODE_H__ */
> > > > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15 17:09         ` Darrick J. Wong
@ 2019-04-15 17:31           ` Brian Foster
  2019-04-15 18:00             ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-04-15 17:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Apr 15, 2019 at 10:09:35AM -0700, Darrick J. Wong wrote:
> On Mon, Apr 15, 2019 at 12:13:47PM -0400, Brian Foster wrote:
> > On Mon, Apr 15, 2019 at 08:53:20AM -0700, Darrick J. Wong wrote:
> > > On Mon, Apr 15, 2019 at 10:49:28AM -0400, Brian Foster wrote:
> > > > On Sun, Apr 14, 2019 at 07:07:48PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > When scheduling writeback of dirty file data in the page cache, XFS uses
> > > > > IO completion workqueue items to ensure that filesystem metadata only
> > > > > updates after the write completes successfully.  This is essential for
> > > > > converting unwritten extents to real extents at the right time and
> > > > > performing COW remappings.
> > > > > 
> > > > > Unfortunately, XFS queues each IO completion work item to an unbounded
> > > > > workqueue, which means that the kernel can spawn dozens of threads to
> > > > > try to handle the items quickly.  These threads need to take the ILOCK
> > > > > to update file metadata, which results in heavy ILOCK contention if a
> > > > > large number of the work items target a single file, which is
> > > > > inefficient.
> > > > > 
> > > > > Worse yet, the writeback completion threads get stuck waiting for the
> > > > > ILOCK while holding transaction reservations, which can use up all
> > > > > available log reservation space.  When that happens, metadata updates to
> > > > > other parts of the filesystem grind to a halt, even if the filesystem
> > > > > could otherwise have handled it.
> > > > > 
> > > > > Even worse, if one of the things grinding to a halt happens to be a
> > > > > thread in the middle of a defer-ops finish holding the same ILOCK and
> > > > > trying to obtain more log reservation having exhausted the permanent
> > > > > reservation, we now have an ABBA deadlock - writeback has a transaction
> > > > > reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> > > > > transaction reservation.
> > > > > 
> > > > > Therefore, we create a per-inode writeback io completion queue + work
> > > > > item.  When writeback finishes, it can add the ioend to the per-inode
> > > > > queue and let the single worker item process that queue.  This
> > > > > dramatically cuts down on the number of kworkers and ILOCK contention in
> > > > > the system, and seems to have eliminated an occasional deadlock I was
> > > > > seeing while running generic/476.
> > > > > 
> > > > > Testing with a program that simulates a heavy random-write workload to a
> > > > > single file demonstrates that the number of kworkers drops from
> > > > > approximately 120 threads per file to 1, without dramatically changing
> > > > > write bandwidth or pagecache access latency.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > 
> > > > Thanks for the updated commit log. It looks like a couple nits from the
> > > > rfc weren't addressed though. Specifically, to rename the workqueue and
> > > > iodone bits to ioend_[workqueue|list|lock] so as to not confuse with log
> > > > buf (iodone) completion callbacks.
> > > 
> > > Oops, I did forget to rename the iodone part.  Will fix.
> > > 
> > > > Also, this patch essentially turns the unbound completion work queue
> > > > into something that should only expect one task running at a time,
> > > > right? If so, it might make sense to fix up the max_active parameter to
> > > > the alloc_workqueue() call and ...
> > > 
> > > It's fine to leave max_active == 0 (i.e. "spawn as many kworker threads
> > > as you decide are necessary to handle the queued work items") here
> > > because while we only want to have one ioend worker per inode, we still
> > > want to have as many inodes undergoing ioend processing simultaneously
> > > as the storage can handle.
> > > 
> > 
> > Ah, right. For some reason I was thinking this would be per-work..
> > disregard the thinko..
> > 
> > > > 
> > > > >  fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
> > > > >  fs/xfs/xfs_aops.h   |    1 -
> > > > >  fs/xfs/xfs_icache.c |    3 +++
> > > > >  fs/xfs/xfs_inode.h  |    7 +++++++
> > > > >  4 files changed, 47 insertions(+), 12 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > > index 3619e9e8d359..f7a9bb661826 100644
> > > > > --- a/fs/xfs/xfs_aops.c
> > > > > +++ b/fs/xfs/xfs_aops.c
> > > > ...
> > > > > @@ -278,19 +276,48 @@ xfs_end_io(
> > > > ...
> > > > >  STATIC void
> > > > >  xfs_end_bio(
> > > > >  	struct bio		*bio)
> > > > >  {
> > > > >  	struct xfs_ioend	*ioend = bio->bi_private;
> > > > > -	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> > > > > +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > +	unsigned long		flags;
> > > > >  
> > > > >  	if (ioend->io_fork == XFS_COW_FORK ||
> > > > > -	    ioend->io_state == XFS_EXT_UNWRITTEN)
> > > > > -		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> > > > > -	else if (ioend->io_append_trans)
> > > > > -		queue_work(mp->m_data_workqueue, &ioend->io_work);
> > > > > -	else
> > > > > +	    ioend->io_state == XFS_EXT_UNWRITTEN ||
> > > > > +	    ioend->io_append_trans != NULL) {
> > > > > +		spin_lock_irqsave(&ip->i_iodone_lock, flags);
> > > > > +		if (list_empty(&ip->i_iodone_list))
> > > > > +			queue_work(mp->m_unwritten_workqueue, &ip->i_iodone_work);
> > > > 
> > > > ... assert that queue_work() always returns true.
> > > 
> > > queue_work can return false here because the worker function only holds
> > > i_ioend_lock long enough to move the i_ioend_list onto a function-local
> > > list head.  Once it drops the lock and begins issuing ioend transactions,
> > > there's nothing preventing another xfs_end_bio from taking the lock,
> > > seeing the empty list, and (re)queuing i_ioend_work while the worker is
> > > running.
> > > 
> > 
> > So I'm not familiar with the details of the wq implementation, but I was
> > assuming that the return value essentially tells us if we've scheduled a
> > new execution of this work or not. IOW, that the case above would return
> > true because the current execution has already dequeued the work and
> > began running it.
> 
> It does return true if we've scheduled a new execution of work.  If the
> work has already been dequeued and is running then it'll queue it again,
> because process_one_work calls set_work_pool_and_clear_pending to clear
> the PENDING bit after it's decided to run the work item but before it
> actually invokes the work function.
> 
> > Is that not the case? FWIW, if the queue_work() call doesn't actually
> > requeue the task if it happens to be running (aside from the semantics
> 
> As far as queue_work is concerned there's no difference between a work
> item that is currently running and a work item that has never been
> queued -- the result for both cases is to put it on the queue.  So I
> think the answer to "is that not the case?" is "no, you got it right in
> the previous paragraph".
> 
> But maybe I'll just work out some examples to satisfy my own sense of
> "egad I looked at the workqueue code and OH MY EYES" :)
> 

;)

> If queue_work observes that the PENDING bit was set, it'll return false
> immediately because we know that the work function hasn't been called
> yet, so whatever we just added to the i_ioend_list will get picked up
> when the work function gets called.
> 
> If queue_work observes that PENDING was not set and the work item isn't
> queued or being processed anywhere in the system, it'll add the work
> item to the queue and xfs_end_io can pick up the work from the ioend
> list whenever it does get called.
> 
> If queue_work observes that PENDING was not set and process_one_work has
> picked up the work item and it is after the point where it clears the
> bit but before the work function picks up the ioend list, then we'll
> reqeueue the work item, but the current the work function invocation
> will process our ioend.  The requeued item will be called again, but it
> won't find any work and returns.
> 

Wouldn't we skip the queue_work() call in this case because the list is
presumably not empty (since the work being dequeued hasn't run yet to
empty it)?

> If queue_work observes that PENDING was not set and process_one_work has
> picked up the work item and it is after the point where the work
> function picks up the ioend list, then we'll reqeueue the work item, the
> current work function invocation won't see our new ioend, and the
> requeued item will be called again and it'll process our new ioend.
> 

Doesn't this mean that our queue_work() call should always return true?
We're only calling it if the list is empty, which means either nothing
was queued yet or a work item is currently running and is processing
ioends.

Brian

> > of the return value), ISTM that we're at risk of leaving ioends around
> > until something else comes along to kick the handler. A quick look at
> > the wq code suggests queue_work() returns false only when a particular
> > queued bit is set and thus the call is essentially a no-op, but that
> > state appears to be cleared (via set_work_pool_and_clear_pending())
> > before the work callback is invoked. Hm?
> 
> <nod> Does that clear things up?  The wq documentation isn't
> particularly clear in its description of how queues work for end users
> like us. :)
> 
> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > +		list_add_tail(&ioend->io_list, &ip->i_iodone_list);
> > > > > +		spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
> > > > > +	} else
> > > > >  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
> > > > >  }
> > > > >  
> > > > > @@ -594,7 +621,6 @@ xfs_alloc_ioend(
> > > > >  	ioend->io_inode = inode;
> > > > >  	ioend->io_size = 0;
> > > > >  	ioend->io_offset = offset;
> > > > > -	INIT_WORK(&ioend->io_work, xfs_end_io);
> > > > >  	ioend->io_append_trans = NULL;
> > > > >  	ioend->io_bio = bio;
> > > > >  	return ioend;
> > > > > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > > > > index 6c2615b83c5d..f62b03186c62 100644
> > > > > --- a/fs/xfs/xfs_aops.h
> > > > > +++ b/fs/xfs/xfs_aops.h
> > > > > @@ -18,7 +18,6 @@ struct xfs_ioend {
> > > > >  	struct inode		*io_inode;	/* file being written to */
> > > > >  	size_t			io_size;	/* size of the extent */
> > > > >  	xfs_off_t		io_offset;	/* offset in the file */
> > > > > -	struct work_struct	io_work;	/* xfsdatad work queue */
> > > > >  	struct xfs_trans	*io_append_trans;/* xact. for size update */
> > > > >  	struct bio		*io_bio;	/* bio being built */
> > > > >  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > index 245483cc282b..e70e7db29026 100644
> > > > > --- a/fs/xfs/xfs_icache.c
> > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > @@ -70,6 +70,9 @@ xfs_inode_alloc(
> > > > >  	ip->i_flags = 0;
> > > > >  	ip->i_delayed_blks = 0;
> > > > >  	memset(&ip->i_d, 0, sizeof(ip->i_d));
> > > > > +	INIT_WORK(&ip->i_iodone_work, xfs_end_io);
> > > > > +	INIT_LIST_HEAD(&ip->i_iodone_list);
> > > > > +	spin_lock_init(&ip->i_iodone_lock);
> > > > >  
> > > > >  	return ip;
> > > > >  }
> > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > > index e62074a5257c..88239c2dd824 100644
> > > > > --- a/fs/xfs/xfs_inode.h
> > > > > +++ b/fs/xfs/xfs_inode.h
> > > > > @@ -57,6 +57,11 @@ typedef struct xfs_inode {
> > > > >  
> > > > >  	/* VFS inode */
> > > > >  	struct inode		i_vnode;	/* embedded VFS inode */
> > > > > +
> > > > > +	/* pending io completions */
> > > > > +	spinlock_t		i_iodone_lock;
> > > > > +	struct work_struct	i_iodone_work;
> > > > > +	struct list_head	i_iodone_list;
> > > > >  } xfs_inode_t;
> > > > >  
> > > > >  /* Convert from vfs inode to xfs inode */
> > > > > @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
> > > > >  int xfs_iunlink_init(struct xfs_perag *pag);
> > > > >  void xfs_iunlink_destroy(struct xfs_perag *pag);
> > > > >  
> > > > > +void xfs_end_io(struct work_struct *work);
> > > > > +
> > > > >  #endif	/* __XFS_INODE_H__ */
> > > > > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15 17:31           ` Brian Foster
@ 2019-04-15 18:00             ` Darrick J. Wong
  2019-04-15 18:13               ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-15 18:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Apr 15, 2019 at 01:31:50PM -0400, Brian Foster wrote:
> On Mon, Apr 15, 2019 at 10:09:35AM -0700, Darrick J. Wong wrote:
> > On Mon, Apr 15, 2019 at 12:13:47PM -0400, Brian Foster wrote:
> > > On Mon, Apr 15, 2019 at 08:53:20AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Apr 15, 2019 at 10:49:28AM -0400, Brian Foster wrote:
> > > > > On Sun, Apr 14, 2019 at 07:07:48PM -0700, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > 
> > > > > > When scheduling writeback of dirty file data in the page cache, XFS uses
> > > > > > IO completion workqueue items to ensure that filesystem metadata only
> > > > > > updates after the write completes successfully.  This is essential for
> > > > > > converting unwritten extents to real extents at the right time and
> > > > > > performing COW remappings.
> > > > > > 
> > > > > > Unfortunately, XFS queues each IO completion work item to an unbounded
> > > > > > workqueue, which means that the kernel can spawn dozens of threads to
> > > > > > try to handle the items quickly.  These threads need to take the ILOCK
> > > > > > to update file metadata, which results in heavy ILOCK contention if a
> > > > > > large number of the work items target a single file, which is
> > > > > > inefficient.
> > > > > > 
> > > > > > Worse yet, the writeback completion threads get stuck waiting for the
> > > > > > ILOCK while holding transaction reservations, which can use up all
> > > > > > available log reservation space.  When that happens, metadata updates to
> > > > > > other parts of the filesystem grind to a halt, even if the filesystem
> > > > > > could otherwise have handled it.
> > > > > > 
> > > > > > Even worse, if one of the things grinding to a halt happens to be a
> > > > > > thread in the middle of a defer-ops finish holding the same ILOCK and
> > > > > > trying to obtain more log reservation having exhausted the permanent
> > > > > > reservation, we now have an ABBA deadlock - writeback has a transaction
> > > > > > reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> > > > > > transaction reservation.
> > > > > > 
> > > > > > Therefore, we create a per-inode writeback io completion queue + work
> > > > > > item.  When writeback finishes, it can add the ioend to the per-inode
> > > > > > queue and let the single worker item process that queue.  This
> > > > > > dramatically cuts down on the number of kworkers and ILOCK contention in
> > > > > > the system, and seems to have eliminated an occasional deadlock I was
> > > > > > seeing while running generic/476.
> > > > > > 
> > > > > > Testing with a program that simulates a heavy random-write workload to a
> > > > > > single file demonstrates that the number of kworkers drops from
> > > > > > approximately 120 threads per file to 1, without dramatically changing
> > > > > > write bandwidth or pagecache access latency.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > 
> > > > > Thanks for the updated commit log. It looks like a couple nits from the
> > > > > rfc weren't addressed though. Specifically, to rename the workqueue and
> > > > > iodone bits to ioend_[workqueue|list|lock] so as to not confuse with log
> > > > > buf (iodone) completion callbacks.
> > > > 
> > > > Oops, I did forget to rename the iodone part.  Will fix.
> > > > 
> > > > > Also, this patch essentially turns the unbound completion work queue
> > > > > into something that should only expect one task running at a time,
> > > > > right? If so, it might make sense to fix up the max_active parameter to
> > > > > the alloc_workqueue() call and ...
> > > > 
> > > > It's fine to leave max_active == 0 (i.e. "spawn as many kworker threads
> > > > as you decide are necessary to handle the queued work items") here
> > > > because while we only want to have one ioend worker per inode, we still
> > > > want to have as many inodes undergoing ioend processing simultaneously
> > > > as the storage can handle.
> > > > 
> > > 
> > > Ah, right. For some reason I was thinking this would be per-work..
> > > disregard the thinko..
> > > 
> > > > > 
> > > > > >  fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
> > > > > >  fs/xfs/xfs_aops.h   |    1 -
> > > > > >  fs/xfs/xfs_icache.c |    3 +++
> > > > > >  fs/xfs/xfs_inode.h  |    7 +++++++
> > > > > >  4 files changed, 47 insertions(+), 12 deletions(-)
> > > > > > 
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > > > index 3619e9e8d359..f7a9bb661826 100644
> > > > > > --- a/fs/xfs/xfs_aops.c
> > > > > > +++ b/fs/xfs/xfs_aops.c
> > > > > ...
> > > > > > @@ -278,19 +276,48 @@ xfs_end_io(
> > > > > ...
> > > > > >  STATIC void
> > > > > >  xfs_end_bio(
> > > > > >  	struct bio		*bio)
> > > > > >  {
> > > > > >  	struct xfs_ioend	*ioend = bio->bi_private;
> > > > > > -	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> > > > > > +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> > > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > > +	unsigned long		flags;
> > > > > >  
> > > > > >  	if (ioend->io_fork == XFS_COW_FORK ||
> > > > > > -	    ioend->io_state == XFS_EXT_UNWRITTEN)
> > > > > > -		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> > > > > > -	else if (ioend->io_append_trans)
> > > > > > -		queue_work(mp->m_data_workqueue, &ioend->io_work);
> > > > > > -	else
> > > > > > +	    ioend->io_state == XFS_EXT_UNWRITTEN ||
> > > > > > +	    ioend->io_append_trans != NULL) {
> > > > > > +		spin_lock_irqsave(&ip->i_iodone_lock, flags);
> > > > > > +		if (list_empty(&ip->i_iodone_list))
> > > > > > +			queue_work(mp->m_unwritten_workqueue, &ip->i_iodone_work);
> > > > > 
> > > > > ... assert that queue_work() always returns true.
> > > > 
> > > > queue_work can return false here because the worker function only holds
> > > > i_ioend_lock long enough to move the i_ioend_list onto a function-local
> > > > list head.  Once it drops the lock and begins issuing ioend transactions,
> > > > there's nothing preventing another xfs_end_bio from taking the lock,
> > > > seeing the empty list, and (re)queuing i_ioend_work while the worker is
> > > > running.
> > > > 
> > > 
> > > So I'm not familiar with the details of the wq implementation, but I was
> > > assuming that the return value essentially tells us if we've scheduled a
> > > new execution of this work or not. IOW, that the case above would return
> > > true because the current execution has already dequeued the work and
> > > began running it.
> > 
> > It does return true if we've scheduled a new execution of work.  If the
> > work has already been dequeued and is running then it'll queue it again,
> > because process_one_work calls set_work_pool_and_clear_pending to clear
> > the PENDING bit after it's decided to run the work item but before it
> > actually invokes the work function.
> > 
> > > Is that not the case? FWIW, if the queue_work() call doesn't actually
> > > requeue the task if it happens to be running (aside from the semantics
> > 
> > As far as queue_work is concerned there's no difference between a work
> > item that is currently running and a work item that has never been
> > queued -- the result for both cases is to put it on the queue.  So I
> > think the answer to "is that not the case?" is "no, you got it right in
> > the previous paragraph".
> > 
> > But maybe I'll just work out some examples to satisfy my own sense of
> > "egad I looked at the workqueue code and OH MY EYES" :)
> > 
> 
> ;)
> 
> > If queue_work observes that the PENDING bit was set, it'll return false
> > immediately because we know that the work function hasn't been called
> > yet, so whatever we just added to the i_ioend_list will get picked up
> > when the work function gets called.
> > 
> > If queue_work observes that PENDING was not set and the work item isn't
> > queued or being processed anywhere in the system, it'll add the work
> > item to the queue and xfs_end_io can pick up the work from the ioend
> > list whenever it does get called.
> > 
> > If queue_work observes that PENDING was not set and process_one_work has
> > picked up the work item and it is after the point where it clears the
> > bit but before the work function picks up the ioend list, then we'll
> > reqeueue the work item, but the current the work function invocation
> > will process our ioend.  The requeued item will be called again, but it
> > won't find any work and returns.
> > 
> 
> Wouldn't we skip the queue_work() call in this case because the list is
> presumably not empty (since the work being dequeued hasn't run yet to
> empty it)?

Oh, right.  Missed the forest for the weeds, ignore this case. :)

> > If queue_work observes that PENDING was not set and process_one_work has
> > picked up the work item and it is after the point where the work
> > function picks up the ioend list, then we'll reqeueue the work item, the
> > current work function invocation won't see our new ioend, and the
> > requeued item will be called again and it'll process our new ioend.
> > 
> 
> Doesn't this mean that our queue_work() call should always return true?
> We're only calling it if the list is empty, which means either nothing
> was queued yet or a work item is currently running and is processing
> ioends.

Oh. Yep.  I think you're right that it should always return true.  But,
how would we set up an assert that won't just trigger unused variable
warnings if we're #defining away ASSERT?

--D

> Brian
> 
> > > of the return value), ISTM that we're at risk of leaving ioends around
> > > until something else comes along to kick the handler. A quick look at
> > > the wq code suggests queue_work() returns false only when a particular
> > > queued bit is set and thus the call is essentially a no-op, but that
> > > state appears to be cleared (via set_work_pool_and_clear_pending())
> > > before the work callback is invoked. Hm?
> > 
> > <nod> Does that clear things up?  The wq documentation isn't
> > particularly clear in its description of how queues work for end users
> > like us. :)
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > +		list_add_tail(&ioend->io_list, &ip->i_iodone_list);
> > > > > > +		spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
> > > > > > +	} else
> > > > > >  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
> > > > > >  }
> > > > > >  
> > > > > > @@ -594,7 +621,6 @@ xfs_alloc_ioend(
> > > > > >  	ioend->io_inode = inode;
> > > > > >  	ioend->io_size = 0;
> > > > > >  	ioend->io_offset = offset;
> > > > > > -	INIT_WORK(&ioend->io_work, xfs_end_io);
> > > > > >  	ioend->io_append_trans = NULL;
> > > > > >  	ioend->io_bio = bio;
> > > > > >  	return ioend;
> > > > > > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > > > > > index 6c2615b83c5d..f62b03186c62 100644
> > > > > > --- a/fs/xfs/xfs_aops.h
> > > > > > +++ b/fs/xfs/xfs_aops.h
> > > > > > @@ -18,7 +18,6 @@ struct xfs_ioend {
> > > > > >  	struct inode		*io_inode;	/* file being written to */
> > > > > >  	size_t			io_size;	/* size of the extent */
> > > > > >  	xfs_off_t		io_offset;	/* offset in the file */
> > > > > > -	struct work_struct	io_work;	/* xfsdatad work queue */
> > > > > >  	struct xfs_trans	*io_append_trans;/* xact. for size update */
> > > > > >  	struct bio		*io_bio;	/* bio being built */
> > > > > >  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > index 245483cc282b..e70e7db29026 100644
> > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > > @@ -70,6 +70,9 @@ xfs_inode_alloc(
> > > > > >  	ip->i_flags = 0;
> > > > > >  	ip->i_delayed_blks = 0;
> > > > > >  	memset(&ip->i_d, 0, sizeof(ip->i_d));
> > > > > > +	INIT_WORK(&ip->i_iodone_work, xfs_end_io);
> > > > > > +	INIT_LIST_HEAD(&ip->i_iodone_list);
> > > > > > +	spin_lock_init(&ip->i_iodone_lock);
> > > > > >  
> > > > > >  	return ip;
> > > > > >  }
> > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > > > index e62074a5257c..88239c2dd824 100644
> > > > > > --- a/fs/xfs/xfs_inode.h
> > > > > > +++ b/fs/xfs/xfs_inode.h
> > > > > > @@ -57,6 +57,11 @@ typedef struct xfs_inode {
> > > > > >  
> > > > > >  	/* VFS inode */
> > > > > >  	struct inode		i_vnode;	/* embedded VFS inode */
> > > > > > +
> > > > > > +	/* pending io completions */
> > > > > > +	spinlock_t		i_iodone_lock;
> > > > > > +	struct work_struct	i_iodone_work;
> > > > > > +	struct list_head	i_iodone_list;
> > > > > >  } xfs_inode_t;
> > > > > >  
> > > > > >  /* Convert from vfs inode to xfs inode */
> > > > > > @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
> > > > > >  int xfs_iunlink_init(struct xfs_perag *pag);
> > > > > >  void xfs_iunlink_destroy(struct xfs_perag *pag);
> > > > > >  
> > > > > > +void xfs_end_io(struct work_struct *work);
> > > > > > +
> > > > > >  #endif	/* __XFS_INODE_H__ */
> > > > > > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15 18:00             ` Darrick J. Wong
@ 2019-04-15 18:13               ` Brian Foster
  2019-04-15 19:27                 ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-04-15 18:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Apr 15, 2019 at 11:00:43AM -0700, Darrick J. Wong wrote:
> On Mon, Apr 15, 2019 at 01:31:50PM -0400, Brian Foster wrote:
> > On Mon, Apr 15, 2019 at 10:09:35AM -0700, Darrick J. Wong wrote:
> > > On Mon, Apr 15, 2019 at 12:13:47PM -0400, Brian Foster wrote:
> > > > On Mon, Apr 15, 2019 at 08:53:20AM -0700, Darrick J. Wong wrote:
> > > > > On Mon, Apr 15, 2019 at 10:49:28AM -0400, Brian Foster wrote:
> > > > > > On Sun, Apr 14, 2019 at 07:07:48PM -0700, Darrick J. Wong wrote:
> > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > 
> > > > > > > When scheduling writeback of dirty file data in the page cache, XFS uses
> > > > > > > IO completion workqueue items to ensure that filesystem metadata only
> > > > > > > updates after the write completes successfully.  This is essential for
> > > > > > > converting unwritten extents to real extents at the right time and
> > > > > > > performing COW remappings.
> > > > > > > 
> > > > > > > Unfortunately, XFS queues each IO completion work item to an unbounded
> > > > > > > workqueue, which means that the kernel can spawn dozens of threads to
> > > > > > > try to handle the items quickly.  These threads need to take the ILOCK
> > > > > > > to update file metadata, which results in heavy ILOCK contention if a
> > > > > > > large number of the work items target a single file, which is
> > > > > > > inefficient.
> > > > > > > 
> > > > > > > Worse yet, the writeback completion threads get stuck waiting for the
> > > > > > > ILOCK while holding transaction reservations, which can use up all
> > > > > > > available log reservation space.  When that happens, metadata updates to
> > > > > > > other parts of the filesystem grind to a halt, even if the filesystem
> > > > > > > could otherwise have handled it.
> > > > > > > 
> > > > > > > Even worse, if one of the things grinding to a halt happens to be a
> > > > > > > thread in the middle of a defer-ops finish holding the same ILOCK and
> > > > > > > trying to obtain more log reservation having exhausted the permanent
> > > > > > > reservation, we now have an ABBA deadlock - writeback has a transaction
> > > > > > > reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> > > > > > > transaction reservation.
> > > > > > > 
> > > > > > > Therefore, we create a per-inode writeback io completion queue + work
> > > > > > > item.  When writeback finishes, it can add the ioend to the per-inode
> > > > > > > queue and let the single worker item process that queue.  This
> > > > > > > dramatically cuts down on the number of kworkers and ILOCK contention in
> > > > > > > the system, and seems to have eliminated an occasional deadlock I was
> > > > > > > seeing while running generic/476.
> > > > > > > 
> > > > > > > Testing with a program that simulates a heavy random-write workload to a
> > > > > > > single file demonstrates that the number of kworkers drops from
> > > > > > > approximately 120 threads per file to 1, without dramatically changing
> > > > > > > write bandwidth or pagecache access latency.
> > > > > > > 
> > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > ---
> > > > > > 
> > > > > > Thanks for the updated commit log. It looks like a couple nits from the
> > > > > > rfc weren't addressed though. Specifically, to rename the workqueue and
> > > > > > iodone bits to ioend_[workqueue|list|lock] so as to not confuse with log
> > > > > > buf (iodone) completion callbacks.
> > > > > 
> > > > > Oops, I did forget to rename the iodone part.  Will fix.
> > > > > 
> > > > > > Also, this patch essentially turns the unbound completion work queue
> > > > > > into something that should only expect one task running at a time,
> > > > > > right? If so, it might make sense to fix up the max_active parameter to
> > > > > > the alloc_workqueue() call and ...
> > > > > 
> > > > > It's fine to leave max_active == 0 (i.e. "spawn as many kworker threads
> > > > > as you decide are necessary to handle the queued work items") here
> > > > > because while we only want to have one ioend worker per inode, we still
> > > > > want to have as many inodes undergoing ioend processing simultaneously
> > > > > as the storage can handle.
> > > > > 
> > > > 
> > > > Ah, right. For some reason I was thinking this would be per-work..
> > > > disregard the thinko..
> > > > 
> > > > > > 
> > > > > > >  fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
> > > > > > >  fs/xfs/xfs_aops.h   |    1 -
> > > > > > >  fs/xfs/xfs_icache.c |    3 +++
> > > > > > >  fs/xfs/xfs_inode.h  |    7 +++++++
> > > > > > >  4 files changed, 47 insertions(+), 12 deletions(-)
> > > > > > > 
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > > > > index 3619e9e8d359..f7a9bb661826 100644
> > > > > > > --- a/fs/xfs/xfs_aops.c
> > > > > > > +++ b/fs/xfs/xfs_aops.c
> > > > > > ...
> > > > > > > @@ -278,19 +276,48 @@ xfs_end_io(
> > > > > > ...
> > > > > > >  STATIC void
> > > > > > >  xfs_end_bio(
> > > > > > >  	struct bio		*bio)
> > > > > > >  {
> > > > > > >  	struct xfs_ioend	*ioend = bio->bi_private;
> > > > > > > -	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> > > > > > > +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> > > > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > > > +	unsigned long		flags;
> > > > > > >  
> > > > > > >  	if (ioend->io_fork == XFS_COW_FORK ||
> > > > > > > -	    ioend->io_state == XFS_EXT_UNWRITTEN)
> > > > > > > -		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> > > > > > > -	else if (ioend->io_append_trans)
> > > > > > > -		queue_work(mp->m_data_workqueue, &ioend->io_work);
> > > > > > > -	else
> > > > > > > +	    ioend->io_state == XFS_EXT_UNWRITTEN ||
> > > > > > > +	    ioend->io_append_trans != NULL) {
> > > > > > > +		spin_lock_irqsave(&ip->i_iodone_lock, flags);
> > > > > > > +		if (list_empty(&ip->i_iodone_list))
> > > > > > > +			queue_work(mp->m_unwritten_workqueue, &ip->i_iodone_work);
> > > > > > 
> > > > > > ... assert that queue_work() always returns true.
> > > > > 
> > > > > queue_work can return false here because the worker function only holds
> > > > > i_ioend_lock long enough to move the i_ioend_list onto a function-local
> > > > > list head.  Once it drops the lock and begins issuing ioend transactions,
> > > > > there's nothing preventing another xfs_end_bio from taking the lock,
> > > > > seeing the empty list, and (re)queuing i_ioend_work while the worker is
> > > > > running.
> > > > > 
> > > > 
> > > > So I'm not familiar with the details of the wq implementation, but I was
> > > > assuming that the return value essentially tells us if we've scheduled a
> > > > new execution of this work or not. IOW, that the case above would return
> > > > true because the current execution has already dequeued the work and
> > > > began running it.
> > > 
> > > It does return true if we've scheduled a new execution of work.  If the
> > > work has already been dequeued and is running then it'll queue it again,
> > > because process_one_work calls set_work_pool_and_clear_pending to clear
> > > the PENDING bit after it's decided to run the work item but before it
> > > actually invokes the work function.
> > > 
> > > > Is that not the case? FWIW, if the queue_work() call doesn't actually
> > > > requeue the task if it happens to be running (aside from the semantics
> > > 
> > > As far as queue_work is concerned there's no difference between a work
> > > item that is currently running and a work item that has never been
> > > queued -- the result for both cases is to put it on the queue.  So I
> > > think the answer to "is that not the case?" is "no, you got it right in
> > > the previous paragraph".
> > > 
> > > But maybe I'll just work out some examples to satisfy my own sense of
> > > "egad I looked at the workqueue code and OH MY EYES" :)
> > > 
> > 
> > ;)
> > 
> > > If queue_work observes that the PENDING bit was set, it'll return false
> > > immediately because we know that the work function hasn't been called
> > > yet, so whatever we just added to the i_ioend_list will get picked up
> > > when the work function gets called.
> > > 
> > > If queue_work observes that PENDING was not set and the work item isn't
> > > queued or being processed anywhere in the system, it'll add the work
> > > item to the queue and xfs_end_io can pick up the work from the ioend
> > > list whenever it does get called.
> > > 
> > > If queue_work observes that PENDING was not set and process_one_work has
> > > picked up the work item and it is after the point where it clears the
> > > bit but before the work function picks up the ioend list, then we'll
> > > reqeueue the work item, but the current the work function invocation
> > > will process our ioend.  The requeued item will be called again, but it
> > > won't find any work and returns.
> > > 
> > 
> > Wouldn't we skip the queue_work() call in this case because the list is
> > presumably not empty (since the work being dequeued hasn't run yet to
> > empty it)?
> 
> Oh, right.  Missed the forest for the weeds, ignore this case. :)
> 
> > > If queue_work observes that PENDING was not set and process_one_work has
> > > picked up the work item and it is after the point where the work
> > > function picks up the ioend list, then we'll reqeueue the work item, the
> > > current work function invocation won't see our new ioend, and the
> > > requeued item will be called again and it'll process our new ioend.
> > > 
> > 
> > Doesn't this mean that our queue_work() call should always return true?
> > We're only calling it if the list is empty, which means either nothing
> > was queued yet or a work item is currently running and is processing
> > ioends.
> 
> Oh. Yep.  I think you're right that it should always return true.  But,
> how would we set up an assert that won't just trigger unused variable
> warnings if we're #defining away ASSERT?
> 

Ok. I guess we'd have to ifdef it (or use a WARN_ON[_ONCE]() or
something), but now that we're on the same page, it's probably not worth
resending again just for this anyways.

Brian

> --D
> 
> > Brian
> > 
> > > > of the return value), ISTM that we're at risk of leaving ioends around
> > > > until something else comes along to kick the handler. A quick look at
> > > > the wq code suggests queue_work() returns false only when a particular
> > > > queued bit is set and thus the call is essentially a no-op, but that
> > > > state appears to be cleared (via set_work_pool_and_clear_pending())
> > > > before the work callback is invoked. Hm?
> > > 
> > > <nod> Does that clear things up?  The wq documentation isn't
> > > particularly clear in its description of how queues work for end users
> > > like us. :)
> > > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > +		list_add_tail(&ioend->io_list, &ip->i_iodone_list);
> > > > > > > +		spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
> > > > > > > +	} else
> > > > > > >  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
> > > > > > >  }
> > > > > > >  
> > > > > > > @@ -594,7 +621,6 @@ xfs_alloc_ioend(
> > > > > > >  	ioend->io_inode = inode;
> > > > > > >  	ioend->io_size = 0;
> > > > > > >  	ioend->io_offset = offset;
> > > > > > > -	INIT_WORK(&ioend->io_work, xfs_end_io);
> > > > > > >  	ioend->io_append_trans = NULL;
> > > > > > >  	ioend->io_bio = bio;
> > > > > > >  	return ioend;
> > > > > > > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > > > > > > index 6c2615b83c5d..f62b03186c62 100644
> > > > > > > --- a/fs/xfs/xfs_aops.h
> > > > > > > +++ b/fs/xfs/xfs_aops.h
> > > > > > > @@ -18,7 +18,6 @@ struct xfs_ioend {
> > > > > > >  	struct inode		*io_inode;	/* file being written to */
> > > > > > >  	size_t			io_size;	/* size of the extent */
> > > > > > >  	xfs_off_t		io_offset;	/* offset in the file */
> > > > > > > -	struct work_struct	io_work;	/* xfsdatad work queue */
> > > > > > >  	struct xfs_trans	*io_append_trans;/* xact. for size update */
> > > > > > >  	struct bio		*io_bio;	/* bio being built */
> > > > > > >  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > > index 245483cc282b..e70e7db29026 100644
> > > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > > > @@ -70,6 +70,9 @@ xfs_inode_alloc(
> > > > > > >  	ip->i_flags = 0;
> > > > > > >  	ip->i_delayed_blks = 0;
> > > > > > >  	memset(&ip->i_d, 0, sizeof(ip->i_d));
> > > > > > > +	INIT_WORK(&ip->i_iodone_work, xfs_end_io);
> > > > > > > +	INIT_LIST_HEAD(&ip->i_iodone_list);
> > > > > > > +	spin_lock_init(&ip->i_iodone_lock);
> > > > > > >  
> > > > > > >  	return ip;
> > > > > > >  }
> > > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > > > > index e62074a5257c..88239c2dd824 100644
> > > > > > > --- a/fs/xfs/xfs_inode.h
> > > > > > > +++ b/fs/xfs/xfs_inode.h
> > > > > > > @@ -57,6 +57,11 @@ typedef struct xfs_inode {
> > > > > > >  
> > > > > > >  	/* VFS inode */
> > > > > > >  	struct inode		i_vnode;	/* embedded VFS inode */
> > > > > > > +
> > > > > > > +	/* pending io completions */
> > > > > > > +	spinlock_t		i_iodone_lock;
> > > > > > > +	struct work_struct	i_iodone_work;
> > > > > > > +	struct list_head	i_iodone_list;
> > > > > > >  } xfs_inode_t;
> > > > > > >  
> > > > > > >  /* Convert from vfs inode to xfs inode */
> > > > > > > @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
> > > > > > >  int xfs_iunlink_init(struct xfs_perag *pag);
> > > > > > >  void xfs_iunlink_destroy(struct xfs_perag *pag);
> > > > > > >  
> > > > > > > +void xfs_end_io(struct work_struct *work);
> > > > > > > +
> > > > > > >  #endif	/* __XFS_INODE_H__ */
> > > > > > > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15 16:13   ` [PATCH v2 " Darrick J. Wong
@ 2019-04-15 18:13     ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2019-04-15 18:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Apr 15, 2019 at 09:13:36AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When scheduling writeback of dirty file data in the page cache, XFS uses
> IO completion workqueue items to ensure that filesystem metadata only
> updates after the write completes successfully.  This is essential for
> converting unwritten extents to real extents at the right time and
> performing COW remappings.
> 
> Unfortunately, XFS queues each IO completion work item to an unbounded
> workqueue, which means that the kernel can spawn dozens of threads to
> try to handle the items quickly.  These threads need to take the ILOCK
> to update file metadata, which results in heavy ILOCK contention if a
> large number of the work items target a single file, which is
> inefficient.
> 
> Worse yet, the writeback completion threads get stuck waiting for the
> ILOCK while holding transaction reservations, which can use up all
> available log reservation space.  When that happens, metadata updates to
> other parts of the filesystem grind to a halt, even if the filesystem
> could otherwise have handled it.
> 
> Even worse, if one of the things grinding to a halt happens to be a
> thread in the middle of a defer-ops finish holding the same ILOCK and
> trying to obtain more log reservation having exhausted the permanent
> reservation, we now have an ABBA deadlock - writeback has a transaction
> reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> transaction reservation.
> 
> Therefore, we create a per-inode writeback io completion queue + work
> item.  When writeback finishes, it can add the ioend to the per-inode
> queue and let the single worker item process that queue.  This
> dramatically cuts down on the number of kworkers and ILOCK contention in
> the system, and seems to have eliminated an occasional deadlock I was
> seeing while running generic/476.
> 
> Testing with a program that simulates a heavy random-write workload to a
> single file demonstrates that the number of kworkers drops from
> approximately 120 threads per file to 1, without dramatically changing
> write bandwidth or pagecache access latency.
> 
> Note that we leave the xfs-conv workqueue's max_active alone because we
> still want to be able to run ioend processing for as many inodes as the
> system can handle.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: rename i_iodone_* -> i_ioend_* to avoid naming collision
> ---

Looks good, thanks!

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_aops.h   |    1 -
>  fs/xfs/xfs_icache.c |    3 +++
>  fs/xfs/xfs_inode.h  |    7 +++++++
>  4 files changed, 47 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3619e9e8d359..0dd249115f3c 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -234,11 +234,9 @@ xfs_setfilesize_ioend(
>   * IO write completion.
>   */
>  STATIC void
> -xfs_end_io(
> -	struct work_struct *work)
> +xfs_end_ioend(
> +	struct xfs_ioend	*ioend)
>  {
> -	struct xfs_ioend	*ioend =
> -		container_of(work, struct xfs_ioend, io_work);
>  	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
>  	xfs_off_t		offset = ioend->io_offset;
>  	size_t			size = ioend->io_size;
> @@ -278,19 +276,48 @@ xfs_end_io(
>  	xfs_destroy_ioend(ioend, error);
>  }
>  
> +/* Finish all pending io completions. */
> +void
> +xfs_end_io(
> +	struct work_struct	*work)
> +{
> +	struct xfs_inode	*ip;
> +	struct xfs_ioend	*ioend;
> +	struct list_head	completion_list;
> +	unsigned long		flags;
> +
> +	ip = container_of(work, struct xfs_inode, i_ioend_work);
> +
> +	spin_lock_irqsave(&ip->i_ioend_lock, flags);
> +	list_replace_init(&ip->i_ioend_list, &completion_list);
> +	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> +
> +	while (!list_empty(&completion_list)) {
> +		ioend = list_first_entry(&completion_list, struct xfs_ioend,
> +				io_list);
> +		list_del_init(&ioend->io_list);
> +		xfs_end_ioend(ioend);
> +	}
> +}
> +
>  STATIC void
>  xfs_end_bio(
>  	struct bio		*bio)
>  {
>  	struct xfs_ioend	*ioend = bio->bi_private;
> -	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	unsigned long		flags;
>  
>  	if (ioend->io_fork == XFS_COW_FORK ||
> -	    ioend->io_state == XFS_EXT_UNWRITTEN)
> -		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> -	else if (ioend->io_append_trans)
> -		queue_work(mp->m_data_workqueue, &ioend->io_work);
> -	else
> +	    ioend->io_state == XFS_EXT_UNWRITTEN ||
> +	    ioend->io_append_trans != NULL) {
> +		spin_lock_irqsave(&ip->i_ioend_lock, flags);
> +		if (list_empty(&ip->i_ioend_list))
> +			queue_work(mp->m_unwritten_workqueue, &ip->i_ioend_work);
> +		list_add_tail(&ioend->io_list, &ip->i_ioend_list);
> +		spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> +	} else
>  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
>  }
>  
> @@ -594,7 +621,6 @@ xfs_alloc_ioend(
>  	ioend->io_inode = inode;
>  	ioend->io_size = 0;
>  	ioend->io_offset = offset;
> -	INIT_WORK(&ioend->io_work, xfs_end_io);
>  	ioend->io_append_trans = NULL;
>  	ioend->io_bio = bio;
>  	return ioend;
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 6c2615b83c5d..f62b03186c62 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -18,7 +18,6 @@ struct xfs_ioend {
>  	struct inode		*io_inode;	/* file being written to */
>  	size_t			io_size;	/* size of the extent */
>  	xfs_off_t		io_offset;	/* offset in the file */
> -	struct work_struct	io_work;	/* xfsdatad work queue */
>  	struct xfs_trans	*io_append_trans;/* xact. for size update */
>  	struct bio		*io_bio;	/* bio being built */
>  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 245483cc282b..1237b775d7f7 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -70,6 +70,9 @@ xfs_inode_alloc(
>  	ip->i_flags = 0;
>  	ip->i_delayed_blks = 0;
>  	memset(&ip->i_d, 0, sizeof(ip->i_d));
> +	INIT_WORK(&ip->i_ioend_work, xfs_end_io);
> +	INIT_LIST_HEAD(&ip->i_ioend_list);
> +	spin_lock_init(&ip->i_ioend_lock);
>  
>  	return ip;
>  }
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index e62074a5257c..b9ee3dfc104a 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -57,6 +57,11 @@ typedef struct xfs_inode {
>  
>  	/* VFS inode */
>  	struct inode		i_vnode;	/* embedded VFS inode */
> +
> +	/* pending io completions */
> +	spinlock_t		i_ioend_lock;
> +	struct work_struct	i_ioend_work;
> +	struct list_head	i_ioend_list;
>  } xfs_inode_t;
>  
>  /* Convert from vfs inode to xfs inode */
> @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
>  int xfs_iunlink_init(struct xfs_perag *pag);
>  void xfs_iunlink_destroy(struct xfs_perag *pag);
>  
> +void xfs_end_io(struct work_struct *work);
> +
>  #endif	/* __XFS_INODE_H__ */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15 18:13               ` Brian Foster
@ 2019-04-15 19:27                 ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-15 19:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Apr 15, 2019 at 02:13:27PM -0400, Brian Foster wrote:
> On Mon, Apr 15, 2019 at 11:00:43AM -0700, Darrick J. Wong wrote:
> > On Mon, Apr 15, 2019 at 01:31:50PM -0400, Brian Foster wrote:
> > > On Mon, Apr 15, 2019 at 10:09:35AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Apr 15, 2019 at 12:13:47PM -0400, Brian Foster wrote:
> > > > > On Mon, Apr 15, 2019 at 08:53:20AM -0700, Darrick J. Wong wrote:
> > > > > > On Mon, Apr 15, 2019 at 10:49:28AM -0400, Brian Foster wrote:
> > > > > > > On Sun, Apr 14, 2019 at 07:07:48PM -0700, Darrick J. Wong wrote:
> > > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > 
> > > > > > > > When scheduling writeback of dirty file data in the page cache, XFS uses
> > > > > > > > IO completion workqueue items to ensure that filesystem metadata only
> > > > > > > > updates after the write completes successfully.  This is essential for
> > > > > > > > converting unwritten extents to real extents at the right time and
> > > > > > > > performing COW remappings.
> > > > > > > > 
> > > > > > > > Unfortunately, XFS queues each IO completion work item to an unbounded
> > > > > > > > workqueue, which means that the kernel can spawn dozens of threads to
> > > > > > > > try to handle the items quickly.  These threads need to take the ILOCK
> > > > > > > > to update file metadata, which results in heavy ILOCK contention if a
> > > > > > > > large number of the work items target a single file, which is
> > > > > > > > inefficient.
> > > > > > > > 
> > > > > > > > Worse yet, the writeback completion threads get stuck waiting for the
> > > > > > > > ILOCK while holding transaction reservations, which can use up all
> > > > > > > > available log reservation space.  When that happens, metadata updates to
> > > > > > > > other parts of the filesystem grind to a halt, even if the filesystem
> > > > > > > > could otherwise have handled it.
> > > > > > > > 
> > > > > > > > Even worse, if one of the things grinding to a halt happens to be a
> > > > > > > > thread in the middle of a defer-ops finish holding the same ILOCK and
> > > > > > > > trying to obtain more log reservation having exhausted the permanent
> > > > > > > > reservation, we now have an ABBA deadlock - writeback has a transaction
> > > > > > > > reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> > > > > > > > transaction reservation.
> > > > > > > > 
> > > > > > > > Therefore, we create a per-inode writeback io completion queue + work
> > > > > > > > item.  When writeback finishes, it can add the ioend to the per-inode
> > > > > > > > queue and let the single worker item process that queue.  This
> > > > > > > > dramatically cuts down on the number of kworkers and ILOCK contention in
> > > > > > > > the system, and seems to have eliminated an occasional deadlock I was
> > > > > > > > seeing while running generic/476.
> > > > > > > > 
> > > > > > > > Testing with a program that simulates a heavy random-write workload to a
> > > > > > > > single file demonstrates that the number of kworkers drops from
> > > > > > > > approximately 120 threads per file to 1, without dramatically changing
> > > > > > > > write bandwidth or pagecache access latency.
> > > > > > > > 
> > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > ---
> > > > > > > 
> > > > > > > Thanks for the updated commit log. It looks like a couple nits from the
> > > > > > > rfc weren't addressed though. Specifically, to rename the workqueue and
> > > > > > > iodone bits to ioend_[workqueue|list|lock] so as to not confuse with log
> > > > > > > buf (iodone) completion callbacks.
> > > > > > 
> > > > > > Oops, I did forget to rename the iodone part.  Will fix.
> > > > > > 
> > > > > > > Also, this patch essentially turns the unbound completion work queue
> > > > > > > into something that should only expect one task running at a time,
> > > > > > > right? If so, it might make sense to fix up the max_active parameter to
> > > > > > > the alloc_workqueue() call and ...
> > > > > > 
> > > > > > It's fine to leave max_active == 0 (i.e. "spawn as many kworker threads
> > > > > > as you decide are necessary to handle the queued work items") here
> > > > > > because while we only want to have one ioend worker per inode, we still
> > > > > > want to have as many inodes undergoing ioend processing simultaneously
> > > > > > as the storage can handle.
> > > > > > 
> > > > > 
> > > > > Ah, right. For some reason I was thinking this would be per-work..
> > > > > disregard the thinko..
> > > > > 
> > > > > > > 
> > > > > > > >  fs/xfs/xfs_aops.c   |   48 +++++++++++++++++++++++++++++++++++++-----------
> > > > > > > >  fs/xfs/xfs_aops.h   |    1 -
> > > > > > > >  fs/xfs/xfs_icache.c |    3 +++
> > > > > > > >  fs/xfs/xfs_inode.h  |    7 +++++++
> > > > > > > >  4 files changed, 47 insertions(+), 12 deletions(-)
> > > > > > > > 
> > > > > > > > 
> > > > > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > > > > > index 3619e9e8d359..f7a9bb661826 100644
> > > > > > > > --- a/fs/xfs/xfs_aops.c
> > > > > > > > +++ b/fs/xfs/xfs_aops.c
> > > > > > > ...
> > > > > > > > @@ -278,19 +276,48 @@ xfs_end_io(
> > > > > > > ...
> > > > > > > >  STATIC void
> > > > > > > >  xfs_end_bio(
> > > > > > > >  	struct bio		*bio)
> > > > > > > >  {
> > > > > > > >  	struct xfs_ioend	*ioend = bio->bi_private;
> > > > > > > > -	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> > > > > > > > +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> > > > > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > > > > +	unsigned long		flags;
> > > > > > > >  
> > > > > > > >  	if (ioend->io_fork == XFS_COW_FORK ||
> > > > > > > > -	    ioend->io_state == XFS_EXT_UNWRITTEN)
> > > > > > > > -		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> > > > > > > > -	else if (ioend->io_append_trans)
> > > > > > > > -		queue_work(mp->m_data_workqueue, &ioend->io_work);
> > > > > > > > -	else
> > > > > > > > +	    ioend->io_state == XFS_EXT_UNWRITTEN ||
> > > > > > > > +	    ioend->io_append_trans != NULL) {
> > > > > > > > +		spin_lock_irqsave(&ip->i_iodone_lock, flags);
> > > > > > > > +		if (list_empty(&ip->i_iodone_list))
> > > > > > > > +			queue_work(mp->m_unwritten_workqueue, &ip->i_iodone_work);
> > > > > > > 
> > > > > > > ... assert that queue_work() always returns true.
> > > > > > 
> > > > > > queue_work can return false here because the worker function only holds
> > > > > > i_ioend_lock long enough to move the i_ioend_list onto a function-local
> > > > > > list head.  Once it drops the lock and begins issuing ioend transactions,
> > > > > > there's nothing preventing another xfs_end_bio from taking the lock,
> > > > > > seeing the empty list, and (re)queuing i_ioend_work while the worker is
> > > > > > running.
> > > > > > 
> > > > > 
> > > > > So I'm not familiar with the details of the wq implementation, but I was
> > > > > assuming that the return value essentially tells us if we've scheduled a
> > > > > new execution of this work or not. IOW, that the case above would return
> > > > > true because the current execution has already dequeued the work and
> > > > > began running it.
> > > > 
> > > > It does return true if we've scheduled a new execution of work.  If the
> > > > work has already been dequeued and is running then it'll queue it again,
> > > > because process_one_work calls set_work_pool_and_clear_pending to clear
> > > > the PENDING bit after it's decided to run the work item but before it
> > > > actually invokes the work function.
> > > > 
> > > > > Is that not the case? FWIW, if the queue_work() call doesn't actually
> > > > > requeue the task if it happens to be running (aside from the semantics
> > > > 
> > > > As far as queue_work is concerned there's no difference between a work
> > > > item that is currently running and a work item that has never been
> > > > queued -- the result for both cases is to put it on the queue.  So I
> > > > think the answer to "is that not the case?" is "no, you got it right in
> > > > the previous paragraph".
> > > > 
> > > > But maybe I'll just work out some examples to satisfy my own sense of
> > > > "egad I looked at the workqueue code and OH MY EYES" :)
> > > > 
> > > 
> > > ;)
> > > 
> > > > If queue_work observes that the PENDING bit was set, it'll return false
> > > > immediately because we know that the work function hasn't been called
> > > > yet, so whatever we just added to the i_ioend_list will get picked up
> > > > when the work function gets called.
> > > > 
> > > > If queue_work observes that PENDING was not set and the work item isn't
> > > > queued or being processed anywhere in the system, it'll add the work
> > > > item to the queue and xfs_end_io can pick up the work from the ioend
> > > > list whenever it does get called.
> > > > 
> > > > If queue_work observes that PENDING was not set and process_one_work has
> > > > picked up the work item and it is after the point where it clears the
> > > > bit but before the work function picks up the ioend list, then we'll
> > > > reqeueue the work item, but the current the work function invocation
> > > > will process our ioend.  The requeued item will be called again, but it
> > > > won't find any work and returns.
> > > > 
> > > 
> > > Wouldn't we skip the queue_work() call in this case because the list is
> > > presumably not empty (since the work being dequeued hasn't run yet to
> > > empty it)?
> > 
> > Oh, right.  Missed the forest for the weeds, ignore this case. :)
> > 
> > > > If queue_work observes that PENDING was not set and process_one_work has
> > > > picked up the work item and it is after the point where the work
> > > > function picks up the ioend list, then we'll reqeueue the work item, the
> > > > current work function invocation won't see our new ioend, and the
> > > > requeued item will be called again and it'll process our new ioend.
> > > > 
> > > 
> > > Doesn't this mean that our queue_work() call should always return true?
> > > We're only calling it if the list is empty, which means either nothing
> > > was queued yet or a work item is currently running and is processing
> > > ioends.
> > 
> > Oh. Yep.  I think you're right that it should always return true.  But,
> > how would we set up an assert that won't just trigger unused variable
> > warnings if we're #defining away ASSERT?
> > 
> 
> Ok. I guess we'd have to ifdef it (or use a WARN_ON[_ONCE]() or
> something), but now that we're on the same page, it's probably not worth
> resending again just for this anyways.

Yeah, I'll add a WARN_ON_ONCE on the way in.  Thanks for reviewing! :)

--D

> 
> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > > of the return value), ISTM that we're at risk of leaving ioends around
> > > > > until something else comes along to kick the handler. A quick look at
> > > > > the wq code suggests queue_work() returns false only when a particular
> > > > > queued bit is set and thus the call is essentially a no-op, but that
> > > > > state appears to be cleared (via set_work_pool_and_clear_pending())
> > > > > before the work callback is invoked. Hm?
> > > > 
> > > > <nod> Does that clear things up?  The wq documentation isn't
> > > > particularly clear in its description of how queues work for end users
> > > > like us. :)
> > > > 
> > > > --D
> > > > 
> > > > > 
> > > > > Brian
> > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > > +		list_add_tail(&ioend->io_list, &ip->i_iodone_list);
> > > > > > > > +		spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
> > > > > > > > +	} else
> > > > > > > >  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > @@ -594,7 +621,6 @@ xfs_alloc_ioend(
> > > > > > > >  	ioend->io_inode = inode;
> > > > > > > >  	ioend->io_size = 0;
> > > > > > > >  	ioend->io_offset = offset;
> > > > > > > > -	INIT_WORK(&ioend->io_work, xfs_end_io);
> > > > > > > >  	ioend->io_append_trans = NULL;
> > > > > > > >  	ioend->io_bio = bio;
> > > > > > > >  	return ioend;
> > > > > > > > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > > > > > > > index 6c2615b83c5d..f62b03186c62 100644
> > > > > > > > --- a/fs/xfs/xfs_aops.h
> > > > > > > > +++ b/fs/xfs/xfs_aops.h
> > > > > > > > @@ -18,7 +18,6 @@ struct xfs_ioend {
> > > > > > > >  	struct inode		*io_inode;	/* file being written to */
> > > > > > > >  	size_t			io_size;	/* size of the extent */
> > > > > > > >  	xfs_off_t		io_offset;	/* offset in the file */
> > > > > > > > -	struct work_struct	io_work;	/* xfsdatad work queue */
> > > > > > > >  	struct xfs_trans	*io_append_trans;/* xact. for size update */
> > > > > > > >  	struct bio		*io_bio;	/* bio being built */
> > > > > > > >  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > > > index 245483cc282b..e70e7db29026 100644
> > > > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > > > > @@ -70,6 +70,9 @@ xfs_inode_alloc(
> > > > > > > >  	ip->i_flags = 0;
> > > > > > > >  	ip->i_delayed_blks = 0;
> > > > > > > >  	memset(&ip->i_d, 0, sizeof(ip->i_d));
> > > > > > > > +	INIT_WORK(&ip->i_iodone_work, xfs_end_io);
> > > > > > > > +	INIT_LIST_HEAD(&ip->i_iodone_list);
> > > > > > > > +	spin_lock_init(&ip->i_iodone_lock);
> > > > > > > >  
> > > > > > > >  	return ip;
> > > > > > > >  }
> > > > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > > > > > index e62074a5257c..88239c2dd824 100644
> > > > > > > > --- a/fs/xfs/xfs_inode.h
> > > > > > > > +++ b/fs/xfs/xfs_inode.h
> > > > > > > > @@ -57,6 +57,11 @@ typedef struct xfs_inode {
> > > > > > > >  
> > > > > > > >  	/* VFS inode */
> > > > > > > >  	struct inode		i_vnode;	/* embedded VFS inode */
> > > > > > > > +
> > > > > > > > +	/* pending io completions */
> > > > > > > > +	spinlock_t		i_iodone_lock;
> > > > > > > > +	struct work_struct	i_iodone_work;
> > > > > > > > +	struct list_head	i_iodone_list;
> > > > > > > >  } xfs_inode_t;
> > > > > > > >  
> > > > > > > >  /* Convert from vfs inode to xfs inode */
> > > > > > > > @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
> > > > > > > >  int xfs_iunlink_init(struct xfs_perag *pag);
> > > > > > > >  void xfs_iunlink_destroy(struct xfs_perag *pag);
> > > > > > > >  
> > > > > > > > +void xfs_end_io(struct work_struct *work);
> > > > > > > > +
> > > > > > > >  #endif	/* __XFS_INODE_H__ */
> > > > > > > > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-15  2:07 ` [PATCH 1/3] xfs: implement per-inode writeback completion queues Darrick J. Wong
  2019-04-15 14:49   ` Brian Foster
  2019-04-15 16:13   ` [PATCH v2 " Darrick J. Wong
@ 2019-04-23  6:34   ` Christoph Hellwig
  2019-04-23 15:29     ` Darrick J. Wong
  2 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2019-04-23  6:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

I'm a little late to the game, sorry..

But:

> +
> +	spin_lock_irqsave(&ip->i_iodone_lock, flags);
> +	list_replace_init(&ip->i_iodone_list, &completion_list);
> +	spin_unlock_irqrestore(&ip->i_iodone_lock, flags);

I really don't like this new per-inode lock, and the irq disabling at
all.  Especially as the pattern perfectly fits the lockless lists (llists).
Can you check if these work here?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: implement per-inode writeback completion queues
  2019-04-23  6:34   ` [PATCH " Christoph Hellwig
@ 2019-04-23 15:29     ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2019-04-23 15:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Apr 22, 2019 at 11:34:21PM -0700, Christoph Hellwig wrote:
> I'm a little late to the game, sorry..
> 
> But:
> 
> > +
> > +	spin_lock_irqsave(&ip->i_iodone_lock, flags);
> > +	list_replace_init(&ip->i_iodone_list, &completion_list);
> > +	spin_unlock_irqrestore(&ip->i_iodone_lock, flags);
> 
> I really don't like this new per-inode lock, and the irq disabling at
> all.  Especially as the pattern perfectly fits the lockless lists (llists).
> Can you check if these work here?

They certainly look promising, though the singly-linked list nature
means it can't be a direct replacement for xfs_ioend.io_list because we
still need to be able to sort and merge completions.  However, adding
another pointer to xfs_ioend won't change the objects per slab ratio, so
avoiding the spinlock heartburn might be worth it...

--D

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-04-23 15:29 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-15  2:07 [PATCH v2 0/3] xfs: merged io completions Darrick J. Wong
2019-04-15  2:07 ` [PATCH 1/3] xfs: implement per-inode writeback completion queues Darrick J. Wong
2019-04-15 14:49   ` Brian Foster
2019-04-15 15:53     ` Darrick J. Wong
2019-04-15 16:13       ` Brian Foster
2019-04-15 17:09         ` Darrick J. Wong
2019-04-15 17:31           ` Brian Foster
2019-04-15 18:00             ` Darrick J. Wong
2019-04-15 18:13               ` Brian Foster
2019-04-15 19:27                 ` Darrick J. Wong
2019-04-15 16:13   ` [PATCH v2 " Darrick J. Wong
2019-04-15 18:13     ` Brian Foster
2019-04-23  6:34   ` [PATCH " Christoph Hellwig
2019-04-23 15:29     ` Darrick J. Wong
2019-04-15  2:07 ` [PATCH 2/3] xfs: remove unused m_data_workqueue Darrick J. Wong
2019-04-15 14:49   ` Brian Foster
2019-04-15  2:08 ` [PATCH 3/3] xfs: merge adjacent io completions of the same type Darrick J. Wong
2019-04-15 14:49   ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.