All of lore.kernel.org
 help / color / mirror / Atom feed
* defer all write I/O completions to process context
@ 2023-03-14 16:59 Christoph Hellwig
  2023-03-14 16:59 ` [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing Christoph Hellwig
                   ` (10 more replies)
  0 siblings, 11 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

Hi all,

based on some of my own projects and Johannes raid-stripe-tree work it has
become a bit painful to drive much of the write I/O completion from the
potential irq context ->bi_end_io handler.  This series follows the steps
that XFS has taken about 10 years ago and defers all the end_io handling
to process contexts using a workqueue, which also allows to remove all
irq locking in btrfs (safe one spot that interacts with the pagecache).

I've run various data and metadata benchmarks, and the only significant
change is that the outliers to the high and low in repeated runs of
metadata heavy workloads seem to reduce while the average remains the
same.  I'm of course also interested in runs of additional workloads or
systems as this is a fairly significant change.

Diffstat:
 async-thread.c   |   17 ++++-------
 bio.c            |   82 ++++++++++++++++++++++++++++++++++++-------------------
 compression.c    |   34 ++++++----------------
 compression.h    |    7 +---
 disk-io.c        |   33 ++++++++++------------
 extent-io-tree.c |   12 ++------
 extent_io.c      |   27 +++++++-----------
 fs.h             |    6 ++--
 inode.c          |   19 +++---------
 ordered-data.c   |   68 ++++++++++++++++-----------------------------
 ordered-data.h   |    2 -
 subpage.c        |   65 +++++++++++++++++--------------------------
 super.c          |    2 -
 tree-log.c       |    4 +-
 14 files changed, 165 insertions(+), 213 deletions(-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-16 17:10   ` Johannes Thumshirn
                     ` (2 more replies)
  2023-03-14 16:59 ` [PATCH 02/10] btrfs: refactor btrfs_end_io_wq Christoph Hellwig
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

The endio_write_workers and endio_freespace_workers workqueues don't use
any of the ordering features in the btrfs_workqueue, so switch them to
plain Linux workqueues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/disk-io.c      | 16 ++++++++--------
 fs/btrfs/fs.h           |  4 ++--
 fs/btrfs/ordered-data.c |  8 ++++----
 fs/btrfs/ordered-data.h |  2 +-
 fs/btrfs/super.c        |  2 --
 5 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bb864cf2eed60f..0889eb81e71a7d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1993,8 +1993,10 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 		destroy_workqueue(fs_info->rmw_workers);
 	if (fs_info->compressed_write_workers)
 		destroy_workqueue(fs_info->compressed_write_workers);
-	btrfs_destroy_workqueue(fs_info->endio_write_workers);
-	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
+	if (fs_info->endio_write_workers)
+		destroy_workqueue(fs_info->endio_write_workers);
+	if (fs_info->endio_freespace_worker)
+		destroy_workqueue(fs_info->endio_freespace_worker);
 	btrfs_destroy_workqueue(fs_info->delayed_workers);
 	btrfs_destroy_workqueue(fs_info->caching_workers);
 	btrfs_destroy_workqueue(fs_info->flush_workers);
@@ -2204,13 +2206,11 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 		alloc_workqueue("btrfs-endio-meta", flags, max_active);
 	fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active);
 	fs_info->endio_write_workers =
-		btrfs_alloc_workqueue(fs_info, "endio-write", flags,
-				      max_active, 2);
+		alloc_workqueue("btrfs-endio-write", flags, max_active);
 	fs_info->compressed_write_workers =
 		alloc_workqueue("btrfs-compressed-write", flags, max_active);
 	fs_info->endio_freespace_worker =
-		btrfs_alloc_workqueue(fs_info, "freespace-write", flags,
-				      max_active, 0);
+		alloc_workqueue("btrfs-freespace-write", flags, max_active);
 	fs_info->delayed_workers =
 		btrfs_alloc_workqueue(fs_info, "delayed-meta", flags,
 				      max_active, 0);
@@ -4536,9 +4536,9 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	 * the final btrfs_put_ordered_extent() (which must happen at
 	 * btrfs_finish_ordered_io() when we are unmounting).
 	 */
-	btrfs_flush_workqueue(fs_info->endio_write_workers);
+	flush_workqueue(fs_info->endio_write_workers);
 	/* Ordered extents for free space inodes. */
-	btrfs_flush_workqueue(fs_info->endio_freespace_worker);
+	flush_workqueue(fs_info->endio_freespace_worker);
 	btrfs_run_delayed_iputs(fs_info);
 
 	cancel_work_sync(&fs_info->async_reclaim_work);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 20d554a0c2ac0d..276a17780f2b1b 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -542,8 +542,8 @@ struct btrfs_fs_info {
 	struct workqueue_struct *endio_meta_workers;
 	struct workqueue_struct *rmw_workers;
 	struct workqueue_struct *compressed_write_workers;
-	struct btrfs_workqueue *endio_write_workers;
-	struct btrfs_workqueue *endio_freespace_worker;
+	struct workqueue_struct *endio_write_workers;
+	struct workqueue_struct *endio_freespace_worker;
 	struct btrfs_workqueue *caching_workers;
 
 	/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 1848d0d1a9c41e..23f496f0d7b776 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -303,7 +303,7 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 	spin_unlock_irq(&tree->lock);
 }
 
-static void finish_ordered_fn(struct btrfs_work *work)
+static void finish_ordered_fn(struct work_struct *work)
 {
 	struct btrfs_ordered_extent *ordered_extent;
 
@@ -330,7 +330,7 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 {
 	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_workqueue *wq;
+	struct workqueue_struct *wq;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
 	unsigned long flags;
@@ -439,8 +439,8 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 			refcount_inc(&entry->refs);
 			trace_btrfs_ordered_extent_mark_finished(inode, entry);
 			spin_unlock_irqrestore(&tree->lock, flags);
-			btrfs_init_work(&entry->work, finish_ordered_fn, NULL, NULL);
-			btrfs_queue_work(wq, &entry->work);
+			INIT_WORK(&entry->work, finish_ordered_fn);
+			queue_work(wq, &entry->work);
 			spin_lock_irqsave(&tree->lock, flags);
 		}
 		cur += len;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 18007f9c00add8..b8a92f040458f0 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -146,7 +146,7 @@ struct btrfs_ordered_extent {
 	/* a per root list of all the pending ordered extents */
 	struct list_head root_extent_list;
 
-	struct btrfs_work work;
+	struct work_struct work;
 
 	struct completion completion;
 	struct btrfs_work flush_work;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d8885966e801cd..065b4fab1ee011 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1632,8 +1632,6 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
 	btrfs_workqueue_set_max(fs_info->hipri_workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
-	btrfs_workqueue_set_max(fs_info->endio_write_workers, new_pool_size);
-	btrfs_workqueue_set_max(fs_info->endio_freespace_worker, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->delayed_workers, new_pool_size);
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 02/10] btrfs: refactor btrfs_end_io_wq
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
  2023-03-14 16:59 ` [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-16 17:12   ` Johannes Thumshirn
  2023-03-20 11:09   ` Qu Wenruo
  2023-03-14 16:59 ` [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue Christoph Hellwig
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

Pass a btrfs_bio instead of a fs_info and bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/bio.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index cf09c6271edbee..85539964864a34 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -296,10 +296,11 @@ static void btrfs_log_dev_io_error(struct bio *bio, struct btrfs_device *dev)
 		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_FLUSH_ERRS);
 }
 
-static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_fs_info *fs_info,
-						struct bio *bio)
+static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_bio *bbio)
 {
-	if (bio->bi_opf & REQ_META)
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
+
+	if (bbio->bio.bi_opf & REQ_META)
 		return fs_info->endio_meta_workers;
 	return fs_info->endio_workers;
 }
@@ -319,16 +320,15 @@ static void btrfs_simple_end_io(struct bio *bio)
 {
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	struct btrfs_device *dev = bio->bi_private;
-	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 
-	btrfs_bio_counter_dec(fs_info);
+	btrfs_bio_counter_dec(bbio->inode->root->fs_info);
 
 	if (bio->bi_status)
 		btrfs_log_dev_io_error(bio, dev);
 
 	if (bio_op(bio) == REQ_OP_READ) {
 		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
-		queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
+		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
 	} else {
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
 			btrfs_record_physical_zoned(bbio);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
  2023-03-14 16:59 ` [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing Christoph Hellwig
  2023-03-14 16:59 ` [PATCH 02/10] btrfs: refactor btrfs_end_io_wq Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-16 17:14   ` Johannes Thumshirn
  2023-03-20 11:29   ` Qu Wenruo
  2023-03-14 16:59 ` [PATCH 04/10] btrfs: remove the compressed_write_workers workqueue Christoph Hellwig
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

Currently all reads are offloaded to workqueues for checksum
verification.  Writes are initially processed in the bi_end_io
context that usually happens in hard or soft interrupts, and are
only offloaded to a workqueue for ordered extent completion or
compressed extent handling, and in the future for raid stripe
tree handling.

Switch to offload all writes to workqueues instead of doing that
only for the final ordered extent processing.  This means that
there are slightly workqueue offloads for large writes as each
submitted btrfs_bio can only hold 1MiB worth of data on systems
with a 4K page size.  On the other hand this will allow large
simpliciations by always having a process context.

For metadata writes there was not offload at all before, which
creates various problems like not being able to drop the final
extent_buffered reference from the end_io handler.

With the entire series applied this shows no performance differences
on data heavy workload, while for metadata heavy workloads like
Filipes fs_mark there also is no change in averages, while it
seems to somewhat reduce the outliers in terms of best and worst
performance.  This is probably due to the removal of the irq
disabling in the next patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/bio.c          | 74 ++++++++++++++++++++++++++++-------------
 fs/btrfs/disk-io.c      |  5 +++
 fs/btrfs/fs.h           |  1 +
 fs/btrfs/ordered-data.c | 17 +---------
 fs/btrfs/ordered-data.h |  2 --
 5 files changed, 57 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 85539964864a34..037eded3b33ba2 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -300,20 +300,33 @@ static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_bio *bbio)
 {
 	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 
-	if (bbio->bio.bi_opf & REQ_META)
-		return fs_info->endio_meta_workers;
-	return fs_info->endio_workers;
+	if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE) {
+		if (btrfs_is_free_space_inode(bbio->inode))
+			return fs_info->endio_freespace_worker;
+		if (bbio->bio.bi_opf & REQ_META)
+			return fs_info->endio_write_meta_workers;
+		return fs_info->endio_write_workers;
+	} else {
+		if (bbio->bio.bi_opf & REQ_META)
+			return fs_info->endio_meta_workers;
+		return fs_info->endio_workers;
+	}
 }
 
-static void btrfs_end_bio_work(struct work_struct *work)
+static void btrfs_simple_end_bio_work(struct work_struct *work)
 {
 	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
+	struct bio *bio = &bbio->bio;
+	struct btrfs_device *dev = bio->bi_private;
 
 	/* Metadata reads are checked and repaired by the submitter. */
-	if (bbio->bio.bi_opf & REQ_META)
-		bbio->end_io(bbio);
-	else
-		btrfs_check_read_bio(bbio, bbio->bio.bi_private);
+	if (bio_op(bio) == REQ_OP_READ && !(bio->bi_opf & REQ_META)) {
+		btrfs_check_read_bio(bbio, dev);
+	} else {
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+			btrfs_record_physical_zoned(bbio);
+		btrfs_orig_bbio_end_io(bbio);
+	}
 }
 
 static void btrfs_simple_end_io(struct bio *bio)
@@ -322,18 +335,19 @@ static void btrfs_simple_end_io(struct bio *bio)
 	struct btrfs_device *dev = bio->bi_private;
 
 	btrfs_bio_counter_dec(bbio->inode->root->fs_info);
-
 	if (bio->bi_status)
 		btrfs_log_dev_io_error(bio, dev);
 
-	if (bio_op(bio) == REQ_OP_READ) {
-		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
-		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
-	} else {
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-			btrfs_record_physical_zoned(bbio);
-		btrfs_orig_bbio_end_io(bbio);
-	}
+	INIT_WORK(&bbio->end_io_work, btrfs_simple_end_bio_work);
+	queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
+}
+
+static void btrfs_write_end_bio_work(struct work_struct *work)
+{
+	struct btrfs_bio *bbio =
+		container_of(work, struct btrfs_bio, end_io_work);
+
+	btrfs_orig_bbio_end_io(bbio);
 }
 
 static void btrfs_raid56_end_io(struct bio *bio)
@@ -343,12 +357,23 @@ static void btrfs_raid56_end_io(struct bio *bio)
 
 	btrfs_bio_counter_dec(bioc->fs_info);
 	bbio->mirror_num = bioc->mirror_num;
-	if (bio_op(bio) == REQ_OP_READ && !(bbio->bio.bi_opf & REQ_META))
-		btrfs_check_read_bio(bbio, NULL);
-	else
-		btrfs_orig_bbio_end_io(bbio);
-
 	btrfs_put_bioc(bioc);
+
+	/*
+	 * The RAID5/6 code already completes all bios from workqueues, but for
+	 * writs we need to offload them again to avoid deadlocks in the ordered
+	 * extent processing.
+	 */
+	if (bio_op(bio) == REQ_OP_READ) {
+		/* Metadata reads are checked and repaired by the submitter. */
+		if (!(bio->bi_opf & REQ_META))
+			btrfs_check_read_bio(bbio, NULL);
+		else
+			btrfs_orig_bbio_end_io(bbio);
+	} else {
+		INIT_WORK(&btrfs_bio(bio)->end_io_work, btrfs_write_end_bio_work);
+		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
+	}
 }
 
 static void btrfs_orig_write_end_io(struct bio *bio)
@@ -372,9 +397,10 @@ static void btrfs_orig_write_end_io(struct bio *bio)
 		bio->bi_status = BLK_STS_IOERR;
 	else
 		bio->bi_status = BLK_STS_OK;
-
-	btrfs_orig_bbio_end_io(bbio);
 	btrfs_put_bioc(bioc);
+
+	INIT_WORK(&bbio->end_io_work, btrfs_write_end_bio_work);
+	queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
 }
 
 static void btrfs_clone_write_end_io(struct bio *bio)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0889eb81e71a7d..5b63a5161cedea 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2008,6 +2008,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	 * the queues used for metadata I/O, since tasks from those other work
 	 * queues can do metadata I/O operations.
 	 */
+	if (fs_info->endio_write_meta_workers)
+		destroy_workqueue(fs_info->endio_write_meta_workers);
 	if (fs_info->endio_meta_workers)
 		destroy_workqueue(fs_info->endio_meta_workers);
 }
@@ -2204,6 +2206,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 		alloc_workqueue("btrfs-endio", flags, max_active);
 	fs_info->endio_meta_workers =
 		alloc_workqueue("btrfs-endio-meta", flags, max_active);
+	fs_info->endio_write_meta_workers =
+		alloc_workqueue("btrfs-endio-write-meta", flags, max_active);
 	fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active);
 	fs_info->endio_write_workers =
 		alloc_workqueue("btrfs-endio-write", flags, max_active);
@@ -2224,6 +2228,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
 	      fs_info->compressed_write_workers &&
 	      fs_info->endio_write_workers &&
+	      fs_info->endio_write_meta_workers &&
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->fixup_workers &&
 	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 276a17780f2b1b..e2643bc5c039ad 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -543,6 +543,7 @@ struct btrfs_fs_info {
 	struct workqueue_struct *rmw_workers;
 	struct workqueue_struct *compressed_write_workers;
 	struct workqueue_struct *endio_write_workers;
+	struct workqueue_struct *endio_write_meta_workers;
 	struct workqueue_struct *endio_freespace_worker;
 	struct btrfs_workqueue *caching_workers;
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 23f496f0d7b776..48c7510df80a0d 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -303,14 +303,6 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 	spin_unlock_irq(&tree->lock);
 }
 
-static void finish_ordered_fn(struct work_struct *work)
-{
-	struct btrfs_ordered_extent *ordered_extent;
-
-	ordered_extent = container_of(work, struct btrfs_ordered_extent, work);
-	btrfs_finish_ordered_io(ordered_extent);
-}
-
 /*
  * Mark all ordered extents io inside the specified range finished.
  *
@@ -330,17 +322,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 {
 	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct workqueue_struct *wq;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
 	unsigned long flags;
 	u64 cur = file_offset;
 
-	if (btrfs_is_free_space_inode(inode))
-		wq = fs_info->endio_freespace_worker;
-	else
-		wq = fs_info->endio_write_workers;
-
 	if (page)
 		ASSERT(page->mapping && page_offset(page) <= file_offset &&
 		       file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
@@ -439,8 +425,7 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 			refcount_inc(&entry->refs);
 			trace_btrfs_ordered_extent_mark_finished(inode, entry);
 			spin_unlock_irqrestore(&tree->lock, flags);
-			INIT_WORK(&entry->work, finish_ordered_fn);
-			queue_work(wq, &entry->work);
+			btrfs_finish_ordered_io(entry);
 			spin_lock_irqsave(&tree->lock, flags);
 		}
 		cur += len;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index b8a92f040458f0..64a37d42e7a24a 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -146,8 +146,6 @@ struct btrfs_ordered_extent {
 	/* a per root list of all the pending ordered extents */
 	struct list_head root_extent_list;
 
-	struct work_struct work;
-
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 04/10] btrfs: remove the compressed_write_workers workqueue
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
                   ` (2 preceding siblings ...)
  2023-03-14 16:59 ` [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-14 16:59 ` [PATCH 05/10] btrfs: remove irq disabling for btrfs_workqueue.list_lock Christoph Hellwig
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

Now that all writes completions happen in workqueues, there is no need
for an extra offload for compressed write completions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/compression.c | 34 ++++++++++------------------------
 fs/btrfs/compression.h |  7 ++-----
 fs/btrfs/disk-io.c     |  5 -----
 fs/btrfs/fs.h          |  1 -
 4 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index c5839d04690d67..1641336a39d215 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -220,27 +220,6 @@ static noinline void end_compressed_writeback(const struct compressed_bio *cb)
 	/* the inode may be gone now */
 }
 
-static void btrfs_finish_compressed_write_work(struct work_struct *work)
-{
-	struct compressed_bio *cb =
-		container_of(work, struct compressed_bio, write_end_work);
-
-	/*
-	 * Ok, we're the last bio for this extent, step one is to call back
-	 * into the FS and do all the end_io operations.
-	 */
-	btrfs_writepage_endio_finish_ordered(cb->bbio.inode, NULL,
-			cb->start, cb->start + cb->len - 1,
-			cb->bbio.bio.bi_status == BLK_STS_OK);
-
-	if (cb->writeback)
-		end_compressed_writeback(cb);
-	/* Note, our inode could be gone now */
-
-	btrfs_free_compressed_pages(cb);
-	bio_put(&cb->bbio.bio);
-}
-
 /*
  * Do the cleanup once all the compressed pages hit the disk.  This will clear
  * writeback on the file pages and free the compressed pages.
@@ -251,9 +230,17 @@ static void btrfs_finish_compressed_write_work(struct work_struct *work)
 static void end_compressed_bio_write(struct btrfs_bio *bbio)
 {
 	struct compressed_bio *cb = to_compressed_bio(bbio);
-	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 
-	queue_work(fs_info->compressed_write_workers, &cb->write_end_work);
+	btrfs_writepage_endio_finish_ordered(cb->bbio.inode, NULL,
+			cb->start, cb->start + cb->len - 1,
+			cb->bbio.bio.bi_status == BLK_STS_OK);
+
+	if (cb->writeback)
+		end_compressed_writeback(cb);
+	/* Note, our inode could be gone now */
+
+	btrfs_free_compressed_pages(cb);
+	bio_put(&bbio->bio);
 }
 
 static void btrfs_add_compressed_bio_pages(struct compressed_bio *cb,
@@ -329,7 +316,6 @@ void btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	cb->compressed_pages = compressed_pages;
 	cb->compressed_len = compressed_len;
 	cb->writeback = writeback;
-	INIT_WORK(&cb->write_end_work, btrfs_finish_compressed_write_work);
 	cb->nr_pages = nr_pages;
 
 	btrfs_add_compressed_bio_pages(cb, disk_start);
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 5d5146e72a860b..65aaedba35a2ad 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -53,11 +53,8 @@ struct compressed_bio {
 	/* Whether this is a write for writeback. */
 	bool writeback;
 
-	union {
-		/* For reads, this is the bio we are copying the data into */
-		struct btrfs_bio *orig_bbio;
-		struct work_struct write_end_work;
-	};
+	/* For reads, this is the bio we are copying the data into */
+	struct btrfs_bio *orig_bbio;
 
 	/* Must be last. */
 	struct btrfs_bio bbio;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5b63a5161cedea..494081dda5fc66 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1991,8 +1991,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 		destroy_workqueue(fs_info->endio_workers);
 	if (fs_info->rmw_workers)
 		destroy_workqueue(fs_info->rmw_workers);
-	if (fs_info->compressed_write_workers)
-		destroy_workqueue(fs_info->compressed_write_workers);
 	if (fs_info->endio_write_workers)
 		destroy_workqueue(fs_info->endio_write_workers);
 	if (fs_info->endio_freespace_worker)
@@ -2211,8 +2209,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active);
 	fs_info->endio_write_workers =
 		alloc_workqueue("btrfs-endio-write", flags, max_active);
-	fs_info->compressed_write_workers =
-		alloc_workqueue("btrfs-compressed-write", flags, max_active);
 	fs_info->endio_freespace_worker =
 		alloc_workqueue("btrfs-freespace-write", flags, max_active);
 	fs_info->delayed_workers =
@@ -2226,7 +2222,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	if (!(fs_info->workers && fs_info->hipri_workers &&
 	      fs_info->delalloc_workers && fs_info->flush_workers &&
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
-	      fs_info->compressed_write_workers &&
 	      fs_info->endio_write_workers &&
 	      fs_info->endio_write_meta_workers &&
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index e2643bc5c039ad..d926a337d2be20 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -541,7 +541,6 @@ struct btrfs_fs_info {
 	struct workqueue_struct *endio_workers;
 	struct workqueue_struct *endio_meta_workers;
 	struct workqueue_struct *rmw_workers;
-	struct workqueue_struct *compressed_write_workers;
 	struct workqueue_struct *endio_write_workers;
 	struct workqueue_struct *endio_write_meta_workers;
 	struct workqueue_struct *endio_freespace_worker;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 05/10] btrfs: remove irq disabling for btrfs_workqueue.list_lock
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
                   ` (3 preceding siblings ...)
  2023-03-14 16:59 ` [PATCH 04/10] btrfs: remove the compressed_write_workers workqueue Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-17 10:34   ` Johannes Thumshirn
  2023-03-14 16:59 ` [PATCH 06/10] btrfs: remove irq disabling for subpage.list_lock Christoph Hellwig
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

btrfs_queue_work with an ordered_func is never called from irq
context, so remove the irq disabling for btrfs_workqueue.list_lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/async-thread.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index aac240430efe13..91ec1e2fea0c69 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -179,11 +179,10 @@ static void run_ordered_work(struct btrfs_workqueue *wq,
 	struct list_head *list = &wq->ordered_list;
 	struct btrfs_work *work;
 	spinlock_t *lock = &wq->list_lock;
-	unsigned long flags;
 	bool free_self = false;
 
 	while (1) {
-		spin_lock_irqsave(lock, flags);
+		spin_lock(lock);
 		if (list_empty(list))
 			break;
 		work = list_entry(list->next, struct btrfs_work,
@@ -207,13 +206,13 @@ static void run_ordered_work(struct btrfs_workqueue *wq,
 		if (test_and_set_bit(WORK_ORDER_DONE_BIT, &work->flags))
 			break;
 		trace_btrfs_ordered_sched(work);
-		spin_unlock_irqrestore(lock, flags);
+		spin_unlock(lock);
 		work->ordered_func(work);
 
 		/* now take the lock again and drop our item from the list */
-		spin_lock_irqsave(lock, flags);
+		spin_lock(lock);
 		list_del(&work->ordered_list);
-		spin_unlock_irqrestore(lock, flags);
+		spin_unlock(lock);
 
 		if (work == self) {
 			/*
@@ -248,7 +247,7 @@ static void run_ordered_work(struct btrfs_workqueue *wq,
 			trace_btrfs_all_work_done(wq->fs_info, work);
 		}
 	}
-	spin_unlock_irqrestore(lock, flags);
+	spin_unlock(lock);
 
 	if (free_self) {
 		self->ordered_free(self);
@@ -307,14 +306,12 @@ void btrfs_init_work(struct btrfs_work *work, btrfs_func_t func,
 
 void btrfs_queue_work(struct btrfs_workqueue *wq, struct btrfs_work *work)
 {
-	unsigned long flags;
-
 	work->wq = wq;
 	thresh_queue_hook(wq);
 	if (work->ordered_func) {
-		spin_lock_irqsave(&wq->list_lock, flags);
+		spin_lock(&wq->list_lock);
 		list_add_tail(&work->ordered_list, &wq->ordered_list);
-		spin_unlock_irqrestore(&wq->list_lock, flags);
+		spin_unlock(&wq->list_lock);
 	}
 	trace_btrfs_work_queued(work);
 	queue_work(wq->normal_wq, &work->normal_work);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 06/10] btrfs: remove irq disabling for subpage.list_lock
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
                   ` (4 preceding siblings ...)
  2023-03-14 16:59 ` [PATCH 05/10] btrfs: remove irq disabling for btrfs_workqueue.list_lock Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-14 16:59 ` [PATCH 07/10] btrfs: remove irq disabling for leak_lock Christoph Hellwig
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

Subpage state isn't modified from irq context any more, so remove
the irq disabling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/disk-io.c   |  7 ++---
 fs/btrfs/extent_io.c | 12 ++++----
 fs/btrfs/inode.c     |  4 +--
 fs/btrfs/subpage.c   | 65 ++++++++++++++++++--------------------------
 4 files changed, 36 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 494081dda5fc66..638eed9023492e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -789,17 +789,16 @@ static bool btree_dirty_folio(struct address_space *mapping,
 
 	ASSERT(subpage->dirty_bitmap);
 	while (cur_bit < BTRFS_SUBPAGE_BITMAP_SIZE) {
-		unsigned long flags;
 		u64 cur;
 		u16 tmp = (1 << cur_bit);
 
-		spin_lock_irqsave(&subpage->lock, flags);
+		spin_lock(&subpage->lock);
 		if (!(tmp & subpage->dirty_bitmap)) {
-			spin_unlock_irqrestore(&subpage->lock, flags);
+			spin_unlock(&subpage->lock);
 			cur_bit++;
 			continue;
 		}
-		spin_unlock_irqrestore(&subpage->lock, flags);
+		spin_unlock(&subpage->lock);
 		cur = page_start + cur_bit * fs_info->sectorsize;
 
 		eb = find_extent_buffer(fs_info, cur);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1221f699ffc596..b0f74c741aa7a9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1385,7 +1385,6 @@ static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage_info *spi = fs_info->subpage_info;
 	u64 orig_start = *start;
 	/* Declare as unsigned long so we can use bitmap ops */
-	unsigned long flags;
 	int range_start_bit;
 	int range_end_bit;
 
@@ -1403,10 +1402,10 @@ static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
 			  (offset_in_page(orig_start) >> fs_info->sectorsize_bits);
 
 	/* We should have the page locked, but just in case */
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_next_set_region(subpage->bitmaps, &range_start_bit, &range_end_bit,
 			       spi->dirty_offset + spi->bitmap_nr_bits);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 
 	range_start_bit -= spi->dirty_offset;
 	range_end_bit -= spi->dirty_offset;
@@ -2080,7 +2079,6 @@ static int submit_eb_subpage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl)
 	while (bit_start < fs_info->subpage_info->bitmap_nr_bits) {
 		struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 		struct extent_buffer *eb;
-		unsigned long flags;
 		u64 start;
 
 		/*
@@ -2092,10 +2090,10 @@ static int submit_eb_subpage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl)
 			spin_unlock(&page->mapping->private_lock);
 			break;
 		}
-		spin_lock_irqsave(&subpage->lock, flags);
+		spin_lock(&subpage->lock);
 		if (!test_bit(bit_start + fs_info->subpage_info->dirty_offset,
 			      subpage->bitmaps)) {
-			spin_unlock_irqrestore(&subpage->lock, flags);
+			spin_unlock(&subpage->lock);
 			spin_unlock(&page->mapping->private_lock);
 			bit_start++;
 			continue;
@@ -2109,7 +2107,7 @@ static int submit_eb_subpage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl)
 		 * spin locks, so call find_extent_buffer_nolock().
 		 */
 		eb = find_extent_buffer_nolock(fs_info, start);
-		spin_unlock_irqrestore(&subpage->lock, flags);
+		spin_unlock(&subpage->lock);
 		spin_unlock(&page->mapping->private_lock);
 
 		/*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 33f0ec14703a3d..cccaab4aa9cd91 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7900,8 +7900,8 @@ static void wait_subpage_spinlock(struct page *page)
 	 * Here we just acquire the spinlock so that all existing callers
 	 * should exit and we're safe to release/invalidate the page.
 	 */
-	spin_lock_irq(&subpage->lock);
-	spin_unlock_irq(&subpage->lock);
+	spin_lock(&subpage->lock);
+	spin_unlock(&subpage->lock);
 }
 
 static bool __btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index dd46b978ac2cfc..3a47aeed419503 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -415,13 +415,12 @@ void btrfs_subpage_set_uptodate(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							uptodate, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	if (subpage_test_bitmap_all_set(fs_info, subpage, uptodate))
 		SetPageUptodate(page);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 void btrfs_subpage_clear_uptodate(const struct btrfs_fs_info *fs_info,
@@ -430,12 +429,11 @@ void btrfs_subpage_clear_uptodate(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							uptodate, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	ClearPageUptodate(page);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 void btrfs_subpage_set_error(const struct btrfs_fs_info *fs_info,
@@ -444,12 +442,11 @@ void btrfs_subpage_set_error(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							error, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	SetPageError(page);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 void btrfs_subpage_clear_error(const struct btrfs_fs_info *fs_info,
@@ -458,13 +455,12 @@ void btrfs_subpage_clear_error(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							error, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	if (subpage_test_bitmap_all_zero(fs_info, subpage, error))
 		ClearPageError(page);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 void btrfs_subpage_set_dirty(const struct btrfs_fs_info *fs_info,
@@ -473,11 +469,10 @@ void btrfs_subpage_set_dirty(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							dirty, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 	set_page_dirty(page);
 }
 
@@ -497,14 +492,13 @@ bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							dirty, start, len);
-	unsigned long flags;
 	bool last = false;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	if (subpage_test_bitmap_all_zero(fs_info, subpage, dirty))
 		last = true;
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 	return last;
 }
 
@@ -524,12 +518,11 @@ void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							writeback, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	set_page_writeback(page);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
@@ -538,15 +531,14 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							writeback, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	if (subpage_test_bitmap_all_zero(fs_info, subpage, writeback)) {
 		ASSERT(PageWriteback(page));
 		end_page_writeback(page);
 	}
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
@@ -555,12 +547,11 @@ void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							ordered, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	SetPageOrdered(page);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info,
@@ -569,13 +560,12 @@ void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							ordered, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	if (subpage_test_bitmap_all_zero(fs_info, subpage, ordered))
 		ClearPageOrdered(page);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 void btrfs_subpage_set_checked(const struct btrfs_fs_info *fs_info,
@@ -584,13 +574,12 @@ void btrfs_subpage_set_checked(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							checked, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	if (subpage_test_bitmap_all_set(fs_info, subpage, checked))
 		SetPageChecked(page);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 void btrfs_subpage_clear_checked(const struct btrfs_fs_info *fs_info,
@@ -599,12 +588,11 @@ void btrfs_subpage_clear_checked(const struct btrfs_fs_info *fs_info,
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
 							checked, start, len);
-	unsigned long flags;
 
-	spin_lock_irqsave(&subpage->lock, flags);
+	spin_lock(&subpage->lock);
 	bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	ClearPageChecked(page);
-	spin_unlock_irqrestore(&subpage->lock, flags);
+	spin_unlock(&subpage->lock);
 }
 
 /*
@@ -618,13 +606,12 @@ bool btrfs_subpage_test_##name(const struct btrfs_fs_info *fs_info,	\
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private; \
 	unsigned int start_bit = subpage_calc_start_bit(fs_info, page,	\
 						name, start, len);	\
-	unsigned long flags;						\
 	bool ret;							\
 									\
-	spin_lock_irqsave(&subpage->lock, flags);			\
+	spin_lock(&subpage->lock);					\
 	ret = bitmap_test_range_all_set(subpage->bitmaps, start_bit,	\
 				len >> fs_info->sectorsize_bits);	\
-	spin_unlock_irqrestore(&subpage->lock, flags);			\
+	spin_unlock(&subpage->lock);					\
 	return ret;							\
 }
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 07/10] btrfs: remove irq disabling for leak_lock
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
                   ` (5 preceding siblings ...)
  2023-03-14 16:59 ` [PATCH 06/10] btrfs: remove irq disabling for subpage.list_lock Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-17 10:35   ` Johannes Thumshirn
  2023-03-14 16:59 ` [PATCH 08/10] btrfs: remove irq disabling for fs_info.ebleak_lock Christoph Hellwig
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

None of the extent state leak tracking is called from irq context,
so remove the irq disabling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent-io-tree.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 29a225836e286e..caf11a10da71a0 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -21,20 +21,16 @@ static DEFINE_SPINLOCK(leak_lock);
 
 static inline void btrfs_leak_debug_add_state(struct extent_state *state)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&leak_lock, flags);
+	spin_lock(&leak_lock);
 	list_add(&state->leak_list, &states);
-	spin_unlock_irqrestore(&leak_lock, flags);
+	spin_unlock(&leak_lock);
 }
 
 static inline void btrfs_leak_debug_del_state(struct extent_state *state)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&leak_lock, flags);
+	spin_lock(&leak_lock);
 	list_del(&state->leak_list);
-	spin_unlock_irqrestore(&leak_lock, flags);
+	spin_unlock(&leak_lock);
 }
 
 static inline void btrfs_extent_state_leak_debug_check(void)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 08/10] btrfs: remove irq disabling for fs_info.ebleak_lock
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
                   ` (6 preceding siblings ...)
  2023-03-14 16:59 ` [PATCH 07/10] btrfs: remove irq disabling for leak_lock Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-17 10:35   ` Johannes Thumshirn
  2023-03-14 16:59 ` [PATCH 09/10] btrfs: remove irq_disabling for ordered_tree.lock Christoph Hellwig
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

None of the extent_buffer leak tracking is called from irq context,
so remove the irq disabling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b0f74c741aa7a9..8e1ad6d1c7ccca 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -44,27 +44,24 @@ static struct kmem_cache *extent_buffer_cache;
 static inline void btrfs_leak_debug_add_eb(struct extent_buffer *eb)
 {
 	struct btrfs_fs_info *fs_info = eb->fs_info;
-	unsigned long flags;
 
-	spin_lock_irqsave(&fs_info->eb_leak_lock, flags);
+	spin_lock(&fs_info->eb_leak_lock);
 	list_add(&eb->leak_list, &fs_info->allocated_ebs);
-	spin_unlock_irqrestore(&fs_info->eb_leak_lock, flags);
+	spin_unlock(&fs_info->eb_leak_lock);
 }
 
 static inline void btrfs_leak_debug_del_eb(struct extent_buffer *eb)
 {
 	struct btrfs_fs_info *fs_info = eb->fs_info;
-	unsigned long flags;
 
-	spin_lock_irqsave(&fs_info->eb_leak_lock, flags);
+	spin_lock(&fs_info->eb_leak_lock);
 	list_del(&eb->leak_list);
-	spin_unlock_irqrestore(&fs_info->eb_leak_lock, flags);
+	spin_unlock(&fs_info->eb_leak_lock);
 }
 
 void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info)
 {
 	struct extent_buffer *eb;
-	unsigned long flags;
 
 	/*
 	 * If we didn't get into open_ctree our allocated_ebs will not be
@@ -74,7 +71,7 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info)
 		return;
 
 	WARN_ON(!list_empty(&fs_info->allocated_ebs));
-	spin_lock_irqsave(&fs_info->eb_leak_lock, flags);
+	spin_lock(&fs_info->eb_leak_lock);
 	while (!list_empty(&fs_info->allocated_ebs)) {
 		eb = list_first_entry(&fs_info->allocated_ebs,
 				      struct extent_buffer, leak_list);
@@ -85,7 +82,7 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info)
 		list_del(&eb->leak_list);
 		kmem_cache_free(extent_buffer_cache, eb);
 	}
-	spin_unlock_irqrestore(&fs_info->eb_leak_lock, flags);
+	spin_unlock(&fs_info->eb_leak_lock);
 }
 #else
 #define btrfs_leak_debug_add_eb(eb)			do {} while (0)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 09/10] btrfs: remove irq_disabling for ordered_tree.lock
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
                   ` (7 preceding siblings ...)
  2023-03-14 16:59 ` [PATCH 08/10] btrfs: remove irq disabling for fs_info.ebleak_lock Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-17 10:36   ` Johannes Thumshirn
  2023-03-14 16:59 ` [PATCH 10/10] btrfs: remove confusing comments Christoph Hellwig
  2023-03-17 10:39 ` defer all write I/O completions to process context Johannes Thumshirn
  10 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

The per-inode ordered extent list is not only accessed from process
context, so remove the irq disabling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c        |  4 ++--
 fs/btrfs/ordered-data.c | 51 +++++++++++++++++++----------------------
 fs/btrfs/tree-log.c     |  4 ++--
 3 files changed, 28 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cccaab4aa9cd91..afa564f46c6452 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8050,11 +8050,11 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
 					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					 EXTENT_DEFRAG, &cached_state);
 
-		spin_lock_irq(&inode->ordered_tree.lock);
+		spin_lock(&inode->ordered_tree.lock);
 		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
 		ordered->truncated_len = min(ordered->truncated_len,
 					     cur - ordered->file_offset);
-		spin_unlock_irq(&inode->ordered_tree.lock);
+		spin_unlock(&inode->ordered_tree.lock);
 
 		/*
 		 * If the ordered extent has finished, we're safe to delete all
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 48c7510df80a0d..4c913d1cf21b1d 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -228,14 +228,14 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
 
 	trace_btrfs_ordered_extent_add(inode, entry);
 
-	spin_lock_irq(&tree->lock);
+	spin_lock(&tree->lock);
 	node = tree_insert(&tree->tree, file_offset,
 			   &entry->rb_node);
 	if (node)
 		btrfs_panic(fs_info, -EEXIST,
 				"inconsistency in ordered tree at offset %llu",
 				file_offset);
-	spin_unlock_irq(&tree->lock);
+	spin_unlock(&tree->lock);
 
 	spin_lock(&root->ordered_extent_lock);
 	list_add_tail(&entry->root_extent_list,
@@ -298,9 +298,9 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 	struct btrfs_ordered_inode_tree *tree;
 
 	tree = &BTRFS_I(entry->inode)->ordered_tree;
-	spin_lock_irq(&tree->lock);
+	spin_lock(&tree->lock);
 	list_add_tail(&sum->list, &entry->list);
-	spin_unlock_irq(&tree->lock);
+	spin_unlock(&tree->lock);
 }
 
 /*
@@ -324,14 +324,13 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 	u64 cur = file_offset;
 
 	if (page)
 		ASSERT(page->mapping && page_offset(page) <= file_offset &&
 		       file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
 
-	spin_lock_irqsave(&tree->lock, flags);
+	spin_lock(&tree->lock);
 	while (cur < file_offset + num_bytes) {
 		u64 entry_end;
 		u64 end;
@@ -424,13 +423,13 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 			cond_wake_up(&entry->wait);
 			refcount_inc(&entry->refs);
 			trace_btrfs_ordered_extent_mark_finished(inode, entry);
-			spin_unlock_irqrestore(&tree->lock, flags);
+			spin_unlock(&tree->lock);
 			btrfs_finish_ordered_io(entry);
-			spin_lock_irqsave(&tree->lock, flags);
+			spin_lock(&tree->lock);
 		}
 		cur += len;
 	}
-	spin_unlock_irqrestore(&tree->lock, flags);
+	spin_unlock(&tree->lock);
 }
 
 /*
@@ -457,10 +456,9 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 	bool finished = false;
 
-	spin_lock_irqsave(&tree->lock, flags);
+	spin_lock(&tree->lock);
 	if (cached && *cached) {
 		entry = *cached;
 		goto have_entry;
@@ -497,7 +495,7 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 		refcount_inc(&entry->refs);
 		trace_btrfs_ordered_extent_dec_test_pending(inode, entry);
 	}
-	spin_unlock_irqrestore(&tree->lock, flags);
+	spin_unlock(&tree->lock);
 	return finished;
 }
 
@@ -567,7 +565,7 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 				 fs_info->delalloc_batch);
 
 	tree = &btrfs_inode->ordered_tree;
-	spin_lock_irq(&tree->lock);
+	spin_lock(&tree->lock);
 	node = &entry->rb_node;
 	rb_erase(node, &tree->tree);
 	RB_CLEAR_NODE(node);
@@ -575,7 +573,7 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 		tree->last = NULL;
 	set_bit(BTRFS_ORDERED_COMPLETE, &entry->flags);
 	pending = test_and_clear_bit(BTRFS_ORDERED_PENDING, &entry->flags);
-	spin_unlock_irq(&tree->lock);
+	spin_unlock(&tree->lock);
 
 	/*
 	 * The current running transaction is waiting on us, we need to let it
@@ -837,10 +835,9 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *ino
 	struct btrfs_ordered_inode_tree *tree;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 
 	tree = &inode->ordered_tree;
-	spin_lock_irqsave(&tree->lock, flags);
+	spin_lock(&tree->lock);
 	node = tree_search(tree, file_offset);
 	if (!node)
 		goto out;
@@ -853,7 +850,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *ino
 		trace_btrfs_ordered_extent_lookup(inode, entry);
 	}
 out:
-	spin_unlock_irqrestore(&tree->lock, flags);
+	spin_unlock(&tree->lock);
 	return entry;
 }
 
@@ -868,7 +865,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 	struct btrfs_ordered_extent *entry = NULL;
 
 	tree = &inode->ordered_tree;
-	spin_lock_irq(&tree->lock);
+	spin_lock(&tree->lock);
 	node = tree_search(tree, file_offset);
 	if (!node) {
 		node = tree_search(tree, file_offset + len);
@@ -895,7 +892,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 		refcount_inc(&entry->refs);
 		trace_btrfs_ordered_extent_lookup_range(inode, entry);
 	}
-	spin_unlock_irq(&tree->lock);
+	spin_unlock(&tree->lock);
 	return entry;
 }
 
@@ -911,7 +908,7 @@ void btrfs_get_ordered_extents_for_logging(struct btrfs_inode *inode,
 
 	ASSERT(inode_is_locked(&inode->vfs_inode));
 
-	spin_lock_irq(&tree->lock);
+	spin_lock(&tree->lock);
 	for (n = rb_first(&tree->tree); n; n = rb_next(n)) {
 		struct btrfs_ordered_extent *ordered;
 
@@ -925,7 +922,7 @@ void btrfs_get_ordered_extents_for_logging(struct btrfs_inode *inode,
 		refcount_inc(&ordered->refs);
 		trace_btrfs_ordered_extent_lookup_for_logging(inode, ordered);
 	}
-	spin_unlock_irq(&tree->lock);
+	spin_unlock(&tree->lock);
 }
 
 /*
@@ -940,7 +937,7 @@ btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset)
 	struct btrfs_ordered_extent *entry = NULL;
 
 	tree = &inode->ordered_tree;
-	spin_lock_irq(&tree->lock);
+	spin_lock(&tree->lock);
 	node = tree_search(tree, file_offset);
 	if (!node)
 		goto out;
@@ -949,7 +946,7 @@ btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset)
 	refcount_inc(&entry->refs);
 	trace_btrfs_ordered_extent_lookup_first(inode, entry);
 out:
-	spin_unlock_irq(&tree->lock);
+	spin_unlock(&tree->lock);
 	return entry;
 }
 
@@ -972,7 +969,7 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 	struct rb_node *next;
 	struct btrfs_ordered_extent *entry = NULL;
 
-	spin_lock_irq(&tree->lock);
+	spin_lock(&tree->lock);
 	node = tree->tree.rb_node;
 	/*
 	 * Here we don't want to use tree_search() which will use tree->last
@@ -1027,7 +1024,7 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 		trace_btrfs_ordered_extent_lookup_first_range(inode, entry);
 	}
 
-	spin_unlock_irq(&tree->lock);
+	spin_unlock(&tree->lock);
 	return entry;
 }
 
@@ -1134,7 +1131,7 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
 
 	trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered);
 
-	spin_lock_irq(&tree->lock);
+	spin_lock(&tree->lock);
 	/* Remove from tree once */
 	node = &ordered->rb_node;
 	rb_erase(node, &tree->tree);
@@ -1155,7 +1152,7 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
 			"zoned: inconsistency in ordered tree at offset %llu",
 			    ordered->file_offset);
 
-	spin_unlock_irq(&tree->lock);
+	spin_unlock(&tree->lock);
 
 	if (pre)
 		ret = clone_ordered_extent(ordered, 0, pre);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 9ab793b638a7b9..8cb0b700bf535b 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4950,12 +4950,12 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 		set_bit(BTRFS_ORDERED_LOGGED, &ordered->flags);
 
 		if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) {
-			spin_lock_irq(&inode->ordered_tree.lock);
+			spin_lock(&inode->ordered_tree.lock);
 			if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) {
 				set_bit(BTRFS_ORDERED_PENDING, &ordered->flags);
 				atomic_inc(&trans->transaction->pending_ordered);
 			}
-			spin_unlock_irq(&inode->ordered_tree.lock);
+			spin_unlock(&inode->ordered_tree.lock);
 		}
 		btrfs_put_ordered_extent(ordered);
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 10/10] btrfs: remove confusing comments
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
                   ` (8 preceding siblings ...)
  2023-03-14 16:59 ` [PATCH 09/10] btrfs: remove irq_disabling for ordered_tree.lock Christoph Hellwig
@ 2023-03-14 16:59 ` Christoph Hellwig
  2023-03-17 10:37   ` Johannes Thumshirn
  2023-03-17 10:39 ` defer all write I/O completions to process context Johannes Thumshirn
  10 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-14 16:59 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Johannes Thumshirn, linux-btrfs

As far as I can tell there is no such thing as set_bit and test_bit
hooks, and there also isn't any irq disabling near the data structures
used here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index afa564f46c6452..e26ba7104c2b2a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2402,11 +2402,7 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
 
 	if ((bits & EXTENT_DEFRAG) && !(bits & EXTENT_DELALLOC))
 		WARN_ON(1);
-	/*
-	 * set_bit and clear bit hooks normally require _irqsave/restore
-	 * but in this case, we are only testing for the DELALLOC
-	 * bit, which is only set or cleared with irqs on
-	 */
+
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		struct btrfs_root *root = inode->root;
 		u64 len = state->end + 1 - state->start;
@@ -2458,11 +2454,6 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 		spin_unlock(&inode->lock);
 	}
 
-	/*
-	 * set_bit and clear bit hooks normally require _irqsave/restore
-	 * but in this case, we are only testing for the DELALLOC
-	 * bit, which is only set or cleared with irqs on
-	 */
 	if ((state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		struct btrfs_root *root = inode->root;
 		bool do_list = !btrfs_is_free_space_inode(inode);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing
  2023-03-14 16:59 ` [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing Christoph Hellwig
@ 2023-03-16 17:10   ` Johannes Thumshirn
  2023-03-16 17:31   ` David Sterba
  2023-03-20 11:08   ` Qu Wenruo
  2 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-03-16 17:10 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/10] btrfs: refactor btrfs_end_io_wq
  2023-03-14 16:59 ` [PATCH 02/10] btrfs: refactor btrfs_end_io_wq Christoph Hellwig
@ 2023-03-16 17:12   ` Johannes Thumshirn
  2023-03-20 11:09   ` Qu Wenruo
  1 sibling, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-03-16 17:12 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-14 16:59 ` [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue Christoph Hellwig
@ 2023-03-16 17:14   ` Johannes Thumshirn
  2023-03-20 11:29   ` Qu Wenruo
  1 sibling, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-03-16 17:14 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs

On 14.03.23 17:59, Christoph Hellwig wrote:
> +	if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE) {
> +		if (btrfs_is_free_space_inode(bbio->inode))
> +			return fs_info->endio_freespace_worker;
> +		if (bbio->bio.bi_opf & REQ_META)
> +			return fs_info->endio_write_meta_workers;
> +		return fs_info->endio_write_workers;
> +	} else {
> +		if (bbio->bio.bi_opf & REQ_META)
> +			return fs_info->endio_meta_workers;
> +		return fs_info->endio_workers;
> +	}


Can't that else be dropped as well?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing
  2023-03-14 16:59 ` [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing Christoph Hellwig
  2023-03-16 17:10   ` Johannes Thumshirn
@ 2023-03-16 17:31   ` David Sterba
  2023-03-20  6:12     ` Christoph Hellwig
  2023-03-20 11:08   ` Qu Wenruo
  2 siblings, 1 reply; 48+ messages in thread
From: David Sterba @ 2023-03-16 17:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn, linux-btrfs

On Tue, Mar 14, 2023 at 05:59:01PM +0100, Christoph Hellwig wrote:
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1632,8 +1632,6 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
>  	btrfs_workqueue_set_max(fs_info->hipri_workers, new_pool_size);
>  	btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
>  	btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
> -	btrfs_workqueue_set_max(fs_info->endio_write_workers, new_pool_size);
> -	btrfs_workqueue_set_max(fs_info->endio_freespace_worker, new_pool_size);

I haven't noticed in the past workque patches but here we lose the
connection of mount option thread_pool and max_active per workqueue.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] btrfs: remove irq disabling for btrfs_workqueue.list_lock
  2023-03-14 16:59 ` [PATCH 05/10] btrfs: remove irq disabling for btrfs_workqueue.list_lock Christoph Hellwig
@ 2023-03-17 10:34   ` Johannes Thumshirn
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-03-17 10:34 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 07/10] btrfs: remove irq disabling for leak_lock
  2023-03-14 16:59 ` [PATCH 07/10] btrfs: remove irq disabling for leak_lock Christoph Hellwig
@ 2023-03-17 10:35   ` Johannes Thumshirn
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-03-17 10:35 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 08/10] btrfs: remove irq disabling for fs_info.ebleak_lock
  2023-03-14 16:59 ` [PATCH 08/10] btrfs: remove irq disabling for fs_info.ebleak_lock Christoph Hellwig
@ 2023-03-17 10:35   ` Johannes Thumshirn
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-03-17 10:35 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/10] btrfs: remove irq_disabling for ordered_tree.lock
  2023-03-14 16:59 ` [PATCH 09/10] btrfs: remove irq_disabling for ordered_tree.lock Christoph Hellwig
@ 2023-03-17 10:36   ` Johannes Thumshirn
  2023-03-20  6:12     ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-03-17 10:36 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs

On 14.03.23 18:00, Christoph Hellwig wrote:
> The per-inode ordered extent list is not only accessed from process

s/not// ?

Otherwise,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] btrfs: remove confusing comments
  2023-03-14 16:59 ` [PATCH 10/10] btrfs: remove confusing comments Christoph Hellwig
@ 2023-03-17 10:37   ` Johannes Thumshirn
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-03-17 10:37 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: defer all write I/O completions to process context
  2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
                   ` (9 preceding siblings ...)
  2023-03-14 16:59 ` [PATCH 10/10] btrfs: remove confusing comments Christoph Hellwig
@ 2023-03-17 10:39 ` Johannes Thumshirn
  2023-03-20  6:14   ` Christoph Hellwig
  10 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-03-17 10:39 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs

On 14.03.23 17:59, Christoph Hellwig wrote:
> Hi all,
> 
> based on some of my own projects and Johannes raid-stripe-tree work it has
> become a bit painful to drive much of the write I/O completion from the
> potential irq context ->bi_end_io handler.  This series follows the steps
> that XFS has taken about 10 years ago and defers all the end_io handling
> to process contexts using a workqueue, which also allows to remove all
> irq locking in btrfs (safe one spot that interacts with the pagecache).
> 
> I've run various data and metadata benchmarks, and the only significant
> change is that the outliers to the high and low in repeated runs of
> metadata heavy workloads seem to reduce while the average remains the
> same.  I'm of course also interested in runs of additional workloads or
> systems as this is a fairly significant change.

Apart from David's comment (which I totally overlooked as well) this looks
good to me.

But I have a general question, which even might sound pretty dumb. As we're
out of IRQ/atomic contexts now, do we even need spinlocks or would a mutex
suffice as well?

Byte,
	Johannes

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing
  2023-03-16 17:31   ` David Sterba
@ 2023-03-20  6:12     ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-20  6:12 UTC (permalink / raw)
  To: David Sterba
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Thu, Mar 16, 2023 at 06:31:34PM +0100, David Sterba wrote:
> On Tue, Mar 14, 2023 at 05:59:01PM +0100, Christoph Hellwig wrote:
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -1632,8 +1632,6 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
> >  	btrfs_workqueue_set_max(fs_info->hipri_workers, new_pool_size);
> >  	btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
> >  	btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
> > -	btrfs_workqueue_set_max(fs_info->endio_write_workers, new_pool_size);
> > -	btrfs_workqueue_set_max(fs_info->endio_freespace_worker, new_pool_size);
> 
> I haven't noticed in the past workque patches but here we lose the
> connection of mount option thread_pool and max_active per workqueue.

Looking at the code in detail, We do, but for the freespace-write workqueue
only.

endio-write passes 2 as the thresh argument to btrfs_alloc_workqueue,
which is below DFT_THRESHOLD and thus all the max active tracking is
disabled.

Which makes sense to me - the max_active tracking is fairly expensive,
and the cost of more active items in the workqueue code is non-existant
until they are actually used.

The history here is a bit weird, the initial version of
endio_freespace_worker was added by Josef in commit 0cb59c995317
("Btrfs: write out free space cache").  This was back in the day of the
old btrfs workers, that weren't using workqueues at all, and the
new freespace workers passed in a thread_pool_size of 1, instead
of fs_info->thread_pool_size used for the normal endio workers.

Qu then replaced the btrfs_workers with the btrfs_workqueue in commit
fccb5d86d8f5 ("btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue.")
which stopped looking at the thread_pool_size for both of them,
passing down the thread_pool_size (as the max_active local variable)
to both, then limits thresh to 2 for end-io-writes, but to the default
using 0 for the free-space write workqueue, which seems like a big
change to the original behavior, without much of an explanation in the
commit log.

So as far as I can tell this change (accidentally) restores behavior
that is closer, but not identical, to the 2014 behaviour and very
sesible, but it needs to be documented way better or even be split
into a prep patch.

This little exercise also makes me wonder what the point of the
threshold tracking in the btrfs-workqueue is to start with.  I think
the original idea based on some of the commit logs is to be able
to set a higher max_active than the workqueue default using
workqueue_set_max_active for HDD workloads using the thread_pool
mount option.  But given that a higher max_active in the workqueue
code has not extra resource cost until the additional workers are
created on demand, I think just wiring up workqueue_set_max_active to
btrfs_workqueue_set_max is a much better option than the separate
tracking that duplicates the core workqueue functionality.  But that
is left for another series..

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/10] btrfs: remove irq_disabling for ordered_tree.lock
  2023-03-17 10:36   ` Johannes Thumshirn
@ 2023-03-20  6:12     ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-20  6:12 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Fri, Mar 17, 2023 at 10:36:35AM +0000, Johannes Thumshirn wrote:
> On 14.03.23 18:00, Christoph Hellwig wrote:
> > The per-inode ordered extent list is not only accessed from process
> 
> s/not// ?

Yes.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: defer all write I/O completions to process context
  2023-03-17 10:39 ` defer all write I/O completions to process context Johannes Thumshirn
@ 2023-03-20  6:14   ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-20  6:14 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Fri, Mar 17, 2023 at 10:39:41AM +0000, Johannes Thumshirn wrote:
> But I have a general question, which even might sound pretty dumb. As we're
> out of IRQ/atomic contexts now, do we even need spinlocks or would a mutex
> suffice as well?

Once we're not acquiring a lock from any atomic context, it could
become a mutex.  That being said spinlocks are the smaller and simpler
data structure, and tend to perform better for short criticial sections.
So unless we have a good reason to change any specific lock, I would
not change the locking primitive.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing
  2023-03-14 16:59 ` [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing Christoph Hellwig
  2023-03-16 17:10   ` Johannes Thumshirn
  2023-03-16 17:31   ` David Sterba
@ 2023-03-20 11:08   ` Qu Wenruo
  2023-03-20 11:35     ` Qu Wenruo
  2 siblings, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-03-20 11:08 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs



On 2023/3/15 00:59, Christoph Hellwig wrote:
> The endio_write_workers and endio_freespace_workers workqueues don't use
> any of the ordering features in the btrfs_workqueue, so switch them to
> plain Linux workqueues.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/disk-io.c      | 16 ++++++++--------
>   fs/btrfs/fs.h           |  4 ++--
>   fs/btrfs/ordered-data.c |  8 ++++----
>   fs/btrfs/ordered-data.h |  2 +-
>   fs/btrfs/super.c        |  2 --
>   5 files changed, 15 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index bb864cf2eed60f..0889eb81e71a7d 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1993,8 +1993,10 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>   		destroy_workqueue(fs_info->rmw_workers);
>   	if (fs_info->compressed_write_workers)
>   		destroy_workqueue(fs_info->compressed_write_workers);
> -	btrfs_destroy_workqueue(fs_info->endio_write_workers);
> -	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
> +	if (fs_info->endio_write_workers)
> +		destroy_workqueue(fs_info->endio_write_workers);
> +	if (fs_info->endio_freespace_worker)
> +		destroy_workqueue(fs_info->endio_freespace_worker);
>   	btrfs_destroy_workqueue(fs_info->delayed_workers);
>   	btrfs_destroy_workqueue(fs_info->caching_workers);
>   	btrfs_destroy_workqueue(fs_info->flush_workers);
> @@ -2204,13 +2206,11 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>   		alloc_workqueue("btrfs-endio-meta", flags, max_active);
>   	fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active);
>   	fs_info->endio_write_workers =
> -		btrfs_alloc_workqueue(fs_info, "endio-write", flags,
> -				      max_active, 2);
> +		alloc_workqueue("btrfs-endio-write", flags, max_active);
>   	fs_info->compressed_write_workers =
>   		alloc_workqueue("btrfs-compressed-write", flags, max_active);
>   	fs_info->endio_freespace_worker =
> -		btrfs_alloc_workqueue(fs_info, "freespace-write", flags,
> -				      max_active, 0);
> +		alloc_workqueue("btrfs-freespace-write", flags, max_active);
>   	fs_info->delayed_workers =
>   		btrfs_alloc_workqueue(fs_info, "delayed-meta", flags,
>   				      max_active, 0);
> @@ -4536,9 +4536,9 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
>   	 * the final btrfs_put_ordered_extent() (which must happen at
>   	 * btrfs_finish_ordered_io() when we are unmounting).
>   	 */
> -	btrfs_flush_workqueue(fs_info->endio_write_workers);
> +	flush_workqueue(fs_info->endio_write_workers);
>   	/* Ordered extents for free space inodes. */
> -	btrfs_flush_workqueue(fs_info->endio_freespace_worker);
> +	flush_workqueue(fs_info->endio_freespace_worker);
>   	btrfs_run_delayed_iputs(fs_info);
>   
>   	cancel_work_sync(&fs_info->async_reclaim_work);
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 20d554a0c2ac0d..276a17780f2b1b 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -542,8 +542,8 @@ struct btrfs_fs_info {
>   	struct workqueue_struct *endio_meta_workers;
>   	struct workqueue_struct *rmw_workers;
>   	struct workqueue_struct *compressed_write_workers;
> -	struct btrfs_workqueue *endio_write_workers;
> -	struct btrfs_workqueue *endio_freespace_worker;
> +	struct workqueue_struct *endio_write_workers;
> +	struct workqueue_struct *endio_freespace_worker;
>   	struct btrfs_workqueue *caching_workers;
>   
>   	/*
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 1848d0d1a9c41e..23f496f0d7b776 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -303,7 +303,7 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
>   	spin_unlock_irq(&tree->lock);
>   }
>   
> -static void finish_ordered_fn(struct btrfs_work *work)
> +static void finish_ordered_fn(struct work_struct *work)
>   {
>   	struct btrfs_ordered_extent *ordered_extent;
>   
> @@ -330,7 +330,7 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>   {
>   	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct btrfs_workqueue *wq;
> +	struct workqueue_struct *wq;
>   	struct rb_node *node;
>   	struct btrfs_ordered_extent *entry = NULL;
>   	unsigned long flags;
> @@ -439,8 +439,8 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>   			refcount_inc(&entry->refs);
>   			trace_btrfs_ordered_extent_mark_finished(inode, entry);
>   			spin_unlock_irqrestore(&tree->lock, flags);
> -			btrfs_init_work(&entry->work, finish_ordered_fn, NULL, NULL);
> -			btrfs_queue_work(wq, &entry->work);
> +			INIT_WORK(&entry->work, finish_ordered_fn);
> +			queue_work(wq, &entry->work);
>   			spin_lock_irqsave(&tree->lock, flags);
>   		}
>   		cur += len;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 18007f9c00add8..b8a92f040458f0 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -146,7 +146,7 @@ struct btrfs_ordered_extent {
>   	/* a per root list of all the pending ordered extents */
>   	struct list_head root_extent_list;
>   
> -	struct btrfs_work work;
> +	struct work_struct work;
>   
>   	struct completion completion;
>   	struct btrfs_work flush_work;
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index d8885966e801cd..065b4fab1ee011 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1632,8 +1632,6 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
>   	btrfs_workqueue_set_max(fs_info->hipri_workers, new_pool_size);
>   	btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
>   	btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
> -	btrfs_workqueue_set_max(fs_info->endio_write_workers, new_pool_size);
> -	btrfs_workqueue_set_max(fs_info->endio_freespace_worker, new_pool_size);

I guess we need workqueue_set_max_active() for all our plain work queues 
too.

Thanks,
Qu

>   	btrfs_workqueue_set_max(fs_info->delayed_workers, new_pool_size);
>   }
>   

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/10] btrfs: refactor btrfs_end_io_wq
  2023-03-14 16:59 ` [PATCH 02/10] btrfs: refactor btrfs_end_io_wq Christoph Hellwig
  2023-03-16 17:12   ` Johannes Thumshirn
@ 2023-03-20 11:09   ` Qu Wenruo
  1 sibling, 0 replies; 48+ messages in thread
From: Qu Wenruo @ 2023-03-20 11:09 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs



On 2023/3/15 00:59, Christoph Hellwig wrote:
> Pass a btrfs_bio instead of a fs_info and bio.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/bio.c | 12 ++++++------
>   1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index cf09c6271edbee..85539964864a34 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -296,10 +296,11 @@ static void btrfs_log_dev_io_error(struct bio *bio, struct btrfs_device *dev)
>   		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_FLUSH_ERRS);
>   }
>   
> -static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_fs_info *fs_info,
> -						struct bio *bio)
> +static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_bio *bbio)
>   {
> -	if (bio->bi_opf & REQ_META)
> +	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
> +
> +	if (bbio->bio.bi_opf & REQ_META)
>   		return fs_info->endio_meta_workers;
>   	return fs_info->endio_workers;
>   }
> @@ -319,16 +320,15 @@ static void btrfs_simple_end_io(struct bio *bio)
>   {
>   	struct btrfs_bio *bbio = btrfs_bio(bio);
>   	struct btrfs_device *dev = bio->bi_private;
> -	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
>   
> -	btrfs_bio_counter_dec(fs_info);
> +	btrfs_bio_counter_dec(bbio->inode->root->fs_info);
>   
>   	if (bio->bi_status)
>   		btrfs_log_dev_io_error(bio, dev);
>   
>   	if (bio_op(bio) == REQ_OP_READ) {
>   		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
> -		queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
> +		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
>   	} else {
>   		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
>   			btrfs_record_physical_zoned(bbio);

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-14 16:59 ` [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue Christoph Hellwig
  2023-03-16 17:14   ` Johannes Thumshirn
@ 2023-03-20 11:29   ` Qu Wenruo
  2023-03-20 12:30     ` Christoph Hellwig
  1 sibling, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-03-20 11:29 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs



On 2023/3/15 00:59, Christoph Hellwig wrote:
> Currently all reads are offloaded to workqueues for checksum
> verification.  Writes are initially processed in the bi_end_io
> context that usually happens in hard or soft interrupts, and are
> only offloaded to a workqueue for ordered extent completion or
> compressed extent handling, and in the future for raid stripe
> tree handling.
> 
> Switch to offload all writes to workqueues instead of doing that
> only for the final ordered extent processing.  This means that
> there are slightly workqueue offloads for large writes as each
> submitted btrfs_bio can only hold 1MiB worth of data on systems
> with a 4K page size.  On the other hand this will allow large
> simpliciations by always having a process context.
> 
> For metadata writes there was not offload at all before, which
> creates various problems like not being able to drop the final
> extent_buffered reference from the end_io handler.
> 
> With the entire series applied this shows no performance differences
> on data heavy workload, while for metadata heavy workloads like
> Filipes fs_mark there also is no change in averages, while it
> seems to somewhat reduce the outliers in terms of best and worst
> performance.  This is probably due to the removal of the irq
> disabling in the next patches.

I'm uncertain what is proper or the better solution when handling the 
endio functions.

Sure, they are called in very strict context, thus we should keep them 
short.
But on the other hand, we're already having too many workqueues, and I'm 
always wondering under what situation they can lead to deadlock.
(e.g. why we need to queue endios for free space and regular data inodes 
into different workqueues?)

My current method is always consider the workqueue has only 1 
max_active, but I'm still not sure for such case, what would happen if 
one work slept?
Would the workqueue being able to choose the next work item? Or that 
workqueue is stalled until the only active got woken?


Thus I'm uncertain if we're going the correct way.

Personally speaking, I'd like to keep the btrfs bio endio function calls 
in the old soft/hard irq context, and let the higher layer to queue the 
work.

However we have already loosen the endio context for btrfs bio, from the 
old soft/hard irq to the current workqueue context already...

> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/bio.c          | 74 ++++++++++++++++++++++++++++-------------
>   fs/btrfs/disk-io.c      |  5 +++
>   fs/btrfs/fs.h           |  1 +
>   fs/btrfs/ordered-data.c | 17 +---------
>   fs/btrfs/ordered-data.h |  2 --
>   5 files changed, 57 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 85539964864a34..037eded3b33ba2 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -300,20 +300,33 @@ static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_bio *bbio)
>   {
>   	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
>   
> -	if (bbio->bio.bi_opf & REQ_META)
> -		return fs_info->endio_meta_workers;
> -	return fs_info->endio_workers;
> +	if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE) {
> +		if (btrfs_is_free_space_inode(bbio->inode))
> +			return fs_info->endio_freespace_worker;
> +		if (bbio->bio.bi_opf & REQ_META)
> +			return fs_info->endio_write_meta_workers;
> +		return fs_info->endio_write_workers;
> +	} else {
> +		if (bbio->bio.bi_opf & REQ_META)
> +			return fs_info->endio_meta_workers;
> +		return fs_info->endio_workers;
> +	}
>   }
>   
> -static void btrfs_end_bio_work(struct work_struct *work)
> +static void btrfs_simple_end_bio_work(struct work_struct *work)
>   {
>   	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
> +	struct bio *bio = &bbio->bio;
> +	struct btrfs_device *dev = bio->bi_private;
>   
>   	/* Metadata reads are checked and repaired by the submitter. */
> -	if (bbio->bio.bi_opf & REQ_META)
> -		bbio->end_io(bbio);
> -	else
> -		btrfs_check_read_bio(bbio, bbio->bio.bi_private);
> +	if (bio_op(bio) == REQ_OP_READ && !(bio->bi_opf & REQ_META)) {
> +		btrfs_check_read_bio(bbio, dev);
> +	} else {
> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +			btrfs_record_physical_zoned(bbio);
> +		btrfs_orig_bbio_end_io(bbio);
> +	}
>   }
>   
>   static void btrfs_simple_end_io(struct bio *bio)
> @@ -322,18 +335,19 @@ static void btrfs_simple_end_io(struct bio *bio)
>   	struct btrfs_device *dev = bio->bi_private;
>   
>   	btrfs_bio_counter_dec(bbio->inode->root->fs_info);
> -
>   	if (bio->bi_status)
>   		btrfs_log_dev_io_error(bio, dev);
>   
> -	if (bio_op(bio) == REQ_OP_READ) {
> -		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
> -		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
> -	} else {
> -		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> -			btrfs_record_physical_zoned(bbio);
> -		btrfs_orig_bbio_end_io(bbio);
> -	}

I'm already seeing in the future, there would be extra if checks to skip 
the work queue for scrub bios...

> +	INIT_WORK(&bbio->end_io_work, btrfs_simple_end_bio_work);
> +	queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
> +}
> +
> +static void btrfs_write_end_bio_work(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio =
> +		container_of(work, struct btrfs_bio, end_io_work);
> +
> +	btrfs_orig_bbio_end_io(bbio);
>   }
>   
>   static void btrfs_raid56_end_io(struct bio *bio)
> @@ -343,12 +357,23 @@ static void btrfs_raid56_end_io(struct bio *bio)
>   
>   	btrfs_bio_counter_dec(bioc->fs_info);
>   	bbio->mirror_num = bioc->mirror_num;
> -	if (bio_op(bio) == REQ_OP_READ && !(bbio->bio.bi_opf & REQ_META))
> -		btrfs_check_read_bio(bbio, NULL);
> -	else
> -		btrfs_orig_bbio_end_io(bbio);
> -
>   	btrfs_put_bioc(bioc);
> +
> +	/*
> +	 * The RAID5/6 code already completes all bios from workqueues, but for
> +	 * writs we need to offload them again to avoid deadlocks in the ordered
> +	 * extent processing.
> +	 */

Mind to share some details about the possible deadlock here?

To me, it's just a different workqueue doing the finish ordered io workload.

Thanks,
Qu

> +	if (bio_op(bio) == REQ_OP_READ) {
> +		/* Metadata reads are checked and repaired by the submitter. */
> +		if (!(bio->bi_opf & REQ_META))
> +			btrfs_check_read_bio(bbio, NULL);
> +		else
> +			btrfs_orig_bbio_end_io(bbio);
> +	} else {
> +		INIT_WORK(&btrfs_bio(bio)->end_io_work, btrfs_write_end_bio_work);
> +		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
> +	}
>   }
>   
>   static void btrfs_orig_write_end_io(struct bio *bio)
> @@ -372,9 +397,10 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>   		bio->bi_status = BLK_STS_IOERR;
>   	else
>   		bio->bi_status = BLK_STS_OK;
> -
> -	btrfs_orig_bbio_end_io(bbio);
>   	btrfs_put_bioc(bioc);
> +
> +	INIT_WORK(&bbio->end_io_work, btrfs_write_end_bio_work);
> +	queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
>   }
>   
>   static void btrfs_clone_write_end_io(struct bio *bio)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 0889eb81e71a7d..5b63a5161cedea 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2008,6 +2008,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>   	 * the queues used for metadata I/O, since tasks from those other work
>   	 * queues can do metadata I/O operations.
>   	 */
> +	if (fs_info->endio_write_meta_workers)
> +		destroy_workqueue(fs_info->endio_write_meta_workers);
>   	if (fs_info->endio_meta_workers)
>   		destroy_workqueue(fs_info->endio_meta_workers);
>   }
> @@ -2204,6 +2206,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>   		alloc_workqueue("btrfs-endio", flags, max_active);
>   	fs_info->endio_meta_workers =
>   		alloc_workqueue("btrfs-endio-meta", flags, max_active);
> +	fs_info->endio_write_meta_workers =
> +		alloc_workqueue("btrfs-endio-write-meta", flags, max_active);
>   	fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active);
>   	fs_info->endio_write_workers =
>   		alloc_workqueue("btrfs-endio-write", flags, max_active);
> @@ -2224,6 +2228,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>   	      fs_info->endio_workers && fs_info->endio_meta_workers &&
>   	      fs_info->compressed_write_workers &&
>   	      fs_info->endio_write_workers &&
> +	      fs_info->endio_write_meta_workers &&
>   	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
>   	      fs_info->caching_workers && fs_info->fixup_workers &&
>   	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 276a17780f2b1b..e2643bc5c039ad 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -543,6 +543,7 @@ struct btrfs_fs_info {
>   	struct workqueue_struct *rmw_workers;
>   	struct workqueue_struct *compressed_write_workers;
>   	struct workqueue_struct *endio_write_workers;
> +	struct workqueue_struct *endio_write_meta_workers;
>   	struct workqueue_struct *endio_freespace_worker;
>   	struct btrfs_workqueue *caching_workers;
>   
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 23f496f0d7b776..48c7510df80a0d 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -303,14 +303,6 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
>   	spin_unlock_irq(&tree->lock);
>   }
>   
> -static void finish_ordered_fn(struct work_struct *work)
> -{
> -	struct btrfs_ordered_extent *ordered_extent;
> -
> -	ordered_extent = container_of(work, struct btrfs_ordered_extent, work);
> -	btrfs_finish_ordered_io(ordered_extent);
> -}
> -
>   /*
>    * Mark all ordered extents io inside the specified range finished.
>    *
> @@ -330,17 +322,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>   {
>   	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct workqueue_struct *wq;
>   	struct rb_node *node;
>   	struct btrfs_ordered_extent *entry = NULL;
>   	unsigned long flags;
>   	u64 cur = file_offset;
>   
> -	if (btrfs_is_free_space_inode(inode))
> -		wq = fs_info->endio_freespace_worker;
> -	else
> -		wq = fs_info->endio_write_workers;
> -
>   	if (page)
>   		ASSERT(page->mapping && page_offset(page) <= file_offset &&
>   		       file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
> @@ -439,8 +425,7 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>   			refcount_inc(&entry->refs);
>   			trace_btrfs_ordered_extent_mark_finished(inode, entry);
>   			spin_unlock_irqrestore(&tree->lock, flags);
> -			INIT_WORK(&entry->work, finish_ordered_fn);
> -			queue_work(wq, &entry->work);
> +			btrfs_finish_ordered_io(entry);
>   			spin_lock_irqsave(&tree->lock, flags);
>   		}
>   		cur += len;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index b8a92f040458f0..64a37d42e7a24a 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -146,8 +146,6 @@ struct btrfs_ordered_extent {
>   	/* a per root list of all the pending ordered extents */
>   	struct list_head root_extent_list;
>   
> -	struct work_struct work;
> -
>   	struct completion completion;
>   	struct btrfs_work flush_work;
>   	struct list_head work_list;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing
  2023-03-20 11:08   ` Qu Wenruo
@ 2023-03-20 11:35     ` Qu Wenruo
  2023-03-20 12:24       ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-03-20 11:35 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, linux-btrfs



On 2023/3/20 19:08, Qu Wenruo wrote:
> 
> 
> On 2023/3/15 00:59, Christoph Hellwig wrote:
>> The endio_write_workers and endio_freespace_workers workqueues don't use
>> any of the ordering features in the btrfs_workqueue, so switch them to
>> plain Linux workqueues.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>   fs/btrfs/disk-io.c      | 16 ++++++++--------
>>   fs/btrfs/fs.h           |  4 ++--
>>   fs/btrfs/ordered-data.c |  8 ++++----
>>   fs/btrfs/ordered-data.h |  2 +-
>>   fs/btrfs/super.c        |  2 --
>>   5 files changed, 15 insertions(+), 17 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index bb864cf2eed60f..0889eb81e71a7d 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1993,8 +1993,10 @@ static void btrfs_stop_all_workers(struct 
>> btrfs_fs_info *fs_info)
>>           destroy_workqueue(fs_info->rmw_workers);
>>       if (fs_info->compressed_write_workers)
>>           destroy_workqueue(fs_info->compressed_write_workers);
>> -    btrfs_destroy_workqueue(fs_info->endio_write_workers);
>> -    btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
>> +    if (fs_info->endio_write_workers)
>> +        destroy_workqueue(fs_info->endio_write_workers);
>> +    if (fs_info->endio_freespace_worker)
>> +        destroy_workqueue(fs_info->endio_freespace_worker);
>>       btrfs_destroy_workqueue(fs_info->delayed_workers);
>>       btrfs_destroy_workqueue(fs_info->caching_workers);
>>       btrfs_destroy_workqueue(fs_info->flush_workers);
>> @@ -2204,13 +2206,11 @@ static int btrfs_init_workqueues(struct 
>> btrfs_fs_info *fs_info)
>>           alloc_workqueue("btrfs-endio-meta", flags, max_active);
>>       fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, 
>> max_active);
>>       fs_info->endio_write_workers =
>> -        btrfs_alloc_workqueue(fs_info, "endio-write", flags,
>> -                      max_active, 2);
>> +        alloc_workqueue("btrfs-endio-write", flags, max_active);
>>       fs_info->compressed_write_workers =
>>           alloc_workqueue("btrfs-compressed-write", flags, max_active);
>>       fs_info->endio_freespace_worker =
>> -        btrfs_alloc_workqueue(fs_info, "freespace-write", flags,
>> -                      max_active, 0);
>> +        alloc_workqueue("btrfs-freespace-write", flags, max_active);
>>       fs_info->delayed_workers =
>>           btrfs_alloc_workqueue(fs_info, "delayed-meta", flags,
>>                         max_active, 0);
>> @@ -4536,9 +4536,9 @@ void __cold close_ctree(struct btrfs_fs_info 
>> *fs_info)
>>        * the final btrfs_put_ordered_extent() (which must happen at
>>        * btrfs_finish_ordered_io() when we are unmounting).
>>        */
>> -    btrfs_flush_workqueue(fs_info->endio_write_workers);
>> +    flush_workqueue(fs_info->endio_write_workers);
>>       /* Ordered extents for free space inodes. */
>> -    btrfs_flush_workqueue(fs_info->endio_freespace_worker);
>> +    flush_workqueue(fs_info->endio_freespace_worker);
>>       btrfs_run_delayed_iputs(fs_info);
>>       cancel_work_sync(&fs_info->async_reclaim_work);
>> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
>> index 20d554a0c2ac0d..276a17780f2b1b 100644
>> --- a/fs/btrfs/fs.h
>> +++ b/fs/btrfs/fs.h
>> @@ -542,8 +542,8 @@ struct btrfs_fs_info {
>>       struct workqueue_struct *endio_meta_workers;
>>       struct workqueue_struct *rmw_workers;
>>       struct workqueue_struct *compressed_write_workers;
>> -    struct btrfs_workqueue *endio_write_workers;
>> -    struct btrfs_workqueue *endio_freespace_worker;
>> +    struct workqueue_struct *endio_write_workers;
>> +    struct workqueue_struct *endio_freespace_worker;
>>       struct btrfs_workqueue *caching_workers;
>>       /*
>> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
>> index 1848d0d1a9c41e..23f496f0d7b776 100644
>> --- a/fs/btrfs/ordered-data.c
>> +++ b/fs/btrfs/ordered-data.c
>> @@ -303,7 +303,7 @@ void btrfs_add_ordered_sum(struct 
>> btrfs_ordered_extent *entry,
>>       spin_unlock_irq(&tree->lock);
>>   }
>> -static void finish_ordered_fn(struct btrfs_work *work)
>> +static void finish_ordered_fn(struct work_struct *work)
>>   {
>>       struct btrfs_ordered_extent *ordered_extent;
>> @@ -330,7 +330,7 @@ void btrfs_mark_ordered_io_finished(struct 
>> btrfs_inode *inode,
>>   {
>>       struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
>>       struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> -    struct btrfs_workqueue *wq;
>> +    struct workqueue_struct *wq;
>>       struct rb_node *node;
>>       struct btrfs_ordered_extent *entry = NULL;
>>       unsigned long flags;
>> @@ -439,8 +439,8 @@ void btrfs_mark_ordered_io_finished(struct 
>> btrfs_inode *inode,
>>               refcount_inc(&entry->refs);
>>               trace_btrfs_ordered_extent_mark_finished(inode, entry);
>>               spin_unlock_irqrestore(&tree->lock, flags);
>> -            btrfs_init_work(&entry->work, finish_ordered_fn, NULL, 
>> NULL);
>> -            btrfs_queue_work(wq, &entry->work);
>> +            INIT_WORK(&entry->work, finish_ordered_fn);
>> +            queue_work(wq, &entry->work);
>>               spin_lock_irqsave(&tree->lock, flags);
>>           }
>>           cur += len;
>> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
>> index 18007f9c00add8..b8a92f040458f0 100644
>> --- a/fs/btrfs/ordered-data.h
>> +++ b/fs/btrfs/ordered-data.h
>> @@ -146,7 +146,7 @@ struct btrfs_ordered_extent {
>>       /* a per root list of all the pending ordered extents */
>>       struct list_head root_extent_list;
>> -    struct btrfs_work work;
>> +    struct work_struct work;
>>       struct completion completion;
>>       struct btrfs_work flush_work;
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index d8885966e801cd..065b4fab1ee011 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1632,8 +1632,6 @@ static void btrfs_resize_thread_pool(struct 
>> btrfs_fs_info *fs_info,
>>       btrfs_workqueue_set_max(fs_info->hipri_workers, new_pool_size);
>>       btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
>>       btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
>> -    btrfs_workqueue_set_max(fs_info->endio_write_workers, 
>> new_pool_size);
>> -    btrfs_workqueue_set_max(fs_info->endio_freespace_worker, 
>> new_pool_size);
> 
> I guess we need workqueue_set_max_active() for all our plain work queues 
> too.

In fact, I believe we not only need to add workqueue_set_max_active() 
for the thread_pool= mount option, but also add a test case for 
thread_pool=1 to shake out all the possible hidden bugs...

Mind me to send a patch adding the max_active setting for all plain 
workqueues?

Thanks,
Qu

> 
> Thanks,
> Qu
> 
>>       btrfs_workqueue_set_max(fs_info->delayed_workers, new_pool_size);
>>   }

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing
  2023-03-20 11:35     ` Qu Wenruo
@ 2023-03-20 12:24       ` Christoph Hellwig
  2023-03-20 23:19         ` Qu Wenruo
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-20 12:24 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Mon, Mar 20, 2023 at 07:35:45PM +0800, Qu Wenruo wrote:
> In fact, I believe we not only need to add workqueue_set_max_active() for 
> the thread_pool= mount option, but also add a test case for thread_pool=1 
> to shake out all the possible hidden bugs...
>
> Mind me to send a patch adding the max_active setting for all plain 
> workqueues?

I don't think blindly doing that is a good idea.  As explained in my
reply to Dave, going all the way back to 2014, all workqueues hat had
a less than default treshold never used it to start with.

I'm also really curious (and I might have to do some digging) what
the intended use case is, and if it actually works as-is.  I know
one of your workloads mentioned a higher concurrency for some HDD
workloads, do you still remember what the workloads are?  Because
I'm pretty sure it won't be needed for all workqueues, and the fact
that btrfs is the only caller of workqueue_set_max_active in the
entire kernel makes me very sceptical that we do need it everywhere.

So I'd be much happier to figure out where we needed it first, but even
if not and we want to restore historic behavior from some point in the
past we'd still only need to apply it selectively.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-20 11:29   ` Qu Wenruo
@ 2023-03-20 12:30     ` Christoph Hellwig
  2023-03-20 23:37       ` Qu Wenruo
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-20 12:30 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Mon, Mar 20, 2023 at 07:29:38PM +0800, Qu Wenruo wrote:
> Sure, they are called in very strict context, thus we should keep them 
> short.
> But on the other hand, we're already having too many workqueues, and I'm 
> always wondering under what situation they can lead to deadlock.
> (e.g. why we need to queue endios for free space and regular data inodes 
> into different workqueues?)

In general the reason for separate workqueues is if one workqueue depends
on the execution of another.  It seems like this is Josef's area, but
my impression is that finishing and ordered extent can cause writeback
and a possible wait for data in the freespace inode.  Normally such
workqueue splits should have comments in the code to explain them, but
so far I haven't found one.

> My current method is always consider the workqueue has only 1 max_active, 
> but I'm still not sure for such case, what would happen if one work slept?

That's my understanding of the workqueue mechanisms, yes.

> Would the workqueue being able to choose the next work item? Or that 
> workqueue is stalled until the only active got woken?

I think it is stalled.  That's why the workqueue heavily discourages
limiting max_active unless you have a good reason to, and most callers
follow that advise.

> Personally speaking, I'd like to keep the btrfs bio endio function calls in 
> the old soft/hard irq context, and let the higher layer to queue the work.

Can you explain why?

> However we have already loosen the endio context for btrfs bio, from the 
> old soft/hard irq to the current workqueue context already...

read I/O are executed in a workqueue.  For write completions there also
are various offloads, but none that is consistent and dependable so far.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing
  2023-03-20 12:24       ` Christoph Hellwig
@ 2023-03-20 23:19         ` Qu Wenruo
  2023-03-21 12:48           ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-03-20 23:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn, linux-btrfs



On 2023/3/20 20:24, Christoph Hellwig wrote:
> On Mon, Mar 20, 2023 at 07:35:45PM +0800, Qu Wenruo wrote:
>> In fact, I believe we not only need to add workqueue_set_max_active() for
>> the thread_pool= mount option, but also add a test case for thread_pool=1
>> to shake out all the possible hidden bugs...
>>
>> Mind me to send a patch adding the max_active setting for all plain
>> workqueues?
> 
> I don't think blindly doing that is a good idea.  As explained in my
> reply to Dave, going all the way back to 2014, all workqueues hat had
> a less than default treshold never used it to start with.
> 
> I'm also really curious (and I might have to do some digging) what
> the intended use case is, and if it actually works as-is.  I know
> one of your workloads mentioned a higher concurrency for some HDD
> workloads, do you still remember what the workloads are?

In fact, recently we had a patchset trying to adding a new mount option 
to pin workqueue loads to certain CPUs.

The idea of that patchset is to limit the CPU usage for compression/csum 
generation.

This can also apply to scrub workload.


The other thing is, I also want to test if we could really have certain 
deadlock related to workqueue.
Currently thread_pool= mount option only changes the btrfs workqueues, 
not the new plain ones.

Thanks,
Qu
>  Because
> I'm pretty sure it won't be needed for all workqueues, and the fact
> that btrfs is the only caller of workqueue_set_max_active in the
> entire kernel makes me very sceptical that we do need it everywhere.
> 
> So I'd be much happier to figure out where we needed it first, but even
> if not and we want to restore historic behavior from some point in the
> past we'd still only need to apply it selectively.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-20 12:30     ` Christoph Hellwig
@ 2023-03-20 23:37       ` Qu Wenruo
  2023-03-21 12:55         ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-03-20 23:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn, linux-btrfs



On 2023/3/20 20:30, Christoph Hellwig wrote:
> On Mon, Mar 20, 2023 at 07:29:38PM +0800, Qu Wenruo wrote:
>> Sure, they are called in very strict context, thus we should keep them
>> short.
>> But on the other hand, we're already having too many workqueues, and I'm
>> always wondering under what situation they can lead to deadlock.
>> (e.g. why we need to queue endios for free space and regular data inodes
>> into different workqueues?)
> 
> In general the reason for separate workqueues is if one workqueue depends
> on the execution of another.  It seems like this is Josef's area, but
> my impression is that finishing and ordered extent can cause writeback
> and a possible wait for data in the freespace inode.  Normally such
> workqueue splits should have comments in the code to explain them, but
> so far I haven't found one.
> 
>> My current method is always consider the workqueue has only 1 max_active,
>> but I'm still not sure for such case, what would happen if one work slept?
> 
> That's my understanding of the workqueue mechanisms, yes.
> 
>> Would the workqueue being able to choose the next work item? Or that
>> workqueue is stalled until the only active got woken?
> 
> I think it is stalled.  That's why the workqueue heavily discourages
> limiting max_active unless you have a good reason to, and most callers
> follow that advise.

To me, this looks more like hiding a bug in the workqueue user.
Shouldn't we expose such bugs instead of hiding them?

Furthermore although I'm a big fan of plain workqueue, the old btrfs 
workqueues allows us to apply max_active to the plain workqueues, but 
now we don't have this ability any more.

Thus at least, we should bring back the old behavior, and possibly fix 
whatever bugs in our workqueue usage.

> 
>> Personally speaking, I'd like to keep the btrfs bio endio function calls in
>> the old soft/hard irq context, and let the higher layer to queue the work.
> 
> Can you explain why?

Just to keep the context consistent.

Image a situation, where we put some sleep functions into a endio 
function without any doubt, and fully rely on the fact the endio 
function is executed under workqueue context.

Or we have to extra comments explain the situation every time we're 
doing something like this.

Thanks,
Qu
> 
>> However we have already loosen the endio context for btrfs bio, from the
>> old soft/hard irq to the current workqueue context already...
> 
> read I/O are executed in a workqueue.  For write completions there also
> are various offloads, but none that is consistent and dependable so far.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing
  2023-03-20 23:19         ` Qu Wenruo
@ 2023-03-21 12:48           ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-21 12:48 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Tue, Mar 21, 2023 at 07:19:30AM +0800, Qu Wenruo wrote:
>> I'm also really curious (and I might have to do some digging) what
>> the intended use case is, and if it actually works as-is.  I know
>> one of your workloads mentioned a higher concurrency for some HDD
>> workloads, do you still remember what the workloads are?
>
> In fact, recently we had a patchset trying to adding a new mount option to 
> pin workqueue loads to certain CPUs.
>
> The idea of that patchset is to limit the CPU usage for compression/csum 
> generation.
>
> This can also apply to scrub workload.

... and is totally unrelated to using either a core workqueue max_active
value or reimplementing that in btrfs.

> The other thing is, I also want to test if we could really have certain 
> deadlock related to workqueue.
> Currently thread_pool= mount option only changes the btrfs workqueues, not 
> the new plain ones.

What kind of deadlock testing do you want, and why does it apply
to all workqueues in btrfs and not other workqueues?  Note that you'd
also need to change the btrfs_workqueue code to actually apply literally
everywhere.

Maybe we can start by going back to me request, and can actually come
up with a definition for what thread_pool= is supposed to affect, and
how users should pick values for it?  There is a definition in the
btrfs(5) man page, but that one has been wrong at least since 2014
and the switch to use workqueues.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-20 23:37       ` Qu Wenruo
@ 2023-03-21 12:55         ` Christoph Hellwig
  2023-03-21 23:37           ` Qu Wenruo
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-21 12:55 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Tue, Mar 21, 2023 at 07:37:46AM +0800, Qu Wenruo wrote:
>> I think it is stalled.  That's why the workqueue heavily discourages
>> limiting max_active unless you have a good reason to, and most callers
>> follow that advise.
>
> To me, this looks more like hiding a bug in the workqueue user.
> Shouldn't we expose such bugs instead of hiding them?

If you limit max_active to a certain value, you clearly tell the
workqueue code not not use more workers that that.  That is what the
argument is for.

I don't see where there is a bug, and that someone is hiding it.

> Furthermore although I'm a big fan of plain workqueue, the old btrfs 
> workqueues allows us to apply max_active to the plain workqueues, but now 
> we don't have this ability any more.

You can pass max_active to plain Linux workqueues and there is a bunch
of places that do that, although the vast majority passes either 0 to
use the default, or 1 to limit themselves to a single active worker.

The core also allows to change it, but nothing but the btrfs_workqueue
code ever does.  We can wire up btrfs to change the conccurency if we
have a good reason to do.  And I'd like to figure out if there is one,
and if yes for which workqueue, but for that we'd need to have an
argument for why which workqueue would like to use a larger or smaller
than default value.  The old argument of higher resource usage with
a bigger number of workers does not apply any more with the concurrency
managed workqueue implementation.

>>> Personally speaking, I'd like to keep the btrfs bio endio function calls in
>>> the old soft/hard irq context, and let the higher layer to queue the work.
>>
>> Can you explain why?
>
> Just to keep the context consistent.

Which is what this series does.  Before all read I/O completion handlers
were called in workqueue context, while write ones weren't.  With the
series write completion handlers are called from workqueue context
as well, and with that all significant btrfs code except for tiny bits
of bi_end_io handlers are called from process context.

> Image a situation, where we put some sleep functions into a endio function 
> without any doubt, and fully rely on the fact the endio function is 
> executed under workqueue context.

Being able to do that is the point of this series.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-21 12:55         ` Christoph Hellwig
@ 2023-03-21 23:37           ` Qu Wenruo
  2023-03-22  8:32             ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-03-21 23:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn, linux-btrfs



On 2023/3/21 20:55, Christoph Hellwig wrote:
> On Tue, Mar 21, 2023 at 07:37:46AM +0800, Qu Wenruo wrote:
>>> I think it is stalled.  That's why the workqueue heavily discourages
>>> limiting max_active unless you have a good reason to, and most callers
>>> follow that advise.
>>
>> To me, this looks more like hiding a bug in the workqueue user.
>> Shouldn't we expose such bugs instead of hiding them?
> 
> If you limit max_active to a certain value, you clearly tell the
> workqueue code not not use more workers that that.  That is what the
> argument is for.

And if a work load can only be deadlock free using the default 
max_active, but not any value smaller, then I'd say the work load itself 
is buggy.

> 
> I don't see where there is a bug, and that someone is hiding it.
> 
>> Furthermore although I'm a big fan of plain workqueue, the old btrfs
>> workqueues allows us to apply max_active to the plain workqueues, but now
>> we don't have this ability any more.
> 
> You can pass max_active to plain Linux workqueues and there is a bunch
> of places that do that, although the vast majority passes either 0 to
> use the default, or 1 to limit themselves to a single active worker.
> 
> The core also allows to change it, but nothing but the btrfs_workqueue
> code ever does.  We can wire up btrfs to change the conccurency if we
> have a good reason to do.  And I'd like to figure out if there is one,
> and if yes for which workqueue, but for that we'd need to have an
> argument for why which workqueue would like to use a larger or smaller
> than default value.  The old argument of higher resource usage with
> a bigger number of workers does not apply any more with the concurrency
> managed workqueue implementation.

The usecase is still there.
To limit the amount of CPU time spent by btrfs workloads, from csum 
verification to compression.

And if limiting max_active for plain workqueue could help us to expose 
possible deadlocks, it would be extra benefits.

Thanks,
Qu

> 
>>>> Personally speaking, I'd like to keep the btrfs bio endio function calls in
>>>> the old soft/hard irq context, and let the higher layer to queue the work.
>>>
>>> Can you explain why?
>>
>> Just to keep the context consistent.
> 
> Which is what this series does.  Before all read I/O completion handlers
> were called in workqueue context, while write ones weren't.  With the
> series write completion handlers are called from workqueue context
> as well, and with that all significant btrfs code except for tiny bits
> of bi_end_io handlers are called from process context.
> 
>> Image a situation, where we put some sleep functions into a endio function
>> without any doubt, and fully rely on the fact the endio function is
>> executed under workqueue context.
> 
> Being able to do that is the point of this series.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-21 23:37           ` Qu Wenruo
@ 2023-03-22  8:32             ` Christoph Hellwig
  2023-03-23  8:07               ` Qu Wenruo
  2023-03-23 14:53               ` Chris Mason
  0 siblings, 2 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-22  8:32 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Wed, Mar 22, 2023 at 07:37:07AM +0800, Qu Wenruo wrote:
>> If you limit max_active to a certain value, you clearly tell the
>> workqueue code not not use more workers that that.  That is what the
>> argument is for.
>
> And if a work load can only be deadlock free using the default max_active, 
> but not any value smaller, then I'd say the work load itself is buggy.

Anything that has an interaction between two instances of a work_struct
can deadlock.  Only a single execution context is guaranteed (and even
that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
to e.g. only using a global workqueue in file systems or block devices
that can stack.

Fortunately these days lockdep is generally able to catch these
dependencies as well.

> The usecase is still there.
> To limit the amount of CPU time spent by btrfs workloads, from csum 
> verification to compression.

So this is the first time I see an actual explanation, thanks for that
first.  If this is the reason we should apply the max_active to all
workqueus that do csum an compression work, but not to other random
workqueues.

Dave, Josef, Chis: do you agree to this interpretation?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-22  8:32             ` Christoph Hellwig
@ 2023-03-23  8:07               ` Qu Wenruo
  2023-03-23  8:12                 ` Christoph Hellwig
  2023-03-23 14:53               ` Chris Mason
  1 sibling, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-03-23  8:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn, linux-btrfs



On 2023/3/22 16:32, Christoph Hellwig wrote:
> On Wed, Mar 22, 2023 at 07:37:07AM +0800, Qu Wenruo wrote:
>>> If you limit max_active to a certain value, you clearly tell the
>>> workqueue code not not use more workers that that.  That is what the
>>> argument is for.
>>
>> And if a work load can only be deadlock free using the default max_active,
>> but not any value smaller, then I'd say the work load itself is buggy.
> 
> Anything that has an interaction between two instances of a work_struct
> can deadlock.  Only a single execution context is guaranteed (and even
> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
> to e.g. only using a global workqueue in file systems or block devices
> that can stack.

Shouldn't we avoid such cross workqueue workload at all cost?

IIRC at least we don't have such usage in btrfs.
And I hope we will never have.

> 
> Fortunately these days lockdep is generally able to catch these
> dependencies as well.
> 
>> The usecase is still there.
>> To limit the amount of CPU time spent by btrfs workloads, from csum
>> verification to compression.
> 
> So this is the first time I see an actual explanation, thanks for that
> first.  If this is the reason we should apply the max_active to all
> workqueus that do csum an compression work, but not to other random
> workqueues.

If we're limiting the max_active for certain workqueues, then I'd say 
why not to all workqueues?

If we have some usage relying on the amount of workers, at least we 
should be able to expose it and fix it.
(IIRC we should have a better way with less cross-workqueue dependency)

Thanks,
Qu

> 
> Dave, Josef, Chis: do you agree to this interpretation?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-23  8:07               ` Qu Wenruo
@ 2023-03-23  8:12                 ` Christoph Hellwig
  2023-03-23  8:20                   ` Qu Wenruo
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-23  8:12 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Thu, Mar 23, 2023 at 04:07:28PM +0800, Qu Wenruo wrote:
>>> And if a work load can only be deadlock free using the default max_active,
>>> but not any value smaller, then I'd say the work load itself is buggy.
>>
>> Anything that has an interaction between two instances of a work_struct
>> can deadlock.  Only a single execution context is guaranteed (and even
>> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
>> to e.g. only using a global workqueue in file systems or block devices
>> that can stack.
>
> Shouldn't we avoid such cross workqueue workload at all cost?

Yes, btrfs uses per-sb workqueues.  As do most other places now,
but there as a bit of a learning curve years ago.

>> So this is the first time I see an actual explanation, thanks for that
>> first.  If this is the reason we should apply the max_active to all
>> workqueus that do csum an compression work, but not to other random
>> workqueues.
>
> If we're limiting the max_active for certain workqueues, then I'd say why 
> not to all workqueues?
>
> If we have some usage relying on the amount of workers, at least we should 
> be able to expose it and fix it.

Again, btrfs is the odd one out allowing the user to set arbitrary limits.
This code predates using the kernel workqueues, and I'm a little
doubtful it still is useful.  But for that I need to figure out why
it was even be kept when converting btrfs to use workqueues.

> (IIRC we should have a better way with less cross-workqueue dependency)

I've been very actively working on reducing the amount of different
workqueues.  This series is an important part of that.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-23  8:12                 ` Christoph Hellwig
@ 2023-03-23  8:20                   ` Qu Wenruo
  2023-03-24  1:11                     ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-03-23  8:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn, linux-btrfs



On 2023/3/23 16:12, Christoph Hellwig wrote:
> On Thu, Mar 23, 2023 at 04:07:28PM +0800, Qu Wenruo wrote:
>>>> And if a work load can only be deadlock free using the default max_active,
>>>> but not any value smaller, then I'd say the work load itself is buggy.
>>>
>>> Anything that has an interaction between two instances of a work_struct
>>> can deadlock.  Only a single execution context is guaranteed (and even
>>> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
>>> to e.g. only using a global workqueue in file systems or block devices
>>> that can stack.
>>
>> Shouldn't we avoid such cross workqueue workload at all cost?
> 
> Yes, btrfs uses per-sb workqueues.  As do most other places now,
> but there as a bit of a learning curve years ago.
> 
>>> So this is the first time I see an actual explanation, thanks for that
>>> first.  If this is the reason we should apply the max_active to all
>>> workqueus that do csum an compression work, but not to other random
>>> workqueues.
>>
>> If we're limiting the max_active for certain workqueues, then I'd say why
>> not to all workqueues?
>>
>> If we have some usage relying on the amount of workers, at least we should
>> be able to expose it and fix it.
> 
> Again, btrfs is the odd one out allowing the user to set arbitrary limits.
> This code predates using the kernel workqueues, and I'm a little
> doubtful it still is useful.  But for that I need to figure out why
> it was even be kept when converting btrfs to use workqueues.
> 
>> (IIRC we should have a better way with less cross-workqueue dependency)
> 
> I've been very actively working on reducing the amount of different
> workqueues.  This series is an important part of that.
> 
Yeah, your work is very appreciated, that's without any doubt.

I'm just wondering if we all know that we should avoid cross-workqueue 
dependency to avoid deadlock, then why no one is trying to expose any 
such deadlocks by simply limiting max_active to 1 for all workqueues?

Especially with lockdep, once we got a reproduce, it should not be that 
hard to fix.

That's the only other valid usecase other than limiting CPU usage for 
csum/compression, I know regular users should not touch this, but us 
developer should still be able to create a worst case for our CI systems 
to suffer.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-22  8:32             ` Christoph Hellwig
  2023-03-23  8:07               ` Qu Wenruo
@ 2023-03-23 14:53               ` Chris Mason
  2023-03-24  1:09                 ` Christoph Hellwig
  1 sibling, 1 reply; 48+ messages in thread
From: Chris Mason @ 2023-03-23 14:53 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn, linux-btrfs

On 3/22/23 4:32 AM, Christoph Hellwig wrote:
> On Wed, Mar 22, 2023 at 07:37:07AM +0800, Qu Wenruo wrote:
>>> If you limit max_active to a certain value, you clearly tell the
>>> workqueue code not not use more workers that that.  That is what the
>>> argument is for.
>>
>> And if a work load can only be deadlock free using the default max_active, 
>> but not any value smaller, then I'd say the work load itself is buggy.
> 
> Anything that has an interaction between two instances of a work_struct
> can deadlock.  Only a single execution context is guaranteed (and even
> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
> to e.g. only using a global workqueue in file systems or block devices
> that can stack.
> 
> Fortunately these days lockdep is generally able to catch these
> dependencies as well.
> 
>> The usecase is still there.
>> To limit the amount of CPU time spent by btrfs workloads, from csum 
>> verification to compression.
> 
> So this is the first time I see an actual explanation, thanks for that
> first.  If this is the reason we should apply the max_active to all
> workqueus that do csum an compression work, but not to other random
> workqueues.
> 
> Dave, Josef, Chis: do you agree to this interpretation?

The original reason for limiting the number of workers was that work
like crc calculations was causing tons of context switches.  This isn't
a surprise, there were a ton of work items, and each one was processed
very quickly.

So the original btrfs thread pools had optimizations meant to limit
context switches by preferring to give work to a smaller number of workers.

For compression it's more clear cut.  I wanted the ability to saturate
all the CPUs with compression work, but also wanted a default that
didn't take over the whole machine.

-chris



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-23 14:53               ` Chris Mason
@ 2023-03-24  1:09                 ` Christoph Hellwig
  2023-03-24 13:25                   ` Chris Mason
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-24  1:09 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Qu Wenruo, Chris Mason, Josef Bacik,
	David Sterba, Johannes Thumshirn, linux-btrfs

On Thu, Mar 23, 2023 at 10:53:16AM -0400, Chris Mason wrote:
> The original reason for limiting the number of workers was that work
> like crc calculations was causing tons of context switches.  This isn't
> a surprise, there were a ton of work items, and each one was processed
> very quickly.

Yeah.  Although quite a bit of that went away since, by not doing
the offload for sync I/O, not for metadata when there is a fast crc32
implementation (we need to do the same for small data I/O and hardware
accelerated sha256, btw), and by doing these in larger chunks, but ..

> 
> So the original btrfs thread pools had optimizations meant to limit
> context switches by preferring to give work to a smaller number of workers.

.. this is something that the workqueue code already does under the hood
anyway.

> For compression it's more clear cut.  I wanted the ability to saturate
> all the CPUs with compression work, but also wanted a default that
> didn't take over the whole machine.

And that's actually a very good use case.

It almost asks for a separate option just for compression, or at least
for compression and checksumming only.

Is there consensus that for now we should limit thread_pool for 
se workqueues that do compression and chekcsumming for now?  Then
I'll send a series to wire it up for the read completion workqueue
again that does the checksum verification, the compressed write
workqueue, but drop it for all others but the write submission
one?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-23  8:20                   ` Qu Wenruo
@ 2023-03-24  1:11                     ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-24  1:11 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On Thu, Mar 23, 2023 at 04:20:45PM +0800, Qu Wenruo wrote:
> I'm just wondering if we all know that we should avoid cross-workqueue 
> dependency to avoid deadlock, then why no one is trying to expose any such 
> deadlocks by simply limiting max_active to 1 for all workqueues?

Cross workqueue dependencies are not a problem per se, and a lot of
places in the kernel rely on them.  Dependencies on another work_struct
on the same workqueue on the other hand are a problem.

> Especially with lockdep, once we got a reproduce, it should not be that 
> hard to fix.

lockdep helps you to find them without requiring to limit the max_active.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-24  1:09                 ` Christoph Hellwig
@ 2023-03-24 13:25                   ` Chris Mason
  2023-03-24 19:20                     ` Chris Mason
  2023-03-25  8:15                     ` Christoph Hellwig
  0 siblings, 2 replies; 48+ messages in thread
From: Chris Mason @ 2023-03-24 13:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs

On 3/23/23 9:09 PM, Christoph Hellwig wrote:
> On Thu, Mar 23, 2023 at 10:53:16AM -0400, Chris Mason wrote:
>> The original reason for limiting the number of workers was that work
>> like crc calculations was causing tons of context switches.  This isn't
>> a surprise, there were a ton of work items, and each one was processed
>> very quickly.
> 
> Yeah.  Although quite a bit of that went away since, by not doing
> the offload for sync I/O, not for metadata when there is a fast crc32
> implementation (we need to do the same for small data I/O and hardware
> accelerated sha256, btw), and by doing these in larger chunks, but ..
> 
>>
>> So the original btrfs thread pools had optimizations meant to limit
>> context switches by preferring to give work to a smaller number of workers.
> 
> .. this is something that the workqueue code already does under the hood
> anyway.

Yeah, between hardware changes and kernel changes, these motivations
don't really apply anymore.  Even if they did, we're probably better off
fixing that in the workqueue code directly now.

> 
>> For compression it's more clear cut.  I wanted the ability to saturate
>> all the CPUs with compression work, but also wanted a default that
>> didn't take over the whole machine.
> 
> And that's actually a very good use case.
> 
> It almost asks for a separate option just for compression, or at least
> for compression and checksumming only.
> 
> Is there consensus that for now we should limit thread_pool for 
> se workqueues that do compression and chekcsumming for now?  Then
> I'll send a series to wire it up for the read completion workqueue
> again that does the checksum verification, the compressed write
> workqueue, but drop it for all others but the write submission
> one?

As you mentioned above, we're currently doing synchronous crcs for
metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes.
We've benchmarked this a few times and I think with modern hardware a
better default is just always doing synchronous crcs for data too.

Qu or David have you looked synchronous data crcs recently?

Looking ahead, encryption is going to want similar knobs to compression.

My preference would be:

- crcs are always inline if your hardware is fast
- Compression, encryption, slow hardware crcs use the thread_pool_size knob
- We don't try to limit the other workers

The idea behind the knob is just "how much of your CPU should each btrfs
mount consume?"  Obviously we'll silently judge anyone that chooses less
than 100% but I guess it's good to give people options.

-chris

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-24 13:25                   ` Chris Mason
@ 2023-03-24 19:20                     ` Chris Mason
  2023-03-25  8:13                       ` Christoph Hellwig
  2023-03-25  8:15                     ` Christoph Hellwig
  1 sibling, 1 reply; 48+ messages in thread
From: Chris Mason @ 2023-03-24 19:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs, Jens Axboe

On 3/24/23 9:25 AM, Chris Mason wrote:
> On 3/23/23 9:09 PM, Christoph Hellwig wrote:

> As you mentioned above, we're currently doing synchronous crcs for
> metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes.
> We've benchmarked this a few times and I think with modern hardware a
> better default is just always doing synchronous crcs for data too.
> 

I asked Jens to run some numbers on his fast drives (thanks Jens!).  He
did 1M buffered writes w/fio.

Unpatched Linus:

write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets

Jens said there was a lot of variation during the run going down as low
as 1100MB/s.

Synchronous crcs:

write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets

The synchronous run had about the same peak write speed but much lower
dips, and fewer kworkers churning around.

Both tests had fio saturated at 100% CPU.

I think we should flip crcs to synchronous as long as it's HW accelerated.

-chris


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-24 19:20                     ` Chris Mason
@ 2023-03-25  8:13                       ` Christoph Hellwig
  2023-03-25 17:16                         ` Chris Mason
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-25  8:13 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Qu Wenruo, Chris Mason, Josef Bacik,
	David Sterba, Johannes Thumshirn, linux-btrfs, Jens Axboe

On Fri, Mar 24, 2023 at 03:20:47PM -0400, Chris Mason wrote:
> I asked Jens to run some numbers on his fast drives (thanks Jens!).  He
> did 1M buffered writes w/fio.
> 
> Unpatched Linus:
> 
> write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets
> 
> Jens said there was a lot of variation during the run going down as low
> as 1100MB/s.
> 
> Synchronous crcs:
> 
> write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets
> 
> The synchronous run had about the same peak write speed but much lower
> dips, and fewer kworkers churning around.

FYI, I did some similar runs on this just with kvm on consumer drives,
and the results were similar.  Interestingly enough I first accidentally
ran with the non-accelerated crc32 code, as kvm by default doesn't
pass through cpu flags.  Even with that the workqueue offload only
looked better for really large writes, and then only marginally.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-24 13:25                   ` Chris Mason
  2023-03-24 19:20                     ` Chris Mason
@ 2023-03-25  8:15                     ` Christoph Hellwig
  2023-03-25  8:42                       ` Qu Wenruo
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-03-25  8:15 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Qu Wenruo, Chris Mason, Josef Bacik,
	David Sterba, Johannes Thumshirn, linux-btrfs

On Fri, Mar 24, 2023 at 09:25:07AM -0400, Chris Mason wrote:
> As you mentioned above, we're currently doing synchronous crcs for
> metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes.
> We've benchmarked this a few times and I think with modern hardware a
> better default is just always doing synchronous crcs for data too.
> 
> Qu or David have you looked synchronous data crcs recently?

As mentioend in the other mail I have a bit.  But only for crc32
so far, and only on x86, so the matrix might be a little more
complicated.

> My preference would be:
> 
> - crcs are always inline if your hardware is fast
> - Compression, encryption, slow hardware crcs use the thread_pool_size knob
> - We don't try to limit the other workers
> 
> The idea behind the knob is just "how much of your CPU should each btrfs
> mount consume?"  Obviously we'll silently judge anyone that chooses less
> than 100% but I guess it's good to give people options.

Ok.  I'll do a series about the nobs ASAP, and then the inline CRCs
next.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-25  8:15                     ` Christoph Hellwig
@ 2023-03-25  8:42                       ` Qu Wenruo
  0 siblings, 0 replies; 48+ messages in thread
From: Qu Wenruo @ 2023-03-25  8:42 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn, linux-btrfs



On 2023/3/25 16:15, Christoph Hellwig wrote:
> On Fri, Mar 24, 2023 at 09:25:07AM -0400, Chris Mason wrote:
>> As you mentioned above, we're currently doing synchronous crcs for
>> metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes.
>> We've benchmarked this a few times and I think with modern hardware a
>> better default is just always doing synchronous crcs for data too.
>>
>> Qu or David have you looked synchronous data crcs recently?
> 
> As mentioend in the other mail I have a bit.  But only for crc32
> so far, and only on x86, so the matrix might be a little more
> complicated.

Haven't really looked up the async csum part at all.

In fact I'm not even sure what's the point of generating data checksum 
asynchronously...

I didn't see any extra split to get multiple threads to calculate hash 
on different parts.
Furthermore, I'm not sure even if all the supported hash algos 
(CRC32/SHA256/BLAKE2/XXHASH) can support such split and merge 
multi-threading.
(IIRC SHA256 can not?)

Thus I'm not really sure of the benefit of async csum generation anyway.
(Not to mention asynchronous code itself is not straightforward)

Considering at least we used to do data csum verification in endio 
function synchronously, I see no problem to do the generation path 
synchronously, even without hardware accelerated checksum support.

Thanks,
Qu

> 
>> My preference would be:
>>
>> - crcs are always inline if your hardware is fast
>> - Compression, encryption, slow hardware crcs use the thread_pool_size knob
>> - We don't try to limit the other workers
>>
>> The idea behind the knob is just "how much of your CPU should each btrfs
>> mount consume?"  Obviously we'll silently judge anyone that chooses less
>> than 100% but I guess it's good to give people options.
> 
> Ok.  I'll do a series about the nobs ASAP, and then the inline CRCs
> next.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue
  2023-03-25  8:13                       ` Christoph Hellwig
@ 2023-03-25 17:16                         ` Chris Mason
  0 siblings, 0 replies; 48+ messages in thread
From: Chris Mason @ 2023-03-25 17:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, linux-btrfs, Jens Axboe

On 3/25/23 4:13 AM, Christoph Hellwig wrote:
> On Fri, Mar 24, 2023 at 03:20:47PM -0400, Chris Mason wrote:
>> I asked Jens to run some numbers on his fast drives (thanks Jens!).  He
>> did 1M buffered writes w/fio.
>>
>> Unpatched Linus:
>>
>> write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets
>>
>> Jens said there was a lot of variation during the run going down as low
>> as 1100MB/s.
>>
>> Synchronous crcs:
>>
>> write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets
>>
>> The synchronous run had about the same peak write speed but much lower
>> dips, and fewer kworkers churning around.
> 
> FYI, I did some similar runs on this just with kvm on consumer drives,
> and the results were similar.  Interestingly enough I first accidentally
> ran with the non-accelerated crc32 code, as kvm by default doesn't
> pass through cpu flags.  Even with that the workqueue offload only
> looked better for really large writes, and then only marginally.

There is pretty clearly an opportunity to tune the crc workers, but I
wouldn't make that a requirement for your series.  It'll be a bigger win
to just force inline most of the time.

-chris


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2023-03-25 17:17 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-14 16:59 defer all write I/O completions to process context Christoph Hellwig
2023-03-14 16:59 ` [PATCH 01/10] btrfs: use a plain workqueue for ordered_extent processing Christoph Hellwig
2023-03-16 17:10   ` Johannes Thumshirn
2023-03-16 17:31   ` David Sterba
2023-03-20  6:12     ` Christoph Hellwig
2023-03-20 11:08   ` Qu Wenruo
2023-03-20 11:35     ` Qu Wenruo
2023-03-20 12:24       ` Christoph Hellwig
2023-03-20 23:19         ` Qu Wenruo
2023-03-21 12:48           ` Christoph Hellwig
2023-03-14 16:59 ` [PATCH 02/10] btrfs: refactor btrfs_end_io_wq Christoph Hellwig
2023-03-16 17:12   ` Johannes Thumshirn
2023-03-20 11:09   ` Qu Wenruo
2023-03-14 16:59 ` [PATCH 03/10] btrfs: offload all write I/O completions to a workqueue Christoph Hellwig
2023-03-16 17:14   ` Johannes Thumshirn
2023-03-20 11:29   ` Qu Wenruo
2023-03-20 12:30     ` Christoph Hellwig
2023-03-20 23:37       ` Qu Wenruo
2023-03-21 12:55         ` Christoph Hellwig
2023-03-21 23:37           ` Qu Wenruo
2023-03-22  8:32             ` Christoph Hellwig
2023-03-23  8:07               ` Qu Wenruo
2023-03-23  8:12                 ` Christoph Hellwig
2023-03-23  8:20                   ` Qu Wenruo
2023-03-24  1:11                     ` Christoph Hellwig
2023-03-23 14:53               ` Chris Mason
2023-03-24  1:09                 ` Christoph Hellwig
2023-03-24 13:25                   ` Chris Mason
2023-03-24 19:20                     ` Chris Mason
2023-03-25  8:13                       ` Christoph Hellwig
2023-03-25 17:16                         ` Chris Mason
2023-03-25  8:15                     ` Christoph Hellwig
2023-03-25  8:42                       ` Qu Wenruo
2023-03-14 16:59 ` [PATCH 04/10] btrfs: remove the compressed_write_workers workqueue Christoph Hellwig
2023-03-14 16:59 ` [PATCH 05/10] btrfs: remove irq disabling for btrfs_workqueue.list_lock Christoph Hellwig
2023-03-17 10:34   ` Johannes Thumshirn
2023-03-14 16:59 ` [PATCH 06/10] btrfs: remove irq disabling for subpage.list_lock Christoph Hellwig
2023-03-14 16:59 ` [PATCH 07/10] btrfs: remove irq disabling for leak_lock Christoph Hellwig
2023-03-17 10:35   ` Johannes Thumshirn
2023-03-14 16:59 ` [PATCH 08/10] btrfs: remove irq disabling for fs_info.ebleak_lock Christoph Hellwig
2023-03-17 10:35   ` Johannes Thumshirn
2023-03-14 16:59 ` [PATCH 09/10] btrfs: remove irq_disabling for ordered_tree.lock Christoph Hellwig
2023-03-17 10:36   ` Johannes Thumshirn
2023-03-20  6:12     ` Christoph Hellwig
2023-03-14 16:59 ` [PATCH 10/10] btrfs: remove confusing comments Christoph Hellwig
2023-03-17 10:37   ` Johannes Thumshirn
2023-03-17 10:39 ` defer all write I/O completions to process context Johannes Thumshirn
2023-03-20  6:14   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.