[PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support
@ 2019-07-10 19:28 Tejun Heo
  2019-07-10 19:28 ` [PATCH 1/5] Btrfs: stop using btrfs_schedule_bio() Tejun Heo
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Tejun Heo @ 2019-07-10 19:28 UTC (permalink / raw)
  To: josef, clm, dsterba; +Cc: axboe, jack, linux-kernel, linux-btrfs, kernel-team

Hello,

This patchset contains only the btrfs part of the following patchset.

  [1] [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support

The block part has already been applied to

  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/ for-linus

with some naming changes.  This patchset has been updated accordingly.

When writeback is executed asynchronously (e.g. for compression), bios
are bounced to and issued by worker pool shared by all cgroups.  This
leads to significant priority inversions when cgroup IO control is in
use - IOs for a low priority cgroup can tie down the workers forcing
higher priority IOs to wait behind them.

This patchset updates btrfs to issue async IOs through the new bio
punt mechanism.  A bio tagged with REQ_CGROUP_PUNT flag is bounced to
the asynchronous issue context of the associated blkcg on
bio_submit().  As the bios are issued from per-blkcg work items,
there's no concern for priority inversions and it doesn't require
invasive changes to the filesystems.

This patchset contains the following 5 patches.

 0001-Btrfs-stop-using-btrfs_schedule_bio.patch
 0002-Btrfs-delete-the-entire-async-bio-submission-framewo.patch
 0003-Btrfs-only-associate-the-locked-page-with-one-async_.patch
 0004-Btrfs-use-REQ_CGROUP_PUNT-for-worker-thread-submitte.patch
 0005-Btrfs-extent_write_locked_range-should-attach-inode-.patch

The patches are also available in the following branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-btrfs-cgroup-updates-v3

Thanks, diffstat follows.

 fs/btrfs/compression.c |   16 ++
 fs/btrfs/compression.h |    3 
 fs/btrfs/ctree.h       |    1 
 fs/btrfs/disk-io.c     |   25 +---
 fs/btrfs/extent_io.c   |   15 +-
 fs/btrfs/inode.c       |   62 +++++++++--
 fs/btrfs/super.c       |    1 
 fs/btrfs/volumes.c     |  264 -------------------------------------------------
 fs/btrfs/volumes.h     |   10 -
 9 files changed, 90 insertions(+), 307 deletions(-)

--
tejun

[1] http://lkml.kernel.org/r/20190615182453.843275-1-tj@kernel.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] Btrfs: stop using btrfs_schedule_bio()
  2019-07-10 19:28 [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support Tejun Heo
@ 2019-07-10 19:28 ` Tejun Heo
  2019-07-11 11:32   ` Nikolay Borisov
  2019-07-10 19:28 ` [PATCH 2/5] Btrfs: delete the entire async bio submission framework Tejun Heo
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2019-07-10 19:28 UTC (permalink / raw)
  To: josef, clm, dsterba; +Cc: axboe, jack, linux-kernel, linux-btrfs, kernel-team

From: Chris Mason <clm@fb.com>

btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call.  This has been used to make sure async crc and
compression helpers don't get stuck on IO submission.  To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.

All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
using btrfs_schedule_bio().  We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/compression.c |  8 +++---
 fs/btrfs/disk-io.c     |  6 ++---
 fs/btrfs/inode.c       |  6 ++---
 fs/btrfs/volumes.c     | 55 +++---------------------------------------
 fs/btrfs/volumes.h     |  2 +-
 5 files changed, 15 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 84dd4a8980c5..dfc4eb9b7717 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -354,7 +354,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				BUG_ON(ret); /* -ENOMEM */
 			}
 
-			ret = btrfs_map_bio(fs_info, bio, 0, 1);
+			ret = btrfs_map_bio(fs_info, bio, 0);
 			if (ret) {
 				bio->bi_status = ret;
 				bio_endio(bio);
@@ -384,7 +384,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 		BUG_ON(ret); /* -ENOMEM */
 	}
 
-	ret = btrfs_map_bio(fs_info, bio, 0, 1);
+	ret = btrfs_map_bio(fs_info, bio, 0);
 	if (ret) {
 		bio->bi_status = ret;
 		bio_endio(bio);
@@ -637,7 +637,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 			sums += DIV_ROUND_UP(comp_bio->bi_iter.bi_size,
 					     fs_info->sectorsize);
 
-			ret = btrfs_map_bio(fs_info, comp_bio, mirror_num, 0);
+			ret = btrfs_map_bio(fs_info, comp_bio, mirror_num);
 			if (ret) {
 				comp_bio->bi_status = ret;
 				bio_endio(comp_bio);
@@ -661,7 +661,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		BUG_ON(ret); /* -ENOMEM */
 	}
 
-	ret = btrfs_map_bio(fs_info, comp_bio, mirror_num, 0);
+	ret = btrfs_map_bio(fs_info, comp_bio, mirror_num);
 	if (ret) {
 		comp_bio->bi_status = ret;
 		bio_endio(comp_bio);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index deb74a8c191a..6b1ecc27913b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -800,7 +800,7 @@ static void run_one_async_done(struct btrfs_work *work)
 	}
 
 	ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio,
-			async->mirror_num, 1);
+			    async->mirror_num);
 	if (ret) {
 		async->bio->bi_status = ret;
 		bio_endio(async->bio);
@@ -901,12 +901,12 @@ static blk_status_t btree_submit_bio_hook(struct inode *inode, struct bio *bio,
 					  BTRFS_WQ_ENDIO_METADATA);
 		if (ret)
 			goto out_w_error;
-		ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+		ret = btrfs_map_bio(fs_info, bio, mirror_num);
 	} else if (!async) {
 		ret = btree_csum_one_bio(bio);
 		if (ret)
 			goto out_w_error;
-		ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+		ret = btrfs_map_bio(fs_info, bio, mirror_num);
 	} else {
 		/*
 		 * kthread helpers are used to submit writes so that
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a2aabdb85226..6e6df0eab324 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2032,7 +2032,7 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
 	}
 
 mapit:
-	ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+	ret = btrfs_map_bio(fs_info, bio, mirror_num);
 
 out:
 	if (ret) {
@@ -7774,7 +7774,7 @@ static inline blk_status_t submit_dio_repair_bio(struct inode *inode,
 	if (ret)
 		return ret;
 
-	ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+	ret = btrfs_map_bio(fs_info, bio, mirror_num);
 
 	return ret;
 }
@@ -8305,7 +8305,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 			goto err;
 	}
 map:
-	ret = btrfs_map_bio(fs_info, bio, 0, 0);
+	ret = btrfs_map_bio(fs_info, bio, 0);
 err:
 	return ret;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1c2a6e4b39da..72326cc23985 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6386,52 +6386,8 @@ static void btrfs_end_bio(struct bio *bio)
 	}
 }
 
-/*
- * see run_scheduled_bios for a description of why bios are collected for
- * async submit.
- *
- * This will add one bio to the pending list for a device and make sure
- * the work struct is scheduled.
- */
-static noinline void btrfs_schedule_bio(struct btrfs_device *device,
-					struct bio *bio)
-{
-	struct btrfs_fs_info *fs_info = device->fs_info;
-	int should_queue = 1;
-	struct btrfs_pending_bios *pending_bios;
-
-	/* don't bother with additional async steps for reads, right now */
-	if (bio_op(bio) == REQ_OP_READ) {
-		btrfsic_submit_bio(bio);
-		return;
-	}
-
-	WARN_ON(bio->bi_next);
-	bio->bi_next = NULL;
-
-	spin_lock(&device->io_lock);
-	if (op_is_sync(bio->bi_opf))
-		pending_bios = &device->pending_sync_bios;
-	else
-		pending_bios = &device->pending_bios;
-
-	if (pending_bios->tail)
-		pending_bios->tail->bi_next = bio;
-
-	pending_bios->tail = bio;
-	if (!pending_bios->head)
-		pending_bios->head = bio;
-	if (device->running_pending)
-		should_queue = 0;
-
-	spin_unlock(&device->io_lock);
-
-	if (should_queue)
-		btrfs_queue_work(fs_info->submit_workers, &device->work);
-}
-
 static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
-			      u64 physical, int dev_nr, int async)
+			      u64 physical, int dev_nr)
 {
 	struct btrfs_device *dev = bbio->stripes[dev_nr].dev;
 	struct btrfs_fs_info *fs_info = bbio->fs_info;
@@ -6449,10 +6405,7 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
 
 	btrfs_bio_counter_inc_noblocked(fs_info);
 
-	if (async)
-		btrfs_schedule_bio(dev, bio);
-	else
-		btrfsic_submit_bio(bio);
+	btrfsic_submit_bio(bio);
 }
 
 static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
@@ -6473,7 +6426,7 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 }
 
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
-			   int mirror_num, int async_submit)
+			   int mirror_num)
 {
 	struct btrfs_device *dev;
 	struct bio *first_bio = bio;
@@ -6542,7 +6495,7 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			bio = first_bio;
 
 		submit_stripe_bio(bbio, bio, bbio->stripes[dev_nr].physical,
-				  dev_nr, async_submit);
+				  dev_nr);
 	}
 	btrfs_bio_counter_dec(fs_info);
 	return BLK_STS_OK;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 136a3eb64604..e532d095c6a4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -416,7 +416,7 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans, u64 type);
 void btrfs_mapping_init(struct btrfs_mapping_tree *tree);
 void btrfs_mapping_tree_free(struct btrfs_mapping_tree *tree);
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
-			   int mirror_num, int async_submit);
+			   int mirror_num);
 int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 		       fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] Btrfs: delete the entire async bio submission framework
  2019-07-10 19:28 [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support Tejun Heo
  2019-07-10 19:28 ` [PATCH 1/5] Btrfs: stop using btrfs_schedule_bio() Tejun Heo
@ 2019-07-10 19:28 ` Tejun Heo
  2019-07-11 14:53   ` Nikolay Borisov
  2019-07-10 19:28 ` [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct Tejun Heo
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2019-07-10 19:28 UTC (permalink / raw)
  To: josef, clm, dsterba; +Cc: axboe, jack, linux-kernel, linux-btrfs, kernel-team

From: Chris Mason <clm@fb.com>

Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h   |   1 -
 fs/btrfs/disk-io.c |  13 +--
 fs/btrfs/super.c   |   1 -
 fs/btrfs/volumes.c | 209 ---------------------------------------------
 fs/btrfs/volumes.h |   8 --
 5 files changed, 1 insertion(+), 231 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0a61dff27f57..21618b5b18a4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -989,7 +989,6 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *endio_meta_write_workers;
 	struct btrfs_workqueue *endio_write_workers;
 	struct btrfs_workqueue *endio_freespace_worker;
-	struct btrfs_workqueue *submit_workers;
 	struct btrfs_workqueue *caching_workers;
 	struct btrfs_workqueue *readahead_workers;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6b1ecc27913b..323cab06f2a9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2028,7 +2028,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->rmw_workers);
 	btrfs_destroy_workqueue(fs_info->endio_write_workers);
 	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
-	btrfs_destroy_workqueue(fs_info->submit_workers);
 	btrfs_destroy_workqueue(fs_info->delayed_workers);
 	btrfs_destroy_workqueue(fs_info->caching_workers);
 	btrfs_destroy_workqueue(fs_info->readahead_workers);
@@ -2194,16 +2193,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	fs_info->caching_workers =
 		btrfs_alloc_workqueue(fs_info, "cache", flags, max_active, 0);
 
-	/*
-	 * a higher idle thresh on the submit workers makes it much more
-	 * likely that bios will be send down in a sane order to the
-	 * devices
-	 */
-	fs_info->submit_workers =
-		btrfs_alloc_workqueue(fs_info, "submit", flags,
-				      min_t(u64, fs_devices->num_devices,
-					    max_active), 64);
-
 	fs_info->fixup_workers =
 		btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);
 
@@ -2246,7 +2235,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 					    max_active), 8);
 
 	if (!(fs_info->workers && fs_info->delalloc_workers &&
-	      fs_info->submit_workers && fs_info->flush_workers &&
+	      fs_info->flush_workers &&
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
 	      fs_info->endio_meta_write_workers &&
 	      fs_info->endio_repair_workers &&
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 0645ec428b4f..b130dc43b5f1 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1668,7 +1668,6 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
 
 	btrfs_workqueue_set_max(fs_info->workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
-	btrfs_workqueue_set_max(fs_info->submit_workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->endio_workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->endio_meta_workers, new_pool_size);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 72326cc23985..fc3a16d87869 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -509,212 +509,6 @@ btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder,
 	return ret;
 }
 
-static void requeue_list(struct btrfs_pending_bios *pending_bios,
-			struct bio *head, struct bio *tail)
-{
-
-	struct bio *old_head;
-
-	old_head = pending_bios->head;
-	pending_bios->head = head;
-	if (pending_bios->tail)
-		tail->bi_next = old_head;
-	else
-		pending_bios->tail = tail;
-}
-
-/*
- * we try to collect pending bios for a device so we don't get a large
- * number of procs sending bios down to the same device.  This greatly
- * improves the schedulers ability to collect and merge the bios.
- *
- * But, it also turns into a long list of bios to process and that is sure
- * to eventually make the worker thread block.  The solution here is to
- * make some progress and then put this work struct back at the end of
- * the list if the block device is congested.  This way, multiple devices
- * can make progress from a single worker thread.
- */
-static noinline void run_scheduled_bios(struct btrfs_device *device)
-{
-	struct btrfs_fs_info *fs_info = device->fs_info;
-	struct bio *pending;
-	struct backing_dev_info *bdi;
-	struct btrfs_pending_bios *pending_bios;
-	struct bio *tail;
-	struct bio *cur;
-	int again = 0;
-	unsigned long num_run;
-	unsigned long batch_run = 0;
-	unsigned long last_waited = 0;
-	int force_reg = 0;
-	int sync_pending = 0;
-	struct blk_plug plug;
-
-	/*
-	 * this function runs all the bios we've collected for
-	 * a particular device.  We don't want to wander off to
-	 * another device without first sending all of these down.
-	 * So, setup a plug here and finish it off before we return
-	 */
-	blk_start_plug(&plug);
-
-	bdi = device->bdev->bd_bdi;
-
-loop:
-	spin_lock(&device->io_lock);
-
-loop_lock:
-	num_run = 0;
-
-	/* take all the bios off the list at once and process them
-	 * later on (without the lock held).  But, remember the
-	 * tail and other pointers so the bios can be properly reinserted
-	 * into the list if we hit congestion
-	 */
-	if (!force_reg && device->pending_sync_bios.head) {
-		pending_bios = &device->pending_sync_bios;
-		force_reg = 1;
-	} else {
-		pending_bios = &device->pending_bios;
-		force_reg = 0;
-	}
-
-	pending = pending_bios->head;
-	tail = pending_bios->tail;
-	WARN_ON(pending && !tail);
-
-	/*
-	 * if pending was null this time around, no bios need processing
-	 * at all and we can stop.  Otherwise it'll loop back up again
-	 * and do an additional check so no bios are missed.
-	 *
-	 * device->running_pending is used to synchronize with the
-	 * schedule_bio code.
-	 */
-	if (device->pending_sync_bios.head == NULL &&
-	    device->pending_bios.head == NULL) {
-		again = 0;
-		device->running_pending = 0;
-	} else {
-		again = 1;
-		device->running_pending = 1;
-	}
-
-	pending_bios->head = NULL;
-	pending_bios->tail = NULL;
-
-	spin_unlock(&device->io_lock);
-
-	while (pending) {
-
-		rmb();
-		/* we want to work on both lists, but do more bios on the
-		 * sync list than the regular list
-		 */
-		if ((num_run > 32 &&
-		    pending_bios != &device->pending_sync_bios &&
-		    device->pending_sync_bios.head) ||
-		   (num_run > 64 && pending_bios == &device->pending_sync_bios &&
-		    device->pending_bios.head)) {
-			spin_lock(&device->io_lock);
-			requeue_list(pending_bios, pending, tail);
-			goto loop_lock;
-		}
-
-		cur = pending;
-		pending = pending->bi_next;
-		cur->bi_next = NULL;
-
-		BUG_ON(atomic_read(&cur->__bi_cnt) == 0);
-
-		/*
-		 * if we're doing the sync list, record that our
-		 * plug has some sync requests on it
-		 *
-		 * If we're doing the regular list and there are
-		 * sync requests sitting around, unplug before
-		 * we add more
-		 */
-		if (pending_bios == &device->pending_sync_bios) {
-			sync_pending = 1;
-		} else if (sync_pending) {
-			blk_finish_plug(&plug);
-			blk_start_plug(&plug);
-			sync_pending = 0;
-		}
-
-		btrfsic_submit_bio(cur);
-		num_run++;
-		batch_run++;
-
-		cond_resched();
-
-		/*
-		 * we made progress, there is more work to do and the bdi
-		 * is now congested.  Back off and let other work structs
-		 * run instead
-		 */
-		if (pending && bdi_write_congested(bdi) && batch_run > 8 &&
-		    fs_info->fs_devices->open_devices > 1) {
-			struct io_context *ioc;
-
-			ioc = current->io_context;
-
-			/*
-			 * the main goal here is that we don't want to
-			 * block if we're going to be able to submit
-			 * more requests without blocking.
-			 *
-			 * This code does two great things, it pokes into
-			 * the elevator code from a filesystem _and_
-			 * it makes assumptions about how batching works.
-			 */
-			if (ioc && ioc->nr_batch_requests > 0 &&
-			    time_before(jiffies, ioc->last_waited + HZ/50UL) &&
-			    (last_waited == 0 ||
-			     ioc->last_waited == last_waited)) {
-				/*
-				 * we want to go through our batch of
-				 * requests and stop.  So, we copy out
-				 * the ioc->last_waited time and test
-				 * against it before looping
-				 */
-				last_waited = ioc->last_waited;
-				cond_resched();
-				continue;
-			}
-			spin_lock(&device->io_lock);
-			requeue_list(pending_bios, pending, tail);
-			device->running_pending = 1;
-
-			spin_unlock(&device->io_lock);
-			btrfs_queue_work(fs_info->submit_workers,
-					 &device->work);
-			goto done;
-		}
-	}
-
-	cond_resched();
-	if (again)
-		goto loop;
-
-	spin_lock(&device->io_lock);
-	if (device->pending_bios.head || device->pending_sync_bios.head)
-		goto loop_lock;
-	spin_unlock(&device->io_lock);
-
-done:
-	blk_finish_plug(&plug);
-}
-
-static void pending_bios_fn(struct btrfs_work *work)
-{
-	struct btrfs_device *device;
-
-	device = container_of(work, struct btrfs_device, work);
-	run_scheduled_bios(device);
-}
-
 static bool device_path_matched(const char *path, struct btrfs_device *device)
 {
 	int found;
@@ -6599,9 +6393,6 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
 	else
 		generate_random_uuid(dev->uuid);
 
-	btrfs_init_work(&dev->work, btrfs_submit_helper,
-			pending_bios_fn, NULL, NULL);
-
 	return dev;
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index e532d095c6a4..819047621176 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -18,10 +18,6 @@ extern struct mutex uuid_mutex;
 #define BTRFS_STRIPE_LEN	SZ_64K
 
 struct buffer_head;
-struct btrfs_pending_bios {
-	struct bio *head;
-	struct bio *tail;
-};
 
 /*
  * Use sequence counter to get consistent device stat data on
@@ -55,10 +51,6 @@ struct btrfs_device {
 
 	spinlock_t io_lock ____cacheline_aligned;
 	int running_pending;
-	/* regular prio bios */
-	struct btrfs_pending_bios pending_bios;
-	/* sync bios */
-	struct btrfs_pending_bios pending_sync_bios;
 
 	struct block_device *bdev;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct
  2019-07-10 19:28 [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support Tejun Heo
  2019-07-10 19:28 ` [PATCH 1/5] Btrfs: stop using btrfs_schedule_bio() Tejun Heo
  2019-07-10 19:28 ` [PATCH 2/5] Btrfs: delete the entire async bio submission framework Tejun Heo
@ 2019-07-10 19:28 ` Tejun Heo
  2019-07-11 16:00   ` Nikolay Borisov
  2019-07-10 19:28 ` [PATCH 4/5] Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios Tejun Heo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2019-07-10 19:28 UTC (permalink / raw)
  To: josef, clm, dsterba; +Cc: axboe, jack, linux-kernel, linux-btrfs, kernel-team

From: Chris Mason <clm@fb.com>

The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing.  When compression is on, we allocate one async_cow
structure for every 512K, and then run those pages through the
compression code for IO submission.

writepages starts all of this off with a single page, locked by
the original call to extent_write_cache_pages(), and it's important to
keep track of this page because it has already been through
clear_page_dirty_for_io().

The btrfs async_cow struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_cow
struct really owns that page.

But, this is racey.  If a given delalloc range is broken up into two
async_cows (cow_A and cow_B), we can end up with something like this:

compress_file_range(cowA)
submit_compress_extents(cowA)
submit compressed bios(cowA)
put_page(locked_page)

				compress_file_range(cowB)
				...

The end result is that cowA is completed and cleaned up before cowB even
starts processing.  This means we can free locked_page() and reuse it
elsewhere.  If we get really lucky, it'll have the same page->index in
its new home as it did before.

While we're processing cowB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on cowB->locked_page.

Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world.  With cgroups in use, we
might crash in the accounting code because page->mapping->i_wb isn't
set.

[ 8308.523110] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
[ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
[ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
[ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
[ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
[ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
[ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
[ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
[ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
[ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
[ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
[ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
[ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8308.594469] Call Trace:
[ 8308.595149]  account_page_cleaned+0x15b/0x1f0
[ 8308.596340]  __cancel_dirty_page+0x146/0x200
[ 8308.599395]  truncate_cleanup_page+0x92/0xb0
[ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
[ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
[ 8308.619108]  evict+0xc1/0x190
[ 8308.620023]  do_unlinkat+0x176/0x280
[ 8308.621202]  do_syscall_64+0x63/0x1a0
[ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

The fix here is to make asyc_cow->locked_page NULL everywhere but the
one async_cow struct that's allowed to do things to the locked page.

Signed-off-by: Chris Mason <clm@fb.com>
Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c |  2 +-
 fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5106008f5e28..a31574df06aa 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct address_space *mapping,
 			if (page_ops & PAGE_SET_PRIVATE2)
 				SetPagePrivate2(pages[i]);
 
-			if (pages[i] == locked_page) {
+			if (locked_page && pages[i] == locked_page) {
 				put_page(pages[i]);
 				pages_locked++;
 				continue;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6e6df0eab324..a81e9860ee1f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -666,10 +666,12 @@ static noinline void compress_file_range(struct async_chunk *async_chunk,
 	 * to our extent and set things up for the async work queue to run
 	 * cow_file_range to do the normal delalloc dance.
 	 */
-	if (page_offset(async_chunk->locked_page) >= start &&
-	    page_offset(async_chunk->locked_page) <= end)
+	if (async_chunk->locked_page &&
+	    (page_offset(async_chunk->locked_page) >= start &&
+	     page_offset(async_chunk->locked_page)) <= end) {
 		__set_page_dirty_nobuffers(async_chunk->locked_page);
 		/* unlocked later on in the async handlers */
+	}
 
 	if (redirty)
 		extent_range_redirty_for_io(inode, start, end);
@@ -759,7 +761,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 						  async_extent->start +
 						  async_extent->ram_size - 1,
 						  WB_SYNC_ALL);
-			else if (ret)
+			else if (ret && async_chunk->locked_page)
 				unlock_page(async_chunk->locked_page);
 			kfree(async_extent);
 			cond_resched();
@@ -1236,10 +1238,25 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		async_chunk[i].inode = inode;
 		async_chunk[i].start = start;
 		async_chunk[i].end = cur_end;
-		async_chunk[i].locked_page = locked_page;
 		async_chunk[i].write_flags = write_flags;
 		INIT_LIST_HEAD(&async_chunk[i].extents);
 
+		/*
+		 * The locked_page comes all the way from writepage and its
+		 * the original page we were actually given.  As we spread
+		 * this large delalloc region across multiple async_cow
+		 * structs, only the first struct needs a pointer to locked_page
+		 *
+		 * This way we don't need racey decisions about who is supposed
+		 * to unlock it.
+		 */
+		if (locked_page) {
+			async_chunk[i].locked_page = locked_page;
+			locked_page = NULL;
+		} else {
+			async_chunk[i].locked_page = NULL;
+		}
+
 		btrfs_init_work(&async_chunk[i].work,
 				btrfs_delalloc_helper,
 				async_cow_start, async_cow_submit,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios
  2019-07-10 19:28 [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support Tejun Heo
                   ` (2 preceding siblings ...)
  2019-07-10 19:28 ` [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct Tejun Heo
@ 2019-07-10 19:28 ` Tejun Heo
  2019-07-10 19:28 ` [PATCH 5/5] Btrfs: extent_write_locked_range() should attach inode->i_wb Tejun Heo
  2019-07-26 15:13 ` [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support David Sterba
  5 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2019-07-10 19:28 UTC (permalink / raw)
  To: josef, clm, dsterba; +Cc: axboe, jack, linux-kernel, linux-btrfs, kernel-team

From: Chris Mason <clm@fb.com>

Async CRCs and compression submit IO through helper threads, which
means they have IO priority inversions when cgroup IO controllers are
in use.

This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.

For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.

For the async crcs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.

Signed-off-by: Chris Mason <clm@fb.com>
Modified-and-reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/compression.c |  8 +++++++-
 fs/btrfs/compression.h |  3 ++-
 fs/btrfs/disk-io.c     |  6 ++++++
 fs/btrfs/extent_io.c   |  3 +++
 fs/btrfs/inode.c       | 31 ++++++++++++++++++++++++++++---
 5 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index dfc4eb9b7717..5b142d0d0a0b 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -288,7 +288,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				 unsigned long compressed_len,
 				 struct page **compressed_pages,
 				 unsigned long nr_pages,
-				 unsigned int write_flags)
+				 unsigned int write_flags,
+				 struct cgroup_subsys_state *blkcg_css)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct bio *bio = NULL;
@@ -322,6 +323,11 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 	bio->bi_opf = REQ_OP_WRITE | write_flags;
 	bio->bi_private = cb;
 	bio->bi_end_io = end_compressed_bio_write;
+
+	if (blkcg_css) {
+		bio->bi_opf |= REQ_CGROUP_PUNT;
+		bio_associate_blkg_from_css(bio, blkcg_css);
+	}
 	refcount_set(&cb->pending_bios, 1);
 
 	/* create and submit bios for the compressed pages */
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 9976fe0f7526..7cbefab96ecf 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -93,7 +93,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				  unsigned long compressed_len,
 				  struct page **compressed_pages,
 				  unsigned long nr_pages,
-				  unsigned int write_flags);
+				  unsigned int write_flags,
+				  struct cgroup_subsys_state *blkcg_css);
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 				 int mirror_num, unsigned long bio_flags);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 323cab06f2a9..cc0aa77b8128 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -799,6 +799,12 @@ static void run_one_async_done(struct btrfs_work *work)
 		return;
 	}
 
+	/*
+	 * All of the bios that pass through here are from async helpers.
+	 * Use REQ_CGROUP_PUNT to issue them from the owning cgroup's
+	 * context.  This changes nothing when cgroups aren't in use.
+	 */
+	async->bio->bi_opf |= REQ_CGROUP_PUNT;
 	ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio,
 			    async->mirror_num);
 	if (ret) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a31574df06aa..3f3942618e92 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4173,6 +4173,9 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,
+		/* we're called from an async helper function */
+		.punt_to_cgroup	= 1,
+		.no_cgroup_owner = 1,
 	};
 
 	while (start <= end) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a81e9860ee1f..f5515aea6012 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -357,6 +357,7 @@ struct async_extent {
 };
 
 struct async_chunk {
+	struct cgroup_subsys_state *blkcg_css;
 	struct inode *inode;
 	struct page *locked_page;
 	u64 start;
@@ -846,7 +847,8 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 				    ins.objectid,
 				    ins.offset, async_extent->pages,
 				    async_extent->nr_pages,
-				    async_chunk->write_flags)) {
+				    async_chunk->write_flags,
+				    async_chunk->blkcg_css)) {
 			struct page *p = async_extent->pages[0];
 			const u64 start = async_extent->start;
 			const u64 end = start + async_extent->ram_size - 1;
@@ -1170,6 +1172,8 @@ static noinline void async_cow_free(struct btrfs_work *work)
 	async_chunk = container_of(work, struct async_chunk, work);
 	if (async_chunk->inode)
 		btrfs_add_delayed_iput(async_chunk->inode);
+	if (async_chunk->blkcg_css)
+		css_put(async_chunk->blkcg_css);
 	/*
 	 * Since the pointer to 'pending' is at the beginning of the array of
 	 * async_chunk's, freeing it ensures the whole array has been freed.
@@ -1178,12 +1182,15 @@ static noinline void async_cow_free(struct btrfs_work *work)
 		kvfree(async_chunk->pending);
 }
 
-static int cow_file_range_async(struct inode *inode, struct page *locked_page,
+static int cow_file_range_async(struct inode *inode,
+				struct writeback_control *wbc,
+				struct page *locked_page,
 				u64 start, u64 end, int *page_started,
 				unsigned long *nr_written,
 				unsigned int write_flags)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct cgroup_subsys_state *blkcg_css = wbc_blkcg_css(wbc);
 	struct async_cow *ctx;
 	struct async_chunk *async_chunk;
 	unsigned long nr_pages;
@@ -1251,12 +1258,30 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		 * to unlock it.
 		 */
 		if (locked_page) {
+			/*
+			 * Depending on the compressibility, the pages
+			 * might or might not go through async.  We want
+			 * all of them to be accounted against @wbc once.
+			 * Let's do it here before the paths diverge.  wbc
+			 * accounting is used only for foreign writeback
+			 * detection and doesn't need full accuracy.  Just
+			 * account the whole thing against the first page.
+			 */
+			wbc_account_cgroup_owner(wbc, locked_page,
+						 cur_end - start);
 			async_chunk[i].locked_page = locked_page;
 			locked_page = NULL;
 		} else {
 			async_chunk[i].locked_page = NULL;
 		}
 
+		if (blkcg_css != blkcg_root_css) {
+			css_get(blkcg_css);
+			async_chunk[i].blkcg_css = blkcg_css;
+		} else {
+			async_chunk[i].blkcg_css = NULL;
+		}
+
 		btrfs_init_work(&async_chunk[i].work,
 				btrfs_delalloc_helper,
 				async_cow_start, async_cow_submit,
@@ -1653,7 +1678,7 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);
-		ret = cow_file_range_async(inode, locked_page, start, end,
+		ret = cow_file_range_async(inode, wbc, locked_page, start, end,
 					   page_started, nr_written,
 					   write_flags);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] Btrfs: extent_write_locked_range() should attach inode->i_wb
  2019-07-10 19:28 [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support Tejun Heo
                   ` (3 preceding siblings ...)
  2019-07-10 19:28 ` [PATCH 4/5] Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios Tejun Heo
@ 2019-07-10 19:28 ` Tejun Heo
  2019-07-26 15:13 ` [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support David Sterba
  5 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2019-07-10 19:28 UTC (permalink / raw)
  To: josef, clm, dsterba; +Cc: axboe, jack, linux-kernel, linux-btrfs, kernel-team

From: Chris Mason <clm@fb.com>

extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression.  It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3f3942618e92..5606a38b64ff 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4178,6 +4178,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 		.no_cgroup_owner = 1,
 	};
 
+	wbc_attach_fdatawrite_inode(&wbc_writepages, inode);
 	while (start <= end) {
 		page = find_get_page(mapping, start >> PAGE_SHIFT);
 		if (clear_page_dirty_for_io(page))
@@ -4192,11 +4193,12 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 	}
 
 	ASSERT(ret <= 0);
-	if (ret < 0) {
+	if (ret == 0)
+		ret = flush_write_bio(&epd);
+	else
 		end_write_bio(&epd, ret);
-		return ret;
-	}
-	ret = flush_write_bio(&epd);
+
+	wbc_detach_inode(&wbc_writepages);
 	return ret;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] Btrfs: stop using btrfs_schedule_bio()
  2019-07-10 19:28 ` [PATCH 1/5] Btrfs: stop using btrfs_schedule_bio() Tejun Heo
@ 2019-07-11 11:32   ` Nikolay Borisov
  0 siblings, 0 replies; 15+ messages in thread
From: Nikolay Borisov @ 2019-07-11 11:32 UTC (permalink / raw)
  To: Tejun Heo, clm, David Sterba, josef
  Cc: kernel-team, axboe, jack, linux-btrfs, linux-kernel



On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
> From: Chris Mason <clm@fb.com>
> 
> btrfs_schedule_bio() hands IO off to a helper thread to do the actual
> submit_bio() call.  This has been used to make sure async crc and
> compression helpers don't get stuck on IO submission.  To maintain good
> performance, over time the IO submission threads duplicated some IO
> scheduler characteristics such as high and low priority IOs and they
> also made some ugly assumptions about request allocation batch sizes.
> 
> All of this cost at least one extra context switch during IO submission,
> and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
> using btrfs_schedule_bio().  We may need to adjust the number of async
> helper threads for crcs and compression, but long term it's a better
> path.
> 
> Signed-off-by: Chris Mason <clm@fb.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

<snip>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] Btrfs: delete the entire async bio submission framework
  2019-07-10 19:28 ` [PATCH 2/5] Btrfs: delete the entire async bio submission framework Tejun Heo
@ 2019-07-11 14:53   ` Nikolay Borisov
  0 siblings, 0 replies; 15+ messages in thread
From: Nikolay Borisov @ 2019-07-11 14:53 UTC (permalink / raw)
  To: Tejun Heo, clm, David Sterba, josef
  Cc: kernel-team, axboe, jack, linux-btrfs, linux-kernel



On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
> From: Chris Mason <clm@fb.com>
> 
> Now that we're not using btrfs_schedule_bio() anymore, delete all the
> code that supported it.
> 
> Signed-off-by: Chris Mason <clm@fb.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/ctree.h   |   1 -
>  fs/btrfs/disk-io.c |  13 +--
>  fs/btrfs/super.c   |   1 -
>  fs/btrfs/volumes.c | 209 ---------------------------------------------
>  fs/btrfs/volumes.h |   8 --
>  5 files changed, 1 insertion(+), 231 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0a61dff27f57..21618b5b18a4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -989,7 +989,6 @@ struct btrfs_fs_info {
>  	struct btrfs_workqueue *endio_meta_write_workers;
>  	struct btrfs_workqueue *endio_write_workers;
>  	struct btrfs_workqueue *endio_freespace_worker;
> -	struct btrfs_workqueue *submit_workers;
>  	struct btrfs_workqueue *caching_workers;
>  	struct btrfs_workqueue *readahead_workers;
>  
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 6b1ecc27913b..323cab06f2a9 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2028,7 +2028,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>  	btrfs_destroy_workqueue(fs_info->rmw_workers);
>  	btrfs_destroy_workqueue(fs_info->endio_write_workers);
>  	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
> -	btrfs_destroy_workqueue(fs_info->submit_workers);
>  	btrfs_destroy_workqueue(fs_info->delayed_workers);
>  	btrfs_destroy_workqueue(fs_info->caching_workers);
>  	btrfs_destroy_workqueue(fs_info->readahead_workers);
> @@ -2194,16 +2193,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
>  	fs_info->caching_workers =
>  		btrfs_alloc_workqueue(fs_info, "cache", flags, max_active, 0);
>  
> -	/*
> -	 * a higher idle thresh on the submit workers makes it much more
> -	 * likely that bios will be send down in a sane order to the
> -	 * devices
> -	 */
> -	fs_info->submit_workers =
> -		btrfs_alloc_workqueue(fs_info, "submit", flags,
> -				      min_t(u64, fs_devices->num_devices,
> -					    max_active), 64);
> -
>  	fs_info->fixup_workers =
>  		btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);
>  
> @@ -2246,7 +2235,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
>  					    max_active), 8);
>  
>  	if (!(fs_info->workers && fs_info->delalloc_workers &&
> -	      fs_info->submit_workers && fs_info->flush_workers &&
> +	      fs_info->flush_workers &&
>  	      fs_info->endio_workers && fs_info->endio_meta_workers &&
>  	      fs_info->endio_meta_write_workers &&
>  	      fs_info->endio_repair_workers &&
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 0645ec428b4f..b130dc43b5f1 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1668,7 +1668,6 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
>  
>  	btrfs_workqueue_set_max(fs_info->workers, new_pool_size);
>  	btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
> -	btrfs_workqueue_set_max(fs_info->submit_workers, new_pool_size);
>  	btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
>  	btrfs_workqueue_set_max(fs_info->endio_workers, new_pool_size);
>  	btrfs_workqueue_set_max(fs_info->endio_meta_workers, new_pool_size);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 72326cc23985..fc3a16d87869 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -509,212 +509,6 @@ btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder,
>  	return ret;
>  }
>  
> -static void requeue_list(struct btrfs_pending_bios *pending_bios,
> -			struct bio *head, struct bio *tail)
> -{
> -
> -	struct bio *old_head;
> -
> -	old_head = pending_bios->head;
> -	pending_bios->head = head;
> -	if (pending_bios->tail)
> -		tail->bi_next = old_head;
> -	else
> -		pending_bios->tail = tail;
> -}
> -
> -/*
> - * we try to collect pending bios for a device so we don't get a large
> - * number of procs sending bios down to the same device.  This greatly
> - * improves the schedulers ability to collect and merge the bios.
> - *
> - * But, it also turns into a long list of bios to process and that is sure
> - * to eventually make the worker thread block.  The solution here is to
> - * make some progress and then put this work struct back at the end of
> - * the list if the block device is congested.  This way, multiple devices
> - * can make progress from a single worker thread.
> - */
> -static noinline void run_scheduled_bios(struct btrfs_device *device)
> -{
> -	struct btrfs_fs_info *fs_info = device->fs_info;
> -	struct bio *pending;
> -	struct backing_dev_info *bdi;
> -	struct btrfs_pending_bios *pending_bios;
> -	struct bio *tail;
> -	struct bio *cur;
> -	int again = 0;
> -	unsigned long num_run;
> -	unsigned long batch_run = 0;
> -	unsigned long last_waited = 0;
> -	int force_reg = 0;
> -	int sync_pending = 0;
> -	struct blk_plug plug;
> -
> -	/*
> -	 * this function runs all the bios we've collected for
> -	 * a particular device.  We don't want to wander off to
> -	 * another device without first sending all of these down.
> -	 * So, setup a plug here and finish it off before we return
> -	 */
> -	blk_start_plug(&plug);
> -
> -	bdi = device->bdev->bd_bdi;
> -
> -loop:
> -	spin_lock(&device->io_lock);
> -
> -loop_lock:
> -	num_run = 0;
> -
> -	/* take all the bios off the list at once and process them
> -	 * later on (without the lock held).  But, remember the
> -	 * tail and other pointers so the bios can be properly reinserted
> -	 * into the list if we hit congestion
> -	 */
> -	if (!force_reg && device->pending_sync_bios.head) {
> -		pending_bios = &device->pending_sync_bios;
> -		force_reg = 1;
> -	} else {
> -		pending_bios = &device->pending_bios;
> -		force_reg = 0;
> -	}
> -
> -	pending = pending_bios->head;
> -	tail = pending_bios->tail;
> -	WARN_ON(pending && !tail);
> -
> -	/*
> -	 * if pending was null this time around, no bios need processing
> -	 * at all and we can stop.  Otherwise it'll loop back up again
> -	 * and do an additional check so no bios are missed.
> -	 *
> -	 * device->running_pending is used to synchronize with the
> -	 * schedule_bio code.
> -	 */
> -	if (device->pending_sync_bios.head == NULL &&
> -	    device->pending_bios.head == NULL) {
> -		again = 0;
> -		device->running_pending = 0;
> -	} else {
> -		again = 1;
> -		device->running_pending = 1;
> -	}
> -
> -	pending_bios->head = NULL;
> -	pending_bios->tail = NULL;
> -
> -	spin_unlock(&device->io_lock);
> -
> -	while (pending) {
> -
> -		rmb();
> -		/* we want to work on both lists, but do more bios on the
> -		 * sync list than the regular list
> -		 */
> -		if ((num_run > 32 &&
> -		    pending_bios != &device->pending_sync_bios &&
> -		    device->pending_sync_bios.head) ||
> -		   (num_run > 64 && pending_bios == &device->pending_sync_bios &&
> -		    device->pending_bios.head)) {
> -			spin_lock(&device->io_lock);
> -			requeue_list(pending_bios, pending, tail);
> -			goto loop_lock;
> -		}
> -
> -		cur = pending;
> -		pending = pending->bi_next;
> -		cur->bi_next = NULL;
> -
> -		BUG_ON(atomic_read(&cur->__bi_cnt) == 0);
> -
> -		/*
> -		 * if we're doing the sync list, record that our
> -		 * plug has some sync requests on it
> -		 *
> -		 * If we're doing the regular list and there are
> -		 * sync requests sitting around, unplug before
> -		 * we add more
> -		 */
> -		if (pending_bios == &device->pending_sync_bios) {
> -			sync_pending = 1;
> -		} else if (sync_pending) {
> -			blk_finish_plug(&plug);
> -			blk_start_plug(&plug);
> -			sync_pending = 0;
> -		}
> -
> -		btrfsic_submit_bio(cur);
> -		num_run++;
> -		batch_run++;
> -
> -		cond_resched();
> -
> -		/*
> -		 * we made progress, there is more work to do and the bdi
> -		 * is now congested.  Back off and let other work structs
> -		 * run instead
> -		 */
> -		if (pending && bdi_write_congested(bdi) && batch_run > 8 &&
> -		    fs_info->fs_devices->open_devices > 1) {
> -			struct io_context *ioc;
> -
> -			ioc = current->io_context;
> -
> -			/*
> -			 * the main goal here is that we don't want to
> -			 * block if we're going to be able to submit
> -			 * more requests without blocking.
> -			 *
> -			 * This code does two great things, it pokes into
> -			 * the elevator code from a filesystem _and_
> -			 * it makes assumptions about how batching works.
> -			 */
> -			if (ioc && ioc->nr_batch_requests > 0 &&
> -			    time_before(jiffies, ioc->last_waited + HZ/50UL) &&
> -			    (last_waited == 0 ||
> -			     ioc->last_waited == last_waited)) {
> -				/*
> -				 * we want to go through our batch of
> -				 * requests and stop.  So, we copy out
> -				 * the ioc->last_waited time and test
> -				 * against it before looping
> -				 */
> -				last_waited = ioc->last_waited;
> -				cond_resched();
> -				continue;
> -			}
> -			spin_lock(&device->io_lock);
> -			requeue_list(pending_bios, pending, tail);
> -			device->running_pending = 1;
> -
> -			spin_unlock(&device->io_lock);
> -			btrfs_queue_work(fs_info->submit_workers,
> -					 &device->work);
> -			goto done;
> -		}
> -	}
> -
> -	cond_resched();
> -	if (again)
> -		goto loop;
> -
> -	spin_lock(&device->io_lock);
> -	if (device->pending_bios.head || device->pending_sync_bios.head)
> -		goto loop_lock;
> -	spin_unlock(&device->io_lock);
> -
> -done:
> -	blk_finish_plug(&plug);
> -}
> -
> -static void pending_bios_fn(struct btrfs_work *work)
> -{
> -	struct btrfs_device *device;
> -
> -	device = container_of(work, struct btrfs_device, work);
> -	run_scheduled_bios(device);
> -}
> -
>  static bool device_path_matched(const char *path, struct btrfs_device *device)
>  {
>  	int found;
> @@ -6599,9 +6393,6 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
>  	else
>  		generate_random_uuid(dev->uuid);
>  
> -	btrfs_init_work(&dev->work, btrfs_submit_helper,
> -			pending_bios_fn, NULL, NULL);
> -
>  	return dev;
>  }
>  
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index e532d095c6a4..819047621176 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -18,10 +18,6 @@ extern struct mutex uuid_mutex;
>  #define BTRFS_STRIPE_LEN	SZ_64K
>  
>  struct buffer_head;
> -struct btrfs_pending_bios {
> -	struct bio *head;
> -	struct bio *tail;
> -};
>  
>  /*
>   * Use sequence counter to get consistent device stat data on
> @@ -55,10 +51,6 @@ struct btrfs_device {
>  
>  	spinlock_t io_lock ____cacheline_aligned;
>  	int running_pending;
> -	/* regular prio bios */
> -	struct btrfs_pending_bios pending_bios;
> -	/* sync bios */
> -	struct btrfs_pending_bios pending_sync_bios;
>  
>  	struct block_device *bdev;
>  
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct
  2019-07-10 19:28 ` [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct Tejun Heo
@ 2019-07-11 16:00   ` Nikolay Borisov
  2019-07-11 19:52     ` Chris Mason
  0 siblings, 1 reply; 15+ messages in thread
From: Nikolay Borisov @ 2019-07-11 16:00 UTC (permalink / raw)
  To: Tejun Heo, clm, David Sterba, josef
  Cc: kernel-team, axboe, jack, linux-btrfs, linux-kernel



On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
> From: Chris Mason <clm@fb.com>
> 
> The btrfs writepages function collects a large range of pages flagged
> for delayed allocation, and then sends them down through the COW code
> for processing.  When compression is on, we allocate one async_cow

nit: The code no longer uses async_cow to represent in-flight chunks but
the more aptly named async_chunk. Presumably this patchset predates
those changes.

> structure for every 512K, and then run those pages through the
> compression code for IO submission.
> 
> writepages starts all of this off with a single page, locked by
> the original call to extent_write_cache_pages(), and it's important to
> keep track of this page because it has already been through
> clear_page_dirty_for_io().

IMO it will be beneficial to state what are the implications of
clear_page_dirty_for_io being called, i.e what special handling should
this particular page receive to the rest of its lifetime.

> 
> The btrfs async_cow struct has a pointer to the locked_page, and when
> we're redirtying the page because compression had to fallback to
> uncompressed IO, we use page->index to decide if a given async_cow
> struct really owns that page.
> 
> But, this is racey.  If a given delalloc range is broken up into two
> async_cows (cow_A and cow_B), we can end up with something like this:
> 
> compress_file_range(cowA)
> submit_compress_extents(cowA)
> submit compressed bios(cowA)
> put_page(locked_page)
> 
> 				compress_file_range(cowB)
> 				...

This call trace is _really_ hand wavy and the correct one is more
complex, hence it should be something like :

async_cow_submit
 submit_compressed_extents <--- falls back to buffered writeout
  cow_file_range
   extent_clear_unlock_delalloc
    __process_pages_contig
      put_page(locked_pages)

                                           async_cow_submit

> 
> The end result is that cowA is completed and cleaned up before cowB even
> starts processing.  This means we can free locked_page() and reuse it
> elsewhere.  If we get really lucky, it'll have the same page->index in
> its new home as it did before.
> 
> While we're processing cowB, we might decide we need to fall back to
> uncompressed IO, and so compress_file_range() will call
> __set_page_dirty_nobufers() on cowB->locked_page.
> 
> Without cgroups in use, this creates as a phantom dirty page, which> isn't great but isn't the end of the world.  With cgroups in use, we

Having a phantom dirty page is not great but not terrible without
cgroups but apart from that, does it have any other implications?


> might crash in the accounting code because page->mapping->i_wb isn't
> set.
> 
> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
> [ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
> [ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 8308.594469] Call Trace:
> [ 8308.595149]  account_page_cleaned+0x15b/0x1f0
> [ 8308.596340]  __cancel_dirty_page+0x146/0x200
> [ 8308.599395]  truncate_cleanup_page+0x92/0xb0
> [ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
> [ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
> [ 8308.619108]  evict+0xc1/0x190
> [ 8308.620023]  do_unlinkat+0x176/0x280
> [ 8308.621202]  do_syscall_64+0x63/0x1a0
> [ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> The fix here is to make asyc_cow->locked_page NULL everywhere but the
> one async_cow struct that's allowed to do things to the locked page.
> 
> Signed-off-by: Chris Mason <clm@fb.com>
> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent_io.c |  2 +-
>  fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
>  2 files changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5106008f5e28..a31574df06aa 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct address_space *mapping,
>  			if (page_ops & PAGE_SET_PRIVATE2)
>  				SetPagePrivate2(pages[i]);
>  
> -			if (pages[i] == locked_page) {
> +			if (locked_page && pages[i] == locked_page) {

Why not make the check just if (locked_page) then clean it up, since if
__process_pages_contig is called from the owner of the page then it's
guaranteed that the page will fall within it's range.

>  				put_page(pages[i]);
>  				pages_locked++;
>  				continue;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 6e6df0eab324..a81e9860ee1f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -666,10 +666,12 @@ static noinline void compress_file_range(struct async_chunk *async_chunk,
>  	 * to our extent and set things up for the async work queue to run
>  	 * cow_file_range to do the normal delalloc dance.
>  	 */
> -	if (page_offset(async_chunk->locked_page) >= start &&
> -	    page_offset(async_chunk->locked_page) <= end)
> +	if (async_chunk->locked_page &&
> +	    (page_offset(async_chunk->locked_page) >= start &&
> +	     page_offset(async_chunk->locked_page)) <= end) {

DITTO since locked_page is now only set to the chunk that has the right
to it then there is no need to check the offsets and this will simplify
the code.

>  		__set_page_dirty_nobuffers(async_chunk->locked_page);
>  		/* unlocked later on in the async handlers */
> +	}
>  
>  	if (redirty)
>  		extent_range_redirty_for_io(inode, start, end);
> @@ -759,7 +761,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
>  						  async_extent->start +
>  						  async_extent->ram_size - 1,
>  						  WB_SYNC_ALL);
> -			else if (ret)
> +			else if (ret && async_chunk->locked_page)
>  				unlock_page(async_chunk->locked_page);
>  			kfree(async_extent);
>  			cond_resched();
> @@ -1236,10 +1238,25 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
>  		async_chunk[i].inode = inode;
>  		async_chunk[i].start = start;
>  		async_chunk[i].end = cur_end;
> -		async_chunk[i].locked_page = locked_page;
>  		async_chunk[i].write_flags = write_flags;
>  		INIT_LIST_HEAD(&async_chunk[i].extents);
>  
> +		/*
> +		 * The locked_page comes all the way from writepage and its
> +		 * the original page we were actually given.  As we spread
> +		 * this large delalloc region across multiple async_cow
> +		 * structs, only the first struct needs a pointer to locked_page
> +		 *
> +		 * This way we don't need racey decisions about who is supposed
> +		 * to unlock it.
> +		 */
> +		if (locked_page) {
> +			async_chunk[i].locked_page = locked_page;
> +			locked_page = NULL;
> +		} else {
> +			async_chunk[i].locked_page = NULL;
> +		}
> +
>  		btrfs_init_work(&async_chunk[i].work,
>  				btrfs_delalloc_helper,
>  				async_cow_start, async_cow_submit,
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct
  2019-07-11 16:00   ` Nikolay Borisov
@ 2019-07-11 19:52     ` Chris Mason
  2019-07-26 15:29       ` Nikolay Borisov
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Mason @ 2019-07-11 19:52 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Tejun Heo, David Sterba, josef, Kernel Team, axboe, jack,
	linux-btrfs, linux-kernel

On 11 Jul 2019, at 12:00, Nikolay Borisov wrote:

> On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
>> From: Chris Mason <clm@fb.com>
>>
>> The btrfs writepages function collects a large range of pages flagged
>> for delayed allocation, and then sends them down through the COW code
>> for processing.  When compression is on, we allocate one async_cow
>
> nit: The code no longer uses async_cow to represent in-flight chunks 
> but
> the more aptly named async_chunk. Presumably this patchset predates
> those changes.

Not by much, but yes.

>
>>
>> The end result is that cowA is completed and cleaned up before cowB 
>> even
>> starts processing.  This means we can free locked_page() and reuse it
>> elsewhere.  If we get really lucky, it'll have the same page->index 
>> in
>> its new home as it did before.
>>
>> While we're processing cowB, we might decide we need to fall back to
>> uncompressed IO, and so compress_file_range() will call
>> __set_page_dirty_nobufers() on cowB->locked_page.
>>
>> Without cgroups in use, this creates as a phantom dirty page, which> 
>> isn't great but isn't the end of the world.  With cgroups in use, we
>
> Having a phantom dirty page is not great but not terrible without
> cgroups but apart from that, does it have any other implications?

Best case, it'll go through the writepage fixup worker and go through 
the whole cow machinery again.  Worst case we go to this code more than 
once:

                         /*
                          * if page_started, cow_file_range inserted an
                          * inline extent and took care of all the 
unlocking
                          * and IO for us.  Otherwise, we need to submit
                          * all those pages down to the drive.
                          */
                         if (!page_started && !ret)
                                 extent_write_locked_range(inode,
                                                   async_extent->start,
                                                   async_extent->start +
                                                   async_extent->ram_size 
- 1,
                                                   WB_SYNC_ALL);
                         else if (ret)
                                 unlock_page(async_chunk->locked_page);


That never happened in production as far as I can tell, but it seems 
possible.

>
>
>> might crash in the accounting code because page->mapping->i_wb isn't
>> set.
>>
>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference 
>> at 00000000000000d0
>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 
>> 0000000000026115
>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 
>> 0000000000000090
>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 
>> 0000000000000000
>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: 
>> ffffffffffffffff
>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 
>> 0000000000000001
>> [ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) 
>> knlGS:0000000000000000
>> [ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 
>> 0000000000360ee0
>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [ 8308.594469] Call Trace:
>> [ 8308.595149]  account_page_cleaned+0x15b/0x1f0
>> [ 8308.596340]  __cancel_dirty_page+0x146/0x200
>> [ 8308.599395]  truncate_cleanup_page+0x92/0xb0
>> [ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
>> [ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
>> [ 8308.619108]  evict+0xc1/0x190
>> [ 8308.620023]  do_unlinkat+0x176/0x280
>> [ 8308.621202]  do_syscall_64+0x63/0x1a0
>> [ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> The fix here is to make asyc_cow->locked_page NULL everywhere but the
>> one async_cow struct that's allowed to do things to the locked page.
>>
>> Signed-off-by: Chris Mason <clm@fb.com>
>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and 
>> reads")
>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>  fs/btrfs/extent_io.c |  2 +-
>>  fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
>>  2 files changed, 22 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 5106008f5e28..a31574df06aa 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct 
>> address_space *mapping,
>>  			if (page_ops & PAGE_SET_PRIVATE2)
>>  				SetPagePrivate2(pages[i]);
>>
>> -			if (pages[i] == locked_page) {
>> +			if (locked_page && pages[i] == locked_page) {
>
> Why not make the check just if (locked_page) then clean it up, since 
> if
> __process_pages_contig is called from the owner of the page then it's
> guaranteed that the page will fall within it's range.

I'm not convinced that every single caller of __process_pages_contig is 
making sure to only send locked_page for ranges that correspond to the 
locked_page.  I'm not sure exactly what you're asking for though, it 
looks like it would require some larger changes to the flow of that 
loop.

>
>>  				put_page(pages[i]);
>>  				pages_locked++;
>>  				continue;
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 6e6df0eab324..a81e9860ee1f 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -666,10 +666,12 @@ static noinline void compress_file_range(struct 
>> async_chunk *async_chunk,
>>  	 * to our extent and set things up for the async work queue to run
>>  	 * cow_file_range to do the normal delalloc dance.
>>  	 */
>> -	if (page_offset(async_chunk->locked_page) >= start &&
>> -	    page_offset(async_chunk->locked_page) <= end)
>> +	if (async_chunk->locked_page &&
>> +	    (page_offset(async_chunk->locked_page) >= start &&
>> +	     page_offset(async_chunk->locked_page)) <= end) {
>
> DITTO since locked_page is now only set to the chunk that has the 
> right
> to it then there is no need to check the offsets and this will 
> simplify
> the code.
>

start is adjusted higher up in the loop:

                         if (start + total_in < end) {
                                 start += total_in;
                                 pages = NULL;
                                 cond_resched();
                                 goto again;
                         }

So we might get to the __set_page_dirty_nobuffers() test with a range 
that no longer corresponds to the locked page.

-chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support
  2019-07-10 19:28 [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support Tejun Heo
                   ` (4 preceding siblings ...)
  2019-07-10 19:28 ` [PATCH 5/5] Btrfs: extent_write_locked_range() should attach inode->i_wb Tejun Heo
@ 2019-07-26 15:13 ` David Sterba
  2019-09-05 11:59   ` David Sterba
  5 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2019-07-26 15:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: josef, clm, dsterba, axboe, jack, linux-kernel, linux-btrfs, kernel-team

On Wed, Jul 10, 2019 at 12:28:13PM -0700, Tejun Heo wrote:
> Hello,
> 
> This patchset contains only the btrfs part of the following patchset.
> 
>   [1] [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support
> 
> The block part has already been applied to
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/ for-linus
> 
> with some naming changes.  This patchset has been updated accordingly.

I'm going to add this patchset to for-next to get some testing coverage,
there are some comments pending, but that are changelog updates and
refactoring.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct
  2019-07-11 19:52     ` Chris Mason
@ 2019-07-26 15:29       ` Nikolay Borisov
  0 siblings, 0 replies; 15+ messages in thread
From: Nikolay Borisov @ 2019-07-26 15:29 UTC (permalink / raw)
  To: Chris Mason
  Cc: Kernel Team, axboe, Tejun Heo, David Sterba, jack, josef,
	linux-btrfs, linux-kernel



On 11.07.19 г. 22:52 ч., Chris Mason wrote:
> On 11 Jul 2019, at 12:00, Nikolay Borisov wrote:
> 
>> On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
>>> From: Chris Mason <clm@fb.com>
>>>
>>> The btrfs writepages function collects a large range of pages flagged
>>> for delayed allocation, and then sends them down through the COW code
>>> for processing.  When compression is on, we allocate one async_cow
>>
>> nit: The code no longer uses async_cow to represent in-flight chunks 
>> but
>> the more aptly named async_chunk. Presumably this patchset predates
>> those changes.
> 
> Not by much, but yes.
> 
>>
>>>
>>> The end result is that cowA is completed and cleaned up before cowB 
>>> even
>>> starts processing.  This means we can free locked_page() and reuse it
>>> elsewhere.  If we get really lucky, it'll have the same page->index 
>>> in
>>> its new home as it did before.
>>>
>>> While we're processing cowB, we might decide we need to fall back to
>>> uncompressed IO, and so compress_file_range() will call
>>> __set_page_dirty_nobufers() on cowB->locked_page.
>>>
>>> Without cgroups in use, this creates as a phantom dirty page, which> 
>>> isn't great but isn't the end of the world.  With cgroups in use, we
>>
>> Having a phantom dirty page is not great but not terrible without
>> cgroups but apart from that, does it have any other implications?
> 
> Best case, it'll go through the writepage fixup worker and go through 
> the whole cow machinery again.  Worst case we go to this code more than 
> once:
> 
>                          /*
>                           * if page_started, cow_file_range inserted an
>                           * inline extent and took care of all the 
> unlocking
>                           * and IO for us.  Otherwise, we need to submit
>                           * all those pages down to the drive.
>                           */
>                          if (!page_started && !ret)
>                                  extent_write_locked_range(inode,
>                                                    async_extent->start,
>                                                    async_extent->start +
>                                                    async_extent->ram_size 
> - 1,
>                                                    WB_SYNC_ALL);
>                          else if (ret)
>                                  unlock_page(async_chunk->locked_page);
> 
> 
> That never happened in production as far as I can tell, but it seems 
> possible.
> 
>>
>>
>>> might crash in the accounting code because page->mapping->i_wb isn't
>>> set.
>>>
>>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference 
>>> at 00000000000000d0
>>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
>>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
>>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
>>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
>>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
>>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 
>>> 0000000000026115
>>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 
>>> 0000000000000090
>>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 
>>> 0000000000000000
>>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: 
>>> ffffffffffffffff
>>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 
>>> 0000000000000001
>>> [ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) 
>>> knlGS:0000000000000000
>>> [ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 
>>> 0000000000360ee0
>>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>> 0000000000000400
>>> [ 8308.594469] Call Trace:
>>> [ 8308.595149]  account_page_cleaned+0x15b/0x1f0
>>> [ 8308.596340]  __cancel_dirty_page+0x146/0x200
>>> [ 8308.599395]  truncate_cleanup_page+0x92/0xb0
>>> [ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
>>> [ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
>>> [ 8308.619108]  evict+0xc1/0x190
>>> [ 8308.620023]  do_unlinkat+0x176/0x280
>>> [ 8308.621202]  do_syscall_64+0x63/0x1a0
>>> [ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>
>>> The fix here is to make asyc_cow->locked_page NULL everywhere but the
>>> one async_cow struct that's allowed to do things to the locked page.
>>>
>>> Signed-off-by: Chris Mason <clm@fb.com>
>>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and 
>>> reads")
>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>  fs/btrfs/extent_io.c |  2 +-
>>>  fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
>>>  2 files changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 5106008f5e28..a31574df06aa 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct 
>>> address_space *mapping,
>>>  			if (page_ops & PAGE_SET_PRIVATE2)
>>>  				SetPagePrivate2(pages[i]);
>>>
>>> -			if (pages[i] == locked_page) {
>>> +			if (locked_page && pages[i] == locked_page) {
>>
>> Why not make the check just if (locked_page) then clean it up, since 
>> if
>> __process_pages_contig is called from the owner of the page then it's
>> guaranteed that the page will fall within it's range.
> 
> I'm not convinced that every single caller of __process_pages_contig is 
> making sure to only send locked_page for ranges that correspond to the 
> locked_page.  I'm not sure exactly what you're asking for though, it 
> looks like it would require some larger changes to the flow of that 
> loop.


What I meant it is to simply factor out the code dealing with locked
page outside of the loop and still place it inside
__process_pages_contig. Also looking at the way locked_pages is passed
across different call chains I arrive at:


compress_file_range  <-- locked page is null
 extent_clear_unlock_delalloc
  __process_pages_contig

submit_compressed_extents <---- locked page is null
 extent_clear_unlock_delalloc
  __process_pages_contig

btrfs_run_delalloc_range | run_delalloc_nocow
 cow_file_range <--- [when called from btrfs_run_delalloc_range we are
all fine and dandy because it will always iterates a range which belongs
to the page. So we can free the page and set it null for subsequent
passes of the loop.]

Looking run_delalloc_nocow I see the page is used 5
times - 2 of those, at the beginning and end of the function, are only
used during error cases. The other 2 times is if cow_start is different
than -1 , which happens if !nocow is true. I've yet to wrap my head
around run_delalloc_nocow but I think it should also be safe to pass
locked page just once.

cow_file_range_async <--- always called with the correct locked page, in
this case the function is called before any async chunks are going to be
submitted.
 extent_clear_unlock_delalloc
  __process_pages_contig

btrfs_run_delalloc_range <--- this one is called with locked_page
belonging to the passed delalloc range.
 run_delalloc_nocow
  extent_clear_unlock_delalloc
   __process_pages_contig


writepage_delalloc <-- calls find_lock_delalloc_range only if we aren't
caalled from compress path and the start range always belongs to the page
 find_lock_delalloc_range <----  if the range is not delalloc it will
retry. But that function is also called with the correct page.
  lock_delalloc_pages <--- ignores range which belongs only to this page
    __unlock_for_delaloc <--- ignores range which belongs only to this page




<snip>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support
  2019-07-26 15:13 ` [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support David Sterba
@ 2019-09-05 11:59   ` David Sterba
  2019-09-06 17:46     ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2019-09-05 11:59 UTC (permalink / raw)
  To: dsterba, Tejun Heo, josef, clm, dsterba, axboe, jack,
	linux-kernel, linux-btrfs, kernel-team

On Fri, Jul 26, 2019 at 05:13:21PM +0200, David Sterba wrote:
> On Wed, Jul 10, 2019 at 12:28:13PM -0700, Tejun Heo wrote:
> > Hello,
> > 
> > This patchset contains only the btrfs part of the following patchset.
> > 
> >   [1] [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support
> > 
> > The block part has already been applied to
> > 
> >   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/ for-linus
> > 
> > with some naming changes.  This patchset has been updated accordingly.
> 
> I'm going to add this patchset to for-next to get some testing coverage,
> there are some comments pending, but that are changelog updates and
> refactoring.

No updates, so patchset stays in for-next, closest merge target is 5.5.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support
  2019-09-05 11:59   ` David Sterba
@ 2019-09-06 17:46     ` Tejun Heo
  2019-10-02 14:15       ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2019-09-06 17:46 UTC (permalink / raw)
  To: dsterba, josef, clm, dsterba, axboe, jack, linux-kernel,
	linux-btrfs, kernel-team

Hello, David.

On Thu, Sep 05, 2019 at 01:59:37PM +0200, David Sterba wrote:
> On Fri, Jul 26, 2019 at 05:13:21PM +0200, David Sterba wrote:
> > On Wed, Jul 10, 2019 at 12:28:13PM -0700, Tejun Heo wrote:
> > > Hello,
> > > 
> > > This patchset contains only the btrfs part of the following patchset.
> > > 
> > >   [1] [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support
> > > 
> > > The block part has already been applied to
> > > 
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/ for-linus
> > > 
> > > with some naming changes.  This patchset has been updated accordingly.
> > 
> > I'm going to add this patchset to for-next to get some testing coverage,
> > there are some comments pending, but that are changelog updates and
> > refactoring.
> 
> No updates, so patchset stays in for-next, closest merge target is 5.5.

Sorry about dropping the ball.  It looked like Chris and Nikolay
weren't agreeing so I wasn't sure what the next step should be and
then forgot about it.  The following is the discussion.

  https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/

What do you think about the exchange?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support
  2019-09-06 17:46     ` Tejun Heo
@ 2019-10-02 14:15       ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2019-10-02 14:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dsterba, josef, clm, dsterba, axboe, jack, linux-kernel,
	linux-btrfs, kernel-team

On Fri, Sep 06, 2019 at 10:46:56AM -0700, Tejun Heo wrote:
> On Thu, Sep 05, 2019 at 01:59:37PM +0200, David Sterba wrote:
> > On Fri, Jul 26, 2019 at 05:13:21PM +0200, David Sterba wrote:
> > > On Wed, Jul 10, 2019 at 12:28:13PM -0700, Tejun Heo wrote:
> > > > This patchset contains only the btrfs part of the following patchset.
> > > > 
> > > >   [1] [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support
> > > > 
> > > > The block part has already been applied to
> > > > 
> > > >   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/ for-linus
> > > > 
> > > > with some naming changes.  This patchset has been updated accordingly.
> > > 
> > > I'm going to add this patchset to for-next to get some testing coverage,
> > > there are some comments pending, but that are changelog updates and
> > > refactoring.
> > 
> > No updates, so patchset stays in for-next, closest merge target is 5.5.
> 
> Sorry about dropping the ball.  It looked like Chris and Nikolay
> weren't agreeing so I wasn't sure what the next step should be and
> then forgot about it.  The following is the discussion.
> 
>   https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
> 
> What do you think about the exchange?

I've read the thread again and talked to Nikolay. After going through
the questions raised for patch 3/5, I'm more or less ok with merging it
as there are no blockers.  I'll update the changelogs with points from
the discussion.

The patchset has been in for-next for some months now so we have testing
coverage but we'll have more in the main devel patch queue, that I'll
add after I do one more review pass.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-10-02 14:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-10 19:28 [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support Tejun Heo
2019-07-10 19:28 ` [PATCH 1/5] Btrfs: stop using btrfs_schedule_bio() Tejun Heo
2019-07-11 11:32   ` Nikolay Borisov
2019-07-10 19:28 ` [PATCH 2/5] Btrfs: delete the entire async bio submission framework Tejun Heo
2019-07-11 14:53   ` Nikolay Borisov
2019-07-10 19:28 ` [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct Tejun Heo
2019-07-11 16:00   ` Nikolay Borisov
2019-07-11 19:52     ` Chris Mason
2019-07-26 15:29       ` Nikolay Borisov
2019-07-10 19:28 ` [PATCH 4/5] Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios Tejun Heo
2019-07-10 19:28 ` [PATCH 5/5] Btrfs: extent_write_locked_range() should attach inode->i_wb Tejun Heo
2019-07-26 15:13 ` [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support David Sterba
2019-09-05 11:59   ` David Sterba
2019-09-06 17:46     ` Tejun Heo
2019-10-02 14:15       ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).