All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/11] simplify block layer based on immutable biovecs
@ 2015-08-12  7:07 Ming Lin
  2015-08-12  7:07   ` Ming Lin
                   ` (11 more replies)
  0 siblings, 12 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin

Hi Jens,

Neil/Mike/Martin have acked/reviewed PATCH 1.
Now it's ready. Could you please apply this series?

https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req

Please note that, for discard, we cap the size at 2G.
We'll change it to UINT_MAX after the splitting code in
DM thinp is rewritten.

v6:
  - rebase on top of 4.2-rc6+
  - fix discard/write_same 32bit bi_size overflow issue
  - add ACKs/Review from Mike/Christoph/Martin/Steven

v5:
  - rebase on top of 4.2-rc1
  - reorder patch 6,7
  - add NeilBrown's ACKs
  - fix memory leak: free "bio_split" bioset in blk_release_queue()

v4:
  - rebase on top of 4.1-rc4
  - use BIO_POOL_SIZE instead of number 4 for bioset_create()
  - call blk_queue_split() in blk_mq_make_request()
  - call blk_queue_split() in zram_make_request()
  - add patch "block: remove bio_get_nr_vecs()"
  - remove split code in blkdev_issue_discard()
  - drop patch "md/raid10: make sync_request_write() call bio_copy_data()".
    NeilBrown queued it.
  - drop patch "block: allow __blk_queue_bounce() to handle bios larger than BIO_MAX_PAGES".
    Will send it seperately

v3:
  - rebase on top of 4.1-rc2
  - support for QUEUE_FLAG_SG_GAPS
  - update commit logs of patch 2&4
  - split bio for chunk_aligned_read

v2: https://lkml.org/lkml/2015/4/28/28
v1: https://lkml.org/lkml/2014/12/22/128

This is the 6th attempt of simplifying block layer based on immutable
biovecs. Immutable biovecs, implemented by Kent Overstreet, have been
available in mainline since v3.14. Its original goal was actually making
generic_make_request() accept arbitrarily sized bios, and pushing the
splitting down to the drivers or wherever it's required. See also
discussions in the past, [1] [2] [3].

This will bring not only performance improvements, but also a great amount
of reduction in code complexity all over the block layer. Performance gain
is possible due to the fact that bio_add_page() does not have to check
unnecesary conditions such as queue limits or if biovecs are mergeable.
Those will be delegated to the driver level. Kent already said that he
actually benchmarked the impact of this with fio on a micron p320h, which
showed definitely a positive impact.

Moreover, this patchset also allows a lot of code to be deleted, mainly
because of removal of merge_bvec_fn() callbacks. We have been aware that
it has been always a delicate issue for stacking block drivers (e.g. md
and bcache) to handle merging bio consistently. This simplication will
help every individual block driver avoid having such an issue.

Patches are against 4.2-rc6+. These are also available in my git repo at:

  https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
  git://git.kernel.org/pub/scm/linux/kernel/git/mlin/linux.git block-generic-req

This patchset is a prerequisite of other consecutive patchsets, e.g.
multipage biovecs, rewriting plugging, or rewriting direct-IO, which are
excluded this time. That means, this patchset should not bring any
regression to end-users.

Comments are welcome.
Ming

[1] https://lkml.org/lkml/2014/11/23/263
[2] https://lkml.org/lkml/2013/11/25/732
[3] https://lkml.org/lkml/2014/2/26/618

Dongsu Park (1):
      Documentation: update notes in biovecs about arbitrarily sized bios

Kent Overstreet (8):
      block: make generic_make_request handle arbitrarily sized bios
      block: simplify bio_add_page()
      bcache: remove driver private bio splitting code
      btrfs: remove bio splitting and merge_bvec_fn() calls
      md/raid5: get rid of bio_fits_rdev()
      block: kill merge_bvec_fn() completely
      fs: use helper bio_add_page() instead of open coding on bi_io_vec
      block: remove bio_get_nr_vecs()

Ming Lin (2):
      block: remove split code in blkdev_issue_{discard,write_same}
      md/raid5: split bio for chunk_aligned_read

 Documentation/block/biovecs.txt             |  10 +-
 block/bio.c                                 | 152 ++++++++++------------------
 block/blk-core.c                            |  19 ++--
 block/blk-lib.c                             |  47 ++-------
 block/blk-merge.c                           | 148 +++++++++++++++++++++++++--
 block/blk-mq.c                              |   4 +
 block/blk-settings.c                        |  22 ----
 block/blk-sysfs.c                           |   3 +
 drivers/block/drbd/drbd_int.h               |   1 -
 drivers/block/drbd/drbd_main.c              |   1 -
 drivers/block/drbd/drbd_req.c               |  37 +------
 drivers/block/pktcdvd.c                     |  27 +----
 drivers/block/ps3vram.c                     |   2 +
 drivers/block/rbd.c                         |  47 ---------
 drivers/block/rsxx/dev.c                    |   2 +
 drivers/block/umem.c                        |   2 +
 drivers/block/zram/zram_drv.c               |   2 +
 drivers/md/bcache/bcache.h                  |  18 ----
 drivers/md/bcache/io.c                      | 101 +-----------------
 drivers/md/bcache/journal.c                 |   4 +-
 drivers/md/bcache/request.c                 |  16 +--
 drivers/md/bcache/super.c                   |  32 +-----
 drivers/md/bcache/util.h                    |   5 +-
 drivers/md/bcache/writeback.c               |   4 +-
 drivers/md/dm-cache-target.c                |  21 ----
 drivers/md/dm-crypt.c                       |  16 ---
 drivers/md/dm-era-target.c                  |  15 ---
 drivers/md/dm-flakey.c                      |  16 ---
 drivers/md/dm-io.c                          |   2 +-
 drivers/md/dm-linear.c                      |  16 ---
 drivers/md/dm-log-writes.c                  |  16 ---
 drivers/md/dm-raid.c                        |  19 ----
 drivers/md/dm-snap.c                        |  15 ---
 drivers/md/dm-stripe.c                      |  21 ----
 drivers/md/dm-table.c                       |   8 --
 drivers/md/dm-thin.c                        |  31 ------
 drivers/md/dm-verity.c                      |  16 ---
 drivers/md/dm.c                             | 125 +----------------------
 drivers/md/dm.h                             |   2 -
 drivers/md/linear.c                         |  43 --------
 drivers/md/md.c                             |  28 +----
 drivers/md/md.h                             |  12 ---
 drivers/md/multipath.c                      |  21 ----
 drivers/md/raid0.c                          |  56 ----------
 drivers/md/raid0.h                          |   2 -
 drivers/md/raid1.c                          |  58 +----------
 drivers/md/raid10.c                         | 121 +---------------------
 drivers/md/raid5.c                          |  92 ++++++-----------
 drivers/s390/block/dcssblk.c                |   2 +
 drivers/s390/block/xpram.c                  |   2 +
 drivers/staging/lustre/lustre/llite/lloop.c |   2 +
 fs/btrfs/compression.c                      |   5 +-
 fs/btrfs/extent_io.c                        |   9 +-
 fs/btrfs/inode.c                            |   3 +-
 fs/btrfs/scrub.c                            |  18 +---
 fs/btrfs/volumes.c                          |  72 -------------
 fs/buffer.c                                 |   7 +-
 fs/direct-io.c                              |   2 +-
 fs/ext4/page-io.c                           |   3 +-
 fs/ext4/readpage.c                          |   2 +-
 fs/f2fs/data.c                              |   2 +-
 fs/gfs2/lops.c                              |   9 +-
 fs/jfs/jfs_logmgr.c                         |  14 +--
 fs/logfs/dev_bdev.c                         |   4 +-
 fs/mpage.c                                  |   4 +-
 fs/nilfs2/segbuf.c                          |   2 +-
 fs/xfs/xfs_aops.c                           |   3 +-
 include/linux/bio.h                         |   1 -
 include/linux/blkdev.h                      |  13 +--
 include/linux/device-mapper.h               |   4 -
 mm/page_io.c                                |   8 +-
 71 files changed, 337 insertions(+), 1332 deletions(-)



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 01/11] block: make generic_make_request handle arbitrarily sized bios
@ 2015-08-12  7:07   ` Ming Lin
  0 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Christoph Hellwig,
	Al Viro, Ming Lei, Neil Brown, Alasdair Kergon, dm-devel,
	Lars Ellenberg, drbd-user, Jiri Kosina, Geoff Levand, Jim Paris,
	Philip Kelleher, Minchan Kim, Nitin Gupta, Oleg Drokin,
	Andreas Dilger, Ming Lin

From: Kent Overstreet <kent.overstreet@gmail.com>

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them.  In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

 * nfhd_make_request (arch/m68k/emu/nfblock.c)
 * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
 * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
 * brd_make_request (ramdisk - drivers/block/brd.c)
 * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
 * loop_make_request
 * null_queue_bio
 * bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 block/blk-core.c                            |  19 ++--
 block/blk-merge.c                           | 159 ++++++++++++++++++++++++++--
 block/blk-mq.c                              |   4 +
 block/blk-sysfs.c                           |   3 +
 drivers/block/drbd/drbd_req.c               |   2 +
 drivers/block/pktcdvd.c                     |   6 +-
 drivers/block/ps3vram.c                     |   2 +
 drivers/block/rsxx/dev.c                    |   2 +
 drivers/block/umem.c                        |   2 +
 drivers/block/zram/zram_drv.c               |   2 +
 drivers/md/dm.c                             |   2 +
 drivers/md/md.c                             |   2 +
 drivers/s390/block/dcssblk.c                |   2 +
 drivers/s390/block/xpram.c                  |   2 +
 drivers/staging/lustre/lustre/llite/lloop.c |   2 +
 include/linux/blkdev.h                      |   3 +
 16 files changed, 192 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 627ed0c..47f84cf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -645,6 +645,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
+	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	if (!q->bio_split)
+		goto fail_id;
+
 	q->backing_dev_info.ra_pages =
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK;
@@ -653,7 +657,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	err = bdi_init(&q->backing_dev_info);
 	if (err)
-		goto fail_id;
+		goto fail_split;
 
 	setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
 		    laptop_mode_timer_fn, (unsigned long) q);
@@ -695,6 +699,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 fail_bdi:
 	bdi_destroy(&q->backing_dev_info);
+fail_split:
+	bioset_free(q->bio_split);
 fail_id:
 	ida_simple_remove(&blk_queue_ida, q->id);
 fail_q:
@@ -1612,6 +1618,8 @@ static void blk_queue_bio(struct request_queue *q, struct bio *bio)
 	struct request *req;
 	unsigned int request_count = 0;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	/*
 	 * low level driver can indicate that it wants pages above a
 	 * certain limit bounced to low memory (ie for highmem, or even
@@ -1832,15 +1840,6 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
-	if (likely(bio_is_rw(bio) &&
-		   nr_sectors > queue_max_hw_sectors(q))) {
-		printk(KERN_ERR "bio too big device %s (%u > %u)\n",
-		       bdevname(bio->bi_bdev, b),
-		       bio_sectors(bio),
-		       queue_max_hw_sectors(q));
-		goto end_io;
-	}
-
 	part = bio->bi_bdev->bd_part;
 	if (should_fail_request(part, bio->bi_iter.bi_size) ||
 	    should_fail_request(&part_to_disk(part)->part0,
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 30a0d9f..3707f30 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -9,12 +9,158 @@
 
 #include "blk.h"
 
+static struct bio *blk_bio_discard_split(struct request_queue *q,
+					 struct bio *bio,
+					 struct bio_set *bs)
+{
+	unsigned int max_discard_sectors, granularity;
+	int alignment;
+	sector_t tmp;
+	unsigned split_sectors;
+
+	/* Zero-sector (unknown) and one-sector granularities are the same.  */
+	granularity = max(q->limits.discard_granularity >> 9, 1U);
+
+	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+	max_discard_sectors -= max_discard_sectors % granularity;
+
+	if (unlikely(!max_discard_sectors)) {
+		/* XXX: warn */
+		return NULL;
+	}
+
+	if (bio_sectors(bio) <= max_discard_sectors)
+		return NULL;
+
+	split_sectors = max_discard_sectors;
+
+	/*
+	 * If the next starting sector would be misaligned, stop the discard at
+	 * the previous aligned sector.
+	 */
+	alignment = (q->limits.discard_alignment >> 9) % granularity;
+
+	tmp = bio->bi_iter.bi_sector + split_sectors - alignment;
+	tmp = sector_div(tmp, granularity);
+
+	if (split_sectors > tmp)
+		split_sectors -= tmp;
+
+	return bio_split(bio, split_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_write_same_split(struct request_queue *q,
+					    struct bio *bio,
+					    struct bio_set *bs)
+{
+	if (!q->limits.max_write_same_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_write_same_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_segment_split(struct request_queue *q,
+					 struct bio *bio,
+					 struct bio_set *bs)
+{
+	struct bio *split;
+	struct bio_vec bv, bvprv;
+	struct bvec_iter iter;
+	unsigned seg_size = 0, nsegs = 0;
+	int prev = 0;
+
+	struct bvec_merge_data bvm = {
+		.bi_bdev	= bio->bi_bdev,
+		.bi_sector	= bio->bi_iter.bi_sector,
+		.bi_size	= 0,
+		.bi_rw		= bio->bi_rw,
+	};
+
+	bio_for_each_segment(bv, bio, iter) {
+		if (q->merge_bvec_fn &&
+		    q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
+			goto split;
+
+		bvm.bi_size += bv.bv_len;
+
+		if (bvm.bi_size >> 9 > queue_max_sectors(q))
+			goto split;
+
+		/*
+		 * If the queue doesn't support SG gaps and adding this
+		 * offset would create a gap, disallow it.
+		 */
+		if (q->queue_flags & (1 << QUEUE_FLAG_SG_GAPS) &&
+		    prev && bvec_gap_to_prev(&bvprv, bv.bv_offset))
+			goto split;
+
+		if (prev && blk_queue_cluster(q)) {
+			if (seg_size + bv.bv_len > queue_max_segment_size(q))
+				goto new_segment;
+			if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv))
+				goto new_segment;
+			if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv))
+				goto new_segment;
+
+			seg_size += bv.bv_len;
+			bvprv = bv;
+			prev = 1;
+			continue;
+		}
+new_segment:
+		if (nsegs == queue_max_segments(q))
+			goto split;
+
+		nsegs++;
+		bvprv = bv;
+		prev = 1;
+		seg_size = bv.bv_len;
+	}
+
+	return NULL;
+split:
+	split = bio_clone_bioset(bio, GFP_NOIO, bs);
+
+	split->bi_iter.bi_size -= iter.bi_size;
+	bio->bi_iter = iter;
+
+	if (bio_integrity(bio)) {
+		bio_integrity_advance(bio, split->bi_iter.bi_size);
+		bio_integrity_trim(split, 0, bio_sectors(split));
+	}
+
+	return split;
+}
+
+void blk_queue_split(struct request_queue *q, struct bio **bio,
+		     struct bio_set *bs)
+{
+	struct bio *split;
+
+	if ((*bio)->bi_rw & REQ_DISCARD)
+		split = blk_bio_discard_split(q, *bio, bs);
+	else if ((*bio)->bi_rw & REQ_WRITE_SAME)
+		split = blk_bio_write_same_split(q, *bio, bs);
+	else
+		split = blk_bio_segment_split(q, *bio, q->bio_split);
+
+	if (split) {
+		bio_chain(split, *bio);
+		generic_make_request(*bio);
+		*bio = split;
+	}
+}
+EXPORT_SYMBOL(blk_queue_split);
+
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio,
 					     bool no_sg_merge)
 {
 	struct bio_vec bv, bvprv = { NULL };
-	int cluster, high, highprv = 1;
+	int cluster, prev = 0;
 	unsigned int seg_size, nr_phys_segs;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
@@ -36,7 +182,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	cluster = blk_queue_cluster(q);
 	seg_size = 0;
 	nr_phys_segs = 0;
-	high = 0;
 	for_each_bio(bio) {
 		bio_for_each_segment(bv, bio, iter) {
 			/*
@@ -46,13 +191,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 			if (no_sg_merge)
 				goto new_segment;
 
-			/*
-			 * the trick here is making sure that a high page is
-			 * never considered part of another segment, since
-			 * that might change with the bounce page.
-			 */
-			high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
-			if (!high && !highprv && cluster) {
+			if (prev && cluster) {
 				if (seg_size + bv.bv_len
 				    > queue_max_segment_size(q))
 					goto new_segment;
@@ -72,8 +211,8 @@ new_segment:
 
 			nr_phys_segs++;
 			bvprv = bv;
+			prev = 1;
 			seg_size = bv.bv_len;
-			highprv = high;
 		}
 		bbio = bio;
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d842db..a2808d7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1287,6 +1287,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if (!is_flush_fua && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return;
@@ -1372,6 +1374,8 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if (!is_flush_fua && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count, NULL))
 		return;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6264b38..9a25db3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -523,6 +523,9 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_trace_shutdown(q);
 
+	if (q->bio_split)
+		bioset_free(q->bio_split);
+
 	ida_simple_remove(&blk_queue_ida, q->id);
 	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 3907202..a6265bc 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1497,6 +1497,8 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
 	struct drbd_device *device = (struct drbd_device *) q->queuedata;
 	unsigned long start_jif;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	start_jif = jiffies;
 
 	/*
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 4c20c22..05a81ae 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2447,6 +2447,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
 	char b[BDEVNAME_SIZE];
 	struct bio *split;
 
+	blk_queue_bounce(q, &bio);
+
+	blk_queue_split(q, &bio, q->bio_split);
+
 	pd = q->queuedata;
 	if (!pd) {
 		pr_err("%s incorrect request queue\n",
@@ -2477,8 +2481,6 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
 		goto end_io;
 	}
 
-	blk_queue_bounce(q, &bio);
-
 	do {
 		sector_t zone = get_zone(bio->bi_iter.bi_sector, pd);
 		sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd);
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index b1612eb..748c63c 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -605,6 +605,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio)
 
 	dev_dbg(&dev->core, "%s\n", __func__);
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	spin_lock_irq(&priv->lock);
 	busy = !bio_list_empty(&priv->list);
 	bio_list_add(&priv->list, bio);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index ac8c62c..50ef199 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -148,6 +148,8 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
 	struct rsxx_bio_meta *bio_meta;
 	int st = -EINVAL;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	might_sleep();
 
 	if (!card)
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 4cf81b5..13d577c 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -531,6 +531,8 @@ static void mm_make_request(struct request_queue *q, struct bio *bio)
 		 (unsigned long long)bio->bi_iter.bi_sector,
 		 bio->bi_iter.bi_size);
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	spin_lock_irq(&card->lock);
 	*card->biotail = bio;
 	bio->bi_next = NULL;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fb655e8..44e3b89 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -901,6 +901,8 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio)
 	if (unlikely(!zram_meta_get(zram)))
 		goto error;
 
+	blk_queue_split(queue, &bio, queue->bio_split);
+
 	if (!valid_io_request(zram, bio->bi_iter.bi_sector,
 					bio->bi_iter.bi_size)) {
 		atomic64_inc(&zram->stats.invalid_io);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 0d7ab20..4db6ca2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1789,6 +1789,8 @@ static void dm_make_request(struct request_queue *q, struct bio *bio)
 
 	map = dm_get_live_table(md, &srcu_idx);
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);
 
 	/* if we're suspended, we have to queue this io for later */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e25f00f..b8c5a82 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -257,6 +257,8 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 	unsigned int sectors;
 	int cpu;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if (mddev == NULL || mddev->pers == NULL
 	    || !mddev->ready) {
 		bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index da21281..267ca3a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -826,6 +826,8 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long source_addr;
 	unsigned long bytes_done;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	bytes_done = 0;
 	dev_info = bio->bi_bdev->bd_disk->private_data;
 	if (dev_info == NULL)
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index 7d4e939..1305ed3 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,6 +190,8 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long page_addr;
 	unsigned long bytes;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if ((bio->bi_iter.bi_sector & 7) != 0 ||
 	    (bio->bi_iter.bi_size & 4095) != 0)
 		/* Request is not page-aligned. */
diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
index cc00fd1..1e33d54 100644
--- a/drivers/staging/lustre/lustre/llite/lloop.c
+++ b/drivers/staging/lustre/lustre/llite/lloop.c
@@ -340,6 +340,8 @@ static void loop_make_request(struct request_queue *q, struct bio *old_bio)
 	int rw = bio_rw(old_bio);
 	int inactive;
 
+	blk_queue_split(q, &old_bio, q->bio_split);
+
 	if (!lo)
 		goto err;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d4068c1..dc89cc8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -462,6 +462,7 @@ struct request_queue {
 
 	struct blk_mq_tag_set	*tag_set;
 	struct list_head	tag_set_list;
+	struct bio_set		*bio_split;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
@@ -782,6 +783,8 @@ extern void blk_rq_unprep_clone(struct request *rq);
 extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
+extern void blk_queue_split(struct request_queue *, struct bio **,
+			    struct bio_set *);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 01/11] block: make generic_make_request handle arbitrarily sized bios
@ 2015-08-12  7:07   ` Ming Lin
  0 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Mike Snitzer, Neil Brown, Ming Lei,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Christoph Hellwig,
	Alasdair Kergon, Lars Ellenberg, Philip Kelleher,
	Christoph Hellwig, Kent Overstreet, Nitin Gupta, Ming Lin,
	Oleg Drokin, Al Viro, Ming Lin, Jens Axboe, Andreas Dilger,
	Martin K. Petersen, Geoff Levand, Jiri Kosina, Jim Paris,
	Minchan Kim, Dongsu Park, drbd-user-cunTk1MwBs8qoQakbn7OcQ

From: Kent Overstreet <kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them.  In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

 * nfhd_make_request (arch/m68k/emu/nfblock.c)
 * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
 * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
 * brd_make_request (ramdisk - drivers/block/brd.c)
 * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
 * loop_make_request
 * null_queue_bio
 * bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
Cc: Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Cc: Ming Lei <ming.lei-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
Cc: Alasdair Kergon <agk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: Lars Ellenberg <drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org>
Cc: drbd-user-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org
Cc: Jiri Kosina <jkosina-AlSwsSmVLrQ@public.gmane.org>
Cc: Geoff Levand <geoff-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Jim Paris <jim-XrPbb/hENzg@public.gmane.org>
Cc: Philip Kelleher <pjk1939-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Nitin Gupta <ngupta-KNmc09w0p+Ednm+yROfE0A@public.gmane.org>
Cc: Oleg Drokin <oleg.drokin-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Andreas Dilger <andreas.dilger-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Acked-by: NeilBrown <neilb-l3A5Bk7waGM@public.gmane.org> (for the 'md/md.c' bits)
Acked-by: Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Martin K. Petersen <martin.petersen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Kent Overstreet <kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org>
Signed-off-by: Ming Lin <ming.l-Vzezgt5dB6uUEJcrhfAQsw@public.gmane.org>
---
 block/blk-core.c                            |  19 ++--
 block/blk-merge.c                           | 159 ++++++++++++++++++++++++++--
 block/blk-mq.c                              |   4 +
 block/blk-sysfs.c                           |   3 +
 drivers/block/drbd/drbd_req.c               |   2 +
 drivers/block/pktcdvd.c                     |   6 +-
 drivers/block/ps3vram.c                     |   2 +
 drivers/block/rsxx/dev.c                    |   2 +
 drivers/block/umem.c                        |   2 +
 drivers/block/zram/zram_drv.c               |   2 +
 drivers/md/dm.c                             |   2 +
 drivers/md/md.c                             |   2 +
 drivers/s390/block/dcssblk.c                |   2 +
 drivers/s390/block/xpram.c                  |   2 +
 drivers/staging/lustre/lustre/llite/lloop.c |   2 +
 include/linux/blkdev.h                      |   3 +
 16 files changed, 192 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 627ed0c..47f84cf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -645,6 +645,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
+	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	if (!q->bio_split)
+		goto fail_id;
+
 	q->backing_dev_info.ra_pages =
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK;
@@ -653,7 +657,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	err = bdi_init(&q->backing_dev_info);
 	if (err)
-		goto fail_id;
+		goto fail_split;
 
 	setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
 		    laptop_mode_timer_fn, (unsigned long) q);
@@ -695,6 +699,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 fail_bdi:
 	bdi_destroy(&q->backing_dev_info);
+fail_split:
+	bioset_free(q->bio_split);
 fail_id:
 	ida_simple_remove(&blk_queue_ida, q->id);
 fail_q:
@@ -1612,6 +1618,8 @@ static void blk_queue_bio(struct request_queue *q, struct bio *bio)
 	struct request *req;
 	unsigned int request_count = 0;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	/*
 	 * low level driver can indicate that it wants pages above a
 	 * certain limit bounced to low memory (ie for highmem, or even
@@ -1832,15 +1840,6 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
-	if (likely(bio_is_rw(bio) &&
-		   nr_sectors > queue_max_hw_sectors(q))) {
-		printk(KERN_ERR "bio too big device %s (%u > %u)\n",
-		       bdevname(bio->bi_bdev, b),
-		       bio_sectors(bio),
-		       queue_max_hw_sectors(q));
-		goto end_io;
-	}
-
 	part = bio->bi_bdev->bd_part;
 	if (should_fail_request(part, bio->bi_iter.bi_size) ||
 	    should_fail_request(&part_to_disk(part)->part0,
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 30a0d9f..3707f30 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -9,12 +9,158 @@
 
 #include "blk.h"
 
+static struct bio *blk_bio_discard_split(struct request_queue *q,
+					 struct bio *bio,
+					 struct bio_set *bs)
+{
+	unsigned int max_discard_sectors, granularity;
+	int alignment;
+	sector_t tmp;
+	unsigned split_sectors;
+
+	/* Zero-sector (unknown) and one-sector granularities are the same.  */
+	granularity = max(q->limits.discard_granularity >> 9, 1U);
+
+	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+	max_discard_sectors -= max_discard_sectors % granularity;
+
+	if (unlikely(!max_discard_sectors)) {
+		/* XXX: warn */
+		return NULL;
+	}
+
+	if (bio_sectors(bio) <= max_discard_sectors)
+		return NULL;
+
+	split_sectors = max_discard_sectors;
+
+	/*
+	 * If the next starting sector would be misaligned, stop the discard at
+	 * the previous aligned sector.
+	 */
+	alignment = (q->limits.discard_alignment >> 9) % granularity;
+
+	tmp = bio->bi_iter.bi_sector + split_sectors - alignment;
+	tmp = sector_div(tmp, granularity);
+
+	if (split_sectors > tmp)
+		split_sectors -= tmp;
+
+	return bio_split(bio, split_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_write_same_split(struct request_queue *q,
+					    struct bio *bio,
+					    struct bio_set *bs)
+{
+	if (!q->limits.max_write_same_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_write_same_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_segment_split(struct request_queue *q,
+					 struct bio *bio,
+					 struct bio_set *bs)
+{
+	struct bio *split;
+	struct bio_vec bv, bvprv;
+	struct bvec_iter iter;
+	unsigned seg_size = 0, nsegs = 0;
+	int prev = 0;
+
+	struct bvec_merge_data bvm = {
+		.bi_bdev	= bio->bi_bdev,
+		.bi_sector	= bio->bi_iter.bi_sector,
+		.bi_size	= 0,
+		.bi_rw		= bio->bi_rw,
+	};
+
+	bio_for_each_segment(bv, bio, iter) {
+		if (q->merge_bvec_fn &&
+		    q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
+			goto split;
+
+		bvm.bi_size += bv.bv_len;
+
+		if (bvm.bi_size >> 9 > queue_max_sectors(q))
+			goto split;
+
+		/*
+		 * If the queue doesn't support SG gaps and adding this
+		 * offset would create a gap, disallow it.
+		 */
+		if (q->queue_flags & (1 << QUEUE_FLAG_SG_GAPS) &&
+		    prev && bvec_gap_to_prev(&bvprv, bv.bv_offset))
+			goto split;
+
+		if (prev && blk_queue_cluster(q)) {
+			if (seg_size + bv.bv_len > queue_max_segment_size(q))
+				goto new_segment;
+			if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv))
+				goto new_segment;
+			if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv))
+				goto new_segment;
+
+			seg_size += bv.bv_len;
+			bvprv = bv;
+			prev = 1;
+			continue;
+		}
+new_segment:
+		if (nsegs == queue_max_segments(q))
+			goto split;
+
+		nsegs++;
+		bvprv = bv;
+		prev = 1;
+		seg_size = bv.bv_len;
+	}
+
+	return NULL;
+split:
+	split = bio_clone_bioset(bio, GFP_NOIO, bs);
+
+	split->bi_iter.bi_size -= iter.bi_size;
+	bio->bi_iter = iter;
+
+	if (bio_integrity(bio)) {
+		bio_integrity_advance(bio, split->bi_iter.bi_size);
+		bio_integrity_trim(split, 0, bio_sectors(split));
+	}
+
+	return split;
+}
+
+void blk_queue_split(struct request_queue *q, struct bio **bio,
+		     struct bio_set *bs)
+{
+	struct bio *split;
+
+	if ((*bio)->bi_rw & REQ_DISCARD)
+		split = blk_bio_discard_split(q, *bio, bs);
+	else if ((*bio)->bi_rw & REQ_WRITE_SAME)
+		split = blk_bio_write_same_split(q, *bio, bs);
+	else
+		split = blk_bio_segment_split(q, *bio, q->bio_split);
+
+	if (split) {
+		bio_chain(split, *bio);
+		generic_make_request(*bio);
+		*bio = split;
+	}
+}
+EXPORT_SYMBOL(blk_queue_split);
+
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio,
 					     bool no_sg_merge)
 {
 	struct bio_vec bv, bvprv = { NULL };
-	int cluster, high, highprv = 1;
+	int cluster, prev = 0;
 	unsigned int seg_size, nr_phys_segs;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
@@ -36,7 +182,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	cluster = blk_queue_cluster(q);
 	seg_size = 0;
 	nr_phys_segs = 0;
-	high = 0;
 	for_each_bio(bio) {
 		bio_for_each_segment(bv, bio, iter) {
 			/*
@@ -46,13 +191,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 			if (no_sg_merge)
 				goto new_segment;
 
-			/*
-			 * the trick here is making sure that a high page is
-			 * never considered part of another segment, since
-			 * that might change with the bounce page.
-			 */
-			high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
-			if (!high && !highprv && cluster) {
+			if (prev && cluster) {
 				if (seg_size + bv.bv_len
 				    > queue_max_segment_size(q))
 					goto new_segment;
@@ -72,8 +211,8 @@ new_segment:
 
 			nr_phys_segs++;
 			bvprv = bv;
+			prev = 1;
 			seg_size = bv.bv_len;
-			highprv = high;
 		}
 		bbio = bio;
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d842db..a2808d7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1287,6 +1287,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if (!is_flush_fua && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return;
@@ -1372,6 +1374,8 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if (!is_flush_fua && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count, NULL))
 		return;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6264b38..9a25db3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -523,6 +523,9 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_trace_shutdown(q);
 
+	if (q->bio_split)
+		bioset_free(q->bio_split);
+
 	ida_simple_remove(&blk_queue_ida, q->id);
 	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 3907202..a6265bc 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1497,6 +1497,8 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
 	struct drbd_device *device = (struct drbd_device *) q->queuedata;
 	unsigned long start_jif;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	start_jif = jiffies;
 
 	/*
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 4c20c22..05a81ae 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2447,6 +2447,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
 	char b[BDEVNAME_SIZE];
 	struct bio *split;
 
+	blk_queue_bounce(q, &bio);
+
+	blk_queue_split(q, &bio, q->bio_split);
+
 	pd = q->queuedata;
 	if (!pd) {
 		pr_err("%s incorrect request queue\n",
@@ -2477,8 +2481,6 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
 		goto end_io;
 	}
 
-	blk_queue_bounce(q, &bio);
-
 	do {
 		sector_t zone = get_zone(bio->bi_iter.bi_sector, pd);
 		sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd);
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index b1612eb..748c63c 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -605,6 +605,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio)
 
 	dev_dbg(&dev->core, "%s\n", __func__);
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	spin_lock_irq(&priv->lock);
 	busy = !bio_list_empty(&priv->list);
 	bio_list_add(&priv->list, bio);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index ac8c62c..50ef199 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -148,6 +148,8 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
 	struct rsxx_bio_meta *bio_meta;
 	int st = -EINVAL;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	might_sleep();
 
 	if (!card)
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 4cf81b5..13d577c 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -531,6 +531,8 @@ static void mm_make_request(struct request_queue *q, struct bio *bio)
 		 (unsigned long long)bio->bi_iter.bi_sector,
 		 bio->bi_iter.bi_size);
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	spin_lock_irq(&card->lock);
 	*card->biotail = bio;
 	bio->bi_next = NULL;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fb655e8..44e3b89 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -901,6 +901,8 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio)
 	if (unlikely(!zram_meta_get(zram)))
 		goto error;
 
+	blk_queue_split(queue, &bio, queue->bio_split);
+
 	if (!valid_io_request(zram, bio->bi_iter.bi_sector,
 					bio->bi_iter.bi_size)) {
 		atomic64_inc(&zram->stats.invalid_io);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 0d7ab20..4db6ca2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1789,6 +1789,8 @@ static void dm_make_request(struct request_queue *q, struct bio *bio)
 
 	map = dm_get_live_table(md, &srcu_idx);
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);
 
 	/* if we're suspended, we have to queue this io for later */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e25f00f..b8c5a82 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -257,6 +257,8 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 	unsigned int sectors;
 	int cpu;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if (mddev == NULL || mddev->pers == NULL
 	    || !mddev->ready) {
 		bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index da21281..267ca3a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -826,6 +826,8 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long source_addr;
 	unsigned long bytes_done;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	bytes_done = 0;
 	dev_info = bio->bi_bdev->bd_disk->private_data;
 	if (dev_info == NULL)
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index 7d4e939..1305ed3 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,6 +190,8 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long page_addr;
 	unsigned long bytes;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if ((bio->bi_iter.bi_sector & 7) != 0 ||
 	    (bio->bi_iter.bi_size & 4095) != 0)
 		/* Request is not page-aligned. */
diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
index cc00fd1..1e33d54 100644
--- a/drivers/staging/lustre/lustre/llite/lloop.c
+++ b/drivers/staging/lustre/lustre/llite/lloop.c
@@ -340,6 +340,8 @@ static void loop_make_request(struct request_queue *q, struct bio *old_bio)
 	int rw = bio_rw(old_bio);
 	int inactive;
 
+	blk_queue_split(q, &old_bio, q->bio_split);
+
 	if (!lo)
 		goto err;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d4068c1..dc89cc8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -462,6 +462,7 @@ struct request_queue {
 
 	struct blk_mq_tag_set	*tag_set;
 	struct list_head	tag_set_list;
+	struct bio_set		*bio_split;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
@@ -782,6 +783,8 @@ extern void blk_rq_unprep_clone(struct request *rq);
 extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
+extern void blk_queue_split(struct request_queue *, struct bio **,
+			    struct bio_set *);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 02/11] block: simplify bio_add_page()
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
  2015-08-12  7:07   ` Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2015-08-12  7:07 ` [PATCH v6 03/11] bcache: remove driver private bio splitting code Ming Lin
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Christoph Hellwig,
	Ming Lin

From: Kent Overstreet <kent.overstreet@gmail.com>

Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.
__bio_add_page() doesn't need to call ->merge_bvec_fn(), where
we can get rid of unnecessary code paths.

Removing the call to ->merge_bvec_fn() is also fine, as no driver that
implements support for BLOCK_PC commands even has a ->merge_bvec_fn()
method.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: rebase and resolve merge conflicts, change a couple of comments,
 make bio_add_page() warn once upon a cloned bio.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 block/bio.c | 135 +++++++++++++++++++++++++-----------------------------------
 1 file changed, 55 insertions(+), 80 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index d6e5ba3..c8bfa61 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -714,9 +714,23 @@ int bio_get_nr_vecs(struct block_device *bdev)
 }
 EXPORT_SYMBOL(bio_get_nr_vecs);
 
-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
-			  *page, unsigned int len, unsigned int offset,
-			  unsigned int max_sectors)
+/**
+ *	bio_add_pc_page	-	attempt to add page to bio
+ *	@q: the target queue
+ *	@bio: destination bio
+ *	@page: page to add
+ *	@len: vec entry length
+ *	@offset: vec entry offset
+ *
+ *	Attempt to add a page to the bio_vec maplist. This can fail for a
+ *	number of reasons, such as the bio being full or target block device
+ *	limitations. The target block device must allow bio's up to PAGE_SIZE,
+ *	so it is always possible to add a single page to an empty bio.
+ *
+ *	This should only be used by REQ_PC bios.
+ */
+int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
+		    *page, unsigned int len, unsigned int offset)
 {
 	int retried_segments = 0;
 	struct bio_vec *bvec;
@@ -727,7 +741,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	if (unlikely(bio_flagged(bio, BIO_CLONED)))
 		return 0;
 
-	if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
+	if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
 		return 0;
 
 	/*
@@ -740,28 +754,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 
 		if (page == prev->bv_page &&
 		    offset == prev->bv_offset + prev->bv_len) {
-			unsigned int prev_bv_len = prev->bv_len;
 			prev->bv_len += len;
-
-			if (q->merge_bvec_fn) {
-				struct bvec_merge_data bvm = {
-					/* prev_bvec is already charged in
-					   bi_size, discharge it in order to
-					   simulate merging updated prev_bvec
-					   as new bvec. */
-					.bi_bdev = bio->bi_bdev,
-					.bi_sector = bio->bi_iter.bi_sector,
-					.bi_size = bio->bi_iter.bi_size -
-						prev_bv_len,
-					.bi_rw = bio->bi_rw,
-				};
-
-				if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
-					prev->bv_len -= len;
-					return 0;
-				}
-			}
-
 			bio->bi_iter.bi_size += len;
 			goto done;
 		}
@@ -804,27 +797,6 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 		blk_recount_segments(q, bio);
 	}
 
-	/*
-	 * if queue has other restrictions (eg varying max sector size
-	 * depending on offset), it can specify a merge_bvec_fn in the
-	 * queue to get further control
-	 */
-	if (q->merge_bvec_fn) {
-		struct bvec_merge_data bvm = {
-			.bi_bdev = bio->bi_bdev,
-			.bi_sector = bio->bi_iter.bi_sector,
-			.bi_size = bio->bi_iter.bi_size - len,
-			.bi_rw = bio->bi_rw,
-		};
-
-		/*
-		 * merge_bvec_fn() returns number of bytes it can accept
-		 * at this offset
-		 */
-		if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len)
-			goto failed;
-	}
-
 	/* If we may be able to merge these biovecs, force a recount */
 	if (bio->bi_vcnt > 1 && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
 		bio->bi_flags &= ~(1 << BIO_SEG_VALID);
@@ -841,28 +813,6 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	blk_recount_segments(q, bio);
 	return 0;
 }
-
-/**
- *	bio_add_pc_page	-	attempt to add page to bio
- *	@q: the target queue
- *	@bio: destination bio
- *	@page: page to add
- *	@len: vec entry length
- *	@offset: vec entry offset
- *
- *	Attempt to add a page to the bio_vec maplist. This can fail for a
- *	number of reasons, such as the bio being full or target block device
- *	limitations. The target block device must allow bio's up to PAGE_SIZE,
- *	so it is always possible to add a single page to an empty bio.
- *
- *	This should only be used by REQ_PC bios.
- */
-int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
-		    unsigned int len, unsigned int offset)
-{
-	return __bio_add_page(q, bio, page, len, offset,
-			      queue_max_hw_sectors(q));
-}
 EXPORT_SYMBOL(bio_add_pc_page);
 
 /**
@@ -872,22 +822,47 @@ EXPORT_SYMBOL(bio_add_pc_page);
  *	@len: vec entry length
  *	@offset: vec entry offset
  *
- *	Attempt to add a page to the bio_vec maplist. This can fail for a
- *	number of reasons, such as the bio being full or target block device
- *	limitations. The target block device must allow bio's up to PAGE_SIZE,
- *	so it is always possible to add a single page to an empty bio.
+ *	Attempt to add a page to the bio_vec maplist. This will only fail
+ *	if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
  */
-int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
-		 unsigned int offset)
+int bio_add_page(struct bio *bio, struct page *page,
+		 unsigned int len, unsigned int offset)
 {
-	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-	unsigned int max_sectors;
+	struct bio_vec *bv;
 
-	max_sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
-	if ((max_sectors < (len >> 9)) && !bio->bi_iter.bi_size)
-		max_sectors = len >> 9;
+	/*
+	 * cloned bio must not modify vec list
+	 */
+	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
+		return 0;
 
-	return __bio_add_page(q, bio, page, len, offset, max_sectors);
+	/*
+	 * For filesystems with a blocksize smaller than the pagesize
+	 * we will often be called with the same page as last time and
+	 * a consecutive offset.  Optimize this special case.
+	 */
+	if (bio->bi_vcnt > 0) {
+		bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+		if (page == bv->bv_page &&
+		    offset == bv->bv_offset + bv->bv_len) {
+			bv->bv_len += len;
+			goto done;
+		}
+	}
+
+	if (bio->bi_vcnt >= bio->bi_max_vecs)
+		return 0;
+
+	bv		= &bio->bi_io_vec[bio->bi_vcnt];
+	bv->bv_page	= page;
+	bv->bv_len	= len;
+	bv->bv_offset	= offset;
+
+	bio->bi_vcnt++;
+done:
+	bio->bi_iter.bi_size += len;
+	return len;
 }
 EXPORT_SYMBOL(bio_add_page);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 03/11] bcache: remove driver private bio splitting code
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
  2015-08-12  7:07   ` Ming Lin
  2015-08-12  7:07 ` [PATCH v6 02/11] block: simplify bio_add_page() Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2016-01-08  1:53   ` Eric Wheeler
  2015-08-12  7:07 ` [PATCH v6 04/11] btrfs: remove bio splitting and merge_bvec_fn() calls Ming Lin
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, linux-bcache,
	Ming Lin

From: Kent Overstreet <kent.overstreet@gmail.com>

The bcache driver has always accepted arbitrarily large bios and split
them internally.  Now that every driver must accept arbitrarily large
bios this code isn't nessecary anymore.

Cc: linux-bcache@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 drivers/md/bcache/bcache.h    |  18 --------
 drivers/md/bcache/io.c        | 101 +-----------------------------------------
 drivers/md/bcache/journal.c   |   4 +-
 drivers/md/bcache/request.c   |  16 +++----
 drivers/md/bcache/super.c     |  32 +------------
 drivers/md/bcache/util.h      |   5 ++-
 drivers/md/bcache/writeback.c |   4 +-
 7 files changed, 18 insertions(+), 162 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 04f7bc2..6b420a5 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -243,19 +243,6 @@ struct keybuf {
 	DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
 };
 
-struct bio_split_pool {
-	struct bio_set		*bio_split;
-	mempool_t		*bio_split_hook;
-};
-
-struct bio_split_hook {
-	struct closure		cl;
-	struct bio_split_pool	*p;
-	struct bio		*bio;
-	bio_end_io_t		*bi_end_io;
-	void			*bi_private;
-};
-
 struct bcache_device {
 	struct closure		cl;
 
@@ -288,8 +275,6 @@ struct bcache_device {
 	int (*cache_miss)(struct btree *, struct search *,
 			  struct bio *, unsigned);
 	int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
-
-	struct bio_split_pool	bio_split_hook;
 };
 
 struct io {
@@ -454,8 +439,6 @@ struct cache {
 	atomic_long_t		meta_sectors_written;
 	atomic_long_t		btree_sectors_written;
 	atomic_long_t		sectors_written;
-
-	struct bio_split_pool	bio_split_hook;
 };
 
 struct gc_stat {
@@ -873,7 +856,6 @@ void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *);
 void bch_bbio_free(struct bio *, struct cache_set *);
 struct bio *bch_bbio_alloc(struct cache_set *);
 
-void bch_generic_make_request(struct bio *, struct bio_split_pool *);
 void __bch_submit_bbio(struct bio *, struct cache_set *);
 void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);
 
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index bf6a9ca..86a0bb8 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -11,105 +11,6 @@
 
 #include <linux/blkdev.h>
 
-static unsigned bch_bio_max_sectors(struct bio *bio)
-{
-	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-	struct bio_vec bv;
-	struct bvec_iter iter;
-	unsigned ret = 0, seg = 0;
-
-	if (bio->bi_rw & REQ_DISCARD)
-		return min(bio_sectors(bio), q->limits.max_discard_sectors);
-
-	bio_for_each_segment(bv, bio, iter) {
-		struct bvec_merge_data bvm = {
-			.bi_bdev	= bio->bi_bdev,
-			.bi_sector	= bio->bi_iter.bi_sector,
-			.bi_size	= ret << 9,
-			.bi_rw		= bio->bi_rw,
-		};
-
-		if (seg == min_t(unsigned, BIO_MAX_PAGES,
-				 queue_max_segments(q)))
-			break;
-
-		if (q->merge_bvec_fn &&
-		    q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
-			break;
-
-		seg++;
-		ret += bv.bv_len >> 9;
-	}
-
-	ret = min(ret, queue_max_sectors(q));
-
-	WARN_ON(!ret);
-	ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9);
-
-	return ret;
-}
-
-static void bch_bio_submit_split_done(struct closure *cl)
-{
-	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
-	s->bio->bi_end_io = s->bi_end_io;
-	s->bio->bi_private = s->bi_private;
-	bio_endio(s->bio, 0);
-
-	closure_debug_destroy(&s->cl);
-	mempool_free(s, s->p->bio_split_hook);
-}
-
-static void bch_bio_submit_split_endio(struct bio *bio, int error)
-{
-	struct closure *cl = bio->bi_private;
-	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
-	if (error)
-		clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
-
-	bio_put(bio);
-	closure_put(cl);
-}
-
-void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
-{
-	struct bio_split_hook *s;
-	struct bio *n;
-
-	if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
-		goto submit;
-
-	if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
-		goto submit;
-
-	s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
-	closure_init(&s->cl, NULL);
-
-	s->bio		= bio;
-	s->p		= p;
-	s->bi_end_io	= bio->bi_end_io;
-	s->bi_private	= bio->bi_private;
-	bio_get(bio);
-
-	do {
-		n = bio_next_split(bio, bch_bio_max_sectors(bio),
-				   GFP_NOIO, s->p->bio_split);
-
-		n->bi_end_io	= bch_bio_submit_split_endio;
-		n->bi_private	= &s->cl;
-
-		closure_get(&s->cl);
-		generic_make_request(n);
-	} while (n != bio);
-
-	continue_at(&s->cl, bch_bio_submit_split_done, NULL);
-	return;
-submit:
-	generic_make_request(bio);
-}
-
 /* Bios with headers */
 
 void bch_bbio_free(struct bio *bio, struct cache_set *c)
@@ -139,7 +40,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c)
 	bio->bi_bdev		= PTR_CACHE(c, &b->key, 0)->bdev;
 
 	b->submit_time_us = local_clock_us();
-	closure_bio_submit(bio, bio->bi_private, PTR_CACHE(c, &b->key, 0));
+	closure_bio_submit(bio, bio->bi_private);
 }
 
 void bch_submit_bbio(struct bio *bio, struct cache_set *c,
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 418607a..727ca9b 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -61,7 +61,7 @@ reread:		left = ca->sb.bucket_size - offset;
 		bio->bi_private = &cl;
 		bch_bio_map(bio, data);
 
-		closure_bio_submit(bio, &cl, ca);
+		closure_bio_submit(bio, &cl);
 		closure_sync(&cl);
 
 		/* This function could be simpler now since we no longer write
@@ -648,7 +648,7 @@ static void journal_write_unlocked(struct closure *cl)
 	spin_unlock(&c->journal.lock);
 
 	while ((bio = bio_list_pop(&list)))
-		closure_bio_submit(bio, cl, c->cache[0]);
+		closure_bio_submit(bio, cl);
 
 	continue_at(cl, journal_write_done, NULL);
 }
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index f292790..ab093a8 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -718,7 +718,7 @@ static void cached_dev_read_error(struct closure *cl)
 
 		/* XXX: invalidate cache */
 
-		closure_bio_submit(bio, cl, s->d);
+		closure_bio_submit(bio, cl);
 	}
 
 	continue_at(cl, cached_dev_cache_miss_done, NULL);
@@ -841,7 +841,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 	s->cache_miss	= miss;
 	s->iop.bio	= cache_bio;
 	bio_get(cache_bio);
-	closure_bio_submit(cache_bio, &s->cl, s->d);
+	closure_bio_submit(cache_bio, &s->cl);
 
 	return ret;
 out_put:
@@ -849,7 +849,7 @@ out_put:
 out_submit:
 	miss->bi_end_io		= request_endio;
 	miss->bi_private	= &s->cl;
-	closure_bio_submit(miss, &s->cl, s->d);
+	closure_bio_submit(miss, &s->cl);
 	return ret;
 }
 
@@ -914,7 +914,7 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
 
 		if (!(bio->bi_rw & REQ_DISCARD) ||
 		    blk_queue_discard(bdev_get_queue(dc->bdev)))
-			closure_bio_submit(bio, cl, s->d);
+			closure_bio_submit(bio, cl);
 	} else if (s->iop.writeback) {
 		bch_writeback_add(dc);
 		s->iop.bio = bio;
@@ -929,12 +929,12 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
 			flush->bi_end_io = request_endio;
 			flush->bi_private = cl;
 
-			closure_bio_submit(flush, cl, s->d);
+			closure_bio_submit(flush, cl);
 		}
 	} else {
 		s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split);
 
-		closure_bio_submit(bio, cl, s->d);
+		closure_bio_submit(bio, cl);
 	}
 
 	closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
@@ -950,7 +950,7 @@ static void cached_dev_nodata(struct closure *cl)
 		bch_journal_meta(s->iop.c, cl);
 
 	/* If it's a flush, we send the flush to the backing device too */
-	closure_bio_submit(bio, cl, s->d);
+	closure_bio_submit(bio, cl);
 
 	continue_at(cl, cached_dev_bio_complete, NULL);
 }
@@ -994,7 +994,7 @@ static void cached_dev_make_request(struct request_queue *q, struct bio *bio)
 		    !blk_queue_discard(bdev_get_queue(dc->bdev)))
 			bio_endio(bio, 0);
 		else
-			bch_generic_make_request(bio, &d->bio_split_hook);
+			generic_make_request(bio);
 	}
 }
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 94980bf..db70c9e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -59,29 +59,6 @@ struct workqueue_struct *bcache_wq;
 
 #define BTREE_MAX_PAGES		(256 * 1024 / PAGE_SIZE)
 
-static void bio_split_pool_free(struct bio_split_pool *p)
-{
-	if (p->bio_split_hook)
-		mempool_destroy(p->bio_split_hook);
-
-	if (p->bio_split)
-		bioset_free(p->bio_split);
-}
-
-static int bio_split_pool_init(struct bio_split_pool *p)
-{
-	p->bio_split = bioset_create(4, 0);
-	if (!p->bio_split)
-		return -ENOMEM;
-
-	p->bio_split_hook = mempool_create_kmalloc_pool(4,
-				sizeof(struct bio_split_hook));
-	if (!p->bio_split_hook)
-		return -ENOMEM;
-
-	return 0;
-}
-
 /* Superblock */
 
 static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
@@ -537,7 +514,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, unsigned long rw)
 	bio->bi_private = ca;
 	bch_bio_map(bio, ca->disk_buckets);
 
-	closure_bio_submit(bio, &ca->prio, ca);
+	closure_bio_submit(bio, &ca->prio);
 	closure_sync(cl);
 }
 
@@ -757,7 +734,6 @@ static void bcache_device_free(struct bcache_device *d)
 		put_disk(d->disk);
 	}
 
-	bio_split_pool_free(&d->bio_split_hook);
 	if (d->bio_split)
 		bioset_free(d->bio_split);
 	kvfree(d->full_dirty_stripes);
@@ -804,7 +780,6 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 		return minor;
 
 	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
-	    bio_split_pool_init(&d->bio_split_hook) ||
 	    !(d->disk = alloc_disk(1))) {
 		ida_simple_remove(&bcache_minor, minor);
 		return -ENOMEM;
@@ -1793,8 +1768,6 @@ void bch_cache_release(struct kobject *kobj)
 		ca->set->cache[ca->sb.nr_this_dev] = NULL;
 	}
 
-	bio_split_pool_free(&ca->bio_split_hook);
-
 	free_pages((unsigned long) ca->disk_buckets, ilog2(bucket_pages(ca)));
 	kfree(ca->prio_buckets);
 	vfree(ca->buckets);
@@ -1839,8 +1812,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
 					  ca->sb.nbuckets)) ||
 	    !(ca->prio_buckets	= kzalloc(sizeof(uint64_t) * prio_buckets(ca) *
 					  2, GFP_KERNEL)) ||
-	    !(ca->disk_buckets	= alloc_bucket_pages(GFP_KERNEL, ca)) ||
-	    bio_split_pool_init(&ca->bio_split_hook))
+	    !(ca->disk_buckets	= alloc_bucket_pages(GFP_KERNEL, ca)))
 		return -ENOMEM;
 
 	ca->prio_last_buckets = ca->prio_buckets + prio_buckets(ca);
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 1d04c48..cf2cbc2 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
 
 #include <linux/blkdev.h>
 #include <linux/errno.h>
+#include <linux/blkdev.h>
 #include <linux/kernel.h>
 #include <linux/llist.h>
 #include <linux/ratelimit.h>
@@ -570,10 +571,10 @@ static inline sector_t bdev_sectors(struct block_device *bdev)
 	return bdev->bd_inode->i_size >> 9;
 }
 
-#define closure_bio_submit(bio, cl, dev)				\
+#define closure_bio_submit(bio, cl)					\
 do {									\
 	closure_get(cl);						\
-	bch_generic_make_request(bio, &(dev)->bio_split_hook);		\
+	generic_make_request(bio);					\
 } while (0)
 
 uint64_t bch_crc64_update(uint64_t, const void *, size_t);
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f1986bc..ca38362 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -188,7 +188,7 @@ static void write_dirty(struct closure *cl)
 	io->bio.bi_bdev		= io->dc->bdev;
 	io->bio.bi_end_io	= dirty_endio;
 
-	closure_bio_submit(&io->bio, cl, &io->dc->disk);
+	closure_bio_submit(&io->bio, cl);
 
 	continue_at(cl, write_dirty_finish, system_wq);
 }
@@ -208,7 +208,7 @@ static void read_dirty_submit(struct closure *cl)
 {
 	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
 
-	closure_bio_submit(&io->bio, cl, &io->dc->disk);
+	closure_bio_submit(&io->bio, cl);
 
 	continue_at(cl, write_dirty, system_wq);
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 04/11] btrfs: remove bio splitting and merge_bvec_fn() calls
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
                   ` (2 preceding siblings ...)
  2015-08-12  7:07 ` [PATCH v6 03/11] bcache: remove driver private bio splitting code Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2015-08-12  7:07 ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same} Ming Lin
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Chris Mason,
	Josef Bacik, linux-btrfs, Ming Lin

From: Kent Overstreet <kent.overstreet@gmail.com>

Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 fs/btrfs/volumes.c | 72 ------------------------------------------------------
 1 file changed, 72 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fbe7c10..1b52313 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5871,34 +5871,6 @@ static noinline void btrfs_schedule_bio(struct btrfs_root *root,
 				 &device->work);
 }
 
-static int bio_size_ok(struct block_device *bdev, struct bio *bio,
-		       sector_t sector)
-{
-	struct bio_vec *prev;
-	struct request_queue *q = bdev_get_queue(bdev);
-	unsigned int max_sectors = queue_max_sectors(q);
-	struct bvec_merge_data bvm = {
-		.bi_bdev = bdev,
-		.bi_sector = sector,
-		.bi_rw = bio->bi_rw,
-	};
-
-	if (WARN_ON(bio->bi_vcnt == 0))
-		return 1;
-
-	prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
-	if (bio_sectors(bio) > max_sectors)
-		return 0;
-
-	if (!q->merge_bvec_fn)
-		return 1;
-
-	bvm.bi_size = bio->bi_iter.bi_size - prev->bv_len;
-	if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len)
-		return 0;
-	return 1;
-}
-
 static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
 			      struct bio *bio, u64 physical, int dev_nr,
 			      int rw, int async)
@@ -5932,38 +5904,6 @@ static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
 		btrfsic_submit_bio(rw, bio);
 }
 
-static int breakup_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
-			      struct bio *first_bio, struct btrfs_device *dev,
-			      int dev_nr, int rw, int async)
-{
-	struct bio_vec *bvec = first_bio->bi_io_vec;
-	struct bio *bio;
-	int nr_vecs = bio_get_nr_vecs(dev->bdev);
-	u64 physical = bbio->stripes[dev_nr].physical;
-
-again:
-	bio = btrfs_bio_alloc(dev->bdev, physical >> 9, nr_vecs, GFP_NOFS);
-	if (!bio)
-		return -ENOMEM;
-
-	while (bvec <= (first_bio->bi_io_vec + first_bio->bi_vcnt - 1)) {
-		if (bio_add_page(bio, bvec->bv_page, bvec->bv_len,
-				 bvec->bv_offset) < bvec->bv_len) {
-			u64 len = bio->bi_iter.bi_size;
-
-			atomic_inc(&bbio->stripes_pending);
-			submit_stripe_bio(root, bbio, bio, physical, dev_nr,
-					  rw, async);
-			physical += len;
-			goto again;
-		}
-		bvec++;
-	}
-
-	submit_stripe_bio(root, bbio, bio, physical, dev_nr, rw, async);
-	return 0;
-}
-
 static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 {
 	atomic_inc(&bbio->error);
@@ -6036,18 +5976,6 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
 			continue;
 		}
 
-		/*
-		 * Check and see if we're ok with this bio based on it's size
-		 * and offset with the given device.
-		 */
-		if (!bio_size_ok(dev->bdev, first_bio,
-				 bbio->stripes[dev_nr].physical >> 9)) {
-			ret = breakup_stripe_bio(root, bbio, first_bio, dev,
-						 dev_nr, rw, async_submit);
-			BUG_ON(ret);
-			continue;
-		}
-
 		if (dev_nr < total_devs - 1) {
 			bio = btrfs_bio_clone(first_bio, GFP_NOFS);
 			BUG_ON(!bio); /* -ENOMEM */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
                   ` (3 preceding siblings ...)
  2015-08-12  7:07 ` [PATCH v6 04/11] btrfs: remove bio splitting and merge_bvec_fn() calls Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2015-10-13 11:50     ` Christoph Hellwig
  2015-08-12  7:07 ` [PATCH v6 06/11] md/raid5: split bio for chunk_aligned_read Ming Lin
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Ming Lin

From: Ming Lin <ming.l@ssi.samsung.com>

The split code in blkdev_issue_{discard,write_same} can go away
now that any driver that cares does the split. We have to make
sure bio size doesn't overflow.

For discard, we set max discard sectors to (1<<31)>>9 to ensure
it doesn't overflow bi_size and hopefully it is of the proper
granularity as long as the granularity is a power of two.

Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 block/blk-lib.c | 47 +++++++++++------------------------------------
 1 file changed, 11 insertions(+), 36 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..948594c 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,6 +26,13 @@ static void bio_batch_end_io(struct bio *bio, int err)
 	bio_put(bio);
 }
 
+/*
+ * Ensure that max discard sectors doesn't overflow bi_size and hopefully
+ * it is of the proper granularity as long as the granularity is a power
+ * of two.
+ */
+#define MAX_BIO_SECTORS ((1U << 31) >> 9)
+
 /**
  * blkdev_issue_discard - queue a discard
  * @bdev:	blockdev to issue discard for
@@ -43,8 +50,6 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *q = bdev_get_queue(bdev);
 	int type = REQ_WRITE | REQ_DISCARD;
-	unsigned int max_discard_sectors, granularity;
-	int alignment;
 	struct bio_batch bb;
 	struct bio *bio;
 	int ret = 0;
@@ -56,21 +61,6 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	if (!blk_queue_discard(q))
 		return -EOPNOTSUPP;
 
-	/* Zero-sector (unknown) and one-sector granularities are the same.  */
-	granularity = max(q->limits.discard_granularity >> 9, 1U);
-	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
-	/*
-	 * Ensure that max_discard_sectors is of the proper
-	 * granularity, so that requests stay aligned after a split.
-	 */
-	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
-	max_discard_sectors -= max_discard_sectors % granularity;
-	if (unlikely(!max_discard_sectors)) {
-		/* Avoid infinite loop below. Being cautious never hurts. */
-		return -EOPNOTSUPP;
-	}
-
 	if (flags & BLKDEV_DISCARD_SECURE) {
 		if (!blk_queue_secdiscard(q))
 			return -EOPNOTSUPP;
@@ -84,7 +74,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	blk_start_plug(&plug);
 	while (nr_sects) {
 		unsigned int req_sects;
-		sector_t end_sect, tmp;
+		sector_t end_sect;
 
 		bio = bio_alloc(gfp_mask, 1);
 		if (!bio) {
@@ -92,21 +82,8 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			break;
 		}
 
-		req_sects = min_t(sector_t, nr_sects, max_discard_sectors);
-
-		/*
-		 * If splitting a request, and the next starting sector would be
-		 * misaligned, stop the discard at the previous aligned sector.
-		 */
+		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
 		end_sect = sector + req_sects;
-		tmp = end_sect;
-		if (req_sects < nr_sects &&
-		    sector_div(tmp, granularity) != alignment) {
-			end_sect = end_sect - alignment;
-			sector_div(end_sect, granularity);
-			end_sect = end_sect * granularity + alignment;
-			req_sects = end_sect - sector;
-		}
 
 		bio->bi_iter.bi_sector = sector;
 		bio->bi_end_io = bio_batch_end_io;
@@ -166,10 +143,8 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 	if (!q)
 		return -ENXIO;
 
-	max_write_same_sectors = q->limits.max_write_same_sectors;
-
-	if (max_write_same_sectors == 0)
-		return -EOPNOTSUPP;
+	/* Ensure that max_write_same_sectors doesn't overflow bi_size */
+	max_write_same_sectors = UINT_MAX >> 9;
 
 	atomic_set(&bb.done, 1);
 	bb.flags = 1 << BIO_UPTODATE;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 06/11] md/raid5: split bio for chunk_aligned_read
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
                   ` (4 preceding siblings ...)
  2015-08-12  7:07 ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same} Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2015-08-12  7:07 ` [PATCH v6 07/11] md/raid5: get rid of bio_fits_rdev() Ming Lin
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Ming Lin, Neil Brown,
	linux-raid

From: Ming Lin <ming.l@ssi.samsung.com>

If a read request fits entirely in a chunk, it will be passed directly to the
underlying device (providing it hasn't failed of course).  If it doesn't fit,
the slightly less efficient path that uses the stripe_cache is used.
Requests that get to the stripe cache are always completely split up as
necessary.

So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work,
but could cause it to take the less efficient path more often.

All that is needed to manage this is for 'chunk_aligned_read' do some bio
splitting, much like the RAID0 code does.

Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 drivers/md/raid5.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f757023..d572639 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4806,7 +4806,7 @@ static int bio_fits_rdev(struct bio *bi)
 	return 1;
 }
 
-static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
+static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 {
 	struct r5conf *conf = mddev->private;
 	int dd_idx;
@@ -4815,7 +4815,7 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
 	sector_t end_sector;
 
 	if (!in_chunk_boundary(mddev, raid_bio)) {
-		pr_debug("chunk_aligned_read : non aligned\n");
+		pr_debug("%s: non aligned\n", __func__);
 		return 0;
 	}
 	/*
@@ -4892,6 +4892,31 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
 	}
 }
 
+static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio)
+{
+	struct bio *split;
+
+	do {
+		sector_t sector = raid_bio->bi_iter.bi_sector;
+		unsigned chunk_sects = mddev->chunk_sectors;
+		unsigned sectors = chunk_sects - (sector & (chunk_sects-1));
+
+		if (sectors < bio_sectors(raid_bio)) {
+			split = bio_split(raid_bio, sectors, GFP_NOIO, fs_bio_set);
+			bio_chain(split, raid_bio);
+		} else
+			split = raid_bio;
+
+		if (!raid5_read_one_chunk(mddev, split)) {
+			if (split != raid_bio)
+				generic_make_request(raid_bio);
+			return split;
+		}
+	} while (split != raid_bio);
+
+	return NULL;
+}
+
 /* __get_priority_stripe - get the next stripe to process
  *
  * Full stripe writes are allowed to pass preread active stripes up until
@@ -5169,9 +5194,11 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 	 * data on failed drives.
 	 */
 	if (rw == READ && mddev->degraded == 0 &&
-	     mddev->reshape_position == MaxSector &&
-	     chunk_aligned_read(mddev,bi))
-		return;
+	    mddev->reshape_position == MaxSector) {
+		bi = chunk_aligned_read(mddev, bi);
+		if (!bi)
+			return;
+	}
 
 	if (unlikely(bi->bi_rw & REQ_DISCARD)) {
 		make_discard_request(mddev, bi);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 07/11] md/raid5: get rid of bio_fits_rdev()
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
                   ` (5 preceding siblings ...)
  2015-08-12  7:07 ` [PATCH v6 06/11] md/raid5: split bio for chunk_aligned_read Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2015-08-12  7:07 ` [PATCH v6 08/11] block: kill merge_bvec_fn() completely Ming Lin
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Neil Brown,
	linux-raid, Ming Lin

From: Kent Overstreet <kent.overstreet@gmail.com>

Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now
performed by blk_queue_split() in md_make_request().

Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 drivers/md/raid5.c | 23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d572639..7ce3252 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4787,25 +4787,6 @@ static void raid5_align_endio(struct bio *bi, int error)
 	add_bio_to_retry(raid_bi, conf);
 }
 
-static int bio_fits_rdev(struct bio *bi)
-{
-	struct request_queue *q = bdev_get_queue(bi->bi_bdev);
-
-	if (bio_sectors(bi) > queue_max_sectors(q))
-		return 0;
-	blk_recount_segments(q, bi);
-	if (bi->bi_phys_segments > queue_max_segments(q))
-		return 0;
-
-	if (q->merge_bvec_fn)
-		/* it's too hard to apply the merge_bvec_fn at this stage,
-		 * just just give up
-		 */
-		return 0;
-
-	return 1;
-}
-
 static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 {
 	struct r5conf *conf = mddev->private;
@@ -4859,11 +4840,9 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 		align_bi->bi_bdev =  rdev->bdev;
 		__clear_bit(BIO_SEG_VALID, &align_bi->bi_flags);
 
-		if (!bio_fits_rdev(align_bi) ||
-		    is_badblock(rdev, align_bi->bi_iter.bi_sector,
+		if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
 				bio_sectors(align_bi),
 				&first_bad, &bad_sectors)) {
-			/* too big in some way, or has a known bad block */
 			bio_put(align_bi);
 			rdev_dec_pending(rdev, mddev);
 			return 0;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 08/11] block: kill merge_bvec_fn() completely
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
                   ` (6 preceding siblings ...)
  2015-08-12  7:07 ` [PATCH v6 07/11] md/raid5: get rid of bio_fits_rdev() Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2015-08-12  7:07 ` [PATCH v6 09/11] fs: use helper bio_add_page() instead of open coding on bi_io_vec Ming Lin
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Lars Ellenberg,
	drbd-user, Jiri Kosina, Yehuda Sadeh, Sage Weil, Alex Elder,
	ceph-devel, Alasdair Kergon, dm-devel, Neil Brown, linux-raid,
	Christoph Hellwig, Ming Lin

From: Kent Overstreet <kent.overstreet@gmail.com>

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
 dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 block/blk-merge.c              |  17 +-----
 block/blk-settings.c           |  22 --------
 drivers/block/drbd/drbd_int.h  |   1 -
 drivers/block/drbd/drbd_main.c |   1 -
 drivers/block/drbd/drbd_req.c  |  35 ------------
 drivers/block/pktcdvd.c        |  21 -------
 drivers/block/rbd.c            |  47 ----------------
 drivers/md/dm-cache-target.c   |  21 -------
 drivers/md/dm-crypt.c          |  16 ------
 drivers/md/dm-era-target.c     |  15 -----
 drivers/md/dm-flakey.c         |  16 ------
 drivers/md/dm-linear.c         |  16 ------
 drivers/md/dm-log-writes.c     |  16 ------
 drivers/md/dm-raid.c           |  19 -------
 drivers/md/dm-snap.c           |  15 -----
 drivers/md/dm-stripe.c         |  21 -------
 drivers/md/dm-table.c          |   8 ---
 drivers/md/dm-thin.c           |  31 -----------
 drivers/md/dm-verity.c         |  16 ------
 drivers/md/dm.c                | 123 +----------------------------------------
 drivers/md/dm.h                |   2 -
 drivers/md/linear.c            |  43 --------------
 drivers/md/md.c                |  26 ---------
 drivers/md/md.h                |  12 ----
 drivers/md/multipath.c         |  21 -------
 drivers/md/raid0.c             |  56 -------------------
 drivers/md/raid0.h             |   2 -
 drivers/md/raid1.c             |  58 +------------------
 drivers/md/raid10.c            | 121 +---------------------------------------
 drivers/md/raid5.c             |  32 -----------
 include/linux/blkdev.h         |  10 ----
 include/linux/device-mapper.h  |   4 --
 32 files changed, 9 insertions(+), 855 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3707f30..1f5dfa0 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -69,24 +69,13 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	struct bio *split;
 	struct bio_vec bv, bvprv;
 	struct bvec_iter iter;
-	unsigned seg_size = 0, nsegs = 0;
+	unsigned seg_size = 0, nsegs = 0, sectors = 0;
 	int prev = 0;
 
-	struct bvec_merge_data bvm = {
-		.bi_bdev	= bio->bi_bdev,
-		.bi_sector	= bio->bi_iter.bi_sector,
-		.bi_size	= 0,
-		.bi_rw		= bio->bi_rw,
-	};
-
 	bio_for_each_segment(bv, bio, iter) {
-		if (q->merge_bvec_fn &&
-		    q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
-			goto split;
-
-		bvm.bi_size += bv.bv_len;
+		sectors += bv.bv_len >> 9;
 
-		if (bvm.bi_size >> 9 > queue_max_sectors(q))
+		if (sectors > queue_max_sectors(q))
 			goto split;
 
 		/*
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 12600bf..e90d477 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -53,28 +53,6 @@ void blk_queue_unprep_rq(struct request_queue *q, unprep_rq_fn *ufn)
 }
 EXPORT_SYMBOL(blk_queue_unprep_rq);
 
-/**
- * blk_queue_merge_bvec - set a merge_bvec function for queue
- * @q:		queue
- * @mbfn:	merge_bvec_fn
- *
- * Usually queues have static limitations on the max sectors or segments that
- * we can put in a request. Stacking drivers may have some settings that
- * are dynamic, and thus we have to query the queue whether it is ok to
- * add a new bio_vec to a bio at a given offset or not. If the block device
- * has such limitations, it needs to register a merge_bvec_fn to control
- * the size of bio's sent to it. Note that a block device *must* allow a
- * single page to be added to an empty bio. The block device driver may want
- * to use the bio_split() function to deal with these bio's. By default
- * no merge_bvec_fn is defined for a queue, and only the fixed limits are
- * honored.
- */
-void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn)
-{
-	q->merge_bvec_fn = mbfn;
-}
-EXPORT_SYMBOL(blk_queue_merge_bvec);
-
 void blk_queue_softirq_done(struct request_queue *q, softirq_done_fn *fn)
 {
 	q->softirq_done_fn = fn;
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index efd19c2..7ac66f3 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1450,7 +1450,6 @@ extern void do_submit(struct work_struct *ws);
 extern void __drbd_make_request(struct drbd_device *, struct bio *, unsigned long);
 extern void drbd_make_request(struct request_queue *q, struct bio *bio);
 extern int drbd_read_remote(struct drbd_device *device, struct drbd_request *req);
-extern int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec);
 extern int is_valid_ar_handle(struct drbd_request *, sector_t);
 
 
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index a151853..74d97f4 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2774,7 +2774,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
 	   This triggers a max_bio_size message upon first attach or connect */
 	blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8);
 	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
-	blk_queue_merge_bvec(q, drbd_merge_bvec);
 	q->queue_lock = &resource->req_lock;
 
 	device->md_io.page = alloc_page(GFP_KERNEL);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index a6265bc..7523f00 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1510,41 +1510,6 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
 	__drbd_make_request(device, bio, start_jif);
 }
 
-/* This is called by bio_add_page().
- *
- * q->max_hw_sectors and other global limits are already enforced there.
- *
- * We need to call down to our lower level device,
- * in case it has special restrictions.
- *
- * We also may need to enforce configured max-bio-bvecs limits.
- *
- * As long as the BIO is empty we have to allow at least one bvec,
- * regardless of size and offset, so no need to ask lower levels.
- */
-int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)
-{
-	struct drbd_device *device = (struct drbd_device *) q->queuedata;
-	unsigned int bio_size = bvm->bi_size;
-	int limit = DRBD_MAX_BIO_SIZE;
-	int backing_limit;
-
-	if (bio_size && get_ldev(device)) {
-		unsigned int max_hw_sectors = queue_max_hw_sectors(q);
-		struct request_queue * const b =
-			device->ldev->backing_bdev->bd_disk->queue;
-		if (b->merge_bvec_fn) {
-			bvm->bi_bdev = device->ldev->backing_bdev;
-			backing_limit = b->merge_bvec_fn(b, bvm, bvec);
-			limit = min(limit, backing_limit);
-		}
-		put_ldev(device);
-		if ((limit >> 9) > max_hw_sectors)
-			limit = max_hw_sectors << 9;
-	}
-	return limit;
-}
-
 void request_timer_fn(unsigned long data)
 {
 	struct drbd_device *device = (struct drbd_device *) data;
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 05a81ae..190d7d7 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2506,26 +2506,6 @@ end_io:
 
 
 
-static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
-			  struct bio_vec *bvec)
-{
-	struct pktcdvd_device *pd = q->queuedata;
-	sector_t zone = get_zone(bmd->bi_sector, pd);
-	int used = ((bmd->bi_sector - zone) << 9) + bmd->bi_size;
-	int remaining = (pd->settings.size << 9) - used;
-	int remaining2;
-
-	/*
-	 * A bio <= PAGE_SIZE must be allowed. If it crosses a packet
-	 * boundary, pkt_make_request() will split the bio.
-	 */
-	remaining2 = PAGE_SIZE - bmd->bi_size;
-	remaining = max(remaining, remaining2);
-
-	BUG_ON(remaining < 0);
-	return remaining;
-}
-
 static void pkt_init_queue(struct pktcdvd_device *pd)
 {
 	struct request_queue *q = pd->disk->queue;
@@ -2533,7 +2513,6 @@ static void pkt_init_queue(struct pktcdvd_device *pd)
 	blk_queue_make_request(q, pkt_make_request);
 	blk_queue_logical_block_size(q, CD_FRAMESIZE);
 	blk_queue_max_hw_sectors(q, PACKET_MAX_SECTORS);
-	blk_queue_merge_bvec(q, pkt_merge_bvec);
 	q->queuedata = pd;
 }
 
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index bc67a93..2c6803d 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3474,52 +3474,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_MQ_RQ_QUEUE_OK;
 }
 
-/*
- * a queue callback. Makes sure that we don't create a bio that spans across
- * multiple osd objects. One exception would be with a single page bios,
- * which we handle later at bio_chain_clone_range()
- */
-static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
-			  struct bio_vec *bvec)
-{
-	struct rbd_device *rbd_dev = q->queuedata;
-	sector_t sector_offset;
-	sector_t sectors_per_obj;
-	sector_t obj_sector_offset;
-	int ret;
-
-	/*
-	 * Find how far into its rbd object the partition-relative
-	 * bio start sector is to offset relative to the enclosing
-	 * device.
-	 */
-	sector_offset = get_start_sect(bmd->bi_bdev) + bmd->bi_sector;
-	sectors_per_obj = 1 << (rbd_dev->header.obj_order - SECTOR_SHIFT);
-	obj_sector_offset = sector_offset & (sectors_per_obj - 1);
-
-	/*
-	 * Compute the number of bytes from that offset to the end
-	 * of the object.  Account for what's already used by the bio.
-	 */
-	ret = (int) (sectors_per_obj - obj_sector_offset) << SECTOR_SHIFT;
-	if (ret > bmd->bi_size)
-		ret -= bmd->bi_size;
-	else
-		ret = 0;
-
-	/*
-	 * Don't send back more than was asked for.  And if the bio
-	 * was empty, let the whole thing through because:  "Note
-	 * that a block device *must* allow a single page to be
-	 * added to an empty bio."
-	 */
-	rbd_assert(bvec->bv_len <= PAGE_SIZE);
-	if (ret > (int) bvec->bv_len || !bmd->bi_size)
-		ret = (int) bvec->bv_len;
-
-	return ret;
-}
-
 static void rbd_free_disk(struct rbd_device *rbd_dev)
 {
 	struct gendisk *disk = rbd_dev->disk;
@@ -3818,7 +3772,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	q->limits.max_discard_sectors = segment_size / SECTOR_SIZE;
 	q->limits.discard_zeroes_data = 1;
 
-	blk_queue_merge_bvec(q, rbd_merge_bvec);
 	disk->queue = q;
 
 	q->queuedata = rbd_dev;
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 1fe93cf..06e857b 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -3778,26 +3778,6 @@ static int cache_iterate_devices(struct dm_target *ti,
 	return r;
 }
 
-/*
- * We assume I/O is going to the origin (which is the volume
- * more likely to have restrictions e.g. by being striped).
- * (Looking up the exact location of the data would be expensive
- * and could always be out of date by the time the bio is submitted.)
- */
-static int cache_bvec_merge(struct dm_target *ti,
-			    struct bvec_merge_data *bvm,
-			    struct bio_vec *biovec, int max_size)
-{
-	struct cache *cache = ti->private;
-	struct request_queue *q = bdev_get_queue(cache->origin_dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = cache->origin_dev->bdev;
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static void set_discard_limits(struct cache *cache, struct queue_limits *limits)
 {
 	/*
@@ -3841,7 +3821,6 @@ static struct target_type cache_target = {
 	.status = cache_status,
 	.message = cache_message,
 	.iterate_devices = cache_iterate_devices,
-	.merge = cache_bvec_merge,
 	.io_hints = cache_io_hints,
 };
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 0f48fed..a1f1d09 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2035,21 +2035,6 @@ error:
 	return -EINVAL;
 }
 
-static int crypt_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-		       struct bio_vec *biovec, int max_size)
-{
-	struct crypt_config *cc = ti->private;
-	struct request_queue *q = bdev_get_queue(cc->dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = cc->dev->bdev;
-	bvm->bi_sector = cc->start + dm_target_offset(ti, bvm->bi_sector);
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static int crypt_iterate_devices(struct dm_target *ti,
 				 iterate_devices_callout_fn fn, void *data)
 {
@@ -2070,7 +2055,6 @@ static struct target_type crypt_target = {
 	.preresume = crypt_preresume,
 	.resume = crypt_resume,
 	.message = crypt_message,
-	.merge  = crypt_merge,
 	.iterate_devices = crypt_iterate_devices,
 };
 
diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index ad913cd..0119ebf 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1673,20 +1673,6 @@ static int era_iterate_devices(struct dm_target *ti,
 	return fn(ti, era->origin_dev, 0, get_dev_size(era->origin_dev), data);
 }
 
-static int era_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-		     struct bio_vec *biovec, int max_size)
-{
-	struct era *era = ti->private;
-	struct request_queue *q = bdev_get_queue(era->origin_dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = era->origin_dev->bdev;
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static void era_io_hints(struct dm_target *ti, struct queue_limits *limits)
 {
 	struct era *era = ti->private;
@@ -1717,7 +1703,6 @@ static struct target_type era_target = {
 	.status = era_status,
 	.message = era_message,
 	.iterate_devices = era_iterate_devices,
-	.merge = era_merge,
 	.io_hints = era_io_hints
 };
 
diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
index b257e46..d955b3e 100644
--- a/drivers/md/dm-flakey.c
+++ b/drivers/md/dm-flakey.c
@@ -387,21 +387,6 @@ static int flakey_ioctl(struct dm_target *ti, unsigned int cmd, unsigned long ar
 	return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
 }
 
-static int flakey_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-			struct bio_vec *biovec, int max_size)
-{
-	struct flakey_c *fc = ti->private;
-	struct request_queue *q = bdev_get_queue(fc->dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = fc->dev->bdev;
-	bvm->bi_sector = flakey_map_sector(ti, bvm->bi_sector);
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static int flakey_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn, void *data)
 {
 	struct flakey_c *fc = ti->private;
@@ -419,7 +404,6 @@ static struct target_type flakey_target = {
 	.end_io = flakey_end_io,
 	.status = flakey_status,
 	.ioctl	= flakey_ioctl,
-	.merge	= flakey_merge,
 	.iterate_devices = flakey_iterate_devices,
 };
 
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 53e848c..7dd5fc8 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -130,21 +130,6 @@ static int linear_ioctl(struct dm_target *ti, unsigned int cmd,
 	return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
 }
 
-static int linear_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-			struct bio_vec *biovec, int max_size)
-{
-	struct linear_c *lc = ti->private;
-	struct request_queue *q = bdev_get_queue(lc->dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = lc->dev->bdev;
-	bvm->bi_sector = linear_map_sector(ti, bvm->bi_sector);
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static int linear_iterate_devices(struct dm_target *ti,
 				  iterate_devices_callout_fn fn, void *data)
 {
@@ -162,7 +147,6 @@ static struct target_type linear_target = {
 	.map    = linear_map,
 	.status = linear_status,
 	.ioctl  = linear_ioctl,
-	.merge  = linear_merge,
 	.iterate_devices = linear_iterate_devices,
 };
 
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index ad1b049..883595c 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -728,21 +728,6 @@ static int log_writes_ioctl(struct dm_target *ti, unsigned int cmd,
 	return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
 }
 
-static int log_writes_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-			    struct bio_vec *biovec, int max_size)
-{
-	struct log_writes_c *lc = ti->private;
-	struct request_queue *q = bdev_get_queue(lc->dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = lc->dev->bdev;
-	bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector);
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static int log_writes_iterate_devices(struct dm_target *ti,
 				      iterate_devices_callout_fn fn,
 				      void *data)
@@ -796,7 +781,6 @@ static struct target_type log_writes_target = {
 	.end_io = normal_end_io,
 	.status = log_writes_status,
 	.ioctl	= log_writes_ioctl,
-	.merge	= log_writes_merge,
 	.message = log_writes_message,
 	.iterate_devices = log_writes_iterate_devices,
 	.io_hints = log_writes_io_hints,
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 2daa677..97e1651 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -1717,24 +1717,6 @@ static void raid_resume(struct dm_target *ti)
 	mddev_resume(&rs->md);
 }
 
-static int raid_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-		      struct bio_vec *biovec, int max_size)
-{
-	struct raid_set *rs = ti->private;
-	struct md_personality *pers = rs->md.pers;
-
-	if (pers && pers->mergeable_bvec)
-		return min(max_size, pers->mergeable_bvec(&rs->md, bvm, biovec));
-
-	/*
-	 * In case we can't request the personality because
-	 * the raid set is not running yet
-	 *
-	 * -> return safe minimum
-	 */
-	return rs->md.chunk_sectors;
-}
-
 static struct target_type raid_target = {
 	.name = "raid",
 	.version = {1, 7, 0},
@@ -1749,7 +1731,6 @@ static struct target_type raid_target = {
 	.presuspend = raid_presuspend,
 	.postsuspend = raid_postsuspend,
 	.resume = raid_resume,
-	.merge = raid_merge,
 };
 
 static int __init dm_raid_init(void)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 7c82d3c..eabc805 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2330,20 +2330,6 @@ static void origin_status(struct dm_target *ti, status_type_t type,
 	}
 }
 
-static int origin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-			struct bio_vec *biovec, int max_size)
-{
-	struct dm_origin *o = ti->private;
-	struct request_queue *q = bdev_get_queue(o->dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = o->dev->bdev;
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static int origin_iterate_devices(struct dm_target *ti,
 				  iterate_devices_callout_fn fn, void *data)
 {
@@ -2362,7 +2348,6 @@ static struct target_type origin_target = {
 	.resume  = origin_resume,
 	.postsuspend = origin_postsuspend,
 	.status  = origin_status,
-	.merge	 = origin_merge,
 	.iterate_devices = origin_iterate_devices,
 };
 
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index a672a15..c7c8ced 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -412,26 +412,6 @@ static void stripe_io_hints(struct dm_target *ti,
 	blk_limits_io_opt(limits, chunk_size * sc->stripes);
 }
 
-static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-			struct bio_vec *biovec, int max_size)
-{
-	struct stripe_c *sc = ti->private;
-	sector_t bvm_sector = bvm->bi_sector;
-	uint32_t stripe;
-	struct request_queue *q;
-
-	stripe_map_sector(sc, bvm_sector, &stripe, &bvm_sector);
-
-	q = bdev_get_queue(sc->stripe[stripe].dev->bdev);
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = sc->stripe[stripe].dev->bdev;
-	bvm->bi_sector = sc->stripe[stripe].physical_start + bvm_sector;
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static struct target_type stripe_target = {
 	.name   = "striped",
 	.version = {1, 5, 1},
@@ -443,7 +423,6 @@ static struct target_type stripe_target = {
 	.status = stripe_status,
 	.iterate_devices = stripe_iterate_devices,
 	.io_hints = stripe_io_hints,
-	.merge  = stripe_merge,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 16ba55a..afb4ad3 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -440,14 +440,6 @@ static int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
 		       q->limits.alignment_offset,
 		       (unsigned long long) start << SECTOR_SHIFT);
 
-	/*
-	 * Check if merge fn is supported.
-	 * If not we'll force DM to use PAGE_SIZE or
-	 * smaller I/O, just to be safe.
-	 */
-	if (dm_queue_merge_is_compulsory(q) && !ti->type->merge)
-		blk_limits_max_hw_sectors(limits,
-					  (unsigned int) (PAGE_SIZE >> 9));
 	return 0;
 }
 
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index d2bbe8c..d45e9a8 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3875,20 +3875,6 @@ static int pool_iterate_devices(struct dm_target *ti,
 	return fn(ti, pt->data_dev, 0, ti->len, data);
 }
 
-static int pool_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-		      struct bio_vec *biovec, int max_size)
-{
-	struct pool_c *pt = ti->private;
-	struct request_queue *q = bdev_get_queue(pt->data_dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = pt->data_dev->bdev;
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
 {
 	struct pool_c *pt = ti->private;
@@ -3965,7 +3951,6 @@ static struct target_type pool_target = {
 	.resume = pool_resume,
 	.message = pool_message,
 	.status = pool_status,
-	.merge = pool_merge,
 	.iterate_devices = pool_iterate_devices,
 	.io_hints = pool_io_hints,
 };
@@ -4292,21 +4277,6 @@ err:
 	DMEMIT("Error");
 }
 
-static int thin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-		      struct bio_vec *biovec, int max_size)
-{
-	struct thin_c *tc = ti->private;
-	struct request_queue *q = bdev_get_queue(tc->pool_dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = tc->pool_dev->bdev;
-	bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector);
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static int thin_iterate_devices(struct dm_target *ti,
 				iterate_devices_callout_fn fn, void *data)
 {
@@ -4350,7 +4320,6 @@ static struct target_type thin_target = {
 	.presuspend = thin_presuspend,
 	.postsuspend = thin_postsuspend,
 	.status = thin_status,
-	.merge = thin_merge,
 	.iterate_devices = thin_iterate_devices,
 	.io_hints = thin_io_hints,
 };
diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index bb9c6a0..4f2cdd9 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -648,21 +648,6 @@ static int verity_ioctl(struct dm_target *ti, unsigned cmd,
 				     cmd, arg);
 }
 
-static int verity_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
-			struct bio_vec *biovec, int max_size)
-{
-	struct dm_verity *v = ti->private;
-	struct request_queue *q = bdev_get_queue(v->data_dev->bdev);
-
-	if (!q->merge_bvec_fn)
-		return max_size;
-
-	bvm->bi_bdev = v->data_dev->bdev;
-	bvm->bi_sector = verity_map_sector(v, bvm->bi_sector);
-
-	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
 static int verity_iterate_devices(struct dm_target *ti,
 				  iterate_devices_callout_fn fn, void *data)
 {
@@ -995,7 +980,6 @@ static struct target_type verity_target = {
 	.map		= verity_map,
 	.status		= verity_status,
 	.ioctl		= verity_ioctl,
-	.merge		= verity_merge,
 	.iterate_devices = verity_iterate_devices,
 	.io_hints	= verity_io_hints,
 };
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4db6ca2..2cc68d5 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -124,9 +124,8 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
 #define DMF_FREEING 3
 #define DMF_DELETING 4
 #define DMF_NOFLUSH_SUSPENDING 5
-#define DMF_MERGE_IS_OPTIONAL 6
-#define DMF_DEFERRED_REMOVE 7
-#define DMF_SUSPENDED_INTERNALLY 8
+#define DMF_DEFERRED_REMOVE 6
+#define DMF_SUSPENDED_INTERNALLY 7
 
 /*
  * A dummy definition to make RCU happy.
@@ -1487,9 +1486,6 @@ static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
 	bio->bi_iter.bi_size = to_bytes(len);
 }
 
-/*
- * Creates a bio that consists of range of complete bvecs.
- */
 static void clone_bio(struct dm_target_io *tio, struct bio *bio,
 		      sector_t sector, unsigned len)
 {
@@ -1722,60 +1718,6 @@ static void __split_and_process_bio(struct mapped_device *md,
  * CRUD END
  *---------------------------------------------------------------*/
 
-static int dm_merge_bvec(struct request_queue *q,
-			 struct bvec_merge_data *bvm,
-			 struct bio_vec *biovec)
-{
-	struct mapped_device *md = q->queuedata;
-	struct dm_table *map = dm_get_live_table_fast(md);
-	struct dm_target *ti;
-	sector_t max_sectors;
-	int max_size = 0;
-
-	if (unlikely(!map))
-		goto out;
-
-	ti = dm_table_find_target(map, bvm->bi_sector);
-	if (!dm_target_is_valid(ti))
-		goto out;
-
-	/*
-	 * Find maximum amount of I/O that won't need splitting
-	 */
-	max_sectors = min(max_io_len(bvm->bi_sector, ti),
-			  (sector_t) BIO_MAX_SECTORS);
-	max_size = (max_sectors << SECTOR_SHIFT) - bvm->bi_size;
-	if (max_size < 0)
-		max_size = 0;
-
-	/*
-	 * merge_bvec_fn() returns number of bytes
-	 * it can accept at this offset
-	 * max is precomputed maximal io size
-	 */
-	if (max_size && ti->type->merge)
-		max_size = ti->type->merge(ti, bvm, biovec, max_size);
-	/*
-	 * If the target doesn't support merge method and some of the devices
-	 * provided their merge_bvec method (we know this by looking at
-	 * queue_max_hw_sectors), then we can't allow bios with multiple vector
-	 * entries.  So always set max_size to 0, and the code below allows
-	 * just one page.
-	 */
-	else if (queue_max_hw_sectors(q) <= PAGE_SIZE >> 9)
-		max_size = 0;
-
-out:
-	dm_put_live_table_fast(md);
-	/*
-	 * Always allow an entire first page
-	 */
-	if (max_size <= biovec->bv_len && !(bvm->bi_size >> SECTOR_SHIFT))
-		max_size = biovec->bv_len;
-
-	return max_size;
-}
-
 /*
  * The request function that just remaps the bio built up by
  * dm_merge_bvec.
@@ -2498,59 +2440,6 @@ static void __set_size(struct mapped_device *md, sector_t size)
 }
 
 /*
- * Return 1 if the queue has a compulsory merge_bvec_fn function.
- *
- * If this function returns 0, then the device is either a non-dm
- * device without a merge_bvec_fn, or it is a dm device that is
- * able to split any bios it receives that are too big.
- */
-int dm_queue_merge_is_compulsory(struct request_queue *q)
-{
-	struct mapped_device *dev_md;
-
-	if (!q->merge_bvec_fn)
-		return 0;
-
-	if (q->make_request_fn == dm_make_request) {
-		dev_md = q->queuedata;
-		if (test_bit(DMF_MERGE_IS_OPTIONAL, &dev_md->flags))
-			return 0;
-	}
-
-	return 1;
-}
-
-static int dm_device_merge_is_compulsory(struct dm_target *ti,
-					 struct dm_dev *dev, sector_t start,
-					 sector_t len, void *data)
-{
-	struct block_device *bdev = dev->bdev;
-	struct request_queue *q = bdev_get_queue(bdev);
-
-	return dm_queue_merge_is_compulsory(q);
-}
-
-/*
- * Return 1 if it is acceptable to ignore merge_bvec_fn based
- * on the properties of the underlying devices.
- */
-static int dm_table_merge_is_optional(struct dm_table *table)
-{
-	unsigned i = 0;
-	struct dm_target *ti;
-
-	while (i < dm_table_get_num_targets(table)) {
-		ti = dm_table_get_target(table, i++);
-
-		if (ti->type->iterate_devices &&
-		    ti->type->iterate_devices(ti, dm_device_merge_is_compulsory, NULL))
-			return 0;
-	}
-
-	return 1;
-}
-
-/*
  * Returns old map, which caller must destroy.
  */
 static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
@@ -2559,7 +2448,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 	struct dm_table *old_map;
 	struct request_queue *q = md->queue;
 	sector_t size;
-	int merge_is_optional;
 
 	size = dm_table_get_size(t);
 
@@ -2585,17 +2473,11 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 
 	__bind_mempools(md, t);
 
-	merge_is_optional = dm_table_merge_is_optional(t);
-
 	old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
 	rcu_assign_pointer(md->map, t);
 	md->immutable_target_type = dm_table_get_immutable_target_type(t);
 
 	dm_table_set_restrictions(t, q, limits);
-	if (merge_is_optional)
-		set_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
-	else
-		clear_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
 	if (old_map)
 		dm_sync_table(md);
 
@@ -2876,7 +2758,6 @@ int dm_setup_md_queue(struct mapped_device *md)
 	case DM_TYPE_BIO_BASED:
 		dm_init_old_md_queue(md);
 		blk_queue_make_request(md->queue, dm_make_request);
-		blk_queue_merge_bvec(md->queue, dm_merge_bvec);
 		break;
 	}
 
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 4e98499..7edcf97 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -78,8 +78,6 @@ bool dm_table_mq_request_based(struct dm_table *t);
 void dm_table_free_md_mempools(struct dm_table *t);
 struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);
 
-int dm_queue_merge_is_compulsory(struct request_queue *q);
-
 void dm_lock_md_type(struct mapped_device *md);
 void dm_unlock_md_type(struct mapped_device *md);
 void dm_set_md_type(struct mapped_device *md, unsigned type);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index fa7d577..8721ef9 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -52,48 +52,6 @@ static inline struct dev_info *which_dev(struct mddev *mddev, sector_t sector)
 	return conf->disks + lo;
 }
 
-/**
- *	linear_mergeable_bvec -- tell bio layer if two requests can be merged
- *	@q: request queue
- *	@bvm: properties of new bio
- *	@biovec: the request that could be merged to it.
- *
- *	Return amount of bytes we can take at this offset
- */
-static int linear_mergeable_bvec(struct mddev *mddev,
-				 struct bvec_merge_data *bvm,
-				 struct bio_vec *biovec)
-{
-	struct dev_info *dev0;
-	unsigned long maxsectors, bio_sectors = bvm->bi_size >> 9;
-	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
-	int maxbytes = biovec->bv_len;
-	struct request_queue *subq;
-
-	dev0 = which_dev(mddev, sector);
-	maxsectors = dev0->end_sector - sector;
-	subq = bdev_get_queue(dev0->rdev->bdev);
-	if (subq->merge_bvec_fn) {
-		bvm->bi_bdev = dev0->rdev->bdev;
-		bvm->bi_sector -= dev0->end_sector - dev0->rdev->sectors;
-		maxbytes = min(maxbytes, subq->merge_bvec_fn(subq, bvm,
-							     biovec));
-	}
-
-	if (maxsectors < bio_sectors)
-		maxsectors = 0;
-	else
-		maxsectors -= bio_sectors;
-
-	if (maxsectors <= (PAGE_SIZE >> 9 ) && bio_sectors == 0)
-		return maxbytes;
-
-	if (maxsectors > (maxbytes >> 9))
-		return maxbytes;
-	else
-		return maxsectors << 9;
-}
-
 static int linear_congested(struct mddev *mddev, int bits)
 {
 	struct linear_conf *conf;
@@ -338,7 +296,6 @@ static struct md_personality linear_personality =
 	.size		= linear_size,
 	.quiesce	= linear_quiesce,
 	.congested	= linear_congested,
-	.mergeable_bvec	= linear_mergeable_bvec,
 };
 
 static int __init linear_init (void)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b8c5a82..ef37415 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -352,29 +352,6 @@ static int md_congested(void *data, int bits)
 	return mddev_congested(mddev, bits);
 }
 
-static int md_mergeable_bvec(struct request_queue *q,
-			     struct bvec_merge_data *bvm,
-			     struct bio_vec *biovec)
-{
-	struct mddev *mddev = q->queuedata;
-	int ret;
-	rcu_read_lock();
-	if (mddev->suspended) {
-		/* Must always allow one vec */
-		if (bvm->bi_size == 0)
-			ret = biovec->bv_len;
-		else
-			ret = 0;
-	} else {
-		struct md_personality *pers = mddev->pers;
-		if (pers && pers->mergeable_bvec)
-			ret = pers->mergeable_bvec(mddev, bvm, biovec);
-		else
-			ret = biovec->bv_len;
-	}
-	rcu_read_unlock();
-	return ret;
-}
 /*
  * Generic flush handling for md
  */
@@ -5188,7 +5165,6 @@ int md_run(struct mddev *mddev)
 	if (mddev->queue) {
 		mddev->queue->backing_dev_info.congested_data = mddev;
 		mddev->queue->backing_dev_info.congested_fn = md_congested;
-		blk_queue_merge_bvec(mddev->queue, md_mergeable_bvec);
 	}
 	if (pers->sync_request) {
 		if (mddev->kobj.sd &&
@@ -5317,7 +5293,6 @@ static void md_clean(struct mddev *mddev)
 	mddev->degraded = 0;
 	mddev->safemode = 0;
 	mddev->private = NULL;
-	mddev->merge_check_needed = 0;
 	mddev->bitmap_info.offset = 0;
 	mddev->bitmap_info.default_offset = 0;
 	mddev->bitmap_info.default_space = 0;
@@ -5516,7 +5491,6 @@ static int do_md_stop(struct mddev *mddev, int mode,
 
 		__md_stop_writes(mddev);
 		__md_stop(mddev);
-		mddev->queue->merge_bvec_fn = NULL;
 		mddev->queue->backing_dev_info.congested_fn = NULL;
 
 		/* tell userspace to handle 'inactive' */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 7da6e9c..ab33957 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -134,10 +134,6 @@ enum flag_bits {
 	Bitmap_sync,		/* ..actually, not quite In_sync.  Need a
 				 * bitmap-based recovery to get fully in sync
 				 */
-	Unmerged,		/* device is being added to array and should
-				 * be considerred for bvec_merge_fn but not
-				 * yet for actual IO
-				 */
 	WriteMostly,		/* Avoid reading if at all possible */
 	AutoDetected,		/* added by auto-detect */
 	Blocked,		/* An error occurred but has not yet
@@ -374,10 +370,6 @@ struct mddev {
 	int				degraded;	/* whether md should consider
 							 * adding a spare
 							 */
-	int				merge_check_needed; /* at least one
-							     * member device
-							     * has a
-							     * merge_bvec_fn */
 
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
@@ -532,10 +524,6 @@ struct md_personality
 	/* congested implements bdi.congested_fn().
 	 * Will not be called while array is 'suspended' */
 	int (*congested)(struct mddev *mddev, int bits);
-	/* mergeable_bvec is use to implement ->merge_bvec_fn */
-	int (*mergeable_bvec)(struct mddev *mddev,
-			      struct bvec_merge_data *bvm,
-			      struct bio_vec *biovec);
 };
 
 struct md_sysfs_entry {
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index ac3ede2..7ee27fb 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -257,18 +257,6 @@ static int multipath_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
 					  rdev->data_offset << 9);
 
-		/* as we don't honour merge_bvec_fn, we must never risk
-		 * violating it, so limit ->max_segments to one, lying
-		 * within a single page.
-		 * (Note: it is very unlikely that a device with
-		 * merge_bvec_fn will be involved in multipath.)
-		 */
-			if (q->merge_bvec_fn) {
-				blk_queue_max_segments(mddev->queue, 1);
-				blk_queue_segment_boundary(mddev->queue,
-							   PAGE_CACHE_SIZE - 1);
-			}
-
 			spin_lock_irq(&conf->device_lock);
 			mddev->degraded--;
 			rdev->raid_disk = path;
@@ -432,15 +420,6 @@ static int multipath_run (struct mddev *mddev)
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
 
-		/* as we don't honour merge_bvec_fn, we must never risk
-		 * violating it, not that we ever expect a device with
-		 * a merge_bvec_fn to be involved in multipath */
-		if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
-			blk_queue_max_segments(mddev->queue, 1);
-			blk_queue_segment_boundary(mddev->queue,
-						   PAGE_CACHE_SIZE - 1);
-		}
-
 		if (!test_bit(Faulty, &rdev->flags))
 			working_disks++;
 	}
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index efb654e..c853331 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -192,9 +192,6 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
 			disk_stack_limits(mddev->gendisk, rdev1->bdev,
 					  rdev1->data_offset << 9);
 
-		if (rdev1->bdev->bd_disk->queue->merge_bvec_fn)
-			conf->has_merge_bvec = 1;
-
 		if (!smallest || (rdev1->sectors < smallest->sectors))
 			smallest = rdev1;
 		cnt++;
@@ -351,58 +348,6 @@ static struct md_rdev *map_sector(struct mddev *mddev, struct strip_zone *zone,
 			     + sector_div(sector, zone->nb_dev)];
 }
 
-/**
- *	raid0_mergeable_bvec -- tell bio layer if two requests can be merged
- *	@mddev: the md device
- *	@bvm: properties of new bio
- *	@biovec: the request that could be merged to it.
- *
- *	Return amount of bytes we can accept at this offset
- */
-static int raid0_mergeable_bvec(struct mddev *mddev,
-				struct bvec_merge_data *bvm,
-				struct bio_vec *biovec)
-{
-	struct r0conf *conf = mddev->private;
-	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
-	sector_t sector_offset = sector;
-	int max;
-	unsigned int chunk_sectors = mddev->chunk_sectors;
-	unsigned int bio_sectors = bvm->bi_size >> 9;
-	struct strip_zone *zone;
-	struct md_rdev *rdev;
-	struct request_queue *subq;
-
-	if (is_power_of_2(chunk_sectors))
-		max =  (chunk_sectors - ((sector & (chunk_sectors-1))
-						+ bio_sectors)) << 9;
-	else
-		max =  (chunk_sectors - (sector_div(sector, chunk_sectors)
-						+ bio_sectors)) << 9;
-	if (max < 0)
-		max = 0; /* bio_add cannot handle a negative return */
-	if (max <= biovec->bv_len && bio_sectors == 0)
-		return biovec->bv_len;
-	if (max < biovec->bv_len)
-		/* too small already, no need to check further */
-		return max;
-	if (!conf->has_merge_bvec)
-		return max;
-
-	/* May need to check subordinate device */
-	sector = sector_offset;
-	zone = find_zone(mddev->private, &sector_offset);
-	rdev = map_sector(mddev, zone, sector, &sector_offset);
-	subq = bdev_get_queue(rdev->bdev);
-	if (subq->merge_bvec_fn) {
-		bvm->bi_bdev = rdev->bdev;
-		bvm->bi_sector = sector_offset + zone->dev_start +
-			rdev->data_offset;
-		return min(max, subq->merge_bvec_fn(subq, bvm, biovec));
-	} else
-		return max;
-}
-
 static sector_t raid0_size(struct mddev *mddev, sector_t sectors, int raid_disks)
 {
 	sector_t array_sectors = 0;
@@ -727,7 +672,6 @@ static struct md_personality raid0_personality=
 	.takeover	= raid0_takeover,
 	.quiesce	= raid0_quiesce,
 	.congested	= raid0_congested,
-	.mergeable_bvec	= raid0_mergeable_bvec,
 };
 
 static int __init raid0_init (void)
diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h
index 05539d9..7127a62 100644
--- a/drivers/md/raid0.h
+++ b/drivers/md/raid0.h
@@ -12,8 +12,6 @@ struct r0conf {
 	struct md_rdev		**devlist; /* lists of rdevs, pointed to
 					    * by strip_zone->dev */
 	int			nr_strip_zones;
-	int			has_merge_bvec;	/* at least one member has
-						 * a merge_bvec_fn */
 };
 
 #endif
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 967a4ed..453dd01 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -557,7 +557,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 		rdev = rcu_dereference(conf->mirrors[disk].rdev);
 		if (r1_bio->bios[disk] == IO_BLOCKED
 		    || rdev == NULL
-		    || test_bit(Unmerged, &rdev->flags)
 		    || test_bit(Faulty, &rdev->flags))
 			continue;
 		if (!test_bit(In_sync, &rdev->flags) &&
@@ -708,38 +707,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 	return best_disk;
 }
 
-static int raid1_mergeable_bvec(struct mddev *mddev,
-				struct bvec_merge_data *bvm,
-				struct bio_vec *biovec)
-{
-	struct r1conf *conf = mddev->private;
-	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
-	int max = biovec->bv_len;
-
-	if (mddev->merge_check_needed) {
-		int disk;
-		rcu_read_lock();
-		for (disk = 0; disk < conf->raid_disks * 2; disk++) {
-			struct md_rdev *rdev = rcu_dereference(
-				conf->mirrors[disk].rdev);
-			if (rdev && !test_bit(Faulty, &rdev->flags)) {
-				struct request_queue *q =
-					bdev_get_queue(rdev->bdev);
-				if (q->merge_bvec_fn) {
-					bvm->bi_sector = sector +
-						rdev->data_offset;
-					bvm->bi_bdev = rdev->bdev;
-					max = min(max, q->merge_bvec_fn(
-							  q, bvm, biovec));
-				}
-			}
-		}
-		rcu_read_unlock();
-	}
-	return max;
-
-}
-
 static int raid1_congested(struct mddev *mddev, int bits)
 {
 	struct r1conf *conf = mddev->private;
@@ -1269,8 +1236,7 @@ read_again:
 			break;
 		}
 		r1_bio->bios[i] = NULL;
-		if (!rdev || test_bit(Faulty, &rdev->flags)
-		    || test_bit(Unmerged, &rdev->flags)) {
+		if (!rdev || test_bit(Faulty, &rdev->flags)) {
 			if (i < conf->raid_disks)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			continue;
@@ -1617,7 +1583,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	struct raid1_info *p;
 	int first = 0;
 	int last = conf->raid_disks - 1;
-	struct request_queue *q = bdev_get_queue(rdev->bdev);
 
 	if (mddev->recovery_disabled == conf->recovery_disabled)
 		return -EBUSY;
@@ -1625,11 +1590,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	if (rdev->raid_disk >= 0)
 		first = last = rdev->raid_disk;
 
-	if (q->merge_bvec_fn) {
-		set_bit(Unmerged, &rdev->flags);
-		mddev->merge_check_needed = 1;
-	}
-
 	for (mirror = first; mirror <= last; mirror++) {
 		p = conf->mirrors+mirror;
 		if (!p->rdev) {
@@ -1661,19 +1621,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 			break;
 		}
 	}
-	if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
-		/* Some requests might not have seen this new
-		 * merge_bvec_fn.  We must wait for them to complete
-		 * before merging the device fully.
-		 * First we make sure any code which has tested
-		 * our function has submitted the request, then
-		 * we wait for all outstanding requests to complete.
-		 */
-		synchronize_sched();
-		freeze_array(conf, 0);
-		unfreeze_array(conf);
-		clear_bit(Unmerged, &rdev->flags);
-	}
 	md_integrity_add_rdev(rdev, mddev);
 	if (mddev->queue && blk_queue_discard(bdev_get_queue(rdev->bdev)))
 		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
@@ -2810,8 +2757,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 			goto abort;
 		disk->rdev = rdev;
 		q = bdev_get_queue(rdev->bdev);
-		if (q->merge_bvec_fn)
-			mddev->merge_check_needed = 1;
 
 		disk->head_position = 0;
 		disk->seq_start = MaxSector;
@@ -3176,7 +3121,6 @@ static struct md_personality raid1_personality =
 	.quiesce	= raid1_quiesce,
 	.takeover	= raid1_takeover,
 	.congested	= raid1_congested,
-	.mergeable_bvec	= raid1_mergeable_bvec,
 };
 
 static int __init raid_init(void)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 38c58e1..f6ec82e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -672,93 +672,6 @@ static sector_t raid10_find_virt(struct r10conf *conf, sector_t sector, int dev)
 	return (vchunk << geo->chunk_shift) + offset;
 }
 
-/**
- *	raid10_mergeable_bvec -- tell bio layer if a two requests can be merged
- *	@mddev: the md device
- *	@bvm: properties of new bio
- *	@biovec: the request that could be merged to it.
- *
- *	Return amount of bytes we can accept at this offset
- *	This requires checking for end-of-chunk if near_copies != raid_disks,
- *	and for subordinate merge_bvec_fns if merge_check_needed.
- */
-static int raid10_mergeable_bvec(struct mddev *mddev,
-				 struct bvec_merge_data *bvm,
-				 struct bio_vec *biovec)
-{
-	struct r10conf *conf = mddev->private;
-	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
-	int max;
-	unsigned int chunk_sectors;
-	unsigned int bio_sectors = bvm->bi_size >> 9;
-	struct geom *geo = &conf->geo;
-
-	chunk_sectors = (conf->geo.chunk_mask & conf->prev.chunk_mask) + 1;
-	if (conf->reshape_progress != MaxSector &&
-	    ((sector >= conf->reshape_progress) !=
-	     conf->mddev->reshape_backwards))
-		geo = &conf->prev;
-
-	if (geo->near_copies < geo->raid_disks) {
-		max = (chunk_sectors - ((sector & (chunk_sectors - 1))
-					+ bio_sectors)) << 9;
-		if (max < 0)
-			/* bio_add cannot handle a negative return */
-			max = 0;
-		if (max <= biovec->bv_len && bio_sectors == 0)
-			return biovec->bv_len;
-	} else
-		max = biovec->bv_len;
-
-	if (mddev->merge_check_needed) {
-		struct {
-			struct r10bio r10_bio;
-			struct r10dev devs[conf->copies];
-		} on_stack;
-		struct r10bio *r10_bio = &on_stack.r10_bio;
-		int s;
-		if (conf->reshape_progress != MaxSector) {
-			/* Cannot give any guidance during reshape */
-			if (max <= biovec->bv_len && bio_sectors == 0)
-				return biovec->bv_len;
-			return 0;
-		}
-		r10_bio->sector = sector;
-		raid10_find_phys(conf, r10_bio);
-		rcu_read_lock();
-		for (s = 0; s < conf->copies; s++) {
-			int disk = r10_bio->devs[s].devnum;
-			struct md_rdev *rdev = rcu_dereference(
-				conf->mirrors[disk].rdev);
-			if (rdev && !test_bit(Faulty, &rdev->flags)) {
-				struct request_queue *q =
-					bdev_get_queue(rdev->bdev);
-				if (q->merge_bvec_fn) {
-					bvm->bi_sector = r10_bio->devs[s].addr
-						+ rdev->data_offset;
-					bvm->bi_bdev = rdev->bdev;
-					max = min(max, q->merge_bvec_fn(
-							  q, bvm, biovec));
-				}
-			}
-			rdev = rcu_dereference(conf->mirrors[disk].replacement);
-			if (rdev && !test_bit(Faulty, &rdev->flags)) {
-				struct request_queue *q =
-					bdev_get_queue(rdev->bdev);
-				if (q->merge_bvec_fn) {
-					bvm->bi_sector = r10_bio->devs[s].addr
-						+ rdev->data_offset;
-					bvm->bi_bdev = rdev->bdev;
-					max = min(max, q->merge_bvec_fn(
-							  q, bvm, biovec));
-				}
-			}
-		}
-		rcu_read_unlock();
-	}
-	return max;
-}
-
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -821,12 +734,10 @@ retry:
 		disk = r10_bio->devs[slot].devnum;
 		rdev = rcu_dereference(conf->mirrors[disk].replacement);
 		if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
-		    test_bit(Unmerged, &rdev->flags) ||
 		    r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
 			rdev = rcu_dereference(conf->mirrors[disk].rdev);
 		if (rdev == NULL ||
-		    test_bit(Faulty, &rdev->flags) ||
-		    test_bit(Unmerged, &rdev->flags))
+		    test_bit(Faulty, &rdev->flags))
 			continue;
 		if (!test_bit(In_sync, &rdev->flags) &&
 		    r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
@@ -1326,11 +1237,9 @@ retry_write:
 			blocked_rdev = rrdev;
 			break;
 		}
-		if (rdev && (test_bit(Faulty, &rdev->flags)
-			     || test_bit(Unmerged, &rdev->flags)))
+		if (rdev && (test_bit(Faulty, &rdev->flags)))
 			rdev = NULL;
-		if (rrdev && (test_bit(Faulty, &rrdev->flags)
-			      || test_bit(Unmerged, &rrdev->flags)))
+		if (rrdev && (test_bit(Faulty, &rrdev->flags)))
 			rrdev = NULL;
 
 		r10_bio->devs[i].bio = NULL;
@@ -1777,7 +1686,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	int mirror;
 	int first = 0;
 	int last = conf->geo.raid_disks - 1;
-	struct request_queue *q = bdev_get_queue(rdev->bdev);
 
 	if (mddev->recovery_cp < MaxSector)
 		/* only hot-add to in-sync arrays, as recovery is
@@ -1790,11 +1698,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	if (rdev->raid_disk >= 0)
 		first = last = rdev->raid_disk;
 
-	if (q->merge_bvec_fn) {
-		set_bit(Unmerged, &rdev->flags);
-		mddev->merge_check_needed = 1;
-	}
-
 	if (rdev->saved_raid_disk >= first &&
 	    conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
 		mirror = rdev->saved_raid_disk;
@@ -1833,19 +1736,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 		rcu_assign_pointer(p->rdev, rdev);
 		break;
 	}
-	if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
-		/* Some requests might not have seen this new
-		 * merge_bvec_fn.  We must wait for them to complete
-		 * before merging the device fully.
-		 * First we make sure any code which has tested
-		 * our function has submitted the request, then
-		 * we wait for all outstanding requests to complete.
-		 */
-		synchronize_sched();
-		freeze_array(conf, 0);
-		unfreeze_array(conf);
-		clear_bit(Unmerged, &rdev->flags);
-	}
 	md_integrity_add_rdev(rdev, mddev);
 	if (mddev->queue && blk_queue_discard(bdev_get_queue(rdev->bdev)))
 		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
@@ -2394,7 +2284,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 			d = r10_bio->devs[sl].devnum;
 			rdev = rcu_dereference(conf->mirrors[d].rdev);
 			if (rdev &&
-			    !test_bit(Unmerged, &rdev->flags) &&
 			    test_bit(In_sync, &rdev->flags) &&
 			    is_badblock(rdev, r10_bio->devs[sl].addr + sect, s,
 					&first_bad, &bad_sectors) == 0) {
@@ -2448,7 +2337,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 			d = r10_bio->devs[sl].devnum;
 			rdev = rcu_dereference(conf->mirrors[d].rdev);
 			if (!rdev ||
-			    test_bit(Unmerged, &rdev->flags) ||
 			    !test_bit(In_sync, &rdev->flags))
 				continue;
 
@@ -3643,8 +3531,6 @@ static int run(struct mddev *mddev)
 			disk->rdev = rdev;
 		}
 		q = bdev_get_queue(rdev->bdev);
-		if (q->merge_bvec_fn)
-			mddev->merge_check_needed = 1;
 		diff = (rdev->new_data_offset - rdev->data_offset);
 		if (!mddev->reshape_backwards)
 			diff = -diff;
@@ -4700,7 +4586,6 @@ static struct md_personality raid10_personality =
 	.start_reshape	= raid10_start_reshape,
 	.finish_reshape	= raid10_finish_reshape,
 	.congested	= raid10_congested,
-	.mergeable_bvec	= raid10_mergeable_bvec,
 };
 
 static int __init raid_init(void)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7ce3252..f116b77 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4669,35 +4669,6 @@ static int raid5_congested(struct mddev *mddev, int bits)
 	return 0;
 }
 
-/* We want read requests to align with chunks where possible,
- * but write requests don't need to.
- */
-static int raid5_mergeable_bvec(struct mddev *mddev,
-				struct bvec_merge_data *bvm,
-				struct bio_vec *biovec)
-{
-	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
-	int max;
-	unsigned int chunk_sectors = mddev->chunk_sectors;
-	unsigned int bio_sectors = bvm->bi_size >> 9;
-
-	/*
-	 * always allow writes to be mergeable, read as well if array
-	 * is degraded as we'll go through stripe cache anyway.
-	 */
-	if ((bvm->bi_rw & 1) == WRITE || mddev->degraded)
-		return biovec->bv_len;
-
-	if (mddev->new_chunk_sectors < mddev->chunk_sectors)
-		chunk_sectors = mddev->new_chunk_sectors;
-	max =  (chunk_sectors - ((sector & (chunk_sectors - 1)) + bio_sectors)) << 9;
-	if (max < 0) max = 0;
-	if (max <= biovec->bv_len && bio_sectors == 0)
-		return biovec->bv_len;
-	else
-		return max;
-}
-
 static int in_chunk_boundary(struct mddev *mddev, struct bio *bio)
 {
 	sector_t sector = bio->bi_iter.bi_sector + get_start_sect(bio->bi_bdev);
@@ -7785,7 +7756,6 @@ static struct md_personality raid6_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid6_takeover,
 	.congested	= raid5_congested,
-	.mergeable_bvec	= raid5_mergeable_bvec,
 };
 static struct md_personality raid5_personality =
 {
@@ -7809,7 +7779,6 @@ static struct md_personality raid5_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
 	.congested	= raid5_congested,
-	.mergeable_bvec	= raid5_mergeable_bvec,
 };
 
 static struct md_personality raid4_personality =
@@ -7834,7 +7803,6 @@ static struct md_personality raid4_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid4_takeover,
 	.congested	= raid5_congested,
-	.mergeable_bvec	= raid5_mergeable_bvec,
 };
 
 static int __init raid5_init(void)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index dc89cc8..56cf082 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -213,14 +213,6 @@ typedef int (prep_rq_fn) (struct request_queue *, struct request *);
 typedef void (unprep_rq_fn) (struct request_queue *, struct request *);
 
 struct bio_vec;
-struct bvec_merge_data {
-	struct block_device *bi_bdev;
-	sector_t bi_sector;
-	unsigned bi_size;
-	unsigned long bi_rw;
-};
-typedef int (merge_bvec_fn) (struct request_queue *, struct bvec_merge_data *,
-			     struct bio_vec *);
 typedef void (softirq_done_fn)(struct request *);
 typedef int (dma_drain_needed_fn)(struct request *);
 typedef int (lld_busy_fn) (struct request_queue *q);
@@ -305,7 +297,6 @@ struct request_queue {
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
 	unprep_rq_fn		*unprep_rq_fn;
-	merge_bvec_fn		*merge_bvec_fn;
 	softirq_done_fn		*softirq_done_fn;
 	rq_timed_out_fn		*rq_timed_out_fn;
 	dma_drain_needed_fn	*dma_drain_needed;
@@ -991,7 +982,6 @@ extern void blk_queue_lld_busy(struct request_queue *q, lld_busy_fn *fn);
 extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
 extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
 extern void blk_queue_unprep_rq(struct request_queue *, unprep_rq_fn *ufn);
-extern void blk_queue_merge_bvec(struct request_queue *, merge_bvec_fn *);
 extern void blk_queue_dma_alignment(struct request_queue *, int);
 extern void blk_queue_update_dma_alignment(struct request_queue *, int);
 extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 51cc1de..76d23fa 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -82,9 +82,6 @@ typedef int (*dm_message_fn) (struct dm_target *ti, unsigned argc, char **argv);
 typedef int (*dm_ioctl_fn) (struct dm_target *ti, unsigned int cmd,
 			    unsigned long arg);
 
-typedef int (*dm_merge_fn) (struct dm_target *ti, struct bvec_merge_data *bvm,
-			    struct bio_vec *biovec, int max_size);
-
 /*
  * These iteration functions are typically used to check (and combine)
  * properties of underlying devices.
@@ -160,7 +157,6 @@ struct target_type {
 	dm_status_fn status;
 	dm_message_fn message;
 	dm_ioctl_fn ioctl;
-	dm_merge_fn merge;
 	dm_busy_fn busy;
 	dm_iterate_devices_fn iterate_devices;
 	dm_io_hints_fn io_hints;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 09/11] fs: use helper bio_add_page() instead of open coding on bi_io_vec
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
                   ` (7 preceding siblings ...)
  2015-08-12  7:07 ` [PATCH v6 08/11] block: kill merge_bvec_fn() completely Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2015-08-12  7:07 ` [PATCH v6 10/11] block: remove bio_get_nr_vecs() Ming Lin
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Christoph Hellwig,
	Al Viro, linux-fsdevel, Ming Lin

From: Kent Overstreet <kent.overstreet@gmail.com>

Call pre-defined helper bio_add_page() instead of open coding for
iterating through bi_io_vec[]. Doing that, it's possible to make some
parts in filesystems and mm/page_io.c simpler than before.

Acked-by: Dave Kleikamp <shaggy@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 fs/buffer.c         |  7 ++-----
 fs/jfs/jfs_logmgr.c | 14 ++++----------
 mm/page_io.c        |  8 +++-----
 3 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1cf7a53..95996ba 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3046,12 +3046,9 @@ static int submit_bh_wbc(int rw, struct buffer_head *bh,
 
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_bdev = bh->b_bdev;
-	bio->bi_io_vec[0].bv_page = bh->b_page;
-	bio->bi_io_vec[0].bv_len = bh->b_size;
-	bio->bi_io_vec[0].bv_offset = bh_offset(bh);
 
-	bio->bi_vcnt = 1;
-	bio->bi_iter.bi_size = bh->b_size;
+	bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
+	BUG_ON(bio->bi_iter.bi_size != bh->b_size);
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index bc462dc..46fae06 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -1999,12 +1999,9 @@ static int lbmRead(struct jfs_log * log, int pn, struct lbuf ** bpp)
 
 	bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
 	bio->bi_bdev = log->bdev;
-	bio->bi_io_vec[0].bv_page = bp->l_page;
-	bio->bi_io_vec[0].bv_len = LOGPSIZE;
-	bio->bi_io_vec[0].bv_offset = bp->l_offset;
 
-	bio->bi_vcnt = 1;
-	bio->bi_iter.bi_size = LOGPSIZE;
+	bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+	BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);
 
 	bio->bi_end_io = lbmIODone;
 	bio->bi_private = bp;
@@ -2145,12 +2142,9 @@ static void lbmStartIO(struct lbuf * bp)
 	bio = bio_alloc(GFP_NOFS, 1);
 	bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
 	bio->bi_bdev = log->bdev;
-	bio->bi_io_vec[0].bv_page = bp->l_page;
-	bio->bi_io_vec[0].bv_len = LOGPSIZE;
-	bio->bi_io_vec[0].bv_offset = bp->l_offset;
 
-	bio->bi_vcnt = 1;
-	bio->bi_iter.bi_size = LOGPSIZE;
+	bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+	BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);
 
 	bio->bi_end_io = lbmIODone;
 	bio->bi_private = bp;
diff --git a/mm/page_io.c b/mm/page_io.c
index 520baa4..194081b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -33,12 +33,10 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
 	if (bio) {
 		bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev);
 		bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
-		bio->bi_io_vec[0].bv_page = page;
-		bio->bi_io_vec[0].bv_len = PAGE_SIZE;
-		bio->bi_io_vec[0].bv_offset = 0;
-		bio->bi_vcnt = 1;
-		bio->bi_iter.bi_size = PAGE_SIZE;
 		bio->bi_end_io = end_io;
+
+		bio_add_page(bio, page, PAGE_SIZE, 0);
+		BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE);
 	}
 	return bio;
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 10/11] block: remove bio_get_nr_vecs()
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
                   ` (8 preceding siblings ...)
  2015-08-12  7:07 ` [PATCH v6 09/11] fs: use helper bio_add_page() instead of open coding on bi_io_vec Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2015-08-12  7:07 ` [PATCH v6 11/11] Documentation: update notes in biovecs about arbitrarily sized bios Ming Lin
  2015-08-13 16:51 ` [PATCH v6 00/11] simplify block layer based on immutable biovecs Jens Axboe
  11 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Ming Lin

From: Kent Overstreet <kent.overstreet@gmail.com>

We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 block/bio.c            | 23 -----------------------
 drivers/md/dm-io.c     |  2 +-
 fs/btrfs/compression.c |  5 +----
 fs/btrfs/extent_io.c   |  9 ++-------
 fs/btrfs/inode.c       |  3 +--
 fs/btrfs/scrub.c       | 18 ++----------------
 fs/direct-io.c         |  2 +-
 fs/ext4/page-io.c      |  3 +--
 fs/ext4/readpage.c     |  2 +-
 fs/f2fs/data.c         |  2 +-
 fs/gfs2/lops.c         |  9 +--------
 fs/logfs/dev_bdev.c    |  4 ++--
 fs/mpage.c             |  4 ++--
 fs/nilfs2/segbuf.c     |  2 +-
 fs/xfs/xfs_aops.c      |  3 +--
 include/linux/bio.h    |  1 -
 16 files changed, 18 insertions(+), 74 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index c8bfa61..9e1adab 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -692,29 +692,6 @@ integrity_clone:
 EXPORT_SYMBOL(bio_clone_bioset);
 
 /**
- *	bio_get_nr_vecs		- return approx number of vecs
- *	@bdev:  I/O target
- *
- *	Return the approximate number of pages we can send to this target.
- *	There's no guarantee that you will be able to fit this number of pages
- *	into a bio, it does not account for dynamic restrictions that vary
- *	on offset.
- */
-int bio_get_nr_vecs(struct block_device *bdev)
-{
-	struct request_queue *q = bdev_get_queue(bdev);
-	int nr_pages;
-
-	nr_pages = min_t(unsigned,
-		     queue_max_segments(q),
-		     queue_max_sectors(q) / (PAGE_SIZE >> 9) + 1);
-
-	return min_t(unsigned, nr_pages, BIO_MAX_PAGES);
-
-}
-EXPORT_SYMBOL(bio_get_nr_vecs);
-
-/**
  *	bio_add_pc_page	-	attempt to add page to bio
  *	@q: the target queue
  *	@bio: destination bio
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 74adcd2..7d64272 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -314,7 +314,7 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
 		if ((rw & REQ_DISCARD) || (rw & REQ_WRITE_SAME))
 			num_bvecs = 1;
 		else
-			num_bvecs = min_t(int, bio_get_nr_vecs(where->bdev),
+			num_bvecs = min_t(int, BIO_MAX_PAGES,
 					  dm_sector_div_up(remaining, (PAGE_SIZE >> SECTOR_SHIFT)));
 
 		bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ce62324..449c752 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -97,10 +97,7 @@ static inline int compressed_bio_size(struct btrfs_root *root,
 static struct bio *compressed_bio_alloc(struct block_device *bdev,
 					u64 first_byte, gfp_t gfp_flags)
 {
-	int nr_vecs;
-
-	nr_vecs = bio_get_nr_vecs(bdev);
-	return btrfs_bio_alloc(bdev, first_byte >> 9, nr_vecs, gfp_flags);
+	return btrfs_bio_alloc(bdev, first_byte >> 9, BIO_MAX_PAGES, gfp_flags);
 }
 
 static int check_compressed_csum(struct inode *inode,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 02d0581..ba89efd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2802,9 +2802,7 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
 {
 	int ret = 0;
 	struct bio *bio;
-	int nr;
 	int contig = 0;
-	int this_compressed = bio_flags & EXTENT_BIO_COMPRESSED;
 	int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED;
 	size_t page_size = min_t(size_t, size, PAGE_CACHE_SIZE);
 
@@ -2829,12 +2827,9 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
 			return 0;
 		}
 	}
-	if (this_compressed)
-		nr = BIO_MAX_PAGES;
-	else
-		nr = bio_get_nr_vecs(bdev);
 
-	bio = btrfs_bio_alloc(bdev, sector, nr, GFP_NOFS | __GFP_HIGH);
+	bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES,
+			GFP_NOFS | __GFP_HIGH);
 	if (!bio)
 		return -ENOMEM;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e33dff3..fee0368 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7957,8 +7957,7 @@ out:
 static struct bio *btrfs_dio_bio_alloc(struct block_device *bdev,
 				       u64 first_sector, gfp_t gfp_flags)
 {
-	int nr_vecs = bio_get_nr_vecs(bdev);
-	return btrfs_bio_alloc(bdev, first_sector, nr_vecs, gfp_flags);
+	return btrfs_bio_alloc(bdev, first_sector, BIO_MAX_PAGES, gfp_flags);
 }
 
 static inline int btrfs_lookup_and_bind_dio_csum(struct btrfs_root *root,
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 94db0fa..b9e834e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -454,27 +454,14 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
 	struct scrub_ctx *sctx;
 	int		i;
 	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
-	int pages_per_rd_bio;
 	int ret;
 
-	/*
-	 * the setting of pages_per_rd_bio is correct for scrub but might
-	 * be wrong for the dev_replace code where we might read from
-	 * different devices in the initial huge bios. However, that
-	 * code is able to correctly handle the case when adding a page
-	 * to a bio fails.
-	 */
-	if (dev->bdev)
-		pages_per_rd_bio = min_t(int, SCRUB_PAGES_PER_RD_BIO,
-					 bio_get_nr_vecs(dev->bdev));
-	else
-		pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
 	sctx = kzalloc(sizeof(*sctx), GFP_NOFS);
 	if (!sctx)
 		goto nomem;
 	atomic_set(&sctx->refs, 1);
 	sctx->is_dev_replace = is_dev_replace;
-	sctx->pages_per_rd_bio = pages_per_rd_bio;
+	sctx->pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
 	sctx->curr = -1;
 	sctx->dev_root = dev->dev_root;
 	for (i = 0; i < SCRUB_BIOS_PER_SCTX; ++i) {
@@ -3896,8 +3883,7 @@ static int scrub_setup_wr_ctx(struct scrub_ctx *sctx,
 		return 0;
 
 	WARN_ON(!dev->bdev);
-	wr_ctx->pages_per_wr_bio = min_t(int, SCRUB_PAGES_PER_WR_BIO,
-					 bio_get_nr_vecs(dev->bdev));
+	wr_ctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
 	wr_ctx->tgtdev = dev;
 	atomic_set(&wr_ctx->flush_all_writes, 0);
 	return 0;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 745d234..89baebe 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -653,7 +653,7 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
 	if (ret)
 		goto out;
 	sector = start_sector << (sdio->blkbits - 9);
-	nr_pages = min(sdio->pages_in_io, bio_get_nr_vecs(map_bh->b_bdev));
+	nr_pages = min(sdio->pages_in_io, BIO_MAX_PAGES);
 	BUG_ON(nr_pages <= 0);
 	dio_bio_alloc(dio, sdio, map_bh->b_bdev, sector, nr_pages);
 	sdio->boundary = 0;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 5602450..e678ad3 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -375,10 +375,9 @@ void ext4_io_submit_init(struct ext4_io_submit *io,
 static int io_submit_init_bio(struct ext4_io_submit *io,
 			      struct buffer_head *bh)
 {
-	int nvecs = bio_get_nr_vecs(bh->b_bdev);
 	struct bio *bio;
 
-	bio = bio_alloc(GFP_NOIO, min(nvecs, BIO_MAX_PAGES));
+	bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES);
 	if (!bio)
 		return -ENOMEM;
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index ec3ef93..37c886c 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -284,7 +284,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
 					goto set_error_page;
 			}
 			bio = bio_alloc(GFP_KERNEL,
-				min_t(int, nr_pages, bio_get_nr_vecs(bdev)));
+				min_t(int, nr_pages, BIO_MAX_PAGES));
 			if (!bio) {
 				if (ctx)
 					ext4_release_crypto_ctx(ctx);
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index f71e19a..7ead19b 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1552,7 +1552,7 @@ submit_and_realloc:
 			}
 
 			bio = bio_alloc(GFP_KERNEL,
-				min_t(int, nr_pages, bio_get_nr_vecs(bdev)));
+				min_t(int, nr_pages, BIO_MAX_PAGES));
 			if (!bio) {
 				if (ctx)
 					f2fs_release_crypto_ctx(ctx);
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 2c1ae86..64d3116 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -261,18 +261,11 @@ void gfs2_log_flush_bio(struct gfs2_sbd *sdp, int rw)
 static struct bio *gfs2_log_alloc_bio(struct gfs2_sbd *sdp, u64 blkno)
 {
 	struct super_block *sb = sdp->sd_vfs;
-	unsigned nrvecs = bio_get_nr_vecs(sb->s_bdev);
 	struct bio *bio;
 
 	BUG_ON(sdp->sd_log_bio);
 
-	while (1) {
-		bio = bio_alloc(GFP_NOIO, nrvecs);
-		if (likely(bio))
-			break;
-		nrvecs = max(nrvecs/2, 1U);
-	}
-
+	bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES);
 	bio->bi_iter.bi_sector = blkno * (sb->s_blocksize >> 9);
 	bio->bi_bdev = sb->s_bdev;
 	bio->bi_end_io = gfs2_end_log_write;
diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index 76279e1..fbb5f95 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -83,7 +83,7 @@ static int __bdev_writeseg(struct super_block *sb, u64 ofs, pgoff_t index,
 	unsigned int max_pages;
 	int i;
 
-	max_pages = min(nr_pages, (size_t) bio_get_nr_vecs(super->s_bdev));
+	max_pages = min(nr_pages, BIO_MAX_PAGES);
 
 	bio = bio_alloc(GFP_NOFS, max_pages);
 	BUG_ON(!bio);
@@ -175,7 +175,7 @@ static int do_erase(struct super_block *sb, u64 ofs, pgoff_t index,
 	unsigned int max_pages;
 	int i;
 
-	max_pages = min(nr_pages, (size_t) bio_get_nr_vecs(super->s_bdev));
+	max_pages = min(nr_pages, BIO_MAX_PAGES);
 
 	bio = bio_alloc(GFP_NOFS, max_pages);
 	BUG_ON(!bio);
diff --git a/fs/mpage.c b/fs/mpage.c
index ca0244b..4b92133 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -277,7 +277,7 @@ alloc_new:
 				goto out;
 		}
 		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
-			  	min_t(int, nr_pages, bio_get_nr_vecs(bdev)),
+				min_t(int, nr_pages, BIO_MAX_PAGES),
 				GFP_KERNEL);
 		if (bio == NULL)
 			goto confused;
@@ -602,7 +602,7 @@ alloc_new:
 			}
 		}
 		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
-				bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
+				BIO_MAX_PAGES, GFP_NOFS|__GFP_HIGH);
 		if (bio == NULL)
 			goto confused;
 
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 42468e5..4b59031 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -415,7 +415,7 @@ static void nilfs_segbuf_prepare_write(struct nilfs_segment_buffer *segbuf,
 {
 	wi->bio = NULL;
 	wi->rest_blocks = segbuf->sb_sum.nblocks;
-	wi->max_pages = bio_get_nr_vecs(wi->nilfs->ns_bdev);
+	wi->max_pages = BIO_MAX_PAGES;
 	wi->nr_vecs = min(wi->max_pages, wi->rest_blocks);
 	wi->start = wi->end = 0;
 	wi->blocknr = segbuf->sb_pseg_start;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3859f5e..89a8c15 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -382,8 +382,7 @@ STATIC struct bio *
 xfs_alloc_ioend_bio(
 	struct buffer_head	*bh)
 {
-	int			nvecs = bio_get_nr_vecs(bh->b_bdev);
-	struct bio		*bio = bio_alloc(GFP_NOIO, nvecs);
+	struct bio		*bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES);
 
 	ASSERT(bio->bi_private == NULL);
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5e963a6..1608b89 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -440,7 +440,6 @@ void bio_chain(struct bio *, struct bio *);
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
 			   unsigned int, unsigned int);
-extern int bio_get_nr_vecs(struct block_device *);
 struct rq_map_data;
 extern struct bio *bio_map_user_iov(struct request_queue *,
 				    const struct iov_iter *, gfp_t);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 11/11] Documentation: update notes in biovecs about arbitrarily sized bios
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
                   ` (9 preceding siblings ...)
  2015-08-12  7:07 ` [PATCH v6 10/11] block: remove bio_get_nr_vecs() Ming Lin
@ 2015-08-12  7:07 ` Ming Lin
  2015-08-13 16:51 ` [PATCH v6 00/11] simplify block layer based on immutable biovecs Jens Axboe
  11 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-12  7:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, Christoph Hellwig,
	Jonathan Corbet, linux-doc, Ming Lin

From: Dongsu Park <dpark@posteo.net>

Update block/biovecs.txt so that it includes a note on what kind of
effects arbitrarily sized bios would bring to the block layer.
Also fix a trivial typo, bio_iter_iovec.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 Documentation/block/biovecs.txt | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 74a32ad..2568958 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -24,7 +24,7 @@ particular, presenting the illusion of partially completed biovecs so that
 normal code doesn't have to deal with bi_bvec_done.
 
  * Driver code should no longer refer to biovecs directly; we now have
-   bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs,
+   bio_iovec() and bio_iter_iovec() macros that return literal struct biovecs,
    constructed from the raw biovecs but taking into account bi_bvec_done and
    bi_size.
 
@@ -109,3 +109,11 @@ Other implications:
    over all the biovecs in the new bio - which is silly as it's not needed.
 
    So, don't use bi_vcnt anymore.
+
+ * The current interface allows the block layer to split bios as needed, so we
+   could eliminate a lot of complexity particularly in stacked drivers. Code
+   that creates bios can then create whatever size bios are convenient, and
+   more importantly stacked drivers don't have to deal with both their own bio
+   size limitations and the limitations of the underlying devices. Thus
+   there's no need to define ->merge_bvec_fn() callbacks for individual block
+   drivers.
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 00/11] simplify block layer based on immutable biovecs
  2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
                   ` (10 preceding siblings ...)
  2015-08-12  7:07 ` [PATCH v6 11/11] Documentation: update notes in biovecs about arbitrarily sized bios Ming Lin
@ 2015-08-13 16:51 ` Jens Axboe
  2015-08-13 17:03   ` Ming Lin
  11 siblings, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2015-08-13 16:51 UTC (permalink / raw)
  To: Ming Lin, linux-kernel
  Cc: Christoph Hellwig, Kent Overstreet, Dongsu Park, Mike Snitzer,
	Martin K. Petersen

On 08/12/2015 01:07 AM, Ming Lin wrote:
> Hi Jens,
>
> Neil/Mike/Martin have acked/reviewed PATCH 1.
> Now it's ready. Could you please apply this series?
>
> https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
>
> Please note that, for discard, we cap the size at 2G.
> We'll change it to UINT_MAX after the splitting code in
> DM thinp is rewritten.
>
> v6:
>    - rebase on top of 4.2-rc6+
>    - fix discard/write_same 32bit bi_size overflow issue
>    - add ACKs/Review from Mike/Christoph/Martin/Steven

Why did you rebase it on top of 4.2-rc6+? If you had kept it at 4.2-rc1, 
it would have applied to for-4.3/core a lot more easily. Care to respin 
on top of that?


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 00/11] simplify block layer based on immutable biovecs
  2015-08-13 16:51 ` [PATCH v6 00/11] simplify block layer based on immutable biovecs Jens Axboe
@ 2015-08-13 17:03   ` Ming Lin
  2015-08-13 17:07     ` Jens Axboe
  0 siblings, 1 reply; 57+ messages in thread
From: Ming Lin @ 2015-08-13 17:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Christoph Hellwig, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen

On Thu, 2015-08-13 at 10:51 -0600, Jens Axboe wrote:
> On 08/12/2015 01:07 AM, Ming Lin wrote:
> > Hi Jens,
> >
> > Neil/Mike/Martin have acked/reviewed PATCH 1.
> > Now it's ready. Could you please apply this series?
> >
> > https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
> >
> > Please note that, for discard, we cap the size at 2G.
> > We'll change it to UINT_MAX after the splitting code in
> > DM thinp is rewritten.
> >
> > v6:
> >    - rebase on top of 4.2-rc6+
> >    - fix discard/write_same 32bit bi_size overflow issue
> >    - add ACKs/Review from Mike/Christoph/Martin/Steven
> 
> Why did you rebase it on top of 4.2-rc6+? If you had kept it at 4.2-rc1, 
> it would have applied to for-4.3/core a lot more easily. Care to respin 
> on top of that?

Because commit bd4aaf8 "dm: fix dm_merge_bvec regression on 32 bit
systems" in 4.2-rc6 conflicted with PATCH 6.

Sure, I can respin on top of 4.2-rc1.

Should I re-post new series to mail list or just update my tree?
 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 00/11] simplify block layer based on immutable biovecs
  2015-08-13 17:03   ` Ming Lin
@ 2015-08-13 17:07     ` Jens Axboe
  2015-08-13 17:36       ` Ming Lin
  0 siblings, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2015-08-13 17:07 UTC (permalink / raw)
  To: Ming Lin
  Cc: linux-kernel, Christoph Hellwig, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen

On 08/13/2015 11:03 AM, Ming Lin wrote:
> On Thu, 2015-08-13 at 10:51 -0600, Jens Axboe wrote:
>> On 08/12/2015 01:07 AM, Ming Lin wrote:
>>> Hi Jens,
>>>
>>> Neil/Mike/Martin have acked/reviewed PATCH 1.
>>> Now it's ready. Could you please apply this series?
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
>>>
>>> Please note that, for discard, we cap the size at 2G.
>>> We'll change it to UINT_MAX after the splitting code in
>>> DM thinp is rewritten.
>>>
>>> v6:
>>>     - rebase on top of 4.2-rc6+
>>>     - fix discard/write_same 32bit bi_size overflow issue
>>>     - add ACKs/Review from Mike/Christoph/Martin/Steven
>>
>> Why did you rebase it on top of 4.2-rc6+? If you had kept it at 4.2-rc1,
>> it would have applied to for-4.3/core a lot more easily. Care to respin
>> on top of that?
>
> Because commit bd4aaf8 "dm: fix dm_merge_bvec regression on 32 bit
> systems" in 4.2-rc6 conflicted with PATCH 6.
>
> Sure, I can respin on top of 4.2-rc1.
>
> Should I re-post new series to mail list or just update my tree?

Just repost to me, if you don't want to spam the list again.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 00/11] simplify block layer based on immutable biovecs
  2015-08-13 17:07     ` Jens Axboe
@ 2015-08-13 17:36       ` Ming Lin
  0 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-08-13 17:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: lkml, Christoph Hellwig, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen

On Thu, 2015-08-13 at 11:07 -0600, Jens Axboe wrote:
> On 08/13/2015 11:03 AM, Ming Lin wrote:
> > On Thu, 2015-08-13 at 10:51 -0600, Jens Axboe wrote:
> >> On 08/12/2015 01:07 AM, Ming Lin wrote:
> >>> Hi Jens,
> >>>
> >>> Neil/Mike/Martin have acked/reviewed PATCH 1.
> >>> Now it's ready. Could you please apply this series?
> >>>
> >>> https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
> >>>
> >>> Please note that, for discard, we cap the size at 2G.
> >>> We'll change it to UINT_MAX after the splitting code in
> >>> DM thinp is rewritten.
> >>>
> >>> v6:
> >>>     - rebase on top of 4.2-rc6+
> >>>     - fix discard/write_same 32bit bi_size overflow issue
> >>>     - add ACKs/Review from Mike/Christoph/Martin/Steven
> >>
> >> Why did you rebase it on top of 4.2-rc6+? If you had kept it at 4.2-rc1,
> >> it would have applied to for-4.3/core a lot more easily. Care to respin
> >> on top of that?
> >
> > Because commit bd4aaf8 "dm: fix dm_merge_bvec regression on 32 bit
> > systems" in 4.2-rc6 conflicted with PATCH 6.
> >
> > Sure, I can respin on top of 4.2-rc1.
> >
> > Should I re-post new series to mail list or just update my tree?
>
> Just repost to me, if you don't want to spam the list again.

Already reposted to you off list.

Thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-08-12  7:07 ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same} Ming Lin
@ 2015-10-13 11:50     ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-13 11:50 UTC (permalink / raw)
  To: Ming Lin
  Cc: linux-kernel, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, linux-nvme

On Wed, Aug 12, 2015 at 12:07:15AM -0700, Ming Lin wrote:
> From: Ming Lin <ming.l@ssi.samsung.com>
> 
> The split code in blkdev_issue_{discard,write_same} can go away
> now that any driver that cares does the split. We have to make
> sure bio size doesn't overflow.
> 
> For discard, we set max discard sectors to (1<<31)>>9 to ensure
> it doesn't overflow bi_size and hopefully it is of the proper
> granularity as long as the granularity is a power of two.

This ends up breaking discard on NVMe devices for a me.  An mkfs.xfs
which does a discard of the whole device now hangs the system.
Something in here makes it send discard command that the device doesn't
like and the aborts don't seem to help either, although that might be
an issue with the abort handling in the driver.

Just a heads up for now, once I get a bit more time I'll try to collect
a blktrace to figure out how the commands sent to the driver look
different before and after the patch.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-13 11:50     ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-13 11:50 UTC (permalink / raw)


On Wed, Aug 12, 2015@12:07:15AM -0700, Ming Lin wrote:
> From: Ming Lin <ming.l at ssi.samsung.com>
> 
> The split code in blkdev_issue_{discard,write_same} can go away
> now that any driver that cares does the split. We have to make
> sure bio size doesn't overflow.
> 
> For discard, we set max discard sectors to (1<<31)>>9 to ensure
> it doesn't overflow bi_size and hopefully it is of the proper
> granularity as long as the granularity is a power of two.

This ends up breaking discard on NVMe devices for a me.  An mkfs.xfs
which does a discard of the whole device now hangs the system.
Something in here makes it send discard command that the device doesn't
like and the aborts don't seem to help either, although that might be
an issue with the abort handling in the driver.

Just a heads up for now, once I get a bit more time I'll try to collect
a blktrace to figure out how the commands sent to the driver look
different before and after the patch.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-13 11:50     ` Christoph Hellwig
@ 2015-10-13 17:44       ` Ming Lin
  -1 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-13 17:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: lkml, Jens Axboe, Kent Overstreet, Dongsu Park, Mike Snitzer,
	Martin K. Petersen, Ming Lin, linux-nvme

On Tue, Oct 13, 2015 at 4:50 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Aug 12, 2015 at 12:07:15AM -0700, Ming Lin wrote:
>> From: Ming Lin <ming.l@ssi.samsung.com>
>>
>> The split code in blkdev_issue_{discard,write_same} can go away
>> now that any driver that cares does the split. We have to make
>> sure bio size doesn't overflow.
>>
>> For discard, we set max discard sectors to (1<<31)>>9 to ensure
>> it doesn't overflow bi_size and hopefully it is of the proper
>> granularity as long as the granularity is a power of two.
>
> This ends up breaking discard on NVMe devices for a me.  An mkfs.xfs
> which does a discard of the whole device now hangs the system.
> Something in here makes it send discard command that the device doesn't
> like and the aborts don't seem to help either, although that might be
> an issue with the abort handling in the driver.
>
> Just a heads up for now, once I get a bit more time I'll try to collect
> a blktrace to figure out how the commands sent to the driver look
> different before and after the patch.

I just did a quick test with a Samsung 900G NVMe device.
mkfs.xfs is OK on 4.3-rc5.

What's your device model? I may find a similar one to try.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard, write_same}
@ 2015-10-13 17:44       ` Ming Lin
  0 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-13 17:44 UTC (permalink / raw)


On Tue, Oct 13, 2015@4:50 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Aug 12, 2015@12:07:15AM -0700, Ming Lin wrote:
>> From: Ming Lin <ming.l at ssi.samsung.com>
>>
>> The split code in blkdev_issue_{discard,write_same} can go away
>> now that any driver that cares does the split. We have to make
>> sure bio size doesn't overflow.
>>
>> For discard, we set max discard sectors to (1<<31)>>9 to ensure
>> it doesn't overflow bi_size and hopefully it is of the proper
>> granularity as long as the granularity is a power of two.
>
> This ends up breaking discard on NVMe devices for a me.  An mkfs.xfs
> which does a discard of the whole device now hangs the system.
> Something in here makes it send discard command that the device doesn't
> like and the aborts don't seem to help either, although that might be
> an issue with the abort handling in the driver.
>
> Just a heads up for now, once I get a bit more time I'll try to collect
> a blktrace to figure out how the commands sent to the driver look
> different before and after the patch.

I just did a quick test with a Samsung 900G NVMe device.
mkfs.xfs is OK on 4.3-rc5.

What's your device model? I may find a similar one to try.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-13 17:44       ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard, write_same} Ming Lin
@ 2015-10-14 13:27         ` Christoph Hellwig
  -1 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-14 13:27 UTC (permalink / raw)
  To: Ming Lin
  Cc: lkml, Jens Axboe, Kent Overstreet, Dongsu Park, Mike Snitzer,
	Martin K. Petersen, Ming Lin, linux-nvme

On Tue, Oct 13, 2015 at 10:44:11AM -0700, Ming Lin wrote:
> I just did a quick test with a Samsung 900G NVMe device.
> mkfs.xfs is OK on 4.3-rc5.
> 
> What's your device model? I may find a similar one to try.

This is a HGST Ultrastar SN100

Analsys and tentativ fix below:

blktrace for before the commit:

259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]

.. and so on.

blktrace with the commit:

259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]

So discards are smaller, but better aligned.  Now if I tweak a single
line in blk-lib.c to be able to use all of bi_size I get the old I/O
pattern back and everything works fine again:

diff --git a/block/blk-lib.c b/block/blk-lib.c
index bd40292..65b61dc 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			break;
 		}
 
-		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
+		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
 		end_sect = sector + req_sects;
 
 		bio->bi_iter.bi_sector = sector;

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-14 13:27         ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-14 13:27 UTC (permalink / raw)


On Tue, Oct 13, 2015@10:44:11AM -0700, Ming Lin wrote:
> I just did a quick test with a Samsung 900G NVMe device.
> mkfs.xfs is OK on 4.3-rc5.
> 
> What's your device model? I may find a similar one to try.

This is a HGST Ultrastar SN100

Analsys and tentativ fix below:

blktrace for before the commit:

259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]

.. and so on.

blktrace with the commit:

259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]

So discards are smaller, but better aligned.  Now if I tweak a single
line in blk-lib.c to be able to use all of bi_size I get the old I/O
pattern back and everything works fine again:

diff --git a/block/blk-lib.c b/block/blk-lib.c
index bd40292..65b61dc 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			break;
 		}
 
-		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
+		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
 		end_sect = sector + req_sects;
 
 		bio->bi_iter.bi_sector = sector;

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}B
  2015-10-14 13:27         ` Christoph Hellwig
@ 2015-10-14 16:38           ` Keith Busch
  -1 siblings, 0 replies; 57+ messages in thread
From: Keith Busch @ 2015-10-14 16:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lin, Jens Axboe, Ming Lin, Mike Snitzer, Martin K. Petersen,
	lkml, linux-nvme, Dongsu Park, Kent Overstreet

On Wed, 14 Oct 2015, Christoph Hellwig wrote:
> Analsys and tentativ fix below:
>
> blktrace for before the commit:
>
> 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
>
> .. and so on.
>
> blktrace with the commit:
>
> 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
>
> So discards are smaller, but better aligned.  Now if I tweak a single
> line in blk-lib.c to be able to use all of bi_size I get the old I/O
> pattern back and everything works fine again:

I see why the proposal is an improvement, but I don't understand why the
current situation results in a hang. Are we missing some kind of error
recovery in the driver?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}B
@ 2015-10-14 16:38           ` Keith Busch
  0 siblings, 0 replies; 57+ messages in thread
From: Keith Busch @ 2015-10-14 16:38 UTC (permalink / raw)


On Wed, 14 Oct 2015, Christoph Hellwig wrote:
> Analsys and tentativ fix below:
>
> blktrace for before the commit:
>
> 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
>
> .. and so on.
>
> blktrace with the commit:
>
> 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
>
> So discards are smaller, but better aligned.  Now if I tweak a single
> line in blk-lib.c to be able to use all of bi_size I get the old I/O
> pattern back and everything works fine again:

I see why the proposal is an improvement, but I don't understand why the
current situation results in a hang. Are we missing some kind of error
recovery in the driver?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}B
  2015-10-14 16:38           ` Keith Busch
@ 2015-10-14 16:50             ` Christoph Hellwig
  -1 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-14 16:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Ming Lin, Jens Axboe, Ming Lin, Mike Snitzer,
	Martin K. Petersen, lkml, linux-nvme, Dongsu Park,
	Kent Overstreet

On Wed, Oct 14, 2015 at 04:38:50PM +0000, Keith Busch wrote:
> I see why the proposal is an improvement, but I don't understand why the
> current situation results in a hang. Are we missing some kind of error
> recovery in the driver?

The driver tries to abort the commands and eventually gets into a death
spiral.  I'm still trying to understand what exactly is going on.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}B
@ 2015-10-14 16:50             ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-14 16:50 UTC (permalink / raw)


On Wed, Oct 14, 2015@04:38:50PM +0000, Keith Busch wrote:
> I see why the proposal is an improvement, but I don't understand why the
> current situation results in a hang. Are we missing some kind of error
> recovery in the driver?

The driver tries to abort the commands and eventually gets into a death
spiral.  I'm still trying to understand what exactly is going on.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-13 17:44       ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard, write_same} Ming Lin
@ 2015-10-21  7:21         ` Christoph Hellwig
  -1 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-21  7:21 UTC (permalink / raw)
  To: Ming Lin
  Cc: lkml, Jens Axboe, Kent Overstreet, Dongsu Park, Mike Snitzer,
	Martin K. Petersen, Ming Lin, linux-nvme

Jens, Ming:

are you fine with the one liner change to get back to the old I/O
pattern?  While it looks like the cards fault I'd like to avoid this
annoying regression.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-21  7:21         ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-21  7:21 UTC (permalink / raw)


Jens, Ming:

are you fine with the one liner change to get back to the old I/O
pattern?  While it looks like the cards fault I'd like to avoid this
annoying regression.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21  7:21         ` Christoph Hellwig
@ 2015-10-21 13:39           ` Jeff Moyer
  -1 siblings, 0 replies; 57+ messages in thread
From: Jeff Moyer @ 2015-10-21 13:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lin, lkml, Jens Axboe, Kent Overstreet, Dongsu Park,
	Mike Snitzer, Martin K. Petersen, Ming Lin, linux-nvme

Christoph Hellwig <hch@infradead.org> writes:

> Jens, Ming:
>
> are you fine with the one liner change to get back to the old I/O
> pattern?  While it looks like the cards fault I'd like to avoid this
> annoying regression.

I'm not Jens or Ming, but your patch looks fine to me, though you'll
want to remove the MAX_BIO_SECTORS definition since it's now unused.
It's not clear to me why the limit was lowered in the first place.

You can add my Reviewed-by: Jeff Moyer <jmoyer@redhat.com> if you resend
the patch.

-Jeff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard, write_same}
@ 2015-10-21 13:39           ` Jeff Moyer
  0 siblings, 0 replies; 57+ messages in thread
From: Jeff Moyer @ 2015-10-21 13:39 UTC (permalink / raw)


Christoph Hellwig <hch at infradead.org> writes:

> Jens, Ming:
>
> are you fine with the one liner change to get back to the old I/O
> pattern?  While it looks like the cards fault I'd like to avoid this
> annoying regression.

I'm not Jens or Ming, but your patch looks fine to me, though you'll
want to remove the MAX_BIO_SECTORS definition since it's now unused.
It's not clear to me why the limit was lowered in the first place.

You can add my Reviewed-by: Jeff Moyer <jmoyer at redhat.com> if you resend
the patch.

-Jeff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21 13:39           ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard, write_same} Jeff Moyer
@ 2015-10-21 15:01             ` Ming Lin
  -1 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-21 15:01 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, lkml, Jens Axboe, Kent Overstreet,
	Dongsu Park, Mike Snitzer, Martin K. Petersen, Ming Lin,
	linux-nvme

On Wed, 2015-10-21 at 09:39 -0400, Jeff Moyer wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> 
> > Jens, Ming:
> >
> > are you fine with the one liner change to get back to the old I/O
> > pattern?  While it looks like the cards fault I'd like to avoid this
> > annoying regression.
> 
> I'm not Jens or Ming, but your patch looks fine to me, though you'll
> want to remove the MAX_BIO_SECTORS definition since it's now unused.
> It's not clear to me why the limit was lowered in the first place.

UINT_MAX >> 9 is not power of 2 and it causes dm-thinp discard fails.

At the lengthy discussion:
[PATCH v5 01/11] block: make generic_make_request handle arbitrarily sized bios
We agreed to cap discard to 2G as an interim solution for 4.3 until the
dm-thinp discard code is rewritten.

Hi Mike,

Will the dm-thinp discard rewritten ready for 4.4?

Thanks,
Ming

> 
> You can add my Reviewed-by: Jeff Moyer <jmoyer@redhat.com> if you resend
> the patch.
> 
> -Jeff



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-21 15:01             ` Ming Lin
  0 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-21 15:01 UTC (permalink / raw)


On Wed, 2015-10-21@09:39 -0400, Jeff Moyer wrote:
> Christoph Hellwig <hch at infradead.org> writes:
> 
> > Jens, Ming:
> >
> > are you fine with the one liner change to get back to the old I/O
> > pattern?  While it looks like the cards fault I'd like to avoid this
> > annoying regression.
> 
> I'm not Jens or Ming, but your patch looks fine to me, though you'll
> want to remove the MAX_BIO_SECTORS definition since it's now unused.
> It's not clear to me why the limit was lowered in the first place.

UINT_MAX >> 9 is not power of 2 and it causes dm-thinp discard fails.

At the lengthy discussion:
[PATCH v5 01/11] block: make generic_make_request handle arbitrarily sized bios
We agreed to cap discard to 2G as an interim solution for 4.3 until the
dm-thinp discard code is rewritten.

Hi Mike,

Will the dm-thinp discard rewritten ready for 4.4?

Thanks,
Ming

> 
> You can add my Reviewed-by: Jeff Moyer <jmoyer at redhat.com> if you resend
> the patch.
> 
> -Jeff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21 15:01             ` Ming Lin
@ 2015-10-21 15:33               ` Mike Snitzer
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Snitzer @ 2015-10-21 15:33 UTC (permalink / raw)
  To: Ming Lin
  Cc: Jeff Moyer, Christoph Hellwig, lkml, Jens Axboe, Kent Overstreet,
	Dongsu Park, Martin K. Petersen, Ming Lin, linux-nvme

On Wed, Oct 21 2015 at 11:01am -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Wed, 2015-10-21 at 09:39 -0400, Jeff Moyer wrote:
> > Christoph Hellwig <hch@infradead.org> writes:
> > 
> > > Jens, Ming:
> > >
> > > are you fine with the one liner change to get back to the old I/O
> > > pattern?  While it looks like the cards fault I'd like to avoid this
> > > annoying regression.
> > 
> > I'm not Jens or Ming, but your patch looks fine to me, though you'll
> > want to remove the MAX_BIO_SECTORS definition since it's now unused.
> > It's not clear to me why the limit was lowered in the first place.
> 
> UINT_MAX >> 9 is not power of 2 and it causes dm-thinp discard fails.
> 
> At the lengthy discussion:
> [PATCH v5 01/11] block: make generic_make_request handle arbitrarily sized bios
> We agreed to cap discard to 2G as an interim solution for 4.3 until the
> dm-thinp discard code is rewritten.

But did Jens ever commit that change to cap at 2G?  I don't recall
seeing it.

> Hi Mike,
> 
> Will the dm-thinp discard rewritten ready for 4.4?

No.  I'm not clear what needs changing in dm-thinp.  I'll have to
revisit the thread to refresh my memory.

BTW, DM thinp can easily handle discards that aren't a power-of-2 so
long as the requested discard is a factor of the thinp blocksize.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-21 15:33               ` Mike Snitzer
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Snitzer @ 2015-10-21 15:33 UTC (permalink / raw)


On Wed, Oct 21 2015 at 11:01am -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Wed, 2015-10-21@09:39 -0400, Jeff Moyer wrote:
> > Christoph Hellwig <hch at infradead.org> writes:
> > 
> > > Jens, Ming:
> > >
> > > are you fine with the one liner change to get back to the old I/O
> > > pattern?  While it looks like the cards fault I'd like to avoid this
> > > annoying regression.
> > 
> > I'm not Jens or Ming, but your patch looks fine to me, though you'll
> > want to remove the MAX_BIO_SECTORS definition since it's now unused.
> > It's not clear to me why the limit was lowered in the first place.
> 
> UINT_MAX >> 9 is not power of 2 and it causes dm-thinp discard fails.
> 
> At the lengthy discussion:
> [PATCH v5 01/11] block: make generic_make_request handle arbitrarily sized bios
> We agreed to cap discard to 2G as an interim solution for 4.3 until the
> dm-thinp discard code is rewritten.

But did Jens ever commit that change to cap at 2G?  I don't recall
seeing it.

> Hi Mike,
> 
> Will the dm-thinp discard rewritten ready for 4.4?

No.  I'm not clear what needs changing in dm-thinp.  I'll have to
revisit the thread to refresh my memory.

BTW, DM thinp can easily handle discards that aren't a power-of-2 so
long as the requested discard is a factor of the thinp blocksize.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-14 13:27         ` Christoph Hellwig
@ 2015-10-21 16:02           ` Mike Snitzer
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Snitzer @ 2015-10-21 16:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lin, lkml, Jens Axboe, Kent Overstreet, Dongsu Park,
	Martin K. Petersen, Ming Lin, linux-nvme

On Wed, Oct 14 2015 at  9:27am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Oct 13, 2015 at 10:44:11AM -0700, Ming Lin wrote:
> > I just did a quick test with a Samsung 900G NVMe device.
> > mkfs.xfs is OK on 4.3-rc5.
> > 
> > What's your device model? I may find a similar one to try.
> 
> This is a HGST Ultrastar SN100
> 
> Analsys and tentativ fix below:
> 
> blktrace for before the commit:
> 
> 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> 
> .. and so on.
> 
> blktrace with the commit:
> 
> 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> 
> So discards are smaller, but better aligned.  Now if I tweak a single
> line in blk-lib.c to be able to use all of bi_size I get the old I/O
> pattern back and everything works fine again:
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index bd40292..65b61dc 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  			break;
>  		}
>  
> -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
>  		end_sect = sector + req_sects;
>  
>  		bio->bi_iter.bi_sector = sector;

Can we change UINT_MAX >> 9 to rounddown to the first factor of
minimum_io_size?

That should work for all devices and for dm-thinp (and dm-cache) in
particular will ensure that all discards that are issued will be a
multiple of the underlying device's blocksize.

Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-21 16:02           ` Mike Snitzer
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Snitzer @ 2015-10-21 16:02 UTC (permalink / raw)


On Wed, Oct 14 2015 at  9:27am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Oct 13, 2015@10:44:11AM -0700, Ming Lin wrote:
> > I just did a quick test with a Samsung 900G NVMe device.
> > mkfs.xfs is OK on 4.3-rc5.
> > 
> > What's your device model? I may find a similar one to try.
> 
> This is a HGST Ultrastar SN100
> 
> Analsys and tentativ fix below:
> 
> blktrace for before the commit:
> 
> 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> 
> .. and so on.
> 
> blktrace with the commit:
> 
> 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> 
> So discards are smaller, but better aligned.  Now if I tweak a single
> line in blk-lib.c to be able to use all of bi_size I get the old I/O
> pattern back and everything works fine again:
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index bd40292..65b61dc 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  			break;
>  		}
>  
> -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
>  		end_sect = sector + req_sects;
>  
>  		bio->bi_iter.bi_sector = sector;

Can we change UINT_MAX >> 9 to rounddown to the first factor of
minimum_io_size?

That should work for all devices and for dm-thinp (and dm-cache) in
particular will ensure that all discards that are issued will be a
multiple of the underlying device's blocksize.

Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21 16:02           ` Mike Snitzer
@ 2015-10-21 16:19             ` Mike Snitzer
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Snitzer @ 2015-10-21 16:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lin, lkml, Jens Axboe, Kent Overstreet, Dongsu Park,
	Martin K. Petersen, Ming Lin, linux-nvme

On Wed, Oct 21 2015 at 12:02pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, Oct 14 2015 at  9:27am -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Tue, Oct 13, 2015 at 10:44:11AM -0700, Ming Lin wrote:
> > > I just did a quick test with a Samsung 900G NVMe device.
> > > mkfs.xfs is OK on 4.3-rc5.
> > > 
> > > What's your device model? I may find a similar one to try.
> > 
> > This is a HGST Ultrastar SN100
> > 
> > Analsys and tentativ fix below:
> > 
> > blktrace for before the commit:
> > 
> > 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> > 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> > 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> > 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> > 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> > 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> > 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> > 
> > .. and so on.
> > 
> > blktrace with the commit:
> > 
> > 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> > 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> > 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> > 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> > 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> > 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> > 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> > 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> > 
> > So discards are smaller, but better aligned.  Now if I tweak a single
> > line in blk-lib.c to be able to use all of bi_size I get the old I/O
> > pattern back and everything works fine again:
> > 
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index bd40292..65b61dc 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> >  			break;
> >  		}
> >  
> > -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> > +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> >  		end_sect = sector + req_sects;
> >  
> >  		bio->bi_iter.bi_sector = sector;
> 
> Can we change UINT_MAX >> 9 to rounddown to the first factor of
> minimum_io_size?
> 
> That should work for all devices and for dm-thinp (and dm-cache) in
> particular will ensure that all discards that are issued will be a
> multiple of the underlying device's blocksize.

Jeff Moyer pointed out having req_sects be a factor of
discard_granularity makes more sense.  And I agree.  Same difference in
the end (since dm-thinp sets discard_granularity to the thinp
blocksize).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-21 16:19             ` Mike Snitzer
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Snitzer @ 2015-10-21 16:19 UTC (permalink / raw)


On Wed, Oct 21 2015 at 12:02pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, Oct 14 2015 at  9:27am -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Tue, Oct 13, 2015@10:44:11AM -0700, Ming Lin wrote:
> > > I just did a quick test with a Samsung 900G NVMe device.
> > > mkfs.xfs is OK on 4.3-rc5.
> > > 
> > > What's your device model? I may find a similar one to try.
> > 
> > This is a HGST Ultrastar SN100
> > 
> > Analsys and tentativ fix below:
> > 
> > blktrace for before the commit:
> > 
> > 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> > 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> > 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> > 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> > 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> > 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> > 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> > 
> > .. and so on.
> > 
> > blktrace with the commit:
> > 
> > 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> > 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> > 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> > 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> > 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> > 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> > 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> > 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> > 
> > So discards are smaller, but better aligned.  Now if I tweak a single
> > line in blk-lib.c to be able to use all of bi_size I get the old I/O
> > pattern back and everything works fine again:
> > 
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index bd40292..65b61dc 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> >  			break;
> >  		}
> >  
> > -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> > +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> >  		end_sect = sector + req_sects;
> >  
> >  		bio->bi_iter.bi_sector = sector;
> 
> Can we change UINT_MAX >> 9 to rounddown to the first factor of
> minimum_io_size?
> 
> That should work for all devices and for dm-thinp (and dm-cache) in
> particular will ensure that all discards that are issued will be a
> multiple of the underlying device's blocksize.

Jeff Moyer pointed out having req_sects be a factor of
discard_granularity makes more sense.  And I agree.  Same difference in
the end (since dm-thinp sets discard_granularity to the thinp
blocksize).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21 16:19             ` Mike Snitzer
@ 2015-10-21 16:33               ` Martin K. Petersen
  -1 siblings, 0 replies; 57+ messages in thread
From: Martin K. Petersen @ 2015-10-21 16:33 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Ming Lin, lkml, Jens Axboe, Kent Overstreet,
	Dongsu Park, Martin K. Petersen, Ming Lin, linux-nvme

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

>> That should work for all devices and for dm-thinp (and dm-cache) in
>> particular will ensure that all discards that are issued will be a
>> multiple of the underlying device's blocksize.

Mike> Jeff Moyer pointed out having req_sects be a factor of
Mike> discard_granularity makes more sense.

Absolutely!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard, write_same}
@ 2015-10-21 16:33               ` Martin K. Petersen
  0 siblings, 0 replies; 57+ messages in thread
From: Martin K. Petersen @ 2015-10-21 16:33 UTC (permalink / raw)


>>>>> "Mike" == Mike Snitzer <snitzer at redhat.com> writes:

>> That should work for all devices and for dm-thinp (and dm-cache) in
>> particular will ensure that all discards that are issued will be a
>> multiple of the underlying device's blocksize.

Mike> Jeff Moyer pointed out having req_sects be a factor of
Mike> discard_granularity makes more sense.

Absolutely!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21 15:33               ` Mike Snitzer
@ 2015-10-21 17:18                 ` Ming Lin
  -1 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-21 17:18 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jeff Moyer, Christoph Hellwig, lkml, Jens Axboe, Kent Overstreet,
	Dongsu Park, Martin K. Petersen, Ming Lin, linux-nvme

On Wed, 2015-10-21 at 11:33 -0400, Mike Snitzer wrote:
> On Wed, Oct 21 2015 at 11:01am -0400,
> Ming Lin <mlin@kernel.org> wrote:
> 
> > On Wed, 2015-10-21 at 09:39 -0400, Jeff Moyer wrote:
> > > Christoph Hellwig <hch@infradead.org> writes:
> > > 
> > > > Jens, Ming:
> > > >
> > > > are you fine with the one liner change to get back to the old I/O
> > > > pattern?  While it looks like the cards fault I'd like to avoid this
> > > > annoying regression.
> > > 
> > > I'm not Jens or Ming, but your patch looks fine to me, though you'll
> > > want to remove the MAX_BIO_SECTORS definition since it's now unused.
> > > It's not clear to me why the limit was lowered in the first place.
> > 
> > UINT_MAX >> 9 is not power of 2 and it causes dm-thinp discard fails.
> > 
> > At the lengthy discussion:
> > [PATCH v5 01/11] block: make generic_make_request handle arbitrarily sized bios
> > We agreed to cap discard to 2G as an interim solution for 4.3 until the
> > dm-thinp discard code is rewritten.
> 
> But did Jens ever commit that change to cap at 2G?  I don't recall
> seeing it.

Yes, commit b49a0871

> 
> > Hi Mike,
> > 
> > Will the dm-thinp discard rewritten ready for 4.4?
> 
> No.  I'm not clear what needs changing in dm-thinp.  I'll have to
> revisit the thread to refresh my memory.
> 
> BTW, DM thinp can easily handle discards that aren't a power-of-2 so
> long as the requested discard is a factor of the thinp blocksize.

You are right. It's not about power-of-2.

Copy my old post here about why dm-thinp discard may fail with "UINT_MAX
>> 9".

      4G: 8388608 sectors
UINT_MAX: 8388607 sectors

dm-thinp block size = default discard granularity = 128 sectors

blkdev_issue_discard(sector=0, nr_sectors=8388608)

[start_sector, end_sector]
[0, 8388607]
    [0, 8388606], then dm-thinp splits it to 2 bios
        [0, 8388479]
        [8388480, 8388606] ---> this has problem in process_discard_bio(),
                                because the discard size(7 sectors) covers less than a block(128 sectors)
    [8388607, 8388607] ---> same problem 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-21 17:18                 ` Ming Lin
  0 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-21 17:18 UTC (permalink / raw)


On Wed, 2015-10-21@11:33 -0400, Mike Snitzer wrote:
> On Wed, Oct 21 2015 at 11:01am -0400,
> Ming Lin <mlin@kernel.org> wrote:
> 
> > On Wed, 2015-10-21@09:39 -0400, Jeff Moyer wrote:
> > > Christoph Hellwig <hch at infradead.org> writes:
> > > 
> > > > Jens, Ming:
> > > >
> > > > are you fine with the one liner change to get back to the old I/O
> > > > pattern?  While it looks like the cards fault I'd like to avoid this
> > > > annoying regression.
> > > 
> > > I'm not Jens or Ming, but your patch looks fine to me, though you'll
> > > want to remove the MAX_BIO_SECTORS definition since it's now unused.
> > > It's not clear to me why the limit was lowered in the first place.
> > 
> > UINT_MAX >> 9 is not power of 2 and it causes dm-thinp discard fails.
> > 
> > At the lengthy discussion:
> > [PATCH v5 01/11] block: make generic_make_request handle arbitrarily sized bios
> > We agreed to cap discard to 2G as an interim solution for 4.3 until the
> > dm-thinp discard code is rewritten.
> 
> But did Jens ever commit that change to cap at 2G?  I don't recall
> seeing it.

Yes, commit b49a0871

> 
> > Hi Mike,
> > 
> > Will the dm-thinp discard rewritten ready for 4.4?
> 
> No.  I'm not clear what needs changing in dm-thinp.  I'll have to
> revisit the thread to refresh my memory.
> 
> BTW, DM thinp can easily handle discards that aren't a power-of-2 so
> long as the requested discard is a factor of the thinp blocksize.

You are right. It's not about power-of-2.

Copy my old post here about why dm-thinp discard may fail with "UINT_MAX
>> 9".

      4G: 8388608 sectors
UINT_MAX: 8388607 sectors

dm-thinp block size = default discard granularity = 128 sectors

blkdev_issue_discard(sector=0, nr_sectors=8388608)

[start_sector, end_sector]
[0, 8388607]
    [0, 8388606], then dm-thinp splits it to 2 bios
        [0, 8388479]
        [8388480, 8388606] ---> this has problem in process_discard_bio(),
                                because the discard size(7 sectors) covers less than a block(128 sectors)
    [8388607, 8388607] ---> same problem 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21 16:19             ` Mike Snitzer
@ 2015-10-21 17:33               ` Ming Lin
  -1 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-21 17:33 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, lkml, Jens Axboe, Kent Overstreet,
	Dongsu Park, Martin K. Petersen, Ming Lin, linux-nvme

On Wed, 2015-10-21 at 12:19 -0400, Mike Snitzer wrote:
> On Wed, Oct 21 2015 at 12:02pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Wed, Oct 14 2015 at  9:27am -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > On Tue, Oct 13, 2015 at 10:44:11AM -0700, Ming Lin wrote:
> > > > I just did a quick test with a Samsung 900G NVMe device.
> > > > mkfs.xfs is OK on 4.3-rc5.
> > > > 
> > > > What's your device model? I may find a similar one to try.
> > > 
> > > This is a HGST Ultrastar SN100
> > > 
> > > Analsys and tentativ fix below:
> > > 
> > > blktrace for before the commit:
> > > 
> > > 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> > > 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> > > 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> > > 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> > > 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> > > 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> > > 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> > > 
> > > .. and so on.
> > > 
> > > blktrace with the commit:
> > > 
> > > 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> > > 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> > > 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> > > 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> > > 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> > > 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> > > 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> > > 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> > > 
> > > So discards are smaller, but better aligned.  Now if I tweak a single
> > > line in blk-lib.c to be able to use all of bi_size I get the old I/O
> > > pattern back and everything works fine again:
> > > 
> > > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > > index bd40292..65b61dc 100644
> > > --- a/block/blk-lib.c
> > > +++ b/block/blk-lib.c
> > > @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> > >  			break;
> > >  		}
> > >  
> > > -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> > > +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> > >  		end_sect = sector + req_sects;
> > >  
> > >  		bio->bi_iter.bi_sector = sector;
> > 
> > Can we change UINT_MAX >> 9 to rounddown to the first factor of
> > minimum_io_size?
> > 
> > That should work for all devices and for dm-thinp (and dm-cache) in
> > particular will ensure that all discards that are issued will be a
> > multiple of the underlying device's blocksize.
> 
> Jeff Moyer pointed out having req_sects be a factor of
> discard_granularity makes more sense.  And I agree.  Same difference in
> the end (since dm-thinp sets discard_granularity to the thinp
> blocksize).

An old version of this patch did use discard_granularity
https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html

But you didn't agree.
https://www.redhat.com/archives/dm-devel/2015-August/msg00001.html

Maybe we can re-add discard_granularity now?



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-21 17:33               ` Ming Lin
  0 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-21 17:33 UTC (permalink / raw)


On Wed, 2015-10-21@12:19 -0400, Mike Snitzer wrote:
> On Wed, Oct 21 2015 at 12:02pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Wed, Oct 14 2015 at  9:27am -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > On Tue, Oct 13, 2015@10:44:11AM -0700, Ming Lin wrote:
> > > > I just did a quick test with a Samsung 900G NVMe device.
> > > > mkfs.xfs is OK on 4.3-rc5.
> > > > 
> > > > What's your device model? I may find a similar one to try.
> > > 
> > > This is a HGST Ultrastar SN100
> > > 
> > > Analsys and tentativ fix below:
> > > 
> > > blktrace for before the commit:
> > > 
> > > 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> > > 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> > > 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> > > 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> > > 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> > > 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> > > 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> > > 
> > > .. and so on.
> > > 
> > > blktrace with the commit:
> > > 
> > > 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> > > 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> > > 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> > > 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> > > 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> > > 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> > > 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> > > 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> > > 
> > > So discards are smaller, but better aligned.  Now if I tweak a single
> > > line in blk-lib.c to be able to use all of bi_size I get the old I/O
> > > pattern back and everything works fine again:
> > > 
> > > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > > index bd40292..65b61dc 100644
> > > --- a/block/blk-lib.c
> > > +++ b/block/blk-lib.c
> > > @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> > >  			break;
> > >  		}
> > >  
> > > -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> > > +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> > >  		end_sect = sector + req_sects;
> > >  
> > >  		bio->bi_iter.bi_sector = sector;
> > 
> > Can we change UINT_MAX >> 9 to rounddown to the first factor of
> > minimum_io_size?
> > 
> > That should work for all devices and for dm-thinp (and dm-cache) in
> > particular will ensure that all discards that are issued will be a
> > multiple of the underlying device's blocksize.
> 
> Jeff Moyer pointed out having req_sects be a factor of
> discard_granularity makes more sense.  And I agree.  Same difference in
> the end (since dm-thinp sets discard_granularity to the thinp
> blocksize).

An old version of this patch did use discard_granularity
https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html

But you didn't agree.
https://www.redhat.com/archives/dm-devel/2015-August/msg00001.html

Maybe we can re-add discard_granularity now?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21 17:33               ` Ming Lin
@ 2015-10-21 18:18                 ` Mike Snitzer
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Snitzer @ 2015-10-21 18:18 UTC (permalink / raw)
  To: Ming Lin
  Cc: Christoph Hellwig, lkml, Jens Axboe, Kent Overstreet,
	Dongsu Park, Martin K. Petersen, Ming Lin, linux-nvme

On Wed, Oct 21 2015 at  1:33pm -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Wed, 2015-10-21 at 12:19 -0400, Mike Snitzer wrote:
> > On Wed, Oct 21 2015 at 12:02pm -0400,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> > > On Wed, Oct 14 2015 at  9:27am -0400,
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > > 
> > > > On Tue, Oct 13, 2015 at 10:44:11AM -0700, Ming Lin wrote:
> > > > > I just did a quick test with a Samsung 900G NVMe device.
> > > > > mkfs.xfs is OK on 4.3-rc5.
> > > > > 
> > > > > What's your device model? I may find a similar one to try.
> > > > 
> > > > This is a HGST Ultrastar SN100
> > > > 
> > > > Analsys and tentativ fix below:
> > > > 
> > > > blktrace for before the commit:
> > > > 
> > > > 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> > > > 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> > > > 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> > > > 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> > > > 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> > > > 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> > > > 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> > > > 
> > > > .. and so on.
> > > > 
> > > > blktrace with the commit:
> > > > 
> > > > 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> > > > 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> > > > 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> > > > 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> > > > 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> > > > 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> > > > 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> > > > 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> > > > 
> > > > So discards are smaller, but better aligned.  Now if I tweak a single
> > > > line in blk-lib.c to be able to use all of bi_size I get the old I/O
> > > > pattern back and everything works fine again:
> > > > 
> > > > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > > > index bd40292..65b61dc 100644
> > > > --- a/block/blk-lib.c
> > > > +++ b/block/blk-lib.c
> > > > @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> > > >  			break;
> > > >  		}
> > > >  
> > > > -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> > > > +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> > > >  		end_sect = sector + req_sects;
> > > >  
> > > >  		bio->bi_iter.bi_sector = sector;
> > > 
> > > Can we change UINT_MAX >> 9 to rounddown to the first factor of
> > > minimum_io_size?
> > > 
> > > That should work for all devices and for dm-thinp (and dm-cache) in
> > > particular will ensure that all discards that are issued will be a
> > > multiple of the underlying device's blocksize.
> > 
> > Jeff Moyer pointed out having req_sects be a factor of
> > discard_granularity makes more sense.  And I agree.  Same difference in
> > the end (since dm-thinp sets discard_granularity to the thinp
> > blocksize).
> 
> An old version of this patch did use discard_granularity
> https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html
> 
> But you didn't agree.
> https://www.redhat.com/archives/dm-devel/2015-August/msg00001.html
> 
> Maybe we can re-add discard_granularity now?

I disagreed on a more generic level than discard_granularity shaping the
split boundary.

But we are where we are.  If we're going to split (due to 32-bit limits
in bio->bi_iter.bi_size) then we should at least do so in terms of the
support discard_granularity.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-21 18:18                 ` Mike Snitzer
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Snitzer @ 2015-10-21 18:18 UTC (permalink / raw)


On Wed, Oct 21 2015 at  1:33pm -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Wed, 2015-10-21@12:19 -0400, Mike Snitzer wrote:
> > On Wed, Oct 21 2015 at 12:02pm -0400,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> > > On Wed, Oct 14 2015 at  9:27am -0400,
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > > 
> > > > On Tue, Oct 13, 2015@10:44:11AM -0700, Ming Lin wrote:
> > > > > I just did a quick test with a Samsung 900G NVMe device.
> > > > > mkfs.xfs is OK on 4.3-rc5.
> > > > > 
> > > > > What's your device model? I may find a similar one to try.
> > > > 
> > > > This is a HGST Ultrastar SN100
> > > > 
> > > > Analsys and tentativ fix below:
> > > > 
> > > > blktrace for before the commit:
> > > > 
> > > > 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> > > > 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> > > > 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> > > > 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> > > > 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> > > > 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> > > > 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> > > > 
> > > > .. and so on.
> > > > 
> > > > blktrace with the commit:
> > > > 
> > > > 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> > > > 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> > > > 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> > > > 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> > > > 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> > > > 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> > > > 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> > > > 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> > > > 
> > > > So discards are smaller, but better aligned.  Now if I tweak a single
> > > > line in blk-lib.c to be able to use all of bi_size I get the old I/O
> > > > pattern back and everything works fine again:
> > > > 
> > > > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > > > index bd40292..65b61dc 100644
> > > > --- a/block/blk-lib.c
> > > > +++ b/block/blk-lib.c
> > > > @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> > > >  			break;
> > > >  		}
> > > >  
> > > > -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> > > > +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> > > >  		end_sect = sector + req_sects;
> > > >  
> > > >  		bio->bi_iter.bi_sector = sector;
> > > 
> > > Can we change UINT_MAX >> 9 to rounddown to the first factor of
> > > minimum_io_size?
> > > 
> > > That should work for all devices and for dm-thinp (and dm-cache) in
> > > particular will ensure that all discards that are issued will be a
> > > multiple of the underlying device's blocksize.
> > 
> > Jeff Moyer pointed out having req_sects be a factor of
> > discard_granularity makes more sense.  And I agree.  Same difference in
> > the end (since dm-thinp sets discard_granularity to the thinp
> > blocksize).
> 
> An old version of this patch did use discard_granularity
> https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html
> 
> But you didn't agree.
> https://www.redhat.com/archives/dm-devel/2015-August/msg00001.html
> 
> Maybe we can re-add discard_granularity now?

I disagreed on a more generic level than discard_granularity shaping the
split boundary.

But we are where we are.  If we're going to split (due to 32-bit limits
in bio->bi_iter.bi_size) then we should at least do so in terms of the
support discard_granularity.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21 18:18                 ` Mike Snitzer
@ 2015-10-21 20:13                   ` Ming Lin
  -1 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-21 20:13 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, lkml, Jens Axboe, Kent Overstreet,
	Dongsu Park, Martin K. Petersen, Ming Lin, linux-nvme

On Wed, 2015-10-21 at 14:18 -0400, Mike Snitzer wrote:
> On Wed, Oct 21 2015 at  1:33pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
> 
> > On Wed, 2015-10-21 at 12:19 -0400, Mike Snitzer wrote:
> > > On Wed, Oct 21 2015 at 12:02pm -0400,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > > 
> > > > On Wed, Oct 14 2015 at  9:27am -0400,
> > > > Christoph Hellwig <hch@infradead.org> wrote:
> > > > 
> > > > > On Tue, Oct 13, 2015 at 10:44:11AM -0700, Ming Lin wrote:
> > > > > > I just did a quick test with a Samsung 900G NVMe device.
> > > > > > mkfs.xfs is OK on 4.3-rc5.
> > > > > > 
> > > > > > What's your device model? I may find a similar one to try.
> > > > > 
> > > > > This is a HGST Ultrastar SN100
> > > > > 
> > > > > Analsys and tentativ fix below:
> > > > > 
> > > > > blktrace for before the commit:
> > > > > 
> > > > > 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> > > > > 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> > > > > 
> > > > > .. and so on.
> > > > > 
> > > > > blktrace with the commit:
> > > > > 
> > > > > 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> > > > > 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> > > > > 
> > > > > So discards are smaller, but better aligned.  Now if I tweak a single
> > > > > line in blk-lib.c to be able to use all of bi_size I get the old I/O
> > > > > pattern back and everything works fine again:
> > > > > 
> > > > > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > > > > index bd40292..65b61dc 100644
> > > > > --- a/block/blk-lib.c
> > > > > +++ b/block/blk-lib.c
> > > > > @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> > > > >  			break;
> > > > >  		}
> > > > >  
> > > > > -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> > > > > +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> > > > >  		end_sect = sector + req_sects;
> > > > >  
> > > > >  		bio->bi_iter.bi_sector = sector;
> > > > 
> > > > Can we change UINT_MAX >> 9 to rounddown to the first factor of
> > > > minimum_io_size?
> > > > 
> > > > That should work for all devices and for dm-thinp (and dm-cache) in
> > > > particular will ensure that all discards that are issued will be a
> > > > multiple of the underlying device's blocksize.
> > > 
> > > Jeff Moyer pointed out having req_sects be a factor of
> > > discard_granularity makes more sense.  And I agree.  Same difference in
> > > the end (since dm-thinp sets discard_granularity to the thinp
> > > blocksize).
> > 
> > An old version of this patch did use discard_granularity
> > https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html
> > 
> > But you didn't agree.
> > https://www.redhat.com/archives/dm-devel/2015-August/msg00001.html
> > 
> > Maybe we can re-add discard_granularity now?
> 
> I disagreed on a more generic level than discard_granularity shaping the
> split boundary.
> 
> But we are where we are.  If we're going to split (due to 32-bit limits
> in bio->bi_iter.bi_size) then we should at least do so in terms of the
> support discard_granularity.

How about below?
It actually reverts commit b49a0871 and adds patch at
https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html

Christoph, could you help to try it?

commit 122bf0a43cb1611ed62aaf945f25b649c27a71ed
Author: Ming Lin <mlin@kernel.org>
Date:   Wed Oct 21 11:24:48 2015 -0700

    block: check discard_granularity and alignment
    
    Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 block/blk-lib.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index bd40292..9ebf653 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,13 +26,6 @@ static void bio_batch_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
-/*
- * Ensure that max discard sectors doesn't overflow bi_size and hopefully
- * it is of the proper granularity as long as the granularity is a power
- * of two.
- */
-#define MAX_BIO_SECTORS ((1U << 31) >> 9)
-
 /**
  * blkdev_issue_discard - queue a discard
  * @bdev:	blockdev to issue discard for
@@ -50,6 +43,8 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *q = bdev_get_queue(bdev);
 	int type = REQ_WRITE | REQ_DISCARD;
+	unsigned int granularity;
+	int alignment;
 	struct bio_batch bb;
 	struct bio *bio;
 	int ret = 0;
@@ -61,6 +56,10 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	if (!blk_queue_discard(q))
 		return -EOPNOTSUPP;
 
+	/* Zero-sector (unknown) and one-sector granularities are the same.  */
+	granularity = max(q->limits.discard_granularity >> 9, 1U);
+	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
+
 	if (flags & BLKDEV_DISCARD_SECURE) {
 		if (!blk_queue_secdiscard(q))
 			return -EOPNOTSUPP;
@@ -74,7 +73,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	blk_start_plug(&plug);
 	while (nr_sects) {
 		unsigned int req_sects;
-		sector_t end_sect;
+		sector_t end_sect, tmp;
 
 		bio = bio_alloc(gfp_mask, 1);
 		if (!bio) {
@@ -82,8 +81,22 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			break;
 		}
 
-		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
+		/* Make sure bi_size doesn't overflow */
+		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
+
+		/*
+		 * If splitting a request, and the next starting sector would be
+		 * misaligned, stop the discard at the previous aligned sector.
+		 */
 		end_sect = sector + req_sects;
+		tmp = end_sect;
+		if (req_sects < nr_sects &&
+		    sector_div(tmp, granularity) != alignment) {
+			end_sect = end_sect - alignment;
+			sector_div(end_sect, granularity);
+			end_sect = end_sect * granularity + alignment;
+			req_sects = end_sect - sector;
+		}
 
 		bio->bi_iter.bi_sector = sector;
 		bio->bi_end_io = bio_batch_end_io;



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-21 20:13                   ` Ming Lin
  0 siblings, 0 replies; 57+ messages in thread
From: Ming Lin @ 2015-10-21 20:13 UTC (permalink / raw)


On Wed, 2015-10-21@14:18 -0400, Mike Snitzer wrote:
> On Wed, Oct 21 2015 at  1:33pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
> 
> > On Wed, 2015-10-21@12:19 -0400, Mike Snitzer wrote:
> > > On Wed, Oct 21 2015 at 12:02pm -0400,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > > 
> > > > On Wed, Oct 14 2015 at  9:27am -0400,
> > > > Christoph Hellwig <hch@infradead.org> wrote:
> > > > 
> > > > > On Tue, Oct 13, 2015@10:44:11AM -0700, Ming Lin wrote:
> > > > > > I just did a quick test with a Samsung 900G NVMe device.
> > > > > > mkfs.xfs is OK on 4.3-rc5.
> > > > > > 
> > > > > > What's your device model? I may find a similar one to try.
> > > > > 
> > > > > This is a HGST Ultrastar SN100
> > > > > 
> > > > > Analsys and tentativ fix below:
> > > > > 
> > > > > blktrace for before the commit:
> > > > > 
> > > > > 259,0    1        2     0.000002543  2394  G   D 0 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        3     0.000008230  2394  I   D 0 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        4     0.000031090   207  D   D 0 + 8388607 [kworker/1:1H]
> > > > > 259,0    1        5     0.000044869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        6     0.000045992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        7     0.000049559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
> > > > > 259,0    1        8     0.000061551   207  D   D 8388607 + 8388607 [kworker/1:1H]
> > > > > 
> > > > > .. and so on.
> > > > > 
> > > > > blktrace with the commit:
> > > > > 
> > > > > 259,0    2        1     0.000000000  1228  Q   D 0 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        2     0.000002543  1228  G   D 0 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        3     0.000010080  1228  I   D 0 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        4     0.000082187   267  D   D 0 + 4194304 [kworker/2:1H]
> > > > > 259,0    2        5     0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        6     0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        7     0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
> > > > > 259,0    2        8     0.000238507   267  D   D 4194304 + 4194304 [kworker/2:1H]
> > > > > 
> > > > > So discards are smaller, but better aligned.  Now if I tweak a single
> > > > > line in blk-lib.c to be able to use all of bi_size I get the old I/O
> > > > > pattern back and everything works fine again:
> > > > > 
> > > > > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > > > > index bd40292..65b61dc 100644
> > > > > --- a/block/blk-lib.c
> > > > > +++ b/block/blk-lib.c
> > > > > @@ -82,7 +82,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> > > > >  			break;
> > > > >  		}
> > > > >  
> > > > > -		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
> > > > > +		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> > > > >  		end_sect = sector + req_sects;
> > > > >  
> > > > >  		bio->bi_iter.bi_sector = sector;
> > > > 
> > > > Can we change UINT_MAX >> 9 to rounddown to the first factor of
> > > > minimum_io_size?
> > > > 
> > > > That should work for all devices and for dm-thinp (and dm-cache) in
> > > > particular will ensure that all discards that are issued will be a
> > > > multiple of the underlying device's blocksize.
> > > 
> > > Jeff Moyer pointed out having req_sects be a factor of
> > > discard_granularity makes more sense.  And I agree.  Same difference in
> > > the end (since dm-thinp sets discard_granularity to the thinp
> > > blocksize).
> > 
> > An old version of this patch did use discard_granularity
> > https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html
> > 
> > But you didn't agree.
> > https://www.redhat.com/archives/dm-devel/2015-August/msg00001.html
> > 
> > Maybe we can re-add discard_granularity now?
> 
> I disagreed on a more generic level than discard_granularity shaping the
> split boundary.
> 
> But we are where we are.  If we're going to split (due to 32-bit limits
> in bio->bi_iter.bi_size) then we should at least do so in terms of the
> support discard_granularity.

How about below?
It actually reverts commit b49a0871 and adds patch at
https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html

Christoph, could you help to try it?

commit 122bf0a43cb1611ed62aaf945f25b649c27a71ed
Author: Ming Lin <mlin at kernel.org>
Date:   Wed Oct 21 11:24:48 2015 -0700

    block: check discard_granularity and alignment
    
    Signed-off-by: Ming Lin <ming.l at ssi.samsung.com>
---
 block/blk-lib.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index bd40292..9ebf653 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,13 +26,6 @@ static void bio_batch_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
-/*
- * Ensure that max discard sectors doesn't overflow bi_size and hopefully
- * it is of the proper granularity as long as the granularity is a power
- * of two.
- */
-#define MAX_BIO_SECTORS ((1U << 31) >> 9)
-
 /**
  * blkdev_issue_discard - queue a discard
  * @bdev:	blockdev to issue discard for
@@ -50,6 +43,8 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *q = bdev_get_queue(bdev);
 	int type = REQ_WRITE | REQ_DISCARD;
+	unsigned int granularity;
+	int alignment;
 	struct bio_batch bb;
 	struct bio *bio;
 	int ret = 0;
@@ -61,6 +56,10 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	if (!blk_queue_discard(q))
 		return -EOPNOTSUPP;
 
+	/* Zero-sector (unknown) and one-sector granularities are the same.  */
+	granularity = max(q->limits.discard_granularity >> 9, 1U);
+	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
+
 	if (flags & BLKDEV_DISCARD_SECURE) {
 		if (!blk_queue_secdiscard(q))
 			return -EOPNOTSUPP;
@@ -74,7 +73,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	blk_start_plug(&plug);
 	while (nr_sects) {
 		unsigned int req_sects;
-		sector_t end_sect;
+		sector_t end_sect, tmp;
 
 		bio = bio_alloc(gfp_mask, 1);
 		if (!bio) {
@@ -82,8 +81,22 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			break;
 		}
 
-		req_sects = min_t(sector_t, nr_sects, MAX_BIO_SECTORS);
+		/* Make sure bi_size doesn't overflow */
+		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
+
+		/*
+		 * If splitting a request, and the next starting sector would be
+		 * misaligned, stop the discard at the previous aligned sector.
+		 */
 		end_sect = sector + req_sects;
+		tmp = end_sect;
+		if (req_sects < nr_sects &&
+		    sector_div(tmp, granularity) != alignment) {
+			end_sect = end_sect - alignment;
+			sector_div(end_sect, granularity);
+			end_sect = end_sect * granularity + alignment;
+			req_sects = end_sect - sector;
+		}
 
 		bio->bi_iter.bi_sector = sector;
 		bio->bi_end_io = bio_batch_end_io;

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-21 20:13                   ` Ming Lin
@ 2015-10-22 10:24                     ` Christoph Hellwig
  -1 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-22 10:24 UTC (permalink / raw)
  To: Ming Lin
  Cc: Mike Snitzer, Christoph Hellwig, lkml, Jens Axboe,
	Kent Overstreet, Dongsu Park, Martin K. Petersen, Ming Lin,
	linux-nvme

On Wed, Oct 21, 2015 at 01:13:09PM -0700, Ming Lin wrote:
> How about below?
> It actually reverts commit b49a0871 and adds patch at
> https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html
> 
> Christoph, could you help to try it?

Still causes hickups with my controller unfortunately.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-22 10:24                     ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-22 10:24 UTC (permalink / raw)


On Wed, Oct 21, 2015@01:13:09PM -0700, Ming Lin wrote:
> How about below?
> It actually reverts commit b49a0871 and adds patch at
> https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html
> 
> Christoph, could you help to try it?

Still causes hickups with my controller unfortunately.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
  2015-10-22 10:24                     ` Christoph Hellwig
@ 2015-10-22 11:22                       ` Christoph Hellwig
  -1 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-22 11:22 UTC (permalink / raw)
  To: Ming Lin
  Cc: Mike Snitzer, Christoph Hellwig, lkml, Jens Axboe,
	Kent Overstreet, Dongsu Park, Martin K. Petersen, Ming Lin,
	linux-nvme

On Thu, Oct 22, 2015 at 03:24:44AM -0700, Christoph Hellwig wrote:
> > How about below?
> > It actually reverts commit b49a0871 and adds patch at
> > https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html
> > 
> > Christoph, could you help to try it?
> 
> Still causes hickups with my controller unfortunately.

Turns out I booted into the wrong kernel.  This actually works fine now
thay I've actually tested the right code:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}
@ 2015-10-22 11:22                       ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2015-10-22 11:22 UTC (permalink / raw)


On Thu, Oct 22, 2015@03:24:44AM -0700, Christoph Hellwig wrote:
> > How about below?
> > It actually reverts commit b49a0871 and adds patch at
> > https://www.redhat.com/archives/dm-devel/2015-August/msg00000.html
> > 
> > Christoph, could you help to try it?
> 
> Still causes hickups with my controller unfortunately.

Turns out I booted into the wrong kernel.  This actually works fine now
thay I've actually tested the right code:

Reviewed-by: Christoph Hellwig <hch at lst.de>
Tested-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: bcache: remove driver private bio splitting code
  2015-08-12  7:07 ` [PATCH v6 03/11] bcache: remove driver private bio splitting code Ming Lin
@ 2016-01-08  1:53   ` Eric Wheeler
  2016-01-13  2:00     ` Eric Wheeler
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Wheeler @ 2016-01-08  1:53 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Dongsu Park, Jens Axboe, linux-bcache,
	linux-kernel, Mike Snitzer, Ming Lin, Ming Lin,
	Martin K. Petersen

> From: Kent Overstreet Wed, 12 Aug 2015 00:07:13 -0700
> The bcache driver has always accepted arbitrarily large bios and split
> them internally.  Now that every driver must accept arbitrarily large
> bios this code isn't nessecary anymore.

Hi Kent,

Does your patch below make any kernel version assumptions about bio 
splitting?  That doesn't appear so based on the commit message, but 
thought I'd double-check.

If there is a kernel dependency, then, how far back should this be safe?  
I'd like to apply to 3.17-rc1 like the other stable commits and stamp it 
with Cc: stable@ so it will merge forward into 3.18 and onward if it fixes 
my problem below:

This question comes up because I'm getting the following 
(discard-related?) backtrace and would like to try the patch.  If it fixes 
the problem I'll stamp it with stable and get it to Jens.  Note that 
bch_generic_make_request() doesn't even exist after this patch, so this 
backtrace would at least be simplified and possibly fixed:

[  294.059255]  [<ffffffffa049a20d>] bch_generic_make_request+0x15d/0x1f0 [bcache]
[  294.059578]  [<ffffffffa049a3b9>] __bch_submit_bbio+0x79/0x80 [bcache]
[  294.059794]  [<ffffffffa049a3eb>] bch_submit_bbio+0x2b/0x30 [bcache]
[  294.059983]  [<ffffffffa049fabb>] bch_data_insert_start+0xcb/0x5a0 [bcache]
[  294.060169]  [<ffffffffa049ffe1>] bch_data_insert+0x51/0xc0 [bcache]
[  294.060354]  [<ffffffffa04a0cc4>] cached_dev_make_request+0xbe4/0xdf0 [bcache]
[  294.060652]  [<ffffffff81317c20>] generic_make_request+0xe0/0x130
[  294.060830]  [<ffffffff81317ce7>] submit_bio+0x77/0x150
[  294.061012]  [<ffffffff81312906>] ? bio_alloc_bioset+0x1d6/0x330
[  294.061194]  [<ffffffffa04cef90>] ? le64_dec+0x20/0x20 [dm_persistent_data]
[  294.061377]  [<ffffffffa04e49be>] __blkdev_issue_discard_async.constprop.59+0x17e/0x210 [dm_thin_pool]
[  294.061695]  [<ffffffffa04e55dc>] process_prepared_discard_passdown+0x9c/0x230 [dm_thin_pool]
[  294.062007]  [<ffffffff811ac73f>] ? mempool_free+0x2f/0x90
[  294.062207]  [<ffffffffa04e0772>] process_prepared+0x92/0xc0 [dm_thin_pool]
[  294.062404]  [<ffffffffa04e5f78>] do_worker+0xb8/0x880 [dm_thin_pool]
[  294.062606]  [<ffffffff81014693>] ? __switch_to+0x1e3/0x580
[  294.062796]  [<ffffffff810dc71c>] ? put_prev_task_fair+0x2c/0x40
[  294.062989]  [<ffffffff810ba18d>] process_one_work+0x14d/0x420
[  294.063171]  [<ffffffff810ba952>] worker_thread+0x112/0x520
[  294.063355]  [<ffffffff810ba840>] ? rescuer_thread+0x3e0/0x3e0
[  294.063560]  [<ffffffff810c06a8>] kthread+0xd8/0xf0
[  294.063737]  [<ffffffff810c05d0>] ? kthread_create_on_node+0x1b0/0x1b0
[  294.063943]  [<ffffffff816e4aa2>] ret_from_fork+0x42/0x70
[  294.064133]  [<ffffffff810c05d0>] ? kthread_create_on_node+0x1b0/0x1b0


Thanks!

-Eric
 
> Cc: linux-bcache@vger.kernel.org
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> [dpark: add more description in commit message]
> Signed-off-by: Dongsu Park <dpark@posteo.net>
> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
> ---
>  drivers/md/bcache/bcache.h    |  18 --------
>  drivers/md/bcache/io.c        | 101 +-----------------------------------------
>  drivers/md/bcache/journal.c   |   4 +-
>  drivers/md/bcache/request.c   |  16 +++----
>  drivers/md/bcache/super.c     |  32 +------------
>  drivers/md/bcache/util.h      |   5 ++-
>  drivers/md/bcache/writeback.c |   4 +-
>  7 files changed, 18 insertions(+), 162 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 04f7bc2..6b420a5 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -243,19 +243,6 @@ struct keybuf {
>  	DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
>  };
>  
> -struct bio_split_pool {
> -	struct bio_set		*bio_split;
> -	mempool_t		*bio_split_hook;
> -};
> -
> -struct bio_split_hook {
> -	struct closure		cl;
> -	struct bio_split_pool	*p;
> -	struct bio		*bio;
> -	bio_end_io_t		*bi_end_io;
> -	void			*bi_private;
> -};
> -
>  struct bcache_device {
>  	struct closure		cl;
>  
> @@ -288,8 +275,6 @@ struct bcache_device {
>  	int (*cache_miss)(struct btree *, struct search *,
>  			  struct bio *, unsigned);
>  	int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
> -
> -	struct bio_split_pool	bio_split_hook;
>  };
>  
>  struct io {
> @@ -454,8 +439,6 @@ struct cache {
>  	atomic_long_t		meta_sectors_written;
>  	atomic_long_t		btree_sectors_written;
>  	atomic_long_t		sectors_written;
> -
> -	struct bio_split_pool	bio_split_hook;
>  };
>  
>  struct gc_stat {
> @@ -873,7 +856,6 @@ void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *);
>  void bch_bbio_free(struct bio *, struct cache_set *);
>  struct bio *bch_bbio_alloc(struct cache_set *);
>  
> -void bch_generic_make_request(struct bio *, struct bio_split_pool *);
>  void __bch_submit_bbio(struct bio *, struct cache_set *);
>  void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);
>  
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index bf6a9ca..86a0bb8 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -11,105 +11,6 @@
>  
>  #include <linux/blkdev.h>
>  
> -static unsigned bch_bio_max_sectors(struct bio *bio)
> -{
> -	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> -	struct bio_vec bv;
> -	struct bvec_iter iter;
> -	unsigned ret = 0, seg = 0;
> -
> -	if (bio->bi_rw & REQ_DISCARD)
> -		return min(bio_sectors(bio), q->limits.max_discard_sectors);
> -
> -	bio_for_each_segment(bv, bio, iter) {
> -		struct bvec_merge_data bvm = {
> -			.bi_bdev	= bio->bi_bdev,
> -			.bi_sector	= bio->bi_iter.bi_sector,
> -			.bi_size	= ret << 9,
> -			.bi_rw		= bio->bi_rw,
> -		};
> -
> -		if (seg == min_t(unsigned, BIO_MAX_PAGES,
> -				 queue_max_segments(q)))
> -			break;
> -
> -		if (q->merge_bvec_fn &&
> -		    q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
> -			break;
> -
> -		seg++;
> -		ret += bv.bv_len >> 9;
> -	}
> -
> -	ret = min(ret, queue_max_sectors(q));
> -
> -	WARN_ON(!ret);
> -	ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9);
> -
> -	return ret;
> -}
> -
> -static void bch_bio_submit_split_done(struct closure *cl)
> -{
> -	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> -
> -	s->bio->bi_end_io = s->bi_end_io;
> -	s->bio->bi_private = s->bi_private;
> -	bio_endio(s->bio, 0);
> -
> -	closure_debug_destroy(&s->cl);
> -	mempool_free(s, s->p->bio_split_hook);
> -}
> -
> -static void bch_bio_submit_split_endio(struct bio *bio, int error)
> -{
> -	struct closure *cl = bio->bi_private;
> -	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> -
> -	if (error)
> -		clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
> -
> -	bio_put(bio);
> -	closure_put(cl);
> -}
> -
> -void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
> -{
> -	struct bio_split_hook *s;
> -	struct bio *n;
> -
> -	if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
> -		goto submit;
> -
> -	if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
> -		goto submit;
> -
> -	s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
> -	closure_init(&s->cl, NULL);
> -
> -	s->bio		= bio;
> -	s->p		= p;
> -	s->bi_end_io	= bio->bi_end_io;
> -	s->bi_private	= bio->bi_private;
> -	bio_get(bio);
> -
> -	do {
> -		n = bio_next_split(bio, bch_bio_max_sectors(bio),
> -				   GFP_NOIO, s->p->bio_split);
> -
> -		n->bi_end_io	= bch_bio_submit_split_endio;
> -		n->bi_private	= &s->cl;
> -
> -		closure_get(&s->cl);
> -		generic_make_request(n);
> -	} while (n != bio);
> -
> -	continue_at(&s->cl, bch_bio_submit_split_done, NULL);
> -	return;
> -submit:
> -	generic_make_request(bio);
> -}
> -
>  /* Bios with headers */
>  
>  void bch_bbio_free(struct bio *bio, struct cache_set *c)
> @@ -139,7 +40,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c)
>  	bio->bi_bdev		= PTR_CACHE(c, &b->key, 0)->bdev;
>  
>  	b->submit_time_us = local_clock_us();
> -	closure_bio_submit(bio, bio->bi_private, PTR_CACHE(c, &b->key, 0));
> +	closure_bio_submit(bio, bio->bi_private);
>  }
>  
>  void bch_submit_bbio(struct bio *bio, struct cache_set *c,
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 418607a..727ca9b 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -61,7 +61,7 @@ reread:		left = ca->sb.bucket_size - offset;
>  		bio->bi_private = &cl;
>  		bch_bio_map(bio, data);
>  
> -		closure_bio_submit(bio, &cl, ca);
> +		closure_bio_submit(bio, &cl);
>  		closure_sync(&cl);
>  
>  		/* This function could be simpler now since we no longer write
> @@ -648,7 +648,7 @@ static void journal_write_unlocked(struct closure *cl)
>  	spin_unlock(&c->journal.lock);
>  
>  	while ((bio = bio_list_pop(&list)))
> -		closure_bio_submit(bio, cl, c->cache[0]);
> +		closure_bio_submit(bio, cl);
>  
>  	continue_at(cl, journal_write_done, NULL);
>  }
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index f292790..ab093a8 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -718,7 +718,7 @@ static void cached_dev_read_error(struct closure *cl)
>  
>  		/* XXX: invalidate cache */
>  
> -		closure_bio_submit(bio, cl, s->d);
> +		closure_bio_submit(bio, cl);
>  	}
>  
>  	continue_at(cl, cached_dev_cache_miss_done, NULL);
> @@ -841,7 +841,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
>  	s->cache_miss	= miss;
>  	s->iop.bio	= cache_bio;
>  	bio_get(cache_bio);
> -	closure_bio_submit(cache_bio, &s->cl, s->d);
> +	closure_bio_submit(cache_bio, &s->cl);
>  
>  	return ret;
>  out_put:
> @@ -849,7 +849,7 @@ out_put:
>  out_submit:
>  	miss->bi_end_io		= request_endio;
>  	miss->bi_private	= &s->cl;
> -	closure_bio_submit(miss, &s->cl, s->d);
> +	closure_bio_submit(miss, &s->cl);
>  	return ret;
>  }
>  
> @@ -914,7 +914,7 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
>  
>  		if (!(bio->bi_rw & REQ_DISCARD) ||
>  		    blk_queue_discard(bdev_get_queue(dc->bdev)))
> -			closure_bio_submit(bio, cl, s->d);
> +			closure_bio_submit(bio, cl);
>  	} else if (s->iop.writeback) {
>  		bch_writeback_add(dc);
>  		s->iop.bio = bio;
> @@ -929,12 +929,12 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
>  			flush->bi_end_io = request_endio;
>  			flush->bi_private = cl;
>  
> -			closure_bio_submit(flush, cl, s->d);
> +			closure_bio_submit(flush, cl);
>  		}
>  	} else {
>  		s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split);
>  
> -		closure_bio_submit(bio, cl, s->d);
> +		closure_bio_submit(bio, cl);
>  	}
>  
>  	closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
> @@ -950,7 +950,7 @@ static void cached_dev_nodata(struct closure *cl)
>  		bch_journal_meta(s->iop.c, cl);
>  
>  	/* If it's a flush, we send the flush to the backing device too */
> -	closure_bio_submit(bio, cl, s->d);
> +	closure_bio_submit(bio, cl);
>  
>  	continue_at(cl, cached_dev_bio_complete, NULL);
>  }
> @@ -994,7 +994,7 @@ static void cached_dev_make_request(struct request_queue *q, struct bio *bio)
>  		    !blk_queue_discard(bdev_get_queue(dc->bdev)))
>  			bio_endio(bio, 0);
>  		else
> -			bch_generic_make_request(bio, &d->bio_split_hook);
> +			generic_make_request(bio);
>  	}
>  }
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 94980bf..db70c9e 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -59,29 +59,6 @@ struct workqueue_struct *bcache_wq;
>  
>  #define BTREE_MAX_PAGES		(256 * 1024 / PAGE_SIZE)
>  
> -static void bio_split_pool_free(struct bio_split_pool *p)
> -{
> -	if (p->bio_split_hook)
> -		mempool_destroy(p->bio_split_hook);
> -
> -	if (p->bio_split)
> -		bioset_free(p->bio_split);
> -}
> -
> -static int bio_split_pool_init(struct bio_split_pool *p)
> -{
> -	p->bio_split = bioset_create(4, 0);
> -	if (!p->bio_split)
> -		return -ENOMEM;
> -
> -	p->bio_split_hook = mempool_create_kmalloc_pool(4,
> -				sizeof(struct bio_split_hook));
> -	if (!p->bio_split_hook)
> -		return -ENOMEM;
> -
> -	return 0;
> -}
> -
>  /* Superblock */
>  
>  static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
> @@ -537,7 +514,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, unsigned long rw)
>  	bio->bi_private = ca;
>  	bch_bio_map(bio, ca->disk_buckets);
>  
> -	closure_bio_submit(bio, &ca->prio, ca);
> +	closure_bio_submit(bio, &ca->prio);
>  	closure_sync(cl);
>  }
>  
> @@ -757,7 +734,6 @@ static void bcache_device_free(struct bcache_device *d)
>  		put_disk(d->disk);
>  	}
>  
> -	bio_split_pool_free(&d->bio_split_hook);
>  	if (d->bio_split)
>  		bioset_free(d->bio_split);
>  	kvfree(d->full_dirty_stripes);
> @@ -804,7 +780,6 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
>  		return minor;
>  
>  	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
> -	    bio_split_pool_init(&d->bio_split_hook) ||
>  	    !(d->disk = alloc_disk(1))) {
>  		ida_simple_remove(&bcache_minor, minor);
>  		return -ENOMEM;
> @@ -1793,8 +1768,6 @@ void bch_cache_release(struct kobject *kobj)
>  		ca->set->cache[ca->sb.nr_this_dev] = NULL;
>  	}
>  
> -	bio_split_pool_free(&ca->bio_split_hook);
> -
>  	free_pages((unsigned long) ca->disk_buckets, ilog2(bucket_pages(ca)));
>  	kfree(ca->prio_buckets);
>  	vfree(ca->buckets);
> @@ -1839,8 +1812,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
>  					  ca->sb.nbuckets)) ||
>  	    !(ca->prio_buckets	= kzalloc(sizeof(uint64_t) * prio_buckets(ca) *
>  					  2, GFP_KERNEL)) ||
> -	    !(ca->disk_buckets	= alloc_bucket_pages(GFP_KERNEL, ca)) ||
> -	    bio_split_pool_init(&ca->bio_split_hook))
> +	    !(ca->disk_buckets	= alloc_bucket_pages(GFP_KERNEL, ca)))
>  		return -ENOMEM;
>  
>  	ca->prio_last_buckets = ca->prio_buckets + prio_buckets(ca);
> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> index 1d04c48..cf2cbc2 100644
> --- a/drivers/md/bcache/util.h
> +++ b/drivers/md/bcache/util.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/blkdev.h>
>  #include <linux/errno.h>
> +#include <linux/blkdev.h>
>  #include <linux/kernel.h>
>  #include <linux/llist.h>
>  #include <linux/ratelimit.h>
> @@ -570,10 +571,10 @@ static inline sector_t bdev_sectors(struct block_device *bdev)
>  	return bdev->bd_inode->i_size >> 9;
>  }
>  
> -#define closure_bio_submit(bio, cl, dev)				\
> +#define closure_bio_submit(bio, cl)					\
>  do {									\
>  	closure_get(cl);						\
> -	bch_generic_make_request(bio, &(dev)->bio_split_hook);		\
> +	generic_make_request(bio);					\
>  } while (0)
>  
>  uint64_t bch_crc64_update(uint64_t, const void *, size_t);
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index f1986bc..ca38362 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -188,7 +188,7 @@ static void write_dirty(struct closure *cl)
>  	io->bio.bi_bdev		= io->dc->bdev;
>  	io->bio.bi_end_io	= dirty_endio;
>  
> -	closure_bio_submit(&io->bio, cl, &io->dc->disk);
> +	closure_bio_submit(&io->bio, cl);
>  
>  	continue_at(cl, write_dirty_finish, system_wq);
>  }
> @@ -208,7 +208,7 @@ static void read_dirty_submit(struct closure *cl)
>  {
>  	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
>  
> -	closure_bio_submit(&io->bio, cl, &io->dc->disk);
> +	closure_bio_submit(&io->bio, cl);
>  
>  	continue_at(cl, write_dirty, system_wq);
>  }
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: bcache: remove driver private bio splitting code
  2016-01-08  1:53   ` Eric Wheeler
@ 2016-01-13  2:00     ` Eric Wheeler
  2016-01-13  5:54       ` Vojtech Pavlik
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Wheeler @ 2016-01-13  2:00 UTC (permalink / raw)
  To: vojtech; +Cc: linux-bcache

[-- Attachment #1: Type: TEXT/PLAIN, Size: 17907 bytes --]

Hi Vojtěch,

Have you tested the patch below in SLE12-* when bcache is backed by md
raid5/6?

FYI: I was compared the drivers/md/bcache/io.c functions in the various
branches here:
  https://github.com/openSUSE/kernel
I compared the presence of bch_generic_make_request() (which the patch
below removes).  It looks like the branch SLE12-SP2 has the patch, but
version before SLE12-SP2 and openSUSE-* do not (as they still have
bch_generic_make_request).

Since the patch does exist in SLE12-SP2, I'm guessing that is been tested,
though I am curious if it has been tested specifically when being backed
by md raid5/6 so that queue->limits->partial_stripes_expensive is nonzero.

If you have and it is stable, then I want to get it to Kent and Jens for 
upstream integration.

Thanks!

-Eric



--
Eric Wheeler, President           eWheeler, Inc. dba Global Linux Security
888-LINUX26 (888-546-8926)        Fax: 503-716-3878           PO Box 25107
www.GlobalLinuxSecurity.pro       Linux since 1996!     Portland, OR 97298

On Thu, 7 Jan 2016, Eric Wheeler wrote:

> > From: Kent Overstreet Wed, 12 Aug 2015 00:07:13 -0700
> > The bcache driver has always accepted arbitrarily large bios and split
> > them internally.  Now that every driver must accept arbitrarily large
> > bios this code isn't nessecary anymore.
> 
> Hi Kent,
> 
> Does your patch below make any kernel version assumptions about bio 
> splitting?  That doesn't appear so based on the commit message, but 
> thought I'd double-check.
> 
> If there is a kernel dependency, then, how far back should this be safe?  
> I'd like to apply to 3.17-rc1 like the other stable commits and stamp it 
> with Cc: stable@ so it will merge forward into 3.18 and onward if it fixes 
> my problem below:
> 
> This question comes up because I'm getting the following 
> (discard-related?) backtrace and would like to try the patch.  If it fixes 
> the problem I'll stamp it with stable and get it to Jens.  Note that 
> bch_generic_make_request() doesn't even exist after this patch, so this 
> backtrace would at least be simplified and possibly fixed:
> 
> [  294.059255]  [<ffffffffa049a20d>] bch_generic_make_request+0x15d/0x1f0 [bcache]
> [  294.059578]  [<ffffffffa049a3b9>] __bch_submit_bbio+0x79/0x80 [bcache]
> [  294.059794]  [<ffffffffa049a3eb>] bch_submit_bbio+0x2b/0x30 [bcache]
> [  294.059983]  [<ffffffffa049fabb>] bch_data_insert_start+0xcb/0x5a0 [bcache]
> [  294.060169]  [<ffffffffa049ffe1>] bch_data_insert+0x51/0xc0 [bcache]
> [  294.060354]  [<ffffffffa04a0cc4>] cached_dev_make_request+0xbe4/0xdf0 [bcache]
> [  294.060652]  [<ffffffff81317c20>] generic_make_request+0xe0/0x130
> [  294.060830]  [<ffffffff81317ce7>] submit_bio+0x77/0x150
> [  294.061012]  [<ffffffff81312906>] ? bio_alloc_bioset+0x1d6/0x330
> [  294.061194]  [<ffffffffa04cef90>] ? le64_dec+0x20/0x20 [dm_persistent_data]
> [  294.061377]  [<ffffffffa04e49be>] __blkdev_issue_discard_async.constprop.59+0x17e/0x210 [dm_thin_pool]
> [  294.061695]  [<ffffffffa04e55dc>] process_prepared_discard_passdown+0x9c/0x230 [dm_thin_pool]
> [  294.062007]  [<ffffffff811ac73f>] ? mempool_free+0x2f/0x90
> [  294.062207]  [<ffffffffa04e0772>] process_prepared+0x92/0xc0 [dm_thin_pool]
> [  294.062404]  [<ffffffffa04e5f78>] do_worker+0xb8/0x880 [dm_thin_pool]
> [  294.062606]  [<ffffffff81014693>] ? __switch_to+0x1e3/0x580
> [  294.062796]  [<ffffffff810dc71c>] ? put_prev_task_fair+0x2c/0x40
> [  294.062989]  [<ffffffff810ba18d>] process_one_work+0x14d/0x420
> [  294.063171]  [<ffffffff810ba952>] worker_thread+0x112/0x520
> [  294.063355]  [<ffffffff810ba840>] ? rescuer_thread+0x3e0/0x3e0
> [  294.063560]  [<ffffffff810c06a8>] kthread+0xd8/0xf0
> [  294.063737]  [<ffffffff810c05d0>] ? kthread_create_on_node+0x1b0/0x1b0
> [  294.063943]  [<ffffffff816e4aa2>] ret_from_fork+0x42/0x70
> [  294.064133]  [<ffffffff810c05d0>] ? kthread_create_on_node+0x1b0/0x1b0
> 
> 
> Thanks!
> 
> -Eric
>  
> > Cc: linux-bcache@vger.kernel.org
> > Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> > [dpark: add more description in commit message]
> > Signed-off-by: Dongsu Park <dpark@posteo.net>
> > Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
> > ---
> >  drivers/md/bcache/bcache.h    |  18 --------
> >  drivers/md/bcache/io.c        | 101 +-----------------------------------------
> >  drivers/md/bcache/journal.c   |   4 +-
> >  drivers/md/bcache/request.c   |  16 +++----
> >  drivers/md/bcache/super.c     |  32 +------------
> >  drivers/md/bcache/util.h      |   5 ++-
> >  drivers/md/bcache/writeback.c |   4 +-
> >  7 files changed, 18 insertions(+), 162 deletions(-)
> > 
> > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> > index 04f7bc2..6b420a5 100644
> > --- a/drivers/md/bcache/bcache.h
> > +++ b/drivers/md/bcache/bcache.h
> > @@ -243,19 +243,6 @@ struct keybuf {
> >  	DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
> >  };
> >  
> > -struct bio_split_pool {
> > -	struct bio_set		*bio_split;
> > -	mempool_t		*bio_split_hook;
> > -};
> > -
> > -struct bio_split_hook {
> > -	struct closure		cl;
> > -	struct bio_split_pool	*p;
> > -	struct bio		*bio;
> > -	bio_end_io_t		*bi_end_io;
> > -	void			*bi_private;
> > -};
> > -
> >  struct bcache_device {
> >  	struct closure		cl;
> >  
> > @@ -288,8 +275,6 @@ struct bcache_device {
> >  	int (*cache_miss)(struct btree *, struct search *,
> >  			  struct bio *, unsigned);
> >  	int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
> > -
> > -	struct bio_split_pool	bio_split_hook;
> >  };
> >  
> >  struct io {
> > @@ -454,8 +439,6 @@ struct cache {
> >  	atomic_long_t		meta_sectors_written;
> >  	atomic_long_t		btree_sectors_written;
> >  	atomic_long_t		sectors_written;
> > -
> > -	struct bio_split_pool	bio_split_hook;
> >  };
> >  
> >  struct gc_stat {
> > @@ -873,7 +856,6 @@ void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *);
> >  void bch_bbio_free(struct bio *, struct cache_set *);
> >  struct bio *bch_bbio_alloc(struct cache_set *);
> >  
> > -void bch_generic_make_request(struct bio *, struct bio_split_pool *);
> >  void __bch_submit_bbio(struct bio *, struct cache_set *);
> >  void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);
> >  
> > diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> > index bf6a9ca..86a0bb8 100644
> > --- a/drivers/md/bcache/io.c
> > +++ b/drivers/md/bcache/io.c
> > @@ -11,105 +11,6 @@
> >  
> >  #include <linux/blkdev.h>
> >  
> > -static unsigned bch_bio_max_sectors(struct bio *bio)
> > -{
> > -	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > -	struct bio_vec bv;
> > -	struct bvec_iter iter;
> > -	unsigned ret = 0, seg = 0;
> > -
> > -	if (bio->bi_rw & REQ_DISCARD)
> > -		return min(bio_sectors(bio), q->limits.max_discard_sectors);
> > -
> > -	bio_for_each_segment(bv, bio, iter) {
> > -		struct bvec_merge_data bvm = {
> > -			.bi_bdev	= bio->bi_bdev,
> > -			.bi_sector	= bio->bi_iter.bi_sector,
> > -			.bi_size	= ret << 9,
> > -			.bi_rw		= bio->bi_rw,
> > -		};
> > -
> > -		if (seg == min_t(unsigned, BIO_MAX_PAGES,
> > -				 queue_max_segments(q)))
> > -			break;
> > -
> > -		if (q->merge_bvec_fn &&
> > -		    q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
> > -			break;
> > -
> > -		seg++;
> > -		ret += bv.bv_len >> 9;
> > -	}
> > -
> > -	ret = min(ret, queue_max_sectors(q));
> > -
> > -	WARN_ON(!ret);
> > -	ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9);
> > -
> > -	return ret;
> > -}
> > -
> > -static void bch_bio_submit_split_done(struct closure *cl)
> > -{
> > -	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> > -
> > -	s->bio->bi_end_io = s->bi_end_io;
> > -	s->bio->bi_private = s->bi_private;
> > -	bio_endio(s->bio, 0);
> > -
> > -	closure_debug_destroy(&s->cl);
> > -	mempool_free(s, s->p->bio_split_hook);
> > -}
> > -
> > -static void bch_bio_submit_split_endio(struct bio *bio, int error)
> > -{
> > -	struct closure *cl = bio->bi_private;
> > -	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> > -
> > -	if (error)
> > -		clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
> > -
> > -	bio_put(bio);
> > -	closure_put(cl);
> > -}
> > -
> > -void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
> > -{
> > -	struct bio_split_hook *s;
> > -	struct bio *n;
> > -
> > -	if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
> > -		goto submit;
> > -
> > -	if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
> > -		goto submit;
> > -
> > -	s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
> > -	closure_init(&s->cl, NULL);
> > -
> > -	s->bio		= bio;
> > -	s->p		= p;
> > -	s->bi_end_io	= bio->bi_end_io;
> > -	s->bi_private	= bio->bi_private;
> > -	bio_get(bio);
> > -
> > -	do {
> > -		n = bio_next_split(bio, bch_bio_max_sectors(bio),
> > -				   GFP_NOIO, s->p->bio_split);
> > -
> > -		n->bi_end_io	= bch_bio_submit_split_endio;
> > -		n->bi_private	= &s->cl;
> > -
> > -		closure_get(&s->cl);
> > -		generic_make_request(n);
> > -	} while (n != bio);
> > -
> > -	continue_at(&s->cl, bch_bio_submit_split_done, NULL);
> > -	return;
> > -submit:
> > -	generic_make_request(bio);
> > -}
> > -
> >  /* Bios with headers */
> >  
> >  void bch_bbio_free(struct bio *bio, struct cache_set *c)
> > @@ -139,7 +40,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c)
> >  	bio->bi_bdev		= PTR_CACHE(c, &b->key, 0)->bdev;
> >  
> >  	b->submit_time_us = local_clock_us();
> > -	closure_bio_submit(bio, bio->bi_private, PTR_CACHE(c, &b->key, 0));
> > +	closure_bio_submit(bio, bio->bi_private);
> >  }
> >  
> >  void bch_submit_bbio(struct bio *bio, struct cache_set *c,
> > diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> > index 418607a..727ca9b 100644
> > --- a/drivers/md/bcache/journal.c
> > +++ b/drivers/md/bcache/journal.c
> > @@ -61,7 +61,7 @@ reread:		left = ca->sb.bucket_size - offset;
> >  		bio->bi_private = &cl;
> >  		bch_bio_map(bio, data);
> >  
> > -		closure_bio_submit(bio, &cl, ca);
> > +		closure_bio_submit(bio, &cl);
> >  		closure_sync(&cl);
> >  
> >  		/* This function could be simpler now since we no longer write
> > @@ -648,7 +648,7 @@ static void journal_write_unlocked(struct closure *cl)
> >  	spin_unlock(&c->journal.lock);
> >  
> >  	while ((bio = bio_list_pop(&list)))
> > -		closure_bio_submit(bio, cl, c->cache[0]);
> > +		closure_bio_submit(bio, cl);
> >  
> >  	continue_at(cl, journal_write_done, NULL);
> >  }
> > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> > index f292790..ab093a8 100644
> > --- a/drivers/md/bcache/request.c
> > +++ b/drivers/md/bcache/request.c
> > @@ -718,7 +718,7 @@ static void cached_dev_read_error(struct closure *cl)
> >  
> >  		/* XXX: invalidate cache */
> >  
> > -		closure_bio_submit(bio, cl, s->d);
> > +		closure_bio_submit(bio, cl);
> >  	}
> >  
> >  	continue_at(cl, cached_dev_cache_miss_done, NULL);
> > @@ -841,7 +841,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
> >  	s->cache_miss	= miss;
> >  	s->iop.bio	= cache_bio;
> >  	bio_get(cache_bio);
> > -	closure_bio_submit(cache_bio, &s->cl, s->d);
> > +	closure_bio_submit(cache_bio, &s->cl);
> >  
> >  	return ret;
> >  out_put:
> > @@ -849,7 +849,7 @@ out_put:
> >  out_submit:
> >  	miss->bi_end_io		= request_endio;
> >  	miss->bi_private	= &s->cl;
> > -	closure_bio_submit(miss, &s->cl, s->d);
> > +	closure_bio_submit(miss, &s->cl);
> >  	return ret;
> >  }
> >  
> > @@ -914,7 +914,7 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
> >  
> >  		if (!(bio->bi_rw & REQ_DISCARD) ||
> >  		    blk_queue_discard(bdev_get_queue(dc->bdev)))
> > -			closure_bio_submit(bio, cl, s->d);
> > +			closure_bio_submit(bio, cl);
> >  	} else if (s->iop.writeback) {
> >  		bch_writeback_add(dc);
> >  		s->iop.bio = bio;
> > @@ -929,12 +929,12 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
> >  			flush->bi_end_io = request_endio;
> >  			flush->bi_private = cl;
> >  
> > -			closure_bio_submit(flush, cl, s->d);
> > +			closure_bio_submit(flush, cl);
> >  		}
> >  	} else {
> >  		s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split);
> >  
> > -		closure_bio_submit(bio, cl, s->d);
> > +		closure_bio_submit(bio, cl);
> >  	}
> >  
> >  	closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
> > @@ -950,7 +950,7 @@ static void cached_dev_nodata(struct closure *cl)
> >  		bch_journal_meta(s->iop.c, cl);
> >  
> >  	/* If it's a flush, we send the flush to the backing device too */
> > -	closure_bio_submit(bio, cl, s->d);
> > +	closure_bio_submit(bio, cl);
> >  
> >  	continue_at(cl, cached_dev_bio_complete, NULL);
> >  }
> > @@ -994,7 +994,7 @@ static void cached_dev_make_request(struct request_queue *q, struct bio *bio)
> >  		    !blk_queue_discard(bdev_get_queue(dc->bdev)))
> >  			bio_endio(bio, 0);
> >  		else
> > -			bch_generic_make_request(bio, &d->bio_split_hook);
> > +			generic_make_request(bio);
> >  	}
> >  }
> >  
> > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > index 94980bf..db70c9e 100644
> > --- a/drivers/md/bcache/super.c
> > +++ b/drivers/md/bcache/super.c
> > @@ -59,29 +59,6 @@ struct workqueue_struct *bcache_wq;
> >  
> >  #define BTREE_MAX_PAGES		(256 * 1024 / PAGE_SIZE)
> >  
> > -static void bio_split_pool_free(struct bio_split_pool *p)
> > -{
> > -	if (p->bio_split_hook)
> > -		mempool_destroy(p->bio_split_hook);
> > -
> > -	if (p->bio_split)
> > -		bioset_free(p->bio_split);
> > -}
> > -
> > -static int bio_split_pool_init(struct bio_split_pool *p)
> > -{
> > -	p->bio_split = bioset_create(4, 0);
> > -	if (!p->bio_split)
> > -		return -ENOMEM;
> > -
> > -	p->bio_split_hook = mempool_create_kmalloc_pool(4,
> > -				sizeof(struct bio_split_hook));
> > -	if (!p->bio_split_hook)
> > -		return -ENOMEM;
> > -
> > -	return 0;
> > -}
> > -
> >  /* Superblock */
> >  
> >  static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
> > @@ -537,7 +514,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, unsigned long rw)
> >  	bio->bi_private = ca;
> >  	bch_bio_map(bio, ca->disk_buckets);
> >  
> > -	closure_bio_submit(bio, &ca->prio, ca);
> > +	closure_bio_submit(bio, &ca->prio);
> >  	closure_sync(cl);
> >  }
> >  
> > @@ -757,7 +734,6 @@ static void bcache_device_free(struct bcache_device *d)
> >  		put_disk(d->disk);
> >  	}
> >  
> > -	bio_split_pool_free(&d->bio_split_hook);
> >  	if (d->bio_split)
> >  		bioset_free(d->bio_split);
> >  	kvfree(d->full_dirty_stripes);
> > @@ -804,7 +780,6 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
> >  		return minor;
> >  
> >  	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
> > -	    bio_split_pool_init(&d->bio_split_hook) ||
> >  	    !(d->disk = alloc_disk(1))) {
> >  		ida_simple_remove(&bcache_minor, minor);
> >  		return -ENOMEM;
> > @@ -1793,8 +1768,6 @@ void bch_cache_release(struct kobject *kobj)
> >  		ca->set->cache[ca->sb.nr_this_dev] = NULL;
> >  	}
> >  
> > -	bio_split_pool_free(&ca->bio_split_hook);
> > -
> >  	free_pages((unsigned long) ca->disk_buckets, ilog2(bucket_pages(ca)));
> >  	kfree(ca->prio_buckets);
> >  	vfree(ca->buckets);
> > @@ -1839,8 +1812,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
> >  					  ca->sb.nbuckets)) ||
> >  	    !(ca->prio_buckets	= kzalloc(sizeof(uint64_t) * prio_buckets(ca) *
> >  					  2, GFP_KERNEL)) ||
> > -	    !(ca->disk_buckets	= alloc_bucket_pages(GFP_KERNEL, ca)) ||
> > -	    bio_split_pool_init(&ca->bio_split_hook))
> > +	    !(ca->disk_buckets	= alloc_bucket_pages(GFP_KERNEL, ca)))
> >  		return -ENOMEM;
> >  
> >  	ca->prio_last_buckets = ca->prio_buckets + prio_buckets(ca);
> > diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> > index 1d04c48..cf2cbc2 100644
> > --- a/drivers/md/bcache/util.h
> > +++ b/drivers/md/bcache/util.h
> > @@ -4,6 +4,7 @@
> >  
> >  #include <linux/blkdev.h>
> >  #include <linux/errno.h>
> > +#include <linux/blkdev.h>
> >  #include <linux/kernel.h>
> >  #include <linux/llist.h>
> >  #include <linux/ratelimit.h>
> > @@ -570,10 +571,10 @@ static inline sector_t bdev_sectors(struct block_device *bdev)
> >  	return bdev->bd_inode->i_size >> 9;
> >  }
> >  
> > -#define closure_bio_submit(bio, cl, dev)				\
> > +#define closure_bio_submit(bio, cl)					\
> >  do {									\
> >  	closure_get(cl);						\
> > -	bch_generic_make_request(bio, &(dev)->bio_split_hook);		\
> > +	generic_make_request(bio);					\
> >  } while (0)
> >  
> >  uint64_t bch_crc64_update(uint64_t, const void *, size_t);
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index f1986bc..ca38362 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -188,7 +188,7 @@ static void write_dirty(struct closure *cl)
> >  	io->bio.bi_bdev		= io->dc->bdev;
> >  	io->bio.bi_end_io	= dirty_endio;
> >  
> > -	closure_bio_submit(&io->bio, cl, &io->dc->disk);
> > +	closure_bio_submit(&io->bio, cl);
> >  
> >  	continue_at(cl, write_dirty_finish, system_wq);
> >  }
> > @@ -208,7 +208,7 @@ static void read_dirty_submit(struct closure *cl)
> >  {
> >  	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
> >  
> > -	closure_bio_submit(&io->bio, cl, &io->dc->disk);
> > +	closure_bio_submit(&io->bio, cl);
> >  
> >  	continue_at(cl, write_dirty, system_wq);
> >  }
> > -- 
> > 2.1.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: bcache: remove driver private bio splitting code
  2016-01-13  2:00     ` Eric Wheeler
@ 2016-01-13  5:54       ` Vojtech Pavlik
  2016-01-13 23:03         ` Eric Wheeler
  0 siblings, 1 reply; 57+ messages in thread
From: Vojtech Pavlik @ 2016-01-13  5:54 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: linux-bcache

On Tue, Jan 12, 2016 at 06:00:49PM -0800, Eric Wheeler wrote:

Hello Eric,

> Have you tested the patch below in SLE12-* when bcache is backed by md
> raid5/6?
> 
> FYI: I was compared the drivers/md/bcache/io.c functions in the various
> branches here:
>   https://github.com/openSUSE/kernel
> I compared the presence of bch_generic_make_request() (which the patch
> below removes).  It looks like the branch SLE12-SP2 has the patch, but
> version before SLE12-SP2 and openSUSE-* do not (as they still have
> bch_generic_make_request).
> 
> Since the patch does exist in SLE12-SP2, I'm guessing that is been tested,
> though I am curious if it has been tested specifically when being backed
> by md raid5/6 so that queue->limits->partial_stripes_expensive is nonzero.
> 
> If you have and it is stable, then I want to get it to Kent and Jens for 
> upstream integration.

The SLES12-SP2 kernel branch is very fresh, created this week. So while
the patch was tested by Johannes before adding it, and by our per-commit
automated tests, it didn't go through fully qualified QA test cycle yet.
So I won't say it's proven stable just yet.

-- 
Vojtech Pavlik
Director SUSE Labs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: bcache: remove driver private bio splitting code
  2016-01-13  5:54       ` Vojtech Pavlik
@ 2016-01-13 23:03         ` Eric Wheeler
  0 siblings, 0 replies; 57+ messages in thread
From: Eric Wheeler @ 2016-01-13 23:03 UTC (permalink / raw)
  To: Vojtech Pavlik; +Cc: linux-bcache

> On Tue, Jan 12, 2016 at 06:00:49PM -0800, Eric Wheeler wrote:
> 
> Hello Eric,
> 
> > Have you tested the patch below in SLE12-* when bcache is backed by md
> > raid5/6?
> > 
> > FYI: I was compared the drivers/md/bcache/io.c functions in the various
> > branches here:
> >   https://github.com/openSUSE/kernel
> > I compared the presence of bch_generic_make_request() (which the patch
> > below removes).  It looks like the branch SLE12-SP2 has the patch, but
> > version before SLE12-SP2 and openSUSE-* do not (as they still have
> > bch_generic_make_request).
> > 
> > Since the patch does exist in SLE12-SP2, I'm guessing that is been tested,
> > though I am curious if it has been tested specifically when being backed
> > by md raid5/6 so that queue->limits->partial_stripes_expensive is nonzero.
> > 
> > If you have and it is stable, then I want to get it to Kent and Jens for 
> > upstream integration.
> 
> The SLES12-SP2 kernel branch is very fresh, created this week. So while
> the patch was tested by Johannes before adding it, and by our per-commit
> automated tests, it didn't go through fully qualified QA test cycle yet.
> So I won't say it's proven stable just yet.

Good to know, thank you for that!  

I encourage you to include md-based raid5/6 backed bcache volumes in your 
testing if it is not already.  This particular patch may affect that use 
case.  Please keep us posted, I look forward to learning about the 
bcache-specific test cycle results.

-Eric


> 
> -- 
> Vojtech Pavlik
> Director SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2016-01-13 23:03 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-12  7:07 [PATCH v6 00/11] simplify block layer based on immutable biovecs Ming Lin
2015-08-12  7:07 ` [PATCH v6 01/11] block: make generic_make_request handle arbitrarily sized bios Ming Lin
2015-08-12  7:07   ` Ming Lin
2015-08-12  7:07 ` [PATCH v6 02/11] block: simplify bio_add_page() Ming Lin
2015-08-12  7:07 ` [PATCH v6 03/11] bcache: remove driver private bio splitting code Ming Lin
2016-01-08  1:53   ` Eric Wheeler
2016-01-13  2:00     ` Eric Wheeler
2016-01-13  5:54       ` Vojtech Pavlik
2016-01-13 23:03         ` Eric Wheeler
2015-08-12  7:07 ` [PATCH v6 04/11] btrfs: remove bio splitting and merge_bvec_fn() calls Ming Lin
2015-08-12  7:07 ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same} Ming Lin
2015-10-13 11:50   ` Christoph Hellwig
2015-10-13 11:50     ` Christoph Hellwig
2015-10-13 17:44     ` Ming Lin
2015-10-13 17:44       ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard, write_same} Ming Lin
2015-10-14 13:27       ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same} Christoph Hellwig
2015-10-14 13:27         ` Christoph Hellwig
2015-10-14 16:38         ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}B Keith Busch
2015-10-14 16:38           ` Keith Busch
2015-10-14 16:50           ` Christoph Hellwig
2015-10-14 16:50             ` Christoph Hellwig
2015-10-21 16:02         ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same} Mike Snitzer
2015-10-21 16:02           ` Mike Snitzer
2015-10-21 16:19           ` Mike Snitzer
2015-10-21 16:19             ` Mike Snitzer
2015-10-21 16:33             ` Martin K. Petersen
2015-10-21 16:33               ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard, write_same} Martin K. Petersen
2015-10-21 17:33             ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same} Ming Lin
2015-10-21 17:33               ` Ming Lin
2015-10-21 18:18               ` Mike Snitzer
2015-10-21 18:18                 ` Mike Snitzer
2015-10-21 20:13                 ` Ming Lin
2015-10-21 20:13                   ` Ming Lin
2015-10-22 10:24                   ` Christoph Hellwig
2015-10-22 10:24                     ` Christoph Hellwig
2015-10-22 11:22                     ` Christoph Hellwig
2015-10-22 11:22                       ` Christoph Hellwig
2015-10-21  7:21       ` Christoph Hellwig
2015-10-21  7:21         ` Christoph Hellwig
2015-10-21 13:39         ` Jeff Moyer
2015-10-21 13:39           ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard, write_same} Jeff Moyer
2015-10-21 15:01           ` [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same} Ming Lin
2015-10-21 15:01             ` Ming Lin
2015-10-21 15:33             ` Mike Snitzer
2015-10-21 15:33               ` Mike Snitzer
2015-10-21 17:18               ` Ming Lin
2015-10-21 17:18                 ` Ming Lin
2015-08-12  7:07 ` [PATCH v6 06/11] md/raid5: split bio for chunk_aligned_read Ming Lin
2015-08-12  7:07 ` [PATCH v6 07/11] md/raid5: get rid of bio_fits_rdev() Ming Lin
2015-08-12  7:07 ` [PATCH v6 08/11] block: kill merge_bvec_fn() completely Ming Lin
2015-08-12  7:07 ` [PATCH v6 09/11] fs: use helper bio_add_page() instead of open coding on bi_io_vec Ming Lin
2015-08-12  7:07 ` [PATCH v6 10/11] block: remove bio_get_nr_vecs() Ming Lin
2015-08-12  7:07 ` [PATCH v6 11/11] Documentation: update notes in biovecs about arbitrarily sized bios Ming Lin
2015-08-13 16:51 ` [PATCH v6 00/11] simplify block layer based on immutable biovecs Jens Axboe
2015-08-13 17:03   ` Ming Lin
2015-08-13 17:07     ` Jens Axboe
2015-08-13 17:36       ` Ming Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.