linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V15 00/18] block: support multi-page bvec
@ 2019-02-15 11:13 Ming Lei
  2019-02-15 11:13 ` [PATCH V15 01/18] btrfs: look at bi_size for repair decisions Ming Lei
                   ` (19 more replies)
  0 siblings, 20 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

Hi,

This patchset brings multi-page bvec into block layer:

1) what is multi-page bvec?

Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages
which are physically contiguous instead of one single page used in linux
kernel for long time.

2) why is multi-page bvec introduced?

Kent proposed the idea[1] first. 

As system's RAM becomes much bigger than before, and huge page, transparent
huge page and memory compaction are widely used, it is a bit easy now
to see physically contiguous pages from fs in I/O. On the other hand, from
block layer's view, it isn't necessary to store intermediate pages into bvec,
and it is enough to just store the physicallly contiguous 'segment' in each
io vector.

Also huge pages are being brought to filesystem and swap [2][6], we can
do IO on a hugepage each time[3], which requires that one bio can transfer
at least one huge page one time. Turns out it isn't flexiable to change
BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well.
As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured
as much bigger, such as 512, which requires at least two 4K pages for holding
the bvec table.

With multi-page bvec:

- Inside block layer, both bio splitting and sg map can become more
efficient than before by just traversing the physically contiguous
'segment' instead of each page.

- segment handling in block layer can be improved much in future since it
should be quite easy to convert multipage bvec into segment easily. For
example, we might just store segment in each bvec directly in future.

- bio size can be increased and it should improve some high-bandwidth IO
case in theory[4].

- there is opportunity in future to improve memory footprint of bvecs. 

3) how is multi-page bvec implemented in this patchset?

Patch 1 ~ 3 parpares for supporting multi-page bvec. 

Patches 4 ~ 14 implement multipage bvec in block layer:

	- put all tricks into bvec/bio/rq iterators, and as far as
	drivers and fs use these standard iterators, they are happy
	with multipage bvec

	- introduce bio_for_each_bvec() to iterate over multipage bvec for splitting
	bio and mapping sg

	- keep current bio_for_each_segment*() to itereate over singlepage bvec and
	make sure current users won't be broken; especailly, convert to this
	new helper prototype in single patch 21 given it is bascially a mechanism
	conversion

	- deal with iomap & xfs's sub-pagesize io vec in patch 13

	- enalbe multipage bvec in patch 14 

Patch 15 redefines BIO_MAX_PAGES as 256.

Patch 16 documents usages of bio iterator helpers.

Patch 17~18 kills NO_SG_MERGE.

These patches can be found in the following git tree:

	git:  https://github.com/ming1/linux.git  v5.0-blk_mp_bvec_v14

Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset,
and not see regression.

Thanks Christoph for reviewing the early version and providing very good
suggestions, such as: introduce bio_init_with_vec_table(), remove another
unnecessary helpers for cleanup and so on.

Thanks Chritoph and Omar for reviewing V10/V11/V12, and provides lots of
helpful comments.

V15:
	- rename bio_for_each_mp_bvec/rq_for_each_mp_bvec as
	  bio_for_each_bvec/rq_for_each_bvec, as suggested by Christoph,
	  so the mp_bvec name is only used by bvec helpers

V14:
	- drop patch(patch 4 in V13) for renaming bvec helpers, as suggested by Jens
	- use mp_bvec_* as multi-page bvec helper name
	- fix one build issue, which is caused by missing one converion of
	bio_for_each_segment_all in fs/gfs2
	- fix one 32bit ARCH specific issue caused by segment boundary mask
	overflow

V13:
	- rebase on v5.0-rc2
	- address Omar's comment on patch 1 of V12 by using V11's approach
	- rename one local vairable in patch 15 as suggested by Christoph

V12:
	- deal with non-cluster by max segment size & segment boundary limit
	- rename bvec helper's name
	- revert new change on bvec_iter_advance() in V11
	- introduce rq_for_each_bvec()
	- use simpler check on enalbing multi-page bvec
	- fix Document change

V11:
	- address most of reviews from Omar and christoph
	- rename mp_bvec_* as segment_* helpers
	- remove 'mp' parameter from bvec_iter_advance() and related helpers
	- cleanup patch on bvec_split_segs() and blk_bio_segment_split(),
	  remove unnecessary checks
	- simplify bvec_last_segment()
	- drop bio_pages_all()
	- introduce dedicated functions/file for handling non-cluser bio for
	avoiding checking queue cluster before adding page to bio
	- introduce bio_try_merge_segment() for simplifying iomap/xfs page
	  accounting code
	- Fix Document change

V10:
	- no any code change, just add more guys and list into patch's CC list,
	as suggested by Christoph and Dave Chinner
V9:
	- fix regression on iomap's sub-pagesize io vec, covered by patch 13
V8:
	- remove prepare patches which all are merged to linus tree
	- rebase on for-4.21/block
	- address comments on V7
	- add patches of killing NO_SG_MERGE

V7:
	- include Christoph and Mike's bio_clone_bioset() patches, which is
	  actually prepare patches for multipage bvec
	- address Christoph's comments

V6:
	- avoid to introduce lots of renaming, follow Jen's suggestion of
	using the name of chunk for multipage io vector
	- include Christoph's three prepare patches
	- decrease stack usage for using bio_for_each_chunk_segment_all()
	- address Kent's comment

V5:
	- remove some of prepare patches, which have been merged already
	- add bio_clone_seg_bioset() to fix DM's bio clone, which
	is introduced by 18a25da84354c6b (dm: ensure bio submission follows
	a depth-first tree walk)
	- rebase on the latest block for-v4.18

V4:
	- rename bio_for_each_segment*() as bio_for_each_page*(), rename
	bio_segments() as bio_pages(), rename rq_for_each_segment() as
	rq_for_each_pages(), because these helpers never return real
	segment, and they always return single page bvec
	
	- introducing segment_for_each_page_all()

	- introduce new bio_for_each_segment*()/rq_for_each_segment()/bio_segments()
	for returning real multipage segment

	- rewrite segment_last_page()

	- rename bvec iterator helper as suggested by Christoph

	- replace comment with applying bio helpers as suggested by Christoph

	- document usage of bio iterator helpers

	- redefine BIO_MAX_PAGES as 256 to make the biggest bvec table
	accommodated in 4K page

	- move bio_alloc_pages() into bcache as suggested by Christoph

V3:
	- rebase on v4.13-rc3 with for-next of block tree
	- run more xfstests: xfs/ext4 over NVMe, Sata, DM(linear),
	MD(raid1), and not see regressions triggered
	- add Reviewed-by on some btrfs patches
	- remove two MD patches because both are merged to linus tree
	  already

V2:
	- bvec table direct access in raid has been cleaned, so NO_MP
	flag is dropped
	- rebase on recent Neil Brown's change on bio and bounce code
	- reorganize the patchset

V1:
	- against v4.10-rc1 and some cleanup in V0 are in -linus already
	- handle queue_virt_boundary() in mp bvec change and make NVMe happy
	- further BTRFS cleanup
	- remove QUEUE_FLAG_SPLIT_MP
	- rename for two new helpers of bio_for_each_segment_all()
	- fix bounce convertion
	- address comments in V0

[1], http://marc.info/?l=linux-kernel&m=141680246629547&w=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=147735447100001&r=1&w=2
[4], http://marc.info/?l=linux-mm&m=147745525801433&w=2
[5], http://marc.info/?t=149569484500007&r=1&w=2
[6], http://marc.info/?t=149820215300004&r=1&w=2



Christoph Hellwig (1):
  btrfs: look at bi_size for repair decisions

Ming Lei (17):
  block: don't use bio->bi_vcnt to figure out segment number
  block: remove bvec_iter_rewind()
  block: introduce multi-page bvec helpers
  block: introduce bio_for_each_bvec() and rq_for_each_bvec()
  block: use bio_for_each_bvec() to compute multi-page bvec count
  block: use bio_for_each_bvec() to map sg
  block: introduce mp_bvec_last_segment()
  fs/buffer.c: use bvec iterator to truncate the bio
  btrfs: use mp_bvec_last_segment to get bio's last page
  block: loop: pass multi-page bvec to iov_iter
  bcache: avoid to use bio_for_each_segment_all() in
    bch_bio_alloc_pages()
  block: allow bio_for_each_segment_all() to iterate over multi-page
    bvec
  block: enable multipage bvecs
  block: always define BIO_MAX_PAGES as 256
  block: document usage of bio iterator helpers
  block: kill QUEUE_FLAG_NO_SG_MERGE
  block: kill BLK_MQ_F_SG_MERGE

 Documentation/block/biovecs.txt   |  25 +++++
 block/bio.c                       |  49 ++++++---
 block/blk-merge.c                 | 210 +++++++++++++++++++++++++-------------
 block/blk-mq-debugfs.c            |   2 -
 block/blk-mq.c                    |   3 -
 block/bounce.c                    |   6 +-
 drivers/block/loop.c              |  22 ++--
 drivers/block/nbd.c               |   2 +-
 drivers/block/rbd.c               |   2 +-
 drivers/block/skd_main.c          |   1 -
 drivers/block/xen-blkfront.c      |   2 +-
 drivers/md/bcache/btree.c         |   3 +-
 drivers/md/bcache/util.c          |   6 +-
 drivers/md/dm-crypt.c             |   3 +-
 drivers/md/dm-rq.c                |   2 +-
 drivers/md/dm-table.c             |  13 ---
 drivers/md/raid1.c                |   3 +-
 drivers/mmc/core/queue.c          |   3 +-
 drivers/scsi/scsi_lib.c           |   2 +-
 drivers/staging/erofs/data.c      |   3 +-
 drivers/staging/erofs/unzip_vle.c |   3 +-
 fs/block_dev.c                    |   6 +-
 fs/btrfs/compression.c            |   3 +-
 fs/btrfs/disk-io.c                |   3 +-
 fs/btrfs/extent_io.c              |  16 +--
 fs/btrfs/inode.c                  |   6 +-
 fs/btrfs/raid56.c                 |   3 +-
 fs/buffer.c                       |   5 +-
 fs/crypto/bio.c                   |   3 +-
 fs/direct-io.c                    |   4 +-
 fs/exofs/ore.c                    |   3 +-
 fs/exofs/ore_raid.c               |   3 +-
 fs/ext4/page-io.c                 |   3 +-
 fs/ext4/readpage.c                |   3 +-
 fs/f2fs/data.c                    |   9 +-
 fs/gfs2/lops.c                    |   9 +-
 fs/gfs2/meta_io.c                 |   3 +-
 fs/iomap.c                        |  10 +-
 fs/mpage.c                        |   3 +-
 fs/xfs/xfs_aops.c                 |   9 +-
 include/linux/bio.h               |  37 ++++---
 include/linux/blk-mq.h            |   1 -
 include/linux/blkdev.h            |   5 +-
 include/linux/bvec.h              | 106 ++++++++++++++-----
 44 files changed, 404 insertions(+), 214 deletions(-)

-- 
2.9.5


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 01/18] btrfs: look at bi_size for repair decisions
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 02/18] block: don't use bio->bi_vcnt to figure out segment number Ming Lei
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel

From: Christoph Hellwig <hch@lst.de>

bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O.  But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying.  Use bi_size for that and shift by
PAGE_SHIFT.  This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 2 +-
 include/linux/bio.h  | 6 ------
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 52abe4082680..dc8ba3ee515d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2350,7 +2350,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	int read_mode = 0;
 	blk_status_t status;
 	int ret;
-	unsigned failed_bio_pages = bio_pages_all(failed_bio);
+	unsigned failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
 
 	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7380b094dcca..72b4f7be2106 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -263,12 +263,6 @@ static inline void bio_get_last_bvec(struct bio *bio, struct bio_vec *bv)
 		bv->bv_len = iter.bi_bvec_done;
 }
 
-static inline unsigned bio_pages_all(struct bio *bio)
-{
-	WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-	return bio->bi_vcnt;
-}
-
 static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
 {
 	WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 02/18] block: don't use bio->bi_vcnt to figure out segment number
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
  2019-02-15 11:13 ` [PATCH V15 01/18] btrfs: look at bi_size for repair decisions Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 03/18] block: remove bvec_iter_rewind() Ming Lei
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.

So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max segments")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-merge.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 71e9ac03f621..f85d878f313d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -367,13 +367,7 @@ void blk_recalc_rq_segments(struct request *rq)
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-	unsigned short seg_cnt;
-
-	/* estimate segment number by bi_vcnt for non-cloned bio */
-	if (bio_flagged(bio, BIO_CLONED))
-		seg_cnt = bio_segments(bio);
-	else
-		seg_cnt = bio->bi_vcnt;
+	unsigned short seg_cnt = bio_segments(bio);
 
 	if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags) &&
 			(seg_cnt < queue_max_segments(q)))
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 03/18] block: remove bvec_iter_rewind()
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
  2019-02-15 11:13 ` [PATCH V15 01/18] btrfs: look at bi_size for repair decisions Ming Lei
  2019-02-15 11:13 ` [PATCH V15 02/18] block: don't use bio->bi_vcnt to figure out segment number Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 04/18] block: introduce multi-page bvec helpers Ming Lei
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

Commit 7759eb23fd980 ("block: remove bio_rewind_iter()") removes
bio_rewind_iter(), then no one uses bvec_iter_rewind() any more,
so remove it.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bvec.h | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 02c73c6aa805..ba0ae40e77c9 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -92,30 +92,6 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 	return true;
 }
 
-static inline bool bvec_iter_rewind(const struct bio_vec *bv,
-				     struct bvec_iter *iter,
-				     unsigned int bytes)
-{
-	while (bytes) {
-		unsigned len = min(bytes, iter->bi_bvec_done);
-
-		if (iter->bi_bvec_done == 0) {
-			if (WARN_ONCE(iter->bi_idx == 0,
-				      "Attempted to rewind iter beyond "
-				      "bvec's boundaries\n")) {
-				return false;
-			}
-			iter->bi_idx--;
-			iter->bi_bvec_done = __bvec_iter_bvec(bv, *iter)->bv_len;
-			continue;
-		}
-		bytes -= len;
-		iter->bi_size += len;
-		iter->bi_bvec_done -= len;
-	}
-	return true;
-}
-
 #define for_each_bvec(bvl, bio_vec, iter, start)			\
 	for (iter = (start);						\
 	     (iter).bi_size &&						\
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 04/18] block: introduce multi-page bvec helpers
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (2 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 03/18] block: remove bvec_iter_rewind() Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 05/18] block: introduce bio_for_each_bvec() and rq_for_each_bvec() Ming Lei
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

This patch introduces helpers of 'mp_bvec_iter_*' for multi-page bvec
support.

The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.

The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.

Follows some multi-page bvec background:

- bvecs stored in bio->bi_io_vec is always multi-page style

- bvec(struct bio_vec) represents one physically contiguous I/O
  buffer, now the buffer may include more than one page after
  multi-page bvec is supported, and all these pages represented
  by one bvec is physically contiguous. Before multi-page bvec
  support, at most one page is included in one bvec, we call it
  single-page bvec.

- .bv_page of the bvec points to the 1st page in the multi-page bvec

- .bv_offset of the bvec is the offset of the buffer in the bvec

The effect on the current drivers/filesystem/dm/bcache/...:

- almost everyone supposes that one bvec only includes one single
  page, so we keep the sp interface not changed, for example,
  bio_for_each_segment() still returns single-page bvec

- bio_for_each_segment_all() will return single-page bvec too

- during iterating, iterator variable(struct bvec_iter) is always
  updated in multi-page bvec style, and bvec_iter_advance() is kept
  not changed

- returned(copied) single-page bvec is built in flight by bvec
  helpers from the stored multi-page bvec

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bvec.h | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ba0ae40e77c9..0ae729b1c9fe 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,7 @@
 #include <linux/kernel.h>
 #include <linux/bug.h>
 #include <linux/errno.h>
+#include <linux/mm.h>
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -50,16 +51,39 @@ struct bvec_iter {
  */
 #define __bvec_iter_bvec(bvec, iter)	(&(bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter)				\
+/* multi-page (mp_bvec) helpers */
+#define mp_bvec_iter_page(bvec, iter)				\
 	(__bvec_iter_bvec((bvec), (iter))->bv_page)
 
-#define bvec_iter_len(bvec, iter)				\
+#define mp_bvec_iter_len(bvec, iter)				\
 	min((iter).bi_size,					\
 	    __bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
 
-#define bvec_iter_offset(bvec, iter)				\
+#define mp_bvec_iter_offset(bvec, iter)				\
 	(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
 
+#define mp_bvec_iter_page_idx(bvec, iter)			\
+	(mp_bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
+
+#define mp_bvec_iter_bvec(bvec, iter)				\
+((struct bio_vec) {						\
+	.bv_page	= mp_bvec_iter_page((bvec), (iter)),	\
+	.bv_len		= mp_bvec_iter_len((bvec), (iter)),	\
+	.bv_offset	= mp_bvec_iter_offset((bvec), (iter)),	\
+})
+
+/* For building single-page bvec in flight */
+ #define bvec_iter_offset(bvec, iter)				\
+	(mp_bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define bvec_iter_len(bvec, iter)				\
+	min_t(unsigned, mp_bvec_iter_len((bvec), (iter)),		\
+	      PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
+
+#define bvec_iter_page(bvec, iter)				\
+	nth_page(mp_bvec_iter_page((bvec), (iter)),		\
+		 mp_bvec_iter_page_idx((bvec), (iter)))
+
 #define bvec_iter_bvec(bvec, iter)				\
 ((struct bio_vec) {						\
 	.bv_page	= bvec_iter_page((bvec), (iter)),	\
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 05/18] block: introduce bio_for_each_bvec() and rq_for_each_bvec()
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (3 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 04/18] block: introduce multi-page bvec helpers Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 06/18] block: use bio_for_each_bvec() to compute multi-page bvec count Ming Lei
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

bio_for_each_bvec() is used for iterating over multi-page bvec for bio
split & merge code.

rq_for_each_bvec() can be used for drivers which may handle the
multi-page bvec directly, so far loop is one perfect use case.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bio.h    | 10 ++++++++++
 include/linux/blkdev.h |  4 ++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 72b4f7be2106..7ef8a7505c0a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -156,6 +156,16 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
 #define bio_for_each_segment(bvl, bio, iter)				\
 	__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+#define __bio_for_each_bvec(bvl, bio, iter, start)		\
+	for (iter = (start);						\
+	     (iter).bi_size &&						\
+		((bvl = mp_bvec_iter_bvec((bio)->bi_io_vec, (iter))), 1); \
+	     bio_advance_iter((bio), &(iter), (bvl).bv_len))
+
+/* iterate over multi-page bvec */
+#define bio_for_each_bvec(bvl, bio, iter)			\
+	__bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3603270cb82d..b6292d469ea4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -792,6 +792,10 @@ struct req_iterator {
 	__rq_for_each_bio(_iter.bio, _rq)			\
 		bio_for_each_segment(bvl, _iter.bio, _iter.iter)
 
+#define rq_for_each_bvec(bvl, _rq, _iter)			\
+	__rq_for_each_bio(_iter.bio, _rq)			\
+		bio_for_each_bvec(bvl, _iter.bio, _iter.iter)
+
 #define rq_iter_last(bvec, _iter)				\
 		(_iter.bio->bi_next == NULL &&			\
 		 bio_iter_last(bvec, _iter.iter))
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 06/18] block: use bio_for_each_bvec() to compute multi-page bvec count
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (4 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 05/18] block: introduce bio_for_each_bvec() and rq_for_each_bvec() Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 07/18] block: use bio_for_each_bvec() to map sg Ming Lei
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

First it is more efficient to use bio_for_each_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.

Secondly once bio_for_each_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.

Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-merge.c | 103 +++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 83 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index f85d878f313d..4ef56b2d2aa5 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -161,6 +161,73 @@ static inline unsigned get_max_io_size(struct request_queue *q,
 	return sectors;
 }
 
+static unsigned get_max_segment_size(struct request_queue *q,
+				     unsigned offset)
+{
+	unsigned long mask = queue_segment_boundary(q);
+
+	/* default segment boundary mask means no boundary limit */
+	if (mask == BLK_SEG_BOUNDARY_MASK)
+		return queue_max_segment_size(q);
+
+	return min_t(unsigned long, mask - (mask & offset) + 1,
+		     queue_max_segment_size(q));
+}
+
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+		unsigned *nsegs, unsigned *last_seg_size,
+		unsigned *front_seg_size, unsigned *sectors)
+{
+	unsigned len = bv->bv_len;
+	unsigned total_len = 0;
+	unsigned new_nsegs = 0, seg_size = 0;
+
+	/*
+	 * Multi-page bvec may be too big to hold in one segment, so the
+	 * current bvec has to be splitted as multiple segments.
+	 */
+	while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
+		seg_size = get_max_segment_size(q, bv->bv_offset + total_len);
+		seg_size = min(seg_size, len);
+
+		new_nsegs++;
+		total_len += seg_size;
+		len -= seg_size;
+
+		if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
+			break;
+	}
+
+	if (!new_nsegs)
+		return !!len;
+
+	/* update front segment size */
+	if (!*nsegs) {
+		unsigned first_seg_size;
+
+		if (new_nsegs == 1)
+			first_seg_size = get_max_segment_size(q, bv->bv_offset);
+		else
+			first_seg_size = queue_max_segment_size(q);
+
+		if (*front_seg_size < first_seg_size)
+			*front_seg_size = first_seg_size;
+	}
+
+	/* update other varibles */
+	*last_seg_size = seg_size;
+	*nsegs += new_nsegs;
+	if (sectors)
+		*sectors += total_len >> 9;
+
+	/* split in the middle of the bvec if len != 0 */
+	return !!len;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 					 struct bio *bio,
 					 struct bio_set *bs,
@@ -174,7 +241,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	struct bio *new = NULL;
 	const unsigned max_sectors = get_max_io_size(q, bio);
 
-	bio_for_each_segment(bv, bio, iter) {
+	bio_for_each_bvec(bv, bio, iter) {
 		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
@@ -189,8 +256,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 			 */
 			if (nsegs < queue_max_segments(q) &&
 			    sectors < max_sectors) {
-				nsegs++;
-				sectors = max_sectors;
+				/* split in the middle of bvec */
+				bv.bv_len = (max_sectors - sectors) << 9;
+				bvec_split_segs(q, &bv, &nsegs,
+						&seg_size,
+						&front_seg_size,
+						&sectors);
 			}
 			goto split;
 		}
@@ -212,14 +283,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 		if (nsegs == queue_max_segments(q))
 			goto split;
 
-		if (nsegs == 1 && seg_size > front_seg_size)
-			front_seg_size = seg_size;
-
-		nsegs++;
 		bvprv = bv;
 		bvprvp = &bvprv;
-		seg_size = bv.bv_len;
-		sectors += bv.bv_len >> 9;
+
+		if (bvec_split_segs(q, &bv, &nsegs, &seg_size,
+				    &front_seg_size, &sectors))
+			goto split;
 
 	}
 
@@ -233,8 +302,6 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 			bio = new;
 	}
 
-	if (nsegs == 1 && seg_size > front_seg_size)
-		front_seg_size = seg_size;
 	bio->bi_seg_front_size = front_seg_size;
 	if (seg_size > bio->bi_seg_back_size)
 		bio->bi_seg_back_size = seg_size;
@@ -297,6 +364,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	struct bio_vec bv, bvprv = { NULL };
 	int prev = 0;
 	unsigned int seg_size, nr_phys_segs;
+	unsigned front_seg_size = bio->bi_seg_front_size;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
 
@@ -316,7 +384,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	seg_size = 0;
 	nr_phys_segs = 0;
 	for_each_bio(bio) {
-		bio_for_each_segment(bv, bio, iter) {
+		bio_for_each_bvec(bv, bio, iter) {
 			/*
 			 * If SG merging is disabled, each bio vector is
 			 * a segment
@@ -336,20 +404,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 				continue;
 			}
 new_segment:
-			if (nr_phys_segs == 1 && seg_size >
-			    fbio->bi_seg_front_size)
-				fbio->bi_seg_front_size = seg_size;
-
-			nr_phys_segs++;
 			bvprv = bv;
 			prev = 1;
-			seg_size = bv.bv_len;
+			bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size,
+					&front_seg_size, NULL);
 		}
 		bbio = bio;
 	}
 
-	if (nr_phys_segs == 1 && seg_size > fbio->bi_seg_front_size)
-		fbio->bi_seg_front_size = seg_size;
+	fbio->bi_seg_front_size = front_seg_size;
 	if (seg_size > bbio->bi_seg_back_size)
 		bbio->bi_seg_back_size = seg_size;
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 07/18] block: use bio_for_each_bvec() to map sg
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (5 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 06/18] block: use bio_for_each_bvec() to compute multi-page bvec count Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 08/18] block: introduce mp_bvec_last_segment() Ming Lei
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

It is more efficient to use bio_for_each_bvec() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-merge.c | 70 +++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 4ef56b2d2aa5..1912499b08b7 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -464,6 +464,54 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
 	return biovec_phys_mergeable(q, &end_bv, &nxt_bv);
 }
 
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+		struct scatterlist *sglist)
+{
+	if (!*sg)
+		return sglist;
+
+	/*
+	 * If the driver previously mapped a shorter list, we could see a
+	 * termination bit prematurely unless it fully inits the sg table
+	 * on each mapping. We KNOW that there must be more entries here
+	 * or the driver would be buggy, so force clear the termination bit
+	 * to avoid doing a full sg_init_table() in drivers for each command.
+	 */
+	sg_unmark_end(*sg);
+	return sg_next(*sg);
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+		struct bio_vec *bvec, struct scatterlist *sglist,
+		struct scatterlist **sg)
+{
+	unsigned nbytes = bvec->bv_len;
+	unsigned nsegs = 0, total = 0, offset = 0;
+
+	while (nbytes > 0) {
+		unsigned seg_size;
+		struct page *pg;
+		unsigned idx;
+
+		*sg = blk_next_sg(sg, sglist);
+
+		seg_size = get_max_segment_size(q, bvec->bv_offset + total);
+		seg_size = min(nbytes, seg_size);
+
+		offset = (total + bvec->bv_offset) % PAGE_SIZE;
+		idx = (total + bvec->bv_offset) / PAGE_SIZE;
+		pg = nth_page(bvec->bv_page, idx);
+
+		sg_set_page(*sg, pg, seg_size, offset);
+
+		total += seg_size;
+		nbytes -= seg_size;
+		nsegs++;
+	}
+
+	return nsegs;
+}
+
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 		     struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -481,25 +529,7 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 		(*sg)->length += nbytes;
 	} else {
 new_segment:
-		if (!*sg)
-			*sg = sglist;
-		else {
-			/*
-			 * If the driver previously mapped a shorter
-			 * list, we could see a termination bit
-			 * prematurely unless it fully inits the sg
-			 * table on each mapping. We KNOW that there
-			 * must be more entries here or the driver
-			 * would be buggy, so force clear the
-			 * termination bit to avoid doing a full
-			 * sg_init_table() in drivers for each command.
-			 */
-			sg_unmark_end(*sg);
-			*sg = sg_next(*sg);
-		}
-
-		sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
-		(*nsegs)++;
+		(*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
 	}
 	*bvprv = *bvec;
 }
@@ -521,7 +551,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
 	int nsegs = 0;
 
 	for_each_bio(bio)
-		bio_for_each_segment(bvec, bio, iter)
+		bio_for_each_bvec(bvec, bio, iter)
 			__blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg,
 					     &nsegs);
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 08/18] block: introduce mp_bvec_last_segment()
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (6 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 07/18] block: use bio_for_each_bvec() to map sg Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 09/18] fs/buffer.c: use bvec iterator to truncate the bio Ming Lei
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bvec.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 0ae729b1c9fe..21f76bad7be2 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -131,4 +131,26 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 	.bi_bvec_done	= 0,						\
 }
 
+/*
+ * Get the last single-page segment from the multi-page bvec and store it
+ * in @seg
+ */
+static inline void mp_bvec_last_segment(const struct bio_vec *bvec,
+					struct bio_vec *seg)
+{
+	unsigned total = bvec->bv_offset + bvec->bv_len;
+	unsigned last_page = (total - 1) / PAGE_SIZE;
+
+	seg->bv_page = nth_page(bvec->bv_page, last_page);
+
+	/* the whole segment is inside the last page */
+	if (bvec->bv_offset >= last_page * PAGE_SIZE) {
+		seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
+		seg->bv_len = bvec->bv_len;
+	} else {
+		seg->bv_offset = 0;
+		seg->bv_len = total - last_page * PAGE_SIZE;
+	}
+}
+
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 09/18] fs/buffer.c: use bvec iterator to truncate the bio
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (7 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 08/18] block: introduce mp_bvec_last_segment() Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 10/18] btrfs: use mp_bvec_last_segment to get bio's last page Ming Lei
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use mp_bvec_last_segment() to truncate the bio.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 fs/buffer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 52d024bfdbc1..817871274c77 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3032,7 +3032,10 @@ void guard_bio_eod(int op, struct bio *bio)
 
 	/* ..and clear the end of the buffer for reads */
 	if (op == REQ_OP_READ) {
-		zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+		struct bio_vec bv;
+
+		mp_bvec_last_segment(bvec, &bv);
+		zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
 				truncated_bytes);
 	}
 }
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 10/18] btrfs: use mp_bvec_last_segment to get bio's last page
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (8 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 09/18] fs/buffer.c: use bvec iterator to truncate the bio Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 11/18] block: loop: pass multi-page bvec to iov_iter Ming Lei
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

Preparing for supporting multi-page bvec.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 fs/btrfs/extent_io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dc8ba3ee515d..986ef49b0269 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2697,11 +2697,12 @@ static int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 {
 	blk_status_t ret = 0;
 	struct bio_vec *bvec = bio_last_bvec_all(bio);
-	struct page *page = bvec->bv_page;
+	struct bio_vec bv;
 	struct extent_io_tree *tree = bio->bi_private;
 	u64 start;
 
-	start = page_offset(page) + bvec->bv_offset;
+	mp_bvec_last_segment(bvec, &bv);
+	start = page_offset(bv.bv_page) + bv.bv_offset;
 
 	bio->bi_private = NULL;
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 11/18] block: loop: pass multi-page bvec to iov_iter
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (9 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 10/18] btrfs: use mp_bvec_last_segment to get bio's last page Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 12/18] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages() Ming Lei
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cf5538942834..8ef583197414 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -511,21 +511,22 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 		     loff_t pos, bool rw)
 {
 	struct iov_iter iter;
+	struct req_iterator rq_iter;
 	struct bio_vec *bvec;
 	struct request *rq = blk_mq_rq_from_pdu(cmd);
 	struct bio *bio = rq->bio;
 	struct file *file = lo->lo_backing_file;
+	struct bio_vec tmp;
 	unsigned int offset;
-	int segments = 0;
+	int nr_bvec = 0;
 	int ret;
 
+	rq_for_each_bvec(tmp, rq, rq_iter)
+		nr_bvec++;
+
 	if (rq->bio != rq->biotail) {
-		struct req_iterator iter;
-		struct bio_vec tmp;
 
-		__rq_for_each_bio(bio, rq)
-			segments += bio_segments(bio);
-		bvec = kmalloc_array(segments, sizeof(struct bio_vec),
+		bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 				     GFP_NOIO);
 		if (!bvec)
 			return -EIO;
@@ -534,10 +535,10 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 		/*
 		 * The bios of the request may be started from the middle of
 		 * the 'bvec' because of bio splitting, so we can't directly
-		 * copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
+		 * copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
 		 * API will take care of all details for us.
 		 */
-		rq_for_each_segment(tmp, rq, iter) {
+		rq_for_each_bvec(tmp, rq, rq_iter) {
 			*bvec = tmp;
 			bvec++;
 		}
@@ -551,11 +552,10 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 		 */
 		offset = bio->bi_iter.bi_bvec_done;
 		bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-		segments = bio_segments(bio);
 	}
 	atomic_set(&cmd->ref, 2);
 
-	iov_iter_bvec(&iter, rw, bvec, segments, blk_rq_bytes(rq));
+	iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
 	iter.iov_offset = offset;
 
 	cmd->iocb.ki_pos = pos;
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 12/18] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (10 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 11/18] block: loop: pass multi-page bvec to iov_iter Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 13/18] block: allow bio_for_each_segment_all() to iterate over multi-page bvec Ming Lei
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.

Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/bcache/util.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 20eddeac1531..62fb917f7a4f 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -270,7 +270,11 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
 	int i;
 	struct bio_vec *bv;
 
-	bio_for_each_segment_all(bv, bio, i) {
+	/*
+	 * This is called on freshly new bio, so it is safe to access the
+	 * bvec table directly.
+	 */
+	for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++, i++) {
 		bv->bv_page = alloc_page(gfp_mask);
 		if (!bv->bv_page) {
 			while (--bv >= bio->bi_io_vec)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 13/18] block: allow bio_for_each_segment_all() to iterate over multi-page bvec
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (11 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 12/18] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages() Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 14/18] block: enable multipage bvecs Ming Lei
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio.c                       | 27 ++++++++++++++++++---------
 block/bounce.c                    |  6 ++++--
 drivers/md/bcache/btree.c         |  3 ++-
 drivers/md/dm-crypt.c             |  3 ++-
 drivers/md/raid1.c                |  3 ++-
 drivers/staging/erofs/data.c      |  3 ++-
 drivers/staging/erofs/unzip_vle.c |  3 ++-
 fs/block_dev.c                    |  6 ++++--
 fs/btrfs/compression.c            |  3 ++-
 fs/btrfs/disk-io.c                |  3 ++-
 fs/btrfs/extent_io.c              |  9 ++++++---
 fs/btrfs/inode.c                  |  6 ++++--
 fs/btrfs/raid56.c                 |  3 ++-
 fs/crypto/bio.c                   |  3 ++-
 fs/direct-io.c                    |  4 +++-
 fs/exofs/ore.c                    |  3 ++-
 fs/exofs/ore_raid.c               |  3 ++-
 fs/ext4/page-io.c                 |  3 ++-
 fs/ext4/readpage.c                |  3 ++-
 fs/f2fs/data.c                    |  9 ++++++---
 fs/gfs2/lops.c                    |  9 ++++++---
 fs/gfs2/meta_io.c                 |  3 ++-
 fs/iomap.c                        |  6 ++++--
 fs/mpage.c                        |  3 ++-
 fs/xfs/xfs_aops.c                 |  5 +++--
 include/linux/bio.h               | 11 +++++++++--
 include/linux/bvec.h              | 30 ++++++++++++++++++++++++++++++
 27 files changed, 127 insertions(+), 46 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..968b12fea564 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1072,8 +1072,9 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
 {
 	int i;
 	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		ssize_t ret;
 
 		ret = copy_page_from_iter(bvec->bv_page,
@@ -1103,8 +1104,9 @@ static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
 {
 	int i;
 	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		ssize_t ret;
 
 		ret = copy_page_to_iter(bvec->bv_page,
@@ -1126,8 +1128,9 @@ void bio_free_pages(struct bio *bio)
 {
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i)
+	bio_for_each_segment_all(bvec, bio, i, iter_all)
 		__free_page(bvec->bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
@@ -1295,6 +1298,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 	struct bio *bio;
 	int ret;
 	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
 
 	if (!iov_iter_count(iter))
 		return ERR_PTR(-EINVAL);
@@ -1368,7 +1372,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 	return bio;
 
  out_unmap:
-	bio_for_each_segment_all(bvec, bio, j) {
+	bio_for_each_segment_all(bvec, bio, j, iter_all) {
 		put_page(bvec->bv_page);
 	}
 	bio_put(bio);
@@ -1379,11 +1383,12 @@ static void __bio_unmap_user(struct bio *bio)
 {
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	/*
 	 * make sure we dirty pages we wrote to
 	 */
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		if (bio_data_dir(bio) == READ)
 			set_page_dirty_lock(bvec->bv_page);
 
@@ -1475,8 +1480,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
 	char *p = bio->bi_private;
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
 		p += bvec->bv_len;
 	}
@@ -1585,8 +1591,9 @@ void bio_set_pages_dirty(struct bio *bio)
 {
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		if (!PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
 	}
@@ -1596,8 +1603,9 @@ static void bio_release_pages(struct bio *bio)
 {
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i)
+	bio_for_each_segment_all(bvec, bio, i, iter_all)
 		put_page(bvec->bv_page);
 }
 
@@ -1644,8 +1652,9 @@ void bio_check_pages_dirty(struct bio *bio)
 	struct bio_vec *bvec;
 	unsigned long flags;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
 			goto defer;
 	}
diff --git a/block/bounce.c b/block/bounce.c
index ffb9e9ecfa7e..add085e28b1d 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -165,11 +165,12 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool)
 	struct bio_vec *bvec, orig_vec;
 	int i;
 	struct bvec_iter orig_iter = bio_orig->bi_iter;
+	struct bvec_iter_all iter_all;
 
 	/*
 	 * free up bounce indirect pages used
 	 */
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		orig_vec = bio_iter_iovec(bio_orig, orig_iter);
 		if (bvec->bv_page != orig_vec.bv_page) {
 			dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
@@ -294,6 +295,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 	bool bounce = false;
 	int sectors = 0;
 	bool passthrough = bio_is_passthrough(*bio_orig);
+	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment(from, *bio_orig, iter) {
 		if (i++ < BIO_MAX_PAGES)
@@ -313,7 +315,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 	bio = bounce_clone_bio(*bio_orig, GFP_NOIO, passthrough ? NULL :
 			&bounce_bio_set);
 
-	bio_for_each_segment_all(to, bio, i) {
+	bio_for_each_segment_all(to, bio, i, iter_all) {
 		struct page *page = to->bv_page;
 
 		if (page_to_pfn(page) <= q->limits.bounce_pfn)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 23cb1dc7296b..64def336f053 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -432,8 +432,9 @@ static void do_btree_node_write(struct btree *b)
 		int j;
 		struct bio_vec *bv;
 		void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
+		struct bvec_iter_all iter_all;
 
-		bio_for_each_segment_all(bv, b->bio, j)
+		bio_for_each_segment_all(bv, b->bio, j, iter_all)
 			memcpy(page_address(bv->bv_page),
 			       base + j * PAGE_SIZE, PAGE_SIZE);
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 47d4e0d30bf0..9a29037f5615 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1447,8 +1447,9 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
 {
 	unsigned int i;
 	struct bio_vec *bv;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bv, clone, i) {
+	bio_for_each_segment_all(bv, clone, i, iter_all) {
 		BUG_ON(!bv->bv_page);
 		mempool_free(bv->bv_page, &cc->page_pool);
 	}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7e63ccc4ae7b..88c61d3090b0 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2112,13 +2112,14 @@ static void process_checks(struct r1bio *r1_bio)
 		struct page **spages = get_resync_pages(sbio)->pages;
 		struct bio_vec *bi;
 		int page_len[RESYNC_PAGES] = { 0 };
+		struct bvec_iter_all iter_all;
 
 		if (sbio->bi_end_io != end_sync_read)
 			continue;
 		/* Now we can 'fixup' the error value */
 		sbio->bi_status = 0;
 
-		bio_for_each_segment_all(bi, sbio, j)
+		bio_for_each_segment_all(bi, sbio, j, iter_all)
 			page_len[j] = bi->bv_len;
 
 		if (!status) {
diff --git a/drivers/staging/erofs/data.c b/drivers/staging/erofs/data.c
index 5a55f0bfdfbb..4871ba7b7d9a 100644
--- a/drivers/staging/erofs/data.c
+++ b/drivers/staging/erofs/data.c
@@ -20,8 +20,9 @@ static inline void read_endio(struct bio *bio)
 	int i;
 	struct bio_vec *bvec;
 	const blk_status_t err = bio->bi_status;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		struct page *page = bvec->bv_page;
 
 		/* page is already locked */
diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index 4ac1099a39c6..c057c5616b1d 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -830,8 +830,9 @@ static inline void z_erofs_vle_read_endio(struct bio *bio)
 #ifdef EROFS_FS_HAS_MANAGED_CACHE
 	struct address_space *mc = NULL;
 #endif
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		struct page *page = bvec->bv_page;
 		bool cachemngd = false;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 58a4c1217fa8..7758adee6efe 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -211,6 +211,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 	ssize_t ret;
 	blk_qc_t qc;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	if ((pos | iov_iter_alignment(iter)) &
 	    (bdev_logical_block_size(bdev) - 1))
@@ -260,7 +261,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 	}
 	__set_current_state(TASK_RUNNING);
 
-	bio_for_each_segment_all(bvec, &bio, i) {
+	bio_for_each_segment_all(bvec, &bio, i, iter_all) {
 		if (should_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
 		put_page(bvec->bv_page);
@@ -329,8 +330,9 @@ static void blkdev_bio_end_io(struct bio *bio)
 	} else {
 		struct bio_vec *bvec;
 		int i;
+		struct bvec_iter_all iter_all;
 
-		bio_for_each_segment_all(bvec, bio, i)
+		bio_for_each_segment_all(bvec, bio, i, iter_all)
 			put_page(bvec->bv_page);
 		bio_put(bio);
 	}
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 548057630b69..6896ea60c843 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -162,13 +162,14 @@ static void end_compressed_bio_read(struct bio *bio)
 	} else {
 		int i;
 		struct bio_vec *bvec;
+		struct bvec_iter_all iter_all;
 
 		/*
 		 * we have verified the checksum already, set page
 		 * checked so the end_io handlers know about it
 		 */
 		ASSERT(!bio_flagged(bio, BIO_CLONED));
-		bio_for_each_segment_all(bvec, cb->orig_bio, i)
+		bio_for_each_segment_all(bvec, cb->orig_bio, i, iter_all)
 			SetPageChecked(bvec->bv_page);
 
 		bio_endio(cb->orig_bio);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6a2a2a951705..ca1b7da6dd1b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -832,9 +832,10 @@ static blk_status_t btree_csum_one_bio(struct bio *bio)
 	struct bio_vec *bvec;
 	struct btrfs_root *root;
 	int i, ret = 0;
+	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		root = BTRFS_I(bvec->bv_page->mapping->host)->root;
 		ret = csum_dirty_buffer(root->fs_info, bvec->bv_page);
 		if (ret)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 986ef49b0269..4ed58c9a94a9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2422,9 +2422,10 @@ static void end_bio_extent_writepage(struct bio *bio)
 	u64 start;
 	u64 end;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2493,9 +2494,10 @@ static void end_bio_extent_readpage(struct bio *bio)
 	int mirror;
 	int ret;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -3635,9 +3637,10 @@ static void end_bio_extent_buffer_writepage(struct bio *bio)
 	struct bio_vec *bvec;
 	struct extent_buffer *eb;
 	int i, done;
+	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		struct page *page = bvec->bv_page;
 
 		eb = (struct extent_buffer *)page->private;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5c349667c761..7ade5769f691 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7777,6 +7777,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
 	struct bio_vec *bvec;
 	struct extent_io_tree *io_tree, *failure_tree;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	if (bio->bi_status)
 		goto end;
@@ -7788,7 +7789,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
 
 	done->uptodate = 1;
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, i)
+	bio_for_each_segment_all(bvec, bio, i, iter_all)
 		clean_io_failure(BTRFS_I(inode)->root->fs_info, failure_tree,
 				 io_tree, done->start, bvec->bv_page,
 				 btrfs_ino(BTRFS_I(inode)), 0);
@@ -7867,6 +7868,7 @@ static void btrfs_retry_endio(struct bio *bio)
 	int uptodate;
 	int ret;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	if (bio->bi_status)
 		goto end;
@@ -7880,7 +7882,7 @@ static void btrfs_retry_endio(struct bio *bio)
 	failure_tree = &BTRFS_I(inode)->io_failure_tree;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
 					     bvec->bv_offset, done->start,
 					     bvec->bv_len);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index e74455eb42f9..1869ba8e5981 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1443,10 +1443,11 @@ static void set_bio_pages_uptodate(struct bio *bio)
 {
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 
-	bio_for_each_segment_all(bvec, bio, i)
+	bio_for_each_segment_all(bvec, bio, i, iter_all)
 		SetPageUptodate(bvec->bv_page);
 }
 
diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c
index 0959044c5cee..5759bcd018cd 100644
--- a/fs/crypto/bio.c
+++ b/fs/crypto/bio.c
@@ -30,8 +30,9 @@ static void __fscrypt_decrypt_bio(struct bio *bio, bool done)
 {
 	struct bio_vec *bv;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bv, bio, i) {
+	bio_for_each_segment_all(bv, bio, i, iter_all) {
 		struct page *page = bv->bv_page;
 		int ret = fscrypt_decrypt_page(page->mapping->host, page,
 				PAGE_SIZE, 0, page->index);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index ec2fb6fe6d37..9bb015bc4a83 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -551,7 +551,9 @@ static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio)
 	if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
 		bio_check_pages_dirty(bio);	/* transfers ownership */
 	} else {
-		bio_for_each_segment_all(bvec, bio, i) {
+		struct bvec_iter_all iter_all;
+
+		bio_for_each_segment_all(bvec, bio, i, iter_all) {
 			struct page *page = bvec->bv_page;
 
 			if (dio->op == REQ_OP_READ && !PageCompound(page) &&
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 5331a15a61f1..24a8e34882e9 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -420,8 +420,9 @@ static void _clear_bio(struct bio *bio)
 {
 	struct bio_vec *bv;
 	unsigned i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bv, bio, i) {
+	bio_for_each_segment_all(bv, bio, i, iter_all) {
 		unsigned this_count = bv->bv_len;
 
 		if (likely(PAGE_SIZE == this_count))
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 199590f36203..e83bab54b03e 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -468,11 +468,12 @@ static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
 	/* loop on all devices all pages */
 	for (d = 0; d < ios->numdevs; d++) {
 		struct bio *bio = ios->per_dev[d].bio;
+		struct bvec_iter_all iter_all;
 
 		if (!bio)
 			continue;
 
-		bio_for_each_segment_all(bv, bio, i) {
+		bio_for_each_segment_all(bv, bio, i, iter_all) {
 			struct page *page = bv->bv_page;
 
 			SetPageUptodate(page);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 2aa62d58d8dd..cff4c4aa7a9c 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -63,8 +63,9 @@ static void ext4_finish_bio(struct bio *bio)
 {
 	int i;
 	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		struct page *page = bvec->bv_page;
 #ifdef CONFIG_EXT4_FS_ENCRYPTION
 		struct page *data_page = NULL;
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index 6aa282ee455a..e53639784892 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -72,6 +72,7 @@ static void mpage_end_io(struct bio *bio)
 {
 	struct bio_vec *bv;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	if (ext4_bio_encrypted(bio)) {
 		if (bio->bi_status) {
@@ -81,7 +82,7 @@ static void mpage_end_io(struct bio *bio)
 			return;
 		}
 	}
-	bio_for_each_segment_all(bv, bio, i) {
+	bio_for_each_segment_all(bv, bio, i, iter_all) {
 		struct page *page = bv->bv_page;
 
 		if (!bio->bi_status) {
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index f91d8630c9a2..da060b77f64d 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -87,8 +87,9 @@ static void __read_end_io(struct bio *bio)
 	struct page *page;
 	struct bio_vec *bv;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bv, bio, i) {
+	bio_for_each_segment_all(bv, bio, i, iter_all) {
 		page = bv->bv_page;
 
 		/* PG_error was set if any post_read step failed */
@@ -164,13 +165,14 @@ static void f2fs_write_end_io(struct bio *bio)
 	struct f2fs_sb_info *sbi = bio->bi_private;
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	if (time_to_inject(sbi, FAULT_WRITE_IO)) {
 		f2fs_show_injection_info(FAULT_WRITE_IO);
 		bio->bi_status = BLK_STS_IOERR;
 	}
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		struct page *page = bvec->bv_page;
 		enum count_type type = WB_DATA_TYPE(page);
 
@@ -347,6 +349,7 @@ static bool __has_merged_page(struct f2fs_bio_info *io, struct inode *inode,
 	struct bio_vec *bvec;
 	struct page *target;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	if (!io->bio)
 		return false;
@@ -354,7 +357,7 @@ static bool __has_merged_page(struct f2fs_bio_info *io, struct inode *inode,
 	if (!inode && !page && !ino)
 		return true;
 
-	bio_for_each_segment_all(bvec, io->bio, i) {
+	bio_for_each_segment_all(bvec, io->bio, i, iter_all) {
 
 		if (bvec->bv_page->mapping)
 			target = bvec->bv_page;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 94dcab655bc0..15deefeaafd0 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -170,7 +170,8 @@ u64 gfs2_log_bmap(struct gfs2_sbd *sdp)
  * that is pinned in the pagecache.
  */
 
-static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec,
+static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp,
+				  struct bio_vec *bvec,
 				  blk_status_t error)
 {
 	struct buffer_head *bh, *next;
@@ -208,6 +209,7 @@ static void gfs2_end_log_write(struct bio *bio)
 	struct bio_vec *bvec;
 	struct page *page;
 	int i;
+	struct bvec_iter_all iter_all;
 
 	if (bio->bi_status) {
 		fs_err(sdp, "Error %d writing to journal, jid=%u\n",
@@ -215,7 +217,7 @@ static void gfs2_end_log_write(struct bio *bio)
 		wake_up(&sdp->sd_logd_waitq);
 	}
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		page = bvec->bv_page;
 		if (page_has_buffers(page))
 			gfs2_end_log_write_bh(sdp, bvec, bio->bi_status);
@@ -388,8 +390,9 @@ static void gfs2_end_log_read(struct bio *bio)
 	struct page *page;
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		page = bvec->bv_page;
 		if (bio->bi_status) {
 			int err = blk_status_to_errno(bio->bi_status);
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index be9c0bf697fe..3201342404a7 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -190,8 +190,9 @@ static void gfs2_meta_read_endio(struct bio *bio)
 {
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all(bvec, bio, i, iter_all) {
 		struct page *page = bvec->bv_page;
 		struct buffer_head *bh = page_buffers(page);
 		unsigned int len = bvec->bv_len;
diff --git a/fs/iomap.c b/fs/iomap.c
index a3088fae567b..af736acd9006 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -267,8 +267,9 @@ iomap_read_end_io(struct bio *bio)
 	int error = blk_status_to_errno(bio->bi_status);
 	struct bio_vec *bvec;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i)
+	bio_for_each_segment_all(bvec, bio, i, iter_all)
 		iomap_read_page_end_io(bvec, error);
 	bio_put(bio);
 }
@@ -1559,8 +1560,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	} else {
 		struct bio_vec *bvec;
 		int i;
+		struct bvec_iter_all iter_all;
 
-		bio_for_each_segment_all(bvec, bio, i)
+		bio_for_each_segment_all(bvec, bio, i, iter_all)
 			put_page(bvec->bv_page);
 		bio_put(bio);
 	}
diff --git a/fs/mpage.c b/fs/mpage.c
index c820dc9bebab..3f19da75178b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -48,8 +48,9 @@ static void mpage_end_io(struct bio *bio)
 {
 	struct bio_vec *bv;
 	int i;
+	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bv, bio, i) {
+	bio_for_each_segment_all(bv, bio, i, iter_all) {
 		struct page *page = bv->bv_page;
 		page_endio(page, bio_op(bio),
 			   blk_status_to_errno(bio->bi_status));
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..1f1829e506e8 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -62,7 +62,7 @@ xfs_find_daxdev_for_inode(
 static void
 xfs_finish_page_writeback(
 	struct inode		*inode,
-	struct bio_vec		*bvec,
+	struct bio_vec	*bvec,
 	int			error)
 {
 	struct iomap_page	*iop = to_iomap_page(bvec->bv_page);
@@ -98,6 +98,7 @@ xfs_destroy_ioend(
 	for (bio = &ioend->io_inline_bio; bio; bio = next) {
 		struct bio_vec	*bvec;
 		int		i;
+		struct bvec_iter_all iter_all;
 
 		/*
 		 * For the last bio, bi_private points to the ioend, so we
@@ -109,7 +110,7 @@ xfs_destroy_ioend(
 			next = bio->bi_private;
 
 		/* walk each page on bio, ending page IO on them */
-		bio_for_each_segment_all(bvec, bio, i)
+		bio_for_each_segment_all(bvec, bio, i, iter_all)
 			xfs_finish_page_writeback(inode, bvec, error);
 		bio_put(bio);
 	}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7ef8a7505c0a..089370eb84d9 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -128,12 +128,19 @@ static inline bool bio_full(struct bio *bio)
 	return bio->bi_vcnt >= bio->bi_max_vecs;
 }
 
+#define mp_bvec_for_each_segment(bv, bvl, i, iter_all)			\
+	for (bv = bvec_init_iter_all(&iter_all);			\
+		(iter_all.done < (bvl)->bv_len) &&			\
+		(mp_bvec_next_segment((bvl), &iter_all), 1);		\
+		iter_all.done += bv->bv_len, i += 1)
+
 /*
  * drivers should _never_ use the all version - the bio may have been split
  * before it got to the driver and the driver won't own all of it
  */
-#define bio_for_each_segment_all(bvl, bio, i)				\
-	for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
+#define bio_for_each_segment_all(bvl, bio, i, iter_all)		\
+	for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; iter_all.idx++)	\
+		mp_bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), i, iter_all)
 
 static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
 				    unsigned bytes)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 21f76bad7be2..30a57b68d017 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -45,6 +45,12 @@ struct bvec_iter {
 						   current bvec */
 };
 
+struct bvec_iter_all {
+	struct bio_vec	bv;
+	int		idx;
+	unsigned	done;
+};
+
 /*
  * various member access, note that bio_data should of course not be used
  * on highmem page vectors
@@ -131,6 +137,30 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 	.bi_bvec_done	= 0,						\
 }
 
+static inline struct bio_vec *bvec_init_iter_all(struct bvec_iter_all *iter_all)
+{
+	iter_all->bv.bv_page = NULL;
+	iter_all->done = 0;
+
+	return &iter_all->bv;
+}
+
+static inline void mp_bvec_next_segment(const struct bio_vec *bvec,
+					struct bvec_iter_all *iter_all)
+{
+	struct bio_vec *bv = &iter_all->bv;
+
+	if (bv->bv_page) {
+		bv->bv_page = nth_page(bv->bv_page, 1);
+		bv->bv_offset = 0;
+	} else {
+		bv->bv_page = bvec->bv_page;
+		bv->bv_offset = bvec->bv_offset;
+	}
+	bv->bv_len = min_t(unsigned int, PAGE_SIZE - bv->bv_offset,
+			   bvec->bv_len - iter_all->done);
+}
+
 /*
  * Get the last single-page segment from the multi-page bvec and store it
  * in @seg
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (12 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 13/18] block: allow bio_for_each_segment_all() to iterate over multi-page bvec Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
       [not found]   ` <CGME20190221084301eucas1p11e8841a62b4b1da3cccca661b6f4c29d@eucas1p1.samsung.com>
  2019-02-15 11:13 ` [PATCH V15 15/18] block: always define BIO_MAX_PAGES as 256 Ming Lei
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

This patch pulls the trigger for multi-page bvecs.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio.c         | 22 +++++++++++++++-------
 fs/iomap.c          |  4 ++--
 fs/xfs/xfs_aops.c   |  4 ++--
 include/linux/bio.h |  2 +-
 4 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 968b12fea564..83a2dfa417ca 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -753,6 +753,8 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * @page: page to add
  * @len: length of the data to add
  * @off: offset of the data in @page
+ * @same_page: if %true only merge if the new data is in the same physical
+ *		page as the last segment of the bio.
  *
  * Try to add the data at @page + @off to the last bvec of @bio.  This is a
  * a useful optimisation for file systems with a block size smaller than the
@@ -761,19 +763,25 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * Return %true on success or %false on failure.
  */
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-		unsigned int len, unsigned int off)
+		unsigned int len, unsigned int off, bool same_page)
 {
 	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
 		return false;
 
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+		phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) +
+			bv->bv_offset + bv->bv_len - 1;
+		phys_addr_t page_addr = page_to_phys(page);
 
-		if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
-			bv->bv_len += len;
-			bio->bi_iter.bi_size += len;
-			return true;
-		}
+		if (vec_end_addr + 1 != page_addr + off)
+			return false;
+		if (same_page && (vec_end_addr & PAGE_MASK) != page_addr)
+			return false;
+
+		bv->bv_len += len;
+		bio->bi_iter.bi_size += len;
+		return true;
 	}
 	return false;
 }
@@ -819,7 +827,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
 int bio_add_page(struct bio *bio, struct page *page,
 		 unsigned int len, unsigned int offset)
 {
-	if (!__bio_try_merge_page(bio, page, len, offset)) {
+	if (!__bio_try_merge_page(bio, page, len, offset, false)) {
 		if (bio_full(bio))
 			return 0;
 		__bio_add_page(bio, page, len, offset);
diff --git a/fs/iomap.c b/fs/iomap.c
index af736acd9006..0c350e658b7f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -318,7 +318,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	 */
 	sector = iomap_sector(iomap, pos);
 	if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
-		if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+		if (__bio_try_merge_page(ctx->bio, page, plen, poff, true))
 			goto done;
 		is_contig = true;
 	}
@@ -349,7 +349,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		ctx->bio->bi_end_io = iomap_read_end_io;
 	}
 
-	__bio_add_page(ctx->bio, page, plen, poff);
+	bio_add_page(ctx->bio, page, plen, poff);
 done:
 	/*
 	 * Move the caller beyond our range so that it keeps making progress.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1f1829e506e8..b9fd44168f61 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -616,12 +616,12 @@ xfs_add_to_ioend(
 				bdev, sector);
 	}
 
-	if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
+	if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, true)) {
 		if (iop)
 			atomic_inc(&iop->write_count);
 		if (bio_full(wpc->ioend->io_bio))
 			xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
-		__bio_add_page(wpc->ioend->io_bio, page, len, poff);
+		bio_add_page(wpc->ioend->io_bio, page, len, poff);
 	}
 
 	wpc->ioend->io_size += len;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 089370eb84d9..9f77adcfde82 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -441,7 +441,7 @@ extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
 			   unsigned int, unsigned int);
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-		unsigned int len, unsigned int off);
+		unsigned int len, unsigned int off, bool same_page);
 void __bio_add_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 15/18] block: always define BIO_MAX_PAGES as 256
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (13 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 14/18] block: enable multipage bvecs Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 16/18] block: document usage of bio iterator helpers Ming Lei
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.

CONFIG_THP_SWAP needs to split one THP into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bio.h | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 9f77adcfde82..bdd11d4c2f05 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -34,15 +34,7 @@
 #define BIO_BUG_ON
 #endif
 
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES		HPAGE_PMD_NR
-#else
 #define BIO_MAX_PAGES		256
-#endif
-#else
-#define BIO_MAX_PAGES		256
-#endif
 
 #define bio_prio(bio)			(bio)->bi_ioprio
 #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 16/18] block: document usage of bio iterator helpers
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (14 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 15/18] block: always define BIO_MAX_PAGES as 256 Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 17/18] block: kill QUEUE_FLAG_NO_SG_MERGE Ming Lei
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

Now multi-page bvec is supported, some helpers may return page by
page, meantime some may return segment by segment, this patch
documents the usage.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 Documentation/block/biovecs.txt | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..ce6eccaf5df7 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,28 @@ Other implications:
    size limitations and the limitations of the underlying devices. Thus
    there's no need to define ->merge_bvec_fn() callbacks for individual block
    drivers.
+
+Usage of helpers:
+=================
+
+* The following helpers whose names have the suffix of "_all" can only be used
+on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
+shouldn't use them because the bio may have been split before it reached the
+driver.
+
+	bio_for_each_segment_all()
+	bio_first_bvec_all()
+	bio_first_page_all()
+	bio_last_bvec_all()
+
+* The following helpers iterate over single-page segment. The passed 'struct
+bio_vec' will contain a single-page IO vector during the iteration
+
+	bio_for_each_segment()
+	bio_for_each_segment_all()
+
+* The following helpers iterate over multi-page bvec. The passed 'struct
+bio_vec' will contain a multi-page IO vector during the iteration
+
+	bio_for_each_bvec()
+	rq_for_each_bvec()
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 17/18] block: kill QUEUE_FLAG_NO_SG_MERGE
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (15 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 16/18] block: document usage of bio iterator helpers Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 11:13 ` [PATCH V15 18/18] block: kill BLK_MQ_F_SG_MERGE Ming Lei
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-merge.c      | 31 ++++++-------------------------
 block/blk-mq-debugfs.c |  1 -
 block/blk-mq.c         |  3 ---
 drivers/md/dm-table.c  | 13 -------------
 include/linux/blkdev.h |  1 -
 5 files changed, 6 insertions(+), 43 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1912499b08b7..bed065904677 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -358,8 +358,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
 EXPORT_SYMBOL(blk_queue_split);
 
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
-					     struct bio *bio,
-					     bool no_sg_merge)
+					     struct bio *bio)
 {
 	struct bio_vec bv, bvprv = { NULL };
 	int prev = 0;
@@ -385,13 +384,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	nr_phys_segs = 0;
 	for_each_bio(bio) {
 		bio_for_each_bvec(bv, bio, iter) {
-			/*
-			 * If SG merging is disabled, each bio vector is
-			 * a segment
-			 */
-			if (no_sg_merge)
-				goto new_segment;
-
 			if (prev) {
 				if (seg_size + bv.bv_len
 				    > queue_max_segment_size(q))
@@ -421,27 +413,16 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 
 void blk_recalc_rq_segments(struct request *rq)
 {
-	bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
-			&rq->q->queue_flags);
-
-	rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
-			no_sg_merge);
+	rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
 }
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-	unsigned short seg_cnt = bio_segments(bio);
-
-	if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags) &&
-			(seg_cnt < queue_max_segments(q)))
-		bio->bi_phys_segments = seg_cnt;
-	else {
-		struct bio *nxt = bio->bi_next;
+	struct bio *nxt = bio->bi_next;
 
-		bio->bi_next = NULL;
-		bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio, false);
-		bio->bi_next = nxt;
-	}
+	bio->bi_next = NULL;
+	bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
+	bio->bi_next = nxt;
 
 	bio_set_flag(bio, BIO_SEG_VALID);
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index c782e81db627..697d6213c82b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -128,7 +128,6 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(SAME_FORCE),
 	QUEUE_FLAG_NAME(DEAD),
 	QUEUE_FLAG_NAME(INIT_DONE),
-	QUEUE_FLAG_NAME(NO_SG_MERGE),
 	QUEUE_FLAG_NAME(POLL),
 	QUEUE_FLAG_NAME(WC),
 	QUEUE_FLAG_NAME(FUA),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 44d471ff8754..fa508ee31742 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2837,9 +2837,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	    set->map[HCTX_TYPE_POLL].nr_queues)
 		blk_queue_flag_set(QUEUE_FLAG_POLL, q);
 
-	if (!(set->flags & BLK_MQ_F_SG_MERGE))
-		blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-
 	q->sg_reserved_size = INT_MAX;
 
 	INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4b1be754cc41..ba9481f1bf3c 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1698,14 +1698,6 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
 	return q && !blk_queue_add_random(q);
 }
 
-static int queue_supports_sg_merge(struct dm_target *ti, struct dm_dev *dev,
-				   sector_t start, sector_t len, void *data)
-{
-	struct request_queue *q = bdev_get_queue(dev->bdev);
-
-	return q && !test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags);
-}
-
 static bool dm_table_all_devices_attribute(struct dm_table *t,
 					   iterate_devices_callout_fn func)
 {
@@ -1902,11 +1894,6 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (!dm_table_supports_write_zeroes(t))
 		q->limits.max_write_zeroes_sectors = 0;
 
-	if (dm_table_all_devices_attribute(t, queue_supports_sg_merge))
-		blk_queue_flag_clear(QUEUE_FLAG_NO_SG_MERGE, q);
-	else
-		blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-
 	dm_table_verify_integrity(t);
 
 	/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b6292d469ea4..faed9d9eb84c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -588,7 +588,6 @@ struct request_queue {
 #define QUEUE_FLAG_SAME_FORCE	12	/* force complete on same CPU */
 #define QUEUE_FLAG_DEAD		13	/* queue tear-down finished */
 #define QUEUE_FLAG_INIT_DONE	14	/* queue is initialized */
-#define QUEUE_FLAG_NO_SG_MERGE	15	/* don't attempt to merge SG segments*/
 #define QUEUE_FLAG_POLL		16	/* IO polling enabled if set */
 #define QUEUE_FLAG_WC		17	/* Write back caching */
 #define QUEUE_FLAG_FUA		18	/* device supports FUA writes */
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V15 18/18] block: kill BLK_MQ_F_SG_MERGE
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (16 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 17/18] block: kill QUEUE_FLAG_NO_SG_MERGE Ming Lei
@ 2019-02-15 11:13 ` Ming Lei
  2019-02-15 14:51 ` [PATCH V15 00/18] block: support multi-page bvec Christoph Hellwig
  2019-02-15 15:49 ` Jens Axboe
  19 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-15 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ming Lei

QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c       | 1 -
 drivers/block/loop.c         | 2 +-
 drivers/block/nbd.c          | 2 +-
 drivers/block/rbd.c          | 2 +-
 drivers/block/skd_main.c     | 1 -
 drivers/block/xen-blkfront.c | 2 +-
 drivers/md/dm-rq.c           | 2 +-
 drivers/mmc/core/queue.c     | 3 +--
 drivers/scsi/scsi_lib.c      | 2 +-
 include/linux/blk-mq.h       | 1 -
 10 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 697d6213c82b..c39247c5ddb6 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -249,7 +249,6 @@ static const char *const alloc_policy_name[] = {
 static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(SHOULD_MERGE),
 	HCTX_FLAG_NAME(TAG_SHARED),
-	HCTX_FLAG_NAME(SG_MERGE),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8ef583197414..3d63ad036398 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1937,7 +1937,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 7c9a949e876b..32a7ba1674b7 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1571,7 +1571,7 @@ static int nbd_dev_add(int index)
 	nbd->tag_set.numa_node = NUMA_NO_NODE;
 	nbd->tag_set.cmd_size = sizeof(struct nbd_cmd);
 	nbd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-		BLK_MQ_F_SG_MERGE | BLK_MQ_F_BLOCKING;
+		BLK_MQ_F_BLOCKING;
 	nbd->tag_set.driver_data = nbd;
 
 	err = blk_mq_alloc_tag_set(&nbd->tag_set);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 1e92b61d0bd5..abe9e1c89227 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3988,7 +3988,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	rbd_dev->tag_set.ops = &rbd_mq_ops;
 	rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth;
 	rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
-	rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	rbd_dev->tag_set.nr_hw_queues = 1;
 	rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
 
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index ab893a7571a2..7d3ad6c22ee5 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2843,7 +2843,6 @@ static int skd_cons_disk(struct skd_device *skdev)
 		skdev->sgs_per_request * sizeof(struct scatterlist);
 	skdev->tag_set.numa_node = NUMA_NO_NODE;
 	skdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-		BLK_MQ_F_SG_MERGE |
 		BLK_ALLOC_POLICY_TO_MQ_FLAG(BLK_TAG_ALLOC_FIFO);
 	skdev->tag_set.driver_data = skdev;
 	rc = blk_mq_alloc_tag_set(&skdev->tag_set);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0ed4b200fa58..d43a5677ccbc 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -977,7 +977,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
 	} else
 		info->tag_set.queue_depth = BLK_RING_SIZE(info);
 	info->tag_set.numa_node = NUMA_NO_NODE;
-	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	info->tag_set.cmd_size = sizeof(struct blkif_req);
 	info->tag_set.driver_data = info;
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 4eb5f8c56535..b2f8eb2365ee 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -527,7 +527,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 35cc138b096d..cc19e71c71d4 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -410,8 +410,7 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card)
 	else
 		mq->tag_set.queue_depth = MMC_QUEUE_DEPTH;
 	mq->tag_set.numa_node = NUMA_NO_NODE;
-	mq->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE |
-			    BLK_MQ_F_BLOCKING;
+	mq->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING;
 	mq->tag_set.nr_hw_queues = 1;
 	mq->tag_set.cmd_size = sizeof(struct mmc_queue_req);
 	mq->tag_set.driver_data = mq;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6d65ac584eba..6cadbe945bdb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1899,7 +1899,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 	shost->tag_set.queue_depth = shost->can_queue;
 	shost->tag_set.cmd_size = cmd_size;
 	shost->tag_set.numa_node = NUMA_NO_NODE;
-	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	shost->tag_set.flags |=
 		BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
 	shost->tag_set.driver_data = shost;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0e030f5f76b6..b0c814bcc7e3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -218,7 +218,6 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
-	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 00/18] block: support multi-page bvec
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (17 preceding siblings ...)
  2019-02-15 11:13 ` [PATCH V15 18/18] block: kill BLK_MQ_F_SG_MERGE Ming Lei
@ 2019-02-15 14:51 ` Christoph Hellwig
  2019-02-17 13:10   ` Ming Lei
  2019-02-15 15:49 ` Jens Axboe
  19 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2019-02-15 14:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, Christoph Hellwig,
	linux-ext4, Coly Li, linux-bcache, Boaz Harrosh, Bob Peterson,
	cluster-devel

I still don't understand why mp_bvec_last_segment isn't simply
called bvec_last_segment as there is no conflict.  But I don't
want to hold this series up on that as there only are two users
left and we can always just fix it up later.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 00/18] block: support multi-page bvec
  2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
                   ` (18 preceding siblings ...)
  2019-02-15 14:51 ` [PATCH V15 00/18] block: support multi-page bvec Christoph Hellwig
@ 2019-02-15 15:49 ` Jens Axboe
  2019-02-15 17:14   ` [dm-devel] " Bart Van Assche
  19 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2019-02-15 15:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel

On 2/15/19 4:13 AM, Ming Lei wrote:
> Hi,
> 
> This patchset brings multi-page bvec into block layer:
> 
> 1) what is multi-page bvec?
> 
> Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages
> which are physically contiguous instead of one single page used in linux
> kernel for long time.
> 
> 2) why is multi-page bvec introduced?
> 
> Kent proposed the idea[1] first. 
> 
> As system's RAM becomes much bigger than before, and huge page, transparent
> huge page and memory compaction are widely used, it is a bit easy now
> to see physically contiguous pages from fs in I/O. On the other hand, from
> block layer's view, it isn't necessary to store intermediate pages into bvec,
> and it is enough to just store the physicallly contiguous 'segment' in each
> io vector.
> 
> Also huge pages are being brought to filesystem and swap [2][6], we can
> do IO on a hugepage each time[3], which requires that one bio can transfer
> at least one huge page one time. Turns out it isn't flexiable to change
> BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well.
> As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured
> as much bigger, such as 512, which requires at least two 4K pages for holding
> the bvec table.
> 
> With multi-page bvec:
> 
> - Inside block layer, both bio splitting and sg map can become more
> efficient than before by just traversing the physically contiguous
> 'segment' instead of each page.
> 
> - segment handling in block layer can be improved much in future since it
> should be quite easy to convert multipage bvec into segment easily. For
> example, we might just store segment in each bvec directly in future.
> 
> - bio size can be increased and it should improve some high-bandwidth IO
> case in theory[4].
> 
> - there is opportunity in future to improve memory footprint of bvecs. 
> 
> 3) how is multi-page bvec implemented in this patchset?
> 
> Patch 1 ~ 3 parpares for supporting multi-page bvec. 
> 
> Patches 4 ~ 14 implement multipage bvec in block layer:
> 
> 	- put all tricks into bvec/bio/rq iterators, and as far as
> 	drivers and fs use these standard iterators, they are happy
> 	with multipage bvec
> 
> 	- introduce bio_for_each_bvec() to iterate over multipage bvec for splitting
> 	bio and mapping sg
> 
> 	- keep current bio_for_each_segment*() to itereate over singlepage bvec and
> 	make sure current users won't be broken; especailly, convert to this
> 	new helper prototype in single patch 21 given it is bascially a mechanism
> 	conversion
> 
> 	- deal with iomap & xfs's sub-pagesize io vec in patch 13
> 
> 	- enalbe multipage bvec in patch 14 
> 
> Patch 15 redefines BIO_MAX_PAGES as 256.
> 
> Patch 16 documents usages of bio iterator helpers.
> 
> Patch 17~18 kills NO_SG_MERGE.
> 
> These patches can be found in the following git tree:
> 
> 	git:  https://github.com/ming1/linux.git  v5.0-blk_mp_bvec_v14
                                                                   ^^^

v15?

> Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset,
> and not see regression.
> 
> Thanks Christoph for reviewing the early version and providing very good
> suggestions, such as: introduce bio_init_with_vec_table(), remove another
> unnecessary helpers for cleanup and so on.
> 
> Thanks Chritoph and Omar for reviewing V10/V11/V12, and provides lots of
> helpful comments.

Applied, thanks Ming. Let's hope it sticks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dm-devel] [PATCH V15 00/18] block: support multi-page bvec
  2019-02-15 15:49 ` Jens Axboe
@ 2019-02-15 17:14   ` Bart Van Assche
  2019-02-15 17:59     ` Jens Axboe
  2019-02-17 13:11     ` Ming Lei
  0 siblings, 2 replies; 41+ messages in thread
From: Bart Van Assche @ 2019-02-15 17:14 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: Mike Snitzer, linux-mm, dm-devel, Christoph Hellwig,
	Sagi Grimberg, Darrick J . Wong, Omar Sandoval, cluster-devel,
	linux-ext4, Kent Overstreet, Boaz Harrosh, Gao Xiang, Coly Li,
	linux-raid, Bob Peterson, linux-bcache, Alexander Viro,
	Dave Chinner, David Sterba, linux-block, Theodore Ts'o,
	linux-kernel, linux-xfs, linux-fsdevel, linux-btrfs

On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> On 2/15/19 4:13 AM, Ming Lei wrote:
> > This patchset brings multi-page bvec into block layer:
> 
> Applied, thanks Ming. Let's hope it sticks!

Hi Jens and Ming,

Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
I have not yet tried to figure out which patch introduced the failure.
Anyway, this is what I see in the kernel log for test nvmeof-mp/002:

[  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[  475.621188] #PF error: [normal kernel read fault]
[  475.623148] PGD 0 P4D 0  
[  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
[  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
[  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  475.633855] Workqueue: kblockd blk_mq_requeue_work
[  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
[  475.670948] Call Trace:
[  475.693515]  blk_recalc_rq_segments+0x2f/0x50
[  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
[  475.701142]  dm_mq_queue_rq+0x3d1/0x770
[  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
[  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
[  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
[  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
[  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
[  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
[  475.733468]  blk_mq_requeue_work+0x2cb/0x300
[  475.736473]  process_one_work+0x4f1/0xa40
[  475.739424]  worker_thread+0x67/0x5b0
[  475.741751]  kthread+0x1cf/0x1f0
[  475.746034]  ret_from_fork+0x24/0x30

(gdb) list *(__blk_recalc_rq_segments+0xbe)
0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
361                                                  struct bio *bio)
362     {
363             struct bio_vec bv, bvprv = { NULL };
364             int prev = 0;
365             unsigned int seg_size, nr_phys_segs;
366             unsigned front_seg_size = bio->bi_seg_front_size;
367             struct bio *fbio, *bbio;
368             struct bvec_iter iter;
369
370             if (!bio)

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dm-devel] [PATCH V15 00/18] block: support multi-page bvec
  2019-02-15 17:14   ` [dm-devel] " Bart Van Assche
@ 2019-02-15 17:59     ` Jens Axboe
  2019-02-17 13:13       ` Ming Lei
  2019-02-17 13:11     ` Ming Lei
  1 sibling, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2019-02-15 17:59 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei
  Cc: Mike Snitzer, linux-mm, dm-devel, Christoph Hellwig,
	Sagi Grimberg, Darrick J . Wong, Omar Sandoval, cluster-devel,
	linux-ext4, Kent Overstreet, Boaz Harrosh, Gao Xiang, Coly Li,
	linux-raid, Bob Peterson, linux-bcache, Alexander Viro,
	Dave Chinner, David Sterba, linux-block, Theodore Ts'o,
	linux-kernel, linux-xfs, linux-fsdevel, linux-btrfs

On 2/15/19 10:14 AM, Bart Van Assche wrote:
> On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
>> On 2/15/19 4:13 AM, Ming Lei wrote:
>>> This patchset brings multi-page bvec into block layer:
>>
>> Applied, thanks Ming. Let's hope it sticks!
> 
> Hi Jens and Ming,
> 
> Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> I have not yet tried to figure out which patch introduced the failure.
> Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> 
> [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> [  475.621188] #PF error: [normal kernel read fault]
> [  475.623148] PGD 0 P4D 0  
> [  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
> [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
> [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> [  475.670948] Call Trace:
> [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> [  475.736473]  process_one_work+0x4f1/0xa40
> [  475.739424]  worker_thread+0x67/0x5b0
> [  475.741751]  kthread+0x1cf/0x1f0
> [  475.746034]  ret_from_fork+0x24/0x30
> 
> (gdb) list *(__blk_recalc_rq_segments+0xbe)
> 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> 361                                                  struct bio *bio)
> 362     {
> 363             struct bio_vec bv, bvprv = { NULL };
> 364             int prev = 0;
> 365             unsigned int seg_size, nr_phys_segs;
> 366             unsigned front_seg_size = bio->bi_seg_front_size;
> 367             struct bio *fbio, *bbio;
> 368             struct bvec_iter iter;
> 369
> 370             if (!bio)

Just ran a few tests, and it also seems to cause about a 5% regression
in per-core IOPS throughput. Prior to this work, I could get 1620K 4k
rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems
to be blk_queue_split() and blk_rq_map_sg().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 00/18] block: support multi-page bvec
  2019-02-15 14:51 ` [PATCH V15 00/18] block: support multi-page bvec Christoph Hellwig
@ 2019-02-17 13:10   ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-17 13:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel

On Fri, Feb 15, 2019 at 03:51:26PM +0100, Christoph Hellwig wrote:
> I still don't understand why mp_bvec_last_segment isn't simply
> called bvec_last_segment as there is no conflict.  But I don't
> want to hold this series up on that as there only are two users
> left and we can always just fix it up later.

mp_bvec_last_segment() is one bvec helper, so better to keep its
name consistent with other bvec helpers.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dm-devel] [PATCH V15 00/18] block: support multi-page bvec
  2019-02-15 17:14   ` [dm-devel] " Bart Van Assche
  2019-02-15 17:59     ` Jens Axboe
@ 2019-02-17 13:11     ` Ming Lei
  2019-02-19 16:28       ` Bart Van Assche
  1 sibling, 1 reply; 41+ messages in thread
From: Ming Lei @ 2019-02-17 13:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Mike Snitzer, linux-mm, dm-devel, Christoph Hellwig,
	Sagi Grimberg, Darrick J . Wong, Omar Sandoval, cluster-devel,
	linux-ext4, Kent Overstreet, Boaz Harrosh, Gao Xiang, Coly Li,
	linux-raid, Bob Peterson, linux-bcache, Alexander Viro,
	Dave Chinner, David Sterba, linux-block, Theodore Ts'o,
	linux-kernel, linux-xfs, linux-fsdevel, linux-btrfs

On Fri, Feb 15, 2019 at 09:14:15AM -0800, Bart Van Assche wrote:
> On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> > On 2/15/19 4:13 AM, Ming Lei wrote:
> > > This patchset brings multi-page bvec into block layer:
> > 
> > Applied, thanks Ming. Let's hope it sticks!
> 
> Hi Jens and Ming,
> 
> Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> I have not yet tried to figure out which patch introduced the failure.
> Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> 
> [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> [  475.621188] #PF error: [normal kernel read fault]
> [  475.623148] PGD 0 P4D 0  
> [  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
> [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
> [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> [  475.670948] Call Trace:
> [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> [  475.736473]  process_one_work+0x4f1/0xa40
> [  475.739424]  worker_thread+0x67/0x5b0
> [  475.741751]  kthread+0x1cf/0x1f0
> [  475.746034]  ret_from_fork+0x24/0x30
> 
> (gdb) list *(__blk_recalc_rq_segments+0xbe)
> 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> 361                                                  struct bio *bio)
> 362     {
> 363             struct bio_vec bv, bvprv = { NULL };
> 364             int prev = 0;
> 365             unsigned int seg_size, nr_phys_segs;
> 366             unsigned front_seg_size = bio->bi_seg_front_size;
> 367             struct bio *fbio, *bbio;
> 368             struct bvec_iter iter;
> 369
> 370             if (!bio)
> 
> Bart.

Thanks for your test!

The following patch should fix this issue:


diff --git a/block/blk-merge.c b/block/blk-merge.c
index bed065904677..066b66430523 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	struct bio_vec bv, bvprv = { NULL };
 	int prev = 0;
 	unsigned int seg_size, nr_phys_segs;
-	unsigned front_seg_size = bio->bi_seg_front_size;
+	unsigned front_seg_size;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
 
 	if (!bio)
 		return 0;
 
+	front_seg_size = bio->bi_seg_front_size;
+
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [dm-devel] [PATCH V15 00/18] block: support multi-page bvec
  2019-02-15 17:59     ` Jens Axboe
@ 2019-02-17 13:13       ` Ming Lei
  2019-02-18  7:49         ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2019-02-17 13:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bart Van Assche, Mike Snitzer, linux-mm, dm-devel,
	Christoph Hellwig, Sagi Grimberg, Darrick J . Wong,
	Omar Sandoval, cluster-devel, linux-ext4, Kent Overstreet,
	Boaz Harrosh, Gao Xiang, Coly Li, linux-raid, Bob Peterson,
	linux-bcache, Alexander Viro, Dave Chinner, David Sterba,
	linux-block, Theodore Ts'o, linux-kernel, linux-xfs,
	linux-fsdevel, linux-btrfs

On Fri, Feb 15, 2019 at 10:59:47AM -0700, Jens Axboe wrote:
> On 2/15/19 10:14 AM, Bart Van Assche wrote:
> > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> >> On 2/15/19 4:13 AM, Ming Lei wrote:
> >>> This patchset brings multi-page bvec into block layer:
> >>
> >> Applied, thanks Ming. Let's hope it sticks!
> > 
> > Hi Jens and Ming,
> > 
> > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> > I have not yet tried to figure out which patch introduced the failure.
> > Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> > 
> > [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > [  475.621188] #PF error: [normal kernel read fault]
> > [  475.623148] PGD 0 P4D 0  
> > [  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
> > [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
> > [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> > [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> > [  475.670948] Call Trace:
> > [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> > [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> > [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> > [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> > [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> > [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> > [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> > [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> > [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> > [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> > [  475.736473]  process_one_work+0x4f1/0xa40
> > [  475.739424]  worker_thread+0x67/0x5b0
> > [  475.741751]  kthread+0x1cf/0x1f0
> > [  475.746034]  ret_from_fork+0x24/0x30
> > 
> > (gdb) list *(__blk_recalc_rq_segments+0xbe)
> > 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> > 361                                                  struct bio *bio)
> > 362     {
> > 363             struct bio_vec bv, bvprv = { NULL };
> > 364             int prev = 0;
> > 365             unsigned int seg_size, nr_phys_segs;
> > 366             unsigned front_seg_size = bio->bi_seg_front_size;
> > 367             struct bio *fbio, *bbio;
> > 368             struct bvec_iter iter;
> > 369
> > 370             if (!bio)
> 
> Just ran a few tests, and it also seems to cause about a 5% regression
> in per-core IOPS throughput. Prior to this work, I could get 1620K 4k
> rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems
> to be blk_queue_split() and blk_rq_map_sg().

Could you share us your test setting?

I will run null_blk first and see if it can be reproduced.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dm-devel] [PATCH V15 00/18] block: support multi-page bvec
  2019-02-17 13:13       ` Ming Lei
@ 2019-02-18  7:49         ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2019-02-18  7:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bart Van Assche, Mike Snitzer, linux-mm, dm-devel,
	Christoph Hellwig, Sagi Grimberg, Darrick J . Wong,
	Omar Sandoval, cluster-devel, linux-ext4, Kent Overstreet,
	Boaz Harrosh, Gao Xiang, Coly Li, linux-raid, Bob Peterson,
	linux-bcache, Alexander Viro, Dave Chinner, David Sterba,
	linux-block, Theodore Ts'o, linux-kernel, linux-xfs,
	linux-fsdevel, linux-btrfs

On Sun, Feb 17, 2019 at 09:13:32PM +0800, Ming Lei wrote:
> On Fri, Feb 15, 2019 at 10:59:47AM -0700, Jens Axboe wrote:
> > On 2/15/19 10:14 AM, Bart Van Assche wrote:
> > > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> > >> On 2/15/19 4:13 AM, Ming Lei wrote:
> > >>> This patchset brings multi-page bvec into block layer:
> > >>
> > >> Applied, thanks Ming. Let's hope it sticks!
> > > 
> > > Hi Jens and Ming,
> > > 
> > > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> > > I have not yet tried to figure out which patch introduced the failure.
> > > Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> > > 
> > > [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > > [  475.621188] #PF error: [normal kernel read fault]
> > > [  475.623148] PGD 0 P4D 0  
> > > [  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
> > > [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
> > > [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > > [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> > > [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> > > [  475.670948] Call Trace:
> > > [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> > > [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> > > [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> > > [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> > > [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> > > [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> > > [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> > > [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> > > [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> > > [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> > > [  475.736473]  process_one_work+0x4f1/0xa40
> > > [  475.739424]  worker_thread+0x67/0x5b0
> > > [  475.741751]  kthread+0x1cf/0x1f0
> > > [  475.746034]  ret_from_fork+0x24/0x30
> > > 
> > > (gdb) list *(__blk_recalc_rq_segments+0xbe)
> > > 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> > > 361                                                  struct bio *bio)
> > > 362     {
> > > 363             struct bio_vec bv, bvprv = { NULL };
> > > 364             int prev = 0;
> > > 365             unsigned int seg_size, nr_phys_segs;
> > > 366             unsigned front_seg_size = bio->bi_seg_front_size;
> > > 367             struct bio *fbio, *bbio;
> > > 368             struct bvec_iter iter;
> > > 369
> > > 370             if (!bio)
> > 
> > Just ran a few tests, and it also seems to cause about a 5% regression
> > in per-core IOPS throughput. Prior to this work, I could get 1620K 4k
> > rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems
> > to be blk_queue_split() and blk_rq_map_sg().
> 
> Could you share us your test setting?
> 
> I will run null_blk first and see if it can be reproduced.

Looks this performance drop isn't reproduced on null_blk with the following
setting by me:

- modprobe null_blk nr_devices=4 submit_queues=48
- test machine : dual socket, two NUMA nodes, 24cores/socket
- fio script:
fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=48 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/nullb0 --name=randread --rw=randread

result: 10.7M IOPS(base kernel), 10.6M IOPS(patched kernel)

And if 'bs' is increased to 256k, 512k, 1024k, IOPS improvement can be ~8%
with multi-page bvec patches in above test.

BTW, there isn't cost added to bio_for_each_bvec(), so blk_queue_split() and
blk_rq_map_sg() should be fine. However, bio_for_each_segment_all()
may not be quick as before.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dm-devel] [PATCH V15 00/18] block: support multi-page bvec
  2019-02-17 13:11     ` Ming Lei
@ 2019-02-19 16:28       ` Bart Van Assche
  2019-02-20  1:17         ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2019-02-19 16:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, linux-mm, dm-devel, Christoph Hellwig,
	Sagi Grimberg, Darrick J . Wong, Omar Sandoval, cluster-devel,
	linux-ext4, Kent Overstreet, Boaz Harrosh, Gao Xiang, Coly Li,
	linux-raid, Bob Peterson, linux-bcache, Alexander Viro,
	Dave Chinner, David Sterba, linux-block, Theodore Ts'o,
	linux-kernel, linux-xfs, linux-fsdevel, linux-btrfs

On Sun, 2019-02-17 at 21:11 +0800, Ming Lei wrote:
> The following patch should fix this issue:
> 
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index bed065904677..066b66430523 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
>  	struct bio_vec bv, bvprv = { NULL };
>  	int prev = 0;
>  	unsigned int seg_size, nr_phys_segs;
> -	unsigned front_seg_size = bio->bi_seg_front_size;
> +	unsigned front_seg_size;
>  	struct bio *fbio, *bbio;
>  	struct bvec_iter iter;
>  
>  	if (!bio)
>  		return 0;
>  
> +	front_seg_size = bio->bi_seg_front_size;
> +
>  	switch (bio_op(bio)) {
>  	case REQ_OP_DISCARD:
>  	case REQ_OP_SECURE_ERASE:

Hi Ming,

With this patch applied test nvmeof-mp/002 fails as follows:

[  694.700400] kernel BUG at lib/sg_pool.c:103!
[  694.705932] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[  694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: G    B             5.0.0-rc6-dbg+ #2
[  694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  694.715113] Workqueue: kblockd blk_mq_run_work_fn
[  694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0
[  694.758222] Call Trace:
[  694.759645]  nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma]
[  694.764915]  blk_mq_try_issue_directly+0x2a5/0x4b0
[  694.771779]  blk_insert_cloned_request+0x11e/0x1c0
[  694.778417]  dm_mq_queue_rq+0x3d1/0x770
[  694.793400]  blk_mq_dispatch_rq_list+0x5fc/0xb10
[  694.798386]  blk_mq_sched_dispatch_requests+0x2f7/0x300
[  694.803180]  __blk_mq_run_hw_queue+0xd6/0x180
[  694.808933]  blk_mq_run_work_fn+0x27/0x30
[  694.810315]  process_one_work+0x4f1/0xa40
[  694.813178]  worker_thread+0x67/0x5b0
[  694.814487]  kthread+0x1cf/0x1f0
[  694.819134]  ret_from_fork+0x24/0x30

The code in sg_pool.c that triggers the BUG() statement is as follows:

int sg_alloc_table_chained(struct sg_table *table, int nents,
		struct scatterlist *first_chunk)
{
	int ret;

	BUG_ON(!nents);
[ ... ]

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dm-devel] [PATCH V15 00/18] block: support multi-page bvec
  2019-02-19 16:28       ` Bart Van Assche
@ 2019-02-20  1:17         ` Ming Lei
  2019-02-20  2:37           ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2019-02-20  1:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Mike Snitzer, linux-mm, dm-devel, Christoph Hellwig,
	Sagi Grimberg, Darrick J . Wong, Omar Sandoval, cluster-devel,
	linux-ext4, Kent Overstreet, Boaz Harrosh, Gao Xiang, Coly Li,
	linux-raid, Bob Peterson, linux-bcache, Alexander Viro,
	Dave Chinner, David Sterba, linux-block, Theodore Ts'o,
	linux-kernel, linux-xfs, linux-fsdevel, linux-btrfs

On Tue, Feb 19, 2019 at 08:28:19AM -0800, Bart Van Assche wrote:
> On Sun, 2019-02-17 at 21:11 +0800, Ming Lei wrote:
> > The following patch should fix this issue:
> > 
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index bed065904677..066b66430523 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> >  	struct bio_vec bv, bvprv = { NULL };
> >  	int prev = 0;
> >  	unsigned int seg_size, nr_phys_segs;
> > -	unsigned front_seg_size = bio->bi_seg_front_size;
> > +	unsigned front_seg_size;
> >  	struct bio *fbio, *bbio;
> >  	struct bvec_iter iter;
> >  
> >  	if (!bio)
> >  		return 0;
> >  
> > +	front_seg_size = bio->bi_seg_front_size;
> > +
> >  	switch (bio_op(bio)) {
> >  	case REQ_OP_DISCARD:
> >  	case REQ_OP_SECURE_ERASE:
> 
> Hi Ming,
> 
> With this patch applied test nvmeof-mp/002 fails as follows:
> 
> [  694.700400] kernel BUG at lib/sg_pool.c:103!
> [  694.705932] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> [  694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: G    B             5.0.0-rc6-dbg+ #2
> [  694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [  694.715113] Workqueue: kblockd blk_mq_run_work_fn
> [  694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0
> [  694.758222] Call Trace:
> [  694.759645]  nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma]
> [  694.764915]  blk_mq_try_issue_directly+0x2a5/0x4b0
> [  694.771779]  blk_insert_cloned_request+0x11e/0x1c0
> [  694.778417]  dm_mq_queue_rq+0x3d1/0x770
> [  694.793400]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> [  694.798386]  blk_mq_sched_dispatch_requests+0x2f7/0x300
> [  694.803180]  __blk_mq_run_hw_queue+0xd6/0x180
> [  694.808933]  blk_mq_run_work_fn+0x27/0x30
> [  694.810315]  process_one_work+0x4f1/0xa40
> [  694.813178]  worker_thread+0x67/0x5b0
> [  694.814487]  kthread+0x1cf/0x1f0
> [  694.819134]  ret_from_fork+0x24/0x30
> 
> The code in sg_pool.c that triggers the BUG() statement is as follows:
> 
> int sg_alloc_table_chained(struct sg_table *table, int nents,
> 		struct scatterlist *first_chunk)
> {
> 	int ret;
> 
> 	BUG_ON(!nents);
> [ ... ]
> 
> Bart.

I can reproduce this issue("kernel BUG at lib/sg_pool.c:103") without mp-bvec patches,
so looks it isn't the fault of this patchset.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dm-devel] [PATCH V15 00/18] block: support multi-page bvec
  2019-02-20  1:17         ` Ming Lei
@ 2019-02-20  2:37           ` Bart Van Assche
  0 siblings, 0 replies; 41+ messages in thread
From: Bart Van Assche @ 2019-02-20  2:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, linux-mm, dm-devel, Christoph Hellwig,
	Sagi Grimberg, Darrick J . Wong, Omar Sandoval, cluster-devel,
	linux-ext4, Kent Overstreet, Boaz Harrosh, Gao Xiang, Coly Li,
	linux-raid, Bob Peterson, linux-bcache, Alexander Viro,
	Dave Chinner, David Sterba, linux-block, Theodore Ts'o,
	linux-kernel, linux-xfs, linux-fsdevel, linux-btrfs

On 2/19/19 5:17 PM, Ming Lei wrote:
> On Tue, Feb 19, 2019 at 08:28:19AM -0800, Bart Van Assche wrote:
>> With this patch applied test nvmeof-mp/002 fails as follows:
>>
>> [  694.700400] kernel BUG at lib/sg_pool.c:103!
>> [  694.705932] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
>> [  694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: G    B             5.0.0-rc6-dbg+ #2
>> [  694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>> [  694.715113] Workqueue: kblockd blk_mq_run_work_fn
>> [  694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0
>> [  694.758222] Call Trace:
>> [  694.759645]  nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma]
>> [  694.764915]  blk_mq_try_issue_directly+0x2a5/0x4b0
>> [  694.771779]  blk_insert_cloned_request+0x11e/0x1c0
>> [  694.778417]  dm_mq_queue_rq+0x3d1/0x770
>> [  694.793400]  blk_mq_dispatch_rq_list+0x5fc/0xb10
>> [  694.798386]  blk_mq_sched_dispatch_requests+0x2f7/0x300
>> [  694.803180]  __blk_mq_run_hw_queue+0xd6/0x180
>> [  694.808933]  blk_mq_run_work_fn+0x27/0x30
>> [  694.810315]  process_one_work+0x4f1/0xa40
>> [  694.813178]  worker_thread+0x67/0x5b0
>> [  694.814487]  kthread+0x1cf/0x1f0
>> [  694.819134]  ret_from_fork+0x24/0x30
>>
>> The code in sg_pool.c that triggers the BUG() statement is as follows:
>>
>> int sg_alloc_table_chained(struct sg_table *table, int nents,
>> 		struct scatterlist *first_chunk)
>> {
>> 	int ret;
>>
>> 	BUG_ON(!nents);
>> [ ... ]
>>
>> Bart.
> 
> I can reproduce this issue("kernel BUG at lib/sg_pool.c:103") without mp-bvec patches,
> so looks it isn't the fault of this patchset.

Thanks Ming for your feedback.

Jens, I don't see that issue with kernel v5.0-rc6. Does that mean that 
the sg_pool BUG() is a regression in your for-next branch that predates 
Ming's multi-page bvec patch series?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
       [not found]   ` <CGME20190221084301eucas1p11e8841a62b4b1da3cccca661b6f4c29d@eucas1p1.samsung.com>
@ 2019-02-21  8:42     ` Marek Szyprowski
  2019-02-21  9:57       ` Ming Lei
  2019-02-27 20:47       ` Jon Hunter
  0 siblings, 2 replies; 41+ messages in thread
From: Marek Szyprowski @ 2019-02-21  8:42 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ulf Hansson, linux-mmc, 'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz

Dear All,

On 2019-02-15 12:13, Ming Lei wrote:
> This patch pulls the trigger for multi-page bvecs.
>
> Reviewed-by: Omar Sandoval <osandov@fb.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Since Linux next-20190218 I've observed problems with block layer on one
of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
this issue led me to this change. This is also the first linux-next
release with this change merged. The issue is fully reproducible and can
be observed in the following kernel log:

sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
s3c-sdhci 12530000.sdhci: Got CD GPIO
mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
mmc0: new high speed SDHC card at address aaaa
mmcblk0: mmc0:aaaa SL16G 14.8 GiB

...

EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
EXT4-fs (mmcblk0p2): write access will be enabled during recovery
EXT4-fs (mmcblk0p2): recovery complete
EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
devtmpfs: mounted
Freeing unused kernel memory: 1024K
hub 1-3:1.0: USB hub found
Run /sbin/init as init process
hub 1-3:1.0: 3 ports detected
*** stack smashing detected ***: <unknown> terminated
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[<c01118d0>] (unwind_backtrace) from [<c010d794>] (show_stack+0x10/0x14)
[<c010d794>] (show_stack) from [<c09ff8a4>] (dump_stack+0x90/0xc8)
[<c09ff8a4>] (dump_stack) from [<c0125944>] (panic+0xfc/0x304)
[<c0125944>] (panic) from [<c012bc98>] (do_exit+0xabc/0xc6c)
[<c012bc98>] (do_exit) from [<c012c100>] (do_group_exit+0x3c/0xbc)
[<c012c100>] (do_group_exit) from [<c0138908>] (get_signal+0x130/0xbf4)
[<c0138908>] (get_signal) from [<c010c7a0>] (do_work_pending+0x130/0x618)
[<c010c7a0>] (do_work_pending) from [<c0101034>]
(slow_work_pending+0xc/0x20)
Exception stack(0xe88c3fb0 to 0xe88c3ff8)
3fa0:                                     00000000 bea7787c 00000005
b6e8d0b8
3fc0: bea77a18 b6f92010 b6e8d0b8 00000001 b6e8d0c8 00000001 b6e8c000
bea77b60
3fe0: 00000020 bea77998 ffffffff b6d52368 60000050 ffffffff
CPU3: stopping

I would like to help debugging and fixing this issue, but I don't really
have idea where to start. Here are some more detailed information about
my test system:

1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
source: arch/arm/boot/dts/exynos4412-odroidu3.dts)

2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
(drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
tree)

3. Rootfs: Ext4

4. Kernel config: arch/arm/configs/exynos_defconfig

I can gather more logs if needed, just let me which kernel option to
enable. Reverting this commit on top of next-20190218 as well as current
linux-next (tested with next-20190221) fixes this issue and makes the
system bootable again.

> ---
>  block/bio.c         | 22 +++++++++++++++-------
>  fs/iomap.c          |  4 ++--
>  fs/xfs/xfs_aops.c   |  4 ++--
>  include/linux/bio.h |  2 +-
>  4 files changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 968b12fea564..83a2dfa417ca 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -753,6 +753,8 @@ EXPORT_SYMBOL(bio_add_pc_page);
>   * @page: page to add
>   * @len: length of the data to add
>   * @off: offset of the data in @page
> + * @same_page: if %true only merge if the new data is in the same physical
> + *		page as the last segment of the bio.
>   *
>   * Try to add the data at @page + @off to the last bvec of @bio.  This is a
>   * a useful optimisation for file systems with a block size smaller than the
> @@ -761,19 +763,25 @@ EXPORT_SYMBOL(bio_add_pc_page);
>   * Return %true on success or %false on failure.
>   */
>  bool __bio_try_merge_page(struct bio *bio, struct page *page,
> -		unsigned int len, unsigned int off)
> +		unsigned int len, unsigned int off, bool same_page)
>  {
>  	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
>  		return false;
>  
>  	if (bio->bi_vcnt > 0) {
>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> +		phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) +
> +			bv->bv_offset + bv->bv_len - 1;
> +		phys_addr_t page_addr = page_to_phys(page);
>  
> -		if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> -			bv->bv_len += len;
> -			bio->bi_iter.bi_size += len;
> -			return true;
> -		}
> +		if (vec_end_addr + 1 != page_addr + off)
> +			return false;
> +		if (same_page && (vec_end_addr & PAGE_MASK) != page_addr)
> +			return false;
> +
> +		bv->bv_len += len;
> +		bio->bi_iter.bi_size += len;
> +		return true;
>  	}
>  	return false;
>  }
> @@ -819,7 +827,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
>  int bio_add_page(struct bio *bio, struct page *page,
>  		 unsigned int len, unsigned int offset)
>  {
> -	if (!__bio_try_merge_page(bio, page, len, offset)) {
> +	if (!__bio_try_merge_page(bio, page, len, offset, false)) {
>  		if (bio_full(bio))
>  			return 0;
>  		__bio_add_page(bio, page, len, offset);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index af736acd9006..0c350e658b7f 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -318,7 +318,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	 */
>  	sector = iomap_sector(iomap, pos);
>  	if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
> -		if (__bio_try_merge_page(ctx->bio, page, plen, poff))
> +		if (__bio_try_merge_page(ctx->bio, page, plen, poff, true))
>  			goto done;
>  		is_contig = true;
>  	}
> @@ -349,7 +349,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		ctx->bio->bi_end_io = iomap_read_end_io;
>  	}
>  
> -	__bio_add_page(ctx->bio, page, plen, poff);
> +	bio_add_page(ctx->bio, page, plen, poff);
>  done:
>  	/*
>  	 * Move the caller beyond our range so that it keeps making progress.
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 1f1829e506e8..b9fd44168f61 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -616,12 +616,12 @@ xfs_add_to_ioend(
>  				bdev, sector);
>  	}
>  
> -	if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
> +	if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, true)) {
>  		if (iop)
>  			atomic_inc(&iop->write_count);
>  		if (bio_full(wpc->ioend->io_bio))
>  			xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
> -		__bio_add_page(wpc->ioend->io_bio, page, len, poff);
> +		bio_add_page(wpc->ioend->io_bio, page, len, poff);
>  	}
>  
>  	wpc->ioend->io_size += len;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 089370eb84d9..9f77adcfde82 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -441,7 +441,7 @@ extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
>  extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
>  			   unsigned int, unsigned int);
>  bool __bio_try_merge_page(struct bio *bio, struct page *page,
> -		unsigned int len, unsigned int off);
> +		unsigned int len, unsigned int off, bool same_page);
>  void __bio_add_page(struct bio *bio, struct page *page,
>  		unsigned int len, unsigned int off);
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-21  8:42     ` Marek Szyprowski
@ 2019-02-21  9:57       ` Ming Lei
  2019-02-21 10:08         ` Marek Szyprowski
  2019-02-27 20:47       ` Jon Hunter
  1 sibling, 1 reply; 41+ messages in thread
From: Ming Lei @ 2019-02-21  9:57 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, Christoph Hellwig,
	linux-ext4, Coly Li, linux-bcache, Boaz Harrosh, Bob Peterson,
	cluster-devel, Ulf Hansson, linux-mmc,
	'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz

On Thu, Feb 21, 2019 at 09:42:59AM +0100, Marek Szyprowski wrote:
> Dear All,
> 
> On 2019-02-15 12:13, Ming Lei wrote:
> > This patch pulls the trigger for multi-page bvecs.
> >
> > Reviewed-by: Omar Sandoval <osandov@fb.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> 
> Since Linux next-20190218 I've observed problems with block layer on one
> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
> this issue led me to this change. This is also the first linux-next
> release with this change merged. The issue is fully reproducible and can
> be observed in the following kernel log:
> 
> sdhci: Secure Digital Host Controller Interface driver
> sdhci: Copyright(c) Pierre Ossman
> s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
> s3c-sdhci 12530000.sdhci: Got CD GPIO
> mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
> mmc0: new high speed SDHC card at address aaaa
> mmcblk0: mmc0:aaaa SL16G 14.8 GiB
> 
> ...
> 
> EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
> EXT4-fs (mmcblk0p2): write access will be enabled during recovery
> EXT4-fs (mmcblk0p2): recovery complete
> EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
> VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
> devtmpfs: mounted
> Freeing unused kernel memory: 1024K
> hub 1-3:1.0: USB hub found
> Run /sbin/init as init process
> hub 1-3:1.0: 3 ports detected
> *** stack smashing detected ***: <unknown> terminated
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
> CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
> Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> [<c01118d0>] (unwind_backtrace) from [<c010d794>] (show_stack+0x10/0x14)
> [<c010d794>] (show_stack) from [<c09ff8a4>] (dump_stack+0x90/0xc8)
> [<c09ff8a4>] (dump_stack) from [<c0125944>] (panic+0xfc/0x304)
> [<c0125944>] (panic) from [<c012bc98>] (do_exit+0xabc/0xc6c)
> [<c012bc98>] (do_exit) from [<c012c100>] (do_group_exit+0x3c/0xbc)
> [<c012c100>] (do_group_exit) from [<c0138908>] (get_signal+0x130/0xbf4)
> [<c0138908>] (get_signal) from [<c010c7a0>] (do_work_pending+0x130/0x618)
> [<c010c7a0>] (do_work_pending) from [<c0101034>]
> (slow_work_pending+0xc/0x20)
> Exception stack(0xe88c3fb0 to 0xe88c3ff8)
> 3fa0:                                     00000000 bea7787c 00000005
> b6e8d0b8
> 3fc0: bea77a18 b6f92010 b6e8d0b8 00000001 b6e8d0c8 00000001 b6e8c000
> bea77b60
> 3fe0: 00000020 bea77998 ffffffff b6d52368 60000050 ffffffff
> CPU3: stopping
> 
> I would like to help debugging and fixing this issue, but I don't really
> have idea where to start. Here are some more detailed information about
> my test system:
> 
> 1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
> source: arch/arm/boot/dts/exynos4412-odroidu3.dts)
> 
> 2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
> (drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
> tree)
> 
> 3. Rootfs: Ext4
> 
> 4. Kernel config: arch/arm/configs/exynos_defconfig
> 
> I can gather more logs if needed, just let me which kernel option to
> enable. Reverting this commit on top of next-20190218 as well as current
> linux-next (tested with next-20190221) fixes this issue and makes the
> system bootable again.

Could you test the patch in following link and see if it can make a difference?

https://marc.info/?l=linux-aio&m=155070355614541&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-21  9:57       ` Ming Lei
@ 2019-02-21 10:08         ` Marek Szyprowski
  2019-02-21 10:16           ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Marek Szyprowski @ 2019-02-21 10:08 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, Christoph Hellwig,
	linux-ext4, Coly Li, linux-bcache, Boaz Harrosh, Bob Peterson,
	cluster-devel, Ulf Hansson, linux-mmc,
	'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz

Hi Ming,

On 2019-02-21 10:57, Ming Lei wrote:
> On Thu, Feb 21, 2019 at 09:42:59AM +0100, Marek Szyprowski wrote:
>> On 2019-02-15 12:13, Ming Lei wrote:
>>> This patch pulls the trigger for multi-page bvecs.
>>>
>>> Reviewed-by: Omar Sandoval <osandov@fb.com>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>> Since Linux next-20190218 I've observed problems with block layer on one
>> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
>> this issue led me to this change. This is also the first linux-next
>> release with this change merged. The issue is fully reproducible and can
>> be observed in the following kernel log:
>>
>> sdhci: Secure Digital Host Controller Interface driver
>> sdhci: Copyright(c) Pierre Ossman
>> s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
>> s3c-sdhci 12530000.sdhci: Got CD GPIO
>> mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
>> mmc0: new high speed SDHC card at address aaaa
>> mmcblk0: mmc0:aaaa SL16G 14.8 GiB
>>
>> ...
>>
>> EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
>> EXT4-fs (mmcblk0p2): write access will be enabled during recovery
>> EXT4-fs (mmcblk0p2): recovery complete
>> EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
>> VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
>> devtmpfs: mounted
>> Freeing unused kernel memory: 1024K
>> hub 1-3:1.0: USB hub found
>> Run /sbin/init as init process
>> hub 1-3:1.0: 3 ports detected
>> *** stack smashing detected ***: <unknown> terminated
>> Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
>> CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
>> Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
>> [<c01118d0>] (unwind_backtrace) from [<c010d794>] (show_stack+0x10/0x14)
>> [<c010d794>] (show_stack) from [<c09ff8a4>] (dump_stack+0x90/0xc8)
>> [<c09ff8a4>] (dump_stack) from [<c0125944>] (panic+0xfc/0x304)
>> [<c0125944>] (panic) from [<c012bc98>] (do_exit+0xabc/0xc6c)
>> [<c012bc98>] (do_exit) from [<c012c100>] (do_group_exit+0x3c/0xbc)
>> [<c012c100>] (do_group_exit) from [<c0138908>] (get_signal+0x130/0xbf4)
>> [<c0138908>] (get_signal) from [<c010c7a0>] (do_work_pending+0x130/0x618)
>> [<c010c7a0>] (do_work_pending) from [<c0101034>]
>> (slow_work_pending+0xc/0x20)
>> Exception stack(0xe88c3fb0 to 0xe88c3ff8)
>> 3fa0:                                     00000000 bea7787c 00000005
>> b6e8d0b8
>> 3fc0: bea77a18 b6f92010 b6e8d0b8 00000001 b6e8d0c8 00000001 b6e8c000
>> bea77b60
>> 3fe0: 00000020 bea77998 ffffffff b6d52368 60000050 ffffffff
>> CPU3: stopping
>>
>> I would like to help debugging and fixing this issue, but I don't really
>> have idea where to start. Here are some more detailed information about
>> my test system:
>>
>> 1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
>> source: arch/arm/boot/dts/exynos4412-odroidu3.dts)
>>
>> 2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
>> (drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
>> tree)
>>
>> 3. Rootfs: Ext4
>>
>> 4. Kernel config: arch/arm/configs/exynos_defconfig
>>
>> I can gather more logs if needed, just let me which kernel option to
>> enable. Reverting this commit on top of next-20190218 as well as current
>> linux-next (tested with next-20190221) fixes this issue and makes the
>> system bootable again.
> Could you test the patch in following link and see if it can make a difference?
>
> https://marc.info/?l=linux-aio&m=155070355614541&w=2

I've tested that patch, but it doesn't make any difference on the test
system. In the log I see no warning added by it.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-21 10:08         ` Marek Szyprowski
@ 2019-02-21 10:16           ` Ming Lei
  2019-02-21 10:22             ` Marek Szyprowski
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2019-02-21 10:16 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, Christoph Hellwig,
	linux-ext4, Coly Li, linux-bcache, Boaz Harrosh, Bob Peterson,
	cluster-devel, Ulf Hansson, linux-mmc,
	'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz

On Thu, Feb 21, 2019 at 11:08:19AM +0100, Marek Szyprowski wrote:
> Hi Ming,
> 
> On 2019-02-21 10:57, Ming Lei wrote:
> > On Thu, Feb 21, 2019 at 09:42:59AM +0100, Marek Szyprowski wrote:
> >> On 2019-02-15 12:13, Ming Lei wrote:
> >>> This patch pulls the trigger for multi-page bvecs.
> >>>
> >>> Reviewed-by: Omar Sandoval <osandov@fb.com>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >> Since Linux next-20190218 I've observed problems with block layer on one
> >> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
> >> this issue led me to this change. This is also the first linux-next
> >> release with this change merged. The issue is fully reproducible and can
> >> be observed in the following kernel log:
> >>
> >> sdhci: Secure Digital Host Controller Interface driver
> >> sdhci: Copyright(c) Pierre Ossman
> >> s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
> >> s3c-sdhci 12530000.sdhci: Got CD GPIO
> >> mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
> >> mmc0: new high speed SDHC card at address aaaa
> >> mmcblk0: mmc0:aaaa SL16G 14.8 GiB
> >>
> >> ...
> >>
> >> EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
> >> EXT4-fs (mmcblk0p2): write access will be enabled during recovery
> >> EXT4-fs (mmcblk0p2): recovery complete
> >> EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
> >> VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
> >> devtmpfs: mounted
> >> Freeing unused kernel memory: 1024K
> >> hub 1-3:1.0: USB hub found
> >> Run /sbin/init as init process
> >> hub 1-3:1.0: 3 ports detected
> >> *** stack smashing detected ***: <unknown> terminated
> >> Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
> >> CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
> >> Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> >> [<c01118d0>] (unwind_backtrace) from [<c010d794>] (show_stack+0x10/0x14)
> >> [<c010d794>] (show_stack) from [<c09ff8a4>] (dump_stack+0x90/0xc8)
> >> [<c09ff8a4>] (dump_stack) from [<c0125944>] (panic+0xfc/0x304)
> >> [<c0125944>] (panic) from [<c012bc98>] (do_exit+0xabc/0xc6c)
> >> [<c012bc98>] (do_exit) from [<c012c100>] (do_group_exit+0x3c/0xbc)
> >> [<c012c100>] (do_group_exit) from [<c0138908>] (get_signal+0x130/0xbf4)
> >> [<c0138908>] (get_signal) from [<c010c7a0>] (do_work_pending+0x130/0x618)
> >> [<c010c7a0>] (do_work_pending) from [<c0101034>]
> >> (slow_work_pending+0xc/0x20)
> >> Exception stack(0xe88c3fb0 to 0xe88c3ff8)
> >> 3fa0:                                     00000000 bea7787c 00000005
> >> b6e8d0b8
> >> 3fc0: bea77a18 b6f92010 b6e8d0b8 00000001 b6e8d0c8 00000001 b6e8c000
> >> bea77b60
> >> 3fe0: 00000020 bea77998 ffffffff b6d52368 60000050 ffffffff
> >> CPU3: stopping
> >>
> >> I would like to help debugging and fixing this issue, but I don't really
> >> have idea where to start. Here are some more detailed information about
> >> my test system:
> >>
> >> 1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
> >> source: arch/arm/boot/dts/exynos4412-odroidu3.dts)
> >>
> >> 2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
> >> (drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
> >> tree)
> >>
> >> 3. Rootfs: Ext4
> >>
> >> 4. Kernel config: arch/arm/configs/exynos_defconfig
> >>
> >> I can gather more logs if needed, just let me which kernel option to
> >> enable. Reverting this commit on top of next-20190218 as well as current
> >> linux-next (tested with next-20190221) fixes this issue and makes the
> >> system bootable again.
> > Could you test the patch in following link and see if it can make a difference?
> >
> > https://marc.info/?l=linux-aio&m=155070355614541&w=2
> 
> I've tested that patch, but it doesn't make any difference on the test
> system. In the log I see no warning added by it.

I guess it might be related with memory corruption, could you enable the
following debug options and post the dmesg log?

CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_KASAN=y

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-21 10:16           ` Ming Lei
@ 2019-02-21 10:22             ` Marek Szyprowski
  2019-02-21 10:38               ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Marek Szyprowski @ 2019-02-21 10:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, Christoph Hellwig,
	linux-ext4, Coly Li, linux-bcache, Boaz Harrosh, Bob Peterson,
	cluster-devel, Ulf Hansson, linux-mmc,
	'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz

Hi Ming,

On 2019-02-21 11:16, Ming Lei wrote:
> On Thu, Feb 21, 2019 at 11:08:19AM +0100, Marek Szyprowski wrote:
>> On 2019-02-21 10:57, Ming Lei wrote:
>>> On Thu, Feb 21, 2019 at 09:42:59AM +0100, Marek Szyprowski wrote:
>>>> On 2019-02-15 12:13, Ming Lei wrote:
>>>>> This patch pulls the trigger for multi-page bvecs.
>>>>>
>>>>> Reviewed-by: Omar Sandoval <osandov@fb.com>
>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>> Since Linux next-20190218 I've observed problems with block layer on one
>>>> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
>>>> this issue led me to this change. This is also the first linux-next
>>>> release with this change merged. The issue is fully reproducible and can
>>>> be observed in the following kernel log:
>>>>
>>>> sdhci: Secure Digital Host Controller Interface driver
>>>> sdhci: Copyright(c) Pierre Ossman
>>>> s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
>>>> s3c-sdhci 12530000.sdhci: Got CD GPIO
>>>> mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
>>>> mmc0: new high speed SDHC card at address aaaa
>>>> mmcblk0: mmc0:aaaa SL16G 14.8 GiB
>>>>
>>>> ...
>>>>
>>>> EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
>>>> EXT4-fs (mmcblk0p2): write access will be enabled during recovery
>>>> EXT4-fs (mmcblk0p2): recovery complete
>>>> EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
>>>> VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
>>>> devtmpfs: mounted
>>>> Freeing unused kernel memory: 1024K
>>>> hub 1-3:1.0: USB hub found
>>>> Run /sbin/init as init process
>>>> hub 1-3:1.0: 3 ports detected
>>>> *** stack smashing detected ***: <unknown> terminated
>>>> Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
>>>> CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
>>>> Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
>>>> [<c01118d0>] (unwind_backtrace) from [<c010d794>] (show_stack+0x10/0x14)
>>>> [<c010d794>] (show_stack) from [<c09ff8a4>] (dump_stack+0x90/0xc8)
>>>> [<c09ff8a4>] (dump_stack) from [<c0125944>] (panic+0xfc/0x304)
>>>> [<c0125944>] (panic) from [<c012bc98>] (do_exit+0xabc/0xc6c)
>>>> [<c012bc98>] (do_exit) from [<c012c100>] (do_group_exit+0x3c/0xbc)
>>>> [<c012c100>] (do_group_exit) from [<c0138908>] (get_signal+0x130/0xbf4)
>>>> [<c0138908>] (get_signal) from [<c010c7a0>] (do_work_pending+0x130/0x618)
>>>> [<c010c7a0>] (do_work_pending) from [<c0101034>]
>>>> (slow_work_pending+0xc/0x20)
>>>> Exception stack(0xe88c3fb0 to 0xe88c3ff8)
>>>> 3fa0:                                     00000000 bea7787c 00000005
>>>> b6e8d0b8
>>>> 3fc0: bea77a18 b6f92010 b6e8d0b8 00000001 b6e8d0c8 00000001 b6e8c000
>>>> bea77b60
>>>> 3fe0: 00000020 bea77998 ffffffff b6d52368 60000050 ffffffff
>>>> CPU3: stopping
>>>>
>>>> I would like to help debugging and fixing this issue, but I don't really
>>>> have idea where to start. Here are some more detailed information about
>>>> my test system:
>>>>
>>>> 1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
>>>> source: arch/arm/boot/dts/exynos4412-odroidu3.dts)
>>>>
>>>> 2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
>>>> (drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
>>>> tree)
>>>>
>>>> 3. Rootfs: Ext4
>>>>
>>>> 4. Kernel config: arch/arm/configs/exynos_defconfig
>>>>
>>>> I can gather more logs if needed, just let me which kernel option to
>>>> enable. Reverting this commit on top of next-20190218 as well as current
>>>> linux-next (tested with next-20190221) fixes this issue and makes the
>>>> system bootable again.
>>> Could you test the patch in following link and see if it can make a difference?
>>>
>>> https://marc.info/?l=linux-aio&m=155070355614541&w=2
>> I've tested that patch, but it doesn't make any difference on the test
>> system. In the log I see no warning added by it.
> I guess it might be related with memory corruption, could you enable the
> following debug options and post the dmesg log?
>
> CONFIG_DEBUG_STACKOVERFLOW=y
> CONFIG_KASAN=y

It won't be that easy as none of the above options is available on ARM
32bit. I will try to apply some ARM KASAN patches floating on the net
and let you know the result.

Best regards

-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-21 10:22             ` Marek Szyprowski
@ 2019-02-21 10:38               ` Ming Lei
  2019-02-21 11:42                 ` Marek Szyprowski
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2019-02-21 10:38 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, Christoph Hellwig,
	linux-ext4, Coly Li, linux-bcache, Boaz Harrosh, Bob Peterson,
	cluster-devel, Ulf Hansson, linux-mmc,
	'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz

On Thu, Feb 21, 2019 at 11:22:39AM +0100, Marek Szyprowski wrote:
> Hi Ming,
> 
> On 2019-02-21 11:16, Ming Lei wrote:
> > On Thu, Feb 21, 2019 at 11:08:19AM +0100, Marek Szyprowski wrote:
> >> On 2019-02-21 10:57, Ming Lei wrote:
> >>> On Thu, Feb 21, 2019 at 09:42:59AM +0100, Marek Szyprowski wrote:
> >>>> On 2019-02-15 12:13, Ming Lei wrote:
> >>>>> This patch pulls the trigger for multi-page bvecs.
> >>>>>
> >>>>> Reviewed-by: Omar Sandoval <osandov@fb.com>
> >>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>> Since Linux next-20190218 I've observed problems with block layer on one
> >>>> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
> >>>> this issue led me to this change. This is also the first linux-next
> >>>> release with this change merged. The issue is fully reproducible and can
> >>>> be observed in the following kernel log:
> >>>>
> >>>> sdhci: Secure Digital Host Controller Interface driver
> >>>> sdhci: Copyright(c) Pierre Ossman
> >>>> s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
> >>>> s3c-sdhci 12530000.sdhci: Got CD GPIO
> >>>> mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
> >>>> mmc0: new high speed SDHC card at address aaaa
> >>>> mmcblk0: mmc0:aaaa SL16G 14.8 GiB
> >>>>
> >>>> ...
> >>>>
> >>>> EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
> >>>> EXT4-fs (mmcblk0p2): write access will be enabled during recovery
> >>>> EXT4-fs (mmcblk0p2): recovery complete
> >>>> EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
> >>>> VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
> >>>> devtmpfs: mounted
> >>>> Freeing unused kernel memory: 1024K
> >>>> hub 1-3:1.0: USB hub found
> >>>> Run /sbin/init as init process
> >>>> hub 1-3:1.0: 3 ports detected
> >>>> *** stack smashing detected ***: <unknown> terminated
> >>>> Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
> >>>> CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
> >>>> Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> >>>> [<c01118d0>] (unwind_backtrace) from [<c010d794>] (show_stack+0x10/0x14)
> >>>> [<c010d794>] (show_stack) from [<c09ff8a4>] (dump_stack+0x90/0xc8)
> >>>> [<c09ff8a4>] (dump_stack) from [<c0125944>] (panic+0xfc/0x304)
> >>>> [<c0125944>] (panic) from [<c012bc98>] (do_exit+0xabc/0xc6c)
> >>>> [<c012bc98>] (do_exit) from [<c012c100>] (do_group_exit+0x3c/0xbc)
> >>>> [<c012c100>] (do_group_exit) from [<c0138908>] (get_signal+0x130/0xbf4)
> >>>> [<c0138908>] (get_signal) from [<c010c7a0>] (do_work_pending+0x130/0x618)
> >>>> [<c010c7a0>] (do_work_pending) from [<c0101034>]
> >>>> (slow_work_pending+0xc/0x20)
> >>>> Exception stack(0xe88c3fb0 to 0xe88c3ff8)
> >>>> 3fa0:                                     00000000 bea7787c 00000005
> >>>> b6e8d0b8
> >>>> 3fc0: bea77a18 b6f92010 b6e8d0b8 00000001 b6e8d0c8 00000001 b6e8c000
> >>>> bea77b60
> >>>> 3fe0: 00000020 bea77998 ffffffff b6d52368 60000050 ffffffff
> >>>> CPU3: stopping
> >>>>
> >>>> I would like to help debugging and fixing this issue, but I don't really
> >>>> have idea where to start. Here are some more detailed information about
> >>>> my test system:
> >>>>
> >>>> 1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
> >>>> source: arch/arm/boot/dts/exynos4412-odroidu3.dts)
> >>>>
> >>>> 2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
> >>>> (drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
> >>>> tree)
> >>>>
> >>>> 3. Rootfs: Ext4
> >>>>
> >>>> 4. Kernel config: arch/arm/configs/exynos_defconfig
> >>>>
> >>>> I can gather more logs if needed, just let me which kernel option to
> >>>> enable. Reverting this commit on top of next-20190218 as well as current
> >>>> linux-next (tested with next-20190221) fixes this issue and makes the
> >>>> system bootable again.
> >>> Could you test the patch in following link and see if it can make a difference?
> >>>
> >>> https://marc.info/?l=linux-aio&m=155070355614541&w=2
> >> I've tested that patch, but it doesn't make any difference on the test
> >> system. In the log I see no warning added by it.
> > I guess it might be related with memory corruption, could you enable the
> > following debug options and post the dmesg log?
> >
> > CONFIG_DEBUG_STACKOVERFLOW=y
> > CONFIG_KASAN=y
> 
> It won't be that easy as none of the above options is available on ARM
> 32bit. I will try to apply some ARM KASAN patches floating on the net
> and let you know the result.

Hi Marek,

Could you test the following patch?

diff --git a/block/bounce.c b/block/bounce.c
index add085e28b1d..0c618c0b3cf8 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -295,7 +295,6 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 	bool bounce = false;
 	int sectors = 0;
 	bool passthrough = bio_is_passthrough(*bio_orig);
-	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment(from, *bio_orig, iter) {
 		if (i++ < BIO_MAX_PAGES)
@@ -315,7 +314,8 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 	bio = bounce_clone_bio(*bio_orig, GFP_NOIO, passthrough ? NULL :
 			&bounce_bio_set);
 
-	bio_for_each_segment_all(to, bio, i, iter_all) {
+	/* bio won't be multi-page bvec, so operate its bvec table directly */
+	for (i = 0, to = bio->bi_io_vec; i < bio->bi_vcnt; to++, i++) {
 		struct page *page = to->bv_page;
 
 		if (page_to_pfn(page) <= q->limits.bounce_pfn)

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-21 10:38               ` Ming Lei
@ 2019-02-21 11:42                 ` Marek Szyprowski
  0 siblings, 0 replies; 41+ messages in thread
From: Marek Szyprowski @ 2019-02-21 11:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, Christoph Hellwig,
	linux-ext4, Coly Li, linux-bcache, Boaz Harrosh, Bob Peterson,
	cluster-devel, Ulf Hansson, linux-mmc,
	'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz

Hi Ming,

On 2019-02-21 11:38, Ming Lei wrote:
> On Thu, Feb 21, 2019 at 11:22:39AM +0100, Marek Szyprowski wrote:
>> On 2019-02-21 11:16, Ming Lei wrote:
>>> On Thu, Feb 21, 2019 at 11:08:19AM +0100, Marek Szyprowski wrote:
>>>> On 2019-02-21 10:57, Ming Lei wrote:
>>>>> On Thu, Feb 21, 2019 at 09:42:59AM +0100, Marek Szyprowski wrote:
>>>>>> On 2019-02-15 12:13, Ming Lei wrote:
>>>>>>> This patch pulls the trigger for multi-page bvecs.
>>>>>>>
>>>>>>> Reviewed-by: Omar Sandoval <osandov@fb.com>
>>>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>>> Since Linux next-20190218 I've observed problems with block layer on one
>>>>>> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
>>>>>> this issue led me to this change. This is also the first linux-next
>>>>>> release with this change merged. The issue is fully reproducible and can
>>>>>> be observed in the following kernel log:
>>>>>>
>>>>>> sdhci: Secure Digital Host Controller Interface driver
>>>>>> sdhci: Copyright(c) Pierre Ossman
>>>>>> s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
>>>>>> s3c-sdhci 12530000.sdhci: Got CD GPIO
>>>>>> mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
>>>>>> mmc0: new high speed SDHC card at address aaaa
>>>>>> mmcblk0: mmc0:aaaa SL16G 14.8 GiB
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
>>>>>> EXT4-fs (mmcblk0p2): write access will be enabled during recovery
>>>>>> EXT4-fs (mmcblk0p2): recovery complete
>>>>>> EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
>>>>>> VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
>>>>>> devtmpfs: mounted
>>>>>> Freeing unused kernel memory: 1024K
>>>>>> hub 1-3:1.0: USB hub found
>>>>>> Run /sbin/init as init process
>>>>>> hub 1-3:1.0: 3 ports detected
>>>>>> *** stack smashing detected ***: <unknown> terminated
>>>>>> Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
>>>>>> CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
>>>>>> Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
>>>>>> [<c01118d0>] (unwind_backtrace) from [<c010d794>] (show_stack+0x10/0x14)
>>>>>> [<c010d794>] (show_stack) from [<c09ff8a4>] (dump_stack+0x90/0xc8)
>>>>>> [<c09ff8a4>] (dump_stack) from [<c0125944>] (panic+0xfc/0x304)
>>>>>> [<c0125944>] (panic) from [<c012bc98>] (do_exit+0xabc/0xc6c)
>>>>>> [<c012bc98>] (do_exit) from [<c012c100>] (do_group_exit+0x3c/0xbc)
>>>>>> [<c012c100>] (do_group_exit) from [<c0138908>] (get_signal+0x130/0xbf4)
>>>>>> [<c0138908>] (get_signal) from [<c010c7a0>] (do_work_pending+0x130/0x618)
>>>>>> [<c010c7a0>] (do_work_pending) from [<c0101034>]
>>>>>> (slow_work_pending+0xc/0x20)
>>>>>> Exception stack(0xe88c3fb0 to 0xe88c3ff8)
>>>>>> 3fa0:                                     00000000 bea7787c 00000005
>>>>>> b6e8d0b8
>>>>>> 3fc0: bea77a18 b6f92010 b6e8d0b8 00000001 b6e8d0c8 00000001 b6e8c000
>>>>>> bea77b60
>>>>>> 3fe0: 00000020 bea77998 ffffffff b6d52368 60000050 ffffffff
>>>>>> CPU3: stopping
>>>>>>
>>>>>> I would like to help debugging and fixing this issue, but I don't really
>>>>>> have idea where to start. Here are some more detailed information about
>>>>>> my test system:
>>>>>>
>>>>>> 1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
>>>>>> source: arch/arm/boot/dts/exynos4412-odroidu3.dts)
>>>>>>
>>>>>> 2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
>>>>>> (drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
>>>>>> tree)
>>>>>>
>>>>>> 3. Rootfs: Ext4
>>>>>>
>>>>>> 4. Kernel config: arch/arm/configs/exynos_defconfig
>>>>>>
>>>>>> I can gather more logs if needed, just let me which kernel option to
>>>>>> enable. Reverting this commit on top of next-20190218 as well as current
>>>>>> linux-next (tested with next-20190221) fixes this issue and makes the
>>>>>> system bootable again.
>>>>> Could you test the patch in following link and see if it can make a difference?
>>>>>
>>>>> https://marc.info/?l=linux-aio&m=155070355614541&w=2
>>>> I've tested that patch, but it doesn't make any difference on the test
>>>> system. In the log I see no warning added by it.
>>> I guess it might be related with memory corruption, could you enable the
>>> following debug options and post the dmesg log?
>>>
>>> CONFIG_DEBUG_STACKOVERFLOW=y
>>> CONFIG_KASAN=y
>> It won't be that easy as none of the above options is available on ARM
>> 32bit. I will try to apply some ARM KASAN patches floating on the net
>> and let you know the result.
> Hi Marek,
>
> Could you test the following patch?

Yes. Sadly, no change observed.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-21  8:42     ` Marek Szyprowski
  2019-02-21  9:57       ` Ming Lei
@ 2019-02-27 20:47       ` Jon Hunter
  2019-02-27 23:29         ` Ming Lei
  1 sibling, 1 reply; 41+ messages in thread
From: Jon Hunter @ 2019-02-27 20:47 UTC (permalink / raw)
  To: Marek Szyprowski, Ming Lei, Jens Axboe
  Cc: linux-block, linux-kernel, linux-mm, Theodore Ts'o,
	Omar Sandoval, Sagi Grimberg, Dave Chinner, Kent Overstreet,
	Mike Snitzer, dm-devel, Alexander Viro, linux-fsdevel,
	linux-raid, David Sterba, linux-btrfs, Darrick J . Wong,
	linux-xfs, Gao Xiang, Christoph Hellwig, linux-ext4, Coly Li,
	linux-bcache, Boaz Harrosh, Bob Peterson, cluster-devel,
	Ulf Hansson, linux-mmc, 'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz,
	linux-tegra


On 21/02/2019 08:42, Marek Szyprowski wrote:
> Dear All,
> 
> On 2019-02-15 12:13, Ming Lei wrote:
>> This patch pulls the trigger for multi-page bvecs.
>>
>> Reviewed-by: Omar Sandoval <osandov@fb.com>
>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> 
> Since Linux next-20190218 I've observed problems with block layer on one
> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
> this issue led me to this change. This is also the first linux-next
> release with this change merged. The issue is fully reproducible and can
> be observed in the following kernel log:
> 
> sdhci: Secure Digital Host Controller Interface driver
> sdhci: Copyright(c) Pierre Ossman
> s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
> s3c-sdhci 12530000.sdhci: Got CD GPIO
> mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
> mmc0: new high speed SDHC card at address aaaa
> mmcblk0: mmc0:aaaa SL16G 14.8 GiB
I have also noticed some failures when writing to an eMMC device on one
of our Tegra boards. We have a simple eMMC write/read test and it is
currently failing because the data written does not match the source.

I did not seem the same crash as reported here, however, in our case the
rootfs is NFS mounted and so probably would not. However, the bisect
points to this commit and reverting on top of -next fixes the issues.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-27 20:47       ` Jon Hunter
@ 2019-02-27 23:29         ` Ming Lei
  2019-02-28  7:51           ` Marek Szyprowski
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2019-02-27 23:29 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Marek Szyprowski, Jens Axboe, linux-block, linux-kernel,
	linux-mm, Theodore Ts'o, Omar Sandoval, Sagi Grimberg,
	Dave Chinner, Kent Overstreet, Mike Snitzer, dm-devel,
	Alexander Viro, linux-fsdevel, linux-raid, David Sterba,
	linux-btrfs, Darrick J . Wong, linux-xfs, Gao Xiang,
	Christoph Hellwig, linux-ext4, Coly Li, linux-bcache,
	Boaz Harrosh, Bob Peterson, cluster-devel, Ulf Hansson,
	linux-mmc, 'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz,
	linux-tegra

On Wed, Feb 27, 2019 at 08:47:09PM +0000, Jon Hunter wrote:
> 
> On 21/02/2019 08:42, Marek Szyprowski wrote:
> > Dear All,
> > 
> > On 2019-02-15 12:13, Ming Lei wrote:
> >> This patch pulls the trigger for multi-page bvecs.
> >>
> >> Reviewed-by: Omar Sandoval <osandov@fb.com>
> >> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > 
> > Since Linux next-20190218 I've observed problems with block layer on one
> > of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
> > this issue led me to this change. This is also the first linux-next
> > release with this change merged. The issue is fully reproducible and can
> > be observed in the following kernel log:
> > 
> > sdhci: Secure Digital Host Controller Interface driver
> > sdhci: Copyright(c) Pierre Ossman
> > s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
> > s3c-sdhci 12530000.sdhci: Got CD GPIO
> > mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
> > mmc0: new high speed SDHC card at address aaaa
> > mmcblk0: mmc0:aaaa SL16G 14.8 GiB
> I have also noticed some failures when writing to an eMMC device on one
> of our Tegra boards. We have a simple eMMC write/read test and it is
> currently failing because the data written does not match the source.
> 
> I did not seem the same crash as reported here, however, in our case the
> rootfs is NFS mounted and so probably would not. However, the bisect
> points to this commit and reverting on top of -next fixes the issues.

It is sdhci, probably related with max segment size, could you test the
following patch:

https://marc.info/?l=linux-mmc&m=155128334122951&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-27 23:29         ` Ming Lei
@ 2019-02-28  7:51           ` Marek Szyprowski
  2019-02-28 12:39             ` Jon Hunter
  0 siblings, 1 reply; 41+ messages in thread
From: Marek Szyprowski @ 2019-02-28  7:51 UTC (permalink / raw)
  To: Ming Lei, Jon Hunter
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, Christoph Hellwig,
	linux-ext4, Coly Li, linux-bcache, Boaz Harrosh, Bob Peterson,
	cluster-devel, Ulf Hansson, linux-mmc,
	'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz,
	linux-tegra

Hi Ming,

On 2019-02-28 00:29, Ming Lei wrote:
> On Wed, Feb 27, 2019 at 08:47:09PM +0000, Jon Hunter wrote:
>> On 21/02/2019 08:42, Marek Szyprowski wrote:
>>> On 2019-02-15 12:13, Ming Lei wrote:
>>>> This patch pulls the trigger for multi-page bvecs.
>>>>
>>>> Reviewed-by: Omar Sandoval <osandov@fb.com>
>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> Since Linux next-20190218 I've observed problems with block layer on one
>>> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
>>> this issue led me to this change. This is also the first linux-next
>>> release with this change merged. The issue is fully reproducible and can
>>> be observed in the following kernel log:
>>>
>>> sdhci: Secure Digital Host Controller Interface driver
>>> sdhci: Copyright(c) Pierre Ossman
>>> s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
>>> s3c-sdhci 12530000.sdhci: Got CD GPIO
>>> mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
>>> mmc0: new high speed SDHC card at address aaaa
>>> mmcblk0: mmc0:aaaa SL16G 14.8 GiB
>> I have also noticed some failures when writing to an eMMC device on one
>> of our Tegra boards. We have a simple eMMC write/read test and it is
>> currently failing because the data written does not match the source.
>>
>> I did not seem the same crash as reported here, however, in our case the
>> rootfs is NFS mounted and so probably would not. However, the bisect
>> points to this commit and reverting on top of -next fixes the issues.
> It is sdhci, probably related with max segment size, could you test the
> following patch:
>
> https://marc.info/?l=linux-mmc&m=155128334122951&w=2

This seems to be fixing my issue too! Thanks!

It also fixed the boot issue from USB stick (Exynos EHCI / Mass
Storage), but I suspect that reading the partition table from the sd
card (which hold the bootloader and thus must be present to boot the
device) was enough to trash memory/page cache and break the boot process.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 14/18] block: enable multipage bvecs
  2019-02-28  7:51           ` Marek Szyprowski
@ 2019-02-28 12:39             ` Jon Hunter
  0 siblings, 0 replies; 41+ messages in thread
From: Jon Hunter @ 2019-02-28 12:39 UTC (permalink / raw)
  To: Marek Szyprowski, Ming Lei
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm,
	Theodore Ts'o, Omar Sandoval, Sagi Grimberg, Dave Chinner,
	Kent Overstreet, Mike Snitzer, dm-devel, Alexander Viro,
	linux-fsdevel, linux-raid, David Sterba, linux-btrfs,
	Darrick J . Wong, linux-xfs, Gao Xiang, Christoph Hellwig,
	linux-ext4, Coly Li, linux-bcache, Boaz Harrosh, Bob Peterson,
	cluster-devel, Ulf Hansson, linux-mmc,
	'Linux Samsung SOC',
	Krzysztof Kozlowski, Adrian Hunter, Bartlomiej Zolnierkiewicz,
	linux-tegra


On 28/02/2019 07:51, Marek Szyprowski wrote:
> Hi Ming,
> 
> On 2019-02-28 00:29, Ming Lei wrote:
>> On Wed, Feb 27, 2019 at 08:47:09PM +0000, Jon Hunter wrote:
>>> On 21/02/2019 08:42, Marek Szyprowski wrote:
>>>> On 2019-02-15 12:13, Ming Lei wrote:
>>>>> This patch pulls the trigger for multi-page bvecs.
>>>>>
>>>>> Reviewed-by: Omar Sandoval <osandov@fb.com>
>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>> Since Linux next-20190218 I've observed problems with block layer on one
>>>> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
>>>> this issue led me to this change. This is also the first linux-next
>>>> release with this change merged. The issue is fully reproducible and can
>>>> be observed in the following kernel log:
>>>>
>>>> sdhci: Secure Digital Host Controller Interface driver
>>>> sdhci: Copyright(c) Pierre Ossman
>>>> s3c-sdhci 12530000.sdhci: clock source 2: mmc_busclk.2 (100000000 Hz)
>>>> s3c-sdhci 12530000.sdhci: Got CD GPIO
>>>> mmc0: SDHCI controller on samsung-hsmmc [12530000.sdhci] using ADMA
>>>> mmc0: new high speed SDHC card at address aaaa
>>>> mmcblk0: mmc0:aaaa SL16G 14.8 GiB
>>> I have also noticed some failures when writing to an eMMC device on one
>>> of our Tegra boards. We have a simple eMMC write/read test and it is
>>> currently failing because the data written does not match the source.
>>>
>>> I did not seem the same crash as reported here, however, in our case the
>>> rootfs is NFS mounted and so probably would not. However, the bisect
>>> points to this commit and reverting on top of -next fixes the issues.
>> It is sdhci, probably related with max segment size, could you test the
>> following patch:
>>
>> https://marc.info/?l=linux-mmc&m=155128334122951&w=2
> 
> This seems to be fixing my issue too! Thanks!

Thanks, I can confirm this fixes the issue for Tegra. So feel free to
add my ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Cheers!
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2019-02-28 12:39 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-15 11:13 [PATCH V15 00/18] block: support multi-page bvec Ming Lei
2019-02-15 11:13 ` [PATCH V15 01/18] btrfs: look at bi_size for repair decisions Ming Lei
2019-02-15 11:13 ` [PATCH V15 02/18] block: don't use bio->bi_vcnt to figure out segment number Ming Lei
2019-02-15 11:13 ` [PATCH V15 03/18] block: remove bvec_iter_rewind() Ming Lei
2019-02-15 11:13 ` [PATCH V15 04/18] block: introduce multi-page bvec helpers Ming Lei
2019-02-15 11:13 ` [PATCH V15 05/18] block: introduce bio_for_each_bvec() and rq_for_each_bvec() Ming Lei
2019-02-15 11:13 ` [PATCH V15 06/18] block: use bio_for_each_bvec() to compute multi-page bvec count Ming Lei
2019-02-15 11:13 ` [PATCH V15 07/18] block: use bio_for_each_bvec() to map sg Ming Lei
2019-02-15 11:13 ` [PATCH V15 08/18] block: introduce mp_bvec_last_segment() Ming Lei
2019-02-15 11:13 ` [PATCH V15 09/18] fs/buffer.c: use bvec iterator to truncate the bio Ming Lei
2019-02-15 11:13 ` [PATCH V15 10/18] btrfs: use mp_bvec_last_segment to get bio's last page Ming Lei
2019-02-15 11:13 ` [PATCH V15 11/18] block: loop: pass multi-page bvec to iov_iter Ming Lei
2019-02-15 11:13 ` [PATCH V15 12/18] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages() Ming Lei
2019-02-15 11:13 ` [PATCH V15 13/18] block: allow bio_for_each_segment_all() to iterate over multi-page bvec Ming Lei
2019-02-15 11:13 ` [PATCH V15 14/18] block: enable multipage bvecs Ming Lei
     [not found]   ` <CGME20190221084301eucas1p11e8841a62b4b1da3cccca661b6f4c29d@eucas1p1.samsung.com>
2019-02-21  8:42     ` Marek Szyprowski
2019-02-21  9:57       ` Ming Lei
2019-02-21 10:08         ` Marek Szyprowski
2019-02-21 10:16           ` Ming Lei
2019-02-21 10:22             ` Marek Szyprowski
2019-02-21 10:38               ` Ming Lei
2019-02-21 11:42                 ` Marek Szyprowski
2019-02-27 20:47       ` Jon Hunter
2019-02-27 23:29         ` Ming Lei
2019-02-28  7:51           ` Marek Szyprowski
2019-02-28 12:39             ` Jon Hunter
2019-02-15 11:13 ` [PATCH V15 15/18] block: always define BIO_MAX_PAGES as 256 Ming Lei
2019-02-15 11:13 ` [PATCH V15 16/18] block: document usage of bio iterator helpers Ming Lei
2019-02-15 11:13 ` [PATCH V15 17/18] block: kill QUEUE_FLAG_NO_SG_MERGE Ming Lei
2019-02-15 11:13 ` [PATCH V15 18/18] block: kill BLK_MQ_F_SG_MERGE Ming Lei
2019-02-15 14:51 ` [PATCH V15 00/18] block: support multi-page bvec Christoph Hellwig
2019-02-17 13:10   ` Ming Lei
2019-02-15 15:49 ` Jens Axboe
2019-02-15 17:14   ` [dm-devel] " Bart Van Assche
2019-02-15 17:59     ` Jens Axboe
2019-02-17 13:13       ` Ming Lei
2019-02-18  7:49         ` Ming Lei
2019-02-17 13:11     ` Ming Lei
2019-02-19 16:28       ` Bart Van Assche
2019-02-20  1:17         ` Ming Lei
2019-02-20  2:37           ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).