linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] evacuate struct page from the block layer
@ 2015-03-16 20:25 Dan Williams
  2015-03-16 20:25 ` [RFC PATCH 1/7] block: add helpers for accessing a bio_vec page Dan Williams
                   ` (8 more replies)
  0 siblings, 9 replies; 42+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid,
	mgorman, hch, linux-fsdevel, Matthew Wilcox

Avoid the impending disaster of requiring struct page coverage for what
is expected to be ever increasing capacities of persistent memory.  In
conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
recently concluded Linux Storage Summit it became clear that struct page
is not required in many places, it was simply convenient to re-use.

Introduce helpers and infrastructure to remove struct page usage where
it is not necessary.  One use case for these changes is to implement a
write-back-cache in persistent memory for software-RAID.  Another use
case for the scatterlist changes is RDMA to a pfn-range.

This compiles and boots, but 0day-kbuild-robot coverage is needed before
this set exits "RFC".  Obviously, the coccinelle script needs to be
re-run on the block updates for kernel.next.  As is, this only includes
the resulting auto-generated-patch against 4.0-rc3.

---

Dan Williams (6):
      block: add helpers for accessing a bio_vec page
      block: convert bio_vec.bv_page to bv_pfn
      dma-mapping: allow archs to optionally specify a ->map_pfn() operation
      scatterlist: use sg_phys()
      x86: support dma_map_pfn()
      block: base support for pfn i/o

Matthew Wilcox (1):
      scatterlist: support "page-less" (__pfn_t only) entries


 arch/Kconfig                                 |    3 +
 arch/arm/mm/dma-mapping.c                    |    2 -
 arch/microblaze/kernel/dma.c                 |    2 -
 arch/powerpc/sysdev/axonram.c                |    2 -
 arch/x86/Kconfig                             |   12 +++
 arch/x86/kernel/amd_gart_64.c                |   22 ++++--
 arch/x86/kernel/pci-nommu.c                  |   22 ++++--
 arch/x86/kernel/pci-swiotlb.c                |    4 +
 arch/x86/pci/sta2x11-fixup.c                 |    4 +
 arch/x86/xen/pci-swiotlb-xen.c               |    4 +
 block/bio-integrity.c                        |    8 +-
 block/bio.c                                  |   83 +++++++++++++++------
 block/blk-core.c                             |    9 ++
 block/blk-integrity.c                        |    7 +-
 block/blk-lib.c                              |    2 -
 block/blk-merge.c                            |   15 ++--
 block/bounce.c                               |   26 +++----
 drivers/block/aoe/aoecmd.c                   |    8 +-
 drivers/block/brd.c                          |    2 -
 drivers/block/drbd/drbd_bitmap.c             |    5 +
 drivers/block/drbd/drbd_main.c               |    4 +
 drivers/block/drbd/drbd_receiver.c           |    4 +
 drivers/block/drbd/drbd_worker.c             |    3 +
 drivers/block/floppy.c                       |    6 +-
 drivers/block/loop.c                         |    8 +-
 drivers/block/nbd.c                          |    8 +-
 drivers/block/nvme-core.c                    |    2 -
 drivers/block/pktcdvd.c                      |   11 ++-
 drivers/block/ps3disk.c                      |    2 -
 drivers/block/ps3vram.c                      |    2 -
 drivers/block/rbd.c                          |    2 -
 drivers/block/rsxx/dma.c                     |    3 +
 drivers/block/umem.c                         |    2 -
 drivers/block/zram/zram_drv.c                |   10 +--
 drivers/dma/ste_dma40.c                      |    5 -
 drivers/iommu/amd_iommu.c                    |   21 ++++-
 drivers/iommu/intel-iommu.c                  |   26 +++++--
 drivers/iommu/iommu.c                        |    2 -
 drivers/md/bcache/btree.c                    |    4 +
 drivers/md/bcache/debug.c                    |    6 +-
 drivers/md/bcache/movinggc.c                 |    2 -
 drivers/md/bcache/request.c                  |    6 +-
 drivers/md/bcache/super.c                    |   10 +--
 drivers/md/bcache/util.c                     |    5 +
 drivers/md/bcache/writeback.c                |    2 -
 drivers/md/dm-crypt.c                        |   12 ++-
 drivers/md/dm-io.c                           |    2 -
 drivers/md/dm-verity.c                       |    2 -
 drivers/md/raid1.c                           |   50 +++++++------
 drivers/md/raid10.c                          |   38 +++++-----
 drivers/md/raid5.c                           |    6 +-
 drivers/mmc/card/queue.c                     |    4 +
 drivers/s390/block/dasd_diag.c               |    2 -
 drivers/s390/block/dasd_eckd.c               |   14 ++--
 drivers/s390/block/dasd_fba.c                |    6 +-
 drivers/s390/block/dcssblk.c                 |    2 -
 drivers/s390/block/scm_blk.c                 |    2 -
 drivers/s390/block/scm_blk_cluster.c         |    2 -
 drivers/s390/block/xpram.c                   |    2 -
 drivers/scsi/mpt2sas/mpt2sas_transport.c     |    6 +-
 drivers/scsi/mpt3sas/mpt3sas_transport.c     |    6 +-
 drivers/scsi/sd_dif.c                        |    4 +
 drivers/staging/android/ion/ion_chunk_heap.c |    4 +
 drivers/staging/lustre/lustre/llite/lloop.c  |    2 -
 drivers/xen/biomerge.c                       |    4 +
 drivers/xen/swiotlb-xen.c                    |   29 +++++--
 fs/btrfs/check-integrity.c                   |    6 +-
 fs/btrfs/compression.c                       |   12 ++-
 fs/btrfs/disk-io.c                           |    4 +
 fs/btrfs/extent_io.c                         |    8 +-
 fs/btrfs/file-item.c                         |    8 +-
 fs/btrfs/inode.c                             |   18 +++--
 fs/btrfs/raid56.c                            |    4 +
 fs/btrfs/volumes.c                           |    2 -
 fs/buffer.c                                  |    4 +
 fs/direct-io.c                               |    2 -
 fs/exofs/ore.c                               |    4 +
 fs/exofs/ore_raid.c                          |    2 -
 fs/ext4/page-io.c                            |    2 -
 fs/f2fs/data.c                               |    4 +
 fs/f2fs/segment.c                            |    2 -
 fs/gfs2/lops.c                               |    4 +
 fs/jfs/jfs_logmgr.c                          |    4 +
 fs/logfs/dev_bdev.c                          |   10 +--
 fs/mpage.c                                   |    2 -
 fs/splice.c                                  |    2 -
 include/asm-generic/dma-mapping-common.h     |   30 ++++++++
 include/asm-generic/memory_model.h           |    4 +
 include/asm-generic/scatterlist.h            |    6 ++
 include/crypto/scatterwalk.h                 |   10 +++
 include/linux/bio.h                          |   24 +++---
 include/linux/blk_types.h                    |   21 +++++
 include/linux/blkdev.h                       |    2 +
 include/linux/dma-debug.h                    |   23 +++++-
 include/linux/dma-mapping.h                  |    8 ++
 include/linux/scatterlist.h                  |  101 ++++++++++++++++++++++++--
 include/linux/swiotlb.h                      |    5 +
 kernel/power/block_io.c                      |    2 -
 lib/dma-debug.c                              |    4 +
 lib/swiotlb.c                                |   20 ++++-
 mm/iov_iter.c                                |   22 +++---
 mm/page_io.c                                 |    8 +-
 net/ceph/messenger.c                         |    2 -
 103 files changed, 658 insertions(+), 335 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 1/7] block: add helpers for accessing a bio_vec page
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
@ 2015-03-16 20:25 ` Dan Williams
  2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: axboe, hch, riel, linux-nvdimm, linux-raid, mgorman, linux-fsdevel

In prepartion for converting struct bio_vec to carry a pfn instead of a
struct page.  This change is prompted by the desire to add in-kernel DMA
support for persistent memory which lacks struct page coverage.

Generated with the following semantic patch:

// bv_page.cocci: convert usage of ->bv_page to use set/get helpers
// usage: make coccicheck COCCI=bv_page.cocci MODE=patch

virtual patch
virtual report
virtual org

@@
struct bio_vec bvec;
expression E;
type T;
@@

- bvec.bv_page = (T)E
+ bvec_set_page(&bvec, E)

@@
struct bio_vec *bvec;
expression E;
type T;
@@

- bvec->bv_page = (T)E
+ bvec_set_page(bvec, E)

@@
struct bio_vec bvec;
type T;
@@

- (T)bvec.bv_page
+ bvec_page(&bvec)

@@
struct bio_vec *bvec;
type T;
@@

- (T)bvec->bv_page
+ bvec_page(bvec)

@@
struct bio *bio;
expression E;
expression F;
type T;
@@

- bio->bi_io_vec[F].bv_page = (T)E
+ bvec_set_page(&bio->bi_io_vec[F], E)

@@
struct bio *bio;
expression E;
type T;
@@

- bio->bi_io_vec->bv_page = (T)E
+ bvec_set_page(bio->bi_io_vec, E)

@@
struct cached_dev *dc;
expression E;
type T;
@@

- dc->sb_bio.bi_io_vec->bv_page = (T)E
+ bvec_set_page(dc->sb_bio.bi_io_vec, E)

@@
struct cache *ca;
expression E;
expression F;
type T;
@@

- ca->sb_bio.bi_io_vec[F].bv_page = (T)E
+ bvec_set_page(&ca->sb_bio.bi_io_vec[F], E)

@@
struct cache *ca;
expression F;
@@

- ca->sb_bio.bi_io_vec[F].bv_page
+ bvec_page(&ca->sb_bio.bi_io_vec[F])

@@
struct cache *ca;
expression E;
expression F;
type T;
@@

- ca->sb_bio.bi_inline_vecs[F].bv_page = (T)E
+ bvec_set_page(&ca->sb_bio.bi_inline_vecs[F], E)

@@
struct cache *ca;
expression F;
@@

- ca->sb_bio.bi_inline_vecs[F].bv_page
+ bvec_page(&ca->sb_bio.bi_inline_vecs[F])


@@
struct cache *ca;
expression E;
type T;
@@

- ca->sb_bio.bi_io_vec->bv_page = (T)E
+ bvec_set_page(ca->sb_bio.bi_io_vec, E)

@@
struct bio *bio;
expression F;
@@

- bio->bi_io_vec[F].bv_page
+ bvec_page(&bio->bi_io_vec[F])

@@
struct bio bio;
expression F;
@@

- bio.bi_io_vec[F].bv_page
+ bvec_page(&bio.bi_io_vec[F])

@@
struct bio *bio;
@@

- bio->bi_io_vec->bv_page
+ bvec_page(bio->bi_io_vec)

@@
struct cached_dev *dc;
@@

- dc->sb_bio.bi_io_vec->bv_page
+ bvec_page(&dc->sb_bio->bi_io_vec)


@@
struct bio bio;
@@

- bio.bi_io_vec->bv_page
+ bvec_page(bio.bi_io_vec)

@@
struct bio_integrity_payload *bip;
expression E;
type T;
@@

- bip->bip_vec->bv_page = (T)E
+ bvec_set_page(bip->bip_vec, E)

@@
struct bio_integrity_payload *bip;
@@

- bip->bip_vec->bv_page
+ bvec_page(bip->bip_vec)

@@
struct bio_integrity_payload bip;
@@

- bip.bip_vec->bv_page
+ bvec_page(bip.bip_vec)

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c               |    2 +
 block/bio-integrity.c                       |    8 ++--
 block/bio.c                                 |   40 +++++++++++-----------
 block/blk-core.c                            |    4 +-
 block/blk-integrity.c                       |    3 +-
 block/blk-lib.c                             |    2 +
 block/blk-merge.c                           |    7 ++--
 block/bounce.c                              |   24 ++++++-------
 drivers/block/aoe/aoecmd.c                  |    8 ++--
 drivers/block/brd.c                         |    2 +
 drivers/block/drbd/drbd_bitmap.c            |    5 ++-
 drivers/block/drbd/drbd_main.c              |    4 +-
 drivers/block/drbd/drbd_receiver.c          |    4 +-
 drivers/block/drbd/drbd_worker.c            |    3 +-
 drivers/block/floppy.c                      |    6 ++-
 drivers/block/loop.c                        |    8 ++--
 drivers/block/nbd.c                         |    8 ++--
 drivers/block/nvme-core.c                   |    2 +
 drivers/block/pktcdvd.c                     |   11 +++---
 drivers/block/ps3disk.c                     |    2 +
 drivers/block/ps3vram.c                     |    2 +
 drivers/block/rbd.c                         |    2 +
 drivers/block/rsxx/dma.c                    |    3 +-
 drivers/block/umem.c                        |    2 +
 drivers/block/zram/zram_drv.c               |   10 +++--
 drivers/md/bcache/btree.c                   |    2 +
 drivers/md/bcache/debug.c                   |    6 ++-
 drivers/md/bcache/movinggc.c                |    2 +
 drivers/md/bcache/request.c                 |    6 ++-
 drivers/md/bcache/super.c                   |   10 +++--
 drivers/md/bcache/util.c                    |    5 +--
 drivers/md/bcache/writeback.c               |    2 +
 drivers/md/dm-crypt.c                       |   12 +++---
 drivers/md/dm-io.c                          |    2 +
 drivers/md/dm-verity.c                      |    2 +
 drivers/md/raid1.c                          |   50 ++++++++++++++-------------
 drivers/md/raid10.c                         |   38 ++++++++++-----------
 drivers/md/raid5.c                          |    6 ++-
 drivers/s390/block/dasd_diag.c              |    2 +
 drivers/s390/block/dasd_eckd.c              |   14 ++++----
 drivers/s390/block/dasd_fba.c               |    6 ++-
 drivers/s390/block/dcssblk.c                |    2 +
 drivers/s390/block/scm_blk.c                |    2 +
 drivers/s390/block/scm_blk_cluster.c        |    2 +
 drivers/s390/block/xpram.c                  |    2 +
 drivers/scsi/mpt2sas/mpt2sas_transport.c    |    6 ++-
 drivers/scsi/mpt3sas/mpt3sas_transport.c    |    6 ++-
 drivers/scsi/sd_dif.c                       |    4 +-
 drivers/staging/lustre/lustre/llite/lloop.c |    2 +
 drivers/xen/biomerge.c                      |    4 +-
 fs/btrfs/check-integrity.c                  |    6 ++-
 fs/btrfs/compression.c                      |   12 +++---
 fs/btrfs/disk-io.c                          |    4 +-
 fs/btrfs/extent_io.c                        |    8 ++--
 fs/btrfs/file-item.c                        |    8 ++--
 fs/btrfs/inode.c                            |   18 ++++++----
 fs/btrfs/raid56.c                           |    4 +-
 fs/btrfs/volumes.c                          |    2 +
 fs/buffer.c                                 |    4 +-
 fs/direct-io.c                              |    2 +
 fs/exofs/ore.c                              |    4 +-
 fs/exofs/ore_raid.c                         |    2 +
 fs/ext4/page-io.c                           |    2 +
 fs/f2fs/data.c                              |    4 +-
 fs/f2fs/segment.c                           |    2 +
 fs/gfs2/lops.c                              |    4 +-
 fs/jfs/jfs_logmgr.c                         |    4 +-
 fs/logfs/dev_bdev.c                         |   10 +++--
 fs/mpage.c                                  |    2 +
 fs/splice.c                                 |    2 +
 include/linux/bio.h                         |    4 +-
 include/linux/blk_types.h                   |   10 +++++
 kernel/power/block_io.c                     |    2 +
 mm/page_io.c                                |    6 ++-
 net/ceph/messenger.c                        |    2 +
 75 files changed, 258 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ee90db17b097..9bb5da7f2c0c 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -123,7 +123,7 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
 			return;
 		}
 
-		user_mem = page_address(vec.bv_page) + vec.bv_offset;
+		user_mem = page_address(bvec_page(&vec)) + vec.bv_offset;
 		if (bio_data_dir(bio) == READ)
 			memcpy(user_mem, (void *) phys_mem, vec.bv_len);
 		else
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5cbd5d9ea61d..3add34cba048 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -101,7 +101,7 @@ void bio_integrity_free(struct bio *bio)
 	struct bio_set *bs = bio->bi_pool;
 
 	if (bip->bip_flags & BIP_BLOCK_INTEGRITY)
-		kfree(page_address(bip->bip_vec->bv_page) +
+		kfree(page_address(bvec_page(bip->bip_vec)) +
 		      bip->bip_vec->bv_offset);
 
 	if (bs) {
@@ -140,7 +140,7 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
 
 	iv = bip->bip_vec + bip->bip_vcnt;
 
-	iv->bv_page = page;
+	bvec_set_page(iv, page);
 	iv->bv_len = len;
 	iv->bv_offset = offset;
 	bip->bip_vcnt++;
@@ -220,7 +220,7 @@ static int bio_integrity_process(struct bio *bio,
 	struct bio_vec bv;
 	struct bio_integrity_payload *bip = bio_integrity(bio);
 	unsigned int ret = 0;
-	void *prot_buf = page_address(bip->bip_vec->bv_page) +
+	void *prot_buf = page_address(bvec_page(bip->bip_vec)) +
 		bip->bip_vec->bv_offset;
 
 	iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
@@ -229,7 +229,7 @@ static int bio_integrity_process(struct bio *bio,
 	iter.prot_buf = prot_buf;
 
 	bio_for_each_segment(bv, bio, bviter) {
-		void *kaddr = kmap_atomic(bv.bv_page);
+		void *kaddr = kmap_atomic(bvec_page(&bv));
 
 		iter.data_buf = kaddr + bv.bv_offset;
 		iter.data_size = bv.bv_len;
diff --git a/block/bio.c b/block/bio.c
index f66a4eae16ee..7100fd6d5898 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -508,7 +508,7 @@ void zero_fill_bio(struct bio *bio)
 	bio_for_each_segment(bv, bio, iter) {
 		char *data = bvec_kmap_irq(&bv, &flags);
 		memset(data, 0, bv.bv_len);
-		flush_dcache_page(bv.bv_page);
+		flush_dcache_page(bvec_page(&bv));
 		bvec_kunmap_irq(data, &flags);
 	}
 }
@@ -723,7 +723,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
 
-		if (page == prev->bv_page &&
+		if (page == bvec_page(prev) &&
 		    offset == prev->bv_offset + prev->bv_len) {
 			unsigned int prev_bv_len = prev->bv_len;
 			prev->bv_len += len;
@@ -768,7 +768,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	 * cannot add the page
 	 */
 	bvec = &bio->bi_io_vec[bio->bi_vcnt];
-	bvec->bv_page = page;
+	bvec_set_page(bvec, page);
 	bvec->bv_len = len;
 	bvec->bv_offset = offset;
 	bio->bi_vcnt++;
@@ -818,7 +818,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	return len;
 
  failed:
-	bvec->bv_page = NULL;
+	bvec_set_page(bvec, NULL);
 	bvec->bv_len = 0;
 	bvec->bv_offset = 0;
 	bio->bi_vcnt--;
@@ -948,10 +948,10 @@ int bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
 	struct bio_vec *bv;
 
 	bio_for_each_segment_all(bv, bio, i) {
-		bv->bv_page = alloc_page(gfp_mask);
-		if (!bv->bv_page) {
+		bvec_set_page(bv, alloc_page(gfp_mask));
+		if (!bvec_page(bv)) {
 			while (--bv >= bio->bi_io_vec)
-				__free_page(bv->bv_page);
+				__free_page(bvec_page(bv));
 			return -ENOMEM;
 		}
 	}
@@ -1004,8 +1004,8 @@ void bio_copy_data(struct bio *dst, struct bio *src)
 
 		bytes = min(src_bv.bv_len, dst_bv.bv_len);
 
-		src_p = kmap_atomic(src_bv.bv_page);
-		dst_p = kmap_atomic(dst_bv.bv_page);
+		src_p = kmap_atomic(bvec_page(&src_bv));
+		dst_p = kmap_atomic(bvec_page(&dst_bv));
 
 		memcpy(dst_p + dst_bv.bv_offset,
 		       src_p + src_bv.bv_offset,
@@ -1052,7 +1052,7 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter iter)
 	bio_for_each_segment_all(bvec, bio, i) {
 		ssize_t ret;
 
-		ret = copy_page_from_iter(bvec->bv_page,
+		ret = copy_page_from_iter(bvec_page(bvec),
 					  bvec->bv_offset,
 					  bvec->bv_len,
 					  &iter);
@@ -1083,7 +1083,7 @@ static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
 	bio_for_each_segment_all(bvec, bio, i) {
 		ssize_t ret;
 
-		ret = copy_page_to_iter(bvec->bv_page,
+		ret = copy_page_to_iter(bvec_page(bvec),
 					bvec->bv_offset,
 					bvec->bv_len,
 					&iter);
@@ -1104,7 +1104,7 @@ static void bio_free_pages(struct bio *bio)
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i)
-		__free_page(bvec->bv_page);
+		__free_page(bvec_page(bvec));
 }
 
 /**
@@ -1406,9 +1406,9 @@ static void __bio_unmap_user(struct bio *bio)
 	 */
 	bio_for_each_segment_all(bvec, bio, i) {
 		if (bio_data_dir(bio) == READ)
-			set_page_dirty_lock(bvec->bv_page);
+			set_page_dirty_lock(bvec_page(bvec));
 
-		page_cache_release(bvec->bv_page);
+		page_cache_release(bvec_page(bvec));
 	}
 
 	bio_put(bio);
@@ -1499,7 +1499,7 @@ static void bio_copy_kern_endio_read(struct bio *bio, int err)
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
+		memcpy(p, page_address(bvec_page(bvec)), bvec->bv_len);
 		p += bvec->bv_len;
 	}
 
@@ -1611,7 +1611,7 @@ void bio_set_pages_dirty(struct bio *bio)
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec_page(bvec);
 
 		if (page && !PageCompound(page))
 			set_page_dirty_lock(page);
@@ -1624,7 +1624,7 @@ static void bio_release_pages(struct bio *bio)
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec_page(bvec);
 
 		if (page)
 			put_page(page);
@@ -1678,11 +1678,11 @@ void bio_check_pages_dirty(struct bio *bio)
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec_page(bvec);
 
 		if (PageDirty(page) || PageCompound(page)) {
 			page_cache_release(page);
-			bvec->bv_page = NULL;
+			bvec_set_page(bvec, NULL);
 		} else {
 			nr_clean_pages++;
 		}
@@ -1736,7 +1736,7 @@ void bio_flush_dcache_pages(struct bio *bi)
 	struct bvec_iter iter;
 
 	bio_for_each_segment(bvec, bi, iter)
-		flush_dcache_page(bvec.bv_page);
+		flush_dcache_page(bvec_page(&bvec));
 }
 EXPORT_SYMBOL(bio_flush_dcache_pages);
 #endif
diff --git a/block/blk-core.c b/block/blk-core.c
index 794c3e7f01cf..7830ce00cbf5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1429,7 +1429,7 @@ void blk_add_request_payload(struct request *rq, struct page *page,
 {
 	struct bio *bio = rq->bio;
 
-	bio->bi_io_vec->bv_page = page;
+	bvec_set_page(bio->bi_io_vec, page);
 	bio->bi_io_vec->bv_offset = 0;
 	bio->bi_io_vec->bv_len = len;
 
@@ -2855,7 +2855,7 @@ void rq_flush_dcache_pages(struct request *rq)
 	struct bio_vec bvec;
 
 	rq_for_each_segment(bvec, rq, iter)
-		flush_dcache_page(bvec.bv_page);
+		flush_dcache_page(bvec_page(&bvec));
 }
 EXPORT_SYMBOL_GPL(rq_flush_dcache_pages);
 #endif
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 79ffb4855af0..6c8b1d63e90b 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -117,7 +117,8 @@ new_segment:
 				sg = sg_next(sg);
 			}
 
-			sg_set_page(sg, iv.bv_page, iv.bv_len, iv.bv_offset);
+			sg_set_page(sg, bvec_page(&iv), iv.bv_len,
+				    iv.bv_offset);
 			segments++;
 		}
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3f5d72..7931a09f86d6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -187,7 +187,7 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		bio->bi_bdev = bdev;
 		bio->bi_private = &bb;
 		bio->bi_vcnt = 1;
-		bio->bi_io_vec->bv_page = page;
+		bvec_set_page(bio->bi_io_vec, page);
 		bio->bi_io_vec->bv_offset = 0;
 		bio->bi_io_vec->bv_len = bdev_logical_block_size(bdev);
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index fc1ff3b1ea1f..39bd9925c057 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -51,7 +51,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 			 * never considered part of another segment, since
 			 * that might change with the bounce page.
 			 */
-			high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
+			high = page_to_pfn(bvec_page(&bv)) > queue_bounce_pfn(q);
 			if (!high && !highprv && cluster) {
 				if (seg_size + bv.bv_len
 				    > queue_max_segment_size(q))
@@ -192,7 +192,7 @@ new_segment:
 			*sg = sg_next(*sg);
 		}
 
-		sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
+		sg_set_page(*sg, bvec_page(bvec), nbytes, bvec->bv_offset);
 		(*nsegs)++;
 	}
 	*bvprv = *bvec;
@@ -228,7 +228,8 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
 single_segment:
 		*sg = sglist;
 		bvec = bio_iovec(bio);
-		sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
+		sg_set_page(*sg, bvec_page(&bvec), bvec.bv_len,
+			    bvec.bv_offset);
 		return 1;
 	}
 
diff --git a/block/bounce.c b/block/bounce.c
index ab21ba203d5c..0390e44d6e1b 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -55,7 +55,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom)
 	unsigned char *vto;
 
 	local_irq_save(flags);
-	vto = kmap_atomic(to->bv_page);
+	vto = kmap_atomic(bvec_page(to));
 	memcpy(vto + to->bv_offset, vfrom, to->bv_len);
 	kunmap_atomic(vto);
 	local_irq_restore(flags);
@@ -105,17 +105,17 @@ static void copy_to_high_bio_irq(struct bio *to, struct bio *from)
 	struct bvec_iter iter;
 
 	bio_for_each_segment(tovec, to, iter) {
-		if (tovec.bv_page != fromvec->bv_page) {
+		if (bvec_page(&tovec) != bvec_page(fromvec)) {
 			/*
 			 * fromvec->bv_offset and fromvec->bv_len might have
 			 * been modified by the block layer, so use the original
 			 * copy, bounce_copy_vec already uses tovec->bv_len
 			 */
-			vfrom = page_address(fromvec->bv_page) +
+			vfrom = page_address(bvec_page(fromvec)) +
 				tovec.bv_offset;
 
 			bounce_copy_vec(&tovec, vfrom);
-			flush_dcache_page(tovec.bv_page);
+			flush_dcache_page(bvec_page(&tovec));
 		}
 
 		fromvec++;
@@ -136,11 +136,11 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool, int err)
 	 */
 	bio_for_each_segment_all(bvec, bio, i) {
 		org_vec = bio_orig->bi_io_vec + i;
-		if (bvec->bv_page == org_vec->bv_page)
+		if (bvec_page(bvec) == bvec_page(org_vec))
 			continue;
 
-		dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
-		mempool_free(bvec->bv_page, pool);
+		dec_zone_page_state(bvec_page(bvec), NR_BOUNCE);
+		mempool_free(bvec_page(bvec), pool);
 	}
 
 	bio_endio(bio_orig, err);
@@ -208,7 +208,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 	if (force)
 		goto bounce;
 	bio_for_each_segment(from, *bio_orig, iter)
-		if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q))
+		if (page_to_pfn(bvec_page(&from)) > queue_bounce_pfn(q))
 			goto bounce;
 
 	return;
@@ -216,20 +216,20 @@ bounce:
 	bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
 
 	bio_for_each_segment_all(to, bio, i) {
-		struct page *page = to->bv_page;
+		struct page *page = bvec_page(to);
 
 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
 			continue;
 
-		inc_zone_page_state(to->bv_page, NR_BOUNCE);
-		to->bv_page = mempool_alloc(pool, q->bounce_gfp);
+		inc_zone_page_state(bvec_page(to), NR_BOUNCE);
+		bvec_set_page(to, mempool_alloc(pool, q->bounce_gfp));
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
 
 			flush_dcache_page(page);
 
-			vto = page_address(to->bv_page) + to->bv_offset;
+			vto = page_address(bvec_page(to)) + to->bv_offset;
 			vfrom = kmap_atomic(page) + to->bv_offset;
 			memcpy(vto, vfrom, to->bv_len);
 			kunmap_atomic(vfrom);
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 422b7d84f686..f0cbfe8c4bd8 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -300,7 +300,7 @@ skb_fillup(struct sk_buff *skb, struct bio *bio, struct bvec_iter iter)
 	struct bio_vec bv;
 
 	__bio_for_each_segment(bv, bio, iter, iter)
-		skb_fill_page_desc(skb, frag++, bv.bv_page,
+		skb_fill_page_desc(skb, frag++, bvec_page(&bv),
 				   bv.bv_offset, bv.bv_len);
 }
 
@@ -874,7 +874,7 @@ bio_pageinc(struct bio *bio)
 		/* Non-zero page count for non-head members of
 		 * compound pages is no longer allowed by the kernel.
 		 */
-		page = compound_head(bv.bv_page);
+		page = compound_head(bvec_page(&bv));
 		atomic_inc(&page->_count);
 	}
 }
@@ -887,7 +887,7 @@ bio_pagedec(struct bio *bio)
 	struct bvec_iter iter;
 
 	bio_for_each_segment(bv, bio, iter) {
-		page = compound_head(bv.bv_page);
+		page = compound_head(bvec_page(&bv));
 		atomic_dec(&page->_count);
 	}
 }
@@ -1092,7 +1092,7 @@ bvcpy(struct sk_buff *skb, struct bio *bio, struct bvec_iter iter, long cnt)
 	iter.bi_size = cnt;
 
 	__bio_for_each_segment(bv, bio, iter, iter) {
-		char *p = page_address(bv.bv_page) + bv.bv_offset;
+		char *p = page_address(bvec_page(&bv)) + bv.bv_offset;
 		skb_copy_bits(skb, soff, p, bv.bv_len);
 		soff += bv.bv_len;
 	}
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 64ab4951e9d6..115c6cf9cb43 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -349,7 +349,7 @@ static void brd_make_request(struct request_queue *q, struct bio *bio)
 
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
-		err = brd_do_bvec(brd, bvec.bv_page, len,
+		err = brd_do_bvec(brd, bvec_page(&bvec), len,
 					bvec.bv_offset, rw, sector);
 		if (err)
 			break;
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 434c77dcc99e..37ba0f533e4b 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -946,7 +946,7 @@ static void drbd_bm_endio(struct bio *bio, int error)
 	struct drbd_bm_aio_ctx *ctx = bio->bi_private;
 	struct drbd_device *device = ctx->device;
 	struct drbd_bitmap *b = device->bitmap;
-	unsigned int idx = bm_page_to_idx(bio->bi_io_vec[0].bv_page);
+	unsigned int idx = bm_page_to_idx(bvec_page(&bio->bi_io_vec[0]));
 	int uptodate = bio_flagged(bio, BIO_UPTODATE);
 
 
@@ -979,7 +979,8 @@ static void drbd_bm_endio(struct bio *bio, int error)
 	bm_page_unlock_io(device, idx);
 
 	if (ctx->flags & BM_AIO_COPY_PAGES)
-		mempool_free(bio->bi_io_vec[0].bv_page, drbd_md_io_page_pool);
+		mempool_free(bvec_page(&bio->bi_io_vec[0]),
+			     drbd_md_io_page_pool);
 
 	bio_put(bio);
 
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 1fc83427199c..3df534a88572 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1554,7 +1554,7 @@ static int _drbd_send_bio(struct drbd_peer_device *peer_device, struct bio *bio)
 	bio_for_each_segment(bvec, bio, iter) {
 		int err;
 
-		err = _drbd_no_send_page(peer_device, bvec.bv_page,
+		err = _drbd_no_send_page(peer_device, bvec_page(&bvec),
 					 bvec.bv_offset, bvec.bv_len,
 					 bio_iter_last(bvec, iter)
 					 ? 0 : MSG_MORE);
@@ -1573,7 +1573,7 @@ static int _drbd_send_zc_bio(struct drbd_peer_device *peer_device, struct bio *b
 	bio_for_each_segment(bvec, bio, iter) {
 		int err;
 
-		err = _drbd_send_page(peer_device, bvec.bv_page,
+		err = _drbd_send_page(peer_device, bvec_page(&bvec),
 				      bvec.bv_offset, bvec.bv_len,
 				      bio_iter_last(bvec, iter) ? 0 : MSG_MORE);
 		if (err)
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index cee20354ac37..b4f16c6a0d73 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1729,10 +1729,10 @@ static int recv_dless_read(struct drbd_peer_device *peer_device, struct drbd_req
 	D_ASSERT(peer_device->device, sector == bio->bi_iter.bi_sector);
 
 	bio_for_each_segment(bvec, bio, iter) {
-		void *mapped = kmap(bvec.bv_page) + bvec.bv_offset;
+		void *mapped = kmap(bvec_page(&bvec)) + bvec.bv_offset;
 		expect = min_t(int, data_size, bvec.bv_len);
 		err = drbd_recv_all_warn(peer_device->connection, mapped, expect);
-		kunmap(bvec.bv_page);
+		kunmap(bvec_page(&bvec));
 		if (err)
 			return err;
 		data_size -= expect;
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index d0fae55d871d..d633c13d8ebe 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -332,7 +332,8 @@ void drbd_csum_bio(struct crypto_hash *tfm, struct bio *bio, void *digest)
 	crypto_hash_init(&desc);
 
 	bio_for_each_segment(bvec, bio, iter) {
-		sg_set_page(&sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
+		sg_set_page(&sg, bvec_page(&bvec), bvec.bv_len,
+			    bvec.bv_offset);
 		crypto_hash_update(&desc, &sg, sg.length);
 	}
 	crypto_hash_final(&desc, digest);
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index a08cda955285..6eae02e31731 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -2374,7 +2374,7 @@ static int buffer_chain_size(void)
 	size = 0;
 
 	rq_for_each_segment(bv, current_req, iter) {
-		if (page_address(bv.bv_page) + bv.bv_offset != base + size)
+		if (page_address(bvec_page(&bv)) + bv.bv_offset != base + size)
 			break;
 
 		size += bv.bv_len;
@@ -2444,7 +2444,7 @@ static void copy_buffer(int ssize, int max_sector, int max_sector_2)
 		size = bv.bv_len;
 		SUPBOUND(size, remaining);
 
-		buffer = page_address(bv.bv_page) + bv.bv_offset;
+		buffer = page_address(bvec_page(&bv)) + bv.bv_offset;
 		if (dma_buffer + size >
 		    floppy_track_buffer + (max_buffer_sectors << 10) ||
 		    dma_buffer < floppy_track_buffer) {
@@ -3805,7 +3805,7 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)
 
 	bio_init(&bio);
 	bio.bi_io_vec = &bio_vec;
-	bio_vec.bv_page = page;
+	bvec_set_page(&bio_vec, page);
 	bio_vec.bv_len = size;
 	bio_vec.bv_offset = 0;
 	bio.bi_vcnt = 1;
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d1f168b73634..2c3dd26bafdb 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -256,9 +256,9 @@ static int do_lo_send_direct_write(struct loop_device *lo,
 		struct bio_vec *bvec, loff_t pos, struct page *page)
 {
 	ssize_t bw = __do_lo_send_write(lo->lo_backing_file,
-			kmap(bvec->bv_page) + bvec->bv_offset,
+			kmap(bvec_page(bvec)) + bvec->bv_offset,
 			bvec->bv_len, pos);
-	kunmap(bvec->bv_page);
+	kunmap(bvec_page(bvec));
 	cond_resched();
 	return bw;
 }
@@ -273,7 +273,7 @@ static int do_lo_send_direct_write(struct loop_device *lo,
 static int do_lo_send_write(struct loop_device *lo, struct bio_vec *bvec,
 		loff_t pos, struct page *page)
 {
-	int ret = lo_do_transfer(lo, WRITE, page, 0, bvec->bv_page,
+	int ret = lo_do_transfer(lo, WRITE, page, 0, bvec_page(bvec),
 			bvec->bv_offset, bvec->bv_len, pos >> 9);
 	if (likely(!ret))
 		return __do_lo_send_write(lo->lo_backing_file,
@@ -376,7 +376,7 @@ do_lo_receive(struct loop_device *lo,
 	ssize_t retval;
 
 	cookie.lo = lo;
-	cookie.page = bvec->bv_page;
+	cookie.page = bvec_page(bvec);
 	cookie.offset = bvec->bv_offset;
 	cookie.bsize = bsize;
 
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 4bc2a5cb9935..fb89faa00e48 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -229,10 +229,10 @@ static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
 		int flags)
 {
 	int result;
-	void *kaddr = kmap(bvec->bv_page);
+	void *kaddr = kmap(bvec_page(bvec));
 	result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset,
 			   bvec->bv_len, flags);
-	kunmap(bvec->bv_page);
+	kunmap(bvec_page(bvec));
 	return result;
 }
 
@@ -323,10 +323,10 @@ out:
 static inline int sock_recv_bvec(struct nbd_device *nbd, struct bio_vec *bvec)
 {
 	int result;
-	void *kaddr = kmap(bvec->bv_page);
+	void *kaddr = kmap(bvec_page(bvec));
 	result = sock_xmit(nbd, 0, kaddr + bvec->bv_offset, bvec->bv_len,
 			MSG_WAITALL);
-	kunmap(bvec->bv_page);
+	kunmap(bvec_page(bvec));
 	return result;
 }
 
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index ceb32dd52a6c..39068a196cc9 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -521,7 +521,7 @@ static void nvme_dif_remap(struct request *req,
 	if (!bip)
 		return;
 
-	pmap = kmap_atomic(bip->bip_vec->bv_page) + bip->bip_vec->bv_offset;
+	pmap = kmap_atomic(bvec_page(bip->bip_vec)) + bip->bip_vec->bv_offset;
 	if (!pmap)
 		return;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 09e628dafd9d..c873290bd8bb 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -958,12 +958,12 @@ static void pkt_make_local_copy(struct packet_data *pkt, struct bio_vec *bvec)
 	p = 0;
 	offs = 0;
 	for (f = 0; f < pkt->frames; f++) {
-		if (bvec[f].bv_page != pkt->pages[p]) {
-			void *vfrom = kmap_atomic(bvec[f].bv_page) + bvec[f].bv_offset;
+		if (bvec_page(&bvec[f]) != pkt->pages[p]) {
+			void *vfrom = kmap_atomic(bvec_page(&bvec[f])) + bvec[f].bv_offset;
 			void *vto = page_address(pkt->pages[p]) + offs;
 			memcpy(vto, vfrom, CD_FRAMESIZE);
 			kunmap_atomic(vfrom);
-			bvec[f].bv_page = pkt->pages[p];
+			bvec_set_page(&bvec[f], pkt->pages[p]);
 			bvec[f].bv_offset = offs;
 		} else {
 			BUG_ON(bvec[f].bv_offset != offs);
@@ -1307,9 +1307,10 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)
 
 	/* XXX: locking? */
 	for (f = 0; f < pkt->frames; f++) {
-		bvec[f].bv_page = pkt->pages[(f * CD_FRAMESIZE) / PAGE_SIZE];
+		bvec_set_page(&bvec[f],
+			      pkt->pages[(f * CD_FRAMESIZE) / PAGE_SIZE]);
 		bvec[f].bv_offset = (f * CD_FRAMESIZE) % PAGE_SIZE;
-		if (!bio_add_page(pkt->w_bio, bvec[f].bv_page, CD_FRAMESIZE, bvec[f].bv_offset))
+		if (!bio_add_page(pkt->w_bio, bvec_page(&bvec[f]), CD_FRAMESIZE, bvec[f].bv_offset))
 			BUG();
 	}
 	pkt_dbg(2, pd, "vcnt=%d\n", pkt->w_bio->bi_vcnt);
diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c
index c120d70d3fb3..07ad0d9d9480 100644
--- a/drivers/block/ps3disk.c
+++ b/drivers/block/ps3disk.c
@@ -112,7 +112,7 @@ static void ps3disk_scatter_gather(struct ps3_storage_device *dev,
 		else
 			memcpy(buf, dev->bounce_buf+offset, size);
 		offset += size;
-		flush_kernel_dcache_page(bvec.bv_page);
+		flush_kernel_dcache_page(bvec_page(&bvec));
 		bvec_kunmap_irq(buf, &flags);
 		i++;
 	}
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index ef45cfb98fd2..5db3311c2865 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -561,7 +561,7 @@ static struct bio *ps3vram_do_bio(struct ps3_system_bus_device *dev,
 
 	bio_for_each_segment(bvec, bio, iter) {
 		/* PS3 is ppc64, so we don't handle highmem */
-		char *ptr = page_address(bvec.bv_page) + bvec.bv_offset;
+		char *ptr = page_address(bvec_page(&bvec)) + bvec.bv_offset;
 		size_t len = bvec.bv_len, retlen;
 
 		dev_dbg(&dev->core, "    %s %zu bytes at offset %llu\n", op,
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b40af3203089..812c1ffd7742 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1257,7 +1257,7 @@ static void zero_bio_chain(struct bio *chain, int start_ofs)
 				buf = bvec_kmap_irq(&bv, &flags);
 				memset(buf + remainder, 0,
 				       bv.bv_len - remainder);
-				flush_dcache_page(bv.bv_page);
+				flush_dcache_page(bvec_page(&bv));
 				bvec_kunmap_irq(buf, &flags);
 			}
 			pos += bv.bv_len;
diff --git a/drivers/block/rsxx/dma.c b/drivers/block/rsxx/dma.c
index cf8cd293abb5..262ffdd017ae 100644
--- a/drivers/block/rsxx/dma.c
+++ b/drivers/block/rsxx/dma.c
@@ -737,7 +737,8 @@ int rsxx_dma_queue_bio(struct rsxx_cardinfo *card,
 				st = rsxx_queue_dma(card, &dma_list[tgt],
 							bio_data_dir(bio),
 							dma_off, dma_len,
-							laddr, bvec.bv_page,
+							laddr,
+						    bvec_page(&bvec),
 							bv_off, cb, cb_data);
 				if (st)
 					goto bvec_err;
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 4cf81b5bf0f7..c7f65e4ec874 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -366,7 +366,7 @@ static int add_bio(struct cardinfo *card)
 	vec = bio_iter_iovec(bio, card->current_iter);
 
 	dma_handle = pci_map_page(card->dev,
-				  vec.bv_page,
+				  bvec_page(&vec),
 				  vec.bv_offset,
 				  vec.bv_len,
 				  (rw == READ) ?
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 871bd3550cb0..584fb16d8ca3 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -394,7 +394,7 @@ static int page_zero_filled(void *ptr)
 
 static void handle_zero_page(struct bio_vec *bvec)
 {
-	struct page *page = bvec->bv_page;
+	struct page *page = bvec_page(bvec);
 	void *user_mem;
 
 	user_mem = kmap_atomic(page);
@@ -482,7 +482,7 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
 	struct page *page;
 	unsigned char *user_mem, *uncmem = NULL;
 	struct zram_meta *meta = zram->meta;
-	page = bvec->bv_page;
+	page = bvec_page(bvec);
 
 	bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
 	if (unlikely(!meta->table[index].handle) ||
@@ -553,7 +553,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 	bool locked = false;
 	unsigned long alloced_pages;
 
-	page = bvec->bv_page;
+	page = bvec_page(bvec);
 	if (is_partial_io(bvec)) {
 		/*
 		 * This is a partial IO. We need to read the full page
@@ -903,7 +903,7 @@ static void __zram_make_request(struct zram *zram, struct bio *bio)
 			 */
 			struct bio_vec bv;
 
-			bv.bv_page = bvec.bv_page;
+			bvec_set_page(&bv, bvec_page(&bvec));
 			bv.bv_len = max_transfer_size;
 			bv.bv_offset = bvec.bv_offset;
 
@@ -990,7 +990,7 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
 	index = sector >> SECTORS_PER_PAGE_SHIFT;
 	offset = sector & (SECTORS_PER_PAGE - 1) << SECTOR_SHIFT;
 
-	bv.bv_page = page;
+	bvec_set_page(&bv, page);
 	bv.bv_len = PAGE_SIZE;
 	bv.bv_offset = 0;
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 00cde40db572..2e76e8b62902 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -366,7 +366,7 @@ static void btree_node_write_done(struct closure *cl)
 	int n;
 
 	bio_for_each_segment_all(bv, b->bio, n)
-		__free_page(bv->bv_page);
+		__free_page(bvec_page(bv));
 
 	__btree_node_write_done(cl);
 }
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 8b1f1d5c1819..c355a02b94dd 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -120,8 +120,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 	submit_bio_wait(READ_SYNC, check);
 
 	bio_for_each_segment(bv, bio, iter) {
-		void *p1 = kmap_atomic(bv.bv_page);
-		void *p2 = page_address(check->bi_io_vec[iter.bi_idx].bv_page);
+		void *p1 = kmap_atomic(bvec_page(&bv));
+		void *p2 = page_address(bvec_page(&check->bi_io_vec[iter.bi_idx]));
 
 		cache_set_err_on(memcmp(p1 + bv.bv_offset,
 					p2 + bv.bv_offset,
@@ -135,7 +135,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 	}
 
 	bio_for_each_segment_all(bv2, check, i)
-		__free_page(bv2->bv_page);
+		__free_page(bvec_page(bv2));
 out_put:
 	bio_put(check);
 }
diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index cd7490311e51..744e7af4b160 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -48,7 +48,7 @@ static void write_moving_finish(struct closure *cl)
 	int i;
 
 	bio_for_each_segment_all(bv, bio, i)
-		__free_page(bv->bv_page);
+		__free_page(bvec_page(bv));
 
 	if (io->op.replace_collision)
 		trace_bcache_gc_copy_collision(&io->w->key);
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ab43faddb447..e6378a998618 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -42,9 +42,9 @@ static void bio_csum(struct bio *bio, struct bkey *k)
 	uint64_t csum = 0;
 
 	bio_for_each_segment(bv, bio, iter) {
-		void *d = kmap(bv.bv_page) + bv.bv_offset;
+		void *d = kmap(bvec_page(&bv)) + bv.bv_offset;
 		csum = bch_crc64_update(csum, d, bv.bv_len);
-		kunmap(bv.bv_page);
+		kunmap(bvec_page(&bv));
 	}
 
 	k->ptr[KEY_PTRS(k)] = csum & (~0ULL >> 1);
@@ -690,7 +690,7 @@ static void cached_dev_cache_miss_done(struct closure *cl)
 		struct bio_vec *bv;
 
 		bio_for_each_segment_all(bv, s->iop.bio, i)
-			__free_page(bv->bv_page);
+			__free_page(bvec_page(bv));
 	}
 
 	cached_dev_bio_complete(cl);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 4dd2bb7167f0..8d7cbba7ff7e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -231,7 +231,7 @@ static void write_bdev_super_endio(struct bio *bio, int error)
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
-	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
+	struct cache_sb *out = page_address(bvec_page(&bio->bi_io_vec[0]));
 	unsigned i;
 
 	bio->bi_iter.bi_sector	= SB_SECTOR;
@@ -1172,7 +1172,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	bio_init(&dc->sb_bio);
 	dc->sb_bio.bi_max_vecs	= 1;
 	dc->sb_bio.bi_io_vec	= dc->sb_bio.bi_inline_vecs;
-	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
+	bvec_set_page(dc->sb_bio.bi_io_vec, sb_page);
 	get_page(sb_page);
 
 	if (cached_dev_init(dc, sb->block_size << 9))
@@ -1811,8 +1811,8 @@ void bch_cache_release(struct kobject *kobj)
 	for (i = 0; i < RESERVE_NR; i++)
 		free_fifo(&ca->free[i]);
 
-	if (ca->sb_bio.bi_inline_vecs[0].bv_page)
-		put_page(ca->sb_bio.bi_io_vec[0].bv_page);
+	if (bvec_page(&ca->sb_bio.bi_inline_vecs[0]))
+		put_page(bvec_page(&ca->sb_bio.bi_io_vec[0]));
 
 	if (!IS_ERR_OR_NULL(ca->bdev))
 		blkdev_put(ca->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
@@ -1870,7 +1870,7 @@ static void register_cache(struct cache_sb *sb, struct page *sb_page,
 	bio_init(&ca->sb_bio);
 	ca->sb_bio.bi_max_vecs	= 1;
 	ca->sb_bio.bi_io_vec	= ca->sb_bio.bi_inline_vecs;
-	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
+	bvec_set_page(&ca->sb_bio.bi_io_vec[0], sb_page);
 	get_page(sb_page);
 
 	if (blk_queue_discard(bdev_get_queue(ca->bdev)))
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index db3ae4c2b223..d02f6d626529 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -238,9 +238,8 @@ void bch_bio_map(struct bio *bio, void *base)
 start:		bv->bv_len	= min_t(size_t, PAGE_SIZE - bv->bv_offset,
 					size);
 		if (base) {
-			bv->bv_page = is_vmalloc_addr(base)
-				? vmalloc_to_page(base)
-				: virt_to_page(base);
+			bvec_set_page(bv,
+				      is_vmalloc_addr(base) ? vmalloc_to_page(base) : virt_to_page(base));
 
 			base += bv->bv_len;
 		}
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f1986bcd1bf0..6e9901c5dd66 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -133,7 +133,7 @@ static void write_dirty_finish(struct closure *cl)
 	int i;
 
 	bio_for_each_segment_all(bv, &io->bio, i)
-		__free_page(bv->bv_page);
+		__free_page(bvec_page(bv));
 
 	/* This is kind of a dumb way of signalling errors. */
 	if (KEY_DIRTY(&w->key)) {
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 713a96237a80..9990a3c966f3 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -849,11 +849,11 @@ static int crypt_convert_block(struct crypt_config *cc,
 	dmreq->iv_sector = ctx->cc_sector;
 	dmreq->ctx = ctx;
 	sg_init_table(&dmreq->sg_in, 1);
-	sg_set_page(&dmreq->sg_in, bv_in.bv_page, 1 << SECTOR_SHIFT,
+	sg_set_page(&dmreq->sg_in, bvec_page(&bv_in), 1 << SECTOR_SHIFT,
 		    bv_in.bv_offset);
 
 	sg_init_table(&dmreq->sg_out, 1);
-	sg_set_page(&dmreq->sg_out, bv_out.bv_page, 1 << SECTOR_SHIFT,
+	sg_set_page(&dmreq->sg_out, bvec_page(&bv_out), 1 << SECTOR_SHIFT,
 		    bv_out.bv_offset);
 
 	bio_advance_iter(ctx->bio_in, &ctx->iter_in, 1 << SECTOR_SHIFT);
@@ -1003,7 +1003,7 @@ retry:
 		len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;
 
 		bvec = &clone->bi_io_vec[clone->bi_vcnt++];
-		bvec->bv_page = page;
+		bvec_set_page(bvec, page);
 		bvec->bv_len = len;
 		bvec->bv_offset = 0;
 
@@ -1025,9 +1025,9 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
 	struct bio_vec *bv;
 
 	bio_for_each_segment_all(bv, clone, i) {
-		BUG_ON(!bv->bv_page);
-		mempool_free(bv->bv_page, cc->page_pool);
-		bv->bv_page = NULL;
+		BUG_ON(!bvec_page(bv));
+		mempool_free(bvec_page(bv), cc->page_pool);
+		bvec_set_page(bv, NULL);
 	}
 }
 
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 37de0173b6d2..0bf4bed512fe 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -204,7 +204,7 @@ static void bio_get_page(struct dpages *dp, struct page **p,
 			 unsigned long *len, unsigned *offset)
 {
 	struct bio_vec *bvec = dp->context_ptr;
-	*p = bvec->bv_page;
+	*p = bvec_page(bvec);
 	*len = bvec->bv_len - dp->context_u;
 	*offset = bvec->bv_offset + dp->context_u;
 }
diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index 7a7bab8947ae..5b5c8684f61f 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -336,7 +336,7 @@ test_block_hash:
 			unsigned len;
 			struct bio_vec bv = bio_iter_iovec(bio, io->iter);
 
-			page = kmap_atomic(bv.bv_page);
+			page = kmap_atomic(bvec_page(&bv));
 			len = bv.bv_len;
 			if (likely(len >= todo))
 				len = todo;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238afa54..79db06bd32ae 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -134,8 +134,8 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 	if (!test_bit(MD_RECOVERY_REQUESTED, &pi->mddev->recovery)) {
 		for (i=0; i<RESYNC_PAGES ; i++)
 			for (j=1; j<pi->raid_disks; j++)
-				r1_bio->bios[j]->bi_io_vec[i].bv_page =
-					r1_bio->bios[0]->bi_io_vec[i].bv_page;
+				bvec_set_page(&r1_bio->bios[j]->bi_io_vec[i],
+					      bvec_page(&r1_bio->bios[0]->bi_io_vec[i]));
 	}
 
 	r1_bio->master_bio = NULL;
@@ -147,7 +147,7 @@ out_free_pages:
 		struct bio_vec *bv;
 
 		bio_for_each_segment_all(bv, r1_bio->bios[j], i)
-			__free_page(bv->bv_page);
+			__free_page(bvec_page(bv));
 	}
 
 out_free_bio:
@@ -166,9 +166,9 @@ static void r1buf_pool_free(void *__r1_bio, void *data)
 	for (i = 0; i < RESYNC_PAGES; i++)
 		for (j = pi->raid_disks; j-- ;) {
 			if (j == 0 ||
-			    r1bio->bios[j]->bi_io_vec[i].bv_page !=
-			    r1bio->bios[0]->bi_io_vec[i].bv_page)
-				safe_put_page(r1bio->bios[j]->bi_io_vec[i].bv_page);
+			    bvec_page(&r1bio->bios[j]->bi_io_vec[i]) !=
+			    bvec_page(&r1bio->bios[0]->bi_io_vec[i]))
+				safe_put_page(bvec_page(&r1bio->bios[j]->bi_io_vec[i]));
 		}
 	for (i=0 ; i < pi->raid_disks; i++)
 		bio_put(r1bio->bios[i]);
@@ -369,7 +369,7 @@ static void close_write(struct r1bio *r1_bio)
 		/* free extra copy of the data pages */
 		int i = r1_bio->behind_page_count;
 		while (i--)
-			safe_put_page(r1_bio->behind_bvecs[i].bv_page);
+			safe_put_page(bvec_page(&r1_bio->behind_bvecs[i]));
 		kfree(r1_bio->behind_bvecs);
 		r1_bio->behind_bvecs = NULL;
 	}
@@ -1004,13 +1004,13 @@ static void alloc_behind_pages(struct bio *bio, struct r1bio *r1_bio)
 
 	bio_for_each_segment_all(bvec, bio, i) {
 		bvecs[i] = *bvec;
-		bvecs[i].bv_page = alloc_page(GFP_NOIO);
-		if (unlikely(!bvecs[i].bv_page))
+		bvec_set_page(&bvecs[i], alloc_page(GFP_NOIO));
+		if (unlikely(!bvec_page(&bvecs[i])))
 			goto do_sync_io;
-		memcpy(kmap(bvecs[i].bv_page) + bvec->bv_offset,
-		       kmap(bvec->bv_page) + bvec->bv_offset, bvec->bv_len);
-		kunmap(bvecs[i].bv_page);
-		kunmap(bvec->bv_page);
+		memcpy(kmap(bvec_page(&bvecs[i])) + bvec->bv_offset,
+		       kmap(bvec_page(bvec)) + bvec->bv_offset, bvec->bv_len);
+		kunmap(bvec_page(&bvecs[i]));
+		kunmap(bvec_page(bvec));
 	}
 	r1_bio->behind_bvecs = bvecs;
 	r1_bio->behind_page_count = bio->bi_vcnt;
@@ -1019,8 +1019,8 @@ static void alloc_behind_pages(struct bio *bio, struct r1bio *r1_bio)
 
 do_sync_io:
 	for (i = 0; i < bio->bi_vcnt; i++)
-		if (bvecs[i].bv_page)
-			put_page(bvecs[i].bv_page);
+		if (bvec_page(&bvecs[i]))
+			put_page(bvec_page(&bvecs[i]));
 	kfree(bvecs);
 	pr_debug("%dB behind alloc failed, doing sync I/O\n",
 		 bio->bi_iter.bi_size);
@@ -1386,7 +1386,8 @@ read_again:
 			 * We trimmed the bio, so _all is legit
 			 */
 			bio_for_each_segment_all(bvec, mbio, j)
-				bvec->bv_page = r1_bio->behind_bvecs[j].bv_page;
+				bvec_set_page(bvec,
+					      bvec_page(&r1_bio->behind_bvecs[j]));
 			if (test_bit(WriteMostly, &conf->mirrors[i].rdev->flags))
 				atomic_inc(&r1_bio->behind_remaining);
 		}
@@ -1849,7 +1850,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 				 */
 				rdev = conf->mirrors[d].rdev;
 				if (sync_page_io(rdev, sect, s<<9,
-						 bio->bi_io_vec[idx].bv_page,
+						 bvec_page(&bio->bi_io_vec[idx]),
 						 READ, false)) {
 					success = 1;
 					break;
@@ -1905,7 +1906,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 				continue;
 			rdev = conf->mirrors[d].rdev;
 			if (r1_sync_page_io(rdev, sect, s,
-					    bio->bi_io_vec[idx].bv_page,
+					    bvec_page(&bio->bi_io_vec[idx]),
 					    WRITE) == 0) {
 				r1_bio->bios[d]->bi_end_io = NULL;
 				rdev_dec_pending(rdev, mddev);
@@ -1920,7 +1921,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 				continue;
 			rdev = conf->mirrors[d].rdev;
 			if (r1_sync_page_io(rdev, sect, s,
-					    bio->bi_io_vec[idx].bv_page,
+					    bvec_page(&bio->bi_io_vec[idx]),
 					    READ) != 0)
 				atomic_add(s, &rdev->corrected_errors);
 		}
@@ -2004,8 +2005,8 @@ static void process_checks(struct r1bio *r1_bio)
 		if (uptodate) {
 			for (j = vcnt; j-- ; ) {
 				struct page *p, *s;
-				p = pbio->bi_io_vec[j].bv_page;
-				s = sbio->bi_io_vec[j].bv_page;
+				p = bvec_page(&pbio->bi_io_vec[j]);
+				s = bvec_page(&sbio->bi_io_vec[j]);
 				if (memcmp(page_address(p),
 					   page_address(s),
 					   sbio->bi_io_vec[j].bv_len))
@@ -2214,7 +2215,7 @@ static int narrow_write_error(struct r1bio *r1_bio, int i)
 			unsigned vcnt = r1_bio->behind_page_count;
 			struct bio_vec *vec = r1_bio->behind_bvecs;
 
-			while (!vec->bv_page) {
+			while (!bvec_page(vec)) {
 				vec++;
 				vcnt--;
 			}
@@ -2695,10 +2696,11 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp
 		for (i = 0 ; i < conf->raid_disks * 2; i++) {
 			bio = r1_bio->bios[i];
 			if (bio->bi_end_io) {
-				page = bio->bi_io_vec[bio->bi_vcnt].bv_page;
+				page = bvec_page(&bio->bi_io_vec[bio->bi_vcnt]);
 				if (bio_add_page(bio, page, len, 0) == 0) {
 					/* stop here */
-					bio->bi_io_vec[bio->bi_vcnt].bv_page = page;
+					bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt],
+						      page);
 					while (i > 0) {
 						i--;
 						bio = r1_bio->bios[i];
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a7196c49d15d..da9033e3bc64 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -181,16 +181,16 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 				/* we can share bv_page's during recovery
 				 * and reshape */
 				struct bio *rbio = r10_bio->devs[0].bio;
-				page = rbio->bi_io_vec[i].bv_page;
+				page = bvec_page(&rbio->bi_io_vec[i]);
 				get_page(page);
 			} else
 				page = alloc_page(gfp_flags);
 			if (unlikely(!page))
 				goto out_free_pages;
 
-			bio->bi_io_vec[i].bv_page = page;
+			bvec_set_page(&bio->bi_io_vec[i], page);
 			if (rbio)
-				rbio->bi_io_vec[i].bv_page = page;
+				bvec_set_page(&rbio->bi_io_vec[i], page);
 		}
 	}
 
@@ -198,10 +198,10 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 
 out_free_pages:
 	for ( ; i > 0 ; i--)
-		safe_put_page(bio->bi_io_vec[i-1].bv_page);
+		safe_put_page(bvec_page(&bio->bi_io_vec[i - 1]));
 	while (j--)
 		for (i = 0; i < RESYNC_PAGES ; i++)
-			safe_put_page(r10_bio->devs[j].bio->bi_io_vec[i].bv_page);
+			safe_put_page(bvec_page(&r10_bio->devs[j].bio->bi_io_vec[i]));
 	j = 0;
 out_free_bio:
 	for ( ; j < nalloc; j++) {
@@ -225,8 +225,8 @@ static void r10buf_pool_free(void *__r10_bio, void *data)
 		struct bio *bio = r10bio->devs[j].bio;
 		if (bio) {
 			for (i = 0; i < RESYNC_PAGES; i++) {
-				safe_put_page(bio->bi_io_vec[i].bv_page);
-				bio->bi_io_vec[i].bv_page = NULL;
+				safe_put_page(bvec_page(&bio->bi_io_vec[i]));
+				bvec_set_page(&bio->bi_io_vec[i], NULL);
 			}
 			bio_put(bio);
 		}
@@ -2074,8 +2074,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 				int len = PAGE_SIZE;
 				if (sectors < (len / 512))
 					len = sectors * 512;
-				if (memcmp(page_address(fbio->bi_io_vec[j].bv_page),
-					   page_address(tbio->bi_io_vec[j].bv_page),
+				if (memcmp(page_address(bvec_page(&fbio->bi_io_vec[j])),
+					   page_address(bvec_page(&tbio->bi_io_vec[j])),
 					   len))
 					break;
 				sectors -= len/512;
@@ -2104,8 +2104,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 			tbio->bi_io_vec[j].bv_offset = 0;
 			tbio->bi_io_vec[j].bv_len = PAGE_SIZE;
 
-			memcpy(page_address(tbio->bi_io_vec[j].bv_page),
-			       page_address(fbio->bi_io_vec[j].bv_page),
+			memcpy(page_address(bvec_page(&tbio->bi_io_vec[j])),
+			       page_address(bvec_page(&fbio->bi_io_vec[j])),
 			       PAGE_SIZE);
 		}
 		tbio->bi_end_io = end_sync_write;
@@ -2132,8 +2132,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 		if (r10_bio->devs[i].bio->bi_end_io != end_sync_write
 		    && r10_bio->devs[i].bio != fbio)
 			for (j = 0; j < vcnt; j++)
-				memcpy(page_address(tbio->bi_io_vec[j].bv_page),
-				       page_address(fbio->bi_io_vec[j].bv_page),
+				memcpy(page_address(bvec_page(&tbio->bi_io_vec[j])),
+				       page_address(bvec_page(&fbio->bi_io_vec[j])),
 				       PAGE_SIZE);
 		d = r10_bio->devs[i].devnum;
 		atomic_inc(&r10_bio->remaining);
@@ -2191,7 +2191,7 @@ static void fix_recovery_read_error(struct r10bio *r10_bio)
 		ok = sync_page_io(rdev,
 				  addr,
 				  s << 9,
-				  bio->bi_io_vec[idx].bv_page,
+				  bvec_page(&bio->bi_io_vec[idx]),
 				  READ, false);
 		if (ok) {
 			rdev = conf->mirrors[dw].rdev;
@@ -2199,7 +2199,7 @@ static void fix_recovery_read_error(struct r10bio *r10_bio)
 			ok = sync_page_io(rdev,
 					  addr,
 					  s << 9,
-					  bio->bi_io_vec[idx].bv_page,
+					  bvec_page(&bio->bi_io_vec[idx]),
 					  WRITE, false);
 			if (!ok) {
 				set_bit(WriteErrorSeen, &rdev->flags);
@@ -3361,12 +3361,12 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 			break;
 		for (bio= biolist ; bio ; bio=bio->bi_next) {
 			struct bio *bio2;
-			page = bio->bi_io_vec[bio->bi_vcnt].bv_page;
+			page = bvec_page(&bio->bi_io_vec[bio->bi_vcnt]);
 			if (bio_add_page(bio, page, len, 0))
 				continue;
 
 			/* stop here */
-			bio->bi_io_vec[bio->bi_vcnt].bv_page = page;
+			bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page);
 			for (bio2 = biolist;
 			     bio2 && bio2 != bio;
 			     bio2 = bio2->bi_next) {
@@ -4436,7 +4436,7 @@ read_more:
 
 	nr_sectors = 0;
 	for (s = 0 ; s < max_sectors; s += PAGE_SIZE >> 9) {
-		struct page *page = r10_bio->devs[0].bio->bi_io_vec[s/(PAGE_SIZE>>9)].bv_page;
+		struct page *page = bvec_page(&r10_bio->devs[0].bio->bi_io_vec[s / (PAGE_SIZE >> 9)]);
 		int len = (max_sectors - s) << 9;
 		if (len > PAGE_SIZE)
 			len = PAGE_SIZE;
@@ -4593,7 +4593,7 @@ static int handle_reshape_read_error(struct mddev *mddev,
 			success = sync_page_io(rdev,
 					       addr,
 					       s << 9,
-					       bvec[idx].bv_page,
+					       bvec_page(&bvec[idx]),
 					       READ, false);
 			if (success)
 				break;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b2c572..0f450a166cd1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -864,7 +864,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 			if (test_bit(R5_SkipCopy, &sh->dev[i].flags))
 				WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags));
-			sh->dev[i].vec.bv_page = sh->dev[i].page;
+			bvec_set_page(&sh->dev[i].vec, sh->dev[i].page);
 			bi->bi_vcnt = 1;
 			bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
 			bi->bi_io_vec[0].bv_offset = 0;
@@ -911,7 +911,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 						  + rrdev->data_offset);
 			if (test_bit(R5_SkipCopy, &sh->dev[i].flags))
 				WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags));
-			sh->dev[i].rvec.bv_page = sh->dev[i].page;
+			bvec_set_page(&sh->dev[i].rvec, sh->dev[i].page);
 			rbi->bi_vcnt = 1;
 			rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
 			rbi->bi_io_vec[0].bv_offset = 0;
@@ -978,7 +978,7 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
 
 		if (clen > 0) {
 			b_offset += bvl.bv_offset;
-			bio_page = bvl.bv_page;
+			bio_page = bvec_page(&bvl);
 			if (frombio) {
 				if (sh->raid_conf->skip_copy &&
 				    b_offset == 0 && page_offset == 0 &&
diff --git a/drivers/s390/block/dasd_diag.c b/drivers/s390/block/dasd_diag.c
index c062f1620c58..89f39d00077d 100644
--- a/drivers/s390/block/dasd_diag.c
+++ b/drivers/s390/block/dasd_diag.c
@@ -545,7 +545,7 @@ static struct dasd_ccw_req *dasd_diag_build_cp(struct dasd_device *memdev,
 	dbio = dreq->bio;
 	recid = first_rec;
 	rq_for_each_segment(bv, req, iter) {
-		dst = page_address(bv.bv_page) + bv.bv_offset;
+		dst = page_address(bvec_page(&bv)) + bv.bv_offset;
 		for (off = 0; off < bv.bv_len; off += blksize) {
 			memset(dbio, 0, sizeof (struct dasd_diag_bio));
 			dbio->type = rw_cmd;
diff --git a/drivers/s390/block/dasd_eckd.c b/drivers/s390/block/dasd_eckd.c
index d47f5b99623a..cd78c3483192 100644
--- a/drivers/s390/block/dasd_eckd.c
+++ b/drivers/s390/block/dasd_eckd.c
@@ -2616,7 +2616,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_cmd_single(
 			return ERR_PTR(-EINVAL);
 		count += bv.bv_len >> (block->s2b_shift + 9);
 #if defined(CONFIG_64BIT)
-		if (idal_is_needed (page_address(bv.bv_page), bv.bv_len))
+		if (idal_is_needed (page_address(bvec_page(&bv)), bv.bv_len))
 			cidaw += bv.bv_len >> (block->s2b_shift + 9);
 #endif
 	}
@@ -2688,7 +2688,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_cmd_single(
 			      last_rec - recid + 1, cmd, basedev, blksize);
 	}
 	rq_for_each_segment(bv, req, iter) {
-		dst = page_address(bv.bv_page) + bv.bv_offset;
+		dst = page_address(bvec_page(&bv)) + bv.bv_offset;
 		if (dasd_page_cache) {
 			char *copy = kmem_cache_alloc(dasd_page_cache,
 						      GFP_DMA | __GFP_NOWARN);
@@ -2851,7 +2851,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_cmd_track(
 	idaw_dst = NULL;
 	idaw_len = 0;
 	rq_for_each_segment(bv, req, iter) {
-		dst = page_address(bv.bv_page) + bv.bv_offset;
+		dst = page_address(bvec_page(&bv)) + bv.bv_offset;
 		seg_len = bv.bv_len;
 		while (seg_len) {
 			if (new_track) {
@@ -3163,7 +3163,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_tpm_track(
 		new_track = 1;
 		recid = first_rec;
 		rq_for_each_segment(bv, req, iter) {
-			dst = page_address(bv.bv_page) + bv.bv_offset;
+			dst = page_address(bvec_page(&bv)) + bv.bv_offset;
 			seg_len = bv.bv_len;
 			while (seg_len) {
 				if (new_track) {
@@ -3196,7 +3196,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_tpm_track(
 		}
 	} else {
 		rq_for_each_segment(bv, req, iter) {
-			dst = page_address(bv.bv_page) + bv.bv_offset;
+			dst = page_address(bvec_page(&bv)) + bv.bv_offset;
 			last_tidaw = itcw_add_tidaw(itcw, 0x00,
 						    dst, bv.bv_len);
 			if (IS_ERR(last_tidaw)) {
@@ -3416,7 +3416,7 @@ static struct dasd_ccw_req *dasd_raw_build_cp(struct dasd_device *startdev,
 			idaws = idal_create_words(idaws, rawpadpage, PAGE_SIZE);
 	}
 	rq_for_each_segment(bv, req, iter) {
-		dst = page_address(bv.bv_page) + bv.bv_offset;
+		dst = page_address(bvec_page(&bv)) + bv.bv_offset;
 		seg_len = bv.bv_len;
 		if (cmd == DASD_ECKD_CCW_READ_TRACK)
 			memset(dst, 0, seg_len);
@@ -3480,7 +3480,7 @@ dasd_eckd_free_cp(struct dasd_ccw_req *cqr, struct request *req)
 	if (private->uses_cdl == 0 || recid > 2*blk_per_trk)
 		ccw++;
 	rq_for_each_segment(bv, req, iter) {
-		dst = page_address(bv.bv_page) + bv.bv_offset;
+		dst = page_address(bvec_page(&bv)) + bv.bv_offset;
 		for (off = 0; off < bv.bv_len; off += blksize) {
 			/* Skip locate record. */
 			if (private->uses_cdl && recid <= 2*blk_per_trk)
diff --git a/drivers/s390/block/dasd_fba.c b/drivers/s390/block/dasd_fba.c
index 2c8e68bf9a1c..e405051d829b 100644
--- a/drivers/s390/block/dasd_fba.c
+++ b/drivers/s390/block/dasd_fba.c
@@ -288,7 +288,7 @@ static struct dasd_ccw_req *dasd_fba_build_cp(struct dasd_device * memdev,
 			return ERR_PTR(-EINVAL);
 		count += bv.bv_len >> (block->s2b_shift + 9);
 #if defined(CONFIG_64BIT)
-		if (idal_is_needed (page_address(bv.bv_page), bv.bv_len))
+		if (idal_is_needed (page_address(bvec_page(&bv)), bv.bv_len))
 			cidaw += bv.bv_len / blksize;
 #endif
 	}
@@ -326,7 +326,7 @@ static struct dasd_ccw_req *dasd_fba_build_cp(struct dasd_device * memdev,
 	}
 	recid = first_rec;
 	rq_for_each_segment(bv, req, iter) {
-		dst = page_address(bv.bv_page) + bv.bv_offset;
+		dst = page_address(bvec_page(&bv)) + bv.bv_offset;
 		if (dasd_page_cache) {
 			char *copy = kmem_cache_alloc(dasd_page_cache,
 						      GFP_DMA | __GFP_NOWARN);
@@ -399,7 +399,7 @@ dasd_fba_free_cp(struct dasd_ccw_req *cqr, struct request *req)
 	if (private->rdc_data.mode.bits.data_chain != 0)
 		ccw++;
 	rq_for_each_segment(bv, req, iter) {
-		dst = page_address(bv.bv_page) + bv.bv_offset;
+		dst = page_address(bvec_page(&bv)) + bv.bv_offset;
 		for (off = 0; off < bv.bv_len; off += blksize) {
 			/* Skip locate record. */
 			if (private->rdc_data.mode.bits.data_chain == 0)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 96128cb009f3..13eee10fedc8 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -857,7 +857,7 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
 	index = (bio->bi_iter.bi_sector >> 3);
 	bio_for_each_segment(bvec, bio, iter) {
 		page_addr = (unsigned long)
-			page_address(bvec.bv_page) + bvec.bv_offset;
+			page_address(bvec_page(&bvec)) + bvec.bv_offset;
 		source_addr = dev_info->start + (index<<12) + bytes_done;
 		if (unlikely((page_addr & 4095) != 0) || (bvec.bv_len & 4095) != 0)
 			// More paranoia.
diff --git a/drivers/s390/block/scm_blk.c b/drivers/s390/block/scm_blk.c
index 75d9896deccb..9bf2d42c1946 100644
--- a/drivers/s390/block/scm_blk.c
+++ b/drivers/s390/block/scm_blk.c
@@ -203,7 +203,7 @@ static int scm_request_prepare(struct scm_request *scmrq)
 	rq_for_each_segment(bv, req, iter) {
 		WARN_ON(bv.bv_offset);
 		msb->blk_count += bv.bv_len >> 12;
-		aidaw->data_addr = (u64) page_address(bv.bv_page);
+		aidaw->data_addr = (u64) page_address(bvec_page(&bv));
 		aidaw++;
 	}
 
diff --git a/drivers/s390/block/scm_blk_cluster.c b/drivers/s390/block/scm_blk_cluster.c
index 09db45296eed..a302a79e1cb3 100644
--- a/drivers/s390/block/scm_blk_cluster.c
+++ b/drivers/s390/block/scm_blk_cluster.c
@@ -181,7 +181,7 @@ static int scm_prepare_cluster_request(struct scm_request *scmrq)
 			i++;
 		}
 		rq_for_each_segment(bv, req, iter) {
-			aidaw->data_addr = (u64) page_address(bv.bv_page);
+			aidaw->data_addr = (u64) page_address(bvec_page(&bv));
 			aidaw++;
 			i++;
 		}
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index 7d4e9397ac31..44e80e13b643 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -202,7 +202,7 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio)
 	index = (bio->bi_iter.bi_sector >> 3) + xdev->offset;
 	bio_for_each_segment(bvec, bio, iter) {
 		page_addr = (unsigned long)
-			kmap(bvec.bv_page) + bvec.bv_offset;
+			kmap(bvec_page(&bvec)) + bvec.bv_offset;
 		bytes = bvec.bv_len;
 		if ((page_addr & 4095) != 0 || (bytes & 4095) != 0)
 			/* More paranoia. */
diff --git a/drivers/scsi/mpt2sas/mpt2sas_transport.c b/drivers/scsi/mpt2sas/mpt2sas_transport.c
index ff2500ab9ba4..788de1c250a3 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_transport.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_transport.c
@@ -1956,7 +1956,7 @@ _transport_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,
 
 		bio_for_each_segment(bvec, req->bio, iter) {
 			memcpy(pci_addr_out + offset,
-			    page_address(bvec.bv_page) + bvec.bv_offset,
+			    page_address(bvec_page(&bvec)) + bvec.bv_offset,
 			    bvec.bv_len);
 			offset += bvec.bv_len;
 		}
@@ -2107,12 +2107,12 @@ _transport_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,
 			    le16_to_cpu(mpi_reply->ResponseDataLength);
 			bio_for_each_segment(bvec, rsp->bio, iter) {
 				if (bytes_to_copy <= bvec.bv_len) {
-					memcpy(page_address(bvec.bv_page) +
+					memcpy(page_address(bvec_page(&bvec)) +
 					    bvec.bv_offset, pci_addr_in +
 					    offset, bytes_to_copy);
 					break;
 				} else {
-					memcpy(page_address(bvec.bv_page) +
+					memcpy(page_address(bvec_page(&bvec)) +
 					    bvec.bv_offset, pci_addr_in +
 					    offset, bvec.bv_len);
 					bytes_to_copy -= bvec.bv_len;
diff --git a/drivers/scsi/mpt3sas/mpt3sas_transport.c b/drivers/scsi/mpt3sas/mpt3sas_transport.c
index efb98afc46e0..f187a1a05b9b 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_transport.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_transport.c
@@ -1939,7 +1939,7 @@ _transport_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,
 
 		bio_for_each_segment(bvec, req->bio, iter) {
 			memcpy(pci_addr_out + offset,
-			    page_address(bvec.bv_page) + bvec.bv_offset,
+			    page_address(bvec_page(&bvec)) + bvec.bv_offset,
 			    bvec.bv_len);
 			offset += bvec.bv_len;
 		}
@@ -2068,12 +2068,12 @@ _transport_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,
 			    le16_to_cpu(mpi_reply->ResponseDataLength);
 			bio_for_each_segment(bvec, rsp->bio, iter) {
 				if (bytes_to_copy <= bvec.bv_len) {
-					memcpy(page_address(bvec.bv_page) +
+					memcpy(page_address(bvec_page(&bvec)) +
 					    bvec.bv_offset, pci_addr_in +
 					    offset, bytes_to_copy);
 					break;
 				} else {
-					memcpy(page_address(bvec.bv_page) +
+					memcpy(page_address(bvec_page(&bvec)) +
 					    bvec.bv_offset, pci_addr_in +
 					    offset, bvec.bv_len);
 					bytes_to_copy -= bvec.bv_len;
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 14c7d42a11c2..0534050c9cba 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -134,7 +134,7 @@ void sd_dif_prepare(struct scsi_cmnd *scmd)
 		virt = bip_get_seed(bip) & 0xffffffff;
 
 		bip_for_each_vec(iv, bip, iter) {
-			pi = kmap_atomic(iv.bv_page) + iv.bv_offset;
+			pi = kmap_atomic(bvec_page(&iv)) + iv.bv_offset;
 
 			for (j = 0; j < iv.bv_len; j += tuple_sz, pi++) {
 
@@ -181,7 +181,7 @@ void sd_dif_complete(struct scsi_cmnd *scmd, unsigned int good_bytes)
 		virt = bip_get_seed(bip) & 0xffffffff;
 
 		bip_for_each_vec(iv, bip, iter) {
-			pi = kmap_atomic(iv.bv_page) + iv.bv_offset;
+			pi = kmap_atomic(bvec_page(&iv)) + iv.bv_offset;
 
 			for (j = 0; j < iv.bv_len; j += tuple_sz, pi++) {
 
diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
index 031248840642..2a68e4ab93ca 100644
--- a/drivers/staging/lustre/lustre/llite/lloop.c
+++ b/drivers/staging/lustre/lustre/llite/lloop.c
@@ -222,7 +222,7 @@ static int do_bio_lustrebacked(struct lloop_device *lo, struct bio *head)
 			BUG_ON(bvec.bv_offset != 0);
 			BUG_ON(bvec.bv_len != PAGE_CACHE_SIZE);
 
-			pages[page_count] = bvec.bv_page;
+			pages[page_count] = bvec_page(&bvec);
 			offsets[page_count] = offset;
 			page_count++;
 			offset += bvec.bv_len;
diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
index 0edb91c0de6b..7fcdcb2265f1 100644
--- a/drivers/xen/biomerge.c
+++ b/drivers/xen/biomerge.c
@@ -6,8 +6,8 @@
 bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
 			       const struct bio_vec *vec2)
 {
-	unsigned long mfn1 = pfn_to_mfn(page_to_pfn(vec1->bv_page));
-	unsigned long mfn2 = pfn_to_mfn(page_to_pfn(vec2->bv_page));
+	unsigned long mfn1 = pfn_to_mfn(page_to_pfn(bvec_page(vec1)));
+	unsigned long mfn2 = pfn_to_mfn(page_to_pfn(bvec_page(vec2)));
 
 	return __BIOVEC_PHYS_MERGEABLE(vec1, vec2) &&
 		((mfn1 == mfn2) || ((mfn1+1) == mfn2));
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index d897ef803b3b..8297d6792ac0 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2997,11 +2997,11 @@ static void __btrfsic_submit_bio(int rw, struct bio *bio)
 		cur_bytenr = dev_bytenr;
 		for (i = 0; i < bio->bi_vcnt; i++) {
 			BUG_ON(bio->bi_io_vec[i].bv_len != PAGE_CACHE_SIZE);
-			mapped_datav[i] = kmap(bio->bi_io_vec[i].bv_page);
+			mapped_datav[i] = kmap(bvec_page(&bio->bi_io_vec[i]));
 			if (!mapped_datav[i]) {
 				while (i > 0) {
 					i--;
-					kunmap(bio->bi_io_vec[i].bv_page);
+					kunmap(bvec_page(&bio->bi_io_vec[i]));
 				}
 				kfree(mapped_datav);
 				goto leave;
@@ -3020,7 +3020,7 @@ static void __btrfsic_submit_bio(int rw, struct bio *bio)
 					      NULL, rw);
 		while (i > 0) {
 			i--;
-			kunmap(bio->bi_io_vec[i].bv_page);
+			kunmap(bvec_page(&bio->bi_io_vec[i]));
 		}
 		kfree(mapped_datav);
 	} else if (NULL != dev_state && (rw & REQ_FLUSH)) {
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index e9df8862012c..f654bb5554dc 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -208,7 +208,7 @@ csum_failed:
 		 * checked so the end_io handlers know about it
 		 */
 		bio_for_each_segment_all(bvec, cb->orig_bio, i)
-			SetPageChecked(bvec->bv_page);
+			SetPageChecked(bvec_page(bvec));
 
 		bio_endio(cb->orig_bio, 0);
 	}
@@ -459,7 +459,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 	u64 end;
 	int misses = 0;
 
-	page = cb->orig_bio->bi_io_vec[cb->orig_bio->bi_vcnt - 1].bv_page;
+	page = bvec_page(&cb->orig_bio->bi_io_vec[cb->orig_bio->bi_vcnt - 1]);
 	last_offset = (page_offset(page) + PAGE_CACHE_SIZE);
 	em_tree = &BTRFS_I(inode)->extent_tree;
 	tree = &BTRFS_I(inode)->io_tree;
@@ -592,7 +592,7 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	/* we need the actual starting offset of this extent in the file */
 	read_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree,
-				   page_offset(bio->bi_io_vec->bv_page),
+				   page_offset(bvec_page(bio->bi_io_vec)),
 				   PAGE_CACHE_SIZE);
 	read_unlock(&em_tree->lock);
 	if (!em)
@@ -986,7 +986,7 @@ int btrfs_decompress_buf2page(char *buf, unsigned long buf_start,
 	unsigned long working_bytes = total_out - buf_start;
 	unsigned long bytes;
 	char *kaddr;
-	struct page *page_out = bvec[*pg_index].bv_page;
+	struct page *page_out = bvec_page(&bvec[*pg_index]);
 
 	/*
 	 * start byte is the first byte of the page we're currently
@@ -1031,7 +1031,7 @@ int btrfs_decompress_buf2page(char *buf, unsigned long buf_start,
 			if (*pg_index >= vcnt)
 				return 0;
 
-			page_out = bvec[*pg_index].bv_page;
+			page_out = bvec_page(&bvec[*pg_index]);
 			*pg_offset = 0;
 			start_byte = page_offset(page_out) - disk_start;
 
@@ -1071,7 +1071,7 @@ void btrfs_clear_biovec_end(struct bio_vec *bvec, int vcnt,
 				   unsigned long pg_offset)
 {
 	while (pg_index < vcnt) {
-		struct page *page = bvec[pg_index].bv_page;
+		struct page *page = bvec_page(&bvec[pg_index]);
 		unsigned long off = bvec[pg_index].bv_offset;
 		unsigned long len = bvec[pg_index].bv_len;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f79f38542a73..6b600278f7da 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -881,8 +881,8 @@ static int btree_csum_one_bio(struct bio *bio)
 	int i, ret = 0;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		root = BTRFS_I(bvec->bv_page->mapping->host)->root;
-		ret = csum_dirty_buffer(root, bvec->bv_page);
+		root = BTRFS_I(bvec_page(bvec)->mapping->host)->root;
+		ret = csum_dirty_buffer(root, bvec_page(bvec));
 		if (ret)
 			break;
 	}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c7233ff1d533..76186ab4c5ae 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2489,7 +2489,7 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec_page(bvec);
 
 		/* We always issue full-page reads, but if some block
 		 * in a page fails to read, blk_update_request() will
@@ -2563,7 +2563,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 		uptodate = 0;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec_page(bvec);
 		struct inode *inode = page->mapping->host;
 
 		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
@@ -2751,7 +2751,7 @@ static int __must_check submit_one_bio(int rw, struct bio *bio,
 {
 	int ret = 0;
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
-	struct page *page = bvec->bv_page;
+	struct page *page = bvec_page(bvec);
 	struct extent_io_tree *tree = bio->bi_private;
 	u64 start;
 
@@ -3700,7 +3700,7 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 	int i, done;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec_page(bvec);
 
 		eb = (struct extent_buffer *)page->private;
 		BUG_ON(!eb);
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 84a2d1868271..f7e2c497b0a2 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -222,7 +222,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 		offset = logical_offset;
 	while (bio_index < bio->bi_vcnt) {
 		if (!dio)
-			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+			offset = page_offset(bvec_page(bvec)) + bvec->bv_offset;
 		count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
 					       (u32 *)csum, nblocks);
 		if (count)
@@ -448,7 +448,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 	if (contig)
 		offset = file_start;
 	else
-		offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+		offset = page_offset(bvec_page(bvec)) + bvec->bv_offset;
 
 	ordered = btrfs_lookup_ordered_extent(inode, offset);
 	BUG_ON(!ordered); /* Logic error */
@@ -457,7 +457,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 
 	while (bio_index < bio->bi_vcnt) {
 		if (!contig)
-			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+			offset = page_offset(bvec_page(bvec)) + bvec->bv_offset;
 
 		if (offset >= ordered->file_offset + ordered->len ||
 		    offset < ordered->file_offset) {
@@ -480,7 +480,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 			index = 0;
 		}
 
-		data = kmap_atomic(bvec->bv_page);
+		data = kmap_atomic(bvec_page(bvec));
 		sums->sums[index] = ~(u32)0;
 		sums->sums[index] = btrfs_csum_data(data + bvec->bv_offset,
 						    sums->sums[index],
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index da828cf5e8f8..6efe9fd5c07d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7504,7 +7504,8 @@ static void btrfs_retry_endio_nocsum(struct bio *bio, int err)
 
 	done->uptodate = 1;
 	bio_for_each_segment_all(bvec, bio, i)
-		clean_io_failure(done->inode, done->start, bvec->bv_page, 0);
+		clean_io_failure(done->inode, done->start, bvec_page(bvec),
+				 0);
 end:
 	complete(&done->done);
 	bio_put(bio);
@@ -7528,7 +7529,8 @@ try_again:
 		done.start = start;
 		init_completion(&done.done);
 
-		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+		ret = dio_read_error(inode, &io_bio->bio, bvec_page(bvec),
+				     start,
 				     start + bvec->bv_len - 1,
 				     io_bio->mirror_num,
 				     btrfs_retry_endio_nocsum, &done);
@@ -7563,11 +7565,11 @@ static void btrfs_retry_endio(struct bio *bio, int err)
 	uptodate = 1;
 	bio_for_each_segment_all(bvec, bio, i) {
 		ret = __readpage_endio_check(done->inode, io_bio, i,
-					     bvec->bv_page, 0,
+					     bvec_page(bvec), 0,
 					     done->start, bvec->bv_len);
 		if (!ret)
 			clean_io_failure(done->inode, done->start,
-					 bvec->bv_page, 0);
+					 bvec_page(bvec), 0);
 		else
 			uptodate = 0;
 	}
@@ -7593,7 +7595,8 @@ static int __btrfs_subio_endio_read(struct inode *inode,
 	done.inode = inode;
 
 	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
-		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
+		ret = __readpage_endio_check(inode, io_bio, i,
+					     bvec_page(bvec),
 					     0, start, bvec->bv_len);
 		if (likely(!ret))
 			goto next;
@@ -7602,7 +7605,8 @@ try_again:
 		done.start = start;
 		init_completion(&done.done);
 
-		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+		ret = dio_read_error(inode, &io_bio->bio, bvec_page(bvec),
+				     start,
 				     start + bvec->bv_len - 1,
 				     io_bio->mirror_num,
 				     btrfs_retry_endio, &done);
@@ -7898,7 +7902,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 
 	while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
 		if (map_length < submit_len + bvec->bv_len ||
-		    bio_add_page(bio, bvec->bv_page, bvec->bv_len,
+		    bio_add_page(bio, bvec_page(bvec), bvec->bv_len,
 				 bvec->bv_offset) < bvec->bv_len) {
 			/*
 			 * inc the count before we submit the bio so
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 5264858ed768..4c81b2297435 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1154,7 +1154,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 		page_index = stripe_offset >> PAGE_CACHE_SHIFT;
 
 		for (i = 0; i < bio->bi_vcnt; i++) {
-			p = bio->bi_io_vec[i].bv_page;
+			p = bvec_page(&bio->bi_io_vec[i]);
 			rbio->bio_pages[page_index + i] = p;
 		}
 	}
@@ -1435,7 +1435,7 @@ static void set_bio_pages_uptodate(struct bio *bio)
 	struct page *p;
 
 	for (i = 0; i < bio->bi_vcnt; i++) {
-		p = bio->bi_io_vec[i].bv_page;
+		p = bvec_page(&bio->bi_io_vec[i]);
 		SetPageUptodate(p);
 	}
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8222f6f74147..a9e8ea439792 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5793,7 +5793,7 @@ again:
 		return -ENOMEM;
 
 	while (bvec <= (first_bio->bi_io_vec + first_bio->bi_vcnt - 1)) {
-		if (bio_add_page(bio, bvec->bv_page, bvec->bv_len,
+		if (bio_add_page(bio, bvec_page(bvec), bvec->bv_len,
 				 bvec->bv_offset) < bvec->bv_len) {
 			u64 len = bio->bi_iter.bi_size;
 
diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db2c987..43a785533ac6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2992,7 +2992,7 @@ void guard_bio_eod(int rw, struct bio *bio)
 
 	/* ..and clear the end of the buffer for reads */
 	if ((rw & RW_MASK) == READ) {
-		zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+		zero_user(bvec_page(bvec), bvec->bv_offset + bvec->bv_len,
 				truncated_bytes);
 	}
 }
@@ -3022,7 +3022,7 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_bdev = bh->b_bdev;
-	bio->bi_io_vec[0].bv_page = bh->b_page;
+	bvec_set_page(&bio->bi_io_vec[0], bh->b_page);
 	bio->bi_io_vec[0].bv_len = bh->b_size;
 	bio->bi_io_vec[0].bv_offset = bh_offset(bh);
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b2e297..5b90390481b0 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -467,7 +467,7 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 		bio_check_pages_dirty(bio);	/* transfers ownership */
 	} else {
 		bio_for_each_segment_all(bvec, bio, i) {
-			struct page *page = bvec->bv_page;
+			struct page *page = bvec_page(bvec);
 
 			if (dio->rw == READ && !PageCompound(page))
 				set_page_dirty_lock(page);
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 7bd8ac8dfb28..4bd44bfed847 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -411,9 +411,9 @@ static void _clear_bio(struct bio *bio)
 		unsigned this_count = bv->bv_len;
 
 		if (likely(PAGE_SIZE == this_count))
-			clear_highpage(bv->bv_page);
+			clear_highpage(bvec_page(bv));
 		else
-			zero_user(bv->bv_page, bv->bv_offset, this_count);
+			zero_user(bvec_page(bv), bv->bv_offset, this_count);
 	}
 }
 
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 27cbdb697649..da76728824e6 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -438,7 +438,7 @@ static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
 			continue;
 
 		bio_for_each_segment_all(bv, bio, i) {
-			struct page *page = bv->bv_page;
+			struct page *page = bvec_page(bv);
 
 			SetPageUptodate(page);
 			if (PageError(page))
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index b24a2541a9ba..289564c6aeec 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -68,7 +68,7 @@ static void ext4_finish_bio(struct bio *bio)
 	struct bio_vec *bvec;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec_page(bvec);
 		struct buffer_head *bh, *head;
 		unsigned bio_start = bvec->bv_offset;
 		unsigned bio_end = bio_start + bvec->bv_len;
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 985ed023a750..b8e2d8a1a0d3 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -31,7 +31,7 @@ static void f2fs_read_end_io(struct bio *bio, int err)
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec_page(bvec);
 
 		if (!err) {
 			SetPageUptodate(page);
@@ -51,7 +51,7 @@ static void f2fs_write_end_io(struct bio *bio, int err)
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec_page(bvec);
 
 		if (unlikely(err)) {
 			set_page_dirty(page);
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index daee4ab913da..7fd8f1555e32 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1307,7 +1307,7 @@ static inline bool is_merged_page(struct f2fs_sb_info *sbi,
 		goto out;
 
 	bio_for_each_segment_all(bvec, io->bio, i) {
-		if (page == bvec->bv_page) {
+		if (page == bvec_page(bvec)) {
 			up_read(&io->io_rwsem);
 			return true;
 		}
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 2c1ae861dc94..2c1e14ca5971 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -173,7 +173,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec,
 				  int error)
 {
 	struct buffer_head *bh, *next;
-	struct page *page = bvec->bv_page;
+	struct page *page = bvec_page(bvec);
 	unsigned size;
 
 	bh = page_buffers(page);
@@ -215,7 +215,7 @@ static void gfs2_end_log_write(struct bio *bio, int error)
 	}
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		page = bvec->bv_page;
+		page = bvec_page(bvec);
 		if (page_has_buffers(page))
 			gfs2_end_log_write_bh(sdp, bvec, error);
 		else
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index bc462dcd7a40..4effe870b5aa 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -1999,7 +1999,7 @@ static int lbmRead(struct jfs_log * log, int pn, struct lbuf ** bpp)
 
 	bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
 	bio->bi_bdev = log->bdev;
-	bio->bi_io_vec[0].bv_page = bp->l_page;
+	bvec_set_page(&bio->bi_io_vec[0], bp->l_page);
 	bio->bi_io_vec[0].bv_len = LOGPSIZE;
 	bio->bi_io_vec[0].bv_offset = bp->l_offset;
 
@@ -2145,7 +2145,7 @@ static void lbmStartIO(struct lbuf * bp)
 	bio = bio_alloc(GFP_NOFS, 1);
 	bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
 	bio->bi_bdev = log->bdev;
-	bio->bi_io_vec[0].bv_page = bp->l_page;
+	bvec_set_page(&bio->bi_io_vec[0], bp->l_page);
 	bio->bi_io_vec[0].bv_len = LOGPSIZE;
 	bio->bi_io_vec[0].bv_offset = bp->l_offset;
 
diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index 76279e11982d..7daa0e336fdf 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -22,7 +22,7 @@ static int sync_request(struct page *page, struct block_device *bdev, int rw)
 	bio_init(&bio);
 	bio.bi_max_vecs = 1;
 	bio.bi_io_vec = &bio_vec;
-	bio_vec.bv_page = page;
+	bvec_set_page(&bio_vec, page);
 	bio_vec.bv_len = PAGE_SIZE;
 	bio_vec.bv_offset = 0;
 	bio.bi_vcnt = 1;
@@ -65,8 +65,8 @@ static void writeseg_end_io(struct bio *bio, int err)
 	BUG_ON(err);
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		end_page_writeback(bvec->bv_page);
-		page_cache_release(bvec->bv_page);
+		end_page_writeback(bvec_page(bvec));
+		page_cache_release(bvec_page(bvec));
 	}
 	bio_put(bio);
 	if (atomic_dec_and_test(&super->s_pending_writes))
@@ -110,7 +110,7 @@ static int __bdev_writeseg(struct super_block *sb, u64 ofs, pgoff_t index,
 		}
 		page = find_lock_page(mapping, index + i);
 		BUG_ON(!page);
-		bio->bi_io_vec[i].bv_page = page;
+		bvec_set_page(&bio->bi_io_vec[i], page);
 		bio->bi_io_vec[i].bv_len = PAGE_SIZE;
 		bio->bi_io_vec[i].bv_offset = 0;
 
@@ -200,7 +200,7 @@ static int do_erase(struct super_block *sb, u64 ofs, pgoff_t index,
 			bio = bio_alloc(GFP_NOFS, max_pages);
 			BUG_ON(!bio);
 		}
-		bio->bi_io_vec[i].bv_page = super->s_erase_page;
+		bvec_set_page(&bio->bi_io_vec[i], super->s_erase_page);
 		bio->bi_io_vec[i].bv_len = PAGE_SIZE;
 		bio->bi_io_vec[i].bv_offset = 0;
 	}
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220babac..c570a63e0913 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -48,7 +48,7 @@ static void mpage_end_io(struct bio *bio, int err)
 	int i;
 
 	bio_for_each_segment_all(bv, bio, i) {
-		struct page *page = bv->bv_page;
+		struct page *page = bvec_page(bv);
 		page_endio(page, bio_data_dir(bio), err);
 	}
 
diff --git a/fs/splice.c b/fs/splice.c
index 7968da96bebb..14a4e67c9d26 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -998,7 +998,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 				goto done;
 			}
 
-			array[n].bv_page = buf->page;
+			bvec_set_page(&array[n], buf->page);
 			array[n].bv_len = this_len;
 			array[n].bv_offset = buf->offset;
 			left -= this_len;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index da3a127c9958..f6a2427980f3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -490,7 +490,7 @@ static inline char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags)
 	 * balancing is a lot nicer this way
 	 */
 	local_irq_save(*flags);
-	addr = (unsigned long) kmap_atomic(bvec->bv_page);
+	addr = (unsigned long) kmap_atomic(bvec_page(bvec));
 
 	BUG_ON(addr & ~PAGE_MASK);
 
@@ -508,7 +508,7 @@ static inline void bvec_kunmap_irq(char *buffer, unsigned long *flags)
 #else
 static inline char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags)
 {
-	return page_address(bvec->bv_page) + bvec->bv_offset;
+	return page_address(bvec_page(bvec)) + bvec->bv_offset;
 }
 
 static inline void bvec_kunmap_irq(char *buffer, unsigned long *flags)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c294e3e25e37..3193a0b7051f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -26,6 +26,16 @@ struct bio_vec {
 	unsigned int	bv_offset;
 };
 
+static inline struct page *bvec_page(const struct bio_vec *bvec)
+{
+	return bvec->bv_page;
+}
+
+static inline void bvec_set_page(struct bio_vec *bvec, struct page *page)
+{
+	bvec->bv_page = page;
+}
+
 #ifdef CONFIG_BLOCK
 
 struct bvec_iter {
diff --git a/kernel/power/block_io.c b/kernel/power/block_io.c
index 9a58bc258810..f2824bacb84d 100644
--- a/kernel/power/block_io.c
+++ b/kernel/power/block_io.c
@@ -90,7 +90,7 @@ int hib_wait_on_bio_chain(struct bio **bio_chain)
 		struct page *page;
 
 		next_bio = bio->bi_private;
-		page = bio->bi_io_vec[0].bv_page;
+		page = bvec_page(&bio->bi_io_vec[0]);
 		wait_on_page_locked(page);
 		if (!PageUptodate(page) || PageError(page))
 			ret = -EIO;
diff --git a/mm/page_io.c b/mm/page_io.c
index e6045804c8d8..c540dbc6a9e5 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -33,7 +33,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
 	if (bio) {
 		bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev);
 		bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
-		bio->bi_io_vec[0].bv_page = page;
+		bvec_set_page(&bio->bi_io_vec[0], page);
 		bio->bi_io_vec[0].bv_len = PAGE_SIZE;
 		bio->bi_io_vec[0].bv_offset = 0;
 		bio->bi_vcnt = 1;
@@ -46,7 +46,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
 void end_swap_bio_write(struct bio *bio, int err)
 {
 	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
-	struct page *page = bio->bi_io_vec[0].bv_page;
+	struct page *page = bvec_page(&bio->bi_io_vec[0]);
 
 	if (!uptodate) {
 		SetPageError(page);
@@ -72,7 +72,7 @@ void end_swap_bio_write(struct bio *bio, int err)
 void end_swap_bio_read(struct bio *bio, int err)
 {
 	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
-	struct page *page = bio->bi_io_vec[0].bv_page;
+	struct page *page = bvec_page(&bio->bi_io_vec[0]);
 
 	if (!uptodate) {
 		SetPageError(page);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 6b3f54ed65ba..6babe3b8dd98 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -846,7 +846,7 @@ static struct page *ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor,
 	BUG_ON(*length > cursor->resid);
 	BUG_ON(*page_offset + *length > PAGE_SIZE);
 
-	return bio_vec.bv_page;
+	return bvec_page(&bio_vec);
 }
 
 static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
  2015-03-16 20:25 ` [RFC PATCH 1/7] block: add helpers for accessing a bio_vec page Dan Williams
@ 2015-03-16 20:25 ` Dan Williams
  2015-03-16 23:05   ` Al Viro
  2015-03-16 20:25 ` [RFC PATCH 3/7] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid,
	mgorman, hch, linux-fsdevel, Matthew Wilcox

Carry a pfn in a bio_vec rather than a page in support of allowing
bio(s) to reference unmapped (not struct page backed) persistent memory.

As Dave Hansen points out, it would be unfortunate if we ended up with
less type safety after this conversion, so introduce __pfn_t.

Cc: Matthew Wilcox <willy@linux.intel.com>
[willy: use pfn_t]
[kvm: "no, use __pfn_t, we already stole pfn_t"]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/bio.c                        |    1 +
 block/blk-integrity.c              |    4 ++--
 block/blk-merge.c                  |    6 +++---
 block/bounce.c                     |    2 +-
 drivers/md/bcache/btree.c          |    2 +-
 include/asm-generic/memory_model.h |    4 ++++
 include/linux/bio.h                |   20 +++++++++++---------
 include/linux/blk_types.h          |   14 +++++++++++---
 include/linux/scatterlist.h        |   16 ++++++++++++++++
 include/linux/swiotlb.h            |    1 +
 mm/iov_iter.c                      |   22 +++++++++++-----------
 mm/page_io.c                       |    2 +-
 12 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7100fd6d5898..3d494e85e16d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -28,6 +28,7 @@
 #include <linux/mempool.h>
 #include <linux/workqueue.h>
 #include <linux/cgroup.h>
+#include <linux/scatterlist.h>
 
 #include <trace/events/block.h>
 
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 6c8b1d63e90b..34e53951a0d1 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -43,7 +43,7 @@ static const char *bi_unsupported_name = "unsupported";
  */
 int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio)
 {
-	struct bio_vec iv, ivprv = { NULL };
+	struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
 	unsigned int segments = 0;
 	unsigned int seg_size = 0;
 	struct bvec_iter iter;
@@ -89,7 +89,7 @@ EXPORT_SYMBOL(blk_rq_count_integrity_sg);
 int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
 			    struct scatterlist *sglist)
 {
-	struct bio_vec iv, ivprv = { NULL };
+	struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
 	struct scatterlist *sg = NULL;
 	unsigned int segments = 0;
 	struct bvec_iter iter;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 39bd9925c057..8420d553b8ef 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio,
 					     bool no_sg_merge)
 {
-	struct bio_vec bv, bvprv = { NULL };
+	struct bio_vec bv, bvprv = BIO_VEC_INIT(bvprv);
 	int cluster, high, highprv = 1;
 	unsigned int seg_size, nr_phys_segs;
 	struct bio *fbio, *bbio;
@@ -123,7 +123,7 @@ EXPORT_SYMBOL(blk_recount_segments);
 static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
 				   struct bio *nxt)
 {
-	struct bio_vec end_bv = { NULL }, nxt_bv;
+	struct bio_vec end_bv = BIO_VEC_INIT(end_bv), nxt_bv;
 	struct bvec_iter iter;
 
 	if (!blk_queue_cluster(q))
@@ -202,7 +202,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
 			     struct scatterlist *sglist,
 			     struct scatterlist **sg)
 {
-	struct bio_vec bvec, bvprv = { NULL };
+	struct bio_vec bvec, bvprv = BIO_VEC_INIT(bvprv);
 	struct bvec_iter iter;
 	int nsegs, cluster;
 
diff --git a/block/bounce.c b/block/bounce.c
index 0390e44d6e1b..4a3098067c81 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -64,7 +64,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom)
 #else /* CONFIG_HIGHMEM */
 
 #define bounce_copy_vec(to, vfrom)	\
-	memcpy(page_address((to)->bv_page) + (to)->bv_offset, vfrom, (to)->bv_len)
+	memcpy(page_address(bvec_page(to)) + (to)->bv_offset, vfrom, (to)->bv_len)
 
 #endif /* CONFIG_HIGHMEM */
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2e76e8b62902..36bbe29a806b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -426,7 +426,7 @@ static void do_btree_node_write(struct btree *b)
 		void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
 
 		bio_for_each_segment_all(bv, b->bio, j)
-			memcpy(page_address(bv->bv_page),
+			memcpy(page_address(bvec_page(bv)),
 			       base + j * PAGE_SIZE, PAGE_SIZE);
 
 		bch_submit_bbio(b->bio, b->c, &k.key, 0);
diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h
index 14909b0b9cae..e6c2fda25820 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -72,6 +72,10 @@
 #define page_to_pfn __page_to_pfn
 #define pfn_to_page __pfn_to_page
 
+typedef struct {
+	unsigned long pfn;
+} __pfn_t;
+
 #endif /* __ASSEMBLY__ */
 
 #endif
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f6a2427980f3..f35c90d5fd4d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,8 +63,8 @@
  */
 #define __bvec_iter_bvec(bvec, iter)	(&(bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter)				\
-	(__bvec_iter_bvec((bvec), (iter))->bv_page)
+#define bvec_iter_pfn(bvec, iter)				\
+	(__bvec_iter_bvec((bvec), (iter))->bv_pfn)
 
 #define bvec_iter_len(bvec, iter)				\
 	min((iter).bi_size,					\
@@ -75,7 +75,7 @@
 
 #define bvec_iter_bvec(bvec, iter)				\
 ((struct bio_vec) {						\
-	.bv_page	= bvec_iter_page((bvec), (iter)),	\
+	.bv_pfn		= bvec_iter_pfn((bvec), (iter)),	\
 	.bv_len		= bvec_iter_len((bvec), (iter)),	\
 	.bv_offset	= bvec_iter_offset((bvec), (iter)),	\
 })
@@ -83,14 +83,16 @@
 #define bio_iter_iovec(bio, iter)				\
 	bvec_iter_bvec((bio)->bi_io_vec, (iter))
 
-#define bio_iter_page(bio, iter)				\
-	bvec_iter_page((bio)->bi_io_vec, (iter))
+#define bio_iter_pfn(bio, iter)				\
+	bvec_iter_pfn((bio)->bi_io_vec, (iter))
 #define bio_iter_len(bio, iter)					\
 	bvec_iter_len((bio)->bi_io_vec, (iter))
 #define bio_iter_offset(bio, iter)				\
 	bvec_iter_offset((bio)->bi_io_vec, (iter))
 
-#define bio_page(bio)		bio_iter_page((bio), (bio)->bi_iter)
+#define bio_page(bio)	\
+		pfn_to_page((bio_iter_pfn((bio), (bio)->bi_iter)).pfn)
+#define bio_pfn(bio)		bio_iter_pfn((bio), (bio)->bi_iter)
 #define bio_offset(bio)		bio_iter_offset((bio), (bio)->bi_iter)
 #define bio_iovec(bio)		bio_iter_iovec((bio), (bio)->bi_iter)
 
@@ -150,8 +152,8 @@ static inline void *bio_data(struct bio *bio)
 /*
  * will die
  */
-#define bio_to_phys(bio)	(page_to_phys(bio_page((bio))) + (unsigned long) bio_offset((bio)))
-#define bvec_to_phys(bv)	(page_to_phys((bv)->bv_page) + (unsigned long) (bv)->bv_offset)
+#define bio_to_phys(bio)	(pfn_to_phys(bio_pfn((bio))) + (unsigned long) bio_offset((bio)))
+#define bvec_to_phys(bv)	(pfn_to_phys((bv)->bv_pfn) + (unsigned long) (bv)->bv_offset)
 
 /*
  * queues that have highmem support enabled may still need to revert to
@@ -160,7 +162,7 @@ static inline void *bio_data(struct bio *bio)
  * I/O completely on that queue (see ide-dma for example)
  */
 #define __bio_kmap_atomic(bio, iter)				\
-	(kmap_atomic(bio_iter_iovec((bio), (iter)).bv_page) +	\
+	(kmap_atomic(bio_iter_iovec((bio), bvec_page(iter)) +	\
 		bio_iter_iovec((bio), (iter)).bv_offset)
 
 #define __bio_kunmap_atomic(addr)	kunmap_atomic(addr)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3193a0b7051f..7f63fa3e4fda 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -5,7 +5,9 @@
 #ifndef __LINUX_BLK_TYPES_H
 #define __LINUX_BLK_TYPES_H
 
+#include <linux/scatterlist.h>
 #include <linux/types.h>
+#include <asm/pgtable.h>
 
 struct bio_set;
 struct bio;
@@ -21,19 +23,25 @@ typedef void (bio_destructor_t) (struct bio *);
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
  */
 struct bio_vec {
-	struct page	*bv_page;
+	__pfn_t		bv_pfn;
 	unsigned int	bv_len;
 	unsigned int	bv_offset;
 };
 
+#define BIO_VEC_INIT(name) { .bv_pfn = { .pfn = 0 }, .bv_len = 0, \
+	.bv_offset = 0 }
+
+#define BIO_VEC(name) \
+	struct bio_vec name = BIO_VEC_INIT(name)
+
 static inline struct page *bvec_page(const struct bio_vec *bvec)
 {
-	return bvec->bv_page;
+	return pfn_to_page(bvec->bv_pfn.pfn);
 }
 
 static inline void bvec_set_page(struct bio_vec *bvec, struct page *page)
 {
-	bvec->bv_page = page;
+	bvec->bv_pfn = page_to_pfn_typed(page);
 }
 
 #ifdef CONFIG_BLOCK
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ed8f9e70df9b..5a15b1ce3c9e 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -9,6 +9,22 @@
 #include <asm/scatterlist.h>
 #include <asm/io.h>
 
+#ifndef __pfn_to_phys
+#define __pfn_to_phys(pfn)      ((dma_addr_t)(pfn) << PAGE_SHIFT)
+#endif
+
+static inline dma_addr_t pfn_to_phys(__pfn_t pfn)
+{
+	return __pfn_to_phys(pfn.pfn);
+}
+
+static inline __pfn_t page_to_pfn_typed(struct page *page)
+{
+	__pfn_t pfn = { .pfn = page_to_pfn(page) };
+
+	return pfn;
+}
+
 struct sg_table {
 	struct scatterlist *sgl;	/* the list */
 	unsigned int nents;		/* number of mapped entries */
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e7a018eaf3a2..dc3a94ce3b45 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -1,6 +1,7 @@
 #ifndef __LINUX_SWIOTLB_H
 #define __LINUX_SWIOTLB_H
 
+#include <linux/dma-direction.h>
 #include <linux/types.h>
 
 struct device;
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 827732047da1..be9a7c5b8703 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -61,7 +61,7 @@
 	__p = i->bvec;					\
 	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
 	if (likely(__v.bv_len)) {			\
-		__v.bv_page = __p->bv_page;		\
+		__v.bv_pfn = __p->bv_pfn;		\
 		__v.bv_offset = __p->bv_offset + skip; 	\
 		(void)(STEP);				\
 		skip += __v.bv_len;			\
@@ -72,7 +72,7 @@
 		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
 		if (unlikely(!__v.bv_len))		\
 			continue;			\
-		__v.bv_page = __p->bv_page;		\
+		__v.bv_pfn = __p->bv_pfn;		\
 		__v.bv_offset = __p->bv_offset;		\
 		(void)(STEP);				\
 		skip = __v.bv_len;			\
@@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
 			       v.iov_len),
-		memcpy_to_page(v.bv_page, v.bv_offset,
+		memcpy_to_page(bvec_page(&v), v.bv_offset,
 			       (from += v.bv_len) - v.bv_len, v.bv_len),
 		memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
 	)
@@ -390,7 +390,7 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
 				 v.iov_len),
-		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+		memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v),
 				 v.bv_offset, v.bv_len),
 		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
@@ -411,7 +411,7 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
 					 v.iov_base, v.iov_len),
-		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+		memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v),
 				 v.bv_offset, v.bv_len),
 		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
@@ -456,7 +456,7 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 
 	iterate_and_advance(i, bytes, v,
 		__clear_user(v.iov_base, v.iov_len),
-		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
+		memzero_page(bvec_page(&v), v.bv_offset, v.bv_len),
 		memset(v.iov_base, 0, v.iov_len)
 	)
 
@@ -471,7 +471,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 	iterate_all_kinds(i, bytes, v,
 		__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
 					  v.iov_base, v.iov_len),
-		memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
+		memcpy_from_page((p += v.bv_len) - v.bv_len, bvec_page(&v),
 				 v.bv_offset, v.bv_len),
 		memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
@@ -570,7 +570,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	0;}),({
 		/* can't be more than PAGE_SIZE */
 		*start = v.bv_offset;
-		get_page(*pages = v.bv_page);
+		get_page(*pages = bvec_page(&v));
 		return v.bv_len;
 	}),({
 		return -EFAULT;
@@ -624,7 +624,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		*pages = p = get_pages_array(1);
 		if (!p)
 			return -ENOMEM;
-		get_page(*p = v.bv_page);
+		get_page(*p = bvec_page(&v));
 		return v.bv_len;
 	}),({
 		return -EFAULT;
@@ -658,7 +658,7 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 		}
 		err ? v.iov_len : 0;
 	}), ({
-		char *p = kmap_atomic(v.bv_page);
+		char *p = kmap_atomic(bvec_page(&v));
 		next = csum_partial_copy_nocheck(p + v.bv_offset,
 						 (to += v.bv_len) - v.bv_len,
 						 v.bv_len, 0);
@@ -702,7 +702,7 @@ size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum,
 		}
 		err ? v.iov_len : 0;
 	}), ({
-		char *p = kmap_atomic(v.bv_page);
+		char *p = kmap_atomic(bvec_page(&v));
 		next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len,
 						 p + v.bv_offset,
 						 v.bv_len, 0);
diff --git a/mm/page_io.c b/mm/page_io.c
index c540dbc6a9e5..b7c8d2c3f8f9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -265,7 +265,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
 		struct bio_vec bv = {
-			.bv_page = page,
+			.bv_pfn = page_to_pfn_typed(page),
 			.bv_len  = PAGE_SIZE,
 			.bv_offset = 0
 		};


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 3/7] dma-mapping: allow archs to optionally specify a ->map_pfn() operation
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
  2015-03-16 20:25 ` [RFC PATCH 1/7] block: add helpers for accessing a bio_vec page Dan Williams
  2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
@ 2015-03-16 20:25 ` Dan Williams
  2015-03-18 11:21   ` [Linux-nvdimm] " Boaz Harrosh
  2015-03-16 20:25 ` [RFC PATCH 4/7] scatterlist: use sg_phys() Dan Williams
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: axboe, hch, riel, linux-nvdimm, linux-raid, mgorman, linux-fsdevel

This is in support of enabling block device drivers to perform DMA
to/from persistent memory which may not have a backing struct page
entry.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/Kconfig                             |    3 +++
 include/asm-generic/dma-mapping-common.h |   30 ++++++++++++++++++++++++++++++
 include/linux/dma-debug.h                |   23 +++++++++++++++++++----
 include/linux/dma-mapping.h              |    8 +++++++-
 lib/dma-debug.c                          |    4 ++--
 5 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 05d7a8a458d5..80ea3e124494 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,6 +203,9 @@ config HAVE_DMA_ATTRS
 config HAVE_DMA_CONTIGUOUS
 	bool
 
+config HAVE_DMA_PFN
+	bool
+
 config GENERIC_SMP_IDLE_THREAD
        bool
 
diff --git a/include/asm-generic/dma-mapping-common.h b/include/asm-generic/dma-mapping-common.h
index 3378dcf4c31e..58fad817e51a 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -17,9 +17,15 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
 
 	kmemcheck_mark_initialized(ptr, size);
 	BUG_ON(!valid_dma_direction(dir));
+#ifdef CONFIG_HAVE_DMA_PFN
+	addr = ops->map_pfn(dev, page_to_pfn_typed(virt_to_page(ptr)),
+			     (unsigned long)ptr & ~PAGE_MASK, size,
+			     dir, attrs);
+#else
 	addr = ops->map_page(dev, virt_to_page(ptr),
 			     (unsigned long)ptr & ~PAGE_MASK, size,
 			     dir, attrs);
+#endif
 	debug_dma_map_page(dev, virt_to_page(ptr),
 			   (unsigned long)ptr & ~PAGE_MASK, size,
 			   dir, addr, true);
@@ -68,6 +74,29 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg
 		ops->unmap_sg(dev, sg, nents, dir, attrs);
 }
 
+#ifdef CONFIG_HAVE_DMA_PFN
+static inline dma_addr_t dma_map_pfn(struct device *dev, __pfn_t pfn,
+				      size_t offset, size_t size,
+				      enum dma_data_direction dir)
+{
+	struct dma_map_ops *ops = get_dma_ops(dev);
+	dma_addr_t addr;
+
+	BUG_ON(!valid_dma_direction(dir));
+	addr = ops->map_pfn(dev, pfn, offset, size, dir, NULL);
+	debug_dma_map_pfn(dev, pfn, offset, size, dir, addr, false);
+
+	return addr;
+}
+
+static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
+				      size_t offset, size_t size,
+				      enum dma_data_direction dir)
+{
+	kmemcheck_mark_initialized(page_address(page) + offset, size);
+	return dma_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir);
+}
+#else
 static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
 				      size_t offset, size_t size,
 				      enum dma_data_direction dir)
@@ -82,6 +111,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
 
 	return addr;
 }
+#endif /* CONFIG_HAVE_DMA_PFN */
 
 static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
 				  size_t size, enum dma_data_direction dir)
diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
index fe8cb610deac..eb3e69c61e5e 100644
--- a/include/linux/dma-debug.h
+++ b/include/linux/dma-debug.h
@@ -34,10 +34,18 @@ extern void dma_debug_init(u32 num_entries);
 
 extern int dma_debug_resize_entries(u32 num_entries);
 
-extern void debug_dma_map_page(struct device *dev, struct page *page,
-			       size_t offset, size_t size,
-			       int direction, dma_addr_t dma_addr,
-			       bool map_single);
+extern void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
+			      size_t size, int direction, dma_addr_t dma_addr,
+			      bool map_single);
+
+static inline void debug_dma_map_page(struct device *dev, struct page *page,
+				      size_t offset, size_t size,
+				      int direction, dma_addr_t dma_addr,
+				      bool map_single)
+{
+	return debug_dma_map_pfn(dev, page_to_pfn_typed(page), offset, size,
+			direction, dma_addr, map_single);
+}
 
 extern void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
 
@@ -109,6 +117,13 @@ static inline void debug_dma_map_page(struct device *dev, struct page *page,
 {
 }
 
+static inline void debug_dma_map_pfn(struct device *dev, __pfn_t pfn,
+				     size_t offset, size_t size,
+				     int direction, dma_addr_t dma_addr,
+				     bool map_single)
+{
+}
+
 static inline void debug_dma_mapping_error(struct device *dev,
 					  dma_addr_t dma_addr)
 {
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index c3007cb4bfa6..6411621e4179 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -26,11 +26,17 @@ struct dma_map_ops {
 
 	int (*get_sgtable)(struct device *dev, struct sg_table *sgt, void *,
 			   dma_addr_t, size_t, struct dma_attrs *attrs);
-
+#ifdef CONFIG_HAVE_DMA_PFN
+	dma_addr_t (*map_pfn)(struct device *dev, __pfn_t pfn,
+			      unsigned long offset, size_t size,
+			      enum dma_data_direction dir,
+			      struct dma_attrs *attrs);
+#else
 	dma_addr_t (*map_page)(struct device *dev, struct page *page,
 			       unsigned long offset, size_t size,
 			       enum dma_data_direction dir,
 			       struct dma_attrs *attrs);
+#endif
 	void (*unmap_page)(struct device *dev, dma_addr_t dma_handle,
 			   size_t size, enum dma_data_direction dir,
 			   struct dma_attrs *attrs);
diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 9722bd2dbc9b..a447730fff97 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -1250,7 +1250,7 @@ out:
 	put_hash_bucket(bucket, &flags);
 }
 
-void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
+void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
 			size_t size, int direction, dma_addr_t dma_addr,
 			bool map_single)
 {
@@ -1268,7 +1268,7 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
 
 	entry->dev       = dev;
 	entry->type      = dma_debug_page;
-	entry->pfn	 = page_to_pfn(page);
+	entry->pfn	 = pfn.pfn;
 	entry->offset	 = offset,
 	entry->dev_addr  = dma_addr;
 	entry->size      = size;


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 4/7] scatterlist: use sg_phys()
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
                   ` (2 preceding siblings ...)
  2015-03-16 20:25 ` [RFC PATCH 3/7] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
@ 2015-03-16 20:25 ` Dan Williams
  2015-03-16 20:25 ` [RFC PATCH 5/7] scatterlist: support "page-less" (__pfn_t only) entries Dan Williams
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: axboe, hch, riel, linux-nvdimm, linux-raid, mgorman, linux-fsdevel

Coccinelle cleanup to replace open coded sg to physical address
translations.  This is in preparation for introducing scatterlists that
reference pfn(s) without a backing struct page.

// sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg)
// usage: make coccicheck COCCI=sg_phys.cocci MODE=patch

virtual patch
virtual report
virtual org

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg)) + sg->offset
+ sg_phys(sg)

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg))
+ sg_phys(sg) - sg->offset

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm/mm/dma-mapping.c                    |    2 +-
 arch/microblaze/kernel/dma.c                 |    2 +-
 drivers/iommu/intel-iommu.c                  |    4 ++--
 drivers/iommu/iommu.c                        |    2 +-
 drivers/staging/android/ion/ion_chunk_heap.c |    4 ++--
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 170a116d1b29..ba7eeacfa8b1 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1472,7 +1472,7 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg,
 		return -ENOMEM;
 
 	for (count = 0, s = sg; count < (size >> PAGE_SHIFT); s = sg_next(s)) {
-		phys_addr_t phys = page_to_phys(sg_page(s));
+		phys_addr_t phys = sg_phys(s) - s->offset;
 		unsigned int len = PAGE_ALIGN(s->offset + s->length);
 
 		if (!is_coherent &&
diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index ed7ba8a11822..dcb3c594d626 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -61,7 +61,7 @@ static int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl,
 	/* FIXME this part of code is untested */
 	for_each_sg(sgl, sg, nents, i) {
 		sg->dma_address = sg_phys(sg);
-		__dma_sync(page_to_phys(sg_page(sg)) + sg->offset,
+		__dma_sync(sg_phys(sg),
 							sg->length, direction);
 	}
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index ae4c1a854e57..e10d62f2e61f 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2003,7 +2003,7 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 			sg_res = aligned_nrpages(sg->offset, sg->length);
 			sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
 			sg->dma_length = sg->length;
-			pteval = page_to_phys(sg_page(sg)) | prot;
+			pteval = (sg_phys(sg) - sg->offset) | prot;
 			phys_pfn = pteval >> VTD_PAGE_SHIFT;
 		}
 
@@ -3303,7 +3303,7 @@ static int intel_nontranslate_map_sg(struct device *hddev,
 
 	for_each_sg(sglist, sg, nelems, i) {
 		BUG_ON(!sg_page(sg));
-		sg->dma_address = page_to_phys(sg_page(sg)) + sg->offset;
+		sg->dma_address = sg_phys(sg);
 		sg->dma_length = sg->length;
 	}
 	return nelems;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 72e683df0731..d5e746370bbb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1153,7 +1153,7 @@ size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 	min_pagesz = 1 << __ffs(domain->ops->pgsize_bitmap);
 
 	for_each_sg(sg, s, nents, i) {
-		phys_addr_t phys = page_to_phys(sg_page(s)) + s->offset;
+		phys_addr_t phys = sg_phys(s);
 
 		/*
 		 * We are mapping on IOMMU page boundaries, so offset within
diff --git a/drivers/staging/android/ion/ion_chunk_heap.c b/drivers/staging/android/ion/ion_chunk_heap.c
index 3e6ec2ee6802..b7da5d142aa9 100644
--- a/drivers/staging/android/ion/ion_chunk_heap.c
+++ b/drivers/staging/android/ion/ion_chunk_heap.c
@@ -81,7 +81,7 @@ static int ion_chunk_heap_allocate(struct ion_heap *heap,
 err:
 	sg = table->sgl;
 	for (i -= 1; i >= 0; i--) {
-		gen_pool_free(chunk_heap->pool, page_to_phys(sg_page(sg)),
+		gen_pool_free(chunk_heap->pool, sg_phys(sg) - sg->offset,
 			      sg->length);
 		sg = sg_next(sg);
 	}
@@ -109,7 +109,7 @@ static void ion_chunk_heap_free(struct ion_buffer *buffer)
 							DMA_BIDIRECTIONAL);
 
 	for_each_sg(table->sgl, sg, table->nents, i) {
-		gen_pool_free(chunk_heap->pool, page_to_phys(sg_page(sg)),
+		gen_pool_free(chunk_heap->pool, sg_phys(sg) - sg->offset,
 			      sg->length);
 	}
 	chunk_heap->allocated -= allocated_size;


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 5/7] scatterlist: support "page-less" (__pfn_t only) entries
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
                   ` (3 preceding siblings ...)
  2015-03-16 20:25 ` [RFC PATCH 4/7] scatterlist: use sg_phys() Dan Williams
@ 2015-03-16 20:25 ` Dan Williams
  2015-03-16 20:25 ` [RFC PATCH 6/7] x86: support dma_map_pfn() Dan Williams
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: axboe, hch, riel, linux-nvdimm, linux-raid, mgorman,
	linux-fsdevel, Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Given that an offset will never be more than PAGE_SIZE, steal the unused
bits of the offset to implement a flags field.  Move the existing "this
is a sg_chain() entry" flag to the new flags field, and add a new flag
(SG_FLAGS_PAGE) to indicate that there is a struct page backing for the
entry.

[djbw: convert to pfn_to_phys() / HAVE_DMA_PFN]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 block/blk-merge.c                 |    2 -
 drivers/dma/ste_dma40.c           |    5 --
 drivers/mmc/card/queue.c          |    4 +-
 include/asm-generic/scatterlist.h |    6 +++
 include/crypto/scatterwalk.h      |   10 ++++
 include/linux/scatterlist.h       |   85 ++++++++++++++++++++++++++++++++++---
 6 files changed, 97 insertions(+), 15 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8420d553b8ef..727a30aa5990 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -267,7 +267,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 		if (rq->cmd_flags & REQ_WRITE)
 			memset(q->dma_drain_buffer, 0, q->dma_drain_size);
 
-		sg->page_link &= ~0x02;
+		sg_unmark_end(sg);
 		sg = sg_next(sg);
 		sg_set_page(sg, virt_to_page(q->dma_drain_buffer),
 			    q->dma_drain_size,
diff --git a/drivers/dma/ste_dma40.c b/drivers/dma/ste_dma40.c
index 68aca3334a17..7c28467eb366 100644
--- a/drivers/dma/ste_dma40.c
+++ b/drivers/dma/ste_dma40.c
@@ -2560,10 +2560,7 @@ dma40_prep_dma_cyclic(struct dma_chan *chan, dma_addr_t dma_addr,
 		dma_addr += period_len;
 	}
 
-	sg[periods].offset = 0;
-	sg_dma_len(&sg[periods]) = 0;
-	sg[periods].page_link =
-		((unsigned long)sg | 0x01) & ~0x02;
+	sg_chain(sg, periods + 1, sg);
 
 	txd = d40_prep_sg(chan, sg, sg, periods, direction,
 			  DMA_PREP_INTERRUPT);
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 236d194c2883..127f76294e71 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -469,7 +469,7 @@ static unsigned int mmc_queue_packed_map_sg(struct mmc_queue *mq,
 			sg_set_buf(__sg, buf + offset, len);
 			offset += len;
 			remain -= len;
-			(__sg++)->page_link &= ~0x02;
+			sg_unmark_end(__sg++);
 			sg_len++;
 		} while (remain);
 	}
@@ -477,7 +477,7 @@ static unsigned int mmc_queue_packed_map_sg(struct mmc_queue *mq,
 	list_for_each_entry(req, &packed->list, queuelist) {
 		sg_len += blk_rq_map_sg(mq->queue, req, __sg);
 		__sg = sg + (sg_len - 1);
-		(__sg++)->page_link &= ~0x02;
+		sg_unmark_end(__sg++);
 	}
 	sg_mark_end(sg + (sg_len - 1));
 	return sg_len;
diff --git a/include/asm-generic/scatterlist.h b/include/asm-generic/scatterlist.h
index 5de07355fad4..51174ad11664 100644
--- a/include/asm-generic/scatterlist.h
+++ b/include/asm-generic/scatterlist.h
@@ -7,8 +7,14 @@ struct scatterlist {
 #ifdef CONFIG_DEBUG_SG
 	unsigned long	sg_magic;
 #endif
+#ifdef CONFIG_HAVE_DMA_PFN
+	__pfn_t		pfn;
+	unsigned short	offset;
+	unsigned short	sg_flags;
+#else
 	unsigned long	page_link;
 	unsigned int	offset;
+#endif
 	unsigned int	length;
 	dma_addr_t	dma_address;
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index 20e4226a2e14..7296d89a50b2 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -25,6 +25,15 @@
 #include <linux/scatterlist.h>
 #include <linux/sched.h>
 
+#ifdef CONFIG_HAVE_DMA_PFN
+/*
+ * If we're using PFNs, the architecture must also have been converted to
+ * support SG_CHAIN.  So we can use the generic code instead of custom
+ * code.
+ */
+#define scatterwalk_sg_chain(prv, num, sgl)	sg_chain(prv, num, sgl)
+#define scatterwalk_sg_next(sgl)		sg_next(sgl)
+#else
 static inline void scatterwalk_sg_chain(struct scatterlist *sg1, int num,
 					struct scatterlist *sg2)
 {
@@ -32,6 +41,7 @@ static inline void scatterwalk_sg_chain(struct scatterlist *sg1, int num,
 	sg1[num - 1].page_link &= ~0x02;
 	sg1[num - 1].page_link |= 0x01;
 }
+#endif
 
 static inline void scatterwalk_crypto_chain(struct scatterlist *head,
 					    struct scatterlist *sg,
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 5a15b1ce3c9e..891542c7dda0 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include <linux/bug.h>
 #include <linux/mm.h>
 
+#include <asm/page.h>
 #include <asm/types.h>
 #include <asm/scatterlist.h>
 #include <asm/io.h>
@@ -34,8 +35,14 @@ struct sg_table {
 /*
  * Notes on SG table design.
  *
- * Architectures must provide an unsigned long page_link field in the
- * scatterlist struct. We use that to place the page pointer AND encode
+ * Architectures may define CONFIG_HAVE_DMA_PFN to indicate that they wish
+ * to support SGLs that point to pages which do not have a struct page to
+ * describe them.  If so, they should provide an sg_flags field in their
+ * scatterlist struct (see asm-generic for an example) as well as a pfn
+ * field.
+ *
+ * Otherwise, architectures must provide an unsigned long page_link field in
+ * the scatterlist struct. We use that to place the page pointer AND encode
  * information about the sg table as well. The two lower bits are reserved
  * for this information.
  *
@@ -49,16 +56,25 @@ struct sg_table {
  */
 
 #define SG_MAGIC	0x87654321
-
+#define SG_FLAGS_CHAIN	0x0001
+#define SG_FLAGS_LAST	0x0002
+#define SG_FLAGS_PAGE	0x0004
+
+#ifdef CONFIG_HAVE_DMA_PFN
+#define sg_is_chain(sg)		((sg)->sg_flags & SG_FLAGS_CHAIN)
+#define sg_is_last(sg)		((sg)->sg_flags & SG_FLAGS_LAST)
+#define sg_chain_ptr(sg)	((struct scatterlist *)((sg)->pfn.pfn))
+#else /* !CONFIG_HAVE_DMA_PFN */
 /*
  * We overload the LSB of the page pointer to indicate whether it's
  * a valid sg entry, or whether it points to the start of a new scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define sg_is_chain(sg)		((sg)->page_link & 0x01)
-#define sg_is_last(sg)		((sg)->page_link & 0x02)
+#define sg_is_chain(sg)		((sg)->page_link & SG_FLAGS_CHAIN)
+#define sg_is_last(sg)		((sg)->page_link & SG_FLAGS_LAST)
 #define sg_chain_ptr(sg)	\
 	((struct scatterlist *) ((sg)->page_link & ~0x03))
+#endif /* !CONFIG_HAVE_DMA_PFN */
 
 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -72,6 +88,14 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 {
+#ifdef CONFIG_HAVE_DMA_PFN
+#ifdef CONFIG_DEBUG_SG
+	BUG_ON(sg->sg_magic != SG_MAGIC);
+	BUG_ON(sg_is_chain(sg));
+#endif
+	sg->pfn = page_to_pfn_typed(page);
+	sg->sg_flags |= SG_FLAGS_PAGE;
+#else /* !CONFIG_HAVE_DMA_PFN */
 	unsigned long page_link = sg->page_link & 0x3;
 
 	/*
@@ -84,6 +108,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 	BUG_ON(sg_is_chain(sg));
 #endif
 	sg->page_link = page_link | (unsigned long) page;
+#endif /* !CONFIG_HAVE_DMA_PFN */
 }
 
 /**
@@ -104,9 +129,26 @@ static inline void sg_set_page(struct scatterlist *sg, struct page *page,
 			       unsigned int len, unsigned int offset)
 {
 	sg_assign_page(sg, page);
+	BUG_ON(offset > 65535);
+	sg->offset = offset;
+	sg->length = len;
+}
+
+#ifdef CONFIG_HAVE_DMA_PFN
+static inline void sg_set_pfn(struct scatterlist *sg, __pfn_t pfn,
+			      unsigned int len, unsigned int offset)
+{
+#ifdef CONFIG_DEBUG_SG
+	BUG_ON(sg->sg_magic != SG_MAGIC);
+	BUG_ON(sg_is_chain(sg));
+#endif
+	sg->pfn = pfn;
+	BUG_ON(offset > 65535);
 	sg->offset = offset;
+	sg->sg_flags = 0;
 	sg->length = len;
 }
+#endif
 
 static inline struct page *sg_page(struct scatterlist *sg)
 {
@@ -114,7 +156,12 @@ static inline struct page *sg_page(struct scatterlist *sg)
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
 #endif
+#ifdef CONFIG_HAVE_DMA_PFN
+	BUG_ON(!(sg->sg_flags & SG_FLAGS_PAGE));
+	return pfn_to_page(sg->pfn.pfn);
+#else
 	return (struct page *)((sg)->page_link & ~0x3);
+#endif
 }
 
 /**
@@ -166,7 +213,12 @@ static inline void sg_chain(struct scatterlist *prv, unsigned int prv_nents,
 	 * Set lowest bit to indicate a link pointer, and make sure to clear
 	 * the termination bit if it happens to be set.
 	 */
+#ifdef CONFIG_HAVE_DMA_PFN
+	prv[prv_nents - 1].pfn.pfn = (unsigned long) sgl;
+	prv[prv_nents - 1].sg_flags = SG_FLAGS_CHAIN;
+#else
 	prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+#endif
 }
 
 /**
@@ -186,8 +238,13 @@ static inline void sg_mark_end(struct scatterlist *sg)
 	/*
 	 * Set termination bit, clear potential chain bit
 	 */
-	sg->page_link |= 0x02;
-	sg->page_link &= ~0x01;
+#ifdef CONFIG_HAVE_DMA_PFN
+	sg->sg_flags |= SG_FLAGS_LAST;
+	sg->sg_flags &= ~SG_FLAGS_CHAIN;
+#else
+	sg->page_link |= SG_FLAGS_LAST;
+	sg->page_link &= ~SG_FLAGS_CHAIN;
+#endif
 }
 
 /**
@@ -203,7 +260,11 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 #endif
-	sg->page_link &= ~0x02;
+#ifdef CONFIG_HAVE_DMA_PFN
+	sg->sg_flags &= ~SG_FLAGS_LAST;
+#else
+	sg->page_link &= ~SG_FLAGS_LAST;
+#endif
 }
 
 /**
@@ -218,7 +279,11 @@ static inline void sg_unmark_end(struct scatterlist *sg)
  **/
 static inline dma_addr_t sg_phys(struct scatterlist *sg)
 {
+#ifdef CONFIG_HAVE_DMA_PFN
+	return pfn_to_phys(sg->pfn) + sg->offset;
+#else
 	return page_to_phys(sg_page(sg)) + sg->offset;
+#endif
 }
 
 /**
@@ -233,7 +298,11 @@ static inline dma_addr_t sg_phys(struct scatterlist *sg)
  **/
 static inline void *sg_virt(struct scatterlist *sg)
 {
+#ifdef CONFIG_HAVE_DMA_PFN
+	return __va(pfn_to_phys(sg->pfn)) + sg->offset;
+#else
 	return page_address(sg_page(sg)) + sg->offset;
+#endif
 }
 
 int sg_nents(struct scatterlist *sg);


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 6/7] x86: support dma_map_pfn()
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
                   ` (4 preceding siblings ...)
  2015-03-16 20:25 ` [RFC PATCH 5/7] scatterlist: support "page-less" (__pfn_t only) entries Dan Williams
@ 2015-03-16 20:25 ` Dan Williams
  2015-03-16 20:26 ` [RFC PATCH 7/7] block: base support for pfn i/o Dan Williams
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: axboe, hch, riel, linux-nvdimm, linux-raid, mgorman, linux-fsdevel

Fix up x86 dma_map_ops to allow pfn-only mappings.

As long as a dma_map_sg() implementation uses the generic sg_phys()
helpers it can support scatterlists that use struct pfn instead of
struct page.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig               |   12 ++++++++++++
 arch/x86/kernel/amd_gart_64.c  |   22 +++++++++++++++++-----
 arch/x86/kernel/pci-nommu.c    |   22 +++++++++++++++++-----
 arch/x86/kernel/pci-swiotlb.c  |    4 ++++
 arch/x86/pci/sta2x11-fixup.c   |    4 ++++
 arch/x86/xen/pci-swiotlb-xen.c |    4 ++++
 drivers/iommu/amd_iommu.c      |   21 ++++++++++++++++-----
 drivers/iommu/intel-iommu.c    |   22 +++++++++++++++++-----
 drivers/xen/swiotlb-xen.c      |   29 +++++++++++++++++++----------
 include/linux/swiotlb.h        |    4 ++++
 lib/swiotlb.c                  |   20 +++++++++++++++-----
 11 files changed, 129 insertions(+), 35 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..3be1c0ac0025 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -801,6 +801,7 @@ config CALGARY_IOMMU
 	bool "IBM Calgary IOMMU support"
 	select SWIOTLB
 	depends on X86_64 && PCI
+	depends on !HAVE_DMA_PFN
 	---help---
 	  Support for hardware IOMMUs in IBM's xSeries x366 and x460
 	  systems. Needed to run systems with more than 3GB of memory
@@ -1430,6 +1431,17 @@ config ILLEGAL_POINTER_VALUE
 
 source "mm/Kconfig"
 
+config PMEM_DMA
+	bool "Support DMA to Persistent Memory"
+	select HAVE_DMA_PFN
+	---help---
+	  Enable drivers that are capable of performing DMA to
+	  Persistent Memory.  Drivers with this capability are prepared
+	  to map memory with either dma_map_pfn() or a dma_map_sg()
+	  implementation that is pfn capable.  Note, some iommus, like
+	  CONFIG_CALGARY_IOMMU are incompatible (disabled by this
+	  option).
+
 config HIGHPTE
 	bool "Allocate 3rd-level pagetables from highmem"
 	depends on HIGHMEM
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 8e3842fc8bea..92c9f8139b08 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -239,13 +239,13 @@ static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem,
 }
 
 /* Map a single area into the IOMMU */
-static dma_addr_t gart_map_page(struct device *dev, struct page *page,
-				unsigned long offset, size_t size,
-				enum dma_data_direction dir,
-				struct dma_attrs *attrs)
+static dma_addr_t gart_map_pfn(struct device *dev, __pfn_t pfn,
+			       unsigned long offset, size_t size,
+			       enum dma_data_direction dir,
+			       struct dma_attrs *attrs)
 {
 	unsigned long bus;
-	phys_addr_t paddr = page_to_phys(page) + offset;
+	phys_addr_t paddr = pfn_to_phys(pfn) + offset;
 
 	if (!dev)
 		dev = &x86_dma_fallback_dev;
@@ -259,6 +259,14 @@ static dma_addr_t gart_map_page(struct device *dev, struct page *page,
 	return bus;
 }
 
+static __maybe_unused dma_addr_t gart_map_page(struct device *dev,
+		struct page *page, unsigned long offset, size_t size,
+		enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+	return gart_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir,
+			attrs);
+}
+
 /*
  * Free a DMA mapping.
  */
@@ -699,7 +707,11 @@ static __init int init_amd_gatt(struct agp_kern_info *info)
 static struct dma_map_ops gart_dma_ops = {
 	.map_sg				= gart_map_sg,
 	.unmap_sg			= gart_unmap_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+	.map_pfn			= gart_map_pfn,
+#else
 	.map_page			= gart_map_page,
+#endif
 	.unmap_page			= gart_unmap_page,
 	.alloc				= gart_alloc_coherent,
 	.free				= gart_free_coherent,
diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
index da15918d1c81..dfb66c4b8a73 100644
--- a/arch/x86/kernel/pci-nommu.c
+++ b/arch/x86/kernel/pci-nommu.c
@@ -25,12 +25,12 @@ check_addr(char *name, struct device *hwdev, dma_addr_t bus, size_t size)
 	return 1;
 }
 
-static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
-				 unsigned long offset, size_t size,
-				 enum dma_data_direction dir,
-				 struct dma_attrs *attrs)
+static dma_addr_t nommu_map_pfn(struct device *dev, __pfn_t pfn,
+				unsigned long offset, size_t size,
+				enum dma_data_direction dir,
+				struct dma_attrs *attrs)
 {
-	dma_addr_t bus = page_to_phys(page) + offset;
+	dma_addr_t bus = pfn_to_phys(pfn) + offset;
 	WARN_ON(size == 0);
 	if (!check_addr("map_single", dev, bus, size))
 		return DMA_ERROR_CODE;
@@ -38,6 +38,14 @@ static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
 	return bus;
 }
 
+static __maybe_unused dma_addr_t nommu_map_page(struct device *dev,
+		struct page *page, unsigned long offset, size_t size,
+		enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+	return nommu_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir,
+			attrs);
+}
+
 /* Map a set of buffers described by scatterlist in streaming
  * mode for DMA.  This is the scatter-gather version of the
  * above pci_map_single interface.  Here the scatter gather list
@@ -92,7 +100,11 @@ struct dma_map_ops nommu_dma_ops = {
 	.alloc			= dma_generic_alloc_coherent,
 	.free			= dma_generic_free_coherent,
 	.map_sg			= nommu_map_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+	.map_pfn		= nommu_map_pfn,
+#else
 	.map_page		= nommu_map_page,
+#endif
 	.sync_single_for_device = nommu_sync_single_for_device,
 	.sync_sg_for_device	= nommu_sync_sg_for_device,
 	.is_phys		= 1,
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index 77dd0ad58be4..5351eb8c8f7f 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -48,7 +48,11 @@ static struct dma_map_ops swiotlb_dma_ops = {
 	.sync_sg_for_device = swiotlb_sync_sg_for_device,
 	.map_sg = swiotlb_map_sg_attrs,
 	.unmap_sg = swiotlb_unmap_sg_attrs,
+#ifdef CONFIG_HAVE_DMA_PFN
+	.map_pfn = swiotlb_map_pfn,
+#else
 	.map_page = swiotlb_map_page,
+#endif
 	.unmap_page = swiotlb_unmap_page,
 	.dma_supported = NULL,
 };
diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index 5ceda85b8687..d1c6e3808bb5 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -182,7 +182,11 @@ static void *sta2x11_swiotlb_alloc_coherent(struct device *dev,
 static struct dma_map_ops sta2x11_dma_ops = {
 	.alloc = sta2x11_swiotlb_alloc_coherent,
 	.free = x86_swiotlb_free_coherent,
+#ifdef CONFIG_HAVE_DMA_PFN
+	.map_pfn = swiotlb_map_pfn,
+#else
 	.map_page = swiotlb_map_page,
+#endif
 	.unmap_page = swiotlb_unmap_page,
 	.map_sg = swiotlb_map_sg_attrs,
 	.unmap_sg = swiotlb_unmap_sg_attrs,
diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index 0e98e5d241d0..e65ea48d7aed 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -28,7 +28,11 @@ static struct dma_map_ops xen_swiotlb_dma_ops = {
 	.sync_sg_for_device = xen_swiotlb_sync_sg_for_device,
 	.map_sg = xen_swiotlb_map_sg_attrs,
 	.unmap_sg = xen_swiotlb_unmap_sg_attrs,
+#ifdef CONFIG_HAVE_DMA_PFN
+	.map_pfn = xen_swiotlb_map_pfn,
+#else
 	.map_page = xen_swiotlb_map_page,
+#endif
 	.unmap_page = xen_swiotlb_unmap_page,
 	.dma_supported = xen_swiotlb_dma_supported,
 };
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 48882c126245..65fc71985c14 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2765,16 +2765,15 @@ static void __unmap_single(struct dma_ops_domain *dma_dom,
 /*
  * The exported map_single function for dma_ops.
  */
-static dma_addr_t map_page(struct device *dev, struct page *page,
-			   unsigned long offset, size_t size,
-			   enum dma_data_direction dir,
-			   struct dma_attrs *attrs)
+static dma_addr_t map_pfn(struct device *dev, __pfn_t pfn, unsigned long offset,
+		size_t size, enum dma_data_direction dir,
+		struct dma_attrs *attrs)
 {
 	unsigned long flags;
 	struct protection_domain *domain;
 	dma_addr_t addr;
 	u64 dma_mask;
-	phys_addr_t paddr = page_to_phys(page) + offset;
+	phys_addr_t paddr = pfn_to_phys(pfn) + offset;
 
 	INC_STATS_COUNTER(cnt_map_single);
 
@@ -2799,6 +2798,14 @@ out:
 	spin_unlock_irqrestore(&domain->lock, flags);
 
 	return addr;
+
+}
+
+static __maybe_unused dma_addr_t map_page(struct device *dev, struct page *page,
+		unsigned long offset, size_t size, enum dma_data_direction dir,
+		struct dma_attrs *attrs)
+{
+	return map_pfn(dev, page_to_pfn_typed(page), offset, size, dir, attrs);
 }
 
 /*
@@ -3063,7 +3070,11 @@ static void __init prealloc_protection_domains(void)
 static struct dma_map_ops amd_iommu_dma_ops = {
 	.alloc = alloc_coherent,
 	.free = free_coherent,
+#ifdef CONFIG_HAVE_DMA_PFN
+	.map_pfn = map_pfn,
+#else
 	.map_page = map_page,
+#endif
 	.unmap_page = unmap_page,
 	.map_sg = map_sg,
 	.unmap_sg = unmap_sg,
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e10d62f2e61f..e3d304d4c162 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3087,15 +3087,23 @@ error:
 	return 0;
 }
 
-static dma_addr_t intel_map_page(struct device *dev, struct page *page,
-				 unsigned long offset, size_t size,
-				 enum dma_data_direction dir,
-				 struct dma_attrs *attrs)
+static dma_addr_t intel_map_pfn(struct device *dev, __pfn_t pfn,
+				unsigned long offset, size_t size,
+				enum dma_data_direction dir,
+				struct dma_attrs *attrs)
 {
-	return __intel_map_single(dev, page_to_phys(page) + offset, size,
+	return __intel_map_single(dev, pfn_to_phys(pfn) + offset, size,
 				  dir, *dev->dma_mask);
 }
 
+static __maybe_unused dma_addr_t intel_map_page(struct device *dev,
+		struct page *page, unsigned long offset, size_t size,
+		enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+	return intel_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir,
+			attrs);
+}
+
 static void flush_unmaps(void)
 {
 	int i, j;
@@ -3381,7 +3389,11 @@ struct dma_map_ops intel_dma_ops = {
 	.free = intel_free_coherent,
 	.map_sg = intel_map_sg,
 	.unmap_sg = intel_unmap_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+	.map_pfn = intel_map_pfn,
+#else
 	.map_page = intel_map_page,
+#endif
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
 };
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 810ad419e34c..bd29d09bbacc 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -382,10 +382,10 @@ EXPORT_SYMBOL_GPL(xen_swiotlb_free_coherent);
  * Once the device is given the dma address, the device owns this memory until
  * either xen_swiotlb_unmap_page or xen_swiotlb_dma_sync_single is performed.
  */
-dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
-				unsigned long offset, size_t size,
-				enum dma_data_direction dir,
-				struct dma_attrs *attrs)
+dma_addr_t xen_swiotlb_map_pfn(struct device *dev, unsigned long pfn,
+			       unsigned long offset, size_t size,
+			       enum dma_data_direction dir,
+			       struct dma_attrs *attrs)
 {
 	phys_addr_t map, phys = page_to_phys(page) + offset;
 	dma_addr_t dev_addr = xen_phys_to_bus(phys);
@@ -429,6 +429,16 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
 	}
 	return dev_addr;
 }
+EXPORT_SYMBOL_GPL(xen_swiotlb_map_pfn);
+
+dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
+				unsigned long offset, size_t size,
+				enum dma_data_direction dir,
+				struct dma_attrs *attrs)
+{
+	return xen_swiotlb_map_pfn(dev, page_to_pfn(page), offset, size, dir,
+			attrs);
+}
 EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
 
 /*
@@ -582,15 +592,14 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
 						attrs);
 			sg->dma_address = xen_phys_to_bus(map);
 		} else {
+			__pfn_t pfn = { .pfn = paddr >> PAGE_SHIFT };
+			unsigned long offset = paddr & ~PAGE_MASK;
+
 			/* we are not interested in the dma_addr returned by
 			 * xen_dma_map_page, only in the potential cache flushes executed
 			 * by the function. */
-			xen_dma_map_page(hwdev, pfn_to_page(paddr >> PAGE_SHIFT),
-						dev_addr,
-						paddr & ~PAGE_MASK,
-						sg->length,
-						dir,
-						attrs);
+			xen_dma_map_pfn(hwdev, pfn, dev_addr, offset,
+					sg->length, attrs);
 			sg->dma_address = dev_addr;
 		}
 		sg_dma_len(sg) = sg->length;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index dc3a94ce3b45..78b136b0b139 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -67,6 +67,10 @@ extern dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
 				   unsigned long offset, size_t size,
 				   enum dma_data_direction dir,
 				   struct dma_attrs *attrs);
+extern dma_addr_t swiotlb_map_pfn(struct device *dev, __pfn_t pfn,
+				  unsigned long offset, size_t size,
+				  enum dma_data_direction dir,
+				  struct dma_attrs *attrs);
 extern void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
 			       size_t size, enum dma_data_direction dir,
 			       struct dma_attrs *attrs);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 4abda074ea45..fc164041ec22 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -727,12 +727,12 @@ swiotlb_full(struct device *dev, size_t size, enum dma_data_direction dir,
  * Once the device is given the dma address, the device owns this memory until
  * either swiotlb_unmap_page or swiotlb_dma_sync_single is performed.
  */
-dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
-			    unsigned long offset, size_t size,
-			    enum dma_data_direction dir,
-			    struct dma_attrs *attrs)
+dma_addr_t swiotlb_map_pfn(struct device *dev, __pfn_t pfn,
+			   unsigned long offset, size_t size,
+			   enum dma_data_direction dir,
+			   struct dma_attrs *attrs)
 {
-	phys_addr_t map, phys = page_to_phys(page) + offset;
+	phys_addr_t map, phys = pfn_to_phys(pfn) + offset;
 	dma_addr_t dev_addr = phys_to_dma(dev, phys);
 
 	BUG_ON(dir == DMA_NONE);
@@ -763,6 +763,16 @@ dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
 
 	return dev_addr;
 }
+EXPORT_SYMBOL_GPL(swiotlb_map_pfn);
+
+dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
+			    unsigned long offset, size_t size,
+			    enum dma_data_direction dir,
+			    struct dma_attrs *attrs)
+{
+	return swiotlb_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir,
+			attrs);
+}
 EXPORT_SYMBOL_GPL(swiotlb_map_page);
 
 /*


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 7/7] block: base support for pfn i/o
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
                   ` (5 preceding siblings ...)
  2015-03-16 20:25 ` [RFC PATCH 6/7] x86: support dma_map_pfn() Dan Williams
@ 2015-03-16 20:26 ` Dan Williams
  2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
  2015-03-18 20:26 ` Andrew Morton
  8 siblings, 0 replies; 42+ messages in thread
From: Dan Williams @ 2015-03-16 20:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: axboe, hch, riel, linux-nvdimm, linux-raid, mgorman, linux-fsdevel

Allow block device drivers to opt-in to receiving bio(s) where the
bio_vec(s) point to memory that is not backed by struct page entries.
When a driver opts in it asserts that it will use the pfn version of the
dma mapping routines and does not use pfn_to_page() in its submission
path.

TODO: add kmap_pfn() and kmap_atomic_pfn() for drivers that want to
touch bio_vec buffers with the cpu prior to submission to a
low-level-device-driver.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/bio.c               |   48 ++++++++++++++++++++++++++++++++++++++-------
 block/blk-core.c          |    5 +++++
 include/linux/blk_types.h |    1 +
 include/linux/blkdev.h    |    2 ++
 4 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3d494e85e16d..091c0071e360 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -568,6 +568,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 	bio->bi_rw = bio_src->bi_rw;
 	bio->bi_iter = bio_src->bi_iter;
 	bio->bi_io_vec = bio_src->bi_io_vec;
+	bio->bi_flags |= bio_src->bi_flags & (1 << BIO_PFN);
 }
 EXPORT_SYMBOL(__bio_clone_fast);
 
@@ -659,6 +660,8 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
 		goto integrity_clone;
 	}
 
+	bio->bi_flags |= bio_src->bi_flags & (1 << BIO_PFN);
+
 	bio_for_each_segment(bv, bio_src, iter)
 		bio->bi_io_vec[bio->bi_vcnt++] = bv;
 
@@ -700,9 +703,9 @@ int bio_get_nr_vecs(struct block_device *bdev)
 }
 EXPORT_SYMBOL(bio_get_nr_vecs);
 
-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
-			  *page, unsigned int len, unsigned int offset,
-			  unsigned int max_sectors)
+static int __bio_add_pfn(struct request_queue *q, struct bio *bio,
+		__pfn_t pfn, unsigned int len, unsigned int offset,
+		unsigned int max_sectors)
 {
 	int retried_segments = 0;
 	struct bio_vec *bvec;
@@ -724,7 +727,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
 
-		if (page == bvec_page(prev) &&
+		if (pfn.pfn == prev->bv_pfn.pfn &&
 		    offset == prev->bv_offset + prev->bv_len) {
 			unsigned int prev_bv_len = prev->bv_len;
 			prev->bv_len += len;
@@ -769,7 +772,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	 * cannot add the page
 	 */
 	bvec = &bio->bi_io_vec[bio->bi_vcnt];
-	bvec_set_page(bvec, page);
+	bvec->bv_pfn = pfn;
 	bvec->bv_len = len;
 	bvec->bv_offset = offset;
 	bio->bi_vcnt++;
@@ -819,7 +822,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	return len;
 
  failed:
-	bvec_set_page(bvec, NULL);
+	bvec->bv_pfn.pfn = 0;
 	bvec->bv_len = 0;
 	bvec->bv_offset = 0;
 	bio->bi_vcnt--;
@@ -846,7 +849,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
 		    unsigned int len, unsigned int offset)
 {
-	return __bio_add_page(q, bio, page, len, offset,
+	return __bio_add_pfn(q, bio, page_to_pfn_typed(page), len, offset,
 			      queue_max_hw_sectors(q));
 }
 EXPORT_SYMBOL(bio_add_pc_page);
@@ -873,10 +876,39 @@ int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
 	if ((max_sectors < (len >> 9)) && !bio->bi_iter.bi_size)
 		max_sectors = len >> 9;
 
-	return __bio_add_page(q, bio, page, len, offset, max_sectors);
+	return __bio_add_pfn(q, bio, page_to_pfn_typed(page), len, offset,
+			max_sectors);
 }
 EXPORT_SYMBOL(bio_add_page);
 
+/**
+ *	bio_add_pfn -	attempt to add pfn to bio
+ *	@bio: destination bio
+ *	@pfn: pfn to add
+ *	@len: vec entry length
+ *	@offset: vec entry offset
+ *
+ *	Identical to bio_add_page() except this variant flags the bio as
+ *	not have struct page backing.  A given request_queue must assert
+ *	that it is prepared to handle this constraint before bio(s)
+ *	flagged in the manner can be passed.
+ */
+int bio_add_pfn(struct bio *bio, __pfn_t pfn, unsigned int len,
+		unsigned int offset)
+{
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+	unsigned int max_sectors;
+
+	if (!blk_queue_pfn(q))
+		return 0;
+	set_bit(BIO_PFN, &bio->bi_flags);
+	max_sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
+	if ((max_sectors < (len >> 9)) && !bio->bi_iter.bi_size)
+		max_sectors = len >> 9;
+
+	return __bio_add_pfn(q, bio, pfn, len, offset, max_sectors);
+}
+
 struct submit_bio_ret {
 	struct completion event;
 	int error;
diff --git a/block/blk-core.c b/block/blk-core.c
index 7830ce00cbf5..8bafb4c87c96 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1843,6 +1843,11 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
+	if (bio_flagged(bio, BIO_PFN) && !blk_queue_pfn(q)) {
+		err = -EOPNOTSUPP;
+		goto end_io;
+	}
+
 	/*
 	 * Various block parts want %current->io_context and lazy ioc
 	 * allocation ends up trading a lot of pain for a small amount of
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 7f63fa3e4fda..653bb4fd0706 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -140,6 +140,7 @@ struct bio {
 #define BIO_NULL_MAPPED 8	/* contains invalid user pages */
 #define BIO_QUIET	9	/* Make BIO Quiet */
 #define BIO_SNAP_STABLE	10	/* bio data must be snapshotted during write */
+#define BIO_PFN		11	/* bio_vec references memory without struct page */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516f24de..e17ecefba80a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -513,6 +513,7 @@ struct request_queue {
 #define QUEUE_FLAG_INIT_DONE   20	/* queue is initialized */
 #define QUEUE_FLAG_NO_SG_MERGE 21	/* don't attempt to merge SG segments*/
 #define QUEUE_FLAG_SG_GAPS     22	/* queue doesn't support SG gaps */
+#define QUEUE_FLAG_PFN         23	/* queue supports pfn-only bio_vec(s) */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -594,6 +595,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_noxmerges(q)	\
 	test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
 #define blk_queue_nonrot(q)	test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
+#define blk_queue_pfn(q)	test_bit(QUEUE_FLAG_PFN, &(q)->queue_flags)
 #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
 #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
 #define blk_queue_stackable(q)	\


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
@ 2015-03-16 23:05   ` Al Viro
  2015-03-17 13:02     ` Matthew Wilcox
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2015-03-16 23:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox

> diff --git a/mm/iov_iter.c b/mm/iov_iter.c
> index 827732047da1..be9a7c5b8703 100644
> --- a/mm/iov_iter.c
> +++ b/mm/iov_iter.c
> @@ -61,7 +61,7 @@
>  	__p = i->bvec;					\
>  	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
>  	if (likely(__v.bv_len)) {			\
> -		__v.bv_page = __p->bv_page;		\
> +		__v.bv_pfn = __p->bv_pfn;		\
>  		__v.bv_offset = __p->bv_offset + skip; 	\
>  		(void)(STEP);				\
>  		skip += __v.bv_len;			\
> @@ -72,7 +72,7 @@
>  		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
>  		if (unlikely(!__v.bv_len))		\
>  			continue;			\
> -		__v.bv_page = __p->bv_page;		\
> +		__v.bv_pfn = __p->bv_pfn;		\
>  		__v.bv_offset = __p->bv_offset;		\
>  		(void)(STEP);				\
>  		skip = __v.bv_len;			\
> @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
>  	iterate_and_advance(i, bytes, v,
>  		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
>  			       v.iov_len),
> -		memcpy_to_page(v.bv_page, v.bv_offset,
> +		memcpy_to_page(bvec_page(&v), v.bv_offset,

How had memcpy_to_page(NULL, ...) worked for you?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-16 23:05   ` Al Viro
@ 2015-03-17 13:02     ` Matthew Wilcox
  2015-03-17 15:53       ` Dan Williams
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Wilcox @ 2015-03-17 13:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel

On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote:
> > diff --git a/mm/iov_iter.c b/mm/iov_iter.c
> > index 827732047da1..be9a7c5b8703 100644
> > --- a/mm/iov_iter.c
> > +++ b/mm/iov_iter.c
> > @@ -61,7 +61,7 @@
> >  	__p = i->bvec;					\
> >  	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
> >  	if (likely(__v.bv_len)) {			\
> > -		__v.bv_page = __p->bv_page;		\
> > +		__v.bv_pfn = __p->bv_pfn;		\
> >  		__v.bv_offset = __p->bv_offset + skip; 	\
> >  		(void)(STEP);				\
> >  		skip += __v.bv_len;			\
> > @@ -72,7 +72,7 @@
> >  		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
> >  		if (unlikely(!__v.bv_len))		\
> >  			continue;			\
> > -		__v.bv_page = __p->bv_page;		\
> > +		__v.bv_pfn = __p->bv_pfn;		\
> >  		__v.bv_offset = __p->bv_offset;		\
> >  		(void)(STEP);				\
> >  		skip = __v.bv_len;			\
> > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
> >  	iterate_and_advance(i, bytes, v,
> >  		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
> >  			       v.iov_len),
> > -		memcpy_to_page(v.bv_page, v.bv_offset,
> > +		memcpy_to_page(bvec_page(&v), v.bv_offset,
> 
> How had memcpy_to_page(NULL, ...) worked for you?

 static inline struct page *bvec_page(const struct bio_vec *bvec)
 {
-       return bvec->bv_page;
+       return pfn_to_page(bvec->bv_pfn.pfn);
 }

(yes, more work to be done here to make copy_to_iter work to a bvec that
is actually targetting a page-less address, but these are RFC patches
showing the direction we're heading in while keeping current code working)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-17 13:02     ` Matthew Wilcox
@ 2015-03-17 15:53       ` Dan Williams
  0 siblings, 0 replies; 42+ messages in thread
From: Dan Williams @ 2015-03-17 15:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Al Viro, linux-kernel, linux-arch, Jens Axboe, riel,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	Christoph Hellwig, linux-fsdevel

On Tue, Mar 17, 2015 at 6:02 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote:
>> > diff --git a/mm/iov_iter.c b/mm/iov_iter.c
>> > index 827732047da1..be9a7c5b8703 100644
>> > --- a/mm/iov_iter.c
>> > +++ b/mm/iov_iter.c
>> > @@ -61,7 +61,7 @@
>> >     __p = i->bvec;                                  \
>> >     __v.bv_len = min_t(size_t, n, __p->bv_len - skip);      \
>> >     if (likely(__v.bv_len)) {                       \
>> > -           __v.bv_page = __p->bv_page;             \
>> > +           __v.bv_pfn = __p->bv_pfn;               \
>> >             __v.bv_offset = __p->bv_offset + skip;  \
>> >             (void)(STEP);                           \
>> >             skip += __v.bv_len;                     \
>> > @@ -72,7 +72,7 @@
>> >             __v.bv_len = min_t(size_t, n, __p->bv_len);     \
>> >             if (unlikely(!__v.bv_len))              \
>> >                     continue;                       \
>> > -           __v.bv_page = __p->bv_page;             \
>> > +           __v.bv_pfn = __p->bv_pfn;               \
>> >             __v.bv_offset = __p->bv_offset;         \
>> >             (void)(STEP);                           \
>> >             skip = __v.bv_len;                      \
>> > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
>> >     iterate_and_advance(i, bytes, v,
>> >             __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
>> >                            v.iov_len),
>> > -           memcpy_to_page(v.bv_page, v.bv_offset,
>> > +           memcpy_to_page(bvec_page(&v), v.bv_offset,
>>
>> How had memcpy_to_page(NULL, ...) worked for you?
>
>  static inline struct page *bvec_page(const struct bio_vec *bvec)
>  {
> -       return bvec->bv_page;
> +       return pfn_to_page(bvec->bv_pfn.pfn);
>  }
>
> (yes, more work to be done here to make copy_to_iter work to a bvec that
> is actually targetting a page-less address, but these are RFC patches
> showing the direction we're heading in while keeping current code working)
>

Right, the next item to tackle is kmap() and kmap_atomic() before we
can start converting paths to be "native" pfn-only.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
                   ` (6 preceding siblings ...)
  2015-03-16 20:26 ` [RFC PATCH 7/7] block: base support for pfn i/o Dan Williams
@ 2015-03-18 10:47 ` Boaz Harrosh
  2015-03-18 13:06   ` Matthew Wilcox
  2015-03-18 15:35   ` Dan Williams
  2015-03-18 20:26 ` Andrew Morton
  8 siblings, 2 replies; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-18 10:47 UTC (permalink / raw)
  To: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton,
	Linus Torvalds
  Cc: linux-arch, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	linux-fsdevel, Matthew Wilcox

On 03/16/2015 10:25 PM, Dan Williams wrote:
> Avoid the impending disaster of requiring struct page coverage for what
> is expected to be ever increasing capacities of persistent memory.  

If you are saying "disaster", than we need to believe you. Or is there
a scientific proof for this.

Actually what you are proposing below, is the "real disaster".
(I do hope it is not impending)

> In conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
> recently concluded Linux Storage Summit it became clear that struct page
> is not required in many places, it was simply convenient to re-use.
> 
> Introduce helpers and infrastructure to remove struct page usage where
> it is not necessary.  One use case for these changes is to implement a
> write-back-cache in persistent memory for software-RAID.  Another use
> case for the scatterlist changes is RDMA to a pfn-range.
> 
> This compiles and boots, but 0day-kbuild-robot coverage is needed before
> this set exits "RFC".  Obviously, the coccinelle script needs to be
> re-run on the block updates for kernel.next.  As is, this only includes
> the resulting auto-generated-patch against 4.0-rc3.
> 
> ---
> 
> Dan Williams (6):
>       block: add helpers for accessing a bio_vec page
>       block: convert bio_vec.bv_page to bv_pfn
>       dma-mapping: allow archs to optionally specify a ->map_pfn() operation
>       scatterlist: use sg_phys()
>       x86: support dma_map_pfn()
>       block: base support for pfn i/o
> 
> Matthew Wilcox (1):
>       scatterlist: support "page-less" (__pfn_t only) entries
> 
> 
>  arch/Kconfig                                 |    3 +
>  arch/arm/mm/dma-mapping.c                    |    2 -
>  arch/microblaze/kernel/dma.c                 |    2 -
>  arch/powerpc/sysdev/axonram.c                |    2 -
>  arch/x86/Kconfig                             |   12 +++
>  arch/x86/kernel/amd_gart_64.c                |   22 ++++--
>  arch/x86/kernel/pci-nommu.c                  |   22 ++++--
>  arch/x86/kernel/pci-swiotlb.c                |    4 +
>  arch/x86/pci/sta2x11-fixup.c                 |    4 +
>  arch/x86/xen/pci-swiotlb-xen.c               |    4 +
>  block/bio-integrity.c                        |    8 +-
>  block/bio.c                                  |   83 +++++++++++++++------
>  block/blk-core.c                             |    9 ++
>  block/blk-integrity.c                        |    7 +-
>  block/blk-lib.c                              |    2 -
>  block/blk-merge.c                            |   15 ++--
>  block/bounce.c                               |   26 +++----
>  drivers/block/aoe/aoecmd.c                   |    8 +-
>  drivers/block/brd.c                          |    2 -
>  drivers/block/drbd/drbd_bitmap.c             |    5 +
>  drivers/block/drbd/drbd_main.c               |    4 +
>  drivers/block/drbd/drbd_receiver.c           |    4 +
>  drivers/block/drbd/drbd_worker.c             |    3 +
>  drivers/block/floppy.c                       |    6 +-
>  drivers/block/loop.c                         |    8 +-
>  drivers/block/nbd.c                          |    8 +-
>  drivers/block/nvme-core.c                    |    2 -
>  drivers/block/pktcdvd.c                      |   11 ++-
>  drivers/block/ps3disk.c                      |    2 -
>  drivers/block/ps3vram.c                      |    2 -
>  drivers/block/rbd.c                          |    2 -
>  drivers/block/rsxx/dma.c                     |    3 +
>  drivers/block/umem.c                         |    2 -
>  drivers/block/zram/zram_drv.c                |   10 +--
>  drivers/dma/ste_dma40.c                      |    5 -
>  drivers/iommu/amd_iommu.c                    |   21 ++++-
>  drivers/iommu/intel-iommu.c                  |   26 +++++--
>  drivers/iommu/iommu.c                        |    2 -
>  drivers/md/bcache/btree.c                    |    4 +
>  drivers/md/bcache/debug.c                    |    6 +-
>  drivers/md/bcache/movinggc.c                 |    2 -
>  drivers/md/bcache/request.c                  |    6 +-
>  drivers/md/bcache/super.c                    |   10 +--
>  drivers/md/bcache/util.c                     |    5 +
>  drivers/md/bcache/writeback.c                |    2 -
>  drivers/md/dm-crypt.c                        |   12 ++-
>  drivers/md/dm-io.c                           |    2 -
>  drivers/md/dm-verity.c                       |    2 -
>  drivers/md/raid1.c                           |   50 +++++++------
>  drivers/md/raid10.c                          |   38 +++++-----
>  drivers/md/raid5.c                           |    6 +-
>  drivers/mmc/card/queue.c                     |    4 +
>  drivers/s390/block/dasd_diag.c               |    2 -
>  drivers/s390/block/dasd_eckd.c               |   14 ++--
>  drivers/s390/block/dasd_fba.c                |    6 +-
>  drivers/s390/block/dcssblk.c                 |    2 -
>  drivers/s390/block/scm_blk.c                 |    2 -
>  drivers/s390/block/scm_blk_cluster.c         |    2 -
>  drivers/s390/block/xpram.c                   |    2 -
>  drivers/scsi/mpt2sas/mpt2sas_transport.c     |    6 +-
>  drivers/scsi/mpt3sas/mpt3sas_transport.c     |    6 +-
>  drivers/scsi/sd_dif.c                        |    4 +
>  drivers/staging/android/ion/ion_chunk_heap.c |    4 +
>  drivers/staging/lustre/lustre/llite/lloop.c  |    2 -
>  drivers/xen/biomerge.c                       |    4 +
>  drivers/xen/swiotlb-xen.c                    |   29 +++++--
>  fs/btrfs/check-integrity.c                   |    6 +-
>  fs/btrfs/compression.c                       |   12 ++-
>  fs/btrfs/disk-io.c                           |    4 +
>  fs/btrfs/extent_io.c                         |    8 +-
>  fs/btrfs/file-item.c                         |    8 +-
>  fs/btrfs/inode.c                             |   18 +++--
>  fs/btrfs/raid56.c                            |    4 +
>  fs/btrfs/volumes.c                           |    2 -
>  fs/buffer.c                                  |    4 +
>  fs/direct-io.c                               |    2 -
>  fs/exofs/ore.c                               |    4 +
>  fs/exofs/ore_raid.c                          |    2 -
>  fs/ext4/page-io.c                            |    2 -
>  fs/f2fs/data.c                               |    4 +
>  fs/f2fs/segment.c                            |    2 -
>  fs/gfs2/lops.c                               |    4 +
>  fs/jfs/jfs_logmgr.c                          |    4 +
>  fs/logfs/dev_bdev.c                          |   10 +--
>  fs/mpage.c                                   |    2 -
>  fs/splice.c                                  |    2 -
>  include/asm-generic/dma-mapping-common.h     |   30 ++++++++
>  include/asm-generic/memory_model.h           |    4 +
>  include/asm-generic/scatterlist.h            |    6 ++
>  include/crypto/scatterwalk.h                 |   10 +++
>  include/linux/bio.h                          |   24 +++---
>  include/linux/blk_types.h                    |   21 +++++
>  include/linux/blkdev.h                       |    2 +
>  include/linux/dma-debug.h                    |   23 +++++-
>  include/linux/dma-mapping.h                  |    8 ++
>  include/linux/scatterlist.h                  |  101 ++++++++++++++++++++++++--
>  include/linux/swiotlb.h                      |    5 +
>  kernel/power/block_io.c                      |    2 -
>  lib/dma-debug.c                              |    4 +
>  lib/swiotlb.c                                |   20 ++++-
>  mm/iov_iter.c                                |   22 +++---
>  mm/page_io.c                                 |    8 +-
>  net/ceph/messenger.c                         |    2 -

God! Look at this endless list of files and it is only the very beginning.
It does not even work and touches only 10% of what will need to be touched
for this to work, and very very marginally at that. There will always be
"another subsystem" that will not work. For example NUMA how will you do
NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
because our tests show a huge drop in performance if you do not do
NUMA aware allocation)

Al, Jens, Christoph Andrew. Think of the immediate stability nightmare and
the long term torture to maintain two code paths. Two set of tests, and
the combinatorial explosions of tests.

I'm not the one afraid of hard work, if it was for a good cause, but for what?
really for what? The block layer, and RDMA, and networking, and spline, and what
ever the heck any one wants to imagine to do with pmem, already works perfectly
stable. right now!

We have set up RDMA pmem target without a single line of extra code,
and the RDMA client was trivial to write. We are sending down block layer
BIOs from pmem from day one, and even iscsi NFS and any kind of networking
directly from pmem, for almost a year now.

All it takes is two simple patches to mm that creates a pages-section
for pmem. The Kernel DOCs do says that a page is a construct that keeps track
of the sate of a physical page in memory. A memory mapped pmem is perfectly
that, and it has state that needs tracking just the same, Say that converted
block layers of yours now happens to be an iscsi and goes through the network
stack, it starts to need ref-counting, flags ... It has state.

Matthew Dan. I don't get it. Don't you guys at Intel have nothing to do? why
change half the Kernel? for what? to achieve what? all your wildest dreams
about pmem are right here already. What is it that you guys want to do with
this code that we cannot already do? And I can show you two tons of things
you cannot do with this code that we can already do. With two simple patches.

If it is stability that you are concerned with, "what if a pmem-page gets
to the wrong mm subsystem?" There are a couple small hardening patches and
and extra page-flag allocated, that can make the all thing foolproof. Though
up until now I have not encountered any problem.

>  103 files changed, 658 insertions(+), 335 deletions(-)

Please look, this is only the beginning. And does not even work. Let us come
back to our senses. As true hackers lets do the minimum effort to achieve new
heights. All it really takes to do all this is 2 little patches.

Cheers
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 3/7] dma-mapping: allow archs to optionally specify a ->map_pfn() operation
  2015-03-16 20:25 ` [RFC PATCH 3/7] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
@ 2015-03-18 11:21   ` Boaz Harrosh
  0 siblings, 0 replies; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-18 11:21 UTC (permalink / raw)
  To: Dan Williams, linux-kernel
  Cc: axboe, linux-raid, riel, linux-nvdimm, hch, mgorman, linux-fsdevel

On 03/16/2015 10:25 PM, Dan Williams wrote:
> This is in support of enabling block device drivers to perform DMA
> to/from persistent memory which may not have a backing struct page
> entry.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/Kconfig                             |    3 +++
>  include/asm-generic/dma-mapping-common.h |   30 ++++++++++++++++++++++++++++++
>  include/linux/dma-debug.h                |   23 +++++++++++++++++++----
>  include/linux/dma-mapping.h              |    8 +++++++-
>  lib/dma-debug.c                          |    4 ++--
>  5 files changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 05d7a8a458d5..80ea3e124494 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -203,6 +203,9 @@ config HAVE_DMA_ATTRS
>  config HAVE_DMA_CONTIGUOUS
>  	bool
>  
> +config HAVE_DMA_PFN
> +	bool
> +
>  config GENERIC_SMP_IDLE_THREAD
>         bool
>  
> diff --git a/include/asm-generic/dma-mapping-common.h b/include/asm-generic/dma-mapping-common.h
> index 3378dcf4c31e..58fad817e51a 100644
> --- a/include/asm-generic/dma-mapping-common.h
> +++ b/include/asm-generic/dma-mapping-common.h
> @@ -17,9 +17,15 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>  
>  	kmemcheck_mark_initialized(ptr, size);
>  	BUG_ON(!valid_dma_direction(dir));
> +#ifdef CONFIG_HAVE_DMA_PFN
> +	addr = ops->map_pfn(dev, page_to_pfn_typed(virt_to_page(ptr)),
> +			     (unsigned long)ptr & ~PAGE_MASK, size,
> +			     dir, attrs);
> +#else

Yes our beloved Kernel is full of #ifdef(s) in the middle of the code

Very beautiful

>  	addr = ops->map_page(dev, virt_to_page(ptr),
>  			     (unsigned long)ptr & ~PAGE_MASK, size,
>  			     dir, attrs);
> +#endif
>  	debug_dma_map_page(dev, virt_to_page(ptr),
>  			   (unsigned long)ptr & ~PAGE_MASK, size,
>  			   dir, addr, true);
> @@ -68,6 +74,29 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg
>  		ops->unmap_sg(dev, sg, nents, dir, attrs);
>  }
>  
> +#ifdef CONFIG_HAVE_DMA_PFN
> +static inline dma_addr_t dma_map_pfn(struct device *dev, __pfn_t pfn,
> +				      size_t offset, size_t size,
> +				      enum dma_data_direction dir)
> +{
> +	struct dma_map_ops *ops = get_dma_ops(dev);
> +	dma_addr_t addr;
> +
> +	BUG_ON(!valid_dma_direction(dir));
> +	addr = ops->map_pfn(dev, pfn, offset, size, dir, NULL);
> +	debug_dma_map_pfn(dev, pfn, offset, size, dir, addr, false);
> +
> +	return addr;
> +}
> +
> +static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
> +				      size_t offset, size_t size,
> +				      enum dma_data_direction dir)
> +{
> +	kmemcheck_mark_initialized(page_address(page) + offset, size);
> +	return dma_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir);
> +}
> +#else

And in the middle of source code files

>  static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
>  				      size_t offset, size_t size,
>  				      enum dma_data_direction dir)
> @@ -82,6 +111,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
>  
>  	return addr;
>  }
> +#endif /* CONFIG_HAVE_DMA_PFN */
>  
>  static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
>  				  size_t size, enum dma_data_direction dir)
> diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
> index fe8cb610deac..eb3e69c61e5e 100644
> --- a/include/linux/dma-debug.h
> +++ b/include/linux/dma-debug.h
> @@ -34,10 +34,18 @@ extern void dma_debug_init(u32 num_entries);
>  
>  extern int dma_debug_resize_entries(u32 num_entries);
>  
> -extern void debug_dma_map_page(struct device *dev, struct page *page,
> -			       size_t offset, size_t size,
> -			       int direction, dma_addr_t dma_addr,
> -			       bool map_single);
> +extern void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
> +			      size_t size, int direction, dma_addr_t dma_addr,
> +			      bool map_single);
> +
> +static inline void debug_dma_map_page(struct device *dev, struct page *page,
> +				      size_t offset, size_t size,
> +				      int direction, dma_addr_t dma_addr,
> +				      bool map_single)
> +{
> +	return debug_dma_map_pfn(dev, page_to_pfn_typed(page), offset, size,
> +			direction, dma_addr, map_single);
> +}
>  
>  extern void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
>  
> @@ -109,6 +117,13 @@ static inline void debug_dma_map_page(struct device *dev, struct page *page,
>  {
>  }
>  
> +static inline void debug_dma_map_pfn(struct device *dev, __pfn_t pfn,
> +				     size_t offset, size_t size,
> +				     int direction, dma_addr_t dma_addr,
> +				     bool map_single)
> +{
> +}
> +
>  static inline void debug_dma_mapping_error(struct device *dev,
>  					  dma_addr_t dma_addr)
>  {
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index c3007cb4bfa6..6411621e4179 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -26,11 +26,17 @@ struct dma_map_ops {
>  
>  	int (*get_sgtable)(struct device *dev, struct sg_table *sgt, void *,
>  			   dma_addr_t, size_t, struct dma_attrs *attrs);
> -
> +#ifdef CONFIG_HAVE_DMA_PFN
> +	dma_addr_t (*map_pfn)(struct device *dev, __pfn_t pfn,
> +			      unsigned long offset, size_t size,
> +			      enum dma_data_direction dir,
> +			      struct dma_attrs *attrs);
> +#else

And in the middle of structures

>  	dma_addr_t (*map_page)(struct device *dev, struct page *page,
>  			       unsigned long offset, size_t size,
>  			       enum dma_data_direction dir,
>  			       struct dma_attrs *attrs);
> +#endif
>  	void (*unmap_page)(struct device *dev, dma_addr_t dma_handle,
>  			   size_t size, enum dma_data_direction dir,
>  			   struct dma_attrs *attrs);
> diff --git a/lib/dma-debug.c b/lib/dma-debug.c
> index 9722bd2dbc9b..a447730fff97 100644
> --- a/lib/dma-debug.c
> +++ b/lib/dma-debug.c
> @@ -1250,7 +1250,7 @@ out:
>  	put_hash_bucket(bucket, &flags);
>  }
>  
> -void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
> +void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
>  			size_t size, int direction, dma_addr_t dma_addr,
>  			bool map_single)
>  {
> @@ -1268,7 +1268,7 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
>  
>  	entry->dev       = dev;
>  	entry->type      = dma_debug_page;
> -	entry->pfn	 = page_to_pfn(page);
> +	entry->pfn	 = pfn.pfn;
>  	entry->offset	 = offset,
>  	entry->dev_addr  = dma_addr;
>  	entry->size      = size;
> 

This is exactly what I meant. It is not only two different code paths it is
two different compilation paths. This is a maintenance nightmare. And a sure
bit rot.

Real nice for nothing
Thanks, but no thanks
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
@ 2015-03-18 13:06   ` Matthew Wilcox
  2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
  2015-03-18 15:35   ` Dan Williams
  1 sibling, 1 reply; 42+ messages in thread
From: Matthew Wilcox @ 2015-03-18 13:06 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton,
	Linus Torvalds, linux-arch, riel, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, linux-fsdevel

On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote:
> God! Look at this endless list of files and it is only the very beginning.
> It does not even work and touches only 10% of what will need to be touched
> for this to work, and very very marginally at that. There will always be
> "another subsystem" that will not work. For example NUMA how will you do
> NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
> because our tests show a huge drop in performance if you do not do
> NUMA aware allocation)

You're very entertaining, but please, tone down your emails and stick
to facts.  The BIOS presents the persistent memory as one table entry
per NUMA node, so you get one block device per NUMA node.  There's no
mixing of memory from different NUMA nodes within a single filesystem,
unless you have a filesystem that uses multiple block devices.

> I'm not the one afraid of hard work, if it was for a good cause, but for what?
> really for what? The block layer, and RDMA, and networking, and spline, and what
> ever the heck any one wants to imagine to do with pmem, already works perfectly
> stable. right now!

The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
(the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
That's an unacceptable amount of overhead.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 13:06   ` Matthew Wilcox
@ 2015-03-18 14:38     ` Boaz Harrosh
  2015-03-20 15:56       ` Rik van Riel
  0 siblings, 1 reply; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-18 14:38 UTC (permalink / raw)
  To: Matthew Wilcox, Boaz Harrosh
  Cc: axboe, linux-arch, riel, linux-raid, linux-nvdimm, Dave Hansen,
	linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel,
	Andrew Morton, mgorman

On 03/18/2015 03:06 PM, Matthew Wilcox wrote:
> On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote:
>> God! Look at this endless list of files and it is only the very beginning.
>> It does not even work and touches only 10% of what will need to be touched
>> for this to work, and very very marginally at that. There will always be
>> "another subsystem" that will not work. For example NUMA how will you do
>> NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
>> because our tests show a huge drop in performance if you do not do
>> NUMA aware allocation)
> 
> You're very entertaining, but please, tone down your emails and stick
> to facts.  The BIOS presents the persistent memory as one table entry
> per NUMA node, so you get one block device per NUMA node.  There's no
> mixing of memory from different NUMA nodes within a single filesystem,
> unless you have a filesystem that uses multiple block devices.
> 

Not current BIOS, if we have them contiguous then they are presented as
one range. (DDR3 BIOS). But I agree it is a bug and in our configuration
we separate them to different pmem devices.

Yes I meant a "filesystem that uses multiple block devices"

>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>> really for what? The block layer, and RDMA, and networking, and spline, and what
>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>> stable. right now!
> 
> The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
> That's an unacceptable amount of overhead.
> 

So lets fix the stacks to work nice with 2M pages. That said we can
allocate the struct page also from pmem if we need to. The fact remains
that we need state down the different stacks and this is the current
design over all.

I hate it that you introduce a double design a pfn-or-page and the
combinations of them. It is ugliness to much for my guts. I would
like a unified design. that runs all over the stack. Already we have
too much duplication to my taste, and would love to see more
unification and not more splitting.

But the most important for me is do we have to sacrifice the short
term to the long term. Such a massive change as you are proposing
it will take years. for a theoretical 400GB DIMM. What about the
4G DIMM now in peoples hands, need they wait?
(Though I still do not agree with your design)

I love the SPARSE model of the "section" and the page being it's
own identity relative to virtual & PFN of the section. We could
think of a much smaller page-struct that only takes a ref-count
and flags and have bigger page type for regular use, separate the
low common part of the page, lay down clear rules about its use,
and an high part that's per user. But let us think of a unified
design through out. (most members of page are accessed through
wrappers it would be relatively easy to split)

And let us not sacrifice the now for the "far tomorrow", we should
be able to do this incrementally, wasting more space now and saving
later.

[We can even invent a sizeless page you know how we encode
 the section ID directly into the 64 bit address of the page,
 So we can have a flag at the section that says this is a
 zero-size page section and the needed info is stored at
 the section object. But I still think you will need state
 per page and that we do need a minimal size.
]

[BTW: The only 400GB DIMM I know of is a real flash, and not directly
 mapped to CPU, OK maybe read only, but the erase/write makes it
 logical-to-physical managed and not directly accessed
]

And a personal note. I mean only to entertain. If any one feels
I "toned-up", please forgive me. I meant no such thing. As a rule
if I come across strong then please just laugh and don't take me
seriously. I only mean scientific soundness.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
  2015-03-18 13:06   ` Matthew Wilcox
@ 2015-03-18 15:35   ` Dan Williams
  1 sibling, 0 replies; 42+ messages in thread
From: Dan Williams @ 2015-03-18 15:35 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-kernel, Jens Axboe, Christoph Hellwig, Al Viro,
	Andrew Morton, Linus Torvalds, linux-arch, riel,
	linux-nvdimm@lists.01.org, Dave Hansen, linux-raid, mgorman,
	linux-fsdevel, Matthew Wilcox

On Wed, Mar 18, 2015 at 3:47 AM, Boaz Harrosh <openosd@gmail.com> wrote:
> On 03/16/2015 10:25 PM, Dan Williams wrote:
>> Avoid the impending disaster of requiring struct page coverage for what
>> is expected to be ever increasing capacities of persistent memory.
>
> If you are saying "disaster", than we need to believe you. Or is there
> a scientific proof for this.

The same Moore's Law based extrapolation that Dave Chinner did to
determine that major feature development on XFS may cease in 5 - 7
years.  In Dave's words we're looking ahead to "lots and fast".  Given
the time scale of getting kernel changes out to end users in an
enterprise kernel update the "dynamic page struct allocation" approach
is already insufficient.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
                   ` (7 preceding siblings ...)
  2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
@ 2015-03-18 20:26 ` Andrew Morton
  2015-03-19 13:43   ` Matthew Wilcox
  8 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2015-03-18 20:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox

On Mon, 16 Mar 2015 16:25:25 -0400 Dan Williams <dan.j.williams@intel.com> wrote:

> Avoid the impending disaster of requiring struct page coverage for what
> is expected to be ever increasing capacities of persistent memory.  In
> conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
> recently concluded Linux Storage Summit it became clear that struct page
> is not required in many places, it was simply convenient to re-use.
> 
> Introduce helpers and infrastructure to remove struct page usage where
> it is not necessary.  One use case for these changes is to implement a
> write-back-cache in persistent memory for software-RAID.  Another use
> case for the scatterlist changes is RDMA to a pfn-range.

Those use-cases sound very thin.  If that's all we have then I'd say
"find another way of implementing those things without creating
pageframes for persistent memory".

IOW, please tell us much much much more about the value of this change.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 20:26 ` Andrew Morton
@ 2015-03-19 13:43   ` Matthew Wilcox
  2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
                       ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Matthew Wilcox @ 2015-03-19 13:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel

On Wed, Mar 18, 2015 at 01:26:50PM -0700, Andrew Morton wrote:
> On Mon, 16 Mar 2015 16:25:25 -0400 Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Avoid the impending disaster of requiring struct page coverage for what
> > is expected to be ever increasing capacities of persistent memory.  In
> > conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
> > recently concluded Linux Storage Summit it became clear that struct page
> > is not required in many places, it was simply convenient to re-use.
> > 
> > Introduce helpers and infrastructure to remove struct page usage where
> > it is not necessary.  One use case for these changes is to implement a
> > write-back-cache in persistent memory for software-RAID.  Another use
> > case for the scatterlist changes is RDMA to a pfn-range.
> 
> Those use-cases sound very thin.  If that's all we have then I'd say
> "find another way of implementing those things without creating
> pageframes for persistent memory".
> 
> IOW, please tell us much much much more about the value of this change.

Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
want to be able to do any kind of I/O directly to persistent memory,
and I think we do, we need to do one of:

1. Construct struct pages for persistent memory
1a. Permanently
1b. While the pages are under I/O
2. Teach the I/O layers to deal in PFNs instead of struct pages
3. Replace struct page with some other structure that can represent both
   DRAM and PMEM

I'm personally a fan of #3, and I was looking at the scatterlist as
my preferred data structure.  I now believe the scatterlist as it is
currently defined isn't sufficient, so we probably end up needing a new
data structure.  I think Dan's preferred method of replacing struct
pages with PFNs is actually less instrusive, but doesn't give us as
much advantage (an entirely new data structure would let us move to an
extent based system at the same time, instead of sticking with an array
of pages).  Clearly Boaz prefers 1a, which works well enough for the
8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.

What's your preference?  I guess option 0 is "force all I/O to go
through the page cache and then get copied", but that feels like a nasty
performance hit.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 13:43   ` Matthew Wilcox
@ 2015-03-19 15:54     ` Boaz Harrosh
  2015-03-19 19:59       ` Andrew Morton
  2015-03-19 18:17     ` Christoph Hellwig
  2015-03-20 16:21     ` Rik van Riel
  2 siblings, 1 reply; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-19 15:54 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen,
	linux-kernel, linux-raid, mgorman, linux-fsdevel

On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
<>
> 
> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
> want to be able to do any kind of I/O directly to persistent memory,
> and I think we do, we need to do one of:
> 
> 1. Construct struct pages for persistent memory
> 1a. Permanently
> 1b. While the pages are under I/O
> 2. Teach the I/O layers to deal in PFNs instead of struct pages
> 3. Replace struct page with some other structure that can represent both
>    DRAM and PMEM
> 
> I'm personally a fan of #3, and I was looking at the scatterlist as
> my preferred data structure.  I now believe the scatterlist as it is
> currently defined isn't sufficient, so we probably end up needing a new
> data structure.  I think Dan's preferred method of replacing struct
> pages with PFNs is actually less instrusive, but doesn't give us as
> much advantage (an entirely new data structure would let us move to an
> extent based system at the same time, instead of sticking with an array
> of pages).  Clearly Boaz prefers 1a, which works well enough for the
> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
> 
> What's your preference?  I guess option 0 is "force all I/O to go
> through the page cache and then get copied", but that feels like a nasty
> performance hit.

Thanks Matthew, you have summarized it perfectly.

I think #1b might have merit, as well. I have a very surgical small
"hack" that we can do with allocating on demand pages before IO.
It involves adding a new MEMORY_MODEL policy that is derived from
SPARSEMEM but lets you allocate individual pages on demand. And a new
type of page say call it GP_emulated_page.
(Tell me if you find this interesting. It is 1/117 in size of both
 #2 or #3)

In anyway please reconsider a configurable #1a for people that do
not mind sacrificing 1.2% of their pmem for real pages.

Even at 6G page-structs with 400G pmem, people would love some of the stuff
this gives them today. just few examples: direct_access from within a VM to
an host defined pmem, is trivial with no extra code with my two simple #1a
patches. RDMA memory brick targets, network shared memory FS and so on, the
list will always be bigger then any of #1b #2 or #3. Yes for people that
want to sacrifice the extra cost.

In the Kernel it was always about choice and diversity. And what does it
costs us. Nothing. Two small simple patches and a Kconfig option.
Note that I made it in such a way that if pmem is configured without
use of pages, then the mm code is *not* configured-in automatically.
We can even add a runtime option that even if #1a is enabled, for certain
pmem device may not want pages allocated. And so choose at runtime rather
than compile time.

I think this will only farther our cause and let people advance with
their research and development with great new ideas about use of pmem.
Then once there is a great demand for #1a and those large 512G devices
come out, we can go the #1b or #3 route and save them the extra 1.2%
memory, but once they have the appetite for it. (And Andrews question
becomes clear)

Our two ways need not be "either-or". They can be "have both". I think
choice is a good thing for us here. Even with #3 available #1a still has
merit in some configurations and they can co exist perfectly.

Please think about it?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 13:43   ` Matthew Wilcox
  2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
@ 2015-03-19 18:17     ` Christoph Hellwig
  2015-03-19 19:31       ` Matthew Wilcox
  2015-03-22 16:46       ` Boaz Harrosh
  2015-03-20 16:21     ` Rik van Riel
  2 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2015-03-19 18:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel

On Thu, Mar 19, 2015 at 09:43:13AM -0400, Matthew Wilcox wrote:
> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
> want to be able to do any kind of I/O directly to persistent memory,
> and I think we do, we need to do one of:
> 
> 1. Construct struct pages for persistent memory
> 1a. Permanently
> 1b. While the pages are under I/O
> 2. Teach the I/O layers to deal in PFNs instead of struct pages
> 3. Replace struct page with some other structure that can represent both
>    DRAM and PMEM
> 
> I'm personally a fan of #3, and I was looking at the scatterlist as
> my preferred data structure.  I now believe the scatterlist as it is
> currently defined isn't sufficient, so we probably end up needing a new
> data structure.  I think Dan's preferred method of replacing struct
> pages with PFNs is actually less instrusive, but doesn't give us as
> much advantage (an entirely new data structure would let us move to an
> extent based system at the same time, instead of sticking with an array
> of pages).  Clearly Boaz prefers 1a, which works well enough for the
> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
> 
> What's your preference?  I guess option 0 is "force all I/O to go
> through the page cache and then get copied", but that feels like a nasty
> performance hit.

In addition to the options there's also a time line.  At least for the
short term where we want to get something going 1a seems like the
absolutely be option.  It works perfectly fine for the lots of small
capacity dram-like nvdimms, and it works funtionally fine for the
special huge ones, although the resource use for it is highly annoying.
If it turns out to be too annoying we can also offer a no I/O possible
option for them in the short run.

In the long run option 2) sounds like a good plan to me, but not as a
parallel I/O path, but as the main one.  Doing so will in fact give us
options to experiment with 3).  Given that we're moving towards an
increasinly huge page using world replacing the good old struct page
with something extent-like and/or temporary might be needed for dram
as well in the future.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 18:17     ` Christoph Hellwig
@ 2015-03-19 19:31       ` Matthew Wilcox
  2015-03-22 16:46       ` Boaz Harrosh
  1 sibling, 0 replies; 42+ messages in thread
From: Matthew Wilcox @ 2015-03-19 19:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	linux-fsdevel

On Thu, Mar 19, 2015 at 11:17:25AM -0700, Christoph Hellwig wrote:
> On Thu, Mar 19, 2015 at 09:43:13AM -0400, Matthew Wilcox wrote:
> > 1. Construct struct pages for persistent memory
> > 1a. Permanently
> > 1b. While the pages are under I/O
> > 2. Teach the I/O layers to deal in PFNs instead of struct pages
> > 3. Replace struct page with some other structure that can represent both
> >    DRAM and PMEM
> 
> In addition to the options there's also a time line.  At least for the
> short term where we want to get something going 1a seems like the
> absolutely be option.  It works perfectly fine for the lots of small
(assuming "best option")
> capacity dram-like nvdimms, and it works funtionally fine for the
> special huge ones, although the resource use for it is highly annoying.
> If it turns out to be too annoying we can also offer a no I/O possible
> option for them in the short run.
> 
> In the long run option 2) sounds like a good plan to me, but not as a
> parallel I/O path, but as the main one.  Doing so will in fact give us
> options to experiment with 3).  Given that we're moving towards an
> increasinly huge page using world replacing the good old struct page
> with something extent-like and/or temporary might be needed for dram
> as well in the future.

Dan's patches don't actually make it a "parallel I/O path", that was
Boaz's mischaracterisation.  They move all scatterlists and bios over
to using PFNs, at least on architectures which have been converted.
Speaking of architectures not being converted, it is really past time for
architectures to be switched to supporting SG chaining.  It was introduced
in 2007, and not having it generically available causes problems for
the crypto layer, as well as making further enhancements more tricky.

Assuming 'select ARCH_HAS_SG_CHAIN' is sufficient to tell, the following
architectures do support it:

arm arm64 ia64 powerpc s390 sparc x86

which means the following architectures are 8 years delinquent in
adding support:

alpha arc avr32 blackfin c6x cris frv hexagon m32r m68k metag microblaze
mips mn10300 nios2 openrisc parisc score sh tile um unicore32 xtensa

Perhaps we could deliberately make asm-generic/scatterlist.h not build
for architectures that don't select it in order to make them convert ...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
@ 2015-03-19 19:59       ` Andrew Morton
  2015-03-19 20:59         ` Dan Williams
                           ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Andrew Morton @ 2015-03-19 19:59 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm,
	Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel

On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:

> On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
> <>
> > 
> > Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
> > want to be able to do any kind of I/O directly to persistent memory,
> > and I think we do, we need to do one of:
> > 
> > 1. Construct struct pages for persistent memory
> > 1a. Permanently
> > 1b. While the pages are under I/O
> > 2. Teach the I/O layers to deal in PFNs instead of struct pages
> > 3. Replace struct page with some other structure that can represent both
> >    DRAM and PMEM
> > 
> > I'm personally a fan of #3, and I was looking at the scatterlist as
> > my preferred data structure.  I now believe the scatterlist as it is
> > currently defined isn't sufficient, so we probably end up needing a new
> > data structure.  I think Dan's preferred method of replacing struct
> > pages with PFNs is actually less instrusive, but doesn't give us as
> > much advantage (an entirely new data structure would let us move to an
> > extent based system at the same time, instead of sticking with an array
> > of pages).  Clearly Boaz prefers 1a, which works well enough for the
> > 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
> > 
> > What's your preference?  I guess option 0 is "force all I/O to go
> > through the page cache and then get copied", but that feels like a nasty
> > performance hit.
> 
> Thanks Matthew, you have summarized it perfectly.
> 
> I think #1b might have merit, as well.

It would be interesting to see what a 1b implementation looks like and
how it performs.  We already allocate a bunch of temporary things to
support in-flight IO (bio, request) and allocating pageframes on the
same basis seems a fairly logical fit.

It is all a bit of a stopgap, designed to shoehorn
direct-io-to-dax-mapped-memory into the existing world.  Longer term
I'd expect us to move to something more powerful, but it's unclear what
that will be at this time, so a stopgap isn't too bad?


This is all contingent upon the prevalence of machines which have vast
amounts of nv memory and relatively small amounts of regular memory. 
How confident are we that this really is the future?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 19:59       ` Andrew Morton
@ 2015-03-19 20:59         ` Dan Williams
  2015-03-22 17:22           ` Boaz Harrosh
  2015-03-20 17:32         ` Wols Lists
  2015-03-22 10:30         ` Boaz Harrosh
  2 siblings, 1 reply; 42+ messages in thread
From: Dan Williams @ 2015-03-19 20:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Boaz Harrosh, linux-arch, Jens Axboe, riel, linux-raid,
	linux-nvdimm, Dave Hansen, linux-kernel, Christoph Hellwig,
	Mel Gorman, linux-fsdevel

On Thu, Mar 19, 2015 at 12:59 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:
>
>> On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
>> <>
>> >
>> > Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
>> > want to be able to do any kind of I/O directly to persistent memory,
>> > and I think we do, we need to do one of:
>> >
>> > 1. Construct struct pages for persistent memory
>> > 1a. Permanently
>> > 1b. While the pages are under I/O
>> > 2. Teach the I/O layers to deal in PFNs instead of struct pages
>> > 3. Replace struct page with some other structure that can represent both
>> >    DRAM and PMEM
>> >
>> > I'm personally a fan of #3, and I was looking at the scatterlist as
>> > my preferred data structure.  I now believe the scatterlist as it is
>> > currently defined isn't sufficient, so we probably end up needing a new
>> > data structure.  I think Dan's preferred method of replacing struct
>> > pages with PFNs is actually less instrusive, but doesn't give us as
>> > much advantage (an entirely new data structure would let us move to an
>> > extent based system at the same time, instead of sticking with an array
>> > of pages).  Clearly Boaz prefers 1a, which works well enough for the
>> > 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
>> >
>> > What's your preference?  I guess option 0 is "force all I/O to go
>> > through the page cache and then get copied", but that feels like a nasty
>> > performance hit.
>>
>> Thanks Matthew, you have summarized it perfectly.
>>
>> I think #1b might have merit, as well.
>
> It would be interesting to see what a 1b implementation looks like and
> how it performs.  We already allocate a bunch of temporary things to
> support in-flight IO (bio, request) and allocating pageframes on the
> same basis seems a fairly logical fit.

At least for block-i/o it seems the only place we really need struct
page infrastructure is for kmap().  Given we already need a kmap_pfn()
solution for option 2 a "dynamic allocation" stop along that
development path may just naturally fall out.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
@ 2015-03-20 15:56       ` Rik van Riel
  2015-03-22 11:53         ` Boaz Harrosh
  0 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2015-03-20 15:56 UTC (permalink / raw)
  To: Boaz Harrosh, Matthew Wilcox, Boaz Harrosh
  Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen,
	linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel,
	Andrew Morton, mgorman

On 03/18/2015 10:38 AM, Boaz Harrosh wrote:
> On 03/18/2015 03:06 PM, Matthew Wilcox wrote:

>>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>>> really for what? The block layer, and RDMA, and networking, and spline, and what
>>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>>> stable. right now!
>>
>> The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
>> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
>> That's an unacceptable amount of overhead.
>>
> 
> So lets fix the stacks to work nice with 2M pages. That said we can
> allocate the struct page also from pmem if we need to. The fact remains
> that we need state down the different stacks and this is the current
> design over all.

Fixing the stack to work with 2M pages will be just as invasive,
and just as much work as making it work without a struct page.

What state do you need, exactly?

The struct page in the VM is mostly used for two things:
1) to get a memory address of the data
2) refcounting, to make sure the page does not go away
   during an IO operation, copy, etc...

Persistent memory cannot be paged out so (2) is not a concern, as
long as we ensure the object the page belongs to does not go away.
There are no seek times, so moving it around may not be necessary
either, making (1) not a concern.

The only case where (1) would be a concern is if we wanted to move
data in persistent memory around for better NUMA locality. However,
persistent memory DIMMs are on their way to being too large to move
the memory, anyway - all we can usefully do is detect where programs
are accessing memory, and move the programs there.

What state do you need that is not already represented?

1.5% overhead isn't a whole lot, but it appears to be unnecessary.

If you have a convincing argument as to why we need a struct page,
you might want to articulate it in order to convince us.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 13:43   ` Matthew Wilcox
  2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
  2015-03-19 18:17     ` Christoph Hellwig
@ 2015-03-20 16:21     ` Rik van Riel
  2015-03-20 20:31       ` Matthew Wilcox
  2015-03-22 15:51       ` Boaz Harrosh
  2 siblings, 2 replies; 42+ messages in thread
From: Rik van Riel @ 2015-03-20 16:21 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/19/2015 09:43 AM, Matthew Wilcox wrote:

> 1. Construct struct pages for persistent memory
> 1a. Permanently
> 1b. While the pages are under I/O

Michael Tsirkin and I have been doing some thinking about what
it would take to allocate struct pages per 2MB area permanently,
and allocate additional struct pages for 4kB pages on demand,
when a 2MB area is broken up into 4kB pages.

This should work for both DRAM and persistent memory.

I am still not convinced it is worthwhile to have struct pages
for persistent memory though, but I am willing to change my mind.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 19:59       ` Andrew Morton
  2015-03-19 20:59         ` Dan Williams
@ 2015-03-20 17:32         ` Wols Lists
  2015-03-22 10:30         ` Boaz Harrosh
  2 siblings, 0 replies; 42+ messages in thread
From: Wols Lists @ 2015-03-20 17:32 UTC (permalink / raw)
  To: Andrew Morton, Boaz Harrosh
  Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm,
	Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel

On 19/03/15 19:59, Andrew Morton wrote:
> This is all contingent upon the prevalence of machines which have vast
> amounts of nv memory and relatively small amounts of regular memory. 
> How confident are we that this really is the future?

Somewhat off-topic, but it's also the past. I can't help thinking of the
early Pick machines, which treated backing store as one giant permanent
virtual memory. Back when 300Mb hard drives were HUGE.

Cheers,
Wol



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 16:21     ` Rik van Riel
@ 2015-03-20 20:31       ` Matthew Wilcox
  2015-03-20 21:08         ` Rik van Riel
                           ` (2 more replies)
  2015-03-22 15:51       ` Boaz Harrosh
  1 sibling, 3 replies; 42+ messages in thread
From: Matthew Wilcox @ 2015-03-20 20:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On Fri, Mar 20, 2015 at 12:21:34PM -0400, Rik van Riel wrote:
> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
> 
> > 1. Construct struct pages for persistent memory
> > 1a. Permanently
> > 1b. While the pages are under I/O
> 
> Michael Tsirkin and I have been doing some thinking about what
> it would take to allocate struct pages per 2MB area permanently,
> and allocate additional struct pages for 4kB pages on demand,
> when a 2MB area is broken up into 4kB pages.

Ah!  I've looked at that a couple of times as well.  I asked our database
performance team what impact freeing up the memmap would have on their
performance.  They told me that doubling the amount of memory generally
resulted in approximately a 40% performance improvement.  So freeing up
1.5% additional memory would result in about 0.6% performance improvement,
which I thought was probably too small a return on investment to justify
turning memmap into a two-level data structure.

Persistent memory might change that calculation somewhat ... but I'm
not convinced.  Certainly, if we already had the ability to allocate
'struct superpage', I wouldn't be pushing for page-less I/Os, I'd just
allocate these data structures for PM.  Even if they were 128 bytes in
size, that's only a 25MB overhead per 400GB NV-DIMM, which feels quite
reasonable to me.

> This should work for both DRAM and persistent memory.
> 
> I am still not convinced it is worthwhile to have struct pages
> for persistent memory though, but I am willing to change my mind.

There's a lot of code out there that relies on struct page being PAGE_SIZE
bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
[1] in the biovec and auditing all of the code which touches it ... but
that's going to be a lot of code!  I'm not sure it's less code than
going directly to 'just do I/O on PFNs'.

[1] Please, somebody come up with a better name!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 20:31       ` Matthew Wilcox
@ 2015-03-20 21:08         ` Rik van Riel
  2015-03-22 17:06           ` Boaz Harrosh
  2015-03-20 21:17         ` Wols Lists
  2015-03-22 16:24         ` Boaz Harrosh
  2 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2015-03-20 21:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 03/20/2015 04:31 PM, Matthew Wilcox wrote:
> On Fri, Mar 20, 2015 at 12:21:34PM -0400, Rik van Riel wrote:
>> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>
>> Michael Tsirkin and I have been doing some thinking about what
>> it would take to allocate struct pages per 2MB area permanently,
>> and allocate additional struct pages for 4kB pages on demand,
>> when a 2MB area is broken up into 4kB pages.
> 
> Ah!  I've looked at that a couple of times as well.  I asked our database
> performance team what impact freeing up the memmap would have on their
> performance.  They told me that doubling the amount of memory generally
> resulted in approximately a 40% performance improvement.  So freeing up
> 1.5% additional memory would result in about 0.6% performance improvement,
> which I thought was probably too small a return on investment to justify
> turning memmap into a two-level data structure.

Agreed, it should not be done for memory savings alone, but only
if it helps improve all kinds of other things.

>> This should work for both DRAM and persistent memory.
>>
>> I am still not convinced it is worthwhile to have struct pages
>> for persistent memory though, but I am willing to change my mind.
> 
> There's a lot of code out there that relies on struct page being PAGE_SIZE
> bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
> [1] in the biovec and auditing all of the code which touches it ... but
> that's going to be a lot of code!  I'm not sure it's less code than
> going directly to 'just do I/O on PFNs'.

Totally agreed here. I see absolutely no advantage to teaching the
IO layer about a "struct superpage" when it could operate on PFNs
just as easily.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 20:31       ` Matthew Wilcox
  2015-03-20 21:08         ` Rik van Riel
@ 2015-03-20 21:17         ` Wols Lists
  2015-03-22 16:24         ` Boaz Harrosh
  2 siblings, 0 replies; 42+ messages in thread
From: Wols Lists @ 2015-03-20 21:17 UTC (permalink / raw)
  To: Matthew Wilcox, Rik van Riel
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 20/03/15 20:31, Matthew Wilcox wrote:
> Ah!  I've looked at that a couple of times as well.  I asked our database
> performance team what impact freeing up the memmap would have on their
> performance.  They told me that doubling the amount of memory generally
> resulted in approximately a 40% performance improvement.  So freeing up
> 1.5% additional memory would result in about 0.6% performance improvement,
> which I thought was probably too small a return on investment to justify
> turning memmap into a two-level data structure.

Don't get me started on databases! This is very much a relational
problem, other databases don't suffer like this.

(imho relational theory is totally inappropriate for an engineering
problem, like designing a database engine ...)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 19:59       ` Andrew Morton
  2015-03-19 20:59         ` Dan Williams
  2015-03-20 17:32         ` Wols Lists
@ 2015-03-22 10:30         ` Boaz Harrosh
  2 siblings, 0 replies; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-22 10:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm,
	Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel

On 03/19/2015 09:59 PM, Andrew Morton wrote:
> On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:
> 
>> On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
>> <>
>>>
>>> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
>>> want to be able to do any kind of I/O directly to persistent memory,
>>> and I think we do, we need to do one of:
>>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>> 2. Teach the I/O layers to deal in PFNs instead of struct pages
>>> 3. Replace struct page with some other structure that can represent both
>>>    DRAM and PMEM
>>>
>>> I'm personally a fan of #3, and I was looking at the scatterlist as
>>> my preferred data structure.  I now believe the scatterlist as it is
>>> currently defined isn't sufficient, so we probably end up needing a new
>>> data structure.  I think Dan's preferred method of replacing struct
>>> pages with PFNs is actually less instrusive, but doesn't give us as
>>> much advantage (an entirely new data structure would let us move to an
>>> extent based system at the same time, instead of sticking with an array
>>> of pages).  Clearly Boaz prefers 1a, which works well enough for the
>>> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
>>>
>>> What's your preference?  I guess option 0 is "force all I/O to go
>>> through the page cache and then get copied", but that feels like a nasty
>>> performance hit.
>>
>> Thanks Matthew, you have summarized it perfectly.
>>
>> I think #1b might have merit, as well.
> 
> It would be interesting to see what a 1b implementation looks like and
> how it performs.  We already allocate a bunch of temporary things to
> support in-flight IO (bio, request) and allocating pageframes on the
> same basis seems a fairly logical fit.

There is a couple of ways we can do this, they are all kind of 
"hacks" to me, along the line of how transparent huge pages is an
hack, a very nice one at that, and every one that knows me knows
I love hacks, be so it is never the less.

So it is all about designating the page to mean something else
at a set of a flag.

And actually the transparent-huge-pages is the core of this.
because there is already a switch on core page operations when
it is present. (for example get/put_page )

And because we do not want to allocate pages inline, as part of a
section, we also need a bit of a memory_model.h new define.
(May this can avoided I need to stare harder on this)

> 
> It is all a bit of a stopgap, designed to shoehorn
> direct-io-to-dax-mapped-memory into the existing world.  Longer term
> I'd expect us to move to something more powerful, but it's unclear what
> that will be at this time, so a stopgap isn't too bad?
> 

I'd bet real huge-pages is the long term. The one stop gap for
huge-pages is that no one wants to dirty a full 2M for two changed
bytes. 4k is kind of the IO performance granularity we all calculate
for. This can be solved in couple of ways, all very invasive to lots
of Kernel areas. 

Lots of times the problem is "where do you start?"

> 
> This is all contingent upon the prevalence of machines which have vast
> amounts of nv memory and relatively small amounts of regular memory. 
> How confident are we that this really is the future?
> 

One thing you guys are ignoring is that the 1.5% "waste" can come
from nv-memory. If real ram is scarce and nv-ram is hips cheep,
just allocate the pages from nvram then.

Do not forget that very soon after the availability of real
nvram, I mean not the backed up one, but the real like mram
or reram. Lots of machines will be 100% nv-ram + sram caches.
This is nothing to do with storage speed, it is to do with
power consumption. The machine shuts-off and picks up exactly
where it was. (Even at power on they consume much less, no refreshes)
In those machine a partition of storage say, the swap partition, will
be the volatile memory section of the machine, zeroed out on boot and
used as RAM.

So this future above does not exist. Pages can just be allocated
from the cheapest memory you have and be done with it.

(BTW all this can already be done now, I have demonstrated it
 in the lab, a reserved NvDIMM memory region is memory_hot_plugged
 and is there after used as regular RAM)

Thanks
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 15:56       ` Rik van Riel
@ 2015-03-22 11:53         ` Boaz Harrosh
  0 siblings, 0 replies; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-22 11:53 UTC (permalink / raw)
  To: Rik van Riel, Matthew Wilcox, Boaz Harrosh
  Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen,
	linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel,
	Andrew Morton, mgorman

On 03/20/2015 05:56 PM, Rik van Riel wrote:
> On 03/18/2015 10:38 AM, Boaz Harrosh wrote:
>> On 03/18/2015 03:06 PM, Matthew Wilcox wrote:
> 
>>>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>>>> really for what? The block layer, and RDMA, and networking, and spline, and what
>>>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>>>> stable. right now!
>>>
>>> The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
>>> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
>>> That's an unacceptable amount of overhead.
>>>
>>
>> So lets fix the stacks to work nice with 2M pages. That said we can
>> allocate the struct page also from pmem if we need to. The fact remains
>> that we need state down the different stacks and this is the current
>> design over all.
> 
> Fixing the stack to work with 2M pages will be just as invasive,
> and just as much work as making it work without a struct page.
> 
> What state do you need, exactly?
> 

It is not me that needs state it is the Kernel. Let me show you
what I can do now that uses state (and pages).

block layer sends a bio via iscsi, in turn it goes around and
sends it via networking stack. Here page-ref is used as well
as all kind of page based management. (This is half the Kernel
converted right here)
Same thing but iser & RDMA. Same thing to a null-target, via
the target stack, maybe via path-threw.

Another big example:
  At user-mode application I mmap a portion of pmem, I then
use the libvirt API to designate a named shared-memory object.
At vm I use the same API to retrieve a pointer to that pmem
region and boom, I'm persistent. (Same can be done between
two VMs)

mmap(pmem) send it to network, to encryption, direct_io
RDMA, anything copyless.

So many subsystem use page_lock page->lru page-ref and are
written to receive and manage pages. I do not like to be
excluded from these systems, and I would very much hate
to re-write them. block layer is an example.

> The struct page in the VM is mostly used for two things:
> 1) to get a memory address of the data
> 2) refcounting, to make sure the page does not go away
>    during an IO operation, copy, etc...
> 
> Persistent memory cannot be paged out so (2) is not a concern, as
> long as we ensure the object the page belongs to does not go away.
> There are no seek times, so moving it around may not be necessary
> either, making (1) not a concern.
> 

I lost you sorry. I'm not sure what you meant here?
Yes kmap/kunmap is mute. I do not see any use for highmem and
any 32bitness with this thing.

refcounting is used sure, even with pmem see above. Actually
relaying on refcounting existence can solve us some stuff at
the pmem management level, which exist today. (RDMA while truncate)

> The only case where (1) would be a concern is if we wanted to move
> data in persistent memory around for better NUMA locality. However,
> persistent memory DIMMs are on their way to being too large to move
> the memory, anyway - all we can usefully do is detect where programs
> are accessing memory, and move the programs there.
> 

So actually I have hands on experience with this very problem.
We have observed that NUMA kills us. Now going through memory_add_physaddr_to_nid()
loop for every 4k operation was a pain, but caching it on page_to_nid()
(As part of flags in 64bit) is very nice optimization, we do NUMA aware block
allocation and it preforms much better. (Never like a single node but magnitude
better then without)

> What state do you need that is not already represented?
> 

Most of these subsystem you guys are focused on it is mostly read-only
state. Except page-ref. But never the less the page has added information
describing the pfn. Like nid mapping->ops flags etc ...

And it is also a stop gap of translation.
give me a page I now the pfn and vaddr, give me a pfn I know page
give me a vaddr I know the page. So I can move between all these domains.

Now I am sure that in hindsight we might have devised better structures
and abstractions that could carry all this information in a more abstract
and convenient way, throughout the Kernel. But for now this basic object
is a page and is passed around like in a relay-race. Each subsystem with
its own page based meta-structure. The only real global token is
page-struct.

You are saying: "not already represented" ? I'm saying exactly, sir
it is already represented as a page-struct. Anything else is in the
far far future. (if at all)

> 1.5% overhead isn't a whole lot, but it appears to be unnecessary.
> 

unnecessary, in a theoretical future with every single Kernel
subsystem changed (maybe for the better I'm not saying). And this
future is not even at all clear what it is.

But for current code structure it is very much necessary. For the
very long present days, it is not 1.5% with or without. It is
need-to-copy or direct(-1.5%)

[For me it is not even the performance of a memcpy which exacly halves
 my pmem performance, it is the latency and the extra nightmare locking
 and management to keep in sync two copies of the same thing]

> If you have a convincing argument as to why we need a struct page,
> you might want to articulate it in order to convince us.
> 

The must simple convincing argument there is. "Existing code". Apparently
page was needed, maybe we can all think of much better constructs. But
for now this is what the Kernel is based on. Until such time that we
better it it is there.

Since when we refrain from new technologies and new fixtures because
"A major cleanup is needed". I'm all for all the great 
"change-every-file in Kernel" ideas some guys have, but while at it
also change the small patch I added to support pmem.

For me pmem is now, at clients systems. and I chose direct(-1.5%)
over need-to-copy. Because it gives me the performance, and most
important, latency that sales my products. What is your timetable?

Cheers
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 16:21     ` Rik van Riel
  2015-03-20 20:31       ` Matthew Wilcox
@ 2015-03-22 15:51       ` Boaz Harrosh
  2015-03-23 15:19         ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-22 15:51 UTC (permalink / raw)
  To: Rik van Riel, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/20/2015 06:21 PM, Rik van Riel wrote:
> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
> 
>> 1. Construct struct pages for persistent memory
>> 1a. Permanently
>> 1b. While the pages are under I/O
> 
> Michael Tsirkin and I have been doing some thinking about what
> it would take to allocate struct pages per 2MB area permanently,
> and allocate additional struct pages for 4kB pages on demand,
> when a 2MB area is broken up into 4kB pages.
> 
> This should work for both DRAM and persistent memory.
> 

My thoughts as well, this need *not* be a huge evasive change. Is however
a careful surgery in very core code. And lots of sleepless scary nights
and testing to make sure all the side effects are wrinkled out.

BTW: Basic core block code may very well work with:
	bv_page, bv_len > PAGE_SIZE bv_offset > PAGE_SIZE.

  Meaning bv_page-pfn is contiguous in physical space (and virtual
  of course). So much so that there are already rumors that this suppose
  to be supported, and there are already out-of-tree drivers that use
  this today by kmalloc a page-order and feeding BIOs with bv_len=64K

  But going out of block-layer and say to networking say via iscsi and
  this breaks pretty fast. Lets fix that then lets introduce a:
	page_size(page)
  page already knows its size (ie belonging to a 2M THP)

> I am still not convinced it is worthwhile to have struct pages
> for persistent memory though, but I am willing to change my mind.
> 

If we want copy-less, we need a common memory descriptor career. Today this
is page-struct. So for me your above statement means:
	"still not convinced I care about copy-less pmem"

Otherwise you either enhance what you have today or devise a new
system, which means change the all Kernel.

Lastly: Why does pmem need to wait out-of-tree. Even you say above that
machines with lots of DRAM can enjoy the HUGE-to-4k split. So why
not let pmem waist 4k pages like everyone else and fix it as above
down the line, both for pmem and ram. And save both ways.
Why do we need to first change the all Kernel, then have pmem. Why not
use current infra structure, for good or for worth, and incrementally
do better.

May I call you on the phone to try and work things out. I believe the
huge page thing + 4k on demand is not a very big change, as long as
	struct page *page is left as is, everywhere.

But may *now* carry a different physical/virtual contiguous payload
bigger then 4k. Is not the PAGE_SIZE the real bug? lets fix that problem.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 20:31       ` Matthew Wilcox
  2015-03-20 21:08         ` Rik van Riel
  2015-03-20 21:17         ` Wols Lists
@ 2015-03-22 16:24         ` Boaz Harrosh
  2 siblings, 0 replies; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-22 16:24 UTC (permalink / raw)
  To: Matthew Wilcox, Rik van Riel
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 03/20/2015 10:31 PM, Matthew Wilcox wrote:
<>
> 
> There's a lot of code out there that relies on struct page being PAGE_SIZE
> bytes.  

Not so much really. Not at the lower end of the stack. You can actually feed
a 
	vp = kmalloc(64K);
	bv_page = virt_to_page(vp)
	bv_len = 64k

And feed that to an hard drive. It works.

The only last stronghold of PAGE_SIZE is at the page-cache and page-fault
granularity where the minimum is the better. But it should not be hard
to clean up the lower end of the stack. Even introduce a:
	page_size(page)

You will find that every subsystem that can work with a sub-page size
similar to above bv_len. Will also work well with bigger than PAGE_SIZE
bv_len equivalent.

Only the BUG_ONs need to convert to page_size(page) instead of PAGE_SIZE

> I'm cool with replacing 'struct page' with 'struct superpage'
> [1] in the biovec and auditing all of the code which touches it ... but
> that's going to be a lot of code!  I'm not sure it's less code than
> going directly to 'just do I/O on PFNs'.
> 

struct page already knows how to be a super-page. with the THP mechanics.
All a page_size(page) needs is a call to its section, we do not need any
added storage at page-struct. (And we can cache this as a flag we actually
already have a flag)

It looks like you are very trigger happy to change
	"biovec and auditing all of the code which touches it"

I believe long long term your #1b is the correct "full audit" path:

	Page Is the virtual-2-page-2-physical descriptor + state.
	It is variable size
 
> [1] Please, somebody come up with a better name!

sure struct page *page.

The one to kill is PAGE_SIZE. In most current code it can just be MIN_PAGE_SIZE
and CACHE_PAGE_SIZE == MIN_PAGE_SIZE. Only novelty is enhance of the split_huge_page
in the case of "page-fault-granularity".

Thanks
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 18:17     ` Christoph Hellwig
  2015-03-19 19:31       ` Matthew Wilcox
@ 2015-03-22 16:46       ` Boaz Harrosh
  1 sibling, 0 replies; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-22 16:46 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	linux-fsdevel

On 03/19/2015 08:17 PM, Christoph Hellwig wrote:
<>
> 
> In addition to the options there's also a time line.  At least for the
> short term where we want to get something going 1a seems like the
> absolutely be option.  It works perfectly fine for the lots of small
> capacity dram-like nvdimms, and it works funtionally fine for the
> special huge ones, although the resource use for it is highly annoying.
> If it turns out to be too annoying we can also offer a no I/O possible
> option for them in the short run.
> 

Finally some voice in the dessert.

> In the long run option 2) sounds like a good plan to me, but not as a
> parallel I/O path, but as the main one.  Doing so will in fact give us
> options to experiment with 3).  Given that we're moving towards an
> increasinly huge page using world replacing the good old struct page
> with something extent-like and/or temporary might be needed for dram
> as well in the future.

Why ? why not just make page mean page_size(page) and mostly even that
is not needed.

Any changes to bio will only solve bio. And will push the problem to
the next subsystem.

Fix the PAGE_SIZE problem and you fixed it for all subsystems, not only
bio. And I believe it is the smaller change by far.

Because in most places PAGE_SIZE just means MIN_PAGE_SIZE when we try
calculate some array sizes for storage of a given "io-length", this
is surly 4k, but then when the actual run time is preformed we usually
have a length specifier like bv_len. (And the few places that do not are
easy to fix I believe)

Thanks
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 21:08         ` Rik van Riel
@ 2015-03-22 17:06           ` Boaz Harrosh
  2015-03-22 17:22             ` Dan Williams
  0 siblings, 1 reply; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-22 17:06 UTC (permalink / raw)
  To: Rik van Riel, Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 03/20/2015 11:08 PM, Rik van Riel wrote:
> On 03/20/2015 04:31 PM, Matthew Wilcox wrote:
<>
>> There's a lot of code out there that relies on struct page being PAGE_SIZE
>> bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
>> [1] in the biovec and auditing all of the code which touches it ... but
>> that's going to be a lot of code!  I'm not sure it's less code than
>> going directly to 'just do I/O on PFNs'.
> 
> Totally agreed here. I see absolutely no advantage to teaching the
> IO layer about a "struct superpage" when it could operate on PFNs
> just as easily.
> 

Or teaching 'struct page' to be variable length, This is already so at
bio and sg level so you fixed nothing.

Moving to pfn's only means that all this unnamed code above that
"relies on struct page being PAGE_SIZE" is now not allowed to
interfaced with bio and sg list. Which in current code and in Dan's patches
means two tons of BUG_ONS and return -ENOTSUPP . For all these
subsystems below the bio and sglist that operate on page_structs

Say the "relies on struct page being PAGE_SIZE" is such an hard
work, which is not at all at the bio and sg-list level, will
it not be worth while fixing this instead of alienating the all
Kernel from the IO subsystem.

But I believe it is the much much smaller change? Specially considering
Networking, RDMA shared memory ...

Cheers
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 20:59         ` Dan Williams
@ 2015-03-22 17:22           ` Boaz Harrosh
  0 siblings, 0 replies; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-22 17:22 UTC (permalink / raw)
  To: Dan Williams, Andrew Morton
  Cc: Boaz Harrosh, linux-arch, Jens Axboe, riel, linux-raid,
	linux-nvdimm, Dave Hansen, linux-kernel, Christoph Hellwig,
	Mel Gorman, linux-fsdevel

On 03/19/2015 10:59 PM, Dan Williams wrote:

> 
> At least for block-i/o it seems the only place we really need struct
> page infrastructure is for kmap().  Given we already need a kmap_pfn()
> solution for option 2 a "dynamic allocation" stop along that
> development path may just naturally fall out.

Really? what about networked block-io, RDMA, FcOE emulated targets,
mmaped pointers. virtual-machine bdev drivers

Block layer sits in the middle of the stack not at the low end as you
make it appear. There are lots of below the bio subsystems that tie into
a page struct, which will now stop to operate, unless you do:

pfn_to_page() which means a page-less pfn will now crash or will need
to be rejected so any where you have a
	if (page_less_pfn())
		... /* Fail or do some other code like copy */
	else
		page = pfn_to_page()

Is a double code path in the Kernel and is a nightmare to maintain.
(I'm here for you believe me ;-) )

Thanks
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 17:06           ` Boaz Harrosh
@ 2015-03-22 17:22             ` Dan Williams
  2015-03-22 17:39               ` Boaz Harrosh
  0 siblings, 1 reply; 42+ messages in thread
From: Dan Williams @ 2015-03-22 17:22 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Rik van Riel, Matthew Wilcox, Andrew Morton, linux-kernel,
	linux-arch, Jens Axboe, linux-nvdimm, Dave Hansen, linux-raid,
	Mel Gorman, Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin

On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 03/20/2015 11:08 PM, Rik van Riel wrote:
>> On 03/20/2015 04:31 PM, Matthew Wilcox wrote:
> <>
>>> There's a lot of code out there that relies on struct page being PAGE_SIZE
>>> bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
>>> [1] in the biovec and auditing all of the code which touches it ... but
>>> that's going to be a lot of code!  I'm not sure it's less code than
>>> going directly to 'just do I/O on PFNs'.
>>
>> Totally agreed here. I see absolutely no advantage to teaching the
>> IO layer about a "struct superpage" when it could operate on PFNs
>> just as easily.
>>
>
> Or teaching 'struct page' to be variable length, This is already so at
> bio and sg level so you fixed nothing.
>
> Moving to pfn's only means that all this unnamed code above that
> "relies on struct page being PAGE_SIZE" is now not allowed to
> interfaced with bio and sg list. Which in current code and in Dan's patches
> means two tons of BUG_ONS and return -ENOTSUPP . For all these
> subsystems below the bio and sglist that operate on page_structs

I'm not convinced it will be that bad.  In hyperbolic terms,
continuing to overload struct page means we get to let floppy.c do i/o
from pmem, who needs that level of compatibility?

Similar to sg_chain support I think it's fine to let sub-systems /
archs add pmem i/o support over time.  It's a scaling problem our
development model is good at.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 17:22             ` Dan Williams
@ 2015-03-22 17:39               ` Boaz Harrosh
  0 siblings, 0 replies; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-22 17:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Matthew Wilcox, Andrew Morton, linux-kernel,
	linux-arch, Jens Axboe, linux-nvdimm, Dave Hansen, linux-raid,
	Mel Gorman, Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin

On 03/22/2015 07:22 PM, Dan Williams wrote:
> On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
<>
>>
>> Moving to pfn's only means that all this unnamed code above that
>> "relies on struct page being PAGE_SIZE" is now not allowed to
>> interfaced with bio and sg list. Which in current code and in Dan's patches
>> means two tons of BUG_ONS and return -ENOTSUPP . For all these
>> subsystems below the bio and sglist that operate on page_structs
> 
> I'm not convinced it will be that bad.  In hyperbolic terms,
> continuing to overload struct page means we get to let floppy.c do i/o
> from pmem, who needs that level of compatibility?
> 

But you do need to make sure it does not crash. right?

> Similar to sg_chain support I think it's fine to let sub-systems /
> archs add pmem i/o support over time.  It's a scaling problem our
> development model is good at.
> 

You are so eager to do all this massive change, and willing to do it
over a decade (Judging by your own example of sg-chain)

But you completely ignore the fact that what I'm saying is that
nothing needs to fundamentally change at all. No support over time
and no "scaling problem" at all. All we want to fix is that page-struct
means NOT PAGE_SIZE but some other size.

The much smaller change and full cross Kernel compatibility. What's
not to like ?

Cheers
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 15:51       ` Boaz Harrosh
@ 2015-03-23 15:19         ` Rik van Riel
  2015-03-23 19:30           ` Christoph Hellwig
  2015-03-24  9:41           ` Boaz Harrosh
  0 siblings, 2 replies; 42+ messages in thread
From: Rik van Riel @ 2015-03-23 15:19 UTC (permalink / raw)
  To: Boaz Harrosh, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/22/2015 11:51 AM, Boaz Harrosh wrote:
> On 03/20/2015 06:21 PM, Rik van Riel wrote:
>> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>
>> Michael Tsirkin and I have been doing some thinking about what
>> it would take to allocate struct pages per 2MB area permanently,
>> and allocate additional struct pages for 4kB pages on demand,
>> when a 2MB area is broken up into 4kB pages.
>>
>> This should work for both DRAM and persistent memory.
> 
> My thoughts as well, this need *not* be a huge evasive change. Is however
> a careful surgery in very core code. And lots of sleepless scary nights
> and testing to make sure all the side effects are wrinkled out.

Even the above IS a huge invasive change, and I do not see it
as much better than the work Dan and Matthew are doing.

> If we want copy-less, we need a common memory descriptor career. Today this
> is page-struct. So for me your above statement means:
> 	"still not convinced I care about copy-less pmem"
> 
> Otherwise you either enhance what you have today or devise a new
> system, which means change the all Kernel.

We do not necessarily need a common descriptor, as much as
one that abstracts out what is happening. Something like a
struct bio could be a good I/O descriptor, and releasing the
backing memory after IO completion could be a function of the
bio freeing function itself.

> Lastly: Why does pmem need to wait out-of-tree. Even you say above that
> machines with lots of DRAM can enjoy the HUGE-to-4k split. So why
> not let pmem waist 4k pages like everyone else and fix it as above
> down the line, both for pmem and ram. And save both ways.
> Why do we need to first change the all Kernel, then have pmem. Why not
> use current infra structure, for good or for worth, and incrementally
> do better.

There are two things going on here:

1) You want to keep using struct page for now, while there are
   subsystems that require it. This is perfectly legitimate.

2) Matthew and Dan are changing over some subsystems to no longer
   require struct page. This is perfectly legitimate.

I do not understand why either of you would have to object to what
the other is doing. There is room to keep using struct page until
the rest of the kernel no longer requires it.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-23 15:19         ` Rik van Riel
@ 2015-03-23 19:30           ` Christoph Hellwig
  2015-03-24  9:41           ` Boaz Harrosh
  1 sibling, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2015-03-23 19:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Boaz Harrosh, Matthew Wilcox, Andrew Morton, Dan Williams,
	linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin

On Mon, Mar 23, 2015 at 11:19:07AM -0400, Rik van Riel wrote:
> There are two things going on here:
> 
> 1) You want to keep using struct page for now, while there are
>    subsystems that require it. This is perfectly legitimate.
> 
> 2) Matthew and Dan are changing over some subsystems to no longer
>    require struct page. This is perfectly legitimate.
> 
> I do not understand why either of you would have to object to what
> the other is doing. There is room to keep using struct page until
> the rest of the kernel no longer requires it.

*nod*

I'd really like to merge the struct page based pmem driver ASAP.  We can
then look into work that avoid the need for struct page, and I think Dan
is doing some good work in that direction.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-23 15:19         ` Rik van Riel
  2015-03-23 19:30           ` Christoph Hellwig
@ 2015-03-24  9:41           ` Boaz Harrosh
  2015-03-24 16:57             ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Boaz Harrosh @ 2015-03-24  9:41 UTC (permalink / raw)
  To: Rik van Riel, Boaz Harrosh, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/23/2015 05:19 PM, Rik van Riel wrote:
>>> Michael Tsirkin and I have been doing some thinking about what
>>> it would take to allocate struct pages per 2MB area permanently,
>>> and allocate additional struct pages for 4kB pages on demand,
>>> when a 2MB area is broken up into 4kB pages.
>>
>> My thoughts as well, this need *not* be a huge evasive change. Is however
>> a careful surgery in very core code. And lots of sleepless scary nights
>> and testing to make sure all the side effects are wrinkled out.
> 
> Even the above IS a huge invasive change, and I do not see it
> as much better than the work Dan and Matthew are doing.
> 

You lost me again. Sorry for my slowness. The code I envision is not
invasive at all. Nothing is touched at all, except a few core places
at the page level.

The contract with Kernel stays the same:
	page_to_pfn, pfn_to_page, page_address (which is kmap_atomic in 64bit)
	virt_to_page, page_get/put and so on...

So none of the Kernel code need change at all. You were saying that we
might have a 2M page and on demand we can allocate a 4k page shove it down the
stack, which does not change at all, and once back from io, the 4k pages can be
freed and recycled for reuse with other IO. This is what I thought you said.

This is doable, and not that much work and for the life of me I do not see any
"invasive". (Yes a few core headers that make everything compile ;-))

That said I do not even think we need that (2M split to 4k on demand) we can even
do better and make sure 2M pages just work as is. It is very possible today
(Tested) to push a 2M page into a bio and write to a bdev. Yes lots of side
code will break, but the core path is clean. Let us fix that then.

(Need I send code to show you how a 2M page is written with a single
 bvec?)

>> If we want copy-less, we need a common memory descriptor career. Today this
>> is page-struct. So for me your above statement means:
>> 	"still not convinced I care about copy-less pmem"
>>
>> Otherwise you either enhance what you have today or devise a new
>> system, which means change the all Kernel.
> 
> We do not necessarily need a common descriptor, as much as
> one that abstracts out what is happening. Something like a
> struct bio could be a good I/O descriptor, and releasing the
> backing memory after IO completion could be a function of the
> bio freeing function itself.
> 

Lost me again sorry. What backing memory. struct bio is already
an I/O descriptor which gets freed after use. How is that relevant
to pfn vs page ?

>> Lastly: Why does pmem need to wait out-of-tree. Even you say above that
>> machines with lots of DRAM can enjoy the HUGE-to-4k split. So why
>> not let pmem waist 4k pages like everyone else and fix it as above
>> down the line, both for pmem and ram. And save both ways.
>> Why do we need to first change the all Kernel, then have pmem. Why not
>> use current infra structure, for good or for worth, and incrementally
>> do better.
> 
> There are two things going on here:
> 
> 1) You want to keep using struct page for now, while there are
>    subsystems that require it. This is perfectly legitimate.
> 
> 2) Matthew and Dan are changing over some subsystems to no longer
>    require struct page. This is perfectly legitimate.
> 

How is this legitimate when you need to Interface the [1] subsystems
under the [2] subsystem? A subsystem that expects pages is now not
usable by [2].

Today *All* the Kernel subsystems are [1] Period. How does it become
legitimate to now start *two* competing, do the same differently, abstraction,
in our kernel. We have two much diversity not to little.

> I do not understand why either of you would have to object to what
> the other is doing. There is room to keep using struct page until
> the rest of the kernel no longer requires it.
> 

So this is your vision "until the rest of the kernel no longer requires
pages" Really? Sigh, coming from other Kernels I thought pages were
a breeze of fresh air. I thought it was very clever. And BTW good luck
with that.

BTW: you have not solved the basic problem yet. for one pfn_kmap() given a
pfn what is its virtual address. would you like to loop through the
Kernel's range tables to look for the registered ioremap ? its a long
annoying loop. The page was invented exactly for this reason, to go through
the section object. And actually it is not that easy because if it is an ioremap
pointer it is in one list and if a page it is another way, and on top of
all this, it is ARCH dependent. And you are trashing highmem, because the
state and locks of that are at the page level. Not that I care about highmem
but I hate double coding. For god sake what do you guys have with poor old
pages, they were invented to exacly do this, abstract away management of a single
pfn-to-virt.
All I see is complains about page being 4K well it need not be. page can be any
size, and hell it can be variable size. (And no we do not need to add an extra size
member, all we need is the one bit)

Cheers
Boaz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-24  9:41           ` Boaz Harrosh
@ 2015-03-24 16:57             ` Rik van Riel
  0 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2015-03-24 16:57 UTC (permalink / raw)
  To: Boaz Harrosh, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/24/2015 05:41 AM, Boaz Harrosh wrote:
> On 03/23/2015 05:19 PM, Rik van Riel wrote:

>> There are two things going on here:
>>
>> 1) You want to keep using struct page for now, while there are
>>    subsystems that require it. This is perfectly legitimate.
>>
>> 2) Matthew and Dan are changing over some subsystems to no longer
>>    require struct page. This is perfectly legitimate.
>>
> 
> How is this legitimate when you need to Interface the [1] subsystems
> under the [2] subsystem? A subsystem that expects pages is now not
> usable by [2].
> 
> Today *All* the Kernel subsystems are [1] Period.

That's not true. In the graphics subsystem it is very normal to
mmap graphics memory without ever using a struct page. There are
other callers of remap_pfn_range() too.

In these cases, refcounting is done by keeping a refcount on the
entire object, not on individual pages (since we have none).

> How does it become
> legitimate to now start *two* competing, do the same differently, abstraction,
> in our kernel. We have two much diversity not to little.

We are already able to refcount either the whole object, or an
individual page.

One issue is that not every subsystem can do the whole object
refcounting, and that it would be nice to have the refcounting
done by one single interface.

If we want the code to be the same everywhere, we could achieve
that just as well with an abstraction as with a single data
structure.

Maybe even something as simplistic as these, with the internals
automatically taking and releasing a refcount on the proper object:

get_reference(file, memory_address)

put_reference(file, memory_address)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2015-03-24 16:58 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
2015-03-16 20:25 ` [RFC PATCH 1/7] block: add helpers for accessing a bio_vec page Dan Williams
2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
2015-03-16 23:05   ` Al Viro
2015-03-17 13:02     ` Matthew Wilcox
2015-03-17 15:53       ` Dan Williams
2015-03-16 20:25 ` [RFC PATCH 3/7] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
2015-03-18 11:21   ` [Linux-nvdimm] " Boaz Harrosh
2015-03-16 20:25 ` [RFC PATCH 4/7] scatterlist: use sg_phys() Dan Williams
2015-03-16 20:25 ` [RFC PATCH 5/7] scatterlist: support "page-less" (__pfn_t only) entries Dan Williams
2015-03-16 20:25 ` [RFC PATCH 6/7] x86: support dma_map_pfn() Dan Williams
2015-03-16 20:26 ` [RFC PATCH 7/7] block: base support for pfn i/o Dan Williams
2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
2015-03-18 13:06   ` Matthew Wilcox
2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
2015-03-20 15:56       ` Rik van Riel
2015-03-22 11:53         ` Boaz Harrosh
2015-03-18 15:35   ` Dan Williams
2015-03-18 20:26 ` Andrew Morton
2015-03-19 13:43   ` Matthew Wilcox
2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
2015-03-19 19:59       ` Andrew Morton
2015-03-19 20:59         ` Dan Williams
2015-03-22 17:22           ` Boaz Harrosh
2015-03-20 17:32         ` Wols Lists
2015-03-22 10:30         ` Boaz Harrosh
2015-03-19 18:17     ` Christoph Hellwig
2015-03-19 19:31       ` Matthew Wilcox
2015-03-22 16:46       ` Boaz Harrosh
2015-03-20 16:21     ` Rik van Riel
2015-03-20 20:31       ` Matthew Wilcox
2015-03-20 21:08         ` Rik van Riel
2015-03-22 17:06           ` Boaz Harrosh
2015-03-22 17:22             ` Dan Williams
2015-03-22 17:39               ` Boaz Harrosh
2015-03-20 21:17         ` Wols Lists
2015-03-22 16:24         ` Boaz Harrosh
2015-03-22 15:51       ` Boaz Harrosh
2015-03-23 15:19         ` Rik van Riel
2015-03-23 19:30           ` Christoph Hellwig
2015-03-24  9:41           ` Boaz Harrosh
2015-03-24 16:57             ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).