linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/28] Removing struct page from P2PDMA
@ 2019-06-20 16:12 Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 01/28] block: Introduce DMA direct request type Logan Gunthorpe
                   ` (29 more replies)
  0 siblings, 30 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

For eons there has been a debate over whether or not to use
struct pages for peer-to-peer DMA transactions. Pro-pagers have
argued that struct pages are necessary for interacting with
existing code like scatterlists or the bio_vecs. Anti-pagers
assert that the tracking of the memory is unecessary and
allocating the pages is a waste of memory. Both viewpoints are
valid, however developers working on GPUs and RDMA tend to be
able to do away with struct pages relatively easily compared to
those wanting to work with NVMe devices through the block layer.
So it would be of great value to be able to universally do P2PDMA
transactions without the use of struct pages.

Previously, there have been multiple attempts[1][2] to replace
struct page usage with pfn_t but this has been unpopular seeing
it creates dangerous edge cases where unsuspecting code might
run accross pfn_t's they are not ready for.

Currently, we have P2PDMA using struct pages through the block layer
and the dangerous cases are avoided by using a queue flag that
indicates support for the special pages.

This RFC proposes a new solution: allow the block layer to take
DMA addresses directly for queues that indicate support. This will
provide a more general path for doing P2PDMA-like requests and will
allow us to remove the struct pages that back P2PDMA memory thus paving
the way to build a more uniform P2PDMA ecosystem.

This is a fairly long patch set but most of the patches are quite
small. Patches 1 through 18 introduce the concept of a dma_vec that
is similar to a bio_vec (except it takes dma_addr_t's instead of pages
and offsets) as well as a special dma-direct bio/request. Most of these
patches just prevent the new type of bio from being mis-used and
also support splitting and mapping them in the same way that struct
page bios can be operated on. Patches 19 through 22 modify the existing
P2PDMA support in nvme-pci, ib-core and nvmet to use DMA addresses
directly. Patches 23 through 25 remove the P2PDMA specific
code from the block layer and ib-core. Finally, patches 26 through 28
remove the struct pages from the PCI P2PDMA code.

This RFC is based on v5.2-rc5 and a git branch is available here:

https://github.com/sbates130272/linux-p2pmem.git dma_direct_rfc1

[1] https://lwn.net/Articles/647404/
[2] https://lore.kernel.org/lkml/1495662147-18277-1-git-send-email-logang@deltatee.com/

--

Logan Gunthorpe (28):
  block: Introduce DMA direct request type
  block: Add dma_vec structure
  block: Warn on mis-use of dma-direct bios
  block: Never bounce dma-direct bios
  block: Skip dma-direct bios in bio_integrity_prep()
  block: Support dma-direct bios in bio_advance_iter()
  block: Use dma_vec length in bio_cur_bytes() for dma-direct bios
  block: Introduce dmavec_phys_mergeable()
  block: Introduce vec_gap_to_prev()
  block: Create generic vec_split_segs() from bvec_split_segs()
  block: Create blk_segment_split_ctx
  block: Create helper for bvec_should_split()
  block: Generalize bvec_should_split()
  block: Support splitting dma-direct bios
  block: Support counting dma-direct bio segments
  block: Implement mapping dma-direct requests to SGs in blk_rq_map_sg()
  block: Introduce queue flag to indicate support for dma-direct bios
  block: Introduce bio_add_dma_addr()
  nvme-pci: Support dma-direct bios
  IB/core: Introduce API for initializing a RW ctx from a DMA address
  nvmet: Split nvmet_bdev_execute_rw() into a helper function
  nvmet: Use DMA addresses instead of struct pages for P2P
  nvme-pci: Remove support for PCI_P2PDMA requests
  block: Remove PCI_P2PDMA queue flag
  IB/core: Remove P2PDMA mapping support in rdma_rw_ctx
  PCI/P2PDMA: Remove SGL helpers
  PCI/P2PDMA: Remove struct pages that back P2PDMA memory
  memremap: Remove PCI P2PDMA page memory type

 Documentation/driver-api/pci/p2pdma.rst |   9 +-
 block/bio-integrity.c                   |   4 +
 block/bio.c                             |  71 +++++++
 block/blk-core.c                        |   3 +
 block/blk-merge.c                       | 256 ++++++++++++++++++------
 block/blk.h                             |  49 ++++-
 block/bounce.c                          |   8 +
 drivers/infiniband/core/rw.c            |  85 ++++++--
 drivers/nvme/host/core.c                |   4 +-
 drivers/nvme/host/nvme.h                |   2 +-
 drivers/nvme/host/pci.c                 |  29 ++-
 drivers/nvme/target/core.c              |  12 +-
 drivers/nvme/target/io-cmd-bdev.c       |  82 +++++---
 drivers/nvme/target/nvmet.h             |   5 +-
 drivers/nvme/target/rdma.c              |  43 +++-
 drivers/pci/p2pdma.c                    | 202 +++----------------
 include/linux/bio.h                     |  32 ++-
 include/linux/blk_types.h               |  14 +-
 include/linux/blkdev.h                  |  16 +-
 include/linux/bvec.h                    |  43 ++++
 include/linux/memremap.h                |   5 -
 include/linux/mm.h                      |  13 --
 include/linux/pci-p2pdma.h              |  19 --
 include/rdma/rw.h                       |   6 +
 24 files changed, 648 insertions(+), 364 deletions(-)

--
2.20.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC PATCH 01/28] block: Introduce DMA direct request type
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 02/28] block: Add dma_vec structure Logan Gunthorpe
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

A DMA direct request allows passing DMA addresses directly through
the block layer, instead of struct pages. This allows the calling
layer to take care of the mapping and unmapping and also creates
a path to doing peer-to-peer transactions without using struct pages.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/blk_types.h |  9 ++++++++-
 include/linux/blkdev.h    | 10 ++++++++++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 95202f80676c..f3cabfdb6774 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -322,6 +322,7 @@ enum req_flag_bits {
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
 	__REQ_HIPRI,
+	__REQ_DMA_DIRECT,	/* DMA address direct request */
 
 	/* for driver use */
 	__REQ_DRV,
@@ -345,6 +346,7 @@ enum req_flag_bits {
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
+#define REQ_DMA_DIRECT		(1ULL << __REQ_DMA_DIRECT)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
@@ -353,7 +355,7 @@ enum req_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_DMA_DIRECT)
 
 enum stat_group {
 	STAT_READ,
@@ -412,6 +414,11 @@ static inline int op_stat_group(unsigned int op)
 	return op_is_write(op);
 }
 
+static inline int op_is_dma_direct(unsigned int op)
+{
+	return op & REQ_DMA_DIRECT;
+}
+
 typedef unsigned int blk_qc_t;
 #define BLK_QC_T_NONE		-1U
 #define BLK_QC_T_SHIFT		16
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 592669bcc536..ce70d5dded5f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -271,6 +271,16 @@ static inline bool bio_is_passthrough(struct bio *bio)
 	return blk_op_is_scsi(op) || blk_op_is_private(op);
 }
 
+static inline bool bio_is_dma_direct(struct bio *bio)
+{
+	return op_is_dma_direct(bio->bi_opf);
+}
+
+static inline bool blk_rq_is_dma_direct(struct request *rq)
+{
+	return op_is_dma_direct(rq->cmd_flags);
+}
+
 static inline unsigned short req_get_ioprio(struct request *req)
 {
 	return req->ioprio;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 02/28] block: Add dma_vec structure
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 01/28] block: Introduce DMA direct request type Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 03/28] block: Warn on mis-use of dma-direct bios Logan Gunthorpe
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

The dma_vec structure is similar to the bio_vec structure except
it only stores DMA addresses instead of the struct page address.

struct bios will be able to make use of dma_vecs with a union and,
therefore, we need to ensure that struct dma_vec is no larger
than struct bvec, as they will share the allocated memory.

dma_vecs can make the same use of the bvec_iter structure
to iterate through the vectors.

This will be used for passing DMA addresses directly through the block
layer. I expect something like struct dma_vec will also be used in
Christoph's work to improve the dma_mapping layer and remove sgls.
At some point, these would use the same structure.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/bio.h       | 12 +++++++++++
 include/linux/blk_types.h |  5 ++++-
 include/linux/bvec.h      | 43 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 0f23b5682640..8180309123d7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -28,6 +28,8 @@
 
 #define bio_iter_iovec(bio, iter)				\
 	bvec_iter_bvec((bio)->bi_io_vec, (iter))
+#define bio_iter_dma_vec(bio, iter)				\
+	bvec_iter_dvec((bio)->bi_dma_vec, (iter))
 
 #define bio_iter_page(bio, iter)				\
 	bvec_iter_page((bio)->bi_io_vec, (iter))
@@ -39,6 +41,7 @@
 #define bio_page(bio)		bio_iter_page((bio), (bio)->bi_iter)
 #define bio_offset(bio)		bio_iter_offset((bio), (bio)->bi_iter)
 #define bio_iovec(bio)		bio_iter_iovec((bio), (bio)->bi_iter)
+#define bio_dma_vec(bio)	bio_iter_dma_vec((bio), (bio)->bi_iter)
 
 #define bio_multiple_segments(bio)				\
 	((bio)->bi_iter.bi_size != bio_iovec(bio).bv_len)
@@ -155,6 +158,15 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
 #define bio_for_each_bvec(bvl, bio, iter)			\
 	__bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter)
 
+#define __bio_for_each_dvec(dvl, bio, iter, start)		\
+	for (iter = (start);						\
+	     (iter).bi_size &&						\
+		((dvl = bvec_iter_dvec((bio)->bi_dma_vec, (iter))), 1); \
+	     dvec_iter_advance((bio)->bi_dma_vec, &(iter), (dvl).dv_len))
+
+#define bio_for_each_dvec(dvl, bio, iter)			\
+	__bio_for_each_dvec(dvl, bio, iter, (bio)->bi_iter)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index f3cabfdb6774..7f76ea73b77d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -191,7 +191,10 @@ struct bio {
 
 	atomic_t		__bi_cnt;	/* pin count */
 
-	struct bio_vec		*bi_io_vec;	/* the actual vec list */
+	union {
+		struct bio_vec	*bi_io_vec;	/* the actual vec list */
+		struct dma_vec	*bi_dma_vec;	/* for dma direct bios*/
+	};
 
 	struct bio_set		*bi_pool;
 
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index a032f01e928c..f680e96132ef 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -21,6 +21,11 @@ struct bio_vec {
 	unsigned int	bv_offset;
 };
 
+struct dma_vec {
+	dma_addr_t	dv_addr;
+	unsigned int	dv_len;
+};
+
 struct bvec_iter {
 	sector_t		bi_sector;	/* device address in 512 byte
 						   sectors */
@@ -84,6 +89,18 @@ struct bvec_iter_all {
 	.bv_offset	= bvec_iter_offset((bvec), (iter)),	\
 })
 
+#define bvec_iter_dvec_addr(dvec, iter)	\
+	(__bvec_iter_bvec((dvec), (iter))->dv_addr + (iter).bi_bvec_done)
+#define bvec_iter_dvec_len(dvec, iter)	\
+	min((iter).bi_size,					\
+	    __bvec_iter_bvec((dvec), (iter))->dv_len - (iter).bi_bvec_done)
+
+#define bvec_iter_dvec(dvec, iter)				\
+((struct dma_vec) {						\
+	.dv_addr	= bvec_iter_dvec_addr((dvec), (iter)),	\
+	.dv_len		= bvec_iter_dvec_len((dvec), (iter)),	\
+})
+
 static inline bool bvec_iter_advance(const struct bio_vec *bv,
 		struct bvec_iter *iter, unsigned bytes)
 {
@@ -110,6 +127,32 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 	return true;
 }
 
+static inline bool dvec_iter_advance(const struct dma_vec *dv,
+		struct bvec_iter *iter, unsigned bytes)
+{
+	if (WARN_ONCE(bytes > iter->bi_size,
+		      "Attempted to advance past end of dvec iter\n")) {
+		iter->bi_size = 0;
+		return false;
+	}
+
+	while (bytes) {
+		const struct dma_vec *cur = dv + iter->bi_idx;
+		unsigned len = min3(bytes, iter->bi_size,
+				    cur->dv_len - iter->bi_bvec_done);
+
+		bytes -= len;
+		iter->bi_size -= len;
+		iter->bi_bvec_done += len;
+
+		if (iter->bi_bvec_done == cur->dv_len) {
+			iter->bi_bvec_done = 0;
+			iter->bi_idx++;
+		}
+	}
+	return true;
+}
+
 #define for_each_bvec(bvl, bio_vec, iter, start)			\
 	for (iter = (start);						\
 	     (iter).bi_size &&						\
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 03/28] block: Warn on mis-use of dma-direct bios
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 01/28] block: Introduce DMA direct request type Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 02/28] block: Add dma_vec structure Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 04/28] block: Never bounce " Logan Gunthorpe
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

This is a result of an audit of users of 'bi_io_vec'. A number of
warnings and blocking conditions are added to ensure dma-direct bios
are not incorrectly accessing the 'bi_io_vec' when they should access
the 'bi_dma_vec'. These are largely just protecting against mis-uses
in future development so depending on taste and public opinion some
or all of these checks may not be necessary.

A few other issues with dma-direct bios will be tackled in subsequent
patches.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/bio.c      | 33 +++++++++++++++++++++++++++++++++
 block/blk-core.c |  3 +++
 2 files changed, 36 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 683cbb40f051..6998fceddd36 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -525,6 +525,9 @@ void zero_fill_bio_iter(struct bio *bio, struct bvec_iter start)
 	struct bio_vec bv;
 	struct bvec_iter iter;
 
+	if (WARN_ON_ONCE(bio_is_dma_direct(bio)))
+		return;
+
 	__bio_for_each_segment(bv, bio, iter, start) {
 		char *data = bvec_kmap_irq(&bv, &flags);
 		memset(data, 0, bv.bv_len);
@@ -707,6 +710,8 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	 */
 	if (unlikely(bio_flagged(bio, BIO_CLONED)))
 		return 0;
+	if (unlikely(bio_is_dma_direct(bio)))
+		return 0;
 
 	if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
 		return 0;
@@ -783,6 +788,8 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 {
 	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
 		return false;
+	if (WARN_ON_ONCE(bio_is_dma_direct(bio)))
+		return false;
 
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
@@ -814,6 +821,7 @@ void __bio_add_page(struct bio *bio, struct page *page,
 
 	WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
 	WARN_ON_ONCE(bio_full(bio));
+	WARN_ON_ONCE(bio_is_dma_direct(bio));
 
 	bv->bv_page = page;
 	bv->bv_offset = off;
@@ -851,6 +859,8 @@ static void bio_get_pages(struct bio *bio)
 	struct bvec_iter_all iter_all;
 	struct bio_vec *bvec;
 
+	WARN_ON_ONCE(bio_is_dma_direct(bio));
+
 	bio_for_each_segment_all(bvec, bio, iter_all)
 		get_page(bvec->bv_page);
 }
@@ -956,6 +966,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	if (WARN_ON_ONCE(bio->bi_vcnt))
 		return -EINVAL;
+	if (WARN_ON_ONCE(bio_is_dma_direct(bio)))
+		return -EINVAL;
 
 	do {
 		if (is_bvec)
@@ -1029,6 +1041,9 @@ void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
 	void *src_p, *dst_p;
 	unsigned bytes;
 
+	if (WARN_ON_ONCE(bio_is_dma_direct(src) || bio_is_dma_direct(dst)))
+		return;
+
 	while (src_iter->bi_size && dst_iter->bi_size) {
 		src_bv = bio_iter_iovec(src, *src_iter);
 		dst_bv = bio_iter_iovec(dst, *dst_iter);
@@ -1143,6 +1158,9 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
 
+	if (WARN_ON_ONCE(bio_is_dma_direct(bio)))
+		return -EINVAL;
+
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		ssize_t ret;
 
@@ -1174,6 +1192,9 @@ static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
 
+	if (WARN_ON_ONCE(bio_is_dma_direct(bio)))
+		return -EINVAL;
+
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		ssize_t ret;
 
@@ -1197,6 +1218,9 @@ void bio_free_pages(struct bio *bio)
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
 
+	if (WARN_ON_ONCE(bio_is_dma_direct(bio)))
+		return;
+
 	bio_for_each_segment_all(bvec, bio, iter_all)
 		__free_page(bvec->bv_page);
 }
@@ -1653,6 +1677,9 @@ void bio_set_pages_dirty(struct bio *bio)
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
 
+	if (unlikely(bio_is_dma_direct(bio)))
+		return;
+
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (!PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
@@ -1704,6 +1731,9 @@ void bio_check_pages_dirty(struct bio *bio)
 	unsigned long flags;
 	struct bvec_iter_all iter_all;
 
+	if (unlikely(bio_is_dma_direct(bio)))
+		return;
+
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
 			goto defer;
@@ -1777,6 +1807,9 @@ void bio_flush_dcache_pages(struct bio *bi)
 	struct bio_vec bvec;
 	struct bvec_iter iter;
 
+	if (unlikely(bio_is_dma_direct(bi)))
+		return;
+
 	bio_for_each_segment(bvec, bi, iter)
 		flush_dcache_page(bvec.bv_page);
 }
diff --git a/block/blk-core.c b/block/blk-core.c
index 8340f69670d8..ea152d54c7ce 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1467,6 +1467,9 @@ void rq_flush_dcache_pages(struct request *rq)
 	struct req_iterator iter;
 	struct bio_vec bvec;
 
+	if (unlikely(blk_rq_is_dma_direct(rq)))
+		return;
+
 	rq_for_each_segment(bvec, rq, iter)
 		flush_dcache_page(bvec.bv_page);
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 04/28] block: Never bounce dma-direct bios
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 03/28] block: Warn on mis-use of dma-direct bios Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 17:23   ` Jason Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 05/28] block: Skip dma-direct bios in bio_integrity_prep() Logan Gunthorpe
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

It is expected the creator of the dma-direct bio will ensure the
target device can access the DMA address it's creating bios for.
It's also not possible to bounce a dma-direct bio seeing the block
layer doesn't have any way to access the underlying data behind
the DMA address.

Thus, never bounce dma-direct bios.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/bounce.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/bounce.c b/block/bounce.c
index f8ed677a1bf7..17e020a40cca 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -367,6 +367,14 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig)
 	if (!bio_has_data(*bio_orig))
 		return;
 
+	/*
+	 * For DMA direct bios, Upper layers are expected to ensure
+	 * the device in question can access the DMA addresses. So
+	 * it never makes sense to bounce a DMA direct bio.
+	 */
+	if (bio_is_dma_direct(*bio_orig))
+		return;
+
 	/*
 	 * for non-isa bounce case, just check if the bounce pfn is equal
 	 * to or bigger than the highest pfn in the system -- in that case,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 05/28] block: Skip dma-direct bios in bio_integrity_prep()
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 04/28] block: Never bounce " Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 06/28] block: Support dma-direct bios in bio_advance_iter() Logan Gunthorpe
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

The block layer will not be able to handle integrity for dma-direct
bios seeing it does not have access to the underlying data.

If users of dma-direct require integrity, they will have to handle it
in the layer creating the bios. This is left as future work should
somebody care about handling such a case.

Thus, bio_integrity_prep() should ignore dma-direct bios.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/bio-integrity.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 4db620849515..10fdf456fcd8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -221,6 +221,10 @@ bool bio_integrity_prep(struct bio *bio)
 	if (bio_integrity(bio))
 		return true;
 
+	/* The block layer cannot handle integrity for dma-direct bios */
+	if (bio_is_dma_direct(bio))
+		return true;
+
 	if (bio_data_dir(bio) == READ) {
 		if (!bi->profile->verify_fn ||
 		    !(bi->flags & BLK_INTEGRITY_VERIFY))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 06/28] block: Support dma-direct bios in bio_advance_iter()
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 05/28] block: Skip dma-direct bios in bio_integrity_prep() Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 07/28] block: Use dma_vec length in bio_cur_bytes() for dma-direct bios Logan Gunthorpe
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

Dma-direct bio iterators need to be advanced using a similar
dvec_iter_advance helper.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/bio.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8180309123d7..e212e5958a75 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -134,6 +134,8 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
 
 	if (bio_no_advance_iter(bio))
 		iter->bi_size -= bytes;
+	else if (op_is_dma_direct(bio->bi_opf))
+		dvec_iter_advance(bio->bi_dma_vec, iter, bytes);
 	else
 		bvec_iter_advance(bio->bi_io_vec, iter, bytes);
 		/* TODO: It is reasonable to complete bio with error here. */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 07/28] block: Use dma_vec length in bio_cur_bytes() for dma-direct bios
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 06/28] block: Support dma-direct bios in bio_advance_iter() Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 08/28] block: Introduce dmavec_phys_mergeable() Logan Gunthorpe
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

For dma-direct bios, use the dv_len of the current vector
seeing the bio_vec's are not valid in such a context.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/bio.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index e212e5958a75..df7973932525 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -91,10 +91,12 @@ static inline bool bio_mergeable(struct bio *bio)
 
 static inline unsigned int bio_cur_bytes(struct bio *bio)
 {
-	if (bio_has_data(bio))
-		return bio_iovec(bio).bv_len;
-	else /* dataless requests such as discard */
+	if (!bio_has_data(bio)) /* dataless requests such as discard */
 		return bio->bi_iter.bi_size;
+	else if (op_is_dma_direct(bio->bi_opf))
+		return bio_dma_vec(bio).dv_len;
+	else
+		return bio_iovec(bio).bv_len;
 }
 
 static inline void *bio_data(struct bio *bio)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 08/28] block: Introduce dmavec_phys_mergeable()
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (6 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 07/28] block: Use dma_vec length in bio_cur_bytes() for dma-direct bios Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 09/28] block: Introduce vec_gap_to_prev() Logan Gunthorpe
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

Introduce a new helper which is an analog of biovec_phys_mergeable()
for dma-direct vectors.

This also provides a common helper vec_phys_mergeable() for use in
code that's general to both bio_vecs and dma_vecs.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk.h | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 7814aa207153..4142383eed7a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -66,20 +66,36 @@ static inline void blk_queue_enter_live(struct request_queue *q)
 	percpu_ref_get(&q->q_usage_counter);
 }
 
+static inline bool vec_phys_mergeable(struct request_queue *q,
+				      unsigned long addr1, unsigned int len1,
+				      unsigned long addr2, unsigned int len2)
+{
+	unsigned long mask = queue_segment_boundary(q);
+
+	if (addr1 + len1 != addr2)
+		return false;
+	if ((addr1 | mask) != ((addr2 + len2 - 1) | mask))
+		return false;
+	return true;
+}
+
 static inline bool biovec_phys_mergeable(struct request_queue *q,
 		struct bio_vec *vec1, struct bio_vec *vec2)
 {
-	unsigned long mask = queue_segment_boundary(q);
 	phys_addr_t addr1 = page_to_phys(vec1->bv_page) + vec1->bv_offset;
 	phys_addr_t addr2 = page_to_phys(vec2->bv_page) + vec2->bv_offset;
 
-	if (addr1 + vec1->bv_len != addr2)
-		return false;
 	if (xen_domain() && !xen_biovec_phys_mergeable(vec1, vec2->bv_page))
 		return false;
-	if ((addr1 | mask) != ((addr2 + vec2->bv_len - 1) | mask))
-		return false;
-	return true;
+
+	return vec_phys_mergeable(q, addr1, vec1->bv_len, addr2, vec2->bv_len);
+}
+
+static inline bool dmavec_phys_mergeable(struct request_queue *q,
+		struct dma_vec *vec1, struct dma_vec *vec2)
+{
+	return vec_phys_mergeable(q, vec1->dv_addr, vec1->dv_len,
+				  vec2->dv_addr, vec2->dv_len);
 }
 
 static inline bool __bvec_gap_to_prev(struct request_queue *q,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 09/28] block: Introduce vec_gap_to_prev()
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (7 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 08/28] block: Introduce dmavec_phys_mergeable() Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 10/28] block: Create generic vec_split_segs() from bvec_split_segs() Logan Gunthorpe
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

Introduce vec_gap_to_prev() which is a more general
form of bvec_gap_to_prev().

In order to support splitting dma_vecs we will need to do a similar
calcualtion using the DMA address and length.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk.h | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 4142383eed7a..c5512fefe703 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -98,11 +98,19 @@ static inline bool dmavec_phys_mergeable(struct request_queue *q,
 				  vec2->dv_addr, vec2->dv_len);
 }
 
+static inline bool __vec_gap_to_prev(struct request_queue *q,
+		unsigned int prv_offset, unsigned int prv_len,
+		unsigned int nxt_offset)
+{
+	return (nxt_offset & queue_virt_boundary(q)) ||
+		((prv_offset + prv_len) & queue_virt_boundary(q));
+}
+
 static inline bool __bvec_gap_to_prev(struct request_queue *q,
 		struct bio_vec *bprv, unsigned int offset)
 {
-	return (offset & queue_virt_boundary(q)) ||
-		((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
+	return __vec_gap_to_prev(q, bprv->bv_offset, bprv->bv_len,
+				 offset);
 }
 
 /*
@@ -117,6 +125,15 @@ static inline bool bvec_gap_to_prev(struct request_queue *q,
 	return __bvec_gap_to_prev(q, bprv, offset);
 }
 
+static inline bool vec_gap_to_prev(struct request_queue *q,
+		unsigned int prv_offset, unsigned int prv_len,
+		unsigned int nxt_offset)
+{
+	if (!queue_virt_boundary(q))
+		return false;
+	return __vec_gap_to_prev(q, prv_offset, prv_len, nxt_offset);
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 10/28] block: Create generic vec_split_segs() from bvec_split_segs()
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 09/28] block: Introduce vec_gap_to_prev() Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 11/28] block: Create blk_segment_split_ctx Logan Gunthorpe
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

bvec_split_segs() only requires the address and length of the
vector. In order to generalize it to work with dma_vecs, we just
take the address and length directly instead of the bio_vec.

The function is renamed to vec_split_segs() and a helper is added
to avoid having to adjust the existing callsites.

Note: the new bvec_split_segs() helper will be removed in a subsequent
patch.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-merge.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 17713d7d98d5..3581c7ac3c1b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -158,13 +158,13 @@ static unsigned get_max_segment_size(struct request_queue *q,
 }
 
 /*
- * Split the bvec @bv into segments, and update all kinds of
- * variables.
+ * Split the an address/offset and length into segments, and
+ * update all kinds of variables.
  */
-static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
-		unsigned *nsegs, unsigned *sectors, unsigned max_segs)
+static bool vec_split_segs(struct request_queue *q, unsigned offset,
+			   unsigned len, unsigned *nsegs, unsigned *sectors,
+			   unsigned max_segs)
 {
-	unsigned len = bv->bv_len;
 	unsigned total_len = 0;
 	unsigned new_nsegs = 0, seg_size = 0;
 
@@ -173,14 +173,14 @@ static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
 	 * current bvec has to be splitted as multiple segments.
 	 */
 	while (len && new_nsegs + *nsegs < max_segs) {
-		seg_size = get_max_segment_size(q, bv->bv_offset + total_len);
+		seg_size = get_max_segment_size(q, offset + total_len);
 		seg_size = min(seg_size, len);
 
 		new_nsegs++;
 		total_len += seg_size;
 		len -= seg_size;
 
-		if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
+		if ((offset + total_len) & queue_virt_boundary(q))
 			break;
 	}
 
@@ -194,6 +194,13 @@ static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
 	return !!len;
 }
 
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+		unsigned *nsegs, unsigned *sectors, unsigned max_segs)
+{
+	return vec_split_segs(q, bv->bv_offset, bv->bv_len, nsegs,
+			      sectors, max_segs);
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 					 struct bio *bio,
 					 struct bio_set *bs,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 11/28] block: Create blk_segment_split_ctx
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (9 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 10/28] block: Create generic vec_split_segs() from bvec_split_segs() Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 12/28] block: Create helper for bvec_should_split() Logan Gunthorpe
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

In order to support dma-direct bios, blk_bio_segment_split() will
need to operate on both bio_vecs and dma_vecs. In order to do
this the code inside bio_for_each_bvec() needs to be moved into
a generic helper. Step one to do this is to put some of the
variables used inside the loop into a context structure so we
don't need to pass a dozen variables to this new function.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-merge.c | 55 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3581c7ac3c1b..414e61a714bf 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -201,63 +201,78 @@ static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
 			      sectors, max_segs);
 }
 
+struct blk_segment_split_ctx {
+	unsigned nsegs;
+	unsigned sectors;
+
+	bool prv_valid;
+	struct bio_vec bvprv;
+
+	const unsigned max_sectors;
+	const unsigned max_segs;
+};
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 					 struct bio *bio,
 					 struct bio_set *bs,
 					 unsigned *segs)
 {
-	struct bio_vec bv, bvprv, *bvprvp = NULL;
+	struct bio_vec bv;
 	struct bvec_iter iter;
-	unsigned nsegs = 0, sectors = 0;
 	bool do_split = true;
 	struct bio *new = NULL;
-	const unsigned max_sectors = get_max_io_size(q, bio);
-	const unsigned max_segs = queue_max_segments(q);
+
+	struct blk_segment_split_ctx ctx = {
+		.max_sectors = get_max_io_size(q, bio),
+		.max_segs = queue_max_segments(q),
+	};
 
 	bio_for_each_bvec(bv, bio, iter) {
 		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
-		if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
+		if (ctx.prv_valid && bvec_gap_to_prev(q, &ctx.bvprv,
+						      bv.bv_offset))
 			goto split;
 
-		if (sectors + (bv.bv_len >> 9) > max_sectors) {
+		if (ctx.sectors + (bv.bv_len >> 9) > ctx.max_sectors) {
 			/*
 			 * Consider this a new segment if we're splitting in
 			 * the middle of this vector.
 			 */
-			if (nsegs < max_segs &&
-			    sectors < max_sectors) {
+			if (ctx.nsegs < ctx.max_segs &&
+			    ctx.sectors < ctx.max_sectors) {
 				/* split in the middle of bvec */
-				bv.bv_len = (max_sectors - sectors) << 9;
-				bvec_split_segs(q, &bv, &nsegs,
-						&sectors, max_segs);
+				bv.bv_len =
+					(ctx.max_sectors - ctx.sectors) << 9;
+				bvec_split_segs(q, &bv, &ctx.nsegs,
+						&ctx.sectors, ctx.max_segs);
 			}
 			goto split;
 		}
 
-		if (nsegs == max_segs)
+		if (ctx.nsegs == ctx.max_segs)
 			goto split;
 
-		bvprv = bv;
-		bvprvp = &bvprv;
+		ctx.bvprv = bv;
+		ctx.prv_valid = true;
 
 		if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
-			nsegs++;
-			sectors += bv.bv_len >> 9;
-		} else if (bvec_split_segs(q, &bv, &nsegs, &sectors,
-				max_segs)) {
+			ctx.nsegs++;
+			ctx.sectors += bv.bv_len >> 9;
+		} else if (bvec_split_segs(q, &bv, &ctx.nsegs, &ctx.sectors,
+				ctx.max_segs)) {
 			goto split;
 		}
 	}
 
 	do_split = false;
 split:
-	*segs = nsegs;
+	*segs = ctx.nsegs;
 
 	if (do_split) {
-		new = bio_split(bio, sectors, GFP_NOIO, bs);
+		new = bio_split(bio, ctx.sectors, GFP_NOIO, bs);
 		if (new)
 			bio = new;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 12/28] block: Create helper for bvec_should_split()
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (10 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 11/28] block: Create blk_segment_split_ctx Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 13/28] block: Generalize bvec_should_split() Logan Gunthorpe
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

In order to support dma-direct bios, blk_bio_segment_split() will
need to operate on both bio_vecs and dma_vecs. In order to do
this, the code inside bio_for_each_bvec() is moved into a generic
helper called bvec_should_split().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-merge.c | 86 +++++++++++++++++++++++++----------------------
 1 file changed, 46 insertions(+), 40 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 414e61a714bf..d9e89c0ad40d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -212,6 +212,48 @@ struct blk_segment_split_ctx {
 	const unsigned max_segs;
 };
 
+static bool bvec_should_split(struct request_queue *q, struct bio_vec *bv,
+			      struct blk_segment_split_ctx *ctx)
+{
+	/*
+	 * If the queue doesn't support SG gaps and adding this
+	 * offset would create a gap, disallow it.
+	 */
+	if (ctx->prv_valid && bvec_gap_to_prev(q, &ctx->bvprv, bv->bv_offset))
+		return true;
+
+	if (ctx->sectors + (bv->bv_len >> 9) > ctx->max_sectors) {
+		/*
+		 * Consider this a new segment if we're splitting in
+		 * the middle of this vector.
+		 */
+		if (ctx->nsegs < ctx->max_segs &&
+		    ctx->sectors < ctx->max_sectors) {
+			/* split in the middle of bvec */
+			bv->bv_len = (ctx->max_sectors - ctx->sectors) << 9;
+			bvec_split_segs(q, bv, &ctx->nsegs,
+					&ctx->sectors, ctx->max_segs);
+		}
+		return true;
+	}
+
+	if (ctx->nsegs == ctx->max_segs)
+		return true;
+
+	ctx->bvprv = *bv;
+	ctx->prv_valid = true;
+
+	if (bv->bv_offset + bv->bv_len <= PAGE_SIZE) {
+		ctx->nsegs++;
+		ctx->sectors += bv->bv_len >> 9;
+	} else if (bvec_split_segs(q, bv, &ctx->nsegs, &ctx->sectors,
+				   ctx->max_segs)) {
+		return true;
+	}
+
+	return false;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 					 struct bio *bio,
 					 struct bio_set *bs,
@@ -219,7 +261,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 {
 	struct bio_vec bv;
 	struct bvec_iter iter;
-	bool do_split = true;
+	bool do_split = false;
 	struct bio *new = NULL;
 
 	struct blk_segment_split_ctx ctx = {
@@ -228,47 +270,11 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	};
 
 	bio_for_each_bvec(bv, bio, iter) {
-		/*
-		 * If the queue doesn't support SG gaps and adding this
-		 * offset would create a gap, disallow it.
-		 */
-		if (ctx.prv_valid && bvec_gap_to_prev(q, &ctx.bvprv,
-						      bv.bv_offset))
-			goto split;
-
-		if (ctx.sectors + (bv.bv_len >> 9) > ctx.max_sectors) {
-			/*
-			 * Consider this a new segment if we're splitting in
-			 * the middle of this vector.
-			 */
-			if (ctx.nsegs < ctx.max_segs &&
-			    ctx.sectors < ctx.max_sectors) {
-				/* split in the middle of bvec */
-				bv.bv_len =
-					(ctx.max_sectors - ctx.sectors) << 9;
-				bvec_split_segs(q, &bv, &ctx.nsegs,
-						&ctx.sectors, ctx.max_segs);
-			}
-			goto split;
-		}
-
-		if (ctx.nsegs == ctx.max_segs)
-			goto split;
-
-		ctx.bvprv = bv;
-		ctx.prv_valid = true;
-
-		if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
-			ctx.nsegs++;
-			ctx.sectors += bv.bv_len >> 9;
-		} else if (bvec_split_segs(q, &bv, &ctx.nsegs, &ctx.sectors,
-				ctx.max_segs)) {
-			goto split;
-		}
+		do_split = bvec_should_split(q, &bv, &ctx);
+		if (do_split)
+			break;
 	}
 
-	do_split = false;
-split:
 	*segs = ctx.nsegs;
 
 	if (do_split) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 13/28] block: Generalize bvec_should_split()
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (11 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 12/28] block: Create helper for bvec_should_split() Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 14/28] block: Support splitting dma-direct bios Logan Gunthorpe
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

bvec_should_split() will need to also operate on dma_vecs so
generalize it to take an offset and length instead of a bio_vec.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-merge.c | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index d9e89c0ad40d..32653fca53ce 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -206,23 +206,25 @@ struct blk_segment_split_ctx {
 	unsigned sectors;
 
 	bool prv_valid;
-	struct bio_vec bvprv;
+	unsigned prv_offset;
+	unsigned prv_len;
 
 	const unsigned max_sectors;
 	const unsigned max_segs;
 };
 
-static bool bvec_should_split(struct request_queue *q, struct bio_vec *bv,
-			      struct blk_segment_split_ctx *ctx)
+static bool vec_should_split(struct request_queue *q, unsigned offset,
+			     unsigned len, struct blk_segment_split_ctx *ctx)
 {
 	/*
 	 * If the queue doesn't support SG gaps and adding this
 	 * offset would create a gap, disallow it.
 	 */
-	if (ctx->prv_valid && bvec_gap_to_prev(q, &ctx->bvprv, bv->bv_offset))
+	if (ctx->prv_valid &&
+	    vec_gap_to_prev(q, ctx->prv_offset, ctx->prv_len, offset))
 		return true;
 
-	if (ctx->sectors + (bv->bv_len >> 9) > ctx->max_sectors) {
+	if (ctx->sectors + (len >> 9) > ctx->max_sectors) {
 		/*
 		 * Consider this a new segment if we're splitting in
 		 * the middle of this vector.
@@ -230,9 +232,9 @@ static bool bvec_should_split(struct request_queue *q, struct bio_vec *bv,
 		if (ctx->nsegs < ctx->max_segs &&
 		    ctx->sectors < ctx->max_sectors) {
 			/* split in the middle of bvec */
-			bv->bv_len = (ctx->max_sectors - ctx->sectors) << 9;
-			bvec_split_segs(q, bv, &ctx->nsegs,
-					&ctx->sectors, ctx->max_segs);
+			len = (ctx->max_sectors - ctx->sectors) << 9;
+			vec_split_segs(q, offset, len, &ctx->nsegs,
+				       &ctx->sectors, ctx->max_segs);
 		}
 		return true;
 	}
@@ -240,14 +242,15 @@ static bool bvec_should_split(struct request_queue *q, struct bio_vec *bv,
 	if (ctx->nsegs == ctx->max_segs)
 		return true;
 
-	ctx->bvprv = *bv;
+	ctx->prv_offset = offset;
+	ctx->prv_len = len;
 	ctx->prv_valid = true;
 
-	if (bv->bv_offset + bv->bv_len <= PAGE_SIZE) {
+	if (offset + len <= PAGE_SIZE) {
 		ctx->nsegs++;
-		ctx->sectors += bv->bv_len >> 9;
-	} else if (bvec_split_segs(q, bv, &ctx->nsegs, &ctx->sectors,
-				   ctx->max_segs)) {
+		ctx->sectors += len >> 9;
+	} else if (vec_split_segs(q, offset, len, &ctx->nsegs, &ctx->sectors,
+				  ctx->max_segs)) {
 		return true;
 	}
 
@@ -270,7 +273,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	};
 
 	bio_for_each_bvec(bv, bio, iter) {
-		do_split = bvec_should_split(q, &bv, &ctx);
+		do_split = vec_should_split(q, bv.bv_offset, bv.bv_len, &ctx);
 		if (do_split)
 			break;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 14/28] block: Support splitting dma-direct bios
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (12 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 13/28] block: Generalize bvec_should_split() Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 15/28] block: Support counting dma-direct bio segments Logan Gunthorpe
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

If the bio is a dma-direct bio, loop through the dma_vecs instead
of the bio_vecs when calling vec_should_split().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-merge.c | 45 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 32653fca53ce..c4c016f994f6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -257,14 +257,44 @@ static bool vec_should_split(struct request_queue *q, unsigned offset,
 	return false;
 }
 
+static bool bio_should_split(struct request_queue *q, struct bio *bio,
+			     struct blk_segment_split_ctx *ctx)
+{
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	bool ret;
+
+	bio_for_each_bvec(bv, bio, iter) {
+		ret = vec_should_split(q, bv.bv_offset, bv.bv_len, ctx);
+		if (ret)
+			return true;
+	}
+
+	return false;
+}
+
+static bool bio_dma_should_split(struct request_queue *q, struct bio *bio,
+				 struct blk_segment_split_ctx *ctx)
+{
+	struct bvec_iter iter;
+	struct dma_vec dv;
+	bool ret;
+
+	bio_for_each_dvec(dv, bio, iter) {
+		ret = vec_should_split(q, dv.dv_addr, dv.dv_len, ctx);
+		if (ret)
+			return true;
+	}
+
+	return false;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 					 struct bio *bio,
 					 struct bio_set *bs,
 					 unsigned *segs)
 {
-	struct bio_vec bv;
-	struct bvec_iter iter;
-	bool do_split = false;
+	bool do_split;
 	struct bio *new = NULL;
 
 	struct blk_segment_split_ctx ctx = {
@@ -272,11 +302,10 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 		.max_segs = queue_max_segments(q),
 	};
 
-	bio_for_each_bvec(bv, bio, iter) {
-		do_split = vec_should_split(q, bv.bv_offset, bv.bv_len, &ctx);
-		if (do_split)
-			break;
-	}
+	if (bio_is_dma_direct(bio))
+		do_split = bio_dma_should_split(q, bio, &ctx);
+	else
+		do_split = bio_should_split(q, bio, &ctx);
 
 	*segs = ctx.nsegs;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 15/28] block: Support counting dma-direct bio segments
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (13 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 14/28] block: Support splitting dma-direct bios Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 16/28] block: Implement mapping dma-direct requests to SGs in blk_rq_map_sg() Logan Gunthorpe
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

Change __blk_recalc_rq_segments() to loop through dma_vecs when
appropriate. It calls vec_split_segs() for each dma_vec or bio_vec.

Once this is done the bvec_split_segs() helper is no longer used.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-merge.c | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index c4c016f994f6..a7a5453987f9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -194,13 +194,6 @@ static bool vec_split_segs(struct request_queue *q, unsigned offset,
 	return !!len;
 }
 
-static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
-		unsigned *nsegs, unsigned *sectors, unsigned max_segs)
-{
-	return vec_split_segs(q, bv->bv_offset, bv->bv_len, nsegs,
-			      sectors, max_segs);
-}
-
 struct blk_segment_split_ctx {
 	unsigned nsegs;
 	unsigned sectors;
@@ -366,12 +359,36 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
 }
 EXPORT_SYMBOL(blk_queue_split);
 
+static unsigned int bio_calc_segs(struct request_queue *q, struct bio *bio)
+{
+	unsigned int nsegs = 0;
+	struct bvec_iter iter;
+	struct bio_vec bv;
+
+	bio_for_each_bvec(bv, bio, iter)
+		vec_split_segs(q, bv.bv_offset, bv.bv_len, &nsegs,
+			       NULL, UINT_MAX);
+
+	return nsegs;
+}
+
+static unsigned int bio_dma_calc_segs(struct request_queue *q, struct bio *bio)
+{
+	unsigned int nsegs = 0;
+	struct bvec_iter iter;
+	struct dma_vec dv;
+
+	bio_for_each_dvec(dv, bio, iter)
+		vec_split_segs(q, dv.dv_addr, dv.dv_len, &nsegs,
+			       NULL, UINT_MAX);
+
+	return nsegs;
+}
+
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio)
 {
 	unsigned int nr_phys_segs = 0;
-	struct bvec_iter iter;
-	struct bio_vec bv;
 
 	if (!bio)
 		return 0;
@@ -386,8 +403,10 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	}
 
 	for_each_bio(bio) {
-		bio_for_each_bvec(bv, bio, iter)
-			bvec_split_segs(q, &bv, &nr_phys_segs, NULL, UINT_MAX);
+		if (bio_is_dma_direct(bio))
+			nr_phys_segs += bio_calc_segs(q, bio);
+		else
+			nr_phys_segs += bio_dma_calc_segs(q, bio);
 	}
 
 	return nr_phys_segs;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 16/28] block: Implement mapping dma-direct requests to SGs in blk_rq_map_sg()
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (14 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 15/28] block: Support counting dma-direct bio segments Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 17/28] block: Introduce queue flag to indicate support for dma-direct bios Logan Gunthorpe
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

blk_rq_map_sg() just needs to move the dma_vec into the dma_address
of the sgl. Callers will need to ensure not to call dma_map_sg()
for dma-direct requests.

This will likely get less ugly with Christoph's proposed cleanup
to the DMA API. It will be much simpler if devices are just
calling a dma_map_bio() and don't have to worry about dma-direct
requests.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-merge.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index a7a5453987f9..ccd6c44b9f6e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -545,6 +545,69 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
 	return nsegs;
 }
 
+static unsigned blk_dvec_map_sg(struct request_queue *q,
+		struct dma_vec *dvec, struct scatterlist *sglist,
+		struct scatterlist **sg)
+{
+	unsigned nbytes = dvec->dv_len;
+	unsigned nsegs = 0, total = 0;
+
+	while (nbytes > 0) {
+		unsigned seg_size;
+
+		*sg = blk_next_sg(sg, sglist);
+
+		seg_size = get_max_segment_size(q, total);
+		seg_size = min(nbytes, seg_size);
+
+		(*sg)->dma_address = dvec->dv_addr + total;
+		sg_dma_len(*sg) = seg_size;
+
+		total += seg_size;
+		nbytes -= seg_size;
+		nsegs++;
+	}
+
+	return nsegs;
+}
+
+static inline void
+__blk_segment_dma_map_sg(struct request_queue *q, struct dma_vec *dvec,
+			 struct scatterlist *sglist, struct dma_vec *dvprv,
+			 struct scatterlist **sg, int *nsegs)
+{
+	int nbytes = dvec->dv_len;
+
+	if (*sg) {
+		if ((*sg)->length + nbytes > queue_max_segment_size(q))
+			goto new_segment;
+		if (!dmavec_phys_mergeable(q, dvprv, dvec))
+			goto new_segment;
+
+		(*sg)->length += nbytes;
+	} else {
+new_segment:
+		(*nsegs) += blk_dvec_map_sg(q, dvec, sglist, sg);
+	}
+	*dvprv = *dvec;
+}
+
+static int __blk_dma_bios_map_sg(struct request_queue *q, struct bio *bio,
+				 struct scatterlist *sglist,
+				 struct scatterlist **sg)
+{
+	struct dma_vec dvec, dvprv = {};
+	struct bvec_iter iter;
+	int nsegs = 0;
+
+	for_each_bio(bio)
+		bio_for_each_dvec(dvec, bio, iter)
+			__blk_segment_dma_map_sg(q, &dvec, sglist, &dvprv,
+						 sg, &nsegs);
+
+	return nsegs;
+}
+
 /*
  * map a request to scatterlist, return number of sg entries setup. Caller
  * must make sure sg can hold rq->nr_phys_segments entries
@@ -559,6 +622,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 		nsegs = __blk_bvec_map_sg(rq->special_vec, sglist, &sg);
 	else if (rq->bio && bio_op(rq->bio) == REQ_OP_WRITE_SAME)
 		nsegs = __blk_bvec_map_sg(bio_iovec(rq->bio), sglist, &sg);
+	else if (blk_rq_is_dma_direct(rq))
+		nsegs = __blk_dma_bios_map_sg(q, rq->bio, sglist, &sg);
 	else if (rq->bio)
 		nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 17/28] block: Introduce queue flag to indicate support for dma-direct bios
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (15 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 16/28] block: Implement mapping dma-direct requests to SGs in blk_rq_map_sg() Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 18/28] block: Introduce bio_add_dma_addr() Logan Gunthorpe
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

Queues will need to advertise support to accept dma-direct requests.

The existing PCI P2P support which will be replaced by this and thus
the P2P flag will be dropped in a subsequent patch.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/blkdev.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ce70d5dded5f..a5b856324276 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -616,6 +616,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 23	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED	24	/* queue has been quiesced */
 #define QUEUE_FLAG_PCI_P2PDMA	25	/* device supports PCI p2p requests */
+#define QUEUE_FLAG_DMA_DIRECT	26	/* device supports dma-addr requests */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP))
@@ -642,6 +643,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
 #define blk_queue_pci_p2pdma(q)	\
 	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
+#define blk_queue_dma_direct(q)	\
+	test_bit(QUEUE_FLAG_DMA_DIRECT, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 18/28] block: Introduce bio_add_dma_addr()
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (16 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 17/28] block: Introduce queue flag to indicate support for dma-direct bios Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 19/28] nvme-pci: Support dma-direct bios Logan Gunthorpe
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

bio_add_dma_addr() is analagous to bio_add_page() except it
adds a dma address to a dma-direct bio instead of a struct page.

It also checks to ensure that the queue supports dma address bios and
that we are not mixing dma addresses and struct pages.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/bio.c         | 38 ++++++++++++++++++++++++++++++++++++++
 include/linux/bio.h | 10 ++++++++++
 2 files changed, 48 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 6998fceddd36..02ae72e3ccfa 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -874,6 +874,44 @@ static void bio_release_pages(struct bio *bio)
 		put_page(bvec->bv_page);
 }
 
+/**
+ *	bio_add_dma_addr - attempt to add a dma address to a bio
+ *	@q: the target queue
+ *	@bio: destination bio
+ *	@dma_addr: dma address to add
+ *	@len: vec entry length
+ *
+ *	Attempt to add a dma address to the dma_vec maplist. This can
+ *	fail for a number of reasons, such as the bio being full or
+ *	target block device limitations. The target request queue must
+ *	support dma-only bios and bios can not mix pages and dma_addresses.
+ */
+int bio_add_dma_addr(struct request_queue *q, struct bio *bio,
+		     dma_addr_t dma_addr, unsigned int len)
+{
+	struct dma_vec *dv = &bio->bi_dma_vec[bio->bi_vcnt];
+
+	if (!blk_queue_dma_direct(q))
+		return -EINVAL;
+
+	if (!bio_is_dma_direct(bio))
+		return -EINVAL;
+
+	if (bio_dma_full(bio))
+		return 0;
+
+	WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+
+	dv->dv_addr = dma_addr;
+	dv->dv_len = len;
+
+	bio->bi_iter.bi_size += len;
+	bio->bi_vcnt++;
+
+	return len;
+}
+EXPORT_SYMBOL_GPL(bio_add_dma_addr);
+
 static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 {
 	const struct bio_vec *bv = iter->bvec;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index df7973932525..d775f381ae00 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -112,6 +112,13 @@ static inline bool bio_full(struct bio *bio)
 	return bio->bi_vcnt >= bio->bi_max_vecs;
 }
 
+static inline bool bio_dma_full(struct bio *bio)
+{
+	size_t vec_size = bio->bi_max_vecs * sizeof(struct bio_vec);
+
+	return bio->bi_vcnt >= (vec_size / sizeof(struct dma_vec));
+}
+
 static inline bool bio_next_segment(const struct bio *bio,
 				    struct bvec_iter_all *iter)
 {
@@ -438,6 +445,9 @@ void bio_chain(struct bio *, struct bio *);
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
 			   unsigned int, unsigned int);
+extern int bio_add_dma_addr(struct request_queue *q, struct bio *bio,
+			    dma_addr_t dma_addr, unsigned int len);
+
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int off, bool same_page);
 void __bio_add_page(struct bio *bio, struct page *page,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 19/28] nvme-pci: Support dma-direct bios
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (17 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 18/28] block: Introduce bio_add_dma_addr() Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address Logan Gunthorpe
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

Adding support for dma-direct bios only requires putting a condition
around the call to dma_map_sg() so it is skipped when the request
has the REQ_DMA_ADDR flag.

We then need to indicate support for the queue in much the same way
we did with PCI P2PDMA. Seeing this provides the same support as
PCI P2PDMA those flags will be removed in a subsequent patch.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c |  2 ++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 10 +++++++---
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 120fb593d1da..8e876417c44b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3259,6 +3259,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
 	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+	if (ctrl->ops->flags & NVME_F_DMA_DIRECT)
+		blk_queue_flag_set(QUEUE_FLAG_DMA_DIRECT, ns->queue);
 
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 55553d293a98..f1dddc95c6a8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -362,6 +362,7 @@ struct nvme_ctrl_ops {
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
 #define NVME_F_PCI_P2PDMA		(1 << 2)
+#define NVME_F_DMA_DIRECT		(1 << 3)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 524d6bd6d095..5957f3a4f261 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -565,7 +565,8 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 	WARN_ON_ONCE(!iod->nents);
 
 	/* P2PDMA requests do not need to be unmapped */
-	if (!is_pci_p2pdma_page(sg_page(iod->sg)))
+	if (!is_pci_p2pdma_page(sg_page(iod->sg)) &&
+	    !blk_rq_is_dma_direct(req))
 		dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
 
 
@@ -824,7 +825,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int nr_mapped;
 
-	if (blk_rq_nr_phys_segments(req) == 1) {
+	if (blk_rq_nr_phys_segments(req) == 1 && !blk_rq_is_dma_direct(req)) {
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
@@ -851,6 +852,8 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	if (is_pci_p2pdma_page(sg_page(iod->sg)))
 		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
 					      rq_dma_dir(req));
+	else if (blk_rq_is_dma_direct(req))
+		nr_mapped = iod->nents;
 	else
 		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
 					     rq_dma_dir(req), DMA_ATTR_NO_WARN);
@@ -2639,7 +2642,8 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
 	.flags			= NVME_F_METADATA_SUPPORTED |
-				  NVME_F_PCI_P2PDMA,
+				  NVME_F_PCI_P2PDMA |
+				  NVME_F_DMA_DIRECT,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (18 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 19/28] nvme-pci: Support dma-direct bios Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:49   ` Jason Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 21/28] nvmet: Split nvmet_bdev_execute_rw() into a helper function Logan Gunthorpe
                   ` (9 subsequent siblings)
  29 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

Introduce rdma_rw_ctx_dma_init() and rdma_rw_ctx_dma_destroy() which
peform the same operation as rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() respectively except they operate on a DMA
address and length instead of an SGL.

This will be used for struct page-less P2PDMA, but there's also
been opinions expressed to migrate away from SGLs and struct
pages in the RDMA APIs and this will likely fit with that
effort.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/infiniband/core/rw.c | 74 ++++++++++++++++++++++++++++++------
 include/rdma/rw.h            |  6 +++
 2 files changed, 69 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 32ca8429eaae..cefa6b930bc8 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -319,6 +319,39 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
 
+/**
+ * rdma_rw_ctx_dma_init - initialize a RDMA READ/WRITE context from a
+ *	DMA address instead of SGL
+ * @ctx:	context to initialize
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @addr:	DMA address to READ/WRITE from/to
+ * @len:	length of memory to operate on
+ * @remote_addr:remote address to read/write (relative to @rkey)
+ * @rkey:	remote key to operate on
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ *
+ * Returns the number of WQEs that will be needed on the workqueue if
+ * successful, or a negative error code.
+ */
+int rdma_rw_ctx_dma_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u8 port_num, dma_addr_t addr, u32 len, u64 remote_addr,
+		u32 rkey, enum dma_data_direction dir)
+{
+	struct scatterlist sg;
+
+	sg_dma_address(&sg) = addr;
+	sg_dma_len(&sg) = len;
+
+	if (rdma_rw_io_needs_mr(qp->device, port_num, dir, 1))
+		return rdma_rw_init_mr_wrs(ctx, qp, port_num, &sg, 1, 0,
+					   remote_addr, rkey, dir);
+	else
+		return rdma_rw_init_single_wr(ctx, qp, &sg, 0, remote_addr,
+					      rkey, dir);
+}
+EXPORT_SYMBOL(rdma_rw_ctx_dma_init);
+
 /**
  * rdma_rw_ctx_signature_init - initialize a RW context with signature offload
  * @ctx:	context to initialize
@@ -566,17 +599,7 @@ int rdma_rw_ctx_post(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 }
 EXPORT_SYMBOL(rdma_rw_ctx_post);
 
-/**
- * rdma_rw_ctx_destroy - release all resources allocated by rdma_rw_ctx_init
- * @ctx:	context to release
- * @qp:		queue pair to operate on
- * @port_num:	port num to which the connection is bound
- * @sg:		scatterlist that was used for the READ/WRITE
- * @sg_cnt:	number of entries in @sg
- * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
- */
-void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
-		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+static void __rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp)
 {
 	int i;
 
@@ -596,6 +619,21 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		BUG();
 		break;
 	}
+}
+
+/**
+ * rdma_rw_ctx_destroy - release all resources allocated by rdma_rw_ctx_init
+ * @ctx:	context to release
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @sg:		scatterlist that was used for the READ/WRITE
+ * @sg_cnt:	number of entries in @sg
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ */
+void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+{
+	__rdma_rw_ctx_destroy(ctx, qp);
 
 	/* P2PDMA contexts do not need to be unmapped */
 	if (!is_pci_p2pdma_page(sg_page(sg)))
@@ -603,6 +641,20 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
+/**
+ * rdma_rw_ctx_dma_destroy - release all resources allocated by
+ *	rdma_rw_ctx_dma_init
+ * @ctx:	context to release
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ */
+void rdma_rw_ctx_dma_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+			     u8 port_num)
+{
+	__rdma_rw_ctx_destroy(ctx, qp);
+}
+EXPORT_SYMBOL(rdma_rw_ctx_dma_destroy);
+
 /**
  * rdma_rw_ctx_destroy_signature - release all resources allocated by
  *	rdma_rw_ctx_init_signature
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index 494f79ca3e62..e47f8053af6e 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -58,6 +58,12 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt,
 		enum dma_data_direction dir);
 
+int rdma_rw_ctx_dma_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u8 port_num, dma_addr_t addr, u32 len, u64 remote_addr,
+		u32 rkey, enum dma_data_direction dir);
+void rdma_rw_ctx_dma_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+			     u8 port_num);
+
 int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u8 port_num, struct scatterlist *sg, u32 sg_cnt,
 		struct scatterlist *prot_sg, u32 prot_sg_cnt,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 21/28] nvmet: Split nvmet_bdev_execute_rw() into a helper function
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (19 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 22/28] nvmet: Use DMA addresses instead of struct pages for P2P Logan Gunthorpe
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

Move the mapping of the SG and submission of the bio
into a static helper function to reduce the complexity.

This will be useful in the next patch which submits dma-direct bios
for P2P requests.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/target/io-cmd-bdev.c | 52 ++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 7a1cf6437a6a..061d40b020c7 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -103,13 +103,41 @@ static void nvmet_bio_done(struct bio *bio)
 		bio_put(bio);
 }
 
+static void nvmet_submit_sg(struct nvmet_req *req, struct bio *bio,
+			    sector_t sector)
+{
+	int sg_cnt = req->sg_cnt;
+	struct scatterlist *sg;
+	int i;
+
+	for_each_sg(req->sg, sg, req->sg_cnt, i) {
+		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
+				!= sg->length) {
+			struct bio *prev = bio;
+
+			bio = bio_alloc(GFP_KERNEL,
+					min(sg_cnt, BIO_MAX_PAGES));
+			bio_set_dev(bio, req->ns->bdev);
+			bio->bi_iter.bi_sector = sector;
+			bio->bi_opf = prev->bi_opf;
+
+			bio_chain(bio, prev);
+			submit_bio(prev);
+		}
+
+		sector += sg->length >> 9;
+		sg_cnt--;
+	}
+
+	submit_bio(bio);
+}
+
 static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 {
 	int sg_cnt = req->sg_cnt;
 	struct bio *bio;
-	struct scatterlist *sg;
 	sector_t sector;
-	int op, op_flags = 0, i;
+	int op, op_flags = 0;
 
 	if (!req->sg_cnt) {
 		nvmet_req_complete(req, 0);
@@ -143,25 +171,7 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 	bio->bi_end_io = nvmet_bio_done;
 	bio_set_op_attrs(bio, op, op_flags);
 
-	for_each_sg(req->sg, sg, req->sg_cnt, i) {
-		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
-				!= sg->length) {
-			struct bio *prev = bio;
-
-			bio = bio_alloc(GFP_KERNEL, min(sg_cnt, BIO_MAX_PAGES));
-			bio_set_dev(bio, req->ns->bdev);
-			bio->bi_iter.bi_sector = sector;
-			bio_set_op_attrs(bio, op, op_flags);
-
-			bio_chain(bio, prev);
-			submit_bio(prev);
-		}
-
-		sector += sg->length >> 9;
-		sg_cnt--;
-	}
-
-	submit_bio(bio);
+	nvmet_submit_sg(req, bio, sector);
 }
 
 static void nvmet_bdev_execute_flush(struct nvmet_req *req)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 22/28] nvmet: Use DMA addresses instead of struct pages for P2P
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (20 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 21/28] nvmet: Split nvmet_bdev_execute_rw() into a helper function Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 23/28] nvme-pci: Remove support for PCI_P2PDMA requests Logan Gunthorpe
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

Start using the dma-direct bios and DMA address RDMA CTX API.

This removes struct pages from all P2P transactions.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/target/core.c        | 12 +++++----
 drivers/nvme/target/io-cmd-bdev.c | 32 ++++++++++++++++++++---
 drivers/nvme/target/nvmet.h       |  5 +++-
 drivers/nvme/target/rdma.c        | 43 +++++++++++++++++++++++--------
 4 files changed, 71 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 7734a6acff85..230e99b63320 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -420,7 +420,7 @@ static int nvmet_p2pmem_ns_enable(struct nvmet_ns *ns)
 		return -EINVAL;
 	}
 
-	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+	if (!blk_queue_dma_direct(ns->bdev->bd_queue)) {
 		pr_err("peer-to-peer DMA is not supported by the driver of %s\n",
 		       ns->device_path);
 		return -EINVAL;
@@ -926,9 +926,9 @@ int nvmet_req_alloc_sgl(struct nvmet_req *req)
 
 		req->p2p_dev = NULL;
 		if (req->sq->qid && p2p_dev) {
-			req->sg = pci_p2pmem_alloc_sgl(p2p_dev, &req->sg_cnt,
-						       req->transfer_len);
-			if (req->sg) {
+			req->p2p_dma_buf = pci_alloc_p2pmem(p2p_dev,
+							    req->transfer_len);
+			if (req->p2p_dma_buf) {
 				req->p2p_dev = p2p_dev;
 				return 0;
 			}
@@ -951,10 +951,12 @@ EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
 void nvmet_req_free_sgl(struct nvmet_req *req)
 {
 	if (req->p2p_dev)
-		pci_p2pmem_free_sgl(req->p2p_dev, req->sg);
+		pci_free_p2pmem(req->p2p_dev, req->p2p_dma_buf,
+				req->transfer_len);
 	else
 		sgl_free(req->sg);
 
+	req->p2p_dev = NULL;
 	req->sg = NULL;
 	req->sg_cnt = 0;
 }
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 061d40b020c7..f5621aeb1d6c 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -6,6 +6,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/blkdev.h>
 #include <linux/module.h>
+#include <linux/pci-p2pdma.h>
 #include "nvmet.h"
 
 int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
@@ -132,6 +133,24 @@ static void nvmet_submit_sg(struct nvmet_req *req, struct bio *bio,
 	submit_bio(bio);
 }
 
+static void nvmet_submit_p2p(struct nvmet_req *req, struct bio *bio)
+{
+	dma_addr_t addr;
+	int ret;
+
+	addr = pci_p2pmem_virt_to_bus(req->p2p_dev, req->p2p_dma_buf);
+
+	ret = bio_add_dma_addr(req->ns->bdev->bd_queue, bio,
+			       addr, req->transfer_len);
+	if (WARN_ON_ONCE(ret != req->transfer_len)) {
+		bio->bi_status = BLK_STS_NOTSUPP;
+		nvmet_bio_done(bio);
+		return;
+	}
+
+	submit_bio(bio);
+}
+
 static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 {
 	int sg_cnt = req->sg_cnt;
@@ -139,7 +158,7 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 	sector_t sector;
 	int op, op_flags = 0;
 
-	if (!req->sg_cnt) {
+	if (!req->sg_cnt && !req->p2p_dev) {
 		nvmet_req_complete(req, 0);
 		return;
 	}
@@ -153,8 +172,10 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 		op = REQ_OP_READ;
 	}
 
-	if (is_pci_p2pdma_page(sg_page(req->sg)))
-		op_flags |= REQ_NOMERGE;
+	if (req->p2p_dev) {
+		op_flags |= REQ_DMA_DIRECT;
+		sg_cnt = 1;
+	}
 
 	sector = le64_to_cpu(req->cmd->rw.slba);
 	sector <<= (req->ns->blksize_shift - 9);
@@ -171,7 +192,10 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 	bio->bi_end_io = nvmet_bio_done;
 	bio_set_op_attrs(bio, op, op_flags);
 
-	nvmet_submit_sg(req, bio, sector);
+	if (req->p2p_dev)
+		nvmet_submit_p2p(req, bio);
+	else
+		nvmet_submit_sg(req, bio, sector);
 }
 
 static void nvmet_bdev_execute_flush(struct nvmet_req *req)
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index c25d88fc9dec..5714e5b5ef04 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -288,7 +288,10 @@ struct nvmet_req {
 	struct nvmet_sq		*sq;
 	struct nvmet_cq		*cq;
 	struct nvmet_ns		*ns;
-	struct scatterlist	*sg;
+	union {
+		struct scatterlist	*sg;
+		void			*p2p_dma_buf;
+	};
 	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
 	union {
 		struct {
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 36d906a7f70d..92bfc7207814 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -15,6 +15,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/pci-p2pdma.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -495,6 +496,18 @@ static void nvmet_rdma_process_wr_wait_list(struct nvmet_rdma_queue *queue)
 	spin_unlock(&queue->rsp_wr_wait_lock);
 }
 
+static void nvmet_rdma_ctx_destroy(struct nvmet_rdma_rsp *rsp)
+{
+	struct nvmet_rdma_queue *queue = rsp->queue;
+
+	if (rsp->req.p2p_dev)
+		rdma_rw_ctx_dma_destroy(&rsp->rw, queue->cm_id->qp,
+					queue->cm_id->port_num);
+	else
+		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
+				queue->cm_id->port_num, rsp->req.sg,
+				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+}
 
 static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 {
@@ -502,11 +515,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	atomic_add(1 + rsp->n_rdma, &queue->sq_wr_avail);
 
-	if (rsp->n_rdma) {
-		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
-				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
-	}
+	if (rsp->n_rdma)
+		nvmet_rdma_ctx_destroy(rsp);
 
 	if (rsp->req.sg != rsp->cmd->inline_sg)
 		nvmet_req_free_sgl(&rsp->req);
@@ -587,9 +597,9 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 
 	WARN_ON(rsp->n_rdma <= 0);
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
-	rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
-			queue->cm_id->port_num, rsp->req.sg,
-			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+
+	nvmet_rdma_ctx_destroy(rsp);
+
 	rsp->n_rdma = 0;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -663,6 +673,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	struct rdma_cm_id *cm_id = rsp->queue->cm_id;
 	u64 addr = le64_to_cpu(sgl->addr);
 	u32 key = get_unaligned_le32(sgl->key);
+	dma_addr_t dma_addr;
 	int ret;
 
 	rsp->req.transfer_len = get_unaligned_le24(sgl->length);
@@ -675,9 +686,19 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (ret < 0)
 		goto error_out;
 
-	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
-			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req));
+	if (rsp->req.p2p_dev) {
+		dma_addr = pci_p2pmem_virt_to_bus(rsp->req.p2p_dev,
+						  rsp->req.p2p_dma_buf);
+
+		ret = rdma_rw_ctx_dma_init(&rsp->rw, cm_id->qp,
+					   cm_id->port_num, dma_addr,
+					   rsp->req.transfer_len, addr, key,
+					   nvmet_data_dir(&rsp->req));
+	} else {
+		ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
+				       rsp->req.sg, rsp->req.sg_cnt, 0, addr,
+				       key, nvmet_data_dir(&rsp->req));
+	}
 	if (ret < 0)
 		goto error_out;
 	rsp->n_rdma += ret;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 23/28] nvme-pci: Remove support for PCI_P2PDMA requests
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (21 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 22/28] nvmet: Use DMA addresses instead of struct pages for P2P Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 24/28] block: Remove PCI_P2PDMA queue flag Logan Gunthorpe
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

These requests have been superseded by dma-direct requests and are
therefore no longer needed.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c |  2 --
 drivers/nvme/host/nvme.h |  3 +--
 drivers/nvme/host/pci.c  | 27 ++++++++++-----------------
 3 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8e876417c44b..63d132c478b4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3257,8 +3257,6 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	}
 
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
-		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 	if (ctrl->ops->flags & NVME_F_DMA_DIRECT)
 		blk_queue_flag_set(QUEUE_FLAG_DMA_DIRECT, ns->queue);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f1dddc95c6a8..d103cecc14dd 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -361,8 +361,7 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
-#define NVME_F_PCI_P2PDMA		(1 << 2)
-#define NVME_F_DMA_DIRECT		(1 << 3)
+#define NVME_F_DMA_DIRECT		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5957f3a4f261..7f806e76230a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -564,9 +564,8 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 
 	WARN_ON_ONCE(!iod->nents);
 
-	/* P2PDMA requests do not need to be unmapped */
-	if (!is_pci_p2pdma_page(sg_page(iod->sg)) &&
-	    !blk_rq_is_dma_direct(req))
+	/* DMA direct requests do not need to be unmapped */
+	if (!blk_rq_is_dma_direct(req))
 		dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
 
 
@@ -828,16 +827,14 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	if (blk_rq_nr_phys_segments(req) == 1 && !blk_rq_is_dma_direct(req)) {
 		struct bio_vec bv = req_bvec(req);
 
-		if (!is_pci_p2pdma_page(bv.bv_page)) {
-			if (bv.bv_offset + bv.bv_len <= dev->ctrl.page_size * 2)
-				return nvme_setup_prp_simple(dev, req,
-							     &cmnd->rw, &bv);
+		if (bv.bv_offset + bv.bv_len <= dev->ctrl.page_size * 2)
+			return nvme_setup_prp_simple(dev, req,
+						     &cmnd->rw, &bv);
 
-			if (iod->nvmeq->qid &&
-			    dev->ctrl.sgls & ((1 << 0) | (1 << 1)))
-				return nvme_setup_sgl_simple(dev, req,
-							     &cmnd->rw, &bv);
-		}
+		if (iod->nvmeq->qid &&
+		    dev->ctrl.sgls & ((1 << 0) | (1 << 1)))
+			return nvme_setup_sgl_simple(dev, req,
+						     &cmnd->rw, &bv);
 	}
 
 	iod->dma_len = 0;
@@ -849,10 +846,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	if (!iod->nents)
 		goto out;
 
-	if (is_pci_p2pdma_page(sg_page(iod->sg)))
-		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
-					      rq_dma_dir(req));
-	else if (blk_rq_is_dma_direct(req))
+	if (blk_rq_is_dma_direct(req))
 		nr_mapped = iod->nents;
 	else
 		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
@@ -2642,7 +2636,6 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
 	.flags			= NVME_F_METADATA_SUPPORTED |
-				  NVME_F_PCI_P2PDMA |
 				  NVME_F_DMA_DIRECT,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 24/28] block: Remove PCI_P2PDMA queue flag
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (22 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 23/28] nvme-pci: Remove support for PCI_P2PDMA requests Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 25/28] IB/core: Remove P2PDMA mapping support in rdma_rw_ctx Logan Gunthorpe
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

This flag has been superseded by the DMA_DIRECT functionality.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/blkdev.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a5b856324276..9ea800645cf5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -615,8 +615,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED	22	/* queue has been registered to a disk */
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 23	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED	24	/* queue has been quiesced */
-#define QUEUE_FLAG_PCI_P2PDMA	25	/* device supports PCI p2p requests */
-#define QUEUE_FLAG_DMA_DIRECT	26	/* device supports dma-addr requests */
+#define QUEUE_FLAG_DMA_DIRECT	25	/* device supports dma-addr requests */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP))
@@ -641,8 +640,6 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
-#define blk_queue_pci_p2pdma(q)	\
-	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 #define blk_queue_dma_direct(q)	\
 	test_bit(QUEUE_FLAG_DMA_DIRECT, &(q)->queue_flags)
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 25/28] IB/core: Remove P2PDMA mapping support in rdma_rw_ctx
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (23 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 24/28] block: Remove PCI_P2PDMA queue flag Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 26/28] PCI/P2PDMA: Remove SGL helpers Logan Gunthorpe
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

There are no longer any users submitting P2PDMA struct pages to
rdma_rw_ctx. So it can be removed.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/infiniband/core/rw.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index cefa6b930bc8..350b9b730ddc 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -4,7 +4,6 @@
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
-#include <linux/pci-p2pdma.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -271,11 +270,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	if (is_pci_p2pdma_page(sg_page(sg)))
-		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
-	else
-		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
-
+	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -635,9 +630,7 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 {
 	__rdma_rw_ctx_destroy(ctx, qp);
 
-	/* P2PDMA contexts do not need to be unmapped */
-	if (!is_pci_p2pdma_page(sg_page(sg)))
-		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 26/28] PCI/P2PDMA: Remove SGL helpers
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (24 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 25/28] IB/core: Remove P2PDMA mapping support in rdma_rw_ctx Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 27/28] PCI/P2PDMA: Remove struct pages that back P2PDMA memory Logan Gunthorpe
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

The functions, pci_p2pmem_alloc_sgl(), pci_p2pmem_free_sgl() and
pci_p2pdma_map_sg() no longer have any callers, so remove them.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 Documentation/driver-api/pci/p2pdma.rst |  9 +--
 drivers/pci/p2pdma.c                    | 95 -------------------------
 include/linux/pci-p2pdma.h              | 19 -----
 3 files changed, 3 insertions(+), 120 deletions(-)

diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
index 44deb52beeb4..5b19c420d921 100644
--- a/Documentation/driver-api/pci/p2pdma.rst
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -84,9 +84,8 @@ Client Drivers
 --------------
 
 A client driver typically only has to conditionally change its DMA map
-routine to use the mapping function :c:func:`pci_p2pdma_map_sg()` instead
-of the usual :c:func:`dma_map_sg()` function. Memory mapped in this
-way does not need to be unmapped.
+routine to use the PCI bus address with :c:func:`pci_p2pmem_virt_to_bus()`
+for the DMA address instead of the usual :c:func:`dma_map_sg()` function.
 
 The client may also, optionally, make use of
 :c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
@@ -117,9 +116,7 @@ returned with pci_dev_put().
 
 Once a provider is selected, the orchestrator can then use
 :c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
-allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
-and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
-allocating scatter-gather lists with P2P memory.
+allocate P2P memory from the provider.
 
 Struct Page Caveats
 -------------------
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index a98126ad9c3a..9b82e13f802c 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -666,60 +666,6 @@ pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
 
-/**
- * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist
- * @pdev: the device to allocate memory from
- * @nents: the number of SG entries in the list
- * @length: number of bytes to allocate
- *
- * Return: %NULL on error or &struct scatterlist pointer and @nents on success
- */
-struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
-					 unsigned int *nents, u32 length)
-{
-	struct scatterlist *sg;
-	void *addr;
-
-	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
-	if (!sg)
-		return NULL;
-
-	sg_init_table(sg, 1);
-
-	addr = pci_alloc_p2pmem(pdev, length);
-	if (!addr)
-		goto out_free_sg;
-
-	sg_set_buf(sg, addr, length);
-	*nents = 1;
-	return sg;
-
-out_free_sg:
-	kfree(sg);
-	return NULL;
-}
-EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
-
-/**
- * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl()
- * @pdev: the device to allocate memory from
- * @sgl: the allocated scatterlist
- */
-void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl)
-{
-	struct scatterlist *sg;
-	int count;
-
-	for_each_sg(sgl, sg, INT_MAX, count) {
-		if (!sg)
-			break;
-
-		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
-	}
-	kfree(sgl);
-}
-EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
-
 /**
  * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
  *	other devices with pci_p2pmem_find()
@@ -738,47 +684,6 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-/**
- * pci_p2pdma_map_sg - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- *
- * Scatterlists mapped with this function should not be unmapped in any way.
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
-		      enum dma_data_direction dir)
-{
-	struct dev_pagemap *pgmap;
-	struct scatterlist *s;
-	phys_addr_t paddr;
-	int i;
-
-	/*
-	 * p2pdma mappings are not compatible with devices that use
-	 * dma_virt_ops. If the upper layers do the right thing
-	 * this should never happen because it will be prevented
-	 * by the check in pci_p2pdma_add_client()
-	 */
-	if (WARN_ON_ONCE(IS_ENABLED(CONFIG_DMA_VIRT_OPS) &&
-			 dev->dma_ops == &dma_virt_ops))
-		return 0;
-
-	for_each_sg(sg, s, nents, i) {
-		pgmap = sg_page(s)->pgmap;
-		paddr = sg_phys(s);
-
-		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
-		sg_dma_len(s) = s->length;
-	}
-
-	return nents;
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
-
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  *		to enable p2pdma
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index bca9bc3e5be7..4a75a3f43444 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -26,12 +26,7 @@ struct pci_dev *pci_p2pmem_find_many(struct device **clients, int num_clients);
 void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
 void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
 pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
-struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
-					 unsigned int *nents, u32 length);
-void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
-int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
-		      enum dma_data_direction dir);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -69,23 +64,9 @@ static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
 {
 	return 0;
 }
-static inline struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
-		unsigned int *nents, u32 length)
-{
-	return NULL;
-}
-static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
-		struct scatterlist *sgl)
-{
-}
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
-static inline int pci_p2pdma_map_sg(struct device *dev,
-		struct scatterlist *sg, int nents, enum dma_data_direction dir)
-{
-	return 0;
-}
 static inline int pci_p2pdma_enable_store(const char *page,
 		struct pci_dev **p2p_dev, bool *use_p2pdma)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 27/28] PCI/P2PDMA: Remove struct pages that back P2PDMA memory
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (25 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 26/28] PCI/P2PDMA: Remove SGL helpers Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 16:12 ` [RFC PATCH 28/28] memremap: Remove PCI P2PDMA page memory type Logan Gunthorpe
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

There are no more users of the struct pages that back P2P memory,
so convert the devm_memremap_pages() call to devm_memremap() to remove
them.

The percpu_ref and completion are retained in struct p2pdma to track
when there is no memory allocated out of the genpool and it is safe
to free it.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c | 107 +++++++++++++------------------------------
 1 file changed, 33 insertions(+), 74 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 9b82e13f802c..83d93911f792 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -22,10 +22,6 @@
 struct pci_p2pdma {
 	struct gen_pool *pool;
 	bool p2pmem_published;
-};
-
-struct p2pdma_pagemap {
-	struct dev_pagemap pgmap;
 	struct percpu_ref ref;
 	struct completion ref_done;
 };
@@ -78,29 +74,12 @@ static const struct attribute_group p2pmem_group = {
 	.name = "p2pmem",
 };
 
-static struct p2pdma_pagemap *to_p2p_pgmap(struct percpu_ref *ref)
-{
-	return container_of(ref, struct p2pdma_pagemap, ref);
-}
-
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
-	struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
+	struct pci_p2pdma *p2pdma =
+		container_of(ref, struct pci_p2pdma, ref);
 
-	complete(&p2p_pgmap->ref_done);
-}
-
-static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
-{
-	percpu_ref_kill(ref);
-}
-
-static void pci_p2pdma_percpu_cleanup(struct percpu_ref *ref)
-{
-	struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
-
-	wait_for_completion(&p2p_pgmap->ref_done);
-	percpu_ref_exit(&p2p_pgmap->ref);
+	complete(&p2pdma->ref_done);
 }
 
 static void pci_p2pdma_release(void *data)
@@ -111,6 +90,10 @@ static void pci_p2pdma_release(void *data)
 	if (!p2pdma)
 		return;
 
+	percpu_ref_kill(&p2pdma->ref);
+	wait_for_completion(&p2pdma->ref_done);
+	percpu_ref_exit(&p2pdma->ref);
+
 	/* Flush and disable pci_alloc_p2p_mem() */
 	pdev->p2pdma = NULL;
 	synchronize_rcu();
@@ -128,10 +111,17 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 	if (!p2p)
 		return -ENOMEM;
 
+	init_completion(&p2p->ref_done);
+
 	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
 	if (!p2p->pool)
 		goto out;
 
+	error = percpu_ref_init(&p2p->ref, pci_p2pdma_percpu_release, 0,
+				GFP_KERNEL);
+	if (error)
+		goto out_pool_destroy;
+
 	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
 	if (error)
 		goto out_pool_destroy;
@@ -165,8 +155,7 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 			    u64 offset)
 {
-	struct p2pdma_pagemap *p2p_pgmap;
-	struct dev_pagemap *pgmap;
+	struct resource res;
 	void *addr;
 	int error;
 
@@ -188,50 +177,26 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 			return error;
 	}
 
-	p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL);
-	if (!p2p_pgmap)
-		return -ENOMEM;
+	res.start = pci_resource_start(pdev, bar) + offset;
+	res.end = res.start + size - 1;
+	res.flags = pci_resource_flags(pdev, bar);
 
-	init_completion(&p2p_pgmap->ref_done);
-	error = percpu_ref_init(&p2p_pgmap->ref,
-			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
-	if (error)
-		goto pgmap_free;
-
-	pgmap = &p2p_pgmap->pgmap;
-
-	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
-	pgmap->res.end = pgmap->res.start + size - 1;
-	pgmap->res.flags = pci_resource_flags(pdev, bar);
-	pgmap->ref = &p2p_pgmap->ref;
-	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
-	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
-		pci_resource_start(pdev, bar);
-	pgmap->kill = pci_p2pdma_percpu_kill;
-	pgmap->cleanup = pci_p2pdma_percpu_cleanup;
-
-	addr = devm_memremap_pages(&pdev->dev, pgmap);
-	if (IS_ERR(addr)) {
-		error = PTR_ERR(addr);
-		goto pgmap_free;
-	}
+	addr = devm_memremap(&pdev->dev, res.start, size, MEMREMAP_WC);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
 
-	error = gen_pool_add_owner(pdev->p2pdma->pool, (unsigned long)addr,
-			pci_bus_address(pdev, bar) + offset,
-			resource_size(&pgmap->res), dev_to_node(&pdev->dev),
-			&p2p_pgmap->ref);
+	error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
+				  pci_bus_address(pdev, bar) + offset, size,
+				  dev_to_node(&pdev->dev));
 	if (error)
 		goto pages_free;
 
-	pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
-		 &pgmap->res);
+	pci_info(pdev, "added peer-to-peer DMA memory %pR\n", &res);
 
 	return 0;
 
 pages_free:
-	devm_memunmap_pages(&pdev->dev, pgmap);
-pgmap_free:
-	devm_kfree(&pdev->dev, p2p_pgmap);
+	devm_memunmap(&pdev->dev, addr);
 	return error;
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
@@ -601,7 +566,6 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_find_many);
 void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
 {
 	void *ret = NULL;
-	struct percpu_ref *ref;
 
 	/*
 	 * Pairs with synchronize_rcu() in pci_p2pdma_release() to
@@ -612,16 +576,13 @@ void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
 	if (unlikely(!pdev->p2pdma))
 		goto out;
 
-	ret = (void *)gen_pool_alloc_owner(pdev->p2pdma->pool, size,
-			(void **) &ref);
-	if (!ret)
+	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->ref)))
 		goto out;
 
-	if (unlikely(!percpu_ref_tryget_live(ref))) {
-		gen_pool_free(pdev->p2pdma->pool, (unsigned long) ret, size);
-		ret = NULL;
-		goto out;
-	}
+	ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
+	if (!ret)
+		percpu_ref_put(&pdev->p2pdma->ref);
+
 out:
 	rcu_read_unlock();
 	return ret;
@@ -636,11 +597,9 @@ EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
  */
 void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
 {
-	struct percpu_ref *ref;
+	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
 
-	gen_pool_free_owner(pdev->p2pdma->pool, (uintptr_t)addr, size,
-			(void **) &ref);
-	percpu_ref_put(ref);
+	percpu_ref_put(&pdev->p2pdma->ref);
 }
 EXPORT_SYMBOL_GPL(pci_free_p2pmem);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [RFC PATCH 28/28] memremap: Remove PCI P2PDMA page memory type
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (26 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 27/28] PCI/P2PDMA: Remove struct pages that back P2PDMA memory Logan Gunthorpe
@ 2019-06-20 16:12 ` Logan Gunthorpe
  2019-06-20 18:45 ` [RFC PATCH 00/28] Removing struct page from P2PDMA Dan Williams
  2019-06-24  7:27 ` Christoph Hellwig
  29 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates,
	Logan Gunthorpe

There are no more users of MEMORY_DEVICE_PCI_P2PDMA and
is_pci_p2pdma_page(), so remove them.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/memremap.h |  5 -----
 include/linux/mm.h       | 13 -------------
 2 files changed, 18 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1732dea030b2..2e5d9fcd4d69 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -51,16 +51,11 @@ struct vmem_altmap {
  * wakeup event whenever a page is unpinned and becomes idle. This
  * wakeup is used to coordinate physical address space management (ex:
  * fs truncate/hole punch) vs pinned pages (ex: device dma).
- *
- * MEMORY_DEVICE_PCI_P2PDMA:
- * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
- * transactions.
  */
 enum memory_type {
 	MEMORY_DEVICE_PRIVATE = 1,
 	MEMORY_DEVICE_PUBLIC,
 	MEMORY_DEVICE_FS_DAX,
-	MEMORY_DEVICE_PCI_P2PDMA,
 };
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd0b5f4e1e45..f5fa9ec440e3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -966,19 +966,6 @@ static inline bool is_device_public_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
 }
 
-#ifdef CONFIG_PCI_P2PDMA
-static inline bool is_pci_p2pdma_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
-}
-#else /* CONFIG_PCI_P2PDMA */
-static inline bool is_pci_p2pdma_page(const struct page *page)
-{
-	return false;
-}
-#endif /* CONFIG_PCI_P2PDMA */
-
 #else /* CONFIG_DEV_PAGEMAP_OPS */
 static inline void dev_pagemap_get_ops(void)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address
  2019-06-20 16:12 ` [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address Logan Gunthorpe
@ 2019-06-20 16:49   ` Jason Gunthorpe
  2019-06-20 16:59     ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-20 16:49 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 20, 2019 at 10:12:32AM -0600, Logan Gunthorpe wrote:
> Introduce rdma_rw_ctx_dma_init() and rdma_rw_ctx_dma_destroy() which
> peform the same operation as rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() respectively except they operate on a DMA
> address and length instead of an SGL.
> 
> This will be used for struct page-less P2PDMA, but there's also
> been opinions expressed to migrate away from SGLs and struct
> pages in the RDMA APIs and this will likely fit with that
> effort.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>  drivers/infiniband/core/rw.c | 74 ++++++++++++++++++++++++++++++------
>  include/rdma/rw.h            |  6 +++
>  2 files changed, 69 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
> index 32ca8429eaae..cefa6b930bc8 100644
> +++ b/drivers/infiniband/core/rw.c
> @@ -319,6 +319,39 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
>  }
>  EXPORT_SYMBOL(rdma_rw_ctx_init);
>  
> +/**
> + * rdma_rw_ctx_dma_init - initialize a RDMA READ/WRITE context from a
> + *	DMA address instead of SGL
> + * @ctx:	context to initialize
> + * @qp:		queue pair to operate on
> + * @port_num:	port num to which the connection is bound
> + * @addr:	DMA address to READ/WRITE from/to
> + * @len:	length of memory to operate on
> + * @remote_addr:remote address to read/write (relative to @rkey)
> + * @rkey:	remote key to operate on
> + * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> + *
> + * Returns the number of WQEs that will be needed on the workqueue if
> + * successful, or a negative error code.
> + */
> +int rdma_rw_ctx_dma_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
> +		u8 port_num, dma_addr_t addr, u32 len, u64 remote_addr,
> +		u32 rkey, enum dma_data_direction dir)

Why not keep the same basic signature here but replace the scatterlist
with the dma vec ?

> +{
> +	struct scatterlist sg;
> +
> +	sg_dma_address(&sg) = addr;
> +	sg_dma_len(&sg) = len;

This needs to fail if the driver is one of the few that require
struct page to work..

Really want I want to do is to have this new 'dma vec' pushed through
the RDMA APIs so we know that if a driver is using the dma vec
interface it is struct page free.

This is not so hard to do, as most drivers are already struct page
free, but is pretty much blocked on needing some way to go from the
block layer SGL world to the dma vec world that does not hurt storage
performance.

I am hoping that the biovec dma mapping that CH has talked about will
give the missing pieces.

FWIW, rdma is one of the places that is largely struct page free, and
has few problems to natively handle a 'dma vec' from top to bottom, so
I do like this approach.

Someone would have to look carefully at siw, rxe and hfi/qib to see
how they could continue to work with a dma vec, as they do actually
seem to need to kmap the data they are transferring. However, I
thought they were using custom dma ops these days, so maybe they just
encode a struct page in their dma vec and reject p2p entirely?

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address
  2019-06-20 16:49   ` Jason Gunthorpe
@ 2019-06-20 16:59     ` Logan Gunthorpe
  2019-06-20 17:11       ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 16:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-20 10:49 a.m., Jason Gunthorpe wrote:
> On Thu, Jun 20, 2019 at 10:12:32AM -0600, Logan Gunthorpe wrote:
>> Introduce rdma_rw_ctx_dma_init() and rdma_rw_ctx_dma_destroy() which
>> peform the same operation as rdma_rw_ctx_init() and
>> rdma_rw_ctx_destroy() respectively except they operate on a DMA
>> address and length instead of an SGL.
>>
>> This will be used for struct page-less P2PDMA, but there's also
>> been opinions expressed to migrate away from SGLs and struct
>> pages in the RDMA APIs and this will likely fit with that
>> effort.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>>  drivers/infiniband/core/rw.c | 74 ++++++++++++++++++++++++++++++------
>>  include/rdma/rw.h            |  6 +++
>>  2 files changed, 69 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
>> index 32ca8429eaae..cefa6b930bc8 100644
>> +++ b/drivers/infiniband/core/rw.c
>> @@ -319,6 +319,39 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
>>  }
>>  EXPORT_SYMBOL(rdma_rw_ctx_init);
>>  
>> +/**
>> + * rdma_rw_ctx_dma_init - initialize a RDMA READ/WRITE context from a
>> + *	DMA address instead of SGL
>> + * @ctx:	context to initialize
>> + * @qp:		queue pair to operate on
>> + * @port_num:	port num to which the connection is bound
>> + * @addr:	DMA address to READ/WRITE from/to
>> + * @len:	length of memory to operate on
>> + * @remote_addr:remote address to read/write (relative to @rkey)
>> + * @rkey:	remote key to operate on
>> + * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
>> + *
>> + * Returns the number of WQEs that will be needed on the workqueue if
>> + * successful, or a negative error code.
>> + */
>> +int rdma_rw_ctx_dma_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
>> +		u8 port_num, dma_addr_t addr, u32 len, u64 remote_addr,
>> +		u32 rkey, enum dma_data_direction dir)
> 
> Why not keep the same basic signature here but replace the scatterlist
> with the dma vec ?

Could do. At the moment, I had no need for dma_vec in this interface.

>> +{
>> +	struct scatterlist sg;
>> +
>> +	sg_dma_address(&sg) = addr;
>> +	sg_dma_len(&sg) = len;
> 
> This needs to fail if the driver is one of the few that require
> struct page to work..

Yes, right. Currently P2PDMA checks for the use of dma_virt_ops. And
that probably should also be done here. But is that sufficient? You're
probably right that it'll take an audit of the RDMA tree to sort that out.

> Really want I want to do is to have this new 'dma vec' pushed through
> the RDMA APIs so we know that if a driver is using the dma vec
> interface it is struct page free.

Yeah, I know you were talking about heading this way during LSF/MM and
is partly what inspired this series. However, largely, my focus for this
RFC was the block layer to see this is an acceptable approach -- I just
kind of hacked RDMA for now.

> This is not so hard to do, as most drivers are already struct page
> free, but is pretty much blocked on needing some way to go from the
> block layer SGL world to the dma vec world that does not hurt storage
> performance.

Maybe I can end up helping with that if it helps push the ideas here
through. (And assuming people think it's an acceptable approach for the
block-layer side of things).

Thanks,

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address
  2019-06-20 16:59     ` Logan Gunthorpe
@ 2019-06-20 17:11       ` Jason Gunthorpe
  2019-06-20 18:24         ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-20 17:11 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 20, 2019 at 10:59:44AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-20 10:49 a.m., Jason Gunthorpe wrote:
> > On Thu, Jun 20, 2019 at 10:12:32AM -0600, Logan Gunthorpe wrote:
> >> Introduce rdma_rw_ctx_dma_init() and rdma_rw_ctx_dma_destroy() which
> >> peform the same operation as rdma_rw_ctx_init() and
> >> rdma_rw_ctx_destroy() respectively except they operate on a DMA
> >> address and length instead of an SGL.
> >>
> >> This will be used for struct page-less P2PDMA, but there's also
> >> been opinions expressed to migrate away from SGLs and struct
> >> pages in the RDMA APIs and this will likely fit with that
> >> effort.
> >>
> >> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> >>  drivers/infiniband/core/rw.c | 74 ++++++++++++++++++++++++++++++------
> >>  include/rdma/rw.h            |  6 +++
> >>  2 files changed, 69 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
> >> index 32ca8429eaae..cefa6b930bc8 100644
> >> +++ b/drivers/infiniband/core/rw.c
> >> @@ -319,6 +319,39 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
> >>  }
> >>  EXPORT_SYMBOL(rdma_rw_ctx_init);
> >>  
> >> +/**
> >> + * rdma_rw_ctx_dma_init - initialize a RDMA READ/WRITE context from a
> >> + *	DMA address instead of SGL
> >> + * @ctx:	context to initialize
> >> + * @qp:		queue pair to operate on
> >> + * @port_num:	port num to which the connection is bound
> >> + * @addr:	DMA address to READ/WRITE from/to
> >> + * @len:	length of memory to operate on
> >> + * @remote_addr:remote address to read/write (relative to @rkey)
> >> + * @rkey:	remote key to operate on
> >> + * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> >> + *
> >> + * Returns the number of WQEs that will be needed on the workqueue if
> >> + * successful, or a negative error code.
> >> + */
> >> +int rdma_rw_ctx_dma_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
> >> +		u8 port_num, dma_addr_t addr, u32 len, u64 remote_addr,
> >> +		u32 rkey, enum dma_data_direction dir)
> > 
> > Why not keep the same basic signature here but replace the scatterlist
> > with the dma vec ?
> 
> Could do. At the moment, I had no need for dma_vec in this interface.

I think that is because you only did nvme not srp/iser :)

> >> +{
> >> +	struct scatterlist sg;
> >> +
> >> +	sg_dma_address(&sg) = addr;
> >> +	sg_dma_len(&sg) = len;
> > 
> > This needs to fail if the driver is one of the few that require
> > struct page to work..
> 
> Yes, right. Currently P2PDMA checks for the use of dma_virt_ops. And
> that probably should also be done here. But is that sufficient? You're
> probably right that it'll take an audit of the RDMA tree to sort that out.

For this purpose I'd be fine if you added a flag to the struct
ib_device_ops that is set on drivers that we know are OK.. We can make
that list bigger over time.

> > This is not so hard to do, as most drivers are already struct page
> > free, but is pretty much blocked on needing some way to go from the
> > block layer SGL world to the dma vec world that does not hurt storage
> > performance.
> 
> Maybe I can end up helping with that if it helps push the ideas here
> through. (And assuming people think it's an acceptable approach for the
> block-layer side of things).

Let us hope for a clear decision then

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 04/28] block: Never bounce dma-direct bios
  2019-06-20 16:12 ` [RFC PATCH 04/28] block: Never bounce " Logan Gunthorpe
@ 2019-06-20 17:23   ` Jason Gunthorpe
  2019-06-20 18:38     ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-20 17:23 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 20, 2019 at 10:12:16AM -0600, Logan Gunthorpe wrote:
> It is expected the creator of the dma-direct bio will ensure the
> target device can access the DMA address it's creating bios for.
> It's also not possible to bounce a dma-direct bio seeing the block
> layer doesn't have any way to access the underlying data behind
> the DMA address.
> 
> Thus, never bounce dma-direct bios.

I wonder how feasible it would be to implement a 'dma vec' copy
from/to? 

That is about the only operation you could safely do on P2P BAR
memory. 

I wonder if a copy implementation could somehow query the iommu layer
to get a kmap of the memory pointed at by the dma address so we don't
need to carry struct page around?

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address
  2019-06-20 17:11       ` Jason Gunthorpe
@ 2019-06-20 18:24         ` Logan Gunthorpe
  0 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 18:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-20 11:11 a.m., Jason Gunthorpe wrote:
> On Thu, Jun 20, 2019 at 10:59:44AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-06-20 10:49 a.m., Jason Gunthorpe wrote:
>>> On Thu, Jun 20, 2019 at 10:12:32AM -0600, Logan Gunthorpe wrote:
>>>> Introduce rdma_rw_ctx_dma_init() and rdma_rw_ctx_dma_destroy() which
>>>> peform the same operation as rdma_rw_ctx_init() and
>>>> rdma_rw_ctx_destroy() respectively except they operate on a DMA
>>>> address and length instead of an SGL.
>>>>
>>>> This will be used for struct page-less P2PDMA, but there's also
>>>> been opinions expressed to migrate away from SGLs and struct
>>>> pages in the RDMA APIs and this will likely fit with that
>>>> effort.
>>>>
>>>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>>>>  drivers/infiniband/core/rw.c | 74 ++++++++++++++++++++++++++++++------
>>>>  include/rdma/rw.h            |  6 +++
>>>>  2 files changed, 69 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
>>>> index 32ca8429eaae..cefa6b930bc8 100644
>>>> +++ b/drivers/infiniband/core/rw.c
>>>> @@ -319,6 +319,39 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
>>>>  }
>>>>  EXPORT_SYMBOL(rdma_rw_ctx_init);
>>>>  
>>>> +/**
>>>> + * rdma_rw_ctx_dma_init - initialize a RDMA READ/WRITE context from a
>>>> + *	DMA address instead of SGL
>>>> + * @ctx:	context to initialize
>>>> + * @qp:		queue pair to operate on
>>>> + * @port_num:	port num to which the connection is bound
>>>> + * @addr:	DMA address to READ/WRITE from/to
>>>> + * @len:	length of memory to operate on
>>>> + * @remote_addr:remote address to read/write (relative to @rkey)
>>>> + * @rkey:	remote key to operate on
>>>> + * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
>>>> + *
>>>> + * Returns the number of WQEs that will be needed on the workqueue if
>>>> + * successful, or a negative error code.
>>>> + */
>>>> +int rdma_rw_ctx_dma_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
>>>> +		u8 port_num, dma_addr_t addr, u32 len, u64 remote_addr,
>>>> +		u32 rkey, enum dma_data_direction dir)
>>>
>>> Why not keep the same basic signature here but replace the scatterlist
>>> with the dma vec ?
>>
>> Could do. At the moment, I had no need for dma_vec in this interface.
> 
> I think that is because you only did nvme not srp/iser :)

I'm not sure that's true at least for the P2P case. With P2P we are able
to  allocate one continuous region of memory for each transaction. It
would be quite weird to allocate multiple regions for a single transaction.

>>>> +{
>>>> +	struct scatterlist sg;
>>>> +
>>>> +	sg_dma_address(&sg) = addr;
>>>> +	sg_dma_len(&sg) = len;
>>>
>>> This needs to fail if the driver is one of the few that require
>>> struct page to work..
>>
>> Yes, right. Currently P2PDMA checks for the use of dma_virt_ops. And
>> that probably should also be done here. But is that sufficient? You're
>> probably right that it'll take an audit of the RDMA tree to sort that out.
> 
> For this purpose I'd be fine if you added a flag to the struct
> ib_device_ops that is set on drivers that we know are OK.. We can make
> that list bigger over time.

Ok, that would mirror what we did for the block layer. I'll look at
doing something like that in the near future.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 04/28] block: Never bounce dma-direct bios
  2019-06-20 17:23   ` Jason Gunthorpe
@ 2019-06-20 18:38     ` Logan Gunthorpe
  0 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 18:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-20 11:23 a.m., Jason Gunthorpe wrote:
> On Thu, Jun 20, 2019 at 10:12:16AM -0600, Logan Gunthorpe wrote:
>> It is expected the creator of the dma-direct bio will ensure the
>> target device can access the DMA address it's creating bios for.
>> It's also not possible to bounce a dma-direct bio seeing the block
>> layer doesn't have any way to access the underlying data behind
>> the DMA address.
>>
>> Thus, never bounce dma-direct bios.
> 
> I wonder how feasible it would be to implement a 'dma vec' copy
> from/to? 

> That is about the only operation you could safely do on P2P BAR
> memory. 
> 
> I wonder if a copy implementation could somehow query the iommu layer
> to get a kmap of the memory pointed at by the dma address so we don't
> need to carry struct page around?

That sounds a bit nasty. First we'd have to determine what the
dma_addr_t points to; and with P2P it may be a bus address or it may be
an IOVA address and it would probably have to be based on whether the
IOVA is reserved or not (PCI bus addresses should all be reserved).
Second, if it is an IOVA then the we'd have to get the physical address
back from the IOMMU tables and hope we can then get it back to a
sensible kernel mapping -- and if it points to a PCI bus address we'd
then have to somehow get back to the kernel mapping which could be
anywhere in the VMALLOC region as we no longer have the linear mapping
that struct page provides.

I think if we need access to the memory, then this is the wrong approach
and we should keep struct page or try pfn_t so we can map the memory in
a way that would perform better.

In theory, I could relatively easily do the same thing I did for dma_vec
but with a pfn_t_vec. Though we'd still have the problem of determining
virtual address from physical address for memory that isn't linearly
mapped. We'd probably have to introduce some arch-specific thing to
linearly map an io region or something which may be possible on some
arches on not on others (same problems we have with struct page).

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (27 preceding siblings ...)
  2019-06-20 16:12 ` [RFC PATCH 28/28] memremap: Remove PCI P2PDMA page memory type Logan Gunthorpe
@ 2019-06-20 18:45 ` Dan Williams
  2019-06-20 19:33   ` Jason Gunthorpe
  2019-06-20 19:34   ` Logan Gunthorpe
  2019-06-24  7:27 ` Christoph Hellwig
  29 siblings, 2 replies; 89+ messages in thread
From: Dan Williams @ 2019-06-20 18:45 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-block, linux-nvme, linux-pci,
	linux-rdma, Jens Axboe, Christoph Hellwig, Bjorn Helgaas,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates

On Thu, Jun 20, 2019 at 9:13 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
> For eons there has been a debate over whether or not to use
> struct pages for peer-to-peer DMA transactions. Pro-pagers have
> argued that struct pages are necessary for interacting with
> existing code like scatterlists or the bio_vecs. Anti-pagers
> assert that the tracking of the memory is unecessary and
> allocating the pages is a waste of memory. Both viewpoints are
> valid, however developers working on GPUs and RDMA tend to be
> able to do away with struct pages relatively easily

Presumably because they have historically never tried to be
inter-operable with the block layer or drivers outside graphics and
RDMA.

>  compared to
> those wanting to work with NVMe devices through the block layer.
> So it would be of great value to be able to universally do P2PDMA
> transactions without the use of struct pages.

Please spell out the value, it is not immediately obvious to me
outside of some memory capacity savings.

> Previously, there have been multiple attempts[1][2] to replace
> struct page usage with pfn_t but this has been unpopular seeing
> it creates dangerous edge cases where unsuspecting code might
> run accross pfn_t's they are not ready for.

That's not the conclusion I arrived at because pfn_t is specifically
an opaque type precisely to force "unsuspecting" code to throw
compiler assertions. Instead pfn_t was dealt its death blow here:

https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/

...and I think that feedback also reads on this proposal.

> Currently, we have P2PDMA using struct pages through the block layer
> and the dangerous cases are avoided by using a queue flag that
> indicates support for the special pages.
>
> This RFC proposes a new solution: allow the block layer to take
> DMA addresses directly for queues that indicate support. This will
> provide a more general path for doing P2PDMA-like requests and will
> allow us to remove the struct pages that back P2PDMA memory thus paving
> the way to build a more uniform P2PDMA ecosystem.

My primary concern with this is that ascribes a level of generality
that just isn't there for peer-to-peer dma operations. "Peer"
addresses are not "DMA" addresses, and the rules about what can and
can't do peer-DMA are not generically known to the block layer. At
least with a side object there's a chance to describe / recall those
restrictions as these things get passed around the I/O stack, but an
undecorated "DMA" address passed through the block layer with no other
benefit to any subsystem besides RDMA does not feel like it advances
the state of the art.

Again, what are the benefits of plumbing this RDMA special case?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 18:45 ` [RFC PATCH 00/28] Removing struct page from P2PDMA Dan Williams
@ 2019-06-20 19:33   ` Jason Gunthorpe
  2019-06-20 20:18     ` Dan Williams
  2019-06-24  7:31     ` Christoph Hellwig
  2019-06-20 19:34   ` Logan Gunthorpe
  1 sibling, 2 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-20 19:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Linux Kernel Mailing List, linux-block,
	linux-nvme, linux-pci, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 20, 2019 at 11:45:38AM -0700, Dan Williams wrote:

> > Previously, there have been multiple attempts[1][2] to replace
> > struct page usage with pfn_t but this has been unpopular seeing
> > it creates dangerous edge cases where unsuspecting code might
> > run accross pfn_t's they are not ready for.
> 
> That's not the conclusion I arrived at because pfn_t is specifically
> an opaque type precisely to force "unsuspecting" code to throw
> compiler assertions. Instead pfn_t was dealt its death blow here:
> 
> https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/
> 
> ...and I think that feedback also reads on this proposal.

I read through Linus's remarks and it he seems completely right that
anything that touches a filesystem needs a struct page, because FS's
rely heavily on that.

It is much less clear to me why a GPU BAR or a NVME CMB that never
touches a filesystem needs a struct page.. The best reason I've seen
is that it must have struct page because the block layer heavily
depends on struct page.

Since that thread was so DAX/pmem centric (and Linus did say he liked
the __pfn_t), maybe it is worth checking again, but not for DAX/pmem
users?

This P2P is quite distinct from DAX as the struct page* would point to
non-cacheable weird memory that few struct page users would even be
able to work with, while I understand DAX use cases focused on CPU
cache coherent memory, and filesystem involvement.

> My primary concern with this is that ascribes a level of generality
> that just isn't there for peer-to-peer dma operations. "Peer"
> addresses are not "DMA" addresses, and the rules about what can and
> can't do peer-DMA are not generically known to the block layer.

?? The P2P infrastructure produces a DMA bus address for the
initiating device that is is absolutely a DMA address. There is some
intermediate CPU centric representation, but after mapping it is the
same as any other DMA bus address.

The map function can tell if the device pair combination can do p2p or
not.

> Again, what are the benefits of plumbing this RDMA special case?

It is not just RDMA, this is interesting for GPU and vfio use cases
too. RDMA is just the most complete in-tree user we have today.

ie GPU people wouuld really like to do read() and have P2P
transparently happen to on-GPU pages. With GPUs having huge amounts of
memory loading file data into them is really a performance critical
thing.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 18:45 ` [RFC PATCH 00/28] Removing struct page from P2PDMA Dan Williams
  2019-06-20 19:33   ` Jason Gunthorpe
@ 2019-06-20 19:34   ` Logan Gunthorpe
  2019-06-20 23:40     ` Dan Williams
  1 sibling, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 19:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, linux-block, linux-nvme, linux-pci,
	linux-rdma, Jens Axboe, Christoph Hellwig, Bjorn Helgaas,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates



On 2019-06-20 12:45 p.m., Dan Williams wrote:
> On Thu, Jun 20, 2019 at 9:13 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>> For eons there has been a debate over whether or not to use
>> struct pages for peer-to-peer DMA transactions. Pro-pagers have
>> argued that struct pages are necessary for interacting with
>> existing code like scatterlists or the bio_vecs. Anti-pagers
>> assert that the tracking of the memory is unecessary and
>> allocating the pages is a waste of memory. Both viewpoints are
>> valid, however developers working on GPUs and RDMA tend to be
>> able to do away with struct pages relatively easily
> 
> Presumably because they have historically never tried to be
> inter-operable with the block layer or drivers outside graphics and
> RDMA.

Yes, but really there are three main sets of users for P2P right now:
graphics, RDMA and NVMe. And every time a patch set comes from GPU/RDMA
people they don't bother with struct page. I seem to be the only one
trying to push P2P with NVMe and it seems to be a losing battle.

> Please spell out the value, it is not immediately obvious to me
> outside of some memory capacity savings.

There are a few things:

* Have consistency with P2P efforts as most other efforts have been
avoiding struct page. Nobody else seems to want
pci_p2pdma_add_resource() or any devm_memremap_pages() call.

* Avoid all arch-specific dependencies for P2P. With struct page the IO
memory must fit in the linear mapping. This requires some work with
RISC-V and I remember some complaints from the powerpc people regarding
this. Certainly not all arches will be able to fit the IO region into
the linear mapping space.

* Remove a bunch of PCI P2PDMA special case mapping stuff from the block
layer and RDMA interface (which I've been hearing complaints over).

* Save the struct page memory that is largely unused (as you note).

>> Previously, there have been multiple attempts[1][2] to replace
>> struct page usage with pfn_t but this has been unpopular seeing
>> it creates dangerous edge cases where unsuspecting code might
>> run accross pfn_t's they are not ready for.
> 
> That's not the conclusion I arrived at because pfn_t is specifically
> an opaque type precisely to force "unsuspecting" code to throw
> compiler assertions. Instead pfn_t was dealt its death blow here:
> 
> https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/

Ok, well yes the special pages are what we've done for P2PDMA today. But
I don't think Linus's criticism really applies to what's in this RFC.
For starters, P2PDMA doesn't, and has have never, used struct page to
look up the reference count. PCI BARs have no relation to the cache so
there's no need to serialize their access but this can be done
before/after the DMA addresses are submitted to the block/rdma layer if
it was required.

In fact, the only thing the struct page is used for in the current
P2PDMA implementation is a single flag indicating it's special and needs
to be mapped in a special way.
> My primary concern with this is that ascribes a level of generality
> that just isn't there for peer-to-peer dma operations. "Peer"
> addresses are not "DMA" addresses, and the rules about what can and
> can't do peer-DMA are not generically known to the block layer.

Correct, but I don't think we should teach the block layer about these
rules. In the current code, the rules are enforced outside the block
layer before the bios are submitted and this patch set doesn't change
that. The driver orchestrating P2P will always have to check the rules
and derive addresses from them (as appropriate). With the RFC the block
layer then doesn't have to care and can just handle the DMA addresses
directly.

> At least with a side object there's a chance to describe / recall those
> restrictions as these things get passed around the I/O stack, but an
> undecorated "DMA" address passed through the block layer with no other
> benefit to any subsystem besides RDMA does not feel like it advances
> the state of the art.
> 
> Again, what are the benefits of plumbing this RDMA special case?

Because I don't think it is an RDMA special case.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 19:33   ` Jason Gunthorpe
@ 2019-06-20 20:18     ` Dan Williams
  2019-06-20 20:51       ` Logan Gunthorpe
  2019-06-21 17:47       ` Jason Gunthorpe
  2019-06-24  7:31     ` Christoph Hellwig
  1 sibling, 2 replies; 89+ messages in thread
From: Dan Williams @ 2019-06-20 20:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Linux Kernel Mailing List, linux-block,
	linux-nvme, linux-pci, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 20, 2019 at 12:34 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Jun 20, 2019 at 11:45:38AM -0700, Dan Williams wrote:
>
> > > Previously, there have been multiple attempts[1][2] to replace
> > > struct page usage with pfn_t but this has been unpopular seeing
> > > it creates dangerous edge cases where unsuspecting code might
> > > run accross pfn_t's they are not ready for.
> >
> > That's not the conclusion I arrived at because pfn_t is specifically
> > an opaque type precisely to force "unsuspecting" code to throw
> > compiler assertions. Instead pfn_t was dealt its death blow here:
> >
> > https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/
> >
> > ...and I think that feedback also reads on this proposal.
>
> I read through Linus's remarks and it he seems completely right that
> anything that touches a filesystem needs a struct page, because FS's
> rely heavily on that.
>
> It is much less clear to me why a GPU BAR or a NVME CMB that never
> touches a filesystem needs a struct page.. The best reason I've seen
> is that it must have struct page because the block layer heavily
> depends on struct page.
>
> Since that thread was so DAX/pmem centric (and Linus did say he liked
> the __pfn_t), maybe it is worth checking again, but not for DAX/pmem
> users?
>
> This P2P is quite distinct from DAX as the struct page* would point to
> non-cacheable weird memory that few struct page users would even be
> able to work with, while I understand DAX use cases focused on CPU
> cache coherent memory, and filesystem involvement.

What I'm poking at is whether this block layer capability can pick up
users outside of RDMA, more on this below...

>
> > My primary concern with this is that ascribes a level of generality
> > that just isn't there for peer-to-peer dma operations. "Peer"
> > addresses are not "DMA" addresses, and the rules about what can and
> > can't do peer-DMA are not generically known to the block layer.
>
> ?? The P2P infrastructure produces a DMA bus address for the
> initiating device that is is absolutely a DMA address. There is some
> intermediate CPU centric representation, but after mapping it is the
> same as any other DMA bus address.

Right, this goes back to the confusion caused by the hardware / bus /
address that a dma-engine would consume directly, and Linux "DMA"
address as a device-specific translation of host memory.

Is the block layer representation of this address going to go through
a peer / "bus" address translation when it reaches the RDMA driver? In
other words if we tried to use this facility with other drivers how
would the driver know it was passed a traditional Linux DMA address,
vs a peer bus address that the device may not be able to handle?

> The map function can tell if the device pair combination can do p2p or
> not.

Ok, if this map step is still there then reduce a significant portion
of my concern and it becomes a quibble about the naming and how a
non-RDMA device driver might figure out if it was handled an address
it can't handle.

>
> > Again, what are the benefits of plumbing this RDMA special case?
>
> It is not just RDMA, this is interesting for GPU and vfio use cases
> too. RDMA is just the most complete in-tree user we have today.
>
> ie GPU people wouuld really like to do read() and have P2P
> transparently happen to on-GPU pages. With GPUs having huge amounts of
> memory loading file data into them is really a performance critical
> thing.

A direct-i/o read(2) into a page-less GPU mapping? Through a regular
file or a device special file?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 20:18     ` Dan Williams
@ 2019-06-20 20:51       ` Logan Gunthorpe
  2019-06-21 17:47       ` Jason Gunthorpe
  1 sibling, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 20:51 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe
  Cc: Linux Kernel Mailing List, linux-block, linux-nvme, linux-pci,
	linux-rdma, Jens Axboe, Christoph Hellwig, Bjorn Helgaas,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-20 2:18 p.m., Dan Williams wrote:
>> Since that thread was so DAX/pmem centric (and Linus did say he liked
>> the __pfn_t), maybe it is worth checking again, but not for DAX/pmem
>> users?
>>
>> This P2P is quite distinct from DAX as the struct page* would point to
>> non-cacheable weird memory that few struct page users would even be
>> able to work with, while I understand DAX use cases focused on CPU
>> cache coherent memory, and filesystem involvement.
> 
> What I'm poking at is whether this block layer capability can pick up
> users outside of RDMA, more on this below...

I assume you mean outside of P2PDMA....

This new block layer capability is more likely to pick up additional
users compared to the existing block layer changes that are *very*
specific to PCI P2PDMA.

I also have (probably significantly controversial) plans to use this to
allow P2P through user space with O_DIRECT using an idea Jerome had in a
previous patch set that was discussed a bit informally at LSF/MM this
year. But that's a whole other RFC and requires a bunch of work I
haven't done yet.

>>
>>> My primary concern with this is that ascribes a level of generality
>>> that just isn't there for peer-to-peer dma operations. "Peer"
>>> addresses are not "DMA" addresses, and the rules about what can and
>>> can't do peer-DMA are not generically known to the block layer.
>>
>> ?? The P2P infrastructure produces a DMA bus address for the
>> initiating device that is is absolutely a DMA address. There is some
>> intermediate CPU centric representation, but after mapping it is the
>> same as any other DMA bus address.
> 
> Right, this goes back to the confusion caused by the hardware / bus /
> address that a dma-engine would consume directly, and Linux "DMA"
> address as a device-specific translation of host memory.
> 
> Is the block layer representation of this address going to go through
> a peer / "bus" address translation when it reaches the RDMA driver? In
> other words if we tried to use this facility with other drivers how
> would the driver know it was passed a traditional Linux DMA address,
> vs a peer bus address that the device may not be able to handle?

The idea is that the driver doesn't need to know. There's no distinction
between a Linux DMA address and a peer bus address. They are both used
for the same purpose: to program into a DMA engine. If the device cannot
handle such a DMA address then it shouldn't indicate support for this
feature or the P2PDMA layer needs a way to detect this. Really, this
property depends more on the bus than the device and that's what all the
P2PDMA code in the PCI tree handles.

>> The map function can tell if the device pair combination can do p2p or
>> not.
> 
> Ok, if this map step is still there then reduce a significant portion
> of my concern and it becomes a quibble about the naming and how a
> non-RDMA device driver might figure out if it was handled an address
> it can't handle.

Yes, there will always be a map step, but it should be done by the
orchestrator because it requires both devices (the client and the
provider) and the block layer really should not know about both devices.

In this RFC, the map step is kind of hidden but would probably come back
in the future. It's currently a call to pci_p2pmem_virt_to_bus() but
would eventually need to be a pci_p2pmem_map_resource() or similar which
takes a pointer to the pci_dev provider and the struct device client
doing the mapping.

Logan


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 19:34   ` Logan Gunthorpe
@ 2019-06-20 23:40     ` Dan Williams
  2019-06-20 23:42       ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Dan Williams @ 2019-06-20 23:40 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-block, linux-nvme, linux-pci,
	linux-rdma, Jens Axboe, Christoph Hellwig, Bjorn Helgaas,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates

On Thu, Jun 20, 2019 at 12:35 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2019-06-20 12:45 p.m., Dan Williams wrote:
> > On Thu, Jun 20, 2019 at 9:13 AM Logan Gunthorpe <logang@deltatee.com> wrote:
> >>
> >> For eons there has been a debate over whether or not to use
> >> struct pages for peer-to-peer DMA transactions. Pro-pagers have
> >> argued that struct pages are necessary for interacting with
> >> existing code like scatterlists or the bio_vecs. Anti-pagers
> >> assert that the tracking of the memory is unecessary and
> >> allocating the pages is a waste of memory. Both viewpoints are
> >> valid, however developers working on GPUs and RDMA tend to be
> >> able to do away with struct pages relatively easily
> >
> > Presumably because they have historically never tried to be
> > inter-operable with the block layer or drivers outside graphics and
> > RDMA.
>
> Yes, but really there are three main sets of users for P2P right now:
> graphics, RDMA and NVMe. And every time a patch set comes from GPU/RDMA
> people they don't bother with struct page. I seem to be the only one
> trying to push P2P with NVMe and it seems to be a losing battle.
>
> > Please spell out the value, it is not immediately obvious to me
> > outside of some memory capacity savings.
>
> There are a few things:
>
> * Have consistency with P2P efforts as most other efforts have been
> avoiding struct page. Nobody else seems to want
> pci_p2pdma_add_resource() or any devm_memremap_pages() call.
>
> * Avoid all arch-specific dependencies for P2P. With struct page the IO
> memory must fit in the linear mapping. This requires some work with
> RISC-V and I remember some complaints from the powerpc people regarding
> this. Certainly not all arches will be able to fit the IO region into
> the linear mapping space.
>
> * Remove a bunch of PCI P2PDMA special case mapping stuff from the block
> layer and RDMA interface (which I've been hearing complaints over).

This seems to be the most salient point. I was missing the fact that
this replaces custom hacks and "special" pages with an explicit "just
pass this pre-mapped address down the stack". It's functionality that
might plausibly be used outside of p2p, as long as the driver can
assert that it never needs to touch the data with the cpu before
handing it off to a dma-engine.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 23:40     ` Dan Williams
@ 2019-06-20 23:42       ` Logan Gunthorpe
  0 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-20 23:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, linux-block, linux-nvme, linux-pci,
	linux-rdma, Jens Axboe, Christoph Hellwig, Bjorn Helgaas,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates



On 2019-06-20 5:40 p.m., Dan Williams wrote:
> This seems to be the most salient point. I was missing the fact that
> this replaces custom hacks and "special" pages with an explicit "just
> pass this pre-mapped address down the stack". It's functionality that
> might plausibly be used outside of p2p, as long as the driver can
> assert that it never needs to touch the data with the cpu before
> handing it off to a dma-engine.

Yup, that's a good way to put it. If I resend this patchset, I'll
include wording like yours in the cover letter.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 20:18     ` Dan Williams
  2019-06-20 20:51       ` Logan Gunthorpe
@ 2019-06-21 17:47       ` Jason Gunthorpe
  2019-06-21 17:54         ` Dan Williams
  1 sibling, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-21 17:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Linux Kernel Mailing List, linux-block,
	linux-nvme, linux-pci, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 20, 2019 at 01:18:13PM -0700, Dan Williams wrote:

> > This P2P is quite distinct from DAX as the struct page* would point to
> > non-cacheable weird memory that few struct page users would even be
> > able to work with, while I understand DAX use cases focused on CPU
> > cache coherent memory, and filesystem involvement.
> 
> What I'm poking at is whether this block layer capability can pick up
> users outside of RDMA, more on this below...

The generic capability is to do a transfer through the block layer and
scatter/gather the resulting data to some PCIe BAR memory. Currently
the block layer can only scatter/gather data into CPU cache coherent
memory.

We know of several useful places to put PCIe BAR memory already:
 - On a GPU (or FPGA, acclerator, etc), ie the GB's of GPU private
   memory that is standard these days.
 - On a NVMe CMB. This lets the NVMe drive avoid DMA entirely
 - On a RDMA NIC. Mellanox NICs have a small amount of BAR memory that
   can be used like a CMB and avoids a DMA

RDMA doesn't really get so involved here, except that RDMA is often
the prefered way to source/sink the data buffers after the block layer has
scatter/gathered to them. (and of course RDMA is often for a block
driver, ie NMVe over fabrics)

> > > My primary concern with this is that ascribes a level of generality
> > > that just isn't there for peer-to-peer dma operations. "Peer"
> > > addresses are not "DMA" addresses, and the rules about what can and
> > > can't do peer-DMA are not generically known to the block layer.
> >
> > ?? The P2P infrastructure produces a DMA bus address for the
> > initiating device that is is absolutely a DMA address. There is some
> > intermediate CPU centric representation, but after mapping it is the
> > same as any other DMA bus address.
> 
> Right, this goes back to the confusion caused by the hardware / bus /
> address that a dma-engine would consume directly, and Linux "DMA"
> address as a device-specific translation of host memory.

I don't think there is a confusion :) Logan explained it, the
dma_addr_t is always the thing you program into the DMA engine of the
device it was created for, and this changes nothing about that.

Think of the dma vec as the same as a dma mapped SGL, just with no
available struct page.

> Is the block layer representation of this address going to go through
> a peer / "bus" address translation when it reaches the RDMA driver? 

No, it is just like any other dma mapped SGL, it is ready to go for
the device it was mapped for, and can be used for nothing other than
programming DMA on that device.

> > ie GPU people wouuld really like to do read() and have P2P
> > transparently happen to on-GPU pages. With GPUs having huge amounts of
> > memory loading file data into them is really a performance critical
> > thing.
> 
> A direct-i/o read(2) into a page-less GPU mapping? 

The interesting case is probably an O_DIRECT read into a
DEVICE_PRIVATE page owned by the GPU driver and mmaped into the
process calling read(). The GPU driver can dynamically arrange for
that DEVICE_PRIVATE page to linked to P2P targettable BAR memory so
the HW is capable of a direct CPU bypass transfer from the underlying
block device (ie NVMe or RDMA) to the GPU.

One way to approach this problem is to use this new dma_addr path in
the block layer.

Another way is to feed the DEVICE_PRIVATE pages into the block layer
and have it DMA map them to a P2P address.

In either case we have a situation where the block layer cannot touch
the target struct page buffers with the CPU because there is no cache
coherent CPU mapping for them, and we have to create a CPU clean path
in the block layer.

At best you could do memcpy to/from on these things, but if a GPU is
involved even that is incredibly inefficient. The GPU can do the
memcpy with DMA much faster than a memcpy_to/from_io.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-21 17:47       ` Jason Gunthorpe
@ 2019-06-21 17:54         ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2019-06-21 17:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Linux Kernel Mailing List, linux-block,
	linux-nvme, linux-pci, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Fri, Jun 21, 2019 at 10:47 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Jun 20, 2019 at 01:18:13PM -0700, Dan Williams wrote:
>
> > > This P2P is quite distinct from DAX as the struct page* would point to
> > > non-cacheable weird memory that few struct page users would even be
> > > able to work with, while I understand DAX use cases focused on CPU
> > > cache coherent memory, and filesystem involvement.
> >
> > What I'm poking at is whether this block layer capability can pick up
> > users outside of RDMA, more on this below...
>
> The generic capability is to do a transfer through the block layer and
> scatter/gather the resulting data to some PCIe BAR memory. Currently
> the block layer can only scatter/gather data into CPU cache coherent
> memory.
>
> We know of several useful places to put PCIe BAR memory already:
>  - On a GPU (or FPGA, acclerator, etc), ie the GB's of GPU private
>    memory that is standard these days.
>  - On a NVMe CMB. This lets the NVMe drive avoid DMA entirely
>  - On a RDMA NIC. Mellanox NICs have a small amount of BAR memory that
>    can be used like a CMB and avoids a DMA
>
> RDMA doesn't really get so involved here, except that RDMA is often
> the prefered way to source/sink the data buffers after the block layer has
> scatter/gathered to them. (and of course RDMA is often for a block
> driver, ie NMVe over fabrics)
>
> > > > My primary concern with this is that ascribes a level of generality
> > > > that just isn't there for peer-to-peer dma operations. "Peer"
> > > > addresses are not "DMA" addresses, and the rules about what can and
> > > > can't do peer-DMA are not generically known to the block layer.
> > >
> > > ?? The P2P infrastructure produces a DMA bus address for the
> > > initiating device that is is absolutely a DMA address. There is some
> > > intermediate CPU centric representation, but after mapping it is the
> > > same as any other DMA bus address.
> >
> > Right, this goes back to the confusion caused by the hardware / bus /
> > address that a dma-engine would consume directly, and Linux "DMA"
> > address as a device-specific translation of host memory.
>
> I don't think there is a confusion :) Logan explained it, the
> dma_addr_t is always the thing you program into the DMA engine of the
> device it was created for, and this changes nothing about that.

Yup, Logan and I already settled that point on our last exchange and
offered to make that clearer in the changelog.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
                   ` (28 preceding siblings ...)
  2019-06-20 18:45 ` [RFC PATCH 00/28] Removing struct page from P2PDMA Dan Williams
@ 2019-06-24  7:27 ` Christoph Hellwig
  2019-06-24 16:07   ` Logan Gunthorpe
  29 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-24  7:27 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Christoph Hellwig, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates

This is not going to fly.

For one passing a dma_addr_t through the block layer is a layering
violation, and one that I think will also bite us in practice.
The host physical to PCIe bus address mapping can have offsets, and
those offsets absolutely can be different for differnet root ports.
So with your caller generated dma_addr_t everything works fine with
a switched setup as the one you are probably testing on, but on a
sufficiently complicated setup with multiple root ports it can break.

Also duplicating the whole block I/O stack, including hooks all over
the fast path is pretty much a no-go.

I've been pondering for a while if we wouldn't be better off just
passing a phys_addr_t + len instead of the page, offset, len tuple
in the bio_vec, though.  If you look at the normal I/O path here
is what we normally do:

 - we get a page as input, either because we have it at hand (e.g.
   from the page cache) or from get_user_pages (which actually caculates
   it from a pfn in the page tables)
 - once in the bio all the merging decisions are based on the physical
   address, so we have to convert it to the physical address there,
   potentially multiple times
 - then dma mapping all works off the physical address, which it gets
   from the page at the start
 - then only the dma address is used for the I/O
 - on I/O completion we often but not always need the page again.  In
   the direct I/O case for reference counting and dirty status, in the
   file system also for things like marking the page uptodate

So if we move to a phys_addr_t we'd need to go back to the page at least
once.  But because of how the merging works we really only need to do
it once per segment, as we can just do pointer arithmerics do get the
following pages.  As we generally go at least once from a physical
address to a page in the merging code even a relatively expensive vmem_map
looks shouldn't be too bad.  Even more so given that the super hot path
(small blkdev direct I/O) can actually trivially cache the affected pages
as well.

Linus kinda hates the pfn approach, but part of that was really that
it was proposed for file system data, which we all found out really
can't work as-is without pages the hard way.  Another part probably
was potential performance issue, but between the few page lookups, and
the fact that using a single phys_addr_t instead of pfn/page + offset
should avoid quite a few calculations performance should not actually
be affected, although we'll have to be careful to actually verify that.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-20 19:33   ` Jason Gunthorpe
  2019-06-20 20:18     ` Dan Williams
@ 2019-06-24  7:31     ` Christoph Hellwig
  2019-06-24 13:46       ` Jason Gunthorpe
  1 sibling, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-24  7:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Logan Gunthorpe, Linux Kernel Mailing List,
	linux-block, linux-nvme, linux-pci, linux-rdma, Jens Axboe,
	Christoph Hellwig, Bjorn Helgaas, Sagi Grimberg, Keith Busch,
	Stephen Bates

On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote:
> > My primary concern with this is that ascribes a level of generality
> > that just isn't there for peer-to-peer dma operations. "Peer"
> > addresses are not "DMA" addresses, and the rules about what can and
> > can't do peer-DMA are not generically known to the block layer.
> 
> ?? The P2P infrastructure produces a DMA bus address for the
> initiating device that is is absolutely a DMA address. There is some
> intermediate CPU centric representation, but after mapping it is the
> same as any other DMA bus address.
> 
> The map function can tell if the device pair combination can do p2p or
> not.

At the PCIe level there is no such thing as a DMA address, it all
is bus address with MMIO and DMA in the same address space (without
that P2P would have not chance of actually working obviously).  But
that bus address space is different per "bus" (which would be an
root port in PCIe), and we need to be careful about that.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24  7:31     ` Christoph Hellwig
@ 2019-06-24 13:46       ` Jason Gunthorpe
  2019-06-24 13:50         ` Christoph Hellwig
  2019-06-24 16:10         ` Logan Gunthorpe
  0 siblings, 2 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-24 13:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Logan Gunthorpe, Linux Kernel Mailing List,
	linux-block, linux-nvme, linux-pci, linux-rdma, Jens Axboe,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Mon, Jun 24, 2019 at 09:31:26AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote:
> > > My primary concern with this is that ascribes a level of generality
> > > that just isn't there for peer-to-peer dma operations. "Peer"
> > > addresses are not "DMA" addresses, and the rules about what can and
> > > can't do peer-DMA are not generically known to the block layer.
> > 
> > ?? The P2P infrastructure produces a DMA bus address for the
> > initiating device that is is absolutely a DMA address. There is some
> > intermediate CPU centric representation, but after mapping it is the
> > same as any other DMA bus address.
> > 
> > The map function can tell if the device pair combination can do p2p or
> > not.
> 
> At the PCIe level there is no such thing as a DMA address, it all
> is bus address with MMIO and DMA in the same address space (without
> that P2P would have not chance of actually working obviously).  But
> that bus address space is different per "bus" (which would be an
> root port in PCIe), and we need to be careful about that.

Sure, that is how dma_addr_t is supposed to work - it is always a
device specific value that can be used only by the device that it was
created for, and different devices could have different dma_addr_t
values for the same memory. 

So when Logan goes and puts dma_addr_t into the block stack he must
also invert things so that the DMA map happens at the start of the
process to create the right dma_addr_t early.

I'm not totally clear if this series did that inversion, if it didn't
then it should not be using the dma_addr_t label at all, or refering
to anything as a 'dma address' as it is just confusing.

BTW, it is not just offset right? It is possible that the IOMMU can
generate unique dma_addr_t values for each device?? Simple offset is
just something we saw in certain embedded cases, IIRC.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 13:46       ` Jason Gunthorpe
@ 2019-06-24 13:50         ` Christoph Hellwig
  2019-06-24 13:55           ` Jason Gunthorpe
  2019-06-24 16:10         ` Logan Gunthorpe
  1 sibling, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-24 13:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Dan Williams, Logan Gunthorpe,
	Linux Kernel Mailing List, linux-block, linux-nvme, linux-pci,
	linux-rdma, Jens Axboe, Bjorn Helgaas, Sagi Grimberg,
	Keith Busch, Stephen Bates

On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote:
> BTW, it is not just offset right? It is possible that the IOMMU can
> generate unique dma_addr_t values for each device?? Simple offset is
> just something we saw in certain embedded cases, IIRC.

Yes, it could.  If we are trying to do P2P between two devices on
different root ports and with the IOMMU enabled we'll generate
a new bus address for the BAR on the other side dynamically everytime
we map.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 13:50         ` Christoph Hellwig
@ 2019-06-24 13:55           ` Jason Gunthorpe
  2019-06-24 16:53             ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-24 13:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Logan Gunthorpe, Linux Kernel Mailing List,
	linux-block, linux-nvme, linux-pci, linux-rdma, Jens Axboe,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Mon, Jun 24, 2019 at 03:50:24PM +0200, Christoph Hellwig wrote:
> On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote:
> > BTW, it is not just offset right? It is possible that the IOMMU can
> > generate unique dma_addr_t values for each device?? Simple offset is
> > just something we saw in certain embedded cases, IIRC.
> 
> Yes, it could.  If we are trying to do P2P between two devices on
> different root ports and with the IOMMU enabled we'll generate
> a new bus address for the BAR on the other side dynamically everytime
> we map.

Even with the same root port if ACS is turned on could behave like this.

It is only a very narrow case where you can take shortcuts with
dma_addr_t, and I don't think shortcuts like are are appropriate for
the mainline kernel..

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24  7:27 ` Christoph Hellwig
@ 2019-06-24 16:07   ` Logan Gunthorpe
  2019-06-25  7:20     ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-24 16:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Bjorn Helgaas, Dan Williams, Sagi Grimberg,
	Keith Busch, Jason Gunthorpe, Stephen Bates



On 2019-06-24 1:27 a.m., Christoph Hellwig wrote:
> This is not going to fly.
> 
> For one passing a dma_addr_t through the block layer is a layering
> violation, and one that I think will also bite us in practice.
> The host physical to PCIe bus address mapping can have offsets, and
> those offsets absolutely can be different for differnet root ports.
> So with your caller generated dma_addr_t everything works fine with
> a switched setup as the one you are probably testing on, but on a
> sufficiently complicated setup with multiple root ports it can break.

I don't follow this argument. Yes, I understand PCI Bus offsets and yes
I understand that they only apply beyond the bus they're working with.
But this isn't *that* complicated and it should be the responsibility of
the P2PDMA code to sort out and provide a dma_addr_t for. The dma_addr_t
that's passed through the block layer could be a bus address or it could
be the result of a dma_map_* request (if the transaction is found to go
through an RC) depending on the requirements of the devices being used.

> Also duplicating the whole block I/O stack, including hooks all over
> the fast path is pretty much a no-go.

There was very little duplicate code in the patch set. (Really just the
mapping code). There are a few hooks, but in practice not that many if
we ignore the WARN_ONs. We might be able to work to reduce this further.
The main hooks are: when we skip bouncing, when we skip integrity prep,
when we split, and when we map. And the patchset drops the PCI_P2PDMA
hook when we map. So we're talking about maybe three or four extra ifs
that would likely normally be fast due to the branch predictor.

> I've been pondering for a while if we wouldn't be better off just
> passing a phys_addr_t + len instead of the page, offset, len tuple
> in the bio_vec, though.  If you look at the normal I/O path here
> is what we normally do:
> 
>  - we get a page as input, either because we have it at hand (e.g.
>    from the page cache) or from get_user_pages (which actually caculates
>    it from a pfn in the page tables)
>  - once in the bio all the merging decisions are based on the physical
>    address, so we have to convert it to the physical address there,
>    potentially multiple times
>  - then dma mapping all works off the physical address, which it gets
>    from the page at the start
>  - then only the dma address is used for the I/O
>  - on I/O completion we often but not always need the page again.  In
>    the direct I/O case for reference counting and dirty status, in the
>    file system also for things like marking the page uptodate
> 
> So if we move to a phys_addr_t we'd need to go back to the page at least
> once.  But because of how the merging works we really only need to do
> it once per segment, as we can just do pointer arithmerics do get the
> following pages.  As we generally go at least once from a physical
> address to a page in the merging code even a relatively expensive vmem_map
> looks shouldn't be too bad.  Even more so given that the super hot path
> (small blkdev direct I/O) can actually trivially cache the affected pages
> as well.

I've always wondered why it wasn't done this way. Passing around a page
pointer *and* an offset always seemed less efficient than just a
physical address. If we did do this, the proposed dma_addr_t and
phys_addr_t paths through the block layer could be a lot more similar as
things like the split calculation could work on either address type.
We'd just have to prevent bouncing and integrity and change have a hook
on how it's mapped.

> Linus kinda hates the pfn approach, but part of that was really that
> it was proposed for file system data, which we all found out really
> can't work as-is without pages the hard way.  Another part probably
> was potential performance issue, but between the few page lookups, and
> the fact that using a single phys_addr_t instead of pfn/page + offset
> should avoid quite a few calculations performance should not actually
> be affected, although we'll have to be careful to actually verify that.

Yes, I'd agree that removing the offset should make things simpler. But
that requires changing a lot of stuff and doesn't really help what I'm
trying to do.

Logan


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 13:46       ` Jason Gunthorpe
  2019-06-24 13:50         ` Christoph Hellwig
@ 2019-06-24 16:10         ` Logan Gunthorpe
  2019-06-25  7:18           ` Christoph Hellwig
  1 sibling, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-24 16:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Dan Williams, Linux Kernel Mailing List, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Sagi Grimberg,
	Keith Busch, Stephen Bates



On 2019-06-24 7:46 a.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 09:31:26AM +0200, Christoph Hellwig wrote:
>> On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote:
>>>> My primary concern with this is that ascribes a level of generality
>>>> that just isn't there for peer-to-peer dma operations. "Peer"
>>>> addresses are not "DMA" addresses, and the rules about what can and
>>>> can't do peer-DMA are not generically known to the block layer.
>>>
>>> ?? The P2P infrastructure produces a DMA bus address for the
>>> initiating device that is is absolutely a DMA address. There is some
>>> intermediate CPU centric representation, but after mapping it is the
>>> same as any other DMA bus address.
>>>
>>> The map function can tell if the device pair combination can do p2p or
>>> not.
>>
>> At the PCIe level there is no such thing as a DMA address, it all
>> is bus address with MMIO and DMA in the same address space (without
>> that P2P would have not chance of actually working obviously).  But
>> that bus address space is different per "bus" (which would be an
>> root port in PCIe), and we need to be careful about that.
> 
> Sure, that is how dma_addr_t is supposed to work - it is always a
> device specific value that can be used only by the device that it was
> created for, and different devices could have different dma_addr_t
> values for the same memory. 
> 
> So when Logan goes and puts dma_addr_t into the block stack he must
> also invert things so that the DMA map happens at the start of the
> process to create the right dma_addr_t early.

Yes, that's correct. The intent was to invert it so the dma_map could
happen at the start of the process so that P2PDMA code could be called
with all the information it needs to make it's decision on how to map;
without having to hook into the mapping process of every driver that
wants to participate.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 13:55           ` Jason Gunthorpe
@ 2019-06-24 16:53             ` Logan Gunthorpe
  2019-06-24 18:16               ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-24 16:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Dan Williams, Linux Kernel Mailing List, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Sagi Grimberg,
	Keith Busch, Stephen Bates



On 2019-06-24 7:55 a.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 03:50:24PM +0200, Christoph Hellwig wrote:
>> On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote:
>>> BTW, it is not just offset right? It is possible that the IOMMU can
>>> generate unique dma_addr_t values for each device?? Simple offset is
>>> just something we saw in certain embedded cases, IIRC.
>>
>> Yes, it could.  If we are trying to do P2P between two devices on
>> different root ports and with the IOMMU enabled we'll generate
>> a new bus address for the BAR on the other side dynamically everytime
>> we map.
> 
> Even with the same root port if ACS is turned on could behave like this.

Yup.

> It is only a very narrow case where you can take shortcuts with
> dma_addr_t, and I don't think shortcuts like are are appropriate for
> the mainline kernel..

I don't think it's that narrow and it opens up a lot of avenues for
system design that people are wanting to go. If your high speed data
path can avoid the root complex and CPU, you can design a system which a
much smaller CPU and fewer lanes directed at the CPU.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 16:53             ` Logan Gunthorpe
@ 2019-06-24 18:16               ` Jason Gunthorpe
  2019-06-24 18:28                 ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-24 18:16 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Dan Williams, Linux Kernel Mailing List,
	linux-block, linux-nvme, linux-pci, linux-rdma, Jens Axboe,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Mon, Jun 24, 2019 at 10:53:38AM -0600, Logan Gunthorpe wrote:
> > It is only a very narrow case where you can take shortcuts with
> > dma_addr_t, and I don't think shortcuts like are are appropriate for
> > the mainline kernel..
> 
> I don't think it's that narrow and it opens up a lot of avenues for
> system design that people are wanting to go. If your high speed data
> path can avoid the root complex and CPU, you can design a system which a
> much smaller CPU and fewer lanes directed at the CPU.

I mean the shortcut that something generates dma_addr_t for Device A
and then passes it to Device B - that is too hacky for mainline.

Sounded like this series does generate the dma_addr for the correct
device..

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 18:16               ` Jason Gunthorpe
@ 2019-06-24 18:28                 ` Logan Gunthorpe
  2019-06-24 18:54                   ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-24 18:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Dan Williams, Linux Kernel Mailing List,
	linux-block, linux-nvme, linux-pci, linux-rdma, Jens Axboe,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-24 12:16 p.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 10:53:38AM -0600, Logan Gunthorpe wrote:
>>> It is only a very narrow case where you can take shortcuts with
>>> dma_addr_t, and I don't think shortcuts like are are appropriate for
>>> the mainline kernel..
>>
>> I don't think it's that narrow and it opens up a lot of avenues for
>> system design that people are wanting to go. If your high speed data
>> path can avoid the root complex and CPU, you can design a system which a
>> much smaller CPU and fewer lanes directed at the CPU.
> 
> I mean the shortcut that something generates dma_addr_t for Device A
> and then passes it to Device B - that is too hacky for mainline.

Oh, that's not a shortcut. It's completely invalid and not likely to
work in any case. If you're mapping something you have to pass the
device that the dma_addr_t is being programmed into.

> Sounded like this series does generate the dma_addr for the correct
> device..

This series doesn't generate any DMA addresses with dma_map(). The
current p2pdma code ensures everything is behind the same root port and
only uses the pci bus address. This is valid and correct, but yes it's
something to expand upon.

I'll be doing some work shortly to add transactions that go through the
IOMMU and calls dma_map_* when appropriate.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 18:28                 ` Logan Gunthorpe
@ 2019-06-24 18:54                   ` Jason Gunthorpe
  2019-06-24 19:37                     ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-24 18:54 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Dan Williams, Linux Kernel Mailing List,
	linux-block, linux-nvme, linux-pci, linux-rdma, Jens Axboe,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Mon, Jun 24, 2019 at 12:28:33PM -0600, Logan Gunthorpe wrote:

> > Sounded like this series does generate the dma_addr for the correct
> > device..
> 
> This series doesn't generate any DMA addresses with dma_map(). The
> current p2pdma code ensures everything is behind the same root port and
> only uses the pci bus address. This is valid and correct, but yes it's
> something to expand upon.

I think if you do this it still has to be presented as the same API
like dma_map that takes in the target device * and produces the device
specific dma_addr_t

Otherwise this whole thing is confusing and looks like *all* of it can
only work under the switch assumption

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 18:54                   ` Jason Gunthorpe
@ 2019-06-24 19:37                     ` Logan Gunthorpe
  0 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-24 19:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Dan Williams, Linux Kernel Mailing List,
	linux-block, linux-nvme, linux-pci, linux-rdma, Jens Axboe,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-24 12:54 p.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 12:28:33PM -0600, Logan Gunthorpe wrote:
> 
>>> Sounded like this series does generate the dma_addr for the correct
>>> device..
>>
>> This series doesn't generate any DMA addresses with dma_map(). The
>> current p2pdma code ensures everything is behind the same root port and
>> only uses the pci bus address. This is valid and correct, but yes it's
>> something to expand upon.
> 
> I think if you do this it still has to be presented as the same API
> like dma_map that takes in the target device * and produces the device
> specific dma_addr_t

Yes, once we consider the case where it can go through the root complex,
we will need an API similar to dma_map(). We got rid of that API because
it wasn't yet required or used by anything and, per our best practices,
we don't add features that aren't used as that is more confusing for
people reading/reworking the code.

> Otherwise this whole thing is confusing and looks like *all* of it can
> only work under the switch assumption

Hopefully it'll be clearer once we do the work to map for going through
the root complex. It's not that confusing to me. But it's all orthogonal
to the dma_addr_t through the block layer concept.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 16:10         ` Logan Gunthorpe
@ 2019-06-25  7:18           ` Christoph Hellwig
  0 siblings, 0 replies; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-25  7:18 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Christoph Hellwig, Dan Williams,
	Linux Kernel Mailing List, linux-block, linux-nvme, linux-pci,
	linux-rdma, Jens Axboe, Bjorn Helgaas, Sagi Grimberg,
	Keith Busch, Stephen Bates

On Mon, Jun 24, 2019 at 10:10:16AM -0600, Logan Gunthorpe wrote:
> Yes, that's correct. The intent was to invert it so the dma_map could
> happen at the start of the process so that P2PDMA code could be called
> with all the information it needs to make it's decision on how to map;
> without having to hook into the mapping process of every driver that
> wants to participate.

And that just isn't how things work in layering.  We need to keep
generating the dma addresses in the driver in the receiving end, as
there are all kinds of interesting ideas how we do that.  E.g. for the
Mellanox NICs addressing their own bars is not done by PCIe bus
addresses but by relative offsets.  And while NVMe has refused to go
down that route in the current band aid fix for CMB addressing I suspect
it will sooner or later have to do the same to deal with the addressing
problems in a multiple PASID world.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-24 16:07   ` Logan Gunthorpe
@ 2019-06-25  7:20     ` Christoph Hellwig
  2019-06-25 15:57       ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-25  7:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates

On Mon, Jun 24, 2019 at 10:07:56AM -0600, Logan Gunthorpe wrote:
> > For one passing a dma_addr_t through the block layer is a layering
> > violation, and one that I think will also bite us in practice.
> > The host physical to PCIe bus address mapping can have offsets, and
> > those offsets absolutely can be different for differnet root ports.
> > So with your caller generated dma_addr_t everything works fine with
> > a switched setup as the one you are probably testing on, but on a
> > sufficiently complicated setup with multiple root ports it can break.
> 
> I don't follow this argument. Yes, I understand PCI Bus offsets and yes
> I understand that they only apply beyond the bus they're working with.
> But this isn't *that* complicated and it should be the responsibility of
> the P2PDMA code to sort out and provide a dma_addr_t for. The dma_addr_t
> that's passed through the block layer could be a bus address or it could
> be the result of a dma_map_* request (if the transaction is found to go
> through an RC) depending on the requirements of the devices being used.

You assume all addressing is done by the PCI bus address.  If a device
is addressing its own BAR there is no reason to use the PCI bus address,
as it might have much more intelligent schemes (usually bar + offset).
> 
> > Also duplicating the whole block I/O stack, including hooks all over
> > the fast path is pretty much a no-go.
> 
> There was very little duplicate code in the patch set. (Really just the
> mapping code). There are a few hooks, but in practice not that many if
> we ignore the WARN_ONs. We might be able to work to reduce this further.
> The main hooks are: when we skip bouncing, when we skip integrity prep,
> when we split, and when we map. And the patchset drops the PCI_P2PDMA
> hook when we map. So we're talking about maybe three or four extra ifs
> that would likely normally be fast due to the branch predictor.

And all of those add code to the block layer fast path.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-25  7:20     ` Christoph Hellwig
@ 2019-06-25 15:57       ` Logan Gunthorpe
  2019-06-25 17:01         ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-25 15:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Bjorn Helgaas, Dan Williams, Sagi Grimberg,
	Keith Busch, Jason Gunthorpe, Stephen Bates



On 2019-06-25 1:20 a.m., Christoph Hellwig wrote:
> On Mon, Jun 24, 2019 at 10:07:56AM -0600, Logan Gunthorpe wrote:
>>> For one passing a dma_addr_t through the block layer is a layering
>>> violation, and one that I think will also bite us in practice.
>>> The host physical to PCIe bus address mapping can have offsets, and
>>> those offsets absolutely can be different for differnet root ports.
>>> So with your caller generated dma_addr_t everything works fine with
>>> a switched setup as the one you are probably testing on, but on a
>>> sufficiently complicated setup with multiple root ports it can break.
>>
>> I don't follow this argument. Yes, I understand PCI Bus offsets and yes
>> I understand that they only apply beyond the bus they're working with.
>> But this isn't *that* complicated and it should be the responsibility of
>> the P2PDMA code to sort out and provide a dma_addr_t for. The dma_addr_t
>> that's passed through the block layer could be a bus address or it could
>> be the result of a dma_map_* request (if the transaction is found to go
>> through an RC) depending on the requirements of the devices being used.
> 
> You assume all addressing is done by the PCI bus address.  If a device
> is addressing its own BAR there is no reason to use the PCI bus address,
> as it might have much more intelligent schemes (usually bar + offset).

Yes, that will be a bit tricky regardless of what we do.

>>> Also duplicating the whole block I/O stack, including hooks all over
>>> the fast path is pretty much a no-go.
>>
>> There was very little duplicate code in the patch set. (Really just the
>> mapping code). There are a few hooks, but in practice not that many if
>> we ignore the WARN_ONs. We might be able to work to reduce this further.
>> The main hooks are: when we skip bouncing, when we skip integrity prep,
>> when we split, and when we map. And the patchset drops the PCI_P2PDMA
>> hook when we map. So we're talking about maybe three or four extra ifs
>> that would likely normally be fast due to the branch predictor.
> 
> And all of those add code to the block layer fast path.

If we can't add any ifs to the block layer, there's really nothing we
can do.

So then we're committed to using struct page for P2P?

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-25 15:57       ` Logan Gunthorpe
@ 2019-06-25 17:01         ` Christoph Hellwig
  2019-06-25 19:54           ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-25 17:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates

On Tue, Jun 25, 2019 at 09:57:52AM -0600, Logan Gunthorpe wrote:
> > You assume all addressing is done by the PCI bus address.  If a device
> > is addressing its own BAR there is no reason to use the PCI bus address,
> > as it might have much more intelligent schemes (usually bar + offset).
> 
> Yes, that will be a bit tricky regardless of what we do.

At least right now it isn't at all.  I've implemented support for
a draft NVMe proposal for that, and it basically boils down to this
in the p2p path:

	addr = sg_phys(sg);

	if (page->pgmap->dev == ctrl->dev && HAS_RELATIVE_ADDRESSING)
		addr -= ctrl->cmb_start_addr;

		// set magic flag in the SGL
	} else {
		addr -= pgmap->pci_p2pdma_bus_offset;
	}

without the pagemap it would require a range compare instead, which
isn't all that hard either.

> >>> Also duplicating the whole block I/O stack, including hooks all over
> >>> the fast path is pretty much a no-go.
> >>
> >> There was very little duplicate code in the patch set. (Really just the
> >> mapping code). There are a few hooks, but in practice not that many if
> >> we ignore the WARN_ONs. We might be able to work to reduce this further.
> >> The main hooks are: when we skip bouncing, when we skip integrity prep,
> >> when we split, and when we map. And the patchset drops the PCI_P2PDMA
> >> hook when we map. So we're talking about maybe three or four extra ifs
> >> that would likely normally be fast due to the branch predictor.
> > 
> > And all of those add code to the block layer fast path.
> 
> If we can't add any ifs to the block layer, there's really nothing we
> can do.

That is not what I said.  Of course we can.  But we rather have a
really good reason.  And adding a parallel I/O path violating the
highlevel model is not one.

> So then we're committed to using struct page for P2P?

Only until we have a significantly better soltution.  And I think
using physical address in some form instead of pages is that,
adding a parallel path with dma_addr_t is not, it actually is worse
than the current code in many respects.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-25 17:01         ` Christoph Hellwig
@ 2019-06-25 19:54           ` Logan Gunthorpe
  2019-06-26  6:57             ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-25 19:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Bjorn Helgaas, Dan Williams, Sagi Grimberg,
	Keith Busch, Jason Gunthorpe, Stephen Bates



On 2019-06-25 11:01 a.m., Christoph Hellwig wrote:
> On Tue, Jun 25, 2019 at 09:57:52AM -0600, Logan Gunthorpe wrote:
>>> You assume all addressing is done by the PCI bus address.  If a device
>>> is addressing its own BAR there is no reason to use the PCI bus address,
>>> as it might have much more intelligent schemes (usually bar + offset).
>>
>> Yes, that will be a bit tricky regardless of what we do.
> 
> At least right now it isn't at all.  I've implemented support for
> a draft NVMe proposal for that, and it basically boils down to this
> in the p2p path:
> 
> 	addr = sg_phys(sg);
> 
> 	if (page->pgmap->dev == ctrl->dev && HAS_RELATIVE_ADDRESSING)
> 		addr -= ctrl->cmb_start_addr;
> 
> 		// set magic flag in the SGL
> 	} else {
> 		addr -= pgmap->pci_p2pdma_bus_offset;
> 	}
> 
> without the pagemap it would require a range compare instead, which
> isn't all that hard either.
> 
>>>>> Also duplicating the whole block I/O stack, including hooks all over
>>>>> the fast path is pretty much a no-go.
>>>>
>>>> There was very little duplicate code in the patch set. (Really just the
>>>> mapping code). There are a few hooks, but in practice not that many if
>>>> we ignore the WARN_ONs. We might be able to work to reduce this further.
>>>> The main hooks are: when we skip bouncing, when we skip integrity prep,
>>>> when we split, and when we map. And the patchset drops the PCI_P2PDMA
>>>> hook when we map. So we're talking about maybe three or four extra ifs
>>>> that would likely normally be fast due to the branch predictor.
>>>
>>> And all of those add code to the block layer fast path.
>>
>> If we can't add any ifs to the block layer, there's really nothing we
>> can do.
> 
> That is not what I said.  Of course we can.  But we rather have a
> really good reason.  And adding a parallel I/O path violating the
> highlevel model is not one.
> 
>> So then we're committed to using struct page for P2P?
> 
> Only until we have a significantly better soltution.  And I think
> using physical address in some form instead of pages is that,
> adding a parallel path with dma_addr_t is not, it actually is worse
> than the current code in many respects.

Well whether it's dma_addr_t, phys_addr_t, pfn_t the result isn't all
that different. You still need roughly the same 'if' hooks for any
backed memory that isn't in the linear mapping and you can't get a
kernel mapping for directly.

It wouldn't be too hard to do a similar patch set that uses something
like phys_addr_t instead and have a request and queue flag for support
of non-mappable memory. But you'll end up with very similar 'if' hooks
and we'd have to clean up all bio-using drivers that access the struct
pages directly.

Though, we'd also still have the problem of how to recognize when the
address points to P2PDMA and needs to be translated to the bus offset.
The map-first inversion was what helped here because the driver
submitting the requests had all the information. Though it could be
another request flag and indicating non-mappable memory could be a flag
group like REQ_NOMERGE_FLAGS -- REQ_NOMAP_FLAGS.

If you think any of the above ideas sound workable I'd be happy to try
to code up another prototype.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-25 19:54           ` Logan Gunthorpe
@ 2019-06-26  6:57             ` Christoph Hellwig
  2019-06-26 18:31               ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-26  6:57 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates

On Tue, Jun 25, 2019 at 01:54:21PM -0600, Logan Gunthorpe wrote:
> Well whether it's dma_addr_t, phys_addr_t, pfn_t the result isn't all
> that different. You still need roughly the same 'if' hooks for any
> backed memory that isn't in the linear mapping and you can't get a
> kernel mapping for directly.
> 
> It wouldn't be too hard to do a similar patch set that uses something
> like phys_addr_t instead and have a request and queue flag for support
> of non-mappable memory. But you'll end up with very similar 'if' hooks
> and we'd have to clean up all bio-using drivers that access the struct
> pages directly.

We'll need to clean that mess up anyway, and I've been chugging
along doing some of that.  A lot still assume no highmem, so we need
to convert them over to something that kmaps anyway.  If we get
the abstraction right that will actually help converting over to
a better reprsentation.

> Though, we'd also still have the problem of how to recognize when the
> address points to P2PDMA and needs to be translated to the bus offset.
> The map-first inversion was what helped here because the driver
> submitting the requests had all the information. Though it could be
> another request flag and indicating non-mappable memory could be a flag
> group like REQ_NOMERGE_FLAGS -- REQ_NOMAP_FLAGS.

The assumes the request all has the same memory, which is a simplifing
assuption.  My idea was that if had our new bio_vec like this:

struct bio_vec {
	phys_addr_t		paddr; // 64-bit on 64-bit systems
	unsigned long		len;
};

we have a hole behind len where we could store flag.  Preferably
optionally based on a P2P or other magic memory types config
option so that 32-bit systems with 32-bit phys_addr_t actually
benefit from the smaller and better packing structure.

> If you think any of the above ideas sound workable I'd be happy to try
> to code up another prototype.

Іt sounds workable.  To some of the first steps are cleanups independent
of how the bio_vec is eventually going to look like.  That is making
the DMA-API internals work on the phys_addr_t, which also unifies the
map_resource implementation with map_page.  I plan to do that relatively
soon.  The next is sorting out access to bios data by virtual address.
All these need nice kmapping helper that avoid too much open coding.
I was going to look into that next, mostly to kill the block layer
bounce buffering code.  Similar things will also be needed at the
scatterlist level I think.  After that we need to more audits of
how bv_page is still used.  something like a bv_phys() helper that
does "page_to_phys(bv->bv_page) + bv->bv_offset" might come in handy
for example.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26  6:57             ` Christoph Hellwig
@ 2019-06-26 18:31               ` Logan Gunthorpe
  2019-06-26 20:21                 ` Jason Gunthorpe
  2019-06-27  9:01                 ` Christoph Hellwig
  0 siblings, 2 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-26 18:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-block, linux-nvme, linux-pci, linux-rdma,
	Jens Axboe, Bjorn Helgaas, Dan Williams, Sagi Grimberg,
	Keith Busch, Jason Gunthorpe, Stephen Bates



On 2019-06-26 12:57 a.m., Christoph Hellwig wrote:
> On Tue, Jun 25, 2019 at 01:54:21PM -0600, Logan Gunthorpe wrote:
>> Well whether it's dma_addr_t, phys_addr_t, pfn_t the result isn't all
>> that different. You still need roughly the same 'if' hooks for any
>> backed memory that isn't in the linear mapping and you can't get a
>> kernel mapping for directly.
>>
>> It wouldn't be too hard to do a similar patch set that uses something
>> like phys_addr_t instead and have a request and queue flag for support
>> of non-mappable memory. But you'll end up with very similar 'if' hooks
>> and we'd have to clean up all bio-using drivers that access the struct
>> pages directly.
> 
> We'll need to clean that mess up anyway, and I've been chugging
> along doing some of that.  A lot still assume no highmem, so we need
> to convert them over to something that kmaps anyway.  If we get
> the abstraction right that will actually help converting over to
> a better reprsentation.
> 
>> Though, we'd also still have the problem of how to recognize when the
>> address points to P2PDMA and needs to be translated to the bus offset.
>> The map-first inversion was what helped here because the driver
>> submitting the requests had all the information. Though it could be
>> another request flag and indicating non-mappable memory could be a flag
>> group like REQ_NOMERGE_FLAGS -- REQ_NOMAP_FLAGS.
> 
> The assumes the request all has the same memory, which is a simplifing
> assuption.  My idea was that if had our new bio_vec like this:
> 
> struct bio_vec {
> 	phys_addr_t		paddr; // 64-bit on 64-bit systems
> 	unsigned long		len;
> };
> 
> we have a hole behind len where we could store flag.  Preferably
> optionally based on a P2P or other magic memory types config
> option so that 32-bit systems with 32-bit phys_addr_t actually
> benefit from the smaller and better packing structure.

That seems sensible. The one thing that's unclear though is how to get
the PCI Bus address when appropriate. Can we pass that in instead of the
phys_addr with an appropriate flag? Or will we need to pass the actual
physical address and then, at the map step, the driver has to some how
lookup the PCI device to figure out the bus offset?

>> If you think any of the above ideas sound workable I'd be happy to try
>> to code up another prototype.
> 
> Іt sounds workable.  To some of the first steps are cleanups independent
> of how the bio_vec is eventually going to look like.  That is making
> the DMA-API internals work on the phys_addr_t, which also unifies the
> map_resource implementation with map_page.  I plan to do that relatively
> soon.  The next is sorting out access to bios data by virtual address.
> All these need nice kmapping helper that avoid too much open coding.
> I was going to look into that next, mostly to kill the block layer
> bounce buffering code.  Similar things will also be needed at the
> scatterlist level I think.  After that we need to more audits of
> how bv_page is still used.  something like a bv_phys() helper that
> does "page_to_phys(bv->bv_page) + bv->bv_offset" might come in handy
> for example.

Ok, I should be able to help with that. When I have a chance I'll try to
look at the bv_phys() helper.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 18:31               ` Logan Gunthorpe
@ 2019-06-26 20:21                 ` Jason Gunthorpe
  2019-06-26 20:39                   ` Dan Williams
  2019-06-26 20:45                   ` Logan Gunthorpe
  2019-06-27  9:01                 ` Christoph Hellwig
  1 sibling, 2 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-26 20:21 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
> > we have a hole behind len where we could store flag.  Preferably
> > optionally based on a P2P or other magic memory types config
> > option so that 32-bit systems with 32-bit phys_addr_t actually
> > benefit from the smaller and better packing structure.
> 
> That seems sensible. The one thing that's unclear though is how to get
> the PCI Bus address when appropriate. Can we pass that in instead of the
> phys_addr with an appropriate flag? Or will we need to pass the actual
> physical address and then, at the map step, the driver has to some how
> lookup the PCI device to figure out the bus offset?

I agree with CH, if we go down this path it is a layering violation
for the thing injecting bio's into the block stack to know what struct
device they egress&dma map on just to be able to do the dma_map up
front.

So we must be able to go from this new phys_addr_t&flags to some BAR
information during dma_map.

For instance we could use a small hash table of the upper phys addr
bits, or an interval tree, to do the lookup.

The bar info would give the exporting struct device and any other info
we need to make the iommu mapping.

This phys_addr_t seems like a good approach to me as it avoids the
struct page overheads and will lets us provide copy from/to bio
primitives that could work on BAR memory. I think we can surely use
this approach in RDMA as well.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 20:21                 ` Jason Gunthorpe
@ 2019-06-26 20:39                   ` Dan Williams
  2019-06-26 20:54                     ` Jason Gunthorpe
  2019-06-26 20:55                     ` Logan Gunthorpe
  2019-06-26 20:45                   ` Logan Gunthorpe
  1 sibling, 2 replies; 89+ messages in thread
From: Dan Williams @ 2019-06-26 20:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Christoph Hellwig, Linux Kernel Mailing List,
	linux-block, linux-nvme, linux-pci, linux-rdma, Jens Axboe,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Wed, Jun 26, 2019 at 1:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
> > > we have a hole behind len where we could store flag.  Preferably
> > > optionally based on a P2P or other magic memory types config
> > > option so that 32-bit systems with 32-bit phys_addr_t actually
> > > benefit from the smaller and better packing structure.
> >
> > That seems sensible. The one thing that's unclear though is how to get
> > the PCI Bus address when appropriate. Can we pass that in instead of the
> > phys_addr with an appropriate flag? Or will we need to pass the actual
> > physical address and then, at the map step, the driver has to some how
> > lookup the PCI device to figure out the bus offset?
>
> I agree with CH, if we go down this path it is a layering violation
> for the thing injecting bio's into the block stack to know what struct
> device they egress&dma map on just to be able to do the dma_map up
> front.
>
> So we must be able to go from this new phys_addr_t&flags to some BAR
> information during dma_map.
>
> For instance we could use a small hash table of the upper phys addr
> bits, or an interval tree, to do the lookup.

Hmm, that sounds like dev_pagemap without the pages.

There's already no requirement that dev_pagemap point to real /
present pages (DEVICE_PRIVATE) seems a straightforward extension to
use it for helping coordinate phys_addr_t in 'struct bio'. Then
Logan's future plans to let userspace coordinate p2p operations could
build on PTE_DEVMAP.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 20:21                 ` Jason Gunthorpe
  2019-06-26 20:39                   ` Dan Williams
@ 2019-06-26 20:45                   ` Logan Gunthorpe
  2019-06-26 21:00                     ` Jason Gunthorpe
  2019-06-27  9:08                     ` Christoph Hellwig
  1 sibling, 2 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-26 20:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-26 2:21 p.m., Jason Gunthorpe wrote:
> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
>>> we have a hole behind len where we could store flag.  Preferably
>>> optionally based on a P2P or other magic memory types config
>>> option so that 32-bit systems with 32-bit phys_addr_t actually
>>> benefit from the smaller and better packing structure.
>>
>> That seems sensible. The one thing that's unclear though is how to get
>> the PCI Bus address when appropriate. Can we pass that in instead of the
>> phys_addr with an appropriate flag? Or will we need to pass the actual
>> physical address and then, at the map step, the driver has to some how
>> lookup the PCI device to figure out the bus offset?
> 
> I agree with CH, if we go down this path it is a layering violation
> for the thing injecting bio's into the block stack to know what struct
> device they egress&dma map on just to be able to do the dma_map up
> front.

Not sure I agree with this statement. The p2pdma code already *must*
know and access the pci_dev of the dma device ahead of when it submits
the IO to know if it's valid to allocate and use P2P memory at all. This
is why the submitting driver has a lot of the information needed to map
this memory that the mapping driver does not.

> So we must be able to go from this new phys_addr_t&flags to some BAR
> information during dma_map.

> For instance we could use a small hash table of the upper phys addr
> bits, or an interval tree, to do the lookup.

Yes, if we're going to take a hard stance on this. But using an interval
tree (or similar) is a lot more work for the CPU to figure out these
mappings that may not be strictly necessary if we could just pass better
information down from the submitting driver to the mapping driver.

> The bar info would give the exporting struct device and any other info
> we need to make the iommu mapping.

Well, the IOMMU mapping is the normal thing the mapping driver will
always do. We'd really just need the submitting driver to, when
appropriate, inform the mapping driver that this is a pci bus address
and not to call dma_map_xxx(). Then, for special mappings for the CMB
like Christoph is talking about, it's simply a matter of doing a range
compare on the PCI Bus address and converting the bus address to a BAR
and offset.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 20:39                   ` Dan Williams
@ 2019-06-26 20:54                     ` Jason Gunthorpe
  2019-06-26 20:55                     ` Logan Gunthorpe
  1 sibling, 0 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-26 20:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Christoph Hellwig, Linux Kernel Mailing List,
	linux-block, linux-nvme, linux-pci, linux-rdma, Jens Axboe,
	Bjorn Helgaas, Sagi Grimberg, Keith Busch, Stephen Bates

On Wed, Jun 26, 2019 at 01:39:01PM -0700, Dan Williams wrote:
> Hmm, that sounds like dev_pagemap without the pages.

Yes, and other page related overhead. Maybe both ideas can exist in
the pagemap code?

All that is needed here is to map a bar phys_addr_t to some 'bar info'
that helps the mapping.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 20:39                   ` Dan Williams
  2019-06-26 20:54                     ` Jason Gunthorpe
@ 2019-06-26 20:55                     ` Logan Gunthorpe
  1 sibling, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-26 20:55 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, linux-block,
	linux-nvme, linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-26 2:39 p.m., Dan Williams wrote:
> On Wed, Jun 26, 2019 at 1:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
>>>> we have a hole behind len where we could store flag.  Preferably
>>>> optionally based on a P2P or other magic memory types config
>>>> option so that 32-bit systems with 32-bit phys_addr_t actually
>>>> benefit from the smaller and better packing structure.
>>>
>>> That seems sensible. The one thing that's unclear though is how to get
>>> the PCI Bus address when appropriate. Can we pass that in instead of the
>>> phys_addr with an appropriate flag? Or will we need to pass the actual
>>> physical address and then, at the map step, the driver has to some how
>>> lookup the PCI device to figure out the bus offset?
>>
>> I agree with CH, if we go down this path it is a layering violation
>> for the thing injecting bio's into the block stack to know what struct
>> device they egress&dma map on just to be able to do the dma_map up
>> front.
>>
>> So we must be able to go from this new phys_addr_t&flags to some BAR
>> information during dma_map.
>>
>> For instance we could use a small hash table of the upper phys addr
>> bits, or an interval tree, to do the lookup.
> 
> Hmm, that sounds like dev_pagemap without the pages.

Yup, that's why I'd like to avoid it, but IMO it would still be an
improvement to use a interval tree over struct pages because without
struct page we just have a range and a length and it's relatively easy
to check that the whole range belongs to a specific pci_dev. To be
correct with the struct page approach we really have to loop through all
pages to ensure they all belong to the same pci_dev which is a big pain.

> There's already no requirement that dev_pagemap point to real /
> present pages (DEVICE_PRIVATE) seems a straightforward extension to
> use it for helping coordinate phys_addr_t in 'struct bio'. Then
> Logan's future plans to let userspace coordinate p2p operations could
> build on PTE_DEVMAP.

Well I think the biggest difficulty with struct page for user space is
dealing with cases when the struct pages of different types get mixed
together (or even struct pages that are all P2P pages but from different
PCI devices). We'd have to go through each page and ensure that each
type gets it's own bio_vec with appropriate flags.

Though really, the whole mixed IO from userspace poses a bunch of
problems. I'd prefer to just be able to say that a single IO can be all
or nothing P2P memory from a single device.

Logan


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 20:45                   ` Logan Gunthorpe
@ 2019-06-26 21:00                     ` Jason Gunthorpe
  2019-06-26 21:18                       ` Logan Gunthorpe
  2019-06-27  9:08                     ` Christoph Hellwig
  1 sibling, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-26 21:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-26 2:21 p.m., Jason Gunthorpe wrote:
> > On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
> >>> we have a hole behind len where we could store flag.  Preferably
> >>> optionally based on a P2P or other magic memory types config
> >>> option so that 32-bit systems with 32-bit phys_addr_t actually
> >>> benefit from the smaller and better packing structure.
> >>
> >> That seems sensible. The one thing that's unclear though is how to get
> >> the PCI Bus address when appropriate. Can we pass that in instead of the
> >> phys_addr with an appropriate flag? Or will we need to pass the actual
> >> physical address and then, at the map step, the driver has to some how
> >> lookup the PCI device to figure out the bus offset?
> > 
> > I agree with CH, if we go down this path it is a layering violation
> > for the thing injecting bio's into the block stack to know what struct
> > device they egress&dma map on just to be able to do the dma_map up
> > front.
> 
> Not sure I agree with this statement. The p2pdma code already *must*
> know and access the pci_dev of the dma device ahead of when it submits
> the IO to know if it's valid to allocate and use P2P memory at all.

I don't think we should make drives do that. What if it got CMB memory
on some other device?

> > For instance we could use a small hash table of the upper phys addr
> > bits, or an interval tree, to do the lookup.
> 
> Yes, if we're going to take a hard stance on this. But using an interval
> tree (or similar) is a lot more work for the CPU to figure out these
> mappings that may not be strictly necessary if we could just pass better
> information down from the submitting driver to the mapping driver.

Right, this is coming down to an optimization argument. I think there
are very few cases (Basically yours) where the caller will know this
info, so we need to support the other cases anyhow.

I think with some simple caching this will become negligible for cases
you care about

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 21:00                     ` Jason Gunthorpe
@ 2019-06-26 21:18                       ` Logan Gunthorpe
  2019-06-27  6:32                         ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-26 21:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-26 3:00 p.m., Jason Gunthorpe wrote:
> On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-06-26 2:21 p.m., Jason Gunthorpe wrote:
>>> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
>>>>> we have a hole behind len where we could store flag.  Preferably
>>>>> optionally based on a P2P or other magic memory types config
>>>>> option so that 32-bit systems with 32-bit phys_addr_t actually
>>>>> benefit from the smaller and better packing structure.
>>>>
>>>> That seems sensible. The one thing that's unclear though is how to get
>>>> the PCI Bus address when appropriate. Can we pass that in instead of the
>>>> phys_addr with an appropriate flag? Or will we need to pass the actual
>>>> physical address and then, at the map step, the driver has to some how
>>>> lookup the PCI device to figure out the bus offset?
>>>
>>> I agree with CH, if we go down this path it is a layering violation
>>> for the thing injecting bio's into the block stack to know what struct
>>> device they egress&dma map on just to be able to do the dma_map up
>>> front.
>>
>> Not sure I agree with this statement. The p2pdma code already *must*
>> know and access the pci_dev of the dma device ahead of when it submits
>> the IO to know if it's valid to allocate and use P2P memory at all.
> 
> I don't think we should make drives do that. What if it got CMB memory
> on some other device?

Huh? A driver submitting P2P requests finds appropriate memory to use
based on the DMA device that will be doing the mapping. It *has* to. It
doesn't necessarily have control over which P2P provider it might find
(ie. it may get CMB memory from a random NVMe device), but it easily
knows the NVMe device it got the CMB memory for. Look at the existing
code in the nvme target.

>>> For instance we could use a small hash table of the upper phys addr
>>> bits, or an interval tree, to do the lookup.
>>
>> Yes, if we're going to take a hard stance on this. But using an interval
>> tree (or similar) is a lot more work for the CPU to figure out these
>> mappings that may not be strictly necessary if we could just pass better
>> information down from the submitting driver to the mapping driver.
> 
> Right, this is coming down to an optimization argument. I think there
> are very few cases (Basically yours) where the caller will know this
> info, so we need to support the other cases anyhow.

I disagree. I think it has to be a common pattern. A driver doing a P2P
transaction *must* find some device to obtain memory from (or it may be
itself)  and check if it is compatible with the device that's going to
be mapping the memory or vice versa. So no matter what we do, a driver
submitting P2P requests must have access to both the PCI device that's
going to be mapping the memory and the device that's providing the memory.

> I think with some simple caching this will become negligible for cases
> you care about

Well *maybe* it will be negligible performance wise, but it's also a lot
more complicated, code wise. Tree lookups will always be a lot more
expensive than just checking a flag.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 21:18                       ` Logan Gunthorpe
@ 2019-06-27  6:32                         ` Jason Gunthorpe
  2019-06-27 16:09                           ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-27  6:32 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote:
> > I don't think we should make drives do that. What if it got CMB memory
> > on some other device?
> 
> Huh? A driver submitting P2P requests finds appropriate memory to use
> based on the DMA device that will be doing the mapping. It *has* to. It
> doesn't necessarily have control over which P2P provider it might find
> (ie. it may get CMB memory from a random NVMe device), but it easily
> knows the NVMe device it got the CMB memory for. Look at the existing
> code in the nvme target.

No, this all thinking about things from the CMB perspective. With CMB
you don't care about the BAR location because it is just a temporary
buffer. That is a unique use model.

Every other case has data residing in BAR memory that can really only
reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO
against that is run it should succeed, even if that means bounce
buffering the IO - as the user has really asked for this transfer to
happen.

We certainly don't get to generally pick where the data resides before
starting the IO, that luxury is only for CMB.

> > I think with some simple caching this will become negligible for cases
> > you care about
> 
> Well *maybe* it will be negligible performance wise, but it's also a lot
> more complicated, code wise. Tree lookups will always be a lot more
> expensive than just checking a flag.

Interval trees are pretty simple API wise, and if we only populate
them with P2P providers you probably find the tree depth is negligible
in current systems with one or two P2P providers.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 18:31               ` Logan Gunthorpe
  2019-06-26 20:21                 ` Jason Gunthorpe
@ 2019-06-27  9:01                 ` Christoph Hellwig
  1 sibling, 0 replies; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-27  9:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Jason Gunthorpe, Stephen Bates

On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
> > we have a hole behind len where we could store flag.  Preferably
> > optionally based on a P2P or other magic memory types config
> > option so that 32-bit systems with 32-bit phys_addr_t actually
> > benefit from the smaller and better packing structure.
> 
> That seems sensible. The one thing that's unclear though is how to get
> the PCI Bus address when appropriate. Can we pass that in instead of the
> phys_addr with an appropriate flag? Or will we need to pass the actual
> physical address and then, at the map step, the driver has to some how
> lookup the PCI device to figure out the bus offset?

Yes, I think we'll need a lookup mechanism of some kind.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-26 20:45                   ` Logan Gunthorpe
  2019-06-26 21:00                     ` Jason Gunthorpe
@ 2019-06-27  9:08                     ` Christoph Hellwig
  2019-06-27 16:30                       ` Logan Gunthorpe
  1 sibling, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-27  9:08 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Christoph Hellwig, linux-kernel, linux-block,
	linux-nvme, linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas,
	Dan Williams, Sagi Grimberg, Keith Busch, Stephen Bates

On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote:
> > The bar info would give the exporting struct device and any other info
> > we need to make the iommu mapping.
> 
> Well, the IOMMU mapping is the normal thing the mapping driver will
> always do. We'd really just need the submitting driver to, when
> appropriate, inform the mapping driver that this is a pci bus address
> and not to call dma_map_xxx(). Then, for special mappings for the CMB
> like Christoph is talking about, it's simply a matter of doing a range
> compare on the PCI Bus address and converting the bus address to a BAR
> and offset.

Well, range compare on the physical address.  We have a few different
options here:

 (a) a range is normal RAM, DMA mapping works as usual
 (b) a range is another devices BAR, in which case we need to do a
     map_resource equivalent (which really just means don't bother with
     cache flush on non-coherent architectures) and apply any needed
     offset, fixed or iommu based
 (c) a range points to a BAR on the acting device. In which case we
     don't need to DMA map at all, because no dma is happening but just an
     internal transfer.  And depending on the device that might also require
     a different addressing mode

I guess it might make sense to just have a block layer flag that (b) or
(c) might be contained in a bio.  Then we always look up the data
structure, but can still fall back to (a) if nothing was found.  That
even allows free mixing and matching of memory types, at least as long
as they are contained to separate bio_vec segments.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-27  6:32                         ` Jason Gunthorpe
@ 2019-06-27 16:09                           ` Logan Gunthorpe
  2019-06-27 16:35                             ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-27 16:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-27 12:32 a.m., Jason Gunthorpe wrote:
> On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote:
>>> I don't think we should make drives do that. What if it got CMB memory
>>> on some other device?
>>
>> Huh? A driver submitting P2P requests finds appropriate memory to use
>> based on the DMA device that will be doing the mapping. It *has* to. It
>> doesn't necessarily have control over which P2P provider it might find
>> (ie. it may get CMB memory from a random NVMe device), but it easily
>> knows the NVMe device it got the CMB memory for. Look at the existing
>> code in the nvme target.
> 
> No, this all thinking about things from the CMB perspective. With CMB
> you don't care about the BAR location because it is just a temporary
> buffer. That is a unique use model.
> 
> Every other case has data residing in BAR memory that can really only
> reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO
> against that is run it should succeed, even if that means bounce
> buffering the IO - as the user has really asked for this transfer to
> happen.
> 
> We certainly don't get to generally pick where the data resides before
> starting the IO, that luxury is only for CMB.

I disagree. If we we're going to implement a "bounce" we'd probably want
to do it in two DMA requests. So the GPU/FPGA driver would first decide
whether it can do it P2P directly and, if it can't, would want to submit
a DMA request copy the data to host memory and then submit an IO
normally to the data's final destination.

I think it would be a larger layering violation to have the NVMe driver
(for example) memcpy data off a GPU's bar during a dma_map step to
support this bouncing. And it's even crazier to expect a DMA transfer to
be setup in the map step.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-27  9:08                     ` Christoph Hellwig
@ 2019-06-27 16:30                       ` Logan Gunthorpe
  2019-06-27 17:00                         ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-27 16:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-27 3:08 a.m., Christoph Hellwig wrote:
> On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote:
>>> The bar info would give the exporting struct device and any other info
>>> we need to make the iommu mapping.
>>
>> Well, the IOMMU mapping is the normal thing the mapping driver will
>> always do. We'd really just need the submitting driver to, when
>> appropriate, inform the mapping driver that this is a pci bus address
>> and not to call dma_map_xxx(). Then, for special mappings for the CMB
>> like Christoph is talking about, it's simply a matter of doing a range
>> compare on the PCI Bus address and converting the bus address to a BAR
>> and offset.
> 
> Well, range compare on the physical address.  We have a few different
> options here:
> 
>  (a) a range is normal RAM, DMA mapping works as usual
>  (b) a range is another devices BAR, in which case we need to do a
>      map_resource equivalent (which really just means don't bother with
>      cache flush on non-coherent architectures) and apply any needed
>      offset, fixed or iommu based

Well I would split this into two cases: (b1) ranges in another device's
BAR that will pass through the root complex and require a map_resource
equivalent and (b2) ranges in another device's bar that don't pass
through the root complex and require applying an offset to the bus
address. Both require rather different handling and the submitting
driver should already know ahead of time what type we have.

>  (c) a range points to a BAR on the acting device. In which case we
>      don't need to DMA map at all, because no dma is happening but just an
>      internal transfer.  And depending on the device that might also require
>      a different addressing mode

I think (c) is actually just a special case of (b2). Any device that has
a special protocol for addressing the local BAR can just do a range
compare on the address to determine if it's local or not. Devices that
don't have a special protocol for this would handle both (c) and (b2)
the same.

> I guess it might make sense to just have a block layer flag that (b) or
> (c) might be contained in a bio.  Then we always look up the data
> structure, but can still fall back to (a) if nothing was found.  That
> even allows free mixing and matching of memory types, at least as long
> as they are contained to separate bio_vec segments.

IMO these three cases should be reflected in flags in the bio_vec. We'd
probably still need a queue flag to indicate support for mapping these,
but a flag on the bio that indicates special cases *might* exist in the
bio_vec and the driver has to do extra work to somehow distinguish the
three types doesn't seem useful. bio_vec flags also make it easy to
support mixing segments from different memory types.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-27 16:09                           ` Logan Gunthorpe
@ 2019-06-27 16:35                             ` Jason Gunthorpe
  2019-06-27 16:49                               ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-27 16:35 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 27, 2019 at 10:09:41AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-27 12:32 a.m., Jason Gunthorpe wrote:
> > On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote:
> >>> I don't think we should make drives do that. What if it got CMB memory
> >>> on some other device?
> >>
> >> Huh? A driver submitting P2P requests finds appropriate memory to use
> >> based on the DMA device that will be doing the mapping. It *has* to. It
> >> doesn't necessarily have control over which P2P provider it might find
> >> (ie. it may get CMB memory from a random NVMe device), but it easily
> >> knows the NVMe device it got the CMB memory for. Look at the existing
> >> code in the nvme target.
> > 
> > No, this all thinking about things from the CMB perspective. With CMB
> > you don't care about the BAR location because it is just a temporary
> > buffer. That is a unique use model.
> > 
> > Every other case has data residing in BAR memory that can really only
> > reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO
> > against that is run it should succeed, even if that means bounce
> > buffering the IO - as the user has really asked for this transfer to
> > happen.
> > 
> > We certainly don't get to generally pick where the data resides before
> > starting the IO, that luxury is only for CMB.
> 
> I disagree. If we we're going to implement a "bounce" we'd probably want
> to do it in two DMA requests.

How do you mean?

> So the GPU/FPGA driver would first decide whether it can do it P2P
> directly and, if it can't, would want to submit a DMA request copy
> the data to host memory and then submit an IO normally to the data's
> final destination.

I don't think a GPU/FPGA driver will be involved, this would enter the
block layer through the O_DIRECT path or something generic.. This the
general flow I was suggesting to Dan earlier

> I think it would be a larger layering violation to have the NVMe driver
> (for example) memcpy data off a GPU's bar during a dma_map step to
> support this bouncing. And it's even crazier to expect a DMA transfer to
> be setup in the map step.

Why? Don't we already expect the DMA mapper to handle bouncing for
lots of cases, how is this case different? This is the best place to
place it to make it shared.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-27 16:35                             ` Jason Gunthorpe
@ 2019-06-27 16:49                               ` Logan Gunthorpe
  2019-06-28  4:57                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-27 16:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-27 10:35 a.m., Jason Gunthorpe wrote:
> On Thu, Jun 27, 2019 at 10:09:41AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-06-27 12:32 a.m., Jason Gunthorpe wrote:
>>> On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote:
>>>>> I don't think we should make drives do that. What if it got CMB memory
>>>>> on some other device?
>>>>
>>>> Huh? A driver submitting P2P requests finds appropriate memory to use
>>>> based on the DMA device that will be doing the mapping. It *has* to. It
>>>> doesn't necessarily have control over which P2P provider it might find
>>>> (ie. it may get CMB memory from a random NVMe device), but it easily
>>>> knows the NVMe device it got the CMB memory for. Look at the existing
>>>> code in the nvme target.
>>>
>>> No, this all thinking about things from the CMB perspective. With CMB
>>> you don't care about the BAR location because it is just a temporary
>>> buffer. That is a unique use model.
>>>
>>> Every other case has data residing in BAR memory that can really only
>>> reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO
>>> against that is run it should succeed, even if that means bounce
>>> buffering the IO - as the user has really asked for this transfer to
>>> happen.
>>>
>>> We certainly don't get to generally pick where the data resides before
>>> starting the IO, that luxury is only for CMB.
>>
>> I disagree. If we we're going to implement a "bounce" we'd probably want
>> to do it in two DMA requests.
> 
> How do you mean?
> 
>> So the GPU/FPGA driver would first decide whether it can do it P2P
>> directly and, if it can't, would want to submit a DMA request copy
>> the data to host memory and then submit an IO normally to the data's
>> final destination.
> 
> I don't think a GPU/FPGA driver will be involved, this would enter the
> block layer through the O_DIRECT path or something generic.. This the
> general flow I was suggesting to Dan earlier

I would say the O_DIRECT path has to somehow call into the driver
backing the VMA to get an address to appropriate memory (in some way
vaguely similar to how we were discussing at LSF/MM). If P2P can't be
done at that point, then the provider driver would do the copy to system
memory, in the most appropriate way, and return regular pages for
O_DIRECT to submit to the block device.

>> I think it would be a larger layering violation to have the NVMe driver
>> (for example) memcpy data off a GPU's bar during a dma_map step to
>> support this bouncing. And it's even crazier to expect a DMA transfer to
>> be setup in the map step.
> 
> Why? Don't we already expect the DMA mapper to handle bouncing for
> lots of cases, how is this case different? This is the best place to
> place it to make it shared.

This is different because it's special memory where the DMA mapper can't
possibly know the best way to transfer the data. The best way to
transfer the data is almost certainly going to be a DMA request handled
by the GPU/FPGA. So, one way or another, the GPU/FPGA driver has to be
involved.

One could argue that the hook to the GPU/FPGA driver could be in the
mapping step but then we'd have to do lookups based on an address --
where as the VMA could more easily have a hook back to whatever driver
exported it.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-27 16:30                       ` Logan Gunthorpe
@ 2019-06-27 17:00                         ` Christoph Hellwig
  2019-06-27 18:00                           ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-27 17:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Jason Gunthorpe, linux-kernel, linux-block,
	linux-nvme, linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas,
	Dan Williams, Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 27, 2019 at 10:30:42AM -0600, Logan Gunthorpe wrote:
> >  (a) a range is normal RAM, DMA mapping works as usual
> >  (b) a range is another devices BAR, in which case we need to do a
> >      map_resource equivalent (which really just means don't bother with
> >      cache flush on non-coherent architectures) and apply any needed
> >      offset, fixed or iommu based
> 
> Well I would split this into two cases: (b1) ranges in another device's
> BAR that will pass through the root complex and require a map_resource
> equivalent and (b2) ranges in another device's bar that don't pass
> through the root complex and require applying an offset to the bus
> address. Both require rather different handling and the submitting
> driver should already know ahead of time what type we have.

True.

> 
> >  (c) a range points to a BAR on the acting device. In which case we
> >      don't need to DMA map at all, because no dma is happening but just an
> >      internal transfer.  And depending on the device that might also require
> >      a different addressing mode
> 
> I think (c) is actually just a special case of (b2). Any device that has
> a special protocol for addressing the local BAR can just do a range
> compare on the address to determine if it's local or not. Devices that
> don't have a special protocol for this would handle both (c) and (b2)
> the same.

It is not.  (c) is fundamentally very different as it is not actually
an operation that ever goes out to the wire at all, and which is why the
actual physical address on the wire does not matter at all.
Some interfaces like NVMe have designed it in a way that it the commands
used to do this internal transfer look like (b2), but that is just their
(IMHO very questionable) interface design choice, that produces a whole
chain of problems.

> > I guess it might make sense to just have a block layer flag that (b) or
> > (c) might be contained in a bio.  Then we always look up the data
> > structure, but can still fall back to (a) if nothing was found.  That
> > even allows free mixing and matching of memory types, at least as long
> > as they are contained to separate bio_vec segments.
> 
> IMO these three cases should be reflected in flags in the bio_vec. We'd
> probably still need a queue flag to indicate support for mapping these,
> but a flag on the bio that indicates special cases *might* exist in the
> bio_vec and the driver has to do extra work to somehow distinguish the
> three types doesn't seem useful. bio_vec flags also make it easy to
> support mixing segments from different memory types.

So I іnitially suggested these flags.  But without a pgmap we absolutely
need a lookup operation to find which phys address ranges map to which
device.  And once we do that the data structure the only thing we need
is a flag saying that we need that information, and everything else
can be in the data structure returned from that lookup.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-27 17:00                         ` Christoph Hellwig
@ 2019-06-27 18:00                           ` Logan Gunthorpe
  2019-06-28 13:38                             ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-27 18:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-27 11:00 a.m., Christoph Hellwig wrote:
> It is not.  (c) is fundamentally very different as it is not actually
> an operation that ever goes out to the wire at all, and which is why the
> actual physical address on the wire does not matter at all.
> Some interfaces like NVMe have designed it in a way that it the commands
> used to do this internal transfer look like (b2), but that is just their
> (IMHO very questionable) interface design choice, that produces a whole
> chain of problems.

From the mapping device's driver's perspective yes, but from the
perspective of a submitting driver they would be the same.

>>> I guess it might make sense to just have a block layer flag that (b) or
>>> (c) might be contained in a bio.  Then we always look up the data
>>> structure, but can still fall back to (a) if nothing was found.  That
>>> even allows free mixing and matching of memory types, at least as long
>>> as they are contained to separate bio_vec segments.
>>
>> IMO these three cases should be reflected in flags in the bio_vec. We'd
>> probably still need a queue flag to indicate support for mapping these,
>> but a flag on the bio that indicates special cases *might* exist in the
>> bio_vec and the driver has to do extra work to somehow distinguish the
>> three types doesn't seem useful. bio_vec flags also make it easy to
>> support mixing segments from different memory types.
> 
> So I іnitially suggested these flags.  But without a pgmap we absolutely
> need a lookup operation to find which phys address ranges map to which
> device.  And once we do that the data structure the only thing we need
> is a flag saying that we need that information, and everything else
> can be in the data structure returned from that lookup.

Yes, you did suggest them. But what I'm trying to suggest is we don't
*necessarily* need the lookup. For demonstration purposes only, a
submitting driver could very roughly potentially do:

struct bio_vec vec;
dist = pci_p2pdma_dist(provider_pdev, mapping_pdev);
if (dist < 0) {
     /* use regular memory */
     vec.bv_addr = virt_to_phys(kmalloc(...));
     vec.bv_flags = 0;
} else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) {
     vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...);
     vec.bv_flags = BVEC_MAP_RESOURCE;
} else {
     vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...);
     vec.bv_flags = BVEC_MAP_BUS_ADDR;
}

-- And a mapping driver would roughly just do:

dma_addr_t dma_addr;
if (vec.bv_flags & BVEC_MAP_BUS_ADDR) {
     if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off))  {
          /* case (c) */
          /* program the DMA engine with bar and off */
     } else {
          /* case (b2) */
          dma_addr = vec.bv_addr;
     }
} else if (vec.bv_flags & BVEC_MAP_RESOURCE) {
     /* case (b1) */
     dma_addr = dma_map_resource(mapping_dev, vec.bv_addr, ...);
} else {
     /* case (a) */
     dma_addr = dma_map_page(..., phys_to_page(vec.bv_addr), ...);
}

The real difficulty here is that you'd really want all the above handled
by a dma_map_bvec() so it can combine every vector hitting the IOMMU
into a single continuous IOVA -- but it's hard to fit case (c) into that
equation. So it might be that a dma_map_bvec() handles cases (a), (b1)
and (b2) and the mapping driver has to then check each resulting DMA
vector for pci_bus_addr_in_bar() while it is programming the DMA engine
to deal with case (c).

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-27 16:49                               ` Logan Gunthorpe
@ 2019-06-28  4:57                                 ` Jason Gunthorpe
  2019-06-28 16:22                                   ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-28  4:57 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 27, 2019 at 10:49:43AM -0600, Logan Gunthorpe wrote:

> > I don't think a GPU/FPGA driver will be involved, this would enter the
> > block layer through the O_DIRECT path or something generic.. This the
> > general flow I was suggesting to Dan earlier
> 
> I would say the O_DIRECT path has to somehow call into the driver
> backing the VMA to get an address to appropriate memory (in some way
> vaguely similar to how we were discussing at LSF/MM)

Maybe, maybe no. For something like VFIO the PTE already has the
correct phys_addr_t and we don't need to do anything..

For DEVICE_PRIVATE we need to get the phys_addr_t out - presumably
through a new pagemap op?

> If P2P can't be done at that point, then the provider driver would
> do the copy to system memory, in the most appropriate way, and
> return regular pages for O_DIRECT to submit to the block device.

That only makes sense for the migratable DEVICE_PRIVATE case, it
doesn't help the VFIO-like case, there you'd need to bounce buffer.

> >> I think it would be a larger layering violation to have the NVMe driver
> >> (for example) memcpy data off a GPU's bar during a dma_map step to
> >> support this bouncing. And it's even crazier to expect a DMA transfer to
> >> be setup in the map step.
> > 
> > Why? Don't we already expect the DMA mapper to handle bouncing for
> > lots of cases, how is this case different? This is the best place to
> > place it to make it shared.
> 
> This is different because it's special memory where the DMA mapper
> can't possibly know the best way to transfer the data.

Why not?  If we have a 'bar info' structure that could have data
transfer op callbacks, infact, I think we might already have similar
callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..

> One could argue that the hook to the GPU/FPGA driver could be in the
> mapping step but then we'd have to do lookups based on an address --
> where as the VMA could more easily have a hook back to whatever driver
> exported it.

The trouble with a VMA hook is that it is only really avaiable when
working with the VA, and it is not actually available during GUP, you
have to have a GUP-like thing such as hmm_range_snapshot that is
specifically VMA based. And it is certainly not available during dma_map.

When working with VMA's/etc it seems there are some good reasons to
drive things off of the PTE content (either via struct page & pgmap or
via phys_addr_t & barmap)

I think the best reason to prefer a uniform phys_addr_t is that it
does give us the option to copy the data to/from CPU memory. That
option goes away as soon as the bio sometimes provides a dma_addr_t.

At least for RDMA, we do have some cases (like siw/rxe, hfi) where
they sometimes need to do that copy. I suspect the block stack is
similar, in the general case.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-27 18:00                           ` Logan Gunthorpe
@ 2019-06-28 13:38                             ` Christoph Hellwig
  2019-06-28 15:54                               ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2019-06-28 13:38 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Jason Gunthorpe, linux-kernel, linux-block,
	linux-nvme, linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas,
	Dan Williams, Sagi Grimberg, Keith Busch, Stephen Bates

On Thu, Jun 27, 2019 at 12:00:35PM -0600, Logan Gunthorpe wrote:
> > It is not.  (c) is fundamentally very different as it is not actually
> > an operation that ever goes out to the wire at all, and which is why the
> > actual physical address on the wire does not matter at all.
> > Some interfaces like NVMe have designed it in a way that it the commands
> > used to do this internal transfer look like (b2), but that is just their
> > (IMHO very questionable) interface design choice, that produces a whole
> > chain of problems.
> 
> >From the mapping device's driver's perspective yes, but from the
> perspective of a submitting driver they would be the same.

With your dma_addr_t scheme it won't be the same, as you'd need
a magic way to generate the internal addressing and stuff it into
the dma_addr_t.  With a phys_addr_t based scheme they should basically
be all the same.

> Yes, you did suggest them. But what I'm trying to suggest is we don't
> *necessarily* need the lookup. For demonstration purposes only, a
> submitting driver could very roughly potentially do:
> 
> struct bio_vec vec;
> dist = pci_p2pdma_dist(provider_pdev, mapping_pdev);
> if (dist < 0) {
>      /* use regular memory */
>      vec.bv_addr = virt_to_phys(kmalloc(...));
>      vec.bv_flags = 0;
> } else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) {
>      vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...);
>      vec.bv_flags = BVEC_MAP_RESOURCE;
> } else {
>      vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...);
>      vec.bv_flags = BVEC_MAP_BUS_ADDR;
> }

That doesn't look too bad, except..

> -- And a mapping driver would roughly just do:
> 
> dma_addr_t dma_addr;
> if (vec.bv_flags & BVEC_MAP_BUS_ADDR) {
>      if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off))  {
>           /* case (c) */
>           /* program the DMA engine with bar and off */

Why bother with that here if we could also let the caller handle
that? pci_p2pdma_dist() should be able to trivially find that out
based on provider_dev == mapping_dev.

> The real difficulty here is that you'd really want all the above handled
> by a dma_map_bvec() so it can combine every vector hitting the IOMMU
> into a single continuous IOVA -- but it's hard to fit case (c) into that
> equation. So it might be that a dma_map_bvec() handles cases (a), (b1)
> and (b2) and the mapping driver has to then check each resulting DMA
> vector for pci_bus_addr_in_bar() while it is programming the DMA engine
> to deal with case (c).

I'd do it the other way around.  pci_p2pdma_dist is used to find
the p2p type.  The p2p type is stuff into the bio_vec, and we then:

 (1) manually check for case (c) in driver for drivers that want to
     treat it different from (b)
 (2) we then have a dma mapping wrapper that checks the p2p type
     and does the right thing for the rest.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-28 13:38                             ` Christoph Hellwig
@ 2019-06-28 15:54                               ` Logan Gunthorpe
  0 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-28 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-28 7:38 a.m., Christoph Hellwig wrote:
> On Thu, Jun 27, 2019 at 12:00:35PM -0600, Logan Gunthorpe wrote:
>>> It is not.  (c) is fundamentally very different as it is not actually
>>> an operation that ever goes out to the wire at all, and which is why the
>>> actual physical address on the wire does not matter at all.
>>> Some interfaces like NVMe have designed it in a way that it the commands
>>> used to do this internal transfer look like (b2), but that is just their
>>> (IMHO very questionable) interface design choice, that produces a whole
>>> chain of problems.
>>
>> >From the mapping device's driver's perspective yes, but from the
>> perspective of a submitting driver they would be the same.
> 
> With your dma_addr_t scheme it won't be the same, as you'd need
> a magic way to generate the internal addressing and stuff it into
> the dma_addr_t.  With a phys_addr_t based scheme they should basically
> be all the same.

Yes, I see the folly in the dma_addr_t scheme now. I like the
phys_addr_t ideas we have been discussing.

>> Yes, you did suggest them. But what I'm trying to suggest is we don't
>> *necessarily* need the lookup. For demonstration purposes only, a
>> submitting driver could very roughly potentially do:
>>
>> struct bio_vec vec;
>> dist = pci_p2pdma_dist(provider_pdev, mapping_pdev);
>> if (dist < 0) {
>>      /* use regular memory */
>>      vec.bv_addr = virt_to_phys(kmalloc(...));
>>      vec.bv_flags = 0;
>> } else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) {
>>      vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...);
>>      vec.bv_flags = BVEC_MAP_RESOURCE;
>> } else {
>>      vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...);
>>      vec.bv_flags = BVEC_MAP_BUS_ADDR;
>> }
> 
> That doesn't look too bad, except..
> 
>> -- And a mapping driver would roughly just do:
>>
>> dma_addr_t dma_addr;
>> if (vec.bv_flags & BVEC_MAP_BUS_ADDR) {
>>      if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off))  {
>>           /* case (c) */
>>           /* program the DMA engine with bar and off */
> 
> Why bother with that here if we could also let the caller handle
> that? pci_p2pdma_dist() should be able to trivially find that out
> based on provider_dev == mapping_dev.

True, in fact pci_p2pdma_dist() should return 0 in that case.

Though the driver will still have to do a range compare to figure out
which BAR the address belongs to and find the offset.

>> The real difficulty here is that you'd really want all the above handled
>> by a dma_map_bvec() so it can combine every vector hitting the IOMMU
>> into a single continuous IOVA -- but it's hard to fit case (c) into that
>> equation. So it might be that a dma_map_bvec() handles cases (a), (b1)
>> and (b2) and the mapping driver has to then check each resulting DMA
>> vector for pci_bus_addr_in_bar() while it is programming the DMA engine
>> to deal with case (c).
> 
> I'd do it the other way around.  pci_p2pdma_dist is used to find
> the p2p type.  The p2p type is stuff into the bio_vec, and we then:
> 
>  (1) manually check for case (c) in driver for drivers that want to
>      treat it different from (b)
>  (2) we then have a dma mapping wrapper that checks the p2p type
>      and does the right thing for the rest.

Sure, that could make sense.

I imagine there's a lot of details that are wrong or could be done
better in my example. The purpose of it was just to demonstrate that we
can do it without a lookup in an interval tree on the physical address.

Logan


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-28  4:57                                 ` Jason Gunthorpe
@ 2019-06-28 16:22                                   ` Logan Gunthorpe
  2019-06-28 17:29                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-28 16:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-27 10:57 p.m., Jason Gunthorpe wrote:
> On Thu, Jun 27, 2019 at 10:49:43AM -0600, Logan Gunthorpe wrote:
> 
>>> I don't think a GPU/FPGA driver will be involved, this would enter the
>>> block layer through the O_DIRECT path or something generic.. This the
>>> general flow I was suggesting to Dan earlier
>>
>> I would say the O_DIRECT path has to somehow call into the driver
>> backing the VMA to get an address to appropriate memory (in some way
>> vaguely similar to how we were discussing at LSF/MM)
> 
> Maybe, maybe no. For something like VFIO the PTE already has the
> correct phys_addr_t and we don't need to do anything..
> 
> For DEVICE_PRIVATE we need to get the phys_addr_t out - presumably
> through a new pagemap op?

I don't know much about either VFIO or DEVICE_PRIVATE, but I'd still
wager there would be a better way to handle it before they submit it to
the block layer.

>> If P2P can't be done at that point, then the provider driver would
>> do the copy to system memory, in the most appropriate way, and
>> return regular pages for O_DIRECT to submit to the block device.
> 
> That only makes sense for the migratable DEVICE_PRIVATE case, it
> doesn't help the VFIO-like case, there you'd need to bounce buffer.
> 
>>>> I think it would be a larger layering violation to have the NVMe driver
>>>> (for example) memcpy data off a GPU's bar during a dma_map step to
>>>> support this bouncing. And it's even crazier to expect a DMA transfer to
>>>> be setup in the map step.
>>>
>>> Why? Don't we already expect the DMA mapper to handle bouncing for
>>> lots of cases, how is this case different? This is the best place to
>>> place it to make it shared.
>>
>> This is different because it's special memory where the DMA mapper
>> can't possibly know the best way to transfer the data.
> 
> Why not?  If we have a 'bar info' structure that could have data
> transfer op callbacks, infact, I think we might already have similar
> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..

Well it could, in theory be done, but It just seems wrong to setup and
wait for more DMA requests while we are in mid-progress setting up
another DMA request. Especially when the block layer has historically
had issues with stack sizes. It's also possible you might have multiple
bio_vec's that have to each do a migration and with a hook here they'd
have to be done serially.

>> One could argue that the hook to the GPU/FPGA driver could be in the
>> mapping step but then we'd have to do lookups based on an address --
>> where as the VMA could more easily have a hook back to whatever driver
>> exported it.
> 
> The trouble with a VMA hook is that it is only really avaiable when
> working with the VA, and it is not actually available during GUP, you
> have to have a GUP-like thing such as hmm_range_snapshot that is
> specifically VMA based. And it is certainly not available during dma_map.

Yup, I'm hoping some of the GUP cleanups will help with that but it's
definitely a problem. I never said the VMA would be available at dma_map
time nor would I want it to be. I expect it to be available before we
submit the request to the block layer and it really only applies to the
O_DIRECT path and maybe a similar thing in the RDMA path.

> When working with VMA's/etc it seems there are some good reasons to
> drive things off of the PTE content (either via struct page & pgmap or
> via phys_addr_t & barmap)
> 
> I think the best reason to prefer a uniform phys_addr_t is that it
> does give us the option to copy the data to/from CPU memory. That
> option goes away as soon as the bio sometimes provides a dma_addr_t.

Not really. phys_addr_t alone doesn't give us a way to copy data. You
need a lookup table on that address and a couple of hooks.

> At least for RDMA, we do have some cases (like siw/rxe, hfi) where
> they sometimes need to do that copy. I suspect the block stack is
> similar, in the general case.

But the whole point of the use cases I'm trying to serve is to avoid the
root complex. If the block layer randomly decides to ephemerally copy
the data back to the CPU (for integrity or something) then we've
accomplished nothing and shouldn't have put the data in the BAR to begin
with. Similarly, for DEVICE_PRIVATE, I'd have guessed it wouldn't want
to use ephemeral copies but actually migrate the memory semi-permanently
to the CPU for more than one transaction and I would argue that it makes
the most sense to make these decisions before the data gets to the block
layer.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-28 16:22                                   ` Logan Gunthorpe
@ 2019-06-28 17:29                                     ` Jason Gunthorpe
  2019-06-28 18:29                                       ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-28 17:29 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote:

> > Why not?  If we have a 'bar info' structure that could have data
> > transfer op callbacks, infact, I think we might already have similar
> > callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..
> 
> Well it could, in theory be done, but It just seems wrong to setup and
> wait for more DMA requests while we are in mid-progress setting up
> another DMA request. Especially when the block layer has historically
> had issues with stack sizes. It's also possible you might have multiple
> bio_vec's that have to each do a migration and with a hook here they'd
> have to be done serially.

*shrug* this is just standard bounce buffering stuff...
 
> > I think the best reason to prefer a uniform phys_addr_t is that it
> > does give us the option to copy the data to/from CPU memory. That
> > option goes away as soon as the bio sometimes provides a dma_addr_t.
> 
> Not really. phys_addr_t alone doesn't give us a way to copy data. You
> need a lookup table on that address and a couple of hooks.

Yes, I'm not sure how you envision using phys_addr_t without a
lookup.. At the end of the day we must get the src and target 'struct
device' in the dma_map area (at the minimum to compute the offset to
translate phys_addr_t to dma_addr_t) and the only way to do that from
phys_addr_t is via lookup??

> > At least for RDMA, we do have some cases (like siw/rxe, hfi) where
> > they sometimes need to do that copy. I suspect the block stack is
> > similar, in the general case.
> 
> But the whole point of the use cases I'm trying to serve is to avoid the
> root complex. 

Well, I think this is sort of a seperate issue. Generically I think
the dma layer should continue to work largely transparently, and if I
feed in BAR memory that can't be P2P'd it should bounce, just like
all the other DMA limitations it already supports. That is pretty much
its whole purpose in life.

The issue of having the caller optimize what it sends is kind of
separate - yes you definately still need the egress DMA device to
drive CMB buffer selection, and DEVICE_PRIVATE also needs it to decide
if it should migrate or not.

What I see as the question is how to layout the BIO. 

If we agree the bio should only have phys_addr_t then we need some
'bar info' (ie at least the offset) in the dma map and some 'bar info'
(ie the DMA device) during the bio construciton.

What you are trying to do is optimize the passing of that 'bar info'
with a limited number of bits in the BIO.

A single flag means an interval tree, 4-8 bits could build a probably
O(1) hash lookup, 64 bits could store a pointer, etc.

If we can spare 4-8 bits in the bio then I suggest a 'perfect hash
table'. Assign each registered P2P 'bar info' a small 4 bit id and
hash on that. It should be fast enough to not worry about the double
lookup.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-28 17:29                                     ` Jason Gunthorpe
@ 2019-06-28 18:29                                       ` Logan Gunthorpe
  2019-06-28 19:09                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-28 18:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-28 11:29 a.m., Jason Gunthorpe wrote:
> On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote:
> 
>>> Why not?  If we have a 'bar info' structure that could have data
>>> transfer op callbacks, infact, I think we might already have similar
>>> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..
>>
>> Well it could, in theory be done, but It just seems wrong to setup and
>> wait for more DMA requests while we are in mid-progress setting up
>> another DMA request. Especially when the block layer has historically
>> had issues with stack sizes. It's also possible you might have multiple
>> bio_vec's that have to each do a migration and with a hook here they'd
>> have to be done serially.
> 
> *shrug* this is just standard bounce buffering stuff...

I don't know of any "standard" bounce buffering stuff that uses random
other device's DMA engines where appropriate.

>>> I think the best reason to prefer a uniform phys_addr_t is that it
>>> does give us the option to copy the data to/from CPU memory. That
>>> option goes away as soon as the bio sometimes provides a dma_addr_t.
>>
>> Not really. phys_addr_t alone doesn't give us a way to copy data. You
>> need a lookup table on that address and a couple of hooks.
> 
> Yes, I'm not sure how you envision using phys_addr_t without a
> lookup.. At the end of the day we must get the src and target 'struct
> device' in the dma_map area (at the minimum to compute the offset to
> translate phys_addr_t to dma_addr_t) and the only way to do that from
> phys_addr_t is via lookup??

I thought my other email to Christoph laid it out pretty cleanly...

>>> At least for RDMA, we do have some cases (like siw/rxe, hfi) where
>>> they sometimes need to do that copy. I suspect the block stack is
>>> similar, in the general case.
>>
>> But the whole point of the use cases I'm trying to serve is to avoid the
>> root complex. 
> 
> Well, I think this is sort of a seperate issue. Generically I think
> the dma layer should continue to work largely transparently, and if I
> feed in BAR memory that can't be P2P'd it should bounce, just like
> all the other DMA limitations it already supports. That is pretty much
> its whole purpose in life.

I disagree. It's one thing for the DMA layer to work around architecture
limitations like HighMem/LowMem and just do a memcpy when it can't
handle it -- it's whole different thing for the DMA layer to know about
the varieties of memory on different peripheral device's and the nuances
of how and when to transfer between them. I think the submitting driver
has the best information of when to do these transfers.

IMO the bouncing in the DMA layer isn't a desirable thing, it was a
necessary addition to work around various legacy platform issues and
have existing code still work correctly. It's always better for a driver
to allocate memory appropriate for the DMA than to just use random
memory and rely on it being bounced by the lower layer. For P2P, we
don't have existing code to worry about so I don't think a magic
automatic bouncing design is appropriate.

> The issue of having the caller optimize what it sends is kind of
> separate - yes you definately still need the egress DMA device to
> drive CMB buffer selection, and DEVICE_PRIVATE also needs it to decide
> if it should migrate or not.

Yes, but my contention is that I don't want to have to make these
decisions twice: once before I submit it to the block layer, then again
at mapping time.

> What I see as the question is how to layout the BIO. 
> 
> If we agree the bio should only have phys_addr_t then we need some
> 'bar info' (ie at least the offset) in the dma map and some 'bar info'
> (ie the DMA device) during the bio construciton.

Per my other email, it was phys_addr_t plus hints on how to map the
memory (bus address, dma_map_resource, or regular). This requires
exactly two flag bits in the bio_vec and no interval tree or hash table.
I don't want to have to pass bar info, other hooks, or anything like
that to the block layer.

> What you are trying to do is optimize the passing of that 'bar info'
> with a limited number of bits in the BIO.
> 
> A single flag means an interval tree, 4-8 bits could build a probably
> O(1) hash lookup, 64 bits could store a pointer, etc.

Well, an interval tree can get you the backing device for a given
phys_addr_t; however, for P2PDMA, we need a second lookup based on the
mapping device. This is because exactly how you map the data depends on
both devices. Currently I'm working on this for the existing
implementation and struct page gets me the backing device but I need
another xarray cache based on the mapping device to figure out exactly
how to map the memory. But none of this mess is required if we can just
pass the mapping hints through the block layer as flags (per my other
email) because the submitting driver should already know ahead of time
what it's trying to do.

> If we can spare 4-8 bits in the bio then I suggest a 'perfect hash
> table'. Assign each registered P2P 'bar info' a small 4 bit id and
> hash on that. It should be fast enough to not worry about the double
> lookup.

This feels like it's just setting us up to run into nasty limits based
on the number of bits we actually have. The number of bits in a bio_vec
will always be a precious resource. If I have a server chassis that
exist today with 24 NVMe devices, and each device has a CMB, I'm already
using up 6 of those bits. Then we might have DEVICE_PRIVATE and other
uses on top of that.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-28 18:29                                       ` Logan Gunthorpe
@ 2019-06-28 19:09                                         ` Jason Gunthorpe
  2019-06-28 19:35                                           ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-06-28 19:09 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Fri, Jun 28, 2019 at 12:29:32PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-28 11:29 a.m., Jason Gunthorpe wrote:
> > On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote:
> > 
> >>> Why not?  If we have a 'bar info' structure that could have data
> >>> transfer op callbacks, infact, I think we might already have similar
> >>> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..
> >>
> >> Well it could, in theory be done, but It just seems wrong to setup and
> >> wait for more DMA requests while we are in mid-progress setting up
> >> another DMA request. Especially when the block layer has historically
> >> had issues with stack sizes. It's also possible you might have multiple
> >> bio_vec's that have to each do a migration and with a hook here they'd
> >> have to be done serially.
> > 
> > *shrug* this is just standard bounce buffering stuff...
> 
> I don't know of any "standard" bounce buffering stuff that uses random
> other device's DMA engines where appropriate.

IMHO, it is conceptually the same as memcpy.. And probably we will not
ever need such optimization in dma map. Other copy places might be
different at least we have the option.
 
> IMO the bouncing in the DMA layer isn't a desirable thing, it was a
> necessary addition to work around various legacy platform issues and
> have existing code still work correctly. 

Of course it is not desireable! But there are many situations where we
do not have the luxury to work around the HW limits in the caller, so
those callers either have to not do DMA or they have to open code
bounce buffering - both are wrong.

> > What I see as the question is how to layout the BIO. 
> > 
> > If we agree the bio should only have phys_addr_t then we need some
> > 'bar info' (ie at least the offset) in the dma map and some 'bar info'
> > (ie the DMA device) during the bio construciton.
> 
> Per my other email, it was phys_addr_t plus hints on how to map the
> memory (bus address, dma_map_resource, or regular). This requires
> exactly two flag bits in the bio_vec and no interval tree or hash table.
> I don't want to have to pass bar info, other hooks, or anything like
> that to the block layer.

This scheme makes the assumption that the dma mapping struct device is
all you need, and we never need to know the originating struct device
during dma map. This is clearly safe if the two devices are on the
same PCIe segment

However, I'd feel more comfortable about that assumption if we had
code to support the IOMMU case, and know for sure it doesn't require
more info :(

But I suppose it is also reasonable that only the IOMMU case would
have the expensive 'bar info' lookup or something.

Maybe you can hide these flags as some dma_map helper, then the
layering might be nicer:

  dma_map_set_bio_p2p_flags(bio, phys_addr, source dev, dest_dev) 

?

ie the choice of flag scheme to use is opaque to the DMA layer.

> > If we can spare 4-8 bits in the bio then I suggest a 'perfect hash
> > table'. Assign each registered P2P 'bar info' a small 4 bit id and
> > hash on that. It should be fast enough to not worry about the double
> > lookup.
> 
> This feels like it's just setting us up to run into nasty limits based
> on the number of bits we actually have. The number of bits in a bio_vec
> will always be a precious resource. If I have a server chassis that
> exist today with 24 NVMe devices, and each device has a CMB, I'm already
> using up 6 of those bits. Then we might have DEVICE_PRIVATE and other
> uses on top of that.

A hash is an alternative data structure to a interval tree that has
better scaling for small numbers of BARs, which I think is our
case.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-28 19:09                                         ` Jason Gunthorpe
@ 2019-06-28 19:35                                           ` Logan Gunthorpe
  2019-07-02 22:45                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Logan Gunthorpe @ 2019-06-28 19:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-06-28 1:09 p.m., Jason Gunthorpe wrote:
> On Fri, Jun 28, 2019 at 12:29:32PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-06-28 11:29 a.m., Jason Gunthorpe wrote:
>>> On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote:
>>>
>>>>> Why not?  If we have a 'bar info' structure that could have data
>>>>> transfer op callbacks, infact, I think we might already have similar
>>>>> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..
>>>>
>>>> Well it could, in theory be done, but It just seems wrong to setup and
>>>> wait for more DMA requests while we are in mid-progress setting up
>>>> another DMA request. Especially when the block layer has historically
>>>> had issues with stack sizes. It's also possible you might have multiple
>>>> bio_vec's that have to each do a migration and with a hook here they'd
>>>> have to be done serially.
>>>
>>> *shrug* this is just standard bounce buffering stuff...
>>
>> I don't know of any "standard" bounce buffering stuff that uses random
>> other device's DMA engines where appropriate.
> 
> IMHO, it is conceptually the same as memcpy.. And probably we will not
> ever need such optimization in dma map. Other copy places might be
> different at least we have the option.
>  
>> IMO the bouncing in the DMA layer isn't a desirable thing, it was a
>> necessary addition to work around various legacy platform issues and
>> have existing code still work correctly. 
> 
> Of course it is not desireable! But there are many situations where we
> do not have the luxury to work around the HW limits in the caller, so
> those callers either have to not do DMA or they have to open code
> bounce buffering - both are wrong.

They don't have to open code it, they can use helpers and good coding
practices. But the submitting driver is the one that's in the best
position to figure this stuff out. Just like it is with the dma_map
bouncing -- all it has to do is use dma_alloc_coherent(). If we don't
write any submitting drivers that assume the dma_map API bounces than we
should never have to deal with it.

>>> What I see as the question is how to layout the BIO. 
>>>
>>> If we agree the bio should only have phys_addr_t then we need some
>>> 'bar info' (ie at least the offset) in the dma map and some 'bar info'
>>> (ie the DMA device) during the bio construciton.
>>
>> Per my other email, it was phys_addr_t plus hints on how to map the
>> memory (bus address, dma_map_resource, or regular). This requires
>> exactly two flag bits in the bio_vec and no interval tree or hash table.
>> I don't want to have to pass bar info, other hooks, or anything like
>> that to the block layer.
> 
> This scheme makes the assumption that the dma mapping struct device is
> all you need, and we never need to know the originating struct device
> during dma map. This is clearly safe if the two devices are on the
> same PCIe segment
> 
> However, I'd feel more comfortable about that assumption if we had
> code to support the IOMMU case, and know for sure it doesn't require
> more info :(

The example I posted *does* support the IOMMU case. That was case (b1)
in the description. The idea is that pci_p2pdma_dist() returns a
distance with a high bit set (PCI_P2PDMA_THRU_HOST_BRIDGE) when an IOMMU
mapping is required and the appropriate flag tells it to call
dma_map_resource(). This way, it supports both same-segment and
different-segments without needing any look ups in the map step.

For the only existing upstream use case (NVMe-of), this is ideal because
we can calculate the mapping requirements exactly once ahead of any
transfers. Then populating the bvecs and dma-mapping for each transfer
is fast and doesn't require any additional work besides deciding where
to get the memory from.

For O_DIRECT and userspace RDMA, this should also be ideal, the real
problem is how to get the necessary information out of the VMA. This
isn't helped by having a lookup at the dma map step. But the provider
driver is certainly going to be involved in creating the VMA so it
should be able to easily provide the necessary hooks. Though there are
still a bunch of challenges here.

Maybe other use-cases are not this ideal but I suspect they should still
be able to make use of the same flags. It's hard to say right now,
though, because we haven't seen any other use cases.


> Maybe you can hide these flags as some dma_map helper, then the
> layering might be nicer:
> 
>   dma_map_set_bio_p2p_flags(bio, phys_addr, source dev, dest_dev) 
> 
> ?
> 
> ie the choice of flag scheme to use is opaque to the DMA layer.

If there was such a use case, I suppose you could use a couple of flag
bits to tell you how to interpret the other flag bits but, at the
moment, I only see a need for 2 bits so we'll probably have a lot of
spares for a long time. You could certainly have a 3rd bit which says do
a lookup and try to figure out bouncing, but I don't think it's a good idea.

>>> If we can spare 4-8 bits in the bio then I suggest a 'perfect hash
>>> table'. Assign each registered P2P 'bar info' a small 4 bit id and
>>> hash on that. It should be fast enough to not worry about the double
>>> lookup.
>>
>> This feels like it's just setting us up to run into nasty limits based
>> on the number of bits we actually have. The number of bits in a bio_vec
>> will always be a precious resource. If I have a server chassis that
>> exist today with 24 NVMe devices, and each device has a CMB, I'm already
>> using up 6 of those bits. Then we might have DEVICE_PRIVATE and other
>> uses on top of that.
> 
> A hash is an alternative data structure to a interval tree that has
> better scaling for small numbers of BARs, which I think is our
> case.

But then you need a large and not necessarily future-proof number of
bits in the bio_vec to store the hash.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-06-28 19:35                                           ` Logan Gunthorpe
@ 2019-07-02 22:45                                             ` Jason Gunthorpe
  2019-07-02 22:52                                               ` Logan Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2019-07-02 22:45 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates

On Fri, Jun 28, 2019 at 01:35:42PM -0600, Logan Gunthorpe wrote:

> > However, I'd feel more comfortable about that assumption if we had
> > code to support the IOMMU case, and know for sure it doesn't require
> > more info :(
> 
> The example I posted *does* support the IOMMU case. That was case (b1)
> in the description. The idea is that pci_p2pdma_dist() returns a
> distance with a high bit set (PCI_P2PDMA_THRU_HOST_BRIDGE) when an IOMMU
> mapping is required and the appropriate flag tells it to call
> dma_map_resource(). This way, it supports both same-segment and
> different-segments without needing any look ups in the map step.

I mean we actually have some iommu drivers that can setup P2P in real
HW. I'm worried that real IOMMUs will need to have the BDF of the
completer to route completions back to the requester - which we can't
trivially get through this scheme.

However, maybe that is just a future problem, and certainly we can see
that with an interval tree or otherwise such a IOMMU could get the
information it needs.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
  2019-07-02 22:45                                             ` Jason Gunthorpe
@ 2019-07-02 22:52                                               ` Logan Gunthorpe
  0 siblings, 0 replies; 89+ messages in thread
From: Logan Gunthorpe @ 2019-07-02 22:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-kernel, linux-block, linux-nvme,
	linux-pci, linux-rdma, Jens Axboe, Bjorn Helgaas, Dan Williams,
	Sagi Grimberg, Keith Busch, Stephen Bates



On 2019-07-02 4:45 p.m., Jason Gunthorpe wrote:
> On Fri, Jun 28, 2019 at 01:35:42PM -0600, Logan Gunthorpe wrote:
> 
>>> However, I'd feel more comfortable about that assumption if we had
>>> code to support the IOMMU case, and know for sure it doesn't require
>>> more info :(
>>
>> The example I posted *does* support the IOMMU case. That was case (b1)
>> in the description. The idea is that pci_p2pdma_dist() returns a
>> distance with a high bit set (PCI_P2PDMA_THRU_HOST_BRIDGE) when an IOMMU
>> mapping is required and the appropriate flag tells it to call
>> dma_map_resource(). This way, it supports both same-segment and
>> different-segments without needing any look ups in the map step.
> 
> I mean we actually have some iommu drivers that can setup P2P in real
> HW. I'm worried that real IOMMUs will need to have the BDF of the
> completer to route completions back to the requester - which we can't
> trivially get through this scheme.

I've never seen such an IOMMU but I guess, in theory, it could exist.
The IOMMUs that setup P2P-like transactions in real hardware make use of
dma_map_resource(). There aren't a lot of users of this function (it's
actually been broken with the Intel IOMMU until I fixed it recently and
I'd expect there are other broken implementations); but, to my
knowledge, none of them have needed the BDF of the provider to date.

> However, maybe that is just a future problem, and certainly we can see
> that with an interval tree or otherwise such a IOMMU could get the
> information it needs.

Yup, the rule of thumb is to design for the needs we have today not
imagined future problems.

Logan

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2019-07-03  0:54 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 01/28] block: Introduce DMA direct request type Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 02/28] block: Add dma_vec structure Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 03/28] block: Warn on mis-use of dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 04/28] block: Never bounce " Logan Gunthorpe
2019-06-20 17:23   ` Jason Gunthorpe
2019-06-20 18:38     ` Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 05/28] block: Skip dma-direct bios in bio_integrity_prep() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 06/28] block: Support dma-direct bios in bio_advance_iter() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 07/28] block: Use dma_vec length in bio_cur_bytes() for dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 08/28] block: Introduce dmavec_phys_mergeable() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 09/28] block: Introduce vec_gap_to_prev() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 10/28] block: Create generic vec_split_segs() from bvec_split_segs() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 11/28] block: Create blk_segment_split_ctx Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 12/28] block: Create helper for bvec_should_split() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 13/28] block: Generalize bvec_should_split() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 14/28] block: Support splitting dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 15/28] block: Support counting dma-direct bio segments Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 16/28] block: Implement mapping dma-direct requests to SGs in blk_rq_map_sg() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 17/28] block: Introduce queue flag to indicate support for dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 18/28] block: Introduce bio_add_dma_addr() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 19/28] nvme-pci: Support dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address Logan Gunthorpe
2019-06-20 16:49   ` Jason Gunthorpe
2019-06-20 16:59     ` Logan Gunthorpe
2019-06-20 17:11       ` Jason Gunthorpe
2019-06-20 18:24         ` Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 21/28] nvmet: Split nvmet_bdev_execute_rw() into a helper function Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 22/28] nvmet: Use DMA addresses instead of struct pages for P2P Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 23/28] nvme-pci: Remove support for PCI_P2PDMA requests Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 24/28] block: Remove PCI_P2PDMA queue flag Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 25/28] IB/core: Remove P2PDMA mapping support in rdma_rw_ctx Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 26/28] PCI/P2PDMA: Remove SGL helpers Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 27/28] PCI/P2PDMA: Remove struct pages that back P2PDMA memory Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 28/28] memremap: Remove PCI P2PDMA page memory type Logan Gunthorpe
2019-06-20 18:45 ` [RFC PATCH 00/28] Removing struct page from P2PDMA Dan Williams
2019-06-20 19:33   ` Jason Gunthorpe
2019-06-20 20:18     ` Dan Williams
2019-06-20 20:51       ` Logan Gunthorpe
2019-06-21 17:47       ` Jason Gunthorpe
2019-06-21 17:54         ` Dan Williams
2019-06-24  7:31     ` Christoph Hellwig
2019-06-24 13:46       ` Jason Gunthorpe
2019-06-24 13:50         ` Christoph Hellwig
2019-06-24 13:55           ` Jason Gunthorpe
2019-06-24 16:53             ` Logan Gunthorpe
2019-06-24 18:16               ` Jason Gunthorpe
2019-06-24 18:28                 ` Logan Gunthorpe
2019-06-24 18:54                   ` Jason Gunthorpe
2019-06-24 19:37                     ` Logan Gunthorpe
2019-06-24 16:10         ` Logan Gunthorpe
2019-06-25  7:18           ` Christoph Hellwig
2019-06-20 19:34   ` Logan Gunthorpe
2019-06-20 23:40     ` Dan Williams
2019-06-20 23:42       ` Logan Gunthorpe
2019-06-24  7:27 ` Christoph Hellwig
2019-06-24 16:07   ` Logan Gunthorpe
2019-06-25  7:20     ` Christoph Hellwig
2019-06-25 15:57       ` Logan Gunthorpe
2019-06-25 17:01         ` Christoph Hellwig
2019-06-25 19:54           ` Logan Gunthorpe
2019-06-26  6:57             ` Christoph Hellwig
2019-06-26 18:31               ` Logan Gunthorpe
2019-06-26 20:21                 ` Jason Gunthorpe
2019-06-26 20:39                   ` Dan Williams
2019-06-26 20:54                     ` Jason Gunthorpe
2019-06-26 20:55                     ` Logan Gunthorpe
2019-06-26 20:45                   ` Logan Gunthorpe
2019-06-26 21:00                     ` Jason Gunthorpe
2019-06-26 21:18                       ` Logan Gunthorpe
2019-06-27  6:32                         ` Jason Gunthorpe
2019-06-27 16:09                           ` Logan Gunthorpe
2019-06-27 16:35                             ` Jason Gunthorpe
2019-06-27 16:49                               ` Logan Gunthorpe
2019-06-28  4:57                                 ` Jason Gunthorpe
2019-06-28 16:22                                   ` Logan Gunthorpe
2019-06-28 17:29                                     ` Jason Gunthorpe
2019-06-28 18:29                                       ` Logan Gunthorpe
2019-06-28 19:09                                         ` Jason Gunthorpe
2019-06-28 19:35                                           ` Logan Gunthorpe
2019-07-02 22:45                                             ` Jason Gunthorpe
2019-07-02 22:52                                               ` Logan Gunthorpe
2019-06-27  9:08                     ` Christoph Hellwig
2019-06-27 16:30                       ` Logan Gunthorpe
2019-06-27 17:00                         ` Christoph Hellwig
2019-06-27 18:00                           ` Logan Gunthorpe
2019-06-28 13:38                             ` Christoph Hellwig
2019-06-28 15:54                               ` Logan Gunthorpe
2019-06-27  9:01                 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).