[PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support
@ 2019-11-05 16:20 Max Gurtovoy
  2019-11-05 16:20 ` [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd Max Gurtovoy
                   ` (15 more replies)
  0 siblings, 16 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

Hello Sagi, Christoph, Keith, Jason and Co

This patchset adds metadata (T10-PI) support for NVMeoF/RDMA host side and
target side, using signature verbs API. This set starts with a few preparation
commits to the NVMe host core layer. It continues with NVMeoF/RDMA host
implementation + new bip flag patch + few preparation commits to the NVMe
target core layer. The patchset ends with NVMeoF/RDMA target implementation.

Configuration:
Host:
 - nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme

Target:
 - modprobe nvmet-rdma pi_enable=Y
 - echo 1 > /config/nvmet/subsystems/${NAME}/attr_pi_enable

The code was tested using Mellanox's ConnectX-4/ConnectX-5 HCA.
This series applies on top of nvme_5.5 branch cleanly.

Israel Rukshin (11):
  nvme-fabrics: allow user enabling metadata/T10-PI support
  block: Introduce BIP_NOMAP_INTEGRITY bip_flag
  nvmet: Prepare metadata request
  nvmet: Add metadata characteristics for a namespace
  nvmet: Rename nvmet_rw_len to nvmet_rw_data_len
  nvmet: Rename nvmet_check_data_len to nvmet_check_transfer_len
  nvmet: Introduce nvmet_rw_prot_len and nvmet_ns_has_pi
  nvme: Add Metadata Capabilities enumerations
  nvmet: Add metadata/T10-PI support
  nvmet: Add metadata support for block devices
  nvmet-rdma: Add metadata/T10-PI support

Max Gurtovoy (4):
  nvme: Fail __nvme_revalidate_disk in case of a spec violation
  nvme: Introduce max_integrity_segments ctrl attribute
  nvme: Inline nvme_ns_has_pi function
  nvme-rdma: Add metadata/T10-PI support

 block/t10-pi.c                    |   7 +
 drivers/nvme/host/core.c          |  55 +++---
 drivers/nvme/host/fabrics.c       |   5 +
 drivers/nvme/host/fabrics.h       |   3 +
 drivers/nvme/host/nvme.h          |   7 +
 drivers/nvme/host/pci.c           |   7 +
 drivers/nvme/host/rdma.c          | 346 ++++++++++++++++++++++++++++++++------
 drivers/nvme/target/admin-cmd.c   |  37 ++--
 drivers/nvme/target/configfs.c    |  24 +++
 drivers/nvme/target/core.c        |  54 ++++--
 drivers/nvme/target/discovery.c   |   8 +-
 drivers/nvme/target/fabrics-cmd.c |  19 ++-
 drivers/nvme/target/io-cmd-bdev.c | 105 +++++++++++-
 drivers/nvme/target/io-cmd-file.c |   8 +-
 drivers/nvme/target/nvmet.h       |  27 ++-
 drivers/nvme/target/rdma.c        | 221 ++++++++++++++++++++++--
 include/linux/bio.h               |   1 +
 include/linux/nvme.h              |   2 +
 18 files changed, 811 insertions(+), 125 deletions(-)

-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-07  2:36   ` Martin K. Petersen
  2019-11-05 16:20 ` [PATCH 01/15] nvme-fabrics: allow user enabling metadata/T10-PI support Max Gurtovoy
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

Added 'pi_enable' to 'connect' command so users can enable metadata support.

usage examples:
nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme
nvme connect -p -t rdma -a 10.0.1.1 -n test_nvme

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 fabrics.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fabrics.c b/fabrics.c
index 8982ae4..b22ea14 100644
--- a/fabrics.c
+++ b/fabrics.c
@@ -72,6 +72,7 @@ static struct config {
 	int  disable_sqflow;
 	int  hdr_digest;
 	int  data_digest;
+	int  pi_enable;
 	bool persistent;
 	bool quiet;
 } cfg = { NULL };
@@ -756,7 +757,9 @@ static int build_options(char *argstr, int max_len, bool discover)
 	    add_bool_argument(&argstr, &max_len, "disable_sqflow",
 				cfg.disable_sqflow) ||
 	    add_bool_argument(&argstr, &max_len, "hdr_digest", cfg.hdr_digest) ||
-	    add_bool_argument(&argstr, &max_len, "data_digest", cfg.data_digest))
+	    add_bool_argument(&argstr, &max_len, "data_digest",
+				cfg.data_digest) ||
+	    add_bool_argument(&argstr, &max_len, "pi_enable", cfg.pi_enable))
 		return -EINVAL;
 
 	return 0;
@@ -885,6 +888,13 @@ retry:
 		p += len;
 	}
 
+	if (cfg.pi_enable) {
+		len = sprintf(p, ",pi_enable");
+		if (len < 0)
+			return -EINVAL;
+		p += len;
+	}
+
 	switch (e->trtype) {
 	case NVMF_TRTYPE_RDMA:
 	case NVMF_TRTYPE_TCP:
@@ -1171,6 +1181,7 @@ int discover(const char *desc, int argc, char **argv, bool connect)
 		OPT_INT("tos",             'T', &cfg.tos,             "type of service"),
 		OPT_FLAG("hdr_digest",     'g', &cfg.hdr_digest,      "enable transport protocol header digest (TCP transport)"),
 		OPT_FLAG("data_digest",    'G', &cfg.data_digest,     "enable transport protocol data digest (TCP transport)"),
+		OPT_FLAG("pi_enable",      'p', &cfg.pi_enable,       "enable metadata (T10-PI) support (default false)"),
 		OPT_INT("nr-io-queues",    'i', &cfg.nr_io_queues,    "number of io queues to use (default is core count)"),
 		OPT_INT("nr-write-queues", 'W', &cfg.nr_write_queues, "number of write queues to use (default 0)"),
 		OPT_INT("nr-poll-queues",  'P', &cfg.nr_poll_queues,  "number of poll queues to use (default 0)"),
@@ -1231,6 +1242,7 @@ int connect(const char *desc, int argc, char **argv)
 		OPT_FLAG("disable-sqflow",    'd', &cfg.disable_sqflow,    "disable controller sq flow control (default false)"),
 		OPT_FLAG("hdr-digest",        'g', &cfg.hdr_digest,        "enable transport protocol header digest (TCP transport)"),
 		OPT_FLAG("data-digest",       'G', &cfg.data_digest,       "enable transport protocol data digest (TCP transport)"),
+		OPT_FLAG("pi_enable",         'p', &cfg.pi_enable,         "enable metadata (T10-PI) support (default false)"),
 		OPT_END()
 	};
 
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 01/15] nvme-fabrics: allow user enabling metadata/T10-PI support
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
  2019-11-05 16:20 ` [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-12 17:48   ` Sagi Grimberg
  2019-11-05 16:20 ` [PATCH 02/15] nvme: Fail __nvme_revalidate_disk in case of a spec violation Max Gurtovoy
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

Preparation for adding metadata (T10-PI) over fabric support. This will
allow end-to-end protection information passthrough and validation for
NVMe over Fabric.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/host/fabrics.c | 5 +++++
 drivers/nvme/host/fabrics.h | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 74b8818..0a05eff 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -612,6 +612,7 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	{ NVMF_OPT_NR_WRITE_QUEUES,	"nr_write_queues=%d"	},
 	{ NVMF_OPT_NR_POLL_QUEUES,	"nr_poll_queues=%d"	},
 	{ NVMF_OPT_TOS,			"tos=%d"		},
+	{ NVMF_OPT_PI_ENABLE,		"pi_enable"		},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -634,6 +635,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->hdr_digest = false;
 	opts->data_digest = false;
 	opts->tos = -1; /* < 0 == use transport default */
+	opts->pi_enable = false;
 
 	options = o = kstrdup(buf, GFP_KERNEL);
 	if (!options)
@@ -867,6 +869,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 			}
 			opts->tos = token;
 			break;
+		case NVMF_OPT_PI_ENABLE:
+			opts->pi_enable = true;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 93f08d7..fbb7a31 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -56,6 +56,7 @@ enum {
 	NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
 	NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
 	NVMF_OPT_TOS		= 1 << 19,
+	NVMF_OPT_PI_ENABLE	= 1 << 20,
 };
 
 /**
@@ -89,6 +90,7 @@ enum {
  * @nr_write_queues: number of queues for write I/O
  * @nr_poll_queues: number of queues for polling I/O
  * @tos: type of service
+ * @pi_enable: Enable metadata (T10-PI) support
  */
 struct nvmf_ctrl_options {
 	unsigned		mask;
@@ -111,6 +113,7 @@ struct nvmf_ctrl_options {
 	unsigned int		nr_write_queues;
 	unsigned int		nr_poll_queues;
 	int			tos;
+	bool			pi_enable;
 };
 
 /*
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 02/15] nvme: Fail __nvme_revalidate_disk in case of a spec violation
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
  2019-11-05 16:20 ` [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd Max Gurtovoy
  2019-11-05 16:20 ` [PATCH 01/15] nvme-fabrics: allow user enabling metadata/T10-PI support Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 17:49   ` Christoph Hellwig
  2019-11-05 16:20 ` [PATCH 03/15] nvme: Introduce max_integrity_segments ctrl attribute Max Gurtovoy
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

An extended LBA is a larger LBA that is created when metadata associated
with the LBA is transferred contiguously with the LBA data (AKA
interleaved). The metadata may be either transferred as part of the LBA
(creating an extended LBA) or it may be transferred as a separate
contiguous buffer of data. According to the NVMeoF spec, a fabrics ctrl
supports only an Extended LBA format. Fail revalidation in case we have a
spec violation. Also init the integrity profile for the block device for
fabrics ctrl.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
---
 drivers/nvme/host/core.c | 40 +++++++++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 44b247f..832b0fd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1754,9 +1754,10 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	blk_queue_io_min(disk->queue, phys_bs);
 	blk_queue_io_opt(disk->queue, io_opt);
 
-	if (ns->ms && !ns->ext &&
-	    (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED))
-		nvme_init_integrity(disk, ns->ms, ns->pi_type);
+	if (ns->ms && (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) {
+		if ((ns->ctrl->opts && ns->ctrl->opts->pi_enable) || !ns->ext)
+			nvme_init_integrity(disk, ns->ms, ns->pi_type);
+	}
 	if ((ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) ||
 	    ns->lba_shift > PAGE_SHIFT)
 		capacity = 0;
@@ -1774,7 +1775,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	blk_mq_unfreeze_queue(disk->queue);
 }
 
-static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
+static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 {
 	struct nvme_ns *ns = disk->private_data;
 
@@ -1788,6 +1789,16 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	ns->noiob = le16_to_cpu(id->noiob);
 	ns->ms = le16_to_cpu(id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ms);
 	ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);
+
+	/*
+	 * For Fabrics, only metadata as part of extended data LBA is supported.
+	 * fail in case of a spec violation.
+	 */
+	if (ns->ms && (ns->ctrl->ops->flags & NVME_F_FABRICS)) {
+		if (WARN_ON_ONCE(!ns->ext))
+			return -EINVAL;
+	}
+
 	/* the PI implementation requires metadata equal t10 pi tuple size */
 	if (ns->ms == sizeof(struct t10_pi_tuple))
 		ns->pi_type = id->dps & NVME_NS_DPS_PI_MASK;
@@ -1804,6 +1815,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 		revalidate_disk(ns->head->disk);
 	}
 #endif
+	return 0;
 }
 
 static int nvme_revalidate_disk(struct gendisk *disk)
@@ -1828,7 +1840,10 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto free_id;
 	}
 
-	__nvme_revalidate_disk(disk, id);
+	ret = __nvme_revalidate_disk(disk, id);
+	if (ret)
+		goto free_id;
+
 	ret = nvme_report_ns_ids(ctrl, ns->head->ns_id, id, &ids);
 	if (ret)
 		goto free_id;
@@ -3496,7 +3511,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
 	ns->disk = disk;
 
-	__nvme_revalidate_disk(disk, id);
+	if (__nvme_revalidate_disk(disk, id))
+		goto out_free_disk;
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		ret = nvme_nvm_register(ns, disk_name, node);
@@ -3519,18 +3535,20 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	kfree(id);
 
 	return 0;
- out_put_disk:
+out_put_disk:
 	put_disk(ns->disk);
- out_unlink_ns:
+out_free_disk:
+	del_gendisk(ns->disk);
+out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
 	list_del_rcu(&ns->siblings);
 	mutex_unlock(&ctrl->subsys->lock);
 	nvme_put_ns_head(ns->head);
- out_free_id:
+out_free_id:
 	kfree(id);
- out_free_queue:
+out_free_queue:
 	blk_cleanup_queue(ns->queue);
- out_free_ns:
+out_free_ns:
 	kfree(ns);
 	if (ret > 0)
 		ret = blk_status_to_errno(nvme_error_status(ret));
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 03/15] nvme: Introduce max_integrity_segments ctrl attribute
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (2 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 02/15] nvme: Fail __nvme_revalidate_disk in case of a spec violation Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 17:49   ` Christoph Hellwig
  2019-11-12 17:53   ` Sagi Grimberg
  2019-11-05 16:20 ` [PATCH 04/15] nvme: Inline nvme_ns_has_pi function Max Gurtovoy
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

This patch doesn't change any logic, and is needed as a preparataion
for adding PI support for fabrics drivers that will use an extended
LBA format for metadata.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 11 +++++++----
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  |  7 +++++++
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 832b0fd..9ec8322 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1577,7 +1577,8 @@ static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
 }
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
-static void nvme_init_integrity(struct gendisk *disk, u16 ms, u8 pi_type)
+static void nvme_init_integrity(struct gendisk *disk, u16 ms, u8 pi_type,
+				u32 max_integrity_segments)
 {
 	struct blk_integrity integrity;
 
@@ -1600,10 +1601,11 @@ static void nvme_init_integrity(struct gendisk *disk, u16 ms, u8 pi_type)
 	}
 	integrity.tuple_size = ms;
 	blk_integrity_register(disk, &integrity);
-	blk_queue_max_integrity_segments(disk->queue, 1);
+	blk_queue_max_integrity_segments(disk->queue, max_integrity_segments);
 }
 #else
-static void nvme_init_integrity(struct gendisk *disk, u16 ms, u8 pi_type)
+static void nvme_init_integrity(struct gendisk *disk, u16 ms, u8 pi_type,
+				u32 max_integrity_segments)
 {
 }
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
@@ -1756,7 +1758,8 @@ static void nvme_update_disk_info(struct gendisk *disk,
 
 	if (ns->ms && (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) {
 		if ((ns->ctrl->opts && ns->ctrl->opts->pi_enable) || !ns->ext)
-			nvme_init_integrity(disk, ns->ms, ns->pi_type);
+			nvme_init_integrity(disk, ns->ms, ns->pi_type,
+					    ns->ctrl->max_integrity_segments);
 	}
 	if ((ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) ||
 	    ns->lba_shift > PAGE_SHIFT)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2637d9d..1a8b9bb 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -216,6 +216,7 @@ struct nvme_ctrl {
 	u32 page_size;
 	u32 max_hw_sectors;
 	u32 max_segments;
+	u32 max_integrity_segments;
 	u16 crdt[3];
 	u16 oncs;
 	u16 oacs;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 86bd379..5397774 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2573,6 +2573,13 @@ static void nvme_reset_work(struct work_struct *work)
 		goto out;
 	}
 
+	/*
+	 * NVMe PCI driver doesn't support Extended LBA format so using 1
+	 * integrity segment should be enough for a separate contiguous
+	 * buffer of metadata.
+	 */
+	dev->ctrl.max_integrity_segments = 1;
+
 	result = nvme_init_identify(&dev->ctrl);
 	if (result)
 		goto out;
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 04/15] nvme: Inline nvme_ns_has_pi function
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (3 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 03/15] nvme: Introduce max_integrity_segments ctrl attribute Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 17:50   ` Christoph Hellwig
  2019-11-05 16:20 ` [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

This function will be used by transports that support metadata/T10-PI
mechanism.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/host/core.c | 6 ------
 drivers/nvme/host/nvme.h | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9ec8322..6f51471 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -18,7 +18,6 @@
 #include <linux/pr.h>
 #include <linux/ptrace.h>
 #include <linux/nvme_ioctl.h>
-#include <linux/t10-pi.h>
 #include <linux/pm_qos.h>
 #include <asm/unaligned.h>
 
@@ -194,11 +193,6 @@ static int nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl)
 	return ret;
 }
 
-static inline bool nvme_ns_has_pi(struct nvme_ns *ns)
-{
-	return ns->pi_type && ns->ms == sizeof(struct t10_pi_tuple);
-}
-
 static blk_status_t nvme_error_status(u16 status)
 {
 	switch (status & 0x7ff) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1a8b9bb..4293de8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -11,6 +11,7 @@
 #include <linux/pci.h>
 #include <linux/kref.h>
 #include <linux/blk-mq.h>
+#include <linux/t10-pi.h>
 #include <linux/lightnvm.h>
 #include <linux/sed-opal.h>
 #include <linux/fault-inject.h>
@@ -666,4 +667,9 @@ static inline struct nvme_ns *nvme_get_ns_from_dev(struct device *dev)
 	return dev_to_disk(dev)->private_data;
 }
 
+static inline bool nvme_ns_has_pi(struct nvme_ns *ns)
+{
+	return ns->pi_type && ns->ms == sizeof(struct t10_pi_tuple);
+}
+
 #endif /* _NVME_H */
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (4 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 04/15] nvme: Inline nvme_ns_has_pi function Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 17:58   ` Christoph Hellwig
  2019-11-12 18:22   ` Sagi Grimberg
  2019-11-05 16:20 ` [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag Max Gurtovoy
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

For capable HCAs (e.g. ConnectX-4/ConnectX-5) this will allow end-to-end
protection information passthrough and validation for NVMe over RDMA
transport. Metadata offload support was implemented over the new RDMA
signature verbs API and it is enabled per controller by using nvme-cli.

usage example:
nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
---
 drivers/nvme/host/rdma.c | 346 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 298 insertions(+), 48 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 05f2dfa..16263b8 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -48,6 +48,12 @@ struct nvme_rdma_qe {
 	u64			dma;
 };
 
+struct nvme_rdma_sgl {
+	int			nents;
+	struct sg_table		sg_table;
+	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
+};
+
 struct nvme_rdma_queue;
 struct nvme_rdma_request {
 	struct nvme_request	req;
@@ -58,12 +64,14 @@ struct nvme_rdma_request {
 	refcount_t		ref;
 	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
 	u32			num_sge;
-	int			nents;
 	struct ib_reg_wr	reg_wr;
 	struct ib_cqe		reg_cqe;
 	struct nvme_rdma_queue  *queue;
-	struct sg_table		sg_table;
-	struct scatterlist	first_sgl[];
+	/* T10-PI support */
+	bool			is_protected;
+
+	struct nvme_rdma_sgl	data_sgl;
+	struct nvme_rdma_sgl	pi_sgl[];
 };
 
 enum nvme_rdma_queue_flags {
@@ -85,6 +93,7 @@ struct nvme_rdma_queue {
 	struct rdma_cm_id	*cm_id;
 	int			cm_error;
 	struct completion	cm_done;
+	bool			pi_support;
 };
 
 struct nvme_rdma_ctrl {
@@ -112,6 +121,7 @@ struct nvme_rdma_ctrl {
 	struct nvme_ctrl	ctrl;
 	bool			use_inline_data;
 	u32			io_queues[HCTX_MAX_TYPES];
+	bool			pi_support;
 };
 
 static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct nvme_ctrl *ctrl)
@@ -269,6 +279,8 @@ static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
 	init_attr.qp_type = IB_QPT_RC;
 	init_attr.send_cq = queue->ib_cq;
 	init_attr.recv_cq = queue->ib_cq;
+	if (queue->pi_support)
+		init_attr.create_flags |= IB_QP_CREATE_INTEGRITY_EN;
 
 	ret = rdma_create_qp(queue->cm_id, dev->pd, &init_attr);
 
@@ -408,6 +420,8 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 	dev = queue->device;
 	ibdev = dev->dev;
 
+	if (queue->pi_support)
+		ib_mr_pool_destroy(queue->qp, &queue->qp->sig_mrs);
 	ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
 
 	/*
@@ -424,10 +438,14 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 	nvme_rdma_dev_put(dev);
 }
 
-static int nvme_rdma_get_max_fr_pages(struct ib_device *ibdev)
+static int nvme_rdma_get_max_fr_pages(struct ib_device *ibdev, bool pi_support)
 {
-	return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
-		     ibdev->attrs.max_fast_reg_page_list_len - 1);
+	if (pi_support)
+		return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
+			     ibdev->attrs.max_pi_fast_reg_page_list_len - 1);
+	else
+		return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
+			     ibdev->attrs.max_fast_reg_page_list_len - 1);
 }
 
 static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
@@ -484,7 +502,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	 * misaligned we'll end up using two entries for a single data page,
 	 * so one additional entry is required.
 	 */
-	pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev) + 1;
+	pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, queue->pi_support) + 1;
 	ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
 			      queue->queue_size,
 			      IB_MR_TYPE_MEM_REG,
@@ -496,10 +514,24 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 		goto out_destroy_ring;
 	}
 
+	if (queue->pi_support) {
+		ret = ib_mr_pool_init(queue->qp, &queue->qp->sig_mrs,
+				      queue->queue_size, IB_MR_TYPE_INTEGRITY,
+				      pages_per_mr, pages_per_mr);
+		if (ret) {
+			dev_err(queue->ctrl->ctrl.device,
+				"failed to initialize PI MR pool sized %d for QID %d\n",
+				queue->queue_size, idx);
+			goto out_destroy_mr_pool;
+		}
+	}
+
 	set_bit(NVME_RDMA_Q_TR_READY, &queue->flags);
 
 	return 0;
 
+out_destroy_mr_pool:
+	ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
 out_destroy_ring:
 	nvme_rdma_free_ring(ibdev, queue->rsp_ring, queue->queue_size,
 			    sizeof(struct nvme_completion), DMA_FROM_DEVICE);
@@ -521,6 +553,7 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
 
 	queue = &ctrl->queues[idx];
 	queue->ctrl = ctrl;
+	queue->pi_support = idx && ctrl->pi_support;
 	init_completion(&queue->cm_done);
 
 	if (idx > 0)
@@ -730,8 +763,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
 		set->reserved_tags = 2; /* connect + keep-alive */
 		set->numa_node = nctrl->numa_node;
-		set->cmd_size = sizeof(struct nvme_rdma_request) +
-			SG_CHUNK_SIZE * sizeof(struct scatterlist);
+		set->cmd_size = sizeof(struct nvme_rdma_request);
 		set->driver_data = ctrl;
 		set->nr_hw_queues = 1;
 		set->timeout = ADMIN_TIMEOUT;
@@ -745,7 +777,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->numa_node = nctrl->numa_node;
 		set->flags = BLK_MQ_F_SHOULD_MERGE;
 		set->cmd_size = sizeof(struct nvme_rdma_request) +
-			SG_CHUNK_SIZE * sizeof(struct scatterlist);
+			(ctrl->pi_support * sizeof(struct nvme_rdma_sgl));
 		set->driver_data = ctrl;
 		set->nr_hw_queues = nctrl->queue_count - 1;
 		set->timeout = NVME_IO_TIMEOUT;
@@ -787,7 +819,22 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 	ctrl->device = ctrl->queues[0].device;
 	ctrl->ctrl.numa_node = dev_to_node(ctrl->device->dev->dma_device);
 
-	ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev);
+	/* T10-PI support */
+	if (ctrl->ctrl.opts->pi_enable) {
+		if (!(ctrl->device->dev->attrs.device_cap_flags &
+		      IB_DEVICE_INTEGRITY_HANDOVER)) {
+			dev_warn(ctrl->ctrl.device,
+				 "T10-PI requested but not supported on %s, continue without T10-PI\n",
+				 ctrl->device->dev->name);
+			ctrl->pi_support = false;
+			ctrl->ctrl.opts->pi_enable = false;
+		} else {
+			ctrl->pi_support = true;
+		}
+	}
+
+	ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev,
+							ctrl->pi_support);
 
 	/*
 	 * Bind the async event SQE DMA mapping to the admin queue lifetime.
@@ -829,6 +876,8 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 
 	ctrl->ctrl.max_segments = ctrl->max_fr_pages;
 	ctrl->ctrl.max_hw_sectors = ctrl->max_fr_pages << (ilog2(SZ_4K) - 9);
+	if (ctrl->pi_support)
+		ctrl->ctrl.max_integrity_segments = ctrl->max_fr_pages;
 
 	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
 
@@ -1154,13 +1203,24 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
 	if (!blk_rq_nr_phys_segments(rq))
 		return;
 
+	if (blk_integrity_rq(rq)) {
+		ib_dma_unmap_sg(ibdev, req->pi_sgl->sg_table.sgl,
+			req->pi_sgl->nents, rq_dma_dir(rq));
+		sg_free_table_chained(&req->pi_sgl->sg_table, SG_CHUNK_SIZE);
+	}
+
 	if (req->mr) {
-		ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
+		if (req->is_protected)
+			ib_mr_pool_put(queue->qp, &queue->qp->sig_mrs, req->mr);
+		else
+			ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs,
+				       req->mr);
 		req->mr = NULL;
 	}
 
-	ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq));
-	sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
+	ib_dma_unmap_sg(ibdev, req->data_sgl.sg_table.sgl,
+			req->data_sgl.nents, rq_dma_dir(rq));
+	sg_free_table_chained(&req->data_sgl.sg_table, SG_CHUNK_SIZE);
 }
 
 static int nvme_rdma_set_sg_null(struct nvme_command *c)
@@ -1179,7 +1239,7 @@ static int nvme_rdma_map_sg_inline(struct nvme_rdma_queue *queue,
 		int count)
 {
 	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
-	struct scatterlist *sgl = req->sg_table.sgl;
+	struct scatterlist *sgl = req->data_sgl.sg_table.sgl;
 	struct ib_sge *sge = &req->sge[1];
 	u32 len = 0;
 	int i;
@@ -1204,40 +1264,115 @@ static int nvme_rdma_map_sg_single(struct nvme_rdma_queue *queue,
 {
 	struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
 
-	sg->addr = cpu_to_le64(sg_dma_address(req->sg_table.sgl));
-	put_unaligned_le24(sg_dma_len(req->sg_table.sgl), sg->length);
+	sg->addr = cpu_to_le64(sg_dma_address(req->data_sgl.sg_table.sgl));
+	put_unaligned_le24(sg_dma_len(req->data_sgl.sg_table.sgl), sg->length);
 	put_unaligned_le32(queue->device->pd->unsafe_global_rkey, sg->key);
 	sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4;
 	return 0;
 }
 
-static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
-		struct nvme_rdma_request *req, struct nvme_command *c,
-		int count)
+static void nvme_rdma_set_diff_domain(struct nvme_command *cmd, struct bio *bio,
+		struct ib_sig_domain *domain, struct request *rq)
 {
-	struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
-	int nr;
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_disk);
+	struct nvme_ns *ns = rq->rq_disk->private_data;
 
-	req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
-	if (WARN_ON_ONCE(!req->mr))
-		return -EAGAIN;
+	WARN_ON(bi == NULL);
+
+	domain->sig_type = IB_SIG_TYPE_T10_DIF;
+	domain->sig.dif.bg_type = IB_T10DIF_CRC;
+	domain->sig.dif.pi_interval = 1 << bi->interval_exp;
+	domain->sig.dif.ref_tag = le32_to_cpu(cmd->rw.reftag);
 
 	/*
-	 * Align the MR to a 4K page size to match the ctrl page size and
-	 * the block virtual boundary.
+	 * At the moment we hard code those, but in the future
+	 * we will take them from cmd.
 	 */
-	nr = ib_map_mr_sg(req->mr, req->sg_table.sgl, count, NULL, SZ_4K);
-	if (unlikely(nr < count)) {
-		ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
-		req->mr = NULL;
-		if (nr < 0)
-			return nr;
-		return -EINVAL;
+	domain->sig.dif.apptag_check_mask = 0xffff;
+	domain->sig.dif.app_escape = true;
+	domain->sig.dif.ref_escape = true;
+	if (ns->pi_type != NVME_NS_DPS_PI_TYPE3)
+		domain->sig.dif.ref_remap = true;
+}
+
+static void nvme_rdma_set_sig_attrs(struct bio *bio,
+		struct ib_sig_attrs *sig_attrs, struct nvme_command *c,
+		struct request *rq)
+{
+	u16 control = le16_to_cpu(c->rw.control);
+
+	memset(sig_attrs, 0, sizeof(*sig_attrs));
+
+	if (control & NVME_RW_PRINFO_PRACT) {
+		/* for WRITE_INSERT/READ_STRIP no memory domain */
+		sig_attrs->mem.sig_type = IB_SIG_TYPE_NONE;
+		nvme_rdma_set_diff_domain(c, bio, &sig_attrs->wire, rq);
+		/* Clear the PRACT bit since HCA will generate/verify the PI */
+		control &= ~NVME_RW_PRINFO_PRACT;
+		c->rw.control = cpu_to_le16(control);
+	} else {
+		/* for WRITE_PASS/READ_PASS both wire/memory domains exist */
+		nvme_rdma_set_diff_domain(c, bio, &sig_attrs->wire, rq);
+		nvme_rdma_set_diff_domain(c, bio, &sig_attrs->mem, rq);
 	}
+}
+
+static void nvme_rdma_set_prot_checks(struct nvme_command *cmd, u8 *mask)
+{
+	*mask = 0;
+	if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_REF)
+		*mask |= IB_SIG_CHECK_REFTAG;
+	if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_GUARD)
+		*mask |= IB_SIG_CHECK_GUARD;
+}
+
+static void nvme_rdma_sig_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	if (unlikely(wc->status != IB_WC_SUCCESS))
+		nvme_rdma_wr_error(cq, wc, "SIG");
+}
+
+static void nvme_rdma_set_pi_wr(struct nvme_rdma_request *req,
+				struct nvme_command *c)
+{
+	struct ib_sig_attrs *sig_attrs = req->mr->sig_attrs;
+	struct ib_reg_wr *wr = &req->reg_wr;
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	struct bio *bio = rq->bio;
+	struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
+
+	nvme_rdma_set_sig_attrs(bio, sig_attrs, c, rq);
+
+	nvme_rdma_set_prot_checks(c, &sig_attrs->check_mask);
 
 	ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
 
+	req->reg_cqe.done = nvme_rdma_sig_done;
+
+	memset(wr, 0, sizeof(*wr));
+	wr->wr.opcode = IB_WR_REG_MR_INTEGRITY;
+	wr->wr.wr_cqe = &req->reg_cqe;
+	wr->wr.num_sge = 0;
+	wr->wr.send_flags = 0;
+	wr->mr = req->mr;
+	wr->key = req->mr->rkey;
+	wr->access = IB_ACCESS_LOCAL_WRITE |
+		     IB_ACCESS_REMOTE_READ |
+		     IB_ACCESS_REMOTE_WRITE;
+
+	sg->addr = cpu_to_le64(req->mr->iova);
+	put_unaligned_le24(req->mr->length, sg->length);
+	put_unaligned_le32(req->mr->rkey, sg->key);
+	sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_INVALIDATE;
+}
+
+static void nvme_rdma_set_reg_wr(struct nvme_rdma_request *req,
+				 struct nvme_command *c)
+{
+	struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
+
 	req->reg_cqe.done = nvme_rdma_memreg_done;
+
 	memset(&req->reg_wr, 0, sizeof(req->reg_wr));
 	req->reg_wr.wr.opcode = IB_WR_REG_MR;
 	req->reg_wr.wr.wr_cqe = &req->reg_cqe;
@@ -1253,8 +1388,52 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
 	put_unaligned_le32(req->mr->rkey, sg->key);
 	sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
 			NVME_SGL_FMT_INVALIDATE;
+}
+
+static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
+		struct nvme_rdma_request *req, struct nvme_command *c,
+		int count, int pi_count)
+{
+	struct nvme_rdma_sgl *sgl = &req->data_sgl;
+	int nr;
+
+	if (req->is_protected) {
+		req->mr = ib_mr_pool_get(queue->qp, &queue->qp->sig_mrs);
+		if (WARN_ON_ONCE(!req->mr))
+			return -EAGAIN;
+
+		nr = ib_map_mr_sg_pi(req->mr, sgl->sg_table.sgl, count, 0,
+				     req->pi_sgl->sg_table.sgl, pi_count, 0,
+				     SZ_4K);
+		if (unlikely(nr))
+			goto mr_put;
+
+		nvme_rdma_set_pi_wr(req, c);
+	} else {
+		req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
+		if (WARN_ON_ONCE(!req->mr))
+			return -EAGAIN;
+		/*
+		 * Align the MR to a 4K page size to match the ctrl page size
+		 * and the block virtual boundary.
+		 */
+		nr = ib_map_mr_sg(req->mr, sgl->sg_table.sgl, count, 0, SZ_4K);
+		if (unlikely(nr < count))
+			goto mr_put;
+
+		nvme_rdma_set_reg_wr(req, c);
+	}
 
 	return 0;
+mr_put:
+	if (req->is_protected)
+		ib_mr_pool_put(queue->qp, &queue->qp->sig_mrs, req->mr);
+	else
+		ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
+	req->mr = NULL;
+	if (nr < 0)
+		return nr;
+	return -EINVAL;
 }
 
 static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
@@ -1263,6 +1442,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_rdma_device *dev = queue->device;
 	struct ib_device *ibdev = dev->dev;
+	int pi_count = 0;
 	int count, ret;
 
 	req->num_sge = 1;
@@ -1273,23 +1453,44 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 	if (!blk_rq_nr_phys_segments(rq))
 		return nvme_rdma_set_sg_null(c);
 
-	req->sg_table.sgl = req->first_sgl;
-	ret = sg_alloc_table_chained(&req->sg_table,
-			blk_rq_nr_phys_segments(rq), req->sg_table.sgl,
-			SG_CHUNK_SIZE);
+	req->data_sgl.sg_table.sgl = req->data_sgl.first_sgl;
+	ret = sg_alloc_table_chained(&req->data_sgl.sg_table,
+			blk_rq_nr_phys_segments(rq),
+			req->data_sgl.sg_table.sgl, SG_CHUNK_SIZE);
 	if (ret)
 		return -ENOMEM;
 
-	req->nents = blk_rq_map_sg(rq->q, rq, req->sg_table.sgl);
+	req->data_sgl.nents = blk_rq_map_sg(rq->q, rq,
+					    req->data_sgl.sg_table.sgl);
 
-	count = ib_dma_map_sg(ibdev, req->sg_table.sgl, req->nents,
-			      rq_dma_dir(rq));
+	count = ib_dma_map_sg(ibdev, req->data_sgl.sg_table.sgl,
+			      req->data_sgl.nents, rq_dma_dir(rq));
 	if (unlikely(count <= 0)) {
 		ret = -EIO;
 		goto out_free_table;
 	}
 
-	if (count <= dev->num_inline_segments) {
+	if (blk_integrity_rq(rq)) {
+		req->pi_sgl->sg_table.sgl = req->pi_sgl->first_sgl;
+		ret = sg_alloc_table_chained(&req->pi_sgl->sg_table,
+				blk_rq_count_integrity_sg(rq->q, rq->bio),
+				req->pi_sgl->sg_table.sgl, SG_CHUNK_SIZE);
+		if (unlikely(ret)) {
+			ret = -ENOMEM;
+			goto out_unmap_sg;
+		}
+
+		req->pi_sgl->nents = blk_rq_map_integrity_sg(rq->q, rq->bio,
+					req->pi_sgl->sg_table.sgl);
+		pi_count = ib_dma_map_sg(ibdev, req->pi_sgl->sg_table.sgl,
+				req->pi_sgl->nents, rq_dma_dir(rq));
+		if (unlikely(pi_count <= 0)) {
+			ret = -EIO;
+			goto out_free_pi_table;
+		}
+	}
+
+	if (count <= dev->num_inline_segments && !req->is_protected) {
 		if (rq_data_dir(rq) == WRITE && nvme_rdma_queue_idx(queue) &&
 		    queue->ctrl->use_inline_data &&
 		    blk_rq_payload_bytes(rq) <=
@@ -1304,17 +1505,25 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 		}
 	}
 
-	ret = nvme_rdma_map_sg_fr(queue, req, c, count);
+	ret = nvme_rdma_map_sg_fr(queue, req, c, count, pi_count);
 out:
 	if (unlikely(ret))
-		goto out_unmap_sg;
+		goto out_unmap_pi_sg;
 
 	return 0;
 
+out_unmap_pi_sg:
+	if (blk_integrity_rq(rq))
+		ib_dma_unmap_sg(ibdev, req->pi_sgl->sg_table.sgl,
+				req->pi_sgl->nents, rq_dma_dir(rq));
+out_free_pi_table:
+	if (blk_integrity_rq(rq))
+		sg_free_table_chained(&req->pi_sgl->sg_table, SG_CHUNK_SIZE);
 out_unmap_sg:
-	ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq));
+	ib_dma_unmap_sg(ibdev, req->data_sgl.sg_table.sgl,
+			req->data_sgl.nents, rq_dma_dir(rq));
 out_free_table:
-	sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
+	sg_free_table_chained(&req->data_sgl.sg_table, SG_CHUNK_SIZE);
 	return ret;
 }
 
@@ -1754,6 +1963,13 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	blk_mq_start_request(rq);
 
+	req->is_protected = false;
+	if (nvme_rdma_queue_idx(queue) && queue->pi_support) {
+		if (c->common.opcode == nvme_cmd_write ||
+		    c->common.opcode == nvme_cmd_read)
+			req->is_protected = nvme_ns_has_pi(ns);
+	}
+
 	err = nvme_rdma_map_data(queue, rq, c);
 	if (unlikely(err < 0)) {
 		dev_err(queue->ctrl->ctrl.device,
@@ -1794,12 +2010,46 @@ static int nvme_rdma_poll(struct blk_mq_hw_ctx *hctx)
 	return ib_process_cq_direct(queue->ib_cq, -1);
 }
 
+static void nvme_rdma_check_pi_status(struct nvme_rdma_request *req)
+{
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	struct ib_mr_status mr_status;
+	int ret;
+
+	ret = ib_check_mr_status(req->mr, IB_MR_CHECK_SIG_STATUS, &mr_status);
+	if (ret) {
+		pr_err("ib_check_mr_status failed, ret %d\n", ret);
+		nvme_req(rq)->status = NVME_SC_INVALID_PI;
+		return;
+	}
+
+	if (mr_status.fail_status & IB_MR_CHECK_SIG_STATUS) {
+		switch (mr_status.sig_err.err_type) {
+		case IB_SIG_BAD_GUARD:
+			nvme_req(rq)->status = NVME_SC_GUARD_CHECK;
+			break;
+		case IB_SIG_BAD_REFTAG:
+			nvme_req(rq)->status = NVME_SC_REFTAG_CHECK;
+			break;
+		case IB_SIG_BAD_APPTAG:
+			nvme_req(rq)->status = NVME_SC_APPTAG_CHECK;
+			break;
+		}
+		pr_err("PI error found type %d expected 0x%x vs actual 0x%x\n",
+		       mr_status.sig_err.err_type, mr_status.sig_err.expected,
+		       mr_status.sig_err.actual);
+	}
+}
+
 static void nvme_rdma_complete_rq(struct request *rq)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_rdma_queue *queue = req->queue;
 	struct ib_device *ibdev = queue->device->dev;
 
+	if (req->is_protected)
+		nvme_rdma_check_pi_status(req);
+
 	nvme_rdma_unmap_data(queue, rq);
 	ib_dma_unmap_single(ibdev, req->sqe.dma, sizeof(struct nvme_command),
 			    DMA_TO_DEVICE);
@@ -1919,7 +2169,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 static const struct nvme_ctrl_ops nvme_rdma_ctrl_ops = {
 	.name			= "rdma",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_FABRICS,
+	.flags			= NVME_F_FABRICS | NVME_F_METADATA_SUPPORTED,
 	.reg_read32		= nvmf_reg_read32,
 	.reg_read64		= nvmf_reg_read64,
 	.reg_write32		= nvmf_reg_write32,
@@ -2063,7 +2313,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 	.allowed_opts	= NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
 			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
 			  NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES |
-			  NVMF_OPT_TOS,
+			  NVMF_OPT_TOS | NVMF_OPT_PI_ENABLE,
 	.create_ctrl	= nvme_rdma_create_ctrl,
 };
 
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (5 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 17:52   ` Christoph Hellwig
  2019-11-07  2:43   ` Martin K. Petersen
  2019-11-05 16:20 ` [PATCH 07/15] nvmet: Prepare metadata request Max Gurtovoy
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

It means that reftag shouldn't be remapped. This flag will be used in
case some other layer did the reftag remapping (e.g. in NVMe/RDMA the
initiator controller performs the remapping so target side shouldn't
map it again).

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 block/t10-pi.c      | 7 +++++++
 include/linux/bio.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index f4907d9..6de380b 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -144,6 +144,9 @@ static void t10_pi_type1_prepare(struct request *rq)
 		/* Already remapped? */
 		if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
 			break;
+		/* No need to remap */
+		if (bip->bip_flags & BIP_NOMAP_INTEGRITY)
+			break;
 
 		bip_for_each_vec(iv, bip, iter) {
 			void *p, *pmap;
@@ -193,6 +196,10 @@ static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
 		struct bio_vec iv;
 		struct bvec_iter iter;
 
+		/* No need to remap */
+		if (bip->bip_flags & BIP_NOMAP_INTEGRITY)
+			break;
+
 		bip_for_each_vec(iv, bip, iter) {
 			void *p, *pmap;
 			unsigned int j;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3cdb84c..2ba09e6 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -299,6 +299,7 @@ enum bip_flags {
 	BIP_CTRL_NOCHECK	= 1 << 2, /* disable HBA integrity checking */
 	BIP_DISK_NOCHECK	= 1 << 3, /* disable disk integrity checking */
 	BIP_IP_CHECKSUM		= 1 << 4, /* IP checksum */
+	BIP_NOMAP_INTEGRITY	= 1 << 5, /* ref tag shouldn't be remapped */
 };
 
 /*
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 07/15] nvmet: Prepare metadata request
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (6 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 16:20 ` [PATCH 08/15] nvmet: Add metadata characteristics for a namespace Max Gurtovoy
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

Allocate the metadata SGL buffers and add metadata fields for the
request.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/core.c  | 48 ++++++++++++++++++++++++++++++++++++++-------
 drivers/nvme/target/nvmet.h |  5 +++++
 2 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 28438b8..d545ff5 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -863,13 +863,17 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
 	req->sq = sq;
 	req->ops = ops;
 	req->sg = NULL;
+	req->prot_sg = NULL;
 	req->sg_cnt = 0;
+	req->prot_sg_cnt = 0;
 	req->transfer_len = 0;
+	req->prot_len = 0;
 	req->cqe->status = 0;
 	req->cqe->sq_head = 0;
 	req->ns = NULL;
 	req->error_loc = NVMET_NO_ERROR_LOC;
 	req->error_slba = 0;
+	req->is_protected = false;
 
 	trace_nvmet_req_init(req, req->cmd);
 
@@ -941,6 +945,7 @@ bool nvmet_check_data_len(struct nvmet_req *req, size_t data_len)
 int nvmet_req_alloc_sgl(struct nvmet_req *req)
 {
 	struct pci_dev *p2p_dev = NULL;
+	int data_len = req->transfer_len - req->prot_len;
 
 	if (IS_ENABLED(CONFIG_PCI_P2PDMA)) {
 		if (req->sq->ctrl && req->ns)
@@ -950,11 +955,23 @@ int nvmet_req_alloc_sgl(struct nvmet_req *req)
 		req->p2p_dev = NULL;
 		if (req->sq->qid && p2p_dev) {
 			req->sg = pci_p2pmem_alloc_sgl(p2p_dev, &req->sg_cnt,
-						       req->transfer_len);
-			if (req->sg) {
-				req->p2p_dev = p2p_dev;
-				return 0;
+						       data_len);
+			if (!req->sg)
+				goto fallback;
+
+			if (req->prot_len) {
+				req->prot_sg =
+					pci_p2pmem_alloc_sgl(p2p_dev,
+							     &req->prot_sg_cnt,
+							     req->prot_len);
+				if (!req->prot_sg) {
+					pci_p2pmem_free_sgl(req->p2p_dev,
+							    req->sg);
+					goto fallback;
+				}
 			}
+			req->p2p_dev = p2p_dev;
+			return 0;
 		}
 
 		/*
@@ -963,23 +980,40 @@ int nvmet_req_alloc_sgl(struct nvmet_req *req)
 		 */
 	}
 
-	req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
+fallback:
+	req->sg = sgl_alloc(data_len, GFP_KERNEL, &req->sg_cnt);
 	if (unlikely(!req->sg))
 		return -ENOMEM;
 
+	if (req->prot_len) {
+		req->prot_sg = sgl_alloc(req->prot_len, GFP_KERNEL,
+					 &req->prot_sg_cnt);
+		if (unlikely(!req->prot_sg)) {
+			sgl_free(req->sg);
+			return -ENOMEM;
+		}
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
 
 void nvmet_req_free_sgl(struct nvmet_req *req)
 {
-	if (req->p2p_dev)
+	if (req->p2p_dev) {
 		pci_p2pmem_free_sgl(req->p2p_dev, req->sg);
-	else
+		if (req->prot_sg)
+			pci_p2pmem_free_sgl(req->p2p_dev, req->prot_sg);
+	} else {
 		sgl_free(req->sg);
+		if (req->prot_sg)
+			sgl_free(req->prot_sg);
+	}
 
 	req->sg = NULL;
+	req->prot_sg = NULL;
 	req->sg_cnt = 0;
+	req->prot_sg_cnt = 0;
 }
 EXPORT_SYMBOL_GPL(nvmet_req_free_sgl);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 46df45e..8433218 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -291,6 +291,7 @@ struct nvmet_req {
 	struct nvmet_cq		*cq;
 	struct nvmet_ns		*ns;
 	struct scatterlist	*sg;
+	struct scatterlist	*prot_sg;
 	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
 	union {
 		struct {
@@ -304,8 +305,10 @@ struct nvmet_req {
 		} f;
 	};
 	int			sg_cnt;
+	int			prot_sg_cnt;
 	/* data length as parsed from the SGL descriptor: */
 	size_t			transfer_len;
+	size_t			prot_len;
 
 	struct nvmet_port	*port;
 
@@ -316,6 +319,8 @@ struct nvmet_req {
 	struct device		*p2p_client;
 	u16			error_loc;
 	u64			error_slba;
+	/* Metadata (T10-PI) support */
+	bool			is_protected;
 };
 
 extern struct workqueue_struct *buffered_io_wq;
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 08/15] nvmet: Add metadata characteristics for a namespace
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (7 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 07/15] nvmet: Prepare metadata request Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 17:59   ` Christoph Hellwig
  2019-11-05 16:20 ` [PATCH 09/15] nvmet: Rename nvmet_rw_len to nvmet_rw_data_len Max Gurtovoy
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

This is a preparation for adding metadata (T10-PI) over fabric support.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/io-cmd-bdev.c | 21 +++++++++++++++++++++
 drivers/nvme/target/nvmet.h       |  3 +++
 2 files changed, 24 insertions(+)

diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index b6fca0e..b6122aa7 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -50,6 +50,7 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
 {
 	int ret;
+	struct blk_integrity *bi;
 
 	ns->bdev = blkdev_get_by_path(ns->device_path,
 			FMODE_READ | FMODE_WRITE, NULL);
@@ -64,6 +65,26 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
 	}
 	ns->size = i_size_read(ns->bdev->bd_inode);
 	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
+
+	ns->prot_type = 0;
+	bi = bdev_get_integrity(ns->bdev);
+	if (bi) {
+		ns->ms = bi->tuple_size;
+		if (bi->profile == &t10_pi_type1_crc) {
+			ns->prot_type = NVME_NS_DPS_PI_TYPE1;
+		} else if (bi->profile == &t10_pi_type3_crc) {
+			ns->prot_type = NVME_NS_DPS_PI_TYPE3;
+		} else {
+			pr_err("block device %s: unsupported metadata type\n",
+			       ns->device_path);
+			return -EINVAL;
+		}
+	} else {
+		ns->ms = 0;
+	}
+
+	pr_debug("ms %d pi_type %d\n", ns->ms, ns->prot_type);
+
 	return 0;
 }
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 8433218..8ae7c70 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -19,6 +19,7 @@
 #include <linux/rcupdate.h>
 #include <linux/blkdev.h>
 #include <linux/radix-tree.h>
+#include <linux/t10-pi.h>
 
 #define NVMET_ASYNC_EVENTS		4
 #define NVMET_ERROR_LOG_SLOTS		128
@@ -76,6 +77,8 @@ struct nvmet_ns {
 
 	int			use_p2pmem;
 	struct pci_dev		*p2p_dev;
+	int			prot_type;
+	int			ms;
 };
 
 static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 09/15] nvmet: Rename nvmet_rw_len to nvmet_rw_data_len
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (8 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 08/15] nvmet: Add metadata characteristics for a namespace Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 17:59   ` Christoph Hellwig
  2019-11-12 18:39   ` Sagi Grimberg
  2019-11-05 16:20 ` [PATCH 10/15] nvmet: Rename nvmet_check_data_len to nvmet_check_transfer_len Max Gurtovoy
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

The function doesn't add the metadata length. This is preparation for
adding metadata (T10-PI) support.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/io-cmd-bdev.c | 2 +-
 drivers/nvme/target/io-cmd-file.c | 2 +-
 drivers/nvme/target/nvmet.h       | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index b6122aa7..dbada1f 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -172,7 +172,7 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 	sector_t sector;
 	int op, i;
 
-	if (!nvmet_check_data_len(req, nvmet_rw_len(req)))
+	if (!nvmet_check_data_len(req, nvmet_rw_data_len(req)))
 		return;
 
 	if (!req->sg_cnt) {
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index caebfce..e0a4859 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -232,7 +232,7 @@ static void nvmet_file_execute_rw(struct nvmet_req *req)
 {
 	ssize_t nr_bvec = req->sg_cnt;
 
-	if (!nvmet_check_data_len(req, nvmet_rw_len(req)))
+	if (!nvmet_check_data_len(req, nvmet_rw_data_len(req)))
 		return;
 
 	if (!req->sg_cnt || !nr_bvec) {
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 8ae7c70..50bc21b 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -495,7 +495,7 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
 u16 nvmet_file_flush(struct nvmet_req *req);
 void nvmet_ns_changed(struct nvmet_subsys *subsys, u32 nsid);
 
-static inline u32 nvmet_rw_len(struct nvmet_req *req)
+static inline u32 nvmet_rw_data_len(struct nvmet_req *req)
 {
 	return ((u32)le16_to_cpu(req->cmd->rw.length) + 1) <<
 			req->ns->blksize_shift;
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 10/15] nvmet: Rename nvmet_check_data_len to nvmet_check_transfer_len
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (9 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 09/15] nvmet: Rename nvmet_rw_len to nvmet_rw_data_len Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 18:00   ` Christoph Hellwig
  2019-11-12 18:43   ` Sagi Grimberg
  2019-11-05 16:20 ` [PATCH 11/15] nvmet: Introduce nvmet_rw_prot_len and nvmet_ns_has_pi Max Gurtovoy
                   ` (4 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

The function doesn't checks only the data length, because the transfer
length includes also the metadata length in some cases. This is
preparation for adding metadata (T10-PI) support.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/admin-cmd.c   | 14 +++++++-------
 drivers/nvme/target/core.c        |  6 +++---
 drivers/nvme/target/discovery.c   |  8 ++++----
 drivers/nvme/target/fabrics-cmd.c |  8 ++++----
 drivers/nvme/target/io-cmd-bdev.c |  8 ++++----
 drivers/nvme/target/io-cmd-file.c |  8 ++++----
 drivers/nvme/target/nvmet.h       |  2 +-
 7 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 56c21b50..19ae155 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -284,7 +284,7 @@ static void nvmet_execute_get_log_page_ana(struct nvmet_req *req)
 
 static void nvmet_execute_get_log_page(struct nvmet_req *req)
 {
-	if (!nvmet_check_data_len(req, nvmet_get_log_page_len(req->cmd)))
+	if (!nvmet_check_transfer_len(req, nvmet_get_log_page_len(req->cmd)))
 		return;
 
 	switch (req->cmd->get_log_page.lid) {
@@ -597,7 +597,7 @@ static void nvmet_execute_identify_desclist(struct nvmet_req *req)
 
 static void nvmet_execute_identify(struct nvmet_req *req)
 {
-	if (!nvmet_check_data_len(req, NVME_IDENTIFY_DATA_SIZE))
+	if (!nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE))
 		return;
 
 	switch (req->cmd->identify.cns) {
@@ -626,7 +626,7 @@ static void nvmet_execute_identify(struct nvmet_req *req)
  */
 static void nvmet_execute_abort(struct nvmet_req *req)
 {
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 	nvmet_set_result(req, 1);
 	nvmet_req_complete(req, 0);
@@ -712,7 +712,7 @@ static void nvmet_execute_set_features(struct nvmet_req *req)
 	u32 cdw10 = le32_to_cpu(req->cmd->common.cdw10);
 	u16 status = 0;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	switch (cdw10 & 0xff) {
@@ -778,7 +778,7 @@ static void nvmet_execute_get_features(struct nvmet_req *req)
 	u32 cdw10 = le32_to_cpu(req->cmd->common.cdw10);
 	u16 status = 0;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	switch (cdw10 & 0xff) {
@@ -845,7 +845,7 @@ void nvmet_execute_async_event(struct nvmet_req *req)
 {
 	struct nvmet_ctrl *ctrl = req->sq->ctrl;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	mutex_lock(&ctrl->lock);
@@ -864,7 +864,7 @@ void nvmet_execute_keep_alive(struct nvmet_req *req)
 {
 	struct nvmet_ctrl *ctrl = req->sq->ctrl;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	pr_debug("ctrl %d update keep-alive timer for %d secs\n",
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index d545ff5..6125116 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -930,9 +930,9 @@ void nvmet_req_uninit(struct nvmet_req *req)
 }
 EXPORT_SYMBOL_GPL(nvmet_req_uninit);
 
-bool nvmet_check_data_len(struct nvmet_req *req, size_t data_len)
+bool nvmet_check_transfer_len(struct nvmet_req *req, size_t len)
 {
-	if (unlikely(data_len != req->transfer_len)) {
+	if (unlikely(len != req->transfer_len)) {
 		req->error_loc = offsetof(struct nvme_common_command, dptr);
 		nvmet_req_complete(req, NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR);
 		return false;
@@ -940,7 +940,7 @@ bool nvmet_check_data_len(struct nvmet_req *req, size_t data_len)
 
 	return true;
 }
-EXPORT_SYMBOL_GPL(nvmet_check_data_len);
+EXPORT_SYMBOL_GPL(nvmet_check_transfer_len);
 
 int nvmet_req_alloc_sgl(struct nvmet_req *req)
 {
diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c
index 0c2274b..40cf0b6 100644
--- a/drivers/nvme/target/discovery.c
+++ b/drivers/nvme/target/discovery.c
@@ -171,7 +171,7 @@ static void nvmet_execute_disc_get_log_page(struct nvmet_req *req)
 	u16 status = 0;
 	void *buffer;
 
-	if (!nvmet_check_data_len(req, data_len))
+	if (!nvmet_check_transfer_len(req, data_len))
 		return;
 
 	if (req->cmd->get_log_page.lid != NVME_LOG_DISC) {
@@ -244,7 +244,7 @@ static void nvmet_execute_disc_identify(struct nvmet_req *req)
 	const char model[] = "Linux";
 	u16 status = 0;
 
-	if (!nvmet_check_data_len(req, NVME_IDENTIFY_DATA_SIZE))
+	if (!nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE))
 		return;
 
 	if (req->cmd->identify.cns != NVME_ID_CNS_CTRL) {
@@ -298,7 +298,7 @@ static void nvmet_execute_disc_set_features(struct nvmet_req *req)
 	u32 cdw10 = le32_to_cpu(req->cmd->common.cdw10);
 	u16 stat;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	switch (cdw10 & 0xff) {
@@ -324,7 +324,7 @@ static void nvmet_execute_disc_get_features(struct nvmet_req *req)
 	u32 cdw10 = le32_to_cpu(req->cmd->common.cdw10);
 	u16 stat = 0;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	switch (cdw10 & 0xff) {
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index f729747..ee41b2b 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -12,7 +12,7 @@ static void nvmet_execute_prop_set(struct nvmet_req *req)
 	u64 val = le64_to_cpu(req->cmd->prop_set.value);
 	u16 status = 0;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	if (req->cmd->prop_set.attrib & 1) {
@@ -41,7 +41,7 @@ static void nvmet_execute_prop_get(struct nvmet_req *req)
 	u16 status = 0;
 	u64 val = 0;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	if (req->cmd->prop_get.attrib & 1) {
@@ -151,7 +151,7 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 status = 0;
 
-	if (!nvmet_check_data_len(req, sizeof(struct nvmf_connect_data)))
+	if (!nvmet_check_transfer_len(req, sizeof(struct nvmf_connect_data)))
 		return;
 
 	d = kmalloc(sizeof(*d), GFP_KERNEL);
@@ -218,7 +218,7 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 	u16 qid = le16_to_cpu(c->qid);
 	u16 status = 0;
 
-	if (!nvmet_check_data_len(req, sizeof(struct nvmf_connect_data)))
+	if (!nvmet_check_transfer_len(req, sizeof(struct nvmf_connect_data)))
 		return;
 
 	d = kmalloc(sizeof(*d), GFP_KERNEL);
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index dbada1f..2099c82 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -172,7 +172,7 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 	sector_t sector;
 	int op, i;
 
-	if (!nvmet_check_data_len(req, nvmet_rw_data_len(req)))
+	if (!nvmet_check_transfer_len(req, nvmet_rw_data_len(req)))
 		return;
 
 	if (!req->sg_cnt) {
@@ -233,7 +233,7 @@ static void nvmet_bdev_execute_flush(struct nvmet_req *req)
 {
 	struct bio *bio = &req->b.inline_bio;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	bio_init(bio, req->inline_bvec, ARRAY_SIZE(req->inline_bvec));
@@ -301,7 +301,7 @@ static void nvmet_bdev_execute_discard(struct nvmet_req *req)
 
 static void nvmet_bdev_execute_dsm(struct nvmet_req *req)
 {
-	if (!nvmet_check_data_len(req, nvmet_dsm_len(req)))
+	if (!nvmet_check_transfer_len(req, nvmet_dsm_len(req)))
 		return;
 
 	switch (le32_to_cpu(req->cmd->dsm.attributes)) {
@@ -325,7 +325,7 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	sector_t nr_sector;
 	int ret;
 
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
 	sector = le64_to_cpu(write_zeroes->slba) <<
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index e0a4859..1492f55 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -232,7 +232,7 @@ static void nvmet_file_execute_rw(struct nvmet_req *req)
 {
 	ssize_t nr_bvec = req->sg_cnt;
 
-	if (!nvmet_check_data_len(req, nvmet_rw_data_len(req)))
+	if (!nvmet_check_transfer_len(req, nvmet_rw_data_len(req)))
 		return;
 
 	if (!req->sg_cnt || !nr_bvec) {
@@ -276,7 +276,7 @@ static void nvmet_file_flush_work(struct work_struct *w)
 
 static void nvmet_file_execute_flush(struct nvmet_req *req)
 {
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 	INIT_WORK(&req->f.work, nvmet_file_flush_work);
 	schedule_work(&req->f.work);
@@ -336,7 +336,7 @@ static void nvmet_file_dsm_work(struct work_struct *w)
 
 static void nvmet_file_execute_dsm(struct nvmet_req *req)
 {
-	if (!nvmet_check_data_len(req, nvmet_dsm_len(req)))
+	if (!nvmet_check_transfer_len(req, nvmet_dsm_len(req)))
 		return;
 	INIT_WORK(&req->f.work, nvmet_file_dsm_work);
 	schedule_work(&req->f.work);
@@ -366,7 +366,7 @@ static void nvmet_file_write_zeroes_work(struct work_struct *w)
 
 static void nvmet_file_execute_write_zeroes(struct nvmet_req *req)
 {
-	if (!nvmet_check_data_len(req, 0))
+	if (!nvmet_check_transfer_len(req, 0))
 		return;
 	INIT_WORK(&req->f.work, nvmet_file_write_zeroes_work);
 	schedule_work(&req->f.work);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 50bc21b..39d5300 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -381,7 +381,7 @@ static inline bool nvmet_aen_bit_disabled(struct nvmet_ctrl *ctrl, u32 bn)
 bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
 		struct nvmet_sq *sq, const struct nvmet_fabrics_ops *ops);
 void nvmet_req_uninit(struct nvmet_req *req);
-bool nvmet_check_data_len(struct nvmet_req *req, size_t data_len);
+bool nvmet_check_transfer_len(struct nvmet_req *req, size_t len);
 void nvmet_req_complete(struct nvmet_req *req, u16 status);
 int nvmet_req_alloc_sgl(struct nvmet_req *req);
 void nvmet_req_free_sgl(struct nvmet_req *req);
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 11/15] nvmet: Introduce nvmet_rw_prot_len and nvmet_ns_has_pi
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (10 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 10/15] nvmet: Rename nvmet_check_data_len to nvmet_check_transfer_len Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 18:00   ` Christoph Hellwig
  2019-11-05 16:20 ` [PATCH 12/15] nvme: Add Metadata Capabilities enumerations Max Gurtovoy
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

This is preparation for adding metadata (T10-PI) over fabric support.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/nvmet.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 39d5300..574ee55 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -501,6 +501,16 @@ static inline u32 nvmet_rw_data_len(struct nvmet_req *req)
 			req->ns->blksize_shift;
 }
 
+static inline u32 nvmet_rw_prot_len(struct nvmet_req *req)
+{
+	return ((u32)le16_to_cpu(req->cmd->rw.length) + 1) * req->ns->ms;
+}
+
+static inline bool nvmet_ns_has_pi(struct nvmet_ns *ns)
+{
+	return ns->prot_type && ns->ms == sizeof(struct t10_pi_tuple);
+}
+
 static inline u32 nvmet_dsm_len(struct nvmet_req *req)
 {
 	return (le32_to_cpu(req->cmd->dsm.nr) + 1) *
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 12/15] nvme: Add Metadata Capabilities enumerations
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (11 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 11/15] nvmet: Introduce nvmet_rw_prot_len and nvmet_ns_has_pi Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 16:20 ` [PATCH 13/15] nvmet: Add metadata/T10-PI support Max Gurtovoy
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

This is preparation for adding metadata (T10-PI) over fabric support.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 include/linux/nvme.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 3eca4f7..9eda901 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -396,6 +396,8 @@ enum {
 	NVME_NS_FEAT_THIN	= 1 << 0,
 	NVME_NS_FLBAS_LBA_MASK	= 0xf,
 	NVME_NS_FLBAS_META_EXT	= 0x10,
+	NVME_NS_MC_META_EXT	= 1 << 0,
+	NVME_NS_MC_META_BUF	= 1 << 1,
 	NVME_LBAF_RP_BEST	= 0,
 	NVME_LBAF_RP_BETTER	= 1,
 	NVME_LBAF_RP_GOOD	= 2,
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 13/15] nvmet: Add metadata/T10-PI support
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (12 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 12/15] nvme: Add Metadata Capabilities enumerations Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 16:20 ` [PATCH 14/15] nvmet: Add metadata support for block devices Max Gurtovoy
  2019-11-05 16:20 ` [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
  15 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

Expose the namespace metadata format when PI is enabled. The user needs
to enable the capability per subsystem and the other metadata properties
are taken from the namespace/bdev.

Usage example:
echo 1 > /config/nvmet/subsystems/${NAME}/attr_pi_enable

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/admin-cmd.c   | 23 +++++++++++++++++++----
 drivers/nvme/target/configfs.c    | 24 ++++++++++++++++++++++++
 drivers/nvme/target/fabrics-cmd.c | 11 +++++++++++
 drivers/nvme/target/nvmet.h       |  5 +++++
 4 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 19ae155..20e7f08 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -346,8 +346,8 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 	/* we support multiple ports, multiples hosts and ANA: */
 	id->cmic = (1 << 0) | (1 << 1) | (1 << 3);
 
-	/* no limit on data transfer sizes for now */
-	id->mdts = 0;
+	/* Limit MDTS for metadata capable controllers (assume mpsmin is 4k) */
+	id->mdts = ctrl->pi_support ? ilog2(NVMET_T10_PI_MAX_KB_SZ >> 2) : 0;
 	id->cntlid = cpu_to_le16(ctrl->cntlid);
 	id->ver = cpu_to_le32(ctrl->subsys->ver);
 
@@ -405,9 +405,13 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 
 	strlcpy(id->subnqn, ctrl->subsys->subsysnqn, sizeof(id->subnqn));
 
-	/* Max command capsule size is sqe + single page of in-capsule data */
+	/*
+	 * Max command capsule size is sqe + single page of in-capsule data.
+	 * Disable inline data for Metadata capable controllers.
+	 */
 	id->ioccsz = cpu_to_le32((sizeof(struct nvme_command) +
-				  req->port->inline_data_size) / 16);
+				  req->port->inline_data_size *
+				  !ctrl->pi_support) / 16);
 	/* Max response capsule size is cqe */
 	id->iorcsz = cpu_to_le32(sizeof(struct nvme_completion) / 16);
 
@@ -437,6 +441,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 
 static void nvmet_execute_identify_ns(struct nvmet_req *req)
 {
+	struct nvmet_ctrl *ctrl = req->sq->ctrl;
 	struct nvmet_ns *ns;
 	struct nvme_id_ns *id;
 	u16 status = 0;
@@ -493,6 +498,16 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
 
 	id->lbaf[0].ds = ns->blksize_shift;
 
+	if (ctrl->pi_support && nvmet_ns_has_pi(ns)) {
+		id->dpc = NVME_NS_DPC_PI_FIRST | NVME_NS_DPC_PI_LAST |
+			  NVME_NS_DPC_PI_TYPE1 | NVME_NS_DPC_PI_TYPE2 |
+			  NVME_NS_DPC_PI_TYPE3;
+		id->mc = NVME_NS_MC_META_EXT;
+		id->dps = ns->prot_type;
+		id->flbas = NVME_NS_FLBAS_META_EXT;
+		id->lbaf[0].ms = ns->ms;
+	}
+
 	if (ns->readonly)
 		id->nsattr |= (1 << 0);
 	nvmet_put_namespace(ns);
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 98613a4..316e3c9 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -862,10 +862,34 @@ static ssize_t nvmet_subsys_attr_serial_store(struct config_item *item,
 }
 CONFIGFS_ATTR(nvmet_subsys_, attr_serial);
 
+static ssize_t nvmet_subsys_attr_pi_enable_show(struct config_item *item,
+						char *page)
+{
+	return snprintf(page, PAGE_SIZE, "%d\n", to_subsys(item)->pi_support);
+}
+
+static ssize_t nvmet_subsys_attr_pi_enable_store(struct config_item *item,
+						 const char *page, size_t count)
+{
+	struct nvmet_subsys *subsys = to_subsys(item);
+	bool pi_enable;
+
+	if (strtobool(page, &pi_enable))
+		return -EINVAL;
+
+	down_write(&nvmet_config_sem);
+	subsys->pi_support = pi_enable;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_subsys_, attr_pi_enable);
+
 static struct configfs_attribute *nvmet_subsys_attrs[] = {
 	&nvmet_subsys_attr_attr_allow_any_host,
 	&nvmet_subsys_attr_attr_version,
 	&nvmet_subsys_attr_attr_serial,
+	&nvmet_subsys_attr_attr_pi_enable,
 	NULL,
 };
 
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index ee41b2b..5a99878 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -192,6 +192,17 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 		goto out;
 	}
 
+	if (ctrl->subsys->pi_support) {
+		if (ctrl->port->pi_capable) {
+			ctrl->pi_support = true;
+			pr_info("controller %d T10-PI enabled\n", ctrl->cntlid);
+		} else {
+			ctrl->pi_support = false;
+			pr_warn("T10-PI is not supported on controller %d\n",
+				ctrl->cntlid);
+		}
+	}
+
 	uuid_copy(&ctrl->hostid, &d->hostid);
 
 	status = nvmet_install_queue(ctrl, req);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 574ee55..3d102d6 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -144,6 +144,7 @@ struct nvmet_port {
 	bool				enabled;
 	int				inline_data_size;
 	const struct nvmet_fabrics_ops	*tr_ops;
+	bool				pi_capable;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -203,6 +204,7 @@ struct nvmet_ctrl {
 	spinlock_t		error_lock;
 	u64			err_counter;
 	struct nvme_error_slot	slots[NVMET_ERROR_LOG_SLOTS];
+	bool			pi_support;
 };
 
 struct nvmet_subsys {
@@ -225,6 +227,7 @@ struct nvmet_subsys {
 	u64			ver;
 	u64			serial;
 	char			*subsysnqn;
+	bool			pi_support;
 
 	struct config_group	group;
 
@@ -511,6 +514,8 @@ static inline bool nvmet_ns_has_pi(struct nvmet_ns *ns)
 	return ns->prot_type && ns->ms == sizeof(struct t10_pi_tuple);
 }
 
+#define NVMET_T10_PI_MAX_KB_SZ 128
+
 static inline u32 nvmet_dsm_len(struct nvmet_req *req)
 {
 	return (le32_to_cpu(req->cmd->dsm.nr) + 1) *
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 14/15] nvmet: Add metadata support for block devices
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (13 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 13/15] nvmet: Add metadata/T10-PI support Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 16:20 ` [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
  15 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

Create a block IO request for the metadata from the protection SG list.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/io-cmd-bdev.c | 78 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 76 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 2099c82..10d8fd0 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -163,6 +163,52 @@ static void nvmet_bio_done(struct bio *bio)
 		bio_put(bio);
 }
 
+static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
+				struct sg_mapping_iter *miter)
+{
+	struct blk_integrity *bi;
+	struct bio_integrity_payload *bip;
+	struct block_device *bdev = req->ns->bdev;
+	int rc;
+	size_t resid, len;
+
+	bi = bdev_get_integrity(bdev);
+	if (unlikely(!bi)) {
+		pr_err("Unable to locate bio_integrity\n");
+		return -ENODEV;
+	}
+
+	bip = bio_integrity_alloc(bio, GFP_NOIO,
+			min_t(unsigned int, req->prot_sg_cnt, BIO_MAX_PAGES));
+	if (IS_ERR(bip)) {
+		pr_err("Unable to allocate bio_integrity_payload\n");
+		return PTR_ERR(bip);
+	}
+
+	bip->bip_flags |= BIP_NOMAP_INTEGRITY;
+	bip->bip_iter.bi_size = bio_integrity_bytes(bi, bio_sectors(bio));
+	bip_set_seed(bip, bio->bi_iter.bi_sector);
+
+	resid = bip->bip_iter.bi_size;
+	while (resid > 0 && sg_miter_next(miter)) {
+		len = min_t(size_t, miter->length, resid);
+		rc = bio_integrity_add_page(bio, miter->page, len,
+					    offset_in_page(miter->addr));
+		if (unlikely(rc != len)) {
+			pr_err("bio_integrity_add_page() failed; %d\n", rc);
+			sg_miter_stop(miter);
+			return -ENOMEM;
+		}
+
+		resid -= len;
+		if (len < miter->length)
+			miter->consumed -= miter->length - len;
+	}
+	sg_miter_stop(miter);
+
+	return 0;
+}
+
 static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 {
 	int sg_cnt = req->sg_cnt;
@@ -170,9 +216,11 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 	struct scatterlist *sg;
 	struct blk_plug plug;
 	sector_t sector;
-	int op, i;
+	int op, i, rc;
+	struct sg_mapping_iter prot_miter;
 
-	if (!nvmet_check_transfer_len(req, nvmet_rw_data_len(req)))
+	if (!nvmet_check_transfer_len(req,
+				      nvmet_rw_data_len(req) + req->prot_len))
 		return;
 
 	if (!req->sg_cnt) {
@@ -207,11 +255,25 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 	bio->bi_opf = op;
 
 	blk_start_plug(&plug);
+	if (req->is_protected)
+		sg_miter_start(&prot_miter, req->prot_sg, req->prot_sg_cnt,
+			       op == REQ_OP_READ ? SG_MITER_FROM_SG :
+						   SG_MITER_TO_SG);
+
 	for_each_sg(req->sg, sg, req->sg_cnt, i) {
 		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
 				!= sg->length) {
 			struct bio *prev = bio;
 
+			if (req->is_protected) {
+				rc = nvmet_bdev_alloc_bip(req, bio,
+							  &prot_miter);
+				if (unlikely(rc)) {
+					bio_io_error(bio);
+					return;
+				}
+			}
+
 			bio = bio_alloc(GFP_KERNEL, min(sg_cnt, BIO_MAX_PAGES));
 			bio_set_dev(bio, req->ns->bdev);
 			bio->bi_iter.bi_sector = sector;
@@ -225,6 +287,14 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 		sg_cnt--;
 	}
 
+	if (req->is_protected) {
+		rc = nvmet_bdev_alloc_bip(req, bio, &prot_miter);
+		if (unlikely(rc)) {
+			bio_io_error(bio);
+			return;
+		}
+	}
+
 	submit_bio(bio);
 	blk_finish_plug(&plug);
 }
@@ -352,6 +422,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_read:
 	case nvme_cmd_write:
 		req->execute = nvmet_bdev_execute_rw;
+		if (req->sq->ctrl->pi_support && nvmet_ns_has_pi(req->ns)) {
+			req->is_protected = true;
+			req->prot_len = nvmet_rw_prot_len(req);
+		}
 		return 0;
 	case nvme_cmd_flush:
 		req->execute = nvmet_bdev_execute_flush;
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support
  2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
                   ` (14 preceding siblings ...)
  2019-11-05 16:20 ` [PATCH 14/15] nvmet: Add metadata support for block devices Max Gurtovoy
@ 2019-11-05 16:20 ` Max Gurtovoy
  2019-11-05 18:02   ` Christoph Hellwig
  2019-11-12 18:34   ` Sagi Grimberg
  15 siblings, 2 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-05 16:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi
  Cc: vladimirk, idanb, israelr, shlomin, oren, maxg

From: Israel Rukshin <israelr@mellanox.com>

For capable HCAs (e.g. ConnectX-4/ConnectX-5) this will allow end-to-end
protection information passthrough and validation for NVMe over RDMA
transport. Metadata support was implemented over the new RDMA signature
verbs API. Add pi_enable module parameter to avoid allocating unnecessary
resources when this feature is not in use.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/target/rdma.c | 221 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 204 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 37d262a..9c61b50 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -54,6 +54,7 @@ struct nvmet_rdma_rsp {
 	struct nvmet_rdma_queue	*queue;
 
 	struct ib_cqe		read_cqe;
+	struct ib_cqe		write_cqe;
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
@@ -112,12 +113,17 @@ struct nvmet_rdma_device {
 	struct list_head	entry;
 	int			inline_data_size;
 	int			inline_page_count;
+	bool			pi_capable;
 };
 
 static bool nvmet_rdma_use_srq;
 module_param_named(use_srq, nvmet_rdma_use_srq, bool, 0444);
 MODULE_PARM_DESC(use_srq, "Use shared receive queue.");
 
+static bool nvmet_rdma_pi_enable;
+module_param_named(pi_enable, nvmet_rdma_pi_enable, bool, 0444);
+MODULE_PARM_DESC(pi_enable, "Enable metadata (T10-PI) support");
+
 static DEFINE_IDA(nvmet_rdma_queue_ida);
 static LIST_HEAD(nvmet_rdma_queue_list);
 static DEFINE_MUTEX(nvmet_rdma_queue_mutex);
@@ -129,6 +135,7 @@ struct nvmet_rdma_device {
 static void nvmet_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc);
 static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc);
 static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc);
+static void nvmet_rdma_write_data_done(struct ib_cq *cq, struct ib_wc *wc);
 static void nvmet_rdma_qp_event(struct ib_event *event, void *priv);
 static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue);
 static void nvmet_rdma_free_rsp(struct nvmet_rdma_device *ndev,
@@ -386,6 +393,9 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
 
 	/* Data In / RDMA READ */
 	r->read_cqe.done = nvmet_rdma_read_data_done;
+	/* Data Out / RDMA WRITE */
+	r->write_cqe.done = nvmet_rdma_write_data_done;
+
 	return 0;
 
 out_free_rsp:
@@ -495,6 +505,126 @@ static void nvmet_rdma_process_wr_wait_list(struct nvmet_rdma_queue *queue)
 	spin_unlock(&queue->rsp_wr_wait_lock);
 }
 
+static u16 nvmet_rdma_check_pi_status(struct ib_mr *sig_mr)
+{
+	struct ib_mr_status mr_status;
+	int ret;
+	u16 status = 0;
+
+	ret = ib_check_mr_status(sig_mr, IB_MR_CHECK_SIG_STATUS, &mr_status);
+	if (ret) {
+		pr_err("ib_check_mr_status failed, ret %d\n", ret);
+		return NVME_SC_INVALID_PI;
+	}
+
+	if (mr_status.fail_status & IB_MR_CHECK_SIG_STATUS) {
+		switch (mr_status.sig_err.err_type) {
+		case IB_SIG_BAD_GUARD:
+			status = NVME_SC_GUARD_CHECK;
+			break;
+		case IB_SIG_BAD_REFTAG:
+			status = NVME_SC_REFTAG_CHECK;
+			break;
+		case IB_SIG_BAD_APPTAG:
+			status = NVME_SC_APPTAG_CHECK;
+			break;
+		}
+		pr_err("PI error found type %d expected 0x%x vs actual 0x%x\n",
+		       mr_status.sig_err.err_type,
+		       mr_status.sig_err.expected,
+		       mr_status.sig_err.actual);
+	}
+
+	return status;
+}
+
+static void nvmet_rdma_set_diff_domain(struct nvme_command *cmd,
+		struct ib_sig_domain *domain, struct nvmet_ns *ns)
+{
+	struct blk_integrity *bi = bdev_get_integrity(ns->bdev);
+
+	WARN_ON(bi == NULL);
+
+	domain->sig_type = IB_SIG_TYPE_T10_DIF;
+	domain->sig.dif.bg_type = IB_T10DIF_CRC;
+	domain->sig.dif.pi_interval = 1 << bi->interval_exp;
+	domain->sig.dif.ref_tag = le32_to_cpu(cmd->rw.reftag);
+
+	/*
+	 * At the moment we hard code those, but in the future
+	 * we will take them from cmd.
+	 */
+	domain->sig.dif.apptag_check_mask = 0xffff;
+	domain->sig.dif.app_escape = true;
+	domain->sig.dif.ref_escape = true;
+	if (ns->prot_type != NVME_NS_DPS_PI_TYPE3)
+		domain->sig.dif.ref_remap = true;
+}
+
+static void nvmet_rdma_set_sig_attrs(struct nvmet_req *req,
+				     struct ib_sig_attrs *sig_attrs)
+{
+	struct nvme_command *cmd = req->cmd;
+	u16 control = le16_to_cpu(cmd->rw.control);
+
+	memset(sig_attrs, 0, sizeof(*sig_attrs));
+
+	if (control & NVME_RW_PRINFO_PRACT) {
+		/* for WRITE_INSERT/READ_STRIP no wire domain */
+		sig_attrs->wire.sig_type = IB_SIG_TYPE_NONE;
+		nvmet_rdma_set_diff_domain(cmd, &sig_attrs->mem, req->ns);
+		/* Clear the PRACT bit since HCA will generate/verify the PI */
+		control &= ~NVME_RW_PRINFO_PRACT;
+		cmd->rw.control = cpu_to_le16(control);
+		/* PI is added by the HW */
+		req->transfer_len += req->prot_len;
+	} else {
+		/* for WRITE_PASS/READ_PASS both wire/memory domains exist */
+		nvmet_rdma_set_diff_domain(cmd, &sig_attrs->wire, req->ns);
+		nvmet_rdma_set_diff_domain(cmd, &sig_attrs->mem, req->ns);
+	}
+
+	if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_REF)
+		sig_attrs->check_mask |= IB_SIG_CHECK_REFTAG;
+	if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_GUARD)
+		sig_attrs->check_mask |= IB_SIG_CHECK_GUARD;
+	if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_APP)
+		sig_attrs->check_mask |= IB_SIG_CHECK_APPTAG;
+}
+
+static int nvmet_rdma_rw_ctx_init(struct nvmet_rdma_rsp *rsp, u64 addr, u32 key,
+				  struct ib_sig_attrs *sig_attrs)
+{
+	struct rdma_cm_id *cm_id = rsp->queue->cm_id;
+	struct nvmet_req *req = &rsp->req;
+	int ret;
+
+	if (req->is_protected)
+		ret = rdma_rw_ctx_signature_init(&rsp->rw, cm_id->qp,
+			cm_id->port_num, req->sg, req->sg_cnt, req->prot_sg,
+			req->prot_sg_cnt, sig_attrs, addr, key,
+			nvmet_data_dir(req));
+	else
+		ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
+				       req->sg, req->sg_cnt, 0, addr, key,
+				       nvmet_data_dir(req));
+
+	return ret;
+}
+
+static void nvmet_rdma_rw_ctx_destroy(struct nvmet_rdma_rsp *rsp)
+{
+	struct rdma_cm_id *cm_id = rsp->queue->cm_id;
+	struct nvmet_req *req = &rsp->req;
+
+	if (req->is_protected)
+		rdma_rw_ctx_destroy_signature(&rsp->rw, cm_id->qp,
+			cm_id->port_num, req->sg, req->sg_cnt, req->prot_sg,
+			req->prot_sg_cnt, nvmet_data_dir(req));
+	else
+		rdma_rw_ctx_destroy(&rsp->rw, cm_id->qp, cm_id->port_num,
+				    req->sg, req->sg_cnt, nvmet_data_dir(req));
+}
 
 static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 {
@@ -502,11 +632,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	atomic_add(1 + rsp->n_rdma, &queue->sq_wr_avail);
 
-	if (rsp->n_rdma) {
-		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
-				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
-	}
+	if (rsp->n_rdma)
+		nvmet_rdma_rw_ctx_destroy(rsp);
 
 	if (rsp->req.sg != rsp->cmd->inline_sg)
 		nvmet_req_free_sgl(&rsp->req);
@@ -561,11 +688,16 @@ static void nvmet_rdma_queue_response(struct nvmet_req *req)
 		rsp->send_wr.opcode = IB_WR_SEND;
 	}
 
-	if (nvmet_rdma_need_data_out(rsp))
-		first_wr = rdma_rw_ctx_wrs(&rsp->rw, cm_id->qp,
-				cm_id->port_num, NULL, &rsp->send_wr);
-	else
+	if (nvmet_rdma_need_data_out(rsp)) {
+		if (rsp->req.is_protected)
+			first_wr = rdma_rw_ctx_wrs(&rsp->rw, cm_id->qp,
+					cm_id->port_num, &rsp->write_cqe, NULL);
+		else
+			first_wr = rdma_rw_ctx_wrs(&rsp->rw, cm_id->qp,
+					cm_id->port_num, NULL, &rsp->send_wr);
+	} else {
 		first_wr = &rsp->send_wr;
+	}
 
 	nvmet_rdma_post_recv(rsp->queue->dev, rsp->cmd);
 
@@ -584,15 +716,14 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 	struct nvmet_rdma_rsp *rsp =
 		container_of(wc->wr_cqe, struct nvmet_rdma_rsp, read_cqe);
 	struct nvmet_rdma_queue *queue = cq->cq_context;
+	u16 status = 0;
 
 	WARN_ON(rsp->n_rdma <= 0);
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
-	rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
-			queue->cm_id->port_num, rsp->req.sg,
-			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
 	rsp->n_rdma = 0;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		nvmet_rdma_rw_ctx_destroy(rsp);
 		nvmet_req_uninit(&rsp->req);
 		nvmet_rdma_release_rsp(rsp);
 		if (wc->status != IB_WC_WR_FLUSH_ERR) {
@@ -603,7 +734,55 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 		return;
 	}
 
-	rsp->req.execute(&rsp->req);
+	if (rsp->req.is_protected)
+		status = nvmet_rdma_check_pi_status(rsp->rw.reg->mr);
+	nvmet_rdma_rw_ctx_destroy(rsp);
+
+	if (unlikely(status))
+		nvmet_req_complete(&rsp->req, status);
+	else
+		rsp->req.execute(&rsp->req);
+}
+
+static void nvmet_rdma_write_data_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct nvmet_rdma_rsp *rsp =
+		container_of(wc->wr_cqe, struct nvmet_rdma_rsp, write_cqe);
+	struct nvmet_rdma_queue *queue = cq->cq_context;
+	struct rdma_cm_id *cm_id = rsp->queue->cm_id;
+	u16 status;
+
+	WARN_ON(rsp->n_rdma <= 0);
+	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
+	rsp->n_rdma = 0;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		nvmet_rdma_rw_ctx_destroy(rsp);
+		nvmet_req_uninit(&rsp->req);
+		nvmet_rdma_release_rsp(rsp);
+		if (wc->status != IB_WC_WR_FLUSH_ERR) {
+			pr_info("RDMA WRITE for CQE 0x%p failed with status %s (%d).\n",
+				wc->wr_cqe, ib_wc_status_msg(wc->status),
+				wc->status);
+			nvmet_rdma_error_comp(queue);
+		}
+		return;
+	}
+
+	/*
+	 * Upon RDMA completion check the signature status
+	 * - if succeeded send good NVMe response
+	 * - if failed send bad NVMe response with appropriate error
+	 */
+	status = nvmet_rdma_check_pi_status(rsp->rw.reg->mr);
+	if (unlikely(status))
+		rsp->req.cqe->status = cpu_to_le16(status << 1);
+	nvmet_rdma_rw_ctx_destroy(rsp);
+
+	if (unlikely(ib_post_send(cm_id->qp, &rsp->send_wr, NULL))) {
+		pr_err("sending cmd response failed\n");
+		nvmet_rdma_release_rsp(rsp);
+	}
 }
 
 static void nvmet_rdma_use_inline_sg(struct nvmet_rdma_rsp *rsp, u32 len,
@@ -660,9 +839,9 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
 static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 		struct nvme_keyed_sgl_desc *sgl, bool invalidate)
 {
-	struct rdma_cm_id *cm_id = rsp->queue->cm_id;
 	u64 addr = le64_to_cpu(sgl->addr);
 	u32 key = get_unaligned_le32(sgl->key);
+	struct ib_sig_attrs sig_attrs;
 	int ret;
 
 	rsp->req.transfer_len = get_unaligned_le24(sgl->length);
@@ -671,13 +850,14 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!rsp->req.transfer_len)
 		return 0;
 
+	if (rsp->req.is_protected)
+		nvmet_rdma_set_sig_attrs(&rsp->req, &sig_attrs);
+
 	ret = nvmet_req_alloc_sgl(&rsp->req);
 	if (unlikely(ret < 0))
 		goto error_out;
 
-	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
-			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req));
+	ret = nvmet_rdma_rw_ctx_init(rsp, addr, key, &sig_attrs);
 	if (unlikely(ret < 0))
 		goto error_out;
 	rsp->n_rdma += ret;
@@ -956,6 +1136,10 @@ static void nvmet_rdma_free_dev(struct kref *ref)
 			goto out_free_pd;
 	}
 
+	ndev->pi_capable = ndev->device->attrs.device_cap_flags &
+			IB_DEVICE_INTEGRITY_HANDOVER ? true : false;
+	port->pi_capable = ndev->pi_capable && nvmet_rdma_pi_enable;
+
 	list_add(&ndev->entry, &device_list);
 out_unlock:
 	mutex_unlock(&device_list_mutex);
@@ -1020,6 +1204,9 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 		qp_attr.cap.max_recv_sge = 1 + ndev->inline_page_count;
 	}
 
+	if (ndev->pi_capable && nvmet_rdma_pi_enable && queue->host_qid)
+		qp_attr.create_flags |= IB_QP_CREATE_INTEGRITY_EN;
+
 	ret = rdma_create_qp(queue->cm_id, ndev->pd, &qp_attr);
 	if (ret) {
 		pr_err("failed to create_qp ret= %d\n", ret);
-- 
1.8.3.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/15] nvme: Fail __nvme_revalidate_disk in case of a spec violation
  2019-11-05 16:20 ` [PATCH 02/15] nvme: Fail __nvme_revalidate_disk in case of a spec violation Max Gurtovoy
@ 2019-11-05 17:49   ` Christoph Hellwig
  2019-11-12 17:52     ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 17:49 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

On Tue, Nov 05, 2019 at 06:20:13PM +0200, Max Gurtovoy wrote:
> An extended LBA is a larger LBA that is created when metadata associated
> with the LBA is transferred contiguously with the LBA data (AKA
> interleaved). The metadata may be either transferred as part of the LBA
> (creating an extended LBA) or it may be transferred as a separate
> contiguous buffer of data. According to the NVMeoF spec, a fabrics ctrl
> supports only an Extended LBA format. Fail revalidation in case we have a
> spec violation. Also init the integrity profile for the block device for
> fabrics ctrl.

I don't think the subject describes very well what this patch does,
as it really wires up parsing and checking the ext flag for fabrics.

> +	if (ns->ms && (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) {
> +		if ((ns->ctrl->opts && ns->ctrl->opts->pi_enable) || !ns->ext)
> +			nvme_init_integrity(disk, ns->ms, ns->pi_type);
> +	}

Can we just have a flag that says we have working metadata support instead
of having to duplicate the checks all over?

>  	ns->noiob = le16_to_cpu(id->noiob);
>  	ns->ms = le16_to_cpu(id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ms);
>  	ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);
> +
> +	/*
> +	 * For Fabrics, only metadata as part of extended data LBA is supported.
> +	 * fail in case of a spec violation.
> +	 */
> +	if (ns->ms && (ns->ctrl->ops->flags & NVME_F_FABRICS)) {
> +		if (WARN_ON_ONCE(!ns->ext))
> +			return -EINVAL;
> +	}
> +
>  	/* the PI implementation requires metadata equal t10 pi tuple size */
>  	if (ns->ms == sizeof(struct t10_pi_tuple))

This is getting a little convoluted.  I think this should rely on the
fact that we zero the whole namespace structure and clean a lot of the
mess up.  Also I think we should reject non-external metadata on PCIe
as we can't really properly support it right here.  Also instead of
using NVME_F_FABRICS I'd rather split NVME_F_METADATA_SUPPORTED into
a flag each for external or inline metadata supported.  That also means
your patch 1 can be moved much later in series where it belongs.

	if (ns->ms) {
		ns->ext = id->flbas & NVME_NS_FLBAS_META_EXT;
		if (ns->ext) {
			if (!(ns->ctrl->ops->flags & NVME_F_EXTENDED_LBA))
				return -EINVAL;
		} else {
			if (!(ns->ctrl->ops->flags & NVME_F_METADATA_P)))
				return -EINVL;
		}
		ns->metadata_supported = true;

		/*
		 * The PI implementation requires metadata equal t10 pi tuple
		 * size:
		 */
		 if (ns->ms == sizeof(struct t10_pi_tuple))
		 	ns->pi_type = id->dps & NVME_NS_DPS_PI_MASK;
	}

> - out_put_disk:
> +out_put_disk:
>  	put_disk(ns->disk);
> - out_unlink_ns:
> +out_free_disk:
> +	del_gendisk(ns->disk);
> +out_unlink_ns:
>  	mutex_lock(&ctrl->subsys->lock);
>  	list_del_rcu(&ns->siblings);
>  	mutex_unlock(&ctrl->subsys->lock);
>  	nvme_put_ns_head(ns->head);
> - out_free_id:
> +out_free_id:
>  	kfree(id);
> - out_free_queue:
> +out_free_queue:
>  	blk_cleanup_queue(ns->queue);
> - out_free_ns:
> +out_free_ns:

No real need to reformat this.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 03/15] nvme: Introduce max_integrity_segments ctrl attribute
  2019-11-05 16:20 ` [PATCH 03/15] nvme: Introduce max_integrity_segments ctrl attribute Max Gurtovoy
@ 2019-11-05 17:49   ` Christoph Hellwig
  2019-11-12 17:53   ` Sagi Grimberg
  1 sibling, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 17:49 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

> +	/*
> +	 * NVMe PCI driver doesn't support Extended LBA format so using 1
> +	 * integrity segment should be enough for a separate contiguous
> +	 * buffer of metadata.
> +	 */
> +	dev->ctrl.max_integrity_segments = 1;

Not just should, but as-is we simply can't support anything but a single
segment.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/15] nvme: Inline nvme_ns_has_pi function
  2019-11-05 16:20 ` [PATCH 04/15] nvme: Inline nvme_ns_has_pi function Max Gurtovoy
@ 2019-11-05 17:50   ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 17:50 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

On Tue, Nov 05, 2019 at 06:20:15PM +0200, Max Gurtovoy wrote:
> This function will be used by transports that support metadata/T10-PI
> mechanism.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> ---
>  drivers/nvme/host/core.c | 6 ------
>  drivers/nvme/host/nvme.h | 6 ++++++
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 9ec8322..6f51471 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -18,7 +18,6 @@
>  #include <linux/pr.h>
>  #include <linux/ptrace.h>
>  #include <linux/nvme_ioctl.h>
> -#include <linux/t10-pi.h>
>  #include <linux/pm_qos.h>
>  #include <asm/unaligned.h>
>  
> @@ -194,11 +193,6 @@ static int nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl)
>  	return ret;
>  }
>  
> -static inline bool nvme_ns_has_pi(struct nvme_ns *ns)
> -{
> -	return ns->pi_type && ns->ms == sizeof(struct t10_pi_tuple);
> -}

I'd rather add a flag.  Maybe we can just add a features field in
struct nvme_ns, and then set one bit for metadata supported, one for
extended lbas, one for PI supported.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag
  2019-11-05 16:20 ` [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag Max Gurtovoy
@ 2019-11-05 17:52   ` Christoph Hellwig
  2019-11-07  2:43   ` Martin K. Petersen
  1 sibling, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 17:52 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren,
	martin.petersen, kbusch, hch

Please also sends this to Martin.  In fact adding him to the whole
series might help you to get some good feedback.

On Tue, Nov 05, 2019 at 06:20:17PM +0200, Max Gurtovoy wrote:
> From: Israel Rukshin <israelr@mellanox.com>
> 
> It means that reftag shouldn't be remapped. This flag will be used in
> case some other layer did the reftag remapping (e.g. in NVMe/RDMA the
> initiator controller performs the remapping so target side shouldn't
> map it again).

> 
> Signed-off-by: Israel Rukshin <israelr@mellanox.com>
> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
> ---
>  block/t10-pi.c      | 7 +++++++
>  include/linux/bio.h | 1 +
>  2 files changed, 8 insertions(+)
> 
> diff --git a/block/t10-pi.c b/block/t10-pi.c
> index f4907d9..6de380b 100644
> --- a/block/t10-pi.c
> +++ b/block/t10-pi.c
> @@ -144,6 +144,9 @@ static void t10_pi_type1_prepare(struct request *rq)
>  		/* Already remapped? */
>  		if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
>  			break;
> +		/* No need to remap */
> +		if (bip->bip_flags & BIP_NOMAP_INTEGRITY)
> +			break;
>  
>  		bip_for_each_vec(iv, bip, iter) {
>  			void *p, *pmap;
> @@ -193,6 +196,10 @@ static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
>  		struct bio_vec iv;
>  		struct bvec_iter iter;
>  
> +		/* No need to remap */
> +		if (bip->bip_flags & BIP_NOMAP_INTEGRITY)
> +			break;
> +
>  		bip_for_each_vec(iv, bip, iter) {
>  			void *p, *pmap;
>  			unsigned int j;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 3cdb84c..2ba09e6 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -299,6 +299,7 @@ enum bip_flags {
>  	BIP_CTRL_NOCHECK	= 1 << 2, /* disable HBA integrity checking */
>  	BIP_DISK_NOCHECK	= 1 << 3, /* disable disk integrity checking */
>  	BIP_IP_CHECKSUM		= 1 << 4, /* IP checksum */
> +	BIP_NOMAP_INTEGRITY	= 1 << 5, /* ref tag shouldn't be remapped */
>  };
>  
>  /*
> -- 
> 1.8.3.1
---end quoted text---

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support
  2019-11-05 16:20 ` [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support Max Gurtovoy
@ 2019-11-05 17:58   ` Christoph Hellwig
  2019-11-20 10:41     ` Max Gurtovoy
  2019-11-12 18:22   ` Sagi Grimberg
  1 sibling, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 17:58 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

On Tue, Nov 05, 2019 at 06:20:16PM +0200, Max Gurtovoy wrote:
> For capable HCAs (e.g. ConnectX-4/ConnectX-5) this will allow end-to-end
> protection information passthrough and validation for NVMe over RDMA
> transport. Metadata offload support was implemented over the new RDMA
> signature verbs API and it is enabled per controller by using nvme-cli.
> 
> usage example:
> nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme
> 
> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> Signed-off-by: Israel Rukshin <israelr@mellanox.com>
> ---
>  drivers/nvme/host/rdma.c | 346 ++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 298 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 05f2dfa..16263b8 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -48,6 +48,12 @@ struct nvme_rdma_qe {
>  	u64			dma;
>  };
>  
> +struct nvme_rdma_sgl {
> +	int			nents;
> +	struct sg_table		sg_table;
> +	struct scatterlist	first_sgl[SG_CHUNK_SIZE];

I think this needs some rework.  Kill the first_sgl pointer, and then
just set the pointer in the table to address found by pointer
arithmetics behind the nvme_request in the allocation.

We also have an open todo item to not actually allocate SG_CHUNK_SIZE
entries, but a much smaller value like in SCSI.

Also the whole switch to use struct sg_table should be a separate prep
patch.

> -	struct sg_table		sg_table;
> -	struct scatterlist	first_sgl[];
> +	/* T10-PI support */
> +	bool			is_protected;

This is a bit of an odd variable name.  Why not use_pi or something like
that?

Also all the new code should only be built if CONFIG_BLK_DEV_INTEGRITY
is set.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 08/15] nvmet: Add metadata characteristics for a namespace
  2019-11-05 16:20 ` [PATCH 08/15] nvmet: Add metadata characteristics for a namespace Max Gurtovoy
@ 2019-11-05 17:59   ` Christoph Hellwig
  2019-11-12 18:38     ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 17:59 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

On Tue, Nov 05, 2019 at 06:20:19PM +0200, Max Gurtovoy wrote:
> +	ns->prot_type = 0;
> +	bi = bdev_get_integrity(ns->bdev);
> +	if (bi) {
> +		ns->ms = bi->tuple_size;
> +		if (bi->profile == &t10_pi_type1_crc) {
> +			ns->prot_type = NVME_NS_DPS_PI_TYPE1;
> +		} else if (bi->profile == &t10_pi_type3_crc) {
> +			ns->prot_type = NVME_NS_DPS_PI_TYPE3;
> +		} else {
> +			pr_err("block device %s: unsupported metadata type\n",
> +			       ns->device_path);
> +			return -EINVAL;
> +		}
> +	} else {
> +		ns->ms = 0;
> +	}

I don't think we should report these fields unless the transport
actually supports metadata and PI.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 09/15] nvmet: Rename nvmet_rw_len to nvmet_rw_data_len
  2019-11-05 16:20 ` [PATCH 09/15] nvmet: Rename nvmet_rw_len to nvmet_rw_data_len Max Gurtovoy
@ 2019-11-05 17:59   ` Christoph Hellwig
  2019-11-12 18:39   ` Sagi Grimberg
  1 sibling, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 17:59 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/15] nvmet: Rename nvmet_check_data_len to nvmet_check_transfer_len
  2019-11-05 16:20 ` [PATCH 10/15] nvmet: Rename nvmet_check_data_len to nvmet_check_transfer_len Max Gurtovoy
@ 2019-11-05 18:00   ` Christoph Hellwig
  2019-11-12 18:43   ` Sagi Grimberg
  1 sibling, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 18:00 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

On Tue, Nov 05, 2019 at 06:20:21PM +0200, Max Gurtovoy wrote:
> From: Israel Rukshin <israelr@mellanox.com>
> 
> The function doesn't checks only the data length, because the transfer
> length includes also the metadata length in some cases. This is
> preparation for adding metadata (T10-PI) support.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 11/15] nvmet: Introduce nvmet_rw_prot_len and nvmet_ns_has_pi
  2019-11-05 16:20 ` [PATCH 11/15] nvmet: Introduce nvmet_rw_prot_len and nvmet_ns_has_pi Max Gurtovoy
@ 2019-11-05 18:00   ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 18:00 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

Looks fine, but should be folded into the main patch.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support
  2019-11-05 16:20 ` [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
@ 2019-11-05 18:02   ` Christoph Hellwig
  2019-11-07 13:43     ` Max Gurtovoy
  2019-11-12 18:34   ` Sagi Grimberg
  1 sibling, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2019-11-05 18:02 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

> +static bool nvmet_rdma_pi_enable;
> +module_param_named(pi_enable, nvmet_rdma_pi_enable, bool, 0444);
> +MODULE_PARM_DESC(pi_enable, "Enable metadata (T10-PI) support");

This needs to be a configfs attribute, not a module parameter.

> +	/* Data Out / RDMA WRITE */
> +	r->write_cqe.done = nvmet_rdma_write_data_done;
> +

We should only do that for actual PI enabled controllers.  Also please
use IS_ENABLED or ifdef to only compile this code if we actually support
PI in the kernel build.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd
  2019-11-05 16:20 ` [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd Max Gurtovoy
@ 2019-11-07  2:36   ` Martin K. Petersen
  2019-11-07 12:02     ` Max Gurtovoy
  0 siblings, 1 reply; 52+ messages in thread
From: Martin K. Petersen @ 2019-11-07  2:36 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, shlomin, israelr, linux-nvme, idanb, oren, kbusch, hch

Hi Max,

> Added 'pi_enable' to 'connect' command so users can enable metadata
> support.
>
> usage examples:
> nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme
> nvme connect -p -t rdma -a 10.0.1.1 -n test_nvme

Why does it have to be explicitly enabled? I can understand why you'd
want the capability to disable PI in case you encounter a bad target.
But having to opt-in is different from how the rest of the block
integrity code works. In SCSI, if HBA and target are capable, both DIX
and PI enabled unless you explicitly disable them.

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag
  2019-11-05 16:20 ` [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag Max Gurtovoy
  2019-11-05 17:52   ` Christoph Hellwig
@ 2019-11-07  2:43   ` Martin K. Petersen
  2019-11-07 13:29     ` Max Gurtovoy
  1 sibling, 1 reply; 52+ messages in thread
From: Martin K. Petersen @ 2019-11-07  2:43 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, shlomin, israelr, linux-nvme, idanb, oren, kbusch, hch


Max,

> It means that reftag shouldn't be remapped. This flag will be used in
> case some other layer did the reftag remapping (e.g. in NVMe/RDMA the
> initiator controller performs the remapping so target side shouldn't
> map it again).

This is what BIP_MAPPED_INTEGRITY was meant for. Why do you need a
second flag?

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd
  2019-11-07  2:36   ` Martin K. Petersen
@ 2019-11-07 12:02     ` Max Gurtovoy
  2019-11-09  1:55       ` Martin K. Petersen
  0 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-07 12:02 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

On 11/7/2019 4:36 AM, Martin K. Petersen wrote:
> Hi Max,

Hi Martin,

>
>> Added 'pi_enable' to 'connect' command so users can enable metadata
>> support.
>>
>> usage examples:
>> nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme
>> nvme connect -p -t rdma -a 10.0.1.1 -n test_nvme
> Why does it have to be explicitly enabled? I can understand why you'd
> want the capability to disable PI in case you encounter a bad target.
> But having to opt-in is different from how the rest of the block
> integrity code works. In SCSI, if HBA and target are capable, both DIX
> and PI enabled unless you explicitly disable them.

In SCSI/iSER solution (the only solution over the fabric in Linux) there 
is a module param to enable the initiator side.

I guess it was implemented this way since people don't want to mess 
with  iscsi-daemon.

So iSER will create a signature resources for every capable device and 
connection without giving the user the possibility to distinguish 
between needed PI controllers and un-needed PI controllers.

We don't have a format command in the fabrics so this is the best option 
we can think of for adding flexibility to users.

I spoke with Christoph about the module param possibility and we both 
agreed it's not the way to go.

Let me know if you have an idea that will enable flexibility to users.

>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag
  2019-11-07  2:43   ` Martin K. Petersen
@ 2019-11-07 13:29     ` Max Gurtovoy
  2019-11-09  2:10       ` Martin K. Petersen
  0 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-07 13:29 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

On 11/7/2019 4:43 AM, Martin K. Petersen wrote:
> Max,
>
>> It means that reftag shouldn't be remapped. This flag will be used in
>> case some other layer did the reftag remapping (e.g. in NVMe/RDMA the
>> initiator controller performs the remapping so target side shouldn't
>> map it again).
> This is what BIP_MAPPED_INTEGRITY was meant for. Why do you need a
> second flag?

In the fabric solution the Initiator side is the SW controller of the 
namespace and the target is kinda HW controller.

The SW controller is doing the reftag remapping during prepare/complete 
functions and it should be done only once.

The HW controller should write/read the data from the media as is (and 
maybe verify/add it).

When you're using an NVMe disk as the bdev for the target namespace, 
then prepare/complete will be called also in the "target" side and we 
don't want double remapping to happen.

In the complete_fn there is no option to skip the remapping...

So we added a flag for it.

>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support
  2019-11-05 18:02   ` Christoph Hellwig
@ 2019-11-07 13:43     ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-07 13:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch


On 11/5/2019 8:02 PM, Christoph Hellwig wrote:
>> +static bool nvmet_rdma_pi_enable;
>> +module_param_named(pi_enable, nvmet_rdma_pi_enable, bool, 0444);
>> +MODULE_PARM_DESC(pi_enable, "Enable metadata (T10-PI) support");
> This needs to be a configfs attribute, not a module parameter.

Ok, we'll move it to be per port configuration.

>
>> +	/* Data Out / RDMA WRITE */
>> +	r->write_cqe.done = nvmet_rdma_write_data_done;
>> +
> We should only do that for actual PI enabled controllers.  Also please

are you talking about the assignment ? I guess we can add an "if" check.

> use IS_ENABLED or ifdef to only compile this code if we actually support
> PI in the kernel build.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd
  2019-11-07 12:02     ` Max Gurtovoy
@ 2019-11-09  1:55       ` Martin K. Petersen
  2019-11-10  9:25         ` Max Gurtovoy
  0 siblings, 1 reply; 52+ messages in thread
From: Martin K. Petersen @ 2019-11-09  1:55 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren,
	Martin K. Petersen, kbusch, hch

Max,

> So iSER will create a signature resources for every capable device and
> connection without giving the user the possibility to distinguish
> between needed PI controllers and un-needed PI controllers.
>
> We don't have a format command in the fabrics so this is the best
> option we can think of for adding flexibility to users.
>
> I spoke with Christoph about the module param possibility and we both
> agreed it's not the way to go.
>
> Let me know if you have an idea that will enable flexibility to users.

The way it works in SCSI is that if a user wants to enable PI, they do
so when provision the device. So either a format for disk drives or by
selecting an option while creating a LUN in a storage array management
interface.

There are some knobs that can be twiddled on the initiator side to
enable/disable PI and/or DIX but these are meant for test or buggy
device workaround purposes. Not as a means for the user to select or
deselect the feature.

The user can decide on a per-device basis whether the block layer should
generate PI on write and verify on read. Those are the "proper" policy
knobs on the initiator side.

When an I/O request is received by the SCSI stack, we inspect it to
determine whether we need to allocate one or two scatterlists. If the
request has a bip attached, we'll allocate a separate scatterlist which
is then used to map the protection information before the I/O is
submitted to the device driver.

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag
  2019-11-07 13:29     ` Max Gurtovoy
@ 2019-11-09  2:10       ` Martin K. Petersen
  2019-11-12 10:40         ` Max Gurtovoy
  0 siblings, 1 reply; 52+ messages in thread
From: Martin K. Petersen @ 2019-11-09  2:10 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren,
	Martin K. Petersen, kbusch, hch


Max,

> In the fabric solution the Initiator side is the SW controller of the
> namespace and the target is kinda HW controller.
>
> The SW controller is doing the reftag remapping during
> prepare/complete functions and it should be done only once.
>
> The HW controller should write/read the data from the media as is (and
> maybe verify/add it).
>
> When you're using an NVMe disk as the bdev for the target namespace,
> then prepare/complete will be called also in the "target" side and we
> don't want double remapping to happen.

In a typical SCSI setup you'll have remapping several times as the file
block offset visible to an application is different from the LBA on the
LUN. And the LUN LBA is different from the LBA of the drive physically
storing the data block inside the disk array.

To me it sounds like your 1:1 mapping between host LBA and target
back-end LBA is just a special case. What happens in your scenario if
the target bdev is a partition and not a full namespace?

> In the complete_fn there is no option to skip the remapping...

Only because there hasn't been a need. Feel free to add a
BIP_MAPPED_INTEGRITY check to complete_fn.

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd
  2019-11-09  1:55       ` Martin K. Petersen
@ 2019-11-10  9:25         ` Max Gurtovoy
  2019-11-12 17:47           ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-10  9:25 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch

On 11/9/2019 3:55 AM, Martin K. Petersen wrote:
> Max,

Martin,

>
>> So iSER will create a signature resources for every capable device and
>> connection without giving the user the possibility to distinguish
>> between needed PI controllers and un-needed PI controllers.
>>
>> We don't have a format command in the fabrics so this is the best
>> option we can think of for adding flexibility to users.
>>
>> I spoke with Christoph about the module param possibility and we both
>> agreed it's not the way to go.
>>
>> Let me know if you have an idea that will enable flexibility to users.
> The way it works in SCSI is that if a user wants to enable PI, they do
> so when provision the device. So either a format for disk drives or by
> selecting an option while creating a LUN in a storage array management
> interface.
>
> There are some knobs that can be twiddled on the initiator side to
> enable/disable PI and/or DIX but these are meant for test or buggy
> device workaround purposes. Not as a means for the user to select or
> deselect the feature.
>
> The user can decide on a per-device basis whether the block layer should
> generate PI on write and verify on read. Those are the "proper" policy
> knobs on the initiator side.

The block layer option to enable/disable DIX is there, but there is no 
way to disable DIF (wire PI check and generation).

Fabric storage doesn't support formatting a LUN/Namespace so we need to 
give flexibility to users.

Why we need to enforce the usage of E2E PI for all the controllers if 
the user is not interested (because lack of resources for example) ?

Maybe the application has some kind of QOS, e.g. protected namespace is 
ClassA (expensive storage) and non-protected namespace is ClassB (cheap 
storage) and they both use NVMe/RDMA initiator...

For example, we do it for TOS feature.

>
> When an I/O request is received by the SCSI stack, we inspect it to
> determine whether we need to allocate one or two scatterlists. If the
> request has a bip attached, we'll allocate a separate scatterlist which
> is then used to map the protection information before the I/O is
> submitted to the device driver.

For an NVMe/RDMA controller we must allocate a pool of PI MRs in the 
constructor function in order to support E2E PI (and it's expensive).

We can't do it in queue_rq function. In queue_rq we get the needed MR 
according to the namespace format from a pre-allocated pool (PI_MR for 
PI_namespace and REGULAR_MR for REGULAR_namespace).

We can go with the module parameter approach but it will be very limited 
for users and wasteful.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag
  2019-11-09  2:10       ` Martin K. Petersen
@ 2019-11-12 10:40         ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-12 10:40 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch


On 11/9/2019 4:10 AM, Martin K. Petersen wrote:
> Max,
>
>> In the fabric solution the Initiator side is the SW controller of the
>> namespace and the target is kinda HW controller.
>>
>> The SW controller is doing the reftag remapping during
>> prepare/complete functions and it should be done only once.
>>
>> The HW controller should write/read the data from the media as is (and
>> maybe verify/add it).
>>
>> When you're using an NVMe disk as the bdev for the target namespace,
>> then prepare/complete will be called also in the "target" side and we
>> don't want double remapping to happen.
> In a typical SCSI setup you'll have remapping several times as the file
> block offset visible to an application is different from the LBA on the
> LUN. And the LUN LBA is different from the LBA of the drive physically
> storing the data block inside the disk array.
>
> To me it sounds like your 1:1 mapping between host LBA and target
> back-end LBA is just a special case. What happens in your scenario if
> the target bdev is a partition and not a full namespace?

We'll add a partition tests to our suite and make sure it works.


>
>> In the complete_fn there is no option to skip the remapping...
> Only because there hasn't been a need. Feel free to add a
> BIP_MAPPED_INTEGRITY check to complete_fn.

Sure.

thanks



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd
  2019-11-10  9:25         ` Max Gurtovoy
@ 2019-11-12 17:47           ` Sagi Grimberg
  2019-11-12 18:14             ` James Smart
  0 siblings, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 17:47 UTC (permalink / raw)
  To: Max Gurtovoy, Martin K. Petersen
  Cc: vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch


>> When an I/O request is received by the SCSI stack, we inspect it to
>> determine whether we need to allocate one or two scatterlists. If the
>> request has a bip attached, we'll allocate a separate scatterlist which
>> is then used to map the protection information before the I/O is
>> submitted to the device driver.
> 
> For an NVMe/RDMA controller we must allocate a pool of PI MRs in the 
> constructor function in order to support E2E PI (and it's expensive).
> 
> We can't do it in queue_rq function. In queue_rq we get the needed MR 
> according to the namespace format from a pre-allocated pool (PI_MR for 
> PI_namespace and REGULAR_MR for REGULAR_namespace).
> 
> We can go with the module parameter approach but it will be very limited 
> for users and wasteful.

Yep, this is a fundamental difference for the RDMA devices that supports
integrity.. You effectively need substantially more resources to track
all the inflight contexts.

We could in theory lazily allocate the special resources, but the 
queue-pair extra sizing is not something that you can change in runtime...

I'm fine with explicitly enabling PI for controllers that we know that
support PI.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] nvme-fabrics: allow user enabling metadata/T10-PI support
  2019-11-05 16:20 ` [PATCH 01/15] nvme-fabrics: allow user enabling metadata/T10-PI support Max Gurtovoy
@ 2019-11-12 17:48   ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 17:48 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme, kbusch, hch
  Cc: israelr, oren, idanb, vladimirk, shlomin

> From: Israel Rukshin <israelr@mellanox.com>
> 
> Preparation for adding metadata (T10-PI) over fabric support. This will
> allow end-to-end protection information passthrough and validation for
> NVMe over Fabric.

I think that this patch needs to come after the host supports PI.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/15] nvme: Fail __nvme_revalidate_disk in case of a spec violation
  2019-11-05 17:49   ` Christoph Hellwig
@ 2019-11-12 17:52     ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 17:52 UTC (permalink / raw)
  To: Christoph Hellwig, Max Gurtovoy
  Cc: vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch


>> +	if (ns->ms && (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) {
>> +		if ((ns->ctrl->opts && ns->ctrl->opts->pi_enable) || !ns->ext)
>> +			nvme_init_integrity(disk, ns->ms, ns->pi_type);
>> +	}
> 
> Can we just have a flag that says we have working metadata support instead
> of having to duplicate the checks all over?

Agree with the comment.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 03/15] nvme: Introduce max_integrity_segments ctrl attribute
  2019-11-05 16:20 ` [PATCH 03/15] nvme: Introduce max_integrity_segments ctrl attribute Max Gurtovoy
  2019-11-05 17:49   ` Christoph Hellwig
@ 2019-11-12 17:53   ` Sagi Grimberg
  1 sibling, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 17:53 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme, kbusch, hch
  Cc: israelr, oren, idanb, vladimirk, shlomin

Other than the comment given,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd
  2019-11-12 17:47           ` Sagi Grimberg
@ 2019-11-12 18:14             ` James Smart
  2019-11-12 18:23               ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: James Smart @ 2019-11-12 18:14 UTC (permalink / raw)
  To: Sagi Grimberg, Max Gurtovoy, Martin K. Petersen
  Cc: vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch


On 11/12/2019 9:47 AM, Sagi Grimberg wrote:
>
>>> When an I/O request is received by the SCSI stack, we inspect it to
>>> determine whether we need to allocate one or two scatterlists. If the
>>> request has a bip attached, we'll allocate a separate scatterlist which
>>> is then used to map the protection information before the I/O is
>>> submitted to the device driver.
>>
>> For an NVMe/RDMA controller we must allocate a pool of PI MRs in the 
>> constructor function in order to support E2E PI (and it's expensive).
>>
>> We can't do it in queue_rq function. In queue_rq we get the needed MR 
>> according to the namespace format from a pre-allocated pool (PI_MR 
>> for PI_namespace and REGULAR_MR for REGULAR_namespace).
>>
>> We can go with the module parameter approach but it will be very 
>> limited for users and wasteful.
>
> Yep, this is a fundamental difference for the RDMA devices that supports
> integrity.. You effectively need substantially more resources to track
> all the inflight contexts.
>
> We could in theory lazily allocate the special resources, but the 
> queue-pair extra sizing is not something that you can change in 
> runtime...
>
> I'm fine with explicitly enabling PI for controllers that we know that
> support PI.

Ok - but we're still going to want the ability to dynamically recognize 
support in the transport and in the controller and turn it on even 
without the user-space flag.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support
  2019-11-05 16:20 ` [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support Max Gurtovoy
  2019-11-05 17:58   ` Christoph Hellwig
@ 2019-11-12 18:22   ` Sagi Grimberg
  2019-11-13 14:35     ` Max Gurtovoy
  1 sibling, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 18:22 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme, kbusch, hch
  Cc: israelr, oren, idanb, vladimirk, shlomin



On 11/5/19 8:20 AM, Max Gurtovoy wrote:
> For capable HCAs (e.g. ConnectX-4/ConnectX-5) this will allow end-to-end
> protection information passthrough and validation for NVMe over RDMA
> transport. Metadata offload support was implemented over the new RDMA
> signature verbs API and it is enabled per controller by using nvme-cli.
> 
> usage example:
> nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme
> 
> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
> Signed-off-by: Israel Rukshin <israelr@mellanox.com>
> ---
>   drivers/nvme/host/rdma.c | 346 ++++++++++++++++++++++++++++++++++++++++-------
>   1 file changed, 298 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 05f2dfa..16263b8 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -48,6 +48,12 @@ struct nvme_rdma_qe {
>   	u64			dma;
>   };
>   
> +struct nvme_rdma_sgl {
> +	int			nents;
> +	struct sg_table		sg_table;
> +	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
> +};

How about we dynamically allocate this in nvme_rdma_init_request (with
a prep patch for it)? both for pi and data, this is getting quite
large...

> +
>   struct nvme_rdma_queue;
>   struct nvme_rdma_request {
>   	struct nvme_request	req;
> @@ -58,12 +64,14 @@ struct nvme_rdma_request {
>   	refcount_t		ref;
>   	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
>   	u32			num_sge;
> -	int			nents;
>   	struct ib_reg_wr	reg_wr;
>   	struct ib_cqe		reg_cqe;
>   	struct nvme_rdma_queue  *queue;
> -	struct sg_table		sg_table;
> -	struct scatterlist	first_sgl[];
> +	/* T10-PI support */
> +	bool			is_protected;
> +
> +	struct nvme_rdma_sgl	data_sgl;
> +	struct nvme_rdma_sgl	pi_sgl[];
>   };
>   
>   enum nvme_rdma_queue_flags {
> @@ -85,6 +93,7 @@ struct nvme_rdma_queue {
>   	struct rdma_cm_id	*cm_id;
>   	int			cm_error;
>   	struct completion	cm_done;
> +	bool			pi_support;

Any specific reason for the queue local pi_support?

>   };
>   
>   struct nvme_rdma_ctrl {
> @@ -112,6 +121,7 @@ struct nvme_rdma_ctrl {
>   	struct nvme_ctrl	ctrl;
>   	bool			use_inline_data;
>   	u32			io_queues[HCTX_MAX_TYPES];
> +	bool			pi_support;
>   };
>   
>   static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct nvme_ctrl *ctrl)
> @@ -269,6 +279,8 @@ static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
>   	init_attr.qp_type = IB_QPT_RC;
>   	init_attr.send_cq = queue->ib_cq;
>   	init_attr.recv_cq = queue->ib_cq;
> +	if (queue->pi_support)
> +		init_attr.create_flags |= IB_QP_CREATE_INTEGRITY_EN;
>   
>   	ret = rdma_create_qp(queue->cm_id, dev->pd, &init_attr);
>   
> @@ -408,6 +420,8 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
>   	dev = queue->device;
>   	ibdev = dev->dev;
>   
> +	if (queue->pi_support)
> +		ib_mr_pool_destroy(queue->qp, &queue->qp->sig_mrs);
>   	ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
>   
>   	/*
> @@ -424,10 +438,14 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
>   	nvme_rdma_dev_put(dev);
>   }
>   
> -static int nvme_rdma_get_max_fr_pages(struct ib_device *ibdev)
> +static int nvme_rdma_get_max_fr_pages(struct ib_device *ibdev, bool pi_support)
>   {
> -	return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
> -		     ibdev->attrs.max_fast_reg_page_list_len - 1);
> +	if (pi_support)
> +		return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
> +			     ibdev->attrs.max_pi_fast_reg_page_list_len - 1);
> +	else
> +		return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
> +			     ibdev->attrs.max_fast_reg_page_list_len - 1);
>   }

Maybe just take the max_page_list_len on the condition and take the min 
once.

>   
>   static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
> @@ -484,7 +502,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
>   	 * misaligned we'll end up using two entries for a single data page,
>   	 * so one additional entry is required.
>   	 */
> -	pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev) + 1;
> +	pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, queue->pi_support) + 1;
>   	ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
>   			      queue->queue_size,
>   			      IB_MR_TYPE_MEM_REG,
> @@ -496,10 +514,24 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
>   		goto out_destroy_ring;
>   	}
>   
> +	if (queue->pi_support) {
> +		ret = ib_mr_pool_init(queue->qp, &queue->qp->sig_mrs,
> +				      queue->queue_size, IB_MR_TYPE_INTEGRITY,
> +				      pages_per_mr, pages_per_mr);
> +		if (ret) {
> +			dev_err(queue->ctrl->ctrl.device,
> +				"failed to initialize PI MR pool sized %d for QID %d\n",
> +				queue->queue_size, idx);
> +			goto out_destroy_mr_pool;
> +		}
> +	}
> +
>   	set_bit(NVME_RDMA_Q_TR_READY, &queue->flags);
>   
>   	return 0;
>   
> +out_destroy_mr_pool:
> +	ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
>   out_destroy_ring:
>   	nvme_rdma_free_ring(ibdev, queue->rsp_ring, queue->queue_size,
>   			    sizeof(struct nvme_completion), DMA_FROM_DEVICE);
> @@ -521,6 +553,7 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
>   
>   	queue = &ctrl->queues[idx];
>   	queue->ctrl = ctrl;
> +	queue->pi_support = idx && ctrl->pi_support;

Thats why..

>   	init_completion(&queue->cm_done);
>   
>   	if (idx > 0)
> @@ -730,8 +763,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
>   		set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
>   		set->reserved_tags = 2; /* connect + keep-alive */
>   		set->numa_node = nctrl->numa_node;
> -		set->cmd_size = sizeof(struct nvme_rdma_request) +
> -			SG_CHUNK_SIZE * sizeof(struct scatterlist);
> +		set->cmd_size = sizeof(struct nvme_rdma_request);
>   		set->driver_data = ctrl;
>   		set->nr_hw_queues = 1;
>   		set->timeout = ADMIN_TIMEOUT;
> @@ -745,7 +777,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
>   		set->numa_node = nctrl->numa_node;
>   		set->flags = BLK_MQ_F_SHOULD_MERGE;
>   		set->cmd_size = sizeof(struct nvme_rdma_request) +
> -			SG_CHUNK_SIZE * sizeof(struct scatterlist);
> +			(ctrl->pi_support * sizeof(struct nvme_rdma_sgl));
>   		set->driver_data = ctrl;
>   		set->nr_hw_queues = nctrl->queue_count - 1;
>   		set->timeout = NVME_IO_TIMEOUT;
> @@ -787,7 +819,22 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
>   	ctrl->device = ctrl->queues[0].device;
>   	ctrl->ctrl.numa_node = dev_to_node(ctrl->device->dev->dma_device);
>   
> -	ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev);
> +	/* T10-PI support */
> +	if (ctrl->ctrl.opts->pi_enable) {
> +		if (!(ctrl->device->dev->attrs.device_cap_flags &
> +		      IB_DEVICE_INTEGRITY_HANDOVER)) {
> +			dev_warn(ctrl->ctrl.device,
> +				 "T10-PI requested but not supported on %s, continue without T10-PI\n",
> +				 ctrl->device->dev->name);
> +			ctrl->pi_support = false;
> +			ctrl->ctrl.opts->pi_enable = false;
> +		} else {
> +			ctrl->pi_support = true;
> +		}
> +	}
> +
> +	ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev,
> +							ctrl->pi_support);
>   
>   	/*
>   	 * Bind the async event SQE DMA mapping to the admin queue lifetime.
> @@ -829,6 +876,8 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
>   
>   	ctrl->ctrl.max_segments = ctrl->max_fr_pages;
>   	ctrl->ctrl.max_hw_sectors = ctrl->max_fr_pages << (ilog2(SZ_4K) - 9);
> +	if (ctrl->pi_support)
> +		ctrl->ctrl.max_integrity_segments = ctrl->max_fr_pages;
>   
>   	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>   
> @@ -1154,13 +1203,24 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
>   	if (!blk_rq_nr_phys_segments(rq))
>   		return;
>   
> +	if (blk_integrity_rq(rq)) {
> +		ib_dma_unmap_sg(ibdev, req->pi_sgl->sg_table.sgl,
> +			req->pi_sgl->nents, rq_dma_dir(rq));
> +		sg_free_table_chained(&req->pi_sgl->sg_table, SG_CHUNK_SIZE);
> +	}
> +
>   	if (req->mr) {
> -		ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
> +		if (req->is_protected)
> +			ib_mr_pool_put(queue->qp, &queue->qp->sig_mrs, req->mr);
> +		else
> +			ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs,
> +				       req->mr);
>   		req->mr = NULL;
>   	}
>   
> -	ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq));
> -	sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
> +	ib_dma_unmap_sg(ibdev, req->data_sgl.sg_table.sgl,
> +			req->data_sgl.nents, rq_dma_dir(rq));
> +	sg_free_table_chained(&req->data_sgl.sg_table, SG_CHUNK_SIZE);
>   }
>   
>   static int nvme_rdma_set_sg_null(struct nvme_command *c)
> @@ -1179,7 +1239,7 @@ static int nvme_rdma_map_sg_inline(struct nvme_rdma_queue *queue,
>   		int count)
>   {
>   	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
> -	struct scatterlist *sgl = req->sg_table.sgl;
> +	struct scatterlist *sgl = req->data_sgl.sg_table.sgl;
>   	struct ib_sge *sge = &req->sge[1];
>   	u32 len = 0;
>   	int i;
> @@ -1204,40 +1264,115 @@ static int nvme_rdma_map_sg_single(struct nvme_rdma_queue *queue,
>   {
>   	struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>   
> -	sg->addr = cpu_to_le64(sg_dma_address(req->sg_table.sgl));
> -	put_unaligned_le24(sg_dma_len(req->sg_table.sgl), sg->length);
> +	sg->addr = cpu_to_le64(sg_dma_address(req->data_sgl.sg_table.sgl));
> +	put_unaligned_le24(sg_dma_len(req->data_sgl.sg_table.sgl), sg->length);
>   	put_unaligned_le32(queue->device->pd->unsafe_global_rkey, sg->key);
>   	sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4;
>   	return 0;
>   }
>   
> -static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
> -		struct nvme_rdma_request *req, struct nvme_command *c,
> -		int count)
> +static void nvme_rdma_set_diff_domain(struct nvme_command *cmd, struct bio *bio,
> +		struct ib_sig_domain *domain, struct request *rq)
>   {
> -	struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
> -	int nr;
> +	struct blk_integrity *bi = blk_get_integrity(bio->bi_disk);
> +	struct nvme_ns *ns = rq->rq_disk->private_data;
>   
> -	req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
> -	if (WARN_ON_ONCE(!req->mr))
> -		return -EAGAIN;
> +	WARN_ON(bi == NULL);
> +
> +	domain->sig_type = IB_SIG_TYPE_T10_DIF;
> +	domain->sig.dif.bg_type = IB_T10DIF_CRC;
> +	domain->sig.dif.pi_interval = 1 << bi->interval_exp;
> +	domain->sig.dif.ref_tag = le32_to_cpu(cmd->rw.reftag);
>   
>   	/*
> -	 * Align the MR to a 4K page size to match the ctrl page size and
> -	 * the block virtual boundary.
> +	 * At the moment we hard code those, but in the future
> +	 * we will take them from cmd.
>   	 */
> -	nr = ib_map_mr_sg(req->mr, req->sg_table.sgl, count, NULL, SZ_4K);
> -	if (unlikely(nr < count)) {
> -		ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
> -		req->mr = NULL;
> -		if (nr < 0)
> -			return nr;
> -		return -EINVAL;
> +	domain->sig.dif.apptag_check_mask = 0xffff;
> +	domain->sig.dif.app_escape = true;
> +	domain->sig.dif.ref_escape = true;
> +	if (ns->pi_type != NVME_NS_DPS_PI_TYPE3)
> +		domain->sig.dif.ref_remap = true;
> +}
> +
> +static void nvme_rdma_set_sig_attrs(struct bio *bio,
> +		struct ib_sig_attrs *sig_attrs, struct nvme_command *c,
> +		struct request *rq)
> +{
> +	u16 control = le16_to_cpu(c->rw.control);
> +
> +	memset(sig_attrs, 0, sizeof(*sig_attrs));
> +
> +	if (control & NVME_RW_PRINFO_PRACT) {
> +		/* for WRITE_INSERT/READ_STRIP no memory domain */
> +		sig_attrs->mem.sig_type = IB_SIG_TYPE_NONE;
> +		nvme_rdma_set_diff_domain(c, bio, &sig_attrs->wire, rq);
> +		/* Clear the PRACT bit since HCA will generate/verify the PI */
> +		control &= ~NVME_RW_PRINFO_PRACT;
> +		c->rw.control = cpu_to_le16(control);
> +	} else {
> +		/* for WRITE_PASS/READ_PASS both wire/memory domains exist */
> +		nvme_rdma_set_diff_domain(c, bio, &sig_attrs->wire, rq);
> +		nvme_rdma_set_diff_domain(c, bio, &sig_attrs->mem, rq);
>   	}
> +}
> +
> +static void nvme_rdma_set_prot_checks(struct nvme_command *cmd, u8 *mask)
> +{
> +	*mask = 0;
> +	if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_REF)
> +		*mask |= IB_SIG_CHECK_REFTAG;
> +	if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_GUARD)
> +		*mask |= IB_SIG_CHECK_GUARD;
> +}
> +
> +static void nvme_rdma_sig_done(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	if (unlikely(wc->status != IB_WC_SUCCESS))
> +		nvme_rdma_wr_error(cq, wc, "SIG");
> +}
> +
> +static void nvme_rdma_set_pi_wr(struct nvme_rdma_request *req,
> +				struct nvme_command *c)
> +{
> +	struct ib_sig_attrs *sig_attrs = req->mr->sig_attrs;
> +	struct ib_reg_wr *wr = &req->reg_wr;
> +	struct request *rq = blk_mq_rq_from_pdu(req);
> +	struct bio *bio = rq->bio;
> +	struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
> +
> +	nvme_rdma_set_sig_attrs(bio, sig_attrs, c, rq);
> +
> +	nvme_rdma_set_prot_checks(c, &sig_attrs->check_mask);
>   
>   	ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
>   
> +	req->reg_cqe.done = nvme_rdma_sig_done;
> +
> +	memset(wr, 0, sizeof(*wr));
> +	wr->wr.opcode = IB_WR_REG_MR_INTEGRITY;
> +	wr->wr.wr_cqe = &req->reg_cqe;
> +	wr->wr.num_sge = 0;
> +	wr->wr.send_flags = 0;
> +	wr->mr = req->mr;
> +	wr->key = req->mr->rkey;
> +	wr->access = IB_ACCESS_LOCAL_WRITE |
> +		     IB_ACCESS_REMOTE_READ |
> +		     IB_ACCESS_REMOTE_WRITE;
> +
> +	sg->addr = cpu_to_le64(req->mr->iova);
> +	put_unaligned_le24(req->mr->length, sg->length);
> +	put_unaligned_le32(req->mr->rkey, sg->key);
> +	sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_INVALIDATE;
> +}
> +
> +static void nvme_rdma_set_reg_wr(struct nvme_rdma_request *req,
> +				 struct nvme_command *c)
> +{
> +	struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
> +
>   	req->reg_cqe.done = nvme_rdma_memreg_done;
> +
>   	memset(&req->reg_wr, 0, sizeof(req->reg_wr));
>   	req->reg_wr.wr.opcode = IB_WR_REG_MR;
>   	req->reg_wr.wr.wr_cqe = &req->reg_cqe;
> @@ -1253,8 +1388,52 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>   	put_unaligned_le32(req->mr->rkey, sg->key);
>   	sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
>   			NVME_SGL_FMT_INVALIDATE;
> +}
> +
> +static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
> +		struct nvme_rdma_request *req, struct nvme_command *c,
> +		int count, int pi_count)
> +{
> +	struct nvme_rdma_sgl *sgl = &req->data_sgl;
> +	int nr;
> +
> +	if (req->is_protected) {
> +		req->mr = ib_mr_pool_get(queue->qp, &queue->qp->sig_mrs);
> +		if (WARN_ON_ONCE(!req->mr))
> +			return -EAGAIN;
> +
> +		nr = ib_map_mr_sg_pi(req->mr, sgl->sg_table.sgl, count, 0,
> +				     req->pi_sgl->sg_table.sgl, pi_count, 0,
> +				     SZ_4K);
> +		if (unlikely(nr))
> +			goto mr_put;
> +
> +		nvme_rdma_set_pi_wr(req, c);
> +	} else {
> +		req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
> +		if (WARN_ON_ONCE(!req->mr))
> +			return -EAGAIN;
> +		/*
> +		 * Align the MR to a 4K page size to match the ctrl page size
> +		 * and the block virtual boundary.
> +		 */
> +		nr = ib_map_mr_sg(req->mr, sgl->sg_table.sgl, count, 0, SZ_4K);
> +		if (unlikely(nr < count))
> +			goto mr_put;
> +
> +		nvme_rdma_set_reg_wr(req, c);
> +	}
>   
>   	return 0;
> +mr_put:
> +	if (req->is_protected)
> +		ib_mr_pool_put(queue->qp, &queue->qp->sig_mrs, req->mr);
> +	else
> +		ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
> +	req->mr = NULL;
> +	if (nr < 0)
> +		return nr;
> +	return -EINVAL;
>   }
>   
>   static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
> @@ -1263,6 +1442,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
>   	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>   	struct nvme_rdma_device *dev = queue->device;
>   	struct ib_device *ibdev = dev->dev;
> +	int pi_count = 0;
>   	int count, ret;
>   
>   	req->num_sge = 1;
> @@ -1273,23 +1453,44 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
>   	if (!blk_rq_nr_phys_segments(rq))
>   		return nvme_rdma_set_sg_null(c);
>   
> -	req->sg_table.sgl = req->first_sgl;
> -	ret = sg_alloc_table_chained(&req->sg_table,
> -			blk_rq_nr_phys_segments(rq), req->sg_table.sgl,
> -			SG_CHUNK_SIZE);
> +	req->data_sgl.sg_table.sgl = req->data_sgl.first_sgl;
> +	ret = sg_alloc_table_chained(&req->data_sgl.sg_table,
> +			blk_rq_nr_phys_segments(rq),
> +			req->data_sgl.sg_table.sgl, SG_CHUNK_SIZE);
>   	if (ret)
>   		return -ENOMEM;
>   
> -	req->nents = blk_rq_map_sg(rq->q, rq, req->sg_table.sgl);
> +	req->data_sgl.nents = blk_rq_map_sg(rq->q, rq,
> +					    req->data_sgl.sg_table.sgl);
>   
> -	count = ib_dma_map_sg(ibdev, req->sg_table.sgl, req->nents,
> -			      rq_dma_dir(rq));
> +	count = ib_dma_map_sg(ibdev, req->data_sgl.sg_table.sgl,
> +			      req->data_sgl.nents, rq_dma_dir(rq));
>   	if (unlikely(count <= 0)) {
>   		ret = -EIO;
>   		goto out_free_table;
>   	}
>   
> -	if (count <= dev->num_inline_segments) {
> +	if (blk_integrity_rq(rq)) {
> +		req->pi_sgl->sg_table.sgl = req->pi_sgl->first_sgl;
> +		ret = sg_alloc_table_chained(&req->pi_sgl->sg_table,
> +				blk_rq_count_integrity_sg(rq->q, rq->bio),
> +				req->pi_sgl->sg_table.sgl, SG_CHUNK_SIZE);
> +		if (unlikely(ret)) {
> +			ret = -ENOMEM;
> +			goto out_unmap_sg;
> +		}
> +
> +		req->pi_sgl->nents = blk_rq_map_integrity_sg(rq->q, rq->bio,
> +					req->pi_sgl->sg_table.sgl);
> +		pi_count = ib_dma_map_sg(ibdev, req->pi_sgl->sg_table.sgl,
> +				req->pi_sgl->nents, rq_dma_dir(rq));
> +		if (unlikely(pi_count <= 0)) {
> +			ret = -EIO;
> +			goto out_free_pi_table;
> +		}
> +	}
> +
> +	if (count <= dev->num_inline_segments && !req->is_protected) {
>   		if (rq_data_dir(rq) == WRITE && nvme_rdma_queue_idx(queue) &&
>   		    queue->ctrl->use_inline_data &&
>   		    blk_rq_payload_bytes(rq) <=
> @@ -1304,17 +1505,25 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
>   		}
>   	}
>   
> -	ret = nvme_rdma_map_sg_fr(queue, req, c, count);
> +	ret = nvme_rdma_map_sg_fr(queue, req, c, count, pi_count);
>   out:
>   	if (unlikely(ret))
> -		goto out_unmap_sg;
> +		goto out_unmap_pi_sg;
>   
>   	return 0;
>   
> +out_unmap_pi_sg:
> +	if (blk_integrity_rq(rq))
> +		ib_dma_unmap_sg(ibdev, req->pi_sgl->sg_table.sgl,
> +				req->pi_sgl->nents, rq_dma_dir(rq));
> +out_free_pi_table:
> +	if (blk_integrity_rq(rq))
> +		sg_free_table_chained(&req->pi_sgl->sg_table, SG_CHUNK_SIZE);
>   out_unmap_sg:
> -	ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq));
> +	ib_dma_unmap_sg(ibdev, req->data_sgl.sg_table.sgl,
> +			req->data_sgl.nents, rq_dma_dir(rq));
>   out_free_table:
> -	sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
> +	sg_free_table_chained(&req->data_sgl.sg_table, SG_CHUNK_SIZE);
>   	return ret;
>   }
>   
> @@ -1754,6 +1963,13 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
>   
>   	blk_mq_start_request(rq);
>   
> +	req->is_protected = false;
> +	if (nvme_rdma_queue_idx(queue) && queue->pi_support) {
> +		if (c->common.opcode == nvme_cmd_write ||
> +		    c->common.opcode == nvme_cmd_read)
> +			req->is_protected = nvme_ns_has_pi(ns);
> +	}
> +

This check belongs to nvme-core in nature. Why not test
blk_integrity_rq()?

Overall this looks good!

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd
  2019-11-12 18:14             ` James Smart
@ 2019-11-12 18:23               ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 18:23 UTC (permalink / raw)
  To: James Smart, Max Gurtovoy, Martin K. Petersen
  Cc: vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch, hch


>>>> When an I/O request is received by the SCSI stack, we inspect it to
>>>> determine whether we need to allocate one or two scatterlists. If the
>>>> request has a bip attached, we'll allocate a separate scatterlist which
>>>> is then used to map the protection information before the I/O is
>>>> submitted to the device driver.
>>>
>>> For an NVMe/RDMA controller we must allocate a pool of PI MRs in the 
>>> constructor function in order to support E2E PI (and it's expensive).
>>>
>>> We can't do it in queue_rq function. In queue_rq we get the needed MR 
>>> according to the namespace format from a pre-allocated pool (PI_MR 
>>> for PI_namespace and REGULAR_MR for REGULAR_namespace).
>>>
>>> We can go with the module parameter approach but it will be very 
>>> limited for users and wasteful.
>>
>> Yep, this is a fundamental difference for the RDMA devices that supports
>> integrity.. You effectively need substantially more resources to track
>> all the inflight contexts.
>>
>> We could in theory lazily allocate the special resources, but the 
>> queue-pair extra sizing is not something that you can change in 
>> runtime...
>>
>> I'm fine with explicitly enabling PI for controllers that we know that
>> support PI.
> 
> Ok - but we're still going to want the ability to dynamically recognize 
> support in the transport and in the controller and turn it on even 
> without the user-space flag.

Sure, nothing says that fc can't turn it on based on detection.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support
  2019-11-05 16:20 ` [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
  2019-11-05 18:02   ` Christoph Hellwig
@ 2019-11-12 18:34   ` Sagi Grimberg
  2019-11-13 13:56     ` Max Gurtovoy
  1 sibling, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 18:34 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme, kbusch, hch
  Cc: israelr, oren, idanb, vladimirk, shlomin



> +	if (ndev->pi_capable && nvmet_rdma_pi_enable && queue->host_qid)
> +		qp_attr.create_flags |= IB_QP_CREATE_INTEGRITY_EN;
> +

Max,

It will really be much better if this comes in the rdma_cm
private_data... Have you considered to get this trhough the TWG?

>   	ret = rdma_create_qp(queue->cm_id, ndev->pd, &qp_attr);
>   	if (ret) {
>   		pr_err("failed to create_qp ret= %d\n", ret);
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 08/15] nvmet: Add metadata characteristics for a namespace
  2019-11-05 17:59   ` Christoph Hellwig
@ 2019-11-12 18:38     ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 18:38 UTC (permalink / raw)
  To: Christoph Hellwig, Max Gurtovoy
  Cc: vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch


>> +	ns->prot_type = 0;
>> +	bi = bdev_get_integrity(ns->bdev);
>> +	if (bi) {
>> +		ns->ms = bi->tuple_size;
>> +		if (bi->profile == &t10_pi_type1_crc) {
>> +			ns->prot_type = NVME_NS_DPS_PI_TYPE1;
>> +		} else if (bi->profile == &t10_pi_type3_crc) {
>> +			ns->prot_type = NVME_NS_DPS_PI_TYPE3;
>> +		} else {
>> +			pr_err("block device %s: unsupported metadata type\n",
>> +			       ns->device_path);
>> +			return -EINVAL;
>> +		}
>> +	} else {
>> +		ns->ms = 0;
>> +	}
> 
> I don't think we should report these fields unless the transport
> actually supports metadata and PI.

Agreed.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 09/15] nvmet: Rename nvmet_rw_len to nvmet_rw_data_len
  2019-11-05 16:20 ` [PATCH 09/15] nvmet: Rename nvmet_rw_len to nvmet_rw_data_len Max Gurtovoy
  2019-11-05 17:59   ` Christoph Hellwig
@ 2019-11-12 18:39   ` Sagi Grimberg
  1 sibling, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 18:39 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme, kbusch, hch
  Cc: israelr, oren, idanb, vladimirk, shlomin

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/15] nvmet: Rename nvmet_check_data_len to nvmet_check_transfer_len
  2019-11-05 16:20 ` [PATCH 10/15] nvmet: Rename nvmet_check_data_len to nvmet_check_transfer_len Max Gurtovoy
  2019-11-05 18:00   ` Christoph Hellwig
@ 2019-11-12 18:43   ` Sagi Grimberg
  1 sibling, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-12 18:43 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme, kbusch, hch
  Cc: israelr, oren, idanb, vladimirk, shlomin

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support
  2019-11-12 18:34   ` Sagi Grimberg
@ 2019-11-13 13:56     ` Max Gurtovoy
  2019-11-14 23:45       ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-13 13:56 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, kbusch, hch
  Cc: israelr, oren, idanb, vladimirk, shlomin


On 11/12/2019 8:34 PM, Sagi Grimberg wrote:
>
>
>> +    if (ndev->pi_capable && nvmet_rdma_pi_enable && queue->host_qid)
>> +        qp_attr.create_flags |= IB_QP_CREATE_INTEGRITY_EN;
>> +
>
> Max,
>
> It will really be much better if this comes in the rdma_cm
> private_data... Have you considered to get this trhough the TWG?

can you explain what is TWG ?

In general, this sound like a good idea.

I haven't considered it since it's not in the nvmf SPEC.

I guess we can add it in the future (to the specification) to be 
compatible to other nvmf implementations besides Linux kernel.

>
>>       ret = rdma_create_qp(queue->cm_id, ndev->pd, &qp_attr);
>>       if (ret) {
>>           pr_err("failed to create_qp ret= %d\n", ret);
>>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support
  2019-11-12 18:22   ` Sagi Grimberg
@ 2019-11-13 14:35     ` Max Gurtovoy
  2019-11-14 23:57       ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-13 14:35 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, kbusch, hch
  Cc: israelr, oren, idanb, vladimirk, shlomin


On 11/12/2019 8:22 PM, Sagi Grimberg wrote:
>
>
> On 11/5/19 8:20 AM, Max Gurtovoy wrote:
>> For capable HCAs (e.g. ConnectX-4/ConnectX-5) this will allow end-to-end
>> protection information passthrough and validation for NVMe over RDMA
>> transport. Metadata offload support was implemented over the new RDMA
>> signature verbs API and it is enabled per controller by using nvme-cli.
>>
>> usage example:
>> nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 
>> --nqn=test-nvme
>>
>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>> Signed-off-by: Israel Rukshin <israelr@mellanox.com>
>> ---
>>   drivers/nvme/host/rdma.c | 346 
>> ++++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 298 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 05f2dfa..16263b8 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -48,6 +48,12 @@ struct nvme_rdma_qe {
>>       u64            dma;
>>   };
>>   +struct nvme_rdma_sgl {
>> +    int            nents;
>> +    struct sg_table        sg_table;
>> +    struct scatterlist    first_sgl[SG_CHUNK_SIZE];
>> +};
>
> How about we dynamically allocate this in nvme_rdma_init_request (with
> a prep patch for it)? both for pi and data, this is getting quite
> large...
>
>> +
>>   struct nvme_rdma_queue;
>>   struct nvme_rdma_request {
>>       struct nvme_request    req;
>> @@ -58,12 +64,14 @@ struct nvme_rdma_request {
>>       refcount_t        ref;
>>       struct ib_sge        sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
>>       u32            num_sge;
>> -    int            nents;
>>       struct ib_reg_wr    reg_wr;
>>       struct ib_cqe        reg_cqe;
>>       struct nvme_rdma_queue  *queue;
>> -    struct sg_table        sg_table;
>> -    struct scatterlist    first_sgl[];
>> +    /* T10-PI support */
>> +    bool            is_protected;
>> +
>> +    struct nvme_rdma_sgl    data_sgl;
>> +    struct nvme_rdma_sgl    pi_sgl[];
>>   };
>>     enum nvme_rdma_queue_flags {
>> @@ -85,6 +93,7 @@ struct nvme_rdma_queue {
>>       struct rdma_cm_id    *cm_id;
>>       int            cm_error;
>>       struct completion    cm_done;
>> +    bool            pi_support;
>
> Any specific reason for the queue local pi_support?
>
>>   };
>>     struct nvme_rdma_ctrl {
>> @@ -112,6 +121,7 @@ struct nvme_rdma_ctrl {
>>       struct nvme_ctrl    ctrl;
>>       bool            use_inline_data;
>>       u32            io_queues[HCTX_MAX_TYPES];
>> +    bool            pi_support;
>>   };
>>     static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct 
>> nvme_ctrl *ctrl)
>> @@ -269,6 +279,8 @@ static int nvme_rdma_create_qp(struct 
>> nvme_rdma_queue *queue, const int factor)
>>       init_attr.qp_type = IB_QPT_RC;
>>       init_attr.send_cq = queue->ib_cq;
>>       init_attr.recv_cq = queue->ib_cq;
>> +    if (queue->pi_support)
>> +        init_attr.create_flags |= IB_QP_CREATE_INTEGRITY_EN;
>>         ret = rdma_create_qp(queue->cm_id, dev->pd, &init_attr);
>>   @@ -408,6 +420,8 @@ static void nvme_rdma_destroy_queue_ib(struct 
>> nvme_rdma_queue *queue)
>>       dev = queue->device;
>>       ibdev = dev->dev;
>>   +    if (queue->pi_support)
>> +        ib_mr_pool_destroy(queue->qp, &queue->qp->sig_mrs);
>>       ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
>>         /*
>> @@ -424,10 +438,14 @@ static void nvme_rdma_destroy_queue_ib(struct 
>> nvme_rdma_queue *queue)
>>       nvme_rdma_dev_put(dev);
>>   }
>>   -static int nvme_rdma_get_max_fr_pages(struct ib_device *ibdev)
>> +static int nvme_rdma_get_max_fr_pages(struct ib_device *ibdev, bool 
>> pi_support)
>>   {
>> -    return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
>> -             ibdev->attrs.max_fast_reg_page_list_len - 1);
>> +    if (pi_support)
>> +        return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
>> +                 ibdev->attrs.max_pi_fast_reg_page_list_len - 1);
>> +    else
>> +        return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
>> +                 ibdev->attrs.max_fast_reg_page_list_len - 1);
>>   }
>
> Maybe just take the max_page_list_len on the condition and take the 
> min once.
>
>>     static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
>> @@ -484,7 +502,7 @@ static int nvme_rdma_create_queue_ib(struct 
>> nvme_rdma_queue *queue)
>>        * misaligned we'll end up using two entries for a single data 
>> page,
>>        * so one additional entry is required.
>>        */
>> -    pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev) + 1;
>> +    pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, 
>> queue->pi_support) + 1;
>>       ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
>>                     queue->queue_size,
>>                     IB_MR_TYPE_MEM_REG,
>> @@ -496,10 +514,24 @@ static int nvme_rdma_create_queue_ib(struct 
>> nvme_rdma_queue *queue)
>>           goto out_destroy_ring;
>>       }
>>   +    if (queue->pi_support) {
>> +        ret = ib_mr_pool_init(queue->qp, &queue->qp->sig_mrs,
>> +                      queue->queue_size, IB_MR_TYPE_INTEGRITY,
>> +                      pages_per_mr, pages_per_mr);
>> +        if (ret) {
>> +            dev_err(queue->ctrl->ctrl.device,
>> +                "failed to initialize PI MR pool sized %d for QID 
>> %d\n",
>> +                queue->queue_size, idx);
>> +            goto out_destroy_mr_pool;
>> +        }
>> +    }
>> +
>>       set_bit(NVME_RDMA_Q_TR_READY, &queue->flags);
>>         return 0;
>>   +out_destroy_mr_pool:
>> +    ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
>>   out_destroy_ring:
>>       nvme_rdma_free_ring(ibdev, queue->rsp_ring, queue->queue_size,
>>                   sizeof(struct nvme_completion), DMA_FROM_DEVICE);
>> @@ -521,6 +553,7 @@ static int nvme_rdma_alloc_queue(struct 
>> nvme_rdma_ctrl *ctrl,
>>         queue = &ctrl->queues[idx];
>>       queue->ctrl = ctrl;
>> +    queue->pi_support = idx && ctrl->pi_support;
>
> Thats why..
>
>> init_completion(&queue->cm_done);
>>         if (idx > 0)
>> @@ -730,8 +763,7 @@ static struct blk_mq_tag_set 
>> *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
>>           set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
>>           set->reserved_tags = 2; /* connect + keep-alive */
>>           set->numa_node = nctrl->numa_node;
>> -        set->cmd_size = sizeof(struct nvme_rdma_request) +
>> -            SG_CHUNK_SIZE * sizeof(struct scatterlist);
>> +        set->cmd_size = sizeof(struct nvme_rdma_request);
>>           set->driver_data = ctrl;
>>           set->nr_hw_queues = 1;
>>           set->timeout = ADMIN_TIMEOUT;
>> @@ -745,7 +777,7 @@ static struct blk_mq_tag_set 
>> *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
>>           set->numa_node = nctrl->numa_node;
>>           set->flags = BLK_MQ_F_SHOULD_MERGE;
>>           set->cmd_size = sizeof(struct nvme_rdma_request) +
>> -            SG_CHUNK_SIZE * sizeof(struct scatterlist);
>> +            (ctrl->pi_support * sizeof(struct nvme_rdma_sgl));
>>           set->driver_data = ctrl;
>>           set->nr_hw_queues = nctrl->queue_count - 1;
>>           set->timeout = NVME_IO_TIMEOUT;
>> @@ -787,7 +819,22 @@ static int 
>> nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
>>       ctrl->device = ctrl->queues[0].device;
>>       ctrl->ctrl.numa_node = dev_to_node(ctrl->device->dev->dma_device);
>>   -    ctrl->max_fr_pages = 
>> nvme_rdma_get_max_fr_pages(ctrl->device->dev);
>> +    /* T10-PI support */
>> +    if (ctrl->ctrl.opts->pi_enable) {
>> +        if (!(ctrl->device->dev->attrs.device_cap_flags &
>> +              IB_DEVICE_INTEGRITY_HANDOVER)) {
>> +            dev_warn(ctrl->ctrl.device,
>> +                 "T10-PI requested but not supported on %s, continue 
>> without T10-PI\n",
>> +                 ctrl->device->dev->name);
>> +            ctrl->pi_support = false;
>> +            ctrl->ctrl.opts->pi_enable = false;
>> +        } else {
>> +            ctrl->pi_support = true;
>> +        }
>> +    }
>> +
>> +    ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev,
>> +                            ctrl->pi_support);
>>         /*
>>        * Bind the async event SQE DMA mapping to the admin queue 
>> lifetime.
>> @@ -829,6 +876,8 @@ static int nvme_rdma_configure_admin_queue(struct 
>> nvme_rdma_ctrl *ctrl,
>>         ctrl->ctrl.max_segments = ctrl->max_fr_pages;
>>       ctrl->ctrl.max_hw_sectors = ctrl->max_fr_pages << (ilog2(SZ_4K) 
>> - 9);
>> +    if (ctrl->pi_support)
>> +        ctrl->ctrl.max_integrity_segments = ctrl->max_fr_pages;
>>         blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>>   @@ -1154,13 +1203,24 @@ static void nvme_rdma_unmap_data(struct 
>> nvme_rdma_queue *queue,
>>       if (!blk_rq_nr_phys_segments(rq))
>>           return;
>>   +    if (blk_integrity_rq(rq)) {
>> +        ib_dma_unmap_sg(ibdev, req->pi_sgl->sg_table.sgl,
>> +            req->pi_sgl->nents, rq_dma_dir(rq));
>> +        sg_free_table_chained(&req->pi_sgl->sg_table, SG_CHUNK_SIZE);
>> +    }
>> +
>>       if (req->mr) {
>> -        ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
>> +        if (req->is_protected)
>> +            ib_mr_pool_put(queue->qp, &queue->qp->sig_mrs, req->mr);
>> +        else
>> +            ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs,
>> +                       req->mr);
>>           req->mr = NULL;
>>       }
>>   -    ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, 
>> rq_dma_dir(rq));
>> -    sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
>> +    ib_dma_unmap_sg(ibdev, req->data_sgl.sg_table.sgl,
>> +            req->data_sgl.nents, rq_dma_dir(rq));
>> +    sg_free_table_chained(&req->data_sgl.sg_table, SG_CHUNK_SIZE);
>>   }
>>     static int nvme_rdma_set_sg_null(struct nvme_command *c)
>> @@ -1179,7 +1239,7 @@ static int nvme_rdma_map_sg_inline(struct 
>> nvme_rdma_queue *queue,
>>           int count)
>>   {
>>       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
>> -    struct scatterlist *sgl = req->sg_table.sgl;
>> +    struct scatterlist *sgl = req->data_sgl.sg_table.sgl;
>>       struct ib_sge *sge = &req->sge[1];
>>       u32 len = 0;
>>       int i;
>> @@ -1204,40 +1264,115 @@ static int nvme_rdma_map_sg_single(struct 
>> nvme_rdma_queue *queue,
>>   {
>>       struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>>   -    sg->addr = cpu_to_le64(sg_dma_address(req->sg_table.sgl));
>> -    put_unaligned_le24(sg_dma_len(req->sg_table.sgl), sg->length);
>> +    sg->addr = cpu_to_le64(sg_dma_address(req->data_sgl.sg_table.sgl));
>> + put_unaligned_le24(sg_dma_len(req->data_sgl.sg_table.sgl), 
>> sg->length);
>> put_unaligned_le32(queue->device->pd->unsafe_global_rkey, sg->key);
>>       sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4;
>>       return 0;
>>   }
>>   -static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>> -        struct nvme_rdma_request *req, struct nvme_command *c,
>> -        int count)
>> +static void nvme_rdma_set_diff_domain(struct nvme_command *cmd, 
>> struct bio *bio,
>> +        struct ib_sig_domain *domain, struct request *rq)
>>   {
>> -    struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>> -    int nr;
>> +    struct blk_integrity *bi = blk_get_integrity(bio->bi_disk);
>> +    struct nvme_ns *ns = rq->rq_disk->private_data;
>>   -    req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>> -    if (WARN_ON_ONCE(!req->mr))
>> -        return -EAGAIN;
>> +    WARN_ON(bi == NULL);
>> +
>> +    domain->sig_type = IB_SIG_TYPE_T10_DIF;
>> +    domain->sig.dif.bg_type = IB_T10DIF_CRC;
>> +    domain->sig.dif.pi_interval = 1 << bi->interval_exp;
>> +    domain->sig.dif.ref_tag = le32_to_cpu(cmd->rw.reftag);
>>         /*
>> -     * Align the MR to a 4K page size to match the ctrl page size and
>> -     * the block virtual boundary.
>> +     * At the moment we hard code those, but in the future
>> +     * we will take them from cmd.
>>        */
>> -    nr = ib_map_mr_sg(req->mr, req->sg_table.sgl, count, NULL, SZ_4K);
>> -    if (unlikely(nr < count)) {
>> -        ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
>> -        req->mr = NULL;
>> -        if (nr < 0)
>> -            return nr;
>> -        return -EINVAL;
>> +    domain->sig.dif.apptag_check_mask = 0xffff;
>> +    domain->sig.dif.app_escape = true;
>> +    domain->sig.dif.ref_escape = true;
>> +    if (ns->pi_type != NVME_NS_DPS_PI_TYPE3)
>> +        domain->sig.dif.ref_remap = true;
>> +}
>> +
>> +static void nvme_rdma_set_sig_attrs(struct bio *bio,
>> +        struct ib_sig_attrs *sig_attrs, struct nvme_command *c,
>> +        struct request *rq)
>> +{
>> +    u16 control = le16_to_cpu(c->rw.control);
>> +
>> +    memset(sig_attrs, 0, sizeof(*sig_attrs));
>> +
>> +    if (control & NVME_RW_PRINFO_PRACT) {
>> +        /* for WRITE_INSERT/READ_STRIP no memory domain */
>> +        sig_attrs->mem.sig_type = IB_SIG_TYPE_NONE;
>> +        nvme_rdma_set_diff_domain(c, bio, &sig_attrs->wire, rq);
>> +        /* Clear the PRACT bit since HCA will generate/verify the PI */
>> +        control &= ~NVME_RW_PRINFO_PRACT;
>> +        c->rw.control = cpu_to_le16(control);
>> +    } else {
>> +        /* for WRITE_PASS/READ_PASS both wire/memory domains exist */
>> +        nvme_rdma_set_diff_domain(c, bio, &sig_attrs->wire, rq);
>> +        nvme_rdma_set_diff_domain(c, bio, &sig_attrs->mem, rq);
>>       }
>> +}
>> +
>> +static void nvme_rdma_set_prot_checks(struct nvme_command *cmd, u8 
>> *mask)
>> +{
>> +    *mask = 0;
>> +    if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_REF)
>> +        *mask |= IB_SIG_CHECK_REFTAG;
>> +    if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_GUARD)
>> +        *mask |= IB_SIG_CHECK_GUARD;
>> +}
>> +
>> +static void nvme_rdma_sig_done(struct ib_cq *cq, struct ib_wc *wc)
>> +{
>> +    if (unlikely(wc->status != IB_WC_SUCCESS))
>> +        nvme_rdma_wr_error(cq, wc, "SIG");
>> +}
>> +
>> +static void nvme_rdma_set_pi_wr(struct nvme_rdma_request *req,
>> +                struct nvme_command *c)
>> +{
>> +    struct ib_sig_attrs *sig_attrs = req->mr->sig_attrs;
>> +    struct ib_reg_wr *wr = &req->reg_wr;
>> +    struct request *rq = blk_mq_rq_from_pdu(req);
>> +    struct bio *bio = rq->bio;
>> +    struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>> +
>> +    nvme_rdma_set_sig_attrs(bio, sig_attrs, c, rq);
>> +
>> +    nvme_rdma_set_prot_checks(c, &sig_attrs->check_mask);
>>         ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
>>   +    req->reg_cqe.done = nvme_rdma_sig_done;
>> +
>> +    memset(wr, 0, sizeof(*wr));
>> +    wr->wr.opcode = IB_WR_REG_MR_INTEGRITY;
>> +    wr->wr.wr_cqe = &req->reg_cqe;
>> +    wr->wr.num_sge = 0;
>> +    wr->wr.send_flags = 0;
>> +    wr->mr = req->mr;
>> +    wr->key = req->mr->rkey;
>> +    wr->access = IB_ACCESS_LOCAL_WRITE |
>> +             IB_ACCESS_REMOTE_READ |
>> +             IB_ACCESS_REMOTE_WRITE;
>> +
>> +    sg->addr = cpu_to_le64(req->mr->iova);
>> +    put_unaligned_le24(req->mr->length, sg->length);
>> +    put_unaligned_le32(req->mr->rkey, sg->key);
>> +    sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) | 
>> NVME_SGL_FMT_INVALIDATE;
>> +}
>> +
>> +static void nvme_rdma_set_reg_wr(struct nvme_rdma_request *req,
>> +                 struct nvme_command *c)
>> +{
>> +    struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>> +
>>       req->reg_cqe.done = nvme_rdma_memreg_done;
>> +
>>       memset(&req->reg_wr, 0, sizeof(req->reg_wr));
>>       req->reg_wr.wr.opcode = IB_WR_REG_MR;
>>       req->reg_wr.wr.wr_cqe = &req->reg_cqe;
>> @@ -1253,8 +1388,52 @@ static int nvme_rdma_map_sg_fr(struct 
>> nvme_rdma_queue *queue,
>>       put_unaligned_le32(req->mr->rkey, sg->key);
>>       sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
>>               NVME_SGL_FMT_INVALIDATE;
>> +}
>> +
>> +static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>> +        struct nvme_rdma_request *req, struct nvme_command *c,
>> +        int count, int pi_count)
>> +{
>> +    struct nvme_rdma_sgl *sgl = &req->data_sgl;
>> +    int nr;
>> +
>> +    if (req->is_protected) {
>> +        req->mr = ib_mr_pool_get(queue->qp, &queue->qp->sig_mrs);
>> +        if (WARN_ON_ONCE(!req->mr))
>> +            return -EAGAIN;
>> +
>> +        nr = ib_map_mr_sg_pi(req->mr, sgl->sg_table.sgl, count, 0,
>> +                     req->pi_sgl->sg_table.sgl, pi_count, 0,
>> +                     SZ_4K);
>> +        if (unlikely(nr))
>> +            goto mr_put;
>> +
>> +        nvme_rdma_set_pi_wr(req, c);
>> +    } else {
>> +        req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>> +        if (WARN_ON_ONCE(!req->mr))
>> +            return -EAGAIN;
>> +        /*
>> +         * Align the MR to a 4K page size to match the ctrl page size
>> +         * and the block virtual boundary.
>> +         */
>> +        nr = ib_map_mr_sg(req->mr, sgl->sg_table.sgl, count, 0, SZ_4K);
>> +        if (unlikely(nr < count))
>> +            goto mr_put;
>> +
>> +        nvme_rdma_set_reg_wr(req, c);
>> +    }
>>         return 0;
>> +mr_put:
>> +    if (req->is_protected)
>> +        ib_mr_pool_put(queue->qp, &queue->qp->sig_mrs, req->mr);
>> +    else
>> +        ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
>> +    req->mr = NULL;
>> +    if (nr < 0)
>> +        return nr;
>> +    return -EINVAL;
>>   }
>>     static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
>> @@ -1263,6 +1442,7 @@ static int nvme_rdma_map_data(struct 
>> nvme_rdma_queue *queue,
>>       struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>>       struct nvme_rdma_device *dev = queue->device;
>>       struct ib_device *ibdev = dev->dev;
>> +    int pi_count = 0;
>>       int count, ret;
>>         req->num_sge = 1;
>> @@ -1273,23 +1453,44 @@ static int nvme_rdma_map_data(struct 
>> nvme_rdma_queue *queue,
>>       if (!blk_rq_nr_phys_segments(rq))
>>           return nvme_rdma_set_sg_null(c);
>>   -    req->sg_table.sgl = req->first_sgl;
>> -    ret = sg_alloc_table_chained(&req->sg_table,
>> -            blk_rq_nr_phys_segments(rq), req->sg_table.sgl,
>> -            SG_CHUNK_SIZE);
>> +    req->data_sgl.sg_table.sgl = req->data_sgl.first_sgl;
>> +    ret = sg_alloc_table_chained(&req->data_sgl.sg_table,
>> +            blk_rq_nr_phys_segments(rq),
>> +            req->data_sgl.sg_table.sgl, SG_CHUNK_SIZE);
>>       if (ret)
>>           return -ENOMEM;
>>   -    req->nents = blk_rq_map_sg(rq->q, rq, req->sg_table.sgl);
>> +    req->data_sgl.nents = blk_rq_map_sg(rq->q, rq,
>> +                        req->data_sgl.sg_table.sgl);
>>   -    count = ib_dma_map_sg(ibdev, req->sg_table.sgl, req->nents,
>> -                  rq_dma_dir(rq));
>> +    count = ib_dma_map_sg(ibdev, req->data_sgl.sg_table.sgl,
>> +                  req->data_sgl.nents, rq_dma_dir(rq));
>>       if (unlikely(count <= 0)) {
>>           ret = -EIO;
>>           goto out_free_table;
>>       }
>>   -    if (count <= dev->num_inline_segments) {
>> +    if (blk_integrity_rq(rq)) {
>> +        req->pi_sgl->sg_table.sgl = req->pi_sgl->first_sgl;
>> +        ret = sg_alloc_table_chained(&req->pi_sgl->sg_table,
>> +                blk_rq_count_integrity_sg(rq->q, rq->bio),
>> +                req->pi_sgl->sg_table.sgl, SG_CHUNK_SIZE);
>> +        if (unlikely(ret)) {
>> +            ret = -ENOMEM;
>> +            goto out_unmap_sg;
>> +        }
>> +
>> +        req->pi_sgl->nents = blk_rq_map_integrity_sg(rq->q, rq->bio,
>> +                    req->pi_sgl->sg_table.sgl);
>> +        pi_count = ib_dma_map_sg(ibdev, req->pi_sgl->sg_table.sgl,
>> +                req->pi_sgl->nents, rq_dma_dir(rq));
>> +        if (unlikely(pi_count <= 0)) {
>> +            ret = -EIO;
>> +            goto out_free_pi_table;
>> +        }
>> +    }
>> +
>> +    if (count <= dev->num_inline_segments && !req->is_protected) {
>>           if (rq_data_dir(rq) == WRITE && nvme_rdma_queue_idx(queue) &&
>>               queue->ctrl->use_inline_data &&
>>               blk_rq_payload_bytes(rq) <=
>> @@ -1304,17 +1505,25 @@ static int nvme_rdma_map_data(struct 
>> nvme_rdma_queue *queue,
>>           }
>>       }
>>   -    ret = nvme_rdma_map_sg_fr(queue, req, c, count);
>> +    ret = nvme_rdma_map_sg_fr(queue, req, c, count, pi_count);
>>   out:
>>       if (unlikely(ret))
>> -        goto out_unmap_sg;
>> +        goto out_unmap_pi_sg;
>>         return 0;
>>   +out_unmap_pi_sg:
>> +    if (blk_integrity_rq(rq))
>> +        ib_dma_unmap_sg(ibdev, req->pi_sgl->sg_table.sgl,
>> +                req->pi_sgl->nents, rq_dma_dir(rq));
>> +out_free_pi_table:
>> +    if (blk_integrity_rq(rq))
>> +        sg_free_table_chained(&req->pi_sgl->sg_table, SG_CHUNK_SIZE);
>>   out_unmap_sg:
>> -    ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, 
>> rq_dma_dir(rq));
>> +    ib_dma_unmap_sg(ibdev, req->data_sgl.sg_table.sgl,
>> +            req->data_sgl.nents, rq_dma_dir(rq));
>>   out_free_table:
>> -    sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
>> +    sg_free_table_chained(&req->data_sgl.sg_table, SG_CHUNK_SIZE);
>>       return ret;
>>   }
>>   @@ -1754,6 +1963,13 @@ static blk_status_t 
>> nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
>>         blk_mq_start_request(rq);
>>   +    req->is_protected = false;
>> +    if (nvme_rdma_queue_idx(queue) && queue->pi_support) {
>> +        if (c->common.opcode == nvme_cmd_write ||
>> +            c->common.opcode == nvme_cmd_read)
>> +            req->is_protected = nvme_ns_has_pi(ns);
>> +    }
>> +
>
> This check belongs to nvme-core in nature. Why not test
> blk_integrity_rq()?

In case that write_generate and read_verify are 0, the call 
blk_integrity_rq() will return 0 as well. So it's not good enough for us.

is_protected bool is a helper for RDMA transport to distinguish between 
pi_req and non_pi request.

I prefer not carrying it in the general "struct nvme_request" if only 1 
transport is actually using it for internal needs.

This is why we've added it to "struct nvme_rdma_request".

But if you prefer adding it to "struct nvme_request, it's doable.

>
> Overall this looks good!

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support
  2019-11-13 13:56     ` Max Gurtovoy
@ 2019-11-14 23:45       ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-14 23:45 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme, kbusch, hch
  Cc: israelr, oren, idanb, vladimirk, shlomin


>> Max,
>>
>> It will really be much better if this comes in the rdma_cm
>> private_data... Have you considered to get this trhough the TWG?
> 
> can you explain what is TWG ?

NVMe Technical Working Group

> In general, this sound like a good idea.
> 
> I haven't considered it since it's not in the nvmf SPEC.
> 
> I guess we can add it in the future (to the specification) to be 
> compatible to other nvmf implementations besides Linux kernel.

Would be good to have it for fabrics 1.2 (with a ratified TP
before that).

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support
  2019-11-13 14:35     ` Max Gurtovoy
@ 2019-11-14 23:57       ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2019-11-14 23:57 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme, kbusch, hch
  Cc: shlomin, israelr, oren, vladimirk, idanb



On 11/13/19 6:35 AM, Max Gurtovoy wrote:
> 
> On 11/12/2019 8:22 PM, Sagi Grimberg wrote:
>>
>>
>> On 11/5/19 8:20 AM, Max Gurtovoy wrote:
>>> For capable HCAs (e.g. ConnectX-4/ConnectX-5) this will allow end-to-end
>>> protection information passthrough and validation for NVMe over RDMA
>>> transport. Metadata offload support was implemented over the new RDMA
>>> signature verbs API and it is enabled per controller by using nvme-cli.
>>>
>>> usage example:
>>> nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 
>>> --nqn=test-nvme
>>>
>>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>>> Signed-off-by: Israel Rukshin <israelr@mellanox.com>
>>> ---
>>>   drivers/nvme/host/rdma.c | 346 
>>> ++++++++++++++++++++++++++++++++++++++++-------
>>>   1 file changed, 298 insertions(+), 48 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index 05f2dfa..16263b8 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -48,6 +48,12 @@ struct nvme_rdma_qe {
>>>       u64            dma;
>>>   };
>>>   +struct nvme_rdma_sgl {
>>> +    int            nents;
>>> +    struct sg_table        sg_table;
>>> +    struct scatterlist    first_sgl[SG_CHUNK_SIZE];
>>> +};
>>
>> How about we dynamically allocate this in nvme_rdma_init_request (with
>> a prep patch for it)? both for pi and data, this is getting quite
>> large...
>>
>>> +
>>>   struct nvme_rdma_queue;
>>>   struct nvme_rdma_request {
>>>       struct nvme_request    req;
>>> @@ -58,12 +64,14 @@ struct nvme_rdma_request {
>>>       refcount_t        ref;
>>>       struct ib_sge        sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
>>>       u32            num_sge;
>>> -    int            nents;
>>>       struct ib_reg_wr    reg_wr;
>>>       struct ib_cqe        reg_cqe;
>>>       struct nvme_rdma_queue  *queue;
>>> -    struct sg_table        sg_table;
>>> -    struct scatterlist    first_sgl[];
>>> +    /* T10-PI support */
>>> +    bool            is_protected;
>>> +
>>> +    struct nvme_rdma_sgl    data_sgl;
>>> +    struct nvme_rdma_sgl    pi_sgl[];
>>>   };
>>>     enum nvme_rdma_queue_flags {
>>> @@ -85,6 +93,7 @@ struct nvme_rdma_queue {
>>>       struct rdma_cm_id    *cm_id;
>>>       int            cm_error;
>>>       struct completion    cm_done;
>>> +    bool            pi_support;
>>
>> Any specific reason for the queue local pi_support?
>>
>>>   };
>>>     struct nvme_rdma_ctrl {
>>> @@ -112,6 +121,7 @@ struct nvme_rdma_ctrl {
>>>       struct nvme_ctrl    ctrl;
>>>       bool            use_inline_data;
>>>       u32            io_queues[HCTX_MAX_TYPES];
>>> +    bool            pi_support;
>>>   };
>>>     static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct 
>>> nvme_ctrl *ctrl)
>>> @@ -269,6 +279,8 @@ static int nvme_rdma_create_qp(struct 
>>> nvme_rdma_queue *queue, const int factor)
>>>       init_attr.qp_type = IB_QPT_RC;
>>>       init_attr.send_cq = queue->ib_cq;
>>>       init_attr.recv_cq = queue->ib_cq;
>>> +    if (queue->pi_support)
>>> +        init_attr.create_flags |= IB_QP_CREATE_INTEGRITY_EN;
>>>         ret = rdma_create_qp(queue->cm_id, dev->pd, &init_attr);
>>>   @@ -408,6 +420,8 @@ static void nvme_rdma_destroy_queue_ib(struct 
>>> nvme_rdma_queue *queue)
>>>       dev = queue->device;
>>>       ibdev = dev->dev;
>>>   +    if (queue->pi_support)
>>> +        ib_mr_pool_destroy(queue->qp, &queue->qp->sig_mrs);
>>>       ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
>>>         /*
>>> @@ -424,10 +438,14 @@ static void nvme_rdma_destroy_queue_ib(struct 
>>> nvme_rdma_queue *queue)
>>>       nvme_rdma_dev_put(dev);
>>>   }
>>>   -static int nvme_rdma_get_max_fr_pages(struct ib_device *ibdev)
>>> +static int nvme_rdma_get_max_fr_pages(struct ib_device *ibdev, bool 
>>> pi_support)
>>>   {
>>> -    return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
>>> -             ibdev->attrs.max_fast_reg_page_list_len - 1);
>>> +    if (pi_support)
>>> +        return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
>>> +                 ibdev->attrs.max_pi_fast_reg_page_list_len - 1);
>>> +    else
>>> +        return min_t(u32, NVME_RDMA_MAX_SEGMENTS,
>>> +                 ibdev->attrs.max_fast_reg_page_list_len - 1);
>>>   }
>>
>> Maybe just take the max_page_list_len on the condition and take the 
>> min once.
>>
>>>     static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
>>> @@ -484,7 +502,7 @@ static int nvme_rdma_create_queue_ib(struct 
>>> nvme_rdma_queue *queue)
>>>        * misaligned we'll end up using two entries for a single data 
>>> page,
>>>        * so one additional entry is required.
>>>        */
>>> -    pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev) + 1;
>>> +    pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, 
>>> queue->pi_support) + 1;
>>>       ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
>>>                     queue->queue_size,
>>>                     IB_MR_TYPE_MEM_REG,
>>> @@ -496,10 +514,24 @@ static int nvme_rdma_create_queue_ib(struct 
>>> nvme_rdma_queue *queue)
>>>           goto out_destroy_ring;
>>>       }
>>>   +    if (queue->pi_support) {
>>> +        ret = ib_mr_pool_init(queue->qp, &queue->qp->sig_mrs,
>>> +                      queue->queue_size, IB_MR_TYPE_INTEGRITY,
>>> +                      pages_per_mr, pages_per_mr);
>>> +        if (ret) {
>>> +            dev_err(queue->ctrl->ctrl.device,
>>> +                "failed to initialize PI MR pool sized %d for QID 
>>> %d\n",
>>> +                queue->queue_size, idx);
>>> +            goto out_destroy_mr_pool;
>>> +        }
>>> +    }
>>> +
>>>       set_bit(NVME_RDMA_Q_TR_READY, &queue->flags);
>>>         return 0;
>>>   +out_destroy_mr_pool:
>>> +    ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
>>>   out_destroy_ring:
>>>       nvme_rdma_free_ring(ibdev, queue->rsp_ring, queue->queue_size,
>>>                   sizeof(struct nvme_completion), DMA_FROM_DEVICE);
>>> @@ -521,6 +553,7 @@ static int nvme_rdma_alloc_queue(struct 
>>> nvme_rdma_ctrl *ctrl,
>>>         queue = &ctrl->queues[idx];
>>>       queue->ctrl = ctrl;
>>> +    queue->pi_support = idx && ctrl->pi_support;
>>
>> Thats why..
>>
>>> init_completion(&queue->cm_done);
>>>         if (idx > 0)
>>> @@ -730,8 +763,7 @@ static struct blk_mq_tag_set 
>>> *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
>>>           set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
>>>           set->reserved_tags = 2; /* connect + keep-alive */
>>>           set->numa_node = nctrl->numa_node;
>>> -        set->cmd_size = sizeof(struct nvme_rdma_request) +
>>> -            SG_CHUNK_SIZE * sizeof(struct scatterlist);
>>> +        set->cmd_size = sizeof(struct nvme_rdma_request);
>>>           set->driver_data = ctrl;
>>>           set->nr_hw_queues = 1;
>>>           set->timeout = ADMIN_TIMEOUT;
>>> @@ -745,7 +777,7 @@ static struct blk_mq_tag_set 
>>> *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
>>>           set->numa_node = nctrl->numa_node;
>>>           set->flags = BLK_MQ_F_SHOULD_MERGE;
>>>           set->cmd_size = sizeof(struct nvme_rdma_request) +
>>> -            SG_CHUNK_SIZE * sizeof(struct scatterlist);
>>> +            (ctrl->pi_support * sizeof(struct nvme_rdma_sgl));
>>>           set->driver_data = ctrl;
>>>           set->nr_hw_queues = nctrl->queue_count - 1;
>>>           set->timeout = NVME_IO_TIMEOUT;
>>> @@ -787,7 +819,22 @@ static int 
>>> nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
>>>       ctrl->device = ctrl->queues[0].device;
>>>       ctrl->ctrl.numa_node = dev_to_node(ctrl->device->dev->dma_device);
>>>   -    ctrl->max_fr_pages = 
>>> nvme_rdma_get_max_fr_pages(ctrl->device->dev);
>>> +    /* T10-PI support */
>>> +    if (ctrl->ctrl.opts->pi_enable) {
>>> +        if (!(ctrl->device->dev->attrs.device_cap_flags &
>>> +              IB_DEVICE_INTEGRITY_HANDOVER)) {
>>> +            dev_warn(ctrl->ctrl.device,
>>> +                 "T10-PI requested but not supported on %s, continue 
>>> without T10-PI\n",
>>> +                 ctrl->device->dev->name);
>>> +            ctrl->pi_support = false;
>>> +            ctrl->ctrl.opts->pi_enable = false;
>>> +        } else {
>>> +            ctrl->pi_support = true;
>>> +        }
>>> +    }
>>> +
>>> +    ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev,
>>> +                            ctrl->pi_support);
>>>         /*
>>>        * Bind the async event SQE DMA mapping to the admin queue 
>>> lifetime.
>>> @@ -829,6 +876,8 @@ static int nvme_rdma_configure_admin_queue(struct 
>>> nvme_rdma_ctrl *ctrl,
>>>         ctrl->ctrl.max_segments = ctrl->max_fr_pages;
>>>       ctrl->ctrl.max_hw_sectors = ctrl->max_fr_pages << (ilog2(SZ_4K) 
>>> - 9);
>>> +    if (ctrl->pi_support)
>>> +        ctrl->ctrl.max_integrity_segments = ctrl->max_fr_pages;
>>>         blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>>>   @@ -1154,13 +1203,24 @@ static void nvme_rdma_unmap_data(struct 
>>> nvme_rdma_queue *queue,
>>>       if (!blk_rq_nr_phys_segments(rq))
>>>           return;
>>>   +    if (blk_integrity_rq(rq)) {
>>> +        ib_dma_unmap_sg(ibdev, req->pi_sgl->sg_table.sgl,
>>> +            req->pi_sgl->nents, rq_dma_dir(rq));
>>> +        sg_free_table_chained(&req->pi_sgl->sg_table, SG_CHUNK_SIZE);
>>> +    }
>>> +
>>>       if (req->mr) {
>>> -        ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
>>> +        if (req->is_protected)
>>> +            ib_mr_pool_put(queue->qp, &queue->qp->sig_mrs, req->mr);
>>> +        else
>>> +            ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs,
>>> +                       req->mr);
>>>           req->mr = NULL;
>>>       }
>>>   -    ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, 
>>> rq_dma_dir(rq));
>>> -    sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
>>> +    ib_dma_unmap_sg(ibdev, req->data_sgl.sg_table.sgl,
>>> +            req->data_sgl.nents, rq_dma_dir(rq));
>>> +    sg_free_table_chained(&req->data_sgl.sg_table, SG_CHUNK_SIZE);
>>>   }
>>>     static int nvme_rdma_set_sg_null(struct nvme_command *c)
>>> @@ -1179,7 +1239,7 @@ static int nvme_rdma_map_sg_inline(struct 
>>> nvme_rdma_queue *queue,
>>>           int count)
>>>   {
>>>       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
>>> -    struct scatterlist *sgl = req->sg_table.sgl;
>>> +    struct scatterlist *sgl = req->data_sgl.sg_table.sgl;
>>>       struct ib_sge *sge = &req->sge[1];
>>>       u32 len = 0;
>>>       int i;
>>> @@ -1204,40 +1264,115 @@ static int nvme_rdma_map_sg_single(struct 
>>> nvme_rdma_queue *queue,
>>>   {
>>>       struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>>>   -    sg->addr = cpu_to_le64(sg_dma_address(req->sg_table.sgl));
>>> -    put_unaligned_le24(sg_dma_len(req->sg_table.sgl), sg->length);
>>> +    sg->addr = cpu_to_le64(sg_dma_address(req->data_sgl.sg_table.sgl));
>>> + put_unaligned_le24(sg_dma_len(req->data_sgl.sg_table.sgl), 
>>> sg->length);
>>> put_unaligned_le32(queue->device->pd->unsafe_global_rkey, sg->key);
>>>       sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4;
>>>       return 0;
>>>   }
>>>   -static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>>> -        struct nvme_rdma_request *req, struct nvme_command *c,
>>> -        int count)
>>> +static void nvme_rdma_set_diff_domain(struct nvme_command *cmd, 
>>> struct bio *bio,
>>> +        struct ib_sig_domain *domain, struct request *rq)
>>>   {
>>> -    struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>>> -    int nr;
>>> +    struct blk_integrity *bi = blk_get_integrity(bio->bi_disk);
>>> +    struct nvme_ns *ns = rq->rq_disk->private_data;
>>>   -    req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>>> -    if (WARN_ON_ONCE(!req->mr))
>>> -        return -EAGAIN;
>>> +    WARN_ON(bi == NULL);
>>> +
>>> +    domain->sig_type = IB_SIG_TYPE_T10_DIF;
>>> +    domain->sig.dif.bg_type = IB_T10DIF_CRC;
>>> +    domain->sig.dif.pi_interval = 1 << bi->interval_exp;
>>> +    domain->sig.dif.ref_tag = le32_to_cpu(cmd->rw.reftag);
>>>         /*
>>> -     * Align the MR to a 4K page size to match the ctrl page size and
>>> -     * the block virtual boundary.
>>> +     * At the moment we hard code those, but in the future
>>> +     * we will take them from cmd.
>>>        */
>>> -    nr = ib_map_mr_sg(req->mr, req->sg_table.sgl, count, NULL, SZ_4K);
>>> -    if (unlikely(nr < count)) {
>>> -        ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
>>> -        req->mr = NULL;
>>> -        if (nr < 0)
>>> -            return nr;
>>> -        return -EINVAL;
>>> +    domain->sig.dif.apptag_check_mask = 0xffff;
>>> +    domain->sig.dif.app_escape = true;
>>> +    domain->sig.dif.ref_escape = true;
>>> +    if (ns->pi_type != NVME_NS_DPS_PI_TYPE3)
>>> +        domain->sig.dif.ref_remap = true;
>>> +}
>>> +
>>> +static void nvme_rdma_set_sig_attrs(struct bio *bio,
>>> +        struct ib_sig_attrs *sig_attrs, struct nvme_command *c,
>>> +        struct request *rq)
>>> +{
>>> +    u16 control = le16_to_cpu(c->rw.control);
>>> +
>>> +    memset(sig_attrs, 0, sizeof(*sig_attrs));
>>> +
>>> +    if (control & NVME_RW_PRINFO_PRACT) {
>>> +        /* for WRITE_INSERT/READ_STRIP no memory domain */
>>> +        sig_attrs->mem.sig_type = IB_SIG_TYPE_NONE;
>>> +        nvme_rdma_set_diff_domain(c, bio, &sig_attrs->wire, rq);
>>> +        /* Clear the PRACT bit since HCA will generate/verify the PI */
>>> +        control &= ~NVME_RW_PRINFO_PRACT;
>>> +        c->rw.control = cpu_to_le16(control);
>>> +    } else {
>>> +        /* for WRITE_PASS/READ_PASS both wire/memory domains exist */
>>> +        nvme_rdma_set_diff_domain(c, bio, &sig_attrs->wire, rq);
>>> +        nvme_rdma_set_diff_domain(c, bio, &sig_attrs->mem, rq);
>>>       }
>>> +}
>>> +
>>> +static void nvme_rdma_set_prot_checks(struct nvme_command *cmd, u8 
>>> *mask)
>>> +{
>>> +    *mask = 0;
>>> +    if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_REF)
>>> +        *mask |= IB_SIG_CHECK_REFTAG;
>>> +    if (le16_to_cpu(cmd->rw.control) & NVME_RW_PRINFO_PRCHK_GUARD)
>>> +        *mask |= IB_SIG_CHECK_GUARD;
>>> +}
>>> +
>>> +static void nvme_rdma_sig_done(struct ib_cq *cq, struct ib_wc *wc)
>>> +{
>>> +    if (unlikely(wc->status != IB_WC_SUCCESS))
>>> +        nvme_rdma_wr_error(cq, wc, "SIG");
>>> +}
>>> +
>>> +static void nvme_rdma_set_pi_wr(struct nvme_rdma_request *req,
>>> +                struct nvme_command *c)
>>> +{
>>> +    struct ib_sig_attrs *sig_attrs = req->mr->sig_attrs;
>>> +    struct ib_reg_wr *wr = &req->reg_wr;
>>> +    struct request *rq = blk_mq_rq_from_pdu(req);
>>> +    struct bio *bio = rq->bio;
>>> +    struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>>> +
>>> +    nvme_rdma_set_sig_attrs(bio, sig_attrs, c, rq);
>>> +
>>> +    nvme_rdma_set_prot_checks(c, &sig_attrs->check_mask);
>>>         ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
>>>   +    req->reg_cqe.done = nvme_rdma_sig_done;
>>> +
>>> +    memset(wr, 0, sizeof(*wr));
>>> +    wr->wr.opcode = IB_WR_REG_MR_INTEGRITY;
>>> +    wr->wr.wr_cqe = &req->reg_cqe;
>>> +    wr->wr.num_sge = 0;
>>> +    wr->wr.send_flags = 0;
>>> +    wr->mr = req->mr;
>>> +    wr->key = req->mr->rkey;
>>> +    wr->access = IB_ACCESS_LOCAL_WRITE |
>>> +             IB_ACCESS_REMOTE_READ |
>>> +             IB_ACCESS_REMOTE_WRITE;
>>> +
>>> +    sg->addr = cpu_to_le64(req->mr->iova);
>>> +    put_unaligned_le24(req->mr->length, sg->length);
>>> +    put_unaligned_le32(req->mr->rkey, sg->key);
>>> +    sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) | 
>>> NVME_SGL_FMT_INVALIDATE;
>>> +}
>>> +
>>> +static void nvme_rdma_set_reg_wr(struct nvme_rdma_request *req,
>>> +                 struct nvme_command *c)
>>> +{
>>> +    struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>>> +
>>>       req->reg_cqe.done = nvme_rdma_memreg_done;
>>> +
>>>       memset(&req->reg_wr, 0, sizeof(req->reg_wr));
>>>       req->reg_wr.wr.opcode = IB_WR_REG_MR;
>>>       req->reg_wr.wr.wr_cqe = &req->reg_cqe;
>>> @@ -1253,8 +1388,52 @@ static int nvme_rdma_map_sg_fr(struct 
>>> nvme_rdma_queue *queue,
>>>       put_unaligned_le32(req->mr->rkey, sg->key);
>>>       sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
>>>               NVME_SGL_FMT_INVALIDATE;
>>> +}
>>> +
>>> +static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>>> +        struct nvme_rdma_request *req, struct nvme_command *c,
>>> +        int count, int pi_count)
>>> +{
>>> +    struct nvme_rdma_sgl *sgl = &req->data_sgl;
>>> +    int nr;
>>> +
>>> +    if (req->is_protected) {
>>> +        req->mr = ib_mr_pool_get(queue->qp, &queue->qp->sig_mrs);
>>> +        if (WARN_ON_ONCE(!req->mr))
>>> +            return -EAGAIN;
>>> +
>>> +        nr = ib_map_mr_sg_pi(req->mr, sgl->sg_table.sgl, count, 0,
>>> +                     req->pi_sgl->sg_table.sgl, pi_count, 0,
>>> +                     SZ_4K);
>>> +        if (unlikely(nr))
>>> +            goto mr_put;
>>> +
>>> +        nvme_rdma_set_pi_wr(req, c);
>>> +    } else {
>>> +        req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>>> +        if (WARN_ON_ONCE(!req->mr))
>>> +            return -EAGAIN;
>>> +        /*
>>> +         * Align the MR to a 4K page size to match the ctrl page size
>>> +         * and the block virtual boundary.
>>> +         */
>>> +        nr = ib_map_mr_sg(req->mr, sgl->sg_table.sgl, count, 0, SZ_4K);
>>> +        if (unlikely(nr < count))
>>> +            goto mr_put;
>>> +
>>> +        nvme_rdma_set_reg_wr(req, c);
>>> +    }
>>>         return 0;
>>> +mr_put:
>>> +    if (req->is_protected)
>>> +        ib_mr_pool_put(queue->qp, &queue->qp->sig_mrs, req->mr);
>>> +    else
>>> +        ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
>>> +    req->mr = NULL;
>>> +    if (nr < 0)
>>> +        return nr;
>>> +    return -EINVAL;
>>>   }
>>>     static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
>>> @@ -1263,6 +1442,7 @@ static int nvme_rdma_map_data(struct 
>>> nvme_rdma_queue *queue,
>>>       struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>>>       struct nvme_rdma_device *dev = queue->device;
>>>       struct ib_device *ibdev = dev->dev;
>>> +    int pi_count = 0;
>>>       int count, ret;
>>>         req->num_sge = 1;
>>> @@ -1273,23 +1453,44 @@ static int nvme_rdma_map_data(struct 
>>> nvme_rdma_queue *queue,
>>>       if (!blk_rq_nr_phys_segments(rq))
>>>           return nvme_rdma_set_sg_null(c);
>>>   -    req->sg_table.sgl = req->first_sgl;
>>> -    ret = sg_alloc_table_chained(&req->sg_table,
>>> -            blk_rq_nr_phys_segments(rq), req->sg_table.sgl,
>>> -            SG_CHUNK_SIZE);
>>> +    req->data_sgl.sg_table.sgl = req->data_sgl.first_sgl;
>>> +    ret = sg_alloc_table_chained(&req->data_sgl.sg_table,
>>> +            blk_rq_nr_phys_segments(rq),
>>> +            req->data_sgl.sg_table.sgl, SG_CHUNK_SIZE);
>>>       if (ret)
>>>           return -ENOMEM;
>>>   -    req->nents = blk_rq_map_sg(rq->q, rq, req->sg_table.sgl);
>>> +    req->data_sgl.nents = blk_rq_map_sg(rq->q, rq,
>>> +                        req->data_sgl.sg_table.sgl);
>>>   -    count = ib_dma_map_sg(ibdev, req->sg_table.sgl, req->nents,
>>> -                  rq_dma_dir(rq));
>>> +    count = ib_dma_map_sg(ibdev, req->data_sgl.sg_table.sgl,
>>> +                  req->data_sgl.nents, rq_dma_dir(rq));
>>>       if (unlikely(count <= 0)) {
>>>           ret = -EIO;
>>>           goto out_free_table;
>>>       }
>>>   -    if (count <= dev->num_inline_segments) {
>>> +    if (blk_integrity_rq(rq)) {
>>> +        req->pi_sgl->sg_table.sgl = req->pi_sgl->first_sgl;
>>> +        ret = sg_alloc_table_chained(&req->pi_sgl->sg_table,
>>> +                blk_rq_count_integrity_sg(rq->q, rq->bio),
>>> +                req->pi_sgl->sg_table.sgl, SG_CHUNK_SIZE);
>>> +        if (unlikely(ret)) {
>>> +            ret = -ENOMEM;
>>> +            goto out_unmap_sg;
>>> +        }
>>> +
>>> +        req->pi_sgl->nents = blk_rq_map_integrity_sg(rq->q, rq->bio,
>>> +                    req->pi_sgl->sg_table.sgl);
>>> +        pi_count = ib_dma_map_sg(ibdev, req->pi_sgl->sg_table.sgl,
>>> +                req->pi_sgl->nents, rq_dma_dir(rq));
>>> +        if (unlikely(pi_count <= 0)) {
>>> +            ret = -EIO;
>>> +            goto out_free_pi_table;
>>> +        }
>>> +    }
>>> +
>>> +    if (count <= dev->num_inline_segments && !req->is_protected) {
>>>           if (rq_data_dir(rq) == WRITE && nvme_rdma_queue_idx(queue) &&
>>>               queue->ctrl->use_inline_data &&
>>>               blk_rq_payload_bytes(rq) <=
>>> @@ -1304,17 +1505,25 @@ static int nvme_rdma_map_data(struct 
>>> nvme_rdma_queue *queue,
>>>           }
>>>       }
>>>   -    ret = nvme_rdma_map_sg_fr(queue, req, c, count);
>>> +    ret = nvme_rdma_map_sg_fr(queue, req, c, count, pi_count);
>>>   out:
>>>       if (unlikely(ret))
>>> -        goto out_unmap_sg;
>>> +        goto out_unmap_pi_sg;
>>>         return 0;
>>>   +out_unmap_pi_sg:
>>> +    if (blk_integrity_rq(rq))
>>> +        ib_dma_unmap_sg(ibdev, req->pi_sgl->sg_table.sgl,
>>> +                req->pi_sgl->nents, rq_dma_dir(rq));
>>> +out_free_pi_table:
>>> +    if (blk_integrity_rq(rq))
>>> +        sg_free_table_chained(&req->pi_sgl->sg_table, SG_CHUNK_SIZE);
>>>   out_unmap_sg:
>>> -    ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, 
>>> rq_dma_dir(rq));
>>> +    ib_dma_unmap_sg(ibdev, req->data_sgl.sg_table.sgl,
>>> +            req->data_sgl.nents, rq_dma_dir(rq));
>>>   out_free_table:
>>> -    sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
>>> +    sg_free_table_chained(&req->data_sgl.sg_table, SG_CHUNK_SIZE);
>>>       return ret;
>>>   }
>>>   @@ -1754,6 +1963,13 @@ static blk_status_t 
>>> nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>         blk_mq_start_request(rq);
>>>   +    req->is_protected = false;
>>> +    if (nvme_rdma_queue_idx(queue) && queue->pi_support) {
>>> +        if (c->common.opcode == nvme_cmd_write ||
>>> +            c->common.opcode == nvme_cmd_read)
>>> +            req->is_protected = nvme_ns_has_pi(ns);
>>> +    }
>>> +
>>
>> This check belongs to nvme-core in nature. Why not test
>> blk_integrity_rq()?
> 
> In case that write_generate and read_verify are 0, the call 
> blk_integrity_rq() will return 0 as well. So it's not good enough for us.
> 
> is_protected bool is a helper for RDMA transport to distinguish between 
> pi_req and non_pi request.

This is a hack, You need to take this info from the command.
How about flagging it when the PRINFO is set..

Something like:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4be64703aa47..12ef260296e1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -434,6 +434,7 @@ static inline void nvme_clear_nvme_request(struct 
request *req)
         if (!(req->rq_flags & RQF_DONTPREP)) {
                 nvme_req(req)->retries = 0;
                 nvme_req(req)->flags = 0;
+               nvme_req(req)->has_pi = false;
                 req->rq_flags |= RQF_DONTPREP;
         }
  }
@@ -684,6 +685,7 @@ static inline blk_status_t nvme_setup_rw(struct 
nvme_ns *ns,
                         cmnd->rw.reftag = cpu_to_le32(t10_pi_ref_tag(req));
                         break;
                 }
+               nvme_req(req)->has_pi = true;
         }

         cmnd->rw.control = cpu_to_le16(control);
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support
  2019-11-05 17:58   ` Christoph Hellwig
@ 2019-11-20 10:41     ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2019-11-20 10:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: sagi, vladimirk, idanb, israelr, linux-nvme, shlomin, oren, kbusch


On 11/5/2019 7:58 PM, Christoph Hellwig wrote:
> On Tue, Nov 05, 2019 at 06:20:16PM +0200, Max Gurtovoy wrote:
>> For capable HCAs (e.g. ConnectX-4/ConnectX-5) this will allow end-to-end
>> protection information passthrough and validation for NVMe over RDMA
>> transport. Metadata offload support was implemented over the new RDMA
>> signature verbs API and it is enabled per controller by using nvme-cli.
>>
>> usage example:
>> nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme
>>
>> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
>> Signed-off-by: Israel Rukshin <israelr@mellanox.com>
>> ---
>>   drivers/nvme/host/rdma.c | 346 ++++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 298 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 05f2dfa..16263b8 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -48,6 +48,12 @@ struct nvme_rdma_qe {
>>   	u64			dma;
>>   };
>>   
>> +struct nvme_rdma_sgl {
>> +	int			nents;
>> +	struct sg_table		sg_table;
>> +	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
> I think this needs some rework.  Kill the first_sgl pointer, and then
> just set the pointer in the table to address found by pointer
> arithmetics behind the nvme_request in the allocation.
>
> We also have an open todo item to not actually allocate SG_CHUNK_SIZE
> entries, but a much smaller value like in SCSI.

do you mean to use "real" SGL chain in NVMf/RDMA code ?

We can do it as a separate preparation commit. But we need to make sure 
that the performance will not be effected from this change.


>
> Also the whole switch to use struct sg_table should be a separate prep
> patch.
>
>> -	struct sg_table		sg_table;
>> -	struct scatterlist	first_sgl[];
>> +	/* T10-PI support */
>> +	bool			is_protected;
> This is a bit of an odd variable name.  Why not use_pi or something like
> that?
>
> Also all the new code should only be built if CONFIG_BLK_DEV_INTEGRITY
> is set.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2019-11-20 10:43 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-05 16:20 [PATCH 00/15] nvme-rdma/nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
2019-11-05 16:20 ` [PATCH] nvme-cli/fabrics: Add pi_enable param to connect cmd Max Gurtovoy
2019-11-07  2:36   ` Martin K. Petersen
2019-11-07 12:02     ` Max Gurtovoy
2019-11-09  1:55       ` Martin K. Petersen
2019-11-10  9:25         ` Max Gurtovoy
2019-11-12 17:47           ` Sagi Grimberg
2019-11-12 18:14             ` James Smart
2019-11-12 18:23               ` Sagi Grimberg
2019-11-05 16:20 ` [PATCH 01/15] nvme-fabrics: allow user enabling metadata/T10-PI support Max Gurtovoy
2019-11-12 17:48   ` Sagi Grimberg
2019-11-05 16:20 ` [PATCH 02/15] nvme: Fail __nvme_revalidate_disk in case of a spec violation Max Gurtovoy
2019-11-05 17:49   ` Christoph Hellwig
2019-11-12 17:52     ` Sagi Grimberg
2019-11-05 16:20 ` [PATCH 03/15] nvme: Introduce max_integrity_segments ctrl attribute Max Gurtovoy
2019-11-05 17:49   ` Christoph Hellwig
2019-11-12 17:53   ` Sagi Grimberg
2019-11-05 16:20 ` [PATCH 04/15] nvme: Inline nvme_ns_has_pi function Max Gurtovoy
2019-11-05 17:50   ` Christoph Hellwig
2019-11-05 16:20 ` [PATCH 05/15] nvme-rdma: Add metadata/T10-PI support Max Gurtovoy
2019-11-05 17:58   ` Christoph Hellwig
2019-11-20 10:41     ` Max Gurtovoy
2019-11-12 18:22   ` Sagi Grimberg
2019-11-13 14:35     ` Max Gurtovoy
2019-11-14 23:57       ` Sagi Grimberg
2019-11-05 16:20 ` [PATCH 06/15] block: Introduce BIP_NOMAP_INTEGRITY bip_flag Max Gurtovoy
2019-11-05 17:52   ` Christoph Hellwig
2019-11-07  2:43   ` Martin K. Petersen
2019-11-07 13:29     ` Max Gurtovoy
2019-11-09  2:10       ` Martin K. Petersen
2019-11-12 10:40         ` Max Gurtovoy
2019-11-05 16:20 ` [PATCH 07/15] nvmet: Prepare metadata request Max Gurtovoy
2019-11-05 16:20 ` [PATCH 08/15] nvmet: Add metadata characteristics for a namespace Max Gurtovoy
2019-11-05 17:59   ` Christoph Hellwig
2019-11-12 18:38     ` Sagi Grimberg
2019-11-05 16:20 ` [PATCH 09/15] nvmet: Rename nvmet_rw_len to nvmet_rw_data_len Max Gurtovoy
2019-11-05 17:59   ` Christoph Hellwig
2019-11-12 18:39   ` Sagi Grimberg
2019-11-05 16:20 ` [PATCH 10/15] nvmet: Rename nvmet_check_data_len to nvmet_check_transfer_len Max Gurtovoy
2019-11-05 18:00   ` Christoph Hellwig
2019-11-12 18:43   ` Sagi Grimberg
2019-11-05 16:20 ` [PATCH 11/15] nvmet: Introduce nvmet_rw_prot_len and nvmet_ns_has_pi Max Gurtovoy
2019-11-05 18:00   ` Christoph Hellwig
2019-11-05 16:20 ` [PATCH 12/15] nvme: Add Metadata Capabilities enumerations Max Gurtovoy
2019-11-05 16:20 ` [PATCH 13/15] nvmet: Add metadata/T10-PI support Max Gurtovoy
2019-11-05 16:20 ` [PATCH 14/15] nvmet: Add metadata support for block devices Max Gurtovoy
2019-11-05 16:20 ` [PATCH 15/15] nvmet-rdma: Add metadata/T10-PI support Max Gurtovoy
2019-11-05 18:02   ` Christoph Hellwig
2019-11-07 13:43     ` Max Gurtovoy
2019-11-12 18:34   ` Sagi Grimberg
2019-11-13 13:56     ` Max Gurtovoy
2019-11-14 23:45       ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.