All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] Reliably generate large request from SRP
@ 2011-01-19  4:27 David Dillow
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
       [not found] ` <1300148888.2772.15.camel@lap75545.ornl.gov>
  0 siblings, 2 replies; 35+ messages in thread
From: David Dillow @ 2011-01-19  4:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

A persistent thorn in our side has been getting large (1 MB+) requests
from SRP on a system that has been up for any period of time. As we're
using RAID6 8+2 LUNs, we need to generate a full 1 MB IO to avoid a R/M/W
cycle on some hardware, and other hardware just likes the larger requests,
even without the penalty of an R/M/W cycle. The existing code wouldn't 
reliably generate the requests because its sg_tablesize was limited to
255, or less, due to the number of descriptors we could describe in the
SRP_CMD message.

Now that at least one vendor is implementing full support for the SRP
indirect memory descriptor tables, we can safely expand the sg_tablesize,
and realize some performance gains, in many cases quite large. I don't
have vendor code that implements the full support needed for safety, but
the rareness of FMR mapping failures allows the mapping code to function,
at a risk, with existing targets.

I've done some quick testing against an older generation of hardware RAID6
for these numbers.  They are streaming writes using a queue depth of 64.
The SATA numbers are against a LUN built with 8+2 1 TB SATA drives; the SAS
numbers are against a LUN built with two volumes of 8+2 1 TB SAS drives in
a RAID 0 config.  In all cases, the write cache is disabled, and
dma_boundary on the SRP initiator is set such that no coalescing occurs on
the SG list. The IOMMU has been disabled, and max_sectors_kb has been set
to the IO size under test, which matches the IO request size from the
application. For the baseline testing, the IO request is broken into
multiple pieces before being sent due to the sg_tablesize being capped at
255. For the patched numbers, the request was sent intact. These numbers
are for SRP_FMR_SIZE == 256, but I expect the 512 numbers to be similar.


Device	Size	Baseline	Patched
SAS	1M	524 MB/s	1004 MB/s
SAS	2M	520 MB/s	861 MB/s
SAS	4M	529 MB/s	921 MB/s
SAS	8M	600 MB/s	951 MB/s

SATA	1M	385 MB/s	515 MB/s
SATA	2M	394 MB/s	591 MB/s
SATA	4M	377 MB/s	565 MB/s
SATA	8M	419 MB/s	616 MB/s


Similar gains are found at other queue depths, but I've not done a full
parameter search.

Testing the lock scaling capability with fio indicates an increase in
command throughput except in the single threaded case. This is an
unexpected improvement and needs further examination.

I've only played with performance testing; I need to test data integrity
as well.



David Dillow (8):
  IB/srp: always avoid non-zero offsets into an FMR
  IB/srp: move IB CM setup completion into its own function
  IB/srp: allow sg_tablesize to be set for each target
  IB/srp: rework mapping engine to use multiple FMR entries
  IB/srp: add safety valve for large SG tables without HW support
  IB/srp: add support for indirect tables that don't fit in SRP_CMD
  IB/srp: try to use larger FMR sizes to cover our mappings
  IB/srp and direct IO: patches for testing large indirect tables

 drivers/infiniband/ulp/srp/ib_srp.c |  736 +++++++++++++++++++++++------------
 drivers/infiniband/ulp/srp/ib_srp.h |   38 ++-
 fs/direct-io.c                      |    1 +
 3 files changed, 525 insertions(+), 250 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC 1/8] IB/srp: always avoid non-zero offsets into an FMR
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
@ 2011-01-19  4:27   ` David Dillow
  2011-01-19  4:27   ` [RFC 2/8] IB/srp: move IB CM setup completion into its own function David Dillow
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-19  4:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

It is unclear what bug this fixed, and so it's not clear how safe it is
to remove it, or if things were actually an HCA issue with FMR. Always
enable the workaround for now.

TODO: Better description
---
 drivers/infiniband/ulp/srp/ib_srp.c |   17 +----------------
 1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 83664ed..197e26c 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -72,12 +72,6 @@ module_param(topspin_workarounds, int, 0444);
 MODULE_PARM_DESC(topspin_workarounds,
 		 "Enable workarounds for Topspin/Cisco SRP target bugs if != 0");
 
-static int mellanox_workarounds = 1;
-
-module_param(mellanox_workarounds, int, 0444);
-MODULE_PARM_DESC(mellanox_workarounds,
-		 "Enable workarounds for Mellanox SRP target bugs if != 0");
-
 static void srp_add_one(struct ib_device *device);
 static void srp_remove_one(struct ib_device *device);
 static void srp_recv_completion(struct ib_cq *cq, void *target_ptr);
@@ -114,14 +108,6 @@ static int srp_target_is_topspin(struct srp_target_port *target)
 		 !memcmp(&target->ioc_guid, cisco_oui, sizeof cisco_oui));
 }
 
-static int srp_target_is_mellanox(struct srp_target_port *target)
-{
-	static const u8 mellanox_oui[3] = { 0x00, 0x02, 0xc9 };
-
-	return mellanox_workarounds &&
-		!memcmp(&target->ioc_guid, mellanox_oui, sizeof mellanox_oui);
-}
-
 static struct srp_iu *srp_alloc_iu(struct srp_host *host, size_t size,
 				   gfp_t gfp_mask,
 				   enum dma_data_direction direction)
@@ -662,8 +648,7 @@ static int srp_map_fmr(struct srp_target_port *target, struct scatterlist *scat,
 	if (!dev->fmr_pool)
 		return -ENODEV;
 
-	if (srp_target_is_mellanox(target) &&
-	    (ib_sg_dma_address(ibdev, &scat[0]) & ~dev->fmr_page_mask))
+	if (ib_sg_dma_address(ibdev, &scat[0]) & ~dev->fmr_page_mask)
 		return -EINVAL;
 
 	len = page_cnt = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 2/8] IB/srp: move IB CM setup completion into its own function
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
  2011-01-19  4:27   ` [RFC 1/8] IB/srp: always avoid non-zero offsets into an FMR David Dillow
@ 2011-01-19  4:27   ` David Dillow
  2011-01-19  4:27   ` [RFC 3/8] IB/srp: allow sg_tablesize to be set for each target David Dillow
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-19  4:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

This is to clean up prior to further changes.
---
 drivers/infiniband/ulp/srp/ib_srp.c |  144 ++++++++++++++++++-----------------
 1 files changed, 73 insertions(+), 71 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 197e26c..060e6a8 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1213,6 +1213,78 @@ err:
 	return -ENOMEM;
 }
 
+static void srp_cm_rep_handler(struct ib_cm_id *cm_id,
+			       struct srp_login_rsp *lrsp,
+			       struct srp_target_port *target)
+{
+	struct ib_qp_attr *qp_attr = NULL;
+	int attr_mask = 0;
+	int ret;
+	int i;
+
+	if (lrsp->opcode == SRP_LOGIN_RSP) {
+		target->max_ti_iu_len = be32_to_cpu(lrsp->max_ti_iu_len);
+		target->req_lim       = be32_to_cpu(lrsp->req_lim_delta);
+
+		/*
+		 * Reserve credits for task management so we don't
+		 * bounce requests back to the SCSI mid-layer.
+		 */
+		target->scsi_host->can_queue
+			= min(target->req_lim - SRP_TSK_MGMT_SQ_SIZE,
+			      target->scsi_host->can_queue);
+	} else {
+		shost_printk(KERN_WARNING, target->scsi_host,
+			     PFX "Unhandled RSP opcode %#x\n", lrsp->opcode);
+		ret = -ECONNRESET;
+		goto error;
+	}
+
+	if (!target->rx_ring[0]) {
+		ret = srp_alloc_iu_bufs(target);
+		if (ret)
+			goto error;
+	}
+
+	ret = -ENOMEM;
+	qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
+	if (!qp_attr)
+		goto error;
+
+	qp_attr->qp_state = IB_QPS_RTR;
+	ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask);
+	if (ret)
+		goto error_free;
+
+	ret = ib_modify_qp(target->qp, qp_attr, attr_mask);
+	if (ret)
+		goto error_free;
+
+	for (i = 0; i < SRP_RQ_SIZE; i++) {
+		struct srp_iu *iu = target->rx_ring[i];
+		ret = srp_post_recv(target, iu);
+		if (ret)
+			goto error_free;
+	}
+
+	qp_attr->qp_state = IB_QPS_RTS;
+	ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask);
+	if (ret)
+		goto error_free;
+
+	ret = ib_modify_qp(target->qp, qp_attr, attr_mask);
+	if (ret)
+		goto error_free;
+
+	ret = ib_send_cm_rtu(cm_id, NULL, 0);
+
+error_free:
+	kfree(qp_attr);
+
+error:
+	target->status = ret;
+}
+
 static void srp_cm_rej_handler(struct ib_cm_id *cm_id,
 			       struct ib_cm_event *event,
 			       struct srp_target_port *target)
@@ -1296,11 +1368,7 @@ static void srp_cm_rej_handler(struct ib_cm_id *cm_id,
 static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
 {
 	struct srp_target_port *target = cm_id->context;
-	struct ib_qp_attr *qp_attr = NULL;
-	int attr_mask = 0;
 	int comp = 0;
-	int opcode = 0;
-	int i;
 
 	switch (event->event) {
 	case IB_CM_REQ_ERROR:
@@ -1312,71 +1380,7 @@ static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
 
 	case IB_CM_REP_RECEIVED:
 		comp = 1;
-		opcode = *(u8 *) event->private_data;
-
-		if (opcode == SRP_LOGIN_RSP) {
-			struct srp_login_rsp *rsp = event->private_data;
-
-			target->max_ti_iu_len = be32_to_cpu(rsp->max_ti_iu_len);
-			target->req_lim       = be32_to_cpu(rsp->req_lim_delta);
-
-			/*
-			 * Reserve credits for task management so we don't
-			 * bounce requests back to the SCSI mid-layer.
-			 */
-			target->scsi_host->can_queue
-				= min(target->req_lim - SRP_TSK_MGMT_SQ_SIZE,
-				      target->scsi_host->can_queue);
-		} else {
-			shost_printk(KERN_WARNING, target->scsi_host,
-				    PFX "Unhandled RSP opcode %#x\n", opcode);
-			target->status = -ECONNRESET;
-			break;
-		}
-
-		if (!target->rx_ring[0]) {
-			target->status = srp_alloc_iu_bufs(target);
-			if (target->status)
-				break;
-		}
-
-		qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
-		if (!qp_attr) {
-			target->status = -ENOMEM;
-			break;
-		}
-
-		qp_attr->qp_state = IB_QPS_RTR;
-		target->status = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask);
-		if (target->status)
-			break;
-
-		target->status = ib_modify_qp(target->qp, qp_attr, attr_mask);
-		if (target->status)
-			break;
-
-		for (i = 0; i < SRP_RQ_SIZE; i++) {
-			struct srp_iu *iu = target->rx_ring[i];
-			target->status = srp_post_recv(target, iu);
-			if (target->status)
-				break;
-		}
-		if (target->status)
-			break;
-
-		qp_attr->qp_state = IB_QPS_RTS;
-		target->status = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask);
-		if (target->status)
-			break;
-
-		target->status = ib_modify_qp(target->qp, qp_attr, attr_mask);
-		if (target->status)
-			break;
-
-		target->status = ib_send_cm_rtu(cm_id, NULL, 0);
-		if (target->status)
-			break;
-
+		srp_cm_rep_handler(cm_id, event->private_data, target);
 		break;
 
 	case IB_CM_REJ_RECEIVED:
@@ -1416,8 +1420,6 @@ static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
 	if (comp)
 		complete(&target->done);
 
-	kfree(qp_attr);
-
 	return 0;
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 3/8] IB/srp: allow sg_tablesize to be set for each target
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
  2011-01-19  4:27   ` [RFC 1/8] IB/srp: always avoid non-zero offsets into an FMR David Dillow
  2011-01-19  4:27   ` [RFC 2/8] IB/srp: move IB CM setup completion into its own function David Dillow
@ 2011-01-19  4:27   ` David Dillow
  2011-01-19  4:27   ` [RFC 4/8] IB/srp: rework mapping engine to use multiple FMR entries David Dillow
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-19  4:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Different configurations of target software allow differing max sizes of
the command IU. Allowing this to be changed per-target allows all
targets on an initiator to get an optimal setting.

We deprecate srp_sg_tablesize and replace it with cmd_sg_entries in
preparation for allowing more indirect descriptors than can fit in the
IU.
---
topspin_workarounds could become a bool here as well as a cleanup

 drivers/infiniband/ulp/srp/ib_srp.c |   81 ++++++++++++++++++++++++----------
 drivers/infiniband/ulp/srp/ib_srp.h |    2 +
 2 files changed, 59 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 060e6a8..6f8ee0c 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -59,14 +59,16 @@ MODULE_DESCRIPTION("InfiniBand SCSI RDMA Protocol initiator "
 		   "v" DRV_VERSION " (" DRV_RELDATE ")");
 MODULE_LICENSE("Dual BSD/GPL");
 
-static int srp_sg_tablesize = SRP_DEF_SG_TABLESIZE;
-static int srp_max_iu_len;
+static unsigned int srp_sg_tablesize;
+static unsigned int cmd_sg_entries;
+static int topspin_workarounds = 1;
 
-module_param(srp_sg_tablesize, int, 0444);
-MODULE_PARM_DESC(srp_sg_tablesize,
-		 "Max number of gather/scatter entries per I/O (default is 12, max 255)");
+module_param(srp_sg_tablesize, uint, 0444);
+MODULE_PARM_DESC(srp_sg_tablesize, "Deprecated name for cmd_sg_entries");
 
-static int topspin_workarounds = 1;
+module_param(cmd_sg_entries, uint, 0444);
+MODULE_PARM_DESC(cmd_sg_entries,
+		 "Default number of gather/scatter entries in the SRP command (default is 12, max 255)");
 
 module_param(topspin_workarounds, int, 0444);
 MODULE_PARM_DESC(topspin_workarounds,
@@ -364,7 +366,7 @@ static int srp_send_req(struct srp_target_port *target)
 
 	req->priv.opcode     	= SRP_LOGIN_REQ;
 	req->priv.tag        	= 0;
-	req->priv.req_it_iu_len = cpu_to_be32(srp_max_iu_len);
+	req->priv.req_it_iu_len = cpu_to_be32(target->max_iu_len);
 	req->priv.req_buf_fmt 	= cpu_to_be16(SRP_BUF_FORMAT_DIRECT |
 					      SRP_BUF_FORMAT_INDIRECT);
 	/*
@@ -1125,7 +1127,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
 	spin_unlock_irqrestore(&target->lock, flags);
 
 	dev = target->srp_host->srp_dev->dev;
-	ib_dma_sync_single_for_cpu(dev, iu->dma, srp_max_iu_len,
+	ib_dma_sync_single_for_cpu(dev, iu->dma, target->max_iu_len,
 				   DMA_TO_DEVICE);
 
 	scmnd->result        = 0;
@@ -1149,7 +1151,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
 		goto err_iu;
 	}
 
-	ib_dma_sync_single_for_device(dev, iu->dma, srp_max_iu_len,
+	ib_dma_sync_single_for_device(dev, iu->dma, target->max_iu_len,
 				      DMA_TO_DEVICE);
 
 	if (srp_post_send(target, iu, len)) {
@@ -1189,7 +1191,7 @@ static int srp_alloc_iu_bufs(struct srp_target_port *target)
 
 	for (i = 0; i < SRP_SQ_SIZE; ++i) {
 		target->tx_ring[i] = srp_alloc_iu(target->srp_host,
-						  srp_max_iu_len,
+						  target->max_iu_len,
 						  GFP_KERNEL, DMA_TO_DEVICE);
 		if (!target->tx_ring[i])
 			goto err;
@@ -1645,6 +1647,14 @@ static ssize_t show_local_ib_device(struct device *dev,
 	return sprintf(buf, "%s\n", target->srp_host->srp_dev->dev->name);
 }
 
+static ssize_t show_cmd_sg_entries(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct srp_target_port *target = host_to_target(class_to_shost(dev));
+
+	return sprintf(buf, "%u\n", target->cmd_sg_cnt);
+}
+
 static DEVICE_ATTR(id_ext,	    S_IRUGO, show_id_ext,	   NULL);
 static DEVICE_ATTR(ioc_guid,	    S_IRUGO, show_ioc_guid,	   NULL);
 static DEVICE_ATTR(service_id,	    S_IRUGO, show_service_id,	   NULL);
@@ -1655,6 +1665,7 @@ static DEVICE_ATTR(req_lim,         S_IRUGO, show_req_lim,         NULL);
 static DEVICE_ATTR(zero_req_lim,    S_IRUGO, show_zero_req_lim,	   NULL);
 static DEVICE_ATTR(local_ib_port,   S_IRUGO, show_local_ib_port,   NULL);
 static DEVICE_ATTR(local_ib_device, S_IRUGO, show_local_ib_device, NULL);
+static DEVICE_ATTR(cmd_sg_entries,  S_IRUGO, show_cmd_sg_entries,  NULL);
 
 static struct device_attribute *srp_host_attrs[] = {
 	&dev_attr_id_ext,
@@ -1667,6 +1678,7 @@ static struct device_attribute *srp_host_attrs[] = {
 	&dev_attr_zero_req_lim,
 	&dev_attr_local_ib_port,
 	&dev_attr_local_ib_device,
+	&dev_attr_cmd_sg_entries,
 	NULL
 };
 
@@ -1679,6 +1691,7 @@ static struct scsi_host_template srp_template = {
 	.eh_abort_handler		= srp_abort,
 	.eh_device_reset_handler	= srp_reset_device,
 	.eh_host_reset_handler		= srp_reset_host,
+	.sg_tablesize			= SRP_DEF_SG_TABLESIZE,
 	.can_queue			= SRP_CMD_SQ_SIZE,
 	.this_id			= -1,
 	.cmd_per_lun			= SRP_CMD_SQ_SIZE,
@@ -1750,6 +1763,7 @@ enum {
 	SRP_OPT_MAX_CMD_PER_LUN	= 1 << 6,
 	SRP_OPT_IO_CLASS	= 1 << 7,
 	SRP_OPT_INITIATOR_EXT	= 1 << 8,
+	SRP_OPT_CMD_SG_ENTRIES	= 1 << 9,
 	SRP_OPT_ALL		= (SRP_OPT_ID_EXT	|
 				   SRP_OPT_IOC_GUID	|
 				   SRP_OPT_DGID		|
@@ -1767,6 +1781,7 @@ static const match_table_t srp_opt_tokens = {
 	{ SRP_OPT_MAX_CMD_PER_LUN,	"max_cmd_per_lun=%d" 	},
 	{ SRP_OPT_IO_CLASS,		"io_class=%x"		},
 	{ SRP_OPT_INITIATOR_EXT,	"initiator_ext=%s"	},
+	{ SRP_OPT_CMD_SG_ENTRIES,	"cmd_sg_entries=%u"	},
 	{ SRP_OPT_ERR,			NULL 			}
 };
 
@@ -1894,6 +1909,14 @@ static int srp_parse_options(const char *buf, struct srp_target_port *target)
 			kfree(p);
 			break;
 
+		case SRP_OPT_CMD_SG_ENTRIES:
+			if (match_int(args, &token) || token < 1 || token > 255) {
+				printk(KERN_WARNING PFX "bad max cmd_sg_entries parameter '%s'\n", p);
+				goto out;
+			}
+			target->cmd_sg_cnt = token;
+			break;
+
 		default:
 			printk(KERN_WARNING PFX "unknown parameter or missing value "
 			       "'%s' in target creation request\n", p);
@@ -1932,17 +1955,18 @@ static ssize_t srp_create_target(struct device *dev,
 	if (!target_host)
 		return -ENOMEM;
 
-	target_host->transportt = ib_srp_transport_template;
+	target_host->transportt  = ib_srp_transport_template;
 	target_host->max_lun     = SRP_MAX_LUN;
 	target_host->max_cmd_len = sizeof ((struct srp_cmd *) (void *) 0L)->cdb;
 
 	target = host_to_target(target_host);
 
-	target->io_class   = SRP_REV16A_IB_IO_CLASS;
-	target->scsi_host  = target_host;
-	target->srp_host   = host;
-	target->lkey	   = host->srp_dev->mr->lkey;
-	target->rkey	   = host->srp_dev->mr->rkey;
+	target->io_class	= SRP_REV16A_IB_IO_CLASS;
+	target->scsi_host	= target_host;
+	target->srp_host	= host;
+	target->lkey		= host->srp_dev->mr->lkey;
+	target->rkey		= host->srp_dev->mr->rkey;
+	target->cmd_sg_cnt	= cmd_sg_entries;
 
 	spin_lock_init(&target->lock);
 	INIT_LIST_HEAD(&target->free_tx);
@@ -1956,6 +1980,11 @@ static ssize_t srp_create_target(struct device *dev,
 	if (ret)
 		goto err;
 
+	target_host->sg_tablesize = target->cmd_sg_cnt;
+	target->max_iu_len = sizeof (struct srp_cmd) +
+			     sizeof (struct srp_indirect_buf) +
+			     target->cmd_sg_cnt * sizeof (struct srp_direct_buf);
+
 	ib_query_gid(host->srp_dev->dev, host->port, 0, &target->path.sgid);
 
 	shost_printk(KERN_DEBUG, target->scsi_host, PFX
@@ -2217,9 +2246,18 @@ static int __init srp_init_module(void)
 
 	BUILD_BUG_ON(FIELD_SIZEOF(struct ib_wc, wr_id) < sizeof(void *));
 
-	if (srp_sg_tablesize > 255) {
-		printk(KERN_WARNING PFX "Clamping srp_sg_tablesize to 255\n");
-		srp_sg_tablesize = 255;
+	if (srp_sg_tablesize) {
+		printk(KERN_WARNING PFX "srp_sg_tablesize is deprecated, please use cmd_sg_entries\n");
+		if (!cmd_sg_entries)
+			cmd_sg_entries = srp_sg_tablesize;
+	}
+
+	if (!cmd_sg_entries)
+		cmd_sg_entries = SRP_DEF_SG_TABLESIZE;
+
+	if (cmd_sg_entries > 255) {
+		printk(KERN_WARNING PFX "Clamping cmd_sg_entries to 255\n");
+		cmd_sg_entries = 255;
 	}
 
 	ib_srp_transport_template =
@@ -2227,11 +2265,6 @@ static int __init srp_init_module(void)
 	if (!ib_srp_transport_template)
 		return -ENOMEM;
 
-	srp_template.sg_tablesize = srp_sg_tablesize;
-	srp_max_iu_len = (sizeof (struct srp_cmd) +
-			  sizeof (struct srp_indirect_buf) +
-			  srp_sg_tablesize * 16);
-
 	ret = class_register(&srp_class);
 	if (ret) {
 		printk(KERN_ERR PFX "couldn't register class infiniband_srp\n");
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index 9dc6fc3..db39dbf 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -130,6 +130,8 @@ struct srp_target_port {
 	u32			lkey;
 	u32			rkey;
 	enum srp_target_state	state;
+	unsigned int		max_iu_len;
+	unsigned int		cmd_sg_cnt;
 
 	/* Everything above this point is used in the hot path of
 	 * command processing. Try to keep them packed into cachelines.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 4/8] IB/srp: rework mapping engine to use multiple FMR entries
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
                     ` (2 preceding siblings ...)
  2011-01-19  4:27   ` [RFC 3/8] IB/srp: allow sg_tablesize to be set for each target David Dillow
@ 2011-01-19  4:27   ` David Dillow
       [not found]     ` <1295411242-26148-5-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
  2011-01-19  4:27   ` [RFC 5/8] IB/srp: add safety valve for large SG tables without HW support David Dillow
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 35+ messages in thread
From: David Dillow @ 2011-01-19  4:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Instead of forcing all of the S/G entries to fit in one FMR, and falling
back to indirect descriptors if that fails, allow the use of as many
FMRs as needed to map the request. This lays the groundwork for allowing
indirect descriptor tables that are larger than can fit in the command
IU, but should marginally improve performance now by reducing the number
of indirect descriptors needed.

We up the minimum page size for the FMR pool, as it is rare that the
kernel would send down a requests with scattered 512 byte fragments, and
no HCA supports such a small FMR mapping in any case.

This patch also move some of the target initialization code afte the
parsing of options, to keep it together with the new code that needs to
allocate memory based on the options given.
---
 drivers/infiniband/ulp/srp/ib_srp.c |  367 +++++++++++++++++++++++------------
 drivers/infiniband/ulp/srp/ib_srp.h |   28 +++-
 2 files changed, 266 insertions(+), 129 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 6f8ee0c..9ce129a 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -444,6 +444,17 @@ static bool srp_change_state(struct srp_target_port *target,
 	return changed;
 }
 
+static void srp_free_req_data(struct srp_target_port *target)
+{
+	struct srp_request *req;
+	int i;
+
+	for (i = 0, req = target->req_ring; i < SRP_CMD_SQ_SIZE; ++i, ++req) {
+		kfree(req->fmr_list);
+		kfree(req->map_page);
+	}
+}
+
 static void srp_remove_work(struct work_struct *work)
 {
 	struct srp_target_port *target =
@@ -460,6 +471,7 @@ static void srp_remove_work(struct work_struct *work)
 	scsi_remove_host(target->scsi_host);
 	ib_destroy_cm_id(target->cm_id);
 	srp_free_target_ib(target);
+	srp_free_req_data(target);
 	scsi_host_put(target->scsi_host);
 }
 
@@ -523,18 +535,20 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd,
 			   struct srp_target_port *target,
 			   struct srp_request *req)
 {
+	struct ib_device *ibdev = target->srp_host->srp_dev->dev;
+	struct ib_pool_fmr **pfmr;
+
 	if (!scsi_sglist(scmnd) ||
 	    (scmnd->sc_data_direction != DMA_TO_DEVICE &&
 	     scmnd->sc_data_direction != DMA_FROM_DEVICE))
 		return;
 
-	if (req->fmr) {
-		ib_fmr_pool_unmap(req->fmr);
-		req->fmr = NULL;
-	}
+	pfmr = req->fmr_list;
+	while (req->nfmr--)
+		ib_fmr_pool_unmap(*pfmr++);
 
-	ib_dma_unmap_sg(target->srp_host->srp_dev->dev, scsi_sglist(scmnd),
-			scsi_sg_count(scmnd), scmnd->sc_data_direction);
+	ib_dma_unmap_sg(ibdev, scsi_sglist(scmnd), scsi_sg_count(scmnd),
+			scmnd->sc_data_direction);
 }
 
 static void srp_remove_req(struct srp_target_port *target,
@@ -633,95 +647,152 @@ err:
 	return ret;
 }
 
-static int srp_map_fmr(struct srp_target_port *target, struct scatterlist *scat,
-		       int sg_cnt, struct srp_request *req,
-		       struct srp_direct_buf *buf)
+static void srp_map_desc(struct srp_map_state *state, dma_addr_t dma_addr,
+			 unsigned int dma_len, u32 rkey)
 {
-	u64 io_addr = 0;
-	u64 *dma_pages;
-	u32 len;
-	int page_cnt;
-	int i, j;
-	int ret;
-	struct srp_device *dev = target->srp_host->srp_dev;
-	struct ib_device *ibdev = dev->dev;
-	struct scatterlist *sg;
+	struct srp_direct_buf *desc = state->desc;
 
-	if (!dev->fmr_pool)
-		return -ENODEV;
+	desc->va = cpu_to_be64(dma_addr);
+	desc->key = cpu_to_be32(rkey);
+	desc->len = cpu_to_be32(dma_len);
 
-	if (ib_sg_dma_address(ibdev, &scat[0]) & ~dev->fmr_page_mask)
-		return -EINVAL;
+	state->total_len += dma_len;
+	state->desc++;
+	state->ndesc++;
+}
 
-	len = page_cnt = 0;
-	scsi_for_each_sg(req->scmnd, sg, sg_cnt, i) {
-		unsigned int dma_len = ib_sg_dma_len(ibdev, sg);
+static int srp_map_finish_fmr(struct srp_map_state *state,
+			      struct srp_target_port *target)
+{
+	struct srp_device *dev = target->srp_host->srp_dev;
+	struct ib_pool_fmr *fmr;
+	u64 io_addr = 0;
 
-		if (ib_sg_dma_address(ibdev, sg) & ~dev->fmr_page_mask) {
-			if (i > 0)
-				return -EINVAL;
-			else
-				++page_cnt;
-		}
-		if ((ib_sg_dma_address(ibdev, sg) + dma_len) &
-		    ~dev->fmr_page_mask) {
-			if (i < sg_cnt - 1)
-				return -EINVAL;
-			else
-				++page_cnt;
-		}
+	if (!state->npages)
+		return 0;
 
-		len += dma_len;
+	if (state->npages == 1) {
+		srp_map_desc(state, state->base_dma_addr, state->fmr_len,
+			     target->rkey);
+		state->npages = state->fmr_len = 0;
+		return 0;
 	}
 
-	page_cnt += len >> dev->fmr_page_shift;
-	if (page_cnt > SRP_FMR_SIZE)
-		return -ENOMEM;
+	fmr = ib_fmr_pool_map_phys(dev->fmr_pool, state->pages,
+				   state->npages, io_addr);
+	if (IS_ERR(fmr))
+		return PTR_ERR(fmr);
 
-	dma_pages = kmalloc(sizeof (u64) * page_cnt, GFP_ATOMIC);
-	if (!dma_pages)
-		return -ENOMEM;
+	*state->next_fmr++ = fmr;
+	state->nfmr++;
 
-	page_cnt = 0;
-	scsi_for_each_sg(req->scmnd, sg, sg_cnt, i) {
-		unsigned int dma_len = ib_sg_dma_len(ibdev, sg);
+	srp_map_desc(state, 0, state->fmr_len, fmr->fmr->rkey);
+	state->npages = state->fmr_len = 0;
+	return 0;
+}
+
+static void srp_map_update_start(struct srp_map_state *state,
+				 struct scatterlist *sg, int sg_index,
+				 dma_addr_t dma_addr)
+{
+	state->unmapped_sg = sg;
+	state->unmapped_index = sg_index;
+	state->unmapped_addr = dma_addr;
+}
 
-		for (j = 0; j < dma_len; j += dev->fmr_page_size)
-			dma_pages[page_cnt++] =
-				(ib_sg_dma_address(ibdev, sg) &
-				 dev->fmr_page_mask) + j;
+static int srp_map_sg_entry(struct srp_map_state *state,
+			    struct srp_target_port *target,
+			    struct scatterlist *sg, int sg_index,
+			    int use_fmr)
+{
+	struct srp_device *dev = target->srp_host->srp_dev;
+	struct ib_device *ibdev = dev->dev;
+	dma_addr_t dma_addr = ib_sg_dma_address(ibdev, sg);
+	unsigned int dma_len = ib_sg_dma_len(ibdev, sg);
+	unsigned int len;
+	int ret;
+
+	if (!dma_len)
+		return 0;
+
+	if (use_fmr == SRP_MAP_NO_FMR) {
+		/* Once we're in direct map mode for a request, we don't
+		 * go back to FMR mode, so no need to update anything
+		 * other than the descriptor.
+		 */
+		srp_map_desc(state, dma_addr, dma_len, target->rkey);
+		return 0;
 	}
 
-	req->fmr = ib_fmr_pool_map_phys(dev->fmr_pool,
-					dma_pages, page_cnt, io_addr);
-	if (IS_ERR(req->fmr)) {
-		ret = PTR_ERR(req->fmr);
-		req->fmr = NULL;
-		goto out;
+	/* If we start at an offset into the FMR page, don't merge into
+	 * the current FMR. Finish it out, and use the kernel's MR for this
+	 * sg entry. This is to avoid potential bugs on some SRP targets
+	 * that were never quite defined, but went away when the initiator
+	 * avoided using FMR on such page fragments.
+	 */
+	if (dma_addr & ~dev->fmr_page_mask || dma_len > dev->fmr_max_size) {
+		ret = srp_map_finish_fmr(state, target);
+		if (ret)
+			return ret;
+
+		srp_map_desc(state, dma_addr, dma_len, target->rkey);
+		srp_map_update_start(state, NULL, 0, 0);
+		return 0;
 	}
 
-	buf->va  = cpu_to_be64(ib_sg_dma_address(ibdev, &scat[0]) &
-			       ~dev->fmr_page_mask);
-	buf->key = cpu_to_be32(req->fmr->fmr->rkey);
-	buf->len = cpu_to_be32(len);
+	/* If this is the first sg to go into the FMR, save our position.
+	 * We need to know the first unmapped entry, its index, and the
+	 * first unmapped address within that entry to be able to restart
+	 * mapping after an error.
+	 */
+	if (!state->unmapped_sg)
+		srp_map_update_start(state, sg, sg_index, dma_addr);
 
-	ret = 0;
+	while (dma_len) {
+		if (state->npages == SRP_FMR_SIZE) {
+			ret = srp_map_finish_fmr(state, target);
+			if (ret)
+				return ret;
 
-out:
-	kfree(dma_pages);
+			srp_map_update_start(state, sg, sg_index, dma_addr);
+		}
+
+		len = min_t(unsigned int, dma_len, dev->fmr_page_size);
 
+		if (!state->npages)
+			state->base_dma_addr = dma_addr;
+		state->pages[state->npages++] = dma_addr;
+		state->fmr_len += len;
+		dma_addr += len;
+		dma_len -= len;
+	}
+
+	/* If the last entry of the FMR wasn't a full page, then we need to
+	 * close it out and start a new one -- we can only merge at page
+	 * boundries.
+	 */
+	ret = 0;
+	if (len != dev->fmr_page_size) {
+		ret = srp_map_finish_fmr(state, target);
+		if (!ret)
+			srp_map_update_start(state, NULL, 0, 0);
+	}
 	return ret;
 }
 
 static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target,
 			struct srp_request *req)
 {
-	struct scatterlist *scat;
+	struct scatterlist *scat, *sg;
 	struct srp_cmd *cmd = req->cmd->buf;
-	int len, nents, count;
-	u8 fmt = SRP_DATA_DESC_DIRECT;
+	int i, len, nents, count, use_fmr;
 	struct srp_device *dev;
 	struct ib_device *ibdev;
+	struct srp_map_state state;
+	struct srp_indirect_buf *indirect_hdr;
+	dma_addr_t indirect_addr;
+	u32 table_len;
+	u8 fmt;
 
 	if (!scsi_sglist(scmnd) || scmnd->sc_data_direction == DMA_NONE)
 		return sizeof (struct srp_cmd);
@@ -741,6 +812,8 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target,
 	ibdev = dev->dev;
 
 	count = ib_dma_map_sg(ibdev, scat, nents, scmnd->sc_data_direction);
+	if (unlikely(count == 0))
+		return -EIO;
 
 	fmt = SRP_DATA_DESC_DIRECT;
 	len = sizeof (struct srp_cmd) +	sizeof (struct srp_direct_buf);
@@ -757,49 +830,80 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target,
 		buf->va  = cpu_to_be64(ib_sg_dma_address(ibdev, scat));
 		buf->key = cpu_to_be32(target->rkey);
 		buf->len = cpu_to_be32(ib_sg_dma_len(ibdev, scat));
-	} else if (srp_map_fmr(target, scat, count, req,
-			       (void *) cmd->add_data)) {
-		/*
-		 * FMR mapping failed, and the scatterlist has more
-		 * than one entry.  Generate an indirect memory
-		 * descriptor.
-		 */
-		struct srp_indirect_buf *buf = (void *) cmd->add_data;
-		struct scatterlist *sg;
-		u32 datalen = 0;
-		int i;
-
-		fmt = SRP_DATA_DESC_INDIRECT;
-		len = sizeof (struct srp_cmd) +
-			sizeof (struct srp_indirect_buf) +
-			count * sizeof (struct srp_direct_buf);
-
-		scsi_for_each_sg(scmnd, sg, count, i) {
-			unsigned int dma_len = ib_sg_dma_len(ibdev, sg);
-
-			buf->desc_list[i].va  =
-				cpu_to_be64(ib_sg_dma_address(ibdev, sg));
-			buf->desc_list[i].key =
-				cpu_to_be32(target->rkey);
-			buf->desc_list[i].len = cpu_to_be32(dma_len);
-			datalen += dma_len;
+
+		req->nfmr = 0;
+		goto map_complete;
+	}
+
+	/* We have more than one scatter/gather entry, so build our indirect
+	 * descriptor table, trying to merge as many entries with FMR as we
+	 * can.
+	 */
+	indirect_hdr = (void *) cmd->add_data;
+
+	memset(&state, 0, sizeof(state));
+	state.desc	= indirect_hdr->desc_list;
+	state.pages	= req->map_page;
+	state.next_fmr	= req->fmr_list;
+
+	use_fmr = dev->fmr_pool ? SRP_MAP_ALLOW_FMR : SRP_MAP_NO_FMR;
+
+	for_each_sg(scat, sg, count, i) {
+		if (srp_map_sg_entry(&state, target, sg, i, use_fmr)) {
+			/* FMR mapping failed, so backtrack to the first
+			 * unmapped entry and continue on without using FMR.
+			 */
+			dma_addr_t dma_addr;
+			unsigned int dma_len;
+
+backtrack:
+			sg = state.unmapped_sg;
+			i = state.unmapped_index;
+
+			dma_addr = ib_sg_dma_address(ibdev, sg);
+			dma_len = ib_sg_dma_len(ibdev, sg);
+			dma_len -= (state.unmapped_addr - dma_addr);
+			dma_addr = state.unmapped_addr;
+			use_fmr = SRP_MAP_NO_FMR;
+			srp_map_desc(&state, dma_addr, dma_len, target->rkey);
 		}
+	}
 
-		if (scmnd->sc_data_direction == DMA_TO_DEVICE)
-			cmd->data_out_desc_cnt = count;
-		else
-			cmd->data_in_desc_cnt = count;
+	if (use_fmr == SRP_MAP_ALLOW_FMR && srp_map_finish_fmr(&state, target))
+		goto backtrack;
 
-		buf->table_desc.va  =
-			cpu_to_be64(req->cmd->dma + sizeof *cmd + sizeof *buf);
-		buf->table_desc.key =
-			cpu_to_be32(target->rkey);
-		buf->table_desc.len =
-			cpu_to_be32(count * sizeof (struct srp_direct_buf));
+	/* We've mapped the request, fill in the command buffer.
+	 */
+	req->nfmr = state.nfmr;
+	if (state.ndesc == 1) {
+		/* FMR mapping was able to collapse this to one entry,
+		 * so use a direct descriptor.
+		 */
+		struct srp_direct_buf *buf = (void *) cmd->add_data;
 
-		buf->len = cpu_to_be32(datalen);
+		*buf = indirect_hdr->desc_list[0];
+		goto map_complete;
 	}
 
+	table_len = state.ndesc * sizeof (struct srp_direct_buf);
+
+	fmt = SRP_DATA_DESC_INDIRECT;
+	len = sizeof(struct srp_cmd) + sizeof (struct srp_indirect_buf);
+	len += table_len;
+
+	indirect_addr = req->cmd->dma + sizeof *cmd + sizeof *indirect_hdr;
+
+	indirect_hdr->table_desc.va = cpu_to_be64(indirect_addr);
+	indirect_hdr->table_desc.key = cpu_to_be32(target->rkey);
+	indirect_hdr->table_desc.len = cpu_to_be32(table_len);
+	indirect_hdr->len = cpu_to_be32(state.total_len);
+
+	if (scmnd->sc_data_direction == DMA_TO_DEVICE)
+		cmd->data_out_desc_cnt = state.ndesc;
+	else
+		cmd->data_in_desc_cnt = state.ndesc;
+
+map_complete:
 	if (scmnd->sc_data_direction == DMA_TO_DEVICE)
 		cmd->buf_fmt = fmt << 4;
 	else
@@ -1947,8 +2051,7 @@ static ssize_t srp_create_target(struct device *dev,
 		container_of(dev, struct srp_host, dev);
 	struct Scsi_Host *target_host;
 	struct srp_target_port *target;
-	int ret;
-	int i;
+	int i, ret;
 
 	target_host = scsi_host_alloc(&srp_template,
 				      sizeof (struct srp_target_port));
@@ -1968,14 +2071,6 @@ static ssize_t srp_create_target(struct device *dev,
 	target->rkey		= host->srp_dev->mr->rkey;
 	target->cmd_sg_cnt	= cmd_sg_entries;
 
-	spin_lock_init(&target->lock);
-	INIT_LIST_HEAD(&target->free_tx);
-	INIT_LIST_HEAD(&target->free_reqs);
-	for (i = 0; i < SRP_CMD_SQ_SIZE; ++i) {
-		target->req_ring[i].index = i;
-		list_add_tail(&target->req_ring[i].list, &target->free_reqs);
-	}
-
 	ret = srp_parse_options(buf, target);
 	if (ret)
 		goto err;
@@ -1985,6 +2080,23 @@ static ssize_t srp_create_target(struct device *dev,
 			     sizeof (struct srp_indirect_buf) +
 			     target->cmd_sg_cnt * sizeof (struct srp_direct_buf);
 
+	spin_lock_init(&target->lock);
+	INIT_LIST_HEAD(&target->free_tx);
+	INIT_LIST_HEAD(&target->free_reqs);
+	for (i = 0; i < SRP_CMD_SQ_SIZE; ++i) {
+		struct srp_request *req = &target->req_ring[i];
+
+		req->fmr_list = kmalloc(target->cmd_sg_cnt * sizeof (void *),
+					GFP_KERNEL);
+		req->map_page = kmalloc(SRP_FMR_SIZE * sizeof (void *),
+					GFP_KERNEL);
+		if (!req->fmr_list || !req->map_page)
+			goto err_free_mem;
+
+		req->index = i;
+		list_add_tail(&req->list, &target->free_reqs);
+	}
+
 	ib_query_gid(host->srp_dev->dev, host->port, 0, &target->path.sgid);
 
 	shost_printk(KERN_DEBUG, target->scsi_host, PFX
@@ -1998,11 +2110,11 @@ static ssize_t srp_create_target(struct device *dev,
 
 	ret = srp_create_target_ib(target);
 	if (ret)
-		goto err;
+		goto err_free_mem;
 
 	ret = srp_new_cm_id(target);
 	if (ret)
-		goto err_free;
+		goto err_free_ib;
 
 	target->qp_in_error = 0;
 	ret = srp_connect_target(target);
@@ -2024,9 +2136,12 @@ err_disconnect:
 err_cm_id:
 	ib_destroy_cm_id(target->cm_id);
 
-err_free:
+err_free_ib:
 	srp_free_target_ib(target);
 
+err_free_mem:
+	srp_free_req_data(target);
+
 err:
 	scsi_host_put(target_host);
 
@@ -2099,7 +2214,7 @@ static void srp_add_one(struct ib_device *device)
 	struct ib_device_attr *dev_attr;
 	struct ib_fmr_pool_param fmr_param;
 	struct srp_host *host;
-	int s, e, p;
+	int fmr_page_shift, s, e, p;
 
 	dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL);
 	if (!dev_attr)
@@ -2117,12 +2232,13 @@ static void srp_add_one(struct ib_device *device)
 
 	/*
 	 * Use the smallest page size supported by the HCA, down to a
-	 * minimum of 512 bytes (which is the smallest sector that a
-	 * SCSI command will ever carry).
+	 * minimum of 4096 bytes. We're unlikely to build large sglists
+	 * out of smaller entries.
 	 */
-	srp_dev->fmr_page_shift = max(9, ffs(dev_attr->page_size_cap) - 1);
-	srp_dev->fmr_page_size  = 1 << srp_dev->fmr_page_shift;
-	srp_dev->fmr_page_mask  = ~((u64) srp_dev->fmr_page_size - 1);
+	fmr_page_shift		= max(12, ffs(dev_attr->page_size_cap) - 1);
+	srp_dev->fmr_page_size	= 1 << fmr_page_shift;
+	srp_dev->fmr_page_mask	= ~((u64) srp_dev->fmr_page_size - 1);
+	srp_dev->fmr_max_size	= srp_dev->fmr_page_size * SRP_FMR_SIZE;
 
 	INIT_LIST_HEAD(&srp_dev->dev_list);
 
@@ -2143,7 +2259,7 @@ static void srp_add_one(struct ib_device *device)
 	fmr_param.dirty_watermark   = SRP_FMR_DIRTY_SIZE;
 	fmr_param.cache		    = 1;
 	fmr_param.max_pages_per_fmr = SRP_FMR_SIZE;
-	fmr_param.page_shift	    = srp_dev->fmr_page_shift;
+	fmr_param.page_shift	    = fmr_page_shift;
 	fmr_param.access	    = (IB_ACCESS_LOCAL_WRITE |
 				       IB_ACCESS_REMOTE_WRITE |
 				       IB_ACCESS_REMOTE_READ);
@@ -2223,6 +2339,7 @@ static void srp_remove_one(struct ib_device *device)
 			srp_disconnect_target(target);
 			ib_destroy_cm_id(target->cm_id);
 			srp_free_target_ib(target);
+			srp_free_req_data(target);
 			scsi_host_put(target->scsi_host);
 		}
 
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index db39dbf..b43b5e7 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -71,7 +71,10 @@ enum {
 
 	SRP_FMR_SIZE		= 256,
 	SRP_FMR_POOL_SIZE	= 1024,
-	SRP_FMR_DIRTY_SIZE	= SRP_FMR_POOL_SIZE / 4
+	SRP_FMR_DIRTY_SIZE	= SRP_FMR_POOL_SIZE / 4,
+
+	SRP_MAP_ALLOW_FMR	= 0,
+	SRP_MAP_NO_FMR		= 1,
 };
 
 enum srp_target_state {
@@ -93,9 +96,9 @@ struct srp_device {
 	struct ib_pd	       *pd;
 	struct ib_mr	       *mr;
 	struct ib_fmr_pool     *fmr_pool;
-	int			fmr_page_shift;
-	int			fmr_page_size;
 	u64			fmr_page_mask;
+	int			fmr_page_size;
+	int			fmr_max_size;
 };
 
 struct srp_host {
@@ -112,7 +115,9 @@ struct srp_request {
 	struct list_head	list;
 	struct scsi_cmnd       *scmnd;
 	struct srp_iu	       *cmd;
-	struct ib_pool_fmr     *fmr;
+	struct ib_pool_fmr    **fmr_list;
+	u64		       *map_page;
+	short			nfmr;
 	short			index;
 };
 
@@ -181,4 +186,19 @@ struct srp_iu {
 	enum dma_data_direction	direction;
 };
 
+struct srp_map_state {
+	struct ib_pool_fmr    **next_fmr;
+	struct srp_direct_buf  *desc;
+	u64		       *pages;
+	dma_addr_t		base_dma_addr;
+	u32			fmr_len;
+	u32			total_len;
+	unsigned int		npages;
+	unsigned int		nfmr;
+	unsigned int		ndesc;
+	struct scatterlist     *unmapped_sg;
+	int			unmapped_index;
+	dma_addr_t		unmapped_addr;
+};
+
 #endif /* IB_SRP_H */
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 5/8] IB/srp: add safety valve for large SG tables without HW support
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
                     ` (3 preceding siblings ...)
  2011-01-19  4:27   ` [RFC 4/8] IB/srp: rework mapping engine to use multiple FMR entries David Dillow
@ 2011-01-19  4:27   ` David Dillow
  2011-01-19  4:27   ` [RFC 6/8] IB/srp: add support for indirect tables that don't fit in SRP_CMD David Dillow
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-19  4:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Most targets don't support indirect tables that do not fit in the
command, but FMR failures are exceedingly rare. Allow these targets to
reap the benefits of the large tables but fail in a manner that lets the
user know that the data didn't make it there.

This could/should be merged with the next patch, but broken out
separately for discussion. Not sure failing with HOST_BUSY is the right
thing to do -- what if there is a persistent FMR failure? panic() seems
a bit too strong, but there is no way to recover unless FMR starts working
again. Hence the danger in not having full target support for the spec.
---
 drivers/infiniband/ulp/srp/ib_srp.c |   33 +++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/srp/ib_srp.h |    1 +
 2 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 9ce129a..df4c3b9 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -61,6 +61,7 @@ MODULE_LICENSE("Dual BSD/GPL");
 
 static unsigned int srp_sg_tablesize;
 static unsigned int cmd_sg_entries;
+static bool allow_ext_sg;
 static int topspin_workarounds = 1;
 
 module_param(srp_sg_tablesize, uint, 0444);
@@ -70,6 +71,10 @@ module_param(cmd_sg_entries, uint, 0444);
 MODULE_PARM_DESC(cmd_sg_entries,
 		 "Default number of gather/scatter entries in the SRP command (default is 12, max 255)");
 
+module_param(allow_ext_sg, bool, 0444);
+MODULE_PARM_DESC(allow_ext_sg,
+		  "Default behavior when there are more than cmd_sg_entries S/G entries after mapping; fails the request when false (default false)");
+
 module_param(topspin_workarounds, int, 0444);
 MODULE_PARM_DESC(topspin_workarounds,
 		 "Enable workarounds for Topspin/Cisco SRP target bugs if != 0");
@@ -885,6 +890,13 @@ backtrack:
 		goto map_complete;
 	}
 
+	if (unlikely(target->cmd_sg_cnt < state.ndesc &&
+						!target->allow_ext_sg)) {
+		shost_printk(KERN_ERR, target->scsi_host,
+			     "Could not fit S/G list into SRP_CMD\n");
+		return -EIO;
+	}
+
 	table_len = state.ndesc * sizeof (struct srp_direct_buf);
 
 	fmt = SRP_DATA_DESC_INDIRECT;
@@ -1759,6 +1771,14 @@ static ssize_t show_cmd_sg_entries(struct device *dev,
 	return sprintf(buf, "%u\n", target->cmd_sg_cnt);
 }
 
+static ssize_t show_allow_ext_sg(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct srp_target_port *target = host_to_target(class_to_shost(dev));
+
+	return sprintf(buf, "%s\n", target->allow_ext_sg ? "true" : "false");
+}
+
 static DEVICE_ATTR(id_ext,	    S_IRUGO, show_id_ext,	   NULL);
 static DEVICE_ATTR(ioc_guid,	    S_IRUGO, show_ioc_guid,	   NULL);
 static DEVICE_ATTR(service_id,	    S_IRUGO, show_service_id,	   NULL);
@@ -1770,6 +1790,7 @@ static DEVICE_ATTR(zero_req_lim,    S_IRUGO, show_zero_req_lim,	   NULL);
 static DEVICE_ATTR(local_ib_port,   S_IRUGO, show_local_ib_port,   NULL);
 static DEVICE_ATTR(local_ib_device, S_IRUGO, show_local_ib_device, NULL);
 static DEVICE_ATTR(cmd_sg_entries,  S_IRUGO, show_cmd_sg_entries,  NULL);
+static DEVICE_ATTR(allow_ext_sg,    S_IRUGO, show_allow_ext_sg,    NULL);
 
 static struct device_attribute *srp_host_attrs[] = {
 	&dev_attr_id_ext,
@@ -1783,6 +1804,7 @@ static struct device_attribute *srp_host_attrs[] = {
 	&dev_attr_local_ib_port,
 	&dev_attr_local_ib_device,
 	&dev_attr_cmd_sg_entries,
+	&dev_attr_allow_ext_sg,
 	NULL
 };
 
@@ -1868,6 +1890,7 @@ enum {
 	SRP_OPT_IO_CLASS	= 1 << 7,
 	SRP_OPT_INITIATOR_EXT	= 1 << 8,
 	SRP_OPT_CMD_SG_ENTRIES	= 1 << 9,
+	SRP_OPT_ALLOW_EXT_SG	= 1 << 10,
 	SRP_OPT_ALL		= (SRP_OPT_ID_EXT	|
 				   SRP_OPT_IOC_GUID	|
 				   SRP_OPT_DGID		|
@@ -1886,6 +1909,7 @@ static const match_table_t srp_opt_tokens = {
 	{ SRP_OPT_IO_CLASS,		"io_class=%x"		},
 	{ SRP_OPT_INITIATOR_EXT,	"initiator_ext=%s"	},
 	{ SRP_OPT_CMD_SG_ENTRIES,	"cmd_sg_entries=%u"	},
+	{ SRP_OPT_ALLOW_EXT_SG,		"allow_ext_sg=%u"	},
 	{ SRP_OPT_ERR,			NULL 			}
 };
 
@@ -2021,6 +2045,14 @@ static int srp_parse_options(const char *buf, struct srp_target_port *target)
 			target->cmd_sg_cnt = token;
 			break;
 
+		case SRP_OPT_ALLOW_EXT_SG:
+			if (match_int(args, &token)) {
+				printk(KERN_WARNING PFX "bad allow_ext_sg parameter '%s'\n", p);
+				goto out;
+			}
+			target->allow_ext_sg = !!token;
+			break;
+
 		default:
 			printk(KERN_WARNING PFX "unknown parameter or missing value "
 			       "'%s' in target creation request\n", p);
@@ -2070,6 +2102,7 @@ static ssize_t srp_create_target(struct device *dev,
 	target->lkey		= host->srp_dev->mr->lkey;
 	target->rkey		= host->srp_dev->mr->rkey;
 	target->cmd_sg_cnt	= cmd_sg_entries;
+	target->allow_ext_sg	= allow_ext_sg;
 
 	ret = srp_parse_options(buf, target);
 	if (ret)
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index b43b5e7..dd7c9fe 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -137,6 +137,7 @@ struct srp_target_port {
 	enum srp_target_state	state;
 	unsigned int		max_iu_len;
 	unsigned int		cmd_sg_cnt;
+	bool			allow_ext_sg;
 
 	/* Everything above this point is used in the hot path of
 	 * command processing. Try to keep them packed into cachelines.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 6/8] IB/srp: add support for indirect tables that don't fit in SRP_CMD
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
                     ` (4 preceding siblings ...)
  2011-01-19  4:27   ` [RFC 5/8] IB/srp: add safety valve for large SG tables without HW support David Dillow
@ 2011-01-19  4:27   ` David Dillow
  2011-01-19  4:27   ` [RFC 7/8] IB/srp: try to use larger FMR sizes to cover our mappings David Dillow
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-19  4:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

This allows us to guarantee the ability to submit up to 8 MB requests
based on the current value of SCSI_MAX_SG_CHAIN_SEGMENTS. While FMR will
usually condense the requests into 8 SG entries, it is imperative that
the target support external tables in case the FMR mapping fails or is
not supported.

If indirect_sg_entries is not specified in the modules options, then
the sg_tablesize for the target will default to cmd_sg_entries unless
overridden by the target options.
---
 drivers/infiniband/ulp/srp/ib_srp.c |   84 ++++++++++++++++++++++++++++++-----
 drivers/infiniband/ulp/srp/ib_srp.h |    4 ++
 2 files changed, 76 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index df4c3b9..0121530 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -61,6 +61,7 @@ MODULE_LICENSE("Dual BSD/GPL");
 
 static unsigned int srp_sg_tablesize;
 static unsigned int cmd_sg_entries;
+static unsigned int indirect_sg_entries;
 static bool allow_ext_sg;
 static int topspin_workarounds = 1;
 
@@ -71,6 +72,10 @@ module_param(cmd_sg_entries, uint, 0444);
 MODULE_PARM_DESC(cmd_sg_entries,
 		 "Default number of gather/scatter entries in the SRP command (default is 12, max 255)");
 
+module_param(indirect_sg_entries, uint, 0444);
+MODULE_PARM_DESC(indirect_sg_entries,
+		 "Default max number of gather/scatter entries (default is 12, max is " __stringify(SCSI_MAX_SG_CHAIN_SEGMENTS) ")");
+
 module_param(allow_ext_sg, bool, 0444);
 MODULE_PARM_DESC(allow_ext_sg,
 		  "Default behavior when there are more than cmd_sg_entries S/G entries after mapping; fails the request when false (default false)");
@@ -451,12 +456,19 @@ static bool srp_change_state(struct srp_target_port *target,
 
 static void srp_free_req_data(struct srp_target_port *target)
 {
+	struct ib_device *ibdev = target->srp_host->srp_dev->dev;
 	struct srp_request *req;
 	int i;
 
 	for (i = 0, req = target->req_ring; i < SRP_CMD_SQ_SIZE; ++i, ++req) {
 		kfree(req->fmr_list);
 		kfree(req->map_page);
+		if (req->indirect_dma_addr) {
+			ib_dma_unmap_single(ibdev, req->indirect_dma_addr,
+					    target->indirect_size,
+					    DMA_TO_DEVICE);
+		}
+		kfree(req->indirect_desc);
 	}
 }
 
@@ -795,7 +807,6 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target,
 	struct ib_device *ibdev;
 	struct srp_map_state state;
 	struct srp_indirect_buf *indirect_hdr;
-	dma_addr_t indirect_addr;
 	u32 table_len;
 	u8 fmt;
 
@@ -846,8 +857,11 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target,
 	 */
 	indirect_hdr = (void *) cmd->add_data;
 
+	ib_dma_sync_single_for_cpu(ibdev, req->indirect_dma_addr,
+				   target->indirect_size, DMA_TO_DEVICE);
+
 	memset(&state, 0, sizeof(state));
-	state.desc	= indirect_hdr->desc_list;
+	state.desc	= req->indirect_desc;
 	state.pages	= req->map_page;
 	state.next_fmr	= req->fmr_list;
 
@@ -877,7 +891,11 @@ backtrack:
 	if (use_fmr == SRP_MAP_ALLOW_FMR && srp_map_finish_fmr(&state, target))
 		goto backtrack;
 
-	/* We've mapped the request, fill in the command buffer.
+	/* We've mapped the request, now pull as much of the indirect
+	 * descriptor table as we can into the command buffer. If this
+	 * target is not using an external indirect table, we are
+	 * guaranteed to fit into the command, as the SCSI layer won't
+	 * give us more S/G entries than we allow.
 	 */
 	req->nfmr = state.nfmr;
 	if (state.ndesc == 1) {
@@ -886,7 +904,7 @@ backtrack:
 		 */
 		struct srp_direct_buf *buf = (void *) cmd->add_data;
 
-		*buf = indirect_hdr->desc_list[0];
+		*buf = req->indirect_desc[0];
 		goto map_complete;
 	}
 
@@ -897,23 +915,28 @@ backtrack:
 		return -EIO;
 	}
 
+	count = min(state.ndesc, target->cmd_sg_cnt);
 	table_len = state.ndesc * sizeof (struct srp_direct_buf);
 
 	fmt = SRP_DATA_DESC_INDIRECT;
 	len = sizeof(struct srp_cmd) + sizeof (struct srp_indirect_buf);
-	len += table_len;
+	len += count * sizeof (struct srp_direct_buf);
 
-	indirect_addr = req->cmd->dma + sizeof *cmd + sizeof *indirect_hdr;
+	memcpy(indirect_hdr->desc_list, req->indirect_desc,
+	       count * sizeof (struct srp_direct_buf));
 
-	indirect_hdr->table_desc.va = cpu_to_be64(indirect_addr);
+	indirect_hdr->table_desc.va = cpu_to_be64(req->indirect_dma_addr);
 	indirect_hdr->table_desc.key = cpu_to_be32(target->rkey);
 	indirect_hdr->table_desc.len = cpu_to_be32(table_len);
 	indirect_hdr->len = cpu_to_be32(state.total_len);
 
 	if (scmnd->sc_data_direction == DMA_TO_DEVICE)
-		cmd->data_out_desc_cnt = state.ndesc;
+		cmd->data_out_desc_cnt = count;
 	else
-		cmd->data_in_desc_cnt = state.ndesc;
+		cmd->data_in_desc_cnt = count;
+
+	ib_dma_sync_single_for_device(ibdev, req->indirect_dma_addr, table_len,
+				      DMA_TO_DEVICE);
 
 map_complete:
 	if (scmnd->sc_data_direction == DMA_TO_DEVICE)
@@ -1891,6 +1914,7 @@ enum {
 	SRP_OPT_INITIATOR_EXT	= 1 << 8,
 	SRP_OPT_CMD_SG_ENTRIES	= 1 << 9,
 	SRP_OPT_ALLOW_EXT_SG	= 1 << 10,
+	SRP_OPT_SG_TABLESIZE	= 1 << 11,
 	SRP_OPT_ALL		= (SRP_OPT_ID_EXT	|
 				   SRP_OPT_IOC_GUID	|
 				   SRP_OPT_DGID		|
@@ -1910,6 +1934,7 @@ static const match_table_t srp_opt_tokens = {
 	{ SRP_OPT_INITIATOR_EXT,	"initiator_ext=%s"	},
 	{ SRP_OPT_CMD_SG_ENTRIES,	"cmd_sg_entries=%u"	},
 	{ SRP_OPT_ALLOW_EXT_SG,		"allow_ext_sg=%u"	},
+	{ SRP_OPT_SG_TABLESIZE,		"sg_tablesize=%u" 	},
 	{ SRP_OPT_ERR,			NULL 			}
 };
 
@@ -2053,6 +2078,15 @@ static int srp_parse_options(const char *buf, struct srp_target_port *target)
 			target->allow_ext_sg = !!token;
 			break;
 
+		case SRP_OPT_SG_TABLESIZE:
+			if (match_int(args, &token) || token < 1 ||
+					token > SCSI_MAX_SG_CHAIN_SEGMENTS) {
+				printk(KERN_WARNING PFX "bad max sg_tablesize parameter '%s'\n", p);
+				goto out;
+			}
+			target->sg_tablesize = token;
+			break;
+
 		default:
 			printk(KERN_WARNING PFX "unknown parameter or missing value "
 			       "'%s' in target creation request\n", p);
@@ -2083,6 +2117,8 @@ static ssize_t srp_create_target(struct device *dev,
 		container_of(dev, struct srp_host, dev);
 	struct Scsi_Host *target_host;
 	struct srp_target_port *target;
+	struct ib_device *ibdev = host->srp_dev->dev;
+	dma_addr_t dma_addr;
 	int i, ret;
 
 	target_host = scsi_host_alloc(&srp_template,
@@ -2102,13 +2138,22 @@ static ssize_t srp_create_target(struct device *dev,
 	target->lkey		= host->srp_dev->mr->lkey;
 	target->rkey		= host->srp_dev->mr->rkey;
 	target->cmd_sg_cnt	= cmd_sg_entries;
+	target->sg_tablesize	= indirect_sg_entries ? : cmd_sg_entries;
 	target->allow_ext_sg	= allow_ext_sg;
 
 	ret = srp_parse_options(buf, target);
 	if (ret)
 		goto err;
 
-	target_host->sg_tablesize = target->cmd_sg_cnt;
+	if (!host->srp_dev->fmr_pool && !target->allow_ext_sg &&
+				target->cmd_sg_cnt < target->sg_tablesize) {
+		printk(KERN_WARNING PFX "No FMR pool and no external indirect descriptors, limiting sg_tablesize to cmd_sg_cnt\n");
+		target->sg_tablesize = target->cmd_sg_cnt;
+	}
+
+	target_host->sg_tablesize = target->sg_tablesize;
+	target->indirect_size = target->sg_tablesize *
+				sizeof (struct srp_direct_buf);
 	target->max_iu_len = sizeof (struct srp_cmd) +
 			     sizeof (struct srp_indirect_buf) +
 			     target->cmd_sg_cnt * sizeof (struct srp_direct_buf);
@@ -2123,14 +2168,22 @@ static ssize_t srp_create_target(struct device *dev,
 					GFP_KERNEL);
 		req->map_page = kmalloc(SRP_FMR_SIZE * sizeof (void *),
 					GFP_KERNEL);
-		if (!req->fmr_list || !req->map_page)
+		req->indirect_desc = kmalloc(target->indirect_size, GFP_KERNEL);
+		if (!req->fmr_list || !req->map_page || !req->indirect_desc)
 			goto err_free_mem;
 
+		dma_addr = ib_dma_map_single(ibdev, req->indirect_desc,
+					     target->indirect_size,
+					     DMA_TO_DEVICE);
+		if (ib_dma_mapping_error(ibdev, dma_addr))
+			goto err_free_mem;
+
+		req->indirect_dma_addr = dma_addr;
 		req->index = i;
 		list_add_tail(&req->list, &target->free_reqs);
 	}
 
-	ib_query_gid(host->srp_dev->dev, host->port, 0, &target->path.sgid);
+	ib_query_gid(ibdev, host->port, 0, &target->path.sgid);
 
 	shost_printk(KERN_DEBUG, target->scsi_host, PFX
 		     "new target: id_ext %016llx ioc_guid %016llx pkey %04x "
@@ -2410,6 +2463,13 @@ static int __init srp_init_module(void)
 		cmd_sg_entries = 255;
 	}
 
+	if (!indirect_sg_entries)
+		indirect_sg_entries = cmd_sg_entries;
+	else if (indirect_sg_entries < cmd_sg_entries) {
+		printk(KERN_WARNING PFX "Bumping up indirect_sg_entries to match cmd_sg_entries (%u)\n", cmd_sg_entries);
+		indirect_sg_entries = cmd_sg_entries;
+	}
+
 	ib_srp_transport_template =
 		srp_attach_transport(&ib_srp_transport_functions);
 	if (!ib_srp_transport_template)
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index dd7c9fe..cf69621 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -117,6 +117,8 @@ struct srp_request {
 	struct srp_iu	       *cmd;
 	struct ib_pool_fmr    **fmr_list;
 	u64		       *map_page;
+	struct srp_direct_buf  *indirect_desc;
+	dma_addr_t		indirect_dma_addr;
 	short			nfmr;
 	short			index;
 };
@@ -137,6 +139,7 @@ struct srp_target_port {
 	enum srp_target_state	state;
 	unsigned int		max_iu_len;
 	unsigned int		cmd_sg_cnt;
+	unsigned int		indirect_size;
 	bool			allow_ext_sg;
 
 	/* Everything above this point is used in the hot path of
@@ -152,6 +155,7 @@ struct srp_target_port {
 	struct Scsi_Host       *scsi_host;
 	char			target_name[32];
 	unsigned int		scsi_id;
+	unsigned int		sg_tablesize;
 
 	struct ib_sa_path_rec	path;
 	__be16			orig_dgid[8];
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 7/8] IB/srp: try to use larger FMR sizes to cover our mappings
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
                     ` (5 preceding siblings ...)
  2011-01-19  4:27   ` [RFC 6/8] IB/srp: add support for indirect tables that don't fit in SRP_CMD David Dillow
@ 2011-01-19  4:27   ` David Dillow
       [not found]     ` <1295411242-26148-8-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
  2011-01-19  4:27   ` [RFC 8/8] IB/srp and direct IO: patches for testing large indirect tables David Dillow
  2011-01-19  5:31   ` [RFC 0/8] Reliably generate large request from SRP Roland Dreier
  8 siblings, 1 reply; 35+ messages in thread
From: David Dillow @ 2011-01-19  4:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Now that we can get larger SG lists, we can take advantage of HCAs that
allow us to use larger FMR sizes. In many cases, we can use up to 512
entries, so start there and work our way down.
---
 drivers/infiniband/ulp/srp/ib_srp.c |   31 +++++++++++++++++++------------
 drivers/infiniband/ulp/srp/ib_srp.h |    3 ++-
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 0121530..1da8b25 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -2300,7 +2300,7 @@ static void srp_add_one(struct ib_device *device)
 	struct ib_device_attr *dev_attr;
 	struct ib_fmr_pool_param fmr_param;
 	struct srp_host *host;
-	int fmr_page_shift, s, e, p;
+	int max_pages_per_fmr, fmr_page_shift, s, e, p;
 
 	dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL);
 	if (!dev_attr)
@@ -2340,17 +2340,24 @@ static void srp_add_one(struct ib_device *device)
 	if (IS_ERR(srp_dev->mr))
 		goto err_pd;
 
-	memset(&fmr_param, 0, sizeof fmr_param);
-	fmr_param.pool_size	    = SRP_FMR_POOL_SIZE;
-	fmr_param.dirty_watermark   = SRP_FMR_DIRTY_SIZE;
-	fmr_param.cache		    = 1;
-	fmr_param.max_pages_per_fmr = SRP_FMR_SIZE;
-	fmr_param.page_shift	    = fmr_page_shift;
-	fmr_param.access	    = (IB_ACCESS_LOCAL_WRITE |
-				       IB_ACCESS_REMOTE_WRITE |
-				       IB_ACCESS_REMOTE_READ);
-
-	srp_dev->fmr_pool = ib_create_fmr_pool(srp_dev->pd, &fmr_param);
+	for (max_pages_per_fmr = SRP_FMR_SIZE;
+			max_pages_per_fmr >= SRP_FMR_MIN_SIZE;
+			max_pages_per_fmr /= 2, srp_dev->fmr_max_size /= 2) {
+		memset(&fmr_param, 0, sizeof fmr_param);
+		fmr_param.pool_size	    = SRP_FMR_POOL_SIZE;
+		fmr_param.dirty_watermark   = SRP_FMR_DIRTY_SIZE;
+		fmr_param.cache		    = 1;
+		fmr_param.max_pages_per_fmr = SRP_FMR_SIZE;
+		fmr_param.page_shift	    = fmr_page_shift;
+		fmr_param.access	    = (IB_ACCESS_LOCAL_WRITE |
+					       IB_ACCESS_REMOTE_WRITE |
+					       IB_ACCESS_REMOTE_READ);
+
+		srp_dev->fmr_pool = ib_create_fmr_pool(srp_dev->pd, &fmr_param);
+		if (!IS_ERR(srp_dev->fmr_pool))
+			break;
+	}
+
 	if (IS_ERR(srp_dev->fmr_pool))
 		srp_dev->fmr_pool = NULL;
 
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index cf69621..020caf0 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -69,7 +69,8 @@ enum {
 	SRP_TAG_NO_REQ		= ~0U,
 	SRP_TAG_TSK_MGMT	= 1U << 31,
 
-	SRP_FMR_SIZE		= 256,
+	SRP_FMR_SIZE		= 512,
+	SRP_FMR_MIN_SIZE	= 128,
 	SRP_FMR_POOL_SIZE	= 1024,
 	SRP_FMR_DIRTY_SIZE	= SRP_FMR_POOL_SIZE / 4,
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 8/8] IB/srp and direct IO: patches for testing large indirect tables
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
                     ` (6 preceding siblings ...)
  2011-01-19  4:27   ` [RFC 7/8] IB/srp: try to use larger FMR sizes to cover our mappings David Dillow
@ 2011-01-19  4:27   ` David Dillow
       [not found]     ` <1295411242-26148-9-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
  2011-01-19  5:31   ` [RFC 0/8] Reliably generate large request from SRP Roland Dreier
  8 siblings, 1 reply; 35+ messages in thread
From: David Dillow @ 2011-01-19  4:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Add .dma_boundary to force each page into its own S/G entry to give us
worst case fragmentation.

include scatterlist.h to pick up ARCH_HAS_SG_CHAIN for scsi.h -- patch
to fix is floating in the ether

Fix direct IO when doing more than 1 MB IOs -- proper patch to fix is
floating in the ether as well, now landed in Andrew Morton's tree.
---
 drivers/infiniband/ulp/srp/ib_srp.c |   11 ++++++++++-
 fs/direct-io.c                      |    1 +
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 1da8b25..b9daf2f 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -41,6 +41,10 @@
 
 #include <asm/atomic.h>
 
+/* XXX This has to be included before scsi.h to pickup ARCH_HAS_SG_CHAIN
+ * There is a patch for this floating on LKML and linus-scsi now
+ */
+#include <linux/scatterlist.h>
 #include <scsi/scsi.h>
 #include <scsi/scsi_device.h>
 #include <scsi/scsi_dbg.h>
@@ -1845,7 +1849,12 @@ static struct scsi_host_template srp_template = {
 	.this_id			= -1,
 	.cmd_per_lun			= SRP_CMD_SQ_SIZE,
 	.use_clustering			= ENABLE_CLUSTERING,
-	.shost_attrs			= srp_host_attrs
+	.shost_attrs			= srp_host_attrs,
+
+	/* XXX Force a new SG entry for every page crossing to simulate
+	 * maxmimum fragmentation.
+	 */
+	.dma_boundary			= (4096 - 1),
 };
 
 static int srp_add_target(struct srp_host *host, struct srp_target_port *target)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 85882f6..9eb0553 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -583,6 +583,7 @@ static int dio_new_bio(struct dio *dio, sector_t start_sector)
 		goto out;
 	sector = start_sector << (dio->blkbits - 9);
 	nr_pages = min(dio->pages_in_io, bio_get_nr_vecs(dio->map_bh.b_bdev));
+	nr_pages = min(nr_pages, BIO_MAX_PAGES);
 	BUG_ON(nr_pages <= 0);
 	ret = dio_bio_alloc(dio, dio->map_bh.b_bdev, sector, nr_pages);
 	dio->boundary = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
                     ` (7 preceding siblings ...)
  2011-01-19  4:27   ` [RFC 8/8] IB/srp and direct IO: patches for testing large indirect tables David Dillow
@ 2011-01-19  5:31   ` Roland Dreier
       [not found]     ` <aday66hxxwe.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
  8 siblings, 1 reply; 35+ messages in thread
From: Roland Dreier @ 2011-01-19  5:31 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Dave,

 > Now that at least one vendor is implementing full support for the SRP
 > indirect memory descriptor tables, we can safely expand the sg_tablesize,
 > and realize some performance gains, in many cases quite large. I don't
 > have vendor code that implements the full support needed for safety, but
 > the rareness of FMR mapping failures allows the mapping code to function,
 > at a risk, with existing targets.

Have you considered using memory registration through a send queue (from
the base memory management extensions)?  mlx4 at least has support for
this operation, which would let you pre-allocate everything and avoid
the possibility of failure (I think).

When do we get FMR mapping failures now?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]     ` <aday66hxxwe.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
@ 2011-01-19 12:01       ` David Dillow
  2011-01-20  9:52       ` Or Gerlitz
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-19 12:01 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, 2011-01-18 at 21:31 -0800, Roland Dreier wrote:
> Hi Dave,
> 
>  > Now that at least one vendor is implementing full support for the SRP
>  > indirect memory descriptor tables, we can safely expand the sg_tablesize,
>  > and realize some performance gains, in many cases quite large. I don't
>  > have vendor code that implements the full support needed for safety, but
>  > the rareness of FMR mapping failures allows the mapping code to function,
>  > at a risk, with existing targets.
> 
> Have you considered using memory registration through a send queue (from
> the base memory management extensions)?  mlx4 at least has support for
> this operation, which would let you pre-allocate everything and avoid
> the possibility of failure (I think).

I'd rather keep it to one code path if possible, rather than mix FMRs
and BMM. I also need this to work on mthca cards, though I haven't
checked if they support it.

Still, I'll check into it; avoiding the need for vendor support would be
nice.

> When do we get FMR mapping failures now?

I don't know that we ever do, but reading through the code in fmr_pool.c
and the HW implementations, there is no telling what goes on in ehca
since it is hidden in the hypervisor, or so it appeared to me last I
looked. There seemed to be error cases elsewhere, but they should also
be quite rare.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]     ` <aday66hxxwe.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
  2011-01-19 12:01       ` David Dillow
@ 2011-01-20  9:52       ` Or Gerlitz
       [not found]         ` <4D3805C4.6010203-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  2011-02-19  0:06       ` David Dillow
  2011-02-19  0:07       ` David Dillow
  3 siblings, 1 reply; 35+ messages in thread
From: Or Gerlitz @ 2011-01-20  9:52 UTC (permalink / raw)
  To: Roland Dreier, David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Roland Dreier wrote:
>> [...] support for the SRP indirect memory descriptor tables, we can safely expand
>> the sg_tablesize, and realize some performance gains, in many cases quite large.
>> [..] the rareness of FMR mapping failures allows the mapping code to function,
>> at a risk, with existing targets.

> Have you considered using memory registration through a send queue (from
> the base memory management extensions)?  mlx4 at least has support for
> this operation, which would let you pre-allocate everything and avoid
> the possibility of failure (I think). When do we get FMR mapping failures now?

Dave, with myself being a little behind on srp... would it be correct to 
say that in the initiator side indirect mapping <--> using FMR? 

> Device	Size	Baseline	Patched
> SAS	1M	524 MB/s	1004 MB/s

Starting with the features (perf improvements) that this patch series brings, 
if we look on the 50% for SAS/1M IOs that you're presenting, can you tell 
what made the difference, srp went from sg_tablesize of 255 to 256 so the 
upper layers where able to provide 1M as one IO, where you fmr-ing here or 
not? if not, is this as of the mlx4 patch for the dma_max_seg_size and your 
special environment that allows you to get 1M as single SG entry? 
anything else in that patch set?

Now moving to the bugs (fmr mapping failures) this series addresses, can
you report/shed any light on the failures? and/or how to produce them?

Roland, I wasn't sure to follow if the usage of the FMRs ala the IB spec, 
which are supported by mlx4 you were suggesting came to address the bugs
or the features... 

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 4/8] IB/srp: rework mapping engine to use multiple FMR entries
       [not found]     ` <1295411242-26148-5-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
@ 2011-01-20 10:04       ` Or Gerlitz
       [not found]         ` <AANLkTim6H063ta0w2A+zo9QH0jY5qL5uu1OxN4iqMFEm-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Or Gerlitz @ 2011-01-20 10:04 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

> We up the minimum page size for the FMR pool, as it is rare that the
> kernel would send down a requests with scattered 512 byte fragments, and
> no HCA supports such a small FMR mapping in any case.


I believe you are wrong here as connectx support fmr-ing 512 byte
segments ("block lists") since firmware 2.6 or so

> # ibv_devinfo -v | grep -E "(page_size_cap|fw_ver)"
>         fw_ver:                         2.7.000
>         page_size_cap:                  0xfffffe00
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 8/8] IB/srp and direct IO: patches for testing large indirect tables
       [not found]     ` <1295411242-26148-9-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
@ 2011-01-20 10:07       ` Or Gerlitz
       [not found]         ` <4D38094B.9090101-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Or Gerlitz @ 2011-01-20 10:07 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

David Dillow wrote:
> Fix direct IO when doing more than 1 MB IOs -- proper patch to fix is
> floating in the ether as well, now landed in Andrew Morton's tree.

So what was the behavior without this patch? also can you send pointer
to that tree and mailing list this patch was submitted, I'd like to follow
on the discussion there, thanks,

Or.

> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -583,6 +583,7 @@ static int dio_new_bio(struct dio *dio, sector_t start_sector)
>  		goto out;
>  	sector = start_sector << (dio->blkbits - 9);
>  	nr_pages = min(dio->pages_in_io, bio_get_nr_vecs(dio->map_bh.b_bdev));
> +	nr_pages = min(nr_pages, BIO_MAX_PAGES);
>  	BUG_ON(nr_pages <= 0);
>  	ret = dio_bio_alloc(dio, dio->map_bh.b_bdev, sector, nr_pages);
>  	dio->boundary = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 7/8] IB/srp: try to use larger FMR sizes to cover our mappings
       [not found]     ` <1295411242-26148-8-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
@ 2011-01-20 10:24       ` Or Gerlitz
       [not found]         ` <4D380D4B.6060404-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Or Gerlitz @ 2011-01-20 10:24 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

David Dillow wrote:
> Now that we can get larger SG lists, we can take advantage of HCAs that
> allow us to use larger FMR sizes. In many cases, we can use up to 512
> entries, so start there and work our way down.

I'm not clear what this patch really tries to do but before that, going
to low level details

> +	for (max_pages_per_fmr = SRP_FMR_SIZE;
> +			max_pages_per_fmr >= SRP_FMR_MIN_SIZE;
> +			max_pages_per_fmr /= 2, srp_dev->fmr_max_size /= 2) {
> +		fmr_param.pool_size	    = SRP_FMR_POOL_SIZE;
> +		fmr_param.dirty_watermark   = SRP_FMR_DIRTY_SIZE;
> +		fmr_param.cache		    = 1;
> +		fmr_param.max_pages_per_fmr = SRP_FMR_SIZE;
> +		fmr_param.page_shift	    = fmr_page_shift;

> +		srp_dev->fmr_pool = ib_create_fmr_pool(srp_dev->pd, &fmr_param);
> +		if (!IS_ERR(srp_dev->fmr_pool))
> +			break;
> +	}

aren't stepping on srp_dev->fmr_pool on each invocation of the for loop? didn't you
want page_shift and/or max_pages_per_fmr to change throughout the loop?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 8/8] IB/srp and direct IO: patches for testing large indirect tables
       [not found]         ` <4D38094B.9090101-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2011-01-20 12:33           ` David Dillow
  0 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-20 12:33 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, 2011-01-20 at 12:07 +0200, Or Gerlitz wrote:
> David Dillow wrote:
> > Fix direct IO when doing more than 1 MB IOs -- proper patch to fix is
> > floating in the ether as well, now landed in Andrew Morton's tree.
> 
> So what was the behavior without this patch?

It would oops in dio_bio_alloc() because we tried to allocate too many
entries in the bio -- bio_alloc() returned NULL and it was not checked.

> also can you send pointer to that tree and mailing list this patch was
> submitted, I'd like to follow on the discussion there, thanks,

http://article.gmane.org/gmane.linux.file-systems/50288


> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -583,6 +583,7 @@ static int dio_new_bio(struct dio *dio, sector_t start_sector)
> >  		goto out;
> >  	sector = start_sector << (dio->blkbits - 9);
> >  	nr_pages = min(dio->pages_in_io, bio_get_nr_vecs(dio->map_bh.b_bdev));
> > +	nr_pages = min(nr_pages, BIO_MAX_PAGES);
> >  	BUG_ON(nr_pages <= 0);
> >  	ret = dio_bio_alloc(dio, dio->map_bh.b_bdev, sector, nr_pages);
> >  	dio->boundary = 0;

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 4/8] IB/srp: rework mapping engine to use multiple FMR entries
       [not found]         ` <AANLkTim6H063ta0w2A+zo9QH0jY5qL5uu1OxN4iqMFEm-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-01-20 12:36           ` David Dillow
  0 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-20 12:36 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, 2011-01-20 at 12:04 +0200, Or Gerlitz wrote:
> > We up the minimum page size for the FMR pool, as it is rare that the
> > kernel would send down a requests with scattered 512 byte fragments, and
> > no HCA supports such a small FMR mapping in any case.
> 
> 
> I believe you are wrong here as connectx support fmr-ing 512 byte
> segments ("block lists") since firmware 2.6 or so

Yep, that's what I get for working from memory of mthca -- though I
thought I had looked at mlx4 some time back. Thanks for the heads up.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 7/8] IB/srp: try to use larger FMR sizes to cover our mappings
       [not found]         ` <4D380D4B.6060404-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2011-01-20 12:40           ` David Dillow
  0 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-20 12:40 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, 2011-01-20 at 12:24 +0200, Or Gerlitz wrote:
> David Dillow wrote:
> > Now that we can get larger SG lists, we can take advantage of HCAs that
> > allow us to use larger FMR sizes. In many cases, we can use up to 512
> > entries, so start there and work our way down.
> 
> I'm not clear what this patch really tries to do but before that, going
> to low level details

I want to use 512 entry FMRs instead of 256 entry ones. I'll accept 128
entries if I have to. This reduces the number of indirect entries in the
list going to the client.

> > +	for (max_pages_per_fmr = SRP_FMR_SIZE;
> > +			max_pages_per_fmr >= SRP_FMR_MIN_SIZE;
> > +			max_pages_per_fmr /= 2, srp_dev->fmr_max_size /= 2) {
> > +		fmr_param.pool_size	    = SRP_FMR_POOL_SIZE;
> > +		fmr_param.dirty_watermark   = SRP_FMR_DIRTY_SIZE;
> > +		fmr_param.cache		    = 1;
> > +		fmr_param.max_pages_per_fmr = SRP_FMR_SIZE;
> > +		fmr_param.page_shift	    = fmr_page_shift;
> 
> > +		srp_dev->fmr_pool = ib_create_fmr_pool(srp_dev->pd, &fmr_param);
> > +		if (!IS_ERR(srp_dev->fmr_pool))
> > +			break;
> > +	}
> 
> aren't stepping on srp_dev->fmr_pool on each invocation of the for loop?

No, we break out of the loop if we got a good allocation -- ! IS_ERR().

> didn't you want page_shift and/or max_pages_per_fmr to change
> throughout the loop?

Uhm. Yeah. Forgot to s/SRP_FMR_SIZE/max_pages_per_fmr/ there, didn't I.
Good catch, thanks.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]         ` <4D3805C4.6010203-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2011-01-20 12:54           ` David Dillow
       [not found]             ` <1295528044.22825.64.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  2011-01-20 17:50           ` Roland Dreier
  1 sibling, 1 reply; 35+ messages in thread
From: David Dillow @ 2011-01-20 12:54 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, 2011-01-20 at 11:52 +0200, Or Gerlitz wrote:
> Roland Dreier wrote:
> >> [...] support for the SRP indirect memory descriptor tables, we can safely expand
> >> the sg_tablesize, and realize some performance gains, in many cases quite large.
> >> [..] the rareness of FMR mapping failures allows the mapping code to function,
> >> at a risk, with existing targets.
> 
> > Have you considered using memory registration through a send queue (from
> > the base memory management extensions)?  mlx4 at least has support for
> > this operation, which would let you pre-allocate everything and avoid
> > the possibility of failure (I think). When do we get FMR mapping failures now?
> 
> Dave, with myself being a little behind on srp... would it be correct to 
> say that in the initiator side indirect mapping <--> using FMR?

In general, yes. But there are cases where we use FMR and a direct
memory descriptor. And we can use indirect memory descriptors without
FMR.

> > Device	Size	Baseline	Patched
> > SAS	1M	524 MB/s	1004 MB/s
> 
> Starting with the features (perf improvements) that this patch series brings, 
> if we look on the 50% for SAS/1M IOs that you're presenting, can you tell 
> what made the difference, srp went from sg_tablesize of 255 to 256 so the 
> upper layers where able to provide 1M as one IO, where you fmr-ing here or 
> not?

This win is from sg_tablesize going from 255 to 256 in this case; the HW
really likes that better than getting two requests -- one for 1020 KB
and one for 4 KB. FMR was used for the mapping, so it dropped to doing a
direct memory descriptor. The larger IO sizes also used FMR, but used
indirect memory descriptors, as I was using SRP_FMR_SIZE == 256 for the
testing.

> if not, is this as of the mlx4 patch for the dma_max_seg_size and your 
> special environment that allows you to get 1M as single SG entry? 
> anything else in that patch set?

The mlx4 patch was not used for this testing. I set dma_boundary on the
SCSI host so that I got each 4 KB page in its own SG entry to simulate
maximum memory fragmentation on the host.

> Now moving to the bugs (fmr mapping failures) this series addresses, can
> you report/shed any light on the failures? and/or how to produce them?

I don't believe that they happen in practice, but are corner cases in
the code. However, I'm not willing to risk silent corruption in those
cases, so I think having the target support is important -- even if it
is only used as a fallback on error. I also give a way to allow users to
take advantage of the performance improvement even if the target doesn't
implement the full SRP spec -- it is more noisy and will retry the
command, hoping for a transient failure.

> Roland, I wasn't sure to follow if the usage of the FMRs ala the IB spec, 
> which are supported by mlx4 you were suggesting came to address the bugs
> or the features... 

I'm sure Roland will answer for himself, but I took his suggestion as a
way to guarantee no failures, so that vendor support wouldn't be needed.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]         ` <4D3805C4.6010203-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  2011-01-20 12:54           ` David Dillow
@ 2011-01-20 17:50           ` Roland Dreier
  1 sibling, 0 replies; 35+ messages in thread
From: Roland Dreier @ 2011-01-20 17:50 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David Dillow, linux-rdma-u79uwXL29TY76Z2rM5mHXA

 > Roland, I wasn't sure to follow if the usage of the FMRs ala the IB spec, 
 > which are supported by mlx4 you were suggesting came to address the bugs
 > or the features... 

I just have a general preference for using the fast mem reg through a
work queue (from BMME), since that is cleaner (eg allows deterministic
unmap with local invalidate work requests, instead of having possibly
still valid FMRs floating around until the batched invalidate happens)
and also what all future devices should support, instead of having hacks
to support the old legacy FMRs.

It's not a problem for SRP (since that's really IB-only) but also I
would expect iWARP devices to have better support for mem reg through
work queues, so anything transport agnostic is better off avoiding FMRs.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]             ` <1295528044.22825.64.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2011-01-24 15:32               ` Or Gerlitz
       [not found]                 ` <4D3D9B74.8090607-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Or Gerlitz @ 2011-01-24 15:32 UTC (permalink / raw)
  To: David Dillow
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA

David Dillow wrote:
>> if we look on the 50% for SAS/1M IOs that you're presenting, can you tell 
>> what made the difference, srp went from sg_tablesize of 255 to 256 so the 
>> upper layers where able to provide 1M as one IO

> This win is from sg_tablesize going from 255 to 256 in this case; the HW
> really likes that better than getting two requests -- one for 1020 KB
> and one for 4 KB. 

Its always nice to find the simplest explanation to the greatest improvement... going to the 2nd largest gains

> SAS	2M	520 MB/s	861 MB/s
> SAS	4M	529 MB/s	921 MB/s
> SAS	8M	600 MB/s	951 MB/s

I wonder what made the difference here? it can't be only the 255 --> 256 sg_tablesize change, for the 2M case the change to use 512 pages FMRs could let you use one rkey/fmr for the whole IO but not for 4M/8M

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]                 ` <4D3D9B74.8090607-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2011-01-24 16:14                   ` Bart Van Assche
       [not found]                     ` <AANLkTikapxELx5B6knAm6CQaeLsKHWd9EMQeexmFdF1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-01-24 17:53                   ` David Dillow
  1 sibling, 1 reply; 35+ messages in thread
From: Bart Van Assche @ 2011-01-24 16:14 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Dillow, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Jan 24, 2011 at 4:32 PM, Or Gerlitz <ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org> wrote:
> David Dillow wrote:
>>> if we look on the 50% for SAS/1M IOs that you're presenting, can you tell
>>> what made the difference, srp went from sg_tablesize of 255 to 256 so the
>>> upper layers where able to provide 1M as one IO
>
>> This win is from sg_tablesize going from 255 to 256 in this case; the HW
>> really likes that better than getting two requests -- one for 1020 KB
>> and one for 4 KB.
>
> Its always nice to find the simplest explanation to the greatest improvement... going to the 2nd largest gains
>
>> SAS   2M      520 MB/s        861 MB/s
>> SAS   4M      529 MB/s        921 MB/s
>> SAS   8M      600 MB/s        951 MB/s
>
> I wonder what made the difference here? it can't be only the 255 --> 256 sg_tablesize change, for the 2M case
> the change to use 512 pages FMRs could let you use one rkey/fmr for the whole IO but not for 4M/8M

I think it would be interesting to have performance measurements with
a RAM disk as target too because it is hard to tell for someone not
familiar with the internals of the target used in this test which
performance gain is due to the initiator changes and which is due to
the target behavior.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]                 ` <4D3D9B74.8090607-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  2011-01-24 16:14                   ` Bart Van Assche
@ 2011-01-24 17:53                   ` David Dillow
  1 sibling, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-24 17:53 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, 2011-01-24 at 10:32 -0500, Or Gerlitz wrote:
> David Dillow wrote:
> >> if we look on the 50% for SAS/1M IOs that you're presenting, can you tell 
> >> what made the difference, srp went from sg_tablesize of 255 to 256 so the 
> >> upper layers where able to provide 1M as one IO
> 
> > This win is from sg_tablesize going from 255 to 256 in this case; the HW
> > really likes that better than getting two requests -- one for 1020 KB
> > and one for 4 KB. 
> 
> Its always nice to find the simplest explanation to the greatest
> improvement... going to the 2nd largest gains
> 
> > SAS	2M	520 MB/s	861 MB/s
> > SAS	4M	529 MB/s	921 MB/s
> > SAS	8M	600 MB/s	951 MB/s
> 
> I wonder what made the difference here? it can't be only the 255 -->
> 256 sg_tablesize change, for the 2M case the change to use 512 pages
> FMRs could let you use one rkey/fmr for the whole IO but not for 4M/8M

Actually, it very much was the sg_tablesize going from 255 to 2048 (512
actually used for 2 MB requests). This was pushed into 2 FMRs, as I was
only using 256 entry FMRs for the test; I did not rerun the tests after
I added the 512 entry FMR change.

So, for above 1 MB, we always used indirect descriptors. The target used
has a high overhead per command, and cannot aggregate back-end disk
commands when the write cache is disabled, so it thrives on large
requests.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]                     ` <AANLkTikapxELx5B6knAm6CQaeLsKHWd9EMQeexmFdF1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-01-24 18:00                       ` David Dillow
  0 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-01-24 18:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, 2011-01-24 at 11:14 -0500, Bart Van Assche wrote:
> On Mon, Jan 24, 2011 at 4:32 PM, Or Gerlitz <ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org> wrote:
> > David Dillow wrote:
> >>> if we look on the 50% for SAS/1M IOs that you're presenting, can you tell
> >>> what made the difference, srp went from sg_tablesize of 255 to 256 so the
> >>> upper layers where able to provide 1M as one IO
> >
> >> This win is from sg_tablesize going from 255 to 256 in this case; the HW
> >> really likes that better than getting two requests -- one for 1020 KB
> >> and one for 4 KB.
> >
> > Its always nice to find the simplest explanation to the greatest
> improvement... going to the 2nd largest gains
> >
> >> SAS   2M      520 MB/s        861 MB/s
> >> SAS   4M      529 MB/s        921 MB/s
> >> SAS   8M      600 MB/s        951 MB/s
> >
> > I wonder what made the difference here? it can't be only the 255 -->
> > 256 sg_tablesize change, for the 2M case
> > the change to use 512 pages FMRs could let you use one rkey/fmr for
> > the whole IO but not for 4M/8M
> 
> I think it would be interesting to have performance measurements with
> a RAM disk as target too because it is hard to tell for someone not
> familiar with the internals of the target used in this test which
> performance gain is due to the initiator changes and which is due to
> the target behavior.

I think it is pretty obvious that the gain is due to the initiator
changes allowing us to drive the target the way it likes to be driven,
but perhaps I haven't given you enough information. The HW is backed by
a RAID6 (really RAID3 + two parity drives). Each 4 KB block is broken
into stripes across 8 512 byte sectors, and there is no write combining
when the write cache is disabled. So, when we're splitting 1 MB into a
1020 KB and a 4 KB request, that translates into a 127.5 KB and a 512
byte request to each backend storage device. With the patches, that
remains a single 128 KB request, or 256KB for 2M, etc. The low level
drives can optimize that much better.

I did runs against my IOP test harness, and it showed better performance
there as well, though that was unexpected -- I figured we'd see a slight
decline in IOPS. I have not yet investigated further, but you have the
code and are welcome to run tests and report results.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]     ` <aday66hxxwe.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
  2011-01-19 12:01       ` David Dillow
  2011-01-20  9:52       ` Or Gerlitz
@ 2011-02-19  0:06       ` David Dillow
  2011-02-19  0:07       ` David Dillow
  3 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-02-19  0:06 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, 2011-01-18 at 21:31 -0800, Roland Dreier wrote:
> Hi Dave,
> 
>  > Now that at least one vendor is implementing full support for the SRP
>  > indirect memory descriptor tables, we can safely expand the sg_tablesize,
>  > and realize some performance gains, in many cases quite large. I don't
>  > have vendor code that implements the full support needed for safety, but
>  > the rareness of FMR mapping failures allows the mapping code to function,
>  > at a risk, with existing targets.
> 
> Have you considered using memory registration through a send queue (from
> the base memory management extensions)?  mlx4 at least has support for
> this operation, which would let you pre-allocate everything and avoid
> the possibility of failure (I think).

I'm looking at this now, and it may not be too bad; I think I can re-use
the mapping machinery pretty easily without it getting too ugly.

I do have a few questions, though. It looks like I can only have up to
256 keys for each fast page registration MR, so that seems to limit the
mappings in flight -- I can see cases where we may run out and have to
push the request back to the mid-layer. We'd also be cycling through the
key space pretty quickly, negating some of the advantages of being able
to invalidate the mappings quickly.

I'm thinking about just allocating an MR per request -- this seems like
a simple way to avoid the restrictions, but would of course use more
MRs. mlx4 seems to allow just shy of 512 K of them, so that's probably
not a large concern.

What do you think? Is that overkill? Do I misunderstand the BMM
extensions?
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]     ` <aday66hxxwe.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
                         ` (2 preceding siblings ...)
  2011-02-19  0:06       ` David Dillow
@ 2011-02-19  0:07       ` David Dillow
       [not found]         ` <1298074037.15679.17.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
  3 siblings, 1 reply; 35+ messages in thread
From: David Dillow @ 2011-02-19  0:07 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

[ sorry for the dupe, mis-edited Roland's address ]

On Tue, 2011-01-18 at 21:31 -0800, Roland Dreier wrote:
> Hi Dave,
> 
>  > Now that at least one vendor is implementing full support for the SRP
>  > indirect memory descriptor tables, we can safely expand the sg_tablesize,
>  > and realize some performance gains, in many cases quite large. I don't
>  > have vendor code that implements the full support needed for safety, but
>  > the rareness of FMR mapping failures allows the mapping code to function,
>  > at a risk, with existing targets.
> 
> Have you considered using memory registration through a send queue (from
> the base memory management extensions)?  mlx4 at least has support for
> this operation, which would let you pre-allocate everything and avoid
> the possibility of failure (I think).

I'm looking at this now, and it may not be too bad; I think I can re-use
the mapping machinery pretty easily without it getting too ugly.

I do have a few questions, though. It looks like I can only have up to
256 keys for each fast page registration MR, so that seems to limit the
mappings in flight -- I can see cases where we may run out and have to
push the request back to the mid-layer. We'd also be cycling through the
key space pretty quickly, negating some of the advantages of being able
to invalidate the mappings quickly.

I'm thinking about just allocating an MR per request -- this seems like
a simple way to avoid the restrictions, but would of course use more
MRs. mlx4 seems to allow just shy of 512 K of them, so that's probably
not a large concern.

What do you think? Is that overkill? Do I misunderstand the BMM
extensions?
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]         ` <1298074037.15679.17.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
@ 2011-02-22  6:36           ` Or Gerlitz
       [not found]             ` <4D635974.10807-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Or Gerlitz @ 2011-02-22  6:36 UTC (permalink / raw)
  To: David Dillow; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA

David Dillow wrote:
> It looks like I can only have up to 256 keys for each fast page 
> registration MR, so that seems to limit the mappings in flight

My understanding is that for --specific-- MR, its associated rkey would 
wrap every 256 mappings see ib_update_fast_reg_key and friends. And 
anyway, for specific MR you can have only one mapping living at any 
point of time (e.g set before sending the request, invalidated remotely 
with the response or locally after getting the response)

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]             ` <4D635974.10807-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2011-02-22 14:49               ` David Dillow
       [not found]                 ` <1298386190.18945.1.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: David Dillow @ 2011-02-22 14:49 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, 2011-02-22 at 08:36 +0200, Or Gerlitz wrote:
> David Dillow wrote:
> > It looks like I can only have up to 256 keys for each fast page 
> > registration MR, so that seems to limit the mappings in flight
> 
> My understanding is that for --specific-- MR, its associated rkey would 
> wrap every 256 mappings see ib_update_fast_reg_key and friends. And 
> anyway, for specific MR you can have only one mapping living at any 
> point of time (e.g set before sending the request, invalidated remotely 
> with the response or locally after getting the response)

Thanks for the note Or -- you confirmed my suspensions after testing
last night. Looks like multiple MRs per request, then.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]                 ` <1298386190.18945.1.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2011-02-22 23:38                   ` Roland Dreier
       [not found]                     ` <AANLkTikxDu5b=p4fXHMm8W+tF3Lru4vB7xRZEF+HDpyu-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Roland Dreier @ 2011-02-22 23:38 UTC (permalink / raw)
  To: David Dillow; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Feb 22, 2011 at 6:49 AM, David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org> wrote:
> Thanks for the note Or -- you confirmed my suspensions after testing
> last night. Looks like multiple MRs per request, then.

Multiple MRs per request, or one fast reg MR per request?

This really should be roughly the same as old-style FMRs -- in both
cases, we need one "registration" per request that might be in flight.
Old-style FMRs and new IB-spec fast reg MRs should be roughly
the same in terms of resources consumed...

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]                     ` <AANLkTikxDu5b=p4fXHMm8W+tF3Lru4vB7xRZEF+HDpyu-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-02-23  1:23                       ` David Dillow
  0 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-02-23  1:23 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, 2011-02-22 at 15:38 -0800, Roland Dreier wrote:
> On Tue, Feb 22, 2011 at 6:49 AM, David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org> wrote:
> > Thanks for the note Or -- you confirmed my suspensions after testing
> > last night. Looks like multiple MRs per request, then.
> 
> Multiple MRs per request, or one fast reg MR per request?

Multiple fast reg MRs per request -- on mlx4, they max out at 511
entries, so I'd need 5 FRMRs to get to 8 MB requests. I'm scaling based
on max_sect and the device attributes, so if one doesn't want more than
1 MB requests, they don't pay for the extra MRs -- or if other cards
support larger/smaller fast reg MRs, then the code does the right thing.

> This really should be roughly the same as old-style FMRs -- in both
> cases, we need one "registration" per request that might be in flight.
> Old-style FMRs and new IB-spec fast reg MRs should be roughly
> the same in terms of resources consumed...

Yeah, it seems pretty close overall. Fast reg MRs put more of it in view
of SRP, where as old-style FMRs hid the bookkeeping in ib_pool_fmr.

As an aside, is there any guarantee that ib_post_send() will not modify
the list of work requests? I currently make no assumptions about that,
but it would avoid some work if I could re-use a work request, and just
change the fields I need instead of fully re-initializing it every time.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]     ` <AANLkTinC9QcE8E_O3M0+dapVGEAZq_tw-3cb3GN4qf-q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-15 23:51       ` David Dillow
  0 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-03-15 23:51 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Or Gerlitz

[Fixed linux-rdma address]

On Tue, 2011-03-15 at 07:13 -0400, Bart Van Assche wrote:
> On Tue, Mar 15, 2011 at 1:28 AM, David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org> wrote:
> > On Tue, 2011-01-18 at 23:27 -0500, David Dillow wrote:
> > > Now that at least one vendor is implementing full support for the SRP
> > > indirect memory descriptor tables, we can safely expand the sg_tablesize,
> > > and realize some performance gains, in many cases quite large. I don't
> > > have vendor code that implements the full support needed for safety, but
> > > the rareness of FMR mapping failures allows the mapping code to function,
> > > at a risk, with existing targets.
> >
> > I'm getting ready to re-roll this series to address Or's review
> > comments. I'm going to leave BMME support until later, as I have a few
> > concerns about the error cases.
> >
> > I'd like to see this go into 2.6.39 -- do you guys have any other
> > questions or comments I should address?
> 
> I still have to take a closer look at these patches. But it would be
> convenient if the patches could be made available in a (branch of a)
> public git repository - that makes testing easier. And as you probably
> know temporary branches can be deleted at any time via git push
> <repo_name> :<branch_name>.

I've pushed these to

git://git.kernel.org/pub/scm/linux/kernel/git/dad/srp-initiator.git external-indirect

They should mirror out shortly.

I'll also make another review pass over the code.

Here's the diff from the last version I posted:
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 1da8b25..376d640 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1934,7 +1934,7 @@ static const match_table_t srp_opt_tokens = {
 	{ SRP_OPT_INITIATOR_EXT,	"initiator_ext=%s"	},
 	{ SRP_OPT_CMD_SG_ENTRIES,	"cmd_sg_entries=%u"	},
 	{ SRP_OPT_ALLOW_EXT_SG,		"allow_ext_sg=%u"	},
-	{ SRP_OPT_SG_TABLESIZE,		"sg_tablesize=%u" 	},
+	{ SRP_OPT_SG_TABLESIZE,		"sg_tablesize=%u"	},
 	{ SRP_OPT_ERR,			NULL 			}
 };
 
@@ -2347,7 +2347,7 @@ static void srp_add_one(struct ib_device *device)
 		fmr_param.pool_size	    = SRP_FMR_POOL_SIZE;
 		fmr_param.dirty_watermark   = SRP_FMR_DIRTY_SIZE;
 		fmr_param.cache		    = 1;
-		fmr_param.max_pages_per_fmr = SRP_FMR_SIZE;
+		fmr_param.max_pages_per_fmr = max_pages_per_fmr;
 		fmr_param.page_shift	    = fmr_page_shift;
 		fmr_param.access	    = (IB_ACCESS_LOCAL_WRITE |
 					       IB_ACCESS_REMOTE_WRITE |


-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]   ` <1300148888.2772.15.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
@ 2011-03-16  8:27     ` Or Gerlitz
       [not found]       ` <4D807472.2060000-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Or Gerlitz @ 2011-03-16  8:27 UTC (permalink / raw)
  To: David Dillow
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Bart Van Assche

David Dillow wrote:
> I'm getting ready to re-roll this series to address Or's review
> comments. I'm going to leave BMME support until later, as I have a few
> concerns about the error cases.
> 
> I'd like to see this go into 2.6.39 -- do you guys have any other
> questions or comments I should address?

nothing special, just want to make sure that at this point we're talking 
only on patches 1-8 i.e excluding the "increase DMA max_segment_size on 
Mellanox hardware" patch, correct?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]       ` <4D807472.2060000-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2011-03-16 15:07         ` David Dillow
  2011-03-16 16:50         ` Roland Dreier
  1 sibling, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-03-16 15:07 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Bart Van Assche

On Wed, 2011-03-16 at 10:27 +0200, Or Gerlitz wrote:
> David Dillow wrote:
> > I'm getting ready to re-roll this series to address Or's review
> > comments. I'm going to leave BMME support until later, as I have a few
> > concerns about the error cases.
> > 
> > I'd like to see this go into 2.6.39 -- do you guys have any other
> > questions or comments I should address?
> 
> nothing special, just want to make sure that at this point we're talking 
> only on patches 1-8 i.e excluding the "increase DMA max_segment_size on 
> Mellanox hardware" patch, correct?

Correct. That patch is orthogonal to SRP, so it wouldn't go through my
git tree anyway -- it should go through the maintainers of those
drivers.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]       ` <4D807472.2060000-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2011-03-16 15:07         ` David Dillow
@ 2011-03-16 16:50         ` Roland Dreier
       [not found]           ` <AANLkTimY74Wmsfc3F35SBuR2YyDW=ao78B=9uGh4LZNJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 35+ messages in thread
From: Roland Dreier @ 2011-03-16 16:50 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Dillow, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche

On Wed, Mar 16, 2011 at 1:27 AM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> nothing special, just want to make sure that at this point we're talking
> only on patches 1-8 i.e excluding the "increase DMA max_segment_size on
> Mellanox hardware" patch, correct?

Is there still any concern about that patch possibly being wrong?

Because I'd be inclined to merge it at this point.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Reliably generate large request from SRP
       [not found]           ` <AANLkTimY74Wmsfc3F35SBuR2YyDW=ao78B=9uGh4LZNJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-16 21:41             ` David Dillow
  0 siblings, 0 replies; 35+ messages in thread
From: David Dillow @ 2011-03-16 21:41 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche

On Wed, 2011-03-16 at 09:50 -0700, Roland Dreier wrote:
> On Wed, Mar 16, 2011 at 1:27 AM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> > nothing special, just want to make sure that at this point we're talking
> > only on patches 1-8 i.e excluding the "increase DMA max_segment_size on
> > Mellanox hardware" patch, correct?
> 
> Is there still any concern about that patch possibly being wrong?

Or can speak to his concerns, but the mlx4 hunk of the patch should drop
the limit to 1 GB as that seems to be a hardware limit.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2011-03-16 21:41 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-19  4:27 [RFC 0/8] Reliably generate large request from SRP David Dillow
     [not found] ` <1295411242-26148-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
2011-01-19  4:27   ` [RFC 1/8] IB/srp: always avoid non-zero offsets into an FMR David Dillow
2011-01-19  4:27   ` [RFC 2/8] IB/srp: move IB CM setup completion into its own function David Dillow
2011-01-19  4:27   ` [RFC 3/8] IB/srp: allow sg_tablesize to be set for each target David Dillow
2011-01-19  4:27   ` [RFC 4/8] IB/srp: rework mapping engine to use multiple FMR entries David Dillow
     [not found]     ` <1295411242-26148-5-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
2011-01-20 10:04       ` Or Gerlitz
     [not found]         ` <AANLkTim6H063ta0w2A+zo9QH0jY5qL5uu1OxN4iqMFEm-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-01-20 12:36           ` David Dillow
2011-01-19  4:27   ` [RFC 5/8] IB/srp: add safety valve for large SG tables without HW support David Dillow
2011-01-19  4:27   ` [RFC 6/8] IB/srp: add support for indirect tables that don't fit in SRP_CMD David Dillow
2011-01-19  4:27   ` [RFC 7/8] IB/srp: try to use larger FMR sizes to cover our mappings David Dillow
     [not found]     ` <1295411242-26148-8-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
2011-01-20 10:24       ` Or Gerlitz
     [not found]         ` <4D380D4B.6060404-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2011-01-20 12:40           ` David Dillow
2011-01-19  4:27   ` [RFC 8/8] IB/srp and direct IO: patches for testing large indirect tables David Dillow
     [not found]     ` <1295411242-26148-9-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
2011-01-20 10:07       ` Or Gerlitz
     [not found]         ` <4D38094B.9090101-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2011-01-20 12:33           ` David Dillow
2011-01-19  5:31   ` [RFC 0/8] Reliably generate large request from SRP Roland Dreier
     [not found]     ` <aday66hxxwe.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2011-01-19 12:01       ` David Dillow
2011-01-20  9:52       ` Or Gerlitz
     [not found]         ` <4D3805C4.6010203-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2011-01-20 12:54           ` David Dillow
     [not found]             ` <1295528044.22825.64.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2011-01-24 15:32               ` Or Gerlitz
     [not found]                 ` <4D3D9B74.8090607-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2011-01-24 16:14                   ` Bart Van Assche
     [not found]                     ` <AANLkTikapxELx5B6knAm6CQaeLsKHWd9EMQeexmFdF1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-01-24 18:00                       ` David Dillow
2011-01-24 17:53                   ` David Dillow
2011-01-20 17:50           ` Roland Dreier
2011-02-19  0:06       ` David Dillow
2011-02-19  0:07       ` David Dillow
     [not found]         ` <1298074037.15679.17.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
2011-02-22  6:36           ` Or Gerlitz
     [not found]             ` <4D635974.10807-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2011-02-22 14:49               ` David Dillow
     [not found]                 ` <1298386190.18945.1.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2011-02-22 23:38                   ` Roland Dreier
     [not found]                     ` <AANLkTikxDu5b=p4fXHMm8W+tF3Lru4vB7xRZEF+HDpyu-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-02-23  1:23                       ` David Dillow
     [not found] ` <1300148888.2772.15.camel@lap75545.ornl.gov>
     [not found]   ` <AANLkTinC9QcE8E_O3M0+dapVGEAZq_tw-3cb3GN4qf-q@mail.gmail.com>
     [not found]     ` <AANLkTinC9QcE8E_O3M0+dapVGEAZq_tw-3cb3GN4qf-q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-15 23:51       ` David Dillow
     [not found]   ` <1300148888.2772.15.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
2011-03-16  8:27     ` Or Gerlitz
     [not found]       ` <4D807472.2060000-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2011-03-16 15:07         ` David Dillow
2011-03-16 16:50         ` Roland Dreier
     [not found]           ` <AANLkTimY74Wmsfc3F35SBuR2YyDW=ao78B=9uGh4LZNJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-16 21:41             ` David Dillow

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.