[patch,v2 00/10] make I/O path allocations more numa-friendly

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch,v2 00/10] make I/O path allocations more numa-friendly
@ 2012-11-02 21:45 Jeff Moyer
  2012-11-02 21:45 ` [patch,v2 01/10] scsi: add scsi_host_alloc_node Jeff Moyer
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:45 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Bart Van Assche

Hi,

This patch set makes memory allocations for data structures used in
the I/O path more numa friendly by allocating them from the same numa
node as the storage device.  I've only converted a handfull of drivers
at this point.  My testing showed that, for workloads where the I/O
processes were not tied to the numa node housing the device, a speedup
of around 6% was observed.  When the I/O processes were tied to the
numa node of the device, there was no measurable difference in my test
setup.  Given my relatively low-end setup[1], I wouldn't be surprised
if others could show a more significant performance advantage.

Comments would be greatly appreciated.

Cheers,
Jeff

[1] LSI Megaraid SAS controller with 1GB battery-backed cache,
fronting a RAID 6 10+2.  The workload I used was tuned to not
have to hit disk.  Fio file attached.

--
changes from v1->v2:
- got rid of the vfs patch, as Al pointed out some fundamental
  problems with it
- credited Bart van Assche properly

Jeff Moyer (10):
  scsi: add scsi_host_alloc_node
  scsi: make __scsi_alloc_queue numa-aware
  scsi: make scsi_alloc_sdev numa-aware
  scsi: allocate scsi_cmnd-s from the device's local numa node
  sd: use alloc_disk_node
  ata: use scsi_host_alloc_node
  megaraid_sas: use scsi_host_alloc_node
  mpt2sas: use scsi_host_alloc_node
  lpfc: use scsi_host_alloc_node
  cciss: use blk_init_queue_node

 drivers/ata/libata-scsi.c                 |    3 ++-
 drivers/block/cciss.c                     |    3 ++-
 drivers/scsi/hosts.c                      |   13 +++++++++++--
 drivers/scsi/lpfc/lpfc_init.c             |   10 ++++++----
 drivers/scsi/megaraid/megaraid_sas_base.c |    5 +++--
 drivers/scsi/mpt2sas/mpt2sas_scsih.c      |    4 ++--
 drivers/scsi/scsi.c                       |   17 +++++++++++------
 drivers/scsi/scsi_lib.c                   |    2 +-
 drivers/scsi/scsi_scan.c                  |    4 ++--
 drivers/scsi/sd.c                         |    2 +-
 include/scsi/scsi_host.h                  |    8 ++++++++
 11 files changed, 49 insertions(+), 22 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [patch,v2 01/10] scsi: add scsi_host_alloc_node
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
@ 2012-11-02 21:45 ` Jeff Moyer
  2012-11-03 16:35   ` Bart Van Assche
  2012-11-02 21:45 ` [patch,v2 02/10] scsi: make __scsi_alloc_queue numa-aware Jeff Moyer
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:45 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Bart Van Assche, James E.J. Bottomley

Allow an LLD to specify on which numa node to allocate scsi data
structures.  Thanks to Bart Van Assche for the suggestion.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/scsi/hosts.c     |   13 +++++++++++--
 include/scsi/scsi_host.h |    8 ++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 593085a..7d7ad8b 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -336,16 +336,25 @@ static struct device_type scsi_host_type = {
  **/
 struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 {
+	return scsi_host_alloc_node(sht, privsize, -1);
+}
+EXPORT_SYMBOL(scsi_host_alloc);
+
+struct Scsi_Host *scsi_host_alloc_node(struct scsi_host_template *sht,
+				       int privsize, int node)
+{
 	struct Scsi_Host *shost;
 	gfp_t gfp_mask = GFP_KERNEL;
 
 	if (sht->unchecked_isa_dma && privsize)
 		gfp_mask |= __GFP_DMA;
 
-	shost = kzalloc(sizeof(struct Scsi_Host) + privsize, gfp_mask);
+	shost = kzalloc_node(sizeof(struct Scsi_Host) + privsize,
+			     gfp_mask, node);
 	if (!shost)
 		return NULL;
 
+	shost->numa_node = node;
 	shost->host_lock = &shost->default_lock;
 	spin_lock_init(shost->host_lock);
 	shost->shost_state = SHOST_CREATED;
@@ -443,7 +452,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 	kfree(shost);
 	return NULL;
 }
-EXPORT_SYMBOL(scsi_host_alloc);
+EXPORT_SYMBOL(scsi_host_alloc_node);
 
 struct Scsi_Host *scsi_register(struct scsi_host_template *sht, int privsize)
 {
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 4908480..a1b5c8e 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -733,6 +733,12 @@ struct Scsi_Host {
 	struct device *dma_dev;
 
 	/*
+	 * Numa node this device is closest to, used for allocating
+	 * data structures locally.
+	 */
+	int numa_node;
+
+	/*
 	 * We should ensure that this is aligned, both for better performance
 	 * and also because some compilers (m68k) don't automatically force
 	 * alignment to a long boundary.
@@ -776,6 +782,8 @@ extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
 extern struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *, int);
+extern struct Scsi_Host *scsi_host_alloc_node(struct scsi_host_template *,
+					      int, int);
 extern int __must_check scsi_add_host_with_dma(struct Scsi_Host *,
 					       struct device *,
 					       struct device *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [patch,v2 02/10] scsi: make __scsi_alloc_queue numa-aware
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
  2012-11-02 21:45 ` [patch,v2 01/10] scsi: add scsi_host_alloc_node Jeff Moyer
@ 2012-11-02 21:45 ` Jeff Moyer
  2012-11-02 21:45 ` [patch,v2 03/10] scsi: make scsi_alloc_sdev numa-aware Jeff Moyer
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:45 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Bart Van Assche, James E.J. Bottomley

Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
in order to keep all allocations local to the numa node the device is
closest to.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/scsi/scsi_lib.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..8662a09 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
 	struct request_queue *q;
 	struct device *dev = shost->dma_dev;
 
-	q = blk_init_queue(request_fn, NULL);
+	q = blk_init_queue_node(request_fn, NULL, shost->numa_node);
 	if (!q)
 		return NULL;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [patch,v2 03/10] scsi: make scsi_alloc_sdev numa-aware
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
  2012-11-02 21:45 ` [patch,v2 01/10] scsi: add scsi_host_alloc_node Jeff Moyer
  2012-11-02 21:45 ` [patch,v2 02/10] scsi: make __scsi_alloc_queue numa-aware Jeff Moyer
@ 2012-11-02 21:45 ` Jeff Moyer
  2012-11-02 21:45 ` [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node Jeff Moyer
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:45 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Bart Van Assche, James E.J. Bottomley

Use the numa node id set in the Scsi_Host to allocate the sdev structure
on the device-local numa node.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/scsi/scsi_scan.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 3e58b22..f10a308 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -232,8 +232,8 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	extern void scsi_evt_thread(struct work_struct *work);
 	extern void scsi_requeue_run_queue(struct work_struct *work);
 
-	sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
-		       GFP_ATOMIC);
+	sdev = kzalloc_node(sizeof(*sdev) + shost->transportt->device_size,
+			    GFP_ATOMIC, shost->numa_node);
 	if (!sdev)
 		goto out;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
                   ` (2 preceding siblings ...)
  2012-11-02 21:45 ` [patch,v2 03/10] scsi: make scsi_alloc_sdev numa-aware Jeff Moyer
@ 2012-11-02 21:45 ` Jeff Moyer
  2012-11-03 16:36   ` Bart Van Assche
  2012-11-02 21:45 ` [patch,v2 05/10] sd: use alloc_disk_node Jeff Moyer
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:45 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Bart Van Assche, James E.J. Bottomley


Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/scsi/scsi.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2936b44..4db6973 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -173,16 +173,20 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
  * NULL on failure
  */
 static struct scsi_cmnd *
-scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
+scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
+			int node)
 {
 	struct scsi_cmnd *cmd;
 
-	cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask | pool->gfp_mask);
+	cmd = kmem_cache_alloc_node(pool->cmd_slab,
+				    gfp_mask | pool->gfp_mask | __GFP_ZERO,
+				    node);
 	if (!cmd)
 		return NULL;
 
-	cmd->sense_buffer = kmem_cache_alloc(pool->sense_slab,
-					     gfp_mask | pool->gfp_mask);
+	cmd->sense_buffer = kmem_cache_alloc_node(pool->sense_slab,
+					gfp_mask | pool->gfp_mask | __GFP_ZERO,
+					node);
 	if (!cmd->sense_buffer) {
 		kmem_cache_free(pool->cmd_slab, cmd);
 		return NULL;
@@ -223,7 +227,8 @@ scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t gfp_mask)
 {
 	struct scsi_cmnd *cmd;
 
-	cmd = scsi_pool_alloc_command(shost->cmd_pool, gfp_mask);
+	cmd = scsi_pool_alloc_command(shost->cmd_pool, gfp_mask,
+				      shost->numa_node);
 	if (!cmd)
 		return NULL;
 
@@ -435,7 +440,7 @@ struct scsi_cmnd *scsi_allocate_command(gfp_t gfp_mask)
 	if (!pool)
 		return NULL;
 
-	return scsi_pool_alloc_command(pool, gfp_mask);
+	return scsi_pool_alloc_command(pool, gfp_mask, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(scsi_allocate_command);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [patch,v2 05/10] sd: use alloc_disk_node
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
                   ` (3 preceding siblings ...)
  2012-11-02 21:45 ` [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node Jeff Moyer
@ 2012-11-02 21:45 ` Jeff Moyer
  2012-11-03 16:37   ` Bart Van Assche
  2012-11-02 21:45 ` [patch,v2 06/10] ata: use scsi_host_alloc_node Jeff Moyer
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:45 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Bart Van Assche, James E.J. Bottomley


Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/scsi/sd.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 12f6fdf..8deb915 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
 	if (!sdkp)
 		goto out;
 
-	gd = alloc_disk(SD_MINORS);
+	gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
 	if (!gd)
 		goto out_free;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [patch,v2 06/10] ata: use scsi_host_alloc_node
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
                   ` (4 preceding siblings ...)
  2012-11-02 21:45 ` [patch,v2 05/10] sd: use alloc_disk_node Jeff Moyer
@ 2012-11-02 21:45 ` Jeff Moyer
  2012-11-02 21:46 ` [patch,v2 07/10] megaraid_sas: " Jeff Moyer
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:45 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Bart Van Assche, Jeff Garzik, linux-ide


Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/ata/libata-scsi.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e3bda07..9d5dd09 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3586,7 +3586,8 @@ int ata_scsi_add_hosts(struct ata_host *host, struct scsi_host_template *sht)
 		struct Scsi_Host *shost;
 
 		rc = -ENOMEM;
-		shost = scsi_host_alloc(sht, sizeof(struct ata_port *));
+		shost = scsi_host_alloc_node(sht, sizeof(struct ata_port *),
+					     dev_to_node(host->dev));
 		if (!shost)
 			goto err_alloc;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [patch,v2 07/10] megaraid_sas: use scsi_host_alloc_node
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
                   ` (5 preceding siblings ...)
  2012-11-02 21:45 ` [patch,v2 06/10] ata: use scsi_host_alloc_node Jeff Moyer
@ 2012-11-02 21:46 ` Jeff Moyer
  2012-11-02 21:46 ` [patch,v2 08/10] mpt2sas: " Jeff Moyer
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:46 UTC (permalink / raw)
  To: linux-kernel, linux-scsi
  Cc: Bart Van Assche, Neela Syam Kolli, James E.J. Bottomley


Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/scsi/megaraid/megaraid_sas_base.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index d2c5366..707a6cd 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -4020,8 +4020,9 @@ megasas_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (megasas_set_dma_mask(pdev))
 		goto fail_set_dma_mask;
 
-	host = scsi_host_alloc(&megasas_template,
-			       sizeof(struct megasas_instance));
+	host = scsi_host_alloc_node(&megasas_template,
+				    sizeof(struct megasas_instance),
+				    dev_to_node(&pdev->dev));
 
 	if (!host) {
 		printk(KERN_DEBUG "megasas: scsi_host_alloc failed\n");
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [patch,v2 08/10] mpt2sas: use scsi_host_alloc_node
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
                   ` (6 preceding siblings ...)
  2012-11-02 21:46 ` [patch,v2 07/10] megaraid_sas: " Jeff Moyer
@ 2012-11-02 21:46 ` Jeff Moyer
  2012-11-02 21:46 ` [patch,v2 09/10] lpfc: " Jeff Moyer
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:46 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Bart Van Assche, James E.J. Bottomley


Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/scsi/mpt2sas/mpt2sas_scsih.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
index af4e6c4..a4d6b36 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
@@ -8011,8 +8011,8 @@ _scsih_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	struct MPT2SAS_ADAPTER *ioc;
 	struct Scsi_Host *shost;
 
-	shost = scsi_host_alloc(&scsih_driver_template,
-	    sizeof(struct MPT2SAS_ADAPTER));
+	shost = scsi_host_alloc_node(&scsih_driver_template,
+	    sizeof(struct MPT2SAS_ADAPTER), dev_to_node(&pdev->dev));
 	if (!shost)
 		return -ENODEV;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [patch,v2 09/10] lpfc: use scsi_host_alloc_node
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
                   ` (7 preceding siblings ...)
  2012-11-02 21:46 ` [patch,v2 08/10] mpt2sas: " Jeff Moyer
@ 2012-11-02 21:46 ` Jeff Moyer
  2012-11-02 21:46 ` [patch,v2 10/10] cciss: use blk_init_queue_node Jeff Moyer
  2012-11-06 15:41 ` [patch,v2 00/10] make I/O path allocations more numa-friendly Elliott, Robert (Server Storage)
  10 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:46 UTC (permalink / raw)
  To: linux-kernel, linux-scsi
  Cc: Bart Van Assche, James Smart, James E.J. Bottomley

Acked-By: James Smart  <james.smart@emulex.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/scsi/lpfc/lpfc_init.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 7dc4218..65956d3 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -3051,11 +3051,13 @@ lpfc_create_port(struct lpfc_hba *phba, int instance, struct device *dev)
 	int error = 0;
 
 	if (dev != &phba->pcidev->dev)
-		shost = scsi_host_alloc(&lpfc_vport_template,
-					sizeof(struct lpfc_vport));
+		shost = scsi_host_alloc_node(&lpfc_vport_template,
+					     sizeof(struct lpfc_vport),
+					     dev_to_node(&phba->pcidev->dev));
 	else
-		shost = scsi_host_alloc(&lpfc_template,
-					sizeof(struct lpfc_vport));
+		shost = scsi_host_alloc_node(&lpfc_template,
+					     sizeof(struct lpfc_vport),
+					     dev_to_node(&phba->pcidev->dev));
 	if (!shost)
 		goto out;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [patch,v2 10/10] cciss: use blk_init_queue_node
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
                   ` (8 preceding siblings ...)
  2012-11-02 21:46 ` [patch,v2 09/10] lpfc: " Jeff Moyer
@ 2012-11-02 21:46 ` Jeff Moyer
  2012-11-06 15:41 ` [patch,v2 00/10] make I/O path allocations more numa-friendly Elliott, Robert (Server Storage)
  10 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-02 21:46 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Bart Van Assche, Mike Miller, iss_storagedev


Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/block/cciss.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index b0f553b..5fe5546 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1930,7 +1930,8 @@ static void cciss_get_serial_no(ctlr_info_t *h, int logvol,
 static int cciss_add_disk(ctlr_info_t *h, struct gendisk *disk,
 				int drv_index)
 {
-	disk->queue = blk_init_queue(do_cciss_request, &h->lock);
+	disk->queue = blk_init_queue_node(do_cciss_request, &h->lock,
+					  dev_to_node(&h->dev));
 	if (!disk->queue)
 		goto init_queue_failure;
 	sprintf(disk->disk_name, "cciss/c%dd%d", h->ctlr, drv_index);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [patch,v2 01/10] scsi: add scsi_host_alloc_node
  2012-11-02 21:45 ` [patch,v2 01/10] scsi: add scsi_host_alloc_node Jeff Moyer
@ 2012-11-03 16:35   ` Bart Van Assche
  2012-11-05 14:06     ` Jeff Moyer
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2012-11-03 16:35 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, linux-scsi, James E.J. Bottomley

On 11/02/12 22:45, Jeff Moyer wrote:
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index 593085a..7d7ad8b 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -336,16 +336,25 @@ static struct device_type scsi_host_type = {
>    **/
>   struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
>   {
> +	return scsi_host_alloc_node(sht, privsize, -1);

Using NUMA_NO_NODE here might improve readability.

> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index 4908480..a1b5c8e 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -733,6 +733,12 @@ struct Scsi_Host {
>   	struct device *dma_dev;
>
>   	/*
> +	 * Numa node this device is closest to, used for allocating
> +	 * data structures locally.
> +	 */
> +	int numa_node;

Have you considered using #ifdef CONFIG_NUMA / #endif here ? I've 
noticed that all other numa_node members in structures under include/ 
have this.

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node
  2012-11-02 21:45 ` [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node Jeff Moyer
@ 2012-11-03 16:36   ` Bart Van Assche
  2012-11-05 14:09     ` Jeff Moyer
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2012-11-03 16:36 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, linux-scsi, James E.J. Bottomley

On 11/02/12 22:45, Jeff Moyer wrote:
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index 2936b44..4db6973 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -173,16 +173,20 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
>    * NULL on failure
>    */
>   static struct scsi_cmnd *
> -scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
> +scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
> +			int node)
>   {
>   	struct scsi_cmnd *cmd;
>
> -	cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask | pool->gfp_mask);
> +	cmd = kmem_cache_alloc_node(pool->cmd_slab,
> +				    gfp_mask | pool->gfp_mask | __GFP_ZERO,
> +				    node);
>   	if (!cmd)
>   		return NULL;
>
> -	cmd->sense_buffer = kmem_cache_alloc(pool->sense_slab,
> -					     gfp_mask | pool->gfp_mask);
> +	cmd->sense_buffer = kmem_cache_alloc_node(pool->sense_slab,
> +					gfp_mask | pool->gfp_mask | __GFP_ZERO,
> +					node);

It's not clear to me why __GFP_ZERO is added to the allocation flags ?

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 05/10] sd: use alloc_disk_node
  2012-11-02 21:45 ` [patch,v2 05/10] sd: use alloc_disk_node Jeff Moyer
@ 2012-11-03 16:37   ` Bart Van Assche
  2012-11-05 14:12     ` Jeff Moyer
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2012-11-03 16:37 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, linux-scsi, James E.J. Bottomley

On 11/02/12 22:45, Jeff Moyer wrote:
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
> ---
>   drivers/scsi/sd.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 12f6fdf..8deb915 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
>   	if (!sdkp)
>   		goto out;
>
> -	gd = alloc_disk(SD_MINORS);
> +	gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
>   	if (!gd)
>   		goto out_free;

shost->numa_node can be another NUMA node than dev_to_node(dev). Have 
you considered using shost->numa_node here ?

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 01/10] scsi: add scsi_host_alloc_node
  2012-11-03 16:35   ` Bart Van Assche
@ 2012-11-05 14:06     ` Jeff Moyer
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-05 14:06 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-scsi, James E.J. Bottomley

Bart Van Assche <bvanassche@acm.org> writes:

> On 11/02/12 22:45, Jeff Moyer wrote:
>> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
>> index 593085a..7d7ad8b 100644
>> --- a/drivers/scsi/hosts.c
>> +++ b/drivers/scsi/hosts.c
>> @@ -336,16 +336,25 @@ static struct device_type scsi_host_type = {
>>    **/
>>   struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
>>   {
>> +	return scsi_host_alloc_node(sht, privsize, -1);
>
> Using NUMA_NO_NODE here might improve readability.

Agreed, I'll fix that.

>> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
>> index 4908480..a1b5c8e 100644
>> --- a/include/scsi/scsi_host.h
>> +++ b/include/scsi/scsi_host.h
>> @@ -733,6 +733,12 @@ struct Scsi_Host {
>>   	struct device *dma_dev;
>>
>>   	/*
>> +	 * Numa node this device is closest to, used for allocating
>> +	 * data structures locally.
>> +	 */
>> +	int numa_node;
>
> Have you considered using #ifdef CONFIG_NUMA / #endif here ? I've
> noticed that all other numa_node members in structures under include/
> have this.

That was an oversight, thanks for pointing it out.  I'll fix it up.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node
  2012-11-03 16:36   ` Bart Van Assche
@ 2012-11-05 14:09     ` Jeff Moyer
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-05 14:09 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-scsi, James E.J. Bottomley

Bart Van Assche <bvanassche@acm.org> writes:

> On 11/02/12 22:45, Jeff Moyer wrote:
>> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
>> index 2936b44..4db6973 100644
>> --- a/drivers/scsi/scsi.c
>> +++ b/drivers/scsi/scsi.c
>> @@ -173,16 +173,20 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
>>    * NULL on failure
>>    */
>>   static struct scsi_cmnd *
>> -scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
>> +scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
>> +			int node)
>>   {
>>   	struct scsi_cmnd *cmd;
>>
>> -	cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask | pool->gfp_mask);
>> +	cmd = kmem_cache_alloc_node(pool->cmd_slab,
>> +				    gfp_mask | pool->gfp_mask | __GFP_ZERO,
>> +				    node);
>>   	if (!cmd)
>>   		return NULL;
>>
>> -	cmd->sense_buffer = kmem_cache_alloc(pool->sense_slab,
>> -					     gfp_mask | pool->gfp_mask);
>> +	cmd->sense_buffer = kmem_cache_alloc_node(pool->sense_slab,
>> +					gfp_mask | pool->gfp_mask | __GFP_ZERO,
>> +					node);
>
> It's not clear to me why __GFP_ZERO is added to the allocation flags ?

Hmm, seems I thought this was another case of kmem_cache_zalloc.  I'll
fix it up.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 05/10] sd: use alloc_disk_node
  2012-11-03 16:37   ` Bart Van Assche
@ 2012-11-05 14:12     ` Jeff Moyer
  2012-11-05 14:57       ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Moyer @ 2012-11-05 14:12 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-scsi, James E.J. Bottomley

Bart Van Assche <bvanassche@acm.org> writes:

> On 11/02/12 22:45, Jeff Moyer wrote:
>> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>> ---
>>   drivers/scsi/sd.c |    2 +-
>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 12f6fdf..8deb915 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
>>   	if (!sdkp)
>>   		goto out;
>>
>> -	gd = alloc_disk(SD_MINORS);
>> +	gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
>>   	if (!gd)
>>   		goto out_free;
>
> shost->numa_node can be another NUMA node than dev_to_node(dev). Have
> you considered using shost->numa_node here ?

It can?  How?

Just so I'm clear, you're suggesting I use the scsi_device's host
pointer to get to the Scsi_Host, and that *will* be filled in that this
point, right?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 05/10] sd: use alloc_disk_node
  2012-11-05 14:12     ` Jeff Moyer
@ 2012-11-05 14:57       ` Bart Van Assche
  2012-11-05 15:32         ` taco
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2012-11-05 14:57 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, linux-scsi, James E.J. Bottomley

On 11/05/12 15:12, Jeff Moyer wrote:
> Bart Van Assche <bvanassche@acm.org> writes:
>
>> On 11/02/12 22:45, Jeff Moyer wrote:
>>> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>>> ---
>>>    drivers/scsi/sd.c |    2 +-
>>>    1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>>> index 12f6fdf..8deb915 100644
>>> --- a/drivers/scsi/sd.c
>>> +++ b/drivers/scsi/sd.c
>>> @@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
>>>    	if (!sdkp)
>>>    		goto out;
>>>
>>> -	gd = alloc_disk(SD_MINORS);
>>> +	gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
>>>    	if (!gd)
>>>    		goto out_free;
>>
>> shost->numa_node can be another NUMA node than dev_to_node(dev). Have
>> you considered using shost->numa_node here ?
>
> It can?  How?

E.g. if the LLD allows the user to specify the value of numa_node and 
passes that value to scsi_host_alloc_node() (see also 
http://lkml.org/lkml/2012/10/23/477 for further information).

> Just so I'm clear, you're suggesting I use the scsi_device's host
> pointer to get to the Scsi_Host, and that *will* be filled in that this
> point, right?

As far as I can see the sdev->host pointer is set in scsi_alloc_sdev() 
and that happens before sd_probe() is invoked.

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 05/10] sd: use alloc_disk_node
  2012-11-05 14:57       ` Bart Van Assche
@ 2012-11-05 15:32         ` taco
  0 siblings, 0 replies; 26+ messages in thread
From: taco @ 2012-11-05 15:32 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jeff Moyer, linux-kernel, linux-scsi, James E.J. Bottomley

On 11/05/2012 10:57 PM, Bart Van Assche wrote:
> On 11/05/12 15:12, Jeff Moyer wrote:
>> Bart Van Assche <bvanassche@acm.org> writes:
>>
>>> On 11/02/12 22:45, Jeff Moyer wrote:
>>>> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>>>> ---
>>>>    drivers/scsi/sd.c |    2 +-
>>>>    1 files changed, 1 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>>>> index 12f6fdf..8deb915 100644
>>>> --- a/drivers/scsi/sd.c
>>>> +++ b/drivers/scsi/sd.c
>>>> @@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
>>>>        if (!sdkp)
>>>>            goto out;
>>>>
>>>> -    gd = alloc_disk(SD_MINORS);
>>>> +    gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
>>>>        if (!gd)
>>>>            goto out_free;
>>>
>>> shost->numa_node can be another NUMA node than dev_to_node(dev). Have
>>> you considered using shost->numa_node here ?
>>
>> It can?  How?
>
> E.g. if the LLD allows the user to specify the value of numa_node and 
> passes that value to scsi_host_alloc_node() (see also 
> http://lkml.org/lkml/2012/10/23/477 for further information).
>
>> Just so I'm clear, you're suggesting I use the scsi_device's host
>> pointer to get to the Scsi_Host, and that *will* be filled in that this
>> point, right?
>
> As far as I can see the sdev->host pointer is set in scsi_alloc_sdev() 
> and that happens before sd_probe() is invoked.
>
yes, struct scsi_device was created before sd, sd is the top layer.
> Bart.
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch,v2 00/10] make I/O path allocations more numa-friendly
  2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
                   ` (9 preceding siblings ...)
  2012-11-02 21:46 ` [patch,v2 10/10] cciss: use blk_init_queue_node Jeff Moyer
@ 2012-11-06 15:41 ` Elliott, Robert (Server Storage)
  2012-11-06 19:12   ` Bart Van Assche
  10 siblings, 1 reply; 26+ messages in thread
From: Elliott, Robert (Server Storage) @ 2012-11-06 15:41 UTC (permalink / raw)
  To: linux-scsi

It's certainly better to tie them all to one node then let them be randomly scattered across nodes; your 6% observation may simply be from that.

How do you think these compare, though (for structures that are per-IO)?
- tying the structures to the node hosting the storage device 
- tying the structures to the node running the application 

The latter means that PCI Express traffic must spend more time winding its way through the CPU complex. For example, the Memory Writes to the OQ and to deliver the MSI-X interrupt take longer to reach the destination CPU memory, snooping the other CPUs along the way. Once there, though, application reads should be faster.

We're trying to design the SCSI Express standards (SOP and PQI) to be non-uniform memory and non-uniform I/O friendly.  Some concepts we've included:
- one driver thread per CPU core
- each driver thread processes IOs from application threads on that CPU core
- each driver thread has its own inbound queue (IQ) for command submission
- each driver thread has its own outbound queue (OQ) for status reception
- each OQ has its own MSI-X interrupt that is directed to that CPU core

This should work best if the application threads also run on the right CPU cores.  Most OSes seem to lack a way for an application to determine that its IOs will be heading to an I/O device on another node, and to request (but not demand) that its threads run on that closer node.  Thread affinities seem to be treated as hard requirements rather than suggestions, which causes all applications doing IOs to converge on that poor node and leave the others unused.  There's a tradeoff between the extra latency vs. the extra CPU processing power and memory bandwidth.

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Jeff Moyer
Sent: Friday, November 02, 2012 2:46 PM
To: linux-kernel@vger.kernel.org; linux-scsi@vger.kernel.org
Cc: Bart Van Assche
Subject: [patch,v2 00/10] make I/O path allocations more numa-friendly

Hi,

This patch set makes memory allocations for data structures used in
the I/O path more numa friendly by allocating them from the same numa
node as the storage device.  I've only converted a handfull of drivers
at this point.  My testing showed that, for workloads where the I/O
processes were not tied to the numa node housing the device, a speedup
of around 6% was observed.  When the I/O processes were tied to the
numa node of the device, there was no measurable difference in my test
setup.  Given my relatively low-end setup[1], I wouldn't be surprised
if others could show a more significant performance advantage.

Comments would be greatly appreciated.

Cheers,
Jeff

[1] LSI Megaraid SAS controller with 1GB battery-backed cache,
fronting a RAID 6 10+2.  The workload I used was tuned to not
have to hit disk.  Fio file attached.

--
changes from v1->v2:
- got rid of the vfs patch, as Al pointed out some fundamental
  problems with it
- credited Bart van Assche properly

Jeff Moyer (10):
  scsi: add scsi_host_alloc_node
  scsi: make __scsi_alloc_queue numa-aware
  scsi: make scsi_alloc_sdev numa-aware
  scsi: allocate scsi_cmnd-s from the device's local numa node
  sd: use alloc_disk_node
  ata: use scsi_host_alloc_node
  megaraid_sas: use scsi_host_alloc_node
  mpt2sas: use scsi_host_alloc_node
  lpfc: use scsi_host_alloc_node
  cciss: use blk_init_queue_node

 drivers/ata/libata-scsi.c                 |    3 ++-
 drivers/block/cciss.c                     |    3 ++-
 drivers/scsi/hosts.c                      |   13 +++++++++++--
 drivers/scsi/lpfc/lpfc_init.c             |   10 ++++++----
 drivers/scsi/megaraid/megaraid_sas_base.c |    5 +++--
 drivers/scsi/mpt2sas/mpt2sas_scsih.c      |    4 ++--
 drivers/scsi/scsi.c                       |   17 +++++++++++------
 drivers/scsi/scsi_lib.c                   |    2 +-
 drivers/scsi/scsi_scan.c                  |    4 ++--
 drivers/scsi/sd.c                         |    2 +-
 include/scsi/scsi_host.h                  |    8 ++++++++
 11 files changed, 49 insertions(+), 22 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
  2012-11-06 15:41 ` [patch,v2 00/10] make I/O path allocations more numa-friendly Elliott, Robert (Server Storage)
@ 2012-11-06 19:12   ` Bart Van Assche
  2012-11-09 20:46     ` Jeff Moyer
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2012-11-06 19:12 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage); +Cc: linux-scsi, Jeff Moyer

On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
> It's certainly better to tie them all to one node then let them be
 > randomly scattered across nodes; your 6% observation may simply be
 > from that.
>
> How do you think these compare, though (for structures that are per-IO)?
> - tying the structures to the node hosting the storage device
> - tying the structures to the node running the application
>
> The latter means that PCI Express traffic must spend more time winding
 > its way through the CPU complex. For example, the Memory Writes to the
 > OQ and to deliver the MSI-X interrupt take longer to reach the 
destination
 > CPU memory, snooping the other CPUs along the way. Once there, though,
 > application reads should be faster.
>
> We're trying to design the SCSI Express standards (SOP and PQI) to be
 > non-uniform memory and non-uniform I/O friendly.  Some concepts we've 
included:
> - one driver thread per CPU core
> - each driver thread processes IOs from application threads on that CPU core
> - each driver thread has its own inbound queue (IQ) for command submission
> - each driver thread has its own outbound queue (OQ) for status reception
> - each OQ has its own MSI-X interrupt that is directed to that CPU core
>
> This should work best if the application threads also run on the right
 > CPU cores.  Most OSes seem to lack a way for an application to determine
 > that its IOs will be heading to an I/O device on another node, and to
 > request (but not demand) that its threads run on that closer node.
 > Thread affinities seem to be treated as hard requirements rather than
 > suggestions, which causes all applications doing IOs to converge on that
 > poor node and leave the others unused.  There's a tradeoff between the
 > extra latency vs. the extra CPU processing power and memory bandwidth.

The first five patches in this series already provide an infrastructure 
that allows to tie the data structures needed for I/O to the node 
running the application. That can be realized by passing the proper NUMA 
node to scsi_host_alloc_node(). The only part that is missing is a user 
interface for specifying that node. If anyone could come up with a 
proposal for adding such a user interface without having to reimplement 
it in every LLD that would be great.

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
  2012-11-06 19:12   ` Bart Van Assche
@ 2012-11-09 20:46     ` Jeff Moyer
  2012-11-10  8:56       ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Moyer @ 2012-11-09 20:46 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Elliott, Robert (Server Storage), linux-scsi

Bart Van Assche <bvanassche@acm.org> writes:

> On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
>> It's certainly better to tie them all to one node then let them be
>> randomly scattered across nodes; your 6% observation may simply be
>> from that.
>>
>> How do you think these compare, though (for structures that are per-IO)?
>> - tying the structures to the node hosting the storage device
>> - tying the structures to the node running the application

This is a great question, thanks for asking it!  I went ahead and
modified the megaraid_sas driver to take a module parameter that
specifies on which node to allocate the scsi_host data structure (and
all other structures on top that are tied to that).  I then booted the
system 4 times, specifying a different node each time.  Here are the
results as compared to a vanilla kernel:

data structures tied to node 0

application tied to:
node 0:  +6% +/-1%
node 1:  +9% +/-2%
node 2:  +10% +/-3%
node 3:  +0% +/-4%

The first number is the percent gain (or loss) w.r.t. the vanilla
kernel.  The second number is the standard deviation as a percent of the
bandwidth.  So, when data structures are tied to node 0, we see an
increase in performance for nodes 0-3.  However, on node 3, which is the
node the megaraid_sas controller is attached to, we see no gain in
performance, and we see an increase in the run to run variation.  The
standard deviation for the vanilla kernel was 1% across all nodes.

Given that the results are mixed, depending on which node the workload
is running, I can't really draw any conclusions from this.  The node 3
number is really throwing me for a loop.  If it were positive, I'd do
some handwaving about all data structures getting allocated one node 0
at boot, and the addition of getting the scsi_cmnd structure on the same
node is what resulted in the net gain.

data structures tied to node 1

application tied to:
node 0:  +6% +/-1%
node 1:  +0% +/-2%
node 2:  +0% +/-6%
node 3:  -7% +/-13%

Now this is interesting!  Tying data structures to node 1 results in a
performance boost for node 0?  That would seem to validate your question
of whether it just helps out to have everything come from the same node,
as opposed to allocated close to the storage controller.  However, node
3 sees a decrease in performance, and a huge standard devation.  Node 2
also sees an increased standard deviation.  That leaves me wondering why
node 0 didn't also experience an increase....

data structures tied to node 2

application tied to:
node 0:  +5% +/-3%
node 1:  +0% +/-5%
node 2:  +0% +/-4%
node 3:  +0% +/-5%

Here, we *mostly* just see an increase in standard deviation, with no
appreciable change in application performance.

data structures tied to node 3

application tied to:
node 0:  +0% +/-6%
node 1:  +6% +/-4%
node 2:  +7% +/-4%
node 3:  +0% +/-4%

Now, this is the case where I'd expect to see the best performance,
since the HBA is on node 3.  However, that's not what we get!  Instead,
we get maybe a couple percent improvement on nodes 1 and 2, and an
increased run-to-run variation for nodes 0 and 3.

Overall, I'd say that my testing is inconclusive, and I may just pull
the patch set until I can get some reasonable results.

And now to address your question from a completely theoretical point of
view (since empirical data has only succeeded in baffling me).  You have
to keep in mind that some of these data structures are long-lived.
Things like the Scsi_Host and request_queue will be around as long as
the device is present (and the module is not removed).  So, it doesn't
make sense to try to allocate these data structures on the node running
the application, unless you are pinning the application to a single node
that is not the node hosting the storage (which would be weird).  So, I
think it does make sense to pin these data structures to a single node,
that node being the one closest to the storage.  We do have to keep in
mind that there are architectures for which there could be multiple
nodes equidistant to the storage.

>> The latter means that PCI Express traffic must spend more time winding
>> its way through the CPU complex. For example, the Memory Writes to the
>> OQ and to deliver the MSI-X interrupt take longer to reach the 
> destination
>> CPU memory, snooping the other CPUs along the way. Once there, though,
>> application reads should be faster.

I'm using direct I/O in my testing, which means that the DMA is going to
whatever node the memory allocation (for application buffers) was
satisfied.  For buffered I/O, you're going to end up dma-ing from the
page cache, and that will also likely come from the node on which the
application was running at the time of the read/write.  So, what I'm
getting at is you're very likely to have a split between the data being
transferred and the data structures used to manage the transfer.

>> We're trying to design the SCSI Express standards (SOP and PQI) to be
>> non-uniform memory and non-uniform I/O friendly.  Some concepts
>> we've 
> included:
>> - one driver thread per CPU core

This sounds like a bad idea to me.  We already have a huge proliferation
of kernel threads, and this will only make that problem worse.  Do you
really need (for example) 4096 kernel threads for a single driver?

>> - each driver thread processes IOs from application threads on that CPU core
>> - each driver thread has its own inbound queue (IQ) for command submission
>> - each driver thread has its own outbound queue (OQ) for status reception
>> - each OQ has its own MSI-X interrupt that is directed to that CPU core
>>
>> This should work best if the application threads also run on the right
>> CPU cores.  Most OSes seem to lack a way for an application to determine
>> that its IOs will be heading to an I/O device on another node, and to
>> request (but not demand) that its threads run on that closer node.

Right now that tuning has to be done manually.  There are sysfs files
that will tell the admin on which node a particular adapter is located
(and it is this very information that I have leveraged in this patch
set).  Then, users can run the application under numactl using
--cpunodebind.  I do believe libnuma has recently added support for
detecting where I/O adapters live, as well.  However, mapping from an
application's use of files all the way down to the HBA is not the
easiest thing on the planet, especially once you add in stacking drivers
like dm or md.

>> Thread affinities seem to be treated as hard requirements rather than
>> suggestions, which causes all applications doing IOs to converge on that
>> poor node and leave the others unused.  There's a tradeoff between the
>> extra latency vs. the extra CPU processing power and memory bandwidth.

I have heard others request a soft cpu affinity mechanism.  I don't know
if any progress is being made on that front, though.  Best to ask the
scheduler folks, I think.  Pie in the sky, it sounds like what you're
asking for is some scheduler awareness of the fact that applications are
doing I/O, and have it somehow schedule processes close to the devices
that are being used.  Is that right?  That would be cool....

> The first five patches in this series already provide an
> infrastructure that allows to tie the data structures needed for I/O
> to the node running the application. That can be realized by passing
> the proper NUMA node to scsi_host_alloc_node(). The only part that is
> missing is a user interface for specifying that node. If anyone could
> come up with a proposal for adding such a user interface without
> having to reimplement it in every LLD that would be great.

I guess that would have to live in the SCSI midlayer somewhere, right?
However, as I mentioned above, I think doing the right thing
automatically is what Robert is getting at.  We'd need some input from
scheduler folks to make progress there.  And this wouldn't just apply to
block I/O, btw.  I could see the networking folks being interested as
well.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
  2012-11-09 20:46     ` Jeff Moyer
@ 2012-11-10  8:56       ` Bart Van Assche
  2012-11-12 21:26         ` Jeff Moyer
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2012-11-10  8:56 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Robert Elliott, linux-scsi

On 11/09/12 21:46, Jeff Moyer wrote:
>> On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
>>> It's certainly better to tie them all to one node then let them be
>>> randomly scattered across nodes; your 6% observation may simply be
>>> from that.
>>>
>>> How do you think these compare, though (for structures that are per-IO)?
>>> - tying the structures to the node hosting the storage device
>>> - tying the structures to the node running the application
>
> This is a great question, thanks for asking it!  I went ahead and
> modified the megaraid_sas driver to take a module parameter that
> specifies on which node to allocate the scsi_host data structure (and
> all other structures on top that are tied to that).  I then booted the
> system 4 times, specifying a different node each time.  Here are the
> results as compared to a vanilla kernel:
>
> data structures tied to node 0
>
> application tied to:
> node 0:  +6% +/-1%
> node 1:  +9% +/-2%
> node 2:  +10% +/-3%
> node 3:  +0% +/-4%
>
> The first number is the percent gain (or loss) w.r.t. the vanilla
> kernel.  The second number is the standard deviation as a percent of the
> bandwidth.  So, when data structures are tied to node 0, we see an
> increase in performance for nodes 0-3.  However, on node 3, which is the
> node the megaraid_sas controller is attached to, we see no gain in
> performance, and we see an increase in the run to run variation.  The
> standard deviation for the vanilla kernel was 1% across all nodes.
>
> Given that the results are mixed, depending on which node the workload
> is running, I can't really draw any conclusions from this.  The node 3
> number is really throwing me for a loop.  If it were positive, I'd do
> some handwaving about all data structures getting allocated one node 0
> at boot, and the addition of getting the scsi_cmnd structure on the same
> node is what resulted in the net gain.
>
> data structures tied to node 1
>
> application tied to:
> node 0:  +6% +/-1%
> node 1:  +0% +/-2%
> node 2:  +0% +/-6%
> node 3:  -7% +/-13%
>
> Now this is interesting!  Tying data structures to node 1 results in a
> performance boost for node 0?  That would seem to validate your question
> of whether it just helps out to have everything come from the same node,
> as opposed to allocated close to the storage controller.  However, node
> 3 sees a decrease in performance, and a huge standard devation.  Node 2
> also sees an increased standard deviation.  That leaves me wondering why
> node 0 didn't also experience an increase....
>
> data structures tied to node 2
>
> application tied to:
> node 0:  +5% +/-3%
> node 1:  +0% +/-5%
> node 2:  +0% +/-4%
> node 3:  +0% +/-5%
>
> Here, we *mostly* just see an increase in standard deviation, with no
> appreciable change in application performance.
>
> data structures tied to node 3
>
> application tied to:
> node 0:  +0% +/-6%
> node 1:  +6% +/-4%
> node 2:  +7% +/-4%
> node 3:  +0% +/-4%
>
> Now, this is the case where I'd expect to see the best performance,
> since the HBA is on node 3.  However, that's not what we get!  Instead,
> we get maybe a couple percent improvement on nodes 1 and 2, and an
> increased run-to-run variation for nodes 0 and 3.
>
> Overall, I'd say that my testing is inconclusive, and I may just pull
> the patch set until I can get some reasonable results.

Which NUMA node was processing the megaraid_sas interrupts in these 
tests ? Was irqbalance running during these tests or were interrupts 
manually pinned to a specific CPU core ?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
  2012-11-10  8:56       ` Bart Van Assche
@ 2012-11-12 21:26         ` Jeff Moyer
  2012-11-13  1:26           ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Moyer @ 2012-11-12 21:26 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Robert Elliott, linux-scsi

Bart Van Assche <bvanassche@acm.org> writes:

> On 11/09/12 21:46, Jeff Moyer wrote:
>>> On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
>>>> It's certainly better to tie them all to one node then let them be
>>>> randomly scattered across nodes; your 6% observation may simply be
>>>> from that.
>>>>
>>>> How do you think these compare, though (for structures that are per-IO)?
>>>> - tying the structures to the node hosting the storage device
>>>> - tying the structures to the node running the application
>>
>> This is a great question, thanks for asking it!  I went ahead and
>> modified the megaraid_sas driver to take a module parameter that
>> specifies on which node to allocate the scsi_host data structure (and
>> all other structures on top that are tied to that).  I then booted the
>> system 4 times, specifying a different node each time.  Here are the
>> results as compared to a vanilla kernel:
>>
[snip]
> Which NUMA node was processing the megaraid_sas interrupts in these
> tests ? Was irqbalance running during these tests or were interrupts
> manually pinned to a specific CPU core ?

irqbalanced was indeed running, so I can't say for sure what node the
irq was pinned to during my tests (I didn't record that information).

I re-ran the tests, this time turning off irqbalance (well, I set it to
one-shot), and the pinning the irq to the node running the benchmark.
In this configuration, I saw no regressions in performance.

As a reminder:

>> The first number is the percent gain (or loss) w.r.t. the vanilla
>> kernel.  The second number is the standard deviation as a percent of the
>> bandwidth.  So, when data structures are tied to node 0, we see an
>> increase in performance for nodes 0-3.  However, on node 3, which is the
>> node the megaraid_sas controller is attached to, we see no gain in
>> performance, and we see an increase in the run to run variation.  The
>> standard deviation for the vanilla kernel was 1% across all nodes.

Here are the updated numbers:

data structures tied to node 0

application tied to:
node 0:  0 +/-4%
node 1:  9 +/-1%
node 2: 10 +/-2%
node 3:  0 +/-2%

data structures tied to node 1

application tied to:
node 0:  5 +/-2%
node 1:  6 +/-8%
node 2: 10 +/-1%
node 3:  0 +/-3%

data structures tied to node 2

application tied to:
node 0:  6 +/-2%
node 1:  9 +/-2%
node 2:  7 +/-6%
node 3:  0 +/-3%

data structures tied to node 3

application tied to:
node 0:  0 +/-4%
node 1: 10 +/-2%
node 2: 11 +/-1%
node 3:  0 +/-5%

Now, the above is apples to oranges, since the vanilla kernel was run
w/o any tuning of irqs.  So, I went ahead and booted with
numa_node_parm=-1, which is the same as vanilla, and re-ran the tests.

When we compare a vanilla kernel with and without irq binding, we get
this:

node 0:  0 +/-3%
node 1:  9 +/-1%
node 2:  8 +/-3%
node 3:  0 +/-1%

As you can see, binding irqs helps nodes 1 and 2 quite substantially.
What this boils down to, when you compare a patched kernel with the
vanilla kernel, where they are both tying irqs to the node hosting the
application, is a net gain of zero, but an increase in standard
deviation.

Let me try to make that more readable.  The patch set does not appear
to help at all with my benchmark configuration.  ;-)  One other
conclusion I can draw from this data is that irqbalance could do a
better job.

An interesting (to me) tidbit about this hardware is that, while it has
4 numa nodes, it only has 2 sockets.  Based on the numbers above, I'd
guess nodes 0 and 3 are in the same socket, likewise for 1 and 2.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch,v2 00/10] make I/O path allocations more numa-friendly
  2012-11-12 21:26         ` Jeff Moyer
@ 2012-11-13  1:26           ` Elliott, Robert (Server Storage)
  2012-11-13 15:44             ` Jeff Moyer
  0 siblings, 1 reply; 26+ messages in thread
From: Elliott, Robert (Server Storage) @ 2012-11-13  1:26 UTC (permalink / raw)
  To: Jeff Moyer, Bart Van Assche; +Cc: linux-scsi

What do these commands report about the NUMA and non-uniform IO topology on the test system?
	numactl --hardware
	lspci -t


> -----Original Message-----
> From: Jeff Moyer [mailto:jmoyer@redhat.com]
> Sent: Monday, 12 November, 2012 3:27 PM
> To: Bart Van Assche
> Cc: Elliott, Robert (Server Storage); linux-scsi@vger.kernel.org
> Subject: Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
> 
> Bart Van Assche <bvanassche@acm.org> writes:
> 
> > On 11/09/12 21:46, Jeff Moyer wrote:
> >>> On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
> >>>> It's certainly better to tie them all to one node then let them be
> >>>> randomly scattered across nodes; your 6% observation may simply be
> >>>> from that.
> >>>>
> >>>> How do you think these compare, though (for structures that are per-IO)?
> >>>> - tying the structures to the node hosting the storage device
> >>>> - tying the structures to the node running the application
> >>
> >> This is a great question, thanks for asking it!  I went ahead and
> >> modified the megaraid_sas driver to take a module parameter that
> >> specifies on which node to allocate the scsi_host data structure (and
> >> all other structures on top that are tied to that).  I then booted the
> >> system 4 times, specifying a different node each time.  Here are the
> >> results as compared to a vanilla kernel:
> >>
> [snip]
> > Which NUMA node was processing the megaraid_sas interrupts in these
> > tests ? Was irqbalance running during these tests or were interrupts
> > manually pinned to a specific CPU core ?
> 
> irqbalanced was indeed running, so I can't say for sure what node the
> irq was pinned to during my tests (I didn't record that information).
> 
> I re-ran the tests, this time turning off irqbalance (well, I set it to
> one-shot), and the pinning the irq to the node running the benchmark.
> In this configuration, I saw no regressions in performance.
> 
> As a reminder:
> 
> >> The first number is the percent gain (or loss) w.r.t. the vanilla
> >> kernel.  The second number is the standard deviation as a percent of the
> >> bandwidth.  So, when data structures are tied to node 0, we see an
> >> increase in performance for nodes 0-3.  However, on node 3, which is the
> >> node the megaraid_sas controller is attached to, we see no gain in
> >> performance, and we see an increase in the run to run variation.  The
> >> standard deviation for the vanilla kernel was 1% across all nodes.
> 
> Here are the updated numbers:
> 
> data structures tied to node 0
> 
> application tied to:
> node 0:  0 +/-4%
> node 1:  9 +/-1%
> node 2: 10 +/-2%
> node 3:  0 +/-2%
> 
> data structures tied to node 1
> 
> application tied to:
> node 0:  5 +/-2%
> node 1:  6 +/-8%
> node 2: 10 +/-1%
> node 3:  0 +/-3%
> 
> data structures tied to node 2
> 
> application tied to:
> node 0:  6 +/-2%
> node 1:  9 +/-2%
> node 2:  7 +/-6%
> node 3:  0 +/-3%
> 
> data structures tied to node 3
> 
> application tied to:
> node 0:  0 +/-4%
> node 1: 10 +/-2%
> node 2: 11 +/-1%
> node 3:  0 +/-5%
> 
> Now, the above is apples to oranges, since the vanilla kernel was run
> w/o any tuning of irqs.  So, I went ahead and booted with
> numa_node_parm=-1, which is the same as vanilla, and re-ran the tests.
> 
> When we compare a vanilla kernel with and without irq binding, we get
> this:
> 
> node 0:  0 +/-3%
> node 1:  9 +/-1%
> node 2:  8 +/-3%
> node 3:  0 +/-1%
> 
> As you can see, binding irqs helps nodes 1 and 2 quite substantially.
> What this boils down to, when you compare a patched kernel with the
> vanilla kernel, where they are both tying irqs to the node hosting the
> application, is a net gain of zero, but an increase in standard
> deviation.
> 
> Let me try to make that more readable.  The patch set does not appear
> to help at all with my benchmark configuration.  ;-)  One other
> conclusion I can draw from this data is that irqbalance could do a
> better job.
> 
> An interesting (to me) tidbit about this hardware is that, while it has
> 4 numa nodes, it only has 2 sockets.  Based on the numbers above, I'd
> guess nodes 0 and 3 are in the same socket, likewise for 1 and 2.
> 
> Cheers,
> Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
  2012-11-13  1:26           ` Elliott, Robert (Server Storage)
@ 2012-11-13 15:44             ` Jeff Moyer
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff Moyer @ 2012-11-13 15:44 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage); +Cc: Bart Van Assche, linux-scsi

"Elliott, Robert (Server Storage)" <Elliott@hp.com> writes:

> What do these commands report about the NUMA and non-uniform IO topology on the test system?

This is a DELL PowerEdge R715.  See chapter 7 of this document for
details on how the I/O bridges are connected:
  http://www.dell.com/downloads/global/products/pedge/en/Poweredge-r715-technicalguide.pdf

> 	numactl --hardware

# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 2 4 6
node 0 size: 8182 MB
node 0 free: 7856 MB
node 1 cpus: 8 10 12 14
node 1 size: 8192 MB
node 1 free: 8008 MB
node 2 cpus: 9 11 13 15
node 2 size: 8192 MB
node 2 free: 7994 MB
node 3 cpus: 1 3 5 7
node 3 size: 8192 MB
node 3 free: 7982 MB
node distances:
node   0   1   2   3 
  0:  10  16  16  16 
  1:  16  10  16  16 
  2:  16  16  10  16 
  3:  16  16  16  10 

> 	lspci -t

# lspci -vt
-+-[0000:20]-+-00.0  ATI Technologies Inc RD890 Northbridge only dual slot (2x8) PCI-e GFX Hydra part
 |           +-02.0-[21]--
 |           +-03.0-[22]----00.0  LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator]
 |           \-0b.0-[23]--+-00.0  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection
 |                        \-00.1  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection
 \-[0000:00]-+-00.0  ATI Technologies Inc RD890 PCI to PCI bridge (external gfx0 port A)
             +-02.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
             |            \-00.1  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
             +-03.0-[02]--+-00.0  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
             |            \-00.1  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
             +-04.0-[03-08]----00.0-[04-08]--+-00.0-[05]----00.0  LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
             +-09.0-[09]--
             +-12.0  ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
             +-12.1  ATI Technologies Inc SB7x0 USB OHCI1 Controller
             +-12.2  ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller
             +-13.0  ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
             +-13.1  ATI Technologies Inc SB7x0 USB OHCI1 Controller
             +-13.2  ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller
             +-14.0  ATI Technologies Inc SBx00 SMBus Controller
             +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
             +-14.4-[0a]----03.0  Matrox Graphics, Inc. MGA G200eW WPCM450
             +-18.0  Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
             +-18.1  Advanced Micro Devices [AMD] Family 10h Processor Address Map
             +-18.2  Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
             +-18.3  Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
             +-18.4  Advanced Micro Devices [AMD] Family 10h Processor Link Control
             +-19.0  Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
             +-19.1  Advanced Micro Devices [AMD] Family 10h Processor Address Map
             +-19.2  Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
             +-19.3  Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
             +-19.4  Advanced Micro Devices [AMD] Family 10h Processor Link Control
             +-1a.0  Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
             +-1a.1  Advanced Micro Devices [AMD] Family 10h Processor Address Map
             +-1a.2  Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
             +-1a.3  Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
             +-1a.4  Advanced Micro Devices [AMD] Family 10h Processor Link Control
             +-1b.0  Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
             +-1b.1  Advanced Micro Devices [AMD] Family 10h Processor Address Map
             +-1b.2  Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
             +-1b.3  Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
             \-1b.4  Advanced Micro Devices [AMD] Family 10h Processor Link Control

# cat /sys/bus/pci/devices/0000\:20\:03.0/0000\:22\:00.0/numa_node 
3

-Jeff

>
>
>> -----Original Message-----
>> From: Jeff Moyer [mailto:jmoyer@redhat.com]
>> Sent: Monday, 12 November, 2012 3:27 PM
>> To: Bart Van Assche
>> Cc: Elliott, Robert (Server Storage); linux-scsi@vger.kernel.org
>> Subject: Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
>> 
>> Bart Van Assche <bvanassche@acm.org> writes:
>> 
>> > On 11/09/12 21:46, Jeff Moyer wrote:
>> >>> On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
>> >>>> It's certainly better to tie them all to one node then let them be
>> >>>> randomly scattered across nodes; your 6% observation may simply be
>> >>>> from that.
>> >>>>
>> >>>> How do you think these compare, though (for structures that are per-IO)?
>> >>>> - tying the structures to the node hosting the storage device
>> >>>> - tying the structures to the node running the application
>> >>
>> >> This is a great question, thanks for asking it!  I went ahead and
>> >> modified the megaraid_sas driver to take a module parameter that
>> >> specifies on which node to allocate the scsi_host data structure (and
>> >> all other structures on top that are tied to that).  I then booted the
>> >> system 4 times, specifying a different node each time.  Here are the
>> >> results as compared to a vanilla kernel:
>> >>
>> [snip]
>> > Which NUMA node was processing the megaraid_sas interrupts in these
>> > tests ? Was irqbalance running during these tests or were interrupts
>> > manually pinned to a specific CPU core ?
>> 
>> irqbalanced was indeed running, so I can't say for sure what node the
>> irq was pinned to during my tests (I didn't record that information).
>> 
>> I re-ran the tests, this time turning off irqbalance (well, I set it to
>> one-shot), and the pinning the irq to the node running the benchmark.
>> In this configuration, I saw no regressions in performance.
>> 
>> As a reminder:
>> 
>> >> The first number is the percent gain (or loss) w.r.t. the vanilla
>> >> kernel.  The second number is the standard deviation as a percent of the
>> >> bandwidth.  So, when data structures are tied to node 0, we see an
>> >> increase in performance for nodes 0-3.  However, on node 3, which is the
>> >> node the megaraid_sas controller is attached to, we see no gain in
>> >> performance, and we see an increase in the run to run variation.  The
>> >> standard deviation for the vanilla kernel was 1% across all nodes.
>> 
>> Here are the updated numbers:
>> 
>> data structures tied to node 0
>> 
>> application tied to:
>> node 0:  0 +/-4%
>> node 1:  9 +/-1%
>> node 2: 10 +/-2%
>> node 3:  0 +/-2%
>> 
>> data structures tied to node 1
>> 
>> application tied to:
>> node 0:  5 +/-2%
>> node 1:  6 +/-8%
>> node 2: 10 +/-1%
>> node 3:  0 +/-3%
>> 
>> data structures tied to node 2
>> 
>> application tied to:
>> node 0:  6 +/-2%
>> node 1:  9 +/-2%
>> node 2:  7 +/-6%
>> node 3:  0 +/-3%
>> 
>> data structures tied to node 3
>> 
>> application tied to:
>> node 0:  0 +/-4%
>> node 1: 10 +/-2%
>> node 2: 11 +/-1%
>> node 3:  0 +/-5%
>> 
>> Now, the above is apples to oranges, since the vanilla kernel was run
>> w/o any tuning of irqs.  So, I went ahead and booted with
>> numa_node_parm=-1, which is the same as vanilla, and re-ran the tests.
>> 
>> When we compare a vanilla kernel with and without irq binding, we get
>> this:
>> 
>> node 0:  0 +/-3%
>> node 1:  9 +/-1%
>> node 2:  8 +/-3%
>> node 3:  0 +/-1%
>> 
>> As you can see, binding irqs helps nodes 1 and 2 quite substantially.
>> What this boils down to, when you compare a patched kernel with the
>> vanilla kernel, where they are both tying irqs to the node hosting the
>> application, is a net gain of zero, but an increase in standard
>> deviation.
>> 
>> Let me try to make that more readable.  The patch set does not appear
>> to help at all with my benchmark configuration.  ;-)  One other
>> conclusion I can draw from this data is that irqbalance could do a
>> better job.
>> 
>> An interesting (to me) tidbit about this hardware is that, while it has
>> 4 numa nodes, it only has 2 sockets.  Based on the numbers above, I'd
>> guess nodes 0 and 3 are in the same socket, likewise for 1 and 2.
>> 
>> Cheers,
>> Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-11-13 15:44 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
2012-11-02 21:45 ` [patch,v2 01/10] scsi: add scsi_host_alloc_node Jeff Moyer
2012-11-03 16:35   ` Bart Van Assche
2012-11-05 14:06     ` Jeff Moyer
2012-11-02 21:45 ` [patch,v2 02/10] scsi: make __scsi_alloc_queue numa-aware Jeff Moyer
2012-11-02 21:45 ` [patch,v2 03/10] scsi: make scsi_alloc_sdev numa-aware Jeff Moyer
2012-11-02 21:45 ` [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node Jeff Moyer
2012-11-03 16:36   ` Bart Van Assche
2012-11-05 14:09     ` Jeff Moyer
2012-11-02 21:45 ` [patch,v2 05/10] sd: use alloc_disk_node Jeff Moyer
2012-11-03 16:37   ` Bart Van Assche
2012-11-05 14:12     ` Jeff Moyer
2012-11-05 14:57       ` Bart Van Assche
2012-11-05 15:32         ` taco
2012-11-02 21:45 ` [patch,v2 06/10] ata: use scsi_host_alloc_node Jeff Moyer
2012-11-02 21:46 ` [patch,v2 07/10] megaraid_sas: " Jeff Moyer
2012-11-02 21:46 ` [patch,v2 08/10] mpt2sas: " Jeff Moyer
2012-11-02 21:46 ` [patch,v2 09/10] lpfc: " Jeff Moyer
2012-11-02 21:46 ` [patch,v2 10/10] cciss: use blk_init_queue_node Jeff Moyer
2012-11-06 15:41 ` [patch,v2 00/10] make I/O path allocations more numa-friendly Elliott, Robert (Server Storage)
2012-11-06 19:12   ` Bart Van Assche
2012-11-09 20:46     ` Jeff Moyer
2012-11-10  8:56       ` Bart Van Assche
2012-11-12 21:26         ` Jeff Moyer
2012-11-13  1:26           ` Elliott, Robert (Server Storage)
2012-11-13 15:44             ` Jeff Moyer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.