All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch/rfc/rft] sd: allocate request_queue on device's local numa node
@ 2012-10-22 19:01 Jeff Moyer
  2012-10-22 19:19 ` Jens Axboe
  2012-10-23  6:45 ` Bart Van Assche
  0 siblings, 2 replies; 6+ messages in thread
From: Jeff Moyer @ 2012-10-22 19:01 UTC (permalink / raw)
  To: axboe, linux-kernel, SCSI Mailing List

Hi,

All of the infrastructure is available to allocate a request_queue on a
particular numa node, but it isn't being utilized at all.  Wire up the
sd driver to allocate the request_queue on the HBA's local numa node.

This is a request for comments and testing (I've built and booted it,
nothing more).  I believe that this should be a performance win, but I
have no numbers to back it up as yet.  Suggestions for workloads to test
are welcome.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..7986483 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,8 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
 	struct request_queue *q;
 	struct device *dev = shost->dma_dev;
 
-	q = blk_init_queue(request_fn, NULL);
+	q = blk_init_queue_node(request_fn, NULL,
+				dev_to_node(&shost->shost_dev));
 	if (!q)
 		return NULL;
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node
  2012-10-22 19:01 [patch/rfc/rft] sd: allocate request_queue on device's local numa node Jeff Moyer
@ 2012-10-22 19:19 ` Jens Axboe
  2012-10-23  6:45 ` Bart Van Assche
  1 sibling, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2012-10-22 19:19 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, SCSI Mailing List

On 2012-10-22 21:01, Jeff Moyer wrote:
> Hi,
> 
> All of the infrastructure is available to allocate a request_queue on a
> particular numa node, but it isn't being utilized at all.  Wire up the
> sd driver to allocate the request_queue on the HBA's local numa node.
> 
> This is a request for comments and testing (I've built and booted it,
> nothing more).  I believe that this should be a performance win, but I
> have no numbers to back it up as yet.  Suggestions for workloads to test
> are welcome.

Would seem pointless _not_ to do it, if we have the info. Some
scsi_debug non-delay fio microbenchmark should show it easily. Combine
with perf as needed to verify.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node
  2012-10-22 19:01 [patch/rfc/rft] sd: allocate request_queue on device's local numa node Jeff Moyer
  2012-10-22 19:19 ` Jens Axboe
@ 2012-10-23  6:45 ` Bart Van Assche
  2012-10-23 16:52   ` Jeff Moyer
  1 sibling, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2012-10-23  6:45 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: axboe, linux-kernel, SCSI Mailing List

On 10/22/12 21:01, Jeff Moyer wrote:
> All of the infrastructure is available to allocate a request_queue on a
> particular numa node, but it isn't being utilized at all.  Wire up the
> sd driver to allocate the request_queue on the HBA's local numa node.
>
> This is a request for comments and testing (I've built and booted it,
> nothing more).  I believe that this should be a performance win, but I
> have no numbers to back it up as yet.  Suggestions for workloads to test
> are welcome.
>
> Cheers,
> Jeff
>
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index da36a3a..7986483 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1664,7 +1664,8 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
>   	struct request_queue *q;
>   	struct device *dev = shost->dma_dev;
>
> -	q = blk_init_queue(request_fn, NULL);
> +	q = blk_init_queue_node(request_fn, NULL,
> +				dev_to_node(&shost->shost_dev));
>   	if (!q)
>   		return NULL;

Are you sure this approach will always result in the queue being 
allocated on the same NUMA node as the HCA ? If e.g. a user triggers LUN 
scanning via sysfs the above code may be invoked on another NUMA node 
than the node to which the HCA is connected. Also, if you have a look at 
e.g. scsi_request_fn() or scsi_device_unbusy() you will see that in 
order to avoid inter-node traffic it's important to allocate the sdev 
and shost data structures on the same NUMA node. How about the following 
approach ?
- Add a variant of scsi_host_alloc() that allows to specify on which
   NUMA node to allocate the shost structure and also that stores the
   identity of that node in the shost structure.
- Modify __scsi_alloc_queue() such that it allocates the sdev structure
   on the same NUMA node as the shost structure.
- Modify the SCSI LLD of your choice such that it uses the new
   scsi_host_alloc() call. According to what is appropriate the NUMA node
   on which to allocate the shost could be specified by the user or could
   be identical to the NUMA node of the HCA controlled by the SCSI LLD
   (see e.g. /sys/devices/pci*/*/numa_node). Please keep in mind that a
   single PCIe bus may have a minimal distance to more than one NUMA
   node. See e.g. the diagram at the top of page 8 in
 
http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf
   for a system diagram of a NUMA system where each PCIe bus has a
   minimal distance to two different NUMA nodes.

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node
  2012-10-23  6:45 ` Bart Van Assche
@ 2012-10-23 16:52   ` Jeff Moyer
  2012-10-23 17:42     ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Moyer @ 2012-10-23 16:52 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: axboe, linux-kernel, SCSI Mailing List

Bart Van Assche <bvanassche@acm.org> writes:

> On 10/22/12 21:01, Jeff Moyer wrote:
>> All of the infrastructure is available to allocate a request_queue on a
>> particular numa node, but it isn't being utilized at all.  Wire up the
>> sd driver to allocate the request_queue on the HBA's local numa node.
>>
>> This is a request for comments and testing (I've built and booted it,
>> nothing more).  I believe that this should be a performance win, but I
>> have no numbers to back it up as yet.  Suggestions for workloads to test
>> are welcome.
>>
>> Cheers,
>> Jeff
>>
>> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>>
>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>> index da36a3a..7986483 100644
>> --- a/drivers/scsi/scsi_lib.c
>> +++ b/drivers/scsi/scsi_lib.c
>> @@ -1664,7 +1664,8 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
>>   	struct request_queue *q;
>>   	struct device *dev = shost->dma_dev;
>>
>> -	q = blk_init_queue(request_fn, NULL);
>> +	q = blk_init_queue_node(request_fn, NULL,
>> +				dev_to_node(&shost->shost_dev));
>>   	if (!q)
>>   		return NULL;
>
> Are you sure this approach will always result in the queue being
> allocated on the same NUMA node as the HCA ? If e.g. a user triggers
> LUN scanning via sysfs the above code may be invoked on another NUMA
> node than the node to which the HCA is connected.

shost->shost_dev should inherit the numa node from the pci bus to which
it is attached.  So long as that works, there should be no concern over
which numa node the probe code is running on.

> Also, if you have a look at e.g. scsi_request_fn() or
> scsi_device_unbusy() you will see that in order to avoid inter-node
> traffic it's important to allocate the sdev and shost data structures
> on the same NUMA node.

Yes, good point.

> How about the following approach ?
> - Add a variant of scsi_host_alloc() that allows to specify on which
>   NUMA node to allocate the shost structure and also that stores the
>   identity of that node in the shost structure.
> - Modify __scsi_alloc_queue() such that it allocates the sdev structure
>   on the same NUMA node as the shost structure.
> - Modify the SCSI LLD of your choice such that it uses the new
>   scsi_host_alloc() call. According to what is appropriate the NUMA node
>   on which to allocate the shost could be specified by the user or could
>   be identical to the NUMA node of the HCA controlled by the SCSI LLD
>   (see e.g. /sys/devices/pci*/*/numa_node). Please keep in mind that a
>   single PCIe bus may have a minimal distance to more than one NUMA
>   node. See e.g. the diagram at the top of page 8 in
>
> http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf
>   for a system diagram of a NUMA system where each PCIe bus has a
>   minimal distance to two different NUMA nodes.

That's an interesting configuration.  I wonder what the numa_node sysfs
file contains for such systems--do you know?  I'm not sure how we could
allow this to be user-controlled at probe time.  Did you have a specific
mechanism in mind?  Module parameters?  Something else?

Thanks for your input, Bart.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node
  2012-10-23 16:52   ` Jeff Moyer
@ 2012-10-23 17:42     ` Bart Van Assche
  2012-10-23 17:58       ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2012-10-23 17:42 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: axboe, linux-kernel, SCSI Mailing List

On 10/23/12 18:52, Jeff Moyer wrote:
> Bart Van Assche <bvanassche@acm.org> writes:
>> Please keep in mind that a
>> single PCIe bus may have a minimal distance to more than one NUMA
>> node. See e.g. the diagram at the top of page 8 in
>> http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf
>> for a system diagram of a NUMA system where each PCIe bus has a
>> minimal distance to two different NUMA nodes.
>
> That's an interesting configuration.  I wonder what the numa_node sysfs
> file contains for such systems--do you know?  I'm not sure how we could
> allow this to be user-controlled at probe time.  Did you have a specific
> mechanism in mind?  Module parameters?  Something else?

As far as I can see in drivers/pci/pci-sysfs.c the numa_node sysfs 
attribute contains a single number, even for a topology like the one 
described above.

With regard to user control of the numa node: I'm not sure how to solve 
this in general. But for the ib_srp driver this should be easy to do: 
SCSI host creation is triggered by sending a login string to a sysfs 
attribute ("add_target"). It wouldn't take much time to add a parameter 
to that login string that specifies the NUMA node.

Bart.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node
  2012-10-23 17:42     ` Bart Van Assche
@ 2012-10-23 17:58       ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2012-10-23 17:58 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jeff Moyer, linux-kernel, SCSI Mailing List

On 2012-10-23 19:42, Bart Van Assche wrote:
> On 10/23/12 18:52, Jeff Moyer wrote:
>> Bart Van Assche <bvanassche@acm.org> writes:
>>> Please keep in mind that a
>>> single PCIe bus may have a minimal distance to more than one NUMA
>>> node. See e.g. the diagram at the top of page 8 in
>>> http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf
>>> for a system diagram of a NUMA system where each PCIe bus has a
>>> minimal distance to two different NUMA nodes.
>>
>> That's an interesting configuration.  I wonder what the numa_node sysfs
>> file contains for such systems--do you know?  I'm not sure how we could
>> allow this to be user-controlled at probe time.  Did you have a specific
>> mechanism in mind?  Module parameters?  Something else?
> 
> As far as I can see in drivers/pci/pci-sysfs.c the numa_node sysfs 
> attribute contains a single number, even for a topology like the one 
> described above.

This is an artifact of how ACPI works, it's not possible to have it be a
mask of nodes. But obviously that is how most intel based systems from
the last few years works, so the kernel parts should be updated to at
least allow it to be a mask. How to get this information is a separate
problem.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-10-23 17:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-22 19:01 [patch/rfc/rft] sd: allocate request_queue on device's local numa node Jeff Moyer
2012-10-22 19:19 ` Jens Axboe
2012-10-23  6:45 ` Bart Van Assche
2012-10-23 16:52   ` Jeff Moyer
2012-10-23 17:42     ` Bart Van Assche
2012-10-23 17:58       ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.