All of lore.kernel.org
 help / color / mirror / Atom feed
* [MINI SUMMIT] SCSI core performance
@ 2012-07-18  2:39 Nicholas A. Bellinger
  2012-07-18  8:00 ` James Bottomley
  0 siblings, 1 reply; 3+ messages in thread
From: Nicholas A. Bellinger @ 2012-07-18  2:39 UTC (permalink / raw)
  To: KS-2012-discuss
  Cc: Roland Dreier, Christoph Hellwig, Hannes Reinecke, Jens Axboe,
	James Bottomley, linux-scsi, target-devel

Hi KS-PCs,

I'd like to propose a SCSI performance mini-summit to see how interested
folks are in helping address the long-term issues that SCSI core is
currently facing wrt to multi-lun per host and heavy small block random
I/O workloads.

I know this would probably be better suited for LSF (for the record it
was proposed this year) but now that we've acknowledge there is a
problem with SCSI LLDs vs. raw block drivers vs. other SCSI subsystems,
it would be useful to get the storage folks into a single room at some
point during KS/LPC to figure out what is actually going on with SCSI
core.

As mentioned in the recent tcm_vhost thread, there are a number of cases
where drivers/target/ code can demonstrate this limitation pretty
vividly now.

This includes the following scenarios using raw block flash export with
target_core_mod + target_core_iblock export and the same small block
(4k) mixed random I/O workload with fio:

*) tcm_loop local SCSI LLD performance is an order of magnitude slower 
   than the same local raw block flash backend.
*) tcm_qla2xxx performs better using MSFT Server hosts than Linux v3.x
   based hosts using 2x socket Nehalem hardware w/ PCI-e Gen2 HBAs
*) ib_srpt performs better using MSFT Server host than RHEL 6.x .32 
   based hosts using 2x socket Romley hardware w/ PCI-e Gen3 HCAs
*) Raw block IBLOCK export into KVM guest v3.5-rc w/ virtio-scsi is 
   behind in performance vs. raw local block flash.  (cmwq on the host 
   is helping here, but still need to with MSFT SCSI mini-port)

Also, with 1M IOPs into a single VM guest now being done by other non
Linux based hypervisors, the virtualization bit for high performance KVM
SCSI based storage is quickly coming on..

So all of that said, I'd like to at least have a discussion with the key
SCSI + block folks who will be present in San Diego on path forward to
address these without having to wait until LSF-2013 + hope for a topic
slot to materialize then.

Thank you for your consideration,

--nab

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [MINI SUMMIT] SCSI core performance
  2012-07-18  2:39 [MINI SUMMIT] SCSI core performance Nicholas A. Bellinger
@ 2012-07-18  8:00 ` James Bottomley
  2012-07-18 22:34   ` Nicholas A. Bellinger
  0 siblings, 1 reply; 3+ messages in thread
From: James Bottomley @ 2012-07-18  8:00 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: KS-2012-discuss, Roland Dreier, Christoph Hellwig,
	Hannes Reinecke, Jens Axboe, linux-scsi, target-devel

On Tue, 2012-07-17 at 19:39 -0700, Nicholas A. Bellinger wrote:
> Hi KS-PCs,
> 
> I'd like to propose a SCSI performance mini-summit to see how interested
> folks are in helping address the long-term issues that SCSI core is
> currently facing wrt to multi-lun per host and heavy small block random
> I/O workloads.
> 
> I know this would probably be better suited for LSF (for the record it
> was proposed this year) but now that we've acknowledge there is a
> problem with SCSI LLDs vs. raw block drivers vs. other SCSI subsystems,
> it would be useful to get the storage folks into a single room at some
> point during KS/LPC to figure out what is actually going on with SCSI
> core.

You seem to have a short memory:  The last time it was discussed

http://marc.info/?t=134155373900003

It rapidly became apparent there isn't a problem.  Enabling high IOPS in
the SCSI stack is what I think you mean.

> As mentioned in the recent tcm_vhost thread, there are a number of cases
> where drivers/target/ code can demonstrate this limitation pretty
> vividly now.
> 
> This includes the following scenarios using raw block flash export with
> target_core_mod + target_core_iblock export and the same small block
> (4k) mixed random I/O workload with fio:
> 
> *) tcm_loop local SCSI LLD performance is an order of magnitude slower 
>    than the same local raw block flash backend.
> *) tcm_qla2xxx performs better using MSFT Server hosts than Linux v3.x
>    based hosts using 2x socket Nehalem hardware w/ PCI-e Gen2 HBAs
> *) ib_srpt performs better using MSFT Server host than RHEL 6.x .32 
>    based hosts using 2x socket Romley hardware w/ PCI-e Gen3 HCAs
> *) Raw block IBLOCK export into KVM guest v3.5-rc w/ virtio-scsi is 
>    behind in performance vs. raw local block flash.  (cmwq on the host 
>    is helping here, but still need to with MSFT SCSI mini-port)
> 
> Also, with 1M IOPs into a single VM guest now being done by other non
> Linux based hypervisors, the virtualization bit for high performance KVM
> SCSI based storage is quickly coming on..
> 
> So all of that said, I'd like to at least have a discussion with the key
> SCSI + block folks who will be present in San Diego on path forward to
> address these without having to wait until LSF-2013 + hope for a topic
> slot to materialize then.
> 
> Thank you for your consideration,

Well, your proposal is devoid of an actual proposal.

Enabling high IOPS involves reducing locking overhead and path length
through the code.  I think most of the low hanging fruit in this area is
already picked, but if you have an idea, please say.  There might be
something we can extract from the lockless queue work Jens is doing, but
we need that to materialise first.

Without a concrete thing to discuss, shooting the breeze on high IOPS in
the SCSI stack is about as useful as discussing what happened in last
night's episode of Coronation Street which, when it happens in my house,
always helps me see how incredibly urgent fixing the leaky tap I've been
putting off for months actually is.

If someone can come up with a proposal ... or even perhaps another path
trace showing where the reducible overhead and lock problems are we can
discuss it on the list and we might have a real topic by the time LSF
rolls around.

James



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [MINI SUMMIT] SCSI core performance
  2012-07-18  8:00 ` James Bottomley
@ 2012-07-18 22:34   ` Nicholas A. Bellinger
  0 siblings, 0 replies; 3+ messages in thread
From: Nicholas A. Bellinger @ 2012-07-18 22:34 UTC (permalink / raw)
  To: James Bottomley
  Cc: KS-2012-discuss, Roland Dreier, Christoph Hellwig,
	Hannes Reinecke, Jens Axboe, linux-scsi, target-devel

On Wed, 2012-07-18 at 09:00 +0100, James Bottomley wrote:
> On Tue, 2012-07-17 at 19:39 -0700, Nicholas A. Bellinger wrote:
> > Hi KS-PCs,
> > 
> > I'd like to propose a SCSI performance mini-summit to see how interested
> > folks are in helping address the long-term issues that SCSI core is
> > currently facing wrt to multi-lun per host and heavy small block random
> > I/O workloads.
> > 
> > I know this would probably be better suited for LSF (for the record it
> > was proposed this year) but now that we've acknowledge there is a
> > problem with SCSI LLDs vs. raw block drivers vs. other SCSI subsystems,
> > it would be useful to get the storage folks into a single room at some
> > point during KS/LPC to figure out what is actually going on with SCSI
> > core.
> 
> You seem to have a short memory:  The last time it was discussed
> 
> http://marc.info/?t=134155373900003
> 
> It rapidly became apparent there isn't a problem.  Enabling high IOPS in
> the SCSI stack is what I think you mean.
> 

small block random I/O == performance, that is correct.

The host-lock-less stuff is doing better these days for small-ish
multi-lun setups with large block sequential I/O workloads.

Doing ~1 GB/sec per LUN is achievable with multi-lun per host (say up to
6-8 LUNs dependent on your setup) using PCI-e Gen3 hardware.

> > As mentioned in the recent tcm_vhost thread, there are a number of cases
> > where drivers/target/ code can demonstrate this limitation pretty
> > vividly now.
> > 
> > This includes the following scenarios using raw block flash export with
> > target_core_mod + target_core_iblock export and the same small block
> > (4k) mixed random I/O workload with fio:
> > 
> > *) tcm_loop local SCSI LLD performance is an order of magnitude slower 
> >    than the same local raw block flash backend.
> > *) tcm_qla2xxx performs better using MSFT Server hosts than Linux v3.x
> >    based hosts using 2x socket Nehalem hardware w/ PCI-e Gen2 HBAs
> > *) ib_srpt performs better using MSFT Server host than RHEL 6.x .32 
> >    based hosts using 2x socket Romley hardware w/ PCI-e Gen3 HCAs
> > *) Raw block IBLOCK export into KVM guest v3.5-rc w/ virtio-scsi is 
> >    behind in performance vs. raw local block flash.  (cmwq on the host 
> >    is helping here, but still need to with MSFT SCSI mini-port)
> > 
> > Also, with 1M IOPs into a single VM guest now being done by other non
> > Linux based hypervisors, the virtualization bit for high performance KVM
> > SCSI based storage is quickly coming on..
> > 
> > So all of that said, I'd like to at least have a discussion with the key
> > SCSI + block folks who will be present in San Diego on path forward to
> > address these without having to wait until LSF-2013 + hope for a topic
> > slot to materialize then.
> > 
> > Thank you for your consideration,
> 
> Well, your proposal is devoid of an actual proposal.
> 

Huh..?  It's a proposal for a discussion to (hopefully) identify the
main culprit(s) and figure out an incremental way forward.

Due to the fact that 1M IOPs machines aren't quite the norm (yet), the
idea is to get storage folks in the same room who do have access to 1M
IOPs systems + have an interest in making SCSI core go faster for random
small block I/O workloads.

This can be vendors / LLD maintainers who've run into similar
limitations with SCSI core, or folks who have an interest in KVM guest
SCSI performance.

> Enabling high IOPS involves reducing locking overhead and path length
> through the code.  I think most of the low hanging fruit in this area is
> already picked, but if you have an idea, please say.  There might be
> something we can extract from the lockless queue work Jens is doing, but
> we need that to materialise first.
> 

Would really like to hear from Jens here, but I don't know how much time
he is spending on the SCSI layer these days..

I've been more interested recently in working on a fabric that can
demonstrate this bottleneck with raw block flash into KVM guest <->
virtio-scsi, as I think it's a important vehicle for short-term
diagnosis.

> Without a concrete thing to discuss, shooting the breeze on high IOPS in
> the SCSI stack is about as useful as discussing what happened in last
> night's episode of Coronation Street which, when it happens in my house,
> always helps me see how incredibly urgent fixing the leaky tap I've been
> putting off for months actually is.
> 

Sorry, I've never heard of that show.  

> If someone can come up with a proposal ... or even perhaps another path
> trace showing where the reducible overhead and lock problems are we can
> discuss it on the list and we might have a real topic by the time LSF
> rolls around.
> 

So identifying root culprit(s) is still a WIP at this point.

In the next weeks I'll be back spending time back on 1M IOPs machines
with raw block flash + qla2xxx/srpt/vhost + Linux/MSFT SCSI clients, and
should be getting some more data-points by then.

Anyways, if it ends up taking until LSF it ends up at LSF.  I figured
since things are heating up for virtio-scsi that KS might be a good
venue for a discussion like this.

--nab

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-07-18 22:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-18  2:39 [MINI SUMMIT] SCSI core performance Nicholas A. Bellinger
2012-07-18  8:00 ` James Bottomley
2012-07-18 22:34   ` Nicholas A. Bellinger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.