All of lore.kernel.org
 help / color / mirror / Atom feed
* Proposal for a scalable SCSI midlayer
@ 2014-02-05 12:39 Christoph Hellwig
  2014-02-23 20:10 ` James Bottomley
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2014-02-05 12:39 UTC (permalink / raw)
  To: Jens Axboe, James Bottomley, Nicholas Bellinger; +Cc: linux-scsi

We've run into many issues where the SCSI layer simply does not scale to
keep up with today's hardware, be that in simple single-thread IOPs, or
in lock contention when using multiple LUNs or targets under a single
SCSI host.  This proposal tries to draw a path how to fix this properly
and avoids workarounds where various driver that speak a SCSI command
set are implemented at the block layer because of these issues.

After the dramatic improvements that the scsi-mq prototype from
Nic Bellinger showed it is clear that using the block multiqueue
infrastructure will take a big role in this effort, but it goes much
further than that code base.

As an important goal of this project I want to replace the whole I/O
path in the SCSI midlayer, and not create largely parallel code paths
for small and fast devices.  We will have to find if this is actually
feasible for all cases, but I'd like to get an as broad as possible set
of drivers to use the new I/O path, and avoid API differences if we have
to keep the two paths around.

A specific non-goal is support for multiple hardware queues.  While
we will have to support this soon, the improvements from just using the
blk-mq code and fixing the obvious scalability issues in the SCSI midlayer
are larger enough to deal with this as a first step, and postpone problems
related to queue synchronization into the near future.


1) Summary of the scalability issues

The biggest problem in the current SCSI midlayer is the old block layer
request model in general, with it's large amount of lock round trips on
the queue_lock for every request, and a large amount of touched cache lines.

The way the old request code is used by the SCSI midlayer makes this even
worse by using the queue_lock to protect additional internal state, and
round tripping on a host-wide lock multiple times for each command.

Even when avoiding the host lock by replacing it with atomic counters we'd
run into multiple host or target-wide shared cache lines for each I/O
submission or completion.


2) Suggested way forward

I would suggest to attack the problems from two sides:

a) fixing the easy to hit scalability issues in the SCSI layer where we
   can, even if they are overshadowed by the block layer ones in small
   patch sets.

b) gradually moving the whole SCSI layer to be backed by blk-mq.  This
   is a different approach from Nic's current scsi-mq tree, in that it
   keeps all the per-device/target/shost accounting and fairness code in
   the SCSI midlayer in place for now, and uses the same APIs to talk
   to the LLDDs.  While this is certain to get less stellar results than
   a hard cut, it will allow to do a full move to the new infrastructure
   much easier, and avoid long term maintenance of parallel code paths.
   Additional optimization can and should be implemented on top of this
   baselines work.

3) Current status

I will send the first batch of patches implementing easy optimizations
in the SCSI midlayer after this RFC, as well as a very early prototype
of the blk-mq work based on that, as well as performance numbers.  We'll
need to work from there to improve it to be generally usable, mostly by
adding missing features to the blk-mq core.

4) Major TODO items

 - add support for partial completions, as the SCSI drivers might
   complete only part of a request for a given I/O completion.

 - either make the blk-mq tag allocator usable on a per-host basis for
   those drivers that currently use host-wide tagging, or find a way
   that they can use their own per-host tagging without getting into the
   way of blk-mq.

 - implement BIDI support in blk-mq.  This is currently missing entirely
   and will be needed to support the OSD2 protocol, as well as a few
   SBC commands through sg_io.

 - fix the tag allocation for sequenced FLUSH commands.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Proposal for a scalable SCSI midlayer
  2014-02-05 12:39 Proposal for a scalable SCSI midlayer Christoph Hellwig
@ 2014-02-23 20:10 ` James Bottomley
  2014-02-24  4:25   ` Christoph Hellwig
  2014-02-26 10:59   ` Bart Van Assche
  0 siblings, 2 replies; 5+ messages in thread
From: James Bottomley @ 2014-02-23 20:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Nicholas Bellinger, linux-scsi


On Wed, 2014-02-05 at 04:39 -0800, Christoph Hellwig wrote:
> We've run into many issues where the SCSI layer simply does not scale to
> keep up with today's hardware, be that in simple single-thread IOPs, or
> in lock contention when using multiple LUNs or targets under a single
> SCSI host.  This proposal tries to draw a path how to fix this properly
> and avoids workarounds where various driver that speak a SCSI command
> set are implemented at the block layer because of these issues.
> 
> After the dramatic improvements that the scsi-mq prototype from
> Nic Bellinger showed it is clear that using the block multiqueue
> infrastructure will take a big role in this effort, but it goes much
> further than that code base.
> 
> As an important goal of this project I want to replace the whole I/O
> path in the SCSI midlayer, and not create largely parallel code paths
> for small and fast devices.  We will have to find if this is actually
> feasible for all cases, but I'd like to get an as broad as possible set
> of drivers to use the new I/O path, and avoid API differences if we have
> to keep the two paths around.

If we can do this, that would be great, because it cuts down on the
maintenance burden for all of us and gives some benefits at least to
non-MQ hardware.

> A specific non-goal is support for multiple hardware queues.  While
> we will have to support this soon, the improvements from just using the
> blk-mq code and fixing the obvious scalability issues in the SCSI midlayer
> are larger enough to deal with this as a first step, and postpone problems
> related to queue synchronization into the near future.
> 
> 
> 1) Summary of the scalability issues
> 
> The biggest problem in the current SCSI midlayer is the old block layer
> request model in general, with it's large amount of lock round trips on
> the queue_lock for every request, and a large amount of touched cache lines.

Agree with this.

> The way the old request code is used by the SCSI midlayer makes this even
> worse by using the queue_lock to protect additional internal state, and
> round tripping on a host-wide lock multiple times for each command.

That's largely a result of the above.  The premise for this is that if
we already have heavy lock shuffling induced by block, if we use the
same locks to protect internal state at the points which we've already
acquired them, they're essentially free ... obviously that hasn't quite
worked out.

> Even when avoiding the host lock by replacing it with atomic counters we'd
> run into multiple host or target-wide shared cache lines for each I/O
> submission or completion.

Right ... my ideal here if we can achieve it would be lockless threaded
models, where we could make guarantees like single thread of execution
per command, so all command state could be lockless.  Even CPU dedicated
to single device would give us all device state lockless and the
necessary cache hotness (although this may not be feasible).

> 2) Suggested way forward
> 
> I would suggest to attack the problems from two sides:
> 
> a) fixing the easy to hit scalability issues in the SCSI layer where we
>    can, even if they are overshadowed by the block layer ones in small
>    patch sets.

Fine with this.  Anywhere we can obviously remove a lock or an atomic is
great with me.

> b) gradually moving the whole SCSI layer to be backed by blk-mq.  This
>    is a different approach from Nic's current scsi-mq tree, in that it
>    keeps all the per-device/target/shost accounting and fairness code in
>    the SCSI midlayer in place for now, and uses the same APIs to talk
>    to the LLDDs.  While this is certain to get less stellar results than
>    a hard cut, it will allow to do a full move to the new infrastructure
>    much easier, and avoid long term maintenance of parallel code paths.
>    Additional optimization can and should be implemented on top of this
>    baselines work.

Yes, but would like to see a clearer picture of how this would be
achieved and what it would entail.

> 3) Current status
> 
> I will send the first batch of patches implementing easy optimizations
> in the SCSI midlayer after this RFC, as well as a very early prototype
> of the blk-mq work based on that, as well as performance numbers.  We'll
> need to work from there to improve it to be generally usable, mostly by
> adding missing features to the blk-mq core.
> 
> 4) Major TODO items
> 
>  - add support for partial completions, as the SCSI drivers might
>    complete only part of a request for a given I/O completion.

Agreed, it's required at least for bad sector handling.

>  - either make the blk-mq tag allocator usable on a per-host basis for
>    those drivers that currently use host-wide tagging, or find a way
>    that they can use their own per-host tagging without getting into the
>    way of blk-mq.

Agree.

>  - implement BIDI support in blk-mq.  This is currently missing entirely
>    and will be needed to support the OSD2 protocol, as well as a few
>    SBC commands through sg_io.

Ambivalent.  We need bidi support for OSD arrays and some of the more
esoteric commands, but I'm not convinced they're necessary to the
functioning of the stack.

>  - fix the tag allocation for sequenced FLUSH commands.

Agree.

James




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Proposal for a scalable SCSI midlayer
  2014-02-23 20:10 ` James Bottomley
@ 2014-02-24  4:25   ` Christoph Hellwig
  2014-02-26 10:59   ` Bart Van Assche
  1 sibling, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2014-02-24  4:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Jens Axboe, Nicholas Bellinger, linux-scsi

On Sun, Feb 23, 2014 at 02:10:18PM -0600, James Bottomley wrote:
> If we can do this, that would be great, because it cuts down on the
> maintenance burden for all of us and gives some benefits at least to
> non-MQ hardware.

So far this seems to work out great, and I think we will be able to
stick to it.

> > Even when avoiding the host lock by replacing it with atomic counters we'd
> > run into multiple host or target-wide shared cache lines for each I/O
> > submission or completion.
> 
> Right ... my ideal here if we can achieve it would be lockless threaded
> models, where we could make guarantees like single thread of execution
> per command, so all command state could be lockless.  Even CPU dedicated
> to single device would give us all device state lockless and the
> necessary cache hotness (although this may not be feasible).

It's not fitting the current blk-mq model, which I'd much prefer to
follow for now.  As Jens pointed out in the previous discussion blk-mq
tries to map to cpu-local queues as much as possible, but there's no
hard guarantee.

> >  - implement BIDI support in blk-mq.  This is currently missing entirely
> >    and will be needed to support the OSD2 protocol, as well as a few
> >    SBC commands through sg_io.
> 
> Ambivalent.  We need bidi support for OSD arrays and some of the more
> esoteric commands, but I'm not convinced they're necessary to the
> functioning of the stack.

It's needed so that we get a full replacement of the old code, so we'll
have to tackle it eventually.  I wish we could simply avoid it, but life
ain't that easy.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Proposal for a scalable SCSI midlayer
  2014-02-23 20:10 ` James Bottomley
  2014-02-24  4:25   ` Christoph Hellwig
@ 2014-02-26 10:59   ` Bart Van Assche
  2014-02-26 17:12     ` James Bottomley
  1 sibling, 1 reply; 5+ messages in thread
From: Bart Van Assche @ 2014-02-26 10:59 UTC (permalink / raw)
  To: James Bottomley, Christoph Hellwig
  Cc: Jens Axboe, Nicholas Bellinger, linux-scsi

On 02/23/14 21:10, James Bottomley wrote:
> Right ... my ideal here if we can achieve it would be lockless threaded
> models, where we could make guarantees like single thread of execution
> per command, so all command state could be lockless.

This approach sounds interesting but could be challenging to implement.
With this approach it would no longer be safe to access the SCSI command
state from interrupt nor from tasklet context. That means that the I/O
completion path would have to be modified such that instead of using an
IPI to invoke a tasklet at the CPU that submitted the SCSI command a new
mechanism would have to be used that causes the I/O completion code to
run directly on the context of the thread that submitted the SCSI command.

Bart.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Proposal for a scalable SCSI midlayer
  2014-02-26 10:59   ` Bart Van Assche
@ 2014-02-26 17:12     ` James Bottomley
  0 siblings, 0 replies; 5+ messages in thread
From: James Bottomley @ 2014-02-26 17:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, Nicholas Bellinger, linux-scsi

On Wed, 2014-02-26 at 11:59 +0100, Bart Van Assche wrote:
> On 02/23/14 21:10, James Bottomley wrote:
> > Right ... my ideal here if we can achieve it would be lockless threaded
> > models, where we could make guarantees like single thread of execution
> > per command, so all command state could be lockless.
> 
> This approach sounds interesting but could be challenging to implement.
> With this approach it would no longer be safe to access the SCSI command
> state from interrupt nor from tasklet context.

I don't think so: a SCSI command has to be in a distinct state, either
preparing, dispatching (at HBA) or returning (or a variety of states in
between).  It can't be in more than one state, so we can still guarantee
the per command variables can be accessed locklessly from any context
because if more than one CPU is operating on it simultaneously, we just
broke the state machine.

The challenge comes from the device variables:  a device can have a
bunch of commands in various states.  However, if we can guarantee
single CPU dispatch per device queue, then we could also use lockless
plus sloppy counters on the device variables because the CPU can only be
in a single state per queue even if multiple commands in the queue are
in different states.  The times we need to see the per-queue state
rolled up are fairly well defined, so we can shift the expense of the
rollup operation to the aggregation phase, hence sloppy counters.


James


>  That means that the I/O
> completion path would have to be modified such that instead of using an
> IPI to invoke a tasklet at the CPU that submitted the SCSI command a new
> mechanism would have to be used that causes the I/O completion code to
> run directly on the context of the thread that submitted the SCSI command.
> 
> Bart.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-02-26 17:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-05 12:39 Proposal for a scalable SCSI midlayer Christoph Hellwig
2014-02-23 20:10 ` James Bottomley
2014-02-24  4:25   ` Christoph Hellwig
2014-02-26 10:59   ` Bart Van Assche
2014-02-26 17:12     ` James Bottomley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.