All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladislav Bolkhovitin <vst@vlnb.net>
To: Chris Mason <chris.mason@oracle.com>,
	Christoph Hellwig <hch@lst.de>, Tejun Heo <tj@kernel.org>,
	Vivek Goyal <vgoyal@redhat.com>, Jan Kara <jack@suse.cz>,
	jaxboe@fusionio.com, James.B
Subject: Re: [RFC] relaxed barrier semantics
Date: Thu, 05 Aug 2010 23:48:19 +0400	[thread overview]
Message-ID: <4C5B1583.6070706__25908.5265374326$1281037743$gmane$org@vlnb.net> (raw)
In-Reply-To: <20100805133225.GF29846@think>

Chris Mason, on 08/05/2010 05:32 PM wrote:
> On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
>> Chris Mason, on 08/02/2010 09:39 PM wrote:
>>> I regret putting the ordering into the original barrier code...it
>>> definitely did help reiserfs back in the day but it stinks of magic and
>>> voodoo.
>>
>> But if the ordering isn't in the common (block) code, how to
>> implement the "hardware offload" for ordering, i.e. ORDERED
>> commands, in an acceptable way?
>>
>> I believe, the decision was right, but the flags and magic requests
>> based interface (and, hence, implementation) was wrong. That's it
>> which stinks of magic and voodoo.
>
> The interface definitely has flaws.  We didn't expand it because James
> popped up with a long list of error handling problems.

Could you point on the corresponding message, please? I can't find it in 
my archive.

> Basically how
> do the hardware and the kernel deal with a failed request at the start
> of the chain.  Somehow the easy way of failing them all turned out to be
> extremely difficult.

Have you considered to not fail them all, but using ACA SCSI facility 
just suspend the queue, then requeue the failed request, then restart 
processing? I might be missing something, but using this approach the 
failed requests recovery should look quite simple and, most important, 
compact, hence easily audited. Something like below. Sorry, since it's a 
low level recovery, it requires some deep SCSI knowledge to follow.

We need:

1. A low level driver without internal queue and masking returned status 
and sense. At first look, many of the existing drivers more or less 
satisfy this requirement, including drivers in my direct interest: 
qla2xxx, iscsi and ib_srp.

2. A device with support of ORDERED commands as well as ACA and 
UA_INTLCK facilities in QERR mode 0.

Assume we have N ORDERED requests queued to a device and one of them 
failed. Then submitting new requests to the device would be suspended 
and recovery thread woken up.

Let's we have a list of queued to the device requests in order as they 
queued. Then the recovery thread would need to deal with the following 
cases:

1. The failed command failed with CHECK_CONDITION and from the head of 
the queue. (The device now established ACA and suspended its internal 
queue.) Then the command should be sent to the device as ACA task and, 
after it's finished, ACA should be cleared. (The device now would 
restart its queue.) Then submitting new requests to the device would 
also be resumed.

2. The failed command failed with CHECK_CONDITION and isn't from the 
head of the queue.

2.1. The failed command in the last in the queue. ACA should be cleared 
and the failed command should simply be restarted. Then submitting new 
requests to the device would also be resumed.

2.2. The failed command isn't last in the queue. Then the recovery 
thread would send ACA command TEST UNIT READY to be sure all in-flight 
commands reached the device. Then it would abort all the commands after 
the failed one using ABORT TASK Task Management function. Then ACA 
should be cleared and the failed command as well as all the aborted 
commands would be resend to the device. Then submitting new requests to 
the device would also be resumed.

3. The failed command failed with other status than CHECK_CONDITION and 
from the head of the queue.

3.1. The failed command is the only queued command. Then TEST UNIT READY 
command should be sent to the device to get the post UA_INTLCK CHECK 
CONDITION and trigger ACA. Then ACA should be cleared and the failed 
command restarted. Then submitting new requests to the device would also 
be resumed.

3.2. There are other queued commands. Then the recovery thread should 
remember the failed command and exit. The next command would get the 
post UA_INTLCK CHECK CONDITION and trigger ACA. Then recovery would 
proceed as in (1), except that 2 failed commands would be restarted as 
ACA commands before clearing ACA.

4. The failed command isn't from the head of the queue and failed with 
other status than CHECK_CONDITION. It might happen in case of TASK QUEUE 
FULL condition. This case would be proceed similarly as cases (3.x), 
then (2.2).

That's all. Simple, compact and clear for auditing.

> Even if that part had been refined, I think trusting the ordering down
> to the lower layers was a doomed idea.  The list of ways it could go
> wrong is much much longer (and harder to debug) than the list of
> benefits.

It's hard to debug, because it's currently a overloaded flags nightmare. 
It isn't the idea to trust lower level doomed, everybody trust lower 
levels everywhere in the kernel. Doomed the idea to provide requested 
functionality via a set of flags and artificial barrier requests with 
obscured side effects. Linux just needs a clear and _natural_ interface 
for that. Like one I proposed in 
http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Yes, I am 
proposing to slowly start thinking to move to a new interface and 
implementation out from the current hell. It's obvious that what Linux 
has now in this area is a dead end. The new flag Christoph is going to 
add makes it even worse.

> With all of that said, I did go ahead and benchmark real ordered tags
> extensively on a scsi drive in the initial implementation.  There was
> very little performance difference.

It isn't surprise that you didn't see much difference with a local 
(Wide?) SCSI drive. Such drives sit on a low latency link, simple enough 
to have small internal latencies and dumb enough to not make much 
benefits from internal reordering. But how about external arrays? Or 
even clusters? Nowadays everybody can build such arrays and clusters 
from any Linux (or other *nix) box using any OSS SCSI target 
implementation starting from SCST I have been developing. Such 
array/cluster devices use links with in an order of magnitude higher 
latency, they are very sophisticated inside, so have much bigger 
internal latencies as well as they have much bigger opportunities to 
optimize I/O pattern by internal reordering. All the record numbers I've 
seen so far were reached with deep queue. For instance, the last SCST 
record (>500K 4K IOPSes from a single target) was achieved with queue 
depth 128!

So, I believe, Linux must use that possibility to get full storage 
performance and to finally simplify its storage stack.

Vlad

  parent reply	other threads:[~2010-08-05 19:48 UTC|newest]

Thread overview: 155+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-27 16:56 [RFC] relaxed barrier semantics Christoph Hellwig
2010-07-27 17:54 ` Jan Kara
2010-07-27 18:35   ` Vivek Goyal
2010-07-27 18:42     ` James Bottomley
2010-07-27 18:51       ` Ric Wheeler
2010-07-27 19:43       ` Christoph Hellwig
2010-07-27 19:38     ` Christoph Hellwig
2010-07-28  8:08     ` Tejun Heo
2010-07-28  8:20       ` Tejun Heo
2010-07-28 13:55         ` Vladislav Bolkhovitin
2010-07-28 14:23           ` Tejun Heo
2010-07-28 14:37             ` James Bottomley
2010-07-28 14:44               ` Tejun Heo
2010-07-28 16:17                 ` Vladislav Bolkhovitin
2010-07-28 16:17               ` Vladislav Bolkhovitin
2010-07-28 16:16             ` Vladislav Bolkhovitin
2010-07-28  8:24       ` Christoph Hellwig
2010-07-28  8:40         ` Tejun Heo
2010-07-28  8:50           ` Christoph Hellwig
2010-07-28  8:58             ` Tejun Heo
2010-07-28  9:00               ` Christoph Hellwig
2010-07-28  9:11                 ` Hannes Reinecke
2010-07-28  9:16                   ` Christoph Hellwig
2010-07-28  9:24                     ` Tejun Heo
2010-07-28  9:38                       ` Christoph Hellwig
2010-07-28  9:28                   ` Steven Whitehouse
2010-07-28  9:35                     ` READ_META semantics, was " Christoph Hellwig
2010-07-28 13:52                       ` Jeff Moyer
2010-07-28  9:17                 ` Tejun Heo
2010-07-28  9:28                   ` Christoph Hellwig
2010-07-28  9:48                     ` Tejun Heo
2010-07-28 10:19                     ` Steven Whitehouse
2010-07-28 11:45                       ` Christoph Hellwig
2010-07-28 12:47                     ` Jan Kara
2010-07-28 23:00                       ` Christoph Hellwig
2010-07-29 10:45                         ` Jan Kara
2010-07-29 16:54                           ` Joel Becker
2010-07-29 17:02                             ` Christoph Hellwig
2010-07-29 17:02                             ` Christoph Hellwig
2010-07-29  1:44                     ` Ted Ts'o
2010-07-29  2:43                       ` Vivek Goyal
2010-07-29  2:43                       ` Vivek Goyal
2010-07-29  8:42                         ` Christoph Hellwig
2010-07-29 20:02                           ` Vivek Goyal
2010-07-29 20:06                             ` Christoph Hellwig
2010-07-30  3:17                               ` Vivek Goyal
2010-07-30  7:07                                 ` Christoph Hellwig
2010-07-30  7:41                                   ` Vivek Goyal
2010-08-02 18:28                                   ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal
2010-08-03 13:03                                     ` Christoph Hellwig
2010-08-04 15:29                                       ` Vivek Goyal
2010-08-04 16:21                                         ` Christoph Hellwig
2010-07-29  8:31                       ` [RFC] relaxed barrier semantics Christoph Hellwig
2010-07-29 11:16                         ` Jan Kara
2010-07-29 13:00                         ` extfs reliability Vladislav Bolkhovitin
2010-07-29 13:08                           ` Christoph Hellwig
2010-07-29 14:12                             ` Vladislav Bolkhovitin
2010-07-29 14:34                               ` Jan Kara
2010-07-29 18:20                                 ` Vladislav Bolkhovitin
2010-07-29 18:49                                 ` Vladislav Bolkhovitin
2010-07-29 14:26                           ` Jan Kara
2010-07-29 18:20                             ` Vladislav Bolkhovitin
2010-07-29 18:58                           ` Ted Ts'o
2010-07-29 19:44                       ` [RFC] relaxed barrier semantics Ric Wheeler
2010-07-29 19:49                         ` Christoph Hellwig
2010-07-29 19:56                           ` Ric Wheeler
2010-07-29 19:59                             ` James Bottomley
2010-07-29 20:03                               ` Christoph Hellwig
2010-07-29 20:07                                 ` James Bottomley
2010-07-29 20:11                                   ` Christoph Hellwig
2010-07-30 12:45                                     ` Vladislav Bolkhovitin
2010-07-30 12:56                                       ` Christoph Hellwig
2010-08-04  1:58                                     ` Jamie Lokier
2010-07-30 12:46                                 ` Vladislav Bolkhovitin
2010-07-30 12:57                                   ` Christoph Hellwig
2010-07-30 13:09                                     ` Vladislav Bolkhovitin
2010-07-30 13:12                                       ` Christoph Hellwig
2010-07-30 17:40                                         ` Vladislav Bolkhovitin
2010-07-29 20:58                               ` Ric Wheeler
2010-07-29 22:30                             ` Andreas Dilger
2010-07-29 23:04                               ` Ted Ts'o
2010-07-29 23:08                                 ` Ric Wheeler
2010-07-29 23:08                                 ` Ric Wheeler
2010-07-29 23:28                                 ` James Bottomley
2010-07-29 23:37                                   ` James Bottomley
2010-07-30  0:19                                     ` Ted Ts'o
2010-07-30 12:56                                   ` Vladislav Bolkhovitin
2010-07-30  7:11                                 ` Christoph Hellwig
2010-07-30  7:11                                 ` Christoph Hellwig
2010-07-30 12:56                                 ` Vladislav Bolkhovitin
2010-07-30 13:07                                   ` Tejun Heo
2010-07-30 13:22                                     ` Vladislav Bolkhovitin
2010-07-30 13:27                                       ` Vladislav Bolkhovitin
2010-07-30 13:09                                   ` Christoph Hellwig
2010-07-30 13:25                                     ` Vladislav Bolkhovitin
2010-07-30 13:34                                       ` Christoph Hellwig
2010-07-30 13:44                                         ` Vladislav Bolkhovitin
2010-07-30 14:20                                           ` Christoph Hellwig
2010-07-31  0:47                                             ` Jan Kara
2010-07-31  9:12                                               ` Christoph Hellwig
2010-08-02 13:14                                                 ` Jan Kara
2010-08-02 10:38                                               ` Vladislav Bolkhovitin
2010-08-02 12:48                                                 ` Christoph Hellwig
2010-08-02 19:03                                                   ` xfs rm performance Vladislav Bolkhovitin
2010-08-02 19:18                                                     ` Christoph Hellwig
2010-08-05 19:31                                                       ` Vladislav Bolkhovitin
2010-08-02 19:01                                             ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin
2010-08-02 19:26                                               ` Christoph Hellwig
2010-07-30 12:56                                 ` Vladislav Bolkhovitin
2010-07-31  0:35                         ` Jan Kara
2010-07-29 19:44                       ` Ric Wheeler
2010-08-02 16:47                     ` Ryusuke Konishi
2010-08-02 17:39                     ` Chris Mason
2010-08-05 13:11                       ` Vladislav Bolkhovitin
2010-08-05 13:32                         ` Chris Mason
2010-08-05 14:52                           ` Hannes Reinecke
2010-08-05 14:52                           ` Hannes Reinecke
2010-08-05 15:17                             ` Chris Mason
2010-08-05 17:07                             ` Christoph Hellwig
2010-08-05 19:48                           ` Vladislav Bolkhovitin [this message]
2010-08-05 19:48                           ` Vladislav Bolkhovitin
2010-08-05 19:50                             ` Christoph Hellwig
2010-08-05 20:05                               ` Vladislav Bolkhovitin
2010-08-06 14:56                                 ` Hannes Reinecke
2010-08-06 18:38                                   ` Vladislav Bolkhovitin
2010-08-06 23:38                                     ` Christoph Hellwig
2010-08-06 23:34                                   ` Christoph Hellwig
2010-08-05 17:09                         ` Christoph Hellwig
2010-08-05 19:32                           ` Vladislav Bolkhovitin
2010-08-05 19:40                             ` Christoph Hellwig
2010-08-05 13:11                       ` Vladislav Bolkhovitin
2010-07-28 13:56                   ` Vladislav Bolkhovitin
2010-07-28 14:42                 ` Vivek Goyal
2010-07-27 19:37   ` Christoph Hellwig
2010-08-03 18:49   ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig
2010-08-03 18:51     ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig
2010-08-04  4:57       ` Kiyoshi Ueda
2010-08-04  8:54         ` Christoph Hellwig
2010-08-05  2:16           ` Jun'ichi Nomura
2010-08-26 22:50             ` Mike Snitzer
2010-08-27  0:40               ` Mike Snitzer
2010-08-27  1:20                 ` Jamie Lokier
2010-08-27  1:43               ` Jun'ichi Nomura
2010-08-27  4:08                 ` Mike Snitzer
2010-08-27  5:52                   ` Jun'ichi Nomura
2010-08-27 14:13                     ` Mike Snitzer
2010-08-30  4:45                       ` Jun'ichi Nomura
2010-08-30  8:33                         ` Tejun Heo
2010-08-30 12:43                           ` Mike Snitzer
2010-08-30 12:45                             ` Tejun Heo
2010-08-06 16:04     ` [PATCH, RFC] relaxed barriers Tejun Heo
2010-08-06 23:34       ` Christoph Hellwig
2010-08-07 10:13       ` [PATCH REPOST " Tejun Heo
2010-08-08 14:31         ` Christoph Hellwig
2010-08-09 14:50           ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='4C5B1583.6070706__25908.5265374326$1281037743$gmane$org@vlnb.net' \
    --to=vst@vlnb.net \
    --cc=chris.mason@oracle.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jaxboe@fusionio.com \
    --cc=tj@kernel.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.