All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ric Wheeler <ric@emc.com>
To: Linux-ide <linux-ide@vger.kernel.org>,
	linux-scsi <linux-scsi@vger.kernel.org>,
	linux-raid@vger.kernel.org, Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@SteelEye.com>,
	Mark Lord <mlord@pobox.com>, Neil Brown <neilb@suse.de>,
	Jens Axboe <jens.axboe@oracle.com>,
	"Clark, Nathan" <Clark_Nathan@emc.com>,
	"Singh, Arvinder" <Singh_Arvinder@emc.com>,
	"De Smet, Jochen" <DeSmet_Jochen@emc.com>,
	"Farmer, Matt" <Farmer_Matt@emc.com>Mark Lord <mlord@pobox.com>,
	linux-fsdevel@vger.kernel.org, "Mizar,
	Sunita" <Mizar_Sunita@emc.com>
Subject: end to end error recovery musings
Date: Fri, 23 Feb 2007 09:15:11 -0500	[thread overview]
Message-ID: <45DEF6EF.3020509@emc.com> (raw)

In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.

My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one "appliance like" 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.

The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.

Each box has a watchdog timer that can be set to fire after at most 2 
minutes.

(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).

Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.

We still have the following challenges:

    (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.

    (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.

    (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...

    (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.

We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)

ric




WARNING: multiple messages have this Message-ID (diff)
From: Ric Wheeler <ric@emc.com>
To: Linux-ide <linux-ide@vger.kernel.org>,
	linux-scsi <linux-scsi@vger.kernel.org>,
	linux-raid@vger.kernel.org, Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@SteelEye.com>,
	Mark Lord <mlord@pobox.com>, Neil Brown <neilb@suse.de>,
	Jens Axboe <jens.axboe@oracle.com>,
	"Clark, Nathan" <Clark_Nathan@emc.com>,
	"Singh, Arvinder" <Singh_Arvinder@emc.com>,
	"De Smet, Jochen" <DeSmet_Jochen@emc.com>,
	"Farmer, Matt" <Farmer_Matt@emc.com>, Mark Lord <mlord@pobox.com>,
	linux-fsdevel@vger.kernel.org, "Mizar,
	Sunita" <Mizar_Sunita@emc.com>
Subject: end to end error recovery musings
Date: Fri, 23 Feb 2007 09:15:11 -0500	[thread overview]
Message-ID: <45DEF6EF.3020509@emc.com> (raw)

In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.

My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one "appliance like" 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.

The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.

Each box has a watchdog timer that can be set to fire after at most 2 
minutes.

(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).

Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.

We still have the following challenges:

    (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.

    (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.

    (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...

    (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.

We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)

ric




             reply	other threads:[~2007-02-23 14:15 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-02-23 14:15 Ric Wheeler [this message]
2007-02-23 14:15 ` end to end error recovery musings Ric Wheeler
2007-02-24  0:03 ` H. Peter Anvin
2007-02-24  0:37   ` Andreas Dilger
2007-02-24  2:05     ` H. Peter Anvin
2007-02-24  2:32     ` Theodore Tso
2007-02-24 18:39       ` Chris Wedgwood
2007-02-26  5:33       ` Neil Brown
2007-02-26 13:25         ` Theodore Tso
2007-02-26 15:15           ` Alan
2007-02-26 15:18             ` Ric Wheeler
2007-02-26 17:01               ` Alan
2007-02-26 16:42                 ` Ric Wheeler
2007-02-26 15:17           ` James Bottomley
2007-02-26 18:59           ` H. Peter Anvin
2007-02-26 22:46           ` Jeff Garzik
2007-02-26 22:53             ` Ric Wheeler
2007-02-27  1:19               ` Alan
2007-02-26  6:01   ` Douglas Gilbert
2007-02-27  1:10 Moore, Eric
2007-02-27  1:10 ` Moore, Eric
2007-02-27 16:50 ` Martin K. Petersen
2007-02-27 16:50   ` Martin K. Petersen
2007-02-27 18:51   ` Ric Wheeler
2007-02-27 19:02   ` Alan
2007-02-27 19:02     ` Alan
2007-02-27 18:39     ` Andreas Dilger
2007-02-27 19:07     ` Martin K. Petersen
2007-02-27 19:07       ` Martin K. Petersen
2007-02-27 23:39       ` Alan
2007-02-27 23:39         ` Alan
2007-02-27 22:51         ` Martin K. Petersen
2007-02-27 22:51           ` Martin K. Petersen
2007-02-28 13:46           ` Douglas Gilbert
2007-02-28 17:16             ` Martin K. Petersen
2007-02-28 17:30               ` James Bottomley
2007-02-28 17:42                 ` Martin K. Petersen
2007-02-28 17:52                   ` James Bottomley
2007-03-01  1:28                     ` H. Peter Anvin
2007-03-01 14:25                       ` James Bottomley
2007-03-01 17:19                         ` H. Peter Anvin
2007-02-28 15:19       ` Moore, Eric
2007-02-28 15:19         ` Moore, Eric
2007-02-28 17:27         ` Martin K. Petersen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45DEF6EF.3020509@emc.com \
    --to=ric@emc.com \
    --cc=Clark_Nathan@emc.com \
    --cc=DeSmet_Jochen@emc.com \
    --cc=Farmer_Matt@emc.com \
    --cc=James.Bottomley@SteelEye.com \
    --cc=Singh_Arvinder@emc.com \
    --cc=htejun@gmail.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=mlord@pobox.com \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.