All of lore.kernel.org
 help / color / mirror / Atom feed
* end to end error recovery musings
@ 2007-02-23 14:15 ` Ric Wheeler
  0 siblings, 0 replies; 44+ messages in thread
From: Ric Wheeler @ 2007-02-23 14:15 UTC (permalink / raw)
  To: Linux-ide, linux-scsi, linux-raid, Tejun Heo
  Cc: James Bottomley, Mark Lord, Neil Brown, Jens Axboe, Clark,
	Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt

In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.

My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one "appliance like" 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.

The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.

Each box has a watchdog timer that can be set to fire after at most 2 
minutes.

(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).

Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.

We still have the following challenges:

    (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.

    (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.

    (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...

    (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.

We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)

ric




^ permalink raw reply	[flat|nested] 44+ messages in thread

* end to end error recovery musings
@ 2007-02-23 14:15 ` Ric Wheeler
  0 siblings, 0 replies; 44+ messages in thread
From: Ric Wheeler @ 2007-02-23 14:15 UTC (permalink / raw)
  To: Linux-ide, linux-scsi, linux-raid, Tejun Heo
  Cc: James Bottomley, Mark Lord, Neil Brown, Jens Axboe, Clark,
	Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	Mark Lord, linux-fsdevel, Mizar, Sunita

In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.

My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one "appliance like" 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.

The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.

Each box has a watchdog timer that can be set to fire after at most 2 
minutes.

(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).

Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.

We still have the following challenges:

    (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.

    (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.

    (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...

    (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.

We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)

ric




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-23 14:15 ` Ric Wheeler
  (?)
@ 2007-02-24  0:03 ` H. Peter Anvin
  2007-02-24  0:37   ` Andreas Dilger
  2007-02-26  6:01   ` Douglas Gilbert
  -1 siblings, 2 replies; 44+ messages in thread
From: H. Peter Anvin @ 2007-02-24  0:03 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Linux-ide, linux-scsi, linux-raid, Tejun Heo, James Bottomley,
	Mark Lord, Neil Brown, Jens Axboe, Clark, Nathan, Singh,
	Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar,
	Sunita

Ric Wheeler wrote:
> 
> We still have the following challenges:
> 
>    (1) read-ahead often means that we will  retry every bad sector at 
> least twice from the file system level. The first time, the fs read 
> ahead request triggers a speculative read that includes the bad sector 
> (triggering the error handling mechanisms) right before the real 
> application triggers a read does the same thing.  Not sure what the 
> answer is here since read-ahead is obviously a huge win in the normal case.
> 

Probably the only sane thing to do is to remember the bad sectors and 
avoid attempting reading them; that would mean marking "automatic" 
versus "explicitly requested" requests to determine whether or not to 
filter them against a list of discovered bad blocks.

	-hpa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-24  0:03 ` H. Peter Anvin
@ 2007-02-24  0:37   ` Andreas Dilger
  2007-02-24  2:05     ` H. Peter Anvin
  2007-02-24  2:32     ` Theodore Tso
  2007-02-26  6:01   ` Douglas Gilbert
  1 sibling, 2 replies; 44+ messages in thread
From: Andreas Dilger @ 2007-02-24  0:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ric Wheeler, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	James Bottomley, Mark Lord, Neil Brown, Jens Axboe, Clark,
	Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

On Feb 23, 2007  16:03 -0800, H. Peter Anvin wrote:
> Ric Wheeler wrote:
> >   (1) read-ahead often means that we will  retry every bad sector at 
> >least twice from the file system level. The first time, the fs read 
> >ahead request triggers a speculative read that includes the bad sector 
> >(triggering the error handling mechanisms) right before the real 
> >application triggers a read does the same thing.  Not sure what the 
> >answer is here since read-ahead is obviously a huge win in the normal case.
> 
> Probably the only sane thing to do is to remember the bad sectors and 
> avoid attempting reading them; that would mean marking "automatic" 
> versus "explicitly requested" requests to determine whether or not to 
> filter them against a list of discovered bad blocks.

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.  For that matter, a huge win
would be to have the MD RAID layer rewrite only the bad sector (in hopes
of the disk relocating it) instead of failing the whiole disk.  Otherwise,
a few read errors on different disks in a RAID set can take the whole
system offline.  Apologies if this is already done in recent kernels...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-24  0:37   ` Andreas Dilger
@ 2007-02-24  2:05     ` H. Peter Anvin
  2007-02-24  2:32     ` Theodore Tso
  1 sibling, 0 replies; 44+ messages in thread
From: H. Peter Anvin @ 2007-02-24  2:05 UTC (permalink / raw)
  To: H. Peter Anvin, Ric Wheeler, Linux-ide, linux-scsi, linux-raid,
	Tejun Heo, James Bottomley, Mark Lord, Neil Brown, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

Andreas Dilger wrote:
> And clearing this list when the sector is overwritten, as it will almost
> certainly be relocated at the disk level.

Certainly if the overwrite is successful.

	-hpa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-24  0:37   ` Andreas Dilger
  2007-02-24  2:05     ` H. Peter Anvin
@ 2007-02-24  2:32     ` Theodore Tso
  2007-02-24 18:39       ` Chris Wedgwood
  2007-02-26  5:33       ` Neil Brown
  1 sibling, 2 replies; 44+ messages in thread
From: Theodore Tso @ 2007-02-24  2:32 UTC (permalink / raw)
  To: H. Peter Anvin, Ric Wheeler, Linux-ide, linux-scsi, linux-raid,
	Tejun Heo, James Bottomley, Mark Lord, Neil Brown, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote:
> > Probably the only sane thing to do is to remember the bad sectors and 
> > avoid attempting reading them; that would mean marking "automatic" 
> > versus "explicitly requested" requests to determine whether or not to 
> > filter them against a list of discovered bad blocks.
> 
> And clearing this list when the sector is overwritten, as it will almost
> certainly be relocated at the disk level.  For that matter, a huge win
> would be to have the MD RAID layer rewrite only the bad sector (in hopes
> of the disk relocating it) instead of failing the whiole disk.  Otherwise,
> a few read errors on different disks in a RAID set can take the whole
> system offline.  Apologies if this is already done in recent kernels...

And having a way of making this list available to both the filesystem
and to a userspace utility, so they can more easily deal with doing a
forced rewrite of the bad sector, after determining which file is
involved and perhaps doing something intelligent (up to and including
automatically requesting a backup system to fetch a backup version of
the file, and if it can be determined that the file shouldn't have
been changed since the last backup, automatically fixing up the
corrupted data block :-).

						- Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-24  2:32     ` Theodore Tso
@ 2007-02-24 18:39       ` Chris Wedgwood
  2007-02-26  5:33       ` Neil Brown
  1 sibling, 0 replies; 44+ messages in thread
From: Chris Wedgwood @ 2007-02-24 18:39 UTC (permalink / raw)
  To: Theodore Tso, H. Peter Anvin, Ric Wheeler, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, James Bottomley, Mark Lord, Neil Brown,
	Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet, Jochen,
	Farmer, Matt, linux-fsdevel, Mizar, Sunita

On Fri, Feb 23, 2007 at 09:32:29PM -0500, Theodore Tso wrote:

> And having a way of making this list available to both the
> filesystem and to a userspace utility, so they can more easily deal
> with doing a forced rewrite of the bad sector, after determining
> which file is involved and perhaps doing something intelligent (up
> to and including automatically requesting a backup system to fetch a
> backup version of the file, and if it can be determined that the
> file shouldn't have been changed since the last backup,
> automatically fixing up the corrupted data block :-).

i had a small c program + perl script that would take a badblocks list
and figure out which files on an xfs filesystem were trashed, though
in the case of xfs it's somewhat easier because you can dump the
extents for a file something more generic wouldn't be hard to make
work, it also wouldn't be hard to extend this to inodes in some cases
though im not sure that there is much you can do there beyond fsck


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-24  2:32     ` Theodore Tso
  2007-02-24 18:39       ` Chris Wedgwood
@ 2007-02-26  5:33       ` Neil Brown
  2007-02-26 13:25         ` Theodore Tso
  1 sibling, 1 reply; 44+ messages in thread
From: Neil Brown @ 2007-02-26  5:33 UTC (permalink / raw)
  To: Theodore Tso
  Cc: H. Peter Anvin, Ric Wheeler, Linux-ide, linux-scsi, linux-raid,
	Tejun Heo, James Bottomley, Mark Lord, Neil Brown, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

On Friday February 23, tytso@mit.edu wrote:
> On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote:
> > > Probably the only sane thing to do is to remember the bad sectors and 
> > > avoid attempting reading them; that would mean marking "automatic" 
> > > versus "explicitly requested" requests to determine whether or not to 
> > > filter them against a list of discovered bad blocks.
> > 
> > And clearing this list when the sector is overwritten, as it will almost
> > certainly be relocated at the disk level.  For that matter, a huge win
> > would be to have the MD RAID layer rewrite only the bad sector (in hopes
> > of the disk relocating it) instead of failing the whiole disk.  Otherwise,
> > a few read errors on different disks in a RAID set can take the whole
> > system offline.  Apologies if this is already done in recent kernels...

Yes, current md does this.

> 
> And having a way of making this list available to both the filesystem
> and to a userspace utility, so they can more easily deal with doing a
> forced rewrite of the bad sector, after determining which file is
> involved and perhaps doing something intelligent (up to and including
> automatically requesting a backup system to fetch a backup version of
> the file, and if it can be determined that the file shouldn't have
> been changed since the last backup, automatically fixing up the
> corrupted data block :-).
> 
> 						- Ted

So we want a clear path for media read errors from the device up to
user-space.  Stacked devices (like md) would do appropriate mappings
maybe (for raid0/linear at least.  Other levels wouldn't tolerate
errors).
There would need to be a limit on the number of 'bad blocks' that is
recorded.  Maybe a mechanism to clear old bad  blocks from the list is
needed.

Maybe if generic make request gets a request for a block which
overlaps a 'bad-block' it returns an error immediately.

Do we want a path in the other direction to handle write errors?  The
file system could say "Don't worry to much if this block cannot be
written, just return an error and I will write it somewhere else"?
This might allow md not to fail a whole drive if there is a single
write error.
Or is that completely un-necessary as all modern devices do bad-block
relocation for us?
Is there any need for a bad-block-relocating layer in md or dm?

What about corrected-error counts?  Drives provide them with SMART.
The SCSI layer could provide some as well.  Md can do a similar thing
to some extent.  Where these are actually useful predictors of pending
failure is unclear, but there could be some value.
e.g. after a certain number of recovered errors raid5 could trigger a
background consistency check, or a filesystem could trigger a
background fsck should it support that.


Lots of interesting questions... not so many answers.

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-24  0:03 ` H. Peter Anvin
  2007-02-24  0:37   ` Andreas Dilger
@ 2007-02-26  6:01   ` Douglas Gilbert
  1 sibling, 0 replies; 44+ messages in thread
From: Douglas Gilbert @ 2007-02-26  6:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ric Wheeler, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	James Bottomley, Mark Lord, Neil Brown, Jens Axboe, Clark,
	Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

H. Peter Anvin wrote:
> Ric Wheeler wrote:
>>
>> We still have the following challenges:
>>
>>    (1) read-ahead often means that we will  retry every bad sector at
>> least twice from the file system level. The first time, the fs read
>> ahead request triggers a speculative read that includes the bad sector
>> (triggering the error handling mechanisms) right before the real
>> application triggers a read does the same thing.  Not sure what the
>> answer is here since read-ahead is obviously a huge win in the normal
>> case.
>>
> 
> Probably the only sane thing to do is to remember the bad sectors and
> avoid attempting reading them; that would mean marking "automatic"
> versus "explicitly requested" requests to determine whether or not to
> filter them against a list of discovered bad blocks.

Some disks are doing their own "read-ahead" in the form
of a background media scan. Scans are done on request or
periodically (e.g. once per day or once per week) and we
have tools that can fetch the scan results from a disk
(e.g. a list of unreadable sectors). What we don't have
is any way to feed such information to a file system
that may be impacted.

Doug Gilbert



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26  5:33       ` Neil Brown
@ 2007-02-26 13:25         ` Theodore Tso
  2007-02-26 15:15           ` Alan
                             ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Theodore Tso @ 2007-02-26 13:25 UTC (permalink / raw)
  To: Neil Brown
  Cc: H. Peter Anvin, Ric Wheeler, Linux-ide, linux-scsi, linux-raid,
	Tejun Heo, James Bottomley, Mark Lord, Jens Axboe, Clark, Nathan,
	Singh, Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel,
	Mizar, Sunita

On Mon, Feb 26, 2007 at 04:33:37PM +1100, Neil Brown wrote:
> Do we want a path in the other direction to handle write errors?  The
> file system could say "Don't worry to much if this block cannot be
> written, just return an error and I will write it somewhere else"?
> This might allow md not to fail a whole drive if there is a single
> write error.

Can someone with knowledge of current disk drive behavior confirm that
for all drives that support bad block sparing, if an attempt to write
to a particular spot on disk results in an error due to bad media at
that spot, the disk drive will automatically rewrite the sector to a
sector in its spare pool, and automatically redirect that sector to
the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend.  

(Or that someone was in the middle of reconfiguring a FC network and
they're running a kernel that doesn't understand why short-duration FC
timeouts should be retried.  :-)

> Or is that completely un-necessary as all modern devices do bad-block
> relocation for us?
> Is there any need for a bad-block-relocating layer in md or dm?

That's the question.  It wouldn't be that hard for filesystems to be
able to remap a data block, but (a) it would be much more difficult
for fundamental metadata (for example, the inode table), and (b) it's
unnecessary complexity if the lower levels in the storage stack should
always be doing this for us in the case of media errors anyway.

> What about corrected-error counts?  Drives provide them with SMART.
> The SCSI layer could provide some as well.  Md can do a similar thing
> to some extent.  Where these are actually useful predictors of pending
> failure is unclear, but there could be some value.
> e.g. after a certain number of recovered errors raid5 could trigger a
> background consistency check, or a filesystem could trigger a
> background fsck should it support that.

Somewhat off-topic, but my one big regret with how the dm vs. evms
competition settled out was that evms had the ability to perform block
device snapshots using a non-LVM volume as the base --- and that EVMS
allowed a single drive to be partially managed by the LVM layer, and
partially managed by evms.  

What this allowed is the ability to do device snapshots and therefore
background fsck's without needing to convert the entire laptop disk to
using a LVM solution (since to this day I still don't trust initrd's
to always do the right thing when I am constantly replacing the kernel
for kernel development).

I know, I'm weird, distro users have initrd that seem to mostly work,
and it's only wierd developers that try to use bleeding edge kernels
with a RHEL4 userspace that suffer, but it's one of the reasons why
I've avoided initrd's like the plague --- I've wasted entire days
trying to debug problems with the userspace-provided initrd being too
old to support newer 2.6 development kernels.

In any case, the reason why I bring this up is that it would be really
nice if there was a way with a single laptop drive to be able to do
snapshots and background fsck's without having to use initrd's with
device mapper.

						- Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26 13:25         ` Theodore Tso
@ 2007-02-26 15:15           ` Alan
  2007-02-26 15:18             ` Ric Wheeler
  2007-02-26 15:17           ` James Bottomley
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Alan @ 2007-02-26 15:15 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Neil Brown, H. Peter Anvin, Ric Wheeler, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, James Bottomley, Mark Lord, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

> the new location.  I believe this should be always true, so presumably
> with all modern disk drives a write error should mean something very
> serious has happend. 

Not quite that simple.

If you write a block aligned size the same size as the physical media
block size maybe this is true. If you write a sector on a device with
physical sector size larger than logical block size (as allowed by say
ATA7) then it's less clear what happens. I don't know if the drive
firmware implements multiple "tails" in this case.

On a read error it is worth trying the other parts of the I/O.

Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26 13:25         ` Theodore Tso
  2007-02-26 15:15           ` Alan
@ 2007-02-26 15:17           ` James Bottomley
  2007-02-26 18:59           ` H. Peter Anvin
  2007-02-26 22:46           ` Jeff Garzik
  3 siblings, 0 replies; 44+ messages in thread
From: James Bottomley @ 2007-02-26 15:17 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Neil Brown, H. Peter Anvin, Ric Wheeler, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, Mark Lord, Jens Axboe, Clark, Nathan,
	Singh, Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel,
	Mizar, Sunita

On Mon, 2007-02-26 at 08:25 -0500, Theodore Tso wrote:
> Somewhat off-topic, but my one big regret with how the dm vs. evms
> competition settled out was that evms had the ability to perform block
> device snapshots using a non-LVM volume as the base --- and that EVMS
> allowed a single drive to be partially managed by the LVM layer, and
> partially managed by evms.  

If all you want is a snapshot, md can do this today ... you just create
a RAID-1 resync it and then break it ... of course, you have to have the
filesystem mounted above an md device initially ...

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26 15:15           ` Alan
@ 2007-02-26 15:18             ` Ric Wheeler
  2007-02-26 17:01               ` Alan
  0 siblings, 1 reply; 44+ messages in thread
From: Ric Wheeler @ 2007-02-26 15:18 UTC (permalink / raw)
  To: Alan
  Cc: Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, James Bottomley, Mark Lord, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita



Alan wrote:
>> the new location.  I believe this should be always true, so presumably
>> with all modern disk drives a write error should mean something very
>> serious has happend. 
> 
> Not quite that simple.

I think that write errors are normally quite serious, but there are exceptions 
which might be able to be worked around with retries.  To Ted's point, in 
general, a write to a bad spot on the media will cause a remapping which should 
be transparent (if a bit slow) to us.

> 
> If you write a block aligned size the same size as the physical media
> block size maybe this is true. If you write a sector on a device with
> physical sector size larger than logical block size (as allowed by say
> ATA7) then it's less clear what happens. I don't know if the drive
> firmware implements multiple "tails" in this case.
> 
> On a read error it is worth trying the other parts of the I/O.
> 

I think that this is mostly true, but we also need to balance this against the 
need for higher levels to get a timely response.  In a really large IO, a naive 
retry of a very large write could lead to a non-responsive system for a very 
large time...

ric




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26 17:01               ` Alan
@ 2007-02-26 16:42                 ` Ric Wheeler
  0 siblings, 0 replies; 44+ messages in thread
From: Ric Wheeler @ 2007-02-26 16:42 UTC (permalink / raw)
  To: Alan
  Cc: Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, James Bottomley, Mark Lord, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita


Alan wrote:
>> I think that this is mostly true, but we also need to balance this against the 
>> need for higher levels to get a timely response.  In a really large IO, a naive 
>> retry of a very large write could lead to a non-responsive system for a very 
>> large time...
> 
> And losing the I/O could result in a system that is non responsive until
> the tape restore completes two days later....

Which brings us back to a recent discussion at the file system workshop on being 
more repair oriented in file system design so we can survive situations like 
this a bit more reliably ;-)

ric

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26 15:18             ` Ric Wheeler
@ 2007-02-26 17:01               ` Alan
  2007-02-26 16:42                 ` Ric Wheeler
  0 siblings, 1 reply; 44+ messages in thread
From: Alan @ 2007-02-26 17:01 UTC (permalink / raw)
  To: ric
  Cc: Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, James Bottomley, Mark Lord, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

> I think that this is mostly true, but we also need to balance this against the 
> need for higher levels to get a timely response.  In a really large IO, a naive 
> retry of a very large write could lead to a non-responsive system for a very 
> large time...

And losing the I/O could result in a system that is non responsive until
the tape restore completes two days later....


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26 13:25         ` Theodore Tso
  2007-02-26 15:15           ` Alan
  2007-02-26 15:17           ` James Bottomley
@ 2007-02-26 18:59           ` H. Peter Anvin
  2007-02-26 22:46           ` Jeff Garzik
  3 siblings, 0 replies; 44+ messages in thread
From: H. Peter Anvin @ 2007-02-26 18:59 UTC (permalink / raw)
  To: Theodore Tso, Neil Brown, H. Peter Anvin, Ric Wheeler, Linux-ide,
	linux-scsi, linux-raid, Tejun Heo, James Bottomley, Mark Lord,
	Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet, Jochen,
	Farmer, Matt, linux-fsdevel, Mizar, Sunita

Theodore Tso wrote:
> 
> In any case, the reason why I bring this up is that it would be really
> nice if there was a way with a single laptop drive to be able to do
> snapshots and background fsck's without having to use initrd's with
> device mapper.
> 

This is a major part of why I've been trying to push integrated klibc to 
have all that stuff as a unified "kernel" deliverable.  Unfortunately, 
as you know, Linus apparently rejected the concept "at least for now" at 
LKS last year.

With klibc this stuff could still be in one single wrapper without funny 
dependencies, but wouldn't have to be ported to kernel space.

	-hpa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26 13:25         ` Theodore Tso
                             ` (2 preceding siblings ...)
  2007-02-26 18:59           ` H. Peter Anvin
@ 2007-02-26 22:46           ` Jeff Garzik
  2007-02-26 22:53             ` Ric Wheeler
  3 siblings, 1 reply; 44+ messages in thread
From: Jeff Garzik @ 2007-02-26 22:46 UTC (permalink / raw)
  To: Theodore Tso, Neil Brown, H. Peter Anvin, Ric Wheeler, Linux-ide,
	linux-scsi, linux-raid, Tejun Heo, James Bottomley, Mark Lord,
	Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet, Jochen,
	Farmer, Matt, linux-fsdevel, Mizar, Sunita

Theodore Tso wrote:
> Can someone with knowledge of current disk drive behavior confirm that
> for all drives that support bad block sparing, if an attempt to write
> to a particular spot on disk results in an error due to bad media at
> that spot, the disk drive will automatically rewrite the sector to a
> sector in its spare pool, and automatically redirect that sector to
> the new location.  I believe this should be always true, so presumably
> with all modern disk drives a write error should mean something very
> serious has happend.  


This is what will /probably/ happen.  The drive should indeed find a 
spare sector and remap it, if the write attempt encounters a bad spot on 
the media.

However, with a large enough write, large enough bad-spot-on-media, and 
a firmware programmed to never take more than X seconds to complete 
their enterprise customers' I/O, it might just fail.


IMO, somewhere in the kernel, when we receive a read-op or write-op 
media error, we should immediately try to plaster that area with small 
writes.  Sure, if it's a read-op you lost data, but this method will 
maximize the chance that you can refresh/reuse the logical sectors in 
question.

	Jeff



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26 22:46           ` Jeff Garzik
@ 2007-02-26 22:53             ` Ric Wheeler
  2007-02-27  1:19               ` Alan
  0 siblings, 1 reply; 44+ messages in thread
From: Ric Wheeler @ 2007-02-26 22:53 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, James Bottomley, Mark Lord, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita



Jeff Garzik wrote:
> Theodore Tso wrote:
>> Can someone with knowledge of current disk drive behavior confirm that
>> for all drives that support bad block sparing, if an attempt to write
>> to a particular spot on disk results in an error due to bad media at
>> that spot, the disk drive will automatically rewrite the sector to a
>> sector in its spare pool, and automatically redirect that sector to
>> the new location.  I believe this should be always true, so presumably
>> with all modern disk drives a write error should mean something very
>> serious has happend.  
> 
> 
> This is what will /probably/ happen.  The drive should indeed find a 
> spare sector and remap it, if the write attempt encounters a bad spot on 
> the media.
> 
> However, with a large enough write, large enough bad-spot-on-media, and 
> a firmware programmed to never take more than X seconds to complete 
> their enterprise customers' I/O, it might just fail.
> 
> 
> IMO, somewhere in the kernel, when we receive a read-op or write-op 
> media error, we should immediately try to plaster that area with small 
> writes.  Sure, if it's a read-op you lost data, but this method will 
> maximize the chance that you can refresh/reuse the logical sectors in 
> question.
> 
>     Jeff

One interesting counter example is a smaller write than a full page - say 512 
bytes out of 4k.

If we need to do a read-modify-write and it just so happens that 1 of the 7 
sectors we need to read is flaky, will this "look" like a write failure?

ric

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-26 22:53             ` Ric Wheeler
@ 2007-02-27  1:19               ` Alan
  0 siblings, 0 replies; 44+ messages in thread
From: Alan @ 2007-02-27  1:19 UTC (permalink / raw)
  To: ric
  Cc: Jeff Garzik, Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide,
	linux-scsi, linux-raid, Tejun Heo, James Bottomley, Mark Lord,
	Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet, Jochen,
	Farmer, Matt, linux-fsdevel, Mizar, Sunita

> One interesting counter example is a smaller write than a full page - say 512 
> bytes out of 4k.
> 
> If we need to do a read-modify-write and it just so happens that 1 of the 7 
> sectors we need to read is flaky, will this "look" like a write failure?

The current core kernel code can't handle propogating sub-page sized
errors up to the file system layers (there is nowhere in the page cache
to store 'part of this page is missing'). This is a long standing (four
year plus) problem with CD-RW support as well.

For ATA we can at least retrieve the true media sector size now, which
may be helpful at the physical layer but the page cache would need to
grow some brains to do anything with it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-03-01 14:25                       ` James Bottomley
@ 2007-03-01 17:19                         ` H. Peter Anvin
  0 siblings, 0 replies; 44+ messages in thread
From: H. Peter Anvin @ 2007-03-01 17:19 UTC (permalink / raw)
  To: James Bottomley
  Cc: Martin K. Petersen, dougg, Alan, Moore, Eric, ric, Theodore Tso,
	Neil Brown, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet,
	Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

James Bottomley wrote:
> On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote:
>> James Bottomley wrote:
>>> On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:
>>>> 4104.  It's 8 bytes per hardware sector.  At least for T10...
>>> Er ... that won't look good to the 512 ATA compatibility remapping ...
>>>
>> Well, in that case you'd only see 8x512 data bytes, no metadata...
> 
> i.e. no support for block guard in the 512 byte sector emulation
> mode ...
> 

That makes sense, though... if the raw sector size is 4096 bytes, that 
metadata would presumably not exist on a per-sector basis.

	-hpa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-03-01  1:28                     ` H. Peter Anvin
@ 2007-03-01 14:25                       ` James Bottomley
  2007-03-01 17:19                         ` H. Peter Anvin
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2007-03-01 14:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, dougg, Alan, Moore, Eric, ric, Theodore Tso,
	Neil Brown, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet,
	Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote:
> James Bottomley wrote:
> > On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:
> >> 4104.  It's 8 bytes per hardware sector.  At least for T10...
> > 
> > Er ... that won't look good to the 512 ATA compatibility remapping ...
> > 
> 
> Well, in that case you'd only see 8x512 data bytes, no metadata...

i.e. no support for block guard in the 512 byte sector emulation
mode ...

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-28 17:52                   ` James Bottomley
@ 2007-03-01  1:28                     ` H. Peter Anvin
  2007-03-01 14:25                       ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: H. Peter Anvin @ 2007-03-01  1:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Martin K. Petersen, dougg, Alan, Moore, Eric, ric, Theodore Tso,
	Neil Brown, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet,
	Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

James Bottomley wrote:
> On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:
>> 4104.  It's 8 bytes per hardware sector.  At least for T10...
> 
> Er ... that won't look good to the 512 ATA compatibility remapping ...
> 

Well, in that case you'd only see 8x512 data bytes, no metadata...

	-hpa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-28 17:42                 ` Martin K. Petersen
@ 2007-02-28 17:52                   ` James Bottomley
  2007-03-01  1:28                     ` H. Peter Anvin
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2007-02-28 17:52 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: dougg, Alan, Moore, Eric, ric, Theodore Tso, Neil Brown,
	H. Peter Anvin, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet,
	Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:
> 4104.  It's 8 bytes per hardware sector.  At least for T10...

Er ... that won't look good to the 512 ATA compatibility remapping ...

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-28 17:30               ` James Bottomley
@ 2007-02-28 17:42                 ` Martin K. Petersen
  2007-02-28 17:52                   ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: Martin K. Petersen @ 2007-02-28 17:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: dougg, Martin K. Petersen, Alan, Moore, Eric, ric, Theodore Tso,
	Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi, linux-raid,
	Tejun Heo, Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder,
	De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

>>>>> "James" == James Bottomley <James.Bottomley@SteelEye.com> writes:

James> However, I could see the SATA manufacturers selling capacity at
James> 512 (or the new 4096) sectors but allowing their OEMs to
James> reformat them 520 (or 4160)

4104.  It's 8 bytes per hardware sector.  At least for T10...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-28 17:16             ` Martin K. Petersen
@ 2007-02-28 17:30               ` James Bottomley
  2007-02-28 17:42                 ` Martin K. Petersen
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2007-02-28 17:30 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: dougg, Martin K. Petersen, Alan, Moore, Eric, ric, Theodore Tso,
	Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi, linux-raid,
	Tejun Heo, Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder,
	De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

On Wed, 2007-02-28 at 12:16 -0500, Martin K. Petersen wrote:
> It's cool that it's on the radar in terms of the protocol.  
> 
> That doesn't mean that drive manufacturers are going to implement it,
> though.  The ones I've talked to were unwilling to sacrifice capacity
> because that's the main competitive factor in the SATA/consumer space.
> 
> Maybe we'll see it in the nearline product ranges?  That would be a
> good start...

They wouldn't necessarily have to sacrifice capacity per-se.  The
current problem is that unlike SCSI disks, you can't seem to reformat
SATA ones to arbitrary sector sizes.  However, I could see the SATA
manufacturers selling capacity at 512 (or the new 4096) sectors but
allowing their OEMs to reformat them 520 (or 4160) and then implementing
block guard on top of this.  The OEMs who did this would obviously lose
1.6% of the capacity, but that would be their choice ...

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: end to end error recovery musings
  2007-02-28 15:19         ` Moore, Eric
  (?)
@ 2007-02-28 17:27         ` Martin K. Petersen
  -1 siblings, 0 replies; 44+ messages in thread
From: Martin K. Petersen @ 2007-02-28 17:27 UTC (permalink / raw)
  To: Moore, Eric; +Cc: linux-scsi

>>>>> "Eric" == Moore, Eric <Eric.Moore@lsi.com> writes:

[Trimmed the worldwide broadcast CC: list down to linux-scsi]

Eric> I from the scsi lld perspective, all we need 32 byte cdbs, and a
Eric> mechinism to pass the tags down from above.  

Ok, so your board only supports Type 2 protection?


Eric> It appears our driver to firmware insterface is only providing
Eric> the reference and application tags. 

My current code allows the submitter to specify which tags are valid
between the OS and the HBA.  Your inbound scsi_cmnd will have a 
protection_tag_mask which tells you which fields are provided.

Similarly, there's a mask in scsi_host which allows the HBA to
identify which protection types it supports.  I hadn't envisioned that
an HBA might only provide a subset.  I'll ponder a bit.


Eric> I assume that for transfers greater than a sector, that the
Eric> controller firmware updates the tags for all the other sectors
Eric> within the boundary.  

In other words you only support one app tag per request and not per
sector?

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-28 13:46           ` Douglas Gilbert
@ 2007-02-28 17:16             ` Martin K. Petersen
  2007-02-28 17:30               ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: Martin K. Petersen @ 2007-02-28 17:16 UTC (permalink / raw)
  To: dougg
  Cc: Martin K. Petersen, Alan, Moore, Eric, ric, Theodore Tso,
	Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi, linux-raid,
	Tejun Heo, James Bottomley, Mark Lord, Jens Axboe, Clark, Nathan,
	Singh, Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel,
	Mizar, Sunita

>>>>> "Doug" == Douglas Gilbert <dougg@torque.net> writes:

Doug> Work on SAT-2 is now underway and one of the agenda items is
Doug> "end to end data protection" and is in the hands of the t13
Doug> ATA8-ACS technical editor. So it looks like data integrity is on
Doug> the radar in the SATA world.

It's cool that it's on the radar in terms of the protocol.  

That doesn't mean that drive manufacturers are going to implement it,
though.  The ones I've talked to were unwilling to sacrifice capacity
because that's the main competitive factor in the SATA/consumer space.

Maybe we'll see it in the nearline product ranges?  That would be a
good start...

-- 
Martin K. Petersen      http://mkp.net/


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: end to end error recovery musings
  2007-02-27 19:07       ` Martin K. Petersen
@ 2007-02-28 15:19         ` Moore, Eric
  -1 siblings, 0 replies; 44+ messages in thread
From: Moore, Eric @ 2007-02-28 15:19 UTC (permalink / raw)
  To: Martin K. Petersen, Alan
  Cc: ric, Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide,
	linux-scsi, linux-raid, Tejun Heo, James Bottomley, Mark Lord,
	Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet, Jochen,
	Farmer, Matt, linux-fsdevel, Mizar, Sunita

On Tuesday, February 27, 2007 12:07 PM, Martin K. Petersen wrote: 
> 
> Not sure you're up-to-date on the T10 data integrity feature.
> Essentially it's an extension of the 520 byte sectors common in disk
> arrays.  For each 512 byte sector (or 4KB ditto) you get 8 bytes of
> protection data.  There's a 2 byte CRC (GUARD tag), a 2 byte
> user-defined tag (APP) and a 4-byte reference tag (REF).  Depending on
> how the drive is formatted, the REF tag usually needs to match the
> lower 32-bits of the target sector #.
> 

I from the scsi lld perspective, all we need 32 byte cdbs, and a
mechinism to pass the tags down from above.  It appears our driver to
firmware insterface is only providing the reference and application
tags. It seems the guard tag is not present, so I guess mpt fusion
controller firmware is setting it(I will have to check with others).   I
assume that for transfers greater than a sector, that the controller
firmware updates the tags for all the other sectors within the boundary.
I'm sure the flags probably tell whether EEDP is enabled or not.   I
will have to check if there are some manufacturing pages that say
whether the controller is capable of EEDP(as not all our controllers
support it).  


Here are the EEDP associated fields we provide in our scsi passthru, as
well as target assist.


u32 SecondaryReferenceTag
u16 SecondaryApplicationTag
u16 EEDPFlags
u16 ApplicationTagTranslationMask
u32 EEDPBlockSize

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: end to end error recovery musings
@ 2007-02-28 15:19         ` Moore, Eric
  0 siblings, 0 replies; 44+ messages in thread
From: Moore, Eric @ 2007-02-28 15:19 UTC (permalink / raw)
  To: Martin K. Petersen, Alan
  Cc: ric, Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide,
	linux-scsi, linux-raid, Tejun Heo, James Bottomley, Mark Lord,
	Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet, Jochen,
	Farmer, Matt, linux-fsdevel, Mizar, Sunita

On Tuesday, February 27, 2007 12:07 PM, Martin K. Petersen wrote: 
> 
> Not sure you're up-to-date on the T10 data integrity feature.
> Essentially it's an extension of the 520 byte sectors common in disk
> arrays.  For each 512 byte sector (or 4KB ditto) you get 8 bytes of
> protection data.  There's a 2 byte CRC (GUARD tag), a 2 byte
> user-defined tag (APP) and a 4-byte reference tag (REF).  Depending on
> how the drive is formatted, the REF tag usually needs to match the
> lower 32-bits of the target sector #.
> 

I from the scsi lld perspective, all we need 32 byte cdbs, and a
mechinism to pass the tags down from above.  It appears our driver to
firmware insterface is only providing the reference and application
tags. It seems the guard tag is not present, so I guess mpt fusion
controller firmware is setting it(I will have to check with others).   I
assume that for transfers greater than a sector, that the controller
firmware updates the tags for all the other sectors within the boundary.
I'm sure the flags probably tell whether EEDP is enabled or not.   I
will have to check if there are some manufacturing pages that say
whether the controller is capable of EEDP(as not all our controllers
support it).  


Here are the EEDP associated fields we provide in our scsi passthru, as
well as target assist.


u32 SecondaryReferenceTag
u16 SecondaryApplicationTag
u16 EEDPFlags
u16 ApplicationTagTranslationMask
u32 EEDPBlockSize

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-27 22:51           ` Martin K. Petersen
  (?)
@ 2007-02-28 13:46           ` Douglas Gilbert
  2007-02-28 17:16             ` Martin K. Petersen
  -1 siblings, 1 reply; 44+ messages in thread
From: Douglas Gilbert @ 2007-02-28 13:46 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Alan, Moore, Eric, ric, Theodore Tso, Neil Brown, H. Peter Anvin,
	Linux-ide, linux-scsi, linux-raid, Tejun Heo, James Bottomley,
	Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet,
	Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

Martin K. Petersen wrote:
>>>>>> "Alan" == Alan  <alan@lxorguk.ukuu.org.uk> writes:
> 
>>> Not sure you're up-to-date on the T10 data integrity feature.
>>> Essentially it's an extension of the 520 byte sectors common in
>>> disk
> 
> [...]
> 
> Alan> but here's a minor bit of passing bad news - quite a few older
> Alan> ATA controllers can't issue DMA transfers that are not a
> Alan> multiple of 512 bytes without crapping themselves (eg
> Alan> READ_LONG). Guess we may need to add
> Alan> ap-> i_do_not_suck or similar 8)
> 
> I'm afraid it stops even before you get that far.  There doesn't seem
> to be any interest in adopting the Data Integrity Feature (or anything
> similar) in the ATA camp.  So for now it's a SCSI-only thing.
> 
> I encourage people to lean on their favorite disk manufacturer.  This
> would be a great feature to have on SATA too...

Martin,
SCSI to ATA Translation (SAT) is now a standard
(ANSI INCITS 431-2007) [and libata is somewhat
short of compliance].

Work on SAT-2 is now underway and one of the agenda
items is "end to end data protection" and is in the
hands of the t13 ATA8-ACS technical editor. So it
looks like data integrity is on the radar in the SATA
world.

See http://www.t10.org/ftp/t10/document.06/06-497r4.pdf
for more evidence of how SAS and SATA are converging
at the command and feature set level.

Doug Gilbert



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-27 19:07       ` Martin K. Petersen
@ 2007-02-27 23:39         ` Alan
  -1 siblings, 0 replies; 44+ messages in thread
From: Alan @ 2007-02-27 23:39 UTC (permalink / raw)
  Cc: Martin K. Petersen, Moore, Eric, ric, Theodore Tso, Neil Brown,
	H. Peter Anvin, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	James Bottomley, Mark Lord, Jens Axboe, Clark, Nathan, Singh,
	Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar,
	Sunita

> Not sure you're up-to-date on the T10 data integrity feature.
> Essentially it's an extension of the 520 byte sectors common in disk

I saw the basics but not the detail. Thanks for the explanation it was
most helpful and promises to fix a few things for some controllers.. but
here's a minor bit of passing bad news - quite a few older ATA controllers
can't issue DMA transfers that are not a multiple of 512 bytes without
crapping themselves (eg READ_LONG). Guess we may need to add
ap->i_do_not_suck or similar 8)

On the bright side I believe the Intel ICH is the only one with this
problem (and a workaround) which is SATA capable 8)

Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
@ 2007-02-27 23:39         ` Alan
  0 siblings, 0 replies; 44+ messages in thread
From: Alan @ 2007-02-27 23:39 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Martin K. Petersen, Moore, Eric, ric, Theodore Tso, Neil Brown,
	H. Peter Anvin, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	James Bottomley, Mark Lord, Jens Axboe, Clark, Nathan, Singh,
	Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar,
	Sunita

> Not sure you're up-to-date on the T10 data integrity feature.
> Essentially it's an extension of the 520 byte sectors common in disk

I saw the basics but not the detail. Thanks for the explanation it was
most helpful and promises to fix a few things for some controllers.. but
here's a minor bit of passing bad news - quite a few older ATA controllers
can't issue DMA transfers that are not a multiple of 512 bytes without
crapping themselves (eg READ_LONG). Guess we may need to add
ap->i_do_not_suck or similar 8)

On the bright side I believe the Intel ICH is the only one with this
problem (and a workaround) which is SATA capable 8)

Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-27 23:39         ` Alan
@ 2007-02-27 22:51           ` Martin K. Petersen
  -1 siblings, 0 replies; 44+ messages in thread
From: Martin K. Petersen @ 2007-02-27 22:51 UTC (permalink / raw)
  To: Alan
  Cc: Martin K. Petersen, Moore, Eric, ric, Theodore Tso, Neil Brown,
	H. Peter Anvin, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	James Bottomley, Mark Lord, Jens Axboe, Clark, Nathan, Singh,
	Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar,
	Sunita

>>>>> "Alan" == Alan  <alan@lxorguk.ukuu.org.uk> writes:

>> Not sure you're up-to-date on the T10 data integrity feature.
>> Essentially it's an extension of the 520 byte sectors common in
>> disk

[...]

Alan> but here's a minor bit of passing bad news - quite a few older
Alan> ATA controllers can't issue DMA transfers that are not a
Alan> multiple of 512 bytes without crapping themselves (eg
Alan> READ_LONG). Guess we may need to add
Alan> ap-> i_do_not_suck or similar 8)

I'm afraid it stops even before you get that far.  There doesn't seem
to be any interest in adopting the Data Integrity Feature (or anything
similar) in the ATA camp.  So for now it's a SCSI-only thing.

I encourage people to lean on their favorite disk manufacturer.  This
would be a great feature to have on SATA too...

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
@ 2007-02-27 22:51           ` Martin K. Petersen
  0 siblings, 0 replies; 44+ messages in thread
From: Martin K. Petersen @ 2007-02-27 22:51 UTC (permalink / raw)
  To: Alan
  Cc: Martin K. Petersen, Moore, Eric, ric, Theodore Tso, Neil Brown,
	H. Peter Anvin, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	James Bottomley, Mark Lord, Jens Axboe, Clark, Nathan, Singh,
	Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar,
	Sunita

>>>>> "Alan" == Alan  <alan@lxorguk.ukuu.org.uk> writes:

>> Not sure you're up-to-date on the T10 data integrity feature.
>> Essentially it's an extension of the 520 byte sectors common in
>> disk

[...]

Alan> but here's a minor bit of passing bad news - quite a few older
Alan> ATA controllers can't issue DMA transfers that are not a
Alan> multiple of 512 bytes without crapping themselves (eg
Alan> READ_LONG). Guess we may need to add
Alan> ap-> i_do_not_suck or similar 8)

I'm afraid it stops even before you get that far.  There doesn't seem
to be any interest in adopting the Data Integrity Feature (or anything
similar) in the ATA camp.  So for now it's a SCSI-only thing.

I encourage people to lean on their favorite disk manufacturer.  This
would be a great feature to have on SATA too...

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-27 19:02     ` Alan
@ 2007-02-27 19:07       ` Martin K. Petersen
  -1 siblings, 0 replies; 44+ messages in thread
From: Martin K. Petersen @ 2007-02-27 19:07 UTC (permalink / raw)
  To: Alan
  Cc: Martin K. Petersen, Moore, Eric, ric, Theodore Tso, Neil Brown,
	H. Peter Anvin, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	James Bottomley, Mark Lord, Jens Axboe, Clark, Nathan, Singh,
	Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar,
	Sunita

>>>>> "Alan" == Alan  <alan@lxorguk.ukuu.org.uk> writes:

>> These features make the most sense in terms of WRITE.  Disks
>> already have plenty of CRC on the data so if a READ fails on a
>> regular drive we already know about it.

Alan> Don't bet on it. 

This is why I mentioned that I want to expose the protection data to
the host.  As written, DIF only protects the path between initiator
and target.

See below...


Alan> If you want to do this seriously you need an end to end (media
Alan> to host ram) checksum. We do see bizarre and quite evil things
Alan> happen to people occasionally because they rely on bus level
Alan> protection - both faulty network cards and faulty disk or
Alan> controller RAM can cause very bad things to happen in a critical
Alan> environment and are very very hard to detect and test for.

Not sure you're up-to-date on the T10 data integrity feature.
Essentially it's an extension of the 520 byte sectors common in disk
arrays.  For each 512 byte sector (or 4KB ditto) you get 8 bytes of
protection data.  There's a 2 byte CRC (GUARD tag), a 2 byte
user-defined tag (APP) and a 4-byte reference tag (REF).  Depending on
how the drive is formatted, the REF tag usually needs to match the
lower 32-bits of the target sector #.

For each sector coming in the disk firmware verifies that the CRC and
the reference tags are in accordance with the contents of the sector
and the CDB start sector + offset.  If they don't match the drive will
reject the request.

If an HBA is capable of exposing the protection tuples to the host we
can precalculate the checksum and the LBA when submitting a WRITE.  My
current proposal involves passing them down in two separate buffers to
minimize the risk of in-memory corruption (Besides, it would suck if
you had to interleave data and protection data.  The scatterlists
would become long and twisted).

And that's when the READ case becomes interesting.  Because then the
fs can verify that the checksum of the in-buffer matches of the GUARD
tag.  In that case we'll know there's been no corruption in the
middle.

And of course this also opens up using the APP field to tag sector
contents.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
@ 2007-02-27 19:07       ` Martin K. Petersen
  0 siblings, 0 replies; 44+ messages in thread
From: Martin K. Petersen @ 2007-02-27 19:07 UTC (permalink / raw)
  To: Alan
  Cc: Martin K. Petersen, Moore, Eric, ric, Theodore Tso, Neil Brown,
	H. Peter Anvin, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	James Bottomley, Mark Lord, Jens Axboe, Clark, Nathan, Singh,
	Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar,
	Sunita

>>>>> "Alan" == Alan  <alan@lxorguk.ukuu.org.uk> writes:

>> These features make the most sense in terms of WRITE.  Disks
>> already have plenty of CRC on the data so if a READ fails on a
>> regular drive we already know about it.

Alan> Don't bet on it. 

This is why I mentioned that I want to expose the protection data to
the host.  As written, DIF only protects the path between initiator
and target.

See below...


Alan> If you want to do this seriously you need an end to end (media
Alan> to host ram) checksum. We do see bizarre and quite evil things
Alan> happen to people occasionally because they rely on bus level
Alan> protection - both faulty network cards and faulty disk or
Alan> controller RAM can cause very bad things to happen in a critical
Alan> environment and are very very hard to detect and test for.

Not sure you're up-to-date on the T10 data integrity feature.
Essentially it's an extension of the 520 byte sectors common in disk
arrays.  For each 512 byte sector (or 4KB ditto) you get 8 bytes of
protection data.  There's a 2 byte CRC (GUARD tag), a 2 byte
user-defined tag (APP) and a 4-byte reference tag (REF).  Depending on
how the drive is formatted, the REF tag usually needs to match the
lower 32-bits of the target sector #.

For each sector coming in the disk firmware verifies that the CRC and
the reference tags are in accordance with the contents of the sector
and the CDB start sector + offset.  If they don't match the drive will
reject the request.

If an HBA is capable of exposing the protection tuples to the host we
can precalculate the checksum and the LBA when submitting a WRITE.  My
current proposal involves passing them down in two separate buffers to
minimize the risk of in-memory corruption (Besides, it would suck if
you had to interleave data and protection data.  The scatterlists
would become long and twisted).

And that's when the READ case becomes interesting.  Because then the
fs can verify that the checksum of the in-buffer matches of the GUARD
tag.  In that case we'll know there's been no corruption in the
middle.

And of course this also opens up using the APP field to tag sector
contents.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-27 16:50   ` Martin K. Petersen
@ 2007-02-27 19:02     ` Alan
  -1 siblings, 0 replies; 44+ messages in thread
From: Alan @ 2007-02-27 19:02 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Moore, Eric, ric, Theodore Tso, Neil Brown, H. Peter Anvin,
	Linux-ide, linux-scsi, linux-raid, Tejun Heo, James Bottomley,
	Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet,
	Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

> These features make the most sense in terms of WRITE.  Disks already
> have plenty of CRC on the data so if a READ fails on a regular drive
> we already know about it.

Don't bet on it. If you want to do this seriously you need an end to end
(media to host ram) checksum. We do see bizarre and quite evil things
happen to people occasionally because they rely on bus level protection -
both faulty network cards and faulty disk or controller RAM can cause very
bad things to happen in a critical environment and are very very hard to
detect and test for.

IDE has another hideously evil feature in this area. Command blocks are
sent by PIO cycles, and are therefore unprotected from corruption. So
while a data burst with corruption will error and retry and command which
corrupts the block number although very very much less likely (less bits
and much lower speed) will not be caught on a PATA system for read or for
write and will hit the wrong block.

With networking you can turn off hardware IP checksumming (and many
cluster people do) with disks we don't yet have a proper end to end
checksum to media system in the fs or block layers.

> It would be great if the app tag was more than 16 bits.  Ted mentioned
> that ideally he'd like to store the inode number in the app tag.  But
> as it stands there isn't room.

The lowest few bits are the most important with ext2/ext3 because you
normally lose a sector of inodes which means you've got dangly bits
associated with a sequence of inodes with the same upper bits. More
problematic is losing indirect blocks, and being able to keep some kind
of [inode low bits/block index] would help put stuff back together.

Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
@ 2007-02-27 19:02     ` Alan
  0 siblings, 0 replies; 44+ messages in thread
From: Alan @ 2007-02-27 19:02 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Moore, Eric, ric, Theodore Tso, Neil Brown, H. Peter Anvin,
	Linux-ide, linux-scsi, linux-raid, Tejun Heo, James Bottomley,
	Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet,
	Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

> These features make the most sense in terms of WRITE.  Disks already
> have plenty of CRC on the data so if a READ fails on a regular drive
> we already know about it.

Don't bet on it. If you want to do this seriously you need an end to end
(media to host ram) checksum. We do see bizarre and quite evil things
happen to people occasionally because they rely on bus level protection -
both faulty network cards and faulty disk or controller RAM can cause very
bad things to happen in a critical environment and are very very hard to
detect and test for.

IDE has another hideously evil feature in this area. Command blocks are
sent by PIO cycles, and are therefore unprotected from corruption. So
while a data burst with corruption will error and retry and command which
corrupts the block number although very very much less likely (less bits
and much lower speed) will not be caught on a PATA system for read or for
write and will hit the wrong block.

With networking you can turn off hardware IP checksumming (and many
cluster people do) with disks we don't yet have a proper end to end
checksum to media system in the fs or block layers.

> It would be great if the app tag was more than 16 bits.  Ted mentioned
> that ideally he'd like to store the inode number in the app tag.  But
> as it stands there isn't room.

The lowest few bits are the most important with ext2/ext3 because you
normally lose a sector of inodes which means you've got dangly bits
associated with a sequence of inodes with the same upper bits. More
problematic is losing indirect blocks, and being able to keep some kind
of [inode low bits/block index] would help put stuff back together.

Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-27 16:50   ` Martin K. Petersen
  (?)
@ 2007-02-27 18:51   ` Ric Wheeler
  -1 siblings, 0 replies; 44+ messages in thread
From: Ric Wheeler @ 2007-02-27 18:51 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Moore, Eric, Alan, Theodore Tso, Neil Brown, H. Peter Anvin,
	Linux-ide, linux-scsi, linux-raid, Tejun Heo, James Bottomley,
	Mark Lord, Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet,
	Jochen, Farmer, Matt, linux-fsdevel, Mizar, Sunita

Martin K. Petersen wrote:
>>>>>> "Eric" == Moore, Eric <Eric.Moore@lsi.com> writes:
> 
> Eric> Martin K. Petersen on Data Intergrity Feature, which is also
> Eric> called EEDP(End to End Data Protection), which he presented some
> Eric> ideas/suggestions of adding an API in linux for this.  
> 
> T10 DIF is interesting for a few things: 
> 
>  - Ensuring that the data integrity is preserved when writing a buffer
>    to disk
> 
>  - Ensuring that the write ends up on the right hardware sector
> 
> These features make the most sense in terms of WRITE.  Disks already
> have plenty of CRC on the data so if a READ fails on a regular drive
> we already know about it.

There are paths through a read that could still benefit from the extra 
data integrity.  The CRC gets validated on the physical sector, but we 
don't have the same level of strict data checking once it is read into 
the disk's write cache or being transferred out of cache on the way to 
the transport...

> 
> We can, however, leverage DIF with my proposal to expose the
> protection data to host memory.  This will allow us to verify the data
> integrity information before passing it to the filesystem or
> application.  We can say "this is really the information the disk
> sent. It hasn't been mangled along the way".
> 
> And by using the APP tag we can mark a sector as - say - metadata or
> data to ease putting the recovery puzzle back together.
> 
> It would be great if the app tag was more than 16 bits.  Ted mentioned
> that ideally he'd like to store the inode number in the app tag.  But
> as it stands there isn't room.
> 
> In any case this is all slightly orthogonal to Ric's original post
> about finding the right persistence heuristics in the error handling
> path...
> 

Still all a very relevant discussion - I agree that we could really use 
more than just 16 bits...

ric


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-27 19:02     ` Alan
  (?)
@ 2007-02-27 18:39     ` Andreas Dilger
  -1 siblings, 0 replies; 44+ messages in thread
From: Andreas Dilger @ 2007-02-27 18:39 UTC (permalink / raw)
  To: Alan
  Cc: Martin K. Petersen, Moore, Eric, ric, Theodore Tso, Neil Brown,
	H. Peter Anvin, Linux-ide, linux-scsi, linux-raid, Tejun Heo,
	James Bottomley, Mark Lord, Jens Axboe, Clark, Nathan, Singh,
	Arvinder, De Smet, Jochen, Farmer, Matt, linux-fsdevel, Mizar,
	Sunita

On Feb 27, 2007  19:02 +0000, Alan wrote:
> > It would be great if the app tag was more than 16 bits.  Ted mentioned
> > that ideally he'd like to store the inode number in the app tag.  But
> > as it stands there isn't room.
> 
> The lowest few bits are the most important with ext2/ext3 because you
> normally lose a sector of inodes which means you've got dangly bits
> associated with a sequence of inodes with the same upper bits. More
> problematic is losing indirect blocks, and being able to keep some kind
> of [inode low bits/block index] would help put stuff back together.

In the ext4 extents format there is the ability (not implemented yet)
to add some extra information into the extent index blocks (previously
referred to as the ext3_extent_tail).  This is planned to be a checksum
of the index block, and a back-pointer to the inode which is using this
extent block.

This allows online detection of corrupt index blocks, and also detection
of an index block that is written to the wrong location.  There is as
yet no plan that I'm aware of to have in-filesystem checksums of the
extent data.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
  2007-02-27  1:10 ` Moore, Eric
@ 2007-02-27 16:50   ` Martin K. Petersen
  -1 siblings, 0 replies; 44+ messages in thread
From: Martin K. Petersen @ 2007-02-27 16:50 UTC (permalink / raw)
  To: Moore, Eric
  Cc: ric, Alan, Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide,
	linux-scsi, linux-raid, Tejun Heo, James Bottomley, Mark Lord,
	Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet, Jochen,
	Farmer, Matt, linux-fsdevel, Mizar, Sunita

>>>>> "Eric" == Moore, Eric <Eric.Moore@lsi.com> writes:

Eric> Martin K. Petersen on Data Intergrity Feature, which is also
Eric> called EEDP(End to End Data Protection), which he presented some
Eric> ideas/suggestions of adding an API in linux for this.  

T10 DIF is interesting for a few things: 

 - Ensuring that the data integrity is preserved when writing a buffer
   to disk

 - Ensuring that the write ends up on the right hardware sector

These features make the most sense in terms of WRITE.  Disks already
have plenty of CRC on the data so if a READ fails on a regular drive
we already know about it.

We can, however, leverage DIF with my proposal to expose the
protection data to host memory.  This will allow us to verify the data
integrity information before passing it to the filesystem or
application.  We can say "this is really the information the disk
sent. It hasn't been mangled along the way".

And by using the APP tag we can mark a sector as - say - metadata or
data to ease putting the recovery puzzle back together.

It would be great if the app tag was more than 16 bits.  Ted mentioned
that ideally he'd like to store the inode number in the app tag.  But
as it stands there isn't room.

In any case this is all slightly orthogonal to Ric's original post
about finding the right persistence heuristics in the error handling
path...

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: end to end error recovery musings
@ 2007-02-27 16:50   ` Martin K. Petersen
  0 siblings, 0 replies; 44+ messages in thread
From: Martin K. Petersen @ 2007-02-27 16:50 UTC (permalink / raw)
  To: Moore, Eric
  Cc: ric, Alan, Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide,
	linux-scsi, linux-raid, Tejun Heo, James Bottomley, Mark Lord,
	Jens Axboe, Clark, Nathan, Singh, Arvinder, De Smet, Jochen,
	Farmer, Matt, linux-fsdevel, Mizar, Sunita

>>>>> "Eric" == Moore, Eric <Eric.Moore@lsi.com> writes:

Eric> Martin K. Petersen on Data Intergrity Feature, which is also
Eric> called EEDP(End to End Data Protection), which he presented some
Eric> ideas/suggestions of adding an API in linux for this.  

T10 DIF is interesting for a few things: 

 - Ensuring that the data integrity is preserved when writing a buffer
   to disk

 - Ensuring that the write ends up on the right hardware sector

These features make the most sense in terms of WRITE.  Disks already
have plenty of CRC on the data so if a READ fails on a regular drive
we already know about it.

We can, however, leverage DIF with my proposal to expose the
protection data to host memory.  This will allow us to verify the data
integrity information before passing it to the filesystem or
application.  We can say "this is really the information the disk
sent. It hasn't been mangled along the way".

And by using the APP tag we can mark a sector as - say - metadata or
data to ease putting the recovery puzzle back together.

It would be great if the app tag was more than 16 bits.  Ted mentioned
that ideally he'd like to store the inode number in the app tag.  But
as it stands there isn't room.

In any case this is all slightly orthogonal to Ric's original post
about finding the right persistence heuristics in the error handling
path...

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: end to end error recovery musings
@ 2007-02-27  1:10 ` Moore, Eric
  0 siblings, 0 replies; 44+ messages in thread
From: Moore, Eric @ 2007-02-27  1:10 UTC (permalink / raw)
  To: ric, Alan
  Cc: Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, James Bottomley, Mark Lord, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

On Monday, February 26, 2007 9:42 AM,  Ric Wheeler wrote:
> Which brings us back to a recent discussion at the file 
> system workshop on being 
> more repair oriented in file system design so we can survive 
> situations like 
> this a bit more reliably ;-)
> 

On the second day of the workshop, there was a presentation given by
Martin K. Petersen on  Data Intergrity Feature, which is also called
EEDP(End to End Data Protection), which he presented some
ideas/suggestions of adding an API in linux for this.   I have his
presentation if anyone is interested.  One thing is scsi mid layer needs
32 byte cdbs support.

mpt fusion supports EEDP for some versions of Fibre products, and we
plan to add this for next generation sas products.   We support EEDP in
the windows driver where the driver generates its own tags.  Our Linux
driver don't.

Here is our 32 byte passthru structure for SCSI_IO, defined in
mpi_init.h, which as you may notice has some tags and flags for EEDP.


typedef struct _MSG_SCSI_IO32_REQUEST
{
    U8                          Port;                           /* 00h
*/
    U8                          Reserved1;                      /* 01h
*/
    U8                          ChainOffset;                    /* 02h
*/
    U8                          Function;                       /* 03h
*/
    U8                          CDBLength;                      /* 04h
*/
    U8                          SenseBufferLength;              /* 05h
*/
    U8                          Flags;                          /* 06h
*/
    U8                          MsgFlags;                       /* 07h
*/
    U32                         MsgContext;                     /* 08h
*/
    U8                          LUN[8];                         /* 0Ch
*/
    U32                         Control;                        /* 14h
*/
    MPI_SCSI_IO32_CDB_UNION     CDB;                            /* 18h
*/
    U32                         DataLength;                     /* 38h
*/
    U32                         BidirectionalDataLength;        /* 3Ch
*/
    U32                         SecondaryReferenceTag;          /* 40h
*/
    U16                         SecondaryApplicationTag;        /* 44h
*/
    U16                         Reserved2;                      /* 46h
*/
    U16                         EEDPFlags;                      /* 48h
*/
    U16                         ApplicationTagTranslationMask;  /* 4Ah
*/
    U32                         EEDPBlockSize;                  /* 4Ch
*/
    MPI_SCSI_IO32_ADDRESS       DeviceAddress;                  /* 50h
*/
    U8                          SGLOffset0;                     /* 58h
*/
    U8                          SGLOffset1;                     /* 59h
*/
    U8                          SGLOffset2;                     /* 5Ah
*/
    U8                          SGLOffset3;                     /* 5Bh
*/
    U32                         Reserved3;                      /* 5Ch
*/
    U32                         Reserved4;                      /* 60h
*/
    U32                         SenseBufferLowAddr;             /* 64h
*/
    SGE_IO_UNION                SGL;                            /* 68h
*/
} MSG_SCSI_IO32_REQUEST, MPI_POINTER PTR_MSG_SCSI_IO32_REQUEST,
  SCSIIO32Request_t, MPI_POINTER pSCSIIO32Request_t;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: end to end error recovery musings
@ 2007-02-27  1:10 ` Moore, Eric
  0 siblings, 0 replies; 44+ messages in thread
From: Moore, Eric @ 2007-02-27  1:10 UTC (permalink / raw)
  To: ric, Alan
  Cc: Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, James Bottomley, Mark Lord, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

On Monday, February 26, 2007 9:42 AM,  Ric Wheeler wrote:
> Which brings us back to a recent discussion at the file 
> system workshop on being 
> more repair oriented in file system design so we can survive 
> situations like 
> this a bit more reliably ;-)
> 

On the second day of the workshop, there was a presentation given by
Martin K. Petersen on  Data Intergrity Feature, which is also called
EEDP(End to End Data Protection), which he presented some
ideas/suggestions of adding an API in linux for this.   I have his
presentation if anyone is interested.  One thing is scsi mid layer needs
32 byte cdbs support.

mpt fusion supports EEDP for some versions of Fibre products, and we
plan to add this for next generation sas products.   We support EEDP in
the windows driver where the driver generates its own tags.  Our Linux
driver don't.

Here is our 32 byte passthru structure for SCSI_IO, defined in
mpi_init.h, which as you may notice has some tags and flags for EEDP.


typedef struct _MSG_SCSI_IO32_REQUEST
{
    U8                          Port;                           /* 00h
*/
    U8                          Reserved1;                      /* 01h
*/
    U8                          ChainOffset;                    /* 02h
*/
    U8                          Function;                       /* 03h
*/
    U8                          CDBLength;                      /* 04h
*/
    U8                          SenseBufferLength;              /* 05h
*/
    U8                          Flags;                          /* 06h
*/
    U8                          MsgFlags;                       /* 07h
*/
    U32                         MsgContext;                     /* 08h
*/
    U8                          LUN[8];                         /* 0Ch
*/
    U32                         Control;                        /* 14h
*/
    MPI_SCSI_IO32_CDB_UNION     CDB;                            /* 18h
*/
    U32                         DataLength;                     /* 38h
*/
    U32                         BidirectionalDataLength;        /* 3Ch
*/
    U32                         SecondaryReferenceTag;          /* 40h
*/
    U16                         SecondaryApplicationTag;        /* 44h
*/
    U16                         Reserved2;                      /* 46h
*/
    U16                         EEDPFlags;                      /* 48h
*/
    U16                         ApplicationTagTranslationMask;  /* 4Ah
*/
    U32                         EEDPBlockSize;                  /* 4Ch
*/
    MPI_SCSI_IO32_ADDRESS       DeviceAddress;                  /* 50h
*/
    U8                          SGLOffset0;                     /* 58h
*/
    U8                          SGLOffset1;                     /* 59h
*/
    U8                          SGLOffset2;                     /* 5Ah
*/
    U8                          SGLOffset3;                     /* 5Bh
*/
    U32                         Reserved3;                      /* 5Ch
*/
    U32                         Reserved4;                      /* 60h
*/
    U32                         SenseBufferLowAddr;             /* 64h
*/
    SGE_IO_UNION                SGL;                            /* 68h
*/
} MSG_SCSI_IO32_REQUEST, MPI_POINTER PTR_MSG_SCSI_IO32_REQUEST,
  SCSIIO32Request_t, MPI_POINTER pSCSIIO32Request_t;

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2007-03-01 17:19 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-23 14:15 end to end error recovery musings Ric Wheeler
2007-02-23 14:15 ` Ric Wheeler
2007-02-24  0:03 ` H. Peter Anvin
2007-02-24  0:37   ` Andreas Dilger
2007-02-24  2:05     ` H. Peter Anvin
2007-02-24  2:32     ` Theodore Tso
2007-02-24 18:39       ` Chris Wedgwood
2007-02-26  5:33       ` Neil Brown
2007-02-26 13:25         ` Theodore Tso
2007-02-26 15:15           ` Alan
2007-02-26 15:18             ` Ric Wheeler
2007-02-26 17:01               ` Alan
2007-02-26 16:42                 ` Ric Wheeler
2007-02-26 15:17           ` James Bottomley
2007-02-26 18:59           ` H. Peter Anvin
2007-02-26 22:46           ` Jeff Garzik
2007-02-26 22:53             ` Ric Wheeler
2007-02-27  1:19               ` Alan
2007-02-26  6:01   ` Douglas Gilbert
2007-02-27  1:10 Moore, Eric
2007-02-27  1:10 ` Moore, Eric
2007-02-27 16:50 ` Martin K. Petersen
2007-02-27 16:50   ` Martin K. Petersen
2007-02-27 18:51   ` Ric Wheeler
2007-02-27 19:02   ` Alan
2007-02-27 19:02     ` Alan
2007-02-27 18:39     ` Andreas Dilger
2007-02-27 19:07     ` Martin K. Petersen
2007-02-27 19:07       ` Martin K. Petersen
2007-02-27 23:39       ` Alan
2007-02-27 23:39         ` Alan
2007-02-27 22:51         ` Martin K. Petersen
2007-02-27 22:51           ` Martin K. Petersen
2007-02-28 13:46           ` Douglas Gilbert
2007-02-28 17:16             ` Martin K. Petersen
2007-02-28 17:30               ` James Bottomley
2007-02-28 17:42                 ` Martin K. Petersen
2007-02-28 17:52                   ` James Bottomley
2007-03-01  1:28                     ` H. Peter Anvin
2007-03-01 14:25                       ` James Bottomley
2007-03-01 17:19                         ` H. Peter Anvin
2007-02-28 15:19       ` Moore, Eric
2007-02-28 15:19         ` Moore, Eric
2007-02-28 17:27         ` Martin K. Petersen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.