All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: end to end error recovery musings
@ 2007-02-27  1:10 ` Moore, Eric
  0 siblings, 0 replies; 44+ messages in thread
From: Moore, Eric @ 2007-02-27  1:10 UTC (permalink / raw)
  To: ric, Alan
  Cc: Theodore Tso, Neil Brown, H. Peter Anvin, Linux-ide, linux-scsi,
	linux-raid, Tejun Heo, James Bottomley, Mark Lord, Jens Axboe,
	Clark, Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt,
	linux-fsdevel, Mizar, Sunita

On Monday, February 26, 2007 9:42 AM,  Ric Wheeler wrote:
> Which brings us back to a recent discussion at the file 
> system workshop on being 
> more repair oriented in file system design so we can survive 
> situations like 
> this a bit more reliably ;-)
> 

On the second day of the workshop, there was a presentation given by
Martin K. Petersen on  Data Intergrity Feature, which is also called
EEDP(End to End Data Protection), which he presented some
ideas/suggestions of adding an API in linux for this.   I have his
presentation if anyone is interested.  One thing is scsi mid layer needs
32 byte cdbs support.

mpt fusion supports EEDP for some versions of Fibre products, and we
plan to add this for next generation sas products.   We support EEDP in
the windows driver where the driver generates its own tags.  Our Linux
driver don't.

Here is our 32 byte passthru structure for SCSI_IO, defined in
mpi_init.h, which as you may notice has some tags and flags for EEDP.


typedef struct _MSG_SCSI_IO32_REQUEST
{
    U8                          Port;                           /* 00h
*/
    U8                          Reserved1;                      /* 01h
*/
    U8                          ChainOffset;                    /* 02h
*/
    U8                          Function;                       /* 03h
*/
    U8                          CDBLength;                      /* 04h
*/
    U8                          SenseBufferLength;              /* 05h
*/
    U8                          Flags;                          /* 06h
*/
    U8                          MsgFlags;                       /* 07h
*/
    U32                         MsgContext;                     /* 08h
*/
    U8                          LUN[8];                         /* 0Ch
*/
    U32                         Control;                        /* 14h
*/
    MPI_SCSI_IO32_CDB_UNION     CDB;                            /* 18h
*/
    U32                         DataLength;                     /* 38h
*/
    U32                         BidirectionalDataLength;        /* 3Ch
*/
    U32                         SecondaryReferenceTag;          /* 40h
*/
    U16                         SecondaryApplicationTag;        /* 44h
*/
    U16                         Reserved2;                      /* 46h
*/
    U16                         EEDPFlags;                      /* 48h
*/
    U16                         ApplicationTagTranslationMask;  /* 4Ah
*/
    U32                         EEDPBlockSize;                  /* 4Ch
*/
    MPI_SCSI_IO32_ADDRESS       DeviceAddress;                  /* 50h
*/
    U8                          SGLOffset0;                     /* 58h
*/
    U8                          SGLOffset1;                     /* 59h
*/
    U8                          SGLOffset2;                     /* 5Ah
*/
    U8                          SGLOffset3;                     /* 5Bh
*/
    U32                         Reserved3;                      /* 5Ch
*/
    U32                         Reserved4;                      /* 60h
*/
    U32                         SenseBufferLowAddr;             /* 64h
*/
    SGE_IO_UNION                SGL;                            /* 68h
*/
} MSG_SCSI_IO32_REQUEST, MPI_POINTER PTR_MSG_SCSI_IO32_REQUEST,
  SCSIIO32Request_t, MPI_POINTER pSCSIIO32Request_t;

^ permalink raw reply	[flat|nested] 44+ messages in thread
* end to end error recovery musings
@ 2007-02-23 14:15 ` Ric Wheeler
  0 siblings, 0 replies; 44+ messages in thread
From: Ric Wheeler @ 2007-02-23 14:15 UTC (permalink / raw)
  To: Linux-ide, linux-scsi, linux-raid, Tejun Heo
  Cc: James Bottomley, Mark Lord, Neil Brown, Jens Axboe, Clark,
	Nathan, Singh, Arvinder, De Smet, Jochen, Farmer, Matt

In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.

My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one "appliance like" 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.

The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.

Each box has a watchdog timer that can be set to fire after at most 2 
minutes.

(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).

Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.

We still have the following challenges:

    (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.

    (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.

    (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...

    (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.

We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)

ric




^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2007-03-01 17:19 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-27  1:10 end to end error recovery musings Moore, Eric
2007-02-27  1:10 ` Moore, Eric
2007-02-27 16:50 ` Martin K. Petersen
2007-02-27 16:50   ` Martin K. Petersen
2007-02-27 18:51   ` Ric Wheeler
2007-02-27 19:02   ` Alan
2007-02-27 19:02     ` Alan
2007-02-27 18:39     ` Andreas Dilger
2007-02-27 19:07     ` Martin K. Petersen
2007-02-27 19:07       ` Martin K. Petersen
2007-02-27 23:39       ` Alan
2007-02-27 23:39         ` Alan
2007-02-27 22:51         ` Martin K. Petersen
2007-02-27 22:51           ` Martin K. Petersen
2007-02-28 13:46           ` Douglas Gilbert
2007-02-28 17:16             ` Martin K. Petersen
2007-02-28 17:30               ` James Bottomley
2007-02-28 17:42                 ` Martin K. Petersen
2007-02-28 17:52                   ` James Bottomley
2007-03-01  1:28                     ` H. Peter Anvin
2007-03-01 14:25                       ` James Bottomley
2007-03-01 17:19                         ` H. Peter Anvin
2007-02-28 15:19       ` Moore, Eric
2007-02-28 15:19         ` Moore, Eric
2007-02-28 17:27         ` Martin K. Petersen
  -- strict thread matches above, loose matches on Subject: below --
2007-02-23 14:15 Ric Wheeler
2007-02-23 14:15 ` Ric Wheeler
2007-02-24  0:03 ` H. Peter Anvin
2007-02-24  0:37   ` Andreas Dilger
2007-02-24  2:05     ` H. Peter Anvin
2007-02-24  2:32     ` Theodore Tso
2007-02-24 18:39       ` Chris Wedgwood
2007-02-26  5:33       ` Neil Brown
2007-02-26 13:25         ` Theodore Tso
2007-02-26 15:15           ` Alan
2007-02-26 15:18             ` Ric Wheeler
2007-02-26 17:01               ` Alan
2007-02-26 16:42                 ` Ric Wheeler
2007-02-26 15:17           ` James Bottomley
2007-02-26 18:59           ` H. Peter Anvin
2007-02-26 22:46           ` Jeff Garzik
2007-02-26 22:53             ` Ric Wheeler
2007-02-27  1:19               ` Alan
2007-02-26  6:01   ` Douglas Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.