From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Martin K. Petersen" Subject: Re: end to end error recovery musings Date: Tue, 27 Feb 2007 14:07:12 -0500 Message-ID: References: <664A4EBB07F29743873A87CF62C26D705D6DDB@NAMAIL4.ad.lsil.com> <20070227190236.58323a40@lxorguk.ukuu.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: In-Reply-To: <20070227190236.58323a40@lxorguk.ukuu.org.uk> (alan@lxorguk.ukuu.org.uk's message of "Tue, 27 Feb 2007 19:02:36 +0000") Sender: linux-scsi-owner@vger.kernel.org To: Alan Cc: "Martin K. Petersen" , "Moore, Eric" , ric@emc.com, Theodore Tso , Neil Brown , "H. Peter Anvin" , Linux-ide , linux-scsi , linux-raid@vger.kernel.org, Tejun Heo , James Bottomley , Mark Lord , Jens Axboe , "Clark, Nathan" , "Singh, Arvinder" , "De Smet, Jochen" , "Farmer, Matt" , linux-fsdevel@vger.kernel.org, "Mizar, Sunita" List-Id: linux-raid.ids >>>>> "Alan" == Alan writes: >> These features make the most sense in terms of WRITE. Disks >> already have plenty of CRC on the data so if a READ fails on a >> regular drive we already know about it. Alan> Don't bet on it. This is why I mentioned that I want to expose the protection data to the host. As written, DIF only protects the path between initiator and target. See below... Alan> If you want to do this seriously you need an end to end (media Alan> to host ram) checksum. We do see bizarre and quite evil things Alan> happen to people occasionally because they rely on bus level Alan> protection - both faulty network cards and faulty disk or Alan> controller RAM can cause very bad things to happen in a critical Alan> environment and are very very hard to detect and test for. Not sure you're up-to-date on the T10 data integrity feature. Essentially it's an extension of the 520 byte sectors common in disk arrays. For each 512 byte sector (or 4KB ditto) you get 8 bytes of protection data. There's a 2 byte CRC (GUARD tag), a 2 byte user-defined tag (APP) and a 4-byte reference tag (REF). Depending on how the drive is formatted, the REF tag usually needs to match the lower 32-bits of the target sector #. For each sector coming in the disk firmware verifies that the CRC and the reference tags are in accordance with the contents of the sector and the CDB start sector + offset. If they don't match the drive will reject the request. If an HBA is capable of exposing the protection tuples to the host we can precalculate the checksum and the LBA when submitting a WRITE. My current proposal involves passing them down in two separate buffers to minimize the risk of in-memory corruption (Besides, it would suck if you had to interleave data and protection data. The scatterlists would become long and twisted). And that's when the READ case becomes interesting. Because then the fs can verify that the checksum of the in-buffer matches of the GUARD tag. In that case we'll know there's been no corruption in the middle. And of course this also opens up using the APP field to tag sector contents. -- Martin K. Petersen Oracle Linux Engineering From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Martin K. Petersen" Subject: Re: end to end error recovery musings Date: Tue, 27 Feb 2007 14:07:12 -0500 Message-ID: References: <664A4EBB07F29743873A87CF62C26D705D6DDB@NAMAIL4.ad.lsil.com> <20070227190236.58323a40@lxorguk.ukuu.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "Martin K. Petersen" , "Moore, Eric" , , "Theodore Tso" , "Neil Brown" , "H. Peter Anvin" , "Linux-ide" , "linux-scsi" , , "Tejun Heo" , "James Bottomley" , "Mark Lord" , "Jens Axboe" , "Clark, Nathan" , "Singh, Arvinder" , "De Smet, Jochen" , "Farmer, Matt" , , "Mizar, Sunita" To: Alan Return-path: In-Reply-To: <20070227190236.58323a40@lxorguk.ukuu.org.uk> (alan@lxorguk.ukuu.org.uk's message of "Tue, 27 Feb 2007 19:02:36 +0000") Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org >>>>> "Alan" == Alan writes: >> These features make the most sense in terms of WRITE. Disks >> already have plenty of CRC on the data so if a READ fails on a >> regular drive we already know about it. Alan> Don't bet on it. This is why I mentioned that I want to expose the protection data to the host. As written, DIF only protects the path between initiator and target. See below... Alan> If you want to do this seriously you need an end to end (media Alan> to host ram) checksum. We do see bizarre and quite evil things Alan> happen to people occasionally because they rely on bus level Alan> protection - both faulty network cards and faulty disk or Alan> controller RAM can cause very bad things to happen in a critical Alan> environment and are very very hard to detect and test for. Not sure you're up-to-date on the T10 data integrity feature. Essentially it's an extension of the 520 byte sectors common in disk arrays. For each 512 byte sector (or 4KB ditto) you get 8 bytes of protection data. There's a 2 byte CRC (GUARD tag), a 2 byte user-defined tag (APP) and a 4-byte reference tag (REF). Depending on how the drive is formatted, the REF tag usually needs to match the lower 32-bits of the target sector #. For each sector coming in the disk firmware verifies that the CRC and the reference tags are in accordance with the contents of the sector and the CDB start sector + offset. If they don't match the drive will reject the request. If an HBA is capable of exposing the protection tuples to the host we can precalculate the checksum and the LBA when submitting a WRITE. My current proposal involves passing them down in two separate buffers to minimize the risk of in-memory corruption (Besides, it would suck if you had to interleave data and protection data. The scatterlists would become long and twisted). And that's when the READ case becomes interesting. Because then the fs can verify that the checksum of the in-buffer matches of the GUARD tag. In that case we'll know there's been no corruption in the middle. And of course this also opens up using the APP field to tag sector contents. -- Martin K. Petersen Oracle Linux Engineering