All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: Alan <alan@lxorguk.ukuu.org.uk>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
	"Moore, Eric" <Eric.Moore@lsi.com>,
	ric@emc.com, Theodore Tso <tytso@mit.edu>,
	Neil Brown <neilb@suse.de>, "H. Peter Anvin" <hpa@zytor.com>,
	Linux-ide <linux-ide@vger.kernel.org>,
	linux-scsi <linux-scsi@vger.kernel.org>,
	linux-raid@vger.kernel.org, Tejun Heo <htejun@gmail.com>,
	James Bottomley <James.Bottomley@SteelEye.com>,
	Mark Lord <mlord@pobox.com>, Jens Axboe <jens.axboe@oracle.com>,
	"Clark, Nathan" <Clark_Nathan@emc.com>,
	"Singh, Arvinder" <Singh_Arvinder@emc.com>,
	"De Smet, Jochen" <DeSmet_Jochen@emc.com>,
	"Farmer, Matt" <Farmer_Matt@emc.com>,
	linux-fsdevel@vger.kernel.org, "Mizar,
	Sunita" <Mizar_Sunita@emc.com>
Subject: Re: end to end error recovery musings
Date: Tue, 27 Feb 2007 14:07:12 -0500	[thread overview]
Message-ID: <yq1r6sb7733.fsf@sermon.lab.mkp.net> (raw)
In-Reply-To: <20070227190236.58323a40@lxorguk.ukuu.org.uk> (alan@lxorguk.ukuu.org.uk's message of "Tue, 27 Feb 2007 19:02:36 +0000")

>>>>> "Alan" == Alan  <alan@lxorguk.ukuu.org.uk> writes:

>> These features make the most sense in terms of WRITE.  Disks
>> already have plenty of CRC on the data so if a READ fails on a
>> regular drive we already know about it.

Alan> Don't bet on it. 

This is why I mentioned that I want to expose the protection data to
the host.  As written, DIF only protects the path between initiator
and target.

See below...


Alan> If you want to do this seriously you need an end to end (media
Alan> to host ram) checksum. We do see bizarre and quite evil things
Alan> happen to people occasionally because they rely on bus level
Alan> protection - both faulty network cards and faulty disk or
Alan> controller RAM can cause very bad things to happen in a critical
Alan> environment and are very very hard to detect and test for.

Not sure you're up-to-date on the T10 data integrity feature.
Essentially it's an extension of the 520 byte sectors common in disk
arrays.  For each 512 byte sector (or 4KB ditto) you get 8 bytes of
protection data.  There's a 2 byte CRC (GUARD tag), a 2 byte
user-defined tag (APP) and a 4-byte reference tag (REF).  Depending on
how the drive is formatted, the REF tag usually needs to match the
lower 32-bits of the target sector #.

For each sector coming in the disk firmware verifies that the CRC and
the reference tags are in accordance with the contents of the sector
and the CDB start sector + offset.  If they don't match the drive will
reject the request.

If an HBA is capable of exposing the protection tuples to the host we
can precalculate the checksum and the LBA when submitting a WRITE.  My
current proposal involves passing them down in two separate buffers to
minimize the risk of in-memory corruption (Besides, it would suck if
you had to interleave data and protection data.  The scatterlists
would become long and twisted).

And that's when the READ case becomes interesting.  Because then the
fs can verify that the checksum of the in-buffer matches of the GUARD
tag.  In that case we'll know there's been no corruption in the
middle.

And of course this also opens up using the APP field to tag sector
contents.

-- 
Martin K. Petersen	Oracle Linux Engineering

WARNING: multiple messages have this Message-ID (diff)
From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: Alan <alan@lxorguk.ukuu.org.uk>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
	"Moore, Eric" <Eric.Moore@lsi.com>, <ric@emc.com>,
	"Theodore Tso" <tytso@mit.edu>, "Neil Brown" <neilb@suse.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Linux-ide" <linux-ide@vger.kernel.org>,
	"linux-scsi" <linux-scsi@vger.kernel.org>,
	<linux-raid@vger.kernel.org>, "Tejun Heo" <htejun@gmail.com>,
	"James Bottomley" <James.Bottomley@SteelEye.com>,
	"Mark Lord" <mlord@pobox.com>,
	"Jens Axboe" <jens.axboe@oracle.com>,
	"Clark, Nathan" <Clark_Nathan@emc.com>,
	"Singh, Arvinder" <Singh_Arvinder@emc.com>,
	"De Smet, Jochen" <DeSmet_Jochen@emc.com>,
	"Farmer, Matt" <Farmer_Matt@emc.com>,
	<linux-fsdevel@vger.kernel.org>,
	"Mizar, Sunita" <Mizar_Sunita@emc.com>
Subject: Re: end to end error recovery musings
Date: Tue, 27 Feb 2007 14:07:12 -0500	[thread overview]
Message-ID: <yq1r6sb7733.fsf@sermon.lab.mkp.net> (raw)
In-Reply-To: <20070227190236.58323a40@lxorguk.ukuu.org.uk> (alan@lxorguk.ukuu.org.uk's message of "Tue, 27 Feb 2007 19:02:36 +0000")

>>>>> "Alan" == Alan  <alan@lxorguk.ukuu.org.uk> writes:

>> These features make the most sense in terms of WRITE.  Disks
>> already have plenty of CRC on the data so if a READ fails on a
>> regular drive we already know about it.

Alan> Don't bet on it. 

This is why I mentioned that I want to expose the protection data to
the host.  As written, DIF only protects the path between initiator
and target.

See below...


Alan> If you want to do this seriously you need an end to end (media
Alan> to host ram) checksum. We do see bizarre and quite evil things
Alan> happen to people occasionally because they rely on bus level
Alan> protection - both faulty network cards and faulty disk or
Alan> controller RAM can cause very bad things to happen in a critical
Alan> environment and are very very hard to detect and test for.

Not sure you're up-to-date on the T10 data integrity feature.
Essentially it's an extension of the 520 byte sectors common in disk
arrays.  For each 512 byte sector (or 4KB ditto) you get 8 bytes of
protection data.  There's a 2 byte CRC (GUARD tag), a 2 byte
user-defined tag (APP) and a 4-byte reference tag (REF).  Depending on
how the drive is formatted, the REF tag usually needs to match the
lower 32-bits of the target sector #.

For each sector coming in the disk firmware verifies that the CRC and
the reference tags are in accordance with the contents of the sector
and the CDB start sector + offset.  If they don't match the drive will
reject the request.

If an HBA is capable of exposing the protection tuples to the host we
can precalculate the checksum and the LBA when submitting a WRITE.  My
current proposal involves passing them down in two separate buffers to
minimize the risk of in-memory corruption (Besides, it would suck if
you had to interleave data and protection data.  The scatterlists
would become long and twisted).

And that's when the READ case becomes interesting.  Because then the
fs can verify that the checksum of the in-buffer matches of the GUARD
tag.  In that case we'll know there's been no corruption in the
middle.

And of course this also opens up using the APP field to tag sector
contents.

-- 
Martin K. Petersen	Oracle Linux Engineering

  parent reply	other threads:[~2007-02-27 19:07 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-02-27  1:10 end to end error recovery musings Moore, Eric
2007-02-27  1:10 ` Moore, Eric
2007-02-27 16:50 ` Martin K. Petersen
2007-02-27 16:50   ` Martin K. Petersen
2007-02-27 18:51   ` Ric Wheeler
2007-02-27 19:02   ` Alan
2007-02-27 19:02     ` Alan
2007-02-27 18:39     ` Andreas Dilger
2007-02-27 19:07     ` Martin K. Petersen [this message]
2007-02-27 19:07       ` Martin K. Petersen
2007-02-27 23:39       ` Alan
2007-02-27 23:39         ` Alan
2007-02-27 22:51         ` Martin K. Petersen
2007-02-27 22:51           ` Martin K. Petersen
2007-02-28 13:46           ` Douglas Gilbert
2007-02-28 17:16             ` Martin K. Petersen
2007-02-28 17:30               ` James Bottomley
2007-02-28 17:42                 ` Martin K. Petersen
2007-02-28 17:52                   ` James Bottomley
2007-03-01  1:28                     ` H. Peter Anvin
2007-03-01 14:25                       ` James Bottomley
2007-03-01 17:19                         ` H. Peter Anvin
2007-02-28 15:19       ` Moore, Eric
2007-02-28 15:19         ` Moore, Eric
2007-02-28 17:27         ` Martin K. Petersen
  -- strict thread matches above, loose matches on Subject: below --
2007-02-23 14:15 Ric Wheeler
2007-02-23 14:15 ` Ric Wheeler
2007-02-24  0:03 ` H. Peter Anvin
2007-02-24  0:37   ` Andreas Dilger
2007-02-24  2:05     ` H. Peter Anvin
2007-02-24  2:32     ` Theodore Tso
2007-02-24 18:39       ` Chris Wedgwood
2007-02-26  5:33       ` Neil Brown
2007-02-26 13:25         ` Theodore Tso
2007-02-26 15:15           ` Alan
2007-02-26 15:18             ` Ric Wheeler
2007-02-26 17:01               ` Alan
2007-02-26 16:42                 ` Ric Wheeler
2007-02-26 15:17           ` James Bottomley
2007-02-26 18:59           ` H. Peter Anvin
2007-02-26 22:46           ` Jeff Garzik
2007-02-26 22:53             ` Ric Wheeler
2007-02-27  1:19               ` Alan
2007-02-26  6:01   ` Douglas Gilbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=yq1r6sb7733.fsf@sermon.lab.mkp.net \
    --to=martin.petersen@oracle.com \
    --cc=Clark_Nathan@emc.com \
    --cc=DeSmet_Jochen@emc.com \
    --cc=Eric.Moore@lsi.com \
    --cc=Farmer_Matt@emc.com \
    --cc=James.Bottomley@SteelEye.com \
    --cc=Mizar_Sunita@emc.com \
    --cc=Singh_Arvinder@emc.com \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=hpa@zytor.com \
    --cc=htejun@gmail.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=mlord@pobox.com \
    --cc=neilb@suse.de \
    --cc=ric@emc.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.