Re: [PATCH v4 1/3] block: centralize PI remapping logic to the block layer

From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: Max Gurtovoy <maxg@mellanox.com>
Cc: axboe@kernel.dk, keith.busch@intel.com, sagi@grimberg.me,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	israelr@mellanox.com, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, shlomin@mellanox.com, hch@lst.de
Subject: Re: [PATCH v4 1/3] block: centralize PI remapping logic to the block layer
Date: Mon, 09 Sep 2019 22:29:34 -0400	[thread overview]
Message-ID: <yq1d0g8hoj5.fsf@oracle.com> (raw)
In-Reply-To: <d6cfe6e5-508a-f01c-267d-c8009fafc571@mellanox.com> (Max Gurtovoy's message of "Mon, 9 Sep 2019 16:55:57 +0300")

Max,

> maybe we can add profiles to type0 and type2 in the future and have
> more readable code.

It's a deliberate feature that we treat DIX Type 0, 1, and 2 the
same. It's very common to mix and match legacy drives, T10 PI Type 1,
and T10 PI Type 2 devices in a system. In order for MD/DM stacking,
multipathing, etc. to work, it is important that all devices share the
same protection format, interpretation of the tags, etc.

Type 2, where the ref tag can be different from the LBA, was designed
exclusively for use inside disk arrays where the array firmware is the
sole entity accessing blocks on media. And thus always knows what the
expected ref tag should be for a given LBA (typically the LUN LBA as
seen by the host interface and not the target LBA on the back-end
drive).

For Linux, however, where we need to support dd'ing from the device node
without any knowledge an application or filesystem may have about the
written PI, it's imperative that the reference tag is something
predictable. Therefore it is deliberate that we always use the LBA (or
a derivative thereof for the smaller intervals) for the reference tag.
Even if T10 PI Type 2 in theory allows for the tag to be an arbitrary
number. But Linux is a general purpose OS and not an array controller
firmware. So we can't really leverage that capability.

Also. Take MD, for instance. The same I/O could be going to a mirror of
Type 1 and Type 2 devices. We obviously can't have two different types
of PI hanging off a bio. Nor do we have the capability to handle
arbitrary MD/DM stacking with PI format properties potentially changing
many times within the block range constituting a single I/O.

That's why we have the integrity profile which describes a common block
layer PI format that's somewhat orthogonal to how the underlying device
is formatted.

There are a couple of warts in that department. One is the IP checksum
which is now mostly a legacy thing and not implemented/relevant for
NVMe. The other is Type 3 devices that need special care and
feeding. But Type 3 does not appear to be actively used by anyone
anymore. We recently discovered that it's completely broken in the NVMe
spec and nobody ever noticed. And I don't think it was ever used
as-written in SCSI (Type 3 was an attempt to standardize a particular
vendor's existing, proprietary format).

Anyway. So my take on all this is that the T10-DIF-TYPE1-CRC profile is
"it" and everything else is legacy.

> I think I'll prepare dummy/empty callbacks for type3 and for nop
> profiles instead of setting it to NULL.
>
> agreed ?

Sure. Whatever works.

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme