All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-25 20:13 Walker, Benjamin
  0 siblings, 0 replies; 11+ messages in thread
From: Walker, Benjamin @ 2018-09-25 20:13 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 9753 bytes --]

On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> Sorry for resend, I revised a little.
> 
> 
> Hi All,
> 
> I've worked on extended LBA in bdevs first.
> I will do T10 PI on top of the extended LBA next.
> I expect some applications or users will need separate metadata and will do
> this too.

There is, of course, two ways to handle per-block metadata. The first is to make
it transparent to the block layer by extending the block size (typically by 8
bytes). For this case, which I'll call case #1, I don't necessarily think that
any changes need to be made to the bdev layer - it should already support 520
byte blocks, for example.

The second case is as you said, where the metadata is stored in a separate place
in memory. We support this in the NVMe driver via special command submission
functions that additionally take a pointer to the associated metadata. I think
support for this second case, case #2, is what you are interested in
implementing.

There is also the question of what is actually stored in the metadata. T10 PI is
the obvious choice, but really anything can be put there. Below you mention
writing a bdev module that transparently handles T10 PI and I think that's the
right design choice. Basic metadata support can be in the bdev layer, but the
T10 PI code can be in its own bdev module.

> 
> About extended LBA in bdevs, I would like to hear any feedback before
> submitting patches.
> Any feedback is very very appreciated.
> 
> Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
>   Option 1: length of extended block (data + metadata)
>   Option 2: length of only data block. user will get length of metadata by
> spdk_bdev_get_md_size(bdev)
>   Or any other idea?

I haven't thought through all of the ramifications here, but I think it might
depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
except it to return the total length of data + metadata (the user may not even
be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
return just the length of the data.

> 
> Current implementation is Option 1 but NVMe-oF target cuts off the size of
> metadata now even if metadata is enabled.
> Keeping current implementation, Option1, sounds reasonable for me.
> 
> Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev) anyway.
> 
> Q2. Which behavior should bdev IO APIs have by default?
>  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of extended
> LBA by default)
>  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> upper layer by default)
>  Or any other idea?
> 
> READ_STRIP reads data and metadata from the target, discards metadata, and
> transfers only data to the upper layer.
> WRITE_INSERT transfers only data from the upper layer, add metadata, and
> writes both data and metadata to the target.
> READ_PASS reads data and metadata from the target and transfers both data and
> metadata to the upper layer.
> WRITE_PASS transfers data and metadata from the upper layer and writes both
> data and metadata to the target.
> 
> Option 1 looks reasonable to me. I will be able to provide an new bdev module
> to use extended LBA bdevs transparently.
> The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> If we take Option 2, we will have to provide the set of ELBA aware APIs.

The bdev APIs shouldn't understand the meaning of the metadata at all (just that
it exists and its size), and therefore can't strip or insert it. The bdev module
implementing T10 PI should strip and insert automatically.

> 
> ELBA awareness means that the upper layer must extract data from interleaved
> extended blocks.
> 
> Q3. To support any application which requires separate metadata, providing the
> following APIs is reasonable?
> - add md_buf and md_len to parameters
> - add the suffix "_md" to the function name.

Sounds good.

> 
> About T10 PI, I will send questions as separate mail later.
> 
> Bdev provides the following APIs:
> - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_arg)
> - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_buf,
> md_len, cb, cb_arg)
> - uint32_t spdk_bdev_get_blocks_size(bdev)
> - uint32_t spdk_bdev_get_num_blocks(bdev)
> - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> cb, cb_arg)
> - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks, cb, cb_arg)
> - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> 
> Thanks,
> Shuhei
> 
> ________________________________
> 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> 送信日時: 2018年9月25日 9:14
> 宛先: spdk(a)lists.01.org
> 件名: Design policy to support extended LBA and T10 PI in bdevs.
> 
> 
> Hi All,
> 
> I've worked on extended LBA in bdevs first.
> I will do T10 PI on top of the extended LBA next.
> I expect some applications or users will need separate metadata and will do
> this too.
> 
> About extended LBA in bdevs, I would like to hear any feedback before
> submitting patches.
> Any feedback is very very appreciated.
> 
> Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
>   Option 1: length of extended block (data + metadata)
>   Option 2: length of only data block. user will get length of metadata by
> spdk_bdev_get_md_size(bdev)
>   Or any other idea?
> 
> Current implementation is A1 but NVMe-oF target cuts off the size of metadata
> now even if metadata is enabled.
> Keeping current implementation, Option1, sounds reasonable for me.
> Even if we take Option 1
> 
> Q2. Which behavior should bdev IO APIs have by default?
>  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of extended
> LBA by default)
>  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> upper layer by default)
>  Or any other idea?
> 
> READ_STRIP reads data and metadata from the target, discards metadata, and
> transfers only data to the upper layer.
> WRITE_INSERT transfers only data from the upper layer, add metadata, and
> writes both data and metadata to the target.
> READ_PASS reads data and metadata from the target and transfers both data and
> metadata to the upper layer.
> WRITE_PASS transfers data and metadata from the upper layer and writes both
> data and metadata to the target.
> A1 looks reasonable to me. I will be able to provide an new bdev module to use
> extended LBA bdevs transparently.
> The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> If we take A2, we will have to provide the set of ELBA aware APIs.
> 
> Q3. To support any application which requires separate metadata, providing the
> following APIs is reasonable?
> - add md_buf and md_len to parameters
> - add the suffix "_md" to the function name.
> 
> About T10 PI, I will send questions as separate mail later.
> 
> Bdev provides the following APIs:
> - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_arg)
> - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_buf,
> md_len, cb, cb_arg)
> - uint32_t spdk_bdev_get_blocks_size(bdev)
> - uint32_t spdk_bdev_get_num_blocks(bdev)
> - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> cb, cb_arg)
> - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks, cb, cb_arg)
> - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> 
> Thanks,
> Shuhei
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-27  7:23 Liu, Changpeng
  0 siblings, 0 replies; 11+ messages in thread
From: Liu, Changpeng @ 2018-09-27  7:23 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 29879 bytes --]



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> MATSUMOTO,SHUUHEI
> Sent: Thursday, September 27, 2018 3:09 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> Hi Changpeng,
> 
> 
> > Besides, new bdev APIs which
> > - suports separate metadata, and
> > - converts separate metadata into extended LBA if the backend bdev is
> > formatted with extended LBA.
> > may be nice to have.
> > Separate metadata is easy-to-use for the upper layer that is not aware of PI but
> > NVMe-oF supports only extende LBA and the type of metadata is fixed at
> format.
> The NVMeoF specification says only extended LBA format is supported, so we
> don't need
> to consider separate metadata option.
> 
> I think GPT bdev should be able to examine any NVMe bdev formatted with
> extended LBA easily.
> 
> Besides, application may need separate metadata like Linux kernel.
> 
> 
> For NVMe-oF, Yes, we don't need to consider separate metadata option.
> 
> So priority is not high but if we consider SPDK overall, separate metadata APIs
> will be nice to have.
> 
> 
> Make sense to you?
Thanks Shuhei, make sense to me.  We may consider adding APIs to bdev layer to support separate metadata case.
> 
> 
> Thanks,
> 
> Shuhei
> 
> 
> 
> ________________________________
> 差出人: SPDK <spdk-bounces(a)lists.01.org> が Liu, Changpeng
> <changpeng.liu(a)intel.com> の代理で送信
> 送信日時: 2018年9月27日 15:53
> 宛先: Storage Performance Development Kit
> 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> 
> 
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> > MATSUMOTO,SHUUHEI
> > Sent: Thursday, September 27, 2018 2:41 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> >
> > Hi Changpeng,
> >
> >
> > Thanks for your feedback.
> >
> >
> > > There is, of course, two ways to handle per-block metadata. The first is to
> > make
> > > it transparent to the block layer by extending the block size (typically by 8
> > > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > > any changes need to be made to the bdev layer - it should already support
> 520
> > > byte blocks, for example.
> > > Yes, any change is not necessary in the bdev layer but currently 520 byte
> blocks
> > > cannot be visible from nvmf-target to nvmf-initiator.
> > >
> > > What I want to do first is to make SPDK nvmf-target support (virtual) PI and
> its
> > > expected backends will be nvme bdevs first.
> > Hi Shuhei, I understand your point here, the question I have is that: how can
> we
> > map the virtual
> > Namespace bdev with the real backend, especially the real backend can't
> > support extended LBA format ?
> >
> > Or NVMeoF virtual Namespace can support different types of PI, and SPDK
> > format the NVMe backend
> > with TYPE0, SPDK can fill the PI according to the specification.
> >
> > I think on option we can take is an new bdev module that strips metadata from
> > read and inserts metadata to write.
> That's make sense to me.
> > We can set this bdev on top of the bdev which doesn't support extended LBA.
> >
> > Besides, new bdev APIs which
> > - suports separate metadata, and
> > - converts separate metadata into extended LBA if the backend bdev is
> > formatted with extended LBA.
> > may be nice to have.
> > Separate metadata is easy-to-use for the upper layer that is not aware of PI but
> > NVMe-oF supports only extende LBA and the type of metadata is fixed at
> format.
> The NVMeoF specification says only extended LBA format is supported, so we
> don't need
> to consider separate metadata option.
> >
> > I hope these are align with Ben's suggestive feedback and your thought.
> > Thanks,
> > Shuhei
> >
> > ________________________________
> > 差出人: SPDK <spdk-bounces(a)lists.01.org> が Liu, Changpeng
> > <changpeng.liu(a)intel.com> の代理で送信
> > 送信日時: 2018年9月27日 14:21
> > 宛先: Storage Performance Development Kit
> > 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> >
> >
> >
> > > -----Original Message-----
> > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> > > MATSUMOTO,SHUUHEI
> > > Sent: Wednesday, September 26, 2018 11:13 AM
> > > To: spdk(a)lists.01.org
> > > Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in
> bdevs.
> > >
> > > Hi Ben and All,
> > >
> > >
> > > I updated slightly. Will you read this mail and discard previous one?
> > >
> > >
> > >
> > > There is, of course, two ways to handle per-block metadata. The first is to
> > make
> > > it transparent to the block layer by extending the block size (typically by 8
> > > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > > any changes need to be made to the bdev layer - it should already support
> 520
> > > byte blocks, for example.
> > > Yes, any change is not necessary in the bdev layer but currently 520 byte
> blocks
> > > cannot be visible from nvmf-target to nvmf-initiator.
> > >
> > > What I want to do first is to make SPDK nvmf-target support (virtual) PI and
> its
> > > expected backends will be nvme bdevs first.
> > Hi Shuhei, I understand your point here, the question I have is that: how can
> we
> > map the virtual
> > Namespace bdev with the real backend, especially the real backend can't
> > support extended LBA format ?
> >
> > Or NVMeoF virtual Namespace can support different types of PI, and SPDK
> > format the NVMe backend
> > with TYPE0, SPDK can fill the PI according to the specification.
> > > Then I want to use crypto or raid bdevs on top of nvme bdevs as backends
> next.
> > > The bdev APIs shouldn't understand the meaning of the metadata at all (just
> > that
> > > it exists and its size), and therefore can't strip or insert it. The bdev module
> > > implementing T10 PI should strip and insert automatically.
> > > OK, by this bdevs will be able to inform existence and size of metadata but
> > > additional information will be necessary to create virtual namespaces in
> which
> > PI
> > > is enabled.
> > > Do you think adding optional parameters related with PI to
> > > nvmf_subsystem_add_ns RPC is reasonable?
> > >
> > > Or bdev module will provide detailed PI information at virtual namespace
> > > creation?
> > >
> > > The second case is as you said, where the metadata is stored in a separate
> > place
> > > in memory. We support this in the NVMe driver via special command
> > submission
> > > functions that additionally take a pointer to the associated metadata. I think
> > > support for this second case, case #2, is what you are interested in
> > > implementing.
> > > nvmf-target supports only interleaved metadata and don't support separate
> > > metadata buffer. Which option sounds reasonable for you?
> > >
> > > Option 1:
> > > nvmf-target transfers interleaved metadata directly to the backend bdev by
> > > current bdev APIs.
> > > This will be able to keep zero-copy and be good for performance.
> > >
> > > Option 2:
> > > nvmf-target separates interleaved metadata into two buffers and then
> > transfers
> > > two buffers to the backend bdev by new bdev APIs.
> > >
> > >
> > > >
> > > > ELBA awareness means that the upper layer must extract data from
> > interleaved
> > > > extended blocks.
> > > >
> > > > Q3. To support any application which requires separate metadata,
> providing
> > > the
> > > > following APIs is reasonable?
> > > > - add md_buf and md_len to parameters
> > > > - add the suffix "_md" to the function name.
> > >
> > > Sounds good.
> > > OK, I will do.
> > > But nvme provides pichk, app_tag/mask and ref_tag fields in the read/write
> > > command to applications.
> > > nvme bdev will be able to hold pichk internally but nvme bdev will have to
> > > provide app_tag/mask and ref_tag in bdev APIs from the end-to-end
> > perspective.
> > > Adding app_tag/mask and ref_tag to the new APIs will be reasonable too?
> > >
> > >
> > > There is also the question of what is actually stored in the metadata. T10 PI is
> > > the obvious choice, but really anything can be put there. Below you mention
> > > writing a bdev module that transparently handles T10 PI and I think that's the
> > > right design choice. Basic metadata support can be in the bdev layer, but the
> > > T10 PI code can be in its own bdev module.
> > > I got it.
> > > I haven't thought through all of the ramifications here, but I think it might
> > > depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> > > except it to return the total length of data + metadata (the user may not even
> > > be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> > > return just the length of the data.
> > > I got it.
> > >
> > >
> > > Thanks,
> > > Shuhei
> > >
> > >
> > > ________________________________
> > > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > > 送信日時: 2018年9月26日 8:24
> > > 宛先: spdk(a)lists.01.org
> > > 件名: RE: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> > >
> > >
> > > Hi Ben,
> > >
> > >
> > > Thanks for your feedback. Your feedback made my head cleaner.
> > >
> > > I have additional few questions. Will you provide your feedback again?
> > >
> > >
> > >
> > > There is, of course, two ways to handle per-block metadata. The first is to
> > make
> > > it transparent to the block layer by extending the block size (typically by 8
> > > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > > any changes need to be made to the bdev layer - it should already support
> 520
> > > byte blocks, for example.
> > > Yes, any change is not necessary in the bdev layer but currently 520 byte
> blocks
> > > cannot be visible from nvmf-target to nvmf-initiator.
> > >
> > > What I want to do first is to make SPDK nvmf-target support (virtual) PI and
> its
> > > expected backends will be nvme bdevs first.
> > > Then I want to use crypto or raid bdevs on top of nvme bdevs as backends
> next.
> > > The bdev APIs shouldn't understand the meaning of the metadata at all (just
> > that
> > > it exists and its size), and therefore can't strip or insert it. The bdev module
> > > implementing T10 PI should strip and insert automatically.
> > > OK, by this bdevs will be able to inform existence and size of metadata but
> > > additional information will be necessary to create virtual namespaces with PI.
> > > Do you think adding optional parameters related with PI to
> > > nvmf_subsystem_add_ns RPC is reasonable?
> > > Or bdev module will have detailed PI information and provide them at virtual
> > > namespace creation?
> > >
> > > The second case is as you said, where the metadata is stored in a separate
> > place
> > > in memory. We support this in the NVMe driver via special command
> > submission
> > > functions that additionally take a pointer to the associated metadata. I think
> > > support for this second case, case #2, is what you are interested in
> > > implementing.
> > > nvmf-target supports only interleaved metadata and don't support separate
> > > metadata buffer. Which option sounds reasonable for you?
> > >
> > > Option 1:
> > > nvmf-target separates interleaved metadata into two buffers and then
> > transfers
> > > two buffers to the backend bdev by new bdev APIs.
> > >
> > > Option 2:
> > > nvmf-target transfers interleaved metadata directly to the backend bdev by
> > > current bdev APIs.
> > > But separate metadata may be better for crypto bdev and raid bdev.
> > > Hence for this option I may have to provide a special bdev module to convert
> > > between interleaved metadata and separate metadata.
> > >
> > > There is also the question of what is actually stored in the metadata. T10 PI is
> > > the obvious choice, but really anything can be put there. Below you mention
> > > writing a bdev module that transparently handles T10 PI and I think that's the
> > > right design choice. Basic metadata support can be in the bdev layer, but the
> > > T10 PI code can be in its own bdev module.
> > > I got it.
> > > I haven't thought through all of the ramifications here, but I think it might
> > > depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> > > except it to return the total length of data + metadata (the user may not even
> > > be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> > > return just the length of the data.
> > > I got it.
> > >
> > > >
> > > > ELBA awareness means that the upper layer must extract data from
> > interleaved
> > > > extended blocks.
> > > >
> > > > Q3. To support any application which requires separate metadata,
> providing
> > > the
> > > > following APIs is reasonable?
> > > > - add md_buf and md_len to parameters
> > > > - add the suffix "_md" to the function name.
> > >
> > > Sounds good.
> > > OK, I will do.
> > >
> > > Thanks,
> > > Shuhei
> > >
> > >
> > > ________________________________
> > > 差出人: SPDK <spdk-bounces(a)lists.01.org> が Walker, Benjamin
> > > <benjamin.walker(a)intel.com> の代理で送信
> > > 送信日時: 2018年9月26日 5:13
> > > 宛先: spdk(a)lists.01.org
> > > 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in
> bdevs.
> > >
> > > On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI
> > wrote:
> > > > Sorry for resend, I revised a little.
> > > >
> > > >
> > > > Hi All,
> > > >
> > > > I've worked on extended LBA in bdevs first.
> > > > I will do T10 PI on top of the extended LBA next.
> > > > I expect some applications or users will need separate metadata and will do
> > > > this too.
> > >
> > > There is, of course, two ways to handle per-block metadata. The first is to
> > make
> > > it transparent to the block layer by extending the block size (typically by 8
> > > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > > any changes need to be made to the bdev layer - it should already support
> 520
> > > byte blocks, for example.
> > >
> > > The second case is as you said, where the metadata is stored in a separate
> > place
> > > in memory. We support this in the NVMe driver via special command
> > submission
> > > functions that additionally take a pointer to the associated metadata. I think
> > > support for this second case, case #2, is what you are interested in
> > > implementing.
> > >
> > > There is also the question of what is actually stored in the metadata. T10 PI is
> > > the obvious choice, but really anything can be put there. Below you mention
> > > writing a bdev module that transparently handles T10 PI and I think that's the
> > > right design choice. Basic metadata support can be in the bdev layer, but the
> > > T10 PI code can be in its own bdev module.
> > >
> > > >
> > > > About extended LBA in bdevs, I would like to hear any feedback before
> > > > submitting patches.
> > > > Any feedback is very very appreciated.
> > > >
> > > > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> > > >   Option 1: length of extended block (data + metadata)
> > > >   Option 2: length of only data block. user will get length of metadata by
> > > > spdk_bdev_get_md_size(bdev)
> > > >   Or any other idea?
> > >
> > > I haven't thought through all of the ramifications here, but I think it might
> > > depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> > > except it to return the total length of data + metadata (the user may not even
> > > be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> > > return just the length of the data.
> > >
> > > >
> > > > Current implementation is Option 1 but NVMe-oF target cuts off the size of
> > > > metadata now even if metadata is enabled.
> > > > Keeping current implementation, Option1, sounds reasonable for me.
> > > >
> > > > Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev)
> > anyway.
> > > >
> > > > Q2. Which behavior should bdev IO APIs have by default?
> > > >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> > > extended
> > > > LBA by default)
> > > >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to
> > the
> > > > upper layer by default)
> > > >  Or any other idea?
> > > >
> > > > READ_STRIP reads data and metadata from the target, discards metadata,
> > and
> > > > transfers only data to the upper layer.
> > > > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > > > writes both data and metadata to the target.
> > > > READ_PASS reads data and metadata from the target and transfers both
> data
> > > and
> > > > metadata to the upper layer.
> > > > WRITE_PASS transfers data and metadata from the upper layer and writes
> > both
> > > > data and metadata to the target.
> > > >
> > > > Option 1 looks reasonable to me. I will be able to provide an new bdev
> > module
> > > > to use extended LBA bdevs transparently.
> > > > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > > > If we take Option 2, we will have to provide the set of ELBA aware APIs.
> > >
> > > The bdev APIs shouldn't understand the meaning of the metadata at all (just
> > that
> > > it exists and its size), and therefore can't strip or insert it. The bdev module
> > > implementing T10 PI should strip and insert automatically.
> > >
> > > >
> > > > ELBA awareness means that the upper layer must extract data from
> > interleaved
> > > > extended blocks.
> > > >
> > > > Q3. To support any application which requires separate metadata,
> providing
> > > the
> > > > following APIs is reasonable?
> > > > - add md_buf and md_len to parameters
> > > > - add the suffix "_md" to the function name.
> > >
> > > Sounds good.
> > >
> > > >
> > > > About T10 PI, I will send questions as separate mail later.
> > > >
> > > > Bdev provides the following APIs:
> > > > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > > > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > > > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > > > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb,
> > cb_arg)
> > > > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> > > cb_arg)
> > > > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> > > md_buf,
> > > > md_len, cb, cb_arg)
> > > > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > > > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > > > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > > cb_arg)
> > > > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks,
> > > > cb, cb_arg)
> > > > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > > cb_arg)
> > > > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > > > num_blocks, cb, cb_arg)
> > > > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks,
> cb,
> > > > cb_arg)
> > > > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > cb_arg)
> > > > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> > > >
> > > > Thanks,
> > > > Shuhei
> > > >
> > > > ________________________________
> > > > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > > > 送信日時: 2018年9月25日 9:14
> > > > 宛先: spdk(a)lists.01.org
> > > > 件名: Design policy to support extended LBA and T10 PI in bdevs.
> > > >
> > > >
> > > > Hi All,
> > > >
> > > > I've worked on extended LBA in bdevs first.
> > > > I will do T10 PI on top of the extended LBA next.
> > > > I expect some applications or users will need separate metadata and will do
> > > > this too.
> > > >
> > > > About extended LBA in bdevs, I would like to hear any feedback before
> > > > submitting patches.
> > > > Any feedback is very very appreciated.
> > > >
> > > > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> > > >   Option 1: length of extended block (data + metadata)
> > > >   Option 2: length of only data block. user will get length of metadata by
> > > > spdk_bdev_get_md_size(bdev)
> > > >   Or any other idea?
> > > >
> > > > Current implementation is A1 but NVMe-oF target cuts off the size of
> > metadata
> > > > now even if metadata is enabled.
> > > > Keeping current implementation, Option1, sounds reasonable for me.
> > > > Even if we take Option 1
> > > >
> > > > Q2. Which behavior should bdev IO APIs have by default?
> > > >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> > > extended
> > > > LBA by default)
> > > >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to
> > the
> > > > upper layer by default)
> > > >  Or any other idea?
> > > >
> > > > READ_STRIP reads data and metadata from the target, discards metadata,
> > and
> > > > transfers only data to the upper layer.
> > > > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > > > writes both data and metadata to the target.
> > > > READ_PASS reads data and metadata from the target and transfers both
> data
> > > and
> > > > metadata to the upper layer.
> > > > WRITE_PASS transfers data and metadata from the upper layer and writes
> > both
> > > > data and metadata to the target.
> > > > A1 looks reasonable to me. I will be able to provide an new bdev module to
> > > use
> > > > extended LBA bdevs transparently.
> > > > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > > > If we take A2, we will have to provide the set of ELBA aware APIs.
> > > >
> > > > Q3. To support any application which requires separate metadata,
> providing
> > > the
> > > > following APIs is reasonable?
> > > > - add md_buf and md_len to parameters
> > > > - add the suffix "_md" to the function name.
> > > >
> > > > About T10 PI, I will send questions as separate mail later.
> > > >
> > > > Bdev provides the following APIs:
> > > > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > > > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > > > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > > > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > > > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb,
> > cb_arg)
> > > > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> > > cb_arg)
> > > > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> > > md_buf,
> > > > md_len, cb, cb_arg)
> > > > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > > > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > > > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > > cb_arg)
> > > > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks,
> > > > cb, cb_arg)
> > > > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > > cb_arg)
> > > > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > > > num_blocks, cb, cb_arg)
> > > > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks,
> cb,
> > > > cb_arg)
> > > > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > cb_arg)
> > > > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> > > >
> > > > Thanks,
> > > > Shuhei
> > > >
> > > > _______________________________________________
> > > > SPDK mailing list
> > > > SPDK(a)lists.01.org
> > > > https://clicktime.symantec.com/a/1/_YuXBcwzu-
> > > xtu4EvOXanN4spdgP_iGZmwJk-
> > >
> >
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> > > A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> > >
> >
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> > > DAMUHjlWNsxNQDsRJWSREbYx-
> > > _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> > >
> >
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> > > F-
> > >
> >
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> > > noOomGZDSIU2GAsK9V-
> > >
> >
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> > > D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> > >
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://clicktime.symantec.com/a/1/_YuXBcwzu-
> > xtu4EvOXanN4spdgP_iGZmwJk-
> > >
> >
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> > > A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> > >
> >
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> > > DAMUHjlWNsxNQDsRJWSREbYx-
> > > _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> > >
> >
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> > > F-
> > >
> >
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> > > noOomGZDSIU2GAsK9V-
> > >
> >
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> > > D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > >
> >
> https://clicktime.symantec.com/a/1/bAMGNliZrGfYoWyvjY08H_oLf2QzPVPrRhtt
> > DWDmUVc=?d=XHAlPzaFDFAVoP-6qW9pNNwubD8aIVzOKfy00R6DW9Jzey6-
> > 6UHJ0qLI-
> >
> MZKALXeLtOlKdk6qFmSyLvNgLtbNYCob29X2spBy3XZi5Pm1Of9h2xAA7aTOWBJ3
> > MuCBwX96kdUYJOXklKnQgL6O53bl38gMcW4R5qUTXeBc8LgApuzwsYDWAqb-
> > 0AM4yxUw7SJisdYoiEHUqX8dcKuIVoKad3ZmHbxdC8a_i1-
> > z8sOCYAV78V8VaeM5IwzajddE0_IfZ-
> >
> DOhXHkbteVfqB_Tu6d2u5g5FQmn4Nf_YhhY6teUJbXdFD5yXCxHfNkpvlJjBuEzEz4
> > EnX8jE1QAXKIvQRxb4as71m4RgoATrUXPeOuU-GXjHdOID-
> > XA2fTjdjbUs3Ddf1WiWBeL5Alf5Ojfj-h8IOHciEs2HGpSx2Uo0chLb3X-
> >
> iz7DvV8h2zDV2NVnBErhSIRhyoaQ%3D%3D&u=https%3A%2F%2Flists.01.org%2F
> > mailman%2Flistinfo%2Fspdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> >
> https://clicktime.symantec.com/a/1/bAMGNliZrGfYoWyvjY08H_oLf2QzPVPrRhtt
> > DWDmUVc=?d=XHAlPzaFDFAVoP-6qW9pNNwubD8aIVzOKfy00R6DW9Jzey6-
> > 6UHJ0qLI-
> >
> MZKALXeLtOlKdk6qFmSyLvNgLtbNYCob29X2spBy3XZi5Pm1Of9h2xAA7aTOWBJ3
> > MuCBwX96kdUYJOXklKnQgL6O53bl38gMcW4R5qUTXeBc8LgApuzwsYDWAqb-
> > 0AM4yxUw7SJisdYoiEHUqX8dcKuIVoKad3ZmHbxdC8a_i1-
> > z8sOCYAV78V8VaeM5IwzajddE0_IfZ-
> >
> DOhXHkbteVfqB_Tu6d2u5g5FQmn4Nf_YhhY6teUJbXdFD5yXCxHfNkpvlJjBuEzEz4
> > EnX8jE1QAXKIvQRxb4as71m4RgoATrUXPeOuU-GXjHdOID-
> > XA2fTjdjbUs3Ddf1WiWBeL5Alf5Ojfj-h8IOHciEs2HGpSx2Uo0chLb3X-
> >
> iz7DvV8h2zDV2NVnBErhSIRhyoaQ%3D%3D&u=https%3A%2F%2Flists.01.org%2F
> > mailman%2Flistinfo%2Fspdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://clicktime.symantec.com/a/1/SgW8Kk_DH8-
> KnX4rCqRk8n00vOZuzsp2tH63Mz4EStE=?d=aYNksLrj2T-
> slulsyrbDV9zvEY9AUMYdlw_o2kRkY2rPS6-_KksL8tGr8YQ1JV6IfR1DB9G29xS9fs-
> y4vu7e3IrTMV8JVUHWq_hSBFHB9pxD0Khwhe-
> RBUxC_NhsXf9izvkqZE0Bh1d8nyQeIf1uZh-NKqW3nhMd2Tp-R1LKVXA7-
> VEl1Jye3eDhyj9lGlZBQa1DXej5Vz14nm0wk4C9v7lZ_32UK8sRyzlLGYQ86k7_b1KIir
> WWr3vd3nxYJf1rEzNUHsPlTczvGnfx_Lme2MrSxjeLpA7qSRJRmsD-Svx-
> ztZlxf02hMswE8d3wL_kFHfVqABEt1iFVUvLzqAs00S8YDhEBqqe2UyB4Shh4Kc7Pgx
> lij_6VsXm7S4MTRCQXY9a0P-9JekrBJ_1twEOV0FeUzThQ_J-
> OTsjanx5yixLu2784KERwKKOsKB77dtZT72nFkWTUg%3D&u=https%3A%2F%2Flist
> s.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/SgW8Kk_DH8-
> KnX4rCqRk8n00vOZuzsp2tH63Mz4EStE=?d=aYNksLrj2T-
> slulsyrbDV9zvEY9AUMYdlw_o2kRkY2rPS6-_KksL8tGr8YQ1JV6IfR1DB9G29xS9fs-
> y4vu7e3IrTMV8JVUHWq_hSBFHB9pxD0Khwhe-
> RBUxC_NhsXf9izvkqZE0Bh1d8nyQeIf1uZh-NKqW3nhMd2Tp-R1LKVXA7-
> VEl1Jye3eDhyj9lGlZBQa1DXej5Vz14nm0wk4C9v7lZ_32UK8sRyzlLGYQ86k7_b1KIir
> WWr3vd3nxYJf1rEzNUHsPlTczvGnfx_Lme2MrSxjeLpA7qSRJRmsD-Svx-
> ztZlxf02hMswE8d3wL_kFHfVqABEt1iFVUvLzqAs00S8YDhEBqqe2UyB4Shh4Kc7Pgx
> lij_6VsXm7S4MTRCQXY9a0P-9JekrBJ_1twEOV0FeUzThQ_J-
> OTsjanx5yixLu2784KERwKKOsKB77dtZT72nFkWTUg%3D&u=https%3A%2F%2Flist
> s.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-27  7:08 
  0 siblings, 0 replies; 11+ messages in thread
From:  @ 2018-09-27  7:08 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 27793 bytes --]

Hi Changpeng,


> Besides, new bdev APIs which
> - suports separate metadata, and
> - converts separate metadata into extended LBA if the backend bdev is
> formatted with extended LBA.
> may be nice to have.
> Separate metadata is easy-to-use for the upper layer that is not aware of PI but
> NVMe-oF supports only extende LBA and the type of metadata is fixed at format.
The NVMeoF specification says only extended LBA format is supported, so we don't need
to consider separate metadata option.

I think GPT bdev should be able to examine any NVMe bdev formatted with extended LBA easily.

Besides, application may need separate metadata like Linux kernel.


For NVMe-oF, Yes, we don't need to consider separate metadata option.

So priority is not high but if we consider SPDK overall, separate metadata APIs will be nice to have.


Make sense to you?


Thanks,

Shuhei



________________________________
差出人: SPDK <spdk-bounces(a)lists.01.org> が Liu, Changpeng <changpeng.liu(a)intel.com> の代理で送信
送信日時: 2018年9月27日 15:53
宛先: Storage Performance Development Kit
件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> MATSUMOTO,SHUUHEI
> Sent: Thursday, September 27, 2018 2:41 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
>
> Hi Changpeng,
>
>
> Thanks for your feedback.
>
>
> > There is, of course, two ways to handle per-block metadata. The first is to
> make
> > it transparent to the block layer by extending the block size (typically by 8
> > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > any changes need to be made to the bdev layer - it should already support 520
> > byte blocks, for example.
> > Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> > cannot be visible from nvmf-target to nvmf-initiator.
> >
> > What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> > expected backends will be nvme bdevs first.
> Hi Shuhei, I understand your point here, the question I have is that: how can we
> map the virtual
> Namespace bdev with the real backend, especially the real backend can't
> support extended LBA format ?
>
> Or NVMeoF virtual Namespace can support different types of PI, and SPDK
> format the NVMe backend
> with TYPE0, SPDK can fill the PI according to the specification.
>
> I think on option we can take is an new bdev module that strips metadata from
> read and inserts metadata to write.
That's make sense to me.
> We can set this bdev on top of the bdev which doesn't support extended LBA.
>
> Besides, new bdev APIs which
> - suports separate metadata, and
> - converts separate metadata into extended LBA if the backend bdev is
> formatted with extended LBA.
> may be nice to have.
> Separate metadata is easy-to-use for the upper layer that is not aware of PI but
> NVMe-oF supports only extende LBA and the type of metadata is fixed at format.
The NVMeoF specification says only extended LBA format is supported, so we don't need
to consider separate metadata option.
>
> I hope these are align with Ben's suggestive feedback and your thought.
> Thanks,
> Shuhei
>
> ________________________________
> 差出人: SPDK <spdk-bounces(a)lists.01.org> が Liu, Changpeng
> <changpeng.liu(a)intel.com> の代理で送信
> 送信日時: 2018年9月27日 14:21
> 宛先: Storage Performance Development Kit
> 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
>
>
>
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> > MATSUMOTO,SHUUHEI
> > Sent: Wednesday, September 26, 2018 11:13 AM
> > To: spdk(a)lists.01.org
> > Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> >
> > Hi Ben and All,
> >
> >
> > I updated slightly. Will you read this mail and discard previous one?
> >
> >
> >
> > There is, of course, two ways to handle per-block metadata. The first is to
> make
> > it transparent to the block layer by extending the block size (typically by 8
> > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > any changes need to be made to the bdev layer - it should already support 520
> > byte blocks, for example.
> > Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> > cannot be visible from nvmf-target to nvmf-initiator.
> >
> > What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> > expected backends will be nvme bdevs first.
> Hi Shuhei, I understand your point here, the question I have is that: how can we
> map the virtual
> Namespace bdev with the real backend, especially the real backend can't
> support extended LBA format ?
>
> Or NVMeoF virtual Namespace can support different types of PI, and SPDK
> format the NVMe backend
> with TYPE0, SPDK can fill the PI according to the specification.
> > Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> > The bdev APIs shouldn't understand the meaning of the metadata at all (just
> that
> > it exists and its size), and therefore can't strip or insert it. The bdev module
> > implementing T10 PI should strip and insert automatically.
> > OK, by this bdevs will be able to inform existence and size of metadata but
> > additional information will be necessary to create virtual namespaces in which
> PI
> > is enabled.
> > Do you think adding optional parameters related with PI to
> > nvmf_subsystem_add_ns RPC is reasonable?
> >
> > Or bdev module will provide detailed PI information at virtual namespace
> > creation?
> >
> > The second case is as you said, where the metadata is stored in a separate
> place
> > in memory. We support this in the NVMe driver via special command
> submission
> > functions that additionally take a pointer to the associated metadata. I think
> > support for this second case, case #2, is what you are interested in
> > implementing.
> > nvmf-target supports only interleaved metadata and don't support separate
> > metadata buffer. Which option sounds reasonable for you?
> >
> > Option 1:
> > nvmf-target transfers interleaved metadata directly to the backend bdev by
> > current bdev APIs.
> > This will be able to keep zero-copy and be good for performance.
> >
> > Option 2:
> > nvmf-target separates interleaved metadata into two buffers and then
> transfers
> > two buffers to the backend bdev by new bdev APIs.
> >
> >
> > >
> > > ELBA awareness means that the upper layer must extract data from
> interleaved
> > > extended blocks.
> > >
> > > Q3. To support any application which requires separate metadata, providing
> > the
> > > following APIs is reasonable?
> > > - add md_buf and md_len to parameters
> > > - add the suffix "_md" to the function name.
> >
> > Sounds good.
> > OK, I will do.
> > But nvme provides pichk, app_tag/mask and ref_tag fields in the read/write
> > command to applications.
> > nvme bdev will be able to hold pichk internally but nvme bdev will have to
> > provide app_tag/mask and ref_tag in bdev APIs from the end-to-end
> perspective.
> > Adding app_tag/mask and ref_tag to the new APIs will be reasonable too?
> >
> >
> > There is also the question of what is actually stored in the metadata. T10 PI is
> > the obvious choice, but really anything can be put there. Below you mention
> > writing a bdev module that transparently handles T10 PI and I think that's the
> > right design choice. Basic metadata support can be in the bdev layer, but the
> > T10 PI code can be in its own bdev module.
> > I got it.
> > I haven't thought through all of the ramifications here, but I think it might
> > depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> > except it to return the total length of data + metadata (the user may not even
> > be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> > return just the length of the data.
> > I got it.
> >
> >
> > Thanks,
> > Shuhei
> >
> >
> > ________________________________
> > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > 送信日時: 2018年9月26日 8:24
> > 宛先: spdk(a)lists.01.org
> > 件名: RE: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> >
> >
> > Hi Ben,
> >
> >
> > Thanks for your feedback. Your feedback made my head cleaner.
> >
> > I have additional few questions. Will you provide your feedback again?
> >
> >
> >
> > There is, of course, two ways to handle per-block metadata. The first is to
> make
> > it transparent to the block layer by extending the block size (typically by 8
> > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > any changes need to be made to the bdev layer - it should already support 520
> > byte blocks, for example.
> > Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> > cannot be visible from nvmf-target to nvmf-initiator.
> >
> > What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> > expected backends will be nvme bdevs first.
> > Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> > The bdev APIs shouldn't understand the meaning of the metadata at all (just
> that
> > it exists and its size), and therefore can't strip or insert it. The bdev module
> > implementing T10 PI should strip and insert automatically.
> > OK, by this bdevs will be able to inform existence and size of metadata but
> > additional information will be necessary to create virtual namespaces with PI.
> > Do you think adding optional parameters related with PI to
> > nvmf_subsystem_add_ns RPC is reasonable?
> > Or bdev module will have detailed PI information and provide them at virtual
> > namespace creation?
> >
> > The second case is as you said, where the metadata is stored in a separate
> place
> > in memory. We support this in the NVMe driver via special command
> submission
> > functions that additionally take a pointer to the associated metadata. I think
> > support for this second case, case #2, is what you are interested in
> > implementing.
> > nvmf-target supports only interleaved metadata and don't support separate
> > metadata buffer. Which option sounds reasonable for you?
> >
> > Option 1:
> > nvmf-target separates interleaved metadata into two buffers and then
> transfers
> > two buffers to the backend bdev by new bdev APIs.
> >
> > Option 2:
> > nvmf-target transfers interleaved metadata directly to the backend bdev by
> > current bdev APIs.
> > But separate metadata may be better for crypto bdev and raid bdev.
> > Hence for this option I may have to provide a special bdev module to convert
> > between interleaved metadata and separate metadata.
> >
> > There is also the question of what is actually stored in the metadata. T10 PI is
> > the obvious choice, but really anything can be put there. Below you mention
> > writing a bdev module that transparently handles T10 PI and I think that's the
> > right design choice. Basic metadata support can be in the bdev layer, but the
> > T10 PI code can be in its own bdev module.
> > I got it.
> > I haven't thought through all of the ramifications here, but I think it might
> > depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> > except it to return the total length of data + metadata (the user may not even
> > be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> > return just the length of the data.
> > I got it.
> >
> > >
> > > ELBA awareness means that the upper layer must extract data from
> interleaved
> > > extended blocks.
> > >
> > > Q3. To support any application which requires separate metadata, providing
> > the
> > > following APIs is reasonable?
> > > - add md_buf and md_len to parameters
> > > - add the suffix "_md" to the function name.
> >
> > Sounds good.
> > OK, I will do.
> >
> > Thanks,
> > Shuhei
> >
> >
> > ________________________________
> > 差出人: SPDK <spdk-bounces(a)lists.01.org> が Walker, Benjamin
> > <benjamin.walker(a)intel.com> の代理で送信
> > 送信日時: 2018年9月26日 5:13
> > 宛先: spdk(a)lists.01.org
> > 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> >
> > On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI
> wrote:
> > > Sorry for resend, I revised a little.
> > >
> > >
> > > Hi All,
> > >
> > > I've worked on extended LBA in bdevs first.
> > > I will do T10 PI on top of the extended LBA next.
> > > I expect some applications or users will need separate metadata and will do
> > > this too.
> >
> > There is, of course, two ways to handle per-block metadata. The first is to
> make
> > it transparent to the block layer by extending the block size (typically by 8
> > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > any changes need to be made to the bdev layer - it should already support 520
> > byte blocks, for example.
> >
> > The second case is as you said, where the metadata is stored in a separate
> place
> > in memory. We support this in the NVMe driver via special command
> submission
> > functions that additionally take a pointer to the associated metadata. I think
> > support for this second case, case #2, is what you are interested in
> > implementing.
> >
> > There is also the question of what is actually stored in the metadata. T10 PI is
> > the obvious choice, but really anything can be put there. Below you mention
> > writing a bdev module that transparently handles T10 PI and I think that's the
> > right design choice. Basic metadata support can be in the bdev layer, but the
> > T10 PI code can be in its own bdev module.
> >
> > >
> > > About extended LBA in bdevs, I would like to hear any feedback before
> > > submitting patches.
> > > Any feedback is very very appreciated.
> > >
> > > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> > >   Option 1: length of extended block (data + metadata)
> > >   Option 2: length of only data block. user will get length of metadata by
> > > spdk_bdev_get_md_size(bdev)
> > >   Or any other idea?
> >
> > I haven't thought through all of the ramifications here, but I think it might
> > depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> > except it to return the total length of data + metadata (the user may not even
> > be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> > return just the length of the data.
> >
> > >
> > > Current implementation is Option 1 but NVMe-oF target cuts off the size of
> > > metadata now even if metadata is enabled.
> > > Keeping current implementation, Option1, sounds reasonable for me.
> > >
> > > Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev)
> anyway.
> > >
> > > Q2. Which behavior should bdev IO APIs have by default?
> > >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> > extended
> > > LBA by default)
> > >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to
> the
> > > upper layer by default)
> > >  Or any other idea?
> > >
> > > READ_STRIP reads data and metadata from the target, discards metadata,
> and
> > > transfers only data to the upper layer.
> > > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > > writes both data and metadata to the target.
> > > READ_PASS reads data and metadata from the target and transfers both data
> > and
> > > metadata to the upper layer.
> > > WRITE_PASS transfers data and metadata from the upper layer and writes
> both
> > > data and metadata to the target.
> > >
> > > Option 1 looks reasonable to me. I will be able to provide an new bdev
> module
> > > to use extended LBA bdevs transparently.
> > > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > > If we take Option 2, we will have to provide the set of ELBA aware APIs.
> >
> > The bdev APIs shouldn't understand the meaning of the metadata at all (just
> that
> > it exists and its size), and therefore can't strip or insert it. The bdev module
> > implementing T10 PI should strip and insert automatically.
> >
> > >
> > > ELBA awareness means that the upper layer must extract data from
> interleaved
> > > extended blocks.
> > >
> > > Q3. To support any application which requires separate metadata, providing
> > the
> > > following APIs is reasonable?
> > > - add md_buf and md_len to parameters
> > > - add the suffix "_md" to the function name.
> >
> > Sounds good.
> >
> > >
> > > About T10 PI, I will send questions as separate mail later.
> > >
> > > Bdev provides the following APIs:
> > > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb,
> cb_arg)
> > > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> > cb_arg)
> > > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> > md_buf,
> > > md_len, cb, cb_arg)
> > > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks,
> > > cb, cb_arg)
> > > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > > num_blocks, cb, cb_arg)
> > > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> > > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > >
> > > Thanks,
> > > Shuhei
> > >
> > > ________________________________
> > > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > > 送信日時: 2018年9月25日 9:14
> > > 宛先: spdk(a)lists.01.org
> > > 件名: Design policy to support extended LBA and T10 PI in bdevs.
> > >
> > >
> > > Hi All,
> > >
> > > I've worked on extended LBA in bdevs first.
> > > I will do T10 PI on top of the extended LBA next.
> > > I expect some applications or users will need separate metadata and will do
> > > this too.
> > >
> > > About extended LBA in bdevs, I would like to hear any feedback before
> > > submitting patches.
> > > Any feedback is very very appreciated.
> > >
> > > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> > >   Option 1: length of extended block (data + metadata)
> > >   Option 2: length of only data block. user will get length of metadata by
> > > spdk_bdev_get_md_size(bdev)
> > >   Or any other idea?
> > >
> > > Current implementation is A1 but NVMe-oF target cuts off the size of
> metadata
> > > now even if metadata is enabled.
> > > Keeping current implementation, Option1, sounds reasonable for me.
> > > Even if we take Option 1
> > >
> > > Q2. Which behavior should bdev IO APIs have by default?
> > >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> > extended
> > > LBA by default)
> > >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to
> the
> > > upper layer by default)
> > >  Or any other idea?
> > >
> > > READ_STRIP reads data and metadata from the target, discards metadata,
> and
> > > transfers only data to the upper layer.
> > > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > > writes both data and metadata to the target.
> > > READ_PASS reads data and metadata from the target and transfers both data
> > and
> > > metadata to the upper layer.
> > > WRITE_PASS transfers data and metadata from the upper layer and writes
> both
> > > data and metadata to the target.
> > > A1 looks reasonable to me. I will be able to provide an new bdev module to
> > use
> > > extended LBA bdevs transparently.
> > > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > > If we take A2, we will have to provide the set of ELBA aware APIs.
> > >
> > > Q3. To support any application which requires separate metadata, providing
> > the
> > > following APIs is reasonable?
> > > - add md_buf and md_len to parameters
> > > - add the suffix "_md" to the function name.
> > >
> > > About T10 PI, I will send questions as separate mail later.
> > >
> > > Bdev provides the following APIs:
> > > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb,
> cb_arg)
> > > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> > cb_arg)
> > > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> > md_buf,
> > > md_len, cb, cb_arg)
> > > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks,
> > > cb, cb_arg)
> > > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > > num_blocks, cb, cb_arg)
> > > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> > > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > >
> > > Thanks,
> > > Shuhei
> > >
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://clicktime.symantec.com/a/1/_YuXBcwzu-
> > xtu4EvOXanN4spdgP_iGZmwJk-
> >
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> > A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> >
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> > DAMUHjlWNsxNQDsRJWSREbYx-
> > _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> >
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> > F-
> >
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> > noOomGZDSIU2GAsK9V-
> >
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> > D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://clicktime.symantec.com/a/1/_YuXBcwzu-
> xtu4EvOXanN4spdgP_iGZmwJk-
> >
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> > A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> >
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> > DAMUHjlWNsxNQDsRJWSREbYx-
> > _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> >
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> > F-
> >
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> > noOomGZDSIU2GAsK9V-
> >
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> > D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> >
> https://clicktime.symantec.com/a/1/bAMGNliZrGfYoWyvjY08H_oLf2QzPVPrRhtt
> DWDmUVc=?d=XHAlPzaFDFAVoP-6qW9pNNwubD8aIVzOKfy00R6DW9Jzey6-
> 6UHJ0qLI-
> MZKALXeLtOlKdk6qFmSyLvNgLtbNYCob29X2spBy3XZi5Pm1Of9h2xAA7aTOWBJ3
> MuCBwX96kdUYJOXklKnQgL6O53bl38gMcW4R5qUTXeBc8LgApuzwsYDWAqb-
> 0AM4yxUw7SJisdYoiEHUqX8dcKuIVoKad3ZmHbxdC8a_i1-
> z8sOCYAV78V8VaeM5IwzajddE0_IfZ-
> DOhXHkbteVfqB_Tu6d2u5g5FQmn4Nf_YhhY6teUJbXdFD5yXCxHfNkpvlJjBuEzEz4
> EnX8jE1QAXKIvQRxb4as71m4RgoATrUXPeOuU-GXjHdOID-
> XA2fTjdjbUs3Ddf1WiWBeL5Alf5Ojfj-h8IOHciEs2HGpSx2Uo0chLb3X-
> iz7DvV8h2zDV2NVnBErhSIRhyoaQ%3D%3D&u=https%3A%2F%2Flists.01.org%2F
> mailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/bAMGNliZrGfYoWyvjY08H_oLf2QzPVPrRhtt
> DWDmUVc=?d=XHAlPzaFDFAVoP-6qW9pNNwubD8aIVzOKfy00R6DW9Jzey6-
> 6UHJ0qLI-
> MZKALXeLtOlKdk6qFmSyLvNgLtbNYCob29X2spBy3XZi5Pm1Of9h2xAA7aTOWBJ3
> MuCBwX96kdUYJOXklKnQgL6O53bl38gMcW4R5qUTXeBc8LgApuzwsYDWAqb-
> 0AM4yxUw7SJisdYoiEHUqX8dcKuIVoKad3ZmHbxdC8a_i1-
> z8sOCYAV78V8VaeM5IwzajddE0_IfZ-
> DOhXHkbteVfqB_Tu6d2u5g5FQmn4Nf_YhhY6teUJbXdFD5yXCxHfNkpvlJjBuEzEz4
> EnX8jE1QAXKIvQRxb4as71m4RgoATrUXPeOuU-GXjHdOID-
> XA2fTjdjbUs3Ddf1WiWBeL5Alf5Ojfj-h8IOHciEs2HGpSx2Uo0chLb3X-
> iz7DvV8h2zDV2NVnBErhSIRhyoaQ%3D%3D&u=https%3A%2F%2Flists.01.org%2F
> mailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/SgW8Kk_DH8-KnX4rCqRk8n00vOZuzsp2tH63Mz4EStE=?d=aYNksLrj2T-slulsyrbDV9zvEY9AUMYdlw_o2kRkY2rPS6-_KksL8tGr8YQ1JV6IfR1DB9G29xS9fs-y4vu7e3IrTMV8JVUHWq_hSBFHB9pxD0Khwhe-RBUxC_NhsXf9izvkqZE0Bh1d8nyQeIf1uZh-NKqW3nhMd2Tp-R1LKVXA7-VEl1Jye3eDhyj9lGlZBQa1DXej5Vz14nm0wk4C9v7lZ_32UK8sRyzlLGYQ86k7_b1KIirWWr3vd3nxYJf1rEzNUHsPlTczvGnfx_Lme2MrSxjeLpA7qSRJRmsD-Svx-ztZlxf02hMswE8d3wL_kFHfVqABEt1iFVUvLzqAs00S8YDhEBqqe2UyB4Shh4Kc7Pgxlij_6VsXm7S4MTRCQXY9a0P-9JekrBJ_1twEOV0FeUzThQ_J-OTsjanx5yixLu2784KERwKKOsKB77dtZT72nFkWTUg%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://clicktime.symantec.com/a/1/SgW8Kk_DH8-KnX4rCqRk8n00vOZuzsp2tH63Mz4EStE=?d=aYNksLrj2T-slulsyrbDV9zvEY9AUMYdlw_o2kRkY2rPS6-_KksL8tGr8YQ1JV6IfR1DB9G29xS9fs-y4vu7e3IrTMV8JVUHWq_hSBFHB9pxD0Khwhe-RBUxC_NhsXf9izvkqZE0Bh1d8nyQeIf1uZh-NKqW3nhMd2Tp-R1LKVXA7-VEl1Jye3eDhyj9lGlZBQa1DXej5Vz14nm0wk4C9v7lZ_32UK8sRyzlLGYQ86k7_b1KIirWWr3vd3nxYJf1rEzNUHsPlTczvGnfx_Lme2MrSxjeLpA7qSRJRmsD-Svx-ztZlxf02hMswE8d3wL_kFHfVqABEt1iFVUvLzqAs00S8YDhEBqqe2UyB4Shh4Kc7Pgxlij_6VsXm7S4MTRCQXY9a0P-9JekrBJ_1twEOV0FeUzThQ_J-OTsjanx5yixLu2784KERwKKOsKB77dtZT72nFkWTUg%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-27  6:53 Liu, Changpeng
  0 siblings, 0 replies; 11+ messages in thread
From: Liu, Changpeng @ 2018-09-27  6:53 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 25335 bytes --]



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> MATSUMOTO,SHUUHEI
> Sent: Thursday, September 27, 2018 2:41 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> Hi Changpeng,
> 
> 
> Thanks for your feedback.
> 
> 
> > There is, of course, two ways to handle per-block metadata. The first is to
> make
> > it transparent to the block layer by extending the block size (typically by 8
> > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > any changes need to be made to the bdev layer - it should already support 520
> > byte blocks, for example.
> > Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> > cannot be visible from nvmf-target to nvmf-initiator.
> >
> > What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> > expected backends will be nvme bdevs first.
> Hi Shuhei, I understand your point here, the question I have is that: how can we
> map the virtual
> Namespace bdev with the real backend, especially the real backend can't
> support extended LBA format ?
> 
> Or NVMeoF virtual Namespace can support different types of PI, and SPDK
> format the NVMe backend
> with TYPE0, SPDK can fill the PI according to the specification.
> 
> I think on option we can take is an new bdev module that strips metadata from
> read and inserts metadata to write.
That's make sense to me.
> We can set this bdev on top of the bdev which doesn't support extended LBA.
> 
> Besides, new bdev APIs which
> - suports separate metadata, and
> - converts separate metadata into extended LBA if the backend bdev is
> formatted with extended LBA.
> may be nice to have.
> Separate metadata is easy-to-use for the upper layer that is not aware of PI but
> NVMe-oF supports only extende LBA and the type of metadata is fixed at format.
The NVMeoF specification says only extended LBA format is supported, so we don't need
to consider separate metadata option.
> 
> I hope these are align with Ben's suggestive feedback and your thought.
> Thanks,
> Shuhei
> 
> ________________________________
> 差出人: SPDK <spdk-bounces(a)lists.01.org> が Liu, Changpeng
> <changpeng.liu(a)intel.com> の代理で送信
> 送信日時: 2018年9月27日 14:21
> 宛先: Storage Performance Development Kit
> 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> 
> 
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> > MATSUMOTO,SHUUHEI
> > Sent: Wednesday, September 26, 2018 11:13 AM
> > To: spdk(a)lists.01.org
> > Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> >
> > Hi Ben and All,
> >
> >
> > I updated slightly. Will you read this mail and discard previous one?
> >
> >
> >
> > There is, of course, two ways to handle per-block metadata. The first is to
> make
> > it transparent to the block layer by extending the block size (typically by 8
> > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > any changes need to be made to the bdev layer - it should already support 520
> > byte blocks, for example.
> > Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> > cannot be visible from nvmf-target to nvmf-initiator.
> >
> > What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> > expected backends will be nvme bdevs first.
> Hi Shuhei, I understand your point here, the question I have is that: how can we
> map the virtual
> Namespace bdev with the real backend, especially the real backend can't
> support extended LBA format ?
> 
> Or NVMeoF virtual Namespace can support different types of PI, and SPDK
> format the NVMe backend
> with TYPE0, SPDK can fill the PI according to the specification.
> > Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> > The bdev APIs shouldn't understand the meaning of the metadata at all (just
> that
> > it exists and its size), and therefore can't strip or insert it. The bdev module
> > implementing T10 PI should strip and insert automatically.
> > OK, by this bdevs will be able to inform existence and size of metadata but
> > additional information will be necessary to create virtual namespaces in which
> PI
> > is enabled.
> > Do you think adding optional parameters related with PI to
> > nvmf_subsystem_add_ns RPC is reasonable?
> >
> > Or bdev module will provide detailed PI information at virtual namespace
> > creation?
> >
> > The second case is as you said, where the metadata is stored in a separate
> place
> > in memory. We support this in the NVMe driver via special command
> submission
> > functions that additionally take a pointer to the associated metadata. I think
> > support for this second case, case #2, is what you are interested in
> > implementing.
> > nvmf-target supports only interleaved metadata and don't support separate
> > metadata buffer. Which option sounds reasonable for you?
> >
> > Option 1:
> > nvmf-target transfers interleaved metadata directly to the backend bdev by
> > current bdev APIs.
> > This will be able to keep zero-copy and be good for performance.
> >
> > Option 2:
> > nvmf-target separates interleaved metadata into two buffers and then
> transfers
> > two buffers to the backend bdev by new bdev APIs.
> >
> >
> > >
> > > ELBA awareness means that the upper layer must extract data from
> interleaved
> > > extended blocks.
> > >
> > > Q3. To support any application which requires separate metadata, providing
> > the
> > > following APIs is reasonable?
> > > - add md_buf and md_len to parameters
> > > - add the suffix "_md" to the function name.
> >
> > Sounds good.
> > OK, I will do.
> > But nvme provides pichk, app_tag/mask and ref_tag fields in the read/write
> > command to applications.
> > nvme bdev will be able to hold pichk internally but nvme bdev will have to
> > provide app_tag/mask and ref_tag in bdev APIs from the end-to-end
> perspective.
> > Adding app_tag/mask and ref_tag to the new APIs will be reasonable too?
> >
> >
> > There is also the question of what is actually stored in the metadata. T10 PI is
> > the obvious choice, but really anything can be put there. Below you mention
> > writing a bdev module that transparently handles T10 PI and I think that's the
> > right design choice. Basic metadata support can be in the bdev layer, but the
> > T10 PI code can be in its own bdev module.
> > I got it.
> > I haven't thought through all of the ramifications here, but I think it might
> > depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> > except it to return the total length of data + metadata (the user may not even
> > be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> > return just the length of the data.
> > I got it.
> >
> >
> > Thanks,
> > Shuhei
> >
> >
> > ________________________________
> > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > 送信日時: 2018年9月26日 8:24
> > 宛先: spdk(a)lists.01.org
> > 件名: RE: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> >
> >
> > Hi Ben,
> >
> >
> > Thanks for your feedback. Your feedback made my head cleaner.
> >
> > I have additional few questions. Will you provide your feedback again?
> >
> >
> >
> > There is, of course, two ways to handle per-block metadata. The first is to
> make
> > it transparent to the block layer by extending the block size (typically by 8
> > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > any changes need to be made to the bdev layer - it should already support 520
> > byte blocks, for example.
> > Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> > cannot be visible from nvmf-target to nvmf-initiator.
> >
> > What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> > expected backends will be nvme bdevs first.
> > Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> > The bdev APIs shouldn't understand the meaning of the metadata at all (just
> that
> > it exists and its size), and therefore can't strip or insert it. The bdev module
> > implementing T10 PI should strip and insert automatically.
> > OK, by this bdevs will be able to inform existence and size of metadata but
> > additional information will be necessary to create virtual namespaces with PI.
> > Do you think adding optional parameters related with PI to
> > nvmf_subsystem_add_ns RPC is reasonable?
> > Or bdev module will have detailed PI information and provide them at virtual
> > namespace creation?
> >
> > The second case is as you said, where the metadata is stored in a separate
> place
> > in memory. We support this in the NVMe driver via special command
> submission
> > functions that additionally take a pointer to the associated metadata. I think
> > support for this second case, case #2, is what you are interested in
> > implementing.
> > nvmf-target supports only interleaved metadata and don't support separate
> > metadata buffer. Which option sounds reasonable for you?
> >
> > Option 1:
> > nvmf-target separates interleaved metadata into two buffers and then
> transfers
> > two buffers to the backend bdev by new bdev APIs.
> >
> > Option 2:
> > nvmf-target transfers interleaved metadata directly to the backend bdev by
> > current bdev APIs.
> > But separate metadata may be better for crypto bdev and raid bdev.
> > Hence for this option I may have to provide a special bdev module to convert
> > between interleaved metadata and separate metadata.
> >
> > There is also the question of what is actually stored in the metadata. T10 PI is
> > the obvious choice, but really anything can be put there. Below you mention
> > writing a bdev module that transparently handles T10 PI and I think that's the
> > right design choice. Basic metadata support can be in the bdev layer, but the
> > T10 PI code can be in its own bdev module.
> > I got it.
> > I haven't thought through all of the ramifications here, but I think it might
> > depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> > except it to return the total length of data + metadata (the user may not even
> > be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> > return just the length of the data.
> > I got it.
> >
> > >
> > > ELBA awareness means that the upper layer must extract data from
> interleaved
> > > extended blocks.
> > >
> > > Q3. To support any application which requires separate metadata, providing
> > the
> > > following APIs is reasonable?
> > > - add md_buf and md_len to parameters
> > > - add the suffix "_md" to the function name.
> >
> > Sounds good.
> > OK, I will do.
> >
> > Thanks,
> > Shuhei
> >
> >
> > ________________________________
> > 差出人: SPDK <spdk-bounces(a)lists.01.org> が Walker, Benjamin
> > <benjamin.walker(a)intel.com> の代理で送信
> > 送信日時: 2018年9月26日 5:13
> > 宛先: spdk(a)lists.01.org
> > 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> >
> > On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI
> wrote:
> > > Sorry for resend, I revised a little.
> > >
> > >
> > > Hi All,
> > >
> > > I've worked on extended LBA in bdevs first.
> > > I will do T10 PI on top of the extended LBA next.
> > > I expect some applications or users will need separate metadata and will do
> > > this too.
> >
> > There is, of course, two ways to handle per-block metadata. The first is to
> make
> > it transparent to the block layer by extending the block size (typically by 8
> > bytes). For this case, which I'll call case #1, I don't necessarily think that
> > any changes need to be made to the bdev layer - it should already support 520
> > byte blocks, for example.
> >
> > The second case is as you said, where the metadata is stored in a separate
> place
> > in memory. We support this in the NVMe driver via special command
> submission
> > functions that additionally take a pointer to the associated metadata. I think
> > support for this second case, case #2, is what you are interested in
> > implementing.
> >
> > There is also the question of what is actually stored in the metadata. T10 PI is
> > the obvious choice, but really anything can be put there. Below you mention
> > writing a bdev module that transparently handles T10 PI and I think that's the
> > right design choice. Basic metadata support can be in the bdev layer, but the
> > T10 PI code can be in its own bdev module.
> >
> > >
> > > About extended LBA in bdevs, I would like to hear any feedback before
> > > submitting patches.
> > > Any feedback is very very appreciated.
> > >
> > > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> > >   Option 1: length of extended block (data + metadata)
> > >   Option 2: length of only data block. user will get length of metadata by
> > > spdk_bdev_get_md_size(bdev)
> > >   Or any other idea?
> >
> > I haven't thought through all of the ramifications here, but I think it might
> > depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> > except it to return the total length of data + metadata (the user may not even
> > be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> > return just the length of the data.
> >
> > >
> > > Current implementation is Option 1 but NVMe-oF target cuts off the size of
> > > metadata now even if metadata is enabled.
> > > Keeping current implementation, Option1, sounds reasonable for me.
> > >
> > > Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev)
> anyway.
> > >
> > > Q2. Which behavior should bdev IO APIs have by default?
> > >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> > extended
> > > LBA by default)
> > >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to
> the
> > > upper layer by default)
> > >  Or any other idea?
> > >
> > > READ_STRIP reads data and metadata from the target, discards metadata,
> and
> > > transfers only data to the upper layer.
> > > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > > writes both data and metadata to the target.
> > > READ_PASS reads data and metadata from the target and transfers both data
> > and
> > > metadata to the upper layer.
> > > WRITE_PASS transfers data and metadata from the upper layer and writes
> both
> > > data and metadata to the target.
> > >
> > > Option 1 looks reasonable to me. I will be able to provide an new bdev
> module
> > > to use extended LBA bdevs transparently.
> > > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > > If we take Option 2, we will have to provide the set of ELBA aware APIs.
> >
> > The bdev APIs shouldn't understand the meaning of the metadata at all (just
> that
> > it exists and its size), and therefore can't strip or insert it. The bdev module
> > implementing T10 PI should strip and insert automatically.
> >
> > >
> > > ELBA awareness means that the upper layer must extract data from
> interleaved
> > > extended blocks.
> > >
> > > Q3. To support any application which requires separate metadata, providing
> > the
> > > following APIs is reasonable?
> > > - add md_buf and md_len to parameters
> > > - add the suffix "_md" to the function name.
> >
> > Sounds good.
> >
> > >
> > > About T10 PI, I will send questions as separate mail later.
> > >
> > > Bdev provides the following APIs:
> > > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb,
> cb_arg)
> > > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> > cb_arg)
> > > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> > md_buf,
> > > md_len, cb, cb_arg)
> > > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks,
> > > cb, cb_arg)
> > > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > > num_blocks, cb, cb_arg)
> > > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> > > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > >
> > > Thanks,
> > > Shuhei
> > >
> > > ________________________________
> > > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > > 送信日時: 2018年9月25日 9:14
> > > 宛先: spdk(a)lists.01.org
> > > 件名: Design policy to support extended LBA and T10 PI in bdevs.
> > >
> > >
> > > Hi All,
> > >
> > > I've worked on extended LBA in bdevs first.
> > > I will do T10 PI on top of the extended LBA next.
> > > I expect some applications or users will need separate metadata and will do
> > > this too.
> > >
> > > About extended LBA in bdevs, I would like to hear any feedback before
> > > submitting patches.
> > > Any feedback is very very appreciated.
> > >
> > > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> > >   Option 1: length of extended block (data + metadata)
> > >   Option 2: length of only data block. user will get length of metadata by
> > > spdk_bdev_get_md_size(bdev)
> > >   Or any other idea?
> > >
> > > Current implementation is A1 but NVMe-oF target cuts off the size of
> metadata
> > > now even if metadata is enabled.
> > > Keeping current implementation, Option1, sounds reasonable for me.
> > > Even if we take Option 1
> > >
> > > Q2. Which behavior should bdev IO APIs have by default?
> > >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> > extended
> > > LBA by default)
> > >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to
> the
> > > upper layer by default)
> > >  Or any other idea?
> > >
> > > READ_STRIP reads data and metadata from the target, discards metadata,
> and
> > > transfers only data to the upper layer.
> > > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > > writes both data and metadata to the target.
> > > READ_PASS reads data and metadata from the target and transfers both data
> > and
> > > metadata to the upper layer.
> > > WRITE_PASS transfers data and metadata from the upper layer and writes
> both
> > > data and metadata to the target.
> > > A1 looks reasonable to me. I will be able to provide an new bdev module to
> > use
> > > extended LBA bdevs transparently.
> > > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > > If we take A2, we will have to provide the set of ELBA aware APIs.
> > >
> > > Q3. To support any application which requires separate metadata, providing
> > the
> > > following APIs is reasonable?
> > > - add md_buf and md_len to parameters
> > > - add the suffix "_md" to the function name.
> > >
> > > About T10 PI, I will send questions as separate mail later.
> > >
> > > Bdev provides the following APIs:
> > > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb,
> cb_arg)
> > > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> > cb_arg)
> > > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> > md_buf,
> > > md_len, cb, cb_arg)
> > > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks,
> > > cb, cb_arg)
> > > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > > num_blocks, cb, cb_arg)
> > > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > > cb_arg)
> > > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> > > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > >
> > > Thanks,
> > > Shuhei
> > >
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://clicktime.symantec.com/a/1/_YuXBcwzu-
> > xtu4EvOXanN4spdgP_iGZmwJk-
> >
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> > A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> >
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> > DAMUHjlWNsxNQDsRJWSREbYx-
> > _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> >
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> > F-
> >
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> > noOomGZDSIU2GAsK9V-
> >
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> > D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://clicktime.symantec.com/a/1/_YuXBcwzu-
> xtu4EvOXanN4spdgP_iGZmwJk-
> >
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> > A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> >
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> > DAMUHjlWNsxNQDsRJWSREbYx-
> > _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> >
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> > F-
> >
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> > noOomGZDSIU2GAsK9V-
> >
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> > D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> >
> https://clicktime.symantec.com/a/1/bAMGNliZrGfYoWyvjY08H_oLf2QzPVPrRhtt
> DWDmUVc=?d=XHAlPzaFDFAVoP-6qW9pNNwubD8aIVzOKfy00R6DW9Jzey6-
> 6UHJ0qLI-
> MZKALXeLtOlKdk6qFmSyLvNgLtbNYCob29X2spBy3XZi5Pm1Of9h2xAA7aTOWBJ3
> MuCBwX96kdUYJOXklKnQgL6O53bl38gMcW4R5qUTXeBc8LgApuzwsYDWAqb-
> 0AM4yxUw7SJisdYoiEHUqX8dcKuIVoKad3ZmHbxdC8a_i1-
> z8sOCYAV78V8VaeM5IwzajddE0_IfZ-
> DOhXHkbteVfqB_Tu6d2u5g5FQmn4Nf_YhhY6teUJbXdFD5yXCxHfNkpvlJjBuEzEz4
> EnX8jE1QAXKIvQRxb4as71m4RgoATrUXPeOuU-GXjHdOID-
> XA2fTjdjbUs3Ddf1WiWBeL5Alf5Ojfj-h8IOHciEs2HGpSx2Uo0chLb3X-
> iz7DvV8h2zDV2NVnBErhSIRhyoaQ%3D%3D&u=https%3A%2F%2Flists.01.org%2F
> mailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/bAMGNliZrGfYoWyvjY08H_oLf2QzPVPrRhtt
> DWDmUVc=?d=XHAlPzaFDFAVoP-6qW9pNNwubD8aIVzOKfy00R6DW9Jzey6-
> 6UHJ0qLI-
> MZKALXeLtOlKdk6qFmSyLvNgLtbNYCob29X2spBy3XZi5Pm1Of9h2xAA7aTOWBJ3
> MuCBwX96kdUYJOXklKnQgL6O53bl38gMcW4R5qUTXeBc8LgApuzwsYDWAqb-
> 0AM4yxUw7SJisdYoiEHUqX8dcKuIVoKad3ZmHbxdC8a_i1-
> z8sOCYAV78V8VaeM5IwzajddE0_IfZ-
> DOhXHkbteVfqB_Tu6d2u5g5FQmn4Nf_YhhY6teUJbXdFD5yXCxHfNkpvlJjBuEzEz4
> EnX8jE1QAXKIvQRxb4as71m4RgoATrUXPeOuU-GXjHdOID-
> XA2fTjdjbUs3Ddf1WiWBeL5Alf5Ojfj-h8IOHciEs2HGpSx2Uo0chLb3X-
> iz7DvV8h2zDV2NVnBErhSIRhyoaQ%3D%3D&u=https%3A%2F%2Flists.01.org%2F
> mailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-27  6:40 
  0 siblings, 0 replies; 11+ messages in thread
From:  @ 2018-09-27  6:40 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 23458 bytes --]

Hi Changpeng,


Thanks for your feedback.


> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
> Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> cannot be visible from nvmf-target to nvmf-initiator.
>
> What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> expected backends will be nvme bdevs first.
Hi Shuhei, I understand your point here, the question I have is that: how can we map the virtual
Namespace bdev with the real backend, especially the real backend can't support extended LBA format ?

Or NVMeoF virtual Namespace can support different types of PI, and SPDK format the NVMe backend
with TYPE0, SPDK can fill the PI according to the specification.

I think on option we can take is an new bdev module that strips metadata from read and inserts metadata to write.
We can set this bdev on top of the bdev which doesn't support extended LBA.

Besides, new bdev APIs which
- suports separate metadata, and
- converts separate metadata into extended LBA if the backend bdev is formatted with extended LBA.
may be nice to have.
Separate metadata is easy-to-use for the upper layer that is not aware of PI but NVMe-oF supports only extende LBA and the type of metadata is fixed at format.

I hope these are align with Ben's suggestive feedback and your thought.
Thanks,
Shuhei

________________________________
差出人: SPDK <spdk-bounces(a)lists.01.org> が Liu, Changpeng <changpeng.liu(a)intel.com> の代理で送信
送信日時: 2018年9月27日 14:21
宛先: Storage Performance Development Kit
件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> MATSUMOTO,SHUUHEI
> Sent: Wednesday, September 26, 2018 11:13 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
>
> Hi Ben and All,
>
>
> I updated slightly. Will you read this mail and discard previous one?
>
>
>
> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
> Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> cannot be visible from nvmf-target to nvmf-initiator.
>
> What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> expected backends will be nvme bdevs first.
Hi Shuhei, I understand your point here, the question I have is that: how can we map the virtual
Namespace bdev with the real backend, especially the real backend can't support extended LBA format ?

Or NVMeoF virtual Namespace can support different types of PI, and SPDK format the NVMe backend
with TYPE0, SPDK can fill the PI according to the specification.
> Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> The bdev APIs shouldn't understand the meaning of the metadata at all (just that
> it exists and its size), and therefore can't strip or insert it. The bdev module
> implementing T10 PI should strip and insert automatically.
> OK, by this bdevs will be able to inform existence and size of metadata but
> additional information will be necessary to create virtual namespaces in which PI
> is enabled.
> Do you think adding optional parameters related with PI to
> nvmf_subsystem_add_ns RPC is reasonable?
>
> Or bdev module will provide detailed PI information at virtual namespace
> creation?
>
> The second case is as you said, where the metadata is stored in a separate place
> in memory. We support this in the NVMe driver via special command submission
> functions that additionally take a pointer to the associated metadata. I think
> support for this second case, case #2, is what you are interested in
> implementing.
> nvmf-target supports only interleaved metadata and don't support separate
> metadata buffer. Which option sounds reasonable for you?
>
> Option 1:
> nvmf-target transfers interleaved metadata directly to the backend bdev by
> current bdev APIs.
> This will be able to keep zero-copy and be good for performance.
>
> Option 2:
> nvmf-target separates interleaved metadata into two buffers and then transfers
> two buffers to the backend bdev by new bdev APIs.
>
>
> >
> > ELBA awareness means that the upper layer must extract data from interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
>
> Sounds good.
> OK, I will do.
> But nvme provides pichk, app_tag/mask and ref_tag fields in the read/write
> command to applications.
> nvme bdev will be able to hold pichk internally but nvme bdev will have to
> provide app_tag/mask and ref_tag in bdev APIs from the end-to-end perspective.
> Adding app_tag/mask and ref_tag to the new APIs will be reasonable too?
>
>
> There is also the question of what is actually stored in the metadata. T10 PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's the
> right design choice. Basic metadata support can be in the bdev layer, but the
> T10 PI code can be in its own bdev module.
> I got it.
> I haven't thought through all of the ramifications here, but I think it might
> depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> except it to return the total length of data + metadata (the user may not even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> return just the length of the data.
> I got it.
>
>
> Thanks,
> Shuhei
>
>
> ________________________________
> 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> 送信日時: 2018年9月26日 8:24
> 宛先: spdk(a)lists.01.org
> 件名: RE: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
>
>
> Hi Ben,
>
>
> Thanks for your feedback. Your feedback made my head cleaner.
>
> I have additional few questions. Will you provide your feedback again?
>
>
>
> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
> Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> cannot be visible from nvmf-target to nvmf-initiator.
>
> What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> expected backends will be nvme bdevs first.
> Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> The bdev APIs shouldn't understand the meaning of the metadata at all (just that
> it exists and its size), and therefore can't strip or insert it. The bdev module
> implementing T10 PI should strip and insert automatically.
> OK, by this bdevs will be able to inform existence and size of metadata but
> additional information will be necessary to create virtual namespaces with PI.
> Do you think adding optional parameters related with PI to
> nvmf_subsystem_add_ns RPC is reasonable?
> Or bdev module will have detailed PI information and provide them at virtual
> namespace creation?
>
> The second case is as you said, where the metadata is stored in a separate place
> in memory. We support this in the NVMe driver via special command submission
> functions that additionally take a pointer to the associated metadata. I think
> support for this second case, case #2, is what you are interested in
> implementing.
> nvmf-target supports only interleaved metadata and don't support separate
> metadata buffer. Which option sounds reasonable for you?
>
> Option 1:
> nvmf-target separates interleaved metadata into two buffers and then transfers
> two buffers to the backend bdev by new bdev APIs.
>
> Option 2:
> nvmf-target transfers interleaved metadata directly to the backend bdev by
> current bdev APIs.
> But separate metadata may be better for crypto bdev and raid bdev.
> Hence for this option I may have to provide a special bdev module to convert
> between interleaved metadata and separate metadata.
>
> There is also the question of what is actually stored in the metadata. T10 PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's the
> right design choice. Basic metadata support can be in the bdev layer, but the
> T10 PI code can be in its own bdev module.
> I got it.
> I haven't thought through all of the ramifications here, but I think it might
> depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> except it to return the total length of data + metadata (the user may not even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> return just the length of the data.
> I got it.
>
> >
> > ELBA awareness means that the upper layer must extract data from interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
>
> Sounds good.
> OK, I will do.
>
> Thanks,
> Shuhei
>
>
> ________________________________
> 差出人: SPDK <spdk-bounces(a)lists.01.org> が Walker, Benjamin
> <benjamin.walker(a)intel.com> の代理で送信
> 送信日時: 2018年9月26日 5:13
> 宛先: spdk(a)lists.01.org
> 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
>
> On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> > Sorry for resend, I revised a little.
> >
> >
> > Hi All,
> >
> > I've worked on extended LBA in bdevs first.
> > I will do T10 PI on top of the extended LBA next.
> > I expect some applications or users will need separate metadata and will do
> > this too.
>
> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
>
> The second case is as you said, where the metadata is stored in a separate place
> in memory. We support this in the NVMe driver via special command submission
> functions that additionally take a pointer to the associated metadata. I think
> support for this second case, case #2, is what you are interested in
> implementing.
>
> There is also the question of what is actually stored in the metadata. T10 PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's the
> right design choice. Basic metadata support can be in the bdev layer, but the
> T10 PI code can be in its own bdev module.
>
> >
> > About extended LBA in bdevs, I would like to hear any feedback before
> > submitting patches.
> > Any feedback is very very appreciated.
> >
> > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> >   Option 1: length of extended block (data + metadata)
> >   Option 2: length of only data block. user will get length of metadata by
> > spdk_bdev_get_md_size(bdev)
> >   Or any other idea?
>
> I haven't thought through all of the ramifications here, but I think it might
> depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> except it to return the total length of data + metadata (the user may not even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> return just the length of the data.
>
> >
> > Current implementation is Option 1 but NVMe-oF target cuts off the size of
> > metadata now even if metadata is enabled.
> > Keeping current implementation, Option1, sounds reasonable for me.
> >
> > Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev) anyway.
> >
> > Q2. Which behavior should bdev IO APIs have by default?
> >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> extended
> > LBA by default)
> >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> > upper layer by default)
> >  Or any other idea?
> >
> > READ_STRIP reads data and metadata from the target, discards metadata, and
> > transfers only data to the upper layer.
> > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > writes both data and metadata to the target.
> > READ_PASS reads data and metadata from the target and transfers both data
> and
> > metadata to the upper layer.
> > WRITE_PASS transfers data and metadata from the upper layer and writes both
> > data and metadata to the target.
> >
> > Option 1 looks reasonable to me. I will be able to provide an new bdev module
> > to use extended LBA bdevs transparently.
> > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > If we take Option 2, we will have to provide the set of ELBA aware APIs.
>
> The bdev APIs shouldn't understand the meaning of the metadata at all (just that
> it exists and its size), and therefore can't strip or insert it. The bdev module
> implementing T10 PI should strip and insert automatically.
>
> >
> > ELBA awareness means that the upper layer must extract data from interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
>
> Sounds good.
>
> >
> > About T10 PI, I will send questions as separate mail later.
> >
> > Bdev provides the following APIs:
> > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> md_buf,
> > md_len, cb, cb_arg)
> > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> > cb, cb_arg)
> > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks, cb, cb_arg)
> > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> >
> > Thanks,
> > Shuhei
> >
> > ________________________________
> > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > 送信日時: 2018年9月25日 9:14
> > 宛先: spdk(a)lists.01.org
> > 件名: Design policy to support extended LBA and T10 PI in bdevs.
> >
> >
> > Hi All,
> >
> > I've worked on extended LBA in bdevs first.
> > I will do T10 PI on top of the extended LBA next.
> > I expect some applications or users will need separate metadata and will do
> > this too.
> >
> > About extended LBA in bdevs, I would like to hear any feedback before
> > submitting patches.
> > Any feedback is very very appreciated.
> >
> > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> >   Option 1: length of extended block (data + metadata)
> >   Option 2: length of only data block. user will get length of metadata by
> > spdk_bdev_get_md_size(bdev)
> >   Or any other idea?
> >
> > Current implementation is A1 but NVMe-oF target cuts off the size of metadata
> > now even if metadata is enabled.
> > Keeping current implementation, Option1, sounds reasonable for me.
> > Even if we take Option 1
> >
> > Q2. Which behavior should bdev IO APIs have by default?
> >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> extended
> > LBA by default)
> >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> > upper layer by default)
> >  Or any other idea?
> >
> > READ_STRIP reads data and metadata from the target, discards metadata, and
> > transfers only data to the upper layer.
> > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > writes both data and metadata to the target.
> > READ_PASS reads data and metadata from the target and transfers both data
> and
> > metadata to the upper layer.
> > WRITE_PASS transfers data and metadata from the upper layer and writes both
> > data and metadata to the target.
> > A1 looks reasonable to me. I will be able to provide an new bdev module to
> use
> > extended LBA bdevs transparently.
> > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > If we take A2, we will have to provide the set of ELBA aware APIs.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> >
> > About T10 PI, I will send questions as separate mail later.
> >
> > Bdev provides the following APIs:
> > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> md_buf,
> > md_len, cb, cb_arg)
> > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> > cb, cb_arg)
> > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks, cb, cb_arg)
> > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> >
> > Thanks,
> > Shuhei
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://clicktime.symantec.com/a/1/_YuXBcwzu-
> xtu4EvOXanN4spdgP_iGZmwJk-
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> DAMUHjlWNsxNQDsRJWSREbYx-
> _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> F-
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> noOomGZDSIU2GAsK9V-
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> DAMUHjlWNsxNQDsRJWSREbYx-
> _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> F-
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> noOomGZDSIU2GAsK9V-
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/bAMGNliZrGfYoWyvjY08H_oLf2QzPVPrRhttDWDmUVc=?d=XHAlPzaFDFAVoP-6qW9pNNwubD8aIVzOKfy00R6DW9Jzey6-6UHJ0qLI-MZKALXeLtOlKdk6qFmSyLvNgLtbNYCob29X2spBy3XZi5Pm1Of9h2xAA7aTOWBJ3MuCBwX96kdUYJOXklKnQgL6O53bl38gMcW4R5qUTXeBc8LgApuzwsYDWAqb-0AM4yxUw7SJisdYoiEHUqX8dcKuIVoKad3ZmHbxdC8a_i1-z8sOCYAV78V8VaeM5IwzajddE0_IfZ-DOhXHkbteVfqB_Tu6d2u5g5FQmn4Nf_YhhY6teUJbXdFD5yXCxHfNkpvlJjBuEzEz4EnX8jE1QAXKIvQRxb4as71m4RgoATrUXPeOuU-GXjHdOID-XA2fTjdjbUs3Ddf1WiWBeL5Alf5Ojfj-h8IOHciEs2HGpSx2Uo0chLb3X-iz7DvV8h2zDV2NVnBErhSIRhyoaQ%3D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://clicktime.symantec.com/a/1/bAMGNliZrGfYoWyvjY08H_oLf2QzPVPrRhttDWDmUVc=?d=XHAlPzaFDFAVoP-6qW9pNNwubD8aIVzOKfy00R6DW9Jzey6-6UHJ0qLI-MZKALXeLtOlKdk6qFmSyLvNgLtbNYCob29X2spBy3XZi5Pm1Of9h2xAA7aTOWBJ3MuCBwX96kdUYJOXklKnQgL6O53bl38gMcW4R5qUTXeBc8LgApuzwsYDWAqb-0AM4yxUw7SJisdYoiEHUqX8dcKuIVoKad3ZmHbxdC8a_i1-z8sOCYAV78V8VaeM5IwzajddE0_IfZ-DOhXHkbteVfqB_Tu6d2u5g5FQmn4Nf_YhhY6teUJbXdFD5yXCxHfNkpvlJjBuEzEz4EnX8jE1QAXKIvQRxb4as71m4RgoATrUXPeOuU-GXjHdOID-XA2fTjdjbUs3Ddf1WiWBeL5Alf5Ojfj-h8IOHciEs2HGpSx2Uo0chLb3X-iz7DvV8h2zDV2NVnBErhSIRhyoaQ%3D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-27  5:21 Liu, Changpeng
  0 siblings, 0 replies; 11+ messages in thread
From: Liu, Changpeng @ 2018-09-27  5:21 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 20254 bytes --]



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> MATSUMOTO,SHUUHEI
> Sent: Wednesday, September 26, 2018 11:13 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> Hi Ben and All,
> 
> 
> I updated slightly. Will you read this mail and discard previous one?
> 
> 
> 
> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
> Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> cannot be visible from nvmf-target to nvmf-initiator.
> 
> What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> expected backends will be nvme bdevs first.
Hi Shuhei, I understand your point here, the question I have is that: how can we map the virtual
Namespace bdev with the real backend, especially the real backend can't support extended LBA format ?

Or NVMeoF virtual Namespace can support different types of PI, and SPDK format the NVMe backend
with TYPE0, SPDK can fill the PI according to the specification.
> Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> The bdev APIs shouldn't understand the meaning of the metadata at all (just that
> it exists and its size), and therefore can't strip or insert it. The bdev module
> implementing T10 PI should strip and insert automatically.
> OK, by this bdevs will be able to inform existence and size of metadata but
> additional information will be necessary to create virtual namespaces in which PI
> is enabled.
> Do you think adding optional parameters related with PI to
> nvmf_subsystem_add_ns RPC is reasonable?
> 
> Or bdev module will provide detailed PI information at virtual namespace
> creation?
> 
> The second case is as you said, where the metadata is stored in a separate place
> in memory. We support this in the NVMe driver via special command submission
> functions that additionally take a pointer to the associated metadata. I think
> support for this second case, case #2, is what you are interested in
> implementing.
> nvmf-target supports only interleaved metadata and don't support separate
> metadata buffer. Which option sounds reasonable for you?
> 
> Option 1:
> nvmf-target transfers interleaved metadata directly to the backend bdev by
> current bdev APIs.
> This will be able to keep zero-copy and be good for performance.
> 
> Option 2:
> nvmf-target separates interleaved metadata into two buffers and then transfers
> two buffers to the backend bdev by new bdev APIs.
> 
> 
> >
> > ELBA awareness means that the upper layer must extract data from interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> 
> Sounds good.
> OK, I will do.
> But nvme provides pichk, app_tag/mask and ref_tag fields in the read/write
> command to applications.
> nvme bdev will be able to hold pichk internally but nvme bdev will have to
> provide app_tag/mask and ref_tag in bdev APIs from the end-to-end perspective.
> Adding app_tag/mask and ref_tag to the new APIs will be reasonable too?
> 
> 
> There is also the question of what is actually stored in the metadata. T10 PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's the
> right design choice. Basic metadata support can be in the bdev layer, but the
> T10 PI code can be in its own bdev module.
> I got it.
> I haven't thought through all of the ramifications here, but I think it might
> depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> except it to return the total length of data + metadata (the user may not even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> return just the length of the data.
> I got it.
> 
> 
> Thanks,
> Shuhei
> 
> 
> ________________________________
> 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> 送信日時: 2018年9月26日 8:24
> 宛先: spdk(a)lists.01.org
> 件名: RE: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> 
> Hi Ben,
> 
> 
> Thanks for your feedback. Your feedback made my head cleaner.
> 
> I have additional few questions. Will you provide your feedback again?
> 
> 
> 
> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
> Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> cannot be visible from nvmf-target to nvmf-initiator.
> 
> What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> expected backends will be nvme bdevs first.
> Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> The bdev APIs shouldn't understand the meaning of the metadata at all (just that
> it exists and its size), and therefore can't strip or insert it. The bdev module
> implementing T10 PI should strip and insert automatically.
> OK, by this bdevs will be able to inform existence and size of metadata but
> additional information will be necessary to create virtual namespaces with PI.
> Do you think adding optional parameters related with PI to
> nvmf_subsystem_add_ns RPC is reasonable?
> Or bdev module will have detailed PI information and provide them at virtual
> namespace creation?
> 
> The second case is as you said, where the metadata is stored in a separate place
> in memory. We support this in the NVMe driver via special command submission
> functions that additionally take a pointer to the associated metadata. I think
> support for this second case, case #2, is what you are interested in
> implementing.
> nvmf-target supports only interleaved metadata and don't support separate
> metadata buffer. Which option sounds reasonable for you?
> 
> Option 1:
> nvmf-target separates interleaved metadata into two buffers and then transfers
> two buffers to the backend bdev by new bdev APIs.
> 
> Option 2:
> nvmf-target transfers interleaved metadata directly to the backend bdev by
> current bdev APIs.
> But separate metadata may be better for crypto bdev and raid bdev.
> Hence for this option I may have to provide a special bdev module to convert
> between interleaved metadata and separate metadata.
> 
> There is also the question of what is actually stored in the metadata. T10 PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's the
> right design choice. Basic metadata support can be in the bdev layer, but the
> T10 PI code can be in its own bdev module.
> I got it.
> I haven't thought through all of the ramifications here, but I think it might
> depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> except it to return the total length of data + metadata (the user may not even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> return just the length of the data.
> I got it.
> 
> >
> > ELBA awareness means that the upper layer must extract data from interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> 
> Sounds good.
> OK, I will do.
> 
> Thanks,
> Shuhei
> 
> 
> ________________________________
> 差出人: SPDK <spdk-bounces(a)lists.01.org> が Walker, Benjamin
> <benjamin.walker(a)intel.com> の代理で送信
> 送信日時: 2018年9月26日 5:13
> 宛先: spdk(a)lists.01.org
> 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> > Sorry for resend, I revised a little.
> >
> >
> > Hi All,
> >
> > I've worked on extended LBA in bdevs first.
> > I will do T10 PI on top of the extended LBA next.
> > I expect some applications or users will need separate metadata and will do
> > this too.
> 
> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
> 
> The second case is as you said, where the metadata is stored in a separate place
> in memory. We support this in the NVMe driver via special command submission
> functions that additionally take a pointer to the associated metadata. I think
> support for this second case, case #2, is what you are interested in
> implementing.
> 
> There is also the question of what is actually stored in the metadata. T10 PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's the
> right design choice. Basic metadata support can be in the bdev layer, but the
> T10 PI code can be in its own bdev module.
> 
> >
> > About extended LBA in bdevs, I would like to hear any feedback before
> > submitting patches.
> > Any feedback is very very appreciated.
> >
> > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> >   Option 1: length of extended block (data + metadata)
> >   Option 2: length of only data block. user will get length of metadata by
> > spdk_bdev_get_md_size(bdev)
> >   Or any other idea?
> 
> I haven't thought through all of the ramifications here, but I think it might
> depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> except it to return the total length of data + metadata (the user may not even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> return just the length of the data.
> 
> >
> > Current implementation is Option 1 but NVMe-oF target cuts off the size of
> > metadata now even if metadata is enabled.
> > Keeping current implementation, Option1, sounds reasonable for me.
> >
> > Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev) anyway.
> >
> > Q2. Which behavior should bdev IO APIs have by default?
> >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> extended
> > LBA by default)
> >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> > upper layer by default)
> >  Or any other idea?
> >
> > READ_STRIP reads data and metadata from the target, discards metadata, and
> > transfers only data to the upper layer.
> > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > writes both data and metadata to the target.
> > READ_PASS reads data and metadata from the target and transfers both data
> and
> > metadata to the upper layer.
> > WRITE_PASS transfers data and metadata from the upper layer and writes both
> > data and metadata to the target.
> >
> > Option 1 looks reasonable to me. I will be able to provide an new bdev module
> > to use extended LBA bdevs transparently.
> > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > If we take Option 2, we will have to provide the set of ELBA aware APIs.
> 
> The bdev APIs shouldn't understand the meaning of the metadata at all (just that
> it exists and its size), and therefore can't strip or insert it. The bdev module
> implementing T10 PI should strip and insert automatically.
> 
> >
> > ELBA awareness means that the upper layer must extract data from interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> 
> Sounds good.
> 
> >
> > About T10 PI, I will send questions as separate mail later.
> >
> > Bdev provides the following APIs:
> > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> md_buf,
> > md_len, cb, cb_arg)
> > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> > cb, cb_arg)
> > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks, cb, cb_arg)
> > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> >
> > Thanks,
> > Shuhei
> >
> > ________________________________
> > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > 送信日時: 2018年9月25日 9:14
> > 宛先: spdk(a)lists.01.org
> > 件名: Design policy to support extended LBA and T10 PI in bdevs.
> >
> >
> > Hi All,
> >
> > I've worked on extended LBA in bdevs first.
> > I will do T10 PI on top of the extended LBA next.
> > I expect some applications or users will need separate metadata and will do
> > this too.
> >
> > About extended LBA in bdevs, I would like to hear any feedback before
> > submitting patches.
> > Any feedback is very very appreciated.
> >
> > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> >   Option 1: length of extended block (data + metadata)
> >   Option 2: length of only data block. user will get length of metadata by
> > spdk_bdev_get_md_size(bdev)
> >   Or any other idea?
> >
> > Current implementation is A1 but NVMe-oF target cuts off the size of metadata
> > now even if metadata is enabled.
> > Keeping current implementation, Option1, sounds reasonable for me.
> > Even if we take Option 1
> >
> > Q2. Which behavior should bdev IO APIs have by default?
> >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> extended
> > LBA by default)
> >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> > upper layer by default)
> >  Or any other idea?
> >
> > READ_STRIP reads data and metadata from the target, discards metadata, and
> > transfers only data to the upper layer.
> > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > writes both data and metadata to the target.
> > READ_PASS reads data and metadata from the target and transfers both data
> and
> > metadata to the upper layer.
> > WRITE_PASS transfers data and metadata from the upper layer and writes both
> > data and metadata to the target.
> > A1 looks reasonable to me. I will be able to provide an new bdev module to
> use
> > extended LBA bdevs transparently.
> > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > If we take A2, we will have to provide the set of ELBA aware APIs.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> >
> > About T10 PI, I will send questions as separate mail later.
> >
> > Bdev provides the following APIs:
> > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> md_buf,
> > md_len, cb, cb_arg)
> > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> > cb, cb_arg)
> > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks, cb, cb_arg)
> > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> >
> > Thanks,
> > Shuhei
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://clicktime.symantec.com/a/1/_YuXBcwzu-
> xtu4EvOXanN4spdgP_iGZmwJk-
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> DAMUHjlWNsxNQDsRJWSREbYx-
> _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> F-
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> noOomGZDSIU2GAsK9V-
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> DAMUHjlWNsxNQDsRJWSREbYx-
> _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> F-
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> noOomGZDSIU2GAsK9V-
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-26 20:26 
  0 siblings, 0 replies; 11+ messages in thread
From:  @ 2018-09-26 20:26 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 20028 bytes --]

Hi Ben,

I spent a time to read again and found your feedback was very suggestive.
I can forward my investigation and the last mail was not coordinated.
Please stay tuned for now.

Thanks,
Shuhei


On Wed, Sep 26, 2018, 12:13 PM 松本周平 / MATSUMOTO,SHUUHEI <
shuhei.matsumoto.xt(a)hitachi.com> wrote:

> Hi Ben and All,
>
>
> I updated slightly. Will you read this mail and discard previous one?
>
>
>
> There is, of course, two ways to handle per-block metadata. The first is
> to make
> it transparent to the block layer by extending the block size (typically
> by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think
> that
> any changes need to be made to the bdev layer - it should already support
> 520
> byte blocks, for example.
> Yes, any change is not necessary in the bdev layer but currently 520 byte
> blocks cannot be visible from nvmf-target to nvmf-initiator.
>
> What I want to do first is to make SPDK nvmf-target support (virtual) PI
> and its expected backends will be nvme bdevs first.
> Then I want to use crypto or raid bdevs on top of nvme bdevs as backends
> next.
> The bdev APIs shouldn't understand the meaning of the metadata at all
> (just that
> it exists and its size), and therefore can't strip or insert it. The bdev
> module
> implementing T10 PI should strip and insert automatically.
> OK, by this bdevs will be able to inform existence and size of metadata
> but additional information will be necessary to create virtual namespaces
> in which PI is enabled.
> Do you think adding optional parameters related with PI to
> nvmf_subsystem_add_ns RPC is reasonable?
>
> Or bdev module will provide detailed PI information at virtual namespace
> creation?
>
> The second case is as you said, where the metadata is stored in a separate
> place
> in memory. We support this in the NVMe driver via special command
> submission
> functions that additionally take a pointer to the associated metadata. I
> think
> support for this second case, case #2, is what you are interested in
> implementing.
> nvmf-target supports only interleaved metadata and don't support separate
> metadata buffer. Which option sounds reasonable for you?
>
> Option 1:
> nvmf-target transfers interleaved metadata directly to the backend bdev by
> current bdev APIs.
> This will be able to keep zero-copy and be good for performance.
>
> Option 2:
> nvmf-target separates interleaved metadata into two buffers and then
> transfers two buffers to the backend bdev by new bdev APIs.
>
>
> >
> > ELBA awareness means that the upper layer must extract data from
> interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata,
> providing the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
>
> Sounds good.
> OK, I will do.
> But nvme provides pichk, app_tag/mask and ref_tag fields in the read/write
> command to applications.
> nvme bdev will be able to hold pichk internally but nvme bdev will have to
> provide app_tag/mask and ref_tag in bdev APIs from the end-to-end
> perspective.
> Adding app_tag/mask and ref_tag to the new APIs will be reasonable too?
>
>
> There is also the question of what is actually stored in the metadata. T10
> PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's
> the
> right design choice. Basic metadata support can be in the bdev layer, but
> the
> T10 PI code can be in its own bdev module.
> I got it.
> I haven't thought through all of the ramifications here, but I think it
> might
> depend on whether we're doing case #1 or case #2 from above. In case #1,
> I'd
> except it to return the total length of data + metadata (the user may not
> even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect
> it to
> return just the length of the data.
> I got it.
>
>
> Thanks,
> Shuhei
>
>
> ________________________________
> 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> 送信日時: 2018年9月26日 8:24
> 宛先: spdk(a)lists.01.org
> 件名: RE: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
>
>
> Hi Ben,
>
>
> Thanks for your feedback. Your feedback made my head cleaner.
>
> I have additional few questions. Will you provide your feedback again?
>
>
>
> There is, of course, two ways to handle per-block metadata. The first is
> to make
> it transparent to the block layer by extending the block size (typically
> by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think
> that
> any changes need to be made to the bdev layer - it should already support
> 520
> byte blocks, for example.
> Yes, any change is not necessary in the bdev layer but currently 520 byte
> blocks cannot be visible from nvmf-target to nvmf-initiator.
>
> What I want to do first is to make SPDK nvmf-target support (virtual) PI
> and its expected backends will be nvme bdevs first.
> Then I want to use crypto or raid bdevs on top of nvme bdevs as backends
> next.
> The bdev APIs shouldn't understand the meaning of the metadata at all
> (just that
> it exists and its size), and therefore can't strip or insert it. The bdev
> module
> implementing T10 PI should strip and insert automatically.
> OK, by this bdevs will be able to inform existence and size of metadata
> but additional information will be necessary to create virtual namespaces
> with PI.
> Do you think adding optional parameters related with PI to
> nvmf_subsystem_add_ns RPC is reasonable?
> Or bdev module will have detailed PI information and provide them at
> virtual namespace creation?
>
> The second case is as you said, where the metadata is stored in a separate
> place
> in memory. We support this in the NVMe driver via special command
> submission
> functions that additionally take a pointer to the associated metadata. I
> think
> support for this second case, case #2, is what you are interested in
> implementing.
> nvmf-target supports only interleaved metadata and don't support separate
> metadata buffer. Which option sounds reasonable for you?
>
> Option 1:
> nvmf-target separates interleaved metadata into two buffers and then
> transfers two buffers to the backend bdev by new bdev APIs.
>
> Option 2:
> nvmf-target transfers interleaved metadata directly to the backend bdev by
> current bdev APIs.
> But separate metadata may be better for crypto bdev and raid bdev.
> Hence for this option I may have to provide a special bdev module to
> convert between interleaved metadata and separate metadata.
>
> There is also the question of what is actually stored in the metadata. T10
> PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's
> the
> right design choice. Basic metadata support can be in the bdev layer, but
> the
> T10 PI code can be in its own bdev module.
> I got it.
> I haven't thought through all of the ramifications here, but I think it
> might
> depend on whether we're doing case #1 or case #2 from above. In case #1,
> I'd
> except it to return the total length of data + metadata (the user may not
> even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect
> it to
> return just the length of the data.
> I got it.
>
> >
> > ELBA awareness means that the upper layer must extract data from
> interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata,
> providing the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
>
> Sounds good.
> OK, I will do.
>
> Thanks,
> Shuhei
>
>
> ________________________________
> 差出人: SPDK <spdk-bounces(a)lists.01.org> が Walker, Benjamin <
> benjamin.walker(a)intel.com> の代理で送信
> 送信日時: 2018年9月26日 5:13
> 宛先: spdk(a)lists.01.org
> 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in
> bdevs.
>
> On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> > Sorry for resend, I revised a little.
> >
> >
> > Hi All,
> >
> > I've worked on extended LBA in bdevs first.
> > I will do T10 PI on top of the extended LBA next.
> > I expect some applications or users will need separate metadata and will
> do
> > this too.
>
> There is, of course, two ways to handle per-block metadata. The first is
> to make
> it transparent to the block layer by extending the block size (typically
> by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think
> that
> any changes need to be made to the bdev layer - it should already support
> 520
> byte blocks, for example.
>
> The second case is as you said, where the metadata is stored in a separate
> place
> in memory. We support this in the NVMe driver via special command
> submission
> functions that additionally take a pointer to the associated metadata. I
> think
> support for this second case, case #2, is what you are interested in
> implementing.
>
> There is also the question of what is actually stored in the metadata. T10
> PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's
> the
> right design choice. Basic metadata support can be in the bdev layer, but
> the
> T10 PI code can be in its own bdev module.
>
> >
> > About extended LBA in bdevs, I would like to hear any feedback before
> > submitting patches.
> > Any feedback is very very appreciated.
> >
> > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> >   Option 1: length of extended block (data + metadata)
> >   Option 2: length of only data block. user will get length of metadata
> by
> > spdk_bdev_get_md_size(bdev)
> >   Or any other idea?
>
> I haven't thought through all of the ramifications here, but I think it
> might
> depend on whether we're doing case #1 or case #2 from above. In case #1,
> I'd
> except it to return the total length of data + metadata (the user may not
> even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect
> it to
> return just the length of the data.
>
> >
> > Current implementation is Option 1 but NVMe-oF target cuts off the size
> of
> > metadata now even if metadata is enabled.
> > Keeping current implementation, Option1, sounds reasonable for me.
> >
> > Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev)
> anyway.
> >
> > Q2. Which behavior should bdev IO APIs have by default?
> >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> extended
> > LBA by default)
> >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to
> the
> > upper layer by default)
> >  Or any other idea?
> >
> > READ_STRIP reads data and metadata from the target, discards metadata,
> and
> > transfers only data to the upper layer.
> > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > writes both data and metadata to the target.
> > READ_PASS reads data and metadata from the target and transfers both
> data and
> > metadata to the upper layer.
> > WRITE_PASS transfers data and metadata from the upper layer and writes
> both
> > data and metadata to the target.
> >
> > Option 1 looks reasonable to me. I will be able to provide an new bdev
> module
> > to use extended LBA bdevs transparently.
> > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > If we take Option 2, we will have to provide the set of ELBA aware APIs.
>
> The bdev APIs shouldn't understand the meaning of the metadata at all
> (just that
> it exists and its size), and therefore can't strip or insert it. The bdev
> module
> implementing T10 PI should strip and insert automatically.
>
> >
> > ELBA awareness means that the upper layer must extract data from
> interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata,
> providing the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
>
> Sounds good.
>
> >
> > About T10 PI, I will send questions as separate mail later.
> >
> > Bdev provides the following APIs:
> > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> md_buf,
> > md_len, cb, cb_arg)
> > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks,
> > cb, cb_arg)
> > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks,
> cb,
> > cb_arg)
> > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks, cb, cb_arg)
> > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks,
> cb,
> > cb_arg)
> > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> >
> > Thanks,
> > Shuhei
> >
> > ________________________________
> > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > 送信日時: 2018年9月25日 9:14
> > 宛先: spdk(a)lists.01.org
> > 件名: Design policy to support extended LBA and T10 PI in bdevs.
> >
> >
> > Hi All,
> >
> > I've worked on extended LBA in bdevs first.
> > I will do T10 PI on top of the extended LBA next.
> > I expect some applications or users will need separate metadata and will
> do
> > this too.
> >
> > About extended LBA in bdevs, I would like to hear any feedback before
> > submitting patches.
> > Any feedback is very very appreciated.
> >
> > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> >   Option 1: length of extended block (data + metadata)
> >   Option 2: length of only data block. user will get length of metadata
> by
> > spdk_bdev_get_md_size(bdev)
> >   Or any other idea?
> >
> > Current implementation is A1 but NVMe-oF target cuts off the size of
> metadata
> > now even if metadata is enabled.
> > Keeping current implementation, Option1, sounds reasonable for me.
> > Even if we take Option 1
> >
> > Q2. Which behavior should bdev IO APIs have by default?
> >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> extended
> > LBA by default)
> >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to
> the
> > upper layer by default)
> >  Or any other idea?
> >
> > READ_STRIP reads data and metadata from the target, discards metadata,
> and
> > transfers only data to the upper layer.
> > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > writes both data and metadata to the target.
> > READ_PASS reads data and metadata from the target and transfers both
> data and
> > metadata to the upper layer.
> > WRITE_PASS transfers data and metadata from the upper layer and writes
> both
> > data and metadata to the target.
> > A1 looks reasonable to me. I will be able to provide an new bdev module
> to use
> > extended LBA bdevs transparently.
> > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > If we take A2, we will have to provide the set of ELBA aware APIs.
> >
> > Q3. To support any application which requires separate metadata,
> providing the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> >
> > About T10 PI, I will send questions as separate mail later.
> >
> > Bdev provides the following APIs:
> > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> md_buf,
> > md_len, cb, cb_arg)
> > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks,
> > cb, cb_arg)
> > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks,
> cb,
> > cb_arg)
> > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks, cb, cb_arg)
> > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks,
> cb,
> > cb_arg)
> > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> >
> > Thanks,
> > Shuhei
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> >
> https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMDA-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZGDAMUHjlWNsxNQDsRJWSREbYx-_D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbkF-lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCEnoOomGZDSIU2GAsK9V-HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
>
> https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMDA-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZGDAMUHjlWNsxNQDsRJWSREbYx-_D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbkF-lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCEnoOomGZDSIU2GAsK9V-HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-26  3:13 
  0 siblings, 0 replies; 11+ messages in thread
From:  @ 2018-09-26  3:13 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 18508 bytes --]

Hi Ben and All,


I updated slightly. Will you read this mail and discard previous one?



There is, of course, two ways to handle per-block metadata. The first is to make
it transparent to the block layer by extending the block size (typically by 8
bytes). For this case, which I'll call case #1, I don't necessarily think that
any changes need to be made to the bdev layer - it should already support 520
byte blocks, for example.
Yes, any change is not necessary in the bdev layer but currently 520 byte blocks cannot be visible from nvmf-target to nvmf-initiator.

What I want to do first is to make SPDK nvmf-target support (virtual) PI and its expected backends will be nvme bdevs first.
Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
The bdev APIs shouldn't understand the meaning of the metadata at all (just that
it exists and its size), and therefore can't strip or insert it. The bdev module
implementing T10 PI should strip and insert automatically.
OK, by this bdevs will be able to inform existence and size of metadata but additional information will be necessary to create virtual namespaces in which PI is enabled.
Do you think adding optional parameters related with PI to nvmf_subsystem_add_ns RPC is reasonable?

Or bdev module will provide detailed PI information at virtual namespace creation?

The second case is as you said, where the metadata is stored in a separate place
in memory. We support this in the NVMe driver via special command submission
functions that additionally take a pointer to the associated metadata. I think
support for this second case, case #2, is what you are interested in
implementing.
nvmf-target supports only interleaved metadata and don't support separate metadata buffer. Which option sounds reasonable for you?

Option 1:
nvmf-target transfers interleaved metadata directly to the backend bdev by current bdev APIs.
This will be able to keep zero-copy and be good for performance.

Option 2:
nvmf-target separates interleaved metadata into two buffers and then transfers two buffers to the backend bdev by new bdev APIs.


>
> ELBA awareness means that the upper layer must extract data from interleaved
> extended blocks.
>
> Q3. To support any application which requires separate metadata, providing the
> following APIs is reasonable?
> - add md_buf and md_len to parameters
> - add the suffix "_md" to the function name.

Sounds good.
OK, I will do.
But nvme provides pichk, app_tag/mask and ref_tag fields in the read/write command to applications.
nvme bdev will be able to hold pichk internally but nvme bdev will have to provide app_tag/mask and ref_tag in bdev APIs from the end-to-end perspective.
Adding app_tag/mask and ref_tag to the new APIs will be reasonable too?


There is also the question of what is actually stored in the metadata. T10 PI is
the obvious choice, but really anything can be put there. Below you mention
writing a bdev module that transparently handles T10 PI and I think that's the
right design choice. Basic metadata support can be in the bdev layer, but the
T10 PI code can be in its own bdev module.
I got it.
I haven't thought through all of the ramifications here, but I think it might
depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
except it to return the total length of data + metadata (the user may not even
be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
return just the length of the data.
I got it.


Thanks,
Shuhei


________________________________
差出人: 松本周平 / MATSUMOTO,SHUUHEI
送信日時: 2018年9月26日 8:24
宛先: spdk(a)lists.01.org
件名: RE: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.


Hi Ben,


Thanks for your feedback. Your feedback made my head cleaner.

I have additional few questions. Will you provide your feedback again?



There is, of course, two ways to handle per-block metadata. The first is to make
it transparent to the block layer by extending the block size (typically by 8
bytes). For this case, which I'll call case #1, I don't necessarily think that
any changes need to be made to the bdev layer - it should already support 520
byte blocks, for example.
Yes, any change is not necessary in the bdev layer but currently 520 byte blocks cannot be visible from nvmf-target to nvmf-initiator.

What I want to do first is to make SPDK nvmf-target support (virtual) PI and its expected backends will be nvme bdevs first.
Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
The bdev APIs shouldn't understand the meaning of the metadata at all (just that
it exists and its size), and therefore can't strip or insert it. The bdev module
implementing T10 PI should strip and insert automatically.
OK, by this bdevs will be able to inform existence and size of metadata but additional information will be necessary to create virtual namespaces with PI.
Do you think adding optional parameters related with PI to nvmf_subsystem_add_ns RPC is reasonable?
Or bdev module will have detailed PI information and provide them at virtual namespace creation?

The second case is as you said, where the metadata is stored in a separate place
in memory. We support this in the NVMe driver via special command submission
functions that additionally take a pointer to the associated metadata. I think
support for this second case, case #2, is what you are interested in
implementing.
nvmf-target supports only interleaved metadata and don't support separate metadata buffer. Which option sounds reasonable for you?

Option 1:
nvmf-target separates interleaved metadata into two buffers and then transfers two buffers to the backend bdev by new bdev APIs.

Option 2:
nvmf-target transfers interleaved metadata directly to the backend bdev by current bdev APIs.
But separate metadata may be better for crypto bdev and raid bdev.
Hence for this option I may have to provide a special bdev module to convert between interleaved metadata and separate metadata.

There is also the question of what is actually stored in the metadata. T10 PI is
the obvious choice, but really anything can be put there. Below you mention
writing a bdev module that transparently handles T10 PI and I think that's the
right design choice. Basic metadata support can be in the bdev layer, but the
T10 PI code can be in its own bdev module.
I got it.
I haven't thought through all of the ramifications here, but I think it might
depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
except it to return the total length of data + metadata (the user may not even
be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
return just the length of the data.
I got it.

>
> ELBA awareness means that the upper layer must extract data from interleaved
> extended blocks.
>
> Q3. To support any application which requires separate metadata, providing the
> following APIs is reasonable?
> - add md_buf and md_len to parameters
> - add the suffix "_md" to the function name.

Sounds good.
OK, I will do.

Thanks,
Shuhei


________________________________
差出人: SPDK <spdk-bounces(a)lists.01.org> が Walker, Benjamin <benjamin.walker(a)intel.com> の代理で送信
送信日時: 2018年9月26日 5:13
宛先: spdk(a)lists.01.org
件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.

On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> Sorry for resend, I revised a little.
>
>
> Hi All,
>
> I've worked on extended LBA in bdevs first.
> I will do T10 PI on top of the extended LBA next.
> I expect some applications or users will need separate metadata and will do
> this too.

There is, of course, two ways to handle per-block metadata. The first is to make
it transparent to the block layer by extending the block size (typically by 8
bytes). For this case, which I'll call case #1, I don't necessarily think that
any changes need to be made to the bdev layer - it should already support 520
byte blocks, for example.

The second case is as you said, where the metadata is stored in a separate place
in memory. We support this in the NVMe driver via special command submission
functions that additionally take a pointer to the associated metadata. I think
support for this second case, case #2, is what you are interested in
implementing.

There is also the question of what is actually stored in the metadata. T10 PI is
the obvious choice, but really anything can be put there. Below you mention
writing a bdev module that transparently handles T10 PI and I think that's the
right design choice. Basic metadata support can be in the bdev layer, but the
T10 PI code can be in its own bdev module.

>
> About extended LBA in bdevs, I would like to hear any feedback before
> submitting patches.
> Any feedback is very very appreciated.
>
> Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
>   Option 1: length of extended block (data + metadata)
>   Option 2: length of only data block. user will get length of metadata by
> spdk_bdev_get_md_size(bdev)
>   Or any other idea?

I haven't thought through all of the ramifications here, but I think it might
depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
except it to return the total length of data + metadata (the user may not even
be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
return just the length of the data.

>
> Current implementation is Option 1 but NVMe-oF target cuts off the size of
> metadata now even if metadata is enabled.
> Keeping current implementation, Option1, sounds reasonable for me.
>
> Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev) anyway.
>
> Q2. Which behavior should bdev IO APIs have by default?
>  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of extended
> LBA by default)
>  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> upper layer by default)
>  Or any other idea?
>
> READ_STRIP reads data and metadata from the target, discards metadata, and
> transfers only data to the upper layer.
> WRITE_INSERT transfers only data from the upper layer, add metadata, and
> writes both data and metadata to the target.
> READ_PASS reads data and metadata from the target and transfers both data and
> metadata to the upper layer.
> WRITE_PASS transfers data and metadata from the upper layer and writes both
> data and metadata to the target.
>
> Option 1 looks reasonable to me. I will be able to provide an new bdev module
> to use extended LBA bdevs transparently.
> The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> If we take Option 2, we will have to provide the set of ELBA aware APIs.

The bdev APIs shouldn't understand the meaning of the metadata at all (just that
it exists and its size), and therefore can't strip or insert it. The bdev module
implementing T10 PI should strip and insert automatically.

>
> ELBA awareness means that the upper layer must extract data from interleaved
> extended blocks.
>
> Q3. To support any application which requires separate metadata, providing the
> following APIs is reasonable?
> - add md_buf and md_len to parameters
> - add the suffix "_md" to the function name.

Sounds good.

>
> About T10 PI, I will send questions as separate mail later.
>
> Bdev provides the following APIs:
> - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_arg)
> - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_buf,
> md_len, cb, cb_arg)
> - uint32_t spdk_bdev_get_blocks_size(bdev)
> - uint32_t spdk_bdev_get_num_blocks(bdev)
> - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> cb, cb_arg)
> - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks, cb, cb_arg)
> - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
>
> Thanks,
> Shuhei
>
> ________________________________
> 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> 送信日時: 2018年9月25日 9:14
> 宛先: spdk(a)lists.01.org
> 件名: Design policy to support extended LBA and T10 PI in bdevs.
>
>
> Hi All,
>
> I've worked on extended LBA in bdevs first.
> I will do T10 PI on top of the extended LBA next.
> I expect some applications or users will need separate metadata and will do
> this too.
>
> About extended LBA in bdevs, I would like to hear any feedback before
> submitting patches.
> Any feedback is very very appreciated.
>
> Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
>   Option 1: length of extended block (data + metadata)
>   Option 2: length of only data block. user will get length of metadata by
> spdk_bdev_get_md_size(bdev)
>   Or any other idea?
>
> Current implementation is A1 but NVMe-oF target cuts off the size of metadata
> now even if metadata is enabled.
> Keeping current implementation, Option1, sounds reasonable for me.
> Even if we take Option 1
>
> Q2. Which behavior should bdev IO APIs have by default?
>  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of extended
> LBA by default)
>  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> upper layer by default)
>  Or any other idea?
>
> READ_STRIP reads data and metadata from the target, discards metadata, and
> transfers only data to the upper layer.
> WRITE_INSERT transfers only data from the upper layer, add metadata, and
> writes both data and metadata to the target.
> READ_PASS reads data and metadata from the target and transfers both data and
> metadata to the upper layer.
> WRITE_PASS transfers data and metadata from the upper layer and writes both
> data and metadata to the target.
> A1 looks reasonable to me. I will be able to provide an new bdev module to use
> extended LBA bdevs transparently.
> The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> If we take A2, we will have to provide the set of ELBA aware APIs.
>
> Q3. To support any application which requires separate metadata, providing the
> following APIs is reasonable?
> - add md_buf and md_len to parameters
> - add the suffix "_md" to the function name.
>
> About T10 PI, I will send questions as separate mail later.
>
> Bdev provides the following APIs:
> - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_arg)
> - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_buf,
> md_len, cb, cb_arg)
> - uint32_t spdk_bdev_get_blocks_size(bdev)
> - uint32_t spdk_bdev_get_num_blocks(bdev)
> - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> cb, cb_arg)
> - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks, cb, cb_arg)
> - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
>
> Thanks,
> Shuhei
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMDA-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZGDAMUHjlWNsxNQDsRJWSREbYx-_D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbkF-lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCEnoOomGZDSIU2GAsK9V-HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk

_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMDA-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZGDAMUHjlWNsxNQDsRJWSREbYx-_D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbkF-lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCEnoOomGZDSIU2GAsK9V-HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-25 23:24 
  0 siblings, 0 replies; 11+ messages in thread
From:  @ 2018-09-25 23:24 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 14689 bytes --]

Hi Ben,


Thanks for your feedback. Your feedback made my head cleaner.

I have additional few questions. Will you provide your feedback again?



There is, of course, two ways to handle per-block metadata. The first is to make
it transparent to the block layer by extending the block size (typically by 8
bytes). For this case, which I'll call case #1, I don't necessarily think that
any changes need to be made to the bdev layer - it should already support 520
byte blocks, for example.
Yes, any change is not necessary in the bdev layer but currently 520 byte blocks cannot be visible from nvmf-target to nvmf-initiator.

What I want to do first is to make SPDK nvmf-target support (virtual) PI and its expected backends will be nvme bdevs first.
Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
The bdev APIs shouldn't understand the meaning of the metadata at all (just that
it exists and its size), and therefore can't strip or insert it. The bdev module
implementing T10 PI should strip and insert automatically.
OK, by this bdevs will be able to inform existence and size of metadata but additional information will be necessary to create virtual namespaces with PI.
Do you think adding optional parameters related with PI to nvmf_subsystem_add_ns RPC is reasonable?
Or bdev module will have detailed PI information and provide them at virtual namespace creation?

The second case is as you said, where the metadata is stored in a separate place
in memory. We support this in the NVMe driver via special command submission
functions that additionally take a pointer to the associated metadata. I think
support for this second case, case #2, is what you are interested in
implementing.
nvmf-target supports only interleaved metadata and don't support separate metadata buffer. Which option sounds reasonable for you?

Option 1:
nvmf-target separates interleaved metadata into two buffers and then transfers two buffers to the backend bdev by new bdev APIs.

Option 2:
nvmf-target transfers interleaved metadata directly to the backend bdev by current bdev APIs.
But separate metadata may be better for crypto bdev and raid bdev.
Hence for this option I may have to provide a special bdev module to convert between interleaved metadata and separate metadata.

There is also the question of what is actually stored in the metadata. T10 PI is
the obvious choice, but really anything can be put there. Below you mention
writing a bdev module that transparently handles T10 PI and I think that's the
right design choice. Basic metadata support can be in the bdev layer, but the
T10 PI code can be in its own bdev module.
I got it.
I haven't thought through all of the ramifications here, but I think it might
depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
except it to return the total length of data + metadata (the user may not even
be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
return just the length of the data.
I got it.

>
> ELBA awareness means that the upper layer must extract data from interleaved
> extended blocks.
>
> Q3. To support any application which requires separate metadata, providing the
> following APIs is reasonable?
> - add md_buf and md_len to parameters
> - add the suffix "_md" to the function name.

Sounds good.
OK, I will do.

Thanks,
Shuhei


________________________________
差出人: SPDK <spdk-bounces(a)lists.01.org> が Walker, Benjamin <benjamin.walker(a)intel.com> の代理で送信
送信日時: 2018年9月26日 5:13
宛先: spdk(a)lists.01.org
件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.

On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> Sorry for resend, I revised a little.
>
>
> Hi All,
>
> I've worked on extended LBA in bdevs first.
> I will do T10 PI on top of the extended LBA next.
> I expect some applications or users will need separate metadata and will do
> this too.

There is, of course, two ways to handle per-block metadata. The first is to make
it transparent to the block layer by extending the block size (typically by 8
bytes). For this case, which I'll call case #1, I don't necessarily think that
any changes need to be made to the bdev layer - it should already support 520
byte blocks, for example.

The second case is as you said, where the metadata is stored in a separate place
in memory. We support this in the NVMe driver via special command submission
functions that additionally take a pointer to the associated metadata. I think
support for this second case, case #2, is what you are interested in
implementing.

There is also the question of what is actually stored in the metadata. T10 PI is
the obvious choice, but really anything can be put there. Below you mention
writing a bdev module that transparently handles T10 PI and I think that's the
right design choice. Basic metadata support can be in the bdev layer, but the
T10 PI code can be in its own bdev module.

>
> About extended LBA in bdevs, I would like to hear any feedback before
> submitting patches.
> Any feedback is very very appreciated.
>
> Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
>   Option 1: length of extended block (data + metadata)
>   Option 2: length of only data block. user will get length of metadata by
> spdk_bdev_get_md_size(bdev)
>   Or any other idea?

I haven't thought through all of the ramifications here, but I think it might
depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
except it to return the total length of data + metadata (the user may not even
be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
return just the length of the data.

>
> Current implementation is Option 1 but NVMe-oF target cuts off the size of
> metadata now even if metadata is enabled.
> Keeping current implementation, Option1, sounds reasonable for me.
>
> Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev) anyway.
>
> Q2. Which behavior should bdev IO APIs have by default?
>  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of extended
> LBA by default)
>  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> upper layer by default)
>  Or any other idea?
>
> READ_STRIP reads data and metadata from the target, discards metadata, and
> transfers only data to the upper layer.
> WRITE_INSERT transfers only data from the upper layer, add metadata, and
> writes both data and metadata to the target.
> READ_PASS reads data and metadata from the target and transfers both data and
> metadata to the upper layer.
> WRITE_PASS transfers data and metadata from the upper layer and writes both
> data and metadata to the target.
>
> Option 1 looks reasonable to me. I will be able to provide an new bdev module
> to use extended LBA bdevs transparently.
> The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> If we take Option 2, we will have to provide the set of ELBA aware APIs.

The bdev APIs shouldn't understand the meaning of the metadata at all (just that
it exists and its size), and therefore can't strip or insert it. The bdev module
implementing T10 PI should strip and insert automatically.

>
> ELBA awareness means that the upper layer must extract data from interleaved
> extended blocks.
>
> Q3. To support any application which requires separate metadata, providing the
> following APIs is reasonable?
> - add md_buf and md_len to parameters
> - add the suffix "_md" to the function name.

Sounds good.

>
> About T10 PI, I will send questions as separate mail later.
>
> Bdev provides the following APIs:
> - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_arg)
> - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_buf,
> md_len, cb, cb_arg)
> - uint32_t spdk_bdev_get_blocks_size(bdev)
> - uint32_t spdk_bdev_get_num_blocks(bdev)
> - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> cb, cb_arg)
> - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks, cb, cb_arg)
> - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
>
> Thanks,
> Shuhei
>
> ________________________________
> 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> 送信日時: 2018年9月25日 9:14
> 宛先: spdk(a)lists.01.org
> 件名: Design policy to support extended LBA and T10 PI in bdevs.
>
>
> Hi All,
>
> I've worked on extended LBA in bdevs first.
> I will do T10 PI on top of the extended LBA next.
> I expect some applications or users will need separate metadata and will do
> this too.
>
> About extended LBA in bdevs, I would like to hear any feedback before
> submitting patches.
> Any feedback is very very appreciated.
>
> Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
>   Option 1: length of extended block (data + metadata)
>   Option 2: length of only data block. user will get length of metadata by
> spdk_bdev_get_md_size(bdev)
>   Or any other idea?
>
> Current implementation is A1 but NVMe-oF target cuts off the size of metadata
> now even if metadata is enabled.
> Keeping current implementation, Option1, sounds reasonable for me.
> Even if we take Option 1
>
> Q2. Which behavior should bdev IO APIs have by default?
>  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of extended
> LBA by default)
>  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> upper layer by default)
>  Or any other idea?
>
> READ_STRIP reads data and metadata from the target, discards metadata, and
> transfers only data to the upper layer.
> WRITE_INSERT transfers only data from the upper layer, add metadata, and
> writes both data and metadata to the target.
> READ_PASS reads data and metadata from the target and transfers both data and
> metadata to the upper layer.
> WRITE_PASS transfers data and metadata from the upper layer and writes both
> data and metadata to the target.
> A1 looks reasonable to me. I will be able to provide an new bdev module to use
> extended LBA bdevs transparently.
> The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> If we take A2, we will have to provide the set of ELBA aware APIs.
>
> Q3. To support any application which requires separate metadata, providing the
> following APIs is reasonable?
> - add md_buf and md_len to parameters
> - add the suffix "_md" to the function name.
>
> About T10 PI, I will send questions as separate mail later.
>
> Bdev provides the following APIs:
> - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_arg)
> - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_buf,
> md_len, cb, cb_arg)
> - uint32_t spdk_bdev_get_blocks_size(bdev)
> - uint32_t spdk_bdev_get_num_blocks(bdev)
> - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> cb, cb_arg)
> - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> num_blocks, cb, cb_arg)
> - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> cb_arg)
> - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
>
> Thanks,
> Shuhei
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMDA-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZGDAMUHjlWNsxNQDsRJWSREbYx-_D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbkF-lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCEnoOomGZDSIU2GAsK9V-HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk

_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMDA-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZGDAMUHjlWNsxNQDsRJWSREbYx-_D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbkF-lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCEnoOomGZDSIU2GAsK9V-HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-25  0:27 
  0 siblings, 0 replies; 11+ messages in thread
From:  @ 2018-09-25  0:27 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7499 bytes --]

Sorry for resend, I revised a little.


Hi All,

I've worked on extended LBA in bdevs first.
I will do T10 PI on top of the extended LBA next.
I expect some applications or users will need separate metadata and will do this too.

About extended LBA in bdevs, I would like to hear any feedback before submitting patches.
Any feedback is very very appreciated.

Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
  Option 1: length of extended block (data + metadata)
  Option 2: length of only data block. user will get length of metadata by spdk_bdev_get_md_size(bdev)
  Or any other idea?

Current implementation is Option 1 but NVMe-oF target cuts off the size of metadata now even if metadata is enabled.
Keeping current implementation, Option1, sounds reasonable for me.

Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev) anyway.

Q2. Which behavior should bdev IO APIs have by default?
 Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of extended LBA by default)
 Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the upper layer by default)
 Or any other idea?

READ_STRIP reads data and metadata from the target, discards metadata, and transfers only data to the upper layer.
WRITE_INSERT transfers only data from the upper layer, add metadata, and writes both data and metadata to the target.
READ_PASS reads data and metadata from the target and transfers both data and metadata to the upper layer.
WRITE_PASS transfers data and metadata from the upper layer and writes both data and metadata to the target.

Option 1 looks reasonable to me. I will be able to provide an new bdev module to use extended LBA bdevs transparently.
The new bdev module will do READ_STRIP and WRITE_INSERT internally.
If we take Option 2, we will have to provide the set of ELBA aware APIs.

ELBA awareness means that the upper layer must extract data from interleaved extended blocks.

Q3. To support any application which requires separate metadata, providing the following APIs is reasonable?
- add md_buf and md_len to parameters
- add the suffix "_md" to the function name.

About T10 PI, I will send questions as separate mail later.

Bdev provides the following APIs:
- int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
- int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
- int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
- int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
- int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
- int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
- int spdk_bdev_reset(desc, ch, cb, cb_arg)
- int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
- int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
- int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_arg)
- int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_buf, md_len, cb, cb_arg)
- uint32_t spdk_bdev_get_blocks_size(bdev)
- uint32_t spdk_bdev_get_num_blocks(bdev)
- int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)

Thanks,
Shuhei

________________________________
差出人: 松本周平 / MATSUMOTO,SHUUHEI
送信日時: 2018年9月25日 9:14
宛先: spdk(a)lists.01.org
件名: Design policy to support extended LBA and T10 PI in bdevs.


Hi All,

I've worked on extended LBA in bdevs first.
I will do T10 PI on top of the extended LBA next.
I expect some applications or users will need separate metadata and will do this too.

About extended LBA in bdevs, I would like to hear any feedback before submitting patches.
Any feedback is very very appreciated.

Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
  Option 1: length of extended block (data + metadata)
  Option 2: length of only data block. user will get length of metadata by spdk_bdev_get_md_size(bdev)
  Or any other idea?

Current implementation is A1 but NVMe-oF target cuts off the size of metadata now even if metadata is enabled.
Keeping current implementation, Option1, sounds reasonable for me.
Even if we take Option 1

Q2. Which behavior should bdev IO APIs have by default?
 Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of extended LBA by default)
 Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the upper layer by default)
 Or any other idea?

READ_STRIP reads data and metadata from the target, discards metadata, and transfers only data to the upper layer.
WRITE_INSERT transfers only data from the upper layer, add metadata, and writes both data and metadata to the target.
READ_PASS reads data and metadata from the target and transfers both data and metadata to the upper layer.
WRITE_PASS transfers data and metadata from the upper layer and writes both data and metadata to the target.
A1 looks reasonable to me. I will be able to provide an new bdev module to use extended LBA bdevs transparently.
The new bdev module will do READ_STRIP and WRITE_INSERT internally.
If we take A2, we will have to provide the set of ELBA aware APIs.

Q3. To support any application which requires separate metadata, providing the following APIs is reasonable?
- add md_buf and md_len to parameters
- add the suffix "_md" to the function name.

About T10 PI, I will send questions as separate mail later.

Bdev provides the following APIs:
- int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
- int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
- int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
- int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
- int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
- int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
- int spdk_bdev_reset(desc, ch, cb, cb_arg)
- int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
- int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
- int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_arg)
- int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_buf, md_len, cb, cb_arg)
- uint32_t spdk_bdev_get_blocks_size(bdev)
- uint32_t spdk_bdev_get_num_blocks(bdev)
- int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)

Thanks,
Shuhei


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
@ 2018-09-25  0:14 
  0 siblings, 0 replies; 11+ messages in thread
From:  @ 2018-09-25  0:14 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3503 bytes --]

Hi All,

I've worked on extended LBA in bdevs first.
I will do T10 PI on top of the extended LBA next.
I expect some applications or users will need separate metadata and will do this too.

About extended LBA in bdevs, I would like to hear any feedback before submitting patches.
Any feedback is very very appreciated.

Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
  Option 1: length of extended block (data + metadata)
  Option 2: length of only data block. user will get length of metadata by spdk_bdev_get_md_size(bdev)
  Or any other idea?

Current implementation is A1 but NVMe-oF target cuts off the size of metadata now even if metadata is enabled.
Keeping current implementation, A1, sounds reasonable for me.

Q2. Which behavior should bdev IO APIs have by default?
 Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of extended LBA by default)
 Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the upper layer by default)
 Or any other idea?

READ_STRIP reads data and metadata from the target, discards metadata, and transfers only data to the upper layer.
WRITE_INSERT transfers only data from the upper layer, add metadata, and writes both data and metadata to the target.
READ_PASS reads data and metadata from the target and transfers both data and metadata to the upper layer.
WRITE_PASS transfers data and metadata from the upper layer and writes both data and metadata to the target.
A1 looks reasonable to me. I will be able to provide an new bdev module to use extended LBA bdevs transparently.
The new bdev module will do READ_STRIP and WRITE_INSERT internally.
If we take A2, we will have to provide the set of ELBA aware APIs.

Q3. To support any application which requires separate metadata, providing the following APIs is reasonable?
- add md_buf and md_len to parameters
- add the suffix "_md" to the function name.

About T10 PI, I will send questions as separate mail later.

Bdev provides the following APIs:
- int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
- int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
- int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
- int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
- int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
- int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
- int spdk_bdev_reset(desc, ch, cb, cb_arg)
- int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
- int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
- int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_arg)
- int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_buf, md_len, cb, cb_arg)
- uint32_t spdk_bdev_get_blocks_size(bdev)
- uint32_t spdk_bdev_get_num_blocks(bdev)
- int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
- int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)

Thanks,
Shuhei


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-09-27  7:23 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-25 20:13 [SPDK] Design policy to support extended LBA and T10 PI in bdevs Walker, Benjamin
  -- strict thread matches above, loose matches on Subject: below --
2018-09-27  7:23 Liu, Changpeng
2018-09-27  7:08 
2018-09-27  6:53 Liu, Changpeng
2018-09-27  6:40 
2018-09-27  5:21 Liu, Changpeng
2018-09-26 20:26 
2018-09-26  3:13 
2018-09-25 23:24 
2018-09-25  0:27 
2018-09-25  0:14 

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.