All of lore.kernel.org
 help / color / mirror / Atom feed
From: Liu, Changpeng <changpeng.liu at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
Date: Thu, 27 Sep 2018 05:21:45 +0000	[thread overview]
Message-ID: <FF7FC980937D6342B9D289F5F3C7C2625B7362EB@SHSMSX103.ccr.corp.intel.com> (raw)
In-Reply-To: OS1PR01MB0229B4FE538ACB9F0581684DA2150@OS1PR01MB0229.jpnprd01.prod.outlook.com

[-- Attachment #1: Type: text/plain, Size: 20254 bytes --]



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 /
> MATSUMOTO,SHUUHEI
> Sent: Wednesday, September 26, 2018 11:13 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> Hi Ben and All,
> 
> 
> I updated slightly. Will you read this mail and discard previous one?
> 
> 
> 
> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
> Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> cannot be visible from nvmf-target to nvmf-initiator.
> 
> What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> expected backends will be nvme bdevs first.
Hi Shuhei, I understand your point here, the question I have is that: how can we map the virtual
Namespace bdev with the real backend, especially the real backend can't support extended LBA format ?

Or NVMeoF virtual Namespace can support different types of PI, and SPDK format the NVMe backend
with TYPE0, SPDK can fill the PI according to the specification.
> Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> The bdev APIs shouldn't understand the meaning of the metadata at all (just that
> it exists and its size), and therefore can't strip or insert it. The bdev module
> implementing T10 PI should strip and insert automatically.
> OK, by this bdevs will be able to inform existence and size of metadata but
> additional information will be necessary to create virtual namespaces in which PI
> is enabled.
> Do you think adding optional parameters related with PI to
> nvmf_subsystem_add_ns RPC is reasonable?
> 
> Or bdev module will provide detailed PI information at virtual namespace
> creation?
> 
> The second case is as you said, where the metadata is stored in a separate place
> in memory. We support this in the NVMe driver via special command submission
> functions that additionally take a pointer to the associated metadata. I think
> support for this second case, case #2, is what you are interested in
> implementing.
> nvmf-target supports only interleaved metadata and don't support separate
> metadata buffer. Which option sounds reasonable for you?
> 
> Option 1:
> nvmf-target transfers interleaved metadata directly to the backend bdev by
> current bdev APIs.
> This will be able to keep zero-copy and be good for performance.
> 
> Option 2:
> nvmf-target separates interleaved metadata into two buffers and then transfers
> two buffers to the backend bdev by new bdev APIs.
> 
> 
> >
> > ELBA awareness means that the upper layer must extract data from interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> 
> Sounds good.
> OK, I will do.
> But nvme provides pichk, app_tag/mask and ref_tag fields in the read/write
> command to applications.
> nvme bdev will be able to hold pichk internally but nvme bdev will have to
> provide app_tag/mask and ref_tag in bdev APIs from the end-to-end perspective.
> Adding app_tag/mask and ref_tag to the new APIs will be reasonable too?
> 
> 
> There is also the question of what is actually stored in the metadata. T10 PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's the
> right design choice. Basic metadata support can be in the bdev layer, but the
> T10 PI code can be in its own bdev module.
> I got it.
> I haven't thought through all of the ramifications here, but I think it might
> depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> except it to return the total length of data + metadata (the user may not even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> return just the length of the data.
> I got it.
> 
> 
> Thanks,
> Shuhei
> 
> 
> ________________________________
> 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> 送信日時: 2018年9月26日 8:24
> 宛先: spdk(a)lists.01.org
> 件名: RE: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> 
> Hi Ben,
> 
> 
> Thanks for your feedback. Your feedback made my head cleaner.
> 
> I have additional few questions. Will you provide your feedback again?
> 
> 
> 
> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
> Yes, any change is not necessary in the bdev layer but currently 520 byte blocks
> cannot be visible from nvmf-target to nvmf-initiator.
> 
> What I want to do first is to make SPDK nvmf-target support (virtual) PI and its
> expected backends will be nvme bdevs first.
> Then I want to use crypto or raid bdevs on top of nvme bdevs as backends next.
> The bdev APIs shouldn't understand the meaning of the metadata at all (just that
> it exists and its size), and therefore can't strip or insert it. The bdev module
> implementing T10 PI should strip and insert automatically.
> OK, by this bdevs will be able to inform existence and size of metadata but
> additional information will be necessary to create virtual namespaces with PI.
> Do you think adding optional parameters related with PI to
> nvmf_subsystem_add_ns RPC is reasonable?
> Or bdev module will have detailed PI information and provide them at virtual
> namespace creation?
> 
> The second case is as you said, where the metadata is stored in a separate place
> in memory. We support this in the NVMe driver via special command submission
> functions that additionally take a pointer to the associated metadata. I think
> support for this second case, case #2, is what you are interested in
> implementing.
> nvmf-target supports only interleaved metadata and don't support separate
> metadata buffer. Which option sounds reasonable for you?
> 
> Option 1:
> nvmf-target separates interleaved metadata into two buffers and then transfers
> two buffers to the backend bdev by new bdev APIs.
> 
> Option 2:
> nvmf-target transfers interleaved metadata directly to the backend bdev by
> current bdev APIs.
> But separate metadata may be better for crypto bdev and raid bdev.
> Hence for this option I may have to provide a special bdev module to convert
> between interleaved metadata and separate metadata.
> 
> There is also the question of what is actually stored in the metadata. T10 PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's the
> right design choice. Basic metadata support can be in the bdev layer, but the
> T10 PI code can be in its own bdev module.
> I got it.
> I haven't thought through all of the ramifications here, but I think it might
> depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> except it to return the total length of data + metadata (the user may not even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> return just the length of the data.
> I got it.
> 
> >
> > ELBA awareness means that the upper layer must extract data from interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> 
> Sounds good.
> OK, I will do.
> 
> Thanks,
> Shuhei
> 
> 
> ________________________________
> 差出人: SPDK <spdk-bounces(a)lists.01.org> が Walker, Benjamin
> <benjamin.walker(a)intel.com> の代理で送信
> 送信日時: 2018年9月26日 5:13
> 宛先: spdk(a)lists.01.org
> 件名: [!]Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs.
> 
> On Tue, 2018-09-25 at 00:27 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> > Sorry for resend, I revised a little.
> >
> >
> > Hi All,
> >
> > I've worked on extended LBA in bdevs first.
> > I will do T10 PI on top of the extended LBA next.
> > I expect some applications or users will need separate metadata and will do
> > this too.
> 
> There is, of course, two ways to handle per-block metadata. The first is to make
> it transparent to the block layer by extending the block size (typically by 8
> bytes). For this case, which I'll call case #1, I don't necessarily think that
> any changes need to be made to the bdev layer - it should already support 520
> byte blocks, for example.
> 
> The second case is as you said, where the metadata is stored in a separate place
> in memory. We support this in the NVMe driver via special command submission
> functions that additionally take a pointer to the associated metadata. I think
> support for this second case, case #2, is what you are interested in
> implementing.
> 
> There is also the question of what is actually stored in the metadata. T10 PI is
> the obvious choice, but really anything can be put there. Below you mention
> writing a bdev module that transparently handles T10 PI and I think that's the
> right design choice. Basic metadata support can be in the bdev layer, but the
> T10 PI code can be in its own bdev module.
> 
> >
> > About extended LBA in bdevs, I would like to hear any feedback before
> > submitting patches.
> > Any feedback is very very appreciated.
> >
> > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> >   Option 1: length of extended block (data + metadata)
> >   Option 2: length of only data block. user will get length of metadata by
> > spdk_bdev_get_md_size(bdev)
> >   Or any other idea?
> 
> I haven't thought through all of the ramifications here, but I think it might
> depend on whether we're doing case #1 or case #2 from above. In case #1, I'd
> except it to return the total length of data + metadata (the user may not even
> be using the extra 8 bytes for metadata at all). For case #2, I'd expect it to
> return just the length of the data.
> 
> >
> > Current implementation is Option 1 but NVMe-oF target cuts off the size of
> > metadata now even if metadata is enabled.
> > Keeping current implementation, Option1, sounds reasonable for me.
> >
> > Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev) anyway.
> >
> > Q2. Which behavior should bdev IO APIs have by default?
> >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> extended
> > LBA by default)
> >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> > upper layer by default)
> >  Or any other idea?
> >
> > READ_STRIP reads data and metadata from the target, discards metadata, and
> > transfers only data to the upper layer.
> > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > writes both data and metadata to the target.
> > READ_PASS reads data and metadata from the target and transfers both data
> and
> > metadata to the upper layer.
> > WRITE_PASS transfers data and metadata from the upper layer and writes both
> > data and metadata to the target.
> >
> > Option 1 looks reasonable to me. I will be able to provide an new bdev module
> > to use extended LBA bdevs transparently.
> > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > If we take Option 2, we will have to provide the set of ELBA aware APIs.
> 
> The bdev APIs shouldn't understand the meaning of the metadata at all (just that
> it exists and its size), and therefore can't strip or insert it. The bdev module
> implementing T10 PI should strip and insert automatically.
> 
> >
> > ELBA awareness means that the upper layer must extract data from interleaved
> > extended blocks.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> 
> Sounds good.
> 
> >
> > About T10 PI, I will send questions as separate mail later.
> >
> > Bdev provides the following APIs:
> > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> md_buf,
> > md_len, cb, cb_arg)
> > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> > cb, cb_arg)
> > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks, cb, cb_arg)
> > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> >
> > Thanks,
> > Shuhei
> >
> > ________________________________
> > 差出人: 松本周平 / MATSUMOTO,SHUUHEI
> > 送信日時: 2018年9月25日 9:14
> > 宛先: spdk(a)lists.01.org
> > 件名: Design policy to support extended LBA and T10 PI in bdevs.
> >
> >
> > Hi All,
> >
> > I've worked on extended LBA in bdevs first.
> > I will do T10 PI on top of the extended LBA next.
> > I expect some applications or users will need separate metadata and will do
> > this too.
> >
> > About extended LBA in bdevs, I would like to hear any feedback before
> > submitting patches.
> > Any feedback is very very appreciated.
> >
> > Q1. Which length should spdk_bdev_get_block_size(bdev) describe?
> >   Option 1: length of extended block (data + metadata)
> >   Option 2: length of only data block. user will get length of metadata by
> > spdk_bdev_get_md_size(bdev)
> >   Or any other idea?
> >
> > Current implementation is A1 but NVMe-oF target cuts off the size of metadata
> > now even if metadata is enabled.
> > Keeping current implementation, Option1, sounds reasonable for me.
> > Even if we take Option 1
> >
> > Q2. Which behavior should bdev IO APIs have by default?
> >  Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of
> extended
> > LBA by default)
> >  Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the
> > upper layer by default)
> >  Or any other idea?
> >
> > READ_STRIP reads data and metadata from the target, discards metadata, and
> > transfers only data to the upper layer.
> > WRITE_INSERT transfers only data from the upper layer, add metadata, and
> > writes both data and metadata to the target.
> > READ_PASS reads data and metadata from the target and transfers both data
> and
> > metadata to the upper layer.
> > WRITE_PASS transfers data and metadata from the upper layer and writes both
> > data and metadata to the target.
> > A1 looks reasonable to me. I will be able to provide an new bdev module to
> use
> > extended LBA bdevs transparently.
> > The new bdev module will do READ_STRIP and WRITE_INSERT internally.
> > If we take A2, we will have to provide the set of ELBA aware APIs.
> >
> > Q3. To support any application which requires separate metadata, providing
> the
> > following APIs is reasonable?
> > - add md_buf and md_len to parameters
> > - add the suffix "_md" to the function name.
> >
> > About T10 PI, I will send questions as separate mail later.
> >
> > Bdev provides the following APIs:
> > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg)
> > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg)
> > - int spdk_bdev_reset(desc, ch, cb, cb_arg)
> > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg)
> > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_arg)
> > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc,
> cb_arg)
> > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes,
> md_buf,
> > md_len, cb, cb_arg)
> > - uint32_t spdk_bdev_get_blocks_size(bdev)
> > - uint32_t spdk_bdev_get_num_blocks(bdev)
> > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_blocks,
> > cb, cb_arg)
> > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks,
> > num_blocks, cb, cb_arg)
> > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, cb,
> > cb_arg)
> > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_arg)
> >
> > Thanks,
> > Shuhei
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://clicktime.symantec.com/a/1/_YuXBcwzu-
> xtu4EvOXanN4spdgP_iGZmwJk-
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> DAMUHjlWNsxNQDsRJWSREbYx-
> _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> F-
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> noOomGZDSIU2GAsK9V-
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-
> npnVUNg=?d=9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMD
> A-H70FTUf_PnxY0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-
> AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S9Yink5uuxORDFO2pf8ozMZG
> DAMUHjlWNsxNQDsRJWSREbYx-
> _D0Q2F2O8i5q_xN8Lh2K4_7foUPvqHOGu34rn4D6-
> 8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8CSjbk
> F-
> lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCE
> noOomGZDSIU2GAsK9V-
> HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJeZA%3
> D%3D&u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

             reply	other threads:[~2018-09-27  5:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-27  5:21 Liu, Changpeng [this message]
  -- strict thread matches above, loose matches on Subject: below --
2018-09-27  7:23 [SPDK] Design policy to support extended LBA and T10 PI in bdevs Liu, Changpeng
2018-09-27  7:08 
2018-09-27  6:53 Liu, Changpeng
2018-09-27  6:40 
2018-09-26 20:26 
2018-09-26  3:13 
2018-09-25 23:24 
2018-09-25 20:13 Walker, Benjamin
2018-09-25  0:27 
2018-09-25  0:14 

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=FF7FC980937D6342B9D289F5F3C7C2625B7362EB@SHSMSX103.ccr.corp.intel.com \
    --to=spdk@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.