From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============4526312729204536840==" MIME-Version: 1.0 From: =?utf-8?q?=E6=9D=BE=E6=9C=AC=E5=91=A8=E5=B9=B3_/_MATSUMOTO=EF=BC=8CSHUUHE?= =?utf-8?q?I_=3Cshuhei=2Ematsumoto=2Ext_at_hitachi=2Ecom=3E?= Subject: Re: [SPDK] Design policy to support extended LBA and T10 PI in bdevs. Date: Tue, 25 Sep 2018 23:24:06 +0000 Message-ID: In-Reply-To: 8bec590e4bd223439d18deb9c05553fb9b95093c.camel@intel.com List-ID: To: spdk@lists.01.org --===============4526312729204536840== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Hi Ben, Thanks for your feedback. Your feedback made my head cleaner. I have additional few questions. Will you provide your feedback again? There is, of course, two ways to handle per-block metadata. The first is to= make it transparent to the block layer by extending the block size (typically by= 8 bytes). For this case, which I'll call case #1, I don't necessarily think t= hat any changes need to be made to the bdev layer - it should already support 5= 20 byte blocks, for example. Yes, any change is not necessary in the bdev layer but currently 520 byte b= locks cannot be visible from nvmf-target to nvmf-initiator. What I want to do first is to make SPDK nvmf-target support (virtual) PI an= d its expected backends will be nvme bdevs first. Then I want to use crypto or raid bdevs on top of nvme bdevs as backends ne= xt. The bdev APIs shouldn't understand the meaning of the metadata at all (just= that it exists and its size), and therefore can't strip or insert it. The bdev m= odule implementing T10 PI should strip and insert automatically. OK, by this bdevs will be able to inform existence and size of metadata but= additional information will be necessary to create virtual namespaces with= PI. Do you think adding optional parameters related with PI to nvmf_subsystem_a= dd_ns RPC is reasonable? Or bdev module will have detailed PI information and provide them at virtua= l namespace creation? The second case is as you said, where the metadata is stored in a separate = place in memory. We support this in the NVMe driver via special command submission functions that additionally take a pointer to the associated metadata. I th= ink support for this second case, case #2, is what you are interested in implementing. nvmf-target supports only interleaved metadata and don't support separate m= etadata buffer. Which option sounds reasonable for you? Option 1: nvmf-target separates interleaved metadata into two buffers and then transf= ers two buffers to the backend bdev by new bdev APIs. Option 2: nvmf-target transfers interleaved metadata directly to the backend bdev by = current bdev APIs. But separate metadata may be better for crypto bdev and raid bdev. Hence for this option I may have to provide a special bdev module to conver= t between interleaved metadata and separate metadata. There is also the question of what is actually stored in the metadata. T10 = PI is the obvious choice, but really anything can be put there. Below you mention writing a bdev module that transparently handles T10 PI and I think that's = the right design choice. Basic metadata support can be in the bdev layer, but t= he T10 PI code can be in its own bdev module. I got it. I haven't thought through all of the ramifications here, but I think it mig= ht depend on whether we're doing case #1 or case #2 from above. In case #1, I'd except it to return the total length of data + metadata (the user may not e= ven be using the extra 8 bytes for metadata at all). For case #2, I'd expect it= to return just the length of the data. I got it. > > ELBA awareness means that the upper layer must extract data from interlea= ved > extended blocks. > > Q3. To support any application which requires separate metadata, providin= g the > following APIs is reasonable? > - add md_buf and md_len to parameters > - add the suffix "_md" to the function name. Sounds good. OK, I will do. Thanks, Shuhei ________________________________ =E5=B7=AE=E5=87=BA=E4=BA=BA: SPDK =E3=81=8C W= alker, Benjamin =E3=81=AE=E4=BB=A3=E7=90=86= =E3=81=A7=E9=80=81=E4=BF=A1 =E9=80=81=E4=BF=A1=E6=97=A5=E6=99=82: 2018=E5=B9=B49=E6=9C=8826=E6=97=A5 5:= 13 =E5=AE=9B=E5=85=88: spdk(a)lists.01.org =E4=BB=B6=E5=90=8D: [!]Re: [SPDK] Design policy to support extended LBA and= T10 PI in bdevs. On Tue, 2018-09-25 at 00:27 +0000, =E6=9D=BE=E6=9C=AC=E5=91=A8=E5=B9=B3 / M= ATSUMOTO=EF=BC=8CSHUUHEI wrote: > Sorry for resend, I revised a little. > > > Hi All, > > I've worked on extended LBA in bdevs first. > I will do T10 PI on top of the extended LBA next. > I expect some applications or users will need separate metadata and will = do > this too. There is, of course, two ways to handle per-block metadata. The first is to= make it transparent to the block layer by extending the block size (typically by= 8 bytes). For this case, which I'll call case #1, I don't necessarily think t= hat any changes need to be made to the bdev layer - it should already support 5= 20 byte blocks, for example. The second case is as you said, where the metadata is stored in a separate = place in memory. We support this in the NVMe driver via special command submission functions that additionally take a pointer to the associated metadata. I th= ink support for this second case, case #2, is what you are interested in implementing. There is also the question of what is actually stored in the metadata. T10 = PI is the obvious choice, but really anything can be put there. Below you mention writing a bdev module that transparently handles T10 PI and I think that's = the right design choice. Basic metadata support can be in the bdev layer, but t= he T10 PI code can be in its own bdev module. > > About extended LBA in bdevs, I would like to hear any feedback before > submitting patches. > Any feedback is very very appreciated. > > Q1. Which length should spdk_bdev_get_block_size(bdev) describe? > Option 1: length of extended block (data + metadata) > Option 2: length of only data block. user will get length of metadata by > spdk_bdev_get_md_size(bdev) > Or any other idea? I haven't thought through all of the ramifications here, but I think it mig= ht depend on whether we're doing case #1 or case #2 from above. In case #1, I'd except it to return the total length of data + metadata (the user may not e= ven be using the extra 8 bytes for metadata at all). For case #2, I'd expect it= to return just the length of the data. > > Current implementation is Option 1 but NVMe-oF target cuts off the size of > metadata now even if metadata is enabled. > Keeping current implementation, Option1, sounds reasonable for me. > > Even if we take Option 1, we will need spdk_bdev_get_md_size(bdev) anyway. > > Q2. Which behavior should bdev IO APIs have by default? > Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of ext= ended > LBA by default) > Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the > upper layer by default) > Or any other idea? > > READ_STRIP reads data and metadata from the target, discards metadata, and > transfers only data to the upper layer. > WRITE_INSERT transfers only data from the upper layer, add metadata, and > writes both data and metadata to the target. > READ_PASS reads data and metadata from the target and transfers both data= and > metadata to the upper layer. > WRITE_PASS transfers data and metadata from the upper layer and writes bo= th > data and metadata to the target. > > Option 1 looks reasonable to me. I will be able to provide an new bdev mo= dule > to use extended LBA bdevs transparently. > The new bdev module will do READ_STRIP and WRITE_INSERT internally. > If we take Option 2, we will have to provide the set of ELBA aware APIs. The bdev APIs shouldn't understand the meaning of the metadata at all (just= that it exists and its size), and therefore can't strip or insert it. The bdev m= odule implementing T10 PI should strip and insert automatically. > > ELBA awareness means that the upper layer must extract data from interlea= ved > extended blocks. > > Q3. To support any application which requires separate metadata, providin= g the > following APIs is reasonable? > - add md_buf and md_len to parameters > - add the suffix "_md" to the function name. Sounds good. > > About T10 PI, I will send questions as separate mail later. > > Bdev provides the following APIs: > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg) > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg) > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg) > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg) > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg) > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg) > - int spdk_bdev_reset(desc, ch, cb, cb_arg) > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg) > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_ar= g) > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_= arg) > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_b= uf, > md_len, cb, cb_arg) > - uint32_t spdk_bdev_get_blocks_size(bdev) > - uint32_t spdk_bdev_get_num_blocks(bdev) > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, > cb_arg) > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_bl= ocks, > cb, cb_arg) > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, > cb_arg) > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks, > num_blocks, cb, cb_arg) > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, = cb, > cb_arg) > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_= arg) > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_= arg) > > Thanks, > Shuhei > > ________________________________ > =E5=B7=AE=E5=87=BA=E4=BA=BA: =E6=9D=BE=E6=9C=AC=E5=91=A8=E5=B9=B3 / MATSU= MOTO=EF=BC=8CSHUUHEI > =E9=80=81=E4=BF=A1=E6=97=A5=E6=99=82: 2018=E5=B9=B49=E6=9C=8825=E6=97=A5 = 9:14 > =E5=AE=9B=E5=85=88: spdk(a)lists.01.org > =E4=BB=B6=E5=90=8D: Design policy to support extended LBA and T10 PI in b= devs. > > > Hi All, > > I've worked on extended LBA in bdevs first. > I will do T10 PI on top of the extended LBA next. > I expect some applications or users will need separate metadata and will = do > this too. > > About extended LBA in bdevs, I would like to hear any feedback before > submitting patches. > Any feedback is very very appreciated. > > Q1. Which length should spdk_bdev_get_block_size(bdev) describe? > Option 1: length of extended block (data + metadata) > Option 2: length of only data block. user will get length of metadata by > spdk_bdev_get_md_size(bdev) > Or any other idea? > > Current implementation is A1 but NVMe-oF target cuts off the size of meta= data > now even if metadata is enabled. > Keeping current implementation, Option1, sounds reasonable for me. > Even if we take Option 1 > > Q2. Which behavior should bdev IO APIs have by default? > Option 1: READ_PASS and WRITE_PASS (the upper layer must be aware of ext= ended > LBA by default) > Option 2: READ_STRIP and WRITE_INSERT (extended LBA is transparent to the > upper layer by default) > Or any other idea? > > READ_STRIP reads data and metadata from the target, discards metadata, and > transfers only data to the upper layer. > WRITE_INSERT transfers only data from the upper layer, add metadata, and > writes both data and metadata to the target. > READ_PASS reads data and metadata from the target and transfers both data= and > metadata to the upper layer. > WRITE_PASS transfers data and metadata from the upper layer and writes bo= th > data and metadata to the target. > A1 looks reasonable to me. I will be able to provide an new bdev module t= o use > extended LBA bdevs transparently. > The new bdev module will do READ_STRIP and WRITE_INSERT internally. > If we take A2, we will have to provide the set of ELBA aware APIs. > > Q3. To support any application which requires separate metadata, providin= g the > following APIs is reasonable? > - add md_buf and md_len to parameters > - add the suffix "_md" to the function name. > > About T10 PI, I will send questions as separate mail later. > > Bdev provides the following APIs: > - int spdk_bdev_read(desc, ch, buf, offset, nbytes, cb, cb_arg) > - int spdk_bdev_readv(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg) > - int spdk_bdev_write(desc, ch, buf, offset, nbytes, cb, cb_arg) > - int spdk_bdev_writev(desc, ch, iov, iovcnt, offset, nbytes, cb, cb_arg) > - int spdk_bdev_write_zeroes(desc, ch, offset, len, cb, cb_arg) > - int spdk_bdev_unmap(desc, ch, offset, nbytes, cb, cb_arg) > - int spdk_bdev_reset(desc, ch, cb, cb_arg) > - int spdk_bdev_flush(desc, ch, offset, length, cb, cb_arg) > - int spdk_bdev_nvme_admin_passthru(desc, ch, cmd, buf, nbytes, cb, cb_ar= g) > - int spdk_bdev_nvme_io_passthru(bdev_desc, ch, cmd, buf, nbytes, bc, cb_= arg) > - int spdk_bdev_nvme_io_passthru_md(bdev_desc, ch, cmd, buf, nbytes, md_b= uf, > md_len, cb, cb_arg) > - uint32_t spdk_bdev_get_blocks_size(bdev) > - uint32_t spdk_bdev_get_num_blocks(bdev) > - int spdk_bdev_read_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, > cb_arg) > - int spdk_bdev_readv_blocks(desc, ch, iov, iovcnt, offset_blocks, num_bl= ocks, > cb, cb_arg) > - int spdk_bdev_write_blocks(desc, ch, buf, offset_blocks, num_blocks, cb, > cb_arg) > - int spdk_bdev_writev_blocks(desc, ch, iov, iovcnt, offset_blocks, > num_blocks, cb, cb_arg) > - int spdk_bdev_write_zeroes_blocks(desc, ch, offset_blocks, num_blocks, = cb, > cb_arg) > - int spdk_bdev_unmap_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_= arg) > - int spdk_bdev_flush_blocks(desc, ch, offset_blocks, num_blocks, cb, cb_= arg) > > Thanks, > Shuhei > > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-np= nVUNg=3D?d=3D9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMDA-H70FTUf_Pnx= Y0Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP3= 9S9Yink5uuxORDFO2pf8ozMZGDAMUHjlWNsxNQDsRJWSREbYx-_D0Q2F2O8i5q_xN8Lh2K4_7fo= UPvqHOGu34rn4D6-8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRP= L8CSjbkF-lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCEn= oOomGZDSIU2GAsK9V-HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2= sJeZA%3D%3D&u=3Dhttps%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk _______________________________________________ SPDK mailing list SPDK(a)lists.01.org https://clicktime.symantec.com/a/1/_YuXBcwzu-xtu4EvOXanN4spdgP_iGZmwJk-npnV= UNg=3D?d=3D9NA6TQrg1y3x2LNYVh4PW5R9mxBq4nSAmNKdauz45YcmP7WMDA-H70FTUf_PnxY0= Bp5jb5luBgHDxciVsChJclld_tKjA0lmW-AdkzLd0YvqKO7Til_4_C7yeqWXomVDupRM4FoP39S= 9Yink5uuxORDFO2pf8ozMZGDAMUHjlWNsxNQDsRJWSREbYx-_D0Q2F2O8i5q_xN8Lh2K4_7foUP= vqHOGu34rn4D6-8lfypZmy14ks3MY2Xjod6Vq8cFkcydwZNCgHMh7xISObupvbxQco025KLRPL8= CSjbkF-lRRnfO_e1yuE3VHYZqG2XE3xWOcShRkr3JfzE5Pca1wQCrcGhX60pJvD9umuLRCCEnoO= omGZDSIU2GAsK9V-HMEkaCLnvK39bC3mC1nPnTW3kL0SgMYZ10VUPUqSyEslJJ8QDXrTibGp2sJ= eZA%3D%3D&u=3Dhttps%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk --===============4526312729204536840==--