* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-02 21:06 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-02 21:06 UTC (permalink / raw) To: lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme Hi, The open-channel SSD subsystem is maturing, and drives are beginning to become available on the market. The open-channel SSD interface is very similar to the one exposed by SMR hard-drives. They both have a set of chunks (zones) exposed, and zones are managed using open/close logic. The main difference on open-channel SSDs is that it additionally exposes multiple sets of zones through a hierarchical interface, which covers a numbers levels (X channels, Y LUNs per channel, Z zones per LUN). Given that the SMR interface is similar to OCSSDs interface, I like to propose to discuss this at LSF/MM to align the efforts and make a clear path forward: 1. SMR Compatibility Can the SMR host interface be adapted to Open-Channel SSDs? For example, the interface may be exposed as a single-level set of zones, which ignore the channel and lun concept for simplicity. Another approach might be to extend the SMR implementation sysfs entries to expose the hierarchy of the device (channels with X LUNs and each luns have a set of zones). 2. How to expose the tens of LUNs that OCSSDs have? An open-channel SSDs typically has 64-256 LUNs that each acts as a parallel unit. How can these be efficiently exposed? One may expose these as separate namespaces/partitions. For a DAS with 24 drives, that will be 1536-6144 separate LUNs to manage. That many LUNs will blow up the host with gendisk instances. While if we do, then we have an excellent 1:1 mapping between the SMR interface and the OCSSD interface. On the other hand, one could expose the device LUNs within a single LBA address space and lay the LUNs out linearly. In that case, the block layer may expose a variable that enables applications to understand this hierarchy. Mainly the channels with LUNs. Any warm feelings towards this? Currently, a shortcut is taken with the geometry and hierarchy, which expose it through the /lightnvm sysfs entries. These (or a type thereof) can be moved to the block layer /queue directory. If keeping the LUNs exposed on the same gendisk, vector I/Os becomes a viable path: 3. Vector I/Os To derive parallelism from an open-channel SSD (and SSDs in parallel), one need to access them in parallel. Parallelism is achieved either by issuing I/Os for each LUN (similar to driving multiple SSDs today) or using a vector interface (encapsulating a list of LBAs, length, and data buffer) into the kernel. The latter approach allows I/Os to be vectorized and sent as a single unit to hardware. Implementing this in generic block layer code might be overkill if only open-channel SSDs use it. I like to hear other use-cases (e.g., preadv/pwritev, file-systems, virtio?) that can take advantage of vectored I/Os. If it makes sense, then which level to implement: bio/request level, SGLs, or a new structure? Device drivers that support vectored I/Os should be able to opt into the interface, while the block layer may automatically roll out for device drivers that don't have the support. What has the history been in the Linux kernel about vector I/Os? What have reasons in the past been that such an interface was not adopted? I will post RFC SMR patches before LSF/MM, such that we have a firm ground to discuss how it may be integrated. -- Besides OCSSDs, I also like to participate in the discussions of XCOPY, NVMe, multipath, multi-queue interrupt management as well. -Matias _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-02 21:06 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-02 21:06 UTC (permalink / raw) Hi, The open-channel SSD subsystem is maturing, and drives are beginning to become available on the market. The open-channel SSD interface is very similar to the one exposed by SMR hard-drives. They both have a set of chunks (zones) exposed, and zones are managed using open/close logic. The main difference on open-channel SSDs is that it additionally exposes multiple sets of zones through a hierarchical interface, which covers a numbers levels (X channels, Y LUNs per channel, Z zones per LUN). Given that the SMR interface is similar to OCSSDs interface, I like to propose to discuss this at LSF/MM to align the efforts and make a clear path forward: 1. SMR Compatibility Can the SMR host interface be adapted to Open-Channel SSDs? For example, the interface may be exposed as a single-level set of zones, which ignore the channel and lun concept for simplicity. Another approach might be to extend the SMR implementation sysfs entries to expose the hierarchy of the device (channels with X LUNs and each luns have a set of zones). 2. How to expose the tens of LUNs that OCSSDs have? An open-channel SSDs typically has 64-256 LUNs that each acts as a parallel unit. How can these be efficiently exposed? One may expose these as separate namespaces/partitions. For a DAS with 24 drives, that will be 1536-6144 separate LUNs to manage. That many LUNs will blow up the host with gendisk instances. While if we do, then we have an excellent 1:1 mapping between the SMR interface and the OCSSD interface. On the other hand, one could expose the device LUNs within a single LBA address space and lay the LUNs out linearly. In that case, the block layer may expose a variable that enables applications to understand this hierarchy. Mainly the channels with LUNs. Any warm feelings towards this? Currently, a shortcut is taken with the geometry and hierarchy, which expose it through the /lightnvm sysfs entries. These (or a type thereof) can be moved to the block layer /queue directory. If keeping the LUNs exposed on the same gendisk, vector I/Os becomes a viable path: 3. Vector I/Os To derive parallelism from an open-channel SSD (and SSDs in parallel), one need to access them in parallel. Parallelism is achieved either by issuing I/Os for each LUN (similar to driving multiple SSDs today) or using a vector interface (encapsulating a list of LBAs, length, and data buffer) into the kernel. The latter approach allows I/Os to be vectorized and sent as a single unit to hardware. Implementing this in generic block layer code might be overkill if only open-channel SSDs use it. I like to hear other use-cases (e.g., preadv/pwritev, file-systems, virtio?) that can take advantage of vectored I/Os. If it makes sense, then which level to implement: bio/request level, SGLs, or a new structure? Device drivers that support vectored I/Os should be able to opt into the interface, while the block layer may automatically roll out for device drivers that don't have the support. What has the history been in the Linux kernel about vector I/Os? What have reasons in the past been that such an interface was not adopted? I will post RFC SMR patches before LSF/MM, such that we have a firm ground to discuss how it may be integrated. -- Besides OCSSDs, I also like to participate in the discussions of XCOPY, NVMe, multipath, multi-queue interrupt management as well. -Matias ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-02 21:06 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-02 21:06 UTC (permalink / raw) To: lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme Hi, The open-channel SSD subsystem is maturing, and drives are beginning to become available on the market. The open-channel SSD interface is very similar to the one exposed by SMR hard-drives. They both have a set of chunks (zones) exposed, and zones are managed using open/close logic. The main difference on open-channel SSDs is that it additionally exposes multiple sets of zones through a hierarchical interface, which covers a numbers levels (X channels, Y LUNs per channel, Z zones per LUN). Given that the SMR interface is similar to OCSSDs interface, I like to propose to discuss this at LSF/MM to align the efforts and make a clear path forward: 1. SMR Compatibility Can the SMR host interface be adapted to Open-Channel SSDs? For example, the interface may be exposed as a single-level set of zones, which ignore the channel and lun concept for simplicity. Another approach might be to extend the SMR implementation sysfs entries to expose the hierarchy of the device (channels with X LUNs and each luns have a set of zones). 2. How to expose the tens of LUNs that OCSSDs have? An open-channel SSDs typically has 64-256 LUNs that each acts as a parallel unit. How can these be efficiently exposed? One may expose these as separate namespaces/partitions. For a DAS with 24 drives, that will be 1536-6144 separate LUNs to manage. That many LUNs will blow up the host with gendisk instances. While if we do, then we have an excellent 1:1 mapping between the SMR interface and the OCSSD interface. On the other hand, one could expose the device LUNs within a single LBA address space and lay the LUNs out linearly. In that case, the block layer may expose a variable that enables applications to understand this hierarchy. Mainly the channels with LUNs. Any warm feelings towards this? Currently, a shortcut is taken with the geometry and hierarchy, which expose it through the /lightnvm sysfs entries. These (or a type thereof) can be moved to the block layer /queue directory. If keeping the LUNs exposed on the same gendisk, vector I/Os becomes a viable path: 3. Vector I/Os To derive parallelism from an open-channel SSD (and SSDs in parallel), one need to access them in parallel. Parallelism is achieved either by issuing I/Os for each LUN (similar to driving multiple SSDs today) or using a vector interface (encapsulating a list of LBAs, length, and data buffer) into the kernel. The latter approach allows I/Os to be vectorized and sent as a single unit to hardware. Implementing this in generic block layer code might be overkill if only open-channel SSDs use it. I like to hear other use-cases (e.g., preadv/pwritev, file-systems, virtio?) that can take advantage of vectored I/Os. If it makes sense, then which level to implement: bio/request level, SGLs, or a new structure? Device drivers that support vectored I/Os should be able to opt into the interface, while the block layer may automatically roll out for device drivers that don't have the support. What has the history been in the Linux kernel about vector I/Os? What have reasons in the past been that such an interface was not adopted? I will post RFC SMR patches before LSF/MM, such that we have a firm ground to discuss how it may be integrated. -- Besides OCSSDs, I also like to participate in the discussions of XCOPY, NVMe, multipath, multi-queue interrupt management as well. -Matias ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-02 21:06 ` Matias Bjørling (?) @ 2017-01-02 23:12 ` Viacheslav Dubeyko -1 siblings, 0 replies; 63+ messages in thread From: Viacheslav Dubeyko @ 2017-01-02 23:12 UTC (permalink / raw) To: Matias Bjørling, lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme T24gTW9uLCAyMDE3LTAxLTAyIGF0IDIyOjA2ICswMTAwLCBNYXRpYXMgQmrDuHJsaW5nIHdyb3Rl Ogo+IEhpLAo+IAo+IFRoZSBvcGVuLWNoYW5uZWwgU1NEIHN1YnN5c3RlbSBpcyBtYXR1cmluZywg YW5kIGRyaXZlcyBhcmUgYmVnaW5uaW5nCj4gdG/CoAo+IGJlY29tZSBhdmFpbGFibGUgb24gdGhl IG1hcmtldC4gCgpXaGF0IGRvIHlvdSBtZWFuPyBXZSBzdGlsbCBoYXZlIG5vdGhpbmcgb24gdGhl IG1hcmtldC4gSSBoYXZlbid0Cm9wcG9ydHVuaXR5IHRvIGFjY2VzcyB0byBhbnkgb2Ygc3VjaCBk ZXZpY2UuIENvdWxkIHlvdSBzaGFyZSB5b3VyCmtub3dsZWRnZSB3aGVyZSBhbmQgd2hhdCBkZXZp Y2UgY2FuIGJlIGJvdWdodCBvbiB0aGUgbWFya2V0PwoKVGhhbmtzLApWeWFjaGVzbGF2IER1YmV5 a28uCgoKX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KTGlu dXgtbnZtZSBtYWlsaW5nIGxpc3QKTGludXgtbnZtZUBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6 Ly9saXN0cy5pbmZyYWRlYWQub3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtbnZtZQo= ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-02 23:12 ` Viacheslav Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Viacheslav Dubeyko @ 2017-01-02 23:12 UTC (permalink / raw) On Mon, 2017-01-02@22:06 +0100, Matias Bj?rling wrote: > Hi, > > The open-channel SSD subsystem is maturing, and drives are beginning > to? > become available on the market. What do you mean? We still have nothing on the market. I haven't opportunity to access to any of such device. Could you share your knowledge where and what device can be bought on the market? Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-02 23:12 ` Viacheslav Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Viacheslav Dubeyko @ 2017-01-02 23:12 UTC (permalink / raw) To: Matias Bjørling, lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme On Mon, 2017-01-02 at 22:06 +0100, Matias Bjørling wrote: > Hi, > > The open-channel SSD subsystem is maturing, and drives are beginning > to > become available on the market. What do you mean? We still have nothing on the market. I haven't opportunity to access to any of such device. Could you share your knowledge where and what device can be bought on the market? Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-02 23:12 ` Viacheslav Dubeyko @ 2017-01-03 8:56 ` Matias Bjørling -1 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-03 8:56 UTC (permalink / raw) To: Viacheslav Dubeyko, lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote: > On Mon, 2017-01-02 at 22:06 +0100, Matias Bjørling wrote: >> Hi, >> >> The open-channel SSD subsystem is maturing, and drives are beginning >> to >> become available on the market. > > What do you mean? We still have nothing on the market. I haven't > opportunity to access to any of such device. Could you share your > knowledge where and what device can be bought on the market? > Hi Vyacheslav, You are right that they are not available off the shelf at a convenient store. You may contact one of these vendors for availability: CNEX Labs (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC (OX Controller + Dragon Fire card). > Thanks, > Vyacheslav Dubeyko. > ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-03 8:56 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-03 8:56 UTC (permalink / raw) On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote: > On Mon, 2017-01-02@22:06 +0100, Matias Bj?rling wrote: >> Hi, >> >> The open-channel SSD subsystem is maturing, and drives are beginning >> to >> become available on the market. > > What do you mean? We still have nothing on the market. I haven't > opportunity to access to any of such device. Could you share your > knowledge where and what device can be bought on the market? > Hi Vyacheslav, You are right that they are not available off the shelf at a convenient store. You may contact one of these vendors for availability: CNEX Labs (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC (OX Controller + Dragon Fire card). > Thanks, > Vyacheslav Dubeyko. > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-03 8:56 ` Matias Bjørling (?) @ 2017-01-03 17:35 ` Viacheslav Dubeyko -1 siblings, 0 replies; 63+ messages in thread From: Viacheslav Dubeyko @ 2017-01-03 17:35 UTC (permalink / raw) To: Matias Bjørling, lsf-pc Cc: Linux FS Devel, linux-block, Vyacheslav.Dubeyko, linux-nvme SGkgTWF0aWFzLAoKT24gVHVlLCAyMDE3LTAxLTAzIGF0IDA5OjU2ICswMTAwLCBNYXRpYXMgQmrD uHJsaW5nIHdyb3RlOgo+IE9uIDAxLzAzLzIwMTcgMTI6MTIgQU0sIFZpYWNoZXNsYXYgRHViZXlr byB3cm90ZToKPiA+IAo+ID4gT24gTW9uLCAyMDE3LTAxLTAyIGF0IDIyOjA2ICswMTAwLCBNYXRp YXMgQmrDuHJsaW5nIHdyb3RlOgo+ID4gPiAKPiA+ID4gSGksCj4gPiA+IAo+ID4gPiBUaGUgb3Bl bi1jaGFubmVsIFNTRCBzdWJzeXN0ZW0gaXMgbWF0dXJpbmcsIGFuZCBkcml2ZXMgYXJlCj4gPiA+ IGJlZ2lubmluZwo+ID4gPiB0b8KgCj4gPiA+IGJlY29tZSBhdmFpbGFibGUgb24gdGhlIG1hcmtl dC7CoAo+ID4gV2hhdCBkbyB5b3UgbWVhbj8gV2Ugc3RpbGwgaGF2ZSBub3RoaW5nIG9uIHRoZSBt YXJrZXQuIEkgaGF2ZW4ndAo+ID4gb3Bwb3J0dW5pdHkgdG8gYWNjZXNzIHRvIGFueSBvZiBzdWNo IGRldmljZS4gQ291bGQgeW91IHNoYXJlIHlvdXIKPiA+IGtub3dsZWRnZSB3aGVyZSBhbmQgd2hh dCBkZXZpY2UgY2FuIGJlIGJvdWdodCBvbiB0aGUgbWFya2V0Pwo+ID4gCj4gSGkgVnlhY2hlc2xh diwKPiAKPiBZb3UgYXJlIHJpZ2h0IHRoYXQgdGhleSBhcmUgbm90IGF2YWlsYWJsZSBvZmYgdGhl IHNoZWxmIGF0IGEKPiBjb252ZW5pZW50Cj4gc3RvcmUuIFlvdSBtYXkgY29udGFjdCBvbmUgb2Yg dGhlc2UgdmVuZG9ycyBmb3IgYXZhaWxhYmlsaXR5OiBDTkVYCj4gTGFicwo+IChXZXN0bGFrZSBM aWdodE5WTSBTREspLCBSYWRpYW4gTWVtb3J5IFN5c3RlbXMgKFJNUy0zMjUpLCBhbmQvb3IgRU1D Cj4gKE9YCj4gQ29udHJvbGxlciArIERyYWdvbiBGaXJlIGNhcmQpLgoKV2UsIFdlc3Rlcm4gRGln aXRhbCwgY29udGFjdGVkIHdpdGggQ05FWCBMYWJzIGFib3V0IGEgaGFsZiB5ZWFyIGFnby4KT3Vy IHJlcXVlc3Qgd2FzIHJlZnVzZWQuIEFsc28gd2UgY29udGFjdGVkIHdpdGjCoFJhZGlhbiBNZW1v cnkgU3lzdGVtcwphYm91dCBhIHllYXIgYWdvLiBPdXIgbmVnb3RpYXRpb25zIGZpbmlzaGVkIHdp dGggbm8gc3VjZXNzIGF0IGFsbC4gQW5kCkkgZG91YnQgdGhhdCBFTUMgd2lsbCBzaGFyZSB3aXRo IHVzIHNvbWV0aGluZy4gU28sIHN1Y2ggc2l0dWF0aW9uIGxvb2tzCnJlYWxseSB3ZWlyZCwgZXNw ZWNpYWxseSBmb3IgdGhlIGNhc2Ugb2Ygb3Blbi1zb3VyY2UgY29tbXVuaXR5LiBXZQpjYW5ub3Qg YWNjZXNzIG9yIHRlc3QgYW55IE9wZW4tY2hhbm5lbCBTU0Qgbm9yIGZvciBtb25leSBub3IgdW5k ZXIgTkRBLgpVc3VhbGx5LCBvcGVuLXNvdXJjZSBtZWFucyB0aGF0IGV2ZXJ5Ym9keSBoYXMgYWNj ZXNzIHRvIGhhcmR3YXJlIGFuZCB3ZQpjYW4gZGlzY3VzcyBpbXBsZW1lbnRhdGlvbiwgYXJjaGl0 ZWN0dXJlIG9yIGFwcHJvYWNoIHdpdGhvdXQgYW55CnJlc3RyaWN0aW9ucy4gQnV0IHdlIGhhdmVu J3QgYWNjZXNzIHRvIGhhcmR3YXJlIHJpZ2h0IG5vdy4gSSB1bmRlcnN0YW5kCnRoZSBidXNpbmVz cyBtb2RlbCBhbmQgYmxhaCwgYmxhaCwgYmxhaC4gQnV0IGl0IGxvb2tzIGxpa2UgdGhhdCwKZmlu YWxseSwgd2UgaGF2ZSBub3RoaW5nIGxpa2UgT3Blbi1jaGFubmVsIFNTRCBvbiB0aGUgbWFya2V0 LCBmcm9tIG15CnBlcnNvbmFsIHBvaW50IG9mIHZpZXcuIEFuZCBJIHN1cHBvc2UgdGhhdCBpdCdz IHJlYWxseSB0cmlja3kgd2F5IHRvCmRpc2N1c3Mgc29mdHdhcmUgaW50ZXJmYWNlIG9yIGFueSBv dGhlciBkZXRhaWxzIGFib3V0IHNvbWV0aGluZyB0aGF0CmRvZXNuJ3QgZXhpc3QgYXQgYWxsLiBC ZWNhdXNlIGlmIEkgY2Fubm90IHRha2UgYW5kIHRlc3Qgc29tZSBoYXJkd2FyZQp0aGVuIEkgY2Fu bm90IGJ1aWxkIG15IG93biBvcGluaW9uIGFib3V0IHRoaXMgdGVjaG5vbG9neS4KClRoYW5rcywK VnlhY2hlc2xhdiBEdWJleWtvLgoKCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fCkxpbnV4LW52bWUgbWFpbGluZyBsaXN0CkxpbnV4LW52bWVAbGlzdHMuaW5m cmFkZWFkLm9yZwpodHRwOi8vbGlzdHMuaW5mcmFkZWFkLm9yZy9tYWlsbWFuL2xpc3RpbmZvL2xp bnV4LW52bWUK ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-03 17:35 ` Viacheslav Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Viacheslav Dubeyko @ 2017-01-03 17:35 UTC (permalink / raw) Hi Matias, On Tue, 2017-01-03@09:56 +0100, Matias Bj?rling wrote: > On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote: > > > > On Mon, 2017-01-02@22:06 +0100, Matias Bj?rling wrote: > > > > > > Hi, > > > > > > The open-channel SSD subsystem is maturing, and drives are > > > beginning > > > to? > > > become available on the market.? > > What do you mean? We still have nothing on the market. I haven't > > opportunity to access to any of such device. Could you share your > > knowledge where and what device can be bought on the market? > > > Hi Vyacheslav, > > You are right that they are not available off the shelf at a > convenient > store. You may contact one of these vendors for availability: CNEX > Labs > (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC > (OX > Controller + Dragon Fire card). We, Western Digital, contacted with CNEX Labs about a half year ago. Our request was refused. Also we contacted with?Radian Memory Systems about a year ago. Our negotiations finished with no sucess at all. And I doubt that EMC will share with us something. So, such situation looks really weird, especially for the case of open-source community. We cannot access or test any Open-channel SSD nor for money nor under NDA. Usually, open-source means that everybody has access to hardware and we can discuss implementation, architecture or approach without any restrictions. But we haven't access to hardware right now. I understand the business model and blah, blah, blah. But it looks like that, finally, we have nothing like Open-channel SSD on the market, from my personal point of view. And I suppose that it's really tricky way to discuss software interface or any other details about something that doesn't exist at all. Because if I cannot take and test some hardware then I cannot build my own opinion about this technology. Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-03 17:35 ` Viacheslav Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Viacheslav Dubeyko @ 2017-01-03 17:35 UTC (permalink / raw) To: Matias Bjørling, lsf-pc Cc: Linux FS Devel, linux-block, linux-nvme, Vyacheslav.Dubeyko Hi Matias, On Tue, 2017-01-03 at 09:56 +0100, Matias Bjørling wrote: > On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote: > > > > On Mon, 2017-01-02 at 22:06 +0100, Matias Bjørling wrote: > > > > > > Hi, > > > > > > The open-channel SSD subsystem is maturing, and drives are > > > beginning > > > to > > > become available on the market. > > What do you mean? We still have nothing on the market. I haven't > > opportunity to access to any of such device. Could you share your > > knowledge where and what device can be bought on the market? > > > Hi Vyacheslav, > > You are right that they are not available off the shelf at a > convenient > store. You may contact one of these vendors for availability: CNEX > Labs > (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC > (OX > Controller + Dragon Fire card). We, Western Digital, contacted with CNEX Labs about a half year ago. Our request was refused. Also we contacted with Radian Memory Systems about a year ago. Our negotiations finished with no sucess at all. And I doubt that EMC will share with us something. So, such situation looks really weird, especially for the case of open-source community. We cannot access or test any Open-channel SSD nor for money nor under NDA. Usually, open-source means that everybody has access to hardware and we can discuss implementation, architecture or approach without any restrictions. But we haven't access to hardware right now. I understand the business model and blah, blah, blah. But it looks like that, finally, we have nothing like Open-channel SSD on the market, from my personal point of view. And I suppose that it's really tricky way to discuss software interface or any other details about something that doesn't exist at all. Because if I cannot take and test some hardware then I cannot build my own opinion about this technology. Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-03 17:35 ` Viacheslav Dubeyko @ 2017-01-03 19:10 ` Matias Bjørling -1 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-03 19:10 UTC (permalink / raw) To: Viacheslav Dubeyko, lsf-pc Cc: Linux FS Devel, linux-block, linux-nvme, Vyacheslav.Dubeyko On 01/03/2017 06:35 PM, Viacheslav Dubeyko wrote: > Hi Matias, > > On Tue, 2017-01-03 at 09:56 +0100, Matias Bjørling wrote: >> On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote: >>> >>> On Mon, 2017-01-02 at 22:06 +0100, Matias Bjørling wrote: >>>> >>>> Hi, >>>> >>>> The open-channel SSD subsystem is maturing, and drives are >>>> beginning >>>> to >>>> become available on the market. >>> What do you mean? We still have nothing on the market. I haven't >>> opportunity to access to any of such device. Could you share your >>> knowledge where and what device can be bought on the market? >>> >> Hi Vyacheslav, >> >> You are right that they are not available off the shelf at a >> convenient >> store. You may contact one of these vendors for availability: CNEX >> Labs >> (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC >> (OX >> Controller + Dragon Fire card). > > We, Western Digital, contacted with CNEX Labs about a half year ago. > Our request was refused. Also we contacted with Radian Memory Systems > about a year ago. Our negotiations finished with no sucess at all. And > I doubt that EMC will share with us something. So, such situation looks > really weird, especially for the case of open-source community. We > cannot access or test any Open-channel SSD nor for money nor under NDA. > Usually, open-source means that everybody has access to hardware and we > can discuss implementation, architecture or approach without any > restrictions. But we haven't access to hardware right now. I understand > the business model and blah, blah, blah. But it looks like that, > finally, we have nothing like Open-channel SSD on the market, from my > personal point of view. And I suppose that it's really tricky way to > discuss software interface or any other details about something that > doesn't exist at all. Because if I cannot take and test some hardware > then I cannot build my own opinion about this technology. > I understand your frustration. It is annoying not having easy access to hardware. As you properly are aware, similarly with host-managed SMR drives, there are customers that use your drives, while not being available off-the-shelf. All of the open-channel SSD work is done in the open. Patches, new targets, and so forth are being developed for everyone to see. Similarly, the NVMe host interface is developed in the open as well. The interface allows one to implements supporting firmware. The "front-end" of the FTL on the SSD, is removed, and the "back-end" engine is exposed. It is not much work and given HGST already have an SSD firmware implementation. I bet you guys can whip up an internal implementation in a matter of weeks. If you choose to do so, I will bend over backwards to help you sort out any quirks that might be. Another option is to use the qemu extension. We are improving it continuously to make sure it follows the implementation of a real hardware OCSSDs. Today we do 90% of our FTL work using qemu, and most of the time it just works when we run the FTL code on real hardware. Similarly to vendors that provide new CPUs, NVDIMMs, and graphic drivers. Some code and refactoring go in years in advance. What I am proposing here is to discuss how OCSSDs fits into the storage stack, and what we can do to improve it. Optimally, most of the lightnvm subsystem can be removed by exposing vectored I/Os. Which then enables implementation of a target to be a traditional device mapper module. That would be great! ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-03 19:10 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-03 19:10 UTC (permalink / raw) On 01/03/2017 06:35 PM, Viacheslav Dubeyko wrote: > Hi Matias, > > On Tue, 2017-01-03@09:56 +0100, Matias Bj?rling wrote: >> On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote: >>> >>> On Mon, 2017-01-02@22:06 +0100, Matias Bj?rling wrote: >>>> >>>> Hi, >>>> >>>> The open-channel SSD subsystem is maturing, and drives are >>>> beginning >>>> to >>>> become available on the market. >>> What do you mean? We still have nothing on the market. I haven't >>> opportunity to access to any of such device. Could you share your >>> knowledge where and what device can be bought on the market? >>> >> Hi Vyacheslav, >> >> You are right that they are not available off the shelf at a >> convenient >> store. You may contact one of these vendors for availability: CNEX >> Labs >> (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC >> (OX >> Controller + Dragon Fire card). > > We, Western Digital, contacted with CNEX Labs about a half year ago. > Our request was refused. Also we contacted with Radian Memory Systems > about a year ago. Our negotiations finished with no sucess at all. And > I doubt that EMC will share with us something. So, such situation looks > really weird, especially for the case of open-source community. We > cannot access or test any Open-channel SSD nor for money nor under NDA. > Usually, open-source means that everybody has access to hardware and we > can discuss implementation, architecture or approach without any > restrictions. But we haven't access to hardware right now. I understand > the business model and blah, blah, blah. But it looks like that, > finally, we have nothing like Open-channel SSD on the market, from my > personal point of view. And I suppose that it's really tricky way to > discuss software interface or any other details about something that > doesn't exist at all. Because if I cannot take and test some hardware > then I cannot build my own opinion about this technology. > I understand your frustration. It is annoying not having easy access to hardware. As you properly are aware, similarly with host-managed SMR drives, there are customers that use your drives, while not being available off-the-shelf. All of the open-channel SSD work is done in the open. Patches, new targets, and so forth are being developed for everyone to see. Similarly, the NVMe host interface is developed in the open as well. The interface allows one to implements supporting firmware. The "front-end" of the FTL on the SSD, is removed, and the "back-end" engine is exposed. It is not much work and given HGST already have an SSD firmware implementation. I bet you guys can whip up an internal implementation in a matter of weeks. If you choose to do so, I will bend over backwards to help you sort out any quirks that might be. Another option is to use the qemu extension. We are improving it continuously to make sure it follows the implementation of a real hardware OCSSDs. Today we do 90% of our FTL work using qemu, and most of the time it just works when we run the FTL code on real hardware. Similarly to vendors that provide new CPUs, NVDIMMs, and graphic drivers. Some code and refactoring go in years in advance. What I am proposing here is to discuss how OCSSDs fits into the storage stack, and what we can do to improve it. Optimally, most of the lightnvm subsystem can be removed by exposing vectored I/Os. Which then enables implementation of a target to be a traditional device mapper module. That would be great! ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-03 19:10 ` Matias Bjørling (?) @ 2017-01-04 2:59 ` Slava Dubeyko -1 siblings, 0 replies; 63+ messages in thread From: Slava Dubeyko @ 2017-01-04 2:59 UTC (permalink / raw) To: Matias Bjørling, Viacheslav Dubeyko, lsf-pc Cc: Linux FS Devel, linux-block, linux-nvme DQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KRnJvbTogTWF0aWFzIEJqw7hybGluZyBbbWFp bHRvOm1AYmpvcmxpbmcubWVdIA0KU2VudDogVHVlc2RheSwgSmFudWFyeSAzLCAyMDE3IDExOjEx IEFNDQpUbzogVmlhY2hlc2xhdiBEdWJleWtvIDxzbGF2YUBkdWJleWtvLmNvbT47IGxzZi1wY0Bs aXN0cy5saW51eC1mb3VuZGF0aW9uLm9yZw0KQ2M6IExpbnV4IEZTIERldmVsIDxsaW51eC1mc2Rl dmVsQHZnZXIua2VybmVsLm9yZz47IGxpbnV4LWJsb2NrQHZnZXIua2VybmVsLm9yZzsgbGludXgt bnZtZUBsaXN0cy5pbmZyYWRlYWQub3JnOyBTbGF2YSBEdWJleWtvIDxWeWFjaGVzbGF2LkR1YmV5 a29Ad2RjLmNvbT4NClN1YmplY3Q6IFJlOiBbTFNGL01NIFRPUElDXVtMU0YvTU0gQVRURU5EXSBP Q1NTRHMgLSBTTVIsIEhpZXJhcmNoaWNhbCBJbnRlcmZhY2UsIGFuZCBWZWN0b3IgSS9Pcw0KDQo8 c2tpcHBlZD4NCg0KPiBBbGwgb2YgdGhlIG9wZW4tY2hhbm5lbCBTU0Qgd29yayBpcyBkb25lIGlu IHRoZSBvcGVuLg0KPiBQYXRjaGVzLCBuZXcgdGFyZ2V0cywgYW5kIHNvIGZvcnRoIGFyZSBiZWlu ZyBkZXZlbG9wZWQgZm9yIGV2ZXJ5b25lIHRvIHNlZS4gDQo+IFNpbWlsYXJseSwgdGhlIE5WTWUg aG9zdCBpbnRlcmZhY2UgaXMgZGV2ZWxvcGVkIGluIHRoZSBvcGVuIGFzIHdlbGwuDQo+IFRoZSBp bnRlcmZhY2UgYWxsb3dzIG9uZSB0byBpbXBsZW1lbnRzIHN1cHBvcnRpbmcgZmlybXdhcmUuIFRo ZSAiZnJvbnQtZW5kIg0KPiBvZiB0aGUgRlRMIG9uIHRoZSBTU0QsIGlzIHJlbW92ZWQsIGFuZCB0 aGUgImJhY2stZW5kIiBlbmdpbmUgaXMgZXhwb3NlZC4gDQo+IEl0IGlzIG5vdCBtdWNoIHdvcmsg YW5kIGdpdmVuIEhHU1QgYWxyZWFkeSBoYXZlIGFuIFNTRCBmaXJtd2FyZSBpbXBsZW1lbnRhdGlv bi4NCj4gSSBiZXQgeW91IGd1eXMgY2FuIHdoaXAgdXAgYW4gaW50ZXJuYWwgaW1wbGVtZW50YXRp b24gaW4gYSBtYXR0ZXIgb2Ygd2Vla3MuDQo+IElmIHlvdSBjaG9vc2UgdG8gZG8gc28sIEkgd2ls bCBiZW5kIG92ZXIgYmFja3dhcmRzIHRvIGhlbHAgeW91IHNvcnQgb3V0IGFueSBxdWlya3MgdGhh dCBtaWdodCBiZS4NCg0KSSBzZWUgeW91ciBwb2ludC4gQnV0IEkgYW0gdGhlIHJlc2VhcmNoIGd1 eSBhbmQgSSBoYXZlIHNvZnR3YXJlIHByb2plY3QuIFNvLCBpdCdzIGNvbXBsZXRlbHkgdW5yZWFz b25hYmxlIGZvciBtZQ0KdG8gc3BlbmQgdGhlIHRpbWUgb24gU1NEIGZpcm13YXJlLiBJIHNpbXBs eSBuZWVkIGluIHJlYWR5LW1hZGUgaGFyZHdhcmUgZm9yIHRlc3RpbmcvYmVuY2htYXJraW5nDQpt eSBzb2Z0d2FyZSBhbmQgdG8gY2hlY2sgdGhlIGFzc3VtcHRpb25zIHRoYXQgaXQgd2FzIG1hZGUu IFRoYXQncyBhbGwuIElmIEkgaGF2ZW4ndCB0aGUgaGFyZHdhcmUgcmlnaHQgbm93DQp0aGVuIEkg bmVlZCB0byB3YWl0IHRoZSBiZXR0ZXIgdGltZXMuIA0KDQo+IEFub3RoZXIgb3B0aW9uIGlzIHRv IHVzZSB0aGUgcWVtdSBleHRlbnNpb24uIFdlIGFyZSBpbXByb3ZpbmcgaXQgY29udGludW91c2x5 DQo+IHRvIG1ha2Ugc3VyZSBpdCBmb2xsb3dzIHRoZSBpbXBsZW1lbnRhdGlvbiBvZiBhIHJlYWwg aGFyZHdhcmUgT0NTU0RzLg0KPiBUb2RheSB3ZSBkbyA5MCUgb2Ygb3VyIEZUTCB3b3JrIHVzaW5n IHFlbXUsIGFuZCBtb3N0IG9mIHRoZSB0aW1lDQo+IGl0IGp1c3Qgd29ya3Mgd2hlbiB3ZSBydW4g dGhlIEZUTCBjb2RlIG9uIHJlYWwgaGFyZHdhcmUuDQoNCkkgcmVhbGx5IGRpc2xpa2UgdG8gdXNl IHRoZSBxZW11IGZvciBmaWxlIHN5c3RlbSBiZW5jaG1hcmtpbmcuDQoNCj4gU2ltaWxhcmx5IHRv IHZlbmRvcnMgdGhhdCBwcm92aWRlIG5ldyBDUFVzLCBOVkRJTU1zLCBhbmQgZ3JhcGhpYyBkcml2 ZXJzLg0KPiBTb21lIGNvZGUgYW5kIHJlZmFjdG9yaW5nIGdvIGluIHllYXJzIGluIGFkdmFuY2Uu IFdoYXQgSSBhbSBwcm9wb3NpbmcgaGVyZSBpcyB0byBkaXNjdXNzIGhvdyBPQ1NTRHMNCj4gZml0 cyBpbnRvIHRoZSBzdG9yYWdlIHN0YWNrLCBhbmQgd2hhdCB3ZSBjYW4gZG8gdG8gaW1wcm92ZSBp dC4gT3B0aW1hbGx5LCBtb3N0IG9mIHRoZSBsaWdodG52bSBzdWJzeXN0ZW0NCj4gY2FuIGJlIHJl bW92ZWQgYnkgZXhwb3NpbmcgdmVjdG9yZWQgSS9Pcy4gV2hpY2ggdGhlbiBlbmFibGVzIGltcGxl bWVudGF0aW9uIG9mIGEgdGFyZ2V0IHRvIGJlDQo+IGEgdHJhZGl0aW9uYWwgZGV2aWNlIG1hcHBl ciBtb2R1bGUuIFRoYXQgd291bGQgYmUgZ3JlYXQhDQoNCk9LLiBGcm9tIG9uZSBwb2ludCBvZiB2 aWV3LCBJIGxpa2UgdGhlIGlkZWEgb2YgU01SIGNvbXBhdGliaWxpdHkuIEJ1dCwgZnJvbSBhbm90 aGVyIHBvaW50IG9mIHZpZXcsDQpJIGFtIHNsaWdodGx5IHNrZXB0aWNhbCBhYm91dCBzdWNoIGFw cHJvYWNoLiBJIGJlbGlldmUgeW91IHNlZSB0aGUgYnJpZ2h0IHNpZGUgb2YgeW91ciBzdWdnZXN0 aW9uLg0KU28sIGxldCBtZSB0YWtlIGEgbG9vayBvbiB5b3VyIGFwcHJvYWNoIGZyb20gdGhlIGRh cmsgc2lkZS4NCg0KV2hhdCdzIHRoZSBnb2FsIG9mIFNNUiBjb21wYXRpYmlsaXR5PyBBbnkgdW5p ZmljYXRpb24gb3IgaW50ZXJmYWNlIGFic3RyYWN0aW9uIGhhcyB0aGUgZ29hbCB0byBoaWRlDQp0 aGUgcGVjdWxpYXJpdGllcyBvZiB1bmRlcmx5aW5nIGhhcmR3YXJlLiBCdXQgd2UgaGF2ZSBibG9j ayBkZXZpY2UgYWJzdHJhY3Rpb24gdGhhdCBoaWRlcyBhbGwNCmhhcmR3YXJlJ3MgcGVjdWxpYXJp dGllcyBwZXJmZWN0bHkuIEFsc28gRlRMIChvciBhbnkgb3RoZXIgVHJhbnNsYXRpb24gTGF5ZXIp IGlzIGFibGUgdG8gcmVwcmVzZW50DQp0aGUgZGV2aWNlIGFzIHNlcXVlbmNlIG9mIHBoeXNpY2Fs IHNlY3RvcnMgd2l0aG91dCByZWFsIGtub3dsZWRnZSBvbiBzb2Z0d2FyZSBzaWRlIGFib3V0DQpz b3BoaXN0aWNhdGVkIG1hbmFnZW1lbnQgYWN0aXZpdHkgb24gdGhlIGRldmljZSBzaWRlLiBBbmQs IGZpbmFsbHksIGd1eXMgd2lsbCBiZSBjb21wbGV0ZWx5IGhhcHB5DQp0byB1c2UgdGhlIHJlZ3Vs YXIgZmlsZSBzeXN0ZW1zIChleHQ0LCB4ZnMpIHdpdGhvdXQgbmVjZXNzaXR5IHRvIG1vZGlmeSBz b2Z0d2FyZSBzdGFjay4gQnV0IEkgYmVsaWV2ZSB0aGF0DQp0aGUgZ29hbCBvZiBPcGVuLWNoYW5u ZWwgU1NEIGFwcHJvYWNoIGlzIGNvbXBsZXRlbHkgb3Bwb3NpdGUuIE5hbWVseSwgcHJvdmlkZSB0 aGUgb3Bwb3J0dW5pdHkNCmZvciBzb2Z0d2FyZSBzaWRlIChmaWxlIHN5c3RlbSwgZm9yIGV4YW1w bGUpIHRvIG1hbmFnZSB0aGUgT3Blbi1jaGFubmVsIFNTRCBkZXZpY2Ugd2l0aCBzbWFydGVyDQpw b2xpY3kuDQoNClNvLCBteSBrZXkgd29ycnkgdGhhdCB0aGUgdHJ5aW5nIHRvIGhpZGUgdW5kZXIg dGhlIHNhbWUgaW50ZXJmYWNlIHRoZSB0d28gZGlmZmVyZW50IHRlY2hub2xvZ2llcw0KKFNNUiBh bmQgTkFORCBmbGFzaCkgd2lsbCBiZSByZXN1bHRlZCBpbiB0aGUgbG9zcyBvZiBvcHBvcnR1bml0 eSB0byBtYW5hZ2UgdGhlIGRldmljZSBpbg0KbW9yZSBzbWFydGVyIHdheS4gQmVjYXVzZSBhbnkg dW5pZmljYXRpb24gaGFzIHRoZSBnb2FsIHRvIGNyZWF0ZSBhIHNpbXBsZSBpbnRlcmZhY2UuIEJ1 dCBTTVINCmFuZCBOQU5EIGZsYXNoIGFyZSBzaWduaWZpY2FudGx5IGRpZmZlcmVudCB0ZWNobm9s b2dpZXMuIEFuZCBpZiBzb21lYm9keSBjcmVhdGVzIHRlY2hub2xvZ3ktb3JpZW50ZWQNCmZpbGUg c3lzdGVtLCBmb3IgZXhhbXBsZSwgdGhlbiBpdCBuZWVkcyB0byBoYXZlIGFjY2VzcyB0byByZWFs bHkgc3BlY2lhbCBmZWF0dXJlcyBvZiB0aGUgdGVjaG5vbG9neS4NCk90aGVyd2lzZSwgaW50ZXJm YWNlIHdpbGwgYmUgb3ZlcmxvYWRlZCBieSBmZWF0dXJlcyBvZiBib3RoIHRlY2hub2xvZ2llcyBh bmQgaXQgd2lsbCBsb29rcyBsaWtlIGFzDQphIG1lc3MuDQoNClNNUiB6b25lIGFuZCBOQU5EIGZs YXNoIGVyYXNlIGJsb2NrIGxvb2sgY29tcGFyYWJsZSBidXQsIGZpbmFsbHksIGl0IHNpZ25pZmlj YW50bHkgZGlmZmVyZW50IHN0dWZmLg0KVXN1YWxseSwgU01SIHpvbmUgaGFzIDI2NSBNQiBpbiBz aXplIGJ1dCBOQU5EIGZsYXNoIGVyYXNlIGJsb2NrIGNhbiB2YXJ5IGZyb20gNTEyIEtCIHRvIDgg TUINCihpdCB3aWxsIGJlIHNsaWdodGx5IGxhcmdlciBpbiB0aGUgZnV0dXJlIGJ1dCBub3QgbW9y ZSB0aGFuIDMyIE1CLCBJIHN1cHBvc2UpLg0KSXQgaXMgcG9zc2libGUgdG8gZ3JvdXAgc2V2ZXJh bCBlcmFzZSBibG9ja3MgaW50byBhZ2dyZWdhdGVkIGVudGl0eSBidXQgaXQgY291bGQgYmUgbm90 IHZlcnkgZ29vZA0KcG9saWN5IGZyb20gZmlsZSBzeXN0ZW0gcG9pbnQgb2Ygdmlldy4gQW5vdGhl ciBwb2ludCB0aGF0IFFMQyBkZXZpY2UgY291bGQgaGF2ZSBtb3JlIHRyaWNreSBmZWF0dXJlcw0K b2YgZXJhc2UgYmxvY2tzIG1hbmFnZW1lbnQuIEFsc28gd2Ugc2hvdWxkIGFwcGx5IGVyYXNlIG9w ZXJhdGlvbiBvbiBOQU5EIGZsYXNoIGVyYXNlIGJsb2NrDQpidXQgaXQgaXMgbm90IG1hbmRhdG9y eSBmb3IgdGhlIGNhc2Ugb2YgU01SIHpvbmUuIEJlY2F1c2UgU01SIHpvbmUgY291bGQgYmUgc2lt cGx5IHJlLXdyaXR0ZW4NCmluIHNlcXVlbnRpYWwgb3JkZXIgaWYgYWxsIHpvbmUncyBkYXRhIGlz IGludmFsaWQsIGZvciBleGFtcGxlLiBBbHNvIGNvbnZlbnRpb25hbCB6b25lIGNvdWxkIGJlIHJl YWxseSB0cmlja3kNCnBvaW50LiBCZWNhdXNlIGl0IGlzIG9uZSB6b25lIG9ubHkgZm9yIHRoZSB3 aG9sZSBkZXZpY2UgdGhhdCBjb3VsZCBiZSB1cGRhdGVkIGluLXBsYWNlLg0KUmF3IE5BTkQgZmxh c2gsIHVzdWFsbHksIGhhc24ndCBsaWtld2lzZSBjb252ZW50aW9uYWwgem9uZS4NCg0KRmluYWxs eSwgaWYgSSByZWFsbHkgbGlrZSB0byBkZXZlbG9wIFNNUi0gb3IgTkFORCBmbGFzaCBvcmllbnRl ZCBmaWxlIHN5c3RlbSB0aGVuIEkgd291bGQgbGlrZSB0byBwbGF5DQp3aXRoIHBlY3VsaWFyaXRp ZXMgb2YgY29uY3JldGUgdGVjaG5vbG9naWVzLiBBbmQgYW55IHVuaWZpZWQgaW50ZXJmYWNlIHdp bGwgZGVzdHJveSB0aGUgb3Bwb3J0dW5pdHkNCnRvIGNyZWF0ZSB0aGUgcmVhbGx5IGVmZmljaWVu dCBzb2x1dGlvbi4gRmluYWxseSwgaWYgbXkgc29mdHdhcmUgc29sdXRpb24gaXMgdW5hYmxlIHRv IHByb3ZpZGUgc29tZQ0KZmFuY3kgYW5kIGVmZmljaWVudCBmZWF0dXJlcyB0aGVuIGd1eXMgd2ls bCBwcmVmZXIgdG8gdXNlIHRoZSByZWd1bGFyIHN0YWNrIChleHQ0LCB4ZnMgKyBibG9jayBsYXll cikuDQoNClRoYW5rcywNClZ5YWNoZXNsYXYgRHViZXlrby4NCg0KDQoNCg0KX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KTGludXgtbnZtZSBtYWlsaW5nIGxp c3QKTGludXgtbnZtZUBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6Ly9saXN0cy5pbmZyYWRlYWQu b3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtbnZtZQo= ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-04 2:59 ` Slava Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Slava Dubeyko @ 2017-01-04 2:59 UTC (permalink / raw) -----Original Message----- From: Matias Bj?rling [mailto:m@bjorling.me] Sent: Tuesday, January 3, 2017 11:11 AM To: Viacheslav Dubeyko <slava at dubeyko.com>; lsf-pc at lists.linux-foundation.org Cc: Linux FS Devel <linux-fsdevel at vger.kernel.org>; linux-block at vger.kernel.org; linux-nvme at lists.infradead.org; Slava Dubeyko <Vyacheslav.Dubeyko at wdc.com> Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os <skipped> > All of the open-channel SSD work is done in the open. > Patches, new targets, and so forth are being developed for everyone to see. > Similarly, the NVMe host interface is developed in the open as well. > The interface allows one to implements supporting firmware. The "front-end" > of the FTL on the SSD, is removed, and the "back-end" engine is exposed. > It is not much work and given HGST already have an SSD firmware implementation. > I bet you guys can whip up an internal implementation in a matter of weeks. > If you choose to do so, I will bend over backwards to help you sort out any quirks that might be. I see your point. But I am the research guy and I have software project. So, it's completely unreasonable for me to spend the time on SSD firmware. I simply need in ready-made hardware for testing/benchmarking my software and to check the assumptions that it was made. That's all. If I haven't the hardware right now then I need to wait the better times. > Another option is to use the qemu extension. We are improving it continuously > to make sure it follows the implementation of a real hardware OCSSDs. > Today we do 90% of our FTL work using qemu, and most of the time > it just works when we run the FTL code on real hardware. I really dislike to use the qemu for file system benchmarking. > Similarly to vendors that provide new CPUs, NVDIMMs, and graphic drivers. > Some code and refactoring go in years in advance. What I am proposing here is to discuss how OCSSDs > fits into the storage stack, and what we can do to improve it. Optimally, most of the lightnvm subsystem > can be removed by exposing vectored I/Os. Which then enables implementation of a target to be > a traditional device mapper module. That would be great! OK. From one point of view, I like the idea of SMR compatibility. But, from another point of view, I am slightly skeptical about such approach. I believe you see the bright side of your suggestion. So, let me take a look on your approach from the dark side. What's the goal of SMR compatibility? Any unification or interface abstraction has the goal to hide the peculiarities of underlying hardware. But we have block device abstraction that hides all hardware's peculiarities perfectly. Also FTL (or any other Translation Layer) is able to represent the device as sequence of physical sectors without real knowledge on software side about sophisticated management activity on the device side. And, finally, guys will be completely happy to use the regular file systems (ext4, xfs) without necessity to modify software stack. But I believe that the goal of Open-channel SSD approach is completely opposite. Namely, provide the opportunity for software side (file system, for example) to manage the Open-channel SSD device with smarter policy. So, my key worry that the trying to hide under the same interface the two different technologies (SMR and NAND flash) will be resulted in the loss of opportunity to manage the device in more smarter way. Because any unification has the goal to create a simple interface. But SMR and NAND flash are significantly different technologies. And if somebody creates technology-oriented file system, for example, then it needs to have access to really special features of the technology. Otherwise, interface will be overloaded by features of both technologies and it will looks like as a mess. SMR zone and NAND flash erase block look comparable but, finally, it significantly different stuff. Usually, SMR zone has 265 MB in size but NAND flash erase block can vary from 512 KB to 8 MB (it will be slightly larger in the future but not more than 32 MB, I suppose). It is possible to group several erase blocks into aggregated entity but it could be not very good policy from file system point of view. Another point that QLC device could have more tricky features of erase blocks management. Also we should apply erase operation on NAND flash erase block but it is not mandatory for the case of SMR zone. Because SMR zone could be simply re-written in sequential order if all zone's data is invalid, for example. Also conventional zone could be really tricky point. Because it is one zone only for the whole device that could be updated in-place. Raw NAND flash, usually, hasn't likewise conventional zone. Finally, if I really like to develop SMR- or NAND flash oriented file system then I would like to play with peculiarities of concrete technologies. And any unified interface will destroy the opportunity to create the really efficient solution. Finally, if my software solution is unable to provide some fancy and efficient features then guys will prefer to use the regular stack (ext4, xfs + block layer). Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-04 2:59 ` Slava Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Slava Dubeyko @ 2017-01-04 2:59 UTC (permalink / raw) To: Matias Bjørling, Viacheslav Dubeyko, lsf-pc Cc: Linux FS Devel, linux-block, linux-nvme -----Original Message----- From: Matias Bjørling [mailto:m@bjorling.me] Sent: Tuesday, January 3, 2017 11:11 AM To: Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os <skipped> > All of the open-channel SSD work is done in the open. > Patches, new targets, and so forth are being developed for everyone to see. > Similarly, the NVMe host interface is developed in the open as well. > The interface allows one to implements supporting firmware. The "front-end" > of the FTL on the SSD, is removed, and the "back-end" engine is exposed. > It is not much work and given HGST already have an SSD firmware implementation. > I bet you guys can whip up an internal implementation in a matter of weeks. > If you choose to do so, I will bend over backwards to help you sort out any quirks that might be. I see your point. But I am the research guy and I have software project. So, it's completely unreasonable for me to spend the time on SSD firmware. I simply need in ready-made hardware for testing/benchmarking my software and to check the assumptions that it was made. That's all. If I haven't the hardware right now then I need to wait the better times. > Another option is to use the qemu extension. We are improving it continuously > to make sure it follows the implementation of a real hardware OCSSDs. > Today we do 90% of our FTL work using qemu, and most of the time > it just works when we run the FTL code on real hardware. I really dislike to use the qemu for file system benchmarking. > Similarly to vendors that provide new CPUs, NVDIMMs, and graphic drivers. > Some code and refactoring go in years in advance. What I am proposing here is to discuss how OCSSDs > fits into the storage stack, and what we can do to improve it. Optimally, most of the lightnvm subsystem > can be removed by exposing vectored I/Os. Which then enables implementation of a target to be > a traditional device mapper module. That would be great! OK. From one point of view, I like the idea of SMR compatibility. But, from another point of view, I am slightly skeptical about such approach. I believe you see the bright side of your suggestion. So, let me take a look on your approach from the dark side. What's the goal of SMR compatibility? Any unification or interface abstraction has the goal to hide the peculiarities of underlying hardware. But we have block device abstraction that hides all hardware's peculiarities perfectly. Also FTL (or any other Translation Layer) is able to represent the device as sequence of physical sectors without real knowledge on software side about sophisticated management activity on the device side. And, finally, guys will be completely happy to use the regular file systems (ext4, xfs) without necessity to modify software stack. But I believe that the goal of Open-channel SSD approach is completely opposite. Namely, provide the opportunity for software side (file system, for example) to manage the Open-channel SSD device with smarter policy. So, my key worry that the trying to hide under the same interface the two different technologies (SMR and NAND flash) will be resulted in the loss of opportunity to manage the device in more smarter way. Because any unification has the goal to create a simple interface. But SMR and NAND flash are significantly different technologies. And if somebody creates technology-oriented file system, for example, then it needs to have access to really special features of the technology. Otherwise, interface will be overloaded by features of both technologies and it will looks like as a mess. SMR zone and NAND flash erase block look comparable but, finally, it significantly different stuff. Usually, SMR zone has 265 MB in size but NAND flash erase block can vary from 512 KB to 8 MB (it will be slightly larger in the future but not more than 32 MB, I suppose). It is possible to group several erase blocks into aggregated entity but it could be not very good policy from file system point of view. Another point that QLC device could have more tricky features of erase blocks management. Also we should apply erase operation on NAND flash erase block but it is not mandatory for the case of SMR zone. Because SMR zone could be simply re-written in sequential order if all zone's data is invalid, for example. Also conventional zone could be really tricky point. Because it is one zone only for the whole device that could be updated in-place. Raw NAND flash, usually, hasn't likewise conventional zone. Finally, if I really like to develop SMR- or NAND flash oriented file system then I would like to play with peculiarities of concrete technologies. And any unified interface will destroy the opportunity to create the really efficient solution. Finally, if my software solution is unable to provide some fancy and efficient features then guys will prefer to use the regular stack (ext4, xfs + block layer). Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-04 2:59 ` Slava Dubeyko @ 2017-01-04 7:24 ` Damien Le Moal -1 siblings, 0 replies; 63+ messages in thread From: Damien Le Moal @ 2017-01-04 7:24 UTC (permalink / raw) To: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc Cc: Linux FS Devel, linux-block, linux-nvme Slava, On 1/4/17 11:59, Slava Dubeyko wrote: > What's the goal of SMR compatibility? Any unification or interface > abstraction has the goal to hide the peculiarities of underlying > hardware. But we have block device abstraction that hides all > hardware's peculiarities perfectly. Also FTL (or any other > Translation Layer) is able to represent the device as sequence of > physical sectors without real knowledge on software side about > sophisticated management activity on the device side. And, finally, > guys will be completely happy to use the regular file systems (ext4, > xfs) without necessity to modify software stack. But I believe that > the goal of Open-channel SSD approach is completely opposite. Namely, > provide the opportunity for software side (file system, for example) > to manage the Open-channel SSD device with smarter policy. The Zoned Block Device API is part of the block layer. So as such, it does abstract many aspects of the device characteristics, as so many other API of the block layer do (look at blkdev_issue_discard or zeroout implementations to see how far this can be pushed). Regarding the use of open channel SSDs, I think you are absolutely correct: (1) some users may be very happy to use a regular, unmodified ext4 or xfs on top of an open channel SSD, as long as the FTL implementation does a complete abstraction of the device special features and presents a regular block device to upper layers. And conversely, (2) some file system implementations may prefer to directly use those special features and characteristics of open channel SSDs. No arguing with this. But you are missing the parallel with SMR. For SMR, or more correctly zoned block devices since the ZBC or ZAC standards can equally apply to HDDs and SSDs, 3 models exists: drive-managed, host-aware and host-managed. Case (1) above corresponds *exactly* to the drive managed model, with the difference that the abstraction of the device characteristics (SMR here) is in the drive FW and not in a host-level FTL implementation as it would be for open channel SSDs. Case (2) above corresponds to the host-managed model, that is, the device user has to deal with the device characteristics itself and use it correctly. The host-aware model lies in between these 2 extremes: it offers the possibility of complete abstraction by default, but also allows a user to optimize its operation for the device by allowing access to the device characteristics. So this would correspond to a possible third way of implementing an FTL for open channel SSDs. > So, my key worry that the trying to hide under the same interface the > two different technologies (SMR and NAND flash) will be resulted in > the loss of opportunity to manage the device in more smarter way. > Because any unification has the goal to create a simple interface. > But SMR and NAND flash are significantly different technologies. And > if somebody creates technology-oriented file system, for example, > then it needs to have access to really special features of the > technology. Otherwise, interface will be overloaded by features of > both technologies and it will looks like as a mess. I do not think so, as long as the device "model" is exposed to the user as the zoned block device interface does. This allows a user to adjust its operation depending on the device. This is true of course as long as each "model" has a clearly defined set of features associated. Again, that is the case for zoned block devices and an example of how this can be used is now in f2fs (which allows different operation modes for host-aware devices, but only one for host-managed devices). Again, I can see a clear parallel with open channel SSDs here. > SMR zone and NAND flash erase block look comparable but, finally, it > significantly different stuff. Usually, SMR zone has 265 MB in size > but NAND flash erase block can vary from 512 KB to 8 MB (it will be > slightly larger in the future but not more than 32 MB, I suppose). It > is possible to group several erase blocks into aggregated entity but > it could be not very good policy from file system point of view. Why not? For f2fs, the 2MB segments are grouped together into sections with a size matching the device zone size. That works well and can actually even reduce the garbage collection overhead in some cases. Nothing in the kernel zoned block device support limits the zone size to a particular minimum or maximum. The only direct implication of the zone size on the block I/O stack is that BIOs and requests cannot cross zone boundaries. In an extreme setup, a zone size of 4KB would work too and result in read/write commands of 4KB at most to the device. > Another point that QLC device could have more tricky features of > erase blocks management. Also we should apply erase operation on NAND > flash erase block but it is not mandatory for the case of SMR zone. Incorrect: host-managed devices require a zone "reset" (equivalent to discard/trim) to be reused after being written once. So again, the "tricky features" you mention will depend on the device "model", whatever this ends up to be for an open channel SSD. > Because SMR zone could be simply re-written in sequential order if > all zone's data is invalid, for example. Also conventional zone could > be really tricky point. Because it is one zone only for the whole > device that could be updated in-place. Raw NAND flash, usually, > hasn't likewise conventional zone. Conventional zones are optional in zoned block devices. There may be none at all and an implementation may well decide to not support a device without any conventional zones if some are required. In the case of open channel SSDs, the FTL implementation may well decide to expose a particular range of LBAs as "conventional zones" and have a lower level exposure for the remaining capacity whcih can then be optimally used by the file system based on the features available for that remaining LBA range. Again, a parallel is possible with SMR. > Finally, if I really like to develop SMR- or NAND flash oriented file > system then I would like to play with peculiarities of concrete > technologies. And any unified interface will destroy the opportunity > to create the really efficient solution. Finally, if my software > solution is unable to provide some fancy and efficient features then > guys will prefer to use the regular stack (ext4, xfs + block layer). Not necessarily. Again think in terms of device "model" and associated feature set. An FS implementation may decide to support all possible models, with likely a resulting incredible complexity. More likely, similarly with what is happening with SMR, only models that make sense will be supported by FS implementation that can be easily modified. Example again here of f2fs: changes to support SMR were rather simple, whereas the initial effort to support SMR with ext4 was pretty much abandoned as it was too complex to integrate in the existing code while keeping the existing on-disk format. Your argument above is actually making the same point: you want your implementation to use the device features directly. That is, your implementation wants a "host-managed" like device model. Using ext4 will require a "host-aware" or "drive-managed" model, which could be provided through a different FTL or device-mapper implementation in the case of open channel SSDs. I am not trying to argue that open channel SSDs and zoned block devices should be supported under the exact same API. But I can definitely see clear parallels worth a discussion. As a first step, I would suggest trying to try defining open channel SSDs "models" and their feature set and see how these fit with the existing ZBC/ZAC defined models and at least estimate the implications on the block I/O stack. If adding the new models only results in the addition of a few top level functions or ioctls, it may be entirely feasible to integrate the two together. Best regards. -- Damien Le Moal, Ph.D. Sr Manager, System Software Research Group, Western Digital Damien.LeMoal@hgst.com Tel: (+81) 0466-98-3593 (Ext. 51-3593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-04 7:24 ` Damien Le Moal 0 siblings, 0 replies; 63+ messages in thread From: Damien Le Moal @ 2017-01-04 7:24 UTC (permalink / raw) Slava, On 1/4/17 11:59, Slava Dubeyko wrote: > What's the goal of SMR compatibility? Any unification or interface > abstraction has the goal to hide the peculiarities of underlying > hardware. But we have block device abstraction that hides all > hardware's peculiarities perfectly. Also FTL (or any other > Translation Layer) is able to represent the device as sequence of > physical sectors without real knowledge on software side about > sophisticated management activity on the device side. And, finally, > guys will be completely happy to use the regular file systems (ext4, > xfs) without necessity to modify software stack. But I believe that > the goal of Open-channel SSD approach is completely opposite. Namely, > provide the opportunity for software side (file system, for example) > to manage the Open-channel SSD device with smarter policy. The Zoned Block Device API is part of the block layer. So as such, it does abstract many aspects of the device characteristics, as so many other API of the block layer do (look at blkdev_issue_discard or zeroout implementations to see how far this can be pushed). Regarding the use of open channel SSDs, I think you are absolutely correct: (1) some users may be very happy to use a regular, unmodified ext4 or xfs on top of an open channel SSD, as long as the FTL implementation does a complete abstraction of the device special features and presents a regular block device to upper layers. And conversely, (2) some file system implementations may prefer to directly use those special features and characteristics of open channel SSDs. No arguing with this. But you are missing the parallel with SMR. For SMR, or more correctly zoned block devices since the ZBC or ZAC standards can equally apply to HDDs and SSDs, 3 models exists: drive-managed, host-aware and host-managed. Case (1) above corresponds *exactly* to the drive managed model, with the difference that the abstraction of the device characteristics (SMR here) is in the drive FW and not in a host-level FTL implementation as it would be for open channel SSDs. Case (2) above corresponds to the host-managed model, that is, the device user has to deal with the device characteristics itself and use it correctly. The host-aware model lies in between these 2 extremes: it offers the possibility of complete abstraction by default, but also allows a user to optimize its operation for the device by allowing access to the device characteristics. So this would correspond to a possible third way of implementing an FTL for open channel SSDs. > So, my key worry that the trying to hide under the same interface the > two different technologies (SMR and NAND flash) will be resulted in > the loss of opportunity to manage the device in more smarter way. > Because any unification has the goal to create a simple interface. > But SMR and NAND flash are significantly different technologies. And > if somebody creates technology-oriented file system, for example, > then it needs to have access to really special features of the > technology. Otherwise, interface will be overloaded by features of > both technologies and it will looks like as a mess. I do not think so, as long as the device "model" is exposed to the user as the zoned block device interface does. This allows a user to adjust its operation depending on the device. This is true of course as long as each "model" has a clearly defined set of features associated. Again, that is the case for zoned block devices and an example of how this can be used is now in f2fs (which allows different operation modes for host-aware devices, but only one for host-managed devices). Again, I can see a clear parallel with open channel SSDs here. > SMR zone and NAND flash erase block look comparable but, finally, it > significantly different stuff. Usually, SMR zone has 265 MB in size > but NAND flash erase block can vary from 512 KB to 8 MB (it will be > slightly larger in the future but not more than 32 MB, I suppose). It > is possible to group several erase blocks into aggregated entity but > it could be not very good policy from file system point of view. Why not? For f2fs, the 2MB segments are grouped together into sections with a size matching the device zone size. That works well and can actually even reduce the garbage collection overhead in some cases. Nothing in the kernel zoned block device support limits the zone size to a particular minimum or maximum. The only direct implication of the zone size on the block I/O stack is that BIOs and requests cannot cross zone boundaries. In an extreme setup, a zone size of 4KB would work too and result in read/write commands of 4KB at most to the device. > Another point that QLC device could have more tricky features of > erase blocks management. Also we should apply erase operation on NAND > flash erase block but it is not mandatory for the case of SMR zone. Incorrect: host-managed devices require a zone "reset" (equivalent to discard/trim) to be reused after being written once. So again, the "tricky features" you mention will depend on the device "model", whatever this ends up to be for an open channel SSD. > Because SMR zone could be simply re-written in sequential order if > all zone's data is invalid, for example. Also conventional zone could > be really tricky point. Because it is one zone only for the whole > device that could be updated in-place. Raw NAND flash, usually, > hasn't likewise conventional zone. Conventional zones are optional in zoned block devices. There may be none at all and an implementation may well decide to not support a device without any conventional zones if some are required. In the case of open channel SSDs, the FTL implementation may well decide to expose a particular range of LBAs as "conventional zones" and have a lower level exposure for the remaining capacity whcih can then be optimally used by the file system based on the features available for that remaining LBA range. Again, a parallel is possible with SMR. > Finally, if I really like to develop SMR- or NAND flash oriented file > system then I would like to play with peculiarities of concrete > technologies. And any unified interface will destroy the opportunity > to create the really efficient solution. Finally, if my software > solution is unable to provide some fancy and efficient features then > guys will prefer to use the regular stack (ext4, xfs + block layer). Not necessarily. Again think in terms of device "model" and associated feature set. An FS implementation may decide to support all possible models, with likely a resulting incredible complexity. More likely, similarly with what is happening with SMR, only models that make sense will be supported by FS implementation that can be easily modified. Example again here of f2fs: changes to support SMR were rather simple, whereas the initial effort to support SMR with ext4 was pretty much abandoned as it was too complex to integrate in the existing code while keeping the existing on-disk format. Your argument above is actually making the same point: you want your implementation to use the device features directly. That is, your implementation wants a "host-managed" like device model. Using ext4 will require a "host-aware" or "drive-managed" model, which could be provided through a different FTL or device-mapper implementation in the case of open channel SSDs. I am not trying to argue that open channel SSDs and zoned block devices should be supported under the exact same API. But I can definitely see clear parallels worth a discussion. As a first step, I would suggest trying to try defining open channel SSDs "models" and their feature set and see how these fit with the existing ZBC/ZAC defined models and at least estimate the implications on the block I/O stack. If adding the new models only results in the addition of a few top level functions or ioctls, it may be entirely feasible to integrate the two together. Best regards. -- Damien Le Moal, Ph.D. Sr Manager, System Software Research Group, Western Digital Damien.LeMoal at hgst.com Tel: (+81) 0466-98-3593 (Ext. 51-3593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-04 7:24 ` Damien Le Moal @ 2017-01-04 12:39 ` Matias Bjørling -1 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-04 12:39 UTC (permalink / raw) To: Damien Le Moal, Slava Dubeyko, Viacheslav Dubeyko, lsf-pc Cc: Linux FS Devel, linux-block, linux-nvme On 01/04/2017 08:24 AM, Damien Le Moal wrote: > > Slava, > > On 1/4/17 11:59, Slava Dubeyko wrote: >> What's the goal of SMR compatibility? Any unification or interface >> abstraction has the goal to hide the peculiarities of underlying >> hardware. But we have block device abstraction that hides all >> hardware's peculiarities perfectly. Also FTL (or any other >> Translation Layer) is able to represent the device as sequence of >> physical sectors without real knowledge on software side about >> sophisticated management activity on the device side. And, finally, >> guys will be completely happy to use the regular file systems (ext4, >> xfs) without necessity to modify software stack. But I believe that >> the goal of Open-channel SSD approach is completely opposite. Namely, >> provide the opportunity for software side (file system, for example) >> to manage the Open-channel SSD device with smarter policy. > > The Zoned Block Device API is part of the block layer. So as such, it > does abstract many aspects of the device characteristics, as so many > other API of the block layer do (look at blkdev_issue_discard or zeroout > implementations to see how far this can be pushed). > > Regarding the use of open channel SSDs, I think you are absolutely > correct: (1) some users may be very happy to use a regular, unmodified > ext4 or xfs on top of an open channel SSD, as long as the FTL > implementation does a complete abstraction of the device special > features and presents a regular block device to upper layers. And > conversely, (2) some file system implementations may prefer to directly > use those special features and characteristics of open channel SSDs. No > arguing with this. > > But you are missing the parallel with SMR. For SMR, or more correctly > zoned block devices since the ZBC or ZAC standards can equally apply to > HDDs and SSDs, 3 models exists: drive-managed, host-aware and host-managed. > > Case (1) above corresponds *exactly* to the drive managed model, with > the difference that the abstraction of the device characteristics (SMR > here) is in the drive FW and not in a host-level FTL implementation as > it would be for open channel SSDs. Case (2) above corresponds to the > host-managed model, that is, the device user has to deal with the device > characteristics itself and use it correctly. The host-aware model lies > in between these 2 extremes: it offers the possibility of complete > abstraction by default, but also allows a user to optimize its operation > for the device by allowing access to the device characteristics. So this > would correspond to a possible third way of implementing an FTL for open > channel SSDs. > >> So, my key worry that the trying to hide under the same interface the >> two different technologies (SMR and NAND flash) will be resulted in >> the loss of opportunity to manage the device in more smarter way. >> Because any unification has the goal to create a simple interface. >> But SMR and NAND flash are significantly different technologies. And >> if somebody creates technology-oriented file system, for example, >> then it needs to have access to really special features of the >> technology. Otherwise, interface will be overloaded by features of >> both technologies and it will looks like as a mess. > > I do not think so, as long as the device "model" is exposed to the user > as the zoned block device interface does. This allows a user to adjust > its operation depending on the device. This is true of course as long as > each "model" has a clearly defined set of features associated. Again, > that is the case for zoned block devices and an example of how this can > be used is now in f2fs (which allows different operation modes for > host-aware devices, but only one for host-managed devices). Again, I can > see a clear parallel with open channel SSDs here. > >> SMR zone and NAND flash erase block look comparable but, finally, it >> significantly different stuff. Usually, SMR zone has 265 MB in size >> but NAND flash erase block can vary from 512 KB to 8 MB (it will be >> slightly larger in the future but not more than 32 MB, I suppose). It >> is possible to group several erase blocks into aggregated entity but >> it could be not very good policy from file system point of view. > > Why not? For f2fs, the 2MB segments are grouped together into sections > with a size matching the device zone size. That works well and can > actually even reduce the garbage collection overhead in some cases. > Nothing in the kernel zoned block device support limits the zone size to > a particular minimum or maximum. The only direct implication of the zone > size on the block I/O stack is that BIOs and requests cannot cross zone > boundaries. In an extreme setup, a zone size of 4KB would work too and > result in read/write commands of 4KB at most to the device. > >> Another point that QLC device could have more tricky features of >> erase blocks management. Also we should apply erase operation on NAND >> flash erase block but it is not mandatory for the case of SMR zone. > > Incorrect: host-managed devices require a zone "reset" (equivalent to > discard/trim) to be reused after being written once. So again, the > "tricky features" you mention will depend on the device "model", > whatever this ends up to be for an open channel SSD. > >> Because SMR zone could be simply re-written in sequential order if >> all zone's data is invalid, for example. Also conventional zone could >> be really tricky point. Because it is one zone only for the whole >> device that could be updated in-place. Raw NAND flash, usually, >> hasn't likewise conventional zone. > > Conventional zones are optional in zoned block devices. There may be > none at all and an implementation may well decide to not support a > device without any conventional zones if some are required. > In the case of open channel SSDs, the FTL implementation may well decide > to expose a particular range of LBAs as "conventional zones" and have a > lower level exposure for the remaining capacity whcih can then be > optimally used by the file system based on the features available for > that remaining LBA range. Again, a parallel is possible with SMR. > >> Finally, if I really like to develop SMR- or NAND flash oriented file >> system then I would like to play with peculiarities of concrete >> technologies. And any unified interface will destroy the opportunity >> to create the really efficient solution. Finally, if my software >> solution is unable to provide some fancy and efficient features then >> guys will prefer to use the regular stack (ext4, xfs + block layer). > > Not necessarily. Again think in terms of device "model" and associated > feature set. An FS implementation may decide to support all possible > models, with likely a resulting incredible complexity. More likely, > similarly with what is happening with SMR, only models that make sense > will be supported by FS implementation that can be easily modified. > Example again here of f2fs: changes to support SMR were rather simple, > whereas the initial effort to support SMR with ext4 was pretty much > abandoned as it was too complex to integrate in the existing code while > keeping the existing on-disk format. > > Your argument above is actually making the same point: you want your > implementation to use the device features directly. That is, your > implementation wants a "host-managed" like device model. Using ext4 will > require a "host-aware" or "drive-managed" model, which could be provided > through a different FTL or device-mapper implementation in the case of > open channel SSDs. > > I am not trying to argue that open channel SSDs and zoned block devices > should be supported under the exact same API. But I can definitely see > clear parallels worth a discussion. As a first step, I would suggest > trying to try defining open channel SSDs "models" and their feature set > and see how these fit with the existing ZBC/ZAC defined models and at > least estimate the implications on the block I/O stack. If adding the > new models only results in the addition of a few top level functions or > ioctls, it may be entirely feasible to integrate the two together. > Thanks Damien. I couldn't have said it better my self. The OCSSD 1.3 specification has been made with an eye towards the SMR interface: - "Identification" - Follows the same "global" size definitions, and also supports that each zone has its own local size. - "Get Report" command follows a very similar structure as SMR, such that it can sit behind the "Report Zones" interface. - "Erase/Prepare Block" command follows the Reset block interface. Those should fit right in. If the layout is planar, such that the OCSSD only exposes a set of zones, it should be able to fit right into the framework with minor modifications. A couple of details are added when going towards managing multiple parallel units, which is some of the things that require a bit of discussion. ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-04 12:39 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-04 12:39 UTC (permalink / raw) On 01/04/2017 08:24 AM, Damien Le Moal wrote: > > Slava, > > On 1/4/17 11:59, Slava Dubeyko wrote: >> What's the goal of SMR compatibility? Any unification or interface >> abstraction has the goal to hide the peculiarities of underlying >> hardware. But we have block device abstraction that hides all >> hardware's peculiarities perfectly. Also FTL (or any other >> Translation Layer) is able to represent the device as sequence of >> physical sectors without real knowledge on software side about >> sophisticated management activity on the device side. And, finally, >> guys will be completely happy to use the regular file systems (ext4, >> xfs) without necessity to modify software stack. But I believe that >> the goal of Open-channel SSD approach is completely opposite. Namely, >> provide the opportunity for software side (file system, for example) >> to manage the Open-channel SSD device with smarter policy. > > The Zoned Block Device API is part of the block layer. So as such, it > does abstract many aspects of the device characteristics, as so many > other API of the block layer do (look at blkdev_issue_discard or zeroout > implementations to see how far this can be pushed). > > Regarding the use of open channel SSDs, I think you are absolutely > correct: (1) some users may be very happy to use a regular, unmodified > ext4 or xfs on top of an open channel SSD, as long as the FTL > implementation does a complete abstraction of the device special > features and presents a regular block device to upper layers. And > conversely, (2) some file system implementations may prefer to directly > use those special features and characteristics of open channel SSDs. No > arguing with this. > > But you are missing the parallel with SMR. For SMR, or more correctly > zoned block devices since the ZBC or ZAC standards can equally apply to > HDDs and SSDs, 3 models exists: drive-managed, host-aware and host-managed. > > Case (1) above corresponds *exactly* to the drive managed model, with > the difference that the abstraction of the device characteristics (SMR > here) is in the drive FW and not in a host-level FTL implementation as > it would be for open channel SSDs. Case (2) above corresponds to the > host-managed model, that is, the device user has to deal with the device > characteristics itself and use it correctly. The host-aware model lies > in between these 2 extremes: it offers the possibility of complete > abstraction by default, but also allows a user to optimize its operation > for the device by allowing access to the device characteristics. So this > would correspond to a possible third way of implementing an FTL for open > channel SSDs. > >> So, my key worry that the trying to hide under the same interface the >> two different technologies (SMR and NAND flash) will be resulted in >> the loss of opportunity to manage the device in more smarter way. >> Because any unification has the goal to create a simple interface. >> But SMR and NAND flash are significantly different technologies. And >> if somebody creates technology-oriented file system, for example, >> then it needs to have access to really special features of the >> technology. Otherwise, interface will be overloaded by features of >> both technologies and it will looks like as a mess. > > I do not think so, as long as the device "model" is exposed to the user > as the zoned block device interface does. This allows a user to adjust > its operation depending on the device. This is true of course as long as > each "model" has a clearly defined set of features associated. Again, > that is the case for zoned block devices and an example of how this can > be used is now in f2fs (which allows different operation modes for > host-aware devices, but only one for host-managed devices). Again, I can > see a clear parallel with open channel SSDs here. > >> SMR zone and NAND flash erase block look comparable but, finally, it >> significantly different stuff. Usually, SMR zone has 265 MB in size >> but NAND flash erase block can vary from 512 KB to 8 MB (it will be >> slightly larger in the future but not more than 32 MB, I suppose). It >> is possible to group several erase blocks into aggregated entity but >> it could be not very good policy from file system point of view. > > Why not? For f2fs, the 2MB segments are grouped together into sections > with a size matching the device zone size. That works well and can > actually even reduce the garbage collection overhead in some cases. > Nothing in the kernel zoned block device support limits the zone size to > a particular minimum or maximum. The only direct implication of the zone > size on the block I/O stack is that BIOs and requests cannot cross zone > boundaries. In an extreme setup, a zone size of 4KB would work too and > result in read/write commands of 4KB at most to the device. > >> Another point that QLC device could have more tricky features of >> erase blocks management. Also we should apply erase operation on NAND >> flash erase block but it is not mandatory for the case of SMR zone. > > Incorrect: host-managed devices require a zone "reset" (equivalent to > discard/trim) to be reused after being written once. So again, the > "tricky features" you mention will depend on the device "model", > whatever this ends up to be for an open channel SSD. > >> Because SMR zone could be simply re-written in sequential order if >> all zone's data is invalid, for example. Also conventional zone could >> be really tricky point. Because it is one zone only for the whole >> device that could be updated in-place. Raw NAND flash, usually, >> hasn't likewise conventional zone. > > Conventional zones are optional in zoned block devices. There may be > none at all and an implementation may well decide to not support a > device without any conventional zones if some are required. > In the case of open channel SSDs, the FTL implementation may well decide > to expose a particular range of LBAs as "conventional zones" and have a > lower level exposure for the remaining capacity whcih can then be > optimally used by the file system based on the features available for > that remaining LBA range. Again, a parallel is possible with SMR. > >> Finally, if I really like to develop SMR- or NAND flash oriented file >> system then I would like to play with peculiarities of concrete >> technologies. And any unified interface will destroy the opportunity >> to create the really efficient solution. Finally, if my software >> solution is unable to provide some fancy and efficient features then >> guys will prefer to use the regular stack (ext4, xfs + block layer). > > Not necessarily. Again think in terms of device "model" and associated > feature set. An FS implementation may decide to support all possible > models, with likely a resulting incredible complexity. More likely, > similarly with what is happening with SMR, only models that make sense > will be supported by FS implementation that can be easily modified. > Example again here of f2fs: changes to support SMR were rather simple, > whereas the initial effort to support SMR with ext4 was pretty much > abandoned as it was too complex to integrate in the existing code while > keeping the existing on-disk format. > > Your argument above is actually making the same point: you want your > implementation to use the device features directly. That is, your > implementation wants a "host-managed" like device model. Using ext4 will > require a "host-aware" or "drive-managed" model, which could be provided > through a different FTL or device-mapper implementation in the case of > open channel SSDs. > > I am not trying to argue that open channel SSDs and zoned block devices > should be supported under the exact same API. But I can definitely see > clear parallels worth a discussion. As a first step, I would suggest > trying to try defining open channel SSDs "models" and their feature set > and see how these fit with the existing ZBC/ZAC defined models and at > least estimate the implications on the block I/O stack. If adding the > new models only results in the addition of a few top level functions or > ioctls, it may be entirely feasible to integrate the two together. > Thanks Damien. I couldn't have said it better my self. The OCSSD 1.3 specification has been made with an eye towards the SMR interface: - "Identification" - Follows the same "global" size definitions, and also supports that each zone has its own local size. - "Get Report" command follows a very similar structure as SMR, such that it can sit behind the "Report Zones" interface. - "Erase/Prepare Block" command follows the Reset block interface. Those should fit right in. If the layout is planar, such that the OCSSD only exposes a set of zones, it should be able to fit right into the framework with minor modifications. A couple of details are added when going towards managing multiple parallel units, which is some of the things that require a bit of discussion. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-04 7:24 ` Damien Le Moal @ 2017-01-04 16:57 ` Theodore Ts'o -1 siblings, 0 replies; 63+ messages in thread From: Theodore Ts'o @ 2017-01-04 16:57 UTC (permalink / raw) To: Damien Le Moal Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme I agree with Damien, but I'd also add that in the future there may very well be some new Zone types added to the ZBC model. So we shouldn't assume that the ZBC model is a fixed one. And who knows? Perhaps T10 standards body will come up with a simpler model for interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC model --- or not. Either way, that's not really relevant as far as the Linux block layer is concerned, since the Linux block layer is designed to be an abstraction on top of hardware --- and in some cases we can use a similar abstraction on top of eMMC's, SCSI's, and SATA's implementation definition of TRIM/DISCARD/WRITE SAME/SECURE TRIM/QUEUED TRIM, even though they are different in some subtle ways, and may have different performance characteristics and semantics. The trick is to expose similarities where the differences won't matter to the upper layers, but also to expose the fine distinctions and allow the file system and/or user space to use the protocol-specific differences when it matters to them. Designing that is going to be important, and I can guarantee we won't get it right at first. Which is why it's a good thing that internal kernel interfaces aren't cast into concrete, and can be subject to change as new revisions to ZBC, or new interfaces (like perhaps OCSSD's) get promulgated by various standards bodies or by various vendors. > > Another point that QLC device could have more tricky features of > > erase blocks management. Also we should apply erase operation on NAND > > flash erase block but it is not mandatory for the case of SMR zone. > > Incorrect: host-managed devices require a zone "reset" (equivalent to > discard/trim) to be reused after being written once. So again, the > "tricky features" you mention will depend on the device "model", > whatever this ends up to be for an open channel SSD. ... and this is exposed by having different zone types (sequential write required vs sequential write preferred vs conventional). And if OCSSD's "zones" don't fit into the current ZBC zone types, we can easily add new ones. I would suggest however, that we explicitly disclaim that the block device layer's code points for zone types is an exact match with the ZBC zone types numbering, precisely so we can add new zone types that correspond to abstractions from different hardware types, such as OCSSD. > Not necessarily. Again think in terms of device "model" and associated > feature set. An FS implementation may decide to support all possible > models, with likely a resulting incredible complexity. More likely, > similarly with what is happening with SMR, only models that make sense > will be supported by FS implementation that can be easily modified. > Example again here of f2fs: changes to support SMR were rather simple, > whereas the initial effort to support SMR with ext4 was pretty much > abandoned as it was too complex to integrate in the existing code while > keeping the existing on-disk format. I'll note that Abutalib Aghayev and I will be presenting a paper at the 2017 FAST conference detailing a way to optimize ext4 for Host-Aware SMR drives by making a surprisingly small set of changes to ext4's journalling layer, with some very promising performance improvements for certain workloads, which we tested on both Seagate and WD HA drives and achieved 2x performance improvements. Patches are on the unstable portion of the ext4 patch queue, and I hope to get them into an upstream acceptable shape (as opposed to "good enough for a research paper") in the next few months. So it may very well be that small changes can be made to file systems to support exotic devices if there are ways that we can expose the right information about underlying storage devices, and offering the right abstractions to enable the right kind of minimal I/O tagging, or hints, or commands as necessary such that the changes we do need to make to the file system can be kept small, and kept easily testable even if hardware is not available. For example, by creating device mapper emulators of the feature sets of these advanced storage interfaces that are exposed via the block layer abstractions, whether it be for ZBC zones, or hardware encryption acceleration, etc. Cheers, - Ted ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-04 16:57 ` Theodore Ts'o 0 siblings, 0 replies; 63+ messages in thread From: Theodore Ts'o @ 2017-01-04 16:57 UTC (permalink / raw) I agree with Damien, but I'd also add that in the future there may very well be some new Zone types added to the ZBC model. So we shouldn't assume that the ZBC model is a fixed one. And who knows? Perhaps T10 standards body will come up with a simpler model for interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC model --- or not. Either way, that's not really relevant as far as the Linux block layer is concerned, since the Linux block layer is designed to be an abstraction on top of hardware --- and in some cases we can use a similar abstraction on top of eMMC's, SCSI's, and SATA's implementation definition of TRIM/DISCARD/WRITE SAME/SECURE TRIM/QUEUED TRIM, even though they are different in some subtle ways, and may have different performance characteristics and semantics. The trick is to expose similarities where the differences won't matter to the upper layers, but also to expose the fine distinctions and allow the file system and/or user space to use the protocol-specific differences when it matters to them. Designing that is going to be important, and I can guarantee we won't get it right at first. Which is why it's a good thing that internal kernel interfaces aren't cast into concrete, and can be subject to change as new revisions to ZBC, or new interfaces (like perhaps OCSSD's) get promulgated by various standards bodies or by various vendors. > > Another point that QLC device could have more tricky features of > > erase blocks management. Also we should apply erase operation on NAND > > flash erase block but it is not mandatory for the case of SMR zone. > > Incorrect: host-managed devices require a zone "reset" (equivalent to > discard/trim) to be reused after being written once. So again, the > "tricky features" you mention will depend on the device "model", > whatever this ends up to be for an open channel SSD. ... and this is exposed by having different zone types (sequential write required vs sequential write preferred vs conventional). And if OCSSD's "zones" don't fit into the current ZBC zone types, we can easily add new ones. I would suggest however, that we explicitly disclaim that the block device layer's code points for zone types is an exact match with the ZBC zone types numbering, precisely so we can add new zone types that correspond to abstractions from different hardware types, such as OCSSD. > Not necessarily. Again think in terms of device "model" and associated > feature set. An FS implementation may decide to support all possible > models, with likely a resulting incredible complexity. More likely, > similarly with what is happening with SMR, only models that make sense > will be supported by FS implementation that can be easily modified. > Example again here of f2fs: changes to support SMR were rather simple, > whereas the initial effort to support SMR with ext4 was pretty much > abandoned as it was too complex to integrate in the existing code while > keeping the existing on-disk format. I'll note that Abutalib Aghayev and I will be presenting a paper at the 2017 FAST conference detailing a way to optimize ext4 for Host-Aware SMR drives by making a surprisingly small set of changes to ext4's journalling layer, with some very promising performance improvements for certain workloads, which we tested on both Seagate and WD HA drives and achieved 2x performance improvements. Patches are on the unstable portion of the ext4 patch queue, and I hope to get them into an upstream acceptable shape (as opposed to "good enough for a research paper") in the next few months. So it may very well be that small changes can be made to file systems to support exotic devices if there are ways that we can expose the right information about underlying storage devices, and offering the right abstractions to enable the right kind of minimal I/O tagging, or hints, or commands as necessary such that the changes we do need to make to the file system can be kept small, and kept easily testable even if hardware is not available. For example, by creating device mapper emulators of the feature sets of these advanced storage interfaces that are exposed via the block layer abstractions, whether it be for ZBC zones, or hardware encryption acceleration, etc. Cheers, - Ted ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-04 16:57 ` Theodore Ts'o @ 2017-01-10 1:42 ` Damien Le Moal -1 siblings, 0 replies; 63+ messages in thread From: Damien Le Moal @ 2017-01-10 1:42 UTC (permalink / raw) To: Theodore Ts'o Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme Ted, On 1/5/17 01:57, Theodore Ts'o wrote: > I agree with Damien, but I'd also add that in the future there may > very well be some new Zone types added to the ZBC model. So we > shouldn't assume that the ZBC model is a fixed one. And who knows? > Perhaps T10 standards body will come up with a simpler model for > interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC > model --- or not. Totally agree. There is already some activity in T10 for a ZBC V2 standard which indeed may include new zone types (for instance a "circular buffer zone type that can be sequentially rewritten without a reset, preserving previously written data for reads after the write pointer). Such type of zone could be a perfect match for an FS journal log space for instance. > Either way, that's not really relevant as far as the Linux block layer > is concerned, since the Linux block layer is designed to be an > abstraction on top of hardware --- and in some cases we can use a > similar abstraction on top of eMMC's, SCSI's, and SATA's > implementation definition of TRIM/DISCARD/WRITE SAME/SECURE > TRIM/QUEUED TRIM, even though they are different in some subtle ways, > and may have different performance characteristics and semantics. > > The trick is to expose similarities where the differences won't matter > to the upper layers, but also to expose the fine distinctions and > allow the file system and/or user space to use the protocol-specific > differences when it matters to them. Absolutely. The initial zoned block device support was written to match what ZBC/ZAC defines. It was simple this way and there was no other users of the zone concept. But the device models and zone types are just numerical values reported to the device user. The block I/O stack currently does not use these values beyond the device initialization. It is up to the users (e.g. FS) of the device to determine what to do to correctly use the device according to the types reported. So this basic design is definitely extensible to new zone types and device models. > > Designing that is going to be important, and I can guarantee we won't > get it right at first. Which is why it's a good thing that internal > kernel interfaces aren't cast into concrete, and can be subject to > change as new revisions to ZBC, or new interfaces (like perhaps > OCSSD's) get promulgated by various standards bodies or by various > vendors. Indeed. The ZBC case was simple as we matched the standard defined models. Whihc in any case is not really used in any way directly by the block I/O stack itself. Only upper layers use that. In the case of ACSSD, this adds one hardware-defined model set by the standard, plus a potential collection of software defined models through different FTL implementations on the host. Getting these models and there API right will be indeed tricky. In a first step, providing a ZBC-like host-aware model and a host-managed model may be a good idea as upper layer code already ready for ZBC disks will work out-of-the-box for OCSSDs too. From there, I can see a lot of possibilities for more SSD optimized models though. >>> Another point that QLC device could have more tricky features of >>> erase blocks management. Also we should apply erase operation on NAND >>> flash erase block but it is not mandatory for the case of SMR zone. >> >> Incorrect: host-managed devices require a zone "reset" (equivalent to >> discard/trim) to be reused after being written once. So again, the >> "tricky features" you mention will depend on the device "model", >> whatever this ends up to be for an open channel SSD. > > ... and this is exposed by having different zone types (sequential > write required vs sequential write preferred vs conventional). And if > OCSSD's "zones" don't fit into the current ZBC zone types, we can > easily add new ones. I would suggest however, that we explicitly > disclaim that the block device layer's code points for zone types is > an exact match with the ZBC zone types numbering, precisely so we can > add new zone types that correspond to abstractions from different > hardware types, such as OCSSD. The struct blk_zone type is 64B in size but only currently uses 32B. So there is room for new fields, and existing fields can have newly defined values too as the ZBC standard uses only few of the possible values in the structure fields. >> Not necessarily. Again think in terms of device "model" and associated >> feature set. An FS implementation may decide to support all possible >> models, with likely a resulting incredible complexity. More likely, >> similarly with what is happening with SMR, only models that make sense >> will be supported by FS implementation that can be easily modified. >> Example again here of f2fs: changes to support SMR were rather simple, >> whereas the initial effort to support SMR with ext4 was pretty much >> abandoned as it was too complex to integrate in the existing code while >> keeping the existing on-disk format. > > I'll note that Abutalib Aghayev and I will be presenting a paper at > the 2017 FAST conference detailing a way to optimize ext4 for > Host-Aware SMR drives by making a surprisingly small set of changes to > ext4's journalling layer, with some very promising performance > improvements for certain workloads, which we tested on both Seagate > and WD HA drives and achieved 2x performance improvements. Patches > are on the unstable portion of the ext4 patch queue, and I hope to get > them into an upstream acceptable shape (as opposed to "good enough for > a research paper") in the next few months. Thank you for the information. I will check this out. Is it the optimization that aggressively delay meta-data update by allowing reading of meta-data blocks directly from the journal (for blocks that are not yet updated in place) ? > So it may very well be that small changes can be made to file systems > to support exotic devices if there are ways that we can expose the > right information about underlying storage devices, and offering the > right abstractions to enable the right kind of minimal I/O tagging, or > hints, or commands as necessary such that the changes we do need to > make to the file system can be kept small, and kept easily testable > even if hardware is not available. > > For example, by creating device mapper emulators of the feature sets > of these advanced storage interfaces that are exposed via the block > layer abstractions, whether it be for ZBC zones, or hardware > encryption acceleration, etc. Emulators may indeed be very useful for development. But we could also go further and implement the different models using device mappers too. Doing so, the same device could be used with different FTL through the same DM interface. And this may also simplify the implementation of complex models using DM stacking (e.g. the host-aware model can be implemented on top of a host-managed model). Best regards. -- Damien Le Moal, Ph.D. Sr. Manager, System Software Research Group, Western Digital Corporation Damien.LeMoal@wdc.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-10 1:42 ` Damien Le Moal 0 siblings, 0 replies; 63+ messages in thread From: Damien Le Moal @ 2017-01-10 1:42 UTC (permalink / raw) Ted, On 1/5/17 01:57, Theodore Ts'o wrote: > I agree with Damien, but I'd also add that in the future there may > very well be some new Zone types added to the ZBC model. So we > shouldn't assume that the ZBC model is a fixed one. And who knows? > Perhaps T10 standards body will come up with a simpler model for > interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC > model --- or not. Totally agree. There is already some activity in T10 for a ZBC V2 standard which indeed may include new zone types (for instance a "circular buffer zone type that can be sequentially rewritten without a reset, preserving previously written data for reads after the write pointer). Such type of zone could be a perfect match for an FS journal log space for instance. > Either way, that's not really relevant as far as the Linux block layer > is concerned, since the Linux block layer is designed to be an > abstraction on top of hardware --- and in some cases we can use a > similar abstraction on top of eMMC's, SCSI's, and SATA's > implementation definition of TRIM/DISCARD/WRITE SAME/SECURE > TRIM/QUEUED TRIM, even though they are different in some subtle ways, > and may have different performance characteristics and semantics. > > The trick is to expose similarities where the differences won't matter > to the upper layers, but also to expose the fine distinctions and > allow the file system and/or user space to use the protocol-specific > differences when it matters to them. Absolutely. The initial zoned block device support was written to match what ZBC/ZAC defines. It was simple this way and there was no other users of the zone concept. But the device models and zone types are just numerical values reported to the device user. The block I/O stack currently does not use these values beyond the device initialization. It is up to the users (e.g. FS) of the device to determine what to do to correctly use the device according to the types reported. So this basic design is definitely extensible to new zone types and device models. > > Designing that is going to be important, and I can guarantee we won't > get it right at first. Which is why it's a good thing that internal > kernel interfaces aren't cast into concrete, and can be subject to > change as new revisions to ZBC, or new interfaces (like perhaps > OCSSD's) get promulgated by various standards bodies or by various > vendors. Indeed. The ZBC case was simple as we matched the standard defined models. Whihc in any case is not really used in any way directly by the block I/O stack itself. Only upper layers use that. In the case of ACSSD, this adds one hardware-defined model set by the standard, plus a potential collection of software defined models through different FTL implementations on the host. Getting these models and there API right will be indeed tricky. In a first step, providing a ZBC-like host-aware model and a host-managed model may be a good idea as upper layer code already ready for ZBC disks will work out-of-the-box for OCSSDs too. From there, I can see a lot of possibilities for more SSD optimized models though. >>> Another point that QLC device could have more tricky features of >>> erase blocks management. Also we should apply erase operation on NAND >>> flash erase block but it is not mandatory for the case of SMR zone. >> >> Incorrect: host-managed devices require a zone "reset" (equivalent to >> discard/trim) to be reused after being written once. So again, the >> "tricky features" you mention will depend on the device "model", >> whatever this ends up to be for an open channel SSD. > > ... and this is exposed by having different zone types (sequential > write required vs sequential write preferred vs conventional). And if > OCSSD's "zones" don't fit into the current ZBC zone types, we can > easily add new ones. I would suggest however, that we explicitly > disclaim that the block device layer's code points for zone types is > an exact match with the ZBC zone types numbering, precisely so we can > add new zone types that correspond to abstractions from different > hardware types, such as OCSSD. The struct blk_zone type is 64B in size but only currently uses 32B. So there is room for new fields, and existing fields can have newly defined values too as the ZBC standard uses only few of the possible values in the structure fields. >> Not necessarily. Again think in terms of device "model" and associated >> feature set. An FS implementation may decide to support all possible >> models, with likely a resulting incredible complexity. More likely, >> similarly with what is happening with SMR, only models that make sense >> will be supported by FS implementation that can be easily modified. >> Example again here of f2fs: changes to support SMR were rather simple, >> whereas the initial effort to support SMR with ext4 was pretty much >> abandoned as it was too complex to integrate in the existing code while >> keeping the existing on-disk format. > > I'll note that Abutalib Aghayev and I will be presenting a paper at > the 2017 FAST conference detailing a way to optimize ext4 for > Host-Aware SMR drives by making a surprisingly small set of changes to > ext4's journalling layer, with some very promising performance > improvements for certain workloads, which we tested on both Seagate > and WD HA drives and achieved 2x performance improvements. Patches > are on the unstable portion of the ext4 patch queue, and I hope to get > them into an upstream acceptable shape (as opposed to "good enough for > a research paper") in the next few months. Thank you for the information. I will check this out. Is it the optimization that aggressively delay meta-data update by allowing reading of meta-data blocks directly from the journal (for blocks that are not yet updated in place) ? > So it may very well be that small changes can be made to file systems > to support exotic devices if there are ways that we can expose the > right information about underlying storage devices, and offering the > right abstractions to enable the right kind of minimal I/O tagging, or > hints, or commands as necessary such that the changes we do need to > make to the file system can be kept small, and kept easily testable > even if hardware is not available. > > For example, by creating device mapper emulators of the feature sets > of these advanced storage interfaces that are exposed via the block > layer abstractions, whether it be for ZBC zones, or hardware > encryption acceleration, etc. Emulators may indeed be very useful for development. But we could also go further and implement the different models using device mappers too. Doing so, the same device could be used with different FTL through the same DM interface. And this may also simplify the implementation of complex models using DM stacking (e.g. the host-aware model can be implemented on top of a host-managed model). Best regards. -- Damien Le Moal, Ph.D. Sr. Manager, System Software Research Group, Western Digital Corporation Damien.LeMoal at wdc.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-10 1:42 ` Damien Le Moal @ 2017-01-10 4:24 ` Theodore Ts'o -1 siblings, 0 replies; 63+ messages in thread From: Theodore Ts'o @ 2017-01-10 4:24 UTC (permalink / raw) To: Damien Le Moal Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme On Tue, Jan 10, 2017 at 10:42:45AM +0900, Damien Le Moal wrote: > Thank you for the information. I will check this out. Is it the > optimization that aggressively delay meta-data update by allowing > reading of meta-data blocks directly from the journal (for blocks that > are not yet updated in place) ? Essentially, yes. In some cases, the metadata might never be written back to its permanent location in disk. Instead, we might copy a metadata block from the tail of the journal to the head, if there is enough space. So effectively, it turns the journal into a log structured store for metadata blocks. Over time, if metadata blocks become cold we can evict them from the journal to make room for more frequently updated metadata blocks (given the currently running workload). Eliminating the random 4k updates to the allocation bitmaps, inode table blocks, etc., really helps with host-aware SMR drives, without requiring the massive amount of changes needed to make a file system compatible with the host-managed model. When you consider that for most files, they are never updated in place, but are usually just replaced, I suspect that with some additional adjustments to a traditional file system's block allocator, we can get even further wins. But that's for future work... > Emulators may indeed be very useful for development. But we could also > go further and implement the different models using device mappers too. > Doing so, the same device could be used with different FTL through the > same DM interface. And this may also simplify the implementation of > complex models using DM stacking (e.g. the host-aware model can be > implemented on top of a host-managed model). Yes, indeed. That would also allow people to experiment with how much benefit can be derived if we were to give additional side channel information to STL / FTL's --- since it's much easier to adjust kernel code than to go through a negotiations with a HDD/SSD vendor to make firmware changes! This may be an area where if we can create the right framework, and fund some research work, we might be able to get some researchers and their graduate students interested in doing some work in figuring out what sort of divisions of responsibilities and hints back and forth between the storage device and host have the most benefit. Cheers, - Ted ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-10 4:24 ` Theodore Ts'o 0 siblings, 0 replies; 63+ messages in thread From: Theodore Ts'o @ 2017-01-10 4:24 UTC (permalink / raw) On Tue, Jan 10, 2017@10:42:45AM +0900, Damien Le Moal wrote: > Thank you for the information. I will check this out. Is it the > optimization that aggressively delay meta-data update by allowing > reading of meta-data blocks directly from the journal (for blocks that > are not yet updated in place) ? Essentially, yes. In some cases, the metadata might never be written back to its permanent location in disk. Instead, we might copy a metadata block from the tail of the journal to the head, if there is enough space. So effectively, it turns the journal into a log structured store for metadata blocks. Over time, if metadata blocks become cold we can evict them from the journal to make room for more frequently updated metadata blocks (given the currently running workload). Eliminating the random 4k updates to the allocation bitmaps, inode table blocks, etc., really helps with host-aware SMR drives, without requiring the massive amount of changes needed to make a file system compatible with the host-managed model. When you consider that for most files, they are never updated in place, but are usually just replaced, I suspect that with some additional adjustments to a traditional file system's block allocator, we can get even further wins. But that's for future work... > Emulators may indeed be very useful for development. But we could also > go further and implement the different models using device mappers too. > Doing so, the same device could be used with different FTL through the > same DM interface. And this may also simplify the implementation of > complex models using DM stacking (e.g. the host-aware model can be > implemented on top of a host-managed model). Yes, indeed. That would also allow people to experiment with how much benefit can be derived if we were to give additional side channel information to STL / FTL's --- since it's much easier to adjust kernel code than to go through a negotiations with a HDD/SSD vendor to make firmware changes! This may be an area where if we can create the right framework, and fund some research work, we might be able to get some researchers and their graduate students interested in doing some work in figuring out what sort of divisions of responsibilities and hints back and forth between the storage device and host have the most benefit. Cheers, - Ted ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-10 4:24 ` Theodore Ts'o @ 2017-01-10 13:06 ` Matias Bjorling -1 siblings, 0 replies; 63+ messages in thread From: Matias Bjorling @ 2017-01-10 13:06 UTC (permalink / raw) To: Theodore Ts'o, Damien Le Moal Cc: Slava Dubeyko, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme On 01/10/2017 05:24 AM, Theodore Ts'o wrote: > This may be an area where if we can create the right framework, and > fund some research work, we might be able to get some researchers and > their graduate students interested in doing some work in figuring out > what sort of divisions of responsibilities and hints back and forth > between the storage device and host have the most benefit. > That is a good idea. There is a couple of papers at FAST with Open-Channel SSDs this year. They look into the interface and various ways to reduce latency fluctuations. One thing I've heard a couple of times is the feature to move the GC read/write process into the firmware. Enabling the host to offload GC data movement, while the keeping the host in control. Would this be beneficial for SMR? -Matias ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-10 13:06 ` Matias Bjorling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjorling @ 2017-01-10 13:06 UTC (permalink / raw) On 01/10/2017 05:24 AM, Theodore Ts'o wrote: > This may be an area where if we can create the right framework, and > fund some research work, we might be able to get some researchers and > their graduate students interested in doing some work in figuring out > what sort of divisions of responsibilities and hints back and forth > between the storage device and host have the most benefit. > That is a good idea. There is a couple of papers at FAST with Open-Channel SSDs this year. They look into the interface and various ways to reduce latency fluctuations. One thing I've heard a couple of times is the feature to move the GC read/write process into the firmware. Enabling the host to offload GC data movement, while the keeping the host in control. Would this be beneficial for SMR? -Matias ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-10 13:06 ` Matias Bjorling @ 2017-01-11 4:07 ` Damien Le Moal -1 siblings, 0 replies; 63+ messages in thread From: Damien Le Moal @ 2017-01-11 4:07 UTC (permalink / raw) To: Matias Bjorling, Theodore Ts'o Cc: Slava Dubeyko, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme Matias, On 1/10/17 22:06, Matias Bjorling wrote: > On 01/10/2017 05:24 AM, Theodore Ts'o wrote: >> This may be an area where if we can create the right framework, and >> fund some research work, we might be able to get some researchers and >> their graduate students interested in doing some work in figuring out >> what sort of divisions of responsibilities and hints back and forth >> between the storage device and host have the most benefit. >> > > That is a good idea. There is a couple of papers at FAST with > Open-Channel SSDs this year. They look into the interface and various > ways to reduce latency fluctuations. > > One thing I've heard a couple of times is the feature to move the GC > read/write process into the firmware. Enabling the host to offload GC > data movement, while the keeping the host in control. Would this be > beneficial for SMR? Host-aware SMR drives already have GC internally implemented (for cases when the host does not write sequentially). Host-managed drives do not. As for moving an application specific GC code into the device, well, code injection in the storage device is not for tomorrow, and likely not ever. There are however other clever ways to reduce GC related host overhead with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED COPY, and some others can greatly improve overhead over a simple read+write loop. A better approach to GC offload may not be a "GC" command, but something more generic for moving around LBAs internally within the device. That is, if existing commands are not satisfactory. Best. -- Damien Le Moal, Ph.D. Sr. Manager, System Software Research Group, Western Digital Corporation Damien.LeMoal@wdc.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-11 4:07 ` Damien Le Moal 0 siblings, 0 replies; 63+ messages in thread From: Damien Le Moal @ 2017-01-11 4:07 UTC (permalink / raw) Matias, On 1/10/17 22:06, Matias Bjorling wrote: > On 01/10/2017 05:24 AM, Theodore Ts'o wrote: >> This may be an area where if we can create the right framework, and >> fund some research work, we might be able to get some researchers and >> their graduate students interested in doing some work in figuring out >> what sort of divisions of responsibilities and hints back and forth >> between the storage device and host have the most benefit. >> > > That is a good idea. There is a couple of papers at FAST with > Open-Channel SSDs this year. They look into the interface and various > ways to reduce latency fluctuations. > > One thing I've heard a couple of times is the feature to move the GC > read/write process into the firmware. Enabling the host to offload GC > data movement, while the keeping the host in control. Would this be > beneficial for SMR? Host-aware SMR drives already have GC internally implemented (for cases when the host does not write sequentially). Host-managed drives do not. As for moving an application specific GC code into the device, well, code injection in the storage device is not for tomorrow, and likely not ever. There are however other clever ways to reduce GC related host overhead with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED COPY, and some others can greatly improve overhead over a simple read+write loop. A better approach to GC offload may not be a "GC" command, but something more generic for moving around LBAs internally within the device. That is, if existing commands are not satisfactory. Best. -- Damien Le Moal, Ph.D. Sr. Manager, System Software Research Group, Western Digital Corporation Damien.LeMoal at wdc.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-11 4:07 ` Damien Le Moal @ 2017-01-11 6:06 ` Matias Bjorling -1 siblings, 0 replies; 63+ messages in thread From: Matias Bjorling @ 2017-01-11 6:06 UTC (permalink / raw) To: Damien Le Moal, Theodore Ts'o Cc: Slava Dubeyko, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme On 01/11/2017 05:07 AM, Damien Le Moal wrote: > > Matias, > > On 1/10/17 22:06, Matias Bjorling wrote: >> On 01/10/2017 05:24 AM, Theodore Ts'o wrote: >>> This may be an area where if we can create the right framework, and >>> fund some research work, we might be able to get some researchers and >>> their graduate students interested in doing some work in figuring out >>> what sort of divisions of responsibilities and hints back and forth >>> between the storage device and host have the most benefit. >>> >> >> That is a good idea. There is a couple of papers at FAST with >> Open-Channel SSDs this year. They look into the interface and various >> ways to reduce latency fluctuations. >> >> One thing I've heard a couple of times is the feature to move the GC >> read/write process into the firmware. Enabling the host to offload GC >> data movement, while the keeping the host in control. Would this be >> beneficial for SMR? > > Host-aware SMR drives already have GC internally implemented (for cases > when the host does not write sequentially). Host-managed drives do not. > As for moving an application specific GC code into the device, well, > code injection in the storage device is not for tomorrow, and likely not > ever. > > There are however other clever ways to reduce GC related host overhead > with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED > COPY, and some others can greatly improve overhead over a simple > read+write loop. A better approach to GC offload may not be a "GC" > command, but something more generic for moving around LBAs internally > within the device. That is, if existing commands are not satisfactory. Hi Damien, You're right. I was thinking of something similar to scattered read/write to move data from one place to another. There is no sector-granularity mapping table maintained by the OCSSD, which leaves the logic up to the host. Let me know if you decide to kick of a standardized interface for code injection. Such an interface is long overdue. ;) ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-11 6:06 ` Matias Bjorling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjorling @ 2017-01-11 6:06 UTC (permalink / raw) On 01/11/2017 05:07 AM, Damien Le Moal wrote: > > Matias, > > On 1/10/17 22:06, Matias Bjorling wrote: >> On 01/10/2017 05:24 AM, Theodore Ts'o wrote: >>> This may be an area where if we can create the right framework, and >>> fund some research work, we might be able to get some researchers and >>> their graduate students interested in doing some work in figuring out >>> what sort of divisions of responsibilities and hints back and forth >>> between the storage device and host have the most benefit. >>> >> >> That is a good idea. There is a couple of papers at FAST with >> Open-Channel SSDs this year. They look into the interface and various >> ways to reduce latency fluctuations. >> >> One thing I've heard a couple of times is the feature to move the GC >> read/write process into the firmware. Enabling the host to offload GC >> data movement, while the keeping the host in control. Would this be >> beneficial for SMR? > > Host-aware SMR drives already have GC internally implemented (for cases > when the host does not write sequentially). Host-managed drives do not. > As for moving an application specific GC code into the device, well, > code injection in the storage device is not for tomorrow, and likely not > ever. > > There are however other clever ways to reduce GC related host overhead > with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED > COPY, and some others can greatly improve overhead over a simple > read+write loop. A better approach to GC offload may not be a "GC" > command, but something more generic for moving around LBAs internally > within the device. That is, if existing commands are not satisfactory. Hi Damien, You're right. I was thinking of something similar to scattered read/write to move data from one place to another. There is no sector-granularity mapping table maintained by the OCSSD, which leaves the logic up to the host. Let me know if you decide to kick of a standardized interface for code injection. Such an interface is long overdue. ;) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-11 4:07 ` Damien Le Moal @ 2017-01-11 7:49 ` Hannes Reinecke -1 siblings, 0 replies; 63+ messages in thread From: Hannes Reinecke @ 2017-01-11 7:49 UTC (permalink / raw) To: Damien Le Moal, Matias Bjorling, Theodore Ts'o Cc: Slava Dubeyko, linux-nvme, linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc On 01/11/2017 05:07 AM, Damien Le Moal wrote: > > Matias, > > On 1/10/17 22:06, Matias Bjorling wrote: >> On 01/10/2017 05:24 AM, Theodore Ts'o wrote: >>> This may be an area where if we can create the right framework, and >>> fund some research work, we might be able to get some researchers and >>> their graduate students interested in doing some work in figuring out >>> what sort of divisions of responsibilities and hints back and forth >>> between the storage device and host have the most benefit. >>> >> >> That is a good idea. There is a couple of papers at FAST with >> Open-Channel SSDs this year. They look into the interface and various >> ways to reduce latency fluctuations. >> >> One thing I've heard a couple of times is the feature to move the GC >> read/write process into the firmware. Enabling the host to offload GC >> data movement, while the keeping the host in control. Would this be >> beneficial for SMR? > > Host-aware SMR drives already have GC internally implemented (for cases > when the host does not write sequentially). Host-managed drives do not. > As for moving an application specific GC code into the device, well, > code injection in the storage device is not for tomorrow, and likely not > ever. > > There are however other clever ways to reduce GC related host overhead > with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED > COPY, and some others can greatly improve overhead over a simple > read+write loop. A better approach to GC offload may not be a "GC" > command, but something more generic for moving around LBAs internally > within the device. That is, if existing commands are not satisfactory. > Logical head depop rears its head again... But yes, I think it's more sensible to have I/O functions which help GC (like UNMAP) instead of influencing the GC itself. Anyway. Given the length of this thread I guess this is a worthy topic for LSF. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N�rnberg) ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-11 7:49 ` Hannes Reinecke 0 siblings, 0 replies; 63+ messages in thread From: Hannes Reinecke @ 2017-01-11 7:49 UTC (permalink / raw) On 01/11/2017 05:07 AM, Damien Le Moal wrote: > > Matias, > > On 1/10/17 22:06, Matias Bjorling wrote: >> On 01/10/2017 05:24 AM, Theodore Ts'o wrote: >>> This may be an area where if we can create the right framework, and >>> fund some research work, we might be able to get some researchers and >>> their graduate students interested in doing some work in figuring out >>> what sort of divisions of responsibilities and hints back and forth >>> between the storage device and host have the most benefit. >>> >> >> That is a good idea. There is a couple of papers at FAST with >> Open-Channel SSDs this year. They look into the interface and various >> ways to reduce latency fluctuations. >> >> One thing I've heard a couple of times is the feature to move the GC >> read/write process into the firmware. Enabling the host to offload GC >> data movement, while the keeping the host in control. Would this be >> beneficial for SMR? > > Host-aware SMR drives already have GC internally implemented (for cases > when the host does not write sequentially). Host-managed drives do not. > As for moving an application specific GC code into the device, well, > code injection in the storage device is not for tomorrow, and likely not > ever. > > There are however other clever ways to reduce GC related host overhead > with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED > COPY, and some others can greatly improve overhead over a simple > read+write loop. A better approach to GC offload may not be a "GC" > command, but something more generic for moving around LBAs internally > within the device. That is, if existing commands are not satisfactory. > Logical head depop rears its head again... But yes, I think it's more sensible to have I/O functions which help GC (like UNMAP) instead of influencing the GC itself. Anyway. Given the length of this thread I guess this is a worthy topic for LSF. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N?rnberg) ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-04 7:24 ` Damien Le Moal (?) @ 2017-01-05 22:58 ` Slava Dubeyko -1 siblings, 0 replies; 63+ messages in thread From: Slava Dubeyko @ 2017-01-05 22:58 UTC (permalink / raw) To: Damien Le Moal, Matias Bjørling, Viacheslav Dubeyko, lsf-pc, Theodore Ts'o Cc: Linux FS Devel, linux-block, linux-nvme DQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KRnJvbTogRGFtaWVuIExlIE1vYWwgDQpTZW50 OiBUdWVzZGF5LCBKYW51YXJ5IDMsIDIwMTcgMTE6MjUgUE0NClRvOiBTbGF2YSBEdWJleWtvIDxW eWFjaGVzbGF2LkR1YmV5a29Ad2RjLmNvbT47IE1hdGlhcyBCasO4cmxpbmcgPG1AYmpvcmxpbmcu bWU+OyBWaWFjaGVzbGF2IER1YmV5a28gPHNsYXZhQGR1YmV5a28uY29tPjsgbHNmLXBjQGxpc3Rz LmxpbnV4LWZvdW5kYXRpb24ub3JnDQpDYzogTGludXggRlMgRGV2ZWwgPGxpbnV4LWZzZGV2ZWxA dmdlci5rZXJuZWwub3JnPjsgbGludXgtYmxvY2tAdmdlci5rZXJuZWwub3JnOyBsaW51eC1udm1l QGxpc3RzLmluZnJhZGVhZC5vcmcNClN1YmplY3Q6IFJlOiBbTFNGL01NIFRPUElDXVtMU0YvTU0g QVRURU5EXSBPQ1NTRHMgLSBTTVIsIEhpZXJhcmNoaWNhbCBJbnRlcmZhY2UsIGFuZCBWZWN0b3Ig SS9Pcw0KDQo8c2tpcHBlZD4NCg0KPiBCdXQgeW91IGFyZSBtaXNzaW5nIHRoZSBwYXJhbGxlbCB3 aXRoIFNNUi4gRm9yIFNNUiwgb3IgbW9yZSBjb3JyZWN0bHkgem9uZWQNCj4gYmxvY2sgZGV2aWNl cyBzaW5jZSB0aGUgWkJDIG9yIFpBQyBzdGFuZGFyZHMgY2FuIGVxdWFsbHkgYXBwbHkgdG8gSERE cyBhbmQgU1NEcywNCj4gMyBtb2RlbHMgZXhpc3RzOiBkcml2ZS1tYW5hZ2VkLCBob3N0LWF3YXJl IGFuZCBob3N0LW1hbmFnZWQuDQo+IENhc2UgKDEpIGFib3ZlIGNvcnJlc3BvbmRzICpleGFjdGx5 KiB0byB0aGUgZHJpdmUgbWFuYWdlZCBtb2RlbCwgd2l0aA0KPiB0aGUgZGlmZmVyZW5jZSB0aGF0 IHRoZSBhYnN0cmFjdGlvbiBvZiB0aGUgZGV2aWNlIGNoYXJhY3RlcmlzdGljcyAoU01SDQo+IGhl cmUpIGlzIGluIHRoZSBkcml2ZSBGVyBhbmQgbm90IGluIGEgaG9zdC1sZXZlbCBGVEwgaW1wbGVt ZW50YXRpb24NCj4gYXMgaXQgd291bGQgYmUgZm9yIG9wZW4gY2hhbm5lbCBTU0RzLiBDYXNlICgy KSBhYm92ZSBjb3JyZXNwb25kcyB0byB0aGUgaG9zdC1tYW5hZ2VkDQo+IG1vZGVsLCB0aGF0IGlz LCB0aGUgZGV2aWNlIHVzZXIgaGFzIHRvIGRlYWwgd2l0aCB0aGUgZGV2aWNlIGNoYXJhY3Rlcmlz dGljcw0KPiBpdHNlbGYgYW5kIHVzZSBpdCBjb3JyZWN0bHkuIFRoZSBob3N0LWF3YXJlIG1vZGVs IGxpZXMgaW4gYmV0d2VlbiB0aGVzZSAyIGV4dHJlbWVzOg0KPiBpdCBvZmZlcnMgdGhlIHBvc3Np YmlsaXR5IG9mIGNvbXBsZXRlIGFic3RyYWN0aW9uIGJ5IGRlZmF1bHQsIGJ1dCBhbHNvIGFsbG93 cyBhIHVzZXINCj4gdG8gb3B0aW1pemUgaXRzIG9wZXJhdGlvbiBmb3IgdGhlIGRldmljZSBieSBh bGxvd2luZyBhY2Nlc3MgdG8gdGhlIGRldmljZSBjaGFyYWN0ZXJpc3RpY3MuDQo+IFNvIHRoaXMg d291bGQgY29ycmVzcG9uZCB0byBhIHBvc3NpYmxlIHRoaXJkIHdheSBvZiBpbXBsZW1lbnRpbmcg YW4gRlRMIGZvciBvcGVuIGNoYW5uZWwgU1NEcy4NCg0KSSBzZWUgeW91ciBwb2ludC4gQW5kICBJ IHRoaW5rIHRoYXQsIGhpc3RvcmljYWxseSwgd2UgbmVlZCB0byBkaXN0aW5ndWlzaCA0IGNhc2Vz IGZvciB0aGUNCmNhc2Ugb2YgTkFORCBmbGFzaDoNCigxKSBkcml2ZS1tYW5hZ2VkOiByZWd1bGFy IGZpbGUgc3lzdGVtcyAoZXh0NCwgeGZzIGFuZCBzbyBvbik7DQooMikgaG9zdC1hd2FyZTogZmxh c2gtZnJpZW5kbHkgZmlsZSBzeXN0ZW1zIChOSUxGUzIsIEYyRlMgYW5kIHNvIG9uKTsNCigzKSBo b3N0LW1hbmFnZWQ6IDxmaWxlIHN5c3RlbXMgdW5kZXIgaW1wbGVtZW50YXRpb24+Ow0KKDQpIG9s ZC1mYXNoaW9uZWQgZmxhc2gtb3JpZW50ZWQgZmlsZSBzeXN0ZW1zIGZvciByYXcgTkFORCAoamZm cywgeWFmZnMsIHViaWZzIGFuZCBzbyBvbikuDQoNCkJ1dCwgZnJhbmtseSBzcGVha2luZywgZXZl biByZWd1bGFyIGZpbGUgc3lzdGVtcyBhcmUgc2xpZ2h0bHkgZmxhc2gtYXdhcmUgdG9kYXkgYmVj YXVzZSBvZg0KYmxrZGV2X2lzc3VlX2Rpc2NhcmQgKFRSSU0pIG9yIFJFUV9NRVRBIGZsYWcuIFNv LCB0aGUgbmV4dCByZWFsbHkgaW1wb3J0YW50IHF1ZXN0aW9uIGlzOg0Kd2hhdCBjYW4vc2hvdWxk IGJlIGV4cG9zZWQgZm9yIHRoZSBob3N0LW1hbmFnZWQgYW5kIGhvc3QtYXdhcmUgY2FzZXM/IFdo YXQncyBwcmluY2lwYWwNCmRpZmZlcmVuY2UgYmV0d2VlbiB0aGVzZSBtb2RlbHM/IEFuZCwgZmlu YWxseSwgdGhlIGRpZmZlcmVuY2UgaXMgbm90IHNvIGNsZWFyLg0KDQpMZXQncyBzdGFydCBmcm9t IGVycm9yIGNvcnJlY3Rpb25zLiBPbmx5IGZsYXNoLW9yaWVudGVkIGZpbGUgc3lzdGVtcyB0YWtl IGNhcmUgYWJvdXQNCmVycm9yIGNvcnJlY3Rpb25zLiBCdXQgSSBhc3N1bWUgdGhhdCBkcml2ZS1t YW5hZ2VkLCBob3N0LWF3YXJlIGFuZCBob3N0LW1hbmFnZWQgY2FzZXMNCmV4cGVjdCBoYXJkd2Fy ZS1iYXNlZCBlcnJvciBjb3JyZWN0aW9uLiBTbywgd2UgY2FuIHRyZWF0IG91ciBsb2dpY2FsIHBh Z2UvYmxvY2sgYXMgaWRlYWwNCmJ5dGUgc3RyZWFtIHRoYXQgYWx3YXlzIGNvbnRhaW5zIHZhbGlk IGRhdGEuIFNvLCB3ZSBoYXZlIG5vIGRpZmZlcmVuY2UgYW5kIG5vIGNvbnRyYWRpY3Rpb24NCmhl cmUuDQoNCk5leHQgcG9pbnQgaXMgcmVhZCBkaXN0dXJiYW5jZS4gSWYgQkVSIG9mIHBoeXNpY2Fs IHBhZ2UvYmxvY2sgYWNoaWV2ZXMgc29tZSB0aHJlc2hvbGQgdGhlbg0Kd2UgbmVlZCB0byBtb3Zl IGRhdGEgZnJvbSBvbmUgcGFnZS9ibG9jayBpbnRvIGFub3RoZXIgb25lLiBXaGF0IHN1YnN5c3Rl bSB3aWxsIGJlDQpyZXNwb25zaWJsZSBmb3IgdGhpcyBhY3Rpdml0eT8gVGhlIGRyaXZlLW1hbmFn ZWQgY2FzZSBleHBlY3RzIHRoYXQgZGV2aWNlJ3MgR0Mgd2lsbCBtYW5hZ2UNCnJlYWQgZGlzdHVy YmFuY2UgaXNzdWUuIEJ1dCB3aGF0J3MgYWJvdXQgaG9zdC1hd2FyZSBvciBob3N0LW1hbmFnZWQg Y2FzZT8gSWYgdGhlIGhvc3Qgc2lkZQ0KaGFzbid0IGluZm9ybWF0aW9uIGFib3V0IEJFUiB0aGVu IHRoZSBob3N0J3Mgc29mdHdhcmUgaXMgdW5hYmxlIHRvIG1hbmFnZSB0aGlzIGlzc3VlLiBGaW5h bGx5LA0KaXQgc291bmRzIHRoYXQgd2Ugd2lsbCBoYXZlIEdDIHN1YnN5c3RlbSBhcyBvbiBmaWxl IHN5c3RlbSBzaWRlIGFzIG9uIGRldmljZSBzaWRlLiBBcyBhIHJlc3VsdCwNCml0IG1lYW5zIHBv c3NpYmxlIHVucHJlZGljdGFibGUgcGVyZm9ybWFuY2UgZGVncmFkYXRpb24gYW5kIGRlY3JlYXNp bmcgZGV2aWNlIGxpZmV0aW1lLg0KTGV0J3MgaW1hZ2luZSB0aGF0IGhvc3QtYXdhcmUgY2FzZSBj b3VsZCBiZSB1bmF3YXJlIGFib3V0IHJlYWQgZGlzdHVyYmFuY2UgbWFuYWdlbWVudC4NCkJ1dCBo b3cgaG9zdC1tYW5hZ2VkIGNhc2UgY2FuIG1hbmFnZSB0aGlzIGlzc3VlPw0KDQpCYWQgYmxvY2sg bWFuYWdlbWVudC4uLiBTbywgZHJpdmUtbWFuYWdlZCBhbmQgaG9zdC1hd2FyZSBjYXNlcyBzaG91 bGQgYmUgY29tcGxldGVseSB1bmF3YXJlDQphYm91dCAgYmFkIGJsb2Nrcy4gQnV0IHdoYXQncyBh Ym91dCBob3N0LW1hbmFnZWQgY2FzZT8gSWYgYSBkZXZpY2Ugd2lsbCBoaWRlIGJhZCBibG9ja3Mg ZnJvbQ0KdGhlIGhvc3QgdGhlbiBpdCBtZWFucyBtYXBwaW5nIHRhYmxlIHByZXNlbmNlLCBhY2Nl c3MgdG8gbG9naWNhbCBwYWdlcy9ibG9ja3MgYW5kIHNvIG9uLiBJZiB0aGUgaG9zdA0KaGFzbid0 IGFjY2VzcyB0byB0aGUgYmFkIGJsb2NrIG1hbmFnZW1lbnQgdGhlbiBpdCdzIG5vdCBob3N0LW1h bmFnZWQgbW9kZWwuIEFuZCBpdCBzb3VuZHMgYXMNCmNvbXBsZXRlbHkgdW5tYW5hZ2VhYmxlIHNp dHVhdGlvbiBmb3IgdGhlIGhvc3QtbWFuYWdlZCBtb2RlbC4gQmVjYXVzZSBpZiB0aGUgaG9zdCBo YXMgYWNjZXNzDQp0byBiYWQgYmxvY2sgbWFuYWdlbWVudCAoYnV0IGhvdz8pIHRoZW4gd2UgaGF2 ZSByZWFsbHkgc2ltcGxlIG1vZGVsLiBPdGhlcndpc2UsIHRoZSBob3N0DQpoYXMgYWNjZXNzIHRv IGxvZ2ljYWwgcGFnZXMvYmxvY2tzIG9ubHkgYW5kIGRldmljZSBzaG91bGQgaGF2ZSBpbnRlcm5h bCBHQy4gQXMgYSByZXN1bHQsDQppdCBtZWFucyBwb3NzaWJsZSB1bnByZWRpY3RhYmxlIHBlcmZv cm1hbmNlIGRlZ3JhZGF0aW9uIGFuZCBkZWNyZWFzaW5nIGRldmljZSBsaWZldGltZSBiZWNhdXNl DQpvZiBjb21wZXRpdGlvbiBvZiBHQyBvbiBkZXZpY2Ugc2lkZSBhbmQgR0Mgb24gdGhlIGhvc3Qg c2lkZS4NCg0KV2VhciBsZXZlbGluZy4uLiBEZXZpY2Ugd2lsbCBiZSByZXNwb25zaWJsZSB0byBt YW5hZ2Ugd2Vhci1sZXZlbGluZyBmb3IgdGhlIGNhc2Ugb2YgZGV2aWNlLW1hbmFnZWQNCmFuZCBo b3N0LWF3YXJlIG1vZGVscy4gSXQgbG9va3MgbGlrZSB0aGF0IHRoZSBob3N0IHNpZGUgc2hvdWxk IGJlIHJlc3BvbnNpYmxlIHRvIG1hbmFnZSB3ZWFyLWxldmVsaW5nDQpmb3IgdGhlIGhvc3QtbWFu YWdlZCBjYXNlLiBCdXQgaXQgbWVhbnMgdGhhdCB0aGUgaG9zdCBzaG91bGQgbWFuYWdlIGJhZCBi bG9ja3MgYW5kIHRvIGhhdmUgZGlyZWN0DQphY2Nlc3MgdG8gcGh5c2ljYWwgcGFnZXMvYmxvY2tz LiBPdGhlcndpc2UsIHBoeXNpY2FsIGVyYXNlIGJsb2NrcyB3aWxsIGJlIGhpZGRlbiBieSBkZXZp Y2UncyBpbmRpcmVjdGlvbg0KbGF5ZXIgYW5kIHdlYXItbGV2ZWxpbmcgbWFuYWdlbWVudCB3aWxs IGJlIHVuYXZhaWxhYmxlIG9uIHRoZSBob3N0IHNpZGUuIEFzIGEgcmVzdWx0LCBkZXZpY2Ugd2ls bCBoYXZlDQppbnRlcm5hbCBHQyBhbmQgdGhlIHRyYWRpdGlvbmFsIGlzc3VlcyAocG9zc2libGUg dW5wcmVkaWN0YWJsZSBwZXJmb3JtYW5jZSBkZWdyYWRhdGlvbiBhbmQgZGVjcmVhc2luZw0KZGV2 aWNlIGxpZmV0aW1lKS4gQnV0IGV2ZW4gaWYgU1NEIHByb3ZpZGVzIGFjY2VzcyB0byBhbGwgaW50 ZXJuYWxzIHRoZW4gaG93IHdpbGwgZmlsZSBzeXN0ZW0gYmUgYWJsZQ0KdG8gaW1wbGVtZW50IHdl YXItbGV2ZWxpbmcgb3IgYmFkIGJsb2NrIG1hbmFnZW1lbnQgaW4gdGhlIGNhc2Ugb2YgcmVndWxh ciBJL08gb3BlcmF0aW9ucz8gQmVjYXVzZQ0KYmxvY2sgZGV2aWNlIGNyZWF0ZXMgTEJBIGFic3Ry YWN0aW9uIGZvciB1cy4gRG9lcyBpdCBtZWFuIHRoYXQgc29mdHdhcmUgRlRMIG9uIHRoZSBibG9j ayBsYXllciBsZXZlbA0KaXMgYWJsZSB0byBtYW5hZ2UgU1NEIGludGVybmFscyBkaXJlY3RseT8g QW5kLCBhZ2FpbiwgZmlsZSBzeXN0ZW0gY2Fubm90IG1hbmFnZSBTU0QgaW50ZXJuYWxzIGRpcmVj dGx5DQpmb3IgdGhlIGNhc2Ugb2Ygc29mdHdhcmUgRlRMLiBBbmQgd2hlcmUgc2hvdWxkIHNvZnR3 YXJlIEZUTCBrZWVwIG1hcHBpbmcgdGFibGUsIGZvciBleGFtcGxlPw0KDQpTbywgRjJGUyBhbmQg TklMRlMyIGxvb2tzIGxpa2UgYSBob3N0LWF3YXJlIGNhc2UgYmVjYXVzZSBpdCBpcyBMRlMgZmls ZSBzeXN0ZW1zIHRoYXQgaXMgb3JpZW50ZWQgb24NCnJlZ3VsYXIgU1NEcy4gU28sIGl0IGNvdWxk IGJlIGRlc2lyYWJsZSB0byBoYXZlIHNvbWUga25vd2xlZGdlIChwYWdlIHNpemUsIGVyYXNlIGJs b2NrIHNpemUgYW5kIHNvIG9uKQ0KYWJvdXQgU1NEIGludGVybmFscy4gQnV0LCBtb3N0bHksIHN1 Y2gga25vd2xlZGdlIHNob3VsZCBiZSBzaGFyZWQgd2l0aCBta2ZzIHRvb2wgZHVyaW5nIGZpbGUN CnN5c3RlbSB2b2x1bWUgY3JlYXRpb24uIFRoZSByZXN0IGxvb2tzIGxpa2UgYXMgbm90IHZlcnkg cHJvbWlzaW5nIGFuZCBub3QgdmVyeSBkaWZmZXJlbnQgd2l0aA0KZGV2aWNlLW1hbmFnZWQgbW9k ZWwuIEJlY2F1c2UgZXZlbiBpZiBGMkZTIGFuZCBOSUxGUzIgaGFzIEdDIHN1YnN5c3RlbSBhbmQg bW9zdGx5IGxvb2tzIGxpa2UNCmFzIExGUyBjYXNlIChGMkZTIGhhcyBpbi1wbGFjZSB1cGRhdGVk IGFyZWE7IE5JTEZTMiBoYXMgaW4tcGxhY2UgdXBkYXRlZCBzdXBlcmJsb2NrcyBpbiB0aGUgYmVn aW4vZW5kDQpvZiB0aGUgdm9sdW1lKSwgYW55d2F5LCBib3RoIHRoZXNlIGZpbGUgc3lzdGVtcyBj b21wbGV0ZWx5IHJlbHkgb24gZGV2aWNlIGluZGlyZWN0aW9uIGxheWVyIGFuZA0KR0Mgc3Vic3lz dGVtLiBXZSBhcmUgc3RpbGwgaW4gdGhlIHNhbWUgaGVsbCBvZiBHQ3MgY29tcGV0aXRpb24uIFNv LCB3aGF0J3MgdGhlIHBvaW50IG9mIGhvc3QtYXdhcmUNCm1vZGVsPw0KDQpTbywgSSBhbSBub3Qg Y29tcGxldGVseSBjb252aW5jZWQgdGhhdCwgZmluYWxseSwgd2Ugd2lsbCBoYXZlIHJlYWxseSBk aXN0aW5jdGl2ZSBmZWF0dXJlcyBmb3IgdGhlDQpjYXNlIG9mIGRldmljZS1tYW5hZ2VkLCBob3N0 LWF3YXJlIGFuZCBob3N0LW1hbmFnZWQgbW9kZWwuIEFsc28gSSBoYXZlIG1hbnkgcXVlc3Rpb24g YWJvdXQNCmhvc3QtbWFuYWdlZCBtb2RlbCBpZiB3ZSB3aWxsIHVzZSBibG9jayBkZXZpY2UgYWJz dHJhY3Rpb24uIEhvdyBjYW4gZGlyZWN0IG1hbmFnZW1lbnQgb2YNClNTRCBpbnRlcm5hbHMgYmUg b3JnYW5pemVkIGZvciB0aGUgY2FzZSBvZiBob3N0LW1hbmFnZWQgbW9kZWwgaXMgaGlkZGVuIHVu ZGVyIGJsb2NrIGRldmljZQ0KYWJzdHJhY3Rpb24/DQoNCkFub3RoZXIgaW50ZXJlc3RpbmcgcXVl c3Rpb24uLi4gTGV0J3MgaW1hZ2luZSB0aGF0IHdlIGNyZWF0ZSBmaWxlIHN5c3RlbSB2b2x1bWUg Zm9yIG9uZSBkZXZpY2UNCmdlb21ldHJ5LiBJdCBtZWFucyB0aGF0IGdlb21ldHJ5IGRldGFpbHMg d2lsbCBiZSBzdG9yZWQgaW4gdGhlIGZpbGUgc3lzdGVtIG1ldGFkYXRhIGR1cmluZyB2b2x1bWUN CmNyZWF0aW9uIGZvciB0aGUgY2FzZSBob3N0LWF3YXJlIG9yIGhvc3QtbWFuYWdlZCBjYXNlLiBU aGVuIHdlIGJhY2t1cHMgdGhpcyB2b2x1bWUgYW5kIHJlc3RvcmUNCnRoZSB2b2x1bWUgb24gZGV2 aWNlIHdpdGggY29tcGxldGVseSBkaWZmZXJlbnQgZ2VvbWV0cnkuIFNvLCB3aGF0IHdpbGwgd2Ug aGF2ZSBmb3Igc3VjaCBjYXNlPw0KUGVyZm9ybWFuY2UgZGVncmFkYXRpb24/IE9yIHdpbGwgd2Ug a2lsbCB0aGUgZGV2aWNlPw0KDQo+IFRoZSBvcGVuLWNoYW5uZWwgU1NEIGludGVyZmFjZSBpcyB2 ZXJ5IA0KPiBzaW1pbGFyIHRvIHRoZSBvbmUgZXhwb3NlZCBieSBTTVIgaGFyZC1kcml2ZXMuIFRo ZXkgYm90aCBoYXZlIGEgc2V0IG9mIA0KPiBjaHVua3MgKHpvbmVzKSBleHBvc2VkLCBhbmQgem9u ZXMgYXJlIG1hbmFnZWQgdXNpbmcgb3Blbi9jbG9zZSBsb2dpYy4gDQo+IFRoZSBtYWluIGRpZmZl cmVuY2Ugb24gb3Blbi1jaGFubmVsIFNTRHMgaXMgdGhhdCBpdCBhZGRpdGlvbmFsbHkgZXhwb3Nl cyANCj4gbXVsdGlwbGUgc2V0cyBvZiB6b25lcyB0aHJvdWdoIGEgaGllcmFyY2hpY2FsIGludGVy ZmFjZSwgd2hpY2ggY292ZXJzIGEgDQo+IG51bWJlcnMgbGV2ZWxzIChYIGNoYW5uZWxzLCBZIExV TnMgcGVyIGNoYW5uZWwsIFogem9uZXMgcGVyIExVTikuDQoNCkkgd291bGQgbGlrZSB0byBoYXZl IGFjY2VzcyBjaGFubmVscy9MVU5zL3pvbmVzIG9uIGZpbGUgc3lzdGVtIGxldmVsLg0KSWYsIGZv ciBleGFtcGxlLCBMVU4gd2lsbCBiZSBhc3NvY2lhdGVkIHdpdGggcGFydGl0aW9uIHRoZW4gaXQg bWVhbnMNCnRoYXQgaXQgd2lsbCBuZWVkIHRvIGFnZ3JlZ2F0ZSBzZXZlcmFsIHBhcnRpdGlvbnMg aW5zaWRlIG9mIG9uZSB2b2x1bWUuDQpGaXJzdCBvZiBhbGwsIG5vdCBldmVyeSBmaWxlIHN5c3Rl bSBpcyByZWFkeSBmb3IgdGhlIGFnZ3JlZ2F0aW9uIHNldmVyYWwNCnBhcnRpdGlvbnMgaW5zaWRl IG9mIHRoZSBvbmUgdm9sdW1lLiBTZWNvbmRseSwgd2hhdCdzIGFib3V0IGFnZ3JlZ2F0aW9uDQpz ZXZlcmFsIHBoeXNpY2FsIGRldmljZXMgaW5zaWRlIG9mIG9uZSB2b2x1bWU/IEl0IGxvb2tzIGxp a2UgYXMgc2xpZ2h0bHkNCnRyaWNreSB0byBkaXN0aW5ndWlzaCBwYXJ0aXRpb25zIG9mIHRoZSBz YW1lIGRldmljZSBhbmQgZGlmZmVyZW50IGRldmljZXMNCm9uIGZpbGUgc3lzdGVtIGxldmVsLiBJ c24ndCBpdD8NCg0KPiBJIGFncmVlIHdpdGggRGFtaWVuLCBidXQgSSdkIGFsc28gYWRkIHRoYXQg aW4gdGhlIGZ1dHVyZSB0aGVyZSBtYXkgdmVyeQ0KPiB3ZWxsIGJlIHNvbWUgbmV3IFpvbmUgdHlw ZXMgYWRkZWQgdG8gdGhlIFpCQyBtb2RlbC4gDQo+IFNvIHdlIHNob3VsZG4ndCBhc3N1bWUgdGhh dCB0aGUgWkJDIG1vZGVsIGlzIGEgZml4ZWQgb25lLiAgQW5kIHdobyBrbm93cz8NCj4gUGVyaGFw cyBUMTAgc3RhbmRhcmRzIGJvZHkgd2lsbCBjb21lIHVwIHdpdGggYSBzaW1wbGVyIG1vZGVsIGZv cg0KPiBpbnRlcmZhY2luZyB3aXRoIFNDU0kvU0FUQS1hdHRhY2hlZCBTU0QncyB0aGF0IG1pZ2h0 IGxldmVyYWdlIHRoZSBaQkMgbW9kZWwgLS0tIG9yIG5vdC4NCg0KRGlmZmVyZW50IHpvbmUgdHlw ZXMgaXMgZ29vZC4gQnV0IG1heWJlIExVTiB3aWxsIGJlIHRoZSBiZXR0ZXIgcGxhY2UNCmZvciBk aXN0aW5ndWlzaGluZyB0aGUgZGlmZmVyZW50IHpvbmUgdHlwZXMuIEJlY2F1c2UgaWYgem9uZSBj YW4gaGF2ZSB0aGUgdHlwZQ0KdGhlbiBpdCdzIHBvc3NpYmxlIHRvIGltYWdpbmUgYW55IGNvbWJp bmF0aW9ucyBvZiB6b25lcy4gQnV0IG1vc3RseQ0Kem9uZSBvZiBzb21lIHR5cGUgd2lsbCBiZSBp bnNpZGUgb2Ygc29tZSBjb250aWd1b3VzIGFyZWEgKGluc2lkZSBvZiBOQU5EDQpkaWUsIGZvciBl eGFtcGxlKS4gU28sIExVTiBsb29rcyBsaWtlIGFzIE5BTkQgZGllIHJlcHJlc2VudGF0aW9uLg0K DQo+PiBTTVIgem9uZSBhbmQgTkFORCBmbGFzaCBlcmFzZSBibG9jayBsb29rIGNvbXBhcmFibGUg YnV0LCBmaW5hbGx5LCBpdCANCj4+IHNpZ25pZmljYW50bHkgZGlmZmVyZW50IHN0dWZmLiBVc3Vh bGx5LCBTTVIgem9uZSBoYXMgMjY1IE1CIGluIHNpemUgDQo+PiBidXQgTkFORCBmbGFzaCBlcmFz ZSBibG9jayBjYW4gdmFyeSBmcm9tIDUxMiBLQiB0byA4IE1CIChpdCB3aWxsIGJlIA0KPj4gc2xp Z2h0bHkgbGFyZ2VyIGluIHRoZSBmdXR1cmUgYnV0IG5vdCBtb3JlIHRoYW4gMzIgTUIsIEkgc3Vw cG9zZSkuIEl0IA0KPj4gaXMgcG9zc2libGUgdG8gZ3JvdXAgc2V2ZXJhbCBlcmFzZSBibG9ja3Mg aW50byBhZ2dyZWdhdGVkIGVudGl0eSBidXQgDQo+PiBpdCBjb3VsZCBiZSBub3QgdmVyeSBnb29k IHBvbGljeSBmcm9tIGZpbGUgc3lzdGVtIHBvaW50IG9mIHZpZXcuDQo+DQo+IFdoeSBub3Q/IEZv ciBmMmZzLCB0aGUgMk1CIHNlZ21lbnRzIGFyZSBncm91cGVkIHRvZ2V0aGVyIGludG8gc2VjdGlv bnMNCj4gd2l0aCBhIHNpemUgbWF0Y2hpbmcgdGhlIGRldmljZSB6b25lIHNpemUuIFRoYXQgd29y a3Mgd2VsbCBhbmQgY2FuIGFjdHVhbGx5DQo+IGV2ZW4gcmVkdWNlIHRoZSBnYXJiYWdlIGNvbGxl Y3Rpb24gb3ZlcmhlYWQgaW4gc29tZSBjYXNlcy4NCj4gTm90aGluZyBpbiB0aGUga2VybmVsIHpv bmVkIGJsb2NrIGRldmljZSBzdXBwb3J0IGxpbWl0cyB0aGUgem9uZSBzaXplDQo+IHRvIGEgcGFy dGljdWxhciBtaW5pbXVtIG9yIG1heGltdW0uIFRoZSBvbmx5IGRpcmVjdCBpbXBsaWNhdGlvbiBv ZiB0aGUgem9uZQ0KPiBzaXplIG9uIHRoZSBibG9jayBJL08gc3RhY2sgaXMgdGhhdCBCSU9zIGFu ZCByZXF1ZXN0cyBjYW5ub3QgY3Jvc3Mgem9uZQ0KPiBib3VuZGFyaWVzLiBJbiBhbiBleHRyZW1l IHNldHVwLCBhIHpvbmUgc2l6ZSBvZiA0S0Igd291bGQgd29yayB0b28NCj4gYW5kIHJlc3VsdCBp biByZWFkL3dyaXRlIGNvbW1hbmRzIG9mIDRLQiBhdCBtb3N0IHRvIHRoZSBkZXZpY2UuDQoNClRo ZSBzaXR1YXRpb24gd2l0aCBncm91cGluZyBvZiBzZWdtZW50cyBpbnRvIHNlY3Rpb25zIGZvciB0 aGUgY2FzZSBvZiBGMkZTDQppcyBub3Qgc28gc2ltcGxlLiBGaXJzdCBvZiBhbGwsIHlvdSBuZWVk IHRvIGZpbGwgc3VjaCBhZ2dyZWdhdGlvbiB3aXRoIGRhdGEuDQpGMkZTIGRpc3Rpbmd1aXNoIHNl dmVyYWwgdHlwZXMgb2Ygc2VnbWVudHMgYW5kIGl0IG1lYW5zIHRoYXQgY3VycmVudA0Kc2VnbWVu dC9zZWN0aW9uIHdpbGwgYmUgbGFyZ2VyLiBJZiB5b3UgbWl4IGRpZmZlcmVudCB0eXBlcyBvZiBz ZWdtZW50cyBpbnRvDQpvbmUgc2VjdGlvbiAoYnV0IEkgYmVsaWV2ZSB0aGF0IEYyRlMgZG9lc24n dCBwcm92aWRlIG9wcG9ydHVuaXR5IHRvIGRvIHRoaXMpDQp0aGVuIEdDIG92ZXJoZWFkIGNvdWxk IGJlIGxhcmdlciwgSSBzdXBwb3NlLiBPdGhlcndpc2UsIHRoZSB1c2luZyBvbmUgc2VjdGlvbg0K Zm9yIG9uZSBzZWdtZW50IHR5cGUgbWVhbnMgdGhhdCB0aGUgY3VycmVudCBzZWN0aW9uIHdpdGgg Z3JlYXRlciBzaXplIHRoYW4NCnNlZ21lbnQgKDJNQikgd2lsbCBiZSByZXN1bHRlZCBpbiBjaGFu Z2luZyB0aGUgc3BlZWQgb2YgZmlsbGluZyBzZWN0aW9ucyB3aXRoDQpkaWZmZXJlbnQgdHlwZSBv ZiBkYXRhLiBBcyBhIHJlc3VsdCwgaXQgd2lsbCBjaGFuZ2UgZHJhbWF0aWNhbGx5IHRoZSBkaXN0 cmlidXRpb24NCm9mIGRpZmZlcmVudCB0eXBlIG9mIHNlY3Rpb25zIG9uIGZpbGUgc3lzdGVtIHZv bHVtZS4gRG9lcyBpdCByZWR1Y2UgR0Mgb3ZlcmhlYWQ/DQpJIGFtIG5vdCBzdXJlLiBBbmQgaWYg ZmlsZSBzeXN0ZW0ncyBzZWdtZW50IHNob3VsZCBiZSBlcXVhbCB0byB6b25lIHNpemUNCihmb3Ig ZXhhbXBsZSwgTklMRlMyIGNhc2UpIHRoZW4gaXQgY291bGQgbWVhbiB0aGF0IHlvdSBuZWVkIHRv IHByZXBhcmUgdGhlDQp3aG9sZSBzZWdtZW50IGJlZm9yZSByZWFsIGZsdXNoLiBBbmQgaWYgeW91 IHdpbGwgbmVlZCB0byBwcm9jZXNzIE9fRElSRUNUDQpvciBzeW5jaHJvbm91cyBtb3VudCBjYXNl IHRoZW4sIG1vc3QgcHJvYmFibHksIHlvdSB3aWxsIG5lZWQgdG8gZmx1c2ggdGhlDQpzZWdtZW50 IHdpdGggaHVnZSBob2xlLiBJIHN1cHBvc2UgdGhhdCBpdCBjb3VsZCBzaWduaWZpY2FudGx5IGRl Y3JlYXNlIGZpbGUgc3lzdGVtJ3MNCmZyZWUgc3BhY2UsIGluY3JlYXNlIEdDIGFjdGl2aXR5IGFu ZCBkZWNyZWFzZSBkZXZpY2UgbGlmZXRpbWUuDQoNCj4+IEFub3RoZXIgcG9pbnQgdGhhdCBRTEMg ZGV2aWNlIGNvdWxkIGhhdmUgbW9yZSB0cmlja3kgZmVhdHVyZXMgb2YgZXJhc2UgDQo+PiBibG9j a3MgbWFuYWdlbWVudC4gQWxzbyB3ZSBzaG91bGQgYXBwbHkgZXJhc2Ugb3BlcmF0aW9uIG9uIE5B TkQgZmxhc2ggDQo+PiBlcmFzZSBibG9jayBidXQgaXQgaXMgbm90IG1hbmRhdG9yeSBmb3IgdGhl IGNhc2Ugb2YgU01SIHpvbmUuDQo+DQo+IEluY29ycmVjdDogaG9zdC1tYW5hZ2VkIGRldmljZXMg cmVxdWlyZSBhIHpvbmUgInJlc2V0IiAoZXF1aXZhbGVudCB0bw0KPiBkaXNjYXJkL3RyaW0pIHRv IGJlIHJldXNlZCBhZnRlciBiZWluZyB3cml0dGVuIG9uY2UuIFNvIGFnYWluLCB0aGUNCj4gInRy aWNreSBmZWF0dXJlcyIgeW91IG1lbnRpb24gd2lsbCBkZXBlbmQgb24gdGhlIGRldmljZSAibW9k ZWwiLA0KPiB3aGF0ZXZlciB0aGlzIGVuZHMgdXAgdG8gYmUgZm9yIGFuIG9wZW4gY2hhbm5lbCBT U0QuDQoNCk9LLiBCdXQgSSBhc3N1bWUgdGhhdCBTTVIgem9uZSAicmVzZXQiIGlzIHNpZ25pZmlj YW50bHkgY2hlYXBlciB0aGFuDQpOQU5EIGZsYXNoIGJsb2NrIGVyYXNlIG9wZXJhdGlvbi4gQW5k IHlvdSBjYW4gZmlsbCB5b3VyIFNNUiB6b25lIHdpdGgNCmRhdGEgdGhlbiAicmVzZXQiIGl0IGFu ZCB0byBmaWxsIGFnYWluIHdpdGggZGF0YSB3aXRob3V0IHNpZ25pZmljYW50IHBlbmFsdHkuDQpB bHNvLCBUUklNIGFuZCB6b25lICJyZXNldCIgYXJlIGRpZmZlcmVudCwgSSBzdXBwb3NlLiBCZWNh dXNlLCBUUklNIGxvb2tzDQpsaWtlIGFzIGEgaGludCBmb3IgU1NEIGNvbnRyb2xsZXIuIElmIFNT RCBjb250cm9sbGVyIHJlY2VpdmVzIFRSSU0gZm9yIHNvbWUNCmVyYXNlIGJsb2NrIHRoZW4gaXQg ZG9lc24ndCBtZWFuICB0aGF0IGVyYXNlIG9wZXJhdGlvbiB3aWxsIGJlIGRvbmUNCmltbWVkaWF0 ZWx5LiBVc3VhbGx5LCBpdCBzaG91bGQgYmUgZG9uZSBpbiB0aGUgYmFja2dyb3VuZCBiZWNhdXNl IHJlYWwNCmVyYXNlIG9wZXJhdGlvbiBpcyBleHBlbnNpdmUgb3BlcmF0aW9uLiANCg0KVGhhbmtz LA0KVnlhY2hlc2xhdiBEdWJleWtvLg0KDQpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fXwpMaW51eC1udm1lIG1haWxpbmcgbGlzdApMaW51eC1udm1lQGxpc3Rz LmluZnJhZGVhZC5vcmcKaHR0cDovL2xpc3RzLmluZnJhZGVhZC5vcmcvbWFpbG1hbi9saXN0aW5m by9saW51eC1udm1lCg== ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-05 22:58 ` Slava Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Slava Dubeyko @ 2017-01-05 22:58 UTC (permalink / raw) -----Original Message----- From: Damien Le Moal Sent: Tuesday, January 3, 2017 11:25 PM To: Slava Dubeyko <Vyacheslav.Dubeyko at wdc.com>; Matias Bj?rling <m at bjorling.me>; Viacheslav Dubeyko <slava at dubeyko.com>; lsf-pc at lists.linux-foundation.org Cc: Linux FS Devel <linux-fsdevel at vger.kernel.org>; linux-block at vger.kernel.org; linux-nvme at lists.infradead.org Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os <skipped> > But you are missing the parallel with SMR. For SMR, or more correctly zoned > block devices since the ZBC or ZAC standards can equally apply to HDDs and SSDs, > 3 models exists: drive-managed, host-aware and host-managed. > Case (1) above corresponds *exactly* to the drive managed model, with > the difference that the abstraction of the device characteristics (SMR > here) is in the drive FW and not in a host-level FTL implementation > as it would be for open channel SSDs. Case (2) above corresponds to the host-managed > model, that is, the device user has to deal with the device characteristics > itself and use it correctly. The host-aware model lies in between these 2 extremes: > it offers the possibility of complete abstraction by default, but also allows a user > to optimize its operation for the device by allowing access to the device characteristics. > So this would correspond to a possible third way of implementing an FTL for open channel SSDs. I see your point. And I think that, historically, we need to distinguish 4 cases for the case of NAND flash: (1) drive-managed: regular file systems (ext4, xfs and so on); (2) host-aware: flash-friendly file systems (NILFS2, F2FS and so on); (3) host-managed: <file systems under implementation>; (4) old-fashioned flash-oriented file systems for raw NAND (jffs, yaffs, ubifs and so on). But, frankly speaking, even regular file systems are slightly flash-aware today because of blkdev_issue_discard (TRIM) or REQ_META flag. So, the next really important question is: what can/should be exposed for the host-managed and host-aware cases? What's principal difference between these models? And, finally, the difference is not so clear. Let's start from error corrections. Only flash-oriented file systems take care about error corrections. But I assume that drive-managed, host-aware and host-managed cases expect hardware-based error correction. So, we can treat our logical page/block as ideal byte stream that always contains valid data. So, we have no difference and no contradiction here. Next point is read disturbance. If BER of physical page/block achieves some threshold then we need to move data from one page/block into another one. What subsystem will be responsible for this activity? The drive-managed case expects that device's GC will manage read disturbance issue. But what's about host-aware or host-managed case? If the host side hasn't information about BER then the host's software is unable to manage this issue. Finally, it sounds that we will have GC subsystem as on file system side as on device side. As a result, it means possible unpredictable performance degradation and decreasing device lifetime. Let's imagine that host-aware case could be unaware about read disturbance management. But how host-managed case can manage this issue? Bad block management... So, drive-managed and host-aware cases should be completely unaware about bad blocks. But what's about host-managed case? If a device will hide bad blocks from the host then it means mapping table presence, access to logical pages/blocks and so on. If the host hasn't access to the bad block management then it's not host-managed model. And it sounds as completely unmanageable situation for the host-managed model. Because if the host has access to bad block management (but how?) then we have really simple model. Otherwise, the host has access to logical pages/blocks only and device should have internal GC. As a result, it means possible unpredictable performance degradation and decreasing device lifetime because of competition of GC on device side and GC on the host side. Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed and host-aware models. It looks like that the host side should be responsible to manage wear-leveling for the host-managed case. But it means that the host should manage bad blocks and to have direct access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection layer and wear-leveling management will be unavailable on the host side. As a result, device will have internal GC and the traditional issues (possible unpredictable performance degradation and decreasing device lifetime). But even if SSD provides access to all internals then how will file system be able to implement wear-leveling or bad block management in the case of regular I/O operations? Because block device creates LBA abstraction for us. Does it mean that software FTL on the block layer level is able to manage SSD internals directly? And, again, file system cannot manage SSD internals directly for the case of software FTL. And where should software FTL keep mapping table, for example? So, F2FS and NILFS2 looks like a host-aware case because it is LFS file systems that is oriented on regular SSDs. So, it could be desirable to have some knowledge (page size, erase block size and so on) about SSD internals. But, mostly, such knowledge should be shared with mkfs tool during file system volume creation. The rest looks like as not very promising and not very different with device-managed model. Because even if F2FS and NILFS2 has GC subsystem and mostly looks like as LFS case (F2FS has in-place updated area; NILFS2 has in-place updated superblocks in the begin/end of the volume), anyway, both these file systems completely rely on device indirection layer and GC subsystem. We are still in the same hell of GCs competition. So, what's the point of host-aware model? So, I am not completely convinced that, finally, we will have really distinctive features for the case of device-managed, host-aware and host-managed model. Also I have many question about host-managed model if we will use block device abstraction. How can direct management of SSD internals be organized for the case of host-managed model is hidden under block device abstraction? Another interesting question... Let's imagine that we create file system volume for one device geometry. It means that geometry details will be stored in the file system metadata during volume creation for the case host-aware or host-managed case. Then we backups this volume and restore the volume on device with completely different geometry. So, what will we have for such case? Performance degradation? Or will we kill the device? > The open-channel SSD interface is very > similar to the one exposed by SMR hard-drives. They both have a set of > chunks (zones) exposed, and zones are managed using open/close logic. > The main difference on open-channel SSDs is that it additionally exposes > multiple sets of zones through a hierarchical interface, which covers a > numbers levels (X channels, Y LUNs per channel, Z zones per LUN). I would like to have access channels/LUNs/zones on file system level. If, for example, LUN will be associated with partition then it means that it will need to aggregate several partitions inside of one volume. First of all, not every file system is ready for the aggregation several partitions inside of the one volume. Secondly, what's about aggregation several physical devices inside of one volume? It looks like as slightly tricky to distinguish partitions of the same device and different devices on file system level. Isn't it? > I agree with Damien, but I'd also add that in the future there may very > well be some new Zone types added to the ZBC model. > So we shouldn't assume that the ZBC model is a fixed one. And who knows? > Perhaps T10 standards body will come up with a simpler model for > interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC model --- or not. Different zone types is good. But maybe LUN will be the better place for distinguishing the different zone types. Because if zone can have the type then it's possible to imagine any combinations of zones. But mostly zone of some type will be inside of some contiguous area (inside of NAND die, for example). So, LUN looks like as NAND die representation. >> SMR zone and NAND flash erase block look comparable but, finally, it >> significantly different stuff. Usually, SMR zone has 265 MB in size >> but NAND flash erase block can vary from 512 KB to 8 MB (it will be >> slightly larger in the future but not more than 32 MB, I suppose). It >> is possible to group several erase blocks into aggregated entity but >> it could be not very good policy from file system point of view. > > Why not? For f2fs, the 2MB segments are grouped together into sections > with a size matching the device zone size. That works well and can actually > even reduce the garbage collection overhead in some cases. > Nothing in the kernel zoned block device support limits the zone size > to a particular minimum or maximum. The only direct implication of the zone > size on the block I/O stack is that BIOs and requests cannot cross zone > boundaries. In an extreme setup, a zone size of 4KB would work too > and result in read/write commands of 4KB at most to the device. The situation with grouping of segments into sections for the case of F2FS is not so simple. First of all, you need to fill such aggregation with data. F2FS distinguish several types of segments and it means that current segment/section will be larger. If you mix different types of segments into one section (but I believe that F2FS doesn't provide opportunity to do this) then GC overhead could be larger, I suppose. Otherwise, the using one section for one segment type means that the current section with greater size than segment (2MB) will be resulted in changing the speed of filling sections with different type of data. As a result, it will change dramatically the distribution of different type of sections on file system volume. Does it reduce GC overhead? I am not sure. And if file system's segment should be equal to zone size (for example, NILFS2 case) then it could mean that you need to prepare the whole segment before real flush. And if you will need to process O_DIRECT or synchronous mount case then, most probably, you will need to flush the segment with huge hole. I suppose that it could significantly decrease file system's free space, increase GC activity and decrease device lifetime. >> Another point that QLC device could have more tricky features of erase >> blocks management. Also we should apply erase operation on NAND flash >> erase block but it is not mandatory for the case of SMR zone. > > Incorrect: host-managed devices require a zone "reset" (equivalent to > discard/trim) to be reused after being written once. So again, the > "tricky features" you mention will depend on the device "model", > whatever this ends up to be for an open channel SSD. OK. But I assume that SMR zone "reset" is significantly cheaper than NAND flash block erase operation. And you can fill your SMR zone with data then "reset" it and to fill again with data without significant penalty. Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks like as a hint for SSD controller. If SSD controller receives TRIM for some erase block then it doesn't mean that erase operation will be done immediately. Usually, it should be done in the background because real erase operation is expensive operation. Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-05 22:58 ` Slava Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Slava Dubeyko @ 2017-01-05 22:58 UTC (permalink / raw) To: Damien Le Moal, Matias Bjørling, Viacheslav Dubeyko, lsf-pc, Theodore Ts'o Cc: Linux FS Devel, linux-block, linux-nvme -----Original Message----- From: Damien Le Moal Sent: Tuesday, January 3, 2017 11:25 PM To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>; Matias Bjørling <m@bjorling.me>; Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os <skipped> > But you are missing the parallel with SMR. For SMR, or more correctly zoned > block devices since the ZBC or ZAC standards can equally apply to HDDs and SSDs, > 3 models exists: drive-managed, host-aware and host-managed. > Case (1) above corresponds *exactly* to the drive managed model, with > the difference that the abstraction of the device characteristics (SMR > here) is in the drive FW and not in a host-level FTL implementation > as it would be for open channel SSDs. Case (2) above corresponds to the host-managed > model, that is, the device user has to deal with the device characteristics > itself and use it correctly. The host-aware model lies in between these 2 extremes: > it offers the possibility of complete abstraction by default, but also allows a user > to optimize its operation for the device by allowing access to the device characteristics. > So this would correspond to a possible third way of implementing an FTL for open channel SSDs. I see your point. And I think that, historically, we need to distinguish 4 cases for the case of NAND flash: (1) drive-managed: regular file systems (ext4, xfs and so on); (2) host-aware: flash-friendly file systems (NILFS2, F2FS and so on); (3) host-managed: <file systems under implementation>; (4) old-fashioned flash-oriented file systems for raw NAND (jffs, yaffs, ubifs and so on). But, frankly speaking, even regular file systems are slightly flash-aware today because of blkdev_issue_discard (TRIM) or REQ_META flag. So, the next really important question is: what can/should be exposed for the host-managed and host-aware cases? What's principal difference between these models? And, finally, the difference is not so clear. Let's start from error corrections. Only flash-oriented file systems take care about error corrections. But I assume that drive-managed, host-aware and host-managed cases expect hardware-based error correction. So, we can treat our logical page/block as ideal byte stream that always contains valid data. So, we have no difference and no contradiction here. Next point is read disturbance. If BER of physical page/block achieves some threshold then we need to move data from one page/block into another one. What subsystem will be responsible for this activity? The drive-managed case expects that device's GC will manage read disturbance issue. But what's about host-aware or host-managed case? If the host side hasn't information about BER then the host's software is unable to manage this issue. Finally, it sounds that we will have GC subsystem as on file system side as on device side. As a result, it means possible unpredictable performance degradation and decreasing device lifetime. Let's imagine that host-aware case could be unaware about read disturbance management. But how host-managed case can manage this issue? Bad block management... So, drive-managed and host-aware cases should be completely unaware about bad blocks. But what's about host-managed case? If a device will hide bad blocks from the host then it means mapping table presence, access to logical pages/blocks and so on. If the host hasn't access to the bad block management then it's not host-managed model. And it sounds as completely unmanageable situation for the host-managed model. Because if the host has access to bad block management (but how?) then we have really simple model. Otherwise, the host has access to logical pages/blocks only and device should have internal GC. As a result, it means possible unpredictable performance degradation and decreasing device lifetime because of competition of GC on device side and GC on the host side. Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed and host-aware models. It looks like that the host side should be responsible to manage wear-leveling for the host-managed case. But it means that the host should manage bad blocks and to have direct access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection layer and wear-leveling management will be unavailable on the host side. As a result, device will have internal GC and the traditional issues (possible unpredictable performance degradation and decreasing device lifetime). But even if SSD provides access to all internals then how will file system be able to implement wear-leveling or bad block management in the case of regular I/O operations? Because block device creates LBA abstraction for us. Does it mean that software FTL on the block layer level is able to manage SSD internals directly? And, again, file system cannot manage SSD internals directly for the case of software FTL. And where should software FTL keep mapping table, for example? So, F2FS and NILFS2 looks like a host-aware case because it is LFS file systems that is oriented on regular SSDs. So, it could be desirable to have some knowledge (page size, erase block size and so on) about SSD internals. But, mostly, such knowledge should be shared with mkfs tool during file system volume creation. The rest looks like as not very promising and not very different with device-managed model. Because even if F2FS and NILFS2 has GC subsystem and mostly looks like as LFS case (F2FS has in-place updated area; NILFS2 has in-place updated superblocks in the begin/end of the volume), anyway, both these file systems completely rely on device indirection layer and GC subsystem. We are still in the same hell of GCs competition. So, what's the point of host-aware model? So, I am not completely convinced that, finally, we will have really distinctive features for the case of device-managed, host-aware and host-managed model. Also I have many question about host-managed model if we will use block device abstraction. How can direct management of SSD internals be organized for the case of host-managed model is hidden under block device abstraction? Another interesting question... Let's imagine that we create file system volume for one device geometry. It means that geometry details will be stored in the file system metadata during volume creation for the case host-aware or host-managed case. Then we backups this volume and restore the volume on device with completely different geometry. So, what will we have for such case? Performance degradation? Or will we kill the device? > The open-channel SSD interface is very > similar to the one exposed by SMR hard-drives. They both have a set of > chunks (zones) exposed, and zones are managed using open/close logic. > The main difference on open-channel SSDs is that it additionally exposes > multiple sets of zones through a hierarchical interface, which covers a > numbers levels (X channels, Y LUNs per channel, Z zones per LUN). I would like to have access channels/LUNs/zones on file system level. If, for example, LUN will be associated with partition then it means that it will need to aggregate several partitions inside of one volume. First of all, not every file system is ready for the aggregation several partitions inside of the one volume. Secondly, what's about aggregation several physical devices inside of one volume? It looks like as slightly tricky to distinguish partitions of the same device and different devices on file system level. Isn't it? > I agree with Damien, but I'd also add that in the future there may very > well be some new Zone types added to the ZBC model. > So we shouldn't assume that the ZBC model is a fixed one. And who knows? > Perhaps T10 standards body will come up with a simpler model for > interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC model --- or not. Different zone types is good. But maybe LUN will be the better place for distinguishing the different zone types. Because if zone can have the type then it's possible to imagine any combinations of zones. But mostly zone of some type will be inside of some contiguous area (inside of NAND die, for example). So, LUN looks like as NAND die representation. >> SMR zone and NAND flash erase block look comparable but, finally, it >> significantly different stuff. Usually, SMR zone has 265 MB in size >> but NAND flash erase block can vary from 512 KB to 8 MB (it will be >> slightly larger in the future but not more than 32 MB, I suppose). It >> is possible to group several erase blocks into aggregated entity but >> it could be not very good policy from file system point of view. > > Why not? For f2fs, the 2MB segments are grouped together into sections > with a size matching the device zone size. That works well and can actually > even reduce the garbage collection overhead in some cases. > Nothing in the kernel zoned block device support limits the zone size > to a particular minimum or maximum. The only direct implication of the zone > size on the block I/O stack is that BIOs and requests cannot cross zone > boundaries. In an extreme setup, a zone size of 4KB would work too > and result in read/write commands of 4KB at most to the device. The situation with grouping of segments into sections for the case of F2FS is not so simple. First of all, you need to fill such aggregation with data. F2FS distinguish several types of segments and it means that current segment/section will be larger. If you mix different types of segments into one section (but I believe that F2FS doesn't provide opportunity to do this) then GC overhead could be larger, I suppose. Otherwise, the using one section for one segment type means that the current section with greater size than segment (2MB) will be resulted in changing the speed of filling sections with different type of data. As a result, it will change dramatically the distribution of different type of sections on file system volume. Does it reduce GC overhead? I am not sure. And if file system's segment should be equal to zone size (for example, NILFS2 case) then it could mean that you need to prepare the whole segment before real flush. And if you will need to process O_DIRECT or synchronous mount case then, most probably, you will need to flush the segment with huge hole. I suppose that it could significantly decrease file system's free space, increase GC activity and decrease device lifetime. >> Another point that QLC device could have more tricky features of erase >> blocks management. Also we should apply erase operation on NAND flash >> erase block but it is not mandatory for the case of SMR zone. > > Incorrect: host-managed devices require a zone "reset" (equivalent to > discard/trim) to be reused after being written once. So again, the > "tricky features" you mention will depend on the device "model", > whatever this ends up to be for an open channel SSD. OK. But I assume that SMR zone "reset" is significantly cheaper than NAND flash block erase operation. And you can fill your SMR zone with data then "reset" it and to fill again with data without significant penalty. Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks like as a hint for SSD controller. If SSD controller receives TRIM for some erase block then it doesn't mean that erase operation will be done immediately. Usually, it should be done in the background because real erase operation is expensive operation. Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-05 22:58 ` Slava Dubeyko @ 2017-01-06 1:11 ` Theodore Ts'o -1 siblings, 0 replies; 63+ messages in thread From: Theodore Ts'o @ 2017-01-06 1:11 UTC (permalink / raw) To: Slava Dubeyko Cc: Damien Le Moal, Matias Bjørling, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme On Thu, Jan 05, 2017 at 10:58:57PM +0000, Slava Dubeyko wrote: > > Next point is read disturbance. If BER of physical page/block achieves some threshold then > we need to move data from one page/block into another one. What subsystem will be > responsible for this activity? The drive-managed case expects that device's GC will manage > read disturbance issue. But what's about host-aware or host-managed case? If the host side > hasn't information about BER then the host's software is unable to manage this issue. Finally, > it sounds that we will have GC subsystem as on file system side as on device side. As a result, > it means possible unpredictable performance degradation and decreasing device lifetime. > Let's imagine that host-aware case could be unaware about read disturbance management. > But how host-managed case can manage this issue? One of the ways this could be done in the ZBC specification (assuming that erase blocks == zones) would be set the "reset" bit in the zone descriptor which is returned by the REPORT ZONES EXT command. This is a hint that the a reset write pointer should be sent to the zone in question, and it could be set when you start seeing soft ECC errors or the flash management layer has decided that the zone should be rewritten in the near future. A simple way to do this is to ask the Host OS to copy the data to another zone and then send a reset write pointer command for the zone. So I think it very much could be done, and done within the framework of the ZBC model --- although whether SSD manufactuers will chose to do this, and/or choose to engage the T10/T13 standards committees to add the necessary extensions to the ZBC specification is a question that we probably can't answer in this venue or by the participants on this thread. > Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed > and host-aware models. It looks like that the host side should be responsible to manage wear-leveling > for the host-managed case. But it means that the host should manage bad blocks and to have direct > access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection > layer and wear-leveling management will be unavailable on the host side. As a result, device will have > internal GC and the traditional issues (possible unpredictable performance degradation and decreasing > device lifetime). So I can imagine a setup where the flash translation layer manages the mapping between zone numbers and the physical erase blocks, such that when the host OS issues an "reset write pointer", it immediately gets a new erase block assigned to the specific zone in question. The original erase block would then get erased in the background, when the flash chip in question is available for maintenance activities. I think you've been thinking about a model where *either* the host as complete control over all aspects of the flash management, or the FTL has complete control --- and it may be that there are more clever ways that the work could be split between flash device and the host OS. > Another interesting question... Let's imagine that we create file system volume for one device > geometry. It means that geometry details will be stored in the file system metadata during volume > creation for the case host-aware or host-managed case. Then we backups this volume and restore > the volume on device with completely different geometry. So, what will we have for such case? > Performance degradation? Or will we kill the device? This is why I suspect that exposing the full details of the details of the Flash layout via LUNS is a bad, bad, BAD idea. It's much better to use an abstraction such as Zones, and then have an abstraction layer that hides the low-level details of the hardware from the OS. The trick is picking an abstraction that exposes the _right_ set of details so that the division of labor betewen the Host OS and the storage device is at a better place. Hence my suggestion of perhaps providing a virtual mapping layer betewen "Zone number" and the low-level physical erase block. > I would like to have access channels/LUNs/zones on file system level. > If, for example, LUN will be associated with partition then it means > that it will need to aggregate several partitions inside of one volume. > First of all, not every file system is ready for the aggregation several > partitions inside of the one volume. Secondly, what's about aggregation > several physical devices inside of one volume? It looks like as slightly > tricky to distinguish partitions of the same device and different devices > on file system level. Isn't it? Yes, this is why using LUN's are a BAD idea. There's too much code --- in file systems, in the block layer in terms of how we expose block devices, etc., that assumes that different LUN's are used for different logical containers of storage. There has been decades of usage of this concept by enterprise storage arrays. Trying to appropriate LUN's for another use case is stupid. And maybe we can't stop OCSSD folks if they have gone down that questionable design path, but there's nothing that says we have to expose it as a SCSI LUN inside of Linux! > OK. But I assume that SMR zone "reset" is significantly cheaper than > NAND flash block erase operation. And you can fill your SMR zone with > data then "reset" it and to fill again with data without significant penalty. If you have virtual mapping layer between zones and erase blocks, a reset write pointer could be fast for SSD's as well. And that allows the implementation of your suggestion below: > Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks > like as a hint for SSD controller. If SSD controller receives TRIM for some > erase block then it doesn't mean that erase operation will be done > immediately. Usually, it should be done in the background because real > erase operation is expensive operation. Cheers, - Ted ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-06 1:11 ` Theodore Ts'o 0 siblings, 0 replies; 63+ messages in thread From: Theodore Ts'o @ 2017-01-06 1:11 UTC (permalink / raw) On Thu, Jan 05, 2017@10:58:57PM +0000, Slava Dubeyko wrote: > > Next point is read disturbance. If BER of physical page/block achieves some threshold then > we need to move data from one page/block into another one. What subsystem will be > responsible for this activity? The drive-managed case expects that device's GC will manage > read disturbance issue. But what's about host-aware or host-managed case? If the host side > hasn't information about BER then the host's software is unable to manage this issue. Finally, > it sounds that we will have GC subsystem as on file system side as on device side. As a result, > it means possible unpredictable performance degradation and decreasing device lifetime. > Let's imagine that host-aware case could be unaware about read disturbance management. > But how host-managed case can manage this issue? One of the ways this could be done in the ZBC specification (assuming that erase blocks == zones) would be set the "reset" bit in the zone descriptor which is returned by the REPORT ZONES EXT command. This is a hint that the a reset write pointer should be sent to the zone in question, and it could be set when you start seeing soft ECC errors or the flash management layer has decided that the zone should be rewritten in the near future. A simple way to do this is to ask the Host OS to copy the data to another zone and then send a reset write pointer command for the zone. So I think it very much could be done, and done within the framework of the ZBC model --- although whether SSD manufactuers will chose to do this, and/or choose to engage the T10/T13 standards committees to add the necessary extensions to the ZBC specification is a question that we probably can't answer in this venue or by the participants on this thread. > Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed > and host-aware models. It looks like that the host side should be responsible to manage wear-leveling > for the host-managed case. But it means that the host should manage bad blocks and to have direct > access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection > layer and wear-leveling management will be unavailable on the host side. As a result, device will have > internal GC and the traditional issues (possible unpredictable performance degradation and decreasing > device lifetime). So I can imagine a setup where the flash translation layer manages the mapping between zone numbers and the physical erase blocks, such that when the host OS issues an "reset write pointer", it immediately gets a new erase block assigned to the specific zone in question. The original erase block would then get erased in the background, when the flash chip in question is available for maintenance activities. I think you've been thinking about a model where *either* the host as complete control over all aspects of the flash management, or the FTL has complete control --- and it may be that there are more clever ways that the work could be split between flash device and the host OS. > Another interesting question... Let's imagine that we create file system volume for one device > geometry. It means that geometry details will be stored in the file system metadata during volume > creation for the case host-aware or host-managed case. Then we backups this volume and restore > the volume on device with completely different geometry. So, what will we have for such case? > Performance degradation? Or will we kill the device? This is why I suspect that exposing the full details of the details of the Flash layout via LUNS is a bad, bad, BAD idea. It's much better to use an abstraction such as Zones, and then have an abstraction layer that hides the low-level details of the hardware from the OS. The trick is picking an abstraction that exposes the _right_ set of details so that the division of labor betewen the Host OS and the storage device is at a better place. Hence my suggestion of perhaps providing a virtual mapping layer betewen "Zone number" and the low-level physical erase block. > I would like to have access channels/LUNs/zones on file system level. > If, for example, LUN will be associated with partition then it means > that it will need to aggregate several partitions inside of one volume. > First of all, not every file system is ready for the aggregation several > partitions inside of the one volume. Secondly, what's about aggregation > several physical devices inside of one volume? It looks like as slightly > tricky to distinguish partitions of the same device and different devices > on file system level. Isn't it? Yes, this is why using LUN's are a BAD idea. There's too much code --- in file systems, in the block layer in terms of how we expose block devices, etc., that assumes that different LUN's are used for different logical containers of storage. There has been decades of usage of this concept by enterprise storage arrays. Trying to appropriate LUN's for another use case is stupid. And maybe we can't stop OCSSD folks if they have gone down that questionable design path, but there's nothing that says we have to expose it as a SCSI LUN inside of Linux! > OK. But I assume that SMR zone "reset" is significantly cheaper than > NAND flash block erase operation. And you can fill your SMR zone with > data then "reset" it and to fill again with data without significant penalty. If you have virtual mapping layer between zones and erase blocks, a reset write pointer could be fast for SSD's as well. And that allows the implementation of your suggestion below: > Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks > like as a hint for SSD controller. If SSD controller receives TRIM for some > erase block then it doesn't mean that erase operation will be done > immediately. Usually, it should be done in the background because real > erase operation is expensive operation. Cheers, - Ted ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-06 1:11 ` Theodore Ts'o (?) @ 2017-01-06 12:51 ` Matias Bjørling -1 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-06 12:51 UTC (permalink / raw) To: Theodore Ts'o, Slava Dubeyko Cc: Damien Le Moal, linux-nvme, linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc On 01/06/2017 02:11 AM, Theodore Ts'o wrote: > On Thu, Jan 05, 2017 at 10:58:57PM +0000, Slava Dubeyko wrote: >> >> Next point is read disturbance. If BER of physical page/block achieves some threshold then >> we need to move data from one page/block into another one. What subsystem will be >> responsible for this activity? The drive-managed case expects that device's GC will manage >> read disturbance issue. But what's about host-aware or host-managed case? If the host side >> hasn't information about BER then the host's software is unable to manage this issue. Finally, >> it sounds that we will have GC subsystem as on file system side as on device side. As a result, >> it means possible unpredictable performance degradation and decreasing device lifetime. >> Let's imagine that host-aware case could be unaware about read disturbance management. >> But how host-managed case can manage this issue? > > One of the ways this could be done in the ZBC specification (assuming > that erase blocks == zones) would be set the "reset" bit in the zone > descriptor which is returned by the REPORT ZONES EXT command. This is > a hint that the a reset write pointer should be sent to the zone in > question, and it could be set when you start seeing soft ECC errors or > the flash management layer has decided that the zone should be > rewritten in the near future. A simple way to do this is to ask the > Host OS to copy the data to another zone and then send a reset write > pointer command for the zone. This is an interesting approach. Currently, the OCSSD interface uses both the soft ECC mark to tell the host to rewrite, while the interface also has an explicit method to make the host rewrite the data. E.g., in the case where read scrubbing on the device requires the host to move data due to durability. Adding the information to the "Report zones" is a good idea. It enables the device to keep a list of "zones" that should be refreshed by the host but have yet to have it done. I will add that to the specification. > > So I think it very much could be done, and done within the framework > of the ZBC model --- although whether SSD manufactuers will chose to > do this, and/or choose to engage the T10/T13 standards committees to > add the necessary extensions to the ZBC specification is a question > that we probably can't answer in this venue or by the participants on > this thread. > >> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed >> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling >> for the host-managed case. But it means that the host should manage bad blocks and to have direct >> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection >> layer and wear-leveling management will be unavailable on the host side. As a result, device will have >> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing >> device lifetime). > > So I can imagine a setup where the flash translation layer manages the > mapping between zone numbers and the physical erase blocks, such that > when the host OS issues an "reset write pointer", it immediately gets > a new erase block assigned to the specific zone in question. The > original erase block would then get erased in the background, when the > flash chip in question is available for maintenance activities. > > I think you've been thinking about a model where *either* the host as > complete control over all aspects of the flash management, or the FTL > has complete control --- and it may be that there are more clever ways > that the work could be split between flash device and the host OS. > >> Another interesting question... Let's imagine that we create file system volume for one device >> geometry. It means that geometry details will be stored in the file system metadata during volume >> creation for the case host-aware or host-managed case. Then we backups this volume and restore >> the volume on device with completely different geometry. So, what will we have for such case? >> Performance degradation? Or will we kill the device? > > This is why I suspect that exposing the full details of the details of > the Flash layout via LUNS is a bad, bad, BAD idea. It's much better > to use an abstraction such as Zones, and then have an abstraction > layer that hides the low-level details of the hardware from the OS. > The trick is picking an abstraction that exposes the _right_ set of > details so that the division of labor betewen the Host OS and the > storage device is at a better place. Hence my suggestion of perhaps > providing a virtual mapping layer betewen "Zone number" and the > low-level physical erase block. Agree. The first approach was taken in the first iteration of the specification. After release we began to understand the chaos we just brought onto our self, we moved to the zone/chunk approach in the second iteration to simplify the interface. > >> I would like to have access channels/LUNs/zones on file system level. >> If, for example, LUN will be associated with partition then it means >> that it will need to aggregate several partitions inside of one volume. >> First of all, not every file system is ready for the aggregation several >> partitions inside of the one volume. Secondly, what's about aggregation >> several physical devices inside of one volume? It looks like as slightly >> tricky to distinguish partitions of the same device and different devices >> on file system level. Isn't it? > > Yes, this is why using LUN's are a BAD idea. There's too much code > --- in file systems, in the block layer in terms of how we expose > block devices, etc., that assumes that different LUN's are used for > different logical containers of storage. There has been decades of > usage of this concept by enterprise storage arrays. Trying to > appropriate LUN's for another use case is stupid. And maybe we can't > stop OCSSD folks if they have gone down that questionable design path, > but there's nothing that says we have to expose it as a SCSI LUN > inside of Linux! Heh, yes, really bad idea. The naming of "LUNs" for OCSSDs could have been chosen better. In the future, it is being renamed to "parallel unit". For OCSSDs, all the device's parallel units are exposed through the same block device "LUN", which then has to be managed by the layers above. > >> OK. But I assume that SMR zone "reset" is significantly cheaper than >> NAND flash block erase operation. And you can fill your SMR zone with >> data then "reset" it and to fill again with data without significant penalty. > > If you have virtual mapping layer between zones and erase blocks, a > reset write pointer could be fast for SSD's as well. And that allows > the implementation of your suggestion below: > >> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks >> like as a hint for SSD controller. If SSD controller receives TRIM for some >> erase block then it doesn't mean that erase operation will be done >> immediately. Usually, it should be done in the background because real >> erase operation is expensive operation. > > Cheers, > > - Ted > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-06 12:51 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-06 12:51 UTC (permalink / raw) On 01/06/2017 02:11 AM, Theodore Ts'o wrote: > On Thu, Jan 05, 2017@10:58:57PM +0000, Slava Dubeyko wrote: >> >> Next point is read disturbance. If BER of physical page/block achieves some threshold then >> we need to move data from one page/block into another one. What subsystem will be >> responsible for this activity? The drive-managed case expects that device's GC will manage >> read disturbance issue. But what's about host-aware or host-managed case? If the host side >> hasn't information about BER then the host's software is unable to manage this issue. Finally, >> it sounds that we will have GC subsystem as on file system side as on device side. As a result, >> it means possible unpredictable performance degradation and decreasing device lifetime. >> Let's imagine that host-aware case could be unaware about read disturbance management. >> But how host-managed case can manage this issue? > > One of the ways this could be done in the ZBC specification (assuming > that erase blocks == zones) would be set the "reset" bit in the zone > descriptor which is returned by the REPORT ZONES EXT command. This is > a hint that the a reset write pointer should be sent to the zone in > question, and it could be set when you start seeing soft ECC errors or > the flash management layer has decided that the zone should be > rewritten in the near future. A simple way to do this is to ask the > Host OS to copy the data to another zone and then send a reset write > pointer command for the zone. This is an interesting approach. Currently, the OCSSD interface uses both the soft ECC mark to tell the host to rewrite, while the interface also has an explicit method to make the host rewrite the data. E.g., in the case where read scrubbing on the device requires the host to move data due to durability. Adding the information to the "Report zones" is a good idea. It enables the device to keep a list of "zones" that should be refreshed by the host but have yet to have it done. I will add that to the specification. > > So I think it very much could be done, and done within the framework > of the ZBC model --- although whether SSD manufactuers will chose to > do this, and/or choose to engage the T10/T13 standards committees to > add the necessary extensions to the ZBC specification is a question > that we probably can't answer in this venue or by the participants on > this thread. > >> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed >> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling >> for the host-managed case. But it means that the host should manage bad blocks and to have direct >> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection >> layer and wear-leveling management will be unavailable on the host side. As a result, device will have >> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing >> device lifetime). > > So I can imagine a setup where the flash translation layer manages the > mapping between zone numbers and the physical erase blocks, such that > when the host OS issues an "reset write pointer", it immediately gets > a new erase block assigned to the specific zone in question. The > original erase block would then get erased in the background, when the > flash chip in question is available for maintenance activities. > > I think you've been thinking about a model where *either* the host as > complete control over all aspects of the flash management, or the FTL > has complete control --- and it may be that there are more clever ways > that the work could be split between flash device and the host OS. > >> Another interesting question... Let's imagine that we create file system volume for one device >> geometry. It means that geometry details will be stored in the file system metadata during volume >> creation for the case host-aware or host-managed case. Then we backups this volume and restore >> the volume on device with completely different geometry. So, what will we have for such case? >> Performance degradation? Or will we kill the device? > > This is why I suspect that exposing the full details of the details of > the Flash layout via LUNS is a bad, bad, BAD idea. It's much better > to use an abstraction such as Zones, and then have an abstraction > layer that hides the low-level details of the hardware from the OS. > The trick is picking an abstraction that exposes the _right_ set of > details so that the division of labor betewen the Host OS and the > storage device is at a better place. Hence my suggestion of perhaps > providing a virtual mapping layer betewen "Zone number" and the > low-level physical erase block. Agree. The first approach was taken in the first iteration of the specification. After release we began to understand the chaos we just brought onto our self, we moved to the zone/chunk approach in the second iteration to simplify the interface. > >> I would like to have access channels/LUNs/zones on file system level. >> If, for example, LUN will be associated with partition then it means >> that it will need to aggregate several partitions inside of one volume. >> First of all, not every file system is ready for the aggregation several >> partitions inside of the one volume. Secondly, what's about aggregation >> several physical devices inside of one volume? It looks like as slightly >> tricky to distinguish partitions of the same device and different devices >> on file system level. Isn't it? > > Yes, this is why using LUN's are a BAD idea. There's too much code > --- in file systems, in the block layer in terms of how we expose > block devices, etc., that assumes that different LUN's are used for > different logical containers of storage. There has been decades of > usage of this concept by enterprise storage arrays. Trying to > appropriate LUN's for another use case is stupid. And maybe we can't > stop OCSSD folks if they have gone down that questionable design path, > but there's nothing that says we have to expose it as a SCSI LUN > inside of Linux! Heh, yes, really bad idea. The naming of "LUNs" for OCSSDs could have been chosen better. In the future, it is being renamed to "parallel unit". For OCSSDs, all the device's parallel units are exposed through the same block device "LUN", which then has to be managed by the layers above. > >> OK. But I assume that SMR zone "reset" is significantly cheaper than >> NAND flash block erase operation. And you can fill your SMR zone with >> data then "reset" it and to fill again with data without significant penalty. > > If you have virtual mapping layer between zones and erase blocks, a > reset write pointer could be fast for SSD's as well. And that allows > the implementation of your suggestion below: > >> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks >> like as a hint for SSD controller. If SSD controller receives TRIM for some >> erase block then it doesn't mean that erase operation will be done >> immediately. Usually, it should be done in the background because real >> erase operation is expensive operation. > > Cheers, > > - Ted > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-06 12:51 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-06 12:51 UTC (permalink / raw) To: Theodore Ts'o, Slava Dubeyko Cc: Damien Le Moal, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme On 01/06/2017 02:11 AM, Theodore Ts'o wrote: > On Thu, Jan 05, 2017 at 10:58:57PM +0000, Slava Dubeyko wrote: >> >> Next point is read disturbance. If BER of physical page/block achieves some threshold then >> we need to move data from one page/block into another one. What subsystem will be >> responsible for this activity? The drive-managed case expects that device's GC will manage >> read disturbance issue. But what's about host-aware or host-managed case? If the host side >> hasn't information about BER then the host's software is unable to manage this issue. Finally, >> it sounds that we will have GC subsystem as on file system side as on device side. As a result, >> it means possible unpredictable performance degradation and decreasing device lifetime. >> Let's imagine that host-aware case could be unaware about read disturbance management. >> But how host-managed case can manage this issue? > > One of the ways this could be done in the ZBC specification (assuming > that erase blocks == zones) would be set the "reset" bit in the zone > descriptor which is returned by the REPORT ZONES EXT command. This is > a hint that the a reset write pointer should be sent to the zone in > question, and it could be set when you start seeing soft ECC errors or > the flash management layer has decided that the zone should be > rewritten in the near future. A simple way to do this is to ask the > Host OS to copy the data to another zone and then send a reset write > pointer command for the zone. This is an interesting approach. Currently, the OCSSD interface uses both the soft ECC mark to tell the host to rewrite, while the interface also has an explicit method to make the host rewrite the data. E.g., in the case where read scrubbing on the device requires the host to move data due to durability. Adding the information to the "Report zones" is a good idea. It enables the device to keep a list of "zones" that should be refreshed by the host but have yet to have it done. I will add that to the specification. > > So I think it very much could be done, and done within the framework > of the ZBC model --- although whether SSD manufactuers will chose to > do this, and/or choose to engage the T10/T13 standards committees to > add the necessary extensions to the ZBC specification is a question > that we probably can't answer in this venue or by the participants on > this thread. > >> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed >> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling >> for the host-managed case. But it means that the host should manage bad blocks and to have direct >> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection >> layer and wear-leveling management will be unavailable on the host side. As a result, device will have >> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing >> device lifetime). > > So I can imagine a setup where the flash translation layer manages the > mapping between zone numbers and the physical erase blocks, such that > when the host OS issues an "reset write pointer", it immediately gets > a new erase block assigned to the specific zone in question. The > original erase block would then get erased in the background, when the > flash chip in question is available for maintenance activities. > > I think you've been thinking about a model where *either* the host as > complete control over all aspects of the flash management, or the FTL > has complete control --- and it may be that there are more clever ways > that the work could be split between flash device and the host OS. > >> Another interesting question... Let's imagine that we create file system volume for one device >> geometry. It means that geometry details will be stored in the file system metadata during volume >> creation for the case host-aware or host-managed case. Then we backups this volume and restore >> the volume on device with completely different geometry. So, what will we have for such case? >> Performance degradation? Or will we kill the device? > > This is why I suspect that exposing the full details of the details of > the Flash layout via LUNS is a bad, bad, BAD idea. It's much better > to use an abstraction such as Zones, and then have an abstraction > layer that hides the low-level details of the hardware from the OS. > The trick is picking an abstraction that exposes the _right_ set of > details so that the division of labor betewen the Host OS and the > storage device is at a better place. Hence my suggestion of perhaps > providing a virtual mapping layer betewen "Zone number" and the > low-level physical erase block. Agree. The first approach was taken in the first iteration of the specification. After release we began to understand the chaos we just brought onto our self, we moved to the zone/chunk approach in the second iteration to simplify the interface. > >> I would like to have access channels/LUNs/zones on file system level. >> If, for example, LUN will be associated with partition then it means >> that it will need to aggregate several partitions inside of one volume. >> First of all, not every file system is ready for the aggregation several >> partitions inside of the one volume. Secondly, what's about aggregation >> several physical devices inside of one volume? It looks like as slightly >> tricky to distinguish partitions of the same device and different devices >> on file system level. Isn't it? > > Yes, this is why using LUN's are a BAD idea. There's too much code > --- in file systems, in the block layer in terms of how we expose > block devices, etc., that assumes that different LUN's are used for > different logical containers of storage. There has been decades of > usage of this concept by enterprise storage arrays. Trying to > appropriate LUN's for another use case is stupid. And maybe we can't > stop OCSSD folks if they have gone down that questionable design path, > but there's nothing that says we have to expose it as a SCSI LUN > inside of Linux! Heh, yes, really bad idea. The naming of "LUNs" for OCSSDs could have been chosen better. In the future, it is being renamed to "parallel unit". For OCSSDs, all the device's parallel units are exposed through the same block device "LUN", which then has to be managed by the layers above. > >> OK. But I assume that SMR zone "reset" is significantly cheaper than >> NAND flash block erase operation. And you can fill your SMR zone with >> data then "reset" it and to fill again with data without significant penalty. > > If you have virtual mapping layer between zones and erase blocks, a > reset write pointer could be fast for SSD's as well. And that allows > the implementation of your suggestion below: > >> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks >> like as a hint for SSD controller. If SSD controller receives TRIM for some >> erase block then it doesn't mean that erase operation will be done >> immediately. Usually, it should be done in the background because real >> erase operation is expensive operation. > > Cheers, > > - Ted > ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-06 1:11 ` Theodore Ts'o (?) @ 2017-01-09 6:49 ` Slava Dubeyko -1 siblings, 0 replies; 63+ messages in thread From: Slava Dubeyko @ 2017-01-09 6:49 UTC (permalink / raw) To: Theodore Ts'o, Matias Bjørling Cc: Damien Le Moal, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme -----Original Message----- From: Theodore Ts'o [mailto:tytso@mit.edu]=20 Sent: Thursday, January 5, 2017 5:12 PM To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> Cc: Damien Le Moal <Damien.LeMoal@wdc.com>; Matias Bj=F8rling <m@bjorling.m= e>; Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.o= rg; Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel= .org; linux-nvme@lists.infradead.org Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Inter= face, and Vector I/Os <skipped> > I think you've been thinking about a model where *either* the host as com= plete control > over all aspects of the flash management, or the FTL has complete control= --- and it may > be that there are more clever ways that the work could be split between > flash device and the host OS. Yes, I totally agree that the better way is to split different responsibili= ties between the flash device and the host (file system, for example). I would like to consider an= SSD device as a set of FTL primitives. Let's imagine the SSD like an automata that is able to e= xecute FTL primitives but the file system issues the commands orchestrate the SSD activity. I bel= ieve it makes sense to think about SSD like data processing accelerator engine. It means that w= e need in good interface that can be the basis for the offload of data processing operatio= ns. And I clearly see the many cases when a file system would like to say: "Hey, SSD. Please, exe= cute this primitive for me right now". Let's consider operations of moving zones (or erase blocks) with high BER. If we have completely passive SSD then it sounds for me that all operations= will look like: (1) read data on the host side; (2) "reset" zone; (3) write data into the SSD backwards. But if we talk about some zone (erase block(s)) with high BE= R is full of valid data then why does host need to execute the whole operation in such stupid way l= ike "read-write"? I mean that it completely doesn't make sense to spend the host's resources = for such operation. The responsibility of the host is simply to initiate such operation in the = proper time. And responsibility of the SSD is to execute such operation internally (offload of the operatio= n). So, here we could have the FTL primitive of moving of zones (erase blocks) for overcoming the= read disturbance. Let's consider GC operations... Right now, we have GC subsystem on the SSD = side (device-managed and host-aware case) and we have GC subsystem on the host side (LFS file system= s of host-aware case). So, it's clear that SSD device is able to provide some primitives of GC ope= rations. Also it's completely unreasonable to have GC subsystem as on SSD side as on the host side. If we= have GC subsystem on the host only then we need to follow the stupid paradigm "read-modify-write= " and to spend the host's resources for GC operations. Otherwise, if GC subsystem on the SSD side the= n GC suffers from lack of knowledge about valid data location (file system keeps this knowledge) and = such solution provides wide range of cases for unexpected performance degradation. So, we need in = much smarter solution. What could it be? Again, file system (host) has to initiate the GC operation in proper time b= ut the SSD should execute the requested operation (offload of the operation). So, we will have the GC= subsystem on file system side but the real GC operation under zone (erase block(s)) will be executed= by SSD device. The key point here that: (1) file system choses the good time for GC operation; (2) file = system is able to select a zone (erase block(s)) that provides then cost-efficient way of GC activity from = the point of view of valid data amount in the aged zone; (3) file system shares information about valid pag= es in the zone (erase block(s)); (4) SSD executes GC operation under the zone internally. We need to take into account three possible cases: (1) zone is completely i= nvalid; (2) zone is partially invalid; (3) zone contains valid data only. If file system's GC selects a z= one that doesn't contain valid data ("invalid" zone case) then GC simply needs to request "reset" zone or send = TRIM command. The rest is responsibility of SSD device. If zone is completely filled by valid data th= en file system's GC needs to request moving operation on the SSD side. If we will use a virtual zones th= en it means that such moving operation on the SSD side will change nothing for the file system (logical = block numbers will be the same). So, file system doesn't need to change internal mapping table for such oper= ation. The case of partially invalid zone (contains some amount of valid data) is = more tricky. But let's consider the situation. If file system has knowledge about position of valid logical= blocks or pages inside a zone then the file system is able to share a zone's bitmap with SSD device. It m= eans that if we have 4 KB logical block and 256 MB zone then we need in 8 KB bitmap for representing = positions of valid logical blocks inside of the zone. So, file system is able to send such val= id pages' bitmap with the command of GC operation initiation for some zone. The responsibility of SSD= side will be: (1) "reset" zone; (2) move the valid logical blocks from aged zone into new ones with c= ompaction scheme using. I mean that all valid pages should be written in contiguous manner in the n= ewly allocated zone (erase blocks). Finally, it means that SSD device can reposition logical bl= ocks inside of the zone without changing the initial order of logical pages (compaction scheme). Su= ch compaction scheme can be easily implemented on the SSD side. And if we will not change the or= der of logical blocks then we have deterministic case that can be easily processed on file system= side. If file system has initial bitmap then it can easily re-calculate the valid logical blocks' po= sition after compaction scheme using. For example, F2FS can easily do such re-calculation. Finally, new va= lues of valid logical blocks' position should be stored into file system's mapping table. NILFS2 is sligh= tly more complex case. Because, NILFS2 describes logical blocks inside of the log by means of spec= ial btree in the log's header. So, again, compaction scheme is deterministic case that provides opportunit= y to re-calculate the logical blocks' position before real GC operation. It means that NILFS2 is = able to prepare as valid logical blocks' bitmap as log's header before GC operation and to share all= these stuff with SSD device. However, every GC operation under partially invalid zone is resulted in cre= ation of zone that will be partially filled by valid data (the rest of zone will be completely free). = What does it need to do in such case? I can see the four possible approaches: (1) Re-use the partially filled zone. If file system will track the state o= f every zone (mapping table, for example) or it will be possible to extract the state of zone then it me= ans that aged zone will change the state after GC operation. So, partially filled zone can be used = as current zone for writing a new data. (2) Add valid data of aged zone into the tail of current zone. Let's imagin= e that file system is using some zone as current zone for adding a new data. If we know that an aged zo= ne contains some number of valid pages then it's possible to reserve the space in the tail o= f current zone. Finally, it is possible to initiate combine flush operation (write data from page ca= che of current zone) with GC operation under aged zone on the SSD side.=20 (3) Re-use aged zone as current zone. Let's imagine that we have some aged = zone with small number of valid pages. It means that we can select this zone as current zon= e for a new data. First of all, we need: (1) "reset" zone; (2) initiate GC operation on the S= SD device side. We know how many valid pages we will have in the beginning of the current zone. So,= we simply needs to add a new logical blocks into page cache of current zone after reserved = area of data from aged zone. So, our GC operation will be in the background of a new data pre= paration in the page cache of current zone. And, finally, we will have the whole zone is fu= ll of data after flush operation. (4) Merge several aged zones into new one. > It's much better to use an abstraction such as Zones, and then have an ab= straction layer > that hides the low-level details of the hardware from the OS. > The trick is picking an abstraction that exposes the _right_ set of detai= ls so that the division > of labor between the Host OS and the storage device is at a better place.= Hence my suggestion > of perhaps providing a virtual mapping layer between "Zone number" and > the low-level physical erase block. I like the idea of some abstraction that hides the low-level details. But i= t sounds that we still will have two mapping tables on SSD side and file system side. Again we nee= ds in distribution the responsibilities between the file system and SSD device. If file system= will manage GC activity but the real GC operation will be delegated on SSD side (in proper time) th= en it sounds that all maintenance operations will be done by SSD itself. It means that SSD de= vice is able to manage only one mapping table and file system simply needs to have actual copy of = the mapping table. Or, oppositely, file system can manage only one mapping table and to share = the actual state with the SSD device. But one mapping table looks like as really complicated= technique. From another point of view, virtual zone can have the same ID always. So, the re= sponsibility of the SSD device will be mapping the virtual zone ID with physical erase block ID= s. Such mapping table (virtual zone ID <-> erase block(s)) can be more compact as mapping t= able (LBA <-> physical page). The responsibility of file system (host) will be the mappin= g inside of the virtual zone (LBA <-> logical block inside the virtual zone). If the vi= rtual zone ID will be always the same then such mapping table could be lesser in size. But I don'= t see how such mapping table can be lesser in size for the current implementation of = F2FS or NILFS2.=20 However, let's imagine that log will be equal to the whole zone then the he= ader of the log can include likewise mapping table for the log/zone. Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-09 6:49 ` Slava Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Slava Dubeyko @ 2017-01-09 6:49 UTC (permalink / raw) -----Original Message----- From: Theodore Ts'o [mailto:tytso@mit.edu] Sent: Thursday, January 5, 2017 5:12 PM To: Slava Dubeyko <Vyacheslav.Dubeyko at wdc.com> Cc: Damien Le Moal <Damien.LeMoal at wdc.com>; Matias Bj?rling <m at bjorling.me>; Viacheslav Dubeyko <slava at dubeyko.com>; lsf-pc at lists.linux-foundation.org; Linux FS Devel <linux-fsdevel at vger.kernel.org>; linux-block at vger.kernel.org; linux-nvme at lists.infradead.org Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os <skipped> > I think you've been thinking about a model where *either* the host as complete control > over all aspects of the flash management, or the FTL has complete control --- and it may > be that there are more clever ways that the work could be split between > flash device and the host OS. Yes, I totally agree that the better way is to split different responsibilities between the flash device and the host (file system, for example). I would like to consider an SSD device as a set of FTL primitives. Let's imagine the SSD like an automata that is able to execute FTL primitives but the file system issues the commands orchestrate the SSD activity. I believe it makes sense to think about SSD like data processing accelerator engine. It means that we need in good interface that can be the basis for the offload of data processing operations. And I clearly see the many cases when a file system would like to say: "Hey, SSD. Please, execute this primitive for me right now". Let's consider operations of moving zones (or erase blocks) with high BER. If we have completely passive SSD then it sounds for me that all operations will look like: (1) read data on the host side; (2) "reset" zone; (3) write data into the SSD backwards. But if we talk about some zone (erase block(s)) with high BER is full of valid data then why does host need to execute the whole operation in such stupid way like "read-write"? I mean that it completely doesn't make sense to spend the host's resources for such operation. The responsibility of the host is simply to initiate such operation in the proper time. And responsibility of the SSD is to execute such operation internally (offload of the operation). So, here we could have the FTL primitive of moving of zones (erase blocks) for overcoming the read disturbance. Let's consider GC operations... Right now, we have GC subsystem on the SSD side (device-managed and host-aware case) and we have GC subsystem on the host side (LFS file systems of host-aware case). So, it's clear that SSD device is able to provide some primitives of GC operations. Also it's completely unreasonable to have GC subsystem as on SSD side as on the host side. If we have GC subsystem on the host only then we need to follow the stupid paradigm "read-modify-write" and to spend the host's resources for GC operations. Otherwise, if GC subsystem on the SSD side then GC suffers from lack of knowledge about valid data location (file system keeps this knowledge) and such solution provides wide range of cases for unexpected performance degradation. So, we need in much smarter solution. What could it be? Again, file system (host) has to initiate the GC operation in proper time but the SSD should execute the requested operation (offload of the operation). So, we will have the GC subsystem on file system side but the real GC operation under zone (erase block(s)) will be executed by SSD device. The key point here that: (1) file system choses the good time for GC operation; (2) file system is able to select a zone (erase block(s)) that provides then cost-efficient way of GC activity from the point of view of valid data amount in the aged zone; (3) file system shares information about valid pages in the zone (erase block(s)); (4) SSD executes GC operation under the zone internally. We need to take into account three possible cases: (1) zone is completely invalid; (2) zone is partially invalid; (3) zone contains valid data only. If file system's GC selects a zone that doesn't contain valid data ("invalid" zone case) then GC simply needs to request "reset" zone or send TRIM command. The rest is responsibility of SSD device. If zone is completely filled by valid data then file system's GC needs to request moving operation on the SSD side. If we will use a virtual zones then it means that such moving operation on the SSD side will change nothing for the file system (logical block numbers will be the same). So, file system doesn't need to change internal mapping table for such operation. The case of partially invalid zone (contains some amount of valid data) is more tricky. But let's consider the situation. If file system has knowledge about position of valid logical blocks or pages inside a zone then the file system is able to share a zone's bitmap with SSD device. It means that if we have 4 KB logical block and 256 MB zone then we need in 8 KB bitmap for representing positions of valid logical blocks inside of the zone. So, file system is able to send such valid pages' bitmap with the command of GC operation initiation for some zone. The responsibility of SSD side will be: (1) "reset" zone; (2) move the valid logical blocks from aged zone into new ones with compaction scheme using. I mean that all valid pages should be written in contiguous manner in the newly allocated zone (erase blocks). Finally, it means that SSD device can reposition logical blocks inside of the zone without changing the initial order of logical pages (compaction scheme). Such compaction scheme can be easily implemented on the SSD side. And if we will not change the order of logical blocks then we have deterministic case that can be easily processed on file system side. If file system has initial bitmap then it can easily re-calculate the valid logical blocks' position after compaction scheme using. For example, F2FS can easily do such re-calculation. Finally, new values of valid logical blocks' position should be stored into file system's mapping table. NILFS2 is slightly more complex case. Because, NILFS2 describes logical blocks inside of the log by means of special btree in the log's header. So, again, compaction scheme is deterministic case that provides opportunity to re-calculate the logical blocks' position before real GC operation. It means that NILFS2 is able to prepare as valid logical blocks' bitmap as log's header before GC operation and to share all these stuff with SSD device. However, every GC operation under partially invalid zone is resulted in creation of zone that will be partially filled by valid data (the rest of zone will be completely free). What does it need to do in such case? I can see the four possible approaches: (1) Re-use the partially filled zone. If file system will track the state of every zone (mapping table, for example) or it will be possible to extract the state of zone then it means that aged zone will change the state after GC operation. So, partially filled zone can be used as current zone for writing a new data. (2) Add valid data of aged zone into the tail of current zone. Let's imagine that file system is using some zone as current zone for adding a new data. If we know that an aged zone contains some number of valid pages then it's possible to reserve the space in the tail of current zone. Finally, it is possible to initiate combine flush operation (write data from page cache of current zone) with GC operation under aged zone on the SSD side. (3) Re-use aged zone as current zone. Let's imagine that we have some aged zone with small number of valid pages. It means that we can select this zone as current zone for a new data. First of all, we need: (1) "reset" zone; (2) initiate GC operation on the SSD device side. We know how many valid pages we will have in the beginning of the current zone. So, we simply needs to add a new logical blocks into page cache of current zone after reserved area of data from aged zone. So, our GC operation will be in the background of a new data preparation in the page cache of current zone. And, finally, we will have the whole zone is full of data after flush operation. (4) Merge several aged zones into new one. > It's much better to use an abstraction such as Zones, and then have an abstraction layer > that hides the low-level details of the hardware from the OS. > The trick is picking an abstraction that exposes the _right_ set of details so that the division > of labor between the Host OS and the storage device is at a better place. Hence my suggestion > of perhaps providing a virtual mapping layer between "Zone number" and > the low-level physical erase block. I like the idea of some abstraction that hides the low-level details. But it sounds that we still will have two mapping tables on SSD side and file system side. Again we needs in distribution the responsibilities between the file system and SSD device. If file system will manage GC activity but the real GC operation will be delegated on SSD side (in proper time) then it sounds that all maintenance operations will be done by SSD itself. It means that SSD device is able to manage only one mapping table and file system simply needs to have actual copy of the mapping table. Or, oppositely, file system can manage only one mapping table and to share the actual state with the SSD device. But one mapping table looks like as really complicated technique. From another point of view, virtual zone can have the same ID always. So, the responsibility of the SSD device will be mapping the virtual zone ID with physical erase block IDs. Such mapping table (virtual zone ID <-> erase block(s)) can be more compact as mapping table (LBA <-> physical page). The responsibility of file system (host) will be the mapping inside of the virtual zone (LBA <-> logical block inside the virtual zone). If the virtual zone ID will be always the same then such mapping table could be lesser in size. But I don't see how such mapping table can be lesser in size for the current implementation of F2FS or NILFS2. However, let's imagine that log will be equal to the whole zone then the header of the log can include likewise mapping table for the log/zone. Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-09 6:49 ` Slava Dubeyko 0 siblings, 0 replies; 63+ messages in thread From: Slava Dubeyko @ 2017-01-09 6:49 UTC (permalink / raw) To: Theodore Ts'o, Matias Bjørling Cc: Damien Le Moal, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme -----Original Message----- From: Theodore Ts'o [mailto:tytso@mit.edu] Sent: Thursday, January 5, 2017 5:12 PM To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> Cc: Damien Le Moal <Damien.LeMoal@wdc.com>; Matias Bjørling <m@bjorling.me>; Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os <skipped> > I think you've been thinking about a model where *either* the host as complete control > over all aspects of the flash management, or the FTL has complete control --- and it may > be that there are more clever ways that the work could be split between > flash device and the host OS. Yes, I totally agree that the better way is to split different responsibilities between the flash device and the host (file system, for example). I would like to consider an SSD device as a set of FTL primitives. Let's imagine the SSD like an automata that is able to execute FTL primitives but the file system issues the commands orchestrate the SSD activity. I believe it makes sense to think about SSD like data processing accelerator engine. It means that we need in good interface that can be the basis for the offload of data processing operations. And I clearly see the many cases when a file system would like to say: "Hey, SSD. Please, execute this primitive for me right now". Let's consider operations of moving zones (or erase blocks) with high BER. If we have completely passive SSD then it sounds for me that all operations will look like: (1) read data on the host side; (2) "reset" zone; (3) write data into the SSD backwards. But if we talk about some zone (erase block(s)) with high BER is full of valid data then why does host need to execute the whole operation in such stupid way like "read-write"? I mean that it completely doesn't make sense to spend the host's resources for such operation. The responsibility of the host is simply to initiate such operation in the proper time. And responsibility of the SSD is to execute such operation internally (offload of the operation). So, here we could have the FTL primitive of moving of zones (erase blocks) for overcoming the read disturbance. Let's consider GC operations... Right now, we have GC subsystem on the SSD side (device-managed and host-aware case) and we have GC subsystem on the host side (LFS file systems of host-aware case). So, it's clear that SSD device is able to provide some primitives of GC operations. Also it's completely unreasonable to have GC subsystem as on SSD side as on the host side. If we have GC subsystem on the host only then we need to follow the stupid paradigm "read-modify-write" and to spend the host's resources for GC operations. Otherwise, if GC subsystem on the SSD side then GC suffers from lack of knowledge about valid data location (file system keeps this knowledge) and such solution provides wide range of cases for unexpected performance degradation. So, we need in much smarter solution. What could it be? Again, file system (host) has to initiate the GC operation in proper time but the SSD should execute the requested operation (offload of the operation). So, we will have the GC subsystem on file system side but the real GC operation under zone (erase block(s)) will be executed by SSD device. The key point here that: (1) file system choses the good time for GC operation; (2) file system is able to select a zone (erase block(s)) that provides then cost-efficient way of GC activity from the point of view of valid data amount in the aged zone; (3) file system shares information about valid pages in the zone (erase block(s)); (4) SSD executes GC operation under the zone internally. We need to take into account three possible cases: (1) zone is completely invalid; (2) zone is partially invalid; (3) zone contains valid data only. If file system's GC selects a zone that doesn't contain valid data ("invalid" zone case) then GC simply needs to request "reset" zone or send TRIM command. The rest is responsibility of SSD device. If zone is completely filled by valid data then file system's GC needs to request moving operation on the SSD side. If we will use a virtual zones then it means that such moving operation on the SSD side will change nothing for the file system (logical block numbers will be the same). So, file system doesn't need to change internal mapping table for such operation. The case of partially invalid zone (contains some amount of valid data) is more tricky. But let's consider the situation. If file system has knowledge about position of valid logical blocks or pages inside a zone then the file system is able to share a zone's bitmap with SSD device. It means that if we have 4 KB logical block and 256 MB zone then we need in 8 KB bitmap for representing positions of valid logical blocks inside of the zone. So, file system is able to send such valid pages' bitmap with the command of GC operation initiation for some zone. The responsibility of SSD side will be: (1) "reset" zone; (2) move the valid logical blocks from aged zone into new ones with compaction scheme using. I mean that all valid pages should be written in contiguous manner in the newly allocated zone (erase blocks). Finally, it means that SSD device can reposition logical blocks inside of the zone without changing the initial order of logical pages (compaction scheme). Such compaction scheme can be easily implemented on the SSD side. And if we will not change the order of logical blocks then we have deterministic case that can be easily processed on file system side. If file system has initial bitmap then it can easily re-calculate the valid logical blocks' position after compaction scheme using. For example, F2FS can easily do such re-calculation. Finally, new values of valid logical blocks' position should be stored into file system's mapping table. NILFS2 is slightly more complex case. Because, NILFS2 describes logical blocks inside of the log by means of special btree in the log's header. So, again, compaction scheme is deterministic case that provides opportunity to re-calculate the logical blocks' position before real GC operation. It means that NILFS2 is able to prepare as valid logical blocks' bitmap as log's header before GC operation and to share all these stuff with SSD device. However, every GC operation under partially invalid zone is resulted in creation of zone that will be partially filled by valid data (the rest of zone will be completely free). What does it need to do in such case? I can see the four possible approaches: (1) Re-use the partially filled zone. If file system will track the state of every zone (mapping table, for example) or it will be possible to extract the state of zone then it means that aged zone will change the state after GC operation. So, partially filled zone can be used as current zone for writing a new data. (2) Add valid data of aged zone into the tail of current zone. Let's imagine that file system is using some zone as current zone for adding a new data. If we know that an aged zone contains some number of valid pages then it's possible to reserve the space in the tail of current zone. Finally, it is possible to initiate combine flush operation (write data from page cache of current zone) with GC operation under aged zone on the SSD side. (3) Re-use aged zone as current zone. Let's imagine that we have some aged zone with small number of valid pages. It means that we can select this zone as current zone for a new data. First of all, we need: (1) "reset" zone; (2) initiate GC operation on the SSD device side. We know how many valid pages we will have in the beginning of the current zone. So, we simply needs to add a new logical blocks into page cache of current zone after reserved area of data from aged zone. So, our GC operation will be in the background of a new data preparation in the page cache of current zone. And, finally, we will have the whole zone is full of data after flush operation. (4) Merge several aged zones into new one. > It's much better to use an abstraction such as Zones, and then have an abstraction layer > that hides the low-level details of the hardware from the OS. > The trick is picking an abstraction that exposes the _right_ set of details so that the division > of labor between the Host OS and the storage device is at a better place. Hence my suggestion > of perhaps providing a virtual mapping layer between "Zone number" and > the low-level physical erase block. I like the idea of some abstraction that hides the low-level details. But it sounds that we still will have two mapping tables on SSD side and file system side. Again we needs in distribution the responsibilities between the file system and SSD device. If file system will manage GC activity but the real GC operation will be delegated on SSD side (in proper time) then it sounds that all maintenance operations will be done by SSD itself. It means that SSD device is able to manage only one mapping table and file system simply needs to have actual copy of the mapping table. Or, oppositely, file system can manage only one mapping table and to share the actual state with the SSD device. But one mapping table looks like as really complicated technique. From another point of view, virtual zone can have the same ID always. So, the responsibility of the SSD device will be mapping the virtual zone ID with physical erase block IDs. Such mapping table (virtual zone ID <-> erase block(s)) can be more compact as mapping table (LBA <-> physical page). The responsibility of file system (host) will be the mapping inside of the virtual zone (LBA <-> logical block inside the virtual zone). If the virtual zone ID will be always the same then such mapping table could be lesser in size. But I don't see how such mapping table can be lesser in size for the current implementation of F2FS or NILFS2. However, let's imagine that log will be equal to the whole zone then the header of the log can include likewise mapping table for the log/zone. Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-09 6:49 ` Slava Dubeyko (?) @ 2017-01-09 14:55 ` Theodore Ts'o -1 siblings, 0 replies; 63+ messages in thread From: Theodore Ts'o @ 2017-01-09 14:55 UTC (permalink / raw) To: Slava Dubeyko Cc: Damien Le Moal, Matias Bjørling, linux-nvme, linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc So in the model where the Flash-side is tracking logical to physical zone mapping, and host is merely expecting the ZBC interface, one way it could work is as follows. 1) The flash signals that a particular zone should be reset soon. 2) If the host does not honor the request, eventually the flash will have to do a forced copy of the zone to a new erase block. (This is a fail-safe and shouldn't happen under normal circumstances.) (By the way, this model can be used for any number of things. For example, for cloud workloads where tail latency is really important, it would be really cool if T10/T13 adopted a way that the host could be notified about the potential need for ATI remediation in a particular disk region, so the host could schedule it when it would be least likely to impact high priority, low latency workloads. If the host fails to give permission to the firmware to do the ATI remediation before the "gotta go" deadline is exceed, the disk would could the ATI remediation at that point to assure data integrity as a fail safe.) 3) The host, since it has better knowledge of which blocks belong to which inode, and which inodes are more likely to have identical object lifetimes (example, all of the .o files in a directory are likely to be deleted at the same time when the user runs "make clean"; there was a Usenix or FAST paper over a decade ago the pointed out that doing hueristics based on file names were likely to be helpful), can do a better job of distributing the blocks to different partially filled sequential write preferred / sequential write required zones. The idea here is that you might have multiple zones that are partially filled based on expected object lifetime predictions. Or the host could move blocks based on the knowledge that a particular already has blocks that will share the same fate (e.g., belong to the same inode) --- this is knowledge that the FTL can not know, so with a sufficiently smart host file system, it ought to be able to do a better job than the FTL. 4) Since we assumed that the Flash is tracking logical to physial zone mappings, and the host is responsible for everything else, if the host decides to move blocks to different SMR zones, the host file system will be responsible for updating its existing (inode, logical block) to physical block (SMR zone plus offset) mapping tables. The main advantage of this model is to the extent that there are cloud/enterprise customers who are already implementing Host Aware SMR storage solutions, they might be able to reutilize code already written for SMR HDD's, and reuse it for this model/interface. Yes, some tweaks would probably be needed since the design tradeoffs for disks and flash are very different. But the point is that the Host Managed and Host Aware SMR models is one that is well understood by everyone. ---- There is another model you might consider, and it's one which Christoph Hillwig suggested at a LSF/MM at least 2-3 years ago, and this is a model where the flash or the SMR disk could use a division of labor similar to Object Based Disks (except hopefully with a less awful interface). The idea here is that you give up on LBA numbers, and instead you move the entire responsibility of mapping (inode, logical block) to (physical location) to the storage device. The file system would then be responsibile for managing metadata (mod times, user/group ownership, permission bits/ACL's, etc) and namespace issues (e.g., directory pathnames to inode lookups). So this solves the problem you seem to be concerned about in terms of keeping mapping information at two layers, and it solves it completely, since the file system no longer has to do a mapping between inode+logical offset to LBA number, which it would in the models you've outlined to date. It also solves the problem of giving the storage device more information about which blocks belong to which inode/object, and it would also make it easier for the OS to pass object lifetime and shared fate hints to the storage device. This should hopefully allow the FTL or STL to do a better job, since it now has access to low-level hardware inforation (e.g., BER / Soft ECC failures) as well as higher-level object information to do a better job making storage layout and garbage collection activities. ---- A fair criticism of any of the models discussed to date (one based on ZBC, the object-based storage model, or OCSSD) don't have any mature implementations, either in the open source or the clost source world. But since it's true for *all* of them, we should be using other criteria for deciding which model is the best one to choose for the long term. The advantage of the ZBC model is that people have had several years to consider and understand the model, so in terms of mind share it has an advantage. The advantage of the object-based model is that it transfers a lot of the complexity to the storage device, so the job that needs to be done by the file system is much simpler than either of the other two models. The advantage of the OCSSD model is that it exposes a lot of the raw flash complexities to the host. This can be good in that the host can now do a really good job optimize for a particular flash technology. The downside is that by exposing all of that complexity to the host, it makes file system design very fragile, since as the number of chips changes, or the size of erase blocks changes, or as flash developes new capabilities such as erase suspend/resume, *all* of that hair gets exposed to the file system implementor. Personally, I think that's why either the ZBC model or the object-based model makes a lot more sense than something where we expose all of the vagaries of NAND flash to the file system. Cheers, - Ted _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-09 14:55 ` Theodore Ts'o 0 siblings, 0 replies; 63+ messages in thread From: Theodore Ts'o @ 2017-01-09 14:55 UTC (permalink / raw) So in the model where the Flash-side is tracking logical to physical zone mapping, and host is merely expecting the ZBC interface, one way it could work is as follows. 1) The flash signals that a particular zone should be reset soon. 2) If the host does not honor the request, eventually the flash will have to do a forced copy of the zone to a new erase block. (This is a fail-safe and shouldn't happen under normal circumstances.) (By the way, this model can be used for any number of things. For example, for cloud workloads where tail latency is really important, it would be really cool if T10/T13 adopted a way that the host could be notified about the potential need for ATI remediation in a particular disk region, so the host could schedule it when it would be least likely to impact high priority, low latency workloads. If the host fails to give permission to the firmware to do the ATI remediation before the "gotta go" deadline is exceed, the disk would could the ATI remediation at that point to assure data integrity as a fail safe.) 3) The host, since it has better knowledge of which blocks belong to which inode, and which inodes are more likely to have identical object lifetimes (example, all of the .o files in a directory are likely to be deleted at the same time when the user runs "make clean"; there was a Usenix or FAST paper over a decade ago the pointed out that doing hueristics based on file names were likely to be helpful), can do a better job of distributing the blocks to different partially filled sequential write preferred / sequential write required zones. The idea here is that you might have multiple zones that are partially filled based on expected object lifetime predictions. Or the host could move blocks based on the knowledge that a particular already has blocks that will share the same fate (e.g., belong to the same inode) --- this is knowledge that the FTL can not know, so with a sufficiently smart host file system, it ought to be able to do a better job than the FTL. 4) Since we assumed that the Flash is tracking logical to physial zone mappings, and the host is responsible for everything else, if the host decides to move blocks to different SMR zones, the host file system will be responsible for updating its existing (inode, logical block) to physical block (SMR zone plus offset) mapping tables. The main advantage of this model is to the extent that there are cloud/enterprise customers who are already implementing Host Aware SMR storage solutions, they might be able to reutilize code already written for SMR HDD's, and reuse it for this model/interface. Yes, some tweaks would probably be needed since the design tradeoffs for disks and flash are very different. But the point is that the Host Managed and Host Aware SMR models is one that is well understood by everyone. ---- There is another model you might consider, and it's one which Christoph Hillwig suggested at a LSF/MM at least 2-3 years ago, and this is a model where the flash or the SMR disk could use a division of labor similar to Object Based Disks (except hopefully with a less awful interface). The idea here is that you give up on LBA numbers, and instead you move the entire responsibility of mapping (inode, logical block) to (physical location) to the storage device. The file system would then be responsibile for managing metadata (mod times, user/group ownership, permission bits/ACL's, etc) and namespace issues (e.g., directory pathnames to inode lookups). So this solves the problem you seem to be concerned about in terms of keeping mapping information at two layers, and it solves it completely, since the file system no longer has to do a mapping between inode+logical offset to LBA number, which it would in the models you've outlined to date. It also solves the problem of giving the storage device more information about which blocks belong to which inode/object, and it would also make it easier for the OS to pass object lifetime and shared fate hints to the storage device. This should hopefully allow the FTL or STL to do a better job, since it now has access to low-level hardware inforation (e.g., BER / Soft ECC failures) as well as higher-level object information to do a better job making storage layout and garbage collection activities. ---- A fair criticism of any of the models discussed to date (one based on ZBC, the object-based storage model, or OCSSD) don't have any mature implementations, either in the open source or the clost source world. But since it's true for *all* of them, we should be using other criteria for deciding which model is the best one to choose for the long term. The advantage of the ZBC model is that people have had several years to consider and understand the model, so in terms of mind share it has an advantage. The advantage of the object-based model is that it transfers a lot of the complexity to the storage device, so the job that needs to be done by the file system is much simpler than either of the other two models. The advantage of the OCSSD model is that it exposes a lot of the raw flash complexities to the host. This can be good in that the host can now do a really good job optimize for a particular flash technology. The downside is that by exposing all of that complexity to the host, it makes file system design very fragile, since as the number of chips changes, or the size of erase blocks changes, or as flash developes new capabilities such as erase suspend/resume, *all* of that hair gets exposed to the file system implementor. Personally, I think that's why either the ZBC model or the object-based model makes a lot more sense than something where we expose all of the vagaries of NAND flash to the file system. Cheers, - Ted ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-09 14:55 ` Theodore Ts'o 0 siblings, 0 replies; 63+ messages in thread From: Theodore Ts'o @ 2017-01-09 14:55 UTC (permalink / raw) To: Slava Dubeyko Cc: Matias Bjørling, Damien Le Moal, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme So in the model where the Flash-side is tracking logical to physical zone mapping, and host is merely expecting the ZBC interface, one way it could work is as follows. 1) The flash signals that a particular zone should be reset soon. 2) If the host does not honor the request, eventually the flash will have to do a forced copy of the zone to a new erase block. (This is a fail-safe and shouldn't happen under normal circumstances.) (By the way, this model can be used for any number of things. For example, for cloud workloads where tail latency is really important, it would be really cool if T10/T13 adopted a way that the host could be notified about the potential need for ATI remediation in a particular disk region, so the host could schedule it when it would be least likely to impact high priority, low latency workloads. If the host fails to give permission to the firmware to do the ATI remediation before the "gotta go" deadline is exceed, the disk would could the ATI remediation at that point to assure data integrity as a fail safe.) 3) The host, since it has better knowledge of which blocks belong to which inode, and which inodes are more likely to have identical object lifetimes (example, all of the .o files in a directory are likely to be deleted at the same time when the user runs "make clean"; there was a Usenix or FAST paper over a decade ago the pointed out that doing hueristics based on file names were likely to be helpful), can do a better job of distributing the blocks to different partially filled sequential write preferred / sequential write required zones. The idea here is that you might have multiple zones that are partially filled based on expected object lifetime predictions. Or the host could move blocks based on the knowledge that a particular already has blocks that will share the same fate (e.g., belong to the same inode) --- this is knowledge that the FTL can not know, so with a sufficiently smart host file system, it ought to be able to do a better job than the FTL. 4) Since we assumed that the Flash is tracking logical to physial zone mappings, and the host is responsible for everything else, if the host decides to move blocks to different SMR zones, the host file system will be responsible for updating its existing (inode, logical block) to physical block (SMR zone plus offset) mapping tables. The main advantage of this model is to the extent that there are cloud/enterprise customers who are already implementing Host Aware SMR storage solutions, they might be able to reutilize code already written for SMR HDD's, and reuse it for this model/interface. Yes, some tweaks would probably be needed since the design tradeoffs for disks and flash are very different. But the point is that the Host Managed and Host Aware SMR models is one that is well understood by everyone. ---- There is another model you might consider, and it's one which Christoph Hillwig suggested at a LSF/MM at least 2-3 years ago, and this is a model where the flash or the SMR disk could use a division of labor similar to Object Based Disks (except hopefully with a less awful interface). The idea here is that you give up on LBA numbers, and instead you move the entire responsibility of mapping (inode, logical block) to (physical location) to the storage device. The file system would then be responsibile for managing metadata (mod times, user/group ownership, permission bits/ACL's, etc) and namespace issues (e.g., directory pathnames to inode lookups). So this solves the problem you seem to be concerned about in terms of keeping mapping information at two layers, and it solves it completely, since the file system no longer has to do a mapping between inode+logical offset to LBA number, which it would in the models you've outlined to date. It also solves the problem of giving the storage device more information about which blocks belong to which inode/object, and it would also make it easier for the OS to pass object lifetime and shared fate hints to the storage device. This should hopefully allow the FTL or STL to do a better job, since it now has access to low-level hardware inforation (e.g., BER / Soft ECC failures) as well as higher-level object information to do a better job making storage layout and garbage collection activities. ---- A fair criticism of any of the models discussed to date (one based on ZBC, the object-based storage model, or OCSSD) don't have any mature implementations, either in the open source or the clost source world. But since it's true for *all* of them, we should be using other criteria for deciding which model is the best one to choose for the long term. The advantage of the ZBC model is that people have had several years to consider and understand the model, so in terms of mind share it has an advantage. The advantage of the object-based model is that it transfers a lot of the complexity to the storage device, so the job that needs to be done by the file system is much simpler than either of the other two models. The advantage of the OCSSD model is that it exposes a lot of the raw flash complexities to the host. This can be good in that the host can now do a really good job optimize for a particular flash technology. The downside is that by exposing all of that complexity to the host, it makes file system design very fragile, since as the number of chips changes, or the size of erase blocks changes, or as flash developes new capabilities such as erase suspend/resume, *all* of that hair gets exposed to the file system implementor. Personally, I think that's why either the ZBC model or the object-based model makes a lot more sense than something where we expose all of the vagaries of NAND flash to the file system. Cheers, - Ted ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-05 22:58 ` Slava Dubeyko (?) @ 2017-01-06 13:05 ` Matias Bjørling -1 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-06 13:05 UTC (permalink / raw) To: Slava Dubeyko, Damien Le Moal, Viacheslav Dubeyko, lsf-pc, Theodore Ts'o Cc: Linux FS Devel, linux-block, linux-nvme On 01/05/2017 11:58 PM, Slava Dubeyko wrote: > Next point is read disturbance. If BER of physical page/block achieves some threshold then > we need to move data from one page/block into another one. What subsystem will be > responsible for this activity? The drive-managed case expects that device's GC will manage > read disturbance issue. But what's about host-aware or host-managed case? If the host side > hasn't information about BER then the host's software is unable to manage this issue. Finally, > it sounds that we will have GC subsystem as on file system side as on device side. As a result, > it means possible unpredictable performance degradation and decreasing device lifetime. > Let's imagine that host-aware case could be unaware about read disturbance management. > But how host-managed case can manage this issue? The OCSSD interface uses a couple of methods: 1) Piggy back soft ECC errors onto the completion entry. Tells the host that a block properly should be refreshed when appropriate. 2) Use an asynchronous interface, e.g., NVMe get log page. Report blocks through this interface that has been read disturbed. This may be coupled with the various processes running on the SSD. 3) (That Ted suggested). Expose a "reset" bit in the Report Zones command to let the host know which blocks should be reset. If the plumbing for 2) is not available, or the information has been lost on the host side, this method can be used to "resync". > > Bad block management... So, drive-managed and host-aware cases should be completely unaware > about bad blocks. But what's about host-managed case? If a device will hide bad blocks from > the host then it means mapping table presence, access to logical pages/blocks and so on. If the host > hasn't access to the bad block management then it's not host-managed model. And it sounds as > completely unmanageable situation for the host-managed model. Because if the host has access > to bad block management (but how?) then we have really simple model. Otherwise, the host > has access to logical pages/blocks only and device should have internal GC. As a result, > it means possible unpredictable performance degradation and decreasing device lifetime because > of competition of GC on device side and GC on the host side. Agree. depending on the use-case, one may expose a "perfect" interface to the host, or one may expose an interface where media errors may be reported to the host. The former case are great for consumer units, where I/O predictability isn't critical, and similarly if I/O predictability is critical, the media errors can be reported, and the host may deal with them appropriately. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-06 13:05 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-06 13:05 UTC (permalink / raw) On 01/05/2017 11:58 PM, Slava Dubeyko wrote: > Next point is read disturbance. If BER of physical page/block achieves some threshold then > we need to move data from one page/block into another one. What subsystem will be > responsible for this activity? The drive-managed case expects that device's GC will manage > read disturbance issue. But what's about host-aware or host-managed case? If the host side > hasn't information about BER then the host's software is unable to manage this issue. Finally, > it sounds that we will have GC subsystem as on file system side as on device side. As a result, > it means possible unpredictable performance degradation and decreasing device lifetime. > Let's imagine that host-aware case could be unaware about read disturbance management. > But how host-managed case can manage this issue? The OCSSD interface uses a couple of methods: 1) Piggy back soft ECC errors onto the completion entry. Tells the host that a block properly should be refreshed when appropriate. 2) Use an asynchronous interface, e.g., NVMe get log page. Report blocks through this interface that has been read disturbed. This may be coupled with the various processes running on the SSD. 3) (That Ted suggested). Expose a "reset" bit in the Report Zones command to let the host know which blocks should be reset. If the plumbing for 2) is not available, or the information has been lost on the host side, this method can be used to "resync". > > Bad block management... So, drive-managed and host-aware cases should be completely unaware > about bad blocks. But what's about host-managed case? If a device will hide bad blocks from > the host then it means mapping table presence, access to logical pages/blocks and so on. If the host > hasn't access to the bad block management then it's not host-managed model. And it sounds as > completely unmanageable situation for the host-managed model. Because if the host has access > to bad block management (but how?) then we have really simple model. Otherwise, the host > has access to logical pages/blocks only and device should have internal GC. As a result, > it means possible unpredictable performance degradation and decreasing device lifetime because > of competition of GC on device side and GC on the host side. Agree. depending on the use-case, one may expose a "perfect" interface to the host, or one may expose an interface where media errors may be reported to the host. The former case are great for consumer units, where I/O predictability isn't critical, and similarly if I/O predictability is critical, the media errors can be reported, and the host may deal with them appropriately. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-06 13:05 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-06 13:05 UTC (permalink / raw) To: Slava Dubeyko, Damien Le Moal, Viacheslav Dubeyko, lsf-pc, Theodore Ts'o Cc: Linux FS Devel, linux-block, linux-nvme On 01/05/2017 11:58 PM, Slava Dubeyko wrote: > Next point is read disturbance. If BER of physical page/block achieves some threshold then > we need to move data from one page/block into another one. What subsystem will be > responsible for this activity? The drive-managed case expects that device's GC will manage > read disturbance issue. But what's about host-aware or host-managed case? If the host side > hasn't information about BER then the host's software is unable to manage this issue. Finally, > it sounds that we will have GC subsystem as on file system side as on device side. As a result, > it means possible unpredictable performance degradation and decreasing device lifetime. > Let's imagine that host-aware case could be unaware about read disturbance management. > But how host-managed case can manage this issue? The OCSSD interface uses a couple of methods: 1) Piggy back soft ECC errors onto the completion entry. Tells the host that a block properly should be refreshed when appropriate. 2) Use an asynchronous interface, e.g., NVMe get log page. Report blocks through this interface that has been read disturbed. This may be coupled with the various processes running on the SSD. 3) (That Ted suggested). Expose a "reset" bit in the Report Zones command to let the host know which blocks should be reset. If the plumbing for 2) is not available, or the information has been lost on the host side, this method can be used to "resync". > > Bad block management... So, drive-managed and host-aware cases should be completely unaware > about bad blocks. But what's about host-managed case? If a device will hide bad blocks from > the host then it means mapping table presence, access to logical pages/blocks and so on. If the host > hasn't access to the bad block management then it's not host-managed model. And it sounds as > completely unmanageable situation for the host-managed model. Because if the host has access > to bad block management (but how?) then we have really simple model. Otherwise, the host > has access to logical pages/blocks only and device should have internal GC. As a result, > it means possible unpredictable performance degradation and decreasing device lifetime because > of competition of GC on device side and GC on the host side. Agree. depending on the use-case, one may expose a "perfect" interface to the host, or one may expose an interface where media errors may be reported to the host. The former case are great for consumer units, where I/O predictability isn't critical, and similarly if I/O predictability is critical, the media errors can be reported, and the host may deal with them appropriately. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-04 7:24 ` Damien Le Moal @ 2017-01-06 1:09 ` Jaegeuk Kim -1 siblings, 0 replies; 63+ messages in thread From: Jaegeuk Kim @ 2017-01-06 1:09 UTC (permalink / raw) To: Damien Le Moal Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme Hello, On 01/04, Damien Le Moal wrote: ... > > > Finally, if I really like to develop SMR- or NAND flash oriented file > > system then I would like to play with peculiarities of concrete > > technologies. And any unified interface will destroy the opportunity > > to create the really efficient solution. Finally, if my software > > solution is unable to provide some fancy and efficient features then > > guys will prefer to use the regular stack (ext4, xfs + block layer). > > Not necessarily. Again think in terms of device "model" and associated > feature set. An FS implementation may decide to support all possible > models, with likely a resulting incredible complexity. More likely, > similarly with what is happening with SMR, only models that make sense > will be supported by FS implementation that can be easily modified. > Example again here of f2fs: changes to support SMR were rather simple, > whereas the initial effort to support SMR with ext4 was pretty much > abandoned as it was too complex to integrate in the existing code while > keeping the existing on-disk format. >From the f2fs viewpoint, now we support single host-managed SMR drive having a portion of conventional zones. In addition, f2fs supports multiple devices [1], which enables us to use pure host-managed SMR which has no conventional zone, working with another small conventional partition. I think current lightNVM with OCSSD aims towards a drive-managed device for generic filesystems. Depending on FTL, however, OCSSD can report conventional or sequential zones. 1) If FTL handles random 4K writes pretty well, it would be better to report converntional zones. Otherwise, 2) if FTL has almost nothing to map bettwen LBA to PBA, it is able to report sequential zones likewise pure host-managed SMR. Interestingly, for 1) host-aware model, there is no need to change f2fs at all. In order to explore 2) pure host-managed model, I introduced aligned write IO [2] to make FTL more simple by eliminating partial page write. IMHO, it'd be funny to evaluate several zoned models of SMR and OCSSD accordingly. [1] https://lkml.org/lkml/2016/11/9/727 [2] https://lkml.org/lkml/2016/12/30/242 Thanks, ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-06 1:09 ` Jaegeuk Kim 0 siblings, 0 replies; 63+ messages in thread From: Jaegeuk Kim @ 2017-01-06 1:09 UTC (permalink / raw) Hello, On 01/04, Damien Le Moal wrote: ... > > > Finally, if I really like to develop SMR- or NAND flash oriented file > > system then I would like to play with peculiarities of concrete > > technologies. And any unified interface will destroy the opportunity > > to create the really efficient solution. Finally, if my software > > solution is unable to provide some fancy and efficient features then > > guys will prefer to use the regular stack (ext4, xfs + block layer). > > Not necessarily. Again think in terms of device "model" and associated > feature set. An FS implementation may decide to support all possible > models, with likely a resulting incredible complexity. More likely, > similarly with what is happening with SMR, only models that make sense > will be supported by FS implementation that can be easily modified. > Example again here of f2fs: changes to support SMR were rather simple, > whereas the initial effort to support SMR with ext4 was pretty much > abandoned as it was too complex to integrate in the existing code while > keeping the existing on-disk format. >From the f2fs viewpoint, now we support single host-managed SMR drive having a portion of conventional zones. In addition, f2fs supports multiple devices [1], which enables us to use pure host-managed SMR which has no conventional zone, working with another small conventional partition. I think current lightNVM with OCSSD aims towards a drive-managed device for generic filesystems. Depending on FTL, however, OCSSD can report conventional or sequential zones. 1) If FTL handles random 4K writes pretty well, it would be better to report converntional zones. Otherwise, 2) if FTL has almost nothing to map bettwen LBA to PBA, it is able to report sequential zones likewise pure host-managed SMR. Interestingly, for 1) host-aware model, there is no need to change f2fs at all. In order to explore 2) pure host-managed model, I introduced aligned write IO [2] to make FTL more simple by eliminating partial page write. IMHO, it'd be funny to evaluate several zoned models of SMR and OCSSD accordingly. [1] https://lkml.org/lkml/2016/11/9/727 [2] https://lkml.org/lkml/2016/12/30/242 Thanks, ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-06 1:09 ` Jaegeuk Kim (?) @ 2017-01-06 12:55 ` Matias Bjørling -1 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-06 12:55 UTC (permalink / raw) To: Jaegeuk Kim, Damien Le Moal Cc: Slava Dubeyko, linux-nvme, linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc On 01/06/2017 02:09 AM, Jaegeuk Kim wrote: > Hello, > > On 01/04, Damien Le Moal wrote: > > ... >> >>> Finally, if I really like to develop SMR- or NAND flash oriented file >>> system then I would like to play with peculiarities of concrete >>> technologies. And any unified interface will destroy the opportunity >>> to create the really efficient solution. Finally, if my software >>> solution is unable to provide some fancy and efficient features then >>> guys will prefer to use the regular stack (ext4, xfs + block layer). >> >> Not necessarily. Again think in terms of device "model" and associated >> feature set. An FS implementation may decide to support all possible >> models, with likely a resulting incredible complexity. More likely, >> similarly with what is happening with SMR, only models that make sense >> will be supported by FS implementation that can be easily modified. >> Example again here of f2fs: changes to support SMR were rather simple, >> whereas the initial effort to support SMR with ext4 was pretty much >> abandoned as it was too complex to integrate in the existing code while >> keeping the existing on-disk format. > > From the f2fs viewpoint, now we support single host-managed SMR drive having > a portion of conventional zones. In addition, f2fs supports multiple devices > [1], which enables us to use pure host-managed SMR which has no conventional > zone, working with another small conventional partition. That is a good approach. SSD controllers may even implement a small FTL inside the device for the "conventional" zones. The size wouldn't be that big and may only be used to bootstrap rest of the unit. A zone with a couple hundred megabytes should do. That'll simplify having pblk on the side next to f2fs. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-06 12:55 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-06 12:55 UTC (permalink / raw) On 01/06/2017 02:09 AM, Jaegeuk Kim wrote: > Hello, > > On 01/04, Damien Le Moal wrote: > > ... >> >>> Finally, if I really like to develop SMR- or NAND flash oriented file >>> system then I would like to play with peculiarities of concrete >>> technologies. And any unified interface will destroy the opportunity >>> to create the really efficient solution. Finally, if my software >>> solution is unable to provide some fancy and efficient features then >>> guys will prefer to use the regular stack (ext4, xfs + block layer). >> >> Not necessarily. Again think in terms of device "model" and associated >> feature set. An FS implementation may decide to support all possible >> models, with likely a resulting incredible complexity. More likely, >> similarly with what is happening with SMR, only models that make sense >> will be supported by FS implementation that can be easily modified. >> Example again here of f2fs: changes to support SMR were rather simple, >> whereas the initial effort to support SMR with ext4 was pretty much >> abandoned as it was too complex to integrate in the existing code while >> keeping the existing on-disk format. > > From the f2fs viewpoint, now we support single host-managed SMR drive having > a portion of conventional zones. In addition, f2fs supports multiple devices > [1], which enables us to use pure host-managed SMR which has no conventional > zone, working with another small conventional partition. That is a good approach. SSD controllers may even implement a small FTL inside the device for the "conventional" zones. The size wouldn't be that big and may only be used to bootstrap rest of the unit. A zone with a couple hundred megabytes should do. That'll simplify having pblk on the side next to f2fs. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-06 12:55 ` Matias Bjørling 0 siblings, 0 replies; 63+ messages in thread From: Matias Bjørling @ 2017-01-06 12:55 UTC (permalink / raw) To: Jaegeuk Kim, Damien Le Moal Cc: Slava Dubeyko, Viacheslav Dubeyko, lsf-pc, Linux FS Devel, linux-block, linux-nvme On 01/06/2017 02:09 AM, Jaegeuk Kim wrote: > Hello, > > On 01/04, Damien Le Moal wrote: > > ... >> >>> Finally, if I really like to develop SMR- or NAND flash oriented file >>> system then I would like to play with peculiarities of concrete >>> technologies. And any unified interface will destroy the opportunity >>> to create the really efficient solution. Finally, if my software >>> solution is unable to provide some fancy and efficient features then >>> guys will prefer to use the regular stack (ext4, xfs + block layer). >> >> Not necessarily. Again think in terms of device "model" and associated >> feature set. An FS implementation may decide to support all possible >> models, with likely a resulting incredible complexity. More likely, >> similarly with what is happening with SMR, only models that make sense >> will be supported by FS implementation that can be easily modified. >> Example again here of f2fs: changes to support SMR were rather simple, >> whereas the initial effort to support SMR with ext4 was pretty much >> abandoned as it was too complex to integrate in the existing code while >> keeping the existing on-disk format. > > From the f2fs viewpoint, now we support single host-managed SMR drive having > a portion of conventional zones. In addition, f2fs supports multiple devices > [1], which enables us to use pure host-managed SMR which has no conventional > zone, working with another small conventional partition. That is a good approach. SSD controllers may even implement a small FTL inside the device for the "conventional" zones. The size wouldn't be that big and may only be used to bootstrap rest of the unit. A zone with a couple hundred megabytes should do. That'll simplify having pblk on the side next to f2fs. ^ permalink raw reply [flat|nested] 63+ messages in thread
* [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-02 21:06 ` Matias Bjørling ` (2 preceding siblings ...) (?) @ 2017-01-12 1:33 ` Damien Le Moal 2017-01-12 2:18 ` James Bottomley -1 siblings, 1 reply; 63+ messages in thread From: Damien Le Moal @ 2017-01-12 1:33 UTC (permalink / raw) To: Matias Bjørling, lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme Hello, A long discussion on the list followed this initial topic proposal from Matias. I think this is a worthy topic to discuss at LSF in order to steer development of the zoned block device interface in the right direction. Considering the relation and implication to ZBC/ZAC support, I would like to attend LSF/MM to participate in this discussion. Thank you. Best regards. On 1/3/17 06:06, Matias Bj�rling wrote: > Hi, > > The open-channel SSD subsystem is maturing, and drives are beginning to > become available on the market. The open-channel SSD interface is very > similar to the one exposed by SMR hard-drives. They both have a set of > chunks (zones) exposed, and zones are managed using open/close logic. > The main difference on open-channel SSDs is that it additionally exposes > multiple sets of zones through a hierarchical interface, which covers a > numbers levels (X channels, Y LUNs per channel, Z zones per LUN). > > Given that the SMR interface is similar to OCSSDs interface, I like to > propose to discuss this at LSF/MM to align the efforts and make a clear > path forward: > > 1. SMR Compatibility > > Can the SMR host interface be adapted to Open-Channel SSDs? For example, > the interface may be exposed as a single-level set of zones, which > ignore the channel and lun concept for simplicity. Another approach > might be to extend the SMR implementation sysfs entries to expose the > hierarchy of the device (channels with X LUNs and each luns have a set > of zones). > > 2. How to expose the tens of LUNs that OCSSDs have? > > An open-channel SSDs typically has 64-256 LUNs that each acts as a > parallel unit. How can these be efficiently exposed? > > One may expose these as separate namespaces/partitions. For a DAS with > 24 drives, that will be 1536-6144 separate LUNs to manage. That many > LUNs will blow up the host with gendisk instances. While if we do, then > we have an excellent 1:1 mapping between the SMR interface and the OCSSD > interface. > > On the other hand, one could expose the device LUNs within a single LBA > address space and lay the LUNs out linearly. In that case, the block > layer may expose a variable that enables applications to understand this > hierarchy. Mainly the channels with LUNs. Any warm feelings towards this? > > Currently, a shortcut is taken with the geometry and hierarchy, which > expose it through the /lightnvm sysfs entries. These (or a type thereof) > can be moved to the block layer /queue directory. > > If keeping the LUNs exposed on the same gendisk, vector I/Os becomes a > viable path: > > 3. Vector I/Os > > To derive parallelism from an open-channel SSD (and SSDs in parallel), > one need to access them in parallel. Parallelism is achieved either by > issuing I/Os for each LUN (similar to driving multiple SSDs today) or > using a vector interface (encapsulating a list of LBAs, length, and data > buffer) into the kernel. The latter approach allows I/Os to be > vectorized and sent as a single unit to hardware. > > Implementing this in generic block layer code might be overkill if only > open-channel SSDs use it. I like to hear other use-cases (e.g., > preadv/pwritev, file-systems, virtio?) that can take advantage of > vectored I/Os. If it makes sense, then which level to implement: > bio/request level, SGLs, or a new structure? > > Device drivers that support vectored I/Os should be able to opt into the > interface, while the block layer may automatically roll out for device > drivers that don't have the support. > > What has the history been in the Linux kernel about vector I/Os? What > have reasons in the past been that such an interface was not adopted? > > I will post RFC SMR patches before LSF/MM, such that we have a firm > ground to discuss how it may be integrated. > > -- Besides OCSSDs, I also like to participate in the discussions of > XCOPY, NVMe, multipath, multi-queue interrupt management as well. > > -Matias > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > -- Damien Le Moal, Ph.D. Sr. Manager, System Software Research Group, Western Digital Corporation Damien.LeMoal@wdc.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-12 1:33 ` [LSF/MM " Damien Le Moal @ 2017-01-12 2:18 ` James Bottomley 0 siblings, 0 replies; 63+ messages in thread From: James Bottomley @ 2017-01-12 2:18 UTC (permalink / raw) To: Damien Le Moal, Matias Bjørling, lsf-pc Cc: Linux FS Devel, linux-block, linux-nvme On Thu, 2017-01-12 at 10:33 +0900, Damien Le Moal wrote: > Hello, > > A long discussion on the list followed this initial topic proposal > from Matias. I think this is a worthy topic to discuss at LSF in > order to steer development of the zoned block device interface in the > right direction. Considering the relation and implication to ZBC/ZAC > support,I would like to attend LSF/MM to participate in this > discussion. Just a note for the poor admin looking after the lists: to find all the ATTEND and TOPIC requests for the lists I fold up the threads to the top. If you frame your attend request as a reply, it's possible it won't get counted because I didn't find it so please *start a new thread* for ATTEND and TOPIC requests. Thanks, James PS If you think you sent a TOPIC/ATTEND request in reply to something, then I really haven't seen it because this is the first one I noticed, and you should resend. ^ permalink raw reply [flat|nested] 63+ messages in thread
* [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-12 2:18 ` James Bottomley 0 siblings, 0 replies; 63+ messages in thread From: James Bottomley @ 2017-01-12 2:18 UTC (permalink / raw) On Thu, 2017-01-12@10:33 +0900, Damien Le Moal wrote: > Hello, > > A long discussion on the list followed this initial topic proposal > from Matias. I think this is a worthy topic to discuss at LSF in > order to steer development of the zoned block device interface in the > right direction. Considering the relation and implication to ZBC/ZAC > support,I would like to attend LSF/MM to participate in this > discussion. Just a note for the poor admin looking after the lists: to find all the ATTEND and TOPIC requests for the lists I fold up the threads to the top. If you frame your attend request as a reply, it's possible it won't get counted because I didn't find it so please *start a new thread* for ATTEND and TOPIC requests. Thanks, James PS If you think you sent a TOPIC/ATTEND request in reply to something, then I really haven't seen it because this is the first one I noticed, and you should resend. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-12 2:18 ` James Bottomley @ 2017-01-12 2:35 ` Damien Le Moal -1 siblings, 0 replies; 63+ messages in thread From: Damien Le Moal @ 2017-01-12 2:35 UTC (permalink / raw) To: James Bottomley, Matias Bjørling, lsf-pc Cc: Linux FS Devel, linux-block, linux-nvme James, On 1/12/17 11:18, James Bottomley wrote: > On Thu, 2017-01-12 at 10:33 +0900, Damien Le Moal wrote: >> Hello, >> >> A long discussion on the list followed this initial topic proposal >> from Matias. I think this is a worthy topic to discuss at LSF in >> order to steer development of the zoned block device interface in the >> right direction. Considering the relation and implication to ZBC/ZAC >> support,I would like to attend LSF/MM to participate in this >> discussion. > > Just a note for the poor admin looking after the lists: to find all the > ATTEND and TOPIC requests for the lists I fold up the threads to the > top. If you frame your attend request as a reply, it's possible it > won't get counted because I didn't find it > > so please *start a new thread* for ATTEND and TOPIC requests. My apologies for the overhead. I will resend. Thank you. Best regards. -- Damien Le Moal, Ph.D. Sr. Manager, System Software Research Group, Western Digital Corporation Damien.LeMoal@wdc.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-12 2:35 ` Damien Le Moal 0 siblings, 0 replies; 63+ messages in thread From: Damien Le Moal @ 2017-01-12 2:35 UTC (permalink / raw) James, On 1/12/17 11:18, James Bottomley wrote: > On Thu, 2017-01-12@10:33 +0900, Damien Le Moal wrote: >> Hello, >> >> A long discussion on the list followed this initial topic proposal >> from Matias. I think this is a worthy topic to discuss at LSF in >> order to steer development of the zoned block device interface in the >> right direction. Considering the relation and implication to ZBC/ZAC >> support,I would like to attend LSF/MM to participate in this >> discussion. > > Just a note for the poor admin looking after the lists: to find all the > ATTEND and TOPIC requests for the lists I fold up the threads to the > top. If you frame your attend request as a reply, it's possible it > won't get counted because I didn't find it > > so please *start a new thread* for ATTEND and TOPIC requests. My apologies for the overhead. I will resend. Thank you. Best regards. -- Damien Le Moal, Ph.D. Sr. Manager, System Software Research Group, Western Digital Corporation Damien.LeMoal at wdc.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os 2017-01-12 2:35 ` Damien Le Moal @ 2017-01-12 2:38 ` James Bottomley -1 siblings, 0 replies; 63+ messages in thread From: James Bottomley @ 2017-01-12 2:38 UTC (permalink / raw) To: Damien Le Moal, Matias Bjørling, lsf-pc Cc: Linux FS Devel, linux-block, linux-nvme On Thu, 2017-01-12 at 11:35 +0900, Damien Le Moal wrote: > > Just a note for the poor admin looking after the lists: to find all > > the ATTEND and TOPIC requests for the lists I fold up the threads > > to the top. If you frame your attend request as a reply, it's > > possible it won't get counted because I didn't find it > > > > so please *start a new thread* for ATTEND and TOPIC requests. > > My apologies for the overhead. I will resend. > Thank you. You don't need to resend ... I've got you on the list. I replied publicly just in case there were any other people who did this that I didn't notice. James ^ permalink raw reply [flat|nested] 63+ messages in thread
* [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os @ 2017-01-12 2:38 ` James Bottomley 0 siblings, 0 replies; 63+ messages in thread From: James Bottomley @ 2017-01-12 2:38 UTC (permalink / raw) On Thu, 2017-01-12@11:35 +0900, Damien Le Moal wrote: > > Just a note for the poor admin looking after the lists: to find all > > the ATTEND and TOPIC requests for the lists I fold up the threads > > to the top. If you frame your attend request as a reply, it's > > possible it won't get counted because I didn't find it > > > > so please *start a new thread* for ATTEND and TOPIC requests. > > My apologies for the overhead. I will resend. > Thank you. You don't need to resend ... I've got you on the list. I replied publicly just in case there were any other people who did this that I didn't notice. James ^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2017-01-12 2:38 UTC | newest] Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-02 21:06 [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os Matias Bjørling 2017-01-02 21:06 ` Matias Bjørling 2017-01-02 21:06 ` Matias Bjørling 2017-01-02 23:12 ` Viacheslav Dubeyko 2017-01-02 23:12 ` Viacheslav Dubeyko 2017-01-02 23:12 ` Viacheslav Dubeyko 2017-01-03 8:56 ` Matias Bjørling 2017-01-03 8:56 ` Matias Bjørling 2017-01-03 17:35 ` Viacheslav Dubeyko 2017-01-03 17:35 ` Viacheslav Dubeyko 2017-01-03 17:35 ` Viacheslav Dubeyko 2017-01-03 19:10 ` Matias Bjørling 2017-01-03 19:10 ` Matias Bjørling 2017-01-04 2:59 ` Slava Dubeyko 2017-01-04 2:59 ` Slava Dubeyko 2017-01-04 2:59 ` Slava Dubeyko 2017-01-04 7:24 ` Damien Le Moal 2017-01-04 7:24 ` Damien Le Moal 2017-01-04 12:39 ` Matias Bjørling 2017-01-04 12:39 ` Matias Bjørling 2017-01-04 16:57 ` Theodore Ts'o 2017-01-04 16:57 ` Theodore Ts'o 2017-01-10 1:42 ` Damien Le Moal 2017-01-10 1:42 ` Damien Le Moal 2017-01-10 4:24 ` Theodore Ts'o 2017-01-10 4:24 ` Theodore Ts'o 2017-01-10 13:06 ` Matias Bjorling 2017-01-10 13:06 ` Matias Bjorling 2017-01-11 4:07 ` Damien Le Moal 2017-01-11 4:07 ` Damien Le Moal 2017-01-11 6:06 ` Matias Bjorling 2017-01-11 6:06 ` Matias Bjorling 2017-01-11 7:49 ` Hannes Reinecke 2017-01-11 7:49 ` Hannes Reinecke 2017-01-05 22:58 ` Slava Dubeyko 2017-01-05 22:58 ` Slava Dubeyko 2017-01-05 22:58 ` Slava Dubeyko 2017-01-06 1:11 ` Theodore Ts'o 2017-01-06 1:11 ` Theodore Ts'o 2017-01-06 12:51 ` Matias Bjørling 2017-01-06 12:51 ` Matias Bjørling 2017-01-06 12:51 ` Matias Bjørling 2017-01-09 6:49 ` Slava Dubeyko 2017-01-09 6:49 ` Slava Dubeyko 2017-01-09 6:49 ` Slava Dubeyko 2017-01-09 14:55 ` Theodore Ts'o 2017-01-09 14:55 ` Theodore Ts'o 2017-01-09 14:55 ` Theodore Ts'o 2017-01-06 13:05 ` Matias Bjørling 2017-01-06 13:05 ` Matias Bjørling 2017-01-06 13:05 ` Matias Bjørling 2017-01-06 1:09 ` Jaegeuk Kim 2017-01-06 1:09 ` Jaegeuk Kim 2017-01-06 12:55 ` Matias Bjørling 2017-01-06 12:55 ` Matias Bjørling 2017-01-06 12:55 ` Matias Bjørling 2017-01-12 1:33 ` [LSF/MM " Damien Le Moal 2017-01-12 2:18 ` [Lsf-pc] " James Bottomley 2017-01-12 2:18 ` James Bottomley 2017-01-12 2:35 ` Damien Le Moal 2017-01-12 2:35 ` Damien Le Moal 2017-01-12 2:38 ` James Bottomley 2017-01-12 2:38 ` James Bottomley
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.