[LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

All of lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-02 21:06 ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-02 21:06 UTC (permalink / raw)
  To: lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme

Hi,

The open-channel SSD subsystem is maturing, and drives are beginning to 
become available on the market. The open-channel SSD interface is very 
similar to the one exposed by SMR hard-drives. They both have a set of 
chunks (zones) exposed, and zones are managed using open/close logic. 
The main difference on open-channel SSDs is that it additionally exposes 
multiple sets of zones through a hierarchical interface, which covers a 
numbers levels (X channels, Y LUNs per channel, Z zones per LUN).

Given that the SMR interface is similar to OCSSDs interface, I like to 
propose to discuss this at LSF/MM to align the efforts and make a clear 
path forward:

1. SMR Compatibility

Can the SMR host interface be adapted to Open-Channel SSDs? For example, 
the interface may be exposed as a single-level set of zones, which 
ignore the channel and lun concept for simplicity. Another approach 
might be to extend the SMR implementation sysfs entries to expose the 
hierarchy of the device (channels with X LUNs and each luns have a set 
of zones).

2. How to expose the tens of LUNs that OCSSDs have?

An open-channel SSDs typically has 64-256 LUNs that each acts as a 
parallel unit. How can these be efficiently exposed?

One may expose these as separate namespaces/partitions. For a DAS with 
24 drives, that will be 1536-6144 separate LUNs to manage. That many 
LUNs will blow up the host with gendisk instances. While if we do, then 
we have an excellent 1:1 mapping between the SMR interface and the OCSSD 
interface.

On the other hand, one could expose the device LUNs within a single LBA 
address space and lay the LUNs out linearly. In that case, the block 
layer may expose a variable that enables applications to understand this 
hierarchy. Mainly the channels with LUNs. Any warm feelings towards this?

Currently, a shortcut is taken with the geometry and hierarchy, which 
expose it through the /lightnvm sysfs entries. These (or a type thereof) 
can be moved to the block layer /queue directory.

If keeping the LUNs exposed on the same gendisk, vector I/Os becomes a 
viable path:

3. Vector I/Os

To derive parallelism from an open-channel SSD (and SSDs in parallel), 
one need to access them in parallel. Parallelism is achieved either by 
issuing I/Os for each LUN (similar to driving multiple SSDs today) or 
using a vector interface (encapsulating a list of LBAs, length, and data 
buffer) into the kernel. The latter approach allows I/Os to be 
vectorized and sent as a single unit to hardware.

Implementing this in generic block layer code might be overkill if only 
open-channel SSDs use it. I like to hear other use-cases (e.g., 
preadv/pwritev, file-systems, virtio?) that can take advantage of 
vectored I/Os. If it makes sense, then which level to implement: 
bio/request level, SGLs, or a new structure?

Device drivers that support vectored I/Os should be able to opt into the 
interface, while the block layer may automatically roll out for device 
drivers that don't have the support.

What has the history been in the Linux kernel about vector I/Os? What 
have reasons in the past been that such an interface was not adopted?

I will post RFC SMR patches before LSF/MM, such that we have a firm 
ground to discuss how it may be integrated.

-- Besides OCSSDs, I also like to participate in the discussions of 
XCOPY, NVMe, multipath, multi-queue interrupt management as well.

-Matias

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-02 21:06 ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-02 21:06 UTC (permalink / raw)
  To: lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme

Hi,

The open-channel SSD subsystem is maturing, and drives are beginning to 
become available on the market. The open-channel SSD interface is very 
similar to the one exposed by SMR hard-drives. They both have a set of 
chunks (zones) exposed, and zones are managed using open/close logic. 
The main difference on open-channel SSDs is that it additionally exposes 
multiple sets of zones through a hierarchical interface, which covers a 
numbers levels (X channels, Y LUNs per channel, Z zones per LUN).

Given that the SMR interface is similar to OCSSDs interface, I like to 
propose to discuss this at LSF/MM to align the efforts and make a clear 
path forward:

1. SMR Compatibility

Can the SMR host interface be adapted to Open-Channel SSDs? For example, 
the interface may be exposed as a single-level set of zones, which 
ignore the channel and lun concept for simplicity. Another approach 
might be to extend the SMR implementation sysfs entries to expose the 
hierarchy of the device (channels with X LUNs and each luns have a set 
of zones).

2. How to expose the tens of LUNs that OCSSDs have?

An open-channel SSDs typically has 64-256 LUNs that each acts as a 
parallel unit. How can these be efficiently exposed?

One may expose these as separate namespaces/partitions. For a DAS with 
24 drives, that will be 1536-6144 separate LUNs to manage. That many 
LUNs will blow up the host with gendisk instances. While if we do, then 
we have an excellent 1:1 mapping between the SMR interface and the OCSSD 
interface.

On the other hand, one could expose the device LUNs within a single LBA 
address space and lay the LUNs out linearly. In that case, the block 
layer may expose a variable that enables applications to understand this 
hierarchy. Mainly the channels with LUNs. Any warm feelings towards this?

Currently, a shortcut is taken with the geometry and hierarchy, which 
expose it through the /lightnvm sysfs entries. These (or a type thereof) 
can be moved to the block layer /queue directory.

If keeping the LUNs exposed on the same gendisk, vector I/Os becomes a 
viable path:

3. Vector I/Os

To derive parallelism from an open-channel SSD (and SSDs in parallel), 
one need to access them in parallel. Parallelism is achieved either by 
issuing I/Os for each LUN (similar to driving multiple SSDs today) or 
using a vector interface (encapsulating a list of LBAs, length, and data 
buffer) into the kernel. The latter approach allows I/Os to be 
vectorized and sent as a single unit to hardware.

Implementing this in generic block layer code might be overkill if only 
open-channel SSDs use it. I like to hear other use-cases (e.g., 
preadv/pwritev, file-systems, virtio?) that can take advantage of 
vectored I/Os. If it makes sense, then which level to implement: 
bio/request level, SGLs, or a new structure?

Device drivers that support vectored I/Os should be able to opt into the 
interface, while the block layer may automatically roll out for device 
drivers that don't have the support.

What has the history been in the Linux kernel about vector I/Os? What 
have reasons in the past been that such an interface was not adopted?

I will post RFC SMR patches before LSF/MM, such that we have a firm 
ground to discuss how it may be integrated.

-- Besides OCSSDs, I also like to participate in the discussions of 
XCOPY, NVMe, multipath, multi-queue interrupt management as well.

-Matias

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-02 21:06 ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-02 21:06 UTC (permalink / raw)


Hi,

The open-channel SSD subsystem is maturing, and drives are beginning to 
become available on the market. The open-channel SSD interface is very 
similar to the one exposed by SMR hard-drives. They both have a set of 
chunks (zones) exposed, and zones are managed using open/close logic. 
The main difference on open-channel SSDs is that it additionally exposes 
multiple sets of zones through a hierarchical interface, which covers a 
numbers levels (X channels, Y LUNs per channel, Z zones per LUN).

Given that the SMR interface is similar to OCSSDs interface, I like to 
propose to discuss this at LSF/MM to align the efforts and make a clear 
path forward:

1. SMR Compatibility

Can the SMR host interface be adapted to Open-Channel SSDs? For example, 
the interface may be exposed as a single-level set of zones, which 
ignore the channel and lun concept for simplicity. Another approach 
might be to extend the SMR implementation sysfs entries to expose the 
hierarchy of the device (channels with X LUNs and each luns have a set 
of zones).

2. How to expose the tens of LUNs that OCSSDs have?

An open-channel SSDs typically has 64-256 LUNs that each acts as a 
parallel unit. How can these be efficiently exposed?

One may expose these as separate namespaces/partitions. For a DAS with 
24 drives, that will be 1536-6144 separate LUNs to manage. That many 
LUNs will blow up the host with gendisk instances. While if we do, then 
we have an excellent 1:1 mapping between the SMR interface and the OCSSD 
interface.

On the other hand, one could expose the device LUNs within a single LBA 
address space and lay the LUNs out linearly. In that case, the block 
layer may expose a variable that enables applications to understand this 
hierarchy. Mainly the channels with LUNs. Any warm feelings towards this?

Currently, a shortcut is taken with the geometry and hierarchy, which 
expose it through the /lightnvm sysfs entries. These (or a type thereof) 
can be moved to the block layer /queue directory.

If keeping the LUNs exposed on the same gendisk, vector I/Os becomes a 
viable path:

3. Vector I/Os

To derive parallelism from an open-channel SSD (and SSDs in parallel), 
one need to access them in parallel. Parallelism is achieved either by 
issuing I/Os for each LUN (similar to driving multiple SSDs today) or 
using a vector interface (encapsulating a list of LBAs, length, and data 
buffer) into the kernel. The latter approach allows I/Os to be 
vectorized and sent as a single unit to hardware.

Implementing this in generic block layer code might be overkill if only 
open-channel SSDs use it. I like to hear other use-cases (e.g., 
preadv/pwritev, file-systems, virtio?) that can take advantage of 
vectored I/Os. If it makes sense, then which level to implement: 
bio/request level, SGLs, or a new structure?

Device drivers that support vectored I/Os should be able to opt into the 
interface, while the block layer may automatically roll out for device 
drivers that don't have the support.

What has the history been in the Linux kernel about vector I/Os? What 
have reasons in the past been that such an interface was not adopted?

I will post RFC SMR patches before LSF/MM, such that we have a firm 
ground to discuss how it may be integrated.

-- Besides OCSSDs, I also like to participate in the discussions of 
XCOPY, NVMe, multipath, multi-queue interrupt management as well.

-Matias

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-02 21:06 ` Matias Bjørling
  (?)
@ 2017-01-02 23:12   ` Viacheslav Dubeyko
  -1 siblings, 0 replies; 63+ messages in thread
From: Viacheslav Dubeyko @ 2017-01-02 23:12 UTC (permalink / raw)
  To: Matias Bjørling, lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme

T24gTW9uLCAyMDE3LTAxLTAyIGF0IDIyOjA2ICswMTAwLCBNYXRpYXMgQmrDuHJsaW5nIHdyb3Rl
Ogo+IEhpLAo+IAo+IFRoZSBvcGVuLWNoYW5uZWwgU1NEIHN1YnN5c3RlbSBpcyBtYXR1cmluZywg
YW5kIGRyaXZlcyBhcmUgYmVnaW5uaW5nCj4gdG/CoAo+IGJlY29tZSBhdmFpbGFibGUgb24gdGhl
IG1hcmtldC4gCgpXaGF0IGRvIHlvdSBtZWFuPyBXZSBzdGlsbCBoYXZlIG5vdGhpbmcgb24gdGhl
IG1hcmtldC4gSSBoYXZlbid0Cm9wcG9ydHVuaXR5IHRvIGFjY2VzcyB0byBhbnkgb2Ygc3VjaCBk
ZXZpY2UuIENvdWxkIHlvdSBzaGFyZSB5b3VyCmtub3dsZWRnZSB3aGVyZSBhbmQgd2hhdCBkZXZp
Y2UgY2FuIGJlIGJvdWdodCBvbiB0aGUgbWFya2V0PwoKVGhhbmtzLApWeWFjaGVzbGF2IER1YmV5
a28uCgoKX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KTGlu
dXgtbnZtZSBtYWlsaW5nIGxpc3QKTGludXgtbnZtZUBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6
Ly9saXN0cy5pbmZyYWRlYWQub3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtbnZtZQo=

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-02 23:12   ` Viacheslav Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Viacheslav Dubeyko @ 2017-01-02 23:12 UTC (permalink / raw)
  To: Matias Bjørling, lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme

On Mon, 2017-01-02 at 22:06 +0100, Matias Bjørling wrote:
> Hi,
> 
> The open-channel SSD subsystem is maturing, and drives are beginning
> to 
> become available on the market. 

What do you mean? We still have nothing on the market. I haven't
opportunity to access to any of such device. Could you share your
knowledge where and what device can be bought on the market?

Thanks,
Vyacheslav Dubeyko.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-02 23:12   ` Viacheslav Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Viacheslav Dubeyko @ 2017-01-02 23:12 UTC (permalink / raw)

On Mon, 2017-01-02@22:06 +0100, Matias Bj?rling wrote:
> Hi,
> 
> The open-channel SSD subsystem is maturing, and drives are beginning
> to?
> become available on the market. 

What do you mean? We still have nothing on the market. I haven't
opportunity to access to any of such device. Could you share your
knowledge where and what device can be bought on the market?

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-02 23:12   ` Viacheslav Dubeyko
@ 2017-01-03  8:56     ` Matias Bjørling
  -1 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-03  8:56 UTC (permalink / raw)
  To: Viacheslav Dubeyko, lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme

On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote:
> On Mon, 2017-01-02 at 22:06 +0100, Matias Bjørling wrote:
>> Hi,
>>
>> The open-channel SSD subsystem is maturing, and drives are beginning
>> to 
>> become available on the market. 
> 
> What do you mean? We still have nothing on the market. I haven't
> opportunity to access to any of such device. Could you share your
> knowledge where and what device can be bought on the market?
> 

Hi Vyacheslav,

You are right that they are not available off the shelf at a convenient
store. You may contact one of these vendors for availability: CNEX Labs
(Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC (OX
Controller + Dragon Fire card).

> Thanks,
> Vyacheslav Dubeyko.
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-03  8:56     ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-03  8:56 UTC (permalink / raw)


On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote:
> On Mon, 2017-01-02@22:06 +0100, Matias Bj?rling wrote:
>> Hi,
>>
>> The open-channel SSD subsystem is maturing, and drives are beginning
>> to 
>> become available on the market. 
> 
> What do you mean? We still have nothing on the market. I haven't
> opportunity to access to any of such device. Could you share your
> knowledge where and what device can be bought on the market?
> 

Hi Vyacheslav,

You are right that they are not available off the shelf at a convenient
store. You may contact one of these vendors for availability: CNEX Labs
(Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC (OX
Controller + Dragon Fire card).

> Thanks,
> Vyacheslav Dubeyko.
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-03  8:56     ` Matias Bjørling
  (?)
@ 2017-01-03 17:35       ` Viacheslav Dubeyko
  -1 siblings, 0 replies; 63+ messages in thread
From: Viacheslav Dubeyko @ 2017-01-03 17:35 UTC (permalink / raw)
  To: Matias Bjørling, lsf-pc
  Cc: Linux FS Devel, linux-block, Vyacheslav.Dubeyko, linux-nvme

SGkgTWF0aWFzLAoKT24gVHVlLCAyMDE3LTAxLTAzIGF0IDA5OjU2ICswMTAwLCBNYXRpYXMgQmrD
uHJsaW5nIHdyb3RlOgo+IE9uIDAxLzAzLzIwMTcgMTI6MTIgQU0sIFZpYWNoZXNsYXYgRHViZXlr
byB3cm90ZToKPiA+IAo+ID4gT24gTW9uLCAyMDE3LTAxLTAyIGF0IDIyOjA2ICswMTAwLCBNYXRp
YXMgQmrDuHJsaW5nIHdyb3RlOgo+ID4gPiAKPiA+ID4gSGksCj4gPiA+IAo+ID4gPiBUaGUgb3Bl
bi1jaGFubmVsIFNTRCBzdWJzeXN0ZW0gaXMgbWF0dXJpbmcsIGFuZCBkcml2ZXMgYXJlCj4gPiA+
IGJlZ2lubmluZwo+ID4gPiB0b8KgCj4gPiA+IGJlY29tZSBhdmFpbGFibGUgb24gdGhlIG1hcmtl
dC7CoAo+ID4gV2hhdCBkbyB5b3UgbWVhbj8gV2Ugc3RpbGwgaGF2ZSBub3RoaW5nIG9uIHRoZSBt
YXJrZXQuIEkgaGF2ZW4ndAo+ID4gb3Bwb3J0dW5pdHkgdG8gYWNjZXNzIHRvIGFueSBvZiBzdWNo
IGRldmljZS4gQ291bGQgeW91IHNoYXJlIHlvdXIKPiA+IGtub3dsZWRnZSB3aGVyZSBhbmQgd2hh
dCBkZXZpY2UgY2FuIGJlIGJvdWdodCBvbiB0aGUgbWFya2V0Pwo+ID4gCj4gSGkgVnlhY2hlc2xh
diwKPiAKPiBZb3UgYXJlIHJpZ2h0IHRoYXQgdGhleSBhcmUgbm90IGF2YWlsYWJsZSBvZmYgdGhl
IHNoZWxmIGF0IGEKPiBjb252ZW5pZW50Cj4gc3RvcmUuIFlvdSBtYXkgY29udGFjdCBvbmUgb2Yg
dGhlc2UgdmVuZG9ycyBmb3IgYXZhaWxhYmlsaXR5OiBDTkVYCj4gTGFicwo+IChXZXN0bGFrZSBM
aWdodE5WTSBTREspLCBSYWRpYW4gTWVtb3J5IFN5c3RlbXMgKFJNUy0zMjUpLCBhbmQvb3IgRU1D
Cj4gKE9YCj4gQ29udHJvbGxlciArIERyYWdvbiBGaXJlIGNhcmQpLgoKV2UsIFdlc3Rlcm4gRGln
aXRhbCwgY29udGFjdGVkIHdpdGggQ05FWCBMYWJzIGFib3V0IGEgaGFsZiB5ZWFyIGFnby4KT3Vy
IHJlcXVlc3Qgd2FzIHJlZnVzZWQuIEFsc28gd2UgY29udGFjdGVkIHdpdGjCoFJhZGlhbiBNZW1v
cnkgU3lzdGVtcwphYm91dCBhIHllYXIgYWdvLiBPdXIgbmVnb3RpYXRpb25zIGZpbmlzaGVkIHdp
dGggbm8gc3VjZXNzIGF0IGFsbC4gQW5kCkkgZG91YnQgdGhhdCBFTUMgd2lsbCBzaGFyZSB3aXRo
IHVzIHNvbWV0aGluZy4gU28sIHN1Y2ggc2l0dWF0aW9uIGxvb2tzCnJlYWxseSB3ZWlyZCwgZXNw
ZWNpYWxseSBmb3IgdGhlIGNhc2Ugb2Ygb3Blbi1zb3VyY2UgY29tbXVuaXR5LiBXZQpjYW5ub3Qg
YWNjZXNzIG9yIHRlc3QgYW55IE9wZW4tY2hhbm5lbCBTU0Qgbm9yIGZvciBtb25leSBub3IgdW5k
ZXIgTkRBLgpVc3VhbGx5LCBvcGVuLXNvdXJjZSBtZWFucyB0aGF0IGV2ZXJ5Ym9keSBoYXMgYWNj
ZXNzIHRvIGhhcmR3YXJlIGFuZCB3ZQpjYW4gZGlzY3VzcyBpbXBsZW1lbnRhdGlvbiwgYXJjaGl0
ZWN0dXJlIG9yIGFwcHJvYWNoIHdpdGhvdXQgYW55CnJlc3RyaWN0aW9ucy4gQnV0IHdlIGhhdmVu
J3QgYWNjZXNzIHRvIGhhcmR3YXJlIHJpZ2h0IG5vdy4gSSB1bmRlcnN0YW5kCnRoZSBidXNpbmVz
cyBtb2RlbCBhbmQgYmxhaCwgYmxhaCwgYmxhaC4gQnV0IGl0IGxvb2tzIGxpa2UgdGhhdCwKZmlu
YWxseSwgd2UgaGF2ZSBub3RoaW5nIGxpa2UgT3Blbi1jaGFubmVsIFNTRCBvbiB0aGUgbWFya2V0
LCBmcm9tIG15CnBlcnNvbmFsIHBvaW50IG9mIHZpZXcuIEFuZCBJIHN1cHBvc2UgdGhhdCBpdCdz
IHJlYWxseSB0cmlja3kgd2F5IHRvCmRpc2N1c3Mgc29mdHdhcmUgaW50ZXJmYWNlIG9yIGFueSBv
dGhlciBkZXRhaWxzIGFib3V0IHNvbWV0aGluZyB0aGF0CmRvZXNuJ3QgZXhpc3QgYXQgYWxsLiBC
ZWNhdXNlIGlmIEkgY2Fubm90IHRha2UgYW5kIHRlc3Qgc29tZSBoYXJkd2FyZQp0aGVuIEkgY2Fu
bm90IGJ1aWxkIG15IG93biBvcGluaW9uIGFib3V0IHRoaXMgdGVjaG5vbG9neS4KClRoYW5rcywK
VnlhY2hlc2xhdiBEdWJleWtvLgoKCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fCkxpbnV4LW52bWUgbWFpbGluZyBsaXN0CkxpbnV4LW52bWVAbGlzdHMuaW5m
cmFkZWFkLm9yZwpodHRwOi8vbGlzdHMuaW5mcmFkZWFkLm9yZy9tYWlsbWFuL2xpc3RpbmZvL2xp
bnV4LW52bWUK

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-03 17:35       ` Viacheslav Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Viacheslav Dubeyko @ 2017-01-03 17:35 UTC (permalink / raw)
  To: Matias Bjørling, lsf-pc
  Cc: Linux FS Devel, linux-block, linux-nvme, Vyacheslav.Dubeyko

Hi Matias,

On Tue, 2017-01-03 at 09:56 +0100, Matias Bjørling wrote:
> On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote:
> > 
> > On Mon, 2017-01-02 at 22:06 +0100, Matias Bjørling wrote:
> > > 
> > > Hi,
> > > 
> > > The open-channel SSD subsystem is maturing, and drives are
> > > beginning
> > > to 
> > > become available on the market. 
> > What do you mean? We still have nothing on the market. I haven't
> > opportunity to access to any of such device. Could you share your
> > knowledge where and what device can be bought on the market?
> > 
> Hi Vyacheslav,
> 
> You are right that they are not available off the shelf at a
> convenient
> store. You may contact one of these vendors for availability: CNEX
> Labs
> (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC
> (OX
> Controller + Dragon Fire card).

We, Western Digital, contacted with CNEX Labs about a half year ago.
Our request was refused. Also we contacted with Radian Memory Systems
about a year ago. Our negotiations finished with no sucess at all. And
I doubt that EMC will share with us something. So, such situation looks
really weird, especially for the case of open-source community. We
cannot access or test any Open-channel SSD nor for money nor under NDA.
Usually, open-source means that everybody has access to hardware and we
can discuss implementation, architecture or approach without any
restrictions. But we haven't access to hardware right now. I understand
the business model and blah, blah, blah. But it looks like that,
finally, we have nothing like Open-channel SSD on the market, from my
personal point of view. And I suppose that it's really tricky way to
discuss software interface or any other details about something that
doesn't exist at all. Because if I cannot take and test some hardware
then I cannot build my own opinion about this technology.

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-03 17:35       ` Viacheslav Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Viacheslav Dubeyko @ 2017-01-03 17:35 UTC (permalink / raw)

Hi Matias,

On Tue, 2017-01-03@09:56 +0100, Matias Bj?rling wrote:
> On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote:
> > 
> > On Mon, 2017-01-02@22:06 +0100, Matias Bj?rling wrote:
> > > 
> > > Hi,
> > > 
> > > The open-channel SSD subsystem is maturing, and drives are
> > > beginning
> > > to?
> > > become available on the market.?
> > What do you mean? We still have nothing on the market. I haven't
> > opportunity to access to any of such device. Could you share your
> > knowledge where and what device can be bought on the market?
> > 
> Hi Vyacheslav,
> 
> You are right that they are not available off the shelf at a
> convenient
> store. You may contact one of these vendors for availability: CNEX
> Labs
> (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC
> (OX
> Controller + Dragon Fire card).

We, Western Digital, contacted with CNEX Labs about a half year ago.
Our request was refused. Also we contacted with?Radian Memory Systems
about a year ago. Our negotiations finished with no sucess at all. And
I doubt that EMC will share with us something. So, such situation looks
really weird, especially for the case of open-source community. We
cannot access or test any Open-channel SSD nor for money nor under NDA.
Usually, open-source means that everybody has access to hardware and we
can discuss implementation, architecture or approach without any
restrictions. But we haven't access to hardware right now. I understand
the business model and blah, blah, blah. But it looks like that,
finally, we have nothing like Open-channel SSD on the market, from my
personal point of view. And I suppose that it's really tricky way to
discuss software interface or any other details about something that
doesn't exist at all. Because if I cannot take and test some hardware
then I cannot build my own opinion about this technology.

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-03 17:35       ` Viacheslav Dubeyko
@ 2017-01-03 19:10         ` Matias Bjørling
  -1 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-03 19:10 UTC (permalink / raw)
  To: Viacheslav Dubeyko, lsf-pc
  Cc: Linux FS Devel, linux-block, linux-nvme, Vyacheslav.Dubeyko

On 01/03/2017 06:35 PM, Viacheslav Dubeyko wrote:
> Hi Matias,
>
> On Tue, 2017-01-03 at 09:56 +0100, Matias Bjørling wrote:
>> On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote:
>>>
>>> On Mon, 2017-01-02 at 22:06 +0100, Matias Bjørling wrote:
>>>>
>>>> Hi,
>>>>
>>>> The open-channel SSD subsystem is maturing, and drives are
>>>> beginning
>>>> to
>>>> become available on the market.
>>> What do you mean? We still have nothing on the market. I haven't
>>> opportunity to access to any of such device. Could you share your
>>> knowledge where and what device can be bought on the market?
>>>
>> Hi Vyacheslav,
>>
>> You are right that they are not available off the shelf at a
>> convenient
>> store. You may contact one of these vendors for availability: CNEX
>> Labs
>> (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC
>> (OX
>> Controller + Dragon Fire card).
>
> We, Western Digital, contacted with CNEX Labs about a half year ago.
> Our request was refused. Also we contacted with Radian Memory Systems
> about a year ago. Our negotiations finished with no sucess at all. And
> I doubt that EMC will share with us something. So, such situation looks
> really weird, especially for the case of open-source community. We
> cannot access or test any Open-channel SSD nor for money nor under NDA.
> Usually, open-source means that everybody has access to hardware and we
> can discuss implementation, architecture or approach without any
> restrictions. But we haven't access to hardware right now. I understand
> the business model and blah, blah, blah. But it looks like that,
> finally, we have nothing like Open-channel SSD on the market, from my
> personal point of view. And I suppose that it's really tricky way to
> discuss software interface or any other details about something that
> doesn't exist at all. Because if I cannot take and test some hardware
> then I cannot build my own opinion about this technology.
>

I understand your frustration. It is annoying not having easy access to 
hardware. As you properly are aware, similarly with host-managed SMR 
drives, there are customers that use your drives, while not being 
available off-the-shelf.

All of the open-channel SSD work is done in the open. Patches, new 
targets, and so forth are being developed for everyone to see. 
Similarly, the NVMe host interface is developed in the open as well. The 
interface allows one to implements supporting firmware. The "front-end" 
of the FTL on the SSD, is removed, and the "back-end" engine is exposed. 
It is not much work and given HGST already have an SSD firmware 
implementation. I bet you guys can whip up an internal implementation in 
a matter of weeks. If you choose to do so, I will bend over backwards to 
help you sort out any quirks that might be.

Another option is to use the qemu extension. We are improving it 
continuously to make sure it follows the implementation of a real 
hardware OCSSDs. Today we do 90% of our FTL work using qemu, and most of 
the time it just works when we run the FTL code on real hardware.

Similarly to vendors that provide new CPUs, NVDIMMs, and graphic 
drivers. Some code and refactoring go in years in advance. What I am 
proposing here is to discuss how OCSSDs fits into the storage stack, and 
what we can do to improve it. Optimally, most of the lightnvm subsystem 
can be removed by exposing vectored I/Os. Which then enables 
implementation of a target to be a traditional device mapper module. 
That would be great!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-03 19:10         ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-03 19:10 UTC (permalink / raw)

On 01/03/2017 06:35 PM, Viacheslav Dubeyko wrote:
> Hi Matias,
>
> On Tue, 2017-01-03@09:56 +0100, Matias Bj?rling wrote:
>> On 01/03/2017 12:12 AM, Viacheslav Dubeyko wrote:
>>>
>>> On Mon, 2017-01-02@22:06 +0100, Matias Bj?rling wrote:
>>>>
>>>> Hi,
>>>>
>>>> The open-channel SSD subsystem is maturing, and drives are
>>>> beginning
>>>> to
>>>> become available on the market.
>>> What do you mean? We still have nothing on the market. I haven't
>>> opportunity to access to any of such device. Could you share your
>>> knowledge where and what device can be bought on the market?
>>>
>> Hi Vyacheslav,
>>
>> You are right that they are not available off the shelf at a
>> convenient
>> store. You may contact one of these vendors for availability: CNEX
>> Labs
>> (Westlake LightNVM SDK), Radian Memory Systems (RMS-325), and/or EMC
>> (OX
>> Controller + Dragon Fire card).
>
> We, Western Digital, contacted with CNEX Labs about a half year ago.
> Our request was refused. Also we contacted with Radian Memory Systems
> about a year ago. Our negotiations finished with no sucess at all. And
> I doubt that EMC will share with us something. So, such situation looks
> really weird, especially for the case of open-source community. We
> cannot access or test any Open-channel SSD nor for money nor under NDA.
> Usually, open-source means that everybody has access to hardware and we
> can discuss implementation, architecture or approach without any
> restrictions. But we haven't access to hardware right now. I understand
> the business model and blah, blah, blah. But it looks like that,
> finally, we have nothing like Open-channel SSD on the market, from my
> personal point of view. And I suppose that it's really tricky way to
> discuss software interface or any other details about something that
> doesn't exist at all. Because if I cannot take and test some hardware
> then I cannot build my own opinion about this technology.
>

I understand your frustration. It is annoying not having easy access to 
hardware. As you properly are aware, similarly with host-managed SMR 
drives, there are customers that use your drives, while not being 
available off-the-shelf.

All of the open-channel SSD work is done in the open. Patches, new 
targets, and so forth are being developed for everyone to see. 
Similarly, the NVMe host interface is developed in the open as well. The 
interface allows one to implements supporting firmware. The "front-end" 
of the FTL on the SSD, is removed, and the "back-end" engine is exposed. 
It is not much work and given HGST already have an SSD firmware 
implementation. I bet you guys can whip up an internal implementation in 
a matter of weeks. If you choose to do so, I will bend over backwards to 
help you sort out any quirks that might be.

Another option is to use the qemu extension. We are improving it 
continuously to make sure it follows the implementation of a real 
hardware OCSSDs. Today we do 90% of our FTL work using qemu, and most of 
the time it just works when we run the FTL code on real hardware.

Similarly to vendors that provide new CPUs, NVDIMMs, and graphic 
drivers. Some code and refactoring go in years in advance. What I am 
proposing here is to discuss how OCSSDs fits into the storage stack, and 
what we can do to improve it. Optimally, most of the lightnvm subsystem 
can be removed by exposing vectored I/Os. Which then enables 
implementation of a target to be a traditional device mapper module. 
That would be great!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-03 19:10         ` Matias Bjørling
  (?)
@ 2017-01-04  2:59           ` Slava Dubeyko
  -1 siblings, 0 replies; 63+ messages in thread
From: Slava Dubeyko @ 2017-01-04  2:59 UTC (permalink / raw)
  To: Matias Bjørling, Viacheslav Dubeyko, lsf-pc
  Cc: Linux FS Devel, linux-block, linux-nvme

DQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KRnJvbTogTWF0aWFzIEJqw7hybGluZyBbbWFp
bHRvOm1AYmpvcmxpbmcubWVdIA0KU2VudDogVHVlc2RheSwgSmFudWFyeSAzLCAyMDE3IDExOjEx
IEFNDQpUbzogVmlhY2hlc2xhdiBEdWJleWtvIDxzbGF2YUBkdWJleWtvLmNvbT47IGxzZi1wY0Bs
aXN0cy5saW51eC1mb3VuZGF0aW9uLm9yZw0KQ2M6IExpbnV4IEZTIERldmVsIDxsaW51eC1mc2Rl
dmVsQHZnZXIua2VybmVsLm9yZz47IGxpbnV4LWJsb2NrQHZnZXIua2VybmVsLm9yZzsgbGludXgt
bnZtZUBsaXN0cy5pbmZyYWRlYWQub3JnOyBTbGF2YSBEdWJleWtvIDxWeWFjaGVzbGF2LkR1YmV5
a29Ad2RjLmNvbT4NClN1YmplY3Q6IFJlOiBbTFNGL01NIFRPUElDXVtMU0YvTU0gQVRURU5EXSBP
Q1NTRHMgLSBTTVIsIEhpZXJhcmNoaWNhbCBJbnRlcmZhY2UsIGFuZCBWZWN0b3IgSS9Pcw0KDQo8
c2tpcHBlZD4NCg0KPiBBbGwgb2YgdGhlIG9wZW4tY2hhbm5lbCBTU0Qgd29yayBpcyBkb25lIGlu
IHRoZSBvcGVuLg0KPiBQYXRjaGVzLCBuZXcgdGFyZ2V0cywgYW5kIHNvIGZvcnRoIGFyZSBiZWlu
ZyBkZXZlbG9wZWQgZm9yIGV2ZXJ5b25lIHRvIHNlZS4gDQo+IFNpbWlsYXJseSwgdGhlIE5WTWUg
aG9zdCBpbnRlcmZhY2UgaXMgZGV2ZWxvcGVkIGluIHRoZSBvcGVuIGFzIHdlbGwuDQo+IFRoZSBp
bnRlcmZhY2UgYWxsb3dzIG9uZSB0byBpbXBsZW1lbnRzIHN1cHBvcnRpbmcgZmlybXdhcmUuIFRo
ZSAiZnJvbnQtZW5kIg0KPiBvZiB0aGUgRlRMIG9uIHRoZSBTU0QsIGlzIHJlbW92ZWQsIGFuZCB0
aGUgImJhY2stZW5kIiBlbmdpbmUgaXMgZXhwb3NlZC4gDQo+IEl0IGlzIG5vdCBtdWNoIHdvcmsg
YW5kIGdpdmVuIEhHU1QgYWxyZWFkeSBoYXZlIGFuIFNTRCBmaXJtd2FyZSBpbXBsZW1lbnRhdGlv
bi4NCj4gSSBiZXQgeW91IGd1eXMgY2FuIHdoaXAgdXAgYW4gaW50ZXJuYWwgaW1wbGVtZW50YXRp
b24gaW4gYSBtYXR0ZXIgb2Ygd2Vla3MuDQo+IElmIHlvdSBjaG9vc2UgdG8gZG8gc28sIEkgd2ls
bCBiZW5kIG92ZXIgYmFja3dhcmRzIHRvIGhlbHAgeW91IHNvcnQgb3V0IGFueSBxdWlya3MgdGhh
dCBtaWdodCBiZS4NCg0KSSBzZWUgeW91ciBwb2ludC4gQnV0IEkgYW0gdGhlIHJlc2VhcmNoIGd1
eSBhbmQgSSBoYXZlIHNvZnR3YXJlIHByb2plY3QuIFNvLCBpdCdzIGNvbXBsZXRlbHkgdW5yZWFz
b25hYmxlIGZvciBtZQ0KdG8gc3BlbmQgdGhlIHRpbWUgb24gU1NEIGZpcm13YXJlLiBJIHNpbXBs
eSBuZWVkIGluIHJlYWR5LW1hZGUgaGFyZHdhcmUgZm9yIHRlc3RpbmcvYmVuY2htYXJraW5nDQpt
eSBzb2Z0d2FyZSBhbmQgdG8gY2hlY2sgdGhlIGFzc3VtcHRpb25zIHRoYXQgaXQgd2FzIG1hZGUu
IFRoYXQncyBhbGwuIElmIEkgaGF2ZW4ndCB0aGUgaGFyZHdhcmUgcmlnaHQgbm93DQp0aGVuIEkg
bmVlZCB0byB3YWl0IHRoZSBiZXR0ZXIgdGltZXMuIA0KDQo+IEFub3RoZXIgb3B0aW9uIGlzIHRv
IHVzZSB0aGUgcWVtdSBleHRlbnNpb24uIFdlIGFyZSBpbXByb3ZpbmcgaXQgY29udGludW91c2x5
DQo+IHRvIG1ha2Ugc3VyZSBpdCBmb2xsb3dzIHRoZSBpbXBsZW1lbnRhdGlvbiBvZiBhIHJlYWwg
aGFyZHdhcmUgT0NTU0RzLg0KPiBUb2RheSB3ZSBkbyA5MCUgb2Ygb3VyIEZUTCB3b3JrIHVzaW5n
IHFlbXUsIGFuZCBtb3N0IG9mIHRoZSB0aW1lDQo+IGl0IGp1c3Qgd29ya3Mgd2hlbiB3ZSBydW4g
dGhlIEZUTCBjb2RlIG9uIHJlYWwgaGFyZHdhcmUuDQoNCkkgcmVhbGx5IGRpc2xpa2UgdG8gdXNl
IHRoZSBxZW11IGZvciBmaWxlIHN5c3RlbSBiZW5jaG1hcmtpbmcuDQoNCj4gU2ltaWxhcmx5IHRv
IHZlbmRvcnMgdGhhdCBwcm92aWRlIG5ldyBDUFVzLCBOVkRJTU1zLCBhbmQgZ3JhcGhpYyBkcml2
ZXJzLg0KPiBTb21lIGNvZGUgYW5kIHJlZmFjdG9yaW5nIGdvIGluIHllYXJzIGluIGFkdmFuY2Uu
IFdoYXQgSSBhbSBwcm9wb3NpbmcgaGVyZSBpcyB0byBkaXNjdXNzIGhvdyBPQ1NTRHMNCj4gZml0
cyBpbnRvIHRoZSBzdG9yYWdlIHN0YWNrLCBhbmQgd2hhdCB3ZSBjYW4gZG8gdG8gaW1wcm92ZSBp
dC4gT3B0aW1hbGx5LCBtb3N0IG9mIHRoZSBsaWdodG52bSBzdWJzeXN0ZW0NCj4gY2FuIGJlIHJl
bW92ZWQgYnkgZXhwb3NpbmcgdmVjdG9yZWQgSS9Pcy4gV2hpY2ggdGhlbiBlbmFibGVzIGltcGxl
bWVudGF0aW9uIG9mIGEgdGFyZ2V0IHRvIGJlDQo+IGEgdHJhZGl0aW9uYWwgZGV2aWNlIG1hcHBl
ciBtb2R1bGUuIFRoYXQgd291bGQgYmUgZ3JlYXQhDQoNCk9LLiBGcm9tIG9uZSBwb2ludCBvZiB2
aWV3LCBJIGxpa2UgdGhlIGlkZWEgb2YgU01SIGNvbXBhdGliaWxpdHkuIEJ1dCwgZnJvbSBhbm90
aGVyIHBvaW50IG9mIHZpZXcsDQpJIGFtIHNsaWdodGx5IHNrZXB0aWNhbCBhYm91dCBzdWNoIGFw
cHJvYWNoLiBJIGJlbGlldmUgeW91IHNlZSB0aGUgYnJpZ2h0IHNpZGUgb2YgeW91ciBzdWdnZXN0
aW9uLg0KU28sIGxldCBtZSB0YWtlIGEgbG9vayBvbiB5b3VyIGFwcHJvYWNoIGZyb20gdGhlIGRh
cmsgc2lkZS4NCg0KV2hhdCdzIHRoZSBnb2FsIG9mIFNNUiBjb21wYXRpYmlsaXR5PyBBbnkgdW5p
ZmljYXRpb24gb3IgaW50ZXJmYWNlIGFic3RyYWN0aW9uIGhhcyB0aGUgZ29hbCB0byBoaWRlDQp0
aGUgcGVjdWxpYXJpdGllcyBvZiB1bmRlcmx5aW5nIGhhcmR3YXJlLiBCdXQgd2UgaGF2ZSBibG9j
ayBkZXZpY2UgYWJzdHJhY3Rpb24gdGhhdCBoaWRlcyBhbGwNCmhhcmR3YXJlJ3MgcGVjdWxpYXJp
dGllcyBwZXJmZWN0bHkuIEFsc28gRlRMIChvciBhbnkgb3RoZXIgVHJhbnNsYXRpb24gTGF5ZXIp
IGlzIGFibGUgdG8gcmVwcmVzZW50DQp0aGUgZGV2aWNlIGFzIHNlcXVlbmNlIG9mIHBoeXNpY2Fs
IHNlY3RvcnMgd2l0aG91dCByZWFsIGtub3dsZWRnZSBvbiBzb2Z0d2FyZSBzaWRlIGFib3V0DQpz
b3BoaXN0aWNhdGVkIG1hbmFnZW1lbnQgYWN0aXZpdHkgb24gdGhlIGRldmljZSBzaWRlLiBBbmQs
IGZpbmFsbHksIGd1eXMgd2lsbCBiZSBjb21wbGV0ZWx5IGhhcHB5DQp0byB1c2UgdGhlIHJlZ3Vs
YXIgZmlsZSBzeXN0ZW1zIChleHQ0LCB4ZnMpIHdpdGhvdXQgbmVjZXNzaXR5IHRvIG1vZGlmeSBz
b2Z0d2FyZSBzdGFjay4gQnV0IEkgYmVsaWV2ZSB0aGF0DQp0aGUgZ29hbCBvZiBPcGVuLWNoYW5u
ZWwgU1NEIGFwcHJvYWNoIGlzIGNvbXBsZXRlbHkgb3Bwb3NpdGUuIE5hbWVseSwgcHJvdmlkZSB0
aGUgb3Bwb3J0dW5pdHkNCmZvciBzb2Z0d2FyZSBzaWRlIChmaWxlIHN5c3RlbSwgZm9yIGV4YW1w
bGUpIHRvIG1hbmFnZSB0aGUgT3Blbi1jaGFubmVsIFNTRCBkZXZpY2Ugd2l0aCBzbWFydGVyDQpw
b2xpY3kuDQoNClNvLCBteSBrZXkgd29ycnkgdGhhdCB0aGUgdHJ5aW5nIHRvIGhpZGUgdW5kZXIg
dGhlIHNhbWUgaW50ZXJmYWNlIHRoZSB0d28gZGlmZmVyZW50IHRlY2hub2xvZ2llcw0KKFNNUiBh
bmQgTkFORCBmbGFzaCkgd2lsbCBiZSByZXN1bHRlZCBpbiB0aGUgbG9zcyBvZiBvcHBvcnR1bml0
eSB0byBtYW5hZ2UgdGhlIGRldmljZSBpbg0KbW9yZSBzbWFydGVyIHdheS4gQmVjYXVzZSBhbnkg
dW5pZmljYXRpb24gaGFzIHRoZSBnb2FsIHRvIGNyZWF0ZSBhIHNpbXBsZSBpbnRlcmZhY2UuIEJ1
dCBTTVINCmFuZCBOQU5EIGZsYXNoIGFyZSBzaWduaWZpY2FudGx5IGRpZmZlcmVudCB0ZWNobm9s
b2dpZXMuIEFuZCBpZiBzb21lYm9keSBjcmVhdGVzIHRlY2hub2xvZ3ktb3JpZW50ZWQNCmZpbGUg
c3lzdGVtLCBmb3IgZXhhbXBsZSwgdGhlbiBpdCBuZWVkcyB0byBoYXZlIGFjY2VzcyB0byByZWFs
bHkgc3BlY2lhbCBmZWF0dXJlcyBvZiB0aGUgdGVjaG5vbG9neS4NCk90aGVyd2lzZSwgaW50ZXJm
YWNlIHdpbGwgYmUgb3ZlcmxvYWRlZCBieSBmZWF0dXJlcyBvZiBib3RoIHRlY2hub2xvZ2llcyBh
bmQgaXQgd2lsbCBsb29rcyBsaWtlIGFzDQphIG1lc3MuDQoNClNNUiB6b25lIGFuZCBOQU5EIGZs
YXNoIGVyYXNlIGJsb2NrIGxvb2sgY29tcGFyYWJsZSBidXQsIGZpbmFsbHksIGl0IHNpZ25pZmlj
YW50bHkgZGlmZmVyZW50IHN0dWZmLg0KVXN1YWxseSwgU01SIHpvbmUgaGFzIDI2NSBNQiBpbiBz
aXplIGJ1dCBOQU5EIGZsYXNoIGVyYXNlIGJsb2NrIGNhbiB2YXJ5IGZyb20gNTEyIEtCIHRvIDgg
TUINCihpdCB3aWxsIGJlIHNsaWdodGx5IGxhcmdlciBpbiB0aGUgZnV0dXJlIGJ1dCBub3QgbW9y
ZSB0aGFuIDMyIE1CLCBJIHN1cHBvc2UpLg0KSXQgaXMgcG9zc2libGUgdG8gZ3JvdXAgc2V2ZXJh
bCBlcmFzZSBibG9ja3MgaW50byBhZ2dyZWdhdGVkIGVudGl0eSBidXQgaXQgY291bGQgYmUgbm90
IHZlcnkgZ29vZA0KcG9saWN5IGZyb20gZmlsZSBzeXN0ZW0gcG9pbnQgb2Ygdmlldy4gQW5vdGhl
ciBwb2ludCB0aGF0IFFMQyBkZXZpY2UgY291bGQgaGF2ZSBtb3JlIHRyaWNreSBmZWF0dXJlcw0K
b2YgZXJhc2UgYmxvY2tzIG1hbmFnZW1lbnQuIEFsc28gd2Ugc2hvdWxkIGFwcGx5IGVyYXNlIG9w
ZXJhdGlvbiBvbiBOQU5EIGZsYXNoIGVyYXNlIGJsb2NrDQpidXQgaXQgaXMgbm90IG1hbmRhdG9y
eSBmb3IgdGhlIGNhc2Ugb2YgU01SIHpvbmUuIEJlY2F1c2UgU01SIHpvbmUgY291bGQgYmUgc2lt
cGx5IHJlLXdyaXR0ZW4NCmluIHNlcXVlbnRpYWwgb3JkZXIgaWYgYWxsIHpvbmUncyBkYXRhIGlz
IGludmFsaWQsIGZvciBleGFtcGxlLiBBbHNvIGNvbnZlbnRpb25hbCB6b25lIGNvdWxkIGJlIHJl
YWxseSB0cmlja3kNCnBvaW50LiBCZWNhdXNlIGl0IGlzIG9uZSB6b25lIG9ubHkgZm9yIHRoZSB3
aG9sZSBkZXZpY2UgdGhhdCBjb3VsZCBiZSB1cGRhdGVkIGluLXBsYWNlLg0KUmF3IE5BTkQgZmxh
c2gsIHVzdWFsbHksIGhhc24ndCBsaWtld2lzZSBjb252ZW50aW9uYWwgem9uZS4NCg0KRmluYWxs
eSwgaWYgSSByZWFsbHkgbGlrZSB0byBkZXZlbG9wIFNNUi0gb3IgTkFORCBmbGFzaCBvcmllbnRl
ZCBmaWxlIHN5c3RlbSB0aGVuIEkgd291bGQgbGlrZSB0byBwbGF5DQp3aXRoIHBlY3VsaWFyaXRp
ZXMgb2YgY29uY3JldGUgdGVjaG5vbG9naWVzLiBBbmQgYW55IHVuaWZpZWQgaW50ZXJmYWNlIHdp
bGwgZGVzdHJveSB0aGUgb3Bwb3J0dW5pdHkNCnRvIGNyZWF0ZSB0aGUgcmVhbGx5IGVmZmljaWVu
dCBzb2x1dGlvbi4gRmluYWxseSwgaWYgbXkgc29mdHdhcmUgc29sdXRpb24gaXMgdW5hYmxlIHRv
IHByb3ZpZGUgc29tZQ0KZmFuY3kgYW5kIGVmZmljaWVudCBmZWF0dXJlcyB0aGVuIGd1eXMgd2ls
bCBwcmVmZXIgdG8gdXNlIHRoZSByZWd1bGFyIHN0YWNrIChleHQ0LCB4ZnMgKyBibG9jayBsYXll
cikuDQoNClRoYW5rcywNClZ5YWNoZXNsYXYgRHViZXlrby4NCg0KDQoNCg0KX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KTGludXgtbnZtZSBtYWlsaW5nIGxp
c3QKTGludXgtbnZtZUBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6Ly9saXN0cy5pbmZyYWRlYWQu
b3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtbnZtZQo=

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-04  2:59           ` Slava Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Slava Dubeyko @ 2017-01-04  2:59 UTC (permalink / raw)
  To: Matias Bjørling, Viacheslav Dubeyko, lsf-pc
  Cc: Linux FS Devel, linux-block, linux-nvme

-----Original Message-----
From: Matias Bjørling [mailto:m@bjorling.me] 
Sent: Tuesday, January 3, 2017 11:11 AM
To: Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

<skipped>

> All of the open-channel SSD work is done in the open.
> Patches, new targets, and so forth are being developed for everyone to see. 
> Similarly, the NVMe host interface is developed in the open as well.
> The interface allows one to implements supporting firmware. The "front-end"
> of the FTL on the SSD, is removed, and the "back-end" engine is exposed. 
> It is not much work and given HGST already have an SSD firmware implementation.
> I bet you guys can whip up an internal implementation in a matter of weeks.
> If you choose to do so, I will bend over backwards to help you sort out any quirks that might be.

I see your point. But I am the research guy and I have software project. So, it's completely unreasonable for me
to spend the time on SSD firmware. I simply need in ready-made hardware for testing/benchmarking
my software and to check the assumptions that it was made. That's all. If I haven't the hardware right now
then I need to wait the better times. 

> Another option is to use the qemu extension. We are improving it continuously
> to make sure it follows the implementation of a real hardware OCSSDs.
> Today we do 90% of our FTL work using qemu, and most of the time
> it just works when we run the FTL code on real hardware.

I really dislike to use the qemu for file system benchmarking.

> Similarly to vendors that provide new CPUs, NVDIMMs, and graphic drivers.
> Some code and refactoring go in years in advance. What I am proposing here is to discuss how OCSSDs
> fits into the storage stack, and what we can do to improve it. Optimally, most of the lightnvm subsystem
> can be removed by exposing vectored I/Os. Which then enables implementation of a target to be
> a traditional device mapper module. That would be great!

OK. From one point of view, I like the idea of SMR compatibility. But, from another point of view,
I am slightly skeptical about such approach. I believe you see the bright side of your suggestion.
So, let me take a look on your approach from the dark side.

What's the goal of SMR compatibility? Any unification or interface abstraction has the goal to hide
the peculiarities of underlying hardware. But we have block device abstraction that hides all
hardware's peculiarities perfectly. Also FTL (or any other Translation Layer) is able to represent
the device as sequence of physical sectors without real knowledge on software side about
sophisticated management activity on the device side. And, finally, guys will be completely happy
to use the regular file systems (ext4, xfs) without necessity to modify software stack. But I believe that
the goal of Open-channel SSD approach is completely opposite. Namely, provide the opportunity
for software side (file system, for example) to manage the Open-channel SSD device with smarter
policy.

So, my key worry that the trying to hide under the same interface the two different technologies
(SMR and NAND flash) will be resulted in the loss of opportunity to manage the device in
more smarter way. Because any unification has the goal to create a simple interface. But SMR
and NAND flash are significantly different technologies. And if somebody creates technology-oriented
file system, for example, then it needs to have access to really special features of the technology.
Otherwise, interface will be overloaded by features of both technologies and it will looks like as
a mess.

SMR zone and NAND flash erase block look comparable but, finally, it significantly different stuff.
Usually, SMR zone has 265 MB in size but NAND flash erase block can vary from 512 KB to 8 MB
(it will be slightly larger in the future but not more than 32 MB, I suppose).
It is possible to group several erase blocks into aggregated entity but it could be not very good
policy from file system point of view. Another point that QLC device could have more tricky features
of erase blocks management. Also we should apply erase operation on NAND flash erase block
but it is not mandatory for the case of SMR zone. Because SMR zone could be simply re-written
in sequential order if all zone's data is invalid, for example. Also conventional zone could be really tricky
point. Because it is one zone only for the whole device that could be updated in-place.
Raw NAND flash, usually, hasn't likewise conventional zone.

Finally, if I really like to develop SMR- or NAND flash oriented file system then I would like to play
with peculiarities of concrete technologies. And any unified interface will destroy the opportunity
to create the really efficient solution. Finally, if my software solution is unable to provide some
fancy and efficient features then guys will prefer to use the regular stack (ext4, xfs + block layer).

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-04  2:59           ` Slava Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Slava Dubeyko @ 2017-01-04  2:59 UTC (permalink / raw)

-----Original Message-----
From: Matias Bj?rling [mailto:m@bjorling.me] 
Sent: Tuesday, January 3, 2017 11:11 AM
To: Viacheslav Dubeyko <slava at dubeyko.com>; lsf-pc at lists.linux-foundation.org
Cc: Linux FS Devel <linux-fsdevel at vger.kernel.org>; linux-block at vger.kernel.org; linux-nvme at lists.infradead.org; Slava Dubeyko <Vyacheslav.Dubeyko at wdc.com>
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

<skipped>

> All of the open-channel SSD work is done in the open.
> Patches, new targets, and so forth are being developed for everyone to see. 
> Similarly, the NVMe host interface is developed in the open as well.
> The interface allows one to implements supporting firmware. The "front-end"
> of the FTL on the SSD, is removed, and the "back-end" engine is exposed. 
> It is not much work and given HGST already have an SSD firmware implementation.
> I bet you guys can whip up an internal implementation in a matter of weeks.
> If you choose to do so, I will bend over backwards to help you sort out any quirks that might be.

I see your point. But I am the research guy and I have software project. So, it's completely unreasonable for me
to spend the time on SSD firmware. I simply need in ready-made hardware for testing/benchmarking
my software and to check the assumptions that it was made. That's all. If I haven't the hardware right now
then I need to wait the better times. 

> Another option is to use the qemu extension. We are improving it continuously
> to make sure it follows the implementation of a real hardware OCSSDs.
> Today we do 90% of our FTL work using qemu, and most of the time
> it just works when we run the FTL code on real hardware.

I really dislike to use the qemu for file system benchmarking.

> Similarly to vendors that provide new CPUs, NVDIMMs, and graphic drivers.
> Some code and refactoring go in years in advance. What I am proposing here is to discuss how OCSSDs
> fits into the storage stack, and what we can do to improve it. Optimally, most of the lightnvm subsystem
> can be removed by exposing vectored I/Os. Which then enables implementation of a target to be
> a traditional device mapper module. That would be great!

OK. From one point of view, I like the idea of SMR compatibility. But, from another point of view,
I am slightly skeptical about such approach. I believe you see the bright side of your suggestion.
So, let me take a look on your approach from the dark side.

What's the goal of SMR compatibility? Any unification or interface abstraction has the goal to hide
the peculiarities of underlying hardware. But we have block device abstraction that hides all
hardware's peculiarities perfectly. Also FTL (or any other Translation Layer) is able to represent
the device as sequence of physical sectors without real knowledge on software side about
sophisticated management activity on the device side. And, finally, guys will be completely happy
to use the regular file systems (ext4, xfs) without necessity to modify software stack. But I believe that
the goal of Open-channel SSD approach is completely opposite. Namely, provide the opportunity
for software side (file system, for example) to manage the Open-channel SSD device with smarter
policy.

So, my key worry that the trying to hide under the same interface the two different technologies
(SMR and NAND flash) will be resulted in the loss of opportunity to manage the device in
more smarter way. Because any unification has the goal to create a simple interface. But SMR
and NAND flash are significantly different technologies. And if somebody creates technology-oriented
file system, for example, then it needs to have access to really special features of the technology.
Otherwise, interface will be overloaded by features of both technologies and it will looks like as
a mess.

SMR zone and NAND flash erase block look comparable but, finally, it significantly different stuff.
Usually, SMR zone has 265 MB in size but NAND flash erase block can vary from 512 KB to 8 MB
(it will be slightly larger in the future but not more than 32 MB, I suppose).
It is possible to group several erase blocks into aggregated entity but it could be not very good
policy from file system point of view. Another point that QLC device could have more tricky features
of erase blocks management. Also we should apply erase operation on NAND flash erase block
but it is not mandatory for the case of SMR zone. Because SMR zone could be simply re-written
in sequential order if all zone's data is invalid, for example. Also conventional zone could be really tricky
point. Because it is one zone only for the whole device that could be updated in-place.
Raw NAND flash, usually, hasn't likewise conventional zone.

Finally, if I really like to develop SMR- or NAND flash oriented file system then I would like to play
with peculiarities of concrete technologies. And any unified interface will destroy the opportunity
to create the really efficient solution. Finally, if my software solution is unable to provide some
fancy and efficient features then guys will prefer to use the regular stack (ext4, xfs + block layer).

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04  2:59           ` Slava Dubeyko
@ 2017-01-04  7:24             ` Damien Le Moal
  -1 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2017-01-04  7:24 UTC (permalink / raw)
  To: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc
  Cc: Linux FS Devel, linux-block, linux-nvme

Slava,

On 1/4/17 11:59, Slava Dubeyko wrote:
> What's the goal of SMR compatibility? Any unification or interface
> abstraction has the goal to hide the peculiarities of underlying
> hardware. But we have block device abstraction that hides all 
> hardware's peculiarities perfectly. Also FTL (or any other
> Translation Layer) is able to represent the device as sequence of
> physical sectors without real knowledge on software side about 
> sophisticated management activity on the device side. And, finally,
> guys will be completely happy to use the regular file systems (ext4,
> xfs) without necessity to modify software stack. But I believe that 
> the goal of Open-channel SSD approach is completely opposite. Namely,
> provide the opportunity for software side (file system, for example)
> to manage the Open-channel SSD device with smarter policy.

The Zoned Block Device API is part of the block layer. So as such, it
does abstract many aspects of the device characteristics, as so many
other API of the block layer do (look at blkdev_issue_discard or zeroout
implementations to see how far this can be pushed).

Regarding the use of open channel SSDs, I think you are absolutely
correct: (1) some users may be very happy to use a regular, unmodified
ext4 or xfs on top of an open channel SSD, as long as the FTL
implementation does a complete abstraction of the device special
features and presents a regular block device to upper layers. And
conversely, (2) some file system implementations may prefer to directly
use those special features and characteristics of open channel SSDs. No
arguing with this.

But you are missing the parallel with SMR. For SMR, or more correctly
zoned block devices since the ZBC or ZAC standards can equally apply to
HDDs and SSDs, 3 models exists: drive-managed, host-aware and host-managed.

Case (1) above corresponds *exactly* to the drive managed model, with
the difference that the abstraction of the device characteristics (SMR
here) is in the drive FW and not in a host-level FTL implementation as
it would be for open channel SSDs. Case (2) above corresponds to the
host-managed model, that is, the device user has to deal with the device
characteristics itself and use it correctly. The host-aware model lies
in between these 2 extremes: it offers the possibility of complete
abstraction by default, but also allows a user to optimize its operation
for the device by allowing access to the device characteristics. So this
would correspond to a possible third way of implementing an FTL for open
channel SSDs.

> So, my key worry that the trying to hide under the same interface the
> two different technologies (SMR and NAND flash) will be resulted in
> the loss of opportunity to manage the device in more smarter way.
> Because any unification has the goal to create a simple interface.
> But SMR and NAND flash are significantly different technologies. And
> if somebody creates technology-oriented file system, for example,
> then it needs to have access to really special features of the
> technology. Otherwise, interface will be overloaded by features of
> both technologies and it will looks like as a mess.

I do not think so, as long as the device "model" is exposed to the user
as the zoned block device interface does. This allows a user to adjust
its operation depending on the device. This is true of course as long as
each "model" has a clearly defined set of features associated. Again,
that is the case for zoned block devices and an example of how this can
be used is now in f2fs (which allows different operation modes for
host-aware devices, but only one for host-managed devices). Again, I can
see a clear parallel with open channel SSDs here.

> SMR zone and NAND flash erase block look comparable but, finally, it
> significantly different stuff. Usually, SMR zone has 265 MB in size
> but NAND flash erase block can vary from 512 KB to 8 MB (it will be
> slightly larger in the future but not more than 32 MB, I suppose). It
> is possible to group several erase blocks into aggregated entity but
> it could be not very good policy from file system point of view.

Why not? For f2fs, the 2MB segments are grouped together into sections
with a size matching the device zone size. That works well and can
actually even reduce the garbage collection overhead in some cases.
Nothing in the kernel zoned block device support limits the zone size to
a particular minimum or maximum. The only direct implication of the zone
size on the block I/O stack is that BIOs and requests cannot cross zone
boundaries. In an extreme setup, a zone size of 4KB would work too and
result in read/write commands of 4KB at most to the device.

> Another point that QLC device could have more tricky features of
> erase blocks management. Also we should apply erase operation on NAND
> flash erase block but it is not mandatory for the case of SMR zone.

Incorrect: host-managed devices require a zone "reset" (equivalent to
discard/trim) to be reused after being written once. So again, the
"tricky features" you mention will depend on the device "model",
whatever this ends up to be for an open channel SSD.

> Because SMR zone could be simply re-written in sequential order if
> all zone's data is invalid, for example. Also conventional zone could
> be really tricky point. Because it is one zone only for the whole
> device that could be updated in-place. Raw NAND flash, usually,
> hasn't likewise conventional zone.

Conventional zones are optional in zoned block devices. There may be
none at all and an implementation may well decide to not support a
device without any conventional zones if some are required.
In the case of open channel SSDs, the FTL implementation may well decide
to expose a particular range of LBAs as "conventional zones" and have a
lower level exposure for the remaining capacity whcih can then be
optimally used by the file system based on the features available for
that remaining LBA range. Again, a parallel is possible with SMR.

> Finally, if I really like to develop SMR- or NAND flash oriented file
> system then I would like to play with peculiarities of concrete
> technologies. And any unified interface will destroy the opportunity 
> to create the really efficient solution. Finally, if my software
> solution is unable to provide some fancy and efficient features then
> guys will prefer to use the regular stack (ext4, xfs + block layer).

Not necessarily. Again think in terms of device "model" and associated
feature set. An FS implementation may decide to support all possible
models, with likely a resulting incredible complexity. More likely,
similarly with what is happening with SMR, only models that make sense
will be supported by FS implementation that can be easily modified.
Example again here of f2fs: changes to support SMR were rather simple,
whereas the initial effort to support SMR with ext4 was pretty much
abandoned as it was too complex to integrate in the existing code while
keeping the existing on-disk format.

Your argument above is actually making the same point: you want your
implementation to use the device features directly. That is, your
implementation wants a "host-managed" like device model. Using ext4 will
require a "host-aware" or "drive-managed" model, which could be provided
through a different FTL or device-mapper implementation in the case of
open channel SSDs.

I am not trying to argue that open channel SSDs and zoned block devices
should be supported under the exact same API. But I can definitely see
clear parallels worth a discussion. As a first step, I would suggest
trying to try defining open channel SSDs "models" and their feature set
and see how these fit with the existing ZBC/ZAC defined models and at
least estimate the implications on the block I/O stack. If adding the
new models only results in the addition of a few top level functions or
ioctls, it may be entirely feasible to integrate the two together.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr Manager, System Software Research Group,
Western Digital
Damien.LeMoal@hgst.com
Tel: (+81) 0466-98-3593 (Ext. 51-3593)
1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-04  7:24             ` Damien Le Moal
  0 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2017-01-04  7:24 UTC (permalink / raw)

Slava,

On 1/4/17 11:59, Slava Dubeyko wrote:
> What's the goal of SMR compatibility? Any unification or interface
> abstraction has the goal to hide the peculiarities of underlying
> hardware. But we have block device abstraction that hides all 
> hardware's peculiarities perfectly. Also FTL (or any other
> Translation Layer) is able to represent the device as sequence of
> physical sectors without real knowledge on software side about 
> sophisticated management activity on the device side. And, finally,
> guys will be completely happy to use the regular file systems (ext4,
> xfs) without necessity to modify software stack. But I believe that 
> the goal of Open-channel SSD approach is completely opposite. Namely,
> provide the opportunity for software side (file system, for example)
> to manage the Open-channel SSD device with smarter policy.

The Zoned Block Device API is part of the block layer. So as such, it
does abstract many aspects of the device characteristics, as so many
other API of the block layer do (look at blkdev_issue_discard or zeroout
implementations to see how far this can be pushed).

Regarding the use of open channel SSDs, I think you are absolutely
correct: (1) some users may be very happy to use a regular, unmodified
ext4 or xfs on top of an open channel SSD, as long as the FTL
implementation does a complete abstraction of the device special
features and presents a regular block device to upper layers. And
conversely, (2) some file system implementations may prefer to directly
use those special features and characteristics of open channel SSDs. No
arguing with this.

But you are missing the parallel with SMR. For SMR, or more correctly
zoned block devices since the ZBC or ZAC standards can equally apply to
HDDs and SSDs, 3 models exists: drive-managed, host-aware and host-managed.

Case (1) above corresponds *exactly* to the drive managed model, with
the difference that the abstraction of the device characteristics (SMR
here) is in the drive FW and not in a host-level FTL implementation as
it would be for open channel SSDs. Case (2) above corresponds to the
host-managed model, that is, the device user has to deal with the device
characteristics itself and use it correctly. The host-aware model lies
in between these 2 extremes: it offers the possibility of complete
abstraction by default, but also allows a user to optimize its operation
for the device by allowing access to the device characteristics. So this
would correspond to a possible third way of implementing an FTL for open
channel SSDs.

> So, my key worry that the trying to hide under the same interface the
> two different technologies (SMR and NAND flash) will be resulted in
> the loss of opportunity to manage the device in more smarter way.
> Because any unification has the goal to create a simple interface.
> But SMR and NAND flash are significantly different technologies. And
> if somebody creates technology-oriented file system, for example,
> then it needs to have access to really special features of the
> technology. Otherwise, interface will be overloaded by features of
> both technologies and it will looks like as a mess.

I do not think so, as long as the device "model" is exposed to the user
as the zoned block device interface does. This allows a user to adjust
its operation depending on the device. This is true of course as long as
each "model" has a clearly defined set of features associated. Again,
that is the case for zoned block devices and an example of how this can
be used is now in f2fs (which allows different operation modes for
host-aware devices, but only one for host-managed devices). Again, I can
see a clear parallel with open channel SSDs here.

> SMR zone and NAND flash erase block look comparable but, finally, it
> significantly different stuff. Usually, SMR zone has 265 MB in size
> but NAND flash erase block can vary from 512 KB to 8 MB (it will be
> slightly larger in the future but not more than 32 MB, I suppose). It
> is possible to group several erase blocks into aggregated entity but
> it could be not very good policy from file system point of view.

Why not? For f2fs, the 2MB segments are grouped together into sections
with a size matching the device zone size. That works well and can
actually even reduce the garbage collection overhead in some cases.
Nothing in the kernel zoned block device support limits the zone size to
a particular minimum or maximum. The only direct implication of the zone
size on the block I/O stack is that BIOs and requests cannot cross zone
boundaries. In an extreme setup, a zone size of 4KB would work too and
result in read/write commands of 4KB at most to the device.

> Another point that QLC device could have more tricky features of
> erase blocks management. Also we should apply erase operation on NAND
> flash erase block but it is not mandatory for the case of SMR zone.

Incorrect: host-managed devices require a zone "reset" (equivalent to
discard/trim) to be reused after being written once. So again, the
"tricky features" you mention will depend on the device "model",
whatever this ends up to be for an open channel SSD.

> Because SMR zone could be simply re-written in sequential order if
> all zone's data is invalid, for example. Also conventional zone could
> be really tricky point. Because it is one zone only for the whole
> device that could be updated in-place. Raw NAND flash, usually,
> hasn't likewise conventional zone.

Conventional zones are optional in zoned block devices. There may be
none at all and an implementation may well decide to not support a
device without any conventional zones if some are required.
In the case of open channel SSDs, the FTL implementation may well decide
to expose a particular range of LBAs as "conventional zones" and have a
lower level exposure for the remaining capacity whcih can then be
optimally used by the file system based on the features available for
that remaining LBA range. Again, a parallel is possible with SMR.

> Finally, if I really like to develop SMR- or NAND flash oriented file
> system then I would like to play with peculiarities of concrete
> technologies. And any unified interface will destroy the opportunity 
> to create the really efficient solution. Finally, if my software
> solution is unable to provide some fancy and efficient features then
> guys will prefer to use the regular stack (ext4, xfs + block layer).

Not necessarily. Again think in terms of device "model" and associated
feature set. An FS implementation may decide to support all possible
models, with likely a resulting incredible complexity. More likely,
similarly with what is happening with SMR, only models that make sense
will be supported by FS implementation that can be easily modified.
Example again here of f2fs: changes to support SMR were rather simple,
whereas the initial effort to support SMR with ext4 was pretty much
abandoned as it was too complex to integrate in the existing code while
keeping the existing on-disk format.

Your argument above is actually making the same point: you want your
implementation to use the device features directly. That is, your
implementation wants a "host-managed" like device model. Using ext4 will
require a "host-aware" or "drive-managed" model, which could be provided
through a different FTL or device-mapper implementation in the case of
open channel SSDs.

I am not trying to argue that open channel SSDs and zoned block devices
should be supported under the exact same API. But I can definitely see
clear parallels worth a discussion. As a first step, I would suggest
trying to try defining open channel SSDs "models" and their feature set
and see how these fit with the existing ZBC/ZAC defined models and at
least estimate the implications on the block I/O stack. If adding the
new models only results in the addition of a few top level functions or
ioctls, it may be entirely feasible to integrate the two together.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr Manager, System Software Research Group,
Western Digital
Damien.LeMoal at hgst.com
Tel: (+81) 0466-98-3593 (Ext. 51-3593)
1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04  7:24             ` Damien Le Moal
@ 2017-01-04 12:39               ` Matias Bjørling
  -1 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-04 12:39 UTC (permalink / raw)
  To: Damien Le Moal, Slava Dubeyko, Viacheslav Dubeyko, lsf-pc
  Cc: Linux FS Devel, linux-block, linux-nvme

On 01/04/2017 08:24 AM, Damien Le Moal wrote:
> 
> Slava,
> 
> On 1/4/17 11:59, Slava Dubeyko wrote:
>> What's the goal of SMR compatibility? Any unification or interface
>> abstraction has the goal to hide the peculiarities of underlying
>> hardware. But we have block device abstraction that hides all 
>> hardware's peculiarities perfectly. Also FTL (or any other
>> Translation Layer) is able to represent the device as sequence of
>> physical sectors without real knowledge on software side about 
>> sophisticated management activity on the device side. And, finally,
>> guys will be completely happy to use the regular file systems (ext4,
>> xfs) without necessity to modify software stack. But I believe that 
>> the goal of Open-channel SSD approach is completely opposite. Namely,
>> provide the opportunity for software side (file system, for example)
>> to manage the Open-channel SSD device with smarter policy.
> 
> The Zoned Block Device API is part of the block layer. So as such, it
> does abstract many aspects of the device characteristics, as so many
> other API of the block layer do (look at blkdev_issue_discard or zeroout
> implementations to see how far this can be pushed).
> 
> Regarding the use of open channel SSDs, I think you are absolutely
> correct: (1) some users may be very happy to use a regular, unmodified
> ext4 or xfs on top of an open channel SSD, as long as the FTL
> implementation does a complete abstraction of the device special
> features and presents a regular block device to upper layers. And
> conversely, (2) some file system implementations may prefer to directly
> use those special features and characteristics of open channel SSDs. No
> arguing with this.
> 
> But you are missing the parallel with SMR. For SMR, or more correctly
> zoned block devices since the ZBC or ZAC standards can equally apply to
> HDDs and SSDs, 3 models exists: drive-managed, host-aware and host-managed.
> 
> Case (1) above corresponds *exactly* to the drive managed model, with
> the difference that the abstraction of the device characteristics (SMR
> here) is in the drive FW and not in a host-level FTL implementation as
> it would be for open channel SSDs. Case (2) above corresponds to the
> host-managed model, that is, the device user has to deal with the device
> characteristics itself and use it correctly. The host-aware model lies
> in between these 2 extremes: it offers the possibility of complete
> abstraction by default, but also allows a user to optimize its operation
> for the device by allowing access to the device characteristics. So this
> would correspond to a possible third way of implementing an FTL for open
> channel SSDs.
> 
>> So, my key worry that the trying to hide under the same interface the
>> two different technologies (SMR and NAND flash) will be resulted in
>> the loss of opportunity to manage the device in more smarter way.
>> Because any unification has the goal to create a simple interface.
>> But SMR and NAND flash are significantly different technologies. And
>> if somebody creates technology-oriented file system, for example,
>> then it needs to have access to really special features of the
>> technology. Otherwise, interface will be overloaded by features of
>> both technologies and it will looks like as a mess.
> 
> I do not think so, as long as the device "model" is exposed to the user
> as the zoned block device interface does. This allows a user to adjust
> its operation depending on the device. This is true of course as long as
> each "model" has a clearly defined set of features associated. Again,
> that is the case for zoned block devices and an example of how this can
> be used is now in f2fs (which allows different operation modes for
> host-aware devices, but only one for host-managed devices). Again, I can
> see a clear parallel with open channel SSDs here.
> 
>> SMR zone and NAND flash erase block look comparable but, finally, it
>> significantly different stuff. Usually, SMR zone has 265 MB in size
>> but NAND flash erase block can vary from 512 KB to 8 MB (it will be
>> slightly larger in the future but not more than 32 MB, I suppose). It
>> is possible to group several erase blocks into aggregated entity but
>> it could be not very good policy from file system point of view.
> 
> Why not? For f2fs, the 2MB segments are grouped together into sections
> with a size matching the device zone size. That works well and can
> actually even reduce the garbage collection overhead in some cases.
> Nothing in the kernel zoned block device support limits the zone size to
> a particular minimum or maximum. The only direct implication of the zone
> size on the block I/O stack is that BIOs and requests cannot cross zone
> boundaries. In an extreme setup, a zone size of 4KB would work too and
> result in read/write commands of 4KB at most to the device.
> 
>> Another point that QLC device could have more tricky features of
>> erase blocks management. Also we should apply erase operation on NAND
>> flash erase block but it is not mandatory for the case of SMR zone.
> 
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.
> 
>> Because SMR zone could be simply re-written in sequential order if
>> all zone's data is invalid, for example. Also conventional zone could
>> be really tricky point. Because it is one zone only for the whole
>> device that could be updated in-place. Raw NAND flash, usually,
>> hasn't likewise conventional zone.
> 
> Conventional zones are optional in zoned block devices. There may be
> none at all and an implementation may well decide to not support a
> device without any conventional zones if some are required.
> In the case of open channel SSDs, the FTL implementation may well decide
> to expose a particular range of LBAs as "conventional zones" and have a
> lower level exposure for the remaining capacity whcih can then be
> optimally used by the file system based on the features available for
> that remaining LBA range. Again, a parallel is possible with SMR.
> 
>> Finally, if I really like to develop SMR- or NAND flash oriented file
>> system then I would like to play with peculiarities of concrete
>> technologies. And any unified interface will destroy the opportunity 
>> to create the really efficient solution. Finally, if my software
>> solution is unable to provide some fancy and efficient features then
>> guys will prefer to use the regular stack (ext4, xfs + block layer).
> 
> Not necessarily. Again think in terms of device "model" and associated
> feature set. An FS implementation may decide to support all possible
> models, with likely a resulting incredible complexity. More likely,
> similarly with what is happening with SMR, only models that make sense
> will be supported by FS implementation that can be easily modified.
> Example again here of f2fs: changes to support SMR were rather simple,
> whereas the initial effort to support SMR with ext4 was pretty much
> abandoned as it was too complex to integrate in the existing code while
> keeping the existing on-disk format.
> 
> Your argument above is actually making the same point: you want your
> implementation to use the device features directly. That is, your
> implementation wants a "host-managed" like device model. Using ext4 will
> require a "host-aware" or "drive-managed" model, which could be provided
> through a different FTL or device-mapper implementation in the case of
> open channel SSDs.
> 
> I am not trying to argue that open channel SSDs and zoned block devices
> should be supported under the exact same API. But I can definitely see
> clear parallels worth a discussion. As a first step, I would suggest
> trying to try defining open channel SSDs "models" and their feature set
> and see how these fit with the existing ZBC/ZAC defined models and at
> least estimate the implications on the block I/O stack. If adding the
> new models only results in the addition of a few top level functions or
> ioctls, it may be entirely feasible to integrate the two together.
> 

Thanks Damien. I couldn't have said it better my self.

The OCSSD 1.3 specification has been made with an eye towards the SMR
interface:

 - "Identification" - Follows the same "global" size definitions, and
also supports that each zone has its own local size.
 - "Get Report" command follows a very similar structure as SMR, such
that it can sit behind the "Report Zones" interface.
 - "Erase/Prepare Block" command follows the Reset block interface.

Those should fit right in. If the layout is planar, such that the OCSSD
only exposes a set of zones, it should be able to fit right into the
framework with minor modifications.

A couple of details are added when going towards managing multiple
parallel units, which is some of the things that require a bit of
discussion.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-04 12:39               ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-04 12:39 UTC (permalink / raw)


On 01/04/2017 08:24 AM, Damien Le Moal wrote:
> 
> Slava,
> 
> On 1/4/17 11:59, Slava Dubeyko wrote:
>> What's the goal of SMR compatibility? Any unification or interface
>> abstraction has the goal to hide the peculiarities of underlying
>> hardware. But we have block device abstraction that hides all 
>> hardware's peculiarities perfectly. Also FTL (or any other
>> Translation Layer) is able to represent the device as sequence of
>> physical sectors without real knowledge on software side about 
>> sophisticated management activity on the device side. And, finally,
>> guys will be completely happy to use the regular file systems (ext4,
>> xfs) without necessity to modify software stack. But I believe that 
>> the goal of Open-channel SSD approach is completely opposite. Namely,
>> provide the opportunity for software side (file system, for example)
>> to manage the Open-channel SSD device with smarter policy.
> 
> The Zoned Block Device API is part of the block layer. So as such, it
> does abstract many aspects of the device characteristics, as so many
> other API of the block layer do (look at blkdev_issue_discard or zeroout
> implementations to see how far this can be pushed).
> 
> Regarding the use of open channel SSDs, I think you are absolutely
> correct: (1) some users may be very happy to use a regular, unmodified
> ext4 or xfs on top of an open channel SSD, as long as the FTL
> implementation does a complete abstraction of the device special
> features and presents a regular block device to upper layers. And
> conversely, (2) some file system implementations may prefer to directly
> use those special features and characteristics of open channel SSDs. No
> arguing with this.
> 
> But you are missing the parallel with SMR. For SMR, or more correctly
> zoned block devices since the ZBC or ZAC standards can equally apply to
> HDDs and SSDs, 3 models exists: drive-managed, host-aware and host-managed.
> 
> Case (1) above corresponds *exactly* to the drive managed model, with
> the difference that the abstraction of the device characteristics (SMR
> here) is in the drive FW and not in a host-level FTL implementation as
> it would be for open channel SSDs. Case (2) above corresponds to the
> host-managed model, that is, the device user has to deal with the device
> characteristics itself and use it correctly. The host-aware model lies
> in between these 2 extremes: it offers the possibility of complete
> abstraction by default, but also allows a user to optimize its operation
> for the device by allowing access to the device characteristics. So this
> would correspond to a possible third way of implementing an FTL for open
> channel SSDs.
> 
>> So, my key worry that the trying to hide under the same interface the
>> two different technologies (SMR and NAND flash) will be resulted in
>> the loss of opportunity to manage the device in more smarter way.
>> Because any unification has the goal to create a simple interface.
>> But SMR and NAND flash are significantly different technologies. And
>> if somebody creates technology-oriented file system, for example,
>> then it needs to have access to really special features of the
>> technology. Otherwise, interface will be overloaded by features of
>> both technologies and it will looks like as a mess.
> 
> I do not think so, as long as the device "model" is exposed to the user
> as the zoned block device interface does. This allows a user to adjust
> its operation depending on the device. This is true of course as long as
> each "model" has a clearly defined set of features associated. Again,
> that is the case for zoned block devices and an example of how this can
> be used is now in f2fs (which allows different operation modes for
> host-aware devices, but only one for host-managed devices). Again, I can
> see a clear parallel with open channel SSDs here.
> 
>> SMR zone and NAND flash erase block look comparable but, finally, it
>> significantly different stuff. Usually, SMR zone has 265 MB in size
>> but NAND flash erase block can vary from 512 KB to 8 MB (it will be
>> slightly larger in the future but not more than 32 MB, I suppose). It
>> is possible to group several erase blocks into aggregated entity but
>> it could be not very good policy from file system point of view.
> 
> Why not? For f2fs, the 2MB segments are grouped together into sections
> with a size matching the device zone size. That works well and can
> actually even reduce the garbage collection overhead in some cases.
> Nothing in the kernel zoned block device support limits the zone size to
> a particular minimum or maximum. The only direct implication of the zone
> size on the block I/O stack is that BIOs and requests cannot cross zone
> boundaries. In an extreme setup, a zone size of 4KB would work too and
> result in read/write commands of 4KB at most to the device.
> 
>> Another point that QLC device could have more tricky features of
>> erase blocks management. Also we should apply erase operation on NAND
>> flash erase block but it is not mandatory for the case of SMR zone.
> 
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.
> 
>> Because SMR zone could be simply re-written in sequential order if
>> all zone's data is invalid, for example. Also conventional zone could
>> be really tricky point. Because it is one zone only for the whole
>> device that could be updated in-place. Raw NAND flash, usually,
>> hasn't likewise conventional zone.
> 
> Conventional zones are optional in zoned block devices. There may be
> none at all and an implementation may well decide to not support a
> device without any conventional zones if some are required.
> In the case of open channel SSDs, the FTL implementation may well decide
> to expose a particular range of LBAs as "conventional zones" and have a
> lower level exposure for the remaining capacity whcih can then be
> optimally used by the file system based on the features available for
> that remaining LBA range. Again, a parallel is possible with SMR.
> 
>> Finally, if I really like to develop SMR- or NAND flash oriented file
>> system then I would like to play with peculiarities of concrete
>> technologies. And any unified interface will destroy the opportunity 
>> to create the really efficient solution. Finally, if my software
>> solution is unable to provide some fancy and efficient features then
>> guys will prefer to use the regular stack (ext4, xfs + block layer).
> 
> Not necessarily. Again think in terms of device "model" and associated
> feature set. An FS implementation may decide to support all possible
> models, with likely a resulting incredible complexity. More likely,
> similarly with what is happening with SMR, only models that make sense
> will be supported by FS implementation that can be easily modified.
> Example again here of f2fs: changes to support SMR were rather simple,
> whereas the initial effort to support SMR with ext4 was pretty much
> abandoned as it was too complex to integrate in the existing code while
> keeping the existing on-disk format.
> 
> Your argument above is actually making the same point: you want your
> implementation to use the device features directly. That is, your
> implementation wants a "host-managed" like device model. Using ext4 will
> require a "host-aware" or "drive-managed" model, which could be provided
> through a different FTL or device-mapper implementation in the case of
> open channel SSDs.
> 
> I am not trying to argue that open channel SSDs and zoned block devices
> should be supported under the exact same API. But I can definitely see
> clear parallels worth a discussion. As a first step, I would suggest
> trying to try defining open channel SSDs "models" and their feature set
> and see how these fit with the existing ZBC/ZAC defined models and at
> least estimate the implications on the block I/O stack. If adding the
> new models only results in the addition of a few top level functions or
> ioctls, it may be entirely feasible to integrate the two together.
> 

Thanks Damien. I couldn't have said it better my self.

The OCSSD 1.3 specification has been made with an eye towards the SMR
interface:

 - "Identification" - Follows the same "global" size definitions, and
also supports that each zone has its own local size.
 - "Get Report" command follows a very similar structure as SMR, such
that it can sit behind the "Report Zones" interface.
 - "Erase/Prepare Block" command follows the Reset block interface.

Those should fit right in. If the layout is planar, such that the OCSSD
only exposes a set of zones, it should be able to fit right into the
framework with minor modifications.

A couple of details are added when going towards managing multiple
parallel units, which is some of the things that require a bit of
discussion.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04  7:24             ` Damien Le Moal
@ 2017-01-04 16:57               ` Theodore Ts'o
  -1 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2017-01-04 16:57 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc,
	Linux FS Devel, linux-block, linux-nvme

I agree with Damien, but I'd also add that in the future there may
very well be some new Zone types added to the ZBC model.  So we
shouldn't assume that the ZBC model is a fixed one.  And who knows?
Perhaps T10 standards body will come up with a simpler model for
interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC
model --- or not.

Either way, that's not really relevant as far as the Linux block layer
is concerned, since the Linux block layer is designed to be an
abstraction on top of hardware --- and in some cases we can use a
similar abstraction on top of eMMC's, SCSI's, and SATA's
implementation definition of TRIM/DISCARD/WRITE SAME/SECURE
TRIM/QUEUED TRIM, even though they are different in some subtle ways,
and may have different performance characteristics and semantics.

The trick is to expose similarities where the differences won't matter
to the upper layers, but also to expose the fine distinctions and
allow the file system and/or user space to use the protocol-specific
differences when it matters to them.

Designing that is going to be important, and I can guarantee we won't
get it right at first.  Which is why it's a good thing that internal
kernel interfaces aren't cast into concrete, and can be subject to
change as new revisions to ZBC, or new interfaces (like perhaps
OCSSD's) get promulgated by various standards bodies or by various
vendors.

> > Another point that QLC device could have more tricky features of
> > erase blocks management. Also we should apply erase operation on NAND
> > flash erase block but it is not mandatory for the case of SMR zone.
> 
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.

... and this is exposed by having different zone types (sequential
write required vs sequential write preferred vs conventional).  And if
OCSSD's "zones" don't fit into the current ZBC zone types, we can
easily add new ones.  I would suggest however, that we explicitly
disclaim that the block device layer's code points for zone types is
an exact match with the ZBC zone types numbering, precisely so we can
add new zone types that correspond to abstractions from different
hardware types, such as OCSSD.

> Not necessarily. Again think in terms of device "model" and associated
> feature set. An FS implementation may decide to support all possible
> models, with likely a resulting incredible complexity. More likely,
> similarly with what is happening with SMR, only models that make sense
> will be supported by FS implementation that can be easily modified.
> Example again here of f2fs: changes to support SMR were rather simple,
> whereas the initial effort to support SMR with ext4 was pretty much
> abandoned as it was too complex to integrate in the existing code while
> keeping the existing on-disk format.

I'll note that Abutalib Aghayev and I will be presenting a paper at
the 2017 FAST conference detailing a way to optimize ext4 for
Host-Aware SMR drives by making a surprisingly small set of changes to
ext4's journalling layer, with some very promising performance
improvements for certain workloads, which we tested on both Seagate
and WD HA drives and achieved 2x performance improvements.  Patches
are on the unstable portion of the ext4 patch queue, and I hope to get
them into an upstream acceptable shape (as opposed to "good enough for
a research paper") in the next few months.

So it may very well be that small changes can be made to file systems
to support exotic devices if there are ways that we can expose the
right information about underlying storage devices, and offering the
right abstractions to enable the right kind of minimal I/O tagging, or
hints, or commands as necessary such that the changes we do need to
make to the file system can be kept small, and kept easily testable
even if hardware is not available.

For example, by creating device mapper emulators of the feature sets
of these advanced storage interfaces that are exposed via the block
layer abstractions, whether it be for ZBC zones, or hardware
encryption acceleration, etc.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-04 16:57               ` Theodore Ts'o
  0 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2017-01-04 16:57 UTC (permalink / raw)


I agree with Damien, but I'd also add that in the future there may
very well be some new Zone types added to the ZBC model.  So we
shouldn't assume that the ZBC model is a fixed one.  And who knows?
Perhaps T10 standards body will come up with a simpler model for
interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC
model --- or not.

Either way, that's not really relevant as far as the Linux block layer
is concerned, since the Linux block layer is designed to be an
abstraction on top of hardware --- and in some cases we can use a
similar abstraction on top of eMMC's, SCSI's, and SATA's
implementation definition of TRIM/DISCARD/WRITE SAME/SECURE
TRIM/QUEUED TRIM, even though they are different in some subtle ways,
and may have different performance characteristics and semantics.

The trick is to expose similarities where the differences won't matter
to the upper layers, but also to expose the fine distinctions and
allow the file system and/or user space to use the protocol-specific
differences when it matters to them.

Designing that is going to be important, and I can guarantee we won't
get it right at first.  Which is why it's a good thing that internal
kernel interfaces aren't cast into concrete, and can be subject to
change as new revisions to ZBC, or new interfaces (like perhaps
OCSSD's) get promulgated by various standards bodies or by various
vendors.

> > Another point that QLC device could have more tricky features of
> > erase blocks management. Also we should apply erase operation on NAND
> > flash erase block but it is not mandatory for the case of SMR zone.
> 
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.

... and this is exposed by having different zone types (sequential
write required vs sequential write preferred vs conventional).  And if
OCSSD's "zones" don't fit into the current ZBC zone types, we can
easily add new ones.  I would suggest however, that we explicitly
disclaim that the block device layer's code points for zone types is
an exact match with the ZBC zone types numbering, precisely so we can
add new zone types that correspond to abstractions from different
hardware types, such as OCSSD.

> Not necessarily. Again think in terms of device "model" and associated
> feature set. An FS implementation may decide to support all possible
> models, with likely a resulting incredible complexity. More likely,
> similarly with what is happening with SMR, only models that make sense
> will be supported by FS implementation that can be easily modified.
> Example again here of f2fs: changes to support SMR were rather simple,
> whereas the initial effort to support SMR with ext4 was pretty much
> abandoned as it was too complex to integrate in the existing code while
> keeping the existing on-disk format.

I'll note that Abutalib Aghayev and I will be presenting a paper at
the 2017 FAST conference detailing a way to optimize ext4 for
Host-Aware SMR drives by making a surprisingly small set of changes to
ext4's journalling layer, with some very promising performance
improvements for certain workloads, which we tested on both Seagate
and WD HA drives and achieved 2x performance improvements.  Patches
are on the unstable portion of the ext4 patch queue, and I hope to get
them into an upstream acceptable shape (as opposed to "good enough for
a research paper") in the next few months.

So it may very well be that small changes can be made to file systems
to support exotic devices if there are ways that we can expose the
right information about underlying storage devices, and offering the
right abstractions to enable the right kind of minimal I/O tagging, or
hints, or commands as necessary such that the changes we do need to
make to the file system can be kept small, and kept easily testable
even if hardware is not available.

For example, by creating device mapper emulators of the feature sets
of these advanced storage interfaces that are exposed via the block
layer abstractions, whether it be for ZBC zones, or hardware
encryption acceleration, etc.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04  7:24             ` Damien Le Moal
  (?)
@ 2017-01-05 22:58               ` Slava Dubeyko
  -1 siblings, 0 replies; 63+ messages in thread
From: Slava Dubeyko @ 2017-01-05 22:58 UTC (permalink / raw)
  To: Damien Le Moal, Matias Bjørling, Viacheslav Dubeyko, lsf-pc,
	Theodore Ts'o
  Cc: Linux FS Devel, linux-block, linux-nvme

DQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KRnJvbTogRGFtaWVuIExlIE1vYWwgDQpTZW50
OiBUdWVzZGF5LCBKYW51YXJ5IDMsIDIwMTcgMTE6MjUgUE0NClRvOiBTbGF2YSBEdWJleWtvIDxW
eWFjaGVzbGF2LkR1YmV5a29Ad2RjLmNvbT47IE1hdGlhcyBCasO4cmxpbmcgPG1AYmpvcmxpbmcu
bWU+OyBWaWFjaGVzbGF2IER1YmV5a28gPHNsYXZhQGR1YmV5a28uY29tPjsgbHNmLXBjQGxpc3Rz
LmxpbnV4LWZvdW5kYXRpb24ub3JnDQpDYzogTGludXggRlMgRGV2ZWwgPGxpbnV4LWZzZGV2ZWxA
dmdlci5rZXJuZWwub3JnPjsgbGludXgtYmxvY2tAdmdlci5rZXJuZWwub3JnOyBsaW51eC1udm1l
QGxpc3RzLmluZnJhZGVhZC5vcmcNClN1YmplY3Q6IFJlOiBbTFNGL01NIFRPUElDXVtMU0YvTU0g
QVRURU5EXSBPQ1NTRHMgLSBTTVIsIEhpZXJhcmNoaWNhbCBJbnRlcmZhY2UsIGFuZCBWZWN0b3Ig
SS9Pcw0KDQo8c2tpcHBlZD4NCg0KPiBCdXQgeW91IGFyZSBtaXNzaW5nIHRoZSBwYXJhbGxlbCB3
aXRoIFNNUi4gRm9yIFNNUiwgb3IgbW9yZSBjb3JyZWN0bHkgem9uZWQNCj4gYmxvY2sgZGV2aWNl
cyBzaW5jZSB0aGUgWkJDIG9yIFpBQyBzdGFuZGFyZHMgY2FuIGVxdWFsbHkgYXBwbHkgdG8gSERE
cyBhbmQgU1NEcywNCj4gMyBtb2RlbHMgZXhpc3RzOiBkcml2ZS1tYW5hZ2VkLCBob3N0LWF3YXJl
IGFuZCBob3N0LW1hbmFnZWQuDQo+IENhc2UgKDEpIGFib3ZlIGNvcnJlc3BvbmRzICpleGFjdGx5
KiB0byB0aGUgZHJpdmUgbWFuYWdlZCBtb2RlbCwgd2l0aA0KPiB0aGUgZGlmZmVyZW5jZSB0aGF0
IHRoZSBhYnN0cmFjdGlvbiBvZiB0aGUgZGV2aWNlIGNoYXJhY3RlcmlzdGljcyAoU01SDQo+IGhl
cmUpIGlzIGluIHRoZSBkcml2ZSBGVyBhbmQgbm90IGluIGEgaG9zdC1sZXZlbCBGVEwgaW1wbGVt
ZW50YXRpb24NCj4gYXMgaXQgd291bGQgYmUgZm9yIG9wZW4gY2hhbm5lbCBTU0RzLiBDYXNlICgy
KSBhYm92ZSBjb3JyZXNwb25kcyB0byB0aGUgaG9zdC1tYW5hZ2VkDQo+IG1vZGVsLCB0aGF0IGlz
LCB0aGUgZGV2aWNlIHVzZXIgaGFzIHRvIGRlYWwgd2l0aCB0aGUgZGV2aWNlIGNoYXJhY3Rlcmlz
dGljcw0KPiBpdHNlbGYgYW5kIHVzZSBpdCBjb3JyZWN0bHkuIFRoZSBob3N0LWF3YXJlIG1vZGVs
IGxpZXMgaW4gYmV0d2VlbiB0aGVzZSAyIGV4dHJlbWVzOg0KPiBpdCBvZmZlcnMgdGhlIHBvc3Np
YmlsaXR5IG9mIGNvbXBsZXRlIGFic3RyYWN0aW9uIGJ5IGRlZmF1bHQsIGJ1dCBhbHNvIGFsbG93
cyBhIHVzZXINCj4gdG8gb3B0aW1pemUgaXRzIG9wZXJhdGlvbiBmb3IgdGhlIGRldmljZSBieSBh
bGxvd2luZyBhY2Nlc3MgdG8gdGhlIGRldmljZSBjaGFyYWN0ZXJpc3RpY3MuDQo+IFNvIHRoaXMg
d291bGQgY29ycmVzcG9uZCB0byBhIHBvc3NpYmxlIHRoaXJkIHdheSBvZiBpbXBsZW1lbnRpbmcg
YW4gRlRMIGZvciBvcGVuIGNoYW5uZWwgU1NEcy4NCg0KSSBzZWUgeW91ciBwb2ludC4gQW5kICBJ
IHRoaW5rIHRoYXQsIGhpc3RvcmljYWxseSwgd2UgbmVlZCB0byBkaXN0aW5ndWlzaCA0IGNhc2Vz
IGZvciB0aGUNCmNhc2Ugb2YgTkFORCBmbGFzaDoNCigxKSBkcml2ZS1tYW5hZ2VkOiByZWd1bGFy
IGZpbGUgc3lzdGVtcyAoZXh0NCwgeGZzIGFuZCBzbyBvbik7DQooMikgaG9zdC1hd2FyZTogZmxh
c2gtZnJpZW5kbHkgZmlsZSBzeXN0ZW1zIChOSUxGUzIsIEYyRlMgYW5kIHNvIG9uKTsNCigzKSBo
b3N0LW1hbmFnZWQ6IDxmaWxlIHN5c3RlbXMgdW5kZXIgaW1wbGVtZW50YXRpb24+Ow0KKDQpIG9s
ZC1mYXNoaW9uZWQgZmxhc2gtb3JpZW50ZWQgZmlsZSBzeXN0ZW1zIGZvciByYXcgTkFORCAoamZm
cywgeWFmZnMsIHViaWZzIGFuZCBzbyBvbikuDQoNCkJ1dCwgZnJhbmtseSBzcGVha2luZywgZXZl
biByZWd1bGFyIGZpbGUgc3lzdGVtcyBhcmUgc2xpZ2h0bHkgZmxhc2gtYXdhcmUgdG9kYXkgYmVj
YXVzZSBvZg0KYmxrZGV2X2lzc3VlX2Rpc2NhcmQgKFRSSU0pIG9yIFJFUV9NRVRBIGZsYWcuIFNv
LCB0aGUgbmV4dCByZWFsbHkgaW1wb3J0YW50IHF1ZXN0aW9uIGlzOg0Kd2hhdCBjYW4vc2hvdWxk
IGJlIGV4cG9zZWQgZm9yIHRoZSBob3N0LW1hbmFnZWQgYW5kIGhvc3QtYXdhcmUgY2FzZXM/IFdo
YXQncyBwcmluY2lwYWwNCmRpZmZlcmVuY2UgYmV0d2VlbiB0aGVzZSBtb2RlbHM/IEFuZCwgZmlu
YWxseSwgdGhlIGRpZmZlcmVuY2UgaXMgbm90IHNvIGNsZWFyLg0KDQpMZXQncyBzdGFydCBmcm9t
IGVycm9yIGNvcnJlY3Rpb25zLiBPbmx5IGZsYXNoLW9yaWVudGVkIGZpbGUgc3lzdGVtcyB0YWtl
IGNhcmUgYWJvdXQNCmVycm9yIGNvcnJlY3Rpb25zLiBCdXQgSSBhc3N1bWUgdGhhdCBkcml2ZS1t
YW5hZ2VkLCBob3N0LWF3YXJlIGFuZCBob3N0LW1hbmFnZWQgY2FzZXMNCmV4cGVjdCBoYXJkd2Fy
ZS1iYXNlZCBlcnJvciBjb3JyZWN0aW9uLiBTbywgd2UgY2FuIHRyZWF0IG91ciBsb2dpY2FsIHBh
Z2UvYmxvY2sgYXMgaWRlYWwNCmJ5dGUgc3RyZWFtIHRoYXQgYWx3YXlzIGNvbnRhaW5zIHZhbGlk
IGRhdGEuIFNvLCB3ZSBoYXZlIG5vIGRpZmZlcmVuY2UgYW5kIG5vIGNvbnRyYWRpY3Rpb24NCmhl
cmUuDQoNCk5leHQgcG9pbnQgaXMgcmVhZCBkaXN0dXJiYW5jZS4gSWYgQkVSIG9mIHBoeXNpY2Fs
IHBhZ2UvYmxvY2sgYWNoaWV2ZXMgc29tZSB0aHJlc2hvbGQgdGhlbg0Kd2UgbmVlZCB0byBtb3Zl
IGRhdGEgZnJvbSBvbmUgcGFnZS9ibG9jayBpbnRvIGFub3RoZXIgb25lLiBXaGF0IHN1YnN5c3Rl
bSB3aWxsIGJlDQpyZXNwb25zaWJsZSBmb3IgdGhpcyBhY3Rpdml0eT8gVGhlIGRyaXZlLW1hbmFn
ZWQgY2FzZSBleHBlY3RzIHRoYXQgZGV2aWNlJ3MgR0Mgd2lsbCBtYW5hZ2UNCnJlYWQgZGlzdHVy
YmFuY2UgaXNzdWUuIEJ1dCB3aGF0J3MgYWJvdXQgaG9zdC1hd2FyZSBvciBob3N0LW1hbmFnZWQg
Y2FzZT8gSWYgdGhlIGhvc3Qgc2lkZQ0KaGFzbid0IGluZm9ybWF0aW9uIGFib3V0IEJFUiB0aGVu
IHRoZSBob3N0J3Mgc29mdHdhcmUgaXMgdW5hYmxlIHRvIG1hbmFnZSB0aGlzIGlzc3VlLiBGaW5h
bGx5LA0KaXQgc291bmRzIHRoYXQgd2Ugd2lsbCBoYXZlIEdDIHN1YnN5c3RlbSBhcyBvbiBmaWxl
IHN5c3RlbSBzaWRlIGFzIG9uIGRldmljZSBzaWRlLiBBcyBhIHJlc3VsdCwNCml0IG1lYW5zIHBv
c3NpYmxlIHVucHJlZGljdGFibGUgcGVyZm9ybWFuY2UgZGVncmFkYXRpb24gYW5kIGRlY3JlYXNp
bmcgZGV2aWNlIGxpZmV0aW1lLg0KTGV0J3MgaW1hZ2luZSB0aGF0IGhvc3QtYXdhcmUgY2FzZSBj
b3VsZCBiZSB1bmF3YXJlIGFib3V0IHJlYWQgZGlzdHVyYmFuY2UgbWFuYWdlbWVudC4NCkJ1dCBo
b3cgaG9zdC1tYW5hZ2VkIGNhc2UgY2FuIG1hbmFnZSB0aGlzIGlzc3VlPw0KDQpCYWQgYmxvY2sg
bWFuYWdlbWVudC4uLiBTbywgZHJpdmUtbWFuYWdlZCBhbmQgaG9zdC1hd2FyZSBjYXNlcyBzaG91
bGQgYmUgY29tcGxldGVseSB1bmF3YXJlDQphYm91dCAgYmFkIGJsb2Nrcy4gQnV0IHdoYXQncyBh
Ym91dCBob3N0LW1hbmFnZWQgY2FzZT8gSWYgYSBkZXZpY2Ugd2lsbCBoaWRlIGJhZCBibG9ja3Mg
ZnJvbQ0KdGhlIGhvc3QgdGhlbiBpdCBtZWFucyBtYXBwaW5nIHRhYmxlIHByZXNlbmNlLCBhY2Nl
c3MgdG8gbG9naWNhbCBwYWdlcy9ibG9ja3MgYW5kIHNvIG9uLiBJZiB0aGUgaG9zdA0KaGFzbid0
IGFjY2VzcyB0byB0aGUgYmFkIGJsb2NrIG1hbmFnZW1lbnQgdGhlbiBpdCdzIG5vdCBob3N0LW1h
bmFnZWQgbW9kZWwuIEFuZCBpdCBzb3VuZHMgYXMNCmNvbXBsZXRlbHkgdW5tYW5hZ2VhYmxlIHNp
dHVhdGlvbiBmb3IgdGhlIGhvc3QtbWFuYWdlZCBtb2RlbC4gQmVjYXVzZSBpZiB0aGUgaG9zdCBo
YXMgYWNjZXNzDQp0byBiYWQgYmxvY2sgbWFuYWdlbWVudCAoYnV0IGhvdz8pIHRoZW4gd2UgaGF2
ZSByZWFsbHkgc2ltcGxlIG1vZGVsLiBPdGhlcndpc2UsIHRoZSBob3N0DQpoYXMgYWNjZXNzIHRv
IGxvZ2ljYWwgcGFnZXMvYmxvY2tzIG9ubHkgYW5kIGRldmljZSBzaG91bGQgaGF2ZSBpbnRlcm5h
bCBHQy4gQXMgYSByZXN1bHQsDQppdCBtZWFucyBwb3NzaWJsZSB1bnByZWRpY3RhYmxlIHBlcmZv
cm1hbmNlIGRlZ3JhZGF0aW9uIGFuZCBkZWNyZWFzaW5nIGRldmljZSBsaWZldGltZSBiZWNhdXNl
DQpvZiBjb21wZXRpdGlvbiBvZiBHQyBvbiBkZXZpY2Ugc2lkZSBhbmQgR0Mgb24gdGhlIGhvc3Qg
c2lkZS4NCg0KV2VhciBsZXZlbGluZy4uLiBEZXZpY2Ugd2lsbCBiZSByZXNwb25zaWJsZSB0byBt
YW5hZ2Ugd2Vhci1sZXZlbGluZyBmb3IgdGhlIGNhc2Ugb2YgZGV2aWNlLW1hbmFnZWQNCmFuZCBo
b3N0LWF3YXJlIG1vZGVscy4gSXQgbG9va3MgbGlrZSB0aGF0IHRoZSBob3N0IHNpZGUgc2hvdWxk
IGJlIHJlc3BvbnNpYmxlIHRvIG1hbmFnZSB3ZWFyLWxldmVsaW5nDQpmb3IgdGhlIGhvc3QtbWFu
YWdlZCBjYXNlLiBCdXQgaXQgbWVhbnMgdGhhdCB0aGUgaG9zdCBzaG91bGQgbWFuYWdlIGJhZCBi
bG9ja3MgYW5kIHRvIGhhdmUgZGlyZWN0DQphY2Nlc3MgdG8gcGh5c2ljYWwgcGFnZXMvYmxvY2tz
LiBPdGhlcndpc2UsIHBoeXNpY2FsIGVyYXNlIGJsb2NrcyB3aWxsIGJlIGhpZGRlbiBieSBkZXZp
Y2UncyBpbmRpcmVjdGlvbg0KbGF5ZXIgYW5kIHdlYXItbGV2ZWxpbmcgbWFuYWdlbWVudCB3aWxs
IGJlIHVuYXZhaWxhYmxlIG9uIHRoZSBob3N0IHNpZGUuIEFzIGEgcmVzdWx0LCBkZXZpY2Ugd2ls
bCBoYXZlDQppbnRlcm5hbCBHQyBhbmQgdGhlIHRyYWRpdGlvbmFsIGlzc3VlcyAocG9zc2libGUg
dW5wcmVkaWN0YWJsZSBwZXJmb3JtYW5jZSBkZWdyYWRhdGlvbiBhbmQgZGVjcmVhc2luZw0KZGV2
aWNlIGxpZmV0aW1lKS4gQnV0IGV2ZW4gaWYgU1NEIHByb3ZpZGVzIGFjY2VzcyB0byBhbGwgaW50
ZXJuYWxzIHRoZW4gaG93IHdpbGwgZmlsZSBzeXN0ZW0gYmUgYWJsZQ0KdG8gaW1wbGVtZW50IHdl
YXItbGV2ZWxpbmcgb3IgYmFkIGJsb2NrIG1hbmFnZW1lbnQgaW4gdGhlIGNhc2Ugb2YgcmVndWxh
ciBJL08gb3BlcmF0aW9ucz8gQmVjYXVzZQ0KYmxvY2sgZGV2aWNlIGNyZWF0ZXMgTEJBIGFic3Ry
YWN0aW9uIGZvciB1cy4gRG9lcyBpdCBtZWFuIHRoYXQgc29mdHdhcmUgRlRMIG9uIHRoZSBibG9j
ayBsYXllciBsZXZlbA0KaXMgYWJsZSB0byBtYW5hZ2UgU1NEIGludGVybmFscyBkaXJlY3RseT8g
QW5kLCBhZ2FpbiwgZmlsZSBzeXN0ZW0gY2Fubm90IG1hbmFnZSBTU0QgaW50ZXJuYWxzIGRpcmVj
dGx5DQpmb3IgdGhlIGNhc2Ugb2Ygc29mdHdhcmUgRlRMLiBBbmQgd2hlcmUgc2hvdWxkIHNvZnR3
YXJlIEZUTCBrZWVwIG1hcHBpbmcgdGFibGUsIGZvciBleGFtcGxlPw0KDQpTbywgRjJGUyBhbmQg
TklMRlMyIGxvb2tzIGxpa2UgYSBob3N0LWF3YXJlIGNhc2UgYmVjYXVzZSBpdCBpcyBMRlMgZmls
ZSBzeXN0ZW1zIHRoYXQgaXMgb3JpZW50ZWQgb24NCnJlZ3VsYXIgU1NEcy4gU28sIGl0IGNvdWxk
IGJlIGRlc2lyYWJsZSB0byBoYXZlIHNvbWUga25vd2xlZGdlIChwYWdlIHNpemUsIGVyYXNlIGJs
b2NrIHNpemUgYW5kIHNvIG9uKQ0KYWJvdXQgU1NEIGludGVybmFscy4gQnV0LCBtb3N0bHksIHN1
Y2gga25vd2xlZGdlIHNob3VsZCBiZSBzaGFyZWQgd2l0aCBta2ZzIHRvb2wgZHVyaW5nIGZpbGUN
CnN5c3RlbSB2b2x1bWUgY3JlYXRpb24uIFRoZSByZXN0IGxvb2tzIGxpa2UgYXMgbm90IHZlcnkg
cHJvbWlzaW5nIGFuZCBub3QgdmVyeSBkaWZmZXJlbnQgd2l0aA0KZGV2aWNlLW1hbmFnZWQgbW9k
ZWwuIEJlY2F1c2UgZXZlbiBpZiBGMkZTIGFuZCBOSUxGUzIgaGFzIEdDIHN1YnN5c3RlbSBhbmQg
bW9zdGx5IGxvb2tzIGxpa2UNCmFzIExGUyBjYXNlIChGMkZTIGhhcyBpbi1wbGFjZSB1cGRhdGVk
IGFyZWE7IE5JTEZTMiBoYXMgaW4tcGxhY2UgdXBkYXRlZCBzdXBlcmJsb2NrcyBpbiB0aGUgYmVn
aW4vZW5kDQpvZiB0aGUgdm9sdW1lKSwgYW55d2F5LCBib3RoIHRoZXNlIGZpbGUgc3lzdGVtcyBj
b21wbGV0ZWx5IHJlbHkgb24gZGV2aWNlIGluZGlyZWN0aW9uIGxheWVyIGFuZA0KR0Mgc3Vic3lz
dGVtLiBXZSBhcmUgc3RpbGwgaW4gdGhlIHNhbWUgaGVsbCBvZiBHQ3MgY29tcGV0aXRpb24uIFNv
LCB3aGF0J3MgdGhlIHBvaW50IG9mIGhvc3QtYXdhcmUNCm1vZGVsPw0KDQpTbywgSSBhbSBub3Qg
Y29tcGxldGVseSBjb252aW5jZWQgdGhhdCwgZmluYWxseSwgd2Ugd2lsbCBoYXZlIHJlYWxseSBk
aXN0aW5jdGl2ZSBmZWF0dXJlcyBmb3IgdGhlDQpjYXNlIG9mIGRldmljZS1tYW5hZ2VkLCBob3N0
LWF3YXJlIGFuZCBob3N0LW1hbmFnZWQgbW9kZWwuIEFsc28gSSBoYXZlIG1hbnkgcXVlc3Rpb24g
YWJvdXQNCmhvc3QtbWFuYWdlZCBtb2RlbCBpZiB3ZSB3aWxsIHVzZSBibG9jayBkZXZpY2UgYWJz
dHJhY3Rpb24uIEhvdyBjYW4gZGlyZWN0IG1hbmFnZW1lbnQgb2YNClNTRCBpbnRlcm5hbHMgYmUg
b3JnYW5pemVkIGZvciB0aGUgY2FzZSBvZiBob3N0LW1hbmFnZWQgbW9kZWwgaXMgaGlkZGVuIHVu
ZGVyIGJsb2NrIGRldmljZQ0KYWJzdHJhY3Rpb24/DQoNCkFub3RoZXIgaW50ZXJlc3RpbmcgcXVl
c3Rpb24uLi4gTGV0J3MgaW1hZ2luZSB0aGF0IHdlIGNyZWF0ZSBmaWxlIHN5c3RlbSB2b2x1bWUg
Zm9yIG9uZSBkZXZpY2UNCmdlb21ldHJ5LiBJdCBtZWFucyB0aGF0IGdlb21ldHJ5IGRldGFpbHMg
d2lsbCBiZSBzdG9yZWQgaW4gdGhlIGZpbGUgc3lzdGVtIG1ldGFkYXRhIGR1cmluZyB2b2x1bWUN
CmNyZWF0aW9uIGZvciB0aGUgY2FzZSBob3N0LWF3YXJlIG9yIGhvc3QtbWFuYWdlZCBjYXNlLiBU
aGVuIHdlIGJhY2t1cHMgdGhpcyB2b2x1bWUgYW5kIHJlc3RvcmUNCnRoZSB2b2x1bWUgb24gZGV2
aWNlIHdpdGggY29tcGxldGVseSBkaWZmZXJlbnQgZ2VvbWV0cnkuIFNvLCB3aGF0IHdpbGwgd2Ug
aGF2ZSBmb3Igc3VjaCBjYXNlPw0KUGVyZm9ybWFuY2UgZGVncmFkYXRpb24/IE9yIHdpbGwgd2Ug
a2lsbCB0aGUgZGV2aWNlPw0KDQo+IFRoZSBvcGVuLWNoYW5uZWwgU1NEIGludGVyZmFjZSBpcyB2
ZXJ5IA0KPiBzaW1pbGFyIHRvIHRoZSBvbmUgZXhwb3NlZCBieSBTTVIgaGFyZC1kcml2ZXMuIFRo
ZXkgYm90aCBoYXZlIGEgc2V0IG9mIA0KPiBjaHVua3MgKHpvbmVzKSBleHBvc2VkLCBhbmQgem9u
ZXMgYXJlIG1hbmFnZWQgdXNpbmcgb3Blbi9jbG9zZSBsb2dpYy4gDQo+IFRoZSBtYWluIGRpZmZl
cmVuY2Ugb24gb3Blbi1jaGFubmVsIFNTRHMgaXMgdGhhdCBpdCBhZGRpdGlvbmFsbHkgZXhwb3Nl
cyANCj4gbXVsdGlwbGUgc2V0cyBvZiB6b25lcyB0aHJvdWdoIGEgaGllcmFyY2hpY2FsIGludGVy
ZmFjZSwgd2hpY2ggY292ZXJzIGEgDQo+IG51bWJlcnMgbGV2ZWxzIChYIGNoYW5uZWxzLCBZIExV
TnMgcGVyIGNoYW5uZWwsIFogem9uZXMgcGVyIExVTikuDQoNCkkgd291bGQgbGlrZSB0byBoYXZl
IGFjY2VzcyBjaGFubmVscy9MVU5zL3pvbmVzIG9uIGZpbGUgc3lzdGVtIGxldmVsLg0KSWYsIGZv
ciBleGFtcGxlLCBMVU4gd2lsbCBiZSBhc3NvY2lhdGVkIHdpdGggcGFydGl0aW9uIHRoZW4gaXQg
bWVhbnMNCnRoYXQgaXQgd2lsbCBuZWVkIHRvIGFnZ3JlZ2F0ZSBzZXZlcmFsIHBhcnRpdGlvbnMg
aW5zaWRlIG9mIG9uZSB2b2x1bWUuDQpGaXJzdCBvZiBhbGwsIG5vdCBldmVyeSBmaWxlIHN5c3Rl
bSBpcyByZWFkeSBmb3IgdGhlIGFnZ3JlZ2F0aW9uIHNldmVyYWwNCnBhcnRpdGlvbnMgaW5zaWRl
IG9mIHRoZSBvbmUgdm9sdW1lLiBTZWNvbmRseSwgd2hhdCdzIGFib3V0IGFnZ3JlZ2F0aW9uDQpz
ZXZlcmFsIHBoeXNpY2FsIGRldmljZXMgaW5zaWRlIG9mIG9uZSB2b2x1bWU/IEl0IGxvb2tzIGxp
a2UgYXMgc2xpZ2h0bHkNCnRyaWNreSB0byBkaXN0aW5ndWlzaCBwYXJ0aXRpb25zIG9mIHRoZSBz
YW1lIGRldmljZSBhbmQgZGlmZmVyZW50IGRldmljZXMNCm9uIGZpbGUgc3lzdGVtIGxldmVsLiBJ
c24ndCBpdD8NCg0KPiBJIGFncmVlIHdpdGggRGFtaWVuLCBidXQgSSdkIGFsc28gYWRkIHRoYXQg
aW4gdGhlIGZ1dHVyZSB0aGVyZSBtYXkgdmVyeQ0KPiB3ZWxsIGJlIHNvbWUgbmV3IFpvbmUgdHlw
ZXMgYWRkZWQgdG8gdGhlIFpCQyBtb2RlbC4gDQo+IFNvIHdlIHNob3VsZG4ndCBhc3N1bWUgdGhh
dCB0aGUgWkJDIG1vZGVsIGlzIGEgZml4ZWQgb25lLiAgQW5kIHdobyBrbm93cz8NCj4gUGVyaGFw
cyBUMTAgc3RhbmRhcmRzIGJvZHkgd2lsbCBjb21lIHVwIHdpdGggYSBzaW1wbGVyIG1vZGVsIGZv
cg0KPiBpbnRlcmZhY2luZyB3aXRoIFNDU0kvU0FUQS1hdHRhY2hlZCBTU0QncyB0aGF0IG1pZ2h0
IGxldmVyYWdlIHRoZSBaQkMgbW9kZWwgLS0tIG9yIG5vdC4NCg0KRGlmZmVyZW50IHpvbmUgdHlw
ZXMgaXMgZ29vZC4gQnV0IG1heWJlIExVTiB3aWxsIGJlIHRoZSBiZXR0ZXIgcGxhY2UNCmZvciBk
aXN0aW5ndWlzaGluZyB0aGUgZGlmZmVyZW50IHpvbmUgdHlwZXMuIEJlY2F1c2UgaWYgem9uZSBj
YW4gaGF2ZSB0aGUgdHlwZQ0KdGhlbiBpdCdzIHBvc3NpYmxlIHRvIGltYWdpbmUgYW55IGNvbWJp
bmF0aW9ucyBvZiB6b25lcy4gQnV0IG1vc3RseQ0Kem9uZSBvZiBzb21lIHR5cGUgd2lsbCBiZSBp
bnNpZGUgb2Ygc29tZSBjb250aWd1b3VzIGFyZWEgKGluc2lkZSBvZiBOQU5EDQpkaWUsIGZvciBl
eGFtcGxlKS4gU28sIExVTiBsb29rcyBsaWtlIGFzIE5BTkQgZGllIHJlcHJlc2VudGF0aW9uLg0K
DQo+PiBTTVIgem9uZSBhbmQgTkFORCBmbGFzaCBlcmFzZSBibG9jayBsb29rIGNvbXBhcmFibGUg
YnV0LCBmaW5hbGx5LCBpdCANCj4+IHNpZ25pZmljYW50bHkgZGlmZmVyZW50IHN0dWZmLiBVc3Vh
bGx5LCBTTVIgem9uZSBoYXMgMjY1IE1CIGluIHNpemUgDQo+PiBidXQgTkFORCBmbGFzaCBlcmFz
ZSBibG9jayBjYW4gdmFyeSBmcm9tIDUxMiBLQiB0byA4IE1CIChpdCB3aWxsIGJlIA0KPj4gc2xp
Z2h0bHkgbGFyZ2VyIGluIHRoZSBmdXR1cmUgYnV0IG5vdCBtb3JlIHRoYW4gMzIgTUIsIEkgc3Vw
cG9zZSkuIEl0IA0KPj4gaXMgcG9zc2libGUgdG8gZ3JvdXAgc2V2ZXJhbCBlcmFzZSBibG9ja3Mg
aW50byBhZ2dyZWdhdGVkIGVudGl0eSBidXQgDQo+PiBpdCBjb3VsZCBiZSBub3QgdmVyeSBnb29k
IHBvbGljeSBmcm9tIGZpbGUgc3lzdGVtIHBvaW50IG9mIHZpZXcuDQo+DQo+IFdoeSBub3Q/IEZv
ciBmMmZzLCB0aGUgMk1CIHNlZ21lbnRzIGFyZSBncm91cGVkIHRvZ2V0aGVyIGludG8gc2VjdGlv
bnMNCj4gd2l0aCBhIHNpemUgbWF0Y2hpbmcgdGhlIGRldmljZSB6b25lIHNpemUuIFRoYXQgd29y
a3Mgd2VsbCBhbmQgY2FuIGFjdHVhbGx5DQo+IGV2ZW4gcmVkdWNlIHRoZSBnYXJiYWdlIGNvbGxl
Y3Rpb24gb3ZlcmhlYWQgaW4gc29tZSBjYXNlcy4NCj4gTm90aGluZyBpbiB0aGUga2VybmVsIHpv
bmVkIGJsb2NrIGRldmljZSBzdXBwb3J0IGxpbWl0cyB0aGUgem9uZSBzaXplDQo+IHRvIGEgcGFy
dGljdWxhciBtaW5pbXVtIG9yIG1heGltdW0uIFRoZSBvbmx5IGRpcmVjdCBpbXBsaWNhdGlvbiBv
ZiB0aGUgem9uZQ0KPiBzaXplIG9uIHRoZSBibG9jayBJL08gc3RhY2sgaXMgdGhhdCBCSU9zIGFu
ZCByZXF1ZXN0cyBjYW5ub3QgY3Jvc3Mgem9uZQ0KPiBib3VuZGFyaWVzLiBJbiBhbiBleHRyZW1l
IHNldHVwLCBhIHpvbmUgc2l6ZSBvZiA0S0Igd291bGQgd29yayB0b28NCj4gYW5kIHJlc3VsdCBp
biByZWFkL3dyaXRlIGNvbW1hbmRzIG9mIDRLQiBhdCBtb3N0IHRvIHRoZSBkZXZpY2UuDQoNClRo
ZSBzaXR1YXRpb24gd2l0aCBncm91cGluZyBvZiBzZWdtZW50cyBpbnRvIHNlY3Rpb25zIGZvciB0
aGUgY2FzZSBvZiBGMkZTDQppcyBub3Qgc28gc2ltcGxlLiBGaXJzdCBvZiBhbGwsIHlvdSBuZWVk
IHRvIGZpbGwgc3VjaCBhZ2dyZWdhdGlvbiB3aXRoIGRhdGEuDQpGMkZTIGRpc3Rpbmd1aXNoIHNl
dmVyYWwgdHlwZXMgb2Ygc2VnbWVudHMgYW5kIGl0IG1lYW5zIHRoYXQgY3VycmVudA0Kc2VnbWVu
dC9zZWN0aW9uIHdpbGwgYmUgbGFyZ2VyLiBJZiB5b3UgbWl4IGRpZmZlcmVudCB0eXBlcyBvZiBz
ZWdtZW50cyBpbnRvDQpvbmUgc2VjdGlvbiAoYnV0IEkgYmVsaWV2ZSB0aGF0IEYyRlMgZG9lc24n
dCBwcm92aWRlIG9wcG9ydHVuaXR5IHRvIGRvIHRoaXMpDQp0aGVuIEdDIG92ZXJoZWFkIGNvdWxk
IGJlIGxhcmdlciwgSSBzdXBwb3NlLiBPdGhlcndpc2UsIHRoZSB1c2luZyBvbmUgc2VjdGlvbg0K
Zm9yIG9uZSBzZWdtZW50IHR5cGUgbWVhbnMgdGhhdCB0aGUgY3VycmVudCBzZWN0aW9uIHdpdGgg
Z3JlYXRlciBzaXplIHRoYW4NCnNlZ21lbnQgKDJNQikgd2lsbCBiZSByZXN1bHRlZCBpbiBjaGFu
Z2luZyB0aGUgc3BlZWQgb2YgZmlsbGluZyBzZWN0aW9ucyB3aXRoDQpkaWZmZXJlbnQgdHlwZSBv
ZiBkYXRhLiBBcyBhIHJlc3VsdCwgaXQgd2lsbCBjaGFuZ2UgZHJhbWF0aWNhbGx5IHRoZSBkaXN0
cmlidXRpb24NCm9mIGRpZmZlcmVudCB0eXBlIG9mIHNlY3Rpb25zIG9uIGZpbGUgc3lzdGVtIHZv
bHVtZS4gRG9lcyBpdCByZWR1Y2UgR0Mgb3ZlcmhlYWQ/DQpJIGFtIG5vdCBzdXJlLiBBbmQgaWYg
ZmlsZSBzeXN0ZW0ncyBzZWdtZW50IHNob3VsZCBiZSBlcXVhbCB0byB6b25lIHNpemUNCihmb3Ig
ZXhhbXBsZSwgTklMRlMyIGNhc2UpIHRoZW4gaXQgY291bGQgbWVhbiB0aGF0IHlvdSBuZWVkIHRv
IHByZXBhcmUgdGhlDQp3aG9sZSBzZWdtZW50IGJlZm9yZSByZWFsIGZsdXNoLiBBbmQgaWYgeW91
IHdpbGwgbmVlZCB0byBwcm9jZXNzIE9fRElSRUNUDQpvciBzeW5jaHJvbm91cyBtb3VudCBjYXNl
IHRoZW4sIG1vc3QgcHJvYmFibHksIHlvdSB3aWxsIG5lZWQgdG8gZmx1c2ggdGhlDQpzZWdtZW50
IHdpdGggaHVnZSBob2xlLiBJIHN1cHBvc2UgdGhhdCBpdCBjb3VsZCBzaWduaWZpY2FudGx5IGRl
Y3JlYXNlIGZpbGUgc3lzdGVtJ3MNCmZyZWUgc3BhY2UsIGluY3JlYXNlIEdDIGFjdGl2aXR5IGFu
ZCBkZWNyZWFzZSBkZXZpY2UgbGlmZXRpbWUuDQoNCj4+IEFub3RoZXIgcG9pbnQgdGhhdCBRTEMg
ZGV2aWNlIGNvdWxkIGhhdmUgbW9yZSB0cmlja3kgZmVhdHVyZXMgb2YgZXJhc2UgDQo+PiBibG9j
a3MgbWFuYWdlbWVudC4gQWxzbyB3ZSBzaG91bGQgYXBwbHkgZXJhc2Ugb3BlcmF0aW9uIG9uIE5B
TkQgZmxhc2ggDQo+PiBlcmFzZSBibG9jayBidXQgaXQgaXMgbm90IG1hbmRhdG9yeSBmb3IgdGhl
IGNhc2Ugb2YgU01SIHpvbmUuDQo+DQo+IEluY29ycmVjdDogaG9zdC1tYW5hZ2VkIGRldmljZXMg
cmVxdWlyZSBhIHpvbmUgInJlc2V0IiAoZXF1aXZhbGVudCB0bw0KPiBkaXNjYXJkL3RyaW0pIHRv
IGJlIHJldXNlZCBhZnRlciBiZWluZyB3cml0dGVuIG9uY2UuIFNvIGFnYWluLCB0aGUNCj4gInRy
aWNreSBmZWF0dXJlcyIgeW91IG1lbnRpb24gd2lsbCBkZXBlbmQgb24gdGhlIGRldmljZSAibW9k
ZWwiLA0KPiB3aGF0ZXZlciB0aGlzIGVuZHMgdXAgdG8gYmUgZm9yIGFuIG9wZW4gY2hhbm5lbCBT
U0QuDQoNCk9LLiBCdXQgSSBhc3N1bWUgdGhhdCBTTVIgem9uZSAicmVzZXQiIGlzIHNpZ25pZmlj
YW50bHkgY2hlYXBlciB0aGFuDQpOQU5EIGZsYXNoIGJsb2NrIGVyYXNlIG9wZXJhdGlvbi4gQW5k
IHlvdSBjYW4gZmlsbCB5b3VyIFNNUiB6b25lIHdpdGgNCmRhdGEgdGhlbiAicmVzZXQiIGl0IGFu
ZCB0byBmaWxsIGFnYWluIHdpdGggZGF0YSB3aXRob3V0IHNpZ25pZmljYW50IHBlbmFsdHkuDQpB
bHNvLCBUUklNIGFuZCB6b25lICJyZXNldCIgYXJlIGRpZmZlcmVudCwgSSBzdXBwb3NlLiBCZWNh
dXNlLCBUUklNIGxvb2tzDQpsaWtlIGFzIGEgaGludCBmb3IgU1NEIGNvbnRyb2xsZXIuIElmIFNT
RCBjb250cm9sbGVyIHJlY2VpdmVzIFRSSU0gZm9yIHNvbWUNCmVyYXNlIGJsb2NrIHRoZW4gaXQg
ZG9lc24ndCBtZWFuICB0aGF0IGVyYXNlIG9wZXJhdGlvbiB3aWxsIGJlIGRvbmUNCmltbWVkaWF0
ZWx5LiBVc3VhbGx5LCBpdCBzaG91bGQgYmUgZG9uZSBpbiB0aGUgYmFja2dyb3VuZCBiZWNhdXNl
IHJlYWwNCmVyYXNlIG9wZXJhdGlvbiBpcyBleHBlbnNpdmUgb3BlcmF0aW9uLiANCg0KVGhhbmtz
LA0KVnlhY2hlc2xhdiBEdWJleWtvLg0KDQpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fXwpMaW51eC1udm1lIG1haWxpbmcgbGlzdApMaW51eC1udm1lQGxpc3Rz
LmluZnJhZGVhZC5vcmcKaHR0cDovL2xpc3RzLmluZnJhZGVhZC5vcmcvbWFpbG1hbi9saXN0aW5m
by9saW51eC1udm1lCg==

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-05 22:58               ` Slava Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Slava Dubeyko @ 2017-01-05 22:58 UTC (permalink / raw)
  To: Damien Le Moal, Matias Bjørling, Viacheslav Dubeyko, lsf-pc,
	Theodore Ts'o
  Cc: Linux FS Devel, linux-block, linux-nvme

-----Original Message-----
From: Damien Le Moal 
Sent: Tuesday, January 3, 2017 11:25 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>; Matias Bjørling <m@bjorling.me>; Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

<skipped>

> But you are missing the parallel with SMR. For SMR, or more correctly zoned
> block devices since the ZBC or ZAC standards can equally apply to HDDs and SSDs,
> 3 models exists: drive-managed, host-aware and host-managed.
> Case (1) above corresponds *exactly* to the drive managed model, with
> the difference that the abstraction of the device characteristics (SMR
> here) is in the drive FW and not in a host-level FTL implementation
> as it would be for open channel SSDs. Case (2) above corresponds to the host-managed
> model, that is, the device user has to deal with the device characteristics
> itself and use it correctly. The host-aware model lies in between these 2 extremes:
> it offers the possibility of complete abstraction by default, but also allows a user
> to optimize its operation for the device by allowing access to the device characteristics.
> So this would correspond to a possible third way of implementing an FTL for open channel SSDs.

I see your point. And  I think that, historically, we need to distinguish 4 cases for the
case of NAND flash:
(1) drive-managed: regular file systems (ext4, xfs and so on);
(2) host-aware: flash-friendly file systems (NILFS2, F2FS and so on);
(3) host-managed: <file systems under implementation>;
(4) old-fashioned flash-oriented file systems for raw NAND (jffs, yaffs, ubifs and so on).

But, frankly speaking, even regular file systems are slightly flash-aware today because of
blkdev_issue_discard (TRIM) or REQ_META flag. So, the next really important question is:
what can/should be exposed for the host-managed and host-aware cases? What's principal
difference between these models? And, finally, the difference is not so clear.

Let's start from error corrections. Only flash-oriented file systems take care about
error corrections. But I assume that drive-managed, host-aware and host-managed cases
expect hardware-based error correction. So, we can treat our logical page/block as ideal
byte stream that always contains valid data. So, we have no difference and no contradiction
here.

Next point is read disturbance. If BER of physical page/block achieves some threshold then
we need to move data from one page/block into another one. What subsystem will be
responsible for this activity? The drive-managed case expects that device's GC will manage
read disturbance issue. But what's about host-aware or host-managed case? If the host side
hasn't information about BER then the host's software is unable to manage this issue. Finally,
it sounds that we will have GC subsystem as on file system side as on device side. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime.
Let's imagine that host-aware case could be unaware about read disturbance management.
But how host-managed case can manage this issue?

Bad block management... So, drive-managed and host-aware cases should be completely unaware
about  bad blocks. But what's about host-managed case? If a device will hide bad blocks from
the host then it means mapping table presence, access to logical pages/blocks and so on. If the host
hasn't access to the bad block management then it's not host-managed model. And it sounds as
completely unmanageable situation for the host-managed model. Because if the host has access
to bad block management (but how?) then we have really simple model. Otherwise, the host
has access to logical pages/blocks only and device should have internal GC. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime because
of competition of GC on device side and GC on the host side.

Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
for the host-managed case. But it means that the host should manage bad blocks and to have direct
access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
layer and wear-leveling management will be unavailable on the host side. As a result, device will have
internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
device lifetime). But even if SSD provides access to all internals then how will file system be able
to implement wear-leveling or bad block management in the case of regular I/O operations? Because
block device creates LBA abstraction for us. Does it mean that software FTL on the block layer level
is able to manage SSD internals directly? And, again, file system cannot manage SSD internals directly
for the case of software FTL. And where should software FTL keep mapping table, for example?

So, F2FS and NILFS2 looks like a host-aware case because it is LFS file systems that is oriented on
regular SSDs. So, it could be desirable to have some knowledge (page size, erase block size and so on)
about SSD internals. But, mostly, such knowledge should be shared with mkfs tool during file
system volume creation. The rest looks like as not very promising and not very different with
device-managed model. Because even if F2FS and NILFS2 has GC subsystem and mostly looks like
as LFS case (F2FS has in-place updated area; NILFS2 has in-place updated superblocks in the begin/end
of the volume), anyway, both these file systems completely rely on device indirection layer and
GC subsystem. We are still in the same hell of GCs competition. So, what's the point of host-aware
model?

So, I am not completely convinced that, finally, we will have really distinctive features for the
case of device-managed, host-aware and host-managed model. Also I have many question about
host-managed model if we will use block device abstraction. How can direct management of
SSD internals be organized for the case of host-managed model is hidden under block device
abstraction?

Another interesting question... Let's imagine that we create file system volume for one device
geometry. It means that geometry details will be stored in the file system metadata during volume
creation for the case host-aware or host-managed case. Then we backups this volume and restore
the volume on device with completely different geometry. So, what will we have for such case?
Performance degradation? Or will we kill the device?

> The open-channel SSD interface is very 
> similar to the one exposed by SMR hard-drives. They both have a set of 
> chunks (zones) exposed, and zones are managed using open/close logic. 
> The main difference on open-channel SSDs is that it additionally exposes 
> multiple sets of zones through a hierarchical interface, which covers a 
> numbers levels (X channels, Y LUNs per channel, Z zones per LUN).

I would like to have access channels/LUNs/zones on file system level.
If, for example, LUN will be associated with partition then it means
that it will need to aggregate several partitions inside of one volume.
First of all, not every file system is ready for the aggregation several
partitions inside of the one volume. Secondly, what's about aggregation
several physical devices inside of one volume? It looks like as slightly
tricky to distinguish partitions of the same device and different devices
on file system level. Isn't it?

> I agree with Damien, but I'd also add that in the future there may very
> well be some new Zone types added to the ZBC model. 
> So we shouldn't assume that the ZBC model is a fixed one.  And who knows?
> Perhaps T10 standards body will come up with a simpler model for
> interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC model --- or not.

Different zone types is good. But maybe LUN will be the better place
for distinguishing the different zone types. Because if zone can have the type
then it's possible to imagine any combinations of zones. But mostly
zone of some type will be inside of some contiguous area (inside of NAND
die, for example). So, LUN looks like as NAND die representation.

>> SMR zone and NAND flash erase block look comparable but, finally, it 
>> significantly different stuff. Usually, SMR zone has 265 MB in size 
>> but NAND flash erase block can vary from 512 KB to 8 MB (it will be 
>> slightly larger in the future but not more than 32 MB, I suppose). It 
>> is possible to group several erase blocks into aggregated entity but 
>> it could be not very good policy from file system point of view.
>
> Why not? For f2fs, the 2MB segments are grouped together into sections
> with a size matching the device zone size. That works well and can actually
> even reduce the garbage collection overhead in some cases.
> Nothing in the kernel zoned block device support limits the zone size
> to a particular minimum or maximum. The only direct implication of the zone
> size on the block I/O stack is that BIOs and requests cannot cross zone
> boundaries. In an extreme setup, a zone size of 4KB would work too
> and result in read/write commands of 4KB at most to the device.

The situation with grouping of segments into sections for the case of F2FS
is not so simple. First of all, you need to fill such aggregation with data.
F2FS distinguish several types of segments and it means that current
segment/section will be larger. If you mix different types of segments into
one section (but I believe that F2FS doesn't provide opportunity to do this)
then GC overhead could be larger, I suppose. Otherwise, the using one section
for one segment type means that the current section with greater size than
segment (2MB) will be resulted in changing the speed of filling sections with
different type of data. As a result, it will change dramatically the distribution
of different type of sections on file system volume. Does it reduce GC overhead?
I am not sure. And if file system's segment should be equal to zone size
(for example, NILFS2 case) then it could mean that you need to prepare the
whole segment before real flush. And if you will need to process O_DIRECT
or synchronous mount case then, most probably, you will need to flush the
segment with huge hole. I suppose that it could significantly decrease file system's
free space, increase GC activity and decrease device lifetime.

>> Another point that QLC device could have more tricky features of erase 
>> blocks management. Also we should apply erase operation on NAND flash 
>> erase block but it is not mandatory for the case of SMR zone.
>
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.

OK. But I assume that SMR zone "reset" is significantly cheaper than
NAND flash block erase operation. And you can fill your SMR zone with
data then "reset" it and to fill again with data without significant penalty.
Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
like as a hint for SSD controller. If SSD controller receives TRIM for some
erase block then it doesn't mean  that erase operation will be done
immediately. Usually, it should be done in the background because real
erase operation is expensive operation. 

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-05 22:58               ` Slava Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Slava Dubeyko @ 2017-01-05 22:58 UTC (permalink / raw)

-----Original Message-----
From: Damien Le Moal 
Sent: Tuesday, January 3, 2017 11:25 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko at wdc.com>; Matias Bj?rling <m at bjorling.me>; Viacheslav Dubeyko <slava at dubeyko.com>; lsf-pc at lists.linux-foundation.org
Cc: Linux FS Devel <linux-fsdevel at vger.kernel.org>; linux-block at vger.kernel.org; linux-nvme at lists.infradead.org
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

<skipped>

> But you are missing the parallel with SMR. For SMR, or more correctly zoned
> block devices since the ZBC or ZAC standards can equally apply to HDDs and SSDs,
> 3 models exists: drive-managed, host-aware and host-managed.
> Case (1) above corresponds *exactly* to the drive managed model, with
> the difference that the abstraction of the device characteristics (SMR
> here) is in the drive FW and not in a host-level FTL implementation
> as it would be for open channel SSDs. Case (2) above corresponds to the host-managed
> model, that is, the device user has to deal with the device characteristics
> itself and use it correctly. The host-aware model lies in between these 2 extremes:
> it offers the possibility of complete abstraction by default, but also allows a user
> to optimize its operation for the device by allowing access to the device characteristics.
> So this would correspond to a possible third way of implementing an FTL for open channel SSDs.

I see your point. And  I think that, historically, we need to distinguish 4 cases for the
case of NAND flash:
(1) drive-managed: regular file systems (ext4, xfs and so on);
(2) host-aware: flash-friendly file systems (NILFS2, F2FS and so on);
(3) host-managed: <file systems under implementation>;
(4) old-fashioned flash-oriented file systems for raw NAND (jffs, yaffs, ubifs and so on).

But, frankly speaking, even regular file systems are slightly flash-aware today because of
blkdev_issue_discard (TRIM) or REQ_META flag. So, the next really important question is:
what can/should be exposed for the host-managed and host-aware cases? What's principal
difference between these models? And, finally, the difference is not so clear.

Let's start from error corrections. Only flash-oriented file systems take care about
error corrections. But I assume that drive-managed, host-aware and host-managed cases
expect hardware-based error correction. So, we can treat our logical page/block as ideal
byte stream that always contains valid data. So, we have no difference and no contradiction
here.

Next point is read disturbance. If BER of physical page/block achieves some threshold then
we need to move data from one page/block into another one. What subsystem will be
responsible for this activity? The drive-managed case expects that device's GC will manage
read disturbance issue. But what's about host-aware or host-managed case? If the host side
hasn't information about BER then the host's software is unable to manage this issue. Finally,
it sounds that we will have GC subsystem as on file system side as on device side. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime.
Let's imagine that host-aware case could be unaware about read disturbance management.
But how host-managed case can manage this issue?

Bad block management... So, drive-managed and host-aware cases should be completely unaware
about  bad blocks. But what's about host-managed case? If a device will hide bad blocks from
the host then it means mapping table presence, access to logical pages/blocks and so on. If the host
hasn't access to the bad block management then it's not host-managed model. And it sounds as
completely unmanageable situation for the host-managed model. Because if the host has access
to bad block management (but how?) then we have really simple model. Otherwise, the host
has access to logical pages/blocks only and device should have internal GC. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime because
of competition of GC on device side and GC on the host side.

Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
for the host-managed case. But it means that the host should manage bad blocks and to have direct
access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
layer and wear-leveling management will be unavailable on the host side. As a result, device will have
internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
device lifetime). But even if SSD provides access to all internals then how will file system be able
to implement wear-leveling or bad block management in the case of regular I/O operations? Because
block device creates LBA abstraction for us. Does it mean that software FTL on the block layer level
is able to manage SSD internals directly? And, again, file system cannot manage SSD internals directly
for the case of software FTL. And where should software FTL keep mapping table, for example?

So, F2FS and NILFS2 looks like a host-aware case because it is LFS file systems that is oriented on
regular SSDs. So, it could be desirable to have some knowledge (page size, erase block size and so on)
about SSD internals. But, mostly, such knowledge should be shared with mkfs tool during file
system volume creation. The rest looks like as not very promising and not very different with
device-managed model. Because even if F2FS and NILFS2 has GC subsystem and mostly looks like
as LFS case (F2FS has in-place updated area; NILFS2 has in-place updated superblocks in the begin/end
of the volume), anyway, both these file systems completely rely on device indirection layer and
GC subsystem. We are still in the same hell of GCs competition. So, what's the point of host-aware
model?

So, I am not completely convinced that, finally, we will have really distinctive features for the
case of device-managed, host-aware and host-managed model. Also I have many question about
host-managed model if we will use block device abstraction. How can direct management of
SSD internals be organized for the case of host-managed model is hidden under block device
abstraction?

Another interesting question... Let's imagine that we create file system volume for one device
geometry. It means that geometry details will be stored in the file system metadata during volume
creation for the case host-aware or host-managed case. Then we backups this volume and restore
the volume on device with completely different geometry. So, what will we have for such case?
Performance degradation? Or will we kill the device?

> The open-channel SSD interface is very 
> similar to the one exposed by SMR hard-drives. They both have a set of 
> chunks (zones) exposed, and zones are managed using open/close logic. 
> The main difference on open-channel SSDs is that it additionally exposes 
> multiple sets of zones through a hierarchical interface, which covers a 
> numbers levels (X channels, Y LUNs per channel, Z zones per LUN).

I would like to have access channels/LUNs/zones on file system level.
If, for example, LUN will be associated with partition then it means
that it will need to aggregate several partitions inside of one volume.
First of all, not every file system is ready for the aggregation several
partitions inside of the one volume. Secondly, what's about aggregation
several physical devices inside of one volume? It looks like as slightly
tricky to distinguish partitions of the same device and different devices
on file system level. Isn't it?

> I agree with Damien, but I'd also add that in the future there may very
> well be some new Zone types added to the ZBC model. 
> So we shouldn't assume that the ZBC model is a fixed one.  And who knows?
> Perhaps T10 standards body will come up with a simpler model for
> interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC model --- or not.

Different zone types is good. But maybe LUN will be the better place
for distinguishing the different zone types. Because if zone can have the type
then it's possible to imagine any combinations of zones. But mostly
zone of some type will be inside of some contiguous area (inside of NAND
die, for example). So, LUN looks like as NAND die representation.

>> SMR zone and NAND flash erase block look comparable but, finally, it 
>> significantly different stuff. Usually, SMR zone has 265 MB in size 
>> but NAND flash erase block can vary from 512 KB to 8 MB (it will be 
>> slightly larger in the future but not more than 32 MB, I suppose). It 
>> is possible to group several erase blocks into aggregated entity but 
>> it could be not very good policy from file system point of view.
>
> Why not? For f2fs, the 2MB segments are grouped together into sections
> with a size matching the device zone size. That works well and can actually
> even reduce the garbage collection overhead in some cases.
> Nothing in the kernel zoned block device support limits the zone size
> to a particular minimum or maximum. The only direct implication of the zone
> size on the block I/O stack is that BIOs and requests cannot cross zone
> boundaries. In an extreme setup, a zone size of 4KB would work too
> and result in read/write commands of 4KB at most to the device.

The situation with grouping of segments into sections for the case of F2FS
is not so simple. First of all, you need to fill such aggregation with data.
F2FS distinguish several types of segments and it means that current
segment/section will be larger. If you mix different types of segments into
one section (but I believe that F2FS doesn't provide opportunity to do this)
then GC overhead could be larger, I suppose. Otherwise, the using one section
for one segment type means that the current section with greater size than
segment (2MB) will be resulted in changing the speed of filling sections with
different type of data. As a result, it will change dramatically the distribution
of different type of sections on file system volume. Does it reduce GC overhead?
I am not sure. And if file system's segment should be equal to zone size
(for example, NILFS2 case) then it could mean that you need to prepare the
whole segment before real flush. And if you will need to process O_DIRECT
or synchronous mount case then, most probably, you will need to flush the
segment with huge hole. I suppose that it could significantly decrease file system's
free space, increase GC activity and decrease device lifetime.

>> Another point that QLC device could have more tricky features of erase 
>> blocks management. Also we should apply erase operation on NAND flash 
>> erase block but it is not mandatory for the case of SMR zone.
>
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.

OK. But I assume that SMR zone "reset" is significantly cheaper than
NAND flash block erase operation. And you can fill your SMR zone with
data then "reset" it and to fill again with data without significant penalty.
Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
like as a hint for SSD controller. If SSD controller receives TRIM for some
erase block then it doesn't mean  that erase operation will be done
immediately. Usually, it should be done in the background because real
erase operation is expensive operation. 

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04  7:24             ` Damien Le Moal
@ 2017-01-06  1:09               ` Jaegeuk Kim
  -1 siblings, 0 replies; 63+ messages in thread
From: Jaegeuk Kim @ 2017-01-06  1:09 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc,
	Linux FS Devel, linux-block, linux-nvme

Hello,

On 01/04, Damien Le Moal wrote:

...
> 
> > Finally, if I really like to develop SMR- or NAND flash oriented file
> > system then I would like to play with peculiarities of concrete
> > technologies. And any unified interface will destroy the opportunity 
> > to create the really efficient solution. Finally, if my software
> > solution is unable to provide some fancy and efficient features then
> > guys will prefer to use the regular stack (ext4, xfs + block layer).
> 
> Not necessarily. Again think in terms of device "model" and associated
> feature set. An FS implementation may decide to support all possible
> models, with likely a resulting incredible complexity. More likely,
> similarly with what is happening with SMR, only models that make sense
> will be supported by FS implementation that can be easily modified.
> Example again here of f2fs: changes to support SMR were rather simple,
> whereas the initial effort to support SMR with ext4 was pretty much
> abandoned as it was too complex to integrate in the existing code while
> keeping the existing on-disk format.

>From the f2fs viewpoint, now we support single host-managed SMR drive having
a portion of conventional zones. In addition, f2fs supports multiple devices
[1], which enables us to use pure host-managed SMR which has no conventional
zone, working with another small conventional partition.

I think current lightNVM with OCSSD aims towards a drive-managed device for
generic filesystems. Depending on FTL, however, OCSSD can report conventional
or sequential zones. 1) If FTL handles random 4K writes pretty well, it would
be better to report converntional zones. Otherwise, 2) if FTL has almost nothing
to map bettwen LBA to PBA, it is able to report sequential zones likewise pure
host-managed SMR.

Interestingly, for 1) host-aware model, there is no need to change f2fs at all.
In order to explore 2) pure host-managed model, I introduced aligned write IO
[2] to make FTL more simple by eliminating partial page write. IMHO, it'd be
funny to evaluate several zoned models of SMR and OCSSD accordingly.

[1] https://lkml.org/lkml/2016/11/9/727
[2] https://lkml.org/lkml/2016/12/30/242

Thanks,

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-06  1:09               ` Jaegeuk Kim
  0 siblings, 0 replies; 63+ messages in thread
From: Jaegeuk Kim @ 2017-01-06  1:09 UTC (permalink / raw)


Hello,

On 01/04, Damien Le Moal wrote:

...
> 
> > Finally, if I really like to develop SMR- or NAND flash oriented file
> > system then I would like to play with peculiarities of concrete
> > technologies. And any unified interface will destroy the opportunity 
> > to create the really efficient solution. Finally, if my software
> > solution is unable to provide some fancy and efficient features then
> > guys will prefer to use the regular stack (ext4, xfs + block layer).
> 
> Not necessarily. Again think in terms of device "model" and associated
> feature set. An FS implementation may decide to support all possible
> models, with likely a resulting incredible complexity. More likely,
> similarly with what is happening with SMR, only models that make sense
> will be supported by FS implementation that can be easily modified.
> Example again here of f2fs: changes to support SMR were rather simple,
> whereas the initial effort to support SMR with ext4 was pretty much
> abandoned as it was too complex to integrate in the existing code while
> keeping the existing on-disk format.

>From the f2fs viewpoint, now we support single host-managed SMR drive having
a portion of conventional zones. In addition, f2fs supports multiple devices
[1], which enables us to use pure host-managed SMR which has no conventional
zone, working with another small conventional partition.

I think current lightNVM with OCSSD aims towards a drive-managed device for
generic filesystems. Depending on FTL, however, OCSSD can report conventional
or sequential zones. 1) If FTL handles random 4K writes pretty well, it would
be better to report converntional zones. Otherwise, 2) if FTL has almost nothing
to map bettwen LBA to PBA, it is able to report sequential zones likewise pure
host-managed SMR.

Interestingly, for 1) host-aware model, there is no need to change f2fs at all.
In order to explore 2) pure host-managed model, I introduced aligned write IO
[2] to make FTL more simple by eliminating partial page write. IMHO, it'd be
funny to evaluate several zoned models of SMR and OCSSD accordingly.

[1] https://lkml.org/lkml/2016/11/9/727
[2] https://lkml.org/lkml/2016/12/30/242

Thanks,

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-05 22:58               ` Slava Dubeyko
@ 2017-01-06  1:11                 ` Theodore Ts'o
  -1 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2017-01-06  1:11 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Damien Le Moal, Matias Bjørling, Viacheslav Dubeyko, lsf-pc,
	Linux FS Devel, linux-block, linux-nvme

On Thu, Jan 05, 2017 at 10:58:57PM +0000, Slava Dubeyko wrote:
> 
> Next point is read disturbance. If BER of physical page/block achieves some threshold then
> we need to move data from one page/block into another one. What subsystem will be
> responsible for this activity? The drive-managed case expects that device's GC will manage
> read disturbance issue. But what's about host-aware or host-managed case? If the host side
> hasn't information about BER then the host's software is unable to manage this issue. Finally,
> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime.
> Let's imagine that host-aware case could be unaware about read disturbance management.
> But how host-managed case can manage this issue?

One of the ways this could be done in the ZBC specification (assuming
that erase blocks == zones) would be set the "reset" bit in the zone
descriptor which is returned by the REPORT ZONES EXT command.  This is
a hint that the a reset write pointer should be sent to the zone in
question, and it could be set when you start seeing soft ECC errors or
the flash management layer has decided that the zone should be
rewritten in the near future.  A simple way to do this is to ask the
Host OS to copy the data to another zone and then send a reset write
pointer command for the zone.

So I think it very much could be done, and done within the framework
of the ZBC model --- although whether SSD manufactuers will chose to
do this, and/or choose to engage the T10/T13 standards committees to
add the necessary extensions to the ZBC specification is a question
that we probably can't answer in this venue or by the participants on
this thread.

> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
> for the host-managed case. But it means that the host should manage bad blocks and to have direct
> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
> layer and wear-leveling management will be unavailable on the host side. As a result, device will have
> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
> device lifetime).

So I can imagine a setup where the flash translation layer manages the
mapping between zone numbers and the physical erase blocks, such that
when the host OS issues an "reset write pointer", it immediately gets
a new erase block assigned to the specific zone in question.  The
original erase block would then get erased in the background, when the
flash chip in question is available for maintenance activities.

I think you've been thinking about a model where *either* the host as
complete control over all aspects of the flash management, or the FTL
has complete control --- and it may be that there are more clever ways
that the work could be split between flash device and the host OS.

> Another interesting question... Let's imagine that we create file system volume for one device
> geometry. It means that geometry details will be stored in the file system metadata during volume
> creation for the case host-aware or host-managed case. Then we backups this volume and restore
> the volume on device with completely different geometry. So, what will we have for such case?
> Performance degradation? Or will we kill the device?

This is why I suspect that exposing the full details of the details of
the Flash layout via LUNS is a bad, bad, BAD idea.  It's much better
to use an abstraction such as Zones, and then have an abstraction
layer that hides the low-level details of the hardware from the OS.
The trick is picking an abstraction that exposes the _right_ set of
details so that the division of labor betewen the Host OS and the
storage device is at a better place.  Hence my suggestion of perhaps
providing a virtual mapping layer betewen "Zone number" and the
low-level physical erase block.

> I would like to have access channels/LUNs/zones on file system level.
> If, for example, LUN will be associated with partition then it means
> that it will need to aggregate several partitions inside of one volume.
> First of all, not every file system is ready for the aggregation several
> partitions inside of the one volume. Secondly, what's about aggregation
> several physical devices inside of one volume? It looks like as slightly
> tricky to distinguish partitions of the same device and different devices
> on file system level. Isn't it?

Yes, this is why using LUN's are a BAD idea.  There's too much code
--- in file systems, in the block layer in terms of how we expose
block devices, etc., that assumes that different LUN's are used for
different logical containers of storage.  There has been decades of
usage of this concept by enterprise storage arrays.  Trying to
appropriate LUN's for another use case is stupid.  And maybe we can't
stop OCSSD folks if they have gone down that questionable design path,
but there's nothing that says we have to expose it as a SCSI LUN
inside of Linux!

> OK. But I assume that SMR zone "reset" is significantly cheaper than
> NAND flash block erase operation. And you can fill your SMR zone with
> data then "reset" it and to fill again with data without significant penalty.

If you have virtual mapping layer between zones and erase blocks, a
reset write pointer could be fast for SSD's as well.  And that allows
the implementation of your suggestion below:

> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
> like as a hint for SSD controller. If SSD controller receives TRIM for some
> erase block then it doesn't mean  that erase operation will be done
> immediately. Usually, it should be done in the background because real
> erase operation is expensive operation.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-06  1:11                 ` Theodore Ts'o
  0 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2017-01-06  1:11 UTC (permalink / raw)

On Thu, Jan 05, 2017@10:58:57PM +0000, Slava Dubeyko wrote:
> 
> Next point is read disturbance. If BER of physical page/block achieves some threshold then
> we need to move data from one page/block into another one. What subsystem will be
> responsible for this activity? The drive-managed case expects that device's GC will manage
> read disturbance issue. But what's about host-aware or host-managed case? If the host side
> hasn't information about BER then the host's software is unable to manage this issue. Finally,
> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime.
> Let's imagine that host-aware case could be unaware about read disturbance management.
> But how host-managed case can manage this issue?

One of the ways this could be done in the ZBC specification (assuming
that erase blocks == zones) would be set the "reset" bit in the zone
descriptor which is returned by the REPORT ZONES EXT command.  This is
a hint that the a reset write pointer should be sent to the zone in
question, and it could be set when you start seeing soft ECC errors or
the flash management layer has decided that the zone should be
rewritten in the near future.  A simple way to do this is to ask the
Host OS to copy the data to another zone and then send a reset write
pointer command for the zone.

So I think it very much could be done, and done within the framework
of the ZBC model --- although whether SSD manufactuers will chose to
do this, and/or choose to engage the T10/T13 standards committees to
add the necessary extensions to the ZBC specification is a question
that we probably can't answer in this venue or by the participants on
this thread.

> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
> for the host-managed case. But it means that the host should manage bad blocks and to have direct
> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
> layer and wear-leveling management will be unavailable on the host side. As a result, device will have
> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
> device lifetime).

So I can imagine a setup where the flash translation layer manages the
mapping between zone numbers and the physical erase blocks, such that
when the host OS issues an "reset write pointer", it immediately gets
a new erase block assigned to the specific zone in question.  The
original erase block would then get erased in the background, when the
flash chip in question is available for maintenance activities.

I think you've been thinking about a model where *either* the host as
complete control over all aspects of the flash management, or the FTL
has complete control --- and it may be that there are more clever ways
that the work could be split between flash device and the host OS.

> Another interesting question... Let's imagine that we create file system volume for one device
> geometry. It means that geometry details will be stored in the file system metadata during volume
> creation for the case host-aware or host-managed case. Then we backups this volume and restore
> the volume on device with completely different geometry. So, what will we have for such case?
> Performance degradation? Or will we kill the device?

This is why I suspect that exposing the full details of the details of
the Flash layout via LUNS is a bad, bad, BAD idea.  It's much better
to use an abstraction such as Zones, and then have an abstraction
layer that hides the low-level details of the hardware from the OS.
The trick is picking an abstraction that exposes the _right_ set of
details so that the division of labor betewen the Host OS and the
storage device is at a better place.  Hence my suggestion of perhaps
providing a virtual mapping layer betewen "Zone number" and the
low-level physical erase block.

> I would like to have access channels/LUNs/zones on file system level.
> If, for example, LUN will be associated with partition then it means
> that it will need to aggregate several partitions inside of one volume.
> First of all, not every file system is ready for the aggregation several
> partitions inside of the one volume. Secondly, what's about aggregation
> several physical devices inside of one volume? It looks like as slightly
> tricky to distinguish partitions of the same device and different devices
> on file system level. Isn't it?

Yes, this is why using LUN's are a BAD idea.  There's too much code
--- in file systems, in the block layer in terms of how we expose
block devices, etc., that assumes that different LUN's are used for
different logical containers of storage.  There has been decades of
usage of this concept by enterprise storage arrays.  Trying to
appropriate LUN's for another use case is stupid.  And maybe we can't
stop OCSSD folks if they have gone down that questionable design path,
but there's nothing that says we have to expose it as a SCSI LUN
inside of Linux!

> OK. But I assume that SMR zone "reset" is significantly cheaper than
> NAND flash block erase operation. And you can fill your SMR zone with
> data then "reset" it and to fill again with data without significant penalty.

If you have virtual mapping layer between zones and erase blocks, a
reset write pointer could be fast for SSD's as well.  And that allows
the implementation of your suggestion below:

> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
> like as a hint for SSD controller. If SSD controller receives TRIM for some
> erase block then it doesn't mean  that erase operation will be done
> immediately. Usually, it should be done in the background because real
> erase operation is expensive operation.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-06  1:11                 ` Theodore Ts'o
  (?)
@ 2017-01-06 12:51                   ` Matias Bjørling
  -1 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-06 12:51 UTC (permalink / raw)
  To: Theodore Ts'o, Slava Dubeyko
  Cc: Damien Le Moal, linux-nvme, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc

On 01/06/2017 02:11 AM, Theodore Ts'o wrote:
> On Thu, Jan 05, 2017 at 10:58:57PM +0000, Slava Dubeyko wrote:
>>
>> Next point is read disturbance. If BER of physical page/block achieves some threshold then
>> we need to move data from one page/block into another one. What subsystem will be
>> responsible for this activity? The drive-managed case expects that device's GC will manage
>> read disturbance issue. But what's about host-aware or host-managed case? If the host side
>> hasn't information about BER then the host's software is unable to manage this issue. Finally,
>> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
>> it means possible unpredictable performance degradation and decreasing device lifetime.
>> Let's imagine that host-aware case could be unaware about read disturbance management.
>> But how host-managed case can manage this issue?
> 
> One of the ways this could be done in the ZBC specification (assuming
> that erase blocks == zones) would be set the "reset" bit in the zone
> descriptor which is returned by the REPORT ZONES EXT command.  This is
> a hint that the a reset write pointer should be sent to the zone in
> question, and it could be set when you start seeing soft ECC errors or
> the flash management layer has decided that the zone should be
> rewritten in the near future.  A simple way to do this is to ask the
> Host OS to copy the data to another zone and then send a reset write
> pointer command for the zone.

This is an interesting approach. Currently, the OCSSD interface uses
both the soft ECC mark to tell the host to rewrite, while the interface
also has an explicit method to make the host rewrite the data. E.g., in
the case where read scrubbing on the device requires the host to move
data due to durability.

Adding the information to the "Report zones" is a good idea. It enables
the device to keep a list of "zones" that should be refreshed by the
host but have yet to have it done. I will add that to the specification.

> 
> So I think it very much could be done, and done within the framework
> of the ZBC model --- although whether SSD manufactuers will chose to
> do this, and/or choose to engage the T10/T13 standards committees to
> add the necessary extensions to the ZBC specification is a question
> that we probably can't answer in this venue or by the participants on
> this thread.
> 
>> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
>> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
>> for the host-managed case. But it means that the host should manage bad blocks and to have direct
>> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
>> layer and wear-leveling management will be unavailable on the host side. As a result, device will have
>> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
>> device lifetime).
> 
> So I can imagine a setup where the flash translation layer manages the
> mapping between zone numbers and the physical erase blocks, such that
> when the host OS issues an "reset write pointer", it immediately gets
> a new erase block assigned to the specific zone in question.  The
> original erase block would then get erased in the background, when the
> flash chip in question is available for maintenance activities.
> 
> I think you've been thinking about a model where *either* the host as
> complete control over all aspects of the flash management, or the FTL
> has complete control --- and it may be that there are more clever ways
> that the work could be split between flash device and the host OS.
> 
>> Another interesting question... Let's imagine that we create file system volume for one device
>> geometry. It means that geometry details will be stored in the file system metadata during volume
>> creation for the case host-aware or host-managed case. Then we backups this volume and restore
>> the volume on device with completely different geometry. So, what will we have for such case?
>> Performance degradation? Or will we kill the device?
> 
> This is why I suspect that exposing the full details of the details of
> the Flash layout via LUNS is a bad, bad, BAD idea.  It's much better
> to use an abstraction such as Zones, and then have an abstraction
> layer that hides the low-level details of the hardware from the OS.
> The trick is picking an abstraction that exposes the _right_ set of
> details so that the division of labor betewen the Host OS and the
> storage device is at a better place.  Hence my suggestion of perhaps
> providing a virtual mapping layer betewen "Zone number" and the
> low-level physical erase block.

Agree. The first approach was taken in the first iteration of the
specification. After release we began to understand the chaos we just
brought onto our self, we moved to the zone/chunk approach in the second
iteration to simplify the interface.

> 
>> I would like to have access channels/LUNs/zones on file system level.
>> If, for example, LUN will be associated with partition then it means
>> that it will need to aggregate several partitions inside of one volume.
>> First of all, not every file system is ready for the aggregation several
>> partitions inside of the one volume. Secondly, what's about aggregation
>> several physical devices inside of one volume? It looks like as slightly
>> tricky to distinguish partitions of the same device and different devices
>> on file system level. Isn't it?
> 
> Yes, this is why using LUN's are a BAD idea.  There's too much code
> --- in file systems, in the block layer in terms of how we expose
> block devices, etc., that assumes that different LUN's are used for
> different logical containers of storage.  There has been decades of
> usage of this concept by enterprise storage arrays.  Trying to
> appropriate LUN's for another use case is stupid.  And maybe we can't
> stop OCSSD folks if they have gone down that questionable design path,
> but there's nothing that says we have to expose it as a SCSI LUN
> inside of Linux!

Heh, yes, really bad idea. The naming of "LUNs" for OCSSDs could have
been chosen better. In the future, it is being renamed to "parallel
unit". For OCSSDs, all the device's parallel units are exposed through
the same block device "LUN", which then has to be managed by the layers
above.

> 
>> OK. But I assume that SMR zone "reset" is significantly cheaper than
>> NAND flash block erase operation. And you can fill your SMR zone with
>> data then "reset" it and to fill again with data without significant penalty.
> 
> If you have virtual mapping layer between zones and erase blocks, a
> reset write pointer could be fast for SSD's as well.  And that allows
> the implementation of your suggestion below:
> 
>> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
>> like as a hint for SSD controller. If SSD controller receives TRIM for some
>> erase block then it doesn't mean  that erase operation will be done
>> immediately. Usually, it should be done in the background because real
>> erase operation is expensive operation.
> 
> Cheers,
> 
> 					- Ted
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-06 12:51                   ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-06 12:51 UTC (permalink / raw)
  To: Theodore Ts'o, Slava Dubeyko
  Cc: Damien Le Moal, Viacheslav Dubeyko, lsf-pc, Linux FS Devel,
	linux-block, linux-nvme

On 01/06/2017 02:11 AM, Theodore Ts'o wrote:
> On Thu, Jan 05, 2017 at 10:58:57PM +0000, Slava Dubeyko wrote:
>>
>> Next point is read disturbance. If BER of physical page/block achieves some threshold then
>> we need to move data from one page/block into another one. What subsystem will be
>> responsible for this activity? The drive-managed case expects that device's GC will manage
>> read disturbance issue. But what's about host-aware or host-managed case? If the host side
>> hasn't information about BER then the host's software is unable to manage this issue. Finally,
>> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
>> it means possible unpredictable performance degradation and decreasing device lifetime.
>> Let's imagine that host-aware case could be unaware about read disturbance management.
>> But how host-managed case can manage this issue?
> 
> One of the ways this could be done in the ZBC specification (assuming
> that erase blocks == zones) would be set the "reset" bit in the zone
> descriptor which is returned by the REPORT ZONES EXT command.  This is
> a hint that the a reset write pointer should be sent to the zone in
> question, and it could be set when you start seeing soft ECC errors or
> the flash management layer has decided that the zone should be
> rewritten in the near future.  A simple way to do this is to ask the
> Host OS to copy the data to another zone and then send a reset write
> pointer command for the zone.

This is an interesting approach. Currently, the OCSSD interface uses
both the soft ECC mark to tell the host to rewrite, while the interface
also has an explicit method to make the host rewrite the data. E.g., in
the case where read scrubbing on the device requires the host to move
data due to durability.

Adding the information to the "Report zones" is a good idea. It enables
the device to keep a list of "zones" that should be refreshed by the
host but have yet to have it done. I will add that to the specification.

> 
> So I think it very much could be done, and done within the framework
> of the ZBC model --- although whether SSD manufactuers will chose to
> do this, and/or choose to engage the T10/T13 standards committees to
> add the necessary extensions to the ZBC specification is a question
> that we probably can't answer in this venue or by the participants on
> this thread.
> 
>> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
>> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
>> for the host-managed case. But it means that the host should manage bad blocks and to have direct
>> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
>> layer and wear-leveling management will be unavailable on the host side. As a result, device will have
>> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
>> device lifetime).
> 
> So I can imagine a setup where the flash translation layer manages the
> mapping between zone numbers and the physical erase blocks, such that
> when the host OS issues an "reset write pointer", it immediately gets
> a new erase block assigned to the specific zone in question.  The
> original erase block would then get erased in the background, when the
> flash chip in question is available for maintenance activities.
> 
> I think you've been thinking about a model where *either* the host as
> complete control over all aspects of the flash management, or the FTL
> has complete control --- and it may be that there are more clever ways
> that the work could be split between flash device and the host OS.
> 
>> Another interesting question... Let's imagine that we create file system volume for one device
>> geometry. It means that geometry details will be stored in the file system metadata during volume
>> creation for the case host-aware or host-managed case. Then we backups this volume and restore
>> the volume on device with completely different geometry. So, what will we have for such case?
>> Performance degradation? Or will we kill the device?
> 
> This is why I suspect that exposing the full details of the details of
> the Flash layout via LUNS is a bad, bad, BAD idea.  It's much better
> to use an abstraction such as Zones, and then have an abstraction
> layer that hides the low-level details of the hardware from the OS.
> The trick is picking an abstraction that exposes the _right_ set of
> details so that the division of labor betewen the Host OS and the
> storage device is at a better place.  Hence my suggestion of perhaps
> providing a virtual mapping layer betewen "Zone number" and the
> low-level physical erase block.

Agree. The first approach was taken in the first iteration of the
specification. After release we began to understand the chaos we just
brought onto our self, we moved to the zone/chunk approach in the second
iteration to simplify the interface.

> 
>> I would like to have access channels/LUNs/zones on file system level.
>> If, for example, LUN will be associated with partition then it means
>> that it will need to aggregate several partitions inside of one volume.
>> First of all, not every file system is ready for the aggregation several
>> partitions inside of the one volume. Secondly, what's about aggregation
>> several physical devices inside of one volume? It looks like as slightly
>> tricky to distinguish partitions of the same device and different devices
>> on file system level. Isn't it?
> 
> Yes, this is why using LUN's are a BAD idea.  There's too much code
> --- in file systems, in the block layer in terms of how we expose
> block devices, etc., that assumes that different LUN's are used for
> different logical containers of storage.  There has been decades of
> usage of this concept by enterprise storage arrays.  Trying to
> appropriate LUN's for another use case is stupid.  And maybe we can't
> stop OCSSD folks if they have gone down that questionable design path,
> but there's nothing that says we have to expose it as a SCSI LUN
> inside of Linux!

Heh, yes, really bad idea. The naming of "LUNs" for OCSSDs could have
been chosen better. In the future, it is being renamed to "parallel
unit". For OCSSDs, all the device's parallel units are exposed through
the same block device "LUN", which then has to be managed by the layers
above.

> 
>> OK. But I assume that SMR zone "reset" is significantly cheaper than
>> NAND flash block erase operation. And you can fill your SMR zone with
>> data then "reset" it and to fill again with data without significant penalty.
> 
> If you have virtual mapping layer between zones and erase blocks, a
> reset write pointer could be fast for SSD's as well.  And that allows
> the implementation of your suggestion below:
> 
>> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
>> like as a hint for SSD controller. If SSD controller receives TRIM for some
>> erase block then it doesn't mean  that erase operation will be done
>> immediately. Usually, it should be done in the background because real
>> erase operation is expensive operation.
> 
> Cheers,
> 
> 					- Ted
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-06 12:51                   ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-06 12:51 UTC (permalink / raw)


On 01/06/2017 02:11 AM, Theodore Ts'o wrote:
> On Thu, Jan 05, 2017@10:58:57PM +0000, Slava Dubeyko wrote:
>>
>> Next point is read disturbance. If BER of physical page/block achieves some threshold then
>> we need to move data from one page/block into another one. What subsystem will be
>> responsible for this activity? The drive-managed case expects that device's GC will manage
>> read disturbance issue. But what's about host-aware or host-managed case? If the host side
>> hasn't information about BER then the host's software is unable to manage this issue. Finally,
>> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
>> it means possible unpredictable performance degradation and decreasing device lifetime.
>> Let's imagine that host-aware case could be unaware about read disturbance management.
>> But how host-managed case can manage this issue?
> 
> One of the ways this could be done in the ZBC specification (assuming
> that erase blocks == zones) would be set the "reset" bit in the zone
> descriptor which is returned by the REPORT ZONES EXT command.  This is
> a hint that the a reset write pointer should be sent to the zone in
> question, and it could be set when you start seeing soft ECC errors or
> the flash management layer has decided that the zone should be
> rewritten in the near future.  A simple way to do this is to ask the
> Host OS to copy the data to another zone and then send a reset write
> pointer command for the zone.

This is an interesting approach. Currently, the OCSSD interface uses
both the soft ECC mark to tell the host to rewrite, while the interface
also has an explicit method to make the host rewrite the data. E.g., in
the case where read scrubbing on the device requires the host to move
data due to durability.

Adding the information to the "Report zones" is a good idea. It enables
the device to keep a list of "zones" that should be refreshed by the
host but have yet to have it done. I will add that to the specification.

> 
> So I think it very much could be done, and done within the framework
> of the ZBC model --- although whether SSD manufactuers will chose to
> do this, and/or choose to engage the T10/T13 standards committees to
> add the necessary extensions to the ZBC specification is a question
> that we probably can't answer in this venue or by the participants on
> this thread.
> 
>> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
>> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
>> for the host-managed case. But it means that the host should manage bad blocks and to have direct
>> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
>> layer and wear-leveling management will be unavailable on the host side. As a result, device will have
>> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
>> device lifetime).
> 
> So I can imagine a setup where the flash translation layer manages the
> mapping between zone numbers and the physical erase blocks, such that
> when the host OS issues an "reset write pointer", it immediately gets
> a new erase block assigned to the specific zone in question.  The
> original erase block would then get erased in the background, when the
> flash chip in question is available for maintenance activities.
> 
> I think you've been thinking about a model where *either* the host as
> complete control over all aspects of the flash management, or the FTL
> has complete control --- and it may be that there are more clever ways
> that the work could be split between flash device and the host OS.
> 
>> Another interesting question... Let's imagine that we create file system volume for one device
>> geometry. It means that geometry details will be stored in the file system metadata during volume
>> creation for the case host-aware or host-managed case. Then we backups this volume and restore
>> the volume on device with completely different geometry. So, what will we have for such case?
>> Performance degradation? Or will we kill the device?
> 
> This is why I suspect that exposing the full details of the details of
> the Flash layout via LUNS is a bad, bad, BAD idea.  It's much better
> to use an abstraction such as Zones, and then have an abstraction
> layer that hides the low-level details of the hardware from the OS.
> The trick is picking an abstraction that exposes the _right_ set of
> details so that the division of labor betewen the Host OS and the
> storage device is at a better place.  Hence my suggestion of perhaps
> providing a virtual mapping layer betewen "Zone number" and the
> low-level physical erase block.

Agree. The first approach was taken in the first iteration of the
specification. After release we began to understand the chaos we just
brought onto our self, we moved to the zone/chunk approach in the second
iteration to simplify the interface.

> 
>> I would like to have access channels/LUNs/zones on file system level.
>> If, for example, LUN will be associated with partition then it means
>> that it will need to aggregate several partitions inside of one volume.
>> First of all, not every file system is ready for the aggregation several
>> partitions inside of the one volume. Secondly, what's about aggregation
>> several physical devices inside of one volume? It looks like as slightly
>> tricky to distinguish partitions of the same device and different devices
>> on file system level. Isn't it?
> 
> Yes, this is why using LUN's are a BAD idea.  There's too much code
> --- in file systems, in the block layer in terms of how we expose
> block devices, etc., that assumes that different LUN's are used for
> different logical containers of storage.  There has been decades of
> usage of this concept by enterprise storage arrays.  Trying to
> appropriate LUN's for another use case is stupid.  And maybe we can't
> stop OCSSD folks if they have gone down that questionable design path,
> but there's nothing that says we have to expose it as a SCSI LUN
> inside of Linux!

Heh, yes, really bad idea. The naming of "LUNs" for OCSSDs could have
been chosen better. In the future, it is being renamed to "parallel
unit". For OCSSDs, all the device's parallel units are exposed through
the same block device "LUN", which then has to be managed by the layers
above.

> 
>> OK. But I assume that SMR zone "reset" is significantly cheaper than
>> NAND flash block erase operation. And you can fill your SMR zone with
>> data then "reset" it and to fill again with data without significant penalty.
> 
> If you have virtual mapping layer between zones and erase blocks, a
> reset write pointer could be fast for SSD's as well.  And that allows
> the implementation of your suggestion below:
> 
>> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
>> like as a hint for SSD controller. If SSD controller receives TRIM for some
>> erase block then it doesn't mean  that erase operation will be done
>> immediately. Usually, it should be done in the background because real
>> erase operation is expensive operation.
> 
> Cheers,
> 
> 					- Ted
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-06  1:09               ` Jaegeuk Kim
  (?)
@ 2017-01-06 12:55                 ` Matias Bjørling
  -1 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-06 12:55 UTC (permalink / raw)
  To: Jaegeuk Kim, Damien Le Moal
  Cc: Slava Dubeyko, linux-nvme, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc



On 01/06/2017 02:09 AM, Jaegeuk Kim wrote:
> Hello,
> 
> On 01/04, Damien Le Moal wrote:
> 
> ...
>>
>>> Finally, if I really like to develop SMR- or NAND flash oriented file
>>> system then I would like to play with peculiarities of concrete
>>> technologies. And any unified interface will destroy the opportunity 
>>> to create the really efficient solution. Finally, if my software
>>> solution is unable to provide some fancy and efficient features then
>>> guys will prefer to use the regular stack (ext4, xfs + block layer).
>>
>> Not necessarily. Again think in terms of device "model" and associated
>> feature set. An FS implementation may decide to support all possible
>> models, with likely a resulting incredible complexity. More likely,
>> similarly with what is happening with SMR, only models that make sense
>> will be supported by FS implementation that can be easily modified.
>> Example again here of f2fs: changes to support SMR were rather simple,
>> whereas the initial effort to support SMR with ext4 was pretty much
>> abandoned as it was too complex to integrate in the existing code while
>> keeping the existing on-disk format.
> 
> From the f2fs viewpoint, now we support single host-managed SMR drive having
> a portion of conventional zones. In addition, f2fs supports multiple devices
> [1], which enables us to use pure host-managed SMR which has no conventional
> zone, working with another small conventional partition.

That is a good approach. SSD controllers may even implement a small FTL
inside the device for the "conventional" zones. The size wouldn't be
that big and may only be used to bootstrap rest of the unit. A zone with
a couple hundred megabytes should do. That'll simplify having pblk on
the side next to f2fs.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-06 12:55                 ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-06 12:55 UTC (permalink / raw)
  To: Jaegeuk Kim, Damien Le Moal
  Cc: Slava Dubeyko, Viacheslav Dubeyko, lsf-pc, Linux FS Devel,
	linux-block, linux-nvme



On 01/06/2017 02:09 AM, Jaegeuk Kim wrote:
> Hello,
> 
> On 01/04, Damien Le Moal wrote:
> 
> ...
>>
>>> Finally, if I really like to develop SMR- or NAND flash oriented file
>>> system then I would like to play with peculiarities of concrete
>>> technologies. And any unified interface will destroy the opportunity 
>>> to create the really efficient solution. Finally, if my software
>>> solution is unable to provide some fancy and efficient features then
>>> guys will prefer to use the regular stack (ext4, xfs + block layer).
>>
>> Not necessarily. Again think in terms of device "model" and associated
>> feature set. An FS implementation may decide to support all possible
>> models, with likely a resulting incredible complexity. More likely,
>> similarly with what is happening with SMR, only models that make sense
>> will be supported by FS implementation that can be easily modified.
>> Example again here of f2fs: changes to support SMR were rather simple,
>> whereas the initial effort to support SMR with ext4 was pretty much
>> abandoned as it was too complex to integrate in the existing code while
>> keeping the existing on-disk format.
> 
> From the f2fs viewpoint, now we support single host-managed SMR drive having
> a portion of conventional zones. In addition, f2fs supports multiple devices
> [1], which enables us to use pure host-managed SMR which has no conventional
> zone, working with another small conventional partition.

That is a good approach. SSD controllers may even implement a small FTL
inside the device for the "conventional" zones. The size wouldn't be
that big and may only be used to bootstrap rest of the unit. A zone with
a couple hundred megabytes should do. That'll simplify having pblk on
the side next to f2fs.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-06 12:55                 ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-06 12:55 UTC (permalink / raw)




On 01/06/2017 02:09 AM, Jaegeuk Kim wrote:
> Hello,
> 
> On 01/04, Damien Le Moal wrote:
> 
> ...
>>
>>> Finally, if I really like to develop SMR- or NAND flash oriented file
>>> system then I would like to play with peculiarities of concrete
>>> technologies. And any unified interface will destroy the opportunity 
>>> to create the really efficient solution. Finally, if my software
>>> solution is unable to provide some fancy and efficient features then
>>> guys will prefer to use the regular stack (ext4, xfs + block layer).
>>
>> Not necessarily. Again think in terms of device "model" and associated
>> feature set. An FS implementation may decide to support all possible
>> models, with likely a resulting incredible complexity. More likely,
>> similarly with what is happening with SMR, only models that make sense
>> will be supported by FS implementation that can be easily modified.
>> Example again here of f2fs: changes to support SMR were rather simple,
>> whereas the initial effort to support SMR with ext4 was pretty much
>> abandoned as it was too complex to integrate in the existing code while
>> keeping the existing on-disk format.
> 
> From the f2fs viewpoint, now we support single host-managed SMR drive having
> a portion of conventional zones. In addition, f2fs supports multiple devices
> [1], which enables us to use pure host-managed SMR which has no conventional
> zone, working with another small conventional partition.

That is a good approach. SSD controllers may even implement a small FTL
inside the device for the "conventional" zones. The size wouldn't be
that big and may only be used to bootstrap rest of the unit. A zone with
a couple hundred megabytes should do. That'll simplify having pblk on
the side next to f2fs.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-05 22:58               ` Slava Dubeyko
  (?)
@ 2017-01-06 13:05                 ` Matias Bjørling
  -1 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-06 13:05 UTC (permalink / raw)
  To: Slava Dubeyko, Damien Le Moal, Viacheslav Dubeyko, lsf-pc,
	Theodore Ts'o
  Cc: Linux FS Devel, linux-block, linux-nvme

On 01/05/2017 11:58 PM, Slava Dubeyko wrote:
> Next point is read disturbance. If BER of physical page/block achieves some threshold then
> we need to move data from one page/block into another one. What subsystem will be
> responsible for this activity? The drive-managed case expects that device's GC will manage
> read disturbance issue. But what's about host-aware or host-managed case? If the host side
> hasn't information about BER then the host's software is unable to manage this issue. Finally,
> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime.
> Let's imagine that host-aware case could be unaware about read disturbance management.
> But how host-managed case can manage this issue?

The OCSSD interface uses a couple of methods:

1) Piggy back soft ECC errors onto the completion entry. Tells the host
that a block properly should be refreshed when appropriate.
2) Use an asynchronous interface, e.g., NVMe get log page. Report blocks
through this interface that has been read disturbed. This may be coupled
with the various processes running on the SSD.
3) (That Ted suggested). Expose a "reset" bit in the Report Zones
command to let the host know which blocks should be reset. If the
plumbing for 2) is not available, or the information has been lost on
the host side, this method can be used to "resync".

> 
> Bad block management... So, drive-managed and host-aware cases should be completely unaware
> about  bad blocks. But what's about host-managed case? If a device will hide bad blocks from
> the host then it means mapping table presence, access to logical pages/blocks and so on. If the host
> hasn't access to the bad block management then it's not host-managed model. And it sounds as
> completely unmanageable situation for the host-managed model. Because if the host has access
> to bad block management (but how?) then we have really simple model. Otherwise, the host
> has access to logical pages/blocks only and device should have internal GC. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime because
> of competition of GC on device side and GC on the host side.

Agree. depending on the use-case, one may expose a "perfect" interface
to the host, or one may expose an interface where media errors may be
reported to the host. The former case are great for consumer units,
where I/O predictability isn't critical, and similarly if I/O
predictability is critical, the media errors can be reported, and the
host may deal with them appropriately.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-06 13:05                 ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-06 13:05 UTC (permalink / raw)
  To: Slava Dubeyko, Damien Le Moal, Viacheslav Dubeyko, lsf-pc,
	Theodore Ts'o
  Cc: Linux FS Devel, linux-block, linux-nvme

On 01/05/2017 11:58 PM, Slava Dubeyko wrote:
> Next point is read disturbance. If BER of physical page/block achieves some threshold then
> we need to move data from one page/block into another one. What subsystem will be
> responsible for this activity? The drive-managed case expects that device's GC will manage
> read disturbance issue. But what's about host-aware or host-managed case? If the host side
> hasn't information about BER then the host's software is unable to manage this issue. Finally,
> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime.
> Let's imagine that host-aware case could be unaware about read disturbance management.
> But how host-managed case can manage this issue?

The OCSSD interface uses a couple of methods:

1) Piggy back soft ECC errors onto the completion entry. Tells the host
that a block properly should be refreshed when appropriate.
2) Use an asynchronous interface, e.g., NVMe get log page. Report blocks
through this interface that has been read disturbed. This may be coupled
with the various processes running on the SSD.
3) (That Ted suggested). Expose a "reset" bit in the Report Zones
command to let the host know which blocks should be reset. If the
plumbing for 2) is not available, or the information has been lost on
the host side, this method can be used to "resync".

> 
> Bad block management... So, drive-managed and host-aware cases should be completely unaware
> about  bad blocks. But what's about host-managed case? If a device will hide bad blocks from
> the host then it means mapping table presence, access to logical pages/blocks and so on. If the host
> hasn't access to the bad block management then it's not host-managed model. And it sounds as
> completely unmanageable situation for the host-managed model. Because if the host has access
> to bad block management (but how?) then we have really simple model. Otherwise, the host
> has access to logical pages/blocks only and device should have internal GC. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime because
> of competition of GC on device side and GC on the host side.

Agree. depending on the use-case, one may expose a "perfect" interface
to the host, or one may expose an interface where media errors may be
reported to the host. The former case are great for consumer units,
where I/O predictability isn't critical, and similarly if I/O
predictability is critical, the media errors can be reported, and the
host may deal with them appropriately.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-06 13:05                 ` Matias Bjørling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjørling @ 2017-01-06 13:05 UTC (permalink / raw)

On 01/05/2017 11:58 PM, Slava Dubeyko wrote:
> Next point is read disturbance. If BER of physical page/block achieves some threshold then
> we need to move data from one page/block into another one. What subsystem will be
> responsible for this activity? The drive-managed case expects that device's GC will manage
> read disturbance issue. But what's about host-aware or host-managed case? If the host side
> hasn't information about BER then the host's software is unable to manage this issue. Finally,
> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime.
> Let's imagine that host-aware case could be unaware about read disturbance management.
> But how host-managed case can manage this issue?

The OCSSD interface uses a couple of methods:

1) Piggy back soft ECC errors onto the completion entry. Tells the host
that a block properly should be refreshed when appropriate.
2) Use an asynchronous interface, e.g., NVMe get log page. Report blocks
through this interface that has been read disturbed. This may be coupled
with the various processes running on the SSD.
3) (That Ted suggested). Expose a "reset" bit in the Report Zones
command to let the host know which blocks should be reset. If the
plumbing for 2) is not available, or the information has been lost on
the host side, this method can be used to "resync".

> 
> Bad block management... So, drive-managed and host-aware cases should be completely unaware
> about  bad blocks. But what's about host-managed case? If a device will hide bad blocks from
> the host then it means mapping table presence, access to logical pages/blocks and so on. If the host
> hasn't access to the bad block management then it's not host-managed model. And it sounds as
> completely unmanageable situation for the host-managed model. Because if the host has access
> to bad block management (but how?) then we have really simple model. Otherwise, the host
> has access to logical pages/blocks only and device should have internal GC. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime because
> of competition of GC on device side and GC on the host side.

Agree. depending on the use-case, one may expose a "perfect" interface
to the host, or one may expose an interface where media errors may be
reported to the host. The former case are great for consumer units,
where I/O predictability isn't critical, and similarly if I/O
predictability is critical, the media errors can be reported, and the
host may deal with them appropriately.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-06  1:11                 ` Theodore Ts'o
  (?)
@ 2017-01-09  6:49                   ` Slava Dubeyko
  -1 siblings, 0 replies; 63+ messages in thread
From: Slava Dubeyko @ 2017-01-09  6:49 UTC (permalink / raw)
  To: Theodore Ts'o, Matias Bjørling
  Cc: Damien Le Moal, Viacheslav Dubeyko, lsf-pc, Linux FS Devel,
	linux-block, linux-nvme

-----Original Message-----
From: Theodore Ts'o [mailto:tytso@mit.edu]=20
Sent: Thursday, January 5, 2017 5:12 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Damien Le Moal <Damien.LeMoal@wdc.com>; Matias Bj=F8rling <m@bjorling.m=
e>; Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.o=
rg; Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel=
.org; linux-nvme@lists.infradead.org
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Inter=
face, and Vector I/Os

<skipped>

> I think you've been thinking about a model where *either* the host as com=
plete control
> over all aspects of the flash management, or the FTL has complete control=
 --- and it may
> be that there are more clever ways that the work could be split between
> flash device and the host OS.

Yes, I totally agree that the better way is to split different responsibili=
ties between the flash
device and the host (file system, for example). I would like to consider an=
 SSD device as a set
of FTL primitives. Let's imagine the SSD like an automata that is able to e=
xecute FTL primitives
but the file system issues the commands orchestrate the SSD activity. I bel=
ieve it makes sense
to think about SSD like data processing accelerator engine. It means that w=
e need in good
interface that can be the basis for the offload of data processing operatio=
ns. And I clearly see
the many cases when a file system would like to say: "Hey, SSD. Please, exe=
cute this primitive
for me right now".

Let's consider operations of moving zones (or erase blocks) with high BER.
If we have completely passive SSD then it sounds for me that all operations=
 will look like:
(1) read data on the host side; (2) "reset" zone; (3) write data into the
SSD backwards. But if we talk about some zone (erase block(s)) with high BE=
R is full of valid data
then why does host need to execute the whole operation in such stupid way l=
ike "read-write"?
I mean that it completely doesn't make sense to spend the host's resources =
for such operation.
The responsibility of the host is simply to initiate such operation in the =
proper time. And responsibility
of the SSD is to execute such operation internally (offload of the operatio=
n). So, here we could
have the FTL primitive of moving of zones (erase blocks) for overcoming the=
 read disturbance.

Let's consider GC operations... Right now, we have GC subsystem on the SSD =
side (device-managed and
host-aware case) and we have GC subsystem on the host side (LFS file system=
s of host-aware case).
So, it's clear that SSD device is able to provide some primitives of GC ope=
rations. Also it's completely
unreasonable to have GC subsystem as on SSD side as on the host side. If we=
 have GC subsystem on
the host only then we need to follow the stupid paradigm "read-modify-write=
" and to spend the host's
resources for GC operations. Otherwise, if GC subsystem on the SSD side the=
n GC suffers from lack of
knowledge about valid data location (file system keeps this knowledge) and =
such solution provides
wide range of cases for unexpected performance degradation. So, we need in =
much smarter solution.
What could it be?

Again, file system (host) has to initiate the GC operation in proper time b=
ut the SSD should execute
the requested operation (offload of the operation). So, we will have the GC=
 subsystem on file system
side but the real GC operation under zone (erase block(s)) will be executed=
 by SSD device. The key point
here that: (1) file system choses the good time for GC operation; (2) file =
system is able to select a zone
(erase block(s)) that provides then cost-efficient way of GC activity from =
the point of view of valid data
amount in the aged zone; (3) file system shares information about valid pag=
es in the zone (erase block(s));
(4) SSD executes GC operation under the zone internally.

We need to take into account three possible cases: (1) zone is completely i=
nvalid; (2) zone is partially
invalid; (3) zone contains valid data only. If file system's GC selects a z=
one that doesn't contain valid data
("invalid" zone case) then GC simply needs to request "reset" zone or send =
TRIM command. The rest is
responsibility of SSD device. If zone is completely filled by valid data th=
en file system's GC needs to
request moving operation on the SSD side. If we will use a virtual zones th=
en it means that such moving
operation on the SSD side will change nothing for the file system (logical =
block numbers will be the same).
So, file system doesn't need to change internal mapping table for such oper=
ation.

The case of partially invalid zone (contains some amount of valid data) is =
more tricky. But let's consider
the situation. If file system has knowledge about position of valid logical=
 blocks or pages inside a zone
then the file system is able to share a zone's bitmap with SSD device. It m=
eans that if we have 4 KB
logical block and 256 MB zone then we need in 8 KB bitmap for representing =
positions of valid
logical blocks inside of the zone. So, file system is able to send such val=
id pages' bitmap with the
command of GC operation initiation for some zone. The responsibility of SSD=
 side will be: (1) "reset"
zone; (2) move the valid logical blocks from aged zone into new ones with c=
ompaction scheme using.
I mean that all valid pages should be written in contiguous manner in the n=
ewly allocated zone
(erase blocks). Finally, it means that SSD device can reposition logical bl=
ocks inside of the zone
without changing the initial order of logical pages (compaction scheme). Su=
ch compaction scheme
can be easily implemented on the SSD side. And if we will not change the or=
der of logical blocks
then we have deterministic case that can be easily processed on file system=
 side. If file system has
initial bitmap then it can easily re-calculate the valid logical blocks' po=
sition after compaction scheme
using. For example, F2FS can easily do such re-calculation. Finally, new va=
lues of valid logical blocks'
position should be stored into file system's mapping table. NILFS2 is sligh=
tly more complex case.
Because, NILFS2 describes logical blocks inside of the log by means of spec=
ial btree in the log's header.
So, again, compaction scheme is deterministic case that provides opportunit=
y to re-calculate the
logical blocks' position before real GC operation. It means that NILFS2 is =
able to prepare as valid
logical blocks' bitmap as log's header before GC operation and to share all=
 these stuff with SSD device.

However, every GC operation under partially invalid zone is resulted in cre=
ation of zone that will be
partially filled by valid data (the rest of zone will be completely free). =
What does it need to do in such
case? I can see the four possible approaches:

(1) Re-use the partially filled zone. If file system will track the state o=
f every zone (mapping table,
for example) or it will be possible to extract the state of zone then it me=
ans that aged zone will
change the state after GC operation. So, partially filled zone can be used =
as current zone for
writing a new data.

(2) Add valid data of aged zone into the tail of current zone. Let's imagin=
e that file system is using
some zone as current zone for adding a new data. If we know that an aged zo=
ne contains some
number of valid pages then it's possible to reserve the space in the tail o=
f current zone. Finally,
it is possible to initiate combine flush operation (write data from page ca=
che of current zone)
with GC operation under aged zone on the SSD side.=20

(3) Re-use aged zone as current zone. Let's imagine that we have some aged =
zone with small
number of valid pages. It means that we can select this zone as current zon=
e for a new data.
First of all, we need: (1) "reset" zone; (2) initiate GC operation on the S=
SD device side. We know
how many valid pages we will have in the beginning of the current zone. So,=
 we simply needs
to add a new logical blocks into page cache of current zone after reserved =
area of data from
aged zone. So, our GC operation will be in the background of a new data pre=
paration in the
page cache of current zone. And, finally, we will have the whole zone is fu=
ll of data after
flush operation.

(4) Merge several aged zones into new one.

> It's much better to use an abstraction such as Zones, and then have an ab=
straction layer
> that hides the low-level details of the hardware from the OS.
> The trick is picking an abstraction that exposes the _right_ set of detai=
ls so that the division
> of labor between the Host OS and the storage device is at a better place.=
  Hence my suggestion
> of perhaps providing a virtual mapping layer between "Zone number" and
> the low-level physical erase block.

I like the idea of some abstraction that hides the low-level details. But i=
t sounds that we still
will have two mapping tables on SSD side and file system side. Again we nee=
ds in distribution
the responsibilities between the file system and SSD device. If file system=
 will manage GC activity
but the real GC operation will be delegated on SSD side (in proper time) th=
en it sounds that
all maintenance operations will be done by SSD itself. It means that SSD de=
vice is able to manage
only one mapping table and file system simply needs to have actual copy of =
the mapping table.
Or, oppositely, file system can manage only one mapping table and to share =
the actual state
with the SSD device. But one mapping table looks like as really complicated=
 technique. From
another point of view, virtual zone can have the same ID always. So, the re=
sponsibility of the
SSD device will be mapping the virtual zone ID with physical erase block ID=
s. Such mapping
table (virtual zone ID <-> erase block(s)) can be more compact as mapping t=
able (LBA <->
physical page). The responsibility of file system (host) will be the mappin=
g inside of
the virtual zone (LBA <-> logical block inside the virtual zone). If the vi=
rtual zone ID will be
always the same then such mapping table could be lesser in size. But I don'=
t see how
such mapping table can be lesser in size for the current implementation of =
F2FS or NILFS2.=20
However, let's imagine that log will be equal to the whole zone then the he=
ader of the log
can include likewise mapping table for the log/zone.

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-09  6:49                   ` Slava Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Slava Dubeyko @ 2017-01-09  6:49 UTC (permalink / raw)
  To: Theodore Ts'o, Matias Bjørling
  Cc: Damien Le Moal, Viacheslav Dubeyko, lsf-pc, Linux FS Devel,
	linux-block, linux-nvme

-----Original Message-----
From: Theodore Ts'o [mailto:tytso@mit.edu] 
Sent: Thursday, January 5, 2017 5:12 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Damien Le Moal <Damien.LeMoal@wdc.com>; Matias Bjørling <m@bjorling.me>; Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

<skipped>

> I think you've been thinking about a model where *either* the host as complete control
> over all aspects of the flash management, or the FTL has complete control --- and it may
> be that there are more clever ways that the work could be split between
> flash device and the host OS.

Yes, I totally agree that the better way is to split different responsibilities between the flash
device and the host (file system, for example). I would like to consider an SSD device as a set
of FTL primitives. Let's imagine the SSD like an automata that is able to execute FTL primitives
but the file system issues the commands orchestrate the SSD activity. I believe it makes sense
to think about SSD like data processing accelerator engine. It means that we need in good
interface that can be the basis for the offload of data processing operations. And I clearly see
the many cases when a file system would like to say: "Hey, SSD. Please, execute this primitive
for me right now".

Let's consider operations of moving zones (or erase blocks) with high BER.
If we have completely passive SSD then it sounds for me that all operations will look like:
(1) read data on the host side; (2) "reset" zone; (3) write data into the
SSD backwards. But if we talk about some zone (erase block(s)) with high BER is full of valid data
then why does host need to execute the whole operation in such stupid way like "read-write"?
I mean that it completely doesn't make sense to spend the host's resources for such operation.
The responsibility of the host is simply to initiate such operation in the proper time. And responsibility
of the SSD is to execute such operation internally (offload of the operation). So, here we could
have the FTL primitive of moving of zones (erase blocks) for overcoming the read disturbance.

Let's consider GC operations... Right now, we have GC subsystem on the SSD side (device-managed and
host-aware case) and we have GC subsystem on the host side (LFS file systems of host-aware case).
So, it's clear that SSD device is able to provide some primitives of GC operations. Also it's completely
unreasonable to have GC subsystem as on SSD side as on the host side. If we have GC subsystem on
the host only then we need to follow the stupid paradigm "read-modify-write" and to spend the host's
resources for GC operations. Otherwise, if GC subsystem on the SSD side then GC suffers from lack of
knowledge about valid data location (file system keeps this knowledge) and such solution provides
wide range of cases for unexpected performance degradation. So, we need in much smarter solution.
What could it be?

Again, file system (host) has to initiate the GC operation in proper time but the SSD should execute
the requested operation (offload of the operation). So, we will have the GC subsystem on file system
side but the real GC operation under zone (erase block(s)) will be executed by SSD device. The key point
here that: (1) file system choses the good time for GC operation; (2) file system is able to select a zone
(erase block(s)) that provides then cost-efficient way of GC activity from the point of view of valid data
amount in the aged zone; (3) file system shares information about valid pages in the zone (erase block(s));
(4) SSD executes GC operation under the zone internally.

We need to take into account three possible cases: (1) zone is completely invalid; (2) zone is partially
invalid; (3) zone contains valid data only. If file system's GC selects a zone that doesn't contain valid data
("invalid" zone case) then GC simply needs to request "reset" zone or send TRIM command. The rest is
responsibility of SSD device. If zone is completely filled by valid data then file system's GC needs to
request moving operation on the SSD side. If we will use a virtual zones then it means that such moving
operation on the SSD side will change nothing for the file system (logical block numbers will be the same).
So, file system doesn't need to change internal mapping table for such operation.

The case of partially invalid zone (contains some amount of valid data) is more tricky. But let's consider
the situation. If file system has knowledge about position of valid logical blocks or pages inside a zone
then the file system is able to share a zone's bitmap with SSD device. It means that if we have 4 KB
logical block and 256 MB zone then we need in 8 KB bitmap for representing positions of valid
logical blocks inside of the zone. So, file system is able to send such valid pages' bitmap with the
command of GC operation initiation for some zone. The responsibility of SSD side will be: (1) "reset"
zone; (2) move the valid logical blocks from aged zone into new ones with compaction scheme using.
I mean that all valid pages should be written in contiguous manner in the newly allocated zone
(erase blocks). Finally, it means that SSD device can reposition logical blocks inside of the zone
without changing the initial order of logical pages (compaction scheme). Such compaction scheme
can be easily implemented on the SSD side. And if we will not change the order of logical blocks
then we have deterministic case that can be easily processed on file system side. If file system has
initial bitmap then it can easily re-calculate the valid logical blocks' position after compaction scheme
using. For example, F2FS can easily do such re-calculation. Finally, new values of valid logical blocks'
position should be stored into file system's mapping table. NILFS2 is slightly more complex case.
Because, NILFS2 describes logical blocks inside of the log by means of special btree in the log's header.
So, again, compaction scheme is deterministic case that provides opportunity to re-calculate the
logical blocks' position before real GC operation. It means that NILFS2 is able to prepare as valid
logical blocks' bitmap as log's header before GC operation and to share all these stuff with SSD device.

However, every GC operation under partially invalid zone is resulted in creation of zone that will be
partially filled by valid data (the rest of zone will be completely free). What does it need to do in such
case? I can see the four possible approaches:

(1) Re-use the partially filled zone. If file system will track the state of every zone (mapping table,
for example) or it will be possible to extract the state of zone then it means that aged zone will
change the state after GC operation. So, partially filled zone can be used as current zone for
writing a new data.

(2) Add valid data of aged zone into the tail of current zone. Let's imagine that file system is using
some zone as current zone for adding a new data. If we know that an aged zone contains some
number of valid pages then it's possible to reserve the space in the tail of current zone. Finally,
it is possible to initiate combine flush operation (write data from page cache of current zone)
with GC operation under aged zone on the SSD side. 

(3) Re-use aged zone as current zone. Let's imagine that we have some aged zone with small
number of valid pages. It means that we can select this zone as current zone for a new data.
First of all, we need: (1) "reset" zone; (2) initiate GC operation on the SSD device side. We know
how many valid pages we will have in the beginning of the current zone. So, we simply needs
to add a new logical blocks into page cache of current zone after reserved area of data from
aged zone. So, our GC operation will be in the background of a new data preparation in the
page cache of current zone. And, finally, we will have the whole zone is full of data after
flush operation.

(4) Merge several aged zones into new one.

> It's much better to use an abstraction such as Zones, and then have an abstraction layer
> that hides the low-level details of the hardware from the OS.
> The trick is picking an abstraction that exposes the _right_ set of details so that the division
> of labor between the Host OS and the storage device is at a better place.  Hence my suggestion
> of perhaps providing a virtual mapping layer between "Zone number" and
> the low-level physical erase block.

I like the idea of some abstraction that hides the low-level details. But it sounds that we still
will have two mapping tables on SSD side and file system side. Again we needs in distribution
the responsibilities between the file system and SSD device. If file system will manage GC activity
but the real GC operation will be delegated on SSD side (in proper time) then it sounds that
all maintenance operations will be done by SSD itself. It means that SSD device is able to manage
only one mapping table and file system simply needs to have actual copy of the mapping table.
Or, oppositely, file system can manage only one mapping table and to share the actual state
with the SSD device. But one mapping table looks like as really complicated technique. From
another point of view, virtual zone can have the same ID always. So, the responsibility of the
SSD device will be mapping the virtual zone ID with physical erase block IDs. Such mapping
table (virtual zone ID <-> erase block(s)) can be more compact as mapping table (LBA <->
physical page). The responsibility of file system (host) will be the mapping inside of
the virtual zone (LBA <-> logical block inside the virtual zone). If the virtual zone ID will be
always the same then such mapping table could be lesser in size. But I don't see how
such mapping table can be lesser in size for the current implementation of F2FS or NILFS2. 
However, let's imagine that log will be equal to the whole zone then the header of the log
can include likewise mapping table for the log/zone.

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-09  6:49                   ` Slava Dubeyko
  0 siblings, 0 replies; 63+ messages in thread
From: Slava Dubeyko @ 2017-01-09  6:49 UTC (permalink / raw)

-----Original Message-----
From: Theodore Ts'o [mailto:tytso@mit.edu] 
Sent: Thursday, January 5, 2017 5:12 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko at wdc.com>
Cc: Damien Le Moal <Damien.LeMoal at wdc.com>; Matias Bj?rling <m at bjorling.me>; Viacheslav Dubeyko <slava at dubeyko.com>; lsf-pc at lists.linux-foundation.org; Linux FS Devel <linux-fsdevel at vger.kernel.org>; linux-block at vger.kernel.org; linux-nvme at lists.infradead.org
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

<skipped>

> I think you've been thinking about a model where *either* the host as complete control
> over all aspects of the flash management, or the FTL has complete control --- and it may
> be that there are more clever ways that the work could be split between
> flash device and the host OS.

Yes, I totally agree that the better way is to split different responsibilities between the flash
device and the host (file system, for example). I would like to consider an SSD device as a set
of FTL primitives. Let's imagine the SSD like an automata that is able to execute FTL primitives
but the file system issues the commands orchestrate the SSD activity. I believe it makes sense
to think about SSD like data processing accelerator engine. It means that we need in good
interface that can be the basis for the offload of data processing operations. And I clearly see
the many cases when a file system would like to say: "Hey, SSD. Please, execute this primitive
for me right now".

Let's consider operations of moving zones (or erase blocks) with high BER.
If we have completely passive SSD then it sounds for me that all operations will look like:
(1) read data on the host side; (2) "reset" zone; (3) write data into the
SSD backwards. But if we talk about some zone (erase block(s)) with high BER is full of valid data
then why does host need to execute the whole operation in such stupid way like "read-write"?
I mean that it completely doesn't make sense to spend the host's resources for such operation.
The responsibility of the host is simply to initiate such operation in the proper time. And responsibility
of the SSD is to execute such operation internally (offload of the operation). So, here we could
have the FTL primitive of moving of zones (erase blocks) for overcoming the read disturbance.

Let's consider GC operations... Right now, we have GC subsystem on the SSD side (device-managed and
host-aware case) and we have GC subsystem on the host side (LFS file systems of host-aware case).
So, it's clear that SSD device is able to provide some primitives of GC operations. Also it's completely
unreasonable to have GC subsystem as on SSD side as on the host side. If we have GC subsystem on
the host only then we need to follow the stupid paradigm "read-modify-write" and to spend the host's
resources for GC operations. Otherwise, if GC subsystem on the SSD side then GC suffers from lack of
knowledge about valid data location (file system keeps this knowledge) and such solution provides
wide range of cases for unexpected performance degradation. So, we need in much smarter solution.
What could it be?

Again, file system (host) has to initiate the GC operation in proper time but the SSD should execute
the requested operation (offload of the operation). So, we will have the GC subsystem on file system
side but the real GC operation under zone (erase block(s)) will be executed by SSD device. The key point
here that: (1) file system choses the good time for GC operation; (2) file system is able to select a zone
(erase block(s)) that provides then cost-efficient way of GC activity from the point of view of valid data
amount in the aged zone; (3) file system shares information about valid pages in the zone (erase block(s));
(4) SSD executes GC operation under the zone internally.

We need to take into account three possible cases: (1) zone is completely invalid; (2) zone is partially
invalid; (3) zone contains valid data only. If file system's GC selects a zone that doesn't contain valid data
("invalid" zone case) then GC simply needs to request "reset" zone or send TRIM command. The rest is
responsibility of SSD device. If zone is completely filled by valid data then file system's GC needs to
request moving operation on the SSD side. If we will use a virtual zones then it means that such moving
operation on the SSD side will change nothing for the file system (logical block numbers will be the same).
So, file system doesn't need to change internal mapping table for such operation.

The case of partially invalid zone (contains some amount of valid data) is more tricky. But let's consider
the situation. If file system has knowledge about position of valid logical blocks or pages inside a zone
then the file system is able to share a zone's bitmap with SSD device. It means that if we have 4 KB
logical block and 256 MB zone then we need in 8 KB bitmap for representing positions of valid
logical blocks inside of the zone. So, file system is able to send such valid pages' bitmap with the
command of GC operation initiation for some zone. The responsibility of SSD side will be: (1) "reset"
zone; (2) move the valid logical blocks from aged zone into new ones with compaction scheme using.
I mean that all valid pages should be written in contiguous manner in the newly allocated zone
(erase blocks). Finally, it means that SSD device can reposition logical blocks inside of the zone
without changing the initial order of logical pages (compaction scheme). Such compaction scheme
can be easily implemented on the SSD side. And if we will not change the order of logical blocks
then we have deterministic case that can be easily processed on file system side. If file system has
initial bitmap then it can easily re-calculate the valid logical blocks' position after compaction scheme
using. For example, F2FS can easily do such re-calculation. Finally, new values of valid logical blocks'
position should be stored into file system's mapping table. NILFS2 is slightly more complex case.
Because, NILFS2 describes logical blocks inside of the log by means of special btree in the log's header.
So, again, compaction scheme is deterministic case that provides opportunity to re-calculate the
logical blocks' position before real GC operation. It means that NILFS2 is able to prepare as valid
logical blocks' bitmap as log's header before GC operation and to share all these stuff with SSD device.

However, every GC operation under partially invalid zone is resulted in creation of zone that will be
partially filled by valid data (the rest of zone will be completely free). What does it need to do in such
case? I can see the four possible approaches:

(1) Re-use the partially filled zone. If file system will track the state of every zone (mapping table,
for example) or it will be possible to extract the state of zone then it means that aged zone will
change the state after GC operation. So, partially filled zone can be used as current zone for
writing a new data.

(2) Add valid data of aged zone into the tail of current zone. Let's imagine that file system is using
some zone as current zone for adding a new data. If we know that an aged zone contains some
number of valid pages then it's possible to reserve the space in the tail of current zone. Finally,
it is possible to initiate combine flush operation (write data from page cache of current zone)
with GC operation under aged zone on the SSD side. 

(3) Re-use aged zone as current zone. Let's imagine that we have some aged zone with small
number of valid pages. It means that we can select this zone as current zone for a new data.
First of all, we need: (1) "reset" zone; (2) initiate GC operation on the SSD device side. We know
how many valid pages we will have in the beginning of the current zone. So, we simply needs
to add a new logical blocks into page cache of current zone after reserved area of data from
aged zone. So, our GC operation will be in the background of a new data preparation in the
page cache of current zone. And, finally, we will have the whole zone is full of data after
flush operation.

(4) Merge several aged zones into new one.

> It's much better to use an abstraction such as Zones, and then have an abstraction layer
> that hides the low-level details of the hardware from the OS.
> The trick is picking an abstraction that exposes the _right_ set of details so that the division
> of labor between the Host OS and the storage device is at a better place.  Hence my suggestion
> of perhaps providing a virtual mapping layer between "Zone number" and
> the low-level physical erase block.

I like the idea of some abstraction that hides the low-level details. But it sounds that we still
will have two mapping tables on SSD side and file system side. Again we needs in distribution
the responsibilities between the file system and SSD device. If file system will manage GC activity
but the real GC operation will be delegated on SSD side (in proper time) then it sounds that
all maintenance operations will be done by SSD itself. It means that SSD device is able to manage
only one mapping table and file system simply needs to have actual copy of the mapping table.
Or, oppositely, file system can manage only one mapping table and to share the actual state
with the SSD device. But one mapping table looks like as really complicated technique. From
another point of view, virtual zone can have the same ID always. So, the responsibility of the
SSD device will be mapping the virtual zone ID with physical erase block IDs. Such mapping
table (virtual zone ID <-> erase block(s)) can be more compact as mapping table (LBA <->
physical page). The responsibility of file system (host) will be the mapping inside of
the virtual zone (LBA <-> logical block inside the virtual zone). If the virtual zone ID will be
always the same then such mapping table could be lesser in size. But I don't see how
such mapping table can be lesser in size for the current implementation of F2FS or NILFS2. 
However, let's imagine that log will be equal to the whole zone then the header of the log
can include likewise mapping table for the log/zone.

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-09  6:49                   ` Slava Dubeyko
  (?)
@ 2017-01-09 14:55                     ` Theodore Ts'o
  -1 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2017-01-09 14:55 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Damien Le Moal, Matias Bjørling, linux-nvme, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

So in the model where the Flash-side is tracking logical to physical
zone mapping, and host is merely expecting the ZBC interface, one way
it could work is as follows.

1)  The flash signals that a particular zone should be reset soon.

2)  If the host does not honor the request, eventually the flash will
    have to do a forced copy of the zone to a new erase block.  (This
    is a fail-safe and shouldn't happen under normal circumstances.)

    (By the way, this model can be used for any number of things.  For
    example, for cloud workloads where tail latency is really
    important, it would be really cool if T10/T13 adopted a way that
    the host could be notified about the potential need for ATI
    remediation in a particular disk region, so the host could
    schedule it when it would be least likely to impact high priority,
    low latency workloads.  If the host fails to give permission to
    the firmware to do the ATI remediation before the "gotta go"
    deadline is exceed, the disk would could the ATI remediation at
    that point to assure data integrity as a fail safe.)

3) The host, since it has better knowledge of which blocks belong to
    which inode, and which inodes are more likely to have identical
    object lifetimes (example, all of the .o files in a directory are
    likely to be deleted at the same time when the user runs "make
    clean"; there was a Usenix or FAST paper over a decade ago the
    pointed out that doing hueristics based on file names were likely
    to be helpful), can do a better job of distributing the blocks to
    different partially filled sequential write preferred / sequential
    write required zones.

    The idea here is that you might have multiple zones that are
    partially filled based on expected object lifetime predictions.
    Or the host could move blocks based on the knowledge that a
    particular already has blocks that will share the same fate (e.g.,
    belong to the same inode) --- this is knowledge that the FTL can
    not know, so with a sufficiently smart host file system, it ought
    to be able to do a better job than the FTL.

4) Since we assumed that the Flash is tracking logical to physial zone
   mappings, and the host is responsible for everything else, if the
   host decides to move blocks to different SMR zones, the host file
   system will be responsible for updating its existing (inode,
   logical block) to physical block (SMR zone plus offset) mapping
   tables.

The main advantage of this model is to the extent that there are
cloud/enterprise customers who are already implementing Host Aware SMR
storage solutions, they might be able to reutilize code already
written for SMR HDD's, and reuse it for this model/interface.  Yes,
some tweaks would probably be needed since the design tradeoffs for
disks and flash are very different.  But the point is that the Host
Managed and Host Aware SMR models is one that is well understood by
everyone.

				 ----

There is another model you might consider, and it's one which
Christoph Hillwig suggested at a LSF/MM at least 2-3 years ago, and
this is a model where the flash or the SMR disk could use a division
of labor similar to Object Based Disks (except hopefully with a less
awful interface).  The idea here is that you give up on LBA numbers,
and instead you move the entire responsibility of mapping (inode,
logical block) to (physical location) to the storage device.  The file
system would then be responsibile for managing metadata (mod times,
user/group ownership, permission bits/ACL's, etc) and namespace issues
(e.g., directory pathnames to inode lookups).

So this solves the problem you seem to be concerned about in terms of
keeping mapping information at two layers, and it solves it
completely, since the file system no longer has to do a mapping
between inode+logical offset to LBA number, which it would in the
models you've outlined to date.  It also solves the problem of giving
the storage device more information about which blocks belong to which
inode/object, and it would also make it easier for the OS to pass
object lifetime and shared fate hints to the storage device.  This
should hopefully allow the FTL or STL to do a better job, since it now
has access to low-level hardware inforation (e.g., BER / Soft ECC
failures) as well as higher-level object information to do a better
job making storage layout and garbage collection activities.

				 ----

A fair criticism of any of the models discussed to date (one based on
ZBC, the object-based storage model, or OCSSD) don't have any mature
implementations, either in the open source or the clost source world.
But since it's true for *all* of them, we should be using other
criteria for deciding which model is the best one to choose for the
long term.

The advantage of the ZBC model is that people have had several years
to consider and understand the model, so in terms of mind share it has
an advantage.

The advantage of the object-based model is that it transfers a lot of
the complexity to the storage device, so the job that needs to be done
by the file system is much simpler than either of the other two models.

The advantage of the OCSSD model is that it exposes a lot of the raw
flash complexities to the host.  This can be good in that the host can
now do a really good job optimize for a particular flash technology.
The downside is that by exposing all of that complexity to the host,
it makes file system design very fragile, since as the number of chips
changes, or the size of erase blocks changes, or as flash developes
new capabilities such as erase suspend/resume, *all* of that hair gets
exposed to the file system implementor.

Personally, I think that's why either the ZBC model or the
object-based model makes a lot more sense than something where we
expose all of the vagaries of NAND flash to the file system.

Cheers,

					- Ted

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-09 14:55                     ` Theodore Ts'o
  0 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2017-01-09 14:55 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Matias Bjørling, Damien Le Moal, Viacheslav Dubeyko, lsf-pc,
	Linux FS Devel, linux-block, linux-nvme

So in the model where the Flash-side is tracking logical to physical
zone mapping, and host is merely expecting the ZBC interface, one way
it could work is as follows.

1)  The flash signals that a particular zone should be reset soon.

2)  If the host does not honor the request, eventually the flash will
    have to do a forced copy of the zone to a new erase block.  (This
    is a fail-safe and shouldn't happen under normal circumstances.)

    (By the way, this model can be used for any number of things.  For
    example, for cloud workloads where tail latency is really
    important, it would be really cool if T10/T13 adopted a way that
    the host could be notified about the potential need for ATI
    remediation in a particular disk region, so the host could
    schedule it when it would be least likely to impact high priority,
    low latency workloads.  If the host fails to give permission to
    the firmware to do the ATI remediation before the "gotta go"
    deadline is exceed, the disk would could the ATI remediation at
    that point to assure data integrity as a fail safe.)

3) The host, since it has better knowledge of which blocks belong to
    which inode, and which inodes are more likely to have identical
    object lifetimes (example, all of the .o files in a directory are
    likely to be deleted at the same time when the user runs "make
    clean"; there was a Usenix or FAST paper over a decade ago the
    pointed out that doing hueristics based on file names were likely
    to be helpful), can do a better job of distributing the blocks to
    different partially filled sequential write preferred / sequential
    write required zones.

    The idea here is that you might have multiple zones that are
    partially filled based on expected object lifetime predictions.
    Or the host could move blocks based on the knowledge that a
    particular already has blocks that will share the same fate (e.g.,
    belong to the same inode) --- this is knowledge that the FTL can
    not know, so with a sufficiently smart host file system, it ought
    to be able to do a better job than the FTL.

4) Since we assumed that the Flash is tracking logical to physial zone
   mappings, and the host is responsible for everything else, if the
   host decides to move blocks to different SMR zones, the host file
   system will be responsible for updating its existing (inode,
   logical block) to physical block (SMR zone plus offset) mapping
   tables.

The main advantage of this model is to the extent that there are
cloud/enterprise customers who are already implementing Host Aware SMR
storage solutions, they might be able to reutilize code already
written for SMR HDD's, and reuse it for this model/interface.  Yes,
some tweaks would probably be needed since the design tradeoffs for
disks and flash are very different.  But the point is that the Host
Managed and Host Aware SMR models is one that is well understood by
everyone.

				 ----

There is another model you might consider, and it's one which
Christoph Hillwig suggested at a LSF/MM at least 2-3 years ago, and
this is a model where the flash or the SMR disk could use a division
of labor similar to Object Based Disks (except hopefully with a less
awful interface).  The idea here is that you give up on LBA numbers,
and instead you move the entire responsibility of mapping (inode,
logical block) to (physical location) to the storage device.  The file
system would then be responsibile for managing metadata (mod times,
user/group ownership, permission bits/ACL's, etc) and namespace issues
(e.g., directory pathnames to inode lookups).

So this solves the problem you seem to be concerned about in terms of
keeping mapping information at two layers, and it solves it
completely, since the file system no longer has to do a mapping
between inode+logical offset to LBA number, which it would in the
models you've outlined to date.  It also solves the problem of giving
the storage device more information about which blocks belong to which
inode/object, and it would also make it easier for the OS to pass
object lifetime and shared fate hints to the storage device.  This
should hopefully allow the FTL or STL to do a better job, since it now
has access to low-level hardware inforation (e.g., BER / Soft ECC
failures) as well as higher-level object information to do a better
job making storage layout and garbage collection activities.

				 ----

A fair criticism of any of the models discussed to date (one based on
ZBC, the object-based storage model, or OCSSD) don't have any mature
implementations, either in the open source or the clost source world.
But since it's true for *all* of them, we should be using other
criteria for deciding which model is the best one to choose for the
long term.

The advantage of the ZBC model is that people have had several years
to consider and understand the model, so in terms of mind share it has
an advantage.

The advantage of the object-based model is that it transfers a lot of
the complexity to the storage device, so the job that needs to be done
by the file system is much simpler than either of the other two models.

The advantage of the OCSSD model is that it exposes a lot of the raw
flash complexities to the host.  This can be good in that the host can
now do a really good job optimize for a particular flash technology.
The downside is that by exposing all of that complexity to the host,
it makes file system design very fragile, since as the number of chips
changes, or the size of erase blocks changes, or as flash developes
new capabilities such as erase suspend/resume, *all* of that hair gets
exposed to the file system implementor.

Personally, I think that's why either the ZBC model or the
object-based model makes a lot more sense than something where we
expose all of the vagaries of NAND flash to the file system.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-09 14:55                     ` Theodore Ts'o
  0 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2017-01-09 14:55 UTC (permalink / raw)


So in the model where the Flash-side is tracking logical to physical
zone mapping, and host is merely expecting the ZBC interface, one way
it could work is as follows.

1)  The flash signals that a particular zone should be reset soon.

2)  If the host does not honor the request, eventually the flash will
    have to do a forced copy of the zone to a new erase block.  (This
    is a fail-safe and shouldn't happen under normal circumstances.)

    (By the way, this model can be used for any number of things.  For
    example, for cloud workloads where tail latency is really
    important, it would be really cool if T10/T13 adopted a way that
    the host could be notified about the potential need for ATI
    remediation in a particular disk region, so the host could
    schedule it when it would be least likely to impact high priority,
    low latency workloads.  If the host fails to give permission to
    the firmware to do the ATI remediation before the "gotta go"
    deadline is exceed, the disk would could the ATI remediation at
    that point to assure data integrity as a fail safe.)

3) The host, since it has better knowledge of which blocks belong to
    which inode, and which inodes are more likely to have identical
    object lifetimes (example, all of the .o files in a directory are
    likely to be deleted at the same time when the user runs "make
    clean"; there was a Usenix or FAST paper over a decade ago the
    pointed out that doing hueristics based on file names were likely
    to be helpful), can do a better job of distributing the blocks to
    different partially filled sequential write preferred / sequential
    write required zones.

    The idea here is that you might have multiple zones that are
    partially filled based on expected object lifetime predictions.
    Or the host could move blocks based on the knowledge that a
    particular already has blocks that will share the same fate (e.g.,
    belong to the same inode) --- this is knowledge that the FTL can
    not know, so with a sufficiently smart host file system, it ought
    to be able to do a better job than the FTL.

4) Since we assumed that the Flash is tracking logical to physial zone
   mappings, and the host is responsible for everything else, if the
   host decides to move blocks to different SMR zones, the host file
   system will be responsible for updating its existing (inode,
   logical block) to physical block (SMR zone plus offset) mapping
   tables.

The main advantage of this model is to the extent that there are
cloud/enterprise customers who are already implementing Host Aware SMR
storage solutions, they might be able to reutilize code already
written for SMR HDD's, and reuse it for this model/interface.  Yes,
some tweaks would probably be needed since the design tradeoffs for
disks and flash are very different.  But the point is that the Host
Managed and Host Aware SMR models is one that is well understood by
everyone.

				 ----

There is another model you might consider, and it's one which
Christoph Hillwig suggested at a LSF/MM at least 2-3 years ago, and
this is a model where the flash or the SMR disk could use a division
of labor similar to Object Based Disks (except hopefully with a less
awful interface).  The idea here is that you give up on LBA numbers,
and instead you move the entire responsibility of mapping (inode,
logical block) to (physical location) to the storage device.  The file
system would then be responsibile for managing metadata (mod times,
user/group ownership, permission bits/ACL's, etc) and namespace issues
(e.g., directory pathnames to inode lookups).

So this solves the problem you seem to be concerned about in terms of
keeping mapping information at two layers, and it solves it
completely, since the file system no longer has to do a mapping
between inode+logical offset to LBA number, which it would in the
models you've outlined to date.  It also solves the problem of giving
the storage device more information about which blocks belong to which
inode/object, and it would also make it easier for the OS to pass
object lifetime and shared fate hints to the storage device.  This
should hopefully allow the FTL or STL to do a better job, since it now
has access to low-level hardware inforation (e.g., BER / Soft ECC
failures) as well as higher-level object information to do a better
job making storage layout and garbage collection activities.

				 ----

A fair criticism of any of the models discussed to date (one based on
ZBC, the object-based storage model, or OCSSD) don't have any mature
implementations, either in the open source or the clost source world.
But since it's true for *all* of them, we should be using other
criteria for deciding which model is the best one to choose for the
long term.

The advantage of the ZBC model is that people have had several years
to consider and understand the model, so in terms of mind share it has
an advantage.

The advantage of the object-based model is that it transfers a lot of
the complexity to the storage device, so the job that needs to be done
by the file system is much simpler than either of the other two models.

The advantage of the OCSSD model is that it exposes a lot of the raw
flash complexities to the host.  This can be good in that the host can
now do a really good job optimize for a particular flash technology.
The downside is that by exposing all of that complexity to the host,
it makes file system design very fragile, since as the number of chips
changes, or the size of erase blocks changes, or as flash developes
new capabilities such as erase suspend/resume, *all* of that hair gets
exposed to the file system implementor.

Personally, I think that's why either the ZBC model or the
object-based model makes a lot more sense than something where we
expose all of the vagaries of NAND flash to the file system.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04 16:57               ` Theodore Ts'o
@ 2017-01-10  1:42                 ` Damien Le Moal
  -1 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2017-01-10  1:42 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc,
	Linux FS Devel, linux-block, linux-nvme

Ted,

On 1/5/17 01:57, Theodore Ts'o wrote:
> I agree with Damien, but I'd also add that in the future there may
> very well be some new Zone types added to the ZBC model.  So we
> shouldn't assume that the ZBC model is a fixed one.  And who knows?
> Perhaps T10 standards body will come up with a simpler model for
> interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC
> model --- or not.

Totally agree. There is already some activity in T10 for a ZBC V2
standard which indeed may include new zone types (for instance a
"circular buffer zone type that can be sequentially rewritten without a
reset, preserving previously written data for reads after the write
pointer). Such type of zone could be a perfect match for an FS journal
log space for instance.

> Either way, that's not really relevant as far as the Linux block layer
> is concerned, since the Linux block layer is designed to be an
> abstraction on top of hardware --- and in some cases we can use a
> similar abstraction on top of eMMC's, SCSI's, and SATA's
> implementation definition of TRIM/DISCARD/WRITE SAME/SECURE
> TRIM/QUEUED TRIM, even though they are different in some subtle ways,
> and may have different performance characteristics and semantics.
> 
> The trick is to expose similarities where the differences won't matter
> to the upper layers, but also to expose the fine distinctions and
> allow the file system and/or user space to use the protocol-specific
> differences when it matters to them.

Absolutely. The initial zoned block device support was written to match
what ZBC/ZAC defines. It was simple this way and there was no other
users of the zone concept. But the device models and zone types are just
numerical values reported to the device user. The block I/O stack
currently does not use these values beyond the device initialization. It
is up to the users (e.g. FS) of the device to determine what to do to
correctly use the device according to the types reported. So this basic
design is definitely extensible to new zone types and device models.

> 
> Designing that is going to be important, and I can guarantee we won't
> get it right at first.  Which is why it's a good thing that internal
> kernel interfaces aren't cast into concrete, and can be subject to
> change as new revisions to ZBC, or new interfaces (like perhaps
> OCSSD's) get promulgated by various standards bodies or by various
> vendors.

Indeed. The ZBC case was simple as we matched the standard defined
models. Whihc in any case is not really used in any way directly by the
block I/O stack itself. Only upper layers use that.

In the case of ACSSD, this adds one hardware-defined model set by the
standard, plus a potential collection of software defined models through
different FTL implementations on the host. Getting these models and
there API right will be indeed tricky. In a first step, providing a
ZBC-like host-aware model and a host-managed model may be a good idea as
upper layer code already ready for ZBC disks will work out-of-the-box
for OCSSDs too. From there, I can see a lot of possibilities for more
SSD optimized models though.

>>> Another point that QLC device could have more tricky features of
>>> erase blocks management. Also we should apply erase operation on NAND
>>> flash erase block but it is not mandatory for the case of SMR zone.
>>
>> Incorrect: host-managed devices require a zone "reset" (equivalent to
>> discard/trim) to be reused after being written once. So again, the
>> "tricky features" you mention will depend on the device "model",
>> whatever this ends up to be for an open channel SSD.
> 
> ... and this is exposed by having different zone types (sequential
> write required vs sequential write preferred vs conventional).  And if
> OCSSD's "zones" don't fit into the current ZBC zone types, we can
> easily add new ones.  I would suggest however, that we explicitly
> disclaim that the block device layer's code points for zone types is
> an exact match with the ZBC zone types numbering, precisely so we can
> add new zone types that correspond to abstractions from different
> hardware types, such as OCSSD.

The struct blk_zone type is 64B in size but only currently uses 32B. So
there is room for new fields, and existing fields can have newly defined
values too as the ZBC standard uses only few of the possible values in
the structure fields.

>> Not necessarily. Again think in terms of device "model" and associated
>> feature set. An FS implementation may decide to support all possible
>> models, with likely a resulting incredible complexity. More likely,
>> similarly with what is happening with SMR, only models that make sense
>> will be supported by FS implementation that can be easily modified.
>> Example again here of f2fs: changes to support SMR were rather simple,
>> whereas the initial effort to support SMR with ext4 was pretty much
>> abandoned as it was too complex to integrate in the existing code while
>> keeping the existing on-disk format.
> 
> I'll note that Abutalib Aghayev and I will be presenting a paper at
> the 2017 FAST conference detailing a way to optimize ext4 for
> Host-Aware SMR drives by making a surprisingly small set of changes to
> ext4's journalling layer, with some very promising performance
> improvements for certain workloads, which we tested on both Seagate
> and WD HA drives and achieved 2x performance improvements.  Patches
> are on the unstable portion of the ext4 patch queue, and I hope to get
> them into an upstream acceptable shape (as opposed to "good enough for
> a research paper") in the next few months.

Thank you for the information. I will check this out. Is it the
optimization that aggressively delay meta-data update by allowing
reading of meta-data blocks directly from the journal (for blocks that
are not yet updated in place) ?

> So it may very well be that small changes can be made to file systems
> to support exotic devices if there are ways that we can expose the
> right information about underlying storage devices, and offering the
> right abstractions to enable the right kind of minimal I/O tagging, or
> hints, or commands as necessary such that the changes we do need to
> make to the file system can be kept small, and kept easily testable
> even if hardware is not available.
> 
> For example, by creating device mapper emulators of the feature sets
> of these advanced storage interfaces that are exposed via the block
> layer abstractions, whether it be for ZBC zones, or hardware
> encryption acceleration, etc.

Emulators may indeed be very useful for development. But we could also
go further and implement the different models using device mappers too.
Doing so, the same device could be used with different FTL through the
same DM interface. And this may also simplify the implementation of
complex models using DM stacking (e.g. the host-aware model can be
implemented on top of a host-managed model).

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-10  1:42                 ` Damien Le Moal
  0 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2017-01-10  1:42 UTC (permalink / raw)

Ted,

On 1/5/17 01:57, Theodore Ts'o wrote:
> I agree with Damien, but I'd also add that in the future there may
> very well be some new Zone types added to the ZBC model.  So we
> shouldn't assume that the ZBC model is a fixed one.  And who knows?
> Perhaps T10 standards body will come up with a simpler model for
> interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC
> model --- or not.

Totally agree. There is already some activity in T10 for a ZBC V2
standard which indeed may include new zone types (for instance a
"circular buffer zone type that can be sequentially rewritten without a
reset, preserving previously written data for reads after the write
pointer). Such type of zone could be a perfect match for an FS journal
log space for instance.

> Either way, that's not really relevant as far as the Linux block layer
> is concerned, since the Linux block layer is designed to be an
> abstraction on top of hardware --- and in some cases we can use a
> similar abstraction on top of eMMC's, SCSI's, and SATA's
> implementation definition of TRIM/DISCARD/WRITE SAME/SECURE
> TRIM/QUEUED TRIM, even though they are different in some subtle ways,
> and may have different performance characteristics and semantics.
> 
> The trick is to expose similarities where the differences won't matter
> to the upper layers, but also to expose the fine distinctions and
> allow the file system and/or user space to use the protocol-specific
> differences when it matters to them.

Absolutely. The initial zoned block device support was written to match
what ZBC/ZAC defines. It was simple this way and there was no other
users of the zone concept. But the device models and zone types are just
numerical values reported to the device user. The block I/O stack
currently does not use these values beyond the device initialization. It
is up to the users (e.g. FS) of the device to determine what to do to
correctly use the device according to the types reported. So this basic
design is definitely extensible to new zone types and device models.

> 
> Designing that is going to be important, and I can guarantee we won't
> get it right at first.  Which is why it's a good thing that internal
> kernel interfaces aren't cast into concrete, and can be subject to
> change as new revisions to ZBC, or new interfaces (like perhaps
> OCSSD's) get promulgated by various standards bodies or by various
> vendors.

Indeed. The ZBC case was simple as we matched the standard defined
models. Whihc in any case is not really used in any way directly by the
block I/O stack itself. Only upper layers use that.

In the case of ACSSD, this adds one hardware-defined model set by the
standard, plus a potential collection of software defined models through
different FTL implementations on the host. Getting these models and
there API right will be indeed tricky. In a first step, providing a
ZBC-like host-aware model and a host-managed model may be a good idea as
upper layer code already ready for ZBC disks will work out-of-the-box
for OCSSDs too. From there, I can see a lot of possibilities for more
SSD optimized models though.

>>> Another point that QLC device could have more tricky features of
>>> erase blocks management. Also we should apply erase operation on NAND
>>> flash erase block but it is not mandatory for the case of SMR zone.
>>
>> Incorrect: host-managed devices require a zone "reset" (equivalent to
>> discard/trim) to be reused after being written once. So again, the
>> "tricky features" you mention will depend on the device "model",
>> whatever this ends up to be for an open channel SSD.
> 
> ... and this is exposed by having different zone types (sequential
> write required vs sequential write preferred vs conventional).  And if
> OCSSD's "zones" don't fit into the current ZBC zone types, we can
> easily add new ones.  I would suggest however, that we explicitly
> disclaim that the block device layer's code points for zone types is
> an exact match with the ZBC zone types numbering, precisely so we can
> add new zone types that correspond to abstractions from different
> hardware types, such as OCSSD.

The struct blk_zone type is 64B in size but only currently uses 32B. So
there is room for new fields, and existing fields can have newly defined
values too as the ZBC standard uses only few of the possible values in
the structure fields.

>> Not necessarily. Again think in terms of device "model" and associated
>> feature set. An FS implementation may decide to support all possible
>> models, with likely a resulting incredible complexity. More likely,
>> similarly with what is happening with SMR, only models that make sense
>> will be supported by FS implementation that can be easily modified.
>> Example again here of f2fs: changes to support SMR were rather simple,
>> whereas the initial effort to support SMR with ext4 was pretty much
>> abandoned as it was too complex to integrate in the existing code while
>> keeping the existing on-disk format.
> 
> I'll note that Abutalib Aghayev and I will be presenting a paper at
> the 2017 FAST conference detailing a way to optimize ext4 for
> Host-Aware SMR drives by making a surprisingly small set of changes to
> ext4's journalling layer, with some very promising performance
> improvements for certain workloads, which we tested on both Seagate
> and WD HA drives and achieved 2x performance improvements.  Patches
> are on the unstable portion of the ext4 patch queue, and I hope to get
> them into an upstream acceptable shape (as opposed to "good enough for
> a research paper") in the next few months.

Thank you for the information. I will check this out. Is it the
optimization that aggressively delay meta-data update by allowing
reading of meta-data blocks directly from the journal (for blocks that
are not yet updated in place) ?

> So it may very well be that small changes can be made to file systems
> to support exotic devices if there are ways that we can expose the
> right information about underlying storage devices, and offering the
> right abstractions to enable the right kind of minimal I/O tagging, or
> hints, or commands as necessary such that the changes we do need to
> make to the file system can be kept small, and kept easily testable
> even if hardware is not available.
> 
> For example, by creating device mapper emulators of the feature sets
> of these advanced storage interfaces that are exposed via the block
> layer abstractions, whether it be for ZBC zones, or hardware
> encryption acceleration, etc.

Emulators may indeed be very useful for development. But we could also
go further and implement the different models using device mappers too.
Doing so, the same device could be used with different FTL through the
same DM interface. And this may also simplify the implementation of
complex models using DM stacking (e.g. the host-aware model can be
implemented on top of a host-managed model).

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal at wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-10  1:42                 ` Damien Le Moal
@ 2017-01-10  4:24                   ` Theodore Ts'o
  -1 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2017-01-10  4:24 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko, lsf-pc,
	Linux FS Devel, linux-block, linux-nvme

On Tue, Jan 10, 2017 at 10:42:45AM +0900, Damien Le Moal wrote:
> Thank you for the information. I will check this out. Is it the
> optimization that aggressively delay meta-data update by allowing
> reading of meta-data blocks directly from the journal (for blocks that
> are not yet updated in place) ?

Essentially, yes.  In some cases, the metadata might never be written
back to its permanent location in disk.  Instead, we might copy a
metadata block from the tail of the journal to the head, if there is
enough space.  So effectively, it turns the journal into a log
structured store for metadata blocks.  Over time, if metadata blocks
become cold we can evict them from the journal to make room for more
frequently updated metadata blocks (given the currently running
workload).  Eliminating the random 4k updates to the allocation
bitmaps, inode table blocks, etc., really helps with host-aware SMR
drives, without requiring the massive amount of changes needed to make
a file system compatible with the host-managed model.

When you consider that for most files, they are never updated in
place, but are usually just replaced, I suspect that with some
additional adjustments to a traditional file system's block allocator,
we can get even further wins.  But that's for future work...

> Emulators may indeed be very useful for development. But we could also
> go further and implement the different models using device mappers too.
> Doing so, the same device could be used with different FTL through the
> same DM interface. And this may also simplify the implementation of
> complex models using DM stacking (e.g. the host-aware model can be
> implemented on top of a host-managed model).

Yes, indeed.  That would also allow people to experiment with how much
benefit can be derived if we were to give additional side channel
information to STL / FTL's --- since it's much easier to adjust kernel
code than to go through a negotiations with a HDD/SSD vendor to make
firmware changes!

This may be an area where if we can create the right framework, and
fund some research work, we might be able to get some researchers and
their graduate students interested in doing some work in figuring out
what sort of divisions of responsibilities and hints back and forth
between the storage device and host have the most benefit.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-10  4:24                   ` Theodore Ts'o
  0 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2017-01-10  4:24 UTC (permalink / raw)

On Tue, Jan 10, 2017@10:42:45AM +0900, Damien Le Moal wrote:
> Thank you for the information. I will check this out. Is it the
> optimization that aggressively delay meta-data update by allowing
> reading of meta-data blocks directly from the journal (for blocks that
> are not yet updated in place) ?

Essentially, yes.  In some cases, the metadata might never be written
back to its permanent location in disk.  Instead, we might copy a
metadata block from the tail of the journal to the head, if there is
enough space.  So effectively, it turns the journal into a log
structured store for metadata blocks.  Over time, if metadata blocks
become cold we can evict them from the journal to make room for more
frequently updated metadata blocks (given the currently running
workload).  Eliminating the random 4k updates to the allocation
bitmaps, inode table blocks, etc., really helps with host-aware SMR
drives, without requiring the massive amount of changes needed to make
a file system compatible with the host-managed model.

When you consider that for most files, they are never updated in
place, but are usually just replaced, I suspect that with some
additional adjustments to a traditional file system's block allocator,
we can get even further wins.  But that's for future work...

> Emulators may indeed be very useful for development. But we could also
> go further and implement the different models using device mappers too.
> Doing so, the same device could be used with different FTL through the
> same DM interface. And this may also simplify the implementation of
> complex models using DM stacking (e.g. the host-aware model can be
> implemented on top of a host-managed model).

Yes, indeed.  That would also allow people to experiment with how much
benefit can be derived if we were to give additional side channel
information to STL / FTL's --- since it's much easier to adjust kernel
code than to go through a negotiations with a HDD/SSD vendor to make
firmware changes!

This may be an area where if we can create the right framework, and
fund some research work, we might be able to get some researchers and
their graduate students interested in doing some work in figuring out
what sort of divisions of responsibilities and hints back and forth
between the storage device and host have the most benefit.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-10  4:24                   ` Theodore Ts'o
@ 2017-01-10 13:06                     ` Matias Bjorling
  -1 siblings, 0 replies; 63+ messages in thread
From: Matias Bjorling @ 2017-01-10 13:06 UTC (permalink / raw)
  To: Theodore Ts'o, Damien Le Moal
  Cc: Slava Dubeyko, Viacheslav Dubeyko, lsf-pc, Linux FS Devel,
	linux-block, linux-nvme

On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
> This may be an area where if we can create the right framework, and
> fund some research work, we might be able to get some researchers and
> their graduate students interested in doing some work in figuring out
> what sort of divisions of responsibilities and hints back and forth
> between the storage device and host have the most benefit.
> 

That is a good idea. There is a couple of papers at FAST with
Open-Channel SSDs this year.  They look into the interface and various
ways to reduce latency fluctuations.

One thing I've heard a couple of times is the feature to move the GC
read/write process into the firmware. Enabling the host to offload GC
data movement, while the keeping the host in control. Would this be
beneficial for SMR?

-Matias

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-10 13:06                     ` Matias Bjorling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjorling @ 2017-01-10 13:06 UTC (permalink / raw)

On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
> This may be an area where if we can create the right framework, and
> fund some research work, we might be able to get some researchers and
> their graduate students interested in doing some work in figuring out
> what sort of divisions of responsibilities and hints back and forth
> between the storage device and host have the most benefit.
> 

That is a good idea. There is a couple of papers at FAST with
Open-Channel SSDs this year.  They look into the interface and various
ways to reduce latency fluctuations.

One thing I've heard a couple of times is the feature to move the GC
read/write process into the firmware. Enabling the host to offload GC
data movement, while the keeping the host in control. Would this be
beneficial for SMR?

-Matias

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-10 13:06                     ` Matias Bjorling
@ 2017-01-11  4:07                       ` Damien Le Moal
  -1 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2017-01-11  4:07 UTC (permalink / raw)
  To: Matias Bjorling, Theodore Ts'o
  Cc: Slava Dubeyko, Viacheslav Dubeyko, lsf-pc, Linux FS Devel,
	linux-block, linux-nvme

Matias,

On 1/10/17 22:06, Matias Bjorling wrote:
> On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
>> This may be an area where if we can create the right framework, and
>> fund some research work, we might be able to get some researchers and
>> their graduate students interested in doing some work in figuring out
>> what sort of divisions of responsibilities and hints back and forth
>> between the storage device and host have the most benefit.
>>
> 
> That is a good idea. There is a couple of papers at FAST with
> Open-Channel SSDs this year.  They look into the interface and various
> ways to reduce latency fluctuations.
> 
> One thing I've heard a couple of times is the feature to move the GC
> read/write process into the firmware. Enabling the host to offload GC
> data movement, while the keeping the host in control. Would this be
> beneficial for SMR?

Host-aware SMR drives already have GC internally implemented (for cases
when the host does not write sequentially). Host-managed drives do not.
As for moving an application specific GC code into the device, well,
code injection in the storage device is not for tomorrow, and likely not
ever.

There are however other clever ways to reduce GC related host overhead
with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED
COPY, and some others can greatly improve overhead over a simple
read+write loop. A better approach to GC offload may not be a "GC"
command, but something more generic for moving around LBAs internally
within the device. That is, if existing commands are not satisfactory.

Best.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-11  4:07                       ` Damien Le Moal
  0 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2017-01-11  4:07 UTC (permalink / raw)

Matias,

On 1/10/17 22:06, Matias Bjorling wrote:
> On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
>> This may be an area where if we can create the right framework, and
>> fund some research work, we might be able to get some researchers and
>> their graduate students interested in doing some work in figuring out
>> what sort of divisions of responsibilities and hints back and forth
>> between the storage device and host have the most benefit.
>>
> 
> That is a good idea. There is a couple of papers at FAST with
> Open-Channel SSDs this year.  They look into the interface and various
> ways to reduce latency fluctuations.
> 
> One thing I've heard a couple of times is the feature to move the GC
> read/write process into the firmware. Enabling the host to offload GC
> data movement, while the keeping the host in control. Would this be
> beneficial for SMR?

Host-aware SMR drives already have GC internally implemented (for cases
when the host does not write sequentially). Host-managed drives do not.
As for moving an application specific GC code into the device, well,
code injection in the storage device is not for tomorrow, and likely not
ever.

There are however other clever ways to reduce GC related host overhead
with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED
COPY, and some others can greatly improve overhead over a simple
read+write loop. A better approach to GC offload may not be a "GC"
command, but something more generic for moving around LBAs internally
within the device. That is, if existing commands are not satisfactory.

Best.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal at wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-11  4:07                       ` Damien Le Moal
@ 2017-01-11  6:06                         ` Matias Bjorling
  -1 siblings, 0 replies; 63+ messages in thread
From: Matias Bjorling @ 2017-01-11  6:06 UTC (permalink / raw)
  To: Damien Le Moal, Theodore Ts'o
  Cc: Slava Dubeyko, Viacheslav Dubeyko, lsf-pc, Linux FS Devel,
	linux-block, linux-nvme

On 01/11/2017 05:07 AM, Damien Le Moal wrote:
> 
> Matias,
> 
> On 1/10/17 22:06, Matias Bjorling wrote:
>> On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
>>> This may be an area where if we can create the right framework, and
>>> fund some research work, we might be able to get some researchers and
>>> their graduate students interested in doing some work in figuring out
>>> what sort of divisions of responsibilities and hints back and forth
>>> between the storage device and host have the most benefit.
>>>
>>
>> That is a good idea. There is a couple of papers at FAST with
>> Open-Channel SSDs this year.  They look into the interface and various
>> ways to reduce latency fluctuations.
>>
>> One thing I've heard a couple of times is the feature to move the GC
>> read/write process into the firmware. Enabling the host to offload GC
>> data movement, while the keeping the host in control. Would this be
>> beneficial for SMR?
> 
> Host-aware SMR drives already have GC internally implemented (for cases
> when the host does not write sequentially). Host-managed drives do not.
> As for moving an application specific GC code into the device, well,
> code injection in the storage device is not for tomorrow, and likely not
> ever.
> 
> There are however other clever ways to reduce GC related host overhead
> with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED
> COPY, and some others can greatly improve overhead over a simple
> read+write loop. A better approach to GC offload may not be a "GC"
> command, but something more generic for moving around LBAs internally
> within the device. That is, if existing commands are not satisfactory.

Hi Damien,

You're right. I was thinking of something similar to scattered
read/write to move data from one place to another. There is no
sector-granularity mapping table maintained by the OCSSD, which leaves
the logic up to the host.

Let me know if you decide to kick of a standardized interface for code
injection. Such an interface is long overdue. ;)



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-11  6:06                         ` Matias Bjorling
  0 siblings, 0 replies; 63+ messages in thread
From: Matias Bjorling @ 2017-01-11  6:06 UTC (permalink / raw)


On 01/11/2017 05:07 AM, Damien Le Moal wrote:
> 
> Matias,
> 
> On 1/10/17 22:06, Matias Bjorling wrote:
>> On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
>>> This may be an area where if we can create the right framework, and
>>> fund some research work, we might be able to get some researchers and
>>> their graduate students interested in doing some work in figuring out
>>> what sort of divisions of responsibilities and hints back and forth
>>> between the storage device and host have the most benefit.
>>>
>>
>> That is a good idea. There is a couple of papers at FAST with
>> Open-Channel SSDs this year.  They look into the interface and various
>> ways to reduce latency fluctuations.
>>
>> One thing I've heard a couple of times is the feature to move the GC
>> read/write process into the firmware. Enabling the host to offload GC
>> data movement, while the keeping the host in control. Would this be
>> beneficial for SMR?
> 
> Host-aware SMR drives already have GC internally implemented (for cases
> when the host does not write sequentially). Host-managed drives do not.
> As for moving an application specific GC code into the device, well,
> code injection in the storage device is not for tomorrow, and likely not
> ever.
> 
> There are however other clever ways to reduce GC related host overhead
> with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED
> COPY, and some others can greatly improve overhead over a simple
> read+write loop. A better approach to GC offload may not be a "GC"
> command, but something more generic for moving around LBAs internally
> within the device. That is, if existing commands are not satisfactory.

Hi Damien,

You're right. I was thinking of something similar to scattered
read/write to move data from one place to another. There is no
sector-granularity mapping table maintained by the OCSSD, which leaves
the logic up to the host.

Let me know if you decide to kick of a standardized interface for code
injection. Such an interface is long overdue. ;)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-11  4:07                       ` Damien Le Moal
@ 2017-01-11  7:49                         ` Hannes Reinecke
  -1 siblings, 0 replies; 63+ messages in thread
From: Hannes Reinecke @ 2017-01-11  7:49 UTC (permalink / raw)
  To: Damien Le Moal, Matias Bjorling, Theodore Ts'o
  Cc: Slava Dubeyko, linux-nvme, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc

On 01/11/2017 05:07 AM, Damien Le Moal wrote:
> 
> Matias,
> 
> On 1/10/17 22:06, Matias Bjorling wrote:
>> On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
>>> This may be an area where if we can create the right framework, and
>>> fund some research work, we might be able to get some researchers and
>>> their graduate students interested in doing some work in figuring out
>>> what sort of divisions of responsibilities and hints back and forth
>>> between the storage device and host have the most benefit.
>>>
>>
>> That is a good idea. There is a couple of papers at FAST with
>> Open-Channel SSDs this year.  They look into the interface and various
>> ways to reduce latency fluctuations.
>>
>> One thing I've heard a couple of times is the feature to move the GC
>> read/write process into the firmware. Enabling the host to offload GC
>> data movement, while the keeping the host in control. Would this be
>> beneficial for SMR?
> 
> Host-aware SMR drives already have GC internally implemented (for cases
> when the host does not write sequentially). Host-managed drives do not.
> As for moving an application specific GC code into the device, well,
> code injection in the storage device is not for tomorrow, and likely not
> ever.
> 
> There are however other clever ways to reduce GC related host overhead
> with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED
> COPY, and some others can greatly improve overhead over a simple
> read+write loop. A better approach to GC offload may not be a "GC"
> command, but something more generic for moving around LBAs internally
> within the device. That is, if existing commands are not satisfactory.
> 
Logical head depop rears its head again...

But yes, I think it's more sensible to have I/O functions which help GC
(like UNMAP) instead of influencing the GC itself.

Anyway. Given the length of this thread I guess this is a worthy topic
for LSF.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-11  7:49                         ` Hannes Reinecke
  0 siblings, 0 replies; 63+ messages in thread
From: Hannes Reinecke @ 2017-01-11  7:49 UTC (permalink / raw)


On 01/11/2017 05:07 AM, Damien Le Moal wrote:
> 
> Matias,
> 
> On 1/10/17 22:06, Matias Bjorling wrote:
>> On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
>>> This may be an area where if we can create the right framework, and
>>> fund some research work, we might be able to get some researchers and
>>> their graduate students interested in doing some work in figuring out
>>> what sort of divisions of responsibilities and hints back and forth
>>> between the storage device and host have the most benefit.
>>>
>>
>> That is a good idea. There is a couple of papers at FAST with
>> Open-Channel SSDs this year.  They look into the interface and various
>> ways to reduce latency fluctuations.
>>
>> One thing I've heard a couple of times is the feature to move the GC
>> read/write process into the firmware. Enabling the host to offload GC
>> data movement, while the keeping the host in control. Would this be
>> beneficial for SMR?
> 
> Host-aware SMR drives already have GC internally implemented (for cases
> when the host does not write sequentially). Host-managed drives do not.
> As for moving an application specific GC code into the device, well,
> code injection in the storage device is not for tomorrow, and likely not
> ever.
> 
> There are however other clever ways to reduce GC related host overhead
> with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED
> COPY, and some others can greatly improve overhead over a simple
> read+write loop. A better approach to GC offload may not be a "GC"
> command, but something more generic for moving around LBAs internally
> within the device. That is, if existing commands are not satisfactory.
> 
Logical head depop rears its head again...

But yes, I think it's more sensible to have I/O functions which help GC
(like UNMAP) instead of influencing the GC itself.

Anyway. Given the length of this thread I guess this is a worthy topic
for LSF.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-02 21:06 ` Matias Bjørling
                   ` (2 preceding siblings ...)
  (?)
@ 2017-01-12  1:33 ` Damien Le Moal
  2017-01-12  2:18     ` James Bottomley
  -1 siblings, 1 reply; 63+ messages in thread
From: Damien Le Moal @ 2017-01-12  1:33 UTC (permalink / raw)
  To: Matias Bjørling, lsf-pc; +Cc: Linux FS Devel, linux-block, linux-nvme

Hello,

A long discussion on the list followed this initial topic proposal from
Matias. I think this is a worthy topic to discuss at LSF in order to
steer development of the zoned block device interface in the right
direction. Considering the relation and implication to ZBC/ZAC support,
I would like to attend LSF/MM to participate in this discussion.

Thank you.

Best regards.

On 1/3/17 06:06, Matias Bjï¿½rling wrote:
> Hi,
> 
> The open-channel SSD subsystem is maturing, and drives are beginning to 
> become available on the market. The open-channel SSD interface is very 
> similar to the one exposed by SMR hard-drives. They both have a set of 
> chunks (zones) exposed, and zones are managed using open/close logic. 
> The main difference on open-channel SSDs is that it additionally exposes 
> multiple sets of zones through a hierarchical interface, which covers a 
> numbers levels (X channels, Y LUNs per channel, Z zones per LUN).
> 
> Given that the SMR interface is similar to OCSSDs interface, I like to 
> propose to discuss this at LSF/MM to align the efforts and make a clear 
> path forward:
> 
> 1. SMR Compatibility
> 
> Can the SMR host interface be adapted to Open-Channel SSDs? For example, 
> the interface may be exposed as a single-level set of zones, which 
> ignore the channel and lun concept for simplicity. Another approach 
> might be to extend the SMR implementation sysfs entries to expose the 
> hierarchy of the device (channels with X LUNs and each luns have a set 
> of zones).
> 
> 2. How to expose the tens of LUNs that OCSSDs have?
> 
> An open-channel SSDs typically has 64-256 LUNs that each acts as a 
> parallel unit. How can these be efficiently exposed?
> 
> One may expose these as separate namespaces/partitions. For a DAS with 
> 24 drives, that will be 1536-6144 separate LUNs to manage. That many 
> LUNs will blow up the host with gendisk instances. While if we do, then 
> we have an excellent 1:1 mapping between the SMR interface and the OCSSD 
> interface.
> 
> On the other hand, one could expose the device LUNs within a single LBA 
> address space and lay the LUNs out linearly. In that case, the block 
> layer may expose a variable that enables applications to understand this 
> hierarchy. Mainly the channels with LUNs. Any warm feelings towards this?
> 
> Currently, a shortcut is taken with the geometry and hierarchy, which 
> expose it through the /lightnvm sysfs entries. These (or a type thereof) 
> can be moved to the block layer /queue directory.
> 
> If keeping the LUNs exposed on the same gendisk, vector I/Os becomes a 
> viable path:
> 
> 3. Vector I/Os
> 
> To derive parallelism from an open-channel SSD (and SSDs in parallel), 
> one need to access them in parallel. Parallelism is achieved either by 
> issuing I/Os for each LUN (similar to driving multiple SSDs today) or 
> using a vector interface (encapsulating a list of LBAs, length, and data 
> buffer) into the kernel. The latter approach allows I/Os to be 
> vectorized and sent as a single unit to hardware.
> 
> Implementing this in generic block layer code might be overkill if only 
> open-channel SSDs use it. I like to hear other use-cases (e.g., 
> preadv/pwritev, file-systems, virtio?) that can take advantage of 
> vectored I/Os. If it makes sense, then which level to implement: 
> bio/request level, SGLs, or a new structure?
> 
> Device drivers that support vectored I/Os should be able to opt into the 
> interface, while the block layer may automatically roll out for device 
> drivers that don't have the support.
> 
> What has the history been in the Linux kernel about vector I/Os? What 
> have reasons in the past been that such an interface was not adopted?
> 
> I will post RFC SMR patches before LSF/MM, such that we have a firm 
> ground to discuss how it may be integrated.
> 
> -- Besides OCSSDs, I also like to participate in the discussions of 
> XCOPY, NVMe, multipath, multi-queue interrupt management as well.
> 
> -Matias
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-12  1:33 ` [LSF/MM " Damien Le Moal
@ 2017-01-12  2:18     ` James Bottomley
  0 siblings, 0 replies; 63+ messages in thread
From: James Bottomley @ 2017-01-12  2:18 UTC (permalink / raw)
  To: Damien Le Moal, Matias Bjørling, lsf-pc
  Cc: Linux FS Devel, linux-block, linux-nvme

On Thu, 2017-01-12 at 10:33 +0900, Damien Le Moal wrote:
> Hello,
> 
> A long discussion on the list followed this initial topic proposal 
> from Matias. I think this is a worthy topic to discuss at LSF in 
> order to steer development of the zoned block device interface in the 
> right direction. Considering the relation and implication to ZBC/ZAC
> support,I would like to attend LSF/MM to participate in this
> discussion.

Just a note for the poor admin looking after the lists: to find all the
ATTEND and TOPIC requests for the lists I fold up the threads to the
top.  If you frame your attend request as a reply, it's possible it
won't get counted because I didn't find it

so please *start a new thread* for ATTEND and TOPIC requests.

Thanks,

James

PS If you think you sent a TOPIC/ATTEND request in reply to something,
then I really haven't seen it because this is the first one I noticed,
and you should resend.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-12  2:18     ` James Bottomley
  0 siblings, 0 replies; 63+ messages in thread
From: James Bottomley @ 2017-01-12  2:18 UTC (permalink / raw)

On Thu, 2017-01-12@10:33 +0900, Damien Le Moal wrote:
> Hello,
> 
> A long discussion on the list followed this initial topic proposal 
> from Matias. I think this is a worthy topic to discuss at LSF in 
> order to steer development of the zoned block device interface in the 
> right direction. Considering the relation and implication to ZBC/ZAC
> support,I would like to attend LSF/MM to participate in this
> discussion.

Just a note for the poor admin looking after the lists: to find all the
ATTEND and TOPIC requests for the lists I fold up the threads to the
top.  If you frame your attend request as a reply, it's possible it
won't get counted because I didn't find it

so please *start a new thread* for ATTEND and TOPIC requests.

Thanks,

James

PS If you think you sent a TOPIC/ATTEND request in reply to something,
then I really haven't seen it because this is the first one I noticed,
and you should resend.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-12  2:18     ` James Bottomley
@ 2017-01-12  2:35       ` Damien Le Moal
  -1 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2017-01-12  2:35 UTC (permalink / raw)
  To: James Bottomley, Matias Bjørling, lsf-pc
  Cc: Linux FS Devel, linux-block, linux-nvme

James,

On 1/12/17 11:18, James Bottomley wrote:
> On Thu, 2017-01-12 at 10:33 +0900, Damien Le Moal wrote:
>> Hello,
>>
>> A long discussion on the list followed this initial topic proposal 
>> from Matias. I think this is a worthy topic to discuss at LSF in 
>> order to steer development of the zoned block device interface in the 
>> right direction. Considering the relation and implication to ZBC/ZAC
>> support,I would like to attend LSF/MM to participate in this
>> discussion.
> 
> Just a note for the poor admin looking after the lists: to find all the
> ATTEND and TOPIC requests for the lists I fold up the threads to the
> top.  If you frame your attend request as a reply, it's possible it
> won't get counted because I didn't find it
> 
> so please *start a new thread* for ATTEND and TOPIC requests.

My apologies for the overhead. I will resend.
Thank you.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-12  2:35       ` Damien Le Moal
  0 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2017-01-12  2:35 UTC (permalink / raw)


James,

On 1/12/17 11:18, James Bottomley wrote:
> On Thu, 2017-01-12@10:33 +0900, Damien Le Moal wrote:
>> Hello,
>>
>> A long discussion on the list followed this initial topic proposal 
>> from Matias. I think this is a worthy topic to discuss at LSF in 
>> order to steer development of the zoned block device interface in the 
>> right direction. Considering the relation and implication to ZBC/ZAC
>> support,I would like to attend LSF/MM to participate in this
>> discussion.
> 
> Just a note for the poor admin looking after the lists: to find all the
> ATTEND and TOPIC requests for the lists I fold up the threads to the
> top.  If you frame your attend request as a reply, it's possible it
> won't get counted because I didn't find it
> 
> so please *start a new thread* for ATTEND and TOPIC requests.

My apologies for the overhead. I will resend.
Thank you.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal at wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-12  2:35       ` Damien Le Moal
@ 2017-01-12  2:38         ` James Bottomley
  -1 siblings, 0 replies; 63+ messages in thread
From: James Bottomley @ 2017-01-12  2:38 UTC (permalink / raw)
  To: Damien Le Moal, Matias Bjørling, lsf-pc
  Cc: Linux FS Devel, linux-block, linux-nvme

On Thu, 2017-01-12 at 11:35 +0900, Damien Le Moal wrote:
> > Just a note for the poor admin looking after the lists: to find all 
> > the ATTEND and TOPIC requests for the lists I fold up the threads 
> > to the top.  If you frame your attend request as a reply, it's 
> > possible it won't get counted because I didn't find it
> > 
> > so please *start a new thread* for ATTEND and TOPIC requests.
> 
> My apologies for the overhead. I will resend.
> Thank you.

You don't need to resend ... I've got you on the list.  I replied
publicly just in case there were any other people who did this that I
didn't notice.

James



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
@ 2017-01-12  2:38         ` James Bottomley
  0 siblings, 0 replies; 63+ messages in thread
From: James Bottomley @ 2017-01-12  2:38 UTC (permalink / raw)


On Thu, 2017-01-12@11:35 +0900, Damien Le Moal wrote:
> > Just a note for the poor admin looking after the lists: to find all 
> > the ATTEND and TOPIC requests for the lists I fold up the threads 
> > to the top.  If you frame your attend request as a reply, it's 
> > possible it won't get counted because I didn't find it
> > 
> > so please *start a new thread* for ATTEND and TOPIC requests.
> 
> My apologies for the overhead. I will resend.
> Thank you.

You don't need to resend ... I've got you on the list.  I replied
publicly just in case there were any other people who did this that I
didn't notice.

James

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2017-01-12  2:38 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-02 21:06 [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os Matias Bjørling
2017-01-02 21:06 ` Matias Bjørling
2017-01-02 21:06 ` Matias Bjørling
2017-01-02 23:12 ` Viacheslav Dubeyko
2017-01-02 23:12   ` Viacheslav Dubeyko
2017-01-02 23:12   ` Viacheslav Dubeyko
2017-01-03  8:56   ` Matias Bjørling
2017-01-03  8:56     ` Matias Bjørling
2017-01-03 17:35     ` Viacheslav Dubeyko
2017-01-03 17:35       ` Viacheslav Dubeyko
2017-01-03 17:35       ` Viacheslav Dubeyko
2017-01-03 19:10       ` Matias Bjørling
2017-01-03 19:10         ` Matias Bjørling
2017-01-04  2:59         ` Slava Dubeyko
2017-01-04  2:59           ` Slava Dubeyko
2017-01-04  2:59           ` Slava Dubeyko
2017-01-04  7:24           ` Damien Le Moal
2017-01-04  7:24             ` Damien Le Moal
2017-01-04 12:39             ` Matias Bjørling
2017-01-04 12:39               ` Matias Bjørling
2017-01-04 16:57             ` Theodore Ts'o
2017-01-04 16:57               ` Theodore Ts'o
2017-01-10  1:42               ` Damien Le Moal
2017-01-10  1:42                 ` Damien Le Moal
2017-01-10  4:24                 ` Theodore Ts'o
2017-01-10  4:24                   ` Theodore Ts'o
2017-01-10 13:06                   ` Matias Bjorling
2017-01-10 13:06                     ` Matias Bjorling
2017-01-11  4:07                     ` Damien Le Moal
2017-01-11  4:07                       ` Damien Le Moal
2017-01-11  6:06                       ` Matias Bjorling
2017-01-11  6:06                         ` Matias Bjorling
2017-01-11  7:49                       ` Hannes Reinecke
2017-01-11  7:49                         ` Hannes Reinecke
2017-01-05 22:58             ` Slava Dubeyko
2017-01-05 22:58               ` Slava Dubeyko
2017-01-05 22:58               ` Slava Dubeyko
2017-01-06  1:11               ` Theodore Ts'o
2017-01-06  1:11                 ` Theodore Ts'o
2017-01-06 12:51                 ` Matias Bjørling
2017-01-06 12:51                   ` Matias Bjørling
2017-01-06 12:51                   ` Matias Bjørling
2017-01-09  6:49                 ` Slava Dubeyko
2017-01-09  6:49                   ` Slava Dubeyko
2017-01-09  6:49                   ` Slava Dubeyko
2017-01-09 14:55                   ` Theodore Ts'o
2017-01-09 14:55                     ` Theodore Ts'o
2017-01-09 14:55                     ` Theodore Ts'o
2017-01-06 13:05               ` Matias Bjørling
2017-01-06 13:05                 ` Matias Bjørling
2017-01-06 13:05                 ` Matias Bjørling
2017-01-06  1:09             ` Jaegeuk Kim
2017-01-06  1:09               ` Jaegeuk Kim
2017-01-06 12:55               ` Matias Bjørling
2017-01-06 12:55                 ` Matias Bjørling
2017-01-06 12:55                 ` Matias Bjørling
2017-01-12  1:33 ` [LSF/MM " Damien Le Moal
2017-01-12  2:18   ` [Lsf-pc] " James Bottomley
2017-01-12  2:18     ` James Bottomley
2017-01-12  2:35     ` Damien Le Moal
2017-01-12  2:35       ` Damien Le Moal
2017-01-12  2:38       ` James Bottomley
2017-01-12  2:38         ` James Bottomley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.