All of lore.kernel.org
 help / color / mirror / Atom feed
From: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
To: "Damien Le Moal" <Damien.LeMoal@wdc.com>,
	"Matias Bjørling" <m@bjorling.me>,
	"Viacheslav Dubeyko" <slava@dubeyko.com>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"Theodore Ts'o" <tytso@mit.edu>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
Date: Thu, 5 Jan 2017 22:58:57 +0000	[thread overview]
Message-ID: <SN2PR04MB2191BE43398C84C4D262960488600@SN2PR04MB2191.namprd04.prod.outlook.com> (raw)
In-Reply-To: <a5acdf89-99c0-774d-cb08-76cb5d846f29@wdc.com>

DQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KRnJvbTogRGFtaWVuIExlIE1vYWwgDQpTZW50
OiBUdWVzZGF5LCBKYW51YXJ5IDMsIDIwMTcgMTE6MjUgUE0NClRvOiBTbGF2YSBEdWJleWtvIDxW
eWFjaGVzbGF2LkR1YmV5a29Ad2RjLmNvbT47IE1hdGlhcyBCasO4cmxpbmcgPG1AYmpvcmxpbmcu
bWU+OyBWaWFjaGVzbGF2IER1YmV5a28gPHNsYXZhQGR1YmV5a28uY29tPjsgbHNmLXBjQGxpc3Rz
LmxpbnV4LWZvdW5kYXRpb24ub3JnDQpDYzogTGludXggRlMgRGV2ZWwgPGxpbnV4LWZzZGV2ZWxA
dmdlci5rZXJuZWwub3JnPjsgbGludXgtYmxvY2tAdmdlci5rZXJuZWwub3JnOyBsaW51eC1udm1l
QGxpc3RzLmluZnJhZGVhZC5vcmcNClN1YmplY3Q6IFJlOiBbTFNGL01NIFRPUElDXVtMU0YvTU0g
QVRURU5EXSBPQ1NTRHMgLSBTTVIsIEhpZXJhcmNoaWNhbCBJbnRlcmZhY2UsIGFuZCBWZWN0b3Ig
SS9Pcw0KDQo8c2tpcHBlZD4NCg0KPiBCdXQgeW91IGFyZSBtaXNzaW5nIHRoZSBwYXJhbGxlbCB3
aXRoIFNNUi4gRm9yIFNNUiwgb3IgbW9yZSBjb3JyZWN0bHkgem9uZWQNCj4gYmxvY2sgZGV2aWNl
cyBzaW5jZSB0aGUgWkJDIG9yIFpBQyBzdGFuZGFyZHMgY2FuIGVxdWFsbHkgYXBwbHkgdG8gSERE
cyBhbmQgU1NEcywNCj4gMyBtb2RlbHMgZXhpc3RzOiBkcml2ZS1tYW5hZ2VkLCBob3N0LWF3YXJl
IGFuZCBob3N0LW1hbmFnZWQuDQo+IENhc2UgKDEpIGFib3ZlIGNvcnJlc3BvbmRzICpleGFjdGx5
KiB0byB0aGUgZHJpdmUgbWFuYWdlZCBtb2RlbCwgd2l0aA0KPiB0aGUgZGlmZmVyZW5jZSB0aGF0
IHRoZSBhYnN0cmFjdGlvbiBvZiB0aGUgZGV2aWNlIGNoYXJhY3RlcmlzdGljcyAoU01SDQo+IGhl
cmUpIGlzIGluIHRoZSBkcml2ZSBGVyBhbmQgbm90IGluIGEgaG9zdC1sZXZlbCBGVEwgaW1wbGVt
ZW50YXRpb24NCj4gYXMgaXQgd291bGQgYmUgZm9yIG9wZW4gY2hhbm5lbCBTU0RzLiBDYXNlICgy
KSBhYm92ZSBjb3JyZXNwb25kcyB0byB0aGUgaG9zdC1tYW5hZ2VkDQo+IG1vZGVsLCB0aGF0IGlz
LCB0aGUgZGV2aWNlIHVzZXIgaGFzIHRvIGRlYWwgd2l0aCB0aGUgZGV2aWNlIGNoYXJhY3Rlcmlz
dGljcw0KPiBpdHNlbGYgYW5kIHVzZSBpdCBjb3JyZWN0bHkuIFRoZSBob3N0LWF3YXJlIG1vZGVs
IGxpZXMgaW4gYmV0d2VlbiB0aGVzZSAyIGV4dHJlbWVzOg0KPiBpdCBvZmZlcnMgdGhlIHBvc3Np
YmlsaXR5IG9mIGNvbXBsZXRlIGFic3RyYWN0aW9uIGJ5IGRlZmF1bHQsIGJ1dCBhbHNvIGFsbG93
cyBhIHVzZXINCj4gdG8gb3B0aW1pemUgaXRzIG9wZXJhdGlvbiBmb3IgdGhlIGRldmljZSBieSBh
bGxvd2luZyBhY2Nlc3MgdG8gdGhlIGRldmljZSBjaGFyYWN0ZXJpc3RpY3MuDQo+IFNvIHRoaXMg
d291bGQgY29ycmVzcG9uZCB0byBhIHBvc3NpYmxlIHRoaXJkIHdheSBvZiBpbXBsZW1lbnRpbmcg
YW4gRlRMIGZvciBvcGVuIGNoYW5uZWwgU1NEcy4NCg0KSSBzZWUgeW91ciBwb2ludC4gQW5kICBJ
IHRoaW5rIHRoYXQsIGhpc3RvcmljYWxseSwgd2UgbmVlZCB0byBkaXN0aW5ndWlzaCA0IGNhc2Vz
IGZvciB0aGUNCmNhc2Ugb2YgTkFORCBmbGFzaDoNCigxKSBkcml2ZS1tYW5hZ2VkOiByZWd1bGFy
IGZpbGUgc3lzdGVtcyAoZXh0NCwgeGZzIGFuZCBzbyBvbik7DQooMikgaG9zdC1hd2FyZTogZmxh
c2gtZnJpZW5kbHkgZmlsZSBzeXN0ZW1zIChOSUxGUzIsIEYyRlMgYW5kIHNvIG9uKTsNCigzKSBo
b3N0LW1hbmFnZWQ6IDxmaWxlIHN5c3RlbXMgdW5kZXIgaW1wbGVtZW50YXRpb24+Ow0KKDQpIG9s
ZC1mYXNoaW9uZWQgZmxhc2gtb3JpZW50ZWQgZmlsZSBzeXN0ZW1zIGZvciByYXcgTkFORCAoamZm
cywgeWFmZnMsIHViaWZzIGFuZCBzbyBvbikuDQoNCkJ1dCwgZnJhbmtseSBzcGVha2luZywgZXZl
biByZWd1bGFyIGZpbGUgc3lzdGVtcyBhcmUgc2xpZ2h0bHkgZmxhc2gtYXdhcmUgdG9kYXkgYmVj
YXVzZSBvZg0KYmxrZGV2X2lzc3VlX2Rpc2NhcmQgKFRSSU0pIG9yIFJFUV9NRVRBIGZsYWcuIFNv
LCB0aGUgbmV4dCByZWFsbHkgaW1wb3J0YW50IHF1ZXN0aW9uIGlzOg0Kd2hhdCBjYW4vc2hvdWxk
IGJlIGV4cG9zZWQgZm9yIHRoZSBob3N0LW1hbmFnZWQgYW5kIGhvc3QtYXdhcmUgY2FzZXM/IFdo
YXQncyBwcmluY2lwYWwNCmRpZmZlcmVuY2UgYmV0d2VlbiB0aGVzZSBtb2RlbHM/IEFuZCwgZmlu
YWxseSwgdGhlIGRpZmZlcmVuY2UgaXMgbm90IHNvIGNsZWFyLg0KDQpMZXQncyBzdGFydCBmcm9t
IGVycm9yIGNvcnJlY3Rpb25zLiBPbmx5IGZsYXNoLW9yaWVudGVkIGZpbGUgc3lzdGVtcyB0YWtl
IGNhcmUgYWJvdXQNCmVycm9yIGNvcnJlY3Rpb25zLiBCdXQgSSBhc3N1bWUgdGhhdCBkcml2ZS1t
YW5hZ2VkLCBob3N0LWF3YXJlIGFuZCBob3N0LW1hbmFnZWQgY2FzZXMNCmV4cGVjdCBoYXJkd2Fy
ZS1iYXNlZCBlcnJvciBjb3JyZWN0aW9uLiBTbywgd2UgY2FuIHRyZWF0IG91ciBsb2dpY2FsIHBh
Z2UvYmxvY2sgYXMgaWRlYWwNCmJ5dGUgc3RyZWFtIHRoYXQgYWx3YXlzIGNvbnRhaW5zIHZhbGlk
IGRhdGEuIFNvLCB3ZSBoYXZlIG5vIGRpZmZlcmVuY2UgYW5kIG5vIGNvbnRyYWRpY3Rpb24NCmhl
cmUuDQoNCk5leHQgcG9pbnQgaXMgcmVhZCBkaXN0dXJiYW5jZS4gSWYgQkVSIG9mIHBoeXNpY2Fs
IHBhZ2UvYmxvY2sgYWNoaWV2ZXMgc29tZSB0aHJlc2hvbGQgdGhlbg0Kd2UgbmVlZCB0byBtb3Zl
IGRhdGEgZnJvbSBvbmUgcGFnZS9ibG9jayBpbnRvIGFub3RoZXIgb25lLiBXaGF0IHN1YnN5c3Rl
bSB3aWxsIGJlDQpyZXNwb25zaWJsZSBmb3IgdGhpcyBhY3Rpdml0eT8gVGhlIGRyaXZlLW1hbmFn
ZWQgY2FzZSBleHBlY3RzIHRoYXQgZGV2aWNlJ3MgR0Mgd2lsbCBtYW5hZ2UNCnJlYWQgZGlzdHVy
YmFuY2UgaXNzdWUuIEJ1dCB3aGF0J3MgYWJvdXQgaG9zdC1hd2FyZSBvciBob3N0LW1hbmFnZWQg
Y2FzZT8gSWYgdGhlIGhvc3Qgc2lkZQ0KaGFzbid0IGluZm9ybWF0aW9uIGFib3V0IEJFUiB0aGVu
IHRoZSBob3N0J3Mgc29mdHdhcmUgaXMgdW5hYmxlIHRvIG1hbmFnZSB0aGlzIGlzc3VlLiBGaW5h
bGx5LA0KaXQgc291bmRzIHRoYXQgd2Ugd2lsbCBoYXZlIEdDIHN1YnN5c3RlbSBhcyBvbiBmaWxl
IHN5c3RlbSBzaWRlIGFzIG9uIGRldmljZSBzaWRlLiBBcyBhIHJlc3VsdCwNCml0IG1lYW5zIHBv
c3NpYmxlIHVucHJlZGljdGFibGUgcGVyZm9ybWFuY2UgZGVncmFkYXRpb24gYW5kIGRlY3JlYXNp
bmcgZGV2aWNlIGxpZmV0aW1lLg0KTGV0J3MgaW1hZ2luZSB0aGF0IGhvc3QtYXdhcmUgY2FzZSBj
b3VsZCBiZSB1bmF3YXJlIGFib3V0IHJlYWQgZGlzdHVyYmFuY2UgbWFuYWdlbWVudC4NCkJ1dCBo
b3cgaG9zdC1tYW5hZ2VkIGNhc2UgY2FuIG1hbmFnZSB0aGlzIGlzc3VlPw0KDQpCYWQgYmxvY2sg
bWFuYWdlbWVudC4uLiBTbywgZHJpdmUtbWFuYWdlZCBhbmQgaG9zdC1hd2FyZSBjYXNlcyBzaG91
bGQgYmUgY29tcGxldGVseSB1bmF3YXJlDQphYm91dCAgYmFkIGJsb2Nrcy4gQnV0IHdoYXQncyBh
Ym91dCBob3N0LW1hbmFnZWQgY2FzZT8gSWYgYSBkZXZpY2Ugd2lsbCBoaWRlIGJhZCBibG9ja3Mg
ZnJvbQ0KdGhlIGhvc3QgdGhlbiBpdCBtZWFucyBtYXBwaW5nIHRhYmxlIHByZXNlbmNlLCBhY2Nl
c3MgdG8gbG9naWNhbCBwYWdlcy9ibG9ja3MgYW5kIHNvIG9uLiBJZiB0aGUgaG9zdA0KaGFzbid0
IGFjY2VzcyB0byB0aGUgYmFkIGJsb2NrIG1hbmFnZW1lbnQgdGhlbiBpdCdzIG5vdCBob3N0LW1h
bmFnZWQgbW9kZWwuIEFuZCBpdCBzb3VuZHMgYXMNCmNvbXBsZXRlbHkgdW5tYW5hZ2VhYmxlIHNp
dHVhdGlvbiBmb3IgdGhlIGhvc3QtbWFuYWdlZCBtb2RlbC4gQmVjYXVzZSBpZiB0aGUgaG9zdCBo
YXMgYWNjZXNzDQp0byBiYWQgYmxvY2sgbWFuYWdlbWVudCAoYnV0IGhvdz8pIHRoZW4gd2UgaGF2
ZSByZWFsbHkgc2ltcGxlIG1vZGVsLiBPdGhlcndpc2UsIHRoZSBob3N0DQpoYXMgYWNjZXNzIHRv
IGxvZ2ljYWwgcGFnZXMvYmxvY2tzIG9ubHkgYW5kIGRldmljZSBzaG91bGQgaGF2ZSBpbnRlcm5h
bCBHQy4gQXMgYSByZXN1bHQsDQppdCBtZWFucyBwb3NzaWJsZSB1bnByZWRpY3RhYmxlIHBlcmZv
cm1hbmNlIGRlZ3JhZGF0aW9uIGFuZCBkZWNyZWFzaW5nIGRldmljZSBsaWZldGltZSBiZWNhdXNl
DQpvZiBjb21wZXRpdGlvbiBvZiBHQyBvbiBkZXZpY2Ugc2lkZSBhbmQgR0Mgb24gdGhlIGhvc3Qg
c2lkZS4NCg0KV2VhciBsZXZlbGluZy4uLiBEZXZpY2Ugd2lsbCBiZSByZXNwb25zaWJsZSB0byBt
YW5hZ2Ugd2Vhci1sZXZlbGluZyBmb3IgdGhlIGNhc2Ugb2YgZGV2aWNlLW1hbmFnZWQNCmFuZCBo
b3N0LWF3YXJlIG1vZGVscy4gSXQgbG9va3MgbGlrZSB0aGF0IHRoZSBob3N0IHNpZGUgc2hvdWxk
IGJlIHJlc3BvbnNpYmxlIHRvIG1hbmFnZSB3ZWFyLWxldmVsaW5nDQpmb3IgdGhlIGhvc3QtbWFu
YWdlZCBjYXNlLiBCdXQgaXQgbWVhbnMgdGhhdCB0aGUgaG9zdCBzaG91bGQgbWFuYWdlIGJhZCBi
bG9ja3MgYW5kIHRvIGhhdmUgZGlyZWN0DQphY2Nlc3MgdG8gcGh5c2ljYWwgcGFnZXMvYmxvY2tz
LiBPdGhlcndpc2UsIHBoeXNpY2FsIGVyYXNlIGJsb2NrcyB3aWxsIGJlIGhpZGRlbiBieSBkZXZp
Y2UncyBpbmRpcmVjdGlvbg0KbGF5ZXIgYW5kIHdlYXItbGV2ZWxpbmcgbWFuYWdlbWVudCB3aWxs
IGJlIHVuYXZhaWxhYmxlIG9uIHRoZSBob3N0IHNpZGUuIEFzIGEgcmVzdWx0LCBkZXZpY2Ugd2ls
bCBoYXZlDQppbnRlcm5hbCBHQyBhbmQgdGhlIHRyYWRpdGlvbmFsIGlzc3VlcyAocG9zc2libGUg
dW5wcmVkaWN0YWJsZSBwZXJmb3JtYW5jZSBkZWdyYWRhdGlvbiBhbmQgZGVjcmVhc2luZw0KZGV2
aWNlIGxpZmV0aW1lKS4gQnV0IGV2ZW4gaWYgU1NEIHByb3ZpZGVzIGFjY2VzcyB0byBhbGwgaW50
ZXJuYWxzIHRoZW4gaG93IHdpbGwgZmlsZSBzeXN0ZW0gYmUgYWJsZQ0KdG8gaW1wbGVtZW50IHdl
YXItbGV2ZWxpbmcgb3IgYmFkIGJsb2NrIG1hbmFnZW1lbnQgaW4gdGhlIGNhc2Ugb2YgcmVndWxh
ciBJL08gb3BlcmF0aW9ucz8gQmVjYXVzZQ0KYmxvY2sgZGV2aWNlIGNyZWF0ZXMgTEJBIGFic3Ry
YWN0aW9uIGZvciB1cy4gRG9lcyBpdCBtZWFuIHRoYXQgc29mdHdhcmUgRlRMIG9uIHRoZSBibG9j
ayBsYXllciBsZXZlbA0KaXMgYWJsZSB0byBtYW5hZ2UgU1NEIGludGVybmFscyBkaXJlY3RseT8g
QW5kLCBhZ2FpbiwgZmlsZSBzeXN0ZW0gY2Fubm90IG1hbmFnZSBTU0QgaW50ZXJuYWxzIGRpcmVj
dGx5DQpmb3IgdGhlIGNhc2Ugb2Ygc29mdHdhcmUgRlRMLiBBbmQgd2hlcmUgc2hvdWxkIHNvZnR3
YXJlIEZUTCBrZWVwIG1hcHBpbmcgdGFibGUsIGZvciBleGFtcGxlPw0KDQpTbywgRjJGUyBhbmQg
TklMRlMyIGxvb2tzIGxpa2UgYSBob3N0LWF3YXJlIGNhc2UgYmVjYXVzZSBpdCBpcyBMRlMgZmls
ZSBzeXN0ZW1zIHRoYXQgaXMgb3JpZW50ZWQgb24NCnJlZ3VsYXIgU1NEcy4gU28sIGl0IGNvdWxk
IGJlIGRlc2lyYWJsZSB0byBoYXZlIHNvbWUga25vd2xlZGdlIChwYWdlIHNpemUsIGVyYXNlIGJs
b2NrIHNpemUgYW5kIHNvIG9uKQ0KYWJvdXQgU1NEIGludGVybmFscy4gQnV0LCBtb3N0bHksIHN1
Y2gga25vd2xlZGdlIHNob3VsZCBiZSBzaGFyZWQgd2l0aCBta2ZzIHRvb2wgZHVyaW5nIGZpbGUN
CnN5c3RlbSB2b2x1bWUgY3JlYXRpb24uIFRoZSByZXN0IGxvb2tzIGxpa2UgYXMgbm90IHZlcnkg
cHJvbWlzaW5nIGFuZCBub3QgdmVyeSBkaWZmZXJlbnQgd2l0aA0KZGV2aWNlLW1hbmFnZWQgbW9k
ZWwuIEJlY2F1c2UgZXZlbiBpZiBGMkZTIGFuZCBOSUxGUzIgaGFzIEdDIHN1YnN5c3RlbSBhbmQg
bW9zdGx5IGxvb2tzIGxpa2UNCmFzIExGUyBjYXNlIChGMkZTIGhhcyBpbi1wbGFjZSB1cGRhdGVk
IGFyZWE7IE5JTEZTMiBoYXMgaW4tcGxhY2UgdXBkYXRlZCBzdXBlcmJsb2NrcyBpbiB0aGUgYmVn
aW4vZW5kDQpvZiB0aGUgdm9sdW1lKSwgYW55d2F5LCBib3RoIHRoZXNlIGZpbGUgc3lzdGVtcyBj
b21wbGV0ZWx5IHJlbHkgb24gZGV2aWNlIGluZGlyZWN0aW9uIGxheWVyIGFuZA0KR0Mgc3Vic3lz
dGVtLiBXZSBhcmUgc3RpbGwgaW4gdGhlIHNhbWUgaGVsbCBvZiBHQ3MgY29tcGV0aXRpb24uIFNv
LCB3aGF0J3MgdGhlIHBvaW50IG9mIGhvc3QtYXdhcmUNCm1vZGVsPw0KDQpTbywgSSBhbSBub3Qg
Y29tcGxldGVseSBjb252aW5jZWQgdGhhdCwgZmluYWxseSwgd2Ugd2lsbCBoYXZlIHJlYWxseSBk
aXN0aW5jdGl2ZSBmZWF0dXJlcyBmb3IgdGhlDQpjYXNlIG9mIGRldmljZS1tYW5hZ2VkLCBob3N0
LWF3YXJlIGFuZCBob3N0LW1hbmFnZWQgbW9kZWwuIEFsc28gSSBoYXZlIG1hbnkgcXVlc3Rpb24g
YWJvdXQNCmhvc3QtbWFuYWdlZCBtb2RlbCBpZiB3ZSB3aWxsIHVzZSBibG9jayBkZXZpY2UgYWJz
dHJhY3Rpb24uIEhvdyBjYW4gZGlyZWN0IG1hbmFnZW1lbnQgb2YNClNTRCBpbnRlcm5hbHMgYmUg
b3JnYW5pemVkIGZvciB0aGUgY2FzZSBvZiBob3N0LW1hbmFnZWQgbW9kZWwgaXMgaGlkZGVuIHVu
ZGVyIGJsb2NrIGRldmljZQ0KYWJzdHJhY3Rpb24/DQoNCkFub3RoZXIgaW50ZXJlc3RpbmcgcXVl
c3Rpb24uLi4gTGV0J3MgaW1hZ2luZSB0aGF0IHdlIGNyZWF0ZSBmaWxlIHN5c3RlbSB2b2x1bWUg
Zm9yIG9uZSBkZXZpY2UNCmdlb21ldHJ5LiBJdCBtZWFucyB0aGF0IGdlb21ldHJ5IGRldGFpbHMg
d2lsbCBiZSBzdG9yZWQgaW4gdGhlIGZpbGUgc3lzdGVtIG1ldGFkYXRhIGR1cmluZyB2b2x1bWUN
CmNyZWF0aW9uIGZvciB0aGUgY2FzZSBob3N0LWF3YXJlIG9yIGhvc3QtbWFuYWdlZCBjYXNlLiBU
aGVuIHdlIGJhY2t1cHMgdGhpcyB2b2x1bWUgYW5kIHJlc3RvcmUNCnRoZSB2b2x1bWUgb24gZGV2
aWNlIHdpdGggY29tcGxldGVseSBkaWZmZXJlbnQgZ2VvbWV0cnkuIFNvLCB3aGF0IHdpbGwgd2Ug
aGF2ZSBmb3Igc3VjaCBjYXNlPw0KUGVyZm9ybWFuY2UgZGVncmFkYXRpb24/IE9yIHdpbGwgd2Ug
a2lsbCB0aGUgZGV2aWNlPw0KDQo+IFRoZSBvcGVuLWNoYW5uZWwgU1NEIGludGVyZmFjZSBpcyB2
ZXJ5IA0KPiBzaW1pbGFyIHRvIHRoZSBvbmUgZXhwb3NlZCBieSBTTVIgaGFyZC1kcml2ZXMuIFRo
ZXkgYm90aCBoYXZlIGEgc2V0IG9mIA0KPiBjaHVua3MgKHpvbmVzKSBleHBvc2VkLCBhbmQgem9u
ZXMgYXJlIG1hbmFnZWQgdXNpbmcgb3Blbi9jbG9zZSBsb2dpYy4gDQo+IFRoZSBtYWluIGRpZmZl
cmVuY2Ugb24gb3Blbi1jaGFubmVsIFNTRHMgaXMgdGhhdCBpdCBhZGRpdGlvbmFsbHkgZXhwb3Nl
cyANCj4gbXVsdGlwbGUgc2V0cyBvZiB6b25lcyB0aHJvdWdoIGEgaGllcmFyY2hpY2FsIGludGVy
ZmFjZSwgd2hpY2ggY292ZXJzIGEgDQo+IG51bWJlcnMgbGV2ZWxzIChYIGNoYW5uZWxzLCBZIExV
TnMgcGVyIGNoYW5uZWwsIFogem9uZXMgcGVyIExVTikuDQoNCkkgd291bGQgbGlrZSB0byBoYXZl
IGFjY2VzcyBjaGFubmVscy9MVU5zL3pvbmVzIG9uIGZpbGUgc3lzdGVtIGxldmVsLg0KSWYsIGZv
ciBleGFtcGxlLCBMVU4gd2lsbCBiZSBhc3NvY2lhdGVkIHdpdGggcGFydGl0aW9uIHRoZW4gaXQg
bWVhbnMNCnRoYXQgaXQgd2lsbCBuZWVkIHRvIGFnZ3JlZ2F0ZSBzZXZlcmFsIHBhcnRpdGlvbnMg
aW5zaWRlIG9mIG9uZSB2b2x1bWUuDQpGaXJzdCBvZiBhbGwsIG5vdCBldmVyeSBmaWxlIHN5c3Rl
bSBpcyByZWFkeSBmb3IgdGhlIGFnZ3JlZ2F0aW9uIHNldmVyYWwNCnBhcnRpdGlvbnMgaW5zaWRl
IG9mIHRoZSBvbmUgdm9sdW1lLiBTZWNvbmRseSwgd2hhdCdzIGFib3V0IGFnZ3JlZ2F0aW9uDQpz
ZXZlcmFsIHBoeXNpY2FsIGRldmljZXMgaW5zaWRlIG9mIG9uZSB2b2x1bWU/IEl0IGxvb2tzIGxp
a2UgYXMgc2xpZ2h0bHkNCnRyaWNreSB0byBkaXN0aW5ndWlzaCBwYXJ0aXRpb25zIG9mIHRoZSBz
YW1lIGRldmljZSBhbmQgZGlmZmVyZW50IGRldmljZXMNCm9uIGZpbGUgc3lzdGVtIGxldmVsLiBJ
c24ndCBpdD8NCg0KPiBJIGFncmVlIHdpdGggRGFtaWVuLCBidXQgSSdkIGFsc28gYWRkIHRoYXQg
aW4gdGhlIGZ1dHVyZSB0aGVyZSBtYXkgdmVyeQ0KPiB3ZWxsIGJlIHNvbWUgbmV3IFpvbmUgdHlw
ZXMgYWRkZWQgdG8gdGhlIFpCQyBtb2RlbC4gDQo+IFNvIHdlIHNob3VsZG4ndCBhc3N1bWUgdGhh
dCB0aGUgWkJDIG1vZGVsIGlzIGEgZml4ZWQgb25lLiAgQW5kIHdobyBrbm93cz8NCj4gUGVyaGFw
cyBUMTAgc3RhbmRhcmRzIGJvZHkgd2lsbCBjb21lIHVwIHdpdGggYSBzaW1wbGVyIG1vZGVsIGZv
cg0KPiBpbnRlcmZhY2luZyB3aXRoIFNDU0kvU0FUQS1hdHRhY2hlZCBTU0QncyB0aGF0IG1pZ2h0
IGxldmVyYWdlIHRoZSBaQkMgbW9kZWwgLS0tIG9yIG5vdC4NCg0KRGlmZmVyZW50IHpvbmUgdHlw
ZXMgaXMgZ29vZC4gQnV0IG1heWJlIExVTiB3aWxsIGJlIHRoZSBiZXR0ZXIgcGxhY2UNCmZvciBk
aXN0aW5ndWlzaGluZyB0aGUgZGlmZmVyZW50IHpvbmUgdHlwZXMuIEJlY2F1c2UgaWYgem9uZSBj
YW4gaGF2ZSB0aGUgdHlwZQ0KdGhlbiBpdCdzIHBvc3NpYmxlIHRvIGltYWdpbmUgYW55IGNvbWJp
bmF0aW9ucyBvZiB6b25lcy4gQnV0IG1vc3RseQ0Kem9uZSBvZiBzb21lIHR5cGUgd2lsbCBiZSBp
bnNpZGUgb2Ygc29tZSBjb250aWd1b3VzIGFyZWEgKGluc2lkZSBvZiBOQU5EDQpkaWUsIGZvciBl
eGFtcGxlKS4gU28sIExVTiBsb29rcyBsaWtlIGFzIE5BTkQgZGllIHJlcHJlc2VudGF0aW9uLg0K
DQo+PiBTTVIgem9uZSBhbmQgTkFORCBmbGFzaCBlcmFzZSBibG9jayBsb29rIGNvbXBhcmFibGUg
YnV0LCBmaW5hbGx5LCBpdCANCj4+IHNpZ25pZmljYW50bHkgZGlmZmVyZW50IHN0dWZmLiBVc3Vh
bGx5LCBTTVIgem9uZSBoYXMgMjY1IE1CIGluIHNpemUgDQo+PiBidXQgTkFORCBmbGFzaCBlcmFz
ZSBibG9jayBjYW4gdmFyeSBmcm9tIDUxMiBLQiB0byA4IE1CIChpdCB3aWxsIGJlIA0KPj4gc2xp
Z2h0bHkgbGFyZ2VyIGluIHRoZSBmdXR1cmUgYnV0IG5vdCBtb3JlIHRoYW4gMzIgTUIsIEkgc3Vw
cG9zZSkuIEl0IA0KPj4gaXMgcG9zc2libGUgdG8gZ3JvdXAgc2V2ZXJhbCBlcmFzZSBibG9ja3Mg
aW50byBhZ2dyZWdhdGVkIGVudGl0eSBidXQgDQo+PiBpdCBjb3VsZCBiZSBub3QgdmVyeSBnb29k
IHBvbGljeSBmcm9tIGZpbGUgc3lzdGVtIHBvaW50IG9mIHZpZXcuDQo+DQo+IFdoeSBub3Q/IEZv
ciBmMmZzLCB0aGUgMk1CIHNlZ21lbnRzIGFyZSBncm91cGVkIHRvZ2V0aGVyIGludG8gc2VjdGlv
bnMNCj4gd2l0aCBhIHNpemUgbWF0Y2hpbmcgdGhlIGRldmljZSB6b25lIHNpemUuIFRoYXQgd29y
a3Mgd2VsbCBhbmQgY2FuIGFjdHVhbGx5DQo+IGV2ZW4gcmVkdWNlIHRoZSBnYXJiYWdlIGNvbGxl
Y3Rpb24gb3ZlcmhlYWQgaW4gc29tZSBjYXNlcy4NCj4gTm90aGluZyBpbiB0aGUga2VybmVsIHpv
bmVkIGJsb2NrIGRldmljZSBzdXBwb3J0IGxpbWl0cyB0aGUgem9uZSBzaXplDQo+IHRvIGEgcGFy
dGljdWxhciBtaW5pbXVtIG9yIG1heGltdW0uIFRoZSBvbmx5IGRpcmVjdCBpbXBsaWNhdGlvbiBv
ZiB0aGUgem9uZQ0KPiBzaXplIG9uIHRoZSBibG9jayBJL08gc3RhY2sgaXMgdGhhdCBCSU9zIGFu
ZCByZXF1ZXN0cyBjYW5ub3QgY3Jvc3Mgem9uZQ0KPiBib3VuZGFyaWVzLiBJbiBhbiBleHRyZW1l
IHNldHVwLCBhIHpvbmUgc2l6ZSBvZiA0S0Igd291bGQgd29yayB0b28NCj4gYW5kIHJlc3VsdCBp
biByZWFkL3dyaXRlIGNvbW1hbmRzIG9mIDRLQiBhdCBtb3N0IHRvIHRoZSBkZXZpY2UuDQoNClRo
ZSBzaXR1YXRpb24gd2l0aCBncm91cGluZyBvZiBzZWdtZW50cyBpbnRvIHNlY3Rpb25zIGZvciB0
aGUgY2FzZSBvZiBGMkZTDQppcyBub3Qgc28gc2ltcGxlLiBGaXJzdCBvZiBhbGwsIHlvdSBuZWVk
IHRvIGZpbGwgc3VjaCBhZ2dyZWdhdGlvbiB3aXRoIGRhdGEuDQpGMkZTIGRpc3Rpbmd1aXNoIHNl
dmVyYWwgdHlwZXMgb2Ygc2VnbWVudHMgYW5kIGl0IG1lYW5zIHRoYXQgY3VycmVudA0Kc2VnbWVu
dC9zZWN0aW9uIHdpbGwgYmUgbGFyZ2VyLiBJZiB5b3UgbWl4IGRpZmZlcmVudCB0eXBlcyBvZiBz
ZWdtZW50cyBpbnRvDQpvbmUgc2VjdGlvbiAoYnV0IEkgYmVsaWV2ZSB0aGF0IEYyRlMgZG9lc24n
dCBwcm92aWRlIG9wcG9ydHVuaXR5IHRvIGRvIHRoaXMpDQp0aGVuIEdDIG92ZXJoZWFkIGNvdWxk
IGJlIGxhcmdlciwgSSBzdXBwb3NlLiBPdGhlcndpc2UsIHRoZSB1c2luZyBvbmUgc2VjdGlvbg0K
Zm9yIG9uZSBzZWdtZW50IHR5cGUgbWVhbnMgdGhhdCB0aGUgY3VycmVudCBzZWN0aW9uIHdpdGgg
Z3JlYXRlciBzaXplIHRoYW4NCnNlZ21lbnQgKDJNQikgd2lsbCBiZSByZXN1bHRlZCBpbiBjaGFu
Z2luZyB0aGUgc3BlZWQgb2YgZmlsbGluZyBzZWN0aW9ucyB3aXRoDQpkaWZmZXJlbnQgdHlwZSBv
ZiBkYXRhLiBBcyBhIHJlc3VsdCwgaXQgd2lsbCBjaGFuZ2UgZHJhbWF0aWNhbGx5IHRoZSBkaXN0
cmlidXRpb24NCm9mIGRpZmZlcmVudCB0eXBlIG9mIHNlY3Rpb25zIG9uIGZpbGUgc3lzdGVtIHZv
bHVtZS4gRG9lcyBpdCByZWR1Y2UgR0Mgb3ZlcmhlYWQ/DQpJIGFtIG5vdCBzdXJlLiBBbmQgaWYg
ZmlsZSBzeXN0ZW0ncyBzZWdtZW50IHNob3VsZCBiZSBlcXVhbCB0byB6b25lIHNpemUNCihmb3Ig
ZXhhbXBsZSwgTklMRlMyIGNhc2UpIHRoZW4gaXQgY291bGQgbWVhbiB0aGF0IHlvdSBuZWVkIHRv
IHByZXBhcmUgdGhlDQp3aG9sZSBzZWdtZW50IGJlZm9yZSByZWFsIGZsdXNoLiBBbmQgaWYgeW91
IHdpbGwgbmVlZCB0byBwcm9jZXNzIE9fRElSRUNUDQpvciBzeW5jaHJvbm91cyBtb3VudCBjYXNl
IHRoZW4sIG1vc3QgcHJvYmFibHksIHlvdSB3aWxsIG5lZWQgdG8gZmx1c2ggdGhlDQpzZWdtZW50
IHdpdGggaHVnZSBob2xlLiBJIHN1cHBvc2UgdGhhdCBpdCBjb3VsZCBzaWduaWZpY2FudGx5IGRl
Y3JlYXNlIGZpbGUgc3lzdGVtJ3MNCmZyZWUgc3BhY2UsIGluY3JlYXNlIEdDIGFjdGl2aXR5IGFu
ZCBkZWNyZWFzZSBkZXZpY2UgbGlmZXRpbWUuDQoNCj4+IEFub3RoZXIgcG9pbnQgdGhhdCBRTEMg
ZGV2aWNlIGNvdWxkIGhhdmUgbW9yZSB0cmlja3kgZmVhdHVyZXMgb2YgZXJhc2UgDQo+PiBibG9j
a3MgbWFuYWdlbWVudC4gQWxzbyB3ZSBzaG91bGQgYXBwbHkgZXJhc2Ugb3BlcmF0aW9uIG9uIE5B
TkQgZmxhc2ggDQo+PiBlcmFzZSBibG9jayBidXQgaXQgaXMgbm90IG1hbmRhdG9yeSBmb3IgdGhl
IGNhc2Ugb2YgU01SIHpvbmUuDQo+DQo+IEluY29ycmVjdDogaG9zdC1tYW5hZ2VkIGRldmljZXMg
cmVxdWlyZSBhIHpvbmUgInJlc2V0IiAoZXF1aXZhbGVudCB0bw0KPiBkaXNjYXJkL3RyaW0pIHRv
IGJlIHJldXNlZCBhZnRlciBiZWluZyB3cml0dGVuIG9uY2UuIFNvIGFnYWluLCB0aGUNCj4gInRy
aWNreSBmZWF0dXJlcyIgeW91IG1lbnRpb24gd2lsbCBkZXBlbmQgb24gdGhlIGRldmljZSAibW9k
ZWwiLA0KPiB3aGF0ZXZlciB0aGlzIGVuZHMgdXAgdG8gYmUgZm9yIGFuIG9wZW4gY2hhbm5lbCBT
U0QuDQoNCk9LLiBCdXQgSSBhc3N1bWUgdGhhdCBTTVIgem9uZSAicmVzZXQiIGlzIHNpZ25pZmlj
YW50bHkgY2hlYXBlciB0aGFuDQpOQU5EIGZsYXNoIGJsb2NrIGVyYXNlIG9wZXJhdGlvbi4gQW5k
IHlvdSBjYW4gZmlsbCB5b3VyIFNNUiB6b25lIHdpdGgNCmRhdGEgdGhlbiAicmVzZXQiIGl0IGFu
ZCB0byBmaWxsIGFnYWluIHdpdGggZGF0YSB3aXRob3V0IHNpZ25pZmljYW50IHBlbmFsdHkuDQpB
bHNvLCBUUklNIGFuZCB6b25lICJyZXNldCIgYXJlIGRpZmZlcmVudCwgSSBzdXBwb3NlLiBCZWNh
dXNlLCBUUklNIGxvb2tzDQpsaWtlIGFzIGEgaGludCBmb3IgU1NEIGNvbnRyb2xsZXIuIElmIFNT
RCBjb250cm9sbGVyIHJlY2VpdmVzIFRSSU0gZm9yIHNvbWUNCmVyYXNlIGJsb2NrIHRoZW4gaXQg
ZG9lc24ndCBtZWFuICB0aGF0IGVyYXNlIG9wZXJhdGlvbiB3aWxsIGJlIGRvbmUNCmltbWVkaWF0
ZWx5LiBVc3VhbGx5LCBpdCBzaG91bGQgYmUgZG9uZSBpbiB0aGUgYmFja2dyb3VuZCBiZWNhdXNl
IHJlYWwNCmVyYXNlIG9wZXJhdGlvbiBpcyBleHBlbnNpdmUgb3BlcmF0aW9uLiANCg0KVGhhbmtz
LA0KVnlhY2hlc2xhdiBEdWJleWtvLg0KDQpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fXwpMaW51eC1udm1lIG1haWxpbmcgbGlzdApMaW51eC1udm1lQGxpc3Rz
LmluZnJhZGVhZC5vcmcKaHR0cDovL2xpc3RzLmluZnJhZGVhZC5vcmcvbWFpbG1hbi9saXN0aW5m
by9saW51eC1udm1lCg==

WARNING: multiple messages have this Message-ID (diff)
From: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
To: "Damien Le Moal" <Damien.LeMoal@wdc.com>,
	"Matias Bjørling" <m@bjorling.me>,
	"Viacheslav Dubeyko" <slava@dubeyko.com>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"Theodore Ts'o" <tytso@mit.edu>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
Date: Thu, 5 Jan 2017 22:58:57 +0000	[thread overview]
Message-ID: <SN2PR04MB2191BE43398C84C4D262960488600@SN2PR04MB2191.namprd04.prod.outlook.com> (raw)
In-Reply-To: <a5acdf89-99c0-774d-cb08-76cb5d846f29@wdc.com>


-----Original Message-----
From: Damien Le Moal 
Sent: Tuesday, January 3, 2017 11:25 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>; Matias Bjørling <m@bjorling.me>; Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

<skipped>

> But you are missing the parallel with SMR. For SMR, or more correctly zoned
> block devices since the ZBC or ZAC standards can equally apply to HDDs and SSDs,
> 3 models exists: drive-managed, host-aware and host-managed.
> Case (1) above corresponds *exactly* to the drive managed model, with
> the difference that the abstraction of the device characteristics (SMR
> here) is in the drive FW and not in a host-level FTL implementation
> as it would be for open channel SSDs. Case (2) above corresponds to the host-managed
> model, that is, the device user has to deal with the device characteristics
> itself and use it correctly. The host-aware model lies in between these 2 extremes:
> it offers the possibility of complete abstraction by default, but also allows a user
> to optimize its operation for the device by allowing access to the device characteristics.
> So this would correspond to a possible third way of implementing an FTL for open channel SSDs.

I see your point. And  I think that, historically, we need to distinguish 4 cases for the
case of NAND flash:
(1) drive-managed: regular file systems (ext4, xfs and so on);
(2) host-aware: flash-friendly file systems (NILFS2, F2FS and so on);
(3) host-managed: <file systems under implementation>;
(4) old-fashioned flash-oriented file systems for raw NAND (jffs, yaffs, ubifs and so on).

But, frankly speaking, even regular file systems are slightly flash-aware today because of
blkdev_issue_discard (TRIM) or REQ_META flag. So, the next really important question is:
what can/should be exposed for the host-managed and host-aware cases? What's principal
difference between these models? And, finally, the difference is not so clear.

Let's start from error corrections. Only flash-oriented file systems take care about
error corrections. But I assume that drive-managed, host-aware and host-managed cases
expect hardware-based error correction. So, we can treat our logical page/block as ideal
byte stream that always contains valid data. So, we have no difference and no contradiction
here.

Next point is read disturbance. If BER of physical page/block achieves some threshold then
we need to move data from one page/block into another one. What subsystem will be
responsible for this activity? The drive-managed case expects that device's GC will manage
read disturbance issue. But what's about host-aware or host-managed case? If the host side
hasn't information about BER then the host's software is unable to manage this issue. Finally,
it sounds that we will have GC subsystem as on file system side as on device side. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime.
Let's imagine that host-aware case could be unaware about read disturbance management.
But how host-managed case can manage this issue?

Bad block management... So, drive-managed and host-aware cases should be completely unaware
about  bad blocks. But what's about host-managed case? If a device will hide bad blocks from
the host then it means mapping table presence, access to logical pages/blocks and so on. If the host
hasn't access to the bad block management then it's not host-managed model. And it sounds as
completely unmanageable situation for the host-managed model. Because if the host has access
to bad block management (but how?) then we have really simple model. Otherwise, the host
has access to logical pages/blocks only and device should have internal GC. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime because
of competition of GC on device side and GC on the host side.

Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
for the host-managed case. But it means that the host should manage bad blocks and to have direct
access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
layer and wear-leveling management will be unavailable on the host side. As a result, device will have
internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
device lifetime). But even if SSD provides access to all internals then how will file system be able
to implement wear-leveling or bad block management in the case of regular I/O operations? Because
block device creates LBA abstraction for us. Does it mean that software FTL on the block layer level
is able to manage SSD internals directly? And, again, file system cannot manage SSD internals directly
for the case of software FTL. And where should software FTL keep mapping table, for example?

So, F2FS and NILFS2 looks like a host-aware case because it is LFS file systems that is oriented on
regular SSDs. So, it could be desirable to have some knowledge (page size, erase block size and so on)
about SSD internals. But, mostly, such knowledge should be shared with mkfs tool during file
system volume creation. The rest looks like as not very promising and not very different with
device-managed model. Because even if F2FS and NILFS2 has GC subsystem and mostly looks like
as LFS case (F2FS has in-place updated area; NILFS2 has in-place updated superblocks in the begin/end
of the volume), anyway, both these file systems completely rely on device indirection layer and
GC subsystem. We are still in the same hell of GCs competition. So, what's the point of host-aware
model?

So, I am not completely convinced that, finally, we will have really distinctive features for the
case of device-managed, host-aware and host-managed model. Also I have many question about
host-managed model if we will use block device abstraction. How can direct management of
SSD internals be organized for the case of host-managed model is hidden under block device
abstraction?

Another interesting question... Let's imagine that we create file system volume for one device
geometry. It means that geometry details will be stored in the file system metadata during volume
creation for the case host-aware or host-managed case. Then we backups this volume and restore
the volume on device with completely different geometry. So, what will we have for such case?
Performance degradation? Or will we kill the device?

> The open-channel SSD interface is very 
> similar to the one exposed by SMR hard-drives. They both have a set of 
> chunks (zones) exposed, and zones are managed using open/close logic. 
> The main difference on open-channel SSDs is that it additionally exposes 
> multiple sets of zones through a hierarchical interface, which covers a 
> numbers levels (X channels, Y LUNs per channel, Z zones per LUN).

I would like to have access channels/LUNs/zones on file system level.
If, for example, LUN will be associated with partition then it means
that it will need to aggregate several partitions inside of one volume.
First of all, not every file system is ready for the aggregation several
partitions inside of the one volume. Secondly, what's about aggregation
several physical devices inside of one volume? It looks like as slightly
tricky to distinguish partitions of the same device and different devices
on file system level. Isn't it?

> I agree with Damien, but I'd also add that in the future there may very
> well be some new Zone types added to the ZBC model. 
> So we shouldn't assume that the ZBC model is a fixed one.  And who knows?
> Perhaps T10 standards body will come up with a simpler model for
> interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC model --- or not.

Different zone types is good. But maybe LUN will be the better place
for distinguishing the different zone types. Because if zone can have the type
then it's possible to imagine any combinations of zones. But mostly
zone of some type will be inside of some contiguous area (inside of NAND
die, for example). So, LUN looks like as NAND die representation.

>> SMR zone and NAND flash erase block look comparable but, finally, it 
>> significantly different stuff. Usually, SMR zone has 265 MB in size 
>> but NAND flash erase block can vary from 512 KB to 8 MB (it will be 
>> slightly larger in the future but not more than 32 MB, I suppose). It 
>> is possible to group several erase blocks into aggregated entity but 
>> it could be not very good policy from file system point of view.
>
> Why not? For f2fs, the 2MB segments are grouped together into sections
> with a size matching the device zone size. That works well and can actually
> even reduce the garbage collection overhead in some cases.
> Nothing in the kernel zoned block device support limits the zone size
> to a particular minimum or maximum. The only direct implication of the zone
> size on the block I/O stack is that BIOs and requests cannot cross zone
> boundaries. In an extreme setup, a zone size of 4KB would work too
> and result in read/write commands of 4KB at most to the device.

The situation with grouping of segments into sections for the case of F2FS
is not so simple. First of all, you need to fill such aggregation with data.
F2FS distinguish several types of segments and it means that current
segment/section will be larger. If you mix different types of segments into
one section (but I believe that F2FS doesn't provide opportunity to do this)
then GC overhead could be larger, I suppose. Otherwise, the using one section
for one segment type means that the current section with greater size than
segment (2MB) will be resulted in changing the speed of filling sections with
different type of data. As a result, it will change dramatically the distribution
of different type of sections on file system volume. Does it reduce GC overhead?
I am not sure. And if file system's segment should be equal to zone size
(for example, NILFS2 case) then it could mean that you need to prepare the
whole segment before real flush. And if you will need to process O_DIRECT
or synchronous mount case then, most probably, you will need to flush the
segment with huge hole. I suppose that it could significantly decrease file system's
free space, increase GC activity and decrease device lifetime.

>> Another point that QLC device could have more tricky features of erase 
>> blocks management. Also we should apply erase operation on NAND flash 
>> erase block but it is not mandatory for the case of SMR zone.
>
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.

OK. But I assume that SMR zone "reset" is significantly cheaper than
NAND flash block erase operation. And you can fill your SMR zone with
data then "reset" it and to fill again with data without significant penalty.
Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
like as a hint for SSD controller. If SSD controller receives TRIM for some
erase block then it doesn't mean  that erase operation will be done
immediately. Usually, it should be done in the background because real
erase operation is expensive operation. 

Thanks,
Vyacheslav Dubeyko.


WARNING: multiple messages have this Message-ID (diff)
From: Vyacheslav.Dubeyko@wdc.com (Slava Dubeyko)
Subject: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
Date: Thu, 5 Jan 2017 22:58:57 +0000	[thread overview]
Message-ID: <SN2PR04MB2191BE43398C84C4D262960488600@SN2PR04MB2191.namprd04.prod.outlook.com> (raw)
In-Reply-To: <a5acdf89-99c0-774d-cb08-76cb5d846f29@wdc.com>


-----Original Message-----
From: Damien Le Moal 
Sent: Tuesday, January 3, 2017 11:25 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko at wdc.com>; Matias Bj?rling <m at bjorling.me>; Viacheslav Dubeyko <slava at dubeyko.com>; lsf-pc at lists.linux-foundation.org
Cc: Linux FS Devel <linux-fsdevel at vger.kernel.org>; linux-block at vger.kernel.org; linux-nvme at lists.infradead.org
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

<skipped>

> But you are missing the parallel with SMR. For SMR, or more correctly zoned
> block devices since the ZBC or ZAC standards can equally apply to HDDs and SSDs,
> 3 models exists: drive-managed, host-aware and host-managed.
> Case (1) above corresponds *exactly* to the drive managed model, with
> the difference that the abstraction of the device characteristics (SMR
> here) is in the drive FW and not in a host-level FTL implementation
> as it would be for open channel SSDs. Case (2) above corresponds to the host-managed
> model, that is, the device user has to deal with the device characteristics
> itself and use it correctly. The host-aware model lies in between these 2 extremes:
> it offers the possibility of complete abstraction by default, but also allows a user
> to optimize its operation for the device by allowing access to the device characteristics.
> So this would correspond to a possible third way of implementing an FTL for open channel SSDs.

I see your point. And  I think that, historically, we need to distinguish 4 cases for the
case of NAND flash:
(1) drive-managed: regular file systems (ext4, xfs and so on);
(2) host-aware: flash-friendly file systems (NILFS2, F2FS and so on);
(3) host-managed: <file systems under implementation>;
(4) old-fashioned flash-oriented file systems for raw NAND (jffs, yaffs, ubifs and so on).

But, frankly speaking, even regular file systems are slightly flash-aware today because of
blkdev_issue_discard (TRIM) or REQ_META flag. So, the next really important question is:
what can/should be exposed for the host-managed and host-aware cases? What's principal
difference between these models? And, finally, the difference is not so clear.

Let's start from error corrections. Only flash-oriented file systems take care about
error corrections. But I assume that drive-managed, host-aware and host-managed cases
expect hardware-based error correction. So, we can treat our logical page/block as ideal
byte stream that always contains valid data. So, we have no difference and no contradiction
here.

Next point is read disturbance. If BER of physical page/block achieves some threshold then
we need to move data from one page/block into another one. What subsystem will be
responsible for this activity? The drive-managed case expects that device's GC will manage
read disturbance issue. But what's about host-aware or host-managed case? If the host side
hasn't information about BER then the host's software is unable to manage this issue. Finally,
it sounds that we will have GC subsystem as on file system side as on device side. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime.
Let's imagine that host-aware case could be unaware about read disturbance management.
But how host-managed case can manage this issue?

Bad block management... So, drive-managed and host-aware cases should be completely unaware
about  bad blocks. But what's about host-managed case? If a device will hide bad blocks from
the host then it means mapping table presence, access to logical pages/blocks and so on. If the host
hasn't access to the bad block management then it's not host-managed model. And it sounds as
completely unmanageable situation for the host-managed model. Because if the host has access
to bad block management (but how?) then we have really simple model. Otherwise, the host
has access to logical pages/blocks only and device should have internal GC. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime because
of competition of GC on device side and GC on the host side.

Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
for the host-managed case. But it means that the host should manage bad blocks and to have direct
access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
layer and wear-leveling management will be unavailable on the host side. As a result, device will have
internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
device lifetime). But even if SSD provides access to all internals then how will file system be able
to implement wear-leveling or bad block management in the case of regular I/O operations? Because
block device creates LBA abstraction for us. Does it mean that software FTL on the block layer level
is able to manage SSD internals directly? And, again, file system cannot manage SSD internals directly
for the case of software FTL. And where should software FTL keep mapping table, for example?

So, F2FS and NILFS2 looks like a host-aware case because it is LFS file systems that is oriented on
regular SSDs. So, it could be desirable to have some knowledge (page size, erase block size and so on)
about SSD internals. But, mostly, such knowledge should be shared with mkfs tool during file
system volume creation. The rest looks like as not very promising and not very different with
device-managed model. Because even if F2FS and NILFS2 has GC subsystem and mostly looks like
as LFS case (F2FS has in-place updated area; NILFS2 has in-place updated superblocks in the begin/end
of the volume), anyway, both these file systems completely rely on device indirection layer and
GC subsystem. We are still in the same hell of GCs competition. So, what's the point of host-aware
model?

So, I am not completely convinced that, finally, we will have really distinctive features for the
case of device-managed, host-aware and host-managed model. Also I have many question about
host-managed model if we will use block device abstraction. How can direct management of
SSD internals be organized for the case of host-managed model is hidden under block device
abstraction?

Another interesting question... Let's imagine that we create file system volume for one device
geometry. It means that geometry details will be stored in the file system metadata during volume
creation for the case host-aware or host-managed case. Then we backups this volume and restore
the volume on device with completely different geometry. So, what will we have for such case?
Performance degradation? Or will we kill the device?

> The open-channel SSD interface is very 
> similar to the one exposed by SMR hard-drives. They both have a set of 
> chunks (zones) exposed, and zones are managed using open/close logic. 
> The main difference on open-channel SSDs is that it additionally exposes 
> multiple sets of zones through a hierarchical interface, which covers a 
> numbers levels (X channels, Y LUNs per channel, Z zones per LUN).

I would like to have access channels/LUNs/zones on file system level.
If, for example, LUN will be associated with partition then it means
that it will need to aggregate several partitions inside of one volume.
First of all, not every file system is ready for the aggregation several
partitions inside of the one volume. Secondly, what's about aggregation
several physical devices inside of one volume? It looks like as slightly
tricky to distinguish partitions of the same device and different devices
on file system level. Isn't it?

> I agree with Damien, but I'd also add that in the future there may very
> well be some new Zone types added to the ZBC model. 
> So we shouldn't assume that the ZBC model is a fixed one.  And who knows?
> Perhaps T10 standards body will come up with a simpler model for
> interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC model --- or not.

Different zone types is good. But maybe LUN will be the better place
for distinguishing the different zone types. Because if zone can have the type
then it's possible to imagine any combinations of zones. But mostly
zone of some type will be inside of some contiguous area (inside of NAND
die, for example). So, LUN looks like as NAND die representation.

>> SMR zone and NAND flash erase block look comparable but, finally, it 
>> significantly different stuff. Usually, SMR zone has 265 MB in size 
>> but NAND flash erase block can vary from 512 KB to 8 MB (it will be 
>> slightly larger in the future but not more than 32 MB, I suppose). It 
>> is possible to group several erase blocks into aggregated entity but 
>> it could be not very good policy from file system point of view.
>
> Why not? For f2fs, the 2MB segments are grouped together into sections
> with a size matching the device zone size. That works well and can actually
> even reduce the garbage collection overhead in some cases.
> Nothing in the kernel zoned block device support limits the zone size
> to a particular minimum or maximum. The only direct implication of the zone
> size on the block I/O stack is that BIOs and requests cannot cross zone
> boundaries. In an extreme setup, a zone size of 4KB would work too
> and result in read/write commands of 4KB at most to the device.

The situation with grouping of segments into sections for the case of F2FS
is not so simple. First of all, you need to fill such aggregation with data.
F2FS distinguish several types of segments and it means that current
segment/section will be larger. If you mix different types of segments into
one section (but I believe that F2FS doesn't provide opportunity to do this)
then GC overhead could be larger, I suppose. Otherwise, the using one section
for one segment type means that the current section with greater size than
segment (2MB) will be resulted in changing the speed of filling sections with
different type of data. As a result, it will change dramatically the distribution
of different type of sections on file system volume. Does it reduce GC overhead?
I am not sure. And if file system's segment should be equal to zone size
(for example, NILFS2 case) then it could mean that you need to prepare the
whole segment before real flush. And if you will need to process O_DIRECT
or synchronous mount case then, most probably, you will need to flush the
segment with huge hole. I suppose that it could significantly decrease file system's
free space, increase GC activity and decrease device lifetime.

>> Another point that QLC device could have more tricky features of erase 
>> blocks management. Also we should apply erase operation on NAND flash 
>> erase block but it is not mandatory for the case of SMR zone.
>
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.

OK. But I assume that SMR zone "reset" is significantly cheaper than
NAND flash block erase operation. And you can fill your SMR zone with
data then "reset" it and to fill again with data without significant penalty.
Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
like as a hint for SSD controller. If SSD controller receives TRIM for some
erase block then it doesn't mean  that erase operation will be done
immediately. Usually, it should be done in the background because real
erase operation is expensive operation. 

Thanks,
Vyacheslav Dubeyko.

  parent reply	other threads:[~2017-01-05 22:58 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-02 21:06 [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os Matias Bjørling
2017-01-02 21:06 ` Matias Bjørling
2017-01-02 21:06 ` Matias Bjørling
2017-01-02 23:12 ` Viacheslav Dubeyko
2017-01-02 23:12   ` Viacheslav Dubeyko
2017-01-02 23:12   ` Viacheslav Dubeyko
2017-01-03  8:56   ` Matias Bjørling
2017-01-03  8:56     ` Matias Bjørling
2017-01-03 17:35     ` Viacheslav Dubeyko
2017-01-03 17:35       ` Viacheslav Dubeyko
2017-01-03 17:35       ` Viacheslav Dubeyko
2017-01-03 19:10       ` Matias Bjørling
2017-01-03 19:10         ` Matias Bjørling
2017-01-04  2:59         ` Slava Dubeyko
2017-01-04  2:59           ` Slava Dubeyko
2017-01-04  2:59           ` Slava Dubeyko
2017-01-04  7:24           ` Damien Le Moal
2017-01-04  7:24             ` Damien Le Moal
2017-01-04 12:39             ` Matias Bjørling
2017-01-04 12:39               ` Matias Bjørling
2017-01-04 16:57             ` Theodore Ts'o
2017-01-04 16:57               ` Theodore Ts'o
2017-01-10  1:42               ` Damien Le Moal
2017-01-10  1:42                 ` Damien Le Moal
2017-01-10  4:24                 ` Theodore Ts'o
2017-01-10  4:24                   ` Theodore Ts'o
2017-01-10 13:06                   ` Matias Bjorling
2017-01-10 13:06                     ` Matias Bjorling
2017-01-11  4:07                     ` Damien Le Moal
2017-01-11  4:07                       ` Damien Le Moal
2017-01-11  6:06                       ` Matias Bjorling
2017-01-11  6:06                         ` Matias Bjorling
2017-01-11  7:49                       ` Hannes Reinecke
2017-01-11  7:49                         ` Hannes Reinecke
2017-01-05 22:58             ` Slava Dubeyko [this message]
2017-01-05 22:58               ` Slava Dubeyko
2017-01-05 22:58               ` Slava Dubeyko
2017-01-06  1:11               ` Theodore Ts'o
2017-01-06  1:11                 ` Theodore Ts'o
2017-01-06 12:51                 ` Matias Bjørling
2017-01-06 12:51                   ` Matias Bjørling
2017-01-06 12:51                   ` Matias Bjørling
2017-01-09  6:49                 ` Slava Dubeyko
2017-01-09  6:49                   ` Slava Dubeyko
2017-01-09  6:49                   ` Slava Dubeyko
2017-01-09 14:55                   ` Theodore Ts'o
2017-01-09 14:55                     ` Theodore Ts'o
2017-01-09 14:55                     ` Theodore Ts'o
2017-01-06 13:05               ` Matias Bjørling
2017-01-06 13:05                 ` Matias Bjørling
2017-01-06 13:05                 ` Matias Bjørling
2017-01-06  1:09             ` Jaegeuk Kim
2017-01-06  1:09               ` Jaegeuk Kim
2017-01-06 12:55               ` Matias Bjørling
2017-01-06 12:55                 ` Matias Bjørling
2017-01-06 12:55                 ` Matias Bjørling
2017-01-12  1:33 ` [LSF/MM " Damien Le Moal
2017-01-12  2:18   ` [Lsf-pc] " James Bottomley
2017-01-12  2:18     ` James Bottomley
2017-01-12  2:35     ` Damien Le Moal
2017-01-12  2:35       ` Damien Le Moal
2017-01-12  2:38       ` James Bottomley
2017-01-12  2:38         ` James Bottomley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SN2PR04MB2191BE43398C84C4D262960488600@SN2PR04MB2191.namprd04.prod.outlook.com \
    --to=vyacheslav.dubeyko@wdc.com \
    --cc=Damien.LeMoal@wdc.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=m@bjorling.me \
    --cc=slava@dubeyko.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.