All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
@ 2016-02-23  2:56 Damien Le Moal
  2016-02-23  3:56 ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2016-02-23  2:56 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-block, linux-scsi


Hello,

I would like to attend LSF/MM 2016 to discuss the following topics.

1) Online Logical Head Depop

Some disk drives available on the market already provide a "logical
depop" function which allows a system to decommission a defective
disk head, reformat the disk and continue using this same disk with
a reduced capacity. Such feature can allow reduced operation costs
(delayed HDD replacement) but has the drawback of a data loss (data
under the remaining valid heads) and disk downtime during re-formating.

Online logical head depop is a proposed new feature allowing retaining
the disk valid data and eliminating the need for a disk re-format.
The basic idea is to introduce new commands for the host to discover
the ranges of LBAs impacted by a defective head. Using this information,
the host can take actions when a disk head failure event is suspected
or reported:
(a) The impacted LBAs can be depopulated, resulting in the disk
operating as a “thin provisioned” device.
(b) The impacted LBAs can be amputated, resulting in error for all
subsequent accesses to the LBAs under the defective head.
(c) Optionally, a host may decide to reformat (compact) the disk to
restore operation as a fully-provisioned device with a lower capacity.

The goal of the discussion would be to gather the opinion of the
developers for drafting a command standard minimizing the impact of this
feature on the block I/O stack as well as allowing a simple use of this
feature by file systems and device mapper drivers (including logical
volume manager).


2) Write back of dirty pages to SMR block devices:

Dirty pages of a block device inode are currently processed using the
generic_writepages function, which can be executed simultaneously
by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
Mutual exclusion of the dirty page processing being achieved only at
the page level (page lock & page writeback flag), multiple processes
executing a "sync" of overlapping block ranges over the same zone of
an SMR disk can cause an out-of-LBA-order sequence of write requests
being sent to the underlying device. On a host managed SMR disk, where
sequential write to disk zones is mandatory, this result in errors and
the impossibility for an application using raw sequential disk write
accesses to be guaranteed successful completion of its write or fsync
requests.

Using the zone information attached to the SMR block device queue
(introduced by Hannes), calls to the generic_writepages function can
be made mutually exclusive on a per zone basis by locking the zones.
This guarantees sequential request generation for each zone and avoid
write errors without any modification to the generic code implementing
generic_writepages.

This is but one possible solution for supporting SMR host-managed
devices without any major rewrite of page cache management and
write-back processing. The opinion of the audience regarding this
solution and discussing other potential solutions would be greatly
appreciated.

Thank you.

Best regards.


------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital company
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com 
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-23  2:56 [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages Damien Le Moal
@ 2016-02-23  3:56 ` Bart Van Assche
  2016-02-23  5:31   ` Damien Le Moal
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2016-02-23  3:56 UTC (permalink / raw)
  To: Damien Le Moal, lsf-pc; +Cc: linux-block, linux-scsi, Matias Bjorling

On 02/22/16 18:56, Damien Le Moal wrote:
> 2) Write back of dirty pages to SMR block devices:
>
> Dirty pages of a block device inode are currently processed using the
> generic_writepages function, which can be executed simultaneously
> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
> Mutual exclusion of the dirty page processing being achieved only at
> the page level (page lock & page writeback flag), multiple processes
> executing a "sync" of overlapping block ranges over the same zone of
> an SMR disk can cause an out-of-LBA-order sequence of write requests
> being sent to the underlying device. On a host managed SMR disk, where
> sequential write to disk zones is mandatory, this result in errors and
> the impossibility for an application using raw sequential disk write
> accesses to be guaranteed successful completion of its write or fsync
> requests.
>
> Using the zone information attached to the SMR block device queue
> (introduced by Hannes), calls to the generic_writepages function can
> be made mutually exclusive on a per zone basis by locking the zones.
> This guarantees sequential request generation for each zone and avoid
> write errors without any modification to the generic code implementing
> generic_writepages.
>
> This is but one possible solution for supporting SMR host-managed
> devices without any major rewrite of page cache management and
> write-back processing. The opinion of the audience regarding this
> solution and discussing other potential solutions would be greatly
> appreciated.

Hello Damien,

Is it sufficient to support filesystems like BTRFS on top of SMR drives 
or would you also like to see that filesystems like ext4 can use SMR 
drives ? In the latter case: the behavior of SMR drives differs so 
significantly from that of other block devices that I'm not sure that we 
should try to support these directly from infrastructure like the page 
cache. If we look e.g. at NAND SSDs then we see that the characteristics 
of NAND do not match what filesystems expect (e.g. large erase blocks). 
That is why every SSD vendor provides an FTL (Flash Translation Layer), 
either inside the SSD or as a separate software driver. An FTL 
implements a so-called LFS (log-structured filesystem). With what I know 
about SMR this technology looks also suitable for implementation of a 
LFS. Has it already been considered to implement an LFS driver for SMR 
drives ? That would make it possible for any filesystem to access an SMR 
drive as any other block device. I'm not sure of this but maybe it will 
be possible to share some infrastructure with the LightNVM driver 
(directory drivers/lightnvm in the Linux kernel tree). This driver 
namely implements an FTL.

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-23  3:56 ` Bart Van Assche
@ 2016-02-23  5:31   ` Damien Le Moal
  2016-02-23  8:40     ` [Lsf-pc] " Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2016-02-23  5:31 UTC (permalink / raw)
  To: Bart Van Assche, lsf-pc; +Cc: linux-block, linux-scsi, Matias Bjorling


>On 02/22/16 18:56, Damien Le Moal wrote:
>> 2) Write back of dirty pages to SMR block devices:
>>
>> Dirty pages of a block device inode are currently processed using the
>> generic_writepages function, which can be executed simultaneously
>> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
>> Mutual exclusion of the dirty page processing being achieved only at
>> the page level (page lock & page writeback flag), multiple processes
>> executing a "sync" of overlapping block ranges over the same zone of
>> an SMR disk can cause an out-of-LBA-order sequence of write requests
>> being sent to the underlying device. On a host managed SMR disk, where
>> sequential write to disk zones is mandatory, this result in errors and
>> the impossibility for an application using raw sequential disk write
>> accesses to be guaranteed successful completion of its write or fsync
>> requests.
>>
>> Using the zone information attached to the SMR block device queue
>> (introduced by Hannes), calls to the generic_writepages function can
>> be made mutually exclusive on a per zone basis by locking the zones.
>> This guarantees sequential request generation for each zone and avoid
>> write errors without any modification to the generic code implementing
>> generic_writepages.
>>
>> This is but one possible solution for supporting SMR host-managed
>> devices without any major rewrite of page cache management and
>> write-back processing. The opinion of the audience regarding this
>> solution and discussing other potential solutions would be greatly
>> appreciated.
>
>Hello Damien,
>
>Is it sufficient to support filesystems like BTRFS on top of SMR drives 
>or would you also like to see that filesystems like ext4 can use SMR 
>drives ? In the latter case: the behavior of SMR drives differs so 
>significantly from that of other block devices that I'm not sure that we 
>should try to support these directly from infrastructure like the page 
>cache. If we look e.g. at NAND SSDs then we see that the characteristics 
>of NAND do not match what filesystems expect (e.g. large erase blocks). 
>That is why every SSD vendor provides an FTL (Flash Translation Layer), 
>either inside the SSD or as a separate software driver. An FTL 
>implements a so-called LFS (log-structured filesystem). With what I know 
>about SMR this technology looks also suitable for implementation of a 
>LFS. Has it already been considered to implement an LFS driver for SMR 
>drives ? That would make it possible for any filesystem to access an SMR 
>drive as any other block device. I'm not sure of this but maybe it will 
>be possible to share some infrastructure with the LightNVM driver 
>(directory drivers/lightnvm in the Linux kernel tree). This driver 
>namely implements an FTL.

Hello Bart,


Thank you for your comments.

I totally agree with you that trying to support SMR disks by only modifying
the page cache so that unmodified standard file systems like BTRFS or ext4
remain operational is not realistic at best, and more likely simply impossible.
For this kind of use case, as you said, an FTL or a device mapper driver are
much more suitable.

The case I am considering for this discussion is for raw block device accesses
by an application (writes from user space to /dev/sdxx). This is a very likely
use case scenario for high capacity SMR disks with applications like distributed
object stores / key value stores.

In this case, write-back of dirty pages in the block device file inode mapping
is handled in fs/block_dev.c using the generic helper function generic_writepages.
This does not guarantee the generation of the required sequential write pattern
per zone necessary for host-managed disks. As I explained, aligning calls of this
function to zone boundaries while locking the zones under write-back solves
simply the problem (implemented and tested). This is of course only one possible
solution. Pushing modifications deeper in the code or providing a
"generic_sequential_writepages" helper function are other potential solutions
that in my opinion are worth discussing as other types of devices may benefit also
in terms of performance (e.g. regular disk drives prefer sequential writes, and
SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
driver.

For a file system, an SMR compliant implementation of a file inode mapping
writepages method should be provided by the file system itself as the sequentiality
of the write pattern depends further on the block allocation mechanism of the file
system.

Note that the goal here is not to hide to applications the sequential write
constraint of SMR disks. The page cache itself (the mapping of the block
device inode) remains unchanged. But the modification proposed guarantees that
a well behaved application writing sequentially to zones through the page cache
will see successful sync operations.

Best regards.

------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital company
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-23  5:31   ` Damien Le Moal
@ 2016-02-23  8:40     ` Jan Kara
  2016-02-24  1:53       ` Damien Le Moal
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2016-02-23  8:40 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, lsf-pc, linux-block, Matias Bjorling, linux-scsi

On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
> 
> >On 02/22/16 18:56, Damien Le Moal wrote:
> >> 2) Write back of dirty pages to SMR block devices:
> >>
> >> Dirty pages of a block device inode are currently processed using the
> >> generic_writepages function, which can be executed simultaneously
> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
> >> Mutual exclusion of the dirty page processing being achieved only at
> >> the page level (page lock & page writeback flag), multiple processes
> >> executing a "sync" of overlapping block ranges over the same zone of
> >> an SMR disk can cause an out-of-LBA-order sequence of write requests
> >> being sent to the underlying device. On a host managed SMR disk, where
> >> sequential write to disk zones is mandatory, this result in errors and
> >> the impossibility for an application using raw sequential disk write
> >> accesses to be guaranteed successful completion of its write or fsync
> >> requests.
> >>
> >> Using the zone information attached to the SMR block device queue
> >> (introduced by Hannes), calls to the generic_writepages function can
> >> be made mutually exclusive on a per zone basis by locking the zones.
> >> This guarantees sequential request generation for each zone and avoid
> >> write errors without any modification to the generic code implementing
> >> generic_writepages.
> >>
> >> This is but one possible solution for supporting SMR host-managed
> >> devices without any major rewrite of page cache management and
> >> write-back processing. The opinion of the audience regarding this
> >> solution and discussing other potential solutions would be greatly
> >> appreciated.
> >
> >Hello Damien,
> >
> >Is it sufficient to support filesystems like BTRFS on top of SMR drives 
> >or would you also like to see that filesystems like ext4 can use SMR 
> >drives ? In the latter case: the behavior of SMR drives differs so 
> >significantly from that of other block devices that I'm not sure that we 
> >should try to support these directly from infrastructure like the page 
> >cache. If we look e.g. at NAND SSDs then we see that the characteristics 
> >of NAND do not match what filesystems expect (e.g. large erase blocks). 
> >That is why every SSD vendor provides an FTL (Flash Translation Layer), 
> >either inside the SSD or as a separate software driver. An FTL 
> >implements a so-called LFS (log-structured filesystem). With what I know 
> >about SMR this technology looks also suitable for implementation of a 
> >LFS. Has it already been considered to implement an LFS driver for SMR 
> >drives ? That would make it possible for any filesystem to access an SMR 
> >drive as any other block device. I'm not sure of this but maybe it will 
> >be possible to share some infrastructure with the LightNVM driver 
> >(directory drivers/lightnvm in the Linux kernel tree). This driver 
> >namely implements an FTL.
> 
> I totally agree with you that trying to support SMR disks by only modifying
> the page cache so that unmodified standard file systems like BTRFS or ext4
> remain operational is not realistic at best, and more likely simply impossible.
> For this kind of use case, as you said, an FTL or a device mapper driver are
> much more suitable.
> 
> The case I am considering for this discussion is for raw block device accesses
> by an application (writes from user space to /dev/sdxx). This is a very likely
> use case scenario for high capacity SMR disks with applications like distributed
> object stores / key value stores.
> 
> In this case, write-back of dirty pages in the block device file inode mapping
> is handled in fs/block_dev.c using the generic helper function generic_writepages.
> This does not guarantee the generation of the required sequential write pattern
> per zone necessary for host-managed disks. As I explained, aligning calls of this
> function to zone boundaries while locking the zones under write-back solves
> simply the problem (implemented and tested). This is of course only one possible
> solution. Pushing modifications deeper in the code or providing a
> "generic_sequential_writepages" helper function are other potential solutions
> that in my opinion are worth discussing as other types of devices may benefit also
> in terms of performance (e.g. regular disk drives prefer sequential writes, and
> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
> driver.
> 
> For a file system, an SMR compliant implementation of a file inode mapping
> writepages method should be provided by the file system itself as the sequentiality
> of the write pattern depends further on the block allocation mechanism of the file
> system.
> 
> Note that the goal here is not to hide to applications the sequential write
> constraint of SMR disks. The page cache itself (the mapping of the block
> device inode) remains unchanged. But the modification proposed guarantees that
> a well behaved application writing sequentially to zones through the page cache
> will see successful sync operations.

So the easiest solution for the OS, when the application is already aware
of the storage constraints, would be for an application to use direct IO.
Because when using page-cache and writeback there are all sorts of
unexpected things that can happen (e.g. writeback decides to skip a page
because someone else locked it temporarily). So it will work in 99.9% of
cases but sometimes things will be out of order for hard-to-track down
reasons. And for ordinary drives this is not an issue because we just slow
down writeback a bit but rareness of this makes it non-issue. But for host
managed SMR the IO fails and that is something the application does not
expect.

So I would really say just avoid using page-cache when you are using SMR
drives directly without a translation layer. For writes your throughput
won't suffer anyway since you have to do big sequential writes. Using
page-cache for reads may still be beneficial and if you are careful enough
not to do direct IO writes to the same range as you do buffered reads, this
will work fine.

Thinking some more - if you want to make it foolproof, you could implement
something like read-only page cache for block devices. Any write will be in
fact direct IO write, writeable mmaps will be disallowed, reads will honor
O_DIRECT flag.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-23  8:40     ` [Lsf-pc] " Jan Kara
@ 2016-02-24  1:53       ` Damien Le Moal
  2016-02-24  8:47         ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2016-02-24  1:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: Bart Van Assche, lsf-pc, linux-block, Matias Bjorling, linux-scsi


>On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
>> 
>> >On 02/22/16 18:56, Damien Le Moal wrote:
>> >> 2) Write back of dirty pages to SMR block devices:
>> >>
>> >> Dirty pages of a block device inode are currently processed using the
>> >> generic_writepages function, which can be executed simultaneously
>> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
>> >> Mutual exclusion of the dirty page processing being achieved only at
>> >> the page level (page lock & page writeback flag), multiple processes
>> >> executing a "sync" of overlapping block ranges over the same zone of
>> >> an SMR disk can cause an out-of-LBA-order sequence of write requests
>> >> being sent to the underlying device. On a host managed SMR disk, where
>> >> sequential write to disk zones is mandatory, this result in errors and
>> >> the impossibility for an application using raw sequential disk write
>> >> accesses to be guaranteed successful completion of its write or fsync
>> >> requests.
>> >>
>> >> Using the zone information attached to the SMR block device queue
>> >> (introduced by Hannes), calls to the generic_writepages function can
>> >> be made mutually exclusive on a per zone basis by locking the zones.
>> >> This guarantees sequential request generation for each zone and avoid
>> >> write errors without any modification to the generic code implementing
>> >> generic_writepages.
>> >>
>> >> This is but one possible solution for supporting SMR host-managed
>> >> devices without any major rewrite of page cache management and
>> >> write-back processing. The opinion of the audience regarding this
>> >> solution and discussing other potential solutions would be greatly
>> >> appreciated.
>> >
>> >Hello Damien,
>> >
>> >Is it sufficient to support filesystems like BTRFS on top of SMR drives 
>> >or would you also like to see that filesystems like ext4 can use SMR 
>> >drives ? In the latter case: the behavior of SMR drives differs so 
>> >significantly from that of other block devices that I'm not sure that we 
>> >should try to support these directly from infrastructure like the page 
>> >cache. If we look e.g. at NAND SSDs then we see that the characteristics 
>> >of NAND do not match what filesystems expect (e.g. large erase blocks). 
>> >That is why every SSD vendor provides an FTL (Flash Translation Layer), 
>> >either inside the SSD or as a separate software driver. An FTL 
>> >implements a so-called LFS (log-structured filesystem). With what I know 
>> >about SMR this technology looks also suitable for implementation of a 
>> >LFS. Has it already been considered to implement an LFS driver for SMR 
>> >drives ? That would make it possible for any filesystem to access an SMR 
>> >drive as any other block device. I'm not sure of this but maybe it will 
>> >be possible to share some infrastructure with the LightNVM driver 
>> >(directory drivers/lightnvm in the Linux kernel tree). This driver 
>> >namely implements an FTL.
>> 
>> I totally agree with you that trying to support SMR disks by only modifying
>> the page cache so that unmodified standard file systems like BTRFS or ext4
>> remain operational is not realistic at best, and more likely simply impossible.
>> For this kind of use case, as you said, an FTL or a device mapper driver are
>> much more suitable.
>> 
>> The case I am considering for this discussion is for raw block device accesses
>> by an application (writes from user space to /dev/sdxx). This is a very likely
>> use case scenario for high capacity SMR disks with applications like distributed
>> object stores / key value stores.
>> 
>> In this case, write-back of dirty pages in the block device file inode mapping
>> is handled in fs/block_dev.c using the generic helper function generic_writepages.
>> This does not guarantee the generation of the required sequential write pattern
>> per zone necessary for host-managed disks. As I explained, aligning calls of this
>> function to zone boundaries while locking the zones under write-back solves
>> simply the problem (implemented and tested). This is of course only one possible
>> solution. Pushing modifications deeper in the code or providing a
>> "generic_sequential_writepages" helper function are other potential solutions
>> that in my opinion are worth discussing as other types of devices may benefit also
>> in terms of performance (e.g. regular disk drives prefer sequential writes, and
>> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
>> driver.
>> 
>> For a file system, an SMR compliant implementation of a file inode mapping
>> writepages method should be provided by the file system itself as the sequentiality
>> of the write pattern depends further on the block allocation mechanism of the file
>> system.
>> 
>> Note that the goal here is not to hide to applications the sequential write
>> constraint of SMR disks. The page cache itself (the mapping of the block
>> device inode) remains unchanged. But the modification proposed guarantees that
>> a well behaved application writing sequentially to zones through the page cache
>> will see successful sync operations.
>
>So the easiest solution for the OS, when the application is already aware
>of the storage constraints, would be for an application to use direct IO.
>Because when using page-cache and writeback there are all sorts of
>unexpected things that can happen (e.g. writeback decides to skip a page
>because someone else locked it temporarily). So it will work in 99.9% of
>cases but sometimes things will be out of order for hard-to-track down
>reasons. And for ordinary drives this is not an issue because we just slow
>down writeback a bit but rareness of this makes it non-issue. But for host
>managed SMR the IO fails and that is something the application does not
>expect.
>
>So I would really say just avoid using page-cache when you are using SMR
>drives directly without a translation layer. For writes your throughput
>won't suffer anyway since you have to do big sequential writes. Using
>page-cache for reads may still be beneficial and if you are careful enough
>not to do direct IO writes to the same range as you do buffered reads, this
>will work fine.
>
>Thinking some more - if you want to make it foolproof, you could implement
>something like read-only page cache for block devices. Any write will be in
>fact direct IO write, writeable mmaps will be disallowed, reads will honor
>O_DIRECT flag.

Hi Jan,

Indeed, using O_DIRECT for raw block device write is an obvious solution to
guarantee the application successful sequential writes within a zone. However,
host-managed SMR disks (and to a lesser extent host-aware drives too) already
put on applications the constraint of ensuring sequential writes. Adding to this
further mandatory rewrite to support direct I/Os is in my opinion asking a lot,
if not too much.

The example you mention above of writeback skipping a locked page and resulting
in I/O errors is precisely what the proposed patch avoids by first locking the
zone the page belongs to. In the same spirit as the writeback page locking, if
the zone is already locked, it is skipped. That is, zones are treated in a sense
as gigantic pages, ensuring that the actual dirty pages within each one are
processed in one go, sequentially.

This allows preserving all possible application level accesses (buffered, direct
or mmapped). The only constraint is the one the disk imposes: writes must be
sequential.

Granted, this view may be too simplistic and may be overlooking some hard to track
page locking paths which will compete with this. But I think that this can be easily
solved by forcing the zone-aligned generic_writepages calls to not skip any page
(a flag in struct writeback_control would do the trick). And no modification is
necessary on the read side (i.e. page locking only is enough) since reading an SMR
disks blocks after a zone write-pointer position does not make sense (in Hannes
code, this is possible, but the request does not go to the disk and returns garbage
data).

Bottom line: no fundamental change to the page caching mechanism, only how it is
being used/controlled for writeback makes this work. Considering the benefits on
the application side, it is in my opinion a valid modification to have.

Best regards.

------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital company
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-24  1:53       ` Damien Le Moal
@ 2016-02-24  8:47         ` Jan Kara
  2016-02-29  2:02           ` Damien Le Moal
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2016-02-24  8:47 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jan Kara, Bart Van Assche, lsf-pc, linux-block, Matias Bjorling,
	linux-scsi

On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
> 
> >On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
> >> 
> >> >On 02/22/16 18:56, Damien Le Moal wrote:
> >> >> 2) Write back of dirty pages to SMR block devices:
> >> >>
> >> >> Dirty pages of a block device inode are currently processed using the
> >> >> generic_writepages function, which can be executed simultaneously
> >> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
> >> >> Mutual exclusion of the dirty page processing being achieved only at
> >> >> the page level (page lock & page writeback flag), multiple processes
> >> >> executing a "sync" of overlapping block ranges over the same zone of
> >> >> an SMR disk can cause an out-of-LBA-order sequence of write requests
> >> >> being sent to the underlying device. On a host managed SMR disk, where
> >> >> sequential write to disk zones is mandatory, this result in errors and
> >> >> the impossibility for an application using raw sequential disk write
> >> >> accesses to be guaranteed successful completion of its write or fsync
> >> >> requests.
> >> >>
> >> >> Using the zone information attached to the SMR block device queue
> >> >> (introduced by Hannes), calls to the generic_writepages function can
> >> >> be made mutually exclusive on a per zone basis by locking the zones.
> >> >> This guarantees sequential request generation for each zone and avoid
> >> >> write errors without any modification to the generic code implementing
> >> >> generic_writepages.
> >> >>
> >> >> This is but one possible solution for supporting SMR host-managed
> >> >> devices without any major rewrite of page cache management and
> >> >> write-back processing. The opinion of the audience regarding this
> >> >> solution and discussing other potential solutions would be greatly
> >> >> appreciated.
> >> >
> >> >Hello Damien,
> >> >
> >> >Is it sufficient to support filesystems like BTRFS on top of SMR drives 
> >> >or would you also like to see that filesystems like ext4 can use SMR 
> >> >drives ? In the latter case: the behavior of SMR drives differs so 
> >> >significantly from that of other block devices that I'm not sure that we 
> >> >should try to support these directly from infrastructure like the page 
> >> >cache. If we look e.g. at NAND SSDs then we see that the characteristics 
> >> >of NAND do not match what filesystems expect (e.g. large erase blocks). 
> >> >That is why every SSD vendor provides an FTL (Flash Translation Layer), 
> >> >either inside the SSD or as a separate software driver. An FTL 
> >> >implements a so-called LFS (log-structured filesystem). With what I know 
> >> >about SMR this technology looks also suitable for implementation of a 
> >> >LFS. Has it already been considered to implement an LFS driver for SMR 
> >> >drives ? That would make it possible for any filesystem to access an SMR 
> >> >drive as any other block device. I'm not sure of this but maybe it will 
> >> >be possible to share some infrastructure with the LightNVM driver 
> >> >(directory drivers/lightnvm in the Linux kernel tree). This driver 
> >> >namely implements an FTL.
> >> 
> >> I totally agree with you that trying to support SMR disks by only modifying
> >> the page cache so that unmodified standard file systems like BTRFS or ext4
> >> remain operational is not realistic at best, and more likely simply impossible.
> >> For this kind of use case, as you said, an FTL or a device mapper driver are
> >> much more suitable.
> >> 
> >> The case I am considering for this discussion is for raw block device accesses
> >> by an application (writes from user space to /dev/sdxx). This is a very likely
> >> use case scenario for high capacity SMR disks with applications like distributed
> >> object stores / key value stores.
> >> 
> >> In this case, write-back of dirty pages in the block device file inode mapping
> >> is handled in fs/block_dev.c using the generic helper function generic_writepages.
> >> This does not guarantee the generation of the required sequential write pattern
> >> per zone necessary for host-managed disks. As I explained, aligning calls of this
> >> function to zone boundaries while locking the zones under write-back solves
> >> simply the problem (implemented and tested). This is of course only one possible
> >> solution. Pushing modifications deeper in the code or providing a
> >> "generic_sequential_writepages" helper function are other potential solutions
> >> that in my opinion are worth discussing as other types of devices may benefit also
> >> in terms of performance (e.g. regular disk drives prefer sequential writes, and
> >> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
> >> driver.
> >> 
> >> For a file system, an SMR compliant implementation of a file inode mapping
> >> writepages method should be provided by the file system itself as the sequentiality
> >> of the write pattern depends further on the block allocation mechanism of the file
> >> system.
> >> 
> >> Note that the goal here is not to hide to applications the sequential write
> >> constraint of SMR disks. The page cache itself (the mapping of the block
> >> device inode) remains unchanged. But the modification proposed guarantees that
> >> a well behaved application writing sequentially to zones through the page cache
> >> will see successful sync operations.
> >
> >So the easiest solution for the OS, when the application is already aware
> >of the storage constraints, would be for an application to use direct IO.
> >Because when using page-cache and writeback there are all sorts of
> >unexpected things that can happen (e.g. writeback decides to skip a page
> >because someone else locked it temporarily). So it will work in 99.9% of
> >cases but sometimes things will be out of order for hard-to-track down
> >reasons. And for ordinary drives this is not an issue because we just slow
> >down writeback a bit but rareness of this makes it non-issue. But for host
> >managed SMR the IO fails and that is something the application does not
> >expect.
> >
> >So I would really say just avoid using page-cache when you are using SMR
> >drives directly without a translation layer. For writes your throughput
> >won't suffer anyway since you have to do big sequential writes. Using
> >page-cache for reads may still be beneficial and if you are careful enough
> >not to do direct IO writes to the same range as you do buffered reads, this
> >will work fine.
> >
> >Thinking some more - if you want to make it foolproof, you could implement
> >something like read-only page cache for block devices. Any write will be in
> >fact direct IO write, writeable mmaps will be disallowed, reads will honor
> >O_DIRECT flag.
> 
> Hi Jan,
> 
> Indeed, using O_DIRECT for raw block device write is an obvious solution to
> guarantee the application successful sequential writes within a zone. However,
> host-managed SMR disks (and to a lesser extent host-aware drives too) already
> put on applications the constraint of ensuring sequential writes. Adding to this
> further mandatory rewrite to support direct I/Os is in my opinion asking a lot,
> if not too much.

So I don't think adding O_DIRECT to open flags is such a burden -
sequential writes are IMO much harder to do :). And furthermore this could
happen magically inside the kernel in which case app needn't be aware about
this at all (similarly to how we handle writes to persistent memory).
 
> The example you mention above of writeback skipping a locked page and resulting
> in I/O errors is precisely what the proposed patch avoids by first locking the
> zone the page belongs to. In the same spirit as the writeback page locking, if
> the zone is already locked, it is skipped. That is, zones are treated in a sense
> as gigantic pages, ensuring that the actual dirty pages within each one are
> processed in one go, sequentially.

But you cannot rule out mm subsystem locking a page to do something (e.g.
migrate the page to help with compaction of large order pages). These other
places accessing and locking pages are what I'm worried about. Furthermore
kswapd can decide to writeback particular page under memory pressure and
that will just make SMR disk freak out.

> This allows preserving all possible application level accesses (buffered,
> direct or mmapped). The only constraint is the one the disk imposes:
> writes must be sequential.
> 
> Granted, this view may be too simplistic and may be overlooking some hard
> to track page locking paths which will compete with this. But I think
> that this can be easily solved by forcing the zone-aligned
> generic_writepages calls to not skip any page (a flag in struct
> writeback_control would do the trick). And no modification is necessary
> on the read side (i.e. page locking only is enough) since reading an SMR
> disks blocks after a zone write-pointer position does not make sense (in
> Hannes code, this is possible, but the request does not go to the disk
> and returns garbage data).
> 
> Bottom line: no fundamental change to the page caching mechanism, only
> how it is being used/controlled for writeback makes this work.
> Considering the benefits on the application side, it is in my opinion a
> valid modification to have.

See above, there are quite a few places which will break your assumptions.
And I don't think changing them all to handle SMR is worth it. IMO caching
sequential writes to SMR disks has low effect (if any) anyway so I would
just avoid that. We can talk about how to make this as seamless to
applications as possible. The only thing which I don't think is reasonably
doable without dirtying pagecache are writeable mmaps of an SMR device so
applications would have to avoid that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-24  8:47         ` Jan Kara
@ 2016-02-29  2:02           ` Damien Le Moal
  2016-02-29  3:06             ` Hannes Reinecke
  2016-02-29 13:40             ` Jan Kara
  0 siblings, 2 replies; 13+ messages in thread
From: Damien Le Moal @ 2016-02-29  2:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Bart Van Assche, lsf-pc, linux-block, Matias Bjorling, linux-scsi


>On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
>> 
>> >On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
>> >> 
>> >> >On 02/22/16 18:56, Damien Le Moal wrote:
>> >> >> 2) Write back of dirty pages to SMR block devices:
>> >> >>
>> >> >> Dirty pages of a block device inode are currently processed using the
>> >> >> generic_writepages function, which can be executed simultaneously
>> >> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
>> >> >> Mutual exclusion of the dirty page processing being achieved only at
>> >> >> the page level (page lock & page writeback flag), multiple processes
>> >> >> executing a "sync" of overlapping block ranges over the same zone of
>> >> >> an SMR disk can cause an out-of-LBA-order sequence of write requests
>> >> >> being sent to the underlying device. On a host managed SMR disk, where
>> >> >> sequential write to disk zones is mandatory, this result in errors and
>> >> >> the impossibility for an application using raw sequential disk write
>> >> >> accesses to be guaranteed successful completion of its write or fsync
>> >> >> requests.
>> >> >>
>> >> >> Using the zone information attached to the SMR block device queue
>> >> >> (introduced by Hannes), calls to the generic_writepages function can
>> >> >> be made mutually exclusive on a per zone basis by locking the zones.
>> >> >> This guarantees sequential request generation for each zone and avoid
>> >> >> write errors without any modification to the generic code implementing
>> >> >> generic_writepages.
>> >> >>
>> >> >> This is but one possible solution for supporting SMR host-managed
>> >> >> devices without any major rewrite of page cache management and
>> >> >> write-back processing. The opinion of the audience regarding this
>> >> >> solution and discussing other potential solutions would be greatly
>> >> >> appreciated.
>> >> >
>> >> >Hello Damien,
>> >> >
>> >> >Is it sufficient to support filesystems like BTRFS on top of SMR drives 
>> >> >or would you also like to see that filesystems like ext4 can use SMR 
>> >> >drives ? In the latter case: the behavior of SMR drives differs so 
>> >> >significantly from that of other block devices that I'm not sure that we 
>> >> >should try to support these directly from infrastructure like the page 
>> >> >cache. If we look e.g. at NAND SSDs then we see that the characteristics 
>> >> >of NAND do not match what filesystems expect (e.g. large erase blocks). 
>> >> >That is why every SSD vendor provides an FTL (Flash Translation Layer), 
>> >> >either inside the SSD or as a separate software driver. An FTL 
>> >> >implements a so-called LFS (log-structured filesystem). With what I know 
>> >> >about SMR this technology looks also suitable for implementation of a 
>> >> >LFS. Has it already been considered to implement an LFS driver for SMR 
>> >> >drives ? That would make it possible for any filesystem to access an SMR 
>> >> >drive as any other block device. I'm not sure of this but maybe it will 
>> >> >be possible to share some infrastructure with the LightNVM driver 
>> >> >(directory drivers/lightnvm in the Linux kernel tree). This driver 
>> >> >namely implements an FTL.
>> >> 
>> >> I totally agree with you that trying to support SMR disks by only modifying
>> >> the page cache so that unmodified standard file systems like BTRFS or ext4
>> >> remain operational is not realistic at best, and more likely simply impossible.
>> >> For this kind of use case, as you said, an FTL or a device mapper driver are
>> >> much more suitable.
>> >> 
>> >> The case I am considering for this discussion is for raw block device accesses
>> >> by an application (writes from user space to /dev/sdxx). This is a very likely
>> >> use case scenario for high capacity SMR disks with applications like distributed
>> >> object stores / key value stores.
>> >> 
>> >> In this case, write-back of dirty pages in the block device file inode mapping
>> >> is handled in fs/block_dev.c using the generic helper function generic_writepages.
>> >> This does not guarantee the generation of the required sequential write pattern
>> >> per zone necessary for host-managed disks. As I explained, aligning calls of this
>> >> function to zone boundaries while locking the zones under write-back solves
>> >> simply the problem (implemented and tested). This is of course only one possible
>> >> solution. Pushing modifications deeper in the code or providing a
>> >> "generic_sequential_writepages" helper function are other potential solutions
>> >> that in my opinion are worth discussing as other types of devices may benefit also
>> >> in terms of performance (e.g. regular disk drives prefer sequential writes, and
>> >> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
>> >> driver.
>> >> 
>> >> For a file system, an SMR compliant implementation of a file inode mapping
>> >> writepages method should be provided by the file system itself as the sequentiality
>> >> of the write pattern depends further on the block allocation mechanism of the file
>> >> system.
>> >> 
>> >> Note that the goal here is not to hide to applications the sequential write
>> >> constraint of SMR disks. The page cache itself (the mapping of the block
>> >> device inode) remains unchanged. But the modification proposed guarantees that
>> >> a well behaved application writing sequentially to zones through the page cache
>> >> will see successful sync operations.
>> >
>> >So the easiest solution for the OS, when the application is already aware
>> >of the storage constraints, would be for an application to use direct IO.
>> >Because when using page-cache and writeback there are all sorts of
>> >unexpected things that can happen (e.g. writeback decides to skip a page
>> >because someone else locked it temporarily). So it will work in 99.9% of
>> >cases but sometimes things will be out of order for hard-to-track down
>> >reasons. And for ordinary drives this is not an issue because we just slow
>> >down writeback a bit but rareness of this makes it non-issue. But for host
>> >managed SMR the IO fails and that is something the application does not
>> >expect.
>> >
>> >So I would really say just avoid using page-cache when you are using SMR
>> >drives directly without a translation layer. For writes your throughput
>> >won't suffer anyway since you have to do big sequential writes. Using
>> >page-cache for reads may still be beneficial and if you are careful enough
>> >not to do direct IO writes to the same range as you do buffered reads, this
>> >will work fine.
>> >
>> >Thinking some more - if you want to make it foolproof, you could implement
>> >something like read-only page cache for block devices. Any write will be in
>> >fact direct IO write, writeable mmaps will be disallowed, reads will honor
>> >O_DIRECT flag.
>> 
>> Hi Jan,
>> 
>> Indeed, using O_DIRECT for raw block device write is an obvious solution to
>> guarantee the application successful sequential writes within a zone. However,
>> host-managed SMR disks (and to a lesser extent host-aware drives too) already
>> put on applications the constraint of ensuring sequential writes. Adding to this
>> further mandatory rewrite to support direct I/Os is in my opinion asking a lot,
>> if not too much.
>
>So I don't think adding O_DIRECT to open flags is such a burden -
>sequential writes are IMO much harder to do :). And furthermore this could
>happen magically inside the kernel in which case app needn't be aware about
>this at all (similarly to how we handle writes to persistent memory).
> 
>> The example you mention above of writeback skipping a locked page and resulting
>> in I/O errors is precisely what the proposed patch avoids by first locking the
>> zone the page belongs to. In the same spirit as the writeback page locking, if
>> the zone is already locked, it is skipped. That is, zones are treated in a sense
>> as gigantic pages, ensuring that the actual dirty pages within each one are
>> processed in one go, sequentially.
>
>But you cannot rule out mm subsystem locking a page to do something (e.g.
>migrate the page to help with compaction of large order pages). These other
>places accessing and locking pages are what I'm worried about. Furthermore
>kswapd can decide to writeback particular page under memory pressure and
>that will just make SMR disk freak out.
>
>> This allows preserving all possible application level accesses (buffered,
>> direct or mmapped). The only constraint is the one the disk imposes:
>> writes must be sequential.
>> 
>> Granted, this view may be too simplistic and may be overlooking some hard
>> to track page locking paths which will compete with this. But I think
>> that this can be easily solved by forcing the zone-aligned
>> generic_writepages calls to not skip any page (a flag in struct
>> writeback_control would do the trick). And no modification is necessary
>> on the read side (i.e. page locking only is enough) since reading an SMR
>> disks blocks after a zone write-pointer position does not make sense (in
>> Hannes code, this is possible, but the request does not go to the disk
>> and returns garbage data).
>> 
>> Bottom line: no fundamental change to the page caching mechanism, only
>> how it is being used/controlled for writeback makes this work.
>> Considering the benefits on the application side, it is in my opinion a
>> valid modification to have.
>
>See above, there are quite a few places which will break your assumptions.
>And I don't think changing them all to handle SMR is worth it. IMO caching
>sequential writes to SMR disks has low effect (if any) anyway so I would
>just avoid that. We can talk about how to make this as seamless to
>applications as possible. The only thing which I don't think is reasonably
>doable without dirtying pagecache are writeable mmaps of an SMR device so
>applications would have to avoid that.

Jan,

Thank you for your insight.
These "few places" breaking sequential write sequences are indeed problematic
for SMR drives. At the same time, I wonder how these paths would react to an I/O
error generated by the check "write at write pointer" in the request submission
path at the SCSI level. Could these be ignored in the case of an "unaligned write
error" ? That is, the page is left dirty and hopefully the regular writeback path
catches them later in the proper sequence. This may however be dangerous as there
is no way to determine if the unaligned error is due to kswapd or other kernel
threads trying to write back the "wrong" page, or the application having submitted
an out of sequence write.

Until now, the discussion has focused on avoiding unaligned write errors for cached
writes. But this happens only on host-managed SMR disks. Another aspect of the SMR
support should also be to avoid random write to zones on host-aware disks. These will
not return an error on unaligned writes and silently process them as a regular disk.
However, this can over time degrade performance as the disk FW has to handle more and
more internal zone defragmentation.

If possible, I look forward to more discussions about this at LSF/MM.

Thank you.

Best regards.


------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital company
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-29  2:02           ` Damien Le Moal
@ 2016-02-29  3:06             ` Hannes Reinecke
  2016-02-29  5:54               ` Damien Le Moal
  2016-02-29 13:40             ` Jan Kara
  1 sibling, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2016-02-29  3:06 UTC (permalink / raw)
  To: Damien Le Moal, Jan Kara
  Cc: Bart Van Assche, lsf-pc, linux-block, Matias Bjorling, linux-scsi

On 02/29/2016 10:02 AM, Damien Le Moal wrote:
> 
>> On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
>>>
>>>> On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
>>>>>
>>>>>> On 02/22/16 18:56, Damien Le Moal wrote:
>>>>>>> 2) Write back of dirty pages to SMR block devices:
>>>>>>>
>>>>>>> Dirty pages of a block device inode are currently processed using the
>>>>>>> generic_writepages function, which can be executed simultaneously
>>>>>>> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
>>>>>>> Mutual exclusion of the dirty page processing being achieved only at
>>>>>>> the page level (page lock & page writeback flag), multiple processes
>>>>>>> executing a "sync" of overlapping block ranges over the same zone of
>>>>>>> an SMR disk can cause an out-of-LBA-order sequence of write requests
>>>>>>> being sent to the underlying device. On a host managed SMR disk, where
>>>>>>> sequential write to disk zones is mandatory, this result in errors and
>>>>>>> the impossibility for an application using raw sequential disk write
>>>>>>> accesses to be guaranteed successful completion of its write or fsync
>>>>>>> requests.
>>>>>>>
>>>>>>> Using the zone information attached to the SMR block device queue
>>>>>>> (introduced by Hannes), calls to the generic_writepages function can
>>>>>>> be made mutually exclusive on a per zone basis by locking the zones.
>>>>>>> This guarantees sequential request generation for each zone and avoid
>>>>>>> write errors without any modification to the generic code implementing
>>>>>>> generic_writepages.
>>>>>>>
>>>>>>> This is but one possible solution for supporting SMR host-managed
>>>>>>> devices without any major rewrite of page cache management and
>>>>>>> write-back processing. The opinion of the audience regarding this
>>>>>>> solution and discussing other potential solutions would be greatly
>>>>>>> appreciated.
>>>>>>
>>>>>> Hello Damien,
>>>>>>
>>>>>> Is it sufficient to support filesystems like BTRFS on top of SMR drives 
>>>>>> or would you also like to see that filesystems like ext4 can use SMR 
>>>>>> drives ? In the latter case: the behavior of SMR drives differs so 
>>>>>> significantly from that of other block devices that I'm not sure that we 
>>>>>> should try to support these directly from infrastructure like the page 
>>>>>> cache. If we look e.g. at NAND SSDs then we see that the characteristics 
>>>>>> of NAND do not match what filesystems expect (e.g. large erase blocks). 
>>>>>> That is why every SSD vendor provides an FTL (Flash Translation Layer), 
>>>>>> either inside the SSD or as a separate software driver. An FTL 
>>>>>> implements a so-called LFS (log-structured filesystem). With what I know 
>>>>>> about SMR this technology looks also suitable for implementation of a 
>>>>>> LFS. Has it already been considered to implement an LFS driver for SMR 
>>>>>> drives ? That would make it possible for any filesystem to access an SMR 
>>>>>> drive as any other block device. I'm not sure of this but maybe it will 
>>>>>> be possible to share some infrastructure with the LightNVM driver 
>>>>>> (directory drivers/lightnvm in the Linux kernel tree). This driver 
>>>>>> namely implements an FTL.
>>>>>
>>>>> I totally agree with you that trying to support SMR disks by only modifying
>>>>> the page cache so that unmodified standard file systems like BTRFS or ext4
>>>>> remain operational is not realistic at best, and more likely simply impossible.
>>>>> For this kind of use case, as you said, an FTL or a device mapper driver are
>>>>> much more suitable.
>>>>>
>>>>> The case I am considering for this discussion is for raw block device accesses
>>>>> by an application (writes from user space to /dev/sdxx). This is a very likely
>>>>> use case scenario for high capacity SMR disks with applications like distributed
>>>>> object stores / key value stores.
>>>>>
>>>>> In this case, write-back of dirty pages in the block device file inode mapping
>>>>> is handled in fs/block_dev.c using the generic helper function generic_writepages.
>>>>> This does not guarantee the generation of the required sequential write pattern
>>>>> per zone necessary for host-managed disks. As I explained, aligning calls of this
>>>>> function to zone boundaries while locking the zones under write-back solves
>>>>> simply the problem (implemented and tested). This is of course only one possible
>>>>> solution. Pushing modifications deeper in the code or providing a
>>>>> "generic_sequential_writepages" helper function are other potential solutions
>>>>> that in my opinion are worth discussing as other types of devices may benefit also
>>>>> in terms of performance (e.g. regular disk drives prefer sequential writes, and
>>>>> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
>>>>> driver.
>>>>>
>>>>> For a file system, an SMR compliant implementation of a file inode mapping
>>>>> writepages method should be provided by the file system itself as the sequentiality
>>>>> of the write pattern depends further on the block allocation mechanism of the file
>>>>> system.
>>>>>
>>>>> Note that the goal here is not to hide to applications the sequential write
>>>>> constraint of SMR disks. The page cache itself (the mapping of the block
>>>>> device inode) remains unchanged. But the modification proposed guarantees that
>>>>> a well behaved application writing sequentially to zones through the page cache
>>>>> will see successful sync operations.
>>>>
>>>> So the easiest solution for the OS, when the application is already aware
>>>> of the storage constraints, would be for an application to use direct IO.
>>>> Because when using page-cache and writeback there are all sorts of
>>>> unexpected things that can happen (e.g. writeback decides to skip a page
>>>> because someone else locked it temporarily). So it will work in 99.9% of
>>>> cases but sometimes things will be out of order for hard-to-track down
>>>> reasons. And for ordinary drives this is not an issue because we just slow
>>>> down writeback a bit but rareness of this makes it non-issue. But for host
>>>> managed SMR the IO fails and that is something the application does not
>>>> expect.
>>>>
>>>> So I would really say just avoid using page-cache when you are using SMR
>>>> drives directly without a translation layer. For writes your throughput
>>>> won't suffer anyway since you have to do big sequential writes. Using
>>>> page-cache for reads may still be beneficial and if you are careful enough
>>>> not to do direct IO writes to the same range as you do buffered reads, this
>>>> will work fine.
>>>>
>>>> Thinking some more - if you want to make it foolproof, you could implement
>>>> something like read-only page cache for block devices. Any write will be in
>>>> fact direct IO write, writeable mmaps will be disallowed, reads will honor
>>>> O_DIRECT flag.
>>>
>>> Hi Jan,
>>>
>>> Indeed, using O_DIRECT for raw block device write is an obvious solution to
>>> guarantee the application successful sequential writes within a zone. However,
>>> host-managed SMR disks (and to a lesser extent host-aware drives too) already
>>> put on applications the constraint of ensuring sequential writes. Adding to this
>>> further mandatory rewrite to support direct I/Os is in my opinion asking a lot,
>>> if not too much.
>>
>> So I don't think adding O_DIRECT to open flags is such a burden -
>> sequential writes are IMO much harder to do :). And furthermore this could
>> happen magically inside the kernel in which case app needn't be aware about
>> this at all (similarly to how we handle writes to persistent memory).
>>
>>> The example you mention above of writeback skipping a locked page and resulting
>>> in I/O errors is precisely what the proposed patch avoids by first locking the
>>> zone the page belongs to. In the same spirit as the writeback page locking, if
>>> the zone is already locked, it is skipped. That is, zones are treated in a sense
>>> as gigantic pages, ensuring that the actual dirty pages within each one are
>>> processed in one go, sequentially.
>>
>> But you cannot rule out mm subsystem locking a page to do something (e.g.
>> migrate the page to help with compaction of large order pages). These other
>> places accessing and locking pages are what I'm worried about. Furthermore
>> kswapd can decide to writeback particular page under memory pressure and
>> that will just make SMR disk freak out.
>>
>>> This allows preserving all possible application level accesses (buffered,
>>> direct or mmapped). The only constraint is the one the disk imposes:
>>> writes must be sequential.
>>>
>>> Granted, this view may be too simplistic and may be overlooking some hard
>>> to track page locking paths which will compete with this. But I think
>>> that this can be easily solved by forcing the zone-aligned
>>> generic_writepages calls to not skip any page (a flag in struct
>>> writeback_control would do the trick). And no modification is necessary
>>> on the read side (i.e. page locking only is enough) since reading an SMR
>>> disks blocks after a zone write-pointer position does not make sense (in
>>> Hannes code, this is possible, but the request does not go to the disk
>>> and returns garbage data).
>>>
>>> Bottom line: no fundamental change to the page caching mechanism, only
>>> how it is being used/controlled for writeback makes this work.
>>> Considering the benefits on the application side, it is in my opinion a
>>> valid modification to have.
>>
>> See above, there are quite a few places which will break your assumptions.
>> And I don't think changing them all to handle SMR is worth it. IMO caching
>> sequential writes to SMR disks has low effect (if any) anyway so I would
>> just avoid that. We can talk about how to make this as seamless to
>> applications as possible. The only thing which I don't think is reasonably
>> doable without dirtying pagecache are writeable mmaps of an SMR device so
>> applications would have to avoid that.
> 
> Jan,
> 
> Thank you for your insight.
> These "few places" breaking sequential write sequences are indeed problematic
> for SMR drives. At the same time, I wonder how these paths would react to an I/O
> error generated by the check "write at write pointer" in the request submission
> path at the SCSI level. Could these be ignored in the case of an "unaligned write
> error" ? That is, the page is left dirty and hopefully the regular writeback path
> catches them later in the proper sequence. This may however be dangerous as there
> is no way to determine if the unaligned error is due to kswapd or other kernel
> threads trying to write back the "wrong" page, or the application having submitted
> an out of sequence write.
> 
> Until now, the discussion has focused on avoiding unaligned write errors for cached
> writes. But this happens only on host-managed SMR disks. Another aspect of the SMR
> support should also be to avoid random write to zones on host-aware disks. These will
> not return an error on unaligned writes and silently process them as a regular disk.
> However, this can over time degrade performance as the disk FW has to handle more and
> more internal zone defragmentation.
> 
To chime in here, we _might_ be able to fix this via a totally different
route.
If we were allow to pass _linked_ bios to ->make_request_fn (ie bios
where the ->bi_next field was already populated) we would have an easy
marker for merging those requests. At the same time we would be able to
process these linked bios as a single unit, allowing other bios only to
be added to the front or the back of these linked bios.
That would guarantee in-order delivery for SMR, and at the same time
allow us to get merging running for block-mq.

Alternatively one could try to use plugging here, but I'm not sure if
that would be sufficient; will need to test.

> If possible, I look forward to more discussions about this at LSF/MM.
> 
Same here.
Btw, I do like the idea of Online logical head depop.
No idea how we could implement that, but the idea is nice.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-29  3:06             ` Hannes Reinecke
@ 2016-02-29  5:54               ` Damien Le Moal
  0 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2016-02-29  5:54 UTC (permalink / raw)
  To: Hannes Reinecke, Jan Kara
  Cc: Bart Van Assche, lsf-pc, linux-block, linux-scsi


>On 02/29/2016 10:02 AM, Damien Le Moal wrote:
>> 
>>> On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
>>>>
>>>>> On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
>>>>>>
>>>>>>> On 02/22/16 18:56, Damien Le Moal wrote:
>>>>>>>> 2) Write back of dirty pages to SMR block devices:
>>>>>>>>
>>>>>>>> Dirty pages of a block device inode are currently processed using the
>>>>>>>> generic_writepages function, which can be executed simultaneously
>>>>>>>> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
>>>>>>>> Mutual exclusion of the dirty page processing being achieved only at
>>>>>>>> the page level (page lock & page writeback flag), multiple processes
>>>>>>>> executing a "sync" of overlapping block ranges over the same zone of
>>>>>>>> an SMR disk can cause an out-of-LBA-order sequence of write requests
>>>>>>>> being sent to the underlying device. On a host managed SMR disk, where
>>>>>>>> sequential write to disk zones is mandatory, this result in errors and
>>>>>>>> the impossibility for an application using raw sequential disk write
>>>>>>>> accesses to be guaranteed successful completion of its write or fsync
>>>>>>>> requests.
>>>>>>>>
>>>>>>>> Using the zone information attached to the SMR block device queue
>>>>>>>> (introduced by Hannes), calls to the generic_writepages function can
>>>>>>>> be made mutually exclusive on a per zone basis by locking the zones.
>>>>>>>> This guarantees sequential request generation for each zone and avoid
>>>>>>>> write errors without any modification to the generic code implementing
>>>>>>>> generic_writepages.
>>>>>>>>
>>>>>>>> This is but one possible solution for supporting SMR host-managed
>>>>>>>> devices without any major rewrite of page cache management and
>>>>>>>> write-back processing. The opinion of the audience regarding this
>>>>>>>> solution and discussing other potential solutions would be greatly
>>>>>>>> appreciated.
>>>>>>>
>>>>>>> Hello Damien,
>>>>>>>
>>>>>>> Is it sufficient to support filesystems like BTRFS on top of SMR drives 
>>>>>>> or would you also like to see that filesystems like ext4 can use SMR 
>>>>>>> drives ? In the latter case: the behavior of SMR drives differs so 
>>>>>>> significantly from that of other block devices that I'm not sure that we 
>>>>>>> should try to support these directly from infrastructure like the page 
>>>>>>> cache. If we look e.g. at NAND SSDs then we see that the characteristics 
>>>>>>> of NAND do not match what filesystems expect (e.g. large erase blocks). 
>>>>>>> That is why every SSD vendor provides an FTL (Flash Translation Layer), 
>>>>>>> either inside the SSD or as a separate software driver. An FTL 
>>>>>>> implements a so-called LFS (log-structured filesystem). With what I know 
>>>>>>> about SMR this technology looks also suitable for implementation of a 
>>>>>>> LFS. Has it already been considered to implement an LFS driver for SMR 
>>>>>>> drives ? That would make it possible for any filesystem to access an SMR 
>>>>>>> drive as any other block device. I'm not sure of this but maybe it will 
>>>>>>> be possible to share some infrastructure with the LightNVM driver 
>>>>>>> (directory drivers/lightnvm in the Linux kernel tree). This driver 
>>>>>>> namely implements an FTL.
>>>>>>
>>>>>> I totally agree with you that trying to support SMR disks by only modifying
>>>>>> the page cache so that unmodified standard file systems like BTRFS or ext4
>>>>>> remain operational is not realistic at best, and more likely simply impossible.
>>>>>> For this kind of use case, as you said, an FTL or a device mapper driver are
>>>>>> much more suitable.
>>>>>>
>>>>>> The case I am considering for this discussion is for raw block device accesses
>>>>>> by an application (writes from user space to /dev/sdxx). This is a very likely
>>>>>> use case scenario for high capacity SMR disks with applications like distributed
>>>>>> object stores / key value stores.
>>>>>>
>>>>>> In this case, write-back of dirty pages in the block device file inode mapping
>>>>>> is handled in fs/block_dev.c using the generic helper function generic_writepages.
>>>>>> This does not guarantee the generation of the required sequential write pattern
>>>>>> per zone necessary for host-managed disks. As I explained, aligning calls of this
>>>>>> function to zone boundaries while locking the zones under write-back solves
>>>>>> simply the problem (implemented and tested). This is of course only one possible
>>>>>> solution. Pushing modifications deeper in the code or providing a
>>>>>> "generic_sequential_writepages" helper function are other potential solutions
>>>>>> that in my opinion are worth discussing as other types of devices may benefit also
>>>>>> in terms of performance (e.g. regular disk drives prefer sequential writes, and
>>>>>> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
>>>>>> driver.
>>>>>>
>>>>>> For a file system, an SMR compliant implementation of a file inode mapping
>>>>>> writepages method should be provided by the file system itself as the sequentiality
>>>>>> of the write pattern depends further on the block allocation mechanism of the file
>>>>>> system.
>>>>>>
>>>>>> Note that the goal here is not to hide to applications the sequential write
>>>>>> constraint of SMR disks. The page cache itself (the mapping of the block
>>>>>> device inode) remains unchanged. But the modification proposed guarantees that
>>>>>> a well behaved application writing sequentially to zones through the page cache
>>>>>> will see successful sync operations.
>>>>>
>>>>> So the easiest solution for the OS, when the application is already aware
>>>>> of the storage constraints, would be for an application to use direct IO.
>>>>> Because when using page-cache and writeback there are all sorts of
>>>>> unexpected things that can happen (e.g. writeback decides to skip a page
>>>>> because someone else locked it temporarily). So it will work in 99.9% of
>>>>> cases but sometimes things will be out of order for hard-to-track down
>>>>> reasons. And for ordinary drives this is not an issue because we just slow
>>>>> down writeback a bit but rareness of this makes it non-issue. But for host
>>>>> managed SMR the IO fails and that is something the application does not
>>>>> expect.
>>>>>
>>>>> So I would really say just avoid using page-cache when you are using SMR
>>>>> drives directly without a translation layer. For writes your throughput
>>>>> won't suffer anyway since you have to do big sequential writes. Using
>>>>> page-cache for reads may still be beneficial and if you are careful enough
>>>>> not to do direct IO writes to the same range as you do buffered reads, this
>>>>> will work fine.
>>>>>
>>>>> Thinking some more - if you want to make it foolproof, you could implement
>>>>> something like read-only page cache for block devices. Any write will be in
>>>>> fact direct IO write, writeable mmaps will be disallowed, reads will honor
>>>>> O_DIRECT flag.
>>>>
>>>> Hi Jan,
>>>>
>>>> Indeed, using O_DIRECT for raw block device write is an obvious solution to
>>>> guarantee the application successful sequential writes within a zone. However,
>>>> host-managed SMR disks (and to a lesser extent host-aware drives too) already
>>>> put on applications the constraint of ensuring sequential writes. Adding to this
>>>> further mandatory rewrite to support direct I/Os is in my opinion asking a lot,
>>>> if not too much.
>>>
>>> So I don't think adding O_DIRECT to open flags is such a burden -
>>> sequential writes are IMO much harder to do :). And furthermore this could
>>> happen magically inside the kernel in which case app needn't be aware about
>>> this at all (similarly to how we handle writes to persistent memory).
>>>
>>>> The example you mention above of writeback skipping a locked page and resulting
>>>> in I/O errors is precisely what the proposed patch avoids by first locking the
>>>> zone the page belongs to. In the same spirit as the writeback page locking, if
>>>> the zone is already locked, it is skipped. That is, zones are treated in a sense
>>>> as gigantic pages, ensuring that the actual dirty pages within each one are
>>>> processed in one go, sequentially.
>>>
>>> But you cannot rule out mm subsystem locking a page to do something (e.g.
>>> migrate the page to help with compaction of large order pages). These other
>>> places accessing and locking pages are what I'm worried about. Furthermore
>>> kswapd can decide to writeback particular page under memory pressure and
>>> that will just make SMR disk freak out.
>>>
>>>> This allows preserving all possible application level accesses (buffered,
>>>> direct or mmapped). The only constraint is the one the disk imposes:
>>>> writes must be sequential.
>>>>
>>>> Granted, this view may be too simplistic and may be overlooking some hard
>>>> to track page locking paths which will compete with this. But I think
>>>> that this can be easily solved by forcing the zone-aligned
>>>> generic_writepages calls to not skip any page (a flag in struct
>>>> writeback_control would do the trick). And no modification is necessary
>>>> on the read side (i.e. page locking only is enough) since reading an SMR
>>>> disks blocks after a zone write-pointer position does not make sense (in
>>>> Hannes code, this is possible, but the request does not go to the disk
>>>> and returns garbage data).
>>>>
>>>> Bottom line: no fundamental change to the page caching mechanism, only
>>>> how it is being used/controlled for writeback makes this work.
>>>> Considering the benefits on the application side, it is in my opinion a
>>>> valid modification to have.
>>>
>>> See above, there are quite a few places which will break your assumptions.
>>> And I don't think changing them all to handle SMR is worth it. IMO caching
>>> sequential writes to SMR disks has low effect (if any) anyway so I would
>>> just avoid that. We can talk about how to make this as seamless to
>>> applications as possible. The only thing which I don't think is reasonably
>>> doable without dirtying pagecache are writeable mmaps of an SMR device so
>>> applications would have to avoid that.
>> 
>> Jan,
>> 
>> Thank you for your insight.
>> These "few places" breaking sequential write sequences are indeed problematic
>> for SMR drives. At the same time, I wonder how these paths would react to an I/O
>> error generated by the check "write at write pointer" in the request submission
>> path at the SCSI level. Could these be ignored in the case of an "unaligned write
>> error" ? That is, the page is left dirty and hopefully the regular writeback path
>> catches them later in the proper sequence. This may however be dangerous as there
>> is no way to determine if the unaligned error is due to kswapd or other kernel
>> threads trying to write back the "wrong" page, or the application having submitted
>> an out of sequence write.
>> 
>> Until now, the discussion has focused on avoiding unaligned write errors for cached
>> writes. But this happens only on host-managed SMR disks. Another aspect of the SMR
>> support should also be to avoid random write to zones on host-aware disks. These will
>> not return an error on unaligned writes and silently process them as a regular disk.
>> However, this can over time degrade performance as the disk FW has to handle more and
>> more internal zone defragmentation.
>> 
>To chime in here, we _might_ be able to fix this via a totally different
>route.
>If we were allow to pass _linked_ bios to ->make_request_fn (ie bios
>where the ->bi_next field was already populated) we would have an easy
>marker for merging those requests. At the same time we would be able to
>process these linked bios as a single unit, allowing other bios only to
>be added to the front or the back of these linked bios.
>That would guarantee in-order delivery for SMR, and at the same time
>allow us to get merging running for block-mq.
>
>Alternatively one could try to use plugging here, but I'm not sure if
>that would be sufficient; will need to test.

Hannes,

I like also the idea of linked BIOs as it may simplify fixing a lot of the ordering
problems we have throughout the stack. However, in the case of writeback of
buffered writes, as Jan pointed out, the problem first comes from potential out-of-order
dirty page writeback from different paths, which generates a non-sequential BIO ordering.
I do not see how linking BIOs can cover all possible cases.
Or are you suggesting to basically move the "write pointer position check" upward in
the stack to within the ->make_request_fn function ? This indeed would ensure that
out-of-order page writeback selection fails early, always within the writeback BIO
issuing context. If so, I am afraid however that error handling may be tricky as some
failed BIO submission could be retried but not others (e.g. those that correspond to a
non sequential selection of a dirty page within a correct sequence can be, selection of
a completely randomly written page cannot).

>> If possible, I look forward to more discussions about this at LSF/MM.
>> 
>Same here.
>Btw, I do like the idea of Online logical head depop.
>No idea how we could implement that, but the idea is nice.

Right now, I am exploring extending the SMR zone management code, reusing
the zone condition/state to reflect the state of LBAs of a disk (the disk is
"chunked" so that regularly sized LBA ranges correspond to logical zones).
For instance, the ZBC defined "read-only" and "offline" zone conditions for
LBAs respectively under a bad head and under a depopped head. New conditions
can be added for other states as required.
FSes can access that information through the zone management functions, but
interfacing all this with applications may be very tricky.
All very fuzzy for now. I would like to start a discussion at LSF/MM, including

also standard aspects of the feature.

Best regards.

------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital company
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-29  2:02           ` Damien Le Moal
  2016-02-29  3:06             ` Hannes Reinecke
@ 2016-02-29 13:40             ` Jan Kara
  2016-03-01  0:43               ` Damien Le Moal
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Kara @ 2016-02-29 13:40 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jan Kara, linux-block, Bart Van Assche, Matias Bjorling,
	linux-scsi, lsf-pc

On Mon 29-02-16 02:02:16, Damien Le Moal wrote:
> 
> >On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
> >> 
> >> >On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
> >> >> 
> >> >> >On 02/22/16 18:56, Damien Le Moal wrote:
> >> >> >> 2) Write back of dirty pages to SMR block devices:
> >> >> >>
> >> >> >> Dirty pages of a block device inode are currently processed using the
> >> >> >> generic_writepages function, which can be executed simultaneously
> >> >> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
> >> >> >> Mutual exclusion of the dirty page processing being achieved only at
> >> >> >> the page level (page lock & page writeback flag), multiple processes
> >> >> >> executing a "sync" of overlapping block ranges over the same zone of
> >> >> >> an SMR disk can cause an out-of-LBA-order sequence of write requests
> >> >> >> being sent to the underlying device. On a host managed SMR disk, where
> >> >> >> sequential write to disk zones is mandatory, this result in errors and
> >> >> >> the impossibility for an application using raw sequential disk write
> >> >> >> accesses to be guaranteed successful completion of its write or fsync
> >> >> >> requests.
> >> >> >>
> >> >> >> Using the zone information attached to the SMR block device queue
> >> >> >> (introduced by Hannes), calls to the generic_writepages function can
> >> >> >> be made mutually exclusive on a per zone basis by locking the zones.
> >> >> >> This guarantees sequential request generation for each zone and avoid
> >> >> >> write errors without any modification to the generic code implementing
> >> >> >> generic_writepages.
> >> >> >>
> >> >> >> This is but one possible solution for supporting SMR host-managed
> >> >> >> devices without any major rewrite of page cache management and
> >> >> >> write-back processing. The opinion of the audience regarding this
> >> >> >> solution and discussing other potential solutions would be greatly
> >> >> >> appreciated.
> >> >> >
> >> >> >Hello Damien,
> >> >> >
> >> >> >Is it sufficient to support filesystems like BTRFS on top of SMR drives 
> >> >> >or would you also like to see that filesystems like ext4 can use SMR 
> >> >> >drives ? In the latter case: the behavior of SMR drives differs so 
> >> >> >significantly from that of other block devices that I'm not sure that we 
> >> >> >should try to support these directly from infrastructure like the page 
> >> >> >cache. If we look e.g. at NAND SSDs then we see that the characteristics 
> >> >> >of NAND do not match what filesystems expect (e.g. large erase blocks). 
> >> >> >That is why every SSD vendor provides an FTL (Flash Translation Layer), 
> >> >> >either inside the SSD or as a separate software driver. An FTL 
> >> >> >implements a so-called LFS (log-structured filesystem). With what I know 
> >> >> >about SMR this technology looks also suitable for implementation of a 
> >> >> >LFS. Has it already been considered to implement an LFS driver for SMR 
> >> >> >drives ? That would make it possible for any filesystem to access an SMR 
> >> >> >drive as any other block device. I'm not sure of this but maybe it will 
> >> >> >be possible to share some infrastructure with the LightNVM driver 
> >> >> >(directory drivers/lightnvm in the Linux kernel tree). This driver 
> >> >> >namely implements an FTL.
> >> >> 
> >> >> I totally agree with you that trying to support SMR disks by only modifying
> >> >> the page cache so that unmodified standard file systems like BTRFS or ext4
> >> >> remain operational is not realistic at best, and more likely simply impossible.
> >> >> For this kind of use case, as you said, an FTL or a device mapper driver are
> >> >> much more suitable.
> >> >> 
> >> >> The case I am considering for this discussion is for raw block device accesses
> >> >> by an application (writes from user space to /dev/sdxx). This is a very likely
> >> >> use case scenario for high capacity SMR disks with applications like distributed
> >> >> object stores / key value stores.
> >> >> 
> >> >> In this case, write-back of dirty pages in the block device file inode mapping
> >> >> is handled in fs/block_dev.c using the generic helper function generic_writepages.
> >> >> This does not guarantee the generation of the required sequential write pattern
> >> >> per zone necessary for host-managed disks. As I explained, aligning calls of this
> >> >> function to zone boundaries while locking the zones under write-back solves
> >> >> simply the problem (implemented and tested). This is of course only one possible
> >> >> solution. Pushing modifications deeper in the code or providing a
> >> >> "generic_sequential_writepages" helper function are other potential solutions
> >> >> that in my opinion are worth discussing as other types of devices may benefit also
> >> >> in terms of performance (e.g. regular disk drives prefer sequential writes, and
> >> >> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
> >> >> driver.
> >> >> 
> >> >> For a file system, an SMR compliant implementation of a file inode mapping
> >> >> writepages method should be provided by the file system itself as the sequentiality
> >> >> of the write pattern depends further on the block allocation mechanism of the file
> >> >> system.
> >> >> 
> >> >> Note that the goal here is not to hide to applications the sequential write
> >> >> constraint of SMR disks. The page cache itself (the mapping of the block
> >> >> device inode) remains unchanged. But the modification proposed guarantees that
> >> >> a well behaved application writing sequentially to zones through the page cache
> >> >> will see successful sync operations.
> >> >
> >> >So the easiest solution for the OS, when the application is already aware
> >> >of the storage constraints, would be for an application to use direct IO.
> >> >Because when using page-cache and writeback there are all sorts of
> >> >unexpected things that can happen (e.g. writeback decides to skip a page
> >> >because someone else locked it temporarily). So it will work in 99.9% of
> >> >cases but sometimes things will be out of order for hard-to-track down
> >> >reasons. And for ordinary drives this is not an issue because we just slow
> >> >down writeback a bit but rareness of this makes it non-issue. But for host
> >> >managed SMR the IO fails and that is something the application does not
> >> >expect.
> >> >
> >> >So I would really say just avoid using page-cache when you are using SMR
> >> >drives directly without a translation layer. For writes your throughput
> >> >won't suffer anyway since you have to do big sequential writes. Using
> >> >page-cache for reads may still be beneficial and if you are careful enough
> >> >not to do direct IO writes to the same range as you do buffered reads, this
> >> >will work fine.
> >> >
> >> >Thinking some more - if you want to make it foolproof, you could implement
> >> >something like read-only page cache for block devices. Any write will be in
> >> >fact direct IO write, writeable mmaps will be disallowed, reads will honor
> >> >O_DIRECT flag.
> >> 
> >> Hi Jan,
> >> 
> >> Indeed, using O_DIRECT for raw block device write is an obvious solution to
> >> guarantee the application successful sequential writes within a zone. However,
> >> host-managed SMR disks (and to a lesser extent host-aware drives too) already
> >> put on applications the constraint of ensuring sequential writes. Adding to this
> >> further mandatory rewrite to support direct I/Os is in my opinion asking a lot,
> >> if not too much.
> >
> >So I don't think adding O_DIRECT to open flags is such a burden -
> >sequential writes are IMO much harder to do :). And furthermore this could
> >happen magically inside the kernel in which case app needn't be aware about
> >this at all (similarly to how we handle writes to persistent memory).
> > 
> >> The example you mention above of writeback skipping a locked page and resulting
> >> in I/O errors is precisely what the proposed patch avoids by first locking the
> >> zone the page belongs to. In the same spirit as the writeback page locking, if
> >> the zone is already locked, it is skipped. That is, zones are treated in a sense
> >> as gigantic pages, ensuring that the actual dirty pages within each one are
> >> processed in one go, sequentially.
> >
> >But you cannot rule out mm subsystem locking a page to do something (e.g.
> >migrate the page to help with compaction of large order pages). These other
> >places accessing and locking pages are what I'm worried about. Furthermore
> >kswapd can decide to writeback particular page under memory pressure and
> >that will just make SMR disk freak out.
> >
> >> This allows preserving all possible application level accesses (buffered,
> >> direct or mmapped). The only constraint is the one the disk imposes:
> >> writes must be sequential.
> >> 
> >> Granted, this view may be too simplistic and may be overlooking some hard
> >> to track page locking paths which will compete with this. But I think
> >> that this can be easily solved by forcing the zone-aligned
> >> generic_writepages calls to not skip any page (a flag in struct
> >> writeback_control would do the trick). And no modification is necessary
> >> on the read side (i.e. page locking only is enough) since reading an SMR
> >> disks blocks after a zone write-pointer position does not make sense (in
> >> Hannes code, this is possible, but the request does not go to the disk
> >> and returns garbage data).
> >> 
> >> Bottom line: no fundamental change to the page caching mechanism, only
> >> how it is being used/controlled for writeback makes this work.
> >> Considering the benefits on the application side, it is in my opinion a
> >> valid modification to have.
> >
> >See above, there are quite a few places which will break your assumptions.
> >And I don't think changing them all to handle SMR is worth it. IMO caching
> >sequential writes to SMR disks has low effect (if any) anyway so I would
> >just avoid that. We can talk about how to make this as seamless to
> >applications as possible. The only thing which I don't think is reasonably
> >doable without dirtying pagecache are writeable mmaps of an SMR device so
> >applications would have to avoid that.
> 
> Jan,
> 
> Thank you for your insight.
> These "few places" breaking sequential write sequences are indeed
> problematic for SMR drives. At the same time, I wonder how these paths
> would react to an I/O error generated by the check "write at write
> pointer" in the request submission path at the SCSI level. Could these be
> ignored in the case of an "unaligned write error" ? That is, the page is
> left dirty and hopefully the regular writeback path catches them later in
> the proper sequence.

You'd hope ;) But in fact what happens is that the page ends
up being clean, marked as having error, and buffers will not be uptodate =>
you have just lost one page worth of data. See what happens in
end_buffer_async_write(). Now our behavior in presence of IO errors needs
improvement for a long time so you are certainly welcome to improve on this
but what I described is what happens now.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-02-29 13:40             ` Jan Kara
@ 2016-03-01  0:43               ` Damien Le Moal
  2016-03-01  9:27                 ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2016-03-01  0:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-block, Bart Van Assche, Matias Bjorling, linux-scsi, lsf-pc










From:  Jan Kara <jack@suse.cz>
Date:  Monday, February 29, 2016 at 22:40
To:  Damien Le Moal <Damien.LeMoal@hgst.com>
Cc:  Jan Kara <jack@suse.cz>, "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>, Bart Van Assche <bart.vanassche@sandisk.com>, Matias Bjorling <m@bjorling.me>, "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>, "lsf-pc@lists.linuxfoundation.org" <lsf-pc@lists.linuxfoundation.org>
Subject:  Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages


>On Mon 29-02-16 02:02:16, Damien Le Moal wrote:
>> 
>> >On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
>> >> 
>> >> >On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
>> >> >> 
>> >> >> >On 02/22/16 18:56, Damien Le Moal wrote:
>> >> >> >> 2) Write back of dirty pages to SMR block devices:
>> >> >> >>
>> >> >> >> Dirty pages of a block device inode are currently processed using the
>> >> >> >> generic_writepages function, which can be executed simultaneously
>> >> >> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
>> >> >> >> Mutual exclusion of the dirty page processing being achieved only at
>> >> >> >> the page level (page lock & page writeback flag), multiple processes
>> >> >> >> executing a "sync" of overlapping block ranges over the same zone of
>> >> >> >> an SMR disk can cause an out-of-LBA-order sequence of write requests
>> >> >> >> being sent to the underlying device. On a host managed SMR disk, where
>> >> >> >> sequential write to disk zones is mandatory, this result in errors and
>> >> >> >> the impossibility for an application using raw sequential disk write
>> >> >> >> accesses to be guaranteed successful completion of its write or fsync
>> >> >> >> requests.
>> >> >> >>
>> >> >> >> Using the zone information attached to the SMR block device queue
>> >> >> >> (introduced by Hannes), calls to the generic_writepages function can
>> >> >> >> be made mutually exclusive on a per zone basis by locking the zones.
>> >> >> >> This guarantees sequential request generation for each zone and avoid
>> >> >> >> write errors without any modification to the generic code implementing
>> >> >> >> generic_writepages.
>> >> >> >>
>> >> >> >> This is but one possible solution for supporting SMR host-managed
>> >> >> >> devices without any major rewrite of page cache management and
>> >> >> >> write-back processing. The opinion of the audience regarding this
>> >> >> >> solution and discussing other potential solutions would be greatly
>> >> >> >> appreciated.
>> >> >> >
>> >> >> >Hello Damien,
>> >> >> >
>> >> >> >Is it sufficient to support filesystems like BTRFS on top of SMR drives 
>> >> >> >or would you also like to see that filesystems like ext4 can use SMR 
>> >> >> >drives ? In the latter case: the behavior of SMR drives differs so 
>> >> >> >significantly from that of other block devices that I'm not sure that we 
>> >> >> >should try to support these directly from infrastructure like the page 
>> >> >> >cache. If we look e.g. at NAND SSDs then we see that the characteristics 
>> >> >> >of NAND do not match what filesystems expect (e.g. large erase blocks). 
>> >> >> >That is why every SSD vendor provides an FTL (Flash Translation Layer), 
>> >> >> >either inside the SSD or as a separate software driver. An FTL 
>> >> >> >implements a so-called LFS (log-structured filesystem). With what I know 
>> >> >> >about SMR this technology looks also suitable for implementation of a 
>> >> >> >LFS. Has it already been considered to implement an LFS driver for SMR 
>> >> >> >drives ? That would make it possible for any filesystem to access an SMR 
>> >> >> >drive as any other block device. I'm not sure of this but maybe it will 
>> >> >> >be possible to share some infrastructure with the LightNVM driver 
>> >> >> >(directory drivers/lightnvm in the Linux kernel tree). This driver 
>> >> >> >namely implements an FTL.
>> >> >> 
>> >> >> I totally agree with you that trying to support SMR disks by only modifying
>> >> >> the page cache so that unmodified standard file systems like BTRFS or ext4
>> >> >> remain operational is not realistic at best, and more likely simply impossible.
>> >> >> For this kind of use case, as you said, an FTL or a device mapper driver are
>> >> >> much more suitable.
>> >> >> 
>> >> >> The case I am considering for this discussion is for raw block device accesses
>> >> >> by an application (writes from user space to /dev/sdxx). This is a very likely
>> >> >> use case scenario for high capacity SMR disks with applications like distributed
>> >> >> object stores / key value stores.
>> >> >> 
>> >> >> In this case, write-back of dirty pages in the block device file inode mapping
>> >> >> is handled in fs/block_dev.c using the generic helper function generic_writepages.
>> >> >> This does not guarantee the generation of the required sequential write pattern
>> >> >> per zone necessary for host-managed disks. As I explained, aligning calls of this
>> >> >> function to zone boundaries while locking the zones under write-back solves
>> >> >> simply the problem (implemented and tested). This is of course only one possible
>> >> >> solution. Pushing modifications deeper in the code or providing a
>> >> >> "generic_sequential_writepages" helper function are other potential solutions
>> >> >> that in my opinion are worth discussing as other types of devices may benefit also
>> >> >> in terms of performance (e.g. regular disk drives prefer sequential writes, and
>> >> >> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
>> >> >> driver.
>> >> >> 
>> >> >> For a file system, an SMR compliant implementation of a file inode mapping
>> >> >> writepages method should be provided by the file system itself as the sequentiality
>> >> >> of the write pattern depends further on the block allocation mechanism of the file
>> >> >> system.
>> >> >> 
>> >> >> Note that the goal here is not to hide to applications the sequential write
>> >> >> constraint of SMR disks. The page cache itself (the mapping of the block
>> >> >> device inode) remains unchanged. But the modification proposed guarantees that
>> >> >> a well behaved application writing sequentially to zones through the page cache
>> >> >> will see successful sync operations.
>> >> >
>> >> >So the easiest solution for the OS, when the application is already aware
>> >> >of the storage constraints, would be for an application to use direct IO.
>> >> >Because when using page-cache and writeback there are all sorts of
>> >> >unexpected things that can happen (e.g. writeback decides to skip a page
>> >> >because someone else locked it temporarily). So it will work in 99.9% of
>> >> >cases but sometimes things will be out of order for hard-to-track down
>> >> >reasons. And for ordinary drives this is not an issue because we just slow
>> >> >down writeback a bit but rareness of this makes it non-issue. But for host
>> >> >managed SMR the IO fails and that is something the application does not
>> >> >expect.
>> >> >
>> >> >So I would really say just avoid using page-cache when you are using SMR
>> >> >drives directly without a translation layer. For writes your throughput
>> >> >won't suffer anyway since you have to do big sequential writes. Using
>> >> >page-cache for reads may still be beneficial and if you are careful enough
>> >> >not to do direct IO writes to the same range as you do buffered reads, this
>> >> >will work fine.
>> >> >
>> >> >Thinking some more - if you want to make it foolproof, you could implement
>> >> >something like read-only page cache for block devices. Any write will be in
>> >> >fact direct IO write, writeable mmaps will be disallowed, reads will honor
>> >> >O_DIRECT flag.
>> >> 
>> >> Hi Jan,
>> >> 
>> >> Indeed, using O_DIRECT for raw block device write is an obvious solution to
>> >> guarantee the application successful sequential writes within a zone. However,
>> >> host-managed SMR disks (and to a lesser extent host-aware drives too) already
>> >> put on applications the constraint of ensuring sequential writes. Adding to this
>> >> further mandatory rewrite to support direct I/Os is in my opinion asking a lot,
>> >> if not too much.
>> >
>> >So I don't think adding O_DIRECT to open flags is such a burden -
>> >sequential writes are IMO much harder to do :). And furthermore this could
>> >happen magically inside the kernel in which case app needn't be aware about
>> >this at all (similarly to how we handle writes to persistent memory).
>> > 
>> >> The example you mention above of writeback skipping a locked page and resulting
>> >> in I/O errors is precisely what the proposed patch avoids by first locking the
>> >> zone the page belongs to. In the same spirit as the writeback page locking, if
>> >> the zone is already locked, it is skipped. That is, zones are treated in a sense
>> >> as gigantic pages, ensuring that the actual dirty pages within each one are
>> >> processed in one go, sequentially.
>> >
>> >But you cannot rule out mm subsystem locking a page to do something (e.g.
>> >migrate the page to help with compaction of large order pages). These other
>> >places accessing and locking pages are what I'm worried about. Furthermore
>> >kswapd can decide to writeback particular page under memory pressure and
>> >that will just make SMR disk freak out.
>> >
>> >> This allows preserving all possible application level accesses (buffered,
>> >> direct or mmapped). The only constraint is the one the disk imposes:
>> >> writes must be sequential.
>> >> 
>> >> Granted, this view may be too simplistic and may be overlooking some hard
>> >> to track page locking paths which will compete with this. But I think
>> >> that this can be easily solved by forcing the zone-aligned
>> >> generic_writepages calls to not skip any page (a flag in struct
>> >> writeback_control would do the trick). And no modification is necessary
>> >> on the read side (i.e. page locking only is enough) since reading an SMR
>> >> disks blocks after a zone write-pointer position does not make sense (in
>> >> Hannes code, this is possible, but the request does not go to the disk
>> >> and returns garbage data).
>> >> 
>> >> Bottom line: no fundamental change to the page caching mechanism, only
>> >> how it is being used/controlled for writeback makes this work.
>> >> Considering the benefits on the application side, it is in my opinion a
>> >> valid modification to have.
>> >
>> >See above, there are quite a few places which will break your assumptions.
>> >And I don't think changing them all to handle SMR is worth it. IMO caching
>> >sequential writes to SMR disks has low effect (if any) anyway so I would
>> >just avoid that. We can talk about how to make this as seamless to
>> >applications as possible. The only thing which I don't think is reasonably
>> >doable without dirtying pagecache are writeable mmaps of an SMR device so
>> >applications would have to avoid that.
>> 
>> Jan,
>> 
>> Thank you for your insight.
>> These "few places" breaking sequential write sequences are indeed
>> problematic for SMR drives. At the same time, I wonder how these paths
>> would react to an I/O error generated by the check "write at write
>> pointer" in the request submission path at the SCSI level. Could these be
>> ignored in the case of an "unaligned write error" ? That is, the page is
>> left dirty and hopefully the regular writeback path catches them later in
>> the proper sequence.
>
>You'd hope ;) But in fact what happens is that the page ends
>up being clean, marked as having error, and buffers will not be uptodate =>
>you have just lost one page worth of data. See what happens in
>end_buffer_async_write(). Now our behavior in presence of IO errors needs
>improvement for a long time so you are certainly welcome to improve on this
>but what I described is what happens now.


Jan,

Got it. Thanks for the pointers.
I will work a little more on identifying this. In any case, the first problem
to tackle I guess is to get more information than just a -EIO on error. Without
that, no chance to ever be able to retry recoverable errors (unaligned writes).

Thanks !

Best regards.

------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital company
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-03-01  0:43               ` Damien Le Moal
@ 2016-03-01  9:27                 ` Jan Kara
  2016-03-01 12:00                   ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2016-03-01  9:27 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jan Kara, linux-block, Bart Van Assche, Matias Bjorling,
	linux-scsi, lsf-pc

On Tue 01-03-16 00:43:37, Damien Le Moal wrote:
> From:  Jan Kara <jack@suse.cz>
> Date:  Monday, February 29, 2016 at 22:40
> To:  Damien Le Moal <Damien.LeMoal@hgst.com>
> Cc:  Jan Kara <jack@suse.cz>, "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>, Bart Van Assche <bart.vanassche@sandisk.com>, Matias Bjorling <m@bjorling.me>, "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>, "lsf-pc@lists.linuxfoundation.org" <lsf-pc@lists.linuxfoundation.org>
> Subject:  Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
> 
> 
> >On Mon 29-02-16 02:02:16, Damien Le Moal wrote:
> >> 
> >> >On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
> >> >> 
> >> >> >On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
> >> >> >> 
> >> >> >> >On 02/22/16 18:56, Damien Le Moal wrote:
> >> >> >> >> 2) Write back of dirty pages to SMR block devices:
> >> >> >> >>
> >> >> >> >> Dirty pages of a block device inode are currently processed using the
> >> >> >> >> generic_writepages function, which can be executed simultaneously
> >> >> >> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
> >> >> >> >> Mutual exclusion of the dirty page processing being achieved only at
> >> >> >> >> the page level (page lock & page writeback flag), multiple processes
> >> >> >> >> executing a "sync" of overlapping block ranges over the same zone of
> >> >> >> >> an SMR disk can cause an out-of-LBA-order sequence of write requests
> >> >> >> >> being sent to the underlying device. On a host managed SMR disk, where
> >> >> >> >> sequential write to disk zones is mandatory, this result in errors and
> >> >> >> >> the impossibility for an application using raw sequential disk write
> >> >> >> >> accesses to be guaranteed successful completion of its write or fsync
> >> >> >> >> requests.
> >> >> >> >>
> >> >> >> >> Using the zone information attached to the SMR block device queue
> >> >> >> >> (introduced by Hannes), calls to the generic_writepages function can
> >> >> >> >> be made mutually exclusive on a per zone basis by locking the zones.
> >> >> >> >> This guarantees sequential request generation for each zone and avoid
> >> >> >> >> write errors without any modification to the generic code implementing
> >> >> >> >> generic_writepages.
> >> >> >> >>
> >> >> >> >> This is but one possible solution for supporting SMR host-managed
> >> >> >> >> devices without any major rewrite of page cache management and
> >> >> >> >> write-back processing. The opinion of the audience regarding this
> >> >> >> >> solution and discussing other potential solutions would be greatly
> >> >> >> >> appreciated.
> >> >> >> >
> >> >> >> >Hello Damien,
> >> >> >> >
> >> >> >> >Is it sufficient to support filesystems like BTRFS on top of SMR drives 
> >> >> >> >or would you also like to see that filesystems like ext4 can use SMR 
> >> >> >> >drives ? In the latter case: the behavior of SMR drives differs so 
> >> >> >> >significantly from that of other block devices that I'm not sure that we 
> >> >> >> >should try to support these directly from infrastructure like the page 
> >> >> >> >cache. If we look e.g. at NAND SSDs then we see that the characteristics 
> >> >> >> >of NAND do not match what filesystems expect (e.g. large erase blocks). 
> >> >> >> >That is why every SSD vendor provides an FTL (Flash Translation Layer), 
> >> >> >> >either inside the SSD or as a separate software driver. An FTL 
> >> >> >> >implements a so-called LFS (log-structured filesystem). With what I know 
> >> >> >> >about SMR this technology looks also suitable for implementation of a 
> >> >> >> >LFS. Has it already been considered to implement an LFS driver for SMR 
> >> >> >> >drives ? That would make it possible for any filesystem to access an SMR 
> >> >> >> >drive as any other block device. I'm not sure of this but maybe it will 
> >> >> >> >be possible to share some infrastructure with the LightNVM driver 
> >> >> >> >(directory drivers/lightnvm in the Linux kernel tree). This driver 
> >> >> >> >namely implements an FTL.
> >> >> >> 
> >> >> >> I totally agree with you that trying to support SMR disks by only modifying
> >> >> >> the page cache so that unmodified standard file systems like BTRFS or ext4
> >> >> >> remain operational is not realistic at best, and more likely simply impossible.
> >> >> >> For this kind of use case, as you said, an FTL or a device mapper driver are
> >> >> >> much more suitable.
> >> >> >> 
> >> >> >> The case I am considering for this discussion is for raw block device accesses
> >> >> >> by an application (writes from user space to /dev/sdxx). This is a very likely
> >> >> >> use case scenario for high capacity SMR disks with applications like distributed
> >> >> >> object stores / key value stores.
> >> >> >> 
> >> >> >> In this case, write-back of dirty pages in the block device file inode mapping
> >> >> >> is handled in fs/block_dev.c using the generic helper function generic_writepages.
> >> >> >> This does not guarantee the generation of the required sequential write pattern
> >> >> >> per zone necessary for host-managed disks. As I explained, aligning calls of this
> >> >> >> function to zone boundaries while locking the zones under write-back solves
> >> >> >> simply the problem (implemented and tested). This is of course only one possible
> >> >> >> solution. Pushing modifications deeper in the code or providing a
> >> >> >> "generic_sequential_writepages" helper function are other potential solutions
> >> >> >> that in my opinion are worth discussing as other types of devices may benefit also
> >> >> >> in terms of performance (e.g. regular disk drives prefer sequential writes, and
> >> >> >> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
> >> >> >> driver.
> >> >> >> 
> >> >> >> For a file system, an SMR compliant implementation of a file inode mapping
> >> >> >> writepages method should be provided by the file system itself as the sequentiality
> >> >> >> of the write pattern depends further on the block allocation mechanism of the file
> >> >> >> system.
> >> >> >> 
> >> >> >> Note that the goal here is not to hide to applications the sequential write
> >> >> >> constraint of SMR disks. The page cache itself (the mapping of the block
> >> >> >> device inode) remains unchanged. But the modification proposed guarantees that
> >> >> >> a well behaved application writing sequentially to zones through the page cache
> >> >> >> will see successful sync operations.
> >> >> >
> >> >> >So the easiest solution for the OS, when the application is already aware
> >> >> >of the storage constraints, would be for an application to use direct IO.
> >> >> >Because when using page-cache and writeback there are all sorts of
> >> >> >unexpected things that can happen (e.g. writeback decides to skip a page
> >> >> >because someone else locked it temporarily). So it will work in 99.9% of
> >> >> >cases but sometimes things will be out of order for hard-to-track down
> >> >> >reasons. And for ordinary drives this is not an issue because we just slow
> >> >> >down writeback a bit but rareness of this makes it non-issue. But for host
> >> >> >managed SMR the IO fails and that is something the application does not
> >> >> >expect.
> >> >> >
> >> >> >So I would really say just avoid using page-cache when you are using SMR
> >> >> >drives directly without a translation layer. For writes your throughput
> >> >> >won't suffer anyway since you have to do big sequential writes. Using
> >> >> >page-cache for reads may still be beneficial and if you are careful enough
> >> >> >not to do direct IO writes to the same range as you do buffered reads, this
> >> >> >will work fine.
> >> >> >
> >> >> >Thinking some more - if you want to make it foolproof, you could implement
> >> >> >something like read-only page cache for block devices. Any write will be in
> >> >> >fact direct IO write, writeable mmaps will be disallowed, reads will honor
> >> >> >O_DIRECT flag.
> >> >> 
> >> >> Hi Jan,
> >> >> 
> >> >> Indeed, using O_DIRECT for raw block device write is an obvious solution to
> >> >> guarantee the application successful sequential writes within a zone. However,
> >> >> host-managed SMR disks (and to a lesser extent host-aware drives too) already
> >> >> put on applications the constraint of ensuring sequential writes. Adding to this
> >> >> further mandatory rewrite to support direct I/Os is in my opinion asking a lot,
> >> >> if not too much.
> >> >
> >> >So I don't think adding O_DIRECT to open flags is such a burden -
> >> >sequential writes are IMO much harder to do :). And furthermore this could
> >> >happen magically inside the kernel in which case app needn't be aware about
> >> >this at all (similarly to how we handle writes to persistent memory).
> >> > 
> >> >> The example you mention above of writeback skipping a locked page and resulting
> >> >> in I/O errors is precisely what the proposed patch avoids by first locking the
> >> >> zone the page belongs to. In the same spirit as the writeback page locking, if
> >> >> the zone is already locked, it is skipped. That is, zones are treated in a sense
> >> >> as gigantic pages, ensuring that the actual dirty pages within each one are
> >> >> processed in one go, sequentially.
> >> >
> >> >But you cannot rule out mm subsystem locking a page to do something (e.g.
> >> >migrate the page to help with compaction of large order pages). These other
> >> >places accessing and locking pages are what I'm worried about. Furthermore
> >> >kswapd can decide to writeback particular page under memory pressure and
> >> >that will just make SMR disk freak out.
> >> >
> >> >> This allows preserving all possible application level accesses (buffered,
> >> >> direct or mmapped). The only constraint is the one the disk imposes:
> >> >> writes must be sequential.
> >> >> 
> >> >> Granted, this view may be too simplistic and may be overlooking some hard
> >> >> to track page locking paths which will compete with this. But I think
> >> >> that this can be easily solved by forcing the zone-aligned
> >> >> generic_writepages calls to not skip any page (a flag in struct
> >> >> writeback_control would do the trick). And no modification is necessary
> >> >> on the read side (i.e. page locking only is enough) since reading an SMR
> >> >> disks blocks after a zone write-pointer position does not make sense (in
> >> >> Hannes code, this is possible, but the request does not go to the disk
> >> >> and returns garbage data).
> >> >> 
> >> >> Bottom line: no fundamental change to the page caching mechanism, only
> >> >> how it is being used/controlled for writeback makes this work.
> >> >> Considering the benefits on the application side, it is in my opinion a
> >> >> valid modification to have.
> >> >
> >> >See above, there are quite a few places which will break your assumptions.
> >> >And I don't think changing them all to handle SMR is worth it. IMO caching
> >> >sequential writes to SMR disks has low effect (if any) anyway so I would
> >> >just avoid that. We can talk about how to make this as seamless to
> >> >applications as possible. The only thing which I don't think is reasonably
> >> >doable without dirtying pagecache are writeable mmaps of an SMR device so
> >> >applications would have to avoid that.
> >> 
> >> Jan,
> >> 
> >> Thank you for your insight.
> >> These "few places" breaking sequential write sequences are indeed
> >> problematic for SMR drives. At the same time, I wonder how these paths
> >> would react to an I/O error generated by the check "write at write
> >> pointer" in the request submission path at the SCSI level. Could these be
> >> ignored in the case of an "unaligned write error" ? That is, the page is
> >> left dirty and hopefully the regular writeback path catches them later in
> >> the proper sequence.
> >
> >You'd hope ;) But in fact what happens is that the page ends
> >up being clean, marked as having error, and buffers will not be uptodate =>
> >you have just lost one page worth of data. See what happens in
> >end_buffer_async_write(). Now our behavior in presence of IO errors needs
> >improvement for a long time so you are certainly welcome to improve on this
> >but what I described is what happens now.
> 
> 
> Jan,
> 
> Got it. Thanks for the pointers.  I will work a little more on
> identifying this. In any case, the first problem to tackle I guess is to
> get more information than just a -EIO on error. Without that, no chance
> to ever be able to retry recoverable errors (unaligned writes).

Yes, propagating more information to fs / writeback code so that it can
distinguish permanent errors from transient ones is certainly useful for
other usecases than SMR.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages
  2016-03-01  9:27                 ` Jan Kara
@ 2016-03-01 12:00                   ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2016-03-01 12:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Damien Le Moal, linux-block, Bart Van Assche, Matias Bjorling,
	linux-scsi, lsf-pc

Any chance the two of you could occasionally trim the mails your quoting?

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-03-01 12:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-23  2:56 [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages Damien Le Moal
2016-02-23  3:56 ` Bart Van Assche
2016-02-23  5:31   ` Damien Le Moal
2016-02-23  8:40     ` [Lsf-pc] " Jan Kara
2016-02-24  1:53       ` Damien Le Moal
2016-02-24  8:47         ` Jan Kara
2016-02-29  2:02           ` Damien Le Moal
2016-02-29  3:06             ` Hannes Reinecke
2016-02-29  5:54               ` Damien Le Moal
2016-02-29 13:40             ` Jan Kara
2016-03-01  0:43               ` Damien Le Moal
2016-03-01  9:27                 ` Jan Kara
2016-03-01 12:00                   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.