linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Zoned Block Devices
@ 2019-01-28 12:56 Matias Bjorling
  2019-01-28 15:07 ` Bart Van Assche
  2019-01-29  8:25 ` Javier González
  0 siblings, 2 replies; 4+ messages in thread
From: Matias Bjorling @ 2019-01-28 12:56 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel, linux-block, linux-ide, linux-scsi,
	linux-nvme, Damien Le Moal

Hi,

Damien and I would like to propose a couple of topics centering around 
zoned block devices:

1) Zoned block devices require that writes to a zone are sequential. If 
the writes are dispatched to the device out of order, the drive rejects 
the write with a write failure.

So far it has been the responsibility the deadline I/O scheduler to 
serialize writes to zones to avoid intra-zone write command reordering. 
This I/O scheduler based approach has worked so far for HDDs, but we can 
do better for multi-queue devices. NVMe has support for multiple queues, 
and one could dedicate a single queue to writes alone. Furthermore, the 
queue is processed in-order, enabling the host to serialize writes on 
the queue, instead of issuing them one by one. We like to gather 
feedback on this approach (new HCTX_TYPE_WRITE).

2) Adoption of Zone Append in file-systems and user-space applications.

A Zone Append command, together with Zoned Namespaces, is being defined 
in the NVMe workgroup. The new command allows one to automatically 
direct writes to a zone write pointer position, similarly to writing to 
a file open with O_APPEND. With this write append command, the drive 
returns where data was written in the zone. Providing two benefits:

(A) It moves the fine-grained logical block allocation in file-systems 
to the device side. A file-system continues to do coarse-grained logical 
block allocation, but the specific LBAs where data is written and 
reported from the device. Thus improving file-system performance. The 
current target is XFS but we would like to hear the feasibility of it 
being used in other file-systems.

(B) It lets host issue multiple outstanding write I/Os to a zone, 
without having to maintain I/O order. Thus, improving the performance of 
the drive, but also reducing the need for zone locking on the host side.

Is there other use-cases for this, and will an interface like this be 
valuable
in the kernel? If the interface is successful, we would expect the 
interface to move to ATA/SCSI for standardization as well.

Thanks, Matias


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM TOPIC] Zoned Block Devices
  2019-01-28 12:56 [LSF/MM TOPIC] Zoned Block Devices Matias Bjorling
@ 2019-01-28 15:07 ` Bart Van Assche
  2019-01-28 18:40   ` Matias Bjorling
  2019-01-29  8:25 ` Javier González
  1 sibling, 1 reply; 4+ messages in thread
From: Bart Van Assche @ 2019-01-28 15:07 UTC (permalink / raw)
  To: Matias Bjorling, lsf-pc, linux-fsdevel, linux-block, linux-ide,
	linux-scsi, linux-nvme, Damien Le Moal

On 1/28/19 4:56 AM, Matias Bjorling wrote:
> Damien and I would like to propose a couple of topics centering around
> zoned block devices:
> 
> 1) Zoned block devices require that writes to a zone are sequential. If
> the writes are dispatched to the device out of order, the drive rejects
> the write with a write failure.
> 
> So far it has been the responsibility the deadline I/O scheduler to
> serialize writes to zones to avoid intra-zone write command reordering.
> This I/O scheduler based approach has worked so far for HDDs, but we can
> do better for multi-queue devices. NVMe has support for multiple queues,
> and one could dedicate a single queue to writes alone. Furthermore, the
> queue is processed in-order, enabling the host to serialize writes on
> the queue, instead of issuing them one by one. We like to gather
> feedback on this approach (new HCTX_TYPE_WRITE).
> 
> 2) Adoption of Zone Append in file-systems and user-space applications.
> 
> A Zone Append command, together with Zoned Namespaces, is being defined
> in the NVMe workgroup. The new command allows one to automatically
> direct writes to a zone write pointer position, similarly to writing to
> a file open with O_APPEND. With this write append command, the drive
> returns where data was written in the zone. Providing two benefits:
> 
> (A) It moves the fine-grained logical block allocation in file-systems
> to the device side. A file-system continues to do coarse-grained logical
> block allocation, but the specific LBAs where data is written and
> reported from the device. Thus improving file-system performance. The
> current target is XFS but we would like to hear the feasibility of it
> being used in other file-systems.
> 
> (B) It lets host issue multiple outstanding write I/Os to a zone,
> without having to maintain I/O order. Thus, improving the performance of
> the drive, but also reducing the need for zone locking on the host side.
> 
> Is there other use-cases for this, and will an interface like this be
> valuable in the kernel? If the interface is successful, we would expect
> the interface to move to ATA/SCSI for standardization as well.

Hi Matias,

This topic proposal sounds interesting to me, but I think it is 
incomplete. Shouldn't it also be discussed how user space applications 
are expected to submit "zone append" writes? Which system call should 
e.g. fio use to submit this new type of write request? How will the 
offset at which data has been written be communicated back to user space?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM TOPIC] Zoned Block Devices
  2019-01-28 15:07 ` Bart Van Assche
@ 2019-01-28 18:40   ` Matias Bjorling
  0 siblings, 0 replies; 4+ messages in thread
From: Matias Bjorling @ 2019-01-28 18:40 UTC (permalink / raw)
  To: Bart Van Assche, lsf-pc, linux-fsdevel, linux-block, linux-ide,
	linux-scsi, linux-nvme, Damien Le Moal, axboe

On 1/28/19 4:07 PM, Bart Van Assche wrote:
> On 1/28/19 4:56 AM, Matias Bjorling wrote:
>> Damien and I would like to propose a couple of topics centering around
>> zoned block devices:
>>
>> 1) Zoned block devices require that writes to a zone are sequential. If
>> the writes are dispatched to the device out of order, the drive rejects
>> the write with a write failure.
>>
>> So far it has been the responsibility the deadline I/O scheduler to
>> serialize writes to zones to avoid intra-zone write command reordering.
>> This I/O scheduler based approach has worked so far for HDDs, but we can
>> do better for multi-queue devices. NVMe has support for multiple queues,
>> and one could dedicate a single queue to writes alone. Furthermore, the
>> queue is processed in-order, enabling the host to serialize writes on
>> the queue, instead of issuing them one by one. We like to gather
>> feedback on this approach (new HCTX_TYPE_WRITE).
>>
>> 2) Adoption of Zone Append in file-systems and user-space applications.
>>
>> A Zone Append command, together with Zoned Namespaces, is being defined
>> in the NVMe workgroup. The new command allows one to automatically
>> direct writes to a zone write pointer position, similarly to writing to
>> a file open with O_APPEND. With this write append command, the drive
>> returns where data was written in the zone. Providing two benefits:
>>
>> (A) It moves the fine-grained logical block allocation in file-systems
>> to the device side. A file-system continues to do coarse-grained logical
>> block allocation, but the specific LBAs where data is written and
>> reported from the device. Thus improving file-system performance. The
>> current target is XFS but we would like to hear the feasibility of it
>> being used in other file-systems.
>>
>> (B) It lets host issue multiple outstanding write I/Os to a zone,
>> without having to maintain I/O order. Thus, improving the performance of
>> the drive, but also reducing the need for zone locking on the host side.
>>
>> Is there other use-cases for this, and will an interface like this be
>> valuable in the kernel? If the interface is successful, we would expect
>> the interface to move to ATA/SCSI for standardization as well.
>
> Hi Matias,
>
> This topic proposal sounds interesting to me, but I think it is 
> incomplete. Shouldn't it also be discussed how user space applications 
> are expected to submit "zone append" writes? Which system call should 
> e.g. fio use to submit this new type of write request? How will the 
> offset at which data has been written be communicated back to user space?
>
> Thanks,
>
> Bart.

Hi Bart,

That's a good point. Originally, we only looked into support for 
file-systems due to the complexity of exposing it to user-space (e.g., 
we do not have an easy way to support psync/libaio workloads). I would 
love for us to be able to combine this with liburing, such that an LBA 
can be returned on I/O completion. However, I'm not sure we have enough 
bits available on the completion entry.

-Matias


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM TOPIC] Zoned Block Devices
  2019-01-28 12:56 [LSF/MM TOPIC] Zoned Block Devices Matias Bjorling
  2019-01-28 15:07 ` Bart Van Assche
@ 2019-01-29  8:25 ` Javier González
  1 sibling, 0 replies; 4+ messages in thread
From: Javier González @ 2019-01-29  8:25 UTC (permalink / raw)
  To: Matias Bjorling
  Cc: lsf-pc, linux-fsdevel, linux-block, linux-ide, linux-scsi,
	linux-nvme, Damien Le Moal

[-- Attachment #1: Type: text/plain, Size: 2501 bytes --]

> On 28 Jan 2019, at 13.56, Matias Bjorling <Matias.Bjorling@wdc.com> wrote:
> 
> Hi,
> 
> Damien and I would like to propose a couple of topics centering around
> zoned block devices:
> 
> 1) Zoned block devices require that writes to a zone are sequential. If
> the writes are dispatched to the device out of order, the drive rejects
> the write with a write failure.
> 
> So far it has been the responsibility the deadline I/O scheduler to
> serialize writes to zones to avoid intra-zone write command reordering.
> This I/O scheduler based approach has worked so far for HDDs, but we can
> do better for multi-queue devices. NVMe has support for multiple queues,
> and one could dedicate a single queue to writes alone. Furthermore, the
> queue is processed in-order, enabling the host to serialize writes on
> the queue, instead of issuing them one by one. We like to gather
> feedback on this approach (new HCTX_TYPE_WRITE).
> 
> 2) Adoption of Zone Append in file-systems and user-space applications.
> 
> A Zone Append command, together with Zoned Namespaces, is being defined
> in the NVMe workgroup. The new command allows one to automatically
> direct writes to a zone write pointer position, similarly to writing to
> a file open with O_APPEND. With this write append command, the drive
> returns where data was written in the zone. Providing two benefits:
> 
> (A) It moves the fine-grained logical block allocation in file-systems
> to the device side. A file-system continues to do coarse-grained logical
> block allocation, but the specific LBAs where data is written and
> reported from the device. Thus improving file-system performance. The
> current target is XFS but we would like to hear the feasibility of it
> being used in other file-systems.
> 
> (B) It lets host issue multiple outstanding write I/Os to a zone,
> without having to maintain I/O order. Thus, improving the performance of
> the drive, but also reducing the need for zone locking on the host side.
> 
> Is there other use-cases for this, and will an interface like this be
> valuable
> in the kernel? If the interface is successful, we would expect the
> interface to move to ATA/SCSI for standardization as well.
> 
> Thanks, Matias

This topic is of interest to me as well.

For the append command, I think we also need to discuss the error model
as writes should be able to fail (e.g., a zone has shrink due to
previous, hidden, write errors and the host has not updated the zone
metadata).

Thanks,
Javier

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-01-29  8:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-28 12:56 [LSF/MM TOPIC] Zoned Block Devices Matias Bjorling
2019-01-28 15:07 ` Bart Van Assche
2019-01-28 18:40   ` Matias Bjorling
2019-01-29  8:25 ` Javier González

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).