All of lore.kernel.org
 help / color / mirror / Atom feed
* "hardware-assisted zeroing"
@ 2022-01-03 11:08 Eric Levy
  2022-01-03 11:17 ` Qu Wenruo
  2022-01-03 11:46 ` David Disseldorp
  0 siblings, 2 replies; 21+ messages in thread
From: Eric Levy @ 2022-01-03 11:08 UTC (permalink / raw)
  To: linux-btrfs

I am operating a Btrfs file system on logical volumes provided through
an iSCSI target. The software managing the volumes shows that they are
configured for certain features, which include "hardware-assisted
zeroing" and "space reclamation". Presumably the meaning of these
features, at least the former, is that a file system, with support of
the kernel, may issue a SCSI command indicating that a region of a
block device would be cleared. For a file system, such an operation has
no direct value, because the contents of de-allocated space is
irrelevant, but for a logical volume, it creates an opportunity to free
space on the underlying file system on the back end.

I have searched the term "hardware-assisted zeroing", without finding
any useful resources on the application of the term.

Does it describe a feature supported by Btrfs or Linux? Is it possible
for a LUN manager to "know" that Btrfs has freed space on a volume, in
a region that had previously been allocated?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-03 11:08 "hardware-assisted zeroing" Eric Levy
@ 2022-01-03 11:17 ` Qu Wenruo
  2022-01-03 11:24   ` Eric Levy
  2022-01-03 11:46 ` David Disseldorp
  1 sibling, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2022-01-03 11:17 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs



On 2022/1/3 19:08, Eric Levy wrote:
> I am operating a Btrfs file system on logical volumes provided through
> an iSCSI target. The software managing the volumes shows that they are
> configured for certain features, which include "hardware-assisted
> zeroing" and "space reclamation". Presumably the meaning of these
> features, at least the former, is that a file system, with support of
> the kernel, may issue a SCSI command indicating that a region of a
> block device would be cleared.

This looks pretty much like ATA TRIM or SCSI UNMAP command.

If they are the same, then btrfs supports it by either fstrim command
(recommended) or discard mount option.

Thanks,
Qu

> For a file system, such an operation has
> no direct value, because the contents of de-allocated space is
> irrelevant, but for a logical volume, it creates an opportunity to free
> space on the underlying file system on the back end.
>
> I have searched the term "hardware-assisted zeroing", without finding
> any useful resources on the application of the term.
>
> Does it describe a feature supported by Btrfs or Linux? Is it possible
> for a LUN manager to "know" that Btrfs has freed space on a volume, in
> a region that had previously been allocated?
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-03 11:17 ` Qu Wenruo
@ 2022-01-03 11:24   ` Eric Levy
  2022-01-03 11:51     ` Qu Wenruo
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Levy @ 2022-01-03 11:24 UTC (permalink / raw)
  To: linux-btrfs

> > an iSCSI target. The software managing the volumes shows that they
> > are
> > configured for certain features, which include "hardware-assisted
> > zeroing" and "space reclamation". Presumably the meaning of these
> > features, at least the former, is that a file system, with support
> > of
> > the kernel, may issue a SCSI command indicating that a region of a
> > block device would be cleared.
> 
> This looks pretty much like ATA TRIM or SCSI UNMAP command.
> 
> If they are the same, then btrfs supports it by either fstrim command
> (recommended) or discard mount option.

Thanks for the explanation. How does trimming work? Does the file
system maintain a register of blocks that have been cleared? Why is the
command not sent instantly, as soon as the space is freed by the file
system?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-03 11:08 "hardware-assisted zeroing" Eric Levy
  2022-01-03 11:17 ` Qu Wenruo
@ 2022-01-03 11:46 ` David Disseldorp
  2022-01-03 11:57   ` Qu Wenruo
  1 sibling, 1 reply; 21+ messages in thread
From: David Disseldorp @ 2022-01-03 11:46 UTC (permalink / raw)
  To: Eric Levy; +Cc: linux-btrfs

On Mon, 03 Jan 2022 06:08:46 -0500, Eric Levy wrote:

> I am operating a Btrfs file system on logical volumes provided through
> an iSCSI target. The software managing the volumes shows that they are
> configured for certain features, which include "hardware-assisted
> zeroing" and "space reclamation". Presumably the meaning of these
> features, at least the former, is that a file system, with support of
> the kernel, may issue a SCSI command indicating that a region of a
> block device would be cleared. For a file system, such an operation has
> no direct value, because the contents of de-allocated space is
> irrelevant, but for a logical volume, it creates an opportunity to free
> space on the underlying file system on the back end.
> 
> I have searched the term "hardware-assisted zeroing", without finding
> any useful resources on the application of the term.

"hardware-assisted zeroing" is often marketing speak for the WRITE SAME
SCSI command, which is used by VMFS. I'm not aware of any Linux
filesystems which make use of it.

As Qu mentioned, "space reclamation" would refer to UNMAP / discard.

Cheers, David

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-03 11:24   ` Eric Levy
@ 2022-01-03 11:51     ` Qu Wenruo
  2022-01-04 10:50       ` Eric Levy
  0 siblings, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2022-01-03 11:51 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs



On 2022/1/3 19:24, Eric Levy wrote:
>>> an iSCSI target. The software managing the volumes shows that they
>>> are
>>> configured for certain features, which include "hardware-assisted
>>> zeroing" and "space reclamation". Presumably the meaning of these
>>> features, at least the former, is that a file system, with support
>>> of
>>> the kernel, may issue a SCSI command indicating that a region of a
>>> block device would be cleared.
>>
>> This looks pretty much like ATA TRIM or SCSI UNMAP command.
>>
>> If they are the same, then btrfs supports it by either fstrim command
>> (recommended) or discard mount option.
>
> Thanks for the explanation. How does trimming work?

The filesystem will call blkdev_issue_discard() to do the trimming.
Which will in turn issue corresponding command according to the driver.

For ATA devices, it will be a ATA TRIM command. For SCSI it will be a
SCSI UNMAP command, and for loop device backed up by a file, it will
punch holes (and then handled by the filesystem holding the loop file).

> Does the file
> system maintain a register of blocks that have been cleared?

The filesystem (normally) doesn't maintain such info, what a filesystem
really care is the unused/used space.

For fstrim case, the filesystem will issue such discard comand to most
(if not all) unused space.

And one can call fstrim multiple times to do the same work again and
again, the filesystem won't really care.
(even the operation can be very time consuming)

The special thing in btrfs is, there is a cache to record which blocks
have been trimmed. (only in memory, thus after unmount, such cache is
lost, and on next mount will need to be rebuilt)

This is to reduce the trim workload with recent async-discard optimization.

> Why is the
> command not sent instantly, as soon as the space is freed by the file
> system?

If you use discard mount option, then most filesystems will send the
discard command to the underlying device when some space is freed.

But please keep in mind that, how such discard command gets handled is
hardware/storage stack dependent.

Some disk firmware may choose to do discard synchronously, which can
hugely slow down other operations.
(That's why btrfs has async-discard optimization, and also why fstrim is
preferred, to avoid unexpected slow down).

Thanks,
Qu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-03 11:46 ` David Disseldorp
@ 2022-01-03 11:57   ` Qu Wenruo
  0 siblings, 0 replies; 21+ messages in thread
From: Qu Wenruo @ 2022-01-03 11:57 UTC (permalink / raw)
  To: David Disseldorp, Eric Levy; +Cc: linux-btrfs



On 2022/1/3 19:46, David Disseldorp wrote:
> On Mon, 03 Jan 2022 06:08:46 -0500, Eric Levy wrote:
>
>> I am operating a Btrfs file system on logical volumes provided through
>> an iSCSI target. The software managing the volumes shows that they are
>> configured for certain features, which include "hardware-assisted
>> zeroing" and "space reclamation". Presumably the meaning of these
>> features, at least the former, is that a file system, with support of
>> the kernel, may issue a SCSI command indicating that a region of a
>> block device would be cleared. For a file system, such an operation has
>> no direct value, because the contents of de-allocated space is
>> irrelevant, but for a logical volume, it creates an opportunity to free
>> space on the underlying file system on the back end.
>>
>> I have searched the term "hardware-assisted zeroing", without finding
>> any useful resources on the application of the term.
>
> "hardware-assisted zeroing" is often marketing speak for the WRITE SAME
> SCSI command, which is used by VMFS. I'm not aware of any Linux
> filesystems which make use of it.

Thanks for pointing this out, really not familiar with WRITE SAME command.

After some quick search, it looks like it's kinda of hole punching for
SCSI command set.

Then I guess that explains why Linux filesystems don't really make use
of it.

If we want a large zeroed file, both hole punching and fallocate will be
faster, and we don't need to issue any data IO.
All data read will be zeroed at read time. No IO is always faster than
any IO, even if it's "hardware-assisted".

Thanks,
Qu
>
> As Qu mentioned, "space reclamation" would refer to UNMAP / discard.
>
> Cheers, David

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-03 11:51     ` Qu Wenruo
@ 2022-01-04 10:50       ` Eric Levy
  2022-01-04 20:49         ` Zygo Blaxell
  2022-01-04 22:37         ` Qu Wenruo
  0 siblings, 2 replies; 21+ messages in thread
From: Eric Levy @ 2022-01-04 10:50 UTC (permalink / raw)
  To: linux-btrfs

On Mon, 2022-01-03 at 19:51 +0800, Qu Wenruo wrote:

> The filesystem (normally) doesn't maintain such info, what a
> filesystem
> really care is the unused/used space.
> 
> For fstrim case, the filesystem will issue such discard comand to
> most
> (if not all) unused space.
> 
> And one can call fstrim multiple times to do the same work again and
> again, the filesystem won't really care.
> (even the operation can be very time consuming)
> 
> The special thing in btrfs is, there is a cache to record which
> blocks
> have been trimmed. (only in memory, thus after unmount, such cache is
> lost, and on next mount will need to be rebuilt)
> 
> This is to reduce the trim workload with recent async-discard
> optimization.

So in the general case (i.e. no session cache), the trim operation
scans all the allocation structures, to process all non-allocated
space?

> > Why is the
> > command not sent instantly, as soon as the space is freed by the
> > file
> > system?
> 
> If you use discard mount option, then most filesystems will send the
> discard command to the underlying device when some space is freed.
> 
> But please keep in mind that, how such discard command gets handled
> is
> hardware/storage stack dependent.
> 
> Some disk firmware may choose to do discard synchronously, which can
> hugely slow down other operations.
> (That's why btrfs has async-discard optimization, and also why fstrim
> is
> preferred, to avoid unexpected slow down).

Yes, but of course as I have used "instantly", I meant, not necessarily
synchronously, but simply near in time.

The trim operation is not avoiding bottlenecks, even if it is non-
blocking, because it operates at the level of the entire file system,
in a single operation. If Btrfs is able to process discard operations
asynchronously, then mounting with the discard option seems preferable,
as it requires no redundant work, adds no serious delay until until the
calls are made, and depends on no activity (not even automatic
activity) from the admin.

I fail to see a reason for preferring trim over discard.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-04 10:50       ` Eric Levy
@ 2022-01-04 20:49         ` Zygo Blaxell
  2022-01-04 22:37           ` Eric Levy
  2022-01-04 22:37         ` Qu Wenruo
  1 sibling, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2022-01-04 20:49 UTC (permalink / raw)
  To: Eric Levy; +Cc: linux-btrfs

On Tue, Jan 04, 2022 at 05:50:47AM -0500, Eric Levy wrote:
> On Mon, 2022-01-03 at 19:51 +0800, Qu Wenruo wrote:
> 
> > The filesystem (normally) doesn't maintain such info, what a
> > filesystem
> > really care is the unused/used space.
> > 
> > For fstrim case, the filesystem will issue such discard comand to
> > most
> > (if not all) unused space.
> > 
> > And one can call fstrim multiple times to do the same work again and
> > again, the filesystem won't really care.
> > (even the operation can be very time consuming)
> > 
> > The special thing in btrfs is, there is a cache to record which
> > blocks
> > have been trimmed. (only in memory, thus after unmount, such cache is
> > lost, and on next mount will need to be rebuilt)
> > 
> > This is to reduce the trim workload with recent async-discard
> > optimization.
> 
> So in the general case (i.e. no session cache), the trim operation
> scans all the allocation structures, to process all non-allocated
> space?
> 
> > > Why is the
> > > command not sent instantly, as soon as the space is freed by the
> > > file
> > > system?
> > 
> > If you use discard mount option, then most filesystems will send the
> > discard command to the underlying device when some space is freed.
> > 
> > But please keep in mind that, how such discard command gets handled
> > is
> > hardware/storage stack dependent.
> > 
> > Some disk firmware may choose to do discard synchronously, which can
> > hugely slow down other operations.
> > (That's why btrfs has async-discard optimization, and also why fstrim
> > is
> > preferred, to avoid unexpected slow down).
> 
> Yes, but of course as I have used "instantly", I meant, not necessarily
> synchronously, but simply near in time.
> 
> The trim operation is not avoiding bottlenecks, even if it is non-
> blocking, because it operates at the level of the entire file system,
> in a single operation. If Btrfs is able to process discard operations
> asynchronously, then mounting with the discard option seems preferable,
> as it requires no redundant work, adds no serious delay until until the
> calls are made, and depends on no activity (not even automatic
> activity) from the admin.
> 
> I fail to see a reason for preferring trim over discard.

Discard isn't free, and it can cost more than it gives back.

The gain from discard can be close to zero if you're overwriting the
same blocks over and over again (e.g. as btrfs does in metadata pages).
The SSD will keep recycling the blocks without hints from outside to help.
It depends on workload, but on many use cases 60-90% of the discards
btrfs will issue with the mount option are not necessary.  For the rest,
a fstrim every few days is sufficient.

The cost of discard can be significantly higher than zero.  Discard
requires time on the bus to send the trim command, which is a significant
hit for SATA (about the same as a short flushed write).  Popular drive
firmwares can't queue the discard command, which is a significant hit for
IO latency as the IO queue has to be brought to a full stop, the discard
command has to be sent and run, and the IO queue has to be started back
up again.  Before the 'discard=async' option was implemented, 'discard'
was unusably slow on many SSD models, some of them popular.

Cheap SSD devices wear out faster when issued a lot of discards mixed
with small writes, as they lack the specialized hardware and firmware
necessary to make discards low-wear operations.  The same flash component
is used for both FTL persistence (where discards cause wear) and user
data (where writes cause wear), so interleaved short writes and discards
cause double the wear compared to the same short writes without discards.
The fstrim man page advises not running trim more than once a week to
avoid prematurely aging SSDs in this category, while the discard mount
option is equivalent to running fstrim 2000-3000 times a day.

Discard has other side-effects in btrfs as well.  While a block group
is undergoing discard, it cannot be modified, which will force btrfs
to spread allocations out across more of the logical address space.
That can cause performance issues with fragmentation later on (more CPU
usage, more metadata fan-out for extents of the same file).  The discard
mount option can affect performance benchmarks in either direction, even
if the underlying storage is RAM that doesn't implement discard at all.

You'll need to benchmark this with your hardware and your workload to
find out if trim is better than discard or the other way around for you,
but don't be surprised by either result.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-04 20:49         ` Zygo Blaxell
@ 2022-01-04 22:37           ` Eric Levy
  2022-01-04 22:46             ` Qu Wenruo
  2022-01-05  1:32             ` Zygo Blaxell
  0 siblings, 2 replies; 21+ messages in thread
From: Eric Levy @ 2022-01-04 22:37 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 2022-01-04 at 15:49 -0500, Zygo Blaxell wrote:

> Cheap SSD devices wear out faster when issued a lot of discards mixed
> with small writes, as they lack the specialized hardware and firmware
> necessary to make discards low-wear operations.  The same flash
> component
> is used for both FTL persistence (where discards cause wear) and user
> data (where writes cause wear), so interleaved short writes and
> discards
> cause double the wear compared to the same short writes without
> discards.
> The fstrim man page advises not running trim more than once a week to
> avoid prematurely aging SSDs in this category, while the discard
> mount
> option is equivalent to running fstrim 2000-3000 times a day.

It seems much of the discussion relates to the design of physical
hardware. I would need to learn more about SDD to understand why the
operations are useful on them, as my thought had been that they would
be helpful for thin-provisioned logical volumes, but not meaningful on
physical devices.

I wonder whether the same or a different set of concerns from the ones
raised would be most helpful when considering management of non-
physical devices.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-04 10:50       ` Eric Levy
  2022-01-04 20:49         ` Zygo Blaxell
@ 2022-01-04 22:37         ` Qu Wenruo
  1 sibling, 0 replies; 21+ messages in thread
From: Qu Wenruo @ 2022-01-04 22:37 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs



On 2022/1/4 18:50, Eric Levy wrote:
> On Mon, 2022-01-03 at 19:51 +0800, Qu Wenruo wrote:
>
>> The filesystem (normally) doesn't maintain such info, what a
>> filesystem
>> really care is the unused/used space.
>>
>> For fstrim case, the filesystem will issue such discard comand to
>> most
>> (if not all) unused space.
>>
>> And one can call fstrim multiple times to do the same work again and
>> again, the filesystem won't really care.
>> (even the operation can be very time consuming)
>>
>> The special thing in btrfs is, there is a cache to record which
>> blocks
>> have been trimmed. (only in memory, thus after unmount, such cache is
>> lost, and on next mount will need to be rebuilt)
>>
>> This is to reduce the trim workload with recent async-discard
>> optimization.
>
> So in the general case (i.e. no session cache), the trim operation
> scans all the allocation structures, to process all non-allocated
> space?

Yes, and that's almost for all filesystems supporting trim.

All filesystems supporing read-write need to maintain such info anyway.
IIRC for filesystems like ext4, there is a bitmap storing which sector
is used and which is not.

Btrfs has a more complex one (extent tree), not only recording which
range is used, but also which tree is using it.

>
>>> Why is the
>>> command not sent instantly, as soon as the space is freed by the
>>> file
>>> system?
>>
>> If you use discard mount option, then most filesystems will send the
>> discard command to the underlying device when some space is freed.
>>
>> But please keep in mind that, how such discard command gets handled
>> is
>> hardware/storage stack dependent.
>>
>> Some disk firmware may choose to do discard synchronously, which can
>> hugely slow down other operations.
>> (That's why btrfs has async-discard optimization, and also why fstrim
>> is
>> preferred, to avoid unexpected slow down).
>
> Yes, but of course as I have used "instantly", I meant, not necessarily
> synchronously, but simply near in time.
>
> The trim operation is not avoiding bottlenecks, even if it is non-
> blocking, because it operates at the level of the entire file system,
> in a single operation. If Btrfs is able to process discard operations
> asynchronously, then mounting with the discard option seems preferable,
> as it requires no redundant work, adds no serious delay until until the
> calls are made, and depends on no activity (not even automatic
> activity) from the admin.

IIRC this async discard is currently only specific to btrfs, thus it's
not really generic.

Another thing is, just as Zygo said, there is not much benefit from
discarding some frequently used/freed metadata.

But the overhead is always there for discard mount option, thus that's
why we don't recommend discard mount option, even we have async-discard
behavior.

Thanks,
Qu

>
> I fail to see a reason for preferring trim over discard.
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-04 22:37           ` Eric Levy
@ 2022-01-04 22:46             ` Qu Wenruo
  2022-01-05  0:38               ` Paul Jones
  2022-01-05  1:32             ` Zygo Blaxell
  1 sibling, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2022-01-04 22:46 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs



On 2022/1/5 06:37, Eric Levy wrote:
> On Tue, 2022-01-04 at 15:49 -0500, Zygo Blaxell wrote:
>
>> Cheap SSD devices wear out faster when issued a lot of discards mixed
>> with small writes, as they lack the specialized hardware and firmware
>> necessary to make discards low-wear operations.  The same flash
>> component
>> is used for both FTL persistence (where discards cause wear) and user
>> data (where writes cause wear), so interleaved short writes and
>> discards
>> cause double the wear compared to the same short writes without
>> discards.
>> The fstrim man page advises not running trim more than once a week to
>> avoid prematurely aging SSDs in this category, while the discard
>> mount
>> option is equivalent to running fstrim 2000-3000 times a day.
>
> It seems much of the discussion relates to the design of physical
> hardware. I would need to learn more about SDD to understand why the
> operations are useful on them, as my thought had been that they would
> be helpful for thin-provisioned logical volumes, but not meaningful on
> physical devices.

I'm not an expert in this area, but IMHO the trim for SSD is to average
the wear, since NAND used in most (if not all) SSD has a write lifespan
limit.

This is a little different from thin-provisioned device.

>
> I wonder whether the same or a different set of concerns from the ones
> raised would be most helpful when considering management of non-
> physical devices.
>

For thin-provisioned device, it's a pretty different story then.

If the thin-provisioned device is just file backed, then trim brings
little to none performance penalty.

As such trim command will just be converted to hole punch of the
filesystem, and even on filesystems like btrfs which has very slow
metadata operations, it's still super fast.

So in that case, you don't really need to bother about the performance
penalty.

But please keep in mind that, even for heavily stacked storage, if the
final physical stack is still on SSD, the trim/discard discussion is
still true, that it's still recommended to use periodic fstrim over
discard mount option.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: "hardware-assisted zeroing"
  2022-01-04 22:46             ` Qu Wenruo
@ 2022-01-05  0:38               ` Paul Jones
  2022-01-05  0:44                 ` Eric Levy
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Jones @ 2022-01-05  0:38 UTC (permalink / raw)
  To: Qu Wenruo, Eric Levy, linux-btrfs

> -----Original Message-----
> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
> Sent: Wednesday, 5 January 2022 9:47 AM
> To: Eric Levy <contact@ericlevy.name>; linux-btrfs@vger.kernel.org
> Subject: Re: "hardware-assisted zeroing"
> 
> 
> 
> On 2022/1/5 06:37, Eric Levy wrote:
> > On Tue, 2022-01-04 at 15:49 -0500, Zygo Blaxell wrote:
> >
> >> Cheap SSD devices wear out faster when issued a lot of discards mixed
> >> with small writes, as they lack the specialized hardware and firmware
> >> necessary to make discards low-wear operations.  The same flash
> >> component is used for both FTL persistence (where discards cause
> >> wear) and user data (where writes cause wear), so interleaved short
> >> writes and discards cause double the wear compared to the same short
> >> writes without discards.
> >> The fstrim man page advises not running trim more than once a week to
> >> avoid prematurely aging SSDs in this category, while the discard
> >> mount option is equivalent to running fstrim 2000-3000 times a day.
> >
> > It seems much of the discussion relates to the design of physical
> > hardware. I would need to learn more about SDD to understand why the
> > operations are useful on them, as my thought had been that they would
> > be helpful for thin-provisioned logical volumes, but not meaningful on
> > physical devices.
> 
> I'm not an expert in this area, but IMHO the trim for SSD is to average the
> wear, since NAND used in most (if not all) SSD has a write lifespan limit.

It's also needed to keep throughput high on near full drives, as flash can't write at anywhere near the rated speed of the drive. If there is not enough free blocks to dump incoming data then the drive needs to stop and wait for in-progress data to finish writing/erasing before processing the next command.

One particular server I manage is Linux running on pass-through disks on Hyper-V, so there is no way to send a trim command to the actual physical disks. To overcome this I have a 20G partition on each disk which I trim once and then don't use. This acts as a buffer to keep the SSDs healthy and fast. Once every few months during a maintenance window I boot the server bare metal on a rescue disk and perform a trim from there. The write workload isn't huge so this seems to work well.


Paul.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-05  0:38               ` Paul Jones
@ 2022-01-05  0:44                 ` Eric Levy
  2022-01-05  1:12                   ` Paul Jones
  2022-01-05  1:21                   ` Zygo Blaxell
  0 siblings, 2 replies; 21+ messages in thread
From: Eric Levy @ 2022-01-05  0:44 UTC (permalink / raw)
  To: linux-btrfs

On Wed, 2022-01-05 at 00:38 +0000, Paul Jones wrote:

> It's also needed to keep throughput high on near full drives, as
> flash can't write at anywhere near the rated speed of the drive. If
> there is not enough free blocks to dump incoming data then the drive
> needs to stop and wait for in-progress data to finish writing/erasing
> before processing the next command.

Isn't the address of a free block, for writing new data, resolved by
the file system, based on the allocation data it maintains, not by the
hardware?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: "hardware-assisted zeroing"
  2022-01-05  0:44                 ` Eric Levy
@ 2022-01-05  1:12                   ` Paul Jones
  2022-01-05  1:20                     ` Eric Levy
  2022-01-05  1:21                   ` Zygo Blaxell
  1 sibling, 1 reply; 21+ messages in thread
From: Paul Jones @ 2022-01-05  1:12 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs

> -----Original Message-----
> From: Eric Levy <contact@ericlevy.name>
> Sent: Wednesday, 5 January 2022 11:44 AM
> To: linux-btrfs@vger.kernel.org
> Subject: Re: "hardware-assisted zeroing"
> 
> On Wed, 2022-01-05 at 00:38 +0000, Paul Jones wrote:
> 
> > It's also needed to keep throughput high on near full drives, as flash
> > can't write at anywhere near the rated speed of the drive. If there is
> > not enough free blocks to dump incoming data then the drive needs to
> > stop and wait for in-progress data to finish writing/erasing before
> > processing the next command.
> 
> Isn't the address of a free block, for writing new data, resolved by the file
> system, based on the allocation data it maintains, not by the hardware?

Yes, but in an ssd it will get remapped (in hardware) as part of the wear-leveling algorithm, otherwise the front part will wear faster than the rear, assuming free space is allocated from the front.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-05  1:12                   ` Paul Jones
@ 2022-01-05  1:20                     ` Eric Levy
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Levy @ 2022-01-05  1:20 UTC (permalink / raw)
  To: linux-btrfs

On Wed, 2022-01-05 at 01:12 +0000, Paul Jones wrote:

> > Isn't the address of a free block, for writing new data, resolved
> > by the file
> > system, based on the allocation data it maintains, not by the
> > hardware?

I see, I think the thin-provisioning solutions operate under a similar
design, at least the lower-performance ones that provision volumes as
regular files over a regular base file system.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-05  0:44                 ` Eric Levy
  2022-01-05  1:12                   ` Paul Jones
@ 2022-01-05  1:21                   ` Zygo Blaxell
  2022-01-05  1:26                     ` Eric Levy
  1 sibling, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2022-01-05  1:21 UTC (permalink / raw)
  To: Eric Levy; +Cc: linux-btrfs

On Tue, Jan 04, 2022 at 07:44:21PM -0500, Eric Levy wrote:
> On Wed, 2022-01-05 at 00:38 +0000, Paul Jones wrote:
> 
> > It's also needed to keep throughput high on near full drives, as
> > flash can't write at anywhere near the rated speed of the drive. If
> > there is not enough free blocks to dump incoming data then the drive
> > needs to stop and wait for in-progress data to finish writing/erasing
> > before processing the next command.
> 
> Isn't the address of a free block, for writing new data, resolved by
> the file system, based on the allocation data it maintains, not by the
> hardware?

No SSD works this way.

You say "hardware", but a SSD is an embedded computer running a minimalist
filesystem in its firmware (only one file, the nominal size of the
entire drive).  SSDs can't directly map LBA addresses to physical media,
so they need to implement data placement algorithms that have noticeable
side-effects on performance.

ZNS SSD devices do address mapping entirely in reverse--in a write
command, the host says "append this block to zone Z", the drive chooses
a block address for the data within that zone, and sends the written
block address back to the host filesystem as part of the command reply.
This allows the drive to implement writes in parallel (so they are
subject to reordering) without having to store where it put user data
in the SSD's own memory.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-05  1:21                   ` Zygo Blaxell
@ 2022-01-05  1:26                     ` Eric Levy
  2022-01-05  1:33                       ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Levy @ 2022-01-05  1:26 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 2022-01-04 at 20:21 -0500, Zygo Blaxell wrote:

> ZNS SSD devices do address mapping entirely in reverse--in a write
> command, the host says "append this block to zone Z", the drive
> chooses
> a block address for the data within that zone, and sends the written
> block address back to the host filesystem as part of the command
> reply.
> This allows the drive to implement writes in parallel (so they are
> subject to reordering) without having to store where it put user data
> in the SSD's own memory.

How does the host remember the mapping, and where does it get applied
during followup access?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-04 22:37           ` Eric Levy
  2022-01-04 22:46             ` Qu Wenruo
@ 2022-01-05  1:32             ` Zygo Blaxell
  1 sibling, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2022-01-05  1:32 UTC (permalink / raw)
  To: Eric Levy; +Cc: linux-btrfs

On Tue, Jan 04, 2022 at 05:37:05PM -0500, Eric Levy wrote:
> On Tue, 2022-01-04 at 15:49 -0500, Zygo Blaxell wrote:
> 
> > Cheap SSD devices wear out faster when issued a lot of discards mixed
> > with small writes, as they lack the specialized hardware and firmware
> > necessary to make discards low-wear operations.  The same flash
> > component
> > is used for both FTL persistence (where discards cause wear) and user
> > data (where writes cause wear), so interleaved short writes and
> > discards
> > cause double the wear compared to the same short writes without
> > discards.
> > The fstrim man page advises not running trim more than once a week to
> > avoid prematurely aging SSDs in this category, while the discard
> > mount
> > option is equivalent to running fstrim 2000-3000 times a day.
> 
> It seems much of the discussion relates to the design of physical
> hardware. I would need to learn more about SDD to understand why the
> operations are useful on them, as my thought had been that they would
> be helpful for thin-provisioned logical volumes, but not meaningful on
> physical devices.
> 
> I wonder whether the same or a different set of concerns from the ones
> raised would be most helpful when considering management of non-
> physical devices.

You'll still have the locked block groups with the discard mount option,
whether those are good or bad for your workload.

There are two main categories of trim command:  one that guarantees a
particular data value when reading from previously trimmed blocks, and
one that makes no such guarantee (i.e. it may leave the data unchanged,
filled with garbage, or any other contents).

The first kind is usually equivalent to at least one page write,
because it can't be reordered or dropped, and opportunities to merge
are strictly limited, but it must be persisted.

The second kind is much faster since no persistent write is required
to implement the trim itself.  The thin volume can merge the trim with
later writes that update persistent data, or persist the trim in a
background thread without blocking data writes.

This distinction holds for SSDs and also thin volumes (or caching volumes,
or in general any software, including drive firmware, that exists below
the filesystem layer).  The lower layer also usually controls which set
of trim semantics are available to the upper layer.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-05  1:26                     ` Eric Levy
@ 2022-01-05  1:33                       ` Zygo Blaxell
  2022-01-05  1:37                         ` Eric Levy
  0 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2022-01-05  1:33 UTC (permalink / raw)
  To: Eric Levy; +Cc: linux-btrfs

On Tue, Jan 04, 2022 at 08:26:41PM -0500, Eric Levy wrote:
> On Tue, 2022-01-04 at 20:21 -0500, Zygo Blaxell wrote:
> 
> > ZNS SSD devices do address mapping entirely in reverse--in a write
> > command, the host says "append this block to zone Z", the drive
> > chooses
> > a block address for the data within that zone, and sends the written
> > block address back to the host filesystem as part of the command
> > reply.
> > This allows the drive to implement writes in parallel (so they are
> > subject to reordering) without having to store where it put user data
> > in the SSD's own memory.
> 
> How does the host remember the mapping, and where does it get applied
> during followup access?

That's up to the host filesystem implementation.  ZNS devices require
filesystems that speak ZNS protocol.  They don't implement a traditional
LBA-oriented interface (or if they do, they provide a separate logical
device interface for that).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-05  1:33                       ` Zygo Blaxell
@ 2022-01-05  1:37                         ` Eric Levy
  2022-01-05  2:20                           ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Levy @ 2022-01-05  1:37 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 2022-01-04 at 20:33 -0500, Zygo Blaxell wrote:

> That's up to the host filesystem implementation.  ZNS devices require
> filesystems that speak ZNS protocol.  They don't implement a
> traditional
> LBA-oriented interface (or if they do, they provide a separate
> logical
> device interface for that).

The entire file system must fit on one device, even the allocation
data. How would the host find the allocation information, if its
location has been remapped?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "hardware-assisted zeroing"
  2022-01-05  1:37                         ` Eric Levy
@ 2022-01-05  2:20                           ` Zygo Blaxell
  0 siblings, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2022-01-05  2:20 UTC (permalink / raw)
  To: Eric Levy; +Cc: linux-btrfs

On Tue, Jan 04, 2022 at 08:37:54PM -0500, Eric Levy wrote:
> On Tue, 2022-01-04 at 20:33 -0500, Zygo Blaxell wrote:
> 
> > That's up to the host filesystem implementation.  ZNS devices require
> > filesystems that speak ZNS protocol.  They don't implement a
> > traditional
> > LBA-oriented interface (or if they do, they provide a separate
> > logical
> > device interface for that).
> 
> The entire file system must fit on one device, even the allocation
> data. How would the host find the allocation information, if its
> location has been remapped?

For ZBD devices, a linear read of a superblock log zone can provide the
root pointer for the filesystem.  The rest of the trees arise from that
root.  ZNS filesystems can do something similar.

The location of the data is not entirely unknown.  It is written somewhere
within the zone designated by the filesystem--only the bottom N bits
are filled in by the device.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-01-05  2:20 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-03 11:08 "hardware-assisted zeroing" Eric Levy
2022-01-03 11:17 ` Qu Wenruo
2022-01-03 11:24   ` Eric Levy
2022-01-03 11:51     ` Qu Wenruo
2022-01-04 10:50       ` Eric Levy
2022-01-04 20:49         ` Zygo Blaxell
2022-01-04 22:37           ` Eric Levy
2022-01-04 22:46             ` Qu Wenruo
2022-01-05  0:38               ` Paul Jones
2022-01-05  0:44                 ` Eric Levy
2022-01-05  1:12                   ` Paul Jones
2022-01-05  1:20                     ` Eric Levy
2022-01-05  1:21                   ` Zygo Blaxell
2022-01-05  1:26                     ` Eric Levy
2022-01-05  1:33                       ` Zygo Blaxell
2022-01-05  1:37                         ` Eric Levy
2022-01-05  2:20                           ` Zygo Blaxell
2022-01-05  1:32             ` Zygo Blaxell
2022-01-04 22:37         ` Qu Wenruo
2022-01-03 11:46 ` David Disseldorp
2022-01-03 11:57   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.