All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
@ 2014-02-14  4:32 Theodore Ts'o
  2014-02-14 13:05 ` Christoph Hellwig
  2014-02-14 17:14 ` Martin K. Petersen
  0 siblings, 2 replies; 13+ messages in thread
From: Theodore Ts'o @ 2014-02-14  4:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jens Axboe

How do people think I should implement this functionality?  I see
three possible choices:

1) The patch attached below

2) Add this functionality to blk-lib.c, but with a new function name,
   such as blkdev_zero_blocks(), and keep blkdev_issue_zeroout()
   unchanged

3) This functionality shouldn't be in the block device layer; if you
   want something like this, add it to fs/ext4 instead, and other file
   systems can optimize sb_issue_zeroout() themselves if they want.

And if the answer is (1) or (2), do people mind if I carry this patch
in the ext4 tree, so I can use and test this right away, without
having to worry about merging with the block tree?

Thanks,

					- Ted


commit 1af2359e6ffca707c41ed40618773abe944d0c54
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Feb 13 23:27:27 2014 -0500

    block: use discard if possible in blkdev_issue_zeroout()
    
    If the block device supports discards and guarantees that subsequent
    reads will return zeros (sometimes known as DZAT, for Deterministic
    read Zeros After Trim), use this to implement blkdev_issue_zeroout()
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2da76c9..62cbf28 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -269,6 +269,32 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 
+static int issue_zeroout_or_write_same(struct block_device *bdev,
+				       sector_t sector,
+				       sector_t nr_sects, gfp_t gfp_mask)
+{
+	if (bdev_write_same(bdev)) {
+		unsigned char bdn[BDEVNAME_SIZE];
+
+		if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
+					     ZERO_PAGE(0)))
+			return 0;
+
+		bdevname(bdev, bdn);
+		pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
+	}
+
+	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
+}
+
+/*
+ * Like sector_div except don't modify s.
+ */
+static unsigned int sector_mod(sector_t s, unsigned int m)
+{
+	return sector_div(s, m);
+}
+
 /**
  * blkdev_issue_zeroout - zero-fill a block range
  * @bdev:	blockdev to write
@@ -277,23 +303,49 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
  * @gfp_mask:	memory allocation flags (for bio_alloc)
  *
  * Description:
- *  Generate and issue number of bios with zerofiled pages.
+ *  Issues bios which zeros the requested block range.
  */
-
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			 sector_t nr_sects, gfp_t gfp_mask)
 {
-	if (bdev_write_same(bdev)) {
-		unsigned char bdn[BDEVNAME_SIZE];
+	struct request_queue *q = bdev_get_queue(bdev);
+	unsigned int alignment, granularity;
+	unsigned int c;
+	int ret;
 
-		if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
-					     ZERO_PAGE(0)))
-			return 0;
+	if (!q)
+		return -ENXIO;
 
-		bdevname(bdev, bdn);
-		pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
+	if (!blk_queue_discard(q) || !queue_discard_zeroes_data(q) ||
+	    q->limits.discard_misaligned)
+		return issue_zeroout_or_write_same(bdev, sector,
+
+						   nr_sects, gfp_mask);
+
+	alignment = q->limits.discard_alignment >> 9;
+	granularity = q->limits.discard_granularity >> 9;
+
+	c = sector_mod(granularity + alignment - sector, granularity);
+	if (c > nr_sects)
+		c = nr_sects;
+
+	if (c) {
+		int ret = issue_zeroout_or_write_same(bdev, sector,
+						      c, gfp_mask);
+		if (ret)
+			return ret;
+		nr_sects -= c;
 	}
+	if (nr_sects == 0)
+		return 0;
 
-	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
+	c = sector_mod(nr_sects, granularity);
+
+	ret = blkdev_issue_discard(bdev, sector, nr_sects - c, gfp_mask, 0);
+	if (ret || c == 0)
+		return ret;
+
+	return issue_zeroout_or_write_same(bdev, sector + nr_sects - c, c,
+					   gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-14  4:32 [PATCH RFC] block: use discard if possible in blkdev_issue_discard() Theodore Ts'o
@ 2014-02-14 13:05 ` Christoph Hellwig
  2014-02-14 14:57   ` Theodore Ts'o
  2014-02-14 17:14 ` Martin K. Petersen
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2014-02-14 13:05 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel, Jens Axboe

On Thu, Feb 13, 2014 at 11:32:56PM -0500, Theodore Ts'o wrote:
> 3) This functionality shouldn't be in the block device layer; if you
>    want something like this, add it to fs/ext4 instead, and other file
>    systems can optimize sb_issue_zeroout() themselves if they want.
> 
> And if the answer is (1) or (2), do people mind if I carry this patch
> in the ext4 tree, so I can use and test this right away, without
> having to worry about merging with the block tree?

 (4): add a new flag to blkdev_issue_zeroout to say if deallocating the
      blocks is okay, and if yes proceed like (1).

Requiring blocks to be zeroed, but not wanting to remove the
provisioning seems like a perfectly valid request, especially from
userspace (e.g. databases)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-14 13:05 ` Christoph Hellwig
@ 2014-02-14 14:57   ` Theodore Ts'o
  0 siblings, 0 replies; 13+ messages in thread
From: Theodore Ts'o @ 2014-02-14 14:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, Jens Axboe

On Fri, Feb 14, 2014 at 05:05:13AM -0800, Christoph Hellwig wrote:
>  (4): add a new flag to blkdev_issue_zeroout to say if deallocating the
>       blocks is okay, and if yes proceed like (1).
> 
> Requiring blocks to be zeroed, but not wanting to remove the
> provisioning seems like a perfectly valid request, especially from
> userspace (e.g. databases)

That seems reasonable.  Should the default be to try to use discard if
it is available, or to keep the current behavior?  I'm leaning towards
defining BLKDEV_ZEROOUT_DISCARD, and changing all of the existing
calls to blkdev_issue_zeroout() to pass in 0 for the flags.  This
implies that if any file system does want to use discard for things
like fallocate(), they have to explicitly request it.

Which brings up a few additional questions, for which I'd appreciate
suggestions/opinons: 

Should we try to have some kind of informal naming scheme for a mount
option so that file systems can explicitly request this behavior,
i.e. "mount -o zeroout_discard"?  Of course, if we do this, very few
people will probably end up using it.

So should we try to set up some standard hueristics so that for
certain class of devices (such as flash) we default to using zeroout
_discard, and for certain other classes of devices (maybe thinp) we
don't?  Perhaps we should make it be tunable based on the number of
blocks that are being discarded, i.e., "mount -o zeroout_discard=128k"
means to use the zeroout_discard flag if we are trying to zeroout more
than 128k?

Also, is it worth creating a new ioctl, BLKZZEROOUT, or perhaps
BLKZEROOUTF, which takes a flags argument, so we can expose this to
userspace?  I'm not sure it is worth it, since at least for e2fsprogs,
I won't want to depend on the user using a new enough kernel to
support the new ioctl, and while it is a bit painful to query the
/sys/block/sdXX/queue/discard_* files, I'm doing it already for other
reasons.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-14  4:32 [PATCH RFC] block: use discard if possible in blkdev_issue_discard() Theodore Ts'o
  2014-02-14 13:05 ` Christoph Hellwig
@ 2014-02-14 17:14 ` Martin K. Petersen
  2014-02-15  1:29   ` Theodore Ts'o
  2014-02-17 16:41   ` Lukáš Czerner
  1 sibling, 2 replies; 13+ messages in thread
From: Martin K. Petersen @ 2014-02-14 17:14 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel, Jens Axboe

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

Ted,

[issue_zeroout_or_write_same]

Ted> How do people think I should implement this functionality?  I see
Ted> three possible choices:

I did think about doing this when I originally implemented support for
WRITE SAME. However, one caveat is that there are corner cases where
devices that -- despite reporting that they return zeroed data after
TRIM -- will return non-zeroes. The issue being that TRIM is a hint and
there are no hard guarantees. Even if a device reports DRAT/RZAT.

So the reason I didn't end up adding a call like yours is that the
result is non-deterministic. The call would have to be "please zero this
block range but I don't *actually* rely on getting zeroes back". I don't
have a problem providing a function that does that as long as the
best-effort limitation is made crystal clear.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-14 17:14 ` Martin K. Petersen
@ 2014-02-15  1:29   ` Theodore Ts'o
  2014-02-17 16:44     ` Martin K. Petersen
  2014-02-17 16:41   ` Lukáš Czerner
  1 sibling, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2014-02-15  1:29 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe

On Fri, Feb 14, 2014 at 12:14:42PM -0500, Martin K. Petersen wrote:
> >>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:
> 
> Ted,
> 
> [issue_zeroout_or_write_same]
> 
> Ted> How do people think I should implement this functionality?  I see
> Ted> three possible choices:
> 
> I did think about doing this when I originally implemented support for
> WRITE SAME. However, one caveat is that there are corner cases where
> devices that -- despite reporting that they return zeroed data after
> TRIM -- will return non-zeroes. The issue being that TRIM is a hint and
> there are no hard guarantees. Even if a device reports DRAT/RZAT.

So is this the same as how some devices will turn into bricks if you
send trim commands too quickly --- i.e., they are buggy, crappy
devices?

Or does the T10/T13 spec for DRAT/RZAT really say: "determinisitc:
--- you keep using that word.  I do not think it means what you think
it means....."?

Basically, who was practicing engineering malpractice?  The SSD
vendors, or the T10/T13 spec authors?

If this is a case that there is just a bunch of crap SSD's out there,
then maybe we should still do this, but just not enable it by default,
and force users to manually configure mount options or fstrim if they
think they have devices that are competently implemented?

      	   		     	 	     - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-14 17:14 ` Martin K. Petersen
  2014-02-15  1:29   ` Theodore Ts'o
@ 2014-02-17 16:41   ` Lukáš Czerner
  1 sibling, 0 replies; 13+ messages in thread
From: Lukáš Czerner @ 2014-02-17 16:41 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Theodore Ts'o, linux-fsdevel, Jens Axboe

On Fri, 14 Feb 2014, Martin K. Petersen wrote:

> Date: Fri, 14 Feb 2014 12:14:42 -0500
> From: Martin K. Petersen <martin.petersen@oracle.com>
> To: Theodore Ts'o <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.org, Jens Axboe <axboe@kernel.dk>
> Subject: Re: [PATCH RFC] block: use discard if possible in
>     blkdev_issue_discard()
> 
> >>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:
> 
> Ted,
> 
> [issue_zeroout_or_write_same]
> 
> Ted> How do people think I should implement this functionality?  I see
> Ted> three possible choices:
> 
> I did think about doing this when I originally implemented support for
> WRITE SAME. However, one caveat is that there are corner cases where
> devices that -- despite reporting that they return zeroed data after
> TRIM -- will return non-zeroes. The issue being that TRIM is a hint and
> there are no hard guarantees. Even if a device reports DRAT/RZAT.

That was my experience as well back when I did discard testing. The
situation however was even worse than that because I've seen the
content of discarded block to change over time - it still contained
data right after the discard, however after a while it would read
zeros.

While it's true that this was couple of years ago and those devices
I've seen this on had probably faulty firmware but, even today I would
not feel comfortable enabling this by default.

Thanks!
-Lukas

> 
> So the reason I didn't end up adding a call like yours is that the
> result is non-deterministic. The call would have to be "please zero this
> block range but I don't *actually* rely on getting zeroes back". I don't
> have a problem providing a function that does that as long as the
> best-effort limitation is made crystal clear.
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-15  1:29   ` Theodore Ts'o
@ 2014-02-17 16:44     ` Martin K. Petersen
  2014-02-17 19:19       ` Theodore Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Martin K. Petersen @ 2014-02-17 16:44 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

>> The issue being that TRIM is a hint and there are no hard
>> guarantees. Even if a device reports DRAT/RZAT.

Ted> So is this the same as how some devices will turn into bricks if
Ted> you send trim commands too quickly --- i.e., they are buggy, crappy
Ted> devices?

Well, it's actually per spec. Even if a device reports successful
completion on a DSM TRIM command it is not required to actually do
anything because TRIM is a hint.

The DRAT/RZAT flags indicate what the expected results are if a device
decides to honor the request (or parts of it). Some devices will report
zeroes only for blocks that are aligned to their internal allocation
units. Whereas misaligned heads/tails of the TRIM request will contain
old data, zeroes or garbage.

Early SSDs would drop TRIMs under load. I think we've now moved to a
world where TRIMs are mostly dropped when the FTL is in error
recovery. But we have no insight into internal FTL state.

Some RAID controller vendors explicitly whitelist drive models that do
the right thing in their firmware to overcome this. Others rely on WRITE
SAME to ensure that you don't get parity mismatches for RAID5/6.

Ted> Basically, who was practicing engineering malpractice?  The SSD
Ted> vendors, or the T10/T13 spec authors?

I think it's important to emphasize that T10/T13 specs are mainly
written by device vendors. And they have a very strong objection to
complicating the device firmware, keeping internal state, etc. So the
outcome is very rarely in the operating system's favor. I completely
agree that these flags are broken by definition.

The only discard approach that provides a guaranteed result is WRITE
SAME with the UNMAP bit set (i.e. SCSI only). You can also use a discard
followed by a read of the block range to verify that you actually get
zeroes. And then manually patch up any pieces that didn't stick.

Ted> If this is a case that there is just a bunch of crap SSD's out
Ted> there, then maybe we should still do this, but just not enable it
Ted> by default, and force users to manually configure mount options or
Ted> fstrim if they think they have devices that are competently
Ted> implemented?

The good news is that most devices that report DRAT/RZAT are doing the
right thing due to server/RAID vendor pressure. But SSD vendors are
generally not willing to give such guarantees in the datasheets.

Many of these gray areas or slight enhancements to what's mandated by
the T10/T13 specs are negotiated as part of a typical drive procurement
process. The vendor will implement the additional features and
guarantees requested by Dell/HP/IBM/Oracle/etc. Sometimes the
enhancements will trickle into a later versions of the generic SSD
firmware. Sometimes they won't.

It's really no different from hard drives. I'd choose a server vendor
branded version of a disk drive over the generic version any day. Both
because of binning and because of the additional data integrity and
error recovery features that are likely present in the firmware.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-17 16:44     ` Martin K. Petersen
@ 2014-02-17 19:19       ` Theodore Ts'o
  2014-02-18  1:31         ` Martin K. Petersen
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2014-02-17 19:19 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe, linux-ext4

On Mon, Feb 17, 2014 at 11:44:27AM -0500, Martin K. Petersen wrote:
> Ted> Basically, who was practicing engineering malpractice?  The SSD
> Ted> vendors, or the T10/T13 spec authors?
> 
> I think it's important to emphasize that T10/T13 specs are mainly
> written by device vendors. And they have a very strong objection to
> complicating the device firmware, keeping internal state, etc. So the
> outcome is very rarely in the operating system's favor. I completely
> agree that these flags are broken by definition.

Sigh...

One of the reasons why this came up is if you are implementing a cloud
hosting service, where disk is emulated, and since you are trying to
do something cheap-cheap-cheap (for example, OpenShift from Red Hat
has a very generous free guests policy), it's likely that you're using
something like qcow2, or thinp, or something similar to emulate disks
to drive storage costs down.  So anything we can do to eliminate I/O
work at the Host OS layer is going to be really visible, and this
includes replacing zero-block writes with the equivalent of punch or
TRIM w/ ZRAT.

> The only discard approach that provides a guaranteed result is WRITE
> SAME with the UNMAP bit set (i.e. SCSI only).

So currently blkdev_issue_zeroout() will do the WRITE SAME, but it
doesn't set the UNMAP bit, correct?  I understand there will be
environments where performance is more important than cost, where it
may not be a good idea to set the UNMAP bit.  So it sounds like what
we should do is add a flags which controls whether or not to use TRIM
w/ ZRAT or WRITE SAME with the UNMAP bit is set.

We'll then also need to work with the KVM folks to make sure that
WRITE SAME w/ UNMAP gets plumbed through to the KVM userspace, which
can then do something like FL_PUNCH if it is using a raw sparse image,
or the equivalent in qcow2, etc.

(If the KVM folks want to be even more aggressive, if they know they
are using an underlying storage system where keeping the allocated
blocks isn't really going to help performance, even if the UNMAP bit
isn't set and the data block is all zero's, maybe they might want to
unmap the block(s) anyway.  Or we could leave this up to the Guest OS
userspace, and plumb a hint from the Host to the Guest that it should
really use WRITE SAME w/ UNMAP.  But I'm not convinced it's worth it.)

Does this sound like a reasonable way to go?

> The good news is that most devices that report DRAT/RZAT are doing the
> right thing due to server/RAID vendor pressure.   But SSD vendors are
> generally not willing to give such guarantees in the datasheets.

I imagine the reason why they aren't willing to give such guarantees
is that it would cost more to do the testing to assure this, and while
they know that a certain firmwar version shipped to $BIG_HDD_CUSTOMER
does the right thing, it might regress without their knowing about it
in some future firmware version.

On the other hand, if there was a white list kept somewhere, either in
the kernel, or in some more dynamically updated list (ala what
smartctl does to get the latest vendor-specific attributes), being on
the white list might be enough of a commercial advantage that drive
vendors would be incentivized to provide such a guarantee.  Especially
if, say, a major SSD vendor such as Intel could be induced make such a
public guarantee and we publicized this fact.

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-17 19:19       ` Theodore Ts'o
@ 2014-02-18  1:31         ` Martin K. Petersen
  2014-02-18  2:17           ` Theodore Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Martin K. Petersen @ 2014-02-18  1:31 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe, linux-ext4

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

Ted,

Ted> So currently blkdev_issue_zeroout() will do the WRITE SAME, but it
Ted> doesn't set the UNMAP bit, correct?  

Correct. Because it explicitly tells the device to write zeroes.

Ted> I understand there will be environments where performance is more
Ted> important than cost, where it may not be a good idea to set the
Ted> UNMAP bit.  So it sounds like what we should do is add a flags
Ted> which controls whether or not to use TRIM w/ ZRAT or WRITE SAME
Ted> with the UNMAP bit is set.

The rationale behind blkdev_issue_discard was to provide a facility to
mark a block range as unused by the filesystem. With the expectation
that those blocks would be deallocated/deprovisioned on the device.

The rationale behind blkdev_issue_zeroout was to provide a facility to
provide a cleared block range. With the expectation that those blocks
would be allocated/provisioned on the device.

Your variant seems to land somewhere in-between. You want a
blkdev_issue_clear() that zeroes a block range and it's then up to the
storage device to decide whether to provision or deprovision the space
as long as you are guaranteed to get zeroes back for each block in the
range on a subsequent read. Is that a correct interpretation?

I'm trying to pin down your exact use case because it can get very murky
between the SCSI and ATA variants. And the fact that the same knobs are
used for both over and under-provisioned devices. SCSI also has an
additional state: Blocks can be either mapped, anchored or deallocated.

Ted> On the other hand, if there was a white list kept somewhere, either
Ted> in the kernel, or in some more dynamically updated list (ala what
Ted> smartctl does to get the latest vendor-specific attributes), being
Ted> on the white list might be enough of a commercial advantage that
Ted> drive vendors would be incentivized to provide such a guarantee.
Ted> Especially if, say, a major SSD vendor such as Intel could be
Ted> induced make such a public guarantee and we publicized this fact.

I'm perfectly fine with maintaining a whitelist if we can get vendors to
commit.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-18  1:31         ` Martin K. Petersen
@ 2014-02-18  2:17           ` Theodore Ts'o
  2014-02-18  3:44             ` Martin K. Petersen
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2014-02-18  2:17 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe, linux-ext4

On Mon, Feb 17, 2014 at 08:31:50PM -0500, Martin K. Petersen wrote:
> 
> Your variant seems to land somewhere in-between. You want a
> blkdev_issue_clear() that zeroes a block range and it's then up to the
> storage device to decide whether to provision or deprovision the space
> as long as you are guaranteed to get zeroes back for each block in the
> range on a subsequent read. Is that a correct interpretation?

Ultimately the storage device (or host OS) can always decide to
deprovision the space if it is given a write of all zeroes; there's
nothing in the specs that say anything at all about performance
considerations, or what the back-end storage decides to do in order to
handle a particular write command, so long as a subsequent read
returns the correct data.

So what I want is a way in which the guest (kernel, file system, etc)
should be able to request that the space be deprovisioned.  So it's a
hint insofar ultimately, the host can always decide whether or not
whether or not to honor the deprovision request (or the host could
decide to deprovision a block even if it's not explicitly requested).

However, I don't want it to be a hint insofar that a subsequent read
is *guaranteed* to return all zeros after this "blkdev_issue_clear()"
command has been successfully sent to the device.

Does that make sense?

> I'm trying to pin down your exact use case because it can get very murky
> between the SCSI and ATA variants. And the fact that the same knobs are
> used for both over and under-provisioned devices. SCSI also has an
> additional state: Blocks can be either mapped, anchored or deallocated.

What does "anchored" mean?  Does is it the equivalent of using
fallocate() to allocate the block, but it's marked uninitialized so
any attempt to read it returns zeroes?  If so, that certainly sounds
like useful functionality that would be great if KVM exposed via
virt-scsi.

						- Ted


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-18  2:17           ` Theodore Ts'o
@ 2014-02-18  3:44             ` Martin K. Petersen
  2014-02-18  5:47               ` Theodore Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Martin K. Petersen @ 2014-02-18  3:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe, linux-ext4

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

Ted> Ultimately the storage device (or host OS) can always decide to
Ted> deprovision the space if it is given a write of all zeroes; there's
Ted> nothing in the specs that say anything at all about performance
Ted> considerations, or what the back-end storage decides to do in order
Ted> to handle a particular write command, so long as a subsequent read
Ted> returns the correct data.

Well, that has changed a bit with the logical block provisioning bits in
SCSI. That's why I brought up the allocation/deallocation assumptions in
the existing two blkdev_issue_foo() calls.

Ted> Does that make sense?

Yeah. So deprovision with guaranteed zero on read is what you're
after. I'll chew on that a bit tomorrow.

Ted> What does "anchored" mean?  Does is it the equivalent of using
Ted> fallocate() to allocate the block, but it's marked uninitialized so
Ted> any attempt to read it returns zeroes?

It means that any allocations internal to the storage device which are
required to subsequently perform a write to that block have been
made. So it's a way to reserve space without actually writing the
blocks. Whether reading an anchored block returns zeroes or something
else depends on the usual twisted maze of conflicting and confusing
flags.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-18  3:44             ` Martin K. Petersen
@ 2014-02-18  5:47               ` Theodore Ts'o
  2014-02-19  2:20                 ` Martin K. Petersen
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2014-02-18  5:47 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe, linux-ext4

On Mon, Feb 17, 2014 at 10:44:47PM -0500, Martin K. Petersen wrote:
> 
> Well, that has changed a bit with the logical block provisioning bits in
> SCSI. That's why I brought up the allocation/deallocation assumptions in
> the existing two blkdev_issue_foo() calls.

Is it a fair assumption that the reason why T10 added these bits is
mainly so that clients of thin-provisioned storage devices can
guarantee that a subseq uent write won't fail?  Since historically the
spec writers have washed their hands of anything that might vaguely
smell of performance considerations....

> Yeah. So deprovision with guaranteed zero on read is what you're
> after. I'll chew on that a bit tomorrow.

Yes.  And also some way for the host OS (or some other underlying
storage device, more generally) to send hints to the guest OS about
the best way to tune filesystems at mkfs and/or mount time for best
performance, so we don't have to require the system administrator to
have to manually set mount options, mkfs options, and/or magic "echo"
commands to /proc or /sys files.

It would be great if we could get SATA and SCSI devices to also
deliver these hints to kernel, or to have our kernels make some
hueristics based on various SCSI mode pages, and then deliver it to
the file system or via some defined /sys files so that userspace
programs like mkfs can automatically DTRT.  I'm not sure if this is
going to require spec changes and hardware changes, or whether there
are some existing hints form the device drivers that we might be able
to leverage.

For example, right now I'm just manually using the discard mount
options on certain PCIe-attached flash where I know it's beneficial,
but it's a manual tuning based on knowledge of the underlying storage
device.  Figuring out when it's better to use fstrim, or doing it at
unlink time, etc., is something that's better done automatically
instead of manually, but this is I fear answering questions like this
in a reliable fashion is going to be a very hard problem --- and as
storage devices get more complex, such as hybrid drives with even more
varied and interesting performance characteristics, it's only going to
get harder!

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-18  5:47               ` Theodore Ts'o
@ 2014-02-19  2:20                 ` Martin K. Petersen
  0 siblings, 0 replies; 13+ messages in thread
From: Martin K. Petersen @ 2014-02-19  2:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe, linux-ext4

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

Ted,

Ted> Is it a fair assumption that the reason why T10 added these bits is
Ted> mainly so that clients of thin-provisioned storage devices can
Ted> guarantee that a subseq uent write won't fail? 

Yes. It's a way to lock down blocks for future use without having to
actually write them.

Ted> Figuring out when it's better to use fstrim, or doing it at unlink
Ted> time, etc., is something that's better done automatically instead
Ted> of manually, but this is I fear answering questions like this in a
Ted> reliable fashion is going to be a very hard problem

We'll quickly get back to the problem we had with the other I/O hints.
Namely the device vendor having to provide an honest indicator of their
performance characteristics.

Recent SCSI has a way to communicate maximum values for processing
various types of commands. Going forward we can use those values to tune
our request timeouts. But as far as the normal processing time is
concerned it's hard to see a vendor putting anything in there other than
"this one goes to 11".

Over time we may be able to switch to online discards by default. But at
this stage drives with queued trim support are just starting to roll
out. And some thin-provisioned arrays were evidently also designed with
a monthly Norton SpeedDisk run in mind.

However, I'm open to having a prefer_online_discards flag that dedicated
PCIe SSD drivers could set. It's not until we enter SCSI/ATA territory
that things get murky.

On the topic of reliable discards: Apparently there are some amendments
being discussed in T10/T13 right now that may help us. Still
investigating...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-02-19  2:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-14  4:32 [PATCH RFC] block: use discard if possible in blkdev_issue_discard() Theodore Ts'o
2014-02-14 13:05 ` Christoph Hellwig
2014-02-14 14:57   ` Theodore Ts'o
2014-02-14 17:14 ` Martin K. Petersen
2014-02-15  1:29   ` Theodore Ts'o
2014-02-17 16:44     ` Martin K. Petersen
2014-02-17 19:19       ` Theodore Ts'o
2014-02-18  1:31         ` Martin K. Petersen
2014-02-18  2:17           ` Theodore Ts'o
2014-02-18  3:44             ` Martin K. Petersen
2014-02-18  5:47               ` Theodore Ts'o
2014-02-19  2:20                 ` Martin K. Petersen
2014-02-17 16:41   ` Lukáš Czerner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.