Is it possible that certain physical disk doesn't implement flush correctly?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Is it possible that certain physical disk doesn't implement flush correctly?
@ 2019-03-30 12:31 Qu Wenruo
  2019-03-30 12:57 ` Supercilious Dude
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Qu Wenruo @ 2019-03-30 12:31 UTC (permalink / raw)
  To: linux-btrfs, Linux FS Devel, linux-block

[-- Attachment #1.1: Type: text/plain, Size: 630 bytes --]

Hi,

I'm wondering if it's possible that certain physical device doesn't
handle flush correctly.

E.g. some vendor does some complex logical in their hdd controller to
skip certain flush request (but not all, obviously) to improve performance?

Do anyone see such reports?

And if proves to happened before, how do we users detect such problem?

Can we just check the flush time against the write before flush call?
E.g. write X random blocks into that device, call fsync() on it, check
the execution time. Repeat Y times, and compare the avg/std.
And change X to 2X/4X/..., repeat above check.

Thanks,
Qu

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-30 12:31 Is it possible that certain physical disk doesn't implement flush correctly? Qu Wenruo
@ 2019-03-30 12:57 ` Supercilious Dude
  2019-03-30 13:00   ` Qu Wenruo
  2019-03-31 11:27 ` Alberto Bursi
  2019-04-01 12:04 ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 18+ messages in thread
From: Supercilious Dude @ 2019-03-30 12:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Linux FS Devel, linux-block

Hi,

This would give a false positive on any controller with a cache as all
requests would take 0ms until the cache is full and the controller has
to actually flush to disk. I am using an HP P841 controller in my test
system and it has a 4GB cache making every IO instant unless there are
enough of them that they can't be flushed to the disks as quickly as
they come in - the latency variation is huge depending on load.

Regards

On Sat, 30 Mar 2019 at 12:34, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
> Hi,
>
> I'm wondering if it's possible that certain physical device doesn't
> handle flush correctly.
>
> E.g. some vendor does some complex logical in their hdd controller to
> skip certain flush request (but not all, obviously) to improve performance?
>
> Do anyone see such reports?
>
> And if proves to happened before, how do we users detect such problem?
>
> Can we just check the flush time against the write before flush call?
> E.g. write X random blocks into that device, call fsync() on it, check
> the execution time. Repeat Y times, and compare the avg/std.
> And change X to 2X/4X/..., repeat above check.
>
> Thanks,
> Qu
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-30 12:57 ` Supercilious Dude
@ 2019-03-30 13:00   ` Qu Wenruo
  2019-03-30 13:04     ` Supercilious Dude
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-03-30 13:00 UTC (permalink / raw)
  To: Supercilious Dude; +Cc: linux-btrfs, Linux FS Devel, linux-block


[-- Attachment #1.1: Type: text/plain, Size: 1464 bytes --]



On 2019/3/30 下午8:57, Supercilious Dude wrote:
> Hi,
> 
> This would give a false positive on any controller with a cache as all
> requests would take 0ms until the cache is full and the controller has
> to actually flush to disk.

I'm purposing to measure the execution time of flush/fsync, not write.

And if flush takes 0ms, it means it doesn't really write cached data
onto disk.

Thanks,
Qu

> I am using an HP P841 controller in my test
> system and it has a 4GB cache making every IO instant unless there are
> enough of them that they can't be flushed to the disks as quickly as
> they come in - the latency variation is huge depending on load.
> 
> Regards
> 
> On Sat, 30 Mar 2019 at 12:34, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>> Hi,
>>
>> I'm wondering if it's possible that certain physical device doesn't
>> handle flush correctly.
>>
>> E.g. some vendor does some complex logical in their hdd controller to
>> skip certain flush request (but not all, obviously) to improve performance?
>>
>> Do anyone see such reports?
>>
>> And if proves to happened before, how do we users detect such problem?
>>
>> Can we just check the flush time against the write before flush call?
>> E.g. write X random blocks into that device, call fsync() on it, check
>> the execution time. Repeat Y times, and compare the avg/std.
>> And change X to 2X/4X/..., repeat above check.
>>
>> Thanks,
>> Qu
>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-30 13:00   ` Qu Wenruo
@ 2019-03-30 13:04     ` Supercilious Dude
  2019-03-30 13:09       ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Supercilious Dude @ 2019-03-30 13:04 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Linux FS Devel, linux-block

On Sat, 30 Mar 2019 at 13:00, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> I'm purposing to measure the execution time of flush/fsync, not write.
>
> And if flush takes 0ms, it means it doesn't really write cached data
> onto disk.
>

That is correct. The controller ignores your flush requests on the
virtual disk by design. When the data hits the controller it is
considered "stored" - the physical disk(s) storing the virtual disk is
an implementation detail. The performance characteristics of these
controllers are needed to make big arrays work in a useful manner. My
controller is connected to 4 HP 2600 enclosures with 12 drives each.
Waiting for a flush on a single disk before continuing work on the
remaining 47 disks would be catastrophic for performance.

Regards

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-30 13:04     ` Supercilious Dude
@ 2019-03-30 13:09       ` Qu Wenruo
  2019-03-30 13:14         ` Supercilious Dude
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-03-30 13:09 UTC (permalink / raw)
  To: Supercilious Dude; +Cc: linux-btrfs, Linux FS Devel, linux-block


[-- Attachment #1.1: Type: text/plain, Size: 1319 bytes --]



On 2019/3/30 下午9:04, Supercilious Dude wrote:
> On Sat, 30 Mar 2019 at 13:00, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> I'm purposing to measure the execution time of flush/fsync, not write.
>>
>> And if flush takes 0ms, it means it doesn't really write cached data
>> onto disk.
>>
> 
> That is correct. The controller ignores your flush requests on the
> virtual disk by design. When the data hits the controller it is
> considered "stored" - the physical disk(s) storing the virtual disk is
> an implementation detail. The performance characteristics of these
> controllers are needed to make big arrays work in a useful manner. My
> controller is connected to 4 HP 2600 enclosures with 12 drives each.
> Waiting for a flush on a single disk before continuing work on the
> remaining 47 disks would be catastrophic for performance.

If controller is doing so, it must have its own power or at least finish
flush when controller writes to its fast cache.

For cache case, if we have enough data, we could still find some clue on
the flush execution time.

Despite that, for that enterprise level usage, it's OK.

But for consumer level storage, I'm not sure, especially for HDDs, and
maybe NVMe devices.

So my question still stands here.

Thanks,
Qu

> 
> Regards
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-30 13:09       ` Qu Wenruo
@ 2019-03-30 13:14         ` Supercilious Dude
  2019-03-30 13:24           ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Supercilious Dude @ 2019-03-30 13:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Linux FS Devel, linux-block

On Sat, 30 Mar 2019 at 13:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
> If controller is doing so, it must have its own power or at least finish
> flush when controller writes to its fast cache.
>

The controller has its own battery backup to power the DRAM cache, as
well as flash storage to dump it onto in the exceedingly unlikely
event that the battery gets depleted.

> For cache case, if we have enough data, we could still find some clue on
> the flush execution time.
>
> Despite that, for that enterprise level usage, it's OK.
>
> But for consumer level storage, I'm not sure, especially for HDDs, and
> maybe NVMe devices.
>

How do you distinguish who is a who? Am I an enterprise or a consumer?

Regards

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-30 13:14         ` Supercilious Dude
@ 2019-03-30 13:24           ` Qu Wenruo
  2019-03-31 22:45             ` J. Bruce Fields
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-03-30 13:24 UTC (permalink / raw)
  To: Supercilious Dude; +Cc: linux-btrfs, Linux FS Devel, linux-block


[-- Attachment #1.1: Type: text/plain, Size: 1087 bytes --]



On 2019/3/30 下午9:14, Supercilious Dude wrote:
> On Sat, 30 Mar 2019 at 13:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>> If controller is doing so, it must have its own power or at least finish
>> flush when controller writes to its fast cache.
>>
> 
> The controller has its own battery backup to power the DRAM cache, as
> well as flash storage to dump it onto in the exceedingly unlikely
> event that the battery gets depleted.
> 
>> For cache case, if we have enough data, we could still find some clue on
>> the flush execution time.
>>
>> Despite that, for that enterprise level usage, it's OK.
>>
>> But for consumer level storage, I'm not sure, especially for HDDs, and
>> maybe NVMe devices.
>>
> 
> How do you distinguish who is a who? Am I an enterprise or a consumer?

Easy, price. :P

To be honest, I don't really care about that fancy use case.
It's the vendor doing its work, and if something wrong happened,
customer will yell at them.

I'm more interesting in the consumer level situation.

Thanks,
Qu

> 
> Regards
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-30 12:31 Is it possible that certain physical disk doesn't implement flush correctly? Qu Wenruo
  2019-03-30 12:57 ` Supercilious Dude
@ 2019-03-31 11:27 ` Alberto Bursi
  2019-03-31 12:00   ` Qu Wenruo
                     ` (2 more replies)
  2019-04-01 12:04 ` Austin S. Hemmelgarn
  2 siblings, 3 replies; 18+ messages in thread
From: Alberto Bursi @ 2019-03-31 11:27 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, Linux FS Devel, linux-block


On 30/03/19 13:31, Qu Wenruo wrote:
> Hi,
>
> I'm wondering if it's possible that certain physical device doesn't
> handle flush correctly.
>
> E.g. some vendor does some complex logical in their hdd controller to
> skip certain flush request (but not all, obviously) to improve performance?
>
> Do anyone see such reports?
>
> And if proves to happened before, how do we users detect such problem?
>
> Can we just check the flush time against the write before flush call?
> E.g. write X random blocks into that device, call fsync() on it, check
> the execution time. Repeat Y times, and compare the avg/std.
> And change X to 2X/4X/..., repeat above check.
>
> Thanks,
> Qu
>
>

Afaik HDDs and SSDs do lie to fsync()

unless the write cache is turned off with hdparm,

hdparm -W0 /dev/sda

similarly to RAID controllers.

see below

https://brad.livejournal.com/2116715.html

https://queue.acm.org/detail.cfm?id=2367378


-


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-31 11:27 ` Alberto Bursi
@ 2019-03-31 12:00   ` Qu Wenruo
  2019-03-31 13:36     ` Hannes Reinecke
  2019-03-31 12:21   ` Andrei Borzenkov
  2019-04-01 11:55   ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-03-31 12:00 UTC (permalink / raw)
  To: Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block


[-- Attachment #1.1: Type: text/plain, Size: 1505 bytes --]



On 2019/3/31 下午7:27, Alberto Bursi wrote:
> 
> On 30/03/19 13:31, Qu Wenruo wrote:
>> Hi,
>>
>> I'm wondering if it's possible that certain physical device doesn't
>> handle flush correctly.
>>
>> E.g. some vendor does some complex logical in their hdd controller to
>> skip certain flush request (but not all, obviously) to improve performance?
>>
>> Do anyone see such reports?
>>
>> And if proves to happened before, how do we users detect such problem?
>>
>> Can we just check the flush time against the write before flush call?
>> E.g. write X random blocks into that device, call fsync() on it, check
>> the execution time. Repeat Y times, and compare the avg/std.
>> And change X to 2X/4X/..., repeat above check.
>>
>> Thanks,
>> Qu
>>
>>
> 
> Afaik HDDs and SSDs do lie to fsync()

fsync() on block device is interpreted into FLUSH bio.

If all/most consumer level SATA HDD/SSD devices are lying, then there is
no power loss safety at all for any fs. As most fs relies on FLUSH bio
to implement barrier.

And for fs with generation check, they all should report metadata from
the future every time a crash happens, or even worse gracefully
umounting fs would cause corruption.

Thanks,
Qu

> 
> unless the write cache is turned off with hdparm,
> 
> hdparm -W0 /dev/sda
> 
> similarly to RAID controllers.
> 
> see below
> 
> https://brad.livejournal.com/2116715.html
> 
> https://queue.acm.org/detail.cfm?id=2367378
> 
> 
> -
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-31 11:27 ` Alberto Bursi
  2019-03-31 12:00   ` Qu Wenruo
@ 2019-03-31 12:21   ` Andrei Borzenkov
  2019-04-01 11:55   ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 18+ messages in thread
From: Andrei Borzenkov @ 2019-03-31 12:21 UTC (permalink / raw)
  To: Alberto Bursi, Qu Wenruo, linux-btrfs, Linux FS Devel, linux-block

31.03.2019 14:27, Alberto Bursi пишет:
> 
> On 30/03/19 13:31, Qu Wenruo wrote:
>> Hi,
>>
>> I'm wondering if it's possible that certain physical device doesn't
>> handle flush correctly.
>>
>> E.g. some vendor does some complex logical in their hdd controller to
>> skip certain flush request (but not all, obviously) to improve performance?
>>
>> Do anyone see such reports?
>>
>> And if proves to happened before, how do we users detect such problem?
>>
>> Can we just check the flush time against the write before flush call?
>> E.g. write X random blocks into that device, call fsync() on it, check
>> the execution time. Repeat Y times, and compare the avg/std.
>> And change X to 2X/4X/..., repeat above check.
>>
>> Thanks,
>> Qu
>>
>>
> 
> Afaik HDDs and SSDs do lie to fsync()
> 
> unless the write cache is turned off with hdparm,

I know at least one case of SSD that are claimed to flush cache in case
of power loss. I can dig up details if anyone is interested.

> 
> hdparm -W0 /dev/sda
> 
> similarly to RAID controllers.
> 
> see below
> 
> https://brad.livejournal.com/2116715.html
> 
> https://queue.acm.org/detail.cfm?id=2367378
> 
> 
> -
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-31 12:00   ` Qu Wenruo
@ 2019-03-31 13:36     ` Hannes Reinecke
  2019-03-31 14:17       ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Hannes Reinecke @ 2019-03-31 13:36 UTC (permalink / raw)
  To: Qu Wenruo, Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block

On 3/31/19 2:00 PM, Qu Wenruo wrote:
> 
> 
> On 2019/3/31 下午7:27, Alberto Bursi wrote:
>>
>> On 30/03/19 13:31, Qu Wenruo wrote:
>>> Hi,
>>>
>>> I'm wondering if it's possible that certain physical device doesn't
>>> handle flush correctly.
>>>
>>> E.g. some vendor does some complex logical in their hdd controller to
>>> skip certain flush request (but not all, obviously) to improve performance?
>>>
>>> Do anyone see such reports?
>>>
>>> And if proves to happened before, how do we users detect such problem?
>>>
>>> Can we just check the flush time against the write before flush call?
>>> E.g. write X random blocks into that device, call fsync() on it, check
>>> the execution time. Repeat Y times, and compare the avg/std.
>>> And change X to 2X/4X/..., repeat above check.
>>>
>>> Thanks,
>>> Qu
>>>
>>>
>>
>> Afaik HDDs and SSDs do lie to fsync()
> 
> fsync() on block device is interpreted into FLUSH bio.
> 
> If all/most consumer level SATA HDD/SSD devices are lying, then there is
> no power loss safety at all for any fs. As most fs relies on FLUSH bio
> to implement barrier.
> 
> And for fs with generation check, they all should report metadata from
> the future every time a crash happens, or even worse gracefully
> umounting fs would cause corruption.
> 
Please, stop making assumptions.

Disks don't 'lie' about anything, they report things according to the 
(SCSI) standard.
And the SCSI standard has two ways of ensuring that things are written 
to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access) 
bit in the command.
The latter provides a way of ensuring that a single command made it to 
disk, and the former instructs the driver to:

"a) perform a write medium operation to the LBA using the logical block 
data in volatile cache; or
b) write the logical block to the non-volatile cache, if any."

which means it's perfectly fine to treat the write-cache as a 
_non-volative_ cache if the RAID HBA is battery backed, and thus can 
make sure that outstanding I/O can be written back even in the case of a 
power failure.

The FUA handling, OTOH, is another matter, and indeed is causing some 
raised eyebrows when comparing it to the spec. But that's another story.

Cheers,

Hannes
-- 
r. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-31 13:36     ` Hannes Reinecke
@ 2019-03-31 14:17       ` Qu Wenruo
  2019-03-31 14:37         ` Hannes Reinecke
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-03-31 14:17 UTC (permalink / raw)
  To: Hannes Reinecke, Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block


[-- Attachment #1.1: Type: text/plain, Size: 3551 bytes --]



On 2019/3/31 下午9:36, Hannes Reinecke wrote:
> On 3/31/19 2:00 PM, Qu Wenruo wrote:
>>
>>
>> On 2019/3/31 下午7:27, Alberto Bursi wrote:
>>>
>>> On 30/03/19 13:31, Qu Wenruo wrote:
>>>> Hi,
>>>>
>>>> I'm wondering if it's possible that certain physical device doesn't
>>>> handle flush correctly.
>>>>
>>>> E.g. some vendor does some complex logical in their hdd controller to
>>>> skip certain flush request (but not all, obviously) to improve
>>>> performance?
>>>>
>>>> Do anyone see such reports?
>>>>
>>>> And if proves to happened before, how do we users detect such problem?
>>>>
>>>> Can we just check the flush time against the write before flush call?
>>>> E.g. write X random blocks into that device, call fsync() on it, check
>>>> the execution time. Repeat Y times, and compare the avg/std.
>>>> And change X to 2X/4X/..., repeat above check.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>
>>>
>>> Afaik HDDs and SSDs do lie to fsync()
>>
>> fsync() on block device is interpreted into FLUSH bio.
>>
>> If all/most consumer level SATA HDD/SSD devices are lying, then there is
>> no power loss safety at all for any fs. As most fs relies on FLUSH bio
>> to implement barrier.
>>
>> And for fs with generation check, they all should report metadata from
>> the future every time a crash happens, or even worse gracefully
>> umounting fs would cause corruption.
>>
> Please, stop making assumptions.

I'm not.

> 
> Disks don't 'lie' about anything, they report things according to the
> (SCSI) standard.
> And the SCSI standard has two ways of ensuring that things are written
> to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access)
> bit in the command.

I understand FLUSH and FUA.

> The latter provides a way of ensuring that a single command made it to
> disk, and the former instructs the driver to:
> 
> "a) perform a write medium operation to the LBA using the logical block
> data in volatile cache; or
> b) write the logical block to the non-volatile cache, if any."
> 
> which means it's perfectly fine to treat the write-cache as a
> _non-volative_ cache if the RAID HBA is battery backed, and thus can
> make sure that outstanding I/O can be written back even in the case of a
> power failure.
> 
> The FUA handling, OTOH, is another matter, and indeed is causing some
> raised eyebrows when comparing it to the spec. But that's another story.

I don't care FUA as much, since libata still doesn't support FUA by
default and interpret it as FLUSH/WRITE/FLUSH, so it doesn't make things
worse.

I'm more interesting in, are all SATA/NVMe disks follows this FLUSH
behavior?

For most case, I believe it is, or whatever the fs is, either CoW based
or journal based, we're going to see tons of problems, even gracefully
unmounted fs can have corruption if FLUSH is not implemented well.

I'm interested in, is there some device doesn't completely follow
regular FLUSH requirement, but do some tricks, for certain tested fs.

E.g. the disk is only tested for certain fs, and that fs always does
something like flush, write flush, fua.
In that case, if the controller decides to skip the 2nd flush, but only
do the first flush and fua, if the 2nd write is very small (e.g.
journal), the chance of corruption is pretty low due to the small window.

In that case, the disk could perform a little better, with increase
corruption possibility.

I just want to wipe out this case.

Thanks,
Qu

> 
> Cheers,
> 
> Hannes


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-31 14:17       ` Qu Wenruo
@ 2019-03-31 14:37         ` Hannes Reinecke
  2019-03-31 14:40           ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Hannes Reinecke @ 2019-03-31 14:37 UTC (permalink / raw)
  To: Qu Wenruo, Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block

On 3/31/19 4:17 PM, Qu Wenruo wrote:
> 
> 
> On 2019/3/31 下午9:36, Hannes Reinecke wrote:
>> On 3/31/19 2:00 PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2019/3/31 下午7:27, Alberto Bursi wrote:
>>>>
>>>> On 30/03/19 13:31, Qu Wenruo wrote:
>>>>> Hi,
>>>>>
>>>>> I'm wondering if it's possible that certain physical device doesn't
>>>>> handle flush correctly.
>>>>>
>>>>> E.g. some vendor does some complex logical in their hdd controller to
>>>>> skip certain flush request (but not all, obviously) to improve
>>>>> performance?
>>>>>
>>>>> Do anyone see such reports?
>>>>>
>>>>> And if proves to happened before, how do we users detect such problem?
>>>>>
>>>>> Can we just check the flush time against the write before flush call?
>>>>> E.g. write X random blocks into that device, call fsync() on it, check
>>>>> the execution time. Repeat Y times, and compare the avg/std.
>>>>> And change X to 2X/4X/..., repeat above check.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>
>>>>
>>>> Afaik HDDs and SSDs do lie to fsync()
>>>
>>> fsync() on block device is interpreted into FLUSH bio.
>>>
>>> If all/most consumer level SATA HDD/SSD devices are lying, then there is
>>> no power loss safety at all for any fs. As most fs relies on FLUSH bio
>>> to implement barrier.
>>>
>>> And for fs with generation check, they all should report metadata from
>>> the future every time a crash happens, or even worse gracefully
>>> umounting fs would cause corruption.
>>>
>> Please, stop making assumptions.
> 
> I'm not.
> 
>>
>> Disks don't 'lie' about anything, they report things according to the
>> (SCSI) standard.
>> And the SCSI standard has two ways of ensuring that things are written
>> to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access)
>> bit in the command.
> 
> I understand FLUSH and FUA.
> 
>> The latter provides a way of ensuring that a single command made it to
>> disk, and the former instructs the driver to:
>>
>> "a) perform a write medium operation to the LBA using the logical block
>> data in volatile cache; or
>> b) write the logical block to the non-volatile cache, if any."
>>
>> which means it's perfectly fine to treat the write-cache as a
>> _non-volative_ cache if the RAID HBA is battery backed, and thus can
>> make sure that outstanding I/O can be written back even in the case of a
>> power failure.
>>
>> The FUA handling, OTOH, is another matter, and indeed is causing some
>> raised eyebrows when comparing it to the spec. But that's another story.
> 
> I don't care FUA as much, since libata still doesn't support FUA by
> default and interpret it as FLUSH/WRITE/FLUSH, so it doesn't make things
> worse.
> 
> I'm more interesting in, are all SATA/NVMe disks follows this FLUSH
> behavior?
> 
They have to to be spec compliant.

> For most case, I believe it is, or whatever the fs is, either CoW based
> or journal based, we're going to see tons of problems, even gracefully
> unmounted fs can have corruption if FLUSH is not implemented well.
> 
> I'm interested in, is there some device doesn't completely follow
> regular FLUSH requirement, but do some tricks, for certain tested fs.
> 
Not that I'm aware of.

> E.g. the disk is only tested for certain fs, and that fs always does
> something like flush, write flush, fua.
> In that case, if the controller decides to skip the 2nd flush, but only
> do the first flush and fua, if the 2nd write is very small (e.g.
> journal), the chance of corruption is pretty low due to the small window.
> 
Highly unlikely.
Tweaking flush handling in this way is IMO far too complicated, and 
would only add to the complexity of adding flush handling in firmware in 
the first place.
Whereas the whole point of this exercise would be to _reduce_ complexity 
in firmware (no-one really cares about the hardware here; that's already 
factored in during manufacturing, and reliability is measured in such a 
broad way that it doesn't make sense for the manufacture to try to 
'improve' reliability by tweaking the flush algorithm).
So if someone would be wanting to save money they'd do away with the 
entire flush handling and do not implement a write cache at all.
That even saves them money on the hardware, too.

Cheers,

Hannes
-- 
r. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-31 14:37         ` Hannes Reinecke
@ 2019-03-31 14:40           ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2019-03-31 14:40 UTC (permalink / raw)
  To: Hannes Reinecke, Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block


[-- Attachment #1.1: Type: text/plain, Size: 4675 bytes --]



On 2019/3/31 下午10:37, Hannes Reinecke wrote:
> On 3/31/19 4:17 PM, Qu Wenruo wrote:
>>
>>
>> On 2019/3/31 下午9:36, Hannes Reinecke wrote:
>>> On 3/31/19 2:00 PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2019/3/31 下午7:27, Alberto Bursi wrote:
>>>>>
>>>>> On 30/03/19 13:31, Qu Wenruo wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm wondering if it's possible that certain physical device doesn't
>>>>>> handle flush correctly.
>>>>>>
>>>>>> E.g. some vendor does some complex logical in their hdd controller to
>>>>>> skip certain flush request (but not all, obviously) to improve
>>>>>> performance?
>>>>>>
>>>>>> Do anyone see such reports?
>>>>>>
>>>>>> And if proves to happened before, how do we users detect such
>>>>>> problem?
>>>>>>
>>>>>> Can we just check the flush time against the write before flush call?
>>>>>> E.g. write X random blocks into that device, call fsync() on it,
>>>>>> check
>>>>>> the execution time. Repeat Y times, and compare the avg/std.
>>>>>> And change X to 2X/4X/..., repeat above check.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>
>>>>>
>>>>> Afaik HDDs and SSDs do lie to fsync()
>>>>
>>>> fsync() on block device is interpreted into FLUSH bio.
>>>>
>>>> If all/most consumer level SATA HDD/SSD devices are lying, then
>>>> there is
>>>> no power loss safety at all for any fs. As most fs relies on FLUSH bio
>>>> to implement barrier.
>>>>
>>>> And for fs with generation check, they all should report metadata from
>>>> the future every time a crash happens, or even worse gracefully
>>>> umounting fs would cause corruption.
>>>>
>>> Please, stop making assumptions.
>>
>> I'm not.
>>
>>>
>>> Disks don't 'lie' about anything, they report things according to the
>>> (SCSI) standard.
>>> And the SCSI standard has two ways of ensuring that things are written
>>> to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access)
>>> bit in the command.
>>
>> I understand FLUSH and FUA.
>>
>>> The latter provides a way of ensuring that a single command made it to
>>> disk, and the former instructs the driver to:
>>>
>>> "a) perform a write medium operation to the LBA using the logical block
>>> data in volatile cache; or
>>> b) write the logical block to the non-volatile cache, if any."
>>>
>>> which means it's perfectly fine to treat the write-cache as a
>>> _non-volative_ cache if the RAID HBA is battery backed, and thus can
>>> make sure that outstanding I/O can be written back even in the case of a
>>> power failure.
>>>
>>> The FUA handling, OTOH, is another matter, and indeed is causing some
>>> raised eyebrows when comparing it to the spec. But that's another story.
>>
>> I don't care FUA as much, since libata still doesn't support FUA by
>> default and interpret it as FLUSH/WRITE/FLUSH, so it doesn't make things
>> worse.
>>
>> I'm more interesting in, are all SATA/NVMe disks follows this FLUSH
>> behavior?
>>
> They have to to be spec compliant.
> 
>> For most case, I believe it is, or whatever the fs is, either CoW based
>> or journal based, we're going to see tons of problems, even gracefully
>> unmounted fs can have corruption if FLUSH is not implemented well.
>>
>> I'm interested in, is there some device doesn't completely follow
>> regular FLUSH requirement, but do some tricks, for certain tested fs.
>>
> Not that I'm aware of.

That's great to know.

> 
>> E.g. the disk is only tested for certain fs, and that fs always does
>> something like flush, write flush, fua.
>> In that case, if the controller decides to skip the 2nd flush, but only
>> do the first flush and fua, if the 2nd write is very small (e.g.
>> journal), the chance of corruption is pretty low due to the small window.
>>
> Highly unlikely.
> Tweaking flush handling in this way is IMO far too complicated, and
> would only add to the complexity of adding flush handling in firmware in
> the first place.
> Whereas the whole point of this exercise would be to _reduce_ complexity
> in firmware (no-one really cares about the hardware here; that's already
> factored in during manufacturing, and reliability is measured in such a
> broad way that it doesn't make sense for the manufacture to try to
> 'improve' reliability by tweaking the flush algorithm).
> So if someone would be wanting to save money they'd do away with the
> entire flush handling and do not implement a write cache at all.
> That even saves them money on the hardware, too.

If there is no report for consumer level hdd/ssd, then it should be
fine, and matches my understanding.

Thanks,
Qu

> 
> Cheers,
> 
> Hannes


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-30 13:24           ` Qu Wenruo
@ 2019-03-31 22:45             ` J. Bruce Fields
  2019-03-31 23:07               ` Alberto Bursi
  0 siblings, 1 reply; 18+ messages in thread
From: J. Bruce Fields @ 2019-03-31 22:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Supercilious Dude, linux-btrfs, Linux FS Devel, linux-block

On Sat, Mar 30, 2019 at 09:24:37PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/3/30 下午9:14, Supercilious Dude wrote:
> > On Sat, 30 Mar 2019 at 13:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >> If controller is doing so, it must have its own power or at least finish
> >> flush when controller writes to its fast cache.
> >>
> > 
> > The controller has its own battery backup to power the DRAM cache, as
> > well as flash storage to dump it onto in the exceedingly unlikely
> > event that the battery gets depleted.
> > 
> >> For cache case, if we have enough data, we could still find some clue on
> >> the flush execution time.
> >>
> >> Despite that, for that enterprise level usage, it's OK.
> >>
> >> But for consumer level storage, I'm not sure, especially for HDDs, and
> >> maybe NVMe devices.
> >>
> > 
> > How do you distinguish who is a who? Am I an enterprise or a consumer?
> 
> Easy, price. :P
> 
> To be honest, I don't really care about that fancy use case.
> It's the vendor doing its work, and if something wrong happened,
> customer will yell at them.
> 
> I'm more interesting in the consumer level situation.

The feature seems to be advertised as "power loss protection" or
"enhanced power loss data protection".  Which makes it sound like a data
safety feature when really it's a performance feature.  E.g. these are
the Intel drives with "EPLDP":

	https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=35125&0_EPLDP=True

Last I checked there were some that weren't too expensive.

--b.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-31 22:45             ` J. Bruce Fields
@ 2019-03-31 23:07               ` Alberto Bursi
  0 siblings, 0 replies; 18+ messages in thread
From: Alberto Bursi @ 2019-03-31 23:07 UTC (permalink / raw)
  To: J. Bruce Fields, Qu Wenruo
  Cc: Supercilious Dude, linux-btrfs, Linux FS Devel, linux-block


On 01/04/19 00:45, J. Bruce Fields wrote:
> On Sat, Mar 30, 2019 at 09:24:37PM +0800, Qu Wenruo wrote:
>>
>> On 2019/3/30 下午9:14, Supercilious Dude wrote:
>>> On Sat, 30 Mar 2019 at 13:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>> If controller is doing so, it must have its own power or at least finish
>>>> flush when controller writes to its fast cache.
>>>>
>>> The controller has its own battery backup to power the DRAM cache, as
>>> well as flash storage to dump it onto in the exceedingly unlikely
>>> event that the battery gets depleted.
>>>
>>>> For cache case, if we have enough data, we could still find some clue on
>>>> the flush execution time.
>>>>
>>>> Despite that, for that enterprise level usage, it's OK.
>>>>
>>>> But for consumer level storage, I'm not sure, especially for HDDs, and
>>>> maybe NVMe devices.
>>>>
>>> How do you distinguish who is a who? Am I an enterprise or a consumer?
>> Easy, price. :P
>>
>> To be honest, I don't really care about that fancy use case.
>> It's the vendor doing its work, and if something wrong happened,
>> customer will yell at them.
>>
>> I'm more interesting in the consumer level situation.
> The feature seems to be advertised as "power loss protection" or
> "enhanced power loss data protection".  Which makes it sound like a data
> safety feature when really it's a performance feature.  E.g. these are
> the Intel drives with "EPLDP":
>
> 	https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=35125&0_EPLDP=True
>
> Last I checked there were some that weren't too expensive.
>
> --b.


Afaik quite a few consumer Crucial SSDs do have power loss protection 
(those that advertise it either have a large

bank of capacitors on their PCB or use newer flash that for some reason 
can do without that)

-Alberto


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-31 11:27 ` Alberto Bursi
  2019-03-31 12:00   ` Qu Wenruo
  2019-03-31 12:21   ` Andrei Borzenkov
@ 2019-04-01 11:55   ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2019-04-01 11:55 UTC (permalink / raw)
  To: Alberto Bursi, Qu Wenruo, linux-btrfs, Linux FS Devel, linux-block

On 2019-03-31 07:27, Alberto Bursi wrote:
> 
> On 30/03/19 13:31, Qu Wenruo wrote:
>> Hi,
>>
>> I'm wondering if it's possible that certain physical device doesn't
>> handle flush correctly.
>>
>> E.g. some vendor does some complex logical in their hdd controller to
>> skip certain flush request (but not all, obviously) to improve performance?
>>
>> Do anyone see such reports?
>>
>> And if proves to happened before, how do we users detect such problem?
>>
>> Can we just check the flush time against the write before flush call?
>> E.g. write X random blocks into that device, call fsync() on it, check
>> the execution time. Repeat Y times, and compare the avg/std.
>> And change X to 2X/4X/..., repeat above check.
>>
>> Thanks,
>> Qu
>>
>>
> 
> Afaik HDDs and SSDs do lie to fsync()
> 
> unless the write cache is turned off with hdparm,
Nope, not the case on modern Linux.  The issue here was that Linux did 
not issue a FLUSH bio as part of the completion of an fsync() system 
call, and problem has long-since been fixed (and wasn't actually the 
disk lying, but the kernel).
> 
> hdparm -W0 /dev/sda
> 
> similarly to RAID controllers.
And most RAID controllers don't actually lie either.  The SCSI and ATA 
standards both count a write that is stored in a _non-volatile_ cache as 
completed, and any halfway-decent RAID controller will be using some 
form of non-volatile storage for it's cache (classically battery-backed 
SRAM, but there's been some shift to NOR or NAND flash storage recently, 
and I've seen a couple of really expensive ones using more-exotic 
non-volatile storage technologies).
> 
> see below
> 
> https://brad.livejournal.com/2116715.html
> 
> https://queue.acm.org/detail.cfm?id=2367378

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Is it possible that certain physical disk doesn't implement flush correctly?
  2019-03-30 12:31 Is it possible that certain physical disk doesn't implement flush correctly? Qu Wenruo
  2019-03-30 12:57 ` Supercilious Dude
  2019-03-31 11:27 ` Alberto Bursi
@ 2019-04-01 12:04 ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2019-04-01 12:04 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, Linux FS Devel, linux-block

On 2019-03-30 08:31, Qu Wenruo wrote:
> Hi,
> 
> I'm wondering if it's possible that certain physical device doesn't
> handle flush correctly.
> 
> E.g. some vendor does some complex logical in their hdd controller to
> skip certain flush request (but not all, obviously) to improve performance?
> 
> Do anyone see such reports?
Some OCZ SSD's had issues that could be explained by this type of 
behavior (and the associated data-loss problems are part of why they 
don't make SSD's any more).

Other than that, I know of no modern _physical_ hardware that does this 
(I've got  5.25 inch full-height SCSI-2 disks that have this issue at 
work, and am really glad we have no systems that use them anymore).  It 
is, however, pretty easy to configure _virtual_ disk drives to behave 
like this.
> 
> And if proves to happened before, how do we users detect such problem?
There's unfortunately no good way to do so unless you can get the disk 
to drop it's write cache without writing out it's contents.  Assuming 
you can do that, the trivial test is to write a block, issue a FLUSH, 
force drop the cache, and then read-back the block that was written. 
There were some old SCSI disks that actually let you do this by issuing 
some extended SCSI commands, but I don't know of any ATA disks where 
this was ever possible, and most modern SCSI disks won't let you do it 
unless you flash custom firmware to allow for it.

Of course, you can always test with throw-away data by manually inducing 
power failures, but that's tedious and hard on the hardware.
> 
> Can we just check the flush time against the write before flush call?
> E.g. write X random blocks into that device, call fsync() on it, check
> the execution time. Repeat Y times, and compare the avg/std.
> And change X to 2X/4X/..., repeat above check.
> 
> Thanks,
> Qu
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-04-01 12:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-30 12:31 Is it possible that certain physical disk doesn't implement flush correctly? Qu Wenruo
2019-03-30 12:57 ` Supercilious Dude
2019-03-30 13:00   ` Qu Wenruo
2019-03-30 13:04     ` Supercilious Dude
2019-03-30 13:09       ` Qu Wenruo
2019-03-30 13:14         ` Supercilious Dude
2019-03-30 13:24           ` Qu Wenruo
2019-03-31 22:45             ` J. Bruce Fields
2019-03-31 23:07               ` Alberto Bursi
2019-03-31 11:27 ` Alberto Bursi
2019-03-31 12:00   ` Qu Wenruo
2019-03-31 13:36     ` Hannes Reinecke
2019-03-31 14:17       ` Qu Wenruo
2019-03-31 14:37         ` Hannes Reinecke
2019-03-31 14:40           ` Qu Wenruo
2019-03-31 12:21   ` Andrei Borzenkov
2019-04-01 11:55   ` Austin S. Hemmelgarn
2019-04-01 12:04 ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.