Re: Is it possible that certain physical disk doesn't implement flush correctly?

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Hannes Reinecke <hare@suse.de>,
	Alberto Bursi <alberto.bursi@outlook.it>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: Re: Is it possible that certain physical disk doesn't implement flush correctly?
Date: Sun, 31 Mar 2019 22:40:17 +0800	[thread overview]
Message-ID: <7472d332-3e94-0452-8f6c-5eb61f499830@gmx.com> (raw)
In-Reply-To: <e67b8b50-dd2a-2106-5362-913167ea48a8@suse.de>

[-- Attachment #1.1: Type: text/plain, Size: 4675 bytes --]

On 2019/3/31 下午10:37, Hannes Reinecke wrote:
> On 3/31/19 4:17 PM, Qu Wenruo wrote:
>>
>>
>> On 2019/3/31 下午9:36, Hannes Reinecke wrote:
>>> On 3/31/19 2:00 PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2019/3/31 下午7:27, Alberto Bursi wrote:
>>>>>
>>>>> On 30/03/19 13:31, Qu Wenruo wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm wondering if it's possible that certain physical device doesn't
>>>>>> handle flush correctly.
>>>>>>
>>>>>> E.g. some vendor does some complex logical in their hdd controller to
>>>>>> skip certain flush request (but not all, obviously) to improve
>>>>>> performance?
>>>>>>
>>>>>> Do anyone see such reports?
>>>>>>
>>>>>> And if proves to happened before, how do we users detect such
>>>>>> problem?
>>>>>>
>>>>>> Can we just check the flush time against the write before flush call?
>>>>>> E.g. write X random blocks into that device, call fsync() on it,
>>>>>> check
>>>>>> the execution time. Repeat Y times, and compare the avg/std.
>>>>>> And change X to 2X/4X/..., repeat above check.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>
>>>>>
>>>>> Afaik HDDs and SSDs do lie to fsync()
>>>>
>>>> fsync() on block device is interpreted into FLUSH bio.
>>>>
>>>> If all/most consumer level SATA HDD/SSD devices are lying, then
>>>> there is
>>>> no power loss safety at all for any fs. As most fs relies on FLUSH bio
>>>> to implement barrier.
>>>>
>>>> And for fs with generation check, they all should report metadata from
>>>> the future every time a crash happens, or even worse gracefully
>>>> umounting fs would cause corruption.
>>>>
>>> Please, stop making assumptions.
>>
>> I'm not.
>>
>>>
>>> Disks don't 'lie' about anything, they report things according to the
>>> (SCSI) standard.
>>> And the SCSI standard has two ways of ensuring that things are written
>>> to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access)
>>> bit in the command.
>>
>> I understand FLUSH and FUA.
>>
>>> The latter provides a way of ensuring that a single command made it to
>>> disk, and the former instructs the driver to:
>>>
>>> "a) perform a write medium operation to the LBA using the logical block
>>> data in volatile cache; or
>>> b) write the logical block to the non-volatile cache, if any."
>>>
>>> which means it's perfectly fine to treat the write-cache as a
>>> _non-volative_ cache if the RAID HBA is battery backed, and thus can
>>> make sure that outstanding I/O can be written back even in the case of a
>>> power failure.
>>>
>>> The FUA handling, OTOH, is another matter, and indeed is causing some
>>> raised eyebrows when comparing it to the spec. But that's another story.
>>
>> I don't care FUA as much, since libata still doesn't support FUA by
>> default and interpret it as FLUSH/WRITE/FLUSH, so it doesn't make things
>> worse.
>>
>> I'm more interesting in, are all SATA/NVMe disks follows this FLUSH
>> behavior?
>>
> They have to to be spec compliant.
> 
>> For most case, I believe it is, or whatever the fs is, either CoW based
>> or journal based, we're going to see tons of problems, even gracefully
>> unmounted fs can have corruption if FLUSH is not implemented well.
>>
>> I'm interested in, is there some device doesn't completely follow
>> regular FLUSH requirement, but do some tricks, for certain tested fs.
>>
> Not that I'm aware of.

That's great to know.

> 
>> E.g. the disk is only tested for certain fs, and that fs always does
>> something like flush, write flush, fua.
>> In that case, if the controller decides to skip the 2nd flush, but only
>> do the first flush and fua, if the 2nd write is very small (e.g.
>> journal), the chance of corruption is pretty low due to the small window.
>>
> Highly unlikely.
> Tweaking flush handling in this way is IMO far too complicated, and
> would only add to the complexity of adding flush handling in firmware in
> the first place.
> Whereas the whole point of this exercise would be to _reduce_ complexity
> in firmware (no-one really cares about the hardware here; that's already
> factored in during manufacturing, and reliability is measured in such a
> broad way that it doesn't make sense for the manufacture to try to
> 'improve' reliability by tweaking the flush algorithm).
> So if someone would be wanting to save money they'd do away with the
> entire flush handling and do not implement a write cache at all.
> That even saves them money on the hardware, too.

If there is no report for consumer level hdd/ssd, then it should be
fine, and matches my understanding.

Thanks,
Qu

> 
> Cheers,
> 
> Hannes

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]