linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Hannes Reinecke <hare@suse.de>,
	Alberto Bursi <alberto.bursi@outlook.it>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: Re: Is it possible that certain physical disk doesn't implement flush correctly?
Date: Sun, 31 Mar 2019 22:40:17 +0800	[thread overview]
Message-ID: <7472d332-3e94-0452-8f6c-5eb61f499830@gmx.com> (raw)
In-Reply-To: <e67b8b50-dd2a-2106-5362-913167ea48a8@suse.de>


[-- Attachment #1.1: Type: text/plain, Size: 4675 bytes --]



On 2019/3/31 下午10:37, Hannes Reinecke wrote:
> On 3/31/19 4:17 PM, Qu Wenruo wrote:
>>
>>
>> On 2019/3/31 下午9:36, Hannes Reinecke wrote:
>>> On 3/31/19 2:00 PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2019/3/31 下午7:27, Alberto Bursi wrote:
>>>>>
>>>>> On 30/03/19 13:31, Qu Wenruo wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm wondering if it's possible that certain physical device doesn't
>>>>>> handle flush correctly.
>>>>>>
>>>>>> E.g. some vendor does some complex logical in their hdd controller to
>>>>>> skip certain flush request (but not all, obviously) to improve
>>>>>> performance?
>>>>>>
>>>>>> Do anyone see such reports?
>>>>>>
>>>>>> And if proves to happened before, how do we users detect such
>>>>>> problem?
>>>>>>
>>>>>> Can we just check the flush time against the write before flush call?
>>>>>> E.g. write X random blocks into that device, call fsync() on it,
>>>>>> check
>>>>>> the execution time. Repeat Y times, and compare the avg/std.
>>>>>> And change X to 2X/4X/..., repeat above check.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>
>>>>>
>>>>> Afaik HDDs and SSDs do lie to fsync()
>>>>
>>>> fsync() on block device is interpreted into FLUSH bio.
>>>>
>>>> If all/most consumer level SATA HDD/SSD devices are lying, then
>>>> there is
>>>> no power loss safety at all for any fs. As most fs relies on FLUSH bio
>>>> to implement barrier.
>>>>
>>>> And for fs with generation check, they all should report metadata from
>>>> the future every time a crash happens, or even worse gracefully
>>>> umounting fs would cause corruption.
>>>>
>>> Please, stop making assumptions.
>>
>> I'm not.
>>
>>>
>>> Disks don't 'lie' about anything, they report things according to the
>>> (SCSI) standard.
>>> And the SCSI standard has two ways of ensuring that things are written
>>> to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access)
>>> bit in the command.
>>
>> I understand FLUSH and FUA.
>>
>>> The latter provides a way of ensuring that a single command made it to
>>> disk, and the former instructs the driver to:
>>>
>>> "a) perform a write medium operation to the LBA using the logical block
>>> data in volatile cache; or
>>> b) write the logical block to the non-volatile cache, if any."
>>>
>>> which means it's perfectly fine to treat the write-cache as a
>>> _non-volative_ cache if the RAID HBA is battery backed, and thus can
>>> make sure that outstanding I/O can be written back even in the case of a
>>> power failure.
>>>
>>> The FUA handling, OTOH, is another matter, and indeed is causing some
>>> raised eyebrows when comparing it to the spec. But that's another story.
>>
>> I don't care FUA as much, since libata still doesn't support FUA by
>> default and interpret it as FLUSH/WRITE/FLUSH, so it doesn't make things
>> worse.
>>
>> I'm more interesting in, are all SATA/NVMe disks follows this FLUSH
>> behavior?
>>
> They have to to be spec compliant.
> 
>> For most case, I believe it is, or whatever the fs is, either CoW based
>> or journal based, we're going to see tons of problems, even gracefully
>> unmounted fs can have corruption if FLUSH is not implemented well.
>>
>> I'm interested in, is there some device doesn't completely follow
>> regular FLUSH requirement, but do some tricks, for certain tested fs.
>>
> Not that I'm aware of.

That's great to know.

> 
>> E.g. the disk is only tested for certain fs, and that fs always does
>> something like flush, write flush, fua.
>> In that case, if the controller decides to skip the 2nd flush, but only
>> do the first flush and fua, if the 2nd write is very small (e.g.
>> journal), the chance of corruption is pretty low due to the small window.
>>
> Highly unlikely.
> Tweaking flush handling in this way is IMO far too complicated, and
> would only add to the complexity of adding flush handling in firmware in
> the first place.
> Whereas the whole point of this exercise would be to _reduce_ complexity
> in firmware (no-one really cares about the hardware here; that's already
> factored in during manufacturing, and reliability is measured in such a
> broad way that it doesn't make sense for the manufacture to try to
> 'improve' reliability by tweaking the flush algorithm).
> So if someone would be wanting to save money they'd do away with the
> entire flush handling and do not implement a write cache at all.
> That even saves them money on the hardware, too.

If there is no report for consumer level hdd/ssd, then it should be
fine, and matches my understanding.

Thanks,
Qu

> 
> Cheers,
> 
> Hannes


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2019-03-31 14:40 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-30 12:31 Is it possible that certain physical disk doesn't implement flush correctly? Qu Wenruo
2019-03-30 12:57 ` Supercilious Dude
2019-03-30 13:00   ` Qu Wenruo
2019-03-30 13:04     ` Supercilious Dude
2019-03-30 13:09       ` Qu Wenruo
2019-03-30 13:14         ` Supercilious Dude
2019-03-30 13:24           ` Qu Wenruo
2019-03-31 22:45             ` J. Bruce Fields
2019-03-31 23:07               ` Alberto Bursi
2019-03-31 11:27 ` Alberto Bursi
2019-03-31 12:00   ` Qu Wenruo
2019-03-31 13:36     ` Hannes Reinecke
2019-03-31 14:17       ` Qu Wenruo
2019-03-31 14:37         ` Hannes Reinecke
2019-03-31 14:40           ` Qu Wenruo [this message]
2019-03-31 12:21   ` Andrei Borzenkov
2019-04-01 11:55   ` Austin S. Hemmelgarn
2019-04-01 12:04 ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7472d332-3e94-0452-8f6c-5eb61f499830@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=alberto.bursi@outlook.it \
    --cc=hare@suse.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).