* Is it possible that certain physical disk doesn't implement flush correctly? @ 2019-03-30 12:31 Qu Wenruo 2019-03-30 12:57 ` Supercilious Dude ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Qu Wenruo @ 2019-03-30 12:31 UTC (permalink / raw) To: linux-btrfs, Linux FS Devel, linux-block [-- Attachment #1.1: Type: text/plain, Size: 630 bytes --] Hi, I'm wondering if it's possible that certain physical device doesn't handle flush correctly. E.g. some vendor does some complex logical in their hdd controller to skip certain flush request (but not all, obviously) to improve performance? Do anyone see such reports? And if proves to happened before, how do we users detect such problem? Can we just check the flush time against the write before flush call? E.g. write X random blocks into that device, call fsync() on it, check the execution time. Repeat Y times, and compare the avg/std. And change X to 2X/4X/..., repeat above check. Thanks, Qu [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-30 12:31 Is it possible that certain physical disk doesn't implement flush correctly? Qu Wenruo @ 2019-03-30 12:57 ` Supercilious Dude 2019-03-30 13:00 ` Qu Wenruo 2019-03-31 11:27 ` Alberto Bursi 2019-04-01 12:04 ` Austin S. Hemmelgarn 2 siblings, 1 reply; 18+ messages in thread From: Supercilious Dude @ 2019-03-30 12:57 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs, Linux FS Devel, linux-block Hi, This would give a false positive on any controller with a cache as all requests would take 0ms until the cache is full and the controller has to actually flush to disk. I am using an HP P841 controller in my test system and it has a 4GB cache making every IO instant unless there are enough of them that they can't be flushed to the disks as quickly as they come in - the latency variation is huge depending on load. Regards On Sat, 30 Mar 2019 at 12:34, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > Hi, > > I'm wondering if it's possible that certain physical device doesn't > handle flush correctly. > > E.g. some vendor does some complex logical in their hdd controller to > skip certain flush request (but not all, obviously) to improve performance? > > Do anyone see such reports? > > And if proves to happened before, how do we users detect such problem? > > Can we just check the flush time against the write before flush call? > E.g. write X random blocks into that device, call fsync() on it, check > the execution time. Repeat Y times, and compare the avg/std. > And change X to 2X/4X/..., repeat above check. > > Thanks, > Qu > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-30 12:57 ` Supercilious Dude @ 2019-03-30 13:00 ` Qu Wenruo 2019-03-30 13:04 ` Supercilious Dude 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-03-30 13:00 UTC (permalink / raw) To: Supercilious Dude; +Cc: linux-btrfs, Linux FS Devel, linux-block [-- Attachment #1.1: Type: text/plain, Size: 1464 bytes --] On 2019/3/30 下午8:57, Supercilious Dude wrote: > Hi, > > This would give a false positive on any controller with a cache as all > requests would take 0ms until the cache is full and the controller has > to actually flush to disk. I'm purposing to measure the execution time of flush/fsync, not write. And if flush takes 0ms, it means it doesn't really write cached data onto disk. Thanks, Qu > I am using an HP P841 controller in my test > system and it has a 4GB cache making every IO instant unless there are > enough of them that they can't be flushed to the disks as quickly as > they come in - the latency variation is huge depending on load. > > Regards > > On Sat, 30 Mar 2019 at 12:34, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> Hi, >> >> I'm wondering if it's possible that certain physical device doesn't >> handle flush correctly. >> >> E.g. some vendor does some complex logical in their hdd controller to >> skip certain flush request (but not all, obviously) to improve performance? >> >> Do anyone see such reports? >> >> And if proves to happened before, how do we users detect such problem? >> >> Can we just check the flush time against the write before flush call? >> E.g. write X random blocks into that device, call fsync() on it, check >> the execution time. Repeat Y times, and compare the avg/std. >> And change X to 2X/4X/..., repeat above check. >> >> Thanks, >> Qu >> >> [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-30 13:00 ` Qu Wenruo @ 2019-03-30 13:04 ` Supercilious Dude 2019-03-30 13:09 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Supercilious Dude @ 2019-03-30 13:04 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs, Linux FS Devel, linux-block On Sat, 30 Mar 2019 at 13:00, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > I'm purposing to measure the execution time of flush/fsync, not write. > > And if flush takes 0ms, it means it doesn't really write cached data > onto disk. > That is correct. The controller ignores your flush requests on the virtual disk by design. When the data hits the controller it is considered "stored" - the physical disk(s) storing the virtual disk is an implementation detail. The performance characteristics of these controllers are needed to make big arrays work in a useful manner. My controller is connected to 4 HP 2600 enclosures with 12 drives each. Waiting for a flush on a single disk before continuing work on the remaining 47 disks would be catastrophic for performance. Regards ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-30 13:04 ` Supercilious Dude @ 2019-03-30 13:09 ` Qu Wenruo 2019-03-30 13:14 ` Supercilious Dude 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-03-30 13:09 UTC (permalink / raw) To: Supercilious Dude; +Cc: linux-btrfs, Linux FS Devel, linux-block [-- Attachment #1.1: Type: text/plain, Size: 1319 bytes --] On 2019/3/30 下午9:04, Supercilious Dude wrote: > On Sat, 30 Mar 2019 at 13:00, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> I'm purposing to measure the execution time of flush/fsync, not write. >> >> And if flush takes 0ms, it means it doesn't really write cached data >> onto disk. >> > > That is correct. The controller ignores your flush requests on the > virtual disk by design. When the data hits the controller it is > considered "stored" - the physical disk(s) storing the virtual disk is > an implementation detail. The performance characteristics of these > controllers are needed to make big arrays work in a useful manner. My > controller is connected to 4 HP 2600 enclosures with 12 drives each. > Waiting for a flush on a single disk before continuing work on the > remaining 47 disks would be catastrophic for performance. If controller is doing so, it must have its own power or at least finish flush when controller writes to its fast cache. For cache case, if we have enough data, we could still find some clue on the flush execution time. Despite that, for that enterprise level usage, it's OK. But for consumer level storage, I'm not sure, especially for HDDs, and maybe NVMe devices. So my question still stands here. Thanks, Qu > > Regards > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-30 13:09 ` Qu Wenruo @ 2019-03-30 13:14 ` Supercilious Dude 2019-03-30 13:24 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Supercilious Dude @ 2019-03-30 13:14 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs, Linux FS Devel, linux-block On Sat, 30 Mar 2019 at 13:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > If controller is doing so, it must have its own power or at least finish > flush when controller writes to its fast cache. > The controller has its own battery backup to power the DRAM cache, as well as flash storage to dump it onto in the exceedingly unlikely event that the battery gets depleted. > For cache case, if we have enough data, we could still find some clue on > the flush execution time. > > Despite that, for that enterprise level usage, it's OK. > > But for consumer level storage, I'm not sure, especially for HDDs, and > maybe NVMe devices. > How do you distinguish who is a who? Am I an enterprise or a consumer? Regards ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-30 13:14 ` Supercilious Dude @ 2019-03-30 13:24 ` Qu Wenruo 2019-03-31 22:45 ` J. Bruce Fields 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-03-30 13:24 UTC (permalink / raw) To: Supercilious Dude; +Cc: linux-btrfs, Linux FS Devel, linux-block [-- Attachment #1.1: Type: text/plain, Size: 1087 bytes --] On 2019/3/30 下午9:14, Supercilious Dude wrote: > On Sat, 30 Mar 2019 at 13:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> If controller is doing so, it must have its own power or at least finish >> flush when controller writes to its fast cache. >> > > The controller has its own battery backup to power the DRAM cache, as > well as flash storage to dump it onto in the exceedingly unlikely > event that the battery gets depleted. > >> For cache case, if we have enough data, we could still find some clue on >> the flush execution time. >> >> Despite that, for that enterprise level usage, it's OK. >> >> But for consumer level storage, I'm not sure, especially for HDDs, and >> maybe NVMe devices. >> > > How do you distinguish who is a who? Am I an enterprise or a consumer? Easy, price. :P To be honest, I don't really care about that fancy use case. It's the vendor doing its work, and if something wrong happened, customer will yell at them. I'm more interesting in the consumer level situation. Thanks, Qu > > Regards > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-30 13:24 ` Qu Wenruo @ 2019-03-31 22:45 ` J. Bruce Fields 2019-03-31 23:07 ` Alberto Bursi 0 siblings, 1 reply; 18+ messages in thread From: J. Bruce Fields @ 2019-03-31 22:45 UTC (permalink / raw) To: Qu Wenruo; +Cc: Supercilious Dude, linux-btrfs, Linux FS Devel, linux-block On Sat, Mar 30, 2019 at 09:24:37PM +0800, Qu Wenruo wrote: > > > On 2019/3/30 下午9:14, Supercilious Dude wrote: > > On Sat, 30 Mar 2019 at 13:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > >> > >> If controller is doing so, it must have its own power or at least finish > >> flush when controller writes to its fast cache. > >> > > > > The controller has its own battery backup to power the DRAM cache, as > > well as flash storage to dump it onto in the exceedingly unlikely > > event that the battery gets depleted. > > > >> For cache case, if we have enough data, we could still find some clue on > >> the flush execution time. > >> > >> Despite that, for that enterprise level usage, it's OK. > >> > >> But for consumer level storage, I'm not sure, especially for HDDs, and > >> maybe NVMe devices. > >> > > > > How do you distinguish who is a who? Am I an enterprise or a consumer? > > Easy, price. :P > > To be honest, I don't really care about that fancy use case. > It's the vendor doing its work, and if something wrong happened, > customer will yell at them. > > I'm more interesting in the consumer level situation. The feature seems to be advertised as "power loss protection" or "enhanced power loss data protection". Which makes it sound like a data safety feature when really it's a performance feature. E.g. these are the Intel drives with "EPLDP": https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=35125&0_EPLDP=True Last I checked there were some that weren't too expensive. --b. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-31 22:45 ` J. Bruce Fields @ 2019-03-31 23:07 ` Alberto Bursi 0 siblings, 0 replies; 18+ messages in thread From: Alberto Bursi @ 2019-03-31 23:07 UTC (permalink / raw) To: J. Bruce Fields, Qu Wenruo Cc: Supercilious Dude, linux-btrfs, Linux FS Devel, linux-block On 01/04/19 00:45, J. Bruce Fields wrote: > On Sat, Mar 30, 2019 at 09:24:37PM +0800, Qu Wenruo wrote: >> >> On 2019/3/30 下午9:14, Supercilious Dude wrote: >>> On Sat, 30 Mar 2019 at 13:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>>> If controller is doing so, it must have its own power or at least finish >>>> flush when controller writes to its fast cache. >>>> >>> The controller has its own battery backup to power the DRAM cache, as >>> well as flash storage to dump it onto in the exceedingly unlikely >>> event that the battery gets depleted. >>> >>>> For cache case, if we have enough data, we could still find some clue on >>>> the flush execution time. >>>> >>>> Despite that, for that enterprise level usage, it's OK. >>>> >>>> But for consumer level storage, I'm not sure, especially for HDDs, and >>>> maybe NVMe devices. >>>> >>> How do you distinguish who is a who? Am I an enterprise or a consumer? >> Easy, price. :P >> >> To be honest, I don't really care about that fancy use case. >> It's the vendor doing its work, and if something wrong happened, >> customer will yell at them. >> >> I'm more interesting in the consumer level situation. > The feature seems to be advertised as "power loss protection" or > "enhanced power loss data protection". Which makes it sound like a data > safety feature when really it's a performance feature. E.g. these are > the Intel drives with "EPLDP": > > https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=35125&0_EPLDP=True > > Last I checked there were some that weren't too expensive. > > --b. Afaik quite a few consumer Crucial SSDs do have power loss protection (those that advertise it either have a large bank of capacitors on their PCB or use newer flash that for some reason can do without that) -Alberto ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-30 12:31 Is it possible that certain physical disk doesn't implement flush correctly? Qu Wenruo 2019-03-30 12:57 ` Supercilious Dude @ 2019-03-31 11:27 ` Alberto Bursi 2019-03-31 12:00 ` Qu Wenruo ` (2 more replies) 2019-04-01 12:04 ` Austin S. Hemmelgarn 2 siblings, 3 replies; 18+ messages in thread From: Alberto Bursi @ 2019-03-31 11:27 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs, Linux FS Devel, linux-block On 30/03/19 13:31, Qu Wenruo wrote: > Hi, > > I'm wondering if it's possible that certain physical device doesn't > handle flush correctly. > > E.g. some vendor does some complex logical in their hdd controller to > skip certain flush request (but not all, obviously) to improve performance? > > Do anyone see such reports? > > And if proves to happened before, how do we users detect such problem? > > Can we just check the flush time against the write before flush call? > E.g. write X random blocks into that device, call fsync() on it, check > the execution time. Repeat Y times, and compare the avg/std. > And change X to 2X/4X/..., repeat above check. > > Thanks, > Qu > > Afaik HDDs and SSDs do lie to fsync() unless the write cache is turned off with hdparm, hdparm -W0 /dev/sda similarly to RAID controllers. see below https://brad.livejournal.com/2116715.html https://queue.acm.org/detail.cfm?id=2367378 - ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-31 11:27 ` Alberto Bursi @ 2019-03-31 12:00 ` Qu Wenruo 2019-03-31 13:36 ` Hannes Reinecke 2019-03-31 12:21 ` Andrei Borzenkov 2019-04-01 11:55 ` Austin S. Hemmelgarn 2 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-03-31 12:00 UTC (permalink / raw) To: Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block [-- Attachment #1.1: Type: text/plain, Size: 1505 bytes --] On 2019/3/31 下午7:27, Alberto Bursi wrote: > > On 30/03/19 13:31, Qu Wenruo wrote: >> Hi, >> >> I'm wondering if it's possible that certain physical device doesn't >> handle flush correctly. >> >> E.g. some vendor does some complex logical in their hdd controller to >> skip certain flush request (but not all, obviously) to improve performance? >> >> Do anyone see such reports? >> >> And if proves to happened before, how do we users detect such problem? >> >> Can we just check the flush time against the write before flush call? >> E.g. write X random blocks into that device, call fsync() on it, check >> the execution time. Repeat Y times, and compare the avg/std. >> And change X to 2X/4X/..., repeat above check. >> >> Thanks, >> Qu >> >> > > Afaik HDDs and SSDs do lie to fsync() fsync() on block device is interpreted into FLUSH bio. If all/most consumer level SATA HDD/SSD devices are lying, then there is no power loss safety at all for any fs. As most fs relies on FLUSH bio to implement barrier. And for fs with generation check, they all should report metadata from the future every time a crash happens, or even worse gracefully umounting fs would cause corruption. Thanks, Qu > > unless the write cache is turned off with hdparm, > > hdparm -W0 /dev/sda > > similarly to RAID controllers. > > see below > > https://brad.livejournal.com/2116715.html > > https://queue.acm.org/detail.cfm?id=2367378 > > > - > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-31 12:00 ` Qu Wenruo @ 2019-03-31 13:36 ` Hannes Reinecke 2019-03-31 14:17 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Hannes Reinecke @ 2019-03-31 13:36 UTC (permalink / raw) To: Qu Wenruo, Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block On 3/31/19 2:00 PM, Qu Wenruo wrote: > > > On 2019/3/31 下午7:27, Alberto Bursi wrote: >> >> On 30/03/19 13:31, Qu Wenruo wrote: >>> Hi, >>> >>> I'm wondering if it's possible that certain physical device doesn't >>> handle flush correctly. >>> >>> E.g. some vendor does some complex logical in their hdd controller to >>> skip certain flush request (but not all, obviously) to improve performance? >>> >>> Do anyone see such reports? >>> >>> And if proves to happened before, how do we users detect such problem? >>> >>> Can we just check the flush time against the write before flush call? >>> E.g. write X random blocks into that device, call fsync() on it, check >>> the execution time. Repeat Y times, and compare the avg/std. >>> And change X to 2X/4X/..., repeat above check. >>> >>> Thanks, >>> Qu >>> >>> >> >> Afaik HDDs and SSDs do lie to fsync() > > fsync() on block device is interpreted into FLUSH bio. > > If all/most consumer level SATA HDD/SSD devices are lying, then there is > no power loss safety at all for any fs. As most fs relies on FLUSH bio > to implement barrier. > > And for fs with generation check, they all should report metadata from > the future every time a crash happens, or even worse gracefully > umounting fs would cause corruption. > Please, stop making assumptions. Disks don't 'lie' about anything, they report things according to the (SCSI) standard. And the SCSI standard has two ways of ensuring that things are written to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access) bit in the command. The latter provides a way of ensuring that a single command made it to disk, and the former instructs the driver to: "a) perform a write medium operation to the LBA using the logical block data in volatile cache; or b) write the logical block to the non-volatile cache, if any." which means it's perfectly fine to treat the write-cache as a _non-volative_ cache if the RAID HBA is battery backed, and thus can make sure that outstanding I/O can be written back even in the case of a power failure. The FUA handling, OTOH, is another matter, and indeed is causing some raised eyebrows when comparing it to the spec. But that's another story. Cheers, Hannes -- r. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-31 13:36 ` Hannes Reinecke @ 2019-03-31 14:17 ` Qu Wenruo 2019-03-31 14:37 ` Hannes Reinecke 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-03-31 14:17 UTC (permalink / raw) To: Hannes Reinecke, Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block [-- Attachment #1.1: Type: text/plain, Size: 3551 bytes --] On 2019/3/31 下午9:36, Hannes Reinecke wrote: > On 3/31/19 2:00 PM, Qu Wenruo wrote: >> >> >> On 2019/3/31 下午7:27, Alberto Bursi wrote: >>> >>> On 30/03/19 13:31, Qu Wenruo wrote: >>>> Hi, >>>> >>>> I'm wondering if it's possible that certain physical device doesn't >>>> handle flush correctly. >>>> >>>> E.g. some vendor does some complex logical in their hdd controller to >>>> skip certain flush request (but not all, obviously) to improve >>>> performance? >>>> >>>> Do anyone see such reports? >>>> >>>> And if proves to happened before, how do we users detect such problem? >>>> >>>> Can we just check the flush time against the write before flush call? >>>> E.g. write X random blocks into that device, call fsync() on it, check >>>> the execution time. Repeat Y times, and compare the avg/std. >>>> And change X to 2X/4X/..., repeat above check. >>>> >>>> Thanks, >>>> Qu >>>> >>>> >>> >>> Afaik HDDs and SSDs do lie to fsync() >> >> fsync() on block device is interpreted into FLUSH bio. >> >> If all/most consumer level SATA HDD/SSD devices are lying, then there is >> no power loss safety at all for any fs. As most fs relies on FLUSH bio >> to implement barrier. >> >> And for fs with generation check, they all should report metadata from >> the future every time a crash happens, or even worse gracefully >> umounting fs would cause corruption. >> > Please, stop making assumptions. I'm not. > > Disks don't 'lie' about anything, they report things according to the > (SCSI) standard. > And the SCSI standard has two ways of ensuring that things are written > to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access) > bit in the command. I understand FLUSH and FUA. > The latter provides a way of ensuring that a single command made it to > disk, and the former instructs the driver to: > > "a) perform a write medium operation to the LBA using the logical block > data in volatile cache; or > b) write the logical block to the non-volatile cache, if any." > > which means it's perfectly fine to treat the write-cache as a > _non-volative_ cache if the RAID HBA is battery backed, and thus can > make sure that outstanding I/O can be written back even in the case of a > power failure. > > The FUA handling, OTOH, is another matter, and indeed is causing some > raised eyebrows when comparing it to the spec. But that's another story. I don't care FUA as much, since libata still doesn't support FUA by default and interpret it as FLUSH/WRITE/FLUSH, so it doesn't make things worse. I'm more interesting in, are all SATA/NVMe disks follows this FLUSH behavior? For most case, I believe it is, or whatever the fs is, either CoW based or journal based, we're going to see tons of problems, even gracefully unmounted fs can have corruption if FLUSH is not implemented well. I'm interested in, is there some device doesn't completely follow regular FLUSH requirement, but do some tricks, for certain tested fs. E.g. the disk is only tested for certain fs, and that fs always does something like flush, write flush, fua. In that case, if the controller decides to skip the 2nd flush, but only do the first flush and fua, if the 2nd write is very small (e.g. journal), the chance of corruption is pretty low due to the small window. In that case, the disk could perform a little better, with increase corruption possibility. I just want to wipe out this case. Thanks, Qu > > Cheers, > > Hannes [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-31 14:17 ` Qu Wenruo @ 2019-03-31 14:37 ` Hannes Reinecke 2019-03-31 14:40 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Hannes Reinecke @ 2019-03-31 14:37 UTC (permalink / raw) To: Qu Wenruo, Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block On 3/31/19 4:17 PM, Qu Wenruo wrote: > > > On 2019/3/31 下午9:36, Hannes Reinecke wrote: >> On 3/31/19 2:00 PM, Qu Wenruo wrote: >>> >>> >>> On 2019/3/31 下午7:27, Alberto Bursi wrote: >>>> >>>> On 30/03/19 13:31, Qu Wenruo wrote: >>>>> Hi, >>>>> >>>>> I'm wondering if it's possible that certain physical device doesn't >>>>> handle flush correctly. >>>>> >>>>> E.g. some vendor does some complex logical in their hdd controller to >>>>> skip certain flush request (but not all, obviously) to improve >>>>> performance? >>>>> >>>>> Do anyone see such reports? >>>>> >>>>> And if proves to happened before, how do we users detect such problem? >>>>> >>>>> Can we just check the flush time against the write before flush call? >>>>> E.g. write X random blocks into that device, call fsync() on it, check >>>>> the execution time. Repeat Y times, and compare the avg/std. >>>>> And change X to 2X/4X/..., repeat above check. >>>>> >>>>> Thanks, >>>>> Qu >>>>> >>>>> >>>> >>>> Afaik HDDs and SSDs do lie to fsync() >>> >>> fsync() on block device is interpreted into FLUSH bio. >>> >>> If all/most consumer level SATA HDD/SSD devices are lying, then there is >>> no power loss safety at all for any fs. As most fs relies on FLUSH bio >>> to implement barrier. >>> >>> And for fs with generation check, they all should report metadata from >>> the future every time a crash happens, or even worse gracefully >>> umounting fs would cause corruption. >>> >> Please, stop making assumptions. > > I'm not. > >> >> Disks don't 'lie' about anything, they report things according to the >> (SCSI) standard. >> And the SCSI standard has two ways of ensuring that things are written >> to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access) >> bit in the command. > > I understand FLUSH and FUA. > >> The latter provides a way of ensuring that a single command made it to >> disk, and the former instructs the driver to: >> >> "a) perform a write medium operation to the LBA using the logical block >> data in volatile cache; or >> b) write the logical block to the non-volatile cache, if any." >> >> which means it's perfectly fine to treat the write-cache as a >> _non-volative_ cache if the RAID HBA is battery backed, and thus can >> make sure that outstanding I/O can be written back even in the case of a >> power failure. >> >> The FUA handling, OTOH, is another matter, and indeed is causing some >> raised eyebrows when comparing it to the spec. But that's another story. > > I don't care FUA as much, since libata still doesn't support FUA by > default and interpret it as FLUSH/WRITE/FLUSH, so it doesn't make things > worse. > > I'm more interesting in, are all SATA/NVMe disks follows this FLUSH > behavior? > They have to to be spec compliant. > For most case, I believe it is, or whatever the fs is, either CoW based > or journal based, we're going to see tons of problems, even gracefully > unmounted fs can have corruption if FLUSH is not implemented well. > > I'm interested in, is there some device doesn't completely follow > regular FLUSH requirement, but do some tricks, for certain tested fs. > Not that I'm aware of. > E.g. the disk is only tested for certain fs, and that fs always does > something like flush, write flush, fua. > In that case, if the controller decides to skip the 2nd flush, but only > do the first flush and fua, if the 2nd write is very small (e.g. > journal), the chance of corruption is pretty low due to the small window. > Highly unlikely. Tweaking flush handling in this way is IMO far too complicated, and would only add to the complexity of adding flush handling in firmware in the first place. Whereas the whole point of this exercise would be to _reduce_ complexity in firmware (no-one really cares about the hardware here; that's already factored in during manufacturing, and reliability is measured in such a broad way that it doesn't make sense for the manufacture to try to 'improve' reliability by tweaking the flush algorithm). So if someone would be wanting to save money they'd do away with the entire flush handling and do not implement a write cache at all. That even saves them money on the hardware, too. Cheers, Hannes -- r. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-31 14:37 ` Hannes Reinecke @ 2019-03-31 14:40 ` Qu Wenruo 0 siblings, 0 replies; 18+ messages in thread From: Qu Wenruo @ 2019-03-31 14:40 UTC (permalink / raw) To: Hannes Reinecke, Alberto Bursi, linux-btrfs, Linux FS Devel, linux-block [-- Attachment #1.1: Type: text/plain, Size: 4675 bytes --] On 2019/3/31 下午10:37, Hannes Reinecke wrote: > On 3/31/19 4:17 PM, Qu Wenruo wrote: >> >> >> On 2019/3/31 下午9:36, Hannes Reinecke wrote: >>> On 3/31/19 2:00 PM, Qu Wenruo wrote: >>>> >>>> >>>> On 2019/3/31 下午7:27, Alberto Bursi wrote: >>>>> >>>>> On 30/03/19 13:31, Qu Wenruo wrote: >>>>>> Hi, >>>>>> >>>>>> I'm wondering if it's possible that certain physical device doesn't >>>>>> handle flush correctly. >>>>>> >>>>>> E.g. some vendor does some complex logical in their hdd controller to >>>>>> skip certain flush request (but not all, obviously) to improve >>>>>> performance? >>>>>> >>>>>> Do anyone see such reports? >>>>>> >>>>>> And if proves to happened before, how do we users detect such >>>>>> problem? >>>>>> >>>>>> Can we just check the flush time against the write before flush call? >>>>>> E.g. write X random blocks into that device, call fsync() on it, >>>>>> check >>>>>> the execution time. Repeat Y times, and compare the avg/std. >>>>>> And change X to 2X/4X/..., repeat above check. >>>>>> >>>>>> Thanks, >>>>>> Qu >>>>>> >>>>>> >>>>> >>>>> Afaik HDDs and SSDs do lie to fsync() >>>> >>>> fsync() on block device is interpreted into FLUSH bio. >>>> >>>> If all/most consumer level SATA HDD/SSD devices are lying, then >>>> there is >>>> no power loss safety at all for any fs. As most fs relies on FLUSH bio >>>> to implement barrier. >>>> >>>> And for fs with generation check, they all should report metadata from >>>> the future every time a crash happens, or even worse gracefully >>>> umounting fs would cause corruption. >>>> >>> Please, stop making assumptions. >> >> I'm not. >> >>> >>> Disks don't 'lie' about anything, they report things according to the >>> (SCSI) standard. >>> And the SCSI standard has two ways of ensuring that things are written >>> to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access) >>> bit in the command. >> >> I understand FLUSH and FUA. >> >>> The latter provides a way of ensuring that a single command made it to >>> disk, and the former instructs the driver to: >>> >>> "a) perform a write medium operation to the LBA using the logical block >>> data in volatile cache; or >>> b) write the logical block to the non-volatile cache, if any." >>> >>> which means it's perfectly fine to treat the write-cache as a >>> _non-volative_ cache if the RAID HBA is battery backed, and thus can >>> make sure that outstanding I/O can be written back even in the case of a >>> power failure. >>> >>> The FUA handling, OTOH, is another matter, and indeed is causing some >>> raised eyebrows when comparing it to the spec. But that's another story. >> >> I don't care FUA as much, since libata still doesn't support FUA by >> default and interpret it as FLUSH/WRITE/FLUSH, so it doesn't make things >> worse. >> >> I'm more interesting in, are all SATA/NVMe disks follows this FLUSH >> behavior? >> > They have to to be spec compliant. > >> For most case, I believe it is, or whatever the fs is, either CoW based >> or journal based, we're going to see tons of problems, even gracefully >> unmounted fs can have corruption if FLUSH is not implemented well. >> >> I'm interested in, is there some device doesn't completely follow >> regular FLUSH requirement, but do some tricks, for certain tested fs. >> > Not that I'm aware of. That's great to know. > >> E.g. the disk is only tested for certain fs, and that fs always does >> something like flush, write flush, fua. >> In that case, if the controller decides to skip the 2nd flush, but only >> do the first flush and fua, if the 2nd write is very small (e.g. >> journal), the chance of corruption is pretty low due to the small window. >> > Highly unlikely. > Tweaking flush handling in this way is IMO far too complicated, and > would only add to the complexity of adding flush handling in firmware in > the first place. > Whereas the whole point of this exercise would be to _reduce_ complexity > in firmware (no-one really cares about the hardware here; that's already > factored in during manufacturing, and reliability is measured in such a > broad way that it doesn't make sense for the manufacture to try to > 'improve' reliability by tweaking the flush algorithm). > So if someone would be wanting to save money they'd do away with the > entire flush handling and do not implement a write cache at all. > That even saves them money on the hardware, too. If there is no report for consumer level hdd/ssd, then it should be fine, and matches my understanding. Thanks, Qu > > Cheers, > > Hannes [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-31 11:27 ` Alberto Bursi 2019-03-31 12:00 ` Qu Wenruo @ 2019-03-31 12:21 ` Andrei Borzenkov 2019-04-01 11:55 ` Austin S. Hemmelgarn 2 siblings, 0 replies; 18+ messages in thread From: Andrei Borzenkov @ 2019-03-31 12:21 UTC (permalink / raw) To: Alberto Bursi, Qu Wenruo, linux-btrfs, Linux FS Devel, linux-block 31.03.2019 14:27, Alberto Bursi пишет: > > On 30/03/19 13:31, Qu Wenruo wrote: >> Hi, >> >> I'm wondering if it's possible that certain physical device doesn't >> handle flush correctly. >> >> E.g. some vendor does some complex logical in their hdd controller to >> skip certain flush request (but not all, obviously) to improve performance? >> >> Do anyone see such reports? >> >> And if proves to happened before, how do we users detect such problem? >> >> Can we just check the flush time against the write before flush call? >> E.g. write X random blocks into that device, call fsync() on it, check >> the execution time. Repeat Y times, and compare the avg/std. >> And change X to 2X/4X/..., repeat above check. >> >> Thanks, >> Qu >> >> > > Afaik HDDs and SSDs do lie to fsync() > > unless the write cache is turned off with hdparm, I know at least one case of SSD that are claimed to flush cache in case of power loss. I can dig up details if anyone is interested. > > hdparm -W0 /dev/sda > > similarly to RAID controllers. > > see below > > https://brad.livejournal.com/2116715.html > > https://queue.acm.org/detail.cfm?id=2367378 > > > - > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-31 11:27 ` Alberto Bursi 2019-03-31 12:00 ` Qu Wenruo 2019-03-31 12:21 ` Andrei Borzenkov @ 2019-04-01 11:55 ` Austin S. Hemmelgarn 2 siblings, 0 replies; 18+ messages in thread From: Austin S. Hemmelgarn @ 2019-04-01 11:55 UTC (permalink / raw) To: Alberto Bursi, Qu Wenruo, linux-btrfs, Linux FS Devel, linux-block On 2019-03-31 07:27, Alberto Bursi wrote: > > On 30/03/19 13:31, Qu Wenruo wrote: >> Hi, >> >> I'm wondering if it's possible that certain physical device doesn't >> handle flush correctly. >> >> E.g. some vendor does some complex logical in their hdd controller to >> skip certain flush request (but not all, obviously) to improve performance? >> >> Do anyone see such reports? >> >> And if proves to happened before, how do we users detect such problem? >> >> Can we just check the flush time against the write before flush call? >> E.g. write X random blocks into that device, call fsync() on it, check >> the execution time. Repeat Y times, and compare the avg/std. >> And change X to 2X/4X/..., repeat above check. >> >> Thanks, >> Qu >> >> > > Afaik HDDs and SSDs do lie to fsync() > > unless the write cache is turned off with hdparm, Nope, not the case on modern Linux. The issue here was that Linux did not issue a FLUSH bio as part of the completion of an fsync() system call, and problem has long-since been fixed (and wasn't actually the disk lying, but the kernel). > > hdparm -W0 /dev/sda > > similarly to RAID controllers. And most RAID controllers don't actually lie either. The SCSI and ATA standards both count a write that is stored in a _non-volatile_ cache as completed, and any halfway-decent RAID controller will be using some form of non-volatile storage for it's cache (classically battery-backed SRAM, but there's been some shift to NOR or NAND flash storage recently, and I've seen a couple of really expensive ones using more-exotic non-volatile storage technologies). > > see below > > https://brad.livejournal.com/2116715.html > > https://queue.acm.org/detail.cfm?id=2367378 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Is it possible that certain physical disk doesn't implement flush correctly? 2019-03-30 12:31 Is it possible that certain physical disk doesn't implement flush correctly? Qu Wenruo 2019-03-30 12:57 ` Supercilious Dude 2019-03-31 11:27 ` Alberto Bursi @ 2019-04-01 12:04 ` Austin S. Hemmelgarn 2 siblings, 0 replies; 18+ messages in thread From: Austin S. Hemmelgarn @ 2019-04-01 12:04 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs, Linux FS Devel, linux-block On 2019-03-30 08:31, Qu Wenruo wrote: > Hi, > > I'm wondering if it's possible that certain physical device doesn't > handle flush correctly. > > E.g. some vendor does some complex logical in their hdd controller to > skip certain flush request (but not all, obviously) to improve performance? > > Do anyone see such reports? Some OCZ SSD's had issues that could be explained by this type of behavior (and the associated data-loss problems are part of why they don't make SSD's any more). Other than that, I know of no modern _physical_ hardware that does this (I've got 5.25 inch full-height SCSI-2 disks that have this issue at work, and am really glad we have no systems that use them anymore). It is, however, pretty easy to configure _virtual_ disk drives to behave like this. > > And if proves to happened before, how do we users detect such problem? There's unfortunately no good way to do so unless you can get the disk to drop it's write cache without writing out it's contents. Assuming you can do that, the trivial test is to write a block, issue a FLUSH, force drop the cache, and then read-back the block that was written. There were some old SCSI disks that actually let you do this by issuing some extended SCSI commands, but I don't know of any ATA disks where this was ever possible, and most modern SCSI disks won't let you do it unless you flash custom firmware to allow for it. Of course, you can always test with throw-away data by manually inducing power failures, but that's tedious and hard on the hardware. > > Can we just check the flush time against the write before flush call? > E.g. write X random blocks into that device, call fsync() on it, check > the execution time. Repeat Y times, and compare the avg/std. > And change X to 2X/4X/..., repeat above check. > > Thanks, > Qu > > ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2019-04-01 12:05 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-03-30 12:31 Is it possible that certain physical disk doesn't implement flush correctly? Qu Wenruo 2019-03-30 12:57 ` Supercilious Dude 2019-03-30 13:00 ` Qu Wenruo 2019-03-30 13:04 ` Supercilious Dude 2019-03-30 13:09 ` Qu Wenruo 2019-03-30 13:14 ` Supercilious Dude 2019-03-30 13:24 ` Qu Wenruo 2019-03-31 22:45 ` J. Bruce Fields 2019-03-31 23:07 ` Alberto Bursi 2019-03-31 11:27 ` Alberto Bursi 2019-03-31 12:00 ` Qu Wenruo 2019-03-31 13:36 ` Hannes Reinecke 2019-03-31 14:17 ` Qu Wenruo 2019-03-31 14:37 ` Hannes Reinecke 2019-03-31 14:40 ` Qu Wenruo 2019-03-31 12:21 ` Andrei Borzenkov 2019-04-01 11:55 ` Austin S. Hemmelgarn 2019-04-01 12:04 ` Austin S. Hemmelgarn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).