linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Why is O_DSYNC on linux so slow / what's wrong with my SSD?
@ 2013-11-20 12:12 Stefan Priebe - Profihost AG
  2013-11-20 12:54 ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Stefan Priebe - Profihost AG @ 2013-11-20 12:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, LKML, matthew

Hello,

while struggling about an application beeing so slow on my SSD and
having high I/O Waits while the app is using the raw block device i've
detected that this is caused by open the block device with O_DSYNC.

I've used dd and fio with oflags=direct,dsync / --direct=1 and --sync=1

and got these "strange" results:

fio --sync=1:
WRITE: io=1694.0MB, aggrb=57806KB/s, minb=57806KB/s, maxb=57806KB/s,
mint=30008msec, maxt=30008msec

fio --sync=0:
WRITE: io=5978.0MB, aggrb=204021KB/s, minb=204021KB/s, maxb=204021KB/s,
mint=30004msec, maxt=30004msec

I get the same results on a crucial m4 as on my intel 530 ssd.

I also tried the same under FreeBSD 9.1 which shows around the same
results for sync=0 as sync=1:

sync=0:
WRITE: io=5984.0MB, aggrb=204185KB/s, minb=204185KB/s, maxb=204185KB/s,
mint=30010msec, maxt=30010msec

sync=1:
WRITE: io=5843.0MB, aggrb=199414KB/s, minb=199414KB/s, maxb=199414KB/s,
mint=30004msec, maxt=30004msec

Can anyone explain to me why O_DSYNC for my app on linux is so slow?

used kernel is vanilla 3.10.19

Thanks!


Greets Stefan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 12:12 Why is O_DSYNC on linux so slow / what's wrong with my SSD? Stefan Priebe - Profihost AG
@ 2013-11-20 12:54 ` Christoph Hellwig
  2013-11-20 13:34   ` Chinmay V S
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2013-11-20 12:54 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: linux-fsdevel, viro, LKML, matthew

On Wed, Nov 20, 2013 at 01:12:43PM +0100, Stefan Priebe - Profihost AG wrote:
> Can anyone explain to me why O_DSYNC for my app on linux is so slow?

Because FreeBSD ignores O_DSYNC on block devices, it never sends a FLUSH
to the device.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 12:54 ` Christoph Hellwig
@ 2013-11-20 13:34   ` Chinmay V S
  2013-11-20 13:38     ` Christoph Hellwig
  2013-11-20 14:12     ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 29+ messages in thread
From: Chinmay V S @ 2013-11-20 13:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stefan Priebe - Profihost AG, linux-fsdevel, Al Viro, LKML, matthew

Hi Stefan,

Christoph is bang on right. To further elaborate upon this, here is
what is happening in the above case :
By using DIRECT, SYNC/DSYNC flags on a block device (i.e. bypassing
the file-systems layer), essentially you are enforcing a CMD_FLUSH on
each I/O command sent to the disk. This is by design of the
block-device driver in the Linux kernel. This severely degrades the
performance.

A detailed walk-through of the various I/O scenarios is available at
thecodeartist.blogspot.com/2012/08/hdd-filesystems-osync.html

Note that SYNC/DSYNC on a filesystem(eg. ext2/3/4) does NOT issue a
CMD_FLUSH. The "SYNC" via filesystem, simply guarantees that the data
is sent to the disk and not really flushed to the disk. It will
continue to reside in the internal cache on the disk, waiting to be
written to the disk platter in a optimum manner (bunch of writes
re-ordered to be sequential on-disk and clubbed together in one go).
This can affect performance to a large extent on modern HDDs with NCQ
support (CMD_FLUSH simply cancels all performance benefits of NCQ).

In case of SSDs, the huge IOPS number for the disk (40,000 in case of
Crucial M4) is again typically observed with write-cache enabled.
For Crucial M4 SSDs,
http://www.crucial.com/pdf/tech_specs-letter_crucial_m4_ssd_v3-11-11_online.pdf
Footnote1 - "Typical I/O performance numbers as measured using Iometer
with a queue depth of 32 and write cache enabled. Iometer measurements
are performed on a 8GB span. 4k transfers used for Read/Write latency
values."

To simply disable this behaviour and make the SYNC/DSYNC behaviour and
performance on raw block-device I/O resemble the standard filesystem
I/O you may want to apply the following patch to your kernel -
https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba

The above patch simply disables the CMD_FLUSH command support even on
disks that claim to support it.

regards
ChinmayVS

On Wed, Nov 20, 2013 at 6:24 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Nov 20, 2013 at 01:12:43PM +0100, Stefan Priebe - Profihost AG wrote:
>> Can anyone explain to me why O_DSYNC for my app on linux is so slow?
>
> Because FreeBSD ignores O_DSYNC on block devices, it never sends a FLUSH
> to the device.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 13:34   ` Chinmay V S
@ 2013-11-20 13:38     ` Christoph Hellwig
  2013-11-20 14:12     ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2013-11-20 13:38 UTC (permalink / raw)
  To: Chinmay V S
  Cc: Christoph Hellwig, Stefan Priebe - Profihost AG, linux-fsdevel,
	Al Viro, LKML, matthew

On Wed, Nov 20, 2013 at 07:04:15PM +0530, Chinmay V S wrote:
> Note that SYNC/DSYNC on a filesystem(eg. ext2/3/4) does NOT issue a
> CMD_FLUSH. The "SYNC" via filesystem, simply guarantees that the data
> is sent to the disk and not really flushed to the disk.

While this used to be the case for ext2 and ext3 this has never been
true for the modern filesystem, and has been fixed for ext3 quite
a while ago.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 13:34   ` Chinmay V S
  2013-11-20 13:38     ` Christoph Hellwig
@ 2013-11-20 14:12     ` Stefan Priebe - Profihost AG
  2013-11-20 15:22       ` Chinmay V S
  1 sibling, 1 reply; 29+ messages in thread
From: Stefan Priebe - Profihost AG @ 2013-11-20 14:12 UTC (permalink / raw)
  To: Chinmay V S, Christoph Hellwig; +Cc: linux-fsdevel, Al Viro, LKML, matthew

Hi ChinmayVS,

Am 20.11.2013 14:34, schrieb Chinmay V S:
> Hi Stefan,
> 
> Christoph is bang on right. To further elaborate upon this, here is
> what is happening in the above case :
> By using DIRECT, SYNC/DSYNC flags on a block device (i.e. bypassing
> the file-systems layer), essentially you are enforcing a CMD_FLUSH on
> each I/O command sent to the disk. This is by design of the
> block-device driver in the Linux kernel. This severely degrades the
> performance.
> 
> A detailed walk-through of the various I/O scenarios is available at
> thecodeartist.blogspot.com/2012/08/hdd-filesystems-osync.html
> 
> Note that SYNC/DSYNC on a filesystem(eg. ext2/3/4) does NOT issue a
> CMD_FLUSH. The "SYNC" via filesystem, simply guarantees that the data
> is sent to the disk and not really flushed to the disk. It will
> continue to reside in the internal cache on the disk, waiting to be
> written to the disk platter in a optimum manner (bunch of writes
> re-ordered to be sequential on-disk and clubbed together in one go).
> This can affect performance to a large extent on modern HDDs with NCQ
> support (CMD_FLUSH simply cancels all performance benefits of NCQ).
> 
> In case of SSDs, the huge IOPS number for the disk (40,000 in case of
> Crucial M4) is again typically observed with write-cache enabled.
> For Crucial M4 SSDs,
> http://www.crucial.com/pdf/tech_specs-letter_crucial_m4_ssd_v3-11-11_online.pdf
> Footnote1 - "Typical I/O performance numbers as measured using Iometer
> with a queue depth of 32 and write cache enabled. Iometer measurements
> are performed on a 8GB span. 4k transfers used for Read/Write latency
> values."

thanks for your great and detailed reply. I'm just wondering why an
intel 520 ssd degrades the speed just by 2% in case of O_SYNC. intel 530
the newer model and replacement for the 520 degrades speed by 75% like
the crucial m4.

The Intel DC S3500 instead delivers also nearly 98% of it's performance
even under O_SYNC.

> To simply disable this behaviour and make the SYNC/DSYNC behaviour and
> performance on raw block-device I/O resemble the standard filesystem
> I/O you may want to apply the following patch to your kernel -
> https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba
> 
> The above patch simply disables the CMD_FLUSH command support even on
> disks that claim to support it.

Is this the right one? By assing ahci_dummy_read_id we disable the
CMD_FLUSH?

What is the risk of that one?

Thanks!

Stefan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 14:12     ` Stefan Priebe - Profihost AG
@ 2013-11-20 15:22       ` Chinmay V S
  2013-11-20 15:37         ` Theodore Ts'o
  2013-11-22 19:55         ` Stefan Priebe
  0 siblings, 2 replies; 29+ messages in thread
From: Chinmay V S @ 2013-11-20 15:22 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Christoph Hellwig, linux-fsdevel, Al Viro, LKML, matthew

Hi Stefan,

> thanks for your great and detailed reply. I'm just wondering why an
> intel 520 ssd degrades the speed just by 2% in case of O_SYNC. intel 530
> the newer model and replacement for the 520 degrades speed by 75% like
> the crucial m4.
>
> The Intel DC S3500 instead delivers also nearly 98% of it's performance
> even under O_SYNC.

If you have confirmed the performance numbers, then it indicates that
the Intel 530 controller is more advanced and makes better use of the
internal disk-cache to achieve better performance (as compared to the
Intel 520). Thus forcing CMD_FLUSH on each IOP (negating the benefits
of the disk write-cache and not allowing any advanced disk controller
optimisations) has a more pronouced effect of degrading the
performance on Intel 530 SSDs. (Someone with some actual info on Intel
SSDs kindly confirm this.)

>> To simply disable this behaviour and make the SYNC/DSYNC behaviour and
>> performance on raw block-device I/O resemble the standard filesystem
>> I/O you may want to apply the following patch to your kernel -
>> https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba
>>
>> The above patch simply disables the CMD_FLUSH command support even on
>> disks that claim to support it.
>
> Is this the right one? By assing ahci_dummy_read_id we disable the
> CMD_FLUSH?
>
> What is the risk of that one?

Yes, https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba is the
right one. The dummy read_id() provides a hook into the initial
disk-properties discovery process when the disk is plugged-in. By
explicitly negating the bits that indicate cache and
flush-cache(CMD_FLUSH) support, we can ensure that the block driver
does NOT issue CMD_FLUSH commands to the disk. Note that this does NOT
disable the write-cache on the disk itself i.e. performance improves
due to the on-disk write-cache in the absence of any CMD_FLUSH
commands from the host-PC.

Theoretically, it increases the chances of data loss i.e. if power is
removed while the write is in progress from the app. Personally though
i have found that the impact of this is minimal because SYNC on a raw
block device with CMD_FLUSH does NOT guarantee atomicity in case of a
power-loss. Hence, in the event of a power loss, applications cannot
rely on SYNC(with CMD_FLUSH) for data integrity. Rather they have to
maintain other data-structures with redundant disk metadata (which is
precisely what modern file-systems do). Thus, removing CMD_FLUSH
doesn't really result in a downside as such.

The main thing to consider when applying the above simple patch is
that it is system-wide. The above patch prevents the host-PC from
issuing CMD_FLUSH for ALL drives enumerated via SATA/SCSI on the
system.

If this patch works for you, then to restrict the change in behaviour
to a specific disk, you will need to:
1. Identify the disk by its model number within the dummy read_id().
2. Zero the bits ONLY for your particular disk.
3. Return without modifying anything for all other disks.

Try out the above patch and let me know if you have any further issues.

regards
ChinmayVS

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 15:22       ` Chinmay V S
@ 2013-11-20 15:37         ` Theodore Ts'o
  2013-11-20 15:55           ` J. Bruce Fields
  2013-11-20 16:02           ` Howard Chu
  2013-11-22 19:55         ` Stefan Priebe
  1 sibling, 2 replies; 29+ messages in thread
From: Theodore Ts'o @ 2013-11-20 15:37 UTC (permalink / raw)
  To: Chinmay V S
  Cc: Stefan Priebe - Profihost AG, Christoph Hellwig, linux-fsdevel,
	Al Viro, LKML, matthew

On Wed, Nov 20, 2013 at 08:52:36PM +0530, Chinmay V S wrote:
> 
> If you have confirmed the performance numbers, then it indicates that
> the Intel 530 controller is more advanced and makes better use of the
> internal disk-cache to achieve better performance (as compared to the
> Intel 520). Thus forcing CMD_FLUSH on each IOP (negating the benefits
> of the disk write-cache and not allowing any advanced disk controller
> optimisations) has a more pronouced effect of degrading the
> performance on Intel 530 SSDs. (Someone with some actual info on Intel
> SSDs kindly confirm this.)

You might also want to do some power fail testing to make sure that
the SSD is actually flusing all of its internal Flash Translation
Layer (FTL) metadata to stable storage on every CMD_FLUSH command.

There are lots of flash media that don't do this, with the result that
I get lots of users whining at me when their file system stored on an
SD card has massive corruption after a power fail event.

Historically, Intel has been really good about avoiding this, but
since they've moved to using 3rd party flash controllers, I now advise
everyone who plans to use any flash storage, regardless of the
manufacturer, to do their own explicit power fail testing (hitting the
reset button is not good enough, you need to kick the power plug out
of the wall, or better yet, use a network controlled power switch you
so you can repeat the power fail test dozens or hundreds of times for
your qualification run) before being using flash storage in a mission
critical situation where you care about data integrity after a power
fail event.

IOW, make sure that the SSD isn't faster because it's playing fast and
loose with the FTL metadata....

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 15:37         ` Theodore Ts'o
@ 2013-11-20 15:55           ` J. Bruce Fields
  2013-11-20 17:11             ` Chinmay V S
  2013-11-22 19:57             ` Why is O_DSYNC on linux so slow / what's wrong with my SSD? Stefan Priebe
  2013-11-20 16:02           ` Howard Chu
  1 sibling, 2 replies; 29+ messages in thread
From: J. Bruce Fields @ 2013-11-20 15:55 UTC (permalink / raw)
  To: Theodore Ts'o, Chinmay V S, Stefan Priebe - Profihost AG,
	Christoph Hellwig, linux-fsdevel, Al Viro, LKML, matthew

On Wed, Nov 20, 2013 at 10:37:03AM -0500, Theodore Ts'o wrote:
> On Wed, Nov 20, 2013 at 08:52:36PM +0530, Chinmay V S wrote:
> > 
> > If you have confirmed the performance numbers, then it indicates that
> > the Intel 530 controller is more advanced and makes better use of the
> > internal disk-cache to achieve better performance (as compared to the
> > Intel 520). Thus forcing CMD_FLUSH on each IOP (negating the benefits
> > of the disk write-cache and not allowing any advanced disk controller
> > optimisations) has a more pronouced effect of degrading the
> > performance on Intel 530 SSDs. (Someone with some actual info on Intel
> > SSDs kindly confirm this.)
> 
> You might also want to do some power fail testing to make sure that
> the SSD is actually flusing all of its internal Flash Translation
> Layer (FTL) metadata to stable storage on every CMD_FLUSH command.

Some SSD's are also claim the ability to flush the cache on power loss:

	http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html

Which should in theory let them respond immediately to flush requests,
right?  Except they only seem to advertise it as a safety (rather than a
performance) feature, so I probably misunderstand something.

And the 520 doesn't claim this feature (look for "enhanced power loss
protection" at http://ark.intel.com/products/66248), so that wouldn't
explain these results anyway.

--b.

> 
> There are lots of flash media that don't do this, with the result that
> I get lots of users whining at me when their file system stored on an
> SD card has massive corruption after a power fail event.
> 
> Historically, Intel has been really good about avoiding this, but
> since they've moved to using 3rd party flash controllers, I now advise
> everyone who plans to use any flash storage, regardless of the
> manufacturer, to do their own explicit power fail testing (hitting the
> reset button is not good enough, you need to kick the power plug out
> of the wall, or better yet, use a network controlled power switch you
> so you can repeat the power fail test dozens or hundreds of times for
> your qualification run) before being using flash storage in a mission
> critical situation where you care about data integrity after a power
> fail event.
> 
> IOW, make sure that the SSD isn't faster because it's playing fast and
> loose with the FTL metadata....
> 
> Cheers,
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 15:37         ` Theodore Ts'o
  2013-11-20 15:55           ` J. Bruce Fields
@ 2013-11-20 16:02           ` Howard Chu
  2013-11-23 20:36             ` Pavel Machek
  1 sibling, 1 reply; 29+ messages in thread
From: Howard Chu @ 2013-11-20 16:02 UTC (permalink / raw)
  To: Theodore Ts'o, Chinmay V S, Stefan Priebe - Profihost AG,
	Christoph Hellwig, linux-fsdevel, Al Viro, LKML, matthew

Theodore Ts'o wrote:
> Historically, Intel has been really good about avoiding this, but
> since they've moved to using 3rd party flash controllers, I now advise
> everyone who plans to use any flash storage, regardless of the
> manufacturer, to do their own explicit power fail testing (hitting the
> reset button is not good enough, you need to kick the power plug out
> of the wall, or better yet, use a network controlled power switch you
> so you can repeat the power fail test dozens or hundreds of times for
> your qualification run) before being using flash storage in a mission
> critical situation where you care about data integrity after a power
> fail event.

Speaking of which, what would you use to automate this sort of test? I'm 
thinking an SSD connected by eSATA, with an external power supply, and the 
host running inside a VM. Drop power to the drive at the same time as doing a 
kill -9 on the VM, then you can resume the VM pretty quickly instead of 
waiting for a full reboot sequence.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 15:55           ` J. Bruce Fields
@ 2013-11-20 17:11             ` Chinmay V S
  2013-11-20 17:58               ` J. Bruce Fields
  2013-11-22 19:57             ` Why is O_DSYNC on linux so slow / what's wrong with my SSD? Stefan Priebe
  1 sibling, 1 reply; 29+ messages in thread
From: Chinmay V S @ 2013-11-20 17:11 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Theodore Ts'o, Stefan Priebe - Profihost AG,
	Christoph Hellwig, linux-fsdevel, Al Viro, LKML, Matthew Wilcox

On Wed, Nov 20, 2013 at 9:25 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> Some SSD's are also claim the ability to flush the cache on power loss:
>
>         http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
>
> Which should in theory let them respond immediately to flush requests,
> right?  Except they only seem to advertise it as a safety (rather than a
> performance) feature, so I probably misunderstand something.
>
> And the 520 doesn't claim this feature (look for "enhanced power loss
> protection" at http://ark.intel.com/products/66248), so that wouldn't
> explain these results anyway.

FYI, nowhere does Intel imply that the CMD_FLUSH is instantaneous. The
product brief for Intel 320 SSDs (above link), explains that it is
implemented by a power-fail detection circuit that detects drop in
power-supply, following which the on-disk controller issues an internal
CMD_FLUSH equivalent command to ensure data is moved to the
non-volatile area from the disk-cache. Large secondary capacitors
ensure backup supply for this brief duration.

Thus applications can always perform asynchronous I/O upon the disk,
taking comfort in the fact that the physical disk ensures that all
data in the volatile disk-cache is automatically transferred to the
non-volatile area even in the event of an external power-failure. Thus
the host never has to worry about issuing a CMD_FLUSH (which is still
a terribly expensive performance bottleneck, even on the Intel 320
SSDs).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 17:11             ` Chinmay V S
@ 2013-11-20 17:58               ` J. Bruce Fields
  2013-11-20 18:43                 ` Chinmay V S
  0 siblings, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2013-11-20 17:58 UTC (permalink / raw)
  To: Chinmay V S
  Cc: Theodore Ts'o, Stefan Priebe - Profihost AG,
	Christoph Hellwig, linux-fsdevel, Al Viro, LKML, Matthew Wilcox

On Wed, Nov 20, 2013 at 10:41:54PM +0530, Chinmay V S wrote:
> On Wed, Nov 20, 2013 at 9:25 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > Some SSD's are also claim the ability to flush the cache on power loss:
> >
> >         http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
> >
> > Which should in theory let them respond immediately to flush requests,
> > right?  Except they only seem to advertise it as a safety (rather than a
> > performance) feature, so I probably misunderstand something.
> >
> > And the 520 doesn't claim this feature (look for "enhanced power loss
> > protection" at http://ark.intel.com/products/66248), so that wouldn't
> > explain these results anyway.
> 
> FYI, nowhere does Intel imply that the CMD_FLUSH is instantaneous. The
> product brief for Intel 320 SSDs (above link), explains that it is
> implemented by a power-fail detection circuit that detects drop in
> power-supply, following which the on-disk controller issues an internal
> CMD_FLUSH equivalent command to ensure data is moved to the
> non-volatile area from the disk-cache. Large secondary capacitors
> ensure backup supply for this brief duration.
> 
> Thus applications can always perform asynchronous I/O upon the disk,
> taking comfort in the fact that the physical disk ensures that all
> data in the volatile disk-cache is automatically transferred to the
> non-volatile area even in the event of an external power-failure. Thus
> the host never has to worry about issuing a CMD_FLUSH (which is still
> a terribly expensive performance bottleneck, even on the Intel 320
> SSDs).

So why is it up to the application to do this and not the drive?
Naively I'd've thought it would be simpler if the protocol allowed the
drive to respond instantly if it knows it can do so safely, and then you
could always issue flush requests, and save some poor admin from having
to read spec sheets to figure out if they can safely mount "nobarrier".

Is it that you want to eliminate CMD_FLUSH entirely because the protocol
still has some significant overhead even if the drive responds to it
quickly?

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 17:58               ` J. Bruce Fields
@ 2013-11-20 18:43                 ` Chinmay V S
  2013-11-21 10:11                   ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Chinmay V S @ 2013-11-20 18:43 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Theodore Ts'o, Stefan Priebe - Profihost AG,
	Christoph Hellwig, linux-fsdevel, Al Viro, LKML, Matthew Wilcox

On Wed, Nov 20, 2013 at 11:28 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Wed, Nov 20, 2013 at 10:41:54PM +0530, Chinmay V S wrote:
>> On Wed, Nov 20, 2013 at 9:25 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>> > Some SSD's are also claim the ability to flush the cache on power loss:
>> >
>> >         http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
>> >
>> > Which should in theory let them respond immediately to flush requests,
>> > right?  Except they only seem to advertise it as a safety (rather than a
>> > performance) feature, so I probably misunderstand something.
>> >
>> > And the 520 doesn't claim this feature (look for "enhanced power loss
>> > protection" at http://ark.intel.com/products/66248), so that wouldn't
>> > explain these results anyway.
>>
>> FYI, nowhere does Intel imply that the CMD_FLUSH is instantaneous. The
>> product brief for Intel 320 SSDs (above link), explains that it is
>> implemented by a power-fail detection circuit that detects drop in
>> power-supply, following which the on-disk controller issues an internal
>> CMD_FLUSH equivalent command to ensure data is moved to the
>> non-volatile area from the disk-cache. Large secondary capacitors
>> ensure backup supply for this brief duration.
>>
>> Thus applications can always perform asynchronous I/O upon the disk,
>> taking comfort in the fact that the physical disk ensures that all
>> data in the volatile disk-cache is automatically transferred to the
>> non-volatile area even in the event of an external power-failure. Thus
>> the host never has to worry about issuing a CMD_FLUSH (which is still
>> a terribly expensive performance bottleneck, even on the Intel 320
>> SSDs).
>
> So why is it up to the application to do this and not the drive?
> Naively I'd've thought it would be simpler if the protocol allowed the
> drive to respond instantly if it knows it can do so safely, and then you
> could always issue flush requests, and save some poor admin from having
> to read spec sheets to figure out if they can safely mount "nobarrier".
Strictly speaking CMD_FLUSH implies that the app/driver wants to
ensure data IS in-fact on the non-volatile area. Also the time-penalty
associated with it on majority of disks is a known fact and hence
CMD_FLUSHes are not issued unless absolutely necessary. During IO upon
a raw block device, as this is the ONLY data barrier available, it is
mapped to the SYNC command.

The Intel 320 SSD is an exception where the disk does NOT need a
CMD_FLUSH as it can guarantee that the cache is always flushed to the
non-volatile area automatically in case of a power loss. However, a
CMD_FLUSH is an explicit command to write to non-volatile area and is
implemented accordingly. Practically though it is could have been made
a no-op on the Intel 320 series(and other similar battery-backed
disks, but not for all disks). Unfortunately this is not how the
on-disk controller firmware is implemented and hence it is up to the
app/kernel-driver to avoid issuing CMD_FLUSHes which are clearly
unnecessary as discussed above.

> Is it that you want to eliminate CMD_FLUSH entirely because the protocol
> still has some significant overhead even if the drive responds to it
> quickly?

1. Most drives do NOT respond to CMD_FLUSH immediately i.e. they wait
until the data is actually moved to the non-volatile media (which is
the right behaviour) i.e. performance drops.

2. Some drives may implement CMD_FLUSH to return immediately i.e. no
guarantee the data is actually on disk.

3. Anyway, CMD_FLUSH does NOT guarantee atomicity. (Consider power
failure in the middle of an ongoing CMD_FLUSH on non battery-backed
disks).

4. Throughput using CMD_FLUSH is so less that an app generating large
amount of I/O will have to buffer most of it in the app layer itself
i.e. it is lost in case of a power-outage.

Considering the above 4 facts, ASYNC IO is almost always better on raw
block devices. This pushes the data to the disk as fast as possible
and an occasional CMD_FLUSH will ensure it is flushed to the
non-volatile area periodically.

In case the application cannot be modified to perform ASYNC IO, there
exists a way to disable the behaviour of issuing a CMD_FLUSH for each
sync() within the block device driver for SATA/SCSI disks. This is
what is described by
https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba

Just to be clear, i am NOT recommending that this change be mainlined;
rather it is a reference to improve performance in the rare cases(like
in the OP Stefan's case) where both the app performing DIRECT SYNC
block IO and the disk firmware implementing CMD_FLUSH can NOT be
modified. In which case the standard block driver behaviour of issuing
a CMD_FLUSH with each write is too restrictive and thus modified using
the patch.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 18:43                 ` Chinmay V S
@ 2013-11-21 10:11                   ` Christoph Hellwig
  2013-11-22 20:01                     ` Stefan Priebe
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2013-11-21 10:11 UTC (permalink / raw)
  To: Chinmay V S
  Cc: J. Bruce Fields, Theodore Ts'o, Stefan Priebe - Profihost AG,
	linux-fsdevel, Al Viro, LKML, Matthew Wilcox

> 
> 1. Most drives do NOT respond to CMD_FLUSH immediately i.e. they wait
> until the data is actually moved to the non-volatile media (which is
> the right behaviour) i.e. performance drops.

Which is what the specification sais they must do.

> 2. Some drives may implement CMD_FLUSH to return immediately i.e. no
> guarantee the data is actually on disk.

In which case they aren't spec complicant.  While I've seen countless
data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
ingnores flush.  If you'd want to cheat that bluntly you'd be better
of just claiming to not have a writeback cache.

> 3. Anyway, CMD_FLUSH does NOT guarantee atomicity. (Consider power
> failure in the middle of an ongoing CMD_FLUSH on non battery-backed
> disks).

It does not guarantee atomicy by itself, but it's the only low-level
primitive a filesystem or database can use build atomic transaction
at a higher level on an ATA disk with the writeback cache enabled.

> In case the application cannot be modified to perform ASYNC IO, there
> exists a way to disable the behaviour of issuing a CMD_FLUSH for each
> sync() within the block device driver for SATA/SCSI disks. This is
> what is described by
> https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba

Which is utterly broken, and your insistance on pushing it shows you
do not understand the problem space.

You solve your performance problem by completely disabling any chance
of having data integrity guarantees, and do so in a way that is not
detectable for applications or users.

If you have a workload with lots of small synchronous writes disabling
the writeback cache on the disk does indeed often help, especially with
the non-queueable FLUSH on all but the most recent ATA devices.

If you do have workloads where you do lots of synchronous writes

> Just to be clear, i am NOT recommending that this change be mainlined;
> rather it is a reference to improve performance in the rare cases(like
> in the OP Stefan's case) where both the app performing DIRECT SYNC
> block IO and the disk firmware implementing CMD_FLUSH can NOT be
> modified. In which case the standard block driver behaviour of issuing
> a CMD_FLUSH with each write is too restrictive and thus modified using
> the patch.

Again, what your patch does is to explicitly ignore the data integrity
request from the application.  While this will usually be way faster,
it will also cause data loss.  Simply disabling the writeback cache
feature of the disk using hdparm will give you much better performance
than issueing all the FLUSH command, especially if they are non-queued,
but without breaking the gurantee to the application.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 15:22       ` Chinmay V S
  2013-11-20 15:37         ` Theodore Ts'o
@ 2013-11-22 19:55         ` Stefan Priebe
  1 sibling, 0 replies; 29+ messages in thread
From: Stefan Priebe @ 2013-11-22 19:55 UTC (permalink / raw)
  To: Chinmay V S; +Cc: Christoph Hellwig, linux-fsdevel, Al Viro, LKML, matthew

Am 20.11.2013 16:22, schrieb Chinmay V S:
> Hi Stefan,
>
>> thanks for your great and detailed reply. I'm just wondering why an
>> intel 520 ssd degrades the speed just by 2% in case of O_SYNC. intel 530
>> the newer model and replacement for the 520 degrades speed by 75% like
>> the crucial m4.
>>
>> The Intel DC S3500 instead delivers also nearly 98% of it's performance
>> even under O_SYNC.
>
> If you have confirmed the performance numbers, then it indicates that
> the Intel 530 controller is more advanced and makes better use of the
> internal disk-cache to achieve better performance (as compared to the
> Intel 520). Thus forcing CMD_FLUSH on each IOP (negating the benefits
> of the disk write-cache and not allowing any advanced disk controller
> optimisations) has a more pronouced effect of degrading the
> performance on Intel 530 SSDs. (Someone with some actual info on Intel
> SSDs kindly confirm this.)
>
>>> To simply disable this behaviour and make the SYNC/DSYNC behaviour and
>>> performance on raw block-device I/O resemble the standard filesystem
>>> I/O you may want to apply the following patch to your kernel -
>>> https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba
>>>
>>> The above patch simply disables the CMD_FLUSH command support even on
>>> disks that claim to support it.
>>
>> Is this the right one? By assing ahci_dummy_read_id we disable the
>> CMD_FLUSH?
>>
>> What is the risk of that one?
>
> Yes, https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba is the
> right one. The dummy read_id() provides a hook into the initial
> disk-properties discovery process when the disk is plugged-in. By
> explicitly negating the bits that indicate cache and
> flush-cache(CMD_FLUSH) support, we can ensure that the block driver
> does NOT issue CMD_FLUSH commands to the disk. Note that this does NOT
> disable the write-cache on the disk itself i.e. performance improves
> due to the on-disk write-cache in the absence of any CMD_FLUSH
> commands from the host-PC.

ah OK thanks.

> Theoretically, it increases the chances of data loss i.e. if power is
> removed while the write is in progress from the app. Personally though
> i have found that the impact of this is minimal because SYNC on a raw
> block device with CMD_FLUSH does NOT guarantee atomicity in case of a
> power-loss. Hence, in the event of a power loss, applications cannot
> rely on SYNC(with CMD_FLUSH) for data integrity. Rather they have to
> maintain other data-structures with redundant disk metadata (which is
> precisely what modern file-systems do). Thus, removing CMD_FLUSH
> doesn't really result in a downside as such.

In my production system i've crucial m500 which have a capicitor so in a 
case of power loss they flush their data to disk automatically.

> The main thing to consider when applying the above simple patch is
> that it is system-wide. The above patch prevents the host-PC from
> issuing CMD_FLUSH for ALL drives enumerated via SATA/SCSI on the
> system.
>
> If this patch works for you, then to restrict the change in behaviour
> to a specific disk, you will need to:
> 1. Identify the disk by its model number within the dummy read_id().
> 2. Zero the bits ONLY for your particular disk.
> 3. Return without modifying anything for all other disks.
>
> Try out the above patch and let me know if you have any further issues.

The best thing would be a a flag under
/sys/bock/sdc/device/

for ssds with capictor - so everybody can decide on their own.

Stefan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 15:55           ` J. Bruce Fields
  2013-11-20 17:11             ` Chinmay V S
@ 2013-11-22 19:57             ` Stefan Priebe
  2013-11-24  0:10               ` One Thousand Gnomes
  1 sibling, 1 reply; 29+ messages in thread
From: Stefan Priebe @ 2013-11-22 19:57 UTC (permalink / raw)
  To: J. Bruce Fields, Theodore Ts'o, Chinmay V S,
	Christoph Hellwig, linux-fsdevel, Al Viro, LKML, matthew

Am 20.11.2013 16:55, schrieb J. Bruce Fields:
> On Wed, Nov 20, 2013 at 10:37:03AM -0500, Theodore Ts'o wrote:
>> On Wed, Nov 20, 2013 at 08:52:36PM +0530, Chinmay V S wrote:
>>>
>>> If you have confirmed the performance numbers, then it indicates that
>>> the Intel 530 controller is more advanced and makes better use of the
>>> internal disk-cache to achieve better performance (as compared to the
>>> Intel 520). Thus forcing CMD_FLUSH on each IOP (negating the benefits
>>> of the disk write-cache and not allowing any advanced disk controller
>>> optimisations) has a more pronouced effect of degrading the
>>> performance on Intel 530 SSDs. (Someone with some actual info on Intel
>>> SSDs kindly confirm this.)
>>
>> You might also want to do some power fail testing to make sure that
>> the SSD is actually flusing all of its internal Flash Translation
>> Layer (FTL) metadata to stable storage on every CMD_FLUSH command.
>
> Some SSD's are also claim the ability to flush the cache on power loss:
>
> 	http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
>
> Which should in theory let them respond immediately to flush requests,
> right?  Except they only seem to advertise it as a safety (rather than a
> performance) feature, so I probably misunderstand something.

Yes but they all should make use and support CMD_FLUSH so it's slow on 
them too.

> And the 520 doesn't claim this feature (look for "enhanced power loss
> protection" at http://ark.intel.com/products/66248), so that wouldn't
> explain these results anyway.

Correct i think intel simply ignores CMD_FLUSH on that drive - no idea 
why an they fixed this for their 330, 530, DC S3500 (all tested)

> --b.
>
>>
>> There are lots of flash media that don't do this, with the result that
>> I get lots of users whining at me when their file system stored on an
>> SD card has massive corruption after a power fail event.
>>
>> Historically, Intel has been really good about avoiding this, but
>> since they've moved to using 3rd party flash controllers, I now advise
>> everyone who plans to use any flash storage, regardless of the
>> manufacturer, to do their own explicit power fail testing (hitting the
>> reset button is not good enough, you need to kick the power plug out
>> of the wall, or better yet, use a network controlled power switch you
>> so you can repeat the power fail test dozens or hundreds of times for
>> your qualification run) before being using flash storage in a mission
>> critical situation where you care about data integrity after a power
>> fail event.
>>
>> IOW, make sure that the SSD isn't faster because it's playing fast and
>> loose with the FTL metadata....
>>
>> Cheers,
>>
>> 						- Ted
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-21 10:11                   ` Christoph Hellwig
@ 2013-11-22 20:01                     ` Stefan Priebe
  2013-11-22 20:37                       ` Ric Wheeler
  0 siblings, 1 reply; 29+ messages in thread
From: Stefan Priebe @ 2013-11-22 20:01 UTC (permalink / raw)
  To: Christoph Hellwig, Chinmay V S
  Cc: J. Bruce Fields, Theodore Ts'o, linux-fsdevel, Al Viro, LKML,
	Matthew Wilcox

Hi Christoph,
Am 21.11.2013 11:11, schrieb Christoph Hellwig:
>>
>> 2. Some drives may implement CMD_FLUSH to return immediately i.e. no
>> guarantee the data is actually on disk.
>
> In which case they aren't spec complicant.  While I've seen countless
> data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
> ingnores flush.  If you'd want to cheat that bluntly you'd be better
> of just claiming to not have a writeback cache.
>
> You solve your performance problem by completely disabling any chance
> of having data integrity guarantees, and do so in a way that is not
> detectable for applications or users.
>
> If you have a workload with lots of small synchronous writes disabling
> the writeback cache on the disk does indeed often help, especially with
> the non-queueable FLUSH on all but the most recent ATA devices.

But this isn't correct for drives with capicitors like Crucial m500, 
Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an 
option to disable this for drives like these?
/sys/block/sdX/device/ignore_flush

> Again, what your patch does is to explicitly ignore the data integrity
> request from the application.  While this will usually be way faster,
> it will also cause data loss.  Simply disabling the writeback cache
> feature of the disk using hdparm will give you much better performance
> than issueing all the FLUSH command, especially if they are non-queued,
> but without breaking the gurantee to the application.
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-22 20:01                     ` Stefan Priebe
@ 2013-11-22 20:37                       ` Ric Wheeler
  2013-11-22 21:05                         ` Stefan Priebe
  2013-11-23 18:27                         ` Stefan Priebe
  0 siblings, 2 replies; 29+ messages in thread
From: Ric Wheeler @ 2013-11-22 20:37 UTC (permalink / raw)
  To: Stefan Priebe, Christoph Hellwig, Chinmay V S
  Cc: J. Bruce Fields, Theodore Ts'o, linux-fsdevel, Al Viro, LKML,
	Matthew Wilcox

On 11/22/2013 03:01 PM, Stefan Priebe wrote:
> Hi Christoph,
> Am 21.11.2013 11:11, schrieb Christoph Hellwig:
>>>
>>> 2. Some drives may implement CMD_FLUSH to return immediately i.e. no
>>> guarantee the data is actually on disk.
>>
>> In which case they aren't spec complicant.  While I've seen countless
>> data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
>> ingnores flush.  If you'd want to cheat that bluntly you'd be better
>> of just claiming to not have a writeback cache.
>>
>> You solve your performance problem by completely disabling any chance
>> of having data integrity guarantees, and do so in a way that is not
>> detectable for applications or users.
>>
>> If you have a workload with lots of small synchronous writes disabling
>> the writeback cache on the disk does indeed often help, especially with
>> the non-queueable FLUSH on all but the most recent ATA devices.
>
> But this isn't correct for drives with capicitors like Crucial m500, Intel DC 
> S3500, DC S3700 isn't it? Shouldn't the linux kernel has an option to disable 
> this for drives like these?
> /sys/block/sdX/device/ignore_flush

If you know 100% for sure that your drive has a non-volatile write cache, you 
can run the file system without the flushing by mounting "-o nobarrier".  With 
most devices, this is not needed since they tend to simply ignore the flushes if 
they know they are power failure safe.

Block level, we did something similar for users who are not running through a 
file system for SCSI devices - James added support to echo "temporary" into the 
sd's device's cache_type field:

See:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88

Ric

>
>> Again, what your patch does is to explicitly ignore the data integrity
>> request from the application.  While this will usually be way faster,
>> it will also cause data loss.  Simply disabling the writeback cache
>> feature of the disk using hdparm will give you much better performance
>> than issueing all the FLUSH command, especially if they are non-queued,
>> but without breaking the gurantee to the application.
>>
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-22 20:37                       ` Ric Wheeler
@ 2013-11-22 21:05                         ` Stefan Priebe
  2013-11-23 18:27                         ` Stefan Priebe
  1 sibling, 0 replies; 29+ messages in thread
From: Stefan Priebe @ 2013-11-22 21:05 UTC (permalink / raw)
  To: Ric Wheeler, Christoph Hellwig, Chinmay V S
  Cc: J. Bruce Fields, Theodore Ts'o, linux-fsdevel, Al Viro, LKML,
	Matthew Wilcox

Hi Ric,

Am 22.11.2013 21:37, schrieb Ric Wheeler:
> On 11/22/2013 03:01 PM, Stefan Priebe wrote:
>> Hi Christoph,
>> Am 21.11.2013 11:11, schrieb Christoph Hellwig:
>>>>
>>>> 2. Some drives may implement CMD_FLUSH to return immediately i.e. no
>>>> guarantee the data is actually on disk.
>>>
>>> In which case they aren't spec complicant.  While I've seen countless
>>> data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
>>> ingnores flush.  If you'd want to cheat that bluntly you'd be better
>>> of just claiming to not have a writeback cache.
>>>
>>> You solve your performance problem by completely disabling any chance
>>> of having data integrity guarantees, and do so in a way that is not
>>> detectable for applications or users.
>>>
>>> If you have a workload with lots of small synchronous writes disabling
>>> the writeback cache on the disk does indeed often help, especially with
>>> the non-queueable FLUSH on all but the most recent ATA devices.
>>
>> But this isn't correct for drives with capicitors like Crucial m500,
>> Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an
>> option to disable this for drives like these?
>> /sys/block/sdX/device/ignore_flush
>
> If you know 100% for sure that your drive has a non-volatile write
> cache, you can run the file system without the flushing by mounting "-o
> nobarrier".  With most devices, this is not needed since they tend to
> simply ignore the flushes if they know they are power failure safe.

Thanks - but i have raw block devices the data goes to .

> Block level, we did something similar for users who are not running
> through a file system for SCSI devices - James added support to echo
> "temporary" into the sd's device's cache_type field:
>
> See:
>
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88

So i have to switch to write through but i'm still using the wb cache of 
the device?

echo temporary write through > /sys/class/scsi_disk/<disk>/cache_type

Thanks!

Stefan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-22 20:37                       ` Ric Wheeler
  2013-11-22 21:05                         ` Stefan Priebe
@ 2013-11-23 18:27                         ` Stefan Priebe
  2013-11-23 19:35                           ` Ric Wheeler
  1 sibling, 1 reply; 29+ messages in thread
From: Stefan Priebe @ 2013-11-23 18:27 UTC (permalink / raw)
  To: Ric Wheeler, Christoph Hellwig, Chinmay V S
  Cc: J. Bruce Fields, Theodore Ts'o, linux-fsdevel, Al Viro, LKML,
	Matthew Wilcox

Hi Ric,

Am 22.11.2013 21:37, schrieb Ric Wheeler:
> On 11/22/2013 03:01 PM, Stefan Priebe wrote:
>> Hi Christoph,
>> Am 21.11.2013 11:11, schrieb Christoph Hellwig:
>>>>
>>>> 2. Some drives may implement CMD_FLUSH to return immediately i.e. no
>>>> guarantee the data is actually on disk.
>>>
>>> In which case they aren't spec complicant.  While I've seen countless
>>> data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
>>> ingnores flush.  If you'd want to cheat that bluntly you'd be better
>>> of just claiming to not have a writeback cache.
>>>
>>> You solve your performance problem by completely disabling any chance
>>> of having data integrity guarantees, and do so in a way that is not
>>> detectable for applications or users.
>>>
>>> If you have a workload with lots of small synchronous writes disabling
>>> the writeback cache on the disk does indeed often help, especially with
>>> the non-queueable FLUSH on all but the most recent ATA devices.
>>
>> But this isn't correct for drives with capicitors like Crucial m500,
>> Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an
>> option to disable this for drives like these?
>> /sys/block/sdX/device/ignore_flush
>
> If you know 100% for sure that your drive has a non-volatile write
> cache, you can run the file system without the flushing by mounting "-o
> nobarrier".  With most devices, this is not needed since they tend to
> simply ignore the flushes if they know they are power failure safe.
>
> Block level, we did something similar for users who are not running
> through a file system for SCSI devices - James added support to echo
> "temporary" into the sd's device's cache_type field:
>
> See:
>
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88

At least to me this does not work. I get the same awful speed as before 
- also the I/O waits stay the same. I'm still seeing CMD flushes going 
to the devices.

Is there any way to check whether the temporary got accepted and works?

I simply executed:
for i in /sys/class/scsi_disk/*/cache_type; do echo $i; echo temporary 
write back >$i; done

Stefan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-23 18:27                         ` Stefan Priebe
@ 2013-11-23 19:35                           ` Ric Wheeler
  2013-11-23 19:48                             ` Stefan Priebe
                                               ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Ric Wheeler @ 2013-11-23 19:35 UTC (permalink / raw)
  To: Stefan Priebe, Christoph Hellwig, Chinmay V S
  Cc: J. Bruce Fields, Theodore Ts'o, linux-fsdevel, Al Viro, LKML,
	Matthew Wilcox

On 11/23/2013 01:27 PM, Stefan Priebe wrote:
> Hi Ric,
>
> Am 22.11.2013 21:37, schrieb Ric Wheeler:
>> On 11/22/2013 03:01 PM, Stefan Priebe wrote:
>>> Hi Christoph,
>>> Am 21.11.2013 11:11, schrieb Christoph Hellwig:
>>>>>
>>>>> 2. Some drives may implement CMD_FLUSH to return immediately i.e. no
>>>>> guarantee the data is actually on disk.
>>>>
>>>> In which case they aren't spec complicant.  While I've seen countless
>>>> data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
>>>> ingnores flush.  If you'd want to cheat that bluntly you'd be better
>>>> of just claiming to not have a writeback cache.
>>>>
>>>> You solve your performance problem by completely disabling any chance
>>>> of having data integrity guarantees, and do so in a way that is not
>>>> detectable for applications or users.
>>>>
>>>> If you have a workload with lots of small synchronous writes disabling
>>>> the writeback cache on the disk does indeed often help, especially with
>>>> the non-queueable FLUSH on all but the most recent ATA devices.
>>>
>>> But this isn't correct for drives with capicitors like Crucial m500,
>>> Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an
>>> option to disable this for drives like these?
>>> /sys/block/sdX/device/ignore_flush
>>
>> If you know 100% for sure that your drive has a non-volatile write
>> cache, you can run the file system without the flushing by mounting "-o
>> nobarrier".  With most devices, this is not needed since they tend to
>> simply ignore the flushes if they know they are power failure safe.
>>
>> Block level, we did something similar for users who are not running
>> through a file system for SCSI devices - James added support to echo
>> "temporary" into the sd's device's cache_type field:
>>
>> See:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88 
>>
>
> At least to me this does not work. I get the same awful speed as before - also 
> the I/O waits stay the same. I'm still seeing CMD flushes going to the devices.
>
> Is there any way to check whether the temporary got accepted and works?
>
> I simply executed:
> for i in /sys/class/scsi_disk/*/cache_type; do echo $i; echo temporary write 
> back >$i; done
>
> Stefan

What kernel are you running?  This is a new addition....

Also, you can "cat" the same file to see what it says.

Regards,

Ric

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-23 19:35                           ` Ric Wheeler
@ 2013-11-23 19:48                             ` Stefan Priebe
  2013-11-25  7:37                             ` Stefan Priebe
  2020-01-08  6:58                             ` slow sync performance on LSI / Broadcom MegaRaid performance with battery cache Stefan Priebe - Profihost AG
  2 siblings, 0 replies; 29+ messages in thread
From: Stefan Priebe @ 2013-11-23 19:48 UTC (permalink / raw)
  To: Ric Wheeler, Christoph Hellwig, Chinmay V S
  Cc: J. Bruce Fields, Theodore Ts'o, linux-fsdevel, Al Viro, LKML,
	Matthew Wilcox

Hi Ric,

Am 23.11.2013 20:35, schrieb Ric Wheeler:
> On 11/23/2013 01:27 PM, Stefan Priebe wrote:
>> Hi Ric,
>>
>> Am 22.11.2013 21:37, schrieb Ric Wheeler:
>>> On 11/22/2013 03:01 PM, Stefan Priebe wrote:
>>>> Hi Christoph,
>>>> Am 21.11.2013 11:11, schrieb Christoph Hellwig:
>>>>>>
>>>>>> 2. Some drives may implement CMD_FLUSH to return immediately i.e. no
>>>>>> guarantee the data is actually on disk.
>>>>>
>>>>> In which case they aren't spec complicant.  While I've seen countless
>>>>> data integrity bugs on lower end ATA SSDs I've not seen one that
>>>>> simpliy
>>>>> ingnores flush.  If you'd want to cheat that bluntly you'd be better
>>>>> of just claiming to not have a writeback cache.
>>>>>
>>>>> You solve your performance problem by completely disabling any chance
>>>>> of having data integrity guarantees, and do so in a way that is not
>>>>> detectable for applications or users.
>>>>>
>>>>> If you have a workload with lots of small synchronous writes disabling
>>>>> the writeback cache on the disk does indeed often help, especially
>>>>> with
>>>>> the non-queueable FLUSH on all but the most recent ATA devices.
>>>>
>>>> But this isn't correct for drives with capicitors like Crucial m500,
>>>> Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an
>>>> option to disable this for drives like these?
>>>> /sys/block/sdX/device/ignore_flush
>>>
>>> If you know 100% for sure that your drive has a non-volatile write
>>> cache, you can run the file system without the flushing by mounting "-o
>>> nobarrier".  With most devices, this is not needed since they tend to
>>> simply ignore the flushes if they know they are power failure safe.
>>>
>>> Block level, we did something similar for users who are not running
>>> through a file system for SCSI devices - James added support to echo
>>> "temporary" into the sd's device's cache_type field:
>>>
>>> See:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88
>>>
>>
>> At least to me this does not work. I get the same awful speed as
>> before - also the I/O waits stay the same. I'm still seeing CMD
>> flushes going to the devices.
>>
>> Is there any way to check whether the temporary got accepted and works?
>>
>> I simply executed:
>> for i in /sys/class/scsi_disk/*/cache_type; do echo $i; echo temporary
>> write back >$i; done
>>
>> Stefan
>
> What kernel are you running?  This is a new addition....

3.10.19

> Also, you can "cat" the same file to see what it says.

Sure:
[cloud2-1338: ~]# for i in /sys/class/scsi_disk/*/cache_type; do echo 
$i; echo temporary write through >$i; done
/sys/class/scsi_disk/1:0:0:0/cache_type
/sys/class/scsi_disk/2:0:0:0/cache_type
/sys/class/scsi_disk/3:0:0:0/cache_type
/sys/class/scsi_disk/4:0:0:0/cache_type
/sys/class/scsi_disk/5:0:0:0/cache_type
/sys/class/scsi_disk/6:0:0:0/cache_type
[cloud2-1338: ~]# for i in /sys/class/scsi_disk/*/cache_type; do echo 
$i; cat $i; done
/sys/class/scsi_disk/1:0:0:0/cache_type
write through
/sys/class/scsi_disk/2:0:0:0/cache_type
write through
/sys/class/scsi_disk/3:0:0:0/cache_type
write through
/sys/class/scsi_disk/4:0:0:0/cache_type
write through
/sys/class/scsi_disk/5:0:0:0/cache_type
write through
/sys/class/scsi_disk/6:0:0:0/cache_type
write through
[cloud2-1338: ~]# for i in /sys/class/scsi_disk/*/cache_type; do echo 
$i; echo temporary write back >$i; done
/sys/class/scsi_disk/1:0:0:0/cache_type
/sys/class/scsi_disk/2:0:0:0/cache_type
/sys/class/scsi_disk/3:0:0:0/cache_type
/sys/class/scsi_disk/4:0:0:0/cache_type
/sys/class/scsi_disk/5:0:0:0/cache_type
/sys/class/scsi_disk/6:0:0:0/cache_type
[cloud2-1338: ~]# for i in /sys/class/scsi_disk/*/cache_type; do echo 
$i; cat $i; done
/sys/class/scsi_disk/1:0:0:0/cache_type
write back
/sys/class/scsi_disk/2:0:0:0/cache_type
write back
/sys/class/scsi_disk/3:0:0:0/cache_type
write back
/sys/class/scsi_disk/4:0:0:0/cache_type
write back
/sys/class/scsi_disk/5:0:0:0/cache_type
write back
/sys/class/scsi_disk/6:0:0:0/cache_type
write back
[cloud2-1338: ~]#

Stefan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-20 16:02           ` Howard Chu
@ 2013-11-23 20:36             ` Pavel Machek
  2013-11-23 23:01               ` Ric Wheeler
  0 siblings, 1 reply; 29+ messages in thread
From: Pavel Machek @ 2013-11-23 20:36 UTC (permalink / raw)
  To: Howard Chu
  Cc: Theodore Ts'o, Chinmay V S, Stefan Priebe - Profihost AG,
	Christoph Hellwig, linux-fsdevel, Al Viro, LKML, matthew

On Wed 2013-11-20 08:02:33, Howard Chu wrote:
> Theodore Ts'o wrote:
> >Historically, Intel has been really good about avoiding this, but
> >since they've moved to using 3rd party flash controllers, I now advise
> >everyone who plans to use any flash storage, regardless of the
> >manufacturer, to do their own explicit power fail testing (hitting the
> >reset button is not good enough, you need to kick the power plug out
> >of the wall, or better yet, use a network controlled power switch you
> >so you can repeat the power fail test dozens or hundreds of times for
> >your qualification run) before being using flash storage in a mission
> >critical situation where you care about data integrity after a power
> >fail event.
> 
> Speaking of which, what would you use to automate this sort of test?
> I'm thinking an SSD connected by eSATA, with an external power
> supply, and the host running inside a VM. Drop power to the drive at
> the same time as doing a kill -9 on the VM, then you can resume the
> VM pretty quickly instead of waiting for a full reboot sequence.

I was just pulling power on sata drive.

It uncovered "interesting" stuff. I plugged power back, and kernel
re-estabilished communication with that drive, but any settings with
hdparm were forgotten. I'd say there's some room for improvement
there...

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-23 20:36             ` Pavel Machek
@ 2013-11-23 23:01               ` Ric Wheeler
  2013-11-24  0:22                 ` Pavel Machek
  0 siblings, 1 reply; 29+ messages in thread
From: Ric Wheeler @ 2013-11-23 23:01 UTC (permalink / raw)
  To: Pavel Machek, Howard Chu
  Cc: Theodore Ts'o, Chinmay V S, Stefan Priebe - Profihost AG,
	Christoph Hellwig, linux-fsdevel, Al Viro, LKML, matthew

On 11/23/2013 03:36 PM, Pavel Machek wrote:
> On Wed 2013-11-20 08:02:33, Howard Chu wrote:
>> Theodore Ts'o wrote:
>>> Historically, Intel has been really good about avoiding this, but
>>> since they've moved to using 3rd party flash controllers, I now advise
>>> everyone who plans to use any flash storage, regardless of the
>>> manufacturer, to do their own explicit power fail testing (hitting the
>>> reset button is not good enough, you need to kick the power plug out
>>> of the wall, or better yet, use a network controlled power switch you
>>> so you can repeat the power fail test dozens or hundreds of times for
>>> your qualification run) before being using flash storage in a mission
>>> critical situation where you care about data integrity after a power
>>> fail event.
>> Speaking of which, what would you use to automate this sort of test?
>> I'm thinking an SSD connected by eSATA, with an external power
>> supply, and the host running inside a VM. Drop power to the drive at
>> the same time as doing a kill -9 on the VM, then you can resume the
>> VM pretty quickly instead of waiting for a full reboot sequence.
> I was just pulling power on sata drive.
>
> It uncovered "interesting" stuff. I plugged power back, and kernel
> re-estabilished communication with that drive, but any settings with
> hdparm were forgotten. I'd say there's some room for improvement
> there...
>
> 								Pavel

Hi Pavel,

When you drop power, your drive normally loses temporary settings (like a change 
to write cache, etc).

Depending on the class of the device, there are ways to make that permanent 
(look at hdparm or sdparm for details).

This is a feature of the drive and its firmware, not something we reset in the 
device each time it re-appears.

Ric


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-22 19:57             ` Why is O_DSYNC on linux so slow / what's wrong with my SSD? Stefan Priebe
@ 2013-11-24  0:10               ` One Thousand Gnomes
  0 siblings, 0 replies; 29+ messages in thread
From: One Thousand Gnomes @ 2013-11-24  0:10 UTC (permalink / raw)
  Cc: Christoph Hellwig, linux-fsdevel, Al Viro, LKML

> > And the 520 doesn't claim this feature (look for "enhanced power loss
> > protection" at http://ark.intel.com/products/66248), so that wouldn't
> > explain these results anyway.
> 
> Correct i think intel simply ignores CMD_FLUSH on that drive - no idea 
> why an they fixed this for their 330, 530, DC S3500 (all tested)

You are not as I read the standard allowed to "ignore" it. In fact if you
advertise the property you are obliged to implement it. The late Andre
Hedrick made sure the standard was phrased the way it was to stop it
being abused for benchmarketing. The goal was that anyone cheating would
be non-compliant.

Now its entirely possible to do clever stuff and treat it merely as a
write barrier, providing you can't lose what is queued up. What the actual
drives do I don't know.. all deep magic and not my department.

A second thing to be careful about is that certain kinds of I/O barriers
and atomic write patterns that force lots of commits to flash and erase
cycles are going to wear the drive out faster and I've been told by
manufacturers that drives do respond to such patterns by limiting the
transaction rate in self defence (and presumably in the hope the OS will
then begin to block stuff up better).

Pavel - what is lost/kept over the reset of a device is also fairly
clearly defined in the standard. Much is lost because if you committed a
permanent configuration change that the controller couldnt support you
would be a bit screwed!

If you are driving an SSD I'd work very hard to avoid the need for any
kind of O_SYNC or O_DSYNC type behaviour for exactly the same reason you
avoid uncached memory accesses - the hardware can't do its job properly
without the needed freedom. Use minimal barriers and proper sync points
and your performance will be far higher.

Alan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-23 23:01               ` Ric Wheeler
@ 2013-11-24  0:22                 ` Pavel Machek
  2013-11-24  1:03                   ` One Thousand Gnomes
  2013-11-24  2:43                   ` Ric Wheeler
  0 siblings, 2 replies; 29+ messages in thread
From: Pavel Machek @ 2013-11-24  0:22 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Howard Chu, Theodore Ts'o, Chinmay V S,
	Stefan Priebe - Profihost AG, Christoph Hellwig, linux-fsdevel,
	Al Viro, LKML, matthew

On Sat 2013-11-23 18:01:32, Ric Wheeler wrote:
> On 11/23/2013 03:36 PM, Pavel Machek wrote:
> >On Wed 2013-11-20 08:02:33, Howard Chu wrote:
> >>Theodore Ts'o wrote:
> >>>Historically, Intel has been really good about avoiding this, but
> >>>since they've moved to using 3rd party flash controllers, I now advise
> >>>everyone who plans to use any flash storage, regardless of the
> >>>manufacturer, to do their own explicit power fail testing (hitting the
> >>>reset button is not good enough, you need to kick the power plug out
> >>>of the wall, or better yet, use a network controlled power switch you
> >>>so you can repeat the power fail test dozens or hundreds of times for
> >>>your qualification run) before being using flash storage in a mission
> >>>critical situation where you care about data integrity after a power
> >>>fail event.
> >>Speaking of which, what would you use to automate this sort of test?
> >>I'm thinking an SSD connected by eSATA, with an external power
> >>supply, and the host running inside a VM. Drop power to the drive at
> >>the same time as doing a kill -9 on the VM, then you can resume the
> >>VM pretty quickly instead of waiting for a full reboot sequence.
> >I was just pulling power on sata drive.
> >
> >It uncovered "interesting" stuff. I plugged power back, and kernel
> >re-estabilished communication with that drive, but any settings with
> >hdparm were forgotten. I'd say there's some room for improvement
> >there...
> 
> Hi Pavel,
> 
> When you drop power, your drive normally loses temporary settings
> (like a change to write cache, etc).
> 
> Depending on the class of the device, there are ways to make that
> permanent (look at hdparm or sdparm for details).
> 
> This is a feature of the drive and its firmware, not something we
> reset in the device each time it re-appears.

Yes, and I'm arguing that is a bug (as in, < 0.01% people are using
hdparm correctly).

So you used hparm to disable write cache so that ext3 can be safely
used on your hdd. Now you have glitch on power. Then, system continues
to operate in dangerous mode until reboot.

I guess it would be safer not to reattach drives after power
fail... (also I wonder what this does to data integrity. Drive lost
content of its writeback cache, but kernel continues... Journal will
not prevent data corruption in this case).

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-24  0:22                 ` Pavel Machek
@ 2013-11-24  1:03                   ` One Thousand Gnomes
  2013-11-24  2:43                   ` Ric Wheeler
  1 sibling, 0 replies; 29+ messages in thread
From: One Thousand Gnomes @ 2013-11-24  1:03 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Howard Chu, Theodore Ts'o, Chinmay V S,
	Stefan Priebe - Profihost AG, Christoph Hellwig, linux-fsdevel,
	Al Viro, LKML

> Yes, and I'm arguing that is a bug (as in, < 0.01% people are using
> hdparm correctly).

Generally speaking if you are using hdparm for tuning it means we need to
fix something in the ATA layer so you don't have to !

> I guess it would be safer not to reattach drives after power
> fail... (also I wonder what this does to data integrity. Drive lost
> content of its writeback cache, but kernel continues... Journal will
> not prevent data corruption in this case).

For good or bad its very hard to tell if a drive randomly powers off or
we merely get a bus reset in the ATA case. In the SATA case we do at
least get the relevant events to handle it nicely as we *should* see a
DevExch event. We also can't tell a power fail event from a hot drive
swap, so we most definitely want to re-attach the drive, just so long as
we ensure that it comes back on a different device node.

Alan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-24  0:22                 ` Pavel Machek
  2013-11-24  1:03                   ` One Thousand Gnomes
@ 2013-11-24  2:43                   ` Ric Wheeler
  1 sibling, 0 replies; 29+ messages in thread
From: Ric Wheeler @ 2013-11-24  2:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Howard Chu, Theodore Ts'o, Chinmay V S,
	Stefan Priebe - Profihost AG, Christoph Hellwig, linux-fsdevel,
	Al Viro, LKML, matthew

On 11/23/2013 07:22 PM, Pavel Machek wrote:
> On Sat 2013-11-23 18:01:32, Ric Wheeler wrote:
>> On 11/23/2013 03:36 PM, Pavel Machek wrote:
>>> On Wed 2013-11-20 08:02:33, Howard Chu wrote:
>>>> Theodore Ts'o wrote:
>>>>> Historically, Intel has been really good about avoiding this, but
>>>>> since they've moved to using 3rd party flash controllers, I now advise
>>>>> everyone who plans to use any flash storage, regardless of the
>>>>> manufacturer, to do their own explicit power fail testing (hitting the
>>>>> reset button is not good enough, you need to kick the power plug out
>>>>> of the wall, or better yet, use a network controlled power switch you
>>>>> so you can repeat the power fail test dozens or hundreds of times for
>>>>> your qualification run) before being using flash storage in a mission
>>>>> critical situation where you care about data integrity after a power
>>>>> fail event.
>>>> Speaking of which, what would you use to automate this sort of test?
>>>> I'm thinking an SSD connected by eSATA, with an external power
>>>> supply, and the host running inside a VM. Drop power to the drive at
>>>> the same time as doing a kill -9 on the VM, then you can resume the
>>>> VM pretty quickly instead of waiting for a full reboot sequence.
>>> I was just pulling power on sata drive.
>>>
>>> It uncovered "interesting" stuff. I plugged power back, and kernel
>>> re-estabilished communication with that drive, but any settings with
>>> hdparm were forgotten. I'd say there's some room for improvement
>>> there...
>> Hi Pavel,
>>
>> When you drop power, your drive normally loses temporary settings
>> (like a change to write cache, etc).
>>
>> Depending on the class of the device, there are ways to make that
>> permanent (look at hdparm or sdparm for details).
>>
>> This is a feature of the drive and its firmware, not something we
>> reset in the device each time it re-appears.
> Yes, and I'm arguing that is a bug (as in, < 0.01% people are using
> hdparm correctly).

Almost no end users use hdparm. Those who do should read the man page and add 
the -K flag :)

Or system scripts that tweak should invoke it with the right flags.....

Ric

> So you used hparm to disable write cache so that ext3 can be safely
> used on your hdd. Now you have glitch on power. Then, system continues
> to operate in dangerous mode until reboot.
>
> I guess it would be safer not to reattach drives after power
> fail... (also I wonder what this does to data integrity. Drive lost
> content of its writeback cache, but kernel continues... Journal will
> not prevent data corruption in this case).
>
> 									Pavel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
  2013-11-23 19:35                           ` Ric Wheeler
  2013-11-23 19:48                             ` Stefan Priebe
@ 2013-11-25  7:37                             ` Stefan Priebe
  2020-01-08  6:58                             ` slow sync performance on LSI / Broadcom MegaRaid performance with battery cache Stefan Priebe - Profihost AG
  2 siblings, 0 replies; 29+ messages in thread
From: Stefan Priebe @ 2013-11-25  7:37 UTC (permalink / raw)
  To: Ric Wheeler, Christoph Hellwig, Chinmay V S
  Cc: J. Bruce Fields, Theodore Ts'o, linux-fsdevel, Al Viro, LKML,
	Matthew Wilcox

Hi Ric,

Am 23.11.2013 20:35, schrieb Ric Wheeler:
> On 11/23/2013 01:27 PM, Stefan Priebe wrote:
>> Hi Ric,
>>
>> Am 22.11.2013 21:37, schrieb Ric Wheeler:
>>> On 11/22/2013 03:01 PM, Stefan Priebe wrote:
>>>> Hi Christoph,
>>>> Am 21.11.2013 11:11, schrieb Christoph Hellwig:
>>>>>>
>>>>>> 2. Some drives may implement CMD_FLUSH to return immediately i.e. no
>>>>>> guarantee the data is actually on disk.
>>>>>
>>>>> In which case they aren't spec complicant.  While I've seen countless
>>>>> data integrity bugs on lower end ATA SSDs I've not seen one that
>>>>> simpliy
>>>>> ingnores flush.  If you'd want to cheat that bluntly you'd be better
>>>>> of just claiming to not have a writeback cache.
>>>>>
>>>>> You solve your performance problem by completely disabling any chance
>>>>> of having data integrity guarantees, and do so in a way that is not
>>>>> detectable for applications or users.
>>>>>
>>>>> If you have a workload with lots of small synchronous writes disabling
>>>>> the writeback cache on the disk does indeed often help, especially
>>>>> with
>>>>> the non-queueable FLUSH on all but the most recent ATA devices.
>>>>
>>>> But this isn't correct for drives with capicitors like Crucial m500,
>>>> Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an
>>>> option to disable this for drives like these?
>>>> /sys/block/sdX/device/ignore_flush
>>>
>>> If you know 100% for sure that your drive has a non-volatile write
>>> cache, you can run the file system without the flushing by mounting "-o
>>> nobarrier".  With most devices, this is not needed since they tend to
>>> simply ignore the flushes if they know they are power failure safe.
>>>
>>> Block level, we did something similar for users who are not running
>>> through a file system for SCSI devices - James added support to echo
>>> "temporary" into the sd's device's cache_type field:
>>>
>>> See:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88
>>>
>>
>> At least to me this does not work. I get the same awful speed as
>> before - also the I/O waits stay the same. I'm still seeing CMD
>> flushes going to the devices.
>>
>> Is there any way to check whether the temporary got accepted and works?
>>
>> I simply executed:
>> for i in /sys/class/scsi_disk/*/cache_type; do echo $i; echo temporary
>> write back >$i; done
>>
>> Stefan
>
> What kernel are you running?  This is a new addition....
>
> Also, you can "cat" the same file to see what it says.
>
> Regards,
>
> Ric
>

Is the output i sent to you fine? Anything wrong?

Stefan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* slow sync performance on LSI / Broadcom MegaRaid performance with battery cache
  2013-11-23 19:35                           ` Ric Wheeler
  2013-11-23 19:48                             ` Stefan Priebe
  2013-11-25  7:37                             ` Stefan Priebe
@ 2020-01-08  6:58                             ` Stefan Priebe - Profihost AG
  2 siblings, 0 replies; 29+ messages in thread
From: Stefan Priebe - Profihost AG @ 2020-01-08  6:58 UTC (permalink / raw)
  To: Ric Wheeler, Christoph Hellwig, Chinmay V S
  Cc: J. Bruce Fields, Theodore Ts'o, linux-fsdevel, Al Viro, LKML,
	Matthew Wilcox

Hello list,

while we used adaptec controller with battery cache for years we
recently switched to dell hw using the perc controllers which are
rebranded lsi/broadcom controllers.

We're running btrfs subvolume / snapshot workloads and while those are
very fast on btrfs using a btrfs raid 0 on top of several raid 5 running
on adaptec (battery backed up) in write back mode.

The performance really sucks on those LSI controllers even the one i
have has 8GB cache instead of just 1GB at adaptec.

Especially sync / fsync are awfully slow taking sometimes 30-45 minutes
while btrfs is doing snapshots. The workload on all machines is the same
and the disks are ok.

Is there a way to disable FLUSH / sync at all for those devices? Just to
test?

I'm already using nobarrier mount option on btrfs but this does not help
either.

Thanks!

Greets,
Stefan

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-01-08  7:03 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-20 12:12 Why is O_DSYNC on linux so slow / what's wrong with my SSD? Stefan Priebe - Profihost AG
2013-11-20 12:54 ` Christoph Hellwig
2013-11-20 13:34   ` Chinmay V S
2013-11-20 13:38     ` Christoph Hellwig
2013-11-20 14:12     ` Stefan Priebe - Profihost AG
2013-11-20 15:22       ` Chinmay V S
2013-11-20 15:37         ` Theodore Ts'o
2013-11-20 15:55           ` J. Bruce Fields
2013-11-20 17:11             ` Chinmay V S
2013-11-20 17:58               ` J. Bruce Fields
2013-11-20 18:43                 ` Chinmay V S
2013-11-21 10:11                   ` Christoph Hellwig
2013-11-22 20:01                     ` Stefan Priebe
2013-11-22 20:37                       ` Ric Wheeler
2013-11-22 21:05                         ` Stefan Priebe
2013-11-23 18:27                         ` Stefan Priebe
2013-11-23 19:35                           ` Ric Wheeler
2013-11-23 19:48                             ` Stefan Priebe
2013-11-25  7:37                             ` Stefan Priebe
2020-01-08  6:58                             ` slow sync performance on LSI / Broadcom MegaRaid performance with battery cache Stefan Priebe - Profihost AG
2013-11-22 19:57             ` Why is O_DSYNC on linux so slow / what's wrong with my SSD? Stefan Priebe
2013-11-24  0:10               ` One Thousand Gnomes
2013-11-20 16:02           ` Howard Chu
2013-11-23 20:36             ` Pavel Machek
2013-11-23 23:01               ` Ric Wheeler
2013-11-24  0:22                 ` Pavel Machek
2013-11-24  1:03                   ` One Thousand Gnomes
2013-11-24  2:43                   ` Ric Wheeler
2013-11-22 19:55         ` Stefan Priebe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).