linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Disabling in-memory write cache for x86-64 in Linux II
@ 2013-10-25  7:25 Artem S. Tashkinov
  2013-10-25  8:18 ` Linus Torvalds
  2013-10-25 10:49 ` NeilBrown
  0 siblings, 2 replies; 56+ messages in thread
From: Artem S. Tashkinov @ 2013-10-25  7:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: torvalds, linux-fsdevel, axboe, linux-mm

Hello!

On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
is that the x86-64 kernel has the following problem:

When I copy large files to any storage device, be it my HDD with ext4 partitions
or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
then flushes them some time later (quite unpredictably though) or immediately upon
invoking "sync".

How can I disable this memory cache altogether (or at least minimize caching)? When
running the i686 kernel with the same configuration I don't observe this effect - files get
written out almost immediately (for instance "sync" takes less than a second, whereas
on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
performance).

I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
- firstly this command is detrimental to the performance of my PC, secondly, it won't help
in this instance.

Swap is totally disabled, usually my memory is entirely free.

My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531

Please, advise.

Best regards,

Artem 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
@ 2013-10-25  8:18 ` Linus Torvalds
  2013-10-25  8:30   ` Artem S. Tashkinov
  2013-11-05  0:50   ` Andreas Dilger
  2013-10-25 10:49 ` NeilBrown
  1 sibling, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2013-10-25  8:18 UTC (permalink / raw)
  To: Artem S. Tashkinov, Wu Fengguang, Andrew Morton
  Cc: Linux Kernel Mailing List, linux-fsdevel, Jens Axboe, linux-mm

On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
>
> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
> is that the x86-64 kernel has the following problem:
>
> When I copy large files to any storage device, be it my HDD with ext4 partitions
> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
> then flushes them some time later (quite unpredictably though) or immediately upon
> invoking "sync".

Yeah, I think we default to a 10% "dirty background memory" (and
allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
of dirty memory for writeout before we even start writing, and twice
that before we start *waiting* for it.

On 32-bit x86, we only count the memory in the low 1GB (really
actually up to about 890MB), so "10% dirty" really means just about
90MB of buffering (and a "hard limit" of ~180MB of dirty).

And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
come from the old days of less memory (and perhaps servers that don't
much care), and the fact that x86-32 ends up having much lower limits
even if you end up having more memory.

You can easily tune it:

    echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
    echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes

or similar. But you're right, we need to make the defaults much saner.

Wu? Andrew? Comments?

             Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  8:18 ` Linus Torvalds
@ 2013-10-25  8:30   ` Artem S. Tashkinov
  2013-10-25  8:43     ` Linus Torvalds
  2013-10-25  9:18     ` Theodore Ts'o
  2013-11-05  0:50   ` Andreas Dilger
  1 sibling, 2 replies; 56+ messages in thread
From: Artem S. Tashkinov @ 2013-10-25  8:30 UTC (permalink / raw)
  To: torvalds; +Cc: fengguang.wu, akpm, linux-kernel

Oct 25, 2013 02:18:50 PM, Linus Torvalds wrote:
On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote:
>>
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
>> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
>> is that the x86-64 kernel has the following problem:
>>
>> When I copy large files to any storage device, be it my HDD with ext4 partitions
>> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
>> then flushes them some time later (quite unpredictably though) or immediately upon
>> invoking "sync".
>
>Yeah, I think we default to a 10% "dirty background memory" (and
>allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
>of dirty memory for writeout before we even start writing, and twice
>that before we start *waiting* for it.
>
>On 32-bit x86, we only count the memory in the low 1GB (really
>actually up to about 890MB), so "10% dirty" really means just about
>90MB of buffering (and a "hard limit" of ~180MB of dirty).
>
>And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
>come from the old days of less memory (and perhaps servers that don't
>much care), and the fact that x86-32 ends up having much lower limits
>even if you end up having more memory.
>
>You can easily tune it:
>
>    echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
>    echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes
>
>or similar. But you're right, we need to make the defaults much saner.
>
>Wu? Andrew? Comments?
>

My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
more) this value becomes unrealistic (13GB) and I've already had some
unpleasant effects due to it.

I.e. when I dump a large MySQL database (its dump weighs around 10GB)
- it appears on the disk almost immediately, but then, later, when the kernel
decides to flush it to the disk, the server almost stalls and other IO requests
take a lot more time to complete even though mysqldump is run with ionice -c3,
so the use of ionice has no real effect.

Artem

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  8:30   ` Artem S. Tashkinov
@ 2013-10-25  8:43     ` Linus Torvalds
  2013-10-25  9:15       ` Karl Kiniger
  2013-10-25 11:28       ` Disabling in-memory write cache for x86-64 in Linux II David Lang
  2013-10-25  9:18     ` Theodore Ts'o
  1 sibling, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2013-10-25  8:43 UTC (permalink / raw)
  To: Artem S. Tashkinov; +Cc: Wu Fengguang, Andrew Morton, Linux Kernel Mailing List

On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
>
> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> more) this value becomes unrealistic (13GB) and I've already had some
> unpleasant effects due to it.

Right. The percentage notion really goes back to the days when we
typically had 8-64 *megabytes* of memory So if you had a 8MB machine
you wouldn't want to have more than one megabyte of dirty data, but if
you were "Mr Moneybags" and could afford 64MB, you might want to have
up to 8MB dirty!!

Things have changed.

So I would suggest we change the defaults. Or pwehaps make the rule be
that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
semantics similar across 32-bit HIGHMEM machines and 64-bit machines.

The modern way of expressing the dirty limits are to give the actual
absolute byte amounts, but we default to the legacy ratio mode..

                Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  8:43     ` Linus Torvalds
@ 2013-10-25  9:15       ` Karl Kiniger
  2013-10-29 20:30         ` Jan Kara
  2013-10-25 11:28       ` Disabling in-memory write cache for x86-64 in Linux II David Lang
  1 sibling, 1 reply; 56+ messages in thread
From: Karl Kiniger @ 2013-10-25  9:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Artem S. Tashkinov, Wu Fengguang, Andrew Morton,
	Linux Kernel Mailing List

On Fri 131025, Linus Torvalds wrote:
> On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
> >
> > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> > more) this value becomes unrealistic (13GB) and I've already had some
> > unpleasant effects due to it.
> 
> Right. The percentage notion really goes back to the days when we
> typically had 8-64 *megabytes* of memory So if you had a 8MB machine
> you wouldn't want to have more than one megabyte of dirty data, but if
> you were "Mr Moneybags" and could afford 64MB, you might want to have
> up to 8MB dirty!!
> 
> Things have changed.
> 
> So I would suggest we change the defaults. Or pwehaps make the rule be
> that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
> semantics similar across 32-bit HIGHMEM machines and 64-bit machines.
> 
> The modern way of expressing the dirty limits are to give the actual
> absolute byte amounts, but we default to the legacy ratio mode..
> 
>                 Linus

Is it currently possible to somehow set above values per block device?

I want default behaviour for almost everything but  DVD drives in DVD+RW
packet writing mode may easily take several minutes in case of a sync.

Karl



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  8:30   ` Artem S. Tashkinov
  2013-10-25  8:43     ` Linus Torvalds
@ 2013-10-25  9:18     ` Theodore Ts'o
  2013-10-25  9:29       ` Andrew Morton
  2013-10-25 23:05       ` Fengguang Wu
  1 sibling, 2 replies; 56+ messages in thread
From: Theodore Ts'o @ 2013-10-25  9:18 UTC (permalink / raw)
  To: Artem S. Tashkinov; +Cc: torvalds, fengguang.wu, akpm, linux-kernel

On Fri, Oct 25, 2013 at 08:30:53AM +0000, Artem S. Tashkinov wrote:
> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> more) this value becomes unrealistic (13GB) and I've already had some
> unpleasant effects due to it.

What I think would make sense is to dynamically measure the speed of
writeback, so that we can set these limits as a function of the device
speed.  It's already the case that the writeback limits don't make
sense on a slow USB 2.0 storage stick; I suspect that for really huge
RAID arrays or very fast flash devices, it doesn't make much sense
either.

The problem is that if you have a system that has *both* a USB stick
_and_ a fast flash/RAID storage array both needing writeback, this
doesn't work well --- but what we have right now doesn't work all that
well anyway.

						- Ted

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  9:18     ` Theodore Ts'o
@ 2013-10-25  9:29       ` Andrew Morton
  2013-10-25  9:32         ` Linus Torvalds
  2013-10-25 22:37         ` Fengguang Wu
  2013-10-25 23:05       ` Fengguang Wu
  1 sibling, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2013-10-25  9:29 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Artem S. Tashkinov, torvalds, fengguang.wu, linux-kernel

On Fri, 25 Oct 2013 05:18:42 -0400 "Theodore Ts'o" <tytso@mit.edu> wrote:

> What I think would make sense is to dynamically measure the speed of
> writeback, so that we can set these limits as a function of the device
> speed.

We attempt to do this now - have a look through struct backing_dev_info.

Apparently all this stuff isn't working as desired (and perhaps as designed)
in this case.  Will take a look after a return to normalcy ;)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  9:29       ` Andrew Morton
@ 2013-10-25  9:32         ` Linus Torvalds
  2013-10-26 11:32           ` Pavel Machek
  2013-10-29 20:57           ` Jan Kara
  2013-10-25 22:37         ` Fengguang Wu
  1 sibling, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2013-10-25  9:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Artem S. Tashkinov, Wu Fengguang,
	Linux Kernel Mailing List

On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> Apparently all this stuff isn't working as desired (and perhaps as designed)
> in this case.  Will take a look after a return to normalcy ;)

It definitely doesn't work. I can trivially reproduce problems by just
having a cheap (==slow) USB key with an ext3 filesystem, and going a
git clone to it. The end result is not pretty, and that's actually not
even a huge amount of data.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
  2013-10-25  8:18 ` Linus Torvalds
@ 2013-10-25 10:49 ` NeilBrown
  2013-10-25 11:26   ` David Lang
  1 sibling, 1 reply; 56+ messages in thread
From: NeilBrown @ 2013-10-25 10:49 UTC (permalink / raw)
  To: Artem S. Tashkinov; +Cc: linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2094 bytes --]

On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov"
<t.artem@lycos.com> wrote:

> Hello!
> 
> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
> is that the x86-64 kernel has the following problem:
> 
> When I copy large files to any storage device, be it my HDD with ext4 partitions
> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
> then flushes them some time later (quite unpredictably though) or immediately upon
> invoking "sync".
> 
> How can I disable this memory cache altogether (or at least minimize caching)? When
> running the i686 kernel with the same configuration I don't observe this effect - files get
> written out almost immediately (for instance "sync" takes less than a second, whereas
> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
> performance).

What exactly is bothering you about this?  The amount of memory used or the
time until data is flushed?

If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to
look.
This defaults to 30 seconds (3000 centisecs).
You could make it smaller (providing you also shrink
dirty_writeback_centisecs in a similar ratio) and the VM will flush out data
more quickly.

NeilBrown


> 
> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
> - firstly this command is detrimental to the performance of my PC, secondly, it won't help
> in this instance.
> 
> Swap is totally disabled, usually my memory is entirely free.
> 
> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531
> 
> Please, advise.
> 
> Best regards,
> 
> Artem 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 10:49 ` NeilBrown
@ 2013-10-25 11:26   ` David Lang
  2013-10-25 18:26     ` Artem S. Tashkinov
  0 siblings, 1 reply; 56+ messages in thread
From: David Lang @ 2013-10-25 11:26 UTC (permalink / raw)
  To: NeilBrown
  Cc: Artem S. Tashkinov, linux-kernel, torvalds, linux-fsdevel, axboe,
	linux-mm

On Fri, 25 Oct 2013, NeilBrown wrote:

> On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov"
> <t.artem@lycos.com> wrote:
>
>> Hello!
>>
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
>> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
>> is that the x86-64 kernel has the following problem:
>>
>> When I copy large files to any storage device, be it my HDD with ext4 partitions
>> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
>> then flushes them some time later (quite unpredictably though) or immediately upon
>> invoking "sync".
>>
>> How can I disable this memory cache altogether (or at least minimize caching)? When
>> running the i686 kernel with the same configuration I don't observe this effect - files get
>> written out almost immediately (for instance "sync" takes less than a second, whereas
>> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
>> performance).
>
> What exactly is bothering you about this?  The amount of memory used or the
> time until data is flushed?

actually, I think the problem is more the impact of the huge write later on.

David Lang

> If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to
> look.
> This defaults to 30 seconds (3000 centisecs).
> You could make it smaller (providing you also shrink
> dirty_writeback_centisecs in a similar ratio) and the VM will flush out data
> more quickly.
>
> NeilBrown
>
>
>>
>> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
>> - firstly this command is detrimental to the performance of my PC, secondly, it won't help
>> in this instance.
>>
>> Swap is totally disabled, usually my memory is entirely free.
>>
>> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531
>>
>> Please, advise.
>>
>> Best regards,
>>
>> Artem
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  8:43     ` Linus Torvalds
  2013-10-25  9:15       ` Karl Kiniger
@ 2013-10-25 11:28       ` David Lang
  1 sibling, 0 replies; 56+ messages in thread
From: David Lang @ 2013-10-25 11:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Artem S. Tashkinov, Wu Fengguang, Andrew Morton,
	Linux Kernel Mailing List

On Fri, 25 Oct 2013, Linus Torvalds wrote:

> On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
>>
>> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
>> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
>> more) this value becomes unrealistic (13GB) and I've already had some
>> unpleasant effects due to it.
>
> Right. The percentage notion really goes back to the days when we
> typically had 8-64 *megabytes* of memory So if you had a 8MB machine
> you wouldn't want to have more than one megabyte of dirty data, but if
> you were "Mr Moneybags" and could afford 64MB, you might want to have
> up to 8MB dirty!!
>
> Things have changed.
>
> So I would suggest we change the defaults. Or pwehaps make the rule be
> that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
> semantics similar across 32-bit HIGHMEM machines and 64-bit machines.

If you go this direction, allow ratios larger than 100%, some people may be 
willing to have huge amounts of dirty data on large memory machines (if the load 
is extremely bursty, they don't have other needs for I/O, or they have a very 
fast storage system, as a few examples)

David Lang

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 11:26   ` David Lang
@ 2013-10-25 18:26     ` Artem S. Tashkinov
  2013-10-25 19:40       ` Diego Calleja
                         ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Artem S. Tashkinov @ 2013-10-25 18:26 UTC (permalink / raw)
  To: david; +Cc: neilb, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

Oct 25, 2013 05:26:45 PM, david wrote:
On Fri, 25 Oct 2013, NeilBrown wrote:
>
>>
>> What exactly is bothering you about this?  The amount of memory used or the
>> time until data is flushed?
>
>actually, I think the problem is more the impact of the huge write later on.

Exactly. And not being able to use applications which show you IO performance
like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
my life without being able to see the progress of a copying operation. With the current
dirty cache there's no way to understand how you storage media actually behaves.

Hopefully this issue won't dissolve into obscurity and someone will actually make
up a plan (and a patch) how to make dirty write cache behave in a sane manner
considering the fact that there are devices with very different write speeds and
requirements. It'd be ever better, if I could specify dirty cache as a mount option
(though sane defaults or semi-automatic values based on runtime estimates
won't hurt).

Per device dirty cache seems like a nice idea, I, for one, would like to disable it
altogether or make it an absolute minimum for things like USB flash drives - because
I don't care about multithreaded performance or delayed allocation on such devices -
I'm interested in my data reaching my USB stick ASAP - because it's how most people
use them.

Regards,

Artem

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 18:26     ` Artem S. Tashkinov
@ 2013-10-25 19:40       ` Diego Calleja
  2013-10-25 23:32         ` Fengguang Wu
  2013-10-25 20:43       ` NeilBrown
  2013-10-29 20:49       ` Jan Kara
  2 siblings, 1 reply; 56+ messages in thread
From: Diego Calleja @ 2013-10-25 19:40 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: david, neilb, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribió:
> Oct 25, 2013 05:26:45 PM, david wrote:
> >actually, I think the problem is more the impact of the huge write later
> >on.
> Exactly. And not being able to use applications which show you IO
> performance like Midnight Commander. You might prefer to use "cp -a" but I
> cannot imagine my life without being able to see the progress of a copying
> operation. With the current dirty cache there's no way to understand how
> you storage media actually behaves.


This is a problem I also have been suffering for a long time. It's not so much 
how much and when the systems syncs dirty data, but how unreponsive the 
desktop becomes when it happens (usually, with rsync + large files). Most 
programs become completely unreponsive, specially if they have a large memory 
consumption (ie. the browser). I need to pause rsync and wait until the 
systems writes out all dirty data if I want to do simple things like scrolling 
or do any action that uses I/O, otherwise I need to wait minutes.

I have 16 GB of RAM and excluding the browser (which usually uses about half 
of a GB) and KDE itself, there are no memory hogs, so it seem like it's 
something that shouldn't happen. I can understand that I/O operations are 
laggy when there is some other intensive I/O ongoing, but right now the system 
becomes completely unreponsive. If I am unlucky and Konsole also becomes 
unreponsive, I need to switch to a VT (which also takes time).

I haven't reported it before in part because I didn't know how to do it, "my 
browser stalls" is not a very useful description and I didn't know what kind 
of data I'm supposed to report.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 18:26     ` Artem S. Tashkinov
  2013-10-25 19:40       ` Diego Calleja
@ 2013-10-25 20:43       ` NeilBrown
  2013-10-25 21:03         ` Artem S. Tashkinov
  2013-10-29 20:49       ` Jan Kara
  2 siblings, 1 reply; 56+ messages in thread
From: NeilBrown @ 2013-10-25 20:43 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: david, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1836 bytes --]

On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
<t.artem@lycos.com> wrote:

> Oct 25, 2013 05:26:45 PM, david wrote:
> On Fri, 25 Oct 2013, NeilBrown wrote:
> >
> >>
> >> What exactly is bothering you about this?  The amount of memory used or the
> >> time until data is flushed?
> >
> >actually, I think the problem is more the impact of the huge write later on.
> 
> Exactly. And not being able to use applications which show you IO performance
> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
> my life without being able to see the progress of a copying operation. With the current
> dirty cache there's no way to understand how you storage media actually behaves.

So fix Midnight Commander.  If you want the copy to be actually finished when
it says  it is finished, then it needs to call 'fsync()' at the end.

> 
> Hopefully this issue won't dissolve into obscurity and someone will actually make
> up a plan (and a patch) how to make dirty write cache behave in a sane manner
> considering the fact that there are devices with very different write speeds and
> requirements. It'd be ever better, if I could specify dirty cache as a mount option
> (though sane defaults or semi-automatic values based on runtime estimates
> won't hurt).
> 
> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
> altogether or make it an absolute minimum for things like USB flash drives - because
> I don't care about multithreaded performance or delayed allocation on such devices -
> I'm interested in my data reaching my USB stick ASAP - because it's how most people
> use them.
>

As has already been said, you can substantially disable  the cache by tuning
down various values in /proc/sys/vm/.
Have you tried?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 20:43       ` NeilBrown
@ 2013-10-25 21:03         ` Artem S. Tashkinov
  2013-10-25 22:11           ` NeilBrown
  0 siblings, 1 reply; 56+ messages in thread
From: Artem S. Tashkinov @ 2013-10-25 21:03 UTC (permalink / raw)
  To: neilb; +Cc: david, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

Oct 26, 2013 02:44:07 AM, neil wrote:
On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
>> 
>> Exactly. And not being able to use applications which show you IO performance
>> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
>> my life without being able to see the progress of a copying operation. With the current
>> dirty cache there's no way to understand how you storage media actually behaves.
>
>So fix Midnight Commander.  If you want the copy to be actually finished when
>it says  it is finished, then it needs to call 'fsync()' at the end.

This sounds like a very bad joke. How applications are supposed to show and
calculate an _average_ write speed if there are no kernel calls/ioctls to actually
make the kernel flush dirty buffers _during_ copying? Actually it's a good way to
solve this problem in user space - alas, even if such calls are implemented, user
space will start using them only in 2018 if not further from that.

>> 
>> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
>> altogether or make it an absolute minimum for things like USB flash drives - because
>> I don't care about multithreaded performance or delayed allocation on such devices -
>> I'm interested in my data reaching my USB stick ASAP - because it's how most people
>> use them.
>>
>
>As has already been said, you can substantially disable  the cache by tuning
>down various values in /proc/sys/vm/.
>Have you tried?

I don't understand who you are replying to. I asked about per device settings, you are
again referring me to system wide settings - they don't look that good if we're talking
about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
to allocate 20% of physical RAM for things which don't belong to it in the first place.

I don't know any other OS which has a similar behaviour.

And like people (including me) have already mentioned, such a huge dirty cache can
stall their PCs/servers for a considerable amount of time.

Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also
not everyone in this world has an UPS - which means such a huge buffer can lead to a
serious data loss in case of a power blackout.

Regards,

Artem

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 21:03         ` Artem S. Tashkinov
@ 2013-10-25 22:11           ` NeilBrown
       [not found]             ` <CAF7GXvpJVLYDS5NfH-NVuN9bOJjAS5c1MQqSTjoiVBHJt6bWcw@mail.gmail.com>
  0 siblings, 1 reply; 56+ messages in thread
From: NeilBrown @ 2013-10-25 22:11 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: david, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4860 bytes --]

On Fri, 25 Oct 2013 21:03:44 +0000 (UTC) "Artem S. Tashkinov"
<t.artem@lycos.com> wrote:

> Oct 26, 2013 02:44:07 AM, neil wrote:
> On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
> >> 
> >> Exactly. And not being able to use applications which show you IO performance
> >> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
> >> my life without being able to see the progress of a copying operation. With the current
> >> dirty cache there's no way to understand how you storage media actually behaves.
> >
> >So fix Midnight Commander.  If you want the copy to be actually finished when
> >it says  it is finished, then it needs to call 'fsync()' at the end.
> 
> This sounds like a very bad joke. How applications are supposed to show and
> calculate an _average_ write speed if there are no kernel calls/ioctls to actually
> make the kernel flush dirty buffers _during_ copying? Actually it's a good way to
> solve this problem in user space - alas, even if such calls are implemented, user
> space will start using them only in 2018 if not further from that.

But there is a way to flush dirty buffers *during* copies.  
  man 2 sync_file_range

if giving precise feedback is is paramount importance to you, then this would
be the interface to use.
> 
> >> 
> >> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
> >> altogether or make it an absolute minimum for things like USB flash drives - because
> >> I don't care about multithreaded performance or delayed allocation on such devices -
> >> I'm interested in my data reaching my USB stick ASAP - because it's how most people
> >> use them.
> >>
> >
> >As has already been said, you can substantially disable  the cache by tuning
> >down various values in /proc/sys/vm/.
> >Have you tried?
> 
> I don't understand who you are replying to. I asked about per device settings, you are
> again referring me to system wide settings - they don't look that good if we're talking
> about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
> to allocate 20% of physical RAM for things which don't belong to it in the first place.

Sorry, missed the per-device bit.
You could try playing with
  /sys/class/bdi/XX:YY/max_ratio

where XX:YY is the major/minor number of the device, so 8:0 for /dev/sda.
Wind it right down for slow devices and you might get something like what you
want.


> 
> I don't know any other OS which has a similar behaviour.

I don't know about the internal details of any other OS, so I cannot really
comment.

> 
> And like people (including me) have already mentioned, such a huge dirty cache can
> stall their PCs/servers for a considerable amount of time.

Yes.  But this is a different issue.
There are two very different issues that should be kept separate.

One is that when "cp" or similar complete, the data hasn't all be written out
yet.  It typically takes another 30 seconds before the flush will complete.
You seemed to primarily complain about this, so that is what I originally
address.  That is where in the "dirty_*_centisecs" values apply.

The other, quite separate, issue is that Linux will cache more dirty data
than it can write out in a reasonable time.  All the tuning parameters refer
to the amount of data (whether as a percentage of RAM or as a number of
bytes), but what people really care about is a number of seconds.

As you might imagine, estimating how long it will take to write out a certain
amount of data is highly non-trivial.  The relationship between megabytes and
seconds can be non-linear and can change over time.

Caching nothing at all can hurt a lot of workloads.  Caching too much can
obviously hurt too.  Caching "5 seconds" worth of data would be ideal, but
would be incredibly difficult to implement.
It is possible that keeping a sliding estimate of device throughput for each
device would be possible, and using that to automatically adjust the
"max_ratio" value (or some related internal thing) might be a 70% solution.

Certainly it would be an interesting project for someone.


> 
> Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also
> not everyone in this world has an UPS - which means such a huge buffer can lead to a
> serious data loss in case of a power blackout.

I don't have a desk (just a lap), but I use Linux on all my computers and
I've never really noticed the problem.  Maybe I'm just very patient, or maybe
I don't work with large data sets and slow devices.

However I don't think data-loss is really a related issue.  Any process that
cares about data safety *must* use fsync at appropriate places.  This has
always been true.

NeilBrown

> 
> Regards,
> 
> Artem


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  9:29       ` Andrew Morton
  2013-10-25  9:32         ` Linus Torvalds
@ 2013-10-25 22:37         ` Fengguang Wu
  1 sibling, 0 replies; 56+ messages in thread
From: Fengguang Wu @ 2013-10-25 22:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Artem S. Tashkinov, torvalds, linux-kernel

On Fri, Oct 25, 2013 at 02:29:37AM -0700, Andrew Morton wrote:
> On Fri, 25 Oct 2013 05:18:42 -0400 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > What I think would make sense is to dynamically measure the speed of
> > writeback, so that we can set these limits as a function of the device
> > speed.
> 
> We attempt to do this now - have a look through struct backing_dev_info.

To be exact, it's backing_dev_info.write_bandwidth which is estimated
in bdi_update_write_bandwidth() and exported as "BdiWriteBandwidth" in
debugfs file bdi.stats.

> Apparently all this stuff isn't working as desired (and perhaps as designed)
> in this case.  Will take a look after a return to normalcy ;)

Right. The write bandwidth estimation is only estimated and used when
background dirty threshold is reached and hence the disk is actively
doing writeback IO -- which is the case that we can do reasonable
estimation of the writeback bandwidth.

Note that this estimated BdiWriteBandwidth may better be named
"writeback" bandwidth because it may change depending on the workload
at the time -- eg. sequential vs. random writes; whether there are
parallel reads or direct IO competing the disk time.

BdiWriteBandwidth is only designed for use by the dirty throttling
logic and is not generally useful/reliable for other purposes.

It's a bit late and I'd like to carry the original question as
exercises in tomorrow's airplanes. :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  9:18     ` Theodore Ts'o
  2013-10-25  9:29       ` Andrew Morton
@ 2013-10-25 23:05       ` Fengguang Wu
  2013-10-25 23:37         ` Theodore Ts'o
  1 sibling, 1 reply; 56+ messages in thread
From: Fengguang Wu @ 2013-10-25 23:05 UTC (permalink / raw)
  To: Theodore Ts'o, Artem S. Tashkinov, torvalds, akpm, linux-kernel
  Cc: Diego Calleja, David Lang, NeilBrown

On Fri, Oct 25, 2013 at 05:18:42AM -0400, Theodore Ts'o wrote:
> On Fri, Oct 25, 2013 at 08:30:53AM +0000, Artem S. Tashkinov wrote:
> > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> > more) this value becomes unrealistic (13GB) and I've already had some
> > unpleasant effects due to it.
> 
> What I think would make sense is to dynamically measure the speed of
> writeback, so that we can set these limits as a function of the device
> speed.  It's already the case that the writeback limits don't make
> sense on a slow USB 2.0 storage stick; I suspect that for really huge
> RAID arrays or very fast flash devices, it doesn't make much sense
> either.
> 
> The problem is that if you have a system that has *both* a USB stick
> _and_ a fast flash/RAID storage array both needing writeback, this
> doesn't work well --- but what we have right now doesn't work all that
> well anyway.

Ted, when trying to follow up your email, I got a crazy idea and it'd
be better throw it out rather than carrying it to bed. :)

We could do per-bdi dirty thresholds - which has been proposed 1-2
times before by different people.

The per-bdi dirty thresholds could be auto set by the kernel this way: 
start it with an initial value of 100MB. When reached, put all the
100MB dirty data to IO and get an estimation of the write bandwidth.
>From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth,
where N is the seconds of dirty data we'd like to cache in memory.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 19:40       ` Diego Calleja
@ 2013-10-25 23:32         ` Fengguang Wu
  2013-11-15 15:48           ` Diego Calleja
  0 siblings, 1 reply; 56+ messages in thread
From: Fengguang Wu @ 2013-10-25 23:32 UTC (permalink / raw)
  To: Diego Calleja
  Cc: Artem S. Tashkinov, david, neilb, linux-kernel, torvalds,
	linux-fsdevel, axboe, linux-mm

On Fri, Oct 25, 2013 at 09:40:13PM +0200, Diego Calleja wrote:
> El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribió:
> > Oct 25, 2013 05:26:45 PM, david wrote:
> > >actually, I think the problem is more the impact of the huge write later
> > >on.
> > Exactly. And not being able to use applications which show you IO
> > performance like Midnight Commander. You might prefer to use "cp -a" but I
> > cannot imagine my life without being able to see the progress of a copying
> > operation. With the current dirty cache there's no way to understand how
> > you storage media actually behaves.
> 
> 
> This is a problem I also have been suffering for a long time. It's not so much 
> how much and when the systems syncs dirty data, but how unreponsive the 
> desktop becomes when it happens (usually, with rsync + large files). Most 
> programs become completely unreponsive, specially if they have a large memory 
> consumption (ie. the browser). I need to pause rsync and wait until the 
> systems writes out all dirty data if I want to do simple things like scrolling 
> or do any action that uses I/O, otherwise I need to wait minutes.

That's a problem. And it's kind of independent of the dirty threshold
-- if you are doing large file copies in the background, it will lead
to continuous disk writes and stalls anyway -- the large dirty threshold
merely delays the write IO time.

> I have 16 GB of RAM and excluding the browser (which usually uses about half 
> of a GB) and KDE itself, there are no memory hogs, so it seem like it's 
> something that shouldn't happen. I can understand that I/O operations are 
> laggy when there is some other intensive I/O ongoing, but right now the system 
> becomes completely unreponsive. If I am unlucky and Konsole also becomes 
> unreponsive, I need to switch to a VT (which also takes time).
> 
> I haven't reported it before in part because I didn't know how to do it, "my 
> browser stalls" is not a very useful description and I didn't know what kind 
> of data I'm supposed to report.

What's the kernel you are running? And it's writing to a hard disk?
The stalls are most likely caused by either one of

1) write IO starves read IO
2) direct page reclaim blocked when
   - trying to writeout PG_dirty pages
   - trying to lock PG_writeback pages

Which may be confirmed by running

        ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32
or
        echo w > /proc/sysrq-trigger    # and check dmesg

during the stalls. The latter command works more reliably.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 23:05       ` Fengguang Wu
@ 2013-10-25 23:37         ` Theodore Ts'o
  2013-10-29 20:40           ` Jan Kara
  0 siblings, 1 reply; 56+ messages in thread
From: Theodore Ts'o @ 2013-10-25 23:37 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Artem S. Tashkinov, torvalds, akpm, linux-kernel, Diego Calleja,
	David Lang, NeilBrown

On Sat, Oct 26, 2013 at 12:05:45AM +0100, Fengguang Wu wrote:
> 
> Ted, when trying to follow up your email, I got a crazy idea and it'd
> be better throw it out rather than carrying it to bed. :)
> 
> We could do per-bdi dirty thresholds - which has been proposed 1-2
> times before by different people.
> 
> The per-bdi dirty thresholds could be auto set by the kernel this way: 
> start it with an initial value of 100MB. When reached, put all the
> 100MB dirty data to IO and get an estimation of the write bandwidth.
> From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth,
> where N is the seconds of dirty data we'd like to cache in memory.

Sure, although I wonder if it would be worth it calcuate some kind of
rolling average of the write bandwidth while we are doing writeback,
so if it turns out we got unlucky with the contents of the first 100MB
of dirty data (it could be either highly random or highly sequential)
the we'll eventually correct to the right level.

This means that VM would have to keep dirty page counters for each BDI
--- which I thought we weren't doing right now, which is why we have a
global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
have cause and effect reversed?  :-)

						- Ted

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  9:32         ` Linus Torvalds
@ 2013-10-26 11:32           ` Pavel Machek
  2013-10-26 20:03             ` Linus Torvalds
  2013-10-29 20:57           ` Jan Kara
  1 sibling, 1 reply; 56+ messages in thread
From: Pavel Machek @ 2013-10-26 11:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Theodore Ts'o, Artem S. Tashkinov,
	Wu Fengguang, Linux Kernel Mailing List

On Fri 2013-10-25 10:32:16, Linus Torvalds wrote:
> On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > Apparently all this stuff isn't working as desired (and perhaps as designed)
> > in this case.  Will take a look after a return to normalcy ;)
> 
> It definitely doesn't work. I can trivially reproduce problems by just
> having a cheap (==slow) USB key with an ext3 filesystem, and going a
> git clone to it. The end result is not pretty, and that's actually not
> even a huge amount of data.

Hmm, I'd expect the result to be "dead USB key". Putting
ext3 on cheap flash device normally just kills the devic :-(.


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-26 11:32           ` Pavel Machek
@ 2013-10-26 20:03             ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2013-10-26 20:03 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrew Morton, Theodore Ts'o, Artem S. Tashkinov,
	Wu Fengguang, Linux Kernel Mailing List

On Sat, Oct 26, 2013 at 4:32 AM, Pavel Machek <pavel@ucw.cz> wrote:
>
> Hmm, I'd expect the result to be "dead USB key". Putting
> ext3 on cheap flash device normally just kills the devic :-(.

Not my experience. It may be true for some really cheap devices, but
normal USB keys seem to just get really slow, probably due to having
had their flash rewrite algorithm tuned for FAT accesses.

I *do* suspect that to see the really bad behavior, you don't write
just one large file to it, but many smaller ones. "git clone" will
check out all the kernel tree files, obviously.

                     Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  9:15       ` Karl Kiniger
@ 2013-10-29 20:30         ` Jan Kara
  2013-10-29 20:43           ` Andrew Morton
  2013-10-31 14:26           ` Karl Kiniger
  0 siblings, 2 replies; 56+ messages in thread
From: Jan Kara @ 2013-10-29 20:30 UTC (permalink / raw)
  To: Karl Kiniger
  Cc: Linus Torvalds, Artem S. Tashkinov, Wu Fengguang, Andrew Morton,
	Linux Kernel Mailing List

On Fri 25-10-13 11:15:55, Karl Kiniger wrote:
> On Fri 131025, Linus Torvalds wrote:
> > On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
> > >
> > > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> > > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> > > more) this value becomes unrealistic (13GB) and I've already had some
> > > unpleasant effects due to it.
> > 
> > Right. The percentage notion really goes back to the days when we
> > typically had 8-64 *megabytes* of memory So if you had a 8MB machine
> > you wouldn't want to have more than one megabyte of dirty data, but if
> > you were "Mr Moneybags" and could afford 64MB, you might want to have
> > up to 8MB dirty!!
> > 
> > Things have changed.
> > 
> > So I would suggest we change the defaults. Or pwehaps make the rule be
> > that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
> > semantics similar across 32-bit HIGHMEM machines and 64-bit machines.
> > 
> > The modern way of expressing the dirty limits are to give the actual
> > absolute byte amounts, but we default to the legacy ratio mode..
> > 
> >                 Linus
> 
> Is it currently possible to somehow set above values per block device?
  Yes, to some extent. You can set /sys/block/<device>/bdi/max_ratio to
the maximum proportion the device's dirty data can take from the total
amount. The caveat currently is that this setting only takes effect after
we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in
total because that is an amount of dirty data when we start to throttle
processes. So if the device you'd like to limit is the only one which is
currently written to, the limiting doesn't have a big effect.

Andrew has queued up a patch series from Maxim Patlasov which removes this
caveat but currently we don't have a way admin can switch that from
userspace. But I'd like to have that tunable from userspace exactly for the
cases as you describe below.

> I want default behaviour for almost everything but  DVD drives in DVD+RW
> packet writing mode may easily take several minutes in case of a sync.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 23:37         ` Theodore Ts'o
@ 2013-10-29 20:40           ` Jan Kara
  2013-10-30 10:07             ` Artem S. Tashkinov
  0 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2013-10-29 20:40 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Fengguang Wu, Artem S. Tashkinov, torvalds, akpm, linux-kernel,
	Diego Calleja, David Lang, NeilBrown

On Fri 25-10-13 19:37:53, Ted Tso wrote:
> On Sat, Oct 26, 2013 at 12:05:45AM +0100, Fengguang Wu wrote:
> > 
> > Ted, when trying to follow up your email, I got a crazy idea and it'd
> > be better throw it out rather than carrying it to bed. :)
> > 
> > We could do per-bdi dirty thresholds - which has been proposed 1-2
> > times before by different people.
> > 
> > The per-bdi dirty thresholds could be auto set by the kernel this way: 
> > start it with an initial value of 100MB. When reached, put all the
> > 100MB dirty data to IO and get an estimation of the write bandwidth.
> > From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth,
> > where N is the seconds of dirty data we'd like to cache in memory.
> 
> Sure, although I wonder if it would be worth it calcuate some kind of
> rolling average of the write bandwidth while we are doing writeback,
> so if it turns out we got unlucky with the contents of the first 100MB
> of dirty data (it could be either highly random or highly sequential)
> the we'll eventually correct to the right level.
  We already do average measured throughput over a longer time window and
have kind of rolling average algorithm doing some averaging.

> This means that VM would have to keep dirty page counters for each BDI
> --- which I thought we weren't doing right now, which is why we have a
> global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
> have cause and effect reversed?  :-)
  And we do currently keep the number of dirty & under writeback pages per
BDI. We have global limits because mm wants to limit the total number of dirty
pages (as those are harder to free). It doesn't care as much to which device
these pages belong (although it probably should care a bit more because
there are huge differences between how quickly can different devices get rid
of dirty pages).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 20:30         ` Jan Kara
@ 2013-10-29 20:43           ` Andrew Morton
  2013-10-29 21:30             ` Jan Kara
  2013-10-29 21:36             ` Linus Torvalds
  2013-10-31 14:26           ` Karl Kiniger
  1 sibling, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2013-10-29 20:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Karl Kiniger, Linus Torvalds, Artem S. Tashkinov, Wu Fengguang,
	Linux Kernel Mailing List

On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara <jack@suse.cz> wrote:

> Andrew has queued up a patch series from Maxim Patlasov which removes this
> caveat but currently we don't have a way admin can switch that from
> userspace. But I'd like to have that tunable from userspace exactly for the
> cases as you describe below.

This?

commit 5a53748568f79641eaf40e41081a2f4987f005c2
Author:     Maxim Patlasov <mpatlasov@parallels.com>
AuthorDate: Wed Sep 11 14:22:46 2013 -0700
Commit:     Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Sep 11 15:58:04 2013 -0700

    mm/page-writeback.c: add strictlimit feature

That's already in mainline, for 3.12.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 18:26     ` Artem S. Tashkinov
  2013-10-25 19:40       ` Diego Calleja
  2013-10-25 20:43       ` NeilBrown
@ 2013-10-29 20:49       ` Jan Kara
  2 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2013-10-29 20:49 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: david, neilb, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

On Fri 25-10-13 18:26:23, Artem S. Tashkinov wrote:
> Oct 25, 2013 05:26:45 PM, david wrote:
> On Fri, 25 Oct 2013, NeilBrown wrote:
> >
> >>
> >> What exactly is bothering you about this?  The amount of memory used or the
> >> time until data is flushed?
> >
> >actually, I think the problem is more the impact of the huge write later on.
> 
> Exactly. And not being able to use applications which show you IO
> performance like Midnight Commander. You might prefer to use "cp -a" but
> I cannot imagine my life without being able to see the progress of a
> copying operation. With the current dirty cache there's no way to
> understand how you storage media actually behaves.
  Large writes shouldn't stall your desktop, that's certain and we must fix
that. I don't find the problem with copy progress indicators that
pressing...

> Hopefully this issue won't dissolve into obscurity and someone will
> actually make up a plan (and a patch) how to make dirty write cache
> behave in a sane manner considering the fact that there are devices with
> very different write speeds and requirements. It'd be ever better, if I
> could specify dirty cache as a mount option (though sane defaults or
> semi-automatic values based on runtime estimates won't hurt).
> 
> Per device dirty cache seems like a nice idea, I, for one, would like to
> disable it altogether or make it an absolute minimum for things like USB
> flash drives - because I don't care about multithreaded performance or
> delayed allocation on such devices - I'm interested in my data reaching
> my USB stick ASAP - because it's how most people use them.
  See my other emails in this thread. There are ways to tune the amount of
dirty data allowed per device. Currently the result isn't very satisfactory
but we should have something usable after the next merge window.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  9:32         ` Linus Torvalds
  2013-10-26 11:32           ` Pavel Machek
@ 2013-10-29 20:57           ` Jan Kara
  2013-10-29 21:33             ` Linus Torvalds
  2013-10-30 12:01             ` Mel Gorman
  1 sibling, 2 replies; 56+ messages in thread
From: Jan Kara @ 2013-10-29 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Theodore Ts'o, Artem S. Tashkinov,
	Wu Fengguang, Linux Kernel Mailing List, mgorman

On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
> On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > Apparently all this stuff isn't working as desired (and perhaps as designed)
> > in this case.  Will take a look after a return to normalcy ;)
> 
> It definitely doesn't work. I can trivially reproduce problems by just
> having a cheap (==slow) USB key with an ext3 filesystem, and going a
> git clone to it. The end result is not pretty, and that's actually not
> even a huge amount of data.
  I'll try to reproduce this tomorrow so that I can have a look where
exactly are we stuck. But in last few releases problems like this were
caused by problems in reclaim which got fed up by seeing lots of dirty
/ under writeback pages and ended up stuck waiting for IO to finish. Mel
has been tweaking the logic here and there but maybe it haven't got fixed
completely. Mel, do you know about any outstanding issues?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 20:43           ` Andrew Morton
@ 2013-10-29 21:30             ` Jan Kara
  2013-10-29 21:36             ` Linus Torvalds
  1 sibling, 0 replies; 56+ messages in thread
From: Jan Kara @ 2013-10-29 21:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Karl Kiniger, Linus Torvalds, Artem S. Tashkinov,
	Wu Fengguang, Linux Kernel Mailing List

On Tue 29-10-13 13:43:46, Andrew Morton wrote:
> On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > Andrew has queued up a patch series from Maxim Patlasov which removes this
> > caveat but currently we don't have a way admin can switch that from
> > userspace. But I'd like to have that tunable from userspace exactly for the
> > cases as you describe below.
> 
> This?
> 
> commit 5a53748568f79641eaf40e41081a2f4987f005c2
> Author:     Maxim Patlasov <mpatlasov@parallels.com>
> AuthorDate: Wed Sep 11 14:22:46 2013 -0700
> Commit:     Linus Torvalds <torvalds@linux-foundation.org>
> CommitDate: Wed Sep 11 15:58:04 2013 -0700
> 
>     mm/page-writeback.c: add strictlimit feature
> 
> That's already in mainline, for 3.12.
  Yes, I should have checked the code...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 20:57           ` Jan Kara
@ 2013-10-29 21:33             ` Linus Torvalds
  2013-10-29 22:13               ` Jan Kara
  2013-10-30 12:01             ` Mel Gorman
  1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2013-10-29 21:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Theodore Ts'o, Artem S. Tashkinov,
	Wu Fengguang, Linux Kernel Mailing List, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 4672 bytes --]

On Tue, Oct 29, 2013 at 1:57 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
>>
>> It definitely doesn't work. I can trivially reproduce problems by just
>> having a cheap (==slow) USB key with an ext3 filesystem, and going a
>> git clone to it. The end result is not pretty, and that's actually not
>> even a huge amount of data.
>
>   I'll try to reproduce this tomorrow so that I can have a look where
> exactly are we stuck. But in last few releases problems like this were
> caused by problems in reclaim which got fed up by seeing lots of dirty
> / under writeback pages and ended up stuck waiting for IO to finish. Mel
> has been tweaking the logic here and there but maybe it haven't got fixed
> completely. Mel, do you know about any outstanding issues?

I'm not sure this has ever worked, and in the last few years the
common desktop memory size has continued to grow.

For servers and "serious" desktops, having tons of dirty data doesn't
tend to be as much of a problem, because those environments are pretty
much defined by also having fairly good IO subsystems, and people
seldom use crappy USB devices for more than doing things like reading
pictures off them etc. And you'd not even see the problem under any
such load.

But it's actually really easy to reproduce by just taking your average
USB key and trying to write to it. I just did it with a random ISO
image, and it's _painful_. And it's not that it's painful for doing
most other things in the background, but if you just happen to run
anything that does "sync" (and it happens in scripts), the thing just
comes to a screeching halt. For minutes.

Same obviously goes with trying to eject/unmount the media etc.

We've had this problem before with the whole "ratio of dirty memory"
thing. It was a mistake. It made sense (and came from) back in the
days when people had 16MB or 32MB of RAM, and the concept of "let's
limit dirty memory to x% of that" was actually fairly reasonable. But
that "x%" doesn't make much sense any more. x% of 16GB (which is quite
the reasonable amount of memory for any modern desktop) is a huge
thing, and in the meantime the performance of disks have gone up a lot
(largely thanks to SSD's), but the *minimum* performance of disks
hasn't really improved all that much (largely thanks to USB ;).

So how about we just admit that the whole "ratio" thing was a big
mistake, and tell people that if they want to set a dirty limit, they
should do so in bytes? Which we already really do, but we default to
that ratio nevertheless. Which is why I'd suggest we just say "the
ratio works fine up to a certain amount, and makes no sense past it".

Why not make that "the ratio works fine up to a certain amount, and
makes no sense past it" be part of the calculations. We actually
*hace* exactly that on HIGHMEM machines, where we have this
configuration option of "vm_highmem_is_dirtyable" that defaults to
off. It just doesn't trigger on nonhighmem machines (today: "64-bit").

So I would suggest that we just expose that "vm_highmem_is_dirtyable"
on 64-bit too, and just say that anything over 1GB is highmem. That
means that 32-bit and 64-bit environments will basically act the same,
and I think it makes the defaults a bit saner.

Limiting the amount of dirty memory to 100MB/200MB (for "start
background writing" and "wait synchronously" respectively) even if you
happen to have 16GB of memory sounds like a good idea. Sure, it might
make some benchmarks a bit slower, but it will at least avoid the
"wait forever" symptom. And if you really have a very studly IO
subsystem, the fact that it starts writing out earlier won't really be
a problem.

After all, there are two reasons to do delayed writes:

 - temp-files may not be written out at all.

   Quite frankly, if you have multi-hundred-megabyte temptiles, you've
got issues

 - coalescing writes improves throughput

   There are very much diminishing returns, and the big return is to
make sure that we write things out in a good order, which a 100MB
buffer should make more than possible.

so I really think that it's insane to default to 1.6GB of dirty data
before you even start writing it out if you happen to have 16GB of
memory.

And again: if your benchmark is to create a kernel tree and then
immediately delete it, and you used to do that without doing any
actual IO, then yes, the attached patch will make that go much slower.
But for that benchmark, maybe you should just set the dirty limits (in
bytes) by hand, rather than expect the default kernel values to prefer
benchmarks over sanity?

Suggested patch attached. Comments?

                            Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1234 bytes --]

 kernel/sysctl.c     | 2 --
 mm/page-writeback.c | 7 ++++++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3c6a3f..411da56cd732 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1406,7 +1406,6 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
-#ifdef CONFIG_HIGHMEM
 	{
 		.procname	= "highmem_is_dirtyable",
 		.data		= &vm_highmem_is_dirtyable,
@@ -1416,7 +1415,6 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
-#endif
 	{
 		.procname	= "scan_unevictable_pages",
 		.data		= &scan_unevictable_pages,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 63807583d8e8..b3bce1cd59d5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -241,8 +241,13 @@ static unsigned long global_dirtyable_memory(void)
 	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 	x -= min(x, dirty_balance_reserve);
 
-	if (!vm_highmem_is_dirtyable)
+	if (!vm_highmem_is_dirtyable) {
+		const unsigned long GB_pages = 1024*1024*1024 / PAGE_SIZE;
+
 		x -= highmem_dirtyable_memory(x);
+		if (x > GB_pages)
+			x = GB_pages;
+	}
 
 	return x + 1;	/* Ensure that we never return 0 */
 }

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 20:43           ` Andrew Morton
  2013-10-29 21:30             ` Jan Kara
@ 2013-10-29 21:36             ` Linus Torvalds
  1 sibling, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2013-10-29 21:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Karl Kiniger, Artem S. Tashkinov, Wu Fengguang,
	Linux Kernel Mailing List

On Tue, Oct 29, 2013 at 1:43 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara <jack@suse.cz> wrote:
>
>> Andrew has queued up a patch series from Maxim Patlasov which removes this
>> caveat but currently we don't have a way admin can switch that from
>> userspace. But I'd like to have that tunable from userspace exactly for the
>> cases as you describe below.
>
> This?
>
>     mm/page-writeback.c: add strictlimit feature
>
> That's already in mainline, for 3.12.

Nothing currently actually *sets* the BDI_CAP_STRICTLIMIT flag, though.

So it's a potential fix, but it's certainly not a fix now.

               Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 21:33             ` Linus Torvalds
@ 2013-10-29 22:13               ` Jan Kara
  2013-10-29 22:42                 ` Linus Torvalds
  0 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2013-10-29 22:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Andrew Morton, Theodore Ts'o, Artem S. Tashkinov,
	Wu Fengguang, Linux Kernel Mailing List, Mel Gorman

On Tue 29-10-13 14:33:53, Linus Torvalds wrote:
> On Tue, Oct 29, 2013 at 1:57 PM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
> >>
> >> It definitely doesn't work. I can trivially reproduce problems by just
> >> having a cheap (==slow) USB key with an ext3 filesystem, and going a
> >> git clone to it. The end result is not pretty, and that's actually not
> >> even a huge amount of data.
> >
> >   I'll try to reproduce this tomorrow so that I can have a look where
> > exactly are we stuck. But in last few releases problems like this were
> > caused by problems in reclaim which got fed up by seeing lots of dirty
> > / under writeback pages and ended up stuck waiting for IO to finish. Mel
> > has been tweaking the logic here and there but maybe it haven't got fixed
> > completely. Mel, do you know about any outstanding issues?
> 
> I'm not sure this has ever worked, and in the last few years the
> common desktop memory size has continued to grow.
> 
> For servers and "serious" desktops, having tons of dirty data doesn't
> tend to be as much of a problem, because those environments are pretty
> much defined by also having fairly good IO subsystems, and people
> seldom use crappy USB devices for more than doing things like reading
> pictures off them etc. And you'd not even see the problem under any
> such load.
> 
> But it's actually really easy to reproduce by just taking your average
> USB key and trying to write to it. I just did it with a random ISO
> image, and it's _painful_. And it's not that it's painful for doing
> most other things in the background, but if you just happen to run
> anything that does "sync" (and it happens in scripts), the thing just
> comes to a screeching halt. For minutes.
  Yes, I agree that caching more than couple of seconds worth of writeback
for a device isn't good.

> Same obviously goes with trying to eject/unmount the media etc.
> 
> We've had this problem before with the whole "ratio of dirty memory"
> thing. It was a mistake. It made sense (and came from) back in the
> days when people had 16MB or 32MB of RAM, and the concept of "let's
> limit dirty memory to x% of that" was actually fairly reasonable. But
> that "x%" doesn't make much sense any more. x% of 16GB (which is quite
> the reasonable amount of memory for any modern desktop) is a huge
> thing, and in the meantime the performance of disks have gone up a lot
> (largely thanks to SSD's), but the *minimum* performance of disks
> hasn't really improved all that much (largely thanks to USB ;).
> 
> So how about we just admit that the whole "ratio" thing was a big
> mistake, and tell people that if they want to set a dirty limit, they
> should do so in bytes? Which we already really do, but we default to
> that ratio nevertheless. Which is why I'd suggest we just say "the
> ratio works fine up to a certain amount, and makes no sense past it".
> 
> Why not make that "the ratio works fine up to a certain amount, and
> makes no sense past it" be part of the calculations. We actually
> *hace* exactly that on HIGHMEM machines, where we have this
> configuration option of "vm_highmem_is_dirtyable" that defaults to
> off. It just doesn't trigger on nonhighmem machines (today: "64-bit").
> 
> So I would suggest that we just expose that "vm_highmem_is_dirtyable"
> on 64-bit too, and just say that anything over 1GB is highmem. That
> means that 32-bit and 64-bit environments will basically act the same,
> and I think it makes the defaults a bit saner.
> 
> Limiting the amount of dirty memory to 100MB/200MB (for "start
> background writing" and "wait synchronously" respectively) even if you
> happen to have 16GB of memory sounds like a good idea. Sure, it might
> make some benchmarks a bit slower, but it will at least avoid the
> "wait forever" symptom. And if you really have a very studly IO
> subsystem, the fact that it starts writing out earlier won't really be
> a problem.
  So I think we both realize this is only about what the default should be.
There will always be people who have loads which benefit from setting dirty
limits high but I agree they are minority. The reason why we left the
limits at what they are now despite them having less and less sence is that
we didn't want to break user expectations. If we cap the dirty limits as
you suggest, I bet we'll get some user complaints and "don't break users"
policy thus tells me we shouldn't do such changes ;)

Also I'm not sure capping dirty limits at 200MB is the best spot. It may be
but I think we should experiment with numbers a bit to check whether we
didn't miss something.
 
> After all, there are two reasons to do delayed writes:
> 
>  - temp-files may not be written out at all.
> 
>    Quite frankly, if you have multi-hundred-megabyte temptiles, you've
> got issues
  Actually people do stuff like this e.g. when generating ISO images before
burning them.

>  - coalescing writes improves throughput
> 
>    There are very much diminishing returns, and the big return is to
> make sure that we write things out in a good order, which a 100MB
> buffer should make more than possible.
  True.

  There is one more aspect:
- transforming random writes into mostly sequential writes

  Different userspace programs use simple memory mapped databases which do
random writes into their data files. The less you writeback these the
better (at least from throughput POV). I'm not sure how large are these
files together on average user desktop though but my guess would be that
100 MB *should* be enough for them. Can anyone with GNOME / KDE desktop try
running with limits set this low for some time?
 
> so I really think that it's insane to default to 1.6GB of dirty data
> before you even start writing it out if you happen to have 16GB of
> memory.
> 
> And again: if your benchmark is to create a kernel tree and then
> immediately delete it, and you used to do that without doing any
> actual IO, then yes, the attached patch will make that go much slower.
> But for that benchmark, maybe you should just set the dirty limits (in
> bytes) by hand, rather than expect the default kernel values to prefer
> benchmarks over sanity?
> 
> Suggested patch attached. Comments?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 22:13               ` Jan Kara
@ 2013-10-29 22:42                 ` Linus Torvalds
  2013-11-01 17:22                   ` Fengguang Wu
  2013-11-04 12:26                   ` Pavel Machek
  0 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2013-10-29 22:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Theodore Ts'o, Artem S. Tashkinov,
	Wu Fengguang, Linux Kernel Mailing List, Mel Gorman,
	Maxim Patlasov

On Tue, Oct 29, 2013 at 3:13 PM, Jan Kara <jack@suse.cz> wrote:
>
>   So I think we both realize this is only about what the default should be.

Yes. Most people will use the defaults, but there will always be
people who tune things for particular loads.

In fact, I think we have gone much too far in saying "all policy in
user space", because the fact is, user space isn't very good at
policy. Especially not at reacting to complex situations with
different devices. From what I've seen, "policy in user space" has
resulted in exactly two modes:

 - user space does something stupid and wrong (example: "nice -19 X"
to work around some scheduler oddities)

 - user space does nothing at all, and the kernel people say "hey,
user space _could_ set this value Xyz, so it's not our problem, and
it's policy, so we shouldn't touch it".

I think we in the kernel should say "our defaults should be what
everybody sane can use, and they should work fine on average". With
"policy in user space" being for crazy people that do really odd
things and can really spare the time to tune for their particular
issue.

So the "policy in user space" should be about *overriding* kernel
policy choices, not about the kernel never having them.

And this kind of "you can have many different devices and they act
quite differently" is a good example of something complicated that
user space really doesn't have a great model for. And we actually have
much better possible information in the kernel than user space ever is
likely to have.

> Also I'm not sure capping dirty limits at 200MB is the best spot. It may be
> but I think we should experiment with numbers a bit to check whether we
> didn't miss something.

Sure. That said, the patch I suggested basically makes the numbers be
at least roughly comparable across different architectures. So it's
been at least somewhat tested, even if 16GB x86-32 machines are
hopefully pretty rare (but I hear about people installing 32-bit on
modern machines much too often).

>>  - temp-files may not be written out at all.
>>
>>    Quite frankly, if you have multi-hundred-megabyte temptiles, you've
>> got issues
>   Actually people do stuff like this e.g. when generating ISO images before
> burning them.

Yes, but then the temp-file is long-lived enough that it *will* hit
the disk anyway. So it's only the "create temporary file and pretty
much immediately delete it" case that changes behavior (ie compiler
assembly files etc).

If the temp-file is for something like burning an ISO image, the
burning part is slow enough that the temp-file will hit the disk
regardless of when we start writing it.

>   There is one more aspect:
> - transforming random writes into mostly sequential writes

Sure. And I think that if you have a big database, that's when you do
end up tweaking the dirty limits.

That said, I'd certainly like it even *more* if the limits really were
per-BDI, and the global limit was in addition to the per-bdi ones.
Because when you have a USB device that gets maybe 10MB/s on
contiguous writes, and 100kB/s on random 4k writes, I think it would
make more sense to make the "start writeout" limits be 1MB/2MB, not
100MB/200MB. So my patch doesn't even take it far enough, it's just a
"let's not be ridiculous". The per-BDI limits don't seem quite ready
for prime time yet, though. Even the new "strict" limits seems to be
more about "trusted filesystems" than about really sane writeback
limits.

Fengguang, comments?

(And I added Maxim to the cc, since he's the author of the strict
mode, and while it is currently limited to FUSE, he did mention USB
storage in the commit message..).

                  Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 20:40           ` Jan Kara
@ 2013-10-30 10:07             ` Artem S. Tashkinov
  2013-10-30 15:12               ` Jan Kara
  0 siblings, 1 reply; 56+ messages in thread
From: Artem S. Tashkinov @ 2013-10-30 10:07 UTC (permalink / raw)
  To: jack
  Cc: tytso, fengguang.wu, torvalds, akpm, linux-kernel, diegocg, david, neilb

Oct 30, 2013 02:41:01 AM, Jack wrote:
On Fri 25-10-13 19:37:53, Ted Tso wrote:
>> Sure, although I wonder if it would be worth it calcuate some kind of
>> rolling average of the write bandwidth while we are doing writeback,
>> so if it turns out we got unlucky with the contents of the first 100MB
>> of dirty data (it could be either highly random or highly sequential)
>> the we'll eventually correct to the right level.
>  We already do average measured throughput over a longer time window and
>have kind of rolling average algorithm doing some averaging.
>
>> This means that VM would have to keep dirty page counters for each BDI
>> --- which I thought we weren't doing right now, which is why we have a
>> global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
>> have cause and effect reversed?  :-)
>  And we do currently keep the number of dirty & under writeback pages per
>BDI. We have global limits because mm wants to limit the total number of dirty
>pages (as those are harder to free). It doesn't care as much to which device
>these pages belong (although it probably should care a bit more because
>there are huge differences between how quickly can different devices get rid
>of dirty pages).

This might sound like an absolutely stupid question which makes no sense at
all, so I want to apologize for it in advance, but since the Linux kernel lacks
revoke(), does that mean that dirty buffers will always occupy the kernel memory
if I for instance remove my USB stick before the kernel has had the time to flush
these buffers?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 20:57           ` Jan Kara
  2013-10-29 21:33             ` Linus Torvalds
@ 2013-10-30 12:01             ` Mel Gorman
  2013-11-19 17:17               ` Rob Landley
  1 sibling, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2013-10-30 12:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Andrew Morton, Theodore Ts'o,
	Artem S. Tashkinov, Wu Fengguang, Linux Kernel Mailing List

On Tue, Oct 29, 2013 at 09:57:56PM +0100, Jan Kara wrote:
> On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
> > On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > Apparently all this stuff isn't working as desired (and perhaps as designed)
> > > in this case.  Will take a look after a return to normalcy ;)
> > 
> > It definitely doesn't work. I can trivially reproduce problems by just
> > having a cheap (==slow) USB key with an ext3 filesystem, and going a
> > git clone to it. The end result is not pretty, and that's actually not
> > even a huge amount of data.
>
>   I'll try to reproduce this tomorrow so that I can have a look where
> exactly are we stuck. But in last few releases problems like this were
> caused by problems in reclaim which got fed up by seeing lots of dirty
> / under writeback pages and ended up stuck waiting for IO to finish. Mel
> has been tweaking the logic here and there but maybe it haven't got fixed
> completely. Mel, do you know about any outstanding issues?
> 

Yeah, there are still a few. The work in that general area dealt with
such problems as dirty pages reaching the end of the LRU (excessive CPU
usage), calling wait_on_page_writeback from reclaim context (random
processes stalling even though there was not much memory pressure),
desktop applications stalling randomly (second quick write stalling on
stable writeback). The systemtap script caught those type of areas and I
believe they are fixed up.

There are still problems though. If all dirty pages were backed by a slow
device then dirty limiting is still eventually going to cause stalls in
dirty page balancing. If there is a global sync then the shit can really
hit the fan if it all gets stuck waiting on something like journal space.
Applications that are very fsync happy can still get stalled for long
periods of time behind slower writers as they wait for the IO to flush.
When all this happens there still make be spikes in CPU usage if it scans
the dirty pages excessively without sleeping.

Consciously or unconsciously my desktop applications generally do not fall
foul of these problems. At least one of the desktop environments can stall
because it calls fsync on history and preference files constantly but I
cannot remember which one of if it has been fixed since. I did have a problem
with gnome-terminal as it depended on a library that implemented scrollback
buffering by writing single-line files to /tmp and then truncating them
which would "freeze" the terminal under IO. I now use tmpfs for /tmp to
get around this. When I'm writing to USB sticks I think it tends to stay
between the point where background writing starts and dirty throttling
occurs so I rarely notice any major problems. I'm probably unconsciously
avoiding doing any write-heavy work while a USB stick is plugged in.

Addressing this goes back to tuning dirty ratio or replacing it. Tuning
it always falls foul of "works for one person and not another" and fails
utterly when there is storage with differet speeds. We talked about this a
few months ago but I still suspect that we will have to bite the bullet and
tune based on "do not dirty more data than it takes N seconds to writeback"
using per-bdi writeback estimations. It's just not that trivial to implement
as the writeback speeds can change for a variety of reasons (multiple IO
sources, random vs sequential etc). Hence at one point we think we are
within our target window and then get it completely wrong. Dirty ratio
is a hard guarantee, dirty writeback estimation is best-effort that will
go wrong in some cases.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-30 10:07             ` Artem S. Tashkinov
@ 2013-10-30 15:12               ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2013-10-30 15:12 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: jack, tytso, fengguang.wu, torvalds, akpm, linux-kernel, diegocg,
	david, neilb

On Wed 30-10-13 10:07:08, Artem S. Tashkinov wrote:
> Oct 30, 2013 02:41:01 AM, Jack wrote:
> On Fri 25-10-13 19:37:53, Ted Tso wrote:
> >> Sure, although I wonder if it would be worth it calcuate some kind of
> >> rolling average of the write bandwidth while we are doing writeback,
> >> so if it turns out we got unlucky with the contents of the first 100MB
> >> of dirty data (it could be either highly random or highly sequential)
> >> the we'll eventually correct to the right level.
> >  We already do average measured throughput over a longer time window and
> >have kind of rolling average algorithm doing some averaging.
> >
> >> This means that VM would have to keep dirty page counters for each BDI
> >> --- which I thought we weren't doing right now, which is why we have a
> >> global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
> >> have cause and effect reversed?  :-)
> >  And we do currently keep the number of dirty & under writeback pages per
> >BDI. We have global limits because mm wants to limit the total number of dirty
> >pages (as those are harder to free). It doesn't care as much to which device
> >these pages belong (although it probably should care a bit more because
> >there are huge differences between how quickly can different devices get rid
> >of dirty pages).
> 
> This might sound like an absolutely stupid question which makes no sense at
> all, so I want to apologize for it in advance, but since the Linux kernel lacks
> revoke(), does that mean that dirty buffers will always occupy the kernel memory
> if I for instance remove my USB stick before the kernel has had the time to flush
> these buffers?
  That's actually a good question. And the answer is that currently when we
hit EIO while writing out dirty data, we just throw away that data. Not
an ideal solution for some cases but it solves the problem with unwriteable
data...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 20:30         ` Jan Kara
  2013-10-29 20:43           ` Andrew Morton
@ 2013-10-31 14:26           ` Karl Kiniger
  2013-11-01 14:25             ` Maxim Patlasov
  2013-11-01 14:31             ` [PATCH] mm: add strictlimit knob Maxim Patlasov
  1 sibling, 2 replies; 56+ messages in thread
From: Karl Kiniger @ 2013-10-31 14:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Artem S. Tashkinov, Wu Fengguang, Andrew Morton,
	Linux Kernel Mailing List

On Tue 131029, Jan Kara wrote:
> On Fri 25-10-13 11:15:55, Karl Kiniger wrote:
> > On Fri 131025, Linus Torvalds wrote:
.... 
> > Is it currently possible to somehow set above values per block device?
>   Yes, to some extent. You can set /sys/block/<device>/bdi/max_ratio to
> the maximum proportion the device's dirty data can take from the total
> amount. The caveat currently is that this setting only takes effect after
> we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in
> total because that is an amount of dirty data when we start to throttle
> processes. So if the device you'd like to limit is the only one which is
> currently written to, the limiting doesn't have a big effect.

Thanks for the info - thats was I am looking for.

You are right that the limiting doesn't have a big effect right now:

on my  4x speed  DVD+RW on /dev/sr0, x86_64, 4GB,
Fedora19:

max_ratio set to 100  - about 500MB buffered, sync time 2:10 min.
max_ratio set to 1    - about 330MB buffered, sync time 1:23 min.

... way too much buffering.

(measured with strace -tt -ewrite dd if=/dev/zero of=bigfile bs=1M count=1000
by looking at the timestamps).


Karl

....
								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-31 14:26           ` Karl Kiniger
@ 2013-11-01 14:25             ` Maxim Patlasov
  2013-11-01 14:31             ` [PATCH] mm: add strictlimit knob Maxim Patlasov
  1 sibling, 0 replies; 56+ messages in thread
From: Maxim Patlasov @ 2013-11-01 14:25 UTC (permalink / raw)
  To: karl.kiniger
  Cc: jack, linux-kernel, t.artem, mgorman, tytso, akpm, fengguang.wu,
	torvalds, mpatlasov

On Thu 31-10-13 14:26:12, Karl Kiniger wrote:
> On Tue 131029, Jan Kara wrote:
> > On Fri 25-10-13 11:15:55, Karl Kiniger wrote:
> > > On Fri 131025, Linus Torvalds wrote:
> ....
> > > Is it currently possible to somehow set above values per block device?
> >   Yes, to some extent. You can set /sys/block/<device>/bdi/max_ratio to
> > the maximum proportion the device's dirty data can take from the total
> > amount. The caveat currently is that this setting only takes effect after
> > we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in
> > total because that is an amount of dirty data when we start to throttle
> > processes. So if the device you'd like to limit is the only one which is
> > currently written to, the limiting doesn't have a big effect.
>
> Thanks for the info - thats was I am looking for.
>
> You are right that the limiting doesn't have a big effect right now:
>
> on my  4x speed  DVD+RW on /dev/sr0, x86_64, 4GB,
> Fedora19:
>
> max_ratio set to 100  - about 500MB buffered, sync time 2:10 min.
> max_ratio set to 1    - about 330MB buffered, sync time 1:23 min.
>
> ... way too much buffering.

"strictlimit" feature must fit your and Artem's needs quite well. The feature
enforces per-BDI dirty limits even if the global dirty limit is not reached
yet. I'll send a patch adding knob to turn it on/off.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] mm: add strictlimit knob
  2013-10-31 14:26           ` Karl Kiniger
  2013-11-01 14:25             ` Maxim Patlasov
@ 2013-11-01 14:31             ` Maxim Patlasov
  2013-11-04 22:01               ` Andrew Morton
  1 sibling, 1 reply; 56+ messages in thread
From: Maxim Patlasov @ 2013-11-01 14:31 UTC (permalink / raw)
  To: karl.kiniger
  Cc: jack, linux-kernel, t.artem, linux-mm, mgorman, tytso, akpm,
	fengguang.wu, torvalds, mpatlasov

"strictlimit" feature was introduced to enforce per-bdi dirty limits for
FUSE which sets bdi max_ratio to 1% by default:

http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809

However the feature can be useful for other relatively slow or untrusted
BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
feature:

echo 1 > /sys/class/bdi/X:Y/strictlimit

Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
dirty limit is not reached. Of course, the effect is not visible until
max_ratio is decreased to some reasonable value.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 mm/backing-dev.c |   35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7..4ee1d64 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(stable_pages_required);
 
+static ssize_t strictlimit_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+	unsigned int val;
+	ssize_t ret;
+
+	ret = kstrtouint(buf, 10, &val);
+	if (ret < 0)
+		return ret;
+
+	switch (val) {
+	case 0:
+		bdi->capabilities &= ~BDI_CAP_STRICTLIMIT;
+		break;
+	case 1:
+		bdi->capabilities |= BDI_CAP_STRICTLIMIT;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return count;
+}
+static ssize_t strictlimit_show(struct device *dev,
+		struct device_attribute *attr, char *page)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+	return snprintf(page, PAGE_SIZE-1, "%d\n",
+			!!(bdi->capabilities & BDI_CAP_STRICTLIMIT));
+}
+static DEVICE_ATTR_RW(strictlimit);
+
 static struct attribute *bdi_dev_attrs[] = {
 	&dev_attr_read_ahead_kb.attr,
 	&dev_attr_min_ratio.attr,
 	&dev_attr_max_ratio.attr,
 	&dev_attr_stable_pages_required.attr,
+	&dev_attr_strictlimit.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(bdi_dev);


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 22:42                 ` Linus Torvalds
@ 2013-11-01 17:22                   ` Fengguang Wu
  2013-11-04 12:19                     ` Pavel Machek
  2013-11-04 12:26                   ` Pavel Machek
  1 sibling, 1 reply; 56+ messages in thread
From: Fengguang Wu @ 2013-11-01 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Andrew Morton, Theodore Ts'o, Artem S. Tashkinov,
	Linux Kernel Mailing List, Mel Gorman, Maxim Patlasov

// Sorry for the late response! I'm in marriage leave these days. :)

On Tue, Oct 29, 2013 at 03:42:08PM -0700, Linus Torvalds wrote:
> On Tue, Oct 29, 2013 at 3:13 PM, Jan Kara <jack@suse.cz> wrote:
> >
> >   So I think we both realize this is only about what the default should be.
> 
> Yes. Most people will use the defaults, but there will always be
> people who tune things for particular loads.
> 
> In fact, I think we have gone much too far in saying "all policy in
> user space", because the fact is, user space isn't very good at
> policy. Especially not at reacting to complex situations with
> different devices. From what I've seen, "policy in user space" has
> resulted in exactly two modes:
> 
>  - user space does something stupid and wrong (example: "nice -19 X"
> to work around some scheduler oddities)
> 
>  - user space does nothing at all, and the kernel people say "hey,
> user space _could_ set this value Xyz, so it's not our problem, and
> it's policy, so we shouldn't touch it".
> 
> I think we in the kernel should say "our defaults should be what
> everybody sane can use, and they should work fine on average". With
> "policy in user space" being for crazy people that do really odd
> things and can really spare the time to tune for their particular
> issue.
> 
> So the "policy in user space" should be about *overriding* kernel
> policy choices, not about the kernel never having them.

Agreed totally. The kernel defaults should better be geared to the
typical use case by the majority users, unless it will lead to insane
behaviors in some less frequent but still relevant use cases.

> And this kind of "you can have many different devices and they act
> quite differently" is a good example of something complicated that
> user space really doesn't have a great model for. And we actually have
> much better possible information in the kernel than user space ever is
> likely to have.
> 
> > Also I'm not sure capping dirty limits at 200MB is the best spot. It may be
> > but I think we should experiment with numbers a bit to check whether we
> > didn't miss something.
> 
> Sure. That said, the patch I suggested basically makes the numbers be
> at least roughly comparable across different architectures. So it's
> been at least somewhat tested, even if 16GB x86-32 machines are
> hopefully pretty rare (but I hear about people installing 32-bit on
> modern machines much too often).

Yeah, it's interesting the new policy rule actually makes x86_64
behave more consistent with i386, and hence have been reasonably
tested.

> >>  - temp-files may not be written out at all.
> >>
> >>    Quite frankly, if you have multi-hundred-megabyte temptiles, you've
> >> got issues
> >   Actually people do stuff like this e.g. when generating ISO images before
> > burning them.
> 
> Yes, but then the temp-file is long-lived enough that it *will* hit
> the disk anyway. So it's only the "create temporary file and pretty
> much immediately delete it" case that changes behavior (ie compiler
> assembly files etc).
> 
> If the temp-file is for something like burning an ISO image, the
> burning part is slow enough that the temp-file will hit the disk
> regardless of when we start writing it.

The temp-file IO avoidance is an optimization not a guarantee. If a
user want to avoid IO seriously, he will probably use tmpfs and
disable swap.

So if we have to do some trade-offs in the optimization, I agree that
we should optimize more towards the "large copies to USB stick" use case.

The alternative solution, per-bdi dirty thresholds, could eliminate
the need to do such trade-offs. So it's worth looking at the two
solutions side by side.

> >   There is one more aspect:
> > - transforming random writes into mostly sequential writes
> 
> Sure. And I think that if you have a big database, that's when you do
> end up tweaking the dirty limits.

Sure. In general, whenever we have to make some tradeoffs, it's
probably better to "sacrifice" the embedded and super computing worlds
much more than the desktop. Because in the former areas, people tend
to have the skill and mind set to do customizations and optimizations.

I wonder if some hand-held devices will set dirty_background_bytes to
0 for better data safety.

> That said, I'd certainly like it even *more* if the limits really were
> per-BDI, and the global limit was in addition to the per-bdi ones.
> Because when you have a USB device that gets maybe 10MB/s on
> contiguous writes, and 100kB/s on random 4k writes, I think it would
> make more sense to make the "start writeout" limits be 1MB/2MB, not
> 100MB/200MB. So my patch doesn't even take it far enough, it's just a
> "let's not be ridiculous". The per-BDI limits don't seem quite ready
> for prime time yet, though. Even the new "strict" limits seems to be
> more about "trusted filesystems" than about really sane writeback
> limits.
> 
> Fengguang, comments?

Basically A) lowering the global dirty limit is a reasonable tradeoff,
and B) the time based per-bdi dirty limits seems like the ultimate
solution that could offer the sane defaults to your heart's content.

Since both will be user interface (including semantic) changes, we
have to be careful. It's obvious that if ever (B) can be implemented
properly and made mature quickly, it would be the best choice and will
eliminate the need to do (A). But as Mel said in the other email, (B)
is not that easy to implement...

> (And I added Maxim to the cc, since he's the author of the strict
> mode, and while it is currently limited to FUSE, he did mention USB
> storage in the commit message..).
 
The *bytes* based per-bdi limits are relatively easy. It's only a
question of code matureness. When exported user interface to the user
space, we can guarantee the exact limit to the user.

However for *time* based per-bdi limits, there will always be
estimation errors as summarized in Mel's email. It offers the sane
semantics to the user, however may not always work to the expectation,
since writeback bandwidth may change over time depending on the workload.

It feels much better to have some hard guarantee. So even when the
time based limits are implemented, we'll probably still want to
disable the slippery time/bandwidth estimation when the user is able
to provide some bytes based per-bdi limits: hey I don't care about
random writes etc. subtle situations. I know this disk's max write
bandwidth is 100MB/s and it's a good rule of thumb. Let's simply set
its dirty limit to 100MB.

Or shall we do the more simple and less volatile "max write bandwidth"
estimation and use it for auto per-bdi dirty limits?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-01 17:22                   ` Fengguang Wu
@ 2013-11-04 12:19                     ` Pavel Machek
  0 siblings, 0 replies; 56+ messages in thread
From: Pavel Machek @ 2013-11-04 12:19 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, Jan Kara, Andrew Morton, Theodore Ts'o,
	Artem S. Tashkinov, Linux Kernel Mailing List, Mel Gorman,
	Maxim Patlasov

Hi!

> > Yes, but then the temp-file is long-lived enough that it *will* hit
> > the disk anyway. So it's only the "create temporary file and pretty
> > much immediately delete it" case that changes behavior (ie compiler
> > assembly files etc).
> > 
> > If the temp-file is for something like burning an ISO image, the
> > burning part is slow enough that the temp-file will hit the disk
> > regardless of when we start writing it.
> 
> The temp-file IO avoidance is an optimization not a guarantee. If a
> user want to avoid IO seriously, he will probably use tmpfs and
> disable swap.

No, sorry, they can't. Assuming ISO image fits in tmpfs would be
cruel.

> So if we have to do some trade-offs in the optimization, I agree that
> we should optimize more towards the "large copies to USB stick" use case.
> 
> The alternative solution, per-bdi dirty thresholds, could eliminate
> the need to do such trade-offs. So it's worth looking at the two
> solutions side by side.

Yes, please.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-29 22:42                 ` Linus Torvalds
  2013-11-01 17:22                   ` Fengguang Wu
@ 2013-11-04 12:26                   ` Pavel Machek
  1 sibling, 0 replies; 56+ messages in thread
From: Pavel Machek @ 2013-11-04 12:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Andrew Morton, Theodore Ts'o, Artem S. Tashkinov,
	Wu Fengguang, Linux Kernel Mailing List, Mel Gorman,
	Maxim Patlasov

Hi!

> >>  - temp-files may not be written out at all.
> >>
> >>    Quite frankly, if you have multi-hundred-megabyte temptiles, you've
> >> got issues
> >   Actually people do stuff like this e.g. when generating ISO images before
> > burning them.
> 
> Yes, but then the temp-file is long-lived enough that it *will* hit
> the disk anyway. So it's only the "create temporary file and pretty
> much immediately delete it" case that changes behavior (ie compiler
> assembly files etc).
> 
> If the temp-file is for something like burning an ISO image, the
> burning part is slow enough that the temp-file will hit the disk
> regardless of when we start writing it.

It will hit the disk, but with proposed change, burning still will be
slower.

Before:

create 700MB iso
burn the CD, at the same time writing the iso to disk

After:

create 700MB iso and write most of it to disk
burn the CD, writing the rest.

But yes, limiting dirty ammounts is good idea.

> That said, I'd certainly like it even *more* if the limits really were
> per-BDI, and the global limit was in addition to the per-bdi ones.
> Because when you have a USB device that gets maybe 10MB/s on
> contiguous writes, and 100kB/s on random 4k writes, I think it would
> make more sense to make the "start writeout" limits be 1MB/2MB, not

Actually I believe I seen 10kB/sec on an SD card... would expect that
from USB sticks, too.

And yes, there are actually real problems with this at least on N900.

You do apt-get install <big package>. apt internally does fsyncs. It
results in big enough latencies that watchdogs kick in and kill the
machine.

http://pavelmachek.livejournal.com/117089.html

People are doing 

 echo 3 > /proc/sys/vm/dirty_ratio
    echo 3 > /proc/sys/vm/dirty_background_ratio
    echo 100 > /proc/sys/vm/dirty_writeback_centisecs 
    echo 100 > /proc/sys/vm/dirty_expire_centisecs 
    echo 4096 > /proc/sys/vm/min_free_kbytes 
    echo 50 > /proc/sys/vm/swappiness 
    echo 200 > /proc/sys/vm/vfs_cache_pressure 
    echo 8 > /proc/sys/vm/page-cluster
    echo 4 > /sys/block/mmcblk0/queue/nr_requests
    echo 4 > /sys/block/mmcblk1/queue/nr_requests

.. to avoid it, but IIRC it only makes the watchdog reset less likely
:-(.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] mm: add strictlimit knob
  2013-11-01 14:31             ` [PATCH] mm: add strictlimit knob Maxim Patlasov
@ 2013-11-04 22:01               ` Andrew Morton
  2013-11-06 14:30                 ` Maxim Patlasov
  2013-11-06 15:05                 ` [PATCH] mm: add strictlimit knob -v2 Maxim Patlasov
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2013-11-04 22:01 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: karl.kiniger, jack, linux-kernel, t.artem, linux-mm, mgorman,
	tytso, fengguang.wu, torvalds, mpatlasov

On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov <MPatlasov@parallels.com> wrote:

> "strictlimit" feature was introduced to enforce per-bdi dirty limits for
> FUSE which sets bdi max_ratio to 1% by default:
> 
> http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809
> 
> However the feature can be useful for other relatively slow or untrusted
> BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
> feature:
> 
> echo 1 > /sys/class/bdi/X:Y/strictlimit
> 
> Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
> dirty limit is not reached. Of course, the effect is not visible until
> max_ratio is decreased to some reasonable value.

I suggest replacing "max_ratio" here with the much more informative
"/sys/class/bdi/X:Y/max_ratio".

Also, Documentation/ABI/testing/sysfs-class-bdi will need an update
please.

>  mm/backing-dev.c |   35 +++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 

I'm not really sure what to make of the patch.  I assume you tested it
and observed some effect.  Could you please describe the test setup and
the effects in some detail?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  8:18 ` Linus Torvalds
  2013-10-25  8:30   ` Artem S. Tashkinov
@ 2013-11-05  0:50   ` Andreas Dilger
  2013-11-05  4:12     ` Dave Chinner
  1 sibling, 1 reply; 56+ messages in thread
From: Andreas Dilger @ 2013-11-05  0:50 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: Wu Fengguang, Linus Torvalds, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Jens Axboe, linux-mm


On Oct 25, 2013, at 2:18 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
>> 
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11
>> kernel built for the i686 (with PAE) and x86-64 architectures. What’s
>> really troubling me is that the x86-64 kernel has the following problem:
>> 
>> When I copy large files to any storage device, be it my HDD with ext4
>> partitions or flash drive with FAT32 partitions, the kernel first
>> caches them in memory entirely then flushes them some time later
>> (quite unpredictably though) or immediately upon invoking "sync".
> 
> Yeah, I think we default to a 10% "dirty background memory" (and
> allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
> of dirty memory for writeout before we even start writing, and twice
> that before we start *waiting* for it.
> 
> On 32-bit x86, we only count the memory in the low 1GB (really
> actually up to about 890MB), so "10% dirty" really means just about
> 90MB of buffering (and a "hard limit" of ~180MB of dirty).
> 
> And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
> come from the old days of less memory (and perhaps servers that don't
> much care), and the fact that x86-32 ends up having much lower limits
> even if you end up having more memory.

I think the “delay writes for a long time” is a holdover from the
days when e.g. /tmp was on a disk and compilers had lousy IO
patterns, then they deleted the file.  Today, /tmp is always in
RAM, and IMHO the “write and delete” workload tested by dbench
is not worthwhile optimizing for.

With Lustre, we’ve long taken the approach that if there is enough
dirty data on a file to make a decent write (which is around 8MB
today even for very fast storage) then there isn’t much point to
hold back for more data before starting the IO.

Any decent allocator will be able to grow allocated extents to
handle following data, or allocate a new extent.  At 4-8MB extents,
even very seek-impaired media could do 400-800MB/s (likely much
faster than the underlying storage anyway).

This also avoids wasting (tens of?) seconds of idle disk bandwidth.
If the disk is already busy, then the IO will be delayed anyway.
If it is not busy, then why aggregate GB of dirty data in memory
before flushing it?

Something simple like “start writing at 16MB dirty on a single file”
would probably avoid a lot of complexity at little real-world cost.
That shouldn’t throttle dirtying memory above 16MB, but just start
writeout much earlier than it does today.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
       [not found]             ` <CAF7GXvpJVLYDS5NfH-NVuN9bOJjAS5c1MQqSTjoiVBHJt6bWcw@mail.gmail.com>
@ 2013-11-05  1:47               ` David Lang
  2013-11-05  2:08               ` NeilBrown
  1 sibling, 0 replies; 56+ messages in thread
From: David Lang @ 2013-11-05  1:47 UTC (permalink / raw)
  To: Figo.zhang
  Cc: NeilBrown, Artem S. Tashkinov, lkml, Linus Torvalds,
	linux-fsdevel, axboe, Linux-MM

On Tue, 5 Nov 2013, Figo.zhang wrote:

>>>
>>> Of course, if you don't use Linux on the desktop you don't really care -
>> well, I do. Also
>>> not everyone in this world has an UPS - which means such a huge buffer
>> can lead to a
>>> serious data loss in case of a power blackout.
>>
>> I don't have a desk (just a lap), but I use Linux on all my computers and
>> I've never really noticed the problem.  Maybe I'm just very patient, or
>> maybe
>> I don't work with large data sets and slow devices.
>>
>> However I don't think data-loss is really a related issue.  Any process
>> that
>> cares about data safety *must* use fsync at appropriate places.  This has
>> always been true.
>>
>> =>May i ask question that, some like ext4 filesystem, if some app motify
> the files, it create some dirty data. if some meta-data writing to the
> journal disk when a power backout,
> it will be lose some serious data and the the file will damage?
>

with any filesystem and any OS, if you create dirty data but do not f*sync() the 
data, there isa possibility that the system can go down between the time the 
application creates the dirty data and the time the OS actually gets it on disk. 
If the system goes down in this timeframe, the data will be lost and it may 
corrupt the file if only some of the data got written.

David Lang

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
       [not found]             ` <CAF7GXvpJVLYDS5NfH-NVuN9bOJjAS5c1MQqSTjoiVBHJt6bWcw@mail.gmail.com>
  2013-11-05  1:47               ` David Lang
@ 2013-11-05  2:08               ` NeilBrown
  1 sibling, 0 replies; 56+ messages in thread
From: NeilBrown @ 2013-11-05  2:08 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Artem S. Tashkinov, david, lkml, Linus Torvalds, linux-fsdevel,
	axboe, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 1632 bytes --]

On Tue, 5 Nov 2013 09:40:55 +0800 "Figo.zhang" <figo1802@gmail.com> wrote:

> > >
> > > Of course, if you don't use Linux on the desktop you don't really care -
> > well, I do. Also
> > > not everyone in this world has an UPS - which means such a huge buffer
> > can lead to a
> > > serious data loss in case of a power blackout.
> >
> > I don't have a desk (just a lap), but I use Linux on all my computers and
> > I've never really noticed the problem.  Maybe I'm just very patient, or
> > maybe
> > I don't work with large data sets and slow devices.
> >
> > However I don't think data-loss is really a related issue.  Any process
> > that
> > cares about data safety *must* use fsync at appropriate places.  This has
> > always been true.
> >
> > =>May i ask question that, some like ext4 filesystem, if some app motify
> the files, it create some dirty data. if some meta-data writing to the
> journal disk when a power backout,
> it will be lose some serious data and the the file will damage?

If you modify a file, then you must take care that you can recover from a
crash at any point in the process.

If the file is small, the usual approach is to create a copy of the file with
the appropriate changes made, then 'fsync' the file and rename the new file
over the old file.

If the file is large you might need some sort of update log (in a small file)
so you can replay recent updates after a crash.

The  journalling that the filesystem provides only protects the filesystem
metadata.  It does not protect the consistency of the data in your file.

I hope  that helps.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-05  0:50   ` Andreas Dilger
@ 2013-11-05  4:12     ` Dave Chinner
  2013-11-07 13:48       ` Jan Kara
  0 siblings, 1 reply; 56+ messages in thread
From: Dave Chinner @ 2013-11-05  4:12 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Artem S. Tashkinov, Wu Fengguang, Linus Torvalds, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Jens Axboe, linux-mm

On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> 
> On Oct 25, 2013, at 2:18 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
> >> 
> >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11
> >> kernel built for the i686 (with PAE) and x86-64 architectures. What’s
> >> really troubling me is that the x86-64 kernel has the following problem:
> >> 
> >> When I copy large files to any storage device, be it my HDD with ext4
> >> partitions or flash drive with FAT32 partitions, the kernel first
> >> caches them in memory entirely then flushes them some time later
> >> (quite unpredictably though) or immediately upon invoking "sync".
> > 
> > Yeah, I think we default to a 10% "dirty background memory" (and
> > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
> > of dirty memory for writeout before we even start writing, and twice
> > that before we start *waiting* for it.
> > 
> > On 32-bit x86, we only count the memory in the low 1GB (really
> > actually up to about 890MB), so "10% dirty" really means just about
> > 90MB of buffering (and a "hard limit" of ~180MB of dirty).
> > 
> > And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
> > come from the old days of less memory (and perhaps servers that don't
> > much care), and the fact that x86-32 ends up having much lower limits
> > even if you end up having more memory.
> 
> I think the “delay writes for a long time” is a holdover from the
> days when e.g. /tmp was on a disk and compilers had lousy IO
> patterns, then they deleted the file.  Today, /tmp is always in
> RAM, and IMHO the “write and delete” workload tested by dbench
> is not worthwhile optimizing for.
> 
> With Lustre, we’ve long taken the approach that if there is enough
> dirty data on a file to make a decent write (which is around 8MB
> today even for very fast storage) then there isn’t much point to
> hold back for more data before starting the IO.

Agreed - write-through caching is much better for high throughput
streaming data environments than write back caching that can leave
the devices unnecessarily idle.

However, most systems are not running in high-throughput streaming
data environments... :/

> Any decent allocator will be able to grow allocated extents to
> handle following data, or allocate a new extent.  At 4-8MB extents,
> even very seek-impaired media could do 400-800MB/s (likely much
> faster than the underlying storage anyway).

True, but this makes the assumption that the filesystem you are
using is optimising purely for write throughput and your storage is
not seek limited on reads. That's simply not an assumption we can
allow the generic writeback code to make.

In more detail, if we simply implement "we have 8 MB of dirty pages
on a single file, write it" we can maximise write throughput by
allocating sequentially on disk for each subsquent write. The
problem with this comes when you are writing multiple files at a
time, and that leads to this pattern on disk:

 ABC...ABC....ABC....ABC....

And the result is a) fragmented files b) a large number of seeks
during sequential read operations and c) filesystems that age and
degrade rapidly under workloads that concurrently write files with
different life times (i.e. due to free space fragmention).

In some situations this is acceptable, but the performance
degradation as the filesystem ages that this sort of allocation
causes in most environments is not. I'd say that >90% of filesystems
out there would suffer accelerated aging as a result of doing
writeback in this manner by default.

> This also avoids wasting (tens of?) seconds of idle disk bandwidth.
> If the disk is already busy, then the IO will be delayed anyway.
> If it is not busy, then why aggregate GB of dirty data in memory
> before flushing it?

There are plenty of workloads out there where delaying IO for a few
seconds can result in writeback that is an order of magnitude
faster. Similarly, I've seen other workloads where the writeback
delay results in files that can be *read* orders of magnitude
faster....

> Something simple like “start writing at 16MB dirty on a single file”
> would probably avoid a lot of complexity at little real-world cost.
> That shouldn’t throttle dirtying memory above 16MB, but just start
> writeout much earlier than it does today.

That doesn't solve the "slow device, large file" problem. We can
write data into the page cache at rates of over a GB/s, so it's
irrelevant to a device that can write at 5MB/s whether we start
writeback immediately or a second later when there is 500MB of dirty
pages in memory.  AFAIK, the only way to avoid that problem is to
use write-through caching for such devices - where they throttle to
the IO rate at very low levels of cached data.

Realistically, there is no "one right answer" for all combinations
of applications, filesystems and hardware, but writeback caching is
the best *general solution* we've got right now.

However, IMO users should not need to care about tuning BDI dirty
ratios or even have to understand what a BDI dirty ratio is to
select the rigth caching method for their devices and/or workload.
The difference between writeback and write through caching is easy
to explain and AFAICT those two modes suffice to solve the problems
being discussed here.  Further, if two modes suffice to solve the
problems, then we should be able to easily define a trigger to
automatically switch modes.

/me notes that if we look at random vs sequential IO and the impact
that has on writeback duration, then it's very similar to suddenly
having a very slow device. IOWs, fadvise(RANDOM) could be used to
switch an *inode* to write through mode rather than writeback mode
to solve the problem aggregating massive amounts of random write IO
in the page cache...

So rather than treating this as a "one size fits all" type of
problem, let's step back and:

	a) define 2-3 different caching behaviours we consider
	   optimal for the majority of workloads/hardware we care
	   about.
	b) determine optimal workloads for each caching
	   behaviour.
	c) develop reliable triggers to detect when we
	   should switch between caching behaviours.

e.g:

	a) write back caching
		- what we have now
	   write through caching
		- extremely low dirty threshold before writeback
		  starts, enough to optimise for, say, stripe width
		  of the underlying storage.

	b) write back caching:
		- general purpose workload
	   write through caching:
		- slow device, write large file, sync
		- extremely high bandwidth devices, multi-stream
		  sequential IO
		- random IO.

	c) write back caching:
		- default
		- fadvise(NORMAL, SEQUENTIAL, WILLNEED)
	   write through caching:
		- fadvise(NOREUSE, DONTNEED, RANDOM)
		- random IO
		- sequential IO, BDI write bandwidth <<< dirty threshold
		- sequential IO, BDI write bandwidth >>> dirty threshold

I think that covers most of the issues and use cases that have been
discussed in this thread. IMO, this is the level at which we need to
solve the problem (i.e. architectural), not at the level of "let's
add sysfs variables so we can tweak bdi ratios".

Indeed, the above implies that we need the caching behaviour to be a
property of the address space, not just a property of the backing
device.

IOWs, the implementation needs to trickle down from a coherent high
level design - that will define the knobs that we need to expose to
userspace. We should not be adding new writeback behaviours by
adding knobs to sysfs without first having some clue about whether
we are solving the right problem and solving it in a sane manner...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] mm: add strictlimit knob
  2013-11-04 22:01               ` Andrew Morton
@ 2013-11-06 14:30                 ` Maxim Patlasov
  2013-11-06 15:05                 ` [PATCH] mm: add strictlimit knob -v2 Maxim Patlasov
  1 sibling, 0 replies; 56+ messages in thread
From: Maxim Patlasov @ 2013-11-06 14:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: karl.kiniger, jack, linux-kernel, t.artem, linux-mm, mgorman,
	tytso, fengguang.wu, torvalds

Hi Andrew,

On 11/05/2013 02:01 AM, Andrew Morton wrote:
> On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov <MPatlasov@parallels.com> wrote:
>
>> "strictlimit" feature was introduced to enforce per-bdi dirty limits for
>> FUSE which sets bdi max_ratio to 1% by default:
>>
>> http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809
>>
>> However the feature can be useful for other relatively slow or untrusted
>> BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
>> feature:
>>
>> echo 1 > /sys/class/bdi/X:Y/strictlimit
>>
>> Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
>> dirty limit is not reached. Of course, the effect is not visible until
>> max_ratio is decreased to some reasonable value.
> I suggest replacing "max_ratio" here with the much more informative
> "/sys/class/bdi/X:Y/max_ratio".
>
> Also, Documentation/ABI/testing/sysfs-class-bdi will need an update
> please.

OK, I'll update it, fix patch description and re-send the patch.

>
>>   mm/backing-dev.c |   35 +++++++++++++++++++++++++++++++++++
>>   1 file changed, 35 insertions(+)
>>
> I'm not really sure what to make of the patch.  I assume you tested it
> and observed some effect.  Could you please describe the test setup and
> the effects in some detail?

I plugged 16GB USB-flash in a node with 8GB RAM running 3.12.0-rc7 and 
started writing a huge file by "dd" (from /dev/zero to USB-flash 
mount-point). While writing I was observing "Dirty" counter as reported 
by /proc/meminfo. As expected it stabilized on a level about 1.2GB (15% 
of total RAM). Immediately after dd completed, the "umount" command took 
about 5 minutes. This corresponded to 5MB write throughput of the flash 
drive.

Then I repeated the experiment after setting tunables:

echo 1 > /sys/class/bdi/8\:16/max_ratio
echo 1 > /sys/class/bdi/8\:16/strictlimit

This time, "Dirty" counter became 100 times lesser - about 12MB and 
"umount" took about a second.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] mm: add strictlimit knob -v2
  2013-11-04 22:01               ` Andrew Morton
  2013-11-06 14:30                 ` Maxim Patlasov
@ 2013-11-06 15:05                 ` Maxim Patlasov
  2013-11-07 12:26                   ` Henrique de Moraes Holschuh
  2013-11-22 23:45                   ` Andrew Morton
  1 sibling, 2 replies; 56+ messages in thread
From: Maxim Patlasov @ 2013-11-06 15:05 UTC (permalink / raw)
  To: akpm
  Cc: karl.kiniger, tytso, linux-kernel, t.artem, linux-mm, mgorman,
	jack, fengguang.wu, torvalds, mpatlasov

"strictlimit" feature was introduced to enforce per-bdi dirty limits for
FUSE which sets bdi max_ratio to 1% by default:

http://article.gmane.org/gmane.linux.kernel.mm/105809

However the feature can be useful for other relatively slow or untrusted
BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
feature:

echo 1 > /sys/class/bdi/X:Y/strictlimit

Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
dirty limit is not reached. Of course, the effect is not visible until
/sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value.

Changed in v2:
 - updated patch description and documentation

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 Documentation/ABI/testing/sysfs-class-bdi |    8 +++++++
 mm/backing-dev.c                          |   35 +++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
index d773d56..3187a18 100644
--- a/Documentation/ABI/testing/sysfs-class-bdi
+++ b/Documentation/ABI/testing/sysfs-class-bdi
@@ -53,3 +53,11 @@ stable_pages_required (read-only)
 
 	If set, the backing device requires that all pages comprising a write
 	request must not be changed until writeout is complete.
+
+strictlimit (read-write)
+
+	Forces per-BDI checks for the share of given device in the write-back
+	cache even before the global background dirty limit is reached. This
+	is useful in situations where the global limit is much higher than
+	affordable for given relatively slow (or untrusted) device. Turning
+	strictlimit on has no visible effect if max_ratio is equal to 100%.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7..4ee1d64 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(stable_pages_required);
 
+static ssize_t strictlimit_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+	unsigned int val;
+	ssize_t ret;
+
+	ret = kstrtouint(buf, 10, &val);
+	if (ret < 0)
+		return ret;
+
+	switch (val) {
+	case 0:
+		bdi->capabilities &= ~BDI_CAP_STRICTLIMIT;
+		break;
+	case 1:
+		bdi->capabilities |= BDI_CAP_STRICTLIMIT;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return count;
+}
+static ssize_t strictlimit_show(struct device *dev,
+		struct device_attribute *attr, char *page)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+	return snprintf(page, PAGE_SIZE-1, "%d\n",
+			!!(bdi->capabilities & BDI_CAP_STRICTLIMIT));
+}
+static DEVICE_ATTR_RW(strictlimit);
+
 static struct attribute *bdi_dev_attrs[] = {
 	&dev_attr_read_ahead_kb.attr,
 	&dev_attr_min_ratio.attr,
 	&dev_attr_max_ratio.attr,
 	&dev_attr_stable_pages_required.attr,
+	&dev_attr_strictlimit.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(bdi_dev);


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH] mm: add strictlimit knob -v2
  2013-11-06 15:05                 ` [PATCH] mm: add strictlimit knob -v2 Maxim Patlasov
@ 2013-11-07 12:26                   ` Henrique de Moraes Holschuh
  2013-11-22 23:45                   ` Andrew Morton
  1 sibling, 0 replies; 56+ messages in thread
From: Henrique de Moraes Holschuh @ 2013-11-07 12:26 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: akpm, karl.kiniger, tytso, linux-kernel, t.artem, linux-mm,
	mgorman, jack, fengguang.wu, torvalds

Is there a reason to not enforce strictlimit by default?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-05  4:12     ` Dave Chinner
@ 2013-11-07 13:48       ` Jan Kara
  2013-11-11  3:22         ` Dave Chinner
  0 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2013-11-07 13:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, Artem S. Tashkinov, Wu Fengguang, Linus Torvalds,
	Andrew Morton, Linux Kernel Mailing List, linux-fsdevel,
	Jens Axboe, linux-mm

On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > Something simple like “start writing at 16MB dirty on a single file”
> > would probably avoid a lot of complexity at little real-world cost.
> > That shouldn’t throttle dirtying memory above 16MB, but just start
> > writeout much earlier than it does today.
> 
> That doesn't solve the "slow device, large file" problem. We can
> write data into the page cache at rates of over a GB/s, so it's
> irrelevant to a device that can write at 5MB/s whether we start
> writeback immediately or a second later when there is 500MB of dirty
> pages in memory.  AFAIK, the only way to avoid that problem is to
> use write-through caching for such devices - where they throttle to
> the IO rate at very low levels of cached data.
  Agreed.

> Realistically, there is no "one right answer" for all combinations
> of applications, filesystems and hardware, but writeback caching is
> the best *general solution* we've got right now.
> 
> However, IMO users should not need to care about tuning BDI dirty
> ratios or even have to understand what a BDI dirty ratio is to
> select the rigth caching method for their devices and/or workload.
> The difference between writeback and write through caching is easy
> to explain and AFAICT those two modes suffice to solve the problems
> being discussed here.  Further, if two modes suffice to solve the
> problems, then we should be able to easily define a trigger to
> automatically switch modes.
> 
> /me notes that if we look at random vs sequential IO and the impact
> that has on writeback duration, then it's very similar to suddenly
> having a very slow device. IOWs, fadvise(RANDOM) could be used to
> switch an *inode* to write through mode rather than writeback mode
> to solve the problem aggregating massive amounts of random write IO
> in the page cache...
  I disagree here. Writeback cache is also useful for aggregating random
writes and making semi-sequential writes out of them. There are quite some
applications which rely on the fact that they can write a file in a rather
random manner (Berkeley DB, linker, ...) but the files are written out in
one large linear sweep. That is actually the reason why SLES (and I believe
RHEL as well) tune dirty_limit even higher than what's the default value.

So I think it's rather the other way around: If you can detect the file is
being written in a streaming manner, there's not much point in caching too
much data for it. And I agree with you that we also have to be careful not
to cache too few because otherwise two streaming writes would be
interleaved too much. Currently, we have writeback_chunk_size() which
determines how much we ask to write from a single inode. So streaming
writers are going to be interleaved at this chunk size anyway (currently
that number is "measured bandwidth / 2"). So it would make sense to also
limit amount of dirty cache for each file with streaming pattern at this
number.

> So rather than treating this as a "one size fits all" type of
> problem, let's step back and:
> 
> 	a) define 2-3 different caching behaviours we consider
> 	   optimal for the majority of workloads/hardware we care
> 	   about.
> 	b) determine optimal workloads for each caching
> 	   behaviour.
> 	c) develop reliable triggers to detect when we
> 	   should switch between caching behaviours.
> 
> e.g:
> 
> 	a) write back caching
> 		- what we have now
> 	   write through caching
> 		- extremely low dirty threshold before writeback
> 		  starts, enough to optimise for, say, stripe width
> 		  of the underlying storage.
> 
> 	b) write back caching:
> 		- general purpose workload
> 	   write through caching:
> 		- slow device, write large file, sync
> 		- extremely high bandwidth devices, multi-stream
> 		  sequential IO
> 		- random IO.
> 
> 	c) write back caching:
> 		- default
> 		- fadvise(NORMAL, SEQUENTIAL, WILLNEED)
> 	   write through caching:
> 		- fadvise(NOREUSE, DONTNEED, RANDOM)
> 		- random IO
> 		- sequential IO, BDI write bandwidth <<< dirty threshold
> 		- sequential IO, BDI write bandwidth >>> dirty threshold
> 
> I think that covers most of the issues and use cases that have been
> discussed in this thread. IMO, this is the level at which we need to
> solve the problem (i.e. architectural), not at the level of "let's
> add sysfs variables so we can tweak bdi ratios".
> 
> Indeed, the above implies that we need the caching behaviour to be a
> property of the address space, not just a property of the backing
> device.
  Yes, and that would be interesting to implement and not make a mess out
of the whole writeback logic because the way we currently do writeback is
inherently BDI based. When we introduce some special per-inode limits,
flusher threads would have to pick more carefully what to write and what
not. We might be forced to go that way eventually anyway because of memcg
aware writeback but it's not a simple step.

> IOWs, the implementation needs to trickle down from a coherent high
> level design - that will define the knobs that we need to expose to
> userspace. We should not be adding new writeback behaviours by
> adding knobs to sysfs without first having some clue about whether
> we are solving the right problem and solving it in a sane manner...
  Agreed. But the ability to limit amount of dirty pages outstanding
against a particular BDI seems as a sane one to me. It's not as flexible
and automatic as the approach you suggested but it's much simpler and
solves most of problems we currently have.

The biggest objection against the sysfs-tunable approach is that most
people won't have a clue meaning that the tunable is useless for them. But I
wonder if something like:
1) turn on strictlimit by default
2) don't allow dirty cache of BDI to grow over 5s of measured writeback
   speed

won't go a long way into solving our current problems without too much
complication...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-07 13:48       ` Jan Kara
@ 2013-11-11  3:22         ` Dave Chinner
  2013-11-11 19:31           ` Jan Kara
  0 siblings, 1 reply; 56+ messages in thread
From: Dave Chinner @ 2013-11-11  3:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, Artem S. Tashkinov, Wu Fengguang, Linus Torvalds,
	Andrew Morton, Linux Kernel Mailing List, linux-fsdevel,
	Jens Axboe, linux-mm

On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote:
> On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > > Something simple like “start writing at 16MB dirty on a single file”
> > > would probably avoid a lot of complexity at little real-world cost.
> > > That shouldn’t throttle dirtying memory above 16MB, but just start
> > > writeout much earlier than it does today.
> > 
> > That doesn't solve the "slow device, large file" problem. We can
> > write data into the page cache at rates of over a GB/s, so it's
> > irrelevant to a device that can write at 5MB/s whether we start
> > writeback immediately or a second later when there is 500MB of dirty
> > pages in memory.  AFAIK, the only way to avoid that problem is to
> > use write-through caching for such devices - where they throttle to
> > the IO rate at very low levels of cached data.
>   Agreed.
> 
> > Realistically, there is no "one right answer" for all combinations
> > of applications, filesystems and hardware, but writeback caching is
> > the best *general solution* we've got right now.
> > 
> > However, IMO users should not need to care about tuning BDI dirty
> > ratios or even have to understand what a BDI dirty ratio is to
> > select the rigth caching method for their devices and/or workload.
> > The difference between writeback and write through caching is easy
> > to explain and AFAICT those two modes suffice to solve the problems
> > being discussed here.  Further, if two modes suffice to solve the
> > problems, then we should be able to easily define a trigger to
> > automatically switch modes.
> > 
> > /me notes that if we look at random vs sequential IO and the impact
> > that has on writeback duration, then it's very similar to suddenly
> > having a very slow device. IOWs, fadvise(RANDOM) could be used to
> > switch an *inode* to write through mode rather than writeback mode
> > to solve the problem aggregating massive amounts of random write IO
> > in the page cache...
>   I disagree here. Writeback cache is also useful for aggregating random
> writes and making semi-sequential writes out of them. There are quite some
> applications which rely on the fact that they can write a file in a rather
> random manner (Berkeley DB, linker, ...) but the files are written out in
> one large linear sweep. That is actually the reason why SLES (and I believe
> RHEL as well) tune dirty_limit even higher than what's the default value.

Right - but the correct behaviour really depends on the pattern of
randomness. The common case we get into trouble with is when no
clustering occurs and we end up with small, random IO for gigabytes
of cached data. That's the case where write-through caching for
random data is better.

It's also questionable whether writeback caching for aggregation is
faster for random IO on high-IOPS devices or not. Again, I think it
woul depend very much on how random the patterns are...

> So I think it's rather the other way around: If you can detect the file is
> being written in a streaming manner, there's not much point in caching too
> much data for it.

But we're not talking about how much data we cache here - we are
considering how much data we allow to get dirty before writing it
back.  It doesn't matter if we use writeback or write through
caching, the page cache footprint for a given workload is likely to
be similar, but without any data we can't draw any conclusions here.

> And I agree with you that we also have to be careful not
> to cache too few because otherwise two streaming writes would be
> interleaved too much. Currently, we have writeback_chunk_size() which
> determines how much we ask to write from a single inode. So streaming
> writers are going to be interleaved at this chunk size anyway (currently
> that number is "measured bandwidth / 2"). So it would make sense to also
> limit amount of dirty cache for each file with streaming pattern at this
> number.

My experience says that for streaming IO we typically need at least
5s of cached *dirty* data to even out delays and latencies in the
writeback IO pipeline. Hence limiting a file to what we can write in
a second given we might only write a file once a second is likely
going to result in pipeline stalls...

Remember, writeback caching is about maximising throughput, not
minimising latency. The "sync latency" problem with caching too much
dirty data on slow block devices is really a corner case behaviour
and should not compromise the common case for bulk writeback
throughput.

> > Indeed, the above implies that we need the caching behaviour to be a
> > property of the address space, not just a property of the backing
> > device.
>   Yes, and that would be interesting to implement and not make a mess out
> of the whole writeback logic because the way we currently do writeback is
> inherently BDI based. When we introduce some special per-inode limits,
> flusher threads would have to pick more carefully what to write and what
> not. We might be forced to go that way eventually anyway because of memcg
> aware writeback but it's not a simple step.

Agreed, it's not simple, and that's why we need to start working
from the architectural level....

> > IOWs, the implementation needs to trickle down from a coherent high
> > level design - that will define the knobs that we need to expose to
> > userspace. We should not be adding new writeback behaviours by
> > adding knobs to sysfs without first having some clue about whether
> > we are solving the right problem and solving it in a sane manner...
>   Agreed. But the ability to limit amount of dirty pages outstanding
> against a particular BDI seems as a sane one to me. It's not as flexible
> and automatic as the approach you suggested but it's much simpler and
> solves most of problems we currently have.

That's true, but....

> The biggest objection against the sysfs-tunable approach is that most
> people won't have a clue meaning that the tunable is useless for them.

.... that's the big problem I see - nobody is going to know how to
use it, when to use it, or be able to tell if it's the root cause of
some weird performance problem they are seeing.

> But I
> wonder if something like:
> 1) turn on strictlimit by default
> 2) don't allow dirty cache of BDI to grow over 5s of measured writeback
>    speed
> 
> won't go a long way into solving our current problems without too much
> complication...

Turning on strict limit by default is going to change behaviour
quite markedly. Again, it's not something I'd want to see done
without a bunch of data showing that it doesn't cause regressions
for common workloads...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-11  3:22         ` Dave Chinner
@ 2013-11-11 19:31           ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2013-11-11 19:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andreas Dilger, Artem S. Tashkinov, Wu Fengguang,
	Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Jens Axboe, linux-mm

On Mon 11-11-13 14:22:11, Dave Chinner wrote:
> On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote:
> > On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > > Realistically, there is no "one right answer" for all combinations
> > > of applications, filesystems and hardware, but writeback caching is
> > > the best *general solution* we've got right now.
> > > 
> > > However, IMO users should not need to care about tuning BDI dirty
> > > ratios or even have to understand what a BDI dirty ratio is to
> > > select the rigth caching method for their devices and/or workload.
> > > The difference between writeback and write through caching is easy
> > > to explain and AFAICT those two modes suffice to solve the problems
> > > being discussed here.  Further, if two modes suffice to solve the
> > > problems, then we should be able to easily define a trigger to
> > > automatically switch modes.
> > > 
> > > /me notes that if we look at random vs sequential IO and the impact
> > > that has on writeback duration, then it's very similar to suddenly
> > > having a very slow device. IOWs, fadvise(RANDOM) could be used to
> > > switch an *inode* to write through mode rather than writeback mode
> > > to solve the problem aggregating massive amounts of random write IO
> > > in the page cache...
> >   I disagree here. Writeback cache is also useful for aggregating random
> > writes and making semi-sequential writes out of them. There are quite some
> > applications which rely on the fact that they can write a file in a rather
> > random manner (Berkeley DB, linker, ...) but the files are written out in
> > one large linear sweep. That is actually the reason why SLES (and I believe
> > RHEL as well) tune dirty_limit even higher than what's the default value.
> 
> Right - but the correct behaviour really depends on the pattern of
> randomness. The common case we get into trouble with is when no
> clustering occurs and we end up with small, random IO for gigabytes
> of cached data. That's the case where write-through caching for
> random data is better.
> 
> It's also questionable whether writeback caching for aggregation is
> faster for random IO on high-IOPS devices or not. Again, I think it
> woul depend very much on how random the patterns are...
  I agree usefulness of writeback caching for random IO very much depends
on the working set size vs cache size, how random the accesses really are,
and HW characteristics. I just wanted to point out there are fairly common
workloads & setups where writeback caching for semi-random IO really helps
(because you seemed to suggest that random IO implies we should disable
writeback cache).

> > So I think it's rather the other way around: If you can detect the file is
> > being written in a streaming manner, there's not much point in caching too
> > much data for it.
> 
> But we're not talking about how much data we cache here - we are
> considering how much data we allow to get dirty before writing it
> back.
  Sorry, I was imprecise here. I really meant that IMO it doesn't make
sense to allow too much dirty data for sequentially written files.

> It doesn't matter if we use writeback or write through
> caching, the page cache footprint for a given workload is likely to
> be similar, but without any data we can't draw any conclusions here.
> 
> > And I agree with you that we also have to be careful not
> > to cache too few because otherwise two streaming writes would be
> > interleaved too much. Currently, we have writeback_chunk_size() which
> > determines how much we ask to write from a single inode. So streaming
> > writers are going to be interleaved at this chunk size anyway (currently
> > that number is "measured bandwidth / 2"). So it would make sense to also
> > limit amount of dirty cache for each file with streaming pattern at this
> > number.
> 
> My experience says that for streaming IO we typically need at least
> 5s of cached *dirty* data to even out delays and latencies in the
> writeback IO pipeline. Hence limiting a file to what we can write in
> a second given we might only write a file once a second is likely
> going to result in pipeline stalls...
  I guess this begs for real data. We agree in principle but differ in
constants :).
 
> Remember, writeback caching is about maximising throughput, not
> minimising latency. The "sync latency" problem with caching too much
> dirty data on slow block devices is really a corner case behaviour
> and should not compromise the common case for bulk writeback
> throughput.
  Agreed. As a primary goal we want to maximise throughput. But we want
to maintain sane latency as well (e.g. because we have a "promise" of
"dirty_writeback_centisecs" we have to cycle through dirty inodes
reasonably frequently).

> >   Agreed. But the ability to limit amount of dirty pages outstanding
> > against a particular BDI seems as a sane one to me. It's not as flexible
> > and automatic as the approach you suggested but it's much simpler and
> > solves most of problems we currently have.
> 
> That's true, but....
> 
> > The biggest objection against the sysfs-tunable approach is that most
> > people won't have a clue meaning that the tunable is useless for them.
> 
> .... that's the big problem I see - nobody is going to know how to
> use it, when to use it, or be able to tell if it's the root cause of
> some weird performance problem they are seeing.
> 
> > But I
> > wonder if something like:
> > 1) turn on strictlimit by default
> > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback
> >    speed
> > 
> > won't go a long way into solving our current problems without too much
> > complication...
> 
> Turning on strict limit by default is going to change behaviour
> quite markedly. Again, it's not something I'd want to see done
> without a bunch of data showing that it doesn't cause regressions
> for common workloads...
  Agreed.

							Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 23:32         ` Fengguang Wu
@ 2013-11-15 15:48           ` Diego Calleja
  0 siblings, 0 replies; 56+ messages in thread
From: Diego Calleja @ 2013-11-15 15:48 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Artem S. Tashkinov, david, neilb, linux-kernel, torvalds,
	linux-fsdevel, axboe, linux-mm

El Sábado, 26 de octubre de 2013 00:32:25 Fengguang Wu escribió:
> What's the kernel you are running? And it's writing to a hard disk?
> The stalls are most likely caused by either one of
> 
> 1) write IO starves read IO
> 2) direct page reclaim blocked when
>    - trying to writeout PG_dirty pages
>    - trying to lock PG_writeback pages
> 
> Which may be confirmed by running
> 
>         ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32
> or
>         echo w > /proc/sysrq-trigger    # and check dmesg
> 
> during the stalls. The latter command works more reliably.


Sorry for the delay (background: rsync'ing large files from/to a hard disk
in a desktop with 16GB of RAM makes the whole desktop unreponsive)

I just triggered it today (running 3.12), and run sysrq-w:

[ 5547.001505] SysRq : Show Blocked State
[ 5547.001509]   task                        PC stack   pid father
[ 5547.001516] btrfs-transacti D ffff880425d7a8a0     0   193      2 0x00000000
[ 5547.001519]  ffff880425eede10 0000000000000002 ffff880425eedfd8 0000000000012e40
[ 5547.001521]  ffff880425eedfd8 0000000000012e40 ffff880425d7a8a0 ffffea00104baa80
[ 5547.001523]  ffff880425eedd90 ffff880425eedd68 ffff880425eedd70 ffffffff81080edd
[ 5547.001525] Call Trace:
[ 5547.001530]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001533]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001535]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001552]  [<ffffffffa008a742>] ? btrfs_run_ordered_operations+0x212/0x2c0 [btrfs]
[ 5547.001554]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001556]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001557]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001559]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001566]  [<ffffffffa0072215>] btrfs_commit_transaction+0x265/0x9d0 [btrfs]
[ 5547.001569]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001575]  [<ffffffffa006982d>] transaction_kthread+0x19d/0x220 [btrfs]
[ 5547.001581]  [<ffffffffa0069690>] ? free_fs_root+0xc0/0xc0 [btrfs]
[ 5547.001583]  [<ffffffff81072e70>] kthread+0xc0/0xd0
[ 5547.001585]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120
[ 5547.001587]  [<ffffffff81564bac>] ret_from_fork+0x7c/0xb0
[ 5547.001588]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120
[ 5547.001590] systemd-journal D ffff880426e19860     0   234      1 0x00000000
[ 5547.001592]  ffff880426d77d90 0000000000000002 ffff880426d77fd8 0000000000012e40
[ 5547.001593]  ffff880426d77fd8 0000000000012e40 ffff880426e19860 ffffffff8155d7cd
[ 5547.001595]  0000000000000001 0000000000000001 0000000000000000 ffffffff81572560
[ 5547.001596] Call Trace:
[ 5547.001598]  [<ffffffff8155d7cd>] ? retint_restore_args+0xe/0xe
[ 5547.001601]  [<ffffffff8122b47b>] ? queue_unplugged+0x3b/0xe0
[ 5547.001602]  [<ffffffff8122da9b>] ? blk_flush_plug_list+0x1eb/0x230
[ 5547.001604]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001606]  [<ffffffff8155bb88>] schedule_preempt_disabled+0x18/0x30
[ 5547.001607]  [<ffffffff8155a2f4>] __mutex_lock_slowpath+0x124/0x1f0
[ 5547.001613]  [<ffffffffa0071c9b>] ? btrfs_write_marked_extents+0xbb/0xe0 [btrfs]
[ 5547.001615]  [<ffffffff8155a3d7>] mutex_lock+0x17/0x30
[ 5547.001623]  [<ffffffffa00ae06a>] btrfs_sync_log+0x22a/0x690 [btrfs]
[ 5547.001630]  [<ffffffffa0082f47>] btrfs_sync_file+0x287/0x2e0 [btrfs]
[ 5547.001632]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.001634]  [<ffffffff811abe20>] SyS_fsync+0x10/0x20
[ 5547.001635]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001644] mysqld          D ffff8803f0901860     0   643    579 0x00000000
[ 5547.001645]  ffff8803f090de18 0000000000000002 ffff8803f090dfd8 0000000000012e40
[ 5547.001647]  ffff8803f090dfd8 0000000000012e40 ffff8803f0901860 ffff88016d038000
[ 5547.001648]  ffff880426908d00 0000000024119d80 0000000000000000 0000000000000000
[ 5547.001650] Call Trace:
[ 5547.001657]  [<ffffffffa0074d14>] ? btrfs_submit_bio_hook+0x84/0x1f0 [btrfs]
[ 5547.001659]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001660]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001662]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001663]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001669]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.001671]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001677]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.001680]  [<ffffffff8112632e>] ? do_writepages+0x1e/0x40
[ 5547.001686]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.001693]  [<ffffffffa0082e3f>] btrfs_sync_file+0x17f/0x2e0 [btrfs]
[ 5547.001694]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.001696]  [<ffffffff811abe43>] SyS_fdatasync+0x13/0x20
[ 5547.001697]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001701] virtuoso-t      D ffff88000310b0c0     0   617    609 0x00000000
[ 5547.001702]  ffff8803f4867c20 0000000000000002 ffff8803f4867fd8 0000000000012e40
[ 5547.001704]  ffff8803f4867fd8 0000000000012e40 ffff88000310b0c0 ffffffff813ce4af
[ 5547.001705]  ffffffff81860520 ffff8802d8ad8a00 ffff8803f4867ba0 ffffffff81231a0e
[ 5547.001707] Call Trace:
[ 5547.001709]  [<ffffffff813ce4af>] ? scsi_pool_alloc_command+0x3f/0x80
[ 5547.001712]  [<ffffffff81231a0e>] ? __blk_segment_map_sg+0x4e/0x120
[ 5547.001713]  [<ffffffff81231b6b>] ? blk_rq_map_sg+0x8b/0x1f0
[ 5547.001716]  [<ffffffff812481da>] ? cfq_dispatch_requests+0xba/0xc40
[ 5547.001718]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001721]  [<ffffffff81119d70>] ? filemap_fdatawait+0x30/0x30
[ 5547.001722]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001723]  [<ffffffff8155b9bf>] io_schedule+0x8f/0xe0
[ 5547.001725]  [<ffffffff81119d7e>] sleep_on_page+0xe/0x20
[ 5547.001727]  [<ffffffff81559142>] __wait_on_bit+0x62/0x90
[ 5547.001728]  [<ffffffff81119b2f>] wait_on_page_bit+0x7f/0x90
[ 5547.001730]  [<ffffffff81073da0>] ? wake_atomic_t_function+0x40/0x40
[ 5547.001732]  [<ffffffff81119cbb>] filemap_fdatawait_range+0x11b/0x1a0
[ 5547.001734]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001740]  [<ffffffffa0071d47>] btrfs_wait_marked_extents+0x87/0xe0 [btrfs]
[ 5547.001747]  [<ffffffffa00ae328>] btrfs_sync_log+0x4e8/0x690 [btrfs]
[ 5547.001754]  [<ffffffffa0082f47>] btrfs_sync_file+0x287/0x2e0 [btrfs]
[ 5547.001756]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.001758]  [<ffffffff811abe20>] SyS_fsync+0x10/0x20
[ 5547.001759]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001761] pool            D ffff88040db1c100     0   657    477 0x00000000
[ 5547.001763]  ffff8803ee809ba0 0000000000000002 ffff8803ee809fd8 0000000000012e40
[ 5547.001764]  ffff8803ee809fd8 0000000000012e40 ffff88040db1c100 0000000000000004
[ 5547.001766]  ffff8803ee809ae8 ffffffff8155cc86 ffff8803ee809bd0 ffffffffa005ada4
[ 5547.001767] Call Trace:
[ 5547.001769]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001775]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.001776]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001778]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001779]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001781]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001783]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001784]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001790]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.001792]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001798]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.001804]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.001810]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.001813]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.001815]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.001817]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.001818]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.001820]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001822]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.001824]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.001827]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.001829]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.001831]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001833]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.001834]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001836]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.001839]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.001842]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.001844]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.001845]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001850] akregator       D ffff8803ed1d4100     0   875      1 0x00000000
[ 5547.001851]  ffff8803c7f1bba0 0000000000000002 ffff8803c7f1bfd8 0000000000012e40
[ 5547.001853]  ffff8803c7f1bfd8 0000000000012e40 ffff8803ed1d4100 0000000000000004
[ 5547.001854]  ffff8803c7f1bae8 ffffffff8155cc86 ffff8803c7f1bbd0 ffffffffa005ada4
[ 5547.001856] Call Trace:
[ 5547.001858]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001863]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.001865]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001866]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001868]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001870]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001871]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001873]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001879]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.001881]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001886]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.001888]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001894]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.001900]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.001902]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.001904]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.001906]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.001907]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.001909]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001911]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.001912]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.001914]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.001916]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.001918]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001920]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.001921]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001923]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.001925]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.001927]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.001928]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.001930]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001931] mpegaudioparse3 D ffff880341d10820     0  5917      1 0x00000000
[ 5547.001933]  ffff88030f779ce0 0000000000000002 ffff88030f779fd8 0000000000012e40
[ 5547.001934]  ffff88030f779fd8 0000000000012e40 ffff880341d10820 ffffffff81122a28
[ 5547.001936]  ffff88043e5ddc00 ffff880400000002 ffff88043e2138d0 0000000000000000
[ 5547.001938] Call Trace:
[ 5547.001939]  [<ffffffff81122a28>] ? __alloc_pages_nodemask+0x158/0xb00
[ 5547.001941]  [<ffffffff8102af55>] ? native_send_call_func_single_ipi+0x35/0x40
[ 5547.001943]  [<ffffffff810b31a8>] ? generic_exec_single+0x98/0xa0
[ 5547.001945]  [<ffffffff81086a18>] ? __enqueue_entity+0x78/0x80
[ 5547.001947]  [<ffffffff8108a837>] ? enqueue_entity+0x197/0x780
[ 5547.001948]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001950]  [<ffffffff81119d90>] ? sleep_on_page+0x20/0x20
[ 5547.001951]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001953]  [<ffffffff8155b9bf>] io_schedule+0x8f/0xe0
[ 5547.001954]  [<ffffffff81119d9e>] sleep_on_page_killable+0xe/0x40
[ 5547.001956]  [<ffffffff8155925d>] __wait_on_bit_lock+0x5d/0xc0
[ 5547.001958]  [<ffffffff81119f2a>] __lock_page_killable+0x6a/0x70
[ 5547.001960]  [<ffffffff81073da0>] ? wake_atomic_t_function+0x40/0x40
[ 5547.001961]  [<ffffffff8111b9e5>] generic_file_aio_read+0x435/0x700
[ 5547.001963]  [<ffffffff8117d2ba>] do_sync_read+0x5a/0x90
[ 5547.001965]  [<ffffffff8117d85a>] vfs_read+0x9a/0x170
[ 5547.001967]  [<ffffffff8117e039>] SyS_read+0x49/0xa0
[ 5547.001968]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001970] mozStorage #2   D ffff8803b7aa1860     0   920    477 0x00000000
[ 5547.001972]  ffff8803b1473d80 0000000000000002 ffff8803b1473fd8 0000000000012e40
[ 5547.001974]  ffff8803b1473fd8 0000000000012e40 ffff8803b7aa1860 0000000000000004
[ 5547.001975]  ffff8803b1473cc8 ffffffff8155cc86 ffff8803b1473db0 ffffffffa005ada4
[ 5547.001977] Call Trace:
[ 5547.001978]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001984]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.001990]  [<ffffffffa0084729>] ? __btrfs_buffered_write+0x3d9/0x490 [btrfs]
[ 5547.001992]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001994]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001995]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001997]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002003]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002004]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002010]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002016]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002023]  [<ffffffffa007c8a1>] btrfs_setattr+0x101/0x290 [btrfs]
[ 5547.002025]  [<ffffffff810d675c>] ? rcu_eqs_enter+0x5c/0xa0
[ 5547.002027]  [<ffffffff81198a6c>] notify_change+0x1dc/0x360
[ 5547.002029]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002030]  [<ffffffff8117bdcb>] do_truncate+0x6b/0xa0
[ 5547.002032]  [<ffffffff8117f8b9>] ? __sb_start_write+0x49/0x100
[ 5547.002033]  [<ffffffff8117c12b>] SyS_ftruncate+0x10b/0x160
[ 5547.002035]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002036] Cache I/O       D ffff8803b7aa28a0     0   922    477 0x00000000
[ 5547.002038]  ffff8803b1495e18 0000000000000002 ffff8803b1495fd8 0000000000012e40
[ 5547.002039]  ffff8803b1495fd8 0000000000012e40 ffff8803b7aa28a0 ffff8803b1495e08
[ 5547.002041]  ffff8803b1495db0 ffffffff8111a25a ffff8803b1495e40 ffff8803b1495df0
[ 5547.002043] Call Trace:
[ 5547.002045]  [<ffffffff8111a25a>] ? find_get_pages_tag+0xea/0x180
[ 5547.002047]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002048]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002050]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.002051]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002057]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002059]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002065]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002071]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002077]  [<ffffffffa0082e3f>] btrfs_sync_file+0x17f/0x2e0 [btrfs]
[ 5547.002079]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.002080]  [<ffffffff811abe20>] SyS_fsync+0x10/0x20
[ 5547.002081]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002083] mozStorage #6   D ffff8803c0cfa8a0     0   982    477 0x00000000
[ 5547.002085]  ffff8803a10f5ba0 0000000000000002 ffff8803a10f5fd8 0000000000012e40
[ 5547.002086]  ffff8803a10f5fd8 0000000000012e40 ffff8803c0cfa8a0 0000000000000004
[ 5547.002088]  ffff8803a10f5ae8 ffffffff8155cc86 ffff8803a10f5bd0 ffffffffa005ada4
[ 5547.002089] Call Trace:
[ 5547.002091]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002096]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.002098]  [<ffffffff8102b067>] ? native_smp_send_reschedule+0x47/0x60
[ 5547.002100]  [<ffffffff8107f7bc>] ? resched_task+0x5c/0x60
[ 5547.002101]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002103]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002104]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.002106]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002112]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002113]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002119]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002125]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002131]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.002133]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.002134]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.002136]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.002138]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.002139]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002141]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.002143]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.002145]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.002147]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.002148]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002150]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.002152]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002153]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.002155]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.002157]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.002159]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.002160]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002164] rsync           D ffff8802dcde0820     0  5803   5802 0x00000000
[ 5547.002165]  ffff8802daeb1a90 0000000000000002 ffff8802daeb1fd8 0000000000012e40
[ 5547.002167]  ffff8802daeb1fd8 0000000000012e40 ffff8802dcde0820 ffff880100000002
[ 5547.002169]  ffff8802daeb19e0 ffffffff81080edd ffff880308b337e0 0000000000000000
[ 5547.002170] Call Trace:
[ 5547.002172]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002173]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002175]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002177]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002178]  [<ffffffff81560e8d>] ? add_preempt_count+0x3d/0x40
[ 5547.002180]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002181]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002182]  [<ffffffff81558f6a>] schedule_timeout+0x11a/0x230
[ 5547.002185]  [<ffffffff8105e0c0>] ? detach_if_pending+0x120/0x120
[ 5547.002187]  [<ffffffff810a5078>] ? ktime_get_ts+0x48/0xe0
[ 5547.002189]  [<ffffffff8155bd2b>] io_schedule_timeout+0x9b/0xf0
[ 5547.002191]  [<ffffffff811259a9>] balance_dirty_pages_ratelimited+0x3d9/0xa10
[ 5547.002198]  [<ffffffffa0c9ad84>] ? ext4_dirty_inode+0x54/0x60 [ext4]
[ 5547.002200]  [<ffffffff8111a8c8>] generic_file_buffered_write+0x1b8/0x290
[ 5547.002202]  [<ffffffff8111bfd9>] __generic_file_aio_write+0x1a9/0x3b0
[ 5547.002203]  [<ffffffff8111c238>] generic_file_aio_write+0x58/0xa0
[ 5547.002208]  [<ffffffffa0c8ef79>] ext4_file_write+0x99/0x3e0 [ext4]
[ 5547.002210]  [<ffffffff810ddaac>] ? acct_account_cputime+0x1c/0x20
[ 5547.002212]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.002213]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002215]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002216]  [<ffffffff8117d34a>] do_sync_write+0x5a/0x90
[ 5547.002218]  [<ffffffff8117d9ed>] vfs_write+0xbd/0x1e0
[ 5547.002220]  [<ffffffff8117e0d9>] SyS_write+0x49/0xa0
[ 5547.002221]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002223] ktorrent        D ffff8802e7680820     0  5806      1 0x00000000
[ 5547.002224]  ffff8802daf7fba0 0000000000000002 ffff8802daf7ffd8 0000000000012e40
[ 5547.002226]  ffff8802daf7ffd8 0000000000012e40 ffff8802e7680820 0000000000000004
[ 5547.002227]  ffff8802daf7fae8 ffffffff8155cc86 ffff8802daf7fbd0 ffffffffa005ada4
[ 5547.002229] Call Trace:
[ 5547.002230]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002236]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.002241]  [<ffffffffa004ae49>] ? btrfs_set_path_blocking+0x39/0x80 [btrfs]
[ 5547.002246]  [<ffffffffa004fe78>] ? btrfs_search_slot+0x498/0x970 [btrfs]
[ 5547.002247]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002249]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002251]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.002252]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002258]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002260]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002266]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002268]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002273]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002280]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.002281]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.002283]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.002285]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.002287]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.002288]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002290]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.002292]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.002293]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.002295]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.002297]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002299]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.002300]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002302]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.002304]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.002306]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.002307]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.002309]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002311] kworker/u16:0   D ffff88035c5ac920     0  6043      2 0x00000000
[ 5547.002313] Workqueue: writeback bdi_writeback_workfn (flush-8:32)
[ 5547.002315]  ffff88036c9cb898 0000000000000002 ffff88036c9cbfd8 0000000000012e40
[ 5547.002316]  ffff88036c9cbfd8 0000000000012e40 ffff88035c5ac920 ffff8804281de048
[ 5547.002318]  ffff88036c9cb7e8 ffffffff81080edd 0000000000000001 ffff88036c9cb800
[ 5547.002319] Call Trace:
[ 5547.002321]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002323]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002324]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002326]  [<ffffffff8122b47b>] ? queue_unplugged+0x3b/0xe0
[ 5547.002328]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002329]  [<ffffffff8155b9bf>] io_schedule+0x8f/0xe0
[ 5547.002331]  [<ffffffff8122b8aa>] get_request+0x1aa/0x780
[ 5547.002332]  [<ffffffff8123099e>] ? ioc_lookup_icq+0x4e/0x80
[ 5547.002334]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002336]  [<ffffffff8122db58>] blk_queue_bio+0x78/0x3e0
[ 5547.002337]  [<ffffffff8122c5c2>] generic_make_request+0xc2/0x110
[ 5547.002338]  [<ffffffff8122c683>] submit_bio+0x73/0x160
[ 5547.002344]  [<ffffffffa0c9bae5>] ext4_io_submit+0x25/0x50 [ext4]
[ 5547.002348]  [<ffffffffa0c981d3>] ext4_writepages+0x823/0xe00 [ext4]
[ 5547.002350]  [<ffffffff8112632e>] do_writepages+0x1e/0x40
[ 5547.002352]  [<ffffffff811a6340>] __writeback_single_inode+0x40/0x330
[ 5547.002353]  [<ffffffff811a7392>] writeback_sb_inodes+0x262/0x450
[ 5547.002355]  [<ffffffff811a761f>] __writeback_inodes_wb+0x9f/0xd0
[ 5547.002357]  [<ffffffff811a797b>] wb_writeback+0x32b/0x360
[ 5547.002358]  [<ffffffff811a8111>] bdi_writeback_workfn+0x221/0x510
[ 5547.002361]  [<ffffffff8106b917>] process_one_work+0x167/0x450
[ 5547.002362]  [<ffffffff8106c6a1>] worker_thread+0x121/0x3a0
[ 5547.002364]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002366]  [<ffffffff8106c580>] ? manage_workers.isra.25+0x2a0/0x2a0
[ 5547.002367]  [<ffffffff81072e70>] kthread+0xc0/0xd0
[ 5547.002369]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120
[ 5547.002371]  [<ffffffff81564bac>] ret_from_fork+0x7c/0xb0
[ 5547.002372]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-30 12:01             ` Mel Gorman
@ 2013-11-19 17:17               ` Rob Landley
  2013-11-20 20:52                 ` One Thousand Gnomes
  0 siblings, 1 reply; 56+ messages in thread
From: Rob Landley @ 2013-11-19 17:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jan Kara, Linus Torvalds, Andrew Morton, Theodore Ts'o,
	Artem S. Tashkinov, Wu Fengguang, Linux Kernel Mailing List

On 10/30/2013 07:01:52 AM, Mel Gorman wrote:
> We talked about this a
> few months ago but I still suspect that we will have to bite the  
> bullet and
> tune based on "do not dirty more data than it takes N seconds to  
> writeback"
> using per-bdi writeback estimations. It's just not that trivial to  
> implement
> as the writeback speeds can change for a variety of reasons (multiple  
> IO
> sources, random vs sequential etc).

Record "block writes finished this second" into an 8 entry ring buffer,  
with a flag saying "device was partly idle this period" so you can  
ignore those entries. Keep a high water mark, which should converge to  
the device's linear write capacity.

This gives you recent thrashing speed and max capacity, and some  
weighted average of the two lets you avoid queuing up 10 minutes of  
writes all at once like 3.0 would to a terabyte USB2 disk. (And then  
vim calls sync() and hangs...)

The first tricky bit is the high water mark, but it's not too bad. If  
the device reads and writes at the same rate you can populate it from  
that, but even starting it with just one block should converge really  
fast because A) the round trip time should be well under a second, B)  
if you're submitting more than one period's worth of data (you can  
dirty enough to keep disk busy for 2 seconds), then it'll queue up 2  
blocks at a time, then 4, then 8, and increase exponentially until you  
hit the high water mark. (Which is measured so it won't overshoot.)

The second tricky bit is weighting the average, but presumably counting  
the high water mark as one, then adding in all the "device did not  
actually go idle during this period" entries, and dividing by the  
number of entries considered... Reasonable first guess?

Obvious optimizations: instead of recording the "disk went idle" flag  
in the ring buffer, just don't advance the ring buffer at the end of  
that second, but zero out the entry and re-accumulate it. That way the  
ring buffer should always have 7 seconds of measured activity, even if  
it's not necessarily recent. And of course you don't have to wake  
anything up when there was no I/O, so it's nicely quiescent when the  
system is...

Lowering the high water mark in the case of a transient spurious  
reading (maybe clock skew during suspend or virtualization glitch or  
some such) is fun, and could give you a 4 billion block bad reading,  
but if you always decrement the high water mark by 25% (x-=(x>>2)) each  
second the disk didn't go idle (rounding up) and then queue up more  
than one period's worth of data (but no more than say 8 seconds worth),  
such glitches should fix themselves and it'll work its way back up or  
down to a reasonably accurate value. (Keep in mind you're averaging the  
high water mark back down with 7 seconds of measured data from the ring  
buffer. Maybe you can cap the high water mark at the sum of all the  
measured values in the ring buffer as an extra check? You're already  
calculating it to do the average, so...)

This is assuming your hard drive _itself_ doesn't have bufferbloat, but  
http://spritesmods.com/?art=hddhack&f=rss implies they don't, and  
tagged command queueing lets you see through that anyway so your  
"actually committed" numbers could presumably still be accurate if the  
manufacturers aren't totally lying.

Given how far behind I am on my email, I assume somebody's already  
suggested this by now. :)

Rob

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-19 17:17               ` Rob Landley
@ 2013-11-20 20:52                 ` One Thousand Gnomes
  0 siblings, 0 replies; 56+ messages in thread
From: One Thousand Gnomes @ 2013-11-20 20:52 UTC (permalink / raw)
  To: Rob Landley
  Cc: Mel Gorman, Jan Kara, Linus Torvalds, Andrew Morton,
	Theodore Ts'o, Artem S. Tashkinov, Wu Fengguang,
	Linux Kernel Mailing List

> This is assuming your hard drive _itself_ doesn't have bufferbloat, but  
> http://spritesmods.com/?art=hddhack&f=rss implies they don't, and  
> tagged command queueing lets you see through that anyway so your  
> "actually committed" numbers could presumably still be accurate if the  
> manufacturers aren't totally lying.

They don't but they do have wildly variable completion rates and times.
Nothing like a drive having a seven second hiccup to annoy people but
they can do that at times.

There are two problems though

1. Disk performance particularly in the rotating rust world is
operations/second which is rarely related to volume

2. If the block layer is trying to decide whether the drive is busy
you've got it the wrong way up IMHO. Busy-ness is a property of the
device and often very device and subsystem specific, so the device end of
the chain should figure out how loaded it feels


Beyond that the entire problem is well understood and there isn't any
real difference between an IPv4 network and a storage layer. In fact in
some cases like NFS, DRBD, AoE, and remote block device stuff it's even
more so.

(TCP based remote block devices btw are a prime example of why you need
device end of chain figuring out busy state.. you'll otherwise end up
doing double backoff)

Alan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] mm: add strictlimit knob -v2
  2013-11-06 15:05                 ` [PATCH] mm: add strictlimit knob -v2 Maxim Patlasov
  2013-11-07 12:26                   ` Henrique de Moraes Holschuh
@ 2013-11-22 23:45                   ` Andrew Morton
  1 sibling, 0 replies; 56+ messages in thread
From: Andrew Morton @ 2013-11-22 23:45 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: karl.kiniger, tytso, linux-kernel, t.artem, linux-mm, mgorman,
	jack, fengguang.wu, torvalds, mpatlasov

On Wed, 06 Nov 2013 19:05:57 +0400 Maxim Patlasov <MPatlasov@parallels.com> wrote:

> "strictlimit" feature was introduced to enforce per-bdi dirty limits for
> FUSE which sets bdi max_ratio to 1% by default:
> 
> http://article.gmane.org/gmane.linux.kernel.mm/105809
> 
> However the feature can be useful for other relatively slow or untrusted
> BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
> feature:
> 
> echo 1 > /sys/class/bdi/X:Y/strictlimit
> 
> Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
> dirty limit is not reached. Of course, the effect is not visible until
> /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value.
> 
> ...
>
> --- a/Documentation/ABI/testing/sysfs-class-bdi
> +++ b/Documentation/ABI/testing/sysfs-class-bdi
> @@ -53,3 +53,11 @@ stable_pages_required (read-only)
>  
>  	If set, the backing device requires that all pages comprising a write
>  	request must not be changed until writeout is complete.
> +
> +strictlimit (read-write)
> +
> +	Forces per-BDI checks for the share of given device in the write-back
> +	cache even before the global background dirty limit is reached. This
> +	is useful in situations where the global limit is much higher than
> +	affordable for given relatively slow (or untrusted) device. Turning
> +	strictlimit on has no visible effect if max_ratio is equal to 100%.
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index ce682f7..4ee1d64 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(stable_pages_required);
>  
> +static ssize_t strictlimit_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	struct backing_dev_info *bdi = dev_get_drvdata(dev);
> +	unsigned int val;
> +	ssize_t ret;
> +
> +	ret = kstrtouint(buf, 10, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	switch (val) {
> +	case 0:
> +		bdi->capabilities &= ~BDI_CAP_STRICTLIMIT;
> +		break;
> +	case 1:
> +		bdi->capabilities |= BDI_CAP_STRICTLIMIT;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return count;
> +}
> +static ssize_t strictlimit_show(struct device *dev,
> +		struct device_attribute *attr, char *page)
> +{
> +	struct backing_dev_info *bdi = dev_get_drvdata(dev);
> +
> +	return snprintf(page, PAGE_SIZE-1, "%d\n",
> +			!!(bdi->capabilities & BDI_CAP_STRICTLIMIT));
> +}
> +static DEVICE_ATTR_RW(strictlimit);
> +
>  static struct attribute *bdi_dev_attrs[] = {
>  	&dev_attr_read_ahead_kb.attr,
>  	&dev_attr_min_ratio.attr,
>  	&dev_attr_max_ratio.attr,
>  	&dev_attr_stable_pages_required.attr,
> +	&dev_attr_strictlimit.attr,
>  	NULL,

Well the patch is certainly simple and straightforward enough and
*seems* like it will be useful.  The main (and large!) downside is that
it adds to the user interface so we'll have to maintain this feature
and its functionality for ever.

Given this, my concern is that while potentially useful, the feature
might not be *sufficiently* useful to justify its inclusion.  So we'll
end up addressing these issues by other means, then we're left
maintaining this obsolete legacy feature.

So I'm thinking that unless someone can show that this is good and
complete and sufficient for a "large enough" set of issues, I'll take a
pass on the patch[1].  What do people think?


[1] Actually, I'll stick it in -mm and maintain it, so next time
someone reports an issue I can say "hey, try this".


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2013-11-22 23:45 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-25  7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
2013-10-25  8:18 ` Linus Torvalds
2013-10-25  8:30   ` Artem S. Tashkinov
2013-10-25  8:43     ` Linus Torvalds
2013-10-25  9:15       ` Karl Kiniger
2013-10-29 20:30         ` Jan Kara
2013-10-29 20:43           ` Andrew Morton
2013-10-29 21:30             ` Jan Kara
2013-10-29 21:36             ` Linus Torvalds
2013-10-31 14:26           ` Karl Kiniger
2013-11-01 14:25             ` Maxim Patlasov
2013-11-01 14:31             ` [PATCH] mm: add strictlimit knob Maxim Patlasov
2013-11-04 22:01               ` Andrew Morton
2013-11-06 14:30                 ` Maxim Patlasov
2013-11-06 15:05                 ` [PATCH] mm: add strictlimit knob -v2 Maxim Patlasov
2013-11-07 12:26                   ` Henrique de Moraes Holschuh
2013-11-22 23:45                   ` Andrew Morton
2013-10-25 11:28       ` Disabling in-memory write cache for x86-64 in Linux II David Lang
2013-10-25  9:18     ` Theodore Ts'o
2013-10-25  9:29       ` Andrew Morton
2013-10-25  9:32         ` Linus Torvalds
2013-10-26 11:32           ` Pavel Machek
2013-10-26 20:03             ` Linus Torvalds
2013-10-29 20:57           ` Jan Kara
2013-10-29 21:33             ` Linus Torvalds
2013-10-29 22:13               ` Jan Kara
2013-10-29 22:42                 ` Linus Torvalds
2013-11-01 17:22                   ` Fengguang Wu
2013-11-04 12:19                     ` Pavel Machek
2013-11-04 12:26                   ` Pavel Machek
2013-10-30 12:01             ` Mel Gorman
2013-11-19 17:17               ` Rob Landley
2013-11-20 20:52                 ` One Thousand Gnomes
2013-10-25 22:37         ` Fengguang Wu
2013-10-25 23:05       ` Fengguang Wu
2013-10-25 23:37         ` Theodore Ts'o
2013-10-29 20:40           ` Jan Kara
2013-10-30 10:07             ` Artem S. Tashkinov
2013-10-30 15:12               ` Jan Kara
2013-11-05  0:50   ` Andreas Dilger
2013-11-05  4:12     ` Dave Chinner
2013-11-07 13:48       ` Jan Kara
2013-11-11  3:22         ` Dave Chinner
2013-11-11 19:31           ` Jan Kara
2013-10-25 10:49 ` NeilBrown
2013-10-25 11:26   ` David Lang
2013-10-25 18:26     ` Artem S. Tashkinov
2013-10-25 19:40       ` Diego Calleja
2013-10-25 23:32         ` Fengguang Wu
2013-11-15 15:48           ` Diego Calleja
2013-10-25 20:43       ` NeilBrown
2013-10-25 21:03         ` Artem S. Tashkinov
2013-10-25 22:11           ` NeilBrown
     [not found]             ` <CAF7GXvpJVLYDS5NfH-NVuN9bOJjAS5c1MQqSTjoiVBHJt6bWcw@mail.gmail.com>
2013-11-05  1:47               ` David Lang
2013-11-05  2:08               ` NeilBrown
2013-10-29 20:49       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).