All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
@ 2007-02-23 12:03 Tomoki Sekiyama
  2007-02-24  4:46 ` KAMEZAWA Hiroyuki
  2007-02-24 13:15 ` Nikita Danilov
  0 siblings, 2 replies; 12+ messages in thread
From: Tomoki Sekiyama @ 2007-02-23 12:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, miklos, yumiko.sugita.yf, masami.hiramatsu.pt,
	hidehiro.kawai.ez, yuji.kakutani.uw, soshima, haoki

Hi,

I have observed a problem that write(2) can be blocked for a long time
if a system has several disks and is under heavy I/O pressure. This
patchset is to avoid the problem.

Example of the probrem:

There are two processes on a system which has two disks. Process-A
writes heavily to disk-a, and process-B writes small data (e.g. log
files) to disk-b occasionally. A portion of system memory, which is
depends on vm.dirty_ratio (typically 40%), is filled up with Dirty
and Writeback pages of disk-a.

In this situation, write(2) of process-B could be blocked for a very
long time (more then 60 seconds), although the load of disk-b is quite
low. In particular, the system would become quite slow, if disk-a is
slow (e.g. backup to an USB disk).

This seems to be the same problem as discussed in LKML:
http://marc.theaimsgroup.com/?t=115559902900003
and
http://marc.theaimsgroup.com/?t=117182340400003

Root cause:

I found this problem is caused by the balance_dirty_pages().

While Dirty+Writeback pages get more than 40% of memory, process-B is
blocked in balance_dirty_pages() until writeback of some (`write_chunk',
typically = 1536) dirty pages on disk-b is started.

However, because disk-b has only a few dirty pages, the process-B will
be blocked until writeback to disk-a is completed and Dirty+Writeback
goes below 40%.

Solution:

I consider that all of the dirty pages for the disk have been written
back and that the disk is clean if a process cannot write 'write_chunk'
pages in balance_dirty_pages().

To avoid using up the free memory with dirty pages by passing blocking,
this patchset adds a new threshold named vm.dirty_limit_ratio to sysctl.

It modifies balance_dirty_pages() not to block when the amount of
Dirty+Writeback is less than vm.dirty_limit_ratio percent of the memory.
In the other cases, writers are throttled as current Linux does.


In this patchset, vm.dirty_limit_ratio, instead of vm.dirty_ratio, is
used as the clamping level of Dirty+Writeback. And, vm.dirty_ratio is
used as the level at which a writers will itself start writeback of the
dirty pages.


Testing Results:

In the situation explained in "Example of the problem" section, I
measured time of write(2)ing to disk-b.
The write was completed by 30ms or less under the kernel with this
patchset.

When nr_requests is set too high (e.g. 8192), Dirty+Writeback grows near
vm.dirty_limit_ratio(45% of system memory by defaults). In that case,
write(2) sometimes took about 1 second.


This patchset can be applied to 2.6.20-mm2.
It consists of 3 pieces:

1/3 - add a sysctl variable `vm.dirty_limit_ratio'
2/3 - modify get_dirty_limits() to return the limit of dirty pages.
3/3 - break out of balance_dirty_pages() loop if the disk doesn't have
      remaining dirty pages, if Dirty+Writeback < vm.dirty_limit_ratio.

-- 
Tomoki Sekiyama
Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: tomoki.sekiyama.qu@hitachi.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-02-23 12:03 [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers Tomoki Sekiyama
@ 2007-02-24  4:46 ` KAMEZAWA Hiroyuki
  2007-02-27  0:50   ` Tomoki Sekiyama
  2007-02-24 13:15 ` Nikita Danilov
  1 sibling, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-24  4:46 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: linux-kernel, akpm, miklos, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki

On Fri, 23 Feb 2007 21:03:37 +0900
Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> wrote:

> Hi,
> 
> I have observed a problem that write(2) can be blocked for a long time
> if a system has several disks and is under heavy I/O pressure. This
> patchset is to avoid the problem.
> 
> Example of the probrem:
> 
> There are two processes on a system which has two disks. Process-A
> writes heavily to disk-a, and process-B writes small data (e.g. log
> files) to disk-b occasionally. A portion of system memory, which is
> depends on vm.dirty_ratio (typically 40%), is filled up with Dirty
> and Writeback pages of disk-a.
> 
> In this situation, write(2) of process-B could be blocked for a very
> long time (more then 60 seconds), although the load of disk-b is quite
> low. In particular, the system would become quite slow, if disk-a is
> slow (e.g. backup to an USB disk).
> 
> This seems to be the same problem as discussed in LKML:
> http://marc.theaimsgroup.com/?t=115559902900003
> and
> http://marc.theaimsgroup.com/?t=117182340400003
> 
Interesting, but how about adjust this parameter like below instead of
adding new control knob ?(this kind of knob is not easy to use.)

==
                struct writeback_control wbc = {
                        .bdi            = bdi,
                        .sync_mode      = WB_SYNC_NONE,
                        .older_than_this = NULL,
                        .nr_to_write    = 0,
                        .range_cyclic   = 1,
                };
<snip>
                if (nr_reclaimable) {
			/* Just do what I can do */
			dirty_pages_on_device = count_dirty_pages_on_device_limited(bdi, writechunk);
			wbc.nr_to_write = dirty_pages_on_device.
			writeback_inodes(&wbc);
 
==

count_dirty_pages_on_device_limited(bdi, writechunk) above returns 
dirty pages on bdi. if # of dirty_pages on bdi is larger than writechunk,
just returns writechunk.

-Kame






^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-02-23 12:03 [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers Tomoki Sekiyama
  2007-02-24  4:46 ` KAMEZAWA Hiroyuki
@ 2007-02-24 13:15 ` Nikita Danilov
  2007-02-27  0:52   ` Tomoki Sekiyama
  1 sibling, 1 reply; 12+ messages in thread
From: Nikita Danilov @ 2007-02-24 13:15 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: linux-kernel, akpm, miklos, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki

Tomoki Sekiyama writes:
 > Hi,

Hello,

 > 

[...]

 > 
 > While Dirty+Writeback pages get more than 40% of memory, process-B is
 > blocked in balance_dirty_pages() until writeback of some (`write_chunk',
 > typically = 1536) dirty pages on disk-b is started.

May be the simpler solution is to use separate variables to control
ratelimit and write chunk?

writeback_set_ratelimit() adjusts ratelimit_pages to avoid too frequent
calls to balance_dirty_pages(), but once we are inside of
writeback_inodes(), there is no need to write especially many pages in
one go: overhead of any additional looping is negligible, when compared
with the cost of writing.

Speaking of which, now that expensive get_writeback_state() is gone from
page-writeback.c why do we need adjustable ratelimiting at all? It looks
like writeback_set_ratelimit() can be dropped, and fixed ratelimit used
instead.

Nikita.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-02-24  4:46 ` KAMEZAWA Hiroyuki
@ 2007-02-27  0:50   ` Tomoki Sekiyama
  2007-02-27  1:39     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 12+ messages in thread
From: Tomoki Sekiyama @ 2007-02-27  0:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, akpm, miklos, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki, nikita

Hi Kamezawa-san,

thanks for your reply.

KAMEZAWA Hiroyuki wrote:
> Interesting, but how about adjust this parameter like below instead of
> adding new control knob ?(this kind of knob is not easy to use.)
>
> ==
>                 struct writeback_control wbc = {
>                         .bdi            = bdi,
>                         .sync_mode      = WB_SYNC_NONE,
>                         .older_than_this = NULL,
>                         .nr_to_write    = 0,
>                         .range_cyclic   = 1,
>                 };
> <snip>
>                 if (nr_reclaimable) {
> 			/* Just do what I can do */
> 			dirty_pages_on_device = count_dirty_pages_on_device_limited(bdi, writechunk);
> 			wbc.nr_to_write = dirty_pages_on_device.
> 			writeback_inodes(&wbc);
>
> ==
>
> count_dirty_pages_on_device_limited(bdi, writechunk) above returns
> dirty pages on bdi. if # of dirty_pages on bdi is larger than writechunk,
> just returns writechunk.


I think that way is not enough to control the total amount of
Dirty+Writeback.

In that way, while writeback_inodes() scans for dirty pages and writes
them back, the caller will be blocked only if the length of the write-
requests queue is longer than nr_requests. If so, Writeback may consume
tens MB memory for each queue, because nr_requests is 128 and the
maximum size of a request is 512KB. If you have several devices, it can
consume more than hundred MB memory.

I concerned about that, so I introduced dirty_limit_ratio to limit the
total amount of Dirty+Writeback pages.


Regards
--
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-02-24 13:15 ` Nikita Danilov
@ 2007-02-27  0:52   ` Tomoki Sekiyama
  2007-03-01 12:47     ` Leroy van Logchem
  0 siblings, 1 reply; 12+ messages in thread
From: Tomoki Sekiyama @ 2007-02-27  0:52 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: linux-kernel, akpm, miklos, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki

Hi Nikita,

thanks for your comments.

Nikita Danilov wrote:
>> While Dirty+Writeback pages get more than 40% of memory, process-B is
>> blocked in balance_dirty_pages() until writeback of some (`write_chunk',
>> typically = 1536) dirty pages on disk-b is started.
>
> May be the simpler solution is to use separate variables to control
> ratelimit and write chunk?

No, I think it's difficult to throttle total Dirty+Writeback only with
write_chunk, because write_chunk just affects Dirty and Writeback of
each device (in this case, throttling is done in write-requests queue of
the each backing device, as I said in another mail).

Throttling of the total Dirty+Writeback should be also done in VM itself,
and to control that, I added `dirty_limit_ratio.'


> writeback_set_ratelimit() adjusts ratelimit_pages to avoid too frequent
> calls to balance_dirty_pages(), but once we are inside of
> writeback_inodes(), there is no need to write especially many pages in
> one go: overhead of any additional looping is negligible, when compared
> with the cost of writing.
>
> Speaking of which, now that expensive get_writeback_state() is gone from
> page-writeback.c why do we need adjustable ratelimiting at all? It looks
> like writeback_set_ratelimit() can be dropped, and fixed ratelimit used
> instead.

As far as I can see, adjustable ratelimiting is the actual cause of the
long wait on writing to disk with light load.
I think removing adjustable ratelimiting should be done in another patch...


Regards
--
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-02-27  0:50   ` Tomoki Sekiyama
@ 2007-02-27  1:39     ` KAMEZAWA Hiroyuki
  2007-03-02  1:26       ` Tomoki Sekiyama
  0 siblings, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-27  1:39 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: linux-kernel, akpm, miklos, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki, nikita

On Tue, 27 Feb 2007 09:50:16 +0900
Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> wrote:

> Hi Kamezawa-san,
> 
> thanks for your reply.
> 
> KAMEZAWA Hiroyuki wrote:
> > Interesting, but how about adjust this parameter like below instead of
> > adding new control knob ?(this kind of knob is not easy to use.)
> >
> > ==
> >                 struct writeback_control wbc = {
> >                         .bdi            = bdi,
> >                         .sync_mode      = WB_SYNC_NONE,
> >                         .older_than_this = NULL,
> >                         .nr_to_write    = 0,
> >                         .range_cyclic   = 1,
> >                 };
> > <snip>
> >                 if (nr_reclaimable) {
> > 			/* Just do what I can do */
> > 			dirty_pages_on_device = count_dirty_pages_on_device_limited(bdi, writechunk);
> > 			wbc.nr_to_write = dirty_pages_on_device.
> > 			writeback_inodes(&wbc);
> >
> > ==
> >
> > count_dirty_pages_on_device_limited(bdi, writechunk) above returns
> > dirty pages on bdi. if # of dirty_pages on bdi is larger than writechunk,
> > just returns writechunk.
> 
> 
> I think that way is not enough to control the total amount of
> Dirty+Writeback.
> 
> In that way, while writeback_inodes() scans for dirty pages and writes
> them back, the caller will be blocked only if the length of the write-
> requests queue is longer than nr_requests.
What nr_request means ?
But Ok, maybe I'm not understanding. What I want to ask you is do 
per-device-write-throttling rather than adding a new parameter.

Bye.
-Kame




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-02-27  0:52   ` Tomoki Sekiyama
@ 2007-03-01 12:47     ` Leroy van Logchem
  2007-03-02  9:16       ` Brice Figureau
  2007-03-07 13:53       ` Yuji Kakutani
  0 siblings, 2 replies; 12+ messages in thread
From: Leroy van Logchem @ 2007-03-01 12:47 UTC (permalink / raw)
  To: linux-kernel

Tomoki Sekiyama <tomoki.sekiyama.qu <at> hitachi.com> writes:

> thanks for your comments.


The default dirty_ratio on most 2.6 kernels tend to be too large imo.
If you are going to do sustained writes multiple times the size of
the memory you have at least two problems.

1) The precious dentry and inodecache will be dropped leaving you with
a *very* unresponsive system
2) The amount of dirty_pages which need to be flushed to disk is huge,
taking all VM while the i/o channel takes uninterruptable time to flush it.

What we really need is a 'cfq' for all processes -especially misbehaving
ones like dd if=/dev/zero of=/location/large bs=1M count=10000-.
If you want to DoS the 2.6 kernel, start a ever running dd write and
you know what I mean. Huge latencies due the fact that all name_to_inode
caches are lost and have to be fetched from disk again only to be quickly
flushed again and again. I already explained this disaster scenario with
Linus, Andrew and Jens; hoping for a auto-tuning solution which takes
diskspeed per partition into account.

At the moment we cope with this feature by preserving imported caches with
sysctl vm.vfs_cache_pressure = 1, vm.dirty_ratio = 2 combined with
vm.dirty_background_ratio = 1. Some benchmarks may get worse but you have
a more resiliant server.

I hope the VM subsystem will cope with applications which do not advise
what to do with the cached pages. For now we use posix_fadvice DONT_NEED
as patch to Samba 3 in order to at least be able to write larger then
memory files without discarding the important slab caches.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-02-27  1:39     ` KAMEZAWA Hiroyuki
@ 2007-03-02  1:26       ` Tomoki Sekiyama
  0 siblings, 0 replies; 12+ messages in thread
From: Tomoki Sekiyama @ 2007-03-02  1:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, akpm, miklos, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki, nikita

Hi Kamezawa-san,

KAMEZAWA Hiroyuki wrote:
>>> >>> Interesting, but how about adjust this parameter like below instead of
>>> >>> adding new control knob ?(this kind of knob is not easy to use.)
<snip>
>>> >>> count_dirty_pages_on_device_limited(bdi, writechunk) above returns
>>> >>> dirty pages on bdi. if # of dirty_pages on bdi is larger than writechunk,
>>> >>> just returns writechunk.
>> >>
>> >> I think that way is not enough to control the total amount of
>> >> Dirty+Writeback.
>> >>
>> >> In that way, while writeback_inodes() scans for dirty pages and writes
>> >> them back, the caller will be blocked only if the length of the write-
>> >> requests queue is longer than nr_requests.

> > What nr_request means ?

nr_requests is a parameter that means upper limit of the length of I/O
(read- and write-)requests queue of the device, which is configurable
from /sys/block/<device>/queue/nr_requests. A process, which perform I/O
when there are more than nr_requests requests in the queue, will be blocked.

> > But Ok, maybe I'm not understanding. What I want to ask you is do
> > per-device-write-throttling rather than adding a new parameter.

In the patchset, per-device-write-throttling is done by the behavior
of the write-requests queue described above.

When the queue of the disk becomes full while writeback of Dirty in
writeback_inodes(), heavy writes to the disk will be blocked.
In contrast, if it's so occasional that the queue doesn't become full,
writes will not be blocked.


Regards
--
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-01 12:47     ` Leroy van Logchem
@ 2007-03-02  9:16       ` Brice Figureau
  2007-03-02 13:06         ` Leroy van Logchem
  2007-03-07 13:53       ` Yuji Kakutani
  1 sibling, 1 reply; 12+ messages in thread
From: Brice Figureau @ 2007-03-02  9:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: Leroy van Logchem

Hi,

On Thu, 2007-03-01 at 12:47 +0000, Leroy van Logchem wrote:
> Tomoki Sekiyama <tomoki.sekiyama.qu <at> hitachi.com> writes:
> > thanks for your comments.
> 
> The default dirty_ratio on most 2.6 kernels tend to be too large imo.
> If you are going to do sustained writes multiple times the size of
> the memory you have at least two problems.
> 
> 1) The precious dentry and inodecache will be dropped leaving you with
> a *very* unresponsive system
> 2) The amount of dirty_pages which need to be flushed to disk is huge,
> taking all VM while the i/o channel takes uninterruptable time to flush it.
> 
> What we really need is a 'cfq' for all processes -especially misbehaving
> ones like dd if=/dev/zero of=/location/large bs=1M count=10000-.
> If you want to DoS the 2.6 kernel, start a ever running dd write and
> you know what I mean. Huge latencies due the fact that all name_to_inode
> caches are lost and have to be fetched from disk again only to be quickly
> flushed again and again. I already explained this disaster scenario with
> Linus, Andrew and Jens; hoping for a auto-tuning solution which takes
> diskspeed per partition into account.
> 
> At the moment we cope with this feature by preserving imported caches with
> sysctl vm.vfs_cache_pressure = 1, vm.dirty_ratio = 2 combined with
> vm.dirty_background_ratio = 1. Some benchmarks may get worse but you have
> a more resiliant server.
> 
> I hope the VM subsystem will cope with applications which do not advise
> what to do with the cached pages. For now we use posix_fadvice DONT_NEED
> as patch to Samba 3 in order to at least be able to write larger then
> memory files without discarding the important slab caches.

I'm sorry to piggy-back this thread.

Could it be what I'm experiencing in the following bugzilla report:
http://bugzilla.kernel.org/show_bug.cgi?id=7372

As I explained in the report, I see this issue only since 2.6.18.
So if your concern is related to mine, what could have changed between
2.6.17 and 2.6.18 related to this?

Unfortunately it is not possible to git bisect this issue as my problem
appears on a live database server. One of the reporter could git-bisect
to a SATA+SCSI patch, but looking at it I can't really see what's wrong.
As soon as I will be able to build a 2.6.18 minus this patch we could
verify both of us are dealing with the same issue or not.

I'm ready to help any kernel developper who wants to look at my issue.
Just ask what debug material you need and I'll try to provide it.

Please CC: me on replies, as I'm not subscribed to the list.

Regards,
-- 
Brice Figureau <brice+lklm@daysofwonder.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-02  9:16       ` Brice Figureau
@ 2007-03-02 13:06         ` Leroy van Logchem
  2007-03-02 16:04           ` Brice Figureau
  0 siblings, 1 reply; 12+ messages in thread
From: Leroy van Logchem @ 2007-03-02 13:06 UTC (permalink / raw)
  To: linux-kernel

> I'm sorry to piggy-back this thread.
> 
> Could it be what I'm experiencing in the following bugzilla report:
> http://bugzilla.kernel.org/show_bug.cgi?id=7372
> 
> As I explained in the report, I see this issue only since 2.6.18.
> So if your concern is related to mine, what could have changed between
> 2.6.17 and 2.6.18 related to this?

I don't think it's 2.6.x related, it's been under the sheets from start.

Related to your problem in the 7372 bug:

Pages are kept in memory for re-use, which is fast and fine except for:
1) data without re-use value or even single use
2) applications _do not_ advise the kernel how to cache pages related
   to there self generated i/o. POSIX does provide mechanisms to
   properly do so. But the kernel should help these poor apps.

To minimize your MySQL backup cp(1) problem, try this workaround:

cat ./cp_direct.sh 
#!/bin/sh
dd if=$1 of=$2 bs=1M iflag=direct oflag=direct

Combine this with [/etc/sysctl.conf]:

vm.vfs_cache_pressure = 1
vm.dirty_ratio = 2
vm.dirty_background_ratio = 1

This should reduce both the stress on the vm and response latency during
interactive work.

--
Leroy


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-02 13:06         ` Leroy van Logchem
@ 2007-03-02 16:04           ` Brice Figureau
  0 siblings, 0 replies; 12+ messages in thread
From: Brice Figureau @ 2007-03-02 16:04 UTC (permalink / raw)
  To: Leroy van Logchem; +Cc: linux-kernel

Hi,

On Fri, 2007-03-02 at 13:06 +0000, Leroy van Logchem wrote:
> > I'm sorry to piggy-back this thread.
> > 
> > Could it be what I'm experiencing in the following bugzilla report:
> > http://bugzilla.kernel.org/show_bug.cgi?id=7372
> > 
> > As I explained in the report, I see this issue only since 2.6.18.
> > So if your concern is related to mine, what could have changed between
> > 2.6.17 and 2.6.18 related to this?
> 
> I don't think it's 2.6.x related, it's been under the sheets from start.

Maybe. Still the issue has been aggravated between 2.6.17 and 2.6.18.
Right now (running 2.6.17.13) I can backup and do whatever I want, even
with high memory pressure because of mysql (which is consumming right
now about 3.8GB of 4GB).

Under 2.6.18 and later it is simply impossible to perform that (except
with the dd directIO trick).
This makes me think that something else has joined the party on
2.6.18...

> Related to your problem in the 7372 bug:
> 
> Pages are kept in memory for re-use, which is fast and fine except for:
> 1) data without re-use value or even single use
> 2) applications _do not_ advise the kernel how to cache pages related
>    to there self generated i/o. POSIX does provide mechanisms to
>    properly do so. But the kernel should help these poor apps.
>
> To minimize your MySQL backup cp(1) problem, try this workaround:
> 
> cat ./cp_direct.sh 
> #!/bin/sh
> dd if=$1 of=$2 bs=1M iflag=direct oflag=direct

Yes, I already did this, and it helped, see:
http://bugzilla.kernel.org/show_bug.cgi?id=7372#c16

> Combine this with [/etc/sysctl.conf]:
> 
> vm.vfs_cache_pressure = 1
> vm.dirty_ratio = 2
> vm.dirty_background_ratio = 1
> 
> This should reduce both the stress on the vm and response latency during
> interactive work.

I'll test the vfs_cache_pressure (I was already using those dirty_*
settings) whenever I'll reboot the server to 2.6.20.

Please CC: me on list replies as I'm not subscribed to the list.

Thanks for the tips.
Regards
-- 
Brice Figureau <brice+lklm@daysofwonder.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-01 12:47     ` Leroy van Logchem
  2007-03-02  9:16       ` Brice Figureau
@ 2007-03-07 13:53       ` Yuji Kakutani
  1 sibling, 0 replies; 12+ messages in thread
From: Yuji Kakutani @ 2007-03-07 13:53 UTC (permalink / raw)
  To: Leroy van Logchem
  Cc: linux-kernel, akpm, miklos, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, tomoki.sekiyama.qu,
	soshima, haoki

Hi,
Thank you for your comments.

Leroy van Logchem wrote:
>The default dirty_ratio on most 2.6 kernels tend to be too large imo.
>If you are going to do sustained writes multiple times the size of
>the memory you have at least two problems.
>
>1) The precious dentry and inodecache will be dropped leaving you with
>a *very* unresponsive system
>2) The amount of dirty_pages which need to be flushed to disk is huge,
>taking all VM while the i/o channel takes uninterruptable time to flush it.

I agree that the default dirty_ratio is too large.
What I have observed is that write(2) to a disk only with light load is
blocked in balance_dirty_pages() due to heavy write to *another* disk,
and doesn't seem to be the same issue as you say. But decreasing
dirty_ratio shortens the blocking time also in our experiments.

>What we really need is a 'cfq' for all processes -especially misbehaving
>ones like dd if=/dev/zero of=/location/large bs=1M count=10000-.
>If you want to DoS the 2.6 kernel, start a ever running dd write and
>you know what I mean. Huge latencies due the fact that all name_to_inode
>caches are lost and have to be fetched from disk again only to be quickly
>flushed again and again. I already explained this disaster scenario with
>Linus, Andrew and Jens; hoping for a auto-tuning solution which takes
>diskspeed per partition into account.

I appreciate for your information.
Would you know about any progress in this issue?


Regards,
-- 
Tomoki Sekiyama & Yuji Kakutani
Hitachi, Ltd., Systems Development Laboratory


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-03-07 13:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-23 12:03 [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers Tomoki Sekiyama
2007-02-24  4:46 ` KAMEZAWA Hiroyuki
2007-02-27  0:50   ` Tomoki Sekiyama
2007-02-27  1:39     ` KAMEZAWA Hiroyuki
2007-03-02  1:26       ` Tomoki Sekiyama
2007-02-24 13:15 ` Nikita Danilov
2007-02-27  0:52   ` Tomoki Sekiyama
2007-03-01 12:47     ` Leroy van Logchem
2007-03-02  9:16       ` Brice Figureau
2007-03-02 13:06         ` Leroy van Logchem
2007-03-02 16:04           ` Brice Figureau
2007-03-07 13:53       ` Yuji Kakutani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.