RE: [PATCH] mm: moving dirty pages balancing to pdfludh entirely

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-05 19:28 Ananiev, Leonid I
  2006-07-05 20:14 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-05 19:28 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

I have added proposed by Nikita lines
		if (pdflush_operation(background_writeout, 0))
                writeback_inodes(&wbc);
and tested it with iozone. The throughput is 50-53 MB/sec. It is less
than 74-105 MB/sec results sent earlier.

Leonid

-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Wednesday, July 05, 2006 6:57 PM
To: Ananiev, Leonid I
Cc: Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > suppose you have more than MAX_PDFLUSH_THREADS
 > Do you consider that the drawback of the patch is in that the value
 > MAX_PDFLUSH_THREADS is not well known high or this limit is not
deleted

I am more concerned, that this patch _limits_ maximal possible writeback
concurrency to MAX_PDFLUSH_THREADS.

 > at all? The limit could be deleted after patching because the line 

That sounds a bit too extreme, given that pdflush is used for a lot of
things other than background write-out.

 > +	if (writeback_in_progress(bdi)) {
 > keeps off creating extra pdflush threads.

What about replacing

 		pdflush_operation(background_writeout, 0);

with

 		if (pdflush_operation(background_writeout, 0))
                /*
                 * Fall back to synchronous writeback if all pdflush
                 * threads are busy.
                 */
                writeback_inodes(&wbc);

? This will combine increased concurrency in your target case (single
writer) with some safety net in the case of multiple writers and
multiple devices.

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
  2006-07-05 19:28 [PATCH] mm: moving dirty pages balancing to pdfludh entirely Ananiev, Leonid I
@ 2006-07-05 20:14 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-05 20:14 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > I have added proposed by Nikita lines
 > 		if (pdflush_operation(background_writeout, 0))
 >                 writeback_inodes(&wbc);
 > and tested it with iozone. The throughput is 50-53 MB/sec. It is less
 > than 74-105 MB/sec results sent earlier.

If this change has any effect at all, then maximal number of pdflush
threads has been started. But there is only one device, so what are
these threads doing?

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-06 21:07 Ananiev, Leonid I
  0 siblings, 0 replies; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-06 21:07 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov writes:
I've used with " patched UP" to underline fast, simple, superficial
fixing.

> and small number of pdflush threads is able to achieve necessary IO
concurrency level.
Can set MAX_PDFLUSH_THREADS=1000 if it is as you say. You will see huge
number of created pdflush threads created. That is why I consider that
MAX_PDFLUSH_THREADS=8 is fast-fix patch-up which is used now as an
argument contra parallelism.

> If asynchronous write-out cannot cope with the rate of dirtying,
> synchronous write-out starts.
Quite the contrary in current design: synchronous write-out is started
first and after pdflush is waked up and it sees that all done.

> You, on the other hand, are trying to use pdflush for the goal it
wasn't
> designed for.
Just after patching pdflush is used REALLY for writing out of dirty
pages concurrently with user thread work.

> No wonder it doesn't work.

It doesn't work for you because you have not run it. Have you tried to
run any real kernel with proposed path and without? 

> You didn't look carefully enough. :-)
> ratelimit_pages = 32, hence, sync_writeback_pages() returns 48. It
could
> be different, of course, if ratelimit_pages were.

I ask one more: why is it 48. Does it make best performance for any
configuration? You explanation is: 48 because it is 32+16. Is it
explanation? Is it proving?

This patch makes performance benefits for current REAL hardware and
software state and opens new performance benefits in smp. The patch was
carefully benchmarked on several configurations: 1-8 cpu; 1-8Gb memory;
1-15 disks.
You have proposed tens far-fetched contra arguments. I have explained
carefully that each your argument is ungrounded.

Leonid
-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Thursday, July 06, 2006 9:37 PM
To: Ananiev, Leonid I
Cc: Bret Towe; Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > >> by introducing MAX_PDFLUSH_THREADS=8. Why just 8?
 > > Sorry, I don't understand. pdflush.c appeared in 2.5.8 and
 > Thanks for explanation. Why just 8? Answer: it was introduced in
2.5.8.

No, no, I was commenting on your (incorrect) statement that kernel "was
patched up by introducing MAX_PDFLUSH_THREADS=8". It never was: this
limit is part of the original design, not some "patch".

 > Why the constant MAX_PDFLUSH_THREADS is needed? Is it because current
 > kernel may create huge number of pdflush threads and overload the
 > system? Why we can not set MAX_PDFLUSH_THREADS=512? 1000?

Because this is not needed for current pdflush usage patterns. pdflush
is used for light background write-out, and small number of pdflush
threads is able to achieve necessary IO concurrency level, because they
are mostly non-blocking, thanks to the current_is_pdflush() checks. If
asynchronous write-out cannot cope with the rate of dirtying,
synchronous write-out starts.

You, on the other hand, are trying to use pdflush for the goal it wasn't
designed for. No wonder it doesn't work.

 > 
 > >> Why 48?
 > > This is explained in comment just above sync_writeback_pages()
 > > definition. Basically, ratelimit_pages pages might be dirtied
between
 > > calls to balance_dirty_pages(), and the latter tries to write out
 > *more*
 > > pages to keep number of dirty pages under control: negative
feedback
 > > control loop, of sorts.
 > 
 > I had asked "why 48?" is hard coded for any configuration. I do not
see
 > "48" in your explanation.

You didn't look carefully enough. :-)

/*
 * When balance_dirty_pages decides that the caller needs to perform
some
 * non-background writeback, this is how many pages it will attempt to
write.
 * It should be somewhat larger than RATELIMIT_PAGES to ensure that
reasonably
 * large amounts of I/O are submitted.
 */
static inline long sync_writeback_pages(void)
{
	return ratelimit_pages + ratelimit_pages / 2;
}

ratelimit_pages = 32, hence, sync_writeback_pages() returns 48. It could
be different, of course, if ratelimit_pages were.

 > 
 > You do not recommend to use hard coded constants but
 > MAX_PDFLUSH_THREADS=8 and write_chunk=48 are sacred according you
mind. 

No, I think the fewer hard-coded constants are here---the better. I'd
happily eliminate MAX_PDFLUSH_THREADS, but just dropping this constraint
is a non-solution in search of a problem, obviously.

But this discussion is degenerating, so I shall participate in it no
longer.

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
  2006-07-06 16:33 Ananiev, Leonid I
@ 2006-07-06 17:37 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-06 17:37 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Bret Towe, Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > >> by introducing MAX_PDFLUSH_THREADS=8. Why just 8?
 > > Sorry, I don't understand. pdflush.c appeared in 2.5.8 and
 > Thanks for explanation. Why just 8? Answer: it was introduced in 2.5.8.

No, no, I was commenting on your (incorrect) statement that kernel "was
patched up by introducing MAX_PDFLUSH_THREADS=8". It never was: this
limit is part of the original design, not some "patch".

 > Why the constant MAX_PDFLUSH_THREADS is needed? Is it because current
 > kernel may create huge number of pdflush threads and overload the
 > system? Why we can not set MAX_PDFLUSH_THREADS=512? 1000?

Because this is not needed for current pdflush usage patterns. pdflush
is used for light background write-out, and small number of pdflush
threads is able to achieve necessary IO concurrency level, because they
are mostly non-blocking, thanks to the current_is_pdflush() checks. If
asynchronous write-out cannot cope with the rate of dirtying,
synchronous write-out starts.

You, on the other hand, are trying to use pdflush for the goal it wasn't
designed for. No wonder it doesn't work.

 > 
 > >> Why 48?
 > > This is explained in comment just above sync_writeback_pages()
 > > definition. Basically, ratelimit_pages pages might be dirtied between
 > > calls to balance_dirty_pages(), and the latter tries to write out
 > *more*
 > > pages to keep number of dirty pages under control: negative feedback
 > > control loop, of sorts.
 > 
 > I had asked "why 48?" is hard coded for any configuration. I do not see
 > "48" in your explanation.

You didn't look carefully enough. :-)

/*
 * When balance_dirty_pages decides that the caller needs to perform some
 * non-background writeback, this is how many pages it will attempt to write.
 * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
 * large amounts of I/O are submitted.
 */
static inline long sync_writeback_pages(void)
{
	return ratelimit_pages + ratelimit_pages / 2;
}

ratelimit_pages = 32, hence, sync_writeback_pages() returns 48. It could
be different, of course, if ratelimit_pages were.

 > 
 > You do not recommend to use hard coded constants but
 > MAX_PDFLUSH_THREADS=8 and write_chunk=48 are sacred according you mind. 

No, I think the fewer hard-coded constants are here---the better. I'd
happily eliminate MAX_PDFLUSH_THREADS, but just dropping this constraint
is a non-solution in search of a problem, obviously.

But this discussion is degenerating, so I shall participate in it no
longer.

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-06 16:33 Ananiev, Leonid I
  2006-07-06 17:37 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-06 16:33 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Bret Towe, Linux Kernel Mailing List

Nikita Danilov writes:
>> by introducing MAX_PDFLUSH_THREADS=8. Why just 8?
> Sorry, I don't understand. pdflush.c appeared in 2.5.8 and
Thanks for explanation. Why just 8? Answer: it was introduced in 2.5.8.
Why the constant MAX_PDFLUSH_THREADS is needed? Is it because current
kernel may create huge number of pdflush threads and overload the
system? Why we can not set MAX_PDFLUSH_THREADS=512? 1000?

>> Why 48?
> This is explained in comment just above sync_writeback_pages()
> definition. Basically, ratelimit_pages pages might be dirtied between
> calls to balance_dirty_pages(), and the latter tries to write out
*more*
> pages to keep number of dirty pages under control: negative feedback
> control loop, of sorts.

I had asked "why 48?" is hard coded for any configuration. I do not see
"48" in your explanation.

You do not recommend to use hard coded constants but
MAX_PDFLUSH_THREADS=8 and write_chunk=48 are sacred according you mind. 

> it will enter balance_dirty_pages() more often than A.
B will get 1 second additional latency 60 times per minute and user A
only 3 times per minute but after each pressing 'ENTER'.

Leonid
-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Thursday, July 06, 2006 7:17 PM
To: Ananiev, Leonid I
Cc: Bret Towe; Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > You are inhumane. :-)
 > OK. A lame argument was a joke.
 > 
 > > No. User thread will not write _all_ dirty pages (if it does---it's
a
 > > bug in the current code and should be fixed):
 > 
 > Some bugs are fixed by the patch.
 > Current kernel generates pdflush thread unlimitedly. It was patched
up
 > by introducing MAX_PDFLUSH_THREADS=8. Why just 8? Now

Sorry, I don't understand. pdflush.c appeared in 2.5.8 and

#define MAX_PDFLUSH_THREADS    8

was here from the very beginning:
http://www.kernelhq.cc/browse-view.py?fv_nr=107897

 > MAX_PDFLUSH_THREADS still present. But it could be removed.
 > Unlimited thread generation is fixed in proposed patch.
 > 
 > Now we are discussing other wonderful constant write_chunk=48. Why
48?

This is explained in comment just above sync_writeback_pages()
definition. Basically, ratelimit_pages pages might be dirtied between
calls to balance_dirty_pages(), and the latter tries to write out *more*
pages to keep number of dirty pages under control: negative feedback
control loop, of sorts.

 > The patch deletes this magic constant using:
 > -			.nr_to_write	= write_chunk,
 > It was needed in user thread only.
 > 
 > If user thread A writes 1 byte into disk it has to write not all but
48
 > dirty pages. User main work is paused. User thread B continues
 > generating of dirty pages. Thread B is main generator of dirty pages.
 > Thread A found was able to write-out 47 pages only because user B
locks
 > inodes. That is why user A forced to wait 0.1 sec and than repeat
 > compulsory community works. User thread lunch wide inodes scanning
and
 > list creating but interrupted after 48 pages. It is inefficient
method. 

It may so happen, right, but in the long term most of writeback will be
performed by thread B, because, being heavy writer, it will enter
balance_dirty_pages() more often than A.

If you want to get rid of write latencies, you need better accounting,
to know what pages were dirtied by the current thread (or at least their
number), so that synchronous writeback can be fairer.

[...]

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
  2006-07-06 13:54 Ananiev, Leonid I
@ 2006-07-06 15:16 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-06 15:16 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Bret Towe, Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > You are inhumane. :-)
 > OK. A lame argument was a joke.
 > 
 > > No. User thread will not write _all_ dirty pages (if it does---it's a
 > > bug in the current code and should be fixed):
 > 
 > Some bugs are fixed by the patch.
 > Current kernel generates pdflush thread unlimitedly. It was patched up
 > by introducing MAX_PDFLUSH_THREADS=8. Why just 8? Now

Sorry, I don't understand. pdflush.c appeared in 2.5.8 and

#define MAX_PDFLUSH_THREADS    8

was here from the very beginning: http://www.kernelhq.cc/browse-view.py?fv_nr=107897

 > MAX_PDFLUSH_THREADS still present. But it could be removed.
 > Unlimited thread generation is fixed in proposed patch.
 > 
 > Now we are discussing other wonderful constant write_chunk=48. Why 48?

This is explained in comment just above sync_writeback_pages()
definition. Basically, ratelimit_pages pages might be dirtied between
calls to balance_dirty_pages(), and the latter tries to write out *more*
pages to keep number of dirty pages under control: negative feedback
control loop, of sorts.

 > The patch deletes this magic constant using:
 > -			.nr_to_write	= write_chunk,
 > It was needed in user thread only.
 > 
 > If user thread A writes 1 byte into disk it has to write not all but 48
 > dirty pages. User main work is paused. User thread B continues
 > generating of dirty pages. Thread B is main generator of dirty pages.
 > Thread A found was able to write-out 47 pages only because user B locks
 > inodes. That is why user A forced to wait 0.1 sec and than repeat
 > compulsory community works. User thread lunch wide inodes scanning and
 > list creating but interrupted after 48 pages. It is inefficient method. 

It may so happen, right, but in the long term most of writeback will be
performed by thread B, because, being heavy writer, it will enter
balance_dirty_pages() more often than A.

If you want to get rid of write latencies, you need better accounting,
to know what pages were dirtied by the current thread (or at least their
number), so that synchronous writeback can be fairer.

[...]

 > 
 > Leonid

Nikita.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-06 13:54 Ananiev, Leonid I
  2006-07-06 15:16 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-06 13:54 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Bret Towe, Linux Kernel Mailing List

Nikita Danilov writes:
> You are inhumane. :-)
OK. A lame argument was a joke.

> No. User thread will not write _all_ dirty pages (if it does---it's a
> bug in the current code and should be fixed):

Some bugs are fixed by the patch.
Current kernel generates pdflush thread unlimitedly. It was patched up
by introducing MAX_PDFLUSH_THREADS=8. Why just 8? Now
MAX_PDFLUSH_THREADS still present. But it could be removed.
Unlimited thread generation is fixed in proposed patch.

Now we are discussing other wonderful constant write_chunk=48. Why 48?
The patch deletes this magic constant using:
-			.nr_to_write	= write_chunk,
It was needed in user thread only.

If user thread A writes 1 byte into disk it has to write not all but 48
dirty pages. User main work is paused. User thread B continues
generating of dirty pages. Thread B is main generator of dirty pages.
Thread A found was able to write-out 47 pages only because user B locks
inodes. That is why user A forced to wait 0.1 sec and than repeat
compulsory community works. User thread lunch wide inodes scanning and
list creating but interrupted after 48 pages. It is inefficient method. 

You can see in very first mail that write(2) latency is ...; 43; 292;
346; 368 ... 400 msec. You may want to mark that there was no one
latency time 43<t<292 msec while 130,000 writes were made during
benchmark running.

Proposed patch deletes long latency fluctuations and increases
throughput. The patch is benchmarked on real nowadays machines.

Leonid
-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Thursday, July 06, 2006 10:34 AM
To: Ananiev, Leonid I
Cc: Bret Towe; Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > Some people do, should they suffer? :-)
 > You - yes. You have used that example as an argument incorrectly.

You are inhumane. :-) What is incorrect in assuming people may have many
devices?

 > 
 > > Not _all_, only nr_to_write of them
 > 	Yes. User thread writes all dirty pages in the system calling

No. User thread will not write _all_ dirty pages (if it does---it's a
bug in the current code and should be fixed):

balance_dirty_pages():
			if (pages_written >= write_chunk)
				break;		/* We've done our duty
*/

writeback_inodes():
		if (wbc->nr_to_write <= 0)
			break;

sync_sb_inodes():
		if (wbc->nr_to_write <= 0)
			break;

mpage_writepages():
			if (ret || (--(wbc->nr_to_write) <= 0))
				done = 1;

Everywhere down call-chain ->nr_to_write is checked.

 > writeback_inodes() and after it tests
 > 			if (pages_written >= write_chunk)

[rants skipped.]

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-06 11:47 Ananiev, Leonid I
  0 siblings, 0 replies; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-06 11:47 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

> If this change has any effect at all, then maximal number of pdflush
> threads has been started.

I have not observed more than 2 pdflush after patching.
User thread is not stopped. The thread pdflush starts writing-out dirty
pages after low dirty level is reached. User thread continues its own
functions concurrently while high dirty limit is not reached.

Leonid

-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Thursday, July 06, 2006 12:14 AM
To: Ananiev, Leonid I
Cc: Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > I have added proposed by Nikita lines
 > 		if (pdflush_operation(background_writeout, 0))
 >                 writeback_inodes(&wbc);
 > and tested it with iozone. The throughput is 50-53 MB/sec. It is less
 > than 74-105 MB/sec results sent earlier.

If this change has any effect at all, then maximal number of pdflush
threads has been started. But there is only one device, so what are
these threads doing?

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
  2006-07-06  4:30 Ananiev, Leonid I
@ 2006-07-06  6:34 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-06  6:34 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Bret Towe, Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > Some people do, should they suffer? :-)
 > You - yes. You have used that example as an argument incorrectly.

You are inhumane. :-) What is incorrect in assuming people may have many
devices?

 > 
 > > Not _all_, only nr_to_write of them
 > 	Yes. User thread writes all dirty pages in the system calling

No. User thread will not write _all_ dirty pages (if it does---it's a
bug in the current code and should be fixed):

balance_dirty_pages():
			if (pages_written >= write_chunk)
				break;		/* We've done our duty */

writeback_inodes():
		if (wbc->nr_to_write <= 0)
			break;

sync_sb_inodes():
		if (wbc->nr_to_write <= 0)
			break;

mpage_writepages():
			if (ret || (--(wbc->nr_to_write) <= 0))
				done = 1;

Everywhere down call-chain ->nr_to_write is checked.

 > writeback_inodes() and after it tests
 > 			if (pages_written >= write_chunk)

[rants skipped.]

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-06  4:30 Ananiev, Leonid I
  2006-07-06  6:34 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-06  4:30 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Bret Towe, Linux Kernel Mailing List

Nikita Danilov writes:
> Some people do, should they suffer? :-)
You - yes. You have used that example as an argument incorrectly.

> Not _all_, only nr_to_write of them
	Yes. User thread writes all dirty pages in the system calling
writeback_inodes() and after it tests
			if (pages_written >= write_chunk)
	or have the user thread wrote more than 32+32/2 pages?

> In current design each thread is responsible for write-out.
	Casual thread performs common system work; casual thread is
throttled or frozen by this work.
	That is why a constant write_chunk==32+32/2 is used but too
late.
	Infinite number of pdflush may be created in current design and
only after extra pdflush thread is exited because other pdflush
processes the device. I have seen 6 pdflush threads during benchmarking
while I have 1 disk only. That is why MAX_PDFLUSH_THREADS is needed in
current design. The patch adds testing to keep off extra pdflush threads
creating.

Summary:
Nikita Danilov writes:
> Wouldn't this interfere with current->backing_dev_info logic?
	Proved: the patch does not break that logic.
> Intent of this is to throttle writers, and reduce risk of running oom
	Proved: a generator of dirty pages is throttled after patching
too.
	The extra throttling is removed. It and parallelism make
performance benefit.
Nikita Danilov writes that in current pdflush thread
> performs page-out even if queue is congested
	Proved: New pdflush does the same.
> With your patch, this work is done from pdflush, and won't be
throttled
	Proved: pdflush is throttled by device congestion.
> when direct reclaim skips writing dirty pages from tail of the
inactive list
	Proved: direct reclaim does not skip inactive list in proposed
design.
> You propose to limit write-out concurrency by MAX_PDFLUSH_THREADS
	Proved: the patch adds the line which keeps off creating of
infinite number of pdflush thread. The max limit could be removed in a
next patch.

Leonid

-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Thursday, July 06, 2006 12:06 AM
To: Ananiev, Leonid I
Cc: Bret Towe; Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > Exactly to the contrary: as I explained to you, if you have more
 > devices
 > > than pdflush threads
 > I do not believe that Bret Towe has more devices than
 > MAX_PDFLUSH_THREADS=8.

Some people do, should they suffer? :-)

 > 
 > > See how wbc.nr_to_write is set up by balance_dirty_pages().
 > It is number TO write but I said about number after what user has to
 > write-out all dirty pages. 

Not _all_, only nr_to_write of them:

			if (pages_written >= write_chunk)
				break;		/* We've done our duty
*/

 > 
 > > imagine that MAX_PDFLUSH_THREADS equals 1
 > Imagine that CONFIG_NR_CPUS=1 for smp.
 > Kernel has a lot of "big enough" constants.

Then why introduce more of them?

In current design each thread is responsible for write-out. This means
that write-out concurrency level scales together with the number of
writers. You propose to limit write-out concurrency by
MAX_PDFLUSH_THREADS. Obviously this is an artificial limit that will be
sub-optimal sometimes.


 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
  2006-07-05 19:50 Ananiev, Leonid I
@ 2006-07-05 20:05 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-05 20:05 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Bret Towe, Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > Exactly to the contrary: as I explained to you, if you have more
 > devices
 > > than pdflush threads
 > I do not believe that Bret Towe has more devices than
 > MAX_PDFLUSH_THREADS=8.

Some people do, should they suffer? :-)

 > 
 > > See how wbc.nr_to_write is set up by balance_dirty_pages().
 > It is number TO write but I said about number after what user has to
 > write-out all dirty pages. 

Not _all_, only nr_to_write of them:

			if (pages_written >= write_chunk)
				break;		/* We've done our duty */

 > 
 > > imagine that MAX_PDFLUSH_THREADS equals 1
 > Imagine that CONFIG_NR_CPUS=1 for smp.
 > Kernel has a lot of "big enough" constants.

Then why introduce more of them?

In current design each thread is responsible for write-out. This means
that write-out concurrency level scales together with the number of
writers. You propose to limit write-out concurrency by
MAX_PDFLUSH_THREADS. Obviously this is an artificial limit that will be
sub-optimal sometimes.

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-05 19:50 Ananiev, Leonid I
  2006-07-05 20:05 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-05 19:50 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Bret Towe, Linux Kernel Mailing List

Nikita Danilov writes:
> Exactly to the contrary: as I explained to you, if you have more
devices
> than pdflush threads
I do not believe that Bret Towe has more devices than
MAX_PDFLUSH_THREADS=8.

> See how wbc.nr_to_write is set up by balance_dirty_pages().
It is number TO write but I said about number after what user has to
write-out all dirty pages. 

> imagine that MAX_PDFLUSH_THREADS equals 1
Imagine that CONFIG_NR_CPUS=1 for smp.
Kernel has a lot of "big enough" constants.

Leonid
-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Wednesday, July 05, 2006 11:07 PM
To: Ananiev, Leonid I
Cc: Bret Towe; Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > 
 > Bret Towe writes:
 > > if say some gtk app wants to write to disk it will freeze
 > > until the usb hd is completely done
 > 
 > The proposed patch fixes one real cause of long latency: if a user

Exactly to the contrary: as I explained to you, if you have more devices
than pdflush threads, your patch will result in all system doing IO as
slow as slowest devices in the system. In addition, if you have more
than MAX_PDFLUSH_THREADS processors, some of them cannot be used to
concurrently perform writeback.

 > thread writes 1 byte only to disk it could happen that one has to
write
 > all pages dirtied by all threads in the system and wait for it. The

See how wbc.nr_to_write is set up by balance_dirty_pages().

 > patch is tested and gets real benefit on real systems. A common
system
 > work is performed in common system thread but not in casual user
thread.

To understand the problem I am trying to attract your attention to,
imagine that MAX_PDFLUSH_THREADS equals 1. See what you get? BSD
(pagedaemon everybody waits upon). But Linux is not BSD, thankfully.

 > 
 > The patch does not fix other (bazillion - 1) fictitious freezing
causes
 > for imaginary configurations.

Yes, and all the world is VAX (var. "my benchmarking suite"), as they
used to say. :-)

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
  2006-07-05 18:10 Ananiev, Leonid I
@ 2006-07-05 19:07 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-05 19:07 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Bret Towe, Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > 
 > Bret Towe writes:
 > > if say some gtk app wants to write to disk it will freeze
 > > until the usb hd is completely done
 > 
 > The proposed patch fixes one real cause of long latency: if a user

Exactly to the contrary: as I explained to you, if you have more devices
than pdflush threads, your patch will result in all system doing IO as
slow as slowest devices in the system. In addition, if you have more
than MAX_PDFLUSH_THREADS processors, some of them cannot be used to
concurrently perform writeback.

 > thread writes 1 byte only to disk it could happen that one has to write
 > all pages dirtied by all threads in the system and wait for it. The

See how wbc.nr_to_write is set up by balance_dirty_pages().

 > patch is tested and gets real benefit on real systems. A common system
 > work is performed in common system thread but not in casual user thread.

To understand the problem I am trying to attract your attention to,
imagine that MAX_PDFLUSH_THREADS equals 1. See what you get? BSD
(pagedaemon everybody waits upon). But Linux is not BSD, thankfully.

 > 
 > The patch does not fix other (bazillion - 1) fictitious freezing causes
 > for imaginary configurations.

Yes, and all the world is VAX (var. "my benchmarking suite"), as they
used to say. :-)

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-05 18:10 Ananiev, Leonid I
  2006-07-05 19:07 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-05 18:10 UTC (permalink / raw)
  To: Bret Towe, Nikita Danilov; +Cc: Linux Kernel Mailing List


Bret Towe writes:
> if say some gtk app wants to write to disk it will freeze
> until the usb hd is completely done

The proposed patch fixes one real cause of long latency: if a user
thread writes 1 byte only to disk it could happen that one has to write
all pages dirtied by all threads in the system and wait for it. The
patch is tested and gets real benefit on real systems. A common system
work is performed in common system thread but not in casual user thread.

The patch does not fix other (bazillion - 1) fictitious freezing causes
for imaginary configurations.

Leonid
-----Original Message-----
From: Bret Towe [mailto:magnade@gmail.com] 
Sent: Wednesday, July 05, 2006 9:19 PM
To: Nikita Danilov
Cc: Ananiev, Leonid I; Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

On 7/5/06, Nikita Danilov <nikita@clusterfs.com> wrote:
> Bret Towe writes:
>
> [...]
>
>  >
>  > are you sure about that? cause that sounded alot like an issue
>  > i saw with slow usb devices (mainly a usb hd on a usb 1.1
connection)
>  > the usb device would fill up with write queue and local io to say
/dev/hda
>  > would basicly stop and the system would be rather useless till the
usb
>  > hd would finish writing out what it was doing
>  > usally would take several hundred megs of data to get it to do it
>
> There may be bazillion other reasons for slow device making system
> unresponsive in various ways. More details are needed (possibly in
> separate thread).

well at this time all i know is one will be writing to the usb hd its
queue
will fill up and if say some gtk app wants to write to disk it will
freeze
until the usb hd is completely done
i will look into it at some point when i have time and get more info on
it and post a proper report on the issue unless its already been fixed

>  >
>  > ive not tried it in ages so maybe its been fixed since ive last
tried it
>  > dont recall the kernel version at the time but it wasnt more than a
>  > year ago
>  >
>
> Nikita.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
  2006-07-05 10:20     ` Nikita Danilov
@ 2006-07-05 17:19       ` Bret Towe
  0 siblings, 0 replies; 36+ messages in thread
From: Bret Towe @ 2006-07-05 17:19 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Ananiev, Leonid I, Linux Kernel Mailing List

On 7/5/06, Nikita Danilov <nikita@clusterfs.com> wrote:
> Bret Towe writes:
>
> [...]
>
>  >
>  > are you sure about that? cause that sounded alot like an issue
>  > i saw with slow usb devices (mainly a usb hd on a usb 1.1 connection)
>  > the usb device would fill up with write queue and local io to say /dev/hda
>  > would basicly stop and the system would be rather useless till the usb
>  > hd would finish writing out what it was doing
>  > usally would take several hundred megs of data to get it to do it
>
> There may be bazillion other reasons for slow device making system
> unresponsive in various ways. More details are needed (possibly in
> separate thread).

well at this time all i know is one will be writing to the usb hd its queue
will fill up and if say some gtk app wants to write to disk it will freeze
until the usb hd is completely done
i will look into it at some point when i have time and get more info on
it and post a proper report on the issue unless its already been fixed

>  >
>  > ive not tried it in ages so maybe its been fixed since ive last tried it
>  > dont recall the kernel version at the time but it wasnt more than a
>  > year ago
>  >
>
> Nikita.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-05 15:48 Ananiev, Leonid I
  0 siblings, 0 replies; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-05 15:48 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov writes:
> What about replacing
>		pdflush_operation(background_writeout, 0);
> with
>  		if (pdflush_operation(background_writeout, 0))
>
>                 writeback_inodes(&wbc);
The is a latency betwean pdflush_operation() call and
writeback_in_progress(bdi) returns true.
As a result writeback_inodes() may be called in next loop before pdflush
is started. Pdflush detects that writeback_in_progress() returns true
for it and will exit.

Leonid
-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Wednesday, July 05, 2006 6:57 PM
To: Ananiev, Leonid I
Cc: Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > suppose you have more than MAX_PDFLUSH_THREADS
 > Do you consider that the drawback of the patch is in that the value
 > MAX_PDFLUSH_THREADS is not well known high or this limit is not
deleted

I am more concerned, that this patch _limits_ maximal possible writeback
concurrency to MAX_PDFLUSH_THREADS.

 > at all? The limit could be deleted after patching because the line 

That sounds a bit too extreme, given that pdflush is used for a lot of
things other than background write-out.

 > +	if (writeback_in_progress(bdi)) {
 > keeps off creating extra pdflush threads.

What about replacing

 		pdflush_operation(background_writeout, 0);

with

 		if (pdflush_operation(background_writeout, 0))
                /*
                 * Fall back to synchronous writeback if all pdflush
                 * threads are busy.
                 */
                writeback_inodes(&wbc);

? This will combine increased concurrency in your target case (single
writer) with some safety net in the case of multiple writers and
multiple devices.

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
  2006-07-05 14:28 Ananiev, Leonid I
@ 2006-07-05 14:57 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-05 14:57 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > suppose you have more than MAX_PDFLUSH_THREADS
 > Do you consider that the drawback of the patch is in that the value
 > MAX_PDFLUSH_THREADS is not well known high or this limit is not deleted

I am more concerned, that this patch _limits_ maximal possible writeback
concurrency to MAX_PDFLUSH_THREADS.

 > at all? The limit could be deleted after patching because the line 

That sounds a bit too extreme, given that pdflush is used for a lot of
things other than background write-out.

 > +	if (writeback_in_progress(bdi)) {
 > keeps off creating extra pdflush threads.

What about replacing

 		pdflush_operation(background_writeout, 0);

with

 		if (pdflush_operation(background_writeout, 0))
                /*
                 * Fall back to synchronous writeback if all pdflush
                 * threads are busy.
                 */
                writeback_inodes(&wbc);

? This will combine increased concurrency in your target case (single
writer) with some safety net in the case of multiple writers and
multiple devices.

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-05 14:28 Ananiev, Leonid I
  2006-07-05 14:57 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-05 14:28 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov writes:
> suppose you have more than MAX_PDFLUSH_THREADS
Do you consider that the drawback of the patch is in that the value
MAX_PDFLUSH_THREADS is not well known high or this limit is not deleted
at all? The limit could be deleted after patching because the line 
+	if (writeback_in_progress(bdi)) {
keeps off creating extra pdflush threads.

Leonid
-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Wednesday, July 05, 2006 2:17 PM
To: Ananiev, Leonid I
Cc: Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > 
 > 
 > Nikita Danilov writes:
 > > Doing large amounts of writeback from pdflush threads makes
situation
 > > only worse: suppose you have more than MAX_PDFLUSH_THREADS devices
on
 > > the system, and large number of writing threads. If some devices
 > become
 > > congested, then *all* pdflush threads may easily stuck waiting on
 > queue
 > > congestion and there will be no IO going on against other devices.
 > This
 > > would be especially bad, if system is a mix of slow and fast
devices.
 > 
 > *all* pdflush threads may NOT waiting on single queue because
function

I specifically mentioned that multiple deviceS become congested: each
pdlush thread stuck on its own congested device, the rest of devices is
idle.

 > balance_dirty_pages() tests it:
 > 
 > 	if (writeback_in_progress(bdi))
 > 		return;		/* pdflush is already working this queue
 > */
 > 
 > > Yes, that was silly proposal. I think your patch contains very
useful
 > > idea, but it cannot be applied to all file systems. Maybe
 > > wait-for-pdflush can be made optional, depending on the file system
 > > type?
 > 
 > I suppose MS DOS was the last operating system which had considered
 > that parallelism is not applicable.

It also was the last file system that supported the only type of file
system, with asumptions about file system behaviour hard coded into its
design. :-)

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
  2006-07-04 22:24   ` Bret Towe
@ 2006-07-05 10:20     ` Nikita Danilov
  2006-07-05 17:19       ` Bret Towe
  0 siblings, 1 reply; 36+ messages in thread
From: Nikita Danilov @ 2006-07-05 10:20 UTC (permalink / raw)
  To: Bret Towe; +Cc: Ananiev, Leonid I, Linux Kernel Mailing List

Bret Towe writes:

[...]

 > 
 > are you sure about that? cause that sounded alot like an issue
 > i saw with slow usb devices (mainly a usb hd on a usb 1.1 connection)
 > the usb device would fill up with write queue and local io to say /dev/hda
 > would basicly stop and the system would be rather useless till the usb
 > hd would finish writing out what it was doing
 > usally would take several hundred megs of data to get it to do it

There may be bazillion other reasons for slow device making system
unresponsive in various ways. More details are needed (possibly in
separate thread).

 > 
 > ive not tried it in ages so maybe its been fixed since ive last tried it
 > dont recall the kernel version at the time but it wasnt more than a
 > year ago
 > 

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
  2006-07-05  5:40 Ananiev, Leonid I
@ 2006-07-05 10:17 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-05 10:17 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > 
 > 
 > Nikita Danilov writes:
 > > Doing large amounts of writeback from pdflush threads makes situation
 > > only worse: suppose you have more than MAX_PDFLUSH_THREADS devices on
 > > the system, and large number of writing threads. If some devices
 > become
 > > congested, then *all* pdflush threads may easily stuck waiting on
 > queue
 > > congestion and there will be no IO going on against other devices.
 > This
 > > would be especially bad, if system is a mix of slow and fast devices.
 > 
 > *all* pdflush threads may NOT waiting on single queue because function

I specifically mentioned that multiple deviceS become congested: each
pdlush thread stuck on its own congested device, the rest of devices is
idle.

 > balance_dirty_pages() tests it:
 > 
 > 	if (writeback_in_progress(bdi))
 > 		return;		/* pdflush is already working this queue
 > */
 > 
 > > Yes, that was silly proposal. I think your patch contains very useful
 > > idea, but it cannot be applied to all file systems. Maybe
 > > wait-for-pdflush can be made optional, depending on the file system
 > > type?
 > 
 > I suppose MS DOS was the last operating system which had considered
 > that parallelism is not applicable.

It also was the last file system that supported the only type of file
system, with asumptions about file system behaviour hard coded into its
design. :-)

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-05  5:40 Ananiev, Leonid I
  2006-07-05 10:17 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-05  5:40 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov writes:
> Doing large amounts of writeback from pdflush threads makes situation
> only worse: suppose you have more than MAX_PDFLUSH_THREADS devices on
> the system, and large number of writing threads. If some devices
become
> congested, then *all* pdflush threads may easily stuck waiting on
queue
> congestion and there will be no IO going on against other devices.
This
> would be especially bad, if system is a mix of slow and fast devices.

*all* pdflush threads may NOT waiting on single queue because function
balance_dirty_pages() tests it:

	if (writeback_in_progress(bdi))
		return;		/* pdflush is already working this queue
*/

> Yes, that was silly proposal. I think your patch contains very useful
> idea, but it cannot be applied to all file systems. Maybe
> wait-for-pdflush can be made optional, depending on the file system
> type?

I suppose MS DOS was the last operating system which had considered
that parallelism is not applicable.

Leonid
-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Wednesday, July 05, 2006 12:21 AM
To: Ananiev, Leonid I
Cc: Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > When queue is congested---it is, because meta-data (indirect blocks
in
 > > ext[23] case) have to be read in synchronously before page can be
 > paged
 > > out (see comment in mm/vmscan.c:pageout()).
 > 
 > Actually a queue is congested ---it is, because the queue is too long
or
 > bit BDI_write[read]_congested is set.
 > 
 > > But much more importantly: when direct reclaim skips writing dirty
 > pages
 > > from tail of the inactive list,
 > 
 > The  direct reclaim does not skip any page in pdflush thread because
 > may_write_to_queue() returns true
 > if (current->flags & PF_SWAPWRITE) and: __pdflush() sets this flag:
 > See pfflush.c: __pdflush() first line
 > current->flags |= PF_FLUSHER | PF_SWAPWRITE;

Hm.. indeed it is. But this is quite strange. This means, that if some
device queues are congested, pdflush threads will be stuck waiting for
these queues to drain, and as there is only limited number of pdflush
threads in the system, write-out to the non-congested devices will not
progress too.

Doing large amounts of writeback from pdflush threads makes situation
only worse: suppose you have more than MAX_PDFLUSH_THREADS devices on
the system, and large number of writing threads. If some devices become
congested, then *all* pdflush threads may easily stuck waiting on queue
congestion and there will be no IO going on against other devices. This
would be especially bad, if system is a mix of slow and fast devices.

In the original code, threads writing into fast devices are not impacted
by congestion of slow devices.

You can deal with that particular situation in your patch by checking
return value of

        pdflush_operation(background_writeout, 0);

and falling back to synchronous write-back if it fails to find worker
thread.

 > 
 > > Wouldn't this interfere with current->backing_dev_info logic?
 > > Maybe pdflush threads should set this field too?
 > It is not need to set current->backing_dev_info for pdflush because

Yes, that was silly proposal. I think your patch contains very useful
idea, but it cannot be applied to all file systems. Maybe
wait-for-pdflush can be made optional, depending on the file system
type?

 > PF_SWAPWRITE is set for pdflush.
 > The proposed patch does not concern of backing_dev_info logic.
 > 
 > Leonid 

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
  2006-07-04 20:20 ` Nikita Danilov
@ 2006-07-04 22:24   ` Bret Towe
  2006-07-05 10:20     ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Bret Towe @ 2006-07-04 22:24 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Ananiev, Leonid I, Linux Kernel Mailing List

On 7/4/06, Nikita Danilov <nikita@clusterfs.com> wrote:
> Ananiev, Leonid I writes:
>  > Nikita Danilov writes:
>  > > When queue is congested---it is, because meta-data (indirect blocks in
>  > > ext[23] case) have to be read in synchronously before page can be
>  > paged
>  > > out (see comment in mm/vmscan.c:pageout()).
>  >
>  > Actually a queue is congested ---it is, because the queue is too long or
>  > bit BDI_write[read]_congested is set.
>  >
>  > > But much more importantly: when direct reclaim skips writing dirty
>  > pages
>  > > from tail of the inactive list,
>  >
>  > The  direct reclaim does not skip any page in pdflush thread because
>  > may_write_to_queue() returns true
>  > if (current->flags & PF_SWAPWRITE) and: __pdflush() sets this flag:
>  > See pfflush.c: __pdflush() first line
>  > current->flags |= PF_FLUSHER | PF_SWAPWRITE;
>
> Hm.. indeed it is. But this is quite strange. This means, that if some
> device queues are congested, pdflush threads will be stuck waiting for
> these queues to drain, and as there is only limited number of pdflush
> threads in the system, write-out to the non-congested devices will not
> progress too.
>
> Doing large amounts of writeback from pdflush threads makes situation
> only worse: suppose you have more than MAX_PDFLUSH_THREADS devices on
> the system, and large number of writing threads. If some devices become
> congested, then *all* pdflush threads may easily stuck waiting on queue
> congestion and there will be no IO going on against other devices. This
> would be especially bad, if system is a mix of slow and fast devices.
>
> In the original code, threads writing into fast devices are not impacted
> by congestion of slow devices.

are you sure about that? cause that sounded alot like an issue
i saw with slow usb devices (mainly a usb hd on a usb 1.1 connection)
the usb device would fill up with write queue and local io to say /dev/hda
would basicly stop and the system would be rather useless till the usb
hd would finish writing out what it was doing
usally would take several hundred megs of data to get it to do it

ive not tried it in ages so maybe its been fixed since ive last tried it
dont recall the kernel version at the time but it wasnt more than a
year ago

> You can deal with that particular situation in your patch by checking
> return value of
>
>         pdflush_operation(background_writeout, 0);
>
> and falling back to synchronous write-back if it fails to find worker
> thread.
>
>  >
>  > > Wouldn't this interfere with current->backing_dev_info logic?
>  > > Maybe pdflush threads should set this field too?
>  > It is not need to set current->backing_dev_info for pdflush because
>
> Yes, that was silly proposal. I think your patch contains very useful
> idea, but it cannot be applied to all file systems. Maybe
> wait-for-pdflush can be made optional, depending on the file system
> type?
>
>  > PF_SWAPWRITE is set for pdflush.
>  > The proposed patch does not concern of backing_dev_info logic.
>  >
>  > Leonid
>
> Nikita.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
  2006-07-04 19:12 Ananiev, Leonid I
@ 2006-07-04 20:20 ` Nikita Danilov
  2006-07-04 22:24   ` Bret Towe
  0 siblings, 1 reply; 36+ messages in thread
From: Nikita Danilov @ 2006-07-04 20:20 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > > When queue is congested---it is, because meta-data (indirect blocks in
 > > ext[23] case) have to be read in synchronously before page can be
 > paged
 > > out (see comment in mm/vmscan.c:pageout()).
 > 
 > Actually a queue is congested ---it is, because the queue is too long or
 > bit BDI_write[read]_congested is set.
 > 
 > > But much more importantly: when direct reclaim skips writing dirty
 > pages
 > > from tail of the inactive list,
 > 
 > The  direct reclaim does not skip any page in pdflush thread because
 > may_write_to_queue() returns true
 > if (current->flags & PF_SWAPWRITE) and: __pdflush() sets this flag:
 > See pfflush.c: __pdflush() first line
 > current->flags |= PF_FLUSHER | PF_SWAPWRITE;

Hm.. indeed it is. But this is quite strange. This means, that if some
device queues are congested, pdflush threads will be stuck waiting for
these queues to drain, and as there is only limited number of pdflush
threads in the system, write-out to the non-congested devices will not
progress too.

Doing large amounts of writeback from pdflush threads makes situation
only worse: suppose you have more than MAX_PDFLUSH_THREADS devices on
the system, and large number of writing threads. If some devices become
congested, then *all* pdflush threads may easily stuck waiting on queue
congestion and there will be no IO going on against other devices. This
would be especially bad, if system is a mix of slow and fast devices.

In the original code, threads writing into fast devices are not impacted
by congestion of slow devices.

You can deal with that particular situation in your patch by checking
return value of

        pdflush_operation(background_writeout, 0);

and falling back to synchronous write-back if it fails to find worker
thread.

 > 
 > > Wouldn't this interfere with current->backing_dev_info logic?
 > > Maybe pdflush threads should set this field too?
 > It is not need to set current->backing_dev_info for pdflush because

Yes, that was silly proposal. I think your patch contains very useful
idea, but it cannot be applied to all file systems. Maybe
wait-for-pdflush can be made optional, depending on the file system
type?

 > PF_SWAPWRITE is set for pdflush.
 > The proposed patch does not concern of backing_dev_info logic.
 > 
 > Leonid 

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-04 19:12 Ananiev, Leonid I
  2006-07-04 20:20 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-04 19:12 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov writes:
> When queue is congested---it is, because meta-data (indirect blocks in
> ext[23] case) have to be read in synchronously before page can be
paged
> out (see comment in mm/vmscan.c:pageout()).

Actually a queue is congested ---it is, because the queue is too long or
bit BDI_write[read]_congested is set.

> But much more importantly: when direct reclaim skips writing dirty
pages
> from tail of the inactive list,

The  direct reclaim does not skip any page in pdflush thread because
may_write_to_queue() returns true
if (current->flags & PF_SWAPWRITE) and: __pdflush() sets this flag:
See pfflush.c: __pdflush() first line
current->flags |= PF_FLUSHER | PF_SWAPWRITE;

> Wouldn't this interfere with current->backing_dev_info logic?
> Maybe pdflush threads should set this field too?
It is not need to set current->backing_dev_info for pdflush because
PF_SWAPWRITE is set for pdflush.
The proposed patch does not concern of backing_dev_info logic.

Leonid 

-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Tuesday, July 04, 2006 5:12 PM
To: Ananiev, Leonid I
Cc: Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > 
 > > With your patch, this work is done from
 > > pdflush, and won't be throttled by may_write_to_queue() check, thus
 > > increasing a risk of allocation failure.
 > ....
 > After Nikita Danilov agrees that
 > > pdflush is throttled through blk_congestion_wait(), but it is not
 > > throttled by writing dirty from the tail of inactive list
 > 
 > The 'writing dirty from the tail of inactive list' is asynchronous
 > writing and it is not applicable for throttling.

When queue is congested---it is, because meta-data (indirect blocks in
ext[23] case) have to be read in synchronously before page can be paged
out (see comment in mm/vmscan.c:pageout()).

But much more importantly: when direct reclaim skips writing dirty pages
from tail of the inactive list, it instead moves these pages to the head
of this list, and reclaims clean pages instead. These clean pages are
"hotter" than skipped dirty pages, and as has been checked many times
already, this is bad, because doing reclaim in LRU order is important.

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
       [not found] <B41635854730A14CA71C92B36EC22AAC0541F6@mssmsx411>
@ 2006-07-04 13:12 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-04 13:12 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > Nikita Danilov writes:
 > 
 > > With your patch, this work is done from
 > > pdflush, and won't be throttled by may_write_to_queue() check, thus
 > > increasing a risk of allocation failure.
 > ....
 > After Nikita Danilov agrees that
 > > pdflush is throttled through blk_congestion_wait(), but it is not
 > > throttled by writing dirty from the tail of inactive list
 > 
 > The 'writing dirty from the tail of inactive list' is asynchronous
 > writing and it is not applicable for throttling.

When queue is congested---it is, because meta-data (indirect blocks in
ext[23] case) have to be read in synchronously before page can be paged
out (see comment in mm/vmscan.c:pageout()).

But much more importantly: when direct reclaim skips writing dirty pages
from tail of the inactive list, it instead moves these pages to the head
of this list, and reclaims clean pages instead. These clean pages are
"hotter" than skipped dirty pages, and as has been checked many times
already, this is bad, because doing reclaim in LRU order is important.

 > 
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-04 13:07 Ananiev, Leonid I
  0 siblings, 0 replies; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-04 13:07 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov writes:

> With your patch, this work is done from
> pdflush, and won't be throttled by may_write_to_queue() check, thus
> increasing a risk of allocation failure.
....
After Nikita Danilov agrees that
> pdflush is throttled through blk_congestion_wait(), but it is not
> throttled by writing dirty from the tail of inactive list

The 'writing dirty from the tail of inactive list' is asynchronous
writing and it is not applicable for throttling.

Leonid 
 

-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Tuesday, July 04, 2006 3:56 PM
To: Ananiev, Leonid I
Cc: Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov wtites:
 > >> Pdflush thread functions as before patching. Pdflush tends to make
 > pages
 > >> un-dirty without overload memory or IO and it is not need to let
 > pdflush
 > 
 > > This assumption is valid for ext2
 > 
 > The assumption that pdflush should to make pages un-dirty without
 > overload memory or IO is not for ext2 but for it sense. I'm working
with

I am not sure what "sense" is being referred to. Some file systems do
allocate a lot of memory in ->writepages().

ext3 is still in the same ball-park as ext2.

 > ext3. A lot of work it does while writepages(). pdflush is throttled:
 > while vmscan have sorted 32 page for paging-out it calls
 > blk_congestion_wait() nevertheless had it put one of 32 page into
 > congested queue or had not. pdflush is throttled.

pdflush is throttled through blk_congestion_wait(), but it is not
throttled by writing dirty from the tail of inactive list, while
scanning for memory. This destroys LRU ordering.

 > 
 > Leonid
 >  

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
  2006-07-04 11:43 Ananiev, Leonid I
@ 2006-07-04 11:56 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-04 11:56 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > Nikita Danilov wtites:
 > >> Pdflush thread functions as before patching. Pdflush tends to make
 > pages
 > >> un-dirty without overload memory or IO and it is not need to let
 > pdflush
 > 
 > > This assumption is valid for ext2
 > 
 > The assumption that pdflush should to make pages un-dirty without
 > overload memory or IO is not for ext2 but for it sense. I'm working with

I am not sure what "sense" is being referred to. Some file systems do
allocate a lot of memory in ->writepages().

ext3 is still in the same ball-park as ext2.

 > ext3. A lot of work it does while writepages(). pdflush is throttled:
 > while vmscan have sorted 32 page for paging-out it calls
 > blk_congestion_wait() nevertheless had it put one of 32 page into
 > congested queue or had not. pdflush is throttled.

pdflush is throttled through blk_congestion_wait(), but it is not
throttled by writing dirty from the tail of inactive list, while
scanning for memory. This destroys LRU ordering.

 > 
 > Leonid
 >  

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-04 11:43 Ananiev, Leonid I
  2006-07-04 11:56 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-04 11:43 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov wtites:
>> Pdflush thread functions as before patching. Pdflush tends to make
pages
>> un-dirty without overload memory or IO and it is not need to let
pdflush

> This assumption is valid for ext2

The assumption that pdflush should to make pages un-dirty without
overload memory or IO is not for ext2 but for it sense. I'm working with
ext3. A lot of work it does while writepages(). pdflush is throttled:
while vmscan have sorted 32 page for paging-out it calls
blk_congestion_wait() nevertheless had it put one of 32 page into
congested queue or had not. pdflush is throttled.

Leonid

-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Tuesday, July 04, 2006 1:55 PM
To: Ananiev, Leonid I
Cc: Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

Ananiev, Leonid I writes:
 > Nikita Danilov wtites:
 > > performs page-out even if queue is congested.
 > 	Yes. If user thread which generates dirty pages need in
 > reclaimed memory it consider own dirty page as candidate for
page-out.
 > It functions as before patching.
 > 
 > > Intent of this is to throttle writers.
 > I suppose you means dirtier or write(2) caller but not writepage()
 > caller. The dirtier  is throttled  with backing_dev_info logic as
before
 > patching.

I meant ->writepages() used by balance_dirty_pages(), see below.

 > 
 > 	While pdflush thread sorts pages for page-out it does not
 > consider as a candidate a page to be written with congested queue.
 > Pdflush thread functions as before patching. Pdflush tends to make
pages
 > un-dirty without overload memory or IO and it is not need to let
pdflush

This assumption is valid for ext2, where ->writepages() simply sends
pages to the storage, but other file systems (like reiser4) do a *lot*
of work in ->writepages() path, allocating quite an amount of memory
before starting write-out. With your patch, this work is done from
pdflush, and won't be throttled by may_write_to_queue() check, thus
increasing a risk of allocation failure.

 > do page-out with congested queue as you have proposed.
 > 	
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-04  9:55 Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-04  9:55 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > Nikita Danilov wtites:
 > > performs page-out even if queue is congested.
 > 	Yes. If user thread which generates dirty pages need in
 > reclaimed memory it consider own dirty page as candidate for page-out.
 > It functions as before patching.
 > 
 > > Intent of this is to throttle writers.
 > I suppose you means dirtier or write(2) caller but not writepage()
 > caller. The dirtier  is throttled  with backing_dev_info logic as before
 > patching.

I meant ->writepages() used by balance_dirty_pages(), see below.

 > 
 > 	While pdflush thread sorts pages for page-out it does not
 > consider as a candidate a page to be written with congested queue.
 > Pdflush thread functions as before patching. Pdflush tends to make pages
 > un-dirty without overload memory or IO and it is not need to let pdflush

This assumption is valid for ext2, where ->writepages() simply sends
pages to the storage, but other file systems (like reiser4) do a *lot*
of work in ->writepages() path, allocating quite an amount of memory
before starting write-out. With your patch, this work is done from
pdflush, and won't be throttled by may_write_to_queue() check, thus
increasing a risk of allocation failure.

 > do page-out with congested queue as you have proposed.
 > 	
 > Leonid

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-04  9:44 Ananiev, Leonid I
  0 siblings, 0 replies; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-04  9:44 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov wtites:
> performs page-out even if queue is congested.
	Yes. If user thread which generates dirty pages need in
reclaimed memory it consider own dirty page as candidate for page-out.
It functions as before patching.

> Intent of this is to throttle writers.
I suppose you means dirtier or write(2) caller but not writepage()
caller. The dirtier  is throttled  with backing_dev_info logic as before
patching. 

	While pdflush thread sorts pages for page-out it does not
consider as a candidate a page to be written with congested queue.
Pdflush thread functions as before patching. Pdflush tends to make pages
un-dirty without overload memory or IO and it is not need to let pdflush
do page-out with congested queue as you have proposed.
	
Leonid
	
-----Original Message-----
From: Nikita Danilov [mailto:nikita@clusterfs.com] 
Sent: Tuesday, July 04, 2006 12:33 PM
To: Ananiev, Leonid I
Cc: Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely


[Please, don't trim CC/To fields: LKML is too high traffic to read in
its entirety.]

Ananiev, Leonid I writes:
 >  Nikita Danilov writes:
 > > Wouldn't this interfere with current->backing_dev_info logic?
 > 
 > The proposed patch does not modify that logic.

Indeed, it *interferes* with it: in the original code, process doing
direct reclaim during balance_dirty_pages()

 
generic_file_write()->balance_dirty_pages()->...->__alloc_pages()->...->
pageout()

performs page-out even if queue is congested. Intent of this is to
throttle writers, and reduce risk of running oom (certain file systems,
especially ones with delayed allocation, tend to allocate a lot of
memory in balance_dirty_pages()->writepages() paths).

Your patch breaks this mechanism.

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-04  8:35 Ananiev, Leonid I
  0 siblings, 0 replies; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-04  8:35 UTC (permalink / raw)
  To: Antonio Vargas, Nikita Danilov, Linux Kernel Mailing List

Antonio Vargas writes:
>  Maybe we should keep the sync-write logic if there is only one online
> cpu, and thus avoiding extra context switches when they are not
> profitable?

A parallelism makes sense even if 1 cpu and 1 user task is there because
of IO.
>From other hand if user thread actually does inodes write back, it may
wait a lot (fs, jorn, io queue) events and get context switch.
	The  results of 3-4 repeated runs of "/usr/bin/time -f  "%c"
iozone -i 0 -r 4 -s 1200m " in Pentium-4HT with 1GB RAM show that the
patch is useful for 1 cpu as well as for 2:

                                   Old_1cpu         new_1cpu
old_2cpu        new_2cpu
/usr/bin/time %c	           1932-3400       1700-3003
1900-2728      2014-2700
'vmstat 1' (cs/sec)         506-621           693-753           708-715
679-752
throughput(MB/sec)       54-58              71-94               53-59
74-105

Leonid

-----Original Message-----
From: Antonio Vargas [mailto:windenntw@gmail.com] 
Sent: Monday, July 03, 2006 9:32 PM
To: Nikita Danilov; Ananiev, Leonid I; Linux Kernel Mailing List
Subject: Re: [PATCH] mm: moving dirty pages balancing to pdfludh
entirely

On 6/28/06, Nikita Danilov <nikita@clusterfs.com> wrote:
> Ananiev, Leonid I writes:
>  > >From Leonid Ananiev
>
> Hello,
>
>  >
>  > Moving dirty pages balancing from user to kernel thread pdfludh
entirely
>  > reduces extra long write(2) latencies, increases performance.
>  >
>
> [...]
>
>  >      The benchmarks IOzone and Sysbench for file size 50% and 120%
of
>  > RAM size on Pentium4, Xeon, Itanium have reported write and mix
>  > throughput increasing about 25%. The described Iozone > 0.1 sec
write(2)
>
> Results are impressive.
>
> Wouldn't this interfere with current->backing_dev_info logic? This
field
> is set by __generic_file_aio_write_nolock() and checked by
> may_write_to_queue() to force heavy writes to do more pageout. Maybe
> pdflush threads should set this field too?
>
>  > latencies are deleted. The condition writeback_in_progress() is
tested
>  > earlier now. As a result extra pdflush works are not created and
number
>  > of context switches increasing is inside variation limites.
>
> Nikita.

Maybe we should keep the sync-write logic if there is only one online
cpu, and thus avoiding extra context switches when they are not
profitable?

-- 
Greetz, Antonio Vargas aka winden of network

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
  2006-07-03 16:43 Ananiev, Leonid I
@ 2006-07-04  8:32 ` Nikita Danilov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikita Danilov @ 2006-07-04  8:32 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Linux Kernel Mailing List


[Please, don't trim CC/To fields: LKML is too high traffic to read in
its entirety.]

Ananiev, Leonid I writes:
 >  Nikita Danilov writes:
 > > Wouldn't this interfere with current->backing_dev_info logic?
 > 
 > The proposed patch does not modify that logic.

Indeed, it *interferes* with it: in the original code, process doing
direct reclaim during balance_dirty_pages()

    generic_file_write()->balance_dirty_pages()->...->__alloc_pages()->...->pageout()

performs page-out even if queue is congested. Intent of this is to
throttle writers, and reduce risk of running oom (certain file systems,
especially ones with delayed allocation, tend to allocate a lot of
memory in balance_dirty_pages()->writepages() paths).

Your patch breaks this mechanism.

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: moving dirty pages balancing to pdfludh entirely
  2006-06-28 11:33 ` Nikita Danilov
@ 2006-07-03 17:31   ` Antonio Vargas
  0 siblings, 0 replies; 36+ messages in thread
From: Antonio Vargas @ 2006-07-03 17:31 UTC (permalink / raw)
  To: Nikita Danilov, Ananiev, Leonid I, Linux Kernel Mailing List

On 6/28/06, Nikita Danilov <nikita@clusterfs.com> wrote:
> Ananiev, Leonid I writes:
>  > >From Leonid Ananiev
>
> Hello,
>
>  >
>  > Moving dirty pages balancing from user to kernel thread pdfludh entirely
>  > reduces extra long write(2) latencies, increases performance.
>  >
>
> [...]
>
>  >      The benchmarks IOzone and Sysbench for file size 50% and 120% of
>  > RAM size on Pentium4, Xeon, Itanium have reported write and mix
>  > throughput increasing about 25%. The described Iozone > 0.1 sec write(2)
>
> Results are impressive.
>
> Wouldn't this interfere with current->backing_dev_info logic? This field
> is set by __generic_file_aio_write_nolock() and checked by
> may_write_to_queue() to force heavy writes to do more pageout. Maybe
> pdflush threads should set this field too?
>
>  > latencies are deleted. The condition writeback_in_progress() is tested
>  > earlier now. As a result extra pdflush works are not created and number
>  > of context switches increasing is inside variation limites.
>
> Nikita.

Maybe we should keep the sync-write logic if there is only one online
cpu, and thus avoiding extra context switches when they are not
profitable?

-- 
Greetz, Antonio Vargas aka winden of network

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-07-03 16:43 Ananiev, Leonid I
  2006-07-04  8:32 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-07-03 16:43 UTC (permalink / raw)
  To: linux-kernel

 Nikita Danilov writes:
> Wouldn't this interfere with current->backing_dev_info logic?

The proposed patch does not modify that logic.

> Maybe pdflush threads should set this field too?

If pdflush sets current->backing_dev_info it means to ignore congested
device state.
It will not help to get reclaimed memory for pdflush but will increase
congestion, and pdflush will be dependent from swapping out process
using congested queue. The queue may be just congested because of memory
for queue. Other page could be better candidate for memory reclamation.

So, pdflush should not set this field.

Leonid

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
  2006-06-27  5:31 Ananiev, Leonid I
@ 2006-06-28 11:33 ` Nikita Danilov
  2006-07-03 17:31   ` Antonio Vargas
  0 siblings, 1 reply; 36+ messages in thread
From: Nikita Danilov @ 2006-06-28 11:33 UTC (permalink / raw)
  To: Ananiev, Leonid I; +Cc: Linux Kernel Mailing List

Ananiev, Leonid I writes:
 > >From Leonid Ananiev

Hello,

 > 
 > Moving dirty pages balancing from user to kernel thread pdfludh entirely
 > reduces extra long write(2) latencies, increases performance.
 > 

[...]

 > 	The benchmarks IOzone and Sysbench for file size 50% and 120% of
 > RAM size on Pentium4, Xeon, Itanium have reported write and mix
 > throughput increasing about 25%. The described Iozone > 0.1 sec write(2)

Results are impressive.

Wouldn't this interfere with current->backing_dev_info logic? This field
is set by __generic_file_aio_write_nolock() and checked by
may_write_to_queue() to force heavy writes to do more pageout. Maybe
pdflush threads should set this field too?

 > latencies are deleted. The condition writeback_in_progress() is tested
 > earlier now. As a result extra pdflush works are not created and number
 > of context switches increasing is inside variation limites.

Nikita.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH]  mm: moving dirty pages balancing to pdfludh entirely
@ 2006-06-27  5:31 Ananiev, Leonid I
  2006-06-28 11:33 ` Nikita Danilov
  0 siblings, 1 reply; 36+ messages in thread
From: Ananiev, Leonid I @ 2006-06-27  5:31 UTC (permalink / raw)
  To: linux-kernel

>From Leonid Ananiev

Moving dirty pages balancing from user to kernel thread pdfludh entirely
reduces extra long write(2) latencies, increases performance.

Signed-off-by: Leonid Ananiev <leonid.i.ananiev@intel.com>

	A file block writing time with function write(2) is very long
some times.
IOzone benchmark for file size 50% of RAM size reports record write
latency up to 0.4 sec. See sorted column 2(usec) of 'iozone -Q' output:
$ sort -nk2 wol.dat | tail
    890676      43950       4096
     77040     292806       4096
    407812     346254       4096
   1015436     346382       4096
    632940     368620       4096
    278944     368910       4096
    890672     369545       4096
    761808     376660       4096
    152124     383555       4096
   1144300     400145       4096
Investigation shows that long write(2) time ia a result of
balance_dirty_pages() running. If any of threads on current CPU had
wrote ratelimite_pages and dirty_ratio is achieved a current thread have
to find and write all extra dirty pages in the hole system, than have to
wait IO finishing and repeate until dirty_ratio will be OK. The pdflush
was waked up after balancing at user thread.
A proposed patch wakes up pdflush at low level dirty ratio and task is
continued. If high dirty ratio is reached user task waits for IO finish.
So write process will be throttled. As a result dirty page write back
preparing, inodes scanning, journaling are performed concurrently with
user task run while dirty ratio is inside low/high limits. That is why
high default dirty ratio limit is set to 60%. After patching the kernel
uses multiprocessing more for performance; extra throttling of task is
deleted; task run time becomes more stable and predictable.
	The benchmarks IOzone and Sysbench for file size 50% and 120% of
RAM size on Pentium4, Xeon, Itanium have reported write and mix
throughput increasing about 25%. The described Iozone > 0.1 sec write(2)
latencies are deleted. The condition writeback_in_progress() is tested
earlier now. As a result extra pdflush works are not created and number
of context switches increasing is inside variation limites.


diff -rpu a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c	2006-06-02 16:59:27.000000000 +0400
+++ b/mm/page-writeback.c	2006-06-22 22:43:17.000000000 +0400
@@ -69,7 +69,7 @@ int dirty_background_ratio = 10;
 /*
  * The generator of dirty data starts writeback at this percentage
  */
-int vm_dirty_ratio = 40;
+int vm_dirty_ratio = 60;
 
 /*
  * The interval between `kupdate'-style writebacks, in jiffies
@@ -190,56 +190,14 @@ get_dirty_limits(struct writeback_state 
 static void balance_dirty_pages(struct address_space *mapping)
 {
 	struct writeback_state wbs;
-	long nr_reclaimable;
+	long nr_dirty;
 	long background_thresh;
 	long dirty_thresh;
-	unsigned long pages_written = 0;
-	unsigned long write_chunk = sync_writeback_pages();
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-
-	for (;;) {
-		struct writeback_control wbc = {
-			.bdi		= bdi,
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-		};
-
-		get_dirty_limits(&wbs, &background_thresh,
-					&dirty_thresh, mapping);
-		nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
-		if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
-			break;
-
-		if (!dirty_exceeded)
-			dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 */
-		if (nr_reclaimable) {
-			writeback_inodes(&wbc);
-			get_dirty_limits(&wbs, &background_thresh,
-					&dirty_thresh, mapping);
-			nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
-			if (nr_reclaimable + wbs.nr_writeback <=
dirty_thresh)
-				break;
-			pages_written += write_chunk - wbc.nr_to_write;
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty
*/
-		}
-		blk_congestion_wait(WRITE, HZ/10);
-	}
-
-	if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh &&
dirty_exceeded)
-		dirty_exceeded = 0;
-
-	if (writeback_in_progress(bdi))
-		return;		/* pdflush is already working this queue
*/
+	get_dirty_limits(&wbs, &background_thresh,
+				&dirty_thresh, mapping);
+	nr_dirty = wbs.nr_dirty + wbs.nr_unstable;
 
 	/*
 	 * In laptop mode, we wait until hitting the higher threshold
before
@@ -249,8 +207,13 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (writeback_in_progress(bdi)) {
+	       if (nr_dirty > dirty_thresh) {
+			dirty_exceeded = 1;
+	                blk_congestion_wait(WRITE, HZ/50);
+		}
+	} else if ((laptop_mode && (nr_dirty + wbs.nr_writeback >
dirty_thresh)) ||
+	     (!laptop_mode && (nr_dirty > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 
@@ -325,6 +288,9 @@ void throttle_vm_writeout(void)
 static void background_writeout(unsigned long _min_pages)
 {
 	long min_pages = _min_pages;
+	struct writeback_state wbs;
+	long background_thresh;
+	long dirty_thresh;
 	struct writeback_control wbc = {
 		.bdi		= NULL,
 		.sync_mode	= WB_SYNC_NONE,
@@ -333,12 +299,11 @@ static void background_writeout(unsigned
 		.nonblocking	= 1,
 	};
 
+	get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, NULL);
+	if (min_pages == 0) {
+		min_pages = wbs.nr_dirty + wbs.nr_unstable;
+	}
 	for ( ; ; ) {
-		struct writeback_state wbs;
-		long background_thresh;
-		long dirty_thresh;
-
-		get_dirty_limits(&wbs, &background_thresh,
&dirty_thresh, NULL);
 		if (wbs.nr_dirty + wbs.nr_unstable < background_thresh
 				&& min_pages <= 0)
 			break;
@@ -347,13 +312,12 @@ static void background_writeout(unsigned
 		wbc.pages_skipped = 0;
 		writeback_inodes(&wbc);
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
-			/* Wrote less than expected */
+		if (wbc.nr_to_write > 0 || wbc.encountered_congestion) {
 			blk_congestion_wait(WRITE, HZ/10);
-			if (!wbc.encountered_congestion)
-				break;
 		}
+		get_dirty_limits(&wbs, &background_thresh,
&dirty_thresh, NULL);
 	}
+	dirty_exceeded = 0;
 }
 
 /*
@@ -361,14 +325,8 @@ static void background_writeout(unsigned
  * the whole world.  Returns 0 if a pdflush thread was dispatched.
Returns
  * -1 if all pdflush threads were busy.
  */
-int wakeup_pdflush(long nr_pages)
+int inline wakeup_pdflush(long nr_pages)
 {
-	if (nr_pages == 0) {
-		struct writeback_state wbs;
-
-		get_writeback_state(&wbs);
-		nr_pages = wbs.nr_dirty + wbs.nr_unstable;
-	}
 	return pdflush_operation(background_writeout, nr_pages);
 }
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2006-07-06 21:08 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-05 19:28 [PATCH] mm: moving dirty pages balancing to pdfludh entirely Ananiev, Leonid I
2006-07-05 20:14 ` Nikita Danilov
  -- strict thread matches above, loose matches on Subject: below --
2006-07-06 21:07 Ananiev, Leonid I
2006-07-06 16:33 Ananiev, Leonid I
2006-07-06 17:37 ` Nikita Danilov
2006-07-06 13:54 Ananiev, Leonid I
2006-07-06 15:16 ` Nikita Danilov
2006-07-06 11:47 Ananiev, Leonid I
2006-07-06  4:30 Ananiev, Leonid I
2006-07-06  6:34 ` Nikita Danilov
2006-07-05 19:50 Ananiev, Leonid I
2006-07-05 20:05 ` Nikita Danilov
2006-07-05 18:10 Ananiev, Leonid I
2006-07-05 19:07 ` Nikita Danilov
2006-07-05 15:48 Ananiev, Leonid I
2006-07-05 14:28 Ananiev, Leonid I
2006-07-05 14:57 ` Nikita Danilov
2006-07-05  5:40 Ananiev, Leonid I
2006-07-05 10:17 ` Nikita Danilov
2006-07-04 19:12 Ananiev, Leonid I
2006-07-04 20:20 ` Nikita Danilov
2006-07-04 22:24   ` Bret Towe
2006-07-05 10:20     ` Nikita Danilov
2006-07-05 17:19       ` Bret Towe
     [not found] <B41635854730A14CA71C92B36EC22AAC0541F6@mssmsx411>
2006-07-04 13:12 ` Nikita Danilov
2006-07-04 13:07 Ananiev, Leonid I
2006-07-04 11:43 Ananiev, Leonid I
2006-07-04 11:56 ` Nikita Danilov
2006-07-04  9:55 Nikita Danilov
2006-07-04  9:44 Ananiev, Leonid I
2006-07-04  8:35 Ananiev, Leonid I
2006-07-03 16:43 Ananiev, Leonid I
2006-07-04  8:32 ` Nikita Danilov
2006-06-27  5:31 Ananiev, Leonid I
2006-06-28 11:33 ` Nikita Danilov
2006-07-03 17:31   ` Antonio Vargas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.