what is the point of nr_pages information for the flusher thread?

All of lore.kernel.org
 help / color / mirror / Atom feed

* what is the point of nr_pages information for the flusher thread?
@ 2010-07-07 23:16 ` Christoph Hellwig
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2010-07-07 23:16 UTC (permalink / raw)
  To: fengguang.wu, mel, akpm, npiggin; +Cc: linux-fsdevel, linux-mm

Currently there's three possible values we pass into the flusher thread
for the nr_pages arguments:

 - in sync_inodes_sb and bdi_start_background_writeback:

	LONG_MAX

 - in writeback_inodes_sb and wb_check_old_data_flush:

	global_page_state(NR_FILE_DIRTY) +
	global_page_state(NR_UNSTABLE_NFS) +
	(inodes_stat.nr_inodes - inodes_stat.nr_unused)

 - in wakeup_flusher_threads and laptop_mode_timer_fn:

	global_page_state(NR_FILE_DIRTY) +
	global_page_state(NR_UNSTABLE_NFS)

The LONG_MAX cases are triviall explained, as we ignore the nr_to_write
value for data integrity writepage in the lowlevel writeback code, and
the for_background in bdi_start_background_writeback has it's own check
for the background threshold.  So far so good, and now it gets
interesting.

Why does writeback_inodes_sb add the number of used inodes into a value
that is in units of pages?  And why don't the other callers do this?

But seriously, how is the _global_ number of dirty and unstable pages
a good indicator for the amount of writeback per-bdi or superblock
anyway?

Somehow I'd feel much better about doing this calculation all the way
down in wb_writeback instead of the callers so we'll at least have
one documented place for these insanities.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* what is the point of nr_pages information for the flusher thread?
@ 2010-07-07 23:16 ` Christoph Hellwig
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2010-07-07 23:16 UTC (permalink / raw)
  To: fengguang.wu, mel, akpm, npiggin; +Cc: linux-fsdevel, linux-mm

Currently there's three possible values we pass into the flusher thread
for the nr_pages arguments:

 - in sync_inodes_sb and bdi_start_background_writeback:

	LONG_MAX

 - in writeback_inodes_sb and wb_check_old_data_flush:

	global_page_state(NR_FILE_DIRTY) +
	global_page_state(NR_UNSTABLE_NFS) +
	(inodes_stat.nr_inodes - inodes_stat.nr_unused)

 - in wakeup_flusher_threads and laptop_mode_timer_fn:

	global_page_state(NR_FILE_DIRTY) +
	global_page_state(NR_UNSTABLE_NFS)

The LONG_MAX cases are triviall explained, as we ignore the nr_to_write
value for data integrity writepage in the lowlevel writeback code, and
the for_background in bdi_start_background_writeback has it's own check
for the background threshold.  So far so good, and now it gets
interesting.

Why does writeback_inodes_sb add the number of used inodes into a value
that is in units of pages?  And why don't the other callers do this?

But seriously, how is the _global_ number of dirty and unstable pages
a good indicator for the amount of writeback per-bdi or superblock
anyway?

Somehow I'd feel much better about doing this calculation all the way
down in wb_writeback instead of the callers so we'll at least have
one documented place for these insanities.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: what is the point of nr_pages information for the flusher thread?
  2010-07-07 23:16 ` Christoph Hellwig
@ 2010-07-07 23:37   ` Andrew Morton
  -1 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2010-07-07 23:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: fengguang.wu, mel, npiggin, linux-fsdevel, linux-mm

On Wed, 7 Jul 2010 19:16:11 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> Currently there's three possible values we pass into the flusher thread
> for the nr_pages arguments:

I assume you're referring to wakeup_flusher_threads().

>  - in sync_inodes_sb and bdi_start_background_writeback:
> 
> 	LONG_MAX
> 
>  - in writeback_inodes_sb and wb_check_old_data_flush:
> 
> 	global_page_state(NR_FILE_DIRTY) +
> 	global_page_state(NR_UNSTABLE_NFS) +
> 	(inodes_stat.nr_inodes - inodes_stat.nr_unused)
> 
>  - in wakeup_flusher_threads and laptop_mode_timer_fn:
> 
> 	global_page_state(NR_FILE_DIRTY) +
> 	global_page_state(NR_UNSTABLE_NFS)

There's also free_more_memory() and do_try_to_free_pages().

You'd need to do some deep git archeology to work out what the thinking
was at those two callsites.  My git machine is presently at the other
end of a slow link.

wakeup_flusher_threads() apepars to have been borked.  It passes
nr_pages() into *each* bdi hence can write back far more than it was
asked to.

> The LONG_MAX cases are triviall explained, as we ignore the nr_to_write
> value for data integrity writepage in the lowlevel writeback code, and
> the for_background in bdi_start_background_writeback has it's own check
> for the background threshold.  So far so good, and now it gets
> interesting.
> 
> Why does writeback_inodes_sb add the number of used inodes into a value
> that is in units of pages?  And why don't the other callers do this?

Again, git archeology is needed.  The code's been like that for some
time.  IIRC there was a bug long long ago wherein the system could have
lots of dirty inodes but zero dirty pages.  The writeback code would
say "gee, no dirty pages" and would bale out, thus failing to write the
dirty inodes.  Perhaps this hack was a "fix" for that behaviour.  Or
perhaps not.  Apparently it was so obvious that no code comment was
needed.

> But seriously, how is the _global_ number of dirty and unstable pages
> a good indicator for the amount of writeback per-bdi or superblock
> anyway?

It isn't.  This appears to have been an attempt to transport the
wakeup_pdflush() functionality into the new wakeup_flusher_threads()
regime.  Badly.

> Somehow I'd feel much better about doing this calculation all the way
> down in wb_writeback instead of the callers so we'll at least have
> one documented place for these insanities.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: what is the point of nr_pages information for the flusher thread?
@ 2010-07-07 23:37   ` Andrew Morton
  0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2010-07-07 23:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: fengguang.wu, mel, npiggin, linux-fsdevel, linux-mm

On Wed, 7 Jul 2010 19:16:11 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> Currently there's three possible values we pass into the flusher thread
> for the nr_pages arguments:

I assume you're referring to wakeup_flusher_threads().

>  - in sync_inodes_sb and bdi_start_background_writeback:
> 
> 	LONG_MAX
> 
>  - in writeback_inodes_sb and wb_check_old_data_flush:
> 
> 	global_page_state(NR_FILE_DIRTY) +
> 	global_page_state(NR_UNSTABLE_NFS) +
> 	(inodes_stat.nr_inodes - inodes_stat.nr_unused)
> 
>  - in wakeup_flusher_threads and laptop_mode_timer_fn:
> 
> 	global_page_state(NR_FILE_DIRTY) +
> 	global_page_state(NR_UNSTABLE_NFS)

There's also free_more_memory() and do_try_to_free_pages().

You'd need to do some deep git archeology to work out what the thinking
was at those two callsites.  My git machine is presently at the other
end of a slow link.

wakeup_flusher_threads() apepars to have been borked.  It passes
nr_pages() into *each* bdi hence can write back far more than it was
asked to.

> The LONG_MAX cases are triviall explained, as we ignore the nr_to_write
> value for data integrity writepage in the lowlevel writeback code, and
> the for_background in bdi_start_background_writeback has it's own check
> for the background threshold.  So far so good, and now it gets
> interesting.
> 
> Why does writeback_inodes_sb add the number of used inodes into a value
> that is in units of pages?  And why don't the other callers do this?

Again, git archeology is needed.  The code's been like that for some
time.  IIRC there was a bug long long ago wherein the system could have
lots of dirty inodes but zero dirty pages.  The writeback code would
say "gee, no dirty pages" and would bale out, thus failing to write the
dirty inodes.  Perhaps this hack was a "fix" for that behaviour.  Or
perhaps not.  Apparently it was so obvious that no code comment was
needed.

> But seriously, how is the _global_ number of dirty and unstable pages
> a good indicator for the amount of writeback per-bdi or superblock
> anyway?

It isn't.  This appears to have been an attempt to transport the
wakeup_pdflush() functionality into the new wakeup_flusher_threads()
regime.  Badly.

> Somehow I'd feel much better about doing this calculation all the way
> down in wb_writeback instead of the callers so we'll at least have
> one documented place for these insanities.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: what is the point of nr_pages information for the flusher thread?
  2010-07-07 23:37   ` Andrew Morton
@ 2010-07-07 23:43     ` Christoph Hellwig
  -1 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2010-07-07 23:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, fengguang.wu, mel, npiggin, linux-fsdevel, linux-mm

On Wed, Jul 07, 2010 at 04:37:10PM -0700, Andrew Morton wrote:
> On Wed, 7 Jul 2010 19:16:11 -0400
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > Currently there's three possible values we pass into the flusher thread
> > for the nr_pages arguments:
> 
> I assume you're referring to wakeup_flusher_threads().

In that context I refer to everything using the per-bdi flusher thread.
That includes wakeup_flusher_threads() and the functions I've mentioned
below.

> There's also free_more_memory() and do_try_to_free_pages().

Indeed.  So we still have some special cases that want a specific
number to be written back globally.

> wakeup_flusher_threads() apepars to have been borked.  It passes
> nr_pages() into *each* bdi hence can write back far more than it was
> asked to.

> > But seriously, how is the _global_ number of dirty and unstable pages
> > a good indicator for the amount of writeback per-bdi or superblock
> > anyway?
> 
> It isn't.  This appears to have been an attempt to transport the
> wakeup_pdflush() functionality into the new wakeup_flusher_threads()
> regime.  Badly.

Unfortunately we don't just use it for wakeup_flusher_threads() but
also for various bits of per-bdi and per-sb writeback.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: what is the point of nr_pages information for the flusher thread?
@ 2010-07-07 23:43     ` Christoph Hellwig
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2010-07-07 23:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, fengguang.wu, mel, npiggin, linux-fsdevel, linux-mm

On Wed, Jul 07, 2010 at 04:37:10PM -0700, Andrew Morton wrote:
> On Wed, 7 Jul 2010 19:16:11 -0400
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > Currently there's three possible values we pass into the flusher thread
> > for the nr_pages arguments:
> 
> I assume you're referring to wakeup_flusher_threads().

In that context I refer to everything using the per-bdi flusher thread.
That includes wakeup_flusher_threads() and the functions I've mentioned
below.

> There's also free_more_memory() and do_try_to_free_pages().

Indeed.  So we still have some special cases that want a specific
number to be written back globally.

> wakeup_flusher_threads() apepars to have been borked.  It passes
> nr_pages() into *each* bdi hence can write back far more than it was
> asked to.

> > But seriously, how is the _global_ number of dirty and unstable pages
> > a good indicator for the amount of writeback per-bdi or superblock
> > anyway?
> 
> It isn't.  This appears to have been an attempt to transport the
> wakeup_pdflush() functionality into the new wakeup_flusher_threads()
> regime.  Badly.

Unfortunately we don't just use it for wakeup_flusher_threads() but
also for various bits of per-bdi and per-sb writeback.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: what is the point of nr_pages information for the flusher thread?
  2010-07-07 23:43     ` Christoph Hellwig
  (?)
@ 2010-07-07 23:55     ` Andrew Morton
  -1 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2010-07-07 23:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: fengguang.wu, mel, npiggin, linux-fsdevel, linux-mm

On Wed, 7 Jul 2010 19:43:16 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> > There's also free_more_memory() and do_try_to_free_pages().
> 
> Indeed.  So we still have some special cases that want a specific
> number to be written back globally.

It could be that those two callsites can be changed to NotDoThat.  I do
suggest that you dig through the git record and perhaps the email
archives to work out the thinking - that's old code.

Perhaps we could change things to write back down to the dirty limits,
but that might cause subtle breakage in low-memory situations where
dirty memory is uneven between zones, dunno.

Writing back the whole world would surely be a safe substitute, but
might be inefficient.

I doubt if a whole lot of rigourous thinking went into either one...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: what is the point of nr_pages information for the flusher thread?
  2010-07-07 23:16 ` Christoph Hellwig
  (?)
  (?)
@ 2010-07-10 14:58 ` Wu Fengguang
  -1 siblings, 0 replies; 8+ messages in thread
From: Wu Fengguang @ 2010-07-10 14:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: mel, akpm, npiggin, linux-fsdevel, linux-mm

Hi Christoph,

Here are some of my findings.

On Thu, Jul 08, 2010 at 07:16:11AM +0800, Christoph Hellwig wrote:
> Currently there's three possible values we pass into the flusher thread
> for the nr_pages arguments:

The current wb_writeback_work.nr_pages parameter semantic is actually quite
different from the _min_pages argument in 2.6.30. Current semantic is
"max pages to write", the old one is "min pages to write (until all written)".

current wb_writeback():

        for (;;) {
                /*
                 * Stop writeback when nr_pages has been consumed
                 */
                if (work->nr_pages <= 0)
                        break;

2.6.30 background_writeout(_min_pages):

                if (global_page_state(NR_FILE_DIRTY) +
                        global_page_state(NR_UNSTABLE_NFS) < background_thresh
                                && min_pages <= 0)
                        break;

>  - in sync_inodes_sb and bdi_start_background_writeback:
> 
> 	LONG_MAX
> 
>  - in writeback_inodes_sb and wb_check_old_data_flush:
> 
> 	global_page_state(NR_FILE_DIRTY) +
> 	global_page_state(NR_UNSTABLE_NFS) +
> 	(inodes_stat.nr_inodes - inodes_stat.nr_unused)
> 
>  - in wakeup_flusher_threads and laptop_mode_timer_fn:
> 
> 	global_page_state(NR_FILE_DIRTY) +
> 	global_page_state(NR_UNSTABLE_NFS)
> 
> The LONG_MAX cases are triviall explained, as we ignore the nr_to_write
> value for data integrity writepage in the lowlevel writeback code, and
> the for_background in bdi_start_background_writeback has it's own check
> for the background threshold.  So far so good, and now it gets
> interesting.

Yeah.
 
> Why does writeback_inodes_sb add the number of used inodes into a value
> that is in units of pages?  And why don't the other callers do this?

The 2.6.30 sync_inodes_sb() has this comment:

 * We add in the number of potentially dirty inodes, because each inode write
 * can dirty pagecache in the underlying blockdev.
 
The periodic writeback also referenced it:

        nr_pages = global_page_state(NR_FILE_DIRTY) +
                        global_page_state(NR_UNSTABLE_NFS) +
                        (inodes_stat.nr_inodes - inodes_stat.nr_unused);

        if (nr_pages) {
                struct wb_writeback_work work = {
                        .nr_pages       = nr_pages,

Here it looks more sane to do

        if (wb_has_dirty_io(wb)) {
                struct wb_writeback_work work = {
                        .nr_pages       = LONG_MAX,


> But seriously, how is the _global_ number of dirty and unstable pages
> a good indicator for the amount of writeback per-bdi or superblock
> anyway?

Good point.
 
> Somehow I'd feel much better about doing this calculation all the way
> down in wb_writeback instead of the callers so we'll at least have
> one documented place for these insanities.

I guess the current "max pages to write" semantic serves as a poor
man's live-lock prevention guard. sync() want that semantic (in this
sense the old "min pages to write" has never worked as expected). When
proper live-lock preventions are ready, this guard will no longer be
necessary.

However the current semantic is not suitable for other users. "To write at most
nr_pages until hitting background dirty threshold" is basically a no-op,
because the callers may as well let the normal background writeback do the job
for them.

For example, laptop_mode_timer_fn() actually want to write the whole world, so
it wants nr_pages=LONG_MAX with the old "min pages to write" semantic.

There are other cases that try to write some pages
- free_more_memory()
- do_try_to_free_pages() 
- ubifs shrink_liability()
- ext4 ext4_nonda_switch()

They don't really know or care about the exact nr_pages to write.
The latter two functions even sync everything for simplicity..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-07-10 14:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-07 23:16 what is the point of nr_pages information for the flusher thread? Christoph Hellwig
2010-07-07 23:16 ` Christoph Hellwig
2010-07-07 23:37 ` Andrew Morton
2010-07-07 23:37   ` Andrew Morton
2010-07-07 23:43   ` Christoph Hellwig
2010-07-07 23:43     ` Christoph Hellwig
2010-07-07 23:55     ` Andrew Morton
2010-07-10 14:58 ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.