[PATCH] writeback: Don't wait for completion in writeback_inodes_sb

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
@ 2011-06-28 23:43 Curt Wohlgemuth
  2011-06-29  0:54 ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-06-28 23:43 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Curt Wohlgemuth

Contrary to the comment block atop writeback_inodes_sb_nr(),
we *were* calling

        wait_for_completion(&done);

which should not be done, as this is not called for data
integrity sync.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
---
 fs/fs-writeback.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0f015a0..3f711ac 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1186,17 +1186,14 @@ static void wait_sb_inodes(struct super_block *sb)
  */
 void writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr)
 {
-	DECLARE_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_NONE,
-		.done		= &done,
 		.nr_pages	= nr,
 	};
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 	bdi_queue_work(sb->s_bdi, &work);
-	wait_for_completion(&done);
 }
 EXPORT_SYMBOL(writeback_inodes_sb_nr);
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-28 23:43 [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr Curt Wohlgemuth
@ 2011-06-29  0:54 ` Dave Chinner
  2011-06-29  1:56     ` Curt Wohlgemuth
  2011-06-29  6:42   ` Christoph Hellwig
  0 siblings, 2 replies; 38+ messages in thread
From: Dave Chinner @ 2011-06-29  0:54 UTC (permalink / raw)
  To: Curt Wohlgemuth; +Cc: Al Viro, linux-fsdevel, linux-kernel

On Tue, Jun 28, 2011 at 04:43:35PM -0700, Curt Wohlgemuth wrote:
> Contrary to the comment block atop writeback_inodes_sb_nr(),
> we *were* calling
> 
>         wait_for_completion(&done);
> 
> which should not be done, as this is not called for data
> integrity sync.
> 
> Signed-off-by: Curt Wohlgemuth <curtw@google.com>

The comment says it does not wait for IO to be -completed-.

The function as implemented waits for IO to be *submitted*.

This provides the callers with same blocking semantics (i.e. request
queue full) as if the caller submitted the IO themselves. The code
that uses this function rely on this blocking to delay the next set
of operations they do until after IO has been started, so removing
the completion will change their behaviour significantly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29  0:54 ` Dave Chinner
@ 2011-06-29  1:56     ` Curt Wohlgemuth
  2011-06-29  6:42   ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-06-29  1:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-fsdevel, linux-kernel

Hi Dave:

Thanks for the response.

On Tue, Jun 28, 2011 at 5:54 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Jun 28, 2011 at 04:43:35PM -0700, Curt Wohlgemuth wrote:
>> Contrary to the comment block atop writeback_inodes_sb_nr(),
>> we *were* calling
>>
>>         wait_for_completion(&done);
>>
>> which should not be done, as this is not called for data
>> integrity sync.
>>
>> Signed-off-by: Curt Wohlgemuth <curtw@google.com>
>
> The comment says it does not wait for IO to be -completed-.
>
> The function as implemented waits for IO to be *submitted*.
>
> This provides the callers with same blocking semantics (i.e. request
> queue full) as if the caller submitted the IO themselves. The code
> that uses this function rely on this blocking to delay the next set
> of operations they do until after IO has been started, so removing
> the completion will change their behaviour significantly.

I don't quite understand this.  It's true that all IO done as a result
of calling wb_writeback() on this work item won't finish before the
completion takes place, but sending all those pages in flight *will*
take place.  And that's a lot of time.  To wait on this before we then
call sync_inodes_sb(), and do it all over again, seems odd at best.

Pre-2.6.35 kernels would start non-integrity sync writeback and
immediately return, which would seem like a reasonable "prefetch-y"
thing to do, considering it's going to be immediately followed by a
data integrity sync writeback operation.

The post 2.6.35 semantics are fine; but then I don't understand why we
do both a __sync_filesystem(0) followed by a __sync_filesystem(1) (in
the case of sync(2)).  It doesn't seem to be any safer or more correct
to me; why not just do the data integrity sync writeback and call it a
day?

Thanks,
Curt

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
@ 2011-06-29  1:56     ` Curt Wohlgemuth
  0 siblings, 0 replies; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-06-29  1:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-fsdevel, linux-kernel

Hi Dave:

Thanks for the response.

On Tue, Jun 28, 2011 at 5:54 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Jun 28, 2011 at 04:43:35PM -0700, Curt Wohlgemuth wrote:
>> Contrary to the comment block atop writeback_inodes_sb_nr(),
>> we *were* calling
>>
>>         wait_for_completion(&done);
>>
>> which should not be done, as this is not called for data
>> integrity sync.
>>
>> Signed-off-by: Curt Wohlgemuth <curtw@google.com>
>
> The comment says it does not wait for IO to be -completed-.
>
> The function as implemented waits for IO to be *submitted*.
>
> This provides the callers with same blocking semantics (i.e. request
> queue full) as if the caller submitted the IO themselves. The code
> that uses this function rely on this blocking to delay the next set
> of operations they do until after IO has been started, so removing
> the completion will change their behaviour significantly.

I don't quite understand this.  It's true that all IO done as a result
of calling wb_writeback() on this work item won't finish before the
completion takes place, but sending all those pages in flight *will*
take place.  And that's a lot of time.  To wait on this before we then
call sync_inodes_sb(), and do it all over again, seems odd at best.

Pre-2.6.35 kernels would start non-integrity sync writeback and
immediately return, which would seem like a reasonable "prefetch-y"
thing to do, considering it's going to be immediately followed by a
data integrity sync writeback operation.

The post 2.6.35 semantics are fine; but then I don't understand why we
do both a __sync_filesystem(0) followed by a __sync_filesystem(1) (in
the case of sync(2)).  It doesn't seem to be any safer or more correct
to me; why not just do the data integrity sync writeback and call it a
day?

Thanks,
Curt

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29  0:54 ` Dave Chinner
  2011-06-29  1:56     ` Curt Wohlgemuth
@ 2011-06-29  6:42   ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-06-29  6:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Curt Wohlgemuth, Al Viro, linux-fsdevel, linux-kernel

On Wed, Jun 29, 2011 at 10:54:22AM +1000, Dave Chinner wrote:
> The comment says it does not wait for IO to be -completed-.
> 
> The function as implemented waits for IO to be *submitted*.
> 
> This provides the callers with same blocking semantics (i.e. request
> queue full) as if the caller submitted the IO themselves. The code
> that uses this function rely on this blocking to delay the next set
> of operations they do until after IO has been started, so removing
> the completion will change their behaviour significantly.

The real importance is for locking.  If we don't wait for completion
we may still do writebacks in the flushers thread while we're already
unlocked s_umount.  That will give us back the nasty writeback vs
umount races that this scheme fixed.  The commit logs that added
it should explain it in more detail (at least I usually write detailed
changelogs)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29  1:56     ` Curt Wohlgemuth
  (?)
@ 2011-06-29  8:11     ` Christoph Hellwig
  2011-06-29 16:57       ` Jan Kara
  2011-06-29 17:26         ` Curt Wohlgemuth
  -1 siblings, 2 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-06-29  8:11 UTC (permalink / raw)
  To: Curt Wohlgemuth; +Cc: Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

On Tue, Jun 28, 2011 at 06:56:41PM -0700, Curt Wohlgemuth wrote:
> I don't quite understand this.  It's true that all IO done as a result
> of calling wb_writeback() on this work item won't finish before the
> completion takes place, but sending all those pages in flight *will*
> take place.  And that's a lot of time.  To wait on this before we then
> call sync_inodes_sb(), and do it all over again, seems odd at best.
> 
> Pre-2.6.35 kernels would start non-integrity sync writeback and
> immediately return, which would seem like a reasonable "prefetch-y"
> thing to do, considering it's going to be immediately followed by a
> data integrity sync writeback operation.

The only old kernel I have around is 2.6.16, so looking at that one
for old semantics:

 - it first does a wakeup_pdflush() which does a truely asynchronous
   writeback of everything in the system.  This is replaced with a
   wakeup_flusher_threads(0) in 3.0, which has pretty similar sematics.
 - after that things change a bit as we reordered sync quite a bit too
   a) improve data integrity by properly ordering the steps of the sync
   and b) shared code between the global sync and per-fs sync as
   used by umount, sys_syncfs and a few other callers.  But both do
   an semi-sync pass and a sync pass on per-sb writeback.  For the old
   code it's done from the calling threads context, while the next code
   moved it to the flushers thread, but still waiting for it with the
   completion.  No major change in semantics as far as I can see.

The idea of the async pass is to start writeback on as many as possible
pages before actively having to wait on them.  I'd agree with your
assessment that the writeback_inodes_sb might not really help all that
much - given that a lot of the waiting might not actually be for I/O
completion but e.g. for the block level throtteling (or maybe cgroups
in your case?).

For sys_sync I'm pretty sure we could simply remove the
writeback_inodes_sb call and get just as good if not better performance,
but we'd still need a solution for the other sync_filesystem callers,
assuming the first write actually benefits them.  Maybe you can run
some sys_syncfs microbenchmarks to check it?  Note that doing multiple
passes generally doesn't actually help live-lock avoidance either, but
wu has recently done some work in that area.

Another thing we've discussed a while ago is changing sync_inodes_sb
to the writeback or at least the waiting back in the calling threads
context to not block the flushers threads from getting real work done.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29  8:11     ` Christoph Hellwig
@ 2011-06-29 16:57       ` Jan Kara
  2011-06-29 17:55         ` Christoph Hellwig
  2011-06-29 17:26         ` Curt Wohlgemuth
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-06-29 16:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Curt Wohlgemuth, Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

On Wed 29-06-11 04:11:55, Christoph Hellwig wrote:
> On Tue, Jun 28, 2011 at 06:56:41PM -0700, Curt Wohlgemuth wrote:
> > I don't quite understand this.  It's true that all IO done as a result
> > of calling wb_writeback() on this work item won't finish before the
> > completion takes place, but sending all those pages in flight *will*
> > take place.  And that's a lot of time.  To wait on this before we then
> > call sync_inodes_sb(), and do it all over again, seems odd at best.
> > 
> > Pre-2.6.35 kernels would start non-integrity sync writeback and
> > immediately return, which would seem like a reasonable "prefetch-y"
> > thing to do, considering it's going to be immediately followed by a
> > data integrity sync writeback operation.
> 
> The only old kernel I have around is 2.6.16, so looking at that one
> for old semantics:
> 
>  - it first does a wakeup_pdflush() which does a truely asynchronous
>    writeback of everything in the system.  This is replaced with a
>    wakeup_flusher_threads(0) in 3.0, which has pretty similar sematics.
>  - after that things change a bit as we reordered sync quite a bit too
>    a) improve data integrity by properly ordering the steps of the sync
>    and b) shared code between the global sync and per-fs sync as
>    used by umount, sys_syncfs and a few other callers.  But both do
>    an semi-sync pass and a sync pass on per-sb writeback.  For the old
>    code it's done from the calling threads context, while the next code
>    moved it to the flushers thread, but still waiting for it with the
>    completion.  No major change in semantics as far as I can see.
> 
> The idea of the async pass is to start writeback on as many as possible
> pages before actively having to wait on them.  I'd agree with your
> assessment that the writeback_inodes_sb might not really help all that
> much - given that a lot of the waiting might not actually be for I/O
> completion but e.g. for the block level throtteling (or maybe cgroups
> in your case?).
> 
> For sys_sync I'm pretty sure we could simply remove the
> writeback_inodes_sb call and get just as good if not better performance,
  Actually, it won't with current code. Because WB_SYNC_ALL writeback
currently has the peculiarity that it looks like:
  for all inodes {
    write all inode data
    wait for inode data
  }
while to achieve good performance we actually need something like
  for all inodes
    write all inode data
  for all inodes
    wait for inode data
It makes a difference in an order of magnitude when there are lots of
smallish files - SLES had a bug like this so I know from user reports ;)

But with current inode list handling it will be hard to find all inodes
that need to be waited for which I guess is the reason why it is done
as it is done.

> but we'd still need a solution for the other sync_filesystem callers,
> assuming the first write actually benefits them.  Maybe you can run
> some sys_syncfs microbenchmarks to check it?  Note that doing multiple
> passes generally doesn't actually help live-lock avoidance either, but
> wu has recently done some work in that area.
> 
> Another thing we've discussed a while ago is changing sync_inodes_sb
> to the writeback or at least the waiting back in the calling threads
> context to not block the flushers threads from getting real work done.
  You mean that sync(1) would actually write the data itself? It would
certainly make some things simpler but it has its problems as well - for
example sync racing with flusher thread writing back inodes can create
rather bad IO pattern...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29  8:11     ` Christoph Hellwig
@ 2011-06-29 17:26         ` Curt Wohlgemuth
  2011-06-29 17:26         ` Curt Wohlgemuth
  1 sibling, 0 replies; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-06-29 17:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

Hi Christoph:

On Wed, Jun 29, 2011 at 1:11 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jun 28, 2011 at 06:56:41PM -0700, Curt Wohlgemuth wrote:
>> I don't quite understand this.  It's true that all IO done as a result
>> of calling wb_writeback() on this work item won't finish before the
>> completion takes place, but sending all those pages in flight *will*
>> take place.  And that's a lot of time.  To wait on this before we then
>> call sync_inodes_sb(), and do it all over again, seems odd at best.
>>
>> Pre-2.6.35 kernels would start non-integrity sync writeback and
>> immediately return, which would seem like a reasonable "prefetch-y"
>> thing to do, considering it's going to be immediately followed by a
>> data integrity sync writeback operation.
>
> The only old kernel I have around is 2.6.16, so looking at that one
> for old semantics:
>
>  - it first does a wakeup_pdflush() which does a truely asynchronous
>   writeback of everything in the system.  This is replaced with a
>   wakeup_flusher_threads(0) in 3.0, which has pretty similar sematics.
>  - after that things change a bit as we reordered sync quite a bit too
>   a) improve data integrity by properly ordering the steps of the sync
>   and b) shared code between the global sync and per-fs sync as
>   used by umount, sys_syncfs and a few other callers.  But both do
>   an semi-sync pass and a sync pass on per-sb writeback.  For the old
>   code it's done from the calling threads context, while the next code
>   moved it to the flushers thread, but still waiting for it with the
>   completion.  No major change in semantics as far as I can see.

You're right about this.  (I'm looking at 2.6.26 as it happens, but
it's got the same behavior.)

The semantics did change between 2.6.34 and 2.6.35 though.  When the
work item queue was introduced in 2.6.32, the semantics changed from
what you describe above to what's present through 2.6.34:
writeback_inodes_sb() would enqueue a work item, and return.  Your
commit 83ba7b07 ("writeback: simplify the write back thread queue")
added the wait_for_completion() call, putting the semantics back to
where they were pre-2.6.32.

And in fact, this change has only minor effect on writeback, at least
-- since the WB_SYNC_ALL work item won't even get started until the
WB_SYNC_NONE one is finished.

Though one additional change between the old way (pre-2.6.32) and
today:  with the old kernel, the pdflush thread would operate
concurrently with the first (and second?) sync path through writeback.
 Today of course, they're *all* serialized.  So really a call to
sys_sync() will enqueue 3 work items -- one from
wakeup_flusher_threads(), one from writeback_inodes_sb(), and one from
sync_inodes_sb().

> The idea of the async pass is to start writeback on as many as possible
> pages before actively having to wait on them.  I'd agree with your
> assessment that the writeback_inodes_sb might not really help all that
> much - given that a lot of the waiting might not actually be for I/O
> completion but e.g. for the block level throtteling (or maybe cgroups
> in your case?).

What I'm seeing with an admittedly nasty workload -- two FIO scripts
writing as fast as possible to 1000 files each, with a sync 10 seconds
into the workload -- is that the "async pass" will write out a
(nearly) completely separate set of pages from the "sync pass."
Because the work items are serialized, there are so many new dirty
pages since the first work item starts, and so the sync pass has just
about as many pages to write out as the first.  So the "prefetch-y"
async pass is basically wasted time.

>
> For sys_sync I'm pretty sure we could simply remove the
> writeback_inodes_sb call and get just as good if not better performance,
> but we'd still need a solution for the other sync_filesystem callers,
> assuming the first write actually benefits them.  Maybe you can run
> some sys_syncfs microbenchmarks to check it?  Note that doing multiple
> passes generally doesn't actually help live-lock avoidance either, but
> wu has recently done some work in that area.

I've been running tests using sys_sync, and seeing sync take far
longer than I'd expect, but haven't tried sys_syncfs yet.  I'll add
that to my list.

Still, both do two completely serialized passes: one "async pass"
followed by the "sync pass" -- during which other work items can get
queued, lots more pages get dirtied, etc.

>
> Another thing we've discussed a while ago is changing sync_inodes_sb
> to the writeback or at least the waiting back in the calling threads
> context to not block the flushers threads from getting real work done.

This would make sense to insure that sync finishes reasonably fast,
but as Jan points out, would cause that much more interference with
the write pattern...

Thanks,
Curt

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
@ 2011-06-29 17:26         ` Curt Wohlgemuth
  0 siblings, 0 replies; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-06-29 17:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

Hi Christoph:

On Wed, Jun 29, 2011 at 1:11 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jun 28, 2011 at 06:56:41PM -0700, Curt Wohlgemuth wrote:
>> I don't quite understand this.  It's true that all IO done as a result
>> of calling wb_writeback() on this work item won't finish before the
>> completion takes place, but sending all those pages in flight *will*
>> take place.  And that's a lot of time.  To wait on this before we then
>> call sync_inodes_sb(), and do it all over again, seems odd at best.
>>
>> Pre-2.6.35 kernels would start non-integrity sync writeback and
>> immediately return, which would seem like a reasonable "prefetch-y"
>> thing to do, considering it's going to be immediately followed by a
>> data integrity sync writeback operation.
>
> The only old kernel I have around is 2.6.16, so looking at that one
> for old semantics:
>
>  - it first does a wakeup_pdflush() which does a truely asynchronous
>   writeback of everything in the system.  This is replaced with a
>   wakeup_flusher_threads(0) in 3.0, which has pretty similar sematics.
>  - after that things change a bit as we reordered sync quite a bit too
>   a) improve data integrity by properly ordering the steps of the sync
>   and b) shared code between the global sync and per-fs sync as
>   used by umount, sys_syncfs and a few other callers.  But both do
>   an semi-sync pass and a sync pass on per-sb writeback.  For the old
>   code it's done from the calling threads context, while the next code
>   moved it to the flushers thread, but still waiting for it with the
>   completion.  No major change in semantics as far as I can see.

You're right about this.  (I'm looking at 2.6.26 as it happens, but
it's got the same behavior.)

The semantics did change between 2.6.34 and 2.6.35 though.  When the
work item queue was introduced in 2.6.32, the semantics changed from
what you describe above to what's present through 2.6.34:
writeback_inodes_sb() would enqueue a work item, and return.  Your
commit 83ba7b07 ("writeback: simplify the write back thread queue")
added the wait_for_completion() call, putting the semantics back to
where they were pre-2.6.32.

And in fact, this change has only minor effect on writeback, at least
-- since the WB_SYNC_ALL work item won't even get started until the
WB_SYNC_NONE one is finished.

Though one additional change between the old way (pre-2.6.32) and
today:  with the old kernel, the pdflush thread would operate
concurrently with the first (and second?) sync path through writeback.
 Today of course, they're *all* serialized.  So really a call to
sys_sync() will enqueue 3 work items -- one from
wakeup_flusher_threads(), one from writeback_inodes_sb(), and one from
sync_inodes_sb().

> The idea of the async pass is to start writeback on as many as possible
> pages before actively having to wait on them.  I'd agree with your
> assessment that the writeback_inodes_sb might not really help all that
> much - given that a lot of the waiting might not actually be for I/O
> completion but e.g. for the block level throtteling (or maybe cgroups
> in your case?).

What I'm seeing with an admittedly nasty workload -- two FIO scripts
writing as fast as possible to 1000 files each, with a sync 10 seconds
into the workload -- is that the "async pass" will write out a
(nearly) completely separate set of pages from the "sync pass."
Because the work items are serialized, there are so many new dirty
pages since the first work item starts, and so the sync pass has just
about as many pages to write out as the first.  So the "prefetch-y"
async pass is basically wasted time.

>
> For sys_sync I'm pretty sure we could simply remove the
> writeback_inodes_sb call and get just as good if not better performance,
> but we'd still need a solution for the other sync_filesystem callers,
> assuming the first write actually benefits them.  Maybe you can run
> some sys_syncfs microbenchmarks to check it?  Note that doing multiple
> passes generally doesn't actually help live-lock avoidance either, but
> wu has recently done some work in that area.

I've been running tests using sys_sync, and seeing sync take far
longer than I'd expect, but haven't tried sys_syncfs yet.  I'll add
that to my list.

Still, both do two completely serialized passes: one "async pass"
followed by the "sync pass" -- during which other work items can get
queued, lots more pages get dirtied, etc.

>
> Another thing we've discussed a while ago is changing sync_inodes_sb
> to the writeback or at least the waiting back in the calling threads
> context to not block the flushers threads from getting real work done.

This would make sense to insure that sync finishes reasonably fast,
but as Jan points out, would cause that much more interference with
the write pattern...

Thanks,
Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29 16:57       ` Jan Kara
@ 2011-06-29 17:55         ` Christoph Hellwig
  2011-06-29 19:15           ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-06-29 17:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Curt Wohlgemuth, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
> > For sys_sync I'm pretty sure we could simply remove the
> > writeback_inodes_sb call and get just as good if not better performance,
>   Actually, it won't with current code. Because WB_SYNC_ALL writeback
> currently has the peculiarity that it looks like:
>   for all inodes {
>     write all inode data
>     wait for inode data
>   }
> while to achieve good performance we actually need something like
>   for all inodes
>     write all inode data
>   for all inodes
>     wait for inode data
> It makes a difference in an order of magnitude when there are lots of
> smallish files - SLES had a bug like this so I know from user reports ;)

I don't think that's true.  The WB_SYNC_ALL writeback is done using
sync_inodes_sb, which operates as:

  for all dirty inodes in bdi:
     if inode belongs to sb
        write all inode data

  for all inodes in sb:
     wait for inode data

we still do that in a big for each sb loop, though.

>   You mean that sync(1) would actually write the data itself? It would
> certainly make some things simpler but it has its problems as well - for
> example sync racing with flusher thread writing back inodes can create
> rather bad IO pattern...

Only the second pass.  The idea is that we first try to use the flusher
threads for good I/O patterns, but if we can't get that to work only
block the caller and not everyone.  But that's just an idea so far,
it would need serious benchmark.  And despite what I claimed before
we actually do the wait in the caller context already anyway, which
already gives you the easy part of the above effect.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29 17:26         ` Curt Wohlgemuth
  (?)
@ 2011-06-29 18:00         ` Christoph Hellwig
  2011-06-29 21:30           ` Curt Wohlgemuth
  -1 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-06-29 18:00 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

On Wed, Jun 29, 2011 at 10:26:33AM -0700, Curt Wohlgemuth wrote:
> The semantics did change between 2.6.34 and 2.6.35 though.  When the
> work item queue was introduced in 2.6.32, the semantics changed from
> what you describe above to what's present through 2.6.34:
> writeback_inodes_sb() would enqueue a work item, and return.  Your
> commit 83ba7b07 ("writeback: simplify the write back thread queue")
> added the wait_for_completion() call, putting the semantics back to
> where they were pre-2.6.32.

Yes.  The kernels inbetween had that nasty writeback vs umount races
that we could trigger quite often.

> Though one additional change between the old way (pre-2.6.32) and
> today:  with the old kernel, the pdflush thread would operate
> concurrently with the first (and second?) sync path through writeback.
>  Today of course, they're *all* serialized.  So really a call to
> sys_sync() will enqueue 3 work items -- one from
> wakeup_flusher_threads(), one from writeback_inodes_sb(), and one from
> sync_inodes_sb().

Yes.  And having WB_SYNC_NONE items from both wakeup_flusher_threads vs
writeback_inodes_sb is plain stupid.  The initial conversion to the
new sync_filesystem scheme had removed the wakeup_flusher_threads
call, but that caused a huge regression in some benchmark.  

As mentioned before Wu was working on some code to introduce tagging
so that the WB_SYNC_ALL call won't start writing pages dirtied after
the sync call, which should help with your issue.  Although to
completely solve it we really need to get down to just two passes.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29 17:55         ` Christoph Hellwig
@ 2011-06-29 19:15           ` Jan Kara
  2011-06-29 20:12             ` Christoph Hellwig
  2011-07-01 22:55             ` Curt Wohlgemuth
  0 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2011-06-29 19:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Curt Wohlgemuth, Al Viro, linux-fsdevel, linux-kernel,
	fengguang.wu

On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
> > > For sys_sync I'm pretty sure we could simply remove the
> > > writeback_inodes_sb call and get just as good if not better performance,
> >   Actually, it won't with current code. Because WB_SYNC_ALL writeback
> > currently has the peculiarity that it looks like:
> >   for all inodes {
> >     write all inode data
> >     wait for inode data
> >   }
> > while to achieve good performance we actually need something like
> >   for all inodes
> >     write all inode data
> >   for all inodes
> >     wait for inode data
> > It makes a difference in an order of magnitude when there are lots of
> > smallish files - SLES had a bug like this so I know from user reports ;)
> 
> I don't think that's true.  The WB_SYNC_ALL writeback is done using
> sync_inodes_sb, which operates as:
> 
>   for all dirty inodes in bdi:
>      if inode belongs to sb
>         write all inode data
> 
>   for all inodes in sb:
>      wait for inode data
> 
> we still do that in a big for each sb loop, though.
  True but writeback_single_inode() has in it:
        if (wbc->sync_mode == WB_SYNC_ALL) {
                int err = filemap_fdatawait(mapping);
                if (ret == 0)
                        ret = err;
        }
  So we end up waiting much earlier. Probably we should remove this wait
but that will need some audit I guess.

> >   You mean that sync(1) would actually write the data itself? It would
> > certainly make some things simpler but it has its problems as well - for
> > example sync racing with flusher thread writing back inodes can create
> > rather bad IO pattern...
> 
> Only the second pass.  The idea is that we first try to use the flusher
> threads for good I/O patterns, but if we can't get that to work only
> block the caller and not everyone.  But that's just an idea so far,
> it would need serious benchmark.  And despite what I claimed before
> we actually do the wait in the caller context already anyway, which
> already gives you the easy part of the above effect.
  Modulo the writeback_single_inode() wait. But if that is dealt with I
agree.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29 19:15           ` Jan Kara
@ 2011-06-29 20:12             ` Christoph Hellwig
  2011-06-30 12:15               ` Jan Kara
  2011-07-01 22:55             ` Curt Wohlgemuth
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-06-29 20:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Curt Wohlgemuth, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Wed, Jun 29, 2011 at 09:15:18PM +0200, Jan Kara wrote:
>   True but writeback_single_inode() has in it:
>         if (wbc->sync_mode == WB_SYNC_ALL) {
>                 int err = filemap_fdatawait(mapping);
>                 if (ret == 0)
>                         ret = err;
>         }
>   So we end up waiting much earlier. Probably we should remove this wait
> but that will need some audit I guess.

Uhh, indeed.  We'll need this wait for things like sync_inode, though.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29 18:00         ` Christoph Hellwig
@ 2011-06-29 21:30           ` Curt Wohlgemuth
  2011-07-19 15:54             ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-06-29 21:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

On Wed, Jun 29, 2011 at 11:00 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Jun 29, 2011 at 10:26:33AM -0700, Curt Wohlgemuth wrote:
>> The semantics did change between 2.6.34 and 2.6.35 though.  When the
>> work item queue was introduced in 2.6.32, the semantics changed from
>> what you describe above to what's present through 2.6.34:
>> writeback_inodes_sb() would enqueue a work item, and return.  Your
>> commit 83ba7b07 ("writeback: simplify the write back thread queue")
>> added the wait_for_completion() call, putting the semantics back to
>> where they were pre-2.6.32.
>
> Yes.  The kernels inbetween had that nasty writeback vs umount races
> that we could trigger quite often.

My apologies for not being aware of these races (I tried to search
mailing lists) -- how did they manifest themselves?  I don't see
anything about it in your commit message of 83ba7b07.  I do see that
we are not taking s_umount for sys_sync any longer (only in
sync_filesystem(), e.g., for sys_syncfs)...

>> Though one additional change between the old way (pre-2.6.32) and
>> today:  with the old kernel, the pdflush thread would operate
>> concurrently with the first (and second?) sync path through writeback.
>>  Today of course, they're *all* serialized.  So really a call to
>> sys_sync() will enqueue 3 work items -- one from
>> wakeup_flusher_threads(), one from writeback_inodes_sb(), and one from
>> sync_inodes_sb().
>
> Yes.  And having WB_SYNC_NONE items from both wakeup_flusher_threads vs
> writeback_inodes_sb is plain stupid.  The initial conversion to the
> new sync_filesystem scheme had removed the wakeup_flusher_threads
> call, but that caused a huge regression in some benchmark.
>
> As mentioned before Wu was working on some code to introduce tagging
> so that the WB_SYNC_ALL call won't start writing pages dirtied after
> the sync call, which should help with your issue.  Although to
> completely solve it we really need to get down to just two passes.

I can definitely see how tagging with the start of sync would be
helpful; also how going from three to two passes seems like a
no-brainer.

But what's the real point of doing even two passes -- one SYNC_NONE
followed by one SYNC_ALL?  Is it try to get through as many inodes as
possible before we potentially lock up by waiting on an inode on an
unavailable device?

Thanks,
Curt

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29 20:12             ` Christoph Hellwig
@ 2011-06-30 12:15               ` Jan Kara
  2011-06-30 12:37                 ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-06-30 12:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Curt Wohlgemuth, Al Viro, linux-fsdevel, linux-kernel,
	fengguang.wu

On Wed 29-06-11 16:12:34, Christoph Hellwig wrote:
> On Wed, Jun 29, 2011 at 09:15:18PM +0200, Jan Kara wrote:
> >   True but writeback_single_inode() has in it:
> >         if (wbc->sync_mode == WB_SYNC_ALL) {
> >                 int err = filemap_fdatawait(mapping);
> >                 if (ret == 0)
> >                         ret = err;
> >         }
> >   So we end up waiting much earlier. Probably we should remove this wait
> > but that will need some audit I guess.
> 
> Uhh, indeed.  We'll need this wait for things like sync_inode, though.
  Yes. Actually, specifically for filesystems like XFS which update inode
after IO completion we would need more passes to be efficient and correct:
  for all inodes
    fdatawrite
  for all inodes
    fdatawait
  for all inodes
    write_inode
  for all inodes
    wait for inode IO

  But maybe this could be handled by having new WB_SYNC_ mode indicating
that writeback_single_inode() should not bother waiting (thus we'd really
end up waiting in sync_inodes_sb()) and then XFS and other filesystems that
need it would writeout inodes in their sync_fs() implementation (possibly
using a generic helper)?
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-30 12:15               ` Jan Kara
@ 2011-06-30 12:37                 ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-06-30 12:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Curt Wohlgemuth, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Thu, Jun 30, 2011 at 02:15:58PM +0200, Jan Kara wrote:
>   Yes. Actually, specifically for filesystems like XFS which update inode
> after IO completion we would need more passes to be efficient and correct:
>   for all inodes
>     fdatawrite
>   for all inodes
>     fdatawait
>   for all inodes
>     write_inode
>   for all inodes
>     wait for inode IO
> 
>   But maybe this could be handled by having new WB_SYNC_ mode indicating
> that writeback_single_inode() should not bother waiting (thus we'd really
> end up waiting in sync_inodes_sb()) and then XFS and other filesystems that
> need it would writeout inodes in their sync_fs() implementation (possibly
> using a generic helper)?

We do very little in write_inode these days.  Basically just log the
inode size and timestamp sized into the in-memory log.  The actual
writeout of the log happens in ->sync_fs already.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29 19:15           ` Jan Kara
  2011-06-29 20:12             ` Christoph Hellwig
@ 2011-07-01 22:55             ` Curt Wohlgemuth
  2011-07-02 11:32               ` Christoph Hellwig
  2011-07-11 17:00               ` Jan Kara
  1 sibling, 2 replies; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-07-01 22:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

Hi Jan:

On Wed, Jun 29, 2011 at 12:15 PM, Jan Kara <jack@suse.cz> wrote:
> On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
>> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
>> > > For sys_sync I'm pretty sure we could simply remove the
>> > > writeback_inodes_sb call and get just as good if not better performance,
>> >   Actually, it won't with current code. Because WB_SYNC_ALL writeback
>> > currently has the peculiarity that it looks like:
>> >   for all inodes {
>> >     write all inode data
>> >     wait for inode data
>> >   }
>> > while to achieve good performance we actually need something like
>> >   for all inodes
>> >     write all inode data
>> >   for all inodes
>> >     wait for inode data
>> > It makes a difference in an order of magnitude when there are lots of
>> > smallish files - SLES had a bug like this so I know from user reports ;)
>>
>> I don't think that's true.  The WB_SYNC_ALL writeback is done using
>> sync_inodes_sb, which operates as:
>>
>>   for all dirty inodes in bdi:
>>      if inode belongs to sb
>>         write all inode data
>>
>>   for all inodes in sb:
>>      wait for inode data
>>
>> we still do that in a big for each sb loop, though.
>  True but writeback_single_inode() has in it:
>        if (wbc->sync_mode == WB_SYNC_ALL) {
>                int err = filemap_fdatawait(mapping);
>                if (ret == 0)
>                        ret = err;
>        }
>  So we end up waiting much earlier. Probably we should remove this wait
> but that will need some audit I guess.

So today for WB_SYNC_ALL from sync_inodes_sb(), we do:
  - queue a work item; this will
    - write all dirty inodes in a sb
      - write one inode's pages
      - wait on all inode's pages
  - wait for the work item
  - wait on all inodes in the sb (wait_sb_inodes)

I guess the point of wait_sb_inodes() is to wait on all inodes that
were written in a previous writeback pass.

One other issue I have with sync as it's structured is that we don't
do a WB_SYNC_ALL pass on any inode that's only associated with a block
device, and not on a mounted filesystem.  Blockdev mounts are
pseudo-mounts, and are explicitly skipped in __sync_filesystem().  So
if you've written directly to a block device and do a sync, the only
pass over the pages for this inode are via the
wakeup_flusher_threads() -- which operates on a BDI, regardless of the
superblock, and uses WB_SYNC_NONE.

All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
exclude pseudo-superblocks.

I've seen cases in our modified kernels here at Google in which
lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
/dev/sda (though I haven't been able to come up with a consistent test
case, nor reproduce this on an upstream kernel).

Thanks,
Curt

>
>> >   You mean that sync(1) would actually write the data itself? It would
>> > certainly make some things simpler but it has its problems as well - for
>> > example sync racing with flusher thread writing back inodes can create
>> > rather bad IO pattern...
>>
>> Only the second pass.  The idea is that we first try to use the flusher
>> threads for good I/O patterns, but if we can't get that to work only
>> block the caller and not everyone.  But that's just an idea so far,
>> it would need serious benchmark.  And despite what I claimed before
>> we actually do the wait in the caller context already anyway, which
>> already gives you the easy part of the above effect.
>  Modulo the writeback_single_inode() wait. But if that is dealt with I
> agree.
>
>                                                                Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-01 22:55             ` Curt Wohlgemuth
@ 2011-07-02 11:32               ` Christoph Hellwig
  2011-07-11 17:00               ` Jan Kara
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-02 11:32 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Jan Kara, Christoph Hellwig, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Fri, Jul 01, 2011 at 03:55:33PM -0700, Curt Wohlgemuth wrote:
> One other issue I have with sync as it's structured is that we don't
> do a WB_SYNC_ALL pass on any inode that's only associated with a block
> device, and not on a mounted filesystem.  Blockdev mounts are
> pseudo-mounts, and are explicitly skipped in __sync_filesystem().  So
> if you've written directly to a block device and do a sync, the only
> pass over the pages for this inode are via the
> wakeup_flusher_threads() -- which operates on a BDI, regardless of the
> superblock, and uses WB_SYNC_NONE.
> 
> All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
> exclude pseudo-superblocks.

Interesting.  I think that's actually correct by the traditional
defintion of sync, but not really useful.  I also doubt it's
intentional.

> I've seen cases in our modified kernels here at Google in which
> lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
> /dev/sda (though I haven't been able to come up with a consistent test
> case, nor reproduce this on an upstream kernel).

lilo really should do an fsync of the partition it installed itself
into.  That's won't only fix this issue, but also has a lot less impact
on the rest of the system.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-01 22:55             ` Curt Wohlgemuth
  2011-07-02 11:32               ` Christoph Hellwig
@ 2011-07-11 17:00               ` Jan Kara
  2011-07-11 17:11                 ` Curt Wohlgemuth
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-07-11 17:00 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Jan Kara, Christoph Hellwig, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

  Hi Curt,

On Fri 01-07-11 15:55:33, Curt Wohlgemuth wrote:
> On Wed, Jun 29, 2011 at 12:15 PM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
> >> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
> >> > > For sys_sync I'm pretty sure we could simply remove the
> >> > > writeback_inodes_sb call and get just as good if not better performance,
> >> >   Actually, it won't with current code. Because WB_SYNC_ALL writeback
> >> > currently has the peculiarity that it looks like:
> >> >   for all inodes {
> >> >     write all inode data
> >> >     wait for inode data
> >> >   }
> >> > while to achieve good performance we actually need something like
> >> >   for all inodes
> >> >     write all inode data
> >> >   for all inodes
> >> >     wait for inode data
> >> > It makes a difference in an order of magnitude when there are lots of
> >> > smallish files - SLES had a bug like this so I know from user reports ;)
> >>
> >> I don't think that's true.  The WB_SYNC_ALL writeback is done using
> >> sync_inodes_sb, which operates as:
> >>
> >>   for all dirty inodes in bdi:
> >>      if inode belongs to sb
> >>         write all inode data
> >>
> >>   for all inodes in sb:
> >>      wait for inode data
> >>
> >> we still do that in a big for each sb loop, though.
> >  True but writeback_single_inode() has in it:
> >        if (wbc->sync_mode == WB_SYNC_ALL) {
> >                int err = filemap_fdatawait(mapping);
> >                if (ret == 0)
> >                        ret = err;
> >        }
> >  So we end up waiting much earlier. Probably we should remove this wait
> > but that will need some audit I guess.
> 
> So today for WB_SYNC_ALL from sync_inodes_sb(), we do:
>   - queue a work item; this will
>     - write all dirty inodes in a sb
>       - write one inode's pages
>       - wait on all inode's pages
>   - wait for the work item
>   - wait on all inodes in the sb (wait_sb_inodes)
> 
> I guess the point of wait_sb_inodes() is to wait on all inodes that
> were written in a previous writeback pass.
  Yes. As we discussed with Christoph, waiting for each inode could be
avoided in principle but we have to be careful not to break data
consistency guarantees in some other path calling
writeback_single_inode()...

> One other issue I have with sync as it's structured is that we don't
> do a WB_SYNC_ALL pass on any inode that's only associated with a block
> device, and not on a mounted filesystem.  Blockdev mounts are
> pseudo-mounts, and are explicitly skipped in __sync_filesystem().  So
> if you've written directly to a block device and do a sync, the only
> pass over the pages for this inode are via the
> wakeup_flusher_threads() -- which operates on a BDI, regardless of the
> superblock, and uses WB_SYNC_NONE.
> 
> All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
> exclude pseudo-superblocks.
> 
> I've seen cases in our modified kernels here at Google in which
> lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
> /dev/sda (though I haven't been able to come up with a consistent test
> case, nor reproduce this on an upstream kernel).
  Ho hum, I wonder what would be the cleanest way to fix this. I guess we
could make an exception for 'bdev' superblock in the code but it's ugly.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-11 17:00               ` Jan Kara
@ 2011-07-11 17:11                 ` Curt Wohlgemuth
  2011-07-11 19:48                   ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-07-11 17:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

Hi Jan:

On Mon, Jul 11, 2011 at 10:00 AM, Jan Kara <jack@suse.cz> wrote:
>  Hi Curt,
>
> On Fri 01-07-11 15:55:33, Curt Wohlgemuth wrote:
>> On Wed, Jun 29, 2011 at 12:15 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
>> >> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
>> >> > > For sys_sync I'm pretty sure we could simply remove the
>> >> > > writeback_inodes_sb call and get just as good if not better performance,
>> >> >   Actually, it won't with current code. Because WB_SYNC_ALL writeback
>> >> > currently has the peculiarity that it looks like:
>> >> >   for all inodes {
>> >> >     write all inode data
>> >> >     wait for inode data
>> >> >   }
>> >> > while to achieve good performance we actually need something like
>> >> >   for all inodes
>> >> >     write all inode data
>> >> >   for all inodes
>> >> >     wait for inode data
>> >> > It makes a difference in an order of magnitude when there are lots of
>> >> > smallish files - SLES had a bug like this so I know from user reports ;)
>> >>
>> >> I don't think that's true.  The WB_SYNC_ALL writeback is done using
>> >> sync_inodes_sb, which operates as:
>> >>
>> >>   for all dirty inodes in bdi:
>> >>      if inode belongs to sb
>> >>         write all inode data
>> >>
>> >>   for all inodes in sb:
>> >>      wait for inode data
>> >>
>> >> we still do that in a big for each sb loop, though.
>> >  True but writeback_single_inode() has in it:
>> >        if (wbc->sync_mode == WB_SYNC_ALL) {
>> >                int err = filemap_fdatawait(mapping);
>> >                if (ret == 0)
>> >                        ret = err;
>> >        }
>> >  So we end up waiting much earlier. Probably we should remove this wait
>> > but that will need some audit I guess.
>>
>> So today for WB_SYNC_ALL from sync_inodes_sb(), we do:
>>   - queue a work item; this will
>>     - write all dirty inodes in a sb
>>       - write one inode's pages
>>       - wait on all inode's pages
>>   - wait for the work item
>>   - wait on all inodes in the sb (wait_sb_inodes)
>>
>> I guess the point of wait_sb_inodes() is to wait on all inodes that
>> were written in a previous writeback pass.
>  Yes. As we discussed with Christoph, waiting for each inode could be
> avoided in principle but we have to be careful not to break data
> consistency guarantees in some other path calling
> writeback_single_inode()...
>
>> One other issue I have with sync as it's structured is that we don't
>> do a WB_SYNC_ALL pass on any inode that's only associated with a block
>> device, and not on a mounted filesystem.  Blockdev mounts are
>> pseudo-mounts, and are explicitly skipped in __sync_filesystem().  So
>> if you've written directly to a block device and do a sync, the only
>> pass over the pages for this inode are via the
>> wakeup_flusher_threads() -- which operates on a BDI, regardless of the
>> superblock, and uses WB_SYNC_NONE.
>>
>> All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
>> exclude pseudo-superblocks.
>>
>> I've seen cases in our modified kernels here at Google in which
>> lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
>> /dev/sda (though I haven't been able to come up with a consistent test
>> case, nor reproduce this on an upstream kernel).
>  Ho hum, I wonder what would be the cleanest way to fix this. I guess we
> could make an exception for 'bdev' superblock in the code but it's ugly.

Yes, it's ugly.

You could declare that sync(2) is just doing it's job, that it's not
designed to write dirty pages from devices that don't have filesystems
mounted on them; that's what Christoph seems to be saying, and it's
what the man pages for sync(2) that I've seen say as well.  But
everybody I've talked to about this is surprised, so you could declare
it a bug as well :-) .

It seems to me that sys_sync *could* just dispense with iterating over
the superblocks, and just iterate over the BDIs, just as
wakeup_flusher_threads() does.  I.e., just do two passes over all BDIs
-- one with WB_SYNC_NONE, and one with WB_SYNC_ALL.  Well, not quite.
It would still have to go to each SB and call the quota_sync and
sync_fs callbacks, but those should be cheap.

And yes, this would make sys_sync and sys_syncfs more different than
they are today.

Thanks,
Curt

>                                                                Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-11 17:11                 ` Curt Wohlgemuth
@ 2011-07-11 19:48                   ` Jan Kara
  2011-07-11 19:51                       ` Curt Wohlgemuth
  2011-07-11 20:11                     ` Christoph Hellwig
  0 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2011-07-11 19:48 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Jan Kara, Christoph Hellwig, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Mon 11-07-11 10:11:17, Curt Wohlgemuth wrote:
> >> One other issue I have with sync as it's structured is that we don't
> >> do a WB_SYNC_ALL pass on any inode that's only associated with a block
> >> device, and not on a mounted filesystem.  Blockdev mounts are
> >> pseudo-mounts, and are explicitly skipped in __sync_filesystem().  So
> >> if you've written directly to a block device and do a sync, the only
> >> pass over the pages for this inode are via the
> >> wakeup_flusher_threads() -- which operates on a BDI, regardless of the
> >> superblock, and uses WB_SYNC_NONE.
> >>
> >> All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
> >> exclude pseudo-superblocks.
> >>
> >> I've seen cases in our modified kernels here at Google in which
> >> lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
> >> /dev/sda (though I haven't been able to come up with a consistent test
> >> case, nor reproduce this on an upstream kernel).
> >  Ho hum, I wonder what would be the cleanest way to fix this. I guess we
> > could make an exception for 'bdev' superblock in the code but it's ugly.
> 
> Yes, it's ugly.
> 
> You could declare that sync(2) is just doing it's job, that it's not
> designed to write dirty pages from devices that don't have filesystems
> mounted on them; that's what Christoph seems to be saying, and it's
> what the man pages for sync(2) that I've seen say as well.  But
> everybody I've talked to about this is surprised, so you could declare
> it a bug as well :-) .
> 
> It seems to me that sys_sync *could* just dispense with iterating over
> the superblocks, and just iterate over the BDIs, just as
> wakeup_flusher_threads() does.  I.e., just do two passes over all BDIs
> -- one with WB_SYNC_NONE, and one with WB_SYNC_ALL.  Well, not quite.
> It would still have to go to each SB and call the quota_sync and
> sync_fs callbacks, but those should be cheap.
  Well, they're not exactly cheap (journaling filesystems have to force
transaction commits and wait for them) but that does not really matter.
The real problem is that to wait for IO completion you need some list of
inodes you want to wait for and you can currently get such list only from
superblock.
 
> And yes, this would make sys_sync and sys_syncfs more different than
> they are today.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-11 19:48                   ` Jan Kara
@ 2011-07-11 19:51                       ` Curt Wohlgemuth
  2011-07-11 20:11                     ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-07-11 19:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

On Mon, Jul 11, 2011 at 12:48 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 11-07-11 10:11:17, Curt Wohlgemuth wrote:
>> >> One other issue I have with sync as it's structured is that we don't
>> >> do a WB_SYNC_ALL pass on any inode that's only associated with a block
>> >> device, and not on a mounted filesystem.  Blockdev mounts are
>> >> pseudo-mounts, and are explicitly skipped in __sync_filesystem().  So
>> >> if you've written directly to a block device and do a sync, the only
>> >> pass over the pages for this inode are via the
>> >> wakeup_flusher_threads() -- which operates on a BDI, regardless of the
>> >> superblock, and uses WB_SYNC_NONE.
>> >>
>> >> All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
>> >> exclude pseudo-superblocks.
>> >>
>> >> I've seen cases in our modified kernels here at Google in which
>> >> lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
>> >> /dev/sda (though I haven't been able to come up with a consistent test
>> >> case, nor reproduce this on an upstream kernel).
>> >  Ho hum, I wonder what would be the cleanest way to fix this. I guess we
>> > could make an exception for 'bdev' superblock in the code but it's ugly.
>>
>> Yes, it's ugly.
>>
>> You could declare that sync(2) is just doing it's job, that it's not
>> designed to write dirty pages from devices that don't have filesystems
>> mounted on them; that's what Christoph seems to be saying, and it's
>> what the man pages for sync(2) that I've seen say as well.  But
>> everybody I've talked to about this is surprised, so you could declare
>> it a bug as well :-) .
>>
>> It seems to me that sys_sync *could* just dispense with iterating over
>> the superblocks, and just iterate over the BDIs, just as
>> wakeup_flusher_threads() does.  I.e., just do two passes over all BDIs
>> -- one with WB_SYNC_NONE, and one with WB_SYNC_ALL.  Well, not quite.
>> It would still have to go to each SB and call the quota_sync and
>> sync_fs callbacks, but those should be cheap.
>  Well, they're not exactly cheap (journaling filesystems have to force
> transaction commits and wait for them) but that does not really matter.
> The real problem is that to wait for IO completion you need some list of
> inodes you want to wait for and you can currently get such list only from
> superblock.

Ah, of course, I realized that some days ago but forgot it --
something that seems to happen more than I'd like to admit to myself.

Thanks,
Curt

>
>> And yes, this would make sys_sync and sys_syncfs more different than
>> they are today.
>
>                                                                Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
@ 2011-07-11 19:51                       ` Curt Wohlgemuth
  0 siblings, 0 replies; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-07-11 19:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

On Mon, Jul 11, 2011 at 12:48 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 11-07-11 10:11:17, Curt Wohlgemuth wrote:
>> >> One other issue I have with sync as it's structured is that we don't
>> >> do a WB_SYNC_ALL pass on any inode that's only associated with a block
>> >> device, and not on a mounted filesystem.  Blockdev mounts are
>> >> pseudo-mounts, and are explicitly skipped in __sync_filesystem().  So
>> >> if you've written directly to a block device and do a sync, the only
>> >> pass over the pages for this inode are via the
>> >> wakeup_flusher_threads() -- which operates on a BDI, regardless of the
>> >> superblock, and uses WB_SYNC_NONE.
>> >>
>> >> All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
>> >> exclude pseudo-superblocks.
>> >>
>> >> I've seen cases in our modified kernels here at Google in which
>> >> lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
>> >> /dev/sda (though I haven't been able to come up with a consistent test
>> >> case, nor reproduce this on an upstream kernel).
>> >  Ho hum, I wonder what would be the cleanest way to fix this. I guess we
>> > could make an exception for 'bdev' superblock in the code but it's ugly.
>>
>> Yes, it's ugly.
>>
>> You could declare that sync(2) is just doing it's job, that it's not
>> designed to write dirty pages from devices that don't have filesystems
>> mounted on them; that's what Christoph seems to be saying, and it's
>> what the man pages for sync(2) that I've seen say as well.  But
>> everybody I've talked to about this is surprised, so you could declare
>> it a bug as well :-) .
>>
>> It seems to me that sys_sync *could* just dispense with iterating over
>> the superblocks, and just iterate over the BDIs, just as
>> wakeup_flusher_threads() does.  I.e., just do two passes over all BDIs
>> -- one with WB_SYNC_NONE, and one with WB_SYNC_ALL.  Well, not quite.
>> It would still have to go to each SB and call the quota_sync and
>> sync_fs callbacks, but those should be cheap.
>  Well, they're not exactly cheap (journaling filesystems have to force
> transaction commits and wait for them) but that does not really matter.
> The real problem is that to wait for IO completion you need some list of
> inodes you want to wait for and you can currently get such list only from
> superblock.

Ah, of course, I realized that some days ago but forgot it --
something that seems to happen more than I'd like to admit to myself.

Thanks,
Curt

>
>> And yes, this would make sys_sync and sys_syncfs more different than
>> they are today.
>
>                                                                Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-11 19:48                   ` Jan Kara
  2011-07-11 19:51                       ` Curt Wohlgemuth
@ 2011-07-11 20:11                     ` Christoph Hellwig
  2011-07-12 10:34                       ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-11 20:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Curt Wohlgemuth, Christoph Hellwig, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Mon, Jul 11, 2011 at 09:48:35PM +0200, Jan Kara wrote:
>   Well, they're not exactly cheap (journaling filesystems have to force
> transaction commits and wait for them) but that does not really matter.
> The real problem is that to wait for IO completion you need some list of
> inodes you want to wait for and you can currently get such list only from
> superblock.

All block device inodes sit on blockdev_superblock, we got rid of inodes
without a superblock long time ago.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-11 20:11                     ` Christoph Hellwig
@ 2011-07-12 10:34                       ` Jan Kara
  2011-07-12 10:41                         ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-07-12 10:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Curt Wohlgemuth, Al Viro, linux-fsdevel, linux-kernel,
	fengguang.wu

On Mon 11-07-11 16:11:57, Christoph Hellwig wrote:
> On Mon, Jul 11, 2011 at 09:48:35PM +0200, Jan Kara wrote:
> >   Well, they're not exactly cheap (journaling filesystems have to force
> > transaction commits and wait for them) but that does not really matter.
> > The real problem is that to wait for IO completion you need some list of
> > inodes you want to wait for and you can currently get such list only from
> > superblock.
> 
> All block device inodes sit on blockdev_superblock, we got rid of inodes
> without a superblock long time ago.
  Sure, we can easily iterate also blockdev_superblock. What I meant is
that blockdev_superblock will need a special handling since we otherwise
ignore pseudo superblocks...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-12 10:34                       ` Jan Kara
@ 2011-07-12 10:41                         ` Christoph Hellwig
  2011-07-12 22:37                           ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-12 10:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Curt Wohlgemuth, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Tue, Jul 12, 2011 at 12:34:53PM +0200, Jan Kara wrote:
> > All block device inodes sit on blockdev_superblock, we got rid of inodes
> > without a superblock long time ago.
>   Sure, we can easily iterate also blockdev_superblock. What I meant is
> that blockdev_superblock will need a special handling since we otherwise
> ignore pseudo superblocks...

Pseudo superblocks aren't ignored.  They are added to super_blocks like
all others, and iterate_supers doesn't skip over them.  The problem
is that blockdev_superblock doesn't have a proper s_bdi set, and thus
gets skipped over by __sync_filesystem. 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-12 10:41                         ` Christoph Hellwig
@ 2011-07-12 22:37                           ` Jan Kara
  2011-07-14 16:29                               ` Curt Wohlgemuth
  2011-07-19 16:51                             ` Christoph Hellwig
  0 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2011-07-12 22:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Curt Wohlgemuth, Al Viro, linux-fsdevel, linux-kernel,
	fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 1315 bytes --]

On Tue 12-07-11 06:41:32, Christoph Hellwig wrote:
> On Tue, Jul 12, 2011 at 12:34:53PM +0200, Jan Kara wrote:
> > > All block device inodes sit on blockdev_superblock, we got rid of inodes
> > > without a superblock long time ago.
> >   Sure, we can easily iterate also blockdev_superblock. What I meant is
> > that blockdev_superblock will need a special handling since we otherwise
> > ignore pseudo superblocks...
> 
> Pseudo superblocks aren't ignored.  They are added to super_blocks like
> all others, and iterate_supers doesn't skip over them.  The problem
> is that blockdev_superblock doesn't have a proper s_bdi set, and thus
> gets skipped over by __sync_filesystem. 
  Yes. But even if it was not skipped writeback_inodes_sb() doesn't have
one flusher thread to kick to actually do the writeout (since each inode on
blockdev_superblock belongs to a different bdi). So it's perfectly fine we
skip blockdev_superblock.

  If we want to fix the problem something like attached patch should do.
Comments?

								Honza

PS: While testing the patch, I've noticed that block device can have any
dirty data only if it is still open (__blkdev_put() writes all dirty pages)
so that somehow limits how much people can be burned by sync not writing
out block devices...
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-vfs-Make-sync-1-writeout-also-block-device-inodes.patch --]
[-- Type: text/x-patch, Size: 4609 bytes --]

>From 2834bd2727c93055bb7373d8849492044f70c530 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 12 Jul 2011 22:01:51 +0200
Subject: [PATCH] vfs: Make sync(1) writeout also block device inodes

In case block device does not have filesystem mounted on it, sync(1) will just
ignore it and doesn't writeout dirty pages because it iterates over filesystems
with s_bdi != noop_backing_dev_info and thus it avoids blockdev_superblock.
Since it's unexpected that sync doesn't writeout dirty data for block devices
be nice to users and change the behavior to do so.

This requires a change to how syncing is done. We now first traverse all
superblocks with s_bdi != noop_backing_dev_info, writeout their inodes and
call sync_fs and when this is done, we traverse all block devices and sync
them.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/sync.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index c38ec16..f8f21d9 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -23,20 +23,13 @@
 
 /*
  * Do the filesystem syncing work. For simple filesystems
- * writeback_inodes_sb(sb) just dirties buffers with inodes so we have to
- * submit IO for these buffers via __sync_blockdev(). This also speeds up the
- * wait == 1 case since in that case write_inode() functions do
+ * writeback_inodes_sb(sb) just dirties buffers with inodes so the caller has
+ * to additionally submit IO for these buffers via __sync_blockdev(). This also
+ * speeds up the wait == 1 case since in that case write_inode() functions do
  * sync_dirty_buffer() and thus effectively write one block at a time.
  */
-static int __sync_filesystem(struct super_block *sb, int wait)
+static void __sync_filesystem(struct super_block *sb, int wait)
 {
-	/*
-	 * This should be safe, as we require bdi backing to actually
-	 * write out data in the first place
-	 */
-	if (sb->s_bdi == &noop_backing_dev_info)
-		return 0;
-
 	if (sb->s_qcop && sb->s_qcop->quota_sync)
 		sb->s_qcop->quota_sync(sb, -1, wait);
 
@@ -47,7 +40,6 @@ static int __sync_filesystem(struct super_block *sb, int wait)
 
 	if (sb->s_op->sync_fs)
 		sb->s_op->sync_fs(sb, wait);
-	return __sync_blockdev(sb->s_bdev, wait);
 }
 
 /*
@@ -71,16 +63,26 @@ int sync_filesystem(struct super_block *sb)
 	if (sb->s_flags & MS_RDONLY)
 		return 0;
 
-	ret = __sync_filesystem(sb, 0);
+	/*
+	 * This should be safe, as we require bdi backing to actually
+	 * write out data in the first place.
+	 */
+	if (sb->s_bdi == &noop_backing_dev_info)
+		return 0;
+
+	__sync_filesystem(sb, 0);
+	ret = __sync_blockdev(sb->s_bdev, 0);
 	if (ret < 0)
 		return ret;
-	return __sync_filesystem(sb, 1);
+	__sync_filesystem(sb, 1);
+	return __sync_blockdev(sb->s_bdev, 1);
 }
 EXPORT_SYMBOL_GPL(sync_filesystem);
 
 static void sync_one_sb(struct super_block *sb, void *arg)
 {
-	if (!(sb->s_flags & MS_RDONLY))
+	/* Avoid read-only filesystems and filesystems without backing device */
+	if (!(sb->s_flags & MS_RDONLY) && sb->s_bdi != &noop_backing_dev_info)
 		__sync_filesystem(sb, *(int *)arg);
 }
 /*
@@ -92,6 +94,42 @@ static void sync_filesystems(int wait)
 	iterate_supers(sync_one_sb, &wait);
 }
 
+static void sync_all_bdevs(int wait)
+{
+	struct inode *inode, *old_inode = NULL;
+
+	spin_lock(&inode_sb_list_lock);
+	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
+		struct address_space *mapping = inode->i_mapping;
+
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
+		    mapping->nrpages == 0) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&inode_sb_list_lock);
+		/*
+		 * We hold a reference to 'inode' so it couldn't have been
+		 * removed from s_inodes list while we dropped the
+		 * inode_sb_list_lock.  We cannot iput the inode now as we can
+		 * be holding the last reference and we cannot iput it under
+		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * later.
+		 */
+		iput(old_inode);
+		old_inode = inode;
+
+		__sync_blockdev(I_BDEV(inode), wait);
+
+		spin_lock(&inode_sb_list_lock);
+	}
+	spin_unlock(&inode_sb_list_lock);
+	iput(old_inode);
+}
+
 /*
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
@@ -101,6 +139,8 @@ SYSCALL_DEFINE0(sync)
 	wakeup_flusher_threads(0);
 	sync_filesystems(0);
 	sync_filesystems(1);
+	sync_all_bdevs(0);
+	sync_all_bdevs(1);
 	if (unlikely(laptop_mode))
 		laptop_sync_completion();
 	return 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-12 22:37                           ` Jan Kara
@ 2011-07-14 16:29                               ` Curt Wohlgemuth
  2011-07-19 16:51                             ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-07-14 16:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

Hi Jan:

On Tue, Jul 12, 2011 at 3:37 PM, Jan Kara <jack@suse.cz> wrote:
> On Tue 12-07-11 06:41:32, Christoph Hellwig wrote:
>> On Tue, Jul 12, 2011 at 12:34:53PM +0200, Jan Kara wrote:
>> > > All block device inodes sit on blockdev_superblock, we got rid of inodes
>> > > without a superblock long time ago.
>> >   Sure, we can easily iterate also blockdev_superblock. What I meant is
>> > that blockdev_superblock will need a special handling since we otherwise
>> > ignore pseudo superblocks...
>>
>> Pseudo superblocks aren't ignored.  They are added to super_blocks like
>> all others, and iterate_supers doesn't skip over them.  The problem
>> is that blockdev_superblock doesn't have a proper s_bdi set, and thus
>> gets skipped over by __sync_filesystem.
>  Yes. But even if it was not skipped writeback_inodes_sb() doesn't have
> one flusher thread to kick to actually do the writeout (since each inode on
> blockdev_superblock belongs to a different bdi). So it's perfectly fine we
> skip blockdev_superblock.
>
>  If we want to fix the problem something like attached patch should do.
> Comments?

Your patch looks good to me, in that it does hit all the bdevs with
both WB_SYNC_NONE and SYNC_ALL.  However, I still say that the call to
wakeup_flusher_threads() in sys_sync() is superfluous, at least as
long as writeback_inodes_sb() waits for completion of the work item
that it enqueues.

Thanks,
Curt

>
>                                                                Honza
>
> PS: While testing the patch, I've noticed that block device can have any
> dirty data only if it is still open (__blkdev_put() writes all dirty pages)
> so that somehow limits how much people can be burned by sync not writing
> out block devices...
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
@ 2011-07-14 16:29                               ` Curt Wohlgemuth
  0 siblings, 0 replies; 38+ messages in thread
From: Curt Wohlgemuth @ 2011-07-14 16:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

Hi Jan:

On Tue, Jul 12, 2011 at 3:37 PM, Jan Kara <jack@suse.cz> wrote:
> On Tue 12-07-11 06:41:32, Christoph Hellwig wrote:
>> On Tue, Jul 12, 2011 at 12:34:53PM +0200, Jan Kara wrote:
>> > > All block device inodes sit on blockdev_superblock, we got rid of inodes
>> > > without a superblock long time ago.
>> >   Sure, we can easily iterate also blockdev_superblock. What I meant is
>> > that blockdev_superblock will need a special handling since we otherwise
>> > ignore pseudo superblocks...
>>
>> Pseudo superblocks aren't ignored.  They are added to super_blocks like
>> all others, and iterate_supers doesn't skip over them.  The problem
>> is that blockdev_superblock doesn't have a proper s_bdi set, and thus
>> gets skipped over by __sync_filesystem.
>  Yes. But even if it was not skipped writeback_inodes_sb() doesn't have
> one flusher thread to kick to actually do the writeout (since each inode on
> blockdev_superblock belongs to a different bdi). So it's perfectly fine we
> skip blockdev_superblock.
>
>  If we want to fix the problem something like attached patch should do.
> Comments?

Your patch looks good to me, in that it does hit all the bdevs with
both WB_SYNC_NONE and SYNC_ALL.  However, I still say that the call to
wakeup_flusher_threads() in sys_sync() is superfluous, at least as
long as writeback_inodes_sb() waits for completion of the work item
that it enqueues.

Thanks,
Curt

>
>                                                                Honza
>
> PS: While testing the patch, I've noticed that block device can have any
> dirty data only if it is still open (__blkdev_put() writes all dirty pages)
> so that somehow limits how much people can be burned by sync not writing
> out block devices...
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-14 16:29                               ` Curt Wohlgemuth
  (?)
@ 2011-07-14 23:08                               ` Jan Kara
  2011-07-19 16:56                                 ` Christoph Hellwig
  -1 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-07-14 23:08 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Jan Kara, Christoph Hellwig, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 1893 bytes --]

  Hi Curt,

On Thu 14-07-11 09:29:34, Curt Wohlgemuth wrote:
> On Tue, Jul 12, 2011 at 3:37 PM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 12-07-11 06:41:32, Christoph Hellwig wrote:
> >> On Tue, Jul 12, 2011 at 12:34:53PM +0200, Jan Kara wrote:
> >> > > All block device inodes sit on blockdev_superblock, we got rid of inodes
> >> > > without a superblock long time ago.
> >> >   Sure, we can easily iterate also blockdev_superblock. What I meant is
> >> > that blockdev_superblock will need a special handling since we otherwise
> >> > ignore pseudo superblocks...
> >>
> >> Pseudo superblocks aren't ignored.  They are added to super_blocks like
> >> all others, and iterate_supers doesn't skip over them.  The problem
> >> is that blockdev_superblock doesn't have a proper s_bdi set, and thus
> >> gets skipped over by __sync_filesystem.
> >  Yes. But even if it was not skipped writeback_inodes_sb() doesn't have
> > one flusher thread to kick to actually do the writeout (since each inode on
> > blockdev_superblock belongs to a different bdi). So it's perfectly fine we
> > skip blockdev_superblock.
> >
> >  If we want to fix the problem something like attached patch should do.
> > Comments?
> 
> Your patch looks good to me, in that it does hit all the bdevs with
> both WB_SYNC_NONE and SYNC_ALL.  However, I still say that the call to
> wakeup_flusher_threads() in sys_sync() is superfluous, at least as
> long as writeback_inodes_sb() waits for completion of the work item
> that it enqueues.
  Actually, it's the other way around writeback_inodes_sb() is superfluous
because of wakeup_flusher_threads(). So something like attached patch could
improve sync times (especially in presence of other IO). So far I have only
checked that sync times look reasonable with it but didn't really compare
them with original kernel...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-vfs-Avoid-unnecessary-WB_SYNC_NONE-writeback-during-.patch --]
[-- Type: text/x-patch, Size: 3592 bytes --]

>From 7d17da2ceed603a3cb013e06a2158e56029043a8 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 14 Jul 2011 23:42:19 +0200
Subject: [PATCH] vfs: Avoid unnecessary WB_SYNC_NONE writeback during sync(1)

wakeup_flusher_thread(0) will queue work doing complete writeback for
each flusher thread. Thus there is not much point in submitting another
work doing full inode WB_SYNC_NONE writeback by sync_filesystems(). So change
sync to do:
  wakeup_flusher_threads(0);
  for each filesystem
    async quota_sync()
    synchronous inode writeback
    async sync_fs()
  submit dirty buffers from all block devices
  for each filesystem
    synchronous quota_sync()
    synchronous sync_fs()
  synchronous writeout of all block devices

  Note that we do synchronous inode writeback before calling sync_fs().
Otherwise sync_fs() would be called in the middle of inode writeback done
by flusher thread and thus couldn't do very meaningful work.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/sync.c |   38 ++++++++++++++++++++++++++++++++------
 1 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index f8f21d9..fb6b8b2 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -21,6 +21,11 @@
 #define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \
 			SYNC_FILE_RANGE_WAIT_AFTER)
 
+/* Wait flags for __sync_filesystem */
+#define WAIT_QUOTA	0x01
+#define WAIT_INODES	0x02
+#define WAIT_FS		0x04
+
 /*
  * Do the filesystem syncing work. For simple filesystems
  * writeback_inodes_sb(sb) just dirties buffers with inodes so the caller has
@@ -31,15 +36,15 @@
 static void __sync_filesystem(struct super_block *sb, int wait)
 {
 	if (sb->s_qcop && sb->s_qcop->quota_sync)
-		sb->s_qcop->quota_sync(sb, -1, wait);
+		sb->s_qcop->quota_sync(sb, -1, !!(wait & WAIT_QUOTA));
 
-	if (wait)
+	if (wait & WAIT_INODES)
 		sync_inodes_sb(sb);
 	else
 		writeback_inodes_sb(sb);
 
 	if (sb->s_op->sync_fs)
-		sb->s_op->sync_fs(sb, wait);
+		sb->s_op->sync_fs(sb, !!(wait & WAIT_FS));
 }
 
 /*
@@ -74,7 +79,7 @@ int sync_filesystem(struct super_block *sb)
 	ret = __sync_blockdev(sb->s_bdev, 0);
 	if (ret < 0)
 		return ret;
-	__sync_filesystem(sb, 1);
+	__sync_filesystem(sb, WAIT_QUOTA | WAIT_INODES | WAIT_FS);
 	return __sync_blockdev(sb->s_bdev, 1);
 }
 EXPORT_SYMBOL_GPL(sync_filesystem);
@@ -130,16 +135,37 @@ static void sync_all_bdevs(int wait)
 	iput(old_inode);
 }
 
+static void sync_one_fs_metadata(struct super_block *sb, void *arg)
+{
+	/* Avoid read-only filesystems and filesystems without backing device */
+	if (sb->s_flags & MS_RDONLY)
+		return;
+	if (sb->s_bdi == &noop_backing_dev_info)
+		return;
+	if (sb->s_qcop && sb->s_qcop->quota_sync)
+		sb->s_qcop->quota_sync(sb, -1, 1);
+	if (sb->s_op->sync_fs)
+		sb->s_op->sync_fs(sb, 1);
+}
+
 /*
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
 SYSCALL_DEFINE0(sync)
 {
+	/* Start flushing on all devices */
 	wakeup_flusher_threads(0);
-	sync_filesystems(0);
-	sync_filesystems(1);
+	/*
+	 * Above call queued work doing complete writeout on each filesystem.
+	 * So now we only have to queue work which guarantees data integrity
+	 * - not much should be left for it to write. The WB_SYNC_ALL inode
+	 * writeback also guarantees that sync_fs() is called after inodes
+	 * are written out and thus it can do meaningful work.
+	 */
+	sync_filesystems(WAIT_INODES);
 	sync_all_bdevs(0);
+	iterate_supers(sync_one_fs_metadata, NULL);
 	sync_all_bdevs(1);
 	if (unlikely(laptop_mode))
 		laptop_sync_completion();
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-06-29 21:30           ` Curt Wohlgemuth
@ 2011-07-19 15:54             ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-19 15:54 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-kernel, fengguang.wu

On Wed, Jun 29, 2011 at 02:30:17PM -0700, Curt Wohlgemuth wrote:
> I can definitely see how tagging with the start of sync would be
> helpful; also how going from three to two passes seems like a
> no-brainer.
> 
> But what's the real point of doing even two passes -- one SYNC_NONE
> followed by one SYNC_ALL?  Is it try to get through as many inodes as
> possible before we potentially lock up by waiting on an inode on an
> unavailable device?

Yes.  Or at least that's the idea, I'd love to see an actual
prove for it.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-12 22:37                           ` Jan Kara
  2011-07-14 16:29                               ` Curt Wohlgemuth
@ 2011-07-19 16:51                             ` Christoph Hellwig
  2011-07-20 22:00                               ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-19 16:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Curt Wohlgemuth, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Wed, Jul 13, 2011 at 12:37:15AM +0200, Jan Kara wrote:
> -static int __sync_filesystem(struct super_block *sb, int wait)
> +static void __sync_filesystem(struct super_block *sb, int wait)
>  {
> -	/*
> -	 * This should be safe, as we require bdi backing to actually
> -	 * write out data in the first place
> -	 */
> -	if (sb->s_bdi == &noop_backing_dev_info)
> -		return 0;
> -
>  	if (sb->s_qcop && sb->s_qcop->quota_sync)
>  		sb->s_qcop->quota_sync(sb, -1, wait);
>  
> @@ -47,7 +40,6 @@ static int __sync_filesystem(struct super_block *sb, int wait)
>  
>  	if (sb->s_op->sync_fs)
>  		sb->s_op->sync_fs(sb, wait);
> -	return __sync_blockdev(sb->s_bdev, wait);

Removing this call breaks sys_syncfs and similar semantics on filesystem
just pushing metadata into buffers in ->write_inode or ->sync_fs and
then expecting the caller to write them out.  This list of filesystem
includes ext2 and in general most filesystems without journaling or
similar technics.

I'm perfectly fine with pushing the sync_blockdev call into the
filesystem for these, but we'll need a way to handle them.

> +	/*
> +	 * This should be safe, as we require bdi backing to actually
> +	 * write out data in the first place.
> +	 */
> +	if (sb->s_bdi == &noop_backing_dev_info)
> +		return 0;

There's really no reason to do that check early.  There's nothing
no reason to not have a filesystem that doesn't use the writeback
code, but still has a ->sync_fs method.  IMHO this check should
move into sync_inodes_sb/writeback_inodes_sb.

> +static void sync_all_bdevs(int wait)
> +{
> +	struct inode *inode, *old_inode = NULL;
> +
> +	spin_lock(&inode_sb_list_lock);
> +	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> +		struct address_space *mapping = inode->i_mapping;
> +
> +		spin_lock(&inode->i_lock);
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
> +		    mapping->nrpages == 0) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
> +		__iget(inode);
> +		spin_unlock(&inode->i_lock);
> +		spin_unlock(&inode_sb_list_lock);
> +		/*
> +		 * We hold a reference to 'inode' so it couldn't have been
> +		 * removed from s_inodes list while we dropped the
> +		 * inode_sb_list_lock.  We cannot iput the inode now as we can
> +		 * be holding the last reference and we cannot iput it under
> +		 * inode_sb_list_lock. So we keep the reference and iput it
> +		 * later.
> +		 */
> +		iput(old_inode);
> +		old_inode = inode;
> +
> +		__sync_blockdev(I_BDEV(inode), wait);
> +
> +		spin_lock(&inode_sb_list_lock);
> +	}
> +	spin_unlock(&inode_sb_list_lock);
> +	iput(old_inode);

At which point we could fold this code into a blkdev_sync_fs method for
now.  Long term we'll need to support multiple BDIs per SB anyway, at
which point the code can go away again.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-14 16:29                               ` Curt Wohlgemuth
  (?)
  (?)
@ 2011-07-19 16:53                               ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-19 16:53 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Jan Kara, Christoph Hellwig, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Thu, Jul 14, 2011 at 09:29:34AM -0700, Curt Wohlgemuth wrote:
> Your patch looks good to me, in that it does hit all the bdevs with
> both WB_SYNC_NONE and SYNC_ALL.  However, I still say that the call to
> wakeup_flusher_threads() in sys_sync() is superfluous, at least as
> long as writeback_inodes_sb() waits for completion of the work item
> that it enqueues.

Commit 3beab0b42413e83a7907db7176b54c840fc75a81

	"sys_sync(): fix 16% performance regression in ffsb create_4k test"

suggests we need the wakeup_flusher_threads.  We might be able to get
rid of the writeback_inodes_sb call if we're lucky.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-14 23:08                               ` Jan Kara
@ 2011-07-19 16:56                                 ` Christoph Hellwig
  2011-07-21 18:35                                   ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-19 16:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Curt Wohlgemuth, Christoph Hellwig, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Fri, Jul 15, 2011 at 01:08:54AM +0200, Jan Kara wrote:
>   Actually, it's the other way around writeback_inodes_sb() is superfluous
> because of wakeup_flusher_threads(). So something like attached patch could
> improve sync times (especially in presence of other IO). So far I have only
> checked that sync times look reasonable with it but didn't really compare
> them with original kernel...

This changes the order in which ->quota_sync is called relatively to
other operations, see my other mail about it.  Also the code gets really
confusing at this point, I think you're better of stopping to try to
shared code between syncfs, umount & co and sys_sync with these bits.

You're also skipping the ->sync_fs and quotasync calls the first round.
I know for XFS sync_fs without wait is already mostly a no-op, but we'll
need to benchmark and document this change, and apply it to the non-sync
caller as well.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-19 16:51                             ` Christoph Hellwig
@ 2011-07-20 22:00                               ` Jan Kara
  2011-07-22 15:11                                 ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-07-20 22:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Curt Wohlgemuth, Al Viro, linux-fsdevel, linux-kernel,
	fengguang.wu

On Tue 19-07-11 12:51:49, Christoph Hellwig wrote:
> On Wed, Jul 13, 2011 at 12:37:15AM +0200, Jan Kara wrote:
> > -static int __sync_filesystem(struct super_block *sb, int wait)
> > +static void __sync_filesystem(struct super_block *sb, int wait)
> >  {
> > -	/*
> > -	 * This should be safe, as we require bdi backing to actually
> > -	 * write out data in the first place
> > -	 */
> > -	if (sb->s_bdi == &noop_backing_dev_info)
> > -		return 0;
> > -
> >  	if (sb->s_qcop && sb->s_qcop->quota_sync)
> >  		sb->s_qcop->quota_sync(sb, -1, wait);
> >  
> > @@ -47,7 +40,6 @@ static int __sync_filesystem(struct super_block *sb, int wait)
> >  
> >  	if (sb->s_op->sync_fs)
> >  		sb->s_op->sync_fs(sb, wait);
> > -	return __sync_blockdev(sb->s_bdev, wait);
> 
> Removing this call breaks sys_syncfs and similar semantics on filesystem
> just pushing metadata into buffers in ->write_inode or ->sync_fs and
> then expecting the caller to write them out.  This list of filesystem
> includes ext2 and in general most filesystems without journaling or
> similar technics.
  Note that for sys_syncfs() the __sync_blockdev() has just moved from
__sync_filesystem() to sync_filesystem(). So nothing should change.

> I'm perfectly fine with pushing the sync_blockdev call into the
> filesystem for these, but we'll need a way to handle them.
  Yes, calling sync_blockdev() from sync_fs() might actually make sence
(e.g. ext3 and ext4 don't need it) but I didn't want to go that far in
this patch.

> > +	/*
> > +	 * This should be safe, as we require bdi backing to actually
> > +	 * write out data in the first place.
> > +	 */
> > +	if (sb->s_bdi == &noop_backing_dev_info)
> > +		return 0;
> 
> There's really no reason to do that check early.  There's nothing
> no reason to not have a filesystem that doesn't use the writeback
> code, but still has a ->sync_fs method.  IMHO this check should
> move into sync_inodes_sb/writeback_inodes_sb.
  OK, will change.

> > +static void sync_all_bdevs(int wait)
> > +{
> > +	struct inode *inode, *old_inode = NULL;
> > +
> > +	spin_lock(&inode_sb_list_lock);
> > +	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> > +		struct address_space *mapping = inode->i_mapping;
> > +
> > +		spin_lock(&inode->i_lock);
> > +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
> > +		    mapping->nrpages == 0) {
> > +			spin_unlock(&inode->i_lock);
> > +			continue;
> > +		}
> > +		__iget(inode);
> > +		spin_unlock(&inode->i_lock);
> > +		spin_unlock(&inode_sb_list_lock);
> > +		/*
> > +		 * We hold a reference to 'inode' so it couldn't have been
> > +		 * removed from s_inodes list while we dropped the
> > +		 * inode_sb_list_lock.  We cannot iput the inode now as we can
> > +		 * be holding the last reference and we cannot iput it under
> > +		 * inode_sb_list_lock. So we keep the reference and iput it
> > +		 * later.
> > +		 */
> > +		iput(old_inode);
> > +		old_inode = inode;
> > +
> > +		__sync_blockdev(I_BDEV(inode), wait);
> > +
> > +		spin_lock(&inode_sb_list_lock);
> > +	}
> > +	spin_unlock(&inode_sb_list_lock);
> > +	iput(old_inode);
> 
> At which point we could fold this code into a blkdev_sync_fs method for
> now.  Long term we'll need to support multiple BDIs per SB anyway, at
> which point the code can go away again.
  Ah, I had to think a bit before I understood what you mean :). It's kind
of elegant but also slightly subtle (it's not immediately obvious how
blockdevs are synced during sync when you look at the code). Umm, and you
don't have any guarantee in which order superblocks are on the sb list so
you could sync block devices before some filesystems are finished. So I
don't think using blkdev_sync_fs() is a good idea after all.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-19 16:56                                 ` Christoph Hellwig
@ 2011-07-21 18:35                                   ` Jan Kara
  2011-07-22 15:16                                     ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-07-21 18:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Curt Wohlgemuth, Al Viro, linux-fsdevel, linux-kernel,
	fengguang.wu

On Tue 19-07-11 12:56:39, Christoph Hellwig wrote:
> On Fri, Jul 15, 2011 at 01:08:54AM +0200, Jan Kara wrote:
> >   Actually, it's the other way around writeback_inodes_sb() is superfluous
> > because of wakeup_flusher_threads(). So something like attached patch could
> > improve sync times (especially in presence of other IO). So far I have only
> > checked that sync times look reasonable with it but didn't really compare
> > them with original kernel...
> 
> This changes the order in which ->quota_sync is called relatively to
> other operations, see my other mail about it.  Also the code gets really
> confusing at this point, I think you're better of stopping to try to
> shared code between syncfs, umount & co and sys_sync with these bits.
  Good point. I'll stop sharing the code for these two cases.

> You're also skipping the ->sync_fs and quotasync calls the first round.
  Well, kind of. Since writeback is running asynchronously, we have no way
to call ->sync_fs(sb, 0) just after async inode writeback is done (and it
doesn't really make sense to call it before that moment). So we call
synchronous inode writeback first and only then ->sync_fs().

> I know for XFS sync_fs without wait is already mostly a no-op, but we'll
> need to benchmark and document this change, and apply it to the non-sync
> caller as well.
  I already have some numbers:
I've checked pure sync overhead by
  sync; time for (( i = 0; i < 100; i++ )); do sync; done
the results of 5 runs are:
  avg 1.130400s, stddev 0.027369 (unpatched kernel)
  avg 1.073200s, stddev 0.040848 (patched kernel)
so as expected slightly better for patched kernel.

And test with 20000 4k files in 100 directories on xfs:
  avg 155.995600s, stddev 1.879084 (unpatched kernel)
  avg 155.942200s, stddev 1.843881 (patched kernel)
so no real difference.

The same test with ext3:
  avg 154.597200s, stddev 1.556965 (unpatched kernel)
  avg 153.109800s, stddev 1.339094 (patched kernel)
again almost no difference.

  Where would you like to document the change?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-20 22:00                               ` Jan Kara
@ 2011-07-22 15:11                                 ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-22 15:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Curt Wohlgemuth, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Thu, Jul 21, 2011 at 12:00:12AM +0200, Jan Kara wrote:
> > Removing this call breaks sys_syncfs and similar semantics on filesystem
> > just pushing metadata into buffers in ->write_inode or ->sync_fs and
> > then expecting the caller to write them out.  This list of filesystem
> > includes ext2 and in general most filesystems without journaling or
> > similar technics.
>   Note that for sys_syncfs() the __sync_blockdev() has just moved from
> __sync_filesystem() to sync_filesystem(). So nothing should change.

Sorry, I missed that.

> > I'm perfectly fine with pushing the sync_blockdev call into the
> > filesystem for these, but we'll need a way to handle them.
>   Yes, calling sync_blockdev() from sync_fs() might actually make sence
> (e.g. ext3 and ext4 don't need it) but I didn't want to go that far in
> this patch.

The more fine-grained we can split the patches, the better.  It'll help
to explain what we're doing to thise trying to figure out a few years
down the road.

> > At which point we could fold this code into a blkdev_sync_fs method for
> > now.  Long term we'll need to support multiple BDIs per SB anyway, at
> > which point the code can go away again.
>   Ah, I had to think a bit before I understood what you mean :). It's kind
> of elegant but also slightly subtle (it's not immediately obvious how
> blockdevs are synced during sync when you look at the code). Umm, and you
> don't have any guarantee in which order superblocks are on the sb list so
> you could sync block devices before some filesystems are finished. So I
> don't think using blkdev_sync_fs() is a good idea after all.

This required calling sync_blkdev from ->sync_fs of those filesystems
that actually need it as a pre-requisite of course.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
  2011-07-21 18:35                                   ` Jan Kara
@ 2011-07-22 15:16                                     ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-22 15:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Curt Wohlgemuth, Al Viro, linux-fsdevel,
	linux-kernel, fengguang.wu

On Thu, Jul 21, 2011 at 08:35:23PM +0200, Jan Kara wrote:
> > You're also skipping the ->sync_fs and quotasync calls the first round.
>   Well, kind of. Since writeback is running asynchronously, we have no way
> to call ->sync_fs(sb, 0) just after async inode writeback is done (and it
> doesn't really make sense to call it before that moment). So we call
> synchronous inode writeback first and only then ->sync_fs().

I don't nessecarily say it's a bad thing, but when you do subtile
behaviour changes like this it at least needs to be clearly documented
in the changelog, or even better split into a patch on it's own.

>   Where would you like to document the change?

commit log.


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2011-07-22 15:16 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-28 23:43 [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr Curt Wohlgemuth
2011-06-29  0:54 ` Dave Chinner
2011-06-29  1:56   ` Curt Wohlgemuth
2011-06-29  1:56     ` Curt Wohlgemuth
2011-06-29  8:11     ` Christoph Hellwig
2011-06-29 16:57       ` Jan Kara
2011-06-29 17:55         ` Christoph Hellwig
2011-06-29 19:15           ` Jan Kara
2011-06-29 20:12             ` Christoph Hellwig
2011-06-30 12:15               ` Jan Kara
2011-06-30 12:37                 ` Christoph Hellwig
2011-07-01 22:55             ` Curt Wohlgemuth
2011-07-02 11:32               ` Christoph Hellwig
2011-07-11 17:00               ` Jan Kara
2011-07-11 17:11                 ` Curt Wohlgemuth
2011-07-11 19:48                   ` Jan Kara
2011-07-11 19:51                     ` Curt Wohlgemuth
2011-07-11 19:51                       ` Curt Wohlgemuth
2011-07-11 20:11                     ` Christoph Hellwig
2011-07-12 10:34                       ` Jan Kara
2011-07-12 10:41                         ` Christoph Hellwig
2011-07-12 22:37                           ` Jan Kara
2011-07-14 16:29                             ` Curt Wohlgemuth
2011-07-14 16:29                               ` Curt Wohlgemuth
2011-07-14 23:08                               ` Jan Kara
2011-07-19 16:56                                 ` Christoph Hellwig
2011-07-21 18:35                                   ` Jan Kara
2011-07-22 15:16                                     ` Christoph Hellwig
2011-07-19 16:53                               ` Christoph Hellwig
2011-07-19 16:51                             ` Christoph Hellwig
2011-07-20 22:00                               ` Jan Kara
2011-07-22 15:11                                 ` Christoph Hellwig
2011-06-29 17:26       ` Curt Wohlgemuth
2011-06-29 17:26         ` Curt Wohlgemuth
2011-06-29 18:00         ` Christoph Hellwig
2011-06-29 21:30           ` Curt Wohlgemuth
2011-07-19 15:54             ` Christoph Hellwig
2011-06-29  6:42   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.