All of lore.kernel.org
 help / color / mirror / Atom feed
* dirty balancing deadlock
@ 2007-02-18 18:28 ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-18 18:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: akpm

I was testing the new fuse shared writable mmap support, and finding
that bash-shared-mapping deadlocks (which isn't so strange ;).  What
is more strange is that this is not an OOM situation at all, with
plenty of free and cached pages.

A little more investigation shows that a similar deadlock happens
reliably with bash-shared-mapping on a loopback mount, even if only
half the total memory is used.

The cause is slightly different in the two cases:

  - loopback mount: allocation by the underlying filesystem is stalled
    on throttle_vm_writeout()

  - fuse-loop: page dirtying on the underlying filesystem is stalled on
    balance_dirty_pages()

In both cases the underlying fs is totally innocent, with no
dirty/writback pages, yet it's waiting for the global dirty+writeback
to go below the threshold, which obviously won't, until the
allocation/dirtying succeeds.

I'm not quite sure what the solution is, and asking for thoughts.

Ideas:

  - per filesystem dirty counters.  If filesystem is clean (or dirty
    is below some minimum), then balance_dirty_pages() should no wait
    any more

  - throttle_vm_writeout() was meant to throttle swapping, no?  So in
    that case there should be a separate swap-writback counter

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* dirty balancing deadlock
@ 2007-02-18 18:28 ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-18 18:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: akpm

I was testing the new fuse shared writable mmap support, and finding
that bash-shared-mapping deadlocks (which isn't so strange ;).  What
is more strange is that this is not an OOM situation at all, with
plenty of free and cached pages.

A little more investigation shows that a similar deadlock happens
reliably with bash-shared-mapping on a loopback mount, even if only
half the total memory is used.

The cause is slightly different in the two cases:

  - loopback mount: allocation by the underlying filesystem is stalled
    on throttle_vm_writeout()

  - fuse-loop: page dirtying on the underlying filesystem is stalled on
    balance_dirty_pages()

In both cases the underlying fs is totally innocent, with no
dirty/writback pages, yet it's waiting for the global dirty+writeback
to go below the threshold, which obviously won't, until the
allocation/dirtying succeeds.

I'm not quite sure what the solution is, and asking for thoughts.

Ideas:

  - per filesystem dirty counters.  If filesystem is clean (or dirty
    is below some minimum), then balance_dirty_pages() should no wait
    any more

  - throttle_vm_writeout() was meant to throttle swapping, no?  So in
    that case there should be a separate swap-writback counter

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-18 18:28 ` Miklos Szeredi
@ 2007-02-18 20:53   ` Andrew Morton
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-18 20:53 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> I was testing the new fuse shared writable mmap support, and finding
> that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> is more strange is that this is not an OOM situation at all, with
> plenty of free and cached pages.
> 
> A little more investigation shows that a similar deadlock happens
> reliably with bash-shared-mapping on a loopback mount, even if only
> half the total memory is used.
> 
> The cause is slightly different in the two cases:
> 
>   - loopback mount: allocation by the underlying filesystem is stalled
>     on throttle_vm_writeout()
> 
>   - fuse-loop: page dirtying on the underlying filesystem is stalled on
>     balance_dirty_pages()
> 
> In both cases the underlying fs is totally innocent, with no
> dirty/writback pages, yet it's waiting for the global dirty+writeback
> to go below the threshold, which obviously won't, until the
> allocation/dirtying succeeds.
> 
> I'm not quite sure what the solution is, and asking for thoughts.

But....  these things don't just throttle.  They also perform large amounts
of writeback, which causes the dirty levels to subside.

>From your description it appears that this writeback isn't happening, or
isn't working.  How come?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-18 20:53   ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-18 20:53 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> I was testing the new fuse shared writable mmap support, and finding
> that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> is more strange is that this is not an OOM situation at all, with
> plenty of free and cached pages.
> 
> A little more investigation shows that a similar deadlock happens
> reliably with bash-shared-mapping on a loopback mount, even if only
> half the total memory is used.
> 
> The cause is slightly different in the two cases:
> 
>   - loopback mount: allocation by the underlying filesystem is stalled
>     on throttle_vm_writeout()
> 
>   - fuse-loop: page dirtying on the underlying filesystem is stalled on
>     balance_dirty_pages()
> 
> In both cases the underlying fs is totally innocent, with no
> dirty/writback pages, yet it's waiting for the global dirty+writeback
> to go below the threshold, which obviously won't, until the
> allocation/dirtying succeeds.
> 
> I'm not quite sure what the solution is, and asking for thoughts.

But....  these things don't just throttle.  They also perform large amounts
of writeback, which causes the dirty levels to subside.

>From your description it appears that this writeback isn't happening, or
isn't working.  How come?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-18 20:53   ` Andrew Morton
@ 2007-02-18 21:25     ` Rik van Riel
  -1 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2007-02-18 21:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, linux-kernel, linux-mm

Andrew Morton wrote:
> On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
>> I was testing the new fuse shared writable mmap support, and finding
>> that bash-shared-mapping deadlocks (which isn't so strange ;).  What
>> is more strange is that this is not an OOM situation at all, with
>> plenty of free and cached pages.
>>
>> A little more investigation shows that a similar deadlock happens
>> reliably with bash-shared-mapping on a loopback mount, even if only
>> half the total memory is used.
>>
>> The cause is slightly different in the two cases:
>>
>>   - loopback mount: allocation by the underlying filesystem is stalled
>>     on throttle_vm_writeout()
>>
>>   - fuse-loop: page dirtying on the underlying filesystem is stalled on
>>     balance_dirty_pages()
>>
>> In both cases the underlying fs is totally innocent, with no
>> dirty/writback pages, yet it's waiting for the global dirty+writeback
>> to go below the threshold, which obviously won't, until the
>> allocation/dirtying succeeds.
>>
>> I'm not quite sure what the solution is, and asking for thoughts.
> 
> But....  these things don't just throttle.  They also perform large amounts
> of writeback, which causes the dirty levels to subside.
> 
>>From your description it appears that this writeback isn't happening, or
> isn't working.  How come?

Is the fuse daemon trying to do writeback to itself, perhaps?

That is, trying to write out data to the FUSE filesystem, for which
it is also the server.


-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-18 21:25     ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2007-02-18 21:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, linux-kernel, linux-mm

Andrew Morton wrote:
> On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
>> I was testing the new fuse shared writable mmap support, and finding
>> that bash-shared-mapping deadlocks (which isn't so strange ;).  What
>> is more strange is that this is not an OOM situation at all, with
>> plenty of free and cached pages.
>>
>> A little more investigation shows that a similar deadlock happens
>> reliably with bash-shared-mapping on a loopback mount, even if only
>> half the total memory is used.
>>
>> The cause is slightly different in the two cases:
>>
>>   - loopback mount: allocation by the underlying filesystem is stalled
>>     on throttle_vm_writeout()
>>
>>   - fuse-loop: page dirtying on the underlying filesystem is stalled on
>>     balance_dirty_pages()
>>
>> In both cases the underlying fs is totally innocent, with no
>> dirty/writback pages, yet it's waiting for the global dirty+writeback
>> to go below the threshold, which obviously won't, until the
>> allocation/dirtying succeeds.
>>
>> I'm not quite sure what the solution is, and asking for thoughts.
> 
> But....  these things don't just throttle.  They also perform large amounts
> of writeback, which causes the dirty levels to subside.
> 
>>From your description it appears that this writeback isn't happening, or
> isn't working.  How come?

Is the fuse daemon trying to do writeback to itself, perhaps?

That is, trying to write out data to the FUSE filesystem, for which
it is also the server.


-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-18 20:53   ` Andrew Morton
@ 2007-02-18 22:50     ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-18 22:50 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> > I was testing the new fuse shared writable mmap support, and finding
> > that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> > is more strange is that this is not an OOM situation at all, with
> > plenty of free and cached pages.
> > 
> > A little more investigation shows that a similar deadlock happens
> > reliably with bash-shared-mapping on a loopback mount, even if only
> > half the total memory is used.
> > 
> > The cause is slightly different in the two cases:
> > 
> >   - loopback mount: allocation by the underlying filesystem is stalled
> >     on throttle_vm_writeout()
> > 
> >   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> >     balance_dirty_pages()
> > 
> > In both cases the underlying fs is totally innocent, with no
> > dirty/writback pages, yet it's waiting for the global dirty+writeback
> > to go below the threshold, which obviously won't, until the
> > allocation/dirtying succeeds.
> > 
> > I'm not quite sure what the solution is, and asking for thoughts.
> 
> But....  these things don't just throttle.  They also perform large amounts
> of writeback, which causes the dirty levels to subside.
> 
> >From your description it appears that this writeback isn't happening, or
> isn't working.  How come?

 - filesystems A and B
 - write to A will end up as write to B
 - dirty pages in A manage to go over dirty_threshold
 - page writeback is started from A
 - this triggers writeback for a couple of pages in B
 - writeback finishes normally, but dirty+writeback pages are still
   over threshold
 - balance_dirty_pages in B gets stuck, nothing ever moves after this

At least this is my theory for what happens.

Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-18 22:50     ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-18 22:50 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> > I was testing the new fuse shared writable mmap support, and finding
> > that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> > is more strange is that this is not an OOM situation at all, with
> > plenty of free and cached pages.
> > 
> > A little more investigation shows that a similar deadlock happens
> > reliably with bash-shared-mapping on a loopback mount, even if only
> > half the total memory is used.
> > 
> > The cause is slightly different in the two cases:
> > 
> >   - loopback mount: allocation by the underlying filesystem is stalled
> >     on throttle_vm_writeout()
> > 
> >   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> >     balance_dirty_pages()
> > 
> > In both cases the underlying fs is totally innocent, with no
> > dirty/writback pages, yet it's waiting for the global dirty+writeback
> > to go below the threshold, which obviously won't, until the
> > allocation/dirtying succeeds.
> > 
> > I'm not quite sure what the solution is, and asking for thoughts.
> 
> But....  these things don't just throttle.  They also perform large amounts
> of writeback, which causes the dirty levels to subside.
> 
> >From your description it appears that this writeback isn't happening, or
> isn't working.  How come?

 - filesystems A and B
 - write to A will end up as write to B
 - dirty pages in A manage to go over dirty_threshold
 - page writeback is started from A
 - this triggers writeback for a couple of pages in B
 - writeback finishes normally, but dirty+writeback pages are still
   over threshold
 - balance_dirty_pages in B gets stuck, nothing ever moves after this

At least this is my theory for what happens.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-18 21:25     ` Rik van Riel
@ 2007-02-18 22:54       ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-18 22:54 UTC (permalink / raw)
  To: riel; +Cc: akpm, miklos, linux-kernel, linux-mm

> Andrew Morton wrote:
> > On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> >> I was testing the new fuse shared writable mmap support, and finding
> >> that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> >> is more strange is that this is not an OOM situation at all, with
> >> plenty of free and cached pages.
> >>
> >> A little more investigation shows that a similar deadlock happens
> >> reliably with bash-shared-mapping on a loopback mount, even if only
> >> half the total memory is used.
> >>
> >> The cause is slightly different in the two cases:
> >>
> >>   - loopback mount: allocation by the underlying filesystem is stalled
> >>     on throttle_vm_writeout()
> >>
> >>   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> >>     balance_dirty_pages()
> >>
> >> In both cases the underlying fs is totally innocent, with no
> >> dirty/writback pages, yet it's waiting for the global dirty+writeback
> >> to go below the threshold, which obviously won't, until the
> >> allocation/dirtying succeeds.
> >>
> >> I'm not quite sure what the solution is, and asking for thoughts.
> > 
> > But....  these things don't just throttle.  They also perform large amounts
> > of writeback, which causes the dirty levels to subside.
> > 
> >>From your description it appears that this writeback isn't happening, or
> > isn't working.  How come?
> 
> Is the fuse daemon trying to do writeback to itself, perhaps?
> 
> That is, trying to write out data to the FUSE filesystem, for which
> it is also the server.

No.  It's trying to write out data to a different filesystem.

Trying to write out data to itself very obviously deadlocks, but that
doesn't affect anything beside the stupid filesystem itself, and there
are mechanisms for aborting such a situation (forced umount, abort
through fuse-control filesystem).

Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-18 22:54       ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-18 22:54 UTC (permalink / raw)
  To: riel; +Cc: akpm, miklos, linux-kernel, linux-mm

> Andrew Morton wrote:
> > On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> >> I was testing the new fuse shared writable mmap support, and finding
> >> that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> >> is more strange is that this is not an OOM situation at all, with
> >> plenty of free and cached pages.
> >>
> >> A little more investigation shows that a similar deadlock happens
> >> reliably with bash-shared-mapping on a loopback mount, even if only
> >> half the total memory is used.
> >>
> >> The cause is slightly different in the two cases:
> >>
> >>   - loopback mount: allocation by the underlying filesystem is stalled
> >>     on throttle_vm_writeout()
> >>
> >>   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> >>     balance_dirty_pages()
> >>
> >> In both cases the underlying fs is totally innocent, with no
> >> dirty/writback pages, yet it's waiting for the global dirty+writeback
> >> to go below the threshold, which obviously won't, until the
> >> allocation/dirtying succeeds.
> >>
> >> I'm not quite sure what the solution is, and asking for thoughts.
> > 
> > But....  these things don't just throttle.  They also perform large amounts
> > of writeback, which causes the dirty levels to subside.
> > 
> >>From your description it appears that this writeback isn't happening, or
> > isn't working.  How come?
> 
> Is the fuse daemon trying to do writeback to itself, perhaps?
> 
> That is, trying to write out data to the FUSE filesystem, for which
> it is also the server.

No.  It's trying to write out data to a different filesystem.

Trying to write out data to itself very obviously deadlocks, but that
doesn't affect anything beside the stupid filesystem itself, and there
are mechanisms for aborting such a situation (forced umount, abort
through fuse-control filesystem).

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-18 22:50     ` Miklos Szeredi
@ 2007-02-18 22:59       ` Andrew Morton
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-18 22:59 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

On Sun, 18 Feb 2007 23:50:14 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > I was testing the new fuse shared writable mmap support, and finding
> > > that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> > > is more strange is that this is not an OOM situation at all, with
> > > plenty of free and cached pages.
> > > 
> > > A little more investigation shows that a similar deadlock happens
> > > reliably with bash-shared-mapping on a loopback mount, even if only
> > > half the total memory is used.
> > > 
> > > The cause is slightly different in the two cases:
> > > 
> > >   - loopback mount: allocation by the underlying filesystem is stalled
> > >     on throttle_vm_writeout()
> > > 
> > >   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> > >     balance_dirty_pages()
> > > 
> > > In both cases the underlying fs is totally innocent, with no
> > > dirty/writback pages, yet it's waiting for the global dirty+writeback
> > > to go below the threshold, which obviously won't, until the
> > > allocation/dirtying succeeds.
> > > 
> > > I'm not quite sure what the solution is, and asking for thoughts.
> > 
> > But....  these things don't just throttle.  They also perform large amounts
> > of writeback, which causes the dirty levels to subside.
> > 
> > >From your description it appears that this writeback isn't happening, or
> > isn't working.  How come?
> 
>  - filesystems A and B
>  - write to A will end up as write to B
>  - dirty pages in A manage to go over dirty_threshold
>  - page writeback is started from A
>  - this triggers writeback for a couple of pages in B
>  - writeback finishes normally, but dirty+writeback pages are still
>    over threshold
>  - balance_dirty_pages in B gets stuck, nothing ever moves after this
> 
> At least this is my theory for what happens.
> 

Is B a real filesystem?  If so, writes to B will decrease the dirty memory
threshold.

The writeout code _should_ just sit there transferring dirtyiness from A to
B and cleaning pages via B, looping around, alternating between both.

What does sysrq-t say?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-18 22:59       ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-18 22:59 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

On Sun, 18 Feb 2007 23:50:14 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > I was testing the new fuse shared writable mmap support, and finding
> > > that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> > > is more strange is that this is not an OOM situation at all, with
> > > plenty of free and cached pages.
> > > 
> > > A little more investigation shows that a similar deadlock happens
> > > reliably with bash-shared-mapping on a loopback mount, even if only
> > > half the total memory is used.
> > > 
> > > The cause is slightly different in the two cases:
> > > 
> > >   - loopback mount: allocation by the underlying filesystem is stalled
> > >     on throttle_vm_writeout()
> > > 
> > >   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> > >     balance_dirty_pages()
> > > 
> > > In both cases the underlying fs is totally innocent, with no
> > > dirty/writback pages, yet it's waiting for the global dirty+writeback
> > > to go below the threshold, which obviously won't, until the
> > > allocation/dirtying succeeds.
> > > 
> > > I'm not quite sure what the solution is, and asking for thoughts.
> > 
> > But....  these things don't just throttle.  They also perform large amounts
> > of writeback, which causes the dirty levels to subside.
> > 
> > >From your description it appears that this writeback isn't happening, or
> > isn't working.  How come?
> 
>  - filesystems A and B
>  - write to A will end up as write to B
>  - dirty pages in A manage to go over dirty_threshold
>  - page writeback is started from A
>  - this triggers writeback for a couple of pages in B
>  - writeback finishes normally, but dirty+writeback pages are still
>    over threshold
>  - balance_dirty_pages in B gets stuck, nothing ever moves after this
> 
> At least this is my theory for what happens.
> 

Is B a real filesystem?  If so, writes to B will decrease the dirty memory
threshold.

The writeout code _should_ just sit there transferring dirtyiness from A to
B and cleaning pages via B, looping around, alternating between both.

What does sysrq-t say?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-18 22:59       ` Andrew Morton
@ 2007-02-18 23:22         ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-18 23:22 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-mm

> > > > I was testing the new fuse shared writable mmap support, and finding
> > > > that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> > > > is more strange is that this is not an OOM situation at all, with
> > > > plenty of free and cached pages.
> > > > 
> > > > A little more investigation shows that a similar deadlock happens
> > > > reliably with bash-shared-mapping on a loopback mount, even if only
> > > > half the total memory is used.
> > > > 
> > > > The cause is slightly different in the two cases:
> > > > 
> > > >   - loopback mount: allocation by the underlying filesystem is stalled
> > > >     on throttle_vm_writeout()
> > > > 
> > > >   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> > > >     balance_dirty_pages()
> > > > 
> > > > In both cases the underlying fs is totally innocent, with no
> > > > dirty/writback pages, yet it's waiting for the global dirty+writeback
> > > > to go below the threshold, which obviously won't, until the
> > > > allocation/dirtying succeeds.
> > > > 
> > > > I'm not quite sure what the solution is, and asking for thoughts.
> > > 
> > > But....  these things don't just throttle.  They also perform large amounts
> > > of writeback, which causes the dirty levels to subside.
> > > 
> > > >From your description it appears that this writeback isn't happening, or
> > > isn't working.  How come?
> > 
> >  - filesystems A and B
> >  - write to A will end up as write to B
> >  - dirty pages in A manage to go over dirty_threshold
> >  - page writeback is started from A
> >  - this triggers writeback for a couple of pages in B
> >  - writeback finishes normally, but dirty+writeback pages are still
> >    over threshold
> >  - balance_dirty_pages in B gets stuck, nothing ever moves after this
> > 
> > At least this is my theory for what happens.
> > 
> 
> Is B a real filesystem?

Yes.

> If so, writes to B will decrease the dirty memory threshold.

Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
Some pages queued for writeback (doesn't matter how much).  B writes
back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
B doesn't know that there's nothing more to write back for B, it's
just waiting there for those 1099, which'll never get written.

> The writeout code _should_ just sit there transferring dirtyiness from A to
> B and cleaning pages via B, looping around, alternating between both.
> 
> What does sysrq-t say?

This is the fuse daemon thread that got stuck.  There are lots of
others that are stuck on some ext3 mutex as a result of this.

fusexmp_fh_no D 40045401     0   527    493           533   495 (NOTLB)
088d55f8 00000001 00000000 08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000
       08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000
       0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 Call Trace:
08dcfb00:  [<0805d8cb>] switch_to_skas+0x3b/0x83
08dcfb18:  [<0805a38a>] _switch_to+0x49/0x99
08dcfb40:  [<08182fe6>] schedule+0x246/0x547
08dcfb98:  [<08183a03>] schedule_timeout+0x4e/0xb6
08dcfbcc:  [<08183991>] io_schedule_timeout+0x11/0x20
08dcfbd4:  [<080a0cf2>] congestion_wait+0x72/0x87
08dcfc04:  [<0809c693>] balance_dirty_pages+0xa8/0x153
08dcfc5c:  [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45
08dcfc68:  [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5
08dcfd20:  [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd
08dcfda8:  [<08099cb6>] generic_file_aio_write+0x55/0xc7
08dcfddc:  [<080ea1e6>] ext3_file_write+0x39/0xaf
08dcfe04:  [<080b060b>] do_sync_write+0xd8/0x10e
08dcfebc:  [<080b06e3>] vfs_write+0xa2/0x1cb
08dcfeec:  [<080b09b8>] sys_pwrite64+0x65/0x69
08dcff10:  [<0805dd54>] handle_syscall+0x90/0xbc
08dcff64:  [<0806d56c>] handle_trap+0x27/0x121
08dcff8c:  [<0806dc65>] userspace+0x1de/0x226
08dcffe4:  [<0805da19>] fork_handler+0x76/0x88
08dcfffc:  [<d4cf0007>] 0xd4cf0007

/proc/vmstat:

nr_anon_pages 668
nr_mapped 3168
nr_file_pages 5191
nr_slab_reclaimable 173
nr_slab_unreclaimable 494
nr_page_table_pages 65
nr_dirty 2174
nr_writeback 10
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
pgpgin 10955
pgpgout 421091
pswpin 0
pswpout 0
pgalloc_dma 0
pgalloc_normal 268761
pgfree 269709
pgactivate 128287
pgdeactivate 31253
pgfault 237350
pgmajfault 4340
pgrefill_dma 0
pgrefill_normal 127899
pgsteal_dma 0
pgsteal_normal 46892
pgscan_kswapd_dma 0
pgscan_kswapd_normal 47104
pgscan_direct_dma 0
pgscan_direct_normal 36544
pginodesteal 0
slabs_scanned 2048
kswapd_steal 25083
kswapd_inodesteal 335
pageoutrun 656
allocstall 423
pgrotated 0

Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0)
    at mm/page-writeback.c:202
202                             dirty_exceeded = 1;
(gdb) p dirty_thresh
$1 = 2113
(gdb)

For completeness' sake, here's the backtrace for the stuck loopback as
well:

loop0         D BFFFE101     0   499      5           500    59 (L-TLB)
088cc578 00000001 00000000 09197c4c 0805d8cb 084fe6f8 088cc578 09190000
       09190000 09197c74 0805a38a 084fe200 088cc080 09197c64 09190000 09190000
       086d9c80 088cc080 084fe200 09197ccc 08182ab6 084fe200 088cc080 084fe200 Call Trace:
09197c38:  [<0805d8cb>] switch_to_skas+0x3b/0x83
09197c50:  [<0805a38a>] _switch_to+0x49/0x99
09197c78:  [<08182ab6>] schedule+0x246/0x547
09197cd0:  [<081834d3>] schedule_timeout+0x4e/0xb6
09197d04:  [<08183461>] io_schedule_timeout+0x11/0x20
09197d0c:  [<080a0c62>] congestion_wait+0x72/0x87
09197d3c:  [<0809c7e8>] throttle_vm_writeout+0x27/0x6a
09197d60:  [<0809faec>] shrink_zone+0xaf/0x103
09197d8c:  [<0809fbb2>] shrink_zones+0x72/0x8a
09197db0:  [<0809fc87>] try_to_free_pages+0xbd/0x185
09197dfc:  [<0809ba76>] __alloc_pages+0x155/0x335
09197e50:  [<080975eb>] find_or_create_page+0x85/0x99
09197e78:  [<0812785e>] do_lo_send_aops+0x8d/0x233
09197ee4:  [<08127c56>] lo_send+0x92/0x10d
09197f20:  [<08127ee6>] do_bio_filebacked+0x6d/0x74
09197f44:  [<081280e0>] loop_thread+0x89/0x188
09197f84:  [<0808a03a>] kthread+0xa7/0xab
09197fb4:  [<0806a0f1>] run_kernel_thread+0x41/0x50
09197fe0:  [<0805d975>] new_thread_handler+0x62/0x8b
09197ffc:  [<00000000>] nosmp+0xf7fb7000/0x14

Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-18 23:22         ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-18 23:22 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-mm

> > > > I was testing the new fuse shared writable mmap support, and finding
> > > > that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> > > > is more strange is that this is not an OOM situation at all, with
> > > > plenty of free and cached pages.
> > > > 
> > > > A little more investigation shows that a similar deadlock happens
> > > > reliably with bash-shared-mapping on a loopback mount, even if only
> > > > half the total memory is used.
> > > > 
> > > > The cause is slightly different in the two cases:
> > > > 
> > > >   - loopback mount: allocation by the underlying filesystem is stalled
> > > >     on throttle_vm_writeout()
> > > > 
> > > >   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> > > >     balance_dirty_pages()
> > > > 
> > > > In both cases the underlying fs is totally innocent, with no
> > > > dirty/writback pages, yet it's waiting for the global dirty+writeback
> > > > to go below the threshold, which obviously won't, until the
> > > > allocation/dirtying succeeds.
> > > > 
> > > > I'm not quite sure what the solution is, and asking for thoughts.
> > > 
> > > But....  these things don't just throttle.  They also perform large amounts
> > > of writeback, which causes the dirty levels to subside.
> > > 
> > > >From your description it appears that this writeback isn't happening, or
> > > isn't working.  How come?
> > 
> >  - filesystems A and B
> >  - write to A will end up as write to B
> >  - dirty pages in A manage to go over dirty_threshold
> >  - page writeback is started from A
> >  - this triggers writeback for a couple of pages in B
> >  - writeback finishes normally, but dirty+writeback pages are still
> >    over threshold
> >  - balance_dirty_pages in B gets stuck, nothing ever moves after this
> > 
> > At least this is my theory for what happens.
> > 
> 
> Is B a real filesystem?

Yes.

> If so, writes to B will decrease the dirty memory threshold.

Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
Some pages queued for writeback (doesn't matter how much).  B writes
back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
B doesn't know that there's nothing more to write back for B, it's
just waiting there for those 1099, which'll never get written.

> The writeout code _should_ just sit there transferring dirtyiness from A to
> B and cleaning pages via B, looping around, alternating between both.
> 
> What does sysrq-t say?

This is the fuse daemon thread that got stuck.  There are lots of
others that are stuck on some ext3 mutex as a result of this.

fusexmp_fh_no D 40045401     0   527    493           533   495 (NOTLB)
088d55f8 00000001 00000000 08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000
       08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000
       0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 Call Trace:
08dcfb00:  [<0805d8cb>] switch_to_skas+0x3b/0x83
08dcfb18:  [<0805a38a>] _switch_to+0x49/0x99
08dcfb40:  [<08182fe6>] schedule+0x246/0x547
08dcfb98:  [<08183a03>] schedule_timeout+0x4e/0xb6
08dcfbcc:  [<08183991>] io_schedule_timeout+0x11/0x20
08dcfbd4:  [<080a0cf2>] congestion_wait+0x72/0x87
08dcfc04:  [<0809c693>] balance_dirty_pages+0xa8/0x153
08dcfc5c:  [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45
08dcfc68:  [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5
08dcfd20:  [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd
08dcfda8:  [<08099cb6>] generic_file_aio_write+0x55/0xc7
08dcfddc:  [<080ea1e6>] ext3_file_write+0x39/0xaf
08dcfe04:  [<080b060b>] do_sync_write+0xd8/0x10e
08dcfebc:  [<080b06e3>] vfs_write+0xa2/0x1cb
08dcfeec:  [<080b09b8>] sys_pwrite64+0x65/0x69
08dcff10:  [<0805dd54>] handle_syscall+0x90/0xbc
08dcff64:  [<0806d56c>] handle_trap+0x27/0x121
08dcff8c:  [<0806dc65>] userspace+0x1de/0x226
08dcffe4:  [<0805da19>] fork_handler+0x76/0x88
08dcfffc:  [<d4cf0007>] 0xd4cf0007

/proc/vmstat:

nr_anon_pages 668
nr_mapped 3168
nr_file_pages 5191
nr_slab_reclaimable 173
nr_slab_unreclaimable 494
nr_page_table_pages 65
nr_dirty 2174
nr_writeback 10
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
pgpgin 10955
pgpgout 421091
pswpin 0
pswpout 0
pgalloc_dma 0
pgalloc_normal 268761
pgfree 269709
pgactivate 128287
pgdeactivate 31253
pgfault 237350
pgmajfault 4340
pgrefill_dma 0
pgrefill_normal 127899
pgsteal_dma 0
pgsteal_normal 46892
pgscan_kswapd_dma 0
pgscan_kswapd_normal 47104
pgscan_direct_dma 0
pgscan_direct_normal 36544
pginodesteal 0
slabs_scanned 2048
kswapd_steal 25083
kswapd_inodesteal 335
pageoutrun 656
allocstall 423
pgrotated 0

Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0)
    at mm/page-writeback.c:202
202                             dirty_exceeded = 1;
(gdb) p dirty_thresh
$1 = 2113
(gdb)

For completeness' sake, here's the backtrace for the stuck loopback as
well:

loop0         D BFFFE101     0   499      5           500    59 (L-TLB)
088cc578 00000001 00000000 09197c4c 0805d8cb 084fe6f8 088cc578 09190000
       09190000 09197c74 0805a38a 084fe200 088cc080 09197c64 09190000 09190000
       086d9c80 088cc080 084fe200 09197ccc 08182ab6 084fe200 088cc080 084fe200 Call Trace:
09197c38:  [<0805d8cb>] switch_to_skas+0x3b/0x83
09197c50:  [<0805a38a>] _switch_to+0x49/0x99
09197c78:  [<08182ab6>] schedule+0x246/0x547
09197cd0:  [<081834d3>] schedule_timeout+0x4e/0xb6
09197d04:  [<08183461>] io_schedule_timeout+0x11/0x20
09197d0c:  [<080a0c62>] congestion_wait+0x72/0x87
09197d3c:  [<0809c7e8>] throttle_vm_writeout+0x27/0x6a
09197d60:  [<0809faec>] shrink_zone+0xaf/0x103
09197d8c:  [<0809fbb2>] shrink_zones+0x72/0x8a
09197db0:  [<0809fc87>] try_to_free_pages+0xbd/0x185
09197dfc:  [<0809ba76>] __alloc_pages+0x155/0x335
09197e50:  [<080975eb>] find_or_create_page+0x85/0x99
09197e78:  [<0812785e>] do_lo_send_aops+0x8d/0x233
09197ee4:  [<08127c56>] lo_send+0x92/0x10d
09197f20:  [<08127ee6>] do_bio_filebacked+0x6d/0x74
09197f44:  [<081280e0>] loop_thread+0x89/0x188
09197f84:  [<0808a03a>] kthread+0xa7/0xab
09197fb4:  [<0806a0f1>] run_kernel_thread+0x41/0x50
09197fe0:  [<0805d975>] new_thread_handler+0x62/0x8b
09197ffc:  [<00000000>] nosmp+0xf7fb7000/0x14

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-18 23:22         ` Miklos Szeredi
@ 2007-02-18 23:59           ` Andrew Morton
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-18 23:59 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

On Mon, 19 Feb 2007 00:22:11 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > If so, writes to B will decrease the dirty memory threshold.
> 
> Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> Some pages queued for writeback (doesn't matter how much).  B writes
> back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> B doesn't know that there's nothing more to write back for B, it's
> just waiting there for those 1099, which'll never get written.

hm, OK, arguable.  I guess something like this..

--- a/fs/fs-writeback.c~a
+++ a/fs/fs-writeback.c
@@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_
 			continue;		/* Skip a congested blockdev */
 		}
 
-		if (wbc->bdi && bdi != wbc->bdi) {
+		if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) {
 			if (!sb_is_blkdev_sb(sb))
 				break;		/* fs has the wrong queue */
 			list_move(&inode->i_list, &sb->s_dirty);
_

but where's pdflush?  It should be busily transferring dirtiness from A to
B.

> > The writeout code _should_ just sit there transferring dirtyiness from A to
> > B and cleaning pages via B, looping around, alternating between both.
> > 
> > What does sysrq-t say?
> 
> This is the fuse daemon thread that got stuck.

Where's pdflsuh?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-18 23:59           ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-18 23:59 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

On Mon, 19 Feb 2007 00:22:11 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > If so, writes to B will decrease the dirty memory threshold.
> 
> Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> Some pages queued for writeback (doesn't matter how much).  B writes
> back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> B doesn't know that there's nothing more to write back for B, it's
> just waiting there for those 1099, which'll never get written.

hm, OK, arguable.  I guess something like this..

--- a/fs/fs-writeback.c~a
+++ a/fs/fs-writeback.c
@@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_
 			continue;		/* Skip a congested blockdev */
 		}
 
-		if (wbc->bdi && bdi != wbc->bdi) {
+		if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) {
 			if (!sb_is_blkdev_sb(sb))
 				break;		/* fs has the wrong queue */
 			list_move(&inode->i_list, &sb->s_dirty);
_

but where's pdflush?  It should be busily transferring dirtiness from A to
B.

> > The writeout code _should_ just sit there transferring dirtyiness from A to
> > B and cleaning pages via B, looping around, alternating between both.
> > 
> > What does sysrq-t say?
> 
> This is the fuse daemon thread that got stuck.

Where's pdflsuh?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-18 23:59           ` Andrew Morton
@ 2007-02-19  0:25             ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  0:25 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-mm

> > > If so, writes to B will decrease the dirty memory threshold.
> > 
> > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > Some pages queued for writeback (doesn't matter how much).  B writes
> > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > B doesn't know that there's nothing more to write back for B, it's
> > just waiting there for those 1099, which'll never get written.
> 
> hm, OK, arguable.  I guess something like this..

Doesn't help the fuse case, but does seem to help the loopback mount
one.

For fuse it's worse with the patch: now the write triggered by the
balance recurses into fuse, with disastrous results, since the fuse
writeback is now blocked on the userspace queue.

fusexmp_fh_no D 40136678     0   505    494           506   504 (NOTLB)
08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
       08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
       085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace:
08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
08f9f9e0:  [<08183006>] schedule+0x246/0x547
08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c
08f9faac:  [<0809ce3f>] __writepage+0x1e/0x3d
08f9fac0:  [<0809cd39>] write_cache_pages+0x222/0x30a
08f9fb44:  [<0809ce8d>] generic_writepages+0x2f/0x35
08f9fb5c:  [<0809ced6>] do_writepages+0x43/0x45
08f9fb70:  [<080cb8d2>] __writeback_single_inode+0xbc/0x173
08f9fbb8:  [<080cbb30>] sync_sb_inodes+0x1a7/0x260
08f9fbe8:  [<080cbc54>] writeback_inodes+0x6b/0x81
08f9fc04:  [<0809c640>] balance_dirty_pages+0x55/0x153
08f9fc5c:  [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45
08f9fc68:  [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5
08f9fd20:  [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd
08f9fda8:  [<08099cb6>] generic_file_aio_write+0x55/0xc7
08f9fddc:  [<080ea206>] ext3_file_write+0x39/0xaf
08f9fe04:  [<080b060b>] do_sync_write+0xd8/0x10e
08f9febc:  [<080b06e3>] vfs_write+0xa2/0x1cb
08f9feec:  [<080b09b8>] sys_pwrite64+0x65/0x69
08f9ff10:  [<0805dd54>] handle_syscall+0x90/0xbc
08f9ff64:  [<0806d56c>] handle_trap+0x27/0x121
08f9ff8c:  [<0806dc65>] userspace+0x1de/0x226
08f9ffe4:  [<0805da19>] fork_handler+0x76/0x88
08f9fffc:  [<00000000>] nosmp+0xf7fb7000/0x14


> but where's pdflush?  It should be busily transferring dirtiness from A to
> B.

The transfer of dirtyness from A to B goes through the narrow channel
of i_mutex.  And once that is plugged by the stuck balance_dirty_pages()
nothing else can pass through.

> > > The writeout code _should_ just sit there transferring dirtyiness from A to
> > > B and cleaning pages via B, looping around, alternating between both.
> > > 
> > > What does sysrq-t say?
> > 
> > This is the fuse daemon thread that got stuck.
> 
> Where's pdflsuh?

Doing nothing I guess.  The request queue for the fuse filesystem is
full, so writepage with wbc->nonblocking=1 will be skipped.

pdflush       D 40045401     0    23      5            24    12 (L-TLB)
088d5bf8 00000001 00000000 08907df8 0805d8cb 088d55f8 088d5bf8 08900000
       08900000 08907e20 0805a38a 088d5100 088d5700 08907e10 08900000 08900000
       0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 Call Trace:
08907de4:  [<0805d8cb>] switch_to_skas+0x3b/0x83
08907dfc:  [<0805a38a>] _switch_to+0x49/0x99
08907e24:  [<08182fe6>] schedule+0x246/0x547
08907e7c:  [<08183a03>] schedule_timeout+0x4e/0xb6
08907eb0:  [<08183991>] io_schedule_timeout+0x11/0x20
08907eb8:  [<080a0cf2>] congestion_wait+0x72/0x87
08907ee8:  [<0809c860>] background_writeout+0x35/0xa4
08907f38:  [<0809d41e>] __pdflush+0xae/0x152
08907f54:  [<0809d4f5>] pdflush+0x33/0x39
08907f84:  [<0808a03a>] kthread+0xa7/0xab
08907fb4:  [<0806a0f1>] run_kernel_thread+0x41/0x50
08907fe0:  [<0805d975>] new_thread_handler+0x62/0x8b
08907ffc:  [<00000000>] nosmp+0xf7fb7000/0x14

pdflush       D 40045401     0    24      5            25    23 (L-TLB)
081e1458 00000001 00000000 088ffe00 0805d8cb 088d5bf8 081e1458 088f8000
       088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000
       0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 Call Trace:
088ffdec:  [<0805d8cb>] switch_to_skas+0x3b/0x83
088ffe04:  [<0805a38a>] _switch_to+0x49/0x99
088ffe2c:  [<08182fe6>] schedule+0x246/0x547
088ffe84:  [<08183a03>] schedule_timeout+0x4e/0xb6
088ffeb8:  [<08183991>] io_schedule_timeout+0x11/0x20
088ffec0:  [<080a0cf2>] congestion_wait+0x72/0x87
088ffef0:  [<0809c98c>] wb_kupdate+0x93/0xd9
088fff38:  [<0809d41e>] __pdflush+0xae/0x152
088fff54:  [<0809d4f5>] pdflush+0x33/0x39
088fff84:  [<0808a03a>] kthread+0xa7/0xab
088fffb4:  [<0806a0f1>] run_kernel_thread+0x41/0x50
088fffe0:  [<0805d975>] new_thread_handler+0x62/0x8b
088ffffc:  [<00000000>] nosmp+0xf7fb7000/0x14

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-19  0:25             ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  0:25 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-mm

> > > If so, writes to B will decrease the dirty memory threshold.
> > 
> > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > Some pages queued for writeback (doesn't matter how much).  B writes
> > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > B doesn't know that there's nothing more to write back for B, it's
> > just waiting there for those 1099, which'll never get written.
> 
> hm, OK, arguable.  I guess something like this..

Doesn't help the fuse case, but does seem to help the loopback mount
one.

For fuse it's worse with the patch: now the write triggered by the
balance recurses into fuse, with disastrous results, since the fuse
writeback is now blocked on the userspace queue.

fusexmp_fh_no D 40136678     0   505    494           506   504 (NOTLB)
08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
       08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
       085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace:
08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
08f9f9e0:  [<08183006>] schedule+0x246/0x547
08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c
08f9faac:  [<0809ce3f>] __writepage+0x1e/0x3d
08f9fac0:  [<0809cd39>] write_cache_pages+0x222/0x30a
08f9fb44:  [<0809ce8d>] generic_writepages+0x2f/0x35
08f9fb5c:  [<0809ced6>] do_writepages+0x43/0x45
08f9fb70:  [<080cb8d2>] __writeback_single_inode+0xbc/0x173
08f9fbb8:  [<080cbb30>] sync_sb_inodes+0x1a7/0x260
08f9fbe8:  [<080cbc54>] writeback_inodes+0x6b/0x81
08f9fc04:  [<0809c640>] balance_dirty_pages+0x55/0x153
08f9fc5c:  [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45
08f9fc68:  [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5
08f9fd20:  [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd
08f9fda8:  [<08099cb6>] generic_file_aio_write+0x55/0xc7
08f9fddc:  [<080ea206>] ext3_file_write+0x39/0xaf
08f9fe04:  [<080b060b>] do_sync_write+0xd8/0x10e
08f9febc:  [<080b06e3>] vfs_write+0xa2/0x1cb
08f9feec:  [<080b09b8>] sys_pwrite64+0x65/0x69
08f9ff10:  [<0805dd54>] handle_syscall+0x90/0xbc
08f9ff64:  [<0806d56c>] handle_trap+0x27/0x121
08f9ff8c:  [<0806dc65>] userspace+0x1de/0x226
08f9ffe4:  [<0805da19>] fork_handler+0x76/0x88
08f9fffc:  [<00000000>] nosmp+0xf7fb7000/0x14


> but where's pdflush?  It should be busily transferring dirtiness from A to
> B.

The transfer of dirtyness from A to B goes through the narrow channel
of i_mutex.  And once that is plugged by the stuck balance_dirty_pages()
nothing else can pass through.

> > > The writeout code _should_ just sit there transferring dirtyiness from A to
> > > B and cleaning pages via B, looping around, alternating between both.
> > > 
> > > What does sysrq-t say?
> > 
> > This is the fuse daemon thread that got stuck.
> 
> Where's pdflsuh?

Doing nothing I guess.  The request queue for the fuse filesystem is
full, so writepage with wbc->nonblocking=1 will be skipped.

pdflush       D 40045401     0    23      5            24    12 (L-TLB)
088d5bf8 00000001 00000000 08907df8 0805d8cb 088d55f8 088d5bf8 08900000
       08900000 08907e20 0805a38a 088d5100 088d5700 08907e10 08900000 08900000
       0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 Call Trace:
08907de4:  [<0805d8cb>] switch_to_skas+0x3b/0x83
08907dfc:  [<0805a38a>] _switch_to+0x49/0x99
08907e24:  [<08182fe6>] schedule+0x246/0x547
08907e7c:  [<08183a03>] schedule_timeout+0x4e/0xb6
08907eb0:  [<08183991>] io_schedule_timeout+0x11/0x20
08907eb8:  [<080a0cf2>] congestion_wait+0x72/0x87
08907ee8:  [<0809c860>] background_writeout+0x35/0xa4
08907f38:  [<0809d41e>] __pdflush+0xae/0x152
08907f54:  [<0809d4f5>] pdflush+0x33/0x39
08907f84:  [<0808a03a>] kthread+0xa7/0xab
08907fb4:  [<0806a0f1>] run_kernel_thread+0x41/0x50
08907fe0:  [<0805d975>] new_thread_handler+0x62/0x8b
08907ffc:  [<00000000>] nosmp+0xf7fb7000/0x14

pdflush       D 40045401     0    24      5            25    23 (L-TLB)
081e1458 00000001 00000000 088ffe00 0805d8cb 088d5bf8 081e1458 088f8000
       088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000
       0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 Call Trace:
088ffdec:  [<0805d8cb>] switch_to_skas+0x3b/0x83
088ffe04:  [<0805a38a>] _switch_to+0x49/0x99
088ffe2c:  [<08182fe6>] schedule+0x246/0x547
088ffe84:  [<08183a03>] schedule_timeout+0x4e/0xb6
088ffeb8:  [<08183991>] io_schedule_timeout+0x11/0x20
088ffec0:  [<080a0cf2>] congestion_wait+0x72/0x87
088ffef0:  [<0809c98c>] wb_kupdate+0x93/0xd9
088fff38:  [<0809d41e>] __pdflush+0xae/0x152
088fff54:  [<0809d4f5>] pdflush+0x33/0x39
088fff84:  [<0808a03a>] kthread+0xa7/0xab
088fffb4:  [<0806a0f1>] run_kernel_thread+0x41/0x50
088fffe0:  [<0805d975>] new_thread_handler+0x62/0x8b
088ffffc:  [<00000000>] nosmp+0xf7fb7000/0x14

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19  0:25             ` Miklos Szeredi
@ 2007-02-19  0:30               ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  0:30 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> --- a/fs/fs-writeback.c~a
> +++ a/fs/fs-writeback.c
> @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_
>  			continue;		/* Skip a congested blockdev */
>  		}
>  
> -		if (wbc->bdi && bdi != wbc->bdi) {
> +		if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) {
>  			if (!sb_is_blkdev_sb(sb))
>  				break;		/* fs has the wrong queue */
>  			list_move(&inode->i_list, &sb->s_dirty);

Checking bdi_write_congested(bdi) is not reliable, since the queue can
become congested _after_ the check is done.

Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-19  0:30               ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  0:30 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> --- a/fs/fs-writeback.c~a
> +++ a/fs/fs-writeback.c
> @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_
>  			continue;		/* Skip a congested blockdev */
>  		}
>  
> -		if (wbc->bdi && bdi != wbc->bdi) {
> +		if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) {
>  			if (!sb_is_blkdev_sb(sb))
>  				break;		/* fs has the wrong queue */
>  			list_move(&inode->i_list, &sb->s_dirty);

Checking bdi_write_congested(bdi) is not reliable, since the queue can
become congested _after_ the check is done.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19  0:25             ` Miklos Szeredi
@ 2007-02-19  0:45               ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  0:45 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> > > > If so, writes to B will decrease the dirty memory threshold.
> > > 
> > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > B doesn't know that there's nothing more to write back for B, it's
> > > just waiting there for those 1099, which'll never get written.
> > 
> > hm, OK, arguable.  I guess something like this..
> 
> Doesn't help the fuse case, but does seem to help the loopback mount
> one.

No sorry, it doesn't even help the loopback deadlock.  It sometimes
takes quite a while to trigger...

Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-19  0:45               ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  0:45 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> > > > If so, writes to B will decrease the dirty memory threshold.
> > > 
> > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > B doesn't know that there's nothing more to write back for B, it's
> > > just waiting there for those 1099, which'll never get written.
> > 
> > hm, OK, arguable.  I guess something like this..
> 
> Doesn't help the fuse case, but does seem to help the loopback mount
> one.

No sorry, it doesn't even help the loopback deadlock.  It sometimes
takes quite a while to trigger...

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19  0:25             ` Miklos Szeredi
@ 2007-02-19  0:45               ` Chris Mason
  -1 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-19  0:45 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Mon, Feb 19, 2007 at 01:25:21AM +0100, Miklos Szeredi wrote:
> > > > If so, writes to B will decrease the dirty memory threshold.
> > > 
> > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > B doesn't know that there's nothing more to write back for B, it's
> > > just waiting there for those 1099, which'll never get written.
> > 
> > hm, OK, arguable.  I guess something like this..
> 
> Doesn't help the fuse case, but does seem to help the loopback mount
> one.
> 
> For fuse it's worse with the patch: now the write triggered by the
> balance recurses into fuse, with disastrous results, since the fuse
> writeback is now blocked on the userspace queue.
> 
> fusexmp_fh_no D 40136678     0   505    494           506   504 (NOTLB)
> 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
>        08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
>        085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace:
> 08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
> 08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
> 08f9f9e0:  [<08183006>] schedule+0x246/0x547
> 08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
> 08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c

In general, writepage is supposed to do work without blocking on
expensive locks that will get pdflush and dirty reclaim stuck in this
fashion.  You'll probably have to take the same approach reiserfs does
in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
is going to block without making progress.

Queue it somewhere else (ie an internal Fs cleaning thread) and leave
the page dirty so that we can move on to other pages that have a chance
of being cleaned.

-chris

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-19  0:45               ` Chris Mason
  0 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-19  0:45 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Mon, Feb 19, 2007 at 01:25:21AM +0100, Miklos Szeredi wrote:
> > > > If so, writes to B will decrease the dirty memory threshold.
> > > 
> > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > B doesn't know that there's nothing more to write back for B, it's
> > > just waiting there for those 1099, which'll never get written.
> > 
> > hm, OK, arguable.  I guess something like this..
> 
> Doesn't help the fuse case, but does seem to help the loopback mount
> one.
> 
> For fuse it's worse with the patch: now the write triggered by the
> balance recurses into fuse, with disastrous results, since the fuse
> writeback is now blocked on the userspace queue.
> 
> fusexmp_fh_no D 40136678     0   505    494           506   504 (NOTLB)
> 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
>        08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
>        085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace:
> 08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
> 08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
> 08f9f9e0:  [<08183006>] schedule+0x246/0x547
> 08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
> 08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c

In general, writepage is supposed to do work without blocking on
expensive locks that will get pdflush and dirty reclaim stuck in this
fashion.  You'll probably have to take the same approach reiserfs does
in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
is going to block without making progress.

Queue it somewhere else (ie an internal Fs cleaning thread) and leave
the page dirty so that we can move on to other pages that have a chance
of being cleaned.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19  0:45               ` Chris Mason
@ 2007-02-19  0:54                 ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  0:54 UTC (permalink / raw)
  To: chris.mason; +Cc: akpm, linux-kernel, linux-mm

> > > > > If so, writes to B will decrease the dirty memory threshold.
> > > > 
> > > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > > B doesn't know that there's nothing more to write back for B, it's
> > > > just waiting there for those 1099, which'll never get written.
> > > 
> > > hm, OK, arguable.  I guess something like this..
> > 
> > Doesn't help the fuse case, but does seem to help the loopback mount
> > one.
> > 
> > For fuse it's worse with the patch: now the write triggered by the
> > balance recurses into fuse, with disastrous results, since the fuse
> > writeback is now blocked on the userspace queue.
> > 
> > fusexmp_fh_no D 40136678     0   505    494           506   504 (NOTLB)
> > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
> >        08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
> >        085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace:
> > 08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
> > 08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
> > 08f9f9e0:  [<08183006>] schedule+0x246/0x547
> > 08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
> > 08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c
> 
> In general, writepage is supposed to do work without blocking on
> expensive locks that will get pdflush and dirty reclaim stuck in this
> fashion.  You'll probably have to take the same approach reiserfs does
> in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> is going to block without making progress.

Pdflush, and dirty reclaim set wbc->nonblocking to true.
balance_dirty_pages and fsync don't.  The problem here is that
Andrew's patch is wrong to let balance_dirty_pages() try to write back
pages from a different queue.

Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-19  0:54                 ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  0:54 UTC (permalink / raw)
  To: chris.mason; +Cc: akpm, linux-kernel, linux-mm

> > > > > If so, writes to B will decrease the dirty memory threshold.
> > > > 
> > > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > > B doesn't know that there's nothing more to write back for B, it's
> > > > just waiting there for those 1099, which'll never get written.
> > > 
> > > hm, OK, arguable.  I guess something like this..
> > 
> > Doesn't help the fuse case, but does seem to help the loopback mount
> > one.
> > 
> > For fuse it's worse with the patch: now the write triggered by the
> > balance recurses into fuse, with disastrous results, since the fuse
> > writeback is now blocked on the userspace queue.
> > 
> > fusexmp_fh_no D 40136678     0   505    494           506   504 (NOTLB)
> > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
> >        08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
> >        085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace:
> > 08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
> > 08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
> > 08f9f9e0:  [<08183006>] schedule+0x246/0x547
> > 08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
> > 08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c
> 
> In general, writepage is supposed to do work without blocking on
> expensive locks that will get pdflush and dirty reclaim stuck in this
> fashion.  You'll probably have to take the same approach reiserfs does
> in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> is going to block without making progress.

Pdflush, and dirty reclaim set wbc->nonblocking to true.
balance_dirty_pages and fsync don't.  The problem here is that
Andrew's patch is wrong to let balance_dirty_pages() try to write back
pages from a different queue.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19  0:54                 ` Miklos Szeredi
@ 2007-02-19  1:01                   ` Chris Mason
  -1 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-19  1:01 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Mon, Feb 19, 2007 at 01:54:31AM +0100, Miklos Szeredi wrote:
> > > > > > If so, writes to B will decrease the dirty memory threshold.
> > > > > 
> > > > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > > > B doesn't know that there's nothing more to write back for B, it's
> > > > > just waiting there for those 1099, which'll never get written.
> > > > 
> > > > hm, OK, arguable.  I guess something like this..
> > > 
> > > Doesn't help the fuse case, but does seem to help the loopback mount
> > > one.
> > > 
> > > For fuse it's worse with the patch: now the write triggered by the
> > > balance recurses into fuse, with disastrous results, since the fuse
> > > writeback is now blocked on the userspace queue.
> > > 
> > > fusexmp_fh_no D 40136678     0   505    494           506   504 (NOTLB)
> > > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
> > >        08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
> > >        085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace:
> > > 08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
> > > 08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
> > > 08f9f9e0:  [<08183006>] schedule+0x246/0x547
> > > 08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
> > > 08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c
> > 
> > In general, writepage is supposed to do work without blocking on
> > expensive locks that will get pdflush and dirty reclaim stuck in this
> > fashion.  You'll probably have to take the same approach reiserfs does
> > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > is going to block without making progress.
> 
> Pdflush, and dirty reclaim set wbc->nonblocking to true.
> balance_dirty_pages and fsync don't.  The problem here is that
> Andrew's patch is wrong to let balance_dirty_pages() try to write back
> pages from a different queue.

async or sync, writepage is supposed to either make progress or bail.
loopback aside, if the fuse call is blocking long term, you're going to
run into problems.

-chris

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-19  1:01                   ` Chris Mason
  0 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-19  1:01 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Mon, Feb 19, 2007 at 01:54:31AM +0100, Miklos Szeredi wrote:
> > > > > > If so, writes to B will decrease the dirty memory threshold.
> > > > > 
> > > > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > > > B doesn't know that there's nothing more to write back for B, it's
> > > > > just waiting there for those 1099, which'll never get written.
> > > > 
> > > > hm, OK, arguable.  I guess something like this..
> > > 
> > > Doesn't help the fuse case, but does seem to help the loopback mount
> > > one.
> > > 
> > > For fuse it's worse with the patch: now the write triggered by the
> > > balance recurses into fuse, with disastrous results, since the fuse
> > > writeback is now blocked on the userspace queue.
> > > 
> > > fusexmp_fh_no D 40136678     0   505    494           506   504 (NOTLB)
> > > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
> > >        08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
> > >        085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace:
> > > 08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
> > > 08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
> > > 08f9f9e0:  [<08183006>] schedule+0x246/0x547
> > > 08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
> > > 08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c
> > 
> > In general, writepage is supposed to do work without blocking on
> > expensive locks that will get pdflush and dirty reclaim stuck in this
> > fashion.  You'll probably have to take the same approach reiserfs does
> > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > is going to block without making progress.
> 
> Pdflush, and dirty reclaim set wbc->nonblocking to true.
> balance_dirty_pages and fsync don't.  The problem here is that
> Andrew's patch is wrong to let balance_dirty_pages() try to write back
> pages from a different queue.

async or sync, writepage is supposed to either make progress or bail.
loopback aside, if the fuse call is blocking long term, you're going to
run into problems.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19  1:01                   ` Chris Mason
@ 2007-02-19  1:14                     ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  1:14 UTC (permalink / raw)
  To: chris.mason; +Cc: akpm, linux-kernel, linux-mm

> > > In general, writepage is supposed to do work without blocking on
> > > expensive locks that will get pdflush and dirty reclaim stuck in this
> > > fashion.  You'll probably have to take the same approach reiserfs does
> > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > > is going to block without making progress.
> > 
> > Pdflush, and dirty reclaim set wbc->nonblocking to true.
> > balance_dirty_pages and fsync don't.  The problem here is that
> > Andrew's patch is wrong to let balance_dirty_pages() try to write back
> > pages from a different queue.
> 
> async or sync, writepage is supposed to either make progress or bail.
> loopback aside, if the fuse call is blocking long term, you're going to
> run into problems.

Hmm, like what?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-19  1:14                     ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19  1:14 UTC (permalink / raw)
  To: chris.mason; +Cc: akpm, linux-kernel, linux-mm

> > > In general, writepage is supposed to do work without blocking on
> > > expensive locks that will get pdflush and dirty reclaim stuck in this
> > > fashion.  You'll probably have to take the same approach reiserfs does
> > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > > is going to block without making progress.
> > 
> > Pdflush, and dirty reclaim set wbc->nonblocking to true.
> > balance_dirty_pages and fsync don't.  The problem here is that
> > Andrew's patch is wrong to let balance_dirty_pages() try to write back
> > pages from a different queue.
> 
> async or sync, writepage is supposed to either make progress or bail.
> loopback aside, if the fuse call is blocking long term, you're going to
> run into problems.

Hmm, like what?

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-18 23:59           ` Andrew Morton
@ 2007-02-19 17:11             ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19 17:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

How about this?

Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
I'll try to tackle that one as well.

If the per-bdi dirty counter goes below 16, balance_dirty_pages()
returns.

Does the constant need to tunable?  If it's too large, then the global
threshold is more easily exceeded.  If it's too small, then in a tight
situation progress will be slower.

Thanks,
Miklos

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
@@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
 			dirty_thresh)
 				break;
 
+		/*
+		 * Acquit this producer if there's little or nothing
+		 * to write back to this particular queue
+		 *
+		 * Without this check a deadlock is possible in the
+		 * following case:
+		 *
+		 * - filesystem A writes data through filesystem B
+		 * - filesystem A has dirty pages over dirty_thresh
+		 * - writeback is started, this triggers a write in B
+		 * - balance_dirty_pages() is called synchronously
+		 * - the write to B blocks
+		 * - the writeback completes, but dirty is still over threshold
+		 * - the blocking write prevents futher writes from happening
+		 */
+		if (atomic_long_read(&bdi->nr_dirty) +
+		    atomic_long_read(&bdi->nr_writeback) < 16)
+			break;
+
 		if (!dirty_exceeded)
 			dirty_exceeded = 1;
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-19 17:11             ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19 17:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

How about this?

Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
I'll try to tackle that one as well.

If the per-bdi dirty counter goes below 16, balance_dirty_pages()
returns.

Does the constant need to tunable?  If it's too large, then the global
threshold is more easily exceeded.  If it's too small, then in a tight
situation progress will be slower.

Thanks,
Miklos

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
@@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
 			dirty_thresh)
 				break;
 
+		/*
+		 * Acquit this producer if there's little or nothing
+		 * to write back to this particular queue
+		 *
+		 * Without this check a deadlock is possible in the
+		 * following case:
+		 *
+		 * - filesystem A writes data through filesystem B
+		 * - filesystem A has dirty pages over dirty_thresh
+		 * - writeback is started, this triggers a write in B
+		 * - balance_dirty_pages() is called synchronously
+		 * - the write to B blocks
+		 * - the writeback completes, but dirty is still over threshold
+		 * - the blocking write prevents futher writes from happening
+		 */
+		if (atomic_long_read(&bdi->nr_dirty) +
+		    atomic_long_read(&bdi->nr_writeback) < 16)
+			break;
+
 		if (!dirty_exceeded)
 			dirty_exceeded = 1;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19 17:11             ` Miklos Szeredi
@ 2007-02-19 23:12               ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19 23:12 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> I'll try to tackle that one as well.
> 
> If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> returns.
> 
> Does the constant need to tunable?  If it's too large, then the global
> threshold is more easily exceeded.  If it's too small, then in a tight
> situation progress will be slower.

Similar in spirit, this should solve the deadlock on throttle_vm_writeout().
Totally untested.

Does this approach look workable?

Thanks,
Miklos


Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2007-02-19 23:39:36.000000000 +0100
+++ linux/include/linux/swap.h	2007-02-20 00:03:38.000000000 +0100
@@ -277,10 +277,14 @@ static inline void disable_swap_token(vo
 	put_swap_token(swap_token_mm);
 }
 
+#define nr_swap_writeback \
+	atomic_long_read(&swapper_space.backing_dev_info->nr_writeback)
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages			0
 #define total_swapcache_pages			0UL
+#define nr_swap_writeback			0UL
 
 #define si_swapinfo(val) \
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-19 23:43:03.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-20 00:03:49.000000000 +0100
@@ -33,6 +33,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -332,6 +333,9 @@ void throttle_vm_writeout(void)
                 if (global_page_state(NR_UNSTABLE_NFS) +
 			global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
+
+		if (nr_swap_writeback < 16)
+			break;
                 congestion_wait(WRITE, HZ/10);
         }
 }
Index: linux/mm/page_io.c
===================================================================
--- linux.orig/mm/page_io.c	2007-02-19 23:24:23.000000000 +0100
+++ linux/mm/page_io.c	2007-02-19 23:42:21.000000000 +0100
@@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio
 		ClearPageReclaim(page);
 	}
 	end_page_writeback(page);
+	atomic_long_dec(&swapper_space.backing_dev_info->nr_writeback);
 	bio_put(bio);
 	return 0;
 }
@@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
+	atomic_long_inc(&swapper_space.backing_dev_info->nr_writeback);
 	set_page_writeback(page);
 	unlock_page(page);
 	submit_bio(rw, bio);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-19 23:12               ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-19 23:12 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> I'll try to tackle that one as well.
> 
> If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> returns.
> 
> Does the constant need to tunable?  If it's too large, then the global
> threshold is more easily exceeded.  If it's too small, then in a tight
> situation progress will be slower.

Similar in spirit, this should solve the deadlock on throttle_vm_writeout().
Totally untested.

Does this approach look workable?

Thanks,
Miklos


Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2007-02-19 23:39:36.000000000 +0100
+++ linux/include/linux/swap.h	2007-02-20 00:03:38.000000000 +0100
@@ -277,10 +277,14 @@ static inline void disable_swap_token(vo
 	put_swap_token(swap_token_mm);
 }
 
+#define nr_swap_writeback \
+	atomic_long_read(&swapper_space.backing_dev_info->nr_writeback)
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages			0
 #define total_swapcache_pages			0UL
+#define nr_swap_writeback			0UL
 
 #define si_swapinfo(val) \
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-19 23:43:03.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-20 00:03:49.000000000 +0100
@@ -33,6 +33,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -332,6 +333,9 @@ void throttle_vm_writeout(void)
                 if (global_page_state(NR_UNSTABLE_NFS) +
 			global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
+
+		if (nr_swap_writeback < 16)
+			break;
                 congestion_wait(WRITE, HZ/10);
         }
 }
Index: linux/mm/page_io.c
===================================================================
--- linux.orig/mm/page_io.c	2007-02-19 23:24:23.000000000 +0100
+++ linux/mm/page_io.c	2007-02-19 23:42:21.000000000 +0100
@@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio
 		ClearPageReclaim(page);
 	}
 	end_page_writeback(page);
+	atomic_long_dec(&swapper_space.backing_dev_info->nr_writeback);
 	bio_put(bio);
 	return 0;
 }
@@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
+	atomic_long_inc(&swapper_space.backing_dev_info->nr_writeback);
 	set_page_writeback(page);
 	unlock_page(page);
 	submit_bio(rw, bio);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19 17:11             ` Miklos Szeredi
@ 2007-02-20  0:13               ` Chris Mason
  -1 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-20  0:13 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Mon, Feb 19, 2007 at 06:11:55PM +0100, Miklos Szeredi wrote:
> How about this?
> 
> Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> I'll try to tackle that one as well.
> 
> If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> returns.
> 
> Does the constant need to tunable?  If it's too large, then the global
> threshold is more easily exceeded.  If it's too small, then in a tight
> situation progress will be slower.

Ok, what is supposed to happen here is that filesystems are supposed to
be throttled from making more dirty pages when the system is over the
threshold.  Even if filesystem A doesn't have much to contribute, and
filesystem B is the cause of 99% of the dirty pages, the goal of the
threshold is to prevent more dirty data from happening, and filesystem A
should block.

But, with the producer consumer setup of fuse, I think this is a pretty
good compromise.  16 dirty/writeback pages shouldn't hurt the overall
limits too badly.

-chris


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-20  0:13               ` Chris Mason
  0 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-20  0:13 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Mon, Feb 19, 2007 at 06:11:55PM +0100, Miklos Szeredi wrote:
> How about this?
> 
> Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> I'll try to tackle that one as well.
> 
> If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> returns.
> 
> Does the constant need to tunable?  If it's too large, then the global
> threshold is more easily exceeded.  If it's too small, then in a tight
> situation progress will be slower.

Ok, what is supposed to happen here is that filesystems are supposed to
be throttled from making more dirty pages when the system is over the
threshold.  Even if filesystem A doesn't have much to contribute, and
filesystem B is the cause of 99% of the dirty pages, the goal of the
threshold is to prevent more dirty data from happening, and filesystem A
should block.

But, with the producer consumer setup of fuse, I think this is a pretty
good compromise.  16 dirty/writeback pages shouldn't hurt the overall
limits too badly.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19  1:14                     ` Miklos Szeredi
@ 2007-02-20  0:16                       ` Chris Mason
  -1 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-20  0:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Mon, Feb 19, 2007 at 02:14:15AM +0100, Miklos Szeredi wrote:
> > > > In general, writepage is supposed to do work without blocking on
> > > > expensive locks that will get pdflush and dirty reclaim stuck in this
> > > > fashion.  You'll probably have to take the same approach reiserfs does
> > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > > > is going to block without making progress.
> > > 
> > > Pdflush, and dirty reclaim set wbc->nonblocking to true.
> > > balance_dirty_pages and fsync don't.  The problem here is that
> > > Andrew's patch is wrong to let balance_dirty_pages() try to write back
> > > pages from a different queue.
> > 
> > async or sync, writepage is supposed to either make progress or bail.
> > loopback aside, if the fuse call is blocking long term, you're going to
> > run into problems.
> 
> Hmm, like what?

Something a little different from what you're seeing.  Basically if the
PF_MEMALLOC paths end up waiting on a filesystem transaction, and that
transaction is waiting for more ram, the system will eventually grind to
a halt.  data=journal is the easiest way to hit it, since writepage
always logs at least 4k.

WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I
resorted to testing PF_MEMALLOC.

-chris


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-20  0:16                       ` Chris Mason
  0 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-20  0:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Mon, Feb 19, 2007 at 02:14:15AM +0100, Miklos Szeredi wrote:
> > > > In general, writepage is supposed to do work without blocking on
> > > > expensive locks that will get pdflush and dirty reclaim stuck in this
> > > > fashion.  You'll probably have to take the same approach reiserfs does
> > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > > > is going to block without making progress.
> > > 
> > > Pdflush, and dirty reclaim set wbc->nonblocking to true.
> > > balance_dirty_pages and fsync don't.  The problem here is that
> > > Andrew's patch is wrong to let balance_dirty_pages() try to write back
> > > pages from a different queue.
> > 
> > async or sync, writepage is supposed to either make progress or bail.
> > loopback aside, if the fuse call is blocking long term, you're going to
> > run into problems.
> 
> Hmm, like what?

Something a little different from what you're seeing.  Basically if the
PF_MEMALLOC paths end up waiting on a filesystem transaction, and that
transaction is waiting for more ram, the system will eventually grind to
a halt.  data=journal is the easiest way to hit it, since writepage
always logs at least 4k.

WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I
resorted to testing PF_MEMALLOC.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-20  0:13               ` Chris Mason
@ 2007-02-20  8:47                 ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-20  8:47 UTC (permalink / raw)
  To: chris.mason; +Cc: akpm, linux-kernel, linux-mm

> > How about this?
> > 
> > Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> > I'll try to tackle that one as well.
> > 
> > If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> > returns.
> > 
> > Does the constant need to tunable?  If it's too large, then the global
> > threshold is more easily exceeded.  If it's too small, then in a tight
> > situation progress will be slower.
> 
> Ok, what is supposed to happen here is that filesystems are supposed to
> be throttled from making more dirty pages when the system is over the
> threshold.  Even if filesystem A doesn't have much to contribute, and
> filesystem B is the cause of 99% of the dirty pages, the goal of the
> threshold is to prevent more dirty data from happening, and filesystem A
> should block.

Which is the cause of the current deadlock.  But if we allow
filesystem A to go into the red just a little, the deadlock is
avoided, because it can continue to make progress with cleaning the
dirtyness produced by B.

The maximum that filesystems can go over the limit will be

  (16 + epsilon) * number-of-queues

This is usually insignificant compared to the limit itself (~2000
pages on a machine with 32MB)

However with thousands of fuse mounts this may become a problem, as
each filesystem gets a separate queue.  In theory, just 2 pages are
enough to always make progress, but current dirty balancing can't
enforce this, as the ratelimit is at least 8 pages.

So there may have to be some more strict page accounting within fuse
itself, but that doesn't change the overall concept I think.

Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-20  8:47                 ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-20  8:47 UTC (permalink / raw)
  To: chris.mason; +Cc: akpm, linux-kernel, linux-mm

> > How about this?
> > 
> > Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> > I'll try to tackle that one as well.
> > 
> > If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> > returns.
> > 
> > Does the constant need to tunable?  If it's too large, then the global
> > threshold is more easily exceeded.  If it's too small, then in a tight
> > situation progress will be slower.
> 
> Ok, what is supposed to happen here is that filesystems are supposed to
> be throttled from making more dirty pages when the system is over the
> threshold.  Even if filesystem A doesn't have much to contribute, and
> filesystem B is the cause of 99% of the dirty pages, the goal of the
> threshold is to prevent more dirty data from happening, and filesystem A
> should block.

Which is the cause of the current deadlock.  But if we allow
filesystem A to go into the red just a little, the deadlock is
avoided, because it can continue to make progress with cleaning the
dirtyness produced by B.

The maximum that filesystems can go over the limit will be

  (16 + epsilon) * number-of-queues

This is usually insignificant compared to the limit itself (~2000
pages on a machine with 32MB)

However with thousands of fuse mounts this may become a problem, as
each filesystem gets a separate queue.  In theory, just 2 pages are
enough to always make progress, but current dirty balancing can't
enforce this, as the ratelimit is at least 8 pages.

So there may have to be some more strict page accounting within fuse
itself, but that doesn't change the overall concept I think.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-20  0:16                       ` Chris Mason
@ 2007-02-20  8:53                         ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-20  8:53 UTC (permalink / raw)
  To: chris.mason; +Cc: akpm, linux-kernel, linux-mm

> > > > > In general, writepage is supposed to do work without blocking on
> > > > > expensive locks that will get pdflush and dirty reclaim stuck in this
> > > > > fashion.  You'll probably have to take the same approach reiserfs does
> > > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > > > > is going to block without making progress.
> > > > 
> > > > Pdflush, and dirty reclaim set wbc->nonblocking to true.
> > > > balance_dirty_pages and fsync don't.  The problem here is that
> > > > Andrew's patch is wrong to let balance_dirty_pages() try to write back
> > > > pages from a different queue.
> > > 
> > > async or sync, writepage is supposed to either make progress or bail.
> > > loopback aside, if the fuse call is blocking long term, you're going to
> > > run into problems.
> > 
> > Hmm, like what?
> 
> Something a little different from what you're seeing.  Basically if the
> PF_MEMALLOC paths end up waiting on a filesystem transaction, and that
> transaction is waiting for more ram, the system will eventually grind to
> a halt.  data=journal is the easiest way to hit it, since writepage
> always logs at least 4k.
> 
> WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I
> resorted to testing PF_MEMALLOC.

I'm not pretending to understand how journaling filesystems work, but
this shouldn't be an issue with fuse.  Can you show me a call path,
where PF_MEMALLOC is set and .nonblocking is not?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-20  8:53                         ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-20  8:53 UTC (permalink / raw)
  To: chris.mason; +Cc: akpm, linux-kernel, linux-mm

> > > > > In general, writepage is supposed to do work without blocking on
> > > > > expensive locks that will get pdflush and dirty reclaim stuck in this
> > > > > fashion.  You'll probably have to take the same approach reiserfs does
> > > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > > > > is going to block without making progress.
> > > > 
> > > > Pdflush, and dirty reclaim set wbc->nonblocking to true.
> > > > balance_dirty_pages and fsync don't.  The problem here is that
> > > > Andrew's patch is wrong to let balance_dirty_pages() try to write back
> > > > pages from a different queue.
> > > 
> > > async or sync, writepage is supposed to either make progress or bail.
> > > loopback aside, if the fuse call is blocking long term, you're going to
> > > run into problems.
> > 
> > Hmm, like what?
> 
> Something a little different from what you're seeing.  Basically if the
> PF_MEMALLOC paths end up waiting on a filesystem transaction, and that
> transaction is waiting for more ram, the system will eventually grind to
> a halt.  data=journal is the easiest way to hit it, since writepage
> always logs at least 4k.
> 
> WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I
> resorted to testing PF_MEMALLOC.

I'm not pretending to understand how journaling filesystems work, but
this shouldn't be an issue with fuse.  Can you show me a call path,
where PF_MEMALLOC is set and .nonblocking is not?

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-20  8:47                 ` Miklos Szeredi
@ 2007-02-20 11:30                   ` Chris Mason
  -1 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-20 11:30 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Tue, Feb 20, 2007 at 09:47:11AM +0100, Miklos Szeredi wrote:
> > > How about this?
> > > 
> > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> > > I'll try to tackle that one as well.
> > > 
> > > If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> > > returns.
> > > 
> > > Does the constant need to tunable?  If it's too large, then the global
> > > threshold is more easily exceeded.  If it's too small, then in a tight
> > > situation progress will be slower.
> > 
> > Ok, what is supposed to happen here is that filesystems are supposed to
> > be throttled from making more dirty pages when the system is over the
> > threshold.  Even if filesystem A doesn't have much to contribute, and
> > filesystem B is the cause of 99% of the dirty pages, the goal of the
> > threshold is to prevent more dirty data from happening, and filesystem A
> > should block.
> 
> Which is the cause of the current deadlock.  But if we allow
> filesystem A to go into the red just a little, the deadlock is
> avoided, because it can continue to make progress with cleaning the
> dirtyness produced by B.
> 
> The maximum that filesystems can go over the limit will be
> 
>   (16 + epsilon) * number-of-queues

Right, even for thousands of mounted filesystems ~16 pages per FS
effectively pinned is not horrible.

-chris

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-20 11:30                   ` Chris Mason
  0 siblings, 0 replies; 52+ messages in thread
From: Chris Mason @ 2007-02-20 11:30 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm

On Tue, Feb 20, 2007 at 09:47:11AM +0100, Miklos Szeredi wrote:
> > > How about this?
> > > 
> > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> > > I'll try to tackle that one as well.
> > > 
> > > If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> > > returns.
> > > 
> > > Does the constant need to tunable?  If it's too large, then the global
> > > threshold is more easily exceeded.  If it's too small, then in a tight
> > > situation progress will be slower.
> > 
> > Ok, what is supposed to happen here is that filesystems are supposed to
> > be throttled from making more dirty pages when the system is over the
> > threshold.  Even if filesystem A doesn't have much to contribute, and
> > filesystem B is the cause of 99% of the dirty pages, the goal of the
> > threshold is to prevent more dirty data from happening, and filesystem A
> > should block.
> 
> Which is the cause of the current deadlock.  But if we allow
> filesystem A to go into the red just a little, the deadlock is
> avoided, because it can continue to make progress with cleaning the
> dirtyness produced by B.
> 
> The maximum that filesystems can go over the limit will be
> 
>   (16 + epsilon) * number-of-queues

Right, even for thousands of mounted filesystems ~16 pages per FS
effectively pinned is not horrible.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-19 17:11             ` Miklos Szeredi
@ 2007-02-21 21:36               ` Andrew Morton
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-21 21:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

On Mon, 19 Feb 2007 18:11:55 +0100
Miklos Szeredi <miklos@szeredi.hu> wrote:

> How about this?

I still don't understand this bug.

> Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> I'll try to tackle that one as well.
> 
> If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> returns.
> 
> Does the constant need to tunable?  If it's too large, then the global
> threshold is more easily exceeded.  If it's too small, then in a tight
> situation progress will be slower.
> 
> Thanks,
> Miklos
> 
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
> +++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
> @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
>  			dirty_thresh)
>  				break;
>  
> +		/*
> +		 * Acquit this producer if there's little or nothing
> +		 * to write back to this particular queue
> +		 *
> +		 * Without this check a deadlock is possible in the
> +		 * following case:
> +		 *
> +		 * - filesystem A writes data through filesystem B
> +		 * - filesystem A has dirty pages over dirty_thresh
> +		 * - writeback is started, this triggers a write in B
> +		 * - balance_dirty_pages() is called synchronously
> +		 * - the write to B blocks
> +		 * - the writeback completes, but dirty is still over threshold
> +		 * - the blocking write prevents futher writes from happening
> +		 */
> +		if (atomic_long_read(&bdi->nr_dirty) +
> +		    atomic_long_read(&bdi->nr_writeback) < 16)
> +			break;
> +

The problem seems to that little "- the write to B blocks".

How come it blocks?  I mean, if we cannot retire writes to that filesystem
then we're screwed anyway.

Anyway, I think I'll think about this issue a little later on.  You might
as well prepare full changelogs for your proposed changes, because we'll be
needing them anyway.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-21 21:36               ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-21 21:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

On Mon, 19 Feb 2007 18:11:55 +0100
Miklos Szeredi <miklos@szeredi.hu> wrote:

> How about this?

I still don't understand this bug.

> Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> I'll try to tackle that one as well.
> 
> If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> returns.
> 
> Does the constant need to tunable?  If it's too large, then the global
> threshold is more easily exceeded.  If it's too small, then in a tight
> situation progress will be slower.
> 
> Thanks,
> Miklos
> 
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
> +++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
> @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
>  			dirty_thresh)
>  				break;
>  
> +		/*
> +		 * Acquit this producer if there's little or nothing
> +		 * to write back to this particular queue
> +		 *
> +		 * Without this check a deadlock is possible in the
> +		 * following case:
> +		 *
> +		 * - filesystem A writes data through filesystem B
> +		 * - filesystem A has dirty pages over dirty_thresh
> +		 * - writeback is started, this triggers a write in B
> +		 * - balance_dirty_pages() is called synchronously
> +		 * - the write to B blocks
> +		 * - the writeback completes, but dirty is still over threshold
> +		 * - the blocking write prevents futher writes from happening
> +		 */
> +		if (atomic_long_read(&bdi->nr_dirty) +
> +		    atomic_long_read(&bdi->nr_writeback) < 16)
> +			break;
> +

The problem seems to that little "- the write to B blocks".

How come it blocks?  I mean, if we cannot retire writes to that filesystem
then we're screwed anyway.

Anyway, I think I'll think about this issue a little later on.  You might
as well prepare full changelogs for your proposed changes, because we'll be
needing them anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-21 21:36               ` Andrew Morton
@ 2007-02-22  7:42                 ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-22  7:42 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> > How about this?
> 
> I still don't understand this bug.
> 
> > Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> > I'll try to tackle that one as well.
> > 
> > If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> > returns.
> > 
> > Does the constant need to tunable?  If it's too large, then the global
> > threshold is more easily exceeded.  If it's too small, then in a tight
> > situation progress will be slower.
> > 
> > Thanks,
> > Miklos
> > 
> > Index: linux/mm/page-writeback.c
> > ===================================================================
> > --- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
> > +++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
> > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
> >  			dirty_thresh)
> >  				break;
> >  
> > +		/*
> > +		 * Acquit this producer if there's little or nothing
> > +		 * to write back to this particular queue
> > +		 *
> > +		 * Without this check a deadlock is possible in the
> > +		 * following case:
> > +		 *
> > +		 * - filesystem A writes data through filesystem B
> > +		 * - filesystem A has dirty pages over dirty_thresh
> > +		 * - writeback is started, this triggers a write in B
> > +		 * - balance_dirty_pages() is called synchronously
> > +		 * - the write to B blocks
> > +		 * - the writeback completes, but dirty is still over threshold
> > +		 * - the blocking write prevents futher writes from happening
> > +		 */
> > +		if (atomic_long_read(&bdi->nr_dirty) +
> > +		    atomic_long_read(&bdi->nr_writeback) < 16)
> > +			break;
> > +
> 
> The problem seems to that little "- the write to B blocks".
> 
> How come it blocks?  I mean, if we cannot retire writes to that filesystem
> then we're screwed anyway.

Sorry about the sloppy description.  I mean, it's not the lowlevel
write that will block, but rather the VFS one
(generic_file_aio_write).  It will block (or rather loop forever with
0.1 second sleeps) in balance_dirty_pages().  That means, that for
this inode, i_mutex is held and no other writer can continue the work.

Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-22  7:42                 ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-22  7:42 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

> > How about this?
> 
> I still don't understand this bug.
> 
> > Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> > I'll try to tackle that one as well.
> > 
> > If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> > returns.
> > 
> > Does the constant need to tunable?  If it's too large, then the global
> > threshold is more easily exceeded.  If it's too small, then in a tight
> > situation progress will be slower.
> > 
> > Thanks,
> > Miklos
> > 
> > Index: linux/mm/page-writeback.c
> > ===================================================================
> > --- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
> > +++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
> > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
> >  			dirty_thresh)
> >  				break;
> >  
> > +		/*
> > +		 * Acquit this producer if there's little or nothing
> > +		 * to write back to this particular queue
> > +		 *
> > +		 * Without this check a deadlock is possible in the
> > +		 * following case:
> > +		 *
> > +		 * - filesystem A writes data through filesystem B
> > +		 * - filesystem A has dirty pages over dirty_thresh
> > +		 * - writeback is started, this triggers a write in B
> > +		 * - balance_dirty_pages() is called synchronously
> > +		 * - the write to B blocks
> > +		 * - the writeback completes, but dirty is still over threshold
> > +		 * - the blocking write prevents futher writes from happening
> > +		 */
> > +		if (atomic_long_read(&bdi->nr_dirty) +
> > +		    atomic_long_read(&bdi->nr_writeback) < 16)
> > +			break;
> > +
> 
> The problem seems to that little "- the write to B blocks".
> 
> How come it blocks?  I mean, if we cannot retire writes to that filesystem
> then we're screwed anyway.

Sorry about the sloppy description.  I mean, it's not the lowlevel
write that will block, but rather the VFS one
(generic_file_aio_write).  It will block (or rather loop forever with
0.1 second sleeps) in balance_dirty_pages().  That means, that for
this inode, i_mutex is held and no other writer can continue the work.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-22  7:42                 ` Miklos Szeredi
@ 2007-02-22  7:55                   ` Andrew Morton
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-22  7:55 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

> On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > 
> > > Index: linux/mm/page-writeback.c
> > > ===================================================================
> > > --- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
> > > +++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
> > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
> > >  			dirty_thresh)
> > >  				break;
> > >  
> > > +		/*
> > > +		 * Acquit this producer if there's little or nothing
> > > +		 * to write back to this particular queue
> > > +		 *
> > > +		 * Without this check a deadlock is possible in the
> > > +		 * following case:
> > > +		 *
> > > +		 * - filesystem A writes data through filesystem B
> > > +		 * - filesystem A has dirty pages over dirty_thresh
> > > +		 * - writeback is started, this triggers a write in B
> > > +		 * - balance_dirty_pages() is called synchronously
> > > +		 * - the write to B blocks
> > > +		 * - the writeback completes, but dirty is still over threshold
> > > +		 * - the blocking write prevents futher writes from happening
> > > +		 */
> > > +		if (atomic_long_read(&bdi->nr_dirty) +
> > > +		    atomic_long_read(&bdi->nr_writeback) < 16)
> > > +			break;
> > > +
> > 
> > The problem seems to that little "- the write to B blocks".
> > 
> > How come it blocks?  I mean, if we cannot retire writes to that filesystem
> > then we're screwed anyway.
> 
> Sorry about the sloppy description.  I mean, it's not the lowlevel
> write that will block, but rather the VFS one
> (generic_file_aio_write).  It will block (or rather loop forever with
> 0.1 second sleeps) in balance_dirty_pages().  That means, that for
> this inode, i_mutex is held and no other writer can continue the work.

"this inode" I assume is the inode against filesystem A?

Why does holding that inode's i_mutex prevent further writeback of pages in A?



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-22  7:55                   ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2007-02-22  7:55 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-mm

> On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > 
> > > Index: linux/mm/page-writeback.c
> > > ===================================================================
> > > --- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
> > > +++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
> > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
> > >  			dirty_thresh)
> > >  				break;
> > >  
> > > +		/*
> > > +		 * Acquit this producer if there's little or nothing
> > > +		 * to write back to this particular queue
> > > +		 *
> > > +		 * Without this check a deadlock is possible in the
> > > +		 * following case:
> > > +		 *
> > > +		 * - filesystem A writes data through filesystem B
> > > +		 * - filesystem A has dirty pages over dirty_thresh
> > > +		 * - writeback is started, this triggers a write in B
> > > +		 * - balance_dirty_pages() is called synchronously
> > > +		 * - the write to B blocks
> > > +		 * - the writeback completes, but dirty is still over threshold
> > > +		 * - the blocking write prevents futher writes from happening
> > > +		 */
> > > +		if (atomic_long_read(&bdi->nr_dirty) +
> > > +		    atomic_long_read(&bdi->nr_writeback) < 16)
> > > +			break;
> > > +
> > 
> > The problem seems to that little "- the write to B blocks".
> > 
> > How come it blocks?  I mean, if we cannot retire writes to that filesystem
> > then we're screwed anyway.
> 
> Sorry about the sloppy description.  I mean, it's not the lowlevel
> write that will block, but rather the VFS one
> (generic_file_aio_write).  It will block (or rather loop forever with
> 0.1 second sleeps) in balance_dirty_pages().  That means, that for
> this inode, i_mutex is held and no other writer can continue the work.

"this inode" I assume is the inode against filesystem A?

Why does holding that inode's i_mutex prevent further writeback of pages in A?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
  2007-02-22  7:55                   ` Andrew Morton
@ 2007-02-22  8:02                     ` Miklos Szeredi
  -1 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-22  8:02 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-mm

> > On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > 
> > > > Index: linux/mm/page-writeback.c
> > > > ===================================================================
> > > > --- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
> > > > +++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
> > > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
> > > >  			dirty_thresh)
> > > >  				break;
> > > >  
> > > > +		/*
> > > > +		 * Acquit this producer if there's little or nothing
> > > > +		 * to write back to this particular queue
> > > > +		 *
> > > > +		 * Without this check a deadlock is possible in the
> > > > +		 * following case:
> > > > +		 *
> > > > +		 * - filesystem A writes data through filesystem B
> > > > +		 * - filesystem A has dirty pages over dirty_thresh
> > > > +		 * - writeback is started, this triggers a write in B
> > > > +		 * - balance_dirty_pages() is called synchronously
> > > > +		 * - the write to B blocks
> > > > +		 * - the writeback completes, but dirty is still over threshold
> > > > +		 * - the blocking write prevents futher writes from happening
> > > > +		 */
> > > > +		if (atomic_long_read(&bdi->nr_dirty) +
> > > > +		    atomic_long_read(&bdi->nr_writeback) < 16)
> > > > +			break;
> > > > +
> > > 
> > > The problem seems to that little "- the write to B blocks".
> > > 
> > > How come it blocks?  I mean, if we cannot retire writes to that filesystem
> > > then we're screwed anyway.
> > 
> > Sorry about the sloppy description.  I mean, it's not the lowlevel
> > write that will block, but rather the VFS one
> > (generic_file_aio_write).  It will block (or rather loop forever with
> > 0.1 second sleeps) in balance_dirty_pages().  That means, that for
> > this inode, i_mutex is held and no other writer can continue the work.
> 
> "this inode" I assume is the inode against filesystem A?

No, the one in B.

> Why does holding that inode's i_mutex prevent further writeback of
> pages in A?

It is generic_file_aio_write() that is holding the mutex.

Here's the stack for the filesystem daemon trying to write back a page:

08dcfb40:  [<08182fe6>] schedule+0x246/0x547
08dcfb98:  [<08183a03>] schedule_timeout+0x4e/0xb6
08dcfbcc:  [<08183991>] io_schedule_timeout+0x11/0x20
08dcfbd4:  [<080a0cf2>] congestion_wait+0x72/0x87
08dcfc04:  [<0809c693>] balance_dirty_pages+0xa8/0x153
08dcfc5c:  [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45
08dcfc68:  [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5
08dcfd20:  [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd
08dcfda8:  [<08099cb6>] generic_file_aio_write+0x55/0xc7
08dcfddc:  [<080ea1e6>] ext3_file_write+0x39/0xaf
08dcfe04:  [<080b060b>] do_sync_write+0xd8/0x10e
08dcfebc:  [<080b06e3>] vfs_write+0xa2/0x1cb
08dcfeec:  [<080b09b8>] sys_pwrite64+0x65/0x69

Miklos

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: dirty balancing deadlock
@ 2007-02-22  8:02                     ` Miklos Szeredi
  0 siblings, 0 replies; 52+ messages in thread
From: Miklos Szeredi @ 2007-02-22  8:02 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-mm

> > On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > 
> > > > Index: linux/mm/page-writeback.c
> > > > ===================================================================
> > > > --- linux.orig/mm/page-writeback.c	2007-02-19 17:32:41.000000000 +0100
> > > > +++ linux/mm/page-writeback.c	2007-02-19 18:05:28.000000000 +0100
> > > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
> > > >  			dirty_thresh)
> > > >  				break;
> > > >  
> > > > +		/*
> > > > +		 * Acquit this producer if there's little or nothing
> > > > +		 * to write back to this particular queue
> > > > +		 *
> > > > +		 * Without this check a deadlock is possible in the
> > > > +		 * following case:
> > > > +		 *
> > > > +		 * - filesystem A writes data through filesystem B
> > > > +		 * - filesystem A has dirty pages over dirty_thresh
> > > > +		 * - writeback is started, this triggers a write in B
> > > > +		 * - balance_dirty_pages() is called synchronously
> > > > +		 * - the write to B blocks
> > > > +		 * - the writeback completes, but dirty is still over threshold
> > > > +		 * - the blocking write prevents futher writes from happening
> > > > +		 */
> > > > +		if (atomic_long_read(&bdi->nr_dirty) +
> > > > +		    atomic_long_read(&bdi->nr_writeback) < 16)
> > > > +			break;
> > > > +
> > > 
> > > The problem seems to that little "- the write to B blocks".
> > > 
> > > How come it blocks?  I mean, if we cannot retire writes to that filesystem
> > > then we're screwed anyway.
> > 
> > Sorry about the sloppy description.  I mean, it's not the lowlevel
> > write that will block, but rather the VFS one
> > (generic_file_aio_write).  It will block (or rather loop forever with
> > 0.1 second sleeps) in balance_dirty_pages().  That means, that for
> > this inode, i_mutex is held and no other writer can continue the work.
> 
> "this inode" I assume is the inode against filesystem A?

No, the one in B.

> Why does holding that inode's i_mutex prevent further writeback of
> pages in A?

It is generic_file_aio_write() that is holding the mutex.

Here's the stack for the filesystem daemon trying to write back a page:

08dcfb40:  [<08182fe6>] schedule+0x246/0x547
08dcfb98:  [<08183a03>] schedule_timeout+0x4e/0xb6
08dcfbcc:  [<08183991>] io_schedule_timeout+0x11/0x20
08dcfbd4:  [<080a0cf2>] congestion_wait+0x72/0x87
08dcfc04:  [<0809c693>] balance_dirty_pages+0xa8/0x153
08dcfc5c:  [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45
08dcfc68:  [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5
08dcfd20:  [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd
08dcfda8:  [<08099cb6>] generic_file_aio_write+0x55/0xc7
08dcfddc:  [<080ea1e6>] ext3_file_write+0x39/0xaf
08dcfe04:  [<080b060b>] do_sync_write+0xd8/0x10e
08dcfebc:  [<080b06e3>] vfs_write+0xa2/0x1cb
08dcfeec:  [<080b09b8>] sys_pwrite64+0x65/0x69

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2007-02-22  8:02 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-18 18:28 dirty balancing deadlock Miklos Szeredi
2007-02-18 18:28 ` Miklos Szeredi
2007-02-18 20:53 ` Andrew Morton
2007-02-18 20:53   ` Andrew Morton
2007-02-18 21:25   ` Rik van Riel
2007-02-18 21:25     ` Rik van Riel
2007-02-18 22:54     ` Miklos Szeredi
2007-02-18 22:54       ` Miklos Szeredi
2007-02-18 22:50   ` Miklos Szeredi
2007-02-18 22:50     ` Miklos Szeredi
2007-02-18 22:59     ` Andrew Morton
2007-02-18 22:59       ` Andrew Morton
2007-02-18 23:22       ` Miklos Szeredi
2007-02-18 23:22         ` Miklos Szeredi
2007-02-18 23:59         ` Andrew Morton
2007-02-18 23:59           ` Andrew Morton
2007-02-19  0:25           ` Miklos Szeredi
2007-02-19  0:25             ` Miklos Szeredi
2007-02-19  0:30             ` Miklos Szeredi
2007-02-19  0:30               ` Miklos Szeredi
2007-02-19  0:45             ` Miklos Szeredi
2007-02-19  0:45               ` Miklos Szeredi
2007-02-19  0:45             ` Chris Mason
2007-02-19  0:45               ` Chris Mason
2007-02-19  0:54               ` Miklos Szeredi
2007-02-19  0:54                 ` Miklos Szeredi
2007-02-19  1:01                 ` Chris Mason
2007-02-19  1:01                   ` Chris Mason
2007-02-19  1:14                   ` Miklos Szeredi
2007-02-19  1:14                     ` Miklos Szeredi
2007-02-20  0:16                     ` Chris Mason
2007-02-20  0:16                       ` Chris Mason
2007-02-20  8:53                       ` Miklos Szeredi
2007-02-20  8:53                         ` Miklos Szeredi
2007-02-19 17:11           ` Miklos Szeredi
2007-02-19 17:11             ` Miklos Szeredi
2007-02-19 23:12             ` Miklos Szeredi
2007-02-19 23:12               ` Miklos Szeredi
2007-02-20  0:13             ` Chris Mason
2007-02-20  0:13               ` Chris Mason
2007-02-20  8:47               ` Miklos Szeredi
2007-02-20  8:47                 ` Miklos Szeredi
2007-02-20 11:30                 ` Chris Mason
2007-02-20 11:30                   ` Chris Mason
2007-02-21 21:36             ` Andrew Morton
2007-02-21 21:36               ` Andrew Morton
2007-02-22  7:42               ` Miklos Szeredi
2007-02-22  7:42                 ` Miklos Szeredi
2007-02-22  7:55                 ` Andrew Morton
2007-02-22  7:55                   ` Andrew Morton
2007-02-22  8:02                   ` Miklos Szeredi
2007-02-22  8:02                     ` Miklos Szeredi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.