* dirty balancing deadlock @ 2007-02-18 18:28 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-18 18:28 UTC (permalink / raw) To: linux-kernel, linux-mm; +Cc: akpm I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. Ideas: - per filesystem dirty counters. If filesystem is clean (or dirty is below some minimum), then balance_dirty_pages() should no wait any more - throttle_vm_writeout() was meant to throttle swapping, no? So in that case there should be a separate swap-writback counter Thanks, Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* dirty balancing deadlock @ 2007-02-18 18:28 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-18 18:28 UTC (permalink / raw) To: linux-kernel, linux-mm; +Cc: akpm I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. Ideas: - per filesystem dirty counters. If filesystem is clean (or dirty is below some minimum), then balance_dirty_pages() should no wait any more - throttle_vm_writeout() was meant to throttle swapping, no? So in that case there should be a separate swap-writback counter Thanks, Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-18 18:28 ` Miklos Szeredi @ 2007-02-18 20:53 ` Andrew Morton -1 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-18 20:53 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > I was testing the new fuse shared writable mmap support, and finding > that bash-shared-mapping deadlocks (which isn't so strange ;). What > is more strange is that this is not an OOM situation at all, with > plenty of free and cached pages. > > A little more investigation shows that a similar deadlock happens > reliably with bash-shared-mapping on a loopback mount, even if only > half the total memory is used. > > The cause is slightly different in the two cases: > > - loopback mount: allocation by the underlying filesystem is stalled > on throttle_vm_writeout() > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > balance_dirty_pages() > > In both cases the underlying fs is totally innocent, with no > dirty/writback pages, yet it's waiting for the global dirty+writeback > to go below the threshold, which obviously won't, until the > allocation/dirtying succeeds. > > I'm not quite sure what the solution is, and asking for thoughts. But.... these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. >From your description it appears that this writeback isn't happening, or isn't working. How come? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-18 20:53 ` Andrew Morton 0 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-18 20:53 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > I was testing the new fuse shared writable mmap support, and finding > that bash-shared-mapping deadlocks (which isn't so strange ;). What > is more strange is that this is not an OOM situation at all, with > plenty of free and cached pages. > > A little more investigation shows that a similar deadlock happens > reliably with bash-shared-mapping on a loopback mount, even if only > half the total memory is used. > > The cause is slightly different in the two cases: > > - loopback mount: allocation by the underlying filesystem is stalled > on throttle_vm_writeout() > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > balance_dirty_pages() > > In both cases the underlying fs is totally innocent, with no > dirty/writback pages, yet it's waiting for the global dirty+writeback > to go below the threshold, which obviously won't, until the > allocation/dirtying succeeds. > > I'm not quite sure what the solution is, and asking for thoughts. But.... these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. >From your description it appears that this writeback isn't happening, or isn't working. How come? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-18 20:53 ` Andrew Morton @ 2007-02-18 21:25 ` Rik van Riel -1 siblings, 0 replies; 52+ messages in thread From: Rik van Riel @ 2007-02-18 21:25 UTC (permalink / raw) To: Andrew Morton; +Cc: Miklos Szeredi, linux-kernel, linux-mm Andrew Morton wrote: > On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > >> I was testing the new fuse shared writable mmap support, and finding >> that bash-shared-mapping deadlocks (which isn't so strange ;). What >> is more strange is that this is not an OOM situation at all, with >> plenty of free and cached pages. >> >> A little more investigation shows that a similar deadlock happens >> reliably with bash-shared-mapping on a loopback mount, even if only >> half the total memory is used. >> >> The cause is slightly different in the two cases: >> >> - loopback mount: allocation by the underlying filesystem is stalled >> on throttle_vm_writeout() >> >> - fuse-loop: page dirtying on the underlying filesystem is stalled on >> balance_dirty_pages() >> >> In both cases the underlying fs is totally innocent, with no >> dirty/writback pages, yet it's waiting for the global dirty+writeback >> to go below the threshold, which obviously won't, until the >> allocation/dirtying succeeds. >> >> I'm not quite sure what the solution is, and asking for thoughts. > > But.... these things don't just throttle. They also perform large amounts > of writeback, which causes the dirty levels to subside. > >>From your description it appears that this writeback isn't happening, or > isn't working. How come? Is the fuse daemon trying to do writeback to itself, perhaps? That is, trying to write out data to the FUSE filesystem, for which it is also the server. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-18 21:25 ` Rik van Riel 0 siblings, 0 replies; 52+ messages in thread From: Rik van Riel @ 2007-02-18 21:25 UTC (permalink / raw) To: Andrew Morton; +Cc: Miklos Szeredi, linux-kernel, linux-mm Andrew Morton wrote: > On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > >> I was testing the new fuse shared writable mmap support, and finding >> that bash-shared-mapping deadlocks (which isn't so strange ;). What >> is more strange is that this is not an OOM situation at all, with >> plenty of free and cached pages. >> >> A little more investigation shows that a similar deadlock happens >> reliably with bash-shared-mapping on a loopback mount, even if only >> half the total memory is used. >> >> The cause is slightly different in the two cases: >> >> - loopback mount: allocation by the underlying filesystem is stalled >> on throttle_vm_writeout() >> >> - fuse-loop: page dirtying on the underlying filesystem is stalled on >> balance_dirty_pages() >> >> In both cases the underlying fs is totally innocent, with no >> dirty/writback pages, yet it's waiting for the global dirty+writeback >> to go below the threshold, which obviously won't, until the >> allocation/dirtying succeeds. >> >> I'm not quite sure what the solution is, and asking for thoughts. > > But.... these things don't just throttle. They also perform large amounts > of writeback, which causes the dirty levels to subside. > >>From your description it appears that this writeback isn't happening, or > isn't working. How come? Is the fuse daemon trying to do writeback to itself, perhaps? That is, trying to write out data to the FUSE filesystem, for which it is also the server. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-18 21:25 ` Rik van Riel @ 2007-02-18 22:54 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-18 22:54 UTC (permalink / raw) To: riel; +Cc: akpm, miklos, linux-kernel, linux-mm > Andrew Morton wrote: > > On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > > >> I was testing the new fuse shared writable mmap support, and finding > >> that bash-shared-mapping deadlocks (which isn't so strange ;). What > >> is more strange is that this is not an OOM situation at all, with > >> plenty of free and cached pages. > >> > >> A little more investigation shows that a similar deadlock happens > >> reliably with bash-shared-mapping on a loopback mount, even if only > >> half the total memory is used. > >> > >> The cause is slightly different in the two cases: > >> > >> - loopback mount: allocation by the underlying filesystem is stalled > >> on throttle_vm_writeout() > >> > >> - fuse-loop: page dirtying on the underlying filesystem is stalled on > >> balance_dirty_pages() > >> > >> In both cases the underlying fs is totally innocent, with no > >> dirty/writback pages, yet it's waiting for the global dirty+writeback > >> to go below the threshold, which obviously won't, until the > >> allocation/dirtying succeeds. > >> > >> I'm not quite sure what the solution is, and asking for thoughts. > > > > But.... these things don't just throttle. They also perform large amounts > > of writeback, which causes the dirty levels to subside. > > > >>From your description it appears that this writeback isn't happening, or > > isn't working. How come? > > Is the fuse daemon trying to do writeback to itself, perhaps? > > That is, trying to write out data to the FUSE filesystem, for which > it is also the server. No. It's trying to write out data to a different filesystem. Trying to write out data to itself very obviously deadlocks, but that doesn't affect anything beside the stupid filesystem itself, and there are mechanisms for aborting such a situation (forced umount, abort through fuse-control filesystem). Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-18 22:54 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-18 22:54 UTC (permalink / raw) To: riel; +Cc: akpm, miklos, linux-kernel, linux-mm > Andrew Morton wrote: > > On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > > >> I was testing the new fuse shared writable mmap support, and finding > >> that bash-shared-mapping deadlocks (which isn't so strange ;). What > >> is more strange is that this is not an OOM situation at all, with > >> plenty of free and cached pages. > >> > >> A little more investigation shows that a similar deadlock happens > >> reliably with bash-shared-mapping on a loopback mount, even if only > >> half the total memory is used. > >> > >> The cause is slightly different in the two cases: > >> > >> - loopback mount: allocation by the underlying filesystem is stalled > >> on throttle_vm_writeout() > >> > >> - fuse-loop: page dirtying on the underlying filesystem is stalled on > >> balance_dirty_pages() > >> > >> In both cases the underlying fs is totally innocent, with no > >> dirty/writback pages, yet it's waiting for the global dirty+writeback > >> to go below the threshold, which obviously won't, until the > >> allocation/dirtying succeeds. > >> > >> I'm not quite sure what the solution is, and asking for thoughts. > > > > But.... these things don't just throttle. They also perform large amounts > > of writeback, which causes the dirty levels to subside. > > > >>From your description it appears that this writeback isn't happening, or > > isn't working. How come? > > Is the fuse daemon trying to do writeback to itself, perhaps? > > That is, trying to write out data to the FUSE filesystem, for which > it is also the server. No. It's trying to write out data to a different filesystem. Trying to write out data to itself very obviously deadlocks, but that doesn't affect anything beside the stupid filesystem itself, and there are mechanisms for aborting such a situation (forced umount, abort through fuse-control filesystem). Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-18 20:53 ` Andrew Morton @ 2007-02-18 22:50 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-18 22:50 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > > I was testing the new fuse shared writable mmap support, and finding > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > is more strange is that this is not an OOM situation at all, with > > plenty of free and cached pages. > > > > A little more investigation shows that a similar deadlock happens > > reliably with bash-shared-mapping on a loopback mount, even if only > > half the total memory is used. > > > > The cause is slightly different in the two cases: > > > > - loopback mount: allocation by the underlying filesystem is stalled > > on throttle_vm_writeout() > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > balance_dirty_pages() > > > > In both cases the underlying fs is totally innocent, with no > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > to go below the threshold, which obviously won't, until the > > allocation/dirtying succeeds. > > > > I'm not quite sure what the solution is, and asking for thoughts. > > But.... these things don't just throttle. They also perform large amounts > of writeback, which causes the dirty levels to subside. > > >From your description it appears that this writeback isn't happening, or > isn't working. How come? - filesystems A and B - write to A will end up as write to B - dirty pages in A manage to go over dirty_threshold - page writeback is started from A - this triggers writeback for a couple of pages in B - writeback finishes normally, but dirty+writeback pages are still over threshold - balance_dirty_pages in B gets stuck, nothing ever moves after this At least this is my theory for what happens. Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-18 22:50 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-18 22:50 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > > I was testing the new fuse shared writable mmap support, and finding > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > is more strange is that this is not an OOM situation at all, with > > plenty of free and cached pages. > > > > A little more investigation shows that a similar deadlock happens > > reliably with bash-shared-mapping on a loopback mount, even if only > > half the total memory is used. > > > > The cause is slightly different in the two cases: > > > > - loopback mount: allocation by the underlying filesystem is stalled > > on throttle_vm_writeout() > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > balance_dirty_pages() > > > > In both cases the underlying fs is totally innocent, with no > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > to go below the threshold, which obviously won't, until the > > allocation/dirtying succeeds. > > > > I'm not quite sure what the solution is, and asking for thoughts. > > But.... these things don't just throttle. They also perform large amounts > of writeback, which causes the dirty levels to subside. > > >From your description it appears that this writeback isn't happening, or > isn't working. How come? - filesystems A and B - write to A will end up as write to B - dirty pages in A manage to go over dirty_threshold - page writeback is started from A - this triggers writeback for a couple of pages in B - writeback finishes normally, but dirty+writeback pages are still over threshold - balance_dirty_pages in B gets stuck, nothing ever moves after this At least this is my theory for what happens. Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-18 22:50 ` Miklos Szeredi @ 2007-02-18 22:59 ` Andrew Morton -1 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-18 22:59 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm On Sun, 18 Feb 2007 23:50:14 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > > I was testing the new fuse shared writable mmap support, and finding > > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > > is more strange is that this is not an OOM situation at all, with > > > plenty of free and cached pages. > > > > > > A little more investigation shows that a similar deadlock happens > > > reliably with bash-shared-mapping on a loopback mount, even if only > > > half the total memory is used. > > > > > > The cause is slightly different in the two cases: > > > > > > - loopback mount: allocation by the underlying filesystem is stalled > > > on throttle_vm_writeout() > > > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > > balance_dirty_pages() > > > > > > In both cases the underlying fs is totally innocent, with no > > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > > to go below the threshold, which obviously won't, until the > > > allocation/dirtying succeeds. > > > > > > I'm not quite sure what the solution is, and asking for thoughts. > > > > But.... these things don't just throttle. They also perform large amounts > > of writeback, which causes the dirty levels to subside. > > > > >From your description it appears that this writeback isn't happening, or > > isn't working. How come? > > - filesystems A and B > - write to A will end up as write to B > - dirty pages in A manage to go over dirty_threshold > - page writeback is started from A > - this triggers writeback for a couple of pages in B > - writeback finishes normally, but dirty+writeback pages are still > over threshold > - balance_dirty_pages in B gets stuck, nothing ever moves after this > > At least this is my theory for what happens. > Is B a real filesystem? If so, writes to B will decrease the dirty memory threshold. The writeout code _should_ just sit there transferring dirtyiness from A to B and cleaning pages via B, looping around, alternating between both. What does sysrq-t say? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-18 22:59 ` Andrew Morton 0 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-18 22:59 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm On Sun, 18 Feb 2007 23:50:14 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > > I was testing the new fuse shared writable mmap support, and finding > > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > > is more strange is that this is not an OOM situation at all, with > > > plenty of free and cached pages. > > > > > > A little more investigation shows that a similar deadlock happens > > > reliably with bash-shared-mapping on a loopback mount, even if only > > > half the total memory is used. > > > > > > The cause is slightly different in the two cases: > > > > > > - loopback mount: allocation by the underlying filesystem is stalled > > > on throttle_vm_writeout() > > > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > > balance_dirty_pages() > > > > > > In both cases the underlying fs is totally innocent, with no > > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > > to go below the threshold, which obviously won't, until the > > > allocation/dirtying succeeds. > > > > > > I'm not quite sure what the solution is, and asking for thoughts. > > > > But.... these things don't just throttle. They also perform large amounts > > of writeback, which causes the dirty levels to subside. > > > > >From your description it appears that this writeback isn't happening, or > > isn't working. How come? > > - filesystems A and B > - write to A will end up as write to B > - dirty pages in A manage to go over dirty_threshold > - page writeback is started from A > - this triggers writeback for a couple of pages in B > - writeback finishes normally, but dirty+writeback pages are still > over threshold > - balance_dirty_pages in B gets stuck, nothing ever moves after this > > At least this is my theory for what happens. > Is B a real filesystem? If so, writes to B will decrease the dirty memory threshold. The writeout code _should_ just sit there transferring dirtyiness from A to B and cleaning pages via B, looping around, alternating between both. What does sysrq-t say? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-18 22:59 ` Andrew Morton @ 2007-02-18 23:22 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-18 23:22 UTC (permalink / raw) To: akpm; +Cc: miklos, linux-kernel, linux-mm > > > > I was testing the new fuse shared writable mmap support, and finding > > > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > > > is more strange is that this is not an OOM situation at all, with > > > > plenty of free and cached pages. > > > > > > > > A little more investigation shows that a similar deadlock happens > > > > reliably with bash-shared-mapping on a loopback mount, even if only > > > > half the total memory is used. > > > > > > > > The cause is slightly different in the two cases: > > > > > > > > - loopback mount: allocation by the underlying filesystem is stalled > > > > on throttle_vm_writeout() > > > > > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > > > balance_dirty_pages() > > > > > > > > In both cases the underlying fs is totally innocent, with no > > > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > > > to go below the threshold, which obviously won't, until the > > > > allocation/dirtying succeeds. > > > > > > > > I'm not quite sure what the solution is, and asking for thoughts. > > > > > > But.... these things don't just throttle. They also perform large amounts > > > of writeback, which causes the dirty levels to subside. > > > > > > >From your description it appears that this writeback isn't happening, or > > > isn't working. How come? > > > > - filesystems A and B > > - write to A will end up as write to B > > - dirty pages in A manage to go over dirty_threshold > > - page writeback is started from A > > - this triggers writeback for a couple of pages in B > > - writeback finishes normally, but dirty+writeback pages are still > > over threshold > > - balance_dirty_pages in B gets stuck, nothing ever moves after this > > > > At least this is my theory for what happens. > > > > Is B a real filesystem? Yes. > If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. > The writeout code _should_ just sit there transferring dirtyiness from A to > B and cleaning pages via B, looping around, alternating between both. > > What does sysrq-t say? This is the fuse daemon thread that got stuck. There are lots of others that are stuck on some ext3 mutex as a result of this. fusexmp_fh_no D 40045401 0 527 493 533 495 (NOTLB) 088d55f8 00000001 00000000 08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000 08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000 0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 Call Trace: 08dcfb00: [<0805d8cb>] switch_to_skas+0x3b/0x83 08dcfb18: [<0805a38a>] _switch_to+0x49/0x99 08dcfb40: [<08182fe6>] schedule+0x246/0x547 08dcfb98: [<08183a03>] schedule_timeout+0x4e/0xb6 08dcfbcc: [<08183991>] io_schedule_timeout+0x11/0x20 08dcfbd4: [<080a0cf2>] congestion_wait+0x72/0x87 08dcfc04: [<0809c693>] balance_dirty_pages+0xa8/0x153 08dcfc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08dcfddc: [<080ea1e6>] ext3_file_write+0x39/0xaf 08dcfe04: [<080b060b>] do_sync_write+0xd8/0x10e 08dcfebc: [<080b06e3>] vfs_write+0xa2/0x1cb 08dcfeec: [<080b09b8>] sys_pwrite64+0x65/0x69 08dcff10: [<0805dd54>] handle_syscall+0x90/0xbc 08dcff64: [<0806d56c>] handle_trap+0x27/0x121 08dcff8c: [<0806dc65>] userspace+0x1de/0x226 08dcffe4: [<0805da19>] fork_handler+0x76/0x88 08dcfffc: [<d4cf0007>] 0xd4cf0007 /proc/vmstat: nr_anon_pages 668 nr_mapped 3168 nr_file_pages 5191 nr_slab_reclaimable 173 nr_slab_unreclaimable 494 nr_page_table_pages 65 nr_dirty 2174 nr_writeback 10 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 pgpgin 10955 pgpgout 421091 pswpin 0 pswpout 0 pgalloc_dma 0 pgalloc_normal 268761 pgfree 269709 pgactivate 128287 pgdeactivate 31253 pgfault 237350 pgmajfault 4340 pgrefill_dma 0 pgrefill_normal 127899 pgsteal_dma 0 pgsteal_normal 46892 pgscan_kswapd_dma 0 pgscan_kswapd_normal 47104 pgscan_direct_dma 0 pgscan_direct_normal 36544 pginodesteal 0 slabs_scanned 2048 kswapd_steal 25083 kswapd_inodesteal 335 pageoutrun 656 allocstall 423 pgrotated 0 Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0) at mm/page-writeback.c:202 202 dirty_exceeded = 1; (gdb) p dirty_thresh $1 = 2113 (gdb) For completeness' sake, here's the backtrace for the stuck loopback as well: loop0 D BFFFE101 0 499 5 500 59 (L-TLB) 088cc578 00000001 00000000 09197c4c 0805d8cb 084fe6f8 088cc578 09190000 09190000 09197c74 0805a38a 084fe200 088cc080 09197c64 09190000 09190000 086d9c80 088cc080 084fe200 09197ccc 08182ab6 084fe200 088cc080 084fe200 Call Trace: 09197c38: [<0805d8cb>] switch_to_skas+0x3b/0x83 09197c50: [<0805a38a>] _switch_to+0x49/0x99 09197c78: [<08182ab6>] schedule+0x246/0x547 09197cd0: [<081834d3>] schedule_timeout+0x4e/0xb6 09197d04: [<08183461>] io_schedule_timeout+0x11/0x20 09197d0c: [<080a0c62>] congestion_wait+0x72/0x87 09197d3c: [<0809c7e8>] throttle_vm_writeout+0x27/0x6a 09197d60: [<0809faec>] shrink_zone+0xaf/0x103 09197d8c: [<0809fbb2>] shrink_zones+0x72/0x8a 09197db0: [<0809fc87>] try_to_free_pages+0xbd/0x185 09197dfc: [<0809ba76>] __alloc_pages+0x155/0x335 09197e50: [<080975eb>] find_or_create_page+0x85/0x99 09197e78: [<0812785e>] do_lo_send_aops+0x8d/0x233 09197ee4: [<08127c56>] lo_send+0x92/0x10d 09197f20: [<08127ee6>] do_bio_filebacked+0x6d/0x74 09197f44: [<081280e0>] loop_thread+0x89/0x188 09197f84: [<0808a03a>] kthread+0xa7/0xab 09197fb4: [<0806a0f1>] run_kernel_thread+0x41/0x50 09197fe0: [<0805d975>] new_thread_handler+0x62/0x8b 09197ffc: [<00000000>] nosmp+0xf7fb7000/0x14 Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-18 23:22 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-18 23:22 UTC (permalink / raw) To: akpm; +Cc: miklos, linux-kernel, linux-mm > > > > I was testing the new fuse shared writable mmap support, and finding > > > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > > > is more strange is that this is not an OOM situation at all, with > > > > plenty of free and cached pages. > > > > > > > > A little more investigation shows that a similar deadlock happens > > > > reliably with bash-shared-mapping on a loopback mount, even if only > > > > half the total memory is used. > > > > > > > > The cause is slightly different in the two cases: > > > > > > > > - loopback mount: allocation by the underlying filesystem is stalled > > > > on throttle_vm_writeout() > > > > > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > > > balance_dirty_pages() > > > > > > > > In both cases the underlying fs is totally innocent, with no > > > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > > > to go below the threshold, which obviously won't, until the > > > > allocation/dirtying succeeds. > > > > > > > > I'm not quite sure what the solution is, and asking for thoughts. > > > > > > But.... these things don't just throttle. They also perform large amounts > > > of writeback, which causes the dirty levels to subside. > > > > > > >From your description it appears that this writeback isn't happening, or > > > isn't working. How come? > > > > - filesystems A and B > > - write to A will end up as write to B > > - dirty pages in A manage to go over dirty_threshold > > - page writeback is started from A > > - this triggers writeback for a couple of pages in B > > - writeback finishes normally, but dirty+writeback pages are still > > over threshold > > - balance_dirty_pages in B gets stuck, nothing ever moves after this > > > > At least this is my theory for what happens. > > > > Is B a real filesystem? Yes. > If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. > The writeout code _should_ just sit there transferring dirtyiness from A to > B and cleaning pages via B, looping around, alternating between both. > > What does sysrq-t say? This is the fuse daemon thread that got stuck. There are lots of others that are stuck on some ext3 mutex as a result of this. fusexmp_fh_no D 40045401 0 527 493 533 495 (NOTLB) 088d55f8 00000001 00000000 08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000 08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000 0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 Call Trace: 08dcfb00: [<0805d8cb>] switch_to_skas+0x3b/0x83 08dcfb18: [<0805a38a>] _switch_to+0x49/0x99 08dcfb40: [<08182fe6>] schedule+0x246/0x547 08dcfb98: [<08183a03>] schedule_timeout+0x4e/0xb6 08dcfbcc: [<08183991>] io_schedule_timeout+0x11/0x20 08dcfbd4: [<080a0cf2>] congestion_wait+0x72/0x87 08dcfc04: [<0809c693>] balance_dirty_pages+0xa8/0x153 08dcfc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08dcfddc: [<080ea1e6>] ext3_file_write+0x39/0xaf 08dcfe04: [<080b060b>] do_sync_write+0xd8/0x10e 08dcfebc: [<080b06e3>] vfs_write+0xa2/0x1cb 08dcfeec: [<080b09b8>] sys_pwrite64+0x65/0x69 08dcff10: [<0805dd54>] handle_syscall+0x90/0xbc 08dcff64: [<0806d56c>] handle_trap+0x27/0x121 08dcff8c: [<0806dc65>] userspace+0x1de/0x226 08dcffe4: [<0805da19>] fork_handler+0x76/0x88 08dcfffc: [<d4cf0007>] 0xd4cf0007 /proc/vmstat: nr_anon_pages 668 nr_mapped 3168 nr_file_pages 5191 nr_slab_reclaimable 173 nr_slab_unreclaimable 494 nr_page_table_pages 65 nr_dirty 2174 nr_writeback 10 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 pgpgin 10955 pgpgout 421091 pswpin 0 pswpout 0 pgalloc_dma 0 pgalloc_normal 268761 pgfree 269709 pgactivate 128287 pgdeactivate 31253 pgfault 237350 pgmajfault 4340 pgrefill_dma 0 pgrefill_normal 127899 pgsteal_dma 0 pgsteal_normal 46892 pgscan_kswapd_dma 0 pgscan_kswapd_normal 47104 pgscan_direct_dma 0 pgscan_direct_normal 36544 pginodesteal 0 slabs_scanned 2048 kswapd_steal 25083 kswapd_inodesteal 335 pageoutrun 656 allocstall 423 pgrotated 0 Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0) at mm/page-writeback.c:202 202 dirty_exceeded = 1; (gdb) p dirty_thresh $1 = 2113 (gdb) For completeness' sake, here's the backtrace for the stuck loopback as well: loop0 D BFFFE101 0 499 5 500 59 (L-TLB) 088cc578 00000001 00000000 09197c4c 0805d8cb 084fe6f8 088cc578 09190000 09190000 09197c74 0805a38a 084fe200 088cc080 09197c64 09190000 09190000 086d9c80 088cc080 084fe200 09197ccc 08182ab6 084fe200 088cc080 084fe200 Call Trace: 09197c38: [<0805d8cb>] switch_to_skas+0x3b/0x83 09197c50: [<0805a38a>] _switch_to+0x49/0x99 09197c78: [<08182ab6>] schedule+0x246/0x547 09197cd0: [<081834d3>] schedule_timeout+0x4e/0xb6 09197d04: [<08183461>] io_schedule_timeout+0x11/0x20 09197d0c: [<080a0c62>] congestion_wait+0x72/0x87 09197d3c: [<0809c7e8>] throttle_vm_writeout+0x27/0x6a 09197d60: [<0809faec>] shrink_zone+0xaf/0x103 09197d8c: [<0809fbb2>] shrink_zones+0x72/0x8a 09197db0: [<0809fc87>] try_to_free_pages+0xbd/0x185 09197dfc: [<0809ba76>] __alloc_pages+0x155/0x335 09197e50: [<080975eb>] find_or_create_page+0x85/0x99 09197e78: [<0812785e>] do_lo_send_aops+0x8d/0x233 09197ee4: [<08127c56>] lo_send+0x92/0x10d 09197f20: [<08127ee6>] do_bio_filebacked+0x6d/0x74 09197f44: [<081280e0>] loop_thread+0x89/0x188 09197f84: [<0808a03a>] kthread+0xa7/0xab 09197fb4: [<0806a0f1>] run_kernel_thread+0x41/0x50 09197fe0: [<0805d975>] new_thread_handler+0x62/0x8b 09197ffc: [<00000000>] nosmp+0xf7fb7000/0x14 Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-18 23:22 ` Miklos Szeredi @ 2007-02-18 23:59 ` Andrew Morton -1 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-18 23:59 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm On Mon, 19 Feb 2007 00:22:11 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > If so, writes to B will decrease the dirty memory threshold. > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > Some pages queued for writeback (doesn't matter how much). B writes > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > B doesn't know that there's nothing more to write back for B, it's > just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. --- a/fs/fs-writeback.c~a +++ a/fs/fs-writeback.c @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_ continue; /* Skip a congested blockdev */ } - if (wbc->bdi && bdi != wbc->bdi) { + if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) { if (!sb_is_blkdev_sb(sb)) break; /* fs has the wrong queue */ list_move(&inode->i_list, &sb->s_dirty); _ but where's pdflush? It should be busily transferring dirtiness from A to B. > > The writeout code _should_ just sit there transferring dirtyiness from A to > > B and cleaning pages via B, looping around, alternating between both. > > > > What does sysrq-t say? > > This is the fuse daemon thread that got stuck. Where's pdflsuh? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-18 23:59 ` Andrew Morton 0 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-18 23:59 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm On Mon, 19 Feb 2007 00:22:11 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > If so, writes to B will decrease the dirty memory threshold. > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > Some pages queued for writeback (doesn't matter how much). B writes > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > B doesn't know that there's nothing more to write back for B, it's > just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. --- a/fs/fs-writeback.c~a +++ a/fs/fs-writeback.c @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_ continue; /* Skip a congested blockdev */ } - if (wbc->bdi && bdi != wbc->bdi) { + if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) { if (!sb_is_blkdev_sb(sb)) break; /* fs has the wrong queue */ list_move(&inode->i_list, &sb->s_dirty); _ but where's pdflush? It should be busily transferring dirtiness from A to B. > > The writeout code _should_ just sit there transferring dirtyiness from A to > > B and cleaning pages via B, looping around, alternating between both. > > > > What does sysrq-t say? > > This is the fuse daemon thread that got stuck. Where's pdflsuh? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-18 23:59 ` Andrew Morton @ 2007-02-19 0:25 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 0:25 UTC (permalink / raw) To: akpm; +Cc: miklos, linux-kernel, linux-mm > > > If so, writes to B will decrease the dirty memory threshold. > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > Some pages queued for writeback (doesn't matter how much). B writes > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > B doesn't know that there's nothing more to write back for B, it's > > just waiting there for those 1099, which'll never get written. > > hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. For fuse it's worse with the patch: now the write triggered by the balance recurses into fuse, with disastrous results, since the fuse writeback is now blocked on the userspace queue. fusexmp_fh_no D 40136678 0 505 494 506 504 (NOTLB) 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 08f9f9e0: [<08183006>] schedule+0x246/0x547 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c 08f9faac: [<0809ce3f>] __writepage+0x1e/0x3d 08f9fac0: [<0809cd39>] write_cache_pages+0x222/0x30a 08f9fb44: [<0809ce8d>] generic_writepages+0x2f/0x35 08f9fb5c: [<0809ced6>] do_writepages+0x43/0x45 08f9fb70: [<080cb8d2>] __writeback_single_inode+0xbc/0x173 08f9fbb8: [<080cbb30>] sync_sb_inodes+0x1a7/0x260 08f9fbe8: [<080cbc54>] writeback_inodes+0x6b/0x81 08f9fc04: [<0809c640>] balance_dirty_pages+0x55/0x153 08f9fc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08f9fc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08f9fd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08f9fda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08f9fddc: [<080ea206>] ext3_file_write+0x39/0xaf 08f9fe04: [<080b060b>] do_sync_write+0xd8/0x10e 08f9febc: [<080b06e3>] vfs_write+0xa2/0x1cb 08f9feec: [<080b09b8>] sys_pwrite64+0x65/0x69 08f9ff10: [<0805dd54>] handle_syscall+0x90/0xbc 08f9ff64: [<0806d56c>] handle_trap+0x27/0x121 08f9ff8c: [<0806dc65>] userspace+0x1de/0x226 08f9ffe4: [<0805da19>] fork_handler+0x76/0x88 08f9fffc: [<00000000>] nosmp+0xf7fb7000/0x14 > but where's pdflush? It should be busily transferring dirtiness from A to > B. The transfer of dirtyness from A to B goes through the narrow channel of i_mutex. And once that is plugged by the stuck balance_dirty_pages() nothing else can pass through. > > > The writeout code _should_ just sit there transferring dirtyiness from A to > > > B and cleaning pages via B, looping around, alternating between both. > > > > > > What does sysrq-t say? > > > > This is the fuse daemon thread that got stuck. > > Where's pdflsuh? Doing nothing I guess. The request queue for the fuse filesystem is full, so writepage with wbc->nonblocking=1 will be skipped. pdflush D 40045401 0 23 5 24 12 (L-TLB) 088d5bf8 00000001 00000000 08907df8 0805d8cb 088d55f8 088d5bf8 08900000 08900000 08907e20 0805a38a 088d5100 088d5700 08907e10 08900000 08900000 0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 Call Trace: 08907de4: [<0805d8cb>] switch_to_skas+0x3b/0x83 08907dfc: [<0805a38a>] _switch_to+0x49/0x99 08907e24: [<08182fe6>] schedule+0x246/0x547 08907e7c: [<08183a03>] schedule_timeout+0x4e/0xb6 08907eb0: [<08183991>] io_schedule_timeout+0x11/0x20 08907eb8: [<080a0cf2>] congestion_wait+0x72/0x87 08907ee8: [<0809c860>] background_writeout+0x35/0xa4 08907f38: [<0809d41e>] __pdflush+0xae/0x152 08907f54: [<0809d4f5>] pdflush+0x33/0x39 08907f84: [<0808a03a>] kthread+0xa7/0xab 08907fb4: [<0806a0f1>] run_kernel_thread+0x41/0x50 08907fe0: [<0805d975>] new_thread_handler+0x62/0x8b 08907ffc: [<00000000>] nosmp+0xf7fb7000/0x14 pdflush D 40045401 0 24 5 25 23 (L-TLB) 081e1458 00000001 00000000 088ffe00 0805d8cb 088d5bf8 081e1458 088f8000 088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000 0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 Call Trace: 088ffdec: [<0805d8cb>] switch_to_skas+0x3b/0x83 088ffe04: [<0805a38a>] _switch_to+0x49/0x99 088ffe2c: [<08182fe6>] schedule+0x246/0x547 088ffe84: [<08183a03>] schedule_timeout+0x4e/0xb6 088ffeb8: [<08183991>] io_schedule_timeout+0x11/0x20 088ffec0: [<080a0cf2>] congestion_wait+0x72/0x87 088ffef0: [<0809c98c>] wb_kupdate+0x93/0xd9 088fff38: [<0809d41e>] __pdflush+0xae/0x152 088fff54: [<0809d4f5>] pdflush+0x33/0x39 088fff84: [<0808a03a>] kthread+0xa7/0xab 088fffb4: [<0806a0f1>] run_kernel_thread+0x41/0x50 088fffe0: [<0805d975>] new_thread_handler+0x62/0x8b 088ffffc: [<00000000>] nosmp+0xf7fb7000/0x14 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-19 0:25 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 0:25 UTC (permalink / raw) To: akpm; +Cc: miklos, linux-kernel, linux-mm > > > If so, writes to B will decrease the dirty memory threshold. > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > Some pages queued for writeback (doesn't matter how much). B writes > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > B doesn't know that there's nothing more to write back for B, it's > > just waiting there for those 1099, which'll never get written. > > hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. For fuse it's worse with the patch: now the write triggered by the balance recurses into fuse, with disastrous results, since the fuse writeback is now blocked on the userspace queue. fusexmp_fh_no D 40136678 0 505 494 506 504 (NOTLB) 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 08f9f9e0: [<08183006>] schedule+0x246/0x547 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c 08f9faac: [<0809ce3f>] __writepage+0x1e/0x3d 08f9fac0: [<0809cd39>] write_cache_pages+0x222/0x30a 08f9fb44: [<0809ce8d>] generic_writepages+0x2f/0x35 08f9fb5c: [<0809ced6>] do_writepages+0x43/0x45 08f9fb70: [<080cb8d2>] __writeback_single_inode+0xbc/0x173 08f9fbb8: [<080cbb30>] sync_sb_inodes+0x1a7/0x260 08f9fbe8: [<080cbc54>] writeback_inodes+0x6b/0x81 08f9fc04: [<0809c640>] balance_dirty_pages+0x55/0x153 08f9fc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08f9fc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08f9fd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08f9fda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08f9fddc: [<080ea206>] ext3_file_write+0x39/0xaf 08f9fe04: [<080b060b>] do_sync_write+0xd8/0x10e 08f9febc: [<080b06e3>] vfs_write+0xa2/0x1cb 08f9feec: [<080b09b8>] sys_pwrite64+0x65/0x69 08f9ff10: [<0805dd54>] handle_syscall+0x90/0xbc 08f9ff64: [<0806d56c>] handle_trap+0x27/0x121 08f9ff8c: [<0806dc65>] userspace+0x1de/0x226 08f9ffe4: [<0805da19>] fork_handler+0x76/0x88 08f9fffc: [<00000000>] nosmp+0xf7fb7000/0x14 > but where's pdflush? It should be busily transferring dirtiness from A to > B. The transfer of dirtyness from A to B goes through the narrow channel of i_mutex. And once that is plugged by the stuck balance_dirty_pages() nothing else can pass through. > > > The writeout code _should_ just sit there transferring dirtyiness from A to > > > B and cleaning pages via B, looping around, alternating between both. > > > > > > What does sysrq-t say? > > > > This is the fuse daemon thread that got stuck. > > Where's pdflsuh? Doing nothing I guess. The request queue for the fuse filesystem is full, so writepage with wbc->nonblocking=1 will be skipped. pdflush D 40045401 0 23 5 24 12 (L-TLB) 088d5bf8 00000001 00000000 08907df8 0805d8cb 088d55f8 088d5bf8 08900000 08900000 08907e20 0805a38a 088d5100 088d5700 08907e10 08900000 08900000 0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 Call Trace: 08907de4: [<0805d8cb>] switch_to_skas+0x3b/0x83 08907dfc: [<0805a38a>] _switch_to+0x49/0x99 08907e24: [<08182fe6>] schedule+0x246/0x547 08907e7c: [<08183a03>] schedule_timeout+0x4e/0xb6 08907eb0: [<08183991>] io_schedule_timeout+0x11/0x20 08907eb8: [<080a0cf2>] congestion_wait+0x72/0x87 08907ee8: [<0809c860>] background_writeout+0x35/0xa4 08907f38: [<0809d41e>] __pdflush+0xae/0x152 08907f54: [<0809d4f5>] pdflush+0x33/0x39 08907f84: [<0808a03a>] kthread+0xa7/0xab 08907fb4: [<0806a0f1>] run_kernel_thread+0x41/0x50 08907fe0: [<0805d975>] new_thread_handler+0x62/0x8b 08907ffc: [<00000000>] nosmp+0xf7fb7000/0x14 pdflush D 40045401 0 24 5 25 23 (L-TLB) 081e1458 00000001 00000000 088ffe00 0805d8cb 088d5bf8 081e1458 088f8000 088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000 0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 Call Trace: 088ffdec: [<0805d8cb>] switch_to_skas+0x3b/0x83 088ffe04: [<0805a38a>] _switch_to+0x49/0x99 088ffe2c: [<08182fe6>] schedule+0x246/0x547 088ffe84: [<08183a03>] schedule_timeout+0x4e/0xb6 088ffeb8: [<08183991>] io_schedule_timeout+0x11/0x20 088ffec0: [<080a0cf2>] congestion_wait+0x72/0x87 088ffef0: [<0809c98c>] wb_kupdate+0x93/0xd9 088fff38: [<0809d41e>] __pdflush+0xae/0x152 088fff54: [<0809d4f5>] pdflush+0x33/0x39 088fff84: [<0808a03a>] kthread+0xa7/0xab 088fffb4: [<0806a0f1>] run_kernel_thread+0x41/0x50 088fffe0: [<0805d975>] new_thread_handler+0x62/0x8b 088ffffc: [<00000000>] nosmp+0xf7fb7000/0x14 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 0:25 ` Miklos Szeredi @ 2007-02-19 0:30 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 0:30 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > --- a/fs/fs-writeback.c~a > +++ a/fs/fs-writeback.c > @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_ > continue; /* Skip a congested blockdev */ > } > > - if (wbc->bdi && bdi != wbc->bdi) { > + if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) { > if (!sb_is_blkdev_sb(sb)) > break; /* fs has the wrong queue */ > list_move(&inode->i_list, &sb->s_dirty); Checking bdi_write_congested(bdi) is not reliable, since the queue can become congested _after_ the check is done. Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-19 0:30 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 0:30 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > --- a/fs/fs-writeback.c~a > +++ a/fs/fs-writeback.c > @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_ > continue; /* Skip a congested blockdev */ > } > > - if (wbc->bdi && bdi != wbc->bdi) { > + if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) { > if (!sb_is_blkdev_sb(sb)) > break; /* fs has the wrong queue */ > list_move(&inode->i_list, &sb->s_dirty); Checking bdi_write_congested(bdi) is not reliable, since the queue can become congested _after_ the check is done. Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 0:25 ` Miklos Szeredi @ 2007-02-19 0:45 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 0:45 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > Some pages queued for writeback (doesn't matter how much). B writes > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > B doesn't know that there's nothing more to write back for B, it's > > > just waiting there for those 1099, which'll never get written. > > > > hm, OK, arguable. I guess something like this.. > > Doesn't help the fuse case, but does seem to help the loopback mount > one. No sorry, it doesn't even help the loopback deadlock. It sometimes takes quite a while to trigger... Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-19 0:45 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 0:45 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > Some pages queued for writeback (doesn't matter how much). B writes > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > B doesn't know that there's nothing more to write back for B, it's > > > just waiting there for those 1099, which'll never get written. > > > > hm, OK, arguable. I guess something like this.. > > Doesn't help the fuse case, but does seem to help the loopback mount > one. No sorry, it doesn't even help the loopback deadlock. It sometimes takes quite a while to trigger... Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 0:25 ` Miklos Szeredi @ 2007-02-19 0:45 ` Chris Mason -1 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-19 0:45 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 19, 2007 at 01:25:21AM +0100, Miklos Szeredi wrote: > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > Some pages queued for writeback (doesn't matter how much). B writes > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > B doesn't know that there's nothing more to write back for B, it's > > > just waiting there for those 1099, which'll never get written. > > > > hm, OK, arguable. I guess something like this.. > > Doesn't help the fuse case, but does seem to help the loopback mount > one. > > For fuse it's worse with the patch: now the write triggered by the > balance recurses into fuse, with disastrous results, since the fuse > writeback is now blocked on the userspace queue. > > fusexmp_fh_no D 40136678 0 505 494 506 504 (NOTLB) > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 > 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 > 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: > 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 > 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 > 08f9f9e0: [<08183006>] schedule+0x246/0x547 > 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a > 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Queue it somewhere else (ie an internal Fs cleaning thread) and leave the page dirty so that we can move on to other pages that have a chance of being cleaned. -chris ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-19 0:45 ` Chris Mason 0 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-19 0:45 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 19, 2007 at 01:25:21AM +0100, Miklos Szeredi wrote: > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > Some pages queued for writeback (doesn't matter how much). B writes > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > B doesn't know that there's nothing more to write back for B, it's > > > just waiting there for those 1099, which'll never get written. > > > > hm, OK, arguable. I guess something like this.. > > Doesn't help the fuse case, but does seem to help the loopback mount > one. > > For fuse it's worse with the patch: now the write triggered by the > balance recurses into fuse, with disastrous results, since the fuse > writeback is now blocked on the userspace queue. > > fusexmp_fh_no D 40136678 0 505 494 506 504 (NOTLB) > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 > 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 > 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: > 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 > 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 > 08f9f9e0: [<08183006>] schedule+0x246/0x547 > 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a > 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Queue it somewhere else (ie an internal Fs cleaning thread) and leave the page dirty so that we can move on to other pages that have a chance of being cleaned. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 0:45 ` Chris Mason @ 2007-02-19 0:54 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 0:54 UTC (permalink / raw) To: chris.mason; +Cc: akpm, linux-kernel, linux-mm > > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > > Some pages queued for writeback (doesn't matter how much). B writes > > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > > B doesn't know that there's nothing more to write back for B, it's > > > > just waiting there for those 1099, which'll never get written. > > > > > > hm, OK, arguable. I guess something like this.. > > > > Doesn't help the fuse case, but does seem to help the loopback mount > > one. > > > > For fuse it's worse with the patch: now the write triggered by the > > balance recurses into fuse, with disastrous results, since the fuse > > writeback is now blocked on the userspace queue. > > > > fusexmp_fh_no D 40136678 0 505 494 506 504 (NOTLB) > > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 > > 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 > > 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: > > 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 > > 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 > > 08f9f9e0: [<08183006>] schedule+0x246/0x547 > > 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a > > 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c > > In general, writepage is supposed to do work without blocking on > expensive locks that will get pdflush and dirty reclaim stuck in this > fashion. You'll probably have to take the same approach reiserfs does > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > is going to block without making progress. Pdflush, and dirty reclaim set wbc->nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-19 0:54 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 0:54 UTC (permalink / raw) To: chris.mason; +Cc: akpm, linux-kernel, linux-mm > > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > > Some pages queued for writeback (doesn't matter how much). B writes > > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > > B doesn't know that there's nothing more to write back for B, it's > > > > just waiting there for those 1099, which'll never get written. > > > > > > hm, OK, arguable. I guess something like this.. > > > > Doesn't help the fuse case, but does seem to help the loopback mount > > one. > > > > For fuse it's worse with the patch: now the write triggered by the > > balance recurses into fuse, with disastrous results, since the fuse > > writeback is now blocked on the userspace queue. > > > > fusexmp_fh_no D 40136678 0 505 494 506 504 (NOTLB) > > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 > > 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 > > 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: > > 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 > > 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 > > 08f9f9e0: [<08183006>] schedule+0x246/0x547 > > 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a > > 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c > > In general, writepage is supposed to do work without blocking on > expensive locks that will get pdflush and dirty reclaim stuck in this > fashion. You'll probably have to take the same approach reiserfs does > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > is going to block without making progress. Pdflush, and dirty reclaim set wbc->nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 0:54 ` Miklos Szeredi @ 2007-02-19 1:01 ` Chris Mason -1 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-19 1:01 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 19, 2007 at 01:54:31AM +0100, Miklos Szeredi wrote: > > > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > > > Some pages queued for writeback (doesn't matter how much). B writes > > > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > > > B doesn't know that there's nothing more to write back for B, it's > > > > > just waiting there for those 1099, which'll never get written. > > > > > > > > hm, OK, arguable. I guess something like this.. > > > > > > Doesn't help the fuse case, but does seem to help the loopback mount > > > one. > > > > > > For fuse it's worse with the patch: now the write triggered by the > > > balance recurses into fuse, with disastrous results, since the fuse > > > writeback is now blocked on the userspace queue. > > > > > > fusexmp_fh_no D 40136678 0 505 494 506 504 (NOTLB) > > > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 > > > 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 > > > 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: > > > 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 > > > 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 > > > 08f9f9e0: [<08183006>] schedule+0x246/0x547 > > > 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a > > > 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c > > > > In general, writepage is supposed to do work without blocking on > > expensive locks that will get pdflush and dirty reclaim stuck in this > > fashion. You'll probably have to take the same approach reiserfs does > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > is going to block without making progress. > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > balance_dirty_pages and fsync don't. The problem here is that > Andrew's patch is wrong to let balance_dirty_pages() try to write back > pages from a different queue. async or sync, writepage is supposed to either make progress or bail. loopback aside, if the fuse call is blocking long term, you're going to run into problems. -chris ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-19 1:01 ` Chris Mason 0 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-19 1:01 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 19, 2007 at 01:54:31AM +0100, Miklos Szeredi wrote: > > > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > > > Some pages queued for writeback (doesn't matter how much). B writes > > > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > > > B doesn't know that there's nothing more to write back for B, it's > > > > > just waiting there for those 1099, which'll never get written. > > > > > > > > hm, OK, arguable. I guess something like this.. > > > > > > Doesn't help the fuse case, but does seem to help the loopback mount > > > one. > > > > > > For fuse it's worse with the patch: now the write triggered by the > > > balance recurses into fuse, with disastrous results, since the fuse > > > writeback is now blocked on the userspace queue. > > > > > > fusexmp_fh_no D 40136678 0 505 494 506 504 (NOTLB) > > > 08982b78 00000001 00000000 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 > > > 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 > > > 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: > > > 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 > > > 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 > > > 08f9f9e0: [<08183006>] schedule+0x246/0x547 > > > 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a > > > 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c > > > > In general, writepage is supposed to do work without blocking on > > expensive locks that will get pdflush and dirty reclaim stuck in this > > fashion. You'll probably have to take the same approach reiserfs does > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > is going to block without making progress. > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > balance_dirty_pages and fsync don't. The problem here is that > Andrew's patch is wrong to let balance_dirty_pages() try to write back > pages from a different queue. async or sync, writepage is supposed to either make progress or bail. loopback aside, if the fuse call is blocking long term, you're going to run into problems. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 1:01 ` Chris Mason @ 2007-02-19 1:14 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 1:14 UTC (permalink / raw) To: chris.mason; +Cc: akpm, linux-kernel, linux-mm > > > In general, writepage is supposed to do work without blocking on > > > expensive locks that will get pdflush and dirty reclaim stuck in this > > > fashion. You'll probably have to take the same approach reiserfs does > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > > is going to block without making progress. > > > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > > balance_dirty_pages and fsync don't. The problem here is that > > Andrew's patch is wrong to let balance_dirty_pages() try to write back > > pages from a different queue. > > async or sync, writepage is supposed to either make progress or bail. > loopback aside, if the fuse call is blocking long term, you're going to > run into problems. Hmm, like what? Thanks, Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-19 1:14 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 1:14 UTC (permalink / raw) To: chris.mason; +Cc: akpm, linux-kernel, linux-mm > > > In general, writepage is supposed to do work without blocking on > > > expensive locks that will get pdflush and dirty reclaim stuck in this > > > fashion. You'll probably have to take the same approach reiserfs does > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > > is going to block without making progress. > > > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > > balance_dirty_pages and fsync don't. The problem here is that > > Andrew's patch is wrong to let balance_dirty_pages() try to write back > > pages from a different queue. > > async or sync, writepage is supposed to either make progress or bail. > loopback aside, if the fuse call is blocking long term, you're going to > run into problems. Hmm, like what? Thanks, Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 1:14 ` Miklos Szeredi @ 2007-02-20 0:16 ` Chris Mason -1 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-20 0:16 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 19, 2007 at 02:14:15AM +0100, Miklos Szeredi wrote: > > > > In general, writepage is supposed to do work without blocking on > > > > expensive locks that will get pdflush and dirty reclaim stuck in this > > > > fashion. You'll probably have to take the same approach reiserfs does > > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > > > is going to block without making progress. > > > > > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > > > balance_dirty_pages and fsync don't. The problem here is that > > > Andrew's patch is wrong to let balance_dirty_pages() try to write back > > > pages from a different queue. > > > > async or sync, writepage is supposed to either make progress or bail. > > loopback aside, if the fuse call is blocking long term, you're going to > > run into problems. > > Hmm, like what? Something a little different from what you're seeing. Basically if the PF_MEMALLOC paths end up waiting on a filesystem transaction, and that transaction is waiting for more ram, the system will eventually grind to a halt. data=journal is the easiest way to hit it, since writepage always logs at least 4k. WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I resorted to testing PF_MEMALLOC. -chris ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-20 0:16 ` Chris Mason 0 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-20 0:16 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 19, 2007 at 02:14:15AM +0100, Miklos Szeredi wrote: > > > > In general, writepage is supposed to do work without blocking on > > > > expensive locks that will get pdflush and dirty reclaim stuck in this > > > > fashion. You'll probably have to take the same approach reiserfs does > > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > > > is going to block without making progress. > > > > > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > > > balance_dirty_pages and fsync don't. The problem here is that > > > Andrew's patch is wrong to let balance_dirty_pages() try to write back > > > pages from a different queue. > > > > async or sync, writepage is supposed to either make progress or bail. > > loopback aside, if the fuse call is blocking long term, you're going to > > run into problems. > > Hmm, like what? Something a little different from what you're seeing. Basically if the PF_MEMALLOC paths end up waiting on a filesystem transaction, and that transaction is waiting for more ram, the system will eventually grind to a halt. data=journal is the easiest way to hit it, since writepage always logs at least 4k. WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I resorted to testing PF_MEMALLOC. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-20 0:16 ` Chris Mason @ 2007-02-20 8:53 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-20 8:53 UTC (permalink / raw) To: chris.mason; +Cc: akpm, linux-kernel, linux-mm > > > > > In general, writepage is supposed to do work without blocking on > > > > > expensive locks that will get pdflush and dirty reclaim stuck in this > > > > > fashion. You'll probably have to take the same approach reiserfs does > > > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > > > > is going to block without making progress. > > > > > > > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > > > > balance_dirty_pages and fsync don't. The problem here is that > > > > Andrew's patch is wrong to let balance_dirty_pages() try to write back > > > > pages from a different queue. > > > > > > async or sync, writepage is supposed to either make progress or bail. > > > loopback aside, if the fuse call is blocking long term, you're going to > > > run into problems. > > > > Hmm, like what? > > Something a little different from what you're seeing. Basically if the > PF_MEMALLOC paths end up waiting on a filesystem transaction, and that > transaction is waiting for more ram, the system will eventually grind to > a halt. data=journal is the easiest way to hit it, since writepage > always logs at least 4k. > > WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I > resorted to testing PF_MEMALLOC. I'm not pretending to understand how journaling filesystems work, but this shouldn't be an issue with fuse. Can you show me a call path, where PF_MEMALLOC is set and .nonblocking is not? Thanks, Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-20 8:53 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-20 8:53 UTC (permalink / raw) To: chris.mason; +Cc: akpm, linux-kernel, linux-mm > > > > > In general, writepage is supposed to do work without blocking on > > > > > expensive locks that will get pdflush and dirty reclaim stuck in this > > > > > fashion. You'll probably have to take the same approach reiserfs does > > > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > > > > is going to block without making progress. > > > > > > > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > > > > balance_dirty_pages and fsync don't. The problem here is that > > > > Andrew's patch is wrong to let balance_dirty_pages() try to write back > > > > pages from a different queue. > > > > > > async or sync, writepage is supposed to either make progress or bail. > > > loopback aside, if the fuse call is blocking long term, you're going to > > > run into problems. > > > > Hmm, like what? > > Something a little different from what you're seeing. Basically if the > PF_MEMALLOC paths end up waiting on a filesystem transaction, and that > transaction is waiting for more ram, the system will eventually grind to > a halt. data=journal is the easiest way to hit it, since writepage > always logs at least 4k. > > WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I > resorted to testing PF_MEMALLOC. I'm not pretending to understand how journaling filesystems work, but this shouldn't be an issue with fuse. Can you show me a call path, where PF_MEMALLOC is set and .nonblocking is not? Thanks, Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-18 23:59 ` Andrew Morton @ 2007-02-19 17:11 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 17:11 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm How about this? Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Thanks, Miklos Index: linux/mm/page-writeback.c =================================================================== --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a dirty_thresh) break; + /* + * Acquit this producer if there's little or nothing + * to write back to this particular queue + * + * Without this check a deadlock is possible in the + * following case: + * + * - filesystem A writes data through filesystem B + * - filesystem A has dirty pages over dirty_thresh + * - writeback is started, this triggers a write in B + * - balance_dirty_pages() is called synchronously + * - the write to B blocks + * - the writeback completes, but dirty is still over threshold + * - the blocking write prevents futher writes from happening + */ + if (atomic_long_read(&bdi->nr_dirty) + + atomic_long_read(&bdi->nr_writeback) < 16) + break; + if (!dirty_exceeded) dirty_exceeded = 1; ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-19 17:11 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 17:11 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm How about this? Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Thanks, Miklos Index: linux/mm/page-writeback.c =================================================================== --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a dirty_thresh) break; + /* + * Acquit this producer if there's little or nothing + * to write back to this particular queue + * + * Without this check a deadlock is possible in the + * following case: + * + * - filesystem A writes data through filesystem B + * - filesystem A has dirty pages over dirty_thresh + * - writeback is started, this triggers a write in B + * - balance_dirty_pages() is called synchronously + * - the write to B blocks + * - the writeback completes, but dirty is still over threshold + * - the blocking write prevents futher writes from happening + */ + if (atomic_long_read(&bdi->nr_dirty) + + atomic_long_read(&bdi->nr_writeback) < 16) + break; + if (!dirty_exceeded) dirty_exceeded = 1; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 17:11 ` Miklos Szeredi @ 2007-02-19 23:12 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 23:12 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > I'll try to tackle that one as well. > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > returns. > > Does the constant need to tunable? If it's too large, then the global > threshold is more easily exceeded. If it's too small, then in a tight > situation progress will be slower. Similar in spirit, this should solve the deadlock on throttle_vm_writeout(). Totally untested. Does this approach look workable? Thanks, Miklos Index: linux/include/linux/swap.h =================================================================== --- linux.orig/include/linux/swap.h 2007-02-19 23:39:36.000000000 +0100 +++ linux/include/linux/swap.h 2007-02-20 00:03:38.000000000 +0100 @@ -277,10 +277,14 @@ static inline void disable_swap_token(vo put_swap_token(swap_token_mm); } +#define nr_swap_writeback \ + atomic_long_read(&swapper_space.backing_dev_info->nr_writeback) + #else /* CONFIG_SWAP */ #define total_swap_pages 0 #define total_swapcache_pages 0UL +#define nr_swap_writeback 0UL #define si_swapinfo(val) \ do { (val)->freeswap = (val)->totalswap = 0; } while (0) Index: linux/mm/page-writeback.c =================================================================== --- linux.orig/mm/page-writeback.c 2007-02-19 23:43:03.000000000 +0100 +++ linux/mm/page-writeback.c 2007-02-20 00:03:49.000000000 +0100 @@ -33,6 +33,7 @@ #include <linux/syscalls.h> #include <linux/buffer_head.h> #include <linux/pagevec.h> +#include <linux/swap.h> /* * The maximum number of pages to writeout in a single bdflush/kupdate @@ -332,6 +333,9 @@ void throttle_vm_writeout(void) if (global_page_state(NR_UNSTABLE_NFS) + global_page_state(NR_WRITEBACK) <= dirty_thresh) break; + + if (nr_swap_writeback < 16) + break; congestion_wait(WRITE, HZ/10); } } Index: linux/mm/page_io.c =================================================================== --- linux.orig/mm/page_io.c 2007-02-19 23:24:23.000000000 +0100 +++ linux/mm/page_io.c 2007-02-19 23:42:21.000000000 +0100 @@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio ClearPageReclaim(page); } end_page_writeback(page); + atomic_long_dec(&swapper_space.backing_dev_info->nr_writeback); bio_put(bio); return 0; } @@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st if (wbc->sync_mode == WB_SYNC_ALL) rw |= (1 << BIO_RW_SYNC); count_vm_event(PSWPOUT); + atomic_long_inc(&swapper_space.backing_dev_info->nr_writeback); set_page_writeback(page); unlock_page(page); submit_bio(rw, bio); ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-19 23:12 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-19 23:12 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > I'll try to tackle that one as well. > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > returns. > > Does the constant need to tunable? If it's too large, then the global > threshold is more easily exceeded. If it's too small, then in a tight > situation progress will be slower. Similar in spirit, this should solve the deadlock on throttle_vm_writeout(). Totally untested. Does this approach look workable? Thanks, Miklos Index: linux/include/linux/swap.h =================================================================== --- linux.orig/include/linux/swap.h 2007-02-19 23:39:36.000000000 +0100 +++ linux/include/linux/swap.h 2007-02-20 00:03:38.000000000 +0100 @@ -277,10 +277,14 @@ static inline void disable_swap_token(vo put_swap_token(swap_token_mm); } +#define nr_swap_writeback \ + atomic_long_read(&swapper_space.backing_dev_info->nr_writeback) + #else /* CONFIG_SWAP */ #define total_swap_pages 0 #define total_swapcache_pages 0UL +#define nr_swap_writeback 0UL #define si_swapinfo(val) \ do { (val)->freeswap = (val)->totalswap = 0; } while (0) Index: linux/mm/page-writeback.c =================================================================== --- linux.orig/mm/page-writeback.c 2007-02-19 23:43:03.000000000 +0100 +++ linux/mm/page-writeback.c 2007-02-20 00:03:49.000000000 +0100 @@ -33,6 +33,7 @@ #include <linux/syscalls.h> #include <linux/buffer_head.h> #include <linux/pagevec.h> +#include <linux/swap.h> /* * The maximum number of pages to writeout in a single bdflush/kupdate @@ -332,6 +333,9 @@ void throttle_vm_writeout(void) if (global_page_state(NR_UNSTABLE_NFS) + global_page_state(NR_WRITEBACK) <= dirty_thresh) break; + + if (nr_swap_writeback < 16) + break; congestion_wait(WRITE, HZ/10); } } Index: linux/mm/page_io.c =================================================================== --- linux.orig/mm/page_io.c 2007-02-19 23:24:23.000000000 +0100 +++ linux/mm/page_io.c 2007-02-19 23:42:21.000000000 +0100 @@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio ClearPageReclaim(page); } end_page_writeback(page); + atomic_long_dec(&swapper_space.backing_dev_info->nr_writeback); bio_put(bio); return 0; } @@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st if (wbc->sync_mode == WB_SYNC_ALL) rw |= (1 << BIO_RW_SYNC); count_vm_event(PSWPOUT); + atomic_long_inc(&swapper_space.backing_dev_info->nr_writeback); set_page_writeback(page); unlock_page(page); submit_bio(rw, bio); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 17:11 ` Miklos Szeredi @ 2007-02-20 0:13 ` Chris Mason -1 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-20 0:13 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 19, 2007 at 06:11:55PM +0100, Miklos Szeredi wrote: > How about this? > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > I'll try to tackle that one as well. > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > returns. > > Does the constant need to tunable? If it's too large, then the global > threshold is more easily exceeded. If it's too small, then in a tight > situation progress will be slower. Ok, what is supposed to happen here is that filesystems are supposed to be throttled from making more dirty pages when the system is over the threshold. Even if filesystem A doesn't have much to contribute, and filesystem B is the cause of 99% of the dirty pages, the goal of the threshold is to prevent more dirty data from happening, and filesystem A should block. But, with the producer consumer setup of fuse, I think this is a pretty good compromise. 16 dirty/writeback pages shouldn't hurt the overall limits too badly. -chris ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-20 0:13 ` Chris Mason 0 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-20 0:13 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 19, 2007 at 06:11:55PM +0100, Miklos Szeredi wrote: > How about this? > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > I'll try to tackle that one as well. > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > returns. > > Does the constant need to tunable? If it's too large, then the global > threshold is more easily exceeded. If it's too small, then in a tight > situation progress will be slower. Ok, what is supposed to happen here is that filesystems are supposed to be throttled from making more dirty pages when the system is over the threshold. Even if filesystem A doesn't have much to contribute, and filesystem B is the cause of 99% of the dirty pages, the goal of the threshold is to prevent more dirty data from happening, and filesystem A should block. But, with the producer consumer setup of fuse, I think this is a pretty good compromise. 16 dirty/writeback pages shouldn't hurt the overall limits too badly. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-20 0:13 ` Chris Mason @ 2007-02-20 8:47 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-20 8:47 UTC (permalink / raw) To: chris.mason; +Cc: akpm, linux-kernel, linux-mm > > How about this? > > > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > > I'll try to tackle that one as well. > > > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > > returns. > > > > Does the constant need to tunable? If it's too large, then the global > > threshold is more easily exceeded. If it's too small, then in a tight > > situation progress will be slower. > > Ok, what is supposed to happen here is that filesystems are supposed to > be throttled from making more dirty pages when the system is over the > threshold. Even if filesystem A doesn't have much to contribute, and > filesystem B is the cause of 99% of the dirty pages, the goal of the > threshold is to prevent more dirty data from happening, and filesystem A > should block. Which is the cause of the current deadlock. But if we allow filesystem A to go into the red just a little, the deadlock is avoided, because it can continue to make progress with cleaning the dirtyness produced by B. The maximum that filesystems can go over the limit will be (16 + epsilon) * number-of-queues This is usually insignificant compared to the limit itself (~2000 pages on a machine with 32MB) However with thousands of fuse mounts this may become a problem, as each filesystem gets a separate queue. In theory, just 2 pages are enough to always make progress, but current dirty balancing can't enforce this, as the ratelimit is at least 8 pages. So there may have to be some more strict page accounting within fuse itself, but that doesn't change the overall concept I think. Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-20 8:47 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-20 8:47 UTC (permalink / raw) To: chris.mason; +Cc: akpm, linux-kernel, linux-mm > > How about this? > > > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > > I'll try to tackle that one as well. > > > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > > returns. > > > > Does the constant need to tunable? If it's too large, then the global > > threshold is more easily exceeded. If it's too small, then in a tight > > situation progress will be slower. > > Ok, what is supposed to happen here is that filesystems are supposed to > be throttled from making more dirty pages when the system is over the > threshold. Even if filesystem A doesn't have much to contribute, and > filesystem B is the cause of 99% of the dirty pages, the goal of the > threshold is to prevent more dirty data from happening, and filesystem A > should block. Which is the cause of the current deadlock. But if we allow filesystem A to go into the red just a little, the deadlock is avoided, because it can continue to make progress with cleaning the dirtyness produced by B. The maximum that filesystems can go over the limit will be (16 + epsilon) * number-of-queues This is usually insignificant compared to the limit itself (~2000 pages on a machine with 32MB) However with thousands of fuse mounts this may become a problem, as each filesystem gets a separate queue. In theory, just 2 pages are enough to always make progress, but current dirty balancing can't enforce this, as the ratelimit is at least 8 pages. So there may have to be some more strict page accounting within fuse itself, but that doesn't change the overall concept I think. Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-20 8:47 ` Miklos Szeredi @ 2007-02-20 11:30 ` Chris Mason -1 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-20 11:30 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Tue, Feb 20, 2007 at 09:47:11AM +0100, Miklos Szeredi wrote: > > > How about this? > > > > > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > > > I'll try to tackle that one as well. > > > > > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > > > returns. > > > > > > Does the constant need to tunable? If it's too large, then the global > > > threshold is more easily exceeded. If it's too small, then in a tight > > > situation progress will be slower. > > > > Ok, what is supposed to happen here is that filesystems are supposed to > > be throttled from making more dirty pages when the system is over the > > threshold. Even if filesystem A doesn't have much to contribute, and > > filesystem B is the cause of 99% of the dirty pages, the goal of the > > threshold is to prevent more dirty data from happening, and filesystem A > > should block. > > Which is the cause of the current deadlock. But if we allow > filesystem A to go into the red just a little, the deadlock is > avoided, because it can continue to make progress with cleaning the > dirtyness produced by B. > > The maximum that filesystems can go over the limit will be > > (16 + epsilon) * number-of-queues Right, even for thousands of mounted filesystems ~16 pages per FS effectively pinned is not horrible. -chris ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-20 11:30 ` Chris Mason 0 siblings, 0 replies; 52+ messages in thread From: Chris Mason @ 2007-02-20 11:30 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-mm On Tue, Feb 20, 2007 at 09:47:11AM +0100, Miklos Szeredi wrote: > > > How about this? > > > > > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > > > I'll try to tackle that one as well. > > > > > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > > > returns. > > > > > > Does the constant need to tunable? If it's too large, then the global > > > threshold is more easily exceeded. If it's too small, then in a tight > > > situation progress will be slower. > > > > Ok, what is supposed to happen here is that filesystems are supposed to > > be throttled from making more dirty pages when the system is over the > > threshold. Even if filesystem A doesn't have much to contribute, and > > filesystem B is the cause of 99% of the dirty pages, the goal of the > > threshold is to prevent more dirty data from happening, and filesystem A > > should block. > > Which is the cause of the current deadlock. But if we allow > filesystem A to go into the red just a little, the deadlock is > avoided, because it can continue to make progress with cleaning the > dirtyness produced by B. > > The maximum that filesystems can go over the limit will be > > (16 + epsilon) * number-of-queues Right, even for thousands of mounted filesystems ~16 pages per FS effectively pinned is not horrible. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-19 17:11 ` Miklos Szeredi @ 2007-02-21 21:36 ` Andrew Morton -1 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-21 21:36 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm On Mon, 19 Feb 2007 18:11:55 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > How about this? I still don't understand this bug. > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > I'll try to tackle that one as well. > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > returns. > > Does the constant need to tunable? If it's too large, then the global > threshold is more easily exceeded. If it's too small, then in a tight > situation progress will be slower. > > Thanks, > Miklos > > Index: linux/mm/page-writeback.c > =================================================================== > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > dirty_thresh) > break; > > + /* > + * Acquit this producer if there's little or nothing > + * to write back to this particular queue > + * > + * Without this check a deadlock is possible in the > + * following case: > + * > + * - filesystem A writes data through filesystem B > + * - filesystem A has dirty pages over dirty_thresh > + * - writeback is started, this triggers a write in B > + * - balance_dirty_pages() is called synchronously > + * - the write to B blocks > + * - the writeback completes, but dirty is still over threshold > + * - the blocking write prevents futher writes from happening > + */ > + if (atomic_long_read(&bdi->nr_dirty) + > + atomic_long_read(&bdi->nr_writeback) < 16) > + break; > + The problem seems to that little "- the write to B blocks". How come it blocks? I mean, if we cannot retire writes to that filesystem then we're screwed anyway. Anyway, I think I'll think about this issue a little later on. You might as well prepare full changelogs for your proposed changes, because we'll be needing them anyway. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-21 21:36 ` Andrew Morton 0 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-21 21:36 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm On Mon, 19 Feb 2007 18:11:55 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > How about this? I still don't understand this bug. > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > I'll try to tackle that one as well. > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > returns. > > Does the constant need to tunable? If it's too large, then the global > threshold is more easily exceeded. If it's too small, then in a tight > situation progress will be slower. > > Thanks, > Miklos > > Index: linux/mm/page-writeback.c > =================================================================== > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > dirty_thresh) > break; > > + /* > + * Acquit this producer if there's little or nothing > + * to write back to this particular queue > + * > + * Without this check a deadlock is possible in the > + * following case: > + * > + * - filesystem A writes data through filesystem B > + * - filesystem A has dirty pages over dirty_thresh > + * - writeback is started, this triggers a write in B > + * - balance_dirty_pages() is called synchronously > + * - the write to B blocks > + * - the writeback completes, but dirty is still over threshold > + * - the blocking write prevents futher writes from happening > + */ > + if (atomic_long_read(&bdi->nr_dirty) + > + atomic_long_read(&bdi->nr_writeback) < 16) > + break; > + The problem seems to that little "- the write to B blocks". How come it blocks? I mean, if we cannot retire writes to that filesystem then we're screwed anyway. Anyway, I think I'll think about this issue a little later on. You might as well prepare full changelogs for your proposed changes, because we'll be needing them anyway. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-21 21:36 ` Andrew Morton @ 2007-02-22 7:42 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-22 7:42 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > > How about this? > > I still don't understand this bug. > > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > > I'll try to tackle that one as well. > > > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > > returns. > > > > Does the constant need to tunable? If it's too large, then the global > > threshold is more easily exceeded. If it's too small, then in a tight > > situation progress will be slower. > > > > Thanks, > > Miklos > > > > Index: linux/mm/page-writeback.c > > =================================================================== > > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > > dirty_thresh) > > break; > > > > + /* > > + * Acquit this producer if there's little or nothing > > + * to write back to this particular queue > > + * > > + * Without this check a deadlock is possible in the > > + * following case: > > + * > > + * - filesystem A writes data through filesystem B > > + * - filesystem A has dirty pages over dirty_thresh > > + * - writeback is started, this triggers a write in B > > + * - balance_dirty_pages() is called synchronously > > + * - the write to B blocks > > + * - the writeback completes, but dirty is still over threshold > > + * - the blocking write prevents futher writes from happening > > + */ > > + if (atomic_long_read(&bdi->nr_dirty) + > > + atomic_long_read(&bdi->nr_writeback) < 16) > > + break; > > + > > The problem seems to that little "- the write to B blocks". > > How come it blocks? I mean, if we cannot retire writes to that filesystem > then we're screwed anyway. Sorry about the sloppy description. I mean, it's not the lowlevel write that will block, but rather the VFS one (generic_file_aio_write). It will block (or rather loop forever with 0.1 second sleeps) in balance_dirty_pages(). That means, that for this inode, i_mutex is held and no other writer can continue the work. Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-22 7:42 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-22 7:42 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-mm > > How about this? > > I still don't understand this bug. > > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > > I'll try to tackle that one as well. > > > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > > returns. > > > > Does the constant need to tunable? If it's too large, then the global > > threshold is more easily exceeded. If it's too small, then in a tight > > situation progress will be slower. > > > > Thanks, > > Miklos > > > > Index: linux/mm/page-writeback.c > > =================================================================== > > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > > dirty_thresh) > > break; > > > > + /* > > + * Acquit this producer if there's little or nothing > > + * to write back to this particular queue > > + * > > + * Without this check a deadlock is possible in the > > + * following case: > > + * > > + * - filesystem A writes data through filesystem B > > + * - filesystem A has dirty pages over dirty_thresh > > + * - writeback is started, this triggers a write in B > > + * - balance_dirty_pages() is called synchronously > > + * - the write to B blocks > > + * - the writeback completes, but dirty is still over threshold > > + * - the blocking write prevents futher writes from happening > > + */ > > + if (atomic_long_read(&bdi->nr_dirty) + > > + atomic_long_read(&bdi->nr_writeback) < 16) > > + break; > > + > > The problem seems to that little "- the write to B blocks". > > How come it blocks? I mean, if we cannot retire writes to that filesystem > then we're screwed anyway. Sorry about the sloppy description. I mean, it's not the lowlevel write that will block, but rather the VFS one (generic_file_aio_write). It will block (or rather loop forever with 0.1 second sleeps) in balance_dirty_pages(). That means, that for this inode, i_mutex is held and no other writer can continue the work. Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-22 7:42 ` Miklos Szeredi @ 2007-02-22 7:55 ` Andrew Morton -1 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-22 7:55 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm > On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > Index: linux/mm/page-writeback.c > > > =================================================================== > > > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 > > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 > > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > > > dirty_thresh) > > > break; > > > > > > + /* > > > + * Acquit this producer if there's little or nothing > > > + * to write back to this particular queue > > > + * > > > + * Without this check a deadlock is possible in the > > > + * following case: > > > + * > > > + * - filesystem A writes data through filesystem B > > > + * - filesystem A has dirty pages over dirty_thresh > > > + * - writeback is started, this triggers a write in B > > > + * - balance_dirty_pages() is called synchronously > > > + * - the write to B blocks > > > + * - the writeback completes, but dirty is still over threshold > > > + * - the blocking write prevents futher writes from happening > > > + */ > > > + if (atomic_long_read(&bdi->nr_dirty) + > > > + atomic_long_read(&bdi->nr_writeback) < 16) > > > + break; > > > + > > > > The problem seems to that little "- the write to B blocks". > > > > How come it blocks? I mean, if we cannot retire writes to that filesystem > > then we're screwed anyway. > > Sorry about the sloppy description. I mean, it's not the lowlevel > write that will block, but rather the VFS one > (generic_file_aio_write). It will block (or rather loop forever with > 0.1 second sleeps) in balance_dirty_pages(). That means, that for > this inode, i_mutex is held and no other writer can continue the work. "this inode" I assume is the inode against filesystem A? Why does holding that inode's i_mutex prevent further writeback of pages in A? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-22 7:55 ` Andrew Morton 0 siblings, 0 replies; 52+ messages in thread From: Andrew Morton @ 2007-02-22 7:55 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linux-kernel, linux-mm > On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > Index: linux/mm/page-writeback.c > > > =================================================================== > > > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 > > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 > > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > > > dirty_thresh) > > > break; > > > > > > + /* > > > + * Acquit this producer if there's little or nothing > > > + * to write back to this particular queue > > > + * > > > + * Without this check a deadlock is possible in the > > > + * following case: > > > + * > > > + * - filesystem A writes data through filesystem B > > > + * - filesystem A has dirty pages over dirty_thresh > > > + * - writeback is started, this triggers a write in B > > > + * - balance_dirty_pages() is called synchronously > > > + * - the write to B blocks > > > + * - the writeback completes, but dirty is still over threshold > > > + * - the blocking write prevents futher writes from happening > > > + */ > > > + if (atomic_long_read(&bdi->nr_dirty) + > > > + atomic_long_read(&bdi->nr_writeback) < 16) > > > + break; > > > + > > > > The problem seems to that little "- the write to B blocks". > > > > How come it blocks? I mean, if we cannot retire writes to that filesystem > > then we're screwed anyway. > > Sorry about the sloppy description. I mean, it's not the lowlevel > write that will block, but rather the VFS one > (generic_file_aio_write). It will block (or rather loop forever with > 0.1 second sleeps) in balance_dirty_pages(). That means, that for > this inode, i_mutex is held and no other writer can continue the work. "this inode" I assume is the inode against filesystem A? Why does holding that inode's i_mutex prevent further writeback of pages in A? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock 2007-02-22 7:55 ` Andrew Morton @ 2007-02-22 8:02 ` Miklos Szeredi -1 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-22 8:02 UTC (permalink / raw) To: akpm; +Cc: miklos, linux-kernel, linux-mm > > On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > > Index: linux/mm/page-writeback.c > > > > =================================================================== > > > > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 > > > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 > > > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > > > > dirty_thresh) > > > > break; > > > > > > > > + /* > > > > + * Acquit this producer if there's little or nothing > > > > + * to write back to this particular queue > > > > + * > > > > + * Without this check a deadlock is possible in the > > > > + * following case: > > > > + * > > > > + * - filesystem A writes data through filesystem B > > > > + * - filesystem A has dirty pages over dirty_thresh > > > > + * - writeback is started, this triggers a write in B > > > > + * - balance_dirty_pages() is called synchronously > > > > + * - the write to B blocks > > > > + * - the writeback completes, but dirty is still over threshold > > > > + * - the blocking write prevents futher writes from happening > > > > + */ > > > > + if (atomic_long_read(&bdi->nr_dirty) + > > > > + atomic_long_read(&bdi->nr_writeback) < 16) > > > > + break; > > > > + > > > > > > The problem seems to that little "- the write to B blocks". > > > > > > How come it blocks? I mean, if we cannot retire writes to that filesystem > > > then we're screwed anyway. > > > > Sorry about the sloppy description. I mean, it's not the lowlevel > > write that will block, but rather the VFS one > > (generic_file_aio_write). It will block (or rather loop forever with > > 0.1 second sleeps) in balance_dirty_pages(). That means, that for > > this inode, i_mutex is held and no other writer can continue the work. > > "this inode" I assume is the inode against filesystem A? No, the one in B. > Why does holding that inode's i_mutex prevent further writeback of > pages in A? It is generic_file_aio_write() that is holding the mutex. Here's the stack for the filesystem daemon trying to write back a page: 08dcfb40: [<08182fe6>] schedule+0x246/0x547 08dcfb98: [<08183a03>] schedule_timeout+0x4e/0xb6 08dcfbcc: [<08183991>] io_schedule_timeout+0x11/0x20 08dcfbd4: [<080a0cf2>] congestion_wait+0x72/0x87 08dcfc04: [<0809c693>] balance_dirty_pages+0xa8/0x153 08dcfc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08dcfddc: [<080ea1e6>] ext3_file_write+0x39/0xaf 08dcfe04: [<080b060b>] do_sync_write+0xd8/0x10e 08dcfebc: [<080b06e3>] vfs_write+0xa2/0x1cb 08dcfeec: [<080b09b8>] sys_pwrite64+0x65/0x69 Miklos ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: dirty balancing deadlock @ 2007-02-22 8:02 ` Miklos Szeredi 0 siblings, 0 replies; 52+ messages in thread From: Miklos Szeredi @ 2007-02-22 8:02 UTC (permalink / raw) To: akpm; +Cc: miklos, linux-kernel, linux-mm > > On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > > Index: linux/mm/page-writeback.c > > > > =================================================================== > > > > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.000000000 +0100 > > > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.000000000 +0100 > > > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > > > > dirty_thresh) > > > > break; > > > > > > > > + /* > > > > + * Acquit this producer if there's little or nothing > > > > + * to write back to this particular queue > > > > + * > > > > + * Without this check a deadlock is possible in the > > > > + * following case: > > > > + * > > > > + * - filesystem A writes data through filesystem B > > > > + * - filesystem A has dirty pages over dirty_thresh > > > > + * - writeback is started, this triggers a write in B > > > > + * - balance_dirty_pages() is called synchronously > > > > + * - the write to B blocks > > > > + * - the writeback completes, but dirty is still over threshold > > > > + * - the blocking write prevents futher writes from happening > > > > + */ > > > > + if (atomic_long_read(&bdi->nr_dirty) + > > > > + atomic_long_read(&bdi->nr_writeback) < 16) > > > > + break; > > > > + > > > > > > The problem seems to that little "- the write to B blocks". > > > > > > How come it blocks? I mean, if we cannot retire writes to that filesystem > > > then we're screwed anyway. > > > > Sorry about the sloppy description. I mean, it's not the lowlevel > > write that will block, but rather the VFS one > > (generic_file_aio_write). It will block (or rather loop forever with > > 0.1 second sleeps) in balance_dirty_pages(). That means, that for > > this inode, i_mutex is held and no other writer can continue the work. > > "this inode" I assume is the inode against filesystem A? No, the one in B. > Why does holding that inode's i_mutex prevent further writeback of > pages in A? It is generic_file_aio_write() that is holding the mutex. Here's the stack for the filesystem daemon trying to write back a page: 08dcfb40: [<08182fe6>] schedule+0x246/0x547 08dcfb98: [<08183a03>] schedule_timeout+0x4e/0xb6 08dcfbcc: [<08183991>] io_schedule_timeout+0x11/0x20 08dcfbd4: [<080a0cf2>] congestion_wait+0x72/0x87 08dcfc04: [<0809c693>] balance_dirty_pages+0xa8/0x153 08dcfc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08dcfddc: [<080ea1e6>] ext3_file_write+0x39/0xaf 08dcfe04: [<080b060b>] do_sync_write+0xd8/0x10e 08dcfebc: [<080b06e3>] vfs_write+0xa2/0x1cb 08dcfeec: [<080b09b8>] sys_pwrite64+0x65/0x69 Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2007-02-22 8:02 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-02-18 18:28 dirty balancing deadlock Miklos Szeredi 2007-02-18 18:28 ` Miklos Szeredi 2007-02-18 20:53 ` Andrew Morton 2007-02-18 20:53 ` Andrew Morton 2007-02-18 21:25 ` Rik van Riel 2007-02-18 21:25 ` Rik van Riel 2007-02-18 22:54 ` Miklos Szeredi 2007-02-18 22:54 ` Miklos Szeredi 2007-02-18 22:50 ` Miklos Szeredi 2007-02-18 22:50 ` Miklos Szeredi 2007-02-18 22:59 ` Andrew Morton 2007-02-18 22:59 ` Andrew Morton 2007-02-18 23:22 ` Miklos Szeredi 2007-02-18 23:22 ` Miklos Szeredi 2007-02-18 23:59 ` Andrew Morton 2007-02-18 23:59 ` Andrew Morton 2007-02-19 0:25 ` Miklos Szeredi 2007-02-19 0:25 ` Miklos Szeredi 2007-02-19 0:30 ` Miklos Szeredi 2007-02-19 0:30 ` Miklos Szeredi 2007-02-19 0:45 ` Miklos Szeredi 2007-02-19 0:45 ` Miklos Szeredi 2007-02-19 0:45 ` Chris Mason 2007-02-19 0:45 ` Chris Mason 2007-02-19 0:54 ` Miklos Szeredi 2007-02-19 0:54 ` Miklos Szeredi 2007-02-19 1:01 ` Chris Mason 2007-02-19 1:01 ` Chris Mason 2007-02-19 1:14 ` Miklos Szeredi 2007-02-19 1:14 ` Miklos Szeredi 2007-02-20 0:16 ` Chris Mason 2007-02-20 0:16 ` Chris Mason 2007-02-20 8:53 ` Miklos Szeredi 2007-02-20 8:53 ` Miklos Szeredi 2007-02-19 17:11 ` Miklos Szeredi 2007-02-19 17:11 ` Miklos Szeredi 2007-02-19 23:12 ` Miklos Szeredi 2007-02-19 23:12 ` Miklos Szeredi 2007-02-20 0:13 ` Chris Mason 2007-02-20 0:13 ` Chris Mason 2007-02-20 8:47 ` Miklos Szeredi 2007-02-20 8:47 ` Miklos Szeredi 2007-02-20 11:30 ` Chris Mason 2007-02-20 11:30 ` Chris Mason 2007-02-21 21:36 ` Andrew Morton 2007-02-21 21:36 ` Andrew Morton 2007-02-22 7:42 ` Miklos Szeredi 2007-02-22 7:42 ` Miklos Szeredi 2007-02-22 7:55 ` Andrew Morton 2007-02-22 7:55 ` Andrew Morton 2007-02-22 8:02 ` Miklos Szeredi 2007-02-22 8:02 ` Miklos Szeredi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.