* [PATCH][5/?] count writeback pages in nr_scanned @ 2005-01-03 17:25 Rik van Riel 2005-01-05 10:08 ` Andrew Morton 2005-01-05 23:26 ` Andrew Morton 0 siblings, 2 replies; 43+ messages in thread From: Rik van Riel @ 2005-01-03 17:25 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Andrea Arcangeli, Marcelo Tosatti Still untested, but posting the concept here anyway, since this could explain a lot... OOM kills have been observed with 70% of the pages in lowmem being in the writeback state. If we count those pages in sc->nr_scanned, the VM should throttle and wait for IO completion, instead of OOM killing. Signed-off-by: Rik van Riel <riel@redhat.com> --- linux-2.6.9/mm/vmscan.c.screclaim 2005-01-03 12:17:56.547148905 -0500 +++ linux-2.6.9/mm/vmscan.c 2005-01-03 12:18:16.855965416 -0500 @@ -376,10 +376,10 @@ BUG_ON(PageActive(page)); + sc->nr_scanned++; if (PageWriteback(page)) goto keep_locked; - sc->nr_scanned++; /* Double the slab pressure for mapped and swapcache pages */ if (page_mapped(page) || PageSwapCache(page)) sc->nr_scanned++; ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-03 17:25 [PATCH][5/?] count writeback pages in nr_scanned Rik van Riel @ 2005-01-05 10:08 ` Andrew Morton 2005-01-05 18:06 ` Andrea Arcangeli 2005-01-05 23:26 ` Andrew Morton 1 sibling, 1 reply; 43+ messages in thread From: Andrew Morton @ 2005-01-05 10:08 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, andrea, marcelo.tosatti Rik van Riel <riel@redhat.com> wrote: > > Still untested, but posting the concept here anyway, since this > could explain a lot... > > OOM kills have been observed with 70% of the pages in lowmem being > in the writeback state. If we count those pages in sc->nr_scanned, > the VM should throttle and wait for IO completion, instead of OOM > killing. > > Signed-off-by: Rik van Riel <riel@redhat.com> > > --- linux-2.6.9/mm/vmscan.c.screclaim 2005-01-03 12:17:56.547148905 -0500 > +++ linux-2.6.9/mm/vmscan.c 2005-01-03 12:18:16.855965416 -0500 > @@ -376,10 +376,10 @@ > > BUG_ON(PageActive(page)); > > + sc->nr_scanned++; > if (PageWriteback(page)) > goto keep_locked; > > - sc->nr_scanned++; Patch looks very sane. It in fact restores that which we were doing until 12 June 2004, when the rampant `struct scan_control' depredations violated the tree. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-05 10:08 ` Andrew Morton @ 2005-01-05 18:06 ` Andrea Arcangeli 2005-01-05 18:50 ` Rik van Riel 0 siblings, 1 reply; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-05 18:06 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, linux-kernel, marcelo.tosatti On Wed, Jan 05, 2005 at 02:08:59AM -0800, Andrew Morton wrote: > Rik van Riel <riel@redhat.com> wrote: > > > > Still untested, but posting the concept here anyway, since this > > could explain a lot... > > > > OOM kills have been observed with 70% of the pages in lowmem being > > in the writeback state. If we count those pages in sc->nr_scanned, > > the VM should throttle and wait for IO completion, instead of OOM > > killing. > > > > Signed-off-by: Rik van Riel <riel@redhat.com> > > > > --- linux-2.6.9/mm/vmscan.c.screclaim 2005-01-03 12:17:56.547148905 -0500 > > +++ linux-2.6.9/mm/vmscan.c 2005-01-03 12:18:16.855965416 -0500 > > @@ -376,10 +376,10 @@ > > > > BUG_ON(PageActive(page)); > > > > + sc->nr_scanned++; > > if (PageWriteback(page)) > > goto keep_locked; > > > > - sc->nr_scanned++; > > Patch looks very sane. It in fact restores that which we were doing until > 12 June 2004, when the rampant `struct scan_control' depredations violated > the tree. Agreed. Another unrelated problem I have in this same area and that can explain VM troubles at least theoretically, is that blk_congestion_wait is broken by design. First we cannot wait on random I/O not related to write back. Second blk_congestion_wait gets trivially fooled by direct-io for example. Plus the timeout may cause it to return too early with slow blkdev. blk_congestion_wait is a fundamental piece to get oom detection right during writeback and unfortunately it's fundamentally fragile in 2.6 (this as usual wasn't the case in 2.4). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-05 18:06 ` Andrea Arcangeli @ 2005-01-05 18:50 ` Rik van Riel 2005-01-05 17:49 ` Marcelo Tosatti 0 siblings, 1 reply; 43+ messages in thread From: Rik van Riel @ 2005-01-05 18:50 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel, marcelo.tosatti On Wed, 5 Jan 2005, Andrea Arcangeli wrote: > Another unrelated problem I have in this same area and that can explain > VM troubles at least theoretically, is that blk_congestion_wait is > broken by design. First we cannot wait on random I/O not related to > write back. Second blk_congestion_wait gets trivially fooled by > direct-io for example. Plus the timeout may cause it to return too early > with slow blkdev. Or the IO that just finished, finished for pages in another memory zone, or pages we won't scan again in our current go-around through the VM... > blk_congestion_wait is a fundamental piece to get oom detection right > during writeback and unfortunately it's fundamentally fragile in 2.6 > (this as usual wasn't the case in 2.4). Indeed ;( -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-05 18:50 ` Rik van Riel @ 2005-01-05 17:49 ` Marcelo Tosatti 2005-01-05 21:44 ` Andrew Morton 0 siblings, 1 reply; 43+ messages in thread From: Marcelo Tosatti @ 2005-01-05 17:49 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, Andrew Morton, linux-kernel On Wed, Jan 05, 2005 at 01:50:51PM -0500, Rik van Riel wrote: > On Wed, 5 Jan 2005, Andrea Arcangeli wrote: > > >Another unrelated problem I have in this same area and that can explain > >VM troubles at least theoretically, is that blk_congestion_wait is > >broken by design. First we cannot wait on random I/O not related to > >write back. Second blk_congestion_wait gets trivially fooled by > >direct-io for example. Plus the timeout may cause it to return too early > >with slow blkdev. > > Or the IO that just finished, finished for pages in > another memory zone, or pages we won't scan again in > our current go-around through the VM... Thing is there is no distinction between pages which have been written out for what purpose at the block level. One can conjecture the following: per-zone waitqueue to be awakened from end_page_writeback() (for PG_reclaim pages only of course), and a function to wait on the perzone waitqueue: wait_vm_writeback (zone, timeout); Instead of the current blk_congestion_wait() on try_to_free_pages/balance_pgdat. It will probably slow the reclaiming procedure in general, though, and has other side effects, but its the only way of strictly following VM writeback progress from VM reclaiming routines. Does it make any sense? > >blk_congestion_wait is a fundamental piece to get oom detection right > >during writeback and unfortunately it's fundamentally fragile in 2.6 > >(this as usual wasn't the case in 2.4). > > Indeed ;( ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-05 17:49 ` Marcelo Tosatti @ 2005-01-05 21:44 ` Andrew Morton 2005-01-05 20:32 ` Marcelo Tosatti 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2005-01-05 21:44 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: riel, andrea, linux-kernel Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > On Wed, Jan 05, 2005 at 01:50:51PM -0500, Rik van Riel wrote: > > On Wed, 5 Jan 2005, Andrea Arcangeli wrote: > > > > >Another unrelated problem I have in this same area and that can explain > > >VM troubles at least theoretically, is that blk_congestion_wait is > > >broken by design. First we cannot wait on random I/O not related to > > >write back. Second blk_congestion_wait gets trivially fooled by > > >direct-io for example. Plus the timeout may cause it to return too early > > >with slow blkdev. That's true, as we discussed a couple of months back. But the current code is nice and simple and has been there for a couple of years with no observed problems. > > Or the IO that just finished, finished for pages in > > another memory zone, or pages we won't scan again in > > our current go-around through the VM... > > Thing is there is no distinction between pages which have been written out > for what purpose at the block level. > > One can conjecture the following: per-zone waitqueue to be awakened from > end_page_writeback() (for PG_reclaim pages only of course), and a function > to wait on the perzone waitqueue: > > wait_vm_writeback (zone, timeout); > > Instead of the current blk_congestion_wait() on try_to_free_pages/balance_pgdat. > The caller would need to wait on all the zones which can satisfy the caller's allocation request. A bit messy, although not rocket science. One would have to be careful to avoid additional CPU consumption due to delivery of multiple wakeups at each I/O completion. We should be able to demonstrate that such a change really fixes some problem though. Otherwise, why bother? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-05 21:44 ` Andrew Morton @ 2005-01-05 20:32 ` Marcelo Tosatti 2005-01-05 23:51 ` Nick Piggin 0 siblings, 1 reply; 43+ messages in thread From: Marcelo Tosatti @ 2005-01-05 20:32 UTC (permalink / raw) To: Andrew Morton; +Cc: riel, andrea, linux-kernel > The caller would need to wait on all the zones which can satisfy the > caller's allocation request. A bit messy, although not rocket science. > One would have to be careful to avoid additional CPU consumption due to > delivery of multiple wakeups at each I/O completion. > > We should be able to demonstrate that such a change really fixes some > problem though. Otherwise, why bother? Agreed. The current scheme works well enough, we dont have spurious OOM kills anymore, which is the only "problem" such change ought to fix. You might have performance increase in some situations I believe (because you have perzone waitqueues), but I agree its does not seem to be worth the trouble. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-05 20:32 ` Marcelo Tosatti @ 2005-01-05 23:51 ` Nick Piggin 2005-01-06 1:27 ` Rik van Riel 0 siblings, 1 reply; 43+ messages in thread From: Nick Piggin @ 2005-01-05 23:51 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, riel, andrea, linux-kernel Marcelo Tosatti wrote: >>The caller would need to wait on all the zones which can satisfy the >>caller's allocation request. A bit messy, although not rocket science. >>One would have to be careful to avoid additional CPU consumption due to >>delivery of multiple wakeups at each I/O completion. >> >>We should be able to demonstrate that such a change really fixes some >>problem though. Otherwise, why bother? > > > Agreed. The current scheme works well enough, we dont have spurious OOM kills > anymore, which is the only "problem" such change ought to fix. > > You might have performance increase in some situations I believe (because you > have perzone waitqueues), but I agree its does not seem to be worth the > trouble. I think what Andrea is worried about is that blk_congestion_wait is fairly vague, and can be a source of instability in the scanning implementation. For example, if you have a heavy IO workload that is saturating your disks, blk_congestion_wait may do the right thing and sleep until they become uncongested and writeout can continue. But at 2:00 am, when your backup job is trickling writes into another block device, blk_congestion_wait returns much earlier, and before many pages have been cleaned. Bad example? Yeah maybe, but I think this is what Andrea is getting at. Would it be a problem to replace those blk_congestion_waits with unconditional io_schedule_timeout()s? That would be the dumb-but-more -deterministic solution. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-05 23:51 ` Nick Piggin @ 2005-01-06 1:27 ` Rik van Riel 2005-01-06 1:33 ` Nick Piggin 2005-01-06 1:36 ` Andrew Morton 0 siblings, 2 replies; 43+ messages in thread From: Rik van Riel @ 2005-01-06 1:27 UTC (permalink / raw) To: Nick Piggin; +Cc: Marcelo Tosatti, Andrew Morton, andrea, linux-kernel On Thu, 6 Jan 2005, Nick Piggin wrote: > I think what Andrea is worried about is that blk_congestion_wait is > fairly vague, and can be a source of instability in the scanning > implementation. The recent OOM kill problem has been happening: 1) with cache pressure on lowmem only, due to a block device write 2) with no block congestion at all 3) with pretty much all pageable lowmme pages in writeback state It appears the VM has trouble dealing with the situation where there is no block congestion to wait on... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 1:27 ` Rik van Riel @ 2005-01-06 1:33 ` Nick Piggin 2005-01-06 1:37 ` Andrew Morton 2005-01-06 1:36 ` Andrew Morton 1 sibling, 1 reply; 43+ messages in thread From: Nick Piggin @ 2005-01-06 1:33 UTC (permalink / raw) To: Rik van Riel; +Cc: Marcelo Tosatti, Andrew Morton, andrea, linux-kernel Rik van Riel wrote: > On Thu, 6 Jan 2005, Nick Piggin wrote: > >> I think what Andrea is worried about is that blk_congestion_wait is >> fairly vague, and can be a source of instability in the scanning >> implementation. > > > The recent OOM kill problem has been happening: > 1) with cache pressure on lowmem only, due to a block device write > 2) with no block congestion at all > 3) with pretty much all pageable lowmme pages in writeback state > > It appears the VM has trouble dealing with the situation where > there is no block congestion to wait on... > Try, together with your nr_scanned patch, to replace blk_congestion_wait with io_schedule_timeout. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 1:33 ` Nick Piggin @ 2005-01-06 1:37 ` Andrew Morton 2005-01-06 1:40 ` Nick Piggin 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2005-01-06 1:37 UTC (permalink / raw) To: Nick Piggin; +Cc: riel, marcelo.tosatti, andrea, linux-kernel Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Rik van Riel wrote: > > On Thu, 6 Jan 2005, Nick Piggin wrote: > > > >> I think what Andrea is worried about is that blk_congestion_wait is > >> fairly vague, and can be a source of instability in the scanning > >> implementation. > > > > > > The recent OOM kill problem has been happening: > > 1) with cache pressure on lowmem only, due to a block device write > > 2) with no block congestion at all > > 3) with pretty much all pageable lowmme pages in writeback state > > > > It appears the VM has trouble dealing with the situation where > > there is no block congestion to wait on... > > > > Try, together with your nr_scanned patch, to replace blk_congestion_wait > with io_schedule_timeout. Why? Is the nr_scanned fix insufficient? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 1:37 ` Andrew Morton @ 2005-01-06 1:40 ` Nick Piggin 2005-01-06 1:52 ` Andrea Arcangeli 0 siblings, 1 reply; 43+ messages in thread From: Nick Piggin @ 2005-01-06 1:40 UTC (permalink / raw) To: Andrew Morton; +Cc: riel, marcelo.tosatti, andrea, linux-kernel Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>Rik van Riel wrote: >> >>>On Thu, 6 Jan 2005, Nick Piggin wrote: >>> >>> >>>>I think what Andrea is worried about is that blk_congestion_wait is >>>>fairly vague, and can be a source of instability in the scanning >>>>implementation. >>> >>> >>>The recent OOM kill problem has been happening: >>>1) with cache pressure on lowmem only, due to a block device write >>>2) with no block congestion at all >>>3) with pretty much all pageable lowmme pages in writeback state >>> >>>It appears the VM has trouble dealing with the situation where >>>there is no block congestion to wait on... >>> >> >>Try, together with your nr_scanned patch, to replace blk_congestion_wait >>with io_schedule_timeout. > > > Why? Is the nr_scanned fix insufficient? > I thought it sounded like he implied that nr_scanned was insufficient (otherwise he might have said "to wait on ... but my patch fixes it"). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 1:40 ` Nick Piggin @ 2005-01-06 1:52 ` Andrea Arcangeli 0 siblings, 0 replies; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-06 1:52 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, riel, marcelo.tosatti, linux-kernel On Thu, Jan 06, 2005 at 12:40:47PM +1100, Nick Piggin wrote: > I thought it sounded like he implied that nr_scanned was insufficient > (otherwise he might have said "to wait on ... but my patch fixes it"). BTW, from my part I can't reproduce it (even without the nr_scanned, that I'm going to apply too anyway just in case), but my hardware may have different timings dunno. All I could reproduce were the swap-token issues (I had a flood of reports about that too, all fixed by now with my 1-6/4 patches). The other patches agreed here looks good but I don't have actual pending bugs on that side and I can't reproduce problems either in basic testing, so they're applied now for correctness. Thanks. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 1:27 ` Rik van Riel 2005-01-06 1:33 ` Nick Piggin @ 2005-01-06 1:36 ` Andrew Morton 2005-01-06 3:42 ` Rik van Riel 1 sibling, 1 reply; 43+ messages in thread From: Andrew Morton @ 2005-01-06 1:36 UTC (permalink / raw) To: Rik van Riel; +Cc: nickpiggin, marcelo.tosatti, andrea, linux-kernel Rik van Riel <riel@redhat.com> wrote: > > On Thu, 6 Jan 2005, Nick Piggin wrote: > > > I think what Andrea is worried about is that blk_congestion_wait is > > fairly vague, and can be a source of instability in the scanning > > implementation. > > The recent OOM kill problem has been happening: > 1) with cache pressure on lowmem only, due to a block device write > 2) with no block congestion at all > 3) with pretty much all pageable lowmme pages in writeback state You must have a wild number of requests configured in the queue. Is this CFQ? I've done testing with "all of memory under writeback" before and it went OK. It's certainly a design objective to handle this well. But that testing was before we broke it. > It appears the VM has trouble dealing with the situation where > there is no block congestion to wait on... It's misnamed. We don't "wait for the queue to come out of congestion". We simply throttle until a write completes, or, rarely, timeout. The bug which you fixed would cause the VM to scan itself to death without throttling. Did the fix fix things? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 1:36 ` Andrew Morton @ 2005-01-06 3:42 ` Rik van Riel 2005-01-06 3:50 ` Nick Piggin 0 siblings, 1 reply; 43+ messages in thread From: Rik van Riel @ 2005-01-06 3:42 UTC (permalink / raw) To: Andrew Morton; +Cc: nickpiggin, marcelo.tosatti, andrea, linux-kernel On Wed, 5 Jan 2005, Andrew Morton wrote: > Rik van Riel <riel@redhat.com> wrote: >> The recent OOM kill problem has been happening: >> 1) with cache pressure on lowmem only, due to a block device write >> 2) with no block congestion at all >> 3) with pretty much all pageable lowmme pages in writeback state > > You must have a wild number of requests configured in the queue. Is > this CFQ? Yes, it is with CFQ. Around 650MB of lowmem is in writeback stage, which is over 99% of the active and inactive lowmem pages... > I've done testing with "all of memory under writeback" before and it > went OK. It's certainly a design objective to handle this well. But > that testing was before we broke it. I suspect something might still be broken. It may take a few days of continuous testing to trigger the bug, though ... > The bug which you fixed would cause the VM to scan itself to death > without throttling. Did the fix fix things? Mostly fixed. Things are down to the point where it often takes over a day to bring out the bug. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 3:42 ` Rik van Riel @ 2005-01-06 3:50 ` Nick Piggin 2005-01-06 4:26 ` Andrew Morton 0 siblings, 1 reply; 43+ messages in thread From: Nick Piggin @ 2005-01-06 3:50 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, marcelo.tosatti, andrea, linux-kernel Rik van Riel wrote: > On Wed, 5 Jan 2005, Andrew Morton wrote: > >> Rik van Riel <riel@redhat.com> wrote: > > >>> The recent OOM kill problem has been happening: >>> 1) with cache pressure on lowmem only, due to a block device write >>> 2) with no block congestion at all >>> 3) with pretty much all pageable lowmme pages in writeback state >> >> >> You must have a wild number of requests configured in the queue. Is >> this CFQ? > > > Yes, it is with CFQ. Around 650MB of lowmem is in writeback > stage, which is over 99% of the active and inactive lowmem > pages... > >> I've done testing with "all of memory under writeback" before and it >> went OK. It's certainly a design objective to handle this well. But >> that testing was before we broke it. > > > I suspect something might still be broken. It may take a few > days of continuous testing to trigger the bug, though ... > It is possible to be those blk_congestion_wait paths, because the queue simply won't be congested. So doing io_schedule_timeout might help. I wonder if reducing the size of the write queue in CFQ would help too? IIRC, it only really wants a huge read queue. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 3:50 ` Nick Piggin @ 2005-01-06 4:26 ` Andrew Morton 2005-01-06 4:35 ` Nick Piggin 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2005-01-06 4:26 UTC (permalink / raw) To: Nick Piggin; +Cc: riel, marcelo.tosatti, andrea, linux-kernel Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > > I suspect something might still be broken. It may take a few > > days of continuous testing to trigger the bug, though ... > > > > It is possible to be those blk_congestion_wait paths, because > the queue simply won't be congested. So doing io_schedule_timeout > might help. If the queue is not congested, blk_congestion_wait() will still sleep. See freed_request(). > I wonder if reducing the size of the write queue in CFQ would help > too? IIRC, it only really wants a huge read queue. Surely it will help - but we need to be able to handle the situation because memory can still become full of PageWriteback pages if there are many disks. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 4:26 ` Andrew Morton @ 2005-01-06 4:35 ` Nick Piggin 2005-01-06 4:47 ` Andrew Morton 0 siblings, 1 reply; 43+ messages in thread From: Nick Piggin @ 2005-01-06 4:35 UTC (permalink / raw) To: Andrew Morton; +Cc: riel, marcelo.tosatti, andrea, linux-kernel Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>>I suspect something might still be broken. It may take a few >>>days of continuous testing to trigger the bug, though ... >>> >> >>It is possible to be those blk_congestion_wait paths, because >>the queue simply won't be congested. So doing io_schedule_timeout >>might help. > > > If the queue is not congested, blk_congestion_wait() will still sleep. See > freed_request(). > Hmm... doesn't look like it to me: if (rl->count[rw] < queue_congestion_off_threshold(q)) clear_queue_congested(q, rw); And clear_queue_congested does an unconditional wakeup (if there is someone sleeping on the congestion queue). > >>I wonder if reducing the size of the write queue in CFQ would help >>too? IIRC, it only really wants a huge read queue. > > > Surely it will help - but we need to be able to handle the situation > because memory can still become full of PageWriteback pages if there are > many disks. > Yep. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 4:35 ` Nick Piggin @ 2005-01-06 4:47 ` Andrew Morton 2005-01-06 4:55 ` Nick Piggin 2005-01-06 4:59 ` [PATCH][5/?] count writeback pages in nr_scanned Andrea Arcangeli 0 siblings, 2 replies; 43+ messages in thread From: Andrew Morton @ 2005-01-06 4:47 UTC (permalink / raw) To: Nick Piggin; +Cc: riel, marcelo.tosatti, andrea, linux-kernel Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > If the queue is not congested, blk_congestion_wait() will still sleep. See > > freed_request(). > > > > Hmm... doesn't look like it to me: > > if (rl->count[rw] < queue_congestion_off_threshold(q)) > clear_queue_congested(q, rw); > > And clear_queue_congested does an unconditional wakeup (if there > is someone sleeping on the congestion queue). That's my point. blk_congestion_wait() will always sleep, regardless of the queue's congestion state. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 4:47 ` Andrew Morton @ 2005-01-06 4:55 ` Nick Piggin 2005-01-06 5:03 ` Andrea Arcangeli 2005-01-06 8:06 ` Jens Axboe 2005-01-06 4:59 ` [PATCH][5/?] count writeback pages in nr_scanned Andrea Arcangeli 1 sibling, 2 replies; 43+ messages in thread From: Nick Piggin @ 2005-01-06 4:55 UTC (permalink / raw) To: Andrew Morton; +Cc: riel, marcelo.tosatti, andrea, linux-kernel, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 1052 bytes --] Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> > If the queue is not congested, blk_congestion_wait() will still sleep. See >> > freed_request(). >> > >> >> Hmm... doesn't look like it to me: >> >> if (rl->count[rw] < queue_congestion_off_threshold(q)) >> clear_queue_congested(q, rw); >> >> And clear_queue_congested does an unconditional wakeup (if there >> is someone sleeping on the congestion queue). > > > That's my point. blk_congestion_wait() will always sleep, regardless of > the queue's congestion state. > Oh yes, but it will return as soon as a single request is finished. Which is probably a couple of milliseconds, rather than the 100 we had hoped for. So the allocators will wake up again and go around the loop and still make no progress. However, if you had a plain io_schedule_timeout there, at least you would sleep for the full extend of the specified timeout. BTW. Jens, now that I look at freed_request, is the memory barrier required? If so, what is it protecting? [-- Attachment #2: blk-no-mb.patch --] [-- Type: text/plain, Size: 1275 bytes --] This memory barrier is not needed because the waitqueue will only get waiters on it in the following situations: rq->count has exceeded the threshold - however all manipulations of ->count are performed under the runqueue lock, and so we will correctly pick up any waiter. Memory allocation for the request fails. In this case, there is no additional help provided by the memory barrier. We are guaranteed to eventually wake up waiters because the request allocation mempool guarantees that if the mem allocation for a request fails, there must be some requests in flight. They will wake up waiters when they are retired. --- linux-2.6-npiggin/drivers/block/ll_rw_blk.c | 1 - 1 files changed, 1 deletion(-) diff -puN drivers/block/ll_rw_blk.c~blk-no-mb drivers/block/ll_rw_blk.c --- linux-2.6/drivers/block/ll_rw_blk.c~blk-no-mb 2005-01-06 15:37:05.000000000 +1100 +++ linux-2.6-npiggin/drivers/block/ll_rw_blk.c 2005-01-06 15:37:12.000000000 +1100 @@ -1630,7 +1630,6 @@ static void freed_request(request_queue_ if (rl->count[rw] < queue_congestion_off_threshold(q)) clear_queue_congested(q, rw); if (rl->count[rw]+1 <= q->nr_requests) { - smp_mb(); if (waitqueue_active(&rl->wait[rw])) wake_up(&rl->wait[rw]); blk_clear_queue_full(q, rw); _ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 4:55 ` Nick Piggin @ 2005-01-06 5:03 ` Andrea Arcangeli 2005-01-06 8:06 ` Jens Axboe 1 sibling, 0 replies; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-06 5:03 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, riel, marcelo.tosatti, linux-kernel, Jens Axboe On Thu, Jan 06, 2005 at 03:55:34PM +1100, Nick Piggin wrote: > However, if you had a plain io_schedule_timeout there, at least you > would sleep for the full extend of the specified timeout. I agree it sure would be safer but OTOH it may screwup performance by waiting for unnecessary long times on fast stroage. So it's ok for a test, but still it wouldn't be a final fix since the timeout may be still too short in some case. Waiting on one (or more) PG_writeback bitflags to go away should fix it completely. This is how 2.4 throttles on the writeback I/O too of course. How we choose which random page to pick may vary though. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 4:55 ` Nick Piggin 2005-01-06 5:03 ` Andrea Arcangeli @ 2005-01-06 8:06 ` Jens Axboe 2005-01-06 8:16 ` memory barrier in ll_rw_blk.c (was Re: [PATCH][5/?] count writeback pages in nr_scanned) Nick Piggin 1 sibling, 1 reply; 43+ messages in thread From: Jens Axboe @ 2005-01-06 8:06 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, riel, marcelo.tosatti, andrea, linux-kernel On Thu, Jan 06 2005, Nick Piggin wrote: > Andrew Morton wrote: > >Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > >>> If the queue is not congested, blk_congestion_wait() will still sleep. > >>See > >>> freed_request(). > >>> > >> > >>Hmm... doesn't look like it to me: > >> > >> if (rl->count[rw] < queue_congestion_off_threshold(q)) > >> clear_queue_congested(q, rw); > >> > >>And clear_queue_congested does an unconditional wakeup (if there > >>is someone sleeping on the congestion queue). > > > > > >That's my point. blk_congestion_wait() will always sleep, regardless of > >the queue's congestion state. > > > > Oh yes, but it will return as soon as a single request is finished. > Which is probably a couple of milliseconds, rather than the 100 we > had hoped for. So the allocators will wake up again and go around > the loop and still make no progress. > > However, if you had a plain io_schedule_timeout there, at least you > would sleep for the full extend of the specified timeout. > > BTW. Jens, now that I look at freed_request, is the memory barrier > required? If so, what is it protecting? > > > > This memory barrier is not needed because the waitqueue will only get > waiters on it in the following situations: > > rq->count has exceeded the threshold - however all manipulations of ->count > are performed under the runqueue lock, and so we will correctly pick up any > waiter. > > Memory allocation for the request fails. In this case, there is no additional > help provided by the memory barrier. We are guaranteed to eventually wake > up waiters because the request allocation mempool guarantees that if the mem > allocation for a request fails, there must be some requests in flight. They > will wake up waiters when they are retired. Not sure I agree completely. Yes it will work, but only because it tests <= q->nr_requests and I don't think that 'eventually' is good enough :-) The actual waitqueue manipulation doesn't happen under the queue lock, so the memory barrier is needed to pickup the change on SMP. So I'd like to keep the barrier. I'd prefer to add smp_mb() to waitqueue_active() actually! -- Jens Axboe ^ permalink raw reply [flat|nested] 43+ messages in thread
* memory barrier in ll_rw_blk.c (was Re: [PATCH][5/?] count writeback pages in nr_scanned) 2005-01-06 8:06 ` Jens Axboe @ 2005-01-06 8:16 ` Nick Piggin 2005-01-06 8:32 ` Jens Axboe 0 siblings, 1 reply; 43+ messages in thread From: Nick Piggin @ 2005-01-06 8:16 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, riel, marcelo.tosatti, andrea, linux-kernel Jens Axboe wrote: > On Thu, Jan 06 2005, Nick Piggin wrote: >> >>This memory barrier is not needed because the waitqueue will only get >>waiters on it in the following situations: >> >>rq->count has exceeded the threshold - however all manipulations of ->count >>are performed under the runqueue lock, and so we will correctly pick up any >>waiter. >> >>Memory allocation for the request fails. In this case, there is no additional >>help provided by the memory barrier. We are guaranteed to eventually wake >>up waiters because the request allocation mempool guarantees that if the mem >>allocation for a request fails, there must be some requests in flight. They >>will wake up waiters when they are retired. > > > Not sure I agree completely. Yes it will work, but only because it tests > <= q->nr_requests and I don't think that 'eventually' is good enough :-) > > The actual waitqueue manipulation doesn't happen under the queue lock, > so the memory barrier is needed to pickup the change on SMP. So I'd like > to keep the barrier. > No that's right... but between the prepare_to_wait and the io_schedule, get_request takes the lock and checks nr_requests. I think we are safe? > I'd prefer to add smp_mb() to waitqueue_active() actually! > That may be a good idea (I haven't really taken much notice of how other code uses it). I'm not worried about any possible performance advantages of removing it, rather just having a memory barrier without comments can be perplexing. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: memory barrier in ll_rw_blk.c (was Re: [PATCH][5/?] count writeback pages in nr_scanned) 2005-01-06 8:16 ` memory barrier in ll_rw_blk.c (was Re: [PATCH][5/?] count writeback pages in nr_scanned) Nick Piggin @ 2005-01-06 8:32 ` Jens Axboe 2005-01-06 8:53 ` Nick Piggin 0 siblings, 1 reply; 43+ messages in thread From: Jens Axboe @ 2005-01-06 8:32 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, riel, marcelo.tosatti, andrea, linux-kernel On Thu, Jan 06 2005, Nick Piggin wrote: > Jens Axboe wrote: > >On Thu, Jan 06 2005, Nick Piggin wrote: > > >> > >>This memory barrier is not needed because the waitqueue will only get > >>waiters on it in the following situations: > >> > >>rq->count has exceeded the threshold - however all manipulations of > >>->count > >>are performed under the runqueue lock, and so we will correctly pick up > >>any > >>waiter. > >> > >>Memory allocation for the request fails. In this case, there is no > >>additional > >>help provided by the memory barrier. We are guaranteed to eventually wake > >>up waiters because the request allocation mempool guarantees that if the > >>mem > >>allocation for a request fails, there must be some requests in flight. > >>They > >>will wake up waiters when they are retired. > > > > > >Not sure I agree completely. Yes it will work, but only because it tests > ><= q->nr_requests and I don't think that 'eventually' is good enough :-) > > > >The actual waitqueue manipulation doesn't happen under the queue lock, > >so the memory barrier is needed to pickup the change on SMP. So I'd like > >to keep the barrier. > > > > No that's right... but between the prepare_to_wait and the io_schedule, > get_request takes the lock and checks nr_requests. I think we are safe? It looks like it, yes you are right. But it looks to be needed a few lines further down instead, though :-) ===== drivers/block/ll_rw_blk.c 1.281 vs edited ===== --- 1.281/drivers/block/ll_rw_blk.c 2004-12-01 09:13:57 +01:00 +++ edited/drivers/block/ll_rw_blk.c 2005-01-06 09:32:19 +01:00 @@ -1630,11 +1630,11 @@ if (rl->count[rw] < queue_congestion_off_threshold(q)) clear_queue_congested(q, rw); if (rl->count[rw]+1 <= q->nr_requests) { - smp_mb(); if (waitqueue_active(&rl->wait[rw])) wake_up(&rl->wait[rw]); blk_clear_queue_full(q, rw); } + smp_mb(); if (unlikely(waitqueue_active(&rl->drain)) && !rl->count[READ] && !rl->count[WRITE]) wake_up(&rl->drain); > >I'd prefer to add smp_mb() to waitqueue_active() actually! > > > > That may be a good idea (I haven't really taken much notice of how other > code uses it). > > I'm not worried about any possible performance advantages of removing it, > rather just having a memory barrier without comments can be perplexing. I fully agree, subtle things like that should always be commented. -- Jens Axboe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: memory barrier in ll_rw_blk.c (was Re: [PATCH][5/?] count writeback pages in nr_scanned) 2005-01-06 8:32 ` Jens Axboe @ 2005-01-06 8:53 ` Nick Piggin 2005-01-06 12:00 ` Jens Axboe 0 siblings, 1 reply; 43+ messages in thread From: Nick Piggin @ 2005-01-06 8:53 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, riel, marcelo.tosatti, andrea, linux-kernel Jens Axboe wrote: > On Thu, Jan 06 2005, Nick Piggin wrote: > >>No that's right... but between the prepare_to_wait and the io_schedule, >>get_request takes the lock and checks nr_requests. I think we are safe? > > > It looks like it, yes you are right. But it looks to be needed a few > lines further down instead, though :-) > > ===== drivers/block/ll_rw_blk.c 1.281 vs edited ===== > --- 1.281/drivers/block/ll_rw_blk.c 2004-12-01 09:13:57 +01:00 > +++ edited/drivers/block/ll_rw_blk.c 2005-01-06 09:32:19 +01:00 > @@ -1630,11 +1630,11 @@ > if (rl->count[rw] < queue_congestion_off_threshold(q)) > clear_queue_congested(q, rw); > if (rl->count[rw]+1 <= q->nr_requests) { > - smp_mb(); > if (waitqueue_active(&rl->wait[rw])) > wake_up(&rl->wait[rw]); > blk_clear_queue_full(q, rw); > } > + smp_mb(); > if (unlikely(waitqueue_active(&rl->drain)) && > !rl->count[READ] && !rl->count[WRITE]) > wake_up(&rl->drain); > Yes, looks like you're right there. Any point in doing it like this if (!rl->count[READ] && !rl->count[WRITE]) { smb_mb(); if (unlikely(waitqueue_active(...))) wake_up() } I wonder? I don't have any feeling of how memory barriers impact performance on a very parallel system with CPUs that do lots of memory reordering like POWER5. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: memory barrier in ll_rw_blk.c (was Re: [PATCH][5/?] count writeback pages in nr_scanned) 2005-01-06 8:53 ` Nick Piggin @ 2005-01-06 12:00 ` Jens Axboe 0 siblings, 0 replies; 43+ messages in thread From: Jens Axboe @ 2005-01-06 12:00 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, riel, marcelo.tosatti, andrea, linux-kernel On Thu, Jan 06 2005, Nick Piggin wrote: > Jens Axboe wrote: > >On Thu, Jan 06 2005, Nick Piggin wrote: > > > > >>No that's right... but between the prepare_to_wait and the io_schedule, > >>get_request takes the lock and checks nr_requests. I think we are safe? > > > > > >It looks like it, yes you are right. But it looks to be needed a few > >lines further down instead, though :-) > > > >===== drivers/block/ll_rw_blk.c 1.281 vs edited ===== > >--- 1.281/drivers/block/ll_rw_blk.c 2004-12-01 09:13:57 +01:00 > >+++ edited/drivers/block/ll_rw_blk.c 2005-01-06 09:32:19 +01:00 > >@@ -1630,11 +1630,11 @@ > > if (rl->count[rw] < queue_congestion_off_threshold(q)) > > clear_queue_congested(q, rw); > > if (rl->count[rw]+1 <= q->nr_requests) { > >- smp_mb(); > > if (waitqueue_active(&rl->wait[rw])) > > wake_up(&rl->wait[rw]); > > blk_clear_queue_full(q, rw); > > } > >+ smp_mb(); > > if (unlikely(waitqueue_active(&rl->drain)) && > > !rl->count[READ] && !rl->count[WRITE]) > > wake_up(&rl->drain); > > > > Yes, looks like you're right there. > > Any point in doing it like this > > if (!rl->count[READ] && !rl->count[WRITE]) { > smb_mb(); > if (unlikely(waitqueue_active(...))) > wake_up() > } > > I wonder? I don't have any feeling of how memory barriers impact performance > on a very parallel system with CPUs that do lots of memory reordering like > POWER5. I had the same idea - I think it's a good idea, it's definitely an optimization worth doing. -- Jens Axboe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 4:47 ` Andrew Morton 2005-01-06 4:55 ` Nick Piggin @ 2005-01-06 4:59 ` Andrea Arcangeli 2005-01-06 5:05 ` Andrew Morton 2005-01-06 5:06 ` Nick Piggin 1 sibling, 2 replies; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-06 4:59 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, riel, marcelo.tosatti, linux-kernel On Wed, Jan 05, 2005 at 08:47:06PM -0800, Andrew Morton wrote: > That's my point. blk_congestion_wait() will always sleep, regardless of Since I've no pending bugs at all with the mainline codebase I rate this just a theoretical issue: but even sleeping for a fixed amount of time is unreliable there, for example if the storage is very slow. That's why using io_schedule_timeout for that isn't going to be a fix. The fix is very simple and it is to call wait_on_page_writeback on one of the pages under writeback. That guarantees some _writeback_ progress has been made before retrying. That way some random direct-io or a too short timeout, can't cause malfunction. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 4:59 ` [PATCH][5/?] count writeback pages in nr_scanned Andrea Arcangeli @ 2005-01-06 5:05 ` Andrew Morton 2005-01-06 5:17 ` Andrea Arcangeli 2005-01-06 5:06 ` Nick Piggin 1 sibling, 1 reply; 43+ messages in thread From: Andrew Morton @ 2005-01-06 5:05 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: nickpiggin, riel, marcelo.tosatti, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > The fix is very simple and it is to call wait_on_page_writeback on one > of the pages under writeback. eek, no. That was causing waits of five seconds or more. Fixing this caused the single greatest improvement in page allocator latency in early 2.5. We're totally at the mercy of the elevator algorithm this way. If we're to improve things in there we want to wait on _any_ eligible page becoming reclaimable, not on a particular page. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:05 ` Andrew Morton @ 2005-01-06 5:17 ` Andrea Arcangeli 2005-01-06 5:19 ` Nick Piggin 0 siblings, 1 reply; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-06 5:17 UTC (permalink / raw) To: Andrew Morton; +Cc: nickpiggin, riel, marcelo.tosatti, linux-kernel On Wed, Jan 05, 2005 at 09:05:39PM -0800, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > The fix is very simple and it is to call wait_on_page_writeback on one > > of the pages under writeback. > > eek, no. That was causing waits of five seconds or more. Fixing this > caused the single greatest improvement in page allocator latency in early > 2.5. We're totally at the mercy of the elevator algorithm this way. > > If we're to improve things in there we want to wait on _any_ eligible page > becoming reclaimable, not on a particular page. I told you one way to fix it. I didn't guarantee it was the most efficient one. I sure agree waiting on any page to complete writeback is going to fix it too. Exactly because this page was a "random" page anyway. Still my point is that this is a bug, and I prefer to be slow and safe like 2.4, than fast and unreliable like 2.6. The slight improvement you suggested of waiting on _any_ random PG_writeback to go away (instead of one particular one as I did in 2.4) is going to fix the write throttling equally too as well as the 2.4 logic, but without introducing slowdown that 2.4 had. It's easy to demonstrate: exactly because the page we pick is random anyway, we can pick the first random one that has seen PG_writeback transitioning from 1 to 0. The guarantee we get is the same in terms of safety of the write throttling, but we also guarantee the best possible latency this way. And the HZ/x hacks to avoid deadlocks will magically go away too. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:17 ` Andrea Arcangeli @ 2005-01-06 5:19 ` Nick Piggin 2005-01-06 5:25 ` Andrea Arcangeli 2005-01-06 5:32 ` Andrew Morton 0 siblings, 2 replies; 43+ messages in thread From: Nick Piggin @ 2005-01-06 5:19 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, riel, marcelo.tosatti, linux-kernel Andrea Arcangeli wrote: > On Wed, Jan 05, 2005 at 09:05:39PM -0800, Andrew Morton wrote: > >>Andrea Arcangeli <andrea@suse.de> wrote: >> >>>The fix is very simple and it is to call wait_on_page_writeback on one >>> of the pages under writeback. >> >>eek, no. That was causing waits of five seconds or more. Fixing this >>caused the single greatest improvement in page allocator latency in early >>2.5. We're totally at the mercy of the elevator algorithm this way. >> >>If we're to improve things in there we want to wait on _any_ eligible page >>becoming reclaimable, not on a particular page. > > > I told you one way to fix it. I didn't guarantee it was the most > efficient one. > > I sure agree waiting on any page to complete writeback is going to fix > it too. Exactly because this page was a "random" page anyway. > > Still my point is that this is a bug, and I prefer to be slow and safe > like 2.4, than fast and unreliable like 2.6. > > The slight improvement you suggested of waiting on _any_ random > PG_writeback to go away (instead of one particular one as I did in 2.4) > is going to fix the write throttling equally too as well as the 2.4 > logic, but without introducing slowdown that 2.4 had. > > It's easy to demonstrate: exactly because the page we pick is random > anyway, we can pick the first random one that has seen PG_writeback > transitioning from 1 to 0. The guarantee we get is the same in terms of > safety of the write throttling, but we also guarantee the best possible > latency this way. And the HZ/x hacks to avoid deadlocks will magically > go away too. > This is practically what blk_congestion_wait does when the queue isn't congested though, isn't it? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:19 ` Nick Piggin @ 2005-01-06 5:25 ` Andrea Arcangeli 2005-01-06 5:36 ` Nick Piggin 2005-01-06 5:37 ` Andrew Morton 2005-01-06 5:32 ` Andrew Morton 1 sibling, 2 replies; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-06 5:25 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, riel, marcelo.tosatti, linux-kernel On Thu, Jan 06, 2005 at 04:19:36PM +1100, Nick Piggin wrote: > This is practically what blk_congestion_wait does when the queue > isn't congested though, isn't it? The fundamental difference that makes it reliable is that: 1) only the I/O we're throttling against will be considered for the wakeup event, which means only clearing PG_writeback will be considered eligible for wakeup Currently _all_ unrelated write I/O was considered eligible for wakeup events and that could cause spurious oom kills. 2) we won't need unreliable anti-deadlock timeouts anymore ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:25 ` Andrea Arcangeli @ 2005-01-06 5:36 ` Nick Piggin 2005-01-06 5:44 ` Nick Piggin 2005-01-06 5:37 ` Andrew Morton 1 sibling, 1 reply; 43+ messages in thread From: Nick Piggin @ 2005-01-06 5:36 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, riel, marcelo.tosatti, linux-kernel Andrea Arcangeli wrote: > On Thu, Jan 06, 2005 at 04:19:36PM +1100, Nick Piggin wrote: > >>This is practically what blk_congestion_wait does when the queue >>isn't congested though, isn't it? > > > The fundamental difference that makes it reliable is that: > > 1) only the I/O we're throttling against will be considered for the > wakeup event, which means only clearing PG_writeback will be > considered eligible for wakeup > Currently _all_ unrelated write I/O was considered eligible > for wakeup events and that could cause spurious oom kills. I'm not entirely convinced. In Rik's case it didn't matter, because all his writeout was in the same zone that reclaim was happening against (ZONE_NORMAL), so in that case, PG_writeback throttling will do exactly the same thing as blk_congestion_wait. I do like your PG_writeback throttling idea for the other reason that it should behave better on NUMA systems with lots of zones and lots of disks. > 2) we won't need unreliable anti-deadlock timeouts anymore > Well I think you do need *something*. If you wake up each time a single page (or request) has completed, you only complete what, 12 (DEF_PRIORITY) requests before going OOM? In the worst possible case scenario, which looks like what Rik's running into. Nick ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:36 ` Nick Piggin @ 2005-01-06 5:44 ` Nick Piggin 0 siblings, 0 replies; 43+ messages in thread From: Nick Piggin @ 2005-01-06 5:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, riel, marcelo.tosatti, linux-kernel Nick Piggin wrote: > Andrea Arcangeli wrote: > >> On Thu, Jan 06, 2005 at 04:19:36PM +1100, Nick Piggin wrote: >> >>> This is practically what blk_congestion_wait does when the queue >>> isn't congested though, isn't it? >> >> >> >> The fundamental difference that makes it reliable is that: >> >> 1) only the I/O we're throttling against will be considered for the >> wakeup event, which means only clearing PG_writeback will be >> considered eligible for wakeup >> Currently _all_ unrelated write I/O was considered eligible >> for wakeup events and that could cause spurious oom kills. > > > I'm not entirely convinced. In Rik's case it didn't matter, because > all his writeout was in the same zone that reclaim was happening > against (ZONE_NORMAL), so in that case, PG_writeback throttling > will do exactly the same thing as blk_congestion_wait. > > I do like your PG_writeback throttling idea for the other reason > that it should behave better on NUMA systems with lots of zones > and lots of disks. > ... or Andrew's described fix. I think both would result in pretty similar behaviour, but Andrew's is probably a bit nicer because it doesn't require the scanner to have initiated the write. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:25 ` Andrea Arcangeli 2005-01-06 5:36 ` Nick Piggin @ 2005-01-06 5:37 ` Andrew Morton 2005-01-06 5:59 ` Andrea Arcangeli 2005-01-06 13:28 ` Rik van Riel 1 sibling, 2 replies; 43+ messages in thread From: Andrew Morton @ 2005-01-06 5:37 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: nickpiggin, riel, marcelo.tosatti, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > 2) we won't need unreliable anti-deadlock timeouts anymore The timeouts are for: a) A fallback for backing stores which don't wake up waiters in blk_congestion_wait() (eg: NFS). b) handling the race case where the request queue suddenly goes empty before the sleeper gets onto the waitqueue. It can probably be removed with some work, and additional locking. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:37 ` Andrew Morton @ 2005-01-06 5:59 ` Andrea Arcangeli 2005-01-06 13:28 ` Rik van Riel 1 sibling, 0 replies; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-06 5:59 UTC (permalink / raw) To: Andrew Morton; +Cc: nickpiggin, riel, marcelo.tosatti, linux-kernel On Wed, Jan 05, 2005 at 09:37:04PM -0800, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > 2) we won't need unreliable anti-deadlock timeouts anymore > > The timeouts are for: > > a) A fallback for backing stores which don't wake up waiters in > blk_congestion_wait() (eg: NFS). that anti-deadlock will be unnecessary too with the new logic. > b) handling the race case where the request queue suddenly goes empty > before the sleeper gets onto the waitqueue. as I mentioned with proper locking setting task in uninterruptible and then registering into the new per classzone waitqueue, the timeout will be unnecessary even for this. > It can probably be removed with some work, and additional locking. The additional locking will then remove the current locking in blk_congestion_wait so it's new locking but it will replace the current locking. But I agree registering in the waitqueue inside the blk_congestion_wait was simpler. It's just I've an hard time to like the timeout. Timeout is always wrong when it triggers: if it triggers it always triggers either too late (wasted resources) or too early (early oom kills). So unless it messes everything up, it'd be nice to the locking strict. anyway point 1 and 2 can be implemented separately, at first we can leave the timeout if the race is too hard to handle. Ideally if we keep the total number of oustanding writebacks per-classzone (not sure if we account for it already somewhere, I guess if something we've the global number and not the per-classzone one), we could remove the timeout without having to expose the locking outside blk_congestion_wait. With the number of oustanding writebacks per-classzone we could truly fix the race and obsolete the timeout in a self contained manner. Though it will require a proper amount of memory barriers around the account increment/decrement/read. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:37 ` Andrew Morton 2005-01-06 5:59 ` Andrea Arcangeli @ 2005-01-06 13:28 ` Rik van Riel 1 sibling, 0 replies; 43+ messages in thread From: Rik van Riel @ 2005-01-06 13:28 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, nickpiggin, marcelo.tosatti, linux-kernel On Wed, 5 Jan 2005, Andrew Morton wrote: > The timeouts are for: > > a) A fallback for backing stores which don't wake up waiters in > blk_congestion_wait() (eg: NFS). This is buggy, btw. NFS has a "fun" deadlock where it can send so many writes over the wire at once that the PF_MEMALLOC allocations eat up all memory and there's no memory left to receive ACKs from the NFS server ... NFS will need to act congested when memory is low and there are already some writeouts underway. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:19 ` Nick Piggin 2005-01-06 5:25 ` Andrea Arcangeli @ 2005-01-06 5:32 ` Andrew Morton 2005-01-06 5:46 ` Andrea Arcangeli 1 sibling, 1 reply; 43+ messages in thread From: Andrew Morton @ 2005-01-06 5:32 UTC (permalink / raw) To: Nick Piggin; +Cc: andrea, riel, marcelo.tosatti, linux-kernel Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Andrea Arcangeli wrote: > > On Wed, Jan 05, 2005 at 09:05:39PM -0800, Andrew Morton wrote: > > > >>Andrea Arcangeli <andrea@suse.de> wrote: > >> > >>>The fix is very simple and it is to call wait_on_page_writeback on one > >>> of the pages under writeback. > >> > >>eek, no. That was causing waits of five seconds or more. Fixing this > >>caused the single greatest improvement in page allocator latency in early > >>2.5. We're totally at the mercy of the elevator algorithm this way. > >> > >>If we're to improve things in there we want to wait on _any_ eligible page > >>becoming reclaimable, not on a particular page. > > > > > > I told you one way to fix it. I didn't guarantee it was the most > > efficient one. > > And I've already described the efficient way to "fix" it. Twice. > > I sure agree waiting on any page to complete writeback is going to fix > > it too. Exactly because this page was a "random" page anyway. > > > > Still my point is that this is a bug, and I prefer to be slow and safe > > like 2.4, than fast and unreliable like 2.6. > > > > The slight improvement you suggested of waiting on _any_ random > > PG_writeback to go away (instead of one particular one as I did in 2.4) It's a HUGE improvement. "Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 took 35 minutes to compile a kernel. With this patch, it took three minutes, 45 seconds." Plus this was the change which precipitated all the I/O scheduler development, because it caused us to keep the queues full all the time and the old I/O scheduler collapsed. > > is going to fix the write throttling equally too as well as the 2.4 > > logic, but without introducing slowdown that 2.4 had. > > > > It's easy to demonstrate: exactly because the page we pick is random > > anyway, we can pick the first random one that has seen PG_writeback > > transitioning from 1 to 0. The guarantee we get is the same in terms of > > safety of the write throttling, but we also guarantee the best possible > > latency this way. And the HZ/x hacks to avoid deadlocks will magically > > go away too. > > > > This is practically what blk_congestion_wait does when the queue > isn't congested though, isn't it? Pretty much. Except: - Doing a wakeup when a write request is retired corresponds to releasing a batch of pages, not a single page. Usually. - direct-io writes could confuse it. For the third time: "fixing" this involves delivering a wakeup to all zones in the page's classzone in end_page_writeback(), and passing the zone* into blk_congestion_wait(). Only deliver the wakeup on every Nth page to get a bit of batching and to reduce CPU consumption. Then demonstrating that the change actually improves something. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:32 ` Andrew Morton @ 2005-01-06 5:46 ` Andrea Arcangeli 2005-01-06 5:59 ` Andrew Morton 0 siblings, 1 reply; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-06 5:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, riel, marcelo.tosatti, linux-kernel On Wed, Jan 05, 2005 at 09:32:07PM -0800, Andrew Morton wrote: > > > The slight improvement you suggested of waiting on _any_ random > > > PG_writeback to go away (instead of one particular one as I did in 2.4) > > It's a HUGE improvement. I didn't want to question the improvement in wall clock time terms. > For the third time: "fixing" this involves delivering a wakeup to all zones > in the page's classzone in end_page_writeback(), and passing the zone* into > blk_congestion_wait(). Only deliver the wakeup on every Nth page to get a > bit of batching and to reduce CPU consumption. Then demonstrating that the > change actually improves something. Since I cannot reproduce oom kills with writeback, I sure can't demonstrate it on bare hardware with unmodified kernel. But I dislike code that works by luck, and sure I could demonstrate it if I bothered to write an artificial testcase on simulated hardware. This is the only reason I mentioned this bug in the first place, not because I'm reproducing it. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:46 ` Andrea Arcangeli @ 2005-01-06 5:59 ` Andrew Morton 2005-01-06 6:16 ` Andrea Arcangeli 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2005-01-06 5:59 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: nickpiggin, riel, marcelo.tosatti, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > But I dislike code that works by luck, yup, it is code which works by luck. And it's a bit inaccurate - there is potential for (small) further latency improvement in there. But we do need to watch the CPU consumption in there - end_page_writeback() already figures fairly high sometimes. > and sure I could demonstrate it > if I bothered to write an artificial testcase on simulated hardware. Could well be so. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:59 ` Andrew Morton @ 2005-01-06 6:16 ` Andrea Arcangeli 0 siblings, 0 replies; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-06 6:16 UTC (permalink / raw) To: Andrew Morton; +Cc: nickpiggin, riel, marcelo.tosatti, linux-kernel On Wed, Jan 05, 2005 at 09:59:09PM -0800, Andrew Morton wrote: > But we do need to watch the CPU consumption in there - end_page_writeback() > already figures fairly high sometimes. The check for waitqueue_active will avoid all wake_up calls if nobody is waiting in blk_congestion_wait. Plus it won't need to wakeup on all classzones, if we have the other side to reigster on all the zones composing the classzone. The wakeup is going to be more frequent than the waiter registration. So we can register the waiter on _all_ the "interested" zones that composes the classzone, instead of the other way around that you suggested in the previous email. That will reduce the wakeup to a single waitqueue_active, and it shouldn't be too bad. Same goes for the number of outstanding writebacks, we can collect that number per zone and have blk_congestion_wait returning immediatly if it went to zero on all zones composing the classzone after registering. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 4:59 ` [PATCH][5/?] count writeback pages in nr_scanned Andrea Arcangeli 2005-01-06 5:05 ` Andrew Morton @ 2005-01-06 5:06 ` Nick Piggin 2005-01-06 5:21 ` Andrea Arcangeli 1 sibling, 1 reply; 43+ messages in thread From: Nick Piggin @ 2005-01-06 5:06 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, riel, marcelo.tosatti, linux-kernel Andrea Arcangeli wrote: > On Wed, Jan 05, 2005 at 08:47:06PM -0800, Andrew Morton wrote: > >>That's my point. blk_congestion_wait() will always sleep, regardless of > > > Since I've no pending bugs at all with the mainline codebase I rate this > just a theoretical issue: but even sleeping for a fixed amount of time > is unreliable there, for example if the storage is very slow. That's why > using io_schedule_timeout for that isn't going to be a fix. > > The fix is very simple and it is to call wait_on_page_writeback on one > of the pages under writeback. That guarantees some _writeback_ progress > has been made before retrying. That way some random direct-io or a too > short timeout, can't cause malfunction. > Yeah I guess that isn't a bad idea. Doesn't that have the theoretical problem of slowing down reclaim of dirty pages backed by fast devices when you have slow devices writing? But presumably you'd usually have pdflush working away anyway. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-06 5:06 ` Nick Piggin @ 2005-01-06 5:21 ` Andrea Arcangeli 0 siblings, 0 replies; 43+ messages in thread From: Andrea Arcangeli @ 2005-01-06 5:21 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, riel, marcelo.tosatti, linux-kernel On Thu, Jan 06, 2005 at 04:06:34PM +1100, Nick Piggin wrote: > Yeah I guess that isn't a bad idea. > > Doesn't that have the theoretical problem of slowing down reclaim > of dirty pages backed by fast devices when you have slow devices > writing? Andrew just suggested to wait on any random page. That's a minor modification to that logic. Effectively it's stupid to wait on a specific random page, when we can wait wait on _any_ random page to complete the writeback I/O. With locking done right by setting the state to uninterruptible and then registering in the new waitqueue when we know at least one page has PG_writeback set, we can then drop the HZ anti-deadlock hacks too. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH][5/?] count writeback pages in nr_scanned 2005-01-03 17:25 [PATCH][5/?] count writeback pages in nr_scanned Rik van Riel 2005-01-05 10:08 ` Andrew Morton @ 2005-01-05 23:26 ` Andrew Morton 1 sibling, 0 replies; 43+ messages in thread From: Andrew Morton @ 2005-01-05 23:26 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, andrea, marcelo.tosatti Rik van Riel <riel@redhat.com> wrote: > > OOM kills have been observed with 70% of the pages in lowmem being > in the writeback state. If we count those pages in sc->nr_scanned, > the VM should throttle and wait for IO completion, instead of OOM > killing. I'll queue this up: From: Rik van Riel <riel@redhat.com> OOM kills have been observed with 70% of the pages in lowmem being in the writeback state. If we count those pages in sc->nr_scanned, the VM should throttle and wait for IO completion, instead of OOM killing. (akpm: this is how the code was designed to work - we broke it six months ago). Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/mm/vmscan.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff -puN mm/vmscan.c~vmscan-count-writeback-pages-in-nr_scanned mm/vmscan.c --- 25/mm/vmscan.c~vmscan-count-writeback-pages-in-nr_scanned 2005-01-05 15:24:53.730874336 -0800 +++ 25-akpm/mm/vmscan.c 2005-01-05 15:25:48.286580608 -0800 @@ -369,14 +369,14 @@ static int shrink_list(struct list_head BUG_ON(PageActive(page)); - if (PageWriteback(page)) - goto keep_locked; - sc->nr_scanned++; /* Double the slab pressure for mapped and swapcache pages */ if (page_mapped(page) || PageSwapCache(page)) sc->nr_scanned++; + if (PageWriteback(page)) + goto keep_locked; + referenced = page_referenced(page, 1, sc->priority <= 0); /* In active use or really unfreeable? Activate it. */ if (referenced && page_mapping_inuse(page)) _ ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2005-01-06 13:29 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-01-03 17:25 [PATCH][5/?] count writeback pages in nr_scanned Rik van Riel 2005-01-05 10:08 ` Andrew Morton 2005-01-05 18:06 ` Andrea Arcangeli 2005-01-05 18:50 ` Rik van Riel 2005-01-05 17:49 ` Marcelo Tosatti 2005-01-05 21:44 ` Andrew Morton 2005-01-05 20:32 ` Marcelo Tosatti 2005-01-05 23:51 ` Nick Piggin 2005-01-06 1:27 ` Rik van Riel 2005-01-06 1:33 ` Nick Piggin 2005-01-06 1:37 ` Andrew Morton 2005-01-06 1:40 ` Nick Piggin 2005-01-06 1:52 ` Andrea Arcangeli 2005-01-06 1:36 ` Andrew Morton 2005-01-06 3:42 ` Rik van Riel 2005-01-06 3:50 ` Nick Piggin 2005-01-06 4:26 ` Andrew Morton 2005-01-06 4:35 ` Nick Piggin 2005-01-06 4:47 ` Andrew Morton 2005-01-06 4:55 ` Nick Piggin 2005-01-06 5:03 ` Andrea Arcangeli 2005-01-06 8:06 ` Jens Axboe 2005-01-06 8:16 ` memory barrier in ll_rw_blk.c (was Re: [PATCH][5/?] count writeback pages in nr_scanned) Nick Piggin 2005-01-06 8:32 ` Jens Axboe 2005-01-06 8:53 ` Nick Piggin 2005-01-06 12:00 ` Jens Axboe 2005-01-06 4:59 ` [PATCH][5/?] count writeback pages in nr_scanned Andrea Arcangeli 2005-01-06 5:05 ` Andrew Morton 2005-01-06 5:17 ` Andrea Arcangeli 2005-01-06 5:19 ` Nick Piggin 2005-01-06 5:25 ` Andrea Arcangeli 2005-01-06 5:36 ` Nick Piggin 2005-01-06 5:44 ` Nick Piggin 2005-01-06 5:37 ` Andrew Morton 2005-01-06 5:59 ` Andrea Arcangeli 2005-01-06 13:28 ` Rik van Riel 2005-01-06 5:32 ` Andrew Morton 2005-01-06 5:46 ` Andrea Arcangeli 2005-01-06 5:59 ` Andrew Morton 2005-01-06 6:16 ` Andrea Arcangeli 2005-01-06 5:06 ` Nick Piggin 2005-01-06 5:21 ` Andrea Arcangeli 2005-01-05 23:26 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).