linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Race in shrink_cache
@ 2002-09-05  5:04 Daniel Phillips
  2002-09-05  6:36 ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Phillips @ 2002-09-05  5:04 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel

Hi Marcelo,

This looks really suspicious, vmscan.c#435:

	spin_unlock(&pagemap_lru_lock);
							if (put_page_testzero(page))
								__free_pages_ok(page, 0);
	/* avoid to free a locked page */
	page_cache_get(page);

	/* whoops, double free coming */

I suggest you bump the page count before releasing the lru lock.  The race
shown above may not in fact be possible, but the current code is fragile.

-- 
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05  6:36 ` Andrew Morton
@ 2002-09-05  6:36   ` Daniel Phillips
  2002-09-05  7:07     ` Andrew Morton
  2002-09-05 13:33     ` Rik van Riel
  0 siblings, 2 replies; 12+ messages in thread
From: Daniel Phillips @ 2002-09-05  6:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marcelo Tosatti, linux-kernel

On Thursday 05 September 2002 08:36, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > Hi Marcelo,
> > 
> > This looks really suspicious, vmscan.c#435:
> > 
> >         spin_unlock(&pagemap_lru_lock);
> >                                                         if (put_page_testzero(page))
> >                                                                 __free_pages_ok(page, 0);
> >         /* avoid to free a locked page */
> >         page_cache_get(page);
> > 
> >         /* whoops, double free coming */
> > 
> > I suggest you bump the page count before releasing the lru lock.  The race
> > shown above may not in fact be possible, but the current code is fragile.
> > 
> 
> That's OK.  The page has a ref because of nonzero ->buffers  And it
> is locked, which pins page->buffers.

Yes, true.  Calm down ladies and gentlemen, and move away from the exits,
there is no fire.  While we're in here, do you have any idea what this is
about:

/*
 * We must not allow an anon page
 * with no buffers to be visible on
 * the LRU, so we unlock the page after
 * taking the lru lock
 */

That is, what's scary about an anon page without buffers?

-- 
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05  5:04 Race in shrink_cache Daniel Phillips
@ 2002-09-05  6:36 ` Andrew Morton
  2002-09-05  6:36   ` Daniel Phillips
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2002-09-05  6:36 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Marcelo Tosatti, linux-kernel

Daniel Phillips wrote:
> 
> Hi Marcelo,
> 
> This looks really suspicious, vmscan.c#435:
> 
>         spin_unlock(&pagemap_lru_lock);
>                                                         if (put_page_testzero(page))
>                                                                 __free_pages_ok(page, 0);
>         /* avoid to free a locked page */
>         page_cache_get(page);
> 
>         /* whoops, double free coming */
> 
> I suggest you bump the page count before releasing the lru lock.  The race
> shown above may not in fact be possible, but the current code is fragile.
> 

That's OK.  The page has a ref because of nonzero ->buffers  And it
is locked, which pins page->buffers.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05  6:36   ` Daniel Phillips
@ 2002-09-05  7:07     ` Andrew Morton
  2002-09-05  7:28       ` Daniel Phillips
  2002-09-05 13:33     ` Rik van Riel
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2002-09-05  7:07 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Marcelo Tosatti, linux-kernel

Daniel Phillips wrote:
> 
> ...
> /*
>  * We must not allow an anon page
>  * with no buffers to be visible on
>  * the LRU, so we unlock the page after
>  * taking the lru lock
>  */
> 
> That is, what's scary about an anon page without buffers?

ooop.  That's an akpm comment.  umm, err..

It solves this BUG:

http://www.cs.helsinki.fi/linux/linux-kernel/2001-37/0594.html

Around the 2.4.10 timeframe, Andrea started putting anon pages
on the LRU.  Then he backed that out, then put it in again.  I
think this comment dates from the time when anon pages were
not on the LRU.  So there's a little window there where the
page is unlocked, we've just dropped its swapdev buffers, the page is
on the LRU and pagemap_lru_lock is not held.

So another CPU came in, found the page on the LRU, saw that it had
no ->mapping and no ->buffers and went BUG.

The fix was to take pagemap_lru_lock before unlocking the page.

The comment is stale.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05  7:07     ` Andrew Morton
@ 2002-09-05  7:28       ` Daniel Phillips
  2002-09-05  7:53         ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Phillips @ 2002-09-05  7:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thursday 05 September 2002 09:07, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > ...
> > /*
> >  * We must not allow an anon page
> >  * with no buffers to be visible on
> >  * the LRU, so we unlock the page after
> >  * taking the lru lock
> >  */
> > 
> > That is, what's scary about an anon page without buffers?
> 
> ooop.  That's an akpm comment.  umm, err..
> 
> It solves this BUG:
> 
> http://www.cs.helsinki.fi/linux/linux-kernel/2001-37/0594.html
> 
> Around the 2.4.10 timeframe, Andrea started putting anon pages
> on the LRU.  Then he backed that out, then put it in again.  I
> think this comment dates from the time when anon pages were
> not on the LRU.  So there's a little window there where the
> page is unlocked, we've just dropped its swapdev buffers, the page is
> on the LRU and pagemap_lru_lock is not held.
> 
> So another CPU came in, found the page on the LRU, saw that it had
> no ->mapping and no ->buffers and went BUG.
> 
> The fix was to take pagemap_lru_lock before unlocking the page.
> 
> The comment is stale.

With the atomic_dec_and_lock strategy, the page would be freed immediately on 
the buffers being released, and with the lru=1 strategy it doesn't matter 
in terms of correctness whether the page ends up on the lru or not, so I was 
inclined not to worry about this anyway, still, when you a dire-looking 
comment like that...

You said something about your lru locking strategy in 2.5.33-mm2.  I have not
reverse engineered it yet, would you care to wax poetic?

-- 
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05  7:28       ` Daniel Phillips
@ 2002-09-05  7:53         ` Andrew Morton
  2002-09-05 18:41           ` Daniel Phillips
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2002-09-05  7:53 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

Daniel Phillips wrote:
> 
> ...
> You said something about your lru locking strategy in 2.5.33-mm2.  I have not
> reverse engineered it yet, would you care to wax poetic?

I'm not sure what you're after here?  Strategy is to make the locks per-zone,
per-node, to not take them too long, to not take them too frequently, to do
as much *needed* work as possible for a single acquisition of the lock and
to not do unnecessary work while holding it - and that includes not servicing
ethernet interrupts.

Not having to bump page counts when moving pages from the LRU into a private
list would be nice.

The strategy for fixing the double-free race is to wait until you
buy an IDE disk ;)

The elaborate changelogs are at

http://linux.bkbits.net:8080/linux-2.5/user=akpm/ChangeSet@-4w?nav=!-|index.html|stats|!+|index.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05  6:36   ` Daniel Phillips
  2002-09-05  7:07     ` Andrew Morton
@ 2002-09-05 13:33     ` Rik van Riel
  1 sibling, 0 replies; 12+ messages in thread
From: Rik van Riel @ 2002-09-05 13:33 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, Marcelo Tosatti, linux-kernel

On Thu, 5 Sep 2002, Daniel Phillips wrote:

> /*
>  * We must not allow an anon page
>  * with no buffers to be visible on
>  * the LRU, so we unlock the page after
>  * taking the lru lock
>  */
>
> That is, what's scary about an anon page without buffers?

Nothing, those are placed on the LRU all the time.

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05  7:53         ` Andrew Morton
@ 2002-09-05 18:41           ` Daniel Phillips
  2002-09-05 18:51             ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Phillips @ 2002-09-05 18:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thursday 05 September 2002 09:53, Andrew Morton wrote:
> Not having to bump page counts when moving pages from the LRU into a private
> list would be nice.

I'm not sure what your intended application is here.  It's easy enough to 
change the lru state bit to a scalar, the transitions of which are protected 
naturally by the lru lock.  This gives you N partitions of the lru list 
(times M zones) and page_cache_release does the right thing for all of them.

On the other hand, if what you want is a private list that page_cache_release 
doesn't act on automatically, all you have to do is set the lru state to zero,
leave the page count incremented and move to the private list.  You then take
explicit responsibility for freeing the page or moving it back onto a 
mainstream lru list.

An example of an application of the latter technique is a short delay list to 
(finally) implement the use-once concept properly.  A newly instantiated page 
goes onto the hot end of this list instead of the inactive list as it does 
now, and after a short delay dependent on the allocation activity in the 
system, is moved either to the (per zone) active or inactive list, depending 
on whether it was referenced.  Thus the use-once list is not per-zone and 
removal from it is always explicit.  So it's not like the other lru lists, 
even though it uses the same link field.

While I'm meandering here, I'll mention that the above approach finally makes 
use-once work properly for swap pages, which always had the problem that we 
couldn't detect the second, activating reference (and this was fudged by 
always marking a swapped-in page as referenced, i.e., kludging away the 
mechanism).  It also solves the problem of detecting clustered references, 
such as reading through a page a byte at a time, which should only count as a 
single reference.  Right now we do a stupid hack that works in a lot of 
cases, but fails in enough cases to be annoying, all in the name of trying to 
get by without implementing a dedicated list.

Readahead on the other hand needs to be handled with dedicated per-zone lru 
lists, so that we can conveniently and accurately claw back readahead that 
happens to have gone too far ahead.  So this is an example of the first kind 
of usage.  Once read, a readahead page moves to the used-once queue, and from 
there either to the inactive or active queue as above.

-- 
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05 18:41           ` Daniel Phillips
@ 2002-09-05 18:51             ` Andrew Morton
  2002-09-05 19:08               ` Daniel Phillips
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2002-09-05 18:51 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

Daniel Phillips wrote:
> 
> ..
> On the other hand, if what you want is a private list that page_cache_release
> doesn't act on automatically, all you have to do is set the lru state to zero,
> leave the page count incremented and move to the private list.  You then take
> explicit responsibility for freeing the page or moving it back onto a
> mainstream lru list.

That's the one.  Page reclaim speculatively removes a chunk (typically 32) of
pages from the LRU, works on them, and puts back any unfreeable ones later
on.  And the rest of the VM was taught to play correctly with pages that
can be off the LRU.  This was to avoid hanging onto the LRU lock while
running page reclaim.

And when those 32 pages are speculatively removed, their refcounts are
incremented.  Maybe that isn't necessary - I'd need to think about
that.  If it isn't, then the double-free thing is fixed.  If it is
necessary then then lru-adds-a-ref approach is nice, because shrink_cache
doesn't need to page_cache_get each page while holding the LRU lock,
as you say.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05 18:51             ` Andrew Morton
@ 2002-09-05 19:08               ` Daniel Phillips
  2002-09-05 19:22                 ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Phillips @ 2002-09-05 19:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thursday 05 September 2002 20:51, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > ..
> > On the other hand, if what you want is a private list that page_cache_release
> > doesn't act on automatically, all you have to do is set the lru state to zero,
> > leave the page count incremented and move to the private list.  You then take
> > explicit responsibility for freeing the page or moving it back onto a
> > mainstream lru list.
> 
> That's the one.  Page reclaim speculatively removes a chunk (typically 32) of
> pages from the LRU, works on them, and puts back any unfreeable ones later
> on.

Convergent evolution.  That's exactly the same number I'm handling as a
chunk in my rmap sharing project (backgrounded for the moment).  In this
case, the 32 comes from the number of bits you can store in a long, and
it conveniently happens to fall in the sweet spot of performance as well.

In this case I only have to manipulate one list element though, because
I'm taking the lru list itself out onto the shared rmap node, saving 8
bytes in struct page as a side effect.  So other than the size of the
chunk, the resemblance is slight.  ETA for this is 2.7, unless you
suddenly start threatening to back out rmap again.

> And the rest of the VM was taught to play correctly with pages that
> can be off the LRU.  This was to avoid hanging onto the LRU lock while
> running page reclaim.
> 
> And when those 32 pages are speculatively removed, their refcounts are
> incremented.  Maybe that isn't necessary - I'd need to think about
> that.  If it isn't, then the double-free thing is fixed.  If it is
> necessary then then lru-adds-a-ref approach is nice, because shrink_cache
> doesn't need to page_cache_get each page while holding the LRU lock,
> as you say.

I think the extra refcount strategy is inherently stronger, and this
is an example of why.  The other would require you to take/drop an
extra count explicitly for your private list.

-- 
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05 19:08               ` Daniel Phillips
@ 2002-09-05 19:22                 ` Andrew Morton
  2002-09-05 20:00                   ` Daniel Phillips
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2002-09-05 19:22 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

Daniel Phillips wrote:
> 
> On Thursday 05 September 2002 20:51, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > >
> > > ..
> > > On the other hand, if what you want is a private list that page_cache_release
> > > doesn't act on automatically, all you have to do is set the lru state to zero,
> > > leave the page count incremented and move to the private list.  You then take
> > > explicit responsibility for freeing the page or moving it back onto a
> > > mainstream lru list.
> >
> > That's the one.  Page reclaim speculatively removes a chunk (typically 32) of
> > pages from the LRU, works on them, and puts back any unfreeable ones later
> > on.
> 
> Convergent evolution.  That's exactly the same number I'm handling as a
> chunk in my rmap sharing project (backgrounded for the moment).  In this
> case, the 32 comes from the number of bits you can store in a long, and
> it conveniently happens to fall in the sweet spot of performance as well.

That's SWAP_CLUSTER_MAX.  I've never really seen a reason to change
its value.  On 4k pagesize.

The pagevecs use 16 pages.  The thinking here is that we want it to
be large enough to be efficient, but small enough so that all the
pageframes are still in L1 when we come around and hit on them again.
Plus pagevecs are placed on the stack.

> ...
> I think the extra refcount strategy is inherently stronger, and this
> is an example of why.  The other would require you to take/drop an
> extra count explicitly for your private list.

OK.  I assume you taught invalidate_inode_pages[2] about the extra ref?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race in shrink_cache
  2002-09-05 19:22                 ` Andrew Morton
@ 2002-09-05 20:00                   ` Daniel Phillips
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Phillips @ 2002-09-05 20:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thursday 05 September 2002 21:22, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > On Thursday 05 September 2002 20:51, Andrew Morton wrote:
> > > Daniel Phillips wrote:
> > > >
> > > > ..
> > > > On the other hand, if what you want is a private list that page_cache_release
> > > > doesn't act on automatically, all you have to do is set the lru state to zero,
> > > > leave the page count incremented and move to the private list.  You then take
> > > > explicit responsibility for freeing the page or moving it back onto a
> > > > mainstream lru list.
> > >
> > > That's the one.  Page reclaim speculatively removes a chunk (typically 32) of
> > > pages from the LRU, works on them, and puts back any unfreeable ones later
> > > on.
> > 
> > Convergent evolution.  That's exactly the same number I'm handling as a
> > chunk in my rmap sharing project (backgrounded for the moment).  In this
> > case, the 32 comes from the number of bits you can store in a long, and
> > it conveniently happens to fall in the sweet spot of performance as well.
> 
> That's SWAP_CLUSTER_MAX.  I've never really seen a reason to change
> its value.  On 4k pagesize.
> 
> The pagevecs use 16 pages.  The thinking here is that we want it to
> be large enough to be efficient, but small enough so that all the
> pageframes are still in L1 when we come around and hit on them again.
> Plus pagevecs are placed on the stack.
> 
> > ...
> > I think the extra refcount strategy is inherently stronger, and this
> > is an example of why.  The other would require you to take/drop an
> > extra count explicitly for your private list.
> 
> OK.  I assume you taught invalidate_inode_pages[2] about the extra ref?

Yes, all the magic tests have been replaced with versions that depend on
the LRU_PLUS_CACHE define (1 or 2).  As a side effect, this lets you find
all the places that care about this with a simple grep, so I'd be inclined
to leave the symbol in, even after we decide which setting we prefer.

-- 
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2002-09-05 19:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-05  5:04 Race in shrink_cache Daniel Phillips
2002-09-05  6:36 ` Andrew Morton
2002-09-05  6:36   ` Daniel Phillips
2002-09-05  7:07     ` Andrew Morton
2002-09-05  7:28       ` Daniel Phillips
2002-09-05  7:53         ` Andrew Morton
2002-09-05 18:41           ` Daniel Phillips
2002-09-05 18:51             ` Andrew Morton
2002-09-05 19:08               ` Daniel Phillips
2002-09-05 19:22                 ` Andrew Morton
2002-09-05 20:00                   ` Daniel Phillips
2002-09-05 13:33     ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).