linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: implement WasActive page flag (for improving cleancache)
@ 2012-01-25 21:58 Dan Magenheimer
  2012-01-26 17:28 ` Dave Hansen
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-01-25 21:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Andrew Morton, Konrad Wilk
  Cc: Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, Dave Hansen, riel, Chris Mason

(Feedback welcome if there is a different/better way to do this
without using a page flag!)

Since about 2.6.27, the page replacement algorithm maintains
an "active" bit to help decide which pages are most eligible
to reclaim, see http://linux-mm.org/PageReplacementDesign 

This "active' information is also useful to cleancache but is lost
by the time that cleancache has the opportunity to preserve the
pageful of data.  This patch adds a new page flag "WasActive" to
retain the state.  The flag may possibly be useful elsewhere.

It is up to each cleancache backend to utilize the bit as
it desires.  The matching patch for zcache is included here
for clarification/discussion purposes, though it will need to
go through GregKH and the staging tree.

The patch resolves issues reported with cleancache which occur
especially during streaming workloads on older processors,
see https://lkml.org/lkml/2011/8/17/351 

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e90a673..0f5e86a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -51,6 +51,9 @@
  * PG_hwpoison indicates that a page got corrupted in hardware and contains
  * data with incorrect ECC bits that triggered a machine check. Accessing is
  * not safe since it may cause another machine check. Don't touch!
+ *
+ * PG_wasactive reflects that a page previously was promoted to active status.
+ * Such pages should be considered higher priority for cleancache backends.
  */
 
 /*
@@ -107,6 +110,9 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_CLEANCACHE
+	PG_was_active,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -209,6 +215,10 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_CLEANCACHE
+PAGEFLAG(WasActive, was_active)
+#endif
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c52b235..fdd9e88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -636,6 +636,8 @@ void putback_lru_page(struct page *page)
 	int was_unevictable = PageUnevictable(page);
 
 	VM_BUG_ON(PageLRU(page));
+	if (active)
+		SetPageWasActive(page);
 
 redo:
 	ClearPageUnevictable(page);
@@ -1429,6 +1431,7 @@ update_isolated_counts(struct mem_cgroup_zone *mz,
 		if (PageActive(page)) {
 			lru += LRU_ACTIVE;
 			ClearPageActive(page);
+			SetPageWasActive(page);
 			nr_active += numpages;
 		}
 		count[lru] += numpages;
@@ -1755,6 +1758,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
+		SetPageWasActive(page);
 		list_add(&page->lru, &l_inactive);
 	}
 
diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
index 642840c..8c81ec2 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -1696,6 +1696,8 @@ static void zcache_cleancache_put_page(int pool_id,
 	u32 ind = (u32) index;
 	struct tmem_oid oid = *(struct tmem_oid *)&key;
 
+	if (!PageWasActive(page))
+		return;
 	if (likely(ind == index))
 		(void)zcache_put_page(LOCAL_CLIENT, pool_id, &oid, index, page);
 }
@@ -1710,6 +1712,8 @@ static int zcache_cleancache_get_page(int pool_id,
 
 	if (likely(ind == index))
 		ret = zcache_get_page(LOCAL_CLIENT, pool_id, &oid, index, page);
+	if (ret == 0)
+		SetPageWasActive(page);
 	return ret;
 }

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-25 21:58 [PATCH] mm: implement WasActive page flag (for improving cleancache) Dan Magenheimer
@ 2012-01-26 17:28 ` Dave Hansen
  2012-01-26 21:28   ` Dan Magenheimer
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Hansen @ 2012-01-26 17:28 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, Andrew Morton, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, riel, Chris Mason

On 01/25/2012 01:58 PM, Dan Magenheimer wrote:
> (Feedback welcome if there is a different/better way to do this
> without using a page flag!)
> 
> Since about 2.6.27, the page replacement algorithm maintains
> an "active" bit to help decide which pages are most eligible
> to reclaim, see http://linux-mm.org/PageReplacementDesign 
> 
> This "active' information is also useful to cleancache but is lost
> by the time that cleancache has the opportunity to preserve the
> pageful of data.  This patch adds a new page flag "WasActive" to
> retain the state.  The flag may possibly be useful elsewhere.

I guess cleancache itself is clearing the bit, right?  I didn't see any
clearing going on in the patch.

I do think it also needs to get cleared on the way in to the page
allocator.  Otherwise:

	PageSetWasActive(page);
	free_page(page);
	...
	another_user_page = get_free_page()
	// now cleancache sees the active bit for the prev user

Or am I missing somewhere it gets cleared non-explicitly somewhere?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-26 17:28 ` Dave Hansen
@ 2012-01-26 21:28   ` Dan Magenheimer
  2012-01-27  0:31     ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-01-26 21:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Andrew Morton, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, riel, Chris Mason

> From: Dave Hansen [mailto:dave@linux.vnet.ibm.com]
> Subject: Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)

Thanks for the review Dave!
 
> On 01/25/2012 01:58 PM, Dan Magenheimer wrote:
> > (Feedback welcome if there is a different/better way to do this
> > without using a page flag!)
> >
> > Since about 2.6.27, the page replacement algorithm maintains
> > an "active" bit to help decide which pages are most eligible
> > to reclaim, see http://linux-mm.org/PageReplacementDesign
> >
> > This "active' information is also useful to cleancache but is lost
> > by the time that cleancache has the opportunity to preserve the
> > pageful of data.  This patch adds a new page flag "WasActive" to
> > retain the state.  The flag may possibly be useful elsewhere.
> 
> I guess cleancache itself is clearing the bit, right?  I didn't see any
> clearing going on in the patch.

No, there are no changes in cleancache.c so it isn't clearing
the bit.

> I do think it also needs to get cleared on the way in to the page
> allocator.  Otherwise:
> 
> 	PageSetWasActive(page);
> 	free_page(page);
> 	...
> 	another_user_page = get_free_page()
> 	// now cleancache sees the active bit for the prev user
> 
> Or am I missing somewhere it gets cleared non-explicitly somewhere?

True, it is not getting cleared and it should be, good catch!

I'll find the place to add the call to ClearPageWasActive() for v2.

Thanks,
Dan


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-26 21:28   ` Dan Magenheimer
@ 2012-01-27  0:31     ` Andrew Morton
  2012-01-27  0:56       ` Dan Magenheimer
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2012-01-27  0:31 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Dave Hansen, linux-kernel, linux-mm, Konrad Wilk, Seth Jennings,
	Nitin Gupta, Nebojsa Trpkovic, minchan, KAMEZAWA Hiroyuki, riel,
	Chris Mason

On Thu, 26 Jan 2012 13:28:02 -0800 (PST)
Dan Magenheimer <dan.magenheimer@oracle.com> wrote:

> > I do think it also needs to get cleared on the way in to the page
> > allocator.  Otherwise:
> > 
> > 	PageSetWasActive(page);
> > 	free_page(page);
> > 	...
> > 	another_user_page = get_free_page()
> > 	// now cleancache sees the active bit for the prev user
> > 
> > Or am I missing somewhere it gets cleared non-explicitly somewhere?
> 
> True, it is not getting cleared and it should be, good catch!

It should be added to PAGE_FLAGS_CHECK_AT_FREE.

> I'll find the place to add the call to ClearPageWasActive() for v2.

AFAICT this patch consumes our second-last page flag, or close to it. 
We'll all be breaking out in hysterics when the final one is gone.

This does appear to be a make or break thing for cleancache - if we
can't fix https://lkml.org/lkml/2012/1/22/61 then cleancache is pretty
much a dead duck.  But I'm going to ask for great effort to avoid
consuming another page flag.  Either fix cleancache via other means or,
much less desirably, find an existing page flag and overload it.

And I'm afraid that neither I nor other MM developers are likely to
help you with "fix cleancache via other means" because we weren't
provided with any description of what the problem is within cleancache,
nor how it will be fixed.  All we are given is the assertion "cleancache
needs this".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27  0:31     ` Andrew Morton
@ 2012-01-27  0:56       ` Dan Magenheimer
  2012-01-27  1:15         ` Andrew Morton
  2012-01-27  3:28         ` Rik van Riel
  0 siblings, 2 replies; 20+ messages in thread
From: Dan Magenheimer @ 2012-01-27  0:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, linux-kernel, linux-mm, Konrad Wilk, Seth Jennings,
	Nitin Gupta, Nebojsa Trpkovic, minchan, KAMEZAWA Hiroyuki, riel,
	Chris Mason

> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Subject: Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)

Thanks for the reply!
 
> On Thu, 26 Jan 2012 13:28:02 -0800 (PST)
> Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> 
> > > I do think it also needs to get cleared on the way in to the page
> > > allocator.  Otherwise:
> > >
> > > 	PageSetWasActive(page);
> > > 	free_page(page);
> > > 	...
> > > 	another_user_page = get_free_page()
> > > 	// now cleancache sees the active bit for the prev user
> > >
> > > Or am I missing somewhere it gets cleared non-explicitly somewhere?
> >
> > True, it is not getting cleared and it should be, good catch!
> 
> It should be added to PAGE_FLAGS_CHECK_AT_FREE.

I was thinking of clearing it in free_pages_prepare() before the
call to free_pages_check().  Does that make sense?  If so, then
it could also be added to PAGE_FLAGS_CHECK_AT_FREE, though it might
be a bit redundant.

> > I'll find the place to add the call to ClearPageWasActive() for v2.
> 
> AFAICT this patch consumes our second-last page flag, or close to it.
> We'll all be breaking out in hysterics when the final one is gone.

I'd be OK with only using this on 64-bit systems, though there
are ARM folks playing with zcache that might disagree.  Am I
correct in assuming that your "second-last page flag" concern
applies only to 32-bit systems?

> This does appear to be a make or break thing for cleancache - if we
> can't fix https://lkml.org/lkml/2012/1/22/61 then cleancache is pretty
> much a dead duck.

Hmmm... is that URL correct?  If so, there is some subtlety in
that thread that I am missing as I don't understand the relationship
to cleancache at all?

> But I'm going to ask for great effort to avoid
> consuming another page flag.  Either fix cleancache via other means or,
> much less desirably, find an existing page flag and overload it.

Cleancache isn't broken.  The fix is not a requirement for other
cleancache users (Xen and RAMster), though it is definitely useful.
It's not a _requirement_ for zcache either but definitely helps on
certain workloads and systems, see below.

> And I'm afraid that neither I nor other MM developers are likely to
> help you with "fix cleancache via other means" because we weren't
> provided with any description of what the problem is within cleancache,
> nor how it will be fixed.  All we are given is the assertion "cleancache
> needs this".

The patch comment says:

The patch resolves issues reported with cleancache which occur
especially during streaming workloads on older processors,
see https://lkml.org/lkml/2011/8/17/351

I can see that may not be sufficient, so let me expand on it.

First, just as page replacement worked prior to the active/inactive
redesign at 2.6.27, cleancache works without the WasActive page flag.
However, just as pre-2.6.27 page replacement had problems on
streaming workloads, so does cleancache.  The WasActive page flag
is an attempt to pass the same active/inactive info gathered by
the post-2.6.27 kernel into cleancache, with the same objectives and
presumably the same result: improving the "quality" of pages preserved
in memory thus reducing refaults.

Is that clearer?  If so, I'll do better on the description at v2.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27  0:56       ` Dan Magenheimer
@ 2012-01-27  1:15         ` Andrew Morton
  2012-01-27  2:43           ` Dan Magenheimer
  2012-01-27  3:28         ` Rik van Riel
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2012-01-27  1:15 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Dave Hansen, linux-kernel, linux-mm, Konrad Wilk, Seth Jennings,
	Nitin Gupta, Nebojsa Trpkovic, minchan, KAMEZAWA Hiroyuki, riel,
	Chris Mason

On Thu, 26 Jan 2012 16:56:34 -0800 (PST)
Dan Magenheimer <dan.magenheimer@oracle.com> wrote:

> > > I'll find the place to add the call to ClearPageWasActive() for v2.
> > 
> > AFAICT this patch consumes our second-last page flag, or close to it.
> > We'll all be breaking out in hysterics when the final one is gone.
> 
> I'd be OK with only using this on 64-bit systems, though there
> are ARM folks playing with zcache that might disagree.

64-bit only is pretty lame and will reduce the appeal of cleancache and
will increase the maintenance burden by causing different behavior on
different CPU types.  Most Linux machines are 32-bit!  (My cheerily
unsubstantiated assertion of the day).

>  Am I
> correct in assuming that your "second-last page flag" concern
> applies only to 32-bit systems?

Sort-of.  Usually a flag which is 64-bit-only causes the above issues.

> > This does appear to be a make or break thing for cleancache - if we
> > can't fix https://lkml.org/lkml/2012/1/22/61 then cleancache is pretty
> > much a dead duck.
> 
> Hmmm... is that URL correct?  If so, there is some subtlety in
> that thread that I am missing as I don't understand the relationship
> to cleancache at all?

err, sorry, I meant your https://lkml.org/lkml/2011/8/17/351.

> > And I'm afraid that neither I nor other MM developers are likely to
> > help you with "fix cleancache via other means" because we weren't
> > provided with any description of what the problem is within cleancache,
> > nor how it will be fixed.  All we are given is the assertion "cleancache
> > needs this".
> 
> The patch comment says:
> 
> The patch resolves issues reported with cleancache which occur
> especially during streaming workloads on older processors,
> see https://lkml.org/lkml/2011/8/17/351
> 
> I can see that may not be sufficient, so let me expand on it.
> 
> First, just as page replacement worked prior to the active/inactive
> redesign at 2.6.27, cleancache works without the WasActive page flag.
> However, just as pre-2.6.27 page replacement had problems on
> streaming workloads, so does cleancache.  The WasActive page flag
> is an attempt to pass the same active/inactive info gathered by
> the post-2.6.27 kernel into cleancache, with the same objectives and
> presumably the same result: improving the "quality" of pages preserved
> in memory thus reducing refaults.
> 
> Is that clearer?  If so, I'll do better on the description at v2.

It really didn't tell us anything, apart from referring to vague
"problems on streaming workloads", which forces everyone to go off and
do an hour or two's kernel archeology, probably in the area of
readahead.

Just describe the problem!  Why is it slow?  Where's the time being
spent?  How does the proposed fix (which we haven't actually seen)
address the problem?  If you inform us of these things then perhaps
someone will have a useful suggestion.  And as a side-effect, we'll
understand cleancache better.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27  1:15         ` Andrew Morton
@ 2012-01-27  2:43           ` Dan Magenheimer
  2012-01-27  3:33             ` Rik van Riel
  2012-01-27 13:43             ` James Bottomley
  0 siblings, 2 replies; 20+ messages in thread
From: Dan Magenheimer @ 2012-01-27  2:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, linux-kernel, linux-mm, Konrad Wilk, Seth Jennings,
	Nitin Gupta, Nebojsa Trpkovic, minchan, KAMEZAWA Hiroyuki, riel,
	Chris Mason

> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Subject: Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
> 
> On Thu, 26 Jan 2012 16:56:34 -0800 (PST)
> Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> 
> > > > I'll find the place to add the call to ClearPageWasActive() for v2.
> > >
> > > AFAICT this patch consumes our second-last page flag, or close to it.
> > > We'll all be breaking out in hysterics when the final one is gone.
> >
> > I'd be OK with only using this on 64-bit systems, though there
> > are ARM folks playing with zcache that might disagree.
> 
> 64-bit only is pretty lame and will reduce the appeal of cleancache and
> will increase the maintenance burden by causing different behavior on
> different CPU types.  Most Linux machines are 32-bit!  (My cheerily
> unsubstantiated assertion of the day).

Three years ago, I would have agreed.  I suspect the tipping point
has been crossed, especially if one is counting only servers (not
smartphones) though my opinion is also unsubstantiated.

But...

> >  Am I
> > correct in assuming that your "second-last page flag" concern
> > applies only to 32-bit systems?
> 
> Sort-of.  Usually a flag which is 64-bit-only causes the above issues.

Agreed so a non-page-flag approach or overloading a page flag would
be better (though I'd still settle for 64-bit-only if it's the
only option).

Maybe the Active page bit could be overloaded with some minor
rewriting?  IOW, perhaps the Active bit could be ignored when
the page is moved to the inactive LRU?  (Confusing I know, but I am
just brainstorming...)

> > I can see that may not be sufficient, so let me expand on it.
> >
> > First, just as page replacement worked prior to the active/inactive
> > redesign at 2.6.27, cleancache works without the WasActive page flag.
> > However, just as pre-2.6.27 page replacement had problems on
> > streaming workloads, so does cleancache.  The WasActive page flag
> > is an attempt to pass the same active/inactive info gathered by
> > the post-2.6.27 kernel into cleancache, with the same objectives and
> > presumably the same result: improving the "quality" of pages preserved
> > in memory thus reducing refaults.
> >
> > Is that clearer?  If so, I'll do better on the description at v2.
> 
> It really didn't tell us anything, apart from referring to vague
> "problems on streaming workloads", which forces everyone to go off and
> do an hour or two's kernel archeology, probably in the area of
> readahead.
> 
> Just describe the problem!  Why is it slow?  Where's the time being
> spent?  How does the proposed fix (which we haven't actually seen)
> address the problem?  If you inform us of these things then perhaps
> someone will have a useful suggestion.  And as a side-effect, we'll
> understand cleancache better.

Sorry, I'm often told that my explanations are long-winded so
as a result I sometimes err on the side of brevity...

The problem is that if a pageframe is used for a page that is
very unlikely (or never going) to be used again instead of for
a page that IS likely to be used again, it results in more
refaults (which means more I/O which means poorer performance).
So we want to keep pages that are most likely to be used again.
And pages that were active are more likely to be used again than
pages that were never active... at least the post-2.6.27 kernel
makes that assumption.  A cleancache backend can keep or discard
any page it pleases... it makes sense for it to keep pages
that were previously active rather than pages that were never
active.

For zcache, we can store twice as many pages per pageframe.
But if we are storing two pages that are very unlikely
(or never going) to be used again instead of one page
that IS likely to be used again, that's probably still a bad choice.
Further, for every page that never gets used again (or gets reclaimed
before it can be used again because there's so much data streaming
through cleancache), zcache wastes the cpu cost of a page compression.
On newer machines, compression is suitably fast that this additional
cpu cost is small-ish.  On older machines, it adds up fast and that's
what Nebojsa was seeing in https://lkml.org/lkml/2011/8/17/351 

Page replacement algorithms are all about heuristics and
heuristics require information.  The WasActive flag provides
information that has proven useful to the kernel (as proven
by the 2.6.27 page replacement design rewrite) to cleancache
backends (such as zcache).

> How does the proposed fix (which we haven't actually seen)
> address the problem?

Not sure I understand... The patch as posted is the entire proposed
fix (except for the missing page bit clearing as Dave pointed out),
including for one cleancache backend (zcache).  RAMster can also
use WasActive.  It's not clear yet whether Xen tmem (or a future
KVM implementation) needs to.

Thanks again for the great feedback... after being buried in
tmem for so long, it's difficult to see another's perspective.

Dan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27  0:56       ` Dan Magenheimer
  2012-01-27  1:15         ` Andrew Morton
@ 2012-01-27  3:28         ` Rik van Riel
  2012-01-27  5:11           ` Dan Magenheimer
  1 sibling, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2012-01-27  3:28 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Andrew Morton, Dave Hansen, linux-kernel, linux-mm, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, Chris Mason

On 01/26/2012 07:56 PM, Dan Magenheimer wrote:

> The patch resolves issues reported with cleancache which occur
> especially during streaming workloads on older processors,
> see https://lkml.org/lkml/2011/8/17/351
>
> I can see that may not be sufficient, so let me expand on it.
>
> First, just as page replacement worked prior to the active/inactive
> redesign at 2.6.27, cleancache works without the WasActive page flag.
> However, just as pre-2.6.27 page replacement had problems on
> streaming workloads, so does cleancache.  The WasActive page flag
> is an attempt to pass the same active/inactive info gathered by
> the post-2.6.27 kernel into cleancache, with the same objectives and
> presumably the same result: improving the "quality" of pages preserved
> in memory thus reducing refaults.
>
> Is that clearer?  If so, I'll do better on the description at v2.

Whether or not this patch improves things would depend
entirely on the workload, no?

I can imagine a workload where we have a small virtual
machine and a large cleancache buffer in the host.

Due to the small size of the virtual machine, pages
might not stay on the inactive list long enough to get
accessed twice in a row.

When the page gets rescued from the cleancache, we
know it was recently evicted and we can immediately
put it onto the active file list.

This is almost the opposite problem (and solution) of
what you ran into.

Both seem equally likely (and probable)...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27  2:43           ` Dan Magenheimer
@ 2012-01-27  3:33             ` Rik van Riel
  2012-01-27  5:15               ` Dan Magenheimer
  2012-01-27 13:43             ` James Bottomley
  1 sibling, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2012-01-27  3:33 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Andrew Morton, Dave Hansen, linux-kernel, linux-mm, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, Chris Mason

On 01/26/2012 09:43 PM, Dan Magenheimer wrote:

> Maybe the Active page bit could be overloaded with some minor
> rewriting?  IOW, perhaps the Active bit could be ignored when
> the page is moved to the inactive LRU?  (Confusing I know, but I am
> just brainstorming...)

The PG_referenced bit is already overloaded.  We keep
the bit set when we move a page from the active to the
inactive list, so a page that was previously active
only needs to be referenced once to become active again.

The LRU bits (PG_lru, PG_active, etc) are needed to
figure out which LRU list the page is on.  I don't
think we can overload those...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27  3:28         ` Rik van Riel
@ 2012-01-27  5:11           ` Dan Magenheimer
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Magenheimer @ 2012-01-27  5:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Dave Hansen, linux-kernel, linux-mm, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, Chris Mason

> From: Rik van Riel [mailto:riel@redhat.com]
> Subject: Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
> 
> On 01/26/2012 07:56 PM, Dan Magenheimer wrote:
> 
> > The patch resolves issues reported with cleancache which occur
> > especially during streaming workloads on older processors,
> > see https://lkml.org/lkml/2011/8/17/351
> >
> > I can see that may not be sufficient, so let me expand on it.
> >
> > First, just as page replacement worked prior to the active/inactive
> > redesign at 2.6.27, cleancache works without the WasActive page flag.
> > However, just as pre-2.6.27 page replacement had problems on
> > streaming workloads, so does cleancache.  The WasActive page flag
> > is an attempt to pass the same active/inactive info gathered by
> > the post-2.6.27 kernel into cleancache, with the same objectives and
> > presumably the same result: improving the "quality" of pages preserved
> > in memory thus reducing refaults.
> >
> > Is that clearer?  If so, I'll do better on the description at v2.
> 
> Whether or not this patch improves things would depend
> entirely on the workload, no?
> 
> I can imagine a workload where we have a small virtual
> machine and a large cleancache buffer in the host.
> 
> Due to the small size of the virtual machine, pages
> might not stay on the inactive list long enough to get
> accessed twice in a row.
> 
> This is almost the opposite problem (and solution) of
> what you ran into.
> 
> Both seem equally likely (and probable)...

Hi Rik --

Thanks for the reply!

Yes, that's right, in your example, the advantage of
cleancache would be lost.  But the cost would also be
nil because the cleancache backend (zcache) would be rejecting
the inactive pages so would never incur any compression
cost and never use any space.  So "first, do no harm"
is held true.

To get the best of both (like the post-2.6.27 kernel page
replacement algorithm), the cleancache backend could implement
some kind of active/inactive balancing... but that can be
done later with no mm change beyond the proposed patch.

> When the page gets rescued from the cleancache, we
> know it was recently evicted and we can immediately
> put it onto the active file list.

True, that would be another refinement.  The proposed
patch does, however, turn on WasActive so, even if the
page never makes it back to the active lru, it will
still go back into cleancache when evicted from the
pagecache.

Dan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27  3:33             ` Rik van Riel
@ 2012-01-27  5:15               ` Dan Magenheimer
  2012-01-30  8:57                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-01-27  5:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Dave Hansen, linux-kernel, linux-mm, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, Chris Mason

> From: Rik van Riel [mailto:riel@redhat.com]
> Subject: Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
> 
> On 01/26/2012 09:43 PM, Dan Magenheimer wrote:
> 
> > Maybe the Active page bit could be overloaded with some minor
> > rewriting?  IOW, perhaps the Active bit could be ignored when
> > the page is moved to the inactive LRU?  (Confusing I know, but I am
> > just brainstorming...)
> 
> The PG_referenced bit is already overloaded.  We keep
> the bit set when we move a page from the active to the
> inactive list, so a page that was previously active
> only needs to be referenced once to become active again.
> 
> The LRU bits (PG_lru, PG_active, etc) are needed to
> figure out which LRU list the page is on.  I don't
> think we can overload those...

I suspected that was true, but was just brainstorming.
Thanks for confirming.

Are there any other page bits that are dont-care when
a page is on an LRU list?

I'd also be interested in your/RedHat's opinion on the
64-bit vs 32-bit market.  Will RHEL7 even support 32-bit?

Dan


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27  2:43           ` Dan Magenheimer
  2012-01-27  3:33             ` Rik van Riel
@ 2012-01-27 13:43             ` James Bottomley
  2012-01-27 17:32               ` Dan Magenheimer
  1 sibling, 1 reply; 20+ messages in thread
From: James Bottomley @ 2012-01-27 13:43 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Andrew Morton, Dave Hansen, linux-kernel, linux-mm, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, riel, Chris Mason, lsf-pc

On Thu, 2012-01-26 at 18:43 -0800, Dan Magenheimer wrote:
> > From: Andrew Morton [mailto:akpm@linux-foundation.org]
> > It really didn't tell us anything, apart from referring to vague
> > "problems on streaming workloads", which forces everyone to go off and
> > do an hour or two's kernel archeology, probably in the area of
> > readahead.
> > 
> > Just describe the problem!  Why is it slow?  Where's the time being
> > spent?  How does the proposed fix (which we haven't actually seen)
> > address the problem?  If you inform us of these things then perhaps
> > someone will have a useful suggestion.  And as a side-effect, we'll
> > understand cleancache better.
> 
> Sorry, I'm often told that my explanations are long-winded so
> as a result I sometimes err on the side of brevity...
> 
> The problem is that if a pageframe is used for a page that is
> very unlikely (or never going) to be used again instead of for
> a page that IS likely to be used again, it results in more
> refaults (which means more I/O which means poorer performance).
> So we want to keep pages that are most likely to be used again.
> And pages that were active are more likely to be used again than
> pages that were never active... at least the post-2.6.27 kernel
> makes that assumption.  A cleancache backend can keep or discard
> any page it pleases... it makes sense for it to keep pages
> that were previously active rather than pages that were never
> active.
> 
> For zcache, we can store twice as many pages per pageframe.
> But if we are storing two pages that are very unlikely
> (or never going) to be used again instead of one page
> that IS likely to be used again, that's probably still a bad choice.
> Further, for every page that never gets used again (or gets reclaimed
> before it can be used again because there's so much data streaming
> through cleancache), zcache wastes the cpu cost of a page compression.
> On newer machines, compression is suitably fast that this additional
> cpu cost is small-ish.  On older machines, it adds up fast and that's
> what Nebojsa was seeing in https://lkml.org/lkml/2011/8/17/351 
> 
> Page replacement algorithms are all about heuristics and
> heuristics require information.  The WasActive flag provides
> information that has proven useful to the kernel (as proven
> by the 2.6.27 page replacement design rewrite) to cleancache
> backends (such as zcache).

So this sounds very similar to the recent discussion which I cc'd to
this list about readahead:

http://marc.info/?l=linux-scsi&m=132750980203130

It sounds like we want to measure something similar (whether a page has
been touched since it was brought in).  It isn't exactly your WasActive
flag because we want to know after we bring a page in for readahead was
it ever actually used, but it's very similar.

What I was wondering was instead of using a flag, could we make the LRU
lists do this for us ... something like have a special LRU list for
pages added to the page cache but never referenced since added?  It
sounds like you can get your WasActive information from the same type of
LRU list tricks (assuming we can do them).

I think the memory pressure eviction heuristic is: referenced but not
recently used pages first followed by unreferenced and not recently used
readahead pages.  The key being to keep recently read in readahead pages
until last because there's a time between doing readahead and getting
the page accessed and we don't want to trash a recently red in readahead
page only to have the process touch it and find it has to be read in
again.

James




^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27 13:43             ` James Bottomley
@ 2012-01-27 17:32               ` Dan Magenheimer
  2012-01-27 17:54                 ` James Bottomley
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-01-27 17:32 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andrew Morton, Dave Hansen, linux-kernel, linux-mm, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, riel, Chris Mason, lsf-pc

> From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> Subject: RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
> 
> On Thu, 2012-01-26 at 18:43 -0800, Dan Magenheimer wrote:
> > > From: Andrew Morton [mailto:akpm@linux-foundation.org]
> > > It really didn't tell us anything, apart from referring to vague
> > > "problems on streaming workloads", which forces everyone to go off and
> > > do an hour or two's kernel archeology, probably in the area of
> > > readahead.
> > >
> > > Just describe the problem!  Why is it slow?  Where's the time being
> > > spent?  How does the proposed fix (which we haven't actually seen)
> > > address the problem?  If you inform us of these things then perhaps
> > > someone will have a useful suggestion.  And as a side-effect, we'll
> > > understand cleancache better.
> >
> > Sorry, I'm often told that my explanations are long-winded so
> > as a result I sometimes err on the side of brevity...
> >
> > The problem is that if a pageframe is used for a page that is
> > very unlikely (or never going) to be used again instead of for
> > a page that IS likely to be used again, it results in more
> > refaults (which means more I/O which means poorer performance).
> > So we want to keep pages that are most likely to be used again.
> > And pages that were active are more likely to be used again than
> > pages that were never active... at least the post-2.6.27 kernel
> > makes that assumption.  A cleancache backend can keep or discard
> > any page it pleases... it makes sense for it to keep pages
> > that were previously active rather than pages that were never
> > active.
> >
> > For zcache, we can store twice as many pages per pageframe.
> > But if we are storing two pages that are very unlikely
> > (or never going) to be used again instead of one page
> > that IS likely to be used again, that's probably still a bad choice.
> > Further, for every page that never gets used again (or gets reclaimed
> > before it can be used again because there's so much data streaming
> > through cleancache), zcache wastes the cpu cost of a page compression.
> > On newer machines, compression is suitably fast that this additional
> > cpu cost is small-ish.  On older machines, it adds up fast and that's
> > what Nebojsa was seeing in https://lkml.org/lkml/2011/8/17/351
> >
> > Page replacement algorithms are all about heuristics and
> > heuristics require information.  The WasActive flag provides
> > information that has proven useful to the kernel (as proven
> > by the 2.6.27 page replacement design rewrite) to cleancache
> > backends (such as zcache).
> 
> So this sounds very similar to the recent discussion which I cc'd to
> this list about readahead:
> 
> http://marc.info/?l=linux-scsi&m=132750980203130
> 
> It sounds like we want to measure something similar (whether a page has
> been touched since it was brought in).  It isn't exactly your WasActive
> flag because we want to know after we bring a page in for readahead was
> it ever actually used, but it's very similar.

Agreed, there is similarity here.
 
> What I was wondering was instead of using a flag, could we make the LRU
> lists do this for us ... something like have a special LRU list for
> pages added to the page cache but never referenced since added?  It
> sounds like you can get your WasActive information from the same type of
> LRU list tricks (assuming we can do them).

Hmmm... I think this would mean more LRU queues but that may be
the right long term answer.  Something like?

	read but not used yet LRU
	readahead but not used yet LRU
	active LRU
	previously active LRU

Naturally, this then complicates the eviction selection process.

> I think the memory pressure eviction heuristic is: referenced but not
> recently used pages first followed by unreferenced and not recently used
> readahead pages.  The key being to keep recently read in readahead pages
> until last because there's a time between doing readahead and getting
> the page accessed and we don't want to trash a recently red in readahead
> page only to have the process touch it and find it has to be read in
> again.

I suspect that any further progress on the page replacement heuristics
is going to require more per-page data to be stored.  Which means
that it probably won't work under the constraints of 32-bit systems.
So it might also make sense to revisit when/whether to allow the
heuristics to be better on a 64-bit system than on a 32-bit.
(Even ARM now has 64-bit processors!)  I put this on my topic list
for LSF/MM, though I have no history of previous discussion so
this may already have previously been decided.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27 17:32               ` Dan Magenheimer
@ 2012-01-27 17:54                 ` James Bottomley
  2012-01-27 18:46                   ` Dan Magenheimer
  0 siblings, 1 reply; 20+ messages in thread
From: James Bottomley @ 2012-01-27 17:54 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Andrew Morton, Dave Hansen, linux-kernel, linux-mm, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, riel, Chris Mason, lsf-pc

On Fri, 2012-01-27 at 09:32 -0800, Dan Magenheimer wrote:
> > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> > What I was wondering was instead of using a flag, could we make the LRU
> > lists do this for us ... something like have a special LRU list for
> > pages added to the page cache but never referenced since added?  It
> > sounds like you can get your WasActive information from the same type of
> > LRU list tricks (assuming we can do them).
> 
> Hmmm... I think this would mean more LRU queues but that may be
> the right long term answer.  Something like?
> 
> 	read but not used yet LRU
> 	readahead but not used yet LRU

What's the difference between these two?  I think read but not used is
some form of readahead regardless of where it came from.

> 	active LRU
> 	previously active LRU

I don't quite understand why you need two queues here, either.  Surely
active is logically at the bottom of the LRU and previously active at
the top (assuming we've separated the unused pages to a different LRU
list).

> Naturally, this then complicates the eviction selection process.
> 
> > I think the memory pressure eviction heuristic is: referenced but not
> > recently used pages first followed by unreferenced and not recently used
> > readahead pages.  The key being to keep recently read in readahead pages
> > until last because there's a time between doing readahead and getting
> > the page accessed and we don't want to trash a recently red in readahead
> > page only to have the process touch it and find it has to be read in
> > again.
> 
> I suspect that any further progress on the page replacement heuristics
> is going to require more per-page data to be stored.  Which means
> that it probably won't work under the constraints of 32-bit systems.
> So it might also make sense to revisit when/whether to allow the
> heuristics to be better on a 64-bit system than on a 32-bit.
> (Even ARM now has 64-bit processors!)  I put this on my topic list
> for LSF/MM, though I have no history of previous discussion so
> this may already have previously been decided.

I've got to say why? on this.  The object is to find a simple solution
that's good enough.  I think separating the LRU list into two based on
once referenced/never referenced might be enough to derive all the
information we need.  Until that theory is disproved, there's not much
benefit to developing ever more complex heuristics.

James



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27 17:54                 ` James Bottomley
@ 2012-01-27 18:46                   ` Dan Magenheimer
  2012-01-27 21:49                     ` James Bottomley
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-01-27 18:46 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andrew Morton, Dave Hansen, linux-kernel, linux-mm, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, riel, Chris Mason, lsf-pc

> From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> Subject: RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
> 
> On Fri, 2012-01-27 at 09:32 -0800, Dan Magenheimer wrote:
> > > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> > > What I was wondering was instead of using a flag, could we make the LRU
> > > lists do this for us ... something like have a special LRU list for
> > > pages added to the page cache but never referenced since added?  It
> > > sounds like you can get your WasActive information from the same type of
> > > LRU list tricks (assuming we can do them).
> >
> > Hmmm... I think this would mean more LRU queues but that may be
> > the right long term answer.  Something like?
> >
> > 	read but not used yet LRU
> > 	readahead but not used yet LRU
> 
> What's the difference between these two?  I think read but not used is
> some form of readahead regardless of where it came from.

Oops, I meant "read but used only once (to resolve a page fault)" LRU.

> > 	active LRU
> > 	previously active LRU
> 
> I don't quite understand why you need two queues here, either.  Surely
> active is logically at the bottom of the LRU and previously active at
> the top (assuming we've separated the unused pages to a different LRU
> list).

Cleancache only sees pages when they "fall off the end of the world",
so would like to differentiate between clean pages that were used just
once and pages that were used more than once, but not recently.
There may be better heuristics for cleancache too.
 
> > Naturally, this then complicates the eviction selection process.
> >
> > > I think the memory pressure eviction heuristic is: referenced but not
> > > recently used pages first followed by unreferenced and not recently used
> > > readahead pages.  The key being to keep recently read in readahead pages
> > > until last because there's a time between doing readahead and getting
> > > the page accessed and we don't want to trash a recently red in readahead
> > > page only to have the process touch it and find it has to be read in
> > > again.
> >
> > I suspect that any further progress on the page replacement heuristics
> > is going to require more per-page data to be stored.  Which means
> > that it probably won't work under the constraints of 32-bit systems.
> > So it might also make sense to revisit when/whether to allow the
> > heuristics to be better on a 64-bit system than on a 32-bit.
> > (Even ARM now has 64-bit processors!)  I put this on my topic list
> > for LSF/MM, though I have no history of previous discussion so
> > this may already have previously been decided.
> 
> I've got to say why? on this.  The object is to find a simple solution
> that's good enough.  I think separating the LRU list into two based on
> once referenced/never referenced might be enough to derive all the
> information we need.

Maybe I'm misunderstanding (in which case a complete list of the LRU
queues you are proposing would be good), but doesn't the existing design
already do this?  Actually, IIUC the existing file cache design has a
	"referenced more than once" LRU ("active")
and a second ("inactive") LRU queue which combines
	"referenced only once" OR
	"readahead and never referenced" OR
	"referenced more than once but not referenced recently"
Are you proposing to add more queues or define the existing two queues
differently?

So my strawman list of LRU queues separates the three ORs in the inactive
LRU queue to three separate LRU queues, for a total of four.

 INACTIVE-typeA) read but used only once to resolve a page fault LRU
 INACTIVE-typeB) readahead but not used yet LRU
 INACTIVE-typeC) previously active LRU
 ACTIVE) active LRU

> Until that theory is disproved, there's not much
> benefit to developing ever more complex heuristics.

I think we are both "disproving" the theory that the existing
two LRU queues are sufficient.  You need to differentiate between
typeA and typeB (to ensure typeB doesn't get evicted too quickly)
and I would like to differentiate typeC from typeA/typeB to have
heuristic data for cleancache at the time the kernel evicts the page.

Does that make sense?

P.S. I'm interpreting from http://linux-mm.org/PageReplacementDesign 
rather from full understanding of the current code, so please correct
me if I got it wrong.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27 18:46                   ` Dan Magenheimer
@ 2012-01-27 21:49                     ` James Bottomley
  2012-01-29  0:50                       ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: James Bottomley @ 2012-01-27 21:49 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Andrew Morton, Dave Hansen, linux-kernel, linux-mm, Konrad Wilk,
	Seth Jennings, Nitin Gupta, Nebojsa Trpkovic, minchan,
	KAMEZAWA Hiroyuki, riel, Chris Mason, lsf-pc

On Fri, 2012-01-27 at 10:46 -0800, Dan Magenheimer wrote:
> > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> > Subject: RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
> > 
> > On Fri, 2012-01-27 at 09:32 -0800, Dan Magenheimer wrote:
> > > > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> > > > What I was wondering was instead of using a flag, could we make the LRU
> > > > lists do this for us ... something like have a special LRU list for
> > > > pages added to the page cache but never referenced since added?  It
> > > > sounds like you can get your WasActive information from the same type of
> > > > LRU list tricks (assuming we can do them).
> > >
> > > Hmmm... I think this would mean more LRU queues but that may be
> > > the right long term answer.  Something like?
> > >
> > > 	read but not used yet LRU
> > > 	readahead but not used yet LRU
> > 
> > What's the difference between these two?  I think read but not used is
> > some form of readahead regardless of where it came from.
> 
> Oops, I meant "read but used only once (to resolve a page fault)" LRU.
> 
> > > 	active LRU
> > > 	previously active LRU
> > 
> > I don't quite understand why you need two queues here, either.  Surely
> > active is logically at the bottom of the LRU and previously active at
> > the top (assuming we've separated the unused pages to a different LRU
> > list).
> 
> Cleancache only sees pages when they "fall off the end of the world",
> so would like to differentiate between clean pages that were used just
> once and pages that were used more than once, but not recently.
> There may be better heuristics for cleancache too.
>  
> > > Naturally, this then complicates the eviction selection process.
> > >
> > > > I think the memory pressure eviction heuristic is: referenced but not
> > > > recently used pages first followed by unreferenced and not recently used
> > > > readahead pages.  The key being to keep recently read in readahead pages
> > > > until last because there's a time between doing readahead and getting
> > > > the page accessed and we don't want to trash a recently red in readahead
> > > > page only to have the process touch it and find it has to be read in
> > > > again.
> > >
> > > I suspect that any further progress on the page replacement heuristics
> > > is going to require more per-page data to be stored.  Which means
> > > that it probably won't work under the constraints of 32-bit systems.
> > > So it might also make sense to revisit when/whether to allow the
> > > heuristics to be better on a 64-bit system than on a 32-bit.
> > > (Even ARM now has 64-bit processors!)  I put this on my topic list
> > > for LSF/MM, though I have no history of previous discussion so
> > > this may already have previously been decided.
> > 
> > I've got to say why? on this.  The object is to find a simple solution
> > that's good enough.  I think separating the LRU list into two based on
> > once referenced/never referenced might be enough to derive all the
> > information we need.
> 
> Maybe I'm misunderstanding (in which case a complete list of the LRU
> queues you are proposing would be good), but doesn't the existing design
> already do this?  Actually, IIUC the existing file cache design has a
> 	"referenced more than once" LRU ("active")

So here, I was just saying your desire to store more data in the page
table and expand the page flags looks complex.

Perhaps we do have a fundamental misunderstanding:  For readahead, I
don't really care about the referenced part.  referenced just means
pointed to by one or more vmas and active means pointed to by two or
more vmas (unless executable in which case it's one).

What I think we care about for readahead is accessed.  This means a page
that got touched regardless of how many references it has.  An
unaccessed unaged RA page is a less good candidate for reclaim because
it should soon be accessed (under the RA heuristics) than an accessed RA
page.  Obviously if the heuristics misfire, we end up with futile RA
pages, which we read in expecting to be accessed, but which in fact
never were (so an unaccessed aged RA page) and need to be evicted.

But for me, perhaps it's enough to put unaccessed RA pages into the
active list on instantiation and then actually put them in the inactive
list when they're accessed (following the usual active rules, so they
become inactive if they only have a single reference, but remain active
if they have two or more).

I'm less clear on why you think a WasActive() flag is needed.  I think
you mean a member of the inactive list that was at some point previously
active.

Perhaps now I've said all this, it's less clear what readahead wants and
what you want with your flag are similar.

> and a second ("inactive") LRU queue which combines
> 	"referenced only once" OR
> 	"readahead and never referenced" OR
> 	"referenced more than once but not referenced recently"
> Are you proposing to add more queues or define the existing two queues
> differently?

Actually, I think I'm just proposing different insertion criteria.

Incidentally, we don't have two LRU lists, we have five:  we have a four
way split between active/inactive and anon/file backed and then we have
the unreclaimable lru list.

Obviously, for readahead I'm ignoring all the anon and unreclaimable
lists.

> So my strawman list of LRU queues separates the three ORs in the inactive
> LRU queue to three separate LRU queues, for a total of four.
> 
>  INACTIVE-typeA) read but used only once to resolve a page fault LRU
>  INACTIVE-typeB) readahead but not used yet LRU
>  INACTIVE-typeC) previously active LRU
>  ACTIVE) active LRU

Um, that's complex.  Doesn't your inactive-C list really just identify
pages that were shared but have sunk in the LRU lists due to lack of
use?  Inactive-A would appear to identify either streaming pages, or
single application data pages.  Like I said, I think Inactive-B is easy
to work around just by putting RA pages into the active list and letting
the current algorithms run their course.

If that's what you really want, isn't the simplest thing to do to add a
PAGECACHE_TAG_ACTIVE in the page cache radix tree?  We're only using
three of the radix tag bits, so we're not as pressed for them as we are
for actual page bits.  You set it at the same time you activate the page
(mostly, you probably don't want to set it for RA pages).  Then it
remains in the radix tree even if the page sinks to the inactive list.

> > Until that theory is disproved, there's not much
> > benefit to developing ever more complex heuristics.
> 
> I think we are both "disproving" the theory that the existing
> two LRU queues are sufficient.

What I meant was begin with a theory that something simple works first
and then disprove it before moving on to something more complex ...
rather than start out with complex first.

James

>   You need to differentiate between
> typeA and typeB (to ensure typeB doesn't get evicted too quickly)
> and I would like to differentiate typeC from typeA/typeB to have
> heuristic data for cleancache at the time the kernel evicts the page.
> 
> Does that make sense?
> 
> P.S. I'm interpreting from http://linux-mm.org/PageReplacementDesign 
> rather from full understanding of the current code, so please correct
> me if I got it wrong.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27 21:49                     ` James Bottomley
@ 2012-01-29  0:50                       ` Rik van Riel
  2012-01-29 22:25                         ` James Bottomley
  0 siblings, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2012-01-29  0:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: Dan Magenheimer, Andrew Morton, Dave Hansen, linux-kernel,
	linux-mm, Konrad Wilk, Seth Jennings, Nitin Gupta,
	Nebojsa Trpkovic, minchan, KAMEZAWA Hiroyuki, Chris Mason,
	lsf-pc

On 01/27/2012 04:49 PM, James Bottomley wrote:

> So here, I was just saying your desire to store more data in the page
> table and expand the page flags looks complex.
>
> Perhaps we do have a fundamental misunderstanding:  For readahead, I
> don't really care about the referenced part.  referenced just means
> pointed to by one or more vmas and active means pointed to by two or
> more vmas (unless executable in which case it's one).

That is not at all what "referenced" means everywhere
else in the VM.

If you write theories on what Dan should use, it would
help if you limited yourself to stuff the VM provides
and/or could provide :)

> What I think we care about for readahead is accessed.  This means a page
> that got touched regardless of how many references it has.  An
> unaccessed unaged RA page is a less good candidate for reclaim because
> it should soon be accessed (under the RA heuristics) than an accessed RA
> page.  Obviously if the heuristics misfire, we end up with futile RA
> pages, which we read in expecting to be accessed, but which in fact
> never were (so an unaccessed aged RA page) and need to be evicted.
>
> But for me, perhaps it's enough to put unaccessed RA pages into the
> active list on instantiation and then actually put them in the inactive
> list when they're accessed

That is an absolutely terrible idea for many obvious reasons.

Having readahead pages displace the working set wholesale
is the absolute last thing we want.

> I'm less clear on why you think a WasActive() flag is needed.  I think
> you mean a member of the inactive list that was at some point previously
> active.

> Um, that's complex.  Doesn't your inactive-C list really just identify
> pages that were shared but have sunk in the LRU lists due to lack of
> use?

Nope. Pages that are not mapped can still end up on the active
list, by virtue of getting accessed multiple times in a "short"
period of time (the residence on the inactive list).

We want to cache frequently accessed pages with preference over
streaming IO data that gets accessed infrequently.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-29  0:50                       ` Rik van Riel
@ 2012-01-29 22:25                         ` James Bottomley
  0 siblings, 0 replies; 20+ messages in thread
From: James Bottomley @ 2012-01-29 22:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dan Magenheimer, Andrew Morton, Dave Hansen, linux-kernel,
	linux-mm, Konrad Wilk, Seth Jennings, Nitin Gupta,
	Nebojsa Trpkovic, minchan, KAMEZAWA Hiroyuki, Chris Mason,
	lsf-pc

On Sat, 2012-01-28 at 19:50 -0500, Rik van Riel wrote:
> On 01/27/2012 04:49 PM, James Bottomley wrote:
> 
> > So here, I was just saying your desire to store more data in the page
> > table and expand the page flags looks complex.
> >
> > Perhaps we do have a fundamental misunderstanding:  For readahead, I
> > don't really care about the referenced part.  referenced just means
> > pointed to by one or more vmas and active means pointed to by two or
> > more vmas (unless executable in which case it's one).
> 
> That is not at all what "referenced" means everywhere
> else in the VM.

I'm aware there's more subtlety, but I think it's a reasonable
generality: your one sentence summary of page_referenced() seems
conspicuously absent; care to provide it ... or would you prefer the VM
internals remain inaccessible to mere mortals? 

> If you write theories on what Dan should use, it would
> help if you limited yourself to stuff the VM provides
> and/or could provide :)

I didn't give any theories at all about what he should or shouldn't do.
I'm trying to think out loud about whether what he wants and what I
think would help readahead are the same thing (I started of thinking
they were and I talked myself out of it by the end of the previous
email).

> > What I think we care about for readahead is accessed.  This means a page
> > that got touched regardless of how many references it has.  An
> > unaccessed unaged RA page is a less good candidate for reclaim because
> > it should soon be accessed (under the RA heuristics) than an accessed RA
> > page.  Obviously if the heuristics misfire, we end up with futile RA
> > pages, which we read in expecting to be accessed, but which in fact
> > never were (so an unaccessed aged RA page) and need to be evicted.
> >
> > But for me, perhaps it's enough to put unaccessed RA pages into the
> > active list on instantiation and then actually put them in the inactive
> > list when they're accessed
> 
> That is an absolutely terrible idea for many obvious reasons.
> 
> Having readahead pages displace the working set wholesale
> is the absolute last thing we want.

Um, only if you assume you place them at the most recently used head of
the active list ... for obvious reasons, that's not what I was thinking.
I'm still not sure it's more feasible than having separate lists, though
since most recently used tail is  nasty because it's reverse ordering
them and probably not providing sufficient boost and middle insertion
looks just plain wrong.

> > I'm less clear on why you think a WasActive() flag is needed.  I think
> > you mean a member of the inactive list that was at some point previously
> > active.
> 
> > Um, that's complex.  Doesn't your inactive-C list really just identify
> > pages that were shared but have sunk in the LRU lists due to lack of
> > use?
> 
> Nope. Pages that are not mapped can still end up on the active
> list, by virtue of getting accessed multiple times in a "short"
> period of time (the residence on the inactive list).
> 
> We want to cache frequently accessed pages with preference over
> streaming IO data that gets accessed infrequently.

Well, no, that's what I'm trying to argue against.  The chances are that
Streaming RA I/O gets accessed once (the classic movie scenario).  So
the idea is that if you can identify RA as streaming, it should be kept
while unaccessed but discarded after it's been accessed.  To get the LRU
lists to identify this, we want to give a boost to unaccessed unaged RA,
a suppression to accessed once RA and standard heuristics if RA gets
accessed more than once.

James



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-27  5:15               ` Dan Magenheimer
@ 2012-01-30  8:57                 ` KAMEZAWA Hiroyuki
  2012-01-30 22:03                   ` Dan Magenheimer
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-30  8:57 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Rik van Riel, Andrew Morton, Dave Hansen, linux-kernel, linux-mm,
	Konrad Wilk, Seth Jennings, Nitin Gupta, Nebojsa Trpkovic,
	minchan, Chris Mason

On Thu, 26 Jan 2012 21:15:16 -0800 (PST)
Dan Magenheimer <dan.magenheimer@oracle.com> wrote:

> > From: Rik van Riel [mailto:riel@redhat.com]
> > Subject: Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
> > 
> > On 01/26/2012 09:43 PM, Dan Magenheimer wrote:
> > 
> > > Maybe the Active page bit could be overloaded with some minor
> > > rewriting?  IOW, perhaps the Active bit could be ignored when
> > > the page is moved to the inactive LRU?  (Confusing I know, but I am
> > > just brainstorming...)
> > 
> > The PG_referenced bit is already overloaded.  We keep
> > the bit set when we move a page from the active to the
> > inactive list, so a page that was previously active
> > only needs to be referenced once to become active again.
> > 
> > The LRU bits (PG_lru, PG_active, etc) are needed to
> > figure out which LRU list the page is on.  I don't
> > think we can overload those...
> 
> I suspected that was true, but was just brainstorming.
> Thanks for confirming.
> 
> Are there any other page bits that are dont-care when
> a page is on an LRU list?
> 
> I'd also be interested in your/RedHat's opinion on the
> 64-bit vs 32-bit market.  Will RHEL7 even support 32-bit?
> 

How about replacing PG_slab ? 

I think  PageSlab(page) be implemented as

#define SLABMAGIC		(some value)
#define PageSlab(page)		(page->mapping == SLABMAGIC) 

or some...

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm: implement WasActive page flag (for improving cleancache)
  2012-01-30  8:57                 ` KAMEZAWA Hiroyuki
@ 2012-01-30 22:03                   ` Dan Magenheimer
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Magenheimer @ 2012-01-30 22:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Andrew Morton, Dave Hansen, linux-kernel, linux-mm,
	Konrad Wilk, Seth Jennings, Nitin Gupta, Nebojsa Trpkovic,
	minchan, Chris Mason

> From: KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com]
> Subject: Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
> 
> On Thu, 26 Jan 2012 21:15:16 -0800 (PST)
> Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> 
> > > From: Rik van Riel [mailto:riel@redhat.com]
> > > Subject: Re: [PATCH] mm: implement WasActive page flag (for improving cleancache)
> > >
> > > On 01/26/2012 09:43 PM, Dan Magenheimer wrote:
> > >
> > > > Maybe the Active page bit could be overloaded with some minor
> > > > rewriting?  IOW, perhaps the Active bit could be ignored when
> > > > the page is moved to the inactive LRU?  (Confusing I know, but I am
> > > > just brainstorming...)
> > >
> > > The PG_referenced bit is already overloaded.  We keep
> > > the bit set when we move a page from the active to the
> > > inactive list, so a page that was previously active
> > > only needs to be referenced once to become active again.
> > >
> > > The LRU bits (PG_lru, PG_active, etc) are needed to
> > > figure out which LRU list the page is on.  I don't
> > > think we can overload those...
> >
> > I suspected that was true, but was just brainstorming.
> > Thanks for confirming.
> >
> > Are there any other page bits that are dont-care when
> > a page is on an LRU list?
> 
> How about replacing PG_slab ?
> 
> I think  PageSlab(page) be implemented as
> 
> #define SLABMAGIC		(some value)
> #define PageSlab(page)		(page->mapping == SLABMAGIC)
> 
> or some...

Hi Kame --

Sounds like a great idea!  It looks like the PG_slab bit is part
of the kernel<->user ABI (see fs/proc/page.c: stable_page_flags())
but I think it can be simulated without actually using the physical
bit in struct pageflags.  If so, PG_slab is completely free
to be used/overloaded!

Here's a possible patch... compile/boot tested but nothing else (and
memory-failure.c isn't even compiled and may need more work):

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index dee29fa..ef8498e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -80,7 +80,7 @@ enum pageflags {
 	PG_dirty,
 	PG_lru,
 	PG_active,
-	PG_slab,
+	PG_slab,		/* for legacy kernel<->user ABI only */
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
 	PG_reserved,
@@ -206,7 +206,6 @@ PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
 	TESTCLEARFLAG(Active, active)
-__PAGEFLAG(Slab, slab)
 PAGEFLAG(Checked, checked)		/* Used by some filesystems */
 PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
 PAGEFLAG(SavePinned, savepinned);			/* Xen */
@@ -220,6 +219,28 @@ PAGEFLAG(WasActive, was_active)
 #endif
 
 /*
+ * for legacy ABI purposes, PG_slab remains defined but all attempted
+ * uses of the bit are now simulated without using the actual page-flag bit
+ */
+struct address_space;
+#define SLAB_MAGIC ((struct address_space *)0x80758075)
+static inline bool PageSlab(struct page *page)
+{
+	return page->mapping == SLAB_MAGIC;
+}
+
+static inline void __SetPageSlab(struct page *page)
+{
+	page->mapping = SLAB_MAGIC;
+}
+
+static inline void __ClearPageSlab(struct page *page)
+{
+	page->mapping = NULL;
+}
+
+
+/*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
  * - PG_private and PG_private_2 cause releasepage() and co to be invoked
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06d3479..b4dde77 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -742,7 +742,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 #define head		(1UL << PG_head)
 #define tail		(1UL << PG_tail)
 #define compound	(1UL << PG_compound)
-#define slab		(1UL << PG_slab)
 #define reserved	(1UL << PG_reserved)
 
 static struct page_state {
@@ -757,13 +756,6 @@ static struct page_state {
 	 * PG_buddy pages only make a small fraction of all free pages.
 	 */
 
-	/*
-	 * Could in theory check if slab page is free or if we can drop
-	 * currently unused objects without touching them. But just
-	 * treat it as standard kernel for now.
-	 */
-	{ slab,		slab,		"kernel slab",	me_kernel },
-
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{ head,		head,		"huge",		me_huge_page },
 	{ tail,		tail,		"huge",		me_huge_page },
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b8ba3a..48451a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5787,11 +5787,12 @@ static struct trace_print_flags pageflag_names[] = {
 	{-1UL,				NULL		},
 };
 
-static void dump_page_flags(unsigned long flags)
+static void dump_page_flags(struct page *page)
 {
 	const char *delim = "";
 	unsigned long mask;
 	int i;
+	unsigned long flags = page->flags;
 
 	printk(KERN_ALERT "page flags: %#lx(", flags);
 
@@ -5801,7 +5802,10 @@ static void dump_page_flags(unsigned long flags)
 	for (i = 0; pageflag_names[i].name && flags; i++) {
 
 		mask = pageflag_names[i].mask;
-		if ((flags & mask) != mask)
+		if (mask == PG_slab) {
+			if (!PageSlab(page))
+				continue;
+		} else if ((flags & mask) != mask)
 			continue;
 
 		flags &= ~mask;
@@ -5822,6 +5826,6 @@ void dump_page(struct page *page)
 	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
 		page, atomic_read(&page->_count), page_mapcount(page),
 		page->mapping, page->index);
-	dump_page_flags(page->flags);
+	dump_page_flags(page);
 	mem_cgroup_print_bad_page(page);
 }
diff --git a/mm/slub.c b/mm/slub.c
index ed3334d..a0fdca1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1361,7 +1361,7 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
-	page->flags |= 1 << PG_slab;
+	page->mapping = SLAB_MAGIC;
 
 	start = page_address(page);



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-01-30 22:03 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-25 21:58 [PATCH] mm: implement WasActive page flag (for improving cleancache) Dan Magenheimer
2012-01-26 17:28 ` Dave Hansen
2012-01-26 21:28   ` Dan Magenheimer
2012-01-27  0:31     ` Andrew Morton
2012-01-27  0:56       ` Dan Magenheimer
2012-01-27  1:15         ` Andrew Morton
2012-01-27  2:43           ` Dan Magenheimer
2012-01-27  3:33             ` Rik van Riel
2012-01-27  5:15               ` Dan Magenheimer
2012-01-30  8:57                 ` KAMEZAWA Hiroyuki
2012-01-30 22:03                   ` Dan Magenheimer
2012-01-27 13:43             ` James Bottomley
2012-01-27 17:32               ` Dan Magenheimer
2012-01-27 17:54                 ` James Bottomley
2012-01-27 18:46                   ` Dan Magenheimer
2012-01-27 21:49                     ` James Bottomley
2012-01-29  0:50                       ` Rik van Riel
2012-01-29 22:25                         ` James Bottomley
2012-01-27  3:28         ` Rik van Riel
2012-01-27  5:11           ` Dan Magenheimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).