All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05  2:40 ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-05  2:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Greetings,

Jeff Dike found that many KVM pages are being refaulted in 2.6.29:

"Lots of pages between discarded due to memory pressure only to be
faulted back in soon after. These pages are nearly all stack pages.
This is not consistent - sometimes there are relatively few such pages
and they are spread out between processes."

The refaults can be drastically reduced by the following patch, which
respects the referenced bit of all anonymous pages (including the KVM
pages).

However it risks reintroducing the problem addressed by commit 7e9cd4842
(fix reclaim scalability problem by ignoring the referenced bit,
mainly the pte young bit). I wonder if there are better solutions?

Thanks,
Fengguang

---
 mm/vmscan.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned 
 			 * Identify referenced, file-backed active pages and
 			 * give them one more trip around the active list. So
 			 * that executable code get better chances to stay in
-			 * memory under moderate memory pressure.  Anon pages
-			 * are not likely to be evicted by use-once streaming
-			 * IO, plus JVM can create lots of anon VM_EXEC pages,
-			 * so we ignore them here.
+			 * memory under moderate memory pressure.
+			 *
+			 * Also protect anon pages: swapping could be costly,
+			 * and KVM guest's referenced bit is helpful.
 			 */
-			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
+			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
 				list_add(&page->lru, &l_active);
 				continue;
 			}

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05  2:40 ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-05  2:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Greetings,

Jeff Dike found that many KVM pages are being refaulted in 2.6.29:

"Lots of pages between discarded due to memory pressure only to be
faulted back in soon after. These pages are nearly all stack pages.
This is not consistent - sometimes there are relatively few such pages
and they are spread out between processes."

The refaults can be drastically reduced by the following patch, which
respects the referenced bit of all anonymous pages (including the KVM
pages).

However it risks reintroducing the problem addressed by commit 7e9cd4842
(fix reclaim scalability problem by ignoring the referenced bit,
mainly the pte young bit). I wonder if there are better solutions?

Thanks,
Fengguang

---
 mm/vmscan.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned 
 			 * Identify referenced, file-backed active pages and
 			 * give them one more trip around the active list. So
 			 * that executable code get better chances to stay in
-			 * memory under moderate memory pressure.  Anon pages
-			 * are not likely to be evicted by use-once streaming
-			 * IO, plus JVM can create lots of anon VM_EXEC pages,
-			 * so we ignore them here.
+			 * memory under moderate memory pressure.
+			 *
+			 * Also protect anon pages: swapping could be costly,
+			 * and KVM guest's referenced bit is helpful.
 			 */
-			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
+			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
 				list_add(&page->lru, &l_active);
 				continue;
 			}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  2:40 ` Wu Fengguang
@ 2009-08-05  4:15   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  4:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Hi

> Greetings,
> 
> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
> 
> "Lots of pages between discarded due to memory pressure only to be
> faulted back in soon after. These pages are nearly all stack pages.
> This is not consistent - sometimes there are relatively few such pages
> and they are spread out between processes."

I suprise this result really.

  - Why this issue happened only on kvm?
  - Why shrink_inactive_list() can't find pte young bit?
    Is this really unused stack?

> 
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
> 
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
> 
> Thanks,
> Fengguang
> 
> ---
>  mm/vmscan.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned 
>  			 * Identify referenced, file-backed active pages and
>  			 * give them one more trip around the active list. So
>  			 * that executable code get better chances to stay in
> -			 * memory under moderate memory pressure.  Anon pages
> -			 * are not likely to be evicted by use-once streaming
> -			 * IO, plus JVM can create lots of anon VM_EXEC pages,
> -			 * so we ignore them here.
> +			 * memory under moderate memory pressure.
> +			 *
> +			 * Also protect anon pages: swapping could be costly,
> +			 * and KVM guest's referenced bit is helpful.
>  			 */
> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>  				list_add(&page->lru, &l_active);
>  				continue;
>  			}




^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05  4:15   ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  4:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Hi

> Greetings,
> 
> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
> 
> "Lots of pages between discarded due to memory pressure only to be
> faulted back in soon after. These pages are nearly all stack pages.
> This is not consistent - sometimes there are relatively few such pages
> and they are spread out between processes."

I suprise this result really.

  - Why this issue happened only on kvm?
  - Why shrink_inactive_list() can't find pte young bit?
    Is this really unused stack?

> 
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
> 
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
> 
> Thanks,
> Fengguang
> 
> ---
>  mm/vmscan.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned 
>  			 * Identify referenced, file-backed active pages and
>  			 * give them one more trip around the active list. So
>  			 * that executable code get better chances to stay in
> -			 * memory under moderate memory pressure.  Anon pages
> -			 * are not likely to be evicted by use-once streaming
> -			 * IO, plus JVM can create lots of anon VM_EXEC pages,
> -			 * so we ignore them here.
> +			 * memory under moderate memory pressure.
> +			 *
> +			 * Also protect anon pages: swapping could be costly,
> +			 * and KVM guest's referenced bit is helpful.
>  			 */
> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>  				list_add(&page->lru, &l_active);
>  				continue;
>  			}



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  4:15   ` KOSAKI Motohiro
@ 2009-08-05  4:41     ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-05  4:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 12:15:40PM +0800, KOSAKI Motohiro wrote:
> Hi
> 
> > Greetings,
> > 
> > Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
> > 
> > "Lots of pages between discarded due to memory pressure only to be
> > faulted back in soon after. These pages are nearly all stack pages.
> > This is not consistent - sometimes there are relatively few such pages
> > and they are spread out between processes."
> 
> I suprise this result really.
> 
>   - Why this issue happened only on kvm?

Maybe because
- they take up a large portion of memory
- their access patterns/frequencies vary a lot

>   - Why shrink_inactive_list() can't find pte young bit?

It can, but I guess the grace period would be much shorter than with
this patch.

>     Is this really unused stack?

They were actually being refaulted.  So they should be kind of
not-too-hot as well as not-too-cold pages. 

Thanks,
Fengguang

> > 
> > The refaults can be drastically reduced by the following patch, which
> > respects the referenced bit of all anonymous pages (including the KVM
> > pages).
> > 
> > However it risks reintroducing the problem addressed by commit 7e9cd4842
> > (fix reclaim scalability problem by ignoring the referenced bit,
> > mainly the pte young bit). I wonder if there are better solutions?
> > 
> > Thanks,
> > Fengguang
> > 
> > ---
> >  mm/vmscan.c |   10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > --- linux.orig/mm/vmscan.c
> > +++ linux/mm/vmscan.c
> > @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned 
> >  			 * Identify referenced, file-backed active pages and
> >  			 * give them one more trip around the active list. So
> >  			 * that executable code get better chances to stay in
> > -			 * memory under moderate memory pressure.  Anon pages
> > -			 * are not likely to be evicted by use-once streaming
> > -			 * IO, plus JVM can create lots of anon VM_EXEC pages,
> > -			 * so we ignore them here.
> > +			 * memory under moderate memory pressure.
> > +			 *
> > +			 * Also protect anon pages: swapping could be costly,
> > +			 * and KVM guest's referenced bit is helpful.
> >  			 */
> > -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> > +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> >  				list_add(&page->lru, &l_active);
> >  				continue;
> >  			}
> 
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05  4:41     ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-05  4:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 12:15:40PM +0800, KOSAKI Motohiro wrote:
> Hi
> 
> > Greetings,
> > 
> > Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
> > 
> > "Lots of pages between discarded due to memory pressure only to be
> > faulted back in soon after. These pages are nearly all stack pages.
> > This is not consistent - sometimes there are relatively few such pages
> > and they are spread out between processes."
> 
> I suprise this result really.
> 
>   - Why this issue happened only on kvm?

Maybe because
- they take up a large portion of memory
- their access patterns/frequencies vary a lot

>   - Why shrink_inactive_list() can't find pte young bit?

It can, but I guess the grace period would be much shorter than with
this patch.

>     Is this really unused stack?

They were actually being refaulted.  So they should be kind of
not-too-hot as well as not-too-cold pages. 

Thanks,
Fengguang

> > 
> > The refaults can be drastically reduced by the following patch, which
> > respects the referenced bit of all anonymous pages (including the KVM
> > pages).
> > 
> > However it risks reintroducing the problem addressed by commit 7e9cd4842
> > (fix reclaim scalability problem by ignoring the referenced bit,
> > mainly the pte young bit). I wonder if there are better solutions?
> > 
> > Thanks,
> > Fengguang
> > 
> > ---
> >  mm/vmscan.c |   10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > --- linux.orig/mm/vmscan.c
> > +++ linux/mm/vmscan.c
> > @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned 
> >  			 * Identify referenced, file-backed active pages and
> >  			 * give them one more trip around the active list. So
> >  			 * that executable code get better chances to stay in
> > -			 * memory under moderate memory pressure.  Anon pages
> > -			 * are not likely to be evicted by use-once streaming
> > -			 * IO, plus JVM can create lots of anon VM_EXEC pages,
> > -			 * so we ignore them here.
> > +			 * memory under moderate memory pressure.
> > +			 *
> > +			 * Also protect anon pages: swapping could be costly,
> > +			 * and KVM guest's referenced bit is helpful.
> >  			 */
> > -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> > +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> >  				list_add(&page->lru, &l_active);
> >  				continue;
> >  			}
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  2:40 ` Wu Fengguang
@ 2009-08-05  7:58   ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05  7:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/05/2009 05:40 AM, Wu Fengguang wrote:
> Greetings,
>
> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>
> "Lots of pages between discarded due to memory pressure only to be
> faulted back in soon after. These pages are nearly all stack pages.
> This is not consistent - sometimes there are relatively few such pages
> and they are spread out between processes."
>
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
>
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
>    

How do you distinguish between kvm pages and non-kvm anonymous pages?  
More importantly, why should you?

Jeff, do you see the refaults on Nehalem systems?  If so, that's likely 
due to the lack of an accessed bit on EPT pagetables.  It would be 
interesting to compare with Barcelona  (which does).

If that's indeed the case, we can have the EPT ageing mechanism give 
pages a bit more time around by using an available bit in the EPT PTEs 
to return accessed on the first pass and not-accessed on the second.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05  7:58   ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05  7:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/05/2009 05:40 AM, Wu Fengguang wrote:
> Greetings,
>
> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>
> "Lots of pages between discarded due to memory pressure only to be
> faulted back in soon after. These pages are nearly all stack pages.
> This is not consistent - sometimes there are relatively few such pages
> and they are spread out between processes."
>
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
>
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
>    

How do you distinguish between kvm pages and non-kvm anonymous pages?  
More importantly, why should you?

Jeff, do you see the refaults on Nehalem systems?  If so, that's likely 
due to the lack of an accessed bit on EPT pagetables.  It would be 
interesting to compare with Barcelona  (which does).

If that's indeed the case, we can have the EPT ageing mechanism give 
pages a bit more time around by using an available bit in the EPT PTEs 
to return accessed on the first pass and not-accessed on the second.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  7:58   ` Avi Kivity
@ 2009-08-05  8:17     ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05  8:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, Andrea Arcangeli,
	KVM list

[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]

On 08/05/2009 10:58 AM, Avi Kivity wrote:
> On 08/05/2009 05:40 AM, Wu Fengguang wrote:
>> Greetings,
>>
>> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>>
>> "Lots of pages between discarded due to memory pressure only to be
>> faulted back in soon after. These pages are nearly all stack pages.
>> This is not consistent - sometimes there are relatively few such pages
>> and they are spread out between processes."
>>
>> The refaults can be drastically reduced by the following patch, which
>> respects the referenced bit of all anonymous pages (including the KVM
>> pages).
>>
>> However it risks reintroducing the problem addressed by commit 7e9cd4842
>> (fix reclaim scalability problem by ignoring the referenced bit,
>> mainly the pte young bit). I wonder if there are better solutions?
>
> How do you distinguish between kvm pages and non-kvm anonymous pages?  
> More importantly, why should you?
>
> Jeff, do you see the refaults on Nehalem systems?  If so, that's 
> likely due to the lack of an accessed bit on EPT pagetables.  It would 
> be interesting to compare with Barcelona  (which does).
>
> If that's indeed the case, we can have the EPT ageing mechanism give 
> pages a bit more time around by using an available bit in the EPT PTEs 
> to return accessed on the first pass and not-accessed on the second.
>

The attached patch implements this.

-- 
error compiling committee.c: too many arguments to function


[-- Attachment #2: ept-emulate-accessed-bit.patch --]
[-- Type: text/x-patch, Size: 2115 bytes --]

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7b53614..310938a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -195,6 +195,7 @@ static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
 static u64 __read_mostly shadow_user_mask;
 static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
+static int __read_mostly shadow_accessed_shift;
 
 static inline u64 rsvd_bits(int s, int e)
 {
@@ -219,6 +220,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 {
 	shadow_user_mask = user_mask;
 	shadow_accessed_mask = accessed_mask;
+	shadow_accessed_shift
+		= find_first_bit((void *)&shadow_accessed_mask, 64);
 	shadow_dirty_mask = dirty_mask;
 	shadow_nx_mask = nx_mask;
 	shadow_x_mask = x_mask;
@@ -817,11 +820,11 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
 	while (spte) {
 		int _young;
 		u64 _spte = *spte;
-		BUG_ON(!(_spte & PT_PRESENT_MASK));
-		_young = _spte & PT_ACCESSED_MASK;
+		BUG_ON(!(_spte & shadow_accessed_mask));
+		_young = _spte & shadow_accessed_mask;
 		if (_young) {
 			young = 1;
-			clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+			clear_bit(shadow_accessed_shift, (unsigned long *)spte);
 		}
 		spte = rmap_next(kvm, rmapp, spte);
 	}
@@ -2572,7 +2575,7 @@ static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 	    && shadow_accessed_mask
 	    && !(*spte & shadow_accessed_mask)
 	    && is_shadow_present_pte(*spte))
-		set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+		set_bit(shadow_accessed_shift, (unsigned long *)spte);
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0ba706e..bc99367 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4029,7 +4029,7 @@ static int __init vmx_init(void)
 		bypass_guest_pf = 0;
 		kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
 			VMX_EPT_WRITABLE_MASK);
-		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
+		kvm_mmu_set_mask_ptes(0ull, 1ull << 63, 0ull, 0ull,
 				VMX_EPT_EXECUTABLE_MASK);
 		kvm_enable_tdp();
 	} else

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05  8:17     ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05  8:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]

On 08/05/2009 10:58 AM, Avi Kivity wrote:
> On 08/05/2009 05:40 AM, Wu Fengguang wrote:
>> Greetings,
>>
>> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>>
>> "Lots of pages between discarded due to memory pressure only to be
>> faulted back in soon after. These pages are nearly all stack pages.
>> This is not consistent - sometimes there are relatively few such pages
>> and they are spread out between processes."
>>
>> The refaults can be drastically reduced by the following patch, which
>> respects the referenced bit of all anonymous pages (including the KVM
>> pages).
>>
>> However it risks reintroducing the problem addressed by commit 7e9cd4842
>> (fix reclaim scalability problem by ignoring the referenced bit,
>> mainly the pte young bit). I wonder if there are better solutions?
>
> How do you distinguish between kvm pages and non-kvm anonymous pages?  
> More importantly, why should you?
>
> Jeff, do you see the refaults on Nehalem systems?  If so, that's 
> likely due to the lack of an accessed bit on EPT pagetables.  It would 
> be interesting to compare with Barcelona  (which does).
>
> If that's indeed the case, we can have the EPT ageing mechanism give 
> pages a bit more time around by using an available bit in the EPT PTEs 
> to return accessed on the first pass and not-accessed on the second.
>

The attached patch implements this.

-- 
error compiling committee.c: too many arguments to function


[-- Attachment #2: ept-emulate-accessed-bit.patch --]
[-- Type: text/x-patch, Size: 2115 bytes --]

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7b53614..310938a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -195,6 +195,7 @@ static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
 static u64 __read_mostly shadow_user_mask;
 static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
+static int __read_mostly shadow_accessed_shift;
 
 static inline u64 rsvd_bits(int s, int e)
 {
@@ -219,6 +220,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 {
 	shadow_user_mask = user_mask;
 	shadow_accessed_mask = accessed_mask;
+	shadow_accessed_shift
+		= find_first_bit((void *)&shadow_accessed_mask, 64);
 	shadow_dirty_mask = dirty_mask;
 	shadow_nx_mask = nx_mask;
 	shadow_x_mask = x_mask;
@@ -817,11 +820,11 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
 	while (spte) {
 		int _young;
 		u64 _spte = *spte;
-		BUG_ON(!(_spte & PT_PRESENT_MASK));
-		_young = _spte & PT_ACCESSED_MASK;
+		BUG_ON(!(_spte & shadow_accessed_mask));
+		_young = _spte & shadow_accessed_mask;
 		if (_young) {
 			young = 1;
-			clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+			clear_bit(shadow_accessed_shift, (unsigned long *)spte);
 		}
 		spte = rmap_next(kvm, rmapp, spte);
 	}
@@ -2572,7 +2575,7 @@ static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 	    && shadow_accessed_mask
 	    && !(*spte & shadow_accessed_mask)
 	    && is_shadow_present_pte(*spte))
-		set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+		set_bit(shadow_accessed_shift, (unsigned long *)spte);
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0ba706e..bc99367 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4029,7 +4029,7 @@ static int __init vmx_init(void)
 		bypass_guest_pf = 0;
 		kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
 			VMX_EPT_WRITABLE_MASK);
-		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
+		kvm_mmu_set_mask_ptes(0ull, 1ull << 63, 0ull, 0ull,
 				VMX_EPT_EXECUTABLE_MASK);
 		kvm_enable_tdp();
 	} else

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  7:58   ` Avi Kivity
@ 2009-08-05 14:15     ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 14:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Avi Kivity wrote:

>> However it risks reintroducing the problem addressed by commit 7e9cd4842
>> (fix reclaim scalability problem by ignoring the referenced bit,
>> mainly the pte young bit). I wonder if there are better solutions?

Agreed, we need to figure out what the real problem is,
and how to solve it better.

> Jeff, do you see the refaults on Nehalem systems?  If so, that's likely 
> due to the lack of an accessed bit on EPT pagetables.  It would be 
> interesting to compare with Barcelona  (which does).

Not having a hardware accessed bit would explain why
the VM is not reactivating the pages that were accessed
while on the inactive list.

> If that's indeed the case, we can have the EPT ageing mechanism give 
> pages a bit more time around by using an available bit in the EPT PTEs 
> to return accessed on the first pass and not-accessed on the second.

Can we find out which pages are EPT pages?

If so, we could unmap them when they get moved from the
active to the inactive list, and soft fault them back in
on access, emulating the referenced bit for EPT pages and
making page replacement on them work like it should.

Your approximation of pretending the page is accessed the
first time and pretending it's not the second time sounds
like it will just lead to less efficient FIFO replacement,
not to anything even vaguely approximating LRU.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 14:15     ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 14:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Avi Kivity wrote:

>> However it risks reintroducing the problem addressed by commit 7e9cd4842
>> (fix reclaim scalability problem by ignoring the referenced bit,
>> mainly the pte young bit). I wonder if there are better solutions?

Agreed, we need to figure out what the real problem is,
and how to solve it better.

> Jeff, do you see the refaults on Nehalem systems?  If so, that's likely 
> due to the lack of an accessed bit on EPT pagetables.  It would be 
> interesting to compare with Barcelona  (which does).

Not having a hardware accessed bit would explain why
the VM is not reactivating the pages that were accessed
while on the inactive list.

> If that's indeed the case, we can have the EPT ageing mechanism give 
> pages a bit more time around by using an available bit in the EPT PTEs 
> to return accessed on the first pass and not-accessed on the second.

Can we find out which pages are EPT pages?

If so, we could unmap them when they get moved from the
active to the inactive list, and soft fault them back in
on access, emulating the referenced bit for EPT pages and
making page replacement on them work like it should.

Your approximation of pretending the page is accessed the
first time and pretending it's not the second time sounds
like it will just lead to less efficient FIFO replacement,
not to anything even vaguely approximating LRU.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  8:17     ` Avi Kivity
@ 2009-08-05 14:33       ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 14:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list

Avi Kivity wrote:

> The attached patch implements this.

The attached page requires each page to go around twice
before it is evicted, but they will still get evicted in
the order in which they were made present.

FIFO page replacement was shown to be a bad idea in the
1960's and it is still a terrible idea today.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 14:33       ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 14:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list

Avi Kivity wrote:

> The attached patch implements this.

The attached page requires each page to go around twice
before it is evicted, but they will still get evicted in
the order in which they were made present.

FIFO page replacement was shown to be a bad idea in the
1960's and it is still a terrible idea today.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 14:15     ` Rik van Riel
@ 2009-08-05 15:12       ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/05/2009 05:15 PM, Rik van Riel wrote:
>> If that's indeed the case, we can have the EPT ageing mechanism give 
>> pages a bit more time around by using an available bit in the EPT 
>> PTEs to return accessed on the first pass and not-accessed on the 
>> second.
>
> Can we find out which pages are EPT pages?
>

No need to (see below).

> If so, we could unmap them when they get moved from the
> active to the inactive list, and soft fault them back in
> on access, emulating the referenced bit for EPT pages and
> making page replacement on them work like it should.

It should be easy to implement via the mmu notifier callback: when the 
mm calls clear_flush_young(), mark it as young, and unmap it from the 
EPT pagetable.

> Your approximation of pretending the page is accessed the
> first time and pretending it's not the second time sounds
> like it will just lead to less efficient FIFO replacement,
> not to anything even vaguely approximating LRU.

Right, it's just a hack that gives EPT pages higher priority, like the 
original patch suggested.  Note that LRU for VMs is not a good 
algorithm, since the VM will also reference the least recently used 
page, leading to thrashing.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:12       ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/05/2009 05:15 PM, Rik van Riel wrote:
>> If that's indeed the case, we can have the EPT ageing mechanism give 
>> pages a bit more time around by using an available bit in the EPT 
>> PTEs to return accessed on the first pass and not-accessed on the 
>> second.
>
> Can we find out which pages are EPT pages?
>

No need to (see below).

> If so, we could unmap them when they get moved from the
> active to the inactive list, and soft fault them back in
> on access, emulating the referenced bit for EPT pages and
> making page replacement on them work like it should.

It should be easy to implement via the mmu notifier callback: when the 
mm calls clear_flush_young(), mark it as young, and unmap it from the 
EPT pagetable.

> Your approximation of pretending the page is accessed the
> first time and pretending it's not the second time sounds
> like it will just lead to less efficient FIFO replacement,
> not to anything even vaguely approximating LRU.

Right, it's just a hack that gives EPT pages higher priority, like the 
original patch suggested.  Note that LRU for VMs is not a good 
algorithm, since the VM will also reference the least recently used 
page, leading to thrashing.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 15:12       ` Avi Kivity
@ 2009-08-05 15:15         ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 15:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Avi Kivity wrote:

>> If so, we could unmap them when they get moved from the
>> active to the inactive list, and soft fault them back in
>> on access, emulating the referenced bit for EPT pages and
>> making page replacement on them work like it should.
> 
> It should be easy to implement via the mmu notifier callback: when the 
> mm calls clear_flush_young(), mark it as young, and unmap it from the 
> EPT pagetable.

You mean "mark it as old"?

>> Your approximation of pretending the page is accessed the
>> first time and pretending it's not the second time sounds
>> like it will just lead to less efficient FIFO replacement,
>> not to anything even vaguely approximating LRU.
> 
> Right, it's just a hack that gives EPT pages higher priority, like the 
> original patch suggested.  Note that LRU for VMs is not a good 
> algorithm, since the VM will also reference the least recently used 
> page, leading to thrashing.

That is one of the reasons we use a very coarse two
handed clock algorithm instead of true LRU.

LRU has more overhead and more artifacts :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:15         ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 15:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Avi Kivity wrote:

>> If so, we could unmap them when they get moved from the
>> active to the inactive list, and soft fault them back in
>> on access, emulating the referenced bit for EPT pages and
>> making page replacement on them work like it should.
> 
> It should be easy to implement via the mmu notifier callback: when the 
> mm calls clear_flush_young(), mark it as young, and unmap it from the 
> EPT pagetable.

You mean "mark it as old"?

>> Your approximation of pretending the page is accessed the
>> first time and pretending it's not the second time sounds
>> like it will just lead to less efficient FIFO replacement,
>> not to anything even vaguely approximating LRU.
> 
> Right, it's just a hack that gives EPT pages higher priority, like the 
> original patch suggested.  Note that LRU for VMs is not a good 
> algorithm, since the VM will also reference the least recently used 
> page, leading to thrashing.

That is one of the reasons we use a very coarse two
handed clock algorithm instead of true LRU.

LRU has more overhead and more artifacts :)

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 15:15         ` Rik van Riel
@ 2009-08-05 15:25           ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/05/2009 06:15 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>
>>> If so, we could unmap them when they get moved from the
>>> active to the inactive list, and soft fault them back in
>>> on access, emulating the referenced bit for EPT pages and
>>> making page replacement on them work like it should.
>>
>> It should be easy to implement via the mmu notifier callback: when 
>> the mm calls clear_flush_young(), mark it as young, and unmap it from 
>> the EPT pagetable.
>
> You mean "mark it as old"?

I meant 'return young, and drop it from the EPT pagetable'.

If we use the present bit as a replacement for the accessed bit, present 
means young, and clear_flush_young means "if present, return young and 
unmap, otherwise return old'.

See kvm_age_rmapp() in arch/x86/kvm/mmu.c.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:25           ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/05/2009 06:15 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>
>>> If so, we could unmap them when they get moved from the
>>> active to the inactive list, and soft fault them back in
>>> on access, emulating the referenced bit for EPT pages and
>>> making page replacement on them work like it should.
>>
>> It should be easy to implement via the mmu notifier callback: when 
>> the mm calls clear_flush_young(), mark it as young, and unmap it from 
>> the EPT pagetable.
>
> You mean "mark it as old"?

I meant 'return young, and drop it from the EPT pagetable'.

If we use the present bit as a replacement for the accessed bit, present 
means young, and clear_flush_young means "if present, return young and 
unmap, otherwise return old'.

See kvm_age_rmapp() in arch/x86/kvm/mmu.c.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 14:33       ` Rik van Riel
@ 2009-08-05 15:37         ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list

On 08/05/2009 05:33 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>
>> The attached patch implements this.
>
> The attached page requires each page to go around twice
> before it is evicted, but they will still get evicted in
> the order in which they were made present.
>
> FIFO page replacement was shown to be a bad idea in the
> 1960's and it is still a terrible idea today.
>

Which is why we have accessed bits in page tables... but emulating the 
accessed bit via RWX (note no present bit in EPT) is better than 
ignoring it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:37         ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list

On 08/05/2009 05:33 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>
>> The attached patch implements this.
>
> The attached page requires each page to go around twice
> before it is evicted, but they will still get evicted in
> the order in which they were made present.
>
> FIFO page replacement was shown to be a bad idea in the
> 1960's and it is still a terrible idea today.
>

Which is why we have accessed bits in page tables... but emulating the 
accessed bit via RWX (note no present bit in EPT) is better than 
ignoring it.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  7:58   ` Avi Kivity
@ 2009-08-05 15:45     ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 15:45 UTC (permalink / raw)
  To: Avi Kivity, Wu, Fengguang
  Cc: Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
	Mel Gorman, LKML, linux-mm

> Jeff, do you see the refaults on Nehalem systems?

My test box is pre-Nehalem - no EPT.

					Jeff

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:45     ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 15:45 UTC (permalink / raw)
  To: Avi Kivity, Wu, Fengguang
  Cc: Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
	Mel Gorman, LKML, linux-mm

> Jeff, do you see the refaults on Nehalem systems?

My test box is pre-Nehalem - no EPT.

					Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  2:40 ` Wu Fengguang
@ 2009-08-05 15:58   ` Andrea Arcangeli
  -1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 15:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>  			 */
> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>  				list_add(&page->lru, &l_active);
>  				continue;
>  			}
> 

Please nuke the whole check and do an unconditional list_add;
continue; there.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:58   ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 15:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>  			 */
> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>  				list_add(&page->lru, &l_active);
>  				continue;
>  			}
> 

Please nuke the whole check and do an unconditional list_add;
continue; there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  7:58   ` Avi Kivity
@ 2009-08-05 16:05     ` Andrea Arcangeli
  -1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 10:58:10AM +0300, Avi Kivity wrote:
> How do you distinguish between kvm pages and non-kvm anonymous pages?  
> More importantly, why should you?

It can't distinguish. Besides the pages being refaulted (as minor
faults) implies they weren't collected yet. So the fact they are
allowed to stay on active list or not can't matter or alter the
refaulting issue.

> Jeff, do you see the refaults on Nehalem systems?  If so, that's likely 
> due to the lack of an accessed bit on EPT pagetables.  It would be 
> interesting to compare with Barcelona  (which does).

It seems it wasn't using EPT.

Refaulting as minor faults, still possible w/ or w/o EPT and young
bit... when young bit is found not set, we just unmap the spte/pte and
we leave the page in lru for a while until it is collected. So it can
be refaulted even with young bit perfectly functional in spte and pte.

But the _whole_ point of NPT young bit (shame on EPT) and in pte, is
to try not to unmap the pagetables to get the aging information. So
there's a one more pass with young bit functional compared to without
young bit functional, but it doesn't mean that when young bit is clear
at second pass we immediately free the page, just we go into the
"refaulting lru cache waiting to be collected". And if the page isn't
actually collected it doesn't matter if it's in active or inactive
list, so patch can't matter if it's ""minor"" refaults what we're
talking about here :).

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:05     ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 10:58:10AM +0300, Avi Kivity wrote:
> How do you distinguish between kvm pages and non-kvm anonymous pages?  
> More importantly, why should you?

It can't distinguish. Besides the pages being refaulted (as minor
faults) implies they weren't collected yet. So the fact they are
allowed to stay on active list or not can't matter or alter the
refaulting issue.

> Jeff, do you see the refaults on Nehalem systems?  If so, that's likely 
> due to the lack of an accessed bit on EPT pagetables.  It would be 
> interesting to compare with Barcelona  (which does).

It seems it wasn't using EPT.

Refaulting as minor faults, still possible w/ or w/o EPT and young
bit... when young bit is found not set, we just unmap the spte/pte and
we leave the page in lru for a while until it is collected. So it can
be refaulted even with young bit perfectly functional in spte and pte.

But the _whole_ point of NPT young bit (shame on EPT) and in pte, is
to try not to unmap the pagetables to get the aging information. So
there's a one more pass with young bit functional compared to without
young bit functional, but it doesn't mean that when young bit is clear
at second pass we immediately free the page, just we go into the
"refaulting lru cache waiting to be collected". And if the page isn't
actually collected it doesn't matter if it's in active or inactive
list, so patch can't matter if it's ""minor"" refaults what we're
talking about here :).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 16:05     ` Andrea Arcangeli
@ 2009-08-05 16:12       ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity
  Cc: Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
	Mel Gorman, LKML, linux-mm

> It can't distinguish. Besides the pages being refaulted (as minor
> faults) implies they weren't collected yet. So the fact they are
> allowed to stay on active list or not can't matter or alter the
> refaulting issue.

Sounds like there's some terminology confusion.  A refault is a page being discarded due to memory pressure and subsequently being faulted back in.  I was counting the number of faults between the discard and faulting back in for each affected page.  For a large number of predominately stack pages, that number was very small.

					Jeff

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:12       ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity
  Cc: Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
	Mel Gorman, LKML, linux-mm

> It can't distinguish. Besides the pages being refaulted (as minor
> faults) implies they weren't collected yet. So the fact they are
> allowed to stay on active list or not can't matter or alter the
> refaulting issue.

Sounds like there's some terminology confusion.  A refault is a page being discarded due to memory pressure and subsequently being faulted back in.  I was counting the number of faults between the discard and faulting back in for each affected page.  For a large number of predominately stack pages, that number was very small.

					Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 16:12       ` Dike, Jeffrey G
@ 2009-08-05 16:19         ` Andrea Arcangeli
  -1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:19 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Avi Kivity, Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 09:12:39AM -0700, Dike, Jeffrey G wrote:
> Sounds like there's some terminology confusion.  A refault is a page
> being discarded due to memory pressure and subsequently being
> faulted back in.  I was counting the number of faults between the
> discard and faulting back in for each affected page.  For a large
> number of predominately stack pages, that number was very small.

Hmm ok, but if it's anonymous pages we're talking about here (I see
KVM in the equation so it has to be!) normally we call that thing
swapin to imply I/O is involved, not refault...  Refault to me sounds
minor fault from swapcache (clean or dirty) and that's about it...

Anon page becomes swapcache, it is unmapped if young bit permits, and
then it's collected from lru eventually, if it is collected I/O will
be generated as swapin during the next page fault.

If it's too much swapin, then yes, it could be that patch that
prevents young bit to keep the anon pages in active list. But fix is
to remove the whole check, not just to enable list_add for anon pages.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:19         ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:19 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Avi Kivity, Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 09:12:39AM -0700, Dike, Jeffrey G wrote:
> Sounds like there's some terminology confusion.  A refault is a page
> being discarded due to memory pressure and subsequently being
> faulted back in.  I was counting the number of faults between the
> discard and faulting back in for each affected page.  For a large
> number of predominately stack pages, that number was very small.

Hmm ok, but if it's anonymous pages we're talking about here (I see
KVM in the equation so it has to be!) normally we call that thing
swapin to imply I/O is involved, not refault...  Refault to me sounds
minor fault from swapcache (clean or dirty) and that's about it...

Anon page becomes swapcache, it is unmapped if young bit permits, and
then it's collected from lru eventually, if it is collected I/O will
be generated as swapin during the next page fault.

If it's too much swapin, then yes, it could be that patch that
prevents young bit to keep the anon pages in active list. But fix is
to remove the whole check, not just to enable list_add for anon pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 14:15     ` Rik van Riel
@ 2009-08-05 16:31       ` Andrea Arcangeli
  -1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote:
> Not having a hardware accessed bit would explain why
> the VM is not reactivating the pages that were accessed
> while on the inactive list.

Problem is, even with young bit functional the VM isn't reactivating
those pages anyway because of that broken check... That check should
be nuked entirely in my view as it fundamentally thinks it can
outsmart the VM intelligence by checking a bit in the vma... quite
absurd in my view.

> Can we find out which pages are EPT pages?
> 
> If so, we could unmap them when they get moved from the
> active to the inactive list, and soft fault them back in
> on access, emulating the referenced bit for EPT pages and
> making page replacement on them work like it should.
> 
> Your approximation of pretending the page is accessed the
> first time and pretending it's not the second time sounds
> like it will just lead to less efficient FIFO replacement,
> not to anything even vaguely approximating LRU.

I think it'll still better than current situation, as young bit is
always set for ptes. Otherwise EPT pages are too penalized, we need
them to stay one round more in active list like everything else. They
are too penalizied anyways because at second pass they'll be forced
out of active list and unmapped.

This is what alpha and all other archs without young bit set in
hardware have to do. They set young bit in software and clear it in
software and then set it again in software if there's a page fault
(hopefully a minor fault). Returning "not young" first time sounds
worse to me.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:31       ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote:
> Not having a hardware accessed bit would explain why
> the VM is not reactivating the pages that were accessed
> while on the inactive list.

Problem is, even with young bit functional the VM isn't reactivating
those pages anyway because of that broken check... That check should
be nuked entirely in my view as it fundamentally thinks it can
outsmart the VM intelligence by checking a bit in the vma... quite
absurd in my view.

> Can we find out which pages are EPT pages?
> 
> If so, we could unmap them when they get moved from the
> active to the inactive list, and soft fault them back in
> on access, emulating the referenced bit for EPT pages and
> making page replacement on them work like it should.
> 
> Your approximation of pretending the page is accessed the
> first time and pretending it's not the second time sounds
> like it will just lead to less efficient FIFO replacement,
> not to anything even vaguely approximating LRU.

I think it'll still better than current situation, as young bit is
always set for ptes. Otherwise EPT pages are too penalized, we need
them to stay one round more in active list like everything else. They
are too penalizied anyways because at second pass they'll be forced
out of active list and unmapped.

This is what alpha and all other archs without young bit set in
hardware have to do. They set young bit in software and clear it in
software and then set it again in software if there's a page fault
(hopefully a minor fault). Returning "not young" first time sounds
worse to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 15:25           ` Avi Kivity
@ 2009-08-05 16:35             ` Andrea Arcangeli
  -1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 06:25:28PM +0300, Avi Kivity wrote:
> On 08/05/2009 06:15 PM, Rik van Riel wrote:
> > Avi Kivity wrote:
> >
> >>> If so, we could unmap them when they get moved from the
> >>> active to the inactive list, and soft fault them back in
> >>> on access, emulating the referenced bit for EPT pages and
> >>> making page replacement on them work like it should.
> >>
> >> It should be easy to implement via the mmu notifier callback: when 
> >> the mm calls clear_flush_young(), mark it as young, and unmap it from 
> >> the EPT pagetable.
> >
> > You mean "mark it as old"?
> 
> I meant 'return young, and drop it from the EPT pagetable'.
> 
> If we use the present bit as a replacement for the accessed bit, present 
> means young, and clear_flush_young means "if present, return young and 
> unmap, otherwise return old'.

This is the only way to provide accurate information, and it's still a
minor fault so not very different than return young first time around
and old second time around without invalidating the spte... but the
only reason I like it more is that it is done at the right time, like
for the ptes, so probably it's best to implement it this way to ensure
total fairness of the VM regardless if it's guest or qemu-kvm touching
the virtual memory.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:35             ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 06:25:28PM +0300, Avi Kivity wrote:
> On 08/05/2009 06:15 PM, Rik van Riel wrote:
> > Avi Kivity wrote:
> >
> >>> If so, we could unmap them when they get moved from the
> >>> active to the inactive list, and soft fault them back in
> >>> on access, emulating the referenced bit for EPT pages and
> >>> making page replacement on them work like it should.
> >>
> >> It should be easy to implement via the mmu notifier callback: when 
> >> the mm calls clear_flush_young(), mark it as young, and unmap it from 
> >> the EPT pagetable.
> >
> > You mean "mark it as old"?
> 
> I meant 'return young, and drop it from the EPT pagetable'.
> 
> If we use the present bit as a replacement for the accessed bit, present 
> means young, and clear_flush_young means "if present, return young and 
> unmap, otherwise return old'.

This is the only way to provide accurate information, and it's still a
minor fault so not very different than return young first time around
and old second time around without invalidating the spte... but the
only reason I like it more is that it is done at the right time, like
for the ptes, so probably it's best to implement it this way to ensure
total fairness of the VM regardless if it's guest or qemu-kvm touching
the virtual memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 15:58   ` Andrea Arcangeli
@ 2009-08-05 17:20     ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>>  			 */
>> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
>> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>>  				list_add(&page->lru, &l_active);
>>  				continue;
>>  			}
>>
> 
> Please nuke the whole check and do an unconditional list_add;
> continue; there.

That would reinstate the bug that the VM has no pages
available for evicting.  There are very good reasons
that only VM_EXEC file pages get moved to the back of
the active list if they were referenced, and nothing
else.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 17:20     ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>>  			 */
>> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
>> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>>  				list_add(&page->lru, &l_active);
>>  				continue;
>>  			}
>>
> 
> Please nuke the whole check and do an unconditional list_add;
> continue; there.

That would reinstate the bug that the VM has no pages
available for evicting.  There are very good reasons
that only VM_EXEC file pages get moved to the back of
the active list if they were referenced, and nothing
else.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 16:31       ` Andrea Arcangeli
@ 2009-08-05 17:25         ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote:
>> Not having a hardware accessed bit would explain why
>> the VM is not reactivating the pages that were accessed
>> while on the inactive list.
> 
> Problem is, even with young bit functional the VM isn't reactivating
> those pages anyway because of that broken check... 

That check is only done where active pages are moved to the
inactive list!   Inactive pages that were referenced always
get moved to the active list (except for unmapped file pages).

> I think it'll still better than current situation, as young bit is
> always set for ptes. Otherwise EPT pages are too penalized, we need
> them to stay one round more in active list like everything else.

NOTHING ELSE stays on the active anon list for two rounds,
for very good reasons.  Please read up on what has changed
in the VM since 2.6.27.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 17:25         ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote:
>> Not having a hardware accessed bit would explain why
>> the VM is not reactivating the pages that were accessed
>> while on the inactive list.
> 
> Problem is, even with young bit functional the VM isn't reactivating
> those pages anyway because of that broken check... 

That check is only done where active pages are moved to the
inactive list!   Inactive pages that were referenced always
get moved to the active list (except for unmapped file pages).

> I think it'll still better than current situation, as young bit is
> always set for ptes. Otherwise EPT pages are too penalized, we need
> them to stay one round more in active list like everything else.

NOTHING ELSE stays on the active anon list for two rounds,
for very good reasons.  Please read up on what has changed
in the VM since 2.6.27.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 15:58   ` Andrea Arcangeli
@ 2009-08-05 17:42     ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>>  			 */
>> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
>> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>>  				list_add(&page->lru, &l_active);
>>  				continue;
>>  			}
>>
> 
> Please nuke the whole check and do an unconditional list_add;
> continue; there.

<riel> aa: so you're saying we should _never_ add pages to the active 
list at this point in the code
<aa> right
<riel> aa: and remove the list_add and continue completely
<aa> yes
<riel> aa: your email says the opposite :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 17:42     ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>>  			 */
>> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
>> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>>  				list_add(&page->lru, &l_active);
>>  				continue;
>>  			}
>>
> 
> Please nuke the whole check and do an unconditional list_add;
> continue; there.

<riel> aa: so you're saying we should _never_ add pages to the active 
list at this point in the code
<aa> right
<riel> aa: and remove the list_add and continue completely
<aa> yes
<riel> aa: your email says the opposite :)

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05  2:40 ` Wu Fengguang
@ 2009-08-05 17:53   ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:

> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).

The big question is, which referenced bit?

All anonymous pages get the referenced bit set when they are
initially created.  Acting on that bit is pretty useless, since
it does not add any information at all.

> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?

Reintroducing that problem is disastrous for large systems
running eg. JVMs or certain scientific computing workloads.

When you have a 256GB system that is low on memory, you need
to be able to find a page to swap out soon.  If all 64 million
pages in your system are "recently referenced", you run into
BIG trouble.

I do not believe we can afford to reintroduce that problem.

Also, the inactive list (where references to anonymous pages
_do_ count) is pretty big.  Is it not big enough in Jeff's
test case?

Jeff, what kind of workloads are you running in the guests?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 17:53   ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:

> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).

The big question is, which referenced bit?

All anonymous pages get the referenced bit set when they are
initially created.  Acting on that bit is pretty useless, since
it does not add any information at all.

> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?

Reintroducing that problem is disastrous for large systems
running eg. JVMs or certain scientific computing workloads.

When you have a 256GB system that is low on memory, you need
to be able to find a page to swap out soon.  If all 64 million
pages in your system are "recently referenced", you run into
BIG trouble.

I do not believe we can afford to reintroduce that problem.

Also, the inactive list (where references to anonymous pages
_do_ count) is pretty big.  Is it not big enough in Jeff's
test case?

Jeff, what kind of workloads are you running in the guests?

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 17:53   ` Rik van Riel
@ 2009-08-05 19:00     ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 19:00 UTC (permalink / raw)
  To: Rik van Riel, Wu, Fengguang
  Cc: Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity,
	Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
	Mel Gorman, LKML, linux-mm

> Also, the inactive list (where references to anonymous pages
> _do_ count) is pretty big.  Is it not big enough in Jeff's
> test case?

> Jeff, what kind of workloads are you running in the guests?

I'm looking at KVM on small systems.  My "small system" is a 128M memory compartment on a 4G server.

The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance.

					Jeff


^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 19:00     ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 19:00 UTC (permalink / raw)
  To: Rik van Riel, Wu, Fengguang
  Cc: Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity,
	Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
	Mel Gorman, LKML, linux-mm

> Also, the inactive list (where references to anonymous pages
> _do_ count) is pretty big.  Is it not big enough in Jeff's
> test case?

> Jeff, what kind of workloads are you running in the guests?

I'm looking at KVM on small systems.  My "small system" is a 128M memory compartment on a 4G server.

The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance.

					Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 19:00     ` Dike, Jeffrey G
@ 2009-08-05 19:07       ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 19:07 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
>> Also, the inactive list (where references to anonymous pages
>> _do_ count) is pretty big.  Is it not big enough in Jeff's
>> test case?
> 
>> Jeff, what kind of workloads are you running in the guests?
> 
> I'm looking at KVM on small systems.  My "small system" is a 128M memory compartment on a 4G server.

How did you create that 128M memory compartment?

Did you use cgroups on the host system?

> The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance.

How much memory do you give your virtual machine?

That is, how much memory does it think it has?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 19:07       ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 19:07 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
>> Also, the inactive list (where references to anonymous pages
>> _do_ count) is pretty big.  Is it not big enough in Jeff's
>> test case?
> 
>> Jeff, what kind of workloads are you running in the guests?
> 
> I'm looking at KVM on small systems.  My "small system" is a 128M memory compartment on a 4G server.

How did you create that 128M memory compartment?

Did you use cgroups on the host system?

> The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance.

How much memory do you give your virtual machine?

That is, how much memory does it think it has?

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 19:07       ` Rik van Riel
@ 2009-08-05 19:18         ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 19:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

> How did you create that 128M memory compartment?
> 
> Did you use cgroups on the host system?

Yup.

> How much memory do you give your virtual machine?
>
> That is, how much memory does it think it has?

256M.

					Jeff


^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 19:18         ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 19:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

> How did you create that 128M memory compartment?
> 
> Did you use cgroups on the host system?

Yup.

> How much memory do you give your virtual machine?
>
> That is, how much memory does it think it has?

256M.

					Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 19:18         ` Dike, Jeffrey G
@ 2009-08-06  9:22           ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06  9:22 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Rik van Riel, Wu, Fengguang, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote:
>> How did you create that 128M memory compartment?
>>
>> Did you use cgroups on the host system?
>>      
>
> Yup.
>
>    
>> How much memory do you give your virtual machine?
>>
>> That is, how much memory does it think it has?
>>      
>
> 256M.
>    

So you're effectively running a 256M guest on a 128M host?

Do cgroups have private active/inactive lists?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06  9:22           ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06  9:22 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Rik van Riel, Wu, Fengguang, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote:
>> How did you create that 128M memory compartment?
>>
>> Did you use cgroups on the host system?
>>      
>
> Yup.
>
>    
>> How much memory do you give your virtual machine?
>>
>> That is, how much memory does it think it has?
>>      
>
> 256M.
>    

So you're effectively running a 256M guest on a 128M host?

Do cgroups have private active/inactive lists?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06  9:22           ` Avi Kivity
@ 2009-08-06  9:25             ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06  9:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 05:22:23PM +0800, Avi Kivity wrote:
> On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote:
> >> How did you create that 128M memory compartment?
> >>
> >> Did you use cgroups on the host system?
> >>      
> >
> > Yup.
> >
> >    
> >> How much memory do you give your virtual machine?
> >>
> >> That is, how much memory does it think it has?
> >>      
> >
> > 256M.
> >    
> 
> So you're effectively running a 256M guest on a 128M host?
> 
> Do cgroups have private active/inactive lists?

Yes, and they reuse the same page reclaim routines with the global
LRU lists.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06  9:25             ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06  9:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 05:22:23PM +0800, Avi Kivity wrote:
> On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote:
> >> How did you create that 128M memory compartment?
> >>
> >> Did you use cgroups on the host system?
> >>      
> >
> > Yup.
> >
> >    
> >> How much memory do you give your virtual machine?
> >>
> >> That is, how much memory does it think it has?
> >>      
> >
> > 256M.
> >    
> 
> So you're effectively running a 256M guest on a 128M host?
> 
> Do cgroups have private active/inactive lists?

Yes, and they reuse the same page reclaim routines with the global
LRU lists.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06  9:35               ` Avi Kivity
@ 2009-08-06  9:35                 ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06  9:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
> >> So you're effectively running a 256M guest on a 128M host?
> >>
> >> Do cgroups have private active/inactive lists?
> >>      
> >
> > Yes, and they reuse the same page reclaim routines with the global
> > LRU lists.
> >    
> 
> Then this looks like a bug in the shadow accessed bit handling.

Yes. One question is: why only stack pages hurts if it is a
general page reclaim problem?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06  9:35                 ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06  9:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
> >> So you're effectively running a 256M guest on a 128M host?
> >>
> >> Do cgroups have private active/inactive lists?
> >>      
> >
> > Yes, and they reuse the same page reclaim routines with the global
> > LRU lists.
> >    
> 
> Then this looks like a bug in the shadow accessed bit handling.

Yes. One question is: why only stack pages hurts if it is a
general page reclaim problem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06  9:25             ` Wu Fengguang
@ 2009-08-06  9:35               ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06  9:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 12:25 PM, Wu Fengguang wrote:
>> So you're effectively running a 256M guest on a 128M host?
>>
>> Do cgroups have private active/inactive lists?
>>      
>
> Yes, and they reuse the same page reclaim routines with the global
> LRU lists.
>    

Then this looks like a bug in the shadow accessed bit handling.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06  9:35               ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06  9:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 12:25 PM, Wu Fengguang wrote:
>> So you're effectively running a 256M guest on a 128M host?
>>
>> Do cgroups have private active/inactive lists?
>>      
>
> Yes, and they reuse the same page reclaim routines with the global
> LRU lists.
>    

Then this looks like a bug in the shadow accessed bit handling.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06  9:59                   ` Avi Kivity
@ 2009-08-06  9:59                     ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06  9:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 05:59:53PM +0800, Avi Kivity wrote:
> On 08/06/2009 12:35 PM, Wu Fengguang wrote:
> > On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
> >    
> >> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
> >>      
> >>>> So you're effectively running a 256M guest on a 128M host?
> >>>>
> >>>> Do cgroups have private active/inactive lists?
> >>>>
> >>>>          
> >>> Yes, and they reuse the same page reclaim routines with the global
> >>> LRU lists.
> >>>
> >>>        
> >> Then this looks like a bug in the shadow accessed bit handling.
> >>      
> >
> > Yes. One question is: why only stack pages hurts if it is a
> > general page reclaim problem?
> >    
> 
> Do we know for a fact that only stack pages suffer, or is it what has 
> been noticed?

It shall be the first case: "These pages are nearly all stack pages.",
Jeff said.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06  9:59                     ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06  9:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 05:59:53PM +0800, Avi Kivity wrote:
> On 08/06/2009 12:35 PM, Wu Fengguang wrote:
> > On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
> >    
> >> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
> >>      
> >>>> So you're effectively running a 256M guest on a 128M host?
> >>>>
> >>>> Do cgroups have private active/inactive lists?
> >>>>
> >>>>          
> >>> Yes, and they reuse the same page reclaim routines with the global
> >>> LRU lists.
> >>>
> >>>        
> >> Then this looks like a bug in the shadow accessed bit handling.
> >>      
> >
> > Yes. One question is: why only stack pages hurts if it is a
> > general page reclaim problem?
> >    
> 
> Do we know for a fact that only stack pages suffer, or is it what has 
> been noticed?

It shall be the first case: "These pages are nearly all stack pages.",
Jeff said.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06  9:35                 ` Wu Fengguang
@ 2009-08-06  9:59                   ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06  9:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 12:35 PM, Wu Fengguang wrote:
> On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
>    
>> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
>>      
>>>> So you're effectively running a 256M guest on a 128M host?
>>>>
>>>> Do cgroups have private active/inactive lists?
>>>>
>>>>          
>>> Yes, and they reuse the same page reclaim routines with the global
>>> LRU lists.
>>>
>>>        
>> Then this looks like a bug in the shadow accessed bit handling.
>>      
>
> Yes. One question is: why only stack pages hurts if it is a
> general page reclaim problem?
>    

Do we know for a fact that only stack pages suffer, or is it what has 
been noticed?


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06  9:59                   ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06  9:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 12:35 PM, Wu Fengguang wrote:
> On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
>    
>> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
>>      
>>>> So you're effectively running a 256M guest on a 128M host?
>>>>
>>>> Do cgroups have private active/inactive lists?
>>>>
>>>>          
>>> Yes, and they reuse the same page reclaim routines with the global
>>> LRU lists.
>>>
>>>        
>> Then this looks like a bug in the shadow accessed bit handling.
>>      
>
> Yes. One question is: why only stack pages hurts if it is a
> general page reclaim problem?
>    

Do we know for a fact that only stack pages suffer, or is it what has 
been noticed?


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 15:58   ` Andrea Arcangeli
@ 2009-08-06 10:08     ` Andrea Arcangeli
  -1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 05:58:05PM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> >  			 */
> > -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> > +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> >  				list_add(&page->lru, &l_active);
> >  				continue;
> >  			}
> > 
> 
> Please nuke the whole check and do an unconditional list_add;
> continue; there.

After some conversation it seems reactivating on large systems
generates troubles to the VM as young bit have excessive time to be
reactivated, giving troubles to shrink active list. I see that, so
then the check should be still nuked, but the unconditional
deactivation should happen instead. Otherwise it's trivial to put the
VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC
as parameter of mmap. My whole point is that deciding if activating or
deactivating pages can't be in function  of VM_EXEC, and clearly it
helps on desktops but then it probably is a signal that the VM isn't
good enough by itself to identify the important working set using
young bits and stuff on desktop systems, and if there's a good reason
to not activate, we shouldn't activate the VM_EXEC either as anything
and anybody can generate a file mapping with VM_EXEC set...

Likely we need a cut-off point, if we detect it takes more than X
seconds to scan the whole active list, we start ignoring young bits,
as young bits don't provide any meaningful information then and they
just hang the VM in preventing it to shrink active list and looping
over it endlessy with million pages inside that list. But on small
systems if inactive list is short it may be too quick to just clear
the young bit and only giving it time to be re-enabled in inactive
list. That may be the source of the problem. Actually I'm speculating
here, because I barely understood that this is swapin... not sure
exactly what this regression is about but testing the patch posted is
good idea and it will tell us if we just need to dynamically
differentiating the algorithm between large and small systems and start
ignoring young bits only at some point.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:08     ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 05:58:05PM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> >  			 */
> > -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> > +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> >  				list_add(&page->lru, &l_active);
> >  				continue;
> >  			}
> > 
> 
> Please nuke the whole check and do an unconditional list_add;
> continue; there.

After some conversation it seems reactivating on large systems
generates troubles to the VM as young bit have excessive time to be
reactivated, giving troubles to shrink active list. I see that, so
then the check should be still nuked, but the unconditional
deactivation should happen instead. Otherwise it's trivial to put the
VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC
as parameter of mmap. My whole point is that deciding if activating or
deactivating pages can't be in function  of VM_EXEC, and clearly it
helps on desktops but then it probably is a signal that the VM isn't
good enough by itself to identify the important working set using
young bits and stuff on desktop systems, and if there's a good reason
to not activate, we shouldn't activate the VM_EXEC either as anything
and anybody can generate a file mapping with VM_EXEC set...

Likely we need a cut-off point, if we detect it takes more than X
seconds to scan the whole active list, we start ignoring young bits,
as young bits don't provide any meaningful information then and they
just hang the VM in preventing it to shrink active list and looping
over it endlessy with million pages inside that list. But on small
systems if inactive list is short it may be too quick to just clear
the young bit and only giving it time to be re-enabled in inactive
list. That may be the source of the problem. Actually I'm speculating
here, because I barely understood that this is swapin... not sure
exactly what this regression is about but testing the patch posted is
good idea and it will tell us if we just need to dynamically
differentiating the algorithm between large and small systems and start
ignoring young bits only at some point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06  9:59                     ` Wu Fengguang
@ 2009-08-06 10:14                       ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 10:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 12:59 PM, Wu Fengguang wrote:
>> Do we know for a fact that only stack pages suffer, or is it what has
>> been noticed?
>>      
>
> It shall be the first case: "These pages are nearly all stack pages.",
> Jeff said.
>    

Ok.  I can't explain it.  There's no special treatment for guest stack 
pages.  The accessed bit should be maintained for them exactly like all 
other pages.

Are they kernel-mode stack pages, or user-mode stack pages (the 
difference being that kernel mode stack pages are accessed through large 
ptes, whereas user mode stack pages are accessed through normal ptes).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:14                       ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 10:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
	Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 12:59 PM, Wu Fengguang wrote:
>> Do we know for a fact that only stack pages suffer, or is it what has
>> been noticed?
>>      
>
> It shall be the first case: "These pages are nearly all stack pages.",
> Jeff said.
>    

Ok.  I can't explain it.  There's no special treatment for guest stack 
pages.  The accessed bit should be maintained for them exactly like all 
other pages.

Are they kernel-mode stack pages, or user-mode stack pages (the 
difference being that kernel mode stack pages are accessed through large 
ptes, whereas user mode stack pages are accessed through normal ptes).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-05 17:42     ` Rik van Riel
@ 2009-08-06 10:15       ` Andrea Arcangeli
  -1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 01:42:30PM -0400, Rik van Riel wrote:
> Andrea Arcangeli wrote:
> > On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> >>  			 */
> >> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> >> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> >>  				list_add(&page->lru, &l_active);
> >>  				continue;
> >>  			}
> >>
> > 
> > Please nuke the whole check and do an unconditional list_add;
> > continue; there.
> 
> <riel> aa: so you're saying we should _never_ add pages to the active 
> list at this point in the code
> <aa> right
> <riel> aa: and remove the list_add and continue completely
> <aa> yes
> <riel> aa: your email says the opposite :)

Posted more meaningful explanation in self-reply to the email that
said the opposite, which tries to explains why I changed my mind (well
my focus really was on VM_EXEC and I didn't change my mind about yet
but then I'm flexible so I'm listening if somebody thinks it's good
thing to keep it). The irc quote was greatly out of context and it
missed all the previous conversation... I hope my mail explains my
point in more detail than the above.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:15       ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Wed, Aug 05, 2009 at 01:42:30PM -0400, Rik van Riel wrote:
> Andrea Arcangeli wrote:
> > On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> >>  			 */
> >> -			if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> >> +			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> >>  				list_add(&page->lru, &l_active);
> >>  				continue;
> >>  			}
> >>
> > 
> > Please nuke the whole check and do an unconditional list_add;
> > continue; there.
> 
> <riel> aa: so you're saying we should _never_ add pages to the active 
> list at this point in the code
> <aa> right
> <riel> aa: and remove the list_add and continue completely
> <aa> yes
> <riel> aa: your email says the opposite :)

Posted more meaningful explanation in self-reply to the email that
said the opposite, which tries to explains why I changed my mind (well
my focus really was on VM_EXEC and I didn't change my mind about yet
but then I'm flexible so I'm listening if somebody thinks it's good
thing to keep it). The irc quote was greatly out of context and it
missed all the previous conversation... I hope my mail explains my
point in more detail than the above.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 10:08     ` Andrea Arcangeli
@ 2009-08-06 10:18       ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 10:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 01:08 PM, Andrea Arcangeli wrote:
> After some conversation it seems reactivating on large systems
> generates troubles to the VM as young bit have excessive time to be
> reactivated, giving troubles to shrink active list. I see that, so
> then the check should be still nuked, but the unconditional
> deactivation should happen instead. Otherwise it's trivial to put the
> VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC
> as parameter of mmap. My whole point is that deciding if activating or
> deactivating pages can't be in function  of VM_EXEC, and clearly it
> helps on desktops but then it probably is a signal that the VM isn't
> good enough by itself to identify the important working set using
> young bits and stuff on desktop systems, and if there's a good reason
> to not activate, we shouldn't activate the VM_EXEC either as anything
> and anybody can generate a file mapping with VM_EXEC set...
>    

Reasonable; if you depend on a hint from userspace, that hint can be 
used against you.

> Likely we need a cut-off point, if we detect it takes more than X
> seconds to scan the whole active list, we start ignoring young bits,
> as young bits don't provide any meaningful information then and they
> just hang the VM in preventing it to shrink active list and looping
> over it endlessy with million pages inside that list. But on small
> systems if inactive list is short it may be too quick to just clear
> the young bit and only giving it time to be re-enabled in inactive
> list. That may be the source of the problem. Actually I'm speculating
> here, because I barely understood that this is swapin... not sure
> exactly what this regression is about but testing the patch posted is
> good idea and it will tell us if we just need to dynamically
> differentiating the algorithm between large and small systems and start
> ignoring young bits only at some point.
>    

How about, for every N pages that you scan, evict at least 1 page, 
regardless of young bit status?  That limits overscanning to a N:1 
ratio.  With N=250 we'll spend at most 25 usec in order to locate one 
page to evict.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:18       ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 10:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 01:08 PM, Andrea Arcangeli wrote:
> After some conversation it seems reactivating on large systems
> generates troubles to the VM as young bit have excessive time to be
> reactivated, giving troubles to shrink active list. I see that, so
> then the check should be still nuked, but the unconditional
> deactivation should happen instead. Otherwise it's trivial to put the
> VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC
> as parameter of mmap. My whole point is that deciding if activating or
> deactivating pages can't be in function  of VM_EXEC, and clearly it
> helps on desktops but then it probably is a signal that the VM isn't
> good enough by itself to identify the important working set using
> young bits and stuff on desktop systems, and if there's a good reason
> to not activate, we shouldn't activate the VM_EXEC either as anything
> and anybody can generate a file mapping with VM_EXEC set...
>    

Reasonable; if you depend on a hint from userspace, that hint can be 
used against you.

> Likely we need a cut-off point, if we detect it takes more than X
> seconds to scan the whole active list, we start ignoring young bits,
> as young bits don't provide any meaningful information then and they
> just hang the VM in preventing it to shrink active list and looping
> over it endlessy with million pages inside that list. But on small
> systems if inactive list is short it may be too quick to just clear
> the young bit and only giving it time to be re-enabled in inactive
> list. That may be the source of the problem. Actually I'm speculating
> here, because I barely understood that this is swapin... not sure
> exactly what this regression is about but testing the patch posted is
> good idea and it will tell us if we just need to dynamically
> differentiating the algorithm between large and small systems and start
> ignoring young bits only at some point.
>    

How about, for every N pages that you scan, evict at least 1 page, 
regardless of young bit status?  That limits overscanning to a N:1 
ratio.  With N=250 we'll spend at most 25 usec in order to locate one 
page to evict.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 10:18       ` Avi Kivity
@ 2009-08-06 10:20         ` Andrea Arcangeli
  -1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote:
> Reasonable; if you depend on a hint from userspace, that hint can be 
> used against you.

Correct, that is my whole point. Also we never know if applications
are mmapping huge files with MAP_EXEC just because they might need to
trampoline once in a while, or do some little JIT thing once in a
while. Sometime people open files with O_RDWR even if they only need
O_RDONLY. It's not a bug, but radically altering VM behavior because
of a bitflag doesn't sound good to me.

I certainly see this tends to help as it will reactivate all
.text. But this signals current VM behavior is not ok for small
systems IMHO if such an hack is required. I prefer a dynamic algorithm
that when active list grow too much stop reactivating pages and
reduces the time for young bit activation only to the time the page
sits on the inactive list. And if active list is small (like 128M
system) we  fully trust young bit and if it set, we don't allow it to
go in inactive list as it's quick enough to scan the whole active
list, and young bit is meaningful there.

The issue I can see is with huge system and million pages in active
list, by the time we can it all, too much time has passed and we don't
get any meaningful information out of young bit. Things are radically
different on all regular workstations, and frankly regular
workstations are very important too, as I suspect there are more users
running on <64G systems than on >64G systems.

> How about, for every N pages that you scan, evict at least 1 page, 
> regardless of young bit status?  That limits overscanning to a N:1 
> ratio.  With N=250 we'll spend at most 25 usec in order to locate one 
> page to evict.

Yes exactly, something like that I think will be dynamic, and then we
can drop VM_EXEC check and solve the issues on large systems while
still not almost totally ignoring young bit on small systems.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:20         ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote:
> Reasonable; if you depend on a hint from userspace, that hint can be 
> used against you.

Correct, that is my whole point. Also we never know if applications
are mmapping huge files with MAP_EXEC just because they might need to
trampoline once in a while, or do some little JIT thing once in a
while. Sometime people open files with O_RDWR even if they only need
O_RDONLY. It's not a bug, but radically altering VM behavior because
of a bitflag doesn't sound good to me.

I certainly see this tends to help as it will reactivate all
.text. But this signals current VM behavior is not ok for small
systems IMHO if such an hack is required. I prefer a dynamic algorithm
that when active list grow too much stop reactivating pages and
reduces the time for young bit activation only to the time the page
sits on the inactive list. And if active list is small (like 128M
system) we  fully trust young bit and if it set, we don't allow it to
go in inactive list as it's quick enough to scan the whole active
list, and young bit is meaningful there.

The issue I can see is with huge system and million pages in active
list, by the time we can it all, too much time has passed and we don't
get any meaningful information out of young bit. Things are radically
different on all regular workstations, and frankly regular
workstations are very important too, as I suspect there are more users
running on <64G systems than on >64G systems.

> How about, for every N pages that you scan, evict at least 1 page, 
> regardless of young bit status?  That limits overscanning to a N:1 
> ratio.  With N=250 we'll spend at most 25 usec in order to locate one 
> page to evict.

Yes exactly, something like that I think will be dynamic, and then we
can drop VM_EXEC check and solve the issues on large systems while
still not almost totally ignoring young bit on small systems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 10:20         ` Andrea Arcangeli
@ 2009-08-06 10:59           ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 10:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 06:20:57PM +0800, Andrea Arcangeli wrote:
> On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote:
> > Reasonable; if you depend on a hint from userspace, that hint can be 
> > used against you.
> 
> Correct, that is my whole point. Also we never know if applications
> are mmapping huge files with MAP_EXEC just because they might need to
> trampoline once in a while, or do some little JIT thing once in a
> while. Sometime people open files with O_RDWR even if they only need
> O_RDONLY. It's not a bug, but radically altering VM behavior because
> of a bitflag doesn't sound good to me.
> 
> I certainly see this tends to help as it will reactivate all
> .text. But this signals current VM behavior is not ok for small
> systems IMHO if such an hack is required. I prefer a dynamic algorithm
> that when active list grow too much stop reactivating pages and
> reduces the time for young bit activation only to the time the page
> sits on the inactive list. And if active list is small (like 128M
> system) we  fully trust young bit and if it set, we don't allow it to
> go in inactive list as it's quick enough to scan the whole active
> list, and young bit is meaningful there.
> 
> The issue I can see is with huge system and million pages in active
> list, by the time we can it all, too much time has passed and we don't
> get any meaningful information out of young bit. Things are radically
> different on all regular workstations, and frankly regular
> workstations are very important too, as I suspect there are more users
> running on <64G systems than on >64G systems.
> 
> > How about, for every N pages that you scan, evict at least 1 page, 
> > regardless of young bit status?  That limits overscanning to a N:1 
> > ratio.  With N=250 we'll spend at most 25 usec in order to locate one 
> > page to evict.
> 
> Yes exactly, something like that I think will be dynamic, and then we
> can drop VM_EXEC check and solve the issues on large systems while
> still not almost totally ignoring young bit on small systems.

This is a quick hack to materialize the idea. It remembers roughly
the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
and if _all of them_ are referenced, then the referenced bit is
probably meaningless and should not be taken seriously.

As a refinement, the static variable 'recent_all_referenced' could be
moved to struct zone or made a per-cpu variable.

Thanks,
Fengguang

---
 mm/vmscan.c |   28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

--- linux.orig/mm/vmscan.c	2009-08-06 18:31:20.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-06 18:51:58.000000000 +0800
@@ -1239,6 +1239,9 @@ static void move_active_pages_to_lru(str
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
+	static unsigned int recent_all_referenced;
+	int all_referenced = 1;
+	int referenced_bit_ok;
 	unsigned long pgmoved;
 	unsigned long pgscanned;
 	unsigned long vm_flags;
@@ -1267,6 +1270,8 @@ static void shrink_active_list(unsigned 
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
 	else
 		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
+
+	referenced_bit_ok = !recent_all_referenced;
 	spin_unlock_irq(&zone->lru_lock);
 
 	pgmoved = 0;  /* count referenced (mapping) mapped pages */
@@ -1281,19 +1286,15 @@ static void shrink_active_list(unsigned 
 		}
 
 		/* page_referenced clears PageReferenced */
-		if (page_mapping_inuse(page) &&
-		    page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
-			pgmoved++;
-			/*
-			 * Identify referenced, file-backed active pages and
-			 * give them one more trip around the active list. So
-			 * that executable code get better chances to stay in
-			 * memory under moderate memory pressure.
-			 *
-			 * Also protect anon pages: swapping could be costly,
-			 * and KVM guest's referenced bit is helpful.
-			 */
-			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
+		if (page_mapping_inuse(page)) {
+			referenced = page_referenced(page, 0, sc->mem_cgroup,
+						     &vm_flags);
+			if (referenced)
+				pgmoved++;
+			else
+				all_referenced = 0;
+
+			if (referenced && referenced_bit_ok) {
 				list_add(&page->lru, &l_active);
 				continue;
 			}
@@ -1319,6 +1320,7 @@ static void shrink_active_list(unsigned 
 	move_active_pages_to_lru(zone, &l_inactive,
 						LRU_BASE   + file * LRU_FILE);
 
+	recent_all_referenced = (recent_all_referenced << 1) | all_referenced;
 	spin_unlock_irq(&zone->lru_lock);
 }
 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:59           ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 10:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 06:20:57PM +0800, Andrea Arcangeli wrote:
> On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote:
> > Reasonable; if you depend on a hint from userspace, that hint can be 
> > used against you.
> 
> Correct, that is my whole point. Also we never know if applications
> are mmapping huge files with MAP_EXEC just because they might need to
> trampoline once in a while, or do some little JIT thing once in a
> while. Sometime people open files with O_RDWR even if they only need
> O_RDONLY. It's not a bug, but radically altering VM behavior because
> of a bitflag doesn't sound good to me.
> 
> I certainly see this tends to help as it will reactivate all
> .text. But this signals current VM behavior is not ok for small
> systems IMHO if such an hack is required. I prefer a dynamic algorithm
> that when active list grow too much stop reactivating pages and
> reduces the time for young bit activation only to the time the page
> sits on the inactive list. And if active list is small (like 128M
> system) we  fully trust young bit and if it set, we don't allow it to
> go in inactive list as it's quick enough to scan the whole active
> list, and young bit is meaningful there.
> 
> The issue I can see is with huge system and million pages in active
> list, by the time we can it all, too much time has passed and we don't
> get any meaningful information out of young bit. Things are radically
> different on all regular workstations, and frankly regular
> workstations are very important too, as I suspect there are more users
> running on <64G systems than on >64G systems.
> 
> > How about, for every N pages that you scan, evict at least 1 page, 
> > regardless of young bit status?  That limits overscanning to a N:1 
> > ratio.  With N=250 we'll spend at most 25 usec in order to locate one 
> > page to evict.
> 
> Yes exactly, something like that I think will be dynamic, and then we
> can drop VM_EXEC check and solve the issues on large systems while
> still not almost totally ignoring young bit on small systems.

This is a quick hack to materialize the idea. It remembers roughly
the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
and if _all of them_ are referenced, then the referenced bit is
probably meaningless and should not be taken seriously.

As a refinement, the static variable 'recent_all_referenced' could be
moved to struct zone or made a per-cpu variable.

Thanks,
Fengguang

---
 mm/vmscan.c |   28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

--- linux.orig/mm/vmscan.c	2009-08-06 18:31:20.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-06 18:51:58.000000000 +0800
@@ -1239,6 +1239,9 @@ static void move_active_pages_to_lru(str
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
+	static unsigned int recent_all_referenced;
+	int all_referenced = 1;
+	int referenced_bit_ok;
 	unsigned long pgmoved;
 	unsigned long pgscanned;
 	unsigned long vm_flags;
@@ -1267,6 +1270,8 @@ static void shrink_active_list(unsigned 
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
 	else
 		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
+
+	referenced_bit_ok = !recent_all_referenced;
 	spin_unlock_irq(&zone->lru_lock);
 
 	pgmoved = 0;  /* count referenced (mapping) mapped pages */
@@ -1281,19 +1286,15 @@ static void shrink_active_list(unsigned 
 		}
 
 		/* page_referenced clears PageReferenced */
-		if (page_mapping_inuse(page) &&
-		    page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
-			pgmoved++;
-			/*
-			 * Identify referenced, file-backed active pages and
-			 * give them one more trip around the active list. So
-			 * that executable code get better chances to stay in
-			 * memory under moderate memory pressure.
-			 *
-			 * Also protect anon pages: swapping could be costly,
-			 * and KVM guest's referenced bit is helpful.
-			 */
-			if ((vm_flags & VM_EXEC) || PageAnon(page)) {
+		if (page_mapping_inuse(page)) {
+			referenced = page_referenced(page, 0, sc->mem_cgroup,
+						     &vm_flags);
+			if (referenced)
+				pgmoved++;
+			else
+				all_referenced = 0;
+
+			if (referenced && referenced_bit_ok) {
 				list_add(&page->lru, &l_active);
 				continue;
 			}
@@ -1319,6 +1320,7 @@ static void shrink_active_list(unsigned 
 	move_active_pages_to_lru(zone, &l_inactive,
 						LRU_BASE   + file * LRU_FILE);
 
+	recent_all_referenced = (recent_all_referenced << 1) | all_referenced;
 	spin_unlock_irq(&zone->lru_lock);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 10:59           ` Wu Fengguang
@ 2009-08-06 11:44             ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 11:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
> This is a quick hack to materialize the idea. It remembers roughly
> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> and if _all of them_ are referenced, then the referenced bit is
> probably meaningless and should not be taken seriously.
>
>    

I don't think we should ignore the referenced bit. There could still be 
a large batch of unreferenced pages later on that we should 
preferentially swap. If we swap at least 1 page for every 250 scanned, 
after 4K swaps we will have traversed 1M pages, enough to find them.

> As a refinement, the static variable 'recent_all_referenced' could be
> moved to struct zone or made a per-cpu variable.
>
>    

Definitely this should be made part of the zone structure, consider the 
original report where the problem occurs in a 128MB zone (where we can 
expect many pages to have their referenced bit set).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 11:44             ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 11:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
> This is a quick hack to materialize the idea. It remembers roughly
> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> and if _all of them_ are referenced, then the referenced bit is
> probably meaningless and should not be taken seriously.
>
>    

I don't think we should ignore the referenced bit. There could still be 
a large batch of unreferenced pages later on that we should 
preferentially swap. If we swap at least 1 page for every 250 scanned, 
after 4K swaps we will have traversed 1M pages, enough to find them.

> As a refinement, the static variable 'recent_all_referenced' could be
> moved to struct zone or made a per-cpu variable.
>
>    

Definitely this should be made part of the zone structure, consider the 
original report where the problem occurs in a 128MB zone (where we can 
expect many pages to have their referenced bit set).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 11:44             ` Avi Kivity
@ 2009-08-06 13:06               ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 13:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote:
> On 08/06/2009 01:59 PM, Wu Fengguang wrote:

scheme KEEP_MOST:

>> How about, for every N pages that you scan, evict at least 1 page,
>> regardless of young bit status?  That limits overscanning to a N:1
>> ratio.  With N=250 we'll spend at most 25 usec in order to locate one
>> page to evict.

scheme DROP_CONTINUOUS:

> > This is a quick hack to materialize the idea. It remembers roughly
> > the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> > and if _all of them_ are referenced, then the referenced bit is
> > probably meaningless and should not be taken seriously.

> I don't think we should ignore the referenced bit. There could still be 
> a large batch of unreferenced pages later on that we should 
> preferentially swap. If we swap at least 1 page for every 250 scanned, 
> after 4K swaps we will have traversed 1M pages, enough to find them.

I guess both schemes have unacceptable flaws.

For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
while the DROP_CONTINUOUS scheme is effectively zero cost.

However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
It can behave vastly different on single active task and multi ones.
It is short sighted and can be cheated by bursty activities.

> > As a refinement, the static variable 'recent_all_referenced' could be
> > moved to struct zone or made a per-cpu variable.
> 
> Definitely this should be made part of the zone structure, consider the 
> original report where the problem occurs in a 128MB zone (where we can 
> expect many pages to have their referenced bit set).

Good point. Here the cgroup list is highly stressed, while the global
zones are idling.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:06               ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 13:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote:
> On 08/06/2009 01:59 PM, Wu Fengguang wrote:

scheme KEEP_MOST:

>> How about, for every N pages that you scan, evict at least 1 page,
>> regardless of young bit status?  That limits overscanning to a N:1
>> ratio.  With N=250 we'll spend at most 25 usec in order to locate one
>> page to evict.

scheme DROP_CONTINUOUS:

> > This is a quick hack to materialize the idea. It remembers roughly
> > the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> > and if _all of them_ are referenced, then the referenced bit is
> > probably meaningless and should not be taken seriously.

> I don't think we should ignore the referenced bit. There could still be 
> a large batch of unreferenced pages later on that we should 
> preferentially swap. If we swap at least 1 page for every 250 scanned, 
> after 4K swaps we will have traversed 1M pages, enough to find them.

I guess both schemes have unacceptable flaws.

For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
while the DROP_CONTINUOUS scheme is effectively zero cost.

However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
It can behave vastly different on single active task and multi ones.
It is short sighted and can be cheated by bursty activities.

> > As a refinement, the static variable 'recent_all_referenced' could be
> > moved to struct zone or made a per-cpu variable.
> 
> Definitely this should be made part of the zone structure, consider the 
> original report where the problem occurs in a 128MB zone (where we can 
> expect many pages to have their referenced bit set).

Good point. Here the cgroup list is highly stressed, while the global
zones are idling.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 10:08     ` Andrea Arcangeli
@ 2009-08-06 13:08       ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Andrea Arcangeli wrote:

> Likely we need a cut-off point, if we detect it takes more than X
> seconds to scan the whole active list, we start ignoring young bits,

We could just make this depend on the calculated inactive_ratio,
which depends on the size of the list.

For small systems, it may make sense to make every accessed bit
count, because the working set will often approach the size of
memory.

On very large systems, the working set may also approach the
size of memory, but the inactive list only contains a small
percentage of the pages, so there is enough space for everything.

Say, if the inactive_ratio is 3 or less, make the accessed bit
on the active lists count.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:08       ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Andrea Arcangeli wrote:

> Likely we need a cut-off point, if we detect it takes more than X
> seconds to scan the whole active list, we start ignoring young bits,

We could just make this depend on the calculated inactive_ratio,
which depends on the size of the list.

For small systems, it may make sense to make every accessed bit
count, because the working set will often approach the size of
memory.

On very large systems, the working set may also approach the
size of memory, but the inactive list only contains a small
percentage of the pages, so there is enough space for everything.

Say, if the inactive_ratio is 3 or less, make the accessed bit
on the active lists count.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 10:59           ` Wu Fengguang
@ 2009-08-06 13:11             ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrea Arcangeli, Avi Kivity, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:

> This is a quick hack to materialize the idea. It remembers roughly
> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> and if _all of them_ are referenced, then the referenced bit is
> probably meaningless and should not be taken seriously.

This has the potential to increase the number of active
pages scanned by almost a factor 1024.  Let me whip up an
alternative idea when I get to the office later today.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:11             ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrea Arcangeli, Avi Kivity, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:

> This is a quick hack to materialize the idea. It remembers roughly
> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> and if _all of them_ are referenced, then the referenced bit is
> probably meaningless and should not be taken seriously.

This has the potential to increase the number of active
pages scanned by almost a factor 1024.  Let me whip up an
alternative idea when I get to the office later today.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 11:44             ` Avi Kivity
@ 2009-08-06 13:13               ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Avi Kivity wrote:
> On 08/06/2009 01:59 PM, Wu Fengguang wrote:

>> As a refinement, the static variable 'recent_all_referenced' could be
>> moved to struct zone or made a per-cpu variable.
> 
> Definitely this should be made part of the zone structure, consider the 
> original report where the problem occurs in a 128MB zone (where we can 
> expect many pages to have their referenced bit set).

The problem did not occur in a 128MB zone, but in a 128MB cgroup.

Putting it in the zone means that the cgroup, which may have
different behaviour from the rest of the zone, due to excessive
memory pressure inside the cgroup, does not get the right
statistics.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:13               ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Avi Kivity wrote:
> On 08/06/2009 01:59 PM, Wu Fengguang wrote:

>> As a refinement, the static variable 'recent_all_referenced' could be
>> moved to struct zone or made a per-cpu variable.
> 
> Definitely this should be made part of the zone structure, consider the 
> original report where the problem occurs in a 128MB zone (where we can 
> expect many pages to have their referenced bit set).

The problem did not occur in a 128MB zone, but in a 128MB cgroup.

Putting it in the zone means that the cgroup, which may have
different behaviour from the rest of the zone, due to excessive
memory pressure inside the cgroup, does not get the right
statistics.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 13:06               ` Wu Fengguang
@ 2009-08-06 13:16                 ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:

> I guess both schemes have unacceptable flaws.
> 
> For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> while the DROP_CONTINUOUS scheme is effectively zero cost.

The higher overhead may not be an issue on smaller systems,
or inside smaller cgroups inside large systems, when doing
cgroup reclaim.

> However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
> It can behave vastly different on single active task and multi ones.
> It is short sighted and can be cheated by bursty activities.

The split LRU VM tries to avoid the bursty page aging as
much as possible, by doing background deactivating of
anonymous pages whenever we reclaim page cache pages and
the number of anonymous pages in the zone (or cgroup) is
low.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:16                 ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:

> I guess both schemes have unacceptable flaws.
> 
> For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> while the DROP_CONTINUOUS scheme is effectively zero cost.

The higher overhead may not be an issue on smaller systems,
or inside smaller cgroups inside large systems, when doing
cgroup reclaim.

> However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
> It can behave vastly different on single active task and multi ones.
> It is short sighted and can be cheated by bursty activities.

The split LRU VM tries to avoid the bursty page aging as
much as possible, by doing background deactivating of
anonymous pages whenever we reclaim page cache pages and
the number of anonymous pages in the zone (or cgroup) is
low.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 13:06               ` Wu Fengguang
@ 2009-08-06 13:46                 ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 04:06 PM, Wu Fengguang wrote:
> On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote:
>    
>> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>>      
>
> scheme KEEP_MOST:
>
>    
>>> How about, for every N pages that you scan, evict at least 1 page,
>>> regardless of young bit status?  That limits overscanning to a N:1
>>> ratio.  With N=250 we'll spend at most 25 usec in order to locate one
>>> page to evict.
>>>        
>
> scheme DROP_CONTINUOUS:
>
>    
>>> This is a quick hack to materialize the idea. It remembers roughly
>>> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
>>> and if _all of them_ are referenced, then the referenced bit is
>>> probably meaningless and should not be taken seriously.
>>>        
>
>    

Or one scheme, with N=parameter.

>> I don't think we should ignore the referenced bit. There could still be
>> a large batch of unreferenced pages later on that we should
>> preferentially swap. If we swap at least 1 page for every 250 scanned,
>> after 4K swaps we will have traversed 1M pages, enough to find them.
>>      
>
> I guess both schemes have unacceptable flaws.
>
> For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> while the DROP_CONTINUOUS scheme is effectively zero cost.
>    

Maybe 250 is an exaggeration.  But even with 250, the cost is still 
pretty low compared to the cpu cost of evicting a page (with IPIs and 
tlb flushes).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:46                 ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 04:06 PM, Wu Fengguang wrote:
> On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote:
>    
>> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>>      
>
> scheme KEEP_MOST:
>
>    
>>> How about, for every N pages that you scan, evict at least 1 page,
>>> regardless of young bit status?  That limits overscanning to a N:1
>>> ratio.  With N=250 we'll spend at most 25 usec in order to locate one
>>> page to evict.
>>>        
>
> scheme DROP_CONTINUOUS:
>
>    
>>> This is a quick hack to materialize the idea. It remembers roughly
>>> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
>>> and if _all of them_ are referenced, then the referenced bit is
>>> probably meaningless and should not be taken seriously.
>>>        
>
>    

Or one scheme, with N=parameter.

>> I don't think we should ignore the referenced bit. There could still be
>> a large batch of unreferenced pages later on that we should
>> preferentially swap. If we swap at least 1 page for every 250 scanned,
>> after 4K swaps we will have traversed 1M pages, enough to find them.
>>      
>
> I guess both schemes have unacceptable flaws.
>
> For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> while the DROP_CONTINUOUS scheme is effectively zero cost.
>    

Maybe 250 is an exaggeration.  But even with 250, the cost is still 
pretty low compared to the cpu cost of evicting a page (with IPIs and 
tlb flushes).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 13:13               ` Rik van Riel
@ 2009-08-06 13:49                 ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 13:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 04:13 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
>>> As a refinement, the static variable 'recent_all_referenced' could be
>>> moved to struct zone or made a per-cpu variable.
>>
>> Definitely this should be made part of the zone structure, consider 
>> the original report where the problem occurs in a 128MB zone (where 
>> we can expect many pages to have their referenced bit set).
>
> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>
> Putting it in the zone means that the cgroup, which may have
> different behaviour from the rest of the zone, due to excessive
> memory pressure inside the cgroup, does not get the right
> statistics.
>

Well, it should be per inactive list, whether it's a zone or a cgroup.  
What's the name of this thing? ("inactive list"?)

error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:49                 ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 13:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On 08/06/2009 04:13 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
>>> As a refinement, the static variable 'recent_all_referenced' could be
>>> moved to struct zone or made a per-cpu variable.
>>
>> Definitely this should be made part of the zone structure, consider 
>> the original report where the problem occurs in a 128MB zone (where 
>> we can expect many pages to have their referenced bit set).
>
> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>
> Putting it in the zone means that the cgroup, which may have
> different behaviour from the rest of the zone, due to excessive
> memory pressure inside the cgroup, does not get the right
> statistics.
>

Well, it should be per inactive list, whether it's a zone or a cgroup.  
What's the name of this thing? ("inactive list"?)

error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 13:06               ` Wu Fengguang
@ 2009-08-06 21:09                 ` Jeff Dike
  -1 siblings, 0 replies; 243+ messages in thread
From: Jeff Dike @ 2009-08-06 21:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Side question -
	Is there a good reason for this to be in shrink_active_list()
as opposed to __isolate_lru_page?

		if (unlikely(!page_evictable(page, NULL))) {
			putback_lru_page(page);
			continue;
		}

Maybe we want to minimize the amount of code under the lru lock or
avoid duplicate logic in the isolate_page functions.

But if there are important mlock-heavy workloads, this could make the
scan come up empty, or at least emptier than we might like.

				Jeff

-- 
Work email - jdike at linux dot intel dot com

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 21:09                 ` Jeff Dike
  0 siblings, 0 replies; 243+ messages in thread
From: Jeff Dike @ 2009-08-06 21:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Side question -
	Is there a good reason for this to be in shrink_active_list()
as opposed to __isolate_lru_page?

		if (unlikely(!page_evictable(page, NULL))) {
			putback_lru_page(page);
			continue;
		}

Maybe we want to minimize the amount of code under the lru lock or
avoid duplicate logic in the isolate_page functions.

But if there are important mlock-heavy workloads, this could make the
scan come up empty, or at least emptier than we might like.

				Jeff

-- 
Work email - jdike at linux dot intel dot com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 10:14                       ` Avi Kivity
@ 2009-08-07  1:25                         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-07  1:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen,
	Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, 06 Aug 2009 13:14:09 +0300
Avi Kivity <avi@redhat.com> wrote:

> On 08/06/2009 12:59 PM, Wu Fengguang wrote:
> >> Do we know for a fact that only stack pages suffer, or is it what has
> >> been noticed?
> >>      
> >
> > It shall be the first case: "These pages are nearly all stack pages.",
> > Jeff said.
> >    
> 
> Ok.  I can't explain it.  There's no special treatment for guest stack 
> pages.  The accessed bit should be maintained for them exactly like all 
> other pages.
> 
> Are they kernel-mode stack pages, or user-mode stack pages (the 
> difference being that kernel mode stack pages are accessed through large 
> ptes, whereas user mode stack pages are accessed through normal ptes).
> 


Hmm, finally, memcg's problem ?
just as an experiment, how following works ?

 - memory.limit_in_bytes = 128MB
 - memory.memsw.limit_in_bytes = 160MB

By this, if mamory+swap usage hits 160MB, no swap more. 
But plz take care of OOM.

THanks,
-Kame


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07  1:25                         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-07  1:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen,
	Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, 06 Aug 2009 13:14:09 +0300
Avi Kivity <avi@redhat.com> wrote:

> On 08/06/2009 12:59 PM, Wu Fengguang wrote:
> >> Do we know for a fact that only stack pages suffer, or is it what has
> >> been noticed?
> >>      
> >
> > It shall be the first case: "These pages are nearly all stack pages.",
> > Jeff said.
> >    
> 
> Ok.  I can't explain it.  There's no special treatment for guest stack 
> pages.  The accessed bit should be maintained for them exactly like all 
> other pages.
> 
> Are they kernel-mode stack pages, or user-mode stack pages (the 
> difference being that kernel mode stack pages are accessed through large 
> ptes, whereas user mode stack pages are accessed through normal ptes).
> 


Hmm, finally, memcg's problem ?
just as an experiment, how following works ?

 - memory.limit_in_bytes = 128MB
 - memory.memsw.limit_in_bytes = 160MB

By this, if mamory+swap usage hits 160MB, no swap more. 
But plz take care of OOM.

THanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 13:13               ` Rik van Riel
@ 2009-08-07  3:11                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-07  3:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Avi Kivity, Wu Fengguang, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	KAMEZAWA Hiroyuki, Balbir Singh

(cc to memcg folks)

> Avi Kivity wrote:
> > On 08/06/2009 01:59 PM, Wu Fengguang wrote:
> 
> >> As a refinement, the static variable 'recent_all_referenced' could be
> >> moved to struct zone or made a per-cpu variable.
> > 
> > Definitely this should be made part of the zone structure, consider the 
> > original report where the problem occurs in a 128MB zone (where we can 
> > expect many pages to have their referenced bit set).
> 
> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
> 
> Putting it in the zone means that the cgroup, which may have
> different behaviour from the rest of the zone, due to excessive
> memory pressure inside the cgroup, does not get the right
> statistics.

maybe, I heven't catch your point.

Current memcgroup logic also use recent_scan/recent_rotate statistics.
Isn't it enought?





^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07  3:11                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-07  3:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Avi Kivity, Wu Fengguang, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	KAMEZAWA Hiroyuki, Balbir Singh

(cc to memcg folks)

> Avi Kivity wrote:
> > On 08/06/2009 01:59 PM, Wu Fengguang wrote:
> 
> >> As a refinement, the static variable 'recent_all_referenced' could be
> >> moved to struct zone or made a per-cpu variable.
> > 
> > Definitely this should be made part of the zone structure, consider the 
> > original report where the problem occurs in a 128MB zone (where we can 
> > expect many pages to have their referenced bit set).
> 
> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
> 
> Putting it in the zone means that the cgroup, which may have
> different behaviour from the rest of the zone, due to excessive
> memory pressure inside the cgroup, does not get the right
> statistics.

maybe, I heven't catch your point.

Current memcgroup logic also use recent_scan/recent_rotate statistics.
Isn't it enought?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 13:08       ` Rik van Riel
@ 2009-08-07  3:17         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-07  3:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Andrea Arcangeli, Wu Fengguang, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

> Andrea Arcangeli wrote:
> 
> > Likely we need a cut-off point, if we detect it takes more than X
> > seconds to scan the whole active list, we start ignoring young bits,
> 
> We could just make this depend on the calculated inactive_ratio,
> which depends on the size of the list.
> 
> For small systems, it may make sense to make every accessed bit
> count, because the working set will often approach the size of
> memory.
> 
> On very large systems, the working set may also approach the
> size of memory, but the inactive list only contains a small
> percentage of the pages, so there is enough space for everything.
> 
> Say, if the inactive_ratio is 3 or less, make the accessed bit
> on the active lists count.

Sound reasonable. How do we confirm the idea correctness?
Wu, your X focus switching benchmark is sufficient test?




^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07  3:17         ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-07  3:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Andrea Arcangeli, Wu Fengguang, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

> Andrea Arcangeli wrote:
> 
> > Likely we need a cut-off point, if we detect it takes more than X
> > seconds to scan the whole active list, we start ignoring young bits,
> 
> We could just make this depend on the calculated inactive_ratio,
> which depends on the size of the list.
> 
> For small systems, it may make sense to make every accessed bit
> count, because the working set will often approach the size of
> memory.
> 
> On very large systems, the working set may also approach the
> size of memory, but the inactive list only contains a small
> percentage of the pages, so there is enough space for everything.
> 
> Say, if the inactive_ratio is 3 or less, make the accessed bit
> on the active lists count.

Sound reasonable. How do we confirm the idea correctness?
Wu, your X focus switching benchmark is sufficient test?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-07  3:11                 ` KOSAKI Motohiro
@ 2009-08-07  7:54                   ` Balbir Singh
  -1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-07  7:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Avi Kivity, Wu Fengguang, Andrea Arcangeli, Dike,
	Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm, KAMEZAWA Hiroyuki

On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> (cc to memcg folks)
>
>> Avi Kivity wrote:
>> > On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>>
>> >> As a refinement, the static variable 'recent_all_referenced' could be
>> >> moved to struct zone or made a per-cpu variable.
>> >
>> > Definitely this should be made part of the zone structure, consider the
>> > original report where the problem occurs in a 128MB zone (where we can
>> > expect many pages to have their referenced bit set).
>>
>> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>>
>> Putting it in the zone means that the cgroup, which may have
>> different behaviour from the rest of the zone, due to excessive
>> memory pressure inside the cgroup, does not get the right
>> statistics.
>
> maybe, I heven't catch your point.
>
> Current memcgroup logic also use recent_scan/recent_rotate statistics.
> Isn't it enought?

I don't understand the context, I'll look at the problem when I am
back (I am away from work for the next few days).

Balbir

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07  7:54                   ` Balbir Singh
  0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-07  7:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Avi Kivity, Wu Fengguang, Andrea Arcangeli, Dike,
	Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm, KAMEZAWA Hiroyuki

On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> (cc to memcg folks)
>
>> Avi Kivity wrote:
>> > On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>>
>> >> As a refinement, the static variable 'recent_all_referenced' could be
>> >> moved to struct zone or made a per-cpu variable.
>> >
>> > Definitely this should be made part of the zone structure, consider the
>> > original report where the problem occurs in a 128MB zone (where we can
>> > expect many pages to have their referenced bit set).
>>
>> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>>
>> Putting it in the zone means that the cgroup, which may have
>> different behaviour from the rest of the zone, due to excessive
>> memory pressure inside the cgroup, does not get the right
>> statistics.
>
> maybe, I heven't catch your point.
>
> Current memcgroup logic also use recent_scan/recent_rotate statistics.
> Isn't it enought?

I don't understand the context, I'll look at the problem when I am
back (I am away from work for the next few days).

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-07  7:54                   ` Balbir Singh
@ 2009-08-07  8:24                     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-07  8:24 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KOSAKI Motohiro, Rik van Riel, Avi Kivity, Wu Fengguang,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm

On Fri, 7 Aug 2009 13:24:34 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI
> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> > Current memcgroup logic also use recent_scan/recent_rotate statistics.
> > Isn't it enought?
> 
> I don't understand the context, I'll look at the problem when I am
> back (I am away from work for the next few days).
> 
Brief summary: (please point out if not correct)

prepare a memcg with
	memory.limit_in_bytex=128MB

Run kvm on it, and use apps, those working-set is near to 256MB (then, heavy swap)
In this case,
  - Anon memory are swapped-out even while there are file caches.
    Especially, a page for stack which is frequently accessed can be
    easily swapped out, again and again.

One of reasone is a recent change as:
"a page mapped with VM_EXEC is not pageout even if no reference"

Without memcg, a user can use Gigabytes of memory, above change
is very welcomed.

Then, current point is "how we can handle this case without bad effect".

One possibility I wonder is this is a problem of configuration mistake.
setting memory.memsw.limit_in_bytes to be proper value may change bahavior.
But it seems just a workaround.

Can't we find algorithmic/heuristic way to avoid too much swap-out ?
I think memcg can check # of swap-ins, but now, we don't have a tag
to see the sign of "recently swapped-out page is reused" case or
executable file pages are too much.

I wonder we can comapre
    # of pageouted file-caches v.s. # of swapout anon.
and keeping  "# of pageouted file-caches < # of swapout anon." (or use swappiness)
This state can be checked by recalim_stat. (per memcg)
Hmm?

I'm sorry I'll be on a trip Aug/11-Aug/17 and response will be delayed.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07  8:24                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-07  8:24 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KOSAKI Motohiro, Rik van Riel, Avi Kivity, Wu Fengguang,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm

On Fri, 7 Aug 2009 13:24:34 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI
> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> > Current memcgroup logic also use recent_scan/recent_rotate statistics.
> > Isn't it enought?
> 
> I don't understand the context, I'll look at the problem when I am
> back (I am away from work for the next few days).
> 
Brief summary: (please point out if not correct)

prepare a memcg with
	memory.limit_in_bytex=128MB

Run kvm on it, and use apps, those working-set is near to 256MB (then, heavy swap)
In this case,
  - Anon memory are swapped-out even while there are file caches.
    Especially, a page for stack which is frequently accessed can be
    easily swapped out, again and again.

One of reasone is a recent change as:
"a page mapped with VM_EXEC is not pageout even if no reference"

Without memcg, a user can use Gigabytes of memory, above change
is very welcomed.

Then, current point is "how we can handle this case without bad effect".

One possibility I wonder is this is a problem of configuration mistake.
setting memory.memsw.limit_in_bytes to be proper value may change bahavior.
But it seems just a workaround.

Can't we find algorithmic/heuristic way to avoid too much swap-out ?
I think memcg can check # of swap-ins, but now, we don't have a tag
to see the sign of "recently swapped-out page is reused" case or
executable file pages are too much.

I wonder we can comapre
    # of pageouted file-caches v.s. # of swapout anon.
and keeping  "# of pageouted file-caches < # of swapout anon." (or use swappiness)
This state can be checked by recalim_stat. (per memcg)
Hmm?

I'm sorry I'll be on a trip Aug/11-Aug/17 and response will be delayed.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-07  3:17         ` KOSAKI Motohiro
  (?)
@ 2009-08-12  7:48         ` Wu Fengguang
  2009-08-12 14:31             ` Rik van Riel
  -1 siblings, 1 reply; 243+ messages in thread
From: Wu Fengguang @ 2009-08-12  7:48 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

[-- Attachment #1: Type: text/plain, Size: 28582 bytes --]

On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
> > Andrea Arcangeli wrote:
> > 
> > > Likely we need a cut-off point, if we detect it takes more than X
> > > seconds to scan the whole active list, we start ignoring young bits,
> > 
> > We could just make this depend on the calculated inactive_ratio,
> > which depends on the size of the list.
> > 
> > For small systems, it may make sense to make every accessed bit
> > count, because the working set will often approach the size of
> > memory.
> > 
> > On very large systems, the working set may also approach the
> > size of memory, but the inactive list only contains a small
> > percentage of the pages, so there is enough space for everything.
> > 
> > Say, if the inactive_ratio is 3 or less, make the accessed bit
> > on the active lists count.
> 
> Sound reasonable.

Yes, such kind of global measurements would be much better.

> How do we confirm the idea correctness?

In general the active list tends to grow large on under-scanned LRU.
I guess Rik is pretty familiar with typical inactive_ratio values of
the large memory systems and may even have some real numbers :)

> Wu, your X focus switching benchmark is sufficient test?

It is a major test case for memory tight desktop.  Jeff presents
another interesting one for KVM, hehe.

Anyway I collected the active/inactive list sizes, and the numbers
show that the inactive_ratio is roughly 1 when the LRU is scanned
actively and may go very high when it is under-scanned.

Thanks,
Fengguang


4GB desktop, kernel 2.6.30
--------------------------

1) fresh startup:

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
                0            80255            68932            24066

2) read 10GB sparse file:

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            48096            52312           830142            10971

3) kvm -m 512M:

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            82606           155588           684375            15380

4) exit kvm:

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            66364            35275           679033            17009


512MB desktop, kernel 2.6.31
----------------------------

1) fresh startup, console:

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
                0             1870             7082             2075

2) fresh startx:

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
                0            30021            31551             6893

3) start many x apps, no swap:
   (script attached)

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
                0            56475            29608             9707
             4074            54886            27431             9743
             5452            54025            26685             9950
             7417            53428            25394             9963
             8522            52388            24717            10553
            10684            51955            22055            11384
            11644            51597            21329            11342
            12341            51221            20822            11513
            13874            49738            19916            11516
            13874            50494            19916            11517
            15284            48778            19739            12127
            15668            49037            19196            12380
            16821            48571            17661            13133
            18329            49175            14470            14490
            18961            49652            13081            14432
            18961            49608            13236            14414
            20563            51379            11171            13823
            21044            50281            10311            13948
            21426            49906            10268            13984
            21771            50479             9734            14019
            23246            49062             9672            13431
            23984            49490            10083            12763
            24479            49373            10332            12446
            25782            49053             9655            12101

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            26970            48078             9891            11415
            28041            47873             9617            11079
            29485            51183             8445             8293
            30484            50140             8441             7997
            31841            50578             6904             7413
            32579            49873             6937             7804
            34117            49336             6447             7440
            35380            48300             5816             7471
            38055            46486             4778             7546
            39528            45227             5043             7417
            40777            44681             4148             7325
            41902            44468             3967             6534
            43107            43378             4630             5846
            43418            43538             5019             5698
            43563            43514             4839             5514
            43660            43587             5228             5431
            43645            43315             4919             5886
            43618            43555             4531             5704
            43751            43646             4584             5600
            43839            43703             4507             5569
            44015            44057             4757             5378
            44115            44089             4707             4724
            44331            44184             4710             4701
            44577            44554             4221             4265

[...]

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            47945            47876             1594             1547
            47944            47888             1944             1494
            47944            47888             1351             1392
            47974            47844             1976             1498
            47974            47858             1411             1549
            47974            47857             1482             1423
            47973            47874             2105             1435
            47969            47349             1884             1592
            47966            47353             1993             1700
            47966            47343             1913             1882
            47965            47306             1683             1746
            47960            47373             1598             1583
            47960            47375             1808             1677
            47960            47004             2444             1625
            47959            47060             2017             1825
            47956            47047             1866             1742
            47955            47080             2039             1987
            47954            47072             1734             1822
            47954            47092             1963             1867
            47954            47130             1851             1846
            47954            47154             2134             1813
            47954            47181             1952             1813
            47953            47138             1678             1810
            47951            47125             1848             1951

4) start many x apps, with swap:

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
                0             6571            13646             3251
                0             6823            14637             3900
                0             7426            17187             3935
                0             8188            19989             3959
                0             9994            21582             4148
                0            12556            21889             5157
                0            13846            23764             5249
                0            20383            25393             5546
                0            21830            26019             5696
                0            22856            26608             5972
                0            28651            28128             6146
                0            28058            28482             6309
                0            27726            28595             6312
                0            27634            28775             6471
                0            27636            28774             6464
                0            31299            28848             6834
                0            35102            29539             6886
                0            39561            29980             6915
                0            41573            30008             6917
                0            47562            30041             6917
                0            54603            30041             6917
             3040            55528            29273             6945
            16937            44916            23406             7675
            16937            44932            23416             7670

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            16937            44961            23416             7670
            16937            40583            23416             7670
            16937            40596            23417             7670
            16937            40607            23417             7670
            16937            40139            23404             7668
            12181            11794            22932             8144
            12181            11794            22932             8144
            12181            11794            22946             8144
            12181            11794            22946             8144
            12147            13063            23148             8280
            12146            15994            22842             8565
            12146            17491            22654             8718
            12146            17488            22654             8718
            12146            17653            22634             8733
            12146            18656            21030            10513
            12146            19717            20778            10770
            12146            20341            20859            10846
            12146            21134            21096            11133
            12146            22692            21129            11453
            12144            24698            22225            11476
            12144            27726            22609            11536
            12144            27774            22648            11555
            12144            28447            22844            11564
            12144            30286            23238            11567

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            12121            31489            23350            11761
            12099            33117            23336            11779
            12099            33632            23555            11787
            12099            35393            23566            11806
            12099            35828            23490            11882
            12099            35879            23486            11887
            12099            35889            23486            11888
            12099            36078            24124            11890
            12099            36449            25079            11895
            12099            37782            25334            11898
            12099            39494            25564            11904
            12099            40620            25657            11905
            14200            41298            25399            12069
            21555            35228            22969            12495
            22829            33097            22703            12617
            25519            31496            22115            12552
            28590            28947            21617            13051
            28940            29076            19806            13270
            29430            29344            19153            13825
            30183            30399            17643            13418
            32242            32203            13535            13969
            33319            33294            12236            13659
            33154            33085            11431            13482
            33572            33569            11315            13102

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            36246            35355             8033             9355
            35659            35558             8491             8394
            35330            35142             8233             8278
            35788            35561             8460             8454
            36129            36359             8413             8627
            36727            36365             8311             8509
            36672            36870             8437             8479
            36772            36656             8090             8354
            36754            36614             8237             8378
            37591            36065             8352             8470
            36530            36383             7607             7611
            36113            35992             7271             7296
            36149            35667             7092             7052
            36014            35350             7408             7206
            36409            35890             8027             7396
            36300            35418             7892             7704
            36369            36589             7723             7838
            36243            36168             7576             7793
            35804            35622             7422             7726
            35498            35435             7443             7557
            35078            35159             7542             7243
            35478            35415             8199             7552
            35143            35025             7828             7763
            35312            34754             7745             7545

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            35093            34933             7166             7748
            36253            36236             7171             7408
            36225            36929             8236             7532
            36197            36169             7562             7632
            35711            35647             7312             7471
            35210            35144             7202             7227
            35052            35021             7073             7084
            35263            35047             7128             6963
            35359            35177             7572             7048
            35665            35523             7927             7025
            34988            34788             7279             7340
            34678            34438             7352             7141
            34352            34270             7033             6980
            34307            34175             6881             6809
            34038            34469             7603             6700
            34169            33854             7105             6868
            34048            34124             7051             6869
            33630            33445             6821             6875
            33047            32992             6617             6554
            33232            33012             7114             6659
            33442            33217             7408             6700
            32942            32707             6830             7257
            32672            32593             6801             7207
            32406            32142             6656             6960

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            32127            32036             6641             6798
            31929            31769             6567             6664
            31786            31968             7532             6670
            32208            31228             7448             6859
            30904            30835             6503             6774
            30543            30559             6345             6709
            30394            30278             6235             6288
            30541            30239             6470             6243
            30463            30656             6959             6587
            31020            29794             7393             6897
            30169            30128             6295             6905
            29755            29644             6236             6598
            29765            29617             6342             6475
            29874            29748             6215             6335
            29654            29491             6355             6358
            29972            29853             7079             6607
            29437            29267             6670             7205
            29160            28956             6602             6982
            29411            29017             6578             6937
            29069            28952             6539             6717
            29570            29342             6982             6850
            28882            28927             6912             6809
            29326            28731             6928             6814
            28883            28817             6762             6819

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            29072            28696             6756             6803
            29296            29120             6993             6972
            28426            28167             6238             7182
            28071            27862             6197             6953
            27944            27767             6872             6780
            28141            27819             6839             6654
            27547            27309             6285             6209
            27578            27842             7730             6465
            27741            27470             7180             6665
            27481            27217             7566             6919
            27568            27405             7696             7027
            27274            27004             7416             7120
            27110            26920             7111             7303
            27282            27056             7476             7046
            27549            27044             7779             7074
            27325            26968             7290             6972
            27665            27528             8465             7058
            27093            26974             7662             7243
            27155            27068             7299             7344
            26638            26553             6925             7325
            26718            26383             6571             7425
            26264            26150             6470             6960
            26463            26176             6590             6803
            27155            26396             7387             6709

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            26532            26408             6722             6702
            26491            26421             6789             6731
            26783            26950             8389             6849
            27129            26584             7713             6991
            26791            26228             7316             7202
            26208            26168             7115             7172
            26031            25907             6957             7118
            25980            25675             6764             7216
            25608            25535             6779             7042
            25571            25501             6520             6943
            26068            25287             6574             6948
            25734            25305             6778             6776
            25442            25134             6629             6556
            25514            25217             7469             6543
            25659            25552             8561             6620
            26082            24784             8494             6676
            25312            25194             7052             7026
            25386            25267             7422             6973
            25070            24965             6716             6886
            25143            24801             6597             6785
            24971            24866             6643             6786
            25223            25212             6829             6757
            25504            24778             7589             6840
            25531            24786             8068             6896

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            25343            25169             7227             7042
            25195            25129             6804             7149
            25355            25071             6958             6941
            25294            25202             6676             6850
            25688            25050             6743             6694
            25736            25268             6910             6580
            25750            25530             7299             6557
            25401            25273             6622             6810
            25672            25525             6798             6770
            25192            25067             6226             6486
            26011            25360             6540             6466
            26673            25768             6444             6411
            27211            26326             6370             6423
            27527            26764             6615             6534
            27355            26820             6337             6467
            27385            26962             6098             6446
            27528            27431             5832             6303
            26955            26918             6016             6015
            26816            26469             5847             5894
            26961            26390             6077             5866
            26781            26664             5625             5815
            26755            26454             6114             5806
            26994            26552             6016             5784
            27482            26910             5945             5714

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            27609            27143             5929             5715
            27885            27443             7168             5947
            27684            27635             8231             6145
            27320            27205             7359             6415
            27679            27265             7898             6445
            27655            26342             7651             6574
            27033            26853             7385             6831
            26696            26468             6533             6721
            26464            26310             6374             6465
            26192            26084             6261             6417
            26182            25801             6511             6367
            26010            25880             6251             6266
            26130            26032             5974             6280
            26417            26175             5830             6610
            26558            26450             6002             6623
            26758            26016             6141             6526
            26481            26363             5911             6356
            26765            26622             6401             6266
            27022            26534             6593             6210
            27587            26515             6560             6193
            27156            27029             6123             6109
            27284            26926             6159             5776
            27153            26996             5698             5642
            26712            26603             6151             5541

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            26697            27024             7919             5565
            26651            26471             7134             5965
            27021            26479             7617             5996
            26323            26024             7091             6273
            26081            25894             6267             6527
            25605            25487             5814             6407
            25564            25447             5613             6422
            25460            25406             5630             6374
            25380            25380             5776             6358
            25661            25653             6045             6037
            25790            24706             6069             6045
            25512            25043             6024             5982
            25440            25102             6067             5807
            25802            25181             5953             5838
            25864            25314             5694             5711
            26022            25737             5592             5510
            25964            25741             6784             5376
            26092            25952             7929             5537
            26110            25990             7120             5789
            25311            25252             6157             6146
            25432            25379             6658             6197
            25552            25390             6176             6357
            25388            25237             5742             6303
            25841            25173             5932             6325

 nr_inactive_anon   nr_active_anon nr_inactive_file   nr_active_file
            25054            24965             5532             6072
            24754            25191             7503             5941
            25529            25133             7319             6000
            25350            24510             6993             6078
            24672            24541             6027             6068
            24610            24492             5811             5804
            24819            24674             5841             5820
            24775            24394             5719             5696
            24991            25179             6639             5816
            25282            24538             6870             6088
            25172            24727             6628             6090
            25363            24721             6644             6091
            25676            24705             6672             6102
            24998            24909             5683             5957
            24762            24736             6034             5869
            24965            24890             6374             6614
            25050            24895             6436             6616
            25087            24932             6436             6617
            25139            24860             6435             6619
            25159            24903             6435             6620
            25168            25362             6004             7051
            25209            25524             6004             7052
            25209            25504             6004             7053
            25262            25447             6011             7054

[-- Attachment #2: run-many-x-apps.sh --]
[-- Type: application/x-sh, Size: 1735 bytes --]

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-12  7:48         ` Wu Fengguang
@ 2009-08-12 14:31             ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-12 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:
> On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
>>> Andrea Arcangeli wrote:
>>>
>>>> Likely we need a cut-off point, if we detect it takes more than X
>>>> seconds to scan the whole active list, we start ignoring young bits,
>>> We could just make this depend on the calculated inactive_ratio,
>>> which depends on the size of the list.
>>>
>>> For small systems, it may make sense to make every accessed bit
>>> count, because the working set will often approach the size of
>>> memory.
>>>
>>> On very large systems, the working set may also approach the
>>> size of memory, but the inactive list only contains a small
>>> percentage of the pages, so there is enough space for everything.
>>>
>>> Say, if the inactive_ratio is 3 or less, make the accessed bit
>>> on the active lists count.
>> Sound reasonable.
> 
> Yes, such kind of global measurements would be much better.
> 
>> How do we confirm the idea correctness?
> 
> In general the active list tends to grow large on under-scanned LRU.
> I guess Rik is pretty familiar with typical inactive_ratio values of
> the large memory systems and may even have some real numbers :)
> 
>> Wu, your X focus switching benchmark is sufficient test?
> 
> It is a major test case for memory tight desktop.  Jeff presents
> another interesting one for KVM, hehe.
> 
> Anyway I collected the active/inactive list sizes, and the numbers
> show that the inactive_ratio is roughly 1 when the LRU is scanned
> actively and may go very high when it is under-scanned.

inactive_ratio is based on the zone (or cgroup) size.

For zones it is a fixed value, which is available in
/proc/zoneinfo

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-12 14:31             ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-12 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:
> On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
>>> Andrea Arcangeli wrote:
>>>
>>>> Likely we need a cut-off point, if we detect it takes more than X
>>>> seconds to scan the whole active list, we start ignoring young bits,
>>> We could just make this depend on the calculated inactive_ratio,
>>> which depends on the size of the list.
>>>
>>> For small systems, it may make sense to make every accessed bit
>>> count, because the working set will often approach the size of
>>> memory.
>>>
>>> On very large systems, the working set may also approach the
>>> size of memory, but the inactive list only contains a small
>>> percentage of the pages, so there is enough space for everything.
>>>
>>> Say, if the inactive_ratio is 3 or less, make the accessed bit
>>> on the active lists count.
>> Sound reasonable.
> 
> Yes, such kind of global measurements would be much better.
> 
>> How do we confirm the idea correctness?
> 
> In general the active list tends to grow large on under-scanned LRU.
> I guess Rik is pretty familiar with typical inactive_ratio values of
> the large memory systems and may even have some real numbers :)
> 
>> Wu, your X focus switching benchmark is sufficient test?
> 
> It is a major test case for memory tight desktop.  Jeff presents
> another interesting one for KVM, hehe.
> 
> Anyway I collected the active/inactive list sizes, and the numbers
> show that the inactive_ratio is roughly 1 when the LRU is scanned
> actively and may go very high when it is under-scanned.

inactive_ratio is based on the zone (or cgroup) size.

For zones it is a fixed value, which is available in
/proc/zoneinfo

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-12 14:31             ` Rik van Riel
@ 2009-08-13  1:03               ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-13  1:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
> >>> Andrea Arcangeli wrote:
> >>>
> >>>> Likely we need a cut-off point, if we detect it takes more than X
> >>>> seconds to scan the whole active list, we start ignoring young bits,
> >>> We could just make this depend on the calculated inactive_ratio,
> >>> which depends on the size of the list.
> >>>
> >>> For small systems, it may make sense to make every accessed bit
> >>> count, because the working set will often approach the size of
> >>> memory.
> >>>
> >>> On very large systems, the working set may also approach the
> >>> size of memory, but the inactive list only contains a small
> >>> percentage of the pages, so there is enough space for everything.
> >>>
> >>> Say, if the inactive_ratio is 3 or less, make the accessed bit
> >>> on the active lists count.
> >> Sound reasonable.
> > 
> > Yes, such kind of global measurements would be much better.
> > 
> >> How do we confirm the idea correctness?
> > 
> > In general the active list tends to grow large on under-scanned LRU.
> > I guess Rik is pretty familiar with typical inactive_ratio values of
> > the large memory systems and may even have some real numbers :)
> > 
> >> Wu, your X focus switching benchmark is sufficient test?
> > 
> > It is a major test case for memory tight desktop.  Jeff presents
> > another interesting one for KVM, hehe.
> > 
> > Anyway I collected the active/inactive list sizes, and the numbers
> > show that the inactive_ratio is roughly 1 when the LRU is scanned
> > actively and may go very high when it is under-scanned.
> 
> inactive_ratio is based on the zone (or cgroup) size.

Ah sorry, my word 'inactive_ratio' means runtime active:inactive ratio.

> For zones it is a fixed value, which is available in
> /proc/zoneinfo

On my 64bit desktop with 4GB memory:

        DMA     inactive_ratio:    1
        DMA32   inactive_ratio:    4
        Normal  inactive_ratio:    1

The biggest zone DMA32 has inactive_ratio=4. But I guess the
referenced bit should not be ignored on this typical desktop
configuration?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13  1:03               ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-13  1:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
> >>> Andrea Arcangeli wrote:
> >>>
> >>>> Likely we need a cut-off point, if we detect it takes more than X
> >>>> seconds to scan the whole active list, we start ignoring young bits,
> >>> We could just make this depend on the calculated inactive_ratio,
> >>> which depends on the size of the list.
> >>>
> >>> For small systems, it may make sense to make every accessed bit
> >>> count, because the working set will often approach the size of
> >>> memory.
> >>>
> >>> On very large systems, the working set may also approach the
> >>> size of memory, but the inactive list only contains a small
> >>> percentage of the pages, so there is enough space for everything.
> >>>
> >>> Say, if the inactive_ratio is 3 or less, make the accessed bit
> >>> on the active lists count.
> >> Sound reasonable.
> > 
> > Yes, such kind of global measurements would be much better.
> > 
> >> How do we confirm the idea correctness?
> > 
> > In general the active list tends to grow large on under-scanned LRU.
> > I guess Rik is pretty familiar with typical inactive_ratio values of
> > the large memory systems and may even have some real numbers :)
> > 
> >> Wu, your X focus switching benchmark is sufficient test?
> > 
> > It is a major test case for memory tight desktop.  Jeff presents
> > another interesting one for KVM, hehe.
> > 
> > Anyway I collected the active/inactive list sizes, and the numbers
> > show that the inactive_ratio is roughly 1 when the LRU is scanned
> > actively and may go very high when it is under-scanned.
> 
> inactive_ratio is based on the zone (or cgroup) size.

Ah sorry, my word 'inactive_ratio' means runtime active:inactive ratio.

> For zones it is a fixed value, which is available in
> /proc/zoneinfo

On my 64bit desktop with 4GB memory:

        DMA     inactive_ratio:    1
        DMA32   inactive_ratio:    4
        Normal  inactive_ratio:    1

The biggest zone DMA32 has inactive_ratio=4. But I guess the
referenced bit should not be ignored on this typical desktop
configuration?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-13  1:03               ` Wu Fengguang
@ 2009-08-13 15:46                 ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-13 15:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:
> On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote:

>> For zones it is a fixed value, which is available in
>> /proc/zoneinfo
> 
> On my 64bit desktop with 4GB memory:
> 
>         DMA     inactive_ratio:    1
>         DMA32   inactive_ratio:    4
>         Normal  inactive_ratio:    1
> 
> The biggest zone DMA32 has inactive_ratio=4. But I guess the
> referenced bit should not be ignored on this typical desktop
> configuration?

We need to ignore the referenced bit on active anon pages
on very large systems, but it could indeed be helpful to
respect the referenced bit on smaller systems.

I have no idea where the cut-off between them would be.

Maybe at inactive_ratio <= 4 ?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 15:46                 ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-13 15:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:
> On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote:

>> For zones it is a fixed value, which is available in
>> /proc/zoneinfo
> 
> On my 64bit desktop with 4GB memory:
> 
>         DMA     inactive_ratio:    1
>         DMA32   inactive_ratio:    4
>         Normal  inactive_ratio:    1
> 
> The biggest zone DMA32 has inactive_ratio=4. But I guess the
> referenced bit should not be ignored on this typical desktop
> configuration?

We need to ignore the referenced bit on active anon pages
on very large systems, but it could indeed be helpful to
respect the referenced bit on smaller systems.

I have no idea where the cut-off between them would be.

Maybe at inactive_ratio <= 4 ?

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-13 15:46                 ` Rik van Riel
@ 2009-08-13 16:12                   ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-13 16:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On 08/13/2009 06:46 PM, Rik van Riel wrote:
> We need to ignore the referenced bit on active anon pages
> on very large systems, but it could indeed be helpful to
> respect the referenced bit on smaller systems.
>
> I have no idea where the cut-off between them would be.
>
> Maybe at inactive_ratio <= 4 ?

Why do we need to ignore the referenced bit in such cases?  To avoid 
overscanning?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 16:12                   ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-13 16:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On 08/13/2009 06:46 PM, Rik van Riel wrote:
> We need to ignore the referenced bit on active anon pages
> on very large systems, but it could indeed be helpful to
> respect the referenced bit on smaller systems.
>
> I have no idea where the cut-off between them would be.
>
> Maybe at inactive_ratio <= 4 ?

Why do we need to ignore the referenced bit in such cases?  To avoid 
overscanning?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-13 16:12                   ` Avi Kivity
@ 2009-08-13 16:26                     ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-13 16:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

Avi Kivity wrote:
> On 08/13/2009 06:46 PM, Rik van Riel wrote:
>> We need to ignore the referenced bit on active anon pages
>> on very large systems, but it could indeed be helpful to
>> respect the referenced bit on smaller systems.
>>
>> I have no idea where the cut-off between them would be.
>>
>> Maybe at inactive_ratio <= 4 ?
> 
> Why do we need to ignore the referenced bit in such cases?  To avoid 
> overscanning?

Because swapping out anonymous pages tends to be a relatively
rare operation, we'll have many gigabytes of anonymous pages
that all have the referenced bit set (because there was lots
of time between swapout bursts).

Ignoring the referenced bit on active anon pages makes no
difference on these systems, because all active anon pages
have the referenced bit set, anyway.

All we need to do is put the pages on the inactive list and
give them a chance to get referenced.

However, on smaller systems (and cgroups!), the speed at
which we can do pageout IO is larger, compared to the amount
of memory.  This means we can cycle through the pages more
quickly and we may want to count references on the active
list, too.

Yes, on smaller systems we'll also often end up with bursty
swapout loads and all pages referenced - but since we have
fewer pages to begin with, it won't hurt as much.

I suspect that an inactive_ratio of 3 or 4 might make a
good cutoff value.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 16:26                     ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-13 16:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

Avi Kivity wrote:
> On 08/13/2009 06:46 PM, Rik van Riel wrote:
>> We need to ignore the referenced bit on active anon pages
>> on very large systems, but it could indeed be helpful to
>> respect the referenced bit on smaller systems.
>>
>> I have no idea where the cut-off between them would be.
>>
>> Maybe at inactive_ratio <= 4 ?
> 
> Why do we need to ignore the referenced bit in such cases?  To avoid 
> overscanning?

Because swapping out anonymous pages tends to be a relatively
rare operation, we'll have many gigabytes of anonymous pages
that all have the referenced bit set (because there was lots
of time between swapout bursts).

Ignoring the referenced bit on active anon pages makes no
difference on these systems, because all active anon pages
have the referenced bit set, anyway.

All we need to do is put the pages on the inactive list and
give them a chance to get referenced.

However, on smaller systems (and cgroups!), the speed at
which we can do pageout IO is larger, compared to the amount
of memory.  This means we can cycle through the pages more
quickly and we may want to count references on the active
list, too.

Yes, on smaller systems we'll also often end up with bursty
swapout loads and all pages referenced - but since we have
fewer pages to begin with, it won't hurt as much.

I suspect that an inactive_ratio of 3 or 4 might make a
good cutoff value.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-13 16:26                     ` Rik van Riel
@ 2009-08-13 19:12                       ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-13 19:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On 08/13/2009 07:26 PM, Rik van Riel wrote:
>> Why do we need to ignore the referenced bit in such cases?  To avoid 
>> overscanning?
>
>
> Because swapping out anonymous pages tends to be a relatively
> rare operation, we'll have many gigabytes of anonymous pages
> that all have the referenced bit set (because there was lots
> of time between swapout bursts).
>
> Ignoring the referenced bit on active anon pages makes no
> difference on these systems, because all active anon pages
> have the referenced bit set, anyway.
>
> All we need to do is put the pages on the inactive list and
> give them a chance to get referenced.
>
> However, on smaller systems (and cgroups!), the speed at
> which we can do pageout IO is larger, compared to the amount
> of memory.  This means we can cycle through the pages more
> quickly and we may want to count references on the active
> list, too.
>
> Yes, on smaller systems we'll also often end up with bursty
> swapout loads and all pages referenced - but since we have
> fewer pages to begin with, it won't hurt as much.
>
> I suspect that an inactive_ratio of 3 or 4 might make a
> good cutoff value.
>

Thanks for the explanation.  I think my earlier idea of

- do not ignore the referenced bit
- if you see a run of N pages which all have the referenced bit set, do 
swap one

has merit.  It means we cycle more quickly (by a factor of N) through 
the list, looking for unreferenced pages.  If we don't find any we've 
spent a some more cpu, but if we do find an unreferenced page, we win by 
swapping a truly unneeded page.

Cycling faster also means reducing the time between examinations of any 
particular page, so it increases the meaningfulness of the check on 
large systems (otherwise even rarely used pages will always show up as 
referenced).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 19:12                       ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-13 19:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On 08/13/2009 07:26 PM, Rik van Riel wrote:
>> Why do we need to ignore the referenced bit in such cases?  To avoid 
>> overscanning?
>
>
> Because swapping out anonymous pages tends to be a relatively
> rare operation, we'll have many gigabytes of anonymous pages
> that all have the referenced bit set (because there was lots
> of time between swapout bursts).
>
> Ignoring the referenced bit on active anon pages makes no
> difference on these systems, because all active anon pages
> have the referenced bit set, anyway.
>
> All we need to do is put the pages on the inactive list and
> give them a chance to get referenced.
>
> However, on smaller systems (and cgroups!), the speed at
> which we can do pageout IO is larger, compared to the amount
> of memory.  This means we can cycle through the pages more
> quickly and we may want to count references on the active
> list, too.
>
> Yes, on smaller systems we'll also often end up with bursty
> swapout loads and all pages referenced - but since we have
> fewer pages to begin with, it won't hurt as much.
>
> I suspect that an inactive_ratio of 3 or 4 might make a
> good cutoff value.
>

Thanks for the explanation.  I think my earlier idea of

- do not ignore the referenced bit
- if you see a run of N pages which all have the referenced bit set, do 
swap one

has merit.  It means we cycle more quickly (by a factor of N) through 
the list, looking for unreferenced pages.  If we don't find any we've 
spent a some more cpu, but if we do find an unreferenced page, we win by 
swapping a truly unneeded page.

Cycling faster also means reducing the time between examinations of any 
particular page, so it increases the meaningfulness of the check on 
large systems (otherwise even rarely used pages will always show up as 
referenced).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-13 19:12                       ` Avi Kivity
@ 2009-08-13 21:16                         ` Johannes Weiner
  -1 siblings, 0 replies; 243+ messages in thread
From: Johannes Weiner @ 2009-08-13 21:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Thu, Aug 13, 2009 at 10:12:01PM +0300, Avi Kivity wrote:
> On 08/13/2009 07:26 PM, Rik van Riel wrote:
> >>Why do we need to ignore the referenced bit in such cases?  To avoid 
> >>overscanning?
> >
> >
> >Because swapping out anonymous pages tends to be a relatively
> >rare operation, we'll have many gigabytes of anonymous pages
> >that all have the referenced bit set (because there was lots
> >of time between swapout bursts).
> >
> >Ignoring the referenced bit on active anon pages makes no
> >difference on these systems, because all active anon pages
> >have the referenced bit set, anyway.
> >
> >All we need to do is put the pages on the inactive list and
> >give them a chance to get referenced.
> >
> >However, on smaller systems (and cgroups!), the speed at
> >which we can do pageout IO is larger, compared to the amount
> >of memory.  This means we can cycle through the pages more
> >quickly and we may want to count references on the active
> >list, too.
> >
> >Yes, on smaller systems we'll also often end up with bursty
> >swapout loads and all pages referenced - but since we have
> >fewer pages to begin with, it won't hurt as much.
> >
> >I suspect that an inactive_ratio of 3 or 4 might make a
> >good cutoff value.
> >
> 
> Thanks for the explanation.  I think my earlier idea of
> 
> - do not ignore the referenced bit
> - if you see a run of N pages which all have the referenced bit set, do 
> swap one
> 
> has merit.  It means we cycle more quickly (by a factor of N) through 
> the list, looking for unreferenced pages.  If we don't find any we've 
> spent a some more cpu, but if we do find an unreferenced page, we win by 
> swapping a truly unneeded page.

But it also means destroying the LRU order.

Okay, we ignore the referenced bit, but we keep LRU buddies together
which then get reactivated together as well, if they are indeed in
active use.

I could imagine the VM going nuts when you separate them by a
predicate that is rather unrelated to the pages's actual usage
patterns.

After all, the list order is the primary input to selecting pages for
eviction.

It would need actual testing, of course, but I bet Rik's idea of using
the referenced bit always or never is going to show better results.

	Hannes

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 21:16                         ` Johannes Weiner
  0 siblings, 0 replies; 243+ messages in thread
From: Johannes Weiner @ 2009-08-13 21:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Thu, Aug 13, 2009 at 10:12:01PM +0300, Avi Kivity wrote:
> On 08/13/2009 07:26 PM, Rik van Riel wrote:
> >>Why do we need to ignore the referenced bit in such cases?  To avoid 
> >>overscanning?
> >
> >
> >Because swapping out anonymous pages tends to be a relatively
> >rare operation, we'll have many gigabytes of anonymous pages
> >that all have the referenced bit set (because there was lots
> >of time between swapout bursts).
> >
> >Ignoring the referenced bit on active anon pages makes no
> >difference on these systems, because all active anon pages
> >have the referenced bit set, anyway.
> >
> >All we need to do is put the pages on the inactive list and
> >give them a chance to get referenced.
> >
> >However, on smaller systems (and cgroups!), the speed at
> >which we can do pageout IO is larger, compared to the amount
> >of memory.  This means we can cycle through the pages more
> >quickly and we may want to count references on the active
> >list, too.
> >
> >Yes, on smaller systems we'll also often end up with bursty
> >swapout loads and all pages referenced - but since we have
> >fewer pages to begin with, it won't hurt as much.
> >
> >I suspect that an inactive_ratio of 3 or 4 might make a
> >good cutoff value.
> >
> 
> Thanks for the explanation.  I think my earlier idea of
> 
> - do not ignore the referenced bit
> - if you see a run of N pages which all have the referenced bit set, do 
> swap one
> 
> has merit.  It means we cycle more quickly (by a factor of N) through 
> the list, looking for unreferenced pages.  If we don't find any we've 
> spent a some more cpu, but if we do find an unreferenced page, we win by 
> swapping a truly unneeded page.

But it also means destroying the LRU order.

Okay, we ignore the referenced bit, but we keep LRU buddies together
which then get reactivated together as well, if they are indeed in
active use.

I could imagine the VM going nuts when you separate them by a
predicate that is rather unrelated to the pages's actual usage
patterns.

After all, the list order is the primary input to selecting pages for
eviction.

It would need actual testing, of course, but I bet Rik's idea of using
the referenced bit always or never is going to show better results.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-13 21:16                         ` Johannes Weiner
@ 2009-08-14  7:16                           ` Avi Kivity
  -1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-14  7:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On 08/14/2009 12:16 AM, Johannes Weiner wrote:
>
>> - do not ignore the referenced bit
>> - if you see a run of N pages which all have the referenced bit set, do
>> swap one
>>
>>      
>
> But it also means destroying the LRU order.
>
>    

True, it would, but if we ignore the referenced bit, LRU order is really 
FIFO.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14  7:16                           ` Avi Kivity
  0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-14  7:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On 08/14/2009 12:16 AM, Johannes Weiner wrote:
>
>> - do not ignore the referenced bit
>> - if you see a run of N pages which all have the referenced bit set, do
>> swap one
>>
>>      
>
> But it also means destroying the LRU order.
>
>    

True, it would, but if we ignore the referenced bit, LRU order is really 
FIFO.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-14  7:16                           ` Avi Kivity
@ 2009-08-14  9:10                             ` Johannes Weiner
  -1 siblings, 0 replies; 243+ messages in thread
From: Johannes Weiner @ 2009-08-14  9:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote:
> On 08/14/2009 12:16 AM, Johannes Weiner wrote:
> >
> >>- do not ignore the referenced bit
> >>- if you see a run of N pages which all have the referenced bit set, do
> >>swap one
> >>
> >>     
> >
> >But it also means destroying the LRU order.
> >
> >   
> 
> True, it would, but if we ignore the referenced bit, LRU order is really 
> FIFO.

For the active list, yes.  But it's not that we degrade to First Fault
First Out in a global scope, we still update the order from
mark_page_accessed() and by activating referenced pages in
shrink_page_list() etc.

So even with the active list being a FIFO, we keep usage information
gathered from the inactive list.  If we deactivate pages in arbitrary
list intervals, we throw this away.

And while global FIFO-based reclaim does not work too well, initial
fault order is a valuable hint in the aspect of referential locality
as the pages get used in groups and thus move around the lists in
groups.

Our granularity for regrouping decisions is pretty coarse, for
non-filecache pages it's basically 'referenced or not refrenced in the
last list round-trip', so it will take quite some time to regroup
pages that are used in truly similar intervals.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14  9:10                             ` Johannes Weiner
  0 siblings, 0 replies; 243+ messages in thread
From: Johannes Weiner @ 2009-08-14  9:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote:
> On 08/14/2009 12:16 AM, Johannes Weiner wrote:
> >
> >>- do not ignore the referenced bit
> >>- if you see a run of N pages which all have the referenced bit set, do
> >>swap one
> >>
> >>     
> >
> >But it also means destroying the LRU order.
> >
> >   
> 
> True, it would, but if we ignore the referenced bit, LRU order is really 
> FIFO.

For the active list, yes.  But it's not that we degrade to First Fault
First Out in a global scope, we still update the order from
mark_page_accessed() and by activating referenced pages in
shrink_page_list() etc.

So even with the active list being a FIFO, we keep usage information
gathered from the inactive list.  If we deactivate pages in arbitrary
list intervals, we throw this away.

And while global FIFO-based reclaim does not work too well, initial
fault order is a valuable hint in the aspect of referential locality
as the pages get used in groups and thus move around the lists in
groups.

Our granularity for regrouping decisions is pretty coarse, for
non-filecache pages it's basically 'referenced or not refrenced in the
last list round-trip', so it will take quite some time to regroup
pages that are used in truly similar intervals.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-14  9:10                             ` Johannes Weiner
@ 2009-08-14  9:51                               ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-14  9:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote:
> > On 08/14/2009 12:16 AM, Johannes Weiner wrote:
> > >
> > >>- do not ignore the referenced bit
> > >>- if you see a run of N pages which all have the referenced bit set, do
> > >>swap one
> > >>
> > >>     
> > >
> > >But it also means destroying the LRU order.
> > >
> > >   
> > 
> > True, it would, but if we ignore the referenced bit, LRU order is really 
> > FIFO.
> 
> For the active list, yes.  But it's not that we degrade to First Fault
> First Out in a global scope, we still update the order from
> mark_page_accessed() and by activating referenced pages in
> shrink_page_list() etc.
> 
> So even with the active list being a FIFO, we keep usage information
> gathered from the inactive list.  If we deactivate pages in arbitrary
> list intervals, we throw this away.

We do have the danger of FIFO, if inactive list is small enough, so
that (unconditionally) deactivated pages quickly get reclaimed and
their life window in inactive list is too small to be useful.

> And while global FIFO-based reclaim does not work too well, initial
> fault order is a valuable hint in the aspect of referential locality
> as the pages get used in groups and thus move around the lists in
> groups.
> 
> Our granularity for regrouping decisions is pretty coarse, for
> non-filecache pages it's basically 'referenced or not refrenced in the
> last list round-trip', so it will take quite some time to regroup
> pages that are used in truly similar intervals.

Agreed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14  9:51                               ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-14  9:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote:
> > On 08/14/2009 12:16 AM, Johannes Weiner wrote:
> > >
> > >>- do not ignore the referenced bit
> > >>- if you see a run of N pages which all have the referenced bit set, do
> > >>swap one
> > >>
> > >>     
> > >
> > >But it also means destroying the LRU order.
> > >
> > >   
> > 
> > True, it would, but if we ignore the referenced bit, LRU order is really 
> > FIFO.
> 
> For the active list, yes.  But it's not that we degrade to First Fault
> First Out in a global scope, we still update the order from
> mark_page_accessed() and by activating referenced pages in
> shrink_page_list() etc.
> 
> So even with the active list being a FIFO, we keep usage information
> gathered from the inactive list.  If we deactivate pages in arbitrary
> list intervals, we throw this away.

We do have the danger of FIFO, if inactive list is small enough, so
that (unconditionally) deactivated pages quickly get reclaimed and
their life window in inactive list is too small to be useful.

> And while global FIFO-based reclaim does not work too well, initial
> fault order is a valuable hint in the aspect of referential locality
> as the pages get used in groups and thus move around the lists in
> groups.
> 
> Our granularity for regrouping decisions is pretty coarse, for
> non-filecache pages it's basically 'referenced or not refrenced in the
> last list round-trip', so it will take quite some time to regroup
> pages that are used in truly similar intervals.

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-14  9:51                               ` Wu Fengguang
@ 2009-08-14 13:19                                 ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-14 13:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:
> On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:

>> So even with the active list being a FIFO, we keep usage information
>> gathered from the inactive list.  If we deactivate pages in arbitrary
>> list intervals, we throw this away.
> 
> We do have the danger of FIFO, if inactive list is small enough, so
> that (unconditionally) deactivated pages quickly get reclaimed and
> their life window in inactive list is too small to be useful.

This one of the reasons why we unconditionally deactivate
the active anon pages, and do background scanning of the
active anon list when reclaiming page cache pages.

We want to always move some pages to the inactive anon
list, so it does not get too small.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14 13:19                                 ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-14 13:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:
> On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:

>> So even with the active list being a FIFO, we keep usage information
>> gathered from the inactive list.  If we deactivate pages in arbitrary
>> list intervals, we throw this away.
> 
> We do have the danger of FIFO, if inactive list is small enough, so
> that (unconditionally) deactivated pages quickly get reclaimed and
> their life window in inactive list is too small to be useful.

This one of the reasons why we unconditionally deactivate
the active anon pages, and do background scanning of the
active anon list when reclaiming page cache pages.

We want to always move some pages to the inactive anon
list, so it does not get too small.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-14  9:51                               ` Wu Fengguang
@ 2009-08-14 21:42                                 ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-14 21:42 UTC (permalink / raw)
  To: Wu, Fengguang, Johannes Weiner
  Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time.  Your test is rescuing essentially all candidate pages from the inactive list.  Right now, I have the VM_EXEC || PageAnon version of your test.

						Jeff

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14 21:42                                 ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-14 21:42 UTC (permalink / raw)
  To: Wu, Fengguang, Johannes Weiner
  Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time.  Your test is rescuing essentially all candidate pages from the inactive list.  Right now, I have the VM_EXEC || PageAnon version of your test.

						Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-14 21:42                                 ` Dike, Jeffrey G
@ 2009-08-14 22:37                                   ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-14 22:37 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
> A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time.  Your test is rescuing essentially all candidate pages from the inactive list.  Right now, I have the VM_EXEC || PageAnon version of your test.

That is exactly why the the split LRU VM does an unconditional
deactivation of active anon pages :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14 22:37                                   ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-14 22:37 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
> A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time.  Your test is rescuing essentially all candidate pages from the inactive list.  Right now, I have the VM_EXEC || PageAnon version of your test.

That is exactly why the the split LRU VM does an unconditional
deactivation of active anon pages :)

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-14 22:37                                   ` Rik van Riel
@ 2009-08-15  5:32                                     ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-15  5:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dike, Jeffrey G, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Sat, Aug 15, 2009 at 06:37:22AM +0800, Rik van Riel wrote:
> Dike, Jeffrey G wrote:
> > A side note - I've been doing some tracing and shrink_active_list
> > is called a humongous number of times (25000-ish during a ~90 kvm
> > run), with a net result of zero pages moved nearly all the time.

Your mean "no pages get deactivated at all in most invocations"?
This is possible in the steady (thrashing) state of a memory tight
system(the working set is bigger than memory size). 

> > Your test is rescuing essentially all candidate pages from the
> > inactive list.  Right now, I have the VM_EXEC || PageAnon version
> > of your test.
> 
> That is exactly why the the split LRU VM does an unconditional
> deactivation of active anon pages :)

In general it is :)  However in Jeff's small memory case, there
will be many refaults without the "PageAnon" protection. But the
patch does not imply that I'm happy with the "PageAnon" test ;)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-15  5:32                                     ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-15  5:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dike, Jeffrey G, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Sat, Aug 15, 2009 at 06:37:22AM +0800, Rik van Riel wrote:
> Dike, Jeffrey G wrote:
> > A side note - I've been doing some tracing and shrink_active_list
> > is called a humongous number of times (25000-ish during a ~90 kvm
> > run), with a net result of zero pages moved nearly all the time.

Your mean "no pages get deactivated at all in most invocations"?
This is possible in the steady (thrashing) state of a memory tight
system(the working set is bigger than memory size). 

> > Your test is rescuing essentially all candidate pages from the
> > inactive list.  Right now, I have the VM_EXEC || PageAnon version
> > of your test.
> 
> That is exactly why the the split LRU VM does an unconditional
> deactivation of active anon pages :)

In general it is :)  However in Jeff's small memory case, there
will be many refaults without the "PageAnon" protection. But the
patch does not imply that I'm happy with the "PageAnon" test ;)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-14 13:19                                 ` Rik van Riel
@ 2009-08-15  5:45                                   ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-15  5:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> 
> >> So even with the active list being a FIFO, we keep usage information
> >> gathered from the inactive list.  If we deactivate pages in arbitrary
> >> list intervals, we throw this away.
> > 
> > We do have the danger of FIFO, if inactive list is small enough, so
> > that (unconditionally) deactivated pages quickly get reclaimed and
> > their life window in inactive list is too small to be useful.
> 
> This one of the reasons why we unconditionally deactivate
> the active anon pages, and do background scanning of the
> active anon list when reclaiming page cache pages.
> 
> We want to always move some pages to the inactive anon
> list, so it does not get too small.

Right, the current code tries to pull inactive list out of
smallish-size state as long as there are vmscan activities.

However there is a possible (and tricky) hole: mem cgroups
don't do batched vmscan. shrink_zone() may call shrink_list()
with nr_to_scan=1, in which case shrink_list() _still_ calls
isolate_pages() with the much larger SWAP_CLUSTER_MAX.

It effectively scales up the inactive list scan rate by 10 times when
it is still small, and may thus prevent it from growing up for ever.

In that case, LRU becomes FIFO.

Jeff, can you confirm if the mem cgroup's inactive list is small?
If so, this patch should help.

Thanks,
Fengguang
---

mm: do batched scans for mem_cgroup

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   12 ++++++++++++
 mm/vmscan.c                |    9 +++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2009-08-15 13:12:49.000000000 +0800
+++ linux/include/linux/memcontrol.h	2009-08-15 13:18:13.000000000 +0800
@@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru);
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+					 struct zone *zone,
+					 enum lru_list lru);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone);
 struct zone_reclaim_stat*
--- linux.orig/mm/memcontrol.c	2009-08-15 13:07:34.000000000 +0800
+++ linux/mm/memcontrol.c	2009-08-15 13:17:56.000000000 +0800
@@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
 	 */
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
+	unsigned long		nr_saved_scan[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
 };
@@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
 	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+					 struct zone *zone,
+					 enum lru_list lru)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	return &mz->nr_saved_scan[lru];
+}
+
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone)
 {
--- linux.orig/mm/vmscan.c	2009-08-15 13:04:54.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-15 13:19:03.000000000 +0800
@@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
 	for_each_evictable_lru(l) {
 		int file = is_file_lru(l);
 		unsigned long scan;
+		unsigned long *saved_scan;
 
 		scan = zone_nr_pages(zone, sc, l);
 		if (priority || noswap) {
@@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
 			scan = (scan * percent[file]) / 100;
 		}
 		if (scanning_global_lru(sc))
-			nr[l] = nr_scan_try_batch(scan,
-						  &zone->lru[l].nr_saved_scan,
-						  swap_cluster_max);
+			saved_scan = &zone->lru[l].nr_saved_scan;
 		else
-			nr[l] = scan;
+			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
+							       zone, l);
+		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-15  5:45                                   ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-15  5:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> 
> >> So even with the active list being a FIFO, we keep usage information
> >> gathered from the inactive list.  If we deactivate pages in arbitrary
> >> list intervals, we throw this away.
> > 
> > We do have the danger of FIFO, if inactive list is small enough, so
> > that (unconditionally) deactivated pages quickly get reclaimed and
> > their life window in inactive list is too small to be useful.
> 
> This one of the reasons why we unconditionally deactivate
> the active anon pages, and do background scanning of the
> active anon list when reclaiming page cache pages.
> 
> We want to always move some pages to the inactive anon
> list, so it does not get too small.

Right, the current code tries to pull inactive list out of
smallish-size state as long as there are vmscan activities.

However there is a possible (and tricky) hole: mem cgroups
don't do batched vmscan. shrink_zone() may call shrink_list()
with nr_to_scan=1, in which case shrink_list() _still_ calls
isolate_pages() with the much larger SWAP_CLUSTER_MAX.

It effectively scales up the inactive list scan rate by 10 times when
it is still small, and may thus prevent it from growing up for ever.

In that case, LRU becomes FIFO.

Jeff, can you confirm if the mem cgroup's inactive list is small?
If so, this patch should help.

Thanks,
Fengguang
---

mm: do batched scans for mem_cgroup

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   12 ++++++++++++
 mm/vmscan.c                |    9 +++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2009-08-15 13:12:49.000000000 +0800
+++ linux/include/linux/memcontrol.h	2009-08-15 13:18:13.000000000 +0800
@@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru);
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+					 struct zone *zone,
+					 enum lru_list lru);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone);
 struct zone_reclaim_stat*
--- linux.orig/mm/memcontrol.c	2009-08-15 13:07:34.000000000 +0800
+++ linux/mm/memcontrol.c	2009-08-15 13:17:56.000000000 +0800
@@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
 	 */
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
+	unsigned long		nr_saved_scan[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
 };
@@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
 	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+					 struct zone *zone,
+					 enum lru_list lru)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	return &mz->nr_saved_scan[lru];
+}
+
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone)
 {
--- linux.orig/mm/vmscan.c	2009-08-15 13:04:54.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-15 13:19:03.000000000 +0800
@@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
 	for_each_evictable_lru(l) {
 		int file = is_file_lru(l);
 		unsigned long scan;
+		unsigned long *saved_scan;
 
 		scan = zone_nr_pages(zone, sc, l);
 		if (priority || noswap) {
@@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
 			scan = (scan * percent[file]) / 100;
 		}
 		if (scanning_global_lru(sc))
-			nr[l] = nr_scan_try_batch(scan,
-						  &zone->lru[l].nr_saved_scan,
-						  swap_cluster_max);
+			saved_scan = &zone->lru[l].nr_saved_scan;
 		else
-			nr[l] = scan;
+			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
+							       zone, l);
+		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 21:09                 ` Jeff Dike
@ 2009-08-16  3:18                   ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  3:18 UTC (permalink / raw)
  To: Jeff Dike
  Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> Side question -
> 	Is there a good reason for this to be in shrink_active_list()
> as opposed to __isolate_lru_page?
> 
> 		if (unlikely(!page_evictable(page, NULL))) {
> 			putback_lru_page(page);
> 			continue;
> 		}
> 
> Maybe we want to minimize the amount of code under the lru lock or
> avoid duplicate logic in the isolate_page functions.

I guess the quick test means to avoid the expensive page_referenced()
call that follows it. But that should be mostly one shot cost - the
unevictable pages are unlikely to cycle in active/inactive list again
and again.

> But if there are important mlock-heavy workloads, this could make the
> scan come up empty, or at least emptier than we might like.

Yes, if the above 'if' block is removed, the inactive lists might get
more expensive to reclaim.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  3:18                   ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  3:18 UTC (permalink / raw)
  To: Jeff Dike
  Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> Side question -
> 	Is there a good reason for this to be in shrink_active_list()
> as opposed to __isolate_lru_page?
> 
> 		if (unlikely(!page_evictable(page, NULL))) {
> 			putback_lru_page(page);
> 			continue;
> 		}
> 
> Maybe we want to minimize the amount of code under the lru lock or
> avoid duplicate logic in the isolate_page functions.

I guess the quick test means to avoid the expensive page_referenced()
call that follows it. But that should be mostly one shot cost - the
unevictable pages are unlikely to cycle in active/inactive list again
and again.

> But if there are important mlock-heavy workloads, this could make the
> scan come up empty, or at least emptier than we might like.

Yes, if the above 'if' block is removed, the inactive lists might get
more expensive to reclaim.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-06 13:16                 ` Rik van Riel
@ 2009-08-16  3:28                   ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  3:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 09:16:14PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> 
> > I guess both schemes have unacceptable flaws.
> > 
> > For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> > So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> > while the DROP_CONTINUOUS scheme is effectively zero cost.
> 
> The higher overhead may not be an issue on smaller systems,
> or inside smaller cgroups inside large systems, when doing
> cgroup reclaim.

Right.

> > However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
> > It can behave vastly different on single active task and multi ones.
> > It is short sighted and can be cheated by bursty activities.
> 
> The split LRU VM tries to avoid the bursty page aging as
> much as possible, by doing background deactivating of
> anonymous pages whenever we reclaim page cache pages and
> the number of anonymous pages in the zone (or cgroup) is
> low.

Right, but I meant busty page allocations and accesses on them, which
can make a large continuous segment of referenced pages in LRU list,
say 50MB.  They may or may not be valuable as a whole, however a local
algorithm may keep the first 4MB and drop the remaining 46MB.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  3:28                   ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  3:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Thu, Aug 06, 2009 at 09:16:14PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> 
> > I guess both schemes have unacceptable flaws.
> > 
> > For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> > So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> > while the DROP_CONTINUOUS scheme is effectively zero cost.
> 
> The higher overhead may not be an issue on smaller systems,
> or inside smaller cgroups inside large systems, when doing
> cgroup reclaim.

Right.

> > However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
> > It can behave vastly different on single active task and multi ones.
> > It is short sighted and can be cheated by bursty activities.
> 
> The split LRU VM tries to avoid the bursty page aging as
> much as possible, by doing background deactivating of
> anonymous pages whenever we reclaim page cache pages and
> the number of anonymous pages in the zone (or cgroup) is
> low.

Right, but I meant busty page allocations and accesses on them, which
can make a large continuous segment of referenced pages in LRU list,
say 50MB.  They may or may not be valuable as a whole, however a local
algorithm may keep the first 4MB and drop the remaining 46MB.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  3:18                   ` Wu Fengguang
@ 2009-08-16  3:53                     ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-16  3:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:
> On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> Side question -
>> 	Is there a good reason for this to be in shrink_active_list()
>> as opposed to __isolate_lru_page?
>>
>> 		if (unlikely(!page_evictable(page, NULL))) {
>> 			putback_lru_page(page);
>> 			continue;
>> 		}
>>
>> Maybe we want to minimize the amount of code under the lru lock or
>> avoid duplicate logic in the isolate_page functions.
> 
> I guess the quick test means to avoid the expensive page_referenced()
> call that follows it. But that should be mostly one shot cost - the
> unevictable pages are unlikely to cycle in active/inactive list again
> and again.

Please read what putback_lru_page does.

It moves the page onto the unevictable list, so that
it will not end up in this scan again.

>> But if there are important mlock-heavy workloads, this could make the
>> scan come up empty, or at least emptier than we might like.
> 
> Yes, if the above 'if' block is removed, the inactive lists might get
> more expensive to reclaim.

Why?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  3:53                     ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-16  3:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:
> On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> Side question -
>> 	Is there a good reason for this to be in shrink_active_list()
>> as opposed to __isolate_lru_page?
>>
>> 		if (unlikely(!page_evictable(page, NULL))) {
>> 			putback_lru_page(page);
>> 			continue;
>> 		}
>>
>> Maybe we want to minimize the amount of code under the lru lock or
>> avoid duplicate logic in the isolate_page functions.
> 
> I guess the quick test means to avoid the expensive page_referenced()
> call that follows it. But that should be mostly one shot cost - the
> unevictable pages are unlikely to cycle in active/inactive list again
> and again.

Please read what putback_lru_page does.

It moves the page onto the unevictable list, so that
it will not end up in this scan again.

>> But if there are important mlock-heavy workloads, this could make the
>> scan come up empty, or at least emptier than we might like.
> 
> Yes, if the above 'if' block is removed, the inactive lists might get
> more expensive to reclaim.

Why?

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  3:28                   ` Wu Fengguang
@ 2009-08-16  3:56                     ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-16  3:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:

> Right, but I meant busty page allocations and accesses on them, which
> can make a large continuous segment of referenced pages in LRU list,
> say 50MB.  They may or may not be valuable as a whole, however a local
> algorithm may keep the first 4MB and drop the remaining 46MB.

I wonder if the problem is that we simply do not keep a large
enough inactive list in Jeff's test.  If we do not, pages do
not have a chance to be referenced again before the reclaim
code comes in.

The cgroup stats should show how many active anon and inactive
anon pages there are in the cgroup.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  3:56                     ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-16  3:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Wu Fengguang wrote:

> Right, but I meant busty page allocations and accesses on them, which
> can make a large continuous segment of referenced pages in LRU list,
> say 50MB.  They may or may not be valuable as a whole, however a local
> algorithm may keep the first 4MB and drop the remaining 46MB.

I wonder if the problem is that we simply do not keep a large
enough inactive list in Jeff's test.  If we do not, pages do
not have a chance to be referenced again before the reclaim
code comes in.

The cgroup stats should show how many active anon and inactive
anon pages there are in the cgroup.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  3:56                     ` Rik van Riel
@ 2009-08-16  4:43                       ` Balbir Singh
  -1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16  4:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

* Rik van Riel <riel@redhat.com> [2009-08-15 23:56:39]:

> Wu Fengguang wrote:
>
>> Right, but I meant busty page allocations and accesses on them, which
>> can make a large continuous segment of referenced pages in LRU list,
>> say 50MB.  They may or may not be valuable as a whole, however a local
>> algorithm may keep the first 4MB and drop the remaining 46MB.
>
> I wonder if the problem is that we simply do not keep a large
> enough inactive list in Jeff's test.  If we do not, pages do
> not have a chance to be referenced again before the reclaim
> code comes in.
>
> The cgroup stats should show how many active anon and inactive
> anon pages there are in the cgroup.
>

Yes, we do show active and inactive anon pages in the mem cgroup
controller in the memory.stat file.


-- 
	Balbir

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  4:43                       ` Balbir Singh
  0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16  4:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wu Fengguang, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

* Rik van Riel <riel@redhat.com> [2009-08-15 23:56:39]:

> Wu Fengguang wrote:
>
>> Right, but I meant busty page allocations and accesses on them, which
>> can make a large continuous segment of referenced pages in LRU list,
>> say 50MB.  They may or may not be valuable as a whole, however a local
>> algorithm may keep the first 4MB and drop the remaining 46MB.
>
> I wonder if the problem is that we simply do not keep a large
> enough inactive list in Jeff's test.  If we do not, pages do
> not have a chance to be referenced again before the reclaim
> code comes in.
>
> The cgroup stats should show how many active anon and inactive
> anon pages there are in the cgroup.
>

Yes, we do show active and inactive anon pages in the mem cgroup
controller in the memory.stat file.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  3:56                     ` Rik van Riel
@ 2009-08-16  4:55                       ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  4:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> 
> > Right, but I meant busty page allocations and accesses on them, which
> > can make a large continuous segment of referenced pages in LRU list,
> > say 50MB.  They may or may not be valuable as a whole, however a local
> > algorithm may keep the first 4MB and drop the remaining 46MB.
> 
> I wonder if the problem is that we simply do not keep a large
> enough inactive list in Jeff's test.  If we do not, pages do
> not have a chance to be referenced again before the reclaim
> code comes in.

Exactly, that's the case I call the list FIFO.

> The cgroup stats should show how many active anon and inactive
> anon pages there are in the cgroup.

Jeff, can you have a look at these stats? Thanks!

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  4:55                       ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  4:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> 
> > Right, but I meant busty page allocations and accesses on them, which
> > can make a large continuous segment of referenced pages in LRU list,
> > say 50MB.  They may or may not be valuable as a whole, however a local
> > algorithm may keep the first 4MB and drop the remaining 46MB.
> 
> I wonder if the problem is that we simply do not keep a large
> enough inactive list in Jeff's test.  If we do not, pages do
> not have a chance to be referenced again before the reclaim
> code comes in.

Exactly, that's the case I call the list FIFO.

> The cgroup stats should show how many active anon and inactive
> anon pages there are in the cgroup.

Jeff, can you have a look at these stats? Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-15  5:45                                   ` Wu Fengguang
@ 2009-08-16  5:09                                     ` Balbir Singh
  -1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16  5:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm

* Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:

> On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > 
> > >> So even with the active list being a FIFO, we keep usage information
> > >> gathered from the inactive list.  If we deactivate pages in arbitrary
> > >> list intervals, we throw this away.
> > > 
> > > We do have the danger of FIFO, if inactive list is small enough, so
> > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > their life window in inactive list is too small to be useful.
> > 
> > This one of the reasons why we unconditionally deactivate
> > the active anon pages, and do background scanning of the
> > active anon list when reclaiming page cache pages.
> > 
> > We want to always move some pages to the inactive anon
> > list, so it does not get too small.
> 
> Right, the current code tries to pull inactive list out of
> smallish-size state as long as there are vmscan activities.
> 
> However there is a possible (and tricky) hole: mem cgroups
> don't do batched vmscan. shrink_zone() may call shrink_list()
> with nr_to_scan=1, in which case shrink_list() _still_ calls
> isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> 
> It effectively scales up the inactive list scan rate by 10 times when
> it is still small, and may thus prevent it from growing up for ever.
> 

I think we need to possibly export some scanning data under DEBUG_VM
to cross verify.

> In that case, LRU becomes FIFO.
> 
> Jeff, can you confirm if the mem cgroup's inactive list is small?
> If so, this patch should help.
> 

> Thanks,
> Fengguang
> ---
> 
> mm: do batched scans for mem_cgroup
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/memcontrol.h |    3 +++
>  mm/memcontrol.c            |   12 ++++++++++++
>  mm/vmscan.c                |    9 +++++----
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> --- linux.orig/include/linux/memcontrol.h	2009-08-15 13:12:49.000000000 +0800
> +++ linux/include/linux/memcontrol.h	2009-08-15 13:18:13.000000000 +0800
> @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>  				       struct zone *zone,
>  				       enum lru_list lru);
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> +					 struct zone *zone,
> +					 enum lru_list lru);
>  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>  						      struct zone *zone);
>  struct zone_reclaim_stat*
> --- linux.orig/mm/memcontrol.c	2009-08-15 13:07:34.000000000 +0800
> +++ linux/mm/memcontrol.c	2009-08-15 13:17:56.000000000 +0800
> @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
>  	 */
>  	struct list_head	lists[NR_LRU_LISTS];
>  	unsigned long		count[NR_LRU_LISTS];
> +	unsigned long		nr_saved_scan[NR_LRU_LISTS];
> 
>  	struct zone_reclaim_stat reclaim_stat;
>  };
> @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
>  	return MEM_CGROUP_ZSTAT(mz, lru);
>  }
> 
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> +					 struct zone *zone,
> +					 enum lru_list lru)
> +{
> +	int nid = zone->zone_pgdat->node_id;
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	return &mz->nr_saved_scan[lru];
> +}
> +
>  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>  						      struct zone *zone)
>  {
> --- linux.orig/mm/vmscan.c	2009-08-15 13:04:54.000000000 +0800
> +++ linux/mm/vmscan.c	2009-08-15 13:19:03.000000000 +0800
> @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
>  	for_each_evictable_lru(l) {
>  		int file = is_file_lru(l);
>  		unsigned long scan;
> +		unsigned long *saved_scan;
> 
>  		scan = zone_nr_pages(zone, sc, l);
>  		if (priority || noswap) {
> @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
>  			scan = (scan * percent[file]) / 100;
>  		}
>  		if (scanning_global_lru(sc))
> -			nr[l] = nr_scan_try_batch(scan,
> -						  &zone->lru[l].nr_saved_scan,
> -						  swap_cluster_max);
> +			saved_scan = &zone->lru[l].nr_saved_scan;
>  		else
> -			nr[l] = scan;
> +			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> +							       zone, l);
> +		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
>  	}
>

This might be a concern (although not a big ATM), since we can't
afford to miss limits by much. If a cgroup is near its limit and we
drop scanning it. We'll have to work out what this means for the end
user. May be more fundamental look through is required at the priority
based logic of exposing how much to scan, I don't know.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  5:09                                     ` Balbir Singh
  0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16  5:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm

* Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:

> On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > 
> > >> So even with the active list being a FIFO, we keep usage information
> > >> gathered from the inactive list.  If we deactivate pages in arbitrary
> > >> list intervals, we throw this away.
> > > 
> > > We do have the danger of FIFO, if inactive list is small enough, so
> > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > their life window in inactive list is too small to be useful.
> > 
> > This one of the reasons why we unconditionally deactivate
> > the active anon pages, and do background scanning of the
> > active anon list when reclaiming page cache pages.
> > 
> > We want to always move some pages to the inactive anon
> > list, so it does not get too small.
> 
> Right, the current code tries to pull inactive list out of
> smallish-size state as long as there are vmscan activities.
> 
> However there is a possible (and tricky) hole: mem cgroups
> don't do batched vmscan. shrink_zone() may call shrink_list()
> with nr_to_scan=1, in which case shrink_list() _still_ calls
> isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> 
> It effectively scales up the inactive list scan rate by 10 times when
> it is still small, and may thus prevent it from growing up for ever.
> 

I think we need to possibly export some scanning data under DEBUG_VM
to cross verify.

> In that case, LRU becomes FIFO.
> 
> Jeff, can you confirm if the mem cgroup's inactive list is small?
> If so, this patch should help.
> 

> Thanks,
> Fengguang
> ---
> 
> mm: do batched scans for mem_cgroup
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/memcontrol.h |    3 +++
>  mm/memcontrol.c            |   12 ++++++++++++
>  mm/vmscan.c                |    9 +++++----
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> --- linux.orig/include/linux/memcontrol.h	2009-08-15 13:12:49.000000000 +0800
> +++ linux/include/linux/memcontrol.h	2009-08-15 13:18:13.000000000 +0800
> @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>  				       struct zone *zone,
>  				       enum lru_list lru);
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> +					 struct zone *zone,
> +					 enum lru_list lru);
>  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>  						      struct zone *zone);
>  struct zone_reclaim_stat*
> --- linux.orig/mm/memcontrol.c	2009-08-15 13:07:34.000000000 +0800
> +++ linux/mm/memcontrol.c	2009-08-15 13:17:56.000000000 +0800
> @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
>  	 */
>  	struct list_head	lists[NR_LRU_LISTS];
>  	unsigned long		count[NR_LRU_LISTS];
> +	unsigned long		nr_saved_scan[NR_LRU_LISTS];
> 
>  	struct zone_reclaim_stat reclaim_stat;
>  };
> @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
>  	return MEM_CGROUP_ZSTAT(mz, lru);
>  }
> 
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> +					 struct zone *zone,
> +					 enum lru_list lru)
> +{
> +	int nid = zone->zone_pgdat->node_id;
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	return &mz->nr_saved_scan[lru];
> +}
> +
>  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>  						      struct zone *zone)
>  {
> --- linux.orig/mm/vmscan.c	2009-08-15 13:04:54.000000000 +0800
> +++ linux/mm/vmscan.c	2009-08-15 13:19:03.000000000 +0800
> @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
>  	for_each_evictable_lru(l) {
>  		int file = is_file_lru(l);
>  		unsigned long scan;
> +		unsigned long *saved_scan;
> 
>  		scan = zone_nr_pages(zone, sc, l);
>  		if (priority || noswap) {
> @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
>  			scan = (scan * percent[file]) / 100;
>  		}
>  		if (scanning_global_lru(sc))
> -			nr[l] = nr_scan_try_batch(scan,
> -						  &zone->lru[l].nr_saved_scan,
> -						  swap_cluster_max);
> +			saved_scan = &zone->lru[l].nr_saved_scan;
>  		else
> -			nr[l] = scan;
> +			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> +							       zone, l);
> +		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
>  	}
>

This might be a concern (although not a big ATM), since we can't
afford to miss limits by much. If a cgroup is near its limit and we
drop scanning it. We'll have to work out what this means for the end
user. May be more fundamental look through is required at the priority
based logic of exposing how much to scan, I don't know.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  3:53                     ` Rik van Riel
@ 2009-08-16  5:15                       ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  5:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> Side question -
> >> 	Is there a good reason for this to be in shrink_active_list()
> >> as opposed to __isolate_lru_page?
> >>
> >> 		if (unlikely(!page_evictable(page, NULL))) {
> >> 			putback_lru_page(page);
> >> 			continue;
> >> 		}
> >>
> >> Maybe we want to minimize the amount of code under the lru lock or
> >> avoid duplicate logic in the isolate_page functions.
> > 
> > I guess the quick test means to avoid the expensive page_referenced()
> > call that follows it. But that should be mostly one shot cost - the
> > unevictable pages are unlikely to cycle in active/inactive list again
> > and again.
> 
> Please read what putback_lru_page does.
> 
> It moves the page onto the unevictable list, so that
> it will not end up in this scan again.

Yes it does. I said 'mostly' because there is a small hole that an
unevictable page may be scanned but still not moved to unevictable
list: when a page is mapped in two places, the first pte has the
referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
page_referenced() will return 1 and shrink_page_list() will move it
into active list instead of unevictable list. Shall we fix this rare
case?

> >> But if there are important mlock-heavy workloads, this could make the
> >> scan come up empty, or at least emptier than we might like.
> > 
> > Yes, if the above 'if' block is removed, the inactive lists might get
> > more expensive to reclaim.
> 
> Why?

Without the 'if' block, an unevictable page may well be deactivated into
inactive list (and some time later be moved to unevictable list
from there), increasing the inactive list's scanned:reclaimed ratio.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  5:15                       ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  5:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> Side question -
> >> 	Is there a good reason for this to be in shrink_active_list()
> >> as opposed to __isolate_lru_page?
> >>
> >> 		if (unlikely(!page_evictable(page, NULL))) {
> >> 			putback_lru_page(page);
> >> 			continue;
> >> 		}
> >>
> >> Maybe we want to minimize the amount of code under the lru lock or
> >> avoid duplicate logic in the isolate_page functions.
> > 
> > I guess the quick test means to avoid the expensive page_referenced()
> > call that follows it. But that should be mostly one shot cost - the
> > unevictable pages are unlikely to cycle in active/inactive list again
> > and again.
> 
> Please read what putback_lru_page does.
> 
> It moves the page onto the unevictable list, so that
> it will not end up in this scan again.

Yes it does. I said 'mostly' because there is a small hole that an
unevictable page may be scanned but still not moved to unevictable
list: when a page is mapped in two places, the first pte has the
referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
page_referenced() will return 1 and shrink_page_list() will move it
into active list instead of unevictable list. Shall we fix this rare
case?

> >> But if there are important mlock-heavy workloads, this could make the
> >> scan come up empty, or at least emptier than we might like.
> > 
> > Yes, if the above 'if' block is removed, the inactive lists might get
> > more expensive to reclaim.
> 
> Why?

Without the 'if' block, an unevictable page may well be deactivated into
inactive list (and some time later be moved to unevictable list
from there), increasing the inactive list's scanned:reclaimed ratio.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  5:09                                     ` Balbir Singh
@ 2009-08-16  5:41                                       ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  5:41 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm

On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
> 
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > > 
> > > >> So even with the active list being a FIFO, we keep usage information
> > > >> gathered from the inactive list.  If we deactivate pages in arbitrary
> > > >> list intervals, we throw this away.
> > > > 
> > > > We do have the danger of FIFO, if inactive list is small enough, so
> > > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > > their life window in inactive list is too small to be useful.
> > > 
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > > 
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> > 
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> > 
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > 
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> > 
> 
> I think we need to possibly export some scanning data under DEBUG_VM
> to cross verify.

Maybe we can do more general debugging code, but here is a quick patch
for examining the cgroup case. Note that even for the global zones,
max_scan may well not be the multiple of SWAP_CLUSTER_MAX, thus
shrink_inactive_list() will scan a little more in its last loop.

---
 mm/vmscan.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- linux.orig/mm/vmscan.c	2009-08-16 13:24:25.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-16 13:38:32.000000000 +0800
@@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int lumpy_reclaim = 0;
 
+	if (!scanning_global_lru(sc))
+		printk("shrink inactive %s count=%lu scan=%lu\n",
+		       file ? "file" : "anon",
+		       mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
+						LRU_INACTIVE_ANON + !!file),
+		       max_scan);
+
 	/*
 	 * If we need a large contiguous chunk of memory, or have
 	 * trouble getting a small set of contiguous pages, we

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  5:41                                       ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  5:41 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm

On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
> 
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > > 
> > > >> So even with the active list being a FIFO, we keep usage information
> > > >> gathered from the inactive list.  If we deactivate pages in arbitrary
> > > >> list intervals, we throw this away.
> > > > 
> > > > We do have the danger of FIFO, if inactive list is small enough, so
> > > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > > their life window in inactive list is too small to be useful.
> > > 
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > > 
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> > 
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> > 
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > 
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> > 
> 
> I think we need to possibly export some scanning data under DEBUG_VM
> to cross verify.

Maybe we can do more general debugging code, but here is a quick patch
for examining the cgroup case. Note that even for the global zones,
max_scan may well not be the multiple of SWAP_CLUSTER_MAX, thus
shrink_inactive_list() will scan a little more in its last loop.

---
 mm/vmscan.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- linux.orig/mm/vmscan.c	2009-08-16 13:24:25.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-16 13:38:32.000000000 +0800
@@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int lumpy_reclaim = 0;
 
+	if (!scanning_global_lru(sc))
+		printk("shrink inactive %s count=%lu scan=%lu\n",
+		       file ? "file" : "anon",
+		       mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
+						LRU_INACTIVE_ANON + !!file),
+		       max_scan);
+
 	/*
 	 * If we need a large contiguous chunk of memory, or have
 	 * trouble getting a small set of contiguous pages, we

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  5:09                                     ` Balbir Singh
@ 2009-08-16  5:50                                       ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  5:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm

On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
> 
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > > 
> > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> >  			scan = (scan * percent[file]) / 100;
> >  		}
> >  		if (scanning_global_lru(sc))
> > -			nr[l] = nr_scan_try_batch(scan,
> > -						  &zone->lru[l].nr_saved_scan,
> > -						  swap_cluster_max);
> > +			saved_scan = &zone->lru[l].nr_saved_scan;
> >  		else
> > -			nr[l] = scan;
> > +			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> > +							       zone, l);
> > +		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> >  	}
> >
> 
> This might be a concern (although not a big ATM), since we can't
> afford to miss limits by much. If a cgroup is near its limit and we
> drop scanning it. We'll have to work out what this means for the end
> user. May be more fundamental look through is required at the priority
> based logic of exposing how much to scan, I don't know.

I also had this worry at first. Then dismissed it because the page
reclaim should be driven by "pages reclaimed" rather than "pages
scanned". So when shrink_zone() decides to cancel one smallish scan,
it may well be called again and accumulate up nr_saved_scan.

So it should only be a problem for a very small mem_cgroup (which may
be _full_ scanned too much times in order to accumulate up nr_saved_scan).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  5:50                                       ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16  5:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm

On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
> 
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > > 
> > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> >  			scan = (scan * percent[file]) / 100;
> >  		}
> >  		if (scanning_global_lru(sc))
> > -			nr[l] = nr_scan_try_batch(scan,
> > -						  &zone->lru[l].nr_saved_scan,
> > -						  swap_cluster_max);
> > +			saved_scan = &zone->lru[l].nr_saved_scan;
> >  		else
> > -			nr[l] = scan;
> > +			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> > +							       zone, l);
> > +		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> >  	}
> >
> 
> This might be a concern (although not a big ATM), since we can't
> afford to miss limits by much. If a cgroup is near its limit and we
> drop scanning it. We'll have to work out what this means for the end
> user. May be more fundamental look through is required at the priority
> based logic of exposing how much to scan, I don't know.

I also had this worry at first. Then dismissed it because the page
reclaim should be driven by "pages reclaimed" rather than "pages
scanned". So when shrink_zone() decides to cancel one smallish scan,
it may well be called again and accumulate up nr_saved_scan.

So it should only be a problem for a very small mem_cgroup (which may
be _full_ scanned too much times in order to accumulate up nr_saved_scan).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  4:55                       ` Wu Fengguang
@ 2009-08-16  5:59                         ` Balbir Singh
  -1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16  5:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

* Wu Fengguang <fengguang.wu@intel.com> [2009-08-16 12:55:22]:

> On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > 
> > > Right, but I meant busty page allocations and accesses on them, which
> > > can make a large continuous segment of referenced pages in LRU list,
> > > say 50MB.  They may or may not be valuable as a whole, however a local
> > > algorithm may keep the first 4MB and drop the remaining 46MB.
> > 
> > I wonder if the problem is that we simply do not keep a large
> > enough inactive list in Jeff's test.  If we do not, pages do
> > not have a chance to be referenced again before the reclaim
> > code comes in.
> 
> Exactly, that's the case I call the list FIFO.
> 
> > The cgroup stats should show how many active anon and inactive
> > anon pages there are in the cgroup.
> 
> Jeff, can you have a look at these stats? Thanks!

Another experiment would be to toy with memory.swappiness (although
defaults should work well). Could you compare the in-guest values of
nr_*active* with the cgroup values as seen by the host?

-- 
	Balbir

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16  5:59                         ` Balbir Singh
  0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16  5:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

* Wu Fengguang <fengguang.wu@intel.com> [2009-08-16 12:55:22]:

> On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > 
> > > Right, but I meant busty page allocations and accesses on them, which
> > > can make a large continuous segment of referenced pages in LRU list,
> > > say 50MB.  They may or may not be valuable as a whole, however a local
> > > algorithm may keep the first 4MB and drop the remaining 46MB.
> > 
> > I wonder if the problem is that we simply do not keep a large
> > enough inactive list in Jeff's test.  If we do not, pages do
> > not have a chance to be referenced again before the reclaim
> > code comes in.
> 
> Exactly, that's the case I call the list FIFO.
> 
> > The cgroup stats should show how many active anon and inactive
> > anon pages there are in the cgroup.
> 
> Jeff, can you have a look at these stats? Thanks!

Another experiment would be to toy with memory.swappiness (although
defaults should work well). Could you compare the in-guest values of
nr_*active* with the cgroup values as seen by the host?

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  5:15                       ` Wu Fengguang
@ 2009-08-16 11:29                         ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 11:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> Side question -
> > >> 	Is there a good reason for this to be in shrink_active_list()
> > >> as opposed to __isolate_lru_page?
> > >>
> > >> 		if (unlikely(!page_evictable(page, NULL))) {
> > >> 			putback_lru_page(page);
> > >> 			continue;
> > >> 		}
> > >>
> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> avoid duplicate logic in the isolate_page functions.
> > > 
> > > I guess the quick test means to avoid the expensive page_referenced()
> > > call that follows it. But that should be mostly one shot cost - the
> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > and again.
> > 
> > Please read what putback_lru_page does.
> > 
> > It moves the page onto the unevictable list, so that
> > it will not end up in this scan again.
> 
> Yes it does. I said 'mostly' because there is a small hole that an
> unevictable page may be scanned but still not moved to unevictable
> list: when a page is mapped in two places, the first pte has the
> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> page_referenced() will return 1 and shrink_page_list() will move it
> into active list instead of unevictable list. Shall we fix this rare
> case?

How about this fix?

---
mm: stop circulating of referenced mlocked pages

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---

--- linux.orig/mm/rmap.c	2009-08-16 19:11:13.000000000 +0800
+++ linux/mm/rmap.c	2009-08-16 19:22:46.000000000 +0800
@@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
 		*mapcount = 1;	/* break early from loop */
+		*vm_flags |= VM_LOCKED;
 		goto out_unmap;
 	}
 
@@ -482,6 +483,8 @@ static int page_referenced_file(struct p
 	}
 
 	spin_unlock(&mapping->i_mmap_lock);
+	if (*vm_flags & VM_LOCKED)
+		referenced = 0;
 	return referenced;
 }
 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 11:29                         ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 11:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> Side question -
> > >> 	Is there a good reason for this to be in shrink_active_list()
> > >> as opposed to __isolate_lru_page?
> > >>
> > >> 		if (unlikely(!page_evictable(page, NULL))) {
> > >> 			putback_lru_page(page);
> > >> 			continue;
> > >> 		}
> > >>
> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> avoid duplicate logic in the isolate_page functions.
> > > 
> > > I guess the quick test means to avoid the expensive page_referenced()
> > > call that follows it. But that should be mostly one shot cost - the
> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > and again.
> > 
> > Please read what putback_lru_page does.
> > 
> > It moves the page onto the unevictable list, so that
> > it will not end up in this scan again.
> 
> Yes it does. I said 'mostly' because there is a small hole that an
> unevictable page may be scanned but still not moved to unevictable
> list: when a page is mapped in two places, the first pte has the
> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> page_referenced() will return 1 and shrink_page_list() will move it
> into active list instead of unevictable list. Shall we fix this rare
> case?

How about this fix?

---
mm: stop circulating of referenced mlocked pages

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---

--- linux.orig/mm/rmap.c	2009-08-16 19:11:13.000000000 +0800
+++ linux/mm/rmap.c	2009-08-16 19:22:46.000000000 +0800
@@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
 		*mapcount = 1;	/* break early from loop */
+		*vm_flags |= VM_LOCKED;
 		goto out_unmap;
 	}
 
@@ -482,6 +483,8 @@ static int page_referenced_file(struct p
 	}
 
 	spin_unlock(&mapping->i_mmap_lock);
+	if (*vm_flags & VM_LOCKED)
+		referenced = 0;
 	return referenced;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16 11:29                         ` Wu Fengguang
@ 2009-08-17 14:33                           ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-17 14:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Hi, Wu.

On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> > Wu Fengguang wrote:
>> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> > >> Side question -
>> > >>  Is there a good reason for this to be in shrink_active_list()
>> > >> as opposed to __isolate_lru_page?
>> > >>
>> > >>          if (unlikely(!page_evictable(page, NULL))) {
>> > >>                  putback_lru_page(page);
>> > >>                  continue;
>> > >>          }
>> > >>
>> > >> Maybe we want to minimize the amount of code under the lru lock or
>> > >> avoid duplicate logic in the isolate_page functions.
>> > >
>> > > I guess the quick test means to avoid the expensive page_referenced()
>> > > call that follows it. But that should be mostly one shot cost - the
>> > > unevictable pages are unlikely to cycle in active/inactive list again
>> > > and again.
>> >
>> > Please read what putback_lru_page does.
>> >
>> > It moves the page onto the unevictable list, so that
>> > it will not end up in this scan again.
>>
>> Yes it does. I said 'mostly' because there is a small hole that an
>> unevictable page may be scanned but still not moved to unevictable
>> list: when a page is mapped in two places, the first pte has the
>> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> page_referenced() will return 1 and shrink_page_list() will move it
>> into active list instead of unevictable list. Shall we fix this rare
>> case?

I think it's not a big deal.

As you mentioned, it's rare case so there would be few pages in active
list instead of unevictable list.
When next time to scan comes, we can try to move the pages into
unevictable list, again.

As I know about mlock pages, we already had some races condition.
They will be rescued like above.

>
> How about this fix?
>
> ---
> mm: stop circulating of referenced mlocked pages
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>
> --- linux.orig/mm/rmap.c        2009-08-16 19:11:13.000000000 +0800
> +++ linux/mm/rmap.c     2009-08-16 19:22:46.000000000 +0800
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
>         */
>        if (vma->vm_flags & VM_LOCKED) {
>                *mapcount = 1;  /* break early from loop */
> +               *vm_flags |= VM_LOCKED;
>                goto out_unmap;
>        }
>
> @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
>        }
>
>        spin_unlock(&mapping->i_mmap_lock);
> +       if (*vm_flags & VM_LOCKED)
> +               referenced = 0;
>        return referenced;
>  }
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-17 14:33                           ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-17 14:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Hi, Wu.

On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> > Wu Fengguang wrote:
>> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> > >> Side question -
>> > >>  Is there a good reason for this to be in shrink_active_list()
>> > >> as opposed to __isolate_lru_page?
>> > >>
>> > >>          if (unlikely(!page_evictable(page, NULL))) {
>> > >>                  putback_lru_page(page);
>> > >>                  continue;
>> > >>          }
>> > >>
>> > >> Maybe we want to minimize the amount of code under the lru lock or
>> > >> avoid duplicate logic in the isolate_page functions.
>> > >
>> > > I guess the quick test means to avoid the expensive page_referenced()
>> > > call that follows it. But that should be mostly one shot cost - the
>> > > unevictable pages are unlikely to cycle in active/inactive list again
>> > > and again.
>> >
>> > Please read what putback_lru_page does.
>> >
>> > It moves the page onto the unevictable list, so that
>> > it will not end up in this scan again.
>>
>> Yes it does. I said 'mostly' because there is a small hole that an
>> unevictable page may be scanned but still not moved to unevictable
>> list: when a page is mapped in two places, the first pte has the
>> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> page_referenced() will return 1 and shrink_page_list() will move it
>> into active list instead of unevictable list. Shall we fix this rare
>> case?

I think it's not a big deal.

As you mentioned, it's rare case so there would be few pages in active
list instead of unevictable list.
When next time to scan comes, we can try to move the pages into
unevictable list, again.

As I know about mlock pages, we already had some races condition.
They will be rescued like above.

>
> How about this fix?
>
> ---
> mm: stop circulating of referenced mlocked pages
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>
> --- linux.orig/mm/rmap.c        2009-08-16 19:11:13.000000000 +0800
> +++ linux/mm/rmap.c     2009-08-16 19:22:46.000000000 +0800
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
>         */
>        if (vma->vm_flags & VM_LOCKED) {
>                *mapcount = 1;  /* break early from loop */
> +               *vm_flags |= VM_LOCKED;
>                goto out_unmap;
>        }
>
> @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
>        }
>
>        spin_unlock(&mapping->i_mmap_lock);
> +       if (*vm_flags & VM_LOCKED)
> +               referenced = 0;
>        return referenced;
>  }
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-15  5:45                                   ` Wu Fengguang
@ 2009-08-17 18:04                                     ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-17 18:04 UTC (permalink / raw)
  To: Wu, Fengguang, Rik van Riel
  Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

> Jeff, can you confirm if the mem cgroup's inactive list is small?

Nope.  I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M).

The inactive mapped list is much smaller - 0 to ~700 pages.

The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped.

						Jeff


^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-17 18:04                                     ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-17 18:04 UTC (permalink / raw)
  To: Wu, Fengguang, Rik van Riel
  Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

> Jeff, can you confirm if the mem cgroup's inactive list is small?

Nope.  I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M).

The inactive mapped list is much smaller - 0 to ~700 pages.

The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped.

						Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  4:55                       ` Wu Fengguang
@ 2009-08-17 19:47                         ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-17 19:47 UTC (permalink / raw)
  To: Wu, Fengguang, Rik van Riel
  Cc: Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
	Mel Gorman, LKML, linux-mm

> Jeff, can you have a look at these stats? Thanks!

Yeah, I just did after adding some tracing which dumped out the same data.  It looks pretty much the same.  Inactive anon and active anon are pretty similar.  Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced.

Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening.

					Jeff




114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784

# Fire up instance

20480 4403200 2699264 647168
20480 4411392 2740224 647168
20480 4411392 2740224 647168
20480 11436032 3289088 651264
20480 11587584 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 25387008 4263936 872448
20480 25387008 4263936 872448
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25411584 4198400 937984
20480 25411584 4198400 937984
20480 40665088 7573504 946176
20480 43606016 7573504 946176
20480 45346816 7581696 946176
20480 45752320 7581696 946176
20480 46575616 7581696 946176
20480 46682112 7581696 946176
20480 48975872 9920512 1073152
20480 64536576 38457344 1826816

# Booted, X is starting
# Run a browser and editor, then shut them down and halt the instance

10964992 72454144 47714304 3067904
16797696 71151616 42893312 3244032
16797696 73035776 41037824 3272704
16797696 73547776 40525824 3272704
16797696 73609216 40402944 3276800
16797696 73719808 40337408 3289088
16797696 73920512 40079360 3289088
16797696 78016512 36036608 3297280
16797696 78016512 36036608 3297280
16797696 80203776 33755136 3387392
16797696 86904832 26972160 3526656
16797696 93523968 19927040 3837952
29011968 90546176 10276864 4308992
45670400 83685376 991232 3854336
66400256 66416640 368640 933888
66715648 66654208 376832 471040
64811008 64802816 3416064 1114112
65236992 65085440 2535424 1228800
65212416 65011712 2519040 1343488
64626688 64610304 3534848 1429504
63807488 63758336 4829184 1695744
63975424 63946752 4419584 1744896
63975424 63946752 4419584 1744896
63975424 63946752 4419584 1744896
63975424 63946752 4423680 1744896
63975424 63946752 4423680 1744896
63975424 63946752 4423680 1744896
64045056 63946752 4440064 1757184
64069632 63946752 4440064 1757184
64077824 63946752 4403200 1757184
64077824 63946752 4403200 1757184
64077824 63946752 4403200 1757184
64147456 64016384 4222976 1757184
64638976 64733184 2801664 1900544
65208320 65605632 1310720 1892352
64946176 65863680 1335296 1998848
62701568 68599808 843776 1945600
66068480 66023424 778240 978944
65568768 65511424 2093056 1044480
66183168 66056192 966656 1011712
66478080 66555904 241664 864256
66912256 66899968 135168 270336
66646016 66539520 577536 262144
66134016 66179072 1421312 319488
66125824 66183168 1273856 380928
66330624 66445312 933888 475136
65970176 65966080 1581056 548864
66158592 66158592 1175552 708608
65781760 66084864 1503232 806912
66084864 66048000 1118208 843776
66420736 66449408 376832 851968
66351104 66138112 757760 864256
66285568 66138112 921600 868352
65945600 65847296 1495040 888832
66002944 65839104 1363968 888832
66002944 65839104 1363968 888832
66039808 65839104 1363968 888832
66043904 65839104 1363968 888832
66043904 65839104 1372160 888832
65523712 65490944 2224128 929792
66031616 66297856 827392 946176
64913408 68141056 188416 933888
64770048 68325376 73728 917504
65216512 67932160 81920 909312
65470464 67678208 81920 909312
65036288 67973120 356352 790528
63492096 69877760 110592 647168
63111168 70508544 73728 413696
66895872 66883584 16384 348160
66650112 67203072 20480 344064
66830336 67002368 28672 335872
66785280 67002368 32768 331776
67084288 66736128 45056 331776
67104768 66736128 45056 331776
66916352 66801664 45056 331776
66883584 66863104 45056 331776
66883584 66863104 45056 331776
66891776 66863104 45056 331776
66899968 66863104 45056 331776
66904064 66863104 45056 331776
66904064 66863104 45056 331776
66904064 66863104 45056 331776
66715648 66641920 385024 339968
66617344 66629632 237568 364544
66633728 66629632 237568 364544
66641920 66629632 237568 364544
66641920 66629632 237568 364544
66662400 66461696 589824 389120
66588672 66527232 659456 389120
66252800 66252800 1105920 413696
66277376 66297856 983040 413696
66498560 66285568 884736 413696
66129920 66183168 1163264 442368
66138112 66183168 1163264 442368
66256896 66465792 921600 442368
66560000 66465792 606208 442368
66662400 66445312 589824 446464
66560000 66490368 708608 446464
66629632 66441216 577536 446464
66711552 66441216 520192 462848
66617344 66531328 577536 466944
66412544 66605056 606208 466944
66605056 66637824 475136 471040
66297856 67018752 344064 483328
65863680 67588096 212992 483328
66334720 67129344 159744 516096
65204224 68337664 159744 516096
61399040 72212480 98304 499712
62427136 71114752 98304 499712
62775296 70680576 155648 503808
62595072 70209536 823296 565248
66490368 66576384 458752 577536
66818048 66625536 196608 577536
66699264 66793472 135168 507904
66707456 66736128 151552 512000
66314240 67293184 53248 483328
56832000 76705792 180224 495616
59273216 73998336 368640 499712
59351040 73928704 368640 499712
59400192 73928704 368640 499712
59678720 73478144 241664 495616
59875328 73531392 241664 495616
60690432 72830976 110592 503808
61919232 71634944 110592 503808
65523712 68096000 49152 483328
66748416 66801664 102400 487424
65994752 67555328 114688 487424
65994752 67555328 114688 487424
65994752 67555328 114688 487424
66285568 66195456 1093632 520192
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66060288 65994752 1548288 557056

# Instance halted

118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-17 19:47                         ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-17 19:47 UTC (permalink / raw)
  To: Wu, Fengguang, Rik van Riel
  Cc: Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
	Mel Gorman, LKML, linux-mm

> Jeff, can you have a look at these stats? Thanks!

Yeah, I just did after adding some tracing which dumped out the same data.  It looks pretty much the same.  Inactive anon and active anon are pretty similar.  Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced.

Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening.

					Jeff




114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784

# Fire up instance

20480 4403200 2699264 647168
20480 4411392 2740224 647168
20480 4411392 2740224 647168
20480 11436032 3289088 651264
20480 11587584 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 25387008 4263936 872448
20480 25387008 4263936 872448
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25411584 4198400 937984
20480 25411584 4198400 937984
20480 40665088 7573504 946176
20480 43606016 7573504 946176
20480 45346816 7581696 946176
20480 45752320 7581696 946176
20480 46575616 7581696 946176
20480 46682112 7581696 946176
20480 48975872 9920512 1073152
20480 64536576 38457344 1826816

# Booted, X is starting
# Run a browser and editor, then shut them down and halt the instance

10964992 72454144 47714304 3067904
16797696 71151616 42893312 3244032
16797696 73035776 41037824 3272704
16797696 73547776 40525824 3272704
16797696 73609216 40402944 3276800
16797696 73719808 40337408 3289088
16797696 73920512 40079360 3289088
16797696 78016512 36036608 3297280
16797696 78016512 36036608 3297280
16797696 80203776 33755136 3387392
16797696 86904832 26972160 3526656
16797696 93523968 19927040 3837952
29011968 90546176 10276864 4308992
45670400 83685376 991232 3854336
66400256 66416640 368640 933888
66715648 66654208 376832 471040
64811008 64802816 3416064 1114112
65236992 65085440 2535424 1228800
65212416 65011712 2519040 1343488
64626688 64610304 3534848 1429504
63807488 63758336 4829184 1695744
63975424 63946752 4419584 1744896
63975424 63946752 4419584 1744896
63975424 63946752 4419584 1744896
63975424 63946752 4423680 1744896
63975424 63946752 4423680 1744896
63975424 63946752 4423680 1744896
64045056 63946752 4440064 1757184
64069632 63946752 4440064 1757184
64077824 63946752 4403200 1757184
64077824 63946752 4403200 1757184
64077824 63946752 4403200 1757184
64147456 64016384 4222976 1757184
64638976 64733184 2801664 1900544
65208320 65605632 1310720 1892352
64946176 65863680 1335296 1998848
62701568 68599808 843776 1945600
66068480 66023424 778240 978944
65568768 65511424 2093056 1044480
66183168 66056192 966656 1011712
66478080 66555904 241664 864256
66912256 66899968 135168 270336
66646016 66539520 577536 262144
66134016 66179072 1421312 319488
66125824 66183168 1273856 380928
66330624 66445312 933888 475136
65970176 65966080 1581056 548864
66158592 66158592 1175552 708608
65781760 66084864 1503232 806912
66084864 66048000 1118208 843776
66420736 66449408 376832 851968
66351104 66138112 757760 864256
66285568 66138112 921600 868352
65945600 65847296 1495040 888832
66002944 65839104 1363968 888832
66002944 65839104 1363968 888832
66039808 65839104 1363968 888832
66043904 65839104 1363968 888832
66043904 65839104 1372160 888832
65523712 65490944 2224128 929792
66031616 66297856 827392 946176
64913408 68141056 188416 933888
64770048 68325376 73728 917504
65216512 67932160 81920 909312
65470464 67678208 81920 909312
65036288 67973120 356352 790528
63492096 69877760 110592 647168
63111168 70508544 73728 413696
66895872 66883584 16384 348160
66650112 67203072 20480 344064
66830336 67002368 28672 335872
66785280 67002368 32768 331776
67084288 66736128 45056 331776
67104768 66736128 45056 331776
66916352 66801664 45056 331776
66883584 66863104 45056 331776
66883584 66863104 45056 331776
66891776 66863104 45056 331776
66899968 66863104 45056 331776
66904064 66863104 45056 331776
66904064 66863104 45056 331776
66904064 66863104 45056 331776
66715648 66641920 385024 339968
66617344 66629632 237568 364544
66633728 66629632 237568 364544
66641920 66629632 237568 364544
66641920 66629632 237568 364544
66662400 66461696 589824 389120
66588672 66527232 659456 389120
66252800 66252800 1105920 413696
66277376 66297856 983040 413696
66498560 66285568 884736 413696
66129920 66183168 1163264 442368
66138112 66183168 1163264 442368
66256896 66465792 921600 442368
66560000 66465792 606208 442368
66662400 66445312 589824 446464
66560000 66490368 708608 446464
66629632 66441216 577536 446464
66711552 66441216 520192 462848
66617344 66531328 577536 466944
66412544 66605056 606208 466944
66605056 66637824 475136 471040
66297856 67018752 344064 483328
65863680 67588096 212992 483328
66334720 67129344 159744 516096
65204224 68337664 159744 516096
61399040 72212480 98304 499712
62427136 71114752 98304 499712
62775296 70680576 155648 503808
62595072 70209536 823296 565248
66490368 66576384 458752 577536
66818048 66625536 196608 577536
66699264 66793472 135168 507904
66707456 66736128 151552 512000
66314240 67293184 53248 483328
56832000 76705792 180224 495616
59273216 73998336 368640 499712
59351040 73928704 368640 499712
59400192 73928704 368640 499712
59678720 73478144 241664 495616
59875328 73531392 241664 495616
60690432 72830976 110592 503808
61919232 71634944 110592 503808
65523712 68096000 49152 483328
66748416 66801664 102400 487424
65994752 67555328 114688 487424
65994752 67555328 114688 487424
65994752 67555328 114688 487424
66285568 66195456 1093632 520192
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66060288 65994752 1548288 557056

# Instance halted

118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-17 18:04                                     ` Dike, Jeffrey G
@ 2009-08-18  2:26                                       ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18  2:26 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 02:04:46AM +0800, Dike, Jeffrey G wrote:
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> 
> Nope.  I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M).
>
> The inactive mapped list is much smaller - 0 to ~700 pages.
> 
> The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped.

The anon inactive list is "over scanned".  Take 16k pages for example,
with DEF_PRIORITY=12, (16k >> 12) = 4.  So when shrink_zone() expects
to scan 4 pages in the active/inactive list, it will be scanned
SWAP_CLUSTER_MAX=32 pages in effect.

This triggers the background aging of active anon list because
inactive_anon_is_low() is found to be true, which keeps the
active:inactive ratio in balance.

So anon inactive list over scanned => anon active list over scanned =>
anon lists over scanned relative to file lists. (The inactive file list
may or may not be over scanned depending on its size <> (1<<prio) pages.)

Anyway this is not the expected way vmscan should work, and batching
up the cgroup vmscan could get rid of the mess.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18  2:26                                       ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18  2:26 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 02:04:46AM +0800, Dike, Jeffrey G wrote:
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> 
> Nope.  I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M).
>
> The inactive mapped list is much smaller - 0 to ~700 pages.
> 
> The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped.

The anon inactive list is "over scanned".  Take 16k pages for example,
with DEF_PRIORITY=12, (16k >> 12) = 4.  So when shrink_zone() expects
to scan 4 pages in the active/inactive list, it will be scanned
SWAP_CLUSTER_MAX=32 pages in effect.

This triggers the background aging of active anon list because
inactive_anon_is_low() is found to be true, which keeps the
active:inactive ratio in balance.

So anon inactive list over scanned => anon active list over scanned =>
anon lists over scanned relative to file lists. (The inactive file list
may or may not be over scanned depending on its size <> (1<<prio) pages.)

Anyway this is not the expected way vmscan should work, and batching
up the cgroup vmscan could get rid of the mess.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-17 14:33                           ` Minchan Kim
@ 2009-08-18  2:34                             ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18  2:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Minchan,

On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> >> > Wu Fengguang wrote:
> >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> > >> Side question -
> >> > >>  Is there a good reason for this to be in shrink_active_list()
> >> > >> as opposed to __isolate_lru_page?
> >> > >>
> >> > >>          if (unlikely(!page_evictable(page, NULL))) {
> >> > >>                  putback_lru_page(page);
> >> > >>                  continue;
> >> > >>          }
> >> > >>
> >> > >> Maybe we want to minimize the amount of code under the lru lock or
> >> > >> avoid duplicate logic in the isolate_page functions.
> >> > >
> >> > > I guess the quick test means to avoid the expensive page_referenced()
> >> > > call that follows it. But that should be mostly one shot cost - the
> >> > > unevictable pages are unlikely to cycle in active/inactive list again
> >> > > and again.
> >> >
> >> > Please read what putback_lru_page does.
> >> >
> >> > It moves the page onto the unevictable list, so that
> >> > it will not end up in this scan again.
> >>
> >> Yes it does. I said 'mostly' because there is a small hole that an
> >> unevictable page may be scanned but still not moved to unevictable
> >> list: when a page is mapped in two places, the first pte has the
> >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> >> page_referenced() will return 1 and shrink_page_list() will move it
> >> into active list instead of unevictable list. Shall we fix this rare
> >> case?
> 
> I think it's not a big deal.

Maybe, otherwise I should bring up this issue long time before :)

> As you mentioned, it's rare case so there would be few pages in active
> list instead of unevictable list.

Yes.

> When next time to scan comes, we can try to move the pages into
> unevictable list, again.

Will PG_mlocked be set by then? Otherwise the situation is not likely 
to change and the VM_LOCKED pages may circulate in active/inactive
list for countless times.

> As I know about mlock pages, we already had some races condition.
> They will be rescued like above.

Thanks,
Fengguang

> >
> > How about this fix?
> >
> > ---
> > mm: stop circulating of referenced mlocked pages
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >
> > --- linux.orig/mm/rmap.c        2009-08-16 19:11:13.000000000 +0800
> > +++ linux/mm/rmap.c     2009-08-16 19:22:46.000000000 +0800
> > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> >         */
> >        if (vma->vm_flags & VM_LOCKED) {
> >                *mapcount = 1;  /* break early from loop */
> > +               *vm_flags |= VM_LOCKED;
> >                goto out_unmap;
> >        }
> >
> > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> >        }
> >
> >        spin_unlock(&mapping->i_mmap_lock);
> > +       if (*vm_flags & VM_LOCKED)
> > +               referenced = 0;
> >        return referenced;
> >  }
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18  2:34                             ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18  2:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Minchan,

On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> >> > Wu Fengguang wrote:
> >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> > >> Side question -
> >> > >> A Is there a good reason for this to be in shrink_active_list()
> >> > >> as opposed to __isolate_lru_page?
> >> > >>
> >> > >> A  A  A  A  A if (unlikely(!page_evictable(page, NULL))) {
> >> > >> A  A  A  A  A  A  A  A  A putback_lru_page(page);
> >> > >> A  A  A  A  A  A  A  A  A continue;
> >> > >> A  A  A  A  A }
> >> > >>
> >> > >> Maybe we want to minimize the amount of code under the lru lock or
> >> > >> avoid duplicate logic in the isolate_page functions.
> >> > >
> >> > > I guess the quick test means to avoid the expensive page_referenced()
> >> > > call that follows it. But that should be mostly one shot cost - the
> >> > > unevictable pages are unlikely to cycle in active/inactive list again
> >> > > and again.
> >> >
> >> > Please read what putback_lru_page does.
> >> >
> >> > It moves the page onto the unevictable list, so that
> >> > it will not end up in this scan again.
> >>
> >> Yes it does. I said 'mostly' because there is a small hole that an
> >> unevictable page may be scanned but still not moved to unevictable
> >> list: when a page is mapped in two places, the first pte has the
> >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> >> page_referenced() will return 1 and shrink_page_list() will move it
> >> into active list instead of unevictable list. Shall we fix this rare
> >> case?
> 
> I think it's not a big deal.

Maybe, otherwise I should bring up this issue long time before :)

> As you mentioned, it's rare case so there would be few pages in active
> list instead of unevictable list.

Yes.

> When next time to scan comes, we can try to move the pages into
> unevictable list, again.

Will PG_mlocked be set by then? Otherwise the situation is not likely 
to change and the VM_LOCKED pages may circulate in active/inactive
list for countless times.

> As I know about mlock pages, we already had some races condition.
> They will be rescued like above.

Thanks,
Fengguang

> >
> > How about this fix?
> >
> > ---
> > mm: stop circulating of referenced mlocked pages
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >
> > --- linux.orig/mm/rmap.c A  A  A  A 2009-08-16 19:11:13.000000000 +0800
> > +++ linux/mm/rmap.c A  A  2009-08-16 19:22:46.000000000 +0800
> > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> > A  A  A  A  */
> > A  A  A  A if (vma->vm_flags & VM_LOCKED) {
> > A  A  A  A  A  A  A  A *mapcount = 1; A /* break early from loop */
> > + A  A  A  A  A  A  A  *vm_flags |= VM_LOCKED;
> > A  A  A  A  A  A  A  A goto out_unmap;
> > A  A  A  A }
> >
> > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> > A  A  A  A }
> >
> > A  A  A  A spin_unlock(&mapping->i_mmap_lock);
> > + A  A  A  if (*vm_flags & VM_LOCKED)
> > + A  A  A  A  A  A  A  referenced = 0;
> > A  A  A  A return referenced;
> > A }
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org. A For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18  2:34                             ` Wu Fengguang
@ 2009-08-18  4:17                               ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18  4:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
	LKML, linux-mm

On Tue, 18 Aug 2009 10:34:38 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Minchan,
> 
> On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > >> > Wu Fengguang wrote:
> > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> > >> Side question -
> > >> > >>  Is there a good reason for this to be in shrink_active_list()
> > >> > >> as opposed to __isolate_lru_page?
> > >> > >>
> > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
> > >> > >>                  putback_lru_page(page);
> > >> > >>                  continue;
> > >> > >>          }
> > >> > >>
> > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> > >> avoid duplicate logic in the isolate_page functions.
> > >> > >
> > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > >> > > call that follows it. But that should be mostly one shot cost - the
> > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > >> > > and again.
> > >> >
> > >> > Please read what putback_lru_page does.
> > >> >
> > >> > It moves the page onto the unevictable list, so that
> > >> > it will not end up in this scan again.
> > >>
> > >> Yes it does. I said 'mostly' because there is a small hole that an
> > >> unevictable page may be scanned but still not moved to unevictable
> > >> list: when a page is mapped in two places, the first pte has the
> > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > >> page_referenced() will return 1 and shrink_page_list() will move it
> > >> into active list instead of unevictable list. Shall we fix this rare
> > >> case?
> > 
> > I think it's not a big deal.
> 
> Maybe, otherwise I should bring up this issue long time before :)
> 
> > As you mentioned, it's rare case so there would be few pages in active
> > list instead of unevictable list.
> 
> Yes.
> 
> > When next time to scan comes, we can try to move the pages into
> > unevictable list, again.
> 
> Will PG_mlocked be set by then? Otherwise the situation is not likely 
> to change and the VM_LOCKED pages may circulate in active/inactive
> list for countless times.

PG_mlocked is not important in that case. 
Important thing is VM_LOCKED vma. 
I think below annotaion can help you to understand my point. :)

----

/*
 * called from munlock()/munmap() path with page supposedly on the LRU.
 *
 * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
 * [in try_to_munlock()] and then attempt to isolate the page.  We must
 * isolate the page to keep others from messing with its unevictable
 * and mlocked state while trying to munlock.  However, we pre-clear the
 * mlocked state anyway as we might lose the isolation race and we might
 * not get another chance to clear PageMlocked.  If we successfully
 * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
 * mapping the page, it will restore the PageMlocked state, unless the page
 * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
 * perhaps redundantly.
 * If we lose the isolation race, and the page is mapped by other VM_LOCKED
 * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
 * either of which will restore the PageMlocked state by calling
 * mlock_vma_page() above, if it can grab the vma's mmap sem.
 */
static void munlock_vma_page(struct page *page)
{
...

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18  4:17                               ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18  4:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
	LKML, linux-mm

On Tue, 18 Aug 2009 10:34:38 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Minchan,
> 
> On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > >> > Wu Fengguang wrote:
> > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> > >> Side question -
> > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > >> > >> as opposed to __isolate_lru_page?
> > >> > >>
> > >> > >> A  A  A  A  A if (unlikely(!page_evictable(page, NULL))) {
> > >> > >> A  A  A  A  A  A  A  A  A putback_lru_page(page);
> > >> > >> A  A  A  A  A  A  A  A  A continue;
> > >> > >> A  A  A  A  A }
> > >> > >>
> > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> > >> avoid duplicate logic in the isolate_page functions.
> > >> > >
> > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > >> > > call that follows it. But that should be mostly one shot cost - the
> > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > >> > > and again.
> > >> >
> > >> > Please read what putback_lru_page does.
> > >> >
> > >> > It moves the page onto the unevictable list, so that
> > >> > it will not end up in this scan again.
> > >>
> > >> Yes it does. I said 'mostly' because there is a small hole that an
> > >> unevictable page may be scanned but still not moved to unevictable
> > >> list: when a page is mapped in two places, the first pte has the
> > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > >> page_referenced() will return 1 and shrink_page_list() will move it
> > >> into active list instead of unevictable list. Shall we fix this rare
> > >> case?
> > 
> > I think it's not a big deal.
> 
> Maybe, otherwise I should bring up this issue long time before :)
> 
> > As you mentioned, it's rare case so there would be few pages in active
> > list instead of unevictable list.
> 
> Yes.
> 
> > When next time to scan comes, we can try to move the pages into
> > unevictable list, again.
> 
> Will PG_mlocked be set by then? Otherwise the situation is not likely 
> to change and the VM_LOCKED pages may circulate in active/inactive
> list for countless times.

PG_mlocked is not important in that case. 
Important thing is VM_LOCKED vma. 
I think below annotaion can help you to understand my point. :)

----

/*
 * called from munlock()/munmap() path with page supposedly on the LRU.
 *
 * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
 * [in try_to_munlock()] and then attempt to isolate the page.  We must
 * isolate the page to keep others from messing with its unevictable
 * and mlocked state while trying to munlock.  However, we pre-clear the
 * mlocked state anyway as we might lose the isolation race and we might
 * not get another chance to clear PageMlocked.  If we successfully
 * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
 * mapping the page, it will restore the PageMlocked state, unless the page
 * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
 * perhaps redundantly.
 * If we lose the isolation race, and the page is mapped by other VM_LOCKED
 * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
 * either of which will restore the PageMlocked state by calling
 * mlock_vma_page() above, if it can grab the vma's mmap sem.
 */
static void munlock_vma_page(struct page *page)
{
...

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18  4:17                               ` Minchan Kim
@ 2009-08-18  9:31                                 ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18  9:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> On Tue, 18 Aug 2009 10:34:38 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Minchan,
> > 
> > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > >> > Wu Fengguang wrote:
> > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > >> > >> Side question -
> > > >> > >>  Is there a good reason for this to be in shrink_active_list()
> > > >> > >> as opposed to __isolate_lru_page?
> > > >> > >>
> > > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
> > > >> > >>                  putback_lru_page(page);
> > > >> > >>                  continue;
> > > >> > >>          }
> > > >> > >>
> > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > >> > >
> > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > >> > > and again.
> > > >> >
> > > >> > Please read what putback_lru_page does.
> > > >> >
> > > >> > It moves the page onto the unevictable list, so that
> > > >> > it will not end up in this scan again.
> > > >>
> > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > >> unevictable page may be scanned but still not moved to unevictable
> > > >> list: when a page is mapped in two places, the first pte has the
> > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > >> into active list instead of unevictable list. Shall we fix this rare
> > > >> case?
> > > 
> > > I think it's not a big deal.
> > 
> > Maybe, otherwise I should bring up this issue long time before :)
> > 
> > > As you mentioned, it's rare case so there would be few pages in active
> > > list instead of unevictable list.
> > 
> > Yes.
> > 
> > > When next time to scan comes, we can try to move the pages into
> > > unevictable list, again.
> > 
> > Will PG_mlocked be set by then? Otherwise the situation is not likely 
> > to change and the VM_LOCKED pages may circulate in active/inactive
> > list for countless times.
> 
> PG_mlocked is not important in that case. 
> Important thing is VM_LOCKED vma. 
> I think below annotaion can help you to understand my point. :)

Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
PG_mlocked set, and so will be caught by page_evictable(). Is it?
Then I was worrying about a null problem. Sorry for the confusion!

Thanks,
Fengguang

> ----
> 
> /*
>  * called from munlock()/munmap() path with page supposedly on the LRU.
>  *
>  * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
>  * [in try_to_munlock()] and then attempt to isolate the page.  We must
>  * isolate the page to keep others from messing with its unevictable
>  * and mlocked state while trying to munlock.  However, we pre-clear the
>  * mlocked state anyway as we might lose the isolation race and we might
>  * not get another chance to clear PageMlocked.  If we successfully
>  * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
>  * mapping the page, it will restore the PageMlocked state, unless the page
>  * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
>  * perhaps redundantly.
>  * If we lose the isolation race, and the page is mapped by other VM_LOCKED
>  * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
>  * either of which will restore the PageMlocked state by calling
>  * mlock_vma_page() above, if it can grab the vma's mmap sem.
>  */
> static void munlock_vma_page(struct page *page)
> {
> ...
> 
> -- 
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18  9:31                                 ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18  9:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> On Tue, 18 Aug 2009 10:34:38 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Minchan,
> > 
> > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > >> > Wu Fengguang wrote:
> > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > >> > >> Side question -
> > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > > >> > >> as opposed to __isolate_lru_page?
> > > >> > >>
> > > >> > >> A  A  A  A  A if (unlikely(!page_evictable(page, NULL))) {
> > > >> > >> A  A  A  A  A  A  A  A  A putback_lru_page(page);
> > > >> > >> A  A  A  A  A  A  A  A  A continue;
> > > >> > >> A  A  A  A  A }
> > > >> > >>
> > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > >> > >
> > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > >> > > and again.
> > > >> >
> > > >> > Please read what putback_lru_page does.
> > > >> >
> > > >> > It moves the page onto the unevictable list, so that
> > > >> > it will not end up in this scan again.
> > > >>
> > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > >> unevictable page may be scanned but still not moved to unevictable
> > > >> list: when a page is mapped in two places, the first pte has the
> > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > >> into active list instead of unevictable list. Shall we fix this rare
> > > >> case?
> > > 
> > > I think it's not a big deal.
> > 
> > Maybe, otherwise I should bring up this issue long time before :)
> > 
> > > As you mentioned, it's rare case so there would be few pages in active
> > > list instead of unevictable list.
> > 
> > Yes.
> > 
> > > When next time to scan comes, we can try to move the pages into
> > > unevictable list, again.
> > 
> > Will PG_mlocked be set by then? Otherwise the situation is not likely 
> > to change and the VM_LOCKED pages may circulate in active/inactive
> > list for countless times.
> 
> PG_mlocked is not important in that case. 
> Important thing is VM_LOCKED vma. 
> I think below annotaion can help you to understand my point. :)

Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
PG_mlocked set, and so will be caught by page_evictable(). Is it?
Then I was worrying about a null problem. Sorry for the confusion!

Thanks,
Fengguang

> ----
> 
> /*
>  * called from munlock()/munmap() path with page supposedly on the LRU.
>  *
>  * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
>  * [in try_to_munlock()] and then attempt to isolate the page.  We must
>  * isolate the page to keep others from messing with its unevictable
>  * and mlocked state while trying to munlock.  However, we pre-clear the
>  * mlocked state anyway as we might lose the isolation race and we might
>  * not get another chance to clear PageMlocked.  If we successfully
>  * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
>  * mapping the page, it will restore the PageMlocked state, unless the page
>  * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
>  * perhaps redundantly.
>  * If we lose the isolation race, and the page is mapped by other VM_LOCKED
>  * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
>  * either of which will restore the PageMlocked state by calling
>  * mlock_vma_page() above, if it can grab the vma's mmap sem.
>  */
> static void munlock_vma_page(struct page *page)
> {
> ...
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18  9:31                                 ` Wu Fengguang
@ 2009-08-18  9:52                                   ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18  9:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
	LKML, linux-mm

On Tue, 18 Aug 2009 17:31:19 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > On Tue, 18 Aug 2009 10:34:38 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Minchan,
> > > 
> > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > > >> > Wu Fengguang wrote:
> > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > > >> > >> Side question -
> > > > >> > >>  Is there a good reason for this to be in shrink_active_list()
> > > > >> > >> as opposed to __isolate_lru_page?
> > > > >> > >>
> > > > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
> > > > >> > >>                  putback_lru_page(page);
> > > > >> > >>                  continue;
> > > > >> > >>          }
> > > > >> > >>
> > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > > >> > >
> > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > > >> > > and again.
> > > > >> >
> > > > >> > Please read what putback_lru_page does.
> > > > >> >
> > > > >> > It moves the page onto the unevictable list, so that
> > > > >> > it will not end up in this scan again.
> > > > >>
> > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > > >> unevictable page may be scanned but still not moved to unevictable
> > > > >> list: when a page is mapped in two places, the first pte has the
> > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > > >> into active list instead of unevictable list. Shall we fix this rare
> > > > >> case?
> > > > 
> > > > I think it's not a big deal.
> > > 
> > > Maybe, otherwise I should bring up this issue long time before :)
> > > 
> > > > As you mentioned, it's rare case so there would be few pages in active
> > > > list instead of unevictable list.
> > > 
> > > Yes.
> > > 
> > > > When next time to scan comes, we can try to move the pages into
> > > > unevictable list, again.
> > > 
> > > Will PG_mlocked be set by then? Otherwise the situation is not likely 
> > > to change and the VM_LOCKED pages may circulate in active/inactive
> > > list for countless times.
> > 
> > PG_mlocked is not important in that case. 
> > Important thing is VM_LOCKED vma. 
> > I think below annotaion can help you to understand my point. :)
> 
> Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> PG_mlocked set, and so will be caught by page_evictable(). Is it?

No. I am sorry for making my point not clear. 
I meant following as. 
When the next time to scan,

shrink_page_list 
-> try_to_unmap
	-> try_to_unmap_xxx
		-> if (vma->vm_flags & VM_LOCKED)
		-> try_to_mlock_page
			-> TestSetPageMlocked
			-> putback_lru_page

So at last, the page will be located in unevictable list. 

> Then I was worrying about a null problem. Sorry for the confusion!
> 
> Thanks,
> Fengguang
> 
> > ----
> > 
> > /*
> >  * called from munlock()/munmap() path with page supposedly on the LRU.
> >  *
> >  * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
> >  * [in try_to_munlock()] and then attempt to isolate the page.  We must
> >  * isolate the page to keep others from messing with its unevictable
> >  * and mlocked state while trying to munlock.  However, we pre-clear the
> >  * mlocked state anyway as we might lose the isolation race and we might
> >  * not get another chance to clear PageMlocked.  If we successfully
> >  * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> >  * mapping the page, it will restore the PageMlocked state, unless the page
> >  * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
> >  * perhaps redundantly.
> >  * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> >  * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> >  * either of which will restore the PageMlocked state by calling
> >  * mlock_vma_page() above, if it can grab the vma's mmap sem.
> >  */
> > static void munlock_vma_page(struct page *page)
> > {
> > ...
> > 
> > -- 
> > Kind regards,
> > Minchan Kim


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18  9:52                                   ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18  9:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
	LKML, linux-mm

On Tue, 18 Aug 2009 17:31:19 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > On Tue, 18 Aug 2009 10:34:38 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Minchan,
> > > 
> > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > > >> > Wu Fengguang wrote:
> > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > > >> > >> Side question -
> > > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > > > >> > >> as opposed to __isolate_lru_page?
> > > > >> > >>
> > > > >> > >> A  A  A  A  A if (unlikely(!page_evictable(page, NULL))) {
> > > > >> > >> A  A  A  A  A  A  A  A  A putback_lru_page(page);
> > > > >> > >> A  A  A  A  A  A  A  A  A continue;
> > > > >> > >> A  A  A  A  A }
> > > > >> > >>
> > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > > >> > >
> > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > > >> > > and again.
> > > > >> >
> > > > >> > Please read what putback_lru_page does.
> > > > >> >
> > > > >> > It moves the page onto the unevictable list, so that
> > > > >> > it will not end up in this scan again.
> > > > >>
> > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > > >> unevictable page may be scanned but still not moved to unevictable
> > > > >> list: when a page is mapped in two places, the first pte has the
> > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > > >> into active list instead of unevictable list. Shall we fix this rare
> > > > >> case?
> > > > 
> > > > I think it's not a big deal.
> > > 
> > > Maybe, otherwise I should bring up this issue long time before :)
> > > 
> > > > As you mentioned, it's rare case so there would be few pages in active
> > > > list instead of unevictable list.
> > > 
> > > Yes.
> > > 
> > > > When next time to scan comes, we can try to move the pages into
> > > > unevictable list, again.
> > > 
> > > Will PG_mlocked be set by then? Otherwise the situation is not likely 
> > > to change and the VM_LOCKED pages may circulate in active/inactive
> > > list for countless times.
> > 
> > PG_mlocked is not important in that case. 
> > Important thing is VM_LOCKED vma. 
> > I think below annotaion can help you to understand my point. :)
> 
> Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> PG_mlocked set, and so will be caught by page_evictable(). Is it?

No. I am sorry for making my point not clear. 
I meant following as. 
When the next time to scan,

shrink_page_list 
-> try_to_unmap
	-> try_to_unmap_xxx
		-> if (vma->vm_flags & VM_LOCKED)
		-> try_to_mlock_page
			-> TestSetPageMlocked
			-> putback_lru_page

So at last, the page will be located in unevictable list. 

> Then I was worrying about a null problem. Sorry for the confusion!
> 
> Thanks,
> Fengguang
> 
> > ----
> > 
> > /*
> >  * called from munlock()/munmap() path with page supposedly on the LRU.
> >  *
> >  * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
> >  * [in try_to_munlock()] and then attempt to isolate the page.  We must
> >  * isolate the page to keep others from messing with its unevictable
> >  * and mlocked state while trying to munlock.  However, we pre-clear the
> >  * mlocked state anyway as we might lose the isolation race and we might
> >  * not get another chance to clear PageMlocked.  If we successfully
> >  * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> >  * mapping the page, it will restore the PageMlocked state, unless the page
> >  * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
> >  * perhaps redundantly.
> >  * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> >  * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> >  * either of which will restore the PageMlocked state by calling
> >  * mlock_vma_page() above, if it can grab the vma's mmap sem.
> >  */
> > static void munlock_vma_page(struct page *page)
> > {
> > ...
> > 
> > -- 
> > Kind regards,
> > Minchan Kim


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18  9:52                                   ` Minchan Kim
@ 2009-08-18 10:00                                     ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 10:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> On Tue, 18 Aug 2009 17:31:19 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > > On Tue, 18 Aug 2009 10:34:38 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Minchan,
> > > > 
> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > > > >> > Wu Fengguang wrote:
> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > > > >> > >> Side question -
> > > > > >> > >>  Is there a good reason for this to be in shrink_active_list()
> > > > > >> > >> as opposed to __isolate_lru_page?
> > > > > >> > >>
> > > > > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
> > > > > >> > >>                  putback_lru_page(page);
> > > > > >> > >>                  continue;
> > > > > >> > >>          }
> > > > > >> > >>
> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > > > >> > >
> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > > > >> > > and again.
> > > > > >> >
> > > > > >> > Please read what putback_lru_page does.
> > > > > >> >
> > > > > >> > It moves the page onto the unevictable list, so that
> > > > > >> > it will not end up in this scan again.
> > > > > >>
> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > > > >> unevictable page may be scanned but still not moved to unevictable
> > > > > >> list: when a page is mapped in two places, the first pte has the
> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> > > > > >> case?
> > > > > 
> > > > > I think it's not a big deal.
> > > > 
> > > > Maybe, otherwise I should bring up this issue long time before :)
> > > > 
> > > > > As you mentioned, it's rare case so there would be few pages in active
> > > > > list instead of unevictable list.
> > > > 
> > > > Yes.
> > > > 
> > > > > When next time to scan comes, we can try to move the pages into
> > > > > unevictable list, again.
> > > > 
> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely 
> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> > > > list for countless times.
> > > 
> > > PG_mlocked is not important in that case. 
> > > Important thing is VM_LOCKED vma. 
> > > I think below annotaion can help you to understand my point. :)
> > 
> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> 
> No. I am sorry for making my point not clear. 
> I meant following as. 
> When the next time to scan,
> 
> shrink_page_list 
  -> 
                referenced = page_referenced(page, 1,
                                                sc->mem_cgroup, &vm_flags);
                /* In active use or really unfreeable?  Activate it. */
                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
                                        referenced && page_mapping_inuse(page))
                        goto activate_locked;
  
> -> try_to_unmap
     ~~~~~~~~~~~~ this line won't be reached if page is found to be
     referenced in the above lines?

Thanks,
Fengguang

> 	-> try_to_unmap_xxx
> 		-> if (vma->vm_flags & VM_LOCKED)
> 		-> try_to_mlock_page
> 			-> TestSetPageMlocked
> 			-> putback_lru_page
> 
> So at last, the page will be located in unevictable list. 
> 
> > Then I was worrying about a null problem. Sorry for the confusion!
> > 
> > Thanks,
> > Fengguang
> > 
> > > ----
> > > 
> > > /*
> > >  * called from munlock()/munmap() path with page supposedly on the LRU.
> > >  *
> > >  * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
> > >  * [in try_to_munlock()] and then attempt to isolate the page.  We must
> > >  * isolate the page to keep others from messing with its unevictable
> > >  * and mlocked state while trying to munlock.  However, we pre-clear the
> > >  * mlocked state anyway as we might lose the isolation race and we might
> > >  * not get another chance to clear PageMlocked.  If we successfully
> > >  * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> > >  * mapping the page, it will restore the PageMlocked state, unless the page
> > >  * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
> > >  * perhaps redundantly.
> > >  * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> > >  * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> > >  * either of which will restore the PageMlocked state by calling
> > >  * mlock_vma_page() above, if it can grab the vma's mmap sem.
> > >  */
> > > static void munlock_vma_page(struct page *page)
> > > {
> > > ...
> > > 
> > > -- 
> > > Kind regards,
> > > Minchan Kim
> 
> 
> -- 
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 10:00                                     ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 10:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> On Tue, 18 Aug 2009 17:31:19 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > > On Tue, 18 Aug 2009 10:34:38 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Minchan,
> > > > 
> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > > > >> > Wu Fengguang wrote:
> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > > > >> > >> Side question -
> > > > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > > > > >> > >> as opposed to __isolate_lru_page?
> > > > > >> > >>
> > > > > >> > >> A  A  A  A  A if (unlikely(!page_evictable(page, NULL))) {
> > > > > >> > >> A  A  A  A  A  A  A  A  A putback_lru_page(page);
> > > > > >> > >> A  A  A  A  A  A  A  A  A continue;
> > > > > >> > >> A  A  A  A  A }
> > > > > >> > >>
> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > > > >> > >
> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > > > >> > > and again.
> > > > > >> >
> > > > > >> > Please read what putback_lru_page does.
> > > > > >> >
> > > > > >> > It moves the page onto the unevictable list, so that
> > > > > >> > it will not end up in this scan again.
> > > > > >>
> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > > > >> unevictable page may be scanned but still not moved to unevictable
> > > > > >> list: when a page is mapped in two places, the first pte has the
> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> > > > > >> case?
> > > > > 
> > > > > I think it's not a big deal.
> > > > 
> > > > Maybe, otherwise I should bring up this issue long time before :)
> > > > 
> > > > > As you mentioned, it's rare case so there would be few pages in active
> > > > > list instead of unevictable list.
> > > > 
> > > > Yes.
> > > > 
> > > > > When next time to scan comes, we can try to move the pages into
> > > > > unevictable list, again.
> > > > 
> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely 
> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> > > > list for countless times.
> > > 
> > > PG_mlocked is not important in that case. 
> > > Important thing is VM_LOCKED vma. 
> > > I think below annotaion can help you to understand my point. :)
> > 
> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> 
> No. I am sorry for making my point not clear. 
> I meant following as. 
> When the next time to scan,
> 
> shrink_page_list 
  -> 
                referenced = page_referenced(page, 1,
                                                sc->mem_cgroup, &vm_flags);
                /* In active use or really unfreeable?  Activate it. */
                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
                                        referenced && page_mapping_inuse(page))
                        goto activate_locked;
  
> -> try_to_unmap
     ~~~~~~~~~~~~ this line won't be reached if page is found to be
     referenced in the above lines?

Thanks,
Fengguang

> 	-> try_to_unmap_xxx
> 		-> if (vma->vm_flags & VM_LOCKED)
> 		-> try_to_mlock_page
> 			-> TestSetPageMlocked
> 			-> putback_lru_page
> 
> So at last, the page will be located in unevictable list. 
> 
> > Then I was worrying about a null problem. Sorry for the confusion!
> > 
> > Thanks,
> > Fengguang
> > 
> > > ----
> > > 
> > > /*
> > >  * called from munlock()/munmap() path with page supposedly on the LRU.
> > >  *
> > >  * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
> > >  * [in try_to_munlock()] and then attempt to isolate the page.  We must
> > >  * isolate the page to keep others from messing with its unevictable
> > >  * and mlocked state while trying to munlock.  However, we pre-clear the
> > >  * mlocked state anyway as we might lose the isolation race and we might
> > >  * not get another chance to clear PageMlocked.  If we successfully
> > >  * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> > >  * mapping the page, it will restore the PageMlocked state, unless the page
> > >  * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
> > >  * perhaps redundantly.
> > >  * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> > >  * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> > >  * either of which will restore the PageMlocked state by calling
> > >  * mlock_vma_page() above, if it can grab the vma's mmap sem.
> > >  */
> > > static void munlock_vma_page(struct page *page)
> > > {
> > > ...
> > > 
> > > -- 
> > > Kind regards,
> > > Minchan Kim
> 
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18 10:00                                     ` Wu Fengguang
@ 2009-08-18 11:00                                       ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 11:00 UTC (permalink / raw)
  To: Wu Fengguang, Lee Schermerhorn
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
>> On Tue, 18 Aug 2009 17:31:19 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
>> > > On Tue, 18 Aug 2009 10:34:38 +0800
>> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > >
>> > > > Minchan,
>> > > >
>> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
>> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> > > > > >> > Wu Fengguang wrote:
>> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> > > > > >> > >> Side question -
>> > > > > >> > >>  Is there a good reason for this to be in shrink_active_list()
>> > > > > >> > >> as opposed to __isolate_lru_page?
>> > > > > >> > >>
>> > > > > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
>> > > > > >> > >>                  putback_lru_page(page);
>> > > > > >> > >>                  continue;
>> > > > > >> > >>          }
>> > > > > >> > >>
>> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
>> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
>> > > > > >> > >
>> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
>> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
>> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
>> > > > > >> > > and again.
>> > > > > >> >
>> > > > > >> > Please read what putback_lru_page does.
>> > > > > >> >
>> > > > > >> > It moves the page onto the unevictable list, so that
>> > > > > >> > it will not end up in this scan again.
>> > > > > >>
>> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
>> > > > > >> unevictable page may be scanned but still not moved to unevictable
>> > > > > >> list: when a page is mapped in two places, the first pte has the
>> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
>> > > > > >> into active list instead of unevictable list. Shall we fix this rare
>> > > > > >> case?
>> > > > >
>> > > > > I think it's not a big deal.
>> > > >
>> > > > Maybe, otherwise I should bring up this issue long time before :)
>> > > >
>> > > > > As you mentioned, it's rare case so there would be few pages in active
>> > > > > list instead of unevictable list.
>> > > >
>> > > > Yes.
>> > > >
>> > > > > When next time to scan comes, we can try to move the pages into
>> > > > > unevictable list, again.
>> > > >
>> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
>> > > > to change and the VM_LOCKED pages may circulate in active/inactive
>> > > > list for countless times.
>> > >
>> > > PG_mlocked is not important in that case.
>> > > Important thing is VM_LOCKED vma.
>> > > I think below annotaion can help you to understand my point. :)
>> >
>> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
>> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>>
>> No. I am sorry for making my point not clear.
>> I meant following as.
>> When the next time to scan,
>>
>> shrink_page_list
>  ->
>                referenced = page_referenced(page, 1,
>                                                sc->mem_cgroup, &vm_flags);
>                /* In active use or really unfreeable?  Activate it. */
>                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
>                                        referenced && page_mapping_inuse(page))
>                        goto activate_locked;
>
>> -> try_to_unmap
>     ~~~~~~~~~~~~ this line won't be reached if page is found to be
>     referenced in the above lines?

Indeed! In fact, I was worry about that.
It looks after live lock problem.
But I think  it's very small race window so  there isn't any report until now.
Let's Cced Lee.

If we have to fix it, how about this ?
This version  has small overhead than yours since
there is less shrink_page_list call than page_referenced.

diff --git a/mm/rmap.c b/mm/rmap.c
index ed63894..283266c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
         */
        if (vma->vm_flags & VM_LOCKED) {
                *mapcount = 1;  /* break early from loop */
+               *vm_flags |= VM_LOCKED;
                goto out_unmap;
        }

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d224b28..d156e1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
                                                sc->mem_cgroup, &vm_flags);
                /* In active use or really unfreeable?  Activate it. */
                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-                                       referenced && page_mapping_inuse(page))
+                                       referenced && page_mapping_inuse(page)
+                                       && !(vm_flags & VM_LOCKED))
                        goto activate_locked;




>
> Thanks,
> Fengguang
>
>>       -> try_to_unmap_xxx
>>               -> if (vma->vm_flags & VM_LOCKED)
>>               -> try_to_mlock_page
>>                       -> TestSetPageMlocked
>>                       -> putback_lru_page
>>
>> So at last, the page will be located in unevictable list.
>>
>> > Then I was worrying about a null problem. Sorry for the confusion!
>> >
>> > Thanks,
>> > Fengguang
>> >
>> > > ----
>> > >
>> > > /*
>> > >  * called from munlock()/munmap() path with page supposedly on the LRU.
>> > >  *
>> > >  * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
>> > >  * [in try_to_munlock()] and then attempt to isolate the page.  We must
>> > >  * isolate the page to keep others from messing with its unevictable
>> > >  * and mlocked state while trying to munlock.  However, we pre-clear the
>> > >  * mlocked state anyway as we might lose the isolation race and we might
>> > >  * not get another chance to clear PageMlocked.  If we successfully
>> > >  * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
>> > >  * mapping the page, it will restore the PageMlocked state, unless the page
>> > >  * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
>> > >  * perhaps redundantly.
>> > >  * If we lose the isolation race, and the page is mapped by other VM_LOCKED
>> > >  * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
>> > >  * either of which will restore the PageMlocked state by calling
>> > >  * mlock_vma_page() above, if it can grab the vma's mmap sem.
>> > >  */
>> > > static void munlock_vma_page(struct page *page)
>> > > {
>> > > ...
>> > >
>> > > --
>> > > Kind regards,
>> > > Minchan Kim
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 11:00                                       ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 11:00 UTC (permalink / raw)
  To: Wu Fengguang, Lee Schermerhorn
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
>> On Tue, 18 Aug 2009 17:31:19 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
>> > > On Tue, 18 Aug 2009 10:34:38 +0800
>> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > >
>> > > > Minchan,
>> > > >
>> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
>> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> > > > > >> > Wu Fengguang wrote:
>> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> > > > > >> > >> Side question -
>> > > > > >> > >>  Is there a good reason for this to be in shrink_active_list()
>> > > > > >> > >> as opposed to __isolate_lru_page?
>> > > > > >> > >>
>> > > > > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
>> > > > > >> > >>                  putback_lru_page(page);
>> > > > > >> > >>                  continue;
>> > > > > >> > >>          }
>> > > > > >> > >>
>> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
>> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
>> > > > > >> > >
>> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
>> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
>> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
>> > > > > >> > > and again.
>> > > > > >> >
>> > > > > >> > Please read what putback_lru_page does.
>> > > > > >> >
>> > > > > >> > It moves the page onto the unevictable list, so that
>> > > > > >> > it will not end up in this scan again.
>> > > > > >>
>> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
>> > > > > >> unevictable page may be scanned but still not moved to unevictable
>> > > > > >> list: when a page is mapped in two places, the first pte has the
>> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
>> > > > > >> into active list instead of unevictable list. Shall we fix this rare
>> > > > > >> case?
>> > > > >
>> > > > > I think it's not a big deal.
>> > > >
>> > > > Maybe, otherwise I should bring up this issue long time before :)
>> > > >
>> > > > > As you mentioned, it's rare case so there would be few pages in active
>> > > > > list instead of unevictable list.
>> > > >
>> > > > Yes.
>> > > >
>> > > > > When next time to scan comes, we can try to move the pages into
>> > > > > unevictable list, again.
>> > > >
>> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
>> > > > to change and the VM_LOCKED pages may circulate in active/inactive
>> > > > list for countless times.
>> > >
>> > > PG_mlocked is not important in that case.
>> > > Important thing is VM_LOCKED vma.
>> > > I think below annotaion can help you to understand my point. :)
>> >
>> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
>> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>>
>> No. I am sorry for making my point not clear.
>> I meant following as.
>> When the next time to scan,
>>
>> shrink_page_list
>  ->
>                referenced = page_referenced(page, 1,
>                                                sc->mem_cgroup, &vm_flags);
>                /* In active use or really unfreeable?  Activate it. */
>                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
>                                        referenced && page_mapping_inuse(page))
>                        goto activate_locked;
>
>> -> try_to_unmap
>     ~~~~~~~~~~~~ this line won't be reached if page is found to be
>     referenced in the above lines?

Indeed! In fact, I was worry about that.
It looks after live lock problem.
But I think  it's very small race window so  there isn't any report until now.
Let's Cced Lee.

If we have to fix it, how about this ?
This version  has small overhead than yours since
there is less shrink_page_list call than page_referenced.

diff --git a/mm/rmap.c b/mm/rmap.c
index ed63894..283266c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
         */
        if (vma->vm_flags & VM_LOCKED) {
                *mapcount = 1;  /* break early from loop */
+               *vm_flags |= VM_LOCKED;
                goto out_unmap;
        }

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d224b28..d156e1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
                                                sc->mem_cgroup, &vm_flags);
                /* In active use or really unfreeable?  Activate it. */
                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-                                       referenced && page_mapping_inuse(page))
+                                       referenced && page_mapping_inuse(page)
+                                       && !(vm_flags & VM_LOCKED))
                        goto activate_locked;




>
> Thanks,
> Fengguang
>
>>       -> try_to_unmap_xxx
>>               -> if (vma->vm_flags & VM_LOCKED)
>>               -> try_to_mlock_page
>>                       -> TestSetPageMlocked
>>                       -> putback_lru_page
>>
>> So at last, the page will be located in unevictable list.
>>
>> > Then I was worrying about a null problem. Sorry for the confusion!
>> >
>> > Thanks,
>> > Fengguang
>> >
>> > > ----
>> > >
>> > > /*
>> > >  * called from munlock()/munmap() path with page supposedly on the LRU.
>> > >  *
>> > >  * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
>> > >  * [in try_to_munlock()] and then attempt to isolate the page.  We must
>> > >  * isolate the page to keep others from messing with its unevictable
>> > >  * and mlocked state while trying to munlock.  However, we pre-clear the
>> > >  * mlocked state anyway as we might lose the isolation race and we might
>> > >  * not get another chance to clear PageMlocked.  If we successfully
>> > >  * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
>> > >  * mapping the page, it will restore the PageMlocked state, unless the page
>> > >  * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
>> > >  * perhaps redundantly.
>> > >  * If we lose the isolation race, and the page is mapped by other VM_LOCKED
>> > >  * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
>> > >  * either of which will restore the PageMlocked state by calling
>> > >  * mlock_vma_page() above, if it can grab the vma's mmap sem.
>> > >  */
>> > > static void munlock_vma_page(struct page *page)
>> > > {
>> > > ...
>> > >
>> > > --
>> > > Kind regards,
>> > > Minchan Kim
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18 11:00                                       ` Minchan Kim
@ 2009-08-18 11:11                                         ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 11:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
	LKML, linux-mm

On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
> On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> >> On Tue, 18 Aug 2009 17:31:19 +0800
> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >>
> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> >> > > On Tue, 18 Aug 2009 10:34:38 +0800
> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > >
> >> > > > Minchan,
> >> > > >
> >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> >> > > > > >> > Wu Fengguang wrote:
> >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> > > > > >> > >> Side question -
> >> > > > > >> > >>  Is there a good reason for this to be in shrink_active_list()
> >> > > > > >> > >> as opposed to __isolate_lru_page?
> >> > > > > >> > >>
> >> > > > > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
> >> > > > > >> > >>                  putback_lru_page(page);
> >> > > > > >> > >>                  continue;
> >> > > > > >> > >>          }
> >> > > > > >> > >>
> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> >> > > > > >> > >
> >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> >> > > > > >> > > and again.
> >> > > > > >> >
> >> > > > > >> > Please read what putback_lru_page does.
> >> > > > > >> >
> >> > > > > >> > It moves the page onto the unevictable list, so that
> >> > > > > >> > it will not end up in this scan again.
> >> > > > > >>
> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> >> > > > > >> unevictable page may be scanned but still not moved to unevictable
> >> > > > > >> list: when a page is mapped in two places, the first pte has the
> >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> >> > > > > >> case?
> >> > > > >
> >> > > > > I think it's not a big deal.
> >> > > >
> >> > > > Maybe, otherwise I should bring up this issue long time before :)
> >> > > >
> >> > > > > As you mentioned, it's rare case so there would be few pages in active
> >> > > > > list instead of unevictable list.
> >> > > >
> >> > > > Yes.
> >> > > >
> >> > > > > When next time to scan comes, we can try to move the pages into
> >> > > > > unevictable list, again.
> >> > > >
> >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> >> > > > list for countless times.
> >> > >
> >> > > PG_mlocked is not important in that case.
> >> > > Important thing is VM_LOCKED vma.
> >> > > I think below annotaion can help you to understand my point. :)
> >> >
> >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> >>
> >> No. I am sorry for making my point not clear.
> >> I meant following as.
> >> When the next time to scan,
> >>
> >> shrink_page_list
> >  ->
> >                referenced = page_referenced(page, 1,
> >                                                sc->mem_cgroup, &vm_flags);
> >                /* In active use or really unfreeable?  Activate it. */
> >                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> >                                        referenced && page_mapping_inuse(page))
> >                        goto activate_locked;
> >
> >> -> try_to_unmap
> >     ~~~~~~~~~~~~ this line won't be reached if page is found to be
> >     referenced in the above lines?
> 
> Indeed! In fact, I was worry about that.
> It looks after live lock problem.
> But I think  it's very small race window so  there isn't any report until now.
> Let's Cced Lee.
> 
> If we have to fix it, how about this ?
> This version  has small overhead than yours since
> there is less shrink_page_list call than page_referenced.

Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
is possible and somehow persistent. Does anyone have the answer? Thanks!

Thanks,
Fengguang

> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ed63894..283266c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
>          */
>         if (vma->vm_flags & VM_LOCKED) {
>                 *mapcount = 1;  /* break early from loop */
> +               *vm_flags |= VM_LOCKED;
>                 goto out_unmap;
>         }
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d224b28..d156e1d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
>                                                 sc->mem_cgroup, &vm_flags);
>                 /* In active use or really unfreeable?  Activate it. */
>                 if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> -                                       referenced && page_mapping_inuse(page))
> +                                       referenced && page_mapping_inuse(page)
> +                                       && !(vm_flags & VM_LOCKED))
>                         goto activate_locked;
> 
> 
> 
> >
> > Thanks,
> > Fengguang
> >
> >>       -> try_to_unmap_xxx
> >>               -> if (vma->vm_flags & VM_LOCKED)
> >>               -> try_to_mlock_page
> >>                       -> TestSetPageMlocked
> >>                       -> putback_lru_page
> >>
> >> So at last, the page will be located in unevictable list.
> >>
> >> > Then I was worrying about a null problem. Sorry for the confusion!
> >> >
> >> > Thanks,
> >> > Fengguang
> >> >
> >> > > ----
> >> > >
> >> > > /*
> >> > >  * called from munlock()/munmap() path with page supposedly on the LRU.
> >> > >  *
> >> > >  * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
> >> > >  * [in try_to_munlock()] and then attempt to isolate the page.  We must
> >> > >  * isolate the page to keep others from messing with its unevictable
> >> > >  * and mlocked state while trying to munlock.  However, we pre-clear the
> >> > >  * mlocked state anyway as we might lose the isolation race and we might
> >> > >  * not get another chance to clear PageMlocked.  If we successfully
> >> > >  * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> >> > >  * mapping the page, it will restore the PageMlocked state, unless the page
> >> > >  * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
> >> > >  * perhaps redundantly.
> >> > >  * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> >> > >  * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> >> > >  * either of which will restore the PageMlocked state by calling
> >> > >  * mlock_vma_page() above, if it can grab the vma's mmap sem.
> >> > >  */
> >> > > static void munlock_vma_page(struct page *page)
> >> > > {
> >> > > ...
> >> > >
> >> > > --
> >> > > Kind regards,
> >> > > Minchan Kim
> >>
> >>
> >> --
> >> Kind regards,
> >> Minchan Kim
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 11:11                                         ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 11:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
	LKML, linux-mm

On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
> On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> >> On Tue, 18 Aug 2009 17:31:19 +0800
> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >>
> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> >> > > On Tue, 18 Aug 2009 10:34:38 +0800
> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > >
> >> > > > Minchan,
> >> > > >
> >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> >> > > > > >> > Wu Fengguang wrote:
> >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> > > > > >> > >> Side question -
> >> > > > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> >> > > > > >> > >> as opposed to __isolate_lru_page?
> >> > > > > >> > >>
> >> > > > > >> > >> A  A  A  A  A if (unlikely(!page_evictable(page, NULL))) {
> >> > > > > >> > >> A  A  A  A  A  A  A  A  A putback_lru_page(page);
> >> > > > > >> > >> A  A  A  A  A  A  A  A  A continue;
> >> > > > > >> > >> A  A  A  A  A }
> >> > > > > >> > >>
> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> >> > > > > >> > >
> >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> >> > > > > >> > > and again.
> >> > > > > >> >
> >> > > > > >> > Please read what putback_lru_page does.
> >> > > > > >> >
> >> > > > > >> > It moves the page onto the unevictable list, so that
> >> > > > > >> > it will not end up in this scan again.
> >> > > > > >>
> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> >> > > > > >> unevictable page may be scanned but still not moved to unevictable
> >> > > > > >> list: when a page is mapped in two places, the first pte has the
> >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> >> > > > > >> case?
> >> > > > >
> >> > > > > I think it's not a big deal.
> >> > > >
> >> > > > Maybe, otherwise I should bring up this issue long time before :)
> >> > > >
> >> > > > > As you mentioned, it's rare case so there would be few pages in active
> >> > > > > list instead of unevictable list.
> >> > > >
> >> > > > Yes.
> >> > > >
> >> > > > > When next time to scan comes, we can try to move the pages into
> >> > > > > unevictable list, again.
> >> > > >
> >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> >> > > > list for countless times.
> >> > >
> >> > > PG_mlocked is not important in that case.
> >> > > Important thing is VM_LOCKED vma.
> >> > > I think below annotaion can help you to understand my point. :)
> >> >
> >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> >>
> >> No. I am sorry for making my point not clear.
> >> I meant following as.
> >> When the next time to scan,
> >>
> >> shrink_page_list
> > A ->
> > A  A  A  A  A  A  A  A referenced = page_referenced(page, 1,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A sc->mem_cgroup, &vm_flags);
> > A  A  A  A  A  A  A  A /* In active use or really unfreeable? A Activate it. */
> > A  A  A  A  A  A  A  A if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A referenced && page_mapping_inuse(page))
> > A  A  A  A  A  A  A  A  A  A  A  A goto activate_locked;
> >
> >> -> try_to_unmap
> > A  A  ~~~~~~~~~~~~ this line won't be reached if page is found to be
> > A  A  referenced in the above lines?
> 
> Indeed! In fact, I was worry about that.
> It looks after live lock problem.
> But I think  it's very small race window so  there isn't any report until now.
> Let's Cced Lee.
> 
> If we have to fix it, how about this ?
> This version  has small overhead than yours since
> there is less shrink_page_list call than page_referenced.

Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
is possible and somehow persistent. Does anyone have the answer? Thanks!

Thanks,
Fengguang

> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ed63894..283266c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
>          */
>         if (vma->vm_flags & VM_LOCKED) {
>                 *mapcount = 1;  /* break early from loop */
> +               *vm_flags |= VM_LOCKED;
>                 goto out_unmap;
>         }
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d224b28..d156e1d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
>                                                 sc->mem_cgroup, &vm_flags);
>                 /* In active use or really unfreeable?  Activate it. */
>                 if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> -                                       referenced && page_mapping_inuse(page))
> +                                       referenced && page_mapping_inuse(page)
> +                                       && !(vm_flags & VM_LOCKED))
>                         goto activate_locked;
> 
> 
> 
> >
> > Thanks,
> > Fengguang
> >
> >> A  A  A  -> try_to_unmap_xxx
> >> A  A  A  A  A  A  A  -> if (vma->vm_flags & VM_LOCKED)
> >> A  A  A  A  A  A  A  -> try_to_mlock_page
> >> A  A  A  A  A  A  A  A  A  A  A  -> TestSetPageMlocked
> >> A  A  A  A  A  A  A  A  A  A  A  -> putback_lru_page
> >>
> >> So at last, the page will be located in unevictable list.
> >>
> >> > Then I was worrying about a null problem. Sorry for the confusion!
> >> >
> >> > Thanks,
> >> > Fengguang
> >> >
> >> > > ----
> >> > >
> >> > > /*
> >> > > A * called from munlock()/munmap() path with page supposedly on the LRU.
> >> > > A *
> >> > > A * Note: A unlike mlock_vma_page(), we can't just clear the PageMlocked
> >> > > A * [in try_to_munlock()] and then attempt to isolate the page. A We must
> >> > > A * isolate the page to keep others from messing with its unevictable
> >> > > A * and mlocked state while trying to munlock. A However, we pre-clear the
> >> > > A * mlocked state anyway as we might lose the isolation race and we might
> >> > > A * not get another chance to clear PageMlocked. A If we successfully
> >> > > A * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> >> > > A * mapping the page, it will restore the PageMlocked state, unless the page
> >> > > A * is mapped in a non-linear vma. A So, we go ahead and SetPageMlocked(),
> >> > > A * perhaps redundantly.
> >> > > A * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> >> > > A * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> >> > > A * either of which will restore the PageMlocked state by calling
> >> > > A * mlock_vma_page() above, if it can grab the vma's mmap sem.
> >> > > A */
> >> > > static void munlock_vma_page(struct page *page)
> >> > > {
> >> > > ...
> >> > >
> >> > > --
> >> > > Kind regards,
> >> > > Minchan Kim
> >>
> >>
> >> --
> >> Kind regards,
> >> Minchan Kim
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18 11:11                                         ` Wu Fengguang
@ 2009-08-18 14:03                                           ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 14:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
	LKML, linux-mm

On Tue, Aug 18, 2009 at 8:11 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
>> On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
>> >> On Tue, 18 Aug 2009 17:31:19 +0800
>> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >>
>> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
>> >> > > On Tue, 18 Aug 2009 10:34:38 +0800
>> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> > >
>> >> > > > Minchan,
>> >> > > >
>> >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
>> >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> >> > > > > >> > Wu Fengguang wrote:
>> >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> >> > > > > >> > >> Side question -
>> >> > > > > >> > >>  Is there a good reason for this to be in shrink_active_list()
>> >> > > > > >> > >> as opposed to __isolate_lru_page?
>> >> > > > > >> > >>
>> >> > > > > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
>> >> > > > > >> > >>                  putback_lru_page(page);
>> >> > > > > >> > >>                  continue;
>> >> > > > > >> > >>          }
>> >> > > > > >> > >>
>> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
>> >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
>> >> > > > > >> > >
>> >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
>> >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
>> >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
>> >> > > > > >> > > and again.
>> >> > > > > >> >
>> >> > > > > >> > Please read what putback_lru_page does.
>> >> > > > > >> >
>> >> > > > > >> > It moves the page onto the unevictable list, so that
>> >> > > > > >> > it will not end up in this scan again.
>> >> > > > > >>
>> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
>> >> > > > > >> unevictable page may be scanned but still not moved to unevictable
>> >> > > > > >> list: when a page is mapped in two places, the first pte has the
>> >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
>> >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
>> >> > > > > >> case?
>> >> > > > >
>> >> > > > > I think it's not a big deal.
>> >> > > >
>> >> > > > Maybe, otherwise I should bring up this issue long time before :)
>> >> > > >
>> >> > > > > As you mentioned, it's rare case so there would be few pages in active
>> >> > > > > list instead of unevictable list.
>> >> > > >
>> >> > > > Yes.
>> >> > > >
>> >> > > > > When next time to scan comes, we can try to move the pages into
>> >> > > > > unevictable list, again.
>> >> > > >
>> >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
>> >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
>> >> > > > list for countless times.
>> >> > >
>> >> > > PG_mlocked is not important in that case.
>> >> > > Important thing is VM_LOCKED vma.
>> >> > > I think below annotaion can help you to understand my point. :)
>> >> >
>> >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
>> >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>> >>
>> >> No. I am sorry for making my point not clear.
>> >> I meant following as.
>> >> When the next time to scan,
>> >>
>> >> shrink_page_list
>> >  ->
>> >                referenced = page_referenced(page, 1,
>> >                                                sc->mem_cgroup, &vm_flags);
>> >                /* In active use or really unfreeable?  Activate it. */
>> >                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
>> >                                        referenced && page_mapping_inuse(page))
>> >                        goto activate_locked;
>> >
>> >> -> try_to_unmap
>> >     ~~~~~~~~~~~~ this line won't be reached if page is found to be
>> >     referenced in the above lines?
>>
>> Indeed! In fact, I was worry about that.
>> It looks after live lock problem.
>> But I think  it's very small race window so  there isn't any report until now.
>> Let's Cced Lee.
>>
>> If we have to fix it, how about this ?
>> This version  has small overhead than yours since
>> there is less shrink_page_list call than page_referenced.
>
> Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
> is possible and somehow persistent. Does anyone have the answer? Thanks!

I think it's possible.
munlock_vma_page pre-clears PG_mlocked of page.
And then if isolate_lru_page fail, the page have no PG_mlocked and vma which
have VM_LOCKED.

As munlock_vma_page's annotation said, we hope the page will be rescued by
try_to_unmap. But As you pointed out, if the page have PG_referenced, it can't
reach try_to_unmap so that it will go into the active list.

What are others' opinion ?

> Thanks,
> Fengguang



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 14:03                                           ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 14:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
	LKML, linux-mm

On Tue, Aug 18, 2009 at 8:11 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
>> On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
>> >> On Tue, 18 Aug 2009 17:31:19 +0800
>> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >>
>> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
>> >> > > On Tue, 18 Aug 2009 10:34:38 +0800
>> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> > >
>> >> > > > Minchan,
>> >> > > >
>> >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
>> >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> >> > > > > >> > Wu Fengguang wrote:
>> >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> >> > > > > >> > >> Side question -
>> >> > > > > >> > >>  Is there a good reason for this to be in shrink_active_list()
>> >> > > > > >> > >> as opposed to __isolate_lru_page?
>> >> > > > > >> > >>
>> >> > > > > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
>> >> > > > > >> > >>                  putback_lru_page(page);
>> >> > > > > >> > >>                  continue;
>> >> > > > > >> > >>          }
>> >> > > > > >> > >>
>> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
>> >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
>> >> > > > > >> > >
>> >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
>> >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
>> >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
>> >> > > > > >> > > and again.
>> >> > > > > >> >
>> >> > > > > >> > Please read what putback_lru_page does.
>> >> > > > > >> >
>> >> > > > > >> > It moves the page onto the unevictable list, so that
>> >> > > > > >> > it will not end up in this scan again.
>> >> > > > > >>
>> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
>> >> > > > > >> unevictable page may be scanned but still not moved to unevictable
>> >> > > > > >> list: when a page is mapped in two places, the first pte has the
>> >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
>> >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
>> >> > > > > >> case?
>> >> > > > >
>> >> > > > > I think it's not a big deal.
>> >> > > >
>> >> > > > Maybe, otherwise I should bring up this issue long time before :)
>> >> > > >
>> >> > > > > As you mentioned, it's rare case so there would be few pages in active
>> >> > > > > list instead of unevictable list.
>> >> > > >
>> >> > > > Yes.
>> >> > > >
>> >> > > > > When next time to scan comes, we can try to move the pages into
>> >> > > > > unevictable list, again.
>> >> > > >
>> >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
>> >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
>> >> > > > list for countless times.
>> >> > >
>> >> > > PG_mlocked is not important in that case.
>> >> > > Important thing is VM_LOCKED vma.
>> >> > > I think below annotaion can help you to understand my point. :)
>> >> >
>> >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
>> >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>> >>
>> >> No. I am sorry for making my point not clear.
>> >> I meant following as.
>> >> When the next time to scan,
>> >>
>> >> shrink_page_list
>> >  ->
>> >                referenced = page_referenced(page, 1,
>> >                                                sc->mem_cgroup, &vm_flags);
>> >                /* In active use or really unfreeable?  Activate it. */
>> >                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
>> >                                        referenced && page_mapping_inuse(page))
>> >                        goto activate_locked;
>> >
>> >> -> try_to_unmap
>> >     ~~~~~~~~~~~~ this line won't be reached if page is found to be
>> >     referenced in the above lines?
>>
>> Indeed! In fact, I was worry about that.
>> It looks after live lock problem.
>> But I think  it's very small race window so  there isn't any report until now.
>> Let's Cced Lee.
>>
>> If we have to fix it, how about this ?
>> This version  has small overhead than yours since
>> there is less shrink_page_list call than page_referenced.
>
> Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
> is possible and somehow persistent. Does anyone have the answer? Thanks!

I think it's possible.
munlock_vma_page pre-clears PG_mlocked of page.
And then if isolate_lru_page fail, the page have no PG_mlocked and vma which
have VM_LOCKED.

As munlock_vma_page's annotation said, we hope the page will be rescued by
try_to_unmap. But As you pointed out, if the page have PG_referenced, it can't
reach try_to_unmap so that it will go into the active list.

What are others' opinion ?

> Thanks,
> Fengguang



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16  5:09                                     ` Balbir Singh
@ 2009-08-18 15:57                                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, Wu Fengguang, Rik van Riel, Johannes Weiner,
	Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	Mel Gorman, LKML, linux-mm

> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
> 
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > > 
> > > >> So even with the active list being a FIFO, we keep usage information
> > > >> gathered from the inactive list.  If we deactivate pages in arbitrary
> > > >> list intervals, we throw this away.
> > > > 
> > > > We do have the danger of FIFO, if inactive list is small enough, so
> > > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > > their life window in inactive list is too small to be useful.
> > > 
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > > 
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> > 
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> > 
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > 
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> > 
> 
> I think we need to possibly export some scanning data under DEBUG_VM
> to cross verify.

Sorry for the delay.
How about this?

=======================================
Subject: [PATCH] vmscan: show recent_scanned/rotated stat

On recent discussion, Balbir Singh pointed out VM developer shold be
able to see recent_scanned/rotated statistics.

This patch does it.

output example
--------------------
% cat /proc/zoneinfo
Node 0, zone    DMA32
  pages free     347590
        min      613
        low      766
        high     919
(snip)
  inactive_ratio:    3
  recent_rotated_anon: 127305
  recent_rotated_file: 67439
  recent_scanned_anon: 135591
  recent_scanned_file: 180399



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmstat.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Index: b/mm/vmstat.c
===================================================================
--- a/mm/vmstat.c	2009-08-08 14:16:53.000000000 +0900
+++ b/mm/vmstat.c	2009-08-18 22:07:25.000000000 +0900
@@ -762,6 +762,20 @@ static void zoneinfo_show_print(struct s
 		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
+
+#ifdef CONFIG_DEBUG_VM
+	seq_printf(m,
+		   "\n  recent_rotated_anon: %lu"
+		   "\n  recent_rotated_file: %lu"
+		   "\n  recent_scanned_anon: %lu"
+		   "\n  recent_scanned_file: %lu",
+		   zone->reclaim_stat.recent_rotated[0],
+		   zone->reclaim_stat.recent_rotated[1],
+		   zone->reclaim_stat.recent_scanned[0],
+		   zone->reclaim_stat.recent_scanned[1]
+		);
+#endif
+
 	seq_putc(m, '\n');
 }
 






^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 15:57                                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, Wu Fengguang, Rik van Riel, Johannes Weiner,
	Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	Mel Gorman, LKML, linux-mm

> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
> 
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > > 
> > > >> So even with the active list being a FIFO, we keep usage information
> > > >> gathered from the inactive list.  If we deactivate pages in arbitrary
> > > >> list intervals, we throw this away.
> > > > 
> > > > We do have the danger of FIFO, if inactive list is small enough, so
> > > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > > their life window in inactive list is too small to be useful.
> > > 
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > > 
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> > 
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> > 
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > 
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> > 
> 
> I think we need to possibly export some scanning data under DEBUG_VM
> to cross verify.

Sorry for the delay.
How about this?

=======================================
Subject: [PATCH] vmscan: show recent_scanned/rotated stat

On recent discussion, Balbir Singh pointed out VM developer shold be
able to see recent_scanned/rotated statistics.

This patch does it.

output example
--------------------
% cat /proc/zoneinfo
Node 0, zone    DMA32
  pages free     347590
        min      613
        low      766
        high     919
(snip)
  inactive_ratio:    3
  recent_rotated_anon: 127305
  recent_rotated_file: 67439
  recent_scanned_anon: 135591
  recent_scanned_file: 180399



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmstat.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Index: b/mm/vmstat.c
===================================================================
--- a/mm/vmstat.c	2009-08-08 14:16:53.000000000 +0900
+++ b/mm/vmstat.c	2009-08-18 22:07:25.000000000 +0900
@@ -762,6 +762,20 @@ static void zoneinfo_show_print(struct s
 		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
+
+#ifdef CONFIG_DEBUG_VM
+	seq_printf(m,
+		   "\n  recent_rotated_anon: %lu"
+		   "\n  recent_rotated_file: %lu"
+		   "\n  recent_scanned_anon: %lu"
+		   "\n  recent_scanned_file: %lu",
+		   zone->reclaim_stat.recent_rotated[0],
+		   zone->reclaim_stat.recent_rotated[1],
+		   zone->reclaim_stat.recent_scanned[0],
+		   zone->reclaim_stat.recent_scanned[1]
+		);
+#endif
+
 	seq_putc(m, '\n');
 }
 





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-15  5:45                                   ` Wu Fengguang
@ 2009-08-18 15:57                                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, Johannes Weiner, Avi Kivity,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm


> > This one of the reasons why we unconditionally deactivate
> > the active anon pages, and do background scanning of the
> > active anon list when reclaiming page cache pages.
> > 
> > We want to always move some pages to the inactive anon
> > list, so it does not get too small.
> 
> Right, the current code tries to pull inactive list out of
> smallish-size state as long as there are vmscan activities.
> 
> However there is a possible (and tricky) hole: mem cgroups
> don't do batched vmscan. shrink_zone() may call shrink_list()
> with nr_to_scan=1, in which case shrink_list() _still_ calls
> isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> 
> It effectively scales up the inactive list scan rate by 10 times when
> it is still small, and may thus prevent it from growing up for ever.
> 
> In that case, LRU becomes FIFO.
> 
> Jeff, can you confirm if the mem cgroup's inactive list is small?
> If so, this patch should help.

This patch does right thing.
However, I would explain why I and memcg folks didn't do that in past days.

Strangely, some memcg struct declaration is hide in *.c. Thus we can't
make inline function and we hesitated to introduce many function calling
overhead.

So, Can we move some memcg structure declaration to *.h and make 
mem_cgroup_get_saved_scan() inlined function?


> 
> Thanks,
> Fengguang
> ---
> 
> mm: do batched scans for mem_cgroup
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/memcontrol.h |    3 +++
>  mm/memcontrol.c            |   12 ++++++++++++
>  mm/vmscan.c                |    9 +++++----
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> --- linux.orig/include/linux/memcontrol.h	2009-08-15 13:12:49.000000000 +0800
> +++ linux/include/linux/memcontrol.h	2009-08-15 13:18:13.000000000 +0800
> @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>  				       struct zone *zone,
>  				       enum lru_list lru);
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> +					 struct zone *zone,
> +					 enum lru_list lru);
>  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>  						      struct zone *zone);
>  struct zone_reclaim_stat*
> --- linux.orig/mm/memcontrol.c	2009-08-15 13:07:34.000000000 +0800
> +++ linux/mm/memcontrol.c	2009-08-15 13:17:56.000000000 +0800
> @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
>  	 */
>  	struct list_head	lists[NR_LRU_LISTS];
>  	unsigned long		count[NR_LRU_LISTS];
> +	unsigned long		nr_saved_scan[NR_LRU_LISTS];
>  
>  	struct zone_reclaim_stat reclaim_stat;
>  };
> @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
>  	return MEM_CGROUP_ZSTAT(mz, lru);
>  }
>  
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> +					 struct zone *zone,
> +					 enum lru_list lru)
> +{
> +	int nid = zone->zone_pgdat->node_id;
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	return &mz->nr_saved_scan[lru];
> +}

I think this fuction is a bit strange.
shrink_zone don't hold any lock. so, shouldn't we case memcg removing race?



> +
>  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>  						      struct zone *zone)
>  {
> --- linux.orig/mm/vmscan.c	2009-08-15 13:04:54.000000000 +0800
> +++ linux/mm/vmscan.c	2009-08-15 13:19:03.000000000 +0800
> @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
>  	for_each_evictable_lru(l) {
>  		int file = is_file_lru(l);
>  		unsigned long scan;
> +		unsigned long *saved_scan;
>  
>  		scan = zone_nr_pages(zone, sc, l);
>  		if (priority || noswap) {
> @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
>  			scan = (scan * percent[file]) / 100;
>  		}
>  		if (scanning_global_lru(sc))
> -			nr[l] = nr_scan_try_batch(scan,
> -						  &zone->lru[l].nr_saved_scan,
> -						  swap_cluster_max);
> +			saved_scan = &zone->lru[l].nr_saved_scan;
>  		else
> -			nr[l] = scan;
> +			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> +							       zone, l);
> +		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
>  	}
>  
>  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||




^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 15:57                                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, Johannes Weiner, Avi Kivity,
	Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
	linux-mm


> > This one of the reasons why we unconditionally deactivate
> > the active anon pages, and do background scanning of the
> > active anon list when reclaiming page cache pages.
> > 
> > We want to always move some pages to the inactive anon
> > list, so it does not get too small.
> 
> Right, the current code tries to pull inactive list out of
> smallish-size state as long as there are vmscan activities.
> 
> However there is a possible (and tricky) hole: mem cgroups
> don't do batched vmscan. shrink_zone() may call shrink_list()
> with nr_to_scan=1, in which case shrink_list() _still_ calls
> isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> 
> It effectively scales up the inactive list scan rate by 10 times when
> it is still small, and may thus prevent it from growing up for ever.
> 
> In that case, LRU becomes FIFO.
> 
> Jeff, can you confirm if the mem cgroup's inactive list is small?
> If so, this patch should help.

This patch does right thing.
However, I would explain why I and memcg folks didn't do that in past days.

Strangely, some memcg struct declaration is hide in *.c. Thus we can't
make inline function and we hesitated to introduce many function calling
overhead.

So, Can we move some memcg structure declaration to *.h and make 
mem_cgroup_get_saved_scan() inlined function?


> 
> Thanks,
> Fengguang
> ---
> 
> mm: do batched scans for mem_cgroup
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/memcontrol.h |    3 +++
>  mm/memcontrol.c            |   12 ++++++++++++
>  mm/vmscan.c                |    9 +++++----
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> --- linux.orig/include/linux/memcontrol.h	2009-08-15 13:12:49.000000000 +0800
> +++ linux/include/linux/memcontrol.h	2009-08-15 13:18:13.000000000 +0800
> @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>  				       struct zone *zone,
>  				       enum lru_list lru);
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> +					 struct zone *zone,
> +					 enum lru_list lru);
>  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>  						      struct zone *zone);
>  struct zone_reclaim_stat*
> --- linux.orig/mm/memcontrol.c	2009-08-15 13:07:34.000000000 +0800
> +++ linux/mm/memcontrol.c	2009-08-15 13:17:56.000000000 +0800
> @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
>  	 */
>  	struct list_head	lists[NR_LRU_LISTS];
>  	unsigned long		count[NR_LRU_LISTS];
> +	unsigned long		nr_saved_scan[NR_LRU_LISTS];
>  
>  	struct zone_reclaim_stat reclaim_stat;
>  };
> @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
>  	return MEM_CGROUP_ZSTAT(mz, lru);
>  }
>  
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> +					 struct zone *zone,
> +					 enum lru_list lru)
> +{
> +	int nid = zone->zone_pgdat->node_id;
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	return &mz->nr_saved_scan[lru];
> +}

I think this fuction is a bit strange.
shrink_zone don't hold any lock. so, shouldn't we case memcg removing race?



> +
>  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>  						      struct zone *zone)
>  {
> --- linux.orig/mm/vmscan.c	2009-08-15 13:04:54.000000000 +0800
> +++ linux/mm/vmscan.c	2009-08-15 13:19:03.000000000 +0800
> @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
>  	for_each_evictable_lru(l) {
>  		int file = is_file_lru(l);
>  		unsigned long scan;
> +		unsigned long *saved_scan;
>  
>  		scan = zone_nr_pages(zone, sc, l);
>  		if (priority || noswap) {
> @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
>  			scan = (scan * percent[file]) / 100;
>  		}
>  		if (scanning_global_lru(sc))
> -			nr[l] = nr_scan_try_batch(scan,
> -						  &zone->lru[l].nr_saved_scan,
> -						  swap_cluster_max);
> +			saved_scan = &zone->lru[l].nr_saved_scan;
>  		else
> -			nr[l] = scan;
> +			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> +							       zone, l);
> +		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
>  	}
>  
>  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-16 11:29                         ` Wu Fengguang
@ 2009-08-18 15:57                           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

> > Yes it does. I said 'mostly' because there is a small hole that an
> > unevictable page may be scanned but still not moved to unevictable
> > list: when a page is mapped in two places, the first pte has the
> > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > page_referenced() will return 1 and shrink_page_list() will move it
> > into active list instead of unevictable list. Shall we fix this rare
> > case?
> 
> How about this fix?

Good spotting.
Yes, this is rare case. but I also don't think your patch introduce
performance degression.

However, I think your patch have one bug.

> 
> ---
> mm: stop circulating of referenced mlocked pages
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> 
> --- linux.orig/mm/rmap.c	2009-08-16 19:11:13.000000000 +0800
> +++ linux/mm/rmap.c	2009-08-16 19:22:46.000000000 +0800
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
>  		*mapcount = 1;	/* break early from loop */
> +		*vm_flags |= VM_LOCKED;
>  		goto out_unmap;
>  	}
>  
> @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
>  	}
>  
>  	spin_unlock(&mapping->i_mmap_lock);
> +	if (*vm_flags & VM_LOCKED)
> +		referenced = 0;
>  	return referenced;
>  }
>  

page_referenced_file?
I think we should change page_referenced().


Instead, How about this?
==============================================

Subject: [PATCH] mm: stop circulating of referenced mlocked pages

Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
because some race prevent page grabbing.
In that case, instead vmscan move the page to unevictable lru.

However, Recently Wu Fengguang pointed out current vmscan logic isn't so
efficient.
mlocked page can move circulatly active and inactive list because
vmscan check the page is referenced _before_ cull mlocked page.

Plus, vmscan should mark PG_Mlocked when cull mlocked page.
Otherwise vm stastics show strange number.

This patch does that.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/internal.h |    5 +++--
 mm/rmap.c     |    8 +++++++-
 mm/vmscan.c   |    2 +-
 3 files changed, 11 insertions(+), 4 deletions(-)

Index: b/mm/internal.h
===================================================================
--- a/mm/internal.h	2009-06-26 21:06:43.000000000 +0900
+++ b/mm/internal.h	2009-08-18 23:31:11.000000000 +0900
@@ -91,7 +91,8 @@ static inline void unevictable_migrate_p
  * to determine if it's being mapped into a LOCKED vma.
  * If so, mark page as mlocked.
  */
-static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
+static inline int try_set_page_mlocked(struct vm_area_struct *vma,
+				       struct page *page)
 {
 	VM_BUG_ON(PageLRU(page));
 
@@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st
 }
 
 #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
-static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
+static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p)
 {
 	return 0;
 }
Index: b/mm/rmap.c
===================================================================
--- a/mm/rmap.c	2009-08-18 19:48:14.000000000 +0900
+++ b/mm/rmap.c	2009-08-18 23:47:34.000000000 +0900
@@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 1;		/* break early from loop */
+		*vm_flags |= VM_LOCKED;	/* for prevent to move active list */
+		try_set_page_mlocked(vma, page);
 		goto out_unmap;
 	}
 
@@ -531,6 +533,9 @@ int page_referenced(struct page *page,
 	if (page_test_and_clear_young(page))
 		referenced++;
 
+	if (unlikely(*vm_flags & VM_LOCKED))
+		referenced = 0;
+
 	return referenced;
 }
 
@@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page 
 	 */
 	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
+			try_set_page_mlocked(vma, page);
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c	2009-08-18 19:48:14.000000000 +0900
+++ b/mm/vmscan.c	2009-08-18 23:30:51.000000000 +0900
@@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st
 	if (mapping_unevictable(page_mapping(page)))
 		return 0;
 
-	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
+	if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page)))
 		return 0;
 
 	return 1;









^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 15:57                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

> > Yes it does. I said 'mostly' because there is a small hole that an
> > unevictable page may be scanned but still not moved to unevictable
> > list: when a page is mapped in two places, the first pte has the
> > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > page_referenced() will return 1 and shrink_page_list() will move it
> > into active list instead of unevictable list. Shall we fix this rare
> > case?
> 
> How about this fix?

Good spotting.
Yes, this is rare case. but I also don't think your patch introduce
performance degression.

However, I think your patch have one bug.

> 
> ---
> mm: stop circulating of referenced mlocked pages
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> 
> --- linux.orig/mm/rmap.c	2009-08-16 19:11:13.000000000 +0800
> +++ linux/mm/rmap.c	2009-08-16 19:22:46.000000000 +0800
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
>  		*mapcount = 1;	/* break early from loop */
> +		*vm_flags |= VM_LOCKED;
>  		goto out_unmap;
>  	}
>  
> @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
>  	}
>  
>  	spin_unlock(&mapping->i_mmap_lock);
> +	if (*vm_flags & VM_LOCKED)
> +		referenced = 0;
>  	return referenced;
>  }
>  

page_referenced_file?
I think we should change page_referenced().


Instead, How about this?
==============================================

Subject: [PATCH] mm: stop circulating of referenced mlocked pages

Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
because some race prevent page grabbing.
In that case, instead vmscan move the page to unevictable lru.

However, Recently Wu Fengguang pointed out current vmscan logic isn't so
efficient.
mlocked page can move circulatly active and inactive list because
vmscan check the page is referenced _before_ cull mlocked page.

Plus, vmscan should mark PG_Mlocked when cull mlocked page.
Otherwise vm stastics show strange number.

This patch does that.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/internal.h |    5 +++--
 mm/rmap.c     |    8 +++++++-
 mm/vmscan.c   |    2 +-
 3 files changed, 11 insertions(+), 4 deletions(-)

Index: b/mm/internal.h
===================================================================
--- a/mm/internal.h	2009-06-26 21:06:43.000000000 +0900
+++ b/mm/internal.h	2009-08-18 23:31:11.000000000 +0900
@@ -91,7 +91,8 @@ static inline void unevictable_migrate_p
  * to determine if it's being mapped into a LOCKED vma.
  * If so, mark page as mlocked.
  */
-static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
+static inline int try_set_page_mlocked(struct vm_area_struct *vma,
+				       struct page *page)
 {
 	VM_BUG_ON(PageLRU(page));
 
@@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st
 }
 
 #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
-static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
+static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p)
 {
 	return 0;
 }
Index: b/mm/rmap.c
===================================================================
--- a/mm/rmap.c	2009-08-18 19:48:14.000000000 +0900
+++ b/mm/rmap.c	2009-08-18 23:47:34.000000000 +0900
@@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 1;		/* break early from loop */
+		*vm_flags |= VM_LOCKED;	/* for prevent to move active list */
+		try_set_page_mlocked(vma, page);
 		goto out_unmap;
 	}
 
@@ -531,6 +533,9 @@ int page_referenced(struct page *page,
 	if (page_test_and_clear_young(page))
 		referenced++;
 
+	if (unlikely(*vm_flags & VM_LOCKED))
+		referenced = 0;
+
 	return referenced;
 }
 
@@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page 
 	 */
 	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
+			try_set_page_mlocked(vma, page);
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c	2009-08-18 19:48:14.000000000 +0900
+++ b/mm/vmscan.c	2009-08-18 23:30:51.000000000 +0900
@@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st
 	if (mapping_unevictable(page_mapping(page)))
 		return 0;
 
-	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
+	if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page)))
 		return 0;
 
 	return 1;








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18 11:11                                         ` Wu Fengguang
@ 2009-08-18 16:27                                           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 16:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Minchan Kim, Lee Schermerhorn, Rik van Riel,
	Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman,
	LKML, linux-mm

> On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
> > On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> > >> On Tue, 18 Aug 2009 17:31:19 +0800
> > >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >>
> > >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > >> > > On Tue, 18 Aug 2009 10:34:38 +0800
> > >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >> > >
> > >> > > > Minchan,
> > >> > > >
> > >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > >> > > > > >> > Wu Fengguang wrote:
> > >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> > > > > >> > >> Side question -
> > >> > > > > >> > >>  Is there a good reason for this to be in shrink_active_list()
> > >> > > > > >> > >> as opposed to __isolate_lru_page?
> > >> > > > > >> > >>
> > >> > > > > >> > >>          if (unlikely(!page_evictable(page, NULL))) {
> > >> > > > > >> > >>                  putback_lru_page(page);
> > >> > > > > >> > >>                  continue;
> > >> > > > > >> > >>          }
> > >> > > > > >> > >>
> > >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > >> > > > > >> > >
> > >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > >> > > > > >> > > and again.
> > >> > > > > >> >
> > >> > > > > >> > Please read what putback_lru_page does.
> > >> > > > > >> >
> > >> > > > > >> > It moves the page onto the unevictable list, so that
> > >> > > > > >> > it will not end up in this scan again.
> > >> > > > > >>
> > >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > >> > > > > >> unevictable page may be scanned but still not moved to unevictable
> > >> > > > > >> list: when a page is mapped in two places, the first pte has the
> > >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> > >> > > > > >> case?
> > >> > > > >
> > >> > > > > I think it's not a big deal.
> > >> > > >
> > >> > > > Maybe, otherwise I should bring up this issue long time before :)
> > >> > > >
> > >> > > > > As you mentioned, it's rare case so there would be few pages in active
> > >> > > > > list instead of unevictable list.
> > >> > > >
> > >> > > > Yes.
> > >> > > >
> > >> > > > > When next time to scan comes, we can try to move the pages into
> > >> > > > > unevictable list, again.
> > >> > > >
> > >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> > >> > > > list for countless times.
> > >> > >
> > >> > > PG_mlocked is not important in that case.
> > >> > > Important thing is VM_LOCKED vma.
> > >> > > I think below annotaion can help you to understand my point. :)
> > >> >
> > >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> > >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> > >>
> > >> No. I am sorry for making my point not clear.
> > >> I meant following as.
> > >> When the next time to scan,
> > >>
> > >> shrink_page_list
> > >  ->
> > >                referenced = page_referenced(page, 1,
> > >                                                sc->mem_cgroup, &vm_flags);
> > >                /* In active use or really unfreeable?  Activate it. */
> > >                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > >                                        referenced && page_mapping_inuse(page))
> > >                        goto activate_locked;
> > >
> > >> -> try_to_unmap
> > >     ~~~~~~~~~~~~ this line won't be reached if page is found to be
> > >     referenced in the above lines?
> > 
> > Indeed! In fact, I was worry about that.
> > It looks after live lock problem.
> > But I think  it's very small race window so  there isn't any report until now.
> > Let's Cced Lee.
> > 
> > If we have to fix it, how about this ?
> > This version  has small overhead than yours since
> > there is less shrink_page_list call than page_referenced.
> 
> Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
> is possible and somehow persistent. Does anyone have the answer? Thanks!

hehe, that's bug. you spotted very good thing IMHO ;)
I posted fixed patch. can you see it?




^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 16:27                                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 16:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Minchan Kim, Lee Schermerhorn, Rik van Riel,
	Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
	Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman,
	LKML, linux-mm

> On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
> > On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> > >> On Tue, 18 Aug 2009 17:31:19 +0800
> > >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >>
> > >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > >> > > On Tue, 18 Aug 2009 10:34:38 +0800
> > >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >> > >
> > >> > > > Minchan,
> > >> > > >
> > >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > >> > > > > >> > Wu Fengguang wrote:
> > >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> > > > > >> > >> Side question -
> > >> > > > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > >> > > > > >> > >> as opposed to __isolate_lru_page?
> > >> > > > > >> > >>
> > >> > > > > >> > >> A  A  A  A  A if (unlikely(!page_evictable(page, NULL))) {
> > >> > > > > >> > >> A  A  A  A  A  A  A  A  A putback_lru_page(page);
> > >> > > > > >> > >> A  A  A  A  A  A  A  A  A continue;
> > >> > > > > >> > >> A  A  A  A  A }
> > >> > > > > >> > >>
> > >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > >> > > > > >> > >
> > >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > >> > > > > >> > > and again.
> > >> > > > > >> >
> > >> > > > > >> > Please read what putback_lru_page does.
> > >> > > > > >> >
> > >> > > > > >> > It moves the page onto the unevictable list, so that
> > >> > > > > >> > it will not end up in this scan again.
> > >> > > > > >>
> > >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > >> > > > > >> unevictable page may be scanned but still not moved to unevictable
> > >> > > > > >> list: when a page is mapped in two places, the first pte has the
> > >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> > >> > > > > >> case?
> > >> > > > >
> > >> > > > > I think it's not a big deal.
> > >> > > >
> > >> > > > Maybe, otherwise I should bring up this issue long time before :)
> > >> > > >
> > >> > > > > As you mentioned, it's rare case so there would be few pages in active
> > >> > > > > list instead of unevictable list.
> > >> > > >
> > >> > > > Yes.
> > >> > > >
> > >> > > > > When next time to scan comes, we can try to move the pages into
> > >> > > > > unevictable list, again.
> > >> > > >
> > >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> > >> > > > list for countless times.
> > >> > >
> > >> > > PG_mlocked is not important in that case.
> > >> > > Important thing is VM_LOCKED vma.
> > >> > > I think below annotaion can help you to understand my point. :)
> > >> >
> > >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> > >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> > >>
> > >> No. I am sorry for making my point not clear.
> > >> I meant following as.
> > >> When the next time to scan,
> > >>
> > >> shrink_page_list
> > > A ->
> > > A  A  A  A  A  A  A  A referenced = page_referenced(page, 1,
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A sc->mem_cgroup, &vm_flags);
> > > A  A  A  A  A  A  A  A /* In active use or really unfreeable? A Activate it. */
> > > A  A  A  A  A  A  A  A if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A referenced && page_mapping_inuse(page))
> > > A  A  A  A  A  A  A  A  A  A  A  A goto activate_locked;
> > >
> > >> -> try_to_unmap
> > > A  A  ~~~~~~~~~~~~ this line won't be reached if page is found to be
> > > A  A  referenced in the above lines?
> > 
> > Indeed! In fact, I was worry about that.
> > It looks after live lock problem.
> > But I think  it's very small race window so  there isn't any report until now.
> > Let's Cced Lee.
> > 
> > If we have to fix it, how about this ?
> > This version  has small overhead than yours since
> > there is less shrink_page_list call than page_referenced.
> 
> Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
> is possible and somehow persistent. Does anyone have the answer? Thanks!

hehe, that's bug. you spotted very good thing IMHO ;)
I posted fixed patch. can you see it?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18 15:57                           ` KOSAKI Motohiro
@ 2009-08-19 12:01                             ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 11:57:54PM +0800, KOSAKI Motohiro wrote:
> > > Yes it does. I said 'mostly' because there is a small hole that an
> > > unevictable page may be scanned but still not moved to unevictable
> > > list: when a page is mapped in two places, the first pte has the
> > > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > page_referenced() will return 1 and shrink_page_list() will move it
> > > into active list instead of unevictable list. Shall we fix this rare
> > > case?
> > 
> > How about this fix?
> 
> Good spotting.
> Yes, this is rare case. but I also don't think your patch introduce
> performance degression.

Thanks.

> However, I think your patch have one bug.

Hehe, sorry for being careless :)

> > 
> > ---
> > mm: stop circulating of referenced mlocked pages
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > 
> > --- linux.orig/mm/rmap.c	2009-08-16 19:11:13.000000000 +0800
> > +++ linux/mm/rmap.c	2009-08-16 19:22:46.000000000 +0800
> > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> >  	 */
> >  	if (vma->vm_flags & VM_LOCKED) {
> >  		*mapcount = 1;	/* break early from loop */
> > +		*vm_flags |= VM_LOCKED;
> >  		goto out_unmap;
> >  	}
> >  
> > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> >  	}
> >  
> >  	spin_unlock(&mapping->i_mmap_lock);
> > +	if (*vm_flags & VM_LOCKED)
> > +		referenced = 0;
> >  	return referenced;
> >  }
> >  
> 
> page_referenced_file?
> I think we should change page_referenced().

Yeah, good catch.

> 
> Instead, How about this?
> ==============================================
> 
> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> 
> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked

                                                    mark PG_mlocked

> because some race prevent page grabbing.
> In that case, instead vmscan move the page to unevictable lru.
> 
> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> efficient.
> mlocked page can move circulatly active and inactive list because
> vmscan check the page is referenced _before_ cull mlocked page.
> 
> Plus, vmscan should mark PG_Mlocked when cull mlocked page.

                           PG_mlocked

> Otherwise vm stastics show strange number.
> 
> This patch does that.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

> Reported-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/internal.h |    5 +++--
>  mm/rmap.c     |    8 +++++++-
>  mm/vmscan.c   |    2 +-
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> Index: b/mm/internal.h
> ===================================================================
> --- a/mm/internal.h	2009-06-26 21:06:43.000000000 +0900
> +++ b/mm/internal.h	2009-08-18 23:31:11.000000000 +0900
> @@ -91,7 +91,8 @@ static inline void unevictable_migrate_p
>   * to determine if it's being mapped into a LOCKED vma.
>   * If so, mark page as mlocked.
>   */
> -static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
> +static inline int try_set_page_mlocked(struct vm_area_struct *vma,
> +				       struct page *page)
>  {
>  	VM_BUG_ON(PageLRU(page));
>  
> @@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st
>  }
>  
>  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> -static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
> +static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p)
>  {
>  	return 0;
>  }
> Index: b/mm/rmap.c
> ===================================================================
> --- a/mm/rmap.c	2009-08-18 19:48:14.000000000 +0900
> +++ b/mm/rmap.c	2009-08-18 23:47:34.000000000 +0900
> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>  	 * unevictable list.
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
> -		*mapcount = 1;	/* break early from loop */
> +		*mapcount = 1;		/* break early from loop */
> +		*vm_flags |= VM_LOCKED;	/* for prevent to move active list */

> +		try_set_page_mlocked(vma, page);

That call is not absolutely necessary?

Thanks,
Fengguang

>  		goto out_unmap;
>  	}
>  
> @@ -531,6 +533,9 @@ int page_referenced(struct page *page,
>  	if (page_test_and_clear_young(page))
>  		referenced++;
>  
> +	if (unlikely(*vm_flags & VM_LOCKED))
> +		referenced = 0;
> +
>  	return referenced;
>  }
>  
> @@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page 
>  	 */
>  	if (!(flags & TTU_IGNORE_MLOCK)) {
>  		if (vma->vm_flags & VM_LOCKED) {
> +			try_set_page_mlocked(vma, page);
>  			ret = SWAP_MLOCK;
>  			goto out_unmap;
>  		}
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c	2009-08-18 19:48:14.000000000 +0900
> +++ b/mm/vmscan.c	2009-08-18 23:30:51.000000000 +0900
> @@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st
>  	if (mapping_unevictable(page_mapping(page)))
>  		return 0;
>  
> -	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
> +	if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page)))
>  		return 0;
>  
>  	return 1;
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:01                             ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 11:57:54PM +0800, KOSAKI Motohiro wrote:
> > > Yes it does. I said 'mostly' because there is a small hole that an
> > > unevictable page may be scanned but still not moved to unevictable
> > > list: when a page is mapped in two places, the first pte has the
> > > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > page_referenced() will return 1 and shrink_page_list() will move it
> > > into active list instead of unevictable list. Shall we fix this rare
> > > case?
> > 
> > How about this fix?
> 
> Good spotting.
> Yes, this is rare case. but I also don't think your patch introduce
> performance degression.

Thanks.

> However, I think your patch have one bug.

Hehe, sorry for being careless :)

> > 
> > ---
> > mm: stop circulating of referenced mlocked pages
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > 
> > --- linux.orig/mm/rmap.c	2009-08-16 19:11:13.000000000 +0800
> > +++ linux/mm/rmap.c	2009-08-16 19:22:46.000000000 +0800
> > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> >  	 */
> >  	if (vma->vm_flags & VM_LOCKED) {
> >  		*mapcount = 1;	/* break early from loop */
> > +		*vm_flags |= VM_LOCKED;
> >  		goto out_unmap;
> >  	}
> >  
> > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> >  	}
> >  
> >  	spin_unlock(&mapping->i_mmap_lock);
> > +	if (*vm_flags & VM_LOCKED)
> > +		referenced = 0;
> >  	return referenced;
> >  }
> >  
> 
> page_referenced_file?
> I think we should change page_referenced().

Yeah, good catch.

> 
> Instead, How about this?
> ==============================================
> 
> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> 
> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked

                                                    mark PG_mlocked

> because some race prevent page grabbing.
> In that case, instead vmscan move the page to unevictable lru.
> 
> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> efficient.
> mlocked page can move circulatly active and inactive list because
> vmscan check the page is referenced _before_ cull mlocked page.
> 
> Plus, vmscan should mark PG_Mlocked when cull mlocked page.

                           PG_mlocked

> Otherwise vm stastics show strange number.
> 
> This patch does that.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

> Reported-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/internal.h |    5 +++--
>  mm/rmap.c     |    8 +++++++-
>  mm/vmscan.c   |    2 +-
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> Index: b/mm/internal.h
> ===================================================================
> --- a/mm/internal.h	2009-06-26 21:06:43.000000000 +0900
> +++ b/mm/internal.h	2009-08-18 23:31:11.000000000 +0900
> @@ -91,7 +91,8 @@ static inline void unevictable_migrate_p
>   * to determine if it's being mapped into a LOCKED vma.
>   * If so, mark page as mlocked.
>   */
> -static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
> +static inline int try_set_page_mlocked(struct vm_area_struct *vma,
> +				       struct page *page)
>  {
>  	VM_BUG_ON(PageLRU(page));
>  
> @@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st
>  }
>  
>  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> -static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
> +static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p)
>  {
>  	return 0;
>  }
> Index: b/mm/rmap.c
> ===================================================================
> --- a/mm/rmap.c	2009-08-18 19:48:14.000000000 +0900
> +++ b/mm/rmap.c	2009-08-18 23:47:34.000000000 +0900
> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>  	 * unevictable list.
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
> -		*mapcount = 1;	/* break early from loop */
> +		*mapcount = 1;		/* break early from loop */
> +		*vm_flags |= VM_LOCKED;	/* for prevent to move active list */

> +		try_set_page_mlocked(vma, page);

That call is not absolutely necessary?

Thanks,
Fengguang

>  		goto out_unmap;
>  	}
>  
> @@ -531,6 +533,9 @@ int page_referenced(struct page *page,
>  	if (page_test_and_clear_young(page))
>  		referenced++;
>  
> +	if (unlikely(*vm_flags & VM_LOCKED))
> +		referenced = 0;
> +
>  	return referenced;
>  }
>  
> @@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page 
>  	 */
>  	if (!(flags & TTU_IGNORE_MLOCK)) {
>  		if (vma->vm_flags & VM_LOCKED) {
> +			try_set_page_mlocked(vma, page);
>  			ret = SWAP_MLOCK;
>  			goto out_unmap;
>  		}
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c	2009-08-18 19:48:14.000000000 +0900
> +++ b/mm/vmscan.c	2009-08-18 23:30:51.000000000 +0900
> @@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st
>  	if (mapping_unevictable(page_mapping(page)))
>  		return 0;
>  
> -	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
> +	if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page)))
>  		return 0;
>  
>  	return 1;
> 
> 
> 
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-19 12:01                             ` Wu Fengguang
@ 2009-08-19 12:05                               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-19 12:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

>> page_referenced_file?
>> I think we should change page_referenced().
>
> Yeah, good catch.
>
>>
>> Instead, How about this?
>> ==============================================
>>
>> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>
>> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>
>                                                    mark PG_mlocked
>
>> because some race prevent page grabbing.
>> In that case, instead vmscan move the page to unevictable lru.
>>
>> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> efficient.
>> mlocked page can move circulatly active and inactive list because
>> vmscan check the page is referenced _before_ cull mlocked page.
>>
>> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>
>                           PG_mlocked
>
>> Otherwise vm stastics show strange number.
>>
>> This patch does that.
>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks.



>> Index: b/mm/rmap.c
>> ===================================================================
>> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>        * unevictable list.
>>        */
>>       if (vma->vm_flags & VM_LOCKED) {
>> -             *mapcount = 1;  /* break early from loop */
>> +             *mapcount = 1;          /* break early from loop */
>> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>
>> +             try_set_page_mlocked(vma, page);
>
> That call is not absolutely necessary?

Why? I haven't catch your point.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:05                               ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-19 12:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

>> page_referenced_file?
>> I think we should change page_referenced().
>
> Yeah, good catch.
>
>>
>> Instead, How about this?
>> ==============================================
>>
>> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>
>> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>
>                                                    mark PG_mlocked
>
>> because some race prevent page grabbing.
>> In that case, instead vmscan move the page to unevictable lru.
>>
>> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> efficient.
>> mlocked page can move circulatly active and inactive list because
>> vmscan check the page is referenced _before_ cull mlocked page.
>>
>> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>
>                           PG_mlocked
>
>> Otherwise vm stastics show strange number.
>>
>> This patch does that.
>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks.



>> Index: b/mm/rmap.c
>> ===================================================================
>> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>        * unevictable list.
>>        */
>>       if (vma->vm_flags & VM_LOCKED) {
>> -             *mapcount = 1;  /* break early from loop */
>> +             *mapcount = 1;          /* break early from loop */
>> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>
>> +             try_set_page_mlocked(vma, page);
>
> That call is not absolutely necessary?

Why? I haven't catch your point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18 15:57                                     ` KOSAKI Motohiro
@ 2009-08-19 12:08                                       ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> 
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > > 
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> > 
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> > 
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > 
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> > 
> > In that case, LRU becomes FIFO.
> > 
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > If so, this patch should help.
> 
> This patch does right thing.
> However, I would explain why I and memcg folks didn't do that in past days.
> 
> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> make inline function and we hesitated to introduce many function calling
> overhead.
> 
> So, Can we move some memcg structure declaration to *.h and make 
> mem_cgroup_get_saved_scan() inlined function?

Good idea, I'll do that btw.

> 
> > 
> > Thanks,
> > Fengguang
> > ---
> > 
> > mm: do batched scans for mem_cgroup
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/memcontrol.h |    3 +++
> >  mm/memcontrol.c            |   12 ++++++++++++
> >  mm/vmscan.c                |    9 +++++----
> >  3 files changed, 20 insertions(+), 4 deletions(-)
> > 
> > --- linux.orig/include/linux/memcontrol.h	2009-08-15 13:12:49.000000000 +0800
> > +++ linux/include/linux/memcontrol.h	2009-08-15 13:18:13.000000000 +0800
> > @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >  				       struct zone *zone,
> >  				       enum lru_list lru);
> > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> > +					 struct zone *zone,
> > +					 enum lru_list lru);
> >  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> >  						      struct zone *zone);
> >  struct zone_reclaim_stat*
> > --- linux.orig/mm/memcontrol.c	2009-08-15 13:07:34.000000000 +0800
> > +++ linux/mm/memcontrol.c	2009-08-15 13:17:56.000000000 +0800
> > @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
> >  	 */
> >  	struct list_head	lists[NR_LRU_LISTS];
> >  	unsigned long		count[NR_LRU_LISTS];
> > +	unsigned long		nr_saved_scan[NR_LRU_LISTS];
> >  
> >  	struct zone_reclaim_stat reclaim_stat;
> >  };
> > @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
> >  	return MEM_CGROUP_ZSTAT(mz, lru);
> >  }
> >  
> > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> > +					 struct zone *zone,
> > +					 enum lru_list lru)
> > +{
> > +	int nid = zone->zone_pgdat->node_id;
> > +	int zid = zone_idx(zone);
> > +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> > +
> > +	return &mz->nr_saved_scan[lru];
> > +}
> 
> I think this fuction is a bit strange.
> shrink_zone don't hold any lock. so, shouldn't we case memcg removing race?

We've been doing that racy computation for long time. It may hurt a
bit balancing. But the balanced vmscan was never perfect, and required
to perfect. So let's just go with it?
 
Thanks,
Fengguang

> 
> > +
> >  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> >  						      struct zone *zone)
> >  {
> > --- linux.orig/mm/vmscan.c	2009-08-15 13:04:54.000000000 +0800
> > +++ linux/mm/vmscan.c	2009-08-15 13:19:03.000000000 +0800
> > @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
> >  	for_each_evictable_lru(l) {
> >  		int file = is_file_lru(l);
> >  		unsigned long scan;
> > +		unsigned long *saved_scan;
> >  
> >  		scan = zone_nr_pages(zone, sc, l);
> >  		if (priority || noswap) {
> > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> >  			scan = (scan * percent[file]) / 100;
> >  		}
> >  		if (scanning_global_lru(sc))
> > -			nr[l] = nr_scan_try_batch(scan,
> > -						  &zone->lru[l].nr_saved_scan,
> > -						  swap_cluster_max);
> > +			saved_scan = &zone->lru[l].nr_saved_scan;
> >  		else
> > -			nr[l] = scan;
> > +			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> > +							       zone, l);
> > +		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> >  	}
> >  
> >  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> 
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:08                                       ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> 
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > > 
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> > 
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> > 
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > 
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> > 
> > In that case, LRU becomes FIFO.
> > 
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > If so, this patch should help.
> 
> This patch does right thing.
> However, I would explain why I and memcg folks didn't do that in past days.
> 
> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> make inline function and we hesitated to introduce many function calling
> overhead.
> 
> So, Can we move some memcg structure declaration to *.h and make 
> mem_cgroup_get_saved_scan() inlined function?

Good idea, I'll do that btw.

> 
> > 
> > Thanks,
> > Fengguang
> > ---
> > 
> > mm: do batched scans for mem_cgroup
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/memcontrol.h |    3 +++
> >  mm/memcontrol.c            |   12 ++++++++++++
> >  mm/vmscan.c                |    9 +++++----
> >  3 files changed, 20 insertions(+), 4 deletions(-)
> > 
> > --- linux.orig/include/linux/memcontrol.h	2009-08-15 13:12:49.000000000 +0800
> > +++ linux/include/linux/memcontrol.h	2009-08-15 13:18:13.000000000 +0800
> > @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >  				       struct zone *zone,
> >  				       enum lru_list lru);
> > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> > +					 struct zone *zone,
> > +					 enum lru_list lru);
> >  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> >  						      struct zone *zone);
> >  struct zone_reclaim_stat*
> > --- linux.orig/mm/memcontrol.c	2009-08-15 13:07:34.000000000 +0800
> > +++ linux/mm/memcontrol.c	2009-08-15 13:17:56.000000000 +0800
> > @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
> >  	 */
> >  	struct list_head	lists[NR_LRU_LISTS];
> >  	unsigned long		count[NR_LRU_LISTS];
> > +	unsigned long		nr_saved_scan[NR_LRU_LISTS];
> >  
> >  	struct zone_reclaim_stat reclaim_stat;
> >  };
> > @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
> >  	return MEM_CGROUP_ZSTAT(mz, lru);
> >  }
> >  
> > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> > +					 struct zone *zone,
> > +					 enum lru_list lru)
> > +{
> > +	int nid = zone->zone_pgdat->node_id;
> > +	int zid = zone_idx(zone);
> > +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> > +
> > +	return &mz->nr_saved_scan[lru];
> > +}
> 
> I think this fuction is a bit strange.
> shrink_zone don't hold any lock. so, shouldn't we case memcg removing race?

We've been doing that racy computation for long time. It may hurt a
bit balancing. But the balanced vmscan was never perfect, and required
to perfect. So let's just go with it?
 
Thanks,
Fengguang

> 
> > +
> >  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> >  						      struct zone *zone)
> >  {
> > --- linux.orig/mm/vmscan.c	2009-08-15 13:04:54.000000000 +0800
> > +++ linux/mm/vmscan.c	2009-08-15 13:19:03.000000000 +0800
> > @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
> >  	for_each_evictable_lru(l) {
> >  		int file = is_file_lru(l);
> >  		unsigned long scan;
> > +		unsigned long *saved_scan;
> >  
> >  		scan = zone_nr_pages(zone, sc, l);
> >  		if (priority || noswap) {
> > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> >  			scan = (scan * percent[file]) / 100;
> >  		}
> >  		if (scanning_global_lru(sc))
> > -			nr[l] = nr_scan_try_batch(scan,
> > -						  &zone->lru[l].nr_saved_scan,
> > -						  swap_cluster_max);
> > +			saved_scan = &zone->lru[l].nr_saved_scan;
> >  		else
> > -			nr[l] = scan;
> > +			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> > +							       zone, l);
> > +		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> >  	}
> >  
> >  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-19 12:05                               ` KOSAKI Motohiro
@ 2009-08-19 12:10                                 ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> page_referenced_file?
> >> I think we should change page_referenced().
> >
> > Yeah, good catch.
> >
> >>
> >> Instead, How about this?
> >> ==============================================
> >>
> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >>
> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >
> >                                                    mark PG_mlocked
> >
> >> because some race prevent page grabbing.
> >> In that case, instead vmscan move the page to unevictable lru.
> >>
> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> efficient.
> >> mlocked page can move circulatly active and inactive list because
> >> vmscan check the page is referenced _before_ cull mlocked page.
> >>
> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >
> >                           PG_mlocked
> >
> >> Otherwise vm stastics show strange number.
> >>
> >> This patch does that.
> >
> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Thanks.
> 
> 
> 
> >> Index: b/mm/rmap.c
> >> ===================================================================
> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >>        * unevictable list.
> >>        */
> >>       if (vma->vm_flags & VM_LOCKED) {
> >> -             *mapcount = 1;  /* break early from loop */
> >> +             *mapcount = 1;          /* break early from loop */
> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >
> >> +             try_set_page_mlocked(vma, page);
> >
> > That call is not absolutely necessary?
> 
> Why? I haven't catch your point.

Because we'll eventually hit another try_set_page_mlocked() when
trying to unmap the page. Ie. duplicated with another call you added
in this patch.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:10                                 ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> page_referenced_file?
> >> I think we should change page_referenced().
> >
> > Yeah, good catch.
> >
> >>
> >> Instead, How about this?
> >> ==============================================
> >>
> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >>
> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A mark PG_mlocked
> >
> >> because some race prevent page grabbing.
> >> In that case, instead vmscan move the page to unevictable lru.
> >>
> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> efficient.
> >> mlocked page can move circulatly active and inactive list because
> >> vmscan check the page is referenced _before_ cull mlocked page.
> >>
> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >
> > A  A  A  A  A  A  A  A  A  A  A  A  A  PG_mlocked
> >
> >> Otherwise vm stastics show strange number.
> >>
> >> This patch does that.
> >
> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Thanks.
> 
> 
> 
> >> Index: b/mm/rmap.c
> >> ===================================================================
> >> --- a/mm/rmap.c A  A  A  2009-08-18 19:48:14.000000000 +0900
> >> +++ b/mm/rmap.c A  A  A  2009-08-18 23:47:34.000000000 +0900
> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> A  A  A  A * unevictable list.
> >> A  A  A  A */
> >> A  A  A  if (vma->vm_flags & VM_LOCKED) {
> >> - A  A  A  A  A  A  *mapcount = 1; A /* break early from loop */
> >> + A  A  A  A  A  A  *mapcount = 1; A  A  A  A  A /* break early from loop */
> >> + A  A  A  A  A  A  *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >
> >> + A  A  A  A  A  A  try_set_page_mlocked(vma, page);
> >
> > That call is not absolutely necessary?
> 
> Why? I haven't catch your point.

Because we'll eventually hit another try_set_page_mlocked() when
trying to unmap the page. Ie. duplicated with another call you added
in this patch.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-19 12:10                                 ` Wu Fengguang
@ 2009-08-19 12:25                                   ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 12:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>> >> page_referenced_file?
>> >> I think we should change page_referenced().
>> >
>> > Yeah, good catch.
>> >
>> >>
>> >> Instead, How about this?
>> >> ==============================================
>> >>
>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>> >>
>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>> >
>> >                                                    mark PG_mlocked
>> >
>> >> because some race prevent page grabbing.
>> >> In that case, instead vmscan move the page to unevictable lru.
>> >>
>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> >> efficient.
>> >> mlocked page can move circulatly active and inactive list because
>> >> vmscan check the page is referenced _before_ cull mlocked page.
>> >>
>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>> >
>> >                           PG_mlocked
>> >
>> >> Otherwise vm stastics show strange number.
>> >>
>> >> This patch does that.
>> >
>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>
>> Thanks.
>>
>>
>>
>> >> Index: b/mm/rmap.c
>> >> ===================================================================
>> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> >>        * unevictable list.
>> >>        */
>> >>       if (vma->vm_flags & VM_LOCKED) {
>> >> -             *mapcount = 1;  /* break early from loop */
>> >> +             *mapcount = 1;          /* break early from loop */
>> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>> >
>> >> +             try_set_page_mlocked(vma, page);
>> >
>> > That call is not absolutely necessary?
>>
>> Why? I haven't catch your point.
>
> Because we'll eventually hit another try_set_page_mlocked() when
> trying to unmap the page. Ie. duplicated with another call you added
> in this patch.

Yes. we don't have to call it and we can make patch simple.
I already sent patch on yesterday.

http://marc.info/?l=linux-mm&m=125059325722370&w=2

I think It's more simple than KOSAKI's idea.
Is any problem in my patch ?


>
> Thanks,
> Fengguang
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:25                                   ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 12:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>> >> page_referenced_file?
>> >> I think we should change page_referenced().
>> >
>> > Yeah, good catch.
>> >
>> >>
>> >> Instead, How about this?
>> >> ==============================================
>> >>
>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>> >>
>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>> >
>> >                                                    mark PG_mlocked
>> >
>> >> because some race prevent page grabbing.
>> >> In that case, instead vmscan move the page to unevictable lru.
>> >>
>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> >> efficient.
>> >> mlocked page can move circulatly active and inactive list because
>> >> vmscan check the page is referenced _before_ cull mlocked page.
>> >>
>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>> >
>> >                           PG_mlocked
>> >
>> >> Otherwise vm stastics show strange number.
>> >>
>> >> This patch does that.
>> >
>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>
>> Thanks.
>>
>>
>>
>> >> Index: b/mm/rmap.c
>> >> ===================================================================
>> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> >>        * unevictable list.
>> >>        */
>> >>       if (vma->vm_flags & VM_LOCKED) {
>> >> -             *mapcount = 1;  /* break early from loop */
>> >> +             *mapcount = 1;          /* break early from loop */
>> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>> >
>> >> +             try_set_page_mlocked(vma, page);
>> >
>> > That call is not absolutely necessary?
>>
>> Why? I haven't catch your point.
>
> Because we'll eventually hit another try_set_page_mlocked() when
> trying to unmap the page. Ie. duplicated with another call you added
> in this patch.

Yes. we don't have to call it and we can make patch simple.
I already sent patch on yesterday.

http://marc.info/?l=linux-mm&m=125059325722370&w=2

I think It's more simple than KOSAKI's idea.
Is any problem in my patch ?


>
> Thanks,
> Fengguang
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-19 12:25                                   ` Minchan Kim
@ 2009-08-19 13:19                                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-19 13:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

2009/8/19 Minchan Kim <minchan.kim@gmail.com>:
> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>>> >> page_referenced_file?
>>> >> I think we should change page_referenced().
>>> >
>>> > Yeah, good catch.
>>> >
>>> >>
>>> >> Instead, How about this?
>>> >> ==============================================
>>> >>
>>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>> >>
>>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>>> >
>>> >                                                    mark PG_mlocked
>>> >
>>> >> because some race prevent page grabbing.
>>> >> In that case, instead vmscan move the page to unevictable lru.
>>> >>
>>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>>> >> efficient.
>>> >> mlocked page can move circulatly active and inactive list because
>>> >> vmscan check the page is referenced _before_ cull mlocked page.
>>> >>
>>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>>> >
>>> >                           PG_mlocked
>>> >
>>> >> Otherwise vm stastics show strange number.
>>> >>
>>> >> This patch does that.
>>> >
>>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>>
>>> Thanks.
>>>
>>>
>>>
>>> >> Index: b/mm/rmap.c
>>> >> ===================================================================
>>> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>>> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>> >>        * unevictable list.
>>> >>        */
>>> >>       if (vma->vm_flags & VM_LOCKED) {
>>> >> -             *mapcount = 1;  /* break early from loop */
>>> >> +             *mapcount = 1;          /* break early from loop */
>>> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>>> >
>>> >> +             try_set_page_mlocked(vma, page);
>>> >
>>> > That call is not absolutely necessary?
>>>
>>> Why? I haven't catch your point.
>>
>> Because we'll eventually hit another try_set_page_mlocked() when
>> trying to unmap the page. Ie. duplicated with another call you added
>> in this patch.

Correct.


> Yes. we don't have to call it and we can make patch simple.
> I already sent patch on yesterday.
>
> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>
> I think It's more simple than KOSAKI's idea.
> Is any problem in my patch ?

Hmm, I think

1. Anyway, we need turn on PG_mlock.
2. PG_mlock prevent livelock because page_evictable() check is called
at very early in shrink_page_list().

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 13:19                                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-19 13:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

2009/8/19 Minchan Kim <minchan.kim@gmail.com>:
> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>>> >> page_referenced_file?
>>> >> I think we should change page_referenced().
>>> >
>>> > Yeah, good catch.
>>> >
>>> >>
>>> >> Instead, How about this?
>>> >> ==============================================
>>> >>
>>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>> >>
>>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>>> >
>>> >                                                    mark PG_mlocked
>>> >
>>> >> because some race prevent page grabbing.
>>> >> In that case, instead vmscan move the page to unevictable lru.
>>> >>
>>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>>> >> efficient.
>>> >> mlocked page can move circulatly active and inactive list because
>>> >> vmscan check the page is referenced _before_ cull mlocked page.
>>> >>
>>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>>> >
>>> >                           PG_mlocked
>>> >
>>> >> Otherwise vm stastics show strange number.
>>> >>
>>> >> This patch does that.
>>> >
>>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>>
>>> Thanks.
>>>
>>>
>>>
>>> >> Index: b/mm/rmap.c
>>> >> ===================================================================
>>> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>>> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>> >>        * unevictable list.
>>> >>        */
>>> >>       if (vma->vm_flags & VM_LOCKED) {
>>> >> -             *mapcount = 1;  /* break early from loop */
>>> >> +             *mapcount = 1;          /* break early from loop */
>>> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>>> >
>>> >> +             try_set_page_mlocked(vma, page);
>>> >
>>> > That call is not absolutely necessary?
>>>
>>> Why? I haven't catch your point.
>>
>> Because we'll eventually hit another try_set_page_mlocked() when
>> trying to unmap the page. Ie. duplicated with another call you added
>> in this patch.

Correct.


> Yes. we don't have to call it and we can make patch simple.
> I already sent patch on yesterday.
>
> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>
> I think It's more simple than KOSAKI's idea.
> Is any problem in my patch ?

Hmm, I think

1. Anyway, we need turn on PG_mlock.
2. PG_mlock prevent livelock because page_evictable() check is called
at very early in shrink_page_list().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-19 12:25                                   ` Minchan Kim
@ 2009-08-19 13:24                                     ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 13:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> >> page_referenced_file?
> >> >> I think we should change page_referenced().
> >> >
> >> > Yeah, good catch.
> >> >
> >> >>
> >> >> Instead, How about this?
> >> >> ==============================================
> >> >>
> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >> >>
> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >> >
> >> >                                                    mark PG_mlocked
> >> >
> >> >> because some race prevent page grabbing.
> >> >> In that case, instead vmscan move the page to unevictable lru.
> >> >>
> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> >> efficient.
> >> >> mlocked page can move circulatly active and inactive list because
> >> >> vmscan check the page is referenced _before_ cull mlocked page.
> >> >>
> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >> >
> >> >                           PG_mlocked
> >> >
> >> >> Otherwise vm stastics show strange number.
> >> >>
> >> >> This patch does that.
> >> >
> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >>
> >> Thanks.
> >>
> >>
> >>
> >> >> Index: b/mm/rmap.c
> >> >> ===================================================================
> >> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
> >> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> >>        * unevictable list.
> >> >>        */
> >> >>       if (vma->vm_flags & VM_LOCKED) {
> >> >> -             *mapcount = 1;  /* break early from loop */
> >> >> +             *mapcount = 1;          /* break early from loop */
> >> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >> >
> >> >> +             try_set_page_mlocked(vma, page);
> >> >
> >> > That call is not absolutely necessary?
> >>
> >> Why? I haven't catch your point.
> >
> > Because we'll eventually hit another try_set_page_mlocked() when
> > trying to unmap the page. Ie. duplicated with another call you added
> > in this patch.
> 
> Yes. we don't have to call it and we can make patch simple.
> I already sent patch on yesterday.
> 
> http://marc.info/?l=linux-mm&m=125059325722370&w=2
> 
> I think It's more simple than KOSAKI's idea.
> Is any problem in my patch ?

No, IMHO your patch is simple and good, while KOSAKI's is more
complete :)

- the try_set_page_mlocked() rename is suitable
- the call to try_set_page_mlocked() is necessary on try_to_unmap()
- the "if (VM_LOCKED) referenced = 0" in page_referenced() could
  cover both active/inactive vmscan

I did like your proposed

                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-                                       referenced && page_mapping_inuse(page))
+                                       referenced && page_mapping_inuse(page)
+                                       && !(vm_flags & VM_LOCKED))
                        goto activate_locked;

which looks more intuitive and less confusing.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 13:24                                     ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 13:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> >> page_referenced_file?
> >> >> I think we should change page_referenced().
> >> >
> >> > Yeah, good catch.
> >> >
> >> >>
> >> >> Instead, How about this?
> >> >> ==============================================
> >> >>
> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >> >>
> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >> >
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A mark PG_mlocked
> >> >
> >> >> because some race prevent page grabbing.
> >> >> In that case, instead vmscan move the page to unevictable lru.
> >> >>
> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> >> efficient.
> >> >> mlocked page can move circulatly active and inactive list because
> >> >> vmscan check the page is referenced _before_ cull mlocked page.
> >> >>
> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >> >
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  PG_mlocked
> >> >
> >> >> Otherwise vm stastics show strange number.
> >> >>
> >> >> This patch does that.
> >> >
> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >>
> >> Thanks.
> >>
> >>
> >>
> >> >> Index: b/mm/rmap.c
> >> >> ===================================================================
> >> >> --- a/mm/rmap.c A  A  A  2009-08-18 19:48:14.000000000 +0900
> >> >> +++ b/mm/rmap.c A  A  A  2009-08-18 23:47:34.000000000 +0900
> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> >> A  A  A  A * unevictable list.
> >> >> A  A  A  A */
> >> >> A  A  A  if (vma->vm_flags & VM_LOCKED) {
> >> >> - A  A  A  A  A  A  *mapcount = 1; A /* break early from loop */
> >> >> + A  A  A  A  A  A  *mapcount = 1; A  A  A  A  A /* break early from loop */
> >> >> + A  A  A  A  A  A  *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >> >
> >> >> + A  A  A  A  A  A  try_set_page_mlocked(vma, page);
> >> >
> >> > That call is not absolutely necessary?
> >>
> >> Why? I haven't catch your point.
> >
> > Because we'll eventually hit another try_set_page_mlocked() when
> > trying to unmap the page. Ie. duplicated with another call you added
> > in this patch.
> 
> Yes. we don't have to call it and we can make patch simple.
> I already sent patch on yesterday.
> 
> http://marc.info/?l=linux-mm&m=125059325722370&w=2
> 
> I think It's more simple than KOSAKI's idea.
> Is any problem in my patch ?

No, IMHO your patch is simple and good, while KOSAKI's is more
complete :)

- the try_set_page_mlocked() rename is suitable
- the call to try_set_page_mlocked() is necessary on try_to_unmap()
- the "if (VM_LOCKED) referenced = 0" in page_referenced() could
  cover both active/inactive vmscan

I did like your proposed

                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-                                       referenced && page_mapping_inuse(page))
+                                       referenced && page_mapping_inuse(page)
+                                       && !(vm_flags & VM_LOCKED))
                        goto activate_locked;

which looks more intuitive and less confusing.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-19 13:19                                     ` KOSAKI Motohiro
@ 2009-08-19 13:28                                       ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 13:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 10:19 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> 2009/8/19 Minchan Kim <minchan.kim@gmail.com>:
>> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>>> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>>>> >> page_referenced_file?
>>>> >> I think we should change page_referenced().
>>>> >
>>>> > Yeah, good catch.
>>>> >
>>>> >>
>>>> >> Instead, How about this?
>>>> >> ==============================================
>>>> >>
>>>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>>> >>
>>>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>>>> >
>>>> >                                                    mark PG_mlocked
>>>> >
>>>> >> because some race prevent page grabbing.
>>>> >> In that case, instead vmscan move the page to unevictable lru.
>>>> >>
>>>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>>>> >> efficient.
>>>> >> mlocked page can move circulatly active and inactive list because
>>>> >> vmscan check the page is referenced _before_ cull mlocked page.
>>>> >>
>>>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>>>> >
>>>> >                           PG_mlocked
>>>> >
>>>> >> Otherwise vm stastics show strange number.
>>>> >>
>>>> >> This patch does that.
>>>> >
>>>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> >> Index: b/mm/rmap.c
>>>> >> ===================================================================
>>>> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>>>> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>>>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>>> >>        * unevictable list.
>>>> >>        */
>>>> >>       if (vma->vm_flags & VM_LOCKED) {
>>>> >> -             *mapcount = 1;  /* break early from loop */
>>>> >> +             *mapcount = 1;          /* break early from loop */
>>>> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>>>> >
>>>> >> +             try_set_page_mlocked(vma, page);
>>>> >
>>>> > That call is not absolutely necessary?
>>>>
>>>> Why? I haven't catch your point.
>>>
>>> Because we'll eventually hit another try_set_page_mlocked() when
>>> trying to unmap the page. Ie. duplicated with another call you added
>>> in this patch.
>
> Correct.
>
>
>> Yes. we don't have to call it and we can make patch simple.
>> I already sent patch on yesterday.
>>
>> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>>
>> I think It's more simple than KOSAKI's idea.
>> Is any problem in my patch ?
>
> Hmm, I think
>
> 1. Anyway, we need turn on PG_mlock.

I add my patch again to explain.

diff --git a/mm/rmap.c b/mm/rmap.c
index ed63894..283266c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
         */
        if (vma->vm_flags & VM_LOCKED) {
                *mapcount = 1;  /* break early from loop */
+               *vm_flags |= VM_LOCKED;
                goto out_unmap;
        }

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d224b28..d156e1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
                                                sc->mem_cgroup, &vm_flags);
                /* In active use or really unfreeable?  Activate it. */
                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-                                       referenced && page_mapping_inuse(page))
+                                       referenced && page_mapping_inuse(page)
+                                       && !(vm_flags & VM_LOCKED))
                        goto activate_locked;

By this check, the page can be reached at try_to_unmap after
page_referenced in shrink_page_list. At that time PG_mlocked will be
set.

> 2. PG_mlock prevent livelock because page_evictable() check is called
> at very early in shrink_page_list().






-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 13:28                                       ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 13:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 10:19 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> 2009/8/19 Minchan Kim <minchan.kim@gmail.com>:
>> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>>> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>>>> >> page_referenced_file?
>>>> >> I think we should change page_referenced().
>>>> >
>>>> > Yeah, good catch.
>>>> >
>>>> >>
>>>> >> Instead, How about this?
>>>> >> ==============================================
>>>> >>
>>>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>>> >>
>>>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>>>> >
>>>> >                                                    mark PG_mlocked
>>>> >
>>>> >> because some race prevent page grabbing.
>>>> >> In that case, instead vmscan move the page to unevictable lru.
>>>> >>
>>>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>>>> >> efficient.
>>>> >> mlocked page can move circulatly active and inactive list because
>>>> >> vmscan check the page is referenced _before_ cull mlocked page.
>>>> >>
>>>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>>>> >
>>>> >                           PG_mlocked
>>>> >
>>>> >> Otherwise vm stastics show strange number.
>>>> >>
>>>> >> This patch does that.
>>>> >
>>>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> >> Index: b/mm/rmap.c
>>>> >> ===================================================================
>>>> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>>>> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>>>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>>> >>        * unevictable list.
>>>> >>        */
>>>> >>       if (vma->vm_flags & VM_LOCKED) {
>>>> >> -             *mapcount = 1;  /* break early from loop */
>>>> >> +             *mapcount = 1;          /* break early from loop */
>>>> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>>>> >
>>>> >> +             try_set_page_mlocked(vma, page);
>>>> >
>>>> > That call is not absolutely necessary?
>>>>
>>>> Why? I haven't catch your point.
>>>
>>> Because we'll eventually hit another try_set_page_mlocked() when
>>> trying to unmap the page. Ie. duplicated with another call you added
>>> in this patch.
>
> Correct.
>
>
>> Yes. we don't have to call it and we can make patch simple.
>> I already sent patch on yesterday.
>>
>> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>>
>> I think It's more simple than KOSAKI's idea.
>> Is any problem in my patch ?
>
> Hmm, I think
>
> 1. Anyway, we need turn on PG_mlock.

I add my patch again to explain.

diff --git a/mm/rmap.c b/mm/rmap.c
index ed63894..283266c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
         */
        if (vma->vm_flags & VM_LOCKED) {
                *mapcount = 1;  /* break early from loop */
+               *vm_flags |= VM_LOCKED;
                goto out_unmap;
        }

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d224b28..d156e1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
                                                sc->mem_cgroup, &vm_flags);
                /* In active use or really unfreeable?  Activate it. */
                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-                                       referenced && page_mapping_inuse(page))
+                                       referenced && page_mapping_inuse(page)
+                                       && !(vm_flags & VM_LOCKED))
                        goto activate_locked;

By this check, the page can be reached at try_to_unmap after
page_referenced in shrink_page_list. At that time PG_mlocked will be
set.

> 2. PG_mlock prevent livelock because page_evictable() check is called
> at very early in shrink_page_list().






-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-19 13:24                                     ` Wu Fengguang
@ 2009-08-19 13:38                                       ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 13:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
>> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>> >> >> page_referenced_file?
>> >> >> I think we should change page_referenced().
>> >> >
>> >> > Yeah, good catch.
>> >> >
>> >> >>
>> >> >> Instead, How about this?
>> >> >> ==============================================
>> >> >>
>> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>> >> >>
>> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>> >> >
>> >> >                                                    mark PG_mlocked
>> >> >
>> >> >> because some race prevent page grabbing.
>> >> >> In that case, instead vmscan move the page to unevictable lru.
>> >> >>
>> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> >> >> efficient.
>> >> >> mlocked page can move circulatly active and inactive list because
>> >> >> vmscan check the page is referenced _before_ cull mlocked page.
>> >> >>
>> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>> >> >
>> >> >                           PG_mlocked
>> >> >
>> >> >> Otherwise vm stastics show strange number.
>> >> >>
>> >> >> This patch does that.
>> >> >
>> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>> >>
>> >> Thanks.
>> >>
>> >>
>> >>
>> >> >> Index: b/mm/rmap.c
>> >> >> ===================================================================
>> >> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>> >> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> >> >>        * unevictable list.
>> >> >>        */
>> >> >>       if (vma->vm_flags & VM_LOCKED) {
>> >> >> -             *mapcount = 1;  /* break early from loop */
>> >> >> +             *mapcount = 1;          /* break early from loop */
>> >> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>> >> >
>> >> >> +             try_set_page_mlocked(vma, page);
>> >> >
>> >> > That call is not absolutely necessary?
>> >>
>> >> Why? I haven't catch your point.
>> >
>> > Because we'll eventually hit another try_set_page_mlocked() when
>> > trying to unmap the page. Ie. duplicated with another call you added
>> > in this patch.
>>
>> Yes. we don't have to call it and we can make patch simple.
>> I already sent patch on yesterday.
>>
>> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>>
>> I think It's more simple than KOSAKI's idea.
>> Is any problem in my patch ?
>
> No, IMHO your patch is simple and good, while KOSAKI's is more
> complete :)
>
> - the try_set_page_mlocked() rename is suitable
> - the call to try_set_page_mlocked() is necessary on try_to_unmap()

We don't need try_set_page_mlocked call in try_to_unmap.
That's because try_to_unmap_xxx will call try_to_mlock_page if the
page is included in any VM_LOCKED vma. Eventually, It can move
unevictable list.

> - the "if (VM_LOCKED) referenced = 0" in page_referenced() could
>  cover both active/inactive vmscan

ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from
active list to inactive list. But I think it's rare case so that there
would be few pages.
So I think that will be not big overhead.

As I know, Rescue by vmscan page losing the isolation race was the
Lee's design.
But as you pointed out, it have a bug that vmscan can't rescue the
page due to reach try_to_unmap.

So I think this approach is proper. :)

> I did like your proposed
>
>                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> -                                       referenced && page_mapping_inuse(page))
> +                                       referenced && page_mapping_inuse(page)
> +                                       && !(vm_flags & VM_LOCKED))
>                        goto activate_locked;
>
> which looks more intuitive and less confusing.
>
> Thanks,
> Fengguang
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 13:38                                       ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 13:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
>> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>> >> >> page_referenced_file?
>> >> >> I think we should change page_referenced().
>> >> >
>> >> > Yeah, good catch.
>> >> >
>> >> >>
>> >> >> Instead, How about this?
>> >> >> ==============================================
>> >> >>
>> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>> >> >>
>> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>> >> >
>> >> >                                                    mark PG_mlocked
>> >> >
>> >> >> because some race prevent page grabbing.
>> >> >> In that case, instead vmscan move the page to unevictable lru.
>> >> >>
>> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> >> >> efficient.
>> >> >> mlocked page can move circulatly active and inactive list because
>> >> >> vmscan check the page is referenced _before_ cull mlocked page.
>> >> >>
>> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>> >> >
>> >> >                           PG_mlocked
>> >> >
>> >> >> Otherwise vm stastics show strange number.
>> >> >>
>> >> >> This patch does that.
>> >> >
>> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>> >>
>> >> Thanks.
>> >>
>> >>
>> >>
>> >> >> Index: b/mm/rmap.c
>> >> >> ===================================================================
>> >> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
>> >> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
>> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> >> >>        * unevictable list.
>> >> >>        */
>> >> >>       if (vma->vm_flags & VM_LOCKED) {
>> >> >> -             *mapcount = 1;  /* break early from loop */
>> >> >> +             *mapcount = 1;          /* break early from loop */
>> >> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>> >> >
>> >> >> +             try_set_page_mlocked(vma, page);
>> >> >
>> >> > That call is not absolutely necessary?
>> >>
>> >> Why? I haven't catch your point.
>> >
>> > Because we'll eventually hit another try_set_page_mlocked() when
>> > trying to unmap the page. Ie. duplicated with another call you added
>> > in this patch.
>>
>> Yes. we don't have to call it and we can make patch simple.
>> I already sent patch on yesterday.
>>
>> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>>
>> I think It's more simple than KOSAKI's idea.
>> Is any problem in my patch ?
>
> No, IMHO your patch is simple and good, while KOSAKI's is more
> complete :)
>
> - the try_set_page_mlocked() rename is suitable
> - the call to try_set_page_mlocked() is necessary on try_to_unmap()

We don't need try_set_page_mlocked call in try_to_unmap.
That's because try_to_unmap_xxx will call try_to_mlock_page if the
page is included in any VM_LOCKED vma. Eventually, It can move
unevictable list.

> - the "if (VM_LOCKED) referenced = 0" in page_referenced() could
>  cover both active/inactive vmscan

ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from
active list to inactive list. But I think it's rare case so that there
would be few pages.
So I think that will be not big overhead.

As I know, Rescue by vmscan page losing the isolation race was the
Lee's design.
But as you pointed out, it have a bug that vmscan can't rescue the
page due to reach try_to_unmap.

So I think this approach is proper. :)

> I did like your proposed
>
>                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> -                                       referenced && page_mapping_inuse(page))
> +                                       referenced && page_mapping_inuse(page)
> +                                       && !(vm_flags & VM_LOCKED))
>                        goto activate_locked;
>
> which looks more intuitive and less confusing.
>
> Thanks,
> Fengguang
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [RFC] memcg: move definitions to .h and inline some functions
  2009-08-18 15:57                                     ` KOSAKI Motohiro
@ 2009-08-19 13:40                                       ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 13:40 UTC (permalink / raw)
  To: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> 
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > > 
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> > 
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> > 
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > 
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> > 
> > In that case, LRU becomes FIFO.
> > 
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > If so, this patch should help.
> 
> This patch does right thing.
> However, I would explain why I and memcg folks didn't do that in past days.
> 
> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> make inline function and we hesitated to introduce many function calling
> overhead.
> 
> So, Can we move some memcg structure declaration to *.h and make 
> mem_cgroup_get_saved_scan() inlined function?

OK here it is. I have to move big chunks to make it compile, and it
does reduced a dozen lines of code :)

Is this big copy&paste acceptable? (memcg developers CCed).

Thanks,
Fengguang
---

memcg: move definitions to .h and inline some functions

So as to make inline functions.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/memcontrol.h |  154 ++++++++++++++++++++++++++++++-----
 mm/memcontrol.c            |  131 -----------------------------
 2 files changed, 134 insertions(+), 151 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2009-08-19 20:18:55.000000000 +0800
+++ linux/include/linux/memcontrol.h	2009-08-19 20:51:06.000000000 +0800
@@ -20,11 +20,144 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 #include <linux/cgroup.h>
-struct mem_cgroup;
+#include <linux/res_counter.h>
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_MAPPED_FILE,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+
+	MEM_CGROUP_STAT_NSTATS,
+};
+
+struct mem_cgroup_stat_cpu {
+	s64 count[MEM_CGROUP_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct mem_cgroup_stat {
+	struct mem_cgroup_stat_cpu cpustat[0];
+};
+
+/*
+ * per-zone information in memory controller.
+ */
+struct mem_cgroup_per_zone {
+	/*
+	 * spin_lock to protect the per cgroup LRU
+	 */
+	struct list_head	lists[NR_LRU_LISTS];
+	unsigned long		count[NR_LRU_LISTS];
+
+	struct zone_reclaim_stat reclaim_stat;
+};
+/* Macro for accessing counter */
+#define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
+
+struct mem_cgroup_per_node {
+	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_lru_info {
+	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
+};
+
+/*
+ * The memory controller data structure. The memory controller controls both
+ * page cache and RSS per cgroup. We would eventually like to provide
+ * statistics based on the statistics developed by Rik Van Riel for clock-pro,
+ * to help the administrator determine what knobs to tune.
+ *
+ * TODO: Add a water mark for the memory controller. Reclaim will begin when
+ * we hit the water mark. May be even add a low water mark, such that
+ * no reclaim occurs from a cgroup at it's low water mark, this is
+ * a feature that will be implemented much later in the future.
+ */
+struct mem_cgroup {
+	struct cgroup_subsys_state css;
+	/*
+	 * the counter to account for memory usage
+	 */
+	struct res_counter res;
+	/*
+	 * the counter to account for mem+swap usage.
+	 */
+	struct res_counter memsw;
+	/*
+	 * Per cgroup active and inactive list, similar to the
+	 * per zone LRU lists.
+	 */
+	struct mem_cgroup_lru_info info;
+
+	/*
+	  protect against reclaim related member.
+	*/
+	spinlock_t reclaim_param_lock;
+
+	int	prev_priority;	/* for recording reclaim priority */
+
+	/*
+	 * While reclaiming in a hiearchy, we cache the last child we
+	 * reclaimed from.
+	 */
+	int last_scanned_child;
+	/*
+	 * Should the accounting and control be hierarchical, per subtree?
+	 */
+	bool use_hierarchy;
+	unsigned long	last_oom_jiffies;
+	atomic_t	refcnt;
+
+	unsigned int	swappiness;
+
+	/* set when res.limit == memsw.limit */
+	bool		memsw_is_minimum;
+
+	/*
+	 * statistics. This must be placed at the end of memcg.
+	 */
+	struct mem_cgroup_stat stat;
+};
+
+static inline struct mem_cgroup_per_zone *
+mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
+{
+	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
+}
+
+static inline unsigned long
+mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
+			 struct zone *zone,
+			 enum lru_list lru)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	return MEM_CGROUP_ZSTAT(mz, lru);
+}
+
+static inline struct zone_reclaim_stat *
+mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	return &mz->reclaim_stat;
+}
+
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr
 							int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
-				       struct zone *zone,
-				       enum lru_list lru);
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone);
 struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m
 	return 1;
 }
 
-static inline unsigned long
-mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
-			 enum lru_list lru)
-{
-	return 0;
-}
-
-
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
-{
-	return NULL;
-}
-
 static inline struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 {
--- linux.orig/mm/memcontrol.c	2009-08-19 20:14:56.000000000 +0800
+++ linux/mm/memcontrol.c	2009-08-19 20:46:50.000000000 +0800
@@ -55,30 +55,6 @@ static int really_do_swap_account __init
 static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
 
 /*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_MAPPED_FILE,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-
-	MEM_CGROUP_STAT_NSTATS,
-};
-
-struct mem_cgroup_stat_cpu {
-	s64 count[MEM_CGROUP_STAT_NSTATS];
-} ____cacheline_aligned_in_smp;
-
-struct mem_cgroup_stat {
-	struct mem_cgroup_stat_cpu cpustat[0];
-};
-
-/*
  * For accounting under irq disable, no need for increment preempt count.
  */
 static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
@@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct
 	return ret;
 }
 
-/*
- * per-zone information in memory controller.
- */
-struct mem_cgroup_per_zone {
-	/*
-	 * spin_lock to protect the per cgroup LRU
-	 */
-	struct list_head	lists[NR_LRU_LISTS];
-	unsigned long		count[NR_LRU_LISTS];
-
-	struct zone_reclaim_stat reclaim_stat;
-};
-/* Macro for accessing counter */
-#define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
-
-struct mem_cgroup_per_node {
-	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_lru_info {
-	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
-};
-
-/*
- * The memory controller data structure. The memory controller controls both
- * page cache and RSS per cgroup. We would eventually like to provide
- * statistics based on the statistics developed by Rik Van Riel for clock-pro,
- * to help the administrator determine what knobs to tune.
- *
- * TODO: Add a water mark for the memory controller. Reclaim will begin when
- * we hit the water mark. May be even add a low water mark, such that
- * no reclaim occurs from a cgroup at it's low water mark, this is
- * a feature that will be implemented much later in the future.
- */
-struct mem_cgroup {
-	struct cgroup_subsys_state css;
-	/*
-	 * the counter to account for memory usage
-	 */
-	struct res_counter res;
-	/*
-	 * the counter to account for mem+swap usage.
-	 */
-	struct res_counter memsw;
-	/*
-	 * Per cgroup active and inactive list, similar to the
-	 * per zone LRU lists.
-	 */
-	struct mem_cgroup_lru_info info;
-
-	/*
-	  protect against reclaim related member.
-	*/
-	spinlock_t reclaim_param_lock;
-
-	int	prev_priority;	/* for recording reclaim priority */
-
-	/*
-	 * While reclaiming in a hiearchy, we cache the last child we
-	 * reclaimed from.
-	 */
-	int last_scanned_child;
-	/*
-	 * Should the accounting and control be hierarchical, per subtree?
-	 */
-	bool use_hierarchy;
-	unsigned long	last_oom_jiffies;
-	atomic_t	refcnt;
-
-	unsigned int	swappiness;
-
-	/* set when res.limit == memsw.limit */
-	bool		memsw_is_minimum;
-
-	/*
-	 * statistics. This must be placed at the end of memcg.
-	 */
-	struct mem_cgroup_stat stat;
-};
-
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics
 }
 
 static struct mem_cgroup_per_zone *
-mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
-{
-	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
-}
-
-static struct mem_cgroup_per_zone *
 page_cgroup_zoneinfo(struct page_cgroup *pc)
 {
 	struct mem_cgroup *mem = pc->mem_cgroup;
@@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru
 	return (active > inactive);
 }
 
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
-				       struct zone *zone,
-				       enum lru_list lru)
-{
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
-	return MEM_CGROUP_ZSTAT(mz, lru);
-}
-
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone)
-{
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
-	return &mz->reclaim_stat;
-}
-
 struct zone_reclaim_stat *
 mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 {

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [RFC] memcg: move definitions to .h and inline some functions
@ 2009-08-19 13:40                                       ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 13:40 UTC (permalink / raw)
  To: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> 
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > > 
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> > 
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> > 
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > 
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> > 
> > In that case, LRU becomes FIFO.
> > 
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > If so, this patch should help.
> 
> This patch does right thing.
> However, I would explain why I and memcg folks didn't do that in past days.
> 
> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> make inline function and we hesitated to introduce many function calling
> overhead.
> 
> So, Can we move some memcg structure declaration to *.h and make 
> mem_cgroup_get_saved_scan() inlined function?

OK here it is. I have to move big chunks to make it compile, and it
does reduced a dozen lines of code :)

Is this big copy&paste acceptable? (memcg developers CCed).

Thanks,
Fengguang
---

memcg: move definitions to .h and inline some functions

So as to make inline functions.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/memcontrol.h |  154 ++++++++++++++++++++++++++++++-----
 mm/memcontrol.c            |  131 -----------------------------
 2 files changed, 134 insertions(+), 151 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2009-08-19 20:18:55.000000000 +0800
+++ linux/include/linux/memcontrol.h	2009-08-19 20:51:06.000000000 +0800
@@ -20,11 +20,144 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 #include <linux/cgroup.h>
-struct mem_cgroup;
+#include <linux/res_counter.h>
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_MAPPED_FILE,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+
+	MEM_CGROUP_STAT_NSTATS,
+};
+
+struct mem_cgroup_stat_cpu {
+	s64 count[MEM_CGROUP_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct mem_cgroup_stat {
+	struct mem_cgroup_stat_cpu cpustat[0];
+};
+
+/*
+ * per-zone information in memory controller.
+ */
+struct mem_cgroup_per_zone {
+	/*
+	 * spin_lock to protect the per cgroup LRU
+	 */
+	struct list_head	lists[NR_LRU_LISTS];
+	unsigned long		count[NR_LRU_LISTS];
+
+	struct zone_reclaim_stat reclaim_stat;
+};
+/* Macro for accessing counter */
+#define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
+
+struct mem_cgroup_per_node {
+	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_lru_info {
+	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
+};
+
+/*
+ * The memory controller data structure. The memory controller controls both
+ * page cache and RSS per cgroup. We would eventually like to provide
+ * statistics based on the statistics developed by Rik Van Riel for clock-pro,
+ * to help the administrator determine what knobs to tune.
+ *
+ * TODO: Add a water mark for the memory controller. Reclaim will begin when
+ * we hit the water mark. May be even add a low water mark, such that
+ * no reclaim occurs from a cgroup at it's low water mark, this is
+ * a feature that will be implemented much later in the future.
+ */
+struct mem_cgroup {
+	struct cgroup_subsys_state css;
+	/*
+	 * the counter to account for memory usage
+	 */
+	struct res_counter res;
+	/*
+	 * the counter to account for mem+swap usage.
+	 */
+	struct res_counter memsw;
+	/*
+	 * Per cgroup active and inactive list, similar to the
+	 * per zone LRU lists.
+	 */
+	struct mem_cgroup_lru_info info;
+
+	/*
+	  protect against reclaim related member.
+	*/
+	spinlock_t reclaim_param_lock;
+
+	int	prev_priority;	/* for recording reclaim priority */
+
+	/*
+	 * While reclaiming in a hiearchy, we cache the last child we
+	 * reclaimed from.
+	 */
+	int last_scanned_child;
+	/*
+	 * Should the accounting and control be hierarchical, per subtree?
+	 */
+	bool use_hierarchy;
+	unsigned long	last_oom_jiffies;
+	atomic_t	refcnt;
+
+	unsigned int	swappiness;
+
+	/* set when res.limit == memsw.limit */
+	bool		memsw_is_minimum;
+
+	/*
+	 * statistics. This must be placed at the end of memcg.
+	 */
+	struct mem_cgroup_stat stat;
+};
+
+static inline struct mem_cgroup_per_zone *
+mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
+{
+	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
+}
+
+static inline unsigned long
+mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
+			 struct zone *zone,
+			 enum lru_list lru)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	return MEM_CGROUP_ZSTAT(mz, lru);
+}
+
+static inline struct zone_reclaim_stat *
+mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	return &mz->reclaim_stat;
+}
+
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr
 							int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
-				       struct zone *zone,
-				       enum lru_list lru);
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone);
 struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m
 	return 1;
 }
 
-static inline unsigned long
-mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
-			 enum lru_list lru)
-{
-	return 0;
-}
-
-
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
-{
-	return NULL;
-}
-
 static inline struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 {
--- linux.orig/mm/memcontrol.c	2009-08-19 20:14:56.000000000 +0800
+++ linux/mm/memcontrol.c	2009-08-19 20:46:50.000000000 +0800
@@ -55,30 +55,6 @@ static int really_do_swap_account __init
 static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
 
 /*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_MAPPED_FILE,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-
-	MEM_CGROUP_STAT_NSTATS,
-};
-
-struct mem_cgroup_stat_cpu {
-	s64 count[MEM_CGROUP_STAT_NSTATS];
-} ____cacheline_aligned_in_smp;
-
-struct mem_cgroup_stat {
-	struct mem_cgroup_stat_cpu cpustat[0];
-};
-
-/*
  * For accounting under irq disable, no need for increment preempt count.
  */
 static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
@@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct
 	return ret;
 }
 
-/*
- * per-zone information in memory controller.
- */
-struct mem_cgroup_per_zone {
-	/*
-	 * spin_lock to protect the per cgroup LRU
-	 */
-	struct list_head	lists[NR_LRU_LISTS];
-	unsigned long		count[NR_LRU_LISTS];
-
-	struct zone_reclaim_stat reclaim_stat;
-};
-/* Macro for accessing counter */
-#define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
-
-struct mem_cgroup_per_node {
-	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_lru_info {
-	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
-};
-
-/*
- * The memory controller data structure. The memory controller controls both
- * page cache and RSS per cgroup. We would eventually like to provide
- * statistics based on the statistics developed by Rik Van Riel for clock-pro,
- * to help the administrator determine what knobs to tune.
- *
- * TODO: Add a water mark for the memory controller. Reclaim will begin when
- * we hit the water mark. May be even add a low water mark, such that
- * no reclaim occurs from a cgroup at it's low water mark, this is
- * a feature that will be implemented much later in the future.
- */
-struct mem_cgroup {
-	struct cgroup_subsys_state css;
-	/*
-	 * the counter to account for memory usage
-	 */
-	struct res_counter res;
-	/*
-	 * the counter to account for mem+swap usage.
-	 */
-	struct res_counter memsw;
-	/*
-	 * Per cgroup active and inactive list, similar to the
-	 * per zone LRU lists.
-	 */
-	struct mem_cgroup_lru_info info;
-
-	/*
-	  protect against reclaim related member.
-	*/
-	spinlock_t reclaim_param_lock;
-
-	int	prev_priority;	/* for recording reclaim priority */
-
-	/*
-	 * While reclaiming in a hiearchy, we cache the last child we
-	 * reclaimed from.
-	 */
-	int last_scanned_child;
-	/*
-	 * Should the accounting and control be hierarchical, per subtree?
-	 */
-	bool use_hierarchy;
-	unsigned long	last_oom_jiffies;
-	atomic_t	refcnt;
-
-	unsigned int	swappiness;
-
-	/* set when res.limit == memsw.limit */
-	bool		memsw_is_minimum;
-
-	/*
-	 * statistics. This must be placed at the end of memcg.
-	 */
-	struct mem_cgroup_stat stat;
-};
-
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics
 }
 
 static struct mem_cgroup_per_zone *
-mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
-{
-	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
-}
-
-static struct mem_cgroup_per_zone *
 page_cgroup_zoneinfo(struct page_cgroup *pc)
 {
 	struct mem_cgroup *mem = pc->mem_cgroup;
@@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru
 	return (active > inactive);
 }
 
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
-				       struct zone *zone,
-				       enum lru_list lru)
-{
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
-	return MEM_CGROUP_ZSTAT(mz, lru);
-}
-
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone)
-{
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
-	return &mz->reclaim_stat;
-}
-
 struct zone_reclaim_stat *
 mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-19 13:38                                       ` Minchan Kim
@ 2009-08-19 14:00                                         ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 14:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 09:38:05PM +0800, Minchan Kim wrote:
> On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
> >> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> >> >> page_referenced_file?
> >> >> >> I think we should change page_referenced().
> >> >> >
> >> >> > Yeah, good catch.
> >> >> >
> >> >> >>
> >> >> >> Instead, How about this?
> >> >> >> ==============================================
> >> >> >>
> >> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >> >> >>
> >> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >> >> >
> >> >> >                                                    mark PG_mlocked
> >> >> >
> >> >> >> because some race prevent page grabbing.
> >> >> >> In that case, instead vmscan move the page to unevictable lru.
> >> >> >>
> >> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> >> >> efficient.
> >> >> >> mlocked page can move circulatly active and inactive list because
> >> >> >> vmscan check the page is referenced _before_ cull mlocked page.
> >> >> >>
> >> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >> >> >
> >> >> >                           PG_mlocked
> >> >> >
> >> >> >> Otherwise vm stastics show strange number.
> >> >> >>
> >> >> >> This patch does that.
> >> >> >
> >> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >> >>
> >> >> Thanks.
> >> >>
> >> >>
> >> >>
> >> >> >> Index: b/mm/rmap.c
> >> >> >> ===================================================================
> >> >> >> --- a/mm/rmap.c       2009-08-18 19:48:14.000000000 +0900
> >> >> >> +++ b/mm/rmap.c       2009-08-18 23:47:34.000000000 +0900
> >> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> >> >>        * unevictable list.
> >> >> >>        */
> >> >> >>       if (vma->vm_flags & VM_LOCKED) {
> >> >> >> -             *mapcount = 1;  /* break early from loop */
> >> >> >> +             *mapcount = 1;          /* break early from loop */
> >> >> >> +             *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >> >> >
> >> >> >> +             try_set_page_mlocked(vma, page);
> >> >> >
> >> >> > That call is not absolutely necessary?
> >> >>
> >> >> Why? I haven't catch your point.
> >> >
> >> > Because we'll eventually hit another try_set_page_mlocked() when
> >> > trying to unmap the page. Ie. duplicated with another call you added
> >> > in this patch.
> >>
> >> Yes. we don't have to call it and we can make patch simple.
> >> I already sent patch on yesterday.
> >>
> >> http://marc.info/?l=linux-mm&m=125059325722370&w=2
> >>
> >> I think It's more simple than KOSAKI's idea.
> >> Is any problem in my patch ?
> >
> > No, IMHO your patch is simple and good, while KOSAKI's is more
> > complete :)
> >
> > - the try_set_page_mlocked() rename is suitable
> > - the call to try_set_page_mlocked() is necessary on try_to_unmap()
> 
> We don't need try_set_page_mlocked call in try_to_unmap.
> That's because try_to_unmap_xxx will call try_to_mlock_page if the
> page is included in any VM_LOCKED vma. Eventually, It can move
> unevictable list.

Yes, indeed!

> > - the "if (VM_LOCKED) referenced = 0" in page_referenced() could
> >  cover both active/inactive vmscan
> 
> ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from
> active list to inactive list. But I think it's rare case so that there
> would be few pages.
> So I think that will be not big overhead.

The active list case can be persistent, when the mlocked (but without
PG_mlocked) page is executable and referenced by 2+ processes. But I
admit that executable pages are relatively rare.

> As I know, Rescue by vmscan page losing the isolation race was the
> Lee's design.
> But as you pointed out, it have a bug that vmscan can't rescue the
> page due to reach try_to_unmap.
> 
> So I think this approach is proper. :)

Now you decide :)

Thanks,
Fengguang

> > I did like your proposed
> >
> >                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > -                                       referenced && page_mapping_inuse(page))
> > +                                       referenced && page_mapping_inuse(page)
> > +                                       && !(vm_flags & VM_LOCKED))
> >                        goto activate_locked;
> >
> > which looks more intuitive and less confusing.
> >
> > Thanks,
> > Fengguang
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 14:00                                         ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 14:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

On Wed, Aug 19, 2009 at 09:38:05PM +0800, Minchan Kim wrote:
> On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
> >> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> >> >> page_referenced_file?
> >> >> >> I think we should change page_referenced().
> >> >> >
> >> >> > Yeah, good catch.
> >> >> >
> >> >> >>
> >> >> >> Instead, How about this?
> >> >> >> ==============================================
> >> >> >>
> >> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >> >> >>
> >> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >> >> >
> >> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A mark PG_mlocked
> >> >> >
> >> >> >> because some race prevent page grabbing.
> >> >> >> In that case, instead vmscan move the page to unevictable lru.
> >> >> >>
> >> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> >> >> efficient.
> >> >> >> mlocked page can move circulatly active and inactive list because
> >> >> >> vmscan check the page is referenced _before_ cull mlocked page.
> >> >> >>
> >> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >> >> >
> >> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  PG_mlocked
> >> >> >
> >> >> >> Otherwise vm stastics show strange number.
> >> >> >>
> >> >> >> This patch does that.
> >> >> >
> >> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >> >>
> >> >> Thanks.
> >> >>
> >> >>
> >> >>
> >> >> >> Index: b/mm/rmap.c
> >> >> >> ===================================================================
> >> >> >> --- a/mm/rmap.c A  A  A  2009-08-18 19:48:14.000000000 +0900
> >> >> >> +++ b/mm/rmap.c A  A  A  2009-08-18 23:47:34.000000000 +0900
> >> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> >> >> A  A  A  A * unevictable list.
> >> >> >> A  A  A  A */
> >> >> >> A  A  A  if (vma->vm_flags & VM_LOCKED) {
> >> >> >> - A  A  A  A  A  A  *mapcount = 1; A /* break early from loop */
> >> >> >> + A  A  A  A  A  A  *mapcount = 1; A  A  A  A  A /* break early from loop */
> >> >> >> + A  A  A  A  A  A  *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >> >> >
> >> >> >> + A  A  A  A  A  A  try_set_page_mlocked(vma, page);
> >> >> >
> >> >> > That call is not absolutely necessary?
> >> >>
> >> >> Why? I haven't catch your point.
> >> >
> >> > Because we'll eventually hit another try_set_page_mlocked() when
> >> > trying to unmap the page. Ie. duplicated with another call you added
> >> > in this patch.
> >>
> >> Yes. we don't have to call it and we can make patch simple.
> >> I already sent patch on yesterday.
> >>
> >> http://marc.info/?l=linux-mm&m=125059325722370&w=2
> >>
> >> I think It's more simple than KOSAKI's idea.
> >> Is any problem in my patch ?
> >
> > No, IMHO your patch is simple and good, while KOSAKI's is more
> > complete :)
> >
> > - the try_set_page_mlocked() rename is suitable
> > - the call to try_set_page_mlocked() is necessary on try_to_unmap()
> 
> We don't need try_set_page_mlocked call in try_to_unmap.
> That's because try_to_unmap_xxx will call try_to_mlock_page if the
> page is included in any VM_LOCKED vma. Eventually, It can move
> unevictable list.

Yes, indeed!

> > - the "if (VM_LOCKED) referenced = 0" in page_referenced() could
> > A cover both active/inactive vmscan
> 
> ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from
> active list to inactive list. But I think it's rare case so that there
> would be few pages.
> So I think that will be not big overhead.

The active list case can be persistent, when the mlocked (but without
PG_mlocked) page is executable and referenced by 2+ processes. But I
admit that executable pages are relatively rare.

> As I know, Rescue by vmscan page losing the isolation race was the
> Lee's design.
> But as you pointed out, it have a bug that vmscan can't rescue the
> page due to reach try_to_unmap.
> 
> So I think this approach is proper. :)

Now you decide :)

Thanks,
Fengguang

> > I did like your proposed
> >
> > A  A  A  A  A  A  A  A if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  referenced && page_mapping_inuse(page))
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  referenced && page_mapping_inuse(page)
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  && !(vm_flags & VM_LOCKED))
> > A  A  A  A  A  A  A  A  A  A  A  A goto activate_locked;
> >
> > which looks more intuitive and less confusing.
> >
> > Thanks,
> > Fengguang
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] memcg: move definitions to .h and inline some functions
  2009-08-19 13:40                                       ` Wu Fengguang
@ 2009-08-19 14:18                                         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-19 14:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura, lizf,
	menage

Wu Fengguang さんは書きました:
> On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
>>
>> > > This one of the reasons why we unconditionally deactivate
>> > > the active anon pages, and do background scanning of the
>> > > active anon list when reclaiming page cache pages.
>> > >
>> > > We want to always move some pages to the inactive anon
>> > > list, so it does not get too small.
>> >
>> > Right, the current code tries to pull inactive list out of
>> > smallish-size state as long as there are vmscan activities.
>> >
>> > However there is a possible (and tricky) hole: mem cgroups
>> > don't do batched vmscan. shrink_zone() may call shrink_list()
>> > with nr_to_scan=1, in which case shrink_list() _still_ calls
>> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
>> >
>> > It effectively scales up the inactive list scan rate by 10 times when
>> > it is still small, and may thus prevent it from growing up for ever.
>> >
>> > In that case, LRU becomes FIFO.
>> >
>> > Jeff, can you confirm if the mem cgroup's inactive list is small?
>> > If so, this patch should help.
>>
>> This patch does right thing.
>> However, I would explain why I and memcg folks didn't do that in past
>> days.
>>
>> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
>> make inline function and we hesitated to introduce many function calling
>> overhead.
>>
>> So, Can we move some memcg structure declaration to *.h and make
>> mem_cgroup_get_saved_scan() inlined function?
>
> OK here it is. I have to move big chunks to make it compile, and it
> does reduced a dozen lines of code :)
>
> Is this big copy&paste acceptable? (memcg developers CCed).
>
> Thanks,
> Fengguang

I don't like this. plz add hooks to necessary places, at this stage.
This will be too big for inlined function, anyway.
plz move this after you find overhead is too big.

Thanks,
-Kame


> ---
>
> memcg: move definitions to .h and inline some functions
>
> So as to make inline functions.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/memcontrol.h |  154 ++++++++++++++++++++++++++++++-----
>  mm/memcontrol.c            |  131 -----------------------------
>  2 files changed, 134 insertions(+), 151 deletions(-)
>
> --- linux.orig/include/linux/memcontrol.h	2009-08-19 20:18:55.000000000
> +0800
> +++ linux/include/linux/memcontrol.h	2009-08-19 20:51:06.000000000 +0800
> @@ -20,11 +20,144 @@
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
>  #include <linux/cgroup.h>
> -struct mem_cgroup;
> +#include <linux/res_counter.h>
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_MAPPED_FILE,	/* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
> +struct mem_cgroup_stat_cpu {
> +	s64 count[MEM_CGROUP_STAT_NSTATS];
> +} ____cacheline_aligned_in_smp;
> +
> +struct mem_cgroup_stat {
> +	struct mem_cgroup_stat_cpu cpustat[0];
> +};
> +
> +/*
> + * per-zone information in memory controller.
> + */
> +struct mem_cgroup_per_zone {
> +	/*
> +	 * spin_lock to protect the per cgroup LRU
> +	 */
> +	struct list_head	lists[NR_LRU_LISTS];
> +	unsigned long		count[NR_LRU_LISTS];
> +
> +	struct zone_reclaim_stat reclaim_stat;
> +};
> +/* Macro for accessing counter */
> +#define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> +
> +struct mem_cgroup_per_node {
> +	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> +};
> +
> +struct mem_cgroup_lru_info {
> +	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> +};
> +
> +/*
> + * The memory controller data structure. The memory controller controls
> both
> + * page cache and RSS per cgroup. We would eventually like to provide
> + * statistics based on the statistics developed by Rik Van Riel for
> clock-pro,
> + * to help the administrator determine what knobs to tune.
> + *
> + * TODO: Add a water mark for the memory controller. Reclaim will begin
> when
> + * we hit the water mark. May be even add a low water mark, such that
> + * no reclaim occurs from a cgroup at it's low water mark, this is
> + * a feature that will be implemented much later in the future.
> + */
> +struct mem_cgroup {
> +	struct cgroup_subsys_state css;
> +	/*
> +	 * the counter to account for memory usage
> +	 */
> +	struct res_counter res;
> +	/*
> +	 * the counter to account for mem+swap usage.
> +	 */
> +	struct res_counter memsw;
> +	/*
> +	 * Per cgroup active and inactive list, similar to the
> +	 * per zone LRU lists.
> +	 */
> +	struct mem_cgroup_lru_info info;
> +
> +	/*
> +	  protect against reclaim related member.
> +	*/
> +	spinlock_t reclaim_param_lock;
> +
> +	int	prev_priority;	/* for recording reclaim priority */
> +
> +	/*
> +	 * While reclaiming in a hiearchy, we cache the last child we
> +	 * reclaimed from.
> +	 */
> +	int last_scanned_child;
> +	/*
> +	 * Should the accounting and control be hierarchical, per subtree?
> +	 */
> +	bool use_hierarchy;
> +	unsigned long	last_oom_jiffies;
> +	atomic_t	refcnt;
> +
> +	unsigned int	swappiness;
> +
> +	/* set when res.limit == memsw.limit */
> +	bool		memsw_is_minimum;
> +
> +	/*
> +	 * statistics. This must be placed at the end of memcg.
> +	 */
> +	struct mem_cgroup_stat stat;
> +};
> +
> +static inline struct mem_cgroup_per_zone *
> +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> +{
> +	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> +}
> +
> +static inline unsigned long
> +mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> +			 struct zone *zone,
> +			 enum lru_list lru)
> +{
> +	int nid = zone->zone_pgdat->node_id;
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	return MEM_CGROUP_ZSTAT(mz, lru);
> +}
> +
> +static inline struct zone_reclaim_stat *
> +mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +	int nid = zone->zone_pgdat->node_id;
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	return &mz->reclaim_stat;
> +}
> +
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr
>  							int priority);
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> -				       struct zone *zone,
> -				       enum lru_list lru);
> -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup
> *memcg,
> -						      struct zone *zone);
>  struct zone_reclaim_stat*
>  mem_cgroup_get_reclaim_stat_from_page(struct page *page);
>  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> @@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m
>  	return 1;
>  }
>
> -static inline unsigned long
> -mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> -			 enum lru_list lru)
> -{
> -	return 0;
> -}
> -
> -
> -static inline struct zone_reclaim_stat*
> -mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
> -{
> -	return NULL;
> -}
> -
>  static inline struct zone_reclaim_stat*
>  mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>  {
> --- linux.orig/mm/memcontrol.c	2009-08-19 20:14:56.000000000 +0800
> +++ linux/mm/memcontrol.c	2009-08-19 20:46:50.000000000 +0800
> @@ -55,30 +55,6 @@ static int really_do_swap_account __init
>  static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
>
>  /*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_MAPPED_FILE,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
> -struct mem_cgroup_stat_cpu {
> -	s64 count[MEM_CGROUP_STAT_NSTATS];
> -} ____cacheline_aligned_in_smp;
> -
> -struct mem_cgroup_stat {
> -	struct mem_cgroup_stat_cpu cpustat[0];
> -};
> -
> -/*
>   * For accounting under irq disable, no need for increment preempt count.
>   */
>  static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu
> *stat,
> @@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct
>  	return ret;
>  }
>
> -/*
> - * per-zone information in memory controller.
> - */
> -struct mem_cgroup_per_zone {
> -	/*
> -	 * spin_lock to protect the per cgroup LRU
> -	 */
> -	struct list_head	lists[NR_LRU_LISTS];
> -	unsigned long		count[NR_LRU_LISTS];
> -
> -	struct zone_reclaim_stat reclaim_stat;
> -};
> -/* Macro for accessing counter */
> -#define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> -
> -struct mem_cgroup_per_node {
> -	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> -};
> -
> -struct mem_cgroup_lru_info {
> -	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> -};
> -
> -/*
> - * The memory controller data structure. The memory controller controls
> both
> - * page cache and RSS per cgroup. We would eventually like to provide
> - * statistics based on the statistics developed by Rik Van Riel for
> clock-pro,
> - * to help the administrator determine what knobs to tune.
> - *
> - * TODO: Add a water mark for the memory controller. Reclaim will begin
> when
> - * we hit the water mark. May be even add a low water mark, such that
> - * no reclaim occurs from a cgroup at it's low water mark, this is
> - * a feature that will be implemented much later in the future.
> - */
> -struct mem_cgroup {
> -	struct cgroup_subsys_state css;
> -	/*
> -	 * the counter to account for memory usage
> -	 */
> -	struct res_counter res;
> -	/*
> -	 * the counter to account for mem+swap usage.
> -	 */
> -	struct res_counter memsw;
> -	/*
> -	 * Per cgroup active and inactive list, similar to the
> -	 * per zone LRU lists.
> -	 */
> -	struct mem_cgroup_lru_info info;
> -
> -	/*
> -	  protect against reclaim related member.
> -	*/
> -	spinlock_t reclaim_param_lock;
> -
> -	int	prev_priority;	/* for recording reclaim priority */
> -
> -	/*
> -	 * While reclaiming in a hiearchy, we cache the last child we
> -	 * reclaimed from.
> -	 */
> -	int last_scanned_child;
> -	/*
> -	 * Should the accounting and control be hierarchical, per subtree?
> -	 */
> -	bool use_hierarchy;
> -	unsigned long	last_oom_jiffies;
> -	atomic_t	refcnt;
> -
> -	unsigned int	swappiness;
> -
> -	/* set when res.limit == memsw.limit */
> -	bool		memsw_is_minimum;
> -
> -	/*
> -	 * statistics. This must be placed at the end of memcg.
> -	 */
> -	struct mem_cgroup_stat stat;
> -};
> -
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics
>  }
>
>  static struct mem_cgroup_per_zone *
> -mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> -{
> -	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> -}
> -
> -static struct mem_cgroup_per_zone *
>  page_cgroup_zoneinfo(struct page_cgroup *pc)
>  {
>  	struct mem_cgroup *mem = pc->mem_cgroup;
> @@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru
>  	return (active > inactive);
>  }
>
> -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> -				       struct zone *zone,
> -				       enum lru_list lru)
> -{
> -	int nid = zone->zone_pgdat->node_id;
> -	int zid = zone_idx(zone);
> -	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> -
> -	return MEM_CGROUP_ZSTAT(mz, lru);
> -}
> -
> -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup
> *memcg,
> -						      struct zone *zone)
> -{
> -	int nid = zone->zone_pgdat->node_id;
> -	int zid = zone_idx(zone);
> -	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> -
> -	return &mz->reclaim_stat;
> -}
> -
>  struct zone_reclaim_stat *
>  mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>  {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] memcg: move definitions to .h and inline some functions
@ 2009-08-19 14:18                                         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-19 14:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura, lizf,
	menage

Wu Fengguang さんは書きました:
> On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
>>
>> > > This one of the reasons why we unconditionally deactivate
>> > > the active anon pages, and do background scanning of the
>> > > active anon list when reclaiming page cache pages.
>> > >
>> > > We want to always move some pages to the inactive anon
>> > > list, so it does not get too small.
>> >
>> > Right, the current code tries to pull inactive list out of
>> > smallish-size state as long as there are vmscan activities.
>> >
>> > However there is a possible (and tricky) hole: mem cgroups
>> > don't do batched vmscan. shrink_zone() may call shrink_list()
>> > with nr_to_scan=1, in which case shrink_list() _still_ calls
>> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
>> >
>> > It effectively scales up the inactive list scan rate by 10 times when
>> > it is still small, and may thus prevent it from growing up for ever.
>> >
>> > In that case, LRU becomes FIFO.
>> >
>> > Jeff, can you confirm if the mem cgroup's inactive list is small?
>> > If so, this patch should help.
>>
>> This patch does right thing.
>> However, I would explain why I and memcg folks didn't do that in past
>> days.
>>
>> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
>> make inline function and we hesitated to introduce many function calling
>> overhead.
>>
>> So, Can we move some memcg structure declaration to *.h and make
>> mem_cgroup_get_saved_scan() inlined function?
>
> OK here it is. I have to move big chunks to make it compile, and it
> does reduced a dozen lines of code :)
>
> Is this big copy&paste acceptable? (memcg developers CCed).
>
> Thanks,
> Fengguang

I don't like this. plz add hooks to necessary places, at this stage.
This will be too big for inlined function, anyway.
plz move this after you find overhead is too big.

Thanks,
-Kame


> ---
>
> memcg: move definitions to .h and inline some functions
>
> So as to make inline functions.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/memcontrol.h |  154 ++++++++++++++++++++++++++++++-----
>  mm/memcontrol.c            |  131 -----------------------------
>  2 files changed, 134 insertions(+), 151 deletions(-)
>
> --- linux.orig/include/linux/memcontrol.h	2009-08-19 20:18:55.000000000
> +0800
> +++ linux/include/linux/memcontrol.h	2009-08-19 20:51:06.000000000 +0800
> @@ -20,11 +20,144 @@
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
>  #include <linux/cgroup.h>
> -struct mem_cgroup;
> +#include <linux/res_counter.h>
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_MAPPED_FILE,	/* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
> +struct mem_cgroup_stat_cpu {
> +	s64 count[MEM_CGROUP_STAT_NSTATS];
> +} ____cacheline_aligned_in_smp;
> +
> +struct mem_cgroup_stat {
> +	struct mem_cgroup_stat_cpu cpustat[0];
> +};
> +
> +/*
> + * per-zone information in memory controller.
> + */
> +struct mem_cgroup_per_zone {
> +	/*
> +	 * spin_lock to protect the per cgroup LRU
> +	 */
> +	struct list_head	lists[NR_LRU_LISTS];
> +	unsigned long		count[NR_LRU_LISTS];
> +
> +	struct zone_reclaim_stat reclaim_stat;
> +};
> +/* Macro for accessing counter */
> +#define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> +
> +struct mem_cgroup_per_node {
> +	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> +};
> +
> +struct mem_cgroup_lru_info {
> +	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> +};
> +
> +/*
> + * The memory controller data structure. The memory controller controls
> both
> + * page cache and RSS per cgroup. We would eventually like to provide
> + * statistics based on the statistics developed by Rik Van Riel for
> clock-pro,
> + * to help the administrator determine what knobs to tune.
> + *
> + * TODO: Add a water mark for the memory controller. Reclaim will begin
> when
> + * we hit the water mark. May be even add a low water mark, such that
> + * no reclaim occurs from a cgroup at it's low water mark, this is
> + * a feature that will be implemented much later in the future.
> + */
> +struct mem_cgroup {
> +	struct cgroup_subsys_state css;
> +	/*
> +	 * the counter to account for memory usage
> +	 */
> +	struct res_counter res;
> +	/*
> +	 * the counter to account for mem+swap usage.
> +	 */
> +	struct res_counter memsw;
> +	/*
> +	 * Per cgroup active and inactive list, similar to the
> +	 * per zone LRU lists.
> +	 */
> +	struct mem_cgroup_lru_info info;
> +
> +	/*
> +	  protect against reclaim related member.
> +	*/
> +	spinlock_t reclaim_param_lock;
> +
> +	int	prev_priority;	/* for recording reclaim priority */
> +
> +	/*
> +	 * While reclaiming in a hiearchy, we cache the last child we
> +	 * reclaimed from.
> +	 */
> +	int last_scanned_child;
> +	/*
> +	 * Should the accounting and control be hierarchical, per subtree?
> +	 */
> +	bool use_hierarchy;
> +	unsigned long	last_oom_jiffies;
> +	atomic_t	refcnt;
> +
> +	unsigned int	swappiness;
> +
> +	/* set when res.limit == memsw.limit */
> +	bool		memsw_is_minimum;
> +
> +	/*
> +	 * statistics. This must be placed at the end of memcg.
> +	 */
> +	struct mem_cgroup_stat stat;
> +};
> +
> +static inline struct mem_cgroup_per_zone *
> +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> +{
> +	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> +}
> +
> +static inline unsigned long
> +mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> +			 struct zone *zone,
> +			 enum lru_list lru)
> +{
> +	int nid = zone->zone_pgdat->node_id;
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	return MEM_CGROUP_ZSTAT(mz, lru);
> +}
> +
> +static inline struct zone_reclaim_stat *
> +mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +	int nid = zone->zone_pgdat->node_id;
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	return &mz->reclaim_stat;
> +}
> +
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr
>  							int priority);
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> -				       struct zone *zone,
> -				       enum lru_list lru);
> -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup
> *memcg,
> -						      struct zone *zone);
>  struct zone_reclaim_stat*
>  mem_cgroup_get_reclaim_stat_from_page(struct page *page);
>  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> @@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m
>  	return 1;
>  }
>
> -static inline unsigned long
> -mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> -			 enum lru_list lru)
> -{
> -	return 0;
> -}
> -
> -
> -static inline struct zone_reclaim_stat*
> -mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
> -{
> -	return NULL;
> -}
> -
>  static inline struct zone_reclaim_stat*
>  mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>  {
> --- linux.orig/mm/memcontrol.c	2009-08-19 20:14:56.000000000 +0800
> +++ linux/mm/memcontrol.c	2009-08-19 20:46:50.000000000 +0800
> @@ -55,30 +55,6 @@ static int really_do_swap_account __init
>  static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
>
>  /*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_MAPPED_FILE,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
> -struct mem_cgroup_stat_cpu {
> -	s64 count[MEM_CGROUP_STAT_NSTATS];
> -} ____cacheline_aligned_in_smp;
> -
> -struct mem_cgroup_stat {
> -	struct mem_cgroup_stat_cpu cpustat[0];
> -};
> -
> -/*
>   * For accounting under irq disable, no need for increment preempt count.
>   */
>  static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu
> *stat,
> @@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct
>  	return ret;
>  }
>
> -/*
> - * per-zone information in memory controller.
> - */
> -struct mem_cgroup_per_zone {
> -	/*
> -	 * spin_lock to protect the per cgroup LRU
> -	 */
> -	struct list_head	lists[NR_LRU_LISTS];
> -	unsigned long		count[NR_LRU_LISTS];
> -
> -	struct zone_reclaim_stat reclaim_stat;
> -};
> -/* Macro for accessing counter */
> -#define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> -
> -struct mem_cgroup_per_node {
> -	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> -};
> -
> -struct mem_cgroup_lru_info {
> -	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> -};
> -
> -/*
> - * The memory controller data structure. The memory controller controls
> both
> - * page cache and RSS per cgroup. We would eventually like to provide
> - * statistics based on the statistics developed by Rik Van Riel for
> clock-pro,
> - * to help the administrator determine what knobs to tune.
> - *
> - * TODO: Add a water mark for the memory controller. Reclaim will begin
> when
> - * we hit the water mark. May be even add a low water mark, such that
> - * no reclaim occurs from a cgroup at it's low water mark, this is
> - * a feature that will be implemented much later in the future.
> - */
> -struct mem_cgroup {
> -	struct cgroup_subsys_state css;
> -	/*
> -	 * the counter to account for memory usage
> -	 */
> -	struct res_counter res;
> -	/*
> -	 * the counter to account for mem+swap usage.
> -	 */
> -	struct res_counter memsw;
> -	/*
> -	 * Per cgroup active and inactive list, similar to the
> -	 * per zone LRU lists.
> -	 */
> -	struct mem_cgroup_lru_info info;
> -
> -	/*
> -	  protect against reclaim related member.
> -	*/
> -	spinlock_t reclaim_param_lock;
> -
> -	int	prev_priority;	/* for recording reclaim priority */
> -
> -	/*
> -	 * While reclaiming in a hiearchy, we cache the last child we
> -	 * reclaimed from.
> -	 */
> -	int last_scanned_child;
> -	/*
> -	 * Should the accounting and control be hierarchical, per subtree?
> -	 */
> -	bool use_hierarchy;
> -	unsigned long	last_oom_jiffies;
> -	atomic_t	refcnt;
> -
> -	unsigned int	swappiness;
> -
> -	/* set when res.limit == memsw.limit */
> -	bool		memsw_is_minimum;
> -
> -	/*
> -	 * statistics. This must be placed at the end of memcg.
> -	 */
> -	struct mem_cgroup_stat stat;
> -};
> -
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics
>  }
>
>  static struct mem_cgroup_per_zone *
> -mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> -{
> -	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> -}
> -
> -static struct mem_cgroup_per_zone *
>  page_cgroup_zoneinfo(struct page_cgroup *pc)
>  {
>  	struct mem_cgroup *mem = pc->mem_cgroup;
> @@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru
>  	return (active > inactive);
>  }
>
> -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> -				       struct zone *zone,
> -				       enum lru_list lru)
> -{
> -	int nid = zone->zone_pgdat->node_id;
> -	int zid = zone_idx(zone);
> -	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> -
> -	return MEM_CGROUP_ZSTAT(mz, lru);
> -}
> -
> -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup
> *memcg,
> -						      struct zone *zone)
> -{
> -	int nid = zone->zone_pgdat->node_id;
> -	int zid = zone_idx(zone);
> -	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> -
> -	return &mz->reclaim_stat;
> -}
> -
>  struct zone_reclaim_stat *
>  mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>  {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] memcg: move definitions to .h and inline some functions
  2009-08-19 14:18                                         ` KAMEZAWA Hiroyuki
@ 2009-08-19 14:27                                           ` Balbir Singh
  -1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-19 14:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Wu Fengguang, KOSAKI Motohiro, Rik van Riel, Johannes Weiner,
	Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	Mel Gorman, LKML, linux-mm, nishimura, lizf, menage

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]:

> Wu Fengguang ?$B$5$s$O=q$-$^$7$?!'
> > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> >>
> >> > > This one of the reasons why we unconditionally deactivate
> >> > > the active anon pages, and do background scanning of the
> >> > > active anon list when reclaiming page cache pages.
> >> > >
> >> > > We want to always move some pages to the inactive anon
> >> > > list, so it does not get too small.
> >> >
> >> > Right, the current code tries to pull inactive list out of
> >> > smallish-size state as long as there are vmscan activities.
> >> >
> >> > However there is a possible (and tricky) hole: mem cgroups
> >> > don't do batched vmscan. shrink_zone() may call shrink_list()
> >> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >> >
> >> > It effectively scales up the inactive list scan rate by 10 times when
> >> > it is still small, and may thus prevent it from growing up for ever.
> >> >
> >> > In that case, LRU becomes FIFO.
> >> >
> >> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> >> > If so, this patch should help.
> >>
> >> This patch does right thing.
> >> However, I would explain why I and memcg folks didn't do that in past
> >> days.
> >>
> >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> >> make inline function and we hesitated to introduce many function calling
> >> overhead.
> >>
> >> So, Can we move some memcg structure declaration to *.h and make
> >> mem_cgroup_get_saved_scan() inlined function?
> >
> > OK here it is. I have to move big chunks to make it compile, and it
> > does reduced a dozen lines of code :)
> >
> > Is this big copy&paste acceptable? (memcg developers CCed).
> >
> > Thanks,
> > Fengguang
> 
> I don't like this. plz add hooks to necessary places, at this stage.
> This will be too big for inlined function, anyway.
> plz move this after you find overhead is too big.

Me too.. I want to abstract the implementation within memcontrol.c to
be honest (I am concerned that someone might include memcontrol.h and
access its structure members, which scares me). Hiding it within
memcontrol.c provides the right level of abstraction.

Could you please explain your motivation for this change? I got cc'ed
on to a few emails, is this for the patch that export nr_save_scanned
approach?

-- 
	Balbir

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] memcg: move definitions to .h and inline some functions
@ 2009-08-19 14:27                                           ` Balbir Singh
  0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-19 14:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Wu Fengguang, KOSAKI Motohiro, Rik van Riel, Johannes Weiner,
	Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	Mel Gorman, LKML, linux-mm, nishimura, lizf, menage

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]:

> Wu Fengguang ?$B$5$s$O=q$-$^$7$?!'
> > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> >>
> >> > > This one of the reasons why we unconditionally deactivate
> >> > > the active anon pages, and do background scanning of the
> >> > > active anon list when reclaiming page cache pages.
> >> > >
> >> > > We want to always move some pages to the inactive anon
> >> > > list, so it does not get too small.
> >> >
> >> > Right, the current code tries to pull inactive list out of
> >> > smallish-size state as long as there are vmscan activities.
> >> >
> >> > However there is a possible (and tricky) hole: mem cgroups
> >> > don't do batched vmscan. shrink_zone() may call shrink_list()
> >> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >> >
> >> > It effectively scales up the inactive list scan rate by 10 times when
> >> > it is still small, and may thus prevent it from growing up for ever.
> >> >
> >> > In that case, LRU becomes FIFO.
> >> >
> >> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> >> > If so, this patch should help.
> >>
> >> This patch does right thing.
> >> However, I would explain why I and memcg folks didn't do that in past
> >> days.
> >>
> >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> >> make inline function and we hesitated to introduce many function calling
> >> overhead.
> >>
> >> So, Can we move some memcg structure declaration to *.h and make
> >> mem_cgroup_get_saved_scan() inlined function?
> >
> > OK here it is. I have to move big chunks to make it compile, and it
> > does reduced a dozen lines of code :)
> >
> > Is this big copy&paste acceptable? (memcg developers CCed).
> >
> > Thanks,
> > Fengguang
> 
> I don't like this. plz add hooks to necessary places, at this stage.
> This will be too big for inlined function, anyway.
> plz move this after you find overhead is too big.

Me too.. I want to abstract the implementation within memcontrol.c to
be honest (I am concerned that someone might include memcontrol.h and
access its structure members, which scares me). Hiding it within
memcontrol.c provides the right level of abstraction.

Could you please explain your motivation for this change? I got cc'ed
on to a few emails, is this for the patch that export nr_save_scanned
approach?

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] memcg: move definitions to .h and inline some functions
  2009-08-19 14:27                                           ` Balbir Singh
@ 2009-08-20  1:34                                             ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-20  1:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura, lizf,
	menage

On Wed, Aug 19, 2009 at 10:27:05PM +0800, Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]:
> 
> > Wu Fengguang ?$B$5$s$O=q$-$^$7$?!'
> > > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> > >>
> > >> > > This one of the reasons why we unconditionally deactivate
> > >> > > the active anon pages, and do background scanning of the
> > >> > > active anon list when reclaiming page cache pages.
> > >> > >
> > >> > > We want to always move some pages to the inactive anon
> > >> > > list, so it does not get too small.
> > >> >
> > >> > Right, the current code tries to pull inactive list out of
> > >> > smallish-size state as long as there are vmscan activities.
> > >> >
> > >> > However there is a possible (and tricky) hole: mem cgroups
> > >> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > >> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > >> >
> > >> > It effectively scales up the inactive list scan rate by 10 times when
> > >> > it is still small, and may thus prevent it from growing up for ever.
> > >> >
> > >> > In that case, LRU becomes FIFO.
> > >> >
> > >> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > >> > If so, this patch should help.
> > >>
> > >> This patch does right thing.
> > >> However, I would explain why I and memcg folks didn't do that in past
> > >> days.
> > >>
> > >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> > >> make inline function and we hesitated to introduce many function calling
> > >> overhead.
> > >>
> > >> So, Can we move some memcg structure declaration to *.h and make
> > >> mem_cgroup_get_saved_scan() inlined function?
> > >
> > > OK here it is. I have to move big chunks to make it compile, and it
> > > does reduced a dozen lines of code :)
> > >
> > > Is this big copy&paste acceptable? (memcg developers CCed).
> > >
> > > Thanks,
> > > Fengguang
> > 
> > I don't like this. plz add hooks to necessary places, at this stage.
> > This will be too big for inlined function, anyway.
> > plz move this after you find overhead is too big.

It shall not be a performance regression, since the text size is slightly
smaller with the patch:

            text      data        bss        dec      hex      filename
before      8732148   2771858   11048432   22552438   1581f76  vmlinux
after       8731972   2771858   11048432   22552262   1581ec6  vmlinux    

> Me too.. I want to abstract the implementation within memcontrol.c to
> be honest (I am concerned that someone might include memcontrol.h and
> access its structure members, which scares me). Hiding it within
> memcontrol.c provides the right level of abstraction.

Yeah quite reasonable.
 
> Could you please explain your motivation for this change? I got cc'ed
> on to a few emails, is this for the patch that export nr_save_scanned
> approach?

Yes, KOSAKI proposed to inline the mem_cgroup_get_saved_scan() function
introduced in that patch, which requires moving the structs into .h

I'll submit the original (un-inlined) patch.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] memcg: move definitions to .h and inline some functions
@ 2009-08-20  1:34                                             ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-20  1:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura, lizf,
	menage

On Wed, Aug 19, 2009 at 10:27:05PM +0800, Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]:
> 
> > Wu Fengguang ?$B$5$s$O=q$-$^$7$?!'
> > > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> > >>
> > >> > > This one of the reasons why we unconditionally deactivate
> > >> > > the active anon pages, and do background scanning of the
> > >> > > active anon list when reclaiming page cache pages.
> > >> > >
> > >> > > We want to always move some pages to the inactive anon
> > >> > > list, so it does not get too small.
> > >> >
> > >> > Right, the current code tries to pull inactive list out of
> > >> > smallish-size state as long as there are vmscan activities.
> > >> >
> > >> > However there is a possible (and tricky) hole: mem cgroups
> > >> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > >> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > >> >
> > >> > It effectively scales up the inactive list scan rate by 10 times when
> > >> > it is still small, and may thus prevent it from growing up for ever.
> > >> >
> > >> > In that case, LRU becomes FIFO.
> > >> >
> > >> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > >> > If so, this patch should help.
> > >>
> > >> This patch does right thing.
> > >> However, I would explain why I and memcg folks didn't do that in past
> > >> days.
> > >>
> > >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> > >> make inline function and we hesitated to introduce many function calling
> > >> overhead.
> > >>
> > >> So, Can we move some memcg structure declaration to *.h and make
> > >> mem_cgroup_get_saved_scan() inlined function?
> > >
> > > OK here it is. I have to move big chunks to make it compile, and it
> > > does reduced a dozen lines of code :)
> > >
> > > Is this big copy&paste acceptable? (memcg developers CCed).
> > >
> > > Thanks,
> > > Fengguang
> > 
> > I don't like this. plz add hooks to necessary places, at this stage.
> > This will be too big for inlined function, anyway.
> > plz move this after you find overhead is too big.

It shall not be a performance regression, since the text size is slightly
smaller with the patch:

            text      data        bss        dec      hex      filename
before      8732148   2771858   11048432   22552438   1581f76  vmlinux
after       8731972   2771858   11048432   22552262   1581ec6  vmlinux    

> Me too.. I want to abstract the implementation within memcontrol.c to
> be honest (I am concerned that someone might include memcontrol.h and
> access its structure members, which scares me). Hiding it within
> memcontrol.c provides the right level of abstraction.

Yeah quite reasonable.
 
> Could you please explain your motivation for this change? I got cc'ed
> on to a few emails, is this for the patch that export nr_save_scanned
> approach?

Yes, KOSAKI proposed to inline the mem_cgroup_get_saved_scan() function
introduced in that patch, which requires moving the structs into .h

I'll submit the original (un-inlined) patch.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-19 13:28                                       ` Minchan Kim
@ 2009-08-21 11:17                                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-21 11:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

>> Hmm, I think
>>
>> 1. Anyway, we need turn on PG_mlock.
>
> I add my patch again to explain.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ed63894..283266c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
>         */
>        if (vma->vm_flags & VM_LOCKED) {
>                *mapcount = 1;  /* break early from loop */
> +               *vm_flags |= VM_LOCKED;
>                goto out_unmap;
>        }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d224b28..d156e1d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
>                                                sc->mem_cgroup, &vm_flags);
>                /* In active use or really unfreeable?  Activate it. */
>                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> -                                       referenced && page_mapping_inuse(page))
> +                                       referenced && page_mapping_inuse(page)
> +                                       && !(vm_flags & VM_LOCKED))
>                        goto activate_locked;
>
> By this check, the page can be reached at try_to_unmap after
> page_referenced in shrink_page_list. At that time PG_mlocked will be
> set.

You are right.
Please add my Reviewed-by sign to your patch.

Thanks.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-21 11:17                                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-21 11:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

>> Hmm, I think
>>
>> 1. Anyway, we need turn on PG_mlock.
>
> I add my patch again to explain.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ed63894..283266c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
>         */
>        if (vma->vm_flags & VM_LOCKED) {
>                *mapcount = 1;  /* break early from loop */
> +               *vm_flags |= VM_LOCKED;
>                goto out_unmap;
>        }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d224b28..d156e1d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
>                                                sc->mem_cgroup, &vm_flags);
>                /* In active use or really unfreeable?  Activate it. */
>                if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> -                                       referenced && page_mapping_inuse(page))
> +                                       referenced && page_mapping_inuse(page)
> +                                       && !(vm_flags & VM_LOCKED))
>                        goto activate_locked;
>
> By this check, the page can be reached at try_to_unmap after
> page_referenced in shrink_page_list. At that time PG_mlocked will be
> set.

You are right.
Please add my Reviewed-by sign to your patch.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-17 19:47                         ` Dike, Jeffrey G
@ 2009-08-21 18:24                           ` Balbir Singh
  -1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-21 18:24 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

* Dike, Jeffrey G <jeffrey.g.dike@intel.com> [2009-08-17 12:47:29]:

> > Jeff, can you have a look at these stats? Thanks!
> 
> Yeah, I just did after adding some tracing which dumped out the same data.  It looks pretty much the same.  Inactive anon and active anon are pretty similar.  Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced.
> 
> Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening.
> 

Interesting.. there seems to be sufficient number of inactive memory,
specifically inactive_file. My biggest suspicion now is passing of
reference info from shadow page tables to host (although to be
honest, I've never looked at that code).

What do the stats for / from within kvm look like?

-- 
	Balbir

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-21 18:24                           ` Balbir Singh
  0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-21 18:24 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

* Dike, Jeffrey G <jeffrey.g.dike@intel.com> [2009-08-17 12:47:29]:

> > Jeff, can you have a look at these stats? Thanks!
> 
> Yeah, I just did after adding some tracing which dumped out the same data.  It looks pretty much the same.  Inactive anon and active anon are pretty similar.  Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced.
> 
> Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening.
> 

Interesting.. there seems to be sufficient number of inactive memory,
specifically inactive_file. My biggest suspicion now is passing of
reference info from shadow page tables to host (although to be
honest, I've never looked at that code).

What do the stats for / from within kvm look like?

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-21 18:24                           ` Balbir Singh
@ 2009-08-31 19:43                             ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 19:43 UTC (permalink / raw)
  To: balbir
  Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

> What do the stats for / from within kvm look like?

Interesting - what they look like is inactive_anon is always zero.  Details below - I took the host numbers at the same time and they are similar to what I reported before.

					Jeff

The fields are inactive_anon, active_anon, inactive_file, active_file - shortly after the data started being collected, I started firefox and an editor thingy.  The data continues as far into the shutdown as it could.

0 10858 13516 3279
0 10872 13516 3286
0 10867 13513 3286
0 11455 13268 3552
0 13068 12871 3949
0 13281 12810 4012
0 13701 12719 4103
0 14133 12631 4191
0 10878 11742 5087
0 10878 11741 5085
0 10878 11741 5085
0 10877 11741 5085
0 10877 11741 5085
0 10878 11741 5085
0 10878 11741 5085
0 10877 11741 5085
0 10905 11741 5085
0 11118 11776 5106
0 11594 14541 5169
0 11084 15314 5248
0 12022 15686 5300
0 12813 16379 5608
0 13614 16744 5915
0 14230 16849 5936
0 14461 16943 5953
0 14706 17412 5967
0 15574 17445 6011
0 15623 17459 6011
0 15596 17461 6015
0 15941 17523 6048
0 16508 17684 6048
0 17095 18154 6056
0 18635 18175 6056
0 18867 18195 6060
0 18972 18195 6060
0 18975 18185 6073
0 19220 18234 6073
0 19809 18276 6076
0 19571 18276 6076
0 19567 18276 6076
0 19588 18276 6076
0 19588 18276 6076
0 19588 18276 6076
0 19589 18276 6076
0 19603 18276 6076
0 19607 18277 6077
0 19600 18277 6077
0 19034 18235 6119
0 19041 18235 6119
0 19040 18233 6121
0 19040 18233 6121
0 18724 18240 6121
0 11674 16376 7977
0 11674 16376 7977
0 11673 16376 7977
0 11708 16376 7977
0 11703 16374 7979
0 11703 16374 7979
0 11702 16374 7979
0 11702 16374 7979
0 11716 16374 7979
0 11716 16374 7979
0 11718 16374 7979
0 11711 16374 7979
0 11811 16413 7986
0 11811 16413 7986
0 11897 16413 7986
0 12247 16434 7986
0 12495 16457 7990
0 12495 16457 7990
0 12491 16457 7990
0 12491 16457 7990
0 12737 16457 7990
0 11844 16457 7990
0 10969 16436 8011
0 9586 16140 8328
0 9209 16253 8333
0 8467 16120 8550
0 7857 16504 8592
0 7215 16467 8681
0 7194 16481 8723
0 7155 16475 8730

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 19:43                             ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 19:43 UTC (permalink / raw)
  To: balbir
  Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

> What do the stats for / from within kvm look like?

Interesting - what they look like is inactive_anon is always zero.  Details below - I took the host numbers at the same time and they are similar to what I reported before.

					Jeff

The fields are inactive_anon, active_anon, inactive_file, active_file - shortly after the data started being collected, I started firefox and an editor thingy.  The data continues as far into the shutdown as it could.

0 10858 13516 3279
0 10872 13516 3286
0 10867 13513 3286
0 11455 13268 3552
0 13068 12871 3949
0 13281 12810 4012
0 13701 12719 4103
0 14133 12631 4191
0 10878 11742 5087
0 10878 11741 5085
0 10878 11741 5085
0 10877 11741 5085
0 10877 11741 5085
0 10878 11741 5085
0 10878 11741 5085
0 10877 11741 5085
0 10905 11741 5085
0 11118 11776 5106
0 11594 14541 5169
0 11084 15314 5248
0 12022 15686 5300
0 12813 16379 5608
0 13614 16744 5915
0 14230 16849 5936
0 14461 16943 5953
0 14706 17412 5967
0 15574 17445 6011
0 15623 17459 6011
0 15596 17461 6015
0 15941 17523 6048
0 16508 17684 6048
0 17095 18154 6056
0 18635 18175 6056
0 18867 18195 6060
0 18972 18195 6060
0 18975 18185 6073
0 19220 18234 6073
0 19809 18276 6076
0 19571 18276 6076
0 19567 18276 6076
0 19588 18276 6076
0 19588 18276 6076
0 19588 18276 6076
0 19589 18276 6076
0 19603 18276 6076
0 19607 18277 6077
0 19600 18277 6077
0 19034 18235 6119
0 19041 18235 6119
0 19040 18233 6121
0 19040 18233 6121
0 18724 18240 6121
0 11674 16376 7977
0 11674 16376 7977
0 11673 16376 7977
0 11708 16376 7977
0 11703 16374 7979
0 11703 16374 7979
0 11702 16374 7979
0 11702 16374 7979
0 11716 16374 7979
0 11716 16374 7979
0 11718 16374 7979
0 11711 16374 7979
0 11811 16413 7986
0 11811 16413 7986
0 11897 16413 7986
0 12247 16434 7986
0 12495 16457 7990
0 12495 16457 7990
0 12491 16457 7990
0 12491 16457 7990
0 12737 16457 7990
0 11844 16457 7990
0 10969 16436 8011
0 9586 16140 8328
0 9209 16253 8333
0 8467 16120 8550
0 7857 16504 8592
0 7215 16467 8681
0 7194 16481 8723
0 7155 16475 8730

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-31 19:43                             ` Dike, Jeffrey G
@ 2009-08-31 19:52                               ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-31 19:52 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
>> What do the stats for / from within kvm look like?
> 
> Interesting - what they look like is inactive_anon is always zero.

This will be because the VM does not start aging pages
from the active to the inactive list unless there is
some memory pressure.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 19:52                               ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-31 19:52 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
>> What do the stats for / from within kvm look like?
> 
> Interesting - what they look like is inactive_anon is always zero.

This will be because the VM does not start aging pages
from the active to the inactive list unless there is
some memory pressure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-31 19:52                               ` Rik van Riel
@ 2009-08-31 20:06                                 ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 20:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

> This will be because the VM does not start aging pages
> from the active to the inactive list unless there is
> some memory pressure.

Which is the reason I gave the VM a puny amount of memory.  We know the thing is under memory pressure because I've been complaining about page discards.  I didn't collect that data on this run, but I'll do it again to make sure.

					Jeff


^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 20:06                                 ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 20:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

> This will be because the VM does not start aging pages
> from the active to the inactive list unless there is
> some memory pressure.

Which is the reason I gave the VM a puny amount of memory.  We know the thing is under memory pressure because I've been complaining about page discards.  I didn't collect that data on this run, but I'll do it again to make sure.

					Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-31 20:06                                 ` Dike, Jeffrey G
@ 2009-08-31 20:09                                   ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-31 20:09 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
>> This will be because the VM does not start aging pages
>> from the active to the inactive list unless there is
>> some memory pressure.
> 
> Which is the reason I gave the VM a puny amount of memory. 
 > We know the thing is under memory pressure because I've been
 > complaining about page discards.

Page discards by the host, which are invisible to the guest
OS.

The guest OS thinks it has enough pages.  The host disagrees
and swaps out some guest memory.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 20:09                                   ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-31 20:09 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
>> This will be because the VM does not start aging pages
>> from the active to the inactive list unless there is
>> some memory pressure.
> 
> Which is the reason I gave the VM a puny amount of memory. 
 > We know the thing is under memory pressure because I've been
 > complaining about page discards.

Page discards by the host, which are invisible to the guest
OS.

The guest OS thinks it has enough pages.  The host disagrees
and swaps out some guest memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-31 20:09                                   ` Rik van Riel
@ 2009-08-31 20:11                                     ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 20:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

> Page discards by the host, which are invisible to the guest
> OS.

Duh.  Right - I can't keep my VM systems straight...

					Jeff


^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 20:11                                     ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 20:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
	Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
	KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

> Page discards by the host, which are invisible to the guest
> OS.

Duh.  Right - I can't keep my VM systems straight...

					Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-31 20:11                                     ` Dike, Jeffrey G
@ 2009-08-31 20:42                                       ` Balbir Singh
  -1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-31 20:42 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Rik van Riel, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Tue, Sep 1, 2009 at 1:41 AM, Dike, Jeffrey G<jeffrey.g.dike@intel.com> wrote:
>> Page discards by the host, which are invisible to the guest
>> OS.
>
> Duh.  Right - I can't keep my VM systems straight...
>

Sounds like we need a way of indicating reference information. Guest
page hinting (cough; cough) anyone? May be a simpler version?

Balbir Singh.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 20:42                                       ` Balbir Singh
  0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-31 20:42 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Rik van Riel, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu,
	Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
	Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm

On Tue, Sep 1, 2009 at 1:41 AM, Dike, Jeffrey G<jeffrey.g.dike@intel.com> wrote:
>> Page discards by the host, which are invisible to the guest
>> OS.
>
> Duh.  Right - I can't keep my VM systems straight...
>

Sounds like we need a way of indicating reference information. Guest
page hinting (cough; cough) anyone? May be a simpler version?

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-18  2:26                                       ` Wu Fengguang
@ 2009-09-02 19:30                                         ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-09-02 19:30 UTC (permalink / raw)
  To: Wu, Fengguang
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

I'm trying to better understand the motivation for your make-mapped-exec-pages-first-class-citizens patch.  As I read your (very detailed!) description, you are diagnosing a threshold effect from Rik's evict-use-once-pages-first patch where if the inactive list is slightly smaller than the active list, the active list will start being scanned, pushing text (and other) pages onto the inactive list where they will be quickly kicked out to swap.

As I read Rik's patch, if the active list is one page larger than the inactive list, then a batch of pages will get moved from one to the other.  For this to have a noticeable effect on the system once the streaming is done, there must be something continuing to keep the active list larger than the inactive list.  Maybe there is a consistent percentage of the streamed pages which are use-twice. 

So, we a threshold effect where a small change in input (the size of the streaming file vs the number of active pages) causes a large change in output (lots of text pages suddenly start getting thrown out).   My immediate reaction to that is that there shouldn't be this sudden change in behavior, and that maybe there should only be enough scanning in shink_active_list to bring the two lists back to parity.  However, if there's something keeping the active list bigger than the inactive list, this will just put off the inevitable required scanning.

As for your patch, it seems like we have a problem with scanning I/O, and instead of looking at those pages, you are looking to protect some other set of pages (mapped text).  That, in turn, increases pressure on anonymous pages (which is where I came in).  Wouldn't it be a better idea to keep looking at those streaming pages and figure out how to get them out of memory quickly?

						Jeff



^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-02 19:30                                         ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-09-02 19:30 UTC (permalink / raw)
  To: Wu, Fengguang
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

I'm trying to better understand the motivation for your make-mapped-exec-pages-first-class-citizens patch.  As I read your (very detailed!) description, you are diagnosing a threshold effect from Rik's evict-use-once-pages-first patch where if the inactive list is slightly smaller than the active list, the active list will start being scanned, pushing text (and other) pages onto the inactive list where they will be quickly kicked out to swap.

As I read Rik's patch, if the active list is one page larger than the inactive list, then a batch of pages will get moved from one to the other.  For this to have a noticeable effect on the system once the streaming is done, there must be something continuing to keep the active list larger than the inactive list.  Maybe there is a consistent percentage of the streamed pages which are use-twice. 

So, we a threshold effect where a small change in input (the size of the streaming file vs the number of active pages) causes a large change in output (lots of text pages suddenly start getting thrown out).   My immediate reaction to that is that there shouldn't be this sudden change in behavior, and that maybe there should only be enough scanning in shink_active_list to bring the two lists back to parity.  However, if there's something keeping the active list bigger than the inactive list, this will just put off the inevitable required scanning.

As for your patch, it seems like we have a problem with scanning I/O, and instead of looking at those pages, you are looking to protect some other set of pages (mapped text).  That, in turn, increases pressure on anonymous pages (which is where I came in).  Wouldn't it be a better idea to keep looking at those streaming pages and figure out how to get them out of memory quickly?

						Jeff


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-09-02 19:30                                         ` Dike, Jeffrey G
@ 2009-09-03  2:04                                           ` Wu Fengguang
  -1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-09-03  2:04 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Jeff,

On Thu, Sep 03, 2009 at 03:30:59AM +0800, Dike, Jeffrey G wrote:
> I'm trying to better understand the motivation for your
> make-mapped-exec-pages-first-class-citizens patch.  As I read your
> (very detailed!) description, you are diagnosing a threshold effect
> from Rik's evict-use-once-pages-first patch where if the inactive
> list is slightly smaller than the active list, the active list will
> start being scanned, pushing text (and other) pages onto the
> inactive list where they will be quickly kicked out to swap.

Right.

> As I read Rik's patch, if the active list is one page larger than
> the inactive list, then a batch of pages will get moved from one to
> the other.  For this to have a noticeable effect on the system once
> the streaming is done, there must be something continuing to keep
> the active list larger than the inactive list.  Maybe there is a
> consistent percentage of the streamed pages which are use-twice. 

Right. Besides the use-twice case, I also explored the
desktop-working-set-cannot-fit-in-memory case in the patch.

> So, we a threshold effect where a small change in input (the size of
> the streaming file vs the number of active pages) causes a large
> change in output (lots of text pages suddenly start getting thrown
> out).   My immediate reaction to that is that there shouldn't be
> this sudden change in behavior, and that maybe there should only be
> enough scanning in shink_active_list to bring the two lists back to
> parity.  However, if there's something keeping the active list
> bigger than the inactive list, this will just put off the inevitable
> required scanning.

Yes there will be a sudden "behavior change" as soon as active list
grows larger than inactive list.  However the "output change" is
bounded and not as large, because that extra behavior to scan active
list stops as soon as the two lists are back to parity.

> As for your patch, it seems like we have a problem with scanning
> I/O, and instead of looking at those pages, you are looking to
> protect some other set of pages (mapped text).  That, in turn,
> increases pressure on anonymous pages (which is where I came in).
> Wouldn't it be a better idea to keep looking at those streaming
> pages and figure out how to get them out of memory quickly?

The scanning I/O problem has been largely addressed by Rik's patch.
It is not optimal(which is hard), but fair enough for common cases.

Your kvm test case sounds like desktop-working-set-cannot-fit-in-memory.
In that case, it obviously benefits to protect the exec-mapped pages,
and there are not too much kvm exec-mapped pages to impact anon pages.

I ran a kvm and collected its number exec-mapped pages as follows.
They sum up to ~3MB. This is not a big pressure on memory thrashing.

Thanks,
Fengguang
---

Rss of kvm:

% grep -A2 x /proc/7640/smaps | grep -v Size
00400000-005fe000 r-xp 00000000 08:02 1890389                            /usr/bin/kvm
Rss:                 680 kB
--
7f6c029f9000-7f6c02a0f000 r-xp 00000000 08:02 458771                     /lib/libgcc_s.so.1
Rss:                  16 kB
--
7f6c02c10000-7f6c02d00000 r-xp 00000000 08:02 1885409                    /usr/lib/libstdc++.so.6.0.10
Rss:                 364 kB
--
7f6c03bf2000-7f6c03bf7000 r-xp 00000000 08:02 1887873                    /usr/lib/libXdmcp.so.6.0.0
Rss:                  12 kB
--
7f6c03df7000-7f6c03df9000 r-xp 00000000 08:02 1887871                    /usr/lib/libXau.so.6.0.0
Rss:                   8 kB
--
7f6c03ff9000-7f6c04019000 r-xp 00000000 08:02 458890                     /lib/libx86.so.1
Rss:                  36 kB
--
7f6c04019000-7f6c04218000 ---p 00020000 08:02 458890                     /lib/libx86.so.1
Rss:                   0 kB
--
7f6c04218000-7f6c0421a000 rw-p 0001f000 08:02 458890                     /lib/libx86.so.1
Rss:                   8 kB
--
7f6c0421b000-7f6c0421f000 r-xp 00000000 08:02 458861                     /lib/libattr.so.1.1.0
Rss:                  12 kB
--
7f6c0441f000-7f6c04434000 r-xp 00000000 08:02 460739                     /lib/libnsl-2.9.so
Rss:                  24 kB
--
7f6c04637000-7f6c04647000 r-xp 00000000 08:02 1897259                    /usr/lib/libXext.so.6.4.0
Rss:                  20 kB
--
7f6c04647000-7f6c04847000 ---p 00010000 08:02 1897259                    /usr/lib/libXext.so.6.4.0
Rss:                   0 kB
--
7f6c04847000-7f6c04848000 rw-p 00010000 08:02 1897259                    /usr/lib/libXext.so.6.4.0
Rss:                   4 kB
--
7f6c04848000-7f6c04978000 r-xp 00000000 08:02 1889103                    /usr/lib/libicuuc.so.38.1
Rss:                 244 kB
--
7f6c04b89000-7f6c04ba4000 r-xp 00000000 08:02 1886322                    /usr/lib/libxcb.so.1.1.0
Rss:                  28 kB
--
7f6c04ba4000-7f6c04da4000 ---p 0001b000 08:02 1886322                    /usr/lib/libxcb.so.1.1.0
Rss:                   0 kB
--
7f6c04da4000-7f6c04da5000 rw-p 0001b000 08:02 1886322                    /usr/lib/libxcb.so.1.1.0
Rss:                   4 kB
--
7f6c04da5000-7f6c04df3000 r-xp 00000000 08:02 1899899                    /usr/lib/libvga.so.1.4.3
Rss:                  68 kB
--
7f6c05004000-7f6c05018000 r-xp 00000000 08:02 1896343                    /usr/lib/libdirect-1.0.so.0.1.0
Rss:                  24 kB
--
7f6c05219000-7f6c05221000 r-xp 00000000 08:02 1892825                    /usr/lib/libfusion-1.0.so.0.1.0
Rss:                  16 kB
--
7f6c05421000-7f6c0548d000 r-xp 00000000 08:02 1892826                    /usr/lib/libdirectfb-1.0.so.0.1.0
Rss:                  64 kB
--
7f6c05691000-7f6c056a4000 r-xp 00000000 08:02 460720                     /lib/libresolv-2.9.so
Rss:                  20 kB
--
7f6c058a7000-7f6c058aa000 r-xp 00000000 08:02 1895568                    /usr/lib/libgpg-error.so.0.4.0
Rss:                   4 kB
--
7f6c05aaa000-7f6c05b1d000 r-xp 00000000 08:02 1890493                    /usr/lib/libgcrypt.so.11.5.2
Rss:                  36 kB
--
7f6c05d20000-7f6c05d30000 r-xp 00000000 08:02 2081187                    /usr/lib/libtasn1.so.3.1.2
Rss:                  12 kB
--
7f6c05f30000-7f6c05f34000 r-xp 00000000 08:02 458960                     /lib/libcap.so.2.11
Rss:                  12 kB
--
7f6c06134000-7f6c06170000 r-xp 00000000 08:02 1889247                    /usr/lib/libdbus-1.so.3.4.0
Rss:                  36 kB
--
7f6c06372000-7f6c06376000 r-xp 00000000 08:02 1891735                    /usr/lib/libasyncns.so.0.1.0
Rss:                  12 kB
--
7f6c06576000-7f6c0657e000 r-xp 00000000 08:02 458834                     /lib/libwrap.so.0.7.6
Rss:                  20 kB
--
7f6c0677f000-7f6c06784000 r-xp 00000000 08:02 1888031                    /usr/lib/libXtst.so.6.1.0
Rss:                  12 kB
--
7f6c06985000-7f6c0698d000 r-xp 00000000 08:02 1885331                    /usr/lib/libSM.so.6.0.0
Rss:                  12 kB
--
7f6c06b8d000-7f6c06ba3000 r-xp 00000000 08:02 1897238                    /usr/lib/libICE.so.6.3.0
Rss:                  28 kB
--
7f6c06da8000-7f6c06dee000 r-xp 00000000 08:02 2147381                    /usr/lib/libpulsecommon-0.9.15.so
Rss:                  56 kB
--
7f6c06fef000-7f6c06ff1000 r-xp 00000000 08:02 460735                     /lib/libdl-2.9.so
Rss:                   8 kB
--
7f6c071f3000-7f6c0722d000 r-xp 00000000 08:02 2147380                    /usr/lib/libpulse.so.0.8.0
Rss:                  44 kB
--
7f6c0742f000-7f6c07578000 r-xp 00000000 08:02 460721                     /lib/libc-2.9.so
Rss:                 464 kB
--
7f6c07782000-7f6c07786000 r-xp 00000000 08:02 1892933                    /usr/lib/libvdeplug.so.2.1.0
Rss:                  12 kB
--
7f6c07986000-7f6c0798f000 r-xp 00000000 08:02 459991                     /lib/libbrlapi.so.0.5.1
Rss:                  20 kB
--
7f6c07b91000-7f6c07bcb000 r-xp 00000000 08:02 458804                     /lib/libncurses.so.5.6
Rss:                  76 kB
--
7f6c07dd0000-7f6c07f05000 r-xp 00000000 08:02 1886324                    /usr/lib/libX11.so.6.2.0
Rss:                 120 kB
--
7f6c0810b000-7f6c08173000 r-xp 00000000 08:02 1892212                    /usr/lib/libSDL-1.2.so.0.11.1
Rss:                  36 kB
--
7f6c083c1000-7f6c083c3000 r-xp 00000000 08:02 460719                     /lib/libutil-2.9.so
Rss:                   8 kB
--
7f6c085c4000-7f6c085cb000 r-xp 00000000 08:02 460727                     /lib/librt-2.9.so
Rss:                  24 kB
--
7f6c087cc000-7f6c087e2000 r-xp 00000000 08:02 460725                     /lib/libpthread-2.9.so
Rss:                  68 kB
--
7f6c089e7000-7f6c089f2000 r-xp 00000000 08:02 1889339                    /usr/lib/libpci.so.3.1.2
Rss:                  16 kB
--
7f6c08bf2000-7f6c08c0c000 r-xp 00000000 08:02 2146929                    /usr/lib/libbluetooth.so.3.2.5
Rss:                  32 kB
--
7f6c08e0d000-7f6c08eb4000 r-xp 00000000 08:02 1896347                    /usr/lib/libgnutls.so.26.11.5
Rss:                 136 kB
--
7f6c090bf000-7f6c090c2000 r-xp 00000000 08:02 2147382                    /usr/lib/libpulse-simple.so.0.0.2
Rss:                  12 kB
--
7f6c092c3000-7f6c093a0000 r-xp 00000000 08:02 1885450                    /usr/lib/libasound.so.2.0.0
Rss:                 176 kB
--
7f6c095a7000-7f6c095bd000 r-xp 00000000 08:02 1885377                    /usr/lib/libz.so.1.2.3.3
Rss:                  16 kB
--
7f6c097be000-7f6c09840000 r-xp 00000000 08:02 460724                     /lib/libm-2.9.so
Rss:                  20 kB
--
7f6c09a41000-7f6c09a5e000 r-xp 00000000 08:02 459078                     /lib/ld-2.9.so
Rss:                  96 kB
--
7f6c09b2c000-7f6c09b31000 r-xp 00000000 08:02 1886943                    /usr/lib/libgdbm.so.3.0.0
Rss:                  12 kB
--
7fff54d4f000-7fff54d50000 r-xp 00000000 00:00 0                          [vdso]
Rss:                   4 kB
--
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Rss:                   0 kB

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-03  2:04                                           ` Wu Fengguang
  0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-09-03  2:04 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Jeff,

On Thu, Sep 03, 2009 at 03:30:59AM +0800, Dike, Jeffrey G wrote:
> I'm trying to better understand the motivation for your
> make-mapped-exec-pages-first-class-citizens patch.  As I read your
> (very detailed!) description, you are diagnosing a threshold effect
> from Rik's evict-use-once-pages-first patch where if the inactive
> list is slightly smaller than the active list, the active list will
> start being scanned, pushing text (and other) pages onto the
> inactive list where they will be quickly kicked out to swap.

Right.

> As I read Rik's patch, if the active list is one page larger than
> the inactive list, then a batch of pages will get moved from one to
> the other.  For this to have a noticeable effect on the system once
> the streaming is done, there must be something continuing to keep
> the active list larger than the inactive list.  Maybe there is a
> consistent percentage of the streamed pages which are use-twice. 

Right. Besides the use-twice case, I also explored the
desktop-working-set-cannot-fit-in-memory case in the patch.

> So, we a threshold effect where a small change in input (the size of
> the streaming file vs the number of active pages) causes a large
> change in output (lots of text pages suddenly start getting thrown
> out).   My immediate reaction to that is that there shouldn't be
> this sudden change in behavior, and that maybe there should only be
> enough scanning in shink_active_list to bring the two lists back to
> parity.  However, if there's something keeping the active list
> bigger than the inactive list, this will just put off the inevitable
> required scanning.

Yes there will be a sudden "behavior change" as soon as active list
grows larger than inactive list.  However the "output change" is
bounded and not as large, because that extra behavior to scan active
list stops as soon as the two lists are back to parity.

> As for your patch, it seems like we have a problem with scanning
> I/O, and instead of looking at those pages, you are looking to
> protect some other set of pages (mapped text).  That, in turn,
> increases pressure on anonymous pages (which is where I came in).
> Wouldn't it be a better idea to keep looking at those streaming
> pages and figure out how to get them out of memory quickly?

The scanning I/O problem has been largely addressed by Rik's patch.
It is not optimal(which is hard), but fair enough for common cases.

Your kvm test case sounds like desktop-working-set-cannot-fit-in-memory.
In that case, it obviously benefits to protect the exec-mapped pages,
and there are not too much kvm exec-mapped pages to impact anon pages.

I ran a kvm and collected its number exec-mapped pages as follows.
They sum up to ~3MB. This is not a big pressure on memory thrashing.

Thanks,
Fengguang
---

Rss of kvm:

% grep -A2 x /proc/7640/smaps | grep -v Size
00400000-005fe000 r-xp 00000000 08:02 1890389                            /usr/bin/kvm
Rss:                 680 kB
--
7f6c029f9000-7f6c02a0f000 r-xp 00000000 08:02 458771                     /lib/libgcc_s.so.1
Rss:                  16 kB
--
7f6c02c10000-7f6c02d00000 r-xp 00000000 08:02 1885409                    /usr/lib/libstdc++.so.6.0.10
Rss:                 364 kB
--
7f6c03bf2000-7f6c03bf7000 r-xp 00000000 08:02 1887873                    /usr/lib/libXdmcp.so.6.0.0
Rss:                  12 kB
--
7f6c03df7000-7f6c03df9000 r-xp 00000000 08:02 1887871                    /usr/lib/libXau.so.6.0.0
Rss:                   8 kB
--
7f6c03ff9000-7f6c04019000 r-xp 00000000 08:02 458890                     /lib/libx86.so.1
Rss:                  36 kB
--
7f6c04019000-7f6c04218000 ---p 00020000 08:02 458890                     /lib/libx86.so.1
Rss:                   0 kB
--
7f6c04218000-7f6c0421a000 rw-p 0001f000 08:02 458890                     /lib/libx86.so.1
Rss:                   8 kB
--
7f6c0421b000-7f6c0421f000 r-xp 00000000 08:02 458861                     /lib/libattr.so.1.1.0
Rss:                  12 kB
--
7f6c0441f000-7f6c04434000 r-xp 00000000 08:02 460739                     /lib/libnsl-2.9.so
Rss:                  24 kB
--
7f6c04637000-7f6c04647000 r-xp 00000000 08:02 1897259                    /usr/lib/libXext.so.6.4.0
Rss:                  20 kB
--
7f6c04647000-7f6c04847000 ---p 00010000 08:02 1897259                    /usr/lib/libXext.so.6.4.0
Rss:                   0 kB
--
7f6c04847000-7f6c04848000 rw-p 00010000 08:02 1897259                    /usr/lib/libXext.so.6.4.0
Rss:                   4 kB
--
7f6c04848000-7f6c04978000 r-xp 00000000 08:02 1889103                    /usr/lib/libicuuc.so.38.1
Rss:                 244 kB
--
7f6c04b89000-7f6c04ba4000 r-xp 00000000 08:02 1886322                    /usr/lib/libxcb.so.1.1.0
Rss:                  28 kB
--
7f6c04ba4000-7f6c04da4000 ---p 0001b000 08:02 1886322                    /usr/lib/libxcb.so.1.1.0
Rss:                   0 kB
--
7f6c04da4000-7f6c04da5000 rw-p 0001b000 08:02 1886322                    /usr/lib/libxcb.so.1.1.0
Rss:                   4 kB
--
7f6c04da5000-7f6c04df3000 r-xp 00000000 08:02 1899899                    /usr/lib/libvga.so.1.4.3
Rss:                  68 kB
--
7f6c05004000-7f6c05018000 r-xp 00000000 08:02 1896343                    /usr/lib/libdirect-1.0.so.0.1.0
Rss:                  24 kB
--
7f6c05219000-7f6c05221000 r-xp 00000000 08:02 1892825                    /usr/lib/libfusion-1.0.so.0.1.0
Rss:                  16 kB
--
7f6c05421000-7f6c0548d000 r-xp 00000000 08:02 1892826                    /usr/lib/libdirectfb-1.0.so.0.1.0
Rss:                  64 kB
--
7f6c05691000-7f6c056a4000 r-xp 00000000 08:02 460720                     /lib/libresolv-2.9.so
Rss:                  20 kB
--
7f6c058a7000-7f6c058aa000 r-xp 00000000 08:02 1895568                    /usr/lib/libgpg-error.so.0.4.0
Rss:                   4 kB
--
7f6c05aaa000-7f6c05b1d000 r-xp 00000000 08:02 1890493                    /usr/lib/libgcrypt.so.11.5.2
Rss:                  36 kB
--
7f6c05d20000-7f6c05d30000 r-xp 00000000 08:02 2081187                    /usr/lib/libtasn1.so.3.1.2
Rss:                  12 kB
--
7f6c05f30000-7f6c05f34000 r-xp 00000000 08:02 458960                     /lib/libcap.so.2.11
Rss:                  12 kB
--
7f6c06134000-7f6c06170000 r-xp 00000000 08:02 1889247                    /usr/lib/libdbus-1.so.3.4.0
Rss:                  36 kB
--
7f6c06372000-7f6c06376000 r-xp 00000000 08:02 1891735                    /usr/lib/libasyncns.so.0.1.0
Rss:                  12 kB
--
7f6c06576000-7f6c0657e000 r-xp 00000000 08:02 458834                     /lib/libwrap.so.0.7.6
Rss:                  20 kB
--
7f6c0677f000-7f6c06784000 r-xp 00000000 08:02 1888031                    /usr/lib/libXtst.so.6.1.0
Rss:                  12 kB
--
7f6c06985000-7f6c0698d000 r-xp 00000000 08:02 1885331                    /usr/lib/libSM.so.6.0.0
Rss:                  12 kB
--
7f6c06b8d000-7f6c06ba3000 r-xp 00000000 08:02 1897238                    /usr/lib/libICE.so.6.3.0
Rss:                  28 kB
--
7f6c06da8000-7f6c06dee000 r-xp 00000000 08:02 2147381                    /usr/lib/libpulsecommon-0.9.15.so
Rss:                  56 kB
--
7f6c06fef000-7f6c06ff1000 r-xp 00000000 08:02 460735                     /lib/libdl-2.9.so
Rss:                   8 kB
--
7f6c071f3000-7f6c0722d000 r-xp 00000000 08:02 2147380                    /usr/lib/libpulse.so.0.8.0
Rss:                  44 kB
--
7f6c0742f000-7f6c07578000 r-xp 00000000 08:02 460721                     /lib/libc-2.9.so
Rss:                 464 kB
--
7f6c07782000-7f6c07786000 r-xp 00000000 08:02 1892933                    /usr/lib/libvdeplug.so.2.1.0
Rss:                  12 kB
--
7f6c07986000-7f6c0798f000 r-xp 00000000 08:02 459991                     /lib/libbrlapi.so.0.5.1
Rss:                  20 kB
--
7f6c07b91000-7f6c07bcb000 r-xp 00000000 08:02 458804                     /lib/libncurses.so.5.6
Rss:                  76 kB
--
7f6c07dd0000-7f6c07f05000 r-xp 00000000 08:02 1886324                    /usr/lib/libX11.so.6.2.0
Rss:                 120 kB
--
7f6c0810b000-7f6c08173000 r-xp 00000000 08:02 1892212                    /usr/lib/libSDL-1.2.so.0.11.1
Rss:                  36 kB
--
7f6c083c1000-7f6c083c3000 r-xp 00000000 08:02 460719                     /lib/libutil-2.9.so
Rss:                   8 kB
--
7f6c085c4000-7f6c085cb000 r-xp 00000000 08:02 460727                     /lib/librt-2.9.so
Rss:                  24 kB
--
7f6c087cc000-7f6c087e2000 r-xp 00000000 08:02 460725                     /lib/libpthread-2.9.so
Rss:                  68 kB
--
7f6c089e7000-7f6c089f2000 r-xp 00000000 08:02 1889339                    /usr/lib/libpci.so.3.1.2
Rss:                  16 kB
--
7f6c08bf2000-7f6c08c0c000 r-xp 00000000 08:02 2146929                    /usr/lib/libbluetooth.so.3.2.5
Rss:                  32 kB
--
7f6c08e0d000-7f6c08eb4000 r-xp 00000000 08:02 1896347                    /usr/lib/libgnutls.so.26.11.5
Rss:                 136 kB
--
7f6c090bf000-7f6c090c2000 r-xp 00000000 08:02 2147382                    /usr/lib/libpulse-simple.so.0.0.2
Rss:                  12 kB
--
7f6c092c3000-7f6c093a0000 r-xp 00000000 08:02 1885450                    /usr/lib/libasound.so.2.0.0
Rss:                 176 kB
--
7f6c095a7000-7f6c095bd000 r-xp 00000000 08:02 1885377                    /usr/lib/libz.so.1.2.3.3
Rss:                  16 kB
--
7f6c097be000-7f6c09840000 r-xp 00000000 08:02 460724                     /lib/libm-2.9.so
Rss:                  20 kB
--
7f6c09a41000-7f6c09a5e000 r-xp 00000000 08:02 459078                     /lib/ld-2.9.so
Rss:                  96 kB
--
7f6c09b2c000-7f6c09b31000 r-xp 00000000 08:02 1886943                    /usr/lib/libgdbm.so.3.0.0
Rss:                  12 kB
--
7fff54d4f000-7fff54d50000 r-xp 00000000 00:00 0                          [vdso]
Rss:                   4 kB
--
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Rss:                   0 kB

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
  2009-09-03  2:04                                           ` Wu Fengguang
@ 2009-09-04 20:06                                             ` Dike, Jeffrey G
  -1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-09-04 20:06 UTC (permalink / raw)
  To: Wu, Fengguang
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list?

					Jeff

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-04 20:06                                             ` Dike, Jeffrey G
  0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-09-04 20:06 UTC (permalink / raw)
  To: Wu, Fengguang
  Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list?

					Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-09-04 20:06                                             ` Dike, Jeffrey G
@ 2009-09-04 20:57                                               ` Rik van Riel
  -1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-09-04 20:57 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
> Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list?

If it did not get referenced during its second pass through
the active list, it will get deactivated.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-04 20:57                                               ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-09-04 20:57 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Dike, Jeffrey G wrote:
> Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list?

If it did not get referenced during its second pass through
the active list, it will get deactivated.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
  2009-08-14 21:42                                 ` Dike, Jeffrey G
@ 2009-09-13 16:23                                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-09-13 16:23 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, Rik van Riel,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Hi Jeff,

> A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time.  Your test is rescuing essentially all candidate pages from the inactive list.  Right now, I have the VM_EXEC || PageAnon version of your test.

Sorry for the long delayed replay.
I made reproduce environment today. but I don't have luck. I didn't
reproduce stack refault issue.
Could you please explain detailed reproduce way and your analysis way?

My environment is,
  x86_64 CPUx4 MEM 6G
  userland: fedora11
  kernel: latest mmotm

  cgroup size: 128M
  guest mem: 256M
  CONFIG_KSM=n

My result,
  - plenty anon and file fault happen. but it is ideal. it is caused
by demand paging.
  - do_anonymous_page almost doesn't handle stack fault. both host and guest.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-13 16:23                                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-09-13 16:23 UTC (permalink / raw)
  To: Dike, Jeffrey G
  Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, Rik van Riel,
	Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm

Hi Jeff,

> A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time.  Your test is rescuing essentially all candidate pages from the inactive list.  Right now, I have the VM_EXEC || PageAnon version of your test.

Sorry for the long delayed replay.
I made reproduce environment today. but I don't have luck. I didn't
reproduce stack refault issue.
Could you please explain detailed reproduce way and your analysis way?

My environment is,
  x86_64 CPUx4 MEM 6G
  userland: fedora11
  kernel: latest mmotm

  cgroup size: 128M
  guest mem: 256M
  CONFIG_KSM=n

My result,
  - plenty anon and file fault happen. but it is ideal. it is caused
by demand paging.
  - do_anonymous_page almost doesn't handle stack fault. both host and guest.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

end of thread, other threads:[~2009-09-13 16:30 UTC | newest]

Thread overview: 243+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-05  2:40 [RFC] respect the referenced bit of KVM guest pages? Wu Fengguang
2009-08-05  2:40 ` Wu Fengguang
2009-08-05  4:15 ` KOSAKI Motohiro
2009-08-05  4:15   ` KOSAKI Motohiro
2009-08-05  4:41   ` Wu Fengguang
2009-08-05  4:41     ` Wu Fengguang
2009-08-05  7:58 ` Avi Kivity
2009-08-05  7:58   ` Avi Kivity
2009-08-05  8:17   ` Avi Kivity
2009-08-05  8:17     ` Avi Kivity
2009-08-05 14:33     ` Rik van Riel
2009-08-05 14:33       ` Rik van Riel
2009-08-05 15:37       ` Avi Kivity
2009-08-05 15:37         ` Avi Kivity
2009-08-05 14:15   ` Rik van Riel
2009-08-05 14:15     ` Rik van Riel
2009-08-05 15:12     ` Avi Kivity
2009-08-05 15:12       ` Avi Kivity
2009-08-05 15:15       ` Rik van Riel
2009-08-05 15:15         ` Rik van Riel
2009-08-05 15:25         ` Avi Kivity
2009-08-05 15:25           ` Avi Kivity
2009-08-05 16:35           ` Andrea Arcangeli
2009-08-05 16:35             ` Andrea Arcangeli
2009-08-05 16:31     ` Andrea Arcangeli
2009-08-05 16:31       ` Andrea Arcangeli
2009-08-05 17:25       ` Rik van Riel
2009-08-05 17:25         ` Rik van Riel
2009-08-05 15:45   ` Dike, Jeffrey G
2009-08-05 15:45     ` Dike, Jeffrey G
2009-08-05 16:05   ` Andrea Arcangeli
2009-08-05 16:05     ` Andrea Arcangeli
2009-08-05 16:12     ` Dike, Jeffrey G
2009-08-05 16:12       ` Dike, Jeffrey G
2009-08-05 16:19       ` Andrea Arcangeli
2009-08-05 16:19         ` Andrea Arcangeli
2009-08-05 15:58 ` Andrea Arcangeli
2009-08-05 15:58   ` Andrea Arcangeli
2009-08-05 17:20   ` Rik van Riel
2009-08-05 17:20     ` Rik van Riel
2009-08-05 17:42   ` Rik van Riel
2009-08-05 17:42     ` Rik van Riel
2009-08-06 10:15     ` Andrea Arcangeli
2009-08-06 10:15       ` Andrea Arcangeli
2009-08-06 10:08   ` Andrea Arcangeli
2009-08-06 10:08     ` Andrea Arcangeli
2009-08-06 10:18     ` Avi Kivity
2009-08-06 10:18       ` Avi Kivity
2009-08-06 10:20       ` Andrea Arcangeli
2009-08-06 10:20         ` Andrea Arcangeli
2009-08-06 10:59         ` Wu Fengguang
2009-08-06 10:59           ` Wu Fengguang
2009-08-06 11:44           ` Avi Kivity
2009-08-06 11:44             ` Avi Kivity
2009-08-06 13:06             ` Wu Fengguang
2009-08-06 13:06               ` Wu Fengguang
2009-08-06 13:16               ` Rik van Riel
2009-08-06 13:16                 ` Rik van Riel
2009-08-16  3:28                 ` Wu Fengguang
2009-08-16  3:28                   ` Wu Fengguang
2009-08-16  3:56                   ` Rik van Riel
2009-08-16  3:56                     ` Rik van Riel
2009-08-16  4:43                     ` Balbir Singh
2009-08-16  4:43                       ` Balbir Singh
2009-08-16  4:55                     ` Wu Fengguang
2009-08-16  4:55                       ` Wu Fengguang
2009-08-16  5:59                       ` Balbir Singh
2009-08-16  5:59                         ` Balbir Singh
2009-08-17 19:47                       ` Dike, Jeffrey G
2009-08-17 19:47                         ` Dike, Jeffrey G
2009-08-21 18:24                         ` Balbir Singh
2009-08-21 18:24                           ` Balbir Singh
2009-08-31 19:43                           ` Dike, Jeffrey G
2009-08-31 19:43                             ` Dike, Jeffrey G
2009-08-31 19:52                             ` Rik van Riel
2009-08-31 19:52                               ` Rik van Riel
2009-08-31 20:06                               ` Dike, Jeffrey G
2009-08-31 20:06                                 ` Dike, Jeffrey G
2009-08-31 20:09                                 ` Rik van Riel
2009-08-31 20:09                                   ` Rik van Riel
2009-08-31 20:11                                   ` Dike, Jeffrey G
2009-08-31 20:11                                     ` Dike, Jeffrey G
2009-08-31 20:42                                     ` Balbir Singh
2009-08-31 20:42                                       ` Balbir Singh
2009-08-06 13:46               ` Avi Kivity
2009-08-06 13:46                 ` Avi Kivity
2009-08-06 21:09               ` Jeff Dike
2009-08-06 21:09                 ` Jeff Dike
2009-08-16  3:18                 ` Wu Fengguang
2009-08-16  3:18                   ` Wu Fengguang
2009-08-16  3:53                   ` Rik van Riel
2009-08-16  3:53                     ` Rik van Riel
2009-08-16  5:15                     ` Wu Fengguang
2009-08-16  5:15                       ` Wu Fengguang
2009-08-16 11:29                       ` Wu Fengguang
2009-08-16 11:29                         ` Wu Fengguang
2009-08-17 14:33                         ` Minchan Kim
2009-08-17 14:33                           ` Minchan Kim
2009-08-18  2:34                           ` Wu Fengguang
2009-08-18  2:34                             ` Wu Fengguang
2009-08-18  4:17                             ` Minchan Kim
2009-08-18  4:17                               ` Minchan Kim
2009-08-18  9:31                               ` Wu Fengguang
2009-08-18  9:31                                 ` Wu Fengguang
2009-08-18  9:52                                 ` Minchan Kim
2009-08-18  9:52                                   ` Minchan Kim
2009-08-18 10:00                                   ` Wu Fengguang
2009-08-18 10:00                                     ` Wu Fengguang
2009-08-18 11:00                                     ` Minchan Kim
2009-08-18 11:00                                       ` Minchan Kim
2009-08-18 11:11                                       ` Wu Fengguang
2009-08-18 11:11                                         ` Wu Fengguang
2009-08-18 14:03                                         ` Minchan Kim
2009-08-18 14:03                                           ` Minchan Kim
2009-08-18 16:27                                         ` KOSAKI Motohiro
2009-08-18 16:27                                           ` KOSAKI Motohiro
2009-08-18 15:57                         ` KOSAKI Motohiro
2009-08-18 15:57                           ` KOSAKI Motohiro
2009-08-19 12:01                           ` Wu Fengguang
2009-08-19 12:01                             ` Wu Fengguang
2009-08-19 12:05                             ` KOSAKI Motohiro
2009-08-19 12:05                               ` KOSAKI Motohiro
2009-08-19 12:10                               ` Wu Fengguang
2009-08-19 12:10                                 ` Wu Fengguang
2009-08-19 12:25                                 ` Minchan Kim
2009-08-19 12:25                                   ` Minchan Kim
2009-08-19 13:19                                   ` KOSAKI Motohiro
2009-08-19 13:19                                     ` KOSAKI Motohiro
2009-08-19 13:28                                     ` Minchan Kim
2009-08-19 13:28                                       ` Minchan Kim
2009-08-21 11:17                                       ` KOSAKI Motohiro
2009-08-21 11:17                                         ` KOSAKI Motohiro
2009-08-19 13:24                                   ` Wu Fengguang
2009-08-19 13:24                                     ` Wu Fengguang
2009-08-19 13:38                                     ` Minchan Kim
2009-08-19 13:38                                       ` Minchan Kim
2009-08-19 14:00                                       ` Wu Fengguang
2009-08-19 14:00                                         ` Wu Fengguang
2009-08-06 13:13             ` Rik van Riel
2009-08-06 13:13               ` Rik van Riel
2009-08-06 13:49               ` Avi Kivity
2009-08-06 13:49                 ` Avi Kivity
2009-08-07  3:11               ` KOSAKI Motohiro
2009-08-07  3:11                 ` KOSAKI Motohiro
2009-08-07  7:54                 ` Balbir Singh
2009-08-07  7:54                   ` Balbir Singh
2009-08-07  8:24                   ` KAMEZAWA Hiroyuki
2009-08-07  8:24                     ` KAMEZAWA Hiroyuki
2009-08-06 13:11           ` Rik van Riel
2009-08-06 13:11             ` Rik van Riel
2009-08-06 13:08     ` Rik van Riel
2009-08-06 13:08       ` Rik van Riel
2009-08-07  3:17       ` KOSAKI Motohiro
2009-08-07  3:17         ` KOSAKI Motohiro
2009-08-12  7:48         ` Wu Fengguang
2009-08-12 14:31           ` Rik van Riel
2009-08-12 14:31             ` Rik van Riel
2009-08-13  1:03             ` Wu Fengguang
2009-08-13  1:03               ` Wu Fengguang
2009-08-13 15:46               ` Rik van Riel
2009-08-13 15:46                 ` Rik van Riel
2009-08-13 16:12                 ` Avi Kivity
2009-08-13 16:12                   ` Avi Kivity
2009-08-13 16:26                   ` Rik van Riel
2009-08-13 16:26                     ` Rik van Riel
2009-08-13 19:12                     ` Avi Kivity
2009-08-13 19:12                       ` Avi Kivity
2009-08-13 21:16                       ` Johannes Weiner
2009-08-13 21:16                         ` Johannes Weiner
2009-08-14  7:16                         ` Avi Kivity
2009-08-14  7:16                           ` Avi Kivity
2009-08-14  9:10                           ` Johannes Weiner
2009-08-14  9:10                             ` Johannes Weiner
2009-08-14  9:51                             ` Wu Fengguang
2009-08-14  9:51                               ` Wu Fengguang
2009-08-14 13:19                               ` Rik van Riel
2009-08-14 13:19                                 ` Rik van Riel
2009-08-15  5:45                                 ` Wu Fengguang
2009-08-15  5:45                                   ` Wu Fengguang
2009-08-16  5:09                                   ` Balbir Singh
2009-08-16  5:09                                     ` Balbir Singh
2009-08-16  5:41                                     ` Wu Fengguang
2009-08-16  5:41                                       ` Wu Fengguang
2009-08-16  5:50                                     ` Wu Fengguang
2009-08-16  5:50                                       ` Wu Fengguang
2009-08-18 15:57                                     ` KOSAKI Motohiro
2009-08-18 15:57                                       ` KOSAKI Motohiro
2009-08-17 18:04                                   ` Dike, Jeffrey G
2009-08-17 18:04                                     ` Dike, Jeffrey G
2009-08-18  2:26                                     ` Wu Fengguang
2009-08-18  2:26                                       ` Wu Fengguang
2009-09-02 19:30                                       ` Dike, Jeffrey G
2009-09-02 19:30                                         ` Dike, Jeffrey G
2009-09-03  2:04                                         ` Wu Fengguang
2009-09-03  2:04                                           ` Wu Fengguang
2009-09-04 20:06                                           ` Dike, Jeffrey G
2009-09-04 20:06                                             ` Dike, Jeffrey G
2009-09-04 20:57                                             ` Rik van Riel
2009-09-04 20:57                                               ` Rik van Riel
2009-08-18 15:57                                   ` KOSAKI Motohiro
2009-08-18 15:57                                     ` KOSAKI Motohiro
2009-08-19 12:08                                     ` Wu Fengguang
2009-08-19 12:08                                       ` Wu Fengguang
2009-08-19 13:40                                     ` [RFC] memcg: move definitions to .h and inline some functions Wu Fengguang
2009-08-19 13:40                                       ` Wu Fengguang
2009-08-19 14:18                                       ` KAMEZAWA Hiroyuki
2009-08-19 14:18                                         ` KAMEZAWA Hiroyuki
2009-08-19 14:27                                         ` Balbir Singh
2009-08-19 14:27                                           ` Balbir Singh
2009-08-20  1:34                                           ` Wu Fengguang
2009-08-20  1:34                                             ` Wu Fengguang
2009-08-14 21:42                               ` [RFC] respect the referenced bit of KVM guest pages? Dike, Jeffrey G
2009-08-14 21:42                                 ` Dike, Jeffrey G
2009-08-14 22:37                                 ` Rik van Riel
2009-08-14 22:37                                   ` Rik van Riel
2009-08-15  5:32                                   ` Wu Fengguang
2009-08-15  5:32                                     ` Wu Fengguang
2009-09-13 16:23                                 ` KOSAKI Motohiro
2009-09-13 16:23                                   ` KOSAKI Motohiro
2009-08-05 17:53 ` Rik van Riel
2009-08-05 17:53   ` Rik van Riel
2009-08-05 19:00   ` Dike, Jeffrey G
2009-08-05 19:00     ` Dike, Jeffrey G
2009-08-05 19:07     ` Rik van Riel
2009-08-05 19:07       ` Rik van Riel
2009-08-05 19:18       ` Dike, Jeffrey G
2009-08-05 19:18         ` Dike, Jeffrey G
2009-08-06  9:22         ` Avi Kivity
2009-08-06  9:22           ` Avi Kivity
2009-08-06  9:25           ` Wu Fengguang
2009-08-06  9:25             ` Wu Fengguang
2009-08-06  9:35             ` Avi Kivity
2009-08-06  9:35               ` Avi Kivity
2009-08-06  9:35               ` Wu Fengguang
2009-08-06  9:35                 ` Wu Fengguang
2009-08-06  9:59                 ` Avi Kivity
2009-08-06  9:59                   ` Avi Kivity
2009-08-06  9:59                   ` Wu Fengguang
2009-08-06  9:59                     ` Wu Fengguang
2009-08-06 10:14                     ` Avi Kivity
2009-08-06 10:14                       ` Avi Kivity
2009-08-07  1:25                       ` KAMEZAWA Hiroyuki
2009-08-07  1:25                         ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.